Upgrading Kafka from 2.5.0 to 3.2.3

Bidirectional backwards compatibility was introduced in 2017 – which means my experience where you needed to upgrade the broker first and then the clients is no longer true. Rejoice!

Sandbox Setup

Two CentOS docker containers were provisioned as follows:

docker run -dit --name=kafka1 -p 9092:9092 centos:latest
docker run -dit --name=kafka2 -p 9093:9092 -p9000:9000 centos:latest

# Shell into each container and do the following:

sed -i -e "s|mirrorlist=|#mirrorlist=|g" /etc/yum.repos.d/CentOS-*
sed -i -e "s|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g" /etc/yum.repos.d/CentOS-*

# Get Ips and hosts into /etc/hosts

172.17.0.2 40c2222cfea0
172.17.0.3 2923addbcb6d

# Update installed packages & install required tools

dnf update
yum install -y passwd vim net-tools wget git unzip
# Add a kafka user, make a kafka folder, and give the kafka user ownership of the kafka folder
useradd kafka
passwd kafka
usermod -aG wheel kafka

mkdir /kafka

chown kafka:kafka /kafka

# Install Kafka

su – kafka
cd /kafka
wget https://archive.apache.org/dist/kafka/2.5.0/kafka_2.12-2.5.0.tgz
tar vxzf kafka_2.12-2.5.0.tgz
rm kafka_2.12-2.5.0.tgz
ln -s /kafka/kafka_2.12-2.5.0 /kafka/kafka

# Configure zookeeper

vi /kafka/kafka/config/zookeeper.properties
dataDir=/kafka/zookeeperdata
server.1=172.17.0.2:2888:3888

# Start Zookeeper on the first server

screen -S zookeeper
/kafka/kafka/bin/zookeeper-server-start.sh /kafka/kafka/config/zookeeper.properties

# Configure the cluster

vi /kafka/kafka/config/server.properties

broker.id=1 # unique number per cluster node
listeners=PLAINTEXT://:9092
zookeeper.connect=172.17.0.2:2181

# Start Kafka

screen -S kafka
/kafka/kafka/bin/kafka-server-start.sh /kafka/kafka/config/server.properties

# Edit producer.properties on a server

vi /kafka/kafka/config/producer.properties
bootstrap.servers=172.17.0.2:9092,172.17.0.3:9092

# Create test topic

/kafka/kafka/bin/kafka-topics.sh --create --zookeeper 172.17.0.2:2181 --replication-factor 2 --partitions 1 --topic ljrTest

# Post messages to the topic

/kafka/kafka/bin/kafka-console-producer.sh --broker-list 172.17.0.2:9092 --producer.config /kafka/kafka/config/producer.properties --topic ljrTest

# Retrieve messages from topic

/kafka/kafka/bin/kafka-console-consumer.sh --bootstrap-server 172.17.0.2:9092 --topic ljrTest --from-beginning
/kafka/kafka/bin/kafka-console-consumer.sh --bootstrap-server 172.17.0.3:9092 --topic ljrTest --from-beginning

Voila, a functional Kafka sandbox cluster.

Now we’ll install the cluster manager

cd /kafka
git clone --depth 1 --branch 3.0.0.6 https://github.com/yahoo/CMAK.git
cd CMAK
vi conf/application.conf
cmak.zkhosts="40c2222cfea0:2181"

# CMAK requires java > 1.8 … so getting 11 set up
cd /usr/lib/jvm
wget https://cdn.azul.com/zulu/bin/zulu11.58.23-ca-jdk11.0.16.1-linux_x64.zip
unzip zulu11.58.23-ca-jdk11.0.16.1-linux_x64.zip
mv zulu11.58.23-ca-jdk11.0.16.1-linux_x64 zulu-11
PATH=/usr/lib/jvm/zulu-11/bin:$PATH

./sbt -java-home /usr/lib/jvm/zulu-11 clean dist

cp /kafka/CMAK/target/universal/cmak-3.0.0.6.zip /kafka

cd /kafka
unzip cmak-3.0.0.6.zip
cd cmak-3.0.0.6
screen -S CMAK
bin/cmak -java-home /usr/lib/jvm/zulu-11 -Dconfig.file=/kafka/cmak-3.0.0.6/conf/application.conf -Dhttp.port=9000

Access it at http://cmak_host:9000

Sandbox Upgrade Process

# Back up the Kafka installation (excluding log files)

tar cvfzp /kafka/kafka-2.5.0.tar.gz --exclude logs /kafka/ws_npm_kafka/kafka_2.12-2.5.0

# Get newest Kafka version installed
# From another host where you can download the file, transfer it to the kafka server

scp kafka_2.12-3.2.3.tgz list@kafka1:/tmp/

# Back on the Kafka server — copy the tgz file into the Kafka directory

mv /tmp/kafka_2.12-3.2.3.tgz /kafka/kafka

# Verify Kafka data is stored outside of the install directory:

[kafka@40c2222cfea0 config]$ grep log.dir server.properties
log.dirs=/tmp/kafka-logs

# Verify zookeeper data is stored outside of the install directory:

[kafka@40c2222cfea0 config]$ grep dataDir zookeeper.properties
dataDir=/kafka/zookeeperdata

# Get the new version of Kafka – start with the zookeeper(s) then do the other nodes

cd /kafka
wget https://downloads.apache.org/kafka/3.2.3/kafka_2.12-3.2.3.tgz
tar vxfz /kafka/kafka_2.12-3.2.3.tgz

# Copy config from old iteration to new

cp /kafka/kafka_2.12-2.5.0/config/* /kafka/kafka_2.12-3.2.3/config/

# Edit server.properties and add a configuration line to force the inter-broker protocol version to the currently running Kafka version
# This ensures your cluster is using the “old” version to communicate and you can, if needed, revert to the previous version

vi /kafka/kafka/config/server.properties
inter.broker.protocol.version=2.5.0

# Restart each Kafka server – waiting until it has come online before restarting the next one – with the new binaries
# Stop kafka

systemctl stop kafka

# Move symlink to new folder

unlink /kafka/kafka
ln -s /kafka/kafka_2.12-3.2.3 /kafka/kafka

# start kafka

systemctl start kafka

# Or, to watch it run,

/kafka/kafka/bin/kafka-server-start.sh /kafka/kafka/config/server.properties

# Finally, ensure you’ve still got ‘stuff’

/kafka/kafka/bin/kafka-console-consumer.sh --bootstrap-server 172.17.0.3:9092 --topic ljrTest --from-beginning

# And verify the version has updated

[kafka@40c2222cfea0 bin]$ ./kafka-topics.sh --version
3.2.3 (Commit:50029d3ed8ba576f)

# Until this point, we can just roll back to the old folder & revert to the previous version of Kafka … that’s out backout plan.

# Once everything has been confirmed to be working, bump the inter-broker protocol version to the new version & restart Kafka

vi /kafka/kafka/config/server.properties
inter.broker.protocol.version=3.2

OpenSearch Evaluation Overview

What is ElasticSearch?

ElasticSearch, based on the Lucene search software, is a distributed search and analytics application which ingests, stores, and indexes data. Kibana is a web-based front-end providing user access to data stored within ElasticSearch.

What is OpenSearch?

In short, it’s the same but different. OpenSearch is also based on the Lucene search software, is designed to be a distributed search and analytics application, and ingests/stores/indexes data. If it’s essentially the same thing, why does OpenSearch exist? ElasticSearch was initially licensed under the open-source Apache 2.0 license – a rather permissive free software license. ElasticCo did not agree with how their software was being used by Amazon; and, in 2021, the license for ElasticSearch was changed to Server Side Public License (SSPL). One of the requirements of SSPL is that anyone who implements the software and sells their implementation as a service needs to publish their source code under the SSPL license – not just changes made to the original program but all other software a user would require to run the software-as-a-service environment for themselves. Amazon used ElasticSearch for their Amazon Elasticsearch Service offering, but was unable/unwilling to continue doing so under the new license terms. In April of 2021, Amazon Web Services created a fork of ElasticSearch as the basis for OpenSearch.

Differences Between OpenSearch and ElasticSearch

After the OpenSearch fork was created, the product roadmap for ElasticSearch was driven by ElasticCo and the roadmap for OpenSearch was community driven (with significant oversight and input from Amazon) – this means the products are not identical although they provide the same core functionality. Elastic publishes a list of features unique to ElasticSearch, and the underlying machine learning algorithms are different. However, the important components of the “unique” feature list have been implemented in OpenSearch over time.

The biggest differences are price and support. OpenSearch is free software – there is no purchasing a license to unlock features. It does appear that Amazon has an internal iteration of OpenSearch as their as-a-service offering provides features not available in the open-source OpenSearch code base, but that is only available for cloud customers. ElasticCo offers ElasticSearch as free software with a limited feature set. One critical limitation is user authentication mechanisms – we are unable to implement PingID as an authentication source with the free feature set. Advanced features not currently used today – machine learning based anomaly detection, as an example – are also unavailable in the free iteration of ElasticSearch. With an ElasticSearch license, we would also get vendor support. OpenSearch does not offer vendor support, although there are third party companies that will provide support services.

Both OpenSearch and ElasticSearch have community-based support forums available – I have gotten responses from developers on both forums for questions regarding usage nuances.

Salient Feature Comparison

Most companies have a list differentiating their product from the products offered by competitors – but the important thing is how the products differ as it relates to how an individual customer uses the product. A car that can have a fresh cup of espresso waiting for you as you leave for work might be amazing to some people, but those who don’t drink coffee won’t be nearly as impressed. So how do the two products compare for me?

Data ingestion – Data is ingested using the same mechanisms – ElasticCo’s filebeat and logstash are important components of data ingestion, and these components remain unchanged. This means existing processes that feed data into ElasticSearch today would not need to be changed to begin ingesting data into OpenSearch.

Data storage – Both products distribute searchable data over a cluster of servers. Data storage is “tiered” as hot, warm, and cold which allows less used data to reside on slower, less expensive resources. We have confirmed that ingested data is properly housed on cluster nodes designated for ‘hot’ storage and moved to ‘warm’ and ‘cold’ storage as dictated by defined policies. The item count to size ratio is similar between both products (i.e. storing ten million documents takes about the same amount of disk space). OpenSearch provides the ability to alert on transition failures (moving from hot to warm, for instance) which will reduce the amount of manual “health checking” required for the environment.

Search and aggregation – Both products allow both GUI and API searches of indexed data. Data can be aggregated as it is searched – returning the max/min/average value from a search, a count of records matching search criterion, creating sub-aggregations. ElasticSearch does have aggregations not available in OpenSearch, although these could be handled through custom scripted aggregations and many have corresponding GitHub issues requesting such an aggregation be added to OpenSearch (e.g. weighted average, geohash grid, or geotile grid)

Aggregation Name ElasticSearch 8.x OpenSearch 2.x
auto-interval date histogram x
categorize text x
children x
composite x
frequent items x
geohex grid x
geotile grid x
ip prefix x
multi terms x
parent x
random sampler x
rare terms x
terms x
variable width histogram x
boxplot x
geo-centroid x
geo-line x
median absolute deviation x
rate x
string stats x
t-test x
top metrics x
weighted avg x

Alerting – ElastAlert2 can be used to provide the same index monitoring and alerting functionality that ElastAlert currently provides with ElasticSearch. Additionally, OpenSearch includes a built-in alerting capability that might allow us to streamline the functionality into the base OpenSearch implementation.

API Access – Both ElasticSearch and OpenSearch provide API-based access to data. Queries to the ElasticSearch API endpoint returned expected data when directed to the OpenSearch API endpoint. The ElasticSearch python module can be used to access OpenSearch data, although there is a specific OpenSearch module as well.

UX – ElasticSearch allows users to search and visualize data through Kibana; OpenSearch provides graphical user access in OpenSearch Dashboard. While the “look and feel” of the GUI differs (Kibana 8 looks different than the Kibana 7 we use today, too), the user functionality remains the same.

Kibana 7.7 OpenSearch Dashboards 2.2

Kibana uses “KQL” – Kibana Query Language – to compose searches while OpenSearch Dashboards uses “DQL” – Dashboards Query Language, but queries used in Kibana were used in OpenSearch Dashboard without modification.

Currently used visualizations are available in both Kibana and OpenSearch Dashboards

Kibana Visualization OpenSearch Dashboards Visualization

But there are some currently unused visualizations that are unique to each product.

Visualization Kibana OpenSearch Dashboard
Area x x
Controls x x
Coordinate Map x
Data Table x x
Gantt Chart x
Gauge x x
Goal x x
Heat Map x x
Horizonal Bar x x
Lens x
Line x x
Maps x
Markdown x x
Metric x x
Pie x x
Region Map x
Tag Cloud x x
Timeline x x
TSVB x x
Vega x x
Vertical Bar x x

Dashboards can be used to group visualizations.

Kibana OpenSearch Dashboards

New features will be available in either OpenSearch or a licensed installation of ElasticSearch. Currently data is either retained as written or aged out of the system to save disk space. Either path allows us to roll up data – as an example retaining the total number of users per month or total bytes per month instead of retaining each detailed record. Additionally, we will be able to use the “anomaly detection” which is able to monitor large volumes of index data and highlight unusual events. Both newer ElasticSearch versions and OpenSearch offer a Tableau connector which may make data stored in the platform more accessible to users.

 

ElasticSearch – Listing Snapshots in AWS S3

To view the snapshots held in AWS, you should be able to use Kibana. From “Management” navigate to “Snapshot and Restore” and look at the list of snapshots. We, however, get a timeout attempting to view the snapshots. Instead, use the _snapshot ES API endpoint to get the name of the repository:

Then use the name to create the ES API URI to get a list of snapshots in the repository – GET _snapshot/*?verbose=false – you will get a list of snapshots, which indices are included in each snapshot, and a state (SUCCESS or FAILED).

Heritage Turkeys

In addition to growing open pollinated, heirloom vegetables — we’ve got a flock of heritage turkeys. These guys are Black Spanish turkeys. Unlike the broad-breasted turkeys raised commercially today, they walk around and do turkey things all day. They are all waiting by the gate as we walk over to the poultry pasture, and there are always a few turkeys following us around if we’re working in their area.

The two males we have from last year were amazing with the little poults this Spring. They’d take a share of poults and snuggle them at night to keep them warm. They’d march around them as the little ones pecked around during the day. Even now that the younger turkeys are almost fully grown, the older turkeys stand guard and make sure everyone gets access to food and water. Watching the adult turkeys with the younger ones has been right educational, and I am eager to hatch some of our own poults next year!

Knitting – Finally

I learned to knit and crochet at the same time — crocheting was something I could do easily but knitting? It was awkward and never really worked. I’ve always suspected I was just doing something wrong — if only someone who knew what they were doing could spot it. I even managed to teach a friend of mine to knit, and she couldn’t show me what I was doing wrong.

I got Anya a knitting book — it wasn’t quite enough for her to figure out knitting, so I let her find a few videos on YouTube. She has gotten to where she knits quite well. At first, she was using pencils — but I got her some knitting needles with little cats on the top. And she decided to teach me how to knit. Casting on — check. Knitting — not a check. It’s this strange awkward motion and the yarn ends up way too tight on the needle. So she sat and watched what I was doing — corrected the couple of things I was doing wrong, and …

I am actually able to knit now (yes, there are a few mistakes … but the tension is reasonable and it’s reasonable looking).

Managing Photos

We have a lot of photos — I expect that is true of most people. When taking a picture required someone to make sure there was a roll of film in the camera, take a picture, and then develop (or get developed) the pictures? We didn’t have that many pictures — my family, I recall, had a pile of undeveloped film and a few albums of photos. With the advent of digital photography, I took a lot more pictures. But it was still a manageable quantity. With smart phones — a camera available any time you have half a thought to record something for posterity? We have a lot of pictures. 264 already this month, 4727 last year! And then we have movies. While it is awesome to be able to preserve all of these memories, it’s also impossible to find anything. Ideally, you could search for files tagged with ‘garden’ and find all of the garden pictures.

Which brought us to a quest for a good photo and video tagging application. Haven’t found one yet, but I have discovered that a lot of applications use their own database. Dolphin stores its tags in Balloo — I remember encountering something similar with Windows Media Player and a shared music library — we thought we were making changes that would be visible to everyone, but the changes didn’t even persist if you blew away your local store and repopulated your library. I’ve found apps with sqlite, ones with an external MySQL server, etc — but it’s something that locks you into their application. A few (DigiKam) have a feature to sync their data over to the image metadata, which (I think) will have to suffice.

Worse still, I haven’t identified any reasonable consistency to where metadata is stored — when you add ‘tags’ in the Windows File Explorer, the tags appear in “Subject”, “LastKeywordXMP”, and “XPKeywords”

Windows image tags in EXIF data

Tags written in DigiKam application, however, don’t appear anywhere in the metadata by default– it’s all hidden stuff in the DigiKam database. Since we have SQL servers, we could just share a database for tagging our images. But that seems silly since the file metadata already has places to include these images. I could write something that reads from the SQL tables and uses something like exiftool to write the file metadata. Fortunately, there is a configuration option to actually write the tags into the metadata:

Include tags when writing to metadata

Now tags are written into IPTC “Keywords” as well as XMP CatalogSets, Categories, HeirarchicalSubject, LastKeywordXMP, Subject, and TagList. Which is sufficient for Windows to display something in the “tags” column.

Tags written in Darkroom can be stored in a sidecar file — which is messy, adds to the backup requirements, and generally doesn’t work for me.

GThumbs lets you multi-select across a directory and bulk-add tags. These get added to IPTC Keywords and XML Keywords — which do show up in the “Tags” column of Windows Explorer.

Building Vouch Oauth Proxy

I am using an NGINX container which is based on Debian 11 — following the vouch-proxy build instructions failed spectacularly on the first step, reporting that “package embed is not in GOROOT”. It appears that Debian package installation gets you go 1.15 — and ’embed’ wasn’t added until 1.16. So … that’s not great.

As a note to myself — here are the additional packages I install to the base container:

apt-get update
apt-get upgrade
apt-get install vim wget net-tools procps git make gcc g++

To manually install golang on Debian:

  • Find the version you want to run on https://golang.org/dl/ and wget that tar.gz file
    • wget https://go.dev/dl/go1.19.linux-amd64.tar.gz
  • tar -vxf go1.19.linux-amd64.tar.gz
  • mv go /usr/local/
  • vi /etc/bash.bashrc and append the following lines:
    export GOROOT=/usr/local/go
    export PATH=$GOROOT/bin:$PATH
  • Log out and log back in. Test the go installation by running:
    • go version

Now I am able to run their shell script to build the vouch-proxy binary:

  • cd /opt
  • git clone https://github.com/vouch/vouch-proxy.git
  • cd vouch-proxy
  • ./do.sh goget
  • ./do.sh build
  • cd configure
  • cp config.yml_example_oidc config.yml
  • ./vouch-proxy

 

Medina County GIS — Accessing Linked Data

There’s a lot of data that I’ve been using the old Medina County GIS system to access because I couldn’t figure out how to get it from the new Medina County GIS platform — finally got it!

The first step is to select a data layer that has linked data — this could be surveys, old tax maps, building permits, septic permits, etc.

Once a layer is selected click on the “Identify” icon — a blue circle with a letter I inside of it. I’ve been using “Draw Point” — this changes your cursor

On the map, draw your point on the icon for the data element — for tax maps, this is a gray box with some text in it. Building, septic, etc have colored shapes.

The trick, then, is to scroll down on the “Identify” tab — it’ll always start with the general parcel information (ownership, parcel size). But, if you scroll down, there will be new sections specific to the data layer you selected. Here, it’s the archived Tax Map scans. Click one of the “View” links …

Voila — you’ll have the information.