Category: Technology

Fedora 40: NFTables not logging

We upgraded Anya’s laptop to Fedora 40, and Skype has evidently moved from an installable RPM to a snap package. Which didn’t work with the firewall rules we built earlier in the year (video and audio calls would not connect); and, worse, nothing logs out. Looks like the netfilter kernel logging isn’t enabled

Enabled the logging:

echo 1 | sudo tee /proc/sys/net/netfilter/nf_log_all_netns

And, voila, we’ve got log records from nftables. And now Skype works … so I don’t know what to add. Sigh!

Azure DevOps Pipeline Error – Veracode Scan Fails

Pipeline Error:

Build Failed: Error: Exiting Veracode Upload and Scan Task: App not in state where new builds are allowed.

Resolution: There’s a scan in Veracode that never completed. Log into the web UI and delete it!

View the scans in the sandbox. Select the one that says “Request Incomplete”

Use the ellipsis button to select “Delete Request”

Confirm deletion.

Voila – now you can re-run the pipeline and the scan will proceed.

Kafka Streams, Consumer Groups, and Stickiness

The Java application I recently inherited had a lot of … quirks. One of the strangest was that it calculated throughput statistics based on ‘start’ values in a cache that was only refreshed every four hours. So at a minute past the data refresh, the throughput is averaged out over that minute. At three hours and fifty nine minutes past the data refresh, the throughput is averaged out over three hours and fifty nine minutes. In the process of correcting this (reading directly from the cached data rather than using an in-memory copy of the cached data), I noticed that the running application paused a lot as the Kafka group was re-balanced.

Which is especially odd because I’ve got a stable number of clients in each consumer group. But pods restart occasionally, and there was nothing done to attempt to stabilize partition assignment.

Which was odd because Kafka has had mechanisms to reduce re-balancing — StickyAssignor added in 0.11

1
2
// Set the partition assignment strategy to StickyAssignor
config.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG, "org.apache.kafka.clients.consumer.StickyAssignor");

And groupInstanceId in 2.3.0

1
2
3
// Set the group instance ID
String groupInstanceId = UUID.randomUUID().toString();
config.put(ConsumerConfig.GROUP_INSTANCE_ID_CONFIG, groupInstanceId);

Now, I’m certain that a UUID isn’t the best way to go about crafting your group instance ID name … but it produces a “name” that isn’t likely to be duplicated. Since deploying this change, I went from seeing three or four re-balance operations an hour to zero.

Kafka Streams Group Members and Topic Partitions

I encountered an oddity in a Java application that uses Kafka Streams to implement a scalable application that reads data from Kafka topics. Data is broken out into multiple topics, and there are Kubernetes pods (“workers”) reading from each topic. The pods have different numbers of replicas defined. But it appears that no one ever aligned the topic partitions with the number of workers being deployed.

Kafka Streams assigns “work” to group members by partition. If you have ten partitions and five workers, each worker processes the data from two partitions. However, when the numbers don’t line up … some workers get more partitions than others. Were you to have eleven partitions and five workers, four workers would get data from two partitions and the fifth worker gets data from three.

Worse – in some cases we have more workers than partitions. Those extra workers are using up some resources, but they’re not actually processing data.

It’s a quick fix — partitions can be added mostly invisibly (the consumer group will be re-balanced, write operations won’t really change. New data just starts getting placed in the new partitions), so I increased our partition counts to be 2x the number of workers. This allows us to add a few workers to a topic if it gets backlogged, but the configuration evenly distributes the work across all of the normally running pods.

PostgreSQL 12 — Cascading Replication

I’ve got replicated PostgreSQL database pairs that each have some 50TB of data. The server operating systems need to be upgraded, but there is a constraint: no in-place upgrades. I don’t get to veto that constraint (i.e. the fact that we could just cross our fingers and upgrade a replica … and, if it fails, built new and pull the data again doesn’t matter). Unfortunately, trying to add a second replica delays the existing replication. Since all write operations to to the RW server and reads to to the read-only replica … having the read-only copy a day or two out of sync whilst this secondary replica comes online is a non-starter.

Fortunately, you can cascade replication — seed the new replica from the current read-only replica. Create a new replication slot — here new-pg-ro-replica-pgdata. You need to verify the new server is in the pg_hba.conf file to authenticate with the replication account.

pg_basebackup -h pg-ro-replica.example.net -D /pgdata -U replicatorID -v -P --wal-method=stream --slot=new-pg-ro-replica-pgdata

Wait … wait … wait. It’ll finish eventually. Then tweak your recovery.conf

standby_mode = 'on'
primary_conninfo = 'host=pg-rw-replica.example.net port=5432 user=replicatorID password=your_password' sslmode=require
primary_slot_name = 'new-pg-ro-replica-pgdata'

And
touch /pgdata/standby.signal

Finally, start the server

pg_ctl start -D /pgdata

Voila — a second read-only replica. Now they can decom the old server.

OpenSearch 2.x CACerts Permission Error

In my dev OpenSearch 2.x environment, I get a strange error indicating that the application cannot read the cacerts file — except the file is world readable, selinux is disabled, and there’s nothing actually preventing access from the OS level.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
[2024-09-17T12:48:52,666][ERROR][c.a.d.a.h.j.AbstractHTTPJwtAuthenticator] [linux1569.mgmt.windstream.net] Error creating JWT authenticator. JWT authentication will not work
com.amazon.dlic.util.SettingsBasedSSLConfigurator$SSLConfigException: Error loading trust store from /opt/elk/opensearch/jdk/lib/security/cacerts
        at com.amazon.dlic.util.SettingsBasedSSLConfigurator.initFromKeyStore(SettingsBasedSSLConfigurator.java:338) ~[opensearch-security-2.15.0.0.jar:2.15.0.0]
        at com.amazon.dlic.util.SettingsBasedSSLConfigurator.configureWithSettings(SettingsBasedSSLConfigurator.java:196) ~[opensearch-security-2.15.0.0.jar:2.15.0.0]
        at com.amazon.dlic.util.SettingsBasedSSLConfigurator.buildSSLContext(SettingsBasedSSLConfigurator.java:117) ~[opensearch-security-2.15.0.0.jar:2.15.0.0]
        at com.amazon.dlic.util.SettingsBasedSSLConfigurator.buildSSLConfig(SettingsBasedSSLConfigurator.java:131) ~[opensearch-security-2.15.0.0.jar:2.15.0.0]
        at com.amazon.dlic.auth.http.jwt.keybyoidc.HTTPJwtKeyByOpenIdConnectAuthenticator.getSSLConfig(HTTPJwtKeyByOpenIdConnectAuthenticator.java:65) ~[opensearch-security-2.15.0.0.jar:2.15.0.0]
        at com.amazon.dlic.auth.http.jwt.keybyoidc.HTTPJwtKeyByOpenIdConnectAuthenticator.initKeyProvider(HTTPJwtKeyByOpenIdConnectAuthenticator.java:47) ~[opensearch-security-2.15.0.0.jar:2.15.0.0]
        at com.amazon.dlic.auth.http.jwt.AbstractHTTPJwtAuthenticator.<init>(AbstractHTTPJwtAuthenticator.java:89) [opensearch-security-2.15.0.0.jar:2.15.0.0]
        at com.amazon.dlic.auth.http.jwt.keybyoidc.HTTPJwtKeyByOpenIdConnectAuthenticator.<init>(HTTPJwtKeyByOpenIdConnectAuthenticator.java:26) [opensearch-security-2.15.0.0.jar:2.15.0.0]
        at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62) ~[?:?]
        at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502) ~[?:?]
        at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:486) ~[?:?]
        at org.opensearch.security.support.ReflectionHelper.instantiateAAA(ReflectionHelper.java:62) [opensearch-security-2.15.0.0.jar:2.15.0.0]
        at org.opensearch.security.securityconf.DynamicConfigModelV7.lambda$newInstance$1(DynamicConfigModelV7.java:432) [opensearch-security-2.15.0.0.jar:2.15.0.0]
        at java.base/java.security.AccessController.doPrivileged(AccessController.java:319) [?:?]
        at org.opensearch.security.securityconf.DynamicConfigModelV7.newInstance(DynamicConfigModelV7.java:430) [opensearch-security-2.15.0.0.jar:2.15.0.0]
        at org.opensearch.security.securityconf.DynamicConfigModelV7.buildAAA(DynamicConfigModelV7.java:329) [opensearch-security-2.15.0.0.jar:2.15.0.0]
        at org.opensearch.security.securityconf.DynamicConfigModelV7.<init>(DynamicConfigModelV7.java:102) [opensearch-security-2.15.0.0.jar:2.15.0.0]
        at org.opensearch.security.securityconf.DynamicConfigFactory.onChange(DynamicConfigFactory.java:288) [opensearch-security-2.15.0.0.jar:2.15.0.0]
        at org.opensearch.security.configuration.ConfigurationRepository.notifyAboutChanges(ConfigurationRepository.java:570) [opensearch-security-2.15.0.0.jar:2.15.0.0]
        at org.opensearch.security.configuration.ConfigurationRepository.notifyConfigurationListeners(ConfigurationRepository.java:559) [opensearch-security-2.15.0.0.jar:2.15.0.0]
        at org.opensearch.security.configuration.ConfigurationRepository.reloadConfiguration0(ConfigurationRepository.java:554) [opensearch-security-2.15.0.0.jar:2.15.0.0]
        at org.opensearch.security.configuration.ConfigurationRepository.loadConfigurationWithLock(ConfigurationRepository.java:538) [opensearch-security-2.15.0.0.jar:2.15.0.0]
        at org.opensearch.security.configuration.ConfigurationRepository.reloadConfiguration(ConfigurationRepository.java:531) [opensearch-security-2.15.0.0.jar:2.15.0.0]
        at org.opensearch.security.configuration.ConfigurationRepository.initalizeClusterConfiguration(ConfigurationRepository.java:284) [opensearch-security-2.15.0.0.jar:2.15.0.0]
        at org.opensearch.security.configuration.ConfigurationRepository.lambda$initOnNodeStart$10(ConfigurationRepository.java:439) [opensearch-security-2.15.0.0.jar:2.15.0.0]
        at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: java.security.AccessControlException: access denied ("java.io.FilePermission" "/opt/elk/opensearch/jdk/lib/security/cacerts" "read")
        at java.base/java.security.AccessControlContext.checkPermission(AccessControlContext.java:488) ~[?:?]
        at java.base/java.security.AccessController.checkPermission(AccessController.java:1071) ~[?:?]
        at java.base/java.lang.SecurityManager.checkPermission(SecurityManager.java:411) ~[?:?]
        at java.base/java.lang.SecurityManager.checkRead(SecurityManager.java:742) ~[?:?]
        at java.base/sun.nio.fs.UnixPath.checkRead(UnixPath.java:789) ~[?:?]
        at java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:49) ~[?:?]
        at java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:171) ~[?:?]
        at java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99) ~[?:?]
        at java.base/java.nio.file.spi.FileSystemProvider.readAttributesIfExists(FileSystemProvider.java:1270) ~[?:?]
        at java.base/sun.nio.fs.UnixFileSystemProvider.readAttributesIfExists(UnixFileSystemProvider.java:191) ~[?:?]
        at java.base/java.nio.file.Files.isDirectory(Files.java:2319) ~[?:?]
        at org.opensearch.security.support.PemKeyReader.checkPath(PemKeyReader.java:214) ~[opensearch-security-2.15.0.0.jar:2.15.0.0]
        at org.opensearch.security.support.PemKeyReader.resolve(PemKeyReader.java:290) ~[opensearch-security-2.15.0.0.jar:2.15.0.0]
        at org.opensearch.security.support.PemKeyReader.resolve(PemKeyReader.java:276) ~[opensearch-security-2.15.0.0.jar:2.15.0.0]
        at com.amazon.dlic.util.SettingsBasedSSLConfigurator.initFromKeyStore(SettingsBasedSSLConfigurator.java:327) ~[opensearch-security-2.15.0.0.jar:2.15.0.0]
        ... 25 more

Looks like Java has its own security mechanism — the java.policy needed to be updated to allow read access to cacerts (why!?!?!?)

1
2
3
4
5
vi /opt/elk/opensearch/jdk/conf/security/java.policy
 
# Add this grant:
 
    permission java.io.FilePermission "/opt/elk/opensearch/jdk/lib/security/cacerts", "read";

Removing and Recreating a ZFS Pool

In testing out various ways to achieve disk compression on our PostgreSQL servers, I ended up with a server build with a version of ZFS newer that the package distribution. Which means I needed to recreate the pool to use an older version of ZFS that would be updated as part of the routine patching. Beyond backing up and restoring the data …

# Get rid of existing pool

zpool export pgpool
zpool destroy pgpool
zpool list # this still shows a pool on sdb

# Clear the label

zpool labelclear /dev/sdb

# Didn’t work, so blow away everything on sdb

dd if=/dev/zero of=/dev/sdb bs=1M count=10
wipefs -a /dev/sdb

# Uninstall custom built zfs

cd /root/zfs
make uninstall

Install new ZFS

yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
yum install kernel-devel

yum install https://zfsonlinux.org/epel/zfs-release-2-3$(rpm –eval “%{dist}”).noarch.rpm
dnf config-manager –disable zfs
dnf config-manager –enable zfs-kmod
yum install zfs

# Sign kernel modules

/usr/src/kernels/$(uname -r)/scripts/sign-file sha256 /root/signing/MOK.priv /root/signing/MOK.der /lib/modules/$(uname -r)/extras/zfs/avl/zavl.ko

/usr/src/kernels/$(uname -r)/scripts/sign-file sha256 /root/signing/MOK.priv /root/signing/MOK.der /lib/modules/$(uname -r)/extras/zfs/zfs/zfs.ko

# Reboot

init 6

# And start over — recreate the pool

zpool create pgpool sdb
zfs create pgpool/pgdata
zfs set compression=lz4 pgpool/pgdata
df -h /pgpool/pgdata/

Using Templates in Azure Build Pipelines

I inherited a Java application that is actually five applications — and the build pipeline had a lot of repetition. Tell maven to use this POM file, now use that one, and now the other one. It wasn’t great, but it got even more cumbersome when I needed to split the production and development builds to use a different pool (network rule: prod and dev servers may not communicate … so the dev agent talks to the dev image repo which is used by the dev deployment. The prod agent talks to the prod image repo which is used by the prod deployment). Instead of having five “hey, maven, do this build” blocks, I now have ten.

So I created a template for the build step — jdk-path and maven-path are pipeline variables. The rest is the Maven build task with parameters to supply the step display name, pom file to use, and environment flag.

Maven Build Template:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# maven-build-step.yml
parameters:
  - name: pomFile
    type: string
  - name: dockerEnv
    type: string
  - name: displayName
    type: string
 
steps:
  - task: Maven@3
    displayName: '${{ parameters.displayName }}'
    inputs:
      mavenPomFile: '${{ parameters.pomFile }}'
      mavenOptions: '-Xmx3072m'
      javaHomeOption: 'Path'
      jdkDirectory: $(jdk-path)
      mavenVersionOption: 'Path'
      mavenDirectory: $(maven-path)
      mavenSetM2Home: true
      jdkArchitectureOption: 'x64'
      publishJUnitResults: true
      testResultsFiles: '**/surefire-reports/TEST-*.xml'
      goals: 'package -Denv=${{ parameters.dockerEnv }} jib:build'

Then my build pipeline uses the template and supplies a few parameters

Pipeline:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
# azure-pipelines.yml
trigger: none
 
variables:
  appName: 'NPM'
 
stages:
  - stage: Build
    jobs:
      - job: NonProdBuild
        condition: ne(variables['Build.SourceBranchName'], 'production')
        displayName: 'Build non-production branch'
        variables:
          DockerFlag: 'docker_dev'
        pool:
          name: 'Engineering NPM'
        steps:
          - template: maven-build-template.yml
            parameters:
              pomFile: 'JAVA/KafkaStreamsApp/npm/pom.xml'
              dockerEnv: $(DockerFlag)
              displayName: 'Building Kafka Streams App'
 
          - template: maven-build-template.yml
            parameters:
              pomFile: 'JAVA/DataSync/npmInfo/pom.xml'
              dockerEnv: $(DockerFlag)
              displayName: 'Building Data Sync App'
 
          - template: maven-build-template.yml
            parameters:
              pomFile: 'JAVA/GroupingRules/pom.xml'
              dockerEnv: $(DockerFlag)
              displayName: 'Building Grouping Rules App'
 
          - template: maven-build-template.yml
            parameters:
              pomFile: 'JAVA/Errorhandler/pom.xml'
              dockerEnv: $(DockerFlag)
              displayName: 'Building Error Handler App'
 
          - template: maven-build-template.yml
            parameters:
              pomFile: 'JAVA/Events/pom.xml'
              dockerEnv: $(DockerFlag)
              displayName: 'Building Events App'
 
      - job: ProdBuild
        condition: eq(variables['Build.SourceBranchName'], 'production')
        displayName: 'Build production branch'
        variables:
          DockerFlag: 'docker_prod'
        pool:
          name: 'Engineering NPM Prod'
        steps:
          - template: maven-build-template.yml
            parameters:
              pomFile: 'JAVA/KafkaStreamsApp/npm/pom.xml'
              dockerEnv: $(DockerFlag)
              displayName: 'Building Kafka Streams App'
 
          - template: maven-build-template.yml
            parameters:
              pomFile: 'JAVA/DataSync/npmInfo/pom.xml'
              dockerEnv: $(DockerFlag)
              displayName: 'Building Data Sync App'
 
          - template: maven-build-template.yml
            parameters:
              pomFile: 'JAVA/GroupingRules/pom.xml'
              dockerEnv: $(DockerFlag)
              displayName: 'Building Grouping Rules App'
 
          - template: maven-build-template.yml
            parameters:
              pomFile: 'JAVA/Errorhandler/pom.xml'
              dockerEnv: $(DockerFlag)
              displayName: 'Building Error Handler App'
 
          - template: maven-build-template.yml
            parameters:
              pomFile: 'JAVA/Events/pom.xml'
              dockerEnv: $(DockerFlag)
              displayName: 'Building Events App'

I think this could be made more concise … but it will do for now!

Docker Registry: Listing Images and Timestamps

I wanted a quick way to verify that Docker images have actually been pushed to the registry … I’m using Distribution, and only wanted to report on images that start with sample (because the repository is shared & I don’t want to read through the very long list of other people’s images)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#!/bin/bash
 
registry="registryhost.example.net:5443"
authHeader="Authorization: Basic AUTHSTRINGHERE"
 
# List all repositories
repositories=$(curl -s -H "$authHeader" https://$registry/v2/_catalog | jq -r '.repositories[]')
 
for repo in $repositories; do
  # Check if the repository name starts with "npm"
  if [[ $repo == sample* ]]; then
 
    # List all tags for the repository
    tags=$(curl -s -H "$authHeader" https://$registry/v2/$repo/tags/list | jq -r '.tags[]')
 
    for tag in $tags; do
 
      # Get the manifest for the tag
      manifest=$(curl -s -H "$authHeader" -H "Accept: application/vnd.docker.distribution.manifest.v2+json" https://$registry/v2/$repo/manifests/$tag)
 
      # Extract the digest for the config
      configDigest=$(echo $manifest | jq -r '.config.digest')
 
      # Get the config blob
      configBlob=$(curl -s -H "$authHeader" https://$registry/v2/$repo/blobs/$configDigest)
 
      # Extract the last modified date from the config history
      lastModifiedDate=$(echo $configBlob | jq -r '[.history[].created] | max')
 
      echo -e "$repo\t$tag\t$lastModifiedDate"
    done
  fi
done

Verifying Public and Private Keys Go Together

I have no idea how exactly I managed this — but I was renewing certificates on a group of servers and had one that would not work. It’s a Java app, and it just threw a generic handshake error. Even adding debugging didn’t add any useful information. It just didn’t work. Turns out my pubilc key and private key files didn’t go together. I didn’t bother figuring out which one I got wrong — I just downloaded the zip file from our cert provider again.

Using openssl to check the modulus of the cert and key — by getting an md5 checksum of the value, it’s a little easier to compare. This public private key pair go together — they’ve got the same modulus. My original files? Not so much — two different values!

1
2
3
4
linux1570:certs # openssl x509 -noout -modulus -in /opt/elk/opensearch_config/certs/20240722/$(hostname).pem | openssl md5
(stdin)= 52ca3e85fa7cb564dd395a8f801f9bdf
linux1570:certs # openssl rsa -noout -modulus -in /opt/elk/opensearch_config/certs/20240722/$(hostname)-nopass.key | openssl md5
(stdin)= 52ca3e85fa7cb564dd395a8f801f9bdf