Category: Technology

On UUIDs

RFC 4122 UUID Versions:

1 — Datetime and MAC based
48-bit MAC address, 60-bit timestamp, 13-14 bit uniquifying sequence

2 — Datetime and MAC based with DCE security
8 least significant clock sequence numbers and least significant 32 bits of timestamp. RFC doesn’t reallly provide details on DCE security

3 — Hashed Namespace
MD5 hash of namespace

4 — Random
6 pre-determined bits (4 bits for version, 2-3 bits for variant 1 or 2) and 122 bits for 2^122 possible v4 variant 1 UUIDs

5 — Hashed Namespace
SHA-1 hash of namespace

In my case, I hesitate to use a v1 or v2 UUID because I have scripts executing in cron on the same host. The probability of the function being called at the same microsecond time seems higher than the pseudo-random number generator popping the same value in the handful of hours for which the UUIDs will be persisted for deduplication.

v3 or v5 UUIDs are my fallback position if we’re seeing dups in v4 — the namespace would need to glom together the script name and microsecond time to make a unique string when multiple scripts are running the function concurrently.

Kafka Troubleshooting (for those who enjoy reading network traces)

I finally had a revelation that allowed me to definitively prove that I am not doing anything strange that is causing duplicated messages to appear in the Kafka stream — it’s a clear text protocol! That means you can use Wireshark, tcpdump, etc to capture everything that goes over the wire. This shows that the GUID I generated for the duplicated message only appears one time in the network trace. Whatever funky stuff is going on that makes the client see it twice? Not me 😊

I used tcpdump because the batch server doesn’t have tshark (and it’s not my server, so I’m not going to go requesting additional binaries if there’s something sufficient for my need already available). Ran tcpdump -w /srv/data/ljr.cap port 9092 to grab everything that transits port 9092 while my script executed. Once the batch completed, I stopped tcpdump and transferred the file over to my workstation to view the capture in Wireshark. Searched the packet bytes for my duplicated GUID … and there’s only one.

Confluent Kafka Queue Length

The documentation for the Python Confluent Kafka module includes a len function on the producer. I wanted to use the function because we’re getting a number of duplicated messages on the client, and I was trying to isolate what might be causing the problem. Unfortunately, calling producer.len() failed indicating there’s no len() method. I used dir(producer) to show that, no, there isn’t a len() method.

I realized today that the documentation is telling me that I can call the built-in len() function on a producer to get the queue length.

Code:

print(f"Before produce there are {len(producer)} messages awaiting delivery")
producer.produce(topic, key=bytes(str(int(cs.timestamp) ), 'utf8'), value=cs.SerializeToString() )
print(f"After produce there are {len(producer)} messages awaiting delivery")
producer.poll(0) # Per https://github.com/confluentinc/confluent-kafka-python/issues/16 for queue full error
print(f"After poll0 there are {len(producer)} messages awaiting delivery")

Output:

Before produce there are 160 messages awaiting delivery
After produce there are 161 messages awaiting delivery
After poll0 there are 155 messages awaiting delivery

Boolean Opts in Python

I have a few command line arguments on a Python script that are most readily used if they are boolean. I sometimes need a “verbose” option for script debugging — print a lot of extra stuff to show what’s going on, and I usually want a “dry run” option where the script reads data, performs calculations, and prints results to the screen without making any changes or sending data anywhere (database, email, etc). To use command line arguments as boolean values, I use a function that converts a variety of possible inputs to True/False.

def string2boolean(strInput):
    """
    :param strInput: String string to be converted to boolean
    :return: Boolean representation of input
    """
    if isinstance(strInput, bool):
        return strInput
    if strInput.lower() in ('yes', 'true', 't', 'y', '1'):
        return True
    elif strInput.lower() in ('no', 'false', 'f', 'n', '0'):
        return False
    else:
        raise argparse.ArgumentTypeError('Boolean value expected.')

Use “type” when adding the argument to run the input through your function.

    parser.add_argument('-r', '--dryrun', action='store', type=string2boolean, dest='boolDryRun', default=False, help="Preview data processing without sending data to DB or Kafka. Valid values: 'true' or 'false'.")

Troubleshooting Kafka

Our server metrics are fed into a Kafka bus, and various applications are able to pick up and process this data. Problem is, however, that everything I’m sending doesn’t end up in the downstream system. The conflunce_kafka module I’m using in python reports that data is send along it’s merry way, but the primary system that is used to present metrics to end users says they’re not consistently getting data across the channel. Not never like there’s something outright wrong, but long periods of time where there’s no data followed by a cycle where data shows up.

I’ve exhausted all of the in-script debugging I can — the messages are getting there. But I wondered if the async nature of Kafka might mean that the client’s “it got there” wouldn’t actually mean something arrived. So I had to figure out how to test a Kafka server the same way I test my MQTT server — how do I use a quick command line program to send a message and how do I use a quick command line program to subscribe to various topics.

Turns out this is easier than anticipated — the binary build of Kafka includes windows batch files. Download the latest Kafka binary. Untar/unzip it somewhere. This is easy if you have the Win32 port of the GNU utilities and can just run “tar vxfz kafka_2.13-2.8.0.tgz”.

In the .\kafka<version>\bin\windows folder, there are kafka-console-consumer.bat and kafka-console-producer.bat files that can be used for testing Kafka. You can open two command prompts — one for the producer (sending data to Kafka) and one for the consumer (watching Kafka for new messages). In the consumer window, run

kafka-console-consumer.bat –bootstrap-server yourkafkaserver.example.com:Port –topic Test

Then, in the producer, run

kafka-console-producer.bat –broker-list yourkafkaserver.example.com:Port –topic Test

The producer will bring you to a “>” prompt where you can type some strings and hit enter to send the message to Kafka. You should see the messages pop into the consumer window.

To subscribe to multiple topics, use “–whitelist” followed by a pipe-bar delimited list of topics.

Oracle Collection Instead of Dual

I’m still retrofitting a bunch of SQL queries to use bind_by_name and came across a strange scenario. I created a recursive query (STARTS WITH / CONNECT BY PRIOR) but I needed to grab the original value too. The quickest way to accomplish this was to union in something like “select 12345CDE as equipment_id from dual”. But the only way to get a bunch of these original values grafted onto the result set is to iterate through the array once to build my :placeholder1, :placeholder2, …, placeholderN placeholders and then iterate through the array again to bind each placeholder to its proper value.

I’ve been working with Oracle collections for LIKE and IN queries, and thought I could use a table that only exists within the query to glom the entire array into a single placeholder. It works! A query like

select column_value equipment_id from TABLE(sys.ODCIVARCHAR2LIST('12345CDE', '23456BCD', '34567ABC') );

Adds each of the values to my result set.

Which means I can use a query like “select column_value as equipment_id from TABLE(:myIDs)” and bind the collection to :myIDs.

Docker – List Container Startup Policies

A quick one-line Bash command to output all containers and the startup policy:

docker container ls -aq | xargs docker container inspect --format '{{ .Name }}: {{.HostConfig.RestartPolicy.Name}}'

For Docker on Windows, the following PowerShell command produces similar results:

$jsonData = docker container ls -aq |%{docker container inspect --format "{{json .}}"$_}
[System.Collections.ArrayList]$arrayContainerConfig = @()
foreach($jsonContainerConfig in $jsonData ){
	$psobjContainerConfig = ConvertFrom-JSON $jsonContainerConfig
	$arrayContainerConfig.add(@{Name=$psobjContainerConfig.Name;Hostname=$psobjContainerConfig.Config.Hostname;CurrentlyRunning=$psobjContainerConfig.State.Running;RestartPolicy=$psobjContainerConfig.HostConfig.RestartPolicy.Name})
}

$arrayContainerConfig | ForEach {[PSCustomObject]$_} | Format-Table -AutoSize

Oracle – Collections and IN or LIKE Queries

I’ve been retrofitting a lot of PHP/SQL queries to use oci_bind_by_name recently. When using “IN” clauses, you can iterate through your array twice, but it’s an inefficient approach.

// Build the query string with a bunch of placeholders
$strQuery = "select Col1, Col2, Col3 from TableName where ColName IN (";
for($i=0; $i < count($array); $i++){
    if($i > 0){
        $strQuery = $strQuery . ", ";
    }
    $strQuery = $strQuery . ":bindvar" . $i;
}
$strQuery = $strQuery . ")";
... 
// Then bind each placeholder to something
for($i=0; $i < count($array); $i++){
    oci_bind_by_name($stmt, ":bindvar".$i, $array[$i]);
}

Building a table from the array data and using an Oracle collection object creates cleaner code and avoids a second iteration of the array:

$strQuery = "SELECT indexID, objName FROM table WHERE objName in (SELECT column_value FROM table(:myIds))";
$stmt = oci_parse($conn, $strQuery);

$coll = oci_new_collection($kpiprd_conn, 'ODCIVARCHAR2LIST','SYS');
foreach ($arrayValues as $strValue) {
     $coll->append($strValue);
}
oci_bind_by_name($stmt, ':myIds', $coll, -1, OCI_B_NTY);
oci_set_prefetch($stmt, 300);
oci_execute($stmt);

A simple like clause is quite straight-forward

$strNameLikeString = "SomeName%";
$strQuery = "SELECT ds_dvrsty_set_nm from ds_dvrsty_set WHERE ds_dvrsty_set_nm LIKE :divsetnm ORDER BY ds_dvrsty_set_nm DESC fetch first 1 row only";

$stmt = oci_parse($connDB, $strQuery);
oci_bind_by_name($stmt, ":divsetnm", $strNameLikeString);
oci_set_prefetch($stmt, 300);
oci_execute($stmt);

But what about an array of inputs essentially reproducing the LIKE ANY predicate in PostgreSQL? There’s not a direct equivalent in Oracle, and iterating through the array twice to build out a query WHERE (Field1 LIKE ‘Thing1%’ OR Field1 LIKE ‘Thing2%’ OR Field1 LIKE ‘Thing3%’) is undesirable. The with EXISTS allows me to create a LIKE ANY type query and only iterate through my array once to bind variables to placeholders using the same collection approach as was used with the IN clause.

$arrayLocs = array('ERIEPAXE%', 'HNCKOHXA%', 'LTRKARXK%');
$strQuery = "SELECT location_id, clli_code FROM network_location WHERE EXISTS (select 1 FROM TABLE(:likelocs) WHERE clli_code LIKE column_value)";
$stmt = oci_parse($connDB, $strQuery);

$coll = oci_new_collection($connDB, 'ODCIVARCHAR2LIST','SYS');
foreach ($arrayLocs as $strLocation) {
    $coll->append($strLocation);
}
oci_bind_by_name($stmt, ':likelocs', $coll, -1, OCI_B_NTY);
oci_execute($stmt);
print "<table>\n";
print "<tr><th>Loc ID</th><th>CLLI</th></tr>\n";
while ($row = oci_fetch_array($stmt, OCI_ASSOC+OCI_RETURN_NULLS)) {
    print "<tr><td>" . $row['LOCATION_ID'] . "</td><td>" . $row['CLLI_CODE'] . "</td></tr>\n";
}
print "</table>\n";

There are many different collection types in Oracle which can be used with oci_new_collection. A full list of the system collection types can be queried from the database.

SELECT * FROM SYS.ALL_TYPES WHERE TYPECODE = 'COLLECTION' and OWNER = 'SYS';