You ever navigate away from a discussion and realize you needed to go back — or not quite remember where you just posted that message? Teams now has a “Back” button — in the upper left-hand corner of the Teams client, you can click back and forth to navigate between the last 12 channels/chats you’ve visited.
Category: Technology
Kibana – Defining Index Patterns
This isn’t something I will notice again now that I’ve got my heap space sorted … but when you are using Kibana to add index patterns, there’s nothing that waits until you’ve stopped typing for x seconds or implements a minimum characters before searching. So, after each character you type into the dialog, a search is performed against the ES server — here I wanted to create a pattern for logstash-* — which sent nine different search requests to my ES. And, if the server is operating reasonably, all of these searches would deliver a result set that Kibana essentially ignores as I’m tying more characters. But, in my case, the ES server fell over before I could even type logstash!
ElasticSearch Java Heap Size — Order of Precedence
I was asked to look at a malfunctioning ElasticSearch server this week. It’s a lab sandbox, so not a huge deal … but still something they wanted online and functioning for some proof-of-concept testing. And, as a bonus, they’re willing to let us use their lab servers for our own sandboxing (IPv6 implementation, ELK upgrades). There were a handful of problems (it looks like the whole thing used to be a multi-node cluster but the second cluster node has vanished without any trace, the entire platform was re-IP’d, vm.max_map_count wasn’t set on the Docker server, and the logstash folder had a backup of a pipeline config in /path/to/logstash/pipeline/ … which still loads, so causes a continual stream of port-in-use exceptions). But the biggest problem was that any attempt at actually using the ElasticSearch server resulted in it falling over. Java crashed with an out of memory error because the heap space was exhausted. Now I’ve had really small sandboxes before — my first ES sandbox only had a gig of memory and was quite prone to this crash. But the lab server they’ve set up has 64GB of memory. So allocating a few gigs seemed like a quick solution.
The jvm.options file and jvm.options.d folder weren’t mounted into the container — they were the default files held within the container. Which seemed odd, and I made a mental note that it was something we’d need to either mount in or update again when the container gets updated. But no matter how much heap space I allocated, ES crashed.
I discovered that the Docker deployment set an ENV variable for ES_JAVA_OPTS — something which, per the ElasticSearch documentation, overrides all other JVM options. So no matter what I was putting into the jvm options file, the 256 meg set in the ENV was actually being used.
Luckily it’s not terribly difficult to modify the ENV’s within an existing container. You could, of course, redeploy the container with the new settings (and I’ll do that next time, since I’ve also got to get IPv6 enabled). But I wasn’t planning on making any other changes.
Linux – Clearing Caches
I encountered some documentation at work that provided a process for clearing caches. It wasn’t wrong per se, but it showed a lack of understanding of what was being performed. I enhanced our documentation to explain what was happening and why the series of commands was redundant. Figured I’d post my revisions here in case they’re useful for someone else.
Only clean caches can be dropped — dirty ones need to be written somewhere before they can be dropped. Before dropping caches, flush the file system buffer using sync — this tells the kernel to write dirty cache pages to disk (or, well, write as many as it can). This will maximize the number of cache pages that can be dropped. You don’t have to run sync, but doing so optimizes your subsequent commands effectiveness.
Page cache is memory that’s held after reading a file. Linux tends to keep the files in cache on the assumption that a file that’s been read once will probably be read again. Clear the pagecache using echo 1 > /proc/sys/vm/drop_caches — this is the safest to use in production and generally a good first try.
If clearing the pagecache has not freed sufficient memory, proceed to this step. The dentries (directory cache) and inodes cache are memory held after reading file attributes (run strace and look at all of those stat() calls!). Clear the dentries and inodes using echo 2 > /proc/sys/vm/drop_caches — this is kind of a last-ditch effort for a production environment. Better than having it all fall over, but things will be a little slow as all of the in-flight processes repopulate the cached data.
You can clear the pagecache, dentries, and inodes using echo 3 > /proc/sys/vm/drop_caches — this is a good shortcut in a non-production environment. But, if you’re already run 1 and 2 … well, 3 = 1+2, so clearing 1, 2, and then 3 is repetitive.
Another note from other documentation I’ve encountered — you can use sysctl to clear the caches, but this can cause a deadlock under heavy load … as such, I don’t do this. The syntax is sysctl -w vm.drop_caches=1 where the number corresponds to the 1, 2, and 3 described above.
ElasticSearch – Useful API Commands
In all of these examples, the copy/paste text uses localhost and port 9200. Since some of my sandboxes don’t use the default port, some of the example outputs will use a different port. Obviously, use your hostname and port. And, if your ES instance requires authentication, add the “-u” option with the user (or user:password … but that’s not a good idea outside of sandboxes as the password is then stored to the shell history). If you are using https for the API endpoint, you may also need to add the “-k” option to establish an untrusted SSL connection (e.g. the CA isn’t trusted by your OS).
curl -k -u elastic https://localhost...
Listing All Indices
Use the following command to list all of the indices in the ES system:
curl http://localhost:9200/_cat/indices?v
Listing All Templates
Use the following command to list all of the templates:
curl http://localhost:9200/_cat/templates?pretty
Explain Shard Allocation
I was asked to help get a ELK installation back into working order — one of the things I noticed is that all of the indices were yellow. The log file showed allocation errors. This command reported on the allocation decision that was being made. In the case I was looking at, the problem became immediately obvious — it was a single node system and 1 replica was defined. The explanation was that the shard could not be stored because it already existed in that place.
curl http://localhost:9200/_cluster/allocation/explain
If the maximum number of allocation retries has been exceeded, you can force ES to re-try allocation (as an example, a disk was full for an extended period of time but space has been cleared and everything should work now)
curl http://localhost:9200/_cluster/reroute?retry_failed=true
Set the Number of Replicas for a Single Index
Once I identified that the single node ELK instance had indices configured
curl -X PUT \ http://127.0.0.1:9200/logstash-2021.05.08/_settings \ -H 'cache-control: no-cache' \ -H 'content-type: application/json' \ -d '{"index" : {"number_of_replicas" : 0}} '
Add an Alias to an Index
To add an alias to an existing index, use PUT /<indexname>/_alias/<aliasname> — e.g.
PUT /ljr-2022.07.05/_alias/ljr
Docker – Changing an Existing Container
I’ll start by acknowledging that, of course, you could just redeploy the container with the settings you want now. The whole point of containerized development is that anything “good” should either be part of the deployment settings or data persisted outside of the container. So, in theory, redeploying the container every day shouldn’t really be detectable. Even when you didn’t deploy the original container (i.e. you don’t have the Dockerfile or docker run command handy to tweak as needed), you can reverse engineer what you need from docker inspect. But sometimes? It’s quicker/easier/more convenient to just fix what you need to within the existing container. And it is possible to do so.
The trickiest part is finding the right file to edit.
# cd into docker container definition folder
cd /var/lib/docker/containers/
# find the guid for the container you want to edit
docker ps
# Find the corresponding folder name
ls -al | grep bc9dc66882af
# cd into that folder
cd bc9dc66882af18f59c209faf10031fe21765571d0a2fe4a32a768a1d52cc1e53
# Edit the config.v2.json file for the container
vi config.v2.json
# And, finally, restart docker
systemctl stop docker
systemctl start docker
Docker – Reporting IP Addresses of Containers
Farm Automation
Scott set up one of the ESP32’s that we use as environmental sensors to monitor the incubator. There’s an audible alarm if it gets too warm or cold, but that’s only useful if you’re pretty close to the unit. We had gotten a cheap rectangular incubator from Amazon — it’s got some quirks. The display says C, but if we’ve got eggs in a box that’s 100C? They’d be cooked. The number is F, but the letter “C” follows it … there’s supposed to be a calibration menu option, but there isn’t. Worse, though — the temperature sensor is off by a few degrees. If calibration was an option, that wouldn’t be a big deal … but the only way we’re able to get the device over 97F is by taping the temperature probe up to the top of the unit.
So we’ve got an DHT11 sensor inside of the incubator and the ESP32 sends temperature and humidity readings over MQTT to OpenHAB. There are text and audio alerts if the temperature or humidity aren’t in the “good” window. This means you can be out in the yard or away from home and still know if there’s a problem (since data is stored in a database table, we can even chart out the temperature or humidity across the entire incubation period).
We also bought a larger incubator for the chicken eggs — and there’s a new ESP32 and sensor in the larger incubator.
Postgresql – Querying Hot Standby Server
We hit our maximum connection limit on some PostgreSQL servers — which made me wonder why the hot standby servers weren’t being used … well, at all. They’re equally big, expensive servers with loads of disk space. But they’re just sitting there “in case”.
So we directed some traffic over to the standby server. I’m also going to tweak a few settings related to user limits — increase the max connections since these are dedicated hosts and have plenty of available I/O, memory, CPU, etc resources; increase the number of reserved connections since replication filled up all of the reserved slots; implement a per-user connection limit on one account that runs a lot of threads — but directing some people who were only trying to look at data over to the standby server seemed like a quick fix.
Now, we discovered something interesting about how queries against the standby interact with replication. It makes a lot of sense when you start thinking about it — if you query against the writable replica, there’s some blocking that goes on. The system isn’t going to vacuum data that you’re currently trying to use. The standby, however, doesn’t have any way to clue the writable replica in to the fact you are trying to use some data. So the writable replica gets a delete, does its thing to hide those rows from future queries, and eventually auto-vacuum comes through and cleans up those rows. All of this gets pushed over to the standby … and there goes the data you were trying to read.
Odds of this happening on a query that takes eight seconds? Incredibly low! Odds increase, however, the longer a query runs. So some of our super massive reports started seeing an error indicating that their query was cancelled “due to a conflict with recovery”
There are two solutions in the PostgreSQL documentation — one is to increase the max_standby_streaming_delay value (there’s also an archive delay, but we aren’t particularly concerned about clients querying the server during recovery operations) the other is to avoid vacuuming data too quickly — either by setting hot_standby_feedback on the standby or increasing vacuum_defer_cleanup_age on the primary.
There’s a third option too — don’t use the standby for long-running queries. That’s easily done in our case … and doesn’t require tweaking any PostgreSQL settings. Ad hoc reporting and direct user access really shouldn’t be implementing such substantial queries (it’s always good to have a SQL expert plan out and optimize complex queries if that’s an option).
Analyzing Postgresql Tmp Files
Postgresql stores temporary files for in-flight queries — these don’t normally hang around for long, but sorting a large amount of data or building a large hash can create a lot of temp files. A dead query that was sorting a large amount of data or …. well, we’ve gotten terabytes of temp files associated with multiple backend process IDs. The file names are algorithmic — a string “pgsql_tmp followed by the backend PID, a period, and then some other number. Thus, I can extract the PID from each file name and provide a summary of the processes associated with temp files.
To view a summary of the temp files within the pgsql_tmp folder, run the following command to print a count then a PID number:
ls /path/to/pgdata/base/pgsql_tmp | sed -nr 's/pgsql_tmp([0-9]*)\.[0-9]*/\1/p' | sort | uniq -c
A slightly longer command can be used to reverse the columns – producing a list of process IDs followed by the count of files for that PID – too:
ls /path/to/pgdata/base/pgsql_tmp | sed -nr 's/pgsql_tmp([0-9]*)\.[0-9]*/\1/p' | sort | uniq -c | sort -k2nr | awk '{printf("%s\t%s\n",$2,$1)}END{print}'