Tag: postgresql

PostgreSQL Wraparound

We had a Postgres server go into read-only mode — which provided a stressful opportunity to learn more nuances of Postgres internals. It appears this “read only mode” something Postgres does to save it from itself. Transaction IDs are assigned to each row in the database — the ID values are used to determine what transactions can see. For each transaction, Postgres increments the last transaction ID and assigns the incremented value to the current transaction. When a row is written, the transaction ID is stored in the row and used to determine whether a row is visible to a transaction.

Inserting a row will assign the last transaction ID to the xmin column. A transaction can see all rows where xmin is less than its transaction ID. Updating a row actually creates a new row — the old row then has an xmax value and the new row has the same number as its xmin — transactions with IDs newer than the xmax value will not see the row. Similarly, deleting a row updates the row’s xmax value — older transactions will still be able to see the row, but newer ones will not.

You can even view the xmax and xmin values by specifically asking for them in a select statement: select *, xmin, xmax from TableName;

The transaction ID is stored in a 32-bit number — making the possible values 0 through 4,294,967,295. Which can become a problem for a heavily I/O or long-running database (i.e. even if I only get a couple of records an hour, that adds up over years of service) because … what happens when we get to 4,294,967,295 and need to write another record? To combat this,  Postgres does something that reminds me of the “doomsday” Mayan calendar — this number range isn’t aligned on a straight line where one eventually runs into a wall. The numbers are arranged in a circle, so there’s always a new cycle and numbers are issued all over again. In the Postgres source, the wrap limit is “where the world ends”! But, like the Mayan calendar … this isn’t actually the end as much as it’s a new beginning.

How do you know if transaction 5 is ‘old’ or ‘new’ if the number can be reissued? The database considers half of the IDs in the real past and half for future use. When transaction ID four billion is issued, ID number 5 is considered part of the “future”; but when the current transaction ID is one billion, ID number 5 is considered part of the “past”. Which could be problematic if one of the first records in the database has never been updated but is still perfectly legitimate. Reserving in-use transaction IDs would make the re-issuing of transaction IDs more resource intensive (not just assign ++xid to this transaction, but xid++;is xid assigned {if so, xid++ and check again until the answer is no}; assign xid to this transaction). Instead of implementing more complex logic, rows can be “frozen” — this is a special flag that basically says “I am a row from the past and ignore my transaction ID number”. In versions 9.4 and later, both committed and aborted hint bits are set to freeze a row — in earlier versions, used a special FrozenTransactionId index.

There is a minimum age for freezing a row — it generally doesn’t make sense to mark a row that’s existed for eight seconds as frozen. This is configured in the database as the vacuum_freeze_min_age. But it’s also not good to let rows sit around without being frozen for too long — the database could wrap around to the point where the transaction ID is reissued and the row would be lost (well, it’s still there but no one can see it). Since vacuuming doesn’t look through every page of the database on every cycle, there is a vacuum_freeze_table_age which defines the age of a transaction where vacuum will look through an entire table to freeze rows instead of relying on the visibility map. This combination, hopefully, balances the I/O of freezing rows with full scans that effectively freeze rows.

What I believe led to our outage — most of our data is time-series data. It is written, never modified, and eventually deleted. Auto-vacuum will skip tables that don’t need vacuuming. In our case, that’s most of the tables. The autovacuum_freeze_max_age parameter sets an ‘age’ at which vacuuming is forced. If these special vacuum processes don’t complete fully … you eventually get into a state where the server stops accepting writes in order to avoid potential data loss.

So monitoring for transaction IDs approaching the wraparound and emergency vacuum values is important. I set up a task that alerts us when we approach wraparound (fortunately, we’ve not gotten there again) as well as when we approach the emergency auto-vacuum threshold — a state which we reach a few times a week.

Using the following query, we monitor how close each of our databases is to both the auto-vacuum threshold and the ‘end of the world’ wrap-around point.

WITH max_age AS ( SELECT 2000000000 as max_old_xid
                        , setting AS autovacuum_freeze_max_age FROM pg_catalog.pg_settings 
                        WHERE name = 'autovacuum_freeze_max_age' )
         , per_database_stats AS ( SELECT datname , m.max_old_xid::int 
                        , m.autovacuum_freeze_max_age::int 
                        , age(d.datfrozenxid) AS oldest_current_xid 
                        FROM pg_catalog.pg_database d 
                        JOIN max_age m ON (true) WHERE d.datallowconn ) 

SELECT max(oldest_current_xid) AS oldest_current_xid 
      , max(ROUND(100*(oldest_current_xid/max_old_xid::float))) AS percent_towards_wraparound 
      , max(ROUND(100*(oldest_current_xid/autovacuum_freeze_max_age::float))) AS percent_towards_emergency_autovac FROM per_database_stats

If we are approaching either point, e-mail alerts are sent.

When a database approaches the emergency auto-vacuum threshold, we freeze data manually —  vacuumdb --all --freeze --jobs=1 --echo --verbose --analyze (or –jobs=3 if I want the process to hurry up and get done).

Postgresql – Querying Hot Standby Server

We hit our maximum connection limit on some PostgreSQL servers — which made me wonder why the hot standby servers weren’t being used … well, at all. They’re equally big, expensive servers with loads of disk space. But they’re just sitting there “in case”.

So we directed some traffic over to the standby server. I’m also going to tweak a few settings related to user limits — increase the max connections since these are dedicated hosts and have plenty of available I/O, memory, CPU, etc resources; increase the number of reserved connections since replication filled up all of the reserved slots; implement a per-user connection limit on one account that runs a lot of threads — but directing some people who were only trying to look at data over to the standby server seemed like a quick fix.

Now, we discovered something interesting about how queries against the standby interact with replication. It makes a lot of sense when you start thinking about it — if you query against the writable replica, there’s some blocking that goes on. The system isn’t going to vacuum data that you’re currently trying to use. The standby, however, doesn’t have any way to clue the writable replica in to the fact you are trying to use some data. So the writable replica gets a delete, does its thing to hide those rows from future queries, and eventually auto-vacuum comes through and cleans up those rows. All of this gets pushed over to the standby … and there goes the data you were trying to read.

Odds of this happening on a query that takes eight seconds? Incredibly low! Odds increase, however, the longer a query runs. So some of our super massive reports started seeing an error indicating that their query was cancelled “due to a conflict with recovery”

There are two solutions in the PostgreSQL documentation — one is to increase the max_standby_streaming_delay value (there’s also an archive delay, but we aren’t particularly concerned about clients querying the server during recovery operations) the other is to avoid vacuuming data too quickly — either by setting hot_standby_feedback on the standby or increasing vacuum_defer_cleanup_age on the primary.

There’s a third option too — don’t use the standby for long-running queries. That’s easily done in our case … and doesn’t require tweaking any PostgreSQL settings. Ad hoc reporting and direct user access really shouldn’t be implementing such substantial queries (it’s always good to have a SQL expert plan out and optimize complex queries if that’s an option).

Analyzing Postgresql Tmp Files

Postgresql stores temporary files for in-flight queries — these don’t normally hang around for long, but sorting a large amount of data or building a large hash can create a lot of temp files. A dead query that was sorting a large amount of data or …. well, we’ve gotten terabytes of temp files associated with multiple backend process IDs. The file names are algorithmic — a string “pgsql_tmp followed by the backend PID, a period, and then some other number. Thus, I can extract the PID from each file name and provide a summary of the processes associated with temp files.

To view a summary of the temp files within the pgsql_tmp folder, run the following command to print a count then a PID number:
ls /path/to/pgdata/base/pgsql_tmp | sed -nr 's/pgsql_tmp([0-9]*)\.[0-9]*/\1/p' | sort | uniq -c

A slightly longer command can be used to reverse the columns – producing a list of process IDs followed by the count of files for that PID – too:
ls /path/to/pgdata/base/pgsql_tmp | sed -nr 's/pgsql_tmp([0-9]*)\.[0-9]*/\1/p' | sort | uniq -c | sort -k2nr | awk '{printf("%s\t%s\n",$2,$1)}END{print}'

 

Tracking Down Which Pod is Exhausting IP Connections

We’ve been seeing an error that prevents clients from connecting to Postgresql servers – basically that all available connections are in use and the remaining connections are reserved for superuser and replication activity.

First, we need to determine what the connection limit is

SELECT setting, source, sourcefile, sourceline FROM pg_settings WHERE name = 'max_connections';

And if there are any per-user connection limits – a limit of -1 means unlimited connections are allowed.

SELECT rolname, rolconnlimit FROM pg_roles

The next step is to identify what connections are exhausting available connections – are there a lot of long-running queries? Are there just more active queries than anticipated? Are there a bunch of idle connections?

SELECT pid, usename, client_addr, client_port 
 ,to_char(pg_stat_activity.query_start, 'YYYY-MM-DD HH:MI:SS') as query_start
 , state, query 
FROM pg_stat_activity
-- where state = 'idle'
-- and usename = 'app_user'
order by query_start;

In our case, there were over 100 idle connections using up about 77% of the available connections. Auto-vacuum, client read operations, and replication easily filled up the remaining available connections.

Because the clients keeping these idle connections open are an app running in a Kubernetes cluster, there’s an extra layer of complexity identifying where the connection is actually sourced. When you view the list of connections from the Postgresql server’s perspective, “client_addr” is the worker hosting the pod.

On the worker server, use conntrack to identify the actual source of the connection – the IP address in “-d” is the IP address of the Postgresql server. To isolate a specific connection, select a “client_port” from the list of connections (37900 in this case) and grep for the port. You will see the src IP of the individual POD.

lhost1750:~ # conntrack -L -f ipv4 -d 10.24.29.140 -o extended | grep 37900
ipv4 2 tcp 6 86394 ESTABLISHED src=10.244.4.80 dst=10.24.29.140 sport=37900 dport=5432 src=10.24.29.140 dst=10.24.29.155 sport=5432 dport=37900 [ASSURED] mark=0 use=1
conntrack v1.4.4 (conntrack-tools): 27 flow entries have been shown.

Then use kubeadm to identify which pod is assigned that address:

lhost1745:~ # kubectl get po --all-namespaces -o wide | grep "10.244.4.80"
kstreams kafka-stream-app-deployment-1336-d8f7d7456-2n24x 2/2 Running 0 10d 10.244.4.80 lhost0.example.net <none> <none>

In this case, we’ve got an application automatically scaling up that can have 25 connections help open and idle … so there isn’t really a solution other than increasing the number of available connections to a number that’s appropriate given the number of client connections we plan on leaving open. I also want to enact a connection limit on the individual account – if there are 250 connections available on the Postgresql server, then limit the application to 200 of those connections.

 

PostgreSQL Matching Functions

Queries using POSIX regex

-- Case insensitive match
SELECT * FROM mytable WHERE columnName ~* 'this|that';
-- Case sensitive match
SELECT * FROM mytable WHERE columnName ~ 'this|that';

Queries Using ANY

SELECT * FROM mytable WHERE columnName like any (array['%this%', '%that%']);

Queries Using SIMILAR TO

-- This is translated to a regex query internally, so not effectively different than constructing the regex query yourself
SELECT * FROM mytable WHERE columnName SIMILAR TO '%(this|that)%';

Postgresql Through an SSH Tunnel in Python

Our production Postgresql servers have a fairly restrictive IP access control list — which means you cannot VPN in and query the server. We’ve been using DBeaver with an SSH tunnel to connect, but it’s a bit time consuming to run a query across all of the servers for monitoring and troubleshooting. To work around the restriction, I built a python script that uses an SSH tunnel to relay communications to the Postgresql servers.

import psycopg2
from sshtunnel import SSHTunnelForwarder

from config import strSSHRelayHost, iSSHRelayPort, strSSHRelayUser, strSSHAuthKeyFile, dictHost
# In the config.py, dictHost should contain the following information
# dictHost = {"host":"dbserver.example.com","port":5432,"database": "dbname", "username":"dbuser", "password":"S3cr3tPhr@5e"}

# Example query -- listing out locks 
sqlQuery = "WITH RECURSIVE l AS (  SELECT pid, locktype, mode, granted, ROW(locktype,database,relation,page,tuple,virtualxid,transactionid,classid,objid,objsubid) obj FROM pg_locks ), pairs AS ( SELECT w.pid waiter, l.pid locker, l.obj, l.mode FROM l w JOIN l ON l.obj IS NOT DISTINCT FROM w.obj AND l.locktype=w.locktype AND NOT l.pid=w.pid AND l.granted  WHERE NOT w.granted ), tree AS ( SELECT l.locker pid, l.locker root, NULL::record obj, NULL AS mode, 0 lvl, locker::text path, array_agg(l.locker) OVER () all_pids FROM ( SELECT DISTINCT locker FROM pairs l WHERE NOT EXISTS (SELECT 1 FROM pairs WHERE waiter=l.locker) ) l  UNION ALL  SELECT w.waiter pid, tree.root, w.obj, w.mode, tree.lvl+1, tree.path||'.'||w.waiter, all_pids || array_agg(w.waiter) OVER () FROM tree JOIN pairs w ON tree.pid=w.locker AND NOT w.waiter = ANY ( all_pids )) SELECT (clock_timestamp() - a.xact_start)::interval(3) AS ts_age, replace(a.state, 'idle in transaction', 'idletx') state, (clock_timestamp() - state_change)::interval(3) AS change_age, a.datname,tree.pid,a.usename,a.client_addr,lvl, (SELECT count(*) FROM tree p WHERE p.path ~ ('^'||tree.path) AND NOT p.path=tree.path) blocked, repeat(' .', lvl)||' '||left(regexp_replace(query, 's+', ' ', 'g'),100) query FROM tree JOIN pg_stat_activity a USING (pid) ORDER BY path"

with SSHTunnelForwarder( (strSSHRelayHost, iSSHRelayPort), ssh_username=strSSHRelayUser, ssh_private_key=strSSHAuthKeyFile, local_bind_address=("localhost",55432), remote_bind_address=(dictHost.get('host'), dictHost.get('port'))) as server:
# Alternately, you can use password authentication
#with SSHTunnelForwarder( (strSSHRelayHost, iSSHRelayPort), ssh_username=strSSHRelayUser, ssh_password=strSSHRelayUserPass, local_bind_address=("localhost",55432), remote_bind_address=(dictHost.get('host'), dictHost.get('port'))) as server:
    if server is not None:
        server.start()
        server._check_is_started()
        #print("Tunnel server connected")
        params = {'database': dictHost.get('database'),'user': dictHost.get('username'),'password': dictHost.get('password'), 'host': server.local_bind_host, 'port': server.local_bind_port}
        conn = psycopg2.connect(**params)
        cursor = conn.cursor()
        cursor.execute(sqlQuery)
        column_names = [desc[0] for desc in cursor.description]
        print(column_names)
        rows = cursor.fetchall()
        for row in rows:
            print(row)
        cursor.close()
        if conn is not None:
            conn.close()
        server.stop()
    else:
        print("Unable to establish SSH tunnel")

Postgresql Replication Lag

Replication involves sending records from the master, receiving the record on the remote replica, writing the record on the remote replica, and flushing the record to persistent storage, and finally replaying the record. Replication lag can occur when a large amount of data hasn’t been fully replayed into the remote replica. Identifying ​where​ the lag occurs can help in rectifying the underlying problem.

select client_addr, usename, application_name, state, sync_state, (pg_wal_lsn_diff(pg_current_wal_lsn(),sent_lsn) / 1024)::bigint as PendingLag, (pg_wal_lsn_diff(sent_lsn,write_lsn) / 1024)::bigint as WriteLag, (pg_wal_lsn_diff(write_lsn,flush_lsn) / 1024)::bigint as FlushLag, (pg_wal_lsn_diff(flush_lsn,replay_lsn) / 1024)::bigint as ReplayLag, (pg_wal_lsn_diff(pg_current_wal_lsn(),replay_lsn))::bigint / 1024 as TotalLag FROM pg_stat_replication;

Commented to explain what each column means:

select client_addr , usename , application_name, state, sync_state,
    (pg_wal_lsn_diff(pg_current_wal_lsn(),sent_lsn) / 1024)::bigint as PendingLag,    -- The amount of WAL data that hasn't been sent ... check network stuff if lag persists
    (pg_wal_lsn_diff(sent_lsn,write_lsn) / 1024)::bigint as WriteLag,                 -- The amount of replayed log data that isn't applied  ... check iostat stuff if lag persists
    (pg_wal_lsn_diff(write_lsn,flush_lsn) / 1024)::bigint as FlushLag,                -- similar to write lag, and often these two numbers are high in conjunction
    (pg_wal_lsn_diff(flush_lsn,replay_lsn) / 1024)::bigint as ReplayLag,              -- The amount of log data that is waiting to be replayed ... check iostat stuff but could also be high CPU or memory utilization
    (pg_wal_lsn_diff(pg_current_wal_lsn(),replay_lsn))::bigint / 1024 as TotalLag     -- Basically a sum of the previous values
FROM pg_stat_replication;

Accessing Postgresql Server Through SSH Tunnel

Development servers can be accessed directly, but access to our production Postgresql servers is restricted. To use a SQL client from the VPN network, you need to connect through an SSH tunnel.

In the connection configuration for a database connect, click the “SSH” tab.

You can one-off configure an SSH tunnel for a connection, but the most efficient approach for setting up the SSH tunnel used to connect to all of our production databases is to create a tunnel profile. The profile lets you type connection info in one time and use it for multiple servers; and, when you need to update a setting, you only have one configuration set to update.

To create a profile, click the pencil to the right of the profile drop-down box.

Under the “Network Profiles” section, click “Create” to create a new profile.

Give the profile a descriptive name and click “OK”.

Check the “Use SSH Tunnel” box, enter the hostname (one of the ETL dev boxes is a good choice in our case – ltrkarkvm553.mgmt.windstream.net. Supply the username for the connection. You can use password authentication or, if you have a key exchange set up for authentication, select public key authentication. Click “Apply and Close” to save the profile.

Your profile will appear in the profile drop-down – select the profile …

And all of the settings will pop in – they’re grayed out here, if you need to update the SSH tunnel profile settings, click the little pencil again.