Tag: python

Apache HTTPD Log File Analysis — Hits by IP Address

When we are decommissioning a website (or web server), I always watch the log files to ensure there aren’t a lot of people still accessing it. Sometimes there are and it’s worth tracking them down individually to clue them into the site’s eminent demise. Usually there aren’t, and it’s just a confirmation that our decommissioning efforts won’t be impactful.

This python script looks for IP addresses in the log files and outputs each IP & it’s access count per log file. Not great if you’ll see a bunch of IP addresses in the recorded URI string, but it’s good enough for 99% of our log data.

import os
import re
from collections import Counter

def parseApacheHTTPDLog(strLogFile):
    regexIPAddress = r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'

    with open(strLogFile) as f:
        objLog = f.read()
        listIPAddresses = re.findall(regexIPAddress,objLog)
        counterAccessByIP = Counter(listIPAddresses)
        for strIP, iAccessCount in counterAccessByIP.items():
            print(f"{strLogFile}\t{str(strIP)}\t{str(iAccessCount)}")

if __name__ == '__main__':
    strLogDirectory = '/var/log/httpd/'
    for strFileName in os.listdir(strLogDirectory):
        if strFileName.__contains__("access_log"):
        #if strFileName.__contains__("hostname.example.com") and strFileName.__contains__("access_log"):
            parseApacheHTTPDLog(f"{strLogDirectory}{strFileName}")

Postgresql Through an SSH Tunnel in Python

Our production Postgresql servers have a fairly restrictive IP access control list — which means you cannot VPN in and query the server. We’ve been using DBeaver with an SSH tunnel to connect, but it’s a bit time consuming to run a query across all of the servers for monitoring and troubleshooting. To work around the restriction, I built a python script that uses an SSH tunnel to relay communications to the Postgresql servers.

import psycopg2
from sshtunnel import SSHTunnelForwarder

from config import strSSHRelayHost, iSSHRelayPort, strSSHRelayUser, strSSHAuthKeyFile, dictHost
# In the config.py, dictHost should contain the following information
# dictHost = {"host":"dbserver.example.com","port":5432,"database": "dbname", "username":"dbuser", "password":"S3cr3tPhr@5e"}

# Example query -- listing out locks 
sqlQuery = "WITH RECURSIVE l AS (  SELECT pid, locktype, mode, granted, ROW(locktype,database,relation,page,tuple,virtualxid,transactionid,classid,objid,objsubid) obj FROM pg_locks ), pairs AS ( SELECT w.pid waiter, l.pid locker, l.obj, l.mode FROM l w JOIN l ON l.obj IS NOT DISTINCT FROM w.obj AND l.locktype=w.locktype AND NOT l.pid=w.pid AND l.granted  WHERE NOT w.granted ), tree AS ( SELECT l.locker pid, l.locker root, NULL::record obj, NULL AS mode, 0 lvl, locker::text path, array_agg(l.locker) OVER () all_pids FROM ( SELECT DISTINCT locker FROM pairs l WHERE NOT EXISTS (SELECT 1 FROM pairs WHERE waiter=l.locker) ) l  UNION ALL  SELECT w.waiter pid, tree.root, w.obj, w.mode, tree.lvl+1, tree.path||'.'||w.waiter, all_pids || array_agg(w.waiter) OVER () FROM tree JOIN pairs w ON tree.pid=w.locker AND NOT w.waiter = ANY ( all_pids )) SELECT (clock_timestamp() - a.xact_start)::interval(3) AS ts_age, replace(a.state, 'idle in transaction', 'idletx') state, (clock_timestamp() - state_change)::interval(3) AS change_age, a.datname,tree.pid,a.usename,a.client_addr,lvl, (SELECT count(*) FROM tree p WHERE p.path ~ ('^'||tree.path) AND NOT p.path=tree.path) blocked, repeat(' .', lvl)||' '||left(regexp_replace(query, 's+', ' ', 'g'),100) query FROM tree JOIN pg_stat_activity a USING (pid) ORDER BY path"

with SSHTunnelForwarder( (strSSHRelayHost, iSSHRelayPort), ssh_username=strSSHRelayUser, ssh_private_key=strSSHAuthKeyFile, local_bind_address=("localhost",55432), remote_bind_address=(dictHost.get('host'), dictHost.get('port'))) as server:
# Alternately, you can use password authentication
#with SSHTunnelForwarder( (strSSHRelayHost, iSSHRelayPort), ssh_username=strSSHRelayUser, ssh_password=strSSHRelayUserPass, local_bind_address=("localhost",55432), remote_bind_address=(dictHost.get('host'), dictHost.get('port'))) as server:
    if server is not None:
        server.start()
        server._check_is_started()
        #print("Tunnel server connected")
        params = {'database': dictHost.get('database'),'user': dictHost.get('username'),'password': dictHost.get('password'), 'host': server.local_bind_host, 'port': server.local_bind_port}
        conn = psycopg2.connect(**params)
        cursor = conn.cursor()
        cursor.execute(sqlQuery)
        column_names = [desc[0] for desc in cursor.description]
        print(column_names)
        rows = cursor.fetchall()
        for row in rows:
            print(row)
        cursor.close()
        if conn is not None:
            conn.close()
        server.stop()
    else:
        print("Unable to establish SSH tunnel")

Python List Joining

I’ve seen lists joined with a delimiter like this before:

strDelimiter = “\n”
strDelimiter.join(listOfStuff)

But it seems silly to allocate memory just for the delimiter … not a big deal from a resource perspective, and probably using the delimiter variable is more comprehensible in the future … but I’ve always wondered if you couldn’t just use a static string with the join method. Turns out you can —

msgContent.attach(MIMEText(“\n”.join(listOfStuff), ‘plain’))

 

Porkbun DDNS API

I’ve been working on a script that updates our host names in Porkbun, but the script had a problem with the example.com type A records. Updating host.example.com worked fine, but example.com became example.com.example.com

Now, in a Bind zone, you just fully qualify the record by post-pending the implied root dot (i.e. instead of “example.com”, you use “example.com.”, but Porkbun didn’t understand a fully qualified record. You cannot say the name is null (or “”). You cannot say the name is “example.com” or “example.com.”

In what I hope is my final iteration of the script, I now identify cases where the name matches the zone and don’t include the name parameter in the JSON data. Otherwise I include the ‘name’ as the short hostname (i.e. the fully qualified hostname minus the zone name). This appears to be working properly, so (fingers crossed, knock on wood, and all that) our ‘stuff’ won’t go offline next time our IP address changes.

Scraping Google Calendar Data, take 2

I had written a script that uses the Google Calendar API to pull records from the Township’s calendar. Unfortunately, the pickle / token / whatever has started expiring every week. Which means manual intervention is required for my automated process to run. Which made me wonder … for a private calendar, it makes sense to use the API. I need to authenticate in order to read my private appointments. I can get the token to last for a year, but then I’ve got to go through whatever to be a real / approved application. Which is a lot of effort for something that I’m using to read my own data. Which made me wonder why I need to authenticate to read events on a public calendar!?

Turns out I don’t. I just need to use the iCal feed for the calendar. Using requests to pull data from a URL and then parsing out the iCal data is simple enough. So now I have a script that pulls the iCal file to populate my Exchange calendar. Since it’s unauthenticated, I shouldn’t have to do anything to get it working again next week 🙂

Python Selenium Headed v/s Headless

We are automating a file download — it works fine when running headed, but headless execution doesn’t manage to log in. Proxying the requests through Fiddler show that several JavaScript pages download unexpected content.

I’ve added a user-agent to the request, but I’ve noticed that the ChromeDriver also sets sec-ch-* headers … I expect the null sec-ch-ua causes the web server to refuse our request. I don’t see any issues in the ChromeDriver repo for the sec-ch-* headers … and I don’t really want to walk back versions until I find one that doesn’t try setting this header value. Firefox’s GeckoDriver, though, doesn’t set them … so I moved the script over to use Firefox instead of Chrome and am able to download the file.

Headed run:

GET /o/telx-theme/css/A.bootstrap.css+slick,,_slick.css,Mcc.JKqfH-juDS.css.pagespeed.cf.ZO22sEGAvO.css HTTP/1.1
Host: example.com
Connection: keep-alive
sec-ch-ua: “Chromium”;v=”92″, ” Not A;Brand”;v=”99″, “Google Chrome”;v=”92″
sec-ch-ua-mobile: ?0
User-Agent: “Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36
Accept: text/css,*/*;q=0.1
Sec-Fetch-Site: same-origin
Sec-Fetch-Mode: no-cors
Sec-Fetch-Dest: style
Referer: https://example.com/web/guest/login
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Cookie: JSESSIONID=0330C2C988F31010790779A126EA6F55.node1; COOKIE_SUPPORT=true; GUEST_LANGUAGE_ID=en_US; AWSELB=039B496118DDEAD697B2B51C93852940763289C324F9E7C7223F953330AF5506573D13C4D5599541FD3CADB645303C1CAEB6D26992826965DA6C8BEDBDE9C297AE26CD76ED; AWSELBCORS=039B496118DDEAD697B2B51C93852940763289C324F9E7C7223F953330AF5506573D13C4D5599541FD3CADB645303C1CAEB6D26992826965DA6C8BEDBDE9C297AE26CD76ED; TS0194d418=01092b79076749232d762d2a6c232e015d103453fbeda3826bd3d20e1d937f5a90cabe03655c97a79198969eea539e4c2e7fc426216092c78ccda85763d52300ce05672704e45b4fc25516d2c24279656db7b0242f7c8b9c8bfed35b7608afb0c54bbc33d489f431059d048094c1e707a20d28031885ca6c61f81613ac299044f0c2b9ba36

 

Headless run:

GET /o/telx-theme/css/A.bootstrap.css+slick,,_slick.css,Mcc.JKqfH-juDS.css.pagespeed.cf.ZO22sEGAvO.css HTTP/1.1
Host: example.com
Connection: keep-alive
sec-ch-ua:
sec-ch-ua-mobile: ?0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36
Accept: text/css,*/*;q=0.1
Sec-Fetch-Site: same-origin
Sec-Fetch-Mode: no-cors
Sec-Fetch-Dest: style
Referer: https://example.com/web/guest/login
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US
Cookie: JSESSIONID=F4293ECE33B134CC368C0E62D6923B48.node1; COOKIE_SUPPORT=true; GUEST_LANGUAGE_ID=en_US; AWSELB=039B496118DDEAD697B2B51C93852940763289C324A5AB24AE470C70960B5319A93C181302D27B4C9425A4AA05795334C4404D491FBCC8E6A9B809746A802EAC2EC8C2FBFA; AWSELBCORS=039B496118DDEAD697B2B51C93852940763289C324A5AB24AE470C70960B5319A93C181302D27B4C9425A4AA05795334C4404D491FBCC8E6A9B809746A802EAC2EC8C2FBFA; TS0194d418=01ba3b12a4ef612e3839114024b5082fd19d56b17293c914ff867740ad37ae362e385934695ad3fc275074bfd1ee24c7d1591b146ad39d153a8758aecc8eb44d374dc1c689e540deca9566f723df65e9f5ad26551e25bacd5df14e4e6104a91a0ecdb59a65176bd5a0ebed284847e0e6618a05ed1d9db6b544e195d8e1f41164e7199a6596

On UUIDs

RFC 4122 UUID Versions:

1 — Datetime and MAC based
48-bit MAC address, 60-bit timestamp, 13-14 bit uniquifying sequence

2 — Datetime and MAC based with DCE security
8 least significant clock sequence numbers and least significant 32 bits of timestamp. RFC doesn’t reallly provide details on DCE security

3 — Hashed Namespace
MD5 hash of namespace

4 — Random
6 pre-determined bits (4 bits for version, 2-3 bits for variant 1 or 2) and 122 bits for 2^122 possible v4 variant 1 UUIDs

5 — Hashed Namespace
SHA-1 hash of namespace

In my case, I hesitate to use a v1 or v2 UUID because I have scripts executing in cron on the same host. The probability of the function being called at the same microsecond time seems higher than the pseudo-random number generator popping the same value in the handful of hours for which the UUIDs will be persisted for deduplication.

v3 or v5 UUIDs are my fallback position if we’re seeing dups in v4 — the namespace would need to glom together the script name and microsecond time to make a unique string when multiple scripts are running the function concurrently.

Confluent Kafka Queue Length

The documentation for the Python Confluent Kafka module includes a len function on the producer. I wanted to use the function because we’re getting a number of duplicated messages on the client, and I was trying to isolate what might be causing the problem. Unfortunately, calling producer.len() failed indicating there’s no len() method. I used dir(producer) to show that, no, there isn’t a len() method.

I realized today that the documentation is telling me that I can call the built-in len() function on a producer to get the queue length.

Code:

print(f"Before produce there are {len(producer)} messages awaiting delivery")
producer.produce(topic, key=bytes(str(int(cs.timestamp) ), 'utf8'), value=cs.SerializeToString() )
print(f"After produce there are {len(producer)} messages awaiting delivery")
producer.poll(0) # Per https://github.com/confluentinc/confluent-kafka-python/issues/16 for queue full error
print(f"After poll0 there are {len(producer)} messages awaiting delivery")

Output:

Before produce there are 160 messages awaiting delivery
After produce there are 161 messages awaiting delivery
After poll0 there are 155 messages awaiting delivery

Boolean Opts in Python

I have a few command line arguments on a Python script that are most readily used if they are boolean. I sometimes need a “verbose” option for script debugging — print a lot of extra stuff to show what’s going on, and I usually want a “dry run” option where the script reads data, performs calculations, and prints results to the screen without making any changes or sending data anywhere (database, email, etc). To use command line arguments as boolean values, I use a function that converts a variety of possible inputs to True/False.

def string2boolean(strInput):
    """
    :param strInput: String string to be converted to boolean
    :return: Boolean representation of input
    """
    if isinstance(strInput, bool):
        return strInput
    if strInput.lower() in ('yes', 'true', 't', 'y', '1'):
        return True
    elif strInput.lower() in ('no', 'false', 'f', 'n', '0'):
        return False
    else:
        raise argparse.ArgumentTypeError('Boolean value expected.')

Use “type” when adding the argument to run the input through your function.

    parser.add_argument('-r', '--dryrun', action='store', type=string2boolean, dest='boolDryRun', default=False, help="Preview data processing without sending data to DB or Kafka. Valid values: 'true' or 'false'.")

Google OAUTH Stuff

Reminder to self — when you set up a desktop app with OAUTH to use the Google APIs … you have to hit the authorization URL from the computer running the code. That means, for my calendar scraper, that I need to do X-redirection from the server & run the script. Firefox launches & the flow actually completes. Attempting to hit the URL from my computer yields a connection failure to the https://localhost:SomePort at the end of the workflow.

Move token.pickle to backup file, run getCalendarEvents.py with X-redirection so auth can be processed through web form.