Category: System Administration

MTU Probing

We’ve had a number of very strange network problems lately — Zoneminder cannot talk to cameras, clients veg out talking to Myth, Twonky is non-functional (even the web page — you get enough of the header to have a title, but the page just hangs, Scott cannot get to our Discourse site. And, more frustratingly, he cannot SSH to some of our hosts. Using “ssh -v” and throwing on a bunch of flags to not attempt key auth (-o PasswordAuthentication=yes -o PreferredAuthentications=keyboard-interactive,password -o PubkeyAuthentication=no) and his connection still hung. But, at least, I could see something. The last thing the SSH connection reported is:

debug1: expecting SSH2_MSG_KEX_ECDH_REPLY

Which I’ve seen before … fortunately when I had a great Unix support guy working in the same office building that I did. Who let me stop over and bounce really oddball problems off of him. He told me to enable mtu probing.

echo 1 >/proc/sys/net/ipv4/tcp_mtu_probing

And, if that doesn’t work, use “echo 2”. Which …. yeah, wouldn’t have been any of my first thirty guesses. Cloudflare published a good article on what exactly MTU path discovery is, and I can RTFM enough to figure out what I’ve set here. But no idea what’s got a smaller MTU than our computers.

 

tcp_mtu_probing - INTEGER
	Controls TCP Packetization-Layer Path MTU Discovery.  
	  0 - Disabled
	  1 - Disabled by default, enabled when an ICMP black hole detected
	  2 - Always enabled, use initial MSS of tcp_base_mss.

PIP SSL Error

Upgraded pip today, and I pretty quickly regretted it. SSL Error attempting to install anything from the Internet (and, amazingly, some things where I downloaded the wheel file). The answer is to downgrade PIP until you hit a version that doesn’t have the error. Annoying. Not sure what the latest rev I could have used was — going back one level and getting the error in loop was more time than I could devote to the project, so I just jumped back six months. Had success with 20.0.2 and left working alone.

Everything from 20.3.1 through 21.0.1 has this failure:

D:\tmp\5\pip>pip install basic_sftp
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by ‘SSLError(SSLError(1, ‘[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1076)’))’: /simple/basic-sftp/
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by ‘SSLError(SSLError(1, ‘[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1076)’))’: /simple/basic-sftp/
WARNING: You are using pip version 20.3.1; however, version 21.0.1 is available.
You should consider upgrading via the ‘c:\programs\anaconda3\python.exe -m pip install –upgrade pip’ command.

SCP From Solaris to RHEL?

Evidently you cannot just scp files from an old Solaris box when you’re on a RHEL/CentOS system … there’s an incompatibility between them that requires you to (1) install scp1 on the Solaris server {not likely in a prod environment} or (2) use sftp to transfer the files.

 

Server1: Red Hat Enterprise Linux Server release 7.6 (Maipo)
Server2: Solaris 5.9

lisa@server1:~/$ scp lisa@server2:/data/stuff/file1.txt ./input/
lisa@server2’s password:
scp: warning: Executing scp1.
scp: FATAL: Executing ssh1 in compatibility mode failed (Check that scp1 is in your PATH).

Fedora — Disabling IPv6

Since it’s the third time I’ve had to do this so far this year, I’m going to write down how I disable IPv6 in Fedora. Add these lines to /etc/sysctl.conf

[lisa@server~]# grep ipv6 /etc/sysctl.conf
net.ipv6.conf.all.disable_ipv6=1
net.ipv6.conf.default.disable_ipv6=1
net.ipv6.conf.lo.disable_ipv6=1

Then load the sysctl settings (sysctl -p) or reboot.

Without IPv6, if you do X-redirection, you may get an error indicating the redirection was refused. In journalctl, there’s an error “error: Failed to allocate internet-domain X11 display socket”. Evidently you’ve got to configure sshd to use IPv4 by setting “AddressFamily inet” in /etc/ssh/sshd_config

[lisa@server~/]# grep AddressFamily /etc/ssh/sshd_config
AddressFamily inet

 

Fedora – Why were my packets dropped?

We’ve been seeing dropped packets on one of our servers — that usually means more data is coming in than can be processed, but it’s nice to confirm rather than guess. The command “netstat -s” displays summary statistics that are nicely grouped into causes:

TcpExt:
16 invalid SYN cookies received
88 resets received for embryonic SYN_RECV sockets
18 packets pruned from receive queue because of socket buffer overrun
2321 ICMP packets dropped because they were out-of-window
838512 TCP sockets finished time wait in fast timer

Notes on Adding Drive to Linux Host

Create partition, format, find UUID, and add line to fstab to mount the volume

[lisa@linuxhost ~]# parted /dev/sdb
GNU Parted 3.2.153
Using /dev/sdb
Welcome to GNU Parted! Type ‘help’ to view a list of commands.
(parted) mklabel GPT
Warning: The existing disk label on /dev/sdb will be destroyed and all data on this disk will be lost. Do you want to continue?
Yes/No? y
(parted) mkpart primary 2048s 100%
(parted) q
Information: You may need to update /etc/fstab.

[lisa@linuxhost ~]# mkfs.xfs -f /dev/sdb1
meta-data=/dev/sdb1 isize=512 agcount=4, agsize=163839872 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=0
= reflink=1
data = bsize=4096 blocks=655359488, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=319999, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0

[lisa@linuxhost ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 20G 0 disk
├─sda1 8:1 0 1G 0 part /boot
└─sda2 8:2 0 19G 0 part
├─fedora-lisa 253:0 0 17G 0 lvm /
└─fedora-swap 253:1 0 2G 0 lvm [SWAP]
sdb 8:16 0 2.5T 0 disk
└─sdb1 8:17 0 2.5T 0 part
sr0 11:0 1 650M 0 rom
[lisa@linuxhost ~]# blkid | grep sdb1
/dev/sdb1: UUID=”801ebed3-ddd6-459d-bd62-04a0a75f91b8″ TYPE=”xfs” PARTLABEL=”primary” PARTUUID=”b9a9a340-28f5-4efb-b649-af804ef5bc4c”

Add a line to /etc/fstab to mount the volume — here I’m mounting it to /mnt/data/mythtv:

UUID=801ebed3-ddd6-459d-bd62-04a0a75f91b8 /mnt/data/mythtv xfs defaults 0 0

Discourse acme.sh Script Failure

I had a hellacious time updating the certificate on my Dockerized Discourse server — the acme.sh script doesn’t have a slash delimiter between the hostname and the ./well-known folder within the URI. Which means the request fails. Repeatedly.

 

[Sat Oct 10 00:01:09 UTC 2020] _post_url='https://acme-v02.api.letsencrypt.org/acme/chall-v3/7784162898/nr42-g'
[Sat Oct 10 00:01:09 UTC 2020] _CURL='curl -L --silent --dump-header /shared/letsencrypt/http.header -g '
[Sat Oct 10 00:01:10 UTC 2020] _ret='0'
[Sat Oct 10 00:01:10 UTC 2020] code='200'
[Sat Oct 10 00:01:10 UTC 2020] trigger validation code: 200
[Sat Oct 10 00:01:10 UTC 2020] sleep 2 secs to verify
[Sat Oct 10 00:01:12 UTC 2020] checking
[Sat Oct 10 00:01:12 UTC 2020] url='https://acme-v02.api.letsencrypt.org/acme/chall-v3/7784162898/nr42-g'
[Sat Oct 10 00:01:12 UTC 2020] payload
[Sat Oct 10 00:01:12 UTC 2020] POST
[Sat Oct 10 00:01:12 UTC 2020] _post_url='https://acme-v02.api.letsencrypt.org/acme/chall-v3/7784162898/nr42-g'
[Sat Oct 10 00:01:12 UTC 2020] _CURL='curl -L --silent --dump-header /shared/letsencrypt/http.header -g '
[Sat Oct 10 00:01:13 UTC 2020] _ret='0'
[Sat Oct 10 00:01:13 UTC 2020] code='200'
[Sat Oct 10 00:01:13 UTC 2020] discourse.example.com:Verify error:Fetching https://discourse.example.com.well-known/acme-challenge/XY02T_40TL92IADByQ45JMj4JzC2qJCatVd2odJMAlU: Invalid host in redirect target
[Sat Oct 10 00:01:13 UTC 2020] pid
[Sat Oct 10 00:01:13 UTC 2020] No need to restore nginx, skip.

 

Turns out that’s my bad config — I’ve got a reverse proxy in front of Discourse, and we don’t use the clear text http site. The reverse proxy just bounces you over to the https site. Two problems — one, I failed to put the trailing slash after my redirect, s http://discourse.example.com/.well-known/blah is being redirected to https://discourse.example.com.well-known/blah

<VirtualHost 10.1.2.3:80>
ServerName discourse.example.com
ServerAlias discourse

Redirect 301 / https://discourse.example.com

</VirtualHost>

 

That’s easy enough to fix — add the trailing slash I should have had anyway. But the subsequent problem is that the bootstrap nginx config that is used to serve up the validation page only listens on port 80. So I cannot redirect the clear-text traffic over to the SSL site. I have to reverse proxy the clear text site as well (at least whenever the certificate needs to be renewed).

ProxyPass / https://discourse.example.com/
ProxyPassReverse / https://discourse.example.com/

Voila, a web server with an updated certificate.

Using Process Monitor To Troubleshoot Applications

SysInternals used to produce a suite of tools for working with Microsoft Windows systems — the company appears to have been acquired by Microsoft, and the tools continue to be developed. I used PSKill and PSExec to automate a lot of system administration tasks. ProcessMonitor is like truss/strace for Windows. Unlike the HFS standard, Windows files end up all over the place (plus info is stashed in the registry). Sometimes applications or services fall over for no reason. Process monitor reports out

When you open procmon, you can build filters to exclude uninteresting operations — there’s a default set of exclusions (no need to log out what procmon is doing!)

Adding exclusions for specific process names can eliminate a lot of I/O — I was looking to troubleshoot a problem on a Domain Controller that had nothing to do with AD specifically, so excluding activity by lsass.exe significantly reduced the amount of data being logged. If I’m using a browser to troubleshoot the problem, I’ll exclude the firefox.exe or chrome.exe binary too.

From the filter screen, click “OK” to begin grabbing data. The easiest thing I’ve found to do is stop capturing data when the program opens (use ctrl-a followed by ctrl-x to clear the already logged stuff). Stage whatever you want to log, use ctrl-e to start capturing. Perform the actions you want to log, return to procmon and use ctrl-e to stop again.

You’ll see reads (and writes) against the registry, including the specific keys. Network operations. File reads and writes. In the “Result” and “Detail” column, you can determine if the operation was successful. There are a lot of expected not found failures — I see these in truss/strace logs too, programs try a bunch of different things and one of them needs to work.

I’ve had programs using a specific, undocumented file for a critical operation — like the service would fail to start because the file didn’t exist. And seeing the path and file open failure allowed me to create that needed file and run my service. I’ve wanted to find out where a program stashes data, and procmon makes that easy to identify.