zenodotus280

SYSLOG 25-W10

Installing Sanoid on KRONOS; Experimenting with 'sanoid --monitor snapshots' and adding basic health checks to KRONOS


Sanoid, Syncoid, and "Checkoid"

I have tried many different ZFS syncoid tools including znapzend and various pruning scripts. In the end, it's hard to beat the simplicity and efficacy of Sanoid. I'm a shill for excellent free software and these Perl scripts are exactly my cup of tea.

With remote booting on KRONOS sorted I am now looking at the actual data transfer automation. I won't describe the process here since my own notes are mostly copied directly from Jim Salter's Github.

Based on this post I can simply use:

  • sanoid --monitor-snapshots for snapshot health
  • sanoid --monitor-health for pool health

When I ran 'monitor-health' I discovered that the pool was degraded! A missing drive. Not marked as faulted though, just "REMOVED".

I keep a table of all 16 drives and their serial numbers since I can't physically tell which is which without removing them (and pulling the wrong drive on a RAIDZ1 VDEV would crash the pool).

root@node-kronos:~# lsblk -o PATH,SERIAL | grep -E '/dev/sd' | while read path serial; do wwn_link=$(ls -l /dev/disk
/by-id | grep "wwn-0x.*$(basename $path)" | awk '{print $9}'); echo "| ${serial:0:8} | $path | $wwn_link |" | grep Z1P00ZDF; done

Result:

| Z1P00ZDF | /dev/sdo | wwn-0x5000c50033dcf23b |

With the physical position confirmed and the new WWN noted I can run zpool replace wwn-0x...23b wwn-0x...05b... except that I didn't need to! zpool status showed that the replacement was already in progress! I forgot that I set zpool set autoreplace=on ZPOOL for this pool.

While that was happening I let syncoid get caught up with snapshots and then tried out the 'monitor-snapshots' option:

root@node-kronos:~# sanoid --monitor-snapshots
CRIT: VAULT/remote-pool/AVALON/Archive newest hourly snapshot is 18d 10h 49m 34s old (should be < 2d 12h 0m 0s), CRIT: VAULT/remote-pool/AVALON/REDACTED
newest hourly snapshot is 26d 8h 49m 32s old (should be < 2d 12h 0m 0s)

The "backup" template in sanoid.conf perfectly suits my use case though I will modify the alert periods to accomodate a weekly replication rather than a daily one.

Very cool. When I build in the monitoring feature into a master script I will call it "Checkoid" since its purpose is to check on the health of the server, the pool, and the snapshots. If replication fails, if a drive fails, or if the server fails to start I want to know about it especially since it will only run when I'm not home and without any action on my part.

Health Checks and Gotifications

I use Gotify in my free Google Cloud instance to handle all my notifications. It's a fantastic program that works very well with almost zero resource usage. In the KRONOS crontab:

45 01 * * * curl https://hc-ping.com/ABC123-qre3-49r9-r963-ABC123
50 01 * * * /sbin/shutdown -h now

00:40 - wakeonlan 00:25:90:XX:XX:XX sent manually from my desktop. (takes about 3 minutes before the server is "up") 00:45 - Healthchecks reports a successful ping. 00:50 - shuts down automatically

My healthcheck only reports that the server started and doesn't consider the status of the "monitor-health" command yet. The importance of doing this is discussed.

Thoughts? Leave a comment