Installing Sanoid on KRONOS; Experimenting with 'sanoid --monitor snapshots' and adding basic health checks to KRONOS
Sanoid, Syncoid, and "Checkoid"
I have tried many different ZFS syncoid tools including znapzend and various pruning scripts. In the end, it's hard to beat the simplicity and efficacy of Sanoid. I'm a shill for excellent free software and these Perl scripts are exactly my cup of tea.
With remote booting on KRONOS sorted I am now looking at the actual data transfer automation. I won't describe the process here since my own notes are mostly copied directly from Jim Salter's Github.
Based on this post I can simply use:
sanoid --monitor-snapshots
for snapshot healthsanoid --monitor-health
for pool health
When I ran 'monitor-health' I discovered that the pool was degraded! A missing drive. Not marked as faulted though, just "REMOVED".
I keep a table of all 16 drives and their serial numbers since I can't physically tell which is which without removing them (and pulling the wrong drive on a RAIDZ1 VDEV would crash the pool).
root@node-kronos:~# lsblk -o PATH,SERIAL | grep -E '/dev/sd' | while read path serial; do wwn_link=$(ls -l /dev/disk
/by-id | grep "wwn-0x.*$(basename $path)" | awk '{print $9}'); echo "| ${serial:0:8} | $path | $wwn_link |" | grep Z1P00ZDF; done
Result:
| Z1P00ZDF | /dev/sdo | wwn-0x5000c50033dcf23b |
With the physical position confirmed and the new WWN noted I can run
zpool replace wwn-0x...23b wwn-0x...05b
... except that I didn't need
to! zpool status
showed that the replacement was already in progress!
I forgot that I set zpool set autoreplace=on ZPOOL
for this pool.
While that was happening I let syncoid get caught up with snapshots and then tried out the 'monitor-snapshots' option:
root@node-kronos:~# sanoid --monitor-snapshots
CRIT: VAULT/remote-pool/AVALON/Archive newest hourly snapshot is 18d 10h 49m 34s old (should be < 2d 12h 0m 0s), CRIT: VAULT/remote-pool/AVALON/REDACTED
newest hourly snapshot is 26d 8h 49m 32s old (should be < 2d 12h 0m 0s)
The "backup" template in sanoid.conf perfectly suits my use case though I will modify the alert periods to accomodate a weekly replication rather than a daily one.
Very cool. When I build in the monitoring feature into a master script I will call it "Checkoid" since its purpose is to check on the health of the server, the pool, and the snapshots. If replication fails, if a drive fails, or if the server fails to start I want to know about it especially since it will only run when I'm not home and without any action on my part.
Health Checks and Gotifications
I use Gotify in my free Google Cloud instance to handle all my notifications. It's a fantastic program that works very well with almost zero resource usage. In the KRONOS crontab:
45 01 * * * curl https://hc-ping.com/ABC123-qre3-49r9-r963-ABC123
50 01 * * * /sbin/shutdown -h now
00:40 - wakeonlan 00:25:90:XX:XX:XX
sent manually from my desktop.
(takes about 3 minutes before the server is "up") 00:45 - Healthchecks
reports a successful ping. 00:50 - shuts down automatically
My healthcheck only reports that the server started and doesn't consider the status of the "monitor-health" command yet. The importance of doing this is discussed.