As part of my backup pipeline, I use rclone to sync all my files from Seafile to an offsite location using the handy crypt backend once per week. Recently, I switched to using Backblaze B2 as my offsite storage provider, but learned only after uploading all my files, that large files (anything over 200M as per the docs) are not stored with a hash on upload, meaning that while the crypt backend can detect corruption once the file has been downloaded, Backblaze has no way of verifying file integrity when you download files or detecting corruption at rest. This can lead to a frustrating issue where attempting to download files that were corrupted on upload will repeat until the download fails (with unexpected EOF errors), after all attempts are exhausted. Fortunately, rclone lets you enable hashing on large files, using the –b2-upload-cutoff flag to increase the maximum hashed file size from the 200M default. Unfortunately, I switched to using this flag about halfway through my initial upload, so I couldn’t just use a –min-size delete filter if I wanted to avoid uploading everything over 200M again. Additionally, rclone doesn’t have a filter for deleting or re-syncing unhashed files (as of v1.59 at least), so I came up with the method below.
First, to obtain a list of files missing hashes, I ran
rclone sha1sum --fast-list -P backblaze:/ --output-file hashes.txt, which generated a file containing a list of all files on the remote, along with their hashes (or blank spaces if no hash was attached). Note that I used the underlying remote on the crypt backend to do this, not the crypt remote itself.
Next, I ran
cat ../hashes.txt | grep " " | tr -d ' ' > filestodelete.txt to generate a list of files to delete on the remote. Crypted file and directory names have no spaces, so we can safely remove them all here.
Finally, after running the file through
wc to verify how many files I was deleting, I ran
while read nohash; do rclone delete "backblaze:/$nohash"; done <../filestodelete.txt to delete all files missing hashes.
With those files deleted, another run of rclone sync with the increased upload cutoff flag replaced all the files missing hashes with properly hashed ones.