Verifying Remote Data

tjs4ever · July 30, 2018, 10:06pm

Hi all,

I am about 13TB in of a ~26TB backup to google cloud (gsuite), I have been backing up for about a month straight, it was interrupted a few times but it appears to resume OK and pickup where it left off. Today I had to reboot my Win10 PC so I cancelled the backup and after reboot thought I would try a test restore before restarting my backup.

I’m attempting to restore a single 1GB file to the desktop folder and it has been at ‘Verifying Remote Data’ for about two hours now; is this the normal behaviour? Checking in task manager there is very little cpu and network usage by the duplicati process. All block and volume sizes are at the default (I had poor results when I changed these the last time and ended up deleting config and starting fresh), I’m on Duplicati Beta 2.0.3.3, Windows 10 desktop, backing up from a HP G7 server hardware Raid6, duplicati is using a mapped drive as source.

There isn’t much detail in the logs besides this:

Jul 30, 2018 5:57 PM: removing file listed as Deleting: duplicati-b902833e5799b4296a3edaec680de177f.dblock.zip.aes
Jul 30, 2018 5:54 PM: removing file listed as Deleting: duplicati-b87ccbb25046a43d8b98f24c9982eef31.dblock.zip.aes
Jul 30, 2018 5:50 PM: removing file listed as Deleting: duplicati-bb7b801643f0047c385e92866d1acf5b4.dblock.zip.aes
Jul 30, 2018 5:42 PM: ExecuteReader: SELECT DISTINCT "Name", "State" FROM "Remotevolume" WHERE "Name" IN (SELECT "Name" FROM "Remotevolume" WHERE "State" IN ("Deleted", "Deleting")) AND NOT "State" IN ("Deleted", "Deleting") took 00:00:03.287
Jul 30, 2018 5:42 PM: Starting - ExecuteReader: SELECT DISTINCT "Name", "State" FROM "Remotevolume" WHERE "Name" IN (SELECT "Name" FROM "Remotevolume" WHERE "State" IN ("Deleted", "Deleting")) AND NOT "State" IN ("Deleted", "Deleting")
Jul 30, 2018 5:41 PM: RemoteOperationList took 01:40:04.943
Jul 30, 2018 5:41 PM: Backend event: List - Completed: (591.28 KB)

JonMikelV · July 30, 2018, 11:49pm

Hi @tjs4ever, welcome to the forum!

A few users have reported issues with “Verifying Remote Data” taking a long time, but I’ve never been able to replicate it. If I’m thinking correctly, usually all it does is get a list of your backend files (seems to have happened at 5:41) then compares that list to what it has locally in the database. If they match, then it moves on to the next step of whatever process you’ve started.

Note that when restoring from an incomplete backup (meaning one that hasn’t had a least one full backup finished) you’re working in a bit of a hybrid state. Normally when a backup is done, a dlist files is stored on the destination that contains the paths of everything that was backed up in that run.

Since your initial backup hasn’t finished, no dlist file has been created yet so Duplicati should be using the local database to simulate the dlist file contents.

I’d suggest cancelling that test restore and try another one with a much smaller file - just to see if the “Verifying” step takes a long time again.

tjs4ever · July 31, 2018, 12:13am

Thanks for the prompt reply @JonMikelV !

I have resumed my backup for the time being, will try again in about a month when the backup completes.

When the files are listed is it the volumes as stored on the cloud? If it’s a matter of IOPS then would increasing the volume size to say 1GB speed up the process 20x? My duplicati DB is stored locally on an SSD (it is 25GB in size and rising) , but didn’t seem to be used at all during the attempt to restore. Offhand I have about 300,000 files totalling just under 25.5TB - is this beyond the normal usage of Duplicati?

Just a little worried due to the time invested in running a full backup, I was surprised to see so little CPU and network activity (less than 1mbs and 1% CPU)

Appreciate any insights

JonMikelV · July 31, 2018, 12:52am

The cloud files list is just a list of what’s in the cloud folder as if you browsed it - so it would be a bunch of “duplicati-*” files that have dlist, dindex, or dblock in the names.

Initial backups are always slow because every 50kb (default) block of every file has to be read, hashed, encrypted, compressed, uploaded, and recorded in the local database.

While within normal usage limits of Duplicati, 25.5TB of source data means there will end up being about 11 million hash rows stored in the local database. Again, during initial backup writing all of those takes a while. And what’s worse, as more hashes are written - the reading to see if a hash already exists gets slower and slower. (There is an update coming to help optimize this, but I don’t know when it will happen.)

Once the initial backup is done, the database settles down to mostly reads and things run quite a bit faster.

As for a 1GB upload volume (dblock) size, that is something people have done - but usually only for local backups and when source files themselves are multiple gigs in size.

The main reason it’s probably not a good idea for you is that when you go to restore a file, no matter the size, the entire dblock file has to be downloaded so the individual blocks of the file can be pulled out and reassembled. With a 1TB dblock size, that means restoring a 10MB JPG would have to start with a 1TB download. Ouch!

Continuing with that example, a 10MB JPG would itself likely have been chopped into 205 blocks. If all the blocks happened to have been stored in a single dblock that’s great - only 1TB needs to be downloaded. But if you made a change to that JPG, it’s likely the updates would be stored in a different dblock file so now you’re got 2TB of downloads just to restore that 10MB JPG.

And there’s similar stuff like if you’re using retention policy to thin out older versions that compacting process also downloads multiple 1TB dblock files so they can be recompressed and reuploaded as fewer dblock files.

Sorry - I’ve probably gone on a little to much about that process.

Note that you can change the dblock (upload volume) size any time you want and new files create will be done with the new dblock size. Duplicati is fine with mixed dblock sizes on the destination. So if you want to stop the backup (I’d suggest using “stop after current upload”) and change from the default 50MB to something bigger just to see what happens, that should work just fine.

warwickmm · July 31, 2018, 3:09am

I’m not sure if this is related, but I have found that the size of the local database can have a large impact on the performance. See my comments starting here. It appears that the query performed when verifying the consistency of the local database may scale poorly with large local databases.

tjs4ever · July 31, 2018, 7:04pm

I have a new issue since I have tried to resume my backup. ‘verifying backend data’ for a few hours followed by:

Found remote files reported as duplicates, either the backend module is broken or you need to manually remove the extra copies.
The following files were found multiple times: duplicati-ide4b10972c214aa3b82ea961f7c1769f.dindex.zip.aes, duplicati-i0ca558bd793e47cdb6dd86edfafec4bb.dindex.zip.aes, duplicati-b6499979082874461854a4a1b2ecc72e4.dblock.zip.aes, duplicati-b4d9bc8a8aad34049bb1e46c31ea55720.dblock.zip.aes

I logged into GDrive, did a search and there were two of each of these files. Deleted BOTH copies of each file, ran the backup again and same message. Ran a repair on my DB which completed OK but any backup attempt and same result, these files no longer in remote folder.

What gives?

JonMikelV · July 31, 2018, 8:39pm

Slow down there! Deleting the wrong file from the destination can cause you to lose some of your backup versions!

Specifically, you should almost NEVER delete a duplicati-*.dblock.zip.aes file because dblock files are where the actual backed up data is stored. The dlist and dindex files can be rebuild from your source computer, but dblock files can’t.

I don’t suppose you can “undelete” at least one set of those files on your Google Drive, can you?

tjs4ever · July 31, 2018, 8:54pm

@JonMikelV

It would have been possible before I emptied recycle bin, files are gone forever. This has been such a roller-coaster, I had about 20TB uploaded previously before backup became corrupted and unusable; I blew away my config completely, downgraded from Canary to Beta and started fresh: I’m doing one final attempt and if it fails I will look into another software solution. I should know in the next 30-40 minutes as these list commands are consistently 1hr40mins.

I appreciate the support.

JonMikelV · August 1, 2018, 4:09am

Sorry you’re having such a rough time. Part of why Duplicati is still in beta is that it works fine for a majority of users but for the few where it doesn’t work well, it can really be a pain - and we can’t consistently figure out why it happens to them.

If you end up having more issues and look elsewhere, consider Duplicacy - it has similar functionality (deduplication, delta versions, etc.), though is a bit more trusting of destinations than Duplicati is and the GUI based version is not free.

tjs4ever · August 1, 2018, 6:24pm

Thanks for the support and insights,

I am doing some testing with another software, not going to name drop but it is an rsync-like solution. I don’t really need backup and versioning per-se so an encrypted sync of my files to the cloud is good enough for disaster recovery purposes.

kind regards,
TJ

ts678 · August 1, 2018, 8:32pm

It makes perfect sense to select a tool based on what one is trying to do. Possibly that will entirely remove the long file list delay. I’m not as confident about the duplicate file problem, because that’s somewhat invited by Google Drive. Some programs, e.g. rclone, even created a specialty dedupe, which perhaps Duplicati can do someday. Even better would be if duplicates can be prevented by:

Uploading using a pregenerated ID
(longer article)

JonMikelV · August 2, 2018, 5:45pm

I definitely agree with @ts678, go with the tool that works best for you!

While I personally use Duplicati because I want the versioning, there are plenty of people I’ve helped who just want a sync tool where Syncthing proved perfectly adequate.

If you do fine another tool you like, please let us know in case other users find themselves in the same situation as you. We’d rather have people doing backups with ANY tool than not having backups at all!

tjs4ever · August 3, 2018, 2:28am

Yea, I can definately attest to the importance of working backups. I’ve got a veeam solution up and running but both my prod data and backups are physically in the same rack (home setup)

My interest in gsuite storage is purely an experiment with disaster recovery (fire, flood, theft etc)

Trying out rclone and so far so good. Unlike duplicati there is no local db, no long verification process, when there is an interruption or when I exceed my daily UL limit with gdrive it just retries indefinitely instead of throwing a fit, fine-grained control over desired UL speed (for example 8.6MBs).

I’m probably going to stick with rclone for the foreseeable future. It isn’t as accessible as some of the other solutions due to no native gui and scheduler, but 3rd party addons exist.

JonMikelV · August 3, 2018, 2:36am

Thanks for letting us know what ended up working out for you.