Backup "to go" file size seems completely wrong?

wolffstarr · November 4, 2017, 4:19pm

Greetings!

Just starting out with Duplicati, backing up to OneDrive. I’ve got two separate jobs set up, one for my “Backups and Photos” directories, the other for everything else. This is from an OpenMediaVault NAS, local filesystem is ZFS. I am only working with the “Backups and Photos” at the moment.

The Photos portion is fairly straightforward and almost entirely photos with a video here or there. The Backups directory is the target for my UrBackup server that I use for full-system images and incremental backups for all our PCs. When I set up the backup job, it was with a new OneDrive account, and I ran into the “5GB free tier” issue, despite not starting a free trial; that’s been cleared up and it’s uploading cleanly.

The problem is, it’s completed the file count (which looks right; output of find . -type f | wc -l comes within shouting distance) and the file size it’s found is 770GB. On disk, the Backups directory is 227GB and the Photos directory is 232GB. I have compression turned on in ZFS; as expected Photos is at a compress ratio of 1.00. Backups is at 1.31, which would at worst case bring the 227GB to 300GB. So somehow, Duplicati is seeing an extra 250GB of file size.

I thought it might be running both backup jobs combined for some reason, but the other job is well over 400GB in size (and backing up to a different OneDrive account, actually). Does anyone know why it’s happening? UrBackup uses a ton of symlinks; is it possible that Duplicati is counting each symlinked file as a whole file? The Advanced option seemed to indicate the default is to just record that it exists and where it points to, so I wouldn’t have expected that.

Thanks for any help you folks can provide!

JonMikelV · November 5, 2017, 12:43am

Hello @wolffstarr, thanks for using Duplicati!

If I’m understanding the issue correctly, you have known 227G of Source content but Duplicati is reporting “Source: 250G” in the backup job summary?

Since Duplicati works at the file level (as exposed by your file system) the underlying format (in your case ZFS) and any versioning (such as with COW) or compression should not effect it at all.

Your understanding of the default symlink functionality is correct, just record the symlink “link” - don’t follow the it to the actual target file contents. I could be wrong, but I think symlinks are still actual files - they just happen to point to somewhere else. So in that sense, Duplicati is treating them as files - and they will have a size.

Assuming this is the actual source of your issue and assuming a symlink is a 1k file (total uneducated guess there) then your extra 23G would mean…hmm…let’s see here…23,552 symlinks. Does that seem possible to you?

I suppose one way to test this would be to change your job to have --symlink-policy=ignore and see if the numbers line up correctly (then change it back after the test backup is done).

wolffstarr · November 5, 2017, 1:14am

Not… quite. The basics are correct, the amounts are not.

Backup job has 2 folders, “Backups” and “Photos”. Backups is 227GB in size, Photos is 232GB in size, so total on-disk size is 459GB.

The bar at the top of the Home screen on the Duplicati web UI is reporting well over 760GB in total for the backup job. So I’ve got a difference of about 300GB.

Now, the Backups folder is somewhere north of 3 million files when I do find . -type f | wc -l from the backups directory, which is supposed to exclude symlinks. That said, 3 million symlinks at 100 bytes would be right on target, so I will try setting symlinks to ignore and restarting the backup.

wolffstarr · November 5, 2017, 5:27am

Okay, so I stopped the backup job, edited the configuration to set symlinks to “ignore”, and restarted. Same thing; 750-ish gigs of data, not counting what’s already in OneDrive I assume.

Just to make sure that it isn’t something odd with the way my NAS is set up, I went ahead and started the other backup job for all the other things that I’d also set up. Once I started that and the count completed, I got the expected 35k or so files with a file size total of 449 gigs.

The problem seems to be either something with the way UrBackup is creating/storing files, or that the sheer number of files is causing something to choke. Unfortunately, they’re fairly evenly spread out, so I can’t even exclude a certain directory to cut down on the total. I may just pull the Backups directory out of it and shift the Photos directory to the other backup job, but that would require rethinking my entire backup strategy at least somewhat.

davegold · November 5, 2017, 2:50pm

Lets verify that your config is correct
–Open the job in the GUI, select configuration -->export.
–Use “As Command Line” option
–Scrub the output so we don’t get your passphrase etc.

-Dave

wolffstarr · November 5, 2017, 3:16pm

Here you go:

mono “/usr/lib/duplicati/Duplicati.CommandLine.exe” backup “onedrive://Duplicati?authid=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX” “/srv/trinity/Photos/” “/srv/trinity/Backups/” --backup-name=“Backups and Photos” --dbpath="/root/.config/Duplicati/YLWNVPFYBO.sqlite" --encryption-module=“aes” --compression-module=“zip” --dblock-size=“50mb” --keep-time=“1Y” --passphrase=“XXXXXXXXXXXXXXXX” --symlink-policy=“Ignore” --disable-module=“console-password-input”

If you want, I can provide the second backup that is showing the correct amount.

Thanks!

Billium28 · November 9, 2017, 7:56pm

Been on the fence about abandoning Dup cause every time I restart my PC the “to go” amount starts over. It is now at 1.25tb and when it started this morning it was 1.59tb. Hell no did I upload that much in 6 hours.

AND TO TOP IT OFF, when I had it running all last week I got down to ~0.9tb which is where I should be.

Tomorrow it will be back to 1.59tb. Backblaze says I have 600gb uploaded which fits with the 0.9tb.

Why do I get such terrible reporting? And I want to know if it is reuploading the same stuff.

JonMikelV · November 9, 2017, 8:21pm

The issue is that it’s unclear what the “to go” number really is - which is the number of files (and size of those files) left for Duplicati to look at to see whether or not they have changed. At present the GUI doesn’t show anywhere the number of files actually changed or being uploaded.

It shouldn’t be re-uploading the same stuff. Duplicati can tell whether or not a file has changed and, if so, what parts (at least down to a 100k chunk, with default settings). It will only upload as many 100k chunks as needed to make sure all CHANGES are backed up.

This is a known shortcoming in the interface and some updates have been proposed (though not yet implemented) including my favorite mentioned here:

kenkendk · November 10, 2017, 1:10pm

If you have 3M files, that might be a long list, but you can perhaps try this:

mono "/usr/lib/duplicati/Duplicati.CommandLine.exe" test-filters "/srv/trinity/Photos/" --symlink-policy=ignore "*" > filelist.txt

This will give you all the files that Duplicati finds. It calls the same code as the backup process. Maybe you can see something in there that is off.

Niels_Hoogenhout · November 10, 2017, 8:08pm

I assume that it’s not uploading everything again but just verifying all your data for changes and checks if it was already uploaded. Since you have quite a high amount this can take some time to be processed, especially if you have a lot of small files.

So it’s normal that it’s starts over again with the same amount to go. And it explains why it goes down so fast. FYI, some other tools like crashplan are checking your data continuously on the background, Duplicati doesn’t. That’s why it takes more time to check this when a backup starts.

I hope that this way it’s more clear for you.

wolffstarr · November 12, 2017, 12:20am

Well, I fear the actual file is going to be useless, but it might give me some clues. Photos (which is the one you used in the command) has about 15k files in it. The text file that the command generated was 7.4MB in size. If one makes the (possibly invalid) assumption that 15k files = 7.4MB, then 30k files = 14.8MB, 300k = 148MB, and 3m = 1.48GB, then something strange is definitely afoot at the Circle K. Because the file is still being generated for Backups, and I’m at 6GB (and climbing) currently.

I would’ve expected less size for more files, as things that aren’t specific to the file list itself would not be duplicated.

Either way, I think the end result is going to be using Duplicati to back up everything except the UrBackup stuff. Since I’m mostly using it for system images, I can’t exactly use Duplicati to replace it, so not sure what I’m going to do.

JonMikelV · November 12, 2017, 4:41am

It sounds to me like there might be a circular reference floating around, despite the ignore symbolic link setting.

Party on, dude!

wolffstarr · November 12, 2017, 4:54am

You could say that, yes. I declined to cancel the command, just to see what would happen. I now have a text file at 37GB and still climbing. Which is… yeah. That’s nuts. I don’t even want to think about opening it.

EDIT: Just for giggles, I canceled the command, and compressed the 37GB file with gzip. Compressed size is 647MB.

JonMikelV · November 12, 2017, 3:08pm

Yeah, that sounds a bit suspect.

@kenkendk, works it be difficult to analyze the database (or test filters result) and generate a “recursive loop probability” score?

kenkendk · November 30, 2017, 10:06pm

@JonMikelV Yes, it is possible, but it is a bit of a memory hog, so I only do it for hard-links, but maybe there is a kind-of hardlink that Duplicati does not detect.

Basically, for any path, you need to query all past paths to see if the new path is a subpath, so i runs in O^2 time (can be fixed a bit) and requires that you store all paths in memory (alternatively, that you query the database for each path).

ffsb · July 31, 2019, 1:54pm

…about using duplicati to complement urbackup…
IMHO,
the problem is that duplicati doesn’t handle hardlinks as hardlinks… it either duplicate the files (and you would run out of disk space if you restored your urbackup" or doesn’t create the hardlinks at all (and your urbackup index would point to non-existent links and look incomplete)