The issue I’ve noticed is that when new data is added to a backup set, and a job is run to get that new data uploaded (especially if there’s a lot), the backup engine isn’t especially thorough at figuring out in advance just how much of the pending job is already uploaded. I’m not sure it’s necessarily possible (though forgive me in advance for saying “but crashplan…”) - but it’s a tiny bit frustrating to add 10 GB more data to my backup set of ~400, and for the hour or so it takes to process the new 10 GB, it says “200GB to go” because it didn’t notice that 190 of those are already uploaded.
That’s not to say anything works wrong - I’ve verified over and over that it DOES successfully skip the stuff that it needs to skip - but it’s a bit frustrating never being able to really know how much more there is to upload (or what file is currently being processed), for someone like me who likes to micromanage these things.
In the same vein as this (and perhaps more plausible up-front), I’ve realized I’d really love to see some indicator of the total volume of data uploaded so far this session. The same as the data that’s shown in the log output at the end, except without having to wait until some large backup job is completed to see it.
Would it be implausible to work up some sort of 2-pass approach, which does effectively the same work (and the same amount of work), but reprioritizes processing when a backup is started such that files that can be skipped are discounted from all of the totals right away and up-front, instead of sitting in the queue and looking like extra bulk? From my experience with (…sorry…) Crashplan, this is superficially what it seems to do when running a backup job and it always seemed like a satisfactory combination to me.
Well, as a developer, I’m used to knowing the back-end workings of the product and being familiar with the technical workings (and limitations), and having to resist the urge to roll my eyes when the client asks for something that’s physically impossible or goes against the very basics of how we do things… and of course now the shoe is on the other foot for me. So forgive me for inevitably inventing things that I’m sure don’t accurately represent the way the engine actually works
So in my imagination, what’s happening when I run a backup when 99% of the data is already in the backup set, is that the parser (by whatever means) scans the current local files in the backup set, determines that everything from 1% through 49% is already in the backup set and can be skipped, and then arrives at the file(s) between 50% and 51% which are new or changed, creates and uploads their block and index files, then zooms through everything from 52% through 100%.
I can only make assumptions about what Duplicati is doing in the background when it finds new or changed files that need to be uploaded. But what I’m picturing is, instead of immediately stopping and uploading them, it might somehow add them to a queue of filenames that need to be added to the backup set. In my mind that’s as simple as a text list or list of IDs or whatever you use to identify individual files in the local filesystem. This file would then be skipped for the moment (and its filesize NOT deducted from the “total remaining” yet), and allow the engine to process through everything else that’s already been uploaded (as it normally does).
Then when it arrives at the end of the list of files as it currently would, instead of jumping to “verifying”, it would simply start again at the beginning of the “new/changed files queue” it’s just assembled. The “total size remaining” should presumably already be accurate here as the initial total filesize - filesize of already uploaded files would = the size of the new/changed files. And everyone’s confusion would be reduced by that much more, in various use cases (“i stopped my 50 gig upload and restarted halfway through and it’s still saying i have 50 gigs to upload”, or the one that applies to me most often, “i added 1 file to a 200 gig backup and it says i have 150 gigs to upload”, etc).
As a simple first step update maybe changing the current text from “to go” to something more specific like “to be scanned” or “to be indexed” will help avoid the “why is it backup up all 50 TB of my data again!!!” questions.
That progress bar is kinda of tall - perhaps it could be split into a top & bottom with the top showing “to be scanned” while the bottom shows “to be backed up” (which I would expect to fluctuate left AND right until the “to be scanned” bar go to 100%).