Backup states/phases - information reported to users about what their backup is doing is not helpful

Stephen · October 23, 2018, 5:51am

I accidentally posted this to backup state/phases · Issue #3429 · duplicati/duplicati · GitHub instead of posting it here. It didn’t get much response there so I’m reposting here now that I realise my mistake. I’ve put the reply history from there below as well.

Environment info

Duplicati version : 2.0.3.3_beta_2018-04-02
Operating system : Windows 10
Backend : external USB drive.

Description

I’ve searched for a list of backup states/backup phases but can’t find anything. Two that I’ve seen so far that have puzzled me are:
Backup_Begin
and
Backup_Delete

Backup_Begin is particularly puzzling. The web gui interprets that as “starting” but how is it possible that it is starting when it appears to be doing nothing? It has been in this state for more than 10 minutes now. If I use duplicati_client/duc it reports this information to me:
Progress:
BackendAction: List
BackendFileProgress: 0
BackendFileSize: 0
BackendIsBlocking: false
BackendPath: null
BackendSpeed: -1
BackupID: ‘130’
CurrentFilename: null
CurrentFileoffset: 0
CurrentFilesize: 0
OverallProgress: 0.0
Phase: Backup_Begin
ProcessedFileCount: 0
ProcessedFileSize: 0
StillCounting: false
TaskID: 13
TotalFileCount: 0
TotalFileSize: 0
During this time it is clearly not counting files because there still is no count, and it would tell me if it were counting files. It is not checking anything on the USB drive, because the drive light is not flashing. What is it doing?

I did not notice it doing this long backup begin phase until after I turned on exclusion of system files in the backup - does it have anything to do with that?

Backup_Delete - this one seems particularly horrifying until you make the assumption that it is NOT actually deleting the backup you just completed but is really just deleting files that are in the backup that are no longer required - at least that is what I assume is happening. It would be great if it was called something less scary!

Is there a list of backup states somewhere that describes what each state means?

Thanks again,
Stephen

Further update from @Stephen
After 15 minutes it finally began counting files and then very quickly changed state to Backup_PreBackupVerify - so what was it actually doing for the preceeding 15 minutes?

Reply from @Pectojin n my machine it will spend a bit of time listing the backend files before it starts counting local files. I believe this is just a verification that everything looks ok before starting the backup itself.

    Progress:
      Backend:
        Action: List
      State: Backup_Begin
      Task ID: 40

Is it possible it’s having trouble listing the files on the “backend”? How many files are on the USB? And do you have any noticeable delay in listing the files if you open the USB manually in Explorer?

By the way, I’ve noticed in the client it will start saying `Counting files: true` much earlier than in the Web UI. So it’s possible it’s counting files before the 15 minutes pass, although 15 minutes still seems long without an update in the UI.

Reply from @Stephen
@Pectojin hanks for your reply. There are 4054 files on the USB drive and it took about 10 seconds to list every single one of them. But during the Backup_Begin phase it wasn’t even accessing the USB drive, so I don’t think it is related to that. I’d really like to know what it is doing in that time - it needs to communicate better!

As far as counting files goes, I was checking using your duc application to confirm if it was reporting counting files before the webui, but the whole time it was in the backup_begin state it was not.

ts678 · October 24, 2018, 2:43pm

This seems more like internal code detail (in that sense, starting at GitHub was a good idea). Do these states show up anywhere in official Duplicati software? Regardless, I found some source code that may help slightly.

https://github.com/duplicati/duplicati/blob/master/Duplicati/Library/Main/OperationPhase.cs
https://github.com/duplicati/duplicati/blob/master/Duplicati/Server/webroot/ngax/scripts/services/ServerStatus.js

Regarding the “doing nothing” look, do you use a --snapshot-policy setting? For me, VSS causes a pause in action seen for example by MENU → About → Show log → Live, even at the usually wordy Profiling setting.

Stephen · October 24, 2018, 3:21pm

Thanks for your reply @ts678

I’m not sure that this question is so much about internal code detail as looking for an explanation of states being reported to users. So your links did take me to this list of states:
‘Backup_Begin’: gettextCatalog.getString(‘Starting backup …’),
‘Backup_PreBackupVerify’: gettextCatalog.getString(‘Verifying backend data …’),
‘Backup_PostBackupTest’: gettextCatalog.getString(‘Verifying remote data …’),
‘Backup_PreviousBackupFinalize’: gettextCatalog.getString(‘Completing previous backup …’),
‘Backup_ProcessingFiles’: null,
‘Backup_Finalize’: gettextCatalog.getString(‘Completing backup …’),
‘Backup_WaitForUpload’: gettextCatalog.getString(‘Waiting for upload to finish …’),
‘Backup_Delete’: gettextCatalog.getString(‘Deleting unwanted files …’),
‘Backup_Compact’: gettextCatalog.getString(‘Compacting remote data …’),
‘Backup_VerificationUpload’: gettextCatalog.getString(‘Uploading verification file …’),
‘Backup_PostBackupVerify’: gettextCatalog.getString(‘Verifying backend data …’),
‘Backup_Complete’: gettextCatalog.getString(‘Backup complete!’),
‘Restore_Begin’: gettextCatalog.getString(‘Starting restore …’),
‘Restore_RecreateDatabase’: gettextCatalog.getString(‘Rebuilding local database …’),
‘Restore_PreRestoreVerify’: gettextCatalog.getString(‘Verifying remote data …’),
‘Restore_CreateFileList’: gettextCatalog.getString(‘Building list of files to restore …’),
‘Restore_CreateTargetFolders’: gettextCatalog.getString(‘Creating target folders …’),
‘Restore_ScanForExistingFiles’: gettextCatalog.getString(‘Scanning existing files …’),
‘Restore_ScanForLocalBlocks’: gettextCatalog.getString(‘Scanning for local blocks …’),
‘Restore_PatchWithLocalBlocks’: gettextCatalog.getString(‘Patching files with local blocks …’),
‘Restore_DownloadingRemoteFiles’: gettextCatalog.getString(‘Downloading files …’),
‘Restore_PostRestoreVerify’: gettextCatalog.getString(‘Verifying restored files …’),
‘Restore_Complete’: gettextCatalog.getString(‘Restore complete!’),
‘Recreate_Running’: gettextCatalog.getString(‘Recreating database …’),
‘Repair_Running’: gettextCatalog.getString(‘Repairing database …’),
‘Verify_Running’: gettextCatalog.getString(‘Verifying files …’),
‘BugReport_Running’: gettextCatalog.getString(‘Creating bug report …’),
‘Delete_Listing’: gettextCatalog.getString(‘Listing remote files …’),
‘Delete_Deleting’: gettextCatalog.getString(‘Deleting remote files …’),
‘PurgeFiles_Begin,’: gettextCatalog.getString(‘Listing remote files for purge …’),
‘PurgeFiles_Process,’: gettextCatalog.getString(‘Purging files …’),
‘PurgeFiles_Compact,’: gettextCatalog.getString(‘Compacting remote data …’),
‘PurgeFiles_Complete,’: gettextCatalog.getString(‘Purging files complete!’),

Which is great, but it doesn’t tell me what it really means. The first state:
‘Backup_Begin’ was in my original message and is displayed to users as ‘Starting backup …’ - as I explained in my original question, but as a user, I expect “starting” to be something that a program does for a few seconds, not 15 minutes. So after that few seconds has passed I’d really like to know what it is doing because it just looks to me like the backup tool has failed until suddenly 15 minutes later it starts doing something.

To answer your question about official duplicati software - it appears that the official duplicati server is reporting them through API to other applications - such as duplicati_client/duc.

I have never set a --snapshot-policy setting. So unless it is something enabled in the server by default, then no, I’m not using one.

I’ll test that profiling option and report back here with an update. I am in the middle of a backup now, so now that I’ve seen what the profiling option does I realise that I need to do that at the time it appears to be doing nothing in order to work out what it is really doing.

Thanks!

ts678 · October 24, 2018, 3:57pm

The points here were that you might get little response, because almost nobody has seen them, and that the exposing of an internal API by third-party uses is likely to be rough. If you’d said Duplicati should better detail what the more user-friendly web UI messages meant, then I’d agree with you. . That’s not done yet, AFAIK.

The --snapshot-policy should be off by default. You can double-check this on the job Commandline UI option, or by an export. That was my only hope based on my own experience. My VSS takes 50 seconds, but reports are that some can take much longer. I’m not sure how widespread slow ones are. If they’re widespread, a new state reflecting that particular wait might be useful. Same might apply if whatever your slow spot is is common. Agree that it’s bad UI practice to leave things just sitting without explanation, or some sort of progress display.

Stephen · October 27, 2018, 6:50am

@ts678, I’ve exported my command line and there is no reference to snapshots so I think I’m fine. Based on your advice, I’ve updated the title of my post. Thanks.

I will post information from the first 15 minutes of my backup, till it visibly does something, soon.

Stephen · October 27, 2018, 7:33am

I will try to get more details on the next backup run. Unfortunately, for the second time in a row, I’ve been called away from my computer at just about the time that things should have been getting interesting… So I’m posting my interim observations now anyway:

The key for me in what I’ve collected below is that there was a 9 minute period in the log where NOTHING was being reported and therefore presumably nothing was happenning. What was it doing that whole 9 minutes?

*** 11:51 lots was happenning but no data was being transferred (presumably because there were so few changes since last time it was just doing lots of checking etc between the source and the backup list and therefore had no changes to write to the storage media
*** at this point I was called away from my computer
27 Oct 2018 11:30: Starting - ExecuteReader: SELECT DISTINCT “Fileset”.“ID”, “Fileset”.“Timestamp” FROM “Fileset”, “RemoteVolume” WHERE “RemoteVolume”.“ID” = “Fileset”.“VolumeID” AND “Fileset”.“ID” IN (SELECT “FilesetID” FROM “FilesetEntry”) AND (“RemoteVolume”.“State” = “Uploading” OR “RemoteVolume”.“State” = “Temporary”)
27 Oct 2018 11:30: ExecuteScalarInt64: SELECT COUNT() FROM “File” WHERE “BlocksetID” != ? AND “BlocksetID” != ? AND NOT “BlocksetID” IN (SELECT “BlocksetID” FROM “BlocksetEntry”) took 00:00:25.885
27 Oct 2018 11:30: Starting - ExecuteScalarInt64: SELECT COUNT() FROM “File” WHERE “BlocksetID” != ? AND “BlocksetID” != ? AND NOT “BlocksetID” IN (SELECT “BlocksetID” FROM “BlocksetEntry”)
27 Oct 2018 11:30: ExecuteScalarInt64: SELECT COUNT() FROM (SELECT * FROM (SELECT “N”.“BlocksetID”, ((“N”.“BlockCount” + 3200 - 1) / 3200) AS “BlocklistHashCountExpected”, CASE WHEN “G”.“BlocklistHashCount” IS NULL THEN 0 ELSE “G”.“BlocklistHashCount” END AS “BlocklistHashCountActual” FROM (SELECT “BlocksetID”, COUNT() AS “BlockCount” FROM “BlocksetEntry” GROUP BY “BlocksetID”) “N” LEFT OUTER JOIN (SELECT “BlocksetID”, COUNT() AS “BlocklistHashCount” FROM “BlocklistHash” GROUP BY “BlocksetID”) “G” ON “N”.“BlocksetID” = “G”.“BlocksetID” WHERE “N”.“BlockCount” > 1) WHERE “BlocklistHashCountExpected” != “BlocklistHashCountActual”) took 00:00:01.552
27 Oct 2018 11:30: Starting - ExecuteScalarInt64: SELECT COUNT() FROM (SELECT * FROM (SELECT “N”.“BlocksetID”, ((“N”.“BlockCount” + 3200 - 1) / 3200) AS “BlocklistHashCountExpected”, CASE WHEN “G”.“BlocklistHashCount” IS NULL THEN 0 ELSE “G”.“BlocklistHashCount” END AS “BlocklistHashCountActual” FROM (SELECT “BlocksetID”, COUNT() AS “BlockCount” FROM “BlocksetEntry” GROUP BY “BlocksetID”) “N” LEFT OUTER JOIN (SELECT “BlocksetID”, COUNT() AS “BlocklistHashCount” FROM “BlocklistHash” GROUP BY “BlocksetID”) “G” ON “N”.“BlocksetID” = “G”.“BlocksetID” WHERE “N”.“BlockCount” > 1) WHERE “BlocklistHashCountExpected” != “BlocklistHashCountActual”)
27 Oct 2018 11:30: ExecuteScalarInt64: SELECT Count() FROM (SELECT DISTINCT “BlocksetID”, “Index” FROM “BlocklistHash”) took 00:00:00.005
27 Oct 2018 11:30: Starting - ExecuteScalarInt64: SELECT Count() FROM (SELECT DISTINCT “BlocksetID”, “Index” FROM “BlocklistHash”)
27 Oct 2018 11:30: ExecuteScalarInt64: SELECT Count() FROM “BlocklistHash” took 00:00:01.589
27 Oct 2018 11:30: Starting - ExecuteScalarInt64: SELECT Count() FROM “BlocklistHash”
27 Oct 2018 11:30: ExecuteReader: SELECT “CalcLen”, “Length”, “A”.“BlocksetID”, “File”.“Path” FROM (
SELECT
“A”.“ID” AS “BlocksetID”,
IFNULL(“B”.“CalcLen”, 0) AS “CalcLen”,
“A”.“Length”
FROM
“Blockset” A
LEFT OUTER JOIN
(
SELECT
“BlocksetEntry”.“BlocksetID”,
SUM(“Block”.“Size”) AS “CalcLen”
FROM
“BlocksetEntry”
LEFT OUTER JOIN
“Block”
ON
“Block”.“ID” = “BlocksetEntry”.“BlockID”
GROUP BY “BlocksetEntry”.“BlocksetID”
) B
ON
“A”.“ID” = “B”.“BlocksetID”

) A, “File” WHERE “A”.“BlocksetID” = “File”.“BlocksetID” AND “A”.“CalcLen” != “A”.“Length” took 00:00:24.012
27 Oct 2018 11:29: Starting - ExecuteReader: SELECT “CalcLen”, “Length”, “A”.“BlocksetID”, “File”.“Path” FROM (
SELECT
“A”.“ID” AS “BlocksetID”,
IFNULL(“B”.“CalcLen”, 0) AS “CalcLen”,
“A”.“Length”
FROM
“Blockset” A
LEFT OUTER JOIN
(
SELECT
“BlocksetEntry”.“BlocksetID”,
SUM(“Block”.“Size”) AS “CalcLen”
FROM
“BlocksetEntry”
LEFT OUTER JOIN
“Block”
ON
“Block”.“ID” = “BlocksetEntry”.“BlockID”
GROUP BY “BlocksetEntry”.“BlocksetID”
) B
ON
“A”.“ID” = “B”.“BlocksetID”

) A, “File” WHERE “A”.“BlocksetID” = “File”.“BlocksetID” AND “A”.“CalcLen” != “A”.“Length”
27 Oct 2018 11:29: ExecuteReader: SELECT “Key”, “Value” FROM “Configuration” took 00:00:00.000
27 Oct 2018 11:29: Starting - ExecuteReader: SELECT “Key”, “Value” FROM “Configuration”
27 Oct 2018 11:29: ExecuteReader: SELECT “Key”, “Value” FROM “Configuration” took 00:00:00.016
27 Oct 2018 11:29: Starting - ExecuteReader: SELECT “Key”, “Value” FROM “Configuration”
27 Oct 2018 11:29: ExecuteScalarInt64: SELECT COUNT(*) FROM “Block” WHERE “Size” > ? took 00:08:58.471
> *** why did 9 minutes pass where nothing was happenning?

27 Oct 2018 11:20: Starting - ExecuteScalarInt64: SELECT COUNT(*) FROM “Block” WHERE “Size” > ?
27 Oct 2018 11:20: ExecuteReader: SELECT “Key”, “Value” FROM “Configuration” took 00:00:00.000
27 Oct 2018 11:20: Starting - ExecuteReader: SELECT “Key”, “Value” FROM “Configuration”
27 Oct 2018 11:20: ExecuteReader: SELECT “Key”, “Value” FROM “Configuration” took 00:00:00.064
27 Oct 2018 11:20: Starting - ExecuteReader: SELECT “Key”, “Value” FROM “Configuration”
*** don’t know what it is doing next

*** till here it has been updating its records of what it is doing
27 Oct 2018 11:20: ExecuteScalarInt64: INSERT INTO “Operation” (“Description”, “Timestamp”) VALUES (?, ?); SELECT last_insert_rowid(); took 00:00:00.484
27 Oct 2018 11:20: Starting - ExecuteScalarInt64: INSERT INTO “Operation” (“Description”, “Timestamp”) VALUES (?, ?); SELECT last_insert_rowid();

*** till here it has been checking the paths that I’ve asked it to backup to make sure it doesn’t backup things twice
27 Oct 2018 11:20: Removing source "C:\Users\smw.STEVEW10HP360\Videos" because it is a subfolder of "C:\Users\smw.STEVEW10HP360", and using it as an include filter
27 Oct 2018 11:20: Removing source "C:\Users\smw.STEVEW10HP360\Pictures" because it is a subfolder of "C:\Users\smw.STEVEW10HP360", and using it as an include filter
27 Oct 2018 11:20: Removing source "C:\Users\smw.STEVEW10HP360\Music" because it is a subfolder of "C:\Users\smw.STEVEW10HP360", and using it as an include filter
27 Oct 2018 11:20: Removing source "C:\Users\smw.STEVEW10HP360\Documents" because it is a subfolder of "C:\Users\smw.STEVEW10HP360", and using it as an include filter
27 Oct 2018 11:20: Removing source "C:\Users\smw.STEVEW10HP360\Desktop" because it is a subfolder of "C:\Users\smw.STEVEW10HP360", and using it as an include filter
27 Oct 2018 11:20: Removing source "C:\Users\smw.STEVEW10HP360\AppData\Roaming" because it is a subfolder of "C:\Users\smw.STEVEW10HP360", and using it as an include filter
27 Oct 2018 11:20: Starting - Running Backup
*** Backup started

ts678 · October 29, 2018, 11:27pm

TL;DR The database query shown was happening to check block sizes. There might have been other factors slowing it down, however analyzing a seeming performance problem will require investigation, if you’re willing.

Are you familiar with SQLite database performance? I’ve done very little myself, but your question was asked between the start and timed end of a Structured Query Language (SQL) statement that’s running and taking possibly longer than it should (depending on your backup size and other factors). Seeing that is helpful to go deeper to (try to) identify the source code running it, to possibly address your question about what it’s doing.

https://github.com/duplicati/duplicati/blob/v2.0.3.3-2.0.3.3_beta_2018-04-02/Duplicati/Library/Main/Database/LocalDatabase.cs#L723 is my guess, in a routine named GetBlocksLargerThan(), which is then called from https://github.com/duplicati/duplicati/blob/3ec3dc649953c3de2d9595428ba4189f6c86aa4d/Duplicati/Library/Main/Utility.cs#L169 as a check to see whether it finds any blocks that are too large, and if so try to say that:

“You have attempted to change the block-size on an existing backup, which is not supported. Please configure a new clean backup if you want to change the block-size.”

I guess if one were to assign a few words to report to user, it might say something like “Checking block sizes”, however that doesn’t do much to explain why it’s doing that for so long. That’s a performance question going way beyond Duplicati to everything else that might be happening on your system at the time of the slow spot. Yours sounds reliably slow, which is good, because mine jumps all around, perhaps based on system loading and perhaps especially on disk loading because this query “seems” like it’s primarily doing I/O.Task Manager “Performance” can show all such loads. There’s got to be some reason why it’s slow. How big is this backup? How big is the local database for this backup? You can click “Database” for the job to see the database path.

Looking up GetBlocksLarger than in the forum posts did find one other slow case here with some discussion about different ways to do the SQL query. I think optimization is one reason why timing of operations is done.

Stuck on “Starting …” might sound all too familiar to you. It had slow queries, much work, and no tidy ending. This was on a VM (raising some more questions). Is yours directly on a computer that’s typically plenty fast?

Heading in a completely different direction, your observation on system file exclusion might be a CPU issue.
Filter group has incredibly bad performance
Filter groups code is extremely slow #3395
I’m not familiar with this code (I’m just a user like most people trying to help on this forum) so I’ll stop there…

Stephen · October 30, 2018, 4:38am

Thanks@ts678!

Yes, I am willing to spend some time investigating this, but I’ve never done anything with database performance, and have very little experience of SQL/SQLite. However, I’m going to have to come back to looking at this in a week or so as I’m trying to do a deployment of duplicati right now for about 100 users and don’t have time to dig into this yet, and I am not in the office for the next week in any case.

As I can, I will work through the things you’ve raised in your very helpful message. Just a few notes for now:

“Checking block sizes” doesn’t really tell an end user much, it might be informative for a programmer, really what we should be telling end users is the answer to “why is it checking block sizes?” - I’m not sure why it is doing that check but, guessing, having read your comments I’m wondering if it is “checking existing backup integrity”? I concluded that because if there was a discrepancy in block sizes that would cause a lack of integrity in the backup. Therefore I wonder if something like that would be a more helpful message to a user at this point.
I’m using a core i5 computer. It is pretty fast, though I think that my email client can some times bog it down a bit as I think I need to clean out my historic emails! I’m not using a VM.

Thanks!

JonMikelV · October 30, 2018, 11:27pm

Thanks for being willing to help ferret out this issue!

When you get back to it I’d suggest starting by manually running that slow SQL against your database and see if it always takes 9 minutes.

The ? in the SQL should be your block size in bytes, in think. So if you’re using a 10k block size the SQL would look like

SELECT COUNT(*) FROM “Block” WHERE “Size” > 102400

If you get constantly slow times, we can try potentially faster options like:

SELECT COUNT("Size") FROM “Block” WHERE “Size” > 102400

Or maybe just a more correct version of the following (since the current code doesn’t handle the scenario where a user might try a SMALLER block size - of course I’m guessing on whether or not this is “correct”):

SELECT COUNT(*) FROM “Block” WHERE “Size” != 102400

A more aggressive rewrite using an EXISTS clause might be better performing, not of not then adding an Index should certainly speed things up.

I’m available to help guide, but don’t have the time (or slow database) to do testing myself.

Backup states/phases - information reported to users about what their backup is doing is not helpful

Environment info

Description

Thanks again, Stephen

Further update from @Stephen After 15 minutes it finally began counting files and then very quickly changed state to Backup_PreBackupVerify - so what was it actually doing for the preceeding 15 minutes?

By the way, I’ve noticed in the client it will start saying Counting files: true much earlier than in the Web UI. So it’s possible it’s counting files before the 15 minutes pass, although 15 minutes still seems long without an update in the UI.

Thanks again,
Stephen

Further update from @Stephen
After 15 minutes it finally began counting files and then very quickly changed state to Backup_PreBackupVerify - so what was it actually doing for the preceeding 15 minutes?

By the way, I’ve noticed in the client it will start saying `Counting files: true` much earlier than in the Web UI. So it’s possible it’s counting files before the 15 minutes pass, although 15 minutes still seems long without an update in the UI.