Calculating backup use on S3 destination, and clean up

Taomyn · March 4, 2025, 7:56am

I have a backup with a Wasabi S3 destination and I’m trying to figure out why it’s taking up so much space. The job shows the following:

Source:
80.896 GiB

Backup:
1.049 TiB / 20 Versions

Retention: 1W:1D,1M:1W,1Y:1M

I have another job with basically the same source files and it shows:

Source:
20.945 GiB

Backup:
80.082 GiB / 28 Versions

Retention: 1W:1D,1M:1W,2Y:1M

How can I find out why the Wasabi backup is taking up so much more space for a lower retention, and then how best to clean it up? I know the source is larger, which I also need to work out as they should be the same, but surely not enough to make the backup 1TB vs 80GB

I’ve had issues in the past with this particular backup so it’s quite possible there is leftover data being stored, but I’m hoping I can remove just the excess rather than simply starting again.

ts678 · March 4, 2025, 3:24pm

How can that be if one is four times the size of the other?
Also, if it was ever reduced, old backup ran by old config.

Even if it means “same type”, nature of changes matters.
Duplicati uploads changes, so big changes → big upload.
The job log Complete log can show the BytesUploaded

Sometimes some of this is from Compact (look in job log).
If Compact is turned off (or down), that will hold old waste.
Check the Advanced options if you suspect bad settings.

Log at Verbose level will give some info about compacting:

2025-01-28 07:27:25 -05 - [Verbose-Duplicati.Library.Main.Database.LocalDeleteDatabase-WastedSpaceVolumes]: Found 31 volume(s) with a total of 11.73% wasted space (1.73 GiB of 14.72 GiB)

--threshold (Integer): The maximum wasted space in percent
As files are changed, some data stored at the remote destination may not
be required. This option controls how much wasted space the destination
can contain before being reclaimed. This value is a percentage used on
each volume and the total storage.
* default value: 25

and there are other reasons for compact. Be especially sure to not be using

--no-auto-compact (Boolean): Disable automatic compacting
If a large number of small files are detected during a backup, or wasted
space is found after deleting backups, the remote data will be compacted.
Use this option to disable such automatic compacting and only compact
when running the compact command.
* default value: false`

Taomyn · March 5, 2025, 7:17am

Thank-you for the advice, I re-ran the backup with verbose logging and the compact function does seem to be enabled, but it’s only basing it a very small percentage of the files in the S3 folder:

2025-03-05 07:57:01 +01 - [Information-Duplicati.Library.Main.Operation.DeleteHandler:RetentionPolicy-StartCheck]: Start checking if backups can be removed
2025-03-05 07:57:01 +01 - [Information-Duplicati.Library.Main.Operation.DeleteHandler:RetentionPolicy-FramesAndIntervals]: Time frames and intervals pairs: 7.00:00:00 / 1.00:00:00, 31.00:00:00 / 7.00:00:00, 365.00:00:00 / 31.00:00:00
2025-03-05 07:57:01 +01 - [Information-Duplicati.Library.Main.Operation.DeleteHandler:RetentionPolicy-BackupList]: Backups to consider: 05/03/2025 05:07:04, 04/03/2025 05:06:56, 03/03/2025 03:06:39, 01/03/2025 03:06:38, 27/02/2025 03:06:36, 20/02/2025 03:06:41, 13/02/2025 03:06:09, 05/02/2025 03:06:02, 29/01/2025 03:04:49, 28/12/2024 03:12:56, 26/11/2024 03:23:26, 30/10/2024 06:36:09, 25/10/2024 12:56:49, 10/09/2024 06:07:18, 01/09/2024 06:01:54, 04/08/2024 06:03:23, 03/07/2024 06:06:21, 01/06/2024 06:03:00, 29/04/2024 06:16:49, 13/04/2024 06:26:39, 22/03/2024 06:19:37
2025-03-05 07:57:01 +01 - [Information-Duplicati.Library.Main.Operation.DeleteHandler:RetentionPolicy-BackupsToDelete]: Backups outside of all time frames and thus getting deleted: 
2025-03-05 07:57:01 +01 - [Information-Duplicati.Library.Main.Operation.DeleteHandler:RetentionPolicy-AllBackupsToDelete]: All backups to delete: 
2025-03-05 07:57:50 +01 - [Information-Duplicati.Library.Main.Operation.DeleteHandler-DeleteResults]: No remote filesets were deleted
2025-03-05 07:59:21 +01 - [Verbose-Duplicati.Library.Main.Database.LocalDeleteDatabase-FullyDeletableCount]: Found 0 fully deletable volume(s)
2025-03-05 07:59:21 +01 - [Verbose-Duplicati.Library.Main.Database.LocalDeleteDatabase-SmallVolumeCount]: Found 10 small volumes(s) with a total size of 79.213 MiB
2025-03-05 07:59:21 +01 - [Verbose-Duplicati.Library.Main.Database.LocalDeleteDatabase-WastedSpaceVolumes]: Found 1335 volume(s) with a total of 10.31% wasted space (121.570 GiB of 1.152 TiB)

Do you think I should try a Recreate of the database, perhaps with verbose logging if I can figure that out? When I check the job log the values seem to match what I see on Wasabi so I’m not sure if that would even have an effect:

      "UnknownFileSize": 0,
      "UnknownFileCount": 0,
      "KnownFileCount": 33370,
      "KnownFileSize": 1162292137522,

Looks like this needs updating on the documentation site to avoid a deprecation warning: 2025-03-05 07:55:21 +01 - [Warning-Duplicati.Library.Main.Controller-DeprecatedOption]: The option --log-level has been deprecated: Use the options --log-file-log-level and --console-log-level instead.

ts678 · March 5, 2025, 12:50pm

Maybe because only a very small percentage exceed waste threshold?

github.com/duplicati/duplicati

Duplicati/Library/Main/Database/LocalDeleteDatabase.cs

3522da1fc


      
          public CompactReport(long volsize, long wastethreshold, long smallfilesize, long maxsmallfilecount, IEnumerable<VolumeUsage> report)

          {

              m_report = report;

          

              m_cleandelete = (from n in m_report where n.DataSize <= n.WastedSize select n).ToArray();

              m_wastevolumes = from n in m_report where ((((n.WastedSize / (float)n.DataSize) * 100) >= wastethreshold) || (((n.WastedSize / (float)volsize) * 100) >= wastethreshold)) && !m_cleandelete.Contains(n) select n;

              m_smallvolumes = from n in m_report where n.CompressedSize <= smallfilesize && !m_cleandelete.Contains(n) select n;

          

              m_wastethreshold = wastethreshold;

              m_volsize = volsize;

              m_maxsmallfilecount = maxsmallfilecount;

          

              m_deletablevolumes = m_cleandelete.Count();

              m_fullsize = report.Select(x => x.DataSize).Sum();

          

              m_wastedspace = m_wastevolumes.Select(x => x.WastedSize).Sum();

              m_smallspace = m_smallvolumes.Select(x => x.CompressedSize).Sum();

              m_smallvolumecount = m_smallvolumes.Count();

          }

github.com/duplicati/duplicati

Duplicati/Library/Main/Database/LocalDeleteDatabase.cs

3522da1fc


      
          public void ReportCompactData()

          {

              var wastepercentage = ((m_wastedspace / (float)m_fullsize) * 100);

              Logging.Log.WriteVerboseMessage(LOGTAG, "FullyDeletableCount", "Found {0} fully deletable volume(s)", m_deletablevolumes);

              Logging.Log.WriteVerboseMessage(LOGTAG, "SmallVolumeCount", "Found {0} small volumes(s) with a total size of {1}", m_smallvolumes.Count(), Library.Utility.Utility.FormatSizeString(m_smallspace));

              Logging.Log.WriteVerboseMessage(LOGTAG, "WastedSpaceVolumes", "Found {0} volume(s) with a total of {1:F2}% wasted space ({2} of {3})", m_wastevolumes.Count(), wastepercentage, Library.Utility.Utility.FormatSizeString(m_wastedspace), Library.Utility.Utility.FormatSizeString(m_fullsize));

Do you think the above figures it out well enough? I’m not the ultimate expert, but it looks right.

Generally, drastic measures should follow after safer ones such as I had suggested previously.

Please look through your logs, or post some or all if you prefer.

This could use clarification. Did you take unusual recovery steps that might have bad effects?

EDIT 1:

Please answer the other questions I asked too.

EDIT 2:

Configuring logging in the new manual does look wrong. Maybe the developer could correct it.

--log-file=<path to logfile>
--log-level=<loglevel>

Taomyn · March 5, 2025, 1:40pm

I thought that’s what I did including just the parts about the compact, so I now attach the full verbose log from an earlier run: Verbose Log - Pastebin.com

This was a backup where during testing of the canary builds last summer, some of the remote file became corrupted, so after detecting which ones I removed them and allowed Duplicati to sort itself out. It’s all on the forum somewhere.

Regarding the “same source files”, 40GB vs 80GB of source files is a slightly smaller ratio than 121GB vs 1.1TB, but I’m not sure how it managed to get so large.

ts678 · March 5, 2025, 2:03pm

By job log, I mean on job home screen, Show log to check.
On nicer top part, there is sometimes a section on Compact.
On the bottom is a harder-to-read Complete log for details.

The first thing I wanted was to see extent of change/version.
20 versions of very changing 80 GB could certainly get 1 TB.
Possibly not likely here, but I have no way of knowing usage.

These are how to explore history. Current verbose log won’t.
We’re in a broad size question. Verbose log is mostly details.

Maybe 40 means 20?

Since the sentence talks about source files, does “it” mean that the 80 is the oddly high value?

Taomyn · March 5, 2025, 2:34pm

Yeah, the 40 should be 20, just mis-typing. The 80GB is not especially high, it was mainly the ratio to 1.1TB which got me concerned as I’m trying to reduce my monthly costs and trying to store the least I can at Wasabi.

Perhaps the extra sources really are that fluid, it’s daily backups of my Promox host’s operating system settings, as I didn’t want to install a full Duplicati server on each one especially back in the bad old days of “mono”. I’m using UrBackup for those to keep it to just agents on the hosts, and I backup what it labels the current folder each day. As they are literally the same files each day I didn’t think it would cause so many versions. Perhaps I need to rethink these backups.

I thought I would see what would happen if I set the compact threshold down to 10% and it’s currently busy manipulating the remote files, with lots of downloads and uploads.

ts678 · March 5, 2025, 2:59pm

You’d have to look. If you’re really into deep analysis, there’s a tool. I haven’t run it yet though.

Duplicati BackupExplorer

For a simpler log-based approach:

BytesUploaded is mostly changes. Until compact gets triggered, upload sits in the Destination.

KnownFileSize will probably fluctuate up and down as uploads happen, then Compact kicks in.

Taomyn · March 5, 2025, 3:16pm

Thanks I’ll take a look at that tool and report anything interesting

ts678 · March 5, 2025, 3:26pm

I’m not certain I follow the chain and don’t know UrBackup file output formats, but backing up backups might find huge changes from one to the next (unlike if backing up original sources).

You’ve now got multiple ways to check.

Taomyn · March 6, 2025, 6:50am

UrBackup was something I was testing to replace Veeam as it can do image backups, but I wasn’t completely happy with it at the time, I think because of the way it handled Hyper-V. So I didn’t completely abandon UrBackup as it can also do file based backups so I used it for my Proxmox hosts to backup some configuration files in case of disaster. UrBackup stores files uncompressed so I can allow Duplicati to take care of that, but it creates “generational” backups using separate folders and file links. An example for one of my hosts looks like this:

Each folder holds just the changed files with the rest be links to the base backup. At least that’s how I understand it all works. I also backup the UrBackup database folder which is about a 1GB in size.

As you can see here, they don’t use up much space:

I can definitely see where some of the space is going to but somehow it still feels quite high. The compact I ran at 10% did complete and reduced the overall storage at Wasabi down to around 950GB - and incidentally caused an alert to generate for “Unusual Egress Activity” which was expected and nice to know they monitor such things.

Taomyn · March 6, 2025, 6:52am

Didn’t work, I think it’s because the database format has changed beyond the last update to the tool. I will log an issue on the Github and see if the developer is still maintaining it. I really want to see what it can generate from my backups.

ts678 · March 6, 2025, 3:38pm

In the meantime, you can get DB Browser for SQLite (a.k.a. sqlitebrowser) to look around.
Database can be opened read-only, and for even more safety, browse on a copy, not original.

Simple analysis is pretty easy. For example you can find largest known file with a single click.
Sometimes sparse files can blow things up. You could also use du or find on Source area.

Finding the size of a given backup version is a little more, but not awful. An example of SQL:

SELECT sum("Blockset"."Length")
FROM "FilesetEntry"
JOIN "File" ON "File"."ID" = "FilesetEntry"."FileID"
JOIN "Blockset" ON "Blockset"."ID" = "File"."BlocksetID"
WHERE "FilesetEntry"."FilesetID" = 562

The 562 is an ID from my Fileset table. Yours will be different. Timestamp is UNIX epoch time.

EDIT:

I’m wishing that The FIND command could give a total in addition to its individual file sizes.
Still, one can find big files by using it and looking for some string like " GB)" in the output.

kenkendk · March 8, 2025, 4:40pm

It is a bit hard to figure out, but there is a community tool to do just that:

ts678 · March 8, 2025, 6:05pm

Already suggested, tried, failed, issue filed on it (it doesn’t like DB version number).

Other suggestions have not been tried yet (or at least not as posted), but are cruder.
I was tempted to try to write SQL to compare blockset similarities. This backup is big enough (and I don’t know the blocksize) that any block calculation might take awhile.

Taomyn · March 10, 2025, 8:50am

I ran your example SQL against each fileset in the database, and it seems one of them stands out quite a bit:

36	80197200592
41	116567712463
47	78053358331
75	86945171176
108	77247919733
139	86078208257
154	93356244250
161	96444699877
170	86850846374
173	86861577985
175	86779132430
179	90403428253
181	86937216560
186	89566286076
189	89204442430

How do I figure out what’s in fileset 41 taking up the extra space?

Taomyn · March 10, 2025, 8:55am

Block size is 100MB, although I think I increased this from 50MB many months ago

ts678 · March 10, 2025, 11:48am

It stands out visually because it’s in low three digits of MB instead of high two digits.

From OP this should be either 20 or 28, but there are less here. Was that expected? Regardless, highest ID here is the newest, so version 0 on (e.g.) Restore dropdown.
You can check the date there against Timestamp run through epochconverter.com.

After mapping ID to externally used version, you can do a find as described above.
The COMPARE command might also be helpful to show changes between versions.
If output gets truncated, you can add the full-result option to get all of the output.

I don’t know which backup this is, but none of the versions alone add up to OP 1 TB. There was once an idea of version-to-version change, with little evidence either way.

If developer has an idea beyond compact and this sort of check, it’d be welcome too.

Taomyn · March 10, 2025, 1:06pm

Yes expected, as I was really needing to cut this backup down as it was a big part of the doubling of my storage fees at Wasabi, so I reduced the retention from 1W:1D,1M:1W,1Y:1M to 1W:1D,1M:1W,6M:1M which halved the amount stored. Of course this will take a month or two to actually reduce my costs but the total came down to 646GB now.

I’m currently going through the results of comparing each fileset with the one that follows, trying to see if anything sticks out. Will let you know if anything does, although I did already see some “caching” folders that I added to the exclusions.