Zip-compression-zip64 clarification

JonMikelV · April 17, 2018, 3:21am

The --zip-compression-zip64 setting is described as follows:

--zip-compression-zip64
The zip64 format is required for files larger than 4GiB, use this flag to toggle it
Default value: “False”

Does this mean it’s needed for SOURCE files larger than 4GB or DESTINATION (dblock / Upload volume size) files larger than 4GB? Oh, is there a reason it’s NOT enabled by default?

Either way, we should consider updating the parameter description to clarify.

kenkendk · April 17, 2018, 7:37am

That option is fairly technical… It is only required if a single file inside the zip archive is greater than 4GiB uncompressed. It does not matter how large the zip file is, so you can store many 2GiB files and have a 4TiB zip file without the flag.

It was added as a user reported issues having the dlist exceeding 4GiB (i.e. there was a lot of files).

Enabling the flag by default stores a bit of extra header information and makes the zip file use the Zip64 specification. IIRC it adds around 16 bytes pr. file (i.e. block from Duplicati) in the zip archive. I think that overhead is unwanted for most users, but I would not object to making it default- on.

JonMikelV · April 17, 2018, 3:19pm

So roughly we’re looking at something likely between 8kb and 10kb extra per dlist zip file ift’s enabled by default?

On one hand I’d say that small a size difference is likely to essentially disappear when file system block sizes start coming into play, but then again it’s not like there are tons of reports about the issue.

If it’s not already there, maybe it would be enough to update the error associated with attempting 4GB+ files in non-zip64 archives to suggest to the user that the parameter should be added.

Hmm… now that I type that out it seems easier just to default it to on.

kees-z · April 17, 2018, 7:16pm

The size of the DLIST files is not my biggest concern. We’re talking about very rare scenarios. I read this on Wikipedia:

The original .ZIP format had a 4 GiB limit on various things (uncompressed size of a file, compressed size of a file and total size of the archive), as well as a limit of 65535 entries in a ZIP archive. In version 4.5 of the specification (which is not the same as v4.5 of any particular tool), PKWARE introduced the “ZIP64” format extensions to get around these limitations, increasing the limitation to 16 EiB (2^64 bytes). In essence, it uses a “normal” central directory entry for a file, followed by an optional “zip64” directory entry, which has the larger fields.
The File Explorer in Windows XP does not support ZIP64, but the Explorer in Windows Vista does. Likewise, some extension libraries support ZIP64, such as DotNetZip, QuaZIP and IO::Compress::Zip in Perl. Python’s built-in zipfile supports it since 2.5 and defaults to it since 3.4. OpenJDK’s built-in java.util.zip supports ZIP64 from version Java 7. Android Java API support ZIP64 since Android 6.0. Mac OS Sierra contains a broken implementation of creating ZIP64 archives using the graphical Archive Utility.

So the limitations are:

Size of a file inside the .ZIP archive (normal and compressed size)
Number of files inside one .ZIP archive
File size of the .ZIP archive.

I guess only limitation 2 and 3 could apply to Duplicati in some rare situations, because the block size varies from a few KB’s to one or more MB’s.

Limitation 3 can only apply to the DLIST files, DBLOCK files have a fixed size of 50 MB by default, in some situations a size of 1 or 2 GB is defined by the user. More than 4 GB is not recommendable, restore and compact operations would take way too much bandwidth.
Assume there are a lot of nested subfolders in the backup source and the average source file is about 150 characters long (including path). to exceed the 4 GB limitation, you need to backup about 20 million source files. I assume this is a very rare situation (millions of files, very long paths).

Limitation 2 can be a problem when using a large archive size and a small block size. If we choose the largest size for a standard ZIP file (4GB), using the default block size (100 KB) and assuming a compression ratio of 50%, there could be more than 80000 blocks in a single archive, which exceeds the 65535 limitation. Also a very rare scenario, but not impossible.

I thought I remembered that Zip64 files can’t be opened by Windows Explorer, so you would need a third party tool (WinZip, 7-Zip) to open the files. That would be a backdraw for defaulting to Zip64.
But this is fixed in Vista, all OS’es that can run Duplicati seem to support Zip64 natively.

Long story short, I can’t think of any reason to not use Zip64 by default, but there is no urgent need to do this, because the limitations do not apply to Duplicati backup files, except for some very rare situations.

lankanmon · April 17, 2018, 7:36pm

I think this is a relevant issue though, as I, myself have had this issue before, and had @JonMikelV ask me to enable that setting by default. The main issue that I have is that my dataset is quite old (long term backup) and I am not quite sure about that sizes of the files contained within (there are 100s of folders and subfolders). I have resorted to using that flag on every one of my backups as I do not want to risk it happening again (say I add a large file in the future). Initially, I had my backup running for days before it error-ed out. Is there any way for Duplicati to detect such an issue early on and inform the user to use that flag? (maybe during initial indexing)? That would save from a lot of wasted time.

kees-z · April 17, 2018, 7:55pm

As far as I know, the size of any source file is irrelevant in relation to the limitations of Zip files.
Before data is archived, Duplicati breaks the source files into small chuncks of 100 KB (or another value that is set by the user, but never more than a few MB’s).
So backing up a 20GB source file results in a Zip file containing a bunch of 100 KB files. The large source file has already been split up in small files before it is stored in the Zip file.
The only limitations are the size of the upload volume (not more than 4 GB), the number of block files in one upload volume (not more than 65535) and the size of the file list in the DLIST file (not more than 4 GB).
On the other hand, I don’t think defaulting to Zip64 can’t do any harm to existing backups, apart from potentional issues when combining Zip and Zip64 in existing backups.

samw · April 18, 2018, 8:43am

No, it’s only if you want to set your DBLOCK files at 4GB or higher. Source files are broken down and stored in a zip.

kenkendk · April 18, 2018, 7:26pm

Actually, only limitiation (1) applies for the use in Duplicati.

Since Duplicati stores blocks, it can handle appx. --dblock-size=4GiB , and as mentioned elsewhere it does not matter what source files sizes are, as they are split into blocks.

Limitations (2) and (3) are handled automatically, because the zip library automatically detects that one of the limits are exceeded and writes a zip64 header (at the end of the file). In other words, the zip archive is automatically upgraded to zip64 if required.

Unfortunately this cannot be done for the individual streams (aka files in the archive) because the header is written before the content, thus we do not know in advance if the stream will need to store larger values. If the stream is bigger than 4GiB, we need to go back and expand the stream header, essentially “moving” 4GiB of data (expensive operation). The zip library now throws an exception if the limit is exceeded (old versions just overflowed the counter, writing broken zip archives!).

We could handle the this by catching the exception, enabling the zip64 flag, and retrying the operation. Would take some time to do the retry, but is not difficult to implement if we want it.

I don’t think so. The list of files is streamed into the archive to avoid storing it all in memory. We would need to build the entire file in memory (I guess 4GiB+ in your case), look at the size, and then activate zip64.

Pectojin · April 18, 2018, 7:34pm

If the zip64 module is only limited by (1), which only breaks with --dblock-size=>4GB, then we should just have a check on zip64 being enabled if dblock size is over 4GB.

Writing slow zip64 retry logic into the backup operation would lead to some people having slower backups without knowing why, which isn’t very nice.

kenkendk · April 18, 2018, 8:24pm

Yes, we could do that. I am not aware of any user having that large a block size though. The only place I have seen the problem, is when the dlist file (which mostly comprises a single .json file) exceeeds 4GiB, which is harder to detect in advanced.

We could perhaps do something like sum(path_len + const_chars + hash_size + (hash_size * no_of_blocklist_hashes)). The const_chars is basically the overhead in json ({"path":"...","size":...,"hash":"..."}). Not sure how accurate we could make it, but maybe just add a buffer, so people with more than 2GiB, or some other metric has it turned on.

… Or… just turn in on for the dlist file? Minimal overhead (I think there are 2-3 files in the archive), maximum effect?

Pectojin · April 18, 2018, 9:08pm

I think it’s fair enough to add the logic for the dlist. That’s one file per backup.

Edit: By the way, won’t we know the dlist file size before we write to the zip archive?

kenkendk · April 19, 2018, 9:08am

Nope, because the dlist file is essentially a compressed json file. The size of the json file depends on the contents (path lengths, sizes, hashes, etc).

kees-z · April 19, 2018, 9:47am

Cool! That makes the need of Zip64 even more rare, the only threshold is the size of the contents of the DLIST file.

Isn’t the size of the JSON file inside the Zip archive a limitation? Does a JSON of 5 GB that is compressed to 3 GB fit in a standard Zip archive? According to Wikipedia, also the uncompressed size of a file inside the archive must be <= 4 GB.
If that’s the case, Duplicati could switch to Zip64 if the (uncompressed) JSON exceeds 4 GB.

Aside of that, could any issues arise if Zip and Zip64 are mixed up at the backend?

Pectojin · April 19, 2018, 1:37pm

But we know precompressed size of the json, right?

Just make a quick check on the json before compressing and use zip64 on anything over 4GB. Should be fairly cheap to check. It’s already such an edge case that it won’t matter if it sometimes adds the header unnecessarily

kenkendk · April 19, 2018, 8:14pm

Nope. The json is streamed directly into the zip archive, so it is not stored in memory. We can approximate it like I described, but it is not accurate, and prone to breaking.

Yes, if either compressed or uncompressed size is larger than 4GiB, it must use zip64.

Theoretically there could be zip tool that has trouble with a mix of zip and zip64 streams, but those I tested handle it nicely (it would be a weird implementation that fails). All tools also appear to handle a zip64 central header and zip(32) streams without problems.