Alternative FTP failure?

BlueBlock · August 18, 2019, 6:51pm

While testing I’ve found that the AlternativeFTP/FluentFTP is failing on the verify portion within the PutAsync method.

I’m not sure if the issue is with FluentFTP or the FTP server which I have setup locally using FileZilla Server.

What happens in PutAsync is that after a file is uploaded a listing of the ftp server is pulled. Except that the listing does not always contain the just uploaded file. Retries will occur but will all eventually fail for the same reason.

I can resolve the problem by just placing Thread.Sleep(3000) in between the upload and the verify portion.

Has anyone encountered this issue or inconsistent backups with AlternativeFTP?

ts678 · August 18, 2019, 8:22pm

Error: uploaded but the returned size was 0 was a recent forum report, which I found by looking for –disable-upload-verify, which I suggested after looking at the code. FTP has it too though. I don’t know why it’s better. One course of action might be to see if the newer FluentFTP code does better. It’s urgently needed anyway, because the current one breaks on current usage going to UNIX-based servers (changes the “line endings”).

EDIT:

FTP problem described here and it was one of the reasons there’s no beta yet – didn’t want this regression. Whether or not the latest FluentFTP actually fixes it is not 100% clear, so the earlier the tests of it, the better.

BlueBlock · August 18, 2019, 9:48pm

Thanks for the quick research! I’ll test ftp to see if it behaves like the aftp backend.

BlueBlock · August 18, 2019, 9:52pm

Is the alternativeftp supposed to replace the ftp backend once it works or are both to remain?

ts678 · August 19, 2019, 1:05am

AFAIK both are to remain. Sometimes one works and the other doesn’t (can go either way) so choice is good, plus yanking away things that currently exist and are used is not a good thing to do to the user base. Stats at:

https://usage-reporter.duplicati.com/

FTP usage is pretty high (despite all of FTP’s inherent flaws) and for whatever reason higher than Alternative.

You might have noticed that the report I cited used Filezilla Server too, and I was asking for info from its log…

BlueBlock · August 19, 2019, 10:27am

I just ran a simple console app using FluentFTP and FileZillaServer to test the interaction between them.

I was able to transfer with no problem. I’m thinking the issue not with those but Duplicati is somehow ending the transfer before it is finished since adding my 3 second delay also makes Duplicati transfer the files correctly.

With the other thread about the same type of hash mismatch but for SMB I’m now wondering if there is an overall issue of backend transfers being stopped earlier at times.

I need to look into it further.

BlueBlock · August 19, 2019, 9:23pm

(I’ve created a WIP PR in place for DEV discussion and at least a workaround, if not a permanent fix; PR3866)

I need assistance from other devs on this.

I can only replicate the problem when not in debug mode. This is AFTP against a local FileZillaServer.

So far I’ve tracked it to line 30 in BlockVolumeWriter.cs and m_compression will be null when the failure occurs.

The path of the code is that SpillCollectorProcess.cs line 116 will call AddBlock() in BlockVolumeWriter.cs and the m_compression will be null.

When run in debug mode or I place a 3 second delay on after the aftp PutAsync upload, then m_compression will not be null when/if the AddBlock() method gets called, depending on the spill.

p.s. I’m still trying to understand the whole ‘spill’ thing going on. If anyone is able to explain it I’d appreciate it.

Secondly, and maybe just as important a clue, before getting to the issue above I first was hitting an exception with a temp file missing. In TempFile.cs I commented out the .Delete(m_path); down in the Dispose and I was able to get to the exception outlined above. so there seems to be an issue with the temp files getting removed too early… possibly these issues are both realted to something ending too early and releasing an object(s).

ts678 · August 23, 2019, 5:02pm

I’m still trying to understand it too. Let me suggest some things which may or may not be actually correct.

# CoCoL - Concurrent Communications Library from the Duplicati lead developer is used lots in Duplicati:

github.com

duplicati/duplicati/blob/240ea225872cc24336363fde9ef64bec82ff8c76/Duplicati/Library/Main/Operation/Backup/Channels.cs#L31-L34


      
          /// <summary>

          /// When the backup completes, all in-progress archives are sent from the <see cref="DataBlockProcessor"/> to the <see cref="SpillCollectorProcess"/>

          /// </summary>

          public static readonly ChannelMarkerWrapper<SpillVolumeRequest> SpillPickup = new ChannelMarkerWrapper<SpillVolumeRequest>(new ChannelNameAttribute("SpillPickup"));

github.com

duplicati/duplicati/blob/240ea225872cc24336363fde9ef64bec82ff8c76/Duplicati/Library/Main/Operation/Backup/SpillCollectorProcess.cs#L39-L54


      
          /// <summary>

          /// This process just waits until all block processes are terminated

          /// and collects the non-written volumes.

          /// All remaining volumes are re-packed into one or more filled

          /// volumes and uploaded

          /// </summary>

          internal static class SpillCollectorProcess

          {

              public static Task Run(Options options, BackupDatabase database, ITaskReader taskreader)

              {

                  return AutomationExtensions.RunTask(

                  new

                  {

                      Input = Channels.SpillPickup.ForRead,

                      Output = Channels.BackendRequest.ForWrite,

                  },

github.com

duplicati/duplicati/blob/240ea225872cc24336363fde9ef64bec82ff8c76/Duplicati/Library/Main/Operation/Backup/DataBlockProcessor.cs#L128-L132


      
          // If we have collected data, merge all pending volumes into a single volume

          if (blockvolume != null && blockvolume.SourceSize > 0)

          {

              await self.SpillPickup.WriteAsync(new SpillVolumeRequest(blockvolume, indexvolume));

          }

The answer to why such a complex scheme was added (first beta was 2.0.4.5) is probably concurrency.
Lots of work gets parallelized, work doesn’t stop nice and even, so something must tidy up the leftovers:

github.com

duplicati/duplicati/blob/240ea225872cc24336363fde9ef64bec82ff8c76/Duplicati/Library/Main/Operation/BackupHandler.cs#L195-L202


      
          // Spawn additional block hashers

          .Union(

              Enumerable.Range(0, options.ConcurrencyBlockHashers - 1).Select(x => Backup.StreamBlockSplitter.Run(options, database, taskreader))

          )

          // Spawn additional compressors

          .Union(

              Enumerable.Range(0, options.ConcurrencyCompressors - 1).Select(x => Backup.DataBlockProcessor.Run(database, options, taskreader))

          )

I once began trying to trace the flow of things through channels and code, but haven’t had time in awhile…

If anybody knows of a good design document for that (also seeking one for transactions, please point to it).

BlueBlock · August 23, 2019, 5:06pm

Thanks. I’ve spent hours tracing the flow to try and understand how things are interacting. I’m thinking the AFTP issue is now resolved and that the errors in the spill portion were just a result of the AFTP backend not completing. I’m going to keep the ‘spill’ portion in mind though if we see issues with things like data corruption.

kenkendk · October 19, 2019, 8:55pm

The idea with spill is to allow concurrent compression, with libraries that does not support concurrent compression (i.e. zip files).

A large portion of the CPU time is used to compress data (it is inherently compute intensive). A simple fix would be to run multiple compressors in individual threads (as the files are independently compressed in a zip archive), but the zip library does not support it. We could add it, but it is not so easy to do, as the zip files need each compressed stream written sequentially.

To solve this, Duplicati simply compresses to multiple independent zip files. This has the upside that it works with any compression tool. Whenever a new block is added to the output, whichever zip file is ready first, will grab the block and add it to the output. Once the zip file is too large, it is sent to the uploader, and a new zip file is started.

However, once all blocks are uploaded, we have a situation where multiple zip files contains some blocks, but none are filled completely (or they would be uploaded). These partially filled zip volumes are sent to the spill collector process, which simply merges the partial files, producing as few zip files as possible (fill up until the size limit is reached).

tl;dr: The spill process kicks in at the end of the backup, and merges all partially filled compressed volumes.

I assume you mean database transactions. I do not have a document for it, but the strategy is that everything runs in a transaction (particularly useful for --dry-run). Before a volume starts uploading, the transaction is committed, such that the next run will not discover a “new” file on the backend. (This could happen if the upload starts, and Duplicati is hard-killed. The database will then be in the previous state, but the backend will have a partial file which does not exist in the database).

I wrote CoCoL as an academic project, and personally find that it solves concurrency better than any other multi-processing approach I have seen. I realize that it is somewhat foreign to other developers though, so here is a Duplicati-centric description of CoCoL:

The idea is to have a process which is more-or-less a method that runs independently (just a C# async Task). To remove all race-conditions, a process does not share any variables or memory with any other process, and only communicates with other processes through channels.

While a process is nothing special in C#, a channel is more complex in its implementation.

Communication over a channel will only take place when there is both a reader and a writer waiting (aka rendevous-style, or barrier-like communication). A process can request to read or write multiple channels, and exactly one communication will happen, guaranteed to preserve the order of requests.

In other words, each channel has a queue for readers, and a queue for writers. When both queues contain one entry, communication happens.

This kind of communication is intuitive (just a read/write call) but removes non-determinism (including race conditions).

The reason for using AutomationExtensions.RunTask(...) is that this call automatically calls Join and Leave on channels. The effect of this is that the channel closes (throwing exceptions) as soon as the last reader or writer calls Leave (i.e. the channel has zero readers or zero writers, meaning that no further communication can happen).

In Duplicati, this is used to enabled concurrency for hashing and compressing.
Roughly the following happens:

A process traverses the file system and writes paths to a channel
A process reads paths and checks if the metadata has changed; all changed paths are written to another channel
Multiple processes read paths and computes block hashes; new block hashes (and data) are written to a channel
Multiple processes read blocks and compresses blocks; filled volumes are written as upload requests
A process caps the number of active uploads
A process performs the spill pickup (as described above)
A process performs the remote operations (upload, download, list)

Due to the synchronous nature of the channels, there are never races, and no process can flood the system. As soon as the upload limiter is maxed out, it will not read its channel, automatically blocking the writers, rippling all the way back to the process traversing the filesystem. Similarly, once the filesystem traversal process is done, it closes the channel, causing a cascading close of channels and controlled shutdown of the processes without data loss.

This makes is trivial to maximize and balance the system resources as we simply fire up enough processes and they will automatically block the previous processes as needed, no matter if the disk, CPU, memory or network link is the bottleneck.

If anyone is interested in a more academic description of CoCoL, these are the two papers I wrote about it:

ts678 · October 21, 2019, 5:27pm

Feature/aftp update #3956 PR discussion follows, expanding on the summary below responding to:

Issues to address to get out of beta

WIP AFTP backend data upload corruption fix #3866 is the PR from earlier in current topic that hit my FTP server SIZE bug. The following comment concerns me in terms of getting the transfer times right.

Currently AFTP will seemingly transfer a file via async ftp and return ‘success’ even though it seems the transfer is still proceeding.

seems pretty rude behavior, but I can’t find any open or closed FluentFTP issues, so I wonder if there’s a chance it’s Duplicati code. There was a PR comment on the 5 second try, which is basically my worry even though the new PR splits the polling into a configurable timed delay and then hardcoded poll time.

I’d note that ignoring _listVerify was new with the first POC PR, and I’m glad to see the option returned.

The WIP POC PR is also where FluentFTP GetFileSize replaced Duplicati List (of everything I guess), followed by a search and size check for the file of interest. It seems better to ask for the file of interest, which FTP SIZE command can do, but with the gotcha that it’s the transfer size, opening a window for broken FTP servers to compute it wrong (e.g. as ASCII transfer). FTP LIST command can get a single file, however the server output format may vary. MLST was defined in 2007 in RFC 3659 to solve that. FluentFTP does know how to handle MLST, but it looks like it’s by GetObjectInfo instead of GetListing, and there’s a chance that an OLD server won’t have it. I don’t know if FluentFTP runs LIST as fallback.

github.com

robinrodricks/FluentFTP/blob/c4989d69a72d161350fbcf554f727da442dac1cd/FluentFTP/Helpers/FtpListItem.cs#L25-L27


      
          		/// NOTE TO USER : You should not need to construct this class manually except in advanced cases. Typically constructed by GetListing().
          		/// </summary>
          		public FtpListItem(string record, string name, long size, bool isDir, ref DateTime lastModifiedTime) {

and Duplicai does call GetListing in its List operation. I’m not sure why GetListing can’t do as well as GetFileSize at getting the file size on the remote (which should be OK, regardless of transfer mode).

github.com

robinrodricks/FluentFTP/blob/c4989d69a72d161350fbcf554f727da442dac1cd/FluentFTP/Client/FtpClient_HighLevelUpload.cs#L540-L541


      
          		/// <returns>If true then the file was uploaded, false otherwise.</returns>
          		public async Task<bool> UploadAsync(Stream fileStream, string remotePath, FtpExists existsMode = FtpExists.Overwrite, bool createRemoteDir = false, IProgress<FtpProgress> progress = null, CancellationToken token = default(CancellationToken)) {

File Transfer High-level API:
**Upload** () - Uploads a Stream or byte[] to the server. Returns true if succeeded, false if failed or file does not exist.

FluentFTP/FluentFTP.Examples/AsyncConnect.cs unfortunately seems to ignore the return code, yet I can’t spot anything wrong with how Duplicati does it. I’m not real familiar with C# async coding though.

EDIT:

Polling 4 times per second is probably making some poor confused FTP server read through the whole file that often trying to figure out the size after (mis)translation. I think I read that some servers refuse to compute the SIZE unless you’re in BINARY mode so a different confused server might hurt us that way. Listing a single file seems like the fastest poll, but better than polling might be to understand early finish.

kenkendk · October 22, 2019, 8:41am

Yes, that was my thought as well. I did not change it because I think the motivation for using SIZE is to avoid returning a complete file list, which can be really slow for low-powered NAS with lots of files.

The “normal” FTP backend is using LIST for the verification.

I think relying on yet-another extension to solve the problem is very risky, given the non-standard implementations and varying FTP server support.

Yes, you are probably correct. The code does wait 1/4th of a second between requests, so it will not hammer away, but wait for the server to respond before sending a new request almost immediately. Not ideal though.

This could perhaps be one of the clues as to why it returns early, it does not wait for a reply:

github.com

robinrodricks/FluentFTP/blob/c4989d69a72d161350fbcf554f727da442dac1cd/FluentFTP/Client/FtpClient_HighLevelUpload.cs#L1010


      
          				// send progress reports
          				if (progress != null) {
          					progress.Report(new FtpProgress(100.0, 0, TimeSpan.FromSeconds(0)));
          				}
          
          				// disconnect FTP stream before exiting
          				upStream.Dispose();
          
          				// FIX : if this is not added, there appears to be "stale data" on the socket
          				// listen for a success/failure reply
          				if (!m_threadSafeDataChannels) {
          					FtpReply status = await GetReplyAsync(token);
          
          					// Fix #353: if server sends 550 the transfer was received but could not be confirmed by the server
          					if (status.Code != null && status.Code != "" && status.Code.StartsWith("5")) {
          						return false;
          					}
          				}
          
          				return true;
          			}

The Duplicati code sets the thread-safe option, and this has been on since the very first commit.

This makes the library create a new client (the source of the ASCII bug previously fixed), and returns the stream abstraction from that:

github.com

robinrodricks/FluentFTP/blob/2845a26bfb361d5586c0de078e06516b159582eb/FluentFTP/Client/FtpClient_LowLevel.cs#L1436-L1467


      
          		public virtual async Task<Stream> OpenWriteAsync(string path, FtpDataType type, bool checkIfFileExists, CancellationToken token = default(CancellationToken)) {

          			// verify args

          			if (path.IsBlank()) {

          				throw new ArgumentException("Required parameter is null or blank.", "path");

          			}

          

          			LogFunc(nameof(OpenWriteAsync), new object[] {path, type});

          

          			FtpClient client = null;

          			FtpDataStream stream = null;

          			long length = 0;

          

          			if (m_threadSafeDataChannels) {

          				client = CloneConnection();

          				client.CopyStateFlags(this);

          				await client.ConnectAsync(token);

          				await client.SetWorkingDirectoryAsync(await GetWorkingDirectoryAsync(token), token);

          			}

          			else {

          				client = this;

This file has been truncated. show original

The call only returns the stream instance, not the client instance. Sadly, this makes it impossible to call client.GetReplyAsync() on the upload instance, so we cannot ensure that the upload has completed fully.

The real fix appears to be to disable the feature. I have looked through the code, and it looks like it only has an effect on uploads/downloads, where it creates a separate client to do the transfer. Based on this, it seems to be a feature that is meant to allow users to start multiple parallel transfers, and make it appear to happen over the same connection. From the readme:

What does EnableThreadSafeDataConnections do?

EnableThreadSafeDataConnections is an older feature built by the original author. If true, it opens a new FTP client instance (and reconnects to the server) every time you try to upload/download a file. It used to be the default setting, but it affects performance terribly so I disabled it and found many issues were solved as well as performance was restored. I believe if devs want multi-threaded uploading they should just start a new BackgroundWorker and create/use FtpClient within that thread. Try that if you want concurrent uploading, it should work fine.

This matches my investigation. Duplicati uses a single thread (well, a single Task anyway) to control each backend instance, so it would create multiple clients if we get around to do parallel uploads anyway (as proposed).

After discovering this, I suggest that we disable the flag and remove the retry logic, making it similar to what the normal FTP backend does. I can do the PR, but I do not have a good setup for testing.

@ts678 and @BlueBlock would you be able to test a fix that uses this approach?

kenkendk · October 22, 2019, 8:56am

I updated the PR with the changes: Feature/aftp update by kenkendk · Pull Request #3956 · duplicati/duplicati · GitHub
I kept the delay code in there, in case there is a brief confusion state in the FTP server where it returns the size from a “recently active” connection, even though it has been reported completed.

ts678 · October 22, 2019, 1:26pm

TL;DR Revised PR approach looks canary ready. Maybe room for improvement, but lots of unknowns.

and probably unavoidably slow. Perhaps both should someday change to LIST just the file of interest (more below). SIZE is probably slow only if a confused NAS decides to read the whole file to compute transfer size. Proper protocol by my reading would realize it’s BINARY mode and just send file size as returned by the OS, but one never knows what silly things a program is going to do. Fastest answer is asking for and receiving file size as returned by the OS, but using MLST or LIST as described below.

If you look at the 2007 RFC 3659 - Extensions to FTP Table of Contents you can see that it added SIZE, MLST, and other extensions, so it’s not clear why one would be less supported than another. RFC 2389 - Feature negotiation mechanism for the File Transfer Protocol has a way for client to know what’s there.

FTP Commands and Extensions is the official IANA registry of extensions, citing RFCs. SIZE and MLST are listed as “optional”, which is another hole that either scheme could fall into. FluentFTP can fall back MLSD to LIST (which is mandatory). LIST possibly became mandatory in RFC 1123 - Requirements for Internet Hosts – Application and Support - 4.1.2.13 Minimum Implementation in 1989, so may be safest, given an assumption that the server actually implemented the single file capability required by RFC 959:

         LIST (LIST)

            This command causes a list to be sent from the server to the
            passive DTP.  If the pathname specifies a directory or other
            group of files, the server should transfer a list of files
            in the specified directory.  If the pathname specifies a
            file then the server should send current information on the
            file.  A null argument implies the user's current working or
            default directory.  The data transfer is over the data
            connection in type ASCII or type EBCDIC.  (The user must
            ensure that the TYPE is appropriately ASCII or EBCDIC).
            Since the information on a file may vary widely from system
            to system, this information may be hard to use automatically
            in a program, but may be quite useful to a human user.

From above, one can see that the format of LIST is already variable, so one would hope the server can handle the different cases (no guarantees – it might just handle only two of the three possible cases…). There’s also a client-side risk, as FluentFTP source seems to not realize LIST can do a single file, but it possibly doesn’t need to know the three-case processing because it’s somewhat transparent to a client.

What we really want is not MLSD but MLST, and GetObjectInfo can do that if the server has MLST, and internally falls back to LIST otherwise, so we still rely on server’s LIST of non-directory working properly.

Problems after upgrade to 2.0.4.21 was our discovery and discussion of the ASCII failure seen before, and we wound up fixing the library, but maybe now it’s time to try dropping use of its deprecated option despite some concerns you expressed there. My concern would be that I don’t understand the threads that might come through during something like exception processing, and that’s always tough to test…

If I read this correctly, this revision still makes --disable-upload-verify work again (good), while adding --aftp-upload-delay as completely optional and doing nothing unless set to guard against possible but predictable (compared to transfer) delay (good), now turning off EnableThreadSafeDataConnections (we’ll see, and the comment could be misconstrued – we do support parallel uploads in our own way).

There’s no longer a four-times-per-second poll for completion now that early return is resolved (good).

It still uses GetFileSize which was discussed at length above, wondering if a LIST-based way is safer. Good news is that if SIZE breaks, there’s at least one way out by not not doing the upload verify at all. Giving this a try is OK by me even though there might be a better way (never 100% clear what works).

Thanks so much for working on this. I assume it’s tested some, but there’s no substitute for field tests.

kenkendk · October 22, 2019, 8:49pm

FluentFTP appears to be using that. I have seen FTP libraries that pattern match on the reported service name, and assign a feature set based on that due to broken implementations

Looking at the FluentFTP implementation, you are quite right this is what we want. Interesting to note that it does not use the LIST filename but just the plain LIST on the directory for fallback, even though both should be equally supported. This is a strange decision, given that GetFileSize() uses the LIST filename command.

I wanted as little chance of messing up, as I was under the impression that we should roll out quickly. We are now taking some time to investigate, so I am not so worried.

You are correct, but int this case there are some other parts to consider:

FluentFTP is not at all thread safe, even though it can create multiple connections for upload/download
Duplicati does not attempt multiple concurrent operations on a single backend instance
The changed flag ONLY changes the upload and download behavior

I was worried before I searched the source for the m_threadSafeDataChannels. Now that I have seen what it does, I am not worried. I think the library should remove the flag, because it is misleading in suggesting that it can handle multiple threads (and since it never checks the result of uploads/downloads it can really hide errors ).

For (2) I know that none of the other backends are written with multithreading in mind. Any kind of future parallel uploading should be done by creating multiple backend instances, allowing each to be “single threaded”, or a major rewrite of every backend is required.

The story of FTP implementations

I can change it to something like this:

if (_listVerify)
{
    long remoteSize;

    // check remote file size; matching file size indicates completion
    if (ftpClient.HasFeature(FtpCapability.MLSD))
    {
        var r = await ftpClient.GetObjectInfoAsync(remotePath);
        remoteSize = r == null ? -1 : r.Size;
    }
    else
    {
        remoteSize = await ftpClient.GetFileSizeAsync(remotePath, cancelToken);
    }

    if (streamLen != remoteSize)
    {
        throw new UserInformationException(Strings.ListVerifySizeFailure(remotename, remoteSize, streamLen), "AftpListVerifySizeFailure");
    }
}

But this again assumes that HasFeature works correctly, and that the server otherwise supports the LIST filename command. Ideally, it should also be possible to select the bare LIST command, but adding yet-another-user-option for this edge case seems a bit over the top.

Given the relatively small user base, I am suggesting that we stick with the current PR, and give more effort if users report issues. We do not know weird implementation details yet, so rather than spend time fighting ghosts, we can deal with issues as they occur.

The current PR allows disabling the safeguard if the server is non-compliant, and also allows the upload delay if it turns out that anyone actually still need it. The server has to be pretty broken not to work under these conditions.

ts678 · October 22, 2019, 9:05pm

FluentFTP does something based on server seen, at least as implied by comments like this:

FluentFTP 20.0.0

New: FTP Server software detection (PureFTPd, VsFTPd, ProFTPD, FileZilla, OpenVMS, WindowsCE, WuFTPd)

I think putting the current PR code out is fine, and hopefully all will work. If not, options exist.

Thanks for more looking and explaining.