File name encoding

adji · October 12, 2021, 7:23am

I’m getting the following warning: “[Warning-Duplicati.Library.Main.Operation.Backup.FileEnumerationProcess-FileAccessError]: Error reported while accessing file: /…/Cablevisi�n_28072008.pdf”

When I list the file with ls -b I get: “Cablevisi\363n_30112008.pdf”.

Any idea on how to fix it ?

PS: It happen with some directories too with the warning: [Warning-Duplicati.Library.Main.Operation.Backup.FileEnumerationProcess-PathProcessingError]: Failed to process path:

ts678 · October 14, 2021, 5:43pm

Welcome to the forum @adji

Thanks for showing the octal value of the probably offending character. Does anything display it as ó?
If so, you probably have some Latin-1 (ISO 8859-1) characters instead of Unicode in UTF-8 encoding.

Someone with experience with character set conversions would be helpful. If not, I wonder if you can manually fix a small number of issues if a GUI file browser can let you delete the character and retype.
Maybe the retyped character will be Unicode even if it happens to display the same. You can ls -b it.

If you have to convert lots of names, you might look into whether convmv can help. Note that renames might impact references to the files from scripts and such which rely on names and not appearances.

adji · October 14, 2021, 6:58pm

Yes, find shows it as ó, and probably is some Latin-1 char (my current LANG=en_US.UTF-8).
I tried convmv it did:

mv "./Cablevisi�n_30112008.pdf"	"./Cablevisión_30112008.pdf" (y/n) y

ls -b *30112008*
Cablevisi\363n_30112008.pdf

I will check what happen with the next backup.

Do you know where is the problem inside Duplicati? It must manage the filenames as “binaries” and cope with all the strange encoding possibilities or somthing like that, right ?.

adji · October 14, 2021, 7:01pm

I run convmv again (it was in dryrun mode), now I get: Cablevisión_30112008.pdf

ts678 · October 14, 2021, 7:36pm

I don’t think it does, I don’t think that’s unusual, and I’m not sure it’s even possible (any experts around?).
A character like octal 363 means very different things in different code sets. Latin-1 is just one of them…

Once you venture beyond the ASCII range, and into values decimal 128 and up (high half of byte values):

Extended ASCII makes it very difficult to cope with whatever was intended by that particular binary value.

Windows code page shows more of the pain of the what-is-this-thing question. For some very old uses:

Older Character Sets

Systems these days avoid such problems (somewhat) by using standardized character representations:

Unicode – The World Standard for Text and Emoji

I don’t consider this a Duplicati problem. Its underlying software such as mono and SQLite use Unicode.

Unicode in Microsoft Windows shows Windows NT went Unicode 28 years ago. .NET is a Microsoft tool, and mono follows it. The Unicode article in Wikipedia explains that Linux (for example) went that way too:

UTF-8 uses one to four bytes per code point and, being compact for Latin scripts and ASCII-compatible, provides the de facto standard encoding for interchange of Unicode text. It is used by FreeBSD and most recent Linux distributions as a direct replacement for legacy encodings in general text handling.