Das Werkstatt

Posted: **Tue Dec 20, 2011 9:31 pm**

Again...

I'm facing a pile of files with strange characters in their filenames, which represent german umlauts (öäüß) and other special characters.

I've verified that I can create files with umlauts manually:

Code: Select all

touch öäüß.txt

...and their characters are represented correctly.

This means that the encoding is not displayed incorrectly, but has been altered during some copy process in the chain

So, I'm planning to rename them.
But of course, not manually

Here's an equivalent list I've figured out so far:

Code: Select all

├╢ ö
├╝ ü
├ñ ä
├ä=Ä
┬┤ '

Unfortunately, I haven't managed to find an encoding table on the internet that shows what kind of substitution happened here. The ASCII characters representing some frame-characters here seem to be filtered by Google.

As setting the language to german unicode in the bash:

Code: Select all

export LANG=de_DE@UTF-8

causes the filenames to have 5-6 question marks, instead of 2 strange ASCII characters.
This is, because the locale for german UTF8 has not been built. Do so by calling reconfiguring the "locales":

Code: Select all

sudo dpkg-reconfigure locales

There you can select, which encodings your system supports.

Now, with "de_DE.UTF-8" as language in the shell, I get the identical characters as with "en_US.UTF-8", so I assume:

The current encoding of the filenames *is* already UTF-8.
Unfortunately, I therefore suspect that some UTF-8 encoding was rendered in a single-byte encoding (therefore the 2-chars-for-one), but then re-interpreted as unicode, storing the 2 chars in UTF-8.

Therefore, when setting display encoding to a single-byte encoding (ISO-8859-1):

Code: Select all

export LANG=de_DE@ISO-8859-1

the number of digits used to represent the special characters increase from 2 to a number between 4 and 6 characters per "to-be umlaut".

*sigh*

[REFERENCES]
http://www.mastblau.com/2009-01-20/word ... umstellen/ (Other encoding equivalent listings)

Das Werkstatt

Mass renaming filenames with wrong encoding

Mass renaming filenames with wrong encoding