Page 1 of 1

Mass renaming filenames with wrong encoding

Posted: Tue Dec 20, 2011 9:31 pm
by ^rooker
Again...

I'm facing a pile of files with strange characters in their filenames, which represent german umlauts (öäüß) and other special characters.

I've verified that I can create files with umlauts manually:

Code: Select all

touch öäüß.txt
...and their characters are represented correctly.

This means that the encoding is not displayed incorrectly, but has been altered during some copy process in the chain :(

So, I'm planning to rename them.
But of course, not manually ;)

Here's an equivalent list I've figured out so far:

Code: Select all

├╢ ö
├╝ ü
├ñ ä
├ä=Ä
┬┤ '
Unfortunately, I haven't managed to find an encoding table on the internet that shows what kind of substitution happened here. The ASCII characters representing some frame-characters here seem to be filtered by Google.

As setting the language to german unicode in the bash:

Code: Select all

export LANG=de_DE@UTF-8
causes the filenames to have 5-6 question marks, instead of 2 strange ASCII characters.
This is, because the locale for german UTF8 has not been built. Do so by calling reconfiguring the "locales":

Code: Select all

sudo dpkg-reconfigure locales
There you can select, which encodings your system supports.

Now, with "de_DE.UTF-8" as language in the shell, I get the identical characters as with "en_US.UTF-8", so I assume:

The current encoding of the filenames *is* already UTF-8.
Unfortunately, I therefore suspect that some UTF-8 encoding was rendered in a single-byte encoding (therefore the 2-chars-for-one), but then re-interpreted as unicode, storing the 2 chars in UTF-8.

Therefore, when setting display encoding to a single-byte encoding (ISO-8859-1):

Code: Select all

export LANG=de_DE@ISO-8859-1
the number of digits used to represent the special characters increase from 2 to a number between 4 and 6 characters per "to-be umlaut".

*sigh* :?



[REFERENCES]
http://www.mastblau.com/2009-01-20/word ... umstellen/ (Other encoding equivalent listings)