Mass renaming filenames with wrong encoding

Step-by-Step descriptions of how to do things.
Post Reply
User avatar
^rooker
Site Admin
Posts: 1481
Joined: Fri Aug 29, 2003 8:39 pm

Mass renaming filenames with wrong encoding

Post by ^rooker »

Again...

I'm facing a pile of files with strange characters in their filenames, which represent german umlauts (öäüß) and other special characters.

I've verified that I can create files with umlauts manually:

Code: Select all

touch öäüß.txt
...and their characters are represented correctly.

This means that the encoding is not displayed incorrectly, but has been altered during some copy process in the chain :(

So, I'm planning to rename them.
But of course, not manually ;)

Here's an equivalent list I've figured out so far:

Code: Select all

├╢ ö
├╝ ü
├ñ ä
├ä=Ä
┬┤ '
Unfortunately, I haven't managed to find an encoding table on the internet that shows what kind of substitution happened here. The ASCII characters representing some frame-characters here seem to be filtered by Google.

As setting the language to german unicode in the bash:

Code: Select all

export LANG=de_DE@UTF-8
causes the filenames to have 5-6 question marks, instead of 2 strange ASCII characters.
This is, because the locale for german UTF8 has not been built. Do so by calling reconfiguring the "locales":

Code: Select all

sudo dpkg-reconfigure locales
There you can select, which encodings your system supports.

Now, with "de_DE.UTF-8" as language in the shell, I get the identical characters as with "en_US.UTF-8", so I assume:

The current encoding of the filenames *is* already UTF-8.
Unfortunately, I therefore suspect that some UTF-8 encoding was rendered in a single-byte encoding (therefore the 2-chars-for-one), but then re-interpreted as unicode, storing the 2 chars in UTF-8.

Therefore, when setting display encoding to a single-byte encoding (ISO-8859-1):

Code: Select all

export LANG=de_DE@ISO-8859-1
the number of digits used to represent the special characters increase from 2 to a number between 4 and 6 characters per "to-be umlaut".

*sigh* :?



[REFERENCES]
http://www.mastblau.com/2009-01-20/word ... umstellen/ (Other encoding equivalent listings)
Jumping out of an airplane is not a basic instinct. Neither is breathing underwater. But put the two together and you're traveling through space!
Post Reply