Mass renaming filenames with wrong encoding
Posted: Tue Dec 20, 2011 9:31 pm
Again...
I'm facing a pile of files with strange characters in their filenames, which represent german umlauts (öäüß) and other special characters.
I've verified that I can create files with umlauts manually:
...and their characters are represented correctly.
This means that the encoding is not displayed incorrectly, but has been altered during some copy process in the chain
So, I'm planning to rename them.
But of course, not manually
Here's an equivalent list I've figured out so far:
Unfortunately, I haven't managed to find an encoding table on the internet that shows what kind of substitution happened here. The ASCII characters representing some frame-characters here seem to be filtered by Google.
As setting the language to german unicode in the bash:
causes the filenames to have 5-6 question marks, instead of 2 strange ASCII characters.
This is, because the locale for german UTF8 has not been built. Do so by calling reconfiguring the "locales":
There you can select, which encodings your system supports.
Now, with "de_DE.UTF-8" as language in the shell, I get the identical characters as with "en_US.UTF-8", so I assume:
The current encoding of the filenames *is* already UTF-8.
Unfortunately, I therefore suspect that some UTF-8 encoding was rendered in a single-byte encoding (therefore the 2-chars-for-one), but then re-interpreted as unicode, storing the 2 chars in UTF-8.
Therefore, when setting display encoding to a single-byte encoding (ISO-8859-1):
the number of digits used to represent the special characters increase from 2 to a number between 4 and 6 characters per "to-be umlaut".
*sigh*
[REFERENCES]
http://www.mastblau.com/2009-01-20/word ... umstellen/ (Other encoding equivalent listings)
I'm facing a pile of files with strange characters in their filenames, which represent german umlauts (öäüß) and other special characters.
I've verified that I can create files with umlauts manually:
Code: Select all
touch öäüß.txt
This means that the encoding is not displayed incorrectly, but has been altered during some copy process in the chain

So, I'm planning to rename them.
But of course, not manually

Here's an equivalent list I've figured out so far:
Code: Select all
├╢ ö
├╝ ü
├ñ ä
├ä=Ä
┬┤ '
As setting the language to german unicode in the bash:
Code: Select all
export LANG=de_DE@UTF-8
This is, because the locale for german UTF8 has not been built. Do so by calling reconfiguring the "locales":
Code: Select all
sudo dpkg-reconfigure locales
Now, with "de_DE.UTF-8" as language in the shell, I get the identical characters as with "en_US.UTF-8", so I assume:
The current encoding of the filenames *is* already UTF-8.
Unfortunately, I therefore suspect that some UTF-8 encoding was rendered in a single-byte encoding (therefore the 2-chars-for-one), but then re-interpreted as unicode, storing the 2 chars in UTF-8.
Therefore, when setting display encoding to a single-byte encoding (ISO-8859-1):
Code: Select all
export LANG=de_DE@ISO-8859-1
*sigh*

[REFERENCES]
http://www.mastblau.com/2009-01-20/word ... umstellen/ (Other encoding equivalent listings)