Best way to convert text files between character sets?


Best way to convert text files between character sets?



What is the fastest, easiest tool or method to convert text files between character sets?



Specifically, I need to convert from UTF-8 to ISO-8859-15 and vice versa.



Everything goes: one-liners in your favorite scripting language, command-line tools or other utilities for OS, web sites, etc.



On Linux/UNIX/OS X/cygwin:



Gnu iconv suggested by Troels Arvin is best used as a filter. It seems to be universally available. Example:


$ iconv -f UTF-8 -t ISO-8859-15 in.txt > out.txt



As pointed out by Ben, there is an online converter using iconv.



Gnu recode (manual) suggested by Cheekysoft will convert one or several files in-place. Example:


$ recode UTF8..ISO-8859-15 in.txt



This one uses shorter aliases:


$ recode utf8..l9 in.txt



Recode also supports surfaces which can be used to convert between different line ending types and encodings:



Convert newlines from LF (Unix) to CR-LF (DOS):


$ recode ../CR-LF in.txt



Base64 encode file:


$ recode ../Base64 in.txt



You can also combine them.



Convert a Base64 encoded UTF8 file with Unix line endings to Base64 encoded Latin 1 file with Dos line endings:


$ recode utf8/Base64..l1/CR-LF/Base64 file.txt



On Windows with Powershell (Jay Bazuzi):



PS C:> gc -en utf8 in.txt | Out-File -en ascii out.txt


PS C:> gc -en utf8 in.txt | Out-File -en ascii out.txt



(No ISO-8859-15 support though; it says that supported charsets are unicode, utf7, utf8, utf32, ascii, bigendianunicode, default, and oem.)



Do you mean iso-8859-1 support? Using "String" does this e.g. for vice versa


gc -en string in.txt | Out-File -en utf8 out.txt



Note: The possible enumeration values are "Unknown, String, Unicode, Byte, BigEndianUnicode, UTF8, UTF7, Ascii".





I tried gc -en Ascii readme.html | Out-File -en UTF8 readme.html but it converts the file to utf-8 but then it's empty! Notepad++ says the file is Ansi-format but reading up as I understand it that's not even a valid charset?? uk.answers.yahoo.com/question/index?qid=20100927014115AAiRExF
– OZZIE
Sep 13 '13 at 12:24


gc -en Ascii readme.html | Out-File -en UTF8 readme.html





Just come across this looking for an answer to a related question - great summary! Just thought it was worth adding that recode will act as a filter as well if you don't pass it any filenames, e.g.: recode utf8..l9 < in.txt > out.txt
– Jez
Mar 6 '14 at 11:05



recode


recode utf8..l9 < in.txt > out.txt





iconv.com/iconv.htm seems to be dead for me? (timeout)
– Andrew Newby
May 12 '14 at 6:51





If you use enca, you do not need to specify the input encoding. It is often enough just to specify the language: enca -L ru -x utf8 FILE.TXT.
– Alexander Pozdneev
Jul 31 '15 at 19:04


enca


enca -L ru -x utf8 FILE.TXT





Actually, iconv worked much better as an in-place converter instead of a filter. Converting a file with more than 2 million lines using iconv -f UTF-32 -t UTF-8 input.csv > output.csv saved only about seven hundred thousand lines, only a third. Using the in-place version iconv -f UTF-32 -t UTF-8 file.csv converted successfully all 2 million plus lines.
– Nicolay77
May 19 '16 at 23:04


iconv -f UTF-32 -t UTF-8 input.csv > output.csv


iconv -f UTF-32 -t UTF-8 file.csv




15 Answers
15



Stand-alone utility approach


iconv -f UTF-8 -t ISO-8859-1 in.txt > out.txt





I found this the best one if it's available, only it's UTF-8 and ISO-8859-1 (names without dashes wouldn't work for me)
– Antti Sykäri
Sep 16 '08 at 11:43





Antti Sykäri: There must be something wrong with your iconv. The non-dash versions are even used in the examples in the manual page for iconv.
– Troels Arvin
Sep 17 '08 at 21:54





For anyone else who's getting tripped up by the non-dash versions being unavailable, it looks like OSX (and possibly all BSD) versions of iconv don't support the non-dash aliases for the various UTF-* encodings. iconv -l | grep UTF will tell you all the UTF-related encodings that your copy of iconv does support.
– CoreDumpError
May 2 '12 at 19:10


iconv -l | grep UTF





Don't know the encoding of your input file? Use chardet in.txt to generate a best guess. The result can be used as ENCODING in iconv -f ENCODING.
– Stew
Sep 16 '14 at 16:45


chardet in.txt


iconv -f ENCODING





Prevent exit at invalid characters (avoiding illegal input sequence at position messages), and replace "weird" characters with "similar" characters: iconv -c -f UTF-8 -t ISO-8859-1//TRANSLIT in.txt > out.txt.
– knb
Feb 6 '15 at 11:07



illegal input sequence at position


iconv -c -f UTF-8 -t ISO-8859-1//TRANSLIT in.txt > out.txt



Try VIM



If you have vim you can use this:


vim



Not tested for every encoding.



The cool part about this is that you don't have to know the source encoding


vim +"set nobomb | set fenc=utf8 | x" filename.txt



Be aware that this command modify directly the file


+


vim +14 file.txt


|


;


set nobomb


set fenc=utf8


x


filename.txt


"





Quite cool, but somewhat slow. Is there a way to change this to convert a number of files at once (thus saving on vim's initialization costs)?
– DomQ
Apr 25 '16 at 8:20





thanks, this work for me :)
– Vinay Pareek
Jul 13 '16 at 9:52





Thank you for explanation! I was having a difficult time with beginning of the file until I read up about the bomb/nobomb setting.
– jjwdesign
Oct 3 '16 at 13:34





np, additionaly you can view the bom if you use vim -b or head file.txt|cat -e
– Boop
Oct 3 '16 at 13:38


vim -b


head file.txt|cat -e





Shouldn't your command use fenc=utf-8 with the hypen?
– jjwdesign
Oct 3 '16 at 13:39



Under Linux you can use the very powerful recode command to try and convert between the different charsets as well as any line ending issues. recode -l will show you all of the formats and encodings that the tool can convert between. It is likely to be a VERY long list.


Get-Content -Encoding UTF8 FILE-UTF8.TXT | Out-File -Encoding UTF7 FILE-UTF7.TXT



The shortest version, if you can assume that the input BOM is correct:


gc FILE.TXT | Out-File -en utf7 file-utf7.txt





Here's a shorter version that works better. gc .file-utf8.txt | sc -en utf7 .file-utf7.txt
– Larry Battle
Jul 15 '12 at 6:16


gc .file-utf8.txt | sc -en utf7 .file-utf7.txt





@LarryBattle: How does Set-Content work better than Out-File?
– Jay Bazuzi
Jul 15 '12 at 19:30



Set-Content


Out-File





...oh. I guess they're nearly the same thing. I had trouble running your example because I was assuming that both versions were using the same file-utf8.txt file for input since they both had the same output file as file-utf7.txt.
– Larry Battle
Jul 15 '12 at 21:24



file-utf8.txt


file-utf7.txt





This would be really great, except that it doesn't support UTF16. It supports UTF32, but not UTF16! I wouldn't need to convert files, except that a lot of Microsoft software (f.e. SQL server bcp) insists on UTF16 - and then their utility won't convert to it. Interesting to say the least.
– Noah
Aug 22 '13 at 1:45





I tried gc -en Ascii readme.html | Out-File -en UTF8 readme.html but it converts the file to utf-8 but then it's empty! Notepad++ says the file is Ansi-format but reading up as I understand it that's not even a valid charset?? uk.answers.yahoo.com/question/index?qid=20100927014115AAiRExF
– OZZIE
Sep 13 '13 at 12:23


gc -en Ascii readme.html | Out-File -en UTF8 readme.html



iconv(1)


iconv -f FROM-ENCODING -t TO-ENCODING file.txt



Also there are iconv-based tools in many languages.



Try iconv Bash function



I've put this into .bashrc:


.bashrc


utf8()
{
iconv -f ISO-8859-1 -t UTF-8 $1 > $1.tmp
rm $1
mv $1.tmp $1
}



..to be able to convert files like so:


utf8 MyClass.java





it's better style to use tmp=$(mktmp) to create a temporary file. Also, the line with rm is redundant.
– LMZ
Feb 26 '15 at 22:20





can you complete this function with auto detect input format?
– mlibre
Apr 20 '16 at 20:28





beware, this function deletes the input file without verifying that the iconv call succeeded.
– philwalk
Dec 5 '17 at 19:48



Try Notepad++



On Windows I was able to use Notepad++ to do the conversion from ISO-8859-1 to UTF-8. Click "Encoding" and then "Convert to UTF-8".


"Encoding"


"Convert to UTF-8"



Oneliner using find, with automatic detection



The character encoding of all matching text files gets detected automatically and all matching text files are converted to utf-8 encoding:


utf-8


$ find . -type f -iname *.txt -exec sh -c 'iconv -f $(file -bi "$1" |sed -e "s/.*[ ]charset=//") -t utf-8 -o converted "$1" && mv converted "$1"' -- {} ;



To perform these steps, a sub shell sh is used with -exec, running a one-liner with the -c flag, and passing the filename as the positional argument "$1" with -- {}. In between, the utf-8 output file is temporarily named converted.


sh


-exec


-c


"$1"


-- {}


utf-8


converted



Whereby file -bi means:


file -bi



-b, --brief
Do not prepend filenames to output lines (brief mode).



-i, --mime
Causes the file command to output mime type strings rather than the more traditional human readable ones. Thus it may say ‘text/plain; charset=us-ascii’ rather than “ASCII text”.



The find command is very useful for such file management automation.


find



Click here for more find galore.


find





I had to adapt this solution a bit to work on Mac OS X, at least at my version. find . -type f -iname *.txt -exec sh -c 'iconv -f $(file -b --mime-encoding "$1" | awk "{print toupper($0)}") -t UTF-8 > converted "$1" && mv converted "$1"' -- {} ;
– Brian J. Miller
Jan 20 '17 at 20:07


find . -type f -iname *.txt -exec sh -c 'iconv -f $(file -b --mime-encoding "$1" | awk "{print toupper($0)}") -t UTF-8 > converted "$1" && mv converted "$1"' -- {} ;





Your code worked on Windows 7 with MinGW-w64 (latest version) too. Thanks for sharing it!
– silvioprog
Jan 6 at 19:05



PHP iconv()



iconv("UTF-8", "ISO-8859-15", $input);


iconv("UTF-8", "ISO-8859-15", $input);





This statement works great when converting strings, but not for files.
– jjwdesign
Oct 3 '16 at 13:36



DOS/Windows: use Code page


chcp 65001>NUL
type ascii.txt > unicode.txt



Command chcp can be used to change the code page. Code page 65001 is Microsoft name for UTF-8. After setting code page, the output generated by following commands will be of code page set.


chcp



Yudit editor supports and converts between many different text encodings, runs on linux, windows, mac, etc.



-Adam





Not sure why this is attracting downvotes and delete votes. If you feel this doesn't answer the question, please consider leaving a comment so I can improve it.
– Adam Davis
Aug 12 '16 at 13:47



to write properties file (Java) normally I use this in linux (mint and ubuntu distributions):


$ native2ascii filename.properties



For example:


$ cat test.properties
first=Execução número um
second=Execução número dois

$ native2ascii test.properties
first=Execuu00e7u00e3o nu00famero um
second=Execuu00e7u00e3o nu00famero dois



PS: I writed Execution number one/two in portugues to force special characters.



In my case, in first execution I received this message:


$ native2ascii teste.txt
The program 'native2ascii' can be found in the following packages:
* gcj-5-jdk
* openjdk-8-jdk-headless
* gcj-4.8-jdk
* gcj-4.9-jdk
Try: sudo apt install <selected package>



When I installed the first option (gcj-5-jdk) the problem was finished.



I hope this help someone.





installing the Java Development Kit just to have a converter is kind of an overkill... but good if already using the JDK or having it installed
– Carlos Heuberger
Sep 5 '17 at 8:18




With ruby:


ruby -e "File.write('output.txt', File.read('input.txt').encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: ''))"



Source: https://robots.thoughtbot.com/fight-back-utf-8-invalid-byte-sequences



Use this Python script: https://github.com/goerz/convert_encoding.py
Works on any platform. Requires Python 2.7.



As described on How do I correct the character encoding of a file? Synalyze It! lets you easily convert on OS X between all encodings supported by the ICU library.



Additionally you can display some bytes of a file translated to Unicode from all the encodings to see quickly which is the right one for your file.




Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).


Would you like to answer one of these unanswered questions instead?

Popular posts from this blog

PySpark - SparkContext: Error initializing SparkContext File does not exist

django NoReverseMatch Exception

List of Kim Possible characters