Best way to convert text files between character sets?
Best way to convert text files between character sets?
What is the fastest, easiest tool or method to convert text files between character sets?
Specifically, I need to convert from UTF-8 to ISO-8859-15 and vice versa.
Everything goes: one-liners in your favorite scripting language, command-line tools or other utilities for OS, web sites, etc.
On Linux/UNIX/OS X/cygwin:
Gnu iconv suggested by Troels Arvin is best used as a filter. It seems to be universally available. Example:
$ iconv -f UTF-8 -t ISO-8859-15 in.txt > out.txt
As pointed out by Ben, there is an online converter using iconv.
Gnu recode (manual) suggested by Cheekysoft will convert one or several files in-place. Example:
$ recode UTF8..ISO-8859-15 in.txt
This one uses shorter aliases:
$ recode utf8..l9 in.txt
Recode also supports surfaces which can be used to convert between different line ending types and encodings:
Convert newlines from LF (Unix) to CR-LF (DOS):
$ recode ../CR-LF in.txt
Base64 encode file:
$ recode ../Base64 in.txt
You can also combine them.
Convert a Base64 encoded UTF8 file with Unix line endings to Base64 encoded Latin 1 file with Dos line endings:
$ recode utf8/Base64..l1/CR-LF/Base64 file.txt
On Windows with Powershell (Jay Bazuzi):
PS C:> gc -en utf8 in.txt | Out-File -en ascii out.txt
PS C:> gc -en utf8 in.txt | Out-File -en ascii out.txt
(No ISO-8859-15 support though; it says that supported charsets are unicode, utf7, utf8, utf32, ascii, bigendianunicode, default, and oem.)
Do you mean iso-8859-1 support? Using "String" does this e.g. for vice versa
gc -en string in.txt | Out-File -en utf8 out.txt
Note: The possible enumeration values are "Unknown, String, Unicode, Byte, BigEndianUnicode, UTF8, UTF7, Ascii".
gc -en Ascii readme.html | Out-File -en UTF8 readme.html
Just come across this looking for an answer to a related question - great summary! Just thought it was worth adding that
recode
will act as a filter as well if you don't pass it any filenames, e.g.: recode utf8..l9 < in.txt > out.txt
– Jez
Mar 6 '14 at 11:05
recode
recode utf8..l9 < in.txt > out.txt
iconv.com/iconv.htm seems to be dead for me? (timeout)
– Andrew Newby
May 12 '14 at 6:51
If you use
enca
, you do not need to specify the input encoding. It is often enough just to specify the language: enca -L ru -x utf8 FILE.TXT
.– Alexander Pozdneev
Jul 31 '15 at 19:04
enca
enca -L ru -x utf8 FILE.TXT
Actually, iconv worked much better as an in-place converter instead of a filter. Converting a file with more than 2 million lines using
iconv -f UTF-32 -t UTF-8 input.csv > output.csv
saved only about seven hundred thousand lines, only a third. Using the in-place version iconv -f UTF-32 -t UTF-8 file.csv
converted successfully all 2 million plus lines.– Nicolay77
May 19 '16 at 23:04
iconv -f UTF-32 -t UTF-8 input.csv > output.csv
iconv -f UTF-32 -t UTF-8 file.csv
15 Answers
15
Stand-alone utility approach
iconv -f UTF-8 -t ISO-8859-1 in.txt > out.txt
I found this the best one if it's available, only it's UTF-8 and ISO-8859-1 (names without dashes wouldn't work for me)
– Antti Sykäri
Sep 16 '08 at 11:43
Antti Sykäri: There must be something wrong with your iconv. The non-dash versions are even used in the examples in the manual page for iconv.
– Troels Arvin
Sep 17 '08 at 21:54
For anyone else who's getting tripped up by the non-dash versions being unavailable, it looks like OSX (and possibly all BSD) versions of iconv don't support the non-dash aliases for the various UTF-* encodings.
iconv -l | grep UTF
will tell you all the UTF-related encodings that your copy of iconv does support.– CoreDumpError
May 2 '12 at 19:10
iconv -l | grep UTF
Don't know the encoding of your input file? Use
chardet in.txt
to generate a best guess. The result can be used as ENCODING in iconv -f ENCODING
.– Stew
Sep 16 '14 at 16:45
chardet in.txt
iconv -f ENCODING
Prevent exit at invalid characters (avoiding
illegal input sequence at position
messages), and replace "weird" characters with "similar" characters: iconv -c -f UTF-8 -t ISO-8859-1//TRANSLIT in.txt > out.txt
.– knb
Feb 6 '15 at 11:07
illegal input sequence at position
iconv -c -f UTF-8 -t ISO-8859-1//TRANSLIT in.txt > out.txt
Try VIM
If you have vim
you can use this:
vim
Not tested for every encoding.
The cool part about this is that you don't have to know the source encoding
vim +"set nobomb | set fenc=utf8 | x" filename.txt
Be aware that this command modify directly the file
+
vim +14 file.txt
|
;
set nobomb
set fenc=utf8
x
filename.txt
"
Quite cool, but somewhat slow. Is there a way to change this to convert a number of files at once (thus saving on vim's initialization costs)?
– DomQ
Apr 25 '16 at 8:20
thanks, this work for me :)
– Vinay Pareek
Jul 13 '16 at 9:52
Thank you for explanation! I was having a difficult time with beginning of the file until I read up about the bomb/nobomb setting.
– jjwdesign
Oct 3 '16 at 13:34
np, additionaly you can view the bom if you use
vim -b
or head file.txt|cat -e
– Boop
Oct 3 '16 at 13:38
vim -b
head file.txt|cat -e
Shouldn't your command use fenc=utf-8 with the hypen?
– jjwdesign
Oct 3 '16 at 13:39
Under Linux you can use the very powerful recode command to try and convert between the different charsets as well as any line ending issues. recode -l will show you all of the formats and encodings that the tool can convert between. It is likely to be a VERY long list.
Get-Content -Encoding UTF8 FILE-UTF8.TXT | Out-File -Encoding UTF7 FILE-UTF7.TXT
The shortest version, if you can assume that the input BOM is correct:
gc FILE.TXT | Out-File -en utf7 file-utf7.txt
Here's a shorter version that works better.
gc .file-utf8.txt | sc -en utf7 .file-utf7.txt
– Larry Battle
Jul 15 '12 at 6:16
gc .file-utf8.txt | sc -en utf7 .file-utf7.txt
@LarryBattle: How does
Set-Content
work better than Out-File
?– Jay Bazuzi
Jul 15 '12 at 19:30
Set-Content
Out-File
...oh. I guess they're nearly the same thing. I had trouble running your example because I was assuming that both versions were using the same
file-utf8.txt
file for input since they both had the same output file as file-utf7.txt
.– Larry Battle
Jul 15 '12 at 21:24
file-utf8.txt
file-utf7.txt
This would be really great, except that it doesn't support UTF16. It supports UTF32, but not UTF16! I wouldn't need to convert files, except that a lot of Microsoft software (f.e. SQL server bcp) insists on UTF16 - and then their utility won't convert to it. Interesting to say the least.
– Noah
Aug 22 '13 at 1:45
I tried
gc -en Ascii readme.html | Out-File -en UTF8 readme.html
but it converts the file to utf-8 but then it's empty! Notepad++ says the file is Ansi-format but reading up as I understand it that's not even a valid charset?? uk.answers.yahoo.com/question/index?qid=20100927014115AAiRExF– OZZIE
Sep 13 '13 at 12:23
gc -en Ascii readme.html | Out-File -en UTF8 readme.html
iconv(1)
iconv -f FROM-ENCODING -t TO-ENCODING file.txt
Also there are iconv-based tools in many languages.
Try iconv Bash function
I've put this into .bashrc
:
.bashrc
utf8()
{
iconv -f ISO-8859-1 -t UTF-8 $1 > $1.tmp
rm $1
mv $1.tmp $1
}
..to be able to convert files like so:
utf8 MyClass.java
it's better style to use tmp=$(mktmp) to create a temporary file. Also, the line with rm is redundant.
– LMZ
Feb 26 '15 at 22:20
can you complete this function with auto detect input format?
– mlibre
Apr 20 '16 at 20:28
beware, this function deletes the input file without verifying that the iconv call succeeded.
– philwalk
Dec 5 '17 at 19:48
Try Notepad++
On Windows I was able to use Notepad++ to do the conversion from ISO-8859-1 to UTF-8. Click "Encoding"
and then "Convert to UTF-8"
.
"Encoding"
"Convert to UTF-8"
Oneliner using find, with automatic detection
The character encoding of all matching text files gets detected automatically and all matching text files are converted to utf-8
encoding:
utf-8
$ find . -type f -iname *.txt -exec sh -c 'iconv -f $(file -bi "$1" |sed -e "s/.*[ ]charset=//") -t utf-8 -o converted "$1" && mv converted "$1"' -- {} ;
To perform these steps, a sub shell sh
is used with -exec
, running a one-liner with the -c
flag, and passing the filename as the positional argument "$1"
with -- {}
. In between, the utf-8
output file is temporarily named converted
.
sh
-exec
-c
"$1"
-- {}
utf-8
converted
Whereby file -bi
means:
file -bi
-b, --brief
Do not prepend filenames to output lines (brief mode).
-i, --mime
Causes the file command to output mime type strings rather than the more traditional human readable ones. Thus it may say ‘text/plain; charset=us-ascii’ rather than “ASCII text”.
The find
command is very useful for such file management automation.
find
Click here for more find
galore.
find
I had to adapt this solution a bit to work on Mac OS X, at least at my version.
find . -type f -iname *.txt -exec sh -c 'iconv -f $(file -b --mime-encoding "$1" | awk "{print toupper($0)}") -t UTF-8 > converted "$1" && mv converted "$1"' -- {} ;
– Brian J. Miller
Jan 20 '17 at 20:07
find . -type f -iname *.txt -exec sh -c 'iconv -f $(file -b --mime-encoding "$1" | awk "{print toupper($0)}") -t UTF-8 > converted "$1" && mv converted "$1"' -- {} ;
Your code worked on Windows 7 with MinGW-w64 (latest version) too. Thanks for sharing it!
– silvioprog
Jan 6 at 19:05
PHP iconv()
iconv("UTF-8", "ISO-8859-15", $input);
iconv("UTF-8", "ISO-8859-15", $input);
This statement works great when converting strings, but not for files.
– jjwdesign
Oct 3 '16 at 13:36
DOS/Windows: use Code page
chcp 65001>NUL
type ascii.txt > unicode.txt
Command chcp
can be used to change the code page. Code page 65001 is Microsoft name for UTF-8. After setting code page, the output generated by following commands will be of code page set.
chcp
Yudit editor supports and converts between many different text encodings, runs on linux, windows, mac, etc.
-Adam
Not sure why this is attracting downvotes and delete votes. If you feel this doesn't answer the question, please consider leaving a comment so I can improve it.
– Adam Davis
Aug 12 '16 at 13:47
to write properties file (Java) normally I use this in linux (mint and ubuntu distributions):
$ native2ascii filename.properties
For example:
$ cat test.properties
first=Execução número um
second=Execução número dois
$ native2ascii test.properties
first=Execuu00e7u00e3o nu00famero um
second=Execuu00e7u00e3o nu00famero dois
PS: I writed Execution number one/two in portugues to force special characters.
In my case, in first execution I received this message:
$ native2ascii teste.txt
The program 'native2ascii' can be found in the following packages:
* gcj-5-jdk
* openjdk-8-jdk-headless
* gcj-4.8-jdk
* gcj-4.9-jdk
Try: sudo apt install <selected package>
When I installed the first option (gcj-5-jdk) the problem was finished.
I hope this help someone.
installing the Java Development Kit just to have a converter is kind of an overkill... but good if already using the JDK or having it installed
– Carlos Heuberger
Sep 5 '17 at 8:18
With ruby:
ruby -e "File.write('output.txt', File.read('input.txt').encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: ''))"
Source: https://robots.thoughtbot.com/fight-back-utf-8-invalid-byte-sequences
Use this Python script: https://github.com/goerz/convert_encoding.py
Works on any platform. Requires Python 2.7.
As described on How do I correct the character encoding of a file? Synalyze It! lets you easily convert on OS X between all encodings supported by the ICU library.
Additionally you can display some bytes of a file translated to Unicode from all the encodings to see quickly which is the right one for your file.
Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).
Would you like to answer one of these unanswered questions instead?
I tried
gc -en Ascii readme.html | Out-File -en UTF8 readme.html
but it converts the file to utf-8 but then it's empty! Notepad++ says the file is Ansi-format but reading up as I understand it that's not even a valid charset?? uk.answers.yahoo.com/question/index?qid=20100927014115AAiRExF– OZZIE
Sep 13 '13 at 12:24