Best way to convert text files between character sets?

What is the fastest, easiest tool or method to convert text files between character sets?

Specifically, I need to convert from UTF-8 to ISO-8859-15 and vice versa.

Everything goes: one-liners in your favorite scripting language, command-line tools or other utilities for OS, web sites, etc.

On Linux/UNIX/OS X/cygwin:

Gnu iconv suggested by Troels Arvin is best used as a filter. It seems to be universally available. Example:

$ iconv -f UTF-8 -t ISO-8859-15 in.txt > out.txt

As pointed out by Ben, there is an online converter using iconv.

Gnu recode (manual) suggested by Cheekysoft will convert one or several files in-place. Example:

$ recode UTF8..ISO-8859-15 in.txt

This one uses shorter aliases:

$ recode utf8..l9 in.txt

Recode also supports surfaces which can be used to convert between different line ending types and encodings:

Convert newlines from LF (Unix) to CR-LF (DOS):

$ recode ../CR-LF in.txt

Base64 encode file:

$ recode ../Base64 in.txt

You can also combine them.

Convert a Base64 encoded UTF8 file with Unix line endings to Base64 encoded Latin 1 file with Dos line endings:

$ recode utf8/Base64..l1/CR-LF/Base64 file.txt

On Windows with Powershell (Jay Bazuzi):

PS C:> gc -en utf8 in.txt | Out-File -en ascii out.txt

(No ISO-8859-15 support though; it says that supported charsets are unicode, utf7, utf8, utf32, ascii, bigendianunicode, default, and oem.)

Do you mean iso-8859-1 support? Using "String" does this e.g. for vice versa

gc -en string in.txt | Out-File -en utf8 out.txt

Note: The possible enumeration values are "Unknown, String, Unicode, Byte, BigEndianUnicode, UTF8, UTF7, Ascii".

I tried gc -en Ascii readme.html | Out-File -en UTF8 readme.html but it converts the file to utf-8 but then it's empty! Notepad++ says the file is Ansi-format but reading up as I understand it that's not even a valid charset?? uk.answers.yahoo.com/question/index?qid=20100927014115AAiRExF
– OZZIE
Sep 13 '13 at 12:24

gc -en Ascii readme.html | Out-File -en UTF8 readme.html

Just come across this looking for an answer to a related question - great summary! Just thought it was worth adding that recode will act as a filter as well if you don't pass it any filenames, e.g.: recode utf8..l9 < in.txt > out.txt
– Jez
Mar 6 '14 at 11:05

recode

recode utf8..l9 < in.txt > out.txt

iconv.com/iconv.htm seems to be dead for me? (timeout)
– Andrew Newby
May 12 '14 at 6:51

If you use enca, you do not need to specify the input encoding. It is often enough just to specify the language: enca -L ru -x utf8 FILE.TXT.
– Alexander Pozdneev
Jul 31 '15 at 19:04

enca

enca -L ru -x utf8 FILE.TXT

Actually, iconv worked much better as an in-place converter instead of a filter. Converting a file with more than 2 million lines using iconv -f UTF-32 -t UTF-8 input.csv > output.csv saved only about seven hundred thousand lines, only a third. Using the in-place version iconv -f UTF-32 -t UTF-8 file.csv converted successfully all 2 million plus lines.
– Nicolay77
May 19 '16 at 23:04

iconv -f UTF-32 -t UTF-8 input.csv > output.csv

iconv -f UTF-32 -t UTF-8 file.csv

15 Answers
15

Stand-alone utility approach

iconv -f UTF-8 -t ISO-8859-1 in.txt > out.txt

I found this the best one if it's available, only it's UTF-8 and ISO-8859-1 (names without dashes wouldn't work for me)
– Antti Sykäri
Sep 16 '08 at 11:43

Antti Sykäri: There must be something wrong with your iconv. The non-dash versions are even used in the examples in the manual page for iconv.
– Troels Arvin
Sep 17 '08 at 21:54

For anyone else who's getting tripped up by the non-dash versions being unavailable, it looks like OSX (and possibly all BSD) versions of iconv don't support the non-dash aliases for the various UTF-* encodings. iconv -l | grep UTF will tell you all the UTF-related encodings that your copy of iconv does support.
– CoreDumpError
May 2 '12 at 19:10

iconv -l | grep UTF

Don't know the encoding of your input file? Use chardet in.txt to generate a best guess. The result can be used as ENCODING in iconv -f ENCODING.
– Stew
Sep 16 '14 at 16:45

chardet in.txt

iconv -f ENCODING

Prevent exit at invalid characters (avoiding illegal input sequence at position messages), and replace "weird" characters with "similar" characters: iconv -c -f UTF-8 -t ISO-8859-1//TRANSLIT in.txt > out.txt.
– knb
Feb 6 '15 at 11:07

illegal input sequence at position

iconv -c -f UTF-8 -t ISO-8859-1//TRANSLIT in.txt > out.txt

Try VIM

If you have vim you can use this:

vim

Not tested for every encoding.

The cool part about this is that you don't have to know the source encoding

vim +"set nobomb | set fenc=utf8 | x" filename.txt

Be aware that this command modify directly the file

+

vim +14 file.txt

|

;

set nobomb

set fenc=utf8

x

filename.txt

"

Quite cool, but somewhat slow. Is there a way to change this to convert a number of files at once (thus saving on vim's initialization costs)?
– DomQ
Apr 25 '16 at 8:20

thanks, this work for me :)
– Vinay Pareek
Jul 13 '16 at 9:52

Thank you for explanation! I was having a difficult time with beginning of the file until I read up about the bomb/nobomb setting.
– jjwdesign
Oct 3 '16 at 13:34

np, additionaly you can view the bom if you use vim -b or head file.txt|cat -e
– Boop
Oct 3 '16 at 13:38

vim -b

head file.txt|cat -e

Shouldn't your command use fenc=utf-8 with the hypen?
– jjwdesign
Oct 3 '16 at 13:39

Under Linux you can use the very powerful recode command to try and convert between the different charsets as well as any line ending issues. recode -l will show you all of the formats and encodings that the tool can convert between. It is likely to be a VERY long list.

Get-Content -Encoding UTF8 FILE-UTF8.TXT | Out-File -Encoding UTF7 FILE-UTF7.TXT

The shortest version, if you can assume that the input BOM is correct:

gc FILE.TXT | Out-File -en utf7 file-utf7.txt

Here's a shorter version that works better. gc .file-utf8.txt | sc -en utf7 .file-utf7.txt
– Larry Battle
Jul 15 '12 at 6:16

gc .file-utf8.txt | sc -en utf7 .file-utf7.txt

@LarryBattle: How does Set-Content work better than Out-File?
– Jay Bazuzi
Jul 15 '12 at 19:30

Set-Content

Out-File

...oh. I guess they're nearly the same thing. I had trouble running your example because I was assuming that both versions were using the same file-utf8.txt file for input since they both had the same output file as file-utf7.txt.
– Larry Battle
Jul 15 '12 at 21:24

file-utf8.txt

file-utf7.txt

This would be really great, except that it doesn't support UTF16. It supports UTF32, but not UTF16! I wouldn't need to convert files, except that a lot of Microsoft software (f.e. SQL server bcp) insists on UTF16 - and then their utility won't convert to it. Interesting to say the least.
– Noah
Aug 22 '13 at 1:45

gc -en Ascii readme.html | Out-File -en UTF8 readme.html

iconv(1)

iconv -f FROM-ENCODING -t TO-ENCODING file.txt

Also there are iconv-based tools in many languages.

Try iconv Bash function

I've put this into .bashrc:

.bashrc

utf8() { iconv -f ISO-8859-1 -t UTF-8 $1 > $1.tmp rm $1 mv $1.tmp $1 }

..to be able to convert files like so:

utf8 MyClass.java

it's better style to use tmp=$(mktmp) to create a temporary file. Also, the line with rm is redundant.
– LMZ
Feb 26 '15 at 22:20

can you complete this function with auto detect input format?
– mlibre
Apr 20 '16 at 20:28

beware, this function deletes the input file without verifying that the iconv call succeeded.
– philwalk
Dec 5 '17 at 19:48

Try Notepad++

On Windows I was able to use Notepad++ to do the conversion from ISO-8859-1 to UTF-8. Click "Encoding" and then "Convert to UTF-8".

"Encoding"

"Convert to UTF-8"

Oneliner using find, with automatic detection

The character encoding of all matching text files gets detected automatically and all matching text files are converted to utf-8 encoding:

utf-8

$ find . -type f -iname *.txt -exec sh -c 'iconv -f $(file -bi "$1" |sed -e "s/.*[ ]charset=//") -t utf-8 -o converted "$1" && mv converted "$1"' -- {} ;

To perform these steps, a sub shell sh is used with -exec, running a one-liner with the -c flag, and passing the filename as the positional argument "$1" with -- {}. In between, the utf-8 output file is temporarily named converted.

sh

-exec

-c

"$1"

-- {}

utf-8

converted

Whereby file -bi means:

file -bi

-b, --brief
Do not prepend filenames to output lines (brief mode).

-i, --mime
Causes the file command to output mime type strings rather than the more traditional human readable ones. Thus it may say ‘text/plain; charset=us-ascii’ rather than “ASCII text”.

The find command is very useful for such file management automation.

find

Click here for more find galore.

find

I had to adapt this solution a bit to work on Mac OS X, at least at my version.

find . -type f -iname *.txt -exec sh -c 'iconv -f $(file -b --mime-encoding "$1" | awk "{print toupper($0)}") -t UTF-8 > converted "$1" && mv converted "$1"' -- {} ;

– Brian J. Miller
Jan 20 '17 at 20:07

find . -type f -iname *.txt -exec sh -c 'iconv -f $(file -b --mime-encoding "$1" | awk "{print toupper($0)}") -t UTF-8 > converted "$1" && mv converted "$1"' -- {} ;

Your code worked on Windows 7 with MinGW-w64 (latest version) too. Thanks for sharing it!
– silvioprog
Jan 6 at 19:05

PHP iconv()

iconv("UTF-8", "ISO-8859-15", $input);

This statement works great when converting strings, but not for files.
– jjwdesign
Oct 3 '16 at 13:36

DOS/Windows: use Code page

chcp 65001>NUL type ascii.txt > unicode.txt

Command chcp can be used to change the code page. Code page 65001 is Microsoft name for UTF-8. After setting code page, the output generated by following commands will be of code page set.

chcp

Yudit editor supports and converts between many different text encodings, runs on linux, windows, mac, etc.

-Adam

Not sure why this is attracting downvotes and delete votes. If you feel this doesn't answer the question, please consider leaving a comment so I can improve it.
– Adam Davis
Aug 12 '16 at 13:47

to write properties file (Java) normally I use this in linux (mint and ubuntu distributions):

$ native2ascii filename.properties

For example:

$ cat test.properties first=Execução número um second=Execução número dois $ native2ascii test.properties first=Execuu00e7u00e3o nu00famero um second=Execuu00e7u00e3o nu00famero dois

PS: I writed Execution number one/two in portugues to force special characters.

In my case, in first execution I received this message:

$ native2ascii teste.txt The program 'native2ascii' can be found in the following packages: * gcj-5-jdk * openjdk-8-jdk-headless * gcj-4.8-jdk * gcj-4.9-jdk Try: sudo apt install <selected package>

When I installed the first option (gcj-5-jdk) the problem was finished.

I hope this help someone.

installing the Java Development Kit just to have a converter is kind of an overkill... but good if already using the JDK or having it installed
– Carlos Heuberger
Sep 5 '17 at 8:18

With ruby:

ruby -e "File.write('output.txt', File.read('input.txt').encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: ''))"

Source: https://robots.thoughtbot.com/fight-back-utf-8-invalid-byte-sequences

Use this Python script: https://github.com/goerz/convert_encoding.py
Works on any platform. Requires Python 2.7.

As described on How do I correct the character encoding of a file? Synalyze It! lets you easily convert on OS X between all encodings supported by the ICU library.

Additionally you can display some bytes of a file translated to Unicode from all the encodings to see quickly which is the right one for your file.

Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).

Would you like to answer one of these unanswered questions instead?

搜尋此網誌

Search between a Gas Station

Best way to convert text files between character sets?

Best way to convert text files between character sets?

15 Answers
15

Try VIM

Try iconv Bash function

Try Notepad++

Oneliner using find, with automatic detection

Popular posts from this blog

PySpark - SparkContext: Error initializing SparkContext File does not exist

django NoReverseMatch Exception

Python Tkinter Error, “Too Early to Create Image”

Best way to convert text files between character sets?

Best way to convert text files between character sets?

15 Answers 15

Try VIM

Try iconv Bash function

Try Notepad++

Oneliner using find, with automatic detection

Popular posts from this blog

PySpark - SparkContext: Error initializing SparkContext File does not exist

django NoReverseMatch Exception

Python Tkinter Error, “Too Early to Create Image”

15 Answers
15