Iconv ascii utf 8

Приходит массив из модуля геолокации в кодировке ASCII.
joxi.net/8An0aqxuqEl0Jm
Сайт в utf8.

  • Вопрос задан более трёх лет назад
  • 396 просмотров

В mbstring на данный момент реализованы следующие фильтры для определения кодировок. Если последовательность байт в исходной строке не будет соответствовать ни одной из перечисленных кодировок, определение кодировки завершится неудачей.
UTF-8, UTF-7, ASCII, EUC-JP,SJIS, eucJP-win, SJIS-win, JIS, ISO-2022-JP

I’m trying to transcode a bunch of files from US-ASCII to UTF-8.

For that, I’m using iconv:

Thing is my original files are US-ASCII encoded, which makes the conversion not to happen. Apparently it occurs cause ASCII is a subset of UTF-8.

There’s no need for the textfile to appear otherwise until non-ascii characters are introduced

True. If I introduce a non-ASCII character in the file and save it, let’s say with Eclipse, the file encoding (charset) is switched to UTF-8.

In my case, I’d like to force iconv to transcode the files to UTF-8 anyway. Whether there is non-ASCII characters in it or not.

Note: The reason is my PHP code (non-ASCII files. ) is dealing with some non-ASCII string, which causes the strings not to be well interpreted (french):

Il était une fois. l’homme série animée mythique d’Albert

Barillé (Procidis), 1ère

EDIT

  • US-ASCII — is — a subset of UTF-8 (see Ned’s answer below)
  • Meaning that US-ASCII files are actually encoded in UTF-8
  • My problem came from somewhere else

9 Answers 9

ASCII is a subset of UTF-8, so all ASCII files are already UTF-8 encoded. The bytes in the ASCII file and the bytes that would result from "encoding it to UTF-8" would be exactly the same bytes. There’s no difference between them, so there’s no need to do anything.

It looks like your problem is that the files are not actually ASCII. You need to determine what encoding they are using, and transcode them properly.

  • file only guesses at the file encoding and may be wrong (especially in cases where special characters only appear late in large files).
  • you can use hexdump to look at bytes of non-7-bit-ascii text and compare against code tables for common encodings (iso-8859-*, utf-8) to decide for yourself what the encoding is.
  • iconv will use whatever input/output encoding you specify regardless of what the contents of the file are. If you specify the wrong input encoding the output will be garbled.
  • even after running iconv , file may not report any change due to the limited way in which file attempts to guess at the encoding. For a specific example, see my long answer.
  • 7-bit ascii (aka us-ascii) is identical at a byte level to utf-8 and the 8-bit ascii extensions (iso-8859-*). So if your file only has 7-bit characters, then you can call it utf-8, iso-8859-* or us-ascii because at a byte level they are all identical. It only makes sense to talk about utf-8 and other encodings (in this context) once your file has characters outside the 7-bit ascii range.

I ran into this today and came across your question. Perhaps I can add a little more information to help other people who run into this issue.

First, the term ASCII is overloaded, and that leads to confusion.

7-bit ASCII only includes 128 characters (00-7F or 0-127 in decimal). 7-bit ASCII is also referred to as US-ASCII.

UTF-8 encoding uses the same encoding as 7-bit ASCII for its first 128 characters. So a text file that only contains characters from that range of the first 128 characters will be identical at a byte level whether encoded with UTF-8 or 7-bit ASCII.

The term extended ascii (or high ascii) refers to eight-bit or larger character encodings that include the standard seven-bit ASCII characters, plus additional characters.

ISO-8859-1 (aka "ISO Latin 1") is a specific 8-bit ASCII extension standard that covers most characters for Western Europe. There are other ISO standards for Eastern European languages and Cyrillic languages. ISO-8859-1 includes characters like Ö, é, ñ and ß for German and Spanish. "Extension" means that ISO-8859-1 includes the 7-bit ASCII standard and adds characters to it by using the 8th bit. So for the first 128 characters, it is equivalent at a byte level to ASCII and UTF-8 encoded files. However, when you start dealing with characters beyond the first 128, your are no longer UTF-8 equivalent at the byte level, and you must do a conversion if you want your "extended ascii" file to be UTF-8 encoded.

One lesson I learned today is that we can’t trust file to always give correct interpretation of a file’s character encoding.

The command tells only what the file looks like, not what it is (in the case where file looks at the content). It is easy to fool the program by putting a magic number into a file the content of which does not match it. Thus the command is not usable as a security tool other than in specific situations.

file looks for magic numbers in the file that hint at the type, but these can be wrong, no guarantee of correctness. file also tries to guess the character encoding by looking at the bytes in the file. Basically file has a series of tests that helps it guess at the file type and encoding.

My file is a large CSV file. file reports this file as us-ascii encoded, which is WRONG.

My file has umlauts in it (ie Ö). The first non-7-bit-ascii doesn’t show up until over 100k lines into the file. I suspect this is why file doesn’t realize the file encoding isn’t US-ASCII.

I’m on a mac, so using PCRE’s grep . With gnu grep you could use the -P option. Alternatively on a mac, one could install coreutils (via homebrew or other) in order to get gnu grep.

Читайте также:  Как полностью удалить kaspersky free

I haven’t dug into the source-code of file , and the man page doesn’t discuss the text encoding detection in detail, but I am guessing file doesn’t look at the whole file before guessing encoding.

Whatever my file’s encoding is, these non-7-bit-ASCII characters break stuff. My German CSV file is ; -separated and extracting a single column doesn’t work.

Note the cut error and that my "tmp" file has only 102320 lines with the first special character on line 102321.

Let’s take a look at how these non-ASCII characters are encoded. I dump the first non-7-bit-ascii into hexdump , do a little formatting, remove the newlines ( 0a ) and take just the first few.

Another way. I know the first non-7-bit-ASCII char is at position 85 on line 102321. I grab that line and tell hexdump to take the two bytes starting at position 85. You can see the special (non-7-bit-ASCII) character represented by a ".", and the next byte is "M". so this is a single-byte character encoding.

In both cases, we see the special character is represented by d6 . Since this character is an Ö which is a German letter, I am guessing that ISO-8859-1 should include this. Sure enough you can see "d6" is a match (https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Codepage_layout).

Important question. how do I know this character is an Ö without being sure of the file encoding? Answer is context. I opened the file, read the text and then determined what character it is supposed to be. If I open it in vim it displays as an Ö because vim does a better job of guessing the character encoding (in this case) than file does.

So, my file seems to be ISO-8859-1. In theory I should check the rest of the non-7-bit-ASCII characters to make sure ISO-8859-1 is a good fit. There is nothing that forces a program to only use a single encoding when writing a file to disk (other than good manners).

I’ll skip the check and move on to conversion step.

Hmm. file still tells me this file is US-ASCII even after conversion. Let’s check with hexdump again.

Definitely a change. Note that we have two bytes of non-7-bit-ASCII (represented by the "." on the right) and the hex code for the two bytes is now c3 96 . If we take a look, seems we have UTF-8 now (c3 96 is the right encoding of Ö in UTF-8) http://www.utf8-chartable.de/

But file still reports our file as us-ascii ? Well, I think this goes back to the point about file not looking at the whole file and the fact that the first non-7-bit-ASCII characters don’t occur until deep in the file.

I’ll use sed to stick a Ö at the beginning of the file and see what happens.

Cool, we have an umlaut. Note the encoding though is c3 96 (utf-8). Hmm.

Checking our other umlauts in the same file again:

ISO-8859-1. Oops! Just goes to show how easy it is to get the encodings screwed up.

Let’s try converting our new test file with the umlaut at the front and see what happens.

Oops. That first umlaut that was UTF-8 was interpreted as ISO-8859-1 since that is what we told iconv . The second umlaut is correctly converted from d6 to c3 96 .

I’ll try again, this time I will use vim to do the Ö insertion instead of sed . vim seemed to detect the encoding better (as "latin1" aka ISO-8859-1) so perhaps it will insert the new Ö with a consistent encoding.

Looks good. Looks like ISO-8859-1 for new and old umlauts.

Boom! Moral of the story. Don’t trust file to always guess your encoding right. Easy to mix encodings within the same file. When in doubt, look at the hex.

A hack (also prone to failure) that would address this specific limitation of file when dealing with large files would be to shorten the file to make sure that special characters appear early in the file so file is more likely to find them.

Christos Zoulas updated file to make the amount of bytes looked at configurable. One day turn-around on the feature request, awesome!

The feature was released in file version 5.26.

Looking at more of a large file before making a guess about encoding takes time. However it is nice to have the option for specific use-cases where a better guess may outweigh additional time/io.

Use the following option:

. should do the trick if you want to force file to look at the whole file before making a guess. Of course this only works if you have file 5.26 or newer.

I haven’t built/tested the latest releases yet. Most of my machines currently have file 5.04 (2010). hopefully someday this release will make it down from upstream.

(PHP 4 >= 4.0.5, PHP 5, PHP 7)

iconv — Преобразование строки в требуемую кодировку

Описание

Преобразует набор символов строки str из кодировки in_charset в out_charset .

Список параметров

Кодировка входной строки.

Требуемая на выходе кодировка.

Если добавить к out_charset строку //TRANSLIT, включается режим транслитерации. Это значит, что в случае, если символ не может быть представлен в требуемой кодировке, он будет заменен на один или несколько наиболее близких по внешнему виду символов. Если добавить строку //IGNORE, то символы, которые не могут быть представлены в требуемой кодировке, будут удалены. В случае отсутствия вышеуказанных параметров будет сгенерирована ошибка уровня E_NOTICE , а функция вернет FALSE .

Как будет работат //TRANSLIT и будет ли вообще, зависит от системной реализации iconv() ( ICONV_IMPL ). Известны некоторые реализации, которые просто игнорируют //TRANSLIT, так что конвертация для символов некорректных для out_charset скорее всего закончится ошибкой.

Строка, которую необходимо преобразовать.

Возвращаемые значения

Возвращает преобразованную строку или FALSE в случае возникновения ошибки.

Список изменений

Версия Описание
5.4.0 Начиная с этой версии, функция возвращает FALSE на некорректных символах, только если в выходной кодировке не указан //IGNORE. До этого функция возвращала часть строки.
Читайте также:  1С отчет на базе универсального отчета

Примеры

Пример #1 Пример использования iconv()

= "Это символ евро — ‘€’." ;

echo ‘Исходная строка : ‘ , $text , PHP_EOL ;
echo ‘С добавлением TRANSLIT : ‘ , iconv ( "UTF-8" , "ISO-8859-1//TRANSLIT" , $text ), PHP_EOL ;
echo ‘С добавлением IGNORE : ‘ , iconv ( "UTF-8" , "ISO-8859-1//IGNORE" , $text ), PHP_EOL ;
echo ‘Обычное преобразование : ‘ , iconv ( "UTF-8" , "ISO-8859-1" , $text ), PHP_EOL ;

Результатом выполнения данного примера будет что-то подобное:

User Contributed Notes 39 notes

The "//ignore" option doesn’t work with recent versions of the iconv library. So if you’re having trouble with that option, you aren’t alone.

That means you can’t currently use this function to filter invalid characters. Instead it silently fails and returns an empty string (or you’ll get a notice but only if you have E_NOTICE enabled).

This has been a known bug with a known solution for at least since 2009 years but no one seems to be willing to fix it (PHP must pass the -c option to iconv). It’s still broken as of the latest release 5.4.3.

ini_set(‘mbstring.substitute_character’, "none");
$text= mb_convert_encoding($text, ‘UTF-8’, ‘UTF-8’);

That will strip invalid characters from UTF-8 strings (so that you can insert it into a database, etc.). Instead of "none" you can also use the value 32 if you want it to insert spaces in place of the invalid characters.

Please note that iconv(‘UTF-8’, ‘ASCII//TRANSLIT’, . ) doesn’t work properly when locale category LC_CTYPE is set to C or POSIX. You must choose another locale otherwise all non-ASCII characters will be replaced with question marks. This is at least true with glibc 2.5.

Example:
( LC_CTYPE , ‘POSIX’ );
echo iconv ( ‘UTF-8’ , ‘ASCII//TRANSLIT’ , "Žluťoučký kůň
" );
// ?lu?ou?k? k??

setlocale ( LC_CTYPE , ‘cs_CZ’ );
echo iconv ( ‘UTF-8’ , ‘ASCII//TRANSLIT’ , "Žluťoučký kůň
" );
// Zlutoucky kun
?>

Interestingly, setting different target locales results in different, yet appropriate, transliterations. For example:

//some German
$utf8_sentence = ‘Weiß, Goldmann, Göbel, Weiss, Göthe, Goethe und Götz’ ;

//UK
setlocale ( LC_ALL , ‘en_GB’ );

//transliterate
$trans_sentence = iconv ( ‘UTF-8’ , ‘ASCII//TRANSLIT’ , $utf8_sentence );

//gives [Weiss, Goldmann, Gobel, Weiss, Gothe, Goethe und Gotz]
//which is our original string flattened into 7-bit ASCII as
//an English speaker would do it (ie. simply remove the umlauts)
echo $trans_sentence . PHP_EOL ;

//Germany
setlocale ( LC_ALL , ‘de_DE’ );

$trans_sentence = iconv ( ‘UTF-8’ , ‘ASCII//TRANSLIT’ , $utf8_sentence );

//gives [Weiss, Goldmann, Goebel, Weiss, Goethe, Goethe und Goetz]
//which is exactly how a German would transliterate those
//umlauted characters if forced to use 7-bit ASCII!
//(because really ä = ae, ö = oe and ü = ue)
echo $trans_sentence . PHP_EOL ;

to test different combinations of convertions between charsets (when we don’t know the source charset and what is the convenient destination charset) this is an example :

= array( "UTF-8" , "ASCII" , "Windows-1252" , "ISO-8859-15" , "ISO-8859-1" , "ISO-8859-6" , "CP1256" );
$chain = "" ;
foreach ( $tab as $i )
<
foreach ( $tab as $j )
<
$chain .= " $i$j " . iconv ( $i , $j , " $my_string " );
>
>

echo $chain ;
?>

then after displaying, you use the $i$j that shows good displaying.
NB: you can add other charsets to $tab to test other cases.

If you are getting question-marks in your iconv output when transliterating, be sure to ‘setlocale’ to something your system supports.

Some PHP CMS’s will default setlocale to ‘C’, this can be a problem.

use the "locale" command to find out a list..

( LC_CTYPE , ‘en_AU.utf8’ );
$str = iconv ( ‘UTF-8’ , ‘ASCII//TRANSLIT’ , "Côte d’Ivoire" );
?>

Like many other people, I have encountered massive problems when using iconv() to convert between encodings (from UTF-8 to ISO-8859-15 in my case), especially on large strings.

The main problem here is that when your string contains illegal UTF-8 characters, there is no really straight forward way to handle those. iconv() simply (and silently!) terminates the string when encountering the problematic characters (also if using //IGNORE), returning a clipped string. The

= html_entity_decode ( htmlentities ( $oldstring , ENT_QUOTES , ‘UTF-8’ ), ENT_QUOTES , ‘ISO-8859-15’ );

?>

workaround suggested here and elsewhere will also break when encountering illegal characters, at least dropping a useful note ("htmlentities(): Invalid multibyte sequence in argument in. ")

I have found a lot of hints, suggestions and alternative methods (it’s scary and in my opinion no good sign how many ways PHP natively provides to convert the encoding of strings), but none of them really worked, except for this one:

= mb_convert_encoding ( $oldstring , ‘ISO-8859-15’ , ‘UTF-8’ );

There may be situations when a new version of a web site, all in UTF-8, has to display some old data remaining in the database with ISO-8859-1 accents. The problem is iconv("ISO-8859-1", "UTF-8", $string) should not be applied if $string is already UTF-8 encoded.

I use this function that does’nt need any extension :

function convert_utf8( $string ) <
if ( strlen(utf8_decode($string)) == strlen($string) ) <
// $string is not UTF-8
return iconv("ISO-8859-1", "UTF-8", $string);
> else <
// already UTF-8
return $string;
>
>

I have not tested it extensively, hope it may help.

For those who have troubles in displaying UCS-2 data on browser, here’s a simple function that convert ucs2 to html unicode entities :

function ucs2html ( $str ) <
$str = trim ( $str ); // if you are reading from file
$len = strlen ( $str );
$html = » ;
for( $i = 0 ; $i $len ; $i += 2 )
$html .= ‘&#’ . hexdec ( dechex ( ord ( $str [ $i + 1 ])).
sprintf ( "%02s" , dechex ( ord ( $str [ $i ])))). ‘;’ ;
return( $html );
>
?>

In my case, I had to change:
( LC_CTYPE , ‘cs_CZ’ );
?>
to
( LC_CTYPE , ‘cs_CZ.UTF-8’ );
?>
Otherwise it returns question marks.

When I asked my linux for locale (by locale command) it returns "cs_CZ.UTF-8", so there is maybe correlation between it.

iconv (GNU libc) 2.6.1
glibc 2.3.6

Here is how to convert UCS-2 numbers to UTF-8 numbers in hex:

function ucs2toutf8 ( $str )
<
for ( $i = 0 ; $i strlen ( $str ); $i += 4 )
<
$substring1 = $str [ $i ]. $str [ $i + 1 ];
$substring2 = $str [ $i + 2 ]. $str [ $i + 3 ];

if ( $substring1 == "00" )
<
$byte1 = "" ;
$byte2 = $substring2 ;
>
else
<
$substring = $substring1 . $substring2 ;
$byte1 = dechex ( 192 +( hexdec ( $substring )/ 64 ));
$byte2 = dechex ( 128 +( hexdec ( $substring )% 64 ));
>
$utf8 .= $byte1 . $byte2 ;
>
return $utf8 ;
>

Читайте также:  Как поставить знак копирайт

echo strtoupper ( ucs2toutf8 ( "06450631062D0020" ));

?>

Input:
06450631062D
Output:
D985D8B1D8AD

I have used iconv to convert from cp1251 into UTF-8. I spent a day to investigate why a string with Russian capital ‘Р’ (sounds similar to ‘r’) at the end cannot be inserted into a database.

The problem is not in iconv. But ‘Р’ in cp1251 is chr(208) and ‘Р’ in UTF-8 is chr(208).chr(106). chr(106) is one of the space symbol which match ‘s’ in regex. So, it can be taken by a greedy ‘+’ or ‘*’ operator. In that case, you loose ‘Р’ in your string.

For example, ‘ГР ‘ (Russian, UTF-8). Function preg_match. Regex is ‘(.+?)[s]*’. Then ‘(.+?)’ matches ‘Г’.chr(208) and ‘[s]*’ matches chr(106).’ ‘.

Although, it is not a bug of iconv, but it looks like it very much. That’s why I put this comment here.

Here is how to convert UTF-8 numbers to UCS-2 numbers in hex:

function utf8toucs2 ( $str )
<
for ( $i = 0 ; $i strlen ( $str ); $i += 2 )
<
$substring1 = $str [ $i ]. $str [ $i + 1 ];
$substring2 = $str [ $i + 2 ]. $str [ $i + 3 ];

if ( hexdec ( $substring1 ) 127 )
$results = "00" . $str [ $i ]. $str [ $i + 1 ];
else
<
$results = dechex (( hexdec ( $substring1 )- 192 )* 64 + ( hexdec ( $substring2 )- 128 ));
if ( $results 1000 ) $results = "0" . $results ;
$i += 2 ;
>
$ucs2 .= $results ;
>
return $ucs2 ;
>

echo strtoupper ( utf8toucs2 ( "D985D8B1D8AD" )). "
" ;
echo strtoupper ( utf8toucs2 ( "456725" )). "
" ;

I just found out today that the Windows and *NIX versions of PHP use different iconv libraries and are not very consistent with each other.

Here is a repost of my earlier code that now works on more systems. It converts as much as possible and replaces the rest with question marks:

if (! function_exists ( ‘utf8_to_ascii’ )) <
setlocale ( LC_CTYPE , ‘en_AU.utf8’ );
if (@ iconv ( "UTF-8" , "ASCII//IGNORE//TRANSLIT" , ‘é’ ) === false ) <
// PHP is probably using the glibc library (*NIX)
function utf8_to_ascii ( $text ) <
return iconv ( "UTF-8" , "ASCII//TRANSLIT" , $text );
>
>
else <
// PHP is probably using the libiconv library (Windows)
function utf8_to_ascii ( $text ) <
if ( is_string ( $text )) <
// Includes combinations of characters that present as a single glyph
$text = preg_replace_callback ( ‘/X/u’ , __FUNCTION__ , $text );
>
elseif ( is_array ( $text ) && count ( $text ) == 1 && is_string ( $text [ 0 ])) <
// IGNORE characters that can’t be TRANSLITerated to ASCII
$text = iconv ( "UTF-8" , "ASCII//IGNORE//TRANSLIT" , $text [ 0 ]);
// The documentation says that iconv() returns false on failure but it returns »
if ( $text === » || ! is_string ( $text )) <
$text = ‘?’ ;
>
elseif ( preg_match ( ‘/w/’ , $text )) < // If the text contains any letters.
$text = preg_replace ( ‘/W+/’ , » , $text ); // . then remove all non-letters
>
>
else < // $text was not a string
$text = » ;
>
return $text ;
>
>
>

Didn’t know its a feature or not but its works for me (PHP 5.0.4)

test it to convert from windows-1251 (stored in DB) to UTF-8 (which i use for web pages).
BTW i convert each array i fetch from DB with array_walk_recursive.

Here is an example how to convert windows-1251 (windows) or cp1251(Linux/Unix) encoded string to UTF-8 encoding.

function cp1251_utf8 ( $sInput )
<
$sOutput = "" ;

for ( $i = 0 ; $i strlen ( $sInput ); $i ++ )
<
$iAscii = ord ( $sInput [ $i ] );

Be aware that iconv in PHP uses system implementations of locales and languages, what works under linux, normally doesn’t in windows.

Also, you may notice that recent versions of linux (debian, ubuntu, centos, etc) the //TRANSLIT option doesn’t work. since most distros doesn’t include the intl packages (example: php5-intl and icuxx (where xx is a number) in debian) by default. And this because the intl package conflicts with another package needed for international DNS resolution.

Problem is that configuration is dependent of the sysadmin of the machine where you’re hosted, so iconv is pretty much useless by default, depending on what configuration is used by your distro or the machine’s admin.

iconv with //IGNORE works as expected: it will skip the character if this one does not exist in the $out_charset encoding.

If a character is missing from the $in_charset encoding (eg byte x81 from CP1252 encoding), then iconv will return an error, whether with //IGNORE or not.

For transcoding values in an Excel generated CSV the following seems to work:

= iconv ( ‘Windows-1252’ , ‘UTF-8//TRANSLIT’ , $value );
?>

Note an important difference between iconv() and mb_convert_encoding() — if you’re working with strings, as opposed to files, you most likely want mb_convert_encoding() and not iconv(), because iconv() will add a byte-order marker to the beginning of (for example) a UTF-32 string when converting from e.g. ISO-8859-1, which can throw off all your subsequent calculations and operations on the resulting string.

In other words, iconv() appears to be intended for use when converting the contents of files — whereas mb_convert_encoding() is intended for use when juggling strings internally, e.g. strings that aren’t being read/written to/from files, but exchanged with some other media.

‘" to the output.
This function will strip out these extra characters:
( LC_ALL , ‘en_US.UTF8’ );
function clearUTF ( $s )
<
$r = » ;
$s1 = @ iconv ( ‘UTF-8’ , ‘ASCII//TRANSLIT’ , $s );
$j = 0 ;
for ( $i = 0 ; $i strlen ( $s1 ); $i ++) <
$ch1 = $s1 [ $i ];
$ch2 = @ mb_substr ( $s , $j ++, 1 , ‘UTF-8’ );
if ( strstr ( ‘`^

function detectUTF8($string)
<
return preg_match(‘%(?:
[xC2-xDF][x80-xBF] # non-overlong 2-byte
|xE0[xA0-xBF][x80-xBF] # excluding overlongs
|[xE1-xECxEExEF][x80-xBF] <2># straight 3-byte
|xED[x80-x9F][x80-xBF] # excluding surrogates
|xF0[x90-xBF][x80-xBF] <2># planes 1-3
|[xF1-xF3][x80-xBF] <3># planes 4-15
|xF4[x80-x8F][x80-xBF] <2># plane 16
)+%xs’, $string);
>

function cp1251_utf8( $sInput )
<
$sOutput = "";