UTF-8 not recognized, largish file

General and specific configuration/INI settings

UTF-8 not recognized, largish file

Postby BillKat » Tue Jan 20, 2009 6:45 am

Hello all,

I have mysqldump files which are typically about 60MB. The source database is fully-utf-8, and the dump file on the server does contain utf-8 chars like beta, gamma etc, verified by viewing or editing the files there. The UNIX 'file' command describes the file as 'UTF-8 English Unicode text, with very long lines'.

A shell script adds these lines to the very top of the dump file:
charset=utf-8
encoding=utf-8

I did this after reading on here of the post-10kb and UTF-8 recognition thing.

I copy the file down to my Windows machine and edit the file in UE; but UE sees the filetype as 'UNIX' and doesn't display the betas etc properly.
It's fine with smaller utf-8 mysqldump files originating from the same database.

My UE config has:
utf-8 detection turned on (inc utf16 for now)
UNIX/MAC detection set to auto convert to DOS

I'm now stuck for ideas ... any help appreciated, cheers all.
BillKat
Newbie
 
Posts: 2
Joined: Tue Jan 20, 2009 6:24 am

Re: UTF-8 not recognized, largish file

Postby Mofi » Tue Jan 20, 2009 9:02 am

I think the reason is that in UE v14.20.1.1006 the search for the UTF-8 charset declaration is not anymore a simple search for charset=utf-8 as in previous versions of UE as I have found out with some tests yet. If I create a file with only this string UltraEdit does not interpret it anymore as UTF-8 encoded file. But if I create a new file with following line:

<meta http-equiv="content-type" content="text/html; charset=utf-8">

UltraEdit v14.20.1.1006 loads that file as UTF-8 file. Further tests let me think that UltraEdit now uses a regular expression search.

<meta charset=utf-8> was also recognized as valid UTF-8 character set declaration. So the regular expression in UltraEdit syntax is maybe something like <*meta*charset=utf-8*> for HTML and <*?xml*encoding="utf-8*> for XML.

You know that you can specify in the File - Open dialog a special encoding format using option Open As available since UE v14.10.0.

And last I suggest to add at top of the file with your script not the charset declarations for HTML or XML. I suggest you add at top of the file the BOM (Byte Order Mark) for UTF-8. That would declare that file as UTF-8 file without a doubt. You have to insert the characters  (hex: EF BB BF) at top of the file.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4055
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: UTF-8 not recognized, largish file

Postby BillKat » Tue Jan 20, 2009 10:18 am

Thanks Mofi, useful stuff there. I will have a go and post back, in case it helps anyone else out in the future.
Cheers.

edit:
OK, quick tests - the 'File Open as utf8' works fine - never even thought to look there, after years of "right-click > open with UE" or drag & drop ... ((slaps head))
And the BOM inserting works like a charm too. Excellent.

So thanks a lot Mofi! Sorted.
BillKat
Newbie
 
Posts: 2
Joined: Tue Jan 20, 2009 6:24 am

Re: UTF-8 not recognized, largish file

Postby Zababa » Thu Feb 19, 2009 2:31 pm

Hi, I use UE 14.20.1.1008 in MS Windows XP SP 3.

I have a 17 MB large UTF-8 file without BOM with just a dozen or so non-ASCII characters in it somewhere near the end of the file. I have enabled automatic Unicode recognition in the advanced configuration.

When I drag and drop the file to UE, it gets recognized as "UNIX" (in the status line), i.e. a plain ASCII file with UNIX line ends. When I then search for a character like "œ" I get a match at the two bytes "Ĺ“" well, yes, the match is where the "œ" is supposed to be. That leads me to the idea that the file is still not messed up, and I save it as another file and I explicitly say it shall be UTF-8 (I tried both BOM and NOBOM here). UE does certainly some hard work with Unicode because the file being saved temporarily has about the double size before it shrinks again to some 17 MB.

But alas, the newly saved file IS messed up (even if I don't drag and drop it but open with open dialogue and select encoding 65001). In there I cannot find any matches for "œ" but just for the two (now multibyte) characters "Ĺ" and "“"

The only thing I can do to avoid this is not to open a file by drag and drop but via the open dialog. But that is really not user friendly. I mean it still takes a lot of time to open a 17 MB text file. I think UE reads it from the first character to the last before it displays it. Why can't few multibyte characters be for UE enough to detect it as UTF-8?

Why can't UE have an configuration option to assume all opened files are UTF-8 (or any other encoding)? Then I could even disable the (for such file like the one here useless autodetect feature). :x

I wish UE will soon make UTF-8 the default or even better: add an option in the configuration to select an encoding which shall be assumed when opening files.
User avatar
Zababa
Basic User
Basic User
 
Posts: 23
Joined: Wed Oct 03, 2007 11:00 pm
Location: Leipzig

Re: UTF-8 not recognized, largish file

Postby Mofi » Fri Feb 20, 2009 7:28 am

Okay, I will try to answer your questions although already answered in other UTF-8 related topics. But before reading further read carefully Unicode text and Unicode files in UltraEdit/UEStudio to get the basic understanding about encoding which looks like you don't have.

Why can't few multibyte characters be for UE enough to detect it as UTF-8?

UltraEdit searches for byte sequences which could be interpreted as UTF-8 character code only in the first 9 KB (UE v11.20a) or 64 KB (UE v14.00b). Why not in complete file? Because that would make UltraEdit extremly slow on opening any file when setting Auto detect UTF-8 files is enabled. Scanning always complete file for a byte sequence which could be interpreted as UTF-8 character code would actually result in reading all bytes of a file before displaying it. Not a very good idea for files with several MB and of course a very bad idea for files with hundreds of MBs or even GBs.

Also how can UltraEdit be sure that the byte sequence E2 80 9C (hex codes) should be really interpreted as UTF-8 character code for the character and not interpreted as string “ using codepage 1252? Can you answer that question if I give you a file with these 3 bytes? How can you know want I meant with these 3 bytes. Maybe I'm a Russian and the same 3 bytes mean “ or I'm a Greek and the same 3 bytes mean “. Do you understand the problem? There must be a rule for a program which reads the bytes E2 80 9C how to interpret it.

That's the reason why organizations like the International Organization for Standardization (ISO) or the Unicode Consortium exist. They define standards. Without standards our high tech world can't exist. Unicode is a standard - see About the Unicode Standard.

So what is the real problem. The real problem is that the program which created the 17 MB file you open encodes characters with UTF-8 byte sequences, but has not declared the file with a UTF-8 BOM as UTF-8 file. If your file is a HTML, XHTML or XML file then it does not need a BOM, but then it must have at top of the file a declaration for the UTF-8 encoding. That your file does not have a BOM and no standardized character encoding declaration means your program ignores all the standards.

UTF-8 is really a special encoding standard. It was defined because many programs can only handle ASCII files and don't support the Unicode standard. With UTF-8 it is possible to encode non ASCII characters in ASCII files and therefore make the files with the non ASCII characters still readable for programs not supporting the Unicode standard. Many interpreters like PHP and Perl are (or were) for example not capable to correct interpret UTF-16 files. They can interpret only ASCII files and ASCII strings and they don't know about the special meaning of 00 00 FE FF (UTF-32, big-endian ), FF FE 00 00 (UTF-32, little-endian), FE FF (UTF-16, big-endian), FF FE (UTF-16, little-endian ) and EF BB BF (UTF-8) at top of a text file and therefore often break with an error if a BOM exists. That is one reason why for HTML, XHTML and XML a special declaration for the encoding using only ASCII characters was standardized - the document writers can use non ASCII characters, the non Unicode standard compatible interpreters can still interpret the files, but the browsers supporting the standards know which encoding is used for the file and can interpret and display the byte stream correct.


Okay, back to your problem. UltraEdit does not scan whole file for UTF-8 byte sequences because of the reasons described above. So your 17 MB file is opened in ASCII mode. If you now save the file in UTF-8, the bytes of the UTF-8 byte sequences will be encoded itself with UTF-8. So the character œ already present in the file with the 2 bytes C5 93 and interpreted with your code page as Ĺ“ are saved with the 5 bytes C4 B9 E2 80 9C and now you have garbage. The only solution is to use the special file open option in the file open dialog or insert the 3 bytes of the UTF-8 BOM, save the file as ASCII as loaded, close it and open it again.

I think I don't have to explain why UltraEdit converts whole file detected as UTF-8 into UTF-16 LE which needs time on larger files. Most characters in a UTF-8 file are encoded with a single byte, others with 2 bytes, some with 3 bytes. That is not very good for a program which does not only display the content, but also allows to modify it with dozens of functions. Converting the UTF-8 file to UTF-16 LE results in a fixed number of bytes per character. That makes it efficient to handle the bytes of the characters. Also in all programming languages I know there is only the choice to use single byte character arrays for strings or double byte Unicode arrays. As already written above UTF-8 is really something special.

Why can't UE have an configuration option to assume all opened files are UTF-8 (or any other encoding)?

That's a suggestion for an enhancement you can send by email to IDM. But the real problem is the program which created the 17 MB file using UTF-8 encoding without marking the file as UTF-8 encoded file. If all programs creating UTF-8 files would be compatible with the Unicode standard and would write the encoding information into the file as required by the standards, then all other programs which are already really compatible to the Unicode standard would have no problems reading those files.

Added on 2009-11-09: I have found an undocumented setting in uedit32.exe of v11.10c and later. With manually adding to uedit32.ini

[Settings]
Force UTF-8=1


you can force all non Unicode files (not UTF-16 files) to be read/saved as UTF-8 encoded files. But new files are nevertheless created and saved either as Unicode (UTF-16 LE) or ASCII/ANSI files, except with UE v16.00 and later the default Encoding Type is set to Create new files as UTF-8. So this special setting is only for already named files. However, creating a new file in ASCII/ANSI with UE < v16.00, save it with a name, close it and re-open it results in a new file encoded in UTF-8. Be careful with that setting. Even real ANSI files are loaded with this setting as UTF-8 encoded file causing all ANSI characters to be interpreted wrong.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4055
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: UTF-8 not recognized, largish file

Postby Zababa » Thu Mar 26, 2009 4:27 am

Hello Mofi,

thank you for your thorough answer.

Mofi wrote:How can you know want I meant with these 3 bytes. Maybe I'm a Russian and the same 3 bytes mean “ or I'm a Greek and the same 3 bytes mean “. Do you understand the problem? There must be a rule for a program which reads the bytes E2 80 9C how to interpret it.

I completely understand that UE can't tell what it is. It knows much more encodings than 99.99% of its users and seen in bytes it really is ambiguous. But on the other hand, we don't live in the early 90's anymore. How many of today's text documents are written in traditional encodings*? (I know that we need the ISO standards for editing legacy texts which have been saved before the unicode era, there's no question about it.) It's just a matter of probability. How big is the probability that a Greek or Russian will want to open a text in the respective ISO encoding today? I bet for all texts they write they use some kind of UTF (unless they write in notepad, which unfortunately still features some traditional encoding as the default).

So, what I was complaining about was that nobody of the UE developers consideres the falling frequency of handling ISO-encoded texts and the user still needs to convince UE pretty hard that he really would like to do things in Unicode.

If all programs creating UTF-8 files would be compatible with the Unicode standard and would write the encoding information into the file as required by the standards, then all other programs which are already really compatible to the Unicode standard would have no problems reading those files.

What standard are you talking about? UTF-8 BOM is deprecated by the Unicode Consortium itself** and is rather seen as a quirky thing. The BOM of UTF-8 BOM is superfluous (and is no real BOM anyway) because UTF-8 has strictly defined byte order. However, UTF-8 BOM is predominantly used on the Windows platform as an explicit indicator of UTF-8 because many programs — including UE — are reluctant to embrace UTF-8 (NOBOM) as the new encoding standard. (I know that UTF-16 or UTF-32 (whatever endian) are even better (from the programmer's point of view) but most users complain about them being uneconomical for latin-based scripts.) On Linux and Mac UTF-8 (NOBOM) is no big deal and it's usually the default to save text files in. There, if you encounter a UTF-8 BOM file in some kind of workflow, tools and utilities freak out because they don't expect such thing as BOM in a UTF-8 file (that's their ignorance and flaw, I know). They assume if it has nothing then it's UTF-8 (NOBOM). They recognize the BOMs of UTF-16 and -32. They assume that if it's some kind of legacy encoding, users will tell them explicitely.

So I just mean UE could behave in our modern unicoded times the same way: Be (by default) prepared to open (and save) some kind of UTF. If not, the user will tell you.

And even better, as you will probably argue, there might be some users or work periods where you have to deal predominantly with legacy encodings. For this UE should have an option in its configuration where you could set up an encoding which UE would assume when opening files and another encoding for saving files. (These two independent encoding defaults can get very handy if you want to convert dozens of files from one encoding in another.)

My post you reacted on wasn't meant predominantly like a help request with a problem, it was more like a sigh wondering why is UE still so ISO-oriented. I know you did not cause these problems and cannot solve them. You are just somebody who knows where these problems originate and you can explain the inner ongoings of UE perfectly to rahter ignorant users (as I am). I will suggest the default encoding options to IDM.

----------
* assuming you reconsider ASCII as UTF-8 NOBOM
** "Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature." (cited from Section 2.13, Special Characters and Noncharacters, (Unicode 5.1))
User avatar
Zababa
Basic User
Basic User
 
Posts: 23
Joined: Wed Oct 03, 2007 11:00 pm
Location: Leipzig

Re: UTF-8 not recognized, largish file

Postby pietzcker » Thu Mar 26, 2009 5:16 am

I'd like to hope that UTF-8 is the defacto standard today, but I kind of doubt it. If I save a CSV file in Excel or a TXT or HTML file in MS Word, they will be written in the local encoding, not in UTF-8. Python, my favourite programming language, has just made the jump to Unicode with Python 3, but the file handling routines still expect the local standard encoding unless specified otherwise. I guess it'll take a few years until the old encodings are dropped...
User avatar
pietzcker
Master
Master
 
Posts: 241
Joined: Sun Aug 22, 2004 11:00 pm

Re: UTF-8 not recognized, largish file

Postby Zababa » Thu Mar 26, 2009 5:45 am

pietzcker wrote:If I save a CSV file in Excel or a TXT or HTML file in MS Word, they will be written in the local encoding, not in UTF-8.

That's a shame. It's more or less dependent on the platform's default. I think most users never bother with any encoding. They just want to save it, open it and expect things to be alright. They don't ever imagine there are more ways to encode the text. In this cases much depends on the default setting. The traditional encodings are not the best default choice. That is something program developers have to bear in mind. As long as there will be programs around for which UTF-8 (or 16 or 32) is the unexpected setting, as long user's like me will get upset.

pietzcker wrote:Python, my favourite programming language, has just made the jump to Unicode with Python 3, but the file handling routines still expect the local standard encoding unless specified otherwise.

I know. I love Python 3. It's the only language where I can give my functions Czech names. (-: It handles Unicode files neatly (although I have to tell it)

pietzcker wrote:I guess it'll take a few years until the old encodings are dropped...

I hope not. It's one of the reasons I am thinking of switching and not use Windows anymore (or at least as little as possible).
Sorry for sliding off topic.
User avatar
Zababa
Basic User
Basic User
 
Posts: 23
Joined: Wed Oct 03, 2007 11:00 pm
Location: Leipzig

Re: UTF-8 not recognized, largish file

Postby Mofi » Fri Mar 27, 2009 10:56 am

Well, Zababa you are absolutely right with the UTF-8 BOM declared as deprecated in the meantime.

But for Windows platforms it will take surely more than 5 years until most text files are no longer written using a codepage, but using a Unicode encoding. There are more than 30 years of computer history with single byte coded text files which you can get rid of in 2 years, not on a platform so widely used as Windows.

For example I'm mainly a programmer. I have never, really never seen a C or C++ source file encoded with UTF-8. UltraEdit is heavily used by programmers. It would be very dangerous for many program sources if any non ASCII character is suddenly encoded with UTF-8 by default. Any non ASCII character in a NULL terminated C string would suddenly produce unexpected results or buffer overflows.

But I agree from the text writers view that it would be really helpful to be able to specify the default encoding for all files or files with a defined extension.

So from the text writers point of view it would be good if the current option Create new file as Unicode at Configuration - Editor - New File Creation would be converted into a radio button option like:

Create new file as:
  • ASCII/ANSI file
  • Unicode UTF-8 file
  • Unicode UTF-16 LE file
And at Configuration - File Handling - Unicode/UTF-8 Detection an additional option, for example with name "Load files with following extensions as UTF-8" with an edit field could be offered to specify file extensions of files which are loaded as UTF-8 if none of the enabled Unicode detections already detect the Unicode encoding. The file extensions can be separated with a space and the * as wildcard for all files should be possible too like the File Extensions = list in the wordfile for a syntax highlighting language definition.

Of course the script and macro environment must then be also enhanced for being able to detect from within a script or macro which encoding a new file has and to be able to convert the encoding also from/to UTF-8. Currently there is no script or macro command to make UTF-8 conversions in any direction or detect a file as being encoded in UTF-8. Otherwise public scripts/macros working fine for user A could produce garbage for user B.

But don't expect that I suggest such enhancements. You have to do it. Although you maybe can't believe it I don't use UTF-8 or any other Unicode encoding for my daily work although I edit daily many text files. So I'm not really interested in such new options.


Added on 2009-11-09: I have found an undocumented setting in uedit32.exe of v11.10c and later. With manually adding to uedit32.ini

[Settings]
Force UTF-8=1


you can force all non Unicode files (not UTF-16 files) to be read/saved as UTF-8 encoded files. But new files are nevertheless created and saved either as Unicode (UTF-16 LE) or ASCII/ANSI files, except with UE v16.00 and later the default Encoding Type is set to Create new files as UTF-8. So this special setting is only for already named files. However, creating a new file in ASCII/ANSI with UE < v16.00, save it with a name, close it and re-open it results in a new file encoded in UTF-8. Be careful with that setting. Even real ANSI files are loaded with this setting as UTF-8 encoded file causing all ANSI characters to be interpreted wrong.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4055
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: UTF-8 not recognized, largish file

Postby sfqfirst » Tue Jan 12, 2010 2:51 am

Hi, I am comparing two files, they are different only with the top of file "EF BB BF" (3 bytes). One has, the other does not have.
But "Hex Edit" auto adds "EF BB BF" at top of file, so I always cannot find the different of them.
How can UE not change the file's any part, when using "Hex Edit" view?

I have tried "Unicode/UTF-8 Detection" any options's value, but it did not work.
Thanks.
sfqfirst
Newbie
 
Posts: 5
Joined: Fri Dec 11, 2009 3:01 am
Location: Beijing

Re: UTF-8 not recognized, largish file

Postby Mofi » Tue Jan 12, 2010 3:55 am

sfqfirst, which version of UltraEdit do you use?

If I open a UTF-8 file without BOM with UE v15.20.0.1022 and switch to hex edit mode UltraEdit does not add the 3 BOM bytes. It shows the content as really saved on hard disk. If I open a UTF-8 file with BOM with UE v15.20.0.1022 and switch to hex edit mode the BOM bytes are displayed at top of the file.

You can use in the File - Open dialog the option Open as binary to open any file directly in hex editing mode. This option exists since version 14.10 of UltraEdit.

You can also try to use File - Revert to Saved after switching to hex edit mode. Without testing (because not needed with v15.20.0.1022) you then should see the bytes of the file as stored on hard disk.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4055
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: UTF-8 not recognized, largish file

Postby sfqfirst » Tue Jan 12, 2010 4:36 am

Thank you. I got it as you said.
I am very glad to see your answer is so fast, although we are in different countries.
I used UltraEdit 15.10.
sfqfirst
Newbie
 
Posts: 5
Joined: Fri Dec 11, 2009 3:01 am
Location: Beijing


Return to Configuration/INI Settings