Okay, I will try to answer your questions although already answered in other UTF-8 related topics. But before reading further read carefully Unicode text and Unicode files in UltraEdit/UEStudio
to get the basic understanding about encoding which looks like you don't have.Why can't few multibyte characters be for UE enough to detect it as UTF-8?
UltraEdit searches for byte sequences which could be interpreted as UTF-8 character code only in the first 9 KB (UE v11.20a) or 64 KB (UE v14.00b). Why not in complete file? Because that would make UltraEdit extremly slow on opening any file when setting Auto detect UTF-8 files
is enabled. Scanning always complete file for a byte sequence which could be interpreted as UTF-8 character code would actually result in reading all bytes of a file before displaying it. Not a very good idea for files with several MB and of course a very bad idea for files with hundreds of MBs or even GBs.
Also how can UltraEdit be sure that the byte sequence E2 80 9C
(hex codes) should be really interpreted as UTF-8 character code for the character “
and not interpreted as string â€œ
using codepage 1252? Can you answer that question if I give you a file with these 3 bytes? How can you know want I meant with these 3 bytes. Maybe I'm a Russian and the same 3 bytes mean вЂњ
or I'm a Greek and the same 3 bytes mean β€
. Do you understand the problem? There must be a rule for a program which reads the bytes E2 80 9C
how to interpret it.
That's the reason why organizations like the International Organization for Standardization (ISO)
or the Unicode Consortium
exist. They define standards. Without standards our high tech world can't exist. Unicode is a standard - see About the Unicode Standard
So what is the real problem. The real problem is that the program which created the 17 MB file you open encodes characters with UTF-8 byte sequences, but has not declared the file with a UTF-8 BOM as UTF-8 file. If your file is a HTML, XHTML or XML file then it does not need a BOM, but then it must have at top of the file a declaration for the UTF-8 encoding. That your file does not have a BOM and no standardized character encoding declaration means your program ignores all the standards.
UTF-8 is really a special encoding standard. It was defined because many programs can only handle ASCII files and don't support the Unicode standard. With UTF-8 it is possible to encode non ASCII characters in ASCII files and therefore make the files with the non ASCII characters still readable for programs not supporting the Unicode standard. Many interpreters like PHP and Perl are (or were) for example not capable to correct interpret UTF-16 files. They can interpret only ASCII files and ASCII strings and they don't know about the special meaning of 00 00 FE FF
(UTF-32, big-endian ), FF FE 00 00
(UTF-32, little-endian), FE FF
(UTF-16, big-endian), FF FE
(UTF-16, little-endian ) and EF BB BF
(UTF-8) at top of a text file and therefore often break with an error if a BOM exists. That is one reason why for HTML, XHTML and XML a special declaration for the encoding using only ASCII characters was standardized - the document writers can use non ASCII characters, the non Unicode standard compatible interpreters can still interpret the files, but the browsers supporting the standards know which encoding is used for the file and can interpret and display the byte stream correct.
Okay, back to your problem. UltraEdit does not scan whole file for UTF-8 byte sequences because of the reasons described above. So your 17 MB file is opened in ASCII mode. If you now save the file in UTF-8, the bytes of the UTF-8 byte sequences will be encoded itself with UTF-8. So the character œ
already present in the file with the 2 bytes C5 93 and interpreted with your code page as Ĺ“
are saved with the 5 bytes C4 B9 E2 80 9C and now you have garbage. The only solution is to use the special file open option in the file open dialog or insert the 3 bytes of the UTF-8 BOM, save the file as ASCII as loaded, close it and open it again.
I think I don't have to explain why UltraEdit converts whole file detected as UTF-8 into UTF-16 LE which needs time on larger files. Most characters in a UTF-8 file are encoded with a single byte, others with 2 bytes, some with 3 bytes. That is not very good for a program which does not only display the content, but also allows to modify it with dozens of functions. Converting the UTF-8 file to UTF-16 LE results in a fixed number of bytes per character. That makes it efficient to handle the bytes of the characters. Also in all programming languages I know there is only the choice to use single byte character arrays for strings or double byte Unicode arrays. As already written above UTF-8 is really something special.Why can't UE have an configuration option to assume all opened files are UTF-8 (or any other encoding)?
That's a suggestion for an enhancement you can send by email to IDM. But the real problem is the program which created the 17 MB file using UTF-8 encoding without marking the file as UTF-8 encoded file. If all programs creating UTF-8 files would be compatible with the Unicode standard and would write the encoding information into the file as required by the standards, then all other programs which are already really compatible to the Unicode standard would have no problems reading those files.Added on 2009-11-09:
I have found an undocumented setting in uedit32.exe of v11.10c and later. With manually adding to uedit32.ini[Settings]
you can force all non Unicode files (not UTF-16 files) to be read/saved as UTF-8 encoded files. But new files are nevertheless created and saved either as Unicode (UTF-16 LE) or ASCII/ANSI files, except with UE v16.00 and later the default Encoding Type
is set to Create new files as UTF-8
. So this special setting is only for already named files. However, creating a new file in ASCII/ANSI with UE < v16.00, save it with a name, close it and re-open it results in a new file encoded in UTF-8. Be careful with that setting. Even real ANSI files are loaded with this setting as UTF-8 encoded file causing all ANSI characters to be interpreted wrong.