Unicode text and Unicode files in UltraEdit/UEStudio
While UltraEdit and UEStudio include handling for Unicode files and characters, you do need to make sure that the editor is configured properly to handle the display of the Unicode data. In this tutorial, we'll cover some of the basics of Unicode-encoded data and how to view and manipulate it in UltraEdit.
To understand how Unicode works, you need to first understand how encoding works. Any text file containing data that you open and edit in UltraEdit is displayed using encoding. In the simplest terms, encoding is how the raw hex data of a file is interpreted and displayed in the editor as readable text, which you then can manipulate using your keyboard. Since we know that everything on our computer is composed of 0's and 1's (think The Matrix), you can visualize how encoding works by looking over the following diagram.
Unicode strives to map most of the world's written characters to a single encoding set. This allows you to view Chinese scripts, English alphanumeric characters, Russian and Arabic text all within the same file without having to change the encoding (code page) for each specific text. Prior to Unicode, you would probably have needed to select a different code page (encoding) to see each script, and most of the scripts would not have been viewable at the same time (or at all).
Tim Bray, in his article "On the Goodness of Unicode", explains Unicode in simple terms:
The basics of Unicode are actually pretty simple. It defines a large (and steadily growing) number of characters - just over 100,000 last time I checked. Each character gets a name and a code point, for example LATIN CAPITAL LETTER A is 0041 and TIBETAN SYLLABLE OM is 0F00. Unicode includes a table of useful character properties such as "this is lower case" or "this is a number" or "this is a punctuation mark".
So, with this knowledge in mind, an updated diagram for how Unicode encoding works is shown below:
Every encoding works the same way as shown in the above diagram, but each encoding will (usually) give different results of what is displayed in the editor. Unicode is a very robust encoding that displays most writeable languages in the world today.
Because the hex format of Unicode requires many extra, sometimes unnecessary bytes (hex separator characters), a derivation of Unicode was developed to conserve space and optimize the hex data of Unicode strings (and subsequently filesize) called UTF-8 (Unicode Transformation Format in 8-bit format). UTF-8 is still encompassed by the Unicode character set, but its system of storing characters is different and improved. There are other Unicode encodings such as UTF-16, UTF-32, and UTF-7, but UTF-8 is the most popular and widely-used Unicode format today. Most SQL databases and websites you see are encoded in UTF-8, and this is the format that UltraEdit and UEStudio support as well.
For more information on Unicode, we recommend that you read over the following articles:
And of course, be sure to visit the official Unicode site for more detailed information and Unicode updates.
If you have Unicode files that you'd like to open in UltraEdit, you'll need to make sure you set UltraEdit to detect and display Unicode. You can do this by going to Advanced -> Configuration -> File Handling -> Unicode/UTF-8 Detection.
You'll want to make sure that at least the first two options here are checked.
Configuring UltraEdit to detect Unicode is only half of what you need. You'll still need to make sure you're using a font that displays Unicode characters. Many Windows fonts support Unicode characters; you can also install Unicode fonts from the Windows installation CD.
You can change your font by going to View -> Set Font.
You may want to copy and paste Unicode data from an external source to a new file in UltraEdit. Perhaps you've already tried this only to find that the data is displayed as garbage characters, question marks, or something completely different than what you're expecting. This is because new files in UltraEdit are by default created with ASCII encoding, not Unicode/UTF-8. Refer to our diagram above; the hex data is correct for the Unicode characters, but because the encoding has not been set properly, the result is incorrect.
To set the correct encoding for the new file, before actually pasting in the Unicode data, go to File -> Conversions and select ASCII to UTF-8. The conversion is instantaneous, and you will see this reflected in the status bar.
Then paste in your data. You should see your Unicode text!
(Hint: If you plan on working a lot with Unicode text, you may want to go to Advanced -> Configuration -> Editor -> New File Creation, and select the option "Create new files as Unicode". This way you don't need to convert the file before pasting/typing Unicode data.)
A Byte Order Marker (BOM for short) is a hex value at the very beginning of a file that is used as a "flag" or "signature" for the encoding and/or hex byte order that should be used for the file. With UTF-8 encoded data, this is normally hex bytes EF BB BF. The BOM also tells the editor whether the Unicode data is in big endian or little endian format. Big endian Unicode data simply means that the most significant hex byte is stored in your computer's memory first, while little endian stores this in memory last. BOMs are not always essential for displaying Unicode data, but they can save developers headaches when writing and building applications.
If a file contains a UTF-8 BOM, but the application handling the file does not detect or respect the BOM, then the BOM will actually be rendered as part of the ASCII data -- usually junk characters as "ï»¿" or "ÿ" (the ASCII equivalent of the otherwise-invisible BOM -- again, it all comes back to encoding!).
If you're opening files in UltraEdit and seeing these "junk" characters at the beginning of the file, this means you have not set the above-mentioned Unicode detection options properly. Conversely, if you're saving Unicode files that others are opening with other programs that show these junk characters, then the other programs are either unable or not configured to properly handle BOMs and Unicode data.
More information on BOMs and the different endians/UTF formats is available on the official Unicode website.
If you'd like to globally configure UltraEdit to save all UTF-8 files with BOMs, you can set this by going to Advanced -> Configuration -> File Handling -> Save. The first two options here, "Write UTF-8 BOM header to all UTF-8 files when saved" and "Write UTF-8 BOM on new files created within this program (if above is not set)" should be checked. Conversely, if you do NOT want the BOMs, make sure these are NOT checked.
You can also save UTF-8 files with BOMs on a per-file basis. In the File -> Save As dialog, there are several options in the "Format" drop-down list box for Unicode formatting with and without BOMs.
UltraEdit does provide a way for you to convert Unicode-based files back to regular ASCII files. This is a very simple process, but comes with a very important caveat:
Make sure you set your codepage to match the data you want to convert before you do the conversion!
Remember, it all comes back to encoding. Unicode is the holy grail of encoding in that it can encode and decode virtually any code or script system such as Chinese, Arabic, Russian, and more. ASCII, however, cannot; ASCII relies upon the user to set the correct code page in order to interpret the data properly. The encoding must be set properly prior to the conversion or your data will not display correctly and you may corrupt your file. For instance, if you have a Unicode file containing Japanese characters that you'd like to convert to standard ASCII, you'd need to go to View -> Set Code Page and select code page 932 (or some other Japanese code page from this menu). To actually convert the file, go to File -> Conversions -> Unicode to ASCII or UTF-8 to ASCII.
If you'd like to see the extensive character support of a Unicode font, you can access the font's Character Map by (in Windows) going to Start -> Run, typing "charmap", pressing OK, then selecting the Unicode font. Unicode is a very complex system with thousands of characters, but it has been set up and refined to be easily accessed and used by anyone. Unicode is a solution that can help you reach global audiences with its robust character encoding whether you're a programmer, web developer, or a technical writer.