Short utf-8 charset declaration in HTML5 header

This forum is user-to-user based and not regularly monitored by IDM.
Please see the note at the top of this page on how to contact IDM.

Short utf-8 charset declaration in HTML5 header

Postby shawnAnderson » Wed Apr 03, 2013 1:23 pm

I have a HTML5 template I use to start each new web page.
I always use <meta charset="utf-8"> in the HTML head.

I did this on a new page, and uploaded it to the server. I validated it using the W3C HTML5 validator, but it gave me an error
saying the page wasn't UTF-8, but instead was windows-1252.

Where does the charset specification come from?
Does it come from the file that UE creates, or does the web server designate it?

I've never run into this problem before.

thanks
shawnAnderson
Newbie
 
Posts: 2
Joined: Wed Apr 03, 2013 1:18 pm

Re: Short utf-8 charset declaration in HTML5 header

Postby Mofi » Thu Apr 04, 2013 12:41 am

Well, the charset specification comes from you and of course you have to make sure that the characters are really encoded according to the charset declaration at top of the HTML5 file. You can see in the status bar at bottom of the UltraEdit main window which encoding is used currently by UltraEdit for a file. UTF-8 (new status bar in UE v19.00) or U8- (basic status bar in UE v19.00 and all previous versions of UE) indicate a UTF-8 encoding of the file. Just the line terminator type (DOS, UNIX, MAC) or an ANSI code page (new status bar in UE v19.00) means ANSI encoding.

Character encodings on W3C website explains how character set respectively encoding should be declared in an HTML, XHTML and XML file.

UltraEdit detects UTF-8 encoded files by

  • UTF-8 BOM at beginning of a file (not recommended for HTML files)
  • One of the following four strings is found at top of the file (within the first 1024 bytes):
    charset=UTF-8, charset=utf-8, encoding="UTF-8, encoding="utf-8
  • Within the first 64 KB at least one byte sequence is found which looks like a UTF-8 character encoding sequence.
As it can be read at HTML 5.1 Nightly - Specifying the document's character encoding the short character set as you use can be used also for HTML5. But as charset="utf-8 is not recognized yet by UltraEdit, the HTML5 file is opened as ASCII/ANSI file if there is no UTF-8 byte sequence within the first 64 KB.

Entering now a character with a code value greater 127 results in using a wrong encoding for this character in comparison to the character set declaration at top of the HTML5 file.

Solution:

  • Select Create new files as UTF-8 at Advanced - Configuration - Editor - New File Creation.
  • Uncheck at Advanced - Configuration - File Handling - Save
    Write UTF-8 BOM header to all UTF-8 files when saved
    and
    Write UTF-8 BOM on new files created within this program
  • While UltraEdit is not running, open %appdata%\IDMComp\UltraEdit\uedit32.ini with Notepad and add to group [Settings] a line with Force UTF-8=1 and save the modified INI.
Now new files are by default encoded in UTF-8 as required for your HTML5 files. And all files not detected as UTF-16 encoded files are interpreted now always as UTF-8 encoded files.

If you need to open an ASCII/ANSI encoded file like an UltraEdit script file, you have to use the Open As option with ASCII selected in the File Open dialog to overwrite the Force UTF-8=1 setting for such files.

I have sent an enhancement request to IDM support by email for supporting also HTML5 character set declarations. Best you do the same so that request count is already 2. The more users request an enhancement, the higher becomes the priority for being implemented.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4049
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Short utf-8 charset declaration in HTML5 header

Postby shawnAnderson » Thu Apr 04, 2013 9:45 am

A very complete answer, thanks.

I will submit the request.
thanks
shawnAnderson
Newbie
 
Posts: 2
Joined: Wed Apr 03, 2013 1:18 pm


Return to UltraEdit General Discussion

cron