File Formats and encoding

This forum is user-to-user based and not regularly monitored by IDM.
Please see the note at the top of this page on how to contact IDM.

File Formats and encoding

Postby CWBillow » Mon Jul 09, 2007 7:43 pm

What's the difference between ANSI and DOS encoding?

Regards,
Chuck Billow
User avatar
CWBillow
Basic User
Basic User
 
Posts: 32
Joined: Tue Feb 15, 2005 12:00 am
Location: Chula Vista, CA

Re: File Formats and encoding

Postby Mofi » Tue Jul 10, 2007 6:32 am

If you mean with DOS the OEM character set, then the main difference is the upper 128 characters of the codepage. The lower 128 characters are in ANSI and OEM character set identical and are the ASCII characters (ignoring the control codes).

If you write only in English you will not see any difference for normal text.

Go to the Wikipedia page Code page, open the page about OEM code page 437 and open in a second tab or window the ANSI code page 1252. Compare the characters and you will see the difference. The Wikipedia pages also explain very good the code pages. It's worth to read those pages.

The OEM code page 437 is often used for creating small drawings with characters in a text file because it contains lots of "graphic" characters.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4062
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: File Formats and encoding

Postby CWBillow » Tue Jul 10, 2007 7:09 am

Mofi:

And nobody has thought to put all these in one code page, why? It would be nice, clean and certainly useful. I guess that answers why not, huh?

Thanks for the help.

Regards,
Chuck Billow
User avatar
CWBillow
Basic User
Basic User
 
Posts: 32
Joined: Tue Feb 15, 2005 12:00 am
Location: Chula Vista, CA

Re: File Formats and encoding

Postby Mofi » Tue Jul 10, 2007 1:38 pm

CWBillow wrote:And nobody has thought to put all these in one code page, why?

Because a code page can only contain 256 characters. You know 1 byte has 8 bit and so you can code only 2^8 = 256 characters with 1 byte. That's the reason why code pages exist.

As the Wikipedia page I linked to also mentions, there is now the Unicode system which encodes all characters for all languages and even graphic, mathematic and symbol characters.

But most text files are still single byte coded files and not Unicode files. Also most fonts contain only certain code pages or even only parts of a code page and not the full Unicode character table. So there is always a problem when a conversion must be done from a text file (1 byte per character) to a Unicode file (2 bytes per character) and vice versa.

UTF-8 and ASCII escaped Unicode files are a mixture of text and Unicode files to be able to use the full range of Unicode characters, but encode the file content still with just a single byte per character for the most often needed ASCII characters. Only the real Unicode characters are coded with special character sequences. That reduces the file size a lot which is the reason why UTF-8 is used heavily for webpages: It supports all characters, but the HTML files are for many (especially European) languages only a little bit larger than when encoding it in ANSI.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4062
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: File Formats and encoding

Postby CWBillow » Tue Jul 10, 2007 4:31 pm

Now I AM worried: It's BEGINNING to make sense!

Thanks,
Chuck Billow
User avatar
CWBillow
Basic User
Basic User
 
Posts: 32
Joined: Tue Feb 15, 2005 12:00 am
Location: Chula Vista, CA

Re: File Formats and encoding

Postby Peter » Wed Jul 11, 2007 2:15 pm

Hello

I have a problem with two kind of files; both created by another software, Both have blank "text-content", but one is opened by UE without comment (maybe a DOS-File?), but opening the other file UE asks always for "Convert to DOS?"

Where is the difference? The code page behind the files?

Best regards

Peter
User avatar
Peter
Basic User
Basic User
 
Posts: 33
Joined: Mon Nov 01, 2004 12:00 am
Location: Switzerland

Re: File Formats and encoding

Postby Mofi » Wed Jul 11, 2007 2:54 pm

No, this message has nothing to do with the character set in a code page.

This message is shown when the file uses not the DOS (Windows) line termination carriage return (hex: 0D) + line-feed (hex 0A) which are in many programing languages encoded as \r and \n.

Your "not DOS file" is a Unix file, because it uses only the line-feed (= \n) as line termination as it is standard for Unix/Linux operating systems. For MAC systems only carriage return (= \r) is used as line termination.

The handling of a file with non DOS line terminations can be configured at Configuration - File Handling - DOS/UNIX/MAC Handling. Read the help page for this dialog for further details.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4062
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: File Formats and encoding

Postby Peter » Wed Jul 11, 2007 7:09 pm

Thanks Mofi.

Peter
User avatar
Peter
Basic User
Basic User
 
Posts: 33
Joined: Mon Nov 01, 2004 12:00 am
Location: Switzerland


Return to UltraEdit General Discussion