Bug in Base64 Encoding/Decoding for Unicode text

This forum is user-to-user based and not regularly monitored by IDM.
Please see the note at the top of this page on how to contact IDM.

Bug in Base64 Encoding/Decoding for Unicode text

Postby tuyen » Mon Apr 26, 2010 10:08 pm

There's a bug with the built-in Base64 encoding/decoding routines in UltraEdit 16.0.1038. I assume the bug also exists in previous versions, but I haven't tested it.

The encoding/decoding routines work fine when performed on ASCII plain text. But if you try encoding some Unicode text, and then try to decode it, the bug becomes apparent.
tuyen
Basic User
Basic User
 
Posts: 11
Joined: Wed Apr 29, 2009 9:02 pm

Re: Bug in Base64 Encoding/Decoding for Unicode text

Postby Mofi » Tue Apr 27, 2010 12:34 am

Can you explain the problem more detailed with a step by step list how to reproduce it because I can't reproduce it using UE v16.00.0.1038.

For testing I took one of my ANSI HTML files (6468 bytes) encoded and decoded it - no problem. Next I converted the file to Unicode (UTF-16 LE) and saved it. I again encoded and decoded it - no problem. I added some characters which must be really encoded in Unicode with 2 bytes (German umlauts) and saved the Unicode file. I encoded and decoded it - no problem. Now I encoded it once again, copied the encoded string to a new ANSI file and decoded it - also correct result.

Of course encoding a Unicode text with characters not available in the active codepage and decoding this text in an ANSI file results in wrong characters. But that is not a problem of the Base64 encoding/decoding routines. That is the general problem of Unicode to ANSI conversion with characters not available in the ANSI codepage. The encoded data stream does not contain any information of which type the input data stream was. Email programs solve that problem by adding additional information about the original data (= file information like name of file, content-type, etc.) as plain text above the encoded data stream to be able to correct decode the encoded data.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4055
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Bug in Base64 Encoding/Decoding for Unicode text

Postby tuyen » Tue Apr 27, 2010 1:32 am

Mofi wrote:Can you explain the problem more detailed with a step by step list how to reproduce it because I can't reproduce it using UE v16.00.0.1038.


1) Create a new file in UltraEdit. Then go to the File Menu ---> Conversions ---> ASCII to Unicode
2) Type the word "hello" into the editor
3) Highlight the word with your keyboard or mouse
4) Edit Menu ---> Encode Base64
5) Now highlight the encoded text
5) Edit Menu ---> Decode Base64

The decoded text will be exactly the same as what you started with. So as you can see, normal ASCII characters are encoded correctly.


Now let's try encoding something with non-ASCII characters.

1) Erase all the text in your editor window
2) Copy and paste the word "привет" into your editor window
3) Highlight the word with your keyboard or mouse
4) Edit Menu ---> Encode Base64
5) Now highlight the encoded text
5) Edit Menu ---> Decode Base64

As you can see, the decoded text is not what you started with, and therefore we can see that the problem is with the handling of Base64 encoding of non-ASCII text.
tuyen
Basic User
Basic User
 
Posts: 11
Joined: Wed Apr 29, 2009 9:02 pm

Re: Bug in Base64 Encoding/Decoding for Unicode text

Postby Mofi » Tue Apr 27, 2010 9:23 am

Okay, with those details I could reproduce the problem using UE v16.00.0.1029. Please report this issue by email to IDM support.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4055
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Bug in Base64 Encoding/Decoding for Unicode text

Postby tuyen » Thu Apr 29, 2010 5:39 pm

Done. And I received a response from IDM to let me know that they're able to reproduce the problem, and they're working on fixing it.
tuyen
Basic User
Basic User
 
Posts: 11
Joined: Wed Apr 29, 2009 9:02 pm

Re: Bug in Base64 Encoding/Decoding for Unicode text

Postby tuyen » Sat Dec 11, 2010 4:28 pm

Seriously?
After almost 8 months and several new version updates, this simple bug still hasn't been fixed.
tuyen
Basic User
Basic User
 
Posts: 11
Joined: Wed Apr 29, 2009 9:02 pm

Re: Bug in Base64 Encoding/Decoding for Unicode text

Postby Mofi » Sun Mar 13, 2011 11:07 am

I looked on this issue with UE v17.00 and word привет is still not got back after encoding/decoding it with Base64.

UTF-16 encoded привет is in hexadecimal 3F 04 40 04 38 04 32 04 35 04 42 04. Omitting the high bytes with value 04, those bytes would be in ASCII ?@825B. Using Encode Base64 on UTF-16 string привет results in P0A4MjVC. Using Encode Base64 on ASCII string ?@825B results also in P0A4MjVC.

So I wanted to know more about Base64 encoding and read (not entirely) the wikipedia article about Base64. According to this article Base64 encoding is only for ASCII strings (single byte strings). The UTF-7 encoding is needed for encoding UTF-16 characters which is called also modified Base64.

Now the question is, which Base64 encoding is implemented in UltraEdit at all, the standard Base64 or the modified Base64? It looks like standard Base64.

Of course, standard Base64 encoding can be used for binary files. Therefore the UTF-16 characters could be read as binary array like in hex edit mode and therefore encoding UTF-16 characters with standard Base64 encoding is also possible. But why was UTF-7 encoding introduced if standard Base64 can be used also by reading the UTF-16 strings as binary data stream?

I don't want to read all the RFCs to get an answer on that question because for my work with UltraEdit that is totally unimportant. But perhaps my post is an explanation why Base64 encoding is not working for UTF-16 characters in UltraEdit.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4055
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Bug in Base64 Encoding/Decoding for Unicode text

Postby tuyen » Sat May 21, 2011 2:36 pm

Mofi wrote: I don't want to read all the RFCs to get an answer on that question because for my work with UltraEdit that is totally unimportant. But perhaps my post is an explanation why Base64 encoding is not working for UTF-16 characters in UltraEdit.


You're over-complicating this issue. It has nothing to do with "standard" or "modified" Base64 encoding, and reading every RFC in the world will not help.
The bug in UltraEdit is the result of an improper pointer operation in the string handling routine. The bug is present not only for UTF-16 strings, but also when you specifically tell UltraEdit to produce UTF-8 (on the File menu).

I use Base64 encoding in some of my applications which process string values, and the original unmodified Base64 routines from 20 years ago work perfectly on both UTF-8 and UTF-16 strings. The key thing to making it work properly is to account for the correct character size/length (and therefore the correct pointer operation) when referencing them. When I started encoding Unicode text in my programs a couple of years ago, I had the exact same problem which is currently in UltraEdit. I fixed it without reading any RFCs. The only thing I did was some good old-fashioned debugging and some trial-and-error.

I had hoped it would be fixed in UltraEdit by now, but I guess the programmers are too busy doing much more important things like replacing icons and implementing a new serial number registration system. :roll:
tuyen
Basic User
Basic User
 
Posts: 11
Joined: Wed Apr 29, 2009 9:02 pm


Return to UltraEdit General Discussion