Unicode data corrupted when sent to an array

Help with writing and running scripts

Unicode data corrupted when sent to an array

Postby thesleeve » Fri Feb 17, 2012 12:01 pm

Hi everyone,

This is my first post on an Ultraedit forum. I'm writing a script to take a long excerpt of Unicode text (in Japanese), break it into sentences, and load each sentence as a string into an array. I will process the array later.

Everything works just fine, up until the point when I check the contents of the array. It seems that some of the characters are getting corrupted. I'm guessing there's some sort of formatting issue, but I really don't know how to solve the problem.

First, here's my Javascript code:

Code: Select all
// Ask the user how each sentence entry ends (typically, this is the Japanese period character)
var strEntryTerminator = UltraEdit.getString("What ends each sentence?",1);

// Report what the user inputted to the debug window
UltraEdit.outputWindow.write("Entry terminator is" + strEntryTerminator);

// Use DOS-style line terminator for Windows Notepad Unicode .txt files
var lineTerminator = "\r\n";

// Establish our search string for the loop condition
UltraEdit.activeDocument.top();
UltraEdit.activeDocument.findReplace.mode=0; //Replace all in current file
UltraEdit.activeDocument.findReplace.replaceAll=true; //Replace all instances

// Remove all line terminators in the file, making the data one continuous line of text
UltraEdit.activeDocument.findReplace.replace(lineTerminator, "");

// Remove multiple spaces (up to ten) so that there is a maximum of one space between text
var SpaceDeletion = 1;
while (SpaceDeletion < 10) {
      UltraEdit.activeDocument.findReplace.replace("  ", " ");
    SpaceDeletion ++;
  }

// Replace a period plus a space with a period (removing leading spaces from entries)
UltraEdit.activeDocument.findReplace.replace(strEntryTerminator + " ", strEntryTerminator);

// Replace a period with a period plus a terminator, which will put each sentence on its own line
UltraEdit.activeDocument.findReplace.replace(strEntryTerminator, strEntryTerminator + lineTerminator);

// Select all data in the document
UltraEdit.activeDocument.selectAll();

// Selection becomes variable
var mySelection = UltraEdit.activeDocument.selection;

// Split lines at lineTerminator and load them into an array
var resultArr = new Array();
resultArr = mySelection.split(lineTerminator);

// Display total number of records in debug window
resultLength = resultArr.length;
UltraEdit.outputWindow.write(resultLength + " total entries");

// Write array values in debug window
for (var i = 0; i < resultArr.length; i++) {
UltraEdit.outputWindow.write("Value: " + i + " \"" + resultArr[i]);
}


Everything seems to work just fine. However, some characters are corrupted in the process.

For example, it my input text is this:
Code: Select all
ブラックホール(英語:black hole)とは、きわめて高密度で大質量で、きわめて強い重力のために、物質だけでなく光さえも脱出できない天体のこと[1]。
きわめて強い重力のために光さえも抜け出せなくなった時空の領域、とされている。
「ブラック・ホール」(黒い穴)という名は、アメリカの物理学者ジョン・ホイーラーが1967年にこうした天体を呼ぶために編み出した[2]。
それ以前は「collapsar[3] コラプサー」(崩壊した星)などと呼ばれていた。


...the output debug window shows this:

Code: Select all
Running script: C:\Program Files\IDM Computer Solutions\UltraEdit\scripts\JapaneseDocumentToSRS.js
========================================================================================================
Entry terminator is縲・
5 total entries
Value: 0 "・スu・ス・ス・スb・スN・スz・ス[・ス・ス・スi・スp・ス・スFblack hole・スj・スニは、・ス・ス・ス・ス゚て搾ソス・ス・ス・スx・スナ大質・スハで、・ス・ス・ス・ス゚て具ソス・ス・ス・スd・スヘのゑソス・ス゚に、・ス・ス・ス・ス・ス・ス・ス・ス・スナなゑソス・ス・ス・ス・ス・ス・ス・ス・スE・スo・スナゑソス・スネゑソス・スV・スフのゑソス・ス・ス[1]・スB
Value: 1 "・ス・ス・ス・ス゚て具ソス・ス・ス・スd・スヘのゑソス・ス゚に鯉ソス・ス・ス・ス・ス・ス・ス・ス・ス・ス・スo・ス・ス・スネゑソス・スネゑソス・ス・ス・ス・ス・ス・スフ領茨ソスA・スニゑソス・ス・ストゑソス・ス・スB
Value: 2 "縲後ヶ繝ゥ繝・け繝サ繝帙・繝ォ縲搾シ磯サ偵>遨エ・峨→縺・≧蜷阪・縲√い繝。繝ェ繧ォ縺ョ迚ゥ逅・ュヲ閠・ず繝ァ繝ウ繝サ繝帙う繝シ繝ゥ繝シ縺・967蟷エ縺ォ縺薙≧縺励◆螟ゥ菴薙r蜻シ縺カ縺溘a縺ォ邱ィ縺ソ蜃コ縺励◆[2]縲・
Value: 3 "・ス・ス・ス・スネ前・スヘ「collapsar[3] ・スR・ス・ス・スv・スT・ス[・スv・スi・ス・ス・スオゑソス・ス・ス・スj・スネどと呼ばゑソストゑソス・ス・ス・スB
Value: 4 "


Any ideas?
thesleeve
Newbie
 
Posts: 4
Joined: Fri Feb 17, 2012 11:50 am

Re: Unicode data corrupted when sent to an array

Postby thesleeve » Fri Feb 17, 2012 1:27 pm

OK, I've greatly narrowed down the cause of the problem.

I created a very simple script just to see if the data is being stored correctly as a string.
Code: Select all
var mySelection = UltraEdit.activeDocument.selection;
UltraEdit.outputWindow.write(typeof(mySelection) + mySelection);


This in essence takes the highlighted text in the document and displays to the user the type of data and the value of the data.

When I highlight あいうえお, the output window shows:
Code: Select all
stringあいうえお


Success! No problem there.
However, when I try some other characters... なにぬねの,
the output window shows something like:
Code: Select all
string化ã


Obviously, some of the characters are encoding correctly, and some are not. I have no idea what is happening.

Any clues?
thesleeve
Newbie
 
Posts: 4
Joined: Fri Feb 17, 2012 11:50 am

Re: Unicode data corrupted when sent to an array

Postby thesleeve » Fri Feb 17, 2012 1:35 pm

I'm currently thinking maybe the document is using a certain Unicode encoding format, but the Javascript interface somehow is using a different Unicode encoding format, and at some point in the data transfer, bits are being truncated or something. This makes some characters "out of range" so they are corrupted in the process.

Any idea on how we can tell Ultraedit's scripting interface how it should handle and store string data? That would probably solve the problem.
thesleeve
Newbie
 
Posts: 4
Joined: Fri Feb 17, 2012 11:50 am

Re: Unicode data corrupted when sent to an array

Postby Mofi » Sat Feb 18, 2012 12:38 pm

UltraEdit converts all Unicode file formats to UTF-16 Little Endian on load. So every character in any Unicode file is kept in memory of UltraEdit with 2 bytes per character, to be more precise, with an unsigned 16-bit value (unsigned short int). String variables are completely managed by the Javascript engine. Unfortunately the documentation about the String object on the Mozilla Developer Network is a little bit poor regarding Unicode strings.

More about Unicode support by Javascript can be found at Values, variables, and literals - Unicode, but nothing String related.

As you can read on the very technical page Mozilla internal string guide for C++ programmers, Javascript supports 8-bit (ANSI) and 16-bit (Unicode) strings. But how to use the Unicode variant in scripts is not explained on that page.

I have never needed for myself to code a script which works on a Unicode file. I tried several times for questioners to find out how to deal with Unicode strings in Javascript scripts, but the results were pure. The only script function where I have had success on working with Unicode strings was the HexCopy function which is written for working on binary data streams and produce the correct result also for Unicode strings with 16-bit values for every character. But this function is of no use for modifying text in Unicode encoded text files.

Summarized: I don't have any idea how to reformat a Unicode text file with a Javascript script when string variables must be used too.


Correction: I had suddenly an idea how to work with Unicode strings within a script, see Script or macro to identify unicode codepage data. But I still don't know how to get Unicode strings into string variables without conversion to ANSI strings.

However, perhaps you can use the user clipboards and replaces as I have used in the reference topic to work on a Unicode file. I looked on your code and I think it is possible to code it using a user clipboard and normal replaces plus 1 UltraEdit regular expression replace.

Code: Select all
if (UltraEdit.document.length > 0)
{
   UltraEdit.insertMode();
   UltraEdit.columnModeOff();
   UltraEdit.activeDocument.hexOff();
   UltraEdit.activeDocument.top();

   // Ask the user how each sentence entry ends (typically, this is the Japanese period character)
   // Entered character is inserted at top of the file and cut to user clipboard 9.
   UltraEdit.getString("What ends each sentence?",0);
   UltraEdit.selectClipboard(9);
   UltraEdit.activeDocument.selectToTop();
   UltraEdit.activeDocument.cut();

   // Use DOS-style line terminator for Windows Notepad Unicode .txt files
   var sLineTerminator = "^p";

   // Define all properties for the replace commands below.
   UltraEdit.activeDocument.findReplace.mode=0;
   UltraEdit.activeDocument.findReplace.matchCase=true;
   UltraEdit.activeDocument.findReplace.matchWord=false;
   UltraEdit.activeDocument.findReplace.regExp=false;
   UltraEdit.activeDocument.findReplace.searchDown=true;
   UltraEdit.activeDocument.findReplace.searchInColumn=false;
   UltraEdit.activeDocument.findReplace.preserveCase=false;
   UltraEdit.activeDocument.findReplace.replaceAll=true;
   UltraEdit.activeDocument.findReplace.replaceInAllOpen=false;
   UltraEdit.ueReOn();

   // Remove all line terminators in the file, making the data one continuous line of text
   UltraEdit.activeDocument.findReplace.replace(sLineTerminator, "");

   // Remove multiple spaces (up to ten) so that there is a maximum of one space between text
   UltraEdit.activeDocument.findReplace.regExp=true;
   UltraEdit.activeDocument.findReplace.replace("  +", " ");

   // Replace a period plus a space with a period (removing leading spaces from entries)
   UltraEdit.activeDocument.findReplace.regExp=false;
   UltraEdit.activeDocument.findReplace.replace("^c ", "^c");

   // Replace a period with a period plus a terminator, which will put each sentence on its own line
   UltraEdit.activeDocument.findReplace.replace("^c", "^c" + sLineTerminator);

   UltraEdit.clearClipboard();
   UltraEdit.selectClipboard(0);
}
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 3936
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Unicode data corrupted when sent to an array

Postby thesleeve » Tue Feb 21, 2012 11:31 am

Thanks, mofi.

That's extremely helpful. I had figured that the Javascript engine was basically encoding the strings in its own way, but I was hoping to have some control over that process in order to work around this issue. Oh well, it looks like that's all abstracted from the user and it's not possible to influence how strings are stored in memory.

Your idea of using the user clipboard is great! Thanks very much for putting in the time to write up that example script. I'm going to give this a shot and see if it works to solve my problem.

Thanks again!
TheSleeve
thesleeve
Newbie
 
Posts: 4
Joined: Fri Feb 17, 2012 11:50 am


Return to Scripts