Replace UTF-8 character code sequences by ASCII characters with script

Help with writing and running scripts

Replace UTF-8 character code sequences by ASCII characters with script

Postby JPluim » Thu Sep 23, 2010 4:10 am

Hi all,

I'm a bit new with Javascript, but after spending a lot of time in it, I have given up.

I have a file with about 100000 lines, in which are couples of characters which have to be replaced.
For example: â has to be an a, é has to be an e. There are a few of this combinations, they all start with à except for one, that one start with Ä.

For some reason I'm not able to Find/Replace two characters such as é. This code doesn't work for example:

Code: Select all
var abc = "é";
var def = "e";
UltraEdit.activeDocument.findReplace.replace(abc, def);

I also tried to write a code which would find Ã, then select the character right of it, and then look which combination it is. But mayby it's because I'm not that good with Javascript, but that also didn't work out.

I'm using UltraEdit version 16.00.0.1036 on a Windows XP Professional Service Pack 3.

Thanks a lot in advance!

Joost
JPluim
Newbie
 
Posts: 2
Joined: Thu Sep 23, 2010 3:57 am

Re: Replace UTF-8 character code sequences by ASCII characters with script

Postby Mofi » Thu Sep 23, 2010 7:04 am

It is tricky to replace UTF-8 character code sequences with a script because either the script file itself is interpreted as UTF-8 encoded file or just the string to search for is then also read as UTF-8 character instead of 2 ANSI characters.

A solution for this problem is to use a regular expression engine and use the escape character of the engine before second character to avoid interpreting the UTF-8 code sequence as UTF-8 character. For example the following script works using the Perl engine with the backslash as escape character, used twice in the variable string below because the backslash is also the escape character for Javascript strings.

UltraEdit.activeDocument.findReplace.mode=0;
UltraEdit.activeDocument.findReplace.matchCase=true;
UltraEdit.activeDocument.findReplace.matchWord=false;
UltraEdit.activeDocument.findReplace.regExp=true;
UltraEdit.activeDocument.findReplace.searchAscii=false;
UltraEdit.activeDocument.findReplace.searchDown=true;
UltraEdit.activeDocument.findReplace.searchInColumn=false;
UltraEdit.activeDocument.findReplace.preserveCase=false;
UltraEdit.activeDocument.findReplace.replaceAll=false;
UltraEdit.activeDocument.findReplace.replaceInAllOpen=false;

var abc = "Ã
\\©";
var def = "e";
UltraEdit.perlReOn();
UltraEdit.activeDocument.findReplace.replace(abc,def);


Alternatively you can also use the UltraEdit regular expression engine with character ^ as escape character.

var abc = "Ã^©";
var def = "e";
UltraEdit.ueReOn();
UltraEdit.activeDocument.findReplace.replace(abc,def);
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4064
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Replace UTF-8 character code sequences by ASCII characters with script

Postby JPluim » Mon Sep 27, 2010 6:37 am

Thanks a lot for this answer! I'll go and give it a go in my script.

Thanks a lot!
JPluim
Newbie
 
Posts: 2
Joined: Thu Sep 23, 2010 3:57 am


Return to Scripts