Using Perl regular expression replace fails on UTF-8 documents (fixed)

Help with writing and running scripts

Using Perl regular expression replace fails on UTF-8 documents (fixed)

Postby Khuri » Sun Nov 22, 2009 8:09 am

Hello!

I've recently tried to dig into UTF-8 format (thanks to this thread, was a great read) and everything seems to work fine, except using RegEx.

I'm using a script when saving documents that determines wheter it is a script document of mine, including a particular header string stating the time the file was created and edited last.
The script works fine on ASCII/ANSI documents, however this regex search fails on UTF-8 documents and i fail to see why.

The header part looks like this:
Code: Select all
// File written by Jochen "Khuri" Höhmann <mailadress>
// Copyright 2009
//
// File        : test.php
// Begin       : 2009.11.22 13:45:32
// Last Update : 2009.11.22 13:47:54


This header (and some possible variations) is added using templates.
Now when i save a document, the following script is executed.

Code: Select all
var cline = UltraEdit.activeDocument.currentLineNum;
var crow = UltraEdit.activeDocument.currentColumnNum;
if (typeof(UltraEdit.activeDocumentIdx) == "undefined") crow++;
UltraEdit.activeDocument.top();
UltraEdit.activeDocument.findReplace.mode = 0;
var is_privdoc = UltraEdit.activeDocument.findReplace.find("File written by Jochen \"Khuri\" Höhmann <mailadress>");
if(is_privdoc == true) {
   var time = new Date();
   var currdate = time.getFullYear()+'.'+(((time.getMonth() +1).toString().length > 1) ? (time.getMonth() +1) : '0'+(time.getMonth() +1))+'.'+((time.getDate().toString().length > 1) ? time.getDate() : '0'+time.getDate())+' '+((time.getHours().toString().length > 1) ? time.getHours() : '0'+time.getHours())+':'+((time.getMinutes().toString().length > 1) ? time.getMinutes() : '0'+time.getMinutes())+':'+((time.getSeconds().toString().length > 1) ? time.getSeconds() : '0'+time.getSeconds());
   UltraEdit.perlReOn();
   UltraEdit.activeDocument.findReplace.regExp = true;
   if(UltraEdit.activeDocument.findReplace.find("// Copyright "+time.getFullYear()) == false) {
      UltraEdit.activeDocument.findReplace.replace("\\/\\/ Copyright (?:\\d{4})","// Copyright "+time.getFullYear());
   }
   if(UltraEdit.activeDocument.findReplace.replace("\\/\\/ Last Update : [\n\r]","// Last Update : "+currdate+"\n") == false) {
      UltraEdit.activeDocument.findReplace.replace("\\/\\/ Last Update : (?:\\d{4}\\.\\d{2}\\.\\d{2} \\d{2}:\\d{2}:\\d{2})","// Last Update : "+currdate);
   }
}
UltraEdit.activeDocument.gotoLine(cline,crow);
UltraEdit.save();


Both lines
Code: Select all
UltraEdit.activeDocument.findReplace.replace("\\/\\/ Copyright (?:\\d{4})","// Copyright "+time.getFullYear());
and
UltraEdit.activeDocument.findReplace.replace("\\/\\/ Last Update : (?:\\d{4}\\.\\d{2}\\.\\d{2} \\d{2}:\\d{2}:\\d{2})","// Last Update : "+currdate);

fail to work on UTF-8 documents.
I assume it might have to do with the way digits are handled in UTF files? Yet so far i fail to figure out what's wrong here...

Any help is greatly appreciated. Thanks in advance! :)
Khuri
Newbie
 
Posts: 6
Joined: Sun Nov 22, 2009 7:49 am

Re: Using Perl regular expression replace fails on UTF-8 documents (fixed)

Postby pietzcker » Sun Nov 22, 2009 3:27 pm

I don't use the scripting engine of UE, so I'm not sure what's going on.

Two things to consider, though: First, no need to escape the forward slash. It has no special meaning in regex (if used in a JavaScript string instead of a JavaScript regex object where it does have a special meaning). I don't think that this is breaking your regex - but try that first.

Second, try [0-9] instead of \d and see if that changes things. If it does, then there might well be something wrong with the regex engine in UE's JavaScript implementation. Bego is the JavaScript expert around here, so I'm curious about what he thinks about this.
User avatar
pietzcker
Master
Master
 
Posts: 241
Joined: Sun Aug 22, 2004 11:00 pm

Re: Using Perl regular expression replace fails on UTF-8 documents (fixed)

Postby Mofi » Mon Nov 23, 2009 2:37 am

Something strange is going on here which I must further analyze. It looks like the replace command finds the strings, but does not replace them correct for Unicode files. In the meantime you can use this script.

Code: Select all
var cline = UltraEdit.activeDocument.currentLineNum;
var crow = UltraEdit.activeDocument.currentColumnNum;
if (typeof(UltraEdit.activeDocumentIdx) == "undefined") crow++;
UltraEdit.activeDocument.top();
UltraEdit.perlReOn();
UltraEdit.activeDocument.findReplace.mode=0;
UltraEdit.activeDocument.findReplace.matchCase=false;
UltraEdit.activeDocument.findReplace.matchWord=false;
UltraEdit.activeDocument.findReplace.regExp=false;
UltraEdit.activeDocument.findReplace.searchAscii=false;
UltraEdit.activeDocument.findReplace.searchDown=true;
UltraEdit.activeDocument.findReplace.searchInColumn=false;
var is_privdoc = UltraEdit.activeDocument.findReplace.find("File written by Jochen \"Khuri\" Höhmann <mailadress>");
if(is_privdoc == true) {
   var time = new Date();
   var currdate = time.getFullYear()+'.'+(((time.getMonth() +1).toString().length > 1) ? (time.getMonth() +1) : '0'+(time.getMonth() +1))+'.'+((time.getDate().toString().length > 1) ? time.getDate() : '0'+time.getDate())+' '+((time.getHours().toString().length > 1) ? time.getHours() : '0'+time.getHours())+':'+((time.getMinutes().toString().length > 1) ? time.getMinutes() : '0'+time.getMinutes())+':'+((time.getSeconds().toString().length > 1) ? time.getSeconds() : '0'+time.getSeconds());
   var curryear = currdate.substr(0,4);
   UltraEdit.activeDocument.findReplace.regExp=true;
   UltraEdit.activeDocument.findReplace.preserveCase=false;
   UltraEdit.activeDocument.findReplace.replaceAll=false;
   UltraEdit.activeDocument.findReplace.replaceInAllOpen=false;
   if(UltraEdit.activeDocument.findReplace.find("// Copyright "+curryear) == false) {
      UltraEdit.activeDocument.findReplace.find("// Copyright \\d{4}");
      UltraEdit.activeDocument.write("// Copyright "+curryear);
   }
   if(UltraEdit.activeDocument.findReplace.replace("// Last Update : [\\n\\r]","// Last Update : "+currdate+"\n") == false) {
      UltraEdit.activeDocument.findReplace.find("// Last Update : \\d{4}\\.\\d{2}\\.\\d{2} \\d{2}:\\d{2}:\\d{2}");
      UltraEdit.activeDocument.write("// Last Update : "+currdate);
   }
}
UltraEdit.activeDocument.gotoLine(cline,crow);
UltraEdit.save();

Edit: I reported the problem by email after deeper analyzing it and got already a reply that IDM support could reproduce the issue.replaceInAllOpen=false;
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4049
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Using Perl regular expression replace fails on UTF-8 documents (fixed)

Postby Khuri » Wed Nov 25, 2009 1:54 pm

Whew, kinda glad it's a bug then and not me. Spend quite some time trying it out over and over again, but the normal UE replace function is not working either as you stated Mofi. Have to keep that in mind as I'm using regular expressions quite often. So let's hope IDM fixes this soon.

Anyhow, thanks for your script replacement Mofi, it works fine :)
Khuri
Newbie
 
Posts: 6
Joined: Sun Nov 22, 2009 7:49 am

Re: Using Perl regular expression replace fails on UTF-8 documents (fixed)

Postby Mofi » Sun Jun 02, 2013 7:30 am

While original script by Khuri executed with UE v18.20.0.1028 still failed to make the replaces, the script works with UE v19.00.0.1026 and later versions.

In UE v19.00 the Perl regular expression inside UltraEdit was updated and it looks like with this update many Perl regular expression Find and Replace issues like this one are fixed.

PS: The first public release of UE v19.00 was v19.00.0.1022 which does not make the replace 100% correct. With the next hotfix release v19.00.0.1026 the original script by Khuri results in same correct output as my script with the workaround.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4049
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna


Return to Scripts