Workaround sought for missing variable negative lookbehind

Find, replace, find in files, replace in files, regular expressions

Workaround sought for missing variable negative lookbehind

Postby fvgfvg » Tue Mar 26, 2013 7:08 pm

I'm trying to write a script to identify footnotes in a Ventura text file. Such footnotes have the format:

text text <$Ffootnote footnote footnote > text text

That looks straightforward enough - except that Ventura has various other control characters of the type <I>, <I*>, <->, <N>, <> and so on that get in the way:

text <$F footnote <I>title<I*> foot<->note <N> footnote<> footnote > text

After some effort, using Regex Buddy, I constructed the following regex, which matches the above footnote just fine:

Code: Select all
<\$F([\s\S]+?)(?<!(<|< |<  |<I|<I\*|<CR|<N|<-))>

Trouble is, UE's 'Perl-style' regex doesn't seem to support those negative lookbehinds that I need to ignore those gratuitous control characters, and I can't figure out an alternative. I'm flummoxed. Would anyone know of a workaround? (Regex Buddy declares the above regex valid for Java, but for Perl it warns: "Perl does not support variable repetition inside lookbehind")

I'd be be most grateful for a bit of help here...
best,
fvgfvg
fvgfvg
Newbie
 
Posts: 8
Joined: Thu Apr 26, 2012 9:49 am

Re: Workaround sought for missing variable negative lookbehind

Postby Mofi » Wed Mar 27, 2013 4:12 am

Yes, in a Perl lookbehind or lookahead expression the length of the string must be fixed. An OR in a lookbehind expression is therefore not supported.

But it is possible to simply specify multiple lookbehind and therefore you can use:

Code: Select all
<\$F(?:[\s\S]+?>)(?<!<>)(?<!< >)(?<!<  >)(?<!<I>)(?<!<I\*>)(?<!<CR>)(?<!<N>)(?<!<->)
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4049
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Workaround sought for missing variable negative lookbehind

Postby fvgfvg » Thu Mar 28, 2013 8:15 am

Mofi, thank you. I've been trying this out. In RB (RegexBuddy) it works perfectly well. The trouble is that when I insert it into my UE script it doesn't work. RB provides different options: a copy 'as is', or as a 'perl-style' string:
'as is':
Code: Select all
<\$F((?:[\s\S]+?))>(?<!<>)(?<!< >)(?<!<  >)(?<!<I>)(?<!<I\*>)(?<!<CR>)(?<!<N>)(?<!<->)

'perl-style':
Code: Select all
'<\$F((?:[\s\S]+?))>(?<!<>)(?<!< >)(?<!<  >)(?<!<I>)(?<!<I\*>)(?<!<CR>)(?<!<N>)(?<!<->)'

The same for the 'replace' string:
'as is':
Code: Select all
<<${1}>>

perl-style:
Code: Select all
'<<${1}>>'

The trouble is that whatever I do, I can't get my script to work. Have I got the find/replace line wrong?
Code: Select all
if (UltraEdit.document.length > 0) {
   UltraEdit.insertMode();
   UltraEdit.columnModeOff();
   UltraEdit.activeDocument.hexOff();
     //UltraEdit.ueReOn();     // UltraEdit
     //UltraEdit.unixReOn();   // Unix
   UltraEdit.perlReOn();       // Perl

   UltraEdit.activeDocument.top();
   UltraEdit.activeDocument.findReplace.mode=0;
   UltraEdit.activeDocument.findReplace.matchCase=true;
   UltraEdit.activeDocument.findReplace.matchWord=false;
   UltraEdit.activeDocument.findReplace.regExp=true;
   UltraEdit.activeDocument.findReplace.searchDown=true;
   UltraEdit.activeDocument.findReplace.searchInColumn=false;
   UltraEdit.activeDocument.findReplace.preserveCase=false;
   UltraEdit.activeDocument.findReplace.replaceInAllOpen=false;
   UltraEdit.activeDocument.findReplace.replaceAll=true;
  //////////  formatting commands
   UltraEdit.activeDocument.top();       
   UltraEdit.activeDocument.findReplace.replace(" << ","<<");
   UltraEdit.activeDocument.top();
   UltraEdit.activeDocument.findReplace.replace(" >>",">>");
   UltraEdit.activeDocument.top();
   UltraEdit.activeDocument.findReplace.replace("<<","<$F");
   UltraEdit.activeDocument.top();
   UltraEdit.activeDocument.findReplace.replace(">>","[]>");
   UltraEdit.activeDocument.top();     
   UltraEdit.activeDocument.findReplace.replace("@N = ","@n=");
   UltraEdit.activeDocument.top();
   UltraEdit.activeDocument.findReplace.replace("@EN = ","@en=");
   UltraEdit.activeDocument.top();
   UltraEdit.activeDocument.findReplace.replace("@n=","@NOTE = ");
   UltraEdit.activeDocument.top();
   UltraEdit.activeDocument.findReplace.replace("@en=","@ENDNOTE = ");
   UltraEdit.activeDocument.top();
   UltraEdit.activeDocument.findReplace.replace("@h3=","@HEAD3 = ");
   UltraEdit.activeDocument.top();
   UltraEdit.activeDocument.findReplace.replace("@h4=","@HEAD4 = ");
   UltraEdit.activeDocument.top();
   UltraEdit.activeDocument.findReplace.replace("@h5=","@HEAD5 = ");
   UltraEdit.activeDocument.top();   
   UltraEdit.activeDocument.findReplace.replace("  "," ");   
   UltraEdit.activeDocument.top();
   UltraEdit.activeDocument.findReplace.replace("open italics","<I>");   
   UltraEdit.activeDocument.top();   
   UltraEdit.activeDocument.findReplace.replace("close italics","<I\*>");   
   UltraEdit.activeDocument.top();       
   UltraEdit.activeDocument.findReplace.replace("<I> ","<I>");   
   UltraEdit.activeDocument.top();
   UltraEdit.activeDocument.findReplace.replace("open bracket","(");   
   UltraEdit.activeDocument.top();     
   UltraEdit.activeDocument.findReplace.replace("close bracket",")");   
   UltraEdit.activeDocument.top();
   UltraEdit.activeDocument.findReplace.replace("open quote","\"");   
   UltraEdit.activeDocument.top();
   UltraEdit.activeDocument.findReplace.replace("close quote","\"");   
   UltraEdit.activeDocument.top();
// Replace <$F....> with <<....>> 
   UltraEdit.activeDocument.findReplace.replace("<\$F((?:[\s\S]+?))>(?<!<>)(?<!< >)(?<!<  >)(?<!<I>)(?<!<I\*>)(?<!<CR>)(?<!<N>)(?<!<->)","<<${1}>>");   
   UltraEdit.activeDocument.top();   
   
}    // <---END OF "IF" IN LINE 1! (DON'T DELETE)


The 'sand-box' text that I use is this:
Code: Select all

Back-references 1 contains the full FN, without the '<$F' and the '>'.
The negative lookbehind (creating back-reference 2) is there to discard the
 <>, < >, <  >, <->, <I>, <I*>, <N>, <CR>


this is text this is text this is text<$Fendnote endnote endnote>
this is text this is text this is text this is text this is text
this is<$Fendnote must span new line plus blanks lines


 

 endnote> text this is text this is text
this is text<$Fendnote <>This is a <I>title<I*> and this is another <I>Title


<I*>


 endnote<N> end<->note> this is text this is text
this is text this is text this is text this is text


this is text this is text this is
text this is text this is text this is text this is text this is text<$Fendnote endnote <CR>

endnote> this is text this is text this is text this is text t
his is text this is<$Fendnote <N> <I>title <I*>end<->note <>endnote> text this is text this is
text this is text<$Fendnote <CR>
endnote endnote> this is text this is text this is text this is
 text this is text this is text this is text this is text this is text this is text
this is text this is text this is text this



best,
fvg
fvgfvg
Newbie
 
Posts: 8
Joined: Thu Apr 26, 2012 9:49 am

Re: Workaround sought for missing variable negative lookbehind

Postby Mofi » Thu Mar 28, 2013 8:55 am

fvgfvg wrote:Have I got the find/replace line wrong?

Yes, you have the find string in the script wrong. See point 2 in List of UltraEdit / UEStudio script commands and most common mistakes: Backslashes in strings are not escaped.

The search string I provided adapted to a tagged regular expression for replace must be inserted into an UltraEdit script for command UltraEdit.activeDocument.findReplace.replace as:

"<\\$F([\\s\\S]+?>)(?<!<>)(?<!< >)(?<!<  >)(?<!<I>)(?<!<I\\*>)(?<!<CR>)(?<!<N>)(?<!<->)"

And usually used replace string in Perl syntax is <<\1>> which must be in the script <<\\1>>.

So the entire line in the script is:

Code: Select all
UltraEdit.activeDocument.findReplace.replace("<\\$F([\\s\\S]+?)>(?<!<>)(?<!< >)(?<!<  >)(?<!<I>)(?<!<I\\*>)(?<!<CR>)(?<!<N>)(?<!<->)","<<\\1>>");


The following line is also not 100% correct:
Code: Select all
UltraEdit.activeDocument.findReplace.replace("close italics","<I\*>");

The Javascript interpreter which handles a backslash character also as escape character like the Perl regular expression engine passes to the Perl regular expression engine of UltraEdit the replace string "<I*>" and not "<I\*>". But that does not matter here as the asterisk in the replace string has no special meaning and therefore does not need to be escaped at all with a backslash in the replace string.

Therefore this line should be:
Code: Select all
UltraEdit.activeDocument.findReplace.replace("close italics","<I*>");
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4049
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Workaround sought for missing variable negative lookbehind

Postby fvgfvg » Thu Mar 28, 2013 1:02 pm

Thank you Mofi... It's working like a charm. Don't know what I'd do without you. (Looks like the RegexBuddy 'copy' command doesn't know about the particular Perl 'flavour' of UE scripts..)
Best,
fvgfvg
fvgfvg
Newbie
 
Posts: 8
Joined: Thu Apr 26, 2012 9:49 am


Return to Find/Replace/Regular Expressions