by Mofi » Fri Aug 07, 2009 4:48 am
To clarify what every solution does in the various posts I have now edited all posts in this topic and explained more detailed what the various regular expressions do.
ridgerunner,
I added working regexp search strings for UltraEdit and Unix engine at Jane's post doing the same as the Perl regexp string [ \t]{2,}|\t posted by Jane.
From what I observed the last years is that the Unix/UltraEdit engines are internally working with the same code functions. I guess the Unix regexp strings are simply converted to UltraEdit syntax internally before using that function. It looks like the UltraEdit engine uses the Microsoft syntax (as in MS Word or in MS Excel) and the Unix engine was just introduced to help UltraEdit users familiar with the Perl syntax used on Unix machines to more easily use regular expressions in UltraEdit. With introducing the Perl compatible engine the full power of this search engine is available also in UltraEdit.
A disadvantage of the Perl engine, which is very, very powerful and makes very, very complex search and replace operations possible, is the same with all programs which are very powerful and can do lots of things, it contain bugs and therefore produces sometimes not the expected results. Fixing a bug in such complex functions often results in something other worked before well does not work correct anymore after fixing the bug. I think most programmers know what I'm here talking about.
The legacy engine (UE/Unix) is not so powerful and of course contains also some bugs and limitations because it is not so powerful as the Perl compatible engine. But after years of practice with the legacy engine I know quite good how the engine works internally and know most of its bugs and limitations. So I often find quickly a working solution for most search/replace tasks also with the limited legacy search engine (UE/Unix).
ridgerunner,
the not working UE regular expression search string you posted is from a users point of view definitely a bug. But for me it was not surprising that it does not work because I actually know how the OR expression in the legacy engine works and it works completely different than the Perl engine. How to explain the different methods of the engines? Let us look on a simple example. There is following line:
test1 test2
And you run an UE regexp search with search string ^{test1^}^{test^} (stupid search string, but good for explaining the different methods). You expect that the search selects first test1 and next test from test2. But that does not happen, the UltraEdit engine selects twice just test. I don't really know what internally happens, but it looks like the UltraEdit engine evaluates character by character if the build string matches one of the 2 possible OR argument expressions. So it first checks if t matches one of the 2 expressions. That is the case here for both expressions. So the engine takes the next character, build the string te and evaluates this string again with both expressions. Two steps further the evaluated string test is a 100% match of the second expression and UE engine exits because string found. So the UltraEdit engine treats both arguments of an OR expression to 100% at the same level of importance.
Now let us do the same with the Perl compatible engine by searching for (test1|test). This engine selects now first test1 and second test from test2 what everyone expects. Why? Again I don't really know how the Perl engine works, but it looks like it works as follows. It takes the t and checks if it is matched by any expression in the OR expression. In this example this is true for both arguments. So it remembers the string position and evaluates now just the first expression on the entire string (byte stream). If it matches it returns the matching string. If that would not be the case, it would rewind back on the byte stream to the position of character t and evaluates the byte stream from this position with the second OR expression and if that would match, it would return this matching string.
So the UltraEdit engine evaluates a string always with both expressions in the OR argument at the same time while the Perl engine evaluates a string with one expression after the other. So the main difference is that the UltraEdit engine avoids looking back on the byte stream while the Perl engine is designed to go back on the byte stream and evaluate from this position again. Avoiding looking back makes the UltraEdit engine fast, but limits it's capabilities. Supporting looking back gives the Perl engine the power it has, but can make simple searches slower. You can watch that also with modifying the line to
test12 test2
and search for ^{test1^}^{test[0-9]+^} or (test1|test[0-9]+). Both engines now select only test1 and not test12. Independent of the search engine it is never a good idea to use an OR expression where 2 (or more) of the expressions start with an expression matching the same characters as done here twice. With the Perl engine the most left one matching expression always "wins", with the UltraEdit engine the expression first returning a 100% match always "wins". The different working methods when using expressions for the arguments in an OR expression must be taken into account when the arguments match the same substring at start of a matching string.
Now you hopefully understand why UE expression ^{[ ^t][ ^t]+^}^{^t^} selects always only the first tab when a whitespace string starts with a tab character while the Perl expression [ \t]{2,}|\t works. If Perl regexp \t|[ \t]{2,} would be used, it wouldn't work too.
If you look on the UltraEdit and Unix regexp strings I posted at Jane's post you see that I avoided the problem with identical starting substrings for both expressions in the OR expression. The first argument in the OR expression matches only strings starting with a tab character followed by 0 or more occurrences of tabs or spaces. The second argument matches only strings starting with a space followed by 1 or more occurrences of tabs or spaces.