Deleting blank lines using Perl regular expressions

Find, replace, find in files, replace in files, regular expressions

Deleting blank lines using Perl regular expressions

Postby Captain » Wed Mar 22, 2006 9:43 am

Hi

With UltraEdit regular expressions I could delete blank links with the following:

Replace: "^p$" (without the quotes)
With "" (without the quotes - i.e. nothing).

Now using Perl regular expressions I expected the following to work:

Replace: "^\s*$" (without the quotes)
With "" (without the quotes - i.e. nothing).

But it does not. Any suggestions?
User avatar
Captain
Basic User
Basic User
 
Posts: 27
Joined: Tue Jul 26, 2005 11:00 pm

Re: Deleting blank lines using Perl regular expressions

Postby mjcarman » Wed Mar 22, 2006 1:55 pm

Uncheck "Match Whole Word Only." I always do this for regexp searches as it seems to cause problems otherwise.
User avatar
mjcarman
Power User
Power User
 
Posts: 125
Joined: Thu Feb 10, 2005 12:00 am

Re: Deleting blank lines using Perl regular expressions

Postby Captain » Wed Mar 22, 2006 1:59 pm

Thanks for reply mjcarman - but Whole word is not checked. I have confirmed the ^\s*$ should pick up blank lines so there is nothing wrong with the regular expression syntax.
User avatar
Captain
Basic User
Basic User
 
Posts: 27
Joined: Tue Jul 26, 2005 11:00 pm

Re: Deleting blank lines using Perl regular expressions

Postby mjcarman » Wed Mar 22, 2006 5:36 pm

Interesting, "^\s*$" is working for me, although it seems to leave some cruft at the end (presumably because the file size has changed).
User avatar
mjcarman
Power User
Power User
 
Posts: 125
Joined: Thu Feb 10, 2005 12:00 am

Re: Deleting blank lines using Perl regular expressions

Postby scallanh » Wed Mar 22, 2006 7:18 pm

Captain wrote:With UltraEdit regular expressions I could delete blank links with the following:

Replace: "^p$" (without the quotes)
With "" (without the quotes - i.e. nothing).

Now using Perl regular expressions I expected the following to work:

Replace: "^\s*$" (without the quotes)
With "" (without the quotes - i.e. nothing).

But it does not. Any suggestions?

Your regular expression matches the contents of the lines. To completely remove the lines (and not just their contents) you need to replace the carriage returns and/or line feeds that terminate these lines.

Code: Select all
^\s*[\r\n]+


That will match all blank lines except if the line is the very last line in the file.
User avatar
scallanh
Basic User
Basic User
 
Posts: 31
Joined: Mon Oct 24, 2005 11:00 pm

Re: Deleting blank lines using Perl regular expressions

Postby Captain » Thu Mar 23, 2006 6:30 am

Hi Scallanh

Your suggestion still does not delete blank lines, ie lines containing no spaces
which in Perl regex is represented by ^$. Not sure if my ini settings are affecting this operation.

tx
User avatar
Captain
Basic User
Basic User
 
Posts: 27
Joined: Tue Jul 26, 2005 11:00 pm

Re: Deleting blank lines using Perl regular expressions

Postby Bego » Thu Mar 23, 2006 7:28 am

same with me ....
seems he find every 2nd line sometimes.
Did not get the pattern what UE does here
In a block of 20 real empty lines, UE jumps in before the bolck and then at the end of the block of empty lines.
example:

before is empty as 5 lines below




text again




beeing at the top, 1st time UE jumps to word "is" !?!?
going to top againg, then he always jumps to the empty line after "before..." and searching again (F3) to the line after "text again".

Dunno wut to do .... ;-)

rds Bego (UE 12.00+1)
User avatar
Bego
Master
Master
 
Posts: 357
Joined: Wed Nov 24, 2004 12:00 am
Location: Germany

Re: Deleting blank lines using Perl regular expressions

Postby Mofi » Thu Mar 23, 2006 9:07 am

Multiple blank lines cannot be removed with a single search and replace. I use this macro to delete all blank lines:

Top
TrimTrailingSpaces
Loop
Find "^p^p"
Replace All "^p"
IfNotFound
ExitLoop
EndIf
EndLoop
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4066
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Deleting blank lines using Perl regular expressions

Postby Mofi » Tue Mar 28, 2006 3:02 pm

My remove blank lines macro can be extended to remove blank lines or lines with only white-space characters in the current selection only. This macro needs macro option Continue if a Find with Replace not found.

IfSel
UnixReOff
Loop
Find RegExp "%[ ^t]+$"
Replace All SelectText ""
IfNotFound
ExitLoop
EndIf
EndLoop
Loop
Find "^p^p"
Replace All SelectText "^p"
IfNotFound
ExitLoop
EndIf
EndLoop
UnixReOn or PerlReOn
Else
Top
TrimTrailingSpaces
Loop
Find "^p^p"
Replace All "^p"
IfNotFound
ExitLoop
EndIf
EndLoop
EndIf

Insert UnixReOn or PerlReOn as shown above if you do not use regular expressions in UltraEdit style by default. Macro command UnixReOff sets the regular expression option to UltraEdit style.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4066
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Deleting blank lines using Perl regular expressions

Postby cjard » Tue Apr 11, 2006 8:33 am

Captain wrote:With UltraEdit regular expressions I could delete blank links with the following:

Replace: "^p$" (without the quotes)
With "" (without the quotes - i.e. nothing).

Now using Perl regular expressions I expected the following to work:

Replace: "^\s*$" (without the quotes)
With "" (without the quotes - i.e. nothing).

But it does not. Any suggestions?

You know that ^ and $ match the start and the end of the input, and I THINK that UE hence splits the document into an array of lines and feeds them each into the matcher as lines of input with a start and end.

..hence you CANNOT match multiple lines with ^ and $


Instead you must make UEdit shove the whole file into the matcher by not using ^ $. Just get rid of them.

At this point I must say another little understood fact, and its a VERY IMPORTANT thing you need to know about UE is that it operates ALL regex in PESSIMISTIC mode by default. This it NOT usual. GREEDY is the default mode, but people don't understand greedy if they are used to MS DOS * wildcard matching.


FIND foo.*bar IN foothingbarfoothingbar REPLACE WITH foobaz

with greedy:
-> outputs foobaz

with pessimistic:
-> outputs foobazfoobaz


Pessimistic reads the input like a human, one character at a time, and it keeps going while the match is true. It reads foo, then starts reading anything and looking for bar, so it finds FOOTHINGBAR

Greedy on the other hand, reads the whole line, then splits it out until it matches. Hence the first foo and the last bar are matches and the .* matches the thingbarfoothing inbetween. This is counter intuitive to most humans and I think why UE doesn't operate in this mode.Remember it though!


The pattern [\r\n]+ will match a new line in the whole input (considering all lines as a stream).
Why is it not [\r\n]* ?
Because that means 0 or more newlines. So everything that is not a newline character and is also a nothing character will match. What's a nothing character? It's the character in between two real characters
ABCDEF <-there are 5 nothings in here between A and B, B and C ...

Try it.. find for [\r\n]* and see the cursor jump between all the letters. This wouldn't happen in GREEDY because the whole input would be swallowed, then spat back until an long sequence of \r\n was found, then matching would continue from there.
If you don't understand this, heres an analogy:

You and me are standing in a room and I say "shout the word BANG when I say the word zero, one or two okay? Each time you say bang I'll restart counting"
In greedy, I count down from 10, 9, 8, 7...
10 9 8 7 6 5 4 3 2 BANG
10 9 8 7 6 5 4 3 2 BANG
10 9 8 7 6 5 4 3 2 BANG

I get to 2 and you say BANG.. for a 10 character input, you matched when I got to 2. That's greedy.. we match 2 characters which is a long input (longer than 0 anyways).

Now go pessimistic:
0 BANG
0 BANG
0 BANG

You're matching instantly, and thats why [\r\n]* in pessimistic mode (UE default) matches the nothing character in between words - because it is valid to say that the nothing character is indeed true for "zero or more occurrences if a newline".. i.e. not a newline (but not a word character either).


Now you must define what is a blank line? What if there are 10 whitespace on that line? I'll assume there are.

So whats our newline matcher? [\r\n]+ (one or more of.. or we could force the matcher into greedy mode.. Thing is I have no idea how because with Perl, [abc]* is greedy, [abc]*? is pessimistic and [abc]*+ is possessive (eats whole input and never spits back, rarely used). So if * behaves greedy in UE I have no idea what Mr Mead did to the engine to because it's broken perl syntax.

So anyways, we have [\r\n]+ for a newline, now we want to match any number of whitespace \s*
Remember that in pessimistic this match will run until it succeeds but if you don't put anything else onto the end of the expression then you'll get 0 spaces matched!
Why?

<NEWLINE> <NEWLINE>

The first newline will be matched and 0 or more whitespace.. so the pessimistic starts from 0 and finds a match! Yes! 0 occurrences of space is an ok match! So it just matches the first newline it finds.

So now we put another newline in our regex:

[\r\n]+\s*[\r\n]+

Now it will keep going matching up to 10 spaces before it finds a whole newline.

Yow you can see for yourself, this file:

Code: Select all
"this is an
example input
text with a
blank line
now:

and the text
continues"

The newline after colon, and any spaces on that blank line, and the newline on the end of it will be found.

Now just replace with one newline.. ^p in uedit syntax.

Ok tutorial over, I hope this kills many questions in this thread, and remember the UE works in pessimistic mode, and thats NOT perl default!


--

OK, so to summarise this, I'm making assumptions about the way UE is working because I can't see the code for the app, but here's a summary of my guesses:

Using ^$ to match start and end of input causes each line to become input, rather than the whole document. I.e. the doc is split into lines then each line is matched. These metacharacters represent the "nothing character" before the start of the line and after the end of the line. I'm not sure if UEdit adds the CRLF back onto the line after it uses it to split the document into lines, but before it feeds into the matcher.

UEdit operates in pessimistic mode by default. Normally in Perl you would say foo.*?bar to match the bold text in this string: sampleTextfooMatchedInputbarSampleText but in UE it's sufficient to say foo.*bar
This is NOT Perl syntax! I don't know how to get the matcher out of this mode (maybe in UE perl .*? means greedy and .* means pessimistic, I don't know).

So, try work around the second point there. Working in pessimistic when you're used to greedy can cause WEIRD stuff to happen, as I've discussed in this article. If you're aware of it, you might be able to avoid it! :)
User avatar
cjard
Newbie
 
Posts: 4
Joined: Mon Apr 10, 2006 11:00 pm


Return to Find/Replace/Regular Expressions