Deleting lines containing more than specified number of words within a tag

Find, replace, find in files, replace in files, regular expressions

Deleting lines containing more than specified number of words within a tag

Postby greg1234 » Thu May 10, 2012 12:27 pm

Hi,
I'm trying to find and delete lines that contain more than specified number of words in one line.
For example I would like to delete lines containing more than 5 words (separated by spaces) between tags <tag> and </tag> from file:

<tag>this is first sentence</tag><tag2>....</tag2>
<tag>this is the second sentence</tag><tag2>....</tag2>
<tag>this is the very last sentence</tag><tag2>....</tag2>

In the above example the second and third line should be deleted.

I was able to find and delete lines containing fixed number of words (3 in this example) using pattern:

<tag>^([0-9a-zA-Z/()]+^) ^([0-9a-zA-Z/()]+^) ^([0-9a-zA-Z/()]+^)</tag><tag2>....</tag2>

But how to do this for lines containing for example free and more words?
Please advise,
Greg
greg1234
Newbie
 
Posts: 4
Joined: Thu May 10, 2012 11:34 am

Re: Deleting lines containing more than specified number of words within a tag

Postby Mofi » Fri May 11, 2012 1:12 am

That can be done only with the Perl regular expression engine. The search string to use is ^.*<tag>(?:\<\w+ *){6,}</tag>.*\r\n
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 3937
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Deleting lines containing more than specified number of words within a tag

Postby greg1234 » Fri May 11, 2012 6:24 am

Works great.
Thanks a lot!
greg1234
Newbie
 
Posts: 4
Joined: Thu May 10, 2012 11:34 am

Re: Deleting lines containing more than specified number of words within a tag

Postby greg1234 » Thu May 17, 2012 11:06 am

Hi,

this expression works great, but now I'm trying to match also lines containing non-alphanumeric characters, like: - ' , / % # .
Examples of lines that I'd like to be matched:

<tag>this can't be so simple</tag><tag2>....</tag2>
<tag>this is sentence with comma, but without dot</tag><tag2>....</tag2>
<tag>this is sentence with comma and dot.</tag><tag2>....</tag2>

  1. Is it possible to modify this macro, so that words containing - symbol were treated as single words, and comas or other symbols were just omitted in search?
  2. Or, maybe it will be simpler to modify the expression, assuming that word is anything separated by space?
Thanks,
Greg
greg1234
Newbie
 
Posts: 4
Joined: Thu May 10, 2012 11:34 am

Re: Deleting lines containing more than specified number of words within a tag

Postby Mofi » Thu May 17, 2012 12:31 pm

I quickly found the expression ^.*<tag> *(?:\<\w[^\s<]+ *){6,}</tag>.*\r\n

But I was not happy with it because every "word" must start with a word character. Therefore it does not work for something like

<tag>'this' 'is' 'an' 'example' 'with' '6' 'words'</tag><tag2>....</tag2>

I needed more than 30 minutes to find the hopefully ultimate solution producing correct results:

^.*<tag>\s*(?:[^\s<]+\s+){5,}[^\s<]+\s*</tag>.*\r\n

This expression matches entire lines containing at least 6 whitespace separated strings within <tag>...</tag>. Value 5 in the expression is not a mistake for at least 6 strings. As \s matches also newline characters, there can be now even line breaks within <tag>...</tag> like in following example

<tag>this is the very last sentence
with a line break</tag><tag2>....</tag2>

In this case both lines are completely matched by the expression.

Please note that within <tag>...</tag> no other tag or character < not encoded with &lt; as HTML requires is allowed because in this case the regular expression would ignore such lines.

Explanation of the expression above:

^ ... start the search at beginning of a line.

.* ... matches 0 or more occurrences of any character except newline characters. This expression matches everything up to next string which is <tag> if the current line contains that string at all.

\s* ... matches 0 or more occurrences of whitespace characters. There could be a space, tab or line break after string <tag>. Whitespace characters are at least the horizontal tab character (0x09), line-feed (0x0A), the vertical tab character (0x0B, very rare in text files), the form-feed character (0x0C), carriage return (0x0D), the space character (0x30), the non breaking space character (0xA0), and perhaps also other whitespace characters from Unicode table (not tested by me).

(:?...) groups an expression. Usually everything in round brackets is also marked (tagged) for being back referenced in search or replace string. :? immediately after opening round bracket tells the Perl engine not mark the string found by the expression inside the round brackets as here the group is just for applying the following multiplicator expression.

[^\s<]+ ... is a negative character class definition. It matches all characters 1 or more times except whitespace characters and left angle bracket. In other words this expression matches a string surrounded by whitespace characters not containing character <.

\s+ ... next 1 or more whitespace characters must follow and not left angle bracket.

{5,} ... means the previous expression should match non whitespace strings with whitespace(s) following 5 or more times.

[^\s<]+ ... now an already well known expression. After at least 5 non whitespace strings with whitespace(s) following there must one more non whitespace string.

\s* ... the next character can be now 0 or more whitespaces before next fixed string </tag>. But it is allowed that after word 6, 7, 8, ... there is no whitespace and </tag> immediately follows.

.*\r\n ... match 0 or more occurrences of any character up to carriage return and line-feed and match these 2 newline characters too.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 3937
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Deleting lines containing more than specified number of words within a tag

Postby greg1234 » Fri May 18, 2012 1:54 am

Hi Mofi,
You are master. I really, really appreciate your help and your time!
That expression works perfectly. Now, I'll take my time to understand what it does.
Thanks a lot.
Greg
greg1234
Newbie
 
Posts: 4
Joined: Thu May 10, 2012 11:34 am

Re: Deleting lines containing more than specified number of words within a tag

Postby Mofi » Fri May 18, 2012 9:52 am

I have added an explanation to my previous post for the regular expression search string.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 3937
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna


Return to Find/Replace/Regular Expressions