by Mofi » Thu May 17, 2012 12:31 pm
I quickly found the expression ^.*<tag> *(?:\<\w[^\s<]+ *){6,}</tag>.*\r\n
But I was not happy with it because every "word" must start with a word character. Therefore it does not work for something like
<tag>'this' 'is' 'an' 'example' 'with' '6' 'words'</tag><tag2>....</tag2>
I needed more than 30 minutes to find the hopefully ultimate solution producing correct results:
^.*<tag>\s*(?:[^\s<]+\s+){5,}[^\s<]+\s*</tag>.*\r\n
This expression matches entire lines containing at least 6 whitespace separated strings within <tag>...</tag>. Value 5 in the expression is not a mistake for at least 6 strings. As \s matches also newline characters, there can be now even line breaks within <tag>...</tag> like in following example
<tag>this is the very last sentence
with a line break</tag><tag2>....</tag2>
In this case both lines are completely matched by the expression.
Please note that within <tag>...</tag> no other tag or character < not encoded with < as HTML requires is allowed because in this case the regular expression would ignore such lines.
Explanation of the expression above:
^ ... start the search at beginning of a line.
.* ... matches 0 or more occurrences of any character except newline characters. This expression matches everything up to next string which is <tag> if the current line contains that string at all.
\s* ... matches 0 or more occurrences of whitespace characters. There could be a space, tab or line break after string <tag>. Whitespace characters are at least the horizontal tab character (0x09), line-feed (0x0A), the vertical tab character (0x0B, very rare in text files), the form-feed character (0x0C), carriage return (0x0D), the space character (0x30), the non breaking space character (0xA0), and perhaps also other whitespace characters from Unicode table (not tested by me).
(:?...) groups an expression. Usually everything in round brackets is also marked (tagged) for being back referenced in search or replace string. :? immediately after opening round bracket tells the Perl engine not mark the string found by the expression inside the round brackets as here the group is just for applying the following multiplicator expression.
[^\s<]+ ... is a negative character class definition. It matches all characters 1 or more times except whitespace characters and left angle bracket. In other words this expression matches a string surrounded by whitespace characters not containing character <.
\s+ ... next 1 or more whitespace characters must follow and not left angle bracket.
{5,} ... means the previous expression should match non whitespace strings with whitespace(s) following 5 or more times.
[^\s<]+ ... now an already well known expression. After at least 5 non whitespace strings with whitespace(s) following there must one more non whitespace string.
\s* ... the next character can be now 0 or more whitespaces before next fixed string </tag>. But it is allowed that after word 6, 7, 8, ... there is no whitespace and </tag> immediately follows.
.*\r\n ... match 0 or more occurrences of any character up to carriage return and line-feed and match these 2 newline characters too.