Filter lines in large file based on variable criteria

Help with writing and playing macros

Filter lines in large file based on variable criteria

Postby cgcamal » Mon Oct 06, 2008 12:47 am

Hi guys, I'm really new in this, I hope somebody help me.

There are lists within a document that contain many of the 33 different "Types" of products, but not all of them.
I need a kind of "filter" to extract only the number located between the strings "LIST:NUMBER=" and ",TYPES="
for all lines within a specific "Type" look up.

I have the follow pattern in a large text file (256MB about in size and 2.5 million lines more or less).

*******************************************************************************************************
SALE:NUMBER=12345678910:TYPE=XXXXX
LIST:NUMBER=12345678910,TYPES=Type1-1&Type2-10&Type4-2&Type5-1&...&Type31-1&Type32-0&Type33-0

SALE:NUMBER=56734520957:TYPE=XXXXX
LIST:NUMBER=56734520957,TYPES=Type1-1&Type3-1&Type4-2&Type5-1&...&Type31-1&Type32-0&Type33-0

SALE:NUMBER=77834002759:TYPE=XXXXX
LIST:NUMBER=77834002759,TYPES=Type1-1&Type2-10&Type4-2&Type5-1&...&Type31-1&Type32-0
.
.
more or less 2 million lines after
.
.
SALE:NUMBER=23111109385:TYPE=XXXXX
LIST:NUMBER=23111109385,TYPES=Type1-1&Type2-10&Type3-1&Type4-2&Type5-1&...&Type31-1&Type32-0&Type33-0
*******************************************************************************************************
What I need by examples;

Example 1:
If I want to filter for "Type2-10", the answer would be, in a new file, as follow:

************************************
12345678910
77834002759
.
.
.
23111109385
************************************
Example 2:
If I want to filter for "Type3-1" and "Type4-2", the answer would be, in a new file, as follow:

************************************
56734520957
.
.
.
23111109385
************************************
I made a macro that does a filter, but copies the complete line for every match and not only the number
between the strings like I said before.

Questions:
1) I don't know how to say the macro extract in a new file only the numbers between the strings explained above for every match found.

2) In other hand, I've used the next commands to make flexible the look up data, but something is wrong, because not always paste the same data. I think is something with the Clipboard but I don't know how to fix it.


Code: Select all
GetString "A filter over which Type?",
CutAppend
Find "^c"
NewFile
Paste


The complete macro I have at the moment:

Code: Select all
InsertMode
ColumnModeOn
HexOff
UnixReOff
GotoLine 1 1
GetString "A filter over which Type?",
CutAppend
Find "^c"
NewFile
Paste
SaveAs "C:\Documents and Settings\My documents\Filter\^c List.TXT"


Thanks in advance.

Best regards.
cgcamal
Basic User
Basic User
 
Posts: 11
Joined: Sun Oct 05, 2008 6:19 pm

Re: Filter lines in large file based on variable criteria

Postby pietzcker » Mon Oct 06, 2008 2:52 am

Hi,

a few thoughts from me:

- This looks more like a job for a grep tool, not a text editor. With a tool like PowerGREP, this would be a 30 second job.
- This surely can be done anyway with UE. I'm not sure if macros can do the job since I don't think that you can dynamically construct a regex search string using the clipboard. Mofi can surely answer that question. I'd suggest you use UE's scripting engine which surely wouldn't have a problem with that.
- You could:

first delete all blank lines
Search for Perl regex ^[ \t]*\r\n
Replace with nothing.

then delete all lines that start with "SALE:"
Search for ^SALE:.*\r\n
Replace with nothing.

then delete all the lines that don't contain your filter term, one by one:
Search for ^LIST:NUMBER=(?:(?!Type2-10).)*$\r\n
Replace with nothing, repeating this once for each filter term.

Finally, clean up, removing everything but the number:
Search for ^LIST:NUMBER=([^,]+),.*
Replace with \1

Then save under a different filename.

As I said before, PowerGREP would do this in half a minute, including the definition of the search...

Cheers,
Tim
User avatar
pietzcker
Master
Master
 
Posts: 241
Joined: Sun Aug 22, 2004 11:00 pm

Re: Filter lines in large file based on variable criteria

Postby Mofi » Mon Oct 06, 2008 3:07 am

Here is the macro which does the job. You have to enter only the type numbers without the word "Type". For example if you enter 2-10 you will get the result of example 1. The string you enter is interpreted as regular expression string in UltraEdit syntax. So you can for example use an OR expression like ^{3-1^}^{4-2^} to get the list numbers of the lines which contain "Type3-1" or "Type4-2" (= result of example 2). But the file name of the saved data will then look not very nice. And please note that the UltraEdit regular expression engine supports only 2 arguments for the OR expression. So something like ^{3-1^}^{4-2^}^{2-10^} is not possible with 1 macro execution.

The macro property Continue if a Find with Replace not found or Continue if search string not found must be checked for this macro.

InsertMode
ColumnModeOff
HexOff
UnixReOff
Bottom
IfColNumGt 1
InsertLine
IfColNumGt 1
DeleteToStartofLine
EndIf
EndIf
Top
NewFile
Clipboard 9
GetString "A filter over which Type?"
SelectToTop
Cut
NextWindow
Clipboard 8
ClearClipboard
Loop
Clipboard 9
Find RegExp "LIST:NUMBER=[0-9]+,TYPES*Type^c*^p"
IfNotFound
ExitLoop
EndIf
Clipboard 8
CopyAppend
EndLoop
Top
PreviousWindow
Clipboard 8
Paste
ClearClipboard
Top
Find RegExp "LIST:NUMBER=^([0-9]+^)*$"
Replace All "^1"
Clipboard 9
SaveAs "C:\Documents and Settings\My documents\Filter\Type^c List.TXT"
ClearClipboard
Clipboard 0

The method Tim suggested with deletion of everything which is not of interest with regular expression replaces would be much faster, but we don't know which lines your source files contain in real.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4039
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Filter lines in large file based on variable criteria

Postby cgcamal » Mon Oct 06, 2008 5:12 pm

Hi Tim and Mofi,

Many thanks for answer my question.

I tested your suggestions and run perfectly!, but the loop task is executed very slowly (40 min). I've done a macro only to extract the number between strings with the expression you gave me,


Code: Select all
Find RegExp "LIST:NUMBER=^([0-9]+^)*$"

and works very fast, very nice!, but is the second part of what I need.

After be trying and trying, I detected UltraEdit does the filter task very fast, no more than 30 seconds doing the next steps:


1) Find option (Ctrl+F), with the option "List Lines Containing String" selected,
2) Write the string what I want to find in every line in the document.

Now UltraEdit answers with a new window named "Lines containing find string:"
with 5 options, "Close, Goto, Bookmark All, Clipboard and Refresh".

3) In this window I click on Clipboard option
4) NewFile
5) Paste
6) Run macro with command:
Code: Select all
Find RegExp "LIST:NUMBER=^([0-9]+^)*$"

7) It's done!


But the steps 1-5 were applied without variables like ^c, and now my problem is:

1) How to use variables using GetString ""
2) Force the Find function (Ctrl+F) to select the option "List Lines Containing String".
3) Copy the lines filtered with clipboard option,

because when I record the steps, the macro doesn't show intermediate steps 2 and 3 and
looks like this.
Code: Select all
Find "Whatever"
NewFile
Paste

May you please say me how to fix this?

Thanks very much again.
cgcamal
Basic User
Basic User
 
Posts: 11
Joined: Sun Oct 05, 2008 6:19 pm

Re: Filter lines in large file based on variable criteria

Postby Mofi » Tue Oct 07, 2008 3:39 am

My loop is the replacement for a Find with List Lines Containing String and then copying the results to the clipboard. The List Lines Containing String option require user interactions and therefore cannot be run automatically from within a script or macro. What makes the macro so slow on your very large file is scrolling and displaying the content. A tool like PowerGREP as Tim suggested would do that much faster because it does not have to display the content during execution.

Here is again an UltraEdit macro solution which should be much faster because no scrolling in the source file. But it is very important that you have only your source file open and no other file, or the macro will not produce the correct result. It uses a Find In Files in all open files to get the lines of interest with results written to an edit window which unfortunately scrolls. After collecting the data in the results window, all non interesting lines and data are deleted with regular expression replaces (English UE with default settings for the output format of Find In Files).

If you are 100% sure that last line of your source file ends always with a line termination, remove the red colored code to make the macro faster. You can remove this code also when the last line surely never is a data line of interest.

I hope you use latest version of UltraEdit because previous versions had several problems with macro command FindInFiles.

InsertMode
ColumnModeOff
HexOff
UnixReOff
Bottom
IfColNumGt 1
InsertLine
IfColNumGt 1
DeleteToStartofLine
EndIf
EndIf

NewFile
Clipboard 9
GetString "A filter over which Type?"
SelectToTop
Cut
CloseFile NoSave
FindInFiles RegExp OpenFiles "" "" "LIST:NUMBER=[0-9]+,TYPES*Type^c"
Top
Find MatchCase "----------------------------------------^p"
Replace All ""
Find RegExp "Search complete, found?+^p"
Replace All ""
Find RegExp "%F[oui]+nd 'LIST:NUMBER?+^p"
Replace All ""
Find RegExp "%*LIST:NUMBER=^([0-9]+^)*$"
Replace All "^1"
SaveAs "C:\Documents and Settings\My documents\Filter\Type^c List.TXT"
ClearClipboard
Clipboard 0
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4039
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Filter lines in large file based on variable criteria

Postby cgcamal » Wed Oct 08, 2008 1:42 am

Hi guys, thanks very much, really!

I've tested both solutions, the first macro procedure gave by pietzcker (Perl style) and Mofi's last one (UltraEdit style).

Both runs extremely faster than other macros I've used for this task, both have similar duration times,
that is from 3 to 4 minutes of execution.

Thank you very much, because I've learned a lot with your examples.

One more question:

For example, you gave me the solution for search occurences of Type1-1 OR Type2-1 doing "^{1-1^}^{2-2^}" , but

How can I do a search for Type1 AND Type2 occurence at the same time in Perl and UltraEdit or Unix style?,
in this case Type1 and Type2 must be both in the string. Which symbol I have to use to represent the logical operation "AND"?

Thanks in advance.

Best regards
cgcamal
Basic User
Basic User
 
Posts: 11
Joined: Sun Oct 05, 2008 6:19 pm

Re: Filter lines in large file based on variable criteria

Postby Mofi » Wed Oct 08, 2008 4:22 am

In UltraEdit syntax the search string is 1-1?+Type4-2 where ?+ means any character 1 or more times except a new line character.

In Perl/Unix syntax the search string is 1-1.+Type4-2 where .+ means any character 1 or more times except a new line character.

But it is now important that you specify the types in the order as they exist in the lines. For example 4-2.+Type1-1 will not find any line.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4039
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Filter lines in large file based on variable criteria

Postby cgcamal » Thu Oct 09, 2008 1:12 am

Hi Mofi,

I'm new, like I said before, in this subject and the only thing that I can do is to smile :D when I look how this little macro with the Regex you gave me works exactly what I wanted.

This is a very useful tool that I didn't know about.

Many thanks for help me to begin in this wonderful theory.

PD: (Mofi) The AND operand you gave me, passed the test, obviously it works :!:


Best Regards from Honduras.
cgcamal
Basic User
Basic User
 
Posts: 11
Joined: Sun Oct 05, 2008 6:19 pm

Filter lines in large file based on variable criteria with Regex

Postby cgcamal » Sat Oct 11, 2008 4:20 pm

Hi eveybody,

I did this question in macro section, and they helped me a lot, it worked great - see above.

But now I want to learn how to do it with a script.

How extract using only one Regex (in a new file or in the same one) the numbers between "LIST:NUMBER=" and ",TYPES="
for every line that matches the search wanted? (The strings wanted would be the "TypeX-X").

The Regex it could be an If-then expression, I think, but I don't have idea how to do it.

Many thanks in advance.

Best Regards
cgcamal
Basic User
Basic User
 
Posts: 11
Joined: Sun Oct 05, 2008 6:19 pm


Return to Macros