Removing dupe lines (again, sorry!)

Help with writing and playing macros

Removing dupe lines (again, sorry!)

Postby dodued » Sat Apr 09, 2011 4:47 am

I have the following case:

NEW FILE
blabla blabla blable ItemID=94374042 blabla blablublibla blable other_data
blabla blabla blable ItemID=91087082 blabla blabla blbleblaable other_data
blabla blabla blable ItemID=92415300 blabla blabla blablingble other_data
bplofla blabla blable ItemID=91584918 blabplbangofla blabla blable other_data
blabla blabla blable ItemID=95484087 blabla blapowbla blaplofble other_data
bhahalabla blabla blable ItemID=93881915 blabla blabla blable other_data
blablabli blabla blable ItemID=93391409 blabla blabla blable other_data
blabla blblublzabla blable ItemID=94508261 blabla blabandla blable other_data

OLD FILE 1
blabplofla blabla blable ItemID=95709167 splashlabla blabla blable other_data
blabla blabla blable ItemID=94889375 blabla blabbingla blable other_data
blabla blabla blable ItemID=91087082 blabla blabla blable other_data
bpifflabla blabla blable ItemID=93989584 blabla blabla blable other_data
blabla blabla blable ItemID=91930654 blabla blabla bangblable other_data
blabla blabla blable ItemID=93621288 blabla blabla blasockble other_data
blabla blabla blable ItemID=96507582 blabla blabla blable other_data
blablabli blabla blable ItemID=92221673 blabla blabla blable other_data
blabla blablabluble blable ItemID=93391409 blabla blabla blable other_data
blabla bllololoaei blable ItemID=93775797 blabla blabla blable other_data

OLD FILE 2
blabplofla blabla blable ItemID=91424876 blabplofla blabla blable other_data
blabplofla blabla blable ItemID=93272698 blabplofla bingblabla blable other_data
blabfla blabla blable ItemID=94407207 blabplofla blabla blable other_data
bplofla blabla blable ItemID=91584918 blabplbangofla blabla blable other_data
blabbliplo blabla blable ItemID=95498779 blabplofla blabla blable other_data
blabploblu blabla blable ItemID=91634932 blabplofboffla blabla blable other_data
blabplofla blabla blable ItemID=90366946 blabplofla blabla blwowable other_data
bleepbplofla blabla blable ItemID=92169269 blabplofla blabla blable other_data

I need to remove the lines from the NEW file that contain an "ItemID" already present in one or other of the OLD files.
The rest of the content of the line is of no importance.
OLD files are only a few at the moment, all in the same folder and guaranteed not to hold dupes, about 1500 lines in all.
At the end they should grow in size and number to a total of about 30'000 lines or more. I don't mind waiting a few seconds for the dupe-check to complete.
Question: what happens to the execution of the macro if the NEW file is constituted solely of duplicate ItemId's? At the end of my work this might occur.
I have not been able to get either ReplInFiles or FindInFiles give what I need, but it's probably due to my non-knowledge (commonly called ignorance :oops: ).

Thank you for your help!!
DoduEd
dodued
Newbie
 
Posts: 2
Joined: Fri Apr 01, 2011 2:45 am

Re: Removing dupe lines (again, sorry!)

Postby Mofi » Sat Apr 09, 2011 8:01 am

Here is a macro to delete the duplicate lines which need macro property Continue if search string not found checked. You have to modify in the macro the directory containing all the old files and the file type specification to find all lines with ItemID=[0-9]+ in all the old files in this directory. It is necessary to open the new file and only this file before executing the macro. The new file must be stored in a different folder or must have a different file extension. In other words the FindInFiles command should not find the lines with ItemID=[0-9]+ in the new file or the new file is empty after running the macro. The new file must be a file with DOS line terminators.

InsertMode
ColumnModeOff
HexOff
Bottom
IfColNumGt 1
InsertLine
EndIf
Top
UltraEditReOn
FindInFiles MatchCase RegExp "
C:\Temp\" "*.txt" "ItemID=[0-9]+"
Top
UnicodeToASCII
Loop 0
Find MatchCase RegExp "ItemID=[0-9]+"
IfFound
Find MatchCase RegExp AllFiles "%*^s*^p"
Replace All ""
Else
ExitLoop
EndIf
EndLoop
CloseFile NoSave


How it works? First, the macro makes sure that last line of opened file has a line termination just for security. With the FindInFiles command all lines containing ItemID=[0-9]+ in all the old files in the specified directory are copied into a new file - the results file of the search. This file is converted to ASCII. Next from top of the results file to bottom every ItemID=[0-9]+ is selected with an UltraEdit regular expression search in a loop until no one found anymore. For every found ItemID an UltraEdit regular expression replace all in all open filles is executed to delete all lines containing the selected ItemID in the new file and the active results file. After the loop finished the results file is closed without saving and the result is a new file not containing any ItemID already present in one of the old files.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 3937
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Removing dupe lines (again, sorry!)

Postby dodued » Sun Apr 10, 2011 5:06 pm

Perfect...
Thank you for your time and knowledge!
This is solved, hurrah!
DoduEd
dodued
Newbie
 
Posts: 2
Joined: Fri Apr 01, 2011 2:45 am


Return to Macros