Eliminating duplicates based on strings left the equal sign

Help with writing and playing macros

Eliminating duplicates based on strings left the equal sign

Postby boromir » Tue Oct 20, 2009 10:16 pm

Hello,
I have checked through the duplicates macros but cannot find one that meets my requirements.
I have a huge file in which the data is made up of 2 columns of irregular length separated by a =. It so happens that the left hand column has duplicates but the right hand column corresponding to these duplicates does not have similar words.
An example will explain:
ravi=rvI
ravi=roI
What I would like to do is to separate out the duplicates and store them in a separate file without deleting them. In other words the macro should create two files:
non-dupes and dupes
Any chance of a macro for this. I have huge files running to over a hundred thousand entries. Pl. help.
Doc
User avatar
boromir
Newbie
 
Posts: 4
Joined: Sat Jan 26, 2008 12:00 am

Re: Eliminating duplicates based on strings left the equal sign

Postby Mofi » Wed Oct 21, 2009 1:03 am

If I have you understand correct you want the data in your file split up into 2 other files. The first one should contain the lines with the first occurrence of a string left the equal sign and the second one should contain all other lines having the same string left the equal sign already found once.

For this task you need 2 macros. The macros are based on the macro I posted at How do I remove duplicate lines? The macro property Continue if search string not found must be checked for both macros. The macro property for the Cancel dialog should be unchecked for both macros.

The macros are designed to work on lines with DOS line endings because of using ^p in the search strings. If your file is a Unix file opened without conversion to DOS, you have to replace all ^p by ^n in the macro sources to get the macros correct working.

The first macro uses command SelectLine which is faster, but requires that the source file is opened without any word wrap enabled.

It is important that your huge source file is opened with usage of a temporary file because the macro modifies this source file, but closes it without saving the changes (and reopens it). This works only if the modifications are not permanent which requires that the source file is opened with usage of a temporary file.

First click on Macro - Edit Macro, button New Macro and enter as macro name FindDuplicates. The macro name for the first macro is important including the case of the letters. After setting the properties as written above, click OK and replace the existing lines with following macro code:

Loop
Clipboard 9
Find MatchCase "^c"
IfFound
SelectLine
Clipboard 8
CutAppend
Else
ExitLoop
EndIf
EndLoop

Next click again on button New Macro and confirm that you want to save the modifications of the just created macro. For the second macro the name does not matter, use for example SplitDupsFile. After setting the properties as written above, click OK and replace the existing lines with following macro code:

InsertMode
ColumnModeOff
HexOff
UnixReOff
Bottom
IfColNum 1
Else
"
"
EndIf
Top
Find RegExp "%^([~^p]^)"
Replace All "#MOFI_RULES#^1"
Clipboard 7
ClearClipboard
Clipboard 8
ClearClipboard
Clipboard 9
Loop
Find MatchCase RegExp "%#MOFI_RULES#*="
IfNotFound
ExitLoop
EndIf
Copy
Clipboard 7
CutAppend
Find RegExp "?++^p"
CutAppend
PlayMacro 1 "FindDuplicates"
Top
EndLoop
CopyFilePath
CloseFile NoSave
Open "^c"
ClearClipboard
NewFile
Clipboard 7
Paste
ClearClipboard
Top
Find MatchCase RegExp "%#MOFI_RULES#"
Replace All ""
NewFile
Clipboard 8
Paste
ClearClipboard
Top
Find MatchCase RegExp "%#MOFI_RULES#"
Replace All ""
IfNotFound
"NO DUPLICATES :-)
"
EndIf
Clipboard 0

After closing the edit macro dialog with button Close and confirming to save the modifications on the just created second macro, open your file if not already open and run once the just created second macro. It will take very long on your huge file, but as a result the source file is closed without saving the changes and reopened and you should get 2 new files. The first (left) one contains the first occurrence of a line with a unique string left the equal sign and the second (right) one the lines with the duplicates.

For example the source file contains:

ravi=rvI
ravi=roI
test1=test
test2=1
test2=3
test2=2


The first new file contains after macro execution:

ravi=rvI
test1=test
test2=1


The second new file contains after macro execution:

ravi=roI
test2=3
test2=2
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4058
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Eliminating duplicates based on strings left the equal sign

Postby boromir » Wed Oct 21, 2009 5:00 am

Mofi,
You have saved my life. I revisited the forum not expecting a reply and to my surprise I found the answer. Many thanks. I tested it on a file of 200 words and it works fast and is accurate.
The actual file has around 264663 records and I'll leave it on tonite to get the answer tomorrow.
Many thanks once more

Boromir
User avatar
boromir
Newbie
 
Posts: 4
Joined: Sat Jan 26, 2008 12:00 am


Return to Macros