How do I remove duplicate lines?

Help with writing and playing macros

How do I remove duplicate lines?

Postby hamilton » Mon Dec 05, 2005 10:33 am

I have a problem that I hope somone can help me with :)
I have a large file, in this file I have som duplicate (an more) values for the same "primary key"- the "key" is always in the same position in the lines.
Is it possible to make (and how) a macro that delete or removes the lines that have duplicate values?
My file have more than 10.0000 lines and I would hate to do this manually :evil:
I also have to keep the file as it is, so I can't import to excel.

I have UE32 ver 11.00b.
User avatar
hamilton
Basic User
Basic User
 
Posts: 10
Joined: Wed Jul 27, 2005 11:00 pm

Re: How do I remove duplicate lines?

Postby Mofi » Mon Dec 05, 2005 11:54 am

The following macro should do the job, but only if no line exists which contains regular expression characters of UltraEdit style like +[]^%$ ... See help of UltraEdit about regular expressions in UltraEdit style. Unix style cannot be used here, because ^c is not available in a Unix style search.

InsertMode
ColumnModeOff
HexOff
UnixReOff
Bottom
IfColNum 1
Else
"
"
EndIf
Top
Clipboard 9
Loop
IfEof
ExitLoop
EndIf
Key END
IfColNumGt 1
StartSelect
Key HOME
Cut
EndSelect
Find RegExp "%^c^p"
Replace All ""
Paste
EndIf
Key DOWN ARROW
EndLoop
ClearClipboard
Clipboard 0
Top
UnixReOn

Remove the last red command, if you use regular expression in UltraEdit style by default instead of Unix style.
For UltraEdit v11.10c and lower see Advanced - Configuration - Find - Unix style Regular Expressions.
For UltraEdit v11.20 and higher see Advanced - Configuration - Searching - Unix style Regular Expressions.
Macro commands UnixReOn/UnixReOff modifies this setting.

I have an idea how to do it without a regular expression search, but it is much more tricky and I now have no time to develop this macro set (it cannot be done with a single macro).
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4069
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: How do I remove duplicate lines?

Postby hamilton » Mon Dec 05, 2005 1:33 pm

Thank you :D I dont see how the macro works, but it actually does :D this will do the job

Thanx from Norway
User avatar
hamilton
Basic User
Basic User
 
Posts: 10
Joined: Wed Jul 27, 2005 11:00 pm

Re: How do I remove duplicate lines?

Postby Bego » Mon Dec 05, 2005 3:06 pm

Hi Norvegian guy, Tag Mofi,

I added some functionality to Mofis good macro:
Make a new tab and list those double(+) lines there, cause I wanna know which lines where 2+times inside.
Then I sort them (you can kick the sort line if you want)

Check it out. :D

rds Bego

Edited my Mofi: Macro source code removed - see below for the improved version.
User avatar
Bego
Master
Master
 
Posts: 357
Joined: Wed Nov 24, 2004 12:00 am
Location: Germany

Re: How do I remove duplicate lines?

Postby hamilton » Mon Dec 05, 2005 6:51 pm

this macro gets better and better, I'm glad thare are some helpful people out there :)

Thank you
User avatar
hamilton
Basic User
Basic User
 
Posts: 10
Joined: Wed Jul 27, 2005 11:00 pm

Re: How do I remove duplicate lines?

Postby Mofi » Fri Dec 09, 2005 12:54 pm

Thanks Bego for the idea to collect the duplicate lines as additional info and for the information that IfFound and IfNotFound can also be used after a replace. That was new for me although I have written dozens of UltraEdit macros. Even an experienced user like I can learn from others. Thanks again.

I have modified the macro again. Now it also works for files with lines with UltraEdit style regular expression characters and it does not need a second macro as I first thought would be necessary. It now could be also converted to a macro with Unix style regular expressions instead of UltraEdit style. Only 5 simple regular expressions must be changed for Unix style.

The removing duplicate line replace command is now case-sensitive. Remove MatchCase parameter if it should ignore case.

The collection of the duplicate lines is done now with clipboard 8, which improves execution speed a lot. The duplicate lines are sorted. If someone wants this macro without collecting the duplicate line info, remove the red colored lines.

This macro is now added to my private collection of useful macros - see sticky forum topic Macro examples and reference for beginners and experts which contains a macro file with the macros DelDupInfo+ (macro below) and DelDupInfo- (macro below without the red lines).

The macro property Continue if a Find with Replace not found or Continue if search string not found must be checked for this macro.


InsertMode
ColumnModeOff
HexOff
UnixReOff
Bottom
IfColNum 1
Else
"
"
EndIf
Top
Find RegExp "%^([~^p]^)"
Replace All "#MOFI_RULES#^1"
Clipboard 8
ClearClipboard

Clipboard 9
Loop
Find RegExp "%#MOFI_RULES#*$"
IfNotFound
ExitLoop
EndIf
Cut
Find MatchCase "^c^p"
Replace All ""
IfFound
Paste
Find Up "#MOFI_RULES#"
Key HOME
Clipboard 8
Find RegExp "%#MOFI_RULES#*^p"
CopyAppend
EndSelect
Key HOME
Clipboard 9
Else

Paste
Key DOWN ARROW
Key HOME
EndIf
EndLoop
ClearClipboard
Top
Find RegExp "%#MOFI_RULES#"
Replace All ""
NewFile
Clipboard 8
Paste
ClearClipboard
Top
Find RegExp "%#MOFI_RULES#"
Replace All ""
IfNotFound
"NO DUPLICATES :-)
"
Else
SortAsc 1 -1 0 0 0 0 0 0
EndIf
NextWindow

Clipboard 0


Add UnixReOn or PerlReOn (v12+ of UE) at the end of the macro if you do not use UltraEdit style regular expressions by default - see search configuration. Macro command UnixReOff sets the regular expression option to UltraEdit style.


Edit info: Some comments added - see below!

This macro will not work for Unix files opened in Unix mode without conversion temporarily (on file load) or permanently to DOS before macro execution (^p matches CRLF!).

The macro is designed to remove duplicate lines only if a line matches another line 100%. If there are trailing spaces and the trailing spaces of 2 lines displayed identical do not match, the lines will not be removed and reported. Use the command TrimTrailingSpaces at top of the macro after the command Top, if you want to ignore trailing spaces and you can delete it.

2007-11-01: The macro has been rewritten completely because it damaged the file when there are soft-wrapped lines. The new macro works now also for a file with soft-wrapped lines. Also IfEof has been eliminated to let the macro work on Unicode files too, independent of the version of UltraEdit. IfEof works for Unicode files only since UE v13.20.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4069
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: How do I remove duplicate lines?

Postby Mofi » Fri Dec 09, 2005 3:24 pm

Bego: Thanks again for this interesting infos (deleted). I will take this into consideration for future macros (see improved version of the macro above).

The "#MOFI_RULES#" string is used as replacement for the regular expression character % to be able to correct handle lines like this without a regular expression (lines with different preceding and trailing spaces):

Code: Select all
Line example
 Line example
 Line example
Another Line example


Nothing should be changed when running the macro at these 4 lines. Third line contains a trailing space, second line not! Select line 2 and 3 and you will see the difference.

PS: Hopefully nobody uses this macro and has a file with lines which already contains the "#MOFI_RULES#" string. That could lead to wrong results.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4069
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: How do I remove duplicate lines?

Postby Mofi » Thu Nov 01, 2007 4:12 pm

Hi sas2000,

thanks for your uedit32.ini. With your configuration I was able to reproduce the problem and find the reason why the macro failed and created a wrong output (= damaged file).

The problem was the soft-wrapping you have enabled. I normally have soft-wrapping of lines not active. I use it normally only when editing HTML files, but have it not active when running macros or scripts.

I did not know how many macro commands depend on wrapping mode on/off. Key HOME, Key END and SelectLine which I have used before for this macro to select a line with or without line ending are executed always on current displayed line which is not the entire real line if the line is currently soft-wrapped. As a result of this the previous macro worked perfect until it reached the first line which was soft-wrapped.

Additionally you have option Replace All is From Top of File active as you can see in the Replace dialog which makes the output even worse.

I have completely rewritten the macro to get correct output(s) also when soft-wrapped lines exist.

I have already deleted all of our previous posts. You can delete now the zip archives and files on your website.

As a result of turning my attention to what happens when a macro is run on soft-wrapped lines which is not designed for working in active word-wrap mode I have now to update also the macros DelDupInfo- and DelDupInfo+ in my macro collection and add many notes to my macro reference. But first I have to find out which macro commands work different depending on word-wrapp mode on/off. That will take some time.

sas2000, many thanks!
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4069
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: How do I remove duplicate lines?

Postby sas2000 » Thu Nov 01, 2007 6:09 pm

2Mofi

It works ok now :D.

Having read your macro i think that it doesn't exist, but do you know any macro commands to switch on/off soft-wrapping & Replace All is From Top of File ?, my knowledge about macros is quite limited and this way i'll avoid problems on my own macros, i've tried :

SoftWrapOff
WrapOff
WrapWordOff
WordWrapOff

but it doesn't work, may you help me ?

Thanks. :!:
User avatar
sas2000
Newbie
 
Posts: 9
Joined: Sat Aug 05, 2006 11:00 pm

Re: How do I remove duplicate lines?

Postby Mofi » Fri Nov 02, 2007 6:48 am

Except the active regular expression engine none of the configuration settings can be changed by a macro or script. I have already written twice to IDM support that replace option Replace All is From Top of File should be disabled internally temporarily while a macro or script is running to make the output predictable. For scripts this is the case since UE v13.20, for macros since UE v13.20a.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4069
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna


Return to Macros