regular expression to remove duplicates and keep the order

Find, replace, find in files, replace in files, regular expressions

regular expression to remove duplicates and keep the order

Postby estado3 » Mon Jan 07, 2008 12:09 pm

I do not want to use the sort/advanced sort function of ultraedit, as I do not want to lose the order, is there a regular expression to remove duplicates, or any other way to remove duplicates but keep the order/sequence the same?
User avatar
estado3
Newbie
 
Posts: 5
Joined: Thu Dec 07, 2006 12:00 am

Re: regular expression to remove duplicates and keep the order

Postby Mofi » Mon Jan 07, 2008 12:35 pm

What about How do I remove duplicate lines? or Special Case: Remove Duplicates TOTALLY, not just one?

The macro property Continue if a Find with Replace not found or Continue if search string not found must be checked for this macro.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4055
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: regular expression to remove duplicates and keep the order

Postby estado3 » Mon Jan 07, 2008 5:37 pm

I do not want to remove every occurrence, I would like to keep one duplicate, and the lines containing the duplicates contains $.
How would I amend the macros to get this to do as I would like?
User avatar
estado3
Newbie
 
Posts: 5
Joined: Thu Dec 07, 2006 12:00 am

Re: regular expression to remove duplicates and keep the order

Postby Mofi » Tue Jan 08, 2008 8:08 am

You have not read How do I remove duplicate lines? or at least you have not done it carefully enough. The first occurrence of a duplicate line will remain and it does not matter if the lines contain regular expression strings because the improved macro uses non regular expression searches/replaces.

The final macro version as posted in the linked topic is also already ready for usage in the macro file in the ZIP archive you can download from Macro examples and reference for beginners and experts. Macro DelDupLineInfo+ deletes the duplicates and creates a report, macro DelDupLineInfo- deletes just the duplicate lines without creating a report.

Please next time read more carefully. I don't like writting the same twice.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4055
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: regular expression to remove duplicates and keep the order

Postby estado3 » Tue Jan 08, 2008 12:42 pm

I have tried the delduplineinfo+ and the delduplineinfo-, they are both exceptionally slow, after an hour it was only through 1,000 lines, the file has over 300,000 lines!!!

Is there a way to simply randomise the order after a sort ascending/descending as it seems I may be forced to use that option?
User avatar
estado3
Newbie
 
Posts: 5
Joined: Thu Dec 07, 2006 12:00 am

Re: regular expression to remove duplicates and keep the order

Postby pietzcker » Tue Jan 08, 2008 2:55 pm

This sounds more like a job for a Perl or Python script. After all, what you're asking is maybe trivial but demanding. Every single line has to be checked against all following lines (up to 300000), and then every duplicate has to be removed. How long are your lines?
User avatar
pietzcker
Master
Master
 
Posts: 241
Joined: Sun Aug 22, 2004 11:00 pm

Re: regular expression to remove duplicates and keep the order

Postby estado3 » Tue Jan 08, 2008 3:25 pm

an average of 70 characters
User avatar
estado3
Newbie
 
Posts: 5
Joined: Thu Dec 07, 2006 12:00 am

Re: regular expression to remove duplicates and keep the order

Postby pietzcker » Tue Jan 08, 2008 4:01 pm

OK, this is a quick-and-dirty program, hardly any error checks, and it won't handle unicode files. But I've tried it on a 66000 lines XML file that was reduced to 12000 lines within one minute on my laptop. The more duplicates there are, the longer it takes. It works with Python 2.5, haven't tested with other versions.

Code: Select all
# -*- coding: iso-8859-1 -*-

in_file = open("test.txt","r").readlines() # Put the input file (here called test.txt - rename as required) in same directory as script
counter = 0
while True:
    try:
        testline=in_file[counter]
    except:
        break
    while True:
        try:
            x=in_file[counter+1:].index(testline)
        except ValueError:
            break
        in_file.pop(counter+x+1)
    counter += 1

out_file = open("output.txt","w") # overwrites output.txt if it exists
for zeile in in_file:
    out_file.write(zeile)
out_file.close()


HTH,
Tim
User avatar
pietzcker
Master
Master
 
Posts: 241
Joined: Sun Aug 22, 2004 11:00 pm


Return to Find/Replace/Regular Expressions