Removing lines with partial duplicate content with spaces

Find, replace, find in files, replace in files, regular expressions

Removing lines with partial duplicate content with spaces

Postby ScottDaMan » Tue May 08, 2007 9:27 pm

I feel like I am making a duplicate topic on the subject but I have done extensive searching and I cannot figure out how to get this to work.

The closest topic I could find that is related (and has a macro posted by Mofi) that doesn't work for me is: Delete line with partially duplicate content

I'm working with a city/state CSV that has multiple duplicate cities that need to be removed. Here's a sampling:

Hamilton,AK,Wade Hampton (CA),62.8961111,-163.8941667
Hamilton Acres,AK,Fairbanks North Star,64.8469444,-147.6705556
Hammerhead,AK,Aleutians West (CA),51.8486111,-176.6416667
Happy Valley,AK,Kenai Peninsula,59.9472222,-151.7322222
Happy Valley,AK,Aleutians West (CA),51.8472222,-176.6541667
Harding Lake,AK,Fairbanks North Star,64.4216667,-146.8508333
Hawk Inlet,AK,Skagway-Hoonah-Angoon (CA),58.1258333,-134.7530556
Haycock,AK,Nome (CA),65.2097222,-161.1655556
Healy,AK,Denali,63.8569444,-148.9661111
Herendeen Bay,AK,Aleutians East,55.7680556,-160.6858333
Herring Cove,AK,Ketchikan Gateway,0,0
Herring Cove,AK,Ketchikan Gateway,55.325,-131.5222222
Highland Park,AK,Fairbanks North Star,64.7569444,-147.3605556
Hobart Bay,AK,Skagway-Hoonah-Angoon (CA),57.4658333,-133.3533333

In this case, I need these removed:
Happy Valley,AK,Kenai Peninsula,59.9472222,-151.7322222
Herring Cove,AK,Ketchikan Gateway,0,0

They are duplicates. We aren't worried so much about county and geo locations. But anything with city,state that matches exactly, I'd like removed so that there is only one.

Thanks.
User avatar
ScottDaMan
Newbie
 
Posts: 3
Joined: Mon May 07, 2007 11:00 pm

Re: Removing lines with partial duplicate content with space

Postby pietzcker » Wed May 09, 2007 5:47 am

Again, this seems to call for regular expressions. First, you need to activate Perl style regexps in the options dialog (search/regular expressions). Then use the replace command (activating regular expressions in the replace dialog)

Code: Select all
Search string:
^([\w\s]+?,\w+?)(,.+)(\r?\n\1.*?)+$

Replace string:
\1\2

This should find two or more adjacent lines that start with the same city/state combination. It works on your example; if the rest of the data are consistent, it should work with them, too (hopefully).

HTH,
Tim
User avatar
pietzcker
Master
Master
 
Posts: 241
Joined: Sun Aug 22, 2004 11:00 pm

Re: Removing lines with partial duplicate content with spaces

Postby ScottDaMan » Wed May 09, 2007 7:17 am

This code consistently errors out my application after setting the perl regexp option then running a replace all on this large 163,000 line file.
User avatar
ScottDaMan
Newbie
 
Posts: 3
Joined: Mon May 07, 2007 11:00 pm

Re: Removing lines with partial duplicate content with spaces

Postby Bego » Wed May 09, 2007 7:19 am

Scott: Did you update to 13.00a+2 ?

Very well done Tim

Just 2 minor comments:
- the data has to be sorted
- if there are more than 2 "same lines", you have to run the replace more than 1 time.

One question: Why is the os-neutral \n not enough? I see we need \r?\n to get it going
but I don't really see why \n alone does not work.

rds Bego
User avatar
Bego
Master
Master
 
Posts: 357
Joined: Wed Nov 24, 2004 12:00 am
Location: Germany

Re: Removing lines with partial duplicate content with space

Postby pietzcker » Wed May 09, 2007 1:14 pm

You're right:

The data has to be sorted. Or at least duplicate lines have to be adjacent.

About running it more than once: Yes, I see this behavior in UE, but it shouldn't be that way. The + in (\r?\n\1.*?)+$ should catch any multiple adjacent lines that start with the same two CSV fields. It seems that in UE the $ overrides the greediness of the + :( In RegexBuddy (where I "crafted" the regex) and in Python I get the expected behavior.

About \r? --> I use that because newline handling differs between applications. UE regex seems to require \r\n, Python only accepts \n (or \r?\n). I have even found now that (in UE) I have to use \r\n without the ? to make it work right. Otherwise, the replace will introduce a surplus \r (0D 0D 0A). I don't know why that should happen...seems like a(nother) bug of UE's regex engine.

I'm still waiting for them to fix positive lookaround...

If UE still crashes on your large file after update to 13.00a+2 I'd suggest Python (six line program):

Code: Select all
# -*- coding: iso-8859-1 -*-
import re
result = re.sub(r"(?m)^([\w\s]+?,\w+?)(,.+)(\n\1.*?)+$", r"\1\2", open("input.csv","r").read())
output = open("output.csv","w")
output.write(result)
output.close()

In Perl, you could probably do that in one line :)
User avatar
pietzcker
Master
Master
 
Posts: 241
Joined: Sun Aug 22, 2004 11:00 pm

Re: Removing lines with partial duplicate content with space

Postby pietzcker » Wed May 09, 2007 6:28 pm

I've just checked a few things. First, Perl regexes in UE do require \r\n if you want to match a newline in a DOS file. I don't think that that's an error; from what I have gathered, other regex implementations, especially Perl, will handle newlines the same way.

Then, I think I've located the bug that leads to \r being duplicated on a replace. Normally, . (dot) doesn't match a newline, so under Windows, it also shouldn't match \r. In UE, it does match \r under certain circumstances.

If the regex ^(\w+)(.+)(\r?\n\1.*)+$ (find all adjacent lines that start with the same word) is applied to
Code: Select all
One line
One more line
One line again
Another line
Another line too
Another different line

and told to replace with \1\2 (i.e., replace with just the first of those lines), it should result in
Code: Select all
One line
Another line

but instead, you get:
Code: Select all
One line\r
One line again
Another line\r
Another different line

So \2 contains the \r that the underlying expression (.+) shouldn't have matched and it finds only pairs of lines even though it should have found all three.

The first bug can be worked around by using \r\n instead of \r?\n, but the second one remains. Well, I'll write to IDM...

Tim
User avatar
pietzcker
Master
Master
 
Posts: 241
Joined: Sun Aug 22, 2004 11:00 pm

Re: Removing lines with partial duplicate content with space

Postby ScottDaMan » Thu May 10, 2007 9:54 pm

Thanks guys, I upgraded my software and pietzcker's code worked great (after heavy pause, lol).

Thankfully, I had already alphabetized the entire list by state then city in excel (after splitting it to get around the 655++ line limit). I also do not see any duplicates listed more than twice so it all worked out.

This was a huge huge huge time saver. Thanks for helping me remove 10,000+ junk lines.

Update: I had to run this through the CSV like 4 times before I finally got the search string not found error.

Final tally: 14,410 lines of junk removed dumping 0.77MB of junk text.

Thanks again!
User avatar
ScottDaMan
Newbie
 
Posts: 3
Joined: Mon May 07, 2007 11:00 pm

Re: Removing lines with partial duplicate content with space

Postby Bego » Wed May 16, 2007 5:19 pm

Hi Tim !

Good job. Pls keep me/us up to date on that issue.

:-) Bego
User avatar
Bego
Master
Master
 
Posts: 357
Joined: Wed Nov 24, 2004 12:00 am
Location: Germany

Re: Removing lines with partial duplicate content with space

Postby mjcarman » Thu Jan 17, 2008 3:38 pm

pietzcker wrote:Perl regexes in UE do require \r\n if you want to match a newline in a DOS file. I don't think that that's an error; from what I have gathered, other regex implementations, especially Perl, will handle newlines the same way.


Well, yes and no. In Perl, "\n" is a little bit magical. It's a placeholder for a logical (platform) newline. Most of the time -- including in regexes -- you can pretend that "\n" is LF on Unix and CRLF on Windows. Internally "\n" is really a single character (which happens to be LF). The magic is provided by a layer that translates between internal and platform notation when reading from or writing to a file.

UE embeds a third-party engine to provide the Perl-compatible regexes. Without the IO layer magic -- which UE isn't providing -- "\n" only matches LF. I'm going to send UE a request to make "\n" match the newline character(s) for the file being searched. That will make Perl-compatible regexes more Perl compatible. :D
User avatar
mjcarman
Power User
Power User
 
Posts: 125
Joined: Thu Feb 10, 2005 12:00 am

Re: Removing lines with partial duplicate content with spaces

Postby pietzcker » Tue Apr 29, 2008 2:03 am

Good news: I have just received a test version of UE 14.00a+2, and the bug that would introduce erroneous \r characters in the above replace operation seems to have been fixed. Also, positive lookaround is working correctly in that version. The limitation of greedy quantifiers that won't span newlines if they can avoid it is still there, though. This means that if you have three or more lines in a row to be replaced, you'll have to run the replace operation a couple of times (until you get the error message "search string not found!"). As of April 29th, V14.00a+2 isn't available for download yet, but it should be soon.
User avatar
pietzcker
Master
Master
 
Posts: 241
Joined: Sun Aug 22, 2004 11:00 pm


Return to Find/Replace/Regular Expressions