Sort on duplicate value contained in following row

Find, replace, find in files, replace in files, regular expressions

Sort on duplicate value contained in following row

Postby pjoyce » Wed Jan 16, 2008 4:25 pm

I'm not sure how to easily describe this request but I'm trying to find an easy way to identify/highlight duplicate values that appear after/before the current value.

Using the example below you will notice that the both the values 62500388 and 62500394 appear more than once, therefore I would like all of those values highlighted (perhaps with a prefix so the data could be re-sorted).

62500386;62500386
62500387;62500387
62500388;62500217
62500388;62500388
62500388;62500587
62500391;62500389
62500391;62500391
62500392;62500392
62500393;62500393
62500394;62073500
62500394;62500394
62500395;62073600

Notes:
- Both the values in the first and second column are not fixed length
- A check for duplicates is only needed based on the values in the first column

If any one has any suggestions, I'd appreciate the help. Thanks in advance.
User avatar
pjoyce
Newbie
 
Posts: 2
Joined: Wed Jan 16, 2008 12:00 am

Re: Sort on duplicate value contained in following row

Postby mjcarman » Wed Jan 16, 2008 8:56 pm

There's no way to make UE highlight them (as in syntax highlighting). You could use find or find/replace to detect/mark them though. The following uses Perl regular expressions. (So make sure you have the "Regular Expressions" box checked in the Find dialog and have selected Perl-compatible regexps in the configuration.)

To find them search for ^(\d+);\d+\s+\1

To mark them search for ^((\d+);\d+\s+)\2
and replace with *$1*$2

Replace the '*' chars with whatever (literal) text you want to use to mark them.
User avatar
mjcarman
Power User
Power User
 
Posts: 124
Joined: Thu Feb 10, 2005 12:00 am

Re: Sort on duplicate value contained in following row

Postby pietzcker » Wed Jan 16, 2008 9:26 pm

It would have been a good idea to read the sticky first and answer the questions there. What I'd need to know: What do you want to do with the results? Do you want to delete lines that start with identical numbers? What UE version are you using?

The following Perl style regex (UE >= V12) will find all adjacent lines that start with the same characters (up to the first ; ):

Code: Select all
^([^;]+);([^\r\n]+\r\n)(?:\1;[^\r\n]+\r\n)+

It looks a little strange but it works in UE (it's a workaround for the bug described in viewtopic.php?t=4683)

You could then replace with \1;\2 in order to remove the duplicates.

But maybe you want something else done?
User avatar
pietzcker
Master
Master
 
Posts: 241
Joined: Sun Aug 22, 2004 11:00 pm

Re: Sort on duplicate value contained in following row

Postby pjoyce » Wed Jan 16, 2008 9:46 pm

mjcarman and pietzcker: Many thanks for both of your responses. I'm at home right now but will try your suggestions when I get to work tomorrow.

pietzcker: Apologies for not reading the Sticky, I should of done. Ultimately the duplicate values just need to be found amongst the 400,000 lines of data and then used elsewhere. So finding the duplicates and adding a prefix of some kind would be a big help.

Again, I will try both suggestions tomorrow. Thanks for your help, I appreciate it.
User avatar
pjoyce
Newbie
 
Posts: 2
Joined: Wed Jan 16, 2008 12:00 am

Re: Sort on duplicate value contained in following row

Postby mjcarman » Thu Jan 17, 2008 3:18 pm

pietzcker's use of "\r\n" is more robust than my use of "\s" (though it doesn't appear to matter for your data).

pietzcker, I had missed that other topic. I'm glad to see a way to match a newline. It would be nice if "\n" just worked, but it doesn't surprise me that it doesn't. I had tried "\r\f" and even "\015\012" but neither work, oddly. I hadn't thought to try "\r\n".
User avatar
mjcarman
Power User
Power User
 
Posts: 124
Joined: Thu Feb 10, 2005 12:00 am

Re: Sort on duplicate value contained in following row

Postby pietzcker » Fri Jan 18, 2008 7:34 am

pjoyce wrote:So finding the duplicates and adding a prefix of some kind would be a big help.


OK. I guess this is something for a macro.

I'm not very good at UE macros, and their behavior often puzzles me. What I have found to work (in UE V13) is:

InsertMode
ColumnModeOff
HexOff
PerlReOn
Top
Find RegExp "^([^;#]+);([^\r\n]+\r\n)(?:\1;[^\r\n]+\r\n)+"
Find RegExp "^"
Replace All SelectText "###"

The macro property "Continue with macro after search and replace not found" must be unchecked, and the macro must be "run multiple times", checking the option "run until end of file".

I had first tried to write a macro that would only have to be run once, using a loop. However, it didn't work. What I had written was:

InsertMode
ColumnModeOff
HexOff
PerlReOn
Top
Loop
Find RegExp "^([^;#]+);([^\r\n]+\r\n)(?:\1;[^\r\n]+\r\n)+"
Find RegExp "^"
Replace All SelectText "###"
Key HOME
EndLoop

But somehow, the loop only runs once, and I have no idea why.
User avatar
pietzcker
Master
Master
 
Posts: 241
Joined: Sun Aug 22, 2004 11:00 pm

Re: Sort on duplicate value contained in following row

Postby Mofi » Fri Jan 18, 2008 8:29 am

Very intersting. Your loop macro should work with macro property Continue if search string not found unchecked. I have added 3 lines to make the macro independent of this macro property and suddenly the loop works. I will make further tests. Maybe IDM has added into UltraEdit that a loop without a number and without an ExitLoop command runs only once to avoid an endless loop if macro property Continue if search string not found is set. I will send an email to IDM and ask for clarification on this issue.

InsertMode
ColumnModeOff
HexOff
PerlReOn
Top
Loop
Find RegExp "^([^;#]+);([^\r\n]+\r\n)(?:\1;[^\r\n]+\r\n)+"
IfNotFound
ExitLoop
EndIf

Find RegExp "^"
Replace All SelectText "###"
Key HOME
EndLoop

I have got on 2008-01-18 the answer from IDM (but edited this post 2 days later). There is really a simple protection mechanism against endless loops. A loop without a loop number and without command ExitLoop is always executed only once. I have added this new information to my macro reference file.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4062
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Sort on duplicate value contained in following row

Postby pietzcker » Fri Jan 18, 2008 9:58 am

Thanks Mofi! Sounds like a reasonable explanation.
User avatar
pietzcker
Master
Master
 
Posts: 241
Joined: Sun Aug 22, 2004 11:00 pm

Re: Sort on duplicate value contained in following row

Postby mjcarman » Fri Jan 18, 2008 2:12 pm

pietzcker wrote:I guess this is something for a macro.

You're making it more complicated than it needs to be. The search/replace pair in my first post already adds a marker prefix.
User avatar
mjcarman
Power User
Power User
 
Posts: 124
Joined: Thu Feb 10, 2005 12:00 am

Re: Sort on duplicate value contained in following row

Postby pietzcker » Fri Jan 18, 2008 6:22 pm

Well, yes, but it only works for duplicate lines, not for triplicates and higher repetitions.
User avatar
pietzcker
Master
Master
 
Posts: 241
Joined: Sun Aug 22, 2004 11:00 pm


Return to Find/Replace/Regular Expressions

cron