Find lines in CSV file with less or more than X tabs within a line

Find, replace, find in files, replace in files, regular expressions

Find lines in CSV file with less or more than X tabs within a line

Postby mystikal2k2a » Thu Jun 21, 2012 9:50 am

Hi

I am new to UltraEdit macros and I'm getting familiar with the syntax but struggling to find a count like function?

I have a file containing 1.8 million records with 329 coloums. The records in the file are all tab delimited.

Each record should have 328 tab delimiters but I know some records contain more than this number.

I want a macro that will count the number of tab delimiters and locate me all the records that have more than 328 tab delimiters inside the record.

Any help would be much appreciated.

Thank you
mystikal2k2a
Newbie
 
Posts: 2
Joined: Thu Jun 21, 2012 9:43 am

Re: Find lines in CSV file with less or more than X tabs within a line

Postby Mofi » Thu Jun 21, 2012 2:54 pm

There is no need for a macro. Just press Ctrl+F to open the Find dialog. Enter or copy/paste as string to find ^(?:.*?\t){329} and enable the option Regular Expressions: Perl.

Press button Advanced to open the advanced find options pane and make sure the Perl regular expression engine is selected as required. Check additionally the option List Lines Containing String. Run the Find with click on button Next.

A window opens listing all lines with at least 329 tab characters. You can copy this list to clipboard and paste it into a new file or double click on the listed lines to jump to this line for making corrections.

I tested the expression before posting with UE v18.10.0.1014 on a quickly created file with 4 lines with two lines having 5 tabs, one having 6 tabs and one having 7 tabs and using ^(?:.*?\t){6} The list contained the 2 lines with more than 5 tabs.

Explanation of the expresssion:

^ ... start every search at beginning of a line.

(?:...) ... is a non marking group. Grouping is supported only by the Perl regular expression engine. ?: immediately after opening round bracket tells the Perl regular expression engine not to mark (or tag) the string found by the expression inside the round brackets for being back referenced in search or replace string with $1 respectively \1. Such special modifiers as ?: immediately after ( are documented not in help of UltraEdit about Perl regular expression as the explanation of all those modifiers fills books and are for experts only.

.*?\t ... find 0 or more characters of any value except line ending characters followed by a tab character. The question mark after the asterisk makes the expression non greedy (= match as less characters as possible to get true for this expression).

{6} ... the previous expression which is the non marking group finding something or nothingtab must match 6 times. If the line contains less tabs, the result of the find is false for this line.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4049
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Find lines in CSV file with less or more than X tabs within a line

Postby mystikal2k2a » Fri Jun 22, 2012 9:21 am

Hi

Thank you for your reply and explanination. I am beginning to understand regular expression syntax.

I did follow your instructions and I am using 17.30.0.1014 (trial version at the moment).

One last question please - I have found that my problem in the file is actually to do with a record that contains less than 328 delimiters. How can find the rows with this problem instead?

Thanks
mystikal2k2a
Newbie
 
Posts: 2
Joined: Thu Jun 21, 2012 9:43 am

Re: Find lines in CSV file with less or more than X tabs within a line

Postby Mofi » Fri Jun 22, 2012 10:02 am

Take a look on topic How to find line breaks in fields of a CSV file and remove it? which contains most likely the solution for your problem. I'm quite sure that the CSV file contains line breaks in data fields and the exporting application has not double quoted those fields or the importing application is poor coded and does not support line breaks in field strings.

The negative search for finding lines with less than x tabs is much more complicated. Based on what pietzcker posted at How to delete all lines NOT containing specific word or string or expression? the expression to use to find and select lines with less than 328 tabs is:

^(?:(?!(?:.*?\t){328,}).)*$\r\n

Please make sure that Find What field does not contain other invisible characters like spaces and line breaks when using this expression. A common mistake on copying an expression from the browser window and pasting into Find What field is that the line termination is copied too.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4049
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Find lines in CSV file with less or more than X tabs within a line

Postby onceayear » Thu Jul 19, 2012 9:46 am

Are you still looking for a solution to this?

I have an approach that is within UltraEdit, that will allow you to determine the line-number on which the number of tabs is a number that is an exception. The steps will be:

Make a copied work-file

Change all tabs to, say, tildes (all 09 becomes 7e) (this is just for visibility)

Delete (Find, and replace-with-nothing) all characters that are not tildes or line-delimiters; not a burdensome task if all you have are digits and decimal points; more of a bother if you do have letters also, but grit your teeth and you will be through it in about 12 minutes; perl at _this_ step would speed things up well, e.g.:

Using Perl for the Engine (not Unix, not UltraEdit):

In the "Find What:" box, put:
[A-Z]
In the "Replace With:" box, have nothing; and then Replace All
... gets rid of all uppercase-letters

In the "Find What:" box, put:
[a-z]
In the "Replace With:" box, have nothing; and then Replace All
... gets rid of all lowercase-letters

... etc. Onward until you get rid of (besides tildes and line-delimiters) whatever is in there.

Then do massive reductions of tildes, say, 50-tildes-plus-line-delimiter becomes line-delimiter; do this 6 times and you will have 1.8 million instances of:

^p and 28-tildes and ^p

and other instances of

^p and not-28-tildes and ^p

Count the instances of ^p and 28-tildes and ^p (call this Y) and then, where X is the number of lines in the file,

X - Y is the number of exceptional lines you must locate.

Then simply do a few searches for instances such as

^p and 29-tildes and ^p

or whatever variations-of-N for "300+N" you might want to check for, until any and all exceptional lines have been located.
onceayear
Newbie
 
Posts: 1
Joined: Fri Mar 09, 2012 3:03 pm

Re: Find lines in CSV file with less or more than X tabs within a line

Postby Mofi » Fri Jul 20, 2012 1:07 am

onceayear, some hints for you.

If option Match Case is not enabled in Find/Replace dialog, [A-Z] is equal [a-z] is equal [A-Za-z].

And it is possible to search with a negative character class. In Unix/Perl a search for all characters except ~ and line terminators is possible with [^~\r\n]. The character ^ immediately after [ means NOT. So this little expression means: find a character which is NOT a tilde and NOT a carriage return and NOT a line-feed.

But all that work as you have described is not really necessary. There are just 2 Perl regular expression Finds needed to find all lines with more than 328 tabs or less than 328 tabs.

Perl regular expression to find lines with more than 328 tabs in a CSV file: ^(?:[^\t\r\n]*\t){329,}.*\r\n

Perl regular expression to find lines with less than 328 tabs in a CSV file: ^(?!(?:[^\t\r\n]*\t){328}).*\r\n

If the CSV file uses a different character as separator, just the 4 \t in the 2 regular expression strings above must be replaced by the separator character.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4049
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna


Return to Find/Replace/Regular Expressions