Find lines containing a number stored in a list in another file

Find, replace, find in files, replace in files, regular expressions

Find lines containing a number stored in a list in another file

Postby yusufnohh » Mon Apr 22, 2013 1:34 am

Hi Mofi.
Thanks for the guidance so far.
Assuming I have 50 million data with differents fields name as below:-

Code: Select all
ACCOUNT|MOBILENO|NAME|ADD1|ADD2|ADD3|CITY|ZIP|STATE|NEWID|OTHERID|OLDID|REGDATE|STATUS|GENDER|BIRTHDATE|RACE|EMAIL
10100801|0192168465|MRS RAJESWARY A/P BHASKARAN|NO 42|KG PINANG||HULU BERNAM|35900|PERAK|871119085122||||18-SEP-09|Active|Female|19-NOV-87|Indian|
10100841|0192311363|CIK ZANEZA BINTI MOHAMMED ZAIKE|NO 16 JALAN 40 DESA JAYA KEPONG|||KUALA LUMPUR|51200|W PERSEKUTUAN KUALA LUMPUR|771002145314|||A3848390|18-SEP-09|Active|Female|02-OCT-77|Malay|
10102691|0193176085|MR MOHD ZAREMDEEN BIN MOHD ZAMAN|NO 44A 1ST FLOOR JLN TUN MOHD FUAD|SATU TMN TUN DR ISMAIL||KUALA LUMPUR|60000|W PERSEKUTUAN KUALA LUMPUR|701008035171|||A1618225|18-SEP-09|Active|Male|08-OCT-70|Malay|
10103091|0135333198|LOH KIEN SENG|102 JLN AMAN JELAPANG|||IPOH|30020|PERAK|700218085551|||A1557532|18-SEP-09|Active|Male|18-FEB-70|Chinese|
10104261|0133920594|PUAN HUSNA BINTI OSMAN|KUARTERS KLINIK KESIHATAN TEKEK|KAMPUNG TEKEK, PULAU TIOMAN||KUALA ROMPIN|26800|PAHANG|860111335584||||18-SEP-09|Active|Female|11-JAN-86|Malay|
10104911|0165342333| ENCIK.AZHARI BIN ABDULLAH|SLIM PANTAS ENTERPRISE|ESSO FILLING STATION|JALAN BESAR|SLIM RIVER|35800|PERAK|630627085615|||7076892|18-SEP-09|Active|Male|27-JUN-63|Malay|
10100631|0126360984| PUAN NOR HAYATI BINTI MUSA|434|JALAN MARGOSA 15|TAMAN BUKIT MARGOSA|AMPANGAN|70400|NEGERI SEMBILAN|650728055464|||A0208356|18-SEP-09|Active|Female|28-JUL-65|Malay|
10102841|0132878737| ENCIK MUHD SYAHAMIRUL EDININ BIN ABD HAMID|C-2-1 DANAU VILLA APT.|JALAN 5/23E TAMAN DANAU KOTA||KL|53200|W PERSEKUTUAN KUALA LUMPUR|721014125411|||A2308960|18-SEP-09|Active|Male|14-OCT-72|Malay|

What is the best method If I want to search multiple data ( more than 10k data ) using NEW ID No as unique....
eg from the data below I want to extract all the information as above.

851021086081
730421045333
850828016469
781115065635
770630115226
800204065259
881218035189
790905035304
620225015308
730704035311
840501055627
870418295109
641016015136
870401055512
590225105958
600923055106
620912125998
760805055095
870506015314
890423145920
571010055531
800211025570
751003065190
691031025055
860203055125
790218085043
740713085280
610711085740
850328085775
880608016210
880817025308
800215085255

thanks ..cheers
yusufnohh
Newbie
 
Posts: 6
Joined: Wed Apr 03, 2013 5:13 am

Re: Find lines containing a number stored in a list in another file

Postby Mofi » Mon Apr 22, 2013 2:40 am

I offer 2 solutions:

1. You make a copy of the large file and run a Perl regular expression Replace All from top of the file with the search string ^(?:[^\r\n|]*\|){9}([^\r\n|]+)\|.*$ and the replace string \1 to delete from all lines all data except field value 10 - NEWID. For an explanation of the search string see How to delete lines in CSV file if a certain value is found in defined data field?

2. You use the script FindStringsToNewFileExtended.js with search string \n(?:[^\r\n|]*\|){9}([^\r\n|]+) and $1 for the output format string. The first line is ignored by this solution as there is no line-feed. But that should be no problem as the first line is the header line. As explained in readme of this scripts collection you have to execute several times to get the final output as the script processes only 800.000 lines per script execution. With 50 millions of lines it is most likely better to use the first solution as starting the script so many times manually is not funny.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4049
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Find lines containing a number stored in a list in another file

Postby yusufnohh » Mon Apr 22, 2013 3:23 am

Hi Mofi.
Thanks for the info. Sorry, I'm not sure if you got me right.

I hv 50 million database with field name as below:-

ACCOUNT|MOBILENO|NAME|ADD1|ADD2|ADD3|CITY|ZIP|STATE|NEWID|BUSREG|OTHERID|OLDID|REGDATE|STATUS|GENDER|BIRTHDATE|RACE|EMAIL
10100801|0192168465|MRS RAJESWARY A/P BHASKARAN|NO 42|KG PINANG||HULU BERNAM|35900|PERAK|871119085122||||18-SEP-09|Active|Female|19-NOV-87|Indian|
10100841|0192311363|CIK ZANEZA BINTI MOHAMMED ZAIKE|NO 16 JALAN 40 DESA JAYA KEPONG|||KUALA LUMPUR|51200|W PERSEKUTUAN KUALA LUMPUR|771002145314|||A3848390|18-SEP-09|Active|Female|02-OCT-77|Malay|
10102451|0177703799| MR.AZHAR BIN ATAN|NO 57 SELASIH 2|TMN PASIR PUTIH||PASIR GUDANG|81700|JOHOR|650507015987|||A0117302|18-SEP-09|Active|Male|07-MAY-65|Malay|
10102571|0169659889| MR.SOO YEOW LOONG|LOT 623 JLN TELOK|BUNUT||BANTING|42700|SELANGOR|790408105973||||18-SEP-09|Active|Male|08-APR-79|Chinese|
10102691|0193176085|MR MOHD ZAREMDEEN BIN MOHD ZAMAN|NO 44A 1ST FLOOR JLN TUN MOHD FUAD|SATU TMN TUN DR ISMAIL||KUALA LUMPUR|60000|W PERSEKUTUAN KUALA LUMPUR|701008035171|||A1618225|18-SEP-09|Active|Male|08-OCT-70|Malay|
10102781|0195906523|MR ISMAIL BIN AWANG|11 PESARA KELEBANG JAYA 12, TAMAN KELEBANG JAYA|||CHEMOR|31200|PERAK|511203075163|||4211178|18-SEP-09|Active|Male|03-DEC-51|Malay|
10102821|0133991147|MS SITI NOORBAYA BINTI MOHD YUNUS|NO. 1,|JALAN PJU 1A/21,|ARA DAMANSARA,|PETALING JAYA|47301|SELANGOR|750308115238||||18-SEP-09|Active|Female|08-MAR-75|Malay|
10103091|0135333198|LOH KIEN SENG|102 JLN AMAN JELAPANG|||IPOH|30020|PERAK|700218085551|||A1557532|18-SEP-09|Active|Male|18-FEB-70|Chinese|
10104261|0133920594|PUAN HUSNA BINTI OSMAN|KUARTERS KLINIK KESIHATAN TEKEK|KAMPUNG TEKEK, PULAU TIOMAN||KUALA ROMPIN|26800|PAHANG|860111335584||||18-SEP-09|Active|Female|11-JAN-86|Malay|
10104911|0165342333| ENCIK.AZHARI BIN ABDULLAH|SLIM PANTAS ENTERPRISE|ESSO FILLING STATION|JALAN BESAR|SLIM RIVER|35800|PERAK|630627085615|||7076892|18-SEP-09|Active|Male|27-JUN-63|Malay|
10100631|0126360984| PUAN NOR HAYATI BINTI MUSA|434|JALAN MARGOSA 15|TAMAN BUKIT MARGOSA|AMPANGAN|70400|NEGERI SEMBILAN|650728055464|||A0208356|18-SEP-09|Active|Female|28-JUL-65|Malay|
10101341|0192823387|ENCIK MOHAMMAD MIZAN BIN MOHAMMAD ARIF|3Q EQUESTRAIN SG SERAI|KUANG||RAWANG|48050|SELANGOR|870303565105||||18-SEP-09|Active|Male|03-MAR-87|Malay|
10102271|0135333688|MR MUHUZAI BIN MUSTAFA|DM 138|KAMPUNG TELUK BARU|TANJUNG LUMPUR|KUANTAN|26060|PAHANG|821014065429||||18-SEP-09|Active|Male|14-OCT-82|Malay|
10102311|0132407812|MS KHAIRUNNISA BINTI RAMLI|NO 304 LRG ANGGERIK 9|BDR SUNGGALA||PORT DICKSON|71050|NEGERI SEMBILAN|881011055692||||18-SEP-09|Active|Female|11-OCT-88|Malay|
10102791|0137987794|MR MD BALYA HIDIR BIN MD SALEH|PTD 1032 TAMAN MAS SURIA PESERAI|||BATU PAHAT|83000|JOHOR|831216016325||||18-SEP-09|Active|Male|16-DEC-83|Malay|
10102841|0132878737| ENCIK MUHD SYAHAMIRUL EDININ BIN ABD HAMID|C-2-1 DANAU VILLA APT.|JALAN 5/23E TAMAN DANAU KOTA||KL|53200|W PERSEKUTUAN KUALA LUMPUR|721014125411|||A2308960|18-SEP-09|Active|Male|14-OCT-72|Malay|
10103001|0133999479|MR DOL FATAH BIN ABDUL WAHAB|NO 6 JALAN 14/1 FASA 5|TAMAN CHERAS JAYA||CHERAS|43200|SELANGOR|790512036023||||18-SEP-09|Active|Male|12-MAY-79|Malay|
10103251|0195898406|ENCIK MOHD FAKRULRAZI BIN ABD KADIR|NO 20-C LORONG KENANGA|KAMPUNG BARU||KUALA NERANG|06300|KEDAH|850513025635||||18-SEP-09|Active|Male|13-MAY-85|Malay|
10103531|0194092982|CIK SITI ZAHIDAH BINTI AYOB|102 JLN AU2A/14|TAMAN SRI KERAMAT||KUALA LUMPUR|54200|W PERSEKUTUAN KUALA LUMPUR|821128045020||||19-SEP-09|Active ... @gmail.com
10100411|0199447300|MR KUMARASAMY A/L RAJAGOPAL|NO 7 TAMAN SRI MAKMUR|||JERANTUT|27000|PAHANG|710426065685|||A1970518|18-SEP-09|Active|Male|26-APR-71|Indian|
10100441|0193855300|MR KUMARASAMY A/L RAJAGOPAL|NO 7 TAMAN SRI MAKMUR|||JERANTUT|27000|PAHANG|710426065685|||A1970518|18-SEP-09|Active|Male|26-APR-71|Indian|
10100591|0192630066| MADAM.KHOR KIM SAM|31 BU 11/4|BANDAR UTAMA||PETALING JAYA|47800|SELANGOR|690907086078|||A1376748|18-SEP-09|Active|Female|07-SEP-69|Chinese|
10100931|0122042936| MR.LOW CHAN WENG|81-6-5|RESOURCE SPRINGS|JALAN AYER PANAS|SETAPAK|53200|W PERSEKUTUAN KUALA LUMPUR|570825105929|||5365094|18-SEP-09|Active|Male|25-AUG-57|Chinese|
10101141|0199315703|MR WAN AZHAR BIN WAN YUSOFF|NO 16 JALAN IM 2/89|BANDAR INDERA MAHKOTA||KUANTAN|25200|PAHANG|670701035963|||A0723015|18-SEP-09|Active|Male|01-JUL-67|Malay|
10101801|0122543735| MR.FU POH SING|3-12D|JALAN DESA 2/2|DESA AMAN PURI|KEPONG|52100|W PERSEKUTUAN KUALA LUMPUR|640529107357|||7308347|18-SEP-09|Active|Male|29-MAY-64|Chinese|
10102651|0122852585| MADAM.RUSLINA BINTI ABU HASSAN|NO 27|JALAN SG 10/12|TAMAN SERI GOMBAK|BATU CAVES|68100|SELANGOR|610418095034|||6172441|18-SEP-09|Active|Female|18-APR-61|Malay|

Now I want to run multiple search
100k NEWID to map with the 50m data and get all the relevant fields.

Sorry about my explanation. Hope you get me. Thanks

Cheers
yusufnohh
Newbie
 
Posts: 6
Joined: Wed Apr 03, 2013 5:13 am

Re: Find lines containing a number stored in a list in another file

Postby Mofi » Tue Apr 23, 2013 1:12 am

Let me try to explain what I understood.

You have a text file opened in UltraEdit which contains a list of ID numbers line by line.

You want to search in a directory tree for CSV files using character | as separator containing one or more of the ID numbers.

In a new file you want all the lines which contain in any of the CSV files an ID number listed in the opened file as tenth field value.

The output file should contain only the found lines, no other information like in which file the line was found and on which line.

You are using currently UltraEdit v??.??.??.????

Is that the description for the task to do?
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4049
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Find lines containing a number stored in a list in another file

Postby yusufnohh » Tue Apr 23, 2013 9:53 am

Hi Mofi,

I'm using UltraEdit Professional Text/HEX Editor Version 19.00.0.1022

The task:

I have a csv/ text file open in Ultra edit which contains field name MOBILE_NO|NAME|NEW_ID|OLD_ID|OTHER_ID|ADD1|ADD2|ADD3|CITY|STATE|POSTCODE. as below:-

MOBILE_NO|NAME|NEW_IC|OLD_ID|OTHER_IC|ADD1|ADD2|ADD3|CITY|STATE|POSTCODE


0198183831|CHUNG MIANG POH|450226135249|K570346||LOT 886 JALAN GUBAH BINTAWA|||KUCHING|SARAWAK|93450
0198331936|NGU TAI HONG|630205135635|K0000380||P O BOX 60925|||TAWAU|SABAH|91019
0195896368|MR LIM SIEW SENG|561031075233|5111429||G 12,TAMAN SEGAR JAYA,BAGAN LALANG,|||BUTTERWORTH|PULAU PINANG|13400
0195191722|ENCIK BASRIZAL BIN CHE BAHAROM|770708026101||T720988|NO 51 JALAN PONDOK TG BEDIL|SUNGAI BARU GUNUNG||ALOR SETAR|KEDAH|05150
0198528442|DATU BASRUN BIN DATU MANSOR|560830125079|H0065240||LOT 11 TAMAN PARK|PUTATAN||KOTA KINABALU|SABAH|88100
0132284857|MR MOHAMAD LUKHMAN NOOR HAKIM BIN JAAFAR|871216105121|||LOT 3482 JALAN MERBAU|KAMPUNG MELAYU SUBANG||SHAH ALAM|SELANGOR|40150
0123656579| MR MUSTAPA BABA|620628045085|||453-1 KM 6 KAMPUNG DUYONG MELAKA|||MELAKA|MELAKA|75460
0196888721|ENCIK KHAIRUL AFIF BIN KAMRIN|870805305047|||NO 3|JALAN USJ 3A/1||SUBANG JAYA|SELANGOR|47610
0192863660|MR MOHD MIZUAR BIN MOHD YUSOF|770918055955|A3754477||NO 384 RUMAH RAKYAT PANCHOR PAROI|||SEREMBAN|NEGERI SEMBILAN|70400
0199157131|LUKEMA MERI BIN SALLEH@ ABDUL LATIF|660326115371|A0369041||JKR 229 KUARTERS KERAJAAN|JALAN SULTAN MAHMUD|BATU BURUK|KUALA TERENGGANU|TERENGGANU|20400
0198049008|MR CHUA SENG NYEP|650729105013|A0196788||TB 10553 LORONG 7/2|TAMAN MEGAH JAYA JALAN APAS BATU 3 1/2||TAWAU|SABAH|91000

In a new file I have 100k of only ID NO without any other details. How do I search and map it with the above data so I get all the details

ID NO without details
450226135249
630205135635
561031075233


After search and mapping
0198183831|CHUNG MIANG POH|450226135249|K570346||LOT 886 JALAN GUBAH BINTAWA|||KUCHING|SARAWAK|93450
0198331936|NGU TAI HONG|630205135635|K0000380||P O BOX 60925|||TAWAU|SABAH|91019
0195896368|MR LIM SIEW SENG|561031075233|5111429||G 12,TAMAN SEGAR JAYA,BAGAN LALANG,|||BUTTERWORTH|PULAU PINANG|13400

Thanks. Cheers
yusufnohh
Newbie
 
Posts: 6
Joined: Wed Apr 03, 2013 5:13 am

Re: Find lines containing a number stored in a list in another file

Postby Mofi » Wed Apr 24, 2013 2:28 am

Your input data permanently changes. Here is a script which searches for lines with one of the listed ID according to your last post.

Open the large CSV file with all the data as first file (most left on open file tabs bar).

Open the file with the IDs line by line as second file.

Open respectively create and save as third file the script file which must be the active one. Use Scripting - Run Active Script.

The script file can be added also to the list of scripts and executed from menu or the Script List view.

Code: Select all
if (UltraEdit.document.length > 1)  // Are at least two files opened?
{
   // Define environment for this script.
   UltraEdit.insertMode();
   UltraEdit.columnModeOff();
   UltraEdit.perlReOn();

   var CsvFile = UltraEdit.document[0];   // First file (most left) must be the large CSV file.
   var ListFile = UltraEdit.document[1];  // Second file must be the file with the list of IDs.

   // Load all IDs into an array of strings.
   ListFile.selectAll();
   var asIDs = ListFile.selection.split("\r\n");
   ListFile.top();
   // Remove last string if it is an empty string because the list file ended with a line termination.
   if (asIDs[asIDs.length-1] == "") asIDs.pop();

   // Open output window for showing progress.
   UltraEdit.outputWindow.clear();
   UltraEdit.outputWindow.showWindow(true);

   // Define the parameters for the multiple Perl regular expression finds.
   CsvFile.findReplace.mode=0;
   CsvFile.findReplace.matchCase=true;
   CsvFile.findReplace.matchWord=false;
   CsvFile.findReplace.regExp=true;
   CsvFile.findReplace.searchDown=true;
   CsvFile.findReplace.searchInColumn=false;

   // Use user clipboard 9 for collecting the found data.
   UltraEdit.selectClipboard(9);
   UltraEdit.clearClipboard();

   CsvFile.top();
   var nFoundCount = 0;

   for (var nID = 0; nID < asIDs.length; nID++)
   {
      var sSearch = "^(?:[^\\r\\n|]*\\|){2}" + asIDs[nID] + "\\|.+$";
      if (CsvFile.findReplace.find(sSearch))
      {
         UltraEdit.clipboardContent += CsvFile.selection + "\r\n";
         CsvFile.top();
         UltraEdit.outputWindow.write(asIDs[nID]+" found.");
         nFoundCount++;
      }
      else UltraEdit.outputWindow.write(asIDs[nID]+" not found.");
   }

   // Output found lines into a new file if anything was found at all.
   if (nFoundCount)
   {
      // Create a new file for the results.
      UltraEdit.newFile();
      UltraEdit.activeDocument.unixMacToDos();
      UltraEdit.activeDocument.paste();
      UltraEdit.clearClipboard();
      UltraEdit.activeDocument.top();
   }
   UltraEdit.selectClipboard(0);  // Select Windows clipboard.
   // Display a short summary message prompt.
   UltraEdit.messageBox("Found "+nFoundCount+" ID"+(nFoundCount!=1 ? "s":"")+" of "+asIDs.length+" ID"+(asIDs.length!=1 ? "s.":"."));
}
perlReOn();

var CsvFile = UltraEdit
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4049
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Find lines containing a number stored in a list in another file

Postby yusufnohh » Thu Apr 25, 2013 2:29 am

Hi Mofi

I follow your instruction, I open a first file with 30m data, then I open second file with 6m ID no only. I open third file copied the script and run active script. I also did execute form the script menu.
The second file ID NO was highlighted in blue but nothing happens. Did I do it correctly. Please advise. Cheers
yusufnohh
Newbie
 
Posts: 6
Joined: Wed Apr 03, 2013 5:13 am

Re: Find lines containing a number stored in a list in another file

Postby Mofi » Thu Apr 25, 2013 4:41 am

What I did to verify that the script works:

  1. I started UltraEdit resulting in having only an empty new file open.
  2. I copied the block from your previous post with the data and pasted them into this new file.
  3. I pressed Ctrl+N to open one more new file, copied the 3 lines with the IDs from your previous post and pasted them into the second new file.
  4. I created a third new file, saved it as Test.js, wrote the script code and saved the modifications with Ctrl+S.
  5. So there are now 3 files opened: first one is a new file with the input data, second file is a new file with the 3 IDs, third file is Test.js which is the active file.
  6. Now I executed Scripting - Run Active Script and the script produced a new file with same output as you posted.
That is exactly what you should do first too. Check if the script does what you requested in your previous post.

The script will fail to find anything if the CSV file contains data not exactly as you specified in your previous post.

Please note that UltraEdit will need most likely several minutes to finish on your huge file with 30 millions of lines, especially when the IDs could not be found at all because the CSV file is formatted different than posted.

If the list file with the IDs contains really 6 million IDs, the script will most likely need hours to finish. And it could easily happen that an out of memory situation occurs as every found and selected string during script execution is copied to memory twice. A 32-bit application like UltraEdit shares the first 2 GB RAM of your computer with all other 32-bit applications. When there is no memory available anymore in the first 2 GB RAM, the script terminates or more badly UltraEdit crashes because of out of memory. It would be better to divide the huge list of IDs into smaller parts like 50.000 per script run to avoid the out of memory situation.

And please install latest hotfix of UltraEdit as v19.00.0.1022 was the first public build of UE v19.00 with an updated Perl regular expression engine inside in comparison to v18.20. There were some bugs in Perl regexp engine implementation and most of them are fixed in currently latest release v19.00.0.1031. I tested the script with build 1031 of UE v19.00.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4049
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Find lines containing a number stored in a list in another file

Postby yusufnohh » Sat Apr 27, 2013 2:05 am

Hi Mofi.
It work well with the data I posted.
When I run with other data an error occurred, read as below:

Running script: J:\SERVER FILE\Script\test.js
========================================================================================================
An error occurred on line 13:
��
Script failed.



I tried the script with the id column in the 8th column.

When I run the script message says No Id found.

In the first data the ID column was in the 3rd column, and it work well.

Should I fix all the ID column to 3rd column...

Thanks
yusufnohh
Newbie
 
Posts: 6
Joined: Wed Apr 03, 2013 5:13 am

Re: Find lines containing a number stored in a list in another file

Postby Mofi » Sun Apr 28, 2013 4:14 am

The error is caused by an out of memory situation. As I wrote already, limit the number of IDs in the list to 50.000, 100.000 or 200.000 per script run.

The script contains the line

var sSearch = "^(?:[^\\r\\n|]*\\|){2}" + asIDs[nID] + "\\|.+$";

Change the number from 2 to 7 and the IDs are searched in eight data column.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4049
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Find lines containing a number stored in a list in another file

Postby yusufnohh » Mon Apr 29, 2013 8:59 am

Hi Mofi..

You were right.. It is taking a long time to run the script. I have tried with 50k per script run and after almost 10 hours it is still running.
I hv 6 million data to be sorted out. Is there any other way to do this task.
Anyway how do I combine data ..I have hundreds folders of data with fields name jumble up. All folder have fields name in different column with one another. How do I sort the data to be uniform.
What is the max data no i can open in Ultra Edit. Is it possible to open 30 million data.

Cheers
yusufnohh
Newbie
 
Posts: 6
Joined: Wed Apr 03, 2013 5:13 am

Re: Find lines containing a number stored in a list in another file

Postby Mofi » Wed May 01, 2013 9:56 am

UltraEdit can be used to edit files of any size. But for large and huge files it is strongly recommended to configure UltraEdit for working with large files as explained in the IDM power tip Large file text editor.

Editing large files can be done in general efficiently only if every memory usage is reduced to a minimum or the total opposite is done by loading everything into memory and do all modifications in memory. The method of loading all to memory requires for huge files a computer with 4, 8, 16 or even more GB on RAM and of course a 64-bit application which can really make use of so much RAM. Although computers nowadays have more and more RAM, a really large continous block of free RAM is nevertheless often rare. UltraEdit as a 32-bit application cannot make use of more than 2 GB RAM at all.

The problem here is that you do not use UltraEdit for editing the huge file. What you want is extracting/copying data from a huge file based on a list with contains also a very large number of strings. That's a task a text editor is not designed for. This task requires a special handling of data to use as less memory as possible but nevertheless handle the data efficient.

Do you have ever calculated how many string (not integer) compares the program has to do for your task in worst case? 30.000.000 x 6.000.000 which are 180.000.000.000.000 string compares. You don't need to wonder that this takes very long especially as the script has to run complex perl regular expression finds and not only simple string compares.

A more efficient way to do this task would be:

  1. Load one ID string after the other into memory and keep in memory in a list only the ID converted to an unsigned integer, but not the strings. So in memory not a list of ID strings, but a list of ID integers is hold.
  2. Next in a loop the task would need to load one line after the other from the huge data file.
  3. For every line the ID string is extracted. This ID string is converted next also to an unsigned integer.
  4. Now in a second inner loop the ID from the loaded line is compared against every number in the ID number list.
  5. If there is a match, the line still in memory is appended to an output buffer and if this output buffer contains for example 1000 lines, it is written to an output file and memory for the found lines is released.
  6. If the IDs are unique and therefore it is not possible that 2 lines contain the same ID, it would be best to remove the matching ID from the ID number list so that on next line the number of integer compares is reduced by one.
  7. The main loop continues with releasing the memory used for loading the current line if it does not contain an ID of interest and loads the next line. So we are back at step 3.
In general the method written here could be coded in an UltraEdit script, but in practice it will not work. The reason is that UltraEdit copies every selected text accessed by the UltraEdit document property selection into memory as JavaScript string object and keeps this string object until the script is terminated on which all memory used during script execution is released. That is of course a bad memory management and I reported this a few weeks ago as I detected this memory management behavior on writing a script for another user also searching for lots of data in a huge file. As every selected text is kept in memory until the script terminates, it is inevitable that sooner or later an out of memory situation occurs when running a data extraction task on a very large or a huge file. So UltraEdit scripts are at the moment not really useful for such special tasks with a very large amount of data being involved. I hope, the IDM developers soon improve memory management for selected text by removing the string object of the current selection from memory immediately when the selection is canceled or replaced by another selection.

I have an idea how your task with getting 6 millions of lines from a file with 30 millions of lines could be done by modifying a copy of the file with 30 millions of lines using an UltraEdit macro. As UltraEdit macros do not copy strings to memory except when using clipboard(s) or ^s in Finds/Replaces (which is not kept in memory up to macro termination), this approach would make it possible to achieve the task without running into an out of memory situation. But UltraEdit macros do not support variables and conditions for doing something line by line. So although the macro could do the job, it would take very long and would stress your hard disk extremly as all string compares are done with heavy access of the file data on the hard disk.

As even the optimal solution as described in the numbered list will take very long to finish, it would be definitely best to write a C++/C# application for this task, especially because the job could be done very good with parallel running threads using all cores of the CPU. So a C++/C# application using worker threads for comparing the ID numbers against an ID from a loaded line could do the task much quicker as every UE script or macro could ever do.

Another method doing what you want is using a database application like Access, MySQL, ... Database applications are optimized for such tasks.

If you have multiple CSV files with different data structures, you need to reformat the files for example with tagged regular expression replaces so that all CSV files have finally the same data structure. Then it would be possible to merge them. I wrote a script for merging CSV files which can be downloaded from extra downloads page. But the script is written for merging a set of small CSV files to a large CSV file. It is not written for merging several large CSV files to a huge CSV file.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4049
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna


Return to Find/Replace/Regular Expressions