Splitting based on content

Help with writing and playing macros

Splitting based on content

Postby mikekiwi » Thu Oct 05, 2006 6:15 am

I have a huge files containing the sales of a shop for over the past 6 months. One of the fields is the exact sales date.

I'll show a simplified example below, where the last field is the sales-date. What I now would like to do is to create separate files per sales-date.
There is no identifier which marks the end or start of a new day and the amount of rows per day is also changing per day.

In Excel there is the "Subtotals"-option which has the option "At each change in" the value of a certain column, you can do some statistics... And that's what I'm looking for...

Thanks for your help!


Code: Select all
"artist 1 ";"Title"; 1; 09,99;20060828
"artist 2 ";"Title"; 1; 09,99;20060828
"artist 1 ";"Title"; 1; 05,99;20060829
"artist 2 ";"Title"; 1; 06,99;20060829
"artist 3 ";"Title"; 1; 03,99;20060829
"artist 1 ";"Title"; 1; 03,99;20060830
User avatar
mikekiwi
Basic User
Basic User
 
Posts: 10
Joined: Sun Aug 20, 2006 11:00 pm

Re: Splitting based on content

Postby Mofi » Thu Oct 05, 2006 7:06 am

Should be no problem. The following macro works for your example. The macro property Continue if a Find with Replace not found must be checked for this macro.

Because of the focus issue after closing a file described at Problem with Previous Window/Tab Command make sure you have only your CSV file open or it is the most right file in the open file tabs order.

I don't have time currently to explain the macro. But I think, it's not too difficult to understand.

InsertMode
ColumnModeOff
HexOff
Bottom
IfColNum 1
Else
"
"
EndIf
Top
Clipboard 9
Key END
StartSelect
Find Up Select ";"
Key RIGHT ARROW
Copy
EndSelect
Key RIGHT ARROW
Loop
Find "^c"
IfFound
Key LEFT ARROW
Else
Key HOME
IfColNumGt 1
Key HOME
EndIf
Key DOWN ARROW
SelectToTop
Clipboard 8
Cut
NewFile
Paste
Top
Clipboard 9
Paste
".csv"
SelectToTop
Cut
SaveAs "^c"
CloseFile
IfEof
ExitLoop
Else
Key END
StartSelect
Find Up Select ";"
Key RIGHT ARROW
Copy
EndSelect
Key RIGHT ARROW
EndIf
EndIf
EndLoop
CopyFilePath
CloseFile NoSave
Open "^c"
ClearClipboard
Clipboard 8
ClearClipboard
Clipboard 0
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4051
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Splitting based on content

Postby mikekiwi » Thu Oct 05, 2006 7:55 am

Wow....magic! :D

Works great...up till now in the test I've done only one file error (couldn't write to disk) but apart from that it works like a charm on the 50mb testfile I created.

For the real thing (400mb) It will take a night of hard work for the pc I think, but it's way better than doing the manual copy paste thing over and over again by myself...

One litte question: this macro is based on the date being the last field (which is perfect for this file); what if there were let's say two fields after the date, how should I adapt the macro?
Two more "find up select ";"" ???

Thanks an awful lot, mofi!
User avatar
mikekiwi
Basic User
Basic User
 
Posts: 10
Joined: Sun Aug 20, 2006 11:00 pm

Re: Splitting based on content

Postby Mofi » Thu Oct 05, 2006 11:04 am

Wow ... 400 MB! That's a very important info. My first macro modifies the source file and when finished, it restores it by closing without saving and reopening it. That's okay for normal files, but not for a 400 MB file which is hopefully opened without a temp file and so all changes are permanent.

I have changed some lines in the macro to work now without modifying the source file. This should increase the speed of the macro a lot.

The macro contains some additional commands to make sure it works independent of the configuration options Home Key Always Goto Column 1 and Bookmark column with line (second currently only for UEStudio 6.00+, will be available in UE in next major release). It's also independent of the current regular expression engine because it is regex free.

How it works:

First the macro verifies if the last line of the source file is terminated with EOL character(s). This is important because the macro contains Key DOWN ARROW with IfEof and this would produce an endless loop if the last line of the file is not terminated because Key DOWN ARROW does not work and so end of file is never reached. That's the only possible modification of the source file!

The macros works with 2 bookmarks now. So next it clears every existing bookmark if there is any.

Next it selects in the first line the date string and copies it to user clipboard 9. I have inserted the red lines to show you how to select the date string if it is not at end of a line.

Find Up Select ";" selects from current cursor position till the found string with including the found string. Because the ';' should be not included in the file name the select mode is started before this special find with selecting and so Key RIGHT ARROW is executed in select mode which reduces the selected string by the ';'. Find Select "" is the same as when you hold the SHIFT key while pressing the Find Next button in the find dialog.

Key RIGHT ARROW moves the cursor once right to make sure, that the just copied string is not found again in the following loop (not really needed but more secure).

The main loop always searches for the current date string in clipboard 9.

If it is found again, unselect the found string and move cursor once left before continue search. Well, this is not really needed, but it's better for security.

If the date string in clipboard 9 is not found, the cursor is in the last line with this date string. Set the cursor now to start of the next line and bookmark this line. That would fail at the last line of the file if it would not be terminated with CRLF (or only LF or only CR depending on the file format and current edit mode).

Next clipboard 8 is selected and from current cursor position till previous bookmark everything is selected (same as pressing Shift+F2 for Search - Next Bookmark with selecting).

Copy the selected block into clipboard 8, clear the bookmark here and move the cursor down to the remaining bookmark where the next date block starts.

Then open a new file, paste the block, move cursor to top, insert here the date string and inser (=append) ".csv" to get the file name. Select the file name, cut it from the file, save the new file with "date string.csv" and close it.

Back in the source file check if end of file is reached. If so, exit the loop. If not, again select in the already bookmarked line the new date string, copy it to clipboard 9, set cursor to a new position in the current line where the date string cannot be found again and continue the loop.

After the loop clear the remaining bookmark at end of the file, clear the 2 used clipboards to free RAM and switch back to the windows clipboard.

Once again: The macro property Continue if a Find with Replace not found must be checked for this macro. And because of the focus issue after closing a file described at Problem with Previous Window/Tab Command make sure you have only your CSV file open or it is the most right file in the open file tabs order.

InsertMode
ColumnModeOff
HexOff
Bottom
IfColNum 1
Else
"
"
EndIf
Loop
GotoBookMark
IfEof
ExitLoop
Else
ToggleBookmark
Bottom
EndIf
EndLoop
Top
ToggleBookmark
Clipboard 9
Key END
Find Up ";"
Find Up ";"
Key LEFT ARROW

StartSelect
Find Up Select ";"
Key RIGHT ARROW
Copy
EndSelect
Key RIGHT ARROW
Loop
Find "^c"
IfFound
Key LEFT ARROW
Else
Key HOME
IfColNumGt 1
Key HOME
EndIf
Key DOWN ARROW
ToggleBookmark
Clipboard 8
GotoBookMarkSelect
Copy
EndSelect
ToggleBookmark
GotoBookMark
NewFile
Paste
Top
Clipboard 9
Paste
".csv"
SelectToTop
Cut
SaveAs "^c"
CloseFile
IfEof
ExitLoop
Else
Key END
Find Up ";"
Find Up ";"
Key LEFT ARROW

StartSelect
Find Up Select ";"
Key RIGHT ARROW
Copy
EndSelect
Key RIGHT ARROW
EndIf
EndIf
EndLoop
ToggleBookmark
ClearClipboard
Clipboard 8
ClearClipboard
Clipboard 0
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4051
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Splitting based on content

Postby mikekiwi » Fri Oct 06, 2006 8:44 am

Great explanation, really good to understand what makes the differences and how it works. It indeed works quite a bit faster now, but I still have some file save errors once in every 3 or 4 saved files.

I get the message "File Error" and after that "File/device maybe readonly, or open for write by another application"; then I get the chance to manually save the new Tab.
Any idea what can cause this error? There is plenty of room on the disc and as the other files are saved o.k., I can't imagine it has something to do with access rights...

Thanks a lot once again!

Michael
User avatar
mikekiwi
Basic User
Basic User
 
Posts: 10
Joined: Sun Aug 20, 2006 11:00 pm

Re: Splitting based on content

Postby Mofi » Fri Oct 06, 2006 9:37 am

Looks like the file name in clipboard 9 is sometimes not a valid file name. I once detected on a very slow computer (Pentium 166 MHz) that the command Top after a big Paste was not executed completely before the macro has continued with the next command and so the next command was executed anywhere in the middle of the file. Such a synchronization problem in your macro would cause a very large and invalid file name and also a partly destroyed file content!

Replace the section

Top
Clipboard 9
Paste
".csv"
SelectToTop
Cut


in your macro with


Clipboard 9
Paste
".csv"
StartSelect
Key HOME
Cut
EndSelect


With this modification the file name is created at end of the new file instead at top of the file and so UltraEdit has not to move the cursor up to top of the file. This works for your macro because the last line of the new file is terminated surely always with EOL character(s) and so the file name is created on a blank line at bottom of the file.

And I really hope that in the last 2 columns of your CSV file there is never an escaped semicolon - column text is enclosed in double quotes and so a ';' inside a double quoted column text should not be interpreted here has delimiter according to CSV standard. The macro does not handle such exceptions.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4051
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Splitting based on content

Postby mikekiwi » Fri Oct 06, 2006 11:50 am

Well, now I can give you a very big "Muchos Gracias" from Holland! The file is just split-up into 198 separate parts without any error!

I just concatenated those parts in to a new total file to verify for missing records ; the size is almost equal. I'm missing just three records, the UW File Compare is already doing it's very best to determine which ones are missing (going to add them manually).

Thanks a lot, UE was already one of my favorite and "couldn't do without" program, this kind of options and support only makes that feeling stronger.

Have a nice weekend Mofi!
User avatar
mikekiwi
Basic User
Basic User
 
Posts: 10
Joined: Sun Aug 20, 2006 11:00 pm

Re: Splitting based on content

Postby LeoSchambach » Wed Oct 11, 2006 11:49 am

Adapted above mentioned code. Works fine :lol: for record length up to 2600.

With record lengths above 4500 only the first 12 records are ok, the rest give files (with correct names) of 0 bytes.

Already changed the configuration: maximum columns before line wraps to 20000, otherwise the line gets wrapped.

Here's my code (modified by Mofi - see posts below):

InsertMode
ColumnModeOff
HexOff
Bottom
IfColNum 1
Else
"
"
EndIf
Loop
GotoBookMark
IfEof
ExitLoop
Else
ToggleBookmark
Bottom
EndIf
EndLoop
Top
Find "field string to break line"
Replace All "field string to break line^p#!?#"

ToggleBookmark
Clipboard 9
Key HOME
Find "SOURCE1</fieldLabel><fieldvalue"
Key RIGHT ARROW
StartSelect
Find Select "_OMRA"
Copy
EndSelect
Key RIGHT ARROW
Loop
Find "^c"
IfFound
Key LEFT ARROW
Else
Key HOME
IfColNumGt 1
Key HOME
EndIf
Key DOWN ARROW
ToggleBookmark
Clipboard 8
GotoBookMarkSelect
Copy
EndSelect
ToggleBookmark
GotoBookMark
NewFile
Paste
Key UP ARROW
Key END
IfColNum 1
"Nothing selected or copied to clipboard! Macro execution stopped!

Check position in source file and content of active clipboard 8 and also of clipboard 9."
ExitMacro
Else
Key HOME
IfColNumGt 1
Key HOME
EndIf
Key DOWN ARROW
EndIf

Clipboard 9
Paste
".xml"
StartSelect
Key HOME
Cut
EndSelect
GetValue "Continue macro execution (0/1) ?"
Key LEFT ARROW
IfCharIs "0"
Key DEL
ExitMacro
Else
Key DEL
EndIf

Top
Find "^p#!?#"
Replace All ""

SaveAs "^c"
CloseFile
IfEof
ExitLoop
Else
Key HOME
Find "SOURCE1</fieldLabel><fieldvalue"
Key RIGHT ARROW
StartSelect
Find Select "_OMRA"
Copy
EndSelect
Key RIGHT ARROW
EndIf
EndIf
EndLoop
ToggleBookmark
ClearClipboard
Clipboard 8
ClearClipboard
Clipboard 0
Top
Find "^p#!?#"
Replace All ""
User avatar
LeoSchambach
Newbie
 
Posts: 4
Joined: Tue Oct 10, 2006 11:00 pm

Re: Splitting based on content

Postby Mofi » Wed Oct 11, 2006 2:54 pm

You have forgotten to mention which version of UltraEdit you have?

A maximum columns number before wrap of 20,000 is possible since v10.10. Prior versions have the limit 4096.
The maximum bytes in clipboard or selected in a search with ^c or ^s is 30,000 since v9.20. You hopefully do not break this limit.

Your macro looks good. I could not see any mistake. So maybe UltraEdit has really a bug when the old 4096 limit is crossed.

Insert following code after the 2 macro commands NewFile and Paste before Clipboard 9:

Key UP ARROW
Key END
IfColNum 1
"Nothing selected or copied to clipboard! Macro execution stopped!

Check position in source file and content of active clipboard 8 and also of clipboard 9."
ExitMacro
Else
Key HOME
IfColNumGt 1
Key HOME
EndIf
Key DOWN ARROW
EndIf

With this additional code the macro will exit if nothing was pasted into the new file. Maybe you can see in the source file why.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4051
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Splitting based on content

Postby LeoSchambach » Thu Oct 12, 2006 6:08 am

Version of UltraEdit is 10.10c.

Changed the max limit to 9000 characters => 62 get processed, rest with 0 bytes length. Macro doesn't stop when 0 bytes files are written.
User avatar
LeoSchambach
Newbie
 
Posts: 4
Joined: Tue Oct 10, 2006 11:00 pm

Re: Splitting based on content

Postby Mofi » Thu Oct 12, 2006 7:39 am

That becomes more and more suspect. The maximum columns before line wraps setting has an influence on how many blocks are successfully saved to a file? Sounds like a problem of v10.10c. I found in the history of UltraEdit v12.10b following line for UltraEdit v11.10b:

Fixed heap corruption in undo buffer, specifically search/replace operations on files with long lines


Well, the undo buffer is not used here, but who knows!

Is your XML file an UTF-8 or UTF-16 file (Unicode editing) - see status bar at bottom of the UE window?

There are known issues with Unicode editing.

I have merged my first debugging suggestion (gray color) which you have inserted correctly with a new one (red color) in your initial post. This new code asks you now for every file to continue or not. So you can look what the new file contains before it is saved and the macro continues.

But I think you are debugging here a bug of UltraEdit v10.10c.

Maybe you can break up the long lines with a search and replace in the source file at top of the macro and undo it in every new file before save. This depends on the content of your file. I have inserted in green color a suggestion in the macro code at your initial post.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4051
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Splitting based on content

Postby LeoSchambach » Thu Oct 12, 2006 8:48 am

Sorry, no luck at all.
Get question when loading file: Do you want to convert file xxx to dos-format. When not converting the text UNIX is shown at the bottom. When converted text DOS is shown.

When running the macro with the question (0/1) to answer, it runs fine.
Leaving this question out, same error occurs, even with splitting the line in two parts (and bringing back the max record length in the config to 3000).

Think my possibilities are run out now, so I will now pass the files to a unix box and do the splitting overthere, as getting a new release for ultraEdit will take months.

Thanks for your effort !
User avatar
LeoSchambach
Newbie
 
Posts: 4
Joined: Tue Oct 10, 2006 11:00 pm

Re: Splitting based on content

Postby Mofi » Thu Oct 12, 2006 11:10 am

Okay, no Unicode, only ASCII files with UNIX line endings. For avoiding troubles with Unix files on Windows you should set the config option Automatically convert to DOS format at General - Load/Save/Conversions - Unix/Mac file detection/conversion in the configuration dialog AND additionally Save file as input format (UNIX/MAC/DOS). With these settings you always edit in (WIN)DOS mode (good for copying and pasting with other applications), but save the file always in the same mode as it should be - UNIX or DOS.

When it runs fine with the question, it looks like a timing problem after the paste. You could try to help UE to synchronize by inserting following instead of the gray block:

Top
Bottom
" "
Key BACKSPACE

Maybe this helps. If not, UE is shareware. You can download and install latest version and test the macro with v12.10b. You should only rename your existing UltraEdit program directory and also create a backup of the uedit32.* files in the Windows directory before you install temporarily the latest version 12.10b with the same target directory name as your v10.10c has had before rename.

After your test you can delete the program directory of the new version and delete the *.mfg, *.pfg, *.tfg and the uedit32.* files in the Windows directory, restore your uedit32.* backups and rename the UltraEdit program directory back. Then you have your registered version restored.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4051
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Splitting based on content

Postby LeoSchambach » Fri Oct 13, 2006 7:04 am

downloaded the latest version (12.20). Altered some settings in the config (pre version 11 bookmark style), but still doesn't work.
Already did the job on a unix box.
So I will stop now investigating this.
Thanks for the effort !
User avatar
LeoSchambach
Newbie
 
Posts: 4
Joined: Tue Oct 10, 2006 11:00 pm

Re: Splitting based on content

Postby mikekiwi » Tue Nov 28, 2006 2:24 pm

Mofi and/or others of course :lol: ,

the macro did its job very well, and I'm still very happy with it. However, I tried to adjust it as I needed the same principle but with some other selection...

I now wanted to make the split based on the first eight characters of every line. That's the date field and using those positions I planned to make 300 files out of 1 total file (2.5 million records, 450 mb in size).
It seemed to work o.k., but halfway it stopped working, made several bookmarks within one date-selection and then created one huge file for all dates coming after that one.

Any idea what went wrong? I can't find any discrepancy in the file itself and I'm now in doubt if the splits that were made, are o.k.

Here's the code I used, can someone please check this one for mistakes?

Thanks once again,
Michael

Code: Select all
InsertMode
ColumnModeOff
HexOff
Bottom
IfColNum 1
Else
"
"
EndIf
Loop
GotoBookMark
IfEof
ExitLoop
Else
ToggleBookmark
Bottom
EndIf
EndLoop
Top
ToggleBookmark
Clipboard 9
Key HOME
StartSelect
Key RIGHT ARROW
Key RIGHT ARROW
Key RIGHT ARROW
Key RIGHT ARROW
Key RIGHT ARROW
Key RIGHT ARROW
Key RIGHT ARROW
Key RIGHT ARROW
Copy
EndSelect
Loop
Find "^c"
IfFound
Key LEFT ARROW
Else
Key HOME
IfColNumGt 1
Key HOME
EndIf
Key DOWN ARROW
ToggleBookmark
Clipboard 8
GotoBookMarkSelect
Copy
EndSelect
ToggleBookmark
GotoBookMark
NewFile
Paste
Clipboard 9
Paste
".txt"
StartSelect
Key HOME
Cut
EndSelect
SaveAs "^c"
CloseFile
IfEof
ExitLoop
Else
Key HOME
StartSelect
Key RIGHT ARROW
Key RIGHT ARROW
Key RIGHT ARROW
Key RIGHT ARROW
Key RIGHT ARROW
Key RIGHT ARROW
Key RIGHT ARROW
Key RIGHT ARROW
Copy
EndSelect
Key RIGHT ARROW
EndIf
EndIf
EndLoop
ToggleBookmark
ClearClipboard
Clipboard 8
ClearClipboard
Clipboard 0
User avatar
mikekiwi
Basic User
Basic User
 
Posts: 10
Joined: Sun Aug 20, 2006 11:00 pm

Next

Return to Macros