Cleaning a SQL dump

Help with writing and playing macros

Cleaning a SQL dump

Postby RinGo » Sun Dec 31, 2006 6:58 pm

Hello,

I have sql dump that must to be "cleaned", below is a sample of the text.

Original Text (before edit)

Code: Select all
', 0, -1, 0, 0);
INSERT INTO `thread_id` VALUES(9, 'Bim Bousman', 'Hello', '2003-04-18 12:27:00', '<table border=0 cellpadding=2 cellspacing="0"><tr><td>
<pre>Hello All,

Just want said hello to everyone.

Bim Bousman
Honolulu
************************
From: &lt;<a href=forum/postID=ewrjw_-O7iw4x2sTK0kw8a1-OvSC8_P9qOb9_i80z0HJ4hjYAhmPsSusItNqv7EPSk9b0Sh-PLjtXhEVLViUO8qZnPpALg"></a>&gt;
To: &lt;<a href="forum/postID=ewrjw_-O7iw4x2sTK0kw8a1-OvSC8_P9qOb9_i80z0HJ4hjYAhmPsSusItNqv7EPSk9b0Sh-PLjtXhEVLViUO8qZnPpALg"></a>&gt;
Sent: Thursday, April 17, 2003 9:50 PM
Subject: Hello


&gt;
&gt;
&gt;
&gt; Your advertisement here.Click here for more information.
&gt;
&gt;
&gt;<br></pre>

', 0, -1, 0, 0);
INSERT INTO `thread_id` VALUES(10, 'Ferromundo', 'Hello','2003-04-20 20:11:00', '<table border=0 cellpadding=2 cellspacing="0"><tr><td>
<table bgColor="#ffffff"><tr><td>
<div><font face="Arial" size="2">Welcome and enjoy your time.

Thanks
Ferromundo

*********
Avertisement
here
Click mor
e details.
*********
', 0, -1, 0, 0);
INSERT INTO `thread_id` VALUES(11, 'Stanley', 'Hello','2003-04-21 16:22:00', '<table border=0 cellpadding=2 cellspacing="0"><tr><td>
<pre>Hi Ferromundo,

We miss you.When
you can
do war again.Call me after
work hours.

regards
Stanley
=========
Advertisement here
Click more
details.
=========


Then After Edit, I want it like below.

Code: Select all
', 0, -1, 0, 0);
INSERT INTO `thread_id` VALUES(9, 'Bim Bousman', 'Hello', '2003-04-18 12:27:00', '
Hello All,

Just want said hello to everyone.

Bim Bousman
Honolulu
', 0, -1, 0, 0);
INSERT INTO `thread_id` VALUES(10, 'Ferromundo', 'Hello','2003-04-20 20:11:00', '
Welcome and enjoy your time.

Thanks
Ferromundo
', 0, -1, 0, 0);
INSERT INTO `thread_id` VALUES(11, 'Stanley', 'Hello','2003-04-21 16:22:00', '
Hi Ferromundo,

We miss you.When you can do war again.Call me after work hours.

regards
Stanley


Because this is SQL dump, there have many sensitive character on it (like ',; etc).
So anybody here knows how to do that, I try find and replace one by one every character but there to many different word on it.

Or if interested I am offering this job to ultraedit developer or other members here who expert using ultraedit.

I have a lot sql dump must to be "cleaned" so this job NOT FREE, I will paid for it.

Regards,
Ringo
User avatar
RinGo
Newbie
 
Posts: 1
Joined: Sun Dec 31, 2006 12:00 am

Re: Cleaning a SQL dump

Postby Mofi » Tue Jan 02, 2007 4:56 pm

Here is a macro which does the job for your example. The macro is designed to run on a file with MS-DOS line terminations. If your SQL dump is in Unix format opened in Unix mode without temporary conversion to DOS (which is never a good setting on Windows) you must replace every ^p (means CRLF - carriage return and line-feed) with ^n (means LF only).

First the macro trims all trailing spaces on the whole file to be able to work usefully with ^p.

The first loop finds '<table border=0 with or without preceding space(s) and if found expands the selection to the position where after the character > a character is following which is not the character < or a line termination. Additionally the second find in first loop also selects 0 or more occurences of spaces and tabs after last >. The selection is reduced by 1 character - the character after last > if there are no spaces/tabs following - and then deletes the selection. If now the current character is either a space or a tab, delete this character too. Because it is important and deleted before a space with a following ' and a line termination is inserted. So this loop should delete the HTML tags before the email content.

The second loop finds 1 or more line terminations followed by at least 5 * or =. If such a string is found, it is first replaced by a single line termination, before everything to next SQL entry or end of file is also selected and then deleted. This loop should delete the advertisements in the emails.

The third loop finds with an extremly strange but because of some bugs only working regular expression in UltraEdit style 3 successive lines where none of the 3 lines start with a '. Such a block is expanded to end of the paragraph without including the last character of the paragraph and then converted with ReturnToWrap to a single line. Don't know if this is really needed, but your example has such a modified email content line.

The macro property Continue if a Find with Replace not found must be checked for this macro.

InsertMode
ColumnModeOff
HexOff
UnixReOff
Top
TrimTrailingSpaces
Loop
Find RegExp " ++'<table border=0"
IfNotFound
ExitLoop
EndIf
StartSelect
Find RegExp ">[~<^p][ ^t]++"
Key LEFT ARROW
Delete
EndSelect
IfCharIs 32
Delete
EndIf
IfCharIs 9
Delete
EndIf
" '
"
EndLoop
Top
Loop
Find RegExp "[^p]+^{^*^*^*^*^*^}^{=====^}"
IfNotFound
ExitLoop
EndIf
"
"
StartSelect
Find "', 0, -1, 0, 0);"
IfSel
Key HOME
Else
SelectToBottom
EndIf
Delete
EndLoop
Top
Loop
Find RegExp "%[~'^p][~^p]+^p[~'^p][~^p]+^p[~'^p]"
IfNotFound
ExitLoop
EndIf
StartSelect
Find "^p^p"
Key UP ARROW
Key END
Key LEFT ARROW
ReturnToWrap
EndSelect
EndLoop
Top

Add UnixReOn or PerlReOn (v12+ of UE) at the end of the macro if you do not use UltraEdit style regular expressions by default - see search configuration. Macro command UnixReOff sets the regular expression option to UltraEdit style.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4042
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Cleaning a SQL dump

Postby UltraBoG » Thu Mar 22, 2007 8:59 pm

A really quick and dirty way to strip the HTML markups from the text is to display the file in a browser, then "select all" the displayed text in the browser window (Ctrl-A), and then copy/paste the selection into an open and ready UltraEdit window.

Usually this works for me. Your mileage may vary.

"share and enjoy"
User avatar
UltraBoG
Newbie
 
Posts: 4
Joined: Thu Nov 09, 2006 12:00 am


Return to Macros