Search and Replace to remove node from XML file

Find, replace, find in files, replace in files, regular expressions

Search and Replace to remove node from XML file

Postby steven_reid » Fri Oct 17, 2008 8:24 am

Hi All,

I have some large XML files, from 80 Mb to 250 Mb that I need to extract some date from. Due to the size I can not successfully open the files to convert the data.
There are a number of nodes that are are not required for this 'current' exercise, so am thinking that a way around it is to remove these nodes.

The basic format is as follows

<adviser>
<clients>
<client>
<title>
<key>MR</key>
<description>Mr</description>
</title>
<firstName>Anthony</firstName>
<surname>Winkler</surname>
<preferredName>Tony</preferredName>
<dateOfBirth>1982-11-01 12:00:00.0</dateOfBirth>
<gender>M</gender>
<notes>
<note>
<id>250664</id>
<noteTopic>
<refType>
<fieldLocation>
<key>CLIENT_NOTES</key>
<description>Notes</description>
</fieldLocation>
<key>NOTE_TOPIC</key>
<description>Topic</description>
</refType>
<key>ADVICE</key>
<description>Advice</description>
<id>4417</id>
</noteTopic>
<description>Advice</description>
<creationDate>2008-04-17 12:00:00.0</creationDate>
<noteText>Advice scopes and strategies: &lt;br /&gt;1. Income protection / Salary continuance - Income protection: ACCEPTED&lt;br /&gt;2. Life insurance - Consolidate your Debts: ACCEPTED&lt;br /&gt;3. TPD - Consolidate your Debts: ACCEPTED&lt;br /&gt;4. Trauma - Consolidate your Debts: ACCEPTED&lt;br /&gt;&lt;br /&gt;Advice recommendations: ACCEPTED ALL RECOMMENDATIONS&lt;br /&gt;&lt;br /&gt;Notes:
</noteText>
<isStandard>true</isStandard>
<attachments>
<attachmentsItem>
<id>161159</id>
<attachmentType>
<key>SERVER</key>
<description>Server file system</description>
</attachmentType>
<fileName>Std-SoA-17-Dec-07_5667_27001_205336.doc</fileName>
<fileSize>190</fileSize>
<creationDate>2008-04-17 02:27:25.0</creationDate>
</attachmentsItem>
</attachments>
</note>
</notes>
<client>
<clients>
<adviser>


What I would like to do it do a search and replace and remove all of the notes node. from the <notes> to the </notes>

<adviser>
<clients>
<client>
<title>
<key>MR</key>
<description>Mr</description>
</title>
<firstName>Anthony</firstName>
<surname>Winkler</surname>
<preferredName>Tony</preferredName>
<dateOfBirth>1982-11-01 12:00:00.0</dateOfBirth>
<gender>M</gender>
<client>
<clients>
<adviser>

Any ideas?

Thanks in advance

Steve
steven_reid
Newbie
 
Posts: 3
Joined: Fri Oct 17, 2008 7:48 am

Re: Search and Replace to remove node from XML file

Postby Jane » Sat Oct 18, 2008 2:07 am

Using the Perl regex engine
Search for:

(?s)<notes>.+</notes>\r\n

replace with:
nothing

should do the trick.
Normally, if there were more than one set of <notes> </notes>, then this would be greedy and span the whole range, but because of a bug in the multiline support in Ultraedit it acts lazy and gives the result you want.

Works for me using UE ver 13.20+2.
There may be multiline support in ver 14, but you have not indicated what version you are using.
Jane
User avatar
Jane
Basic User
Basic User
 
Posts: 22
Joined: Sat Aug 05, 2006 11:00 pm
Location: Canada

Re: Search and Replace to remove node from XML file

Postby steven_reid » Sat Oct 18, 2008 4:45 pm

Hi Jane,

Thanks heaps for that!!
I am using 14.20 and it is picking up the selected node.

Are you saying it 'could' just pick up from the first <notes> to the last </notes> in the file?

There are multiple sets of the <notes> node in the file, and within the <notes> node there can be carriage returns e.g.
<adviser>
<clients>
<client>
<notes>
<note>stuf in here
can be many lines
<br>can be any html source code

</note>
<note>
</note>
</notes
</client>
<client>
<notes>
<note>
</note>
<note>
</note>
</notes
</client>
<client>
<notes>
<note>
</note>
</notes
</client>
<clients>
<adviser>


Thanks again
Steve
steven_reid
Newbie
 
Posts: 3
Joined: Fri Oct 17, 2008 7:48 am

Re: Search and Replace to remove node from XML file

Postby pietzcker » Sun Oct 19, 2008 10:07 am

Theoretically, the way the regex is now, it should pick up everything from the first <notes> to the last </notes> because + is a greedy quantifier. Because of a bug in UE's regex engine, the + loses its greediness when multiple lines are involved. So at the moment, it should work, but if IDM (or Boost, who provide the regex library) fix this bug, then it won't work anymore. The "lazy" version of the search regex would be (?s)<notes>.+?</notes>\r\n - this should always work but might be a little slower.

The moment you really run into trouble is if <notes> tags can be nested. Regular expressions are not able to deal with arbitrarily nested structures.
User avatar
pietzcker
Master
Master
 
Posts: 241
Joined: Sun Aug 22, 2004 11:00 pm

Re: Search and Replace to remove node from XML file

Postby steven_reid » Sun Oct 19, 2008 5:46 pm

Thanks for the explanation.

the <notes></notes> cant be nested (luckily :-) )
steven_reid
Newbie
 
Posts: 3
Joined: Fri Oct 17, 2008 7:48 am

Re: Search and Replace to remove node from XML file

Postby Jane » Thu Oct 23, 2008 2:21 pm

Thanks for explaining Tim. I should have included the lazy .+? but I find in long searches it tends to be a bit slower due to backtracking. However, my advice which depends on a bug in UltraEdit to get faster results is probably not the best long term advice.
User avatar
Jane
Basic User
Basic User
 
Posts: 22
Joined: Sat Aug 05, 2006 11:00 pm
Location: Canada


Return to Find/Replace/Regular Expressions