XML text cleaning by regular expressions

Help with writing and playing macros

XML text cleaning by regular expressions

Postby KOFN » Thu Dec 20, 2007 9:13 pm

Hello!

Please, help in solving this problem!
How to clean an XML file from all the data outside the tags?
I am sure there are thousands of ways to do it but so far I fail.

Let's see how it can be done by using regular expressions in ultraedit.

Here is an example of the XML file structure I have:

BEFORE THE CLEANING:

text to clean
<wordA>
<wordB>

text to clean
<wordX>useful text X</wordX> no need to clean this
text to clean

text to clean
text to clean
<wordY>useful text Y</wordY>
no need to clean this <wordZ>useful text Z</wordZ>
text to clean
</wordC>


DESIRED RESULT (after cleaning):

<wordA>
<wordB>
<wordX>useful text X</wordX>
no need to clean this
<wordY>useful text Y</wordY>
no need to clean this <wordZ>useful text Z</wordZ>
</wordC>



In the examle above:
* wordA, wordB, wordC, wordX, wordY, wordZ are any words.
* "useful text X" is any text
* "no need to clean this" is any text

I am not sure that every useful line begins with "<". There may be spaces or even junk text, which however I do not need to clean out.


Here is one suggestion:
Removing all the lines which do not contain: <*>

The following example will remove all the lines containing tags. I want to do exactly the opposite:
Find: "%*<*>*^p"
Replace with: ""

Is it possible with regular expressions?




P.S. I use Ultraedit 11.10+1, but please if you have any other suggestions about different methods to solve this, it will be interesting to see. Maybe such a cleaning is a common feature in some other software? (Suggestions for macros are also welcome and appreciated).

P.S. 2: It is not my priority, but I'm curious - is it somehow possible to obtain this with regular expressions:
<wordA>
<wordB>
<wordX>useful text X</wordX>
<wordY>useful text Y</wordY>
<wordZ>useful text Z</wordZ>
</wordC>


Thank you!
User avatar
KOFN
Newbie
 
Posts: 1
Joined: Thu Dec 20, 2007 12:00 am

Re: XML text cleaning by regular expressions

Postby Mofi » Fri Dec 21, 2007 9:12 am

For your first need deleting all lines which do not contain: <*> I suggest following macro. It works only for files with DOS line endings because of ^p. It also deletes all blank lines (with a dirty trick).

The macro property Continue if a Find with Replace not found must be checked for this macro.

InsertMode
ColumnModeOff
HexOff
UnixReOff
Bottom
IfColNumGt 1
"
"
EndIf
Top
Find RegExp "%#"
Replace All "MaRkErChAr"
Find RegExp "%^(*<*>^)"
Replace All "#^1"
Loop
Find RegExp "%[~#]*^p"
Replace All ""
IfNotFound
ExitLoop
EndIf
EndLoop
Find RegExp "%#"
Replace All ""
Find MatchCase RegExp "%MaRkErChAr"
Replace All ""

But you want also the text at start of the line before the tag and the text at end of the line after the tag deleted. That's no problem. Simply append the following 4 lines to the macro above and you will get it.

Find RegExp "%[~<^p]+<"
Replace All "<"
Find RegExp ">[~<>^p]+$"
Replace All ">"
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4051
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: XML text cleaning by regular expressions

Postby jorrasdk » Fri Dec 21, 2007 9:53 am

I will start by apologizing because I suggest a solution that will only work for UE13 and above. But you did write "...but please if you have any other suggestions about different methods to solve this, it will be interesting to see..." :-)

In UE 13 the javascript environment supports ECMAScript for XML (E4X) and that makes it possible to work with the XML tree. But first the original example must at least have balanced tags:

Code: Select all
<wordA/>
<wordB>
  text to clean
  <wordX>useful text X</wordX> no need to clean this
text to clean
 
text to clean
text to clean
<wordY>useful text Y</wordY>
no need to clean this <wordZ>useful text Z</wordZ>
  text to clean
</wordB>


And now the script. I hope I have put enough comments in the script to explain what happens:

Code: Select all
// Misc options for global XML object:
// http://developer.mozilla.org/en/docs/E4X_Tutorial:The_global_XML_object
XML.ignoreComments = false;
XML.ignoreProcessingInstructions = false;
XML.ignoreWhitespace = true;
XML.prettyPrinting = true;
XML.prettyIndent = 2;

// Select the entire XML document
UltraEdit.activeDocument.selectAll();

// Assuming no root tag, we supply one:
var dirtyXML = "<cleanXMLroot>"+UltraEdit.activeDocument.selection+"</cleanXMLroot>";

// Try and create a XML object
try {
  var xml = new XML( dirtyXML );
 
  // run through all xml nodes at this level:
  traverseSubnodes(xml);
 
  // Write xml back with the now deleted text nodes
  // Note: toString is invoked on a XMLList just below the
  //       artificial root tag (cleanXMLroot): = xml.*
  UltraEdit.activeDocument.write( xml.*.toXMLString() );

}
catch (exc) {
  // Unselect text
  UltraEdit.activeDocument.top();

  // Write XML error text
  UltraEdit.messageBox(exc.toString(),"XML error");
}


function traverseSubnodes(xmlNode) {

  // Obtain xmlNodes just below the input node as a XMLList object
  var subNodes = xmlNode.*;

  // First run through all nodes and delete text nodes at this level.
  for (i in subNodes) {
    if(subNodes[i].nodeKind()=="text") {
      delete subNodes[i];
    }
  }

  // Next: Go deeper in the xml tree for nodes that are complex type:
  for (i in subNodes) {
    if(subNodes[i].nodeKind()=="element") {

      // Yup: This one is complex = children
      if(subNodes[i].hasComplexContent()) {
        // go deeper
        traverseSubnodes(subNodes[i]);
      }
    }
  }
}


The script will produce the following output:

Code: Select all
<wordA/>
<wordB>
  <wordX>useful text X</wordX>
  <wordY>useful text Y</wordY>
  <wordZ>useful text Z</wordZ>
</wordB>
User avatar
jorrasdk
Master
Master
 
Posts: 275
Joined: Mon Mar 19, 2007 11:00 pm
Location: Denmark


Return to Macros