Friday, June 26, 2015

Node XML Parsing

XML. Why doesn't it stand for eXtinct Markup Language? It's one of those unavoidable unpleasantries of software development. Sooner or later, you're going to have to deal with it.

Since starting to work with Node, I've been able to ignore XML for quite a while. Most JavaScripty things use the much nicer JSON format for their data. However, this past week I finally had break down and get my hands dirty with XML in Node. It wasn't that bad...

I resorted to this because I wanted to pre-process the SVG images that I upload in the Amphibian.com editor. When I create SVG images with Inkscape, they never have viewBox attributes - which are absolutely necessary for correct display in Internet Explorer. I got tired of adding them manually before uploading and decided that the system should just do it for me. Because the SVG image format is just XML, all I need to do is parse the document and look for the viewBox attribute. If it's not there, I can create it using the values of the width and height attributes.

There are two main ways of dealing with XML. One is using a SAX parser. SAX parsing is basically event-driven and is most useful when you need efficient read-only access to XML documents. The other way is using a DOM parser. DOM parsers build Document Object Models out of the XML and therefore enable random read/write access at the cost of having to store the entire document in memory. Because I need to alter the XML documents, I selected a DOM parsing approach.

I am using the xmldom package. I selected it primarily because it is a native JavaScript implementation. Some of the XML parser packages for Node have library dependencies which make life difficult when you run on multiple platforms like I do.

Here is some example code that does the same thing my web app does - parse a SVG, look for some attributes, potentially add a missing one, and output the SVG. This example reads the SVG from a file instead of from a web form upload.

var xmldom = require("xmldom");
var fs = require("fs");

var DOMParser = xmldom.DOMParser;
var XMLSerializer = xmldom.XMLSerializer;

var svgString = fs.readFileSync("frog.svg", { encoding: "utf8" });

var svgDoc = new DOMParser().parseFromString(svgString);
var root = svgDoc.documentElement;

var svgWidth = root.getAttribute("width");
var svgHeight = root.getAttribute("height");
var svgBox = root.getAttribute("viewBox");

if (svgBox === "") {
    root.setAttribute("viewBox", 
       "0 0 " + svgWidth + " " + svgHeight);
}

console.log(new XMLSerializer().serializeToString(svgDoc));


And this is an example of the SVG XML in the file frog.svg. I cut out most of it to save space here - the only part this example cares about is the root element.

<svg xmlns="http://www.w3.org/2000/svg" width="276" height="281">
    ...
</svg>

Let's break down the steps here. First, require the xmldom object on line 1. That is used to get DOMParser and XMLSerializer objects on lines 4 and 5. The DOMParser is obviously the parser, but the XMLSerializer is what you need if you intend to dump an XML DOM back out to a string.

Line 7 reads the contents of the SVG file into a string. DOMParser MUST have a string to function, NOT a buffer. Specifying the encoding when reading forces the return value from readFileSync to be a string, but other steps may be needed if you are getting the SVG via another mechanism.

Line 9 parses the SVG string into a document. In DOM language, the root node, svg in this case, is known as the documentElement. I set that to a variable named root merely for convenience on line 10.

Lines 12, 13, and 14 are where I read the values of the various attributes. The interesting thing about the xmldom package is how it handles missing elements. Note that on line 16 I have to check to see if the svgBox variable is the empty string instead of null. If the attribute is not present in the document, getting it returns and empty string. I found this to be counter-intuitive as a null would seem to make more sense. But it is what it is...

Line 18 calls setAttribute on the root element to add the viewBox if it was not found. Line 21 uses the XMLSerializer to create a string out of the (possibly modified) SVG DOM and logs it to the screen. If you try it yourself, you should see that the viewBox attribute is in fact added if not present in the original document.

So I processed a little XML with Node and it wasn't a completely terrible experience. It was actually just as nice as any Java XML DOM parsers I've used recently. Probably nicer. The best part is now I can save myself all that time spent manually adding in viewBox attributes to frog pictures. While we're on that subject, read and share today's comic:

Amphibian.com comic for 26 June 2015