Parsing XML with JavaScript

2015-08-29

I’m working on a new side project recently, and one that involves parsing content out of XML documents using browser JavaScript.

Originally, I was just using the standard browser APIs to dig data out of the XML, but this got very tedious very quickly and adding new edge cases1 became more and more complex.

So, I decided to see what solutions existed for converting these XML document trees to JavaScript objects. As it turns out, there’s not much. The best tool I found (and one I’ll look to in future Node projects) is feedparser. However, feedparser relies on Node’s stream system, which doesn’t exist in the browser.

So, in the fine NIH2 tradition of JavaScript developers everywhere, I wrote something to do the job myself. Hopefully someone else can find it useful at some point.

It’s only got one dependency at the moment (Lodash), although I might refactor that out in the future. For now, it makes a few things much easier.

But that’s enough preamble, let’s get to the code, such as it is right now:

// flattens an object (recursively!), similarly to Array#flatten
// e.g. flatten({ a: { b: { c: "hello!" } } }); // => "hello!"
function flatten(object) {
  var check = _.isPlainObject(object) && _.size(object) === 1;
  return check ? flatten(_.values(object)[0]) : object;
}

function parse(xml) {
  var data = {};

  var isText = xml.nodeType === 3,
      isElement = xml.nodeType === 1,
      body = xml.textContent && xml.textContent.trim(),
      hasChildren = xml.children && xml.children.length,
      hasAttributes = xml.attributes && xml.attributes.length;

  // if it's text just return it
  if (isText) { return xml.nodeValue.trim(); }

  // if it doesn't have any children or attributes, just return the contents
  if (!hasChildren && !hasAttributes) { return body; }

  // if it doesn't have children but _does_ have body content, we'll use that
  if (!hasChildren && body.length) { data.text = body; }

  // if it's an element with attributes, add them to data.attributes
  if (isElement && hasAttributes) {
    data.attributes = _.reduce(xml.attributes, function(obj, name, id) {
      var attr = xml.attributes.item(id);
      obj[attr.name] = attr.value;
      return obj;
    }, {});
  }

  // recursively call #parse over children, adding results to data
  _.each(xml.children, function(child) {
    var name = child.nodeName;

    // if we've not come across a child with this nodeType, add it as an object
    // and return here
    if (!_.has(data, name)) {
      data[name] = parse(child);
      return;
    }

    // if we've encountered a second instance of the same nodeType, make our
    // representation of it an array
    if (!_.isArray(data[name])) { data[name] = [data[name]]; }

    // and finally, append the new child
    data[name].push(parse(child));
  });

  // if we can, let's fold some attributes into the body
  _.each(data.attributes, function(value, key) {
    if (data[key] != null) { return; }
    data[key] = value;
    delete data.attributes[key];
  });

  // if data.attributes is now empty, get rid of it
  if (_.isEmpty(data.attributes)) { delete data.attributes; }

  // simplify to reduce number of final leaf nodes and return
  return flatten(data);
}

And here’s an example of what it looks like in use, given the following example XML:

<data>
  <title>Hello, world</title>

  <description>
    A bit of example XML for testing
  </description>

  <attr item="attr" href="http://google.com" />

  <item>
    <header>title</header>
    <desc>desc</desc>
    <xml:namespaced>test</xml:namespaced>
  </item>

  <item>
    <header>another title</header>
    <desc>desc 2</desc>
    <xml:namespaced>text</xml:namespaced>
  </item>
</data>

In this case I’m fetching the XML to be parsed over the network, but so long as you can get it to the browser it doesn’t really matter.

Anyways! Here’s the usage, from XML string to JS object:

// get your XML in a text format
var xmlText = getXMLString();

// use the DOMParser browser API to convert text to a Document
var XML = new DOMParser().parseFromString(xml, "text/xml");

// and then use #parse to convert it to a JS object
var obj = parse(XML);

Fairly straightforward. And in return, you get delicious JavaScript objects like the following:

{
  "title": "Hello, world",
  "description": "A bit of example XML for testing",
  "attr": {
    "item": "attr",
    "href": "http://google.com"
  },
  "item": [
    {
      "header": "title",
      "desc": "desc",
      "xml:namespaced": "test"
    },
    {
      "header": "another title",
      "desc": "desc 2",
      "xml:namespaced": "text"
    }
  ]
}

  1. What’s that? You mean XML documents aren’t standardized at all? What a surprise!

  2. Not Invented Here symdrome