Tips on using and transforming XML. I primarily use XML::LibXML and XML::LibXSLT Perl modules for my XML processing. I usually recommend against XML::Simple, as I have and have seen others waste too much time fiddling around with the resulting data structures. Consider stepping up from XML::Simple to XML::LibXML.
- Older XML parsers may not support the Byte Order Mark (BOM), nor may Unix operating systems that expect scripts to begin with #! and not a BOM. sample-file-encodings lists text files variously encoded.
- Encoding problems will result unless all the software involved agrees on the encoding. Sites should standardize on a suitable encoding—or a list of encodings, such as ASCII and UTF-8, though the number of supported encodings must be limited—and any foreign data handled appropriately before being worked with.
XML will not suit all workflows: alternatives could vary from JSON to Markdown to YAML to many others, depending on the need.
XPath
Reference the XML Path Language (XPath) specification to learn XPath. XSLT makes heavy use of XPath. Not all XPath implementations may support the full range of features found in the XPath specification. HTML::Selector::XPath can convert CSS2 selectors to an equivalent XPath statement. Find a tool with which to experiment, such as xpath-tester, which is used in the examples below, along with xpath-position-example.xml:
<xml> x
<one a="a1">1</one>
<two a="a2">2</two>
<three> 33 <!-- context node in examples -->
<first a="f1">1st</first>
<second a="f2">2nd</second>
<third a="f3">3rd</third>
<fourth/>
<fifth a="f5">5th</fifth>
</three>
<four a="a4">4</four>
<five a="a5">5</five>
</xml>
Examples
- position() - this function depends on the context node, and the direction of the axis. Given a context node of //three in the example XML:
- contains() - the contains() function offers substring searches.
- On text nodes:
- Or on attributes:
- Whether to end an expression with /text() depends on the desired result; software such as XML::LibXML::XPathContext can handle either method, though different code may be required depending on the expression, or appropriate logic employed to check for a node versus a string result. In XSLT, use the XPath string-length function to test whether there is text in an element, not a test on text().
$ xpath-tester x-p-e.xml 'preceding::*[position()=1]/text()' //three
2
$ xpath-tester x-p-e.xml 'preceding::*[position()=last()]/text()' //three
1
$ xpath-tester x-p-e.xml 'following::*[position()=1]/text()' //three
$ xpath-tester x-p-e.xml 'following::*[position()=last()]/text()' //three
five
$ xpath-tester x-p-e.xml 'descendant::*[position()=1]/text()' //three
1st
$ xpath-tester x-p-e.xml 'descendant::*[position()=last()]/text()' //three
5th
Note that when the axis points to previous element, the position numbering goes from 1 for the first ancestor or preceding element, to the last() element of that axis. This means instead of saying “current element, minus two,” one instead says “the second element of a reverse axis.”
The position() function can also be used in XSLT expressions, such as <a href="slide{position()-1}.html"> in multipages.xsl.
$ xpath-tester x-p-e.xml '//*[contains(text(), "th")]'
5th
$ xpath-tester x-p-e.xml '//*[contains(., "th")]'
…
$ xpath-tester x-p-e.xml '//@*[contains(., "2")]'
a2f2
$ xpath-tester x-p-e.xml '//*[contains(@*, "2")]'
22nd
Obtaining the value of an attribute node is different than for element data, as one cannot simply append /text(), but must either surround the expression with string(…), or rely on software outside of XPath to obtain the attribute’s value. This difference may influence the design of an XML format.
Excessively complicated XPath expressions may either simply not work, be hard to debug, or otherwise be difficult to understand and support. This is akin to using multiple regular expressions instead of a single long mess, or various lines of code instead of cramming too much into a single expression. Keep XPath expressions simple, and handle logic, iteration, and recursion in XSLT or other software outside of XPath.
xmlns
xmlns can be troublesome to deal with, notably those that declare no namespace prefix (undec.xml):
<undec xmlns="http://example.org/undec/1.0/">
<a>a</a>
<b>b</b>
</undec>
Undeclared namespace require custom registration in the XPath software being used, such as via the registerNs method of XML::LibXML::XPathContext, and also that elements in the namespace be prefixed with the appropriate tag. Some expressions will still match elements in a custom namespace (//* due to the nature of *), though this should not be relied on instead of properly declaring the xmlns.
$ xpath-tester undec.xml '//u:a'
a
XSLT
XSL Transformations (XSLT) details the XSLT language. The XSLT stylesheets for this website are available online. Tricks include different handling of acronym elements depending on whether or not the acronym has already been shown on the page, and more. XSLT makes heavy use of XPath statements.