Like many other Specialised Information Services (FID), we are working with XSLT to map XML metadata from data providers to our data model, an extended version of the RDF-XML based Europeana Data Model (EDM). In the FID Performing Arts (FID DK), we currently receive data from 22 data providers that deliver 6 different official metadata standards like MARC21, EAD and LIDO as well as 10 individual data standards that result from working with database systems like MS Access or FAUST DB. As most metadata is already delivered in XML, it only made sense to use a programming language like XSLT the main purpose of which is the transformation XML documents.

In this post, I will share my experiences in using XSLT in GLAM projects like the FIDs with a focus on writing cleaner XSLT code as well as optimization tips for transforming XML documents. For the text, I assume that the reader has worked with XSLT before or has at least some knowledge of XML technologies like XPath or XQuery.

Cleaner code and improved performance in XSLT

The following sections provide some random tips that I gathered throughout the years of working with XSLT. They are by no means exhaustive and you might want to read the official XSLT 3.0 specification as well.

Let tools support your work

In general, it is beneficial to make use of appropriate XML editors and databases for XML data as they can make work much more convenient in regard to optimizations, detecting bottlenecks, finding and debugging errors. If you are working with XSLT a lot, you can either try to find an XML plugin for your favorite open source text editor (e.g. XML-Validation and XML-Completion in Kate or the XML-Tools Plugin in Notepad++) or consider buying a commercial license for oXygen XML Editor or XMLSpy. It is also possible to combine the usage of XML editors like oXygen with open source XML databases like BaseX or eXist-db (see Integrating oXygen or Using oXygen with eXist-db). With XML databases you get easy and fast access to certain aspects of your data by using the query language XQuery. In some XML databases like BaseX other serializations like CSV or JSON are supported as input formats in case you receive other data than XML but nevertheless want a consistent data basis in one place.

Oxygen Debugging View that displays processing instructions and duration as well as bottlenecks in the 'Hotspots View'

(Fig. 1: Invocation Tree View and Hotspots View in oXygen XML Editor.)

If you want to improve performance and have access to oXygen XML Editor, I recommend to use the XSLT Debugger and switch on Performance Profiling. In the “Invocation Tree View” you can see processing instructions and duration as well as identify bottlenecks in the “Hotspots View” which can help rethinking your code and improving performance a lot.

Working with large XML documents

When you receive data from GLAM institutions, it often comes in one huge file with thousands of records, which is not ideal for use with XSLT. Traditionally, the whole XML document is parsed into main memory for XSL transformations and needs even more space to be processed efficiently which accordingly leads to consumption of lots of RAM. In order to prevent that, try to split up larger files into smaller XML documents beforehand and compile the stylesheet before you let it run with several smaller documents. If transformations are very complex, you might want to split them up and perform them in several stages.

Since XSLT 3.0, there is also xsl:source-document (formerly known as xsl:stream) which is only available for Saxon’s enterprise versions though. It allows to stream XML documents that are too large to keep in RAM or transform an XML input that is a stream itself. Though one has to keep in mind that stream processing comes with limitations to expressions and constructions in the stylesheet and is slower than the traditional way (s. Jakub Malý. Parallel XSLT Processing of Large Documents. In: XML Prague 2015 Conference Proceedings. S. 15). Thus it makes sense to only use it when necessary.

Choose the newest possible XSLT version

Before you start working on XSLT stylesheets, it is advisable to check which version of XSLT you are using and if possible to use the current version (which is XSLT 3.0 as of writing). It supports the current XPath version (XPath 3.1 as of writing) and you can make use of much more built-in functions and options that can help you to write cleaner and less code. If you are bound to use XSLT 1.0 due to some other programming language that only supports XSLT 1.0, consider running a higher XSLT version with a suitable Saxon parser outside of that language. For Python it is for example possible to call a specific Saxon parser via a subprocess call:

1
2
3
import subprocess

subprocess.call(["java -jar path/to/saxon.jar", "-o:output.xml", "-s:input.xml", "-xsl:sheet.xslt"])

If you are re-using an existing XSLT sheet in XSLT 1.0 or 2.0, check if it is possible to tweak it to a newer version. Extensions to XSLT 1.0 like EXSLT are usually no longer needed as most functions are already included in versions XSLT 2.0 and higher. When you are working with XSLT from within BaseX, make sure to add Saxon 9+ into BaseX’s classpath under lib/custom/ if you want to use XSLT 3.0.

Re-use and create modules

If you keep functions abstract, you can modularize your code, create helper functions and re-use them in other contexts. Additional stylesheets with helper functions can be imported by using xsl:import or xsl:include. Please bear in mind that xsl:import and xsl:include might not work as expected if you convert your data within BaseX or other XML-Databases.

When I can’t re-use existing mappings to EDM, I try to keep new stylesheets for the FID as much standard-specific and as little data provider-specific as possible. That way the stylesheets are more re-usable in other FID projects like the FID African Studies. Data provider-specific steps or processing steps like date normalization, matching or validation checks that are the same for all our data providers are modularized in their own stylesheets. So if a change in the data model occurs not all of the mappings need to be touched individually.

Write readable and less verbose code

As mentioned above, using built-in functions is certainly an important aspect of keeping XSLT code smaller and cleaner. The same goes for using variables to prevent repeated evaluations of complex transformations. There is also some syntactic sugar that came with support of the higher XPath version in XSLT 3.0 such as concatenation with ||

1
$textA || ' ' || $textB

instead of

1
concat($textA, ' ', $textB)

as well as using let and for constructions known from XQuery, function chaining with => (e.g. $textA => lower-case() => tokenize(";")) or mapping a function to each item in a sequence via ! (e.g. ('FIDs','are','cool') ! string-length(.)).

Furthermore, built-in template rules and modes can be quite the game changer. Modes like on-no-match="shallow-copy" can replace a template for identity transformation as it results in a copy of unchanged nodes of the source tree and nodes that are processed differently according to explicit template rules.

Finding out that there is no need to use xsl:element or xsl:attribute if names are not dynamic, reduced my lines of code a lot. When using XSLT 3.0 with the standard attribute expand-text=yes it is even possible to use text value templates with {}. Especially, the verbosity of variable inserts into texts or attributes with value-of can be reduced significantly, e.g. from:

1
2
3
4
5
6
<xsl:element name="dc:description">
    <xsl:attribute name="xml:lang">
      <xsl:value-of select="@language"/>
    </xsl:attribute>
    <xsl:value-of select="concat('Printed inscription: ', normalize-space($inscription))"/>
</xsl:element>

to

1
<dc:description xml:lang="{@language}">Printed inscription: {normalize-space($inscription}</dc:description>

Avoid certain XPath expressions

One of the most important things when it comes to performance in XSLT is to avoid using // at all cost. It is the most expensive expression to visit every single descendant in an XML source tree. Similarly, reconsider if expressions with preceding or following (-siblings) or expressions with descendant or ancestor XPath axes are necessary as these can potentially lead to unnecessary document traversals. Try to work with the context of your current node instead.

Avoid xsl:for-each and named templates… and accept that XSLT is a declarative language ;-)

I know it’s nice to have control over the flow, but XSLT is not an imperative programming language. So let the XSLT processor do its work for you and let it determine which template to invoke. Using xsl:for-each a lot is considered bad style in XSLT. As variables are immutable in XSLT, it won’t give you the incrementable loop counter you are thinking of anyway. I am not saying it’s harmful to use xsl:for-each – I also still use it sometimes – but it’s rarely needed as every xsl:for-each like

1
2
3
4
5
6
7
8
<xsl:template match="book">
  <xsl:for-each select="title">
    <dc:title>{normalize-space(.)}</dc:title>
  </xsl:for-each>
  <xsl:for-each select="creator">
    <dc:creator>{local:getName(.)}</dc:creator>
  </xsl:for-each>
</xsl:template>

can typically be replaced by using xsl:apply-templates instead which gives you cleaner code and a lot more flexibility:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
<xsl:template match="book">
  <xsl:apply-templates/>
</xsl:template>

<xsl:template match="title">
  <dc:title>{normalize-space(.)}</dc:title>
</xsl:template>

<xsl:template match="creator">
  <dc:creator>{local:getName(.)}</dc:creator>
</xsl:template>

The same applies for named templates which can be called with parameters in order to control certain behavior. It’s a great way to modularize your code. Though if you need to call a utility function, try xsl:function instead because it gives you the flexibility to compose or chain functions.

It might lead to conflicting template rules if you are not using xsl:for-each or named templates. In that case, it is advisable to check out priority or template modes as they can give you the control which template rules are relevant matches, resolve conflicting templates or process nodes several times with different results.

Use Maps

When your code includes endless xsl:choose-constructions with many cases of xsl:when that match against a certain string value, check if you can use data structures like Maps instead which can be compared to dictionaries in Python.

1
2
3
4
5
<xsl:variable name="langMap" select="map{'engl':'eng', 'portug':'por', 'schwed':'swe', 'tschech':'cze'}"/>

<xsl:template match="Language">
  <dc:language>{$langMap(.)}</dc:language>
</xsl:template>

Above you see a very short example, but it’s worth it if you have a whole dictionary of key-value pairs. With Maps you can access a value fast via its key while using xsl:choose it runs until any of the tested alternatives in xsl:when elements is satisfied or xsl:otherwise in the worst case.

Further reading

If you are interested in reading more about special pearls of XSLT 3.0 like higher order or anonymous functions, I can recommend e.g. Pearls of XSLT and XPATH 3.0 Design by Roger Costello, Why You Should Be Using XSLT 3.0 by Kurt Cagle or What’s new in XSLT 3.0 and XPath 3.1? by David J. Birnbaum.