Introduction

When you install the open source discovery system VuFind, follow basic configuration steps and feed it with library records, it works well out-of-the-box and provides you with faceted search results and the possibility to browse through your data besides many other features. The easiest way to achieve this, is to load the standard interchange format for library records, i.e. MARC21, into the included Apache Solr-based search index. The FID Performing Arts uses VuFind since 2015, but as mentioned in an earlier post, we receive a vast amount of metadata from performing arts museums and archives in standards other than MARC21, such as EAD, METS/MODS and LIDO as well as other individual data formats which result from database systems like MS Access or FAUST DB. In order to meet the special requirements of performing arts metadata and work with a consistent data set during data aggregation, we decided to map all received data into an extended version of the universal and flexible Europeana Data Model (EDM).

VuFind provides the possibility to index other metadata standards than MARC21 and even supports several standards next to each other in the same VuFind instance. It supports multiple search backends as well as using so-called Record Drivers and combined searches. Though we decided to use only one standard within VuFind. This is with regard to being able to make adjustments to the data during preprocessing, mapping and enrichment and to make the data more independent from the software used to display it. This two-part blog post series describes the steps you need to take to support a new RDF-XML based metadata standard like EDM with a Solr search backend in VuFind. You can find all code in our fork from the main repository on GitHub.

Following the path of data in VuFind

There is much to learn about VuFind and Solr and we won’t be able to give you a thorough introduction of all details and features here. Please note, there is great documentation, a book and video tutorials for VuFind and Solr documentation if you want to learn it from scratch. Also, this post was written in regards to VuFind 8.0 with Solr 8 and the necessary steps might be slightly different in other versions.

Since VuFind is open source and very modular, you can modify it to your needs simply by changing configurations without even touching any code. However, for our use case, we need to make changes to the code in order to support the new metadata standard, so we need to know the way the data takes within VuFind.

Schematic visualization of the path of data in VuFind. It is imported via XSLT into two Solr cores and is wrapped into objects for display.

(Fig. 1: The path of the data within VuFind.)

For our purpose, we still want to use the power of Solr as a search backend (for external backends consider this page) and re-use as much of VuFind’s existing code as possible. So we first need to map the records to Solr’s XML document format and index them into Solr by using VuFind’s import helpers and make adjustments to the provided Solr schema. This is covered in part I of this blog post. In part II, we will describe the changes to configurations that give VuFind instructions on how to cope with the new record format. The implementation of new RecordDrivers allows us to interact with the indexed data and chooses the correct templates for displaying the records with support of the so-called RecordDataFormatter. Finally, other features like recommendations and record export are discussed.

Indexing EDM-XML data into Solr

Solr cores and index schemas

Solr holds multiple cores, each take documents with sets of key-value pairs to which Solr applies faceting and filtering rules. Without going further into the details of EDM, the important thing here is to note that EDM consists of classes of interlinked entities. As an example, we can describe that a person created a resource like a costume that was part of a performance event which is a realization of a work. We use two Solr cores to represent these different types of data:

  • The biblio core stores title data which in case of EDM is the cultural heritage object (edm:ProvidedCHO) like a playbill or photograph in combination with the information from the aggregation (ore:Aggregation) such as the data provider, right statements and links to digital copies. We further store records from the authority type work in the biblio core as they usually have very similar properties as textual resources in library records.
  • The authority core stores authority data which in case of EDM is the data of the classes person (foaf:Person), corporate body (foaf:Organization), event (edm:Event), place (edm:Place), time period (edm:Timespan) and concept (skos:Concept); we are currently only implementing the first three classes in VuFind though.

Both Solr cores have their respective schema.xml files in VuFind’s solr/ folders that define record fields, their types and filtering rules that apply. You can find the explanation of the current versions here and here. So after harvesting, preprocessing, aggregating and enriching our data in the data pipeline, we map it from EDM-XML to Solr XML documents like the following that conform to these schema files:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
<add>
  <doc>
    <field name="id">THULB_ThHStAW_archivesource_00008644</field>
    <field name="record_format">edm</field>
    <field name="title">Der Widerspänstigen Zähmung; Dämmerung</field>
    <field name="format">Playbill</field>
    <field name="date_span">[1889-01-30T00:00:00Z TO 1889-01-30T23:59:59Z]</field>
    <field name="event">Der Widerspänstigen Zähmung</field>
    <field name="event_id">THULB_ThHStAW_performance_00003738</field>
    <field name="work">The Taming of the Shrew</field>
    <field name="work_id">THULB_ThHStAW_opus_00000136</field>
    ...
  </doc>
  <doc>
    ...
  </doc>
  ...
</add>

The most important field from the schema that we need to populate in our Solr XML documents is id to provide Solr with a unique identifier for each record. Solr will overwrite records with the same id, so be aware of that when re-indexing. Another important field is record_format which tells VuFind what format this record has and which RecordDriver to choose. This is especially helpful if you have records coming from different data pools with different formats. We also populate the fullrecord field because it helps us displaying data that won’t be part of the search but instead is only relevant for display. By re-using as many of the fields in the given schemas as possible, we keep maintainable code with only little changes during updates and can re-use many VuFind methods and templates like the ones for displaying hierarchical records from the main code base.

1
2
3
4
<field name="event" type="textProper" indexed="true" stored="true" multiValued="true"/>
<field name="event_id" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="work" type="textProper" indexed="true" stored="true" multiValued="true"/>
<field name="work_id" type="string" indexed="true" stored="true" multiValued="true"/>

For EDM, we needed among other fields a date span field with a date range to be able to represent time periods like festivals. As shown above, we further added the multivalued fields event and work to schema.xml of the biblio core by following the way authors are represented in VuFind to easily find events and works related to the given resource. Additionally, supporting EDM’s linked data approach, we added _id fields to be able to search for persons, events and works not only by string labels but by an identitifer to be sure to find the correct related entity. This is useful for persons or events with the same name which happens quite a lot through the centuries of performances that are based on well-known works. For that, we followed the same approach as VuFind does with author_role, where order is important for knowing which entries belong to each other. Note that the id fields are not only indexed but stored in Solr, because we wanted to retrieve these ids also for display and other functionalities. If you make changes to the schema files later on, don’t forget to re-index your records.

XSLT Import

VuFind provides a mechanism for indexing XML records to Solr, so we can re-use the existing batch-import-xsl.sh which first maps records via XSLT into Solr’s XML format and then imports record batches into a Solr core. However, we cannot re-use custom methods for formatting and other features that come with the included SolrMarc package. Additionally, we are not re-using one of the existing XSLT either, because PHP only allows the usage of XSLT 1.0 which makes the whole process a bit inconvenient if you have to do most of the mapping yourself. Instead, we map the records outside of VuFind with the power of XQuery and XSLT 3.0 (see also this blog post on XSLT) and only use a tiny XSLT 1.0 script within VuFind. We can use it for both title and authority records, because it mainly does an identity transformation and outputs the given XML as is:

1
2
3
4
5
6
7
8
9
<xsl:template match="@* | node()">
  <xsl:copy>
    <xsl:apply-templates select="@* | node()"/>
  </xsl:copy>
</xsl:template>

<xsl:template match="/">
  <xsl:apply-templates select="@* | node()"/>
</xsl:template>

The only other template in this XSLT script uses a VuFind function to populate the fullrecord field with raw XML input as text. Another interesting function in this context, though we currently don’t make use of it, is VuFind::mapString which you can use to map strings by means of a configuration file from VuFind’s translation_maps folder.

1
2
3
    <xsl:template match="rdf:RDF">
        <xsl:copy-of select="php:function('VuFind::xmlAsText', .)"/>
    </xsl:template>

Custom files and overwrites to existing VuFind files are placed into the local/ folder. So we add the XSLT in local/import/xsl/edm.xsl and then we also add local/import/edm.properties which sets the XSLT to use during import with batch-import-xsl.sh and which PHP functions are used. In the properties file, you can also define parameters for your XSLT.

1
2
3
4
5
[General]
xslt = edm.xsl
php_function[] = xmlAsText
custom_class[] = VuFind
truncate_custom_class = true

Then you can start the indexing with:

1
$ ./batch-import-xsl.sh <path-to-your-data> edm.properties

It is recommended to use the batch record mechanism with collections of records as this is way faster than importing single records. For testing and troubleshooting, you can also do single imports with import-xsl.sh that can help you identify problematic records, even if it’s slower, or use the --test-only flag.

To be continued

Now that your records are indexed in Solr, we will proceed next month in part II of this series when we will learn how to make the new format visible in VuFind.