Introduction

Since 2004, the Bielefeld Academic Search Engine (BASE) offers an aggregated metadata search over scientific publications. By indexing well over 300 million records (60% of which are open access1) from almost 10,000 international repositories, it provides researchers with an invaluable access to publications. In addition to offering a VuFind based discovery system, the aggregated data is also available via both a live search API and an OAI-PMH interface to enable re-use by 3rd party services.2 As a general search engine for scientific documents, the indexed publications are not restricted to any academic discipline or field of research. And while BASE retains available subject terms and – to some extent – even tries to automatically assign Dewey Decimal Classification to a subset, it’s not always straight forward to find resources that pertain to a specific field. That’s where 3rd party services can come into play and try to cater to the needs of specific areas of research. To provide such a service, however, a method of selecting a subject specific subset is required.3

In what follows, I give an outline of some techniques and methods that can be applied to retrieve a subject specific subset from BASE. In particular, the approach we chose for the Lin|gu|is|tik-Portal – a web portal for linguistics maintained by the FID Linguistik – is detailed. The description is restricted to relatively orthodox methods while modern approaches e.g. from the area of machine learning are not covered.

In order to follow most of the examples in this post, access to both interfaces of BASE is required. All shown examples can be found in the following GitHub repository: https://github.com/ubffm/ublabs-base-examples.

Goals & Challenges

Optimally, retrieving a “subject specific subset” would mean, that one ends up with a sizable selection of records that is pertinent and exhaustive, while it does not contain anything leading away from the subject. Also, formally inadequate records should be left out.

However, knowledge is endlessly connected and academic fields are highly intertwined and hard to delineate – if they are at all. In linguistics, for example, there are areas of research like psycho- or sociolinguistics that address topics relevant to both linguistics and psychology and sociology, respectively. Or there are terms like “morphology” which can be found in the subject specific jargon of several disciplines to name just two prominent examples.

So it’s not entirely clear how to actually define beforehand what is and is not relevant in concise terms and in a way that can be implemented as software. Therefore, I will describe several approaches, each having different characteristics. This way, you can choose those that work for a specific subject and combine them to arrive at a satisfying approximation.

Concepts for Selection

Concept 1: Repository Based Selection

Core idea: Finding repositories that exclusively deal with subject X via the search API in order to selectively harvest them over OAI-PMH.

As mentioned before, BASE gathers data from different repositories and aggregates it. These repositories constitute BASE’s basic organizational principle and are also made transparent to both the users of the web and the machine-readable interfaces.4 Thus, one simple way to come up with a selection is by relying on this categorization into repositories in some way. The general idea behind this is to identify repositories that are completely – or to a significant amount – dedicated to the discipline in question. In this chapter, ways of using the BASE search API to find dedicated repositories are described, resulting in a list of such repositories that can be fed into an OAI-PMH harvesting process. This latter step is a topic for a later post in this series.

Given the nature of BASE as a general, discipline-agnostic search engine, the criteria for repositories are rather broad.5 As a result, the included repositories range from publication servers of single institutes and subject specific services via organizational repositories of large universities to big international metadata providers such as Crossref, DOAJ, DataCite or CiteSeerX. Nonetheless, it is possible to find repositories that could be relevant. Each repository in BASE has two human-readable names (English + local language). One other type frequently encountered is repositories that are provided by Open-Access journals.

A first simple idea is to look for repositories that, according to their name, might be dedicated to linguistics. This can be done interactively via the aforementioned browser. BASE is quite international. For this to reflect in the selection, you could also try different translations6.

While the interactive browser is very nice for some first insights, a more thorough approach is to use the live search API. The live search API provides versatile access to the BASE index via simple HTTP GET-requests. It comprises three methods (“functions”):

  • List all repositories (ListRepositories)
  • Get a detailed description (“profile”) of a specific repository (ListProfile)
  • Issue a query, using Solr query syntax (PerformSearch)

Each method allows for additional parameters. For the purpose of the current text, only the first two methods are used and parameters will be described as they are needed. For a general description, please refer to the official documentation.

A rather hacky first session using some UNIX shell tools7 might look something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
$ curl -s 'https://api.base-search.net/cgi-bin/BaseHttpSearchInterface.fcgi?func=ListRepositories' \
     | xmllint --format - \
     | head
<?xml version="1.0" encoding="UTF-8"?>
<collection>
  <collection_name>all</collection_name>
  <list_repositories>
    <repository>
      <activation_date>2022-05-31</activation_date>
      <name_en>Institutional repository of HM Hochschule München University of Applied Sciences</name_en>
      <name>Publikationsserver der Hochschule München</name>
      <internal_name>fthmuenchen</internal_name>
    </repository>

$ curl -s 'https://api.base-search.net/cgi-bin/BaseHttpSearchInterface.fcgi?func=ListRepositories' \
    | xmlstarlet sel -t -v '/collection/list_repositories/repository/name/text()' \
    | egrep -i '(l[ia]ng[uv]|sprach)'

Advanced Linguistics (E-Journal)
NaUKMA Research Papers - Linguistics
Linguistic Data and NLP Tools CLARIN
$ curl -s 'https://api.base-search.net/cgi-bin/BaseHttpSearchInterface.fcgi?func=ListRepositories' \
    | xmlstarlet sel -t -v '/collection/list_repositories/repository/name/text()'
    | egrep -i '(l[ia]ng[uv]|sprach)'
    | wc -l
57

(Full example: https://github.com/ubffm/ublabs-base-examples/blob/main/example1.txt)

The hidden assumption behind this is, of course, that if something hints in its title that it was about linguistics, it’s taken for granted. This works as an approximation but often also yields some false positives.

The output of the first command in the example above illustrates the raw XML output of the API call (JSON can also be obtained by appending &format=json to the URL). Apart from the <name> field that is used for this survey, <internal_name> is the most relevant field, as it encodes the internal identifier of the repository. It is also worth mentioning, that some information that is displayed in the interactive browser, such as the country or the number of documents, is not included in this listing.

Using the ListProfile method, details about a specific repository can be retrieved:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
$ curl -s 'https://api.base-search.net/cgi-bin/BaseHttpSearchInterface.fcgi?func=ListProfile&target=ftunivwarsawklf' | xmllint --format -

<?xml version="1.0" encoding="UTF-8"?>
<repository>
  <activation_date>2009-02-05</activation_date>
  <country>pl</country>
  <name>Biblioteka Cyfrowa KLF UW</name>
  <name_en>Digital Library of the Formal Linguistics Department at the University of Warsaw</name_en>
  <num_non_oa_records>0</num_non_oa_records>
  <num_oa_cc_records>28</num_oa_cc_records>
  <num_oa_pd_records>0</num_oa_pd_records>
  <num_oa_records>28</num_oa_records>
  <num_records>403</num_records>
</repository>

Besides the already known facts about a particular repository, this way, it is possible to also obtain additional information such as the country or the number of available documents and a characterization of their access status. Taking all this together, it is easily possible to e.g. select German repositories that have 100 or more documents at least 80% of which are open access and which have something related to linguistics in their name.

While using the shell is always an option, it might be more comfortable to switch to something like Python. The accompanying GitHub repository contains some simple helper functions for the BASE API to get started quickly.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#!/usr/bin/env python3
import re
import base_api

REPO_PATTERN = re.compile(r"(l[ia]ng[uv]|sprach)", re.IGNORECASE)


def has_relevant_title(repo, pattern=REPO_PATTERN):
    """Check all title fields of a repository repo against the provided pattern."""
    return bool(
        re.findall(pattern, repo["name"]) or re.findall(pattern, repo["name_en"])
    )


def is_oa(repo, ratio=0.0, minimum=0):
    """Check the open access status of a given repo.

    ratio: minimum ratio of oa titles (float)
    minimum: minimum number of documents total."""
    oa_ratio = repo["num_oa_records"] / repo["num_records"]
    return oa_ratio >= ratio and repo["num_records"] >= minimum


def main():
    repos = list(base_api.iter_list_repositories(coll="de"))
    print("Repos total:", len(repos))
    profiles = [
        base_api.list_profile(repo.get("internal_name"), wait=1)
        for repo in repos
        if has_relevant_title(repo)
    ]
    print("Relevant:", len(profiles))
    print("Result:")
    for profile in profiles:
        if is_oa(profile, ratio=0.8, minimum=100):
            print(profile["internal_name"])


if __name__ == "__main__":
    main()

(See also: https://github.com/ubffm/ublabs-base-examples/blob/main/example2.py)

1
2
3
4
5
6
$ ./example2.py
Repos total: 525
Relevant: 6
Result:
ftinstdeusprache
ftlangscipr

The main result of this script is a list of internal repository names, that can, among other things, be used for further selective harvesting. The selection criteria were deliberately chosen to keep the result list and the strain on the BASE API minimal. In a more real world example, you would choose more general criteria and probably implement some form of caching.

Repositories in BASE are not static. New repositories are added constantly, while others are removed, renamed, split etc. Therefore, if a repository based selection method is used, these scenarios have to be kept in mind. The BASE API can also help alleviating that.

Conclusion

The BASE search API provides an effective first tool to identify relevant repositories. In addition to that, it can be used as a means of managing repository lists in production. Taken by itself, the described approach is not overly satisfying. Briefly, trying terms that pertain to all 7 FIDs hosted at the UB Frankfurt, “linguistics” yielded the best results, by far. And even these are vanishingly small compared to what’s available. Nonetheless, it can be a first building block.

In the next part of this post series, more powerful methods to improve on this will be described.


  1. “You can access the full texts of about 60% of the indexed documents for free (Open Access).” Source https://www.base-search.net/about/en/index.php↩︎

  2. To use either of the two interfaces offered by BASE, an informal registration is required: https://www.base-search.net/about/en/contact.php ↩︎

  3. On first sight, it might seem like it was easier to just import BASE as a whole into a 3rd party index. However, this might not be optimal for a few reasons: First, just the size of BASE. A full import would result in several million irrelevant records in your search index. At best, these stay “dormant” because no one would every query for them. But even then they might be a technical burden. Second, though connected to the first point, having a solid selection can prove beneficial for subject specific services like synonym replacements, ontological search and various types of enrichment. ↩︎

  4. In addition to that, an interactive browser is available. ↩︎

  5. BASE’s website states following inclusion criteria for a repository:

    1. The source has to contain academic content.
    2. At least some documents from the source are available as open access (full texts free of charge, without registration).
    3. The metadata of the comment’s (SIC!) are provided via a valid OAI-PMH interface.

    C.f.: https://www.base-search.net/about/en/faq.php#requirements

     ↩︎
  6. A crude but effective method to get translations of technical terms quickly is to look them up in your preferred version of the Wikipedia and then check for the titles of the translated articles in other languages on the left side bar. Another idea is to check the Wikidata entry https://www.wikidata.org/wiki/Q8162 ↩︎

  7. These examples require cURL, XMLStarlet and xmllint ↩︎