About Features Downloads Getting Started Documentation Events Support GitHub

Love VuFind®? Consider becoming a financial supporter. Your support helps build a better VuFind®!

Site Tools


Warning: This page has not been updated in over over a year and may be outdated or deprecated.
indexing:open_data_sources

This is an old revision of the document!


Open Archives (OAI), Open Access (OA) and Open Data Sources

This page is designed to list Open Archives that provide metadata mostly from Open Access repositories, journals, databases or other services that provide access to the full-text of their contents, and Open Data sources that might be useful to add to a VuFind instance. Feel free to add new sources as you find them.

Authority Data

OCLC FAST

You can download OCLC's subject authority records in MARC-XML format (easy to import into VuFind's authority module) here. Note that you will need to rename the files from *.marcxml to *.xml in order for SolrMarc to recognize the file format correctly. Once you have done this, make sure that you have the FAST properties import configurations in your import directory (if you have a version of VuFind earlier than 1.4, you can download the necessary files here – all filenames begin with “marc_auth_fast”).

From your VuFind directory you can then import each file as follows:

./import-marc-auth.sh [.xml filename] [properties file for that FAST dataset]

Both the XML filename and the corresponding properties filename are required!

Note: replace “./import-marc-auth.sh” with “import-marc-auth.bat” when using Windows.

2012 MeSH/MARC21

U.S. National Library of Medicine authority records (MeSH), MARC21 format, 2012 version are available to download upon completion of an online memorandum of understanding (easy and fast to import into VuFind's authority module): here – look for the link “Download full file [153MB]”.

A conversion from .bin (ISO2709) into .mrc is required, easily done with Terry Reese's MarcEdit; use MARC Tools > Function: MarcBreaker and then in MARC Editor > “Compile File into MARC”.

A local marc_mesh.properties has to be built in order to map correctly the authority fields used in MeSH (refer to marc_auth.properties as an example).

The following fields should do the work:

id = script(getFirstNormalizedLCCN.bsh), getFirstNormalizedLCCN("001")
source = "U.S. National Library of Medicine authority records (MeSH)"
heading = 100abcdegjqt:110abcdegjqt:111abcdegjqt:130abcdegjqt:150abvxyz:151a:155avxyz:180vxyz:181vxyz:182vxyz:185vxyz
use_for = 400abcdegjqt:410abcdegjqt:411abcdegjqt:430abcdegjqt:450abvxyz:451a:455avxyz:480vxyz:481vxyz:482vxyz:485vxyz
see_also = 500abcdegjqt:510abcdegjqt:511abcdegjqt:530abcdegjqt:550abvxyz:551a:555avxyz:580vxyz:581vxyz:582vxyz:585vxyz
scope_note = custom, getAllSubfields(667:680:688, " ")

According to NLM, this “full file contains all terms with 26,581 descriptor records, 83 qualifier records, and 593,280 descriptor/qualifier combinations, for a total of 619,944 records”.

Bibliographic Data

PubMed Central OAI service (PMC-OAI)

“(…) provides access to metadata of all items in the PubMed Central (PMC) archive, as well as to the full text of a subset of these items”; full info here.

Suggested entry in ./harvest/oai.ini:

[pubmed]
url = http://www.pubmedcentral.nih.gov/oai/oai.cgi
;set = "pmc-open"
metadataPrefix = oai_dc
idSearch[] = "/^oai:pubmedcentral.nih.gov:/"
idReplace[] = "pubmed-"
idSearch[] = "/\//"
idReplace[] = "-"
injectId = "identifier"
injectDate = "datestamp"

Change “metadataPrefix” entry accordingly to the desired format (mostly because of the level of information provided you might want: oai_dc = basic level, but more easy to parse (XSLT); pmc or pmc_fm = higher level [more metadata], but also more complex to parse).

NDLTD: Networked Digital Library of Theses and Dissertations

“Networked Digital Library of Theses and Dissertations (NDLTD), an international organization dedicated to promoting the adoption, creation, use, dissemination, and preservation of electronic theses and dissertations (ETDs)”

Suggested entry in ./harvest/oai.ini:

url = http://union.ndltd.org/OAI-PMH/ 
metadataPrefix = oai_dc
idSearch[] = "/^oai:union.ndltd.org:/"
idReplace[] = "NDLTD-"
idSearch[] = "/\//"
idReplace[] = "-"
idSearch[] = "/:/"
idReplace[] = "-"
injectId = "identifier"
injectDate = "datestamp"

To import records into your VuFind installation just do a svn checkout of the two necessary files (from VuFind trunk):

https://vufind.svn.sourceforge.net/svnroot/vufind/trunk/import/ndltd.properties https://vufind.svn.sourceforge.net/svnroot/vufind/trunk/import/xsl/ndltd.xsl

and perform any desired adjustment.

For a deeper understanding of the XSL Transformation done in ndltd.xsl, please refer to VuFind JIRA tickets:

http://vufind.org/jira/browse/VUFIND-501 | http://vufind.org/jira/browse/VUFIND-493 http://vufind.org/jira/browse/VUFIND-499

DOAJ - Directory of Open Access Journals: Articles

Most people are not aware that DOAJ not only provide a Directory of OA journals, but also harvest their articles, which in turn are available to be harvested from DOAJ. And better news, yet: in oai_doaj metadata format (very similar to NLM's – much richer than standard oai:dc). This allows something like

Journal Title: BMGN Low Countries Historical Review    vol: 85    issue: 1

With the first field value (container_title) searchable – to obtain all the articles from this Journal, and the other two acting as filters over this Journal (filter to retrieve all the articles published in the 85th volume or from the first issue in this volume).

Please be aware that you have also the possibility of just harvest the journal relation themselves indexed in DOAJ (will not retrieve their articles, but rather, Journals’ info).

Suggested entry in ./harvest/oai.ini:

[DOAJart]
url = https://doaj.org/oai.article
metadataPrefix = oai_doaj
idSearch[] = "/oai:doaj.org\/article:/"
idReplace[] = "doaj-art-"
injectId = "identifier"
injectDate = "datestamp" 

There is format definition ready for VuFind harvester (see files ./import/doaj.properties and ./import/xsl/doaj.xsl).

This service was first discussed in the VUFIND-543 JIRA ticket.

:!: The DOAJ format changed significantly between VuFind versions 3 and 4. If your harvest is not working, you may need to update your XSLT. See pull request #944 for details.

:!: If you are using a VuFind version less than 4.0 or a VuFindHarvest version less than 2.3.0, there is a bug that will affect the harvested xml files, and you probably will need to adjust them; this command could help:

sed -i "s/xmlns:oai_doaj/ xmlns:oai_doaj/" local/harvest/DOAJart/*.xml

InTech Open (e-Books)

“InTech is a pioneer and world's largest multidisciplinary open access publisher of books covering the fields of Science, Technology and Medicine. Since 2004, InTech has collaborated with more than 70 000 authors and published 1827 books and 12 journals with the aim of providing free online access to high-quality research and helping leading academics to make their work visible and accessible to diverse new audiences around the world.”; full info here.

Suggested entry in ./harvest/oai.ini:

[InTech]
url =  http://www.intechopen.com/oai/
metadataPrefix = oai_dc
idSearch[] = "/oai:intechopen.com:/"
idReplace[] = "InTech-"
injectId = "identifier"
injectDate = "datestamp"

Suggested InTech.properties:

; XSLT Import Settings for InTech Open (e-Books) OAI XML files
[General]
; REQUIRED: Name of XSLT file to apply.  Path is relative to the import/xsl directory
; of the VuFind installation.
xslt = InTech.xsl
; OPTIONAL: PHP function(s) to register for use within XSLT file.  You may repeat
; this line to register multiple PHP functions.
php_function[] = utf8_encode
; OPTIONAL: PHP class filled with public static functions for use by the XSLT file.
; The class name must match the filename, and the file must exist in the import/xsl
; directory of the VuFind installation.  You may repeat this line to load multiple
; custom classes.
custom_class[] = VuFind

; XSLT parameters -- any key/value pairs set here will be passed as parameters to
; the XSLT file, allowing local values to be set without modifying XSLT code.
[Parameters]
institution = "My University"
collection = "e-books"
building = ""

Full exploit of data provided: InTech.xsl

!! Warning: contains some “hacks” to have direct access to the PDF of the entire e-book, distinguish Book Chapters e-books from journal articles (<dc:relation>ISBN:0</dc:relation> and other vars), etc.. – use at your own risk as they are by no means endorsed or supported in any way by InTech !!

<!-- available fields are defined in solr/biblio/conf/schema.xml -->
<!-- Adapted by / Author: Filipe M S Bento <filben@gmail.com; fsb@ua.pt> -->
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:php="http://php.net/xsl"
    xmlns:xlink="http://www.w3.org/2001/XMLSchema-instance">
    <xsl:output method="xml" indent="yes" encoding="utf-8"/>
    <xsl:param name="institution">My University</xsl:param>
    <xsl:param name="collection">InTech</xsl:param>
    <xsl:param name="building"></xsl:param>
    <xsl:param name="urlPrefix">http</xsl:param>
	<xsl:template match="oai_dc:dc">
        <add>
            <doc>
                <!-- ID -->
                <!-- Important: This relies on an <identifier> tag being injected by the OAI-PMH harvester. -->
                <field name="id">
                    <xsl:value-of select="//identifier"/>
                </field>

                <!-- RECORDTYPE -->
                <field name="recordtype">ebook</field>

                <!-- FULLRECORD -->
                <!-- disabled for now; records are so large that they cause memory problems!
                <field name="fullrecord">
                    <xsl:copy-of select="php:function('VuFind::xmlAsText', //oai_dc:dc)"/>
                </field>
                  -->

                <!-- ALLFIELDS -->
                <field name="allfields">
                    <xsl:value-of select="normalize-space(string(//oai_dc:dc))"/>
                </field>

                <!-- INSTITUTION -->
                <field name="institution">
                    <xsl:value-of select="$institution" />
                </field>

                <!-- COLLECTION -->
                <field name="collection">
                    <xsl:value-of select="$collection" />
                </field>

                <!-- building -->
                <field name="building">
                    <xsl:value-of select="$building" />
                </field>

                <!-- LANGUAGE -->
                <xsl:if test="//dc:language">
                    <xsl:for-each select="//dc:language">
                        <xsl:if test="string-length() > 0">
                            <field name="language">
                                <!--
                                <xsl:value-of select="php:function('VuFind::mapString', normalize-space(string(.)), 'language_map_oai_utf8.properties')"/>
                                -->
                                <xsl:value-of select="php:function('VuFind::mapString', normalize-space(string(.)), 'language_map_iso639-1.properties')"/>
                            </field>
                        </xsl:if>
                    </xsl:for-each>
                </xsl:if>

                
                <!-- FORMAT / TYPE -->
                					
				
				<xsl:choose>
					<xsl:when test="(//dc:type = 00) and (//dc:relation != 'ISBN:0') and (//dc:description = '1') " >
									<field name="format">Book</field>
					</xsl:when>
					<xsl:when test="(contains(//dc:identifier, 'articles/show/title'))">
									<field name="format">Book Part</field>
					</xsl:when>


					<xsl:when test="//dc:relation = 'ISBN:0'">
									<field name="format">Article</field>
					</xsl:when>
					
				
					<xsl:otherwise>
						<field name="format">Book Part</field>
					</xsl:otherwise>
			
				</xsl:choose>
				

				<field name="format">Online</field>
				

                <!-- SUBJECT -->
                <xsl:if test="//dc:subject">
                    <xsl:for-each select="//dc:subject">
                        <xsl:if test="string-length() > 0">
                            <field name="topic">
                                <xsl:value-of select="normalize-space()"/>
                            </field>
                        </xsl:if>
                    </xsl:for-each>
                </xsl:if>

			<!-- DESCRIPTION -->
			
                <xsl:if test="(contains(//dc:identifier, 'articles/show/title') and (//dc:description != '1'))">
					<field name="description">&lt;b&gt;<xsl:if test="//dc:subject"><xsl:value-of select="//dc:subject"/>&lt;/b&gt; </xsl:if> <xsl:if test="//dc:type &gt; 00">, Chap. <xsl:value-of select="//dc:type"/></xsl:if><xsl:if test="//dc:description">:&lt;br&gt;<xsl:value-of select="//dc:description"/></xsl:if></field>
				</xsl:if>					    								 
				<xsl:if test="((//dc:type &lt; 1) and (//dc:description != '1'))">
					<xsl:if test="//dc:description"><field name="description"><xsl:value-of select="//dc:description"/></field></xsl:if>
				</xsl:if>	
				
				

                <!-- ADVISOR / CONTRIBUTOR -->
                <xsl:if test="//dc:contributor[normalize-space()]">
                    <field name="author2">
                        <xsl:value-of select="php:function('VuFind::InverteNome', //dc:contributor[normalize-space()]" />
                    </field>
                </xsl:if>


                <!-- AUTHOR -->
                <xsl:if test="//dc:creator">
                    <xsl:for-each select="//dc:creator">
                        <xsl:if test="normalize-space()">
                            <field name="author">
                                <xsl:value-of select="php:function('VuFind::InverteNome', normalize-space())"/>
                            </field>
                            <!-- use first author value for sorting -->
                            <xsl:if test="position()=1">
                                <field name="author_sort">
                                     <xsl:value-of select="php:function('VuFind::InverteNome', normalize-space())"/>
                                </field>
                            </xsl:if>
                        </xsl:if>
                    </xsl:for-each>
                </xsl:if>

                <!-- TITLE -->
                <xsl:if test="//dc:title[normalize-space()]">
                    <field name="title">
                        <xsl:value-of select="//dc:title[normalize-space()]"/>
                    </field>
                    <field name="title_short">
                        <xsl:value-of select="//dc:title[normalize-space()]"/>
                    </field>
                    <field name="title_full">
                        <xsl:value-of select="//dc:title[normalize-space()]"/>
                    </field>
                    <field name="title_sort">
                        <xsl:value-of select="php:function('VuFind::stripArticles', string(//dc:title[normalize-space()]))"/>
                    </field>
                </xsl:if>

                <!-- PUBLISHER -->
                <xsl:if test="//dc:publisher[normalize-space()]">
                    <field name="publisher">
                        <xsl:value-of select="//dc:publisher[normalize-space()]"/>
                    </field>
                </xsl:if>

                <!-- PUBLISHDATE -->
                <xsl:if test="//dc:date">
                    <field name="publishDate">
                        <xsl:value-of select="substring(//dc:date, 1, 4)"/>
                    </field>
                        <field name="publishDateSort">
                        <xsl:value-of select="substring(//dc:date, 1, 4)"/>
                    </field>
                </xsl:if>

				
				<!-- ISBN -->
                <xsl:if test="//dc:relation">
					<xsl:if test="(contains(//dc:relation,'ISBN:'))">
						<xsl:if test="//dc:relation != 'ISBN:0'">
							<field name="issn">
								<xsl:value-of select="substring(//dc:relation, 6, 30)"/>
							</field>
						</xsl:if>
				    </xsl:if>
                </xsl:if>
				
					
				
				<!-- PDF URL of the book chapter-->
				<xsl:if test="//dc:source">							
					  <xsl:if test="contains(//dc:source,'://')">
						   <field name="url"><xsl:value-of select="//dc:source" />
					       </field>
					   </xsl:if>
				</xsl:if>
				
				
				<!-- PDF URL of the entire e-book-->
				<xsl:if test="(contains(//dc:identifier, 'articles/show/title') and (//dc:description != '1'))">
					<xsl:if test="//dc:relation">							
							   <field name="url">http://www.intechopen.com/download/books/books_isbn/<xsl:value-of select="substring(//dc:relation, 6, 30)"/>
							   </field>
						  
					</xsl:if>	
				</xsl:if>	
				
				<!-- container_title => Book Title = dc:subject | only if it has <dc:relation>ISBN:0</dc:relation> = journal article -->
<!--
				<xsl:if test="//dc:relation = 'ISBN:0'">
					<xsl:if test="//dc:subject">
				        <field name="container_title">
						   <xsl:value-of select="//dc:subject"/> 
						</field>		
					</xsl:if>	
				</xsl:if>	
-->			   
            </doc>
        </add>
    </xsl:template>
</xsl:stylesheet>

Note: this .xsl invokes a new function “VuFind::InverteNome”. This means that a new function “InverteNome” (Portuguese for Inverts (the author) Name), that builds an inverted form of the Author's name when it detects that it is in a non-inverted form, must be added to ./import/xsl/vufind.php (for your convenience all the code comments were translated from Portuguese to English :) ):

public static function InverteNome($in)
    {
        list($fnames,$lname) = preg_split('/\s+(?=[^\s]+$)/', $in, 2);  // includes the full name, eg.: Bento, Filipe Manuel dos Santos
		
		if ( is_null($lname) ) // if author has only one name
		{
			$text = "$fnames";
		}
		else		
		{
			$text = "$lname, $fnames";
		}
		
		/** Only Last and First name
		
		list( $fname, $mname, $lname ) = explode( ' ', $in, 3 );
		if ( is_null($lname) ) //Author has only two names
		{
			$lastname = $mname;
		}
		else
		{
			$lname = explode( ' ', $lname );
			$size = sizeof($lname);
			$lastname = $lname[$size-1];
		}
		
		$text = "$lastname, $fname";
		
		**/
		
        return $text;
    }
    

Extra: Set Up Change Tracking (optional) -- All sources

If you need to track record change dates (see Tracking Record Changes for details), you need to do a couple of extra things (source: eprints)

  • Uncomment the injectDate line in the oai.ini file section.
  • Add these lines to the .properties file:
track_changes = 1
solr_core = "biblio"
  • Add these lines to the .xsl file:

First, after the other parameter declarations:

<xsl:param name="track_changes">1</xsl:param>
<xsl:param name="solr_core">biblio</xsl:param>

Further down, among the other field population code:

<xsl:if test="$track_changes != 0">
    <field name="first_indexed">
        <xsl:value-of select="php:function('VuFind::getFirstIndexed', $solr_core, string(//identifier), string(//datestamp))"/>
    </field>
    <field name="last_indexed">
        <xsl:value-of select="php:function('VuFind::getLastIndexed', $solr_core, string(//identifier), string(//datestamp))"/>
    </field>
</xsl:if>

Shared Index

There has been some discussion about building a shared VuFind index of open content. This is an ambitious project that is currently just in the idea stage. Feel free to comment on the JIRA ticket if you have thoughts on the subject.

Further sources of common interest

A special request: please state which sources you know of (that implement OIA-PMH, that have records that can be bulk downloaded or websites that have a sitemap.xml – although others should be possible to be added, crawling them and generating a sitemap.xml via specific software or services like http://www.xml-sitemaps.com) and would like to be analyzed and possible XSLT files generated to import their records into VuFind.

Please refer them (including their info URL, if possible), as a comment in the dedicated JIRA ticket, mentioned above. Thanks!

Related projects, possible sources of data

Bibliographic and/or usage|circulation data

LOBID: Linking Open Bibliographic Data

This service, courtesy of the North Rhine-Westphalian Library Service Center, provides shareable bibliographic data in linked-data format (see also this page). As if this writing, VuFind does not include tools to ingest this data, but it may be worth investigating in the future.

LibraryCloud

“LibraryCloud is an open, multi-library data service that aggregates and delivers library metadata. We hope it will serve as a platform for the development of Web applications that help all library users (including scholars and re-searchers) find and understand materials.”

“(LibraryCloud) It's a metadata server. It gathers up metadata - information about information - from libraries, museums, and other participating institutions, and makes that metadata available to any application that wants to use it.”

… more info here.

SPLURGE: Scholars Portal Library Usage-Based Recommendation Generation Engine

“Amazon.ca has a “customers who bought this item also bought” feature that recommends things to you that you might be interested in. LibraryThing has it too: the recommendations for What's Bred in the Bone by Robertson Davies include books by Margaret Laurence, Carol Shields, Michael Ondaatje, Peter Ackroyd, John Fowles, and David Lodge, as well as other Davies works.

Library catalogues don't have any such feature, but they should. And libraries are sitting on the circulation and usage data that makes it possible. (BiblioCommons does have a Similar Titles feature, but it's a closed commercial product aimed at public libraries, and anyway the titles are added by hand.)

SPLURGE will collect usage data from OCUL members and build a recommendation engine that can be integrated into any member's catalogue. The code will be made available under the GNU Public License and the data will be made available under an open data license.”

Our thanks to William Denton (Toronto, Canada) for let us know about this project, and the shared info about it, in vufind-tech Mailling List.

… more info here.

Linked Data, Linked Open Data (LOD)

Linked Open Data (LOD), that may be considered as part of the Open Data Movement (which aims at making data freely available to everyone) and could be described as a “recommended best practice for exposing, sharing, and connecting pieces of data, information, and knowledge on the Semantic Web using URIs and RDF” (source), has recently gained prominence in many (Digital) Libraries and Archives events and related discussions.

The recent Tech Trifecta series of conferences' (that took place at Villanova University's Falvey Memorial Library, home of VuFind) presentations and discussions, namely at VuFind Summit 2012, were a major example that LOD is currently a hot issue and gathers the interest of many institutions. This section presents some selected LOD resources (information, presentations, etc.). Please feel free to add or correct the entries as needed.

Linked Open Data: The Essentials - Semantic Web Company (PDF);

LinkingOpenData - W3C SWEO (Semantic Web Education and Outreach) Community Project;

Linked Data - Connect Distributed Data across the Web;

LODLAM - Linked Open Data in Libraries, Archives & Museums;

Europeana Linked Open Data (LOD) | promotional video;

Linked Open Data publication guide (PDF), “report on how to select appropriate encoding strategies for producing Linked Open Data (LOD) enabled bibliographical data” (LODE-BD Recommendations 2.0).

Presentations in Conferences and Seminars

7th IGeLU (The International Group of Ex Libris Users) conference - Zurich, Switzerland, 11 – 13 September 2012

Sharing and Aggregating Social Metadata

A pragmatic usage of LOD within VuFind would be using it as a feasible, light weight alternative to have a shared/common index of open content (mentioned above / please refer to VUFIND-570 JIRA ticket).

By implementing mechanisms for exposing and harvesting social metadata, VuFind installations would be able not only to share their own UGC (User Generated Content / social metadata) but also to collect social metadata from specific VuFind installations. Please refer to 2012-11-13 developers call’ minutes for some initial thoughts about this approach.

Related information / projects:

indexing/open_data_sources.1646159002.txt.gz · Last modified: 2022/03/01 18:23 by demiankatz