Hacking VuFind into German Libraries

On June 29th/30th 2009 some German library developers gathered at Verbundzentrale des GBV in Göttingen for a VuFind hack session. Some general goals were

doing an out of the box VuFind installation for a “typical” German university library
exchange of use cases and experiences
finding out if there are main blockers or minor problems on a technical level for the spread of VuFind (esp. in German library settings)
addressing some of these issues with at least proof of concept solutions

Participants were: Matthias Lange, Claus Weiland, Uwe Reh, Erik Altmann, Hannah Ullrich, Wolfgang Uhmann, Jakob Voss, Silvia Munding, Inga Overkamp and Till Kinstler (behind the camera).

Installation

We started on the first day installing VuFind r1194 (checkout from the SVN repository) on a Solaris 10 machine to warm up. Some notes taken during installation:

The install script does not run properly on Solaris 10 (eg. it uses “tar -xzf” which is not supported with Solaris tar, use “gtar -xzf” instead or a pipe though gunzip and tar). Idea: How about a web based installation?
On the server XSLT support was not compiled into PHP. Would be nice, if the install script would check for all required components (like XSLT support). Recompiling PHP with “–with-xsl” fixed that
You need to modify the same paths in web/conf/config.ini & web/.htaccess. Maybe this could be done automatically?
set VUFIND_HOME if not installing VuFind in /usr/local/vufind
You frequently trap into missing write access by the apache user for web/interface/cache/ and web/interface/compile/ directories used by Smarty - a descriptive error message instead of a blank page would be nice. without a look into apache error log file you are lost…
We had to change line 20 of httpd-vufind.conf to RewriteRule ^([A-Z][^/]+)/(.+)$ because the regular expression matched too much and style sheets and images could not be loaded (because their paths were rewritten to /index.php?module=…&action=…). I have never seen that on any Linux system I installed VuFind, maybe a specific issue with the Apache version (2.2.11) we used?

data import

We then indexed data of Universitätsbibliothek Ilmenau, a typical German university library we had chosen more or less randomly. The record dump we used was UTF8 encoded MARC21 with some local specialties in 9XX fields. It was taken from the Pica CBS (central metadata store) of GBV using standard dumping procedures. The dump had 464083 records. Indexing using VuFind standard configuration went nicely (no errors) at about 550 records/s.

We then commited some searches and noticed some problems with German umlauts: eg. searching for “göttingen” did not return expected hits. Some analysis of the phenomen:

it was not a charset problem in our data or a charset handling problem while indexing (searching “göttingen” directly in Solr returned expected hits)
Apache received g%C3%B6ttingen; Jetty (from VuFind) got g%E3%B6ttingen, similar when searching for ß: %C3%9F in Apache, %E3%9F in Jetty
cross check with Apache, VuFind and Jetty/Solr on a Ubuntu X86-64 box confirmed the problem only occured when Jetty/Solr were running on our Solaris 10 box (independent from where VuFind was running)
with latest Solr nightly build: same problem

Though spending some time investigating that issue, we found no solution. We finally guessed Java or Jetty on the Solaris machine were to blame. Maybe it's just a Jetty configuration setting we had wrong (though we did not change the VuFind default config). Has anybody else seen that problem? Any solution?

"Availability check" with DAIA

One goal we wanted to achieve was implementing DAIA into VuFind (as least as proof of concept). Uwe set up a DAIA interface for the UB Ilmenau circulation module (which is OUS of Pica LBS 3 ILS system) so we had a target to send DAIA requests to. After some source exploration and learning about the current driver model in VuFind we finally had live availability data from Ilmenau library in VuFind through a DAIA driver (we tried different approaches). Jakob added some notes on that:

We created a new library driver to access an ILS via DAIA but it needs some tweaks: The hierarchical DAIA structure must be converted into a flat table structure with properties 'availability', 'status', 'location', 'reserve', 'callnumber', 'duedate', 'number' etc. The current holdings view cannot display all DAIA information so it should better be replaced.

DAIA aims to be an ILS independent format for the exchange of availability data. Currently there are only implementations of DAIA interfaces for Pica/OCLC LBS 3 systems, so this implementation might be a first step towards a “Pica driver” in VuFind (Pica/OCLC LBS 3/short “Pica” is still a common ILS in Germany with some hundred academic libraries using it, current LBS 3 implementations usually already use some modules of the successor LBS 4, eg the OPC4 module which also provides some basic interfaces for accessing ILS data).

data

Vufind is currently tightly bound to MARCish data. Indexing is made using solrmarc, which is, as the name suggests, a tool for processing MARC records. The “full title view”/“details view” is build of fields taken from a stored MARC record. So, at the moment, there is no way to use data in other formats than MARC in (out of the box) VuFind.

In Germany there are currently no MARC or MARCish data formats used in libraries. The official data interchange format used is still MAB2 (there is a resolution of the german library standardisation board to switch to MARC21 as official interchange format that dates back to 2004, but implementation takes time and there is no widespread use of MARC21 in the wild yet). The internal data formats used in library systems are often proprietary formats (like Pica+ in the widely used OCLC/Pica LBS systems) or even MAB2 based (eg. in german Aleph systems). In Germany libraries make heavy use of linked records to model things like “multi volume works” (in german: “mehrbändige Werke”), serials, journal articles, “hierarchical works”, links to authority file records, local and item data (call number, local subject headings…) etc. Formats like Pica+ support that quite well.

Those (somewhat) complex data structures need to be mapped to MARC21 and the flat Solr index structure to use them in VuFind. Verbundzentrale des GBV (and other libraries and library service centers as well) maintain mapping tables and conversion routines to export data in MARC21 format (eg. for exchange with Worldcat). We tried out, how these MARC21 record dumps can be used in VuFind. We used the already mentioned dump of Ilmenau university library data and some example records taken from a dump of UB Freiburg's catalog coming from a different source. The Ilmenau data had local fields (like call number and item location) in MARC21 9XX fields, while the Freiburg data was split into two files, one holding “title data”, the other holding “local data” in complete MARC21 records, linked by an identifier.

As mentioned above, the Ilmenau data could be indexed without problems. We added Solr index fields for the additional data in 9XX fields, so we could search that content. With the Freiburg data the two files need to be merged before indexing. We coded a proof of concept routine using marc4j that would take local data out of the “local data file” and merge it into the matching MARC records in the “title data file”. That routine may be integrated into a localized Indexer class in solrmarc.

Further exploring the Ilmenau data in VuFind showed, that there are some issues (no surprise): There is no out of the box support for the way serials, multi volume works or links to authority records are handled in our MARC data. We thought about solutions for different findings (like handling autority data, expanding links from article to journal records etc) and finally concluded, the issues we saw, could be solved. Some either on the indexing side (eg. by expanding links, merging, cutting or marking records), some on the searching or display side (eg. by expanding links, expanding search terms…), most on either. As the differences in the Ilmenau and Freiburg data showed, representation of those specialties in MARC21 are depending heavily on conversion routines applied. And there are local aspects in this data. So very likely there are no universal solutions for handling all those issues in VuFind or solrmarc. We shortly discussed if there may be some interfaces, modules or whatever to customize/configure those things into/in VuFind/solrmarc and finally concluded, that a lot needs to be coded individually.

further findings

Solr is (currently, that may change, at least for special use cases: http://wiki.apache.org/solr/RealtimeSearch) not a good solution for real time indexing, because adds, updates and deletes need to be “committed” before being “visible”. But <commit/> is a rather expansive operation in Solr: It takes (depending on index size) significant time (upt to minutes) and after a <commit/> internal Solr caches (that speed up searches) need to be warmed again (which takes additional time) before search works fast again. So it's not recommended to call <commit/> too often because it slows down Solr. But (near) realtime updates are a feature requested by some libraries/librarians, though use cases for end user interfaces where these really matter seem rather scarce…

some conclusions

This is my (Till) personal view. Feel free to discuss it or add your own.

I think we found no major technical issue that makes VuFind unusable in german library settings. The strong binding to MARC or MARCish data is a hurdle, because there is very few native MARC data in german library wilderness yet (that may change, but rather within years than within weeks). Looking at some real world data converted to MARC21 with current procedures, we hardly found any issue that could not be solved at indexing or search or display time (or either). But without taking that effort, VuFind is not usable out of the box for a german library yet. So it's a question if libraries are up to invest ressources, effort and know how into VuFind (I doubt it's any other with other NGCs, no matter whether it's an open source or commercial product). Problem is, there are few ressources available in german libraries to put into such a project (almost all participants of the workshop engage in VuFind in their leisure time).

Second big issue, besides the questions around data, is the lack of drivers for common library systems in Germany (like the OCLC/Pica LBS systems). Or, to put it another way, the lack of (easy to implement, open) interfaces/APIs in those systems. With the DAIA driver we saw, how easy it can be to make VuFind and a legacy ILS talk to each other. It's worth asking, if ILS vendors are responsible for providing such interfaces, but there seems to be few demand from customers for them yet. So it may be an issue left to libraries hacking those interfaces themselves.

One way to decrease “individual costs” with all that is collaborative problem solving by getting involved in an open source project. That's exactly what VuFind, solrmarc (and other OSS) is about. On the other hand, there are local aspects, especially around data, that may only be solved locally. Some local customization is always necessary. VuFind (as well as other library software) will very likely never be a “product” you can download, Setup.exe and have your wonderOPAC ready by some magic. All the more it is important to have know how in internet technology (like VuFind uses it) in libraries and participate with that and develop it in communities of open source projects.

VuFind Documentation

Table of Contents