Details
-
Type:
Improvement
-
Status:
Resolved
-
Priority:
Trivial
-
Resolution: Fixed
-
Affects Version/s: None
-
Fix Version/s: 1.3
-
Component/s: OAI
-
Environment:Some OIA-PMH sources send records that do not have any URL to access the full text or existence of a certain document somewhere (or even part of it). dc:identifier may be present (or several), but none has an URL to the document itself.
Description
This one was a little bit of a nightmare, as none of the <xsl:variable> or <xsl:param> element could be used as flags within the XSLT (version "1.0", I mean).
A way to exclude these records, if one just wants records that the user can access in the Library or online, is just to bypass all of the transformation and inject an empty record into SOLR or at least without the "id" field.
After having harvested fully NDLTD (in the recent days) I noticed that many records didn't find their way into SOLR index. For some of them I believe the reason is that they don't have dc:identifier or none of these is an URL.
The initial code I was using was:
<xsl:for-each select="//dc:identifier">
<xsl:choose>
<xsl:when test="contains(., '://')">
<field name="id"><xsl:value-of select="//identifier"/></field>
</xsl:when>
</xsl:choose>
</xsl:for-each>
but with the recent problem described with having in the output of the XSL transformation duplicated "description" fields would prevent that record to be indexed by SOLR (as it is a non multiValued one) --- http://vufind.org/jira/browse/VUFIND-499, I reckon that is also part of the reason for the records not being indexed (although the import says "Successfully imported /usr/local/vufind/harvest/NDLTD/ .....xml...").
For that to happen is just necessary that a record as more than an URL in repeated dc:identifier (in the harvested XML file). I mean, even having the same value, having two "id" fields it a "no go" for SOLR (at least according to the tests I have done).
So until late hours last night was fighting with <xsl:variable> and <xsl:param> see if one of them could be used as a flag and could be read outside of that context, something like:
<xsl:for-each select="//dc:identifier">
<xsl:if test="contains(., '://')">
<xsl:variable name="hasURL">yup</xsl:variable>
</xsl:if>
</xsl:for-each>
<xsl:if test="$hasURL = 'yup'" ><field name="id">NDLTD-<xsl:value-of select="substring-after(//identifier,'union.ndltd.org-')"/></field>
</xsl:if>
or many alike in the test part (all kind of them, I would say)
Placing a <xsl:copy-of select="$hasURL"/> just shows the correct value within the <xsl:if test="contains(., '://')"> so cand not be used (<xsl:stylesheet version="2.0" is more complete in the set of elements and testing that can be done -- 1.0 is very limited).
So going to the point that matters, found a very simple way to do it:
(....)
<xsl:template match="oai_dc:dc">
<xsl:if test="*[contains(//dc:identifier, '://')]">
<add>
<doc>
(....)
</doc>
</add>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
That allows even several URLs to be added to "url" fields, as per usual (and no "dead-end"/"cul-de-sac" records). So simple as that, googling it didn't helped that much (was worst, until found this pearl in a situation different a lot, but could extract the idea from it).
Again, please do advance tips on how to do it better if you have some thoughts about it (and do interest you).
All the best from Aveiro, Portugal,
Filipe
One related issue which might help here -- right now, the XSLT import tool doesn't really know whether or not the record was successfully added to Solr; it only checks whether the POST succeeded. I'll try to find some time to see if there's a way to get better feedback so that we can report errors when the POST succeeds but the actual contents result in an error. If nothing else, we could do a lookup to see if a record with the POSTed id value exists after the operation is completed, though that's not the ideal solution since it would probably slow things down.