[#VUFIND-425] Add support for schema.org microdata on full record page

[VUFIND-425] Add support for schema.org microdata on full record page Created: 26/Jul/11 Updated: 06/Oct/13 Resolved: 04/Oct/13
Status:	Resolved
Project:	VuFind®
Components:	Record
Affects versions:	Wishlist
Fix versions:	2.2

Type:

New Feature

Priority:

Minor

Reporter:

Eoghan Ó Carragáin

Assignee:

Unassigned

Resolution:

Fixed

Votes:

Labels:

None

Remaining Estimate:

Not Specified

Time Spent:

Not Specified

Original estimate:

Not Specified

Environment:

All

Attachments:

schemaOrg.patch

Description

Schema.org markup would improve the parsing of the full-record page by the major search engines - Google, Yahoo & Bing. See Eric Hellman's fun blog post for more info: http://go-to-hellman.blogspot.com/2011/07/spoonfeeding-library-data-to-search.html

For the most part this only requires trivial alterations to view.tpl, core.tpl and extended.tpl. Correctly marking up some elements would require alterations to the record drivers (e.g. Publication details like publisher, place of publication, date of publication are currently assigned to the template as a single variable). Also the record's format must be mapped to schema.org types - the attached *demo* patch does this by adding a new method to IndexRecord.php.

I can't see an easy way to make this markup configurable/optional (as with the use of sitemaps.org) & I'm not sure if better SEO is desirable for all libraries. It might be worth discussing options for adding more semantic structure to Vufind 2.0 in an upcoming developer's call (e.g Linked Data, schema.org, RDFa, OpenGraph, etc)?

Comments

Comment by Eoghan Ó Carragáin [ 26/Jul/11 ]

A thread on schema.org & libraries from the W3C Library Linked Data group:
http://lists.w3.org/Archives/Public/public-lld/2011Jun/0004.html

Comment by Tulie Amichal [ 27/Jul/11 ]

Eoghan,

Is the data parsed well by the Google snippet tool? http://www.google.com/webmasters/tools/richsnippets I attempted to add data but it was never parsed well.

Comment by Eoghan Ó Carragáin [ 27/Jul/11 ]

Yes, I think rich snippets is one of the applications Google have for it. From what I've read, schema.org will be Google's preferred way of parsing additional semantics from webpages from now on (and presumably this goes for Bing & Yahoo too). Was the previous markup for rich snippets RDFa? I think it is possible to do a hybrid schema.org/RDFa markup, but would have to read up on it.

p.s. I haven't actually tried this yet, so I don't know how well the parsing works.

Comment by Ronan McHugh [ 09/Jul/12 ]

Hi Demian, We would be interested in implementing this for VF2. Would you be interested in committing such a patch to the trunk?

Comment by Demian Katz [ 09/Jul/12 ]

In theory I would like to see this functionality in the trunk. The problem is I don't want to force people to use it if they don't want to, and I don't see an obvious way of making it configurable. While I understand that there are certain advantages to mixing semantics directly into the markup, it makes code reuse and modularity more difficult. There's no obvious way to "plug in" microdata without adding an extra layer of abstraction on top of templates or putting lots of ugly if statements everywhere. It becomes another component that needs to get replicated between all the themes when changes are made. In general, it adds noise to templates and may make local customizations less convenient.

It may very well be worth those inconveniences if the VuFind community decides there are best practices for microdata use that the majority of users should follow in order to gain benefits like search engine visibility... but if there isn't a clear across-the-board benefit, the concerns above may mean it's not worth the price of trunk inclusion yet. I guess a starting point is to post a patch and discuss from there... and of course if you have ideas of how to include this with more flexibility and less noise, I'm open to ideas!

Comment by Ronan McHugh [ 10/Jul/12 ]

Hi Demian, I was thinking that the way to accomplish this would be via a view helper which could handle the messy ifness in one place. So for example in the record core template where we now have
<th><?=$this->transEsc('Main Author')?>: </th>

We could instead have <th class = <?$this->schema()->getClass('Main Author') ?> ><?=$this->transEsc('Main Author')?>: </th>

The schema helper could then return microdata read in from an ini file, such that the same class could apply different schema standards or none, depending on the user's preferences.

I guess this is already quite a lot of extra logic for the template files, but I don't see any cleaner way to do it immediately. If I do a generic patch, I guess it could be documented and linked to from the wiki without necessarily being committed.

Comment by Demian Katz [ 10/Jul/12 ]

If you can find a way to do it with a view helper, that does seem like a smart approach. I was just concerned that there might be too many complexities to easily encapsulate in a view helper (i.e. what if a microdata format needs to apply classes, but the template also needs to apply other classes? what if a microdata format uses an HTML5 custom attribute? different formats might have different requirements that would be hard to abstract away into a single helper call). But if you think you can make this work with a helper, a patch speaks louder than my vague theoretical concerns.

Comment by Eoghan Ó Carragáin [ 22/May/13 ]

At the risk of introducing more vague theory without another patch ...

I think we may have been over-thinking the templating issue. If you look at schema.org in the Worldcat schema.org implementation (e.g. [1][2][3]), the encoded information is presented in a different part of the page from the main body. In other words we could have a configuration setting to enable schema.org on vufind record pages which would simply includes a new "Structured Data" section as a block (and possibly modify the <body /> tag slightly; see below). This would duplicate a lot of the information presented in the normal record page template but gets around the code reuse/modularity issues.

We'd probably need to give the semantics more thought though. For example in the Worldcat html view (http://www.worldcat.org/title/zen-mind-beginners-mind/oclc/136259) the <body /> element uses schema.org to identify it as a http://schema.org/WebPage about a resource identified by http://www.worldcat.org/oclc/136259: <body id="worldcat" typeof="http://schema.org/WebPage" property="http://schema.org.about" resource="http://www.worldcat.org/oclc/136259">.

The "Linked Data" section on the http://www.worldcat.org/title/zen-mind-beginners-mind/oclc/136259 HTML page marks up information about the book itself: <div resource="http://www.worldcat.org/oclc/136259" typeof="http://schema.org/Book">

Putting http://www.worldcat.org/oclc/136259 in a browser automatically redirects to http://www.worldcat.org/title/zen-mind-beginners-mind/oclc/136259. Similarly, requesting http://www.worldcat.org/oclc/136259 with curl (either without a HEADER or specifically requesting application/rdf+xml) gives a HTTP/1.1 303 See Other with a location of /title/zen-mind-beginners-mind/oclc/136259.

We should probably accommodate this kind of content negotiation if including schema.org as a configurable option in Vufind. We could:
-- define another module for non-information resources, i.e. http://yourdomain/Data/[RecordId] which automatically redirects to /Record/[RecordId].
-- alternatively, it might be sufficient to distinguish the non-information resource from the webpage by appending a hashtag value, e.g. /Record/[RecordId]#thing

Adding this type of content negotiation is a step towards more formal linked data & could be re-used if VUFIND-500 is tackled. However, is it potentially a bad idea for lots of Vufind instances to mint URIs for things if there isn't a strong organisational commitment to maintain those URIs? Where the record contains an OCLC number there would be no need to mint a local URI for the book as we could reuse the oclc URI. Not all records have oclc numbers though

It would be worth consulting the schema.org bibextend group [4], but maybe we should discuss on a developers call in the first instance to see if there is interest.

[1] http://www.worldcat.org/oclc/136259
[2] http://www.worldcat.org/title/zen-mind-beginners-mind/oclc/136259
[3] http://rdf-translator.appspot.com/convert/rdfa/xml/html/http%3A%2F%2Fwww.worldcat.org%2Foclc%2F136259
[4] http://www.w3.org/community/schemabibex/

Comment by Demian Katz [ 22/May/13 ]

This definitely makes sense as a conservative approach to the problem.

One possible approach to the "minting URIs" issue could be to offer another configuration option that only displays microdata for records that have OCLC numbers -- if you can't link to WorldCat, don't bother trying. This would connect VuFind into the data web without establishing any new permanent URIs. (Not sure if it would really make sense to do this -- but just a thought).

This is definitely worth discussing further on the developers call. Might you be able to attend in the near future?

Comment by Eoghan Ó Carragáin [ 22/May/13 ]

I think it would make sense to allow linking only to Worldcat as an option. In fact we could have a minimalist option not to link to non-informational resources at all. After all, the schema.org examples don't: http://schema.org/Book

Option 1: Standard schema.org markup, i.e. <body itemscope itemtype="http://schema.org/WebPage"> and a "Structure Data" section wrapped in <div itemscope itemtype="http://schema.org/Book"> (other other type as appropriate). Offers the benefit of exposing structured data to search engines without having to worry about maintaining URIs etc.

Option 2: As above with links to external URIs (OCLC, ISBN, etc.) where available in the record. Offers more integration with web of data but still a lightweight option in terms of maintenance/commitment.

Option 3: As above with local URIs for the things and content negotiation, etc. This could possibly make use of the Authority core for additional local URIs etc.

The last option would need more thought but 1 and 2 seem pretty worthwhile in their own right. I'm not around for the call on the 28th of May but can attend the next one after that.

Comment by Demian Katz [ 22/May/13 ]

Sounds good -- we'll talk more on June 11th!

Comment by Eoghan Ó Carragáin [ 12/Jun/13 ]

Two new developments which may be relevant:

1) Schema.org now accepts json-ld serialization: http://blog.schema.org/2013/06/schemaorg-and-json-ld.html

2) Worldcat is now doing full-blown content negotiation: http://dataliberate.com/2013/06/content-negotiation-for-worldcat/

Comment by Dan Scott [ 05/Aug/13 ]

I've been working on schema.org integration in Evergreen for the past year and a half, and recently have been active on the W3C Schema BibExtend community group and the public-vocabs list.

I'm very interested in helping teach VuFind how to generate structured data, and I think the best option would be to support at least a minimal level of structured data by default. I would be happy to contribute thought and code towards this effort.

Comment by Demian Katz [ 06/Aug/13 ]

Thanks, Dan! Since Eoghan has been the person putting the most thought into this so far on the VuFind end, perhaps the two of you should compare notes and come up with a plan to move this forward. I'll be happy to consult as needed, but I have less of a clear vision on exactly where to begin -- certainly Eoghan's options 1 and 2 listed above seem like worthwhile starting points.

Comment by Eoghan Ó Carragáin [ 08/Aug/13 ]

Thanks, Dan! It would be great to get the benefit of your experience on this.

I don't think this will be complex to code once we decide the minimal level of useful structured data. Do Options 1, 2, 3 above make sense, & which one to you think we should aim for to begin with? Is there some background on the Evergreen integration we could read?

Feel free to follow up directly: eoghan.ocarragain@gmail.com

Cheers.

Comment by Dan Scott [ 13/Aug/13 ]

Hi Eoghan & Demian: Sorry for the delay in replying; I was getting VuFind 2 up and running (naturally, using the Evergreen integration!) on my laptop today so I could do some actual local development :)

Option 1 (albeit without the WebPage type--that seemed redundant) is pretty much what we're doing with Evergreen currently, and it is (IMO) a great first step. I would recommend RDFa Lite over microdata as the markup approach here, for what it's worth.

Option 2 makes perfect sense to me. I think some further low-hanging fruit here (at least for MARC-based records) might be in recognizing org codes in the $0 subfield for authorized headings and linking out to the pertinent URIs, where available. For example, something like a 700 $a Bach, Johann Sebastian $d 1685-1750 $0 (DLC)n79021425 then link out to http://id.loc.gov/authorities/names/n79021425 (of course in practice we might not find many bib records that actually record the $0).

Option 3 is a stretch goal, but given that you have already taught VuFind about local authority support, parts of it might be much easier to achieve in the short term.

Given that the complete set of metadata isn't displayed on a VuFind page all at once, but that users are required to click through the tabs and reload the page to display description / table of contents etc, we might be in a situation where offering up supplementary JSON-LD to augment those tabs makes sense. I'll be honest; I'm not a fan of the way that OCLC rolled out structured data to WorldCat, and although there's some tepid claims of support for JSON-LD on behalf of the schema.org search engines, it seems likely that that support is oriented towards JSON-LD stored in <script> tags (as in the way that Gmail is making actions visible) and that pages that stuff JSON-LD under a hidden tab might not benefit as much. However, we shall do what we can!

So, to answer your question Eoghan: I think we should begin with 1 & 2. I'll be happy to work up some branches towards that end!

Comment by Dan Scott [ 19/Aug/13 ]

Okay, I've been plugging away at enhancing the schema.org structured data in Evergreen and have put together a comparable version for VuFind.

I've pushed two commits on top of current master to the branch at https://github.com/dbs/vufind/tree/structured_data which add:

Bibliographic record mappings (schema.org = VuFind):

* type = based on Eoghan's patch, but putting RDFa's support for multiple types to good use
* name = the entire title (rather than trying to guess at what may or may not be the more correct shorter title)
* description = summary
* keywords = subjects
* author = authors[main]
* creator = authors[corporate]
* contributor = authors[secondary]
* bookEdition = edition

Bibliographic holdings (schema.org = VuFind) based on http://schema.org/Offer:

* availability = reserve or availability
* seller = location (expected to be a library)
* serialNumber = barcode
* sku = callnumber
* businessFunction = http://purl.org/goodrelations/v1#LeaseOut
* itemOffered = the described bibliographic record

These mappings are based on the emerging recommendation from the W3C Schema Bib Extend community group to use Offer to represent individual holdings. The goal is to have library holdings appear in the same context as online sales for a given item (recording, book, etc), with the businessFunction indicating that the library items are in fact not for sale, but instead for borrowing. In addition, by standardizing how we express holdings in schema.org, it makes it more possible for other systems such as meta-discovery layers or browser plugins to detect holdings, rather than requiring individual screen scraping techniques for each catalogue and discovery layer.

Comment by Dan Scott [ 19/Aug/13 ]

I should note that you can find an example record with the markup as extracted from my local system at http://stuff.coffeecode.net/schema.org/vufind/holdings.html (might be useful for running RDFa validators / Rich Snippets tools against if you don't want to apply the branch locally).

Comment by Dan Scott [ 22/Aug/13 ]

And I've pushed one more commit, after doing some work with the Koha community and getting some advice from the RDFa community, to change from nesting of properties within <a href> elements (which by default creates a new chained "item" and ends up attaching the property to the wrong item), to wrapping the <a href> elements with the desired properties.

As promised on the developer call, here are some tools that you can use to validate the RDFa / schema.org properties:

* RDFa Tools: http://rdfa.info/tools
  ** The Ruby parser accepts a URL or copy/paste input. Note that the Ruby parser is rather particular about expecting <!DOCTYPE html> in line 1, or it will complain; you can manually set it to Input Format: rdfa to overcome that
  ** The Python parser accepts a URL or copy/paste input
  ** RDFa Play lets you copy/paste input and generate pretty visuals; good if you want to try and tweak
* Google Rich Snippets: http://www.google.com/webmasters/tools/richsnippets

Comment by Dan Scott [ 02/Oct/13 ]

I just tried merging the branch in question with master (post-bootstrap merge) and it applied cleanly without any conflicts. I have also confirmed that the output is still as desired. Huzzah!

So, you could just merge https://github.com/dbs/vufind/tree/structured_data - or if you prefer to cherry-pick from a rebased branch, you could pull in https://github.com/dbs/vufind/tree/bootstrap_rdfa ; the results will be identical.

Comment by Demian Katz [ 03/Oct/13 ]

I don't see any changes to the bootstrap theme in either of the branches you reference. Are the RDFa changes still only in blueprint, or am I missing something? Let me know if you need more information (or work) from me!

Comment by Dan Scott [ 03/Oct/13 ]

Heh, I _thought_ things didn't look very different; for whatever reason, I believed that when bootstrap was merged to master that it would also become the new default theme. Okay, so the takeaway for me is that each theme needs to be taught about RDFa separately.

Now that I've switched over to the bootstrap theme in config.ini I'll put together a branch that actually addresses bootstrap. Sorry for the noise!

Comment by Demian Katz [ 03/Oct/13 ]

Not a problem!

Comment by Dan Scott [ 03/Oct/13 ]

Okay, _now_ https://github.com/dbs/vufind/tree/bootstrap_rdfa contains the changes for the bootstrap theme that parallel the blueprint theme exactly. Thanks for the patience :)

A note in passing: with elements like "$publications" containing just a single string with the publisher, place, and date, I can't be as fine-grained as I would like to be in breaking out the elements. Perhaps at a later point we can look into ways to keep things simple for the simple cases while enabling the templates to get granular; for example, if getPublicationDetails() returned an array of objects instead of strings, where the __toString() method of the object handles the trimming and cleanup (and thus everything keeps working as expected for customized templates), but the publisher / place / date are all available as member variables for those templates that want to mark each one up separately inline.

Comment by Dan Scott [ 03/Oct/13 ]

Forgot to mention that an example bootstrap record page is visible at http://stuff.coffeecode.net/schema.org/vufind/vufind_bootstrap.html for poking and prodding. Although it looks funny to humans due to the missing CSS/JS/images/etc :)

Comment by Chris Hallberg [ 03/Oct/13 ]

The bootstrap stuff looks good! I just want to be sure that 'vocab' and 'property' are the correct attributes, since the article that this ticket opened with (http://go-to-hellman.blogspot.com/2011/07/spoonfeeding-library-data-to-search.html) and all the references for microdata I've seen use 'itemscope' and 'itemprop'.

If this is correct, the schema.org matching is excellent and I'd be excited to have this!

Comment by Dan Scott [ 03/Oct/13 ]

Chris: Yes, I opted for RDFa instead of microdata, so "vocab", "typeof", and "property" are the desired terms instead of "itemscope", "itemtype", and "itemprop" (the microdata equivalents, roughly).

schema.org processors like Google, Yahoo/Microsoft, and Yandex handle both RDFa and microdata just fine, but RDFa arguably offers a little more flexibility for future expansion.

Comment by Demian Katz [ 04/Oct/13 ]

Thanks for all the work and the suggestions, Dan.

I've merged your code into master and made a few small adjustments.

I've also implemented your suggestion regarding publication details -- see ~~VUFIND-911~~. If you would now like to open a new pull request with additional markup, please feel free.

I'm closing this ticket since we now have a good baseline in place; please feel free to open new tickets addressing more specific improvements/fixes.

Comment by Chris Hallberg [ 04/Oct/13 ]

Dan: Excellent! Looks really good.

Comment by Eoghan Ó Carragáin [ 06/Oct/13 ]

This is great - Dan, thanks very much for taking this on & getting everything into master so quickly! Hopefully having VuFind as another example to play with will help with the great work being done by the schemabibex group. Cheers!

Generated at Thu Apr 18 20:59:12 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100250-rev:31daa98eee8114a786a57d1cfda50a8349f72a0a.

[VUFIND-425] Add support for schema.org microdata on full record page Created: 26/Jul/11 Updated: 06/Oct/13 Resolved: 04/Oct/13