VuFind

Fix highlighting functionality.

Details

  • Type: Improvement Improvement
  • Status: Resolved Resolved
  • Priority: Minor Minor
  • Resolution: Fixed
  • Affects Version/s: None
  • Fix Version/s: 1.1
  • Component/s: Search
  • Labels:
    None

Description

VuFind has some legacy highlighting code that doesn't actually do anything. We should clean things up and make highlighting a functional, configurable option. Some key steps/notes:

1.) There is no reason to attempt to do highlighting via PHP. We should rely on Solr (and Summon) to do this for us.

2.) We need to be careful about entity encoding. We should use highlighting tokens rather than inline HTML -- this way we can first encode HTML entities in a string and then replace the highlighting tokens with appropriate HTML.

3.) We need to be careful about string cropping -- existing code has some spots where highlighting is combined with string truncation to set a limit on string length. This will need to be revised so that we can truncate strings without creating mismatched HTML tags!

4.) We need to be careful about highlighting tags in search URLs -- for example, we currently use the same string to display author names AND generate author search links. If there are highlighting tags in the Solr response, we need to strip these out of the author name before building a search URL.

5.) Highlighting is achieved with a CSS class, but this class is not currently defined in any of the CSS files. We should define highlighting behavior (probably just font-weight: bold;) and then disable highlighting by default in the config files. This will maintain current behavior while allowing users to turn on highlighting without being forced to edit CSS files.

Activity

Hide
Demian Katz added a comment -
The WorldCat module poses a challenge here -- the WorldCat API does not provide a highlighting feature, so perhaps we do need to do some PHP-side highlighting for consistency with Solr/Summon searches. However, it may make more sense to do the highlighting in the Worldcat class rather than in the Smarty plug-in so that the behavior can be more easily configurable.
Show
Demian Katz added a comment - The WorldCat module poses a challenge here -- the WorldCat API does not provide a highlighting feature, so perhaps we do need to do some PHP-side highlighting for consistency with Solr/Summon searches. However, it may make more sense to do the highlighting in the Worldcat class rather than in the Smarty plug-in so that the behavior can be more easily configurable.
Hide
Demian Katz added a comment -
Another complication to think about -- in search results, it would be useful to show snippets when highlighted text shows up in a field that is not normally displayed (i.e. table of contents or full text). This poses two challenges:

1.) We have to figure out which fields are covered by normal template display and which need to be added as extra information. This should ideally be handled by the template, since different themes may display different fields... but that may not be practical. We should think carefully about how we pass highlighting information from Solr through PHP to the template so that we can implement smart behavior without overly confusing code.

2.) You can only highlight stored fields -- for something like full text, storing could greatly increase index size, not to mention causing memory problems for PHP when it receives giant chunks of text back in the Solr response. It would be nice to simply exclude long fields from the field list returned by Solr, but there is no easy way to do that -- you can specify which fields are included, but you can't say "all except these." Hard-coding a list of Solr fields into the code would make extending the index more difficult. I can't think of an easy solution to this one -- might be something we have to defer until Solr gets more functionality.
Show
Demian Katz added a comment - Another complication to think about -- in search results, it would be useful to show snippets when highlighted text shows up in a field that is not normally displayed (i.e. table of contents or full text). This poses two challenges: 1.) We have to figure out which fields are covered by normal template display and which need to be added as extra information. This should ideally be handled by the template, since different themes may display different fields... but that may not be practical. We should think carefully about how we pass highlighting information from Solr through PHP to the template so that we can implement smart behavior without overly confusing code. 2.) You can only highlight stored fields -- for something like full text, storing could greatly increase index size, not to mention causing memory problems for PHP when it receives giant chunks of text back in the Solr response. It would be nice to simply exclude long fields from the field list returned by Solr, but there is no easy way to do that -- you can specify which fields are included, but you can't say "all except these." Hard-coding a list of Solr fields into the code would make extending the index more difficult. I can't think of an easy solution to this one -- might be something we have to defer until Solr gets more functionality.
Hide
Demian Katz added a comment - - edited
As Jeffrey Barnett suggested, it may be helpful to view snippets and highlighting as separate but complementary functions. Some users may want snippets without highlighting, others the reverse. We should allow this, but make sure that the two features work together well when both active.
Show
Demian Katz added a comment - - edited As Jeffrey Barnett suggested, it may be helpful to view snippets and highlighting as separate but complementary functions. Some users may want snippets without highlighting, others the reverse. We should allow this, but make sure that the two features work together well when both active.
Hide
Demian Katz added a comment - - edited
I believe that the attached patch addresses most of the issues discussed in previous comments. It uses Solr's highlighting mechanism in place of PHP-based highlighting, but it does not break the existing PHP-based highlighting used by the WorldCat module. It should not produce bad HTML output, since Solr internally handles truncation and tokens are used to inject HTML highlighting after proper escaping of the rest of the field. Snippet and highlighting behavior can be independently toggled via searches.ini settings. Snippet selection is handled by the record driver, so priorities and exclusions can be set independently for different types of records. Highlighted text and non-highlighted text are handled as separate values in the templates, so there is no need to strip text when using things like author name in URL construction. The only known problem that I have not addressed here is the matter of including snippets from the full text without risking running out of memory -- I'm not sure of an easy solution to that problem, but I don't think anything about this solution should restrict our ability to on that problem in the future.

Once this Solr-based solution has been approved and committed, I will also update the Summon module to use the Summon API's native highlighting in place of PHP-based highlighting. It should be fairly similar to this solution.
Show
Demian Katz added a comment - - edited I believe that the attached patch addresses most of the issues discussed in previous comments. It uses Solr's highlighting mechanism in place of PHP-based highlighting, but it does not break the existing PHP-based highlighting used by the WorldCat module. It should not produce bad HTML output, since Solr internally handles truncation and tokens are used to inject HTML highlighting after proper escaping of the rest of the field. Snippet and highlighting behavior can be independently toggled via searches.ini settings. Snippet selection is handled by the record driver, so priorities and exclusions can be set independently for different types of records. Highlighted text and non-highlighted text are handled as separate values in the templates, so there is no need to strip text when using things like author name in URL construction. The only known problem that I have not addressed here is the matter of including snippets from the full text without risking running out of memory -- I'm not sure of an easy solution to that problem, but I don't think anything about this solution should restrict our ability to on that problem in the future. Once this Solr-based solution has been approved and committed, I will also update the Summon module to use the Summon API's native highlighting in place of PHP-based highlighting. It should be fairly similar to this solution.
Hide
Demian Katz added a comment -
Patch committed as r3192; next step: improved Summon highlighting.
Show
Demian Katz added a comment - Patch committed as r3192; next step: improved Summon highlighting.
Hide
Demian Katz added a comment -
Added better Summon highlighting/snippet behavior consistent with Solr functionality as of r3198. No improvements are currently possible for the WorldCat module, so this has been left as-is.
Show
Demian Katz added a comment - Added better Summon highlighting/snippet behavior consistent with Solr functionality as of r3198. No improvements are currently possible for the WorldCat module, so this has been left as-is.

People

Vote (1)
Watch (0)

Dates

  • Created:
    Updated:
    Resolved: