[VUFIND-454] Index websites with VuFind Created: 23/Sep/11  Updated: 19/Aug/13  Resolved: 26/Jun/13

Status: Resolved
Project: VuFind®
Components: Search
Affects versions: None
Fix versions: 2.1

Type: New Feature Priority: Major
Reporter: Demian Katz Assignee: Demian Katz
Resolution: Fixed Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Attachments: File TRUNK_tika_xslt_25-09-12.patch     File WebResults.php     File WebResults.tpl     File blueprint-searchbox.patch     File result.tpl     File websearch-solr34.patch     File websearch-solr34b.patch     File websearch.patch     File website-for-vufind-1-2.patch    

 Description   
The attached patch adds a new Solr core to VuFind which can be used for indexing your local website with the help of Aperture.

How to Use:

1.) Make sure you have Aperture installed and configured -- see http://vufind.org/wiki/aperture

2.) Edit web/conf/webcrawl.ini and list the sitemap.xml files you wish to harvest. (Currently, this module requires sitemap files in order to harvest content).

3.) Customize import/xsl/VuFindSitemap.php if you want to create special rules for generating facet values based on URLs or metadata within pages.

4.) Run import/webcrawl.php. This may take quite some time.

5.) When crawling is done, go to http://vufind_server/vufind/Web/Results -- you can enter a search in the box here.

General notes:

1.) This patch was generated against the trunk, r4284. It should be compatible with VuFind 1.1 or 1.2 with minor modifications -- most notably the need to add a protected update() wrapper method to web/sys/Solr.php (to allow access to the private _update from child classes).

2.) Crawling is currently done by brute force -- we pull down all pages and index them, then delete pages that were indexed prior to the start of the process (to eliminate obsolete/missing pages). This works fine for small or medium websites, but it probably won't scale indefinitely -- we may eventually want to build a more intelligent crawler tool.

3.) The search form at the Web/Results URL is obviously very crude -- you are expected to customize this page to make it more useful. You might instead wish to create some custom code that integrates the web search option into VuFind's main search box (this has been done at Villanova -- see https://library.villanova.edu/Find for an example).

 Comments   
Comment by Demian Katz [ 23/Sep/11 ]
Attaching website-for-vufind-1-2.patch, a modified version of the patch that will apply cleanly to VuFind 1.2. Continue to use the previous version of the patch if you are working with a more recent trunk.
Comment by Demian Katz [ 21/Oct/11 ]
Attaching websearch-solr34.patch, an updated version of the patch reflecting the trunk's recent upgrade to Solr 3.4.
Comment by Filipe MS Bento [ 29/May/12 ]
Hi Demian!

Following our e-mails about sitemaps, since this is not in VuFind Trunk's, how safe is to apply websearch-solr34.patch over a VF1.3 release (I mean because of this: " It should be compatible with VuFind 1.1 or 1.2 with minor modifications"). Or was this made already in 1.3?

Anyway, fulltext.ini is there, but the other files aren't... none, from a quick browse.

I have /usr/local/aperture-1.6.0/bin ready to go (webcrawler.sh), but should I try to merge manually / create the new files (if so, could you share a tarball of them) or just go ahead apply this last patch?...

Many thanks,

Filipe

PS: this functionality is not dropped in anyway, right? :)
Comment by Demian Katz [ 30/May/12 ]
I believe that websearch-solr34.patch should work with release 1.3 (it was created against a pre-1.3 trunk). If you have any problems, it may be necessary to edit the new Solr configuration files to use 3.5 version numbers instead of 3.4 because we have updated Solr since the patch was created; however, I don't think a slightly outdated version number in the configuration should prevent things from functioning.

If you run into any more complicated problems applying the patch, let me know and I'll try to find some time to update the patch for 1.3 and/or the current trunk.
Comment by Filipe MS Bento [ 30/May/12 ]
Thanks, Demian!

Going to try it out.

Regarding SOLR 3.4 > 3.5, should be just a matter of changing

   <luceneMatchVersion>LUCENE_34</luceneMatchVersion>

to

   <luceneMatchVersion>LUCENE_35</luceneMatchVersion>

in ./solr/website/conf/solrconfig.xml

as the others solrconfig.xml do have, right?

Many thanks, once again,

Filipe
Comment by Demian Katz [ 30/May/12 ]
Exactly right, Filipe.
Comment by Filipe MS Bento [ 30/May/12 ]

Demian,

Just a remark for the fact that tt's missing the creation of "protwords.txt", even if empty, in ./solr/website/conf/

SOLR just doesn't start (well, unless if one includes

 <abortOnConfigurationError>false</abortOnConfigurationError>

in solr.xml ) as defined in ./solr/website/conf/schema.xml

All rest went ok (but haven't tried yet indexing pages from a certain site). But should be ok.

Thanks,

Filipe
Comment by Ronan McHugh [ 25/Sep/12 ]
VuFind Sitemap updated to enable use of Tika instead. Significant refactoring to enable sharing of methods between Tika and Aperture parsing. Also, inheritance from Vufind.php.
UPDATE - edited to replace explicit references to static class with self::
Comment by Demian Katz [ 06/Dec/12 ]
When moving forward with the development of this patch, we need to remember to update the YAML file to comply with VUFIND-710 and VUFIND-712.
Comment by Demian Katz [ 07/Feb/13 ]
Attaching mobile version of Web/result.tpl courtesy of Nathan Tallman.
Comment by Demian Katz [ 18/Apr/13 ]
Updated patch against later post-1.4 trunk (websearch-solr34b.patch).
Comment by Nathan Tallman [ 03/May/13 ]
Service to provide website recommendations when searching catalog. (web/sys/Recommend/WebResults.php)
Comment by Nathan Tallman [ 03/May/13 ]
Blueprint template for website recommendations when searching catalog. (web/interface/themes/blueprint/Search/Recommend/WebResults.tpl)
Comment by Nathan Tallman [ 03/May/13 ]
When implementing WebResults.php and WebResults.tpl, you need to add "default_side_recommend[] = WebResults" to web/conf/searches.ini. All code for written for VuFind 1.3.
Comment by Demian Katz [ 08/May/13 ]
The attached blueprint-searchbox.patch can be used to integrate the web search into the main search type drop-down in the Blueprint theme.
Comment by Demian Katz [ 26/Jun/13 ]
Most of the functionality from this ticket has now been ported to VuFind 2. The only thing I have left out is the "web search in main search options" patch, which is a hack that needs to be addressed in a more flexible way (VUFIND-107 covers some related territory).

Here are the key commits:

Basic indexing/search functionality - https://github.com/vufind-org/vufind/commit/bf82a02b39c83af971a4372527bfc06f767a0e61

Web crawler utility - https://github.com/vufind-org/vufind/commit/933146f6dfc180b5f2abba852a1b4b5d058facc7

WebResults recommendation module - https://github.com/vufind-org/vufind/commit/20fa198499efa29cae69925ff971931c911584b5

There is still room to refine the default schema and configuration; I may do more work on this in the near future. However, the functionality of this ticket is now implemented to a point where I feel comfortable marking this as resolved.
Generated at Sat Apr 27 03:51:24 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100251-rev:4690f9fa025ccb713885a7f8212eefdeb0c508be.