VuFind API Documentation

WebCrawlCommand extends Command
in package

Console command: web crawler

Tags
category

VuFind

author

Demian Katz demian.katz@villanova.edu

license

http://opensource.org/licenses/gpl-2.0.php GNU General Public License

link

Wiki

Table of Contents

$bypassCacheExpiration  : bool
Should we bypass cache expiration?
$config  : Config
$importer  : Importer
$solr  : Writer
__construct()  : mixed
Constructor
configure()  : void
Configure the command.
downloadFile()  : string
Download a URL to a temporary file.
execute()  : int
Run the command.
getTransformCachePath()  : string|null
Given a URL, get the transform cache path (or null if the cache is disabled).
harvestSitemap()  : bool
Process a sitemap URL, either harvesting its contents directly or recursively reading in child sitemaps.
indexFromTransformCache()  : bool
Check the cache and configuration to see if the provided URL can be loaded from cache, and load it to Solr if possible.
readFromTransformCache()  : string|null
Fetch transform cache data for the specified URL; return null if the cache is disabled, the data is expired, or something goes wrong.
removeTempFile()  : void
Remove a temporary file.
updateLastIndexed()  : string
Update the last_indexed dates in a cached XML document to the current time so reindexing cached documents works correctly.
updateTransformCache()  : bool
Update the transform cache (if activated). Returns true if the cache was updated, false otherwise.

Properties

$bypassCacheExpiration

Should we bypass cache expiration?

protected bool $bypassCacheExpiration = false

Methods

__construct()

Constructor

public __construct(Importer $importer, Writer $solr, Config $config[, string|null $name = null ]) : mixed
Parameters
$importer : Importer

XSLT importer

$solr : Writer

Solr writer

$config : Config

Configuration from webcrawl.ini

$name : string|null = null

The name of the command; passing null means it must be set in configure()

Return values
mixed

configure()

Configure the command.

protected configure() : void
Return values
void

downloadFile()

Download a URL to a temporary file.

protected downloadFile(string $url) : string
Parameters
$url : string

URL to download

Return values
string

Filename of downloaded content

execute()

Run the command.

protected execute(InputInterface $input, OutputInterface $output) : int
Parameters
$input : InputInterface

Input object

$output : OutputInterface

Output object

Return values
int

0 for success

getTransformCachePath()

Given a URL, get the transform cache path (or null if the cache is disabled).

protected getTransformCachePath(string $url) : string|null
Parameters
$url : string

URL to cache

Return values
string|null

harvestSitemap()

Process a sitemap URL, either harvesting its contents directly or recursively reading in child sitemaps.

protected harvestSitemap(OutputInterface $output, string $url[, bool $verbose = false ][, string $index = 'SolrWeb' ][, bool $testMode = false ]) : bool
Parameters
$output : OutputInterface

Output object

$url : string

URL of sitemap to read.

$verbose : bool = false

Are we in verbose mode?

$index : string = 'SolrWeb'

Solr index to update

$testMode : bool = false

Are we in test mode?

Return values
bool

True on success, false on error.

indexFromTransformCache()

Check the cache and configuration to see if the provided URL can be loaded from cache, and load it to Solr if possible.

protected indexFromTransformCache(OutputInterface $output, string $url, string $lastMod[, bool $verbose = false ][, string $index = 'SolrWeb' ][, bool $testMode = false ]) : bool
Parameters
$output : OutputInterface

Output object

$url : string

URL of sitemap to read.

$lastMod : string

Last modification date of URL.

$verbose : bool = false

Are we in verbose mode?

$index : string = 'SolrWeb'

Solr index to update

$testMode : bool = false

Are we in test mode?

Return values
bool

True if loaded from cache, false if not.

readFromTransformCache()

Fetch transform cache data for the specified URL; return null if the cache is disabled, the data is expired, or something goes wrong.

protected readFromTransformCache(OutputInterface $output, string $url, string $lastMod, bool $verbose) : string|null
Parameters
$output : OutputInterface

Output object

$url : string

URL of sitemap to read.

$lastMod : string

Last modification date of URL.

$verbose : bool

Are we in verbose mode?

Return values
string|null

removeTempFile()

Remove a temporary file.

protected removeTempFile(string $file) : void
Parameters
$file : string

Name of file to delete

Return values
void

updateLastIndexed()

Update the last_indexed dates in a cached XML document to the current time so reindexing cached documents works correctly.

protected updateLastIndexed(string $xml) : string
Parameters
$xml : string

XML to update

Return values
string

updateTransformCache()

Update the transform cache (if activated). Returns true if the cache was updated, false otherwise.

protected updateTransformCache(string $url, string $result) : bool
Parameters
$url : string

URL to use for cache key

$result : string

Result of transforming the URL

Return values
bool

Search results