WebCrawlCommand
extends Command
in package
Console command: web crawler
Tags
Table of Contents
- $bypassCacheExpiration : bool
- Should we bypass cache expiration?
- $config : Config
- $importer : Importer
- $solr : Writer
- __construct() : mixed
- Constructor
- configure() : void
- Configure the command.
- downloadFile() : string
- Download a URL to a temporary file.
- execute() : int
- Run the command.
- getTransformCachePath() : string|null
- Given a URL, get the transform cache path (or null if the cache is disabled).
- harvestSitemap() : bool
- Process a sitemap URL, either harvesting its contents directly or recursively reading in child sitemaps.
- indexFromTransformCache() : bool
- Check the cache and configuration to see if the provided URL can be loaded from cache, and load it to Solr if possible.
- readFromTransformCache() : string|null
- Fetch transform cache data for the specified URL; return null if the cache is disabled, the data is expired, or something goes wrong.
- removeTempFile() : void
- Remove a temporary file.
- updateLastIndexed() : string
- Update the last_indexed dates in a cached XML document to the current time so reindexing cached documents works correctly.
- updateTransformCache() : bool
- Update the transform cache (if activated). Returns true if the cache was updated, false otherwise.
Properties
$bypassCacheExpiration
Should we bypass cache expiration?
protected
bool
$bypassCacheExpiration
= false
$config
protected
Config
$config
$importer
protected
Importer
$importer
$solr
protected
Writer
$solr
Methods
__construct()
Constructor
public
__construct(Importer $importer, Writer $solr, Config $config[, string|null $name = null ]) : mixed
Parameters
- $importer : Importer
-
XSLT importer
- $solr : Writer
-
Solr writer
- $config : Config
-
Configuration from webcrawl.ini
- $name : string|null = null
-
The name of the command; passing null means it must be set in configure()
Return values
mixed —configure()
Configure the command.
protected
configure() : void
Return values
void —downloadFile()
Download a URL to a temporary file.
protected
downloadFile(string $url) : string
Parameters
- $url : string
-
URL to download
Return values
string —Filename of downloaded content
execute()
Run the command.
protected
execute(InputInterface $input, OutputInterface $output) : int
Parameters
- $input : InputInterface
-
Input object
- $output : OutputInterface
-
Output object
Return values
int —0 for success
getTransformCachePath()
Given a URL, get the transform cache path (or null if the cache is disabled).
protected
getTransformCachePath(string $url) : string|null
Parameters
- $url : string
-
URL to cache
Return values
string|null —harvestSitemap()
Process a sitemap URL, either harvesting its contents directly or recursively reading in child sitemaps.
protected
harvestSitemap(OutputInterface $output, string $url[, bool $verbose = false ][, string $index = 'SolrWeb' ][, bool $testMode = false ]) : bool
Parameters
- $output : OutputInterface
-
Output object
- $url : string
-
URL of sitemap to read.
- $verbose : bool = false
-
Are we in verbose mode?
- $index : string = 'SolrWeb'
-
Solr index to update
- $testMode : bool = false
-
Are we in test mode?
Return values
bool —True on success, false on error.
indexFromTransformCache()
Check the cache and configuration to see if the provided URL can be loaded from cache, and load it to Solr if possible.
protected
indexFromTransformCache(OutputInterface $output, string $url, string $lastMod[, bool $verbose = false ][, string $index = 'SolrWeb' ][, bool $testMode = false ]) : bool
Parameters
- $output : OutputInterface
-
Output object
- $url : string
-
URL of sitemap to read.
- $lastMod : string
-
Last modification date of URL.
- $verbose : bool = false
-
Are we in verbose mode?
- $index : string = 'SolrWeb'
-
Solr index to update
- $testMode : bool = false
-
Are we in test mode?
Return values
bool —True if loaded from cache, false if not.
readFromTransformCache()
Fetch transform cache data for the specified URL; return null if the cache is disabled, the data is expired, or something goes wrong.
protected
readFromTransformCache(OutputInterface $output, string $url, string $lastMod, bool $verbose) : string|null
Parameters
- $output : OutputInterface
-
Output object
- $url : string
-
URL of sitemap to read.
- $lastMod : string
-
Last modification date of URL.
- $verbose : bool
-
Are we in verbose mode?
Return values
string|null —removeTempFile()
Remove a temporary file.
protected
removeTempFile(string $file) : void
Parameters
- $file : string
-
Name of file to delete
Return values
void —updateLastIndexed()
Update the last_indexed dates in a cached XML document to the current time so reindexing cached documents works correctly.
protected
updateLastIndexed(string $xml) : string
Parameters
- $xml : string
-
XML to update
Return values
string —updateTransformCache()
Update the transform cache (if activated). Returns true if the cache was updated, false otherwise.
protected
updateTransformCache(string $url, string $result) : bool
Parameters
- $url : string
-
URL to use for cache key
- $result : string
-
Result of transforming the URL