[VUFIND-630] Wikipedia module crashes for certain articles on Windows platform Created: 20/Jul/12  Updated: 06/Aug/13  Resolved: 10/May/13

Status: Resolved
Project: VuFind®
Components: Author
Affects versions: 1.3
Fix versions: 2.0

Type: Bug Priority: Minor
Reporter: Ronan McHugh Assignee: Demian Katz
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified
Environment: Windows XP, XAMPP for Windows

Attachments: File test.php    

 Description   
When retrieving and parsing the wikipedia article for James_Joyce, the Wikipedia module crashes on a Windows platform. The guilty code is preg_match_all("/".$open.$recursive_match.$close."/Us", $body2, $new_matches);

The error returned is not a php error but a connection error: The connection to the server was reset while the page was loading.

We speculate that this has something to do with the apache heap size under windows, but we don't know.

Attached is a test script which demonstrates this issue. There are two possible inputs for the preg_match_all: $body1 is an article head which is parsed correctly, $body2 is the one which causes the error. Just change the variable in the preg_match_all statement to see the difference.

 Comments   
Comment by Ronan McHugh [ 20/Jul/12 ]
The bug seems related to the following section of code. Try running the preg_match_all on a new variable, defined as follows:

$body3 = "[[File:Revolutionary Joyce Better Contrast.jpg|thumb|alt=Half-length portrait of man in his thirties. He looks to his right so that his face is in profile. He has a mustache, a thin beard, and medium-length hair slicked back, and wears a pince-nez and a plain dark greatcoat, looking vaguely like a Russian revolutionary.|[[File:James Joyce signature.svg|200px]]<br /><center>Joyce in [[Zurich]], {{circa|1918}}</center>]]";

So there is obviously something in this extract that is causing the regex problems. If we can identify this, we might be able to sanitise the input somewhat before parsing.
Comment by Demian Katz [ 14/Jan/13 ]
Your heap theory is correct. This thread describes the reason for the problem in more detail:

http://stackoverflow.com/questions/7620910/regexp-in-preg-match-function-returning-browser-error

I am able to reproduce the problem using your test script in Windows under Apache 2.2.15 and PHP 5.3.14. The same problem is not reproducible using command-line PHP 5.3.14 under Windows. This is because CLI PHP has more heap by default than PHP-in-Apache.

I reconfigured Apache to have a larger heap size by adding this to httpd.conf:

<IfModule mpm_winnt_module>
   ThreadStackSize 8388608
</IfModule>

This solved the problem -- no more regex failures in test.php.

So there are three possible resolutions to this ticket:

1.) Close the ticket with no action; assume that users having problems can manually adjust their ThreadStackSize setting, and hope that the default settings under Windows eventually become more reasonable.

2.) Add a higher ThreadStackSize setting to httpd-vufind.conf, so that the default VuFind configuration has more memory available under Windows (presumably the <IfModule> statement above will prevent this from breaking anything under other platforms).

3.) Find a less memory-intensive solution to the problem currently being solved by recursive regex matching.

Obviously #3 is the nicest solution -- but I don't have time to invest in trying to optimize this area of the code. I don't really like option #2, since it may cause more problems than it solves. Thus I favor #1 for the moment, though I'm not especially happy about it.

Any other thoughts?
Comment by Eoghan Ó Carragáin [ 15/Jan/13 ]
How about #1b: Close ticket but reference VUFIND-630 in the windows install documentation and in the [Content] section config.ini?

Is it something that could be presented as an option/warning to windows users as part of the automated 2.x install?

Cheers

Comment by Demian Katz [ 15/Jan/13 ]
That sounds like a reasonable option to me -- I'll try to find time to implement it after the next dev call unless there is any dissent there.

On a related note, I wonder if Wikipedia should be off by default so that administrators would be more likely to notice the warning when going in to turn it on. Since it's a somewhat controversial feature, I'm not sure if the current "on by default" is the best choice -- though we should get some more input before making such a change.
Comment by Demian Katz [ 23/Jan/13 ]
Comments on a different but seemingly related issue have been removed; see VUFIND-739 for more details.
Comment by Demian Katz [ 10/May/13 ]
Note added to config here:

https://github.com/vufind-org/vufind/commit/11018575124f40e621248d485ad604d18a54c4c3

Also put a "general notes" appendix on the Windows install pages in the wiki containing a similar comment.
Generated at Sat Apr 27 04:05:15 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100251-rev:4690f9fa025ccb713885a7f8212eefdeb0c508be.