-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large histories memory error #8
Comments
From [email protected] on July 08, 2011 16:43:50 Not sure if this is a bug and if it's the same bug, but anyway: while trying to download http://it.wikihow.com/index.php?title=Discussioni_template:temp&action=history :
^CTraceback (most recent call last): The script was downloading at full bandwidth (1+ MiB/s) and reached almost 1 GiB of memory consumption after that "Waiting 20 seconds and reloading". That page history is a monster full with GiB of spam, but probably it's not sane to store the data in the RAM as the script seems to do. |
From [email protected] on July 13, 2011 05:39:08 Another example, similar to the first one but a bit different because apparently the download of the page didn't start (there's no reported revisions download progress in chunks of 1000): http://p.defau.lt/?dZddltkd5YcV5zYjMcWvXA Seems to be caused by the horribly huge history of this page. the next after the last downloaded one: http://wiki.guildwars.com/index.php?title=ArenaNet:Guild_Wars_2_suggestions/Scratchpad&action=history (7829 revisions, about 1900 MiB). |
From [email protected] on July 17, 2011 00:51:57 As a workaround, I edited the titles list and moved the problematic titles to the end to postpone the download of those histories and watch it more carefully. In one case, I had the MemoryError despite python reaching only less than 1 GiB RAM and almost 2 additional GiB of RAM being available; in another case, the page history is 1.7 GiB when downloaded with Special:Export on browser, I don't know how much was downloaded by the script. (Looking around a bit, looks like it might be normal to have MemoryError at about 1 GiB of memory whatever the amount of free memory.) |
From [email protected] on November 10, 2013 01:23:41 Again urbandead: Traceback (most recent call last): |
From [email protected] on February 16, 2014 06:01:35 Other examples are http://dota2.gamepedia.com/ (Template:Dictionary/defindex , 13k revisions) and http://wowpedia.org/ (Patch mirrors, 7k revisions). |
Happens with titles as well on huge wikis, for instance wikihow (4M pages). |
I always have to guess which wikis failed for MemoryError and manually track down the title at fault.
|
I'm now trying the code from latest master to complete the dump of uesp.net and hopefully others, Let's see if it gets any better. |
I wonder if we have to use something like http://docs.scipy.org/doc/numpy/user/basics.creation.html Is there some library for XML which accepts invalid XML files as input and can try to repair them, merge etc.? |
Issue #8: avoid MemoryError fatal on big histories, remove sha1 for Wikia
Mako suggests, in addition to the use of generators for title listing and revision scanning, to read the XML from the end during resume, also with a generator. https://stackoverflow.com/a/23646049/4145951 The maximum memory consumed will then be whatever it takes to store in memory the largest request we make to Special:Export (which we can reduce by reducing the number of revisions, perhaps in a try except). |
Is this a history too big to be downloaded at once, or a bug in #228 ?
Either way, one thing I still have to do: a try.. except to catch MemoryError and retry with less revisions. And avoid that xml2.split over huge XML files. |
Mostly fixed with d4fd745 |
|
And once that's fixed:
|
StringIO is not smart enough :(
|
Funny example works :)
|
IIRC this still failed some wiki downloads, which I left rotting since then. For now I'll focus on another round of generic/small wikis downloads, avoiding the huge ones (not enough disk space anyway), so I won't be able to test thoroughly. |
I just had this while trying to download a decently large wiki:
edit:
I'm currently going through and trying to add some memory fixes in certain places to fix these issues, let it download the larger pages and hopefully make it more stable. Once I have a decent patch I'll submit a PR! |
Please also check that all the revisions are actually being downloaded. (One possible method: at the end of the process we have a simple grep which counts the page and revision tags, compare that to the statistics in the wiki.) |
A giant page on which this can be tested: https://wiki.thingsandstuff.org/index.php?title=Audio&offset=&limit=5000&action=history . The graduated response seemed to work:
|
We have one MemoryError on resume:
|
Set up Git pre-commit
From [email protected] on April 16, 2011 23:35:03
RationalWiki:SPOV 18 edits
RationalWiki:Saloon Bar 1 edits
RationalWiki:Saloon Bar/Drink counter/Archive 1 2 edits
RationalWiki:Saloon bar 2000 edits
RationalWiki:Saloon bar 3000 edits
RationalWiki:Saloon bar 4000 edits
RationalWiki:Saloon bar 5000 edits
RationalWiki:Saloon bar 6000 edits
RationalWiki:Saloon bar 7000 edits
RationalWiki:Saloon bar 8000 edits
RationalWiki:Saloon bar 9000 edits
RationalWiki:Saloon bar 10000 edits
RationalWiki:Saloon bar 11000 edits
RationalWiki:Saloon bar 12000 edits
Traceback (most recent call last):
File "dumpgenerator.py", line 878, in
f.close()
File "dumpgenerator.py", line 785, in main
xmltitles = re.findall(r'<title>([^<]+)</title>', l) #weird if found more than 1, but maybe
File "dumpgenerator.py", line 335, in generateXMLDump
if c % 10 == 0:
File "dumpgenerator.py", line 279, in getXMLPage
xml = xml.split('')[0]+xml2.split('\n')[1]
MemoryError
Original issue: http://code.google.com/p/wikiteam/issues/detail?id=8
The text was updated successfully, but these errors were encountered: