Large histories memory error #8

emijrp · 2014-06-25T10:34:37Z

From [email protected] on April 16, 2011 23:35:03

RationalWiki:SPOV 18 edits
RationalWiki:Saloon Bar 1 edits
RationalWiki:Saloon Bar/Drink counter/Archive 1 2 edits
RationalWiki:Saloon bar 2000 edits
RationalWiki:Saloon bar 3000 edits
RationalWiki:Saloon bar 4000 edits
RationalWiki:Saloon bar 5000 edits
RationalWiki:Saloon bar 6000 edits
RationalWiki:Saloon bar 7000 edits
RationalWiki:Saloon bar 8000 edits
RationalWiki:Saloon bar 9000 edits
RationalWiki:Saloon bar 10000 edits
RationalWiki:Saloon bar 11000 edits
RationalWiki:Saloon bar 12000 edits
Traceback (most recent call last):
File "dumpgenerator.py", line 878, in
f.close()
File "dumpgenerator.py", line 785, in main
xmltitles = re.findall(r'<title>([^<]+)</title>', l) #weird if found more than 1, but maybe
File "dumpgenerator.py", line 335, in generateXMLDump
if c % 10 == 0:
File "dumpgenerator.py", line 279, in getXMLPage
xml = xml.split('')[0]+xml2.split('\n')[1]
MemoryError

Original issue: http://code.google.com/p/wikiteam/issues/detail?id=8

emijrp · 2014-06-25T10:34:43Z

From [email protected] on July 08, 2011 16:43:50

Not sure if this is a bug and if it's the same bug, but anyway: while trying to download http://it.wikihow.com/index.php?title=Discussioni_template:temp&action=history :

XML for "Discussioni_template:temp" is wrong. Waiting 20 seconds and reloading...

^CTraceback (most recent call last):
File "../../dumpgenerator.py", line 941, in
main()
File "../../dumpgenerator.py", line 906, in main
generateXMLDump(config=config, titles=titles)
File "../../dumpgenerator.py", line 383, in generateXMLDump
xml = getXMLPage(config=config, title=title)
File "../../dumpgenerator.py", line 292, in getXMLPage
xml = getXMLPageCore(headers=headers, params=params, config=config)
File "../../dumpgenerator.py", line 268, in getXMLPageCore
xml = f.read()
File "/usr/lib/python2.7/socket.py", line 359, in read
return buf.getvalue()

The script was downloading at full bandwidth (1+ MiB/s) and reached almost 1 GiB of memory consumption after that "Waiting 20 seconds and reloading". That page history is a monster full with GiB of spam, but probably it's not sane to store the data in the RAM as the script seems to do.

emijrp · 2014-06-25T10:34:44Z

From [email protected] on July 13, 2011 05:39:08

Another example, similar to the first one but a bit different because apparently the download of the page didn't start (there's no reported revisions download progress in chunks of 1000): http://p.defau.lt/?dZddltkd5YcV5zYjMcWvXA Seems to be caused by the horribly huge history of this page. the next after the last downloaded one: http://wiki.guildwars.com/index.php?title=ArenaNet:Guild_Wars_2_suggestions/Scratchpad&action=history (7829 revisions, about 1900 MiB).

emijrp · 2014-06-25T10:34:45Z

From [email protected] on July 17, 2011 00:51:57

As a workaround, I edited the titles list and moved the problematic titles to the end to postpone the download of those histories and watch it more carefully. In one case, I had the MemoryError despite python reaching only less than 1 GiB RAM and almost 2 additional GiB of RAM being available; in another case, the page history is 1.7 GiB when downloaded with Special:Export on browser, I don't know how much was downloaded by the script. (Looking around a bit, looks like it might be normal to have MemoryError at about 1 GiB of memory whatever the amount of free memory.)
They're all like this: http://p.defau.lt/?3JuOkvmlwDGqi_1A30V6qQ .

emijrp · 2014-06-25T10:34:46Z

From [email protected] on November 10, 2013 01:23:41

Again urbandead:

Traceback (most recent call last):
File "dumpgenerator.py", line 1205, in
main()
File "dumpgenerator.py", line 1196, in main
resumePreviousDump(config=config, other=other)
File "dumpgenerator.py", line 1056, in resumePreviousDump
generateXMLDump(config=config, titles=titles, start=lastxmltitle)
File "dumpgenerator.py", line 457, in generateXMLDump
xml = getXMLPage(config=config, title=title)
File "dumpgenerator.py", line 389, in getXMLPage
xml = xml.split('')[0] + ' ' + (''.join(xml2.split('')[1:]))
MemoryError

emijrp · 2014-06-25T10:34:46Z

From [email protected] on February 16, 2014 06:01:35

Other examples are http://dota2.gamepedia.com/ (Template:Dictionary/defindex , 13k revisions) and http://wowpedia.org/ (Patch mirrors, 7k revisions).

nemobis · 2014-07-05T18:20:25Z

Happens with titles as well on huge wikis, for instance wikihow (4M pages).

nemobis · 2014-08-18T21:13:06Z

I always have to guess which wikis failed for MemoryError and manually track down the title at fault.

It would be useful if the MemoryError was caught at least to log something to error.log. (Very easy?)
When the problem was in the downloading, perhaps requests helped? If the error is in XML manipulation, though, not sure what to do.

nemobis · 2014-08-21T09:45:54Z

I'm now trying the code from latest master to complete the dump of uesp.net and hopefully others, Let's see if it gets any better.

nemobis · 2015-02-02T07:30:34Z

I wonder if we have to use something like http://docs.scipy.org/doc/numpy/user/basics.creation.html

Is there some library for XML which accepts invalid XML files as input and can try to repair them, merge etc.?

Issue #8: avoid MemoryError fatal on big histories, remove sha1 for Wikia

nemobis · 2015-02-10T08:15:58Z

Mako suggests, in addition to the use of generators for title listing and revision scanning, to read the XML from the end during resume, also with a generator. https://stackoverflow.com/a/23646049/4145951

The maximum memory consumed will then be whatever it takes to store in memory the largest request we make to Special:Export (which we can reduce by reducing the number of revisions, perhaps in a try except).

nemobis · 2015-03-10T11:57:16Z

Is this a history too big to be downloaded at once, or a bug in #228 ?

Removing the last chunk of past XML dump: it is probably incomplete
Cannot allocate memory
Cannot allocate memory
Last 124518 lines removed.
    Creepypasta Wiki:Chat/Logs/8 December 2014, 78 edits
Traceback (most recent call last):
  File "./dumpgenerator.py", line 2005, in <module>
    main()
  File "./dumpgenerator.py", line 1995, in main
    resumePreviousDump(config=config, other=other)
  File "./dumpgenerator.py", line 1649, in resumePreviousDump
    session=other['session'])
  File "./dumpgenerator.py", line 684, in generateXMLDump
    for xml in getXMLPage(config=config, title=title, session=session):
  File "./dumpgenerator.py", line 612, in getXMLPage
    yield '  <revision>' + ('<revision>'.join(xml2.split('<revision>')[1:]))
MemoryError
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.
$ wc -l creepypastawikiacom-20150207-history.xml
165624557 creepypastawikiacom-20150207-history.xml

Either way, one thing I still have to do: a try.. except to catch MemoryError and retry with less revisions. And avoid that xml2.split over huge XML files.

nemobis · 2015-03-30T08:25:43Z

Mostly fixed with d4fd745
I think we are no longer doing .split() on any huge XML, which was our biggest issue. There are still some places where a MemoryError can occur, perhaps they should be logged by wrapping the entire main() in a try except.

nemobis · 2015-03-30T14:45:17Z

  File "./dumpgenerator.py", line 2020, in <module>
    main()
  File "./dumpgenerator.py", line 2010, in main
    resumePreviousDump(config=config, other=other)
  File "./dumpgenerator.py", line 1664, in resumePreviousDump
    session=other['session'])
  File "./dumpgenerator.py", line 687, in generateXMLDump
    for xml in getXMLPage(config=config, title=title, session=session):
  File "./dumpgenerator.py", line 574, in getXMLPage
    xml = re.sub(r'\n\s*<sha1>\w+</sha1>\s*\n', r'\n', xml)
  File "/usr/lib/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
MemoryError
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.

nemobis · 2015-03-30T16:05:39Z

And once that's fixed:

Traceback (most recent call last):
  File "./dumpgenerator.py", line 2030, in <module>
    main()
  File "./dumpgenerator.py", line 2020, in main
    resumePreviousDump(config=config, other=other)
  File "./dumpgenerator.py", line 1674, in resumePreviousDump
    session=other['session'])
  File "./dumpgenerator.py", line 697, in generateXMLDump
    for xml in getXMLPage(config=config, title=title, session=session):
  File "./dumpgenerator.py", line 587, in getXMLPage
    yield xml.split("</page>")[0]
MemoryError
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.
#########################################################################

nemobis · 2015-03-30T16:58:00Z

StringIO is not smart enough :(

  File "./dumpgenerator.py", line 2022, in main
    resumePreviousDump(config=config, other=other)
  File "./dumpgenerator.py", line 1676, in resumePreviousDump
    session=other['session'])
  File "./dumpgenerator.py", line 699, in generateXMLDump
    for xml in getXMLPage(config=config, title=title, session=session):
  File "./dumpgenerator.py", line 582, in getXMLPage
    for line in readxml:
  File "/usr/lib/python2.7/StringIO.py", line 76, in next
    r = self.readline()
  File "/usr/lib/python2.7/StringIO.py", line 164, in readline
    r = self.buf[self.pos:newpos]
MemoryError
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.

nemobis · 2015-03-30T16:59:50Z

Funny example works :)

Analysing http://glee.wikia.com/api.php
Loading config file...
Resuming previous dump process...
Title list was completed in the previous session
Resuming XML dump from "Blaine-Kurt Relationship"
Retrieving the XML for every page from "Blaine-Kurt Relationship"
The page's history exceeds our memory, halving limit.
The page's history exceeds our memory, halving limit.
The page's history exceeds our memory, halving limit.
    Blaine-Kurt Relationship, 3483 edits

nemobis · 2015-07-05T13:17:26Z

IIRC this still failed some wiki downloads, which I left rotting since then. For now I'll focus on another round of generic/small wikis downloads, avoiding the huge ones (not enough disk space anyway), so I won't be able to test thoroughly.

DanielOaks · 2015-10-22T09:26:51Z

I just had this while trying to download a decently large wiki:

Traceback (most recent call last):
  File "./dumpgenerator.py", line 2054, in <module>
    main()
  File "./dumpgenerator.py", line 2046, in main
    createNewDump(config=config, other=other)
  File "./dumpgenerator.py", line 1621, in createNewDump
    generateXMLDump(config=config, titles=titles, session=other['session'])
  File "./dumpgenerator.py", line 695, in generateXMLDump
    for xml in getXMLPage(config=config, title=title, session=session):
  File "./dumpgenerator.py", line 629, in getXMLPage
    xml2 = xml2.split("</page>")[0]
MemoryError
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.

edit:
Got this error as well, after that one was fixed:

Traceback (most recent call last):
  File "./dumpgenerator.py", line 2054, in <module>
    main()
  File "./dumpgenerator.py", line 2046, in main
    createNewDump(config=config, other=other)
  File "./dumpgenerator.py", line 1621, in createNewDump
    generateXMLDump(config=config, titles=titles, session=other['session'])
  File "./dumpgenerator.py", line 695, in generateXMLDump
    for xml in getXMLPage(config=config, title=title, session=session):
  File "./dumpgenerator.py", line 604, in getXMLPage
    params=params, config=config, session=session)
  File "./dumpgenerator.py", line 542, in getXMLPageCore
    xml = fixBOM(r)
  File "./dumpgenerator.py", line 1566, in fixBOM
    if request.text.startswith(u'\ufeff'):
  File "/usr/lib/python2.7/site-packages/requests/models.py", line 786, in text
    content = str(self.content, encoding, errors='replace')
MemoryError
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.

I'm currently going through and trying to add some memory fixes in certain places to fix these issues, let it download the larger pages and hopefully make it more stable. Once I have a decent patch I'll submit a PR!

nemobis · 2015-10-22T14:50:24Z

Please also check that all the revisions are actually being downloaded. (One possible method: at the end of the process we have a simple grep which counts the page and revision tags, compare that to the statistics in the wiki.)

nemobis · 2020-02-07T15:09:28Z

The current solution is to use --xmlrevisions #311 , I closed #282

nemobis · 2020-02-09T08:40:47Z

A giant page on which this can be tested: https://wiki.thingsandstuff.org/index.php?title=Audio&offset=&limit=5000&action=history . The graduated response seemed to work:

Analysing https://wiki.thingsandstuff.org/api.php
Loading config file...
Resuming previous dump process...
Title list was completed in the previous session
Resuming XML dump from "Audio"
https://wiki.thingsandstuff.org/api.php
Retrieving the XML for every page from "Audio"
Removing the last chunk of past XML dump: it is probably incomplete.
The page's history exceeds our memory, halving limit.
The page's history exceeds our memory, halving limit.
The page's history exceeds our memory, halving limit.
    Audio, 5017 edits

nemobis · 2020-03-02T11:38:05Z

We have one MemoryError on resume:

Analysing http://wiki.urbandead.com/api.php
Loading config file...
Resuming previous dump process...
Title list was completed in the previous session
Resuming XML dump from "NecroWatch/Procedures"
http://wiki.urbandead.com/api.php
Retrieving the XML for every page from "NecroWatch/Procedures"
Removing the last chunk of past XML dump: it is probably incomplete.
nSleeping... 0 seconds...
    NecroWatch/Procedures, 2 edits
Sleeping... 0 seconds...
Traceback (most recent call last):
  File "dumpgenerator.py", line 2528, in <module>
    main()
  File "dumpgenerator.py", line 2518, in main
    resumePreviousDump(config=config, other=other)
  File "dumpgenerator.py", line 2165, in resumePreviousDump
    session=other['session'])
  File "dumpgenerator.py", line 764, in generateXMLDump
    for xml in getXMLPage(config=config, title=title, session=session):
  File "dumpgenerator.py", line 616, in getXMLPage
    xml = re.sub(r'\n\s*<sha1>\w+</sha1>\s*\n', r'\n', xml)
  File "/usr/lib/python2.7/re.py", line 155, in sub
    return _compile(pattern, flags).sub(repl, string, count)
MemoryError

Set up Git pre-commit

PiRSquared17 added a commit that referenced this issue Feb 10, 2015

Merge pull request #216 from makoshark/master

ac72938

Issue #8: avoid MemoryError fatal on big histories, remove sha1 for Wikia

nemobis self-assigned this Feb 10, 2015

nemobis added this to the 0.3 milestone Mar 30, 2015

DanielOaks mentioned this issue Oct 22, 2015

Avoid out-of-memory error in two extra places #263

Merged

nemobis modified the milestones: 0.3, 0.4 Feb 10, 2020

gausie pushed a commit to gausie/wikiteam that referenced this issue Nov 18, 2023

Merge pull request WikiTeam#8 from elsiehupp/pre-commit

f273743

Set up Git pre-commit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large histories memory error #8

Large histories memory error #8

emijrp commented Jun 25, 2014

emijrp commented Jun 25, 2014

emijrp commented Jun 25, 2014

emijrp commented Jun 25, 2014

emijrp commented Jun 25, 2014

emijrp commented Jun 25, 2014

nemobis commented Jul 5, 2014

nemobis commented Aug 18, 2014

nemobis commented Aug 21, 2014

nemobis commented Feb 2, 2015

nemobis commented Feb 10, 2015

nemobis commented Mar 10, 2015

nemobis commented Mar 30, 2015

nemobis commented Mar 30, 2015

nemobis commented Mar 30, 2015

nemobis commented Mar 30, 2015

nemobis commented Mar 30, 2015

nemobis commented Jul 5, 2015

DanielOaks commented Oct 22, 2015

nemobis commented Oct 22, 2015

nemobis commented Feb 7, 2020

nemobis commented Feb 9, 2020 •

edited

Loading

nemobis commented Mar 2, 2020

Large histories memory error #8

Large histories memory error #8

Comments

emijrp commented Jun 25, 2014

emijrp commented Jun 25, 2014

emijrp commented Jun 25, 2014

emijrp commented Jun 25, 2014

emijrp commented Jun 25, 2014

emijrp commented Jun 25, 2014

nemobis commented Jul 5, 2014

nemobis commented Aug 18, 2014

nemobis commented Aug 21, 2014

nemobis commented Feb 2, 2015

nemobis commented Feb 10, 2015

nemobis commented Mar 10, 2015

nemobis commented Mar 30, 2015

nemobis commented Mar 30, 2015

nemobis commented Mar 30, 2015

nemobis commented Mar 30, 2015

nemobis commented Mar 30, 2015

nemobis commented Jul 5, 2015

DanielOaks commented Oct 22, 2015

nemobis commented Oct 22, 2015

nemobis commented Feb 7, 2020

nemobis commented Feb 9, 2020 • edited Loading

nemobis commented Mar 2, 2020

nemobis commented Feb 9, 2020 •

edited

Loading