Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large histories memory error #8

Open
emijrp opened this issue Jun 25, 2014 · 22 comments
Open

Large histories memory error #8

emijrp opened this issue Jun 25, 2014 · 22 comments

Comments

@emijrp
Copy link
Member

emijrp commented Jun 25, 2014

From [email protected] on April 16, 2011 23:35:03

RationalWiki:SPOV 18 edits
RationalWiki:Saloon Bar 1 edits
RationalWiki:Saloon Bar/Drink counter/Archive 1 2 edits
RationalWiki:Saloon bar 2000 edits
RationalWiki:Saloon bar 3000 edits
RationalWiki:Saloon bar 4000 edits
RationalWiki:Saloon bar 5000 edits
RationalWiki:Saloon bar 6000 edits
RationalWiki:Saloon bar 7000 edits
RationalWiki:Saloon bar 8000 edits
RationalWiki:Saloon bar 9000 edits
RationalWiki:Saloon bar 10000 edits
RationalWiki:Saloon bar 11000 edits
RationalWiki:Saloon bar 12000 edits
Traceback (most recent call last):
File "dumpgenerator.py", line 878, in
f.close()
File "dumpgenerator.py", line 785, in main
xmltitles = re.findall(r'<title>([^<]+)</title>', l) #weird if found more than 1, but maybe
File "dumpgenerator.py", line 335, in generateXMLDump
if c % 10 == 0:
File "dumpgenerator.py", line 279, in getXMLPage
xml = xml.split('')[0]+xml2.split('\n')[1]
MemoryError

Original issue: http://code.google.com/p/wikiteam/issues/detail?id=8

@emijrp
Copy link
Member Author

emijrp commented Jun 25, 2014

From [email protected] on July 08, 2011 16:43:50

Not sure if this is a bug and if it's the same bug, but anyway: while trying to download http://it.wikihow.com/index.php?title=Discussioni_template:temp&action=history :

XML for "Discussioni_template:temp" is wrong. Waiting 20 seconds and reloading...

^CTraceback (most recent call last):
File "../../dumpgenerator.py", line 941, in
main()
File "../../dumpgenerator.py", line 906, in main
generateXMLDump(config=config, titles=titles)
File "../../dumpgenerator.py", line 383, in generateXMLDump
xml = getXMLPage(config=config, title=title)
File "../../dumpgenerator.py", line 292, in getXMLPage
xml = getXMLPageCore(headers=headers, params=params, config=config)
File "../../dumpgenerator.py", line 268, in getXMLPageCore
xml = f.read()
File "/usr/lib/python2.7/socket.py", line 359, in read
return buf.getvalue()

The script was downloading at full bandwidth (1+ MiB/s) and reached almost 1 GiB of memory consumption after that "Waiting 20 seconds and reloading". That page history is a monster full with GiB of spam, but probably it's not sane to store the data in the RAM as the script seems to do.

@emijrp
Copy link
Member Author

emijrp commented Jun 25, 2014

From [email protected] on July 13, 2011 05:39:08

Another example, similar to the first one but a bit different because apparently the download of the page didn't start (there's no reported revisions download progress in chunks of 1000): http://p.defau.lt/?dZddltkd5YcV5zYjMcWvXA Seems to be caused by the horribly huge history of this page. the next after the last downloaded one: http://wiki.guildwars.com/index.php?title=ArenaNet:Guild_Wars_2_suggestions/Scratchpad&action=history (7829 revisions, about 1900 MiB).

@emijrp
Copy link
Member Author

emijrp commented Jun 25, 2014

From [email protected] on July 17, 2011 00:51:57

As a workaround, I edited the titles list and moved the problematic titles to the end to postpone the download of those histories and watch it more carefully. In one case, I had the MemoryError despite python reaching only less than 1 GiB RAM and almost 2 additional GiB of RAM being available; in another case, the page history is 1.7 GiB when downloaded with Special:Export on browser, I don't know how much was downloaded by the script. (Looking around a bit, looks like it might be normal to have MemoryError at about 1 GiB of memory whatever the amount of free memory.)
They're all like this: http://p.defau.lt/?3JuOkvmlwDGqi_1A30V6qQ .

@emijrp
Copy link
Member Author

emijrp commented Jun 25, 2014

From [email protected] on November 10, 2013 01:23:41

Again urbandead:

Traceback (most recent call last):
File "dumpgenerator.py", line 1205, in
main()
File "dumpgenerator.py", line 1196, in main
resumePreviousDump(config=config, other=other)
File "dumpgenerator.py", line 1056, in resumePreviousDump
generateXMLDump(config=config, titles=titles, start=lastxmltitle)
File "dumpgenerator.py", line 457, in generateXMLDump
xml = getXMLPage(config=config, title=title)
File "dumpgenerator.py", line 389, in getXMLPage
xml = xml.split('')[0] + ' ' + (''.join(xml2.split('')[1:]))
MemoryError

@emijrp
Copy link
Member Author

emijrp commented Jun 25, 2014

From [email protected] on February 16, 2014 06:01:35

Other examples are http://dota2.gamepedia.com/ (Template:Dictionary/defindex , 13k revisions) and http://wowpedia.org/ (Patch mirrors, 7k revisions).

@nemobis
Copy link
Member

nemobis commented Jul 5, 2014

Happens with titles as well on huge wikis, for instance wikihow (4M pages).

@nemobis
Copy link
Member

nemobis commented Aug 18, 2014

I always have to guess which wikis failed for MemoryError and manually track down the title at fault.

  • It would be useful if the MemoryError was caught at least to log something to error.log. (Very easy?)
  • When the problem was in the downloading, perhaps requests helped? If the error is in XML manipulation, though, not sure what to do.

@nemobis
Copy link
Member

nemobis commented Aug 21, 2014

I'm now trying the code from latest master to complete the dump of uesp.net and hopefully others, Let's see if it gets any better.

@nemobis
Copy link
Member

nemobis commented Feb 2, 2015

I wonder if we have to use something like http://docs.scipy.org/doc/numpy/user/basics.creation.html

Is there some library for XML which accepts invalid XML files as input and can try to repair them, merge etc.?

PiRSquared17 added a commit that referenced this issue Feb 10, 2015
Issue #8: avoid MemoryError fatal on big histories, remove sha1 for Wikia
@nemobis
Copy link
Member

nemobis commented Feb 10, 2015

Mako suggests, in addition to the use of generators for title listing and revision scanning, to read the XML from the end during resume, also with a generator. https://stackoverflow.com/a/23646049/4145951

The maximum memory consumed will then be whatever it takes to store in memory the largest request we make to Special:Export (which we can reduce by reducing the number of revisions, perhaps in a try except).

@nemobis nemobis self-assigned this Feb 10, 2015
@nemobis
Copy link
Member

nemobis commented Mar 10, 2015

Is this a history too big to be downloaded at once, or a bug in #228 ?

Removing the last chunk of past XML dump: it is probably incomplete
Cannot allocate memory
Cannot allocate memory
Last 124518 lines removed.
    Creepypasta Wiki:Chat/Logs/8 December 2014, 78 edits
Traceback (most recent call last):
  File "./dumpgenerator.py", line 2005, in <module>
    main()
  File "./dumpgenerator.py", line 1995, in main
    resumePreviousDump(config=config, other=other)
  File "./dumpgenerator.py", line 1649, in resumePreviousDump
    session=other['session'])
  File "./dumpgenerator.py", line 684, in generateXMLDump
    for xml in getXMLPage(config=config, title=title, session=session):
  File "./dumpgenerator.py", line 612, in getXMLPage
    yield '  <revision>' + ('<revision>'.join(xml2.split('<revision>')[1:]))
MemoryError
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.
$ wc -l creepypastawikiacom-20150207-history.xml
165624557 creepypastawikiacom-20150207-history.xml

Either way, one thing I still have to do: a try.. except to catch MemoryError and retry with less revisions. And avoid that xml2.split over huge XML files.

@nemobis nemobis added this to the 0.3 milestone Mar 30, 2015
@nemobis
Copy link
Member

nemobis commented Mar 30, 2015

Mostly fixed with d4fd745
I think we are no longer doing .split() on any huge XML, which was our biggest issue. There are still some places where a MemoryError can occur, perhaps they should be logged by wrapping the entire main() in a try except.

@nemobis
Copy link
Member

nemobis commented Mar 30, 2015

  File "./dumpgenerator.py", line 2020, in <module>
    main()
  File "./dumpgenerator.py", line 2010, in main
    resumePreviousDump(config=config, other=other)
  File "./dumpgenerator.py", line 1664, in resumePreviousDump
    session=other['session'])
  File "./dumpgenerator.py", line 687, in generateXMLDump
    for xml in getXMLPage(config=config, title=title, session=session):
  File "./dumpgenerator.py", line 574, in getXMLPage
    xml = re.sub(r'\n\s*<sha1>\w+</sha1>\s*\n', r'\n', xml)
  File "/usr/lib/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
MemoryError
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.

@nemobis
Copy link
Member

nemobis commented Mar 30, 2015

And once that's fixed:

Traceback (most recent call last):
  File "./dumpgenerator.py", line 2030, in <module>
    main()
  File "./dumpgenerator.py", line 2020, in main
    resumePreviousDump(config=config, other=other)
  File "./dumpgenerator.py", line 1674, in resumePreviousDump
    session=other['session'])
  File "./dumpgenerator.py", line 697, in generateXMLDump
    for xml in getXMLPage(config=config, title=title, session=session):
  File "./dumpgenerator.py", line 587, in getXMLPage
    yield xml.split("</page>")[0]
MemoryError
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.
#########################################################################

@nemobis
Copy link
Member

nemobis commented Mar 30, 2015

StringIO is not smart enough :(

  File "./dumpgenerator.py", line 2022, in main
    resumePreviousDump(config=config, other=other)
  File "./dumpgenerator.py", line 1676, in resumePreviousDump
    session=other['session'])
  File "./dumpgenerator.py", line 699, in generateXMLDump
    for xml in getXMLPage(config=config, title=title, session=session):
  File "./dumpgenerator.py", line 582, in getXMLPage
    for line in readxml:
  File "/usr/lib/python2.7/StringIO.py", line 76, in next
    r = self.readline()
  File "/usr/lib/python2.7/StringIO.py", line 164, in readline
    r = self.buf[self.pos:newpos]
MemoryError
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.

@nemobis
Copy link
Member

nemobis commented Mar 30, 2015

Funny example works :)

Analysing http://glee.wikia.com/api.php
Loading config file...
Resuming previous dump process...
Title list was completed in the previous session
Resuming XML dump from "Blaine-Kurt Relationship"
Retrieving the XML for every page from "Blaine-Kurt Relationship"
The page's history exceeds our memory, halving limit.
The page's history exceeds our memory, halving limit.
The page's history exceeds our memory, halving limit.
    Blaine-Kurt Relationship, 3483 edits

@nemobis
Copy link
Member

nemobis commented Jul 5, 2015

IIRC this still failed some wiki downloads, which I left rotting since then. For now I'll focus on another round of generic/small wikis downloads, avoiding the huge ones (not enough disk space anyway), so I won't be able to test thoroughly.

@DanielOaks
Copy link
Contributor

I just had this while trying to download a decently large wiki:

Traceback (most recent call last):
  File "./dumpgenerator.py", line 2054, in <module>
    main()
  File "./dumpgenerator.py", line 2046, in main
    createNewDump(config=config, other=other)
  File "./dumpgenerator.py", line 1621, in createNewDump
    generateXMLDump(config=config, titles=titles, session=other['session'])
  File "./dumpgenerator.py", line 695, in generateXMLDump
    for xml in getXMLPage(config=config, title=title, session=session):
  File "./dumpgenerator.py", line 629, in getXMLPage
    xml2 = xml2.split("</page>")[0]
MemoryError
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.

edit:
Got this error as well, after that one was fixed:

Traceback (most recent call last):
  File "./dumpgenerator.py", line 2054, in <module>
    main()
  File "./dumpgenerator.py", line 2046, in main
    createNewDump(config=config, other=other)
  File "./dumpgenerator.py", line 1621, in createNewDump
    generateXMLDump(config=config, titles=titles, session=other['session'])
  File "./dumpgenerator.py", line 695, in generateXMLDump
    for xml in getXMLPage(config=config, title=title, session=session):
  File "./dumpgenerator.py", line 604, in getXMLPage
    params=params, config=config, session=session)
  File "./dumpgenerator.py", line 542, in getXMLPageCore
    xml = fixBOM(r)
  File "./dumpgenerator.py", line 1566, in fixBOM
    if request.text.startswith(u'\ufeff'):
  File "/usr/lib/python2.7/site-packages/requests/models.py", line 786, in text
    content = str(self.content, encoding, errors='replace')
MemoryError
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.

I'm currently going through and trying to add some memory fixes in certain places to fix these issues, let it download the larger pages and hopefully make it more stable. Once I have a decent patch I'll submit a PR!

@nemobis
Copy link
Member

nemobis commented Oct 22, 2015

Please also check that all the revisions are actually being downloaded. (One possible method: at the end of the process we have a simple grep which counts the page and revision tags, compare that to the statistics in the wiki.)

@nemobis
Copy link
Member

nemobis commented Feb 7, 2020

The current solution is to use --xmlrevisions #311 , I closed #282

@nemobis
Copy link
Member

nemobis commented Feb 9, 2020

A giant page on which this can be tested: https://wiki.thingsandstuff.org/index.php?title=Audio&offset=&limit=5000&action=history . The graduated response seemed to work:

Analysing https://wiki.thingsandstuff.org/api.php
Loading config file...
Resuming previous dump process...
Title list was completed in the previous session
Resuming XML dump from "Audio"
https://wiki.thingsandstuff.org/api.php
Retrieving the XML for every page from "Audio"
Removing the last chunk of past XML dump: it is probably incomplete.
The page's history exceeds our memory, halving limit.
The page's history exceeds our memory, halving limit.
The page's history exceeds our memory, halving limit.
    Audio, 5017 edits

@nemobis nemobis modified the milestones: 0.3, 0.4 Feb 10, 2020
@nemobis
Copy link
Member

nemobis commented Mar 2, 2020

We have one MemoryError on resume:

Analysing http://wiki.urbandead.com/api.php
Loading config file...
Resuming previous dump process...
Title list was completed in the previous session
Resuming XML dump from "NecroWatch/Procedures"
http://wiki.urbandead.com/api.php
Retrieving the XML for every page from "NecroWatch/Procedures"
Removing the last chunk of past XML dump: it is probably incomplete.
nSleeping... 0 seconds...
    NecroWatch/Procedures, 2 edits
Sleeping... 0 seconds...
Traceback (most recent call last):
  File "dumpgenerator.py", line 2528, in <module>
    main()
  File "dumpgenerator.py", line 2518, in main
    resumePreviousDump(config=config, other=other)
  File "dumpgenerator.py", line 2165, in resumePreviousDump
    session=other['session'])
  File "dumpgenerator.py", line 764, in generateXMLDump
    for xml in getXMLPage(config=config, title=title, session=session):
  File "dumpgenerator.py", line 616, in getXMLPage
    xml = re.sub(r'\n\s*<sha1>\w+</sha1>\s*\n', r'\n', xml)
  File "/usr/lib/python2.7/re.py", line 155, in sub
    return _compile(pattern, flags).sub(repl, string, count)
MemoryError

gausie pushed a commit to gausie/wikiteam that referenced this issue Nov 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants