Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZeroDivisionError when training with zero-length data #49

Open
haywhisksoftware opened this issue Feb 4, 2014 · 4 comments
Open

ZeroDivisionError when training with zero-length data #49

haywhisksoftware opened this issue Feb 4, 2014 · 4 comments

Comments

@haywhisksoftware
Copy link

(Minor bug.)
I installed scrapely from pip this morning.

This is a wacky edge case, but I think you could raise a more constructive error.

(Who wants to extract a zero-length string from a document? It's a bit like a magician pulling some atmosphere out of a hat: it's always going to be there...)

Check it out:

In [97]: from scrapely import Scraper

In [98]: s = Scraper()

In [99]: s.train('http://www.google.com', {'image': u''})
- - - - - - - - - - - - - - - - -
ZeroDivisionError                         Traceback (most recent call last)
/home/username/myfolder/<ipython-input-99-233d0ac90e7f> in <module>()
----> 1 s.train('http://www.google.com', {'image': u''})

/usr/local/lib/python2.7/dist-packages/scrapely/__init__.pyc in train(self, url, data, encoding)
     44     def train(self, url, data, encoding=None):
     45         page = url_to_page(url, encoding)
---> 46         self.train_from_htmlpage(page, data)
     47 
     48     def scrape(self, url, encoding=None):

/usr/local/lib/python2.7/dist-packages/scrapely/__init__.pyc in train_from_htmlpage(self, htmlpage, data)
     39                 if isinstance(value, str):
     40                     value = value.decode(htmlpage.encoding or 'utf-8')
---> 41                 tm.annotate(field, best_match(value))
     42         self.add_template(tm.get_template())
     43 

/usr/local/lib/python2.7/dist-packages/scrapely/template.pyc in annotate(self, field, score_func, best_match)
     31 
     32         """
---> 33         indexes = self.select(score_func)
     34         if not indexes:
     35             raise FragmentNotFound("Fragment not found annotating %r using: %s" % 

/usr/local/lib/python2.7/dist-packages/scrapely/template.pyc in select(self, score_func)
     46         matches = []
     47         for i, fragment in enumerate(htmlpage.parsed_body):
---> 48             score = score_func(fragment, htmlpage)
     49             if score:
     50                 matches.append((score, i))

/usr/local/lib/python2.7/dist-packages/scrapely/template.pyc in func(fragment, page)
     95         fdata = page.fragment_data(fragment).strip()
     96         if text in fdata:
---> 97             return float(len(text)) / len(fdata) - (1e-6 * fragment.start)
     98         else:
     99             return 0.0

ZeroDivisionError: float division by zero
@pablohoffman pablohoffman changed the title Train with zero-length expected text -> ZeroDivisionError ZeroDivisionError when training with zero-length data Apr 25, 2014
@ironmaniiith
Copy link

This is the reason for the error.

return float(len(text)) / len(fdata) - (1e-6 * fragment.start)

If the float that is being returned is inversely proportional to length of fdata, can we just write this.?

fdata = page.fragment_data(fragment).strip()
if text in fdata:
    if not len(fdata):
        return float("inf")
    return float(len(text)) / len(fdata) - (1e-6 * fragment.start)
else:
    return 0.0
return func

@moneypython
Copy link

This isn't a wacky edge-case at all.

I got the same error using actual data and had to patch it.

@marekyggdrasil
Copy link
Contributor

Same here, I reproduced this error using regular, non-empty data.

marekyggdrasil added a commit to marekyggdrasil/scrapely that referenced this issue Nov 27, 2019
ruairif added a commit that referenced this issue Nov 28, 2019
patch for issue #49 and fixed Travis tests
@marekyggdrasil
Copy link
Contributor

the patch has been merged, I believe this issue can be closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants