[FIX] Statistics.util.stats: Fix negative #nans for sparse#1659
Merged
Conversation
Computing the number of nans for sparse matrices was broken and sometimes returned negative numbers. Variable `non_zero` contains the number of defined values in each column. So to calculate the number of undefined (nans) for each column we have to substract `non_zero` from the number of values in the column i.e. `X.shape[0]` and not `X.shape[1]`. The bug was noticed when the number of defined values in some column was larger than the number of features and consequenlty the number of nans was negative. This caused density in OWTable to exceeded 100%.
Current coverage is 88.73% (diff: 100%)@@ master #1659 diff @@
==========================================
Files 78 78
Lines 8150 8150
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
Hits 7232 7232
Misses 918 918
Partials 0 0
|
lanzagar
requested changes
Oct 14, 2016
Contributor
lanzagar
left a comment
There was a problem hiding this comment.
If the wrong computation was unnoticed, that probably means there is no test for nan counting. Could you add a simple one?
Contributor
Author
|
@lanzagar IMHO we already have a test for this — the one that I corrected in this PR — just that it was incorrect. For sparse data "nans" are actually values not present in sparse matrix and the |
Contributor
|
You are right, the (not very clear) test that was already there (but broken) does test this too. I guess it is enough to cover the basics now that it is fixed. |
lanzagar
approved these changes
Oct 17, 2016
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Computing the number of nans for sparse matrices was broken and sometimes returned negative numbers.
Variable
non_zerocontains the number of defined values in each column. So to calculate the number of undefined (nans) for each column we have to substractnon_zerofrom the number of values in the column i.e.X.shape[0]and notX.shape[1]. The bug was noticed when the number of defined values in some column was larger than the number of features and consequenlty the number of nans was negative. This caused density in OWTable to exceeded 100%.