[WIP][ENH] Scatterplot, HeatMap, TreeGraph, ConfusionMatrix and Unsupervised widgets: Output Flagged Data #1655
Conversation
|
I haven't yet checked the content of the PR (obviously), but before we merge it we should perhaps talk about the signal name again. I don't think that the verb "to flag" appears in the dictionary with that meaning. Even the use of word "flag" for some kind of marker is IMHO limited to computer science, so the name "flagged data" wouldn't mean anything to an outsider. What is wrong with "marked"? Vesna, sorry if this will require some (hopefully trivial) refactoring. |
Current coverage is 89.38% (diff: 100%)@@ master #1655 diff @@
==========================================
Files 79 79
Lines 8589 8593 +4
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 7677 7681 +4
Misses 912 912
Partials 0 0
|
|
No worries.. I've intentionally made this PS 'small' (only four widgets are modified). |
cd69ccc to
cafa07c
Compare
| if name not in names: | ||
| return name | ||
| counts = [int(re.match(r"(" + name + " )(\d{1,}$)", n).group(2)) | ||
| for n in names if re.match(r"(" + name + " )(\d{1,}$)", n)] + [1] |
There was a problem hiding this comment.
Something like this:
counts = max((int(mo.group(2))
for mo in re.finditer(r"(" + name + " )(\d{1,}$)", n)),
default=0)There was a problem hiding this comment.
Matter of taste: I'd write r"({})(\d{{1,}}$".format(name)
| if subsets: | ||
| return self.instances[np.unique(np.hstack(subsets))] | ||
|
|
||
| def get_indices(self, nodes): |
There was a problem hiding this comment.
get_instances could call this function. Don't forget to handle the case when get_indices returns None.
| self.assertEqual(len(flagged), len(self.zoo)) | ||
| self.assertEqual(0, np.sum([i[FLAGGED_FEATURE_NAME] for i in flagged])) | ||
|
|
||
| def test_cascade_flagged_tables(self): |
There was a problem hiding this comment.
I could be too smart by half and replace your function that uses regular expressions with one that uses the first unoccupied name. This test wouldn't fail because there are no "holes". Can you add some code to this test to remove the second meta and then "flag" the table again, so my smart idea would fail?
| import numpy as np | ||
| from Orange.data import Table, Domain, DiscreteVariable | ||
|
|
||
| FLAGGED_SIGNAL_NAME = "Flagged Data" |
There was a problem hiding this comment.
Thinking about it again, I started liking the constant because it will indeed make us stick to the same name in all widgets.
The name is not perfect, though. Not only will we change "flagged" to something else, but it also suggests that it is a name of a flagged signal. ANNOTATED_DATA_SIGNAL_NAME = "Data" is better (than ANNOTATED_SIGNAL_NAME) but awfully long. Think about it...
There was a problem hiding this comment.
If nothing else, the module belongs within Orange.widgets because it includes a name of the signal, hence it is related to widgets. :)
| self.assertEqual(0, np.sum([i[FLAGGED_FEATURE_NAME] for i in flagged])) | ||
|
|
||
| # select data points | ||
| points = random.sample(range(0, len(self.iris)), 20) |
There was a problem hiding this comment.
I prefer deterministic tests. Randomly choose some indices instead of choosing some indices at random. (https://xkcd.com/221/).
|
|
||
| # check selected data output | ||
| selected = self.get_output("Data") | ||
| self.assertEqual(len(selected), len(points)) |
There was a problem hiding this comment.
What about testing that the correct instances were chosen? There may be a better way to do it, but I somewhere used something like np.testing.assert_almost_equal(selected.X, self.iris.X[points]).
| self.assertEqual(0, np.sum([i[FLAGGED_FEATURE_NAME] for i in flagged])) | ||
|
|
||
| # select data points | ||
| points = random.sample(range(0, len(self.iris)), 20) |
|
Github is showing some of my comments as outdated, although you haven't made any further commits. Please check those, too. Apart from these minor suggestions, I like the PR, in particular your factoring out of the parts of the tests. It would be even greater if you could simulate, say, selection action, but I know this is probably too hard. Please tell me/us when you make the changes, so we don't wait too long with merging, since rebasing dozens of widgets is not that much fun. |
|
The comments are outdated because the code was moved to a Mixin in one of the following commits (441b15a), since it was very similar for all widgets. I could have rebased, but wanted to keep the code in case someone didn't like the Mixin idea. |
Since there are only clusters in Selected Data, 'Other' should be removed from its domain. The value is still present in Flagged Data domain.
|
Done. |
| domain = Domain(data.domain.attributes, data.domain.class_vars, metas) | ||
| annotated = np.zeros((len(data), 1)) | ||
| if selected_indices is not None: | ||
| annotated[selected_indices] = 1 |
There was a problem hiding this comment.
Should this be 0 or 1? If nothing is selected, all instances have to have Selected=No, no?
There was a problem hiding this comment.
Should this be 0 or 1? If nothing is selected, all instances have to have Selected=No, no?
I'm stupid. Please ignore.
| selected = [i for i, t in enumerate(zip( | ||
| self.widget.results.actual, self.widget.results.predicted[0])) | ||
| if t in indices] | ||
| self.selected_indices = self.widget.results.row_indices[selected] |
There was a problem hiding this comment.
Would it be nicer if _select_data returned selected_indices instead of (ab?)using the instance's attributes for semi-global data storage?
| self.same_input_output_domain = True | ||
| self.selected_indices = [] | ||
|
|
||
| def test_outputs(self): |
There was a problem hiding this comment.
I like the way you factor out the tests.
| def _compare_selected_annotated_domains(self, selected, annotated): | ||
| selected_vars = selected.domain.variables + selected.domain.metas | ||
| annotated_vars = annotated.domain.variables + annotated.domain.metas | ||
| self.assertTrue(all((var in annotated_vars for var in selected_vars))) |
There was a problem hiding this comment.
This tests whether annotated.domain.variables + annotated.domain.metas are a subset (<=) of selected.domain.variables + selected.domain.metas. Doing it explicitly, using sets, would be more obvious and easier to read, I guess.
| if not selected_indices: | ||
| self.send("Selected Data", None) | ||
| self.send("Other Data", None) | ||
| annotated_data = create_annotated_table(items, selected_indices) \ |
There was a problem hiding this comment.
Not really an issue, but since you're going to make another commit anyway, can you replace selected_indices with [], so that it's obvious this call will always select all (or no) instances.
| unselected_data = data[~mask] | ||
| if self.append_clusters: | ||
| def remove_other_value(vars_): | ||
| vars_ = [var for var in vars_] |
janezd
left a comment
There was a problem hiding this comment.
I went through all changes and widgets. Since none of my suggestions are substantial, I'm approving the request, but you can still follow them if you decide so.
|
I will fix the suggested in the next PR. |
No description provided.