-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Polish GetColumn on IDataView #2738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2738 +/- ##
==========================================
+ Coverage 71.65% 71.65% +<.01%
==========================================
Files 807 807
Lines 142337 142374 +37
Branches 16117 16120 +3
==========================================
+ Hits 101986 102024 +38
+ Misses 35916 35914 -2
- Partials 4435 4436 +1
|
|
||
if (!data.Schema.TryGetColumnIndex(columnName, out int col)) | ||
throw env.ExceptSchemaMismatch(nameof(columnName), "input", columnName); | ||
if (!data.Schema.TryGetColumnIndex(column.Name, out int colIndex)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!data.Schema.TryGetColumnIndex(column.Name, out int colIndex) [](start = 12, length = 65)
With this implementation I don't see any reason to have DataViewSchema.Column column
instead of just string columnName
.
So imagine I'm doing following:
I have column A
of type string. I modify that column to vector<float>
with same name A
.
Then I apply ToKey to A
, but preserve name.
In old code, I can say: give me A
and i would get key A.
with your code, I can inspect IDataview
and I can find 3 A
columns, two are hidden and one is not.
I pick A
which is vector<float>
and pass to your function.
I would get exception, right?
What's the point of this new function, if it does same thing, but in more obfuscated way by forcing me to get DataViewSchema.Column
first instead of just name?
#Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this?
var colIndex = column.Index;
var colType = column.Type;
var colName = column.Name;
// Use column index as the principle address of the specified input column and check if that address in data contains
// the column indicated.
if (data.Schema[colIndex].Name != colName || data.Schema[colIndex].Type != colType)
throw Contracts.ExceptParam(nameof(column), string.Format("column with name {0}, type {1}, and index {2} cannot be found in {3}",
colName, colType, colIndex, nameof(data)));
We use input column.Index
to retrieve information from data
. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -79,7 +79,7 @@ private void IntermediateData(string dataPath) | |||
// The same extension method also applies to the dynamic-typed data, except you have to | |||
// specify the column name and type: | |||
var dynamicData = transformedData.AsDynamic; | |||
var sameFeatureColumns = dynamicData.GetColumn<string[]>(mlContext, "AllFeatures") | |||
var sameFeatureColumns = dynamicData.GetColumn<string[]>(dynamicData.Schema["AllFeatures"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't forget to update cookbook! both of them! #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. #Resolved
/// <param name="columnName">The name of the column to extract.</param> | ||
public static IEnumerable<T> GetColumn<T>(this IDataView data, IHostEnvironment env, string columnName) | ||
/// <param name="column">The column to be extracted.</param> | ||
public static IEnumerable<T> GetColumn<T>(this IDataView data, DataViewSchema.Column column) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
column [](start = 93, length = 6)
I would prefer to have string version of this method as well, but I have impression Tom doesn't like that idea (not sure why).
No blocker, just fishing for explanation. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe for consistency with static APIs? Also, users might retrive column name from IDataView.Schema
so accepting a Schema.Column
directly looks fine maybe. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to have string version of this method as well, but I have impression Tom doesn't like that idea (not sure why).
I think this is how religious doctrine can get established in some cases. :) I do not believe I have ever said anything of the kind w.r.t. this @Ivanidzo4ka.
But seriously, I think the core interfaces of IDataView
itself and its accessors ought to work over Schema.Column
. It's just safer, and I think we ought to make our core interfaces as principled as this. The conveniences we layer on top of those core interfaces though, are another matter entirely. A convenience like this however, surely benefits from a string
. I think it's quite fine to have both; in the one case, you get some degree of "safety." In the other, convenience. This is clearly a case where convenience is the intent, and insisting on a column is inconvenient. So we should at least have a string overload as an option, and if we had to choose between one or the other (we don't) I have a clear preference for string
.
In reply to: 260585414 [](ancestors = 260585414)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String overloading is added in Iteration 7.
In reply to: 260860073 [](ancestors = 260860073,260585414)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d205480
to
9af95ab
Compare
var colType = column.Type; | ||
var colName = column.Name; | ||
|
||
// Use column index as the principle address of the specified input column and check if that address in data contains |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this logic is to make sure the column
(type DataViewSchema.Column) is indeed from the same data
(type IDataView). Can we add a test that exercises this ? Perhaps by passing a column
from a different dataview, and making sure it fails.
I see some tests which exercise the type (given the same dataview), but none which exercises the new logic added above #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @wschin !
To fix #2473, we need to
Constracts
.DataViewSchema.Column
in GetColumn's argument list.Review steps:
DataViewSchema.Column
are done in the 3rd iteration.