Skip to content

Polish GetColumn on IDataView #2738

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Feb 27, 2019
Merged

Conversation

wschin
Copy link
Member

@wschin wschin commented Feb 26, 2019

To fix #2473, we need to

  • Remove IHostEnviroment from GetColumn's argument list. Note that our strategy is using Constracts.
  • Replace column name with DataViewSchema.Column in GetColumn's argument list.

Review steps:

  1. Start with the 2nd iteration because some changes in the 1st iteration got discarded.
  2. Removal of IHostEnviroment happens in the 2nd iteration.
  3. Uses of DataViewSchema.Column are done in the 3rd iteration.

@wschin wschin self-assigned this Feb 26, 2019
@codecov
Copy link

codecov bot commented Feb 26, 2019

Codecov Report

Merging #2738 into master will increase coverage by <.01%.
The diff coverage is 96.73%.

@@            Coverage Diff             @@
##           master    #2738      +/-   ##
==========================================
+ Coverage   71.65%   71.65%   +<.01%     
==========================================
  Files         807      807              
  Lines      142337   142374      +37     
  Branches    16117    16120       +3     
==========================================
+ Hits       101986   102024      +38     
+ Misses      35916    35914       -2     
- Partials     4435     4436       +1
Flag Coverage Δ
#Debug 71.65% <96.73%> (ø) ⬆️
#production 67.9% <100%> (-0.01%) ⬇️
#test 85.85% <95.94%> (+0.02%) ⬆️
Impacted Files Coverage Δ
...rosoft.ML.StaticPipelineTesting/StaticPipeTests.cs 95.26% <100%> (ø) ⬆️
src/Microsoft.ML.Data/Utilities/ColumnCursor.cs 92% <100%> (+2.41%) ⬆️
test/Microsoft.ML.Tests/RangeFilterTests.cs 100% <100%> (ø) ⬆️
...est/Microsoft.ML.Tests/Scenarios/GetColumnTests.cs 100% <100%> (ø) ⬆️
...ft.ML.Tests/Scenarios/Api/Estimators/Visibility.cs 100% <100%> (ø) ⬆️
...crosoft.ML.Tests/Transformers/ValueMappingTests.cs 100% <100%> (ø) ⬆️
...sts/TrainerEstimators/SymSgdClassificationTests.cs 100% <100%> (ø) ⬆️
...s/Scenarios/Api/CookbookSamples/CookbookSamples.cs 90.9% <100%> (ø) ⬆️
test/Microsoft.ML.Tests/Scenarios/Api/TestApi.cs 97.6% <100%> (ø) ⬆️
test/Microsoft.ML.Tests/CachingTests.cs 100% <100%> (ø) ⬆️
... and 5 more

@wschin wschin changed the title [WIP] Polish GetColumn on IDataView Polish GetColumn on IDataView Feb 26, 2019

if (!data.Schema.TryGetColumnIndex(columnName, out int col))
throw env.ExceptSchemaMismatch(nameof(columnName), "input", columnName);
if (!data.Schema.TryGetColumnIndex(column.Name, out int colIndex))
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Feb 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (!data.Schema.TryGetColumnIndex(column.Name, out int colIndex) [](start = 12, length = 65)

With this implementation I don't see any reason to have DataViewSchema.Column column instead of just string columnName.

So imagine I'm doing following:
I have column A of type string. I modify that column to vector<float> with same name A.
Then I apply ToKey to A, but preserve name.

In old code, I can say: give me A and i would get key A.
with your code, I can inspect IDataview and I can find 3 A columns, two are hidden and one is not.
I pick A which is vector<float> and pass to your function.

I would get exception, right?
What's the point of this new function, if it does same thing, but in more obfuscated way by forcing me to get DataViewSchema.Column first instead of just name?
#Closed

Copy link
Member Author

@wschin wschin Feb 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this?

            var colIndex = column.Index;
            var colType = column.Type;
            var colName = column.Name;

            // Use column index as the principle address of the specified input column and check if that address in data contains
            // the column indicated.
            if (data.Schema[colIndex].Name != colName || data.Schema[colIndex].Type != colType)
                throw Contracts.ExceptParam(nameof(column), string.Format("column with name {0}, type {1}, and index {2} cannot be found in {3}",
                    colName, colType, colIndex, nameof(data)));

We use input column.Index to retrieve information from data. #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.


In reply to: 260491233 [](ancestors = 260491233)

@wschin wschin requested a review from Ivanidzo4ka February 26, 2019 23:26
@@ -79,7 +79,7 @@ private void IntermediateData(string dataPath)
// The same extension method also applies to the dynamic-typed data, except you have to
// specify the column name and type:
var dynamicData = transformedData.AsDynamic;
var sameFeatureColumns = dynamicData.GetColumn<string[]>(mlContext, "AllFeatures")
var sameFeatureColumns = dynamicData.GetColumn<string[]>(dynamicData.Schema["AllFeatures"])
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Feb 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't forget to update cookbook! both of them! #Resolved

Copy link
Member Author

@wschin wschin Feb 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. #Resolved

/// <param name="columnName">The name of the column to extract.</param>
public static IEnumerable<T> GetColumn<T>(this IDataView data, IHostEnvironment env, string columnName)
/// <param name="column">The column to be extracted.</param>
public static IEnumerable<T> GetColumn<T>(this IDataView data, DataViewSchema.Column column)
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Feb 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

column [](start = 93, length = 6)

I would prefer to have string version of this method as well, but I have impression Tom doesn't like that idea (not sure why).
No blocker, just fishing for explanation. #Resolved

Copy link
Member Author

@wschin wschin Feb 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for consistency with static APIs? Also, users might retrive column name from IDataView.Schema so accepting a Schema.Column directly looks fine maybe. #Resolved

Copy link
Contributor

@TomFinley TomFinley Feb 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to have string version of this method as well, but I have impression Tom doesn't like that idea (not sure why).

I think this is how religious doctrine can get established in some cases. :) I do not believe I have ever said anything of the kind w.r.t. this @Ivanidzo4ka.

But seriously, I think the core interfaces of IDataView itself and its accessors ought to work over Schema.Column. It's just safer, and I think we ought to make our core interfaces as principled as this. The conveniences we layer on top of those core interfaces though, are another matter entirely. A convenience like this however, surely benefits from a string. I think it's quite fine to have both; in the one case, you get some degree of "safety." In the other, convenience. This is clearly a case where convenience is the intent, and insisting on a column is inconvenient. So we should at least have a string overload as an option, and if we had to choose between one or the other (we don't) I have a clear preference for string.


In reply to: 260585414 [](ancestors = 260585414)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String overloading is added in Iteration 7.


In reply to: 260860073 [](ancestors = 260860073,260585414)

Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

var colType = column.Type;
var colName = column.Name;

// Use column index as the principle address of the specified input column and check if that address in data contains
Copy link
Member

@abgoswam abgoswam Feb 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this logic is to make sure the column  (type DataViewSchema.Column)  is indeed from the same data  (type IDataView). Can we add a test that exercises this ? Perhaps by passing a column from a different dataview, and making sure it fails.

I see some tests which exercise the type (given the same dataview), but none which exercises the new logic added above #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in Iteration 6.


In reply to: 260851354 [](ancestors = 260851354)

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wschin !

@wschin wschin merged commit a0edc5c into dotnet:master Feb 27, 2019
@wschin wschin deleted the polish-get-column branch February 27, 2019 18:42
@ghost ghost locked as resolved and limited conversation to collaborators Mar 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GetColumn method is extension for IDataView but not mlContext.
4 participants