Skip to content

Automatic conversion of data in the pipeline #3060

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
singlis opened this issue Mar 21, 2019 · 3 comments
Closed

Automatic conversion of data in the pipeline #3060

singlis opened this issue Mar 21, 2019 · 3 comments
Labels
enhancement New feature or request

Comments

@singlis
Copy link
Member

singlis commented Mar 21, 2019

Issue

When providing data to a pipeline there is an expectation that we put on the user to know the data type the trainer is expecting. This is a painful experience for end-users as it requires them to not only know what data types they need to convert to, but also results in them having to add more steps to their pipeline to accommodate.

The example from #3037 demonstrates this issue as this pipeline is taking in integer values for the Label and Features and passing this into the SDCA trainer. Because the data is integer based, the pipeline uses ConvertType to convert from int to float, followed by a Concatenate to generate a vector type (note this is in F# but still applies to C#)

        let mlContext = MLContext()
        EstimatorChain()
           .Append(mlContext.Transforms.Conversion.ConvertType("Features", "Price", DataKind.Single))
           .Append(mlContext.Transforms.Conversion.ConvertType("Label", "Area", DataKind.Single))
           .Append(mlContext.Transforms.Concatenate("Features", "Features"))
           .AppendCacheCheckpoint(mlContext)
           .Append(mlContext.Regression.Trainers.StochasticDualCoordinateAscent("Label", "Features"))
           , mlContext

Without conversions, the user will hit an exception saying that the expected type for a Label is of type float followed by the expected type for Features should be a vector of floats.

Suggestion

We should hide these details from the user as this would make the pipeline easier to load and simplify a user's pipeline. Taking the example above, if you were to remove the conversion steps, it would look something like this:

   let trainer = mlContext.Regression.Trainers.StochasticDualCoordinateAscent("Area", "Price")

cc @glebuk for any additional input

@TomFinley
Copy link
Contributor

TomFinley commented Mar 22, 2019

We should hide these details from the user as this would make the pipeline easier to load and simplify a user's pipeline.

This is not I think correct. Brevity and simplicity are not the same thing. An API that is predictable is an API that has the "right" simplicity, and the best way to be predictable is to do what the user tells us to do. An API with loads of implicit "helpful" behavior is an API that is ultimately impossible to predict in any meaningful fashion, and so impossible to understand.

We have a long history of people trying to add helpful, "harmless" operations for the user -- whether it's type conversion, auto-caching, auto-normalization, auto-calibration. I even had to write a section about it. It of course ultimately blows up in our faces, again and again. I even wrote a section on type conversion specifically as being especially insidious.

https://github.com/dotnet/machinelearning/blob/db4ecc0135baffc8930201a85fb3e9a101cdfa46/docs/code/IDataViewImplementation.md#getters-must-fail-for-invalid-types

Close?

@isaacabraham
Copy link

You can solve this by having the trainer itself take in data directly rather than a data view. Then you can type the method signature to be clear what the input for label and features need to be.

@najeeb-kazmi
Copy link
Member

Closing as per @TomFinley 's explanation

@ghost ghost locked as resolved and limited conversation to collaborators Mar 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants