Description
Issue
When providing data to a pipeline there is an expectation that we put on the user to know the data type the trainer is expecting. This is a painful experience for end-users as it requires them to not only know what data types they need to convert to, but also results in them having to add more steps to their pipeline to accommodate.
The example from #3037 demonstrates this issue as this pipeline is taking in integer values for the Label and Features and passing this into the SDCA trainer. Because the data is integer based, the pipeline uses ConvertType
to convert from int to float, followed by a Concatenate
to generate a vector type (note this is in F# but still applies to C#)
let mlContext = MLContext()
EstimatorChain()
.Append(mlContext.Transforms.Conversion.ConvertType("Features", "Price", DataKind.Single))
.Append(mlContext.Transforms.Conversion.ConvertType("Label", "Area", DataKind.Single))
.Append(mlContext.Transforms.Concatenate("Features", "Features"))
.AppendCacheCheckpoint(mlContext)
.Append(mlContext.Regression.Trainers.StochasticDualCoordinateAscent("Label", "Features"))
, mlContext
Without conversions, the user will hit an exception saying that the expected type for a Label is of type float followed by the expected type for Features should be a vector of floats.
Suggestion
We should hide these details from the user as this would make the pipeline easier to load and simplify a user's pipeline. Taking the example above, if you were to remove the conversion steps, it would look something like this:
let trainer = mlContext.Regression.Trainers.StochasticDualCoordinateAscent("Area", "Price")
cc @glebuk for any additional input