Description
Working with latest F#4.5 and net standard I'm having huge problems trying to do even the most basic explorations with the latest ML .NET. Is there any example showing an absolutely basic example for an in-memory dataset using a simple ML algorithm?
I'm talking something as simple as an example from e.g. scikit-learn e.g. the following hello world is 7 lines of code, and if you leave at the data loading side of things and just focus on the ML side of things - which is exactly what I want to do - it's the following three lines of code.
model = linear_model.LinearRegression()
model.fit(sqfeet, price)
model.predict( pd.DataFrame([1750]))
Lets try and port this into F#. Here's the source data as a simple F# list.
type Observation = { Area:int; Price:int }
let data =
[ { Area = 1100; Price = 119000 }
{ Area = 1200; Price = 126000 }
{ Area = 1300; Price = 133000 }
{ Area = 1400; Price = 150000 }
{ Area = 1500; Price = 161000 }
{ Area = 1600; Price = 163000 }
{ Area = 1700; Price = 169000 }
{ Area = 1800; Price = 182000 }
{ Area = 1900; Price = 201000 }
{ Area = 2000; Price = 209000 } ]
I've spent a good few hours fighting with the API to try and get some - any - results. I can't figure it out.
Issues I've encountered:
- Discoverability. The API is pretty large and not (in my personal opinion) easy to navigate your way around. The namespaces need to be reworked so that the most obvious types are easy and obvious to get to.
- F# scripts are a pain because of the "occasional" reliance on native DLLs. However, you can work around this (or fall back to console applications if needed).
- Error messages are painful -
I4
,R4
etc. etc. - most people will not know what these are. - Vector types - it seems that in order to "use" data with a trainer you need to "convert" data from e.g.
float32
into a "vector" offloat32
. There's no explanation of what a "vector" in the context of ML .NET is, nor how to create one. Is it a .NET type? How do I create it? More than that, why as a developer should I have to care about it? I just want to give some of my data to the library as quickly and easily as possible. - Why do I need to convert from ints or floats into float32s to do some machine learning? Again, this raises the barrier to entry. This is an internal implementation detail of ML .NET, it's nothing that should be forced on the developer.
- Why do I need the
MLContext
? What does it do? Does it store some "hidden state"? What? Why?
I managed to overcome some issues by randomly fumbling around with some existing samples until I got something that seemed to not error any more:
let estimator, mlContext =
let mlContext = MLContext(Nullable 1)
let trainer = mlContext.Regression.Trainers.StochasticDualCoordinateAscent(DefaultColumnNames.Label, "Features")
EstimatorChain()
.Append(mlContext.Transforms.Conversion.ConvertType(Transforms.TypeConvertingEstimator.ColumnOptions("ConvertedArea", DataKind.Single, "Area")))
.Append(mlContext.Transforms.CopyColumns(DefaultColumnNames.Label, "ConvertedArea"))
.Append(mlContext.Transforms.Conversion.ConvertType(Transforms.TypeConvertingEstimator.ColumnOptions("ConvertedPrice", DataKind.Single, "Price")))
.Append(mlContext.Transforms.Concatenate("Features", "ConvertedPrice"))
.AppendCacheCheckpoint(mlContext)
.Append(trainer), mlContext
Next. I try to fit my data to this model:
let dv = mlContext.Data.LoadFromEnumerable(data)
let trained = estimator.Fit(dv)
This returns, but then calls to CreatePredictionEngine
fail with the error System.ArgumentOutOfRangeException: Could not find input column 'Area'
:
type PredictionInput = { Price : int }
[<CLIMutable>]
type PredictionOutput = { Area : int }
let z = trained.CreatePredictionEngine<PredictionInput, PredictionOutput>(mlContext)
z.Predict { Price = 1000 }
To get to this stage has taken 4-8 hours of effort (including spending 30-45 minutes with your team personally :-)). I don't consider myself a complete beginner when it comes to .NET / F# / machine learning - if it takes this long to get up and running, most people will simply not bother and go to scikit-learn, breeze or whatever else it out there.
I would love to see a simple API that looked something like this:
let model = Trainers.Regression.StochasticDualCoordinateAscend.fit(data, "Area", "Price")
let prediction = model.Predict(1234)
or
let model = Trainers.Regression.StochasticDualCoordinateAscend.fit(data, fun d -> d.Area, fun d -> d.Price)
let prediction = model.Predict(1234)
etc. etc.
I get that there are more complicated scenarios - but I feel that this library should really be starting from the lowest common denominator and working from there. At the moment it seems to be the other way around.