Skip to content

Getting started with ML .NET with in-memory data is *painful*. #3037

Closed
@isaacabraham

Description

@isaacabraham

Working with latest F#4.5 and net standard I'm having huge problems trying to do even the most basic explorations with the latest ML .NET. Is there any example showing an absolutely basic example for an in-memory dataset using a simple ML algorithm?

I'm talking something as simple as an example from e.g. scikit-learn e.g. the following hello world is 7 lines of code, and if you leave at the data loading side of things and just focus on the ML side of things - which is exactly what I want to do - it's the following three lines of code.

model = linear_model.LinearRegression()
model.fit(sqfeet, price)
model.predict( pd.DataFrame([1750]))

Lets try and port this into F#. Here's the source data as a simple F# list.

type Observation = { Area:int; Price:int }
let data =
    [ { Area = 1100; Price = 119000 }
      { Area = 1200; Price = 126000 }
      { Area = 1300; Price = 133000 }
      { Area = 1400; Price = 150000 }
      { Area = 1500; Price = 161000 }
      { Area = 1600; Price = 163000 }
      { Area = 1700; Price = 169000 }
      { Area = 1800; Price = 182000 }
      { Area = 1900; Price = 201000 }
      { Area = 2000; Price = 209000 } ]

I've spent a good few hours fighting with the API to try and get some - any - results. I can't figure it out.

Issues I've encountered:

  1. Discoverability. The API is pretty large and not (in my personal opinion) easy to navigate your way around. The namespaces need to be reworked so that the most obvious types are easy and obvious to get to.
  2. F# scripts are a pain because of the "occasional" reliance on native DLLs. However, you can work around this (or fall back to console applications if needed).
  3. Error messages are painful - I4, R4 etc. etc. - most people will not know what these are.
  4. Vector types - it seems that in order to "use" data with a trainer you need to "convert" data from e.g. float32 into a "vector" of float32. There's no explanation of what a "vector" in the context of ML .NET is, nor how to create one. Is it a .NET type? How do I create it? More than that, why as a developer should I have to care about it? I just want to give some of my data to the library as quickly and easily as possible.
  5. Why do I need to convert from ints or floats into float32s to do some machine learning? Again, this raises the barrier to entry. This is an internal implementation detail of ML .NET, it's nothing that should be forced on the developer.
  6. Why do I need the MLContext? What does it do? Does it store some "hidden state"? What? Why?

I managed to overcome some issues by randomly fumbling around with some existing samples until I got something that seemed to not error any more:

let estimator, mlContext =
    let mlContext = MLContext(Nullable 1)

    let trainer = mlContext.Regression.Trainers.StochasticDualCoordinateAscent(DefaultColumnNames.Label, "Features")

    EstimatorChain()
        .Append(mlContext.Transforms.Conversion.ConvertType(Transforms.TypeConvertingEstimator.ColumnOptions("ConvertedArea", DataKind.Single, "Area")))
        .Append(mlContext.Transforms.CopyColumns(DefaultColumnNames.Label, "ConvertedArea"))
        .Append(mlContext.Transforms.Conversion.ConvertType(Transforms.TypeConvertingEstimator.ColumnOptions("ConvertedPrice", DataKind.Single, "Price")))
        .Append(mlContext.Transforms.Concatenate("Features", "ConvertedPrice"))
        .AppendCacheCheckpoint(mlContext)
        .Append(trainer), mlContext

Next. I try to fit my data to this model:

let dv = mlContext.Data.LoadFromEnumerable(data)
let trained = estimator.Fit(dv)

This returns, but then calls to CreatePredictionEngine fail with the error System.ArgumentOutOfRangeException: Could not find input column 'Area':

type PredictionInput = { Price : int }
[<CLIMutable>]
type PredictionOutput = { Area : int }

let z = trained.CreatePredictionEngine<PredictionInput, PredictionOutput>(mlContext)

z.Predict { Price = 1000 }

To get to this stage has taken 4-8 hours of effort (including spending 30-45 minutes with your team personally :-)). I don't consider myself a complete beginner when it comes to .NET / F# / machine learning - if it takes this long to get up and running, most people will simply not bother and go to scikit-learn, breeze or whatever else it out there.

I would love to see a simple API that looked something like this:

let model = Trainers.Regression.StochasticDualCoordinateAscend.fit(data, "Area", "Price")
let prediction = model.Predict(1234)

or

let model = Trainers.Regression.StochasticDualCoordinateAscend.fit(data, fun d -> d.Area, fun d -> d.Price)
let prediction = model.Predict(1234)

etc. etc.

I get that there are more complicated scenarios - but I feel that this library should really be starting from the lowest common denominator and working from there. At the moment it seems to be the other way around.

Metadata

Metadata

Assignees

No one assigned

    Labels

    APIIssues pertaining the friendly APIF#Support of F# language

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions