Getting started with ML .NET with in-memory data is *painful*.

Working with latest F#4.5 and net standard I'm having huge problems trying to do even the most basic explorations with the latest ML .NET. Is there any example showing an absolutely basic example for an in-memory dataset using a simple ML algorithm?

I'm talking something as simple as an example from e.g. [scikit-learn](https://mcalglobal.com/2018/02/22/machine-learning-hello-world-using-python/) e.g. the following hello world is 7 lines of code, and if you leave at the data loading side of things and just focus on the ML side of things - which is exactly what I want to do - it's the following *three lines of code*.

```
model = linear_model.LinearRegression()
model.fit(sqfeet, price)
model.predict( pd.DataFrame([1750]))
```

Lets try and port this into F#. Here's the source data as a simple F# list.

```fsharp
type Observation = { Area:int; Price:int }
let data =
    [ { Area = 1100; Price = 119000 }
      { Area = 1200; Price = 126000 }
      { Area = 1300; Price = 133000 }
      { Area = 1400; Price = 150000 }
      { Area = 1500; Price = 161000 }
      { Area = 1600; Price = 163000 }
      { Area = 1700; Price = 169000 }
      { Area = 1800; Price = 182000 }
      { Area = 1900; Price = 201000 }
      { Area = 2000; Price = 209000 } ]
```

I've spent a good few hours fighting with the API to try and get some - any - results. **I can't figure it out**.

Issues I've encountered:

1. Discoverability. The API is pretty large and not (in my personal opinion) easy to navigate your way around. The namespaces need to be reworked so that the most obvious types are easy and obvious to get to.
1. F# scripts are a pain because of the "occasional" reliance on native DLLs. However, you can work around this (or fall back to console applications if needed).
1. Error messages are painful - `I4`, `R4` etc. etc. - most people will not know what these are.
1. Vector types - it seems that in order to "use" data with a trainer you need to "convert" data from e.g. `float32` into a "vector" of `float32`. There's no explanation of what a "vector" in the context of ML .NET is, nor how to create one. Is it a .NET type? How do I create it? More than that, why as a developer should I have to care about it? I just want to give some of my data to the library as quickly and easily as possible.
1. Why do I need to convert from ints or floats into float32s to do some machine learning? Again, this raises the barrier to entry. This is an internal implementation detail of ML .NET, it's nothing that should be forced on the developer.
1. Why do I need the `MLContext`? What does it do? Does it store some "hidden state"? What? Why?

I managed to overcome some issues by randomly fumbling around with some existing samples until I got something that seemed to not error any more:

```fsharp
let estimator, mlContext =
    let mlContext = MLContext(Nullable 1)

    let trainer = mlContext.Regression.Trainers.StochasticDualCoordinateAscent(DefaultColumnNames.Label, "Features")

    EstimatorChain()
        .Append(mlContext.Transforms.Conversion.ConvertType(Transforms.TypeConvertingEstimator.ColumnOptions("ConvertedArea", DataKind.Single, "Area")))
        .Append(mlContext.Transforms.CopyColumns(DefaultColumnNames.Label, "ConvertedArea"))
        .Append(mlContext.Transforms.Conversion.ConvertType(Transforms.TypeConvertingEstimator.ColumnOptions("ConvertedPrice", DataKind.Single, "Price")))
        .Append(mlContext.Transforms.Concatenate("Features", "ConvertedPrice"))
        .AppendCacheCheckpoint(mlContext)
        .Append(trainer), mlContext
```

Next. I try to fit my data to this model:

```fsharp
let dv = mlContext.Data.LoadFromEnumerable(data)
let trained = estimator.Fit(dv)
```

This returns, but then calls to `CreatePredictionEngine` fail with the error `System.ArgumentOutOfRangeException: Could not find input column 'Area'`:

```fsharp
type PredictionInput = { Price : int }
[<CLIMutable>]
type PredictionOutput = { Area : int }

let z = trained.CreatePredictionEngine<PredictionInput, PredictionOutput>(mlContext)

z.Predict { Price = 1000 }
```

To get to this stage has taken 4-8 hours of effort (including spending 30-45 minutes with your team personally :-)). I don't consider myself a complete beginner when it comes to .NET / F# / machine learning - if it takes this long to get up and running, most people will simply not bother and go to scikit-learn, breeze or whatever else it out there.

I would love to see a *simple* API that looked something like this:

```fsharp
let model = Trainers.Regression.StochasticDualCoordinateAscend.fit(data, "Area", "Price")
let prediction = model.Predict(1234)
```

or

```fsharp
let model = Trainers.Regression.StochasticDualCoordinateAscend.fit(data, fun d -> d.Area, fun d -> d.Price)
let prediction = model.Predict(1234)
```

etc. etc.

I get that there are more complicated scenarios - but I feel that this library should really be starting from the lowest common denominator and working from there. At the moment it seems to be the other way around.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Getting started with ML .NET with in-memory data is painful. #3037

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Getting started with ML .NET with in-memory data is *painful*. #3037

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Getting started with ML .NET with in-memory data is painful. #3037