-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Getting started with ML .NET with in-memory data is *painful*. #3037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @isaacabraham, Thank you very much for your feedback, this is all useful information that is good to hear. I am sorry to hear about the frustrations working with ML.Net in F#. Ideally this is not the experience we want users to undergo to learn and use the library. First, here is the code to unblock your scenario: open System
open Microsoft.ML
open Microsoft.ML.Data
[<CLIMutable>]
type Prediction = {
[<ColumnName("Score")>] Area:single
}
type Observation = { Area:int; Price:int}
let data =
[ { Area = 1100; Price = 119000 }
{ Area = 1200; Price = 126000 }
{ Area = 1300; Price = 133000 }
{ Area = 1400; Price = 150000 }
{ Area = 1500; Price = 161000 }
{ Area = 1600; Price = 163000 }
{ Area = 1700; Price = 169000 }
{ Area = 1800; Price = 182000 }
{ Area = 1900; Price = 201000 }
{ Area = 2000; Price = 209000 } ]
[<EntryPoint>]
let main argv =
let estimator, mlContext =
let mlContext = MLContext()
EstimatorChain()
.Append(mlContext.Transforms.Conversion.ConvertType("Features", "Price", DataKind.Single))
.Append(mlContext.Transforms.Conversion.ConvertType("Label", "Area", DataKind.Single))
.Append(mlContext.Transforms.Concatenate("Features", "Features"))
.AppendCacheCheckpoint(mlContext)
.Append(mlContext.Regression.Trainers.StochasticDualCoordinateAscent("Label", "Features"))
, mlContext
let data1 = mlContext.Data.LoadFromEnumerable<Observation>(data)
let transformer = estimator.Fit(data1)
let predictor = mlContext.Model.CreatePredictionEngine(transformer)
let prediction:Prediction = predictor.Predict({Area=0; Price = 209000})
printf "Prediction results %f" prediction.Area
0 // return an integer exit code As you mentioned, the SDCA trainer expects the Label to be of type float and Features to be a vector of floats. Therefore we are having to convert. In order to get the vector of floats for Features, a For Prediction, we were able to use For the issues you mentioned:
4 & 5) The fact that you have to convert and have an understanding of what the trainer is expecting (in addition to our vector type) is painful. Ideally the conversion should happen behind the scenes and not require the user to have knowledge about what a trainer is expecting. Having an automatic conversion would simplify the pipeline and get closer to having a simpler API. I filed issue #3060 to address this.
In addition to making the API simpler there is the matter that it took many hours to get to a solution. Part of this can be resolved through examples (which I believe you found) that are located here: These do contain F# examples, but we also have simpler samples that are in C# only. These end up on the docs site: Would it be helpful if these examples were also written in F#? |
Hi @singlis . Thanks for the really detailed reply. Let me address all your points:
As another example of "magic", in your solution (and I eventually figured this out myself), you call the prediction field "Score". There's no way to know this for the type system or code - it's just "secret" knowledge that isn't clearly explained. At worst you should have an error message if you try to call predict without this field, or better yet encode this into the type system somehow.
Regarding the samples - indeed, I went through several of them, both C# and F# (I come from a C# background so no problem there). But there wasn't an example that I could find that went through the absolute bare minimum as I wanted to do, and identifies each individual step. I went through all the examples in the samples repo e.g. Iris, Taxi etc. etc. - but it was basically hit and miss until I randomly stumbled upon the right combination of transforms etc. to get all the way through. |
Thinking about it, this issue isn't so much about F# as about the getting up and running with the API - people coming from C# will have mostly the same issues as I've had, I believe. With that, I'm changing the title of this issue. |
Thank you for the clarification @isaacabraham - We are always working to improve the API. Removing MLContext, IDataView, etc. is not something that can happen immediately but your feedback can impact future changes. There is the simplicity with being able to pass the data directly to the trainer - but I understand too the value with having an explicit pipeline. In the meantime, there are immediate actions that we can take to help the user with the learning curve of the API and hopefully reduce the amount of time it takes to get something working. One of these is better documentation and more examples, therefore I have created the following issues to help with this: #3127 - to address knowing the input/output types of a transformer. This addresses the issue you mentioned with Scoring, this should not be secret knowledge and should have further explanation about it. #3100 - to address missing F# examples. I am going to setup an initial folder structure to where we can add F# examples. You are more than welcome to contribute to this, ideally I am trying to mirror what we have now in C#. There are other additional issues that are being filed to help with the structure of the documentation, to know what API to use, how it works, if its a trainer what type of trainer, etc. As for IDataView - IDataView is the basis for how we exchange data within our pipeline. Since it is integral part of ML.Net, you will not be able to learn ML.Net without learning about IDataView. It would be like learning C# without learning about IEnumerable. We have extracted IDataView into its own assembly Microsoft.ML.DataView and it has no dependencies on ML.Net. The thought here is that it can be used for other purposes outside of ML.Net. For example, a graphing application could take in an IDataView and be able to plot a chart -- this data could come from ML.Net or some other library that implements IDataView. |
Although more F# is always a good thing, in this case that this will necessarily solve the issue - there are quite a few that Don (and others, including myself) have done for the samples and most people that write F# know C# already and can map the two across. The issues I've been encountering are shared across both C# and F# (and VB .NET) - it's the API itself that I think is the problem. I get your point regarding IDataView - having a data framing library on .NET is a good thing (although there are already a couple out there such as Deedle...), and if you're fixed on making this a mandatory element of ML .NET - i.e in order to use ML .NET, people have to know about IDataView, then I think you should ensure that things fall into the pit of success. By this I mean, users not having to refer to reams of documentation to learn the API, but the API itself being self explanatory - currently, there's simply (again, IMHO) too much knoweldge required by the developer rather than the API guiding the user into doing the right thing through things like types (they can be really handy in cases like this :-)). As an alternative, look at the scikit learn example for what I mean by a simple API that is obvious - you can see from the examples example what is happening, there's no need for comments, and it's just a few lines. Hopefully this isn't coming across as a rant, but rather as constructive feedback. I'm really excited by the idea of having a first-class ML library on .NET, and although it's not quite there yet I'm hopeful that ML .NET will get there soon. |
Minor note: looks like this thread and #2726 reach the same conclusion about documentation. |
Let's me close this issue as most of our APIs have in-memory samples at https://github.com/dotnet/machinelearning/tree/master/docs/samples/Microsoft.ML.Samples/Dynamic. Feel free to reopen. :) |
Working with latest F#4.5 and net standard I'm having huge problems trying to do even the most basic explorations with the latest ML .NET. Is there any example showing an absolutely basic example for an in-memory dataset using a simple ML algorithm?
I'm talking something as simple as an example from e.g. scikit-learn e.g. the following hello world is 7 lines of code, and if you leave at the data loading side of things and just focus on the ML side of things - which is exactly what I want to do - it's the following three lines of code.
Lets try and port this into F#. Here's the source data as a simple F# list.
I've spent a good few hours fighting with the API to try and get some - any - results. I can't figure it out.
Issues I've encountered:
I4
,R4
etc. etc. - most people will not know what these are.float32
into a "vector" offloat32
. There's no explanation of what a "vector" in the context of ML .NET is, nor how to create one. Is it a .NET type? How do I create it? More than that, why as a developer should I have to care about it? I just want to give some of my data to the library as quickly and easily as possible.MLContext
? What does it do? Does it store some "hidden state"? What? Why?I managed to overcome some issues by randomly fumbling around with some existing samples until I got something that seemed to not error any more:
Next. I try to fit my data to this model:
This returns, but then calls to
CreatePredictionEngine
fail with the errorSystem.ArgumentOutOfRangeException: Could not find input column 'Area'
:To get to this stage has taken 4-8 hours of effort (including spending 30-45 minutes with your team personally :-)). I don't consider myself a complete beginner when it comes to .NET / F# / machine learning - if it takes this long to get up and running, most people will simply not bother and go to scikit-learn, breeze or whatever else it out there.
I would love to see a simple API that looked something like this:
or
etc. etc.
I get that there are more complicated scenarios - but I feel that this library should really be starting from the lowest common denominator and working from there. At the moment it seems to be the other way around.
The text was updated successfully, but these errors were encountered: