-
Notifications
You must be signed in to change notification settings - Fork 1.9k
ML Context to create them all #1252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/// </summary> | ||
/// <param name="catalog">The catalog.</param> | ||
/// <param name="args">The arguments to text reader, describing the data schema.</param> | ||
/// <param name="dataSample">The optional data sample</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data sample, or dataSource?
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// <param name="catalog">The catalog.</param> | ||
/// <param name="args">The arguments to text reader, describing the data schema.</param> | ||
/// <param name="dataSample">The optional data sample</param> | ||
/// <returns></returns> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[](start = 12, length = 19)
empty returns #Resolved
public void Save(ITransformer model, Stream stream) => model.SaveTo(Environment, stream); | ||
|
||
/// <summary> | ||
/// Load the model from the stream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
model [](start = 21, length = 5)
If we are saving and loading an ITransformer, should we call the class TransformerOperationsCatalog? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the name should reflect the user-facing concept, rather than the implementation fact that 'trained model is a transformer'.
In reply to: 224950106 [](ancestors = 224950106)
@@ -403,7 +372,7 @@ internal MulticlassClassificationTrainers(MulticlassClassificationContext ctx) | |||
/// the top-K accuracy, that is, the accuracy assuming we consider an example with the correct class within | |||
/// the top-K values as being stored "correctly."</param> | |||
/// <returns>The evaluation results for these calibrated outputs.</returns> | |||
public MultiClassClassifierEvaluator.Result Evaluate(IDataView data, string label, string score = DefaultColumnNames.Score, | |||
public MultiClassClassifierEvaluator.Result Evaluate(IDataView data, string label = DefaultColumnNames.Label, string score = DefaultColumnNames.Score, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string label = DefaultColumnNames.Label [](start = 77, length = 39)
would assigning defaults here make it error prone. This allows the users to trip and provide a name for the label column in the estimators, and leave it to the default name here? We have not assigned defaults in the estimators. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, all the trainer extensions also have default names for Label
and Features
. This is why I made this change as well.
In reply to: 224950306 [](ancestors = 224950306)
/// </summary> | ||
/// <param name="catalog">The transform catalog</param> | ||
/// <param name="columnName">The column name</param> | ||
/// <param name="mode">The normalization mode</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The normalization mod [](start = 31, length = 21)
cref maybe #Resolved
namespace Microsoft.ML.Runtime | ||
{ | ||
/// <summary> | ||
/// Similar to training context, a transform context is an object serving as a 'catalog' of available transforms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like catalog. #Resolved
/// <param name="weights">The optional example weights.</param> | ||
/// <param name="clustersCount">The number of clusters to use for KMeans.</param> | ||
/// <param name="advancedSettings">Algorithm advanced settings.</param> | ||
public static KMeansPlusPlusTrainer KMeans(this ClusteringContext.ClusteringTrainers ctx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KMeansPlusPlusTrainer [](start = 22, length = 21)
shall we add returns to all those new members, to distinguish them from the other extensions in the documentation? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using Microsoft.ML.StaticPipe.Runtime; | ||
using System; | ||
|
||
namespace Microsoft.ML.Trainers | ||
namespace Microsoft.ML.StaticPipe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StaticPipe [](start = 23, length = 10)
why change namespace? I believe we decided to keep the trainers inside Microsoft.ML.Trainers.
The methods that return trainers iside the new catalog classes are being placed under Microsoft.ML.
@GalOshri what do you think? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this sweeping change, because otherwise we are seeing 2 methods per trainer in the intellisense by default.
I think for now we want users to 'opt in' to the static API by using StaticPipe
, and have the dynamic API available right off the bat
In reply to: 224950522 [](ancestors = 224950522)
/// <param name="conc">Concurrency level. Set to 1 to run single-threaded. Set to 0 to pick automatically.</param> | ||
public MLContext(int? seed = null, int conc = 0) | ||
{ | ||
_env = new LocalEnvironment(seed, conc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have assumed that we just rename LocalEnvironment
to MLContext
. Why do we have both? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably not bundle this change right now, but I'll write a REVIEW and a GitHib issue
In reply to: 225262172 [](ancestors = 225262172,225250207)
/// <summary> | ||
/// Data processing operations. | ||
/// </summary> | ||
public TransformsCatalog Transform { get; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the other properties are not plural, but for some reason the singular Transform
sticks out to me. It feels like it should be Transforms
. Even the class name TransformsCatalog
has plural. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I like it the way it is. Maybe other commenters can swing the vote?
In reply to: 225254551 [](ancestors = 225254551)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has to be in one of the folders :) It is mostly used for composing pipelines, so I picked In reply to: 429941509 [](ancestors = 429941509) Refers to: src/Microsoft.ML.Data/Training/MLContext.cs:1 in 26913ac. [](commit_id = 26913ac, deletion_comment = False) |
What's wrong with the folder: https://github.com/dotnet/machinelearning/tree/master/src/Microsoft.ML.Data ? In general, the .NET Core team follows folders-namespace alignment, so it is easier to find files. Since this type is in the root of the namespace: In reply to: 429953480 [](ancestors = 429953480,429941509) Refers to: src/Microsoft.ML.Data/Training/MLContext.cs:1 in 26913ac. [](commit_id = 26913ac, deletion_comment = False) |
docs/code/MlNetCookBook.md
Outdated
@@ -293,11 +288,11 @@ In the file above, the last column (12th) is label that we predict, and all the | |||
```csharp | |||
// Create a new environment for ML.NET operations. It can be used for exception tracking and logging, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
environment [](start = 16, length = 11)
Would be a good idea to replace with the above comments:
// Create a new context for ML.NET operations. It can be used for exception tracking and logging,
// as a catalog of available operations and as the source of randomness. #Resolved
docs/code/MlNetCookBook.md
Outdated
@@ -449,13 +437,13 @@ The prediction code now looks as follows: | |||
```csharp | |||
// Create a new environment for ML.NET operations. It can be used for exception tracking and logging, | |||
// as well as the source of randomness. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, would be a good idea to replace with the above comments:
// Create a new context for ML.NET operations. It can be used for exception tracking and logging,
// as a catalog of available operations and as the source of randomness. #Resolved
docs/code/MlNetCookBook.md
Outdated
@@ -797,10 +776,10 @@ Sentiment SentimentText | |||
```csharp | |||
// Create a new environment for ML.NET operations. It can be used for exception tracking and logging, | |||
// as well as the source of randomness. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be a good idea to replace with the above comments:
// Create a new context for ML.NET operations. It can be used for exception tracking and logging,
// as a catalog of available operations and as the source of randomness. #Resolved
@@ -345,7 +337,7 @@ Assuming the example above was used to train the model, here's how you calculate | |||
var testData = reader.Read(new MultiFileSource(testDataPath)); | |||
// Calculate metrics of the model on the test data. | |||
// We are using the 'regression' context object here to perform evaluation. | |||
var metrics = regression.Evaluate(model.Transform(testData), label: r => r.Target, score: r => r.Prediction); | |||
var metrics = mlContext.Regression.Evaluate(model.Transform(testData), label: r => r.Target, score: r => r.Prediction); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this sentence still correct? #Resolved
/// </summary> | ||
public sealed class MLContext : IHostEnvironment | ||
{ | ||
// REVIEW: consider making LocalEnvironment and MLContext the same class instead of encapsulation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't fully understand our plan here with LocalEnvironment
and ConsoleEnvironment
. How could I ever use a ConsoleEnvironment
with this new MLContext
class?
And if the answer is "you can't - you'd use ConsoleEnvironment directly", I think that is a pretty poor answer because then my whole API paradigm changes. I can no longer use BinaryClassificationContext
, Transforms
, etc. from the MLContext
. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about the answer: "you can't use ConsoleEnvironment because it is internal and only used by MAML"?
In reply to: 225986842 [](ancestors = 225986842)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be fine with me. For now, I would think it's completely fine to have our custom console printing internal to the commandline tool
In reply to: 225989180 [](ancestors = 225989180,225986842)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1284 #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
using (var stream = File.Create(modelPath)) | ||
{ | ||
// Saving and loading happens to 'dynamic' models, so the static typing is lost in the process. | ||
model.AsDynamic.SaveTo(env, stream); | ||
mlContext.Model.Save(model.AsDynamic, stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on making this so I don't have to say model.AsDynamic
? Can we open an issue to add that overload? #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new TextLoader.Column("Target", DataKind.R4, 10) | ||
}, | ||
// Default separator is tab, but we need a comma. | ||
s => s.Separator = ","); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This is probably not exactly related to MLContext
, but since you wrote this new test, I'll ask the question):
Do we like having s =>
here? Why isn't the separator just an argument in the TextReader()
method? #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pattern we want to have is that we add the most important options as ctor options, and the rest 'advanced' inside the delegate.
The other pattern is that we want the extensions to replicate the signatures of constructors, to avoid any mapping logic in strange places.
Because of that, I'd rather change the signature of TextLoader
ctor to add 'hasHeader' and 'separator', but we will need to do it in a separate PR.
Also 'separator' should not really be a string :)
In reply to: 226106086 [](ancestors = 226106086)
@@ -128,7 +128,7 @@ public Arguments() | |||
public static IDataView Create(IHostEnvironment env, IDataView input, string name, | |||
string source = null, OutputKind outputKind = CategoricalEstimator.Defaults.OutKind) | |||
{ | |||
return new CategoricalEstimator(env, name, source, outputKind).Fit(input).Transform(input) as IDataView; | |||
return new CategoricalEstimator(env, source, name, outputKind).Fit(input).Transform(input) as IDataView; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should consider renaming and re-ordering the parameters to this method as well. name
and source
don't really help say what they are. Also, if new CategoricalEstimator
has it one way, and CategoricalTransform.Create
has it the other way, that is going to lead to confusion. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using System; | ||
using System.Collections.Generic; | ||
|
||
namespace Microsoft.ML |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) Since this is in the Microsoft.ML
namespace, my preference would be for the file to not be under the Text
folder. I think namespaces and folders should align. #Resolved
/// <summary> | ||
/// FastTree <see cref="TrainContextBase"/> extension methods. | ||
/// </summary> | ||
public static partial class RegressionTrainers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There already is a Microsoft.ML.RegressionTrainers
class in the StandardLearners and LightGBM assemblies. We shouldn't have public types with the same namespace and name across multiple assemblies. It is only going to cause headaches for our users. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This applies to all the other extension classes as well.
In reply to: 226109180 [](ancestors = 226109180)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this was intentionally done by Senja for some documentation reasons?
In reply to: 226109272 [](ancestors = 226109272,226109180)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me keep it as is for now, and I'll sync with her and understand why it was done. For me, it would've been much easier to group by trainer, into FastTreeTrainerCatalog
or something
In reply to: 226109747 [](ancestors = 226109747,226109272,226109180)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me keep it as is for now
I don't know .... this is a pretty strong "don't" rule in general. If we are going to fix this very soon (like the next few days), I'm OK with it. But please log an issue. We can't ship with conflicting types across our assemblies. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// <summary> | ||
/// Extensions for normalizer operations. | ||
/// </summary> | ||
public static class NormalizerCatalog |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the use cases for NormalizerCatalog
? I don't see any usages of it in this PR.
I see it is a static class, while all the other "catalog" classes are instance members of MLContext. Should it be designed the same as say TransformsCatalog
?
Also, I don't see any tests using this class. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I get it - this provides extension methods to the TransformsCatalog
. How about we name these "Extension" only classes differently then? It is confusing in my opinion that NormalizerCatalog
and TransformsCatalog
are named similarly, but are designed differently.
In reply to: 226110284 [](ancestors = 226110284)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a set of extensions to TransformsCatalog
. I guess I will need to rename it
In reply to: 226110284 [](ancestors = 226110284)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -1320,6 +1337,8 @@ public void Save(ModelSaveContext ctx) | |||
|
|||
public IDataView Read(IMultiStreamSource source) => new BoundLoader(this, source); | |||
|
|||
public IDataView Read(string path) => Read(new MultiFileSource(path)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixes #1098 .
Adds a couple extensions for transforms, and almost all trainers.
Adds text loading and saving, model loading.