-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DataFrameMapper feature metadata and y-value support. #54
Conversation
Adds optional 'y_feature' to DataFrameMapper and 'extract_y' method, equivalent to 'transform', to extract y value arrays from input dataframe. Updates 'fit' method to accept optional 'y' values.
Adding extract_* interfaces to DataFrameMapper to support extraction of X and y ndarrays for downstream sklearn components.
Add 'feature_indices_' to DataFrameMapper, tracking indicies of features in output feature array. Update class docstring to sklearn standard format.
Adding DataFramePipeline to support extraction of fitting targets via input DataFrameMapper. 'fit', 'fit_predict', 'fit_transform', and 'score' updated to extract fitting targets from input DataFrame via DataFrameMapper 'extract_y' interface. Pipeline requires first step in pipeline to be a DataFrameMapper instance.
Add 'X_*' properties to DataFrameMapper, allowing association of source column metadata with output feature indicies. Fix 'feature_indicies_' property definition. Add initial example of DataFrameMapper and DateFramePipeline.
Thanks a lot @asford I will review the PR shortly. |
Hi @asford. I checked your code and I believe storing the However I feel the code complexity added to be able to use the mapper to extract values for Could you submit a PR with just the Thanks! |
On partitioning the pull request, no problem. I opened this after realizing my customizations addressed a few open issues, it has a few features mixed together. I'll break out the |
@dukebody My goal for the y-value extraction feature was to support syntactic sugar closer to R's high level interfaces. For example (from http://koaning.github.io/html/patsy.html):
This style of interface clearly captures the feature and target columns in a single object, instead of spreading the model definition across multiple objects. This is a standard way of handling symbolic model definition (in R formula, patsy, etc...) because it supports a clear separation of concerns. I agree that the current interface ( Would you be open to a revised version of the y-feature components of this pull request with a refined interface? For example, cross-validation with y features: formulas = [
DataFrameMapper(["sepal_length", "sepal_width", "petal_length"], "species"),
DataFrameMapper(["sepal_length", "sepal_width", "petal_length", "petal_width"], "species"),
]
for formula in formulas:
pipeline = make_dataframe_pipeline( [formula, RandomForestClassifier()] )
print confusion_matrix(pipeline.extract_y(iris), pipeline.fit(iris).predict(iris))
print cross_val_score(*pipeline.estimator_Xy(iris), scoring="accuracy", cv=5) Cross-validation without y features: formulas = [
DataFrameMapper(["sepal_length", "sepal_width", "petal_length"]),
DataFrameMapper(["sepal_length", "sepal_width", "petal_length", "petal_width"]),
]
for formula in formulas:
pipeline = make_dataframe_pipeline( [formula, RandomForestClassifier()] )
print confusion_matrix(
LabelEncoder().fit_transform(iris['species']).values,
pipeline.fit(
iris, LabelEncoder().fit_transform(iris['species']).values
).predict( iris )
)
print cross_val_score(
pipeline, iris, LabelEncoder().fit_transform(iris['species']).values,
scoring="accuracy", cv=5
) |
@asford I agree on that the "formula-like" use case is interesting, but I'd rather prefer to try to implement this without making the I was thinking it might be possible to do something similar creating a custom pipeline class that accepts two Pseudo-code:
On a separate note, your commit also introduces a new interface for selecting features without applying any transformation:
While it is not consistent with the |
I've had a chance to read through your alternate implementation and think about the y-feature component of the DataFrameMapper, and I agree that y feature extraction is best handled in the pipeline level. In this pull's current implementation the pipeline contains:
In the updated interface, the pipeline would contain:
I will rework this pull request to match this interface over the next few days to work out the API. Would you mind holding off on #56 until we can iterate on this change? I'm not entirely convinced that storing |
This pull request:
DataFrameMapper
object with (optional) support for y-value extraction from input frames infit*
methods. Ref discussion in Pandas In, Pandas Out?.inverse_transform()
method #41. @naught101 @dukebodyfeatures
->X
metadata properties, to support association of estimator metadata to source columns. Ref Track which DataFrame Column corresponds to which Array Column(s) after Transform #13. @sveitser @dukebodyPipeline
subclass,DataFramePipeline
, to support transparent extraction ofy
values during fitting via an inputDataFrameMapper
.Example Notebook
I'm absolutely open to code-review and discussion of the proposed interfaces before merging.
TODO
y_feature
column may be present in both inputX
andy
frame inputs.y
inputs.Series
&DataFrame
support?_dataframe_mapper
and_final_estimator
properties inDataFramePipeline
to public properties?X
andy
values?