Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas In, Pandas Out? .inverse_transform() method #41

Open
naught101 opened this issue Oct 22, 2015 · 31 comments
Open

Pandas In, Pandas Out? .inverse_transform() method #41

naught101 opened this issue Oct 22, 2015 · 31 comments

Comments

@naught101
Copy link

It would be really nice to have the ability to put pandas dataframes into sklearn pipelines, and to have equivalent pandas dataframes returned afterwards. I think that this module would be the place for that - probably all that would be required is a .inverse_transform method on the DataFrameMapper.

Would something like this be wanted in this module? I can make a pull request, if so.

Before I do, why is all the code in __init__.py? Seems like it'll get hard to maintain after a while...

@dukebody
Copy link
Collaborator

Hi Naught101!

You can already put pandas dataframes into sklearn pipelines. Just create a pipeline where the first step is the DataFrameMapper.

Regarding the proposal "to have equivalent dataframes returned afterwards", you mean to make the pipeline return a pandas DataFrame? Sklearn pipelines usually return numpy arrays, with either classification probabilities for each class (predict_proba), directly class predictions or regression values. How could you inverse transform that with the initial DataFrameMapper? The output and the input have different shapes and useful transforms.

I believe you can do the indexing thing you proposed at scikit-learn/scikit-learn#5523 (comment) just wrapping the numpy array output into a DataFrame passing as index the one from the original DataFrame you got into the pipe. Am I wrong?

Regarding the reason why all the code is in __init__.py, I guess it is because it was a very small module at first and didn't make a lot of sense to scatter the code along multiple files, although clearly we would need to go that way if the codebase grows, for clarity.

One issue we have one is that the original maintainer of the package (paulgb) is no longer working on it at all, and the second mantainer (Cal Paterson) has been quite irresponsive in the last few months as well. So it's becoming hard to get new code into this repo, and harder to get it into a release. :(

@naught101
Copy link
Author

Aha.. I wasn't thinking clearly, but now I can: DataFrameMappers can also be useful for generating the y value passed to a fit method. The inverse_transform would then be useful to get back a suitable dataframe. But yes, this would be a different DataFrameMapper to the one used for X.

I guess that would all happen outside the pipeline though..

Has anyone working on the code asked @paulgb for push access?

On 22 October 2015 6:52:05 pm AEDT, "Israel Saeta Pérez" [email protected] wrote:

Hi Naught101!

You can already put pandas dataframes into sklearn pipelines. Just
create a pipeline where the first step is the DataFrameMapper.

Regarding the proposal "to have equivalent dataframes returned
afterwards", you mean to make the pipeline return a pandas DataFrame?
Sklearn pipelines usually return numpy arrays, with either
classification probabilities for each class (predict_proba), directly
class predictions or regression values. How could you inverse transform
that with the initial DataFrameMapper? The output and the input have
different shapes and useful transforms.

I believe you can do the indexing thing you proposed at
scikit-learn/scikit-learn#5523 (comment)
just wrapping the numpy array output into a DataFrame passing as
index the one from the original DataFrame you got into the pipe. Am I
wrong?

Regarding the reason why all the code is in __init__.py, I guess it
is because it was a very small module at first and didn't make a lot of
sense to scatter the code along multiple files, although clearly we
would need to go that way if the codebase grows, for clarity.

One issue we have one is that the original maintainer of the package
(paulgb) is no longer working on it at all, and the second mantainer
(Cal Paterson) has been quite irresponsive in the last few months as
well. So it's becoming hard to get new code into this repo, and harder
to get it into a release. :(


Reply to this email directly or view it on GitHub:
#41 (comment)

Sent from my Android device with K-9 Mail. Please excuse my brevity.

@dukebody
Copy link
Collaborator

@calpaterson got write access to this repo, but he's not answering my mails. :S

@naught101
Copy link
Author

Hrm. Is there any reason you couldn't expand the current behaviour to also map the y dataframe? e.g. the call would be mapper = DataFrameMapper(X_features = [(blah...)], y_features = [(blergh)]), and then .fit(), .transform() and .predict() all call whichever transforms are relevant on X and/or y.

@dukebody
Copy link
Collaborator

Sounds reasonable. Could you come up with some examples where this y transformation would be useful?

@dukebody
Copy link
Collaborator

dukebody commented Nov 2, 2015

@naught101 I have write access now to this repo so we can work this out if you come out with useful use cases. :)

@dukebody
Copy link
Collaborator

dukebody commented Nov 2, 2015

@naught101 you might want something similar to what is discussed in #13 ?

@naught101
Copy link
Author

Yeah, I suspect that #13 is a prerequisite for this issue..

@ethanluoyc
Copy link

If say the transformed dataframe has exactly the same shape as the dataframe before the transformation. Can we pass in the columns to regenerate the predicted results in a DataFrame format?

@dukebody
Copy link
Collaborator

dukebody commented Nov 8, 2015

@ethanluoyc Could you provide a code example of how that feature would work? Not the implementation, but how one would use it.

@ethanluoyc
Copy link

I am doing something on basketball so I will just give an exmaple on this
say I have this dataframe,
screenshot 2015-11-08 22 02 01

after the conversion I will get something like this.

screenshot 2015-11-08 22 03 52

Which basically did substitution on based on the position of the keyword (which is the name) I have in a text string, for example,

"Jumpball: (Zydrunas Ilgauskas)\PN vs. (Kendrick Perkins)\PN ((Mo Williams)\PN gains possession)"

So the two dataframes actually has the same shape. I don't know whether I can do such inverse transformation.

I checked out #13 and I think the approach can work, however, as I referenced on the documentation on sklearn I stumble about their docs on the attribute active_features_, I decided to look into that in more details once I figure out what teh active_features_ attribute does.

@dukebody
Copy link
Collaborator

dukebody commented Nov 8, 2015

I believe we can do the inverse transformation if we:
* Track which array columns correspond to each dataframe columns.
* Every transformer used has an inverse_transform(X) method.

It shouldn't be too hard to do. Any takers? :)

@Yevgnen
Copy link

Yevgnen commented Mar 3, 2017

Can sklearn-pands inverse_transform the transformed data right now ?

@dukebody
Copy link
Collaborator

dukebody commented Mar 3, 2017

No, it can't right now.

@dukebody
Copy link
Collaborator

Last intent to do this was #56 but it stalled waiting for input from other dev. Perhaps we can retake it?

@devforfu
Copy link
Collaborator

devforfu commented Oct 21, 2017

Am I right that this feature should be something like:

df = pd.DataFrame({'colA': list('ynyyn'), 'colB': list('abcab')})
mapper = DataFrameMapper([
    ('colA', [LabelEncoder()]),
    ('colB', [LabelEncoder()]),
])
transformed = mapper.fit_transform(df)
restored = mapper.inverse_transform(transformed)

Where transformed will be something like:

np.array([[0, 0],
          [1, 1],
          [0, 2],
          [0, 0],
          [1, 1]])

And, restored is the original dataframe:

colA colB
   y    a
   n    b
   y    c
   y    a
   n    b

So, basically, the DataFrameMapper will be able to "rollback" the result into original dataframe like sklearn transformers do?

@dukebody
Copy link
Collaborator

@devforfu yes, this is what I understand.

To do so we need to keep track of which columns correspond to which features in the transformed output, and then run the transformer inverse on each block.

@erikjandevries
Copy link

erikjandevries commented Nov 7, 2017

Hi all, I've worked on a fork to create a solution for this problem. It passes the test

def test_inverse_transform_multicolumn():
    df = pd.DataFrame({'colA': list('ynyyn'), 'colB': list('abcab'), 'colC': list('sttts')})
    mapper = DataFrameMapper([
        ('colA', LabelEncoder()),
        ('colB', LabelBinarizer()),
        ('colC', LabelEncoder()),
    ])

    transformed = mapper.fit_transform(df)
    restored = mapper.inverse_transform(transformed)

    assert isinstance(restored, pd.DataFrame)    
    assert restored.equals(df)

which includes a LabelBinarizer that generates multiple columns. So far I'm assuming the mapper takes a pandas data frame and outputs a numpy array; I'm not yet dealing with self.input_df.

I'd like to improve this solution (I've now included an extra self.transformed_cols_ to keep track of mapped columns, but that should ideally be integrated with self.transformed_names_. However I haven't yet checked the implications of modifying the latter, so that's why I've simply added the parameter for now.

What would be the next steps? I've no idea if somebody else is already working on this, but I'm assuming I'll update my solution, commit it to my fork and then click on 'pull request' in my forked repository on GitHub? Do I need to keep anything else in mind?

@devforfu
Copy link
Collaborator

devforfu commented Nov 12, 2017

@erikjandevries I guess you only need to run tox to see if all tests pass. Probably, add a couple more tests to see if your implementation correctly handles other cases, e.g. several transformers, like:

mapper = DataFrameMapper([
    ('colA', [CategoricalImputer(), LabelEncoder()])
    ('colB', [Imputer(), StandardScaler()])
    # other transformers
])

Or maybe any other edge cases.

Then, if everything is fine, you could make a pull request and wait for a review from the repo owners. (As well as response from Circle CI which could show if your implementation has any issues).

@Whamp
Copy link

Whamp commented Jul 6, 2018

interested to see if there's been any progress on this issue. Seems like a pretty major limitation to not be able to recover the original data after transformation.

@Whamp
Copy link

Whamp commented Jul 10, 2018

is there any issue with @erikjandevries code here? looks fine to me but hasn't been accepted

https://github.com/scikit-learn-contrib/sklearn-pandas/pull/133/commits/1b4edd9e9a7de56a25259b288150d06ece9701fd

@dukebody
Copy link
Collaborator

dukebody commented Jul 11, 2018 via email

@erikjandevries
Copy link

I'm sorry to say I've also been very busy. If I'm not mistaken the problem with my code was that I created a new variable self.transformed_cols_ where I should have used the existing self.transformed_names_
I did this since I wasn't sure what I might break otherwise or I wasn't sure how to use the transformed names variable... It's been a long time, I think I found another way around for the problem I was dealing with at the time, but perhaps the update could still be useful.

#133

@devforfu
Copy link
Collaborator

@dukebody I usually track the sklearn_pandas repository changes and pull-requests and use it in my daily tasks so I could work on this if nobody else decides to take this responsibility.

@dukebody
Copy link
Collaborator

dukebody commented Aug 5, 2018

@devforfu Thanks! I've sent you an invite to become collaborator with write access to this repo, so you can merge stuff. Do you have an account in Pypi so I can give you access to publish new releases there?

@devforfu
Copy link
Collaborator

devforfu commented Aug 6, 2018

@dukebody Sure, not a problem! Yes, I've created one, the username is devforfu.

@dukebody
Copy link
Collaborator

@devforfu Added you to pypi. I guess you should have received some kind of notification about it.

Can you take care of managing next release after working out existing PRs?

@devforfu
Copy link
Collaborator

@dukebody Yes, the notification was received.

Ok, sure, will do as soon as finalize the pending changes.

@AlanGanem
Copy link

Hello guys. Any update on this issue?

@sxooler
Copy link

sxooler commented May 13, 2020

I am joining @AlanGanem: Is there any update? I can see some updates in #133 and #182 , but it's already been more than 1 year and nothing was approved and merged.

@GitHunter0
Copy link

I am joining @AlanGanem: Is there any update? I can see some updates in #133 and #182 , but it's already been more than 1 year and nothing was approved and merged.

Yes, it is a pitty, this would be a very useful feature

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants