Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow me to opt out of replacing missing values (missing_value_replacement=None) #938

Open
npatki opened this issue Feb 6, 2025 · 0 comments
Labels
feature request Request for a new feature

Comments

@npatki
Copy link
Contributor

npatki commented Feb 6, 2025

Problem Description

A number of the RDT transformers have a missing_value_replacement parameter that describes how to replace missing values in the forward transform. There are a variety of strategies that include replacement with a random value ('random'), replacing with the mean of the data ('mean'), etc. This was done under the assumption that the downstream software does not accept missing values.

However, I may have some use cases where my downstream software does handle missing values. In this case, I do not want the RDTs to do anything to the missing values. In such cases, we should offer missing_value_replacement=None as an option.

Note that we used to have this option. Just that it is listed as deprecated for all the transformers. We should un-deprecated it because there are valid use cases that need this.

Expected behavior

Reinstate the None option for the missing_value_replacement parameter, wherever it exists. If this is passed in, then during the forward transform: do not replace the missing values with anything. Just pass the missing values along.

  • Note that the default value for missing_value_replacement should not change; in most cases, it is 'random'. The user would need to explicitly pass in missing_value_replacement=None to access this new functionality.

For the reverse transform things can be a bit trickier. Ideally the missing_value_generation parameter is supposed to tell us how to recreate missing values. But it the passed-in data already contains missing values, then it can mess up generation. Here's what we can do:

If the missing_value_generation is not None and the passed in data contains some null values, then don't do anything to the missing values. Instead, show the user a warning.

Warning: The 'missing_value_generation' parameter is set to 'from_column' but the data already contains 
missing values. Missing value generation will be skipped.

Additional context

This change should probably be done to a base class. It would ultimately affect the following transformers:

  • numerical: ClusterBasedNormalizer, FloatFormatter, GaussianNormalizer, XGaussianNormalizer
  • categorical: BinaryEncoder
  • datetime: OptimizedTimestampEncoder, UnixTimestampEncoder
@npatki npatki added the feature request Request for a new feature label Feb 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

1 participant