Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log transform for US$ variables #13

Open
flaxter opened this issue Nov 2, 2016 · 6 comments
Open

Log transform for US$ variables #13

flaxter opened this issue Nov 2, 2016 · 6 comments

Comments

@flaxter
Copy link
Contributor

flaxter commented Nov 2, 2016

Here's the variables I think we should log transform, all representing income/wages/etc.

VERSIONS = {
...
    'log_transform_feats': '''INTP OIP PAP RETP SEMP SSIP SSP WAGP PERNP
                            PINCP'''.split(),

Only issue is that some of these variables can be negative (for losses). So I guess the transformation for those should be x = log(x - min(x)) or something?

Once we figure that out it should be easy to put this into get_dummies.

@djsutherland
Copy link
Owner

I think in my pre-pummeler attempt at this I did sign(x) * log(x + 1*sign(x)) or something. log(x - min(x)) isn't shaped very nicely if min(x) is, say, -915,729,293.

@flaxter
Copy link
Contributor Author

flaxter commented Nov 2, 2016

don't understand... 1+sign(x)?

I just looked through the codebook more carefully. Most (all?) of these are truncated below ("Rounded & bottom-coded") so I think something like my solution actually makes sense. Sure, it won't be a normal distribution, but if we're featurizing using KDE than it'll just have a weird bump in the lower tail. Of course my solution doesn't work when x = min(x) so I guess now I'm proposing:

log(x - min(x) + 1)

@djsutherland
Copy link
Owner

I was a little off before: what I want is sign(x) * log( |x| + 1 ), which maintains both sign information and magnitude information. Doing log(x - min(x) + 1) is weird because it conflates very-negative incomes with slightly-negative incomes, while the amount that moderate incomes are conflated depends on what the min is.

@flaxter
Copy link
Contributor Author

flaxter commented Nov 3, 2016

OK, finally went through case-by-case using the sampled data. Here are the only two monetary variables that I found that can actually be negative:

  • INTP (Interest, dividends, and net rental income) has a bunch of true zeros ("None"). Only 0.2% were negative.
  • SEMP (Self-employment income) is same as INTP, with even more true zeros. Again only 0.2% were negative (correlated with INTP?)

So maybe we just do categorical variables for whether INTP/SEMP are non-zero? But I still don't know what transform to use for positive / negative. Here are our two proposals, neither looks great:

semp
intp

@flaxter
Copy link
Contributor Author

flaxter commented Nov 3, 2016

Update: forgot about PERNP, which can also be negative. Or have true zeros (no earnings)?

Also what's RACNUM = Number of major race groups represented
1..6 .Race groups
mean?

@djsutherland
Copy link
Owner

IIRC RACNUM is the flag for how many racial groups the person has indicated, with RAC1P the first race, RAC2P the second, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants