Log transform for US$ variables #13

flaxter · 2016-11-02T21:35:29Z

Here's the variables I think we should log transform, all representing income/wages/etc.

VERSIONS = {
...
    'log_transform_feats': '''INTP OIP PAP RETP SEMP SSIP SSP WAGP PERNP
                            PINCP'''.split(),

Only issue is that some of these variables can be negative (for losses). So I guess the transformation for those should be x = log(x - min(x)) or something?

Once we figure that out it should be easy to put this into get_dummies.

The text was updated successfully, but these errors were encountered:

djsutherland · 2016-11-02T21:38:18Z

I think in my pre-pummeler attempt at this I did sign(x) * log(x + 1*sign(x)) or something. log(x - min(x)) isn't shaped very nicely if min(x) is, say, -915,729,293.

flaxter · 2016-11-02T22:04:11Z

don't understand... 1+sign(x)?

I just looked through the codebook more carefully. Most (all?) of these are truncated below ("Rounded & bottom-coded") so I think something like my solution actually makes sense. Sure, it won't be a normal distribution, but if we're featurizing using KDE than it'll just have a weird bump in the lower tail. Of course my solution doesn't work when x = min(x) so I guess now I'm proposing:

log(x - min(x) + 1)

djsutherland · 2016-11-02T22:23:01Z

I was a little off before: what I want is sign(x) * log( |x| + 1 ), which maintains both sign information and magnitude information. Doing log(x - min(x) + 1) is weird because it conflates very-negative incomes with slightly-negative incomes, while the amount that moderate incomes are conflated depends on what the min is.

flaxter · 2016-11-03T08:45:03Z

OK, finally went through case-by-case using the sampled data. Here are the only two monetary variables that I found that can actually be negative:

INTP (Interest, dividends, and net rental income) has a bunch of true zeros ("None"). Only 0.2% were negative.
SEMP (Self-employment income) is same as INTP, with even more true zeros. Again only 0.2% were negative (correlated with INTP?)

So maybe we just do categorical variables for whether INTP/SEMP are non-zero? But I still don't know what transform to use for positive / negative. Here are our two proposals, neither looks great:

flaxter · 2016-11-03T09:34:11Z

Update: forgot about PERNP, which can also be negative. Or have true zeros (no earnings)?

Also what's RACNUM = Number of major race groups represented
1..6 .Race groups
mean?

djsutherland · 2016-11-03T09:45:36Z

IIRC RACNUM is the flag for how many racial groups the person has indicated, with RAC1P the first race, RAC2P the second, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log transform for US$ variables #13

Log transform for US$ variables #13

flaxter commented Nov 2, 2016

djsutherland commented Nov 2, 2016

flaxter commented Nov 2, 2016

djsutherland commented Nov 2, 2016

flaxter commented Nov 3, 2016

flaxter commented Nov 3, 2016

djsutherland commented Nov 3, 2016

Log transform for US$ variables #13

Log transform for US$ variables #13

Comments

flaxter commented Nov 2, 2016

djsutherland commented Nov 2, 2016

flaxter commented Nov 2, 2016

djsutherland commented Nov 2, 2016

flaxter commented Nov 3, 2016

flaxter commented Nov 3, 2016

djsutherland commented Nov 3, 2016