Merge pull request #25 from dukebody/keep-columns-no-transformation-issue-19

calpaterson · calpaterson · commit c374c98d2e1f · 2015-06-01T20:27:43.000+01:00
Add documentation example selecting column but not applying any transformer to it
diff --git a/README.rst b/README.rst
@@ -73,7 +73,7 @@ Test the Transformation
 
 We can use the ``fit_transform`` shortcut to both fit the model and see what transformed data looks like. In this and the other examples, output is rounded to two digits with ``np.round`` to account for rounding errors on different hardware::
 
-    >>> np.round(mapper.fit_transform(data), 2)
+    >>> np.round(mapper.fit_transform(data.copy()), 2)
     array([[ 1.  ,  0.  ,  0.  ,  0.21],
            [ 0.  ,  1.  ,  0.  ,  1.88],
            [ 0.  ,  1.  ,  0.  , -0.63],
@@ -102,7 +102,7 @@ Transformations may require multiple input columns. In these cases, the column n
     
 Now running ``fit_transform`` will run PCA on the ``children`` and ``salary`` columns and return the first principal component::
 
-    >>> np.round(mapper2.fit_transform(data), 1)
+    >>> np.round(mapper2.fit_transform(data.copy()), 1)
     array([[ 47.6],
            [-18.4],
            [  1.6],
@@ -112,6 +112,25 @@ Now running ``fit_transform`` will run PCA on the ``children`` and ``salary`` co
            [ -6.4],
            [-15.4]])
 
+Columns that don't need any transformation
+******************************************
+
+Only columns that are listed in the DataFrameMapper are kept. To keep a column but don't apply any transformation to it, use `None` as transformer::
+
+    >>> mapper3 = DataFrameMapper([
+    ...     ('pet', sklearn.preprocessing.LabelBinarizer()),
+    ...     ('children', None)
+    ... ])
+    >>> np.round(mapper3.fit_transform(data.copy()))
+    array([[ 1.,  0.,  0.,  4.],
+           [ 0.,  1.,  0.,  6.],
+           [ 0.,  1.,  0.,  3.],
+           [ 0.,  0.,  1.,  3.],
+           [ 1.,  0.,  0.,  2.],
+           [ 0.,  1.,  0.,  3.],
+           [ 1.,  0.,  0.,  5.],
+           [ 0.,  0.,  1.,  4.]])
+
 Cross-Validation
 ----------------
 
@@ -122,7 +141,7 @@ To get around this, sklearn-pandas provides a wrapper on sklearn's ``cross_val_s
     >>> pipe = sklearn.pipeline.Pipeline([
     ...     ('featurize', mapper),
     ...     ('lm', sklearn.linear_model.LinearRegression())])
-    >>> np.round(cross_val_score(pipe, data, data.salary, 'r2'), 2)
+    >>> np.round(cross_val_score(pipe, data.copy(), data.salary, 'r2'), 2)
     array([ -1.09,  -5.3 , -15.38])
 
 Sklearn-pandas' ``cross_val_score`` function provides exactly the same interface as sklearn's function of the same name.