@@ -27,8 +27,8 @@ Categorical
2727 `Categorical ` data in `Series ` and `DataFrame ` is new.
2828
2929
30- This is a short introduction to pandas ` Categorical ` type, including a short comparison with R's
31- `factor `.
30+ This is a introduction to pandas :class: ` pandas. Categorical ` type, including a short comparison
31+ with R's `factor `.
3232
3333`Categoricals ` are a pandas data type, which correspond to categorical variables in
3434statistics: a variable, which can take on only a limited, and usually fixed,
@@ -108,7 +108,7 @@ By using some special functions:
108108 creation time. Use `levels ` to change the levels after creation time.
109109
110110To get back to the original Series or `numpy ` array, use ``Series.astype(original_dtype) `` or
111- ``Categorical.get_values( ) ``:
111+ ``np.asarray(categorical ) ``:
112112
113113.. ipython :: python
114114
@@ -118,7 +118,33 @@ To get back to the original Series or `numpy` array, use ``Series.astype(origina
118118 s2
119119 s3 = s2.astype(' string' )
120120 s3
121- s2.cat.get_values()
121+ np.asarray(s2.cat)
122+
123+ If you have already `codes ` and `levels `, you can use the :func: `~pandas.Categorical.from_codes `
124+ constructor to save the factorize step during normal constructor mode:
125+
126+ .. ipython :: python
127+
128+ splitter = np.random.choice([0 ,1 ], 5 , p = [0.5 ,0.5 ])
129+ pd.Categorical.from_codes(splitter, levels = [" train" , " test" ])
130+
131+ Description
132+ -----------
133+
134+ Using ``.describe() `` on a ``Categorical(...) `` or a ``Series(Categorical(...)) `` will show
135+ different output.
136+
137+
138+ As part of a `Dataframe ` or as a `Series ` a similar output as for a `Series ` of type ``string `` is
139+ shown. Calling ``Categorical.describe() `` will show the frequencies for each level, with NA for
140+ unused levels.
141+
142+ .. ipython :: python
143+
144+ cat = pd.Categorical([" a" ," c" ," c" ,np.nan], levels = [" b" ," a" ," c" ,np.nan] )
145+ df = pd.DataFrame({" cat" :cat, " s" :[" a" ," c" ," c" ,np.nan]})
146+ df.describe()
147+ cat.describe()
122148
123149 Working with levels
124150-------------------
@@ -153,7 +179,8 @@ It's also possible to pass in the levels in a specific order:
153179
154180 .. note ::
155181
156- Passing in a `levels ` argument implies ``ordered=True ``.
182+ Passing in a `levels ` argument implies ``ordered=True ``. You can of course overwrite that by
183+ passing in an explicit ``ordered=False ``.
157184
158185Any value omitted in the levels argument will be replaced by `np.nan `:
159186
@@ -178,8 +205,7 @@ Renaming levels is done by assigning new values to the ``Category.levels`` or
178205
179206 .. note ::
180207
181- I contrast to R's `factor ` function, a `Categorical ` can have levels of other types than
182- string.
208+ I contrast to R's `factor `, a `Categorical ` can have levels of other types than string.
183209
184210Levels must be unique or a `ValueError ` is raised:
185211
@@ -190,14 +216,16 @@ Levels must be unique or a `ValueError` is raised:
190216 except ValueError as e:
191217 print (" ValueError: " + str (e))
192218
193- Appending a level can be done by assigning a levels list longer than the current levels:
219+ Appending levels can be done by assigning a levels list longer than the current levels:
194220
195221.. ipython :: python
196222
197223 s.cat.levels = [1 ,2 ,3 ,4 ]
198224 s.cat.levels
199225 s
200226
227+ .. note ::
228+ Adding levels in other positions can be done with ``.reorder_levels(<levels_including_new>) ``.
201229
202230Removing a level is also possible, but only the last level(s) can be removed by assigning a
203231shorter list than current levels. Values which are omitted are replaced by `np.nan `.
@@ -236,8 +264,8 @@ Ordered or not...
236264-----------------
237265
238266If a `Categoricals ` is ordered (``cat.ordered == True ``), then the order of the levels has a
239- meaning and certain operations are possible. If the the categorical is unordered,
240- a ` TypeError ` is raised.
267+ meaning and certain operations are possible. If the categorical is unordered, a ` TypeError ` is
268+ raised.
241269
242270.. ipython :: python
243271
@@ -268,7 +296,8 @@ This is even true for strings and numeric data:
268296 print (s.min(), s.max())
269297
270298 Reordering the levels is possible via the ``Categorical.reorder_levels(new_levels) `` or
271- ``Series.cat.reorder_levels(new_levels) `` methods:
299+ ``Series.cat.reorder_levels(new_levels) `` methods. All old levels must be included in the new
300+ levels.
272301
273302.. ipython :: python
274303
@@ -287,6 +316,15 @@ Reordering the levels is possible via the ``Categorical.reorder_levels(new_level
287316 way values are sorted is different afterwards, but not that individual values in the
288317 `Series ` are changed.
289318
319+ You can also add new levels with :func: `Categorical.reorder_levels `, as long as you include all
320+ old levels:
321+
322+ .. ipython :: python
323+
324+ s3 = pd.Series(pd.Categorical([" a" ," b" ," d" ]))
325+ s3.cat.reorder_levels([" a" ," b" ," c" ,d" ])
326+ s3
327+
290328
291329Operations
292330----------
@@ -317,8 +355,8 @@ The mode:
317355.. note::
318356
319357 Numeric operations like `` + `` , `` - `` , `` * `` , `` / `` and operations based on them (e.g.
320- ``Categorical .median() ``, which would need to compute the mean between two values if the
321- length of an array is even) do not work and raise a `TypeError `.
358+ `` .median()`` , which would need to compute the mean between two values if the length of an
359+ array is even) do not work and raise a `TypeError ` .
322360
323361`Series` methods like `Series.value_counts()` will use all levels, even if some levels are not
324362present in the data:
@@ -353,7 +391,7 @@ Pivot tables:
353391Data munging
354392------------
355393
356- The optimized pandas data access methods ``.loc ``, ``.iloc `` ``ix `` ``.at ``, and``.iat``,
394+ The optimized pandas data access methods `` .loc`` , `` .iloc`` , `` . ix`` `` .at`` , and `` .iat`` ,
357395work as normal, the only difference is the return type (for getting) and
358396that only values already in the levels can be assigned.
359397
@@ -393,7 +431,7 @@ of length "1".
393431 df.at[" h" ," cats" ] # returns a string
394432
395433.. note::
396- Note that this is a difference to R's `factor ` function, where ``factor(c(1,2,3))[1] ``
434+ This is a difference to R ' s `factor` function, where ``factor(c(1,2,3))[1]``
397435 returns a single value `factor` .
398436
399437To get a single value `Series` of type `` category`` pass in a single value list :
@@ -455,7 +493,9 @@ but the levels of these `Categoricals` need to be the same:
455493 cat = pd.Categorical([" a" ," b" ], levels = [" a" ," b" ])
456494 vals = [1 ,2 ]
457495 df = pd.DataFrame({" cats" :cat, " vals" :vals})
458- pd.concat([df,df])
496+ res = pd.concat([df,df])
497+ res
498+ res.dtypes
459499
460500 df_different = df.copy()
461501 df_different[" cats" ].cat.levels = [" a" ," b" ," c" ]
@@ -501,27 +541,34 @@ store does not yet work.
501541
502542
503543Writing to a csv file will convert the data, effectively removing any information about the
504- `Categorical ` (` levels ` and ordering). So if you read back the csv file you have to convert the
505- relevant columns back to `category ` and assign the right ` levels ` and level ordering.
544+ `Categorical` (levels and ordering). So if you read back the csv file you have to convert the
545+ relevant columns back to `category` and assign the right levels and level ordering.
506546
507547.. ipython:: python
508548 :suppress:
509549
510550 from pandas.compat import StringIO
511- csv_file = StringIO
551+ csv_file = StringIO()
512552
513553.. ipython:: python
514554
515- s = pd.Series(pd.Categorical([' a' , ' b' , ' b' , ' a' , ' a' , ' c' ], levels = [' a' ,' b' ,' c' ,' d' ]))
555+ s = pd.Series(pd.Categorical([' a' , ' b' , ' b' , ' a' , ' a' , ' d' ]))
556+ # rename the levels
557+ s.cat.levels = [" very good" , " good" , " bad" ]
558+ # add new levels at the end
559+ s.cat.levels = list (s.cat.levels) + [" medium" , " very bad" ]
560+ # reorder the levels
561+ s.cat.reorder_levels([" very bad" , " bad" , " medium" , " good" , " very good" ])
516562 df = pd.DataFrame({" s" :s, " vals" :[1 ,2 ,3 ,4 ,5 ,6 ]})
517563 df.to_csv(csv_file)
518564 df2 = pd.read_csv(csv_file)
519- df2.dtype
565+ df2.dtypes
520566 df2[" vals" ]
521567 # Redo the category
522568 df2[" vals" ] = df2[" vals" ].astype(" category" )
523- df2[" vals" ].cat.levels = [' a' ,' b' ,' c' ,' d' ]
524- df2.dtype
569+ df2[" vals" ].cat.levels = list (df2[" vals" ].cat.levels) + [" medium" , " very bad" ]
570+ df2[" vals" ].cat.reorder_levels([" very bad" , " bad" , " medium" , " good" , " very good" ])
571+ df2.dtypes
525572 df2[" vals" ]
526573
527574
@@ -576,8 +623,8 @@ object and not as a low level `numpy` array dtype. This leads to some problems.
576623 dtype == np.str_
577624 np.str_ == dtype
578625
579- Using `` numpy ` ` functions on a `Series ` of type ``category `` should not work as `Categoricals `
580- are not numeric data (even in the case that levels is numeric).
626+ Using `numpy` functions on a `Series` of type `` category`` should not work as `Categoricals`
627+ are not numeric data (even in the case that `` . levels`` is numeric).
581628
582629.. ipython:: python
583630
@@ -612,36 +659,40 @@ means that changes to the `Series` will in most cases change the original `Categ
612659Use `` copy=True `` to prevent such a behaviour:
613660
614661.. ipython:: python
662+
615663 cat = pd.Categorical([1 ,2 ,3 ,10 ], levels = [1 ,2 ,3 ,4 ,10 ])
616664 s = pd.Series(cat, name = " cat" , copy = True )
617665 cat
618666 s.iloc[0 :2 ] = 10
619667 cat
620668
621669.. note::
622- This also happens in some cases when you supply a `numpy ` array: using an int array
623- (e.g. ``np.array([1,2,3,4]) ``) will exhibit the same behaviour, but using a string
624- array (e.g. ``np.array(["a","b","c","a"]) ``) will not.
670+ This also happens in some cases when you supply a `numpy` array instea dof a `Categorical` :
671+ using an int array (e.g. `` np.array([1 ,2 ,3 ,4 ])`` ) will exhibit the same behaviour, but using
672+ a string array (e.g. `` np.array([" a" ," b" ," c" ," a" ])`` ) will not .
625673
626674
627675Danger of confusion
628676~~~~~~~~~~~~~~~~~~~
629677
630- Both `Series ` and `Categorical ` have a method ``.reorder_levels() `` . For Series of type
631- ``category `` this means that there is some danger to confuse both methods.
678+ Both `Series` and `Categorical` have a method `` .reorder_levels()`` but for different things. For
679+ Series of type `` category`` this means that there is some danger to confuse both methods.
632680
633681.. ipython:: python
634682
635683 s = pd.Series(pd.Categorical([1 ,2 ,3 ,4 ]))
684+ print (s.cat.levels)
636685 # wrong and raises an error:
637686 try :
638687 s.reorder_levels([4 ,3 ,2 ,1 ])
639688 except Exception as e:
640689 print (" Exception: " + str (e))
641690 # right
642- print (s.cat.levels)
643- print ([4 ,3 ,2 ,1 ])
644691 s.cat.reorder_levels([4 ,3 ,2 ,1 ])
692+ print (s.cat.levels)
693+
694+ See also the API documentation for :func:`pandas.Series.reorder_levels` and
695+ :func:`pandas.Categorical.reorder_levels`
645696
646697Old style constructor usage
647698~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -665,8 +716,8 @@ In the default case (``compat=False``) the first argument is interpreted as valu
665716
666717.. warning::
667718 Using Categorical with precomputed level_codes and levels is deprecated and a `FutureWarning `
668- is raised. Please change your code to use one of the proper constructor modes instead of
669- adding ``compat=False ``.
719+ is raised. Please change your code to use the :func: ` ~ pandas.Categorical.from_codes`
720+ constructor instead of adding `` compat=False `` .
670721
671722No categorical index
672723~~~~~~~~~~~~~~~~~~~~
@@ -682,9 +733,13 @@ ordering of the levels:
682733 values = [4 ,2 ,3 ,1 ]
683734 df = pd.DataFrame({" strings" :strings, " values" :values}, index = cats)
684735 df.index
685- # This should sort by levels but doesn't !
736+ # This should sort by levels but does not as there is no CategoricalIndex !
686737 df.sort_index()
687738
739+ .. note::
740+ This could change if a `CategoricalIndex` is implemented (see
741+ https:// github.com/ pydata/ pandas/ issues/ 7629 )
742+
688743dtype in apply
689744~~~~~~~~~~~~~~
690745
0 commit comments