Pandas数据类型之category的用法

文章目录[隐藏]

创建category
categories的操作
比较操作
其他操作

Pandas中有一种特殊的数据类型叫做category.它表示的是一个类别,一般用在统计分类中,比如性别,血型,分类,级别等等.有点像java中的enum,今天给大家详细讲解一下category的用法,需要的朋友可以参考下

创建category

使用Series创建

在创建Series的同时添加dtype=”category”就可以创建好category了。category分为两部分，一部分是order，一部分是字面量：

 In [1]: s = pd.Series(["a", "b", "c", "a"], dtype="category") In [2]: s Out[2]: 0    a 1    b 2    c 3    a dtype: category Categories (3, object): ['a', 'b', 'c']

可以将DF中的Series转换为category：

 In [3]: df = pd.DataFrame({"A": ["a", "b", "c", "a"]}) In [4]: df["B"] = df["A"].astype("category") In [5]: df["B"] Out[32]: 0    a 1    b 2    c 3    a Name: B, dtype: category Categories (3, object): [a, b, c]

可以创建好一个pandas.Categorical ，将其作为参数传递给Series：

 In [10]: raw_cat = pd.Categorical( ....:     ["a", "b", "c", "a"], categories=["b", "c", "d"], ordered=False ....: ) ....: In [11]: s = pd.Series(raw_cat) In [12]: s Out[12]: 0    NaN 1      b 2      c 3    NaN dtype: category Categories (3, object): ['b', 'c', 'd']

使用DF创建

创建DataFrame的时候，也可以传入 dtype=”category”：

 In [17]: df = pd.DataFrame({"A": list("abca"), "B": list("bccd")}, dtype="category") In [18]: df.dtypes Out[18]: A    category B    category dtype: object

DF中的A和B都是一个category:

 In [19]: df["A"] Out[19]: 0    a 1    b 2    c 3    a Name: A, dtype: category Categories (3, object): ['a', 'b', 'c'] In [20]: df["B"] Out[20]: 0    b 1    c 2    c 3    d Name: B, dtype: category Categories (3, object): ['b', 'c', 'd']

或者使用df.astype(“category”)将DF中所有的Series转换为category:

 In [21]: df = pd.DataFrame({"A": list("abca"), "B": list("bccd")}) In [22]: df_cat = df.astype("category") In [23]: df_cat.dtypes Out[23]: A    category B    category dtype: object

创建控制

默认情况下传入dtype=’category’ 创建出来的category使用的是默认值：

1.Categories是从数据中推断出来的。

2.Categories是没有大小顺序的。

可以显示创建CategoricalDtype来修改上面的两个默认值：

 In [26]: from pandas.api.types import CategoricalDtype In [27]: s = pd.Series(["a", "b", "c", "a"]) In [28]: cat_type = CategoricalDtype(categories=["b", "c", "d"], ordered=True) In [29]: s_cat = s.astype(cat_type) In [30]: s_cat Out[30]: 0    NaN 1      b 2      c 3    NaN dtype: category Categories (3, object): ['b' <'c' <'d']

同样的CategoricalDtype还可以用在DF中：

 In [31]: from pandas.api.types import CategoricalDtype In [32]: df = pd.DataFrame({"A": list("abca"), "B": list("bccd")}) In [33]: cat_type = CategoricalDtype(categories=list("abcd"), ordered=True) In [34]: df_cat = df.astype(cat_type) In [35]: df_cat["A"] Out[35]: 0    a 1    b 2    c 3    a Name: A, dtype: category Categories (4, object): ['a' <'b' <'c' <'d'] In [36]: df_cat["B"] Out[36]: 0    b 1    c 2    c 3    d Name: B, dtype: category Categories (4, object): ['a' <'b' <'c' <'d']

转换为原始类型

使用Series.astype(original_dtype) 或者 np.asarray(categorical)可以将Category转换为原始类型：

 In [39]: s = pd.Series(["a", "b", "c", "a"]) In [40]: s Out[40]: 0    a 1    b 2    c 3    a dtype: object In [41]: s2 = s.astype("category") In [42]: s2 Out[42]: 0    a 1    b 2    c 3    a dtype: category Categories (3, object): ['a', 'b', 'c'] In [43]: s2.astype(str) Out[43]: 0    a 1    b 2    c 3    a dtype: object In [44]: np.asarray(s2) Out[44]: array(['a', 'b', 'c', 'a'], dtype=object)

categories的操作

获取category的属性

Categorical数据有 categories 和 ordered 两个属性。可以通过s.cat.categories 和 s.cat.ordered来获取：

 In [57]: s = pd.Series(["a", "b", "c", "a"], dtype="category") In [58]: s.cat.categories Out[58]: Index(['a', 'b', 'c'], dtype='object') In [59]: s.cat.ordered Out[59]: False

重排category的顺序：

 In [60]: s = pd.Series(pd.Categorical(["a", "b", "c", "a"], categories=["c", "b", "a"])) In [61]: s.cat.categories Out[61]: Index(['c', 'b', 'a'], dtype='object') In [62]: s.cat.ordered Out[62]: False

重命名categories

通过给s.cat.categories赋值可以重命名categories:

 In [67]: s = pd.Series(["a", "b", "c", "a"], dtype="category") In [68]: s Out[68]: 0    a 1    b 2    c 3    a dtype: category Categories (3, object): ['a', 'b', 'c'] In [69]: s.cat.categories = ["Group %s" % g for g in s.cat.categories] In [70]: s Out[70]: 0    Group a 1    Group b 2    Group c 3    Group a dtype: category Categories (3, object): ['Group a', 'Group b', 'Group c']

使用rename_categories可以达到同样的

 In [71]: s = s.cat.rename_categories([1, 2, 3]) In [72]: s Out[72]: 0    1 1    2 2    3 3    1 dtype: category Categories (3, int64): [1, 2, 3]

或者使用字典对象：

 # You can also pass a dict-like object to map the renaming In [73]: s = s.cat.rename_categories({1: "x", 2: "y", 3: "z"}) In [74]: s Out[74]: 0    x 1    y 2    z 3    x dtype: category Categories (3, object): ['x', 'y', 'z']

使用add_categories添加category

可以使用add_categories来添加category:

 In [77]: s = s.cat.add_categories([4]) In [78]: s.cat.categories Out[78]: Index(['x', 'y', 'z', 4], dtype='object') In [79]: s Out[79]: 0    x 1    y 2    z 3    x dtype: category Categories (4, object): ['x', 'y', 'z', 4]

使用remove_categories删除category

 In [80]: s = s.cat.remove_categories([4]) In [81]: s Out[81]: 0    x 1    y 2    z 3    x dtype: category Categories (3, object): ['x', 'y', 'z']

删除未使用的cagtegory

 In [82]: s = pd.Series(pd.Categorical(["a", "b", "a"], categories=["a", "b", "c", "d"])) In [83]: s Out[83]: 0    a 1    b 2    a dtype: category Categories (4, object): ['a', 'b', 'c', 'd'] In [84]: s.cat.remove_unused_categories() Out[84]: 0    a 1    b 2    a dtype: category Categories (2, object): ['a', 'b']

重置cagtegory

使用set_categories()可以同时进行添加和删除category操作：

 In [85]: s = pd.Series(["one", "two", "four", "-"], dtype="category") In [86]: s Out[86]: 0     one 1     two 2    four 3       - dtype: category Categories (4, object): ['-', 'four', 'one', 'two'] In [87]: s = s.cat.set_categories(["one", "two", "three", "four"]) In [88]: s Out[88]: 0     one 1     two 2    four 3     NaN dtype: category Categories (4, object): ['one', 'two', 'three', 'four']

catego来源gao@daima#com搞(%代@#码@网ry排序

如果category创建的时候带有 ordered=True ，那么可以对其进行排序操作：

 In [91]: s = pd.Series(["a", "b", "c", "a"]).astype(CategoricalDtype(ordered=True)) In [92]: s.sort_values(inplace=True) In [93]: s Out[93]: 0    a 3    a 1    b 2    c dtype: category Categories (3, object): ['a' <'b' <'c'] In [94]: s.min(), s.max() Out[94]: ('a', 'c')

可以使用 as_ordered() 或者 as_unordered() 来强制排序或者不排序：

 In [95]: s.cat.as_ordered() Out[95]: 0    a 3    a 1    b 2    c dtype: category Categories (3, object): ['a' <'b' <'c'] In [96]: s.cat.as_unordered() Out[96]: 0    a 3    a 1    b 2    c dtype: category Categories (3, object): ['a', 'b', 'c']

重排序

使用Categorical.reorder_categories() 可以对现有的category进行重排序：

 In [103]: s = pd.Series([1, 2, 3, 1], dtype="category") In [104]: s = s.cat.reorder_categories([2, 3, 1], ordered=True) In [105]: s Out[105]: 0    1 1    2 2    3 3    1 dtype: category Categories (3, int64): [2 <3 < 1]<pre></div><h3>多列排序</h3><p>sort_values 支持多列进行排序：</p><div class="gaodaimacode"><pre class="prettyprint linenums"> In [109]: dfs = pd.DataFrame( .....:     { .....:         "A": pd.Categorical( .....:             list("bbeebbaa"), .....:             categories=["e", "a", "b"], .....:             ordered=True, .....:         ), .....:         "B": [1, 2, 1, 2, 2, 1, 2, 1], .....:     } .....: ) .....: In [110]: dfs.sort_values(by=["A", "B"]) Out[110]: A  B 2  e  1 3  e  2 7  a  1 6  a  2 0  b  1 5  b  1 1  b  2 4  b  2

比较操作

如果创建的时候设置了ordered==True ，那么category之间就可以进行比较操作。支持 ==, !=, >, >=, <, 和 <=这些操作符。

 In [113]: cat = pd.Series([1, 2, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True)) In [114]: cat_base = pd.Series([2, 2, 2]).astype(CategoricalDtype([3, 2, 1], ordered=True)) In [115]: cat_base2 = pd.Series([2, 2, 2]).astype(CategoricalDtype(ordered=True)) In [119]: cat > cat_base Out[119]: 0     True 1    False 2    False dtype: bool In [120]: cat > 2 Out[120]: 0     True 1    False 2    False dtype: bool

其他操作

Cagetory本质上来说还是一个Series，所以Series的操作category基本上都可以使用，比如： Series.min(), Series.max() 和 Series.mode()。

value_counts：

 In [131]: s = pd.Series(pd.Categorical(["a", "b", "c", "c"], categories=["c", "a", "b", "d"])) In [132]: s.value_counts() Out[132]: c    2 a    1 b    1 d    0 dtype: int64

DataFrame.sum()：

 In [133]: columns = pd.Categorical( .....:     ["One", "One", "Two"], categories=["One", "Two", "Three"], ordered=True .....: ) .....: In [134]: df = pd.DataFrame( .....:     data=[[1, 2, 3], [4, 5, 6]], .....:     columns=pd.MultiIndex.from_arrays([["A", "B", "B"], columns]), .....: ) .....: In [135]: df.sum(axis=1, level=1) Out[135]: One  Two  Three 0    3    3      0 1    9    6      0

Groupby：

 In [136]: cats = pd.Categorical( .....:     ["a", "b", "b", "b", "c", "c", "c"], categories=["a", "b", "c", "d"] .....: ) .....: In [137]: df = pd.DataFrame({"cats": cats, "values": [1, 2, 2, 2, 3, 4, 5]}) In [138]: df.groupby("cats").mean() Out[138]: values cats a        1.0 b        2.0 c        4.0 d        NaN In [139]: cats2 = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"]) In [140]: df2 = pd.DataFrame( .....:     { .....:         "cats": cats2, .....:         "B": ["c", "d", "c", "d"], .....:         "values": [1, 2, 3, 4], .....:     } .....: ) .....: In [141]: df2.groupby(["cats", "B"]).mean() Out[141]: values cats B a    c     1.0 d     2.0 b    c     3.0 d     4.0 c    c     NaN d     NaN

Pivot tables：

 In [142]: raw_cat = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"]) In [143]: df = pd.DataFrame({"A": raw_cat, "B": ["c", "d", "c", "d"], "values": [1, 2, 3, 4]}) In [144]: pd.pivot_table(df, values="values", index=["A", "B"]) Out[144]: values A B a c       1 d       2 b c       3 d       4

到此这篇关于Pandas数据类型之category的用法的文章就介绍到这了,更多相关category的用法内容请搜索gaodaima搞代码网以前的文章或继续浏览下面的相关文章希望大家以后多多支持gaodaima搞代码网！

以上就是Pandas数据类型之category的用法的详细内容，更多请关注gaodaima搞代码网其它相关文章！

搞代码网（gaodaima.com）提供的所有资源部分来自互联网，如果有侵犯您的版权或其他权益，请说明详细缘由并提供版权或权益证明然后发送到邮箱[email protected]‍，我们会在看到邮件的第一时间内为您处理，或直接联系QQ：872152909。本网站采用BY-NC-SA协议进行授权
转载请注明原文链接：Pandas数据类型之category的用法