• 欢迎访问搞代码网站,推荐使用最新版火狐浏览器和Chrome浏览器访问本网站!
  • 如果您觉得本站非常有看点,那么赶紧使用Ctrl+D 收藏搞代码吧

【机器学习】数据预处理之将类别数据转换为数值

python 搞代码 4年前 (2022-01-09) 21次浏览 已收录 0个评论

在进行python数据分析的时候,首先要进行数据预处理。

有时候不得不处理一些非数值类别的数据,嗯, 今天要说的就是面对这些数据该如何处理。

目前了解到的大概有三种方法:

1,通过LabelEncoder来进行快速的转换;

2,通过mapping方式,将类别映射为数值。不过这种方法适用范围有限;

3,通过get_dummies方法来转换。

<span style="color: #008080"> 1</span> <span style="color: #0000ff">import</span><span style="color: #000000"> pandas as pd</span><span style="color: #008080"> 2</span> <span style="color: #0000ff">from</span> io <span style="color: #0000ff">import</span><span style="color: #000000"> StringIO</span><span style="color: #008080"> 3</span> <span style="color: #008080"> 4</span> csv_data = <span style="color: #800000">'''</span><span style="color: #800000">A,B,C,D</span><span style="color: #008080"> 5</span> <span style="color: #800000">1,2,3,4</span><span style="color: #008080"> 6</span> <span style="color: #800000">5,6,,8</span><span style="color: #008080"> 7</span> <span style="color: #800000">0,11,12,</span><span style="color: #800000">'''</span><span style="color: #008080"> 8</span> <span style="color: #008080"> 9</span> df =<span style="color: #000000"> pd.read_csv(StringIO(csv_data))</span><span style="color: #008080">10</span> <span style="color: #0000ff">print</span><span style="color: #000000">(df)</span><span style="color: #008080">11</span> <span style="color: #008000">#</span><span style="color: #008000">统计为空的数目</span><span style="color: #008080">12</span> <span style="color: #0000ff">print</span><span style="color: #000000">(df.isnull().sum())</span><span style="color: #008080">13</span> <span style="color: #0000ff">print</span><span style="color: #000000">(df.values)</span><span style="color: #008080">14</span> <span style="color: #008080">15</span> <span style="color: #008000">#</span><span style="color: #008000">丢弃空的</span><span style="color: #008080">16</span> <span style="color: #0000ff">print</span><span style="color: #000000">(df.dropna())</span><span style="color: #008080">17</span> <span style="color: #0000ff">print</span>(<span style="color: #800000">'</span><span style="color: #800000">after</span><span style="color: #800000">'</span><span style="color: #000000">, df)</span><span style="color: #008080">18</span> <span style="color: #0000ff">from</span> sklearn.preprocessing <span style="color: #0000ff">import</span><span style="color: #000000"> Imputer</span><span style="color: #008080">19</span> <span style="color: #008000">#</span><span style="color: #008000"> axis=0 列   axis = 1 行</span><span style="color: #008080">20</span> imr = Imputer(missing_values=<span style="color: #800000">'</span><span style="color: #800000">NaN</span><span style="color: #800000">'</span>, strategy=<span style="color: #800000">'</span><span style="color: #800000">mean</span><span style="color: #800000">'</span>, axis=<span style="color: #000000">0)</span><span style="color: #008080">21</span> imr.fit(df) <span style="color: #008000">#</span><span style="color: #008000"> fit  构建得到数据</span><span style="color: #008080">22</span> imputed_data = imr.transform(df.values) <span style="color: #008000">#</span><span style="color: #008000">transform 将数据进行填充</span><span style="color: #008080">23</span> <span style="color: #0000ff">print</span><span style="color: #000000">(imputed_data)</span><span style="color: #008080">24</span> <span style="color: #008080">25</span> df = pd.DataFrame([[<span style="color: #800000">'</span><span style="color: #800000">green</span><span style="color: #800000">'</span>, <span style="color: #800000">'</span><span style="color: #800000">M</span><span style="color: #800000">'</span>, 10.1, <span style="color: #800000">'</span><span style="color: #800000">class1</span><span style="color: #800000">'</span><span style="color: #000000">],</span><span style="color: #008080">26</span>                    [<span style="color: #800000">'</span><span style="color: #800000">red</span><span style="color: #800000">'</span>, <span style="color: #800000">'</span><span style="color: #800000">L</span><span style="color: #800000">'</span>, 13.5, <span style="color: #800000">'</span><span style="color: #800000">class2</span><span style="color: #800000">'</span><span style="color: #000000">],</span><span style="color: #008080">27</span>                    [<span style="color: #800000">'</span><span style="color: #800000">blue</span><span style="color: #800000">'</span>, <span style="color: #800000">'</span><span style="color: #800000">XL</span><span style="color: #800000">'</span>, 15.3, <span style="color: #800000">'</span><span style="color: #800000">class1</span><span style="color: #800000">'</span><span style="color: #000000">]])</span><span style="color: #008080">28</span> df.columns =[<span style="color: #800000">'</span><span style="color: #800000">color</span><span style="color: #800000">'</span>, <span style="color: #800000">'</span><span style="color: #800000">size</span><span style="color: #800000">'</span>, <span style="color: #800000">'</span><span style="color: #800000">price</span><span style="color: #800000">'</span>, <span style="color: #800000">'</span><span style="color: #800000">classlabel</span><span style="color: #800000">'</span><span style="color: #000000">]</span><span style="color: #008080">29</span> <span style="color: #0000ff">print</span><span style="color: #000000">(df)</span><span style="color: #008080">30</span> <span style="color: #008080">31</span> size_mapping = {<span style="color: #800000">'</span><span style="color: #800000">XL</span><span style="color: #800000">'</span>:3, <span style="color: #800000">'</span><span style="color: #800000">L</span><span style="color: #800000">'</span>:2, <span style="color: #800000">'</span><span style="color: #800000">M</span><span style="color: #800000">'</span>:1<span style="color: #000000">}</span><span style="color: #008080">32</span> df[<span style="color: #800000">'</span><span style="color: #800000">size</span><span style="color: #800000">'</span>] = df[<span style="color: #800000">'</span><span style="color: #800000">size</span><span style="color: #800000">'</span><span style="color: #000000">].map(size_mapping)</span><span style="color: #008080">33</span> <span style="color: #0000ff">print</span><span style="color: #000000">(df)</span><span style="color: #008080">34</span> <span style="color: #008080">35</span> <span style="color: #008000">#</span><span style="color: #008000"># 遍历Series</span><span style="color: #008080">36</span> <span style="color: #0000ff">for</span> idx, label <span style="color: #0000ff">in</span> enumerate(df[<span style="color: #800000">'</span><span style="color: #800000">classlabel</span><span style="color: #800000">'</span><span style="color: #000000">]):</span><span style="color: #008080">37</span>     <span style="color: #0000ff">print</span><span style="color: #000000">(idx, label)</span><span style="color: #008080">38</span> <span style="color: #008080">39</span> <span style="color: #008000">#</span><span style="color: #008000">1, 利用LabelEncoder类快速编码,但此时对color并不适合,</span><span style="color: #008080">40</span> <span style="color: #008000">#</span><span style="color: #008000">看起来,好像是有大小的</span><span style="color: #008080">41</span> <span style="color: #0000ff">from</span> sklearn.preprocessing <span style="color: #0000ff">import</span><span style="color: #000000"> LabelEncoder</span><span style="color: #008080">42</span> class_le =<span style="color: #000000"> LabelEncoder()</span><span style="color: #008080">43</span> color_le =<span style="color: #000000"> La<div style="color:transparent">本文来源gaodai.ma#com搞##代!^码网(</div>belEncoder()</span><span style="color: #008080">44</span> df[<span style="color: #800000">'</span><span style="color: #800000">classlabel</span><span style="color: #800000">'</span>] = class_le.fit_transform(df[<span style="color: #800000">'</span><span style="color: #800000">classlabel</span><span style="color: #800000">'</span><span style="color: #000000">].values)</span><span style="color: #008080">45</span> <span style="color: #008000">#</span><span style="color: #008000">df['color'] = color_le.fit_transform(df['color'].values)</span><span style="color: #008080">46</span> <span style="color: #0000ff">print</span><span style="color: #000000">(df)</span><span style="color: #008080">47</span> <span style="color: #008080">48</span> <span style="color: #008000">#</span><span style="color: #008000">2, 映射字典将类标转换为整数</span><span style="color: #008080">49</span> <span style="color: #0000ff">import</span><span style="color: #000000"> numpy as np</span><span style="color: #008080">50</span> class_mapping = {label: idx <span style="color: #0000ff">for</span> idx, label <span style="color: #0000ff">in</span> enumerate(np.unique(df[<span style="color: #800000">'</span><span style="color: #800000">classlabel</span><span style="color: #800000">'</span><span style="color: #000000">]))}</span><span style="color: #008080">51</span> df[<span style="color: #800000">'</span><span style="color: #800000">classlabel</span><span style="color: #800000">'</span>] = df[<span style="color: #800000">'</span><span style="color: #800000">classlabel</span><span style="color: #800000">'</span><span style="color: #000000">].map(class_mapping)</span><span style="color: #008080">52</span> <span style="color: #0000ff">print</span>(<span style="color: #800000">'</span><span style="color: #800000">2,</span><span style="color: #800000">'</span><span style="color: #000000">, df)</span><span style="color: #008080">53</span> <span style="color: #008080">54</span> <span style="color: #008080">55</span> <span style="color: #008000">#</span><span style="color: #008000">3,处理1不适用的</span><span style="color: #008080">56</span> <span style="color: #008000">#</span><span style="color: #008000">利用创建一个新的虚拟特征</span><span style="color: #008080">57</span> <span style="color: #0000ff">from</span> sklearn.preprocessing <span style="color: #0000ff">import</span><span style="color: #000000"> OneHotEncoder</span><span style="color: #008080">58</span> pf = pd.get_dummies(df[[<span style="color: #800000">'</span><span style="color: #800000">color</span><span style="color: #800000">'</span><span style="color: #000000">]])</span><span style="color: #008080">59</span> df = pd.concat([df, pf], axis=1<span style="color: #000000">)</span><span style="color: #008080">60</span> df.drop([<span style="color: #800000">'</span><span style="color: #800000">color</span><span style="color: #800000">'</span>], axis=1, inplace=<span style="color: #000000">True)</span><span style="color: #008080">61</span> <span style="color: #0000ff">print</span>(df)

 

以上就是【机器学习】数据预处理之将类别数据转换为数值的详细内容,更多请关注搞代码gaodaima其它相关文章!


搞代码网(gaodaima.com)提供的所有资源部分来自互联网,如果有侵犯您的版权或其他权益,请说明详细缘由并提供版权或权益证明然后发送到邮箱[email protected],我们会在看到邮件的第一时间内为您处理,或直接联系QQ:872152909。本网站采用BY-NC-SA协议进行授权
转载请注明原文链接:【机器学习】数据预处理之将类别数据转换为数值
喜欢 (0)
[搞代码]
分享 (0)
发表我的评论
取消评论

表情 贴图 加粗 删除线 居中 斜体 签到

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址