标签归档:pandas

将多个功能应用于多个groupby列

问题:将多个功能应用于多个groupby列

文档展示了如何使用输出列名称作为键的字典一次在groupby对象上应用多个功能:

In [563]: grouped['D'].agg({'result1' : np.sum,
   .....:                   'result2' : np.mean})
   .....:
Out[563]: 
      result2   result1
A                      
bar -0.579846 -1.739537
foo -0.280588 -1.402938

但是,这仅适用于Series groupby对象。同样,当将字典类似地传递到groupby DataFrame时,它期望键是将应用该函数的列名。

我想做的是对多个列应用多个功能(但是某些列将被多次操作)。同样,某些函数将依赖于groupby对象中的其他列(如sumif函数)。我当前的解决方案是逐列进行操作,并使用类似于上面代码的代码,对依赖其他行的函数使用lambda。但这要花很长时间,(我认为花很长时间才能遍历groupby对象)。我必须对其进行更改,以便一次运行即可遍历整个groupby对象,但是我想知道熊猫中是否有内置的方法可以使此操作更加简洁。

例如,我尝试过类似

grouped.agg({'C_sum' : lambda x: x['C'].sum(),
             'C_std': lambda x: x['C'].std(),
             'D_sum' : lambda x: x['D'].sum()},
             'D_sumifC3': lambda x: x['D'][x['C'] == 3].sum(), ...)

但正如我所料,我收到一个KeyError(因为如果agg从DataFrame调用,则键必须是一列)。

是否有任何内置方法可以执行我想做的事情,或者可能添加了此功能,或者我只需要手动遍历groupby?

谢谢

The docs show how to apply multiple functions on a groupby object at a time using a dict with the output column names as the keys:

In [563]: grouped['D'].agg({'result1' : np.sum,
   .....:                   'result2' : np.mean})
   .....:
Out[563]: 
      result2   result1
A                      
bar -0.579846 -1.739537
foo -0.280588 -1.402938

However, this only works on a Series groupby object. And when a dict is similarly passed to a groupby DataFrame, it expects the keys to be the column names that the function will be applied to.

What I want to do is apply multiple functions to several columns (but certain columns will be operated on multiple times). Also, some functions will depend on other columns in the groupby object (like sumif functions). My current solution is to go column by column, and doing something like the code above, using lambdas for functions that depend on other rows. But this is taking a long time, (I think it takes a long time to iterate through a groupby object). I’ll have to change it so that I iterate through the whole groupby object in a single run, but I’m wondering if there’s a built in way in pandas to do this somewhat cleanly.

For example, I’ve tried something like

grouped.agg({'C_sum' : lambda x: x['C'].sum(),
             'C_std': lambda x: x['C'].std(),
             'D_sum' : lambda x: x['D'].sum()},
             'D_sumifC3': lambda x: x['D'][x['C'] == 3].sum(), ...)

but as expected I get a KeyError (since the keys have to be a column if agg is called from a DataFrame).

Is there any built in way to do what I’d like to do, or a possibility that this functionality may be added, or will I just need to iterate through the groupby manually?

Thanks


回答 0

当前接受的答案的后半部分已过时,并且有两个过时的建议。首先也是最重要的是,您无法再将字典词典传递给agg groupby方法。第二,永远不要使用.ix

如果您希望同时使用两个单独的列,则建议使用 apply隐式将DataFrame传递给应用函数的方法。让我们使用与上面类似的数据框

df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df

          a         b         c         d  group
0  0.418500  0.030955  0.874869  0.145641      0
1  0.446069  0.901153  0.095052  0.487040      0
2  0.843026  0.936169  0.926090  0.041722      1
3  0.635846  0.439175  0.828787  0.714123      1

从列名映射到聚合函数的字典仍然是执行聚合的理想方法。

df.groupby('group').agg({'a':['sum', 'max'], 
                         'b':'mean', 
                         'c':'sum', 
                         'd': lambda x: x.max() - x.min()})

              a                   b         c         d
            sum       max      mean       sum  <lambda>
group                                                  
0      0.864569  0.446069  0.466054  0.969921  0.341399
1      1.478872  0.843026  0.687672  1.754877  0.672401

如果您不喜欢该丑陋的lambda列名称,则可以使用常规函数,并为特殊__name__属性提供自定义名称,如下所示:

def max_min(x):
    return x.max() - x.min()

max_min.__name__ = 'Max minus Min'

df.groupby('group').agg({'a':['sum', 'max'], 
                         'b':'mean', 
                         'c':'sum', 
                         'd': max_min})

              a                   b         c             d
            sum       max      mean       sum Max minus Min
group                                                      
0      0.864569  0.446069  0.466054  0.969921      0.341399
1      1.478872  0.843026  0.687672  1.754877      0.672401

使用 apply和返回系列

现在,如果您有多个需要一起交互的列,则无法使用 agg,这会将Series隐式传递给聚合函数。当使用apply整个集团作为一个数据帧被传递给函数。

我建议制作一个自定义函数,以返回一系列所有聚合。使用系列索引作为新列的标签:

def f(x):
    d = {}
    d['a_sum'] = x['a'].sum()
    d['a_max'] = x['a'].max()
    d['b_mean'] = x['b'].mean()
    d['c_d_prodsum'] = (x['c'] * x['d']).sum()
    return pd.Series(d, index=['a_sum', 'a_max', 'b_mean', 'c_d_prodsum'])

df.groupby('group').apply(f)

         a_sum     a_max    b_mean  c_d_prodsum
group                                           
0      0.864569  0.446069  0.466054     0.173711
1      1.478872  0.843026  0.687672     0.630494

如果您爱上了MultiIndexes,仍然可以返回带有以下内容的Series:

    def f_mi(x):
        d = []
        d.append(x['a'].sum())
        d.append(x['a'].max())
        d.append(x['b'].mean())
        d.append((x['c'] * x['d']).sum())
        return pd.Series(d, index=[['a', 'a', 'b', 'c_d'], 
                                   ['sum', 'max', 'mean', 'prodsum']])

df.groupby('group').apply(f_mi)

              a                   b       c_d
            sum       max      mean   prodsum
group                                        
0      0.864569  0.446069  0.466054  0.173711
1      1.478872  0.843026  0.687672  0.630494

The second half of the currently accepted answer is outdated and has two deprecations. First and most important, you can no longer pass a dictionary of dictionaries to the agg groupby method. Second, never use .ix.

If you desire to work with two separate columns at the same time I would suggest using the apply method which implicitly passes a DataFrame to the applied function. Let’s use a similar dataframe as the one from above

df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df

          a         b         c         d  group
0  0.418500  0.030955  0.874869  0.145641      0
1  0.446069  0.901153  0.095052  0.487040      0
2  0.843026  0.936169  0.926090  0.041722      1
3  0.635846  0.439175  0.828787  0.714123      1

A dictionary mapped from column names to aggregation functions is still a perfectly good way to perform an aggregation.

df.groupby('group').agg({'a':['sum', 'max'], 
                         'b':'mean', 
                         'c':'sum', 
                         'd': lambda x: x.max() - x.min()})

              a                   b         c         d
            sum       max      mean       sum  <lambda>
group                                                  
0      0.864569  0.446069  0.466054  0.969921  0.341399
1      1.478872  0.843026  0.687672  1.754877  0.672401

If you don’t like that ugly lambda column name, you can use a normal function and supply a custom name to the special __name__ attribute like this:

def max_min(x):
    return x.max() - x.min()

max_min.__name__ = 'Max minus Min'

df.groupby('group').agg({'a':['sum', 'max'], 
                         'b':'mean', 
                         'c':'sum', 
                         'd': max_min})

              a                   b         c             d
            sum       max      mean       sum Max minus Min
group                                                      
0      0.864569  0.446069  0.466054  0.969921      0.341399
1      1.478872  0.843026  0.687672  1.754877      0.672401

Using apply and returning a Series

Now, if you had multiple columns that needed to interact together then you cannot use agg, which implicitly passes a Series to the aggregating function. When using apply the entire group as a DataFrame gets passed into the function.

I recommend making a single custom function that returns a Series of all the aggregations. Use the Series index as labels for the new columns:

def f(x):
    d = {}
    d['a_sum'] = x['a'].sum()
    d['a_max'] = x['a'].max()
    d['b_mean'] = x['b'].mean()
    d['c_d_prodsum'] = (x['c'] * x['d']).sum()
    return pd.Series(d, index=['a_sum', 'a_max', 'b_mean', 'c_d_prodsum'])

df.groupby('group').apply(f)

         a_sum     a_max    b_mean  c_d_prodsum
group                                           
0      0.864569  0.446069  0.466054     0.173711
1      1.478872  0.843026  0.687672     0.630494

If you are in love with MultiIndexes, you can still return a Series with one like this:

    def f_mi(x):
        d = []
        d.append(x['a'].sum())
        d.append(x['a'].max())
        d.append(x['b'].mean())
        d.append((x['c'] * x['d']).sum())
        return pd.Series(d, index=[['a', 'a', 'b', 'c_d'], 
                                   ['sum', 'max', 'mean', 'prodsum']])

df.groupby('group').apply(f_mi)

              a                   b       c_d
            sum       max      mean   prodsum
group                                        
0      0.864569  0.446069  0.466054  0.173711
1      1.478872  0.843026  0.687672  0.630494

回答 1

对于第一部分,您可以传递键的列名字典和值的函数列表:

In [28]: df
Out[28]:
          A         B         C         D         E  GRP
0  0.395670  0.219560  0.600644  0.613445  0.242893    0
1  0.323911  0.464584  0.107215  0.204072  0.927325    0
2  0.321358  0.076037  0.166946  0.439661  0.914612    1
3  0.133466  0.447946  0.014815  0.130781  0.268290    1

In [26]: f = {'A':['sum','mean'], 'B':['prod']}

In [27]: df.groupby('GRP').agg(f)
Out[27]:
            A                   B
          sum      mean      prod
GRP
0    0.719580  0.359790  0.102004
1    0.454824  0.227412  0.034060

更新1:

由于聚合函数适用于Series,因此对其他列名称的引用会丢失。为了解决这个问题,您可以引用整个数据框并使用lambda函数中的组索引对其进行索引。

这是一个骇人的解决方法:

In [67]: f = {'A':['sum','mean'], 'B':['prod'], 'D': lambda g: df.loc[g.index].E.sum()}

In [69]: df.groupby('GRP').agg(f)
Out[69]:
            A                   B         D
          sum      mean      prod  <lambda>
GRP
0    0.719580  0.359790  0.102004  1.170219
1    0.454824  0.227412  0.034060  1.182901

在此,结果“ D”列由总和“ E”值组成。

更新2:

我认为这是一种可以满足您要求的方法。首先创建一个自定义lambda函数。下面,g引用该组。汇总时,g将是一个系列。传递g.indexdf.ix[]从df中选择当前组。然后,我测试C列是否小于0.5。返回的布尔系列传递给g[]该布尔系列,该布尔系列仅选择那些符合条件的行。

In [95]: cust = lambda g: g[df.loc[g.index]['C'] < 0.5].sum()

In [96]: f = {'A':['sum','mean'], 'B':['prod'], 'D': {'my name': cust}}

In [97]: df.groupby('GRP').agg(f)
Out[97]:
            A                   B         D
          sum      mean      prod   my name
GRP
0    0.719580  0.359790  0.102004  0.204072
1    0.454824  0.227412  0.034060  0.570441

For the first part you can pass a dict of column names for keys and a list of functions for the values:

In [28]: df
Out[28]:
          A         B         C         D         E  GRP
0  0.395670  0.219560  0.600644  0.613445  0.242893    0
1  0.323911  0.464584  0.107215  0.204072  0.927325    0
2  0.321358  0.076037  0.166946  0.439661  0.914612    1
3  0.133466  0.447946  0.014815  0.130781  0.268290    1

In [26]: f = {'A':['sum','mean'], 'B':['prod']}

In [27]: df.groupby('GRP').agg(f)
Out[27]:
            A                   B
          sum      mean      prod
GRP
0    0.719580  0.359790  0.102004
1    0.454824  0.227412  0.034060

UPDATE 1:

Because the aggregate function works on Series, references to the other column names are lost. To get around this, you can reference the full dataframe and index it using the group indices within the lambda function.

Here’s a hacky workaround:

In [67]: f = {'A':['sum','mean'], 'B':['prod'], 'D': lambda g: df.loc[g.index].E.sum()}

In [69]: df.groupby('GRP').agg(f)
Out[69]:
            A                   B         D
          sum      mean      prod  <lambda>
GRP
0    0.719580  0.359790  0.102004  1.170219
1    0.454824  0.227412  0.034060  1.182901

Here, the resultant ‘D’ column is made up of the summed ‘E’ values.

UPDATE 2:

Here’s a method that I think will do everything you ask. First make a custom lambda function. Below, g references the group. When aggregating, g will be a Series. Passing g.index to df.ix[] selects the current group from df. I then test if column C is less than 0.5. The returned boolean series is passed to g[] which selects only those rows meeting the criteria.

In [95]: cust = lambda g: g[df.loc[g.index]['C'] < 0.5].sum()

In [96]: f = {'A':['sum','mean'], 'B':['prod'], 'D': {'my name': cust}}

In [97]: df.groupby('GRP').agg(f)
Out[97]:
            A                   B         D
          sum      mean      prod   my name
GRP
0    0.719580  0.359790  0.102004  0.204072
1    0.454824  0.227412  0.034060  0.570441

回答 2

作为泰德·彼得鲁(Ted Petrou)回答的替代方案(主要是美学方面),我发现我更喜欢紧凑的清单。请不要考虑接受它,它只是对Ted的答案以及代码/数据的更详细的评论。Python / pandas不是我的第一个/最好的,但是我发现它读起来不错:

df.groupby('group') \
  .apply(lambda x: pd.Series({
      'a_sum'       : x['a'].sum(),
      'a_max'       : x['a'].max(),
      'b_mean'      : x['b'].mean(),
      'c_d_prodsum' : (x['c'] * x['d']).sum()
  })
)

          a_sum     a_max    b_mean  c_d_prodsum
group                                           
0      0.530559  0.374540  0.553354     0.488525
1      1.433558  0.832443  0.460206     0.053313

我发现它更让人联想到dplyr管道和data.table链接的命令。并不是说它们更好,只是我更熟悉。(我当然认识到def对于这些类型的操作使用更正式的功能的功能,并且对于许多人而言,这是首选。这只是一种选择,不一定更好。)


我以与Ted相同的方式生成数据,我将添加一个可重复性的种子。

import numpy as np
np.random.seed(42)
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df

          a         b         c         d  group
0  0.374540  0.950714  0.731994  0.598658      0
1  0.156019  0.155995  0.058084  0.866176      0
2  0.601115  0.708073  0.020584  0.969910      1
3  0.832443  0.212339  0.181825  0.183405      1

As an alternative (mostly on aesthetics) to Ted Petrou’s answer, I found I preferred a slightly more compact listing. Please don’t consider accepting it, it’s just a much-more-detailed comment on Ted’s answer, plus code/data. Python/pandas is not my first/best, but I found this to read well:

df.groupby('group') \
  .apply(lambda x: pd.Series({
      'a_sum'       : x['a'].sum(),
      'a_max'       : x['a'].max(),
      'b_mean'      : x['b'].mean(),
      'c_d_prodsum' : (x['c'] * x['d']).sum()
  })
)

          a_sum     a_max    b_mean  c_d_prodsum
group                                           
0      0.530559  0.374540  0.553354     0.488525
1      1.433558  0.832443  0.460206     0.053313

I find it more reminiscent of dplyr pipes and data.table chained commands. Not to say they’re better, just more familiar to me. (I certainly recognize the power and, for many, the preference of using more formalized def functions for these types of operations. This is just an alternative, not necessarily better.)


I generated data in the same manner as Ted, I’ll add a seed for reproducibility.

import numpy as np
np.random.seed(42)
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df

          a         b         c         d  group
0  0.374540  0.950714  0.731994  0.598658      0
1  0.156019  0.155995  0.058084  0.866176      0
2  0.601115  0.708073  0.020584  0.969910      1
3  0.832443  0.212339  0.181825  0.183405      1

回答 3

Pandas >= 0.25.0,称为集合

从pandas版本0.25.0或更高版本开始,我们正在远离基于字典的聚合和重命名,而转向接受a的命名聚合tuple。现在,我们可以同时聚合+重命名为更多信息的列名:

范例

df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]

          a         b         c         d  group
0  0.521279  0.914988  0.054057  0.125668      0
1  0.426058  0.828890  0.784093  0.446211      0
2  0.363136  0.843751  0.184967  0.467351      1
3  0.241012  0.470053  0.358018  0.525032      1

GroupBy.agg以命名聚合应用:

df.groupby('group').agg(
             a_sum=('a', 'sum'),
             a_mean=('a', 'mean'),
             b_mean=('b', 'mean'),
             c_sum=('c', 'sum'),
             d_range=('d', lambda x: x.max() - x.min())
)

          a_sum    a_mean    b_mean     c_sum   d_range
group                                                  
0      0.947337  0.473668  0.871939  0.838150  0.320543
1      0.604149  0.302074  0.656902  0.542985  0.057681

Pandas >= 0.25.0, named aggregations

Since pandas version 0.25.0 or higher, we are moving away from the dictionary based aggregation and renaming, and moving towards named aggregations which accepts a tuple. Now we can simultaneously aggregate + rename to a more informative column name:

Example:

df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]

          a         b         c         d  group
0  0.521279  0.914988  0.054057  0.125668      0
1  0.426058  0.828890  0.784093  0.446211      0
2  0.363136  0.843751  0.184967  0.467351      1
3  0.241012  0.470053  0.358018  0.525032      1

Apply GroupBy.agg with named aggregation:

df.groupby('group').agg(
             a_sum=('a', 'sum'),
             a_mean=('a', 'mean'),
             b_mean=('b', 'mean'),
             c_sum=('c', 'sum'),
             d_range=('d', lambda x: x.max() - x.min())
)

          a_sum    a_mean    b_mean     c_sum   d_range
group                                                  
0      0.947337  0.473668  0.871939  0.838150  0.320543
1      0.604149  0.302074  0.656902  0.542985  0.057681

回答 4

0.25.0版中的新功能。

为了通过控制输出列名来支持特定于列的聚合,pandas在GroupBy.agg()中接受特殊语法,称为“命名聚合”,其中

  • 关键字是输出列名称
  • 值是元组,其第一个元素是要选择的列,第二个元素是要应用于该列的聚合。Pandas为pandas.NamedAgg namedtuple提供了字段[‘column’,’aggfunc’],以使参数更清楚。像往常一样,聚合可以是可调用的或字符串别名。
    In [79]: animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
       ....:                         'height': [9.1, 6.0, 9.5, 34.0],
       ....:                         'weight': [7.9, 7.5, 9.9, 198.0]})
       ....: 

    In [80]: animals
    Out[80]: 
      kind  height  weight
    0  cat     9.1     7.9
    1  dog     6.0     7.5
    2  cat     9.5     9.9
    3  dog    34.0   198.0

    In [81]: animals.groupby("kind").agg(
       ....:     min_height=pd.NamedAgg(column='height', aggfunc='min'),
       ....:     max_height=pd.NamedAgg(column='height', aggfunc='max'),
       ....:     average_weight=pd.NamedAgg(column='weight', aggfunc=np.mean),
       ....: )
       ....: 
    Out[81]: 
          min_height  max_height  average_weight
    kind                                        
    cat          9.1         9.5            8.90
    dog          6.0        34.0          102.75

pandas.NamedAgg只是一个namedtuple。普通元组也是允许的。

    In [82]: animals.groupby("kind").agg(
       ....:     min_height=('height', 'min'),
       ....:     max_height=('height', 'max'),
       ....:     average_weight=('weight', np.mean),
       ....: )
       ....: 
    Out[82]: 
          min_height  max_height  average_weight
    kind                                        
    cat          9.1         9.5            8.90
    dog          6.0        34.0          102.75

其他关键字参数不会传递给聚合函数。只有成对的(column,aggfunc)应该作为** kwargs传递。如果您的聚合函数需要其他参数,请通过functools.partial()部分应用它们。

命名聚合对于Series groupby聚合也有效。在这种情况下,没有列选择,因此值仅是函数。

    In [84]: animals.groupby("kind").height.agg(
       ....:     min_height='min',
       ....:     max_height='max',
       ....: )
       ....: 
    Out[84]: 
          min_height  max_height
    kind                        
    cat          9.1         9.5
    dog          6.0        34.0

New in version 0.25.0.

To support column-specific aggregation with control over the output column names, pandas accepts the special syntax in GroupBy.agg(), known as “named aggregation”, where

  • The keywords are the output column names
  • The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column. Pandas provides the pandas.NamedAgg namedtuple with the fields [‘column’, ‘aggfunc’] to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias.
    In [79]: animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
       ....:                         'height': [9.1, 6.0, 9.5, 34.0],
       ....:                         'weight': [7.9, 7.5, 9.9, 198.0]})
       ....: 

    In [80]: animals
    Out[80]: 
      kind  height  weight
    0  cat     9.1     7.9
    1  dog     6.0     7.5
    2  cat     9.5     9.9
    3  dog    34.0   198.0

    In [81]: animals.groupby("kind").agg(
       ....:     min_height=pd.NamedAgg(column='height', aggfunc='min'),
       ....:     max_height=pd.NamedAgg(column='height', aggfunc='max'),
       ....:     average_weight=pd.NamedAgg(column='weight', aggfunc=np.mean),
       ....: )
       ....: 
    Out[81]: 
          min_height  max_height  average_weight
    kind                                        
    cat          9.1         9.5            8.90
    dog          6.0        34.0          102.75

pandas.NamedAgg is just a namedtuple. Plain tuples are allowed as well.

    In [82]: animals.groupby("kind").agg(
       ....:     min_height=('height', 'min'),
       ....:     max_height=('height', 'max'),
       ....:     average_weight=('weight', np.mean),
       ....: )
       ....: 
    Out[82]: 
          min_height  max_height  average_weight
    kind                                        
    cat          9.1         9.5            8.90
    dog          6.0        34.0          102.75

Additional keyword arguments are not passed through to the aggregation functions. Only pairs of (column, aggfunc) should be passed as **kwargs. If your aggregation functions requires additional arguments, partially apply them with functools.partial().

Named aggregation is also valid for Series groupby aggregations. In this case there’s no column selection, so the values are just the functions.

    In [84]: animals.groupby("kind").height.agg(
       ....:     min_height='min',
       ....:     max_height='max',
       ....: )
       ....: 
    Out[84]: 
          min_height  max_height
    kind                        
    cat          9.1         9.5
    dog          6.0        34.0

回答 5

特德的答案是惊人的。我最终使用了一个较小的版本,以防有人感兴趣。在寻找一种取决于多个列中的值的聚合时很有用:

创建一个数据框

df=pd.DataFrame({'a': [1,2,3,4,5,6], 'b': [1,1,0,1,1,0], 'c': ['x','x','y','y','z','z']})


   a  b  c
0  1  1  x
1  2  1  x
2  3  0  y
3  4  1  y
4  5  1  z
5  6  0  z

使用apply分组和聚合(使用多列)

df.groupby('c').apply(lambda x: x['a'][(x['a']>1) & (x['b']==1)].mean())

c
x    2.0
y    4.0
z    5.0

使用聚合进行分组和聚合(使用多列)

我喜欢这种方法,因为我仍然可以使用聚合。也许人们会让我知道为什么对组进行汇总时为什么需要套用多列。

现在似乎很明显,但是只要您不选择感兴趣的列 直接在groupby之后的列,就可以从聚合函数中访问数据框的所有列。

仅访问所选列

df.groupby('c')['a'].aggregate(lambda x: x[x>1].mean())

访问所有列,因为选择毕竟是不可思议的

df.groupby('c').aggregate(lambda x: x[(x['a']>1) & (x['b']==1)].mean())['a']

或类似

df.groupby('c').aggregate(lambda x: x['a'][(x['a']>1) & (x['b']==1)].mean())

我希望这有帮助。

Ted’s answer is amazing. I ended up using a smaller version of that in case anyone is interested. Useful when you are looking for one aggregation that depends on values from multiple columns:

create a dataframe

df=pd.DataFrame({'a': [1,2,3,4,5,6], 'b': [1,1,0,1,1,0], 'c': ['x','x','y','y','z','z']})


   a  b  c
0  1  1  x
1  2  1  x
2  3  0  y
3  4  1  y
4  5  1  z
5  6  0  z

grouping and aggregating with apply (using multiple columns)

df.groupby('c').apply(lambda x: x['a'][(x['a']>1) & (x['b']==1)].mean())

c
x    2.0
y    4.0
z    5.0

grouping and aggregating with aggregate (using multiple columns)

I like this approach since I can still use aggregate. Perhaps people will let me know why apply is needed for getting at multiple columns when doing aggregations on groups.

It seems obvious now, but as long as you don’t select the column of interest directly after the groupby, you will have access to all the columns of the dataframe from within your aggregation function.

only access to the selected column

df.groupby('c')['a'].aggregate(lambda x: x[x>1].mean())

access to all columns since selection is after all the magic

df.groupby('c').aggregate(lambda x: x[(x['a']>1) & (x['b']==1)].mean())['a']

or similarly

df.groupby('c').aggregate(lambda x: x['a'][(x['a']>1) & (x['b']==1)].mean())

I hope this helps.


从python pandas中的列名获取列索引

问题:从python pandas中的列名获取列索引

在R中,当您需要根据列名检索列索引时,可以执行此操作

idx <- which(names(my_data)==my_colum_name)

有没有办法对熊猫数据框做同样的事情?

In R when you need to retrieve a column index based on the name of the column you could do

idx <- which(names(my_data)==my_colum_name)

Is there a way to do the same with pandas dataframes?


回答 0

当然可以使用.get_loc()

In [45]: df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})

In [46]: df.columns
Out[46]: Index([apple, orange, pear], dtype=object)

In [47]: df.columns.get_loc("pear")
Out[47]: 2

虽然老实说,我自己通常不需要这个。通常,通过名称进行访问可以实现我想要的功能(df["pear"],,df[["apple", "orange"]]或者也许df.columns.isin(["orange", "pear"])),尽管我可以肯定地看到一些您需要索引号的情况。

Sure, you can use .get_loc():

In [45]: df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})

In [46]: df.columns
Out[46]: Index([apple, orange, pear], dtype=object)

In [47]: df.columns.get_loc("pear")
Out[47]: 2

although to be honest I don’t often need this myself. Usually access by name does what I want it to (df["pear"], df[["apple", "orange"]], or maybe df.columns.isin(["orange", "pear"])), although I can definitely see cases where you’d want the index number.


回答 1

这是通过列表理解的解决方案。cols是要获取索引的列的列表:

[df.columns.get_loc(c) for c in cols if c in df]

Here is a solution through list comprehension. cols is the list of columns to get index for:

[df.columns.get_loc(c) for c in cols if c in df]

回答 2

DSM的解决方案有效,但是如果您想要直接等效的解决方案,则which可以这样做(df.columns == name).nonzero()

DSM’s solution works, but if you wanted a direct equivalent to which you could do (df.columns == name).nonzero()


回答 3

当您要查找多个列匹配项时,可以使用使用searchsorted方法的矢量化解决方案。因此,df作为query_cols要搜索的数据框和列名,实现将是-

def column_index(df, query_cols):
    cols = df.columns.values
    sidx = np.argsort(cols)
    return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]

样品运行-

In [162]: df
Out[162]: 
   apple  banana  pear  orange  peach
0      8       3     4       4      2
1      4       4     3       0      1
2      1       2     6       8      1

In [163]: column_index(df, ['peach', 'banana', 'apple'])
Out[163]: array([4, 1, 0])

When you might be looking to find multiple column matches, a vectorized solution using searchsorted method could be used. Thus, with df as the dataframe and query_cols as the column names to be searched for, an implementation would be –

def column_index(df, query_cols):
    cols = df.columns.values
    sidx = np.argsort(cols)
    return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]

Sample run –

In [162]: df
Out[162]: 
   apple  banana  pear  orange  peach
0      8       3     4       4      2
1      4       4     3       0      1
2      1       2     6       8      1

In [163]: column_index(df, ['peach', 'banana', 'apple'])
Out[163]: array([4, 1, 0])

回答 4

如果您希望从列位置获得列名(OP问题的另一种方法),则可以使用:

>>> df.columns.get_values()[location]

使用@DSM示例:

>>> df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})

>>> df.columns

Index(['apple', 'orange', 'pear'], dtype='object')

>>> df.columns.get_values()[1]

'orange'

其他方法:

df.iloc[:,1].name

df.columns[location] #(thanks to @roobie-nuby for pointing that out in comments.) 

In case you want the column name from the column location (the other way around to the OP question), you can use:

>>> df.columns.get_values()[location]

Using @DSM Example:

>>> df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})

>>> df.columns

Index(['apple', 'orange', 'pear'], dtype='object')

>>> df.columns.get_values()[1]

'orange'

Other ways:

df.iloc[:,1].name

df.columns[location] #(thanks to @roobie-nuby for pointing that out in comments.) 

回答 5

这个怎么样:

df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
out = np.argwhere(df.columns.isin(['apple', 'orange'])).ravel()
print(out)
[1 2]

how about this:

df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
out = np.argwhere(df.columns.isin(['apple', 'orange'])).ravel()
print(out)
[1 2]

从熊猫日期时间列中分别提取月份和年份

问题:从熊猫日期时间列中分别提取月份和年份

我有一个数据框df,其中包含以下列:

df['ArrivalDate'] =
...
936   2012-12-31
938   2012-12-29
965   2012-12-31
966   2012-12-31
967   2012-12-31
968   2012-12-31
969   2012-12-31
970   2012-12-29
971   2012-12-31
972   2012-12-29
973   2012-12-29
...

该列的元素是pandas.tslib.Timestamp。

我只想包括年份和月份。我以为会有一种简单的方法,但是我无法弄清楚。

这是我尝试过的:

df['ArrivalDate'].resample('M', how = 'mean')

我收到以下错误:

Only valid with DatetimeIndex or PeriodIndex 

然后我尝试了:

df['ArrivalDate'].apply(lambda(x):x[:-2])

我收到以下错误:

'Timestamp' object has no attribute '__getitem__' 

有什么建议?

编辑:我想通了。

df.index = df['ArrivalDate']

然后,我可以使用索引对另一列进行重新采样。

但是我仍然想要一种重新配置整个列的方法。有任何想法吗?

I have a Dataframe, df, with the following column:

df['ArrivalDate'] =
...
936   2012-12-31
938   2012-12-29
965   2012-12-31
966   2012-12-31
967   2012-12-31
968   2012-12-31
969   2012-12-31
970   2012-12-29
971   2012-12-31
972   2012-12-29
973   2012-12-29
...

The elements of the column are pandas.tslib.Timestamp.

I want to just include the year and month. I thought there would be simple way to do it, but I can’t figure it out.

Here’s what I’ve tried:

df['ArrivalDate'].resample('M', how = 'mean')

I got the following error:

Only valid with DatetimeIndex or PeriodIndex 

Then I tried:

df['ArrivalDate'].apply(lambda(x):x[:-2])

I got the following error:

'Timestamp' object has no attribute '__getitem__' 

Any suggestions?

Edit: I sort of figured it out.

df.index = df['ArrivalDate']

Then, I can resample another column using the index.

But I’d still like a method for reconfiguring the entire column. Any ideas?


回答 0

如果希望新列分别显示年和月,则可以执行以下操作:

df['year'] = pd.DatetimeIndex(df['ArrivalDate']).year
df['month'] = pd.DatetimeIndex(df['ArrivalDate']).month

要么…

df['year'] = df['ArrivalDate'].dt.year
df['month'] = df['ArrivalDate'].dt.month

然后,您可以将它们组合或按原样使用它们。

If you want new columns showing year and month separately you can do this:

df['year'] = pd.DatetimeIndex(df['ArrivalDate']).year
df['month'] = pd.DatetimeIndex(df['ArrivalDate']).month

or…

df['year'] = df['ArrivalDate'].dt.year
df['month'] = df['ArrivalDate'].dt.month

Then you can combine them or work with them just as they are.


回答 1

找到最好的方法

df['date_column']必须是日期时间格式。

df['month_year'] = df['date_column'].dt.to_period('M')

您也可以将DDay,2M2个月等用于不同的采样间隔,并且如果其中一个带有时间戳的时间序列数据,我们可以进行细化的采样间隔,例如45Min45分钟,15Min15分钟采样等。

Best way found!!

the df['date_column'] has to be in date time format.

df['month_year'] = df['date_column'].dt.to_period('M')

You could also use D for Day, 2M for 2 Months etc. for different sampling intervals, and in case one has time series data with time stamp, we can go for granular sampling intervals such as 45Min for 45 min, 15Min for 15 min sampling etc.


回答 2

您可以直接访问yearmonth属性,或请求一个datetime.datetime

In [15]: t = pandas.tslib.Timestamp.now()

In [16]: t
Out[16]: Timestamp('2014-08-05 14:49:39.643701', tz=None)

In [17]: t.to_pydatetime() #datetime method is deprecated
Out[17]: datetime.datetime(2014, 8, 5, 14, 49, 39, 643701)

In [18]: t.day
Out[18]: 5

In [19]: t.month
Out[19]: 8

In [20]: t.year
Out[20]: 2014

组合年和月的一种方法是对它们进行整数编码,例如:201408对于2014年8月。在整列中,您可以这样做:

df['YearMonth'] = df['ArrivalDate'].map(lambda x: 100*x.year + x.month)

或其许多变体。

不过,我并不是这样做的忠实拥护者,因为它会使日期对齐和算术在以后变得很痛苦,对于那些不遵循相同约定而使用您的代码或数据的其他人来说尤其痛苦。更好的方法是选择一个月的日期约定,例如最终的非美国假日工作日或第一天等,并使用所选的日期约定以日期/时间格式保留数据。

calendar模块对于获取某些日期(例如最后一个工作日)的数值很有用。然后,您可以执行以下操作:

import calendar
import datetime
df['AdjustedDateToEndOfMonth'] = df['ArrivalDate'].map(
    lambda x: datetime.datetime(
        x.year,
        x.month,
        max(calendar.monthcalendar(x.year, x.month)[-1][:5])
    )
)

如果您碰巧正在寻找一种解决简单问题的方法,那就只是将datetime列格式化为某种字符串表示形式,为此,您可以使用类中的strftime函数datetime.datetime,如下所示:

In [5]: df
Out[5]: 
            date_time
0 2014-10-17 22:00:03

In [6]: df.date_time
Out[6]: 
0   2014-10-17 22:00:03
Name: date_time, dtype: datetime64[ns]

In [7]: df.date_time.map(lambda x: x.strftime('%Y-%m-%d'))
Out[7]: 
0    2014-10-17
Name: date_time, dtype: object

You can directly access the year and month attributes, or request a datetime.datetime:

In [15]: t = pandas.tslib.Timestamp.now()

In [16]: t
Out[16]: Timestamp('2014-08-05 14:49:39.643701', tz=None)

In [17]: t.to_pydatetime() #datetime method is deprecated
Out[17]: datetime.datetime(2014, 8, 5, 14, 49, 39, 643701)

In [18]: t.day
Out[18]: 5

In [19]: t.month
Out[19]: 8

In [20]: t.year
Out[20]: 2014

One way to combine year and month is to make an integer encoding them, such as: 201408 for August, 2014. Along a whole column, you could do this as:

df['YearMonth'] = df['ArrivalDate'].map(lambda x: 100*x.year + x.month)

or many variants thereof.

I’m not a big fan of doing this, though, since it makes date alignment and arithmetic painful later and especially painful for others who come upon your code or data without this same convention. A better way is to choose a day-of-month convention, such as final non-US-holiday weekday, or first day, etc., and leave the data in a date/time format with the chosen date convention.

The calendar module is useful for obtaining the number value of certain days such as the final weekday. Then you could do something like:

import calendar
import datetime
df['AdjustedDateToEndOfMonth'] = df['ArrivalDate'].map(
    lambda x: datetime.datetime(
        x.year,
        x.month,
        max(calendar.monthcalendar(x.year, x.month)[-1][:5])
    )
)

If you happen to be looking for a way to solve the simpler problem of just formatting the datetime column into some stringified representation, for that you can just make use of the strftime function from the datetime.datetime class, like this:

In [5]: df
Out[5]: 
            date_time
0 2014-10-17 22:00:03

In [6]: df.date_time
Out[6]: 
0   2014-10-17 22:00:03
Name: date_time, dtype: datetime64[ns]

In [7]: df.date_time.map(lambda x: x.strftime('%Y-%m-%d'))
Out[7]: 
0    2014-10-17
Name: date_time, dtype: object

回答 3

如果要月对唯一,则使用apply非常时尚。

df['mnth_yr'] = df['date_column'].apply(lambda x: x.strftime('%B-%Y')) 

一列输出月-年。

我通常会忘记,不要忘记先将格式更改为日期时间。

df['date_column'] = pd.to_datetime(df['date_column'])

If you want the month year unique pair, using apply is pretty sleek.

df['mnth_yr'] = df['date_column'].apply(lambda x: x.strftime('%B-%Y')) 

Outputs month-year in one column.

Don’t forget to first change the format to date-time before, I generally forget.

df['date_column'] = pd.to_datetime(df['date_column'])

回答 4

从[‘2018-03-04’]中提取年份的说法

df['Year'] = pd.DatetimeIndex(df['date']).year  

df [‘Year’]创建一个新列。如果要提取月份,请使用.month

Extracting the Year say from [‘2018-03-04’]

df['Year'] = pd.DatetimeIndex(df['date']).year  

The df[‘Year’] creates a new column. While if you want to extract the month just use .month


回答 5

您可以首先使用pandas.to_datetime转换日期字符串,这使您可以访问所有的numpy datetime和timedelta工具。例如:

df['ArrivalDate'] = pandas.to_datetime(df['ArrivalDate'])
df['Month'] = df['ArrivalDate'].values.astype('datetime64[M]')

You can first convert your date strings with pandas.to_datetime, which gives you access to all of the numpy datetime and timedelta facilities. For example:

df['ArrivalDate'] = pandas.to_datetime(df['ArrivalDate'])
df['Month'] = df['ArrivalDate'].values.astype('datetime64[M]')

回答 6

感谢jaknap32,我想根据Year和Month汇总结果,所以可以这样:

df_join['YearMonth'] = df_join['timestamp'].apply(lambda x:x.strftime('%Y%m'))

输出整洁:

0    201108
1    201108
2    201108

Thanks to jaknap32, I wanted to aggregate the results according to Year and Month, so this worked:

df_join['YearMonth'] = df_join['timestamp'].apply(lambda x:x.strftime('%Y%m'))

Output was neat:

0    201108
1    201108
2    201108

回答 7

@KieranPC的解决方案是适用于Pandas的正确方法,但对于任意属性而言却不容易扩展。为此,您可以getattr在生成器理解中使用并结合使用pd.concat

# input data
list_of_dates = ['2012-12-31', '2012-12-29', '2012-12-30']
df = pd.DataFrame({'ArrivalDate': pd.to_datetime(list_of_dates)})

# define list of attributes required    
L = ['year', 'month', 'day', 'dayofweek', 'dayofyear', 'weekofyear', 'quarter']

# define generator expression of series, one for each attribute
date_gen = (getattr(df['ArrivalDate'].dt, i).rename(i) for i in L)

# concatenate results and join to original dataframe
df = df.join(pd.concat(date_gen, axis=1))

print(df)

  ArrivalDate  year  month  day  dayofweek  dayofyear  weekofyear  quarter
0  2012-12-31  2012     12   31          0        366           1        4
1  2012-12-29  2012     12   29          5        364          52        4
2  2012-12-30  2012     12   30          6        365          52        4

@KieranPC’s solution is the correct approach for Pandas, but is not easily extendible for arbitrary attributes. For this, you can use getattr within a generator comprehension and combine using pd.concat:

# input data
list_of_dates = ['2012-12-31', '2012-12-29', '2012-12-30']
df = pd.DataFrame({'ArrivalDate': pd.to_datetime(list_of_dates)})

# define list of attributes required    
L = ['year', 'month', 'day', 'dayofweek', 'dayofyear', 'weekofyear', 'quarter']

# define generator expression of series, one for each attribute
date_gen = (getattr(df['ArrivalDate'].dt, i).rename(i) for i in L)

# concatenate results and join to original dataframe
df = df.join(pd.concat(date_gen, axis=1))

print(df)

  ArrivalDate  year  month  day  dayofweek  dayofyear  weekofyear  quarter
0  2012-12-31  2012     12   31          0        366           1        4
1  2012-12-29  2012     12   29          5        364          52        4
2  2012-12-30  2012     12   30          6        365          52        4

回答 8

df['year_month']=df.datetime_column.apply(lambda x: str(x)[:7])

这对我来说很好用,没想到熊猫会把结果字符串日期解释为日期,但是当我做情节时,它非常了解我的议程和正确订购年份的字符串year_month……必须爱熊猫!

df['year_month']=df.datetime_column.apply(lambda x: str(x)[:7])

This worked fine for me, didn’t think pandas would interpret the resultant string date as date, but when i did the plot, it knew very well my agenda and the string year_month where ordered properly… gotta love pandas!


回答 9

不使用方法套用两个步骤提取所有数据框的年份。

第1步

将列转换为datetime:

df['ArrivalDate']=pd.to_datetime(df['ArrivalDate'], format='%Y-%m-%d')

第2步

使用DatetimeIndex()方法提取年或月

 pd.DatetimeIndex(df['ArrivalDate']).year

There is two steps to extract year for all the dataframe without using method apply.

Step1

convert the column to datetime :

df['ArrivalDate']=pd.to_datetime(df['ArrivalDate'], format='%Y-%m-%d')

Step2

extract the year or the month using DatetimeIndex() method

 pd.DatetimeIndex(df['ArrivalDate']).year

回答 10

单行:添加具有“年-月 ”对的列:(“ pd.to_datetime”首先将列dtype更改为操作之前的日期时间)

df['yyyy-mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y-%m')

因此,对于额外的“年”或“月”列:

df['yyyy'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y')

df['mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%m')

SINGLE LINE: Adding a column with ‘year-month’-paires: (‘pd.to_datetime’ first changes the column dtype to date-time before the operation)

df['yyyy-mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y-%m')


Accordingly for an extra ‘year’ or ‘month’ column:

df['yyyy'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y')

df['mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%m')


从熊猫的数据框中删除无限值?

问题:从熊猫的数据框中删除无限值?

从熊猫DataFrame中删除nan和inf / -inf值而不重置的最快/最简单方法是什么mode.use_inf_as_null?我希望能够使用的subsethow参数dropna,但不能使用inf认为缺少的值,例如:

df.dropna(subset=["col1", "col2"], how="all", with_inf=True)

这可能吗?有没有办法告诉它在缺失值的定义中dropna包含inf

what is the quickest/simplest way to drop nan and inf/-inf values from a pandas DataFrame without resetting mode.use_inf_as_null? I’d like to be able to use the subset and how arguments of dropna, except with inf values considered missing, like:

df.dropna(subset=["col1", "col2"], how="all", with_inf=True)

is this possible? Is there a way to tell dropna to include inf in its definition of missing values?


回答 0

最简单的方法是先将replaceinfs改为NaN:

df.replace([np.inf, -np.inf], np.nan)

然后使用dropna

df.replace([np.inf, -np.inf], np.nan).dropna(subset=["col1", "col2"], how="all")

例如:

In [11]: df = pd.DataFrame([1, 2, np.inf, -np.inf])

In [12]: df.replace([np.inf, -np.inf], np.nan)
Out[12]:
    0
0   1
1   2
2 NaN
3 NaN

相同的方法适用于系列。

The simplest way would be to first replace infs to NaN:

df.replace([np.inf, -np.inf], np.nan)

and then use the dropna:

df.replace([np.inf, -np.inf], np.nan).dropna(subset=["col1", "col2"], how="all")

For example:

In [11]: df = pd.DataFrame([1, 2, np.inf, -np.inf])

In [12]: df.replace([np.inf, -np.inf], np.nan)
Out[12]:
    0
0   1
1   2
2 NaN
3 NaN

The same method would work for a Series.


回答 1

使用选项上下文时,无需永久设置即可use_inf_as_na。例如:

with pd.option_context('mode.use_inf_as_na', True):
    df = df.dropna(subset=['col1', 'col2'], how='all')

当然可以将其设置infNaN永久

pd.set_option('use_inf_as_na', True)

对于旧版本,请替换use_inf_as_nause_inf_as_null

With option context, this is possible without permanently setting use_inf_as_na. For example:

with pd.option_context('mode.use_inf_as_na', True):
    df = df.dropna(subset=['col1', 'col2'], how='all')

Of course it can be set to treat inf as NaN permanently with

pd.set_option('use_inf_as_na', True)

For older versions, replace use_inf_as_na with use_inf_as_null.


回答 2

这是.loc在Series上用nan替换inf的另一种方法:

s.loc[(~np.isfinite(s)) & s.notnull()] = np.nan

因此,针对原始问题:

df = pd.DataFrame(np.ones((3, 3)), columns=list('ABC'))

for i in range(3): 
    df.iat[i, i] = np.inf

df
          A         B         C
0       inf  1.000000  1.000000
1  1.000000       inf  1.000000
2  1.000000  1.000000       inf

df.sum()
A    inf
B    inf
C    inf
dtype: float64

df.apply(lambda s: s[np.isfinite(s)].dropna()).sum()
A    2
B    2
C    2
dtype: float64

Here is another method using .loc to replace inf with nan on a Series:

s.loc[(~np.isfinite(s)) & s.notnull()] = np.nan

So, in response to the original question:

df = pd.DataFrame(np.ones((3, 3)), columns=list('ABC'))

for i in range(3): 
    df.iat[i, i] = np.inf

df
          A         B         C
0       inf  1.000000  1.000000
1  1.000000       inf  1.000000
2  1.000000  1.000000       inf

df.sum()
A    inf
B    inf
C    inf
dtype: float64

df.apply(lambda s: s[np.isfinite(s)].dropna()).sum()
A    2
B    2
C    2
dtype: float64

回答 3

使用(快速简单):

df = df[np.isfinite(df).all(1)]

该答案基于DougR在另一个问题中的答案。这里是一个示例代码:

import pandas as pd
import numpy as np
df=pd.DataFrame([1,2,3,np.nan,4,np.inf,5,-np.inf,6])
print('Input:\n',df,sep='')
df = df[np.isfinite(df).all(1)]
print('\nDropped:\n',df,sep='')

结果:

Input:
    0
0  1.0000
1  2.0000
2  3.0000
3     NaN
4  4.0000
5     inf
6  5.0000
7    -inf
8  6.0000

Dropped:
     0
0  1.0
1  2.0
2  3.0
4  4.0
6  5.0
8  6.0

Use (fast and simple):

df = df[np.isfinite(df).all(1)]

This answer is based on DougR’s answer in an other question. Here an example code:

import pandas as pd
import numpy as np
df=pd.DataFrame([1,2,3,np.nan,4,np.inf,5,-np.inf,6])
print('Input:\n',df,sep='')
df = df[np.isfinite(df).all(1)]
print('\nDropped:\n',df,sep='')

Result:

Input:
    0
0  1.0000
1  2.0000
2  3.0000
3     NaN
4  4.0000
5     inf
6  5.0000
7    -inf
8  6.0000

Dropped:
     0
0  1.0
1  2.0
2  3.0
4  4.0
6  5.0
8  6.0

回答 4

另一个解决方案是使用该isin方法。使用它来确定每个值是无限的还是缺失的,然后链接该all方法以确定行中的所有值是无限的还是缺失的。

最后,使用该结果的否定值通过布尔索引选择不具有所有无限值或缺失值的行。

all_inf_or_nan = df.isin([np.inf, -np.inf, np.nan]).all(axis='columns')
df[~all_inf_or_nan]

Yet another solution would be to use the isin method. Use it to determine whether each value is infinite or missing and then chain the all method to determine if all the values in the rows are infinite or missing.

Finally, use the negation of that result to select the rows that don’t have all infinite or missing values via boolean indexing.

all_inf_or_nan = df.isin([np.inf, -np.inf, np.nan]).all(axis='columns')
df[~all_inf_or_nan]

回答 5

以上解决方案将修改inf不在目标列中的。为了解决这个问题,

lst = [np.inf, -np.inf]
to_replace = {v: lst for v in ['col1', 'col2']}
df.replace(to_replace, np.nan)

The above solution will modify the infs that are not in the target columns. To remedy that,

lst = [np.inf, -np.inf]
to_replace = {v: lst for v in ['col1', 'col2']}
df.replace(to_replace, np.nan)

回答 6

您可以使用pd.DataFrame.masknp.isinf。首先,您应确保数据框系列均为type float。然后使用dropna现有逻辑。

print(df)

       col1      col2
0 -0.441406       inf
1 -0.321105      -inf
2 -0.412857  2.223047
3 -0.356610  2.513048

df = df.mask(np.isinf(df))

print(df)

       col1      col2
0 -0.441406       NaN
1 -0.321105       NaN
2 -0.412857  2.223047
3 -0.356610  2.513048

You can use pd.DataFrame.mask with np.isinf. You should ensure first your dataframe series are all of type float. Then use dropna with your existing logic.

print(df)

       col1      col2
0 -0.441406       inf
1 -0.321105      -inf
2 -0.412857  2.223047
3 -0.356610  2.513048

df = df.mask(np.isinf(df))

print(df)

       col1      col2
0 -0.441406       NaN
1 -0.321105       NaN
2 -0.412857  2.223047
3 -0.356610  2.513048

熊猫用空白/空字符串替换NaN

问题:熊猫用空白/空字符串替换NaN

我有一个Pandas Dataframe,如下所示:

    1    2       3
 0  a  NaN    read
 1  b    l  unread
 2  c  NaN    read

我想用一个空字符串删除NaN值,使其看起来像这样:

    1    2       3
 0  a   ""    read
 1  b    l  unread
 2  c   ""    read

I have a Pandas Dataframe as shown below:

    1    2       3
 0  a  NaN    read
 1  b    l  unread
 2  c  NaN    read

I want to remove the NaN values with an empty string so that it looks like so:

    1    2       3
 0  a   ""    read
 1  b    l  unread
 2  c   ""    read

回答 0

import numpy as np
df1 = df.replace(np.nan, '', regex=True)

这可能会有所帮助。它将用空字符串替换所有NaN。

import numpy as np
df1 = df.replace(np.nan, '', regex=True)

This might help. It will replace all NaNs with an empty string.


回答 1

df = df.fillna('')

要不就

df.fillna('', inplace=True)

这将用填充na(例如NaN)''

如果要填充单个列,则可以使用:

df.column1 = df.column1.fillna('')

可以使用df['column1']代替df.column1

df = df.fillna('')

or just

df.fillna('', inplace=True)

This will fill na’s (e.g. NaN’s) with ''.

If you want to fill a single column, you can use:

df.column1 = df.column1.fillna('')

One can use df['column1'] instead of df.column1.


回答 2

如果要从文件(例如CSV或Excel)读取数据帧,请使用:

  • df.read_csv(path , na_filter=False)
  • df.read_excel(path , na_filter=False)

这将自动将空字段视为空字符串 ''


如果您已经有了数据框

  • df = df.replace(np.nan, '', regex=True)
  • df = df.fillna('')

If you are reading the dataframe from a file (say CSV or Excel) then use :

  • df.read_csv(path , na_filter=False)
  • df.read_excel(path , na_filter=False)

This will automatically consider the empty fields as empty strings ''


If you already have the dataframe

  • df = df.replace(np.nan, '', regex=True)
  • df = df.fillna('')

回答 3

如果只想格式化它,以使其在打印时呈现良好,请使用格式化程序。只需使用df.to_string(... formatters即可定义自定义字符串格式,而无需修改您的DataFrame或浪费内存:

df = pd.DataFrame({
    'A': ['a', 'b', 'c'],
    'B': [np.nan, 1, np.nan],
    'C': ['read', 'unread', 'read']})
print df.to_string(
    formatters={'B': lambda x: '' if pd.isnull(x) else '{:.0f}'.format(x)})

要得到:

   A B       C
0  a      read
1  b 1  unread
2  c      read

Use a formatter, if you only want to format it so that it renders nicely when printed. Just use the df.to_string(... formatters to define custom string-formatting, without needlessly modifying your DataFrame or wasting memory:

df = pd.DataFrame({
    'A': ['a', 'b', 'c'],
    'B': [np.nan, 1, np.nan],
    'C': ['read', 'unread', 'read']})
print df.to_string(
    formatters={'B': lambda x: '' if pd.isnull(x) else '{:.0f}'.format(x)})

To get:

   A B       C
0  a      read
1  b 1  unread
2  c      read

回答 4

试试这个,

inplace=True

import numpy as np
df.replace(np.NaN, ' ', inplace=True)

Try this,

add inplace=True

import numpy as np
df.replace(np.NaN, ' ', inplace=True)

回答 5

使用keep_default_na=False 应该可以帮助您:

df = pd.read_csv(filename, keep_default_na=False)

using keep_default_na=False should help you:

df = pd.read_csv(filename, keep_default_na=False)

回答 6

如果您要将DataFrame转换为JSON,NaN将给出错误,因此在此用例中的最佳解决方案是将替换NaNNone
方法如下:

df1 = df.where((pd.notnull(df)), None)

If you are converting DataFrame to JSON, NaN will give error so best solution is in this use case is to replace NaN with None.
Here is how:

df1 = df.where((pd.notnull(df)), None)

回答 7

我用nan尝试了一列字符串值。

要删除nan并填充空字符串,请执行以下操作:

df.columnname.replace(np.nan,'',regex = True)

要删除nan并填充一些值:

df.columnname.replace(np.nan,'value',regex = True)

我也尝试了df.iloc。但它需要列的索引。所以您需要再次查看表格。简单地,上述方法减少了一个步骤。

I tried with one column of string values with nan.

To remove the nan and fill the empty string:

df.columnname.replace(np.nan,'',regex = True)

To remove the nan and fill some values:

df.columnname.replace(np.nan,'value',regex = True)

I tried df.iloc also. but it needs the index of the column. so you need to look into the table again. simply the above method reduced one step.


scikit-learn中跨多列的标签编码

问题:scikit-learn中跨多列的标签编码

我正在尝试使用scikit-learn LabelEncoder来编码一大串DataFrame字符串标签。由于数据框有许多(50+)列,因此我想避免LabelEncoder为每一列创建一个对象。我宁愿只有一个LabelEncoder可以在我所有数据列中使用的大对象。

将整个数据DataFrame投入LabelEncoder会产生以下错误。请记住,我在这里使用伪数据。实际上,我正在处理大约50列的字符串标记数据,因此需要一个不按名称引用任何列的解决方案。

import pandas
from sklearn import preprocessing 

df = pandas.DataFrame({
    'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'], 
    'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'], 
    'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego', 
                 'New_York']
})

le = preprocessing.LabelEncoder()

le.fit(df)

追溯(最近一次通话最近):文件“ /Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py”中的第1行,第103行,适合= column_or_1d的第306行“ column_or_1d(y,warn = True)文件“ /Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py”引发ValueError(“错误的输入形状{ 0}“。format(shape))ValueError:输入形状错误(6,3)

关于如何解决这个问题有什么想法吗?

I’m trying to use scikit-learn’s LabelEncoder to encode a pandas DataFrame of string labels. As the dataframe has many (50+) columns, I want to avoid creating a LabelEncoder object for each column; I’d rather just have one big LabelEncoder objects that works across all my columns of data.

Throwing the entire DataFrame into LabelEncoder creates the below error. Please bear in mind that I’m using dummy data here; in actuality I’m dealing with about 50 columns of string labeled data, so need a solution that doesn’t reference any columns by name.

import pandas
from sklearn import preprocessing 

df = pandas.DataFrame({
    'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'], 
    'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'], 
    'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego', 
                 'New_York']
})

le = preprocessing.LabelEncoder()

le.fit(df)

Traceback (most recent call last): File “”, line 1, in File “/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py”, line 103, in fit y = column_or_1d(y, warn=True) File “/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py”, line 306, in column_or_1d raise ValueError(“bad input shape {0}”.format(shape)) ValueError: bad input shape (6, 3)

Any thoughts on how to get around this problem?


回答 0

您可以轻松地做到这一点,

df.apply(LabelEncoder().fit_transform)

编辑2:

在scikit-learn 0.20中,推荐的方法是

OneHotEncoder().fit_transform(df)

因为OneHotEncoder现在支持字符串输入。使用ColumnTransformer可以仅将OneHotEncoder应用于某些列。

编辑:

由于这个答案是一年多以前的,并且产生了许多赞誉(包括赏金),所以我可能应该进一步扩大。

对于inverse_transform和transform,您必须做一点修改。

from collections import defaultdict
d = defaultdict(LabelEncoder)

这样,您现在将所有列保留LabelEncoder为字典。

# Encoding the variable
fit = df.apply(lambda x: d[x.name].fit_transform(x))

# Inverse the encoded
fit.apply(lambda x: d[x.name].inverse_transform(x))

# Using the dictionary to label future data
df.apply(lambda x: d[x.name].transform(x))

You can easily do this though,

df.apply(LabelEncoder().fit_transform)

EDIT2:

In scikit-learn 0.20, the recommended way is

OneHotEncoder().fit_transform(df)

as the OneHotEncoder now supports string input. Applying OneHotEncoder only to certain columns is possible with the ColumnTransformer.

EDIT:

Since this answer is over a year ago, and generated many upvotes (including a bounty), I should probably extend this further.

For inverse_transform and transform, you have to do a little bit of hack.

from collections import defaultdict
d = defaultdict(LabelEncoder)

With this, you now retain all columns LabelEncoder as dictionary.

# Encoding the variable
fit = df.apply(lambda x: d[x.name].fit_transform(x))

# Inverse the encoded
fit.apply(lambda x: d[x.name].inverse_transform(x))

# Using the dictionary to label future data
df.apply(lambda x: d[x.name].transform(x))

回答 1

如larsmans所述,LabelEncoder()仅将一维数组作为参数。也就是说,滚动自己的标签编码器非常容易,该标签编码器可以在您选择的多列上运行,并返回转换后的数据帧。我的代码部分基于Zac Stewart在此处找到的出色博客文章。

创建自定义编码器包括简单地创建一个类来响应fit()transform()fit_transform()方法。就您而言,一个好的开始可能是这样的:

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline

# Create some toy data in a Pandas dataframe
fruit_data = pd.DataFrame({
    'fruit':  ['apple','orange','pear','orange'],
    'color':  ['red','orange','green','green'],
    'weight': [5,6,3,4]
})

class MultiColumnLabelEncoder:
    def __init__(self,columns = None):
        self.columns = columns # array of column names to encode

    def fit(self,X,y=None):
        return self # not relevant here

    def transform(self,X):
        '''
        Transforms columns of X specified in self.columns using
        LabelEncoder(). If no columns specified, transforms all
        columns in X.
        '''
        output = X.copy()
        if self.columns is not None:
            for col in self.columns:
                output[col] = LabelEncoder().fit_transform(output[col])
        else:
            for colname,col in output.iteritems():
                output[colname] = LabelEncoder().fit_transform(col)
        return output

    def fit_transform(self,X,y=None):
        return self.fit(X,y).transform(X)

假设我们要对两个分类属性(fruitcolor)进行编码,而不保留数字属性weight。我们可以这样做,如下所示:

MultiColumnLabelEncoder(columns = ['fruit','color']).fit_transform(fruit_data)

从中转换我们的fruit_data数据集

在此处输入图片说明

在此处输入图片说明

向其传递一个完全由分类变量组成的数据框并省略该columns参数将导致对每一列进行编码(我相信这是您最初在寻找的内容):

MultiColumnLabelEncoder().fit_transform(fruit_data.drop('weight',axis=1))

这转变

在此处输入图片说明

在此处输入图片说明

请注意,当尝试对已经为数字的属性进行编码时,它可能会感到窒息(如果您愿意,可以添加一些代码来处理)。

与此相关的另一个不错的功能是,我们可以在管道中使用此自定义转换器:

encoding_pipeline = Pipeline([
    ('encoding',MultiColumnLabelEncoder(columns=['fruit','color']))
    # add more pipeline steps as needed
])
encoding_pipeline.fit_transform(fruit_data)

As mentioned by larsmans, LabelEncoder() only takes a 1-d array as an argument. That said, it is quite easy to roll your own label encoder that operates on multiple columns of your choosing, and returns a transformed dataframe. My code here is based in part on Zac Stewart’s excellent blog post found here.

Creating a custom encoder involves simply creating a class that responds to the fit(), transform(), and fit_transform() methods. In your case, a good start might be something like this:

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline

# Create some toy data in a Pandas dataframe
fruit_data = pd.DataFrame({
    'fruit':  ['apple','orange','pear','orange'],
    'color':  ['red','orange','green','green'],
    'weight': [5,6,3,4]
})

class MultiColumnLabelEncoder:
    def __init__(self,columns = None):
        self.columns = columns # array of column names to encode

    def fit(self,X,y=None):
        return self # not relevant here

    def transform(self,X):
        '''
        Transforms columns of X specified in self.columns using
        LabelEncoder(). If no columns specified, transforms all
        columns in X.
        '''
        output = X.copy()
        if self.columns is not None:
            for col in self.columns:
                output[col] = LabelEncoder().fit_transform(output[col])
        else:
            for colname,col in output.iteritems():
                output[colname] = LabelEncoder().fit_transform(col)
        return output

    def fit_transform(self,X,y=None):
        return self.fit(X,y).transform(X)

Suppose we want to encode our two categorical attributes (fruit and color), while leaving the numeric attribute weight alone. We could do this as follows:

MultiColumnLabelEncoder(columns = ['fruit','color']).fit_transform(fruit_data)

Which transforms our fruit_data dataset from

enter image description here to

enter image description here

Passing it a dataframe consisting entirely of categorical variables and omitting the columns parameter will result in every column being encoded (which I believe is what you were originally looking for):

MultiColumnLabelEncoder().fit_transform(fruit_data.drop('weight',axis=1))

This transforms

enter image description here to

enter image description here.

Note that it’ll probably choke when it tries to encode attributes that are already numeric (add some code to handle this if you like).

Another nice feature about this is that we can use this custom transformer in a pipeline:

encoding_pipeline = Pipeline([
    ('encoding',MultiColumnLabelEncoder(columns=['fruit','color']))
    # add more pipeline steps as needed
])
encoding_pipeline.fit_transform(fruit_data)

回答 2

从scikit-learn 0.20开始,您可以使用sklearn.compose.ColumnTransformersklearn.preprocessing.OneHotEncoder

如果只有分类变量,则OneHotEncoder直接:

from sklearn.preprocessing import OneHotEncoder

OneHotEncoder(handle_unknown='ignore').fit_transform(df)

如果您具有异构类型的功能:

from sklearn.compose import make_column_transformer
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import OneHotEncoder

categorical_columns = ['pets', 'owner', 'location']
numerical_columns = ['age', 'weigth', 'height']
column_trans = make_column_transformer(
    (categorical_columns, OneHotEncoder(handle_unknown='ignore'),
    (numerical_columns, RobustScaler())
column_trans.fit_transform(df)

文档中的更多选项:http : //scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data

Since scikit-learn 0.20 you can use sklearn.compose.ColumnTransformer and sklearn.preprocessing.OneHotEncoder:

If you only have categorical variables, OneHotEncoder directly:

from sklearn.preprocessing import OneHotEncoder

OneHotEncoder(handle_unknown='ignore').fit_transform(df)

If you have heterogeneously typed features:

from sklearn.compose import make_column_transformer
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import OneHotEncoder

categorical_columns = ['pets', 'owner', 'location']
numerical_columns = ['age', 'weigth', 'height']
column_trans = make_column_transformer(
    (categorical_columns, OneHotEncoder(handle_unknown='ignore'),
    (numerical_columns, RobustScaler())
column_trans.fit_transform(df)

More options in the documentation: http://scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data


回答 3

我们不需要LabelEncoder。

您可以将列转换为类别,然后获取其代码。我在下面使用了字典理解来将此过程应用于每一列,并将结果包装回具有相同索引和列名称的相同形状的数据框中。

>>> pd.DataFrame({col: df[col].astype('category').cat.codes for col in df}, index=df.index)
   location  owner  pets
0         1      1     0
1         0      2     1
2         0      0     0
3         1      1     2
4         1      3     1
5         0      2     1

要创建映射字典,您可以使用字典理解来枚举类别:

>>> {col: {n: cat for n, cat in enumerate(df[col].astype('category').cat.categories)} 
     for col in df}

{'location': {0: 'New_York', 1: 'San_Diego'},
 'owner': {0: 'Brick', 1: 'Champ', 2: 'Ron', 3: 'Veronica'},
 'pets': {0: 'cat', 1: 'dog', 2: 'monkey'}}

We don’t need a LabelEncoder.

You can convert the columns to categoricals and then get their codes. I used a dictionary comprehension below to apply this process to every column and wrap the result back into a dataframe of the same shape with identical indices and column names.

>>> pd.DataFrame({col: df[col].astype('category').cat.codes for col in df}, index=df.index)
   location  owner  pets
0         1      1     0
1         0      2     1
2         0      0     0
3         1      1     2
4         1      3     1
5         0      2     1

To create a mapping dictionary, you can just enumerate the categories using a dictionary comprehension:

>>> {col: {n: cat for n, cat in enumerate(df[col].astype('category').cat.categories)} 
     for col in df}

{'location': {0: 'New_York', 1: 'San_Diego'},
 'owner': {0: 'Brick', 1: 'Champ', 2: 'Ron', 3: 'Veronica'},
 'pets': {0: 'cat', 1: 'dog', 2: 'monkey'}}

回答 4

这不会直接回答您的问题(Naputipulu Jon和PriceHardman对此做出了出色的答复)

但是,出于一些分类任务等目的,您可以使用

pandas.get_dummies(input_df) 

这样可以输入带有分类数据的数据框,并返回带有二进制值的数据框。变量值被编码为结果数据框中的列名。更多

this does not directly answer your question (for which Naputipulu Jon and PriceHardman have fantastic replies)

However, for the purpose of a few classification tasks etc. you could use

pandas.get_dummies(input_df) 

this can input dataframe with categorical data and return a dataframe with binary values. variable values are encoded into column names in the resulting dataframe. more


回答 5

假设您只是试图获取一个sklearn.preprocessing.LabelEncoder()可用于表示您的列的对象,那么您要做的就是:

le.fit(df.columns)

在上面的代码中,您将有一个对应于每一列的唯一编号。更准确地说,你将有一个1:1映射df.columnsle.transform(df.columns.get_values())。要获取列的编码,只需将其传递给le.transform(...)。例如,以下将获取每一列的编码:

le.transform(df.columns.get_values())

假设要sklearn.preprocessing.LabelEncoder()为所有行标签创建一个对象,可以执行以下操作:

le.fit([y for x in df.get_values() for y in x])

在这种情况下,您很可能具有非唯一的行标签(如您的问题所示)。要查看编​​码器创建的类,可以执行le.classes_。您会注意到,该元素应与中的元素相同set(y for x in df.get_values() for y in x)。再次将行标签转换为编码标签使用le.transform(...)。例如,如果要检索df.columns数组第一列和第一行的标签,则可以执行以下操作:

le.transform([df.get_value(0, df.columns[0])])

您在评论中遇到的问题比较复杂,但仍然可以解决:

le.fit([str(z) for z in set((x[0], y) for x in df.iteritems() for y in x[1])])

上面的代码执行以下操作:

  1. 对所有的(列,行)进行唯一组合
  2. 将每对表示为元组的字符串版本。这是一种解决方法,可以克服LabelEncoder不支持将元组用作类名的类。
  3. 使新项目适合LabelEncoder

现在使用这种新模型要复杂一些。假设我们要提取在上一个示例中查找的同一项目的表示形式(df.columns中的第一列和第一行),我们可以这样做:

le.transform([str((df.columns[0], df.get_value(0, df.columns[0])))])

请记住,每个查找现在都是包含(列,行)的元组的字符串表示形式。

Assuming you are simply trying to get a sklearn.preprocessing.LabelEncoder() object that can be used to represent your columns, all you have to do is:

le.fit(df.columns)

In the above code you will have a unique number corresponding to each column. More precisely, you will have a 1:1 mapping of df.columns to le.transform(df.columns.get_values()). To get a column’s encoding, simply pass it to le.transform(...). As an example, the following will get the encoding for each column:

le.transform(df.columns.get_values())

Assuming you want to create a sklearn.preprocessing.LabelEncoder() object for all of your row labels you can do the following:

le.fit([y for x in df.get_values() for y in x])

In this case, you most likely have non-unique row labels (as shown in your question). To see what classes the encoder created you can do le.classes_. You’ll note that this should have the same elements as in set(y for x in df.get_values() for y in x). Once again to convert a row label to an encoded label use le.transform(...). As an example, if you want to retrieve the label for the first column in the df.columns array and the first row, you could do this:

le.transform([df.get_value(0, df.columns[0])])

The question you had in your comment is a bit more complicated, but can still be accomplished:

le.fit([str(z) for z in set((x[0], y) for x in df.iteritems() for y in x[1])])

The above code does the following:

  1. Make a unique combination of all of the pairs of (column, row)
  2. Represent each pair as a string version of the tuple. This is a workaround to overcome the LabelEncoder class not supporting tuples as a class name.
  3. Fits the new items to the LabelEncoder.

Now to use this new model it’s a bit more complicated. Assuming we want to extract the representation for the same item we looked up in the previous example (the first column in df.columns and the first row), we can do this:

le.transform([str((df.columns[0], df.get_value(0, df.columns[0])))])

Remember that each lookup is now a string representation of a tuple that contains the (column, row).


回答 6

不,LabelEncoder不这样做。它需要一维类标签数组,并生成一维数组。它旨在处理分类问题中的类标签,而不是任意数据,并且任何试图将其用于其他用途的尝试都将需要代码将实际问题转换为要解决的问题(并将解决方案还原到原始空间)。

No, LabelEncoder does not do this. It takes 1-d arrays of class labels and produces 1-d arrays. It’s designed to handle class labels in classification problems, not arbitrary data, and any attempt to force it into other uses will require code to transform the actual problem to the problem it solves (and the solution back to the original space).


回答 7

实际上,这已经过去了一年半,但是我也需要.transform()一次能够同时处理多个pandas数据框列(也必须能够对.inverse_transform()它们进行处理)。这扩展了上面@PriceHardman的出色建议:

class MultiColumnLabelEncoder(LabelEncoder):
    """
    Wraps sklearn LabelEncoder functionality for use on multiple columns of a
    pandas dataframe.

    """
    def __init__(self, columns=None):
        self.columns = columns

    def fit(self, dframe):
        """
        Fit label encoder to pandas columns.

        Access individual column classes via indexig `self.all_classes_`

        Access individual column encoders via indexing
        `self.all_encoders_`
        """
        # if columns are provided, iterate through and get `classes_`
        if self.columns is not None:
            # ndarray to hold LabelEncoder().classes_ for each
            # column; should match the shape of specified `columns`
            self.all_classes_ = np.ndarray(shape=self.columns.shape,
                                           dtype=object)
            self.all_encoders_ = np.ndarray(shape=self.columns.shape,
                                            dtype=object)
            for idx, column in enumerate(self.columns):
                # fit LabelEncoder to get `classes_` for the column
                le = LabelEncoder()
                le.fit(dframe.loc[:, column].values)
                # append the `classes_` to our ndarray container
                self.all_classes_[idx] = (column,
                                          np.array(le.classes_.tolist(),
                                                  dtype=object))
                # append this column's encoder
                self.all_encoders_[idx] = le
        else:
            # no columns specified; assume all are to be encoded
            self.columns = dframe.iloc[:, :].columns
            self.all_classes_ = np.ndarray(shape=self.columns.shape,
                                           dtype=object)
            for idx, column in enumerate(self.columns):
                le = LabelEncoder()
                le.fit(dframe.loc[:, column].values)
                self.all_classes_[idx] = (column,
                                          np.array(le.classes_.tolist(),
                                                  dtype=object))
                self.all_encoders_[idx] = le
        return self

    def fit_transform(self, dframe):
        """
        Fit label encoder and return encoded labels.

        Access individual column classes via indexing
        `self.all_classes_`

        Access individual column encoders via indexing
        `self.all_encoders_`

        Access individual column encoded labels via indexing
        `self.all_labels_`
        """
        # if columns are provided, iterate through and get `classes_`
        if self.columns is not None:
            # ndarray to hold LabelEncoder().classes_ for each
            # column; should match the shape of specified `columns`
            self.all_classes_ = np.ndarray(shape=self.columns.shape,
                                           dtype=object)
            self.all_encoders_ = np.ndarray(shape=self.columns.shape,
                                            dtype=object)
            self.all_labels_ = np.ndarray(shape=self.columns.shape,
                                          dtype=object)
            for idx, column in enumerate(self.columns):
                # instantiate LabelEncoder
                le = LabelEncoder()
                # fit and transform labels in the column
                dframe.loc[:, column] =\
                    le.fit_transform(dframe.loc[:, column].values)
                # append the `classes_` to our ndarray container
                self.all_classes_[idx] = (column,
                                          np.array(le.classes_.tolist(),
                                                  dtype=object))
                self.all_encoders_[idx] = le
                self.all_labels_[idx] = le
        else:
            # no columns specified; assume all are to be encoded
            self.columns = dframe.iloc[:, :].columns
            self.all_classes_ = np.ndarray(shape=self.columns.shape,
                                           dtype=object)
            for idx, column in enumerate(self.columns):
                le = LabelEncoder()
                dframe.loc[:, column] = le.fit_transform(
                        dframe.loc[:, column].values)
                self.all_classes_[idx] = (column,
                                          np.array(le.classes_.tolist(),
                                                  dtype=object))
                self.all_encoders_[idx] = le
        return dframe.loc[:, self.columns].values

    def transform(self, dframe):
        """
        Transform labels to normalized encoding.
        """
        if self.columns is not None:
            for idx, column in enumerate(self.columns):
                dframe.loc[:, column] = self.all_encoders_[
                    idx].transform(dframe.loc[:, column].values)
        else:
            self.columns = dframe.iloc[:, :].columns
            for idx, column in enumerate(self.columns):
                dframe.loc[:, column] = self.all_encoders_[idx]\
                    .transform(dframe.loc[:, column].values)
        return dframe.loc[:, self.columns].values

    def inverse_transform(self, dframe):
        """
        Transform labels back to original encoding.
        """
        if self.columns is not None:
            for idx, column in enumerate(self.columns):
                dframe.loc[:, column] = self.all_encoders_[idx]\
                    .inverse_transform(dframe.loc[:, column].values)
        else:
            self.columns = dframe.iloc[:, :].columns
            for idx, column in enumerate(self.columns):
                dframe.loc[:, column] = self.all_encoders_[idx]\
                    .inverse_transform(dframe.loc[:, column].values)
        return dframe.loc[:, self.columns].values

例:

如果dfdf_copy()是混合类型的pandas数据框,则可以通过以下方式将和MultiColumnLabelEncoder()应用于dtype=object列:

# get `object` columns
df_object_columns = df.iloc[:, :].select_dtypes(include=['object']).columns
df_copy_object_columns = df_copy.iloc[:, :].select_dtypes(include=['object']).columns

# instantiate `MultiColumnLabelEncoder`
mcle = MultiColumnLabelEncoder(columns=object_columns)

# fit to `df` data
mcle.fit(df)

# transform the `df` data
mcle.transform(df)

# returns output like below
array([[1, 0, 0, ..., 1, 1, 0],
       [0, 5, 1, ..., 1, 1, 2],
       [1, 1, 1, ..., 1, 1, 2],
       ..., 
       [3, 5, 1, ..., 1, 1, 2],

# transform `df_copy` data
mcle.transform(df_copy)

# returns output like below (assuming the respective columns 
# of `df_copy` contain the same unique values as that particular 
# column in `df`
array([[1, 0, 0, ..., 1, 1, 0],
       [0, 5, 1, ..., 1, 1, 2],
       [1, 1, 1, ..., 1, 1, 2],
       ..., 
       [3, 5, 1, ..., 1, 1, 2],

# inverse `df` data
mcle.inverse_transform(df)

# outputs data like below
array([['August', 'Friday', '2013', ..., 'N', 'N', 'CA'],
       ['April', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
       ['August', 'Monday', '2014', ..., 'N', 'N', 'NJ'],
       ..., 
       ['February', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
       ['April', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
       ['March', 'Tuesday', '2013', ..., 'N', 'N', 'NJ']], dtype=object)

# inverse `df_copy` data
mcle.inverse_transform(df_copy)

# outputs data like below
array([['August', 'Friday', '2013', ..., 'N', 'N', 'CA'],
       ['April', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
       ['August', 'Monday', '2014', ..., 'N', 'N', 'NJ'],
       ..., 
       ['February', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
       ['April', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
       ['March', 'Tuesday', '2013', ..., 'N', 'N', 'NJ']], dtype=object)

您可以通过索引访问用于适合各列的各个列类,列标签和列编码器:

mcle.all_classes_
mcle.all_encoders_
mcle.all_labels_

This is a year-and-a-half after the fact, but I too, needed to be able to .transform() multiple pandas dataframe columns at once (and be able to .inverse_transform() them as well). This expands upon the excellent suggestion of @PriceHardman above:

class MultiColumnLabelEncoder(LabelEncoder):
    """
    Wraps sklearn LabelEncoder functionality for use on multiple columns of a
    pandas dataframe.

    """
    def __init__(self, columns=None):
        self.columns = columns

    def fit(self, dframe):
        """
        Fit label encoder to pandas columns.

        Access individual column classes via indexig `self.all_classes_`

        Access individual column encoders via indexing
        `self.all_encoders_`
        """
        # if columns are provided, iterate through and get `classes_`
        if self.columns is not None:
            # ndarray to hold LabelEncoder().classes_ for each
            # column; should match the shape of specified `columns`
            self.all_classes_ = np.ndarray(shape=self.columns.shape,
                                           dtype=object)
            self.all_encoders_ = np.ndarray(shape=self.columns.shape,
                                            dtype=object)
            for idx, column in enumerate(self.columns):
                # fit LabelEncoder to get `classes_` for the column
                le = LabelEncoder()
                le.fit(dframe.loc[:, column].values)
                # append the `classes_` to our ndarray container
                self.all_classes_[idx] = (column,
                                          np.array(le.classes_.tolist(),
                                                  dtype=object))
                # append this column's encoder
                self.all_encoders_[idx] = le
        else:
            # no columns specified; assume all are to be encoded
            self.columns = dframe.iloc[:, :].columns
            self.all_classes_ = np.ndarray(shape=self.columns.shape,
                                           dtype=object)
            for idx, column in enumerate(self.columns):
                le = LabelEncoder()
                le.fit(dframe.loc[:, column].values)
                self.all_classes_[idx] = (column,
                                          np.array(le.classes_.tolist(),
                                                  dtype=object))
                self.all_encoders_[idx] = le
        return self

    def fit_transform(self, dframe):
        """
        Fit label encoder and return encoded labels.

        Access individual column classes via indexing
        `self.all_classes_`

        Access individual column encoders via indexing
        `self.all_encoders_`

        Access individual column encoded labels via indexing
        `self.all_labels_`
        """
        # if columns are provided, iterate through and get `classes_`
        if self.columns is not None:
            # ndarray to hold LabelEncoder().classes_ for each
            # column; should match the shape of specified `columns`
            self.all_classes_ = np.ndarray(shape=self.columns.shape,
                                           dtype=object)
            self.all_encoders_ = np.ndarray(shape=self.columns.shape,
                                            dtype=object)
            self.all_labels_ = np.ndarray(shape=self.columns.shape,
                                          dtype=object)
            for idx, column in enumerate(self.columns):
                # instantiate LabelEncoder
                le = LabelEncoder()
                # fit and transform labels in the column
                dframe.loc[:, column] =\
                    le.fit_transform(dframe.loc[:, column].values)
                # append the `classes_` to our ndarray container
                self.all_classes_[idx] = (column,
                                          np.array(le.classes_.tolist(),
                                                  dtype=object))
                self.all_encoders_[idx] = le
                self.all_labels_[idx] = le
        else:
            # no columns specified; assume all are to be encoded
            self.columns = dframe.iloc[:, :].columns
            self.all_classes_ = np.ndarray(shape=self.columns.shape,
                                           dtype=object)
            for idx, column in enumerate(self.columns):
                le = LabelEncoder()
                dframe.loc[:, column] = le.fit_transform(
                        dframe.loc[:, column].values)
                self.all_classes_[idx] = (column,
                                          np.array(le.classes_.tolist(),
                                                  dtype=object))
                self.all_encoders_[idx] = le
        return dframe.loc[:, self.columns].values

    def transform(self, dframe):
        """
        Transform labels to normalized encoding.
        """
        if self.columns is not None:
            for idx, column in enumerate(self.columns):
                dframe.loc[:, column] = self.all_encoders_[
                    idx].transform(dframe.loc[:, column].values)
        else:
            self.columns = dframe.iloc[:, :].columns
            for idx, column in enumerate(self.columns):
                dframe.loc[:, column] = self.all_encoders_[idx]\
                    .transform(dframe.loc[:, column].values)
        return dframe.loc[:, self.columns].values

    def inverse_transform(self, dframe):
        """
        Transform labels back to original encoding.
        """
        if self.columns is not None:
            for idx, column in enumerate(self.columns):
                dframe.loc[:, column] = self.all_encoders_[idx]\
                    .inverse_transform(dframe.loc[:, column].values)
        else:
            self.columns = dframe.iloc[:, :].columns
            for idx, column in enumerate(self.columns):
                dframe.loc[:, column] = self.all_encoders_[idx]\
                    .inverse_transform(dframe.loc[:, column].values)
        return dframe.loc[:, self.columns].values

Example:

If df and df_copy() are mixed-type pandas dataframes, you can apply the MultiColumnLabelEncoder() to the dtype=object columns in the following way:

# get `object` columns
df_object_columns = df.iloc[:, :].select_dtypes(include=['object']).columns
df_copy_object_columns = df_copy.iloc[:, :].select_dtypes(include=['object']).columns

# instantiate `MultiColumnLabelEncoder`
mcle = MultiColumnLabelEncoder(columns=object_columns)

# fit to `df` data
mcle.fit(df)

# transform the `df` data
mcle.transform(df)

# returns output like below
array([[1, 0, 0, ..., 1, 1, 0],
       [0, 5, 1, ..., 1, 1, 2],
       [1, 1, 1, ..., 1, 1, 2],
       ..., 
       [3, 5, 1, ..., 1, 1, 2],

# transform `df_copy` data
mcle.transform(df_copy)

# returns output like below (assuming the respective columns 
# of `df_copy` contain the same unique values as that particular 
# column in `df`
array([[1, 0, 0, ..., 1, 1, 0],
       [0, 5, 1, ..., 1, 1, 2],
       [1, 1, 1, ..., 1, 1, 2],
       ..., 
       [3, 5, 1, ..., 1, 1, 2],

# inverse `df` data
mcle.inverse_transform(df)

# outputs data like below
array([['August', 'Friday', '2013', ..., 'N', 'N', 'CA'],
       ['April', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
       ['August', 'Monday', '2014', ..., 'N', 'N', 'NJ'],
       ..., 
       ['February', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
       ['April', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
       ['March', 'Tuesday', '2013', ..., 'N', 'N', 'NJ']], dtype=object)

# inverse `df_copy` data
mcle.inverse_transform(df_copy)

# outputs data like below
array([['August', 'Friday', '2013', ..., 'N', 'N', 'CA'],
       ['April', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
       ['August', 'Monday', '2014', ..., 'N', 'N', 'NJ'],
       ..., 
       ['February', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
       ['April', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
       ['March', 'Tuesday', '2013', ..., 'N', 'N', 'NJ']], dtype=object)

You can access individual column classes, column labels, and column encoders used to fit each column via indexing:

mcle.all_classes_
mcle.all_encoders_
mcle.all_labels_


回答 8

在对@PriceHardman解决方案提出的意见进行跟进之后,我将提出该类的以下版本:

class LabelEncodingColoumns(BaseEstimator, TransformerMixin):
def __init__(self, cols=None):
    pdu._is_cols_input_valid(cols)
    self.cols = cols
    self.les = {col: LabelEncoder() for col in cols}
    self._is_fitted = False

def transform(self, df, **transform_params):
    """
    Scaling ``cols`` of ``df`` using the fitting

    Parameters
    ----------
    df : DataFrame
        DataFrame to be preprocessed
    """
    if not self._is_fitted:
        raise NotFittedError("Fitting was not preformed")
    pdu._is_cols_subset_of_df_cols(self.cols, df)

    df = df.copy()

    label_enc_dict = {}
    for col in self.cols:
        label_enc_dict[col] = self.les[col].transform(df[col])

    labelenc_cols = pd.DataFrame(label_enc_dict,
        # The index of the resulting DataFrame should be assigned and
        # equal to the one of the original DataFrame. Otherwise, upon
        # concatenation NaNs will be introduced.
        index=df.index
    )

    for col in self.cols:
        df[col] = labelenc_cols[col]
    return df

def fit(self, df, y=None, **fit_params):
    """
    Fitting the preprocessing

    Parameters
    ----------
    df : DataFrame
        Data to use for fitting.
        In many cases, should be ``X_train``.
    """
    pdu._is_cols_subset_of_df_cols(self.cols, df)
    for col in self.cols:
        self.les[col].fit(df[col])
    self._is_fitted = True
    return self

此类将编码器适合训练集,并在转换时使用适合的版本。该代码的初始版本可以在此处找到。

Following up on the comments raised on the solution of @PriceHardman I would propose the following version of the class:

class LabelEncodingColoumns(BaseEstimator, TransformerMixin):
def __init__(self, cols=None):
    pdu._is_cols_input_valid(cols)
    self.cols = cols
    self.les = {col: LabelEncoder() for col in cols}
    self._is_fitted = False

def transform(self, df, **transform_params):
    """
    Scaling ``cols`` of ``df`` using the fitting

    Parameters
    ----------
    df : DataFrame
        DataFrame to be preprocessed
    """
    if not self._is_fitted:
        raise NotFittedError("Fitting was not preformed")
    pdu._is_cols_subset_of_df_cols(self.cols, df)

    df = df.copy()

    label_enc_dict = {}
    for col in self.cols:
        label_enc_dict[col] = self.les[col].transform(df[col])

    labelenc_cols = pd.DataFrame(label_enc_dict,
        # The index of the resulting DataFrame should be assigned and
        # equal to the one of the original DataFrame. Otherwise, upon
        # concatenation NaNs will be introduced.
        index=df.index
    )

    for col in self.cols:
        df[col] = labelenc_cols[col]
    return df

def fit(self, df, y=None, **fit_params):
    """
    Fitting the preprocessing

    Parameters
    ----------
    df : DataFrame
        Data to use for fitting.
        In many cases, should be ``X_train``.
    """
    pdu._is_cols_subset_of_df_cols(self.cols, df)
    for col in self.cols:
        self.les[col].fit(df[col])
    self._is_fitted = True
    return self

This class fits the encoder on the training set and uses the fitted version when transforming. Initial version of the code can be found here.


回答 9

使用LabelEncoder()多个列的一种简短方法dict()

from sklearn.preprocessing import LabelEncoder
le_dict = {col: LabelEncoder() for col in columns }
for col in columns:
    le_dict[col].fit_transform(df[col])

您可以使用此le_dict标签对其他任何列进行标签编码:

le_dict[col].transform(df_another[col])

A short way to LabelEncoder() multiple columns with a dict():

from sklearn.preprocessing import LabelEncoder
le_dict = {col: LabelEncoder() for col in columns }
for col in columns:
    le_dict[col].fit_transform(df[col])

and you can use this le_dict to labelEncode any other column:

le_dict[col].transform(df_another[col])

回答 10

可以直接在熊猫中进行所有操作,并且非常适合该replace方法的独特功能。

首先,让我们创建一个字典字典,将列及其值映射到新的替换值。

transform_dict = {}
for col in df.columns:
    cats = pd.Categorical(df[col]).categories
    d = {}
    for i, cat in enumerate(cats):
        d[cat] = i
    transform_dict[col] = d

transform_dict
{'location': {'New_York': 0, 'San_Diego': 1},
 'owner': {'Brick': 0, 'Champ': 1, 'Ron': 2, 'Veronica': 3},
 'pets': {'cat': 0, 'dog': 1, 'monkey': 2}}

由于这将始终是一对一的映射,因此我们可以反转内部字典以获得新值回到原始值的映射。

inverse_transform_dict = {}
for col, d in transform_dict.items():
    inverse_transform_dict[col] = {v:k for k, v in d.items()}

inverse_transform_dict
{'location': {0: 'New_York', 1: 'San_Diego'},
 'owner': {0: 'Brick', 1: 'Champ', 2: 'Ron', 3: 'Veronica'},
 'pets': {0: 'cat', 1: 'dog', 2: 'monkey'}}

现在,我们可以使用该replace方法的独特功能来获取字典的嵌套列表,并将外键用作列,而将内键用作我们要替换的值。

df.replace(transform_dict)
   location  owner  pets
0         1      1     0
1         0      2     1
2         0      0     0
3         1      1     2
4         1      3     1
5         0      2     1

通过再次链接该replace方法,我们可以轻松地回到原始状态

df.replace(transform_dict).replace(inverse_transform_dict)
    location     owner    pets
0  San_Diego     Champ     cat
1   New_York       Ron     dog
2   New_York     Brick     cat
3  San_Diego     Champ  monkey
4  San_Diego  Veronica     dog
5   New_York       Ron     dog

It is possible to do this all in pandas directly and is well-suited for a unique ability of the replace method.

First, let’s make a dictionary of dictionaries mapping the columns and their values to their new replacement values.

transform_dict = {}
for col in df.columns:
    cats = pd.Categorical(df[col]).categories
    d = {}
    for i, cat in enumerate(cats):
        d[cat] = i
    transform_dict[col] = d

transform_dict
{'location': {'New_York': 0, 'San_Diego': 1},
 'owner': {'Brick': 0, 'Champ': 1, 'Ron': 2, 'Veronica': 3},
 'pets': {'cat': 0, 'dog': 1, 'monkey': 2}}

Since this will always be a one to one mapping, we can invert the inner dictionary to get a mapping of the new values back to the original.

inverse_transform_dict = {}
for col, d in transform_dict.items():
    inverse_transform_dict[col] = {v:k for k, v in d.items()}

inverse_transform_dict
{'location': {0: 'New_York', 1: 'San_Diego'},
 'owner': {0: 'Brick', 1: 'Champ', 2: 'Ron', 3: 'Veronica'},
 'pets': {0: 'cat', 1: 'dog', 2: 'monkey'}}

Now, we can use the unique ability of the replace method to take a nested list of dictionaries and use the outer keys as the columns, and the inner keys as the values we would like to replace.

df.replace(transform_dict)
   location  owner  pets
0         1      1     0
1         0      2     1
2         0      0     0
3         1      1     2
4         1      3     1
5         0      2     1

We can easily go back to the original by again chaining the replace method

df.replace(transform_dict).replace(inverse_transform_dict)
    location     owner    pets
0  San_Diego     Champ     cat
1   New_York       Ron     dog
2   New_York     Brick     cat
3  San_Diego     Champ  monkey
4  San_Diego  Veronica     dog
5   New_York       Ron     dog

回答 11

经过大量搜索和试验,并在此处和其他地方找到了一些答案,我认为您的答案在这里

pd.DataFrame(columns = df.columns,data = LabelEncoder()。fit_transform(df.values.flatten())。reshape(df.shape))

这将跨列保留类别名称:

import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame([['A','B','C','D','E','F','G','I','K','H'],
                   ['A','E','H','F','G','I','K','','',''],
                   ['A','C','I','F','H','G','','','','']], 
                  columns=['A1', 'A2', 'A3','A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10'])

pd.DataFrame(columns=df.columns, data=LabelEncoder().fit_transform(df.values.flatten()).reshape(df.shape))

    A1  A2  A3  A4  A5  A6  A7  A8  A9  A10
0   1   2   3   4   5   6   7   9   10  8
1   1   5   8   6   7   9   10  0   0   0
2   1   3   9   6   8   7   0   0   0   0

After lots of search and experimentation with some answers here and elsewhere, I think your answer is here:

pd.DataFrame(columns=df.columns, data=LabelEncoder().fit_transform(df.values.flatten()).reshape(df.shape))

This will preserve category names across columns:

import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame([['A','B','C','D','E','F','G','I','K','H'],
                   ['A','E','H','F','G','I','K','','',''],
                   ['A','C','I','F','H','G','','','','']], 
                  columns=['A1', 'A2', 'A3','A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10'])

pd.DataFrame(columns=df.columns, data=LabelEncoder().fit_transform(df.values.flatten()).reshape(df.shape))

    A1  A2  A3  A4  A5  A6  A7  A8  A9  A10
0   1   2   3   4   5   6   7   9   10  8
1   1   5   8   6   7   9   10  0   0   0
2   1   3   9   6   8   7   0   0   0   0

回答 12

我检查了LabelEncoder 的源代码(https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/preprocessing/label.py)。它基于一组numpy转换,其中一个是np.unique()。并且此功能仅需要一维数组输入。(如果我错了,请纠正我)。

非常粗糙的想法…首先,确定需要LabelEncoder的列,然后遍历每列。

def cat_var(df): 
    """Identify categorical features. 

    Parameters
    ----------
    df: original df after missing operations 

    Returns
    -------
    cat_var_df: summary df with col index and col name for all categorical vars
    """
    col_type = df.dtypes
    col_names = list(df)

    cat_var_index = [i for i, x in enumerate(col_type) if x=='object']
    cat_var_name = [x for i, x in enumerate(col_names) if i in cat_var_index]

    cat_var_df = pd.DataFrame({'cat_ind': cat_var_index, 
                               'cat_name': cat_var_name})

    return cat_var_df



from sklearn.preprocessing import LabelEncoder 

def column_encoder(df, cat_var_list):
    """Encoding categorical feature in the dataframe

    Parameters
    ----------
    df: input dataframe 
    cat_var_list: categorical feature index and name, from cat_var function

    Return
    ------
    df: new dataframe where categorical features are encoded
    label_list: classes_ attribute for all encoded features 
    """

    label_list = []
    cat_var_df = cat_var(df)
    cat_list = cat_var_df.loc[:, 'cat_name']

    for index, cat_feature in enumerate(cat_list): 

        le = LabelEncoder()

        le.fit(df.loc[:, cat_feature])    
        label_list.append(list(le.classes_))

        df.loc[:, cat_feature] = le.transform(df.loc[:, cat_feature])

    return df, label_list

返回的df将是编码后的dflabel_list将向您显示所有这些值在对应列中的含义。这是我为工作编写的数据处理脚本的摘录。如果您认为有任何进一步的改进,请告诉我。

编辑:这里只想提一下,上面的方法与数据框架一起使用时不会错过最好的方法。不知道它是如何工作的数据帧包含丢失的数据。(在执行上述方法之前,我处理了丢失的程序)

I checked the source code (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/preprocessing/label.py) of LabelEncoder. It was based on a set of numpy transformation, which one of those is np.unique(). And this function only takes 1-d array input. (correct me if I am wrong).

Very Rough ideas… first, identify which columns needed LabelEncoder, then loop through each column.

def cat_var(df): 
    """Identify categorical features. 

    Parameters
    ----------
    df: original df after missing operations 

    Returns
    -------
    cat_var_df: summary df with col index and col name for all categorical vars
    """
    col_type = df.dtypes
    col_names = list(df)

    cat_var_index = [i for i, x in enumerate(col_type) if x=='object']
    cat_var_name = [x for i, x in enumerate(col_names) if i in cat_var_index]

    cat_var_df = pd.DataFrame({'cat_ind': cat_var_index, 
                               'cat_name': cat_var_name})

    return cat_var_df



from sklearn.preprocessing import LabelEncoder 

def column_encoder(df, cat_var_list):
    """Encoding categorical feature in the dataframe

    Parameters
    ----------
    df: input dataframe 
    cat_var_list: categorical feature index and name, from cat_var function

    Return
    ------
    df: new dataframe where categorical features are encoded
    label_list: classes_ attribute for all encoded features 
    """

    label_list = []
    cat_var_df = cat_var(df)
    cat_list = cat_var_df.loc[:, 'cat_name']

    for index, cat_feature in enumerate(cat_list): 

        le = LabelEncoder()

        le.fit(df.loc[:, cat_feature])    
        label_list.append(list(le.classes_))

        df.loc[:, cat_feature] = le.transform(df.loc[:, cat_feature])

    return df, label_list

The returned df would be the one after encoding, and label_list will show you what all those values means in the corresponding column. This is a snippet from a data process script I wrote for work. Let me know if you think there could be any further improvement.

EDIT: Just want to mention here that the methods above work with data frame with no missing the best. Not sure how it is working toward data frame contains missing data. (I had a deal with missing procedure before execute above methods)


回答 13

如果我们有单列来进行标签编码,并且在python中有多列时,它的逆变换很容易做到

def stringtocategory(dataset):
    '''
    @author puja.sharma
    @see The function label encodes the object type columns and gives label      encoded and inverse tranform of the label encoded data
    @param dataset dataframe on whoes column the label encoding has to be done
    @return label encoded and inverse tranform of the label encoded data.
   ''' 
   data_original = dataset[:]
   data_tranformed = dataset[:]
   for y in dataset.columns:
       #check the dtype of the column object type contains strings or chars
       if (dataset[y].dtype == object):
          print("The string type features are  : " + y)
          le = preprocessing.LabelEncoder()
          le.fit(dataset[y].unique())
          #label encoded data
          data_tranformed[y] = le.transform(dataset[y])
          #inverse label transform  data
          data_original[y] = le.inverse_transform(data_tranformed[y])
   return data_tranformed,data_original

if we have single column to do the label encoding and its inverse transform its easy how to do it when there are multiple columns in python

def stringtocategory(dataset):
    '''
    @author puja.sharma
    @see The function label encodes the object type columns and gives label      encoded and inverse tranform of the label encoded data
    @param dataset dataframe on whoes column the label encoding has to be done
    @return label encoded and inverse tranform of the label encoded data.
   ''' 
   data_original = dataset[:]
   data_tranformed = dataset[:]
   for y in dataset.columns:
       #check the dtype of the column object type contains strings or chars
       if (dataset[y].dtype == object):
          print("The string type features are  : " + y)
          le = preprocessing.LabelEncoder()
          le.fit(dataset[y].unique())
          #label encoded data
          data_tranformed[y] = le.transform(dataset[y])
          #inverse label transform  data
          data_original[y] = le.inverse_transform(data_tranformed[y])
   return data_tranformed,data_original

回答 14

如果您在数据框中拥有数值和分类两种数据类型,则可以使用:这里X是我的数据框同时具有分类和数值两种变量

from sklearn import preprocessing
le = preprocessing.LabelEncoder()

for i in range(0,X.shape[1]):
    if X.dtypes[i]=='object':
        X[X.columns[i]] = le.fit_transform(X[X.columns[i]])

注意:如果您不希望将它们转换回去,则此方法很好。

If you have numerical and categorical both type of data in dataframe You can use : here X is my dataframe having categorical and numerical both variables

from sklearn import preprocessing
le = preprocessing.LabelEncoder()

for i in range(0,X.shape[1]):
    if X.dtypes[i]=='object':
        X[X.columns[i]] = le.fit_transform(X[X.columns[i]])

Note: This technique is good if you are not interested in converting them back.


回答 15

使用Neuraxle

TLDR;您在这里可以使用FlattenForEach包装类简单地改变你的DF,如:FlattenForEach(LabelEncoder(), then_unflatten=True).fit_transform(df)

使用这种方法,您的标签编码器将能够在常规的scikit-learn Pipeline中进行调整。让我们简单地导入:

from sklearn.preprocessing import LabelEncoder
from neuraxle.steps.column_transformer import ColumnTransformer
from neuraxle.steps.loop import FlattenForEach

列使用相同的共享编码器:

这是一个共享的LabelEncoder应用于所有数据进行编码的方式:

    p = FlattenForEach(LabelEncoder(), then_unflatten=True)

结果:

    p, predicted_output = p.fit_transform(df.values)
    expected_output = np.array([
        [6, 7, 6, 8, 7, 7],
        [1, 3, 0, 1, 5, 3],
        [4, 2, 2, 4, 4, 2]
    ]).transpose()
    assert np.array_equal(predicted_output, expected_output)

每列不同的编码器:

这是第一个独立的LabelEncoder应用于宠物的方式,第二个将共享给列所有者和位置。确切地说,我们在这里混合使用不同和共享的标签编码器:

    p = ColumnTransformer([
        # A different encoder will be used for column 0 with name "pets":
        (0, FlattenForEach(LabelEncoder(), then_unflatten=True)),
        # A shared encoder will be used for column 1 and 2, "owner" and "location":
        ([1, 2], FlattenForEach(LabelEncoder(), then_unflatten=True)),
    ], n_dimension=2)

结果:

    p, predicted_output = p.fit_transform(df.values)
    expected_output = np.array([
        [0, 1, 0, 2, 1, 1],
        [1, 3, 0, 1, 5, 3],
        [4, 2, 2, 4, 4, 2]
    ]).transpose()
    assert np.array_equal(predicted_output, expected_output)

Using Neuraxle

TLDR; You here can use the FlattenForEach wrapper class to simply transform your df like: FlattenForEach(LabelEncoder(), then_unflatten=True).fit_transform(df).

With this method, your label encoder will be able to fit and transform within a regular scikit-learn Pipeline. Let’s simply import:

from sklearn.preprocessing import LabelEncoder
from neuraxle.steps.column_transformer import ColumnTransformer
from neuraxle.steps.loop import FlattenForEach

Same shared encoder for columns:

Here is how one shared LabelEncoder will be applied on all the data to encode it:

    p = FlattenForEach(LabelEncoder(), then_unflatten=True)

Result:

    p, predicted_output = p.fit_transform(df.values)
    expected_output = np.array([
        [6, 7, 6, 8, 7, 7],
        [1, 3, 0, 1, 5, 3],
        [4, 2, 2, 4, 4, 2]
    ]).transpose()
    assert np.array_equal(predicted_output, expected_output)

Different encoders per column:

And here is how a first standalone LabelEncoder will be applied on the pets, and a second will be shared for the columns owner and location. So to be precise, we here have a mix of different and shared label encoders:

    p = ColumnTransformer([
        # A different encoder will be used for column 0 with name "pets":
        (0, FlattenForEach(LabelEncoder(), then_unflatten=True)),
        # A shared encoder will be used for column 1 and 2, "owner" and "location":
        ([1, 2], FlattenForEach(LabelEncoder(), then_unflatten=True)),
    ], n_dimension=2)

Result:

    p, predicted_output = p.fit_transform(df.values)
    expected_output = np.array([
        [0, 1, 0, 2, 1, 1],
        [1, 3, 0, 1, 5, 3],
        [4, 2, 2, 4, 4, 2]
    ]).transpose()
    assert np.array_equal(predicted_output, expected_output)

回答 16

主要用于@Alexander答案,但必须进行一些更改-

cols_need_mapped = ['col1', 'col2']

mapper = {col: {cat: n for n, cat in enumerate(df[col].astype('category').cat.categories)} 
     for col in df[cols_need_mapped]}

for c in cols_need_mapped :
    df[c] = df[c].map(mapper[c])

然后要在将来重用,您可以将输出保存到json文档中,并在需要时将其读入并.map()像上面一样使用函数。

Mainly used @Alexander answer but had to make some changes –

cols_need_mapped = ['col1', 'col2']

mapper = {col: {cat: n for n, cat in enumerate(df[col].astype('category').cat.categories)} 
     for col in df[cols_need_mapped]}

for c in cols_need_mapped :
    df[c] = df[c].map(mapper[c])

Then to re-use in the future you can just save the output to a json document and when you need it you read it in and use the .map() function like I did above.


回答 17

问题是传递给拟合函数的数据(pd数据帧)的形状。您必须通过一维列表。

The problem is the shape of the data (pd dataframe) you are passing to the fit function. You’ve got to pass 1d list.


回答 18

import pandas as pd
from sklearn.preprocessing import LabelEncoder

train=pd.read_csv('.../train.csv')

#X=train.loc[:,['waterpoint_type_group','status','waterpoint_type','source_class']].values
# Create a label encoder object 
def MultiLabelEncoder(columnlist,dataframe):
    for i in columnlist:

        labelencoder_X=LabelEncoder()
        dataframe[i]=labelencoder_X.fit_transform(dataframe[i])
columnlist=['waterpoint_type_group','status','waterpoint_type','source_class','source_type']
MultiLabelEncoder(columnlist,train)

在这里,我从位置读取一个csv,并且在功能上,我传递了我要标签编码的列列表和我要应用的数据框。

import pandas as pd
from sklearn.preprocessing import LabelEncoder

train=pd.read_csv('.../train.csv')

#X=train.loc[:,['waterpoint_type_group','status','waterpoint_type','source_class']].values
# Create a label encoder object 
def MultiLabelEncoder(columnlist,dataframe):
    for i in columnlist:

        labelencoder_X=LabelEncoder()
        dataframe[i]=labelencoder_X.fit_transform(dataframe[i])
columnlist=['waterpoint_type_group','status','waterpoint_type','source_class','source_type']
MultiLabelEncoder(columnlist,train)

Here i am reading a csv from location and in function i am passing the column list i want to labelencode and the dataframe I want to apply this.


回答 19

这个怎么样?

def MultiColumnLabelEncode(choice, columns, X):
    LabelEncoders = []
    if choice == 'encode':
        for i in enumerate(columns):
            LabelEncoders.append(LabelEncoder())
        i=0    
        for cols in columns:
            X[:, cols] = LabelEncoders[i].fit_transform(X[:, cols])
            i += 1
    elif choice == 'decode': 
        for cols in columns:
            X[:, cols] = LabelEncoders[i].inverse_transform(X[:, cols])
            i += 1
    else:
        print('Please select correct parameter "choice". Available parameters: encode/decode')

它不是最有效的,但是它可以工作并且非常简单。

How about this?

def MultiColumnLabelEncode(choice, columns, X):
    LabelEncoders = []
    if choice == 'encode':
        for i in enumerate(columns):
            LabelEncoders.append(LabelEncoder())
        i=0    
        for cols in columns:
            X[:, cols] = LabelEncoders[i].fit_transform(X[:, cols])
            i += 1
    elif choice == 'decode': 
        for cols in columns:
            X[:, cols] = LabelEncoders[i].inverse_transform(X[:, cols])
            i += 1
    else:
        print('Please select correct parameter "choice". Available parameters: encode/decode')

It is not the most efficient, however it works and it is super simple.


逐行迭代时更新熊猫数据框

问题:逐行迭代时更新熊猫数据框

我有一个看起来像这样的熊猫数据框(非常大)

           date      exer exp     ifor         mat  
1092  2014-03-17  American   M  528.205  2014-04-19 
1093  2014-03-17  American   M  528.205  2014-04-19 
1094  2014-03-17  American   M  528.205  2014-04-19 
1095  2014-03-17  American   M  528.205  2014-04-19    
1096  2014-03-17  American   M  528.205  2014-05-17 

现在我想逐行进行迭代,当我遍历每一行时,每行中的值ifor 可以根据某些条件而变化,因此我需要查找另一个数据帧。

现在,如何在迭代时更新它。尝试了几项都不起作用的东西。

for i, row in df.iterrows():
    if <something>:
        row['ifor'] = x
    else:
        row['ifor'] = y

    df.ix[i]['ifor'] = x

这些方法似乎都不起作用。我看不到数据框中更新的值。

I have a pandas data frame that looks like this (its a pretty big one)

           date      exer exp     ifor         mat  
1092  2014-03-17  American   M  528.205  2014-04-19 
1093  2014-03-17  American   M  528.205  2014-04-19 
1094  2014-03-17  American   M  528.205  2014-04-19 
1095  2014-03-17  American   M  528.205  2014-04-19    
1096  2014-03-17  American   M  528.205  2014-05-17 

now I would like to iterate row by row and as I go through each row, the value of ifor in each row can change depending on some conditions and I need to lookup another dataframe.

Now, how do I update this as I iterate. Tried a few things none of them worked.

for i, row in df.iterrows():
    if <something>:
        row['ifor'] = x
    else:
        row['ifor'] = y

    df.ix[i]['ifor'] = x

None of these approaches seem to work. I don’t see the values updated in the dataframe.


回答 0

您可以使用df.set_value在循环中分配值:

for i, row in df.iterrows():
    ifor_val = something
    if <condition>:
        ifor_val = something_else
    df.set_value(i,'ifor',ifor_val)

如果不需要行值,则可以简单地遍历df的索引,但是我保留了原始的for循环,以防需要此处未显示的行值。

更新

从0.21.0版开始不推荐使用df.set_value(),而可以使用df.at():

for i, row in df.iterrows():
    ifor_val = something
    if <condition>:
        ifor_val = something_else
    df.at[i,'ifor'] = ifor_val

You can assign values in the loop using df.set_value:

for i, row in df.iterrows():
    ifor_val = something
    if <condition>:
        ifor_val = something_else
    df.set_value(i,'ifor',ifor_val)

If you don’t need the row values you could simply iterate over the indices of df, but I kept the original for-loop in case you need the row value for something not shown here.

update

df.set_value() has been deprecated since version 0.21.0 you can use df.at() instead:

for i, row in df.iterrows():
    ifor_val = something
    if <condition>:
        ifor_val = something_else
    df.at[i,'ifor'] = ifor_val

回答 1

熊猫DataFrame对象应被视为一系列系列。换句话说,您应该从列的角度来考虑它。之所以如此重要,是因为在使用时,您将pd.DataFrame.iterrows行作为Series进行迭代。但是这些不是数据帧存储的系列,因此它们是在迭代时为您创建的新系列。这意味着当您尝试分配它们时,这些编辑将不会最终反映在原始数据框中。

好的,现在这已经不合时宜了:我们该怎么办?

此职位之前的建议包括:

  1. pd.DataFrame.set_value弃用的熊猫版0.21
  2. pd.DataFrame.ix弃用
  3. pd.DataFrame.loc很好,但是可以在数组索引器上工作,并且您可以做得更好

我的建议
使用pd.DataFrame.at

for i in df.index:
    if <something>:
        df.at[i, 'ifor'] = x
    else:
        df.at[i, 'ifor'] = y

您甚至可以将其更改为:

for i in df.index:
    df.at[i, 'ifor'] = x if <something> else y

回应评论

如果我需要使用if条件的前一行的值怎么办?

for i in range(1, len(df) + 1):
    j = df.columns.get_loc('ifor')
    if <something>:
        df.iat[i - 1, j] = x
    else:
        df.iat[i - 1, j] = y

Pandas DataFrame object should be thought of as a Series of Series. In other words, you should think of it in terms of columns. The reason why this is important is because when you use pd.DataFrame.iterrows you are iterating through rows as Series. But these are not the Series that the data frame is storing and so they are new Series that are created for you while you iterate. That implies that when you attempt to assign tho them, those edits won’t end up reflected in the original data frame.

Ok, now that that is out of the way: What do we do?

Suggestions prior to this post include:

  1. pd.DataFrame.set_value is deprecated as of Pandas version 0.21
  2. pd.DataFrame.ix is deprecated
  3. pd.DataFrame.loc is fine but can work on array indexers and you can do better

My recommendation
Use pd.DataFrame.at

for i in df.index:
    if <something>:
        df.at[i, 'ifor'] = x
    else:
        df.at[i, 'ifor'] = y

You can even change this to:

for i in df.index:
    df.at[i, 'ifor'] = x if <something> else y

Response to comment

and what if I need to use the value of the previous row for the if condition?

for i in range(1, len(df) + 1):
    j = df.columns.get_loc('ifor')
    if <something>:
        df.iat[i - 1, j] = x
    else:
        df.iat[i - 1, j] = y

回答 2

可以使用的方法是itertuples(),将DataFrame行作为namedtuple进行迭代,将index值作为tuple的第一个元素。与相比,它要快得多iterrows()。对于itertuples(),每个都在DataFrame中row包含其Index,您可以loc用来设置值。

for row in df.itertuples():
    if <something>:
        df.at[row.Index, 'ifor'] = x
    else:
        df.at[row.Index, 'ifor'] = x

    df.loc[row.Index, 'ifor'] = x

在大多数情况下,itertuples()速度比iat或快at

感谢@SantiStSupery,使用.at速度比快得多loc

A method you can use is itertuples(), it iterates over DataFrame rows as namedtuples, with index value as first element of the tuple. And it is much much faster compared with iterrows(). For itertuples(), each row contains its Index in the DataFrame, and you can use loc to set the value.

for row in df.itertuples():
    if <something>:
        df.at[row.Index, 'ifor'] = x
    else:
        df.at[row.Index, 'ifor'] = x

    df.loc[row.Index, 'ifor'] = x

Under most cases, itertuples() is faster than iat or at.

Thanks @SantiStSupery, using .at is much faster than loc.


回答 3

您应该用df.ix[i, 'exp']=Xdf.loc[i, 'exp']=X代替赋值df.ix[i]['ifor'] = x

否则,您正在处理视图,并且应该变暖:

-c:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_index,col_indexer] = value instead

但是可以肯定的是,循环可能最好用某种矢量化算法代替,以充分利用DataFrame@Phillip Cloud建议的方法。

You should assign value by df.ix[i, 'exp']=X or df.loc[i, 'exp']=X instead of df.ix[i]['ifor'] = x.

Otherwise you are working on a view, and should get a warming:

-c:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_index,col_indexer] = value instead

But certainly, loop probably should better be replaced by some vectorized algorithm to make the full use of DataFrame as @Phillip Cloud suggested.


回答 4

好吧,如果您要进行迭代,为什么不使用所有最简单的方法, df['Column'].values[i]

df['Column'] = ''

for i in range(len(df)):
    df['Column'].values[i] = something/update/new_value

或者,如果您想将新值与旧值或类似值进行比较,为什么不将其存储在列表中,然后追加到末尾。

mylist, df['Column'] = [], ''

for <condition>:
    mylist.append(something/update/new_value)

df['Column'] = mylist

Well, if you are going to iterate anyhow, why don’t use the simplest method of all, df['Column'].values[i]

df['Column'] = ''

for i in range(len(df)):
    df['Column'].values[i] = something/update/new_value

Or if you want to compare the new values with old or anything like that, why not store it in a list and then append in the end.

mylist, df['Column'] = [], ''

for <condition>:
    mylist.append(something/update/new_value)

df['Column'] = mylist

回答 5

for i, row in df.iterrows():
    if <something>:
        df.at[i, 'ifor'] = x
    else:
        df.at[i, 'ifor'] = y
for i, row in df.iterrows():
    if <something>:
        df.at[i, 'ifor'] = x
    else:
        df.at[i, 'ifor'] = y

回答 6

最好lambda使用df.apply()

df["ifor"] = df.apply(lambda x: {value} if {condition} else x["ifor"], axis=1)

It’s better to use lambda functions using df.apply()

df["ifor"] = df.apply(lambda x: {value} if {condition} else x["ifor"], axis=1)

回答 7

从一列增加最大数。例如 :

df1 = [sort_ID, Column1,Column2]
print(df1)

我的输出:

Sort_ID Column1 Column2
12         a    e
45         b    f
65         c    g
78         d    h

MAX = df1['Sort_ID'].max() #This returns my Max Number 

现在,我需要在df2中创建一列,并填充增加MAX的列值。

Sort_ID Column1 Column2
79      a1       e1
80      b1       f1
81      c1       g1
82      d1       h1

注意:df2最初将仅包含Column1和Column2。我们需要创建Sortid列,并从df1开始增加MAX。

Increment the MAX number from a column. For Example :

df1 = [sort_ID, Column1,Column2]
print(df1)

My output :

Sort_ID Column1 Column2
12         a    e
45         b    f
65         c    g
78         d    h

MAX = df1['Sort_ID'].max() #This returns my Max Number 

Now , I need to create a column in df2 and fill the column values which increments the MAX .

Sort_ID Column1 Column2
79      a1       e1
80      b1       f1
81      c1       g1
82      d1       h1

Note : df2 will initially contain only the Column1 and Column2 . we need the Sortid column to be created and incremental of the MAX from df1 .


从pandas数据框转换为html时,如何在html中显示完整(非截断)的数据框信息?

问题:从pandas数据框转换为html时,如何在html中显示完整(非截断)的数据框信息?

我使用该DataFrame.to_html函数将pandas数据框转换为html输出。当我将其保存到单独的html文件中时,该文件显示截断的输出。

例如,在我的“文字”列中,

df.head(1) 将会呈现

这部电影是很棒的努力。

代替

这部电影是在解构这一时期盛行的复杂社会情绪方面做出的出色努力。

在大屏幕熊猫数据框的屏幕友好格式的情况下,这种表示形式很好,但是我需要一个HTML文件,该文件将显示数据框中包含的完整表格数据,即,将显示后一个文本元素而不是前一段文字。

如何在html版本的信息的TEXT列中显示每个元素的完整,不截断的文本数据?我可以想象html表必须显示长单元格才能显示完整的数据,但是据我所知,只能将列宽参数传递给该DataFrame.to_html函数。

I converted a pandas dataframe to an html output using the DataFrame.to_html function. When I save this to a separate html file, the file shows truncated output.

For example, in my TEXT column,

df.head(1) will show

The film was an excellent effort…

instead of

The film was an excellent effort in deconstructing the complex social sentiments that prevailed during this period.

This rendition is fine in the case of a screen-friendly format of a massive pandas dataframe, but I need an html file that will show complete tabular data contained in the dataframe, that is, something that will show the latter text element rather than the former text snippet.

How would I be able to show the complete, non-truncated text data for each element in my TEXT column in the html version of the information? I would imagine that the html table would have to display long cells to show the complete data, but as far as I understand, only column-width parameters can be passed into the DataFrame.to_html function.


回答 0

display.max_colwidth选项设置为-1

pd.set_option('display.max_colwidth', -1)

set_option docs

例如,在iPython中,我们看到信息被截断为50个字符。多余的部分省略:

在此处输入图片说明

如果设置该display.max_colwidth选项,则信息将完整显示:

在此处输入图片说明

Set the display.max_colwidth option to -1:

pd.set_option('display.max_colwidth', -1)

set_option docs

For example, in iPython, we see that the information is truncated to 50 characters. Anything in excess is ellipsized:

Truncated result

If you set the display.max_colwidth option, the information will be displayed fully:

Non-truncated result


回答 1

pd.set_option('display.max_columns', None)  

id (第二个参数)可以完全显示列。

pd.set_option('display.max_columns', None)  

id (second argument) can fully show the columns.


回答 2

虽然pd.set_option('display.max_columns', None)套所示的最大列数,选项pd.set_option('display.max_colwidth', -1)设置每个单个场的最大宽度。

出于我的目的,我编写了一个小的辅助函数,以在不影响其余代码的情况下完全打印大型数据帧,还重新格式化了浮点数并设置了虚拟显示宽度。您可以在用例中采用它。

def print_full(x):
    pd.set_option('display.max_rows', len(x))
    pd.set_option('display.max_columns', None)
    pd.set_option('display.width', 2000)
    pd.set_option('display.float_format', '{:20,.2f}'.format)
    pd.set_option('display.max_colwidth', None)
    print(x)
    pd.reset_option('display.max_rows')
    pd.reset_option('display.max_columns')
    pd.reset_option('display.width')
    pd.reset_option('display.float_format')
    pd.reset_option('display.max_colwidth')

While pd.set_option('display.max_columns', None) sets the number of the maximum columns shown, the option pd.set_option('display.max_colwidth', -1) sets the maximum width of each single field.

For my purposes I wrote a small helper function to fully print huge data frames without affecting the rest of the code, it also reformats float numbers and sets the virtual display width. You may adopt it for your use cases.

def print_full(x):
    pd.set_option('display.max_rows', None)
    pd.set_option('display.max_columns', None)
    pd.set_option('display.width', 2000)
    pd.set_option('display.float_format', '{:20,.2f}'.format)
    pd.set_option('display.max_colwidth', None)
    print(x)
    pd.reset_option('display.max_rows')
    pd.reset_option('display.max_columns')
    pd.reset_option('display.width')
    pd.reset_option('display.float_format')
    pd.reset_option('display.max_colwidth')

回答 3

对于那些希望这样做的人。我在dask中找不到类似的选项,但是如果我只是在同一本笔记本中为熊猫做这件事,那么它也适用于dask。

import pandas as pd
import dask.dataframe as dd
pd.set_option('display.max_colwidth', -1) # This will set the no truncate for pandas as well as for dask. Not sure how it does for dask though. but it works

train_data = dd.read_csv('./data/train.csv')    
train_data.head(5)

For those looking to do this in dask. I could not find a similar option in dask but if I simply do this in same notebook for pandas it works for dask too.

import pandas as pd
import dask.dataframe as dd
pd.set_option('display.max_colwidth', -1) # This will set the no truncate for pandas as well as for dask. Not sure how it does for dask though. but it works

train_data = dd.read_csv('./data/train.csv')    
train_data.head(5)

回答 4

以下代码导致以下错误:

pd.set_option('display.max_colwidth', -1)

FutureWarning:在1.0版中不建议使用传递负整数,在将来的版本中将不支持。而是使用“无”不限制列宽。

而是使用:

pd.set_option('display.max_colwidth', None)

这样就完成了任务并符合1.0 版之后的熊猫版本。

The following code results in the error below:

pd.set_option('display.max_colwidth', -1)

FutureWarning: Passing a negative integer is deprecated in version 1.0 and will not be supported in future version. Instead, use None to not limit the column width.

Instead, use:

pd.set_option('display.max_colwidth', None)

This accomplishes the task and complies with versions of pandas following version 1.0.


附加到熊猫中的空DataFrame吗?

问题:附加到熊猫中的空DataFrame吗?

是否可以将其追加到不包含任何索引或列的空数据框中?

我试图这样做,但最后总是得到一个空的数据框。

例如

df = pd.DataFrame()
data = ['some kind of data here' --> I have checked the type already, and it is a dataframe]
df.append(data)

结果看起来像这样:

Empty DataFrame
Columns: []
Index: []

Is it possible to append to an empty data frame that doesn’t contain any indices or columns?

I have tried to do this, but keep getting an empty dataframe at the end.

e.g.

df = pd.DataFrame()
data = ['some kind of data here' --> I have checked the type already, and it is a dataframe]
df.append(data)

The result looks like this:

Empty DataFrame
Columns: []
Index: []

回答 0

那应该工作:

>>> df = pd.DataFrame()
>>> data = pd.DataFrame({"A": range(3)})
>>> df.append(data)
   A
0  0
1  1
2  2

但是,append没有就地发生,所以你必须要存储输出,如果你想它:

>>> df
Empty DataFrame
Columns: []
Index: []
>>> df = df.append(data)
>>> df
   A
0  0
1  1
2  2

That should work:

>>> df = pd.DataFrame()
>>> data = pd.DataFrame({"A": range(3)})
>>> df.append(data)
   A
0  0
1  1
2  2

But the append doesn’t happen in-place, so you’ll have to store the output if you want it:

>>> df
Empty DataFrame
Columns: []
Index: []
>>> df = df.append(data)
>>> df
   A
0  0
1  1
2  2

回答 1

如果要添加一行,可以使用字典:

df = pd.DataFrame()
df = df.append({'name': 'Zed', 'age': 9, 'height': 2}, ignore_index=True)

这给你:

   age  height name
0    9       2  Zed

And if you want to add a row, you can use a dictionary:

df = pd.DataFrame()
df = df.append({'name': 'Zed', 'age': 9, 'height': 2}, ignore_index=True)

which gives you:

   age  height name
0    9       2  Zed

回答 2

您可以通过以下方式连接数据:

InfoDF = pd.DataFrame()
tempDF = pd.DataFrame(rows,columns=['id','min_date'])

InfoDF = pd.concat([InfoDF,tempDF])

You can concat the data in this way:

InfoDF = pd.DataFrame()
tempDF = pd.DataFrame(rows,columns=['id','min_date'])

InfoDF = pd.concat([InfoDF,tempDF])

使用熊猫绘制相关矩阵

问题:使用熊猫绘制相关矩阵

我有一个包含大量特征的数据集,因此分析相关矩阵变得非常困难。我想绘制一个相关矩阵,我们可以使用dataframe.corr()熊猫库中的函数获取相关矩阵。熊猫库是否提供任何内置函数来绘制此矩阵?

I have a data set with huge number of features, so analysing the correlation matrix has become very difficult. I want to plot a correlation matrix which we get using dataframe.corr() function from pandas library. Is there any built-in function provided by the pandas library to plot this matrix?


回答 0

您可以使用pyplot.matshow()matplotlib

import matplotlib.pyplot as plt

plt.matshow(dataframe.corr())
plt.show()

编辑:

在注释中,要求更改轴刻度标签。这是一个豪华的版本,它使用较大的图形尺寸绘制,具有与数据框匹配的轴标签,以及用于解释色阶的色条图例。

我将介绍如何调整标签的大小和旋转角度,并使用数字比例使颜色条和主图形的高度相同。

f = plt.figure(figsize=(19, 15))
plt.matshow(df.corr(), fignum=f.number)
plt.xticks(range(df.shape[1]), df.columns, fontsize=14, rotation=45)
plt.yticks(range(df.shape[1]), df.columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title('Correlation Matrix', fontsize=16);

相关图示例

You can use pyplot.matshow() from matplotlib:

import matplotlib.pyplot as plt

plt.matshow(dataframe.corr())
plt.show()

Edit:

In the comments was a request for how to change the axis tick labels. Here’s a deluxe version that is drawn on a bigger figure size, has axis labels to match the dataframe, and a colorbar legend to interpret the color scale.

I’m including how to adjust the size and rotation of the labels, and I’m using a figure ratio that makes the colorbar and the main figure come out the same height.

f = plt.figure(figsize=(19, 15))
plt.matshow(df.corr(), fignum=f.number)
plt.xticks(range(df.shape[1]), df.columns, fontsize=14, rotation=45)
plt.yticks(range(df.shape[1]), df.columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title('Correlation Matrix', fontsize=16);

correlation plot example


回答 1

如果您的主要目标是可视化相关矩阵,而不是自己创建图表,则便捷的pandas 样式选项是可行的内置解决方案:

import pandas as pd
import numpy as np

rs = np.random.RandomState(0)
df = pd.DataFrame(rs.rand(10, 10))
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm')
# 'RdBu_r' & 'BrBG' are other good diverging colormaps

在此处输入图片说明

请注意,这需要在支持渲染HTML的后端中,例如JupyterLab Notebook。(深色背景上的自动浅色文本来自现有PR,而不是最新发布的版本pandas0.23)。


造型

您可以轻松限制数字精度:

corr.style.background_gradient(cmap='coolwarm').set_precision(2)

在此处输入图片说明

或者,如果您更喜欢没有注释的矩阵,也可以完全删除数字:

corr.style.background_gradient(cmap='coolwarm').set_properties(**{'font-size': '0pt'})

在此处输入图片说明

样式文档还包括更高级样式的说明,例如如何更改鼠标指针悬停在其上方的单元格的显示。为了保存输出,您可以通过附加render()方法来返回HTML ,然后将其写入文件(或者只是截取屏幕快照,以减少非正式目的)。


时间比较

在我的测试中,速度是10×10矩阵的style.background_gradient()4倍,是plt.matshow()120x的120 倍sns.heatmap()。不幸的是,它的伸缩性不如plt.matshow():对于100×100的矩阵,两者花费的时间大约相同,而plt.matshow()对于1000×1000的矩阵,两者的速度要快10倍。


保存

有几种方法可以保存样式化数据框:

  • 通过追加render()方法返回HTML ,然后将输出写入文件。
  • .xslx通过附加该to_excel()方法以条件格式另存为文件。
  • 与imgkit结合以保存位图
  • 截屏(出于非正式目的)。

熊猫> = 0.24的更新

通过设置axis=None,现在可以基于整个矩阵而不是每列或每行计算颜色:

corr.style.background_gradient(cmap='coolwarm', axis=None)

在此处输入图片说明

If your main goal is to visualize the correlation matrix, rather than creating a plot per se, the convenient pandas styling options is a viable built-in solution:

import pandas as pd
import numpy as np

rs = np.random.RandomState(0)
df = pd.DataFrame(rs.rand(10, 10))
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm')
# 'RdBu_r' & 'BrBG' are other good diverging colormaps

enter image description here

Note that this needs to be in a backend that supports rendering HTML, such as the JupyterLab Notebook. (The automatic light text on dark backgrounds is from an existing PR and not the latest released version, pandas 0.23).


Styling

You can easily limit the digit precision:

corr.style.background_gradient(cmap='coolwarm').set_precision(2)

enter image description here

Or get rid of the digits altogether if you prefer the matrix without annotations:

corr.style.background_gradient(cmap='coolwarm').set_properties(**{'font-size': '0pt'})

enter image description here

The styling documentation also includes instructions of more advanced styles, such as how to change the display of the cell the mouse pointer is hovering over. To save the output you could return the HTML by appending the render() method and then write it to a file (or just take a screenshot for less formal purposes).


Time comparison

In my testing, style.background_gradient() was 4x faster than plt.matshow() and 120x faster than sns.heatmap() with a 10×10 matrix. Unfortunately it doesn’t scale as well as plt.matshow(): the two take about the same time for a 100×100 matrix, and plt.matshow() is 10x faster for a 1000×1000 matrix.


Saving

There are a few possible ways to save the stylized dataframe:

  • Return the HTML by appending the render() method and then write the output to a file.
  • Save as an .xslx file with conditional formatting by appending the to_excel() method.
  • Combine with imgkit to save a bitmap
  • Take a screenshot (for less formal purposes).

Update for pandas >= 0.24

By setting axis=None, it is now possible to compute the colors based on the entire matrix rather than per column or per row:

corr.style.background_gradient(cmap='coolwarm', axis=None)

enter image description here


回答 2

试试这个函数,它也显示相关矩阵的变量名:

def plot_corr(df,size=10):
    '''Function plots a graphical correlation matrix for each pair of columns in the dataframe.

    Input:
        df: pandas DataFrame
        size: vertical and horizontal size of the plot'''

    corr = df.corr()
    fig, ax = plt.subplots(figsize=(size, size))
    ax.matshow(corr)
    plt.xticks(range(len(corr.columns)), corr.columns);
    plt.yticks(range(len(corr.columns)), corr.columns);

Try this function, which also displays variable names for the correlation matrix:

def plot_corr(df,size=10):
    '''Function plots a graphical correlation matrix for each pair of columns in the dataframe.

    Input:
        df: pandas DataFrame
        size: vertical and horizontal size of the plot'''

    corr = df.corr()
    fig, ax = plt.subplots(figsize=(size, size))
    ax.matshow(corr)
    plt.xticks(range(len(corr.columns)), corr.columns);
    plt.yticks(range(len(corr.columns)), corr.columns);

回答 3

Seaborn的热图版本:

import seaborn as sns
corr = dataframe.corr()
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)

Seaborn’s heatmap version:

import seaborn as sns
corr = dataframe.corr()
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)

回答 4

您可以通过绘制Seaborn的热图或熊猫的散点图来观察要素之间的关系。

散点矩阵:

pd.scatter_matrix(dataframe, alpha = 0.3, figsize = (14,8), diagonal = 'kde');

如果您还想可视化每个特征的偏斜度,请使用深浅的成对图。

sns.pairplot(dataframe)

SNS热图:

import seaborn as sns

f, ax = pl.subplots(figsize=(10, 8))
corr = dataframe.corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True),
            square=True, ax=ax)

输出将是要素的关联图。即见下面的例子。

在此处输入图片说明

杂货和洗涤剂之间的相关性很高。类似地:

具有高相关性的产品:
  1. 杂货和洗涤剂。
具有中等相关性的产品:
  1. 牛奶和杂货
  2. 牛奶和洗涤剂_纸
低相关性产品:
  1. 牛奶和熟食
  2. 冷冻和新鲜。
  3. 冷冻和熟食。

从线对图:您可以从线对图或散布矩阵观察同一组关系。但是从这些我们可以说,数据是否是正态分布的。

在此处输入图片说明

注意:以上是从数据中提取的相同图形,用于绘制热图。

You can observe the relation between features either by drawing a heat map from seaborn or scatter matrix from pandas.

Scatter Matrix:

pd.scatter_matrix(dataframe, alpha = 0.3, figsize = (14,8), diagonal = 'kde');

If you want to visualize each feature’s skewness as well – use seaborn pairplots.

sns.pairplot(dataframe)

Sns Heatmap:

import seaborn as sns

f, ax = pl.subplots(figsize=(10, 8))
corr = dataframe.corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True),
            square=True, ax=ax)

The output will be a correlation map of the features. i.e. see the below example.

enter image description here

The correlation between grocery and detergents is high. Similarly:

Pdoducts With High Correlation:
  1. Grocery and Detergents.
Products With Medium Correlation:
  1. Milk and Grocery
  2. Milk and Detergents_Paper
Products With Low Correlation:
  1. Milk and Deli
  2. Frozen and Fresh.
  3. Frozen and Deli.

From Pairplots: You can observe same set of relations from pairplots or scatter matrix. But from these we can say that whether the data is normally distributed or not.

enter image description here

Note: The above is same graph taken from the data, which is used to draw heatmap.


回答 5

您可以从matplotlib使用imshow()方法

import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')

plt.imshow(X.corr(), cmap=plt.cm.Reds, interpolation='nearest')
plt.colorbar()
tick_marks = [i for i in range(len(X.columns))]
plt.xticks(tick_marks, X.columns, rotation='vertical')
plt.yticks(tick_marks, X.columns)
plt.show()

You can use imshow() method from matplotlib

import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')

plt.imshow(X.corr(), cmap=plt.cm.Reds, interpolation='nearest')
plt.colorbar()
tick_marks = [i for i in range(len(X.columns))]
plt.xticks(tick_marks, X.columns, rotation='vertical')
plt.yticks(tick_marks, X.columns)
plt.show()

回答 6

如果您df使用的是数据框,则可以简单地使用:

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(15, 10))
sns.heatmap(df.corr(), annot=True)

If you dataframe is df you can simply use:

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(15, 10))
sns.heatmap(df.corr(), annot=True)

回答 7

statmodels图形还提供了一个很好的相关矩阵视图

import statsmodels.api as sm
import matplotlib.pyplot as plt

corr = dataframe.corr()
sm.graphics.plot_corr(corr, xnames=list(corr.columns))
plt.show()

statmodels graphics also gives a nice view of correlation matrix

import statsmodels.api as sm
import matplotlib.pyplot as plt

corr = dataframe.corr()
sm.graphics.plot_corr(corr, xnames=list(corr.columns))
plt.show()

回答 8

为了完整起见,如果有人正在使用Jupyter,则是2019年底我所知道的seaborn最简单的解决方案:

import seaborn as sns
sns.heatmap(dataframe.corr())

For completeness, the simplest solution i know with seaborn as of late 2019, if one is using Jupyter:

import seaborn as sns
sns.heatmap(dataframe.corr())

回答 9

与其他方法一起使用pairplot也会很好,它会给出所有情况的散点图,

import pandas as pd
import numpy as np
import seaborn as sns
rs = np.random.RandomState(0)
df = pd.DataFrame(rs.rand(10, 10))
sns.pairplot(df)

Along with other methods it is also good to have pairplot which will give scatter plot for all the cases-

import pandas as pd
import numpy as np
import seaborn as sns
rs = np.random.RandomState(0)
df = pd.DataFrame(rs.rand(10, 10))
sns.pairplot(df)

回答 10

形式相关矩阵,在我的情况下zdf是我需要执行相关矩阵的数据帧。

corrMatrix =zdf.corr()
corrMatrix.to_csv('sm_zscaled_correlation_matrix.csv');
html = corrMatrix.style.background_gradient(cmap='RdBu').set_precision(2).render()

# Writing the output to a html file.
with open('test.html', 'w') as f:
   print('<!DOCTYPE html><html lang="en"><head><meta charset="UTF-8"><meta name="viewport" content="width=device-widthinitial-scale=1.0"><title>Document</title></head><style>table{word-break: break-all;}</style><body>' + html+'</body></html>', file=f)

然后我们可以截屏。或将html转换为图像文件。

Form correlation matrix, in my case zdf is the dataframe which i need perform correlation matrix.

corrMatrix =zdf.corr()
corrMatrix.to_csv('sm_zscaled_correlation_matrix.csv');
html = corrMatrix.style.background_gradient(cmap='RdBu').set_precision(2).render()

# Writing the output to a html file.
with open('test.html', 'w') as f:
   print('<!DOCTYPE html><html lang="en"><head><meta charset="UTF-8"><meta name="viewport" content="width=device-widthinitial-scale=1.0"><title>Document</title></head><style>table{word-break: break-all;}</style><body>' + html+'</body></html>', file=f)

Then we can take screenshot. or convert html to an image file.