标签归档:pandas

熊猫轴是什么意思?

问题:熊猫轴是什么意思?

这是我的生成数据框的代码:

import pandas as pd
import numpy as np

dff = pd.DataFrame(np.random.randn(1,2),columns=list('AB'))

然后我得到了数据框:

+------------+---------+--------+
|            |  A      |  B     |
+------------+---------+---------
|      0     | 0.626386| 1.52325|
+------------+---------+--------+

当我输入命令时:

dff.mean(axis=1)

我有 :

0    1.074821
dtype: float64

根据熊猫的参考,axis = 1代表列,我希望命令的结果是

A    0.626386
B    1.523255
dtype: float64

所以这是我的问题:大熊猫轴是什么意思?

Here is my code to generate a dataframe:

import pandas as pd
import numpy as np

dff = pd.DataFrame(np.random.randn(1,2),columns=list('AB'))

then I got the dataframe:

+------------+---------+--------+
|            |  A      |  B     |
+------------+---------+---------
|      0     | 0.626386| 1.52325|
+------------+---------+--------+

When I type the commmand :

dff.mean(axis=1)

I got :

0    1.074821
dtype: float64

According to the reference of pandas, axis=1 stands for columns and I expect the result of the command to be

A    0.626386
B    1.523255
dtype: float64

So here is my question: what does axis in pandas mean?


回答 0

它指定轴沿其的装置被计算的。默认情况下axis=0。这与显式指定numpy.mean时的用法一致(默认情况下为,轴== None,该值将计算扁平化数组的平均值),沿(即以熊猫为索引)和沿。为了更加清楚起见,可以选择指定(代替)或(代替)。axisnumpy.meanaxis=0axis=1axis='index'axis=0axis='columns'axis=1

+------------+---------+--------+
|            |  A      |  B     |
+------------+---------+---------
|      0     | 0.626386| 1.52325|----axis=1----->
+------------+---------+--------+
             |         |
             | axis=0  |
                      

It specifies the axis along which the means are computed. By default axis=0. This is consistent with the numpy.mean usage when axis is specified explicitly (in numpy.mean, axis==None by default, which computes the mean value over the flattened array) , in which axis=0 along the rows (namely, index in pandas), and axis=1 along the columns. For added clarity, one may choose to specify axis='index' (instead of axis=0) or axis='columns' (instead of axis=1).

+------------+---------+--------+
|            |  A      |  B     |
+------------+---------+---------
|      0     | 0.626386| 1.52325|----axis=1----->
+------------+---------+--------+
             |         |
             | axis=0  |
             ↓         ↓

回答 1

这些答案确实有助于解释这一点,但是对于非程序员(例如像我这样在数据科学类中首次学习Python的人)来说,它仍然不是很直观。我仍然发现在行和列中使用术语“沿”或“对于每个”令人困惑。

对我来说更有意义的是这样说:

  • 轴0将作用于每个COLUMN中的所有ROWS
  • 轴1将作用于每个行中的所有列

因此,轴0的均值将是每一列中所有行的均值,轴1的均值将是每一行中所有列的均值。

最终,这是与@zhangxaochen和@Michael所说的相同的事情,但是以一种更易于我内部化的方式。

These answers do help explain this, but it still isn’t perfectly intuitive for a non-programmer (i.e. someone like me who is learning Python for the first time in context of data science coursework). I still find using the terms “along” or “for each” wrt to rows and columns to be confusing.

What makes more sense to me is to say it this way:

  • Axis 0 will act on all the ROWS in each COLUMN
  • Axis 1 will act on all the COLUMNS in each ROW

So a mean on axis 0 will be the mean of all the rows in each column, and a mean on axis 1 will be a mean of all the columns in each row.

Ultimately this is saying the same thing as @zhangxaochen and @Michael, but in a way that is easier for me to internalize.


回答 2

让我们想象一下(您会永远记住),

在熊猫:

  1. axis = 0表示沿“索引”。这是逐行操作

假设要对dataframe1和dataframe2执行concat()操作,我们将dataframe1并从dataframe1中取出第一行并放入新的DF,然后从dataframe1中取出另一行并放入新的DF中,重复此过程直到我们到达dataframe1的底部。然后,我们对dataframe2执行相同的过程。

基本上,将dataframe2堆叠在dataframe1之上,反之亦然。

例如在桌子或地板上堆书

  1. axis = 1表示沿“列”。这是列操作。

假设要对dataframe1和dataframe2执行concat()操作,我们将取出dataframe1的第一个完整列(又名1st系列)并放入新的DF中,然后取出dataframe1的第二列并与之相邻(横向) ),我们必须重复此操作,直到完成所有列。然后,我们在dataframe2上重复相同的过程。基本上, 横向堆叠dataframe2。

例如在书架上整理书籍。

更重要的是,与矩阵相比,数组是更好的表示嵌套n维结构的表示形式!因此,下面的内容可以帮助您更加直观地了解将轴推广到多个维度时轴如何发挥重要作用。另外,您实际上可以打印/写入/绘制/可视化任何n维数组,但是在3维以上的纸张上以矩阵表示形式(3维)进行写入或可视化是不可能的。

Let’s visualize (you gonna remember always),

In Pandas:

  1. axis=0 means along “indexes”. It’s a row-wise operation.

Suppose, to perform concat() operation on dataframe1 & dataframe2, we will take dataframe1 & take out 1st row from dataframe1 and place into the new DF, then we take out another row from dataframe1 and put into new DF, we repeat this process until we reach to the bottom of dataframe1. Then, we do the same process for dataframe2.

Basically, stacking dataframe2 on top of dataframe1 or vice a versa.

E.g making a pile of books on a table or floor

  1. axis=1 means along “columns”. It’s a column-wise operation.

Suppose, to perform concat() operation on dataframe1 & dataframe2, we will take out the 1st complete column(a.k.a 1st series) of dataframe1 and place into new DF, then we take out the second column of dataframe1 and keep adjacent to it (sideways), we have to repeat this operation until all columns are finished. Then, we repeat the same process on dataframe2. Basically, stacking dataframe2 sideways.

E.g arranging books on a bookshelf.

More to it, since arrays are better representations to represent a nested n-dimensional structure compared to matrices! so below can help you more to visualize how axis plays an important role when you generalize to more than one dimension. Also, you can actually print/write/draw/visualize any n-dim array but, writing or visualizing the same in a matrix representation(3-dim) is impossible on a paper more than 3-dimensions.


回答 3

axis指向数组的维,在pd.DataFrames 的情况下axis=0是向下的维,axis=1而向右的维。

示例:考虑一个ndarraywith shape (3,5,7)

a = np.ones((3,5,7))

a是3维的ndarray,即具有3个轴(“轴”是“轴”的复数)。的配置a看起来像3片面包,每片面包的尺寸为5 x 7。a[0,:,:]将引用第0个切片,a[1,:,:]将引用第1 个切片,依此类推。

a.sum(axis=0)sum()沿的第0轴应用a。您将添加所有切片,最后得到一个形状的切片(5,7)

a.sum(axis=0) 相当于

b = np.zeros((5,7))
for i in range(5):
    for j in range(7):
        b[i,j] += a[:,i,j].sum()

b并且a.sum(axis=0)将两者看起来像这样

array([[ 3.,  3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.,  3.]])

在中pd.DataFrame,轴的工作方式与在numpy.arrays中相同:axis=0将对sum()每列应用或任何其他归约函数。

注意:在@zhangxaochen的答案中,我发现“沿行”和“沿列”这两个短语有些混乱。axis=0应该指“每列”和axis=1“每行”。

axis refers to the dimension of the array, in the case of pd.DataFrames axis=0 is the dimension that points downwards and axis=1 the one that points to the right.

Example: Think of an ndarray with shape (3,5,7).

a = np.ones((3,5,7))

a is a 3 dimensional ndarray, i.e. it has 3 axes (“axes” is plural of “axis”). The configuration of a will look like 3 slices of bread where each slice is of dimension 5-by-7. a[0,:,:] will refer to the 0-th slice, a[1,:,:] will refer to the 1-st slice etc.

a.sum(axis=0) will apply sum() along the 0-th axis of a. You will add all the slices and end up with one slice of shape (5,7).

a.sum(axis=0) is equivalent to

b = np.zeros((5,7))
for i in range(5):
    for j in range(7):
        b[i,j] += a[:,i,j].sum()

b and a.sum(axis=0) will both look like this

array([[ 3.,  3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.,  3.]])

In a pd.DataFrame, axes work the same way as in numpy.arrays: axis=0 will apply sum() or any other reduction function for each column.

N.B. In @zhangxaochen’s answer, I find the phrases “along the rows” and “along the columns” slightly confusing. axis=0 should refer to “along each column”, and axis=1 “along each row”.


回答 4

对我而言,最容易理解的方法是谈论您是针对每一列(axis = 0)还是每一行(axis = 1)计算统计信息。如果您计算统计量,请说一个平均值,axis = 0您将获得每一列的统计量。因此,如果每个观察值都是一行,并且每个变量都在列中,则将获得每个变量的均值。如果设置,axis = 1则将为每一行计算统计信息。在我们的示例中,您将获得所有变量中每个观察值的平均值(也许您需要相关度量的平均值)。

axis = 0:按列=按列=沿行

axis = 1:按行=按行=沿列

The easiest way for me to understand is to talk about whether you are calculating a statistic for each column (axis = 0) or each row (axis = 1). If you calculate a statistic, say a mean, with axis = 0 you will get that statistic for each column. So if each observation is a row and each variable is in a column, you would get the mean of each variable. If you set axis = 1 then you will calculate your statistic for each row. In our example, you would get the mean for each observation across all of your variables (perhaps you want the average of related measures).

axis = 0: by column = column-wise = along the rows

axis = 1: by row = row-wise = along the columns


回答 5

让我们看一下Wiki中的表格。这是国际货币基金组织对2010年至2019年前十个国家的GDP估算。

1.第1轴将对所有列的每一行起作用
如果您要计算十年(2010-2019年)中每个国家的平均(平均)GDP,则需要做df.mean(axis=1)。例如,如果您要计算2010年至2019年美国的平均GDP,df.loc['United States','2010':'2019'].mean(axis=1)

2.轴0将对所有行的每一列起作用
如果我想计算所有国家每个年份的平均(平均)GDP,则需要做df.mean(axis=0)。例如,如果您要计算美国,中国,日本,德国和印度的2015年平均GDP,df.loc['United States':'India','2015'].mean(axis=0)

请注意:上面的代码仅在将“国家(或从属地区)”列设置为“索引”后才能使用set_index方法。

Let’s look at the table from Wiki. This is an IMF estimate of GDP from 2010 to 2019 for top ten countries.

1. Axis 1 will act for each row on all the columns
If you want to calculate the average (mean) GDP for EACH countries over the decade (2010-2019), you need to do, df.mean(axis=1). For example, if you want to calculate mean GDP of United States from 2010 to 2019, df.loc['United States','2010':'2019'].mean(axis=1)

2. Axis 0 will act for each column on all the rows
If I want to calculate the average (mean) GDP for EACH year for all countries, you need to do, df.mean(axis=0). For example, if you want to calculate mean GDP of the year 2015 for United States, China, Japan, Germany and India, df.loc['United States':'India','2015'].mean(axis=0)

Note: The above code will work only after setting “Country(or dependent territory)” column as the Index, using set_index method.


回答 6

从编程角度来看,轴是形状元组中的位置。这是一个例子:

import numpy as np

a=np.arange(120).reshape(2,3,4,5)

a.shape
Out[3]: (2, 3, 4, 5)

np.sum(a,axis=0).shape
Out[4]: (3, 4, 5)

np.sum(a,axis=1).shape
Out[5]: (2, 4, 5)

np.sum(a,axis=2).shape
Out[6]: (2, 3, 5)

np.sum(a,axis=3).shape
Out[7]: (2, 3, 4)

轴上的均值将导致该尺寸被删除。

参考原始问题,dff形状为(1,2)。使用axis = 1会将形状更改为(1,)。

Axis in view of programming is the position in the shape tuple. Here is an example:

import numpy as np

a=np.arange(120).reshape(2,3,4,5)

a.shape
Out[3]: (2, 3, 4, 5)

np.sum(a,axis=0).shape
Out[4]: (3, 4, 5)

np.sum(a,axis=1).shape
Out[5]: (2, 4, 5)

np.sum(a,axis=2).shape
Out[6]: (2, 3, 5)

np.sum(a,axis=3).shape
Out[7]: (2, 3, 4)

Mean on the axis will cause that dimension to be removed.

Referring to the original question, the dff shape is (1,2). Using axis=1 will change the shape to (1,).


回答 7

熊猫的设计者韦斯·麦金尼(Wes McKinney)过去经常从事金融数据工作。将列视为股票名称,将索引视为每日价格。然后,您可以猜测axis=0此财务数据的默认行为(即)。axis=1可以简单地认为是“另一个方向”。

例如,统计功能,如mean()sum()describe()count()都默认为列明智的,因为它更有意义,做他们每个股票。sort_index(by=)也默认为列。fillna(method='ffill')将沿着列填充,因为它是相同的库存。dropna()默认为行,因为您可能只想放弃当天的价格,而不是丢弃该股票的所有价格。

类似地,方括号索引是指各列,因为选择股票而不是选择一天更为普遍。

The designer of pandas, Wes McKinney, used to work intensively on finance data. Think of columns as stock names and index as daily prices. You can then guess what the default behavior is (i.e., axis=0) with respect to this finance data. axis=1 can be simply thought as ‘the other direction’.

For example, the statistics functions, such as mean(), sum(), describe(), count() all default to column-wise because it makes more sense to do them for each stock. sort_index(by=) also defaults to column. fillna(method='ffill') will fill along column because it is the same stock. dropna() defaults to row because you probably just want to discard the price on that day instead of throw away all prices of that stock.

Similarly, the square brackets indexing refers to the columns since it’s more common to pick a stock instead of picking a day.


回答 8

记住轴1(列)和轴0(行)的简单方法之一就是您期望的输出。

  • 如果您期望每行的输出都使用axis =’columns’,
  • 另一方面,如果要为每列输出,请使用axis =’rows’。

one of easy ways to remember axis 1 (columns), vs axis 0 (rows) is the output you expect.

  • if you expect an output for each row you use axis=’columns’,
  • on the other hand if you want an output for each column you use axis=’rows’.

回答 9

axis=正确使用的问题是在两种主要情况下的使用:

  1. 用于计算累积值重新排列(例如排序)数据。
  2. 用于操纵(“播放”)实体(例如dataframe)。

该答案背后的主要思想是,为了避免混淆,我们选择数字名称来指定特定的轴,以更清晰,直观和描述性的方式为准。

熊猫基于NumPy,后者基于数学,尤其是基于n维矩阵。这是3维空间中数学中轴名称的常用图像:

此图片仅用于存储轴的序号

  • 0 对于x轴,
  • 1 y轴
  • 2 对于z轴。

z轴是只对面板 ; 对于数据帧,我们将把兴趣限制在带有x轴(,垂直)y轴(,水平)的绿色二维基本平面01

所有这些都是数字作为axis=参数的潜在值。

轴的名称'index'(您可以使用别名'rows')和'columns',对于此说明,这些名称与(轴的)序数之间的关系并不重要,因为每个人都知道“行”“列”是什么意思(我想,这里的每个人都知道熊猫中“索引”一词的含义)。

现在,我的建议:

  1. 如果要计算累加值,则可以从沿轴0(或沿轴1)定位的值(使用axis=0axis=1)计算得出。

    同样,如果要重新排列值,请使用轴的轴编号沿着该轴的编号将放置数据以进行重新排列(例如,用于排序)。

  2. 如果要操作(例如连接实体(例如,数据框),请使用axis='index'(同义词:)axis='rows'axis='columns'指定结果更改 – 分别为索引)或
    (对于串联,您将分别获得更长的索引(=更多的行)更多的列。)

The problem with using axis= properly is for its use for 2 main different cases:

  1. For computing an accumulated value, or rearranging (e. g. sorting) data.
  2. For manipulating (“playing” with) entities (e. g. dataframes).

The main idea behind this answer is that for avoiding the confusion, we select either a number, or a name for specifying the particular axis, whichever is more clear, intuitive, and descriptive.

Pandas is based on NumPy, which is based on mathematics, particularly on n-dimensional matrices. Here is an image for common use of axes’ names in math in the 3-dimensional space:

This picture is for memorizing the axes’ ordinal numbers only:

  • 0 for x-axis,
  • 1 for y-axis, and
  • 2 for z-axis.

The z-axis is only for panels; for dataframes we will restrict our interest to the green-colored, 2-dimensional basic plane with x-axis (0, vertical), and y-axis (1, horizontal).

It’s all for numbers as potential values of axis= parameter.

The names of axes are 'index' (you may use the alias 'rows') and 'columns', and for this explanation it is NOT important the relation between these names and ordinal numbers (of axes), as everybody knows what the words “rows” and “columns” mean (and everybody here — I suppose — knows what the word “index” in pandas means).

And now, my recommendation:

  1. If you want to compute an accumulated value, you may compute it from values located along axis 0 (or along axis 1) — use axis=0 (or axis=1).

    Similarly, if you want to rearrange values, use the axis number of the axis, along which are located data for rearranging (e.g. for sorting).

  2. If you want to manipulate (e.g. concatenate) entities (e.g. dataframes) — use axis='index' (synonym: axis='rows') or axis='columns' to specify the resulting changeindex (rows) or columns, respectively.
    (For concatenating, you will obtain either a longer index (= more rows), or more columns, respectively.)


回答 10

这是基于@Safak的答案。理解pandas / numpy中轴的最好方法是创建3d数组,并检查3个不同轴上sum函数的结果。

 a = np.ones((3,5,7))

一个将是:

    array([[[1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.]],

   [[1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.]],

   [[1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.]]])

现在检查沿每个轴的数组元素的总和:

 x0 = np.sum(a,axis=0)
 x1 = np.sum(a,axis=1)
 x2 = np.sum(a,axis=2)

将为您提供以下结果:

   x0 :
   array([[3., 3., 3., 3., 3., 3., 3.],
        [3., 3., 3., 3., 3., 3., 3.],
        [3., 3., 3., 3., 3., 3., 3.],
        [3., 3., 3., 3., 3., 3., 3.],
        [3., 3., 3., 3., 3., 3., 3.]])

   x1 : 
   array([[5., 5., 5., 5., 5., 5., 5.],
   [5., 5., 5., 5., 5., 5., 5.],
   [5., 5., 5., 5., 5., 5., 5.]])

  x2 :
   array([[7., 7., 7., 7., 7.],
        [7., 7., 7., 7., 7.],
        [7., 7., 7., 7., 7.]])

This is based on @Safak’s answer. The best way to understand the axes in pandas/numpy is to create a 3d array and check the result of the sum function along the 3 different axes.

 a = np.ones((3,5,7))

a will be:

    array([[[1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.]],

   [[1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.]],

   [[1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.]]])

Now check out the sum of elements of the array along each of the axes:

 x0 = np.sum(a,axis=0)
 x1 = np.sum(a,axis=1)
 x2 = np.sum(a,axis=2)

will give you the following results:

   x0 :
   array([[3., 3., 3., 3., 3., 3., 3.],
        [3., 3., 3., 3., 3., 3., 3.],
        [3., 3., 3., 3., 3., 3., 3.],
        [3., 3., 3., 3., 3., 3., 3.],
        [3., 3., 3., 3., 3., 3., 3.]])

   x1 : 
   array([[5., 5., 5., 5., 5., 5., 5.],
   [5., 5., 5., 5., 5., 5., 5.],
   [5., 5., 5., 5., 5., 5., 5.]])

  x2 :
   array([[7., 7., 7., 7., 7.],
        [7., 7., 7., 7., 7.],
        [7., 7., 7., 7., 7.]])

回答 11

我这样理解:

假设您的操作需要在数据框中从左向右/从右向左遍历,则显然是在合并列,即。您正在各种列上进行操作。这是轴= 1

df = pd.DataFrame(np.arange(12).reshape(3,4),columns=['A', 'B', 'C', 'D'])
print(df)
   A  B   C   D
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11 

df.mean(axis=1)

0    1.5
1    5.5
2    9.5
dtype: float64

df.drop(['A','B'],axis=1,inplace=True)

    C   D
0   2   3
1   6   7
2  10  11

需要注意的是,我们正在对列进行操作

同样,如果您的操作需要在数据框中从上到下/下到上遍历,则您正在合并行。这是axis = 0

I understand this way :

Say if your operation requires traversing from left to right/right to left in a dataframe, you are apparently merging columns ie. you are operating on various columns. This is axis =1

Example

df = pd.DataFrame(np.arange(12).reshape(3,4),columns=['A', 'B', 'C', 'D'])
print(df)
   A  B   C   D
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11 

df.mean(axis=1)

0    1.5
1    5.5
2    9.5
dtype: float64

df.drop(['A','B'],axis=1,inplace=True)

    C   D
0   2   3
1   6   7
2  10  11

Point to note here is we are operating on columns

Similarly, if your operation requires traversing from top to bottom/bottom to top in a dataframe, you are merging rows. This is axis=0.


回答 12

轴= 0表示上向下轴= 1表示左到右

sums[key] = lang_sets[key].iloc[:,1:].sum(axis=0)

给定的示例是对==键中的所有数据求和。

axis = 0 means up to down axis = 1 means left to right

sums[key] = lang_sets[key].iloc[:,1:].sum(axis=0)

Given example is taking sum of all the data in column == key.


回答 13

我的想法是:Axis = n,其中n = 0、1等,意味着矩阵沿该轴折叠(折叠)。因此,在2D矩阵中,当您沿0(行)折叠时,您实际上一次只对一列进行操作。对于高阶矩阵也是如此。

这与对矩阵中维的常规引用不同,其中0->行和1->列。对于N维数组中的其他维类似。

My thinking : Axis = n, where n = 0, 1, etc. means that the matrix is collapsed (folded) along that axis. So in a 2D matrix, when you collapse along 0 (rows), you are really operating on one column at a time. Similarly for higher order matrices.

This is not the same as the normal reference to a dimension in a matrix, where 0 -> row and 1 -> column. Similarly for other dimensions in an N dimension array.


回答 14

我是熊猫的新手。但这是我理解熊猫轴的方式:


恒定 变化 方向


0列向下|


1行列向右->


因此,要计算列的均值,该特定列应为常数,但其下的行可以更改(变化),因此轴= 0。

类似地,要计算一行的平均值,该特定行是恒定的,但它可以遍历不同的列(变化),轴= 1。

I’m a newbie to pandas. But this is how I understand axis in pandas:


Axis Constant Varying Direction


0 Column Row Downwards |


1 Row Column Towards Right –>


So to compute mean of a column, that particular column should be constant but the rows under that can change (varying) so it is axis=0.

Similarly, to compute mean of a row, that particular row is constant but it can traverse through different columns (varying), axis=1.


回答 15

我认为还有另一种理解方式。

对于np.array,如果要消除列,则使用axis = 1; 如果要消除行,则使用axis = 0。

np.mean(np.array(np.ones(shape=(3,5,10))),axis = 0).shape # (5,10)
np.mean(np.array(np.ones(shape=(3,5,10))),axis = 1).shape # (3,10)
np.mean(np.array(np.ones(shape=(3,5,10))),axis = (0,1)).shape # (10,)

对于熊猫对象,axis = 0代表按行操作,axis = 1代表按列操作。这与numpy定义不同,我们可以检查numpy.docpandas.doc中的定义

I think there is an another way to understand it.

For a np.array,if we want eliminate columns we use axis = 1; if we want eliminate rows, we use axis = 0.

np.mean(np.array(np.ones(shape=(3,5,10))),axis = 0).shape # (5,10)
np.mean(np.array(np.ones(shape=(3,5,10))),axis = 1).shape # (3,10)
np.mean(np.array(np.ones(shape=(3,5,10))),axis = (0,1)).shape # (10,)

For pandas object, axis = 0 stands for row-wise operation and axis = 1 stands for column-wise operation. This is different from numpy by definition, we can check definitions from numpy.doc and pandas.doc


回答 16

我将明确避免使用“按行排列”或“沿列排列”,因为人们可能以完全错误的方式解释它们。

打个比方。直观地,您希望pandas.DataFrame.drop(axis='column')从N列中删除一列,并为您提供(N-1)列。因此,您暂时无需关注行(并从英语词典中删除单词“ row”。)反之亦然,它drop(axis='row')适用于行。

同样,sum(axis='column')可以处理多列,并为您提供1列。同样,sum(axis='row')结果为1行。这与其最简单的定义形式一致,即将数字列表简化为单个数字。

通常,使用axis=column,您可以看到列,在列上工作并获取列。忘记行。

使用axis=row,更改视角并处理行。

0和1只是’row’和’column’的别名。这是矩阵索引的惯例。

I will explicitly avoid using ‘row-wise’ or ‘along the columns’, since people may interpret them in exactly the wrong way.

Analogy first. Intuitively, you would expect that pandas.DataFrame.drop(axis='column') drops a column from N columns and gives you (N – 1) columns. So you can pay NO attention to rows for now (and remove word ‘row’ from your English dictionary.) Vice versa, drop(axis='row') works on rows.

In the same way, sum(axis='column') works on multiple columns and gives you 1 column. Similarly, sum(axis='row') results in 1 row. This is consistent with its simplest form of definition, reducing a list of numbers to a single number.

In general, with axis=column, you see columns, work on columns, and get columns. Forget rows.

With axis=row, change perspective and work on rows.

0 and 1 are just aliases for ‘row’ and ‘column’. It’s the convention of matrix indexing.


回答 17

我也一直在试图找出最后一个小时的轴。以上所有答案中的语言以及文档均无济于事。

要回答我现在所理解的问题,在Pandas中,axis = 1或0表示在应用功能时要保持哪些轴头恒定。

注意:当我说标题时,我指的是索引名称

扩展您的示例:

+------------+---------+--------+
|            |  A      |  B     |
+------------+---------+---------
|      X     | 0.626386| 1.52325|
+------------+---------+--------+
|      Y     | 0.626386| 1.52325|
+------------+---------+--------+

对于axis = 1 = columns:我们保持列标题不变,并通过更改数据应用均值函数。为了演示,我们将列标题保持不变:

+------------+---------+--------+
|            |  A      |  B     |

现在我们填充一组A和B值,然后求平均值

|            | 0.626386| 1.52325|  

然后我们填充下一组A和B值并找到平均值

|            | 0.626386| 1.52325|

类似地,对于axis = rows,我们保持行标题不变,并不断更改数据:为了演示,首先修复行标题:

+------------+
|      X     |
+------------+
|      Y     |
+------------+

现在填充第一组X和Y值,然后求平均值

+------------+---------+
|      X     | 0.626386
+------------+---------+
|      Y     | 0.626386
+------------+---------+

然后填充下一组X和Y值,然后求平均值:

+------------+---------+
|      X     | 1.52325 |
+------------+---------+
|      Y     | 1.52325 |
+------------+---------+

综上所述,

当axis = columns时,可以固定列标题并更改数据,这些数据将来自不同的行。

当axis = rows时,将修复行标题并更改数据,这些数据将来自不同的列。

I have been trying to figure out the axis for the last hour as well. The language in all the above answers, and also the documentation is not at all helpful.

To answer the question as I understand it now, in Pandas, axis = 1 or 0 means which axis headers do you want to keep constant when applying the function.

Note: When I say headers, I mean index names

Expanding your example:

+------------+---------+--------+
|            |  A      |  B     |
+------------+---------+---------
|      X     | 0.626386| 1.52325|
+------------+---------+--------+
|      Y     | 0.626386| 1.52325|
+------------+---------+--------+

For axis=1=columns : We keep columns headers constant and apply the mean function by changing data. To demonstrate, we keep the columns headers constant as:

+------------+---------+--------+
|            |  A      |  B     |

Now we populate one set of A and B values and then find the mean

|            | 0.626386| 1.52325|  

Then we populate next set of A and B values and find the mean

|            | 0.626386| 1.52325|

Similarly, for axis=rows, we keep row headers constant, and keep changing the data: To demonstrate, first fix the row headers:

+------------+
|      X     |
+------------+
|      Y     |
+------------+

Now populate first set of X and Y values and then find the mean

+------------+---------+
|      X     | 0.626386
+------------+---------+
|      Y     | 0.626386
+------------+---------+

Then populate the next set of X and Y values and then find the mean:

+------------+---------+
|      X     | 1.52325 |
+------------+---------+
|      Y     | 1.52325 |
+------------+---------+

In summary,

When axis=columns, you fix the column headers and change data, which will come from the different rows.

When axis=rows, you fix the row headers and change data, which will come from the different columns.


回答 18

axis = 1,它将明智地求和行,keepdims = True将保持二维。希望对您有帮助。

axis=1 ,It will give the sum row wise,keepdims=True will maintain the 2D dimension. Hope it helps you.


回答 19

这里的许多答案对我有很大帮助!

如果您对axisPython和MARGIN R中(例如在apply函数中),则可能会发现我写过一篇有趣的博客文章:https : //accio.github.io/programming/2020/05/ 19 / numpy-pandas-axis.html

在本质上:

  • 有趣的是,与二维数组相比,使用三维数组更容易理解它们的行为。
  • 在Python包中 numpypandas,sum的axis参数实际上指定numpy,以计算所有可以以array [0,0,…,i,…,0]形式获取的值的平均值所有可能的值。在i的位置固定的情况下重复此过程,其他维度的索引则一个接一个地变化(从最右边的元素开始)。结果是一个n-1维数组。
  • 在R中,MARGINS参数使apply函数计算可以以array [,…,i,…,]的形式获取的所有值的平均值,其中i遍历所有可能的值。迭代完所有i值后,不再重复该过程。因此,结果是一个简单的向量。

Many answers here helped me a lot!

In case you get confused by the different behaviours of axis in Python and MARGIN in R (like in the apply function), you may find a blog post that I wrote of interest: https://accio.github.io/programming/2020/05/19/numpy-pandas-axis.html.

In essence:

  • Their behaviours are, intriguingly, easier to understand with three-dimensional array than with two-dimensional arrays.
  • In Python packages numpy and pandas, the axis parameter in sum actually specifies numpy to calculate the mean of all values that can be fetched in the form of array[0, 0, …, i, …, 0] where i iterates through all possible values. The process is repeated with the position of i fixed and the indices of other dimensions vary one after the other (from the most far-right element). The result is a n-1-dimensional array.
  • In R, the MARGINS parameter let the apply function calculate the mean of all values that can be fetched in the form of array[, … , i, … ,] where i iterates through all possible values. The process is not repeated when all i values have been iterated. Therefore, the result is a simple vector.

回答 20

数组设计为具有所谓的axis = 0,垂直排列的行相对于axis = 1,水平排列的列。轴是指数组的尺寸。

Arrays are designed with so-called axis=0 and rows positioned vertically versus axis=1 and columns positioned horizontally. Axis refers to the dimension of the array.


如何在熊猫中获取数据框的列切片

问题:如何在熊猫中获取数据框的列切片

我从CSV文件加载了一些机器学习数据。前两列是观测值,其余两列是要素。

目前,我执行以下操作:

data = pandas.read_csv('mydata.csv')

它给出了类似的东西:

data = pandas.DataFrame(np.random.rand(10,5), columns = list('abcde'))

我想两个dataframes切片此数据框:包含列一个ab和包含一个列cde

不可能写这样的东西

observations = data[:'c']
features = data['c':]

我不确定最好的方法是什么。我需要一个pd.Panel吗?

顺便说一下,我发现数据帧索引非常不一致:data['a']允许,但data[0]不允许。另一方面,data['a':]不允许,但允许data[0:]。是否有实际原因?如果列是由Int索引的,这确实令人困惑,因为data[0] != data[0:1]

I load some machine learning data from a CSV file. The first 2 columns are observations and the remaining columns are features.

Currently, I do the following:

data = pandas.read_csv('mydata.csv')

which gives something like:

data = pandas.DataFrame(np.random.rand(10,5), columns = list('abcde'))

I’d like to slice this dataframe in two dataframes: one containing the columns a and b and one containing the columns c, d and e.

It is not possible to write something like

observations = data[:'c']
features = data['c':]

I’m not sure what the best method is. Do I need a pd.Panel?

By the way, I find dataframe indexing pretty inconsistent: data['a'] is permitted, but data[0] is not. On the other side, data['a':] is not permitted but data[0:] is. Is there a practical reason for this? This is really confusing if columns are indexed by Int, given that data[0] != data[0:1]


回答 0

2017年答案-熊猫0.20:.ix已弃用。使用.loc

请参阅文档中弃用

.loc使用基于标签的索引来选择行和列。标签是索引或列的值。切片.loc包含最后一个元素。

假设我们有以下的列的数据框中:
foobarquzantcatsatdat

# selects all rows and all columns beginning at 'foo' up to and including 'sat'
df.loc[:, 'foo':'sat']
# foo bar quz ant cat sat

.loc接受与Python列表对行和列所做的相同的切片表示法。切片符号为start:stop:step

# slice from 'foo' to 'cat' by every 2nd column
df.loc[:, 'foo':'cat':2]
# foo quz cat

# slice from the beginning to 'bar'
df.loc[:, :'bar']
# foo bar

# slice from 'quz' to the end by 3
df.loc[:, 'quz'::3]
# quz sat

# attempt from 'sat' to 'bar'
df.loc[:, 'sat':'bar']
# no columns returned

# slice from 'sat' to 'bar'
df.loc[:, 'sat':'bar':-1]
sat cat ant quz bar

# slice notation is syntatic sugar for the slice function
# slice from 'quz' to the end by 2 with slice function
df.loc[:, slice('quz',None, 2)]
# quz cat dat

# select specific columns with a list
# select columns foo, bar and dat
df.loc[:, ['foo','bar','dat']]
# foo bar dat

您可以按行和列进行切片。举例来说,如果你有5列的标签vwxyz

# slice from 'w' to 'y' and 'foo' to 'ant' by 3
df.loc['w':'y', 'foo':'ant':3]
#    foo ant
# w
# x
# y

2017 Answer – pandas 0.20: .ix is deprecated. Use .loc

See the deprecation in the docs

.loc uses label based indexing to select both rows and columns. The labels being the values of the index or the columns. Slicing with .loc includes the last element.

Let’s assume we have a DataFrame with the following columns:
foo, bar, quz, ant, cat, sat, dat.

# selects all rows and all columns beginning at 'foo' up to and including 'sat'
df.loc[:, 'foo':'sat']
# foo bar quz ant cat sat

.loc accepts the same slice notation that Python lists do for both row and columns. Slice notation being start:stop:step

# slice from 'foo' to 'cat' by every 2nd column
df.loc[:, 'foo':'cat':2]
# foo quz cat

# slice from the beginning to 'bar'
df.loc[:, :'bar']
# foo bar

# slice from 'quz' to the end by 3
df.loc[:, 'quz'::3]
# quz sat

# attempt from 'sat' to 'bar'
df.loc[:, 'sat':'bar']
# no columns returned

# slice from 'sat' to 'bar'
df.loc[:, 'sat':'bar':-1]
sat cat ant quz bar

# slice notation is syntatic sugar for the slice function
# slice from 'quz' to the end by 2 with slice function
df.loc[:, slice('quz',None, 2)]
# quz cat dat

# select specific columns with a list
# select columns foo, bar and dat
df.loc[:, ['foo','bar','dat']]
# foo bar dat

You can slice by rows and columns. For instance, if you have 5 rows with labels v, w, x, y, z

# slice from 'w' to 'y' and 'foo' to 'ant' by 3
df.loc['w':'y', 'foo':'ant':3]
#    foo ant
# w
# x
# y

回答 1

注意: .ix自Pandas v0.20起已弃用。您应该改用.loc.iloc,视情况而定。

您要访问的是DataFrame.ix索引。这有点令人困惑(我同意熊猫索引有时会令人困惑!),但是以下内容似乎可以满足您的要求:

>>> df = DataFrame(np.random.rand(4,5), columns = list('abcde'))
>>> df.ix[:,'b':]
      b         c         d         e
0  0.418762  0.042369  0.869203  0.972314
1  0.991058  0.510228  0.594784  0.534366
2  0.407472  0.259811  0.396664  0.894202
3  0.726168  0.139531  0.324932  0.906575

其中.ix [row slice,column slice]是正在解释的内容。有关熊猫索引的更多信息,请访问:http : //pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-advanced

Note: .ix has been deprecated since Pandas v0.20. You should instead use .loc or .iloc, as appropriate.

The DataFrame.ix index is what you want to be accessing. It’s a little confusing (I agree that Pandas indexing is perplexing at times!), but the following seems to do what you want:

>>> df = DataFrame(np.random.rand(4,5), columns = list('abcde'))
>>> df.ix[:,'b':]
      b         c         d         e
0  0.418762  0.042369  0.869203  0.972314
1  0.991058  0.510228  0.594784  0.534366
2  0.407472  0.259811  0.396664  0.894202
3  0.726168  0.139531  0.324932  0.906575

where .ix[row slice, column slice] is what is being interpreted. More on Pandas indexing here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-advanced


回答 2

让我们以seaborn包中的钛酸数据集为例

# Load dataset (pip install seaborn)
>> import seaborn.apionly as sns
>> titanic = sns.load_dataset('titanic')

使用列名

>> titanic.loc[:,['sex','age','fare']]

使用列索引

>> titanic.iloc[:,[2,3,6]]

使用ix(版本低于.20的Pandas版本)

>> titanic.ix[:,[‘sex’,’age’,’fare’]]

要么

>> titanic.ix[:,[2,3,6]]

使用重新索引方法

>> titanic.reindex(columns=['sex','age','fare'])

Lets use the titanic dataset from the seaborn package as an example

# Load dataset (pip install seaborn)
>> import seaborn.apionly as sns
>> titanic = sns.load_dataset('titanic')

using the column names

>> titanic.loc[:,['sex','age','fare']]

using the column indices

>> titanic.iloc[:,[2,3,6]]

using ix (Older than Pandas <.20 version)

>> titanic.ix[:,[‘sex’,’age’,’fare’]]

or

>> titanic.ix[:,[2,3,6]]

using the reindex method

>> titanic.reindex(columns=['sex','age','fare'])

回答 3

另外,给定一个DataFrame

数据

如您的示例所示,如果您只想提取列a和d(即第一列和第四列),则需要从熊猫数据框中获取iloc方法,并且可以非常有效地使用它。您只需要知道要提取的列的索引即可。例如:

>>> data.iloc[:,[0,3]]

会给你

          a         d
0  0.883283  0.100975
1  0.614313  0.221731
2  0.438963  0.224361
3  0.466078  0.703347
4  0.955285  0.114033
5  0.268443  0.416996
6  0.613241  0.327548
7  0.370784  0.359159
8  0.692708  0.659410
9  0.806624  0.875476

Also, Given a DataFrame

data

as in your example, if you would like to extract column a and d only (e.i. the 1st and the 4th column), iloc mothod from the pandas dataframe is what you need and could be used very effectively. All you need to know is the index of the columns you would like to extract. For example:

>>> data.iloc[:,[0,3]]

will give you

          a         d
0  0.883283  0.100975
1  0.614313  0.221731
2  0.438963  0.224361
3  0.466078  0.703347
4  0.955285  0.114033
5  0.268443  0.416996
6  0.613241  0.327548
7  0.370784  0.359159
8  0.692708  0.659410
9  0.806624  0.875476

回答 4

您可以DataFrame通过引用列表中每一列的名称来沿a的列进行切片,如下所示:

data = pandas.DataFrame(np.random.rand(10,5), columns = list('abcde'))
data_ab = data[list('ab')]
data_cde = data[list('cde')]

You can slice along the columns of a DataFrame by referring to the names of each column in a list, like so:

data = pandas.DataFrame(np.random.rand(10,5), columns = list('abcde'))
data_ab = data[list('ab')]
data_cde = data[list('cde')]

回答 5

如果您来这里是想对两个范围的列进行切片并将它们组合在一起(例如我),则可以执行以下操作

op = df[list(df.columns[0:899]) + list(df.columns[3593:])]
print op

这将创建一个具有前900列和(所有)列> 3593的新数据框(假设您的数据集中有4000列)。

And if you came here looking for slicing two ranges of columns and combining them together (like me) you can do something like

op = df[list(df.columns[0:899]) + list(df.columns[3593:])]
print op

This will create a new dataframe with first 900 columns and (all) columns > 3593 (assuming you have some 4000 columns in your data set).


回答 6

这是您可以使用不同方法进行选择性列切片的方法,包括基于选择性标签,基于索引和基于选择性范围的列切片。

In [37]: import pandas as pd    
In [38]: import numpy as np
In [43]: df = pd.DataFrame(np.random.rand(4,7), columns = list('abcdefg'))

In [44]: df
Out[44]: 
          a         b         c         d         e         f         g
0  0.409038  0.745497  0.890767  0.945890  0.014655  0.458070  0.786633
1  0.570642  0.181552  0.794599  0.036340  0.907011  0.655237  0.735268
2  0.568440  0.501638  0.186635  0.441445  0.703312  0.187447  0.604305
3  0.679125  0.642817  0.697628  0.391686  0.698381  0.936899  0.101806

In [45]: df.loc[:, ["a", "b", "c"]] ## label based selective column slicing 
Out[45]: 
          a         b         c
0  0.409038  0.745497  0.890767
1  0.570642  0.181552  0.794599
2  0.568440  0.501638  0.186635
3  0.679125  0.642817  0.697628

In [46]: df.loc[:, "a":"c"] ## label based column ranges slicing 
Out[46]: 
          a         b         c
0  0.409038  0.745497  0.890767
1  0.570642  0.181552  0.794599
2  0.568440  0.501638  0.186635
3  0.679125  0.642817  0.697628

In [47]: df.iloc[:, 0:3] ## index based column ranges slicing 
Out[47]: 
          a         b         c
0  0.409038  0.745497  0.890767
1  0.570642  0.181552  0.794599
2  0.568440  0.501638  0.186635
3  0.679125  0.642817  0.697628

### with 2 different column ranges, index based slicing: 
In [49]: df[df.columns[0:1].tolist() + df.columns[1:3].tolist()]
Out[49]: 
          a         b         c
0  0.409038  0.745497  0.890767
1  0.570642  0.181552  0.794599
2  0.568440  0.501638  0.186635
3  0.679125  0.642817  0.697628

Here’s how you could use different methods to do selective column slicing, including selective label based, index based and the selective ranges based column slicing.

In [37]: import pandas as pd    
In [38]: import numpy as np
In [43]: df = pd.DataFrame(np.random.rand(4,7), columns = list('abcdefg'))

In [44]: df
Out[44]: 
          a         b         c         d         e         f         g
0  0.409038  0.745497  0.890767  0.945890  0.014655  0.458070  0.786633
1  0.570642  0.181552  0.794599  0.036340  0.907011  0.655237  0.735268
2  0.568440  0.501638  0.186635  0.441445  0.703312  0.187447  0.604305
3  0.679125  0.642817  0.697628  0.391686  0.698381  0.936899  0.101806

In [45]: df.loc[:, ["a", "b", "c"]] ## label based selective column slicing 
Out[45]: 
          a         b         c
0  0.409038  0.745497  0.890767
1  0.570642  0.181552  0.794599
2  0.568440  0.501638  0.186635
3  0.679125  0.642817  0.697628

In [46]: df.loc[:, "a":"c"] ## label based column ranges slicing 
Out[46]: 
          a         b         c
0  0.409038  0.745497  0.890767
1  0.570642  0.181552  0.794599
2  0.568440  0.501638  0.186635
3  0.679125  0.642817  0.697628

In [47]: df.iloc[:, 0:3] ## index based column ranges slicing 
Out[47]: 
          a         b         c
0  0.409038  0.745497  0.890767
1  0.570642  0.181552  0.794599
2  0.568440  0.501638  0.186635
3  0.679125  0.642817  0.697628

### with 2 different column ranges, index based slicing: 
In [49]: df[df.columns[0:1].tolist() + df.columns[1:3].tolist()]
Out[49]: 
          a         b         c
0  0.409038  0.745497  0.890767
1  0.570642  0.181552  0.794599
2  0.568440  0.501638  0.186635
3  0.679125  0.642817  0.697628

回答 7

相当于

 >>> print(df2.loc[140:160,['Relevance','Title']])
 >>> print(df2.ix[140:160,[3,7]])

Its equivalent

 >>> print(df2.loc[140:160,['Relevance','Title']])
 >>> print(df2.ix[140:160,[3,7]])

回答 8

如果数据框如下所示:

group         name      count
fruit         apple     90
fruit         banana    150
fruit         orange    130
vegetable     broccoli  80
vegetable     kale      70
vegetable     lettuce   125

和输出可能像

   group    name  count
0  fruit   apple     90
1  fruit  banana    150
2  fruit  orange    130

如果您使用逻辑运算符np.logical_not

df[np.logical_not(df['group'] == 'vegetable')]

更多关于

https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.logic.html

其他逻辑运算符

  1. logical_and(x1,x2,/ [,out,where,…])计算x1和x2元素的真值。

  2. logical_or(x1,x2,/ [,out,where,cast,…])计算x1或x2元素的真值。

  3. logical_not(x,/ [,out,where,cast,…])计算非x元素值的真值。
  4. logical_xor(x1,x2,/ [,out,where,..])按元素计算x1 XOR x2的真值。

if Data frame look like that:

group         name      count
fruit         apple     90
fruit         banana    150
fruit         orange    130
vegetable     broccoli  80
vegetable     kale      70
vegetable     lettuce   125

and OUTPUT could be like

   group    name  count
0  fruit   apple     90
1  fruit  banana    150
2  fruit  orange    130

if you use logical operator np.logical_not

df[np.logical_not(df['group'] == 'vegetable')]

more about

https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.logic.html

other logical operators

  1. logical_and(x1, x2, /[, out, where, …]) Compute the truth value of x1 AND x2 element-wise.

  2. logical_or(x1, x2, /[, out, where, casting, …]) Compute the truth value of x1 OR x2 element-wise.

  3. logical_not(x, /[, out, where, casting, …]) Compute the truth value of NOT x element-wise.
  4. logical_xor(x1, x2, /[, out, where, ..]) Compute the truth value of x1 XOR x2, element-wise.

回答 9

假设您需要所有行,则从DataFrame获取列子集的另一种方法是:
data[['a','b']]data[['c','d','e']]
如果要使用数字列索引,可以执行:
data[data.columns[:2]]data[data.columns[2:]]

Another way to get a subset of columns from your DataFrame, assuming you want all the rows, would be to do:
data[['a','b']] and data[['c','d','e']]
If you want to use numerical column indexes you can do:
data[data.columns[:2]] and data[data.columns[2:]]


如何将熊猫系列或索引转换为Numpy数组?

问题:如何将熊猫系列或索引转换为Numpy数组?

您是否知道如何以NumPy数组或python列表的形式获取DataFrame的索引或列?

Do you know how to get the index or column of a DataFrame as a NumPy array or python list?


回答 0

要获取NumPy数组,应使用以下values属性:

In [1]: df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c']); df
   A  B
a  1  4
b  2  5
c  3  6

In [2]: df.index.values
Out[2]: array(['a', 'b', 'c'], dtype=object)

这样可以访问数据的存储方式,因此无需进行转换。
注意:此属性也可用于其他许多熊猫的对象。

In [3]: df['A'].values
Out[3]: Out[16]: array([1, 2, 3])

要将索引作为列表获取,请调用tolist

In [4]: df.index.tolist()
Out[4]: ['a', 'b', 'c']

同样,对于列。

To get a NumPy array, you should use the values attribute:

In [1]: df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c']); df
   A  B
a  1  4
b  2  5
c  3  6

In [2]: df.index.values
Out[2]: array(['a', 'b', 'c'], dtype=object)

This accesses how the data is already stored, so there’s no need for a conversion.
Note: This attribute is also available for many other pandas’ objects.

In [3]: df['A'].values
Out[3]: Out[16]: array([1, 2, 3])

To get the index as a list, call tolist:

In [4]: df.index.tolist()
Out[4]: ['a', 'b', 'c']

And similarly, for columns.


回答 1

您可以使用df.index访问索引对象,然后使用来获取列表中的值df.index.tolist()。同样,您可以将其df['col'].tolist()用于Series。

You can use df.index to access the index object and then get the values in a list using df.index.tolist(). Similarly, you can use df['col'].tolist() for Series.


回答 2

熊猫> = 0.24

.values不赞成使用这些方法,而推荐使用这些方法!

从v0.24.0开始,我们将有两个崭新的品牌,从获得与NumPy阵列的优选方法IndexSeriesDataFrame对象:他们是to_numpy().array。关于用法,文档提到:

我们尚未删除或弃用Series.valuesDataFrame.values,但我们强烈建议您使用.array.to_numpy()代替。

有关更多信息,请参见v0.24.0发行说明的这一部分


to_numpy() 方法

df.index.to_numpy()
# array(['a', 'b'], dtype=object)

df['A'].to_numpy()
#  array([1, 4])

默认情况下,返回一个视图。所做的任何修改都会影响原件。

v = df.index.to_numpy()
v[0] = -1

df
    A  B
-1  1  2
b   4  5

如果您需要副本,请使用to_numpy(copy=True);

v = df.index.to_numpy(copy=True)
v[-1] = -123

df
   A  B
a  1  2
b  4  5

请注意,此功能也适用于DataFrames(而不适用于.array)。


array属性
此属性返回一个ExtensionArray支持索引/系列的对象。

pd.__version__
# '0.24.0rc1'

# Setup.
df = pd.DataFrame([[1, 2], [4, 5]], columns=['A', 'B'], index=['a', 'b'])
df

   A  B
a  1  2
b  4  5

df.index.array    
# <PandasArray>
# ['a', 'b']
# Length: 2, dtype: object

df['A'].array
# <PandasArray>
# [1, 4]
# Length: 2, dtype: int64

在这里,可以使用来获取列表list

list(df.index.array)
# ['a', 'b']

list(df['A'].array)
# [1, 4]

或者,直接调用.tolist()

df.index.tolist()
# ['a', 'b']

df['A'].tolist()
# [1, 4]

关于返回的内容,文档中提到,

对于由常规NumPy数组支持的SeriesIndexSeries.array 将返回一个new arrays.PandasArray,它是一个薄的(无副本)包装numpy.ndarrayarrays.PandasArray本身并不是特别有用,但它确实提供了与pandas或第三方库中定义的任何扩展数组相同的接口。

因此,总而言之,.array将返回

  1. 现有ExtensionArray的索引/系列支持,或
  2. 如果有支持该系列的NumPy数组,则将新ExtensionArray对象创建为基础数组上的精简包装。

添加两种新方法的原理
这些功能是在GitHub两个问题GH19954GH23623下进行讨论的结果而添加的

具体来说,文档提到了基本原理:

[…] .values目前尚不清楚返回的值是实际数组,它的某种转换还是熊猫自定义数组之一(如Categorical)。例如,使用PeriodIndex,每次都会.values 生成一个新ndarray的周期对象。[…]

这两个功能旨在提高API的一致性,这是朝正确方向迈出的重要一步。

最后,.values不会在当前版本中弃用,但我希望这种情况将来可能会发生,因此,我敦促用户尽快迁移到较新的API。

pandas >= 0.24

Deprecate your usage of .values in favour of these methods!

From v0.24.0 onwards, we will have two brand spanking new, preferred methods for obtaining NumPy arrays from Index, Series, and DataFrame objects: they are to_numpy(), and .array. Regarding usage, the docs mention:

We haven’t removed or deprecated Series.values or DataFrame.values, but we highly recommend and using .array or .to_numpy() instead.

See this section of the v0.24.0 release notes for more information.


to_numpy() Method

df.index.to_numpy()
# array(['a', 'b'], dtype=object)

df['A'].to_numpy()
#  array([1, 4])

By default, a view is returned. Any modifications made will affect the original.

v = df.index.to_numpy()
v[0] = -1

df
    A  B
-1  1  2
b   4  5

If you need a copy instead, use to_numpy(copy=True);

v = df.index.to_numpy(copy=True)
v[-1] = -123

df
   A  B
a  1  2
b  4  5

Note that this function also works for DataFrames (while .array does not).


array Attribute
This attribute returns an ExtensionArray object that backs the Index/Series.

pd.__version__
# '0.24.0rc1'

# Setup.
df = pd.DataFrame([[1, 2], [4, 5]], columns=['A', 'B'], index=['a', 'b'])
df

   A  B
a  1  2
b  4  5

df.index.array    
# <PandasArray>
# ['a', 'b']
# Length: 2, dtype: object

df['A'].array
# <PandasArray>
# [1, 4]
# Length: 2, dtype: int64

From here, it is possible to get a list using list:

list(df.index.array)
# ['a', 'b']

list(df['A'].array)
# [1, 4]

or, just directly call .tolist():

df.index.tolist()
# ['a', 'b']

df['A'].tolist()
# [1, 4]

Regarding what is returned, the docs mention,

For Series and Indexes backed by normal NumPy arrays, Series.array will return a new arrays.PandasArray, which is a thin (no-copy) wrapper around a numpy.ndarray. arrays.PandasArray isn’t especially useful on its own, but it does provide the same interface as any extension array defined in pandas or by a third-party library.

So, to summarise, .array will return either

  1. The existing ExtensionArray backing the Index/Series, or
  2. If there is a NumPy array backing the series, a new ExtensionArray object is created as a thin wrapper over the underlying array.

Rationale for adding TWO new methods
These functions were added as a result of discussions under two GitHub issues GH19954 and GH23623.

Specifically, the docs mention the rationale:

[…] with .values it was unclear whether the returned value would be the actual array, some transformation of it, or one of pandas custom arrays (like Categorical). For example, with PeriodIndex, .values generates a new ndarray of period objects each time. […]

These two functions aim to improve the consistency of the API, which is a major step in the right direction.

Lastly, .values will not be deprecated in the current version, but I expect this may happen at some point in the future, so I would urge users to migrate towards the newer API, as soon as you can.


回答 3

如果要处理多索引数据框,则可能只对提取多索引一个名称的列感兴趣。您可以这样做

df.index.get_level_values('name_sub_index')

并且当然name_sub_index必须是FrozenList df.index.names

If you are dealing with a multi-index dataframe, you may be interested in extracting only the column of one name of the multi-index. You can do this as

df.index.get_level_values('name_sub_index')

and of course name_sub_index must be an element of the FrozenList df.index.names


回答 4

从pandas v0.13开始,您还可以使用get_values

df.index.get_values()

Since pandas v0.13 you can also use get_values:

df.index.get_values()

回答 5

我将大熊猫转换dataframelist,然后使用基本list.index()。像这样:

dd = list(zone[0]) #Where zone[0] is some specific column of the table
idx = dd.index(filename[i])

您将索引值设为idx

I converted the pandas dataframe to list and then used the basic list.index(). Something like this:

dd = list(zone[0]) #Where zone[0] is some specific column of the table
idx = dd.index(filename[i])

You have you index value as idx.


回答 6

最近执行此操作的方法是使用.to_numpy()函数。

如果我的数据框的价格为“价格”列,则可以按以下方式进行转换:

priceArray = df['price'].to_numpy()

您还可以将数据类型(例如float或object)作为函数的参数传递

A more recent way to do this is to use the .to_numpy() function.

If I have a dataframe with a column ‘price’, I can convert it as follows:

priceArray = df['price'].to_numpy()

You can also pass the data type, such as float or object, as an argument of the function


回答 7

以下是将dataframe列转换为numpy数组的简单方法。

df = pd.DataFrame(somedict) 
ytrain = df['label']
ytrain_numpy = np.array([x for x in ytrain['label']])

ytrain_numpy是一个numpy数组。

我尝试过,to.numpy()但是给了我以下错误: TypeError:在使用线性SVC进行二进制相关性分类时,不支持类型转换:(dtype(’O’),)。to.numpy()正在将dataFrame转换为numpy数组,但是内部元素的数据类型为list,因此会观察到上述错误。

Below is a simple way to convert dataframe column into numpy array.

df = pd.DataFrame(somedict) 
ytrain = df['label']
ytrain_numpy = np.array([x for x in ytrain['label']])

ytrain_numpy is a numpy array.

I tried with to.numpy() but it gave me the below error: TypeError: no supported conversion for types: (dtype(‘O’),) while doing Binary Relevance classfication using Linear SVC. to.numpy() was converting the dataFrame into numpy array but the inner element’s data type was list because of which the above error was observed.


如何将空列添加到数据框?

问题:如何将空列添加到数据框?

向熊猫DataFrame对象添加空列的最简单方法是什么?我偶然发现的最好的东西是

df['foo'] = df.apply(lambda _: '', axis=1)

有没有那么不合常理的方法?

What’s the easiest way to add an empty column to a pandas DataFrame object? The best I’ve stumbled upon is something like

df['foo'] = df.apply(lambda _: '', axis=1)

Is there a less perverse method?


回答 0

如果我理解正确,则应填写作业:

>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
>>> df
   A  B
0  1  2
1  2  3
2  3  4
>>> df["C"] = ""
>>> df["D"] = np.nan
>>> df
   A  B C   D
0  1  2   NaN
1  2  3   NaN
2  3  4   NaN

If I understand correctly, assignment should fill:

>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
>>> df
   A  B
0  1  2
1  2  3
2  3  4
>>> df["C"] = ""
>>> df["D"] = np.nan
>>> df
   A  B C   D
0  1  2   NaN
1  2  3   NaN
2  3  4   NaN

回答 1

为了增加DSM的答案并以这个相关问题为基础,我将该方法分为两种情况:

  • 添加单个列:只需将空值分配给新列,例如 df['C'] = np.nan

  • 添加多个列:我建议使用.reindex(columns=[...]) pandas方法将新列添加到数据框的列索引中。这也适用于使用添加多个新行.reindex(rows=[...])。请注意,较新版本的Pandas(v> 0.20)允许您指定axis关键字,而不是显式分配给columnsrows

这是添加多列的示例:

mydf = mydf.reindex(columns = mydf.columns.tolist() + ['newcol1','newcol2'])

要么

mydf = mydf.reindex(mydf.columns.tolist() + ['newcol1','newcol2'], axis=1)  # version > 0.20.0

您还可以始终将新的(空)数据框连接到现有数据框,但这对我来说并不像pythonic那样:)

To add to DSM’s answer and building on this associated question, I’d split the approach into two cases:

  • Adding a single column: Just assign empty values to the new columns, e.g. df['C'] = np.nan

  • Adding multiple columns: I’d suggest using the .reindex(columns=[...]) method of pandas to add the new columns to the dataframe’s column index. This also works for adding multiple new rows with .reindex(rows=[...]). Note that newer versions of Pandas (v>0.20) allow you to specify an axis keyword rather than explicitly assigning to columns or rows.

Here is an example adding multiple columns:

mydf = mydf.reindex(columns = mydf.columns.tolist() + ['newcol1','newcol2'])

or

mydf = mydf.reindex(mydf.columns.tolist() + ['newcol1','newcol2'], axis=1)  # version > 0.20.0

You can also always concatenate a new (empty) dataframe to the existing dataframe, but that doesn’t feel as pythonic to me :)


回答 2

一个更简单的解决方案是:

df = df.reindex(columns = header_list)                

其中“ header_list”是要显示的标题的列表。

列表中包含的,在数据​​框中尚未找到的所有标头都将添加以下空白单元格。

因此,如果

header_list = ['a','b','c', 'd']

然后将c和d添加为具有空白单元格的列

an even simpler solution is:

df = df.reindex(columns = header_list)                

where “header_list” is a list of the headers you want to appear.

any header included in the list that is not found already in the dataframe will be added with blank cells below.

so if

header_list = ['a','b','c', 'd']

then c and d will be added as columns with blank cells


回答 3

以开始v0.16.0DF.assign()可用于为分配新列(单/多DF。这些列在末尾按字母顺序插入DF

与您希望直接对返回的数据帧执行一系列链接操作的情况相比,与简单分配相比,这变得很有优势。

考虑DF@DSM演示的相同示例:

df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
df
Out[18]:
   A  B
0  1  2
1  2  3
2  3  4

df.assign(C="",D=np.nan)
Out[21]:
   A  B C   D
0  1  2   NaN
1  2  3   NaN
2  3  4   NaN

请注意,这将返回一个包含所有先前列以及新创建列的副本。为了对原件DF进行相应的修改,请像:df = df.assign(...)一样使用它,因为它inplace当前不支持操作。

Starting with v0.16.0, DF.assign() could be used to assign new columns (single/multiple) to a DF. These columns get inserted in alphabetical order at the end of the DF.

This becomes advantageous compared to simple assignment in cases wherein you want to perform a series of chained operations directly on the returned dataframe.

Consider the same DF sample demonstrated by @DSM:

df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
df
Out[18]:
   A  B
0  1  2
1  2  3
2  3  4

df.assign(C="",D=np.nan)
Out[21]:
   A  B C   D
0  1  2   NaN
1  2  3   NaN
2  3  4   NaN

Note that this returns a copy with all the previous columns along with the newly created ones. In order for the original DF to be modified accordingly, use it like : df = df.assign(...) as it does not support inplace operation currently.


回答 4

我喜欢:

df['new'] = pd.Series(dtype='your_required_dtype')

如果数据框为空,则此解决方案可确保不NaN添加仅包含新行的内容。

如果dtype未指定,则较新的Pandas版本会产生DeprecationWarning

I like:

df['new'] = pd.Series(dtype='your_required_dtype')

If you have an empty dataframe, this solution makes sure that no new row containing only NaN is added.

Specifying dtype is not strictly necessary, however newer Pandas versions produce a DeprecationWarning if not specified.


回答 5

如果要从列表中添加列名

df=pd.DataFrame()
a=['col1','col2','col3','col4']
for i in a:
    df[i]=np.nan

if you want to add column name from a list

df=pd.DataFrame()
a=['col1','col2','col3','col4']
for i in a:
    df[i]=np.nan

回答 6

@emunsing的答案对于添加多个列真的很酷,但是我无法在python 2.7中使用它。相反,我发现这可行:

mydf = mydf.reindex(columns = np.append( mydf.columns.values, ['newcol1','newcol2'])

@emunsing’s answer is really cool for adding multiple columns, but I couldn’t get it to work for me in python 2.7. Instead, I found this works:

mydf = mydf.reindex(columns = np.append( mydf.columns.values, ['newcol1','newcol2'])

回答 7

下面的代码解决了“如何向现有数据帧中添加n个空列”的问题。为了将针对类似问题的解决方案集中在一个地方,我在这里添加它。

方法1(使用1-64的列名创建64个其他列)

m = list(range(1,65,1)) 
dd=pd.DataFrame(columns=m)
df.join(dd).replace(np.nan,'') #df is the dataframe that already exists

方法2(使用1-64的列名称创建64个其他列)

df.reindex(df.columns.tolist() + list(range(1,65,1)), axis=1).replace(np.nan,'')

The below code address the question “How do I add n number of empty columns to my existing dataframe”. In the interest of keeping solutions to similar problems in one place, I am adding it here.

Approach 1 (to create 64 additional columns with column names from 1-64)

m = list(range(1,65,1)) 
dd=pd.DataFrame(columns=m)
df.join(dd).replace(np.nan,'') #df is the dataframe that already exists

Approach 2 (to create 64 additional columns with column names from 1-64)

df.reindex(df.columns.tolist() + list(range(1,65,1)), axis=1).replace(np.nan,'')

回答 8

你可以做

df['column'] = None #This works. This will create a new column with None type
df.column = None #This will work only when the column is already present in the dataframe 

You can do

df['column'] = None #This works. This will create a new column with None type
df.column = None #This will work only when the column is already present in the dataframe 

回答 9

可以用来df.insert(index_to_insert_at, column_header, init_value)在特定索引处插入新列。

cost_tbl.insert(1, "col_name", "") 

上面的语句将在第一列之后插入一个空列。

One can use df.insert(index_to_insert_at, column_header, init_value) to insert new column at a specific index.

cost_tbl.insert(1, "col_name", "") 

The above statement would insert an empty Column after the first column.


如何将熊猫数据添加到现有的csv文件中?

问题:如何将熊猫数据添加到现有的csv文件中?

我想知道是否可以使用pandas to_csv()函数将数据框添加到现有的csv文件中。csv文件与加载的数据具有相同的结构。

I want to know if it is possible to use the pandas to_csv() function to add a dataframe to an existing csv file. The csv file has the same structure as the loaded data.


回答 0

您可以在pandas to_csv函数中指定python写入模式。对于追加,它是“ a”。

在您的情况下:

df.to_csv('my_csv.csv', mode='a', header=False)

默认模式为“ w”。

You can specify a python write mode in the pandas to_csv function. For append it is ‘a’.

In your case:

df.to_csv('my_csv.csv', mode='a', header=False)

The default mode is ‘w’.


回答 1

您可以通过在追加模式下打开文件追加到csv :

with open('my_csv.csv', 'a') as f:
    df.to_csv(f, header=False)

如果这是您的csv,请执行以下操作foo.csv

,A,B,C
0,1,2,3
1,4,5,6

如果您阅读了该内容,然后附加,例如df + 6

In [1]: df = pd.read_csv('foo.csv', index_col=0)

In [2]: df
Out[2]:
   A  B  C
0  1  2  3
1  4  5  6

In [3]: df + 6
Out[3]:
    A   B   C
0   7   8   9
1  10  11  12

In [4]: with open('foo.csv', 'a') as f:
             (df + 6).to_csv(f, header=False)

foo.csv 变成:

,A,B,C
0,1,2,3
1,4,5,6
0,7,8,9
1,10,11,12

You can append to a csv by opening the file in append mode:

with open('my_csv.csv', 'a') as f:
    df.to_csv(f, header=False)

If this was your csv, foo.csv:

,A,B,C
0,1,2,3
1,4,5,6

If you read that and then append, for example, df + 6:

In [1]: df = pd.read_csv('foo.csv', index_col=0)

In [2]: df
Out[2]:
   A  B  C
0  1  2  3
1  4  5  6

In [3]: df + 6
Out[3]:
    A   B   C
0   7   8   9
1  10  11  12

In [4]: with open('foo.csv', 'a') as f:
             (df + 6).to_csv(f, header=False)

foo.csv becomes:

,A,B,C
0,1,2,3
1,4,5,6
0,7,8,9
1,10,11,12

回答 2

with open(filename, 'a') as f:
    df.to_csv(f, header=f.tell()==0)
  • 除非存在,否则创建文件,否则追加
  • 如果正在创建文件,则添加标题,否则跳过它
with open(filename, 'a') as f:
    df.to_csv(f, header=f.tell()==0)
  • Create file unless exists, otherwise append
  • Add header if file is being created, otherwise skip it

回答 3

我在一些标头检查保护措施中使用了一个辅助功能,以处理所有问题:

def appendDFToCSV_void(df, csvFilePath, sep=","):
    import os
    if not os.path.isfile(csvFilePath):
        df.to_csv(csvFilePath, mode='a', index=False, sep=sep)
    elif len(df.columns) != len(pd.read_csv(csvFilePath, nrows=1, sep=sep).columns):
        raise Exception("Columns do not match!! Dataframe has " + str(len(df.columns)) + " columns. CSV file has " + str(len(pd.read_csv(csvFilePath, nrows=1, sep=sep).columns)) + " columns.")
    elif not (df.columns == pd.read_csv(csvFilePath, nrows=1, sep=sep).columns).all():
        raise Exception("Columns and column order of dataframe and csv file do not match!!")
    else:
        df.to_csv(csvFilePath, mode='a', index=False, sep=sep, header=False)

A little helper function I use with some header checking safeguards to handle it all:

def appendDFToCSV_void(df, csvFilePath, sep=","):
    import os
    if not os.path.isfile(csvFilePath):
        df.to_csv(csvFilePath, mode='a', index=False, sep=sep)
    elif len(df.columns) != len(pd.read_csv(csvFilePath, nrows=1, sep=sep).columns):
        raise Exception("Columns do not match!! Dataframe has " + str(len(df.columns)) + " columns. CSV file has " + str(len(pd.read_csv(csvFilePath, nrows=1, sep=sep).columns)) + " columns.")
    elif not (df.columns == pd.read_csv(csvFilePath, nrows=1, sep=sep).columns).all():
        raise Exception("Columns and column order of dataframe and csv file do not match!!")
    else:
        df.to_csv(csvFilePath, mode='a', index=False, sep=sep, header=False)

回答 4

最初从pyspark数据帧开始-给定pyspark数据帧中的架构/列类型,我遇到类型转换错误(转换为pandas df然后附加到csv时)

通过将每个df中的所有列都强制为string类型,然后将其附加到csv来解决此问题,如下所示:

with open('testAppend.csv', 'a') as f:
    df2.toPandas().astype(str).to_csv(f, header=False)

Initially starting with a pyspark dataframes – I got type conversion errors (when converting to pandas df’s and then appending to csv) given the schema/column types in my pyspark dataframes

Solved the problem by forcing all columns in each df to be of type string and then appending this to csv as follows:

with open('testAppend.csv', 'a') as f:
    df2.toPandas().astype(str).to_csv(f, header=False)

回答 5

晚了一点,但是如果您多次打开和关闭文件或记录数据,统计信息等,您也可以使用上下文管理器。

from contextlib import contextmanager
import pandas as pd
@contextmanager
def open_file(path, mode):
     file_to=open(path,mode)
     yield file_to
     file_to.close()


##later
saved_df=pd.DataFrame(data)
with open_file('yourcsv.csv','r') as infile:
      saved_df.to_csv('yourcsv.csv',mode='a',header=False)`

A bit late to the party but you can also use a context manager, if you’re opening and closing your file multiple times, or logging data, statistics, etc.

from contextlib import contextmanager
import pandas as pd
@contextmanager
def open_file(path, mode):
     file_to=open(path,mode)
     yield file_to
     file_to.close()


##later
saved_df=pd.DataFrame(data)
with open_file('yourcsv.csv','r') as infile:
      saved_df.to_csv('yourcsv.csv',mode='a',header=False)`

如何从熊猫数据框中删除行列表?

问题:如何从熊猫数据框中删除行列表?

我有一个数据框df:

>>> df
                  sales  discount  net_sales    cogs
STK_ID RPT_Date                                     
600141 20060331   2.709       NaN      2.709   2.245
       20060630   6.590       NaN      6.590   5.291
       20060930  10.103       NaN     10.103   7.981
       20061231  15.915       NaN     15.915  12.686
       20070331   3.196       NaN      3.196   2.710
       20070630   7.907       NaN      7.907   6.459

然后,我想删除具有列表中指示的某些序列号的行,假设此时留在这里[1,2,4],

                  sales  discount  net_sales    cogs
STK_ID RPT_Date                                     
600141 20060331   2.709       NaN      2.709   2.245
       20061231  15.915       NaN     15.915  12.686
       20070630   7.907       NaN      7.907   6.459

如何或什么功能可以做到这一点?

I have a dataframe df :

>>> df
                  sales  discount  net_sales    cogs
STK_ID RPT_Date                                     
600141 20060331   2.709       NaN      2.709   2.245
       20060630   6.590       NaN      6.590   5.291
       20060930  10.103       NaN     10.103   7.981
       20061231  15.915       NaN     15.915  12.686
       20070331   3.196       NaN      3.196   2.710
       20070630   7.907       NaN      7.907   6.459

Then I want to drop rows with certain sequence numbers which indicated in a list, suppose here is [1,2,4], then left:

                  sales  discount  net_sales    cogs
STK_ID RPT_Date                                     
600141 20060331   2.709       NaN      2.709   2.245
       20061231  15.915       NaN     15.915  12.686
       20070630   7.907       NaN      7.907   6.459

How or what function can do that ?


回答 0

使用DataFrame.drop并将其传递给一系列索引标签:

In [65]: df
Out[65]: 
       one  two
one      1    4
two      2    3
three    3    2
four     4    1


In [66]: df.drop(df.index[[1,3]])
Out[66]: 
       one  two
one      1    4
three    3    2

Use DataFrame.drop and pass it a Series of index labels:

In [65]: df
Out[65]: 
       one  two
one      1    4
two      2    3
three    3    2
four     4    1


In [66]: df.drop(df.index[[1,3]])
Out[66]: 
       one  two
one      1    4
three    3    2

回答 1

请注意,当您要插入时,使用“ inplace”命令可能很重要。

df.drop(df.index[[1,3]], inplace=True)

因为您的原始问题没有返回任何内容,所以应使用此命令。 http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.drop.html

Note that it may be important to use the “inplace” command when you want to do the drop in line.

df.drop(df.index[[1,3]], inplace=True)

Because your original question is not returning anything, this command should be used. http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.drop.html


回答 2

如果DataFrame很大,并且要删除的行数也很大,那么按索引df.drop(df.index[])进行简单的删除将花费太多时间。

在我的情况下,我有一个带的浮点数的多索引DataFrame 100M rows x 3 cols,我需要从中删除10k行。与直觉相反,我发现最快的方法是take其余行。

让我们indexes_to_drop作为要放置的位置索引数组([1, 2, 4]在问题中)。

indexes_to_keep = set(range(df.shape[0])) - set(indexes_to_drop)
df_sliced = df.take(list(indexes_to_keep))

就我而言,这花费了20.5s,而简单的df.drop花费5min 27s了很多内存。所得的DataFrame是相同的。

If the DataFrame is huge, and the number of rows to drop is large as well, then simple drop by index df.drop(df.index[]) takes too much time.

In my case, I have a multi-indexed DataFrame of floats with 100M rows x 3 cols, and I need to remove 10k rows from it. The fastest method I found is, quite counterintuitively, to take the remaining rows.

Let indexes_to_drop be an array of positional indexes to drop ([1, 2, 4] in the question).

indexes_to_keep = set(range(df.shape[0])) - set(indexes_to_drop)
df_sliced = df.take(list(indexes_to_keep))

In my case this took 20.5s, while the simple df.drop took 5min 27s and consumed a lot of memory. The resulting DataFrame is the same.


回答 3

您还可以传递给DataFrame.drop标签本身(而不是索引标签系列):

In[17]: df
Out[17]: 
            a         b         c         d         e
one  0.456558 -2.536432  0.216279 -1.305855 -0.121635
two -1.015127 -0.445133  1.867681  2.179392  0.518801

In[18]: df.drop('one')
Out[18]: 
            a         b         c         d         e
two -1.015127 -0.445133  1.867681  2.179392  0.518801

等效于:

In[19]: df.drop(df.index[[0]])
Out[19]: 
            a         b         c         d         e
two -1.015127 -0.445133  1.867681  2.179392  0.518801

You can also pass to DataFrame.drop the label itself (instead of Series of index labels):

In[17]: df
Out[17]: 
            a         b         c         d         e
one  0.456558 -2.536432  0.216279 -1.305855 -0.121635
two -1.015127 -0.445133  1.867681  2.179392  0.518801

In[18]: df.drop('one')
Out[18]: 
            a         b         c         d         e
two -1.015127 -0.445133  1.867681  2.179392  0.518801

Which is equivalent to:

In[19]: df.drop(df.index[[0]])
Out[19]: 
            a         b         c         d         e
two -1.015127 -0.445133  1.867681  2.179392  0.518801

回答 4

我以一种简单的方式解决了这一问题-仅需两个步骤。

步骤1:首先形成包含不需要的行/数据的数据框。

步骤2:使用此不需要的数据框的索引从原始数据框删除行。

例:

假设您有一个数据框df,其中包括“ Age”的整数列,该列是整数。现在假设您要删除所有以“年龄”为负数的行。

步骤1:df_age_negative = df [df [‘Age’] <0]

步骤2:df = df.drop(df_age_negative.index,axis = 0)

希望这会更简单并且对您有所帮助。

I solved this in a simpler way – just in 2 steps.

  1. Make a dataframe with unwanted rows/data.

  2. Use the index of this unwanted dataframe to drop the rows from the original dataframe.

Example:
Suppose you have a dataframe df which as many columns including ‘Age’ which is an integer. Now let’s say you want to drop all the rows with ‘Age’ as negative number.

df_age_negative = df[ df['Age'] < 0 ] # Step 1
df = df.drop(df_age_negative.index, axis=0) # Step 2

Hope this is much simpler and helps you.


回答 5

如果要删除具有index的行x,我将执行以下操作:

df = df[df.index != x]

如果我想删除多个索引(例如,这些索引在list中unwanted_indices),则可以执行以下操作:

desired_indices = [i for i in len(df.index) if i not in unwanted_indices]
desired_df = df.iloc[desired_indices]

If I want to drop a row which has let’s say index x, I would do the following:

df = df[df.index != x]

If I would want to drop multiple indices (say these indices are in the list unwanted_indices), I would do:

desired_indices = [i for i in len(df.index) if i not in unwanted_indices]
desired_df = df.iloc[desired_indices]

回答 6

我想展示一些具体的例子。假设您在某些行中有许多重复的条目。如果您有字符串条目,则可以轻松地使用字符串方法来查找所有要删除的索引。

ind_drop = df[df['column_of_strings'].apply(lambda x: x.startswith('Keyword'))].index

现在使用它们的索引删除这些行

new_df = df.drop(ind_drop)

Here is a bit specific example, I would like to show. Say you have many duplicate entries in some of your rows. If you have string entries you could easily use string methods to find all indexes to drop.

ind_drop = df[df['column_of_strings'].apply(lambda x: x.startswith('Keyword'))].index

And now to drop those rows using their indexes

new_df = df.drop(ind_drop)

回答 7

在对@ theodros-zelleke的答案的评论中,@ j-jones询问了如果索引不是唯一的怎么办。我不得不处理这种情况。我要做的是在我叫drop()la 之前重命名索引中的重复项:

dropped_indexes = <determine-indexes-to-drop>
df.index = rename_duplicates(df.index)
df.drop(df.index[dropped_indexes], inplace=True)

rename_duplicates()我定义的函数在哪里,它通过了index元素并重命名了重复项。我使用了与pd.read_csv()在列上相同的重命名模式,即,"%s.%d" % (name, count)其中name行的名称和count它以前发生过的次数。

In a comment to @theodros-zelleke’s answer, @j-jones asked about what to do if the index is not unique. I had to deal with such a situation. What I did was to rename the duplicates in the index before I called drop(), a la:

dropped_indexes = <determine-indexes-to-drop>
df.index = rename_duplicates(df.index)
df.drop(df.index[dropped_indexes], inplace=True)

where rename_duplicates() is a function I defined that went through the elements of index and renamed the duplicates. I used the same renaming pattern as pd.read_csv() uses on columns, i.e., "%s.%d" % (name, count), where name is the name of the row and count is how many times it has occurred previously.


回答 8

如上所述,从布尔值确定索引,例如

df[df['column'].isin(values)].index

与使用此方法确定索引相比,可能会占用更多的内存

pd.Index(np.where(df['column'].isin(values))[0])

像这样应用

df.drop(pd.Index(np.where(df['column'].isin(values))[0]), inplace = True)

当处理大数据帧和有限的内存时,此方法很有用。

Determining the index from the boolean as described above e.g.

df[df['column'].isin(values)].index

can be more memory intensive than determining the index using this method

pd.Index(np.where(df['column'].isin(values))[0])

applied like so

df.drop(pd.Index(np.where(df['column'].isin(values))[0]), inplace = True)

This method is useful when dealing with large dataframes and limited memory.


回答 9

仅使用索引arg删除行:

df.drop(index = 2, inplace = True)

对于多行:

df.drop(index=[1,3], inplace = True)

Use only the Index arg to drop row:-

df.drop(index = 2, inplace = True)

For multiple rows:-

df.drop(index=[1,3], inplace = True)

回答 10

考虑一个示例数据框

df =     
index    column1
0           00
1           10
2           20
3           30

我们要删除第二和第三索引行。

方法1:

df = df.drop(df.index[2,3])
 or 
df.drop(df.index[2,3],inplace=True)
print(df)

df =     
index    column1
0           00
3           30

 #This approach removes the rows as we wanted but the index remains unordered

方法2

df.drop(df.index[2,3],inplace=True,ignore_index=True)
print(df)
df =     
index    column1
0           00
1           30
#This approach removes the rows as we wanted and resets the index. 

Consider an example dataframe

df =     
index    column1
0           00
1           10
2           20
3           30

we want to drop 2nd and 3rd index rows.

Approach 1:

df = df.drop(df.index[2,3])
 or 
df.drop(df.index[2,3],inplace=True)
print(df)

df =     
index    column1
0           00
3           30

 #This approach removes the rows as we wanted but the index remains unordered

Approach 2

df.drop(df.index[2,3],inplace=True,ignore_index=True)
print(df)
df =     
index    column1
0           00
1           30
#This approach removes the rows as we wanted and resets the index. 

熊猫read_csv low_memory和dtype选项

问题:熊猫read_csv low_memory和dtype选项

打电话时

df = pd.read_csv('somefile.csv')

我得到:

/Users/josh/anaconda/envs/py27/lib/python2.7/site-packages/pandas/io/parsers.py:1130:DtypeWarning:列(4,5,7,16)具有混合类型。在导入时指定dtype选项,或将low_memory = False设置为false。

为什么dtype选项与关联low_memory,为什么使它False有助于解决此问题?

When calling

df = pd.read_csv('somefile.csv')

I get:

/Users/josh/anaconda/envs/py27/lib/python2.7/site-packages/pandas/io/parsers.py:1130: DtypeWarning: Columns (4,5,7,16) have mixed types. Specify dtype option on import or set low_memory=False.

Why is the dtype option related to low_memory, and why would making it False help with this problem?


回答 0

不推荐使用的low_memory选项

low_memory选项未正确弃用,但应该正确使用,因为它实际上没有做任何不同的事情[ 来源 ]

收到此low_memory警告的原因是因为猜测每列的dtypes非常需要内存。熊猫尝试通过分析每列中的数据来确定要设置的dtype。

Dtype猜测(非常糟糕)

一旦读取了整个文件,熊猫便只能确定列应具有的dtype。这意味着在读取整个文件之前,无法真正解析任何内容,除非您冒着在读取最后一个值时不得不更改该列的dtype的风险。

考虑一个文件的示例,该文件具有一个名为user_id的列。它包含1000万行,其中user_id始终是数字。由于熊猫不能只知道数字,因此它可能会一直保留为原始字符串,直到它读取了整个文件。

指定dtypes(应该总是这样做)

dtype={'user_id': int}

pd.read_csv()呼叫将使大熊猫知道它开始读取文件时,认为这是唯一的整数。

还值得注意的是,如果文件的最后一行将被"foobar"写入user_id列中,那么如果指定了上面的dtype,则加载将崩溃。

定义dtypes时会中断的中断数据示例

import pandas as pd
try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO


csvdata = """user_id,username
1,Alice
3,Bob
foobar,Caesar"""
sio = StringIO(csvdata)
pd.read_csv(sio, dtype={"user_id": int, "username": "string"})

ValueError: invalid literal for long() with base 10: 'foobar'

dtypes通常是一个numpy的东西,请在这里阅读有关它们的更多信息:http ://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html

存在哪些dtype?

我们可以访问numpy dtypes:float,int,bool,timedelta64 [ns]和datetime64 [ns]。请注意,numpy日期/时间dtypes 识别时区。

熊猫通过自己的方式扩展了这套dtypes:

‘datetime64 [ns,]’这是一个时区感知的时间戳。

‘category’本质上是一个枚举(以整数键表示的字符串以保存

‘period []’不要与timedelta混淆,这些对象实际上是固定在特定时间段的

“稀疏”,“ Sparse [int]”,“ Sparse [float]”用于稀疏数据或“其中有很多漏洞的数据”,而不是在数据框中保存NaN或None,它忽略了对象,从而节省了空间。

“间隔”本身是一个主题,但其主要用途是用于索引。在这里查看更多

与numpy变体不同,“ Int8”,“ Int16”,“ Int32”,“ Int64”,“ UInt8”,“ UInt16”,“ UInt32”,“ UInt64”都是可为空的熊猫特定整数。

‘string’是用于处理字符串数据的特定dtype,可访问.str系列中的属性。

‘boolean’类似于numpy’bool’,但它也支持丢失数据。

在此处阅读完整的参考:

熊猫DType参考

陷阱,注意事项,笔记

设置dtype=object将使上面的警告静音,但不会使其更有效地使用内存,仅在有任何处理时才有效。

设置dtype=unicode不会做任何事情,因为对于numpy,a unicode表示为object

转换器的使用

@sparrow正确指出了转换器的用法,以避免在遇到'foobar'指定为的列时遇到大熊猫int。我想补充一点,转换器在熊猫中使用时确实很笨重且效率低下,应该作为最后的手段使用。这是因为read_csv进程是单个进程。

CSV文件可以逐行处理,因此可以通过简单地将文件切成段并运行多个进程来由多个转换器并行更有效地进行处理,而这是熊猫所不支持的。但这是一个不同的故事。

The deprecated low_memory option

The low_memory option is not properly deprecated, but it should be, since it does not actually do anything differently[source]

The reason you get this low_memory warning is because guessing dtypes for each column is very memory demanding. Pandas tries to determine what dtype to set by analyzing the data in each column.

Dtype Guessing (very bad)

Pandas can only determine what dtype a column should have once the whole file is read. This means nothing can really be parsed before the whole file is read unless you risk having to change the dtype of that column when you read the last value.

Consider the example of one file which has a column called user_id. It contains 10 million rows where the user_id is always numbers. Since pandas cannot know it is only numbers, it will probably keep it as the original strings until it has read the whole file.

Specifying dtypes (should always be done)

adding

dtype={'user_id': int}

to the pd.read_csv() call will make pandas know when it starts reading the file, that this is only integers.

Also worth noting is that if the last line in the file would have "foobar" written in the user_id column, the loading would crash if the above dtype was specified.

Example of broken data that breaks when dtypes are defined

import pandas as pd
try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO


csvdata = """user_id,username
1,Alice
3,Bob
foobar,Caesar"""
sio = StringIO(csvdata)
pd.read_csv(sio, dtype={"user_id": int, "username": "string"})

ValueError: invalid literal for long() with base 10: 'foobar'

dtypes are typically a numpy thing, read more about them here: http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html

What dtypes exists?

We have access to numpy dtypes: float, int, bool, timedelta64[ns] and datetime64[ns]. Note that the numpy date/time dtypes are not time zone aware.

Pandas extends this set of dtypes with its own:

‘datetime64[ns, ]’ Which is a time zone aware timestamp.

‘category’ which is essentially an enum (strings represented by integer keys to save

‘period[]’ Not to be confused with a timedelta, these objects are actually anchored to specific time periods

‘Sparse’, ‘Sparse[int]’, ‘Sparse[float]’ is for sparse data or ‘Data that has a lot of holes in it’ Instead of saving the NaN or None in the dataframe it omits the objects, saving space.

‘Interval’ is a topic of its own but its main use is for indexing. See more here

‘Int8’, ‘Int16’, ‘Int32’, ‘Int64’, ‘UInt8’, ‘UInt16’, ‘UInt32’, ‘UInt64’ are all pandas specific integers that are nullable, unlike the numpy variant.

‘string’ is a specific dtype for working with string data and gives access to the .str attribute on the series.

‘boolean’ is like the numpy ‘bool’ but it also supports missing data.

Read the complete reference here:

Pandas dtype reference

Gotchas, caveats, notes

Setting dtype=object will silence the above warning, but will not make it more memory efficient, only process efficient if anything.

Setting dtype=unicode will not do anything, since to numpy, a unicode is represented as object.

Usage of converters

@sparrow correctly points out the usage of converters to avoid pandas blowing up when encountering 'foobar' in a column specified as int. I would like to add that converters are really heavy and inefficient to use in pandas and should be used as a last resort. This is because the read_csv process is a single process.

CSV files can be processed line by line and thus can be processed by multiple converters in parallel more efficiently by simply cutting the file into segments and running multiple processes, something that pandas does not support. But this is a different story.


回答 1

尝试:

dashboard_df = pd.read_csv(p_file, sep=',', error_bad_lines=False, index_col=False, dtype='unicode')

根据熊猫文件:

dtype:类型名称或列的字典->类型

至于low_memory,默认情况下为True 尚未记录。我认为这无关紧要。该错误消息是通用的,因此无论如何您都无需弄混low_memory。希望这会有所帮助,如果您还有其他问题,请告诉我

Try:

dashboard_df = pd.read_csv(p_file, sep=',', error_bad_lines=False, index_col=False, dtype='unicode')

According to the pandas documentation:

dtype : Type name or dict of column -> type

As for low_memory, it’s True by default and isn’t yet documented. I don’t think its relevant though. The error message is generic, so you shouldn’t need to mess with low_memory anyway. Hope this helps and let me know if you have further problems


回答 2

df = pd.read_csv('somefile.csv', low_memory=False)

这应该可以解决问题。从CSV读取180万行时,出现了完全相同的错误。

df = pd.read_csv('somefile.csv', low_memory=False)

This should solve the issue. I got exactly the same error, when reading 1.8M rows from a CSV.


回答 3

如firelynx先前所述,如果显式指定了dtype并且存在与该dtype不兼容的混合数据,则加载将崩溃。我使用像这样的转换器作为变通方法来更改具有不兼容数据类型的值,以便仍然可以加载数据。

def conv(val):
    if not val:
        return 0    
    try:
        return np.float64(val)
    except:        
        return np.float64(0)

df = pd.read_csv(csv_file,converters={'COL_A':conv,'COL_B':conv})

As mentioned earlier by firelynx if dtype is explicitly specified and there is mixed data that is not compatible with that dtype then loading will crash. I used a converter like this as a workaround to change the values with incompatible data type so that the data could still be loaded.

def conv(val):
    if not val:
        return 0    
    try:
        return np.float64(val)
    except:        
        return np.float64(0)

df = pd.read_csv(csv_file,converters={'COL_A':conv,'COL_B':conv})

回答 4

我有一个约400MB的文件类似的问题。设置low_memory=False对我有用。首先做一些简单的事情,我将检查您的数据帧不大于系统内存,重新启动,清除RAM,然后再继续。如果您仍然遇到错误,则值得确保您的.csv文件正常,请在Excel中快速查看并确保没有明显的损坏。原始数据损坏可能会给企业造成严重破坏。

I had a similar issue with a ~400MB file. Setting low_memory=False did the trick for me. Do the simple things first,I would check that your dataframe isn’t bigger than your system memory, reboot, clear the RAM before proceeding. If you’re still running into errors, its worth making sure your .csv file is ok, take a quick look in Excel and make sure there’s no obvious corruption. Broken original data can wreak havoc…


回答 5

处理巨大的csv文件(600万行)时,我遇到了类似的问题。我遇到了三个问题:1.文件包含奇怪的字符(使用编码修复)2.未指定数据类型(使用dtype属性修复)3.使用上述方法,我仍然面临与file_format相关的问题,即根据文件名定义(使用try ..固定,..除外)

df = pd.read_csv(csv_file,sep=';', encoding = 'ISO-8859-1',
                 names=['permission','owner_name','group_name','size','ctime','mtime','atime','filename','full_filename'],
                 dtype={'permission':str,'owner_name':str,'group_name':str,'size':str,'ctime':object,'mtime':object,'atime':object,'filename':str,'full_filename':str,'first_date':object,'last_date':object})

try:
    df['file_format'] = [Path(f).suffix[1:] for f in df.filename.tolist()]
except:
    df['file_format'] = ''

I was facing a similar issue when processing a huge csv file (6 million rows). I had three issues: 1. the file contained strange characters (fixed using encoding) 2. the datatype was not specified (fixed using dtype property) 3. Using the above I still faced an issue which was related with the file_format that could not be defined based on the filename (fixed using try .. except..)

df = pd.read_csv(csv_file,sep=';', encoding = 'ISO-8859-1',
                 names=['permission','owner_name','group_name','size','ctime','mtime','atime','filename','full_filename'],
                 dtype={'permission':str,'owner_name':str,'group_name':str,'size':str,'ctime':object,'mtime':object,'atime':object,'filename':str,'full_filename':str,'first_date':object,'last_date':object})

try:
    df['file_format'] = [Path(f).suffix[1:] for f in df.filename.tolist()]
except:
    df['file_format'] = ''

回答 6

它在low_memory = False导入DataFrame时对我有用。这就是对我有用的所有更改:

df = pd.read_csv('export4_16.csv',low_memory=False)

It worked for me with low_memory = False while importing a DataFrame. That is all the change that worked for me:

df = pd.read_csv('export4_16.csv',low_memory=False)

如何使用熊猫存储数据框

问题:如何使用熊猫存储数据框

现在,CSV每次运行脚本时,我都会导入一个相当大的数据框。是否有一个很好的解决方案,可以使数据帧在两次运行之间保持持续可用,因此我不必花费所有时间等待脚本运行?

Right now I’m importing a fairly large CSV as a dataframe every time I run the script. Is there a good solution for keeping that dataframe constantly available in between runs so I don’t have to spend all that time waiting for the script to run?


回答 0

最简单的方法是使用以下方法将其腌制to_pickle

df.to_pickle(file_name)  # where to save it, usually as a .pkl

然后您可以使用以下方法将其加载回:

df = pd.read_pickle(file_name)

注意:在0.11.1 save和之前,load这样做是唯一的方法(现在已弃用它们,to_pickleread_pickle分别赞成和)。


另一个流行的选择是使用HDF5pytables),它为大型数据集提供了非常快速的访问时间:

store = HDFStore('store.h5')

store['df'] = df  # save it
store['df']  # load it

食谱中讨论了更高级的策略。


从0.13开始,还有msgpack,它可能对于互操作性更好,作为JSON的更快替代品,或者如果您有python对象/文本繁重的数据(请参阅此问题)。

The easiest way is to pickle it using to_pickle:

df.to_pickle(file_name)  # where to save it, usually as a .pkl

Then you can load it back using:

df = pd.read_pickle(file_name)

Note: before 0.11.1 save and load were the only way to do this (they are now deprecated in favor of to_pickle and read_pickle respectively).


Another popular choice is to use HDF5 (pytables) which offers very fast access times for large datasets:

store = HDFStore('store.h5')

store['df'] = df  # save it
store['df']  # load it

More advanced strategies are discussed in the cookbook.


Since 0.13 there’s also msgpack which may be be better for interoperability, as a faster alternative to JSON, or if you have python object/text-heavy data (see this question).


回答 1

尽管已经有了一些答案,但是我发现它们之间进行了很好的比较,他们尝试了几种方法来序列化Pandas DataFrame:有效存储Pandas DataFrames

他们比较:

  • pickle:原始ASCII数据格式
  • cPickle,一个C库
  • pickle-p2:使用较新的二进制格式
  • json:standardlib json库
  • json-no-index:类似于json,但没有索引
  • msgpack:二进制JSON替代
  • CSV
  • hdfstore:HDF5存储格式

在他们的实验中,他们使用分别测试的两列来序列化1,000,000行的DataFrame:一个带有文本数据,另一个带有数字。他们的免责声明说:

您不应相信以下内容会泛化您的数据。您应该查看自己的数据并自己运行基准测试

他们所参考的测试源代码可在线获得。由于此代码无法直接运行,因此我做了一些小的更改,您可以在此处进行更改:serialize.py, 我得到以下结果:

他们还提到,通过将文本数据转换为分类数据,序列化要快得多。在他们的测试中大约快10倍(另请参见测试代码)。

编辑:腌制时间比CSV更长,可以通过使用的数据格式来解释。默认情况下,pickle使用可打印的ASCII表示形式,该表示形式会生成更大的数据集。从图中可以看出,使用较新的二进制数据格式(版本2 pickle-p2)的pickle的加载时间要短得多。

其他一些参考:

Although there are already some answers I found a nice comparison in which they tried several ways to serialize Pandas DataFrames: Efficiently Store Pandas DataFrames.

They compare:

  • pickle: original ASCII data format
  • cPickle, a C library
  • pickle-p2: uses the newer binary format
  • json: standardlib json library
  • json-no-index: like json, but without index
  • msgpack: binary JSON alternative
  • CSV
  • hdfstore: HDF5 storage format

In their experiment, they serialize a DataFrame of 1,000,000 rows with the two columns tested separately: one with text data, the other with numbers. Their disclaimer says:

You should not trust that what follows generalizes to your data. You should look at your own data and run benchmarks yourself

The source code for the test which they refer to is available online. Since this code did not work directly I made some minor changes, which you can get here: serialize.py I got the following results:

They also mention that with the conversion of text data to categorical data the serialization is much faster. In their test about 10 times as fast (also see the test code).

Edit: The higher times for pickle than CSV can be explained by the data format used. By default pickle uses a printable ASCII representation, which generates larger data sets. As can be seen from the graph however, pickle using the newer binary data format (version 2, pickle-p2) has much lower load times.

Some other references:


回答 2

如果我理解正确,那么您已经在使用,pandas.read_csv()但是想加快开发过程,这样就不必在每次编辑脚本时都加载文件,对吗?我有一些建议:

  1. pandas.read_csv(..., nrows=1000)在进行开发时,您只能加载CSV文件的一部分,而仅用于加载表的最高位

  2. 使用ipython进行交互式会话,以便在编辑和重新加载脚本时将pandas表保留在内存中。

  3. 将csv转换为HDF5表

  4. 更新了用法,DataFrame.to_feather()pd.read_feather()以R兼容的羽毛二进制格式存储了数据,该格式超级快(在我手中,比pandas.to_pickle()数字数据要快一些,而字符串数据要快得多)。

您可能也对stackoverflow上的这个答案感兴趣。

If I understand correctly, you’re already using pandas.read_csv() but would like to speed up the development process so that you don’t have to load the file in every time you edit your script, is that right? I have a few recommendations:

  1. you could load in only part of the CSV file using pandas.read_csv(..., nrows=1000) to only load the top bit of the table, while you’re doing the development

  2. use ipython for an interactive session, such that you keep the pandas table in memory as you edit and reload your script.

  3. convert the csv to an HDF5 table

  4. updated use DataFrame.to_feather() and pd.read_feather() to store data in the R-compatible feather binary format that is super fast (in my hands, slightly faster than pandas.to_pickle() on numeric data and much faster on string data).

You might also be interested in this answer on stackoverflow.


回答 3

泡菜很好!

import pandas as pd
df.to_pickle('123.pkl')    #to save the dataframe, df to 123.pkl
df1 = pd.read_pickle('123.pkl') #to load 123.pkl back to the dataframe df

Pickle works good!

import pandas as pd
df.to_pickle('123.pkl')    #to save the dataframe, df to 123.pkl
df1 = pd.read_pickle('123.pkl') #to load 123.pkl back to the dataframe df

回答 4

您可以使用羽毛格式文件。非常快。

df.to_feather('filename.ft')

You can use feather format file. It is extremely fast.

df.to_feather('filename.ft')

回答 5

熊猫数据框具有to_pickle对保存数据框有用的功能:

import pandas as pd

a = pd.DataFrame({'A':[0,1,0,1,0],'B':[True, True, False, False, False]})
print a
#    A      B
# 0  0   True
# 1  1   True
# 2  0  False
# 3  1  False
# 4  0  False

a.to_pickle('my_file.pkl')

b = pd.read_pickle('my_file.pkl')
print b
#    A      B
# 0  0   True
# 1  1   True
# 2  0  False
# 3  1  False
# 4  0  False

Pandas DataFrames have the to_pickle function which is useful for saving a DataFrame:

import pandas as pd

a = pd.DataFrame({'A':[0,1,0,1,0],'B':[True, True, False, False, False]})
print a
#    A      B
# 0  0   True
# 1  1   True
# 2  0  False
# 3  1  False
# 4  0  False

a.to_pickle('my_file.pkl')

b = pd.read_pickle('my_file.pkl')
print b
#    A      B
# 0  0   True
# 1  1   True
# 2  0  False
# 3  1  False
# 4  0  False

回答 6

如前所述,有不同的选项和文件格式(HDF5JSONCSVparquetSQL)来存储数据帧。但是,pickle不是一流的公民(取决于您的设置),因为:

  1. pickle是潜在的安全风险。形成picklePython文档

警告pickle模块对于错误或恶意构建的数据并不安全。切勿挑剔从不可信或未经身份验证的来源收到的数据。

  1. pickle是慢的。在此处此处找到基准。

根据您的设置/用法,两个限制均不适用,但我不建议您pickle将其作为熊猫数据框的默认持久性。

As already mentioned there are different options and file formats (HDF5, JSON, CSV, parquet, SQL) to store a data frame. However, pickle is not a first-class citizen (depending on your setup), because:

  1. pickle is a potential security risk. Form the Python documentation for pickle:

Warning The pickle module is not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.

  1. pickle is slow. Find here and here benchmarks.

Depending on your setup/usage both limitations do not apply, but I would not recommend pickle as the default persistence for pandas data frames.


回答 7

数字数据的文件格式非常快

我更喜欢使用numpy文件,因为它们快速且易于使用。这是一个简单的基准,用于保存和加载具有1百万点的1列的数据框。

import numpy as np
import pandas as pd

num_dict = {'voltage': np.random.rand(1000000)}
num_df = pd.DataFrame(num_dict)

使用ipython的%%timeit魔术功能

%%timeit
with open('num.npy', 'wb') as np_file:
    np.save(np_file, num_df)

输出是

100 loops, best of 3: 5.97 ms per loop

将数据加载回数据框

%%timeit
with open('num.npy', 'rb') as np_file:
    data = np.load(np_file)

data_df = pd.DataFrame(data)

输出是

100 loops, best of 3: 5.12 ms per loop

不错!

缺点

如果您使用python 2保存numpy文件,然后尝试使用python 3打开(反之亦然),则会出现问题。

Numpy file formats are pretty fast for numerical data

I prefer to use numpy files since they’re fast and easy to work with. Here’s a simple benchmark for saving and loading a dataframe with 1 column of 1million points.

import numpy as np
import pandas as pd

num_dict = {'voltage': np.random.rand(1000000)}
num_df = pd.DataFrame(num_dict)

using ipython’s %%timeit magic function

%%timeit
with open('num.npy', 'wb') as np_file:
    np.save(np_file, num_df)

the output is

100 loops, best of 3: 5.97 ms per loop

to load the data back into a dataframe

%%timeit
with open('num.npy', 'rb') as np_file:
    data = np.load(np_file)

data_df = pd.DataFrame(data)

the output is

100 loops, best of 3: 5.12 ms per loop

NOT BAD!

CONS

There’s a problem if you save the numpy file using python 2 and then try opening using python 3 (or vice versa).


回答 8

https://docs.python.org/3/library/pickle.html

泡菜协议格式:

协议版本0是原始的“人类可读”协议,并且与Python的早期版本向后兼容。

协议版本1是旧的二进制格式,也与Python的早期版本兼容。

协议版本2是在Python 2.3中引入的。它提供了新型类的更有效的酸洗。有关协议2带来的改进的信息,请参阅PEP 307。

协议版本3是在Python 3.0中添加的。它具有对字节对象的显式支持,并且不能被Python 2.x取消选择。这是默认协议,当需要与其他Python 3版本兼容时,建议使用该协议。

协议版本4是在Python 3.4中添加的。它增加了对超大型对象的支持,腌制更多种类的对象以及一些数据格式优化。有关协议4带来的改进的信息,请参阅PEP 3154。

https://docs.python.org/3/library/pickle.html

The pickle protocol formats:

Protocol version 0 is the original “human-readable” protocol and is backwards compatible with earlier versions of Python.

Protocol version 1 is an old binary format which is also compatible with earlier versions of Python.

Protocol version 2 was introduced in Python 2.3. It provides much more efficient pickling of new-style classes. Refer to PEP 307 for information about improvements brought by protocol 2.

Protocol version 3 was added in Python 3.0. It has explicit support for bytes objects and cannot be unpickled by Python 2.x. This is the default protocol, and the recommended protocol when compatibility with other Python 3 versions is required.

Protocol version 4 was added in Python 3.4. It adds support for very large objects, pickling more kinds of objects, and some data format optimizations. Refer to PEP 3154 for information about improvements brought by protocol 4.


回答 9

pyarrow跨版本的兼容性

总体上已经转向了pyarrow / feather(来自pandas / msgpack的弃用警告)。但是,我对规范中具有瞬时特性的 pyarrow提出了挑战,使用pyarrow 0.15.1序列化的数据无法使用0.16.0 ARROW-7961进行反序列化。我正在使用序列化来使用Redis,因此必须使用二进制编码。

我已经重新测试了各种选项(使用jupyter笔记本)

import sys, pickle, zlib, warnings, io
class foocls:
    def pyarrow(out): return pa.serialize(out).to_buffer().to_pybytes()
    def msgpack(out): return out.to_msgpack()
    def pickle(out): return pickle.dumps(out)
    def feather(out): return out.to_feather(io.BytesIO())
    def parquet(out): return out.to_parquet(io.BytesIO())

warnings.filterwarnings("ignore")
for c in foocls.__dict__.values():
    sbreak = True
    try:
        c(out)
        print(c.__name__, "before serialization", sys.getsizeof(out))
        print(c.__name__, sys.getsizeof(c(out)))
        %timeit -n 50 c(out)
        print(c.__name__, "zlib", sys.getsizeof(zlib.compress(c(out))))
        %timeit -n 50 zlib.compress(c(out))
    except TypeError as e:
        if "not callable" in str(e): sbreak = False
        else: raise
    except (ValueError) as e: print(c.__name__, "ERROR", e)
    finally: 
        if sbreak: print("=+=" * 30)        
warnings.filterwarnings("default")

对于我的数据框具有以下结果(在outjupyter变量中)

pyarrow before serialization 533366
pyarrow 120805
1.03 ms ± 43.9 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
pyarrow zlib 20517
2.78 ms ± 81.8 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
msgpack before serialization 533366
msgpack 109039
1.74 ms ± 72.8 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
msgpack zlib 16639
3.05 ms ± 71.7 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
pickle before serialization 533366
pickle 142121
733 µs ± 38.3 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
pickle zlib 29477
3.81 ms ± 60.4 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
feather ERROR feather does not support serializing a non-default index for the index; you can .reset_index() to make the index into column(s)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
parquet ERROR Nested column branch had multiple children: struct<x: double, y: double>
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=

羽毛和镶木地板不适用于我的数据框。我将继续使用pyarrow。但是,我将补充泡菜(无压缩)。写入高速缓存时,存储pyarrow和pickle序列化表格。如果从pypy反序列化失败,则从缓存回退到泡菜。

pyarrow compatibility across versions

Overall move has been to pyarrow/feather (deprecation warnings from pandas/msgpack). However I have a challenge with pyarrow with transient in specification Data serialized with pyarrow 0.15.1 cannot be deserialized with 0.16.0 ARROW-7961. I’m using serialization to use redis so have to use a binary encoding.

I’ve retested various options (using jupyter notebook)

import sys, pickle, zlib, warnings, io
class foocls:
    def pyarrow(out): return pa.serialize(out).to_buffer().to_pybytes()
    def msgpack(out): return out.to_msgpack()
    def pickle(out): return pickle.dumps(out)
    def feather(out): return out.to_feather(io.BytesIO())
    def parquet(out): return out.to_parquet(io.BytesIO())

warnings.filterwarnings("ignore")
for c in foocls.__dict__.values():
    sbreak = True
    try:
        c(out)
        print(c.__name__, "before serialization", sys.getsizeof(out))
        print(c.__name__, sys.getsizeof(c(out)))
        %timeit -n 50 c(out)
        print(c.__name__, "zlib", sys.getsizeof(zlib.compress(c(out))))
        %timeit -n 50 zlib.compress(c(out))
    except TypeError as e:
        if "not callable" in str(e): sbreak = False
        else: raise
    except (ValueError) as e: print(c.__name__, "ERROR", e)
    finally: 
        if sbreak: print("=+=" * 30)        
warnings.filterwarnings("default")

With following results for my data frame (in out jupyter variable)

pyarrow before serialization 533366
pyarrow 120805
1.03 ms ± 43.9 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
pyarrow zlib 20517
2.78 ms ± 81.8 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
msgpack before serialization 533366
msgpack 109039
1.74 ms ± 72.8 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
msgpack zlib 16639
3.05 ms ± 71.7 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
pickle before serialization 533366
pickle 142121
733 µs ± 38.3 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
pickle zlib 29477
3.81 ms ± 60.4 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
feather ERROR feather does not support serializing a non-default index for the index; you can .reset_index() to make the index into column(s)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
parquet ERROR Nested column branch had multiple children: struct<x: double, y: double>
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=

feather and parquet do not work for my data frame. I’m going to continue using pyarrow. However I will supplement with pickle (no compression). When writing to cache store pyarrow and pickle serialised forms. When reading from cache fallback to pickle if pyarrow deserialisation fails.


回答 10

格式取决于您的用例

  • 在笔记本会话之间保存DataFrame- 羽毛,如果您习惯于腌制 -也可以。
  • 保存数据帧在尽可能小的文件大小- 镶木地板pickle.gz(检查什么最好为您的数据)
  • 保存一个非常大的DataFrame(10+百万行)-HDF
  • 能够读取另一个平台上(而不是Python),不支持其他格式的数据- CSVcsv.gz,检查是否镶木支持
  • 能够用眼睛查看/使用Excel / Google表格/ Git diff- CSV
  • 保存占用几乎所有RAM的DataFrame- CSV

该视频中有熊猫文件格式的比较。

The format depends on your use-case

  • Save DataFrame between notebook sessions – feather, if you’re used to pickle – also ok.
  • Save DataFrame in smallest possible file size – parquet or pickle.gz (check what’s better for your data)
  • Save a very big DataFrame (10+ millions of rows) – hdf
  • Be able to read the data on another platform (not Python) that doesn’t support other formats – csv, csv.gz, check if parquet is supported
  • Be able to review with your eyes / using Excel / Google Sheets / Git diff – csv
  • Save a DataFrame that takes almost all the RAM – csv

Comparison of the pandas file formats are in this video.


用字典重新映射熊猫列中的值

问题:用字典重新映射熊猫列中的值

我有一本字典,看起来像这样: di = {1: "A", 2: "B"}

我想将其应用于类似于以下内容的数据框的“ col1”列:

     col1   col2
0       w      a
1       1      2
2       2    NaN

要得到:

     col1   col2
0       w      a
1       A      2
2       B    NaN

我怎样才能最好地做到这一点?由于某种原因,与此相关的谷歌搜索术语仅向我显示了有关如何根据字典创建列的链接,反之亦然:-/

I have a dictionary which looks like this: di = {1: "A", 2: "B"}

I would like to apply it to the “col1” column of a dataframe similar to:

     col1   col2
0       w      a
1       1      2
2       2    NaN

to get:

     col1   col2
0       w      a
1       A      2
2       B    NaN

How can I best do this? For some reason googling terms relating to this only shows me links about how to make columns from dicts and vice-versa :-/


回答 0

您可以使用.replace。例如:

>>> df = pd.DataFrame({'col2': {0: 'a', 1: 2, 2: np.nan}, 'col1': {0: 'w', 1: 1, 2: 2}})
>>> di = {1: "A", 2: "B"}
>>> df
  col1 col2
0    w    a
1    1    2
2    2  NaN
>>> df.replace({"col1": di})
  col1 col2
0    w    a
1    A    2
2    B  NaN

或直接在上Series,即df["col1"].replace(di, inplace=True)

You can use .replace. For example:

>>> df = pd.DataFrame({'col2': {0: 'a', 1: 2, 2: np.nan}, 'col1': {0: 'w', 1: 1, 2: 2}})
>>> di = {1: "A", 2: "B"}
>>> df
  col1 col2
0    w    a
1    1    2
2    2  NaN
>>> df.replace({"col1": di})
  col1 col2
0    w    a
1    A    2
2    B  NaN

or directly on the Series, i.e. df["col1"].replace(di, inplace=True).


回答 1

map 可以比 replace

如果您的字典有多个键,使用map速度可能比快得多replace。此方法有两种版本,具体取决于字典是否详尽地映射所有可能的值(以及是否要让不匹配项保留其值或将其转换为NaN):

详尽的映射

在这种情况下,表格非常简单:

df['col1'].map(di)       # note: if the dictionary does not exhaustively map all
                         # entries then non-matched entries are changed to NaNs

尽管map最常用函数作为参数,但也可以选择字典或系列: Pandas.series.map的文档

非穷举映射

如果您有一个非详尽的映射,并且希望保留现有变量用于非匹配,则可以添加fillna

df['col1'].map(di).fillna(df['col1'])

如@jpp的答案在这里: 通过字典有效地替换熊猫系列中的值

基准测试

在pandas 0.23.1版中使用以下数据:

di = {1: "A", 2: "B", 3: "C", 4: "D", 5: "E", 6: "F", 7: "G", 8: "H" }
df = pd.DataFrame({ 'col1': np.random.choice( range(1,9), 100000 ) })

并进行测试时%timeit,它的map速度大约比速度快10倍replace

请注意,您的加速map会随数据而变化。最大的提速似乎是使用大词典和详尽的替换方法。有关更广泛的基准测试和讨论,请参见@jpp答案(上面链接)。

map can be much faster than replace

If your dictionary has more than a couple of keys, using map can be much faster than replace. There are two versions of this approach, depending on whether your dictionary exhaustively maps all possible values (and also whether you want non-matches to keep their values or be converted to NaNs):

Exhaustive Mapping

In this case, the form is very simple:

df['col1'].map(di)       # note: if the dictionary does not exhaustively map all
                         # entries then non-matched entries are changed to NaNs

Although map most commonly takes a function as its argument, it can alternatively take a dictionary or series: Documentation for Pandas.series.map

Non-Exhaustive Mapping

If you have a non-exhaustive mapping and wish to retain the existing variables for non-matches, you can add fillna:

df['col1'].map(di).fillna(df['col1'])

as in @jpp’s answer here: Replace values in a pandas series via dictionary efficiently

Benchmarks

Using the following data with pandas version 0.23.1:

di = {1: "A", 2: "B", 3: "C", 4: "D", 5: "E", 6: "F", 7: "G", 8: "H" }
df = pd.DataFrame({ 'col1': np.random.choice( range(1,9), 100000 ) })

and testing with %timeit, it appears that map is approximately 10x faster than replace.

Note that your speedup with map will vary with your data. The largest speedup appears to be with large dictionaries and exhaustive replaces. See @jpp answer (linked above) for more extensive benchmarks and discussion.


回答 2

您的问题有点含糊。至少有三种解释:

  1. 中的键di引用索引值
  2. 中的键是didf['col1']
  3. 中的键di指的是索引位置(不是OP的问题,而是为了娱乐而抛出的。)

以下是每种情况的解决方案。


情况1: 如果的键di旨在引用索引值,则可以使用以下update方法:

df['col1'].update(pd.Series(di))

例如,

import pandas as pd
import numpy as np

df = pd.DataFrame({'col1':['w', 10, 20],
                   'col2': ['a', 30, np.nan]},
                  index=[1,2,0])
#   col1 col2
# 1    w    a
# 2   10   30
# 0   20  NaN

di = {0: "A", 2: "B"}

# The value at the 0-index is mapped to 'A', the value at the 2-index is mapped to 'B'
df['col1'].update(pd.Series(di))
print(df)

Yield

  col1 col2
1    w    a
2    B   30
0    A  NaN

我已经修改了您原始帖子中的值,因此操作更清晰update。注意输入中的键如何di与索引值关联。索引值的顺序(即索引位置)无关紧要。


情况2: 如果其中的键di引用df['col1']值,则@DanAllan和@DSM显示如何通过以下方法实现此目的replace

import pandas as pd
import numpy as np

df = pd.DataFrame({'col1':['w', 10, 20],
                   'col2': ['a', 30, np.nan]},
                  index=[1,2,0])
print(df)
#   col1 col2
# 1    w    a
# 2   10   30
# 0   20  NaN

di = {10: "A", 20: "B"}

# The values 10 and 20 are replaced by 'A' and 'B'
df['col1'].replace(di, inplace=True)
print(df)

Yield

  col1 col2
1    w    a
2    A   30
0    B  NaN

注意如何在这种情况下,在键di改为匹配df['col1']


情况3: 如果其中的键di引用了索引位置,则可以使用

df['col1'].put(di.keys(), di.values())

以来

df = pd.DataFrame({'col1':['w', 10, 20],
                   'col2': ['a', 30, np.nan]},
                  index=[1,2,0])
di = {0: "A", 2: "B"}

# The values at the 0 and 2 index locations are replaced by 'A' and 'B'
df['col1'].put(di.keys(), di.values())
print(df)

Yield

  col1 col2
1    A    a
2   10   30
0    B  NaN

在这里,第一行和第三行被更改了,因为其中的键di02,使用Python基于0的索引对其进行索引,它们指向第一位置和第三位置。

There is a bit of ambiguity in your question. There are at least three two interpretations:

  1. the keys in di refer to index values
  2. the keys in di refer to df['col1'] values
  3. the keys in di refer to index locations (not the OP’s question, but thrown in for fun.)

Below is a solution for each case.


Case 1: If the keys of di are meant to refer to index values, then you could use the update method:

df['col1'].update(pd.Series(di))

For example,

import pandas as pd
import numpy as np

df = pd.DataFrame({'col1':['w', 10, 20],
                   'col2': ['a', 30, np.nan]},
                  index=[1,2,0])
#   col1 col2
# 1    w    a
# 2   10   30
# 0   20  NaN

di = {0: "A", 2: "B"}

# The value at the 0-index is mapped to 'A', the value at the 2-index is mapped to 'B'
df['col1'].update(pd.Series(di))
print(df)

yields

  col1 col2
1    w    a
2    B   30
0    A  NaN

I’ve modified the values from your original post so it is clearer what update is doing. Note how the keys in di are associated with index values. The order of the index values — that is, the index locations — does not matter.


Case 2: If the keys in di refer to df['col1'] values, then @DanAllan and @DSM show how to achieve this with replace:

import pandas as pd
import numpy as np

df = pd.DataFrame({'col1':['w', 10, 20],
                   'col2': ['a', 30, np.nan]},
                  index=[1,2,0])
print(df)
#   col1 col2
# 1    w    a
# 2   10   30
# 0   20  NaN

di = {10: "A", 20: "B"}

# The values 10 and 20 are replaced by 'A' and 'B'
df['col1'].replace(di, inplace=True)
print(df)

yields

  col1 col2
1    w    a
2    A   30
0    B  NaN

Note how in this case the keys in di were changed to match values in df['col1'].


Case 3: If the keys in di refer to index locations, then you could use

df['col1'].put(di.keys(), di.values())

since

df = pd.DataFrame({'col1':['w', 10, 20],
                   'col2': ['a', 30, np.nan]},
                  index=[1,2,0])
di = {0: "A", 2: "B"}

# The values at the 0 and 2 index locations are replaced by 'A' and 'B'
df['col1'].put(di.keys(), di.values())
print(df)

yields

  col1 col2
1    A    a
2   10   30
0    B  NaN

Here, the first and third rows were altered, because the keys in di are 0 and 2, which with Python’s 0-based indexing refer to the first and third locations.


回答 3

如果您有多个列要在数据数据帧中重新映射,则添加到此问题:

def remap(data,dict_labels):
    """
    This function take in a dictionnary of labels : dict_labels 
    and replace the values (previously labelencode) into the string.

    ex: dict_labels = {{'col1':{1:'A',2:'B'}}

    """
    for field,values in dict_labels.items():
        print("I am remapping %s"%field)
        data.replace({field:values},inplace=True)
    print("DONE")

    return data

希望它对某人有用。

干杯

Adding to this question if you ever have more than one columns to remap in a data dataframe:

def remap(data,dict_labels):
    """
    This function take in a dictionnary of labels : dict_labels 
    and replace the values (previously labelencode) into the string.

    ex: dict_labels = {{'col1':{1:'A',2:'B'}}

    """
    for field,values in dict_labels.items():
        print("I am remapping %s"%field)
        data.replace({field:values},inplace=True)
    print("DONE")

    return data

Hope it can be useful to someone.

Cheers


回答 4

DSM已经接受了答案,但是编码似乎并不适合所有人。这是与当前版本的熊猫一起使用的版本(截至8/2018为0.23.4):

import pandas as pd

df = pd.DataFrame({'col1': [1, 2, 2, 3, 1],
            'col2': ['negative', 'positive', 'neutral', 'neutral', 'positive']})

conversion_dict = {'negative': -1, 'neutral': 0, 'positive': 1}
df['converted_column'] = df['col2'].replace(conversion_dict)

print(df.head())

您会看到它看起来像:

   col1      col2  converted_column
0     1  negative                -1
1     2  positive                 1
2     2   neutral                 0
3     3   neutral                 0
4     1  positive                 1

pandas.DataFrame.replace的文档在这里

DSM has the accepted answer, but the coding doesn’t seem to work for everyone. Here is one that works with the current version of pandas (0.23.4 as of 8/2018):

import pandas as pd

df = pd.DataFrame({'col1': [1, 2, 2, 3, 1],
            'col2': ['negative', 'positive', 'neutral', 'neutral', 'positive']})

conversion_dict = {'negative': -1, 'neutral': 0, 'positive': 1}
df['converted_column'] = df['col2'].replace(conversion_dict)

print(df.head())

You’ll see it looks like:

   col1      col2  converted_column
0     1  negative                -1
1     2  positive                 1
2     2   neutral                 0
3     3   neutral                 0
4     1  positive                 1

The docs for pandas.DataFrame.replace are here.


回答 5

或做apply

df['col1'].apply(lambda x: {1: "A", 2: "B"}.get(x,x))

演示:

>>> df['col1']=df['col1'].apply(lambda x: {1: "A", 2: "B"}.get(x,x))
>>> df
  col1 col2
0    w    a
1    1    2
2    2  NaN
>>> 

Or do apply:

df['col1'].apply(lambda x: {1: "A", 2: "B"}.get(x,x))

Demo:

>>> df['col1']=df['col1'].apply(lambda x: {1: "A", 2: "B"}.get(x,x))
>>> df
  col1 col2
0    w    a
1    1    2
2    2  NaN
>>> 

回答 6

给定map的速度比替换(@JohnE的解决方案)要快,因此在打算将特定值映射到的非穷举映射时,您NaN需要格外小心。在这种情况下,正确的方法需要在您mask使用Series时执行.fillna,否则撤消到的映射NaN

import pandas as pd
import numpy as np

d = {'m': 'Male', 'f': 'Female', 'missing': np.NaN}
df = pd.DataFrame({'gender': ['m', 'f', 'missing', 'Male', 'U']})

keep_nan = [k for k,v in d.items() if pd.isnull(v)]
s = df['gender']

df['mapped'] = s.map(d).fillna(s.mask(s.isin(keep_nan)))

    gender  mapped
0        m    Male
1        f  Female
2  missing     NaN
3     Male    Male
4        U       U

Given map is faster than replace (@JohnE’s solution) you need to be careful with Non-Exhaustive mappings where you intend to map specific values to NaN. The proper method in this case requires that you mask the Series when you .fillna, else you undo the mapping to NaN.

import pandas as pd
import numpy as np

d = {'m': 'Male', 'f': 'Female', 'missing': np.NaN}
df = pd.DataFrame({'gender': ['m', 'f', 'missing', 'Male', 'U']})

keep_nan = [k for k,v in d.items() if pd.isnull(v)]
s = df['gender']

df['mapped'] = s.map(d).fillna(s.mask(s.isin(keep_nan)))

    gender  mapped
0        m    Male
1        f  Female
2  missing     NaN
3     Male    Male
4        U       U

回答 7

一个很好的完整解决方案,可以保留您的类标签的地图:

labels = features['col1'].unique()
labels_dict = dict(zip(labels, range(len(labels))))
features = features.replace({"col1": labels_dict})

这样,您可以随时从labels_dict引用原始类标签。

A nice complete solution that keeps a map of your class labels:

labels = features['col1'].unique()
labels_dict = dict(zip(labels, range(len(labels))))
features = features.replace({"col1": labels_dict})

This way, you can at any point refer to the original class label from labels_dict.


回答 8

作为对Nico Coallier(适用于多列)和U10-Forward(使用应用方式的方法)的建议的扩展,并将其概括为一个单一的行:

df.loc[:,['col1','col2']].transform(lambda x: x.map(lambda x: {1: "A", 2: "B"}.get(x,x))

.transform()每个列按顺序处理。.apply()与之相反,将DataFrame中聚集的列传递给该列。

因此,您可以应用Series方法map()

最后,由于U10,我发现了此行为,您可以在.get()表达式中使用整个Series。除非我误解了它的行为,并且它按顺序而不是按位处理序列。您在映射字典中未提及的值
.get(x,x)帐户,否则该.map()方法将被视为Nan

As an extension to what have been proposed by Nico Coallier (apply to multiple columns) and U10-Forward(using apply style of methods), and summarising it into a one-liner I propose:

df.loc[:,['col1','col2']].transform(lambda x: x.map(lambda x: {1: "A", 2: "B"}.get(x,x))

The .transform() processes each column as a series. Contrary to .apply()which passes the columns aggregated in a DataFrame.

Consequently you can apply the Series method map().

Finally, and I discovered this behaviour thanks to U10, you can use the whole Series in the .get() expression. Unless I have misunderstood its behaviour and it processes sequentially the series instead of bitwisely.
The .get(x,x)accounts for the values you did not mention in your mapping dictionary which would be considered as Nan otherwise by the .map() method


回答 9

一种更本地的熊猫方法是应用如下替换函数:

def multiple_replace(dict, text):
  # Create a regular expression  from the dictionary keys
  regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))

  # For each match, look-up corresponding value in dictionary
  return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text) 

定义函数后,可以将其应用于数据框。

di = {1: "A", 2: "B"}
df['col1'] = df.apply(lambda row: multiple_replace(di, row['col1']), axis=1)

A more native pandas approach is to apply a replace function as below:

def multiple_replace(dict, text):
  # Create a regular expression  from the dictionary keys
  regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))

  # For each match, look-up corresponding value in dictionary
  return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text) 

Once you defined the function, you can apply it to your dataframe.

di = {1: "A", 2: "B"}
df['col1'] = df.apply(lambda row: multiple_replace(di, row['col1']), axis=1)

熊猫根据其他列的值创建新列/逐行应用多列的功能

问题:熊猫根据其他列的值创建新列/逐行应用多列的功能

我想申请我的自定义函数(它使用的if-else梯)这六个列(ERI_HispanicERI_AmerInd_AKNatvERI_AsianERI_Black_Afr.AmerERI_HI_PacIslERI_White我的数据帧的每一行中)。

我尝试了与其他问题不同的方法,但似乎仍然找不到适合我问题的正确答案。关键在于,如果该人被视为西班牙裔,就不能被视为其他任何人。即使他们在另一个种族栏中的得分为“ 1”,他们仍然被视为西班牙裔,而不是两个或两个以上的种族。同样,如果所有ERI列的总和大于1,则将它们计为两个或多个种族,并且不能计为唯一的种族(西班牙裔除外)。希望这是有道理的。任何帮助将不胜感激。

这几乎就像在每行中进行一个for循环一样,如果每条记录都符合条件,则将它们添加到一个列表中并从原始列表中删除。

从下面的数据框中,我需要根据以下SQL规范来计算新列:

===================================================== =======

IF [ERI_Hispanic] = 1 THEN RETURN Hispanic
ELSE IF SUM([ERI_AmerInd_AKNatv] + [ERI_Asian] + [ERI_Black_Afr.Amer] + [ERI_HI_PacIsl] + [ERI_White]) > 1 THEN RETURN Two or More
ELSE IF [ERI_AmerInd_AKNatv] = 1 THEN RETURN A/I AK Native
ELSE IF [ERI_Asian] = 1 THEN RETURN Asian
ELSE IF [ERI_Black_Afr.Amer] = 1 THEN RETURN Black/AA
ELSE IF [ERI_HI_PacIsl] = 1 THEN RETURN Haw/Pac Isl.”
ELSE IF [ERI_White] = 1 THEN RETURN White

评论:如果西班牙裔ERI标志为True(1),则该雇员被分类为“西班牙裔”

注释:如果多个非西班牙裔ERI标志为真,则返回“两个或更多”

======================数据帧===========================

     lname          fname       rno_cd  eri_afr_amer    eri_asian   eri_hawaiian    eri_hispanic    eri_nat_amer    eri_white   rno_defined
0    MOST           JEFF        E       0               0           0               0               0               1           White
1    CRUISE         TOM         E       0               0           0               1               0               0           White
2    DEPP           JOHNNY              0               0           0               0               0               1           Unknown
3    DICAP          LEO                 0               0           0               0               0               1           Unknown
4    BRANDO         MARLON      E       0               0           0               0               0               0           White
5    HANKS          TOM         0                       0           0               0               0               1           Unknown
6    DENIRO         ROBERT      E       0               1           0               0               0               1           White
7    PACINO         AL          E       0               0           0               0               0               1           White
8    WILLIAMS       ROBIN       E       0               0           1               0               0               0           White
9    EASTWOOD       CLINT       E       0               0           0               0               0               1           White

I want to apply my custom function (it uses an if-else ladder) to these six columns (ERI_Hispanic, ERI_AmerInd_AKNatv, ERI_Asian, ERI_Black_Afr.Amer, ERI_HI_PacIsl, ERI_White) in each row of my dataframe.

I’ve tried different methods from other questions but still can’t seem to find the right answer for my problem. The critical piece of this is that if the person is counted as Hispanic they can’t be counted as anything else. Even if they have a “1” in another ethnicity column they still are counted as Hispanic not two or more races. Similarly, if the sum of all the ERI columns is greater than 1 they are counted as two or more races and can’t be counted as a unique ethnicity(except for Hispanic). Hopefully this makes sense. Any help will be greatly appreciated.

Its almost like doing a for loop through each row and if each record meets a criterion they are added to one list and eliminated from the original.

From the dataframe below I need to calculate a new column based on the following spec in SQL:

========================= CRITERIA ===============================

IF [ERI_Hispanic] = 1 THEN RETURN “Hispanic”
ELSE IF SUM([ERI_AmerInd_AKNatv] + [ERI_Asian] + [ERI_Black_Afr.Amer] + [ERI_HI_PacIsl] + [ERI_White]) > 1 THEN RETURN “Two or More”
ELSE IF [ERI_AmerInd_AKNatv] = 1 THEN RETURN “A/I AK Native”
ELSE IF [ERI_Asian] = 1 THEN RETURN “Asian”
ELSE IF [ERI_Black_Afr.Amer] = 1 THEN RETURN “Black/AA”
ELSE IF [ERI_HI_PacIsl] = 1 THEN RETURN “Haw/Pac Isl.”
ELSE IF [ERI_White] = 1 THEN RETURN “White”

Comment: If the ERI Flag for Hispanic is True (1), the employee is classified as “Hispanic”

Comment: If more than 1 non-Hispanic ERI Flag is true, return “Two or More”

====================== DATAFRAME ===========================

     lname          fname       rno_cd  eri_afr_amer    eri_asian   eri_hawaiian    eri_hispanic    eri_nat_amer    eri_white   rno_defined
0    MOST           JEFF        E       0               0           0               0               0               1           White
1    CRUISE         TOM         E       0               0           0               1               0               0           White
2    DEPP           JOHNNY              0               0           0               0               0               1           Unknown
3    DICAP          LEO                 0               0           0               0               0               1           Unknown
4    BRANDO         MARLON      E       0               0           0               0               0               0           White
5    HANKS          TOM         0                       0           0               0               0               1           Unknown
6    DENIRO         ROBERT      E       0               1           0               0               0               1           White
7    PACINO         AL          E       0               0           0               0               0               1           White
8    WILLIAMS       ROBIN       E       0               0           1               0               0               0           White
9    EASTWOOD       CLINT       E       0               0           0               0               0               1           White

回答 0

好的,执行此步骤有两个步骤-首先是编写一个可以执行所需翻译的函数-我已根据您的伪代码将一个示例放在一起:

def label_race (row):
   if row['eri_hispanic'] == 1 :
      return 'Hispanic'
   if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
      return 'Two Or More'
   if row['eri_nat_amer'] == 1 :
      return 'A/I AK Native'
   if row['eri_asian'] == 1:
      return 'Asian'
   if row['eri_afr_amer']  == 1:
      return 'Black/AA'
   if row['eri_hawaiian'] == 1:
      return 'Haw/Pac Isl.'
   if row['eri_white'] == 1:
      return 'White'
   return 'Other'

您可能想要解决这个问题,但这似乎可以解决问题-请注意,进入函数的参数被视为标记为“行”的Series对象。

接下来,在熊猫中使用apply函数来应用该函数-例如

df.apply (lambda row: label_race(row), axis=1)

请注意axis = 1说明符,这意味着应用程序是在行而不是列级别完成的。结果在这里:

0           White
1        Hispanic
2           White
3           White
4           Other
5           White
6     Two Or More
7           White
8    Haw/Pac Isl.
9           White

如果您对这些结果感到满意,请再次运行它,将结果保存到原始数据框中的新列中。

df['race_label'] = df.apply (lambda row: label_race(row), axis=1)

结果数据框如下所示(向右滚动以查看新列):

      lname   fname rno_cd  eri_afr_amer  eri_asian  eri_hawaiian   eri_hispanic  eri_nat_amer  eri_white rno_defined    race_label
0      MOST    JEFF      E             0          0             0              0             0          1       White         White
1    CRUISE     TOM      E             0          0             0              1             0          0       White      Hispanic
2      DEPP  JOHNNY    NaN             0          0             0              0             0          1     Unknown         White
3     DICAP     LEO    NaN             0          0             0              0             0          1     Unknown         White
4    BRANDO  MARLON      E             0          0             0              0             0          0       White         Other
5     HANKS     TOM    NaN             0          0             0              0             0          1     Unknown         White
6    DENIRO  ROBERT      E             0          1             0              0             0          1       White   Two Or More
7    PACINO      AL      E             0          0             0              0             0          1       White         White
8  WILLIAMS   ROBIN      E             0          0             1              0             0          0       White  Haw/Pac Isl.
9  EASTWOOD   CLINT      E             0          0             0              0             0          1       White         White

OK, two steps to this – first is to write a function that does the translation you want – I’ve put an example together based on your pseudo-code:

def label_race (row):
   if row['eri_hispanic'] == 1 :
      return 'Hispanic'
   if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
      return 'Two Or More'
   if row['eri_nat_amer'] == 1 :
      return 'A/I AK Native'
   if row['eri_asian'] == 1:
      return 'Asian'
   if row['eri_afr_amer']  == 1:
      return 'Black/AA'
   if row['eri_hawaiian'] == 1:
      return 'Haw/Pac Isl.'
   if row['eri_white'] == 1:
      return 'White'
   return 'Other'

You may want to go over this, but it seems to do the trick – notice that the parameter going into the function is considered to be a Series object labelled “row”.

Next, use the apply function in pandas to apply the function – e.g.

df.apply (lambda row: label_race(row), axis=1)

Note the axis=1 specifier, that means that the application is done at a row, rather than a column level. The results are here:

0           White
1        Hispanic
2           White
3           White
4           Other
5           White
6     Two Or More
7           White
8    Haw/Pac Isl.
9           White

If you’re happy with those results, then run it again, saving the results into a new column in your original dataframe.

df['race_label'] = df.apply (lambda row: label_race(row), axis=1)

The resultant dataframe looks like this (scroll to the right to see the new column):

      lname   fname rno_cd  eri_afr_amer  eri_asian  eri_hawaiian   eri_hispanic  eri_nat_amer  eri_white rno_defined    race_label
0      MOST    JEFF      E             0          0             0              0             0          1       White         White
1    CRUISE     TOM      E             0          0             0              1             0          0       White      Hispanic
2      DEPP  JOHNNY    NaN             0          0             0              0             0          1     Unknown         White
3     DICAP     LEO    NaN             0          0             0              0             0          1     Unknown         White
4    BRANDO  MARLON      E             0          0             0              0             0          0       White         Other
5     HANKS     TOM    NaN             0          0             0              0             0          1     Unknown         White
6    DENIRO  ROBERT      E             0          1             0              0             0          1       White   Two Or More
7    PACINO      AL      E             0          0             0              0             0          1       White         White
8  WILLIAMS   ROBIN      E             0          0             1              0             0          0       White  Haw/Pac Isl.
9  EASTWOOD   CLINT      E             0          0             0              0             0          1       White         White

回答 1

由于这是Google针对“来自其他人的熊猫专栏”的第一个结果,因此下面是一个简单的示例:

import pandas as pd

# make a simple dataframe
df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
df
#    a  b
# 0  1  3
# 1  2  4

# create an unattached column with an index
df.apply(lambda row: row.a + row.b, axis=1)
# 0    4
# 1    6

# do same but attach it to the dataframe
df['c'] = df.apply(lambda row: row.a + row.b, axis=1)
df
#    a  b  c
# 0  1  3  4
# 1  2  4  6

如果得到了,SettingWithCopyWarning您也可以通过以下方式进行操作:

fn = lambda row: row.a + row.b # define a function for the new column
col = df.apply(fn, axis=1) # get column data with an index
df = df.assign(c=col.values) # assign values to column 'c'

资料来源:https : //stackoverflow.com/a/12555510/243392

如果列名包含空格,则可以使用如下语法:

df = df.assign(**{'some column name': col.values})

这是applyAssign的文档。

Since this is the first Google result for ‘pandas new column from others’, here’s a simple example:

import pandas as pd

# make a simple dataframe
df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
df
#    a  b
# 0  1  3
# 1  2  4

# create an unattached column with an index
df.apply(lambda row: row.a + row.b, axis=1)
# 0    4
# 1    6

# do same but attach it to the dataframe
df['c'] = df.apply(lambda row: row.a + row.b, axis=1)
df
#    a  b  c
# 0  1  3  4
# 1  2  4  6

If you get the SettingWithCopyWarning you can do it this way also:

fn = lambda row: row.a + row.b # define a function for the new column
col = df.apply(fn, axis=1) # get column data with an index
df = df.assign(c=col.values) # assign values to column 'c'

Source: https://stackoverflow.com/a/12555510/243392

And if your column name includes spaces you can use syntax like this:

df = df.assign(**{'some column name': col.values})

And here’s the documentation for apply, and assign.


回答 2

上面的答案是完全正确的,但是存在矢量化的解决方案,形式为numpy.select。这使您可以定义条件,然后为这些条件定义输出,这比使用apply以下命令更有效:


首先,定义条件:

conditions = [
    df['eri_hispanic'] == 1,
    df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
    df['eri_nat_amer'] == 1,
    df['eri_asian'] == 1,
    df['eri_afr_amer'] == 1,
    df['eri_hawaiian'] == 1,
    df['eri_white'] == 1,
]

现在,定义相应的输出:

outputs = [
    'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
]

最后,使用numpy.select

res = np.select(conditions, outputs, 'Other')
pd.Series(res)

0           White
1        Hispanic
2           White
3           White
4           Other
5           White
6     Two Or More
7           White
8    Haw/Pac Isl.
9           White
dtype: object

为什么要numpy.select用完apply?以下是一些性能检查:

df = pd.concat([df]*1000)

In [42]: %timeit df.apply(lambda row: label_race(row), axis=1)
1.07 s ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [44]: %%timeit
    ...: conditions = [
    ...:     df['eri_hispanic'] == 1,
    ...:     df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
    ...:     df['eri_nat_amer'] == 1,
    ...:     df['eri_asian'] == 1,
    ...:     df['eri_afr_amer'] == 1,
    ...:     df['eri_hawaiian'] == 1,
    ...:     df['eri_white'] == 1,
    ...: ]
    ...:
    ...: outputs = [
    ...:     'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
    ...: ]
    ...:
    ...: np.select(conditions, outputs, 'Other')
    ...:
    ...:
3.09 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

使用numpy.select给了我们极大地提高了性能,并且随着数据增长的差异只会增加。

The answers above are perfectly valid, but a vectorized solution exists, in the form of numpy.select. This allows you to define conditions, then define outputs for those conditions, much more efficiently than using apply:


First, define conditions:

conditions = [
    df['eri_hispanic'] == 1,
    df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
    df['eri_nat_amer'] == 1,
    df['eri_asian'] == 1,
    df['eri_afr_amer'] == 1,
    df['eri_hawaiian'] == 1,
    df['eri_white'] == 1,
]

Now, define the corresponding outputs:

outputs = [
    'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
]

Finally, using numpy.select:

res = np.select(conditions, outputs, 'Other')
pd.Series(res)

0           White
1        Hispanic
2           White
3           White
4           Other
5           White
6     Two Or More
7           White
8    Haw/Pac Isl.
9           White
dtype: object

Why should numpy.select be used over apply? Here are some performance checks:

df = pd.concat([df]*1000)

In [42]: %timeit df.apply(lambda row: label_race(row), axis=1)
1.07 s ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [44]: %%timeit
    ...: conditions = [
    ...:     df['eri_hispanic'] == 1,
    ...:     df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
    ...:     df['eri_nat_amer'] == 1,
    ...:     df['eri_asian'] == 1,
    ...:     df['eri_afr_amer'] == 1,
    ...:     df['eri_hawaiian'] == 1,
    ...:     df['eri_white'] == 1,
    ...: ]
    ...:
    ...: outputs = [
    ...:     'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
    ...: ]
    ...:
    ...: np.select(conditions, outputs, 'Other')
    ...:
    ...:
3.09 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Using numpy.select gives us vastly improved performance, and the discrepancy will only increase as the data grows.


回答 3

.apply()以函数作为第一个参数;这样传递label_race函数:

df['race_label'] = df.apply(label_race, axis=1)

您无需使一个lambda函数即可传递函数。

.apply() takes in a function as the first parameter; pass in the label_race function as so:

df['race_label'] = df.apply(label_race, axis=1)

You don’t need to make a lambda function to pass in a function.


回答 4

尝试这个,

df.loc[df['eri_white']==1,'race_label'] = 'White'
df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
df['race_label'].fillna('Other', inplace=True)

O / P:

     lname   fname rno_cd  eri_afr_amer  eri_asian  eri_hawaiian  \
0      MOST    JEFF      E             0          0             0   
1    CRUISE     TOM      E             0          0             0   
2      DEPP  JOHNNY    NaN             0          0             0   
3     DICAP     LEO    NaN             0          0             0   
4    BRANDO  MARLON      E             0          0             0   
5     HANKS     TOM    NaN             0          0             0   
6    DENIRO  ROBERT      E             0          1             0   
7    PACINO      AL      E             0          0             0   
8  WILLIAMS   ROBIN      E             0          0             1   
9  EASTWOOD   CLINT      E             0          0             0   

   eri_hispanic  eri_nat_amer  eri_white rno_defined    race_label  
0             0             0          1       White         White  
1             1             0          0       White      Hispanic  
2             0             0          1     Unknown         White  
3             0             0          1     Unknown         White  
4             0             0          0       White         Other  
5             0             0          1     Unknown         White  
6             0             0          1       White   Two Or More  
7             0             0          1       White         White  
8             0             0          0       White  Haw/Pac Isl.  
9             0             0          1       White         White 

使用.loc代替apply

它改善了向量化。

.loc 以简单的方式工作,根据条件屏蔽行,将值应用于冻结行。

有关更多详细信息,请访问.loc docs

性能指标:

接受的答案:

def label_race (row):
   if row['eri_hispanic'] == 1 :
      return 'Hispanic'
   if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
      return 'Two Or More'
   if row['eri_nat_amer'] == 1 :
      return 'A/I AK Native'
   if row['eri_asian'] == 1:
      return 'Asian'
   if row['eri_afr_amer']  == 1:
      return 'Black/AA'
   if row['eri_hawaiian'] == 1:
      return 'Haw/Pac Isl.'
   if row['eri_white'] == 1:
      return 'White'
   return 'Other'

df=pd.read_csv('dataser.csv')
df = pd.concat([df]*1000)

%timeit df.apply(lambda row: label_race(row), axis=1)

每个循环1.15 s±46.5 ms(平均±标准偏差,共7次运行,每个循环1次)

我的建议答案:

def label_race(df):
    df.loc[df['eri_white']==1,'race_label'] = 'White'
    df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
    df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
    df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
    df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
    df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
    df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
    df['race_label'].fillna('Other', inplace=True)
df=pd.read_csv('s22.csv')
df = pd.concat([df]*1000)

%timeit label_race(df)

每个循环24.7 ms±1.7 ms(平均±标准偏差,运行7次,每个循环10个)

try this,

df.loc[df['eri_white']==1,'race_label'] = 'White'
df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
df['race_label'].fillna('Other', inplace=True)

O/P:

     lname   fname rno_cd  eri_afr_amer  eri_asian  eri_hawaiian  \
0      MOST    JEFF      E             0          0             0   
1    CRUISE     TOM      E             0          0             0   
2      DEPP  JOHNNY    NaN             0          0             0   
3     DICAP     LEO    NaN             0          0             0   
4    BRANDO  MARLON      E             0          0             0   
5     HANKS     TOM    NaN             0          0             0   
6    DENIRO  ROBERT      E             0          1             0   
7    PACINO      AL      E             0          0             0   
8  WILLIAMS   ROBIN      E             0          0             1   
9  EASTWOOD   CLINT      E             0          0             0   

   eri_hispanic  eri_nat_amer  eri_white rno_defined    race_label  
0             0             0          1       White         White  
1             1             0          0       White      Hispanic  
2             0             0          1     Unknown         White  
3             0             0          1     Unknown         White  
4             0             0          0       White         Other  
5             0             0          1     Unknown         White  
6             0             0          1       White   Two Or More  
7             0             0          1       White         White  
8             0             0          0       White  Haw/Pac Isl.  
9             0             0          1       White         White 

use .loc instead of apply.

it improves vectorization.

.loc works in simple manner, mask rows based on the condition, apply values to the freeze rows.

for more details visit, .loc docs

Performance metrics:

Accepted Answer:

def label_race (row):
   if row['eri_hispanic'] == 1 :
      return 'Hispanic'
   if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
      return 'Two Or More'
   if row['eri_nat_amer'] == 1 :
      return 'A/I AK Native'
   if row['eri_asian'] == 1:
      return 'Asian'
   if row['eri_afr_amer']  == 1:
      return 'Black/AA'
   if row['eri_hawaiian'] == 1:
      return 'Haw/Pac Isl.'
   if row['eri_white'] == 1:
      return 'White'
   return 'Other'

df=pd.read_csv('dataser.csv')
df = pd.concat([df]*1000)

%timeit df.apply(lambda row: label_race(row), axis=1)

1.15 s ± 46.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

My Proposed Answer:

def label_race(df):
    df.loc[df['eri_white']==1,'race_label'] = 'White'
    df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
    df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
    df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
    df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
    df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
    df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
    df['race_label'].fillna('Other', inplace=True)
df=pd.read_csv('s22.csv')
df = pd.concat([df]*1000)

%timeit label_race(df)

24.7 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)