标签归档:dataframe

如何检查熊猫中是否存在列

问题:如何检查熊猫中是否存在列

有没有一种方法可以检查Pandas DataFrame中是否存在列?

假设我有以下DataFrame:

>>> import pandas as pd
>>> from random import randint
>>> df = pd.DataFrame({'A': [randint(1, 9) for x in xrange(10)],
                       'B': [randint(1, 9)*10 for x in xrange(10)],
                       'C': [randint(1, 9)*100 for x in xrange(10)]})
>>> df
   A   B    C
0  3  40  100
1  6  30  200
2  7  70  800
3  3  50  200
4  7  50  400
5  4  10  400
6  3  70  500
7  8  30  200
8  3  40  800
9  6  60  200

我想计算 df['sum'] = df['A'] + df['C']

但是首先我要检查是否df['A']存在,如果不存在,我要计算df['sum'] = df['B'] + df['C']

Is there a way to check if a column exists in a Pandas DataFrame?

Suppose that I have the following DataFrame:

>>> import pandas as pd
>>> from random import randint
>>> df = pd.DataFrame({'A': [randint(1, 9) for x in xrange(10)],
                       'B': [randint(1, 9)*10 for x in xrange(10)],
                       'C': [randint(1, 9)*100 for x in xrange(10)]})
>>> df
   A   B    C
0  3  40  100
1  6  30  200
2  7  70  800
3  3  50  200
4  7  50  400
5  4  10  400
6  3  70  500
7  8  30  200
8  3  40  800
9  6  60  200

and I want to calculate df['sum'] = df['A'] + df['C']

But first I want to check if df['A'] exists, and if not, I want to calculate df['sum'] = df['B'] + df['C'] instead.


回答 0

这将起作用:

if 'A' in df:

但是为了清楚起见,我可能将其写为:

if 'A' in df.columns:

This will work:

if 'A' in df:

But for clarity, I’d probably write it as:

if 'A' in df.columns:

回答 1

要检查是否存在全部一列或多列,可以使用set.issubset,如下所示:

if set(['A','C']).issubset(df.columns):
   df['sum'] = df['A'] + df['C']                

正如@brianpck在评论中指出的那样,set([])也可以使用花括号来构造它,

if {'A', 'C'}.issubset(df.columns):

有关大括号语法的讨论,请参见此问题

或者,您可以使用列表推导,如:

if all([item in df.columns for item in ['A','C']]):

To check if one or more columns all exist, you can use set.issubset, as in:

if set(['A','C']).issubset(df.columns):
   df['sum'] = df['A'] + df['C']                

As @brianpck points out in a comment, set([]) can alternatively be constructed with curly braces,

if {'A', 'C'}.issubset(df.columns):

See this question for a discussion of the curly-braces syntax.

Or, you can use a list comprehension, as in:

if all([item in df.columns for item in ['A','C']]):

回答 2

只是建议另一种方法而不使用if语句,您可以将get()方法用于DataFrames。根据问题求和:

df['sum'] = df.get('A', df['B']) + df['C']

DataFrameget方法也有类似的行为,Python字典。

Just to suggest another way without using if statements, you can use the get() method for DataFrames. For performing the sum based on the question:

df['sum'] = df.get('A', df['B']) + df['C']

The DataFrame get method has similar behavior as python dictionaries.


熊猫轴是什么意思?

问题:熊猫轴是什么意思?

这是我的生成数据框的代码:

import pandas as pd
import numpy as np

dff = pd.DataFrame(np.random.randn(1,2),columns=list('AB'))

然后我得到了数据框:

+------------+---------+--------+
|            |  A      |  B     |
+------------+---------+---------
|      0     | 0.626386| 1.52325|
+------------+---------+--------+

当我输入命令时:

dff.mean(axis=1)

我有 :

0    1.074821
dtype: float64

根据熊猫的参考,axis = 1代表列,我希望命令的结果是

A    0.626386
B    1.523255
dtype: float64

所以这是我的问题:大熊猫轴是什么意思?

Here is my code to generate a dataframe:

import pandas as pd
import numpy as np

dff = pd.DataFrame(np.random.randn(1,2),columns=list('AB'))

then I got the dataframe:

+------------+---------+--------+
|            |  A      |  B     |
+------------+---------+---------
|      0     | 0.626386| 1.52325|
+------------+---------+--------+

When I type the commmand :

dff.mean(axis=1)

I got :

0    1.074821
dtype: float64

According to the reference of pandas, axis=1 stands for columns and I expect the result of the command to be

A    0.626386
B    1.523255
dtype: float64

So here is my question: what does axis in pandas mean?


回答 0

它指定轴沿其的装置被计算的。默认情况下axis=0。这与显式指定numpy.mean时的用法一致(默认情况下为,轴== None,该值将计算扁平化数组的平均值),沿(即以熊猫为索引)和沿。为了更加清楚起见,可以选择指定(代替)或(代替)。axisnumpy.meanaxis=0axis=1axis='index'axis=0axis='columns'axis=1

+------------+---------+--------+
|            |  A      |  B     |
+------------+---------+---------
|      0     | 0.626386| 1.52325|----axis=1----->
+------------+---------+--------+
             |         |
             | axis=0  |
                      

It specifies the axis along which the means are computed. By default axis=0. This is consistent with the numpy.mean usage when axis is specified explicitly (in numpy.mean, axis==None by default, which computes the mean value over the flattened array) , in which axis=0 along the rows (namely, index in pandas), and axis=1 along the columns. For added clarity, one may choose to specify axis='index' (instead of axis=0) or axis='columns' (instead of axis=1).

+------------+---------+--------+
|            |  A      |  B     |
+------------+---------+---------
|      0     | 0.626386| 1.52325|----axis=1----->
+------------+---------+--------+
             |         |
             | axis=0  |
             ↓         ↓

回答 1

这些答案确实有助于解释这一点,但是对于非程序员(例如像我这样在数据科学类中首次学习Python的人)来说,它仍然不是很直观。我仍然发现在行和列中使用术语“沿”或“对于每个”令人困惑。

对我来说更有意义的是这样说:

  • 轴0将作用于每个COLUMN中的所有ROWS
  • 轴1将作用于每个行中的所有列

因此,轴0的均值将是每一列中所有行的均值,轴1的均值将是每一行中所有列的均值。

最终,这是与@zhangxaochen和@Michael所说的相同的事情,但是以一种更易于我内部化的方式。

These answers do help explain this, but it still isn’t perfectly intuitive for a non-programmer (i.e. someone like me who is learning Python for the first time in context of data science coursework). I still find using the terms “along” or “for each” wrt to rows and columns to be confusing.

What makes more sense to me is to say it this way:

  • Axis 0 will act on all the ROWS in each COLUMN
  • Axis 1 will act on all the COLUMNS in each ROW

So a mean on axis 0 will be the mean of all the rows in each column, and a mean on axis 1 will be a mean of all the columns in each row.

Ultimately this is saying the same thing as @zhangxaochen and @Michael, but in a way that is easier for me to internalize.


回答 2

让我们想象一下(您会永远记住),

在熊猫:

  1. axis = 0表示沿“索引”。这是逐行操作

假设要对dataframe1和dataframe2执行concat()操作,我们将dataframe1并从dataframe1中取出第一行并放入新的DF,然后从dataframe1中取出另一行并放入新的DF中,重复此过程直到我们到达dataframe1的底部。然后,我们对dataframe2执行相同的过程。

基本上,将dataframe2堆叠在dataframe1之上,反之亦然。

例如在桌子或地板上堆书

  1. axis = 1表示沿“列”。这是列操作。

假设要对dataframe1和dataframe2执行concat()操作,我们将取出dataframe1的第一个完整列(又名1st系列)并放入新的DF中,然后取出dataframe1的第二列并与之相邻(横向) ),我们必须重复此操作,直到完成所有列。然后,我们在dataframe2上重复相同的过程。基本上, 横向堆叠dataframe2。

例如在书架上整理书籍。

更重要的是,与矩阵相比,数组是更好的表示嵌套n维结构的表示形式!因此,下面的内容可以帮助您更加直观地了解将轴推广到多个维度时轴如何发挥重要作用。另外,您实际上可以打印/写入/绘制/可视化任何n维数组,但是在3维以上的纸张上以矩阵表示形式(3维)进行写入或可视化是不可能的。

Let’s visualize (you gonna remember always),

In Pandas:

  1. axis=0 means along “indexes”. It’s a row-wise operation.

Suppose, to perform concat() operation on dataframe1 & dataframe2, we will take dataframe1 & take out 1st row from dataframe1 and place into the new DF, then we take out another row from dataframe1 and put into new DF, we repeat this process until we reach to the bottom of dataframe1. Then, we do the same process for dataframe2.

Basically, stacking dataframe2 on top of dataframe1 or vice a versa.

E.g making a pile of books on a table or floor

  1. axis=1 means along “columns”. It’s a column-wise operation.

Suppose, to perform concat() operation on dataframe1 & dataframe2, we will take out the 1st complete column(a.k.a 1st series) of dataframe1 and place into new DF, then we take out the second column of dataframe1 and keep adjacent to it (sideways), we have to repeat this operation until all columns are finished. Then, we repeat the same process on dataframe2. Basically, stacking dataframe2 sideways.

E.g arranging books on a bookshelf.

More to it, since arrays are better representations to represent a nested n-dimensional structure compared to matrices! so below can help you more to visualize how axis plays an important role when you generalize to more than one dimension. Also, you can actually print/write/draw/visualize any n-dim array but, writing or visualizing the same in a matrix representation(3-dim) is impossible on a paper more than 3-dimensions.


回答 3

axis指向数组的维,在pd.DataFrames 的情况下axis=0是向下的维,axis=1而向右的维。

示例:考虑一个ndarraywith shape (3,5,7)

a = np.ones((3,5,7))

a是3维的ndarray,即具有3个轴(“轴”是“轴”的复数)。的配置a看起来像3片面包,每片面包的尺寸为5 x 7。a[0,:,:]将引用第0个切片,a[1,:,:]将引用第1 个切片,依此类推。

a.sum(axis=0)sum()沿的第0轴应用a。您将添加所有切片,最后得到一个形状的切片(5,7)

a.sum(axis=0) 相当于

b = np.zeros((5,7))
for i in range(5):
    for j in range(7):
        b[i,j] += a[:,i,j].sum()

b并且a.sum(axis=0)将两者看起来像这样

array([[ 3.,  3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.,  3.]])

在中pd.DataFrame,轴的工作方式与在numpy.arrays中相同:axis=0将对sum()每列应用或任何其他归约函数。

注意:在@zhangxaochen的答案中,我发现“沿行”和“沿列”这两个短语有些混乱。axis=0应该指“每列”和axis=1“每行”。

axis refers to the dimension of the array, in the case of pd.DataFrames axis=0 is the dimension that points downwards and axis=1 the one that points to the right.

Example: Think of an ndarray with shape (3,5,7).

a = np.ones((3,5,7))

a is a 3 dimensional ndarray, i.e. it has 3 axes (“axes” is plural of “axis”). The configuration of a will look like 3 slices of bread where each slice is of dimension 5-by-7. a[0,:,:] will refer to the 0-th slice, a[1,:,:] will refer to the 1-st slice etc.

a.sum(axis=0) will apply sum() along the 0-th axis of a. You will add all the slices and end up with one slice of shape (5,7).

a.sum(axis=0) is equivalent to

b = np.zeros((5,7))
for i in range(5):
    for j in range(7):
        b[i,j] += a[:,i,j].sum()

b and a.sum(axis=0) will both look like this

array([[ 3.,  3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.,  3.]])

In a pd.DataFrame, axes work the same way as in numpy.arrays: axis=0 will apply sum() or any other reduction function for each column.

N.B. In @zhangxaochen’s answer, I find the phrases “along the rows” and “along the columns” slightly confusing. axis=0 should refer to “along each column”, and axis=1 “along each row”.


回答 4

对我而言,最容易理解的方法是谈论您是针对每一列(axis = 0)还是每一行(axis = 1)计算统计信息。如果您计算统计量,请说一个平均值,axis = 0您将获得每一列的统计量。因此,如果每个观察值都是一行,并且每个变量都在列中,则将获得每个变量的均值。如果设置,axis = 1则将为每一行计算统计信息。在我们的示例中,您将获得所有变量中每个观察值的平均值(也许您需要相关度量的平均值)。

axis = 0:按列=按列=沿行

axis = 1:按行=按行=沿列

The easiest way for me to understand is to talk about whether you are calculating a statistic for each column (axis = 0) or each row (axis = 1). If you calculate a statistic, say a mean, with axis = 0 you will get that statistic for each column. So if each observation is a row and each variable is in a column, you would get the mean of each variable. If you set axis = 1 then you will calculate your statistic for each row. In our example, you would get the mean for each observation across all of your variables (perhaps you want the average of related measures).

axis = 0: by column = column-wise = along the rows

axis = 1: by row = row-wise = along the columns


回答 5

让我们看一下Wiki中的表格。这是国际货币基金组织对2010年至2019年前十个国家的GDP估算。

1.第1轴将对所有列的每一行起作用
如果您要计算十年(2010-2019年)中每个国家的平均(平均)GDP,则需要做df.mean(axis=1)。例如,如果您要计算2010年至2019年美国的平均GDP,df.loc['United States','2010':'2019'].mean(axis=1)

2.轴0将对所有行的每一列起作用
如果我想计算所有国家每个年份的平均(平均)GDP,则需要做df.mean(axis=0)。例如,如果您要计算美国,中国,日本,德国和印度的2015年平均GDP,df.loc['United States':'India','2015'].mean(axis=0)

请注意:上面的代码仅在将“国家(或从属地区)”列设置为“索引”后才能使用set_index方法。

Let’s look at the table from Wiki. This is an IMF estimate of GDP from 2010 to 2019 for top ten countries.

1. Axis 1 will act for each row on all the columns
If you want to calculate the average (mean) GDP for EACH countries over the decade (2010-2019), you need to do, df.mean(axis=1). For example, if you want to calculate mean GDP of United States from 2010 to 2019, df.loc['United States','2010':'2019'].mean(axis=1)

2. Axis 0 will act for each column on all the rows
If I want to calculate the average (mean) GDP for EACH year for all countries, you need to do, df.mean(axis=0). For example, if you want to calculate mean GDP of the year 2015 for United States, China, Japan, Germany and India, df.loc['United States':'India','2015'].mean(axis=0)

Note: The above code will work only after setting “Country(or dependent territory)” column as the Index, using set_index method.


回答 6

从编程角度来看,轴是形状元组中的位置。这是一个例子:

import numpy as np

a=np.arange(120).reshape(2,3,4,5)

a.shape
Out[3]: (2, 3, 4, 5)

np.sum(a,axis=0).shape
Out[4]: (3, 4, 5)

np.sum(a,axis=1).shape
Out[5]: (2, 4, 5)

np.sum(a,axis=2).shape
Out[6]: (2, 3, 5)

np.sum(a,axis=3).shape
Out[7]: (2, 3, 4)

轴上的均值将导致该尺寸被删除。

参考原始问题,dff形状为(1,2)。使用axis = 1会将形状更改为(1,)。

Axis in view of programming is the position in the shape tuple. Here is an example:

import numpy as np

a=np.arange(120).reshape(2,3,4,5)

a.shape
Out[3]: (2, 3, 4, 5)

np.sum(a,axis=0).shape
Out[4]: (3, 4, 5)

np.sum(a,axis=1).shape
Out[5]: (2, 4, 5)

np.sum(a,axis=2).shape
Out[6]: (2, 3, 5)

np.sum(a,axis=3).shape
Out[7]: (2, 3, 4)

Mean on the axis will cause that dimension to be removed.

Referring to the original question, the dff shape is (1,2). Using axis=1 will change the shape to (1,).


回答 7

熊猫的设计者韦斯·麦金尼(Wes McKinney)过去经常从事金融数据工作。将列视为股票名称,将索引视为每日价格。然后,您可以猜测axis=0此财务数据的默认行为(即)。axis=1可以简单地认为是“另一个方向”。

例如,统计功能,如mean()sum()describe()count()都默认为列明智的,因为它更有意义,做他们每个股票。sort_index(by=)也默认为列。fillna(method='ffill')将沿着列填充,因为它是相同的库存。dropna()默认为行,因为您可能只想放弃当天的价格,而不是丢弃该股票的所有价格。

类似地,方括号索引是指各列,因为选择股票而不是选择一天更为普遍。

The designer of pandas, Wes McKinney, used to work intensively on finance data. Think of columns as stock names and index as daily prices. You can then guess what the default behavior is (i.e., axis=0) with respect to this finance data. axis=1 can be simply thought as ‘the other direction’.

For example, the statistics functions, such as mean(), sum(), describe(), count() all default to column-wise because it makes more sense to do them for each stock. sort_index(by=) also defaults to column. fillna(method='ffill') will fill along column because it is the same stock. dropna() defaults to row because you probably just want to discard the price on that day instead of throw away all prices of that stock.

Similarly, the square brackets indexing refers to the columns since it’s more common to pick a stock instead of picking a day.


回答 8

记住轴1(列)和轴0(行)的简单方法之一就是您期望的输出。

  • 如果您期望每行的输出都使用axis =’columns’,
  • 另一方面,如果要为每列输出,请使用axis =’rows’。

one of easy ways to remember axis 1 (columns), vs axis 0 (rows) is the output you expect.

  • if you expect an output for each row you use axis=’columns’,
  • on the other hand if you want an output for each column you use axis=’rows’.

回答 9

axis=正确使用的问题是在两种主要情况下的使用:

  1. 用于计算累积值重新排列(例如排序)数据。
  2. 用于操纵(“播放”)实体(例如dataframe)。

该答案背后的主要思想是,为了避免混淆,我们选择数字名称来指定特定的轴,以更清晰,直观和描述性的方式为准。

熊猫基于NumPy,后者基于数学,尤其是基于n维矩阵。这是3维空间中数学中轴名称的常用图像:

此图片仅用于存储轴的序号

  • 0 对于x轴,
  • 1 y轴
  • 2 对于z轴。

z轴是只对面板 ; 对于数据帧,我们将把兴趣限制在带有x轴(,垂直)y轴(,水平)的绿色二维基本平面01

所有这些都是数字作为axis=参数的潜在值。

轴的名称'index'(您可以使用别名'rows')和'columns',对于此说明,这些名称与(轴的)序数之间的关系并不重要,因为每个人都知道“行”“列”是什么意思(我想,这里的每个人都知道熊猫中“索引”一词的含义)。

现在,我的建议:

  1. 如果要计算累加值,则可以从沿轴0(或沿轴1)定位的值(使用axis=0axis=1)计算得出。

    同样,如果要重新排列值,请使用轴的轴编号沿着该轴的编号将放置数据以进行重新排列(例如,用于排序)。

  2. 如果要操作(例如连接实体(例如,数据框),请使用axis='index'(同义词:)axis='rows'axis='columns'指定结果更改 – 分别为索引)或
    (对于串联,您将分别获得更长的索引(=更多的行)更多的列。)

The problem with using axis= properly is for its use for 2 main different cases:

  1. For computing an accumulated value, or rearranging (e. g. sorting) data.
  2. For manipulating (“playing” with) entities (e. g. dataframes).

The main idea behind this answer is that for avoiding the confusion, we select either a number, or a name for specifying the particular axis, whichever is more clear, intuitive, and descriptive.

Pandas is based on NumPy, which is based on mathematics, particularly on n-dimensional matrices. Here is an image for common use of axes’ names in math in the 3-dimensional space:

This picture is for memorizing the axes’ ordinal numbers only:

  • 0 for x-axis,
  • 1 for y-axis, and
  • 2 for z-axis.

The z-axis is only for panels; for dataframes we will restrict our interest to the green-colored, 2-dimensional basic plane with x-axis (0, vertical), and y-axis (1, horizontal).

It’s all for numbers as potential values of axis= parameter.

The names of axes are 'index' (you may use the alias 'rows') and 'columns', and for this explanation it is NOT important the relation between these names and ordinal numbers (of axes), as everybody knows what the words “rows” and “columns” mean (and everybody here — I suppose — knows what the word “index” in pandas means).

And now, my recommendation:

  1. If you want to compute an accumulated value, you may compute it from values located along axis 0 (or along axis 1) — use axis=0 (or axis=1).

    Similarly, if you want to rearrange values, use the axis number of the axis, along which are located data for rearranging (e.g. for sorting).

  2. If you want to manipulate (e.g. concatenate) entities (e.g. dataframes) — use axis='index' (synonym: axis='rows') or axis='columns' to specify the resulting changeindex (rows) or columns, respectively.
    (For concatenating, you will obtain either a longer index (= more rows), or more columns, respectively.)


回答 10

这是基于@Safak的答案。理解pandas / numpy中轴的最好方法是创建3d数组,并检查3个不同轴上sum函数的结果。

 a = np.ones((3,5,7))

一个将是:

    array([[[1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.]],

   [[1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.]],

   [[1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.]]])

现在检查沿每个轴的数组元素的总和:

 x0 = np.sum(a,axis=0)
 x1 = np.sum(a,axis=1)
 x2 = np.sum(a,axis=2)

将为您提供以下结果:

   x0 :
   array([[3., 3., 3., 3., 3., 3., 3.],
        [3., 3., 3., 3., 3., 3., 3.],
        [3., 3., 3., 3., 3., 3., 3.],
        [3., 3., 3., 3., 3., 3., 3.],
        [3., 3., 3., 3., 3., 3., 3.]])

   x1 : 
   array([[5., 5., 5., 5., 5., 5., 5.],
   [5., 5., 5., 5., 5., 5., 5.],
   [5., 5., 5., 5., 5., 5., 5.]])

  x2 :
   array([[7., 7., 7., 7., 7.],
        [7., 7., 7., 7., 7.],
        [7., 7., 7., 7., 7.]])

This is based on @Safak’s answer. The best way to understand the axes in pandas/numpy is to create a 3d array and check the result of the sum function along the 3 different axes.

 a = np.ones((3,5,7))

a will be:

    array([[[1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.]],

   [[1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.]],

   [[1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.]]])

Now check out the sum of elements of the array along each of the axes:

 x0 = np.sum(a,axis=0)
 x1 = np.sum(a,axis=1)
 x2 = np.sum(a,axis=2)

will give you the following results:

   x0 :
   array([[3., 3., 3., 3., 3., 3., 3.],
        [3., 3., 3., 3., 3., 3., 3.],
        [3., 3., 3., 3., 3., 3., 3.],
        [3., 3., 3., 3., 3., 3., 3.],
        [3., 3., 3., 3., 3., 3., 3.]])

   x1 : 
   array([[5., 5., 5., 5., 5., 5., 5.],
   [5., 5., 5., 5., 5., 5., 5.],
   [5., 5., 5., 5., 5., 5., 5.]])

  x2 :
   array([[7., 7., 7., 7., 7.],
        [7., 7., 7., 7., 7.],
        [7., 7., 7., 7., 7.]])

回答 11

我这样理解:

假设您的操作需要在数据框中从左向右/从右向左遍历,则显然是在合并列,即。您正在各种列上进行操作。这是轴= 1

df = pd.DataFrame(np.arange(12).reshape(3,4),columns=['A', 'B', 'C', 'D'])
print(df)
   A  B   C   D
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11 

df.mean(axis=1)

0    1.5
1    5.5
2    9.5
dtype: float64

df.drop(['A','B'],axis=1,inplace=True)

    C   D
0   2   3
1   6   7
2  10  11

需要注意的是,我们正在对列进行操作

同样,如果您的操作需要在数据框中从上到下/下到上遍历,则您正在合并行。这是axis = 0

I understand this way :

Say if your operation requires traversing from left to right/right to left in a dataframe, you are apparently merging columns ie. you are operating on various columns. This is axis =1

Example

df = pd.DataFrame(np.arange(12).reshape(3,4),columns=['A', 'B', 'C', 'D'])
print(df)
   A  B   C   D
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11 

df.mean(axis=1)

0    1.5
1    5.5
2    9.5
dtype: float64

df.drop(['A','B'],axis=1,inplace=True)

    C   D
0   2   3
1   6   7
2  10  11

Point to note here is we are operating on columns

Similarly, if your operation requires traversing from top to bottom/bottom to top in a dataframe, you are merging rows. This is axis=0.


回答 12

轴= 0表示上向下轴= 1表示左到右

sums[key] = lang_sets[key].iloc[:,1:].sum(axis=0)

给定的示例是对==键中的所有数据求和。

axis = 0 means up to down axis = 1 means left to right

sums[key] = lang_sets[key].iloc[:,1:].sum(axis=0)

Given example is taking sum of all the data in column == key.


回答 13

我的想法是:Axis = n,其中n = 0、1等,意味着矩阵沿该轴折叠(折叠)。因此,在2D矩阵中,当您沿0(行)折叠时,您实际上一次只对一列进行操作。对于高阶矩阵也是如此。

这与对矩阵中维的常规引用不同,其中0->行和1->列。对于N维数组中的其他维类似。

My thinking : Axis = n, where n = 0, 1, etc. means that the matrix is collapsed (folded) along that axis. So in a 2D matrix, when you collapse along 0 (rows), you are really operating on one column at a time. Similarly for higher order matrices.

This is not the same as the normal reference to a dimension in a matrix, where 0 -> row and 1 -> column. Similarly for other dimensions in an N dimension array.


回答 14

我是熊猫的新手。但这是我理解熊猫轴的方式:


恒定 变化 方向


0列向下|


1行列向右->


因此,要计算列的均值,该特定列应为常数,但其下的行可以更改(变化),因此轴= 0。

类似地,要计算一行的平均值,该特定行是恒定的,但它可以遍历不同的列(变化),轴= 1。

I’m a newbie to pandas. But this is how I understand axis in pandas:


Axis Constant Varying Direction


0 Column Row Downwards |


1 Row Column Towards Right –>


So to compute mean of a column, that particular column should be constant but the rows under that can change (varying) so it is axis=0.

Similarly, to compute mean of a row, that particular row is constant but it can traverse through different columns (varying), axis=1.


回答 15

我认为还有另一种理解方式。

对于np.array,如果要消除列,则使用axis = 1; 如果要消除行,则使用axis = 0。

np.mean(np.array(np.ones(shape=(3,5,10))),axis = 0).shape # (5,10)
np.mean(np.array(np.ones(shape=(3,5,10))),axis = 1).shape # (3,10)
np.mean(np.array(np.ones(shape=(3,5,10))),axis = (0,1)).shape # (10,)

对于熊猫对象,axis = 0代表按行操作,axis = 1代表按列操作。这与numpy定义不同,我们可以检查numpy.docpandas.doc中的定义

I think there is an another way to understand it.

For a np.array,if we want eliminate columns we use axis = 1; if we want eliminate rows, we use axis = 0.

np.mean(np.array(np.ones(shape=(3,5,10))),axis = 0).shape # (5,10)
np.mean(np.array(np.ones(shape=(3,5,10))),axis = 1).shape # (3,10)
np.mean(np.array(np.ones(shape=(3,5,10))),axis = (0,1)).shape # (10,)

For pandas object, axis = 0 stands for row-wise operation and axis = 1 stands for column-wise operation. This is different from numpy by definition, we can check definitions from numpy.doc and pandas.doc


回答 16

我将明确避免使用“按行排列”或“沿列排列”,因为人们可能以完全错误的方式解释它们。

打个比方。直观地,您希望pandas.DataFrame.drop(axis='column')从N列中删除一列,并为您提供(N-1)列。因此,您暂时无需关注行(并从英语词典中删除单词“ row”。)反之亦然,它drop(axis='row')适用于行。

同样,sum(axis='column')可以处理多列,并为您提供1列。同样,sum(axis='row')结果为1行。这与其最简单的定义形式一致,即将数字列表简化为单个数字。

通常,使用axis=column,您可以看到列,在列上工作并获取列。忘记行。

使用axis=row,更改视角并处理行。

0和1只是’row’和’column’的别名。这是矩阵索引的惯例。

I will explicitly avoid using ‘row-wise’ or ‘along the columns’, since people may interpret them in exactly the wrong way.

Analogy first. Intuitively, you would expect that pandas.DataFrame.drop(axis='column') drops a column from N columns and gives you (N – 1) columns. So you can pay NO attention to rows for now (and remove word ‘row’ from your English dictionary.) Vice versa, drop(axis='row') works on rows.

In the same way, sum(axis='column') works on multiple columns and gives you 1 column. Similarly, sum(axis='row') results in 1 row. This is consistent with its simplest form of definition, reducing a list of numbers to a single number.

In general, with axis=column, you see columns, work on columns, and get columns. Forget rows.

With axis=row, change perspective and work on rows.

0 and 1 are just aliases for ‘row’ and ‘column’. It’s the convention of matrix indexing.


回答 17

我也一直在试图找出最后一个小时的轴。以上所有答案中的语言以及文档均无济于事。

要回答我现在所理解的问题,在Pandas中,axis = 1或0表示在应用功能时要保持哪些轴头恒定。

注意:当我说标题时,我指的是索引名称

扩展您的示例:

+------------+---------+--------+
|            |  A      |  B     |
+------------+---------+---------
|      X     | 0.626386| 1.52325|
+------------+---------+--------+
|      Y     | 0.626386| 1.52325|
+------------+---------+--------+

对于axis = 1 = columns:我们保持列标题不变,并通过更改数据应用均值函数。为了演示,我们将列标题保持不变:

+------------+---------+--------+
|            |  A      |  B     |

现在我们填充一组A和B值,然后求平均值

|            | 0.626386| 1.52325|  

然后我们填充下一组A和B值并找到平均值

|            | 0.626386| 1.52325|

类似地,对于axis = rows,我们保持行标题不变,并不断更改数据:为了演示,首先修复行标题:

+------------+
|      X     |
+------------+
|      Y     |
+------------+

现在填充第一组X和Y值,然后求平均值

+------------+---------+
|      X     | 0.626386
+------------+---------+
|      Y     | 0.626386
+------------+---------+

然后填充下一组X和Y值,然后求平均值:

+------------+---------+
|      X     | 1.52325 |
+------------+---------+
|      Y     | 1.52325 |
+------------+---------+

综上所述,

当axis = columns时,可以固定列标题并更改数据,这些数据将来自不同的行。

当axis = rows时,将修复行标题并更改数据,这些数据将来自不同的列。

I have been trying to figure out the axis for the last hour as well. The language in all the above answers, and also the documentation is not at all helpful.

To answer the question as I understand it now, in Pandas, axis = 1 or 0 means which axis headers do you want to keep constant when applying the function.

Note: When I say headers, I mean index names

Expanding your example:

+------------+---------+--------+
|            |  A      |  B     |
+------------+---------+---------
|      X     | 0.626386| 1.52325|
+------------+---------+--------+
|      Y     | 0.626386| 1.52325|
+------------+---------+--------+

For axis=1=columns : We keep columns headers constant and apply the mean function by changing data. To demonstrate, we keep the columns headers constant as:

+------------+---------+--------+
|            |  A      |  B     |

Now we populate one set of A and B values and then find the mean

|            | 0.626386| 1.52325|  

Then we populate next set of A and B values and find the mean

|            | 0.626386| 1.52325|

Similarly, for axis=rows, we keep row headers constant, and keep changing the data: To demonstrate, first fix the row headers:

+------------+
|      X     |
+------------+
|      Y     |
+------------+

Now populate first set of X and Y values and then find the mean

+------------+---------+
|      X     | 0.626386
+------------+---------+
|      Y     | 0.626386
+------------+---------+

Then populate the next set of X and Y values and then find the mean:

+------------+---------+
|      X     | 1.52325 |
+------------+---------+
|      Y     | 1.52325 |
+------------+---------+

In summary,

When axis=columns, you fix the column headers and change data, which will come from the different rows.

When axis=rows, you fix the row headers and change data, which will come from the different columns.


回答 18

axis = 1,它将明智地求和行,keepdims = True将保持二维。希望对您有帮助。

axis=1 ,It will give the sum row wise,keepdims=True will maintain the 2D dimension. Hope it helps you.


回答 19

这里的许多答案对我有很大帮助!

如果您对axisPython和MARGIN R中(例如在apply函数中),则可能会发现我写过一篇有趣的博客文章:https : //accio.github.io/programming/2020/05/ 19 / numpy-pandas-axis.html

在本质上:

  • 有趣的是,与二维数组相比,使用三维数组更容易理解它们的行为。
  • 在Python包中 numpypandas,sum的axis参数实际上指定numpy,以计算所有可以以array [0,0,…,i,…,0]形式获取的值的平均值所有可能的值。在i的位置固定的情况下重复此过程,其他维度的索引则一个接一个地变化(从最右边的元素开始)。结果是一个n-1维数组。
  • 在R中,MARGINS参数使apply函数计算可以以array [,…,i,…,]的形式获取的所有值的平均值,其中i遍历所有可能的值。迭代完所有i值后,不再重复该过程。因此,结果是一个简单的向量。

Many answers here helped me a lot!

In case you get confused by the different behaviours of axis in Python and MARGIN in R (like in the apply function), you may find a blog post that I wrote of interest: https://accio.github.io/programming/2020/05/19/numpy-pandas-axis.html.

In essence:

  • Their behaviours are, intriguingly, easier to understand with three-dimensional array than with two-dimensional arrays.
  • In Python packages numpy and pandas, the axis parameter in sum actually specifies numpy to calculate the mean of all values that can be fetched in the form of array[0, 0, …, i, …, 0] where i iterates through all possible values. The process is repeated with the position of i fixed and the indices of other dimensions vary one after the other (from the most far-right element). The result is a n-1-dimensional array.
  • In R, the MARGINS parameter let the apply function calculate the mean of all values that can be fetched in the form of array[, … , i, … ,] where i iterates through all possible values. The process is not repeated when all i values have been iterated. Therefore, the result is a simple vector.

回答 20

数组设计为具有所谓的axis = 0,垂直排列的行相对于axis = 1,水平排列的列。轴是指数组的尺寸。

Arrays are designed with so-called axis=0 and rows positioned vertically versus axis=1 and columns positioned horizontally. Axis refers to the dimension of the array.


如何在熊猫中获取数据框的列切片

问题:如何在熊猫中获取数据框的列切片

我从CSV文件加载了一些机器学习数据。前两列是观测值,其余两列是要素。

目前,我执行以下操作:

data = pandas.read_csv('mydata.csv')

它给出了类似的东西:

data = pandas.DataFrame(np.random.rand(10,5), columns = list('abcde'))

我想两个dataframes切片此数据框:包含列一个ab和包含一个列cde

不可能写这样的东西

observations = data[:'c']
features = data['c':]

我不确定最好的方法是什么。我需要一个pd.Panel吗?

顺便说一下,我发现数据帧索引非常不一致:data['a']允许,但data[0]不允许。另一方面,data['a':]不允许,但允许data[0:]。是否有实际原因?如果列是由Int索引的,这确实令人困惑,因为data[0] != data[0:1]

I load some machine learning data from a CSV file. The first 2 columns are observations and the remaining columns are features.

Currently, I do the following:

data = pandas.read_csv('mydata.csv')

which gives something like:

data = pandas.DataFrame(np.random.rand(10,5), columns = list('abcde'))

I’d like to slice this dataframe in two dataframes: one containing the columns a and b and one containing the columns c, d and e.

It is not possible to write something like

observations = data[:'c']
features = data['c':]

I’m not sure what the best method is. Do I need a pd.Panel?

By the way, I find dataframe indexing pretty inconsistent: data['a'] is permitted, but data[0] is not. On the other side, data['a':] is not permitted but data[0:] is. Is there a practical reason for this? This is really confusing if columns are indexed by Int, given that data[0] != data[0:1]


回答 0

2017年答案-熊猫0.20:.ix已弃用。使用.loc

请参阅文档中弃用

.loc使用基于标签的索引来选择行和列。标签是索引或列的值。切片.loc包含最后一个元素。

假设我们有以下的列的数据框中:
foobarquzantcatsatdat

# selects all rows and all columns beginning at 'foo' up to and including 'sat'
df.loc[:, 'foo':'sat']
# foo bar quz ant cat sat

.loc接受与Python列表对行和列所做的相同的切片表示法。切片符号为start:stop:step

# slice from 'foo' to 'cat' by every 2nd column
df.loc[:, 'foo':'cat':2]
# foo quz cat

# slice from the beginning to 'bar'
df.loc[:, :'bar']
# foo bar

# slice from 'quz' to the end by 3
df.loc[:, 'quz'::3]
# quz sat

# attempt from 'sat' to 'bar'
df.loc[:, 'sat':'bar']
# no columns returned

# slice from 'sat' to 'bar'
df.loc[:, 'sat':'bar':-1]
sat cat ant quz bar

# slice notation is syntatic sugar for the slice function
# slice from 'quz' to the end by 2 with slice function
df.loc[:, slice('quz',None, 2)]
# quz cat dat

# select specific columns with a list
# select columns foo, bar and dat
df.loc[:, ['foo','bar','dat']]
# foo bar dat

您可以按行和列进行切片。举例来说,如果你有5列的标签vwxyz

# slice from 'w' to 'y' and 'foo' to 'ant' by 3
df.loc['w':'y', 'foo':'ant':3]
#    foo ant
# w
# x
# y

2017 Answer – pandas 0.20: .ix is deprecated. Use .loc

See the deprecation in the docs

.loc uses label based indexing to select both rows and columns. The labels being the values of the index or the columns. Slicing with .loc includes the last element.

Let’s assume we have a DataFrame with the following columns:
foo, bar, quz, ant, cat, sat, dat.

# selects all rows and all columns beginning at 'foo' up to and including 'sat'
df.loc[:, 'foo':'sat']
# foo bar quz ant cat sat

.loc accepts the same slice notation that Python lists do for both row and columns. Slice notation being start:stop:step

# slice from 'foo' to 'cat' by every 2nd column
df.loc[:, 'foo':'cat':2]
# foo quz cat

# slice from the beginning to 'bar'
df.loc[:, :'bar']
# foo bar

# slice from 'quz' to the end by 3
df.loc[:, 'quz'::3]
# quz sat

# attempt from 'sat' to 'bar'
df.loc[:, 'sat':'bar']
# no columns returned

# slice from 'sat' to 'bar'
df.loc[:, 'sat':'bar':-1]
sat cat ant quz bar

# slice notation is syntatic sugar for the slice function
# slice from 'quz' to the end by 2 with slice function
df.loc[:, slice('quz',None, 2)]
# quz cat dat

# select specific columns with a list
# select columns foo, bar and dat
df.loc[:, ['foo','bar','dat']]
# foo bar dat

You can slice by rows and columns. For instance, if you have 5 rows with labels v, w, x, y, z

# slice from 'w' to 'y' and 'foo' to 'ant' by 3
df.loc['w':'y', 'foo':'ant':3]
#    foo ant
# w
# x
# y

回答 1

注意: .ix自Pandas v0.20起已弃用。您应该改用.loc.iloc,视情况而定。

您要访问的是DataFrame.ix索引。这有点令人困惑(我同意熊猫索引有时会令人困惑!),但是以下内容似乎可以满足您的要求:

>>> df = DataFrame(np.random.rand(4,5), columns = list('abcde'))
>>> df.ix[:,'b':]
      b         c         d         e
0  0.418762  0.042369  0.869203  0.972314
1  0.991058  0.510228  0.594784  0.534366
2  0.407472  0.259811  0.396664  0.894202
3  0.726168  0.139531  0.324932  0.906575

其中.ix [row slice,column slice]是正在解释的内容。有关熊猫索引的更多信息,请访问:http : //pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-advanced

Note: .ix has been deprecated since Pandas v0.20. You should instead use .loc or .iloc, as appropriate.

The DataFrame.ix index is what you want to be accessing. It’s a little confusing (I agree that Pandas indexing is perplexing at times!), but the following seems to do what you want:

>>> df = DataFrame(np.random.rand(4,5), columns = list('abcde'))
>>> df.ix[:,'b':]
      b         c         d         e
0  0.418762  0.042369  0.869203  0.972314
1  0.991058  0.510228  0.594784  0.534366
2  0.407472  0.259811  0.396664  0.894202
3  0.726168  0.139531  0.324932  0.906575

where .ix[row slice, column slice] is what is being interpreted. More on Pandas indexing here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-advanced


回答 2

让我们以seaborn包中的钛酸数据集为例

# Load dataset (pip install seaborn)
>> import seaborn.apionly as sns
>> titanic = sns.load_dataset('titanic')

使用列名

>> titanic.loc[:,['sex','age','fare']]

使用列索引

>> titanic.iloc[:,[2,3,6]]

使用ix(版本低于.20的Pandas版本)

>> titanic.ix[:,[‘sex’,’age’,’fare’]]

要么

>> titanic.ix[:,[2,3,6]]

使用重新索引方法

>> titanic.reindex(columns=['sex','age','fare'])

Lets use the titanic dataset from the seaborn package as an example

# Load dataset (pip install seaborn)
>> import seaborn.apionly as sns
>> titanic = sns.load_dataset('titanic')

using the column names

>> titanic.loc[:,['sex','age','fare']]

using the column indices

>> titanic.iloc[:,[2,3,6]]

using ix (Older than Pandas <.20 version)

>> titanic.ix[:,[‘sex’,’age’,’fare’]]

or

>> titanic.ix[:,[2,3,6]]

using the reindex method

>> titanic.reindex(columns=['sex','age','fare'])

回答 3

另外,给定一个DataFrame

数据

如您的示例所示,如果您只想提取列a和d(即第一列和第四列),则需要从熊猫数据框中获取iloc方法,并且可以非常有效地使用它。您只需要知道要提取的列的索引即可。例如:

>>> data.iloc[:,[0,3]]

会给你

          a         d
0  0.883283  0.100975
1  0.614313  0.221731
2  0.438963  0.224361
3  0.466078  0.703347
4  0.955285  0.114033
5  0.268443  0.416996
6  0.613241  0.327548
7  0.370784  0.359159
8  0.692708  0.659410
9  0.806624  0.875476

Also, Given a DataFrame

data

as in your example, if you would like to extract column a and d only (e.i. the 1st and the 4th column), iloc mothod from the pandas dataframe is what you need and could be used very effectively. All you need to know is the index of the columns you would like to extract. For example:

>>> data.iloc[:,[0,3]]

will give you

          a         d
0  0.883283  0.100975
1  0.614313  0.221731
2  0.438963  0.224361
3  0.466078  0.703347
4  0.955285  0.114033
5  0.268443  0.416996
6  0.613241  0.327548
7  0.370784  0.359159
8  0.692708  0.659410
9  0.806624  0.875476

回答 4

您可以DataFrame通过引用列表中每一列的名称来沿a的列进行切片,如下所示:

data = pandas.DataFrame(np.random.rand(10,5), columns = list('abcde'))
data_ab = data[list('ab')]
data_cde = data[list('cde')]

You can slice along the columns of a DataFrame by referring to the names of each column in a list, like so:

data = pandas.DataFrame(np.random.rand(10,5), columns = list('abcde'))
data_ab = data[list('ab')]
data_cde = data[list('cde')]

回答 5

如果您来这里是想对两个范围的列进行切片并将它们组合在一起(例如我),则可以执行以下操作

op = df[list(df.columns[0:899]) + list(df.columns[3593:])]
print op

这将创建一个具有前900列和(所有)列> 3593的新数据框(假设您的数据集中有4000列)。

And if you came here looking for slicing two ranges of columns and combining them together (like me) you can do something like

op = df[list(df.columns[0:899]) + list(df.columns[3593:])]
print op

This will create a new dataframe with first 900 columns and (all) columns > 3593 (assuming you have some 4000 columns in your data set).


回答 6

这是您可以使用不同方法进行选择性列切片的方法,包括基于选择性标签,基于索引和基于选择性范围的列切片。

In [37]: import pandas as pd    
In [38]: import numpy as np
In [43]: df = pd.DataFrame(np.random.rand(4,7), columns = list('abcdefg'))

In [44]: df
Out[44]: 
          a         b         c         d         e         f         g
0  0.409038  0.745497  0.890767  0.945890  0.014655  0.458070  0.786633
1  0.570642  0.181552  0.794599  0.036340  0.907011  0.655237  0.735268
2  0.568440  0.501638  0.186635  0.441445  0.703312  0.187447  0.604305
3  0.679125  0.642817  0.697628  0.391686  0.698381  0.936899  0.101806

In [45]: df.loc[:, ["a", "b", "c"]] ## label based selective column slicing 
Out[45]: 
          a         b         c
0  0.409038  0.745497  0.890767
1  0.570642  0.181552  0.794599
2  0.568440  0.501638  0.186635
3  0.679125  0.642817  0.697628

In [46]: df.loc[:, "a":"c"] ## label based column ranges slicing 
Out[46]: 
          a         b         c
0  0.409038  0.745497  0.890767
1  0.570642  0.181552  0.794599
2  0.568440  0.501638  0.186635
3  0.679125  0.642817  0.697628

In [47]: df.iloc[:, 0:3] ## index based column ranges slicing 
Out[47]: 
          a         b         c
0  0.409038  0.745497  0.890767
1  0.570642  0.181552  0.794599
2  0.568440  0.501638  0.186635
3  0.679125  0.642817  0.697628

### with 2 different column ranges, index based slicing: 
In [49]: df[df.columns[0:1].tolist() + df.columns[1:3].tolist()]
Out[49]: 
          a         b         c
0  0.409038  0.745497  0.890767
1  0.570642  0.181552  0.794599
2  0.568440  0.501638  0.186635
3  0.679125  0.642817  0.697628

Here’s how you could use different methods to do selective column slicing, including selective label based, index based and the selective ranges based column slicing.

In [37]: import pandas as pd    
In [38]: import numpy as np
In [43]: df = pd.DataFrame(np.random.rand(4,7), columns = list('abcdefg'))

In [44]: df
Out[44]: 
          a         b         c         d         e         f         g
0  0.409038  0.745497  0.890767  0.945890  0.014655  0.458070  0.786633
1  0.570642  0.181552  0.794599  0.036340  0.907011  0.655237  0.735268
2  0.568440  0.501638  0.186635  0.441445  0.703312  0.187447  0.604305
3  0.679125  0.642817  0.697628  0.391686  0.698381  0.936899  0.101806

In [45]: df.loc[:, ["a", "b", "c"]] ## label based selective column slicing 
Out[45]: 
          a         b         c
0  0.409038  0.745497  0.890767
1  0.570642  0.181552  0.794599
2  0.568440  0.501638  0.186635
3  0.679125  0.642817  0.697628

In [46]: df.loc[:, "a":"c"] ## label based column ranges slicing 
Out[46]: 
          a         b         c
0  0.409038  0.745497  0.890767
1  0.570642  0.181552  0.794599
2  0.568440  0.501638  0.186635
3  0.679125  0.642817  0.697628

In [47]: df.iloc[:, 0:3] ## index based column ranges slicing 
Out[47]: 
          a         b         c
0  0.409038  0.745497  0.890767
1  0.570642  0.181552  0.794599
2  0.568440  0.501638  0.186635
3  0.679125  0.642817  0.697628

### with 2 different column ranges, index based slicing: 
In [49]: df[df.columns[0:1].tolist() + df.columns[1:3].tolist()]
Out[49]: 
          a         b         c
0  0.409038  0.745497  0.890767
1  0.570642  0.181552  0.794599
2  0.568440  0.501638  0.186635
3  0.679125  0.642817  0.697628

回答 7

相当于

 >>> print(df2.loc[140:160,['Relevance','Title']])
 >>> print(df2.ix[140:160,[3,7]])

Its equivalent

 >>> print(df2.loc[140:160,['Relevance','Title']])
 >>> print(df2.ix[140:160,[3,7]])

回答 8

如果数据框如下所示:

group         name      count
fruit         apple     90
fruit         banana    150
fruit         orange    130
vegetable     broccoli  80
vegetable     kale      70
vegetable     lettuce   125

和输出可能像

   group    name  count
0  fruit   apple     90
1  fruit  banana    150
2  fruit  orange    130

如果您使用逻辑运算符np.logical_not

df[np.logical_not(df['group'] == 'vegetable')]

更多关于

https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.logic.html

其他逻辑运算符

  1. logical_and(x1,x2,/ [,out,where,…])计算x1和x2元素的真值。

  2. logical_or(x1,x2,/ [,out,where,cast,…])计算x1或x2元素的真值。

  3. logical_not(x,/ [,out,where,cast,…])计算非x元素值的真值。
  4. logical_xor(x1,x2,/ [,out,where,..])按元素计算x1 XOR x2的真值。

if Data frame look like that:

group         name      count
fruit         apple     90
fruit         banana    150
fruit         orange    130
vegetable     broccoli  80
vegetable     kale      70
vegetable     lettuce   125

and OUTPUT could be like

   group    name  count
0  fruit   apple     90
1  fruit  banana    150
2  fruit  orange    130

if you use logical operator np.logical_not

df[np.logical_not(df['group'] == 'vegetable')]

more about

https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.logic.html

other logical operators

  1. logical_and(x1, x2, /[, out, where, …]) Compute the truth value of x1 AND x2 element-wise.

  2. logical_or(x1, x2, /[, out, where, casting, …]) Compute the truth value of x1 OR x2 element-wise.

  3. logical_not(x, /[, out, where, casting, …]) Compute the truth value of NOT x element-wise.
  4. logical_xor(x1, x2, /[, out, where, ..]) Compute the truth value of x1 XOR x2, element-wise.

回答 9

假设您需要所有行,则从DataFrame获取列子集的另一种方法是:
data[['a','b']]data[['c','d','e']]
如果要使用数字列索引,可以执行:
data[data.columns[:2]]data[data.columns[2:]]

Another way to get a subset of columns from your DataFrame, assuming you want all the rows, would be to do:
data[['a','b']] and data[['c','d','e']]
If you want to use numerical column indexes you can do:
data[data.columns[:2]] and data[data.columns[2:]]


如何将熊猫数据添加到现有的csv文件中?

问题:如何将熊猫数据添加到现有的csv文件中?

我想知道是否可以使用pandas to_csv()函数将数据框添加到现有的csv文件中。csv文件与加载的数据具有相同的结构。

I want to know if it is possible to use the pandas to_csv() function to add a dataframe to an existing csv file. The csv file has the same structure as the loaded data.


回答 0

您可以在pandas to_csv函数中指定python写入模式。对于追加,它是“ a”。

在您的情况下:

df.to_csv('my_csv.csv', mode='a', header=False)

默认模式为“ w”。

You can specify a python write mode in the pandas to_csv function. For append it is ‘a’.

In your case:

df.to_csv('my_csv.csv', mode='a', header=False)

The default mode is ‘w’.


回答 1

您可以通过在追加模式下打开文件追加到csv :

with open('my_csv.csv', 'a') as f:
    df.to_csv(f, header=False)

如果这是您的csv,请执行以下操作foo.csv

,A,B,C
0,1,2,3
1,4,5,6

如果您阅读了该内容,然后附加,例如df + 6

In [1]: df = pd.read_csv('foo.csv', index_col=0)

In [2]: df
Out[2]:
   A  B  C
0  1  2  3
1  4  5  6

In [3]: df + 6
Out[3]:
    A   B   C
0   7   8   9
1  10  11  12

In [4]: with open('foo.csv', 'a') as f:
             (df + 6).to_csv(f, header=False)

foo.csv 变成:

,A,B,C
0,1,2,3
1,4,5,6
0,7,8,9
1,10,11,12

You can append to a csv by opening the file in append mode:

with open('my_csv.csv', 'a') as f:
    df.to_csv(f, header=False)

If this was your csv, foo.csv:

,A,B,C
0,1,2,3
1,4,5,6

If you read that and then append, for example, df + 6:

In [1]: df = pd.read_csv('foo.csv', index_col=0)

In [2]: df
Out[2]:
   A  B  C
0  1  2  3
1  4  5  6

In [3]: df + 6
Out[3]:
    A   B   C
0   7   8   9
1  10  11  12

In [4]: with open('foo.csv', 'a') as f:
             (df + 6).to_csv(f, header=False)

foo.csv becomes:

,A,B,C
0,1,2,3
1,4,5,6
0,7,8,9
1,10,11,12

回答 2

with open(filename, 'a') as f:
    df.to_csv(f, header=f.tell()==0)
  • 除非存在,否则创建文件,否则追加
  • 如果正在创建文件,则添加标题,否则跳过它
with open(filename, 'a') as f:
    df.to_csv(f, header=f.tell()==0)
  • Create file unless exists, otherwise append
  • Add header if file is being created, otherwise skip it

回答 3

我在一些标头检查保护措施中使用了一个辅助功能,以处理所有问题:

def appendDFToCSV_void(df, csvFilePath, sep=","):
    import os
    if not os.path.isfile(csvFilePath):
        df.to_csv(csvFilePath, mode='a', index=False, sep=sep)
    elif len(df.columns) != len(pd.read_csv(csvFilePath, nrows=1, sep=sep).columns):
        raise Exception("Columns do not match!! Dataframe has " + str(len(df.columns)) + " columns. CSV file has " + str(len(pd.read_csv(csvFilePath, nrows=1, sep=sep).columns)) + " columns.")
    elif not (df.columns == pd.read_csv(csvFilePath, nrows=1, sep=sep).columns).all():
        raise Exception("Columns and column order of dataframe and csv file do not match!!")
    else:
        df.to_csv(csvFilePath, mode='a', index=False, sep=sep, header=False)

A little helper function I use with some header checking safeguards to handle it all:

def appendDFToCSV_void(df, csvFilePath, sep=","):
    import os
    if not os.path.isfile(csvFilePath):
        df.to_csv(csvFilePath, mode='a', index=False, sep=sep)
    elif len(df.columns) != len(pd.read_csv(csvFilePath, nrows=1, sep=sep).columns):
        raise Exception("Columns do not match!! Dataframe has " + str(len(df.columns)) + " columns. CSV file has " + str(len(pd.read_csv(csvFilePath, nrows=1, sep=sep).columns)) + " columns.")
    elif not (df.columns == pd.read_csv(csvFilePath, nrows=1, sep=sep).columns).all():
        raise Exception("Columns and column order of dataframe and csv file do not match!!")
    else:
        df.to_csv(csvFilePath, mode='a', index=False, sep=sep, header=False)

回答 4

最初从pyspark数据帧开始-给定pyspark数据帧中的架构/列类型,我遇到类型转换错误(转换为pandas df然后附加到csv时)

通过将每个df中的所有列都强制为string类型,然后将其附加到csv来解决此问题,如下所示:

with open('testAppend.csv', 'a') as f:
    df2.toPandas().astype(str).to_csv(f, header=False)

Initially starting with a pyspark dataframes – I got type conversion errors (when converting to pandas df’s and then appending to csv) given the schema/column types in my pyspark dataframes

Solved the problem by forcing all columns in each df to be of type string and then appending this to csv as follows:

with open('testAppend.csv', 'a') as f:
    df2.toPandas().astype(str).to_csv(f, header=False)

回答 5

晚了一点,但是如果您多次打开和关闭文件或记录数据,统计信息等,您也可以使用上下文管理器。

from contextlib import contextmanager
import pandas as pd
@contextmanager
def open_file(path, mode):
     file_to=open(path,mode)
     yield file_to
     file_to.close()


##later
saved_df=pd.DataFrame(data)
with open_file('yourcsv.csv','r') as infile:
      saved_df.to_csv('yourcsv.csv',mode='a',header=False)`

A bit late to the party but you can also use a context manager, if you’re opening and closing your file multiple times, or logging data, statistics, etc.

from contextlib import contextmanager
import pandas as pd
@contextmanager
def open_file(path, mode):
     file_to=open(path,mode)
     yield file_to
     file_to.close()


##later
saved_df=pd.DataFrame(data)
with open_file('yourcsv.csv','r') as infile:
      saved_df.to_csv('yourcsv.csv',mode='a',header=False)`

熊猫read_csv low_memory和dtype选项

问题:熊猫read_csv low_memory和dtype选项

打电话时

df = pd.read_csv('somefile.csv')

我得到:

/Users/josh/anaconda/envs/py27/lib/python2.7/site-packages/pandas/io/parsers.py:1130:DtypeWarning:列(4,5,7,16)具有混合类型。在导入时指定dtype选项,或将low_memory = False设置为false。

为什么dtype选项与关联low_memory,为什么使它False有助于解决此问题?

When calling

df = pd.read_csv('somefile.csv')

I get:

/Users/josh/anaconda/envs/py27/lib/python2.7/site-packages/pandas/io/parsers.py:1130: DtypeWarning: Columns (4,5,7,16) have mixed types. Specify dtype option on import or set low_memory=False.

Why is the dtype option related to low_memory, and why would making it False help with this problem?


回答 0

不推荐使用的low_memory选项

low_memory选项未正确弃用,但应该正确使用,因为它实际上没有做任何不同的事情[ 来源 ]

收到此low_memory警告的原因是因为猜测每列的dtypes非常需要内存。熊猫尝试通过分析每列中的数据来确定要设置的dtype。

Dtype猜测(非常糟糕)

一旦读取了整个文件,熊猫便只能确定列应具有的dtype。这意味着在读取整个文件之前,无法真正解析任何内容,除非您冒着在读取最后一个值时不得不更改该列的dtype的风险。

考虑一个文件的示例,该文件具有一个名为user_id的列。它包含1000万行,其中user_id始终是数字。由于熊猫不能只知道数字,因此它可能会一直保留为原始字符串,直到它读取了整个文件。

指定dtypes(应该总是这样做)

dtype={'user_id': int}

pd.read_csv()呼叫将使大熊猫知道它开始读取文件时,认为这是唯一的整数。

还值得注意的是,如果文件的最后一行将被"foobar"写入user_id列中,那么如果指定了上面的dtype,则加载将崩溃。

定义dtypes时会中断的中断数据示例

import pandas as pd
try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO


csvdata = """user_id,username
1,Alice
3,Bob
foobar,Caesar"""
sio = StringIO(csvdata)
pd.read_csv(sio, dtype={"user_id": int, "username": "string"})

ValueError: invalid literal for long() with base 10: 'foobar'

dtypes通常是一个numpy的东西,请在这里阅读有关它们的更多信息:http ://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html

存在哪些dtype?

我们可以访问numpy dtypes:float,int,bool,timedelta64 [ns]和datetime64 [ns]。请注意,numpy日期/时间dtypes 识别时区。

熊猫通过自己的方式扩展了这套dtypes:

‘datetime64 [ns,]’这是一个时区感知的时间戳。

‘category’本质上是一个枚举(以整数键表示的字符串以保存

‘period []’不要与timedelta混淆,这些对象实际上是固定在特定时间段的

“稀疏”,“ Sparse [int]”,“ Sparse [float]”用于稀疏数据或“其中有很多漏洞的数据”,而不是在数据框中保存NaN或None,它忽略了对象,从而节省了空间。

“间隔”本身是一个主题,但其主要用途是用于索引。在这里查看更多

与numpy变体不同,“ Int8”,“ Int16”,“ Int32”,“ Int64”,“ UInt8”,“ UInt16”,“ UInt32”,“ UInt64”都是可为空的熊猫特定整数。

‘string’是用于处理字符串数据的特定dtype,可访问.str系列中的属性。

‘boolean’类似于numpy’bool’,但它也支持丢失数据。

在此处阅读完整的参考:

熊猫DType参考

陷阱,注意事项,笔记

设置dtype=object将使上面的警告静音,但不会使其更有效地使用内存,仅在有任何处理时才有效。

设置dtype=unicode不会做任何事情,因为对于numpy,a unicode表示为object

转换器的使用

@sparrow正确指出了转换器的用法,以避免在遇到'foobar'指定为的列时遇到大熊猫int。我想补充一点,转换器在熊猫中使用时确实很笨重且效率低下,应该作为最后的手段使用。这是因为read_csv进程是单个进程。

CSV文件可以逐行处理,因此可以通过简单地将文件切成段并运行多个进程来由多个转换器并行更有效地进行处理,而这是熊猫所不支持的。但这是一个不同的故事。

The deprecated low_memory option

The low_memory option is not properly deprecated, but it should be, since it does not actually do anything differently[source]

The reason you get this low_memory warning is because guessing dtypes for each column is very memory demanding. Pandas tries to determine what dtype to set by analyzing the data in each column.

Dtype Guessing (very bad)

Pandas can only determine what dtype a column should have once the whole file is read. This means nothing can really be parsed before the whole file is read unless you risk having to change the dtype of that column when you read the last value.

Consider the example of one file which has a column called user_id. It contains 10 million rows where the user_id is always numbers. Since pandas cannot know it is only numbers, it will probably keep it as the original strings until it has read the whole file.

Specifying dtypes (should always be done)

adding

dtype={'user_id': int}

to the pd.read_csv() call will make pandas know when it starts reading the file, that this is only integers.

Also worth noting is that if the last line in the file would have "foobar" written in the user_id column, the loading would crash if the above dtype was specified.

Example of broken data that breaks when dtypes are defined

import pandas as pd
try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO


csvdata = """user_id,username
1,Alice
3,Bob
foobar,Caesar"""
sio = StringIO(csvdata)
pd.read_csv(sio, dtype={"user_id": int, "username": "string"})

ValueError: invalid literal for long() with base 10: 'foobar'

dtypes are typically a numpy thing, read more about them here: http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html

What dtypes exists?

We have access to numpy dtypes: float, int, bool, timedelta64[ns] and datetime64[ns]. Note that the numpy date/time dtypes are not time zone aware.

Pandas extends this set of dtypes with its own:

‘datetime64[ns, ]’ Which is a time zone aware timestamp.

‘category’ which is essentially an enum (strings represented by integer keys to save

‘period[]’ Not to be confused with a timedelta, these objects are actually anchored to specific time periods

‘Sparse’, ‘Sparse[int]’, ‘Sparse[float]’ is for sparse data or ‘Data that has a lot of holes in it’ Instead of saving the NaN or None in the dataframe it omits the objects, saving space.

‘Interval’ is a topic of its own but its main use is for indexing. See more here

‘Int8’, ‘Int16’, ‘Int32’, ‘Int64’, ‘UInt8’, ‘UInt16’, ‘UInt32’, ‘UInt64’ are all pandas specific integers that are nullable, unlike the numpy variant.

‘string’ is a specific dtype for working with string data and gives access to the .str attribute on the series.

‘boolean’ is like the numpy ‘bool’ but it also supports missing data.

Read the complete reference here:

Pandas dtype reference

Gotchas, caveats, notes

Setting dtype=object will silence the above warning, but will not make it more memory efficient, only process efficient if anything.

Setting dtype=unicode will not do anything, since to numpy, a unicode is represented as object.

Usage of converters

@sparrow correctly points out the usage of converters to avoid pandas blowing up when encountering 'foobar' in a column specified as int. I would like to add that converters are really heavy and inefficient to use in pandas and should be used as a last resort. This is because the read_csv process is a single process.

CSV files can be processed line by line and thus can be processed by multiple converters in parallel more efficiently by simply cutting the file into segments and running multiple processes, something that pandas does not support. But this is a different story.


回答 1

尝试:

dashboard_df = pd.read_csv(p_file, sep=',', error_bad_lines=False, index_col=False, dtype='unicode')

根据熊猫文件:

dtype:类型名称或列的字典->类型

至于low_memory,默认情况下为True 尚未记录。我认为这无关紧要。该错误消息是通用的,因此无论如何您都无需弄混low_memory。希望这会有所帮助,如果您还有其他问题,请告诉我

Try:

dashboard_df = pd.read_csv(p_file, sep=',', error_bad_lines=False, index_col=False, dtype='unicode')

According to the pandas documentation:

dtype : Type name or dict of column -> type

As for low_memory, it’s True by default and isn’t yet documented. I don’t think its relevant though. The error message is generic, so you shouldn’t need to mess with low_memory anyway. Hope this helps and let me know if you have further problems


回答 2

df = pd.read_csv('somefile.csv', low_memory=False)

这应该可以解决问题。从CSV读取180万行时,出现了完全相同的错误。

df = pd.read_csv('somefile.csv', low_memory=False)

This should solve the issue. I got exactly the same error, when reading 1.8M rows from a CSV.


回答 3

如firelynx先前所述,如果显式指定了dtype并且存在与该dtype不兼容的混合数据,则加载将崩溃。我使用像这样的转换器作为变通方法来更改具有不兼容数据类型的值,以便仍然可以加载数据。

def conv(val):
    if not val:
        return 0    
    try:
        return np.float64(val)
    except:        
        return np.float64(0)

df = pd.read_csv(csv_file,converters={'COL_A':conv,'COL_B':conv})

As mentioned earlier by firelynx if dtype is explicitly specified and there is mixed data that is not compatible with that dtype then loading will crash. I used a converter like this as a workaround to change the values with incompatible data type so that the data could still be loaded.

def conv(val):
    if not val:
        return 0    
    try:
        return np.float64(val)
    except:        
        return np.float64(0)

df = pd.read_csv(csv_file,converters={'COL_A':conv,'COL_B':conv})

回答 4

我有一个约400MB的文件类似的问题。设置low_memory=False对我有用。首先做一些简单的事情,我将检查您的数据帧不大于系统内存,重新启动,清除RAM,然后再继续。如果您仍然遇到错误,则值得确保您的.csv文件正常,请在Excel中快速查看并确保没有明显的损坏。原始数据损坏可能会给企业造成严重破坏。

I had a similar issue with a ~400MB file. Setting low_memory=False did the trick for me. Do the simple things first,I would check that your dataframe isn’t bigger than your system memory, reboot, clear the RAM before proceeding. If you’re still running into errors, its worth making sure your .csv file is ok, take a quick look in Excel and make sure there’s no obvious corruption. Broken original data can wreak havoc…


回答 5

处理巨大的csv文件(600万行)时,我遇到了类似的问题。我遇到了三个问题:1.文件包含奇怪的字符(使用编码修复)2.未指定数据类型(使用dtype属性修复)3.使用上述方法,我仍然面临与file_format相关的问题,即根据文件名定义(使用try ..固定,..除外)

df = pd.read_csv(csv_file,sep=';', encoding = 'ISO-8859-1',
                 names=['permission','owner_name','group_name','size','ctime','mtime','atime','filename','full_filename'],
                 dtype={'permission':str,'owner_name':str,'group_name':str,'size':str,'ctime':object,'mtime':object,'atime':object,'filename':str,'full_filename':str,'first_date':object,'last_date':object})

try:
    df['file_format'] = [Path(f).suffix[1:] for f in df.filename.tolist()]
except:
    df['file_format'] = ''

I was facing a similar issue when processing a huge csv file (6 million rows). I had three issues: 1. the file contained strange characters (fixed using encoding) 2. the datatype was not specified (fixed using dtype property) 3. Using the above I still faced an issue which was related with the file_format that could not be defined based on the filename (fixed using try .. except..)

df = pd.read_csv(csv_file,sep=';', encoding = 'ISO-8859-1',
                 names=['permission','owner_name','group_name','size','ctime','mtime','atime','filename','full_filename'],
                 dtype={'permission':str,'owner_name':str,'group_name':str,'size':str,'ctime':object,'mtime':object,'atime':object,'filename':str,'full_filename':str,'first_date':object,'last_date':object})

try:
    df['file_format'] = [Path(f).suffix[1:] for f in df.filename.tolist()]
except:
    df['file_format'] = ''

回答 6

它在low_memory = False导入DataFrame时对我有用。这就是对我有用的所有更改:

df = pd.read_csv('export4_16.csv',low_memory=False)

It worked for me with low_memory = False while importing a DataFrame. That is all the change that worked for me:

df = pd.read_csv('export4_16.csv',low_memory=False)

如何使用熊猫存储数据框

问题:如何使用熊猫存储数据框

现在,CSV每次运行脚本时,我都会导入一个相当大的数据框。是否有一个很好的解决方案,可以使数据帧在两次运行之间保持持续可用,因此我不必花费所有时间等待脚本运行?

Right now I’m importing a fairly large CSV as a dataframe every time I run the script. Is there a good solution for keeping that dataframe constantly available in between runs so I don’t have to spend all that time waiting for the script to run?


回答 0

最简单的方法是使用以下方法将其腌制to_pickle

df.to_pickle(file_name)  # where to save it, usually as a .pkl

然后您可以使用以下方法将其加载回:

df = pd.read_pickle(file_name)

注意:在0.11.1 save和之前,load这样做是唯一的方法(现在已弃用它们,to_pickleread_pickle分别赞成和)。


另一个流行的选择是使用HDF5pytables),它为大型数据集提供了非常快速的访问时间:

store = HDFStore('store.h5')

store['df'] = df  # save it
store['df']  # load it

食谱中讨论了更高级的策略。


从0.13开始,还有msgpack,它可能对于互操作性更好,作为JSON的更快替代品,或者如果您有python对象/文本繁重的数据(请参阅此问题)。

The easiest way is to pickle it using to_pickle:

df.to_pickle(file_name)  # where to save it, usually as a .pkl

Then you can load it back using:

df = pd.read_pickle(file_name)

Note: before 0.11.1 save and load were the only way to do this (they are now deprecated in favor of to_pickle and read_pickle respectively).


Another popular choice is to use HDF5 (pytables) which offers very fast access times for large datasets:

store = HDFStore('store.h5')

store['df'] = df  # save it
store['df']  # load it

More advanced strategies are discussed in the cookbook.


Since 0.13 there’s also msgpack which may be be better for interoperability, as a faster alternative to JSON, or if you have python object/text-heavy data (see this question).


回答 1

尽管已经有了一些答案,但是我发现它们之间进行了很好的比较,他们尝试了几种方法来序列化Pandas DataFrame:有效存储Pandas DataFrames

他们比较:

  • pickle:原始ASCII数据格式
  • cPickle,一个C库
  • pickle-p2:使用较新的二进制格式
  • json:standardlib json库
  • json-no-index:类似于json,但没有索引
  • msgpack:二进制JSON替代
  • CSV
  • hdfstore:HDF5存储格式

在他们的实验中,他们使用分别测试的两列来序列化1,000,000行的DataFrame:一个带有文本数据,另一个带有数字。他们的免责声明说:

您不应相信以下内容会泛化您的数据。您应该查看自己的数据并自己运行基准测试

他们所参考的测试源代码可在线获得。由于此代码无法直接运行,因此我做了一些小的更改,您可以在此处进行更改:serialize.py, 我得到以下结果:

他们还提到,通过将文本数据转换为分类数据,序列化要快得多。在他们的测试中大约快10倍(另请参见测试代码)。

编辑:腌制时间比CSV更长,可以通过使用的数据格式来解释。默认情况下,pickle使用可打印的ASCII表示形式,该表示形式会生成更大的数据集。从图中可以看出,使用较新的二进制数据格式(版本2 pickle-p2)的pickle的加载时间要短得多。

其他一些参考:

Although there are already some answers I found a nice comparison in which they tried several ways to serialize Pandas DataFrames: Efficiently Store Pandas DataFrames.

They compare:

  • pickle: original ASCII data format
  • cPickle, a C library
  • pickle-p2: uses the newer binary format
  • json: standardlib json library
  • json-no-index: like json, but without index
  • msgpack: binary JSON alternative
  • CSV
  • hdfstore: HDF5 storage format

In their experiment, they serialize a DataFrame of 1,000,000 rows with the two columns tested separately: one with text data, the other with numbers. Their disclaimer says:

You should not trust that what follows generalizes to your data. You should look at your own data and run benchmarks yourself

The source code for the test which they refer to is available online. Since this code did not work directly I made some minor changes, which you can get here: serialize.py I got the following results:

They also mention that with the conversion of text data to categorical data the serialization is much faster. In their test about 10 times as fast (also see the test code).

Edit: The higher times for pickle than CSV can be explained by the data format used. By default pickle uses a printable ASCII representation, which generates larger data sets. As can be seen from the graph however, pickle using the newer binary data format (version 2, pickle-p2) has much lower load times.

Some other references:


回答 2

如果我理解正确,那么您已经在使用,pandas.read_csv()但是想加快开发过程,这样就不必在每次编辑脚本时都加载文件,对吗?我有一些建议:

  1. pandas.read_csv(..., nrows=1000)在进行开发时,您只能加载CSV文件的一部分,而仅用于加载表的最高位

  2. 使用ipython进行交互式会话,以便在编辑和重新加载脚本时将pandas表保留在内存中。

  3. 将csv转换为HDF5表

  4. 更新了用法,DataFrame.to_feather()pd.read_feather()以R兼容的羽毛二进制格式存储了数据,该格式超级快(在我手中,比pandas.to_pickle()数字数据要快一些,而字符串数据要快得多)。

您可能也对stackoverflow上的这个答案感兴趣。

If I understand correctly, you’re already using pandas.read_csv() but would like to speed up the development process so that you don’t have to load the file in every time you edit your script, is that right? I have a few recommendations:

  1. you could load in only part of the CSV file using pandas.read_csv(..., nrows=1000) to only load the top bit of the table, while you’re doing the development

  2. use ipython for an interactive session, such that you keep the pandas table in memory as you edit and reload your script.

  3. convert the csv to an HDF5 table

  4. updated use DataFrame.to_feather() and pd.read_feather() to store data in the R-compatible feather binary format that is super fast (in my hands, slightly faster than pandas.to_pickle() on numeric data and much faster on string data).

You might also be interested in this answer on stackoverflow.


回答 3

泡菜很好!

import pandas as pd
df.to_pickle('123.pkl')    #to save the dataframe, df to 123.pkl
df1 = pd.read_pickle('123.pkl') #to load 123.pkl back to the dataframe df

Pickle works good!

import pandas as pd
df.to_pickle('123.pkl')    #to save the dataframe, df to 123.pkl
df1 = pd.read_pickle('123.pkl') #to load 123.pkl back to the dataframe df

回答 4

您可以使用羽毛格式文件。非常快。

df.to_feather('filename.ft')

You can use feather format file. It is extremely fast.

df.to_feather('filename.ft')

回答 5

熊猫数据框具有to_pickle对保存数据框有用的功能:

import pandas as pd

a = pd.DataFrame({'A':[0,1,0,1,0],'B':[True, True, False, False, False]})
print a
#    A      B
# 0  0   True
# 1  1   True
# 2  0  False
# 3  1  False
# 4  0  False

a.to_pickle('my_file.pkl')

b = pd.read_pickle('my_file.pkl')
print b
#    A      B
# 0  0   True
# 1  1   True
# 2  0  False
# 3  1  False
# 4  0  False

Pandas DataFrames have the to_pickle function which is useful for saving a DataFrame:

import pandas as pd

a = pd.DataFrame({'A':[0,1,0,1,0],'B':[True, True, False, False, False]})
print a
#    A      B
# 0  0   True
# 1  1   True
# 2  0  False
# 3  1  False
# 4  0  False

a.to_pickle('my_file.pkl')

b = pd.read_pickle('my_file.pkl')
print b
#    A      B
# 0  0   True
# 1  1   True
# 2  0  False
# 3  1  False
# 4  0  False

回答 6

如前所述,有不同的选项和文件格式(HDF5JSONCSVparquetSQL)来存储数据帧。但是,pickle不是一流的公民(取决于您的设置),因为:

  1. pickle是潜在的安全风险。形成picklePython文档

警告pickle模块对于错误或恶意构建的数据并不安全。切勿挑剔从不可信或未经身份验证的来源收到的数据。

  1. pickle是慢的。在此处此处找到基准。

根据您的设置/用法,两个限制均不适用,但我不建议您pickle将其作为熊猫数据框的默认持久性。

As already mentioned there are different options and file formats (HDF5, JSON, CSV, parquet, SQL) to store a data frame. However, pickle is not a first-class citizen (depending on your setup), because:

  1. pickle is a potential security risk. Form the Python documentation for pickle:

Warning The pickle module is not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.

  1. pickle is slow. Find here and here benchmarks.

Depending on your setup/usage both limitations do not apply, but I would not recommend pickle as the default persistence for pandas data frames.


回答 7

数字数据的文件格式非常快

我更喜欢使用numpy文件,因为它们快速且易于使用。这是一个简单的基准,用于保存和加载具有1百万点的1列的数据框。

import numpy as np
import pandas as pd

num_dict = {'voltage': np.random.rand(1000000)}
num_df = pd.DataFrame(num_dict)

使用ipython的%%timeit魔术功能

%%timeit
with open('num.npy', 'wb') as np_file:
    np.save(np_file, num_df)

输出是

100 loops, best of 3: 5.97 ms per loop

将数据加载回数据框

%%timeit
with open('num.npy', 'rb') as np_file:
    data = np.load(np_file)

data_df = pd.DataFrame(data)

输出是

100 loops, best of 3: 5.12 ms per loop

不错!

缺点

如果您使用python 2保存numpy文件,然后尝试使用python 3打开(反之亦然),则会出现问题。

Numpy file formats are pretty fast for numerical data

I prefer to use numpy files since they’re fast and easy to work with. Here’s a simple benchmark for saving and loading a dataframe with 1 column of 1million points.

import numpy as np
import pandas as pd

num_dict = {'voltage': np.random.rand(1000000)}
num_df = pd.DataFrame(num_dict)

using ipython’s %%timeit magic function

%%timeit
with open('num.npy', 'wb') as np_file:
    np.save(np_file, num_df)

the output is

100 loops, best of 3: 5.97 ms per loop

to load the data back into a dataframe

%%timeit
with open('num.npy', 'rb') as np_file:
    data = np.load(np_file)

data_df = pd.DataFrame(data)

the output is

100 loops, best of 3: 5.12 ms per loop

NOT BAD!

CONS

There’s a problem if you save the numpy file using python 2 and then try opening using python 3 (or vice versa).


回答 8

https://docs.python.org/3/library/pickle.html

泡菜协议格式:

协议版本0是原始的“人类可读”协议,并且与Python的早期版本向后兼容。

协议版本1是旧的二进制格式,也与Python的早期版本兼容。

协议版本2是在Python 2.3中引入的。它提供了新型类的更有效的酸洗。有关协议2带来的改进的信息,请参阅PEP 307。

协议版本3是在Python 3.0中添加的。它具有对字节对象的显式支持,并且不能被Python 2.x取消选择。这是默认协议,当需要与其他Python 3版本兼容时,建议使用该协议。

协议版本4是在Python 3.4中添加的。它增加了对超大型对象的支持,腌制更多种类的对象以及一些数据格式优化。有关协议4带来的改进的信息,请参阅PEP 3154。

https://docs.python.org/3/library/pickle.html

The pickle protocol formats:

Protocol version 0 is the original “human-readable” protocol and is backwards compatible with earlier versions of Python.

Protocol version 1 is an old binary format which is also compatible with earlier versions of Python.

Protocol version 2 was introduced in Python 2.3. It provides much more efficient pickling of new-style classes. Refer to PEP 307 for information about improvements brought by protocol 2.

Protocol version 3 was added in Python 3.0. It has explicit support for bytes objects and cannot be unpickled by Python 2.x. This is the default protocol, and the recommended protocol when compatibility with other Python 3 versions is required.

Protocol version 4 was added in Python 3.4. It adds support for very large objects, pickling more kinds of objects, and some data format optimizations. Refer to PEP 3154 for information about improvements brought by protocol 4.


回答 9

pyarrow跨版本的兼容性

总体上已经转向了pyarrow / feather(来自pandas / msgpack的弃用警告)。但是,我对规范中具有瞬时特性的 pyarrow提出了挑战,使用pyarrow 0.15.1序列化的数据无法使用0.16.0 ARROW-7961进行反序列化。我正在使用序列化来使用Redis,因此必须使用二进制编码。

我已经重新测试了各种选项(使用jupyter笔记本)

import sys, pickle, zlib, warnings, io
class foocls:
    def pyarrow(out): return pa.serialize(out).to_buffer().to_pybytes()
    def msgpack(out): return out.to_msgpack()
    def pickle(out): return pickle.dumps(out)
    def feather(out): return out.to_feather(io.BytesIO())
    def parquet(out): return out.to_parquet(io.BytesIO())

warnings.filterwarnings("ignore")
for c in foocls.__dict__.values():
    sbreak = True
    try:
        c(out)
        print(c.__name__, "before serialization", sys.getsizeof(out))
        print(c.__name__, sys.getsizeof(c(out)))
        %timeit -n 50 c(out)
        print(c.__name__, "zlib", sys.getsizeof(zlib.compress(c(out))))
        %timeit -n 50 zlib.compress(c(out))
    except TypeError as e:
        if "not callable" in str(e): sbreak = False
        else: raise
    except (ValueError) as e: print(c.__name__, "ERROR", e)
    finally: 
        if sbreak: print("=+=" * 30)        
warnings.filterwarnings("default")

对于我的数据框具有以下结果(在outjupyter变量中)

pyarrow before serialization 533366
pyarrow 120805
1.03 ms ± 43.9 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
pyarrow zlib 20517
2.78 ms ± 81.8 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
msgpack before serialization 533366
msgpack 109039
1.74 ms ± 72.8 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
msgpack zlib 16639
3.05 ms ± 71.7 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
pickle before serialization 533366
pickle 142121
733 µs ± 38.3 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
pickle zlib 29477
3.81 ms ± 60.4 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
feather ERROR feather does not support serializing a non-default index for the index; you can .reset_index() to make the index into column(s)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
parquet ERROR Nested column branch had multiple children: struct<x: double, y: double>
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=

羽毛和镶木地板不适用于我的数据框。我将继续使用pyarrow。但是,我将补充泡菜(无压缩)。写入高速缓存时,存储pyarrow和pickle序列化表格。如果从pypy反序列化失败,则从缓存回退到泡菜。

pyarrow compatibility across versions

Overall move has been to pyarrow/feather (deprecation warnings from pandas/msgpack). However I have a challenge with pyarrow with transient in specification Data serialized with pyarrow 0.15.1 cannot be deserialized with 0.16.0 ARROW-7961. I’m using serialization to use redis so have to use a binary encoding.

I’ve retested various options (using jupyter notebook)

import sys, pickle, zlib, warnings, io
class foocls:
    def pyarrow(out): return pa.serialize(out).to_buffer().to_pybytes()
    def msgpack(out): return out.to_msgpack()
    def pickle(out): return pickle.dumps(out)
    def feather(out): return out.to_feather(io.BytesIO())
    def parquet(out): return out.to_parquet(io.BytesIO())

warnings.filterwarnings("ignore")
for c in foocls.__dict__.values():
    sbreak = True
    try:
        c(out)
        print(c.__name__, "before serialization", sys.getsizeof(out))
        print(c.__name__, sys.getsizeof(c(out)))
        %timeit -n 50 c(out)
        print(c.__name__, "zlib", sys.getsizeof(zlib.compress(c(out))))
        %timeit -n 50 zlib.compress(c(out))
    except TypeError as e:
        if "not callable" in str(e): sbreak = False
        else: raise
    except (ValueError) as e: print(c.__name__, "ERROR", e)
    finally: 
        if sbreak: print("=+=" * 30)        
warnings.filterwarnings("default")

With following results for my data frame (in out jupyter variable)

pyarrow before serialization 533366
pyarrow 120805
1.03 ms ± 43.9 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
pyarrow zlib 20517
2.78 ms ± 81.8 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
msgpack before serialization 533366
msgpack 109039
1.74 ms ± 72.8 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
msgpack zlib 16639
3.05 ms ± 71.7 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
pickle before serialization 533366
pickle 142121
733 µs ± 38.3 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
pickle zlib 29477
3.81 ms ± 60.4 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
feather ERROR feather does not support serializing a non-default index for the index; you can .reset_index() to make the index into column(s)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
parquet ERROR Nested column branch had multiple children: struct<x: double, y: double>
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=

feather and parquet do not work for my data frame. I’m going to continue using pyarrow. However I will supplement with pickle (no compression). When writing to cache store pyarrow and pickle serialised forms. When reading from cache fallback to pickle if pyarrow deserialisation fails.


回答 10

格式取决于您的用例

  • 在笔记本会话之间保存DataFrame- 羽毛,如果您习惯于腌制 -也可以。
  • 保存数据帧在尽可能小的文件大小- 镶木地板pickle.gz(检查什么最好为您的数据)
  • 保存一个非常大的DataFrame(10+百万行)-HDF
  • 能够读取另一个平台上(而不是Python),不支持其他格式的数据- CSVcsv.gz,检查是否镶木支持
  • 能够用眼睛查看/使用Excel / Google表格/ Git diff- CSV
  • 保存占用几乎所有RAM的DataFrame- CSV

该视频中有熊猫文件格式的比较。

The format depends on your use-case

  • Save DataFrame between notebook sessions – feather, if you’re used to pickle – also ok.
  • Save DataFrame in smallest possible file size – parquet or pickle.gz (check what’s better for your data)
  • Save a very big DataFrame (10+ millions of rows) – hdf
  • Be able to read the data on another platform (not Python) that doesn’t support other formats – csv, csv.gz, check if parquet is supported
  • Be able to review with your eyes / using Excel / Google Sheets / Git diff – csv
  • Save a DataFrame that takes almost all the RAM – csv

Comparison of the pandas file formats are in this video.


熊猫有条件地创建系列/数据框列

问题:熊猫有条件地创建系列/数据框列

我有下面的数据框:

    Type       Set
1    A          Z
2    B          Z           
3    B          X
4    C          Y

我想向数据框添加另一列(或生成一系列),该列的长度与数据框的长度相同(=记录/行的数目相等),如果Set =’Z’则设置为绿色,如果Set =’否则为’red’ 。

最好的方法是什么?

I have a dataframe along the lines of the below:

    Type       Set
1    A          Z
2    B          Z           
3    B          X
4    C          Y

I want to add another column to the dataframe (or generate a series) of the same length as the dataframe (equal number of records/rows) which sets a colour 'green' if Set == 'Z' and 'red' if Set equals anything else.

What’s the best way to do this?


回答 0

如果您只有两个选择供您选择:

df['color'] = np.where(df['Set']=='Z', 'green', 'red')

例如,

import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
print(df)

Yield

  Set Type  color
0   Z    A  green
1   Z    B  green
2   X    B    red
3   Y    C    red

如果您有两个以上的条件,请使用np.select。例如,如果您想color成为

  • yellow 什么时候 (df['Set'] == 'Z') & (df['Type'] == 'A')
  • 否则blue,当(df['Set'] == 'Z') & (df['Type'] == 'B')
  • 否则purple,当(df['Type'] == 'B')
  • 否则black

然后使用

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
conditions = [
    (df['Set'] == 'Z') & (df['Type'] == 'A'),
    (df['Set'] == 'Z') & (df['Type'] == 'B'),
    (df['Type'] == 'B')]
choices = ['yellow', 'blue', 'purple']
df['color'] = np.select(conditions, choices, default='black')
print(df)

产生

  Set Type   color
0   Z    A  yellow
1   Z    B    blue
2   X    B  purple
3   Y    C   black

If you only have two choices to select from:

df['color'] = np.where(df['Set']=='Z', 'green', 'red')

For example,

import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
print(df)

yields

  Set Type  color
0   Z    A  green
1   Z    B  green
2   X    B    red
3   Y    C    red

If you have more than two conditions then use np.select. For example, if you want color to be

  • yellow when (df['Set'] == 'Z') & (df['Type'] == 'A')
  • otherwise blue when (df['Set'] == 'Z') & (df['Type'] == 'B')
  • otherwise purple when (df['Type'] == 'B')
  • otherwise black,

then use

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
conditions = [
    (df['Set'] == 'Z') & (df['Type'] == 'A'),
    (df['Set'] == 'Z') & (df['Type'] == 'B'),
    (df['Type'] == 'B')]
choices = ['yellow', 'blue', 'purple']
df['color'] = np.select(conditions, choices, default='black')
print(df)

which yields

  Set Type   color
0   Z    A  yellow
1   Z    B    blue
2   X    B  purple
3   Y    C   black

回答 1

列表理解是有条件创建另一列的另一种方法。如果像在示例中那样使用列中的对象dtype,则列表理解通常胜过大多数其他方法。

示例列表理解:

df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]

%timeit测试:

import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
%timeit df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit df['color'] = np.where(df['Set']=='Z', 'green', 'red')
%timeit df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')

1000 loops, best of 3: 239 µs per loop
1000 loops, best of 3: 523 µs per loop
1000 loops, best of 3: 263 µs per loop

List comprehension is another way to create another column conditionally. If you are working with object dtypes in columns, like in your example, list comprehensions typically outperform most other methods.

Example list comprehension:

df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]

%timeit tests:

import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
%timeit df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit df['color'] = np.where(df['Set']=='Z', 'green', 'red')
%timeit df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')

1000 loops, best of 3: 239 µs per loop
1000 loops, best of 3: 523 µs per loop
1000 loops, best of 3: 263 µs per loop

回答 2

可以实现这一目标的另一种方法是

df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')

Another way in which this could be achieved is

df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')

回答 3

这是给这只猫换皮的另一种方法,使用字典将新值映射到列表中的键上:

def map_values(row, values_dict):
    return values_dict[row]

values_dict = {'A': 1, 'B': 2, 'C': 3, 'D': 4}

df = pd.DataFrame({'INDICATOR': ['A', 'B', 'C', 'D'], 'VALUE': [10, 9, 8, 7]})

df['NEW_VALUE'] = df['INDICATOR'].apply(map_values, args = (values_dict,))

看起来像什么:

df
Out[2]: 
  INDICATOR  VALUE  NEW_VALUE
0         A     10          1
1         B      9          2
2         C      8          3
3         D      7          4

当您要执行许多ifelse-type语句(即要替换的许多唯一值)时,此方法可能非常强大。

当然,您可以始终这样做:

df['NEW_VALUE'] = df['INDICATOR'].map(values_dict)

但是apply在我的机器上,这种方法的速度是上面的方法的三倍以上。

您也可以使用dict.get

df['NEW_VALUE'] = [values_dict.get(v, None) for v in df['INDICATOR']]

Here’s yet another way to skin this cat, using a dictionary to map new values onto the keys in the list:

def map_values(row, values_dict):
    return values_dict[row]

values_dict = {'A': 1, 'B': 2, 'C': 3, 'D': 4}

df = pd.DataFrame({'INDICATOR': ['A', 'B', 'C', 'D'], 'VALUE': [10, 9, 8, 7]})

df['NEW_VALUE'] = df['INDICATOR'].apply(map_values, args = (values_dict,))

What’s it look like:

df
Out[2]: 
  INDICATOR  VALUE  NEW_VALUE
0         A     10          1
1         B      9          2
2         C      8          3
3         D      7          4

This approach can be very powerful when you have many ifelse-type statements to make (i.e. many unique values to replace).

And of course you could always do this:

df['NEW_VALUE'] = df['INDICATOR'].map(values_dict)

But that approach is more than three times as slow as the apply approach from above, on my machine.

And you could also do this, using dict.get:

df['NEW_VALUE'] = [values_dict.get(v, None) for v in df['INDICATOR']]

回答 4

以下内容比此处介绍的方法要慢,但是我们可以根据多于一列的内容来计算额外的列,并且可以为额外的列计算两个以上的值。

仅使用“设置”列的简单示例:

def set_color(row):
    if row["Set"] == "Z":
        return "red"
    else:
        return "green"

df = df.assign(color=df.apply(set_color, axis=1))

print(df)
  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C  green

具有更多颜色和更多列的示例:

def set_color(row):
    if row["Set"] == "Z":
        return "red"
    elif row["Type"] == "C":
        return "blue"
    else:
        return "green"

df = df.assign(color=df.apply(set_color, axis=1))

print(df)
  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C   blue

编辑(21/06/2019):使用plydata

也可以使用plydata来执行这种操作(尽管这似乎比使用assignand 还要慢apply)。

from plydata import define, if_else

简单if_else

df = define(df, color=if_else('Set=="Z"', '"red"', '"green"'))

print(df)
  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C  green

嵌套if_else

df = define(df, color=if_else(
    'Set=="Z"',
    '"red"',
    if_else('Type=="C"', '"green"', '"blue"')))

print(df)                            
  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B   blue
3   Y    C  green

The following is slower than the approaches timed here, but we can compute the extra column based on the contents of more than one column, and more than two values can be computed for the extra column.

Simple example using just the “Set” column:

def set_color(row):
    if row["Set"] == "Z":
        return "red"
    else:
        return "green"

df = df.assign(color=df.apply(set_color, axis=1))

print(df)
  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C  green

Example with more colours and more columns taken into account:

def set_color(row):
    if row["Set"] == "Z":
        return "red"
    elif row["Type"] == "C":
        return "blue"
    else:
        return "green"

df = df.assign(color=df.apply(set_color, axis=1))

print(df)
  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C   blue

Edit (21/06/2019): Using plydata

It is also possible to use plydata to do this kind of things (this seems even slower than using assign and apply, though).

from plydata import define, if_else

Simple if_else:

df = define(df, color=if_else('Set=="Z"', '"red"', '"green"'))

print(df)
  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C  green

Nested if_else:

df = define(df, color=if_else(
    'Set=="Z"',
    '"red"',
    if_else('Type=="C"', '"green"', '"blue"')))

print(df)                            
  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B   blue
3   Y    C  green

回答 5

也许是通过更新Pandas来实现的,但到目前为止,我认为以下是该问题的最短和最佳答案。您可以使用该.loc方法,并根据需要使用一个或多个条件。

代码摘要:

df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))
df['Color'] = "red"
df.loc[(df['Set']=="Z"), 'Color'] = "green"

#practice!
df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"

说明:

df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))

# df so far: 
  Type Set  
0    A   Z 
1    B   Z 
2    B   X 
3    C   Y

添加“颜色”列并将所有值设置为“红色”

df['Color'] = "red"

应用您的单个条件:

df.loc[(df['Set']=="Z"), 'Color'] = "green"


# df: 
  Type Set  Color
0    A   Z  green
1    B   Z  green
2    B   X    red
3    C   Y    red

或多个条件(如果需要):

df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"

您可以在此处阅读Pandas逻辑运算符和条件选择: Pandas中用于布尔索引的逻辑运算符

Maybe this has been possible with newer updates of Pandas (tested with pandas=1.0.5), but I think the following is the shortest and maybe best answer for the question, so far. You can use the .loc method and use one condition or several depending on your need.

Code Summary:

df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))
df['Color'] = "red"
df.loc[(df['Set']=="Z"), 'Color'] = "green"

#practice!
df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"

Explanation:

df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))

# df so far: 
  Type Set  
0    A   Z 
1    B   Z 
2    B   X 
3    C   Y

add a ‘color’ column and set all values to “red”

df['Color'] = "red"

Apply your single condition:

df.loc[(df['Set']=="Z"), 'Color'] = "green"


# df: 
  Type Set  Color
0    A   Z  green
1    B   Z  green
2    B   X    red
3    C   Y    red

or multiple conditions if you want:

df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"

You can read on Pandas logical operators and conditional selection here: Logical operators for boolean indexing in Pandas


回答 6

一种带有.apply()方法的衬纸如下:

df['color'] = df['Set'].apply(lambda set_: 'green' if set_=='Z' else 'red')

之后,df数据帧如下所示:

>>> print(df)
  Type Set  color
0    A   Z  green
1    B   Z  green
2    B   X    red
3    C   Y    red

One liner with .apply() method is following:

df['color'] = df['Set'].apply(lambda set_: 'green' if set_=='Z' else 'red')

After that, df data frame looks like this:

>>> print(df)
  Type Set  color
0    A   Z  green
1    B   Z  green
2    B   X    red
3    C   Y    red

回答 7

如果您要处理海量数据,则最好采用记忆方式:

# First create a dictionary of manually stored values
color_dict = {'Z':'red'}

# Second, build a dictionary of "other" values
color_dict_other = {x:'green' for x in df['Set'].unique() if x not in color_dict.keys()}

# Next, merge the two
color_dict.update(color_dict_other)

# Finally, map it to your column
df['color'] = df['Set'].map(color_dict)

当您有很多重复的值时,这种方法将是最快的。我的一般经验法则是记住以下情况:data_size> 10**4n_distinct<data_size/4

例如,在10,000行中记录2,500个或更少的不同值。

If you’re working with massive data, a memoized approach would be best:

# First create a dictionary of manually stored values
color_dict = {'Z':'red'}

# Second, build a dictionary of "other" values
color_dict_other = {x:'green' for x in df['Set'].unique() if x not in color_dict.keys()}

# Next, merge the two
color_dict.update(color_dict_other)

# Finally, map it to your column
df['color'] = df['Set'].map(color_dict)

This approach will be fastest when you have many repeated values. My general rule of thumb is to memoize when: data_size > 10**4 & n_distinct < data_size/4

E.x. Memoize in a case 10,000 rows with 2,500 or fewer distinct values.


将Python字典转换为数据框

问题:将Python字典转换为数据框

我有如下的Python字典:

{u'2012-06-08': 388,
 u'2012-06-09': 388,
 u'2012-06-10': 388,
 u'2012-06-11': 389,
 u'2012-06-12': 389,
 u'2012-06-13': 389,
 u'2012-06-14': 389,
 u'2012-06-15': 389,
 u'2012-06-16': 389,
 u'2012-06-17': 389,
 u'2012-06-18': 390,
 u'2012-06-19': 390,
 u'2012-06-20': 390,
 u'2012-06-21': 390,
 u'2012-06-22': 390,
 u'2012-06-23': 390,
 u'2012-06-24': 390,
 u'2012-06-25': 391,
 u'2012-06-26': 391,
 u'2012-06-27': 391,
 u'2012-06-28': 391,
 u'2012-06-29': 391,
 u'2012-06-30': 391,
 u'2012-07-01': 391,
 u'2012-07-02': 392,
 u'2012-07-03': 392,
 u'2012-07-04': 392,
 u'2012-07-05': 392,
 u'2012-07-06': 392}

键是Unicode日期,值是整数。我想通过将日期及其对应的值作为两个单独的列将其转换为pandas数据框。示例:col1:日期col2:DateValue(日期仍为Unicode,日期值仍为整数)

     Date         DateValue
0    2012-07-01    391
1    2012-07-02    392
2    2012-07-03    392
.    2012-07-04    392
.    ...           ...
.    ...           ...

对此方向的任何帮助将不胜感激。我找不到有关熊猫文档的资源来帮助我。

我知道一个解决方案可能是将此dict中的每个键值对转换为dict,以便整个结构成为dict的dict,然后我们可以将每一行分别添加到数据帧中。但我想知道是否有更简单的方法和更直接的方法来执行此操作。

到目前为止,我已经尝试将dict转换为series对象,但这似乎并不能维持各列之间的关系:

s  = Series(my_dict,index=my_dict.keys())

I have a Python dictionary like the following:

{u'2012-06-08': 388,
 u'2012-06-09': 388,
 u'2012-06-10': 388,
 u'2012-06-11': 389,
 u'2012-06-12': 389,
 u'2012-06-13': 389,
 u'2012-06-14': 389,
 u'2012-06-15': 389,
 u'2012-06-16': 389,
 u'2012-06-17': 389,
 u'2012-06-18': 390,
 u'2012-06-19': 390,
 u'2012-06-20': 390,
 u'2012-06-21': 390,
 u'2012-06-22': 390,
 u'2012-06-23': 390,
 u'2012-06-24': 390,
 u'2012-06-25': 391,
 u'2012-06-26': 391,
 u'2012-06-27': 391,
 u'2012-06-28': 391,
 u'2012-06-29': 391,
 u'2012-06-30': 391,
 u'2012-07-01': 391,
 u'2012-07-02': 392,
 u'2012-07-03': 392,
 u'2012-07-04': 392,
 u'2012-07-05': 392,
 u'2012-07-06': 392}

The keys are Unicode dates and the values are integers. I would like to convert this into a pandas dataframe by having the dates and their corresponding values as two separate columns. Example: col1: Dates col2: DateValue (the dates are still Unicode and datevalues are still integers)

     Date         DateValue
0    2012-07-01    391
1    2012-07-02    392
2    2012-07-03    392
.    2012-07-04    392
.    ...           ...
.    ...           ...

Any help in this direction would be much appreciated. I am unable to find resources on the pandas docs to help me with this.

I know one solution might be to convert each key-value pair in this dict, into a dict so the entire structure becomes a dict of dicts, and then we can add each row individually to the dataframe. But I want to know if there is an easier way and a more direct way to do this.

So far I have tried converting the dict into a series object but this doesn’t seem to maintain the relationship between the columns:

s  = Series(my_dict,index=my_dict.keys())

回答 0

这里的错误是因为使用标量值调用DataFrame构造函数(它期望值是列表/字典/ …,即具有多个列):

pd.DataFrame(d)
ValueError: If using all scalar values, you must must pass an index

您可以从字典中获取项目(即键值对):

In [11]: pd.DataFrame(d.items())  # or list(d.items()) in python 3
Out[11]:
             0    1
0   2012-07-02  392
1   2012-07-06  392
2   2012-06-29  391
3   2012-06-28  391
...

In [12]: pd.DataFrame(d.items(), columns=['Date', 'DateValue'])
Out[12]:
          Date  DateValue
0   2012-07-02        392
1   2012-07-06        392
2   2012-06-29        391

但是我认为传递Series构造函数更有意义:

In [21]: s = pd.Series(d, name='DateValue')
Out[21]:
2012-06-08    388
2012-06-09    388
2012-06-10    388

In [22]: s.index.name = 'Date'

In [23]: s.reset_index()
Out[23]:
          Date  DateValue
0   2012-06-08        388
1   2012-06-09        388
2   2012-06-10        388

The error here, is since calling the DataFrame constructor with scalar values (where it expects values to be a list/dict/… i.e. have multiple columns):

pd.DataFrame(d)
ValueError: If using all scalar values, you must must pass an index

You could take the items from the dictionary (i.e. the key-value pairs):

In [11]: pd.DataFrame(d.items())  # or list(d.items()) in python 3
Out[11]:
             0    1
0   2012-07-02  392
1   2012-07-06  392
2   2012-06-29  391
3   2012-06-28  391
...

In [12]: pd.DataFrame(d.items(), columns=['Date', 'DateValue'])
Out[12]:
          Date  DateValue
0   2012-07-02        392
1   2012-07-06        392
2   2012-06-29        391

But I think it makes more sense to pass the Series constructor:

In [21]: s = pd.Series(d, name='DateValue')
Out[21]:
2012-06-08    388
2012-06-09    388
2012-06-10    388

In [22]: s.index.name = 'Date'

In [23]: s.reset_index()
Out[23]:
          Date  DateValue
0   2012-06-08        388
1   2012-06-09        388
2   2012-06-10        388

回答 1

将字典转换为pandas数据框时,您希望键是该数据框的列,而值是行值,则可以像这样在字典周围放置方括号:

>>> dict_ = {'key 1': 'value 1', 'key 2': 'value 2', 'key 3': 'value 3'}
>>> pd.DataFrame([dict_])

    key 1     key 2     key 3
0   value 1   value 2   value 3

它免除了我的头疼,所以我希望它可以帮助某个人!

编辑:在pandas docsdata中,DataFrame构造函数中参数的一个选项是词典列表。在这里,我们传递的列表中有一个字典。

When converting a dictionary into a pandas dataframe where you want the keys to be the columns of said dataframe and the values to be the row values, you can do simply put brackets around the dictionary like this:

>>> dict_ = {'key 1': 'value 1', 'key 2': 'value 2', 'key 3': 'value 3'}
>>> pd.DataFrame([dict_])

    key 1     key 2     key 3
0   value 1   value 2   value 3

It’s saved me some headaches so I hope it helps someone out there!

EDIT: In the pandas docs one option for the data parameter in the DataFrame constructor is a list of dictionaries. Here we’re passing a list with one dictionary in it.


回答 2

如另一个答案所述,在pandas.DataFrame()此处直接使用将不会发挥您的作用。

你可以做的是使用pandas.DataFrame.from_dict具有orient='index'

In[7]: pandas.DataFrame.from_dict({u'2012-06-08': 388,
 u'2012-06-09': 388,
 u'2012-06-10': 388,
 u'2012-06-11': 389,
 u'2012-06-12': 389,
 .....
 u'2012-07-05': 392,
 u'2012-07-06': 392}, orient='index', columns=['foo'])
Out[7]: 
            foo
2012-06-08  388
2012-06-09  388
2012-06-10  388
2012-06-11  389
2012-06-12  389
........
2012-07-05  392
2012-07-06  392

As explained on another answer using pandas.DataFrame() directly here will not act as you think.

What you can do is use pandas.DataFrame.from_dict with orient='index':

In[7]: pandas.DataFrame.from_dict({u'2012-06-08': 388,
 u'2012-06-09': 388,
 u'2012-06-10': 388,
 u'2012-06-11': 389,
 u'2012-06-12': 389,
 .....
 u'2012-07-05': 392,
 u'2012-07-06': 392}, orient='index', columns=['foo'])
Out[7]: 
            foo
2012-06-08  388
2012-06-09  388
2012-06-10  388
2012-06-11  389
2012-06-12  389
........
2012-07-05  392
2012-07-06  392

回答 3

将字典的项目传递给DataFrame构造函数,并指定列名称。之后,解析Date列以获取Timestamp值。

注意python 2.x和3.x之间的区别:

在python 2.x中:

df = pd.DataFrame(data.items(), columns=['Date', 'DateValue'])
df['Date'] = pd.to_datetime(df['Date'])

在Python 3.x中:(需要一个附加的“列表”)

df = pd.DataFrame(list(data.items()), columns=['Date', 'DateValue'])
df['Date'] = pd.to_datetime(df['Date'])

Pass the items of the dictionary to the DataFrame constructor, and give the column names. After that parse the Date column to get Timestamp values.

Note the difference between python 2.x and 3.x:

In python 2.x:

df = pd.DataFrame(data.items(), columns=['Date', 'DateValue'])
df['Date'] = pd.to_datetime(df['Date'])

In Python 3.x: (requiring an additional ‘list’)

df = pd.DataFrame(list(data.items()), columns=['Date', 'DateValue'])
df['Date'] = pd.to_datetime(df['Date'])

回答 4

尤其是ps,我发现面向行的示例很有帮助;因为通常记录是如何在外部存储的。

https://pbpython.com/pandas-list-dict.html

p.s. in particular, I’ve found Row-Oriented examples helpful; since often that how records are stored externally.

https://pbpython.com/pandas-list-dict.html


回答 5

熊猫具有内置功能,可将字典转换为数据帧。

pd.DataFrame.from_dict(dictionaryObject,orient =’index’)

对于您的数据,您可以如下进行转换:

import pandas as pd
your_dict={u'2012-06-08': 388,
 u'2012-06-09': 388,
 u'2012-06-10': 388,
 u'2012-06-11': 389,
 u'2012-06-12': 389,
 u'2012-06-13': 389,
 u'2012-06-14': 389,
 u'2012-06-15': 389,
 u'2012-06-16': 389,
 u'2012-06-17': 389,
 u'2012-06-18': 390,
 u'2012-06-19': 390,
 u'2012-06-20': 390,
 u'2012-06-21': 390,
 u'2012-06-22': 390,
 u'2012-06-23': 390,
 u'2012-06-24': 390,
 u'2012-06-25': 391,
 u'2012-06-26': 391,
 u'2012-06-27': 391,
 u'2012-06-28': 391,
 u'2012-06-29': 391,
 u'2012-06-30': 391,
 u'2012-07-01': 391,
 u'2012-07-02': 392,
 u'2012-07-03': 392,
 u'2012-07-04': 392,
 u'2012-07-05': 392,
 u'2012-07-06': 392}

your_df_from_dict=pd.DataFrame.from_dict(your_dict,orient='index')
print(your_df_from_dict)

Pandas have built-in function for conversion of dict to data frame.

pd.DataFrame.from_dict(dictionaryObject,orient=’index’)

For your data you can convert it like below:

import pandas as pd
your_dict={u'2012-06-08': 388,
 u'2012-06-09': 388,
 u'2012-06-10': 388,
 u'2012-06-11': 389,
 u'2012-06-12': 389,
 u'2012-06-13': 389,
 u'2012-06-14': 389,
 u'2012-06-15': 389,
 u'2012-06-16': 389,
 u'2012-06-17': 389,
 u'2012-06-18': 390,
 u'2012-06-19': 390,
 u'2012-06-20': 390,
 u'2012-06-21': 390,
 u'2012-06-22': 390,
 u'2012-06-23': 390,
 u'2012-06-24': 390,
 u'2012-06-25': 391,
 u'2012-06-26': 391,
 u'2012-06-27': 391,
 u'2012-06-28': 391,
 u'2012-06-29': 391,
 u'2012-06-30': 391,
 u'2012-07-01': 391,
 u'2012-07-02': 392,
 u'2012-07-03': 392,
 u'2012-07-04': 392,
 u'2012-07-05': 392,
 u'2012-07-06': 392}

your_df_from_dict=pd.DataFrame.from_dict(your_dict,orient='index')
print(your_df_from_dict)

回答 6

pd.DataFrame({'date' : dict_dates.keys() , 'date_value' : dict_dates.values() })
pd.DataFrame({'date' : dict_dates.keys() , 'date_value' : dict_dates.values() })

回答 7

您也可以只将字典的键和值传递给新的数据框,如下所示:

import pandas as pd

myDict = {<the_dict_from_your_example>]
df = pd.DataFrame()
df['Date'] = myDict.keys()
df['DateValue'] = myDict.values()

You can also just pass the keys and values of the dictionary to the new dataframe, like so:

import pandas as pd

myDict = {<the_dict_from_your_example>]
df = pd.DataFrame()
df['Date'] = myDict.keys()
df['DateValue'] = myDict.values()

回答 8

就我而言,我希望字典的键和值成为DataFrame的列和值。因此,唯一对我有用的是:

data = {'adjust_power': 'y', 'af_policy_r_submix_prio_adjust': '[null]', 'af_rf_info': '[null]', 'bat_ac': '3500', 'bat_capacity': '75'} 

columns = list(data.keys())
values = list(data.values())
arr_len = len(values)

pd.DataFrame(np.array(values, dtype=object).reshape(1, arr_len), columns=columns)

In my case I wanted keys and values of a dict to be columns and values of DataFrame. So the only thing that worked for me was:

data = {'adjust_power': 'y', 'af_policy_r_submix_prio_adjust': '[null]', 'af_rf_info': '[null]', 'bat_ac': '3500', 'bat_capacity': '75'} 

columns = list(data.keys())
values = list(data.values())
arr_len = len(values)

pd.DataFrame(np.array(values, dtype=object).reshape(1, arr_len), columns=columns)

回答 9

这对我有用,因为我想拥有一个单独的索引列

df = pd.DataFrame.from_dict(some_dict, orient="index").reset_index()
df.columns = ['A', 'B']

This is what worked for me, since I wanted to have a separate index column

df = pd.DataFrame.from_dict(some_dict, orient="index").reset_index()
df.columns = ['A', 'B']

回答 10

接受一个dict作为参数,并返回一个数据帧,其中dict的键作为索引,而值作为一列。

def dict_to_df(d):
    df=pd.DataFrame(d.items())
    df.set_index(0, inplace=True)
    return df

Accepts a dict as argument and returns a dataframe with the keys of the dict as index and values as a column.

def dict_to_df(d):
    df=pd.DataFrame(d.items())
    df.set_index(0, inplace=True)
    return df

回答 11

这对我来说是这样的:

df= pd.DataFrame([d.keys(), d.values()]).T
df.columns= ['keys', 'values']  # call them whatever you like

我希望这有帮助

This is how it worked for me :

df= pd.DataFrame([d.keys(), d.values()]).T
df.columns= ['keys', 'values']  # call them whatever you like

I hope this helps


回答 12

d = {'Date': list(yourDict.keys()),'Date_Values': list(yourDict.values())}
df = pandas.DataFrame(data=d)

如果不封装yourDict.keys()在中list(),则最终会将所有键和值放置在每一列的每一行中。像这样:

Date \ 0 (2012-06-08, 2012-06-09, 2012-06-10, 2012-06-1...
1 (2012-06-08, 2012-06-09, 2012-06-10, 2012-06-1...
2 (2012-06-08, 2012-06-09, 2012-06-10, 2012-06-1...
3 (2012-06-08, 2012-06-09, 2012-06-10, 2012-06-1...
4 (2012-06-08, 2012-06-09, 2012-06-10, 2012-06-1...

但是通过添加list(),结果看起来像这样:

Date Date_Values 0 2012-06-08 388 1 2012-06-09 388 2 2012-06-10 388 3 2012-06-11 389 4 2012-06-12 389 ...

d = {'Date': list(yourDict.keys()),'Date_Values': list(yourDict.values())}
df = pandas.DataFrame(data=d)

If you don’t encapsulate yourDict.keys() inside of list() , then you will end up with all of your keys and values being placed in every row of every column. Like this:

Date \ 0 (2012-06-08, 2012-06-09, 2012-06-10, 2012-06-1...
1 (2012-06-08, 2012-06-09, 2012-06-10, 2012-06-1...
2 (2012-06-08, 2012-06-09, 2012-06-10, 2012-06-1...
3 (2012-06-08, 2012-06-09, 2012-06-10, 2012-06-1...
4 (2012-06-08, 2012-06-09, 2012-06-10, 2012-06-1...

But by adding list() then the result looks like this:

Date Date_Values 0 2012-06-08 388 1 2012-06-09 388 2 2012-06-10 388 3 2012-06-11 389 4 2012-06-12 389 ...


回答 13

我已经遇到过几次,并有一个我从一个函数创建的示例字典,get_max_Path()它返回了示例字典:

{2: 0.3097502930247044, 3: 0.4413177909384636, 4: 0.5197224051562838, 5: 0.5717654946470984, 6: 0.6063959031223476, 7: 0.6365209824708223, 8: 0.655918861281035, 9: 0.680844386645206}

要将其转换为数据框,我运行了以下命令:

df = pd.DataFrame.from_dict(get_max_path(2), orient = 'index').reset_index()

返回带有单独索引的简单两列数据框:

index 0 0 2 0.309750 1 3 0.441318

只需使用重命名列 f.rename(columns={'index': 'Column1', 0: 'Column2'}, inplace=True)

I have run into this several times and have an example dictionary that I created from a function get_max_Path(), and it returns the sample dictionary:

{2: 0.3097502930247044, 3: 0.4413177909384636, 4: 0.5197224051562838, 5: 0.5717654946470984, 6: 0.6063959031223476, 7: 0.6365209824708223, 8: 0.655918861281035, 9: 0.680844386645206}

To convert this to a dataframe, I ran the following:

df = pd.DataFrame.from_dict(get_max_path(2), orient = 'index').reset_index()

Returns a simple two column dataframe with a separate index:

index 0 0 2 0.309750 1 3 0.441318

Just rename the columns using f.rename(columns={'index': 'Column1', 0: 'Column2'}, inplace=True)


回答 14

我认为您可以在创建字典时对数据格式进行一些更改,然后将其轻松转换为DataFrame:

输入:

a={'Dates':['2012-06-08','2012-06-10'],'Date_value':[388,389]}

输出:

{'Date_value': [388, 389], 'Dates': ['2012-06-08', '2012-06-10']}

输入:

aframe=DataFrame(a)

输出:将是您的DataFrame

您只需要在Sublime或Excel之类的地方使用一些文本编辑即可。

I think that you can make some changes in your data format when you create dictionary, then you can easily convert it to DataFrame:

input:

a={'Dates':['2012-06-08','2012-06-10'],'Date_value':[388,389]}

output:

{'Date_value': [388, 389], 'Dates': ['2012-06-08', '2012-06-10']}

input:

aframe=DataFrame(a)

output: will be your DataFrame

You just need to use some text editing in somewhere like Sublime or maybe Excel.


如何检查pandas DataFrame是否为空?

问题:如何检查pandas DataFrame是否为空?

如何检查大熊猫是否DataFrame为空?就我而言,如果终端DataFrame为空,我想在终端打印一些消息。

How to check whether a pandas DataFrame is empty? In my case I want to print some message in terminal if the DataFrame is empty.


回答 0

您可以使用该属性df.empty检查其是否为空:

if df.empty:
    print('DataFrame is empty!')

资料来源:熊猫文件

You can use the attribute df.empty to check whether it’s empty or not:

if df.empty:
    print('DataFrame is empty!')

Source: Pandas Documentation


回答 1

我使用的len功能。它比快得多emptylen(df.index)甚至更快。

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10000, 4), columns=list('ABCD'))

def empty(df):
    return df.empty

def lenz(df):
    return len(df) == 0

def lenzi(df):
    return len(df.index) == 0

'''
%timeit empty(df)
%timeit lenz(df)
%timeit lenzi(df)

10000 loops, best of 3: 13.9 µs per loop
100000 loops, best of 3: 2.34 µs per loop
1000000 loops, best of 3: 695 ns per loop

len on index seems to be faster
'''

I use the len function. It’s much faster than empty. len(df.index) is even faster.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10000, 4), columns=list('ABCD'))

def empty(df):
    return df.empty

def lenz(df):
    return len(df) == 0

def lenzi(df):
    return len(df.index) == 0

'''
%timeit empty(df)
%timeit lenz(df)
%timeit lenzi(df)

10000 loops, best of 3: 13.9 µs per loop
100000 loops, best of 3: 2.34 µs per loop
1000000 loops, best of 3: 695 ns per loop

len on index seems to be faster
'''

回答 2

我更喜欢长途旅行。这些是我为避免使用try-except子句而进行的检查-

  1. 检查变量是否不为None
  2. 然后检查其是否为数据框和
  3. 确保它不为空

DATA是可疑变量-

DATA is not None and isinstance(DATA, pd.DataFrame) and not DATA.empty

I prefer going the long route. These are the checks I follow to avoid using a try-except clause –

  1. check if variable is not None
  2. then check if its a dataframe and
  3. make sure its not empty

Here, DATA is the suspect variable –

DATA is not None and isinstance(DATA, pd.DataFrame) and not DATA.empty

回答 3

似乎在该线程中接受的空定义是仅具有零行的数据帧。但是在零行零列空数据框和零行零列至少一列空数据框之间有区别。在每种情况下,索引的长度都是0,并且empty = True,如下所示:

示例1:具有0行和0列的空数据框

In [1]: import pandas as pd
        df1 = pd.DataFrame()
        df1
Out[1]: Empty DataFrame
        Columns: []
        Index: []

In [2]: len(df1.index)
Out[2]: 0

In [3]: df1.empty
Out[3]: True

示例2:具有0行和至少1列的空数据框

In [4]: df2 = pd.DataFrame({'AA' : [], 'BB' : []})
        df2
Out[4]: Empty DataFrame
        Columns: [AA, BB]
        Index: []

In [5]: len(df2.index)
Out[5]: 0

In [6]: df2.empty
Out[6]: True

区分没有标题和数据数据帧或只是没有数据数据帧的一种方法是测试列索引的长度。第一个加载的数据帧返回零列,第二个数据帧返回空列数。

In [7]: len(df1.columns)
Out[7]: 0

In [8]: len(df2.columns)
Out[8]: 2

To see if a dataframe is empty, I argue that one should test for the length of a dataframe’s columns index:

if len(df.columns) == 0: 1

Reason:

According to the Pandas Reference API, there is a distinction between:

  • an empty dataframe with 0 rows and 0 columns
  • an empty dataframe with rows containing NaN hence at least 1 column

Arguably, they are not the same. The other answers are imprecise in that df.empty, len(df), or len(df.index) make no distinction and return index is 0 and empty is True in both cases.

Examples

Example 1: An empty dataframe with 0 rows and 0 columns

In [1]: import pandas as pd
        df1 = pd.DataFrame()
        df1
Out[1]: Empty DataFrame
        Columns: []
        Index: []

In [2]: len(df1.index)  # or len(df1)
Out[2]: 0

In [3]: df1.empty
Out[3]: True

Example 2: A dataframe which is emptied to 0 rows but still retains n columns

In [4]: df2 = pd.DataFrame({'AA' : [1, 2, 3], 'BB' : [11, 22, 33]})
        df2
Out[4]:    AA  BB
        0   1  11
        1   2  22
        2   3  33

In [5]: df2 = df2[df2['AA'] == 5]
        df2
Out[5]: Empty DataFrame
        Columns: [AA, BB]
        Index: []

In [6]: len(df2.index)  # or len(df2)
Out[6]: 0

In [7]: df2.empty
Out[7]: True

Now, building on the previous examples, in which the index is 0 and empty is True. When reading the length of the columns index for the first loaded dataframe df1, it returns 0 columns to prove that it is indeed empty.

In [8]: len(df1.columns)
Out[8]: 0

In [9]: len(df2.columns)
Out[9]: 2

Critically, while the second dataframe df2 contains no data, it is not completely empty because it returns the amount of empty columns that persist.

Why it matters

Let’s add a new column to these dataframes to understand the implications:

# As expected, the empty column displays 1 series
In [10]: df1['CC'] = [111, 222, 333]
         df1
Out[10]:    CC
         0 111
         1 222
         2 333
In [11]: len(df1.columns)
Out[11]: 1

# Note the persisting series with rows containing `NaN` values in df2
In [12]: df2['CC'] = [111, 222, 333]
         df2
Out[12]:    AA  BB   CC
         0 NaN NaN  111
         1 NaN NaN  222
         2 NaN NaN  333
In [13]: len(df2.columns)
Out[13]: 3

It is evident that the original columns in df2 have re-surfaced. Therefore, it is prudent to instead read the length of the columns index with len(pandas.core.frame.DataFrame.columns) to see if a dataframe is empty.

Practical solution

# New dataframe df
In [1]: df = pd.DataFrame({'AA' : [1, 2, 3], 'BB' : [11, 22, 33]})
        df
Out[1]:    AA  BB
        0   1  11
        1   2  22
        2   3  33

# This data manipulation approach results in an empty df
# because of a subset of values that are not available (`NaN`)
In [2]: df = df[df['AA'] == 5]
        df
Out[2]: Empty DataFrame
        Columns: [AA, BB]
        Index: []

# NOTE: the df is empty, BUT the columns are persistent
In [3]: len(df.columns)
Out[3]: 2

# And accordingly, the other answers on this page
In [4]: len(df.index)  # or len(df)
Out[4]: 0

In [5]: df.empty
Out[5]: True
# SOLUTION: conditionally check for empty columns
In [6]: if len(df.columns) != 0:  # <--- here
            # Do something, e.g. 
            # drop any columns containing rows with `NaN`
            # to make the df really empty
            df = df.dropna(how='all', axis=1)
        df
Out[6]: Empty DataFrame
        Columns: []
        Index: []

# Testing shows it is indeed empty now
In [7]: len(df.columns)
Out[7]: 0

Adding a new data series works as expected without the re-surfacing of empty columns (factually, without any series that were containing rows with only NaN):

In [8]: df['CC'] = [111, 222, 333]
         df
Out[8]:    CC
         0 111
         1 222
         2 333
In [9]: len(df.columns)
Out[9]: 1

回答 4

1)如果一个DataFrame具有Nan和Non Null值,并且您想查找该DataFrame是否
是否为空,然后尝试此代码。
2)什么时候会发生这种情况? 
使用单个函数绘制多个DataFrame时会发生这种情况 
作为参数传递的参数。在这种情况下,该函数甚至尝试绘制数据 
当DataFrame为空并因此绘制一个空图时!
如果仅显示“ DataFrame has no data”消息,将很有意义。
3)为什么? 
如果DataFrame为空(即完全不包含任何数据。请使用Nan值来提醒您DataFrame) 
被认为是非空的),那么最好不要绘制而是显示一条消息:
假设我们有两个DataFrames df1和df2。
函数myfunc接受任何DataFrame(在这种情况下为df1和df2)并打印一条消息 
如果DataFrame为空(而不是绘制):
df1                     df2
col1 col2           col1 col2 
Nan   2              Nan  Nan 
2     Nan            Nan  Nan  

和功能:

def myfunc(df):
  if (df.count().sum())>0: ##count the total number of non Nan values.Equal to 0 if DataFrame is empty
     print('not empty')
     df.plot(kind='barh')
  else:
     display a message instead of plotting if it is empty
     print('empty')
1) If a DataFrame has got Nan and Non Null values and you want to find whether the DataFrame
is empty or not then try this code.
2) when this situation can happen? 
This situation happens when a single function is used to plot more than one DataFrame 
which are passed as parameter.In such a situation the function try to plot the data even 
when a DataFrame is empty and thus plot an empty figure!.
It will make sense if simply display 'DataFrame has no data' message.
3) why? 
if a DataFrame is empty(i.e. contain no data at all.Mind you DataFrame with Nan values 
is considered non empty) then it is desirable not to plot but put out a message :
Suppose we have two DataFrames df1 and df2.
The function myfunc takes any DataFrame(df1 and df2 in this case) and print a message 
if a DataFrame is empty(instead of plotting):
df1                     df2
col1 col2           col1 col2 
Nan   2              Nan  Nan 
2     Nan            Nan  Nan  

and the function:

def myfunc(df):
  if (df.count().sum())>0: ##count the total number of non Nan values.Equal to 0 if DataFrame is empty
     print('not empty')
     df.plot(kind='barh')
  else:
     display a message instead of plotting if it is empty
     print('empty')

从变量中的值构造pandas DataFrame会得到“ ValueError:如果使用所有标量值,则必须传递索引”

问题:从变量中的值构造pandas DataFrame会得到“ ValueError:如果使用所有标量值,则必须传递索引”

这可能是一个简单的问题,但是我不知道该怎么做。可以说我有两个变量,如下所示。

a = 2
b = 3

我想从中构造一个DataFrame:

df2 = pd.DataFrame({'A':a,'B':b})

这会产生一个错误:

ValueError:如果使用所有标量值,则必须传递索引

我也尝试过这个:

df2 = (pd.DataFrame({'a':a,'b':b})).reset_index()

这给出了相同的错误消息。

This may be a simple question, but I can not figure out how to do this. Lets say that I have two variables as follows.

a = 2
b = 3

I want to construct a DataFrame from this:

df2 = pd.DataFrame({'A':a,'B':b})

This generates an error:

ValueError: If using all scalar values, you must pass an index

I tried this also:

df2 = (pd.DataFrame({'a':a,'b':b})).reset_index()

This gives the same error message.


回答 0

错误消息指出,如果要传递标量值,则必须传递索引。因此,您不能对列使用标量值-例如,使用列表:

>>> df = pd.DataFrame({'A': [a], 'B': [b]})
>>> df
   A  B
0  2  3

或使用标量值并传递索引:

>>> df = pd.DataFrame({'A': a, 'B': b}, index=[0])
>>> df
   A  B
0  2  3

The error message says that if you’re passing scalar values, you have to pass an index. So you can either not use scalar values for the columns — e.g. use a list:

>>> df = pd.DataFrame({'A': [a], 'B': [b]})
>>> df
   A  B
0  2  3

or use scalar values and pass an index:

>>> df = pd.DataFrame({'A': a, 'B': b}, index=[0])
>>> df
   A  B
0  2  3

回答 1

pd.DataFrame.from_records当您已经有了字典时,也可以使用以下方法更方便:

df = pd.DataFrame.from_records([{ 'A':a,'B':b }])

您还可以根据需要通过以下方式设置索引:

df = pd.DataFrame.from_records([{ 'A':a,'B':b }], index='A')

You can also use pd.DataFrame.from_records which is more convenient when you already have the dictionary in hand:

df = pd.DataFrame.from_records([{ 'A':a,'B':b }])

You can also set index, if you want, by:

df = pd.DataFrame.from_records([{ 'A':a,'B':b }], index='A')

回答 2

您需要首先创建一个熊猫系列。第二步是将熊猫系列转换为熊猫数据框。

import pandas as pd
data = {'a': 1, 'b': 2}
pd.Series(data).to_frame()

您甚至可以提供列名。

pd.Series(data).to_frame('ColumnName')

You need to create a pandas series first. The second step is to convert the pandas series to pandas dataframe.

import pandas as pd
data = {'a': 1, 'b': 2}
pd.Series(data).to_frame()

You can even provide a column name.

pd.Series(data).to_frame('ColumnName')

回答 3

您可以尝试将字典包装到列表中

my_dict = {'A':1,'B':2}

pd.DataFrame([my_dict])

   A  B
0  1  2

You may try wrapping your dictionary in to list

my_dict = {'A':1,'B':2}

pd.DataFrame([my_dict])

   A  B
0  1  2

回答 4

也许Series将提供您需要的所有功能:

pd.Series({'A':a,'B':b})

可以将DataFrame视为Series的集合,因此您可以:

  • 连接多个系列到一个数据帧(如所描述的在这里

  • 将Series变量添加到现有数据框中(此处示例

Maybe Series would provide all the functions you need:

pd.Series({'A':a,'B':b})

DataFrame can be thought of as a collection of Series hence you can :

  • Concatenate multiple Series into one data frame (as described here )

  • Add a Series variable into existing data frame ( example here )


回答 5

您需要提供可迭代项作为Pandas DataFrame列的值:

df2 = pd.DataFrame({'A':[a],'B':[b]})

You need to provide iterables as the values for the Pandas DataFrame columns:

df2 = pd.DataFrame({'A':[a],'B':[b]})

回答 6

我对numpy数组有同样的问题,解决方案是将它们展平:

data = {
    'b': array1.flatten(),
    'a': array2.flatten(),
}

df = pd.DataFrame(data)

I had the same problem with numpy arrays and the solution is to flatten them:

data = {
    'b': array1.flatten(),
    'a': array2.flatten(),
}

df = pd.DataFrame(data)

回答 7

如果要转换标量字典,则必须包含一个索引:

import pandas as pd

alphabets = {'A': 'a', 'B': 'b'}
index = [0]
alphabets_df = pd.DataFrame(alphabets, index=index)
print(alphabets_df)

尽管列表字典不需要索引,但是可以将相同的概念扩展为列表字典:

planets = {'planet': ['earth', 'mars', 'jupiter'], 'length_of_day': ['1', '1.03', '0.414']}
index = [0, 1, 2]
planets_df = pd.DataFrame(planets, index=index)
print(planets_df)

当然,对于列表字典,您可以构建不带索引的数据框:

planets_df = pd.DataFrame(planets)
print(planets_df)

If you intend to convert a dictionary of scalars, you have to include an index:

import pandas as pd

alphabets = {'A': 'a', 'B': 'b'}
index = [0]
alphabets_df = pd.DataFrame(alphabets, index=index)
print(alphabets_df)

Although index is not required for a dictionary of lists, the same idea can be expanded to a dictionary of lists:

planets = {'planet': ['earth', 'mars', 'jupiter'], 'length_of_day': ['1', '1.03', '0.414']}
index = [0, 1, 2]
planets_df = pd.DataFrame(planets, index=index)
print(planets_df)

Of course, for the dictionary of lists, you can build the dataframe without an index:

planets_df = pd.DataFrame(planets)
print(planets_df)

回答 8

您可以尝试:

df2 = pd.DataFrame.from_dict({'a':a,'b':b}, orient = 'index')

从’orient’参数的文档中:如果传递的dict的键应该是结果DataFrame的列,请传递’columns’(默认值)。否则,如果键应该是行,则传递“ index”。

You could try:

df2 = pd.DataFrame.from_dict({'a':a,'b':b}, orient = 'index')

From the documentation on the ‘orient’ argument: If the keys of the passed dict should be the columns of the resulting DataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’.


回答 9

熊猫魔术在工作。一切逻辑都搞定了。

错误消息"ValueError: If using all scalar values, you must pass an index"说您必须传递索引。

这并不一定意味着传递索引会使熊猫按照自己的意愿去做

传递索引时,pandas会将字典键视为列名,并将值视为列中索引中每个值应包含的值。

a = 2
b = 3
df2 = pd.DataFrame({'A':a,'B':b}, index=[1])

    A   B
1   2   3

传递更大的索引:

df2 = pd.DataFrame({'A':a,'B':b}, index=[1, 2, 3, 4])

    A   B
1   2   3
2   2   3
3   2   3
4   2   3

如果没有给出索引,则通常由数据框自动生成索引。然而,大熊猫不知道多少行23你想要的。但是,您可以对此更加明确

df2 = pd.DataFrame({'A':[a]*4,'B':[b]*4})
df2

    A   B
0   2   3
1   2   3
2   2   3
3   2   3

但是默认索引是基于0的。

我建议在创建数据框时始终将列表字典传递给数据框构造函数。对于其他开发人员来说更容易阅读。Pandas有很多警告,不要让其他开发人员必须要拥有所有这些方面的专家才能阅读您的代码。

Pandas magic at work. All logic is out.

The error message "ValueError: If using all scalar values, you must pass an index" Says you must pass an index.

This does not necessarily mean passing an index makes pandas do what you want it to do

When you pass an index, pandas will treat your dictionary keys as column names and the values as what the column should contain for each of the values in the index.

a = 2
b = 3
df2 = pd.DataFrame({'A':a,'B':b}, index=[1])

    A   B
1   2   3

Passing a larger index:

df2 = pd.DataFrame({'A':a,'B':b}, index=[1, 2, 3, 4])

    A   B
1   2   3
2   2   3
3   2   3
4   2   3

An index is usually automatically generated by a dataframe when none is given. However, pandas does not know how many rows of 2 and 3 you want. You can however be more explicit about it

df2 = pd.DataFrame({'A':[a]*4,'B':[b]*4})
df2

    A   B
0   2   3
1   2   3
2   2   3
3   2   3

The default index is 0 based though.

I would recommend always passing a dictionary of lists to the dataframe constructor when creating dataframes. It’s easier to read for other developers. Pandas has a lot of caveats, don’t make other developers have to experts in all of them in order to read your code.


回答 10

输入不必是记录列表,也可以是单个字典:

pd.DataFrame.from_records({'a':1,'b':2}, index=[0])
   a  b
0  1  2

这似乎等效于:

pd.DataFrame({'a':1,'b':2}, index=[0])
   a  b
0  1  2

the input does not have to be a list of records – it can be a single dictionary as well:

pd.DataFrame.from_records({'a':1,'b':2}, index=[0])
   a  b
0  1  2

Which seems to be equivalent to:

pd.DataFrame({'a':1,'b':2}, index=[0])
   a  b
0  1  2

回答 11

这是因为DataFrame具有两个直观的维度-列行。

您仅使用字典键指定列。

如果只想指定一维数据,请使用系列!

This is because a DataFrame has two intuitive dimensions – the columns and the rows.

You are only specifying the columns using the dictionary keys.

If you only want to specify one dimensional data, use a Series!


回答 12

将字典转换为数据框

col_dict_df = pd.Series(col_dict).to_frame('new_col').reset_index()

为列命名

col_dict_df.columns = ['col1', 'col2']

Convert Dictionary to Data Frame

col_dict_df = pd.Series(col_dict).to_frame('new_col').reset_index()

Give new name to Column

col_dict_df.columns = ['col1', 'col2']

回答 13

如果您有字典,则可以使用以下代码将其转换为熊猫数据框:

pd.DataFrame({"key": d.keys(), "value": d.values()})

If you have a dictionary you can turn it into a pandas data frame with the following line of code:

pd.DataFrame({"key": d.keys(), "value": d.values()})

回答 14

只需将字典传递给列表即可:

a = 2
b = 3
df2 = pd.DataFrame([{'A':a,'B':b}])

Just pass the dict on a list:

a = 2
b = 3
df2 = pd.DataFrame([{'A':a,'B':b}])