标签归档:pandas

分配熊猫数据框列dtypes

问题:分配熊猫数据框列dtypes

我想在中设置dtype多列的s pd.Dataframe(我有一个文件必须手动解析为列表列表,因为该文件不适合pd.read_csv

import pandas as pd
print pd.DataFrame([['a','1'],['b','2']],
                   dtype={'x':'object','y':'int'},
                   columns=['x','y'])

我懂了

ValueError: entry not a 2- or 3- tuple

我可以设置它们的唯一方法是循环遍历每个列变量并使用进行重铸astype

dtypes = {'x':'object','y':'int'}
mydata = pd.DataFrame([['a','1'],['b','2']],
                      columns=['x','y'])
for c in mydata.columns:
    mydata[c] = mydata[c].astype(dtypes[c])
print mydata['y'].dtype   #=> int64

有没有更好的办法?

I want to set the dtypes of multiple columns in pd.Dataframe (I have a file that I’ve had to manually parse into a list of lists, as the file was not amenable for pd.read_csv)

import pandas as pd
print pd.DataFrame([['a','1'],['b','2']],
                   dtype={'x':'object','y':'int'},
                   columns=['x','y'])

I get

ValueError: entry not a 2- or 3- tuple

The only way I can set them is by looping through each column variable and recasting with astype.

dtypes = {'x':'object','y':'int'}
mydata = pd.DataFrame([['a','1'],['b','2']],
                      columns=['x','y'])
for c in mydata.columns:
    mydata[c] = mydata[c].astype(dtypes[c])
print mydata['y'].dtype   #=> int64

Is there a better way?


回答 0

从0.17开始,您必须使用显式转换:

pd.to_datetime, pd.to_timedelta and pd.to_numeric

(如下所述,convert_objects在0.17中已不再弃用“魔术” )

df = pd.DataFrame({'x': {0: 'a', 1: 'b'}, 'y': {0: '1', 1: '2'}, 'z': {0: '2018-05-01', 1: '2018-05-02'}})

df.dtypes

x    object
y    object
z    object
dtype: object

df

   x  y           z
0  a  1  2018-05-01
1  b  2  2018-05-02

您可以将它们应用于要转换的每一列:

df["y"] = pd.to_numeric(df["y"])
df["z"] = pd.to_datetime(df["z"])    
df

   x  y          z
0  a  1 2018-05-01
1  b  2 2018-05-02

df.dtypes

x            object
y             int64
z    datetime64[ns]
dtype: object

并确认dtype已更新。


大熊猫0.12-0.16的旧/建议答案:您可以convert_objects用来推断更好的dtypes:

In [21]: df
Out[21]: 
   x  y
0  a  1
1  b  2

In [22]: df.dtypes
Out[22]: 
x    object
y    object
dtype: object

In [23]: df.convert_objects(convert_numeric=True)
Out[23]: 
   x  y
0  a  1
1  b  2

In [24]: df.convert_objects(convert_numeric=True).dtypes
Out[24]: 
x    object
y     int64
dtype: object

魔法!(遗憾地看到它过时了。)

Since 0.17, you have to use the explicit conversions:

pd.to_datetime, pd.to_timedelta and pd.to_numeric

(As mentioned below, no more “magic”, convert_objects has been deprecated in 0.17)

df = pd.DataFrame({'x': {0: 'a', 1: 'b'}, 'y': {0: '1', 1: '2'}, 'z': {0: '2018-05-01', 1: '2018-05-02'}})

df.dtypes

x    object
y    object
z    object
dtype: object

df

   x  y           z
0  a  1  2018-05-01
1  b  2  2018-05-02

You can apply these to each column you want to convert:

df["y"] = pd.to_numeric(df["y"])
df["z"] = pd.to_datetime(df["z"])    
df

   x  y          z
0  a  1 2018-05-01
1  b  2 2018-05-02

df.dtypes

x            object
y             int64
z    datetime64[ns]
dtype: object

and confirm the dtype is updated.


OLD/DEPRECATED ANSWER for pandas 0.12 – 0.16: You can use convert_objects to infer better dtypes:

In [21]: df
Out[21]: 
   x  y
0  a  1
1  b  2

In [22]: df.dtypes
Out[22]: 
x    object
y    object
dtype: object

In [23]: df.convert_objects(convert_numeric=True)
Out[23]: 
   x  y
0  a  1
1  b  2

In [24]: df.convert_objects(convert_numeric=True).dtypes
Out[24]: 
x    object
y     int64
dtype: object

Magic! (Sad to see it deprecated.)


回答 1

对于那些来自Google(例如我)的人:

convert_objects 从0.17开始不推荐使用-如果您使用它,则会收到类似以下的警告:

FutureWarning: convert_objects is deprecated.  Use the data-type specific converters 
pd.to_datetime, pd.to_timedelta and pd.to_numeric.

您应该执行以下操作:

For those coming from Google (etc.) such as myself:

convert_objects has been deprecated since 0.17 – if you use it, you get a warning like this one:

FutureWarning: convert_objects is deprecated.  Use the data-type specific converters 
pd.to_datetime, pd.to_timedelta and pd.to_numeric.

You should do something like the following:


回答 2

您可以使用pandas显式设置类型,DataFrame.astype(dtype, copy=True, raise_on_error=True, **kwargs)并使用想要的dtypes传递字典dtype

这是一个例子:

import pandas as pd
wheel_number = 5
car_name = 'jeep'
minutes_spent = 4.5

# set the columns
data_columns = ['wheel_number', 'car_name', 'minutes_spent']

# create an empty dataframe
data_df = pd.DataFrame(columns = data_columns)
df_temp = pd.DataFrame([[wheel_number, car_name, minutes_spent]],columns = data_columns)
data_df = data_df.append(df_temp, ignore_index=True) 

In [11]: data_df.dtypes
Out[11]:
wheel_number     float64
car_name          object
minutes_spent    float64
dtype: object

data_df = data_df.astype(dtype= {"wheel_number":"int64",
        "car_name":"object","minutes_spent":"float64"})

现在您可以看到它已更改

In [18]: data_df.dtypes
Out[18]:
wheel_number       int64
car_name          object
minutes_spent    float64

you can set the types explicitly with pandas DataFrame.astype(dtype, copy=True, raise_on_error=True, **kwargs) and pass in a dictionary with the dtypes you want to dtype

here’s an example:

import pandas as pd
wheel_number = 5
car_name = 'jeep'
minutes_spent = 4.5

# set the columns
data_columns = ['wheel_number', 'car_name', 'minutes_spent']

# create an empty dataframe
data_df = pd.DataFrame(columns = data_columns)
df_temp = pd.DataFrame([[wheel_number, car_name, minutes_spent]],columns = data_columns)
data_df = data_df.append(df_temp, ignore_index=True) 

In [11]: data_df.dtypes
Out[11]:
wheel_number     float64
car_name          object
minutes_spent    float64
dtype: object

data_df = data_df.astype(dtype= {"wheel_number":"int64",
        "car_name":"object","minutes_spent":"float64"})

now you can see that it’s changed

In [18]: data_df.dtypes
Out[18]:
wheel_number       int64
car_name          object
minutes_spent    float64

回答 3

设置列类型的另一种方法是,首先使用所需的类型构造一个numpy记录数组,将其填充,然后将其传递给DataFrame构造函数。

import pandas as pd
import numpy as np    

x = np.empty((10,), dtype=[('x', np.uint8), ('y', np.float64)])
df = pd.DataFrame(x)

df.dtypes ->

x      uint8
y    float64

Another way to set the column types is to first construct a numpy record array with your desired types, fill it out and then pass it to a DataFrame constructor.

import pandas as pd
import numpy as np    

x = np.empty((10,), dtype=[('x', np.uint8), ('y', np.float64)])
df = pd.DataFrame(x)

df.dtypes ->

x      uint8
y    float64

回答 4

面临类似的问题。在我的情况下,我需要手动解析来自Cisco日志的1000个文件。

为了灵活处理字段和类型,我已经使用StringIO + read_cvs成功地进行了测试,它确实接受dtype规范的要求。

我通常将每个文件(5k-20k行)放入缓冲区并动态创建dtype字典。

最终,我将这些数据帧连接(通过分类…感谢0.19)到一个大数据帧中,然后将其转储到hdf5中。

这些东西

import pandas as pd
import io 

output = io.StringIO()
output.write('A,1,20,31\n')
output.write('B,2,21,32\n')
output.write('C,3,22,33\n')
output.write('D,4,23,34\n')

output.seek(0)


df=pd.read_csv(output, header=None,
        names=["A","B","C","D"],
        dtype={"A":"category","B":"float32","C":"int32","D":"float64"},
        sep=","
       )

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
A    5 non-null category
B    5 non-null float32
C    5 non-null int32
D    5 non-null float64
dtypes: category(1), float32(1), float64(1), int32(1)
memory usage: 205.0 bytes
None

不是很pythonic ….但是能胜任

希望能帮助到你。

杰西

facing similar problem to you. In my case I have 1000’s of files from cisco logs that I need to parse manually.

In order to be flexible with fields and types I have successfully tested using StringIO + read_cvs which indeed does accept a dict for the dtype specification.

I usually get each of the files ( 5k-20k lines) into a buffer and create the dtype dictionaries dynamically.

Eventually I concatenate ( with categorical… thanks to 0.19) these dataframes into a large data frame that I dump into hdf5.

Something along these lines

import pandas as pd
import io 

output = io.StringIO()
output.write('A,1,20,31\n')
output.write('B,2,21,32\n')
output.write('C,3,22,33\n')
output.write('D,4,23,34\n')

output.seek(0)


df=pd.read_csv(output, header=None,
        names=["A","B","C","D"],
        dtype={"A":"category","B":"float32","C":"int32","D":"float64"},
        sep=","
       )

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
A    5 non-null category
B    5 non-null float32
C    5 non-null int32
D    5 non-null float64
dtypes: category(1), float32(1), float64(1), int32(1)
memory usage: 205.0 bytes
None

Not very pythonic…. but does the job

Hope it helps.

JC


回答 5

最好使用键入的np.arrays,然后将数据和列名作为字典传递。

import numpy as np
import pandas as pd
# Feature: np arrays are 1: efficient, 2: can be pre-sized
x = np.array(['a', 'b'], dtype=object)
y = np.array([ 1 ,  2 ], dtype=np.int32)
df = pd.DataFrame({
   'x' : x,    # Feature: column name is near data array
   'y' : y,
   }
 )

You’re better off using typed np.arrays, and then pass the data and column names as a dictionary.

import numpy as np
import pandas as pd
# Feature: np arrays are 1: efficient, 2: can be pre-sized
x = np.array(['a', 'b'], dtype=object)
y = np.array([ 1 ,  2 ], dtype=np.int32)
df = pd.DataFrame({
   'x' : x,    # Feature: column name is near data array
   'y' : y,
   }
 )

我何时应该在代码中使用pandas apply()?

问题:我何时应该在代码中使用pandas apply()?

我已经看到许多有关使用Pandas方法的堆栈溢出问题的答案apply。我还看到用户在他们的下面发表评论,说“ apply缓慢,应避免使用”。

我已经阅读了许多有关性能的文章,这些文章解释apply得很慢。我还在文档中看到了关于免除apply传递UDF的便捷功能的免责声明(现在似乎找不到)。因此,普遍的共识是,apply应尽可能避免。但是,这引起了以下问题:

  1. 如果apply太糟糕了,那为什么在API中呢?
  2. 我应该如何以及何时使代码apply免费?
  3. 在任何情况下apply都有良好的情况(比其他可能的解决方案更好)吗?

I have seen many answers posted to questions on Stack Overflow involving the use of the Pandas method apply. I have also seen users commenting under them saying that “apply is slow, and should be avoided”.

I have read many articles on the topic of performance that explain apply is slow. I have also seen a disclaimer in the docs about how apply is simply a convenience function for passing UDFs (can’t seem to find that now). So, the general consensus is that apply should be avoided if possible. However, this raises the following questions:

  1. If apply is so bad, then why is it in the API?
  2. How and when should I make my code apply-free?
  3. Are there ever any situations where apply is good (better than other possible solutions)?

回答 0

apply,您不需要的便利功能

我们首先在OP中逐一解决问题。

如果应用是如此糟糕,那么为什么要在API中使用它呢?

DataFrame.applySeries.apply是分别在DataFrame和Series对象上定义的便捷函数apply接受任何在DataFrame上应用转换/聚合的用户定义函数。apply实际上是完成任何现有熊猫功能无法完成的灵丹妙药。

一些事情apply可以做:

  • 在DataFrame或Series上运行任何用户定义的函数
  • 在DataFrame上按行(axis=1)或按列()应用函数axis=0
  • 应用功能时执行索引对齐
  • 使用用户定义的函数执行汇总(但是,我们通常更喜欢aggtransform在这种情况下)
  • 执行逐元素转换
  • 将汇总结果广播到原始行(请参阅result_type参数)。
  • 接受位置/关键字参数以传递给用户定义的函数。

…其他 有关更多信息,请参见文档中的行或列函数应用程序

那么,具有所有这些功能,为什么apply不好?这是因为apply 缓慢的。Pandas对功能的性质不做任何假设,因此在必要时将您的功能迭代地应用于每个行/列。此外,处理上述所有情况均意味着apply每次迭代都会产生一些重大开销。此外,apply会消耗更多的内存,这对于内存受限的应用程序是一个挑战。

在极少数情况下,apply适合使用(以下更多内容)。如果不确定是否应该使用apply,则可能不应该使用。


让我们解决下一个问题。

如何当我应该让我的代码申请-免费?

重新说明一下,这是一些常见的情况,在这些情况下您将希望摆脱对的任何调用apply

数值数据

如果您正在使用数字数据,则可能已经有一个矢量化的cython函数可以完全实现您要执行的操作(如果没有,请在Stack Overflow上提问或在GitHub上打开功能请求)。

对比一下apply简单加法运算的性能。

df = pd.DataFrame({"A": [9, 4, 2, 1], "B": [12, 7, 5, 4]})
df

   A   B
0  9  12
1  4   7
2  2   5
3  1   4

df.apply(np.sum)

A    16
B    28
dtype: int64

df.sum()

A    16
B    28
dtype: int64

在性能方面,没有任何可比的,被cythonized的等效物要快得多。不需要图表,因为即使对于玩具数据,差异也很明显。

%timeit df.apply(np.sum)
%timeit df.sum()
2.22 ms ± 41.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
471 µs ± 8.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

即使您启用带有raw参数的原始数组传递,它的速度仍然是原来的两倍。

%timeit df.apply(np.sum, raw=True)
840 µs ± 691 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

另一个例子:

df.apply(lambda x: x.max() - x.min())

A    8
B    8
dtype: int64

df.max() - df.min()

A    8
B    8
dtype: int64

%timeit df.apply(lambda x: x.max() - x.min())
%timeit df.max() - df.min()

2.43 ms ± 450 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.23 ms ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

通常,如果可能寻找向量化的替代方案。

字符串/正则表达式

在大多数情况下,Pandas提供“矢量化”字符串函数,但是在极少数情况下,这些函数不会…“应用”,可以这么说。

一个常见的问题是检查同一行的另一列中是否存在一列中的值。

df = pd.DataFrame({
    'Name': ['mickey', 'donald', 'minnie'],
    'Title': ['wonderland', "welcome to donald's castle", 'Minnie mouse clubhouse'],
    'Value': [20, 10, 86]})
df

     Name  Value                       Title
0  mickey     20                  wonderland
1  donald     10  welcome to donald's castle
2  minnie     86      Minnie mouse clubhouse

这应该返回第二行和第三行,因为“唐纳德”和“米妮”出现在它们各自的“标题”列中。

使用apply,这将使用

df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)

0    False
1     True
2     True
dtype: bool

df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]

     Name                       Title  Value
1  donald  welcome to donald's castle     10
2  minnie      Minnie mouse clubhouse     86

但是,使用列表推导存在更好的解决方案。

df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]

     Name                       Title  Value
1  donald  welcome to donald's castle     10
2  minnie      Minnie mouse clubhouse     86

%timeit df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]
%timeit df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]

2.85 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
788 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

这里要注意的是apply,由于开销较低,因此迭代例程的运行速度比快。如果您需要处理NaN和无效的dtype,则可以使用自定义函数在此基础上进行构建,然后再使用列表推导中的参数进行调用。

有关何时应该将列表理解视为一个不错的选择的更多信息,请参见我的文章:对于熊猫循环-我何时应该关心?

注意
日期和日期时间操作也具有矢量化版本。因此,例如,您应该更喜欢pd.to_datetime(df['date'])df['date'].apply(pd.to_datetime)

docs上阅读更多内容 。

一个常见的陷阱:列表的爆炸列

s = pd.Series([[1, 2]] * 3)
s

0    [1, 2]
1    [1, 2]
2    [1, 2]
dtype: object

人们很想使用apply(pd.Series)。就性能而言,这太可怕了。

s.apply(pd.Series)

   0  1
0  1  2
1  1  2
2  1  2

更好的选择是列出该列并将其传递给pd.DataFrame。

pd.DataFrame(s.tolist())

   0  1
0  1  2
1  1  2
2  1  2

%timeit s.apply(pd.Series)
%timeit pd.DataFrame(s.tolist())

2.65 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
816 µs ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

最后,

有什么情况 apply 是好的吗?

Apply是一项便利功能,因此在某些情况下开销可以忽略不计,可以原谅。它实际上取决于函数被调用多少次。

为系列矢量化的函数,但不是数据帧的函数
如果要对多列应用字符串操作该怎么办?如果要将多列转换为日期时间怎么办?这些函数仅针对系列进行矢量化处理,因此必须将它们应用于要转换/操作的每一列。

df = pd.DataFrame(
         pd.date_range('2018-12-31','2019-01-31', freq='2D').date.astype(str).reshape(-1, 2), 
         columns=['date1', 'date2'])
df

       date1      date2
0 2018-12-31 2019-01-02
1 2019-01-04 2019-01-06
2 2019-01-08 2019-01-10
3 2019-01-12 2019-01-14
4 2019-01-16 2019-01-18
5 2019-01-20 2019-01-22
6 2019-01-24 2019-01-26
7 2019-01-28 2019-01-30

df.dtypes

date1    object
date2    object
dtype: object

这是以下情况的可接受案例apply

df.apply(pd.to_datetime, errors='coerce').dtypes

date1    datetime64[ns]
date2    datetime64[ns]
dtype: object

请注意,这对于stack还是有意义的,或者仅使用显式循环。所有这些选项都比使用稍微快一点apply,但是差异很小,可以原谅。

%timeit df.apply(pd.to_datetime, errors='coerce')
%timeit pd.to_datetime(df.stack(), errors='coerce').unstack()
%timeit pd.concat([pd.to_datetime(df[c], errors='coerce') for c in df], axis=1)
%timeit for c in df.columns: df[c] = pd.to_datetime(df[c], errors='coerce')

5.49 ms ± 247 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.94 ms ± 48.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.16 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.41 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

您可以对其他操作(例如字符串操作或转换为类别)进行类似的设置。

u = df.apply(lambda x: x.str.contains(...))
v = df.apply(lambda x: x.astype(category))

伏/秒

u = pd.concat([df[c].str.contains(...) for c in df], axis=1)
v = df.copy()
for c in df:
    v[c] = df[c].astype(category)

等等…

将Series转换为strastypevsapply

这似乎是API的特质。与使用相比,apply用于将Series中的整数转换为字符串的方法具有可比性(有时更快)astype

使用该perfplot库绘制该图。

import perfplot

perfplot.show(
    setup=lambda n: pd.Series(np.random.randint(0, n, n)),
    kernels=[
        lambda s: s.astype(str),
        lambda s: s.apply(str)
    ],
    labels=['astype', 'apply'],
    n_range=[2**k for k in range(1, 20)],
    xlabel='N',
    logx=True,
    logy=True,
    equality_check=lambda x, y: (x == y).all())

使用浮点数时,我看到的astype速度始终与一样快,或略快于apply。因此,这与测试中的数据是整数类型有关。

GroupBy 链式转换操作

GroupBy.apply到目前为止尚未进行讨论,但是GroupBy.apply它也是一个迭代便利函数,用于处理现有GroupBy函数未处理的任何事情。

一个常见的要求是执行GroupBy,然后执行两个主要操作,例如“滞后的累积量”:

df = pd.DataFrame({"A": list('aabcccddee'), "B": [12, 7, 5, 4, 5, 4, 3, 2, 1, 10]})
df

   A   B
0  a  12
1  a   7
2  b   5
3  c   4
4  c   5
5  c   4
6  d   3
7  d   2
8  e   1
9  e  10

您需要在此处进行两个连续的groupby调用:

df.groupby('A').B.cumsum().groupby(df.A).shift()

0     NaN
1    12.0
2     NaN
3     NaN
4     4.0
5     9.0
6     NaN
7     3.0
8     NaN
9     1.0
Name: B, dtype: float64

使用apply,您可以将其缩短为一个电话。

df.groupby('A').B.apply(lambda x: x.cumsum().shift())

0     NaN
1    12.0
2     NaN
3     NaN
4     4.0
5     9.0
6     NaN
7     3.0
8     NaN
9     1.0
Name: B, dtype: float64

量化性能非常困难,因为它取决于数据。但是总的来说,apply如果目标是减少groupby通话,这是一个可以接受的解决方案(因为groupby它也很昂贵)。


其他注意事项

除了上述注意事项外,还值得一提的是apply在第一行(或列)上执行两次。这样做是为了确定该功能是否有任何副作用。如果不是,则apply可能能够使用快速路径来评估结果,否则将退回到缓慢的实施方式。

df = pd.DataFrame({
    'A': [1, 2],
    'B': ['x', 'y']
})

def func(x):
    print(x['A'])
    return x

df.apply(func, axis=1)

# 1
# 1
# 2
   A  B
0  1  x
1  2  y

GroupBy.apply小于0.25的熊猫版本中也可以看到此行为(已固定为0.25,有关更多信息请参见此处。)

apply, the Convenience Function you Never Needed

We start by addressing the questions in the OP, one by one.

“If apply is so bad, then why is it in the API?”

DataFrame.apply and Series.apply are convenience functions defined on DataFrame and Series object respectively. apply accepts any user defined function that applies a transformation/aggregation on a DataFrame. apply is effectively a silver bullet that does whatever any existing pandas function cannot do.

Some of the things apply can do:

  • Run any user-defined function on a DataFrame or Series
  • Apply a function either row-wise (axis=1) or column-wise (axis=0) on a DataFrame
  • Perform index alignment while applying the function
  • Perform aggregation with user-defined functions (however, we usually prefer agg or transform in these cases)
  • Perform element-wise transformations
  • Broadcast aggregated results to original rows (see the result_type argument).
  • Accept positional/keyword arguments to pass to the user-defined functions.

…Among others. For more information, see Row or Column-wise Function Application in the documentation.

So, with all these features, why is apply bad? It is because apply is slow. Pandas makes no assumptions about the nature of your function, and so iteratively applies your function to each row/column as necessary. Additionally, handling all of the situations above means apply incurs some major overhead at each iteration. Further, apply consumes a lot more memory, which is a challenge for memory bounded applications.

There are very few situations where apply is appropriate to use (more on that below). If you’re not sure whether you should be using apply, you probably shouldn’t.



Let’s address the next question.

“How and when should I make my code apply-free?”

To rephrase, here are some common situations where you will want to get rid of any calls to apply.

Numeric Data

If you’re working with numeric data, there is likely already a vectorized cython function that does exactly what you’re trying to do (if not, please either ask a question on Stack Overflow or open a feature request on GitHub).

Contrast the performance of apply for a simple addition operation.

df = pd.DataFrame({"A": [9, 4, 2, 1], "B": [12, 7, 5, 4]})
df

   A   B
0  9  12
1  4   7
2  2   5
3  1   4

<!- ->

df.apply(np.sum)

A    16
B    28
dtype: int64

df.sum()

A    16
B    28
dtype: int64

Performance wise, there’s no comparison, the cythonized equivalent is much faster. There’s no need for a graph, because the difference is obvious even for toy data.

%timeit df.apply(np.sum)
%timeit df.sum()
2.22 ms ± 41.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
471 µs ± 8.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Even if you enable passing raw arrays with the raw argument, it’s still twice as slow.

%timeit df.apply(np.sum, raw=True)
840 µs ± 691 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Another example:

df.apply(lambda x: x.max() - x.min())

A    8
B    8
dtype: int64

df.max() - df.min()

A    8
B    8
dtype: int64

%timeit df.apply(lambda x: x.max() - x.min())
%timeit df.max() - df.min()

2.43 ms ± 450 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.23 ms ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In general, seek out vectorized alternatives if possible.


String/Regex

Pandas provides “vectorized” string functions in most situations, but there are rare cases where those functions do not… “apply”, so to speak.

A common problem is to check whether a value in a column is present in another column of the same row.

df = pd.DataFrame({
    'Name': ['mickey', 'donald', 'minnie'],
    'Title': ['wonderland', "welcome to donald's castle", 'Minnie mouse clubhouse'],
    'Value': [20, 10, 86]})
df

     Name  Value                       Title
0  mickey     20                  wonderland
1  donald     10  welcome to donald's castle
2  minnie     86      Minnie mouse clubhouse

This should return the row second and third row, since “donald” and “minnie” are present in their respective “Title” columns.

Using apply, this would be done using

df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)

0    False
1     True
2     True
dtype: bool
 
df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]

     Name                       Title  Value
1  donald  welcome to donald's castle     10
2  minnie      Minnie mouse clubhouse     86

However, a better solution exists using list comprehensions.

df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]

     Name                       Title  Value
1  donald  welcome to donald's castle     10
2  minnie      Minnie mouse clubhouse     86

<!- ->

%timeit df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]
%timeit df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]

2.85 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
788 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The thing to note here is that iterative routines happen to be faster than apply, because of the lower overhead. If you need to handle NaNs and invalid dtypes, you can build on this using a custom function you can then call with arguments inside the list comprehension.

For more information on when list comprehensions should be considered a good option, see my writeup: Are for-loops in pandas really bad? When should I care?.

Note
Date and datetime operations also have vectorized versions. So, for example, you should prefer pd.to_datetime(df['date']), over, say, df['date'].apply(pd.to_datetime).

Read more at the docs.


A Common Pitfall: Exploding Columns of Lists

s = pd.Series([[1, 2]] * 3)
s

0    [1, 2]
1    [1, 2]
2    [1, 2]
dtype: object

People are tempted to use apply(pd.Series). This is horrible in terms of performance.

s.apply(pd.Series)

   0  1
0  1  2
1  1  2
2  1  2

A better option is to listify the column and pass it to pd.DataFrame.

pd.DataFrame(s.tolist())

   0  1
0  1  2
1  1  2
2  1  2

<!- ->

%timeit s.apply(pd.Series)
%timeit pd.DataFrame(s.tolist())

2.65 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
816 µs ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Lastly,

“Are there any situations where apply is good?”

Apply is a convenience function, so there are situations where the overhead is negligible enough to forgive. It really depends on how many times the function is called.

Functions that are Vectorized for Series, but not DataFrames
What if you want to apply a string operation on multiple columns? What if you want to convert multiple columns to datetime? These functions are vectorized for Series only, so they must be applied over each column that you want to convert/operate on.

df = pd.DataFrame(
         pd.date_range('2018-12-31','2019-01-31', freq='2D').date.astype(str).reshape(-1, 2), 
         columns=['date1', 'date2'])
df

       date1      date2
0 2018-12-31 2019-01-02
1 2019-01-04 2019-01-06
2 2019-01-08 2019-01-10
3 2019-01-12 2019-01-14
4 2019-01-16 2019-01-18
5 2019-01-20 2019-01-22
6 2019-01-24 2019-01-26
7 2019-01-28 2019-01-30

df.dtypes

date1    object
date2    object
dtype: object
    

This is an admissible case for apply:

df.apply(pd.to_datetime, errors='coerce').dtypes

date1    datetime64[ns]
date2    datetime64[ns]
dtype: object

Note that it would also make sense to stack, or just use an explicit loop. All these options are slightly faster than using apply, but the difference is small enough to forgive.

%timeit df.apply(pd.to_datetime, errors='coerce')
%timeit pd.to_datetime(df.stack(), errors='coerce').unstack()
%timeit pd.concat([pd.to_datetime(df[c], errors='coerce') for c in df], axis=1)
%timeit for c in df.columns: df[c] = pd.to_datetime(df[c], errors='coerce')

5.49 ms ± 247 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.94 ms ± 48.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.16 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.41 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You can make a similar case for other operations such as string operations, or conversion to category.

u = df.apply(lambda x: x.str.contains(...))
v = df.apply(lambda x: x.astype(category))

v/s

u = pd.concat([df[c].str.contains(...) for c in df], axis=1)
v = df.copy()
for c in df:
    v[c] = df[c].astype(category)

And so on…


Converting Series to str: astype versus apply

This seems like an idiosyncrasy of the API. Using apply to convert integers in a Series to string is comparable (and sometimes faster) than using astype.

The graph was plotted using the perfplot library.

import perfplot

perfplot.show(
    setup=lambda n: pd.Series(np.random.randint(0, n, n)),
    kernels=[
        lambda s: s.astype(str),
        lambda s: s.apply(str)
    ],
    labels=['astype', 'apply'],
    n_range=[2**k for k in range(1, 20)],
    xlabel='N',
    logx=True,
    logy=True,
    equality_check=lambda x, y: (x == y).all())

With floats, I see the astype is consistently as fast as, or slightly faster than apply. So this has to do with the fact that the data in the test is integer type.


GroupBy operations with chained transformations

GroupBy.apply has not been discussed until now, but GroupBy.apply is also an iterative convenience function to handle anything that the existing GroupBy functions do not.

One common requirement is to perform a GroupBy and then two prime operations such as a “lagged cumsum”:

df = pd.DataFrame({"A": list('aabcccddee'), "B": [12, 7, 5, 4, 5, 4, 3, 2, 1, 10]})
df

   A   B
0  a  12
1  a   7
2  b   5
3  c   4
4  c   5
5  c   4
6  d   3
7  d   2
8  e   1
9  e  10

<!- ->

You’d need two successive groupby calls here:

df.groupby('A').B.cumsum().groupby(df.A).shift()
 
0     NaN
1    12.0
2     NaN
3     NaN
4     4.0
5     9.0
6     NaN
7     3.0
8     NaN
9     1.0
Name: B, dtype: float64

Using apply, you can shorten this to a a single call.

df.groupby('A').B.apply(lambda x: x.cumsum().shift())

0     NaN
1    12.0
2     NaN
3     NaN
4     4.0
5     9.0
6     NaN
7     3.0
8     NaN
9     1.0
Name: B, dtype: float64

It is very hard to quantify the performance because it depends on the data. But in general, apply is an acceptable solution if the goal is to reduce a groupby call (because groupby is also quite expensive).



Other Caveats

Aside from the caveats mentioned above, it is also worth mentioning that apply operates on the first row (or column) twice. This is done to determine whether the function has any side effects. If not, apply may be able to use a fast-path for evaluating the result, else it falls back to a slow implementation.

df = pd.DataFrame({
    'A': [1, 2],
    'B': ['x', 'y']
})

def func(x):
    print(x['A'])
    return x

df.apply(func, axis=1)

# 1
# 1
# 2
   A  B
0  1  x
1  2  y

This behaviour is also seen in GroupBy.apply on pandas versions <0.25 (it was fixed for 0.25, see here for more information.)


回答 1

并非所有人apply都一样

下图建议何时考虑apply1。绿色意味着高效。红色避免。

其中一些是直观的:pd.Series.apply是Python级的逐行循环,同上是pd.DataFrame.apply逐行(axis=1)。这些的滥用是广泛的。另一篇文章更深入地探讨了它们。流行的解决方案是使用矢量化方法,列表推导(假定数据干净)或有效的工具pd.DataFrame(例如构造函数)(例如避免使用apply(pd.Series))。

如果使用pd.DataFrame.apply逐行方式,则指定raw=True(如果可能)通常是有益的。在这个阶段,numba通常是一个更好的选择。

GroupBy.apply:普遍偏爱

groupby避免重复操作apply会损害性能。GroupBy.apply只要您在自定义函数中使用的方法本身是矢量化的,通常在这里就可以了。有时,没有适用于希望应用的逐组聚合的本地Pandas方法。在这种情况下,对于少数apply具有自定义功能的组可能仍会提供合理的性能。

pd.DataFrame.apply 专栏式:混合袋

pd.DataFrame.apply按列(axis=0)是一个有趣的情况。对于少量的行而不是大量的列,几乎总是很昂贵的。对于相对于列的大量行(更常见的情况),使用以下命令有时可能看到显着的性能改进apply

# Python 3.7, Pandas 0.23.4
np.random.seed(0)
df = pd.DataFrame(np.random.random((10**7, 3)))     # Scenario_1, many rows
df = pd.DataFrame(np.random.random((10**4, 10**3))) # Scenario_2, many columns

                                               # Scenario_1  | Scenario_2
%timeit df.sum()                               # 800 ms      | 109 ms
%timeit df.apply(pd.Series.sum)                # 568 ms      | 325 ms

%timeit df.max() - df.min()                    # 1.63 s      | 314 ms
%timeit df.apply(lambda x: x.max() - x.min())  # 838 ms      | 473 ms

%timeit df.mean()                              # 108 ms      | 94.4 ms
%timeit df.apply(pd.Series.mean)               # 276 ms      | 233 ms

1有exceptions,但通常很少或很少。几个例子:

  1. df['col'].apply(str)可能略胜一筹df['col'].astype(str)
  2. df.apply(pd.to_datetime)与常规for循环相比,对字符串进行处理无法很好地适应行缩放。

Not all applys are alike

The below chart suggests when to consider apply1. Green means possibly efficient; red avoid.

Some of this is intuitive: pd.Series.apply is a Python-level row-wise loop, ditto pd.DataFrame.apply row-wise (axis=1). The misuses of these are many and wide-ranging. The other post deals with them in more depth. Popular solutions are to use vectorised methods, list comprehensions (assumes clean data), or efficient tools such as the pd.DataFrame constructor (e.g. to avoid apply(pd.Series)).

If you are using pd.DataFrame.apply row-wise, specifying raw=True (where possible) is often beneficial. At this stage, numba is usually a better choice.

GroupBy.apply: generally favoured

Repeating groupby operations to avoid apply will hurt performance. GroupBy.apply is usually fine here, provided the methods you use in your custom function are themselves vectorised. Sometimes there is no native Pandas method for a groupwise aggregation you wish to apply. In this case, for a small number of groups apply with a custom function may still offer reasonable performance.

pd.DataFrame.apply column-wise: a mixed bag

pd.DataFrame.apply column-wise (axis=0) is an interesting case. For a small number of rows versus a large number of columns, it’s almost always expensive. For a large number of rows relative to columns, the more common case, you may sometimes see significant performance improvements using apply:

# Python 3.7, Pandas 0.23.4
np.random.seed(0)
df = pd.DataFrame(np.random.random((10**7, 3)))     # Scenario_1, many rows
df = pd.DataFrame(np.random.random((10**4, 10**3))) # Scenario_2, many columns

                                               # Scenario_1  | Scenario_2
%timeit df.sum()                               # 800 ms      | 109 ms
%timeit df.apply(pd.Series.sum)                # 568 ms      | 325 ms

%timeit df.max() - df.min()                    # 1.63 s      | 314 ms
%timeit df.apply(lambda x: x.max() - x.min())  # 838 ms      | 473 ms

%timeit df.mean()                              # 108 ms      | 94.4 ms
%timeit df.apply(pd.Series.mean)               # 276 ms      | 233 ms

1 There are exceptions, but these are usually marginal or uncommon. A couple of examples:

  1. df['col'].apply(str) may slightly outperform df['col'].astype(str).
  2. df.apply(pd.to_datetime) working on strings doesn’t scale well with rows versus a regular for loop.

回答 2

对于axis=1(即按行函数),则可以使用以下函数代替apply。我想知道为什么这不是pandas行为。(未经复合索引测试,但确实比快得多apply

def faster_df_apply(df, func):
    cols = list(df.columns)
    data, index = [], []
    for row in df.itertuples(index=True):
        row_dict = {f:v for f,v in zip(cols, row[1:])}
        data.append(func(row_dict))
        index.append(row[0])
    return pd.Series(data, index=index)

For axis=1 (i.e. row-wise functions) then you can just use the following function in lieu of apply. I wonder why this isn’t the pandas behavior. (Untested with compound indexes, but it does appear to be much faster than apply)

def faster_df_apply(df, func):
    cols = list(df.columns)
    data, index = [], []
    for row in df.itertuples(index=True):
        row_dict = {f:v for f,v in zip(cols, row[1:])}
        data.append(func(row_dict))
        index.append(row[0])
    return pd.Series(data, index=index)

回答 3

有没有什么情况apply是好的?是的,有时。

任务:解码Unicode字符串。

import numpy as np
import pandas as pd
import unidecode

s = pd.Series(['mañana','Ceñía'])
s.head()
0    mañana
1     Ceñía


s.apply(unidecode.unidecode)
0    manana
1     Cenia

更新
我绝不是提倡使用apply,只是考虑到NumPy无法解决上述情况,因此它可能是一个很好的选择pandas apply。但是由于@jpp的提醒,我忘记了普通的ol列表理解。

Are there ever any situations where apply is good? Yes, sometimes.

Task: decode Unicode strings.

import numpy as np
import pandas as pd
import unidecode

s = pd.Series(['mañana','Ceñía'])
s.head()
0    mañana
1     Ceñía


s.apply(unidecode.unidecode)
0    manana
1     Cenia

Update
I was by no means advocating for the use of apply, just thinking since the NumPy cannot deal with the above situation, it could have been a good candidate for pandas apply. But I was forgetting the plain ol list comprehension thanks to the reminder by @jpp.


熊猫read_xml()方法测试策略

问题:熊猫read_xml()方法测试策略

当前,pandas I / O工具没有维护read_xml()方法,而相应的工具to_xml()。但是,read_json证明可以为数据帧导入和read_html标记格式实现树状结构。

如果大熊猫团队会考虑这样一个read_xml为未来大熊猫版本的方法,他们会追求什么实现:使用内置的解析xml.etree.ElementTreeiterfind()iterparse()功能或第三方模块,lxml其XPath 1.0和XSLT 1.0的方法呢?

以下是我在简单,扁平,以元素为中心的XML输入上针对四种方法类型的测试运行。所有这些都针对root的任何第二级子级进行了通用解析,并且每种方法都应产生完全相同的pandas数据帧。除最后一次调用外pd.Dataframe(),所有其他功能都在词典列表中。XSLT方法将XML转换为CSV,以便StringIO()在中进行转换pd.read_csv()

问题 (多部分)

  • 性能:您如何解释由于iterparse迭代解析文件而通常建议对较大文件使用的速度较慢的速度?部分原因是由于if逻辑检查吗?

  • 内存:CPU内存是否与I / O调用中的时间相关?XSLT和XPath 1.0在较大的XML文档中往往无法很好地扩展,因为必须在内存中读取整个文件才能进行解析。

  • 策略:词典列表是Dataframe()呼叫的最佳策略吗?请参阅以下有趣的答案:生成器版本和iterwalk用户定义版本。两个上载列表到数据帧。

输入数据(Stack Overflow当前的年度最大用户,其中包括我们的熊猫朋友)

<?xml version="1.0" encoding="utf-8"?>
<stackoverflow>
  <topusers>
    <user>Gordon Linoff</user>
    <link>http://www.stackoverflow.com//users/1144035/gordon-linoff</link>
    <location>New York, United States</location>
    <year_rep>5,985</year_rep>
    <total_rep>499,408</total_rep>
    <tag1>sql</tag1>
    <tag2>sql-server</tag2>
    <tag3>mysql</tag3>
  </topusers>
  <topusers>
    <user>Günter Zöchbauer</user>
    <link>http://www.stackoverflow.com//users/217408/g%c3%bcnter-z%c3%b6chbauer</link>
    <location>Linz, Austria</location>
    <year_rep>5,835</year_rep>
    <total_rep>154,439</total_rep>
    <tag1>angular2</tag1>
    <tag2>typescript</tag2>
    <tag3>javascript</tag3>
  </topusers>
  <topusers>
    <user>jezrael</user>
    <link>http://www.stackoverflow.com//users/2901002/jezrael</link>
    <location>Bratislava, Slovakia</location>
    <year_rep>5,740</year_rep>
    <total_rep>83,237</total_rep>
    <tag1>pandas</tag1>
    <tag2>python</tag2>
    <tag3>dataframe</tag3>
  </topusers>
  <topusers>
    <user>VonC</user>
    <link>http://www.stackoverflow.com//users/6309/vonc</link>
    <location>France</location>
    <year_rep>5,577</year_rep>
    <total_rep>651,397</total_rep>
    <tag1>git</tag1>
    <tag2>github</tag2>
    <tag3>docker</tag3>
  </topusers>
  <topusers>
    <user>Martijn Pieters</user>
    <link>http://www.stackoverflow.com//users/100297/martijn-pieters</link>
    <location>Cambridge, United Kingdom</location>
    <year_rep>5,337</year_rep>
    <total_rep>525,176</total_rep>
    <tag1>python</tag1>
    <tag2>python-3.x</tag2>
    <tag3>python-2.7</tag3>
  </topusers>
  <topusers>
    <user>T.J. Crowder</user>
    <link>http://www.stackoverflow.com//users/157247/t-j-crowder</link>
    <location>United Kingdom</location>
    <year_rep>5,258</year_rep>
    <total_rep>508,310</total_rep>
    <tag1>javascript</tag1>
    <tag2>jquery</tag2>
    <tag3>java</tag3>
  </topusers>
  <topusers>
    <user>akrun</user>
    <link>http://www.stackoverflow.com//users/3732271/akrun</link>
    <location></location>
    <year_rep>5,188</year_rep>
    <total_rep>229,553</total_rep>
    <tag1>r</tag1>
    <tag2>dplyr</tag2>
    <tag3>dataframe</tag3>
  </topusers>
  <topusers>
    <user>Wiktor Stribi?ew</user>
    <link>http://www.stackoverflow.com//users/3832970/wiktor-stribi%c5%bcew</link>
    <location>Warsaw, Poland</location>
    <year_rep>4,948</year_rep>
    <total_rep>158,134</total_rep>
    <tag1>regex</tag1>
    <tag2>javascript</tag2>
    <tag3>c#</tag3>
  </topusers>
  <topusers>
    <user>Darin Dimitrov</user>
    <link>http://www.stackoverflow.com//users/29407/darin-dimitrov</link>
    <location>Sofia, Bulgaria</location>
    <year_rep>4,936</year_rep>
    <total_rep>709,683</total_rep>
    <tag1>c#</tag1>
    <tag2>asp.net-mvc</tag2>
    <tag3>asp.net-mvc-3</tag3>
  </topusers>
  <topusers>
    <user>Eric Duminil</user>
    <link>http://www.stackoverflow.com//users/6419007/eric-duminil</link>
    <location></location>
    <year_rep>4,854</year_rep>
    <total_rep>12,557</total_rep>
    <tag1>ruby</tag1>
    <tag2>ruby-on-rails</tag2>
    <tag3>arrays</tag3>
  </topusers>
  <topusers>
    <user>alecxe</user>
    <link>http://www.stackoverflow.com//users/771848/alecxe</link>
    <location>New York, United States</location>
    <year_rep>4,723</year_rep>
    <total_rep>233,368</total_rep>
    <tag1>python</tag1>
    <tag2>selenium</tag2>
    <tag3>protractor</tag3>
  </topusers>
  <topusers>
    <user>Jean-François Fabre</user>
    <link>http://www.stackoverflow.com//users/6451573/jean-fran%c3%a7ois-fabre</link>
    <location>Toulouse, France</location>
    <year_rep>4,526</year_rep>
    <total_rep>30,027</total_rep>
    <tag1>python</tag1>
    <tag2>python-3.x</tag2>
    <tag3>python-2.7</tag3>
  </topusers>
  <topusers>
    <user>piRSquared</user>
    <link>http://www.stackoverflow.com//users/2336654/pirsquared</link>
    <location>Bellevue, WA, United States</location>
    <year_rep>4,482</year_rep>
    <total_rep>41,183</total_rep>
    <tag1>pandas</tag1>
    <tag2>python</tag2>
    <tag3>dataframe</tag3>
  </topusers>
  <topusers>
    <user>CommonsWare</user>
    <link>http://www.stackoverflow.com//users/115145/commonsware</link>
    <location>Who Wants to Know?</location>
    <year_rep>4,475</year_rep>
    <total_rep>616,135</total_rep>
    <tag1>android</tag1>
    <tag2>java</tag2>
    <tag3>android-intent</tag3>
  </topusers>
  <topusers>
    <user>Quentin</user>
    <link>http://www.stackoverflow.com//users/19068/quentin</link>
    <location>United Kingdom</location>
    <year_rep>4,464</year_rep>
    <total_rep>509,365</total_rep>
    <tag1>javascript</tag1>
    <tag2>html</tag2>
    <tag3>css</tag3>
  </topusers>
  <topusers>
    <user>Jon Skeet</user>
    <link>http://www.stackoverflow.com//users/22656/jon-skeet</link>
    <location>Reading, United Kingdom</location>
    <year_rep>4,348</year_rep>
    <total_rep>921,690</total_rep>
    <tag1>c#</tag1>
    <tag2>java</tag2>
    <tag3>.net</tag3>
  </topusers>
  <topusers>
    <user>Felix Kling</user>
    <link>http://www.stackoverflow.com//users/218196/felix-kling</link>
    <location>Sunnyvale, CA</location>
    <year_rep>4,324</year_rep>
    <total_rep>411,535</total_rep>
    <tag1>javascript</tag1>
    <tag2>jquery</tag2>
    <tag3>asynchronous</tag3>
  </topusers>
  <topusers>
    <user>matt</user>
    <link>http://www.stackoverflow.com//users/341994/matt</link>
    <location></location>
    <year_rep>4,313</year_rep>
    <total_rep>220,515</total_rep>
    <tag1>swift</tag1>
    <tag2>ios</tag2>
    <tag3>xcode</tag3>
  </topusers>
  <topusers>
    <user>Psidom</user>
    <link>http://www.stackoverflow.com//users/4983450/psidom</link>
    <location>Atlanta, GA, United States</location>
    <year_rep>4,236</year_rep>
    <total_rep>36,950</total_rep>
    <tag1>python</tag1>
    <tag2>pandas</tag2>
    <tag3>r</tag3>
  </topusers>
  <topusers>
    <user>Martin R</user>
    <link>http://www.stackoverflow.com//users/1187415/martin-r</link>
    <location>Germany</location>
    <year_rep>4,195</year_rep>
    <total_rep>269,380</total_rep>
    <tag1>swift</tag1>
    <tag2>ios</tag2>
    <tag3>swift3</tag3>
  </topusers>
  <topusers>
    <user>Barmar</user>
    <link>http://www.stackoverflow.com//users/1491895/barmar</link>
    <location>Arlington, MA</location>
    <year_rep>4,179</year_rep>
    <total_rep>289,989</total_rep>
    <tag1>javascript</tag1>
    <tag2>php</tag2>
    <tag3>jquery</tag3>
  </topusers>
  <topusers>
    <user>Alexey Mezenin</user>
    <link>http://www.stackoverflow.com//users/1227923/alexey-mezenin</link>
    <location>??????</location>
    <year_rep>4,142</year_rep>
    <total_rep>31,602</total_rep>
    <tag1>laravel</tag1>
    <tag2>php</tag2>
    <tag3>laravel-5.3</tag3>
  </topusers>
  <topusers>
    <user>BalusC</user>
    <link>http://www.stackoverflow.com//users/157882/balusc</link>
    <location>Amsterdam, Netherlands</location>
    <year_rep>4,046</year_rep>
    <total_rep>703,046</total_rep>
    <tag1>java</tag1>
    <tag2>jsf</tag2>
    <tag3>servlets</tag3>
  </topusers>
  <topusers>
    <user>GurV</user>
    <link>http://www.stackoverflow.com//users/6348498/gurv</link>
    <location></location>
    <year_rep>4,016</year_rep>
    <total_rep>7,932</total_rep>
    <tag1>sql</tag1>
    <tag2>mysql</tag2>
    <tag3>sql-server</tag3>
  </topusers>
  <topusers>
    <user>Nina Scholz</user>
    <link>http://www.stackoverflow.com//users/1447675/nina-scholz</link>
    <location>Berlin, Deutschland</location>
    <year_rep>3,950</year_rep>
    <total_rep>61,135</total_rep>
    <tag1>javascript</tag1>
    <tag2>arrays</tag2>
    <tag3>object</tag3>
  </topusers>
  <topusers>
    <user>JB Nizet</user>
    <link>http://www.stackoverflow.com//users/571407/jb-nizet</link>
    <location>Saint-Etienne, France</location>
    <year_rep>3,923</year_rep>
    <total_rep>418,780</total_rep>
    <tag1>java</tag1>
    <tag2>hibernate</tag2>
    <tag3>java-8</tag3>
  </topusers>
  <topusers>
    <user>Frank van Puffelen</user>
    <link>http://www.stackoverflow.com//users/209103/frank-van-puffelen</link>
    <location>San Francisco, CA</location>
    <year_rep>3,920</year_rep>
    <total_rep>86,520</total_rep>
    <tag1>firebase</tag1>
    <tag2>firebase-database</tag2>
    <tag3>android</tag3>
  </topusers>
  <topusers>
    <user>dasblinkenlight</user>
    <link>http://www.stackoverflow.com//users/335858/dasblinkenlight</link>
    <location>United States</location>
    <year_rep>3,886</year_rep>
    <total_rep>475,813</total_rep>
    <tag1>c#</tag1>
    <tag2>java</tag2>
    <tag3>c++</tag3>
  </topusers>
  <topusers>
    <user>Tim Biegeleisen</user>
    <link>http://www.stackoverflow.com//users/1863229/tim-biegeleisen</link>
    <location>Singapore</location>
    <year_rep>3,814</year_rep>
    <total_rep>77,211</total_rep>
    <tag1>sql</tag1>
    <tag2>mysql</tag2>
    <tag3>java</tag3>
  </topusers>
  <topusers>
    <user>Greg Hewgill</user>
    <link>http://www.stackoverflow.com//users/893/greg-hewgill</link>
    <location>Christchurch, New Zealand</location>
    <year_rep>3,796</year_rep>
    <total_rep>529,137</total_rep>
    <tag1>git</tag1>
    <tag2>python</tag2>
    <tag3>git-pull</tag3>
  </topusers>
  <topusers>
    <user>unutbu</user>
    <link>http://www.stackoverflow.com//users/190597/unutbu</link>
    <location></location>
    <year_rep>3,735</year_rep>
    <total_rep>401,595</total_rep>
    <tag1>python</tag1>
    <tag2>pandas</tag2>
    <tag3>numpy</tag3>
  </topusers>
  <topusers>
    <user>Hans Passant</user>
    <link>http://www.stackoverflow.com//users/17034/hans-passant</link>
    <location>Madison, WI</location>
    <year_rep>3,688</year_rep>
    <total_rep>672,118</total_rep>
    <tag1>c#</tag1>
    <tag2>.net</tag2>
    <tag3>winforms</tag3>
  </topusers>
  <topusers>
    <user>Jonathan Leffler</user>
    <link>http://www.stackoverflow.com//users/15168/jonathan-leffler</link>
    <location>California, USA</location>
    <year_rep>3,649</year_rep>
    <total_rep>455,157</total_rep>
    <tag1>c</tag1>
    <tag2>bash</tag2>
    <tag3>unix</tag3>
  </topusers>
  <topusers>
    <user>paxdiablo</user>
    <link>http://www.stackoverflow.com//users/14860/paxdiablo</link>
    <location></location>
    <year_rep>3,636</year_rep>
    <total_rep>507,043</total_rep>
    <tag1>c</tag1>
    <tag2>c++</tag2>
    <tag3>bash</tag3>
  </topusers>
  <topusers>
    <user>Pranav C Balan</user>
    <link>http://www.stackoverflow.com//users/3037257/pranav-c-balan</link>
    <location>Ramanthali, Kannur, Kerala, India</location>
    <year_rep>3,604</year_rep>
    <total_rep>64,476</total_rep>
    <tag1>javascript</tag1>
    <tag2>jquery</tag2>
    <tag3>html</tag3>
  </topusers>
  <topusers>
    <user>Suragch</user>
    <link>http://www.stackoverflow.com//users/3681880/suragch</link>
    <location>Hohhot, China</location>
    <year_rep>3,580</year_rep>
    <total_rep>71,032</total_rep>
    <tag1>swift</tag1>
    <tag2>ios</tag2>
    <tag3>android</tag3>
  </topusers>
</stackoverflow>

Python方法

import xml.etree.ElementTree as et
import pandas as pd
from io import StringIO
from lxml import etree as lxet

def read_xml_iterfind():
    tree = et.parse('Input.xml')

    data = []
    inner = {}
    for el in tree.iterfind('./*'):
        for i in el.iterfind('*'):
            inner[i.tag] = i.text
        data.append(inner)
        inner = {}

    df = pd.DataFrame(data)

def read_xml_iterparse():
    data = []
    inner = {}
    i = 1
    for (ev, el) in et.iterparse(path):
        if i <= 2:
           first_tag = el.tag

        if el.tag == first_tag and len(inner) != 0:
            data.append(inner)            
            inner = {}

        if el.text is not None and len(el.text.strip()) > 0:
            inner[el.tag] = el.text
    i += 1

    df = pd.DataFrame(data)    

def read_xml_lxml_xpath():     
    tree = lxet.parse('Input.xml')

    data = []
    inner = {}
    for el in tree.xpath('/*/*'):
        for i in el:
            inner[i.tag] = i.text
        data.append(inner)
        inner = {}

    df = pd.DataFrame(data)

def read_xml_lxml_xsl():     
    xml = lxet.parse('Input.xml')

    xslstr = '''
    <xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
        <xsl:output version="1.0" encoding="UTF-8" indent="yes"  method="text"/>
        <xsl:strip-space elements="*"/>

        <!-- HEADERS -->
        <xsl:template match = "/*">
            <xsl:for-each select="*[1]/*">
              <xsl:value-of select="local-name()" />
                <xsl:choose>
                   <xsl:when test="position() != last()">
                      <xsl:text>,</xsl:text>
                   </xsl:when>
                   <xsl:otherwise>
                      <xsl:text>&#xa;</xsl:text>
                   </xsl:otherwise>                              
                </xsl:choose>   
            </xsl:for-each>
            <xsl:apply-templates/>
        </xsl:template>

        <!-- DATA ROWS (COMMA-SEPARATED) -->
        <xsl:template match="/*/*" priority="2">    
            <xsl:for-each select="*">
              <xsl:if test="position() = 1">
                   <xsl:text>&quot;</xsl:text>
              </xsl:if>
              <xsl:value-of select="." />
                <xsl:choose>
                   <xsl:when test="position() != last()">
                      <xsl:text>&quot;,&quot;</xsl:text>
                   </xsl:when>
                   <xsl:otherwise>
                      <xsl:text>&quot;&#xa;</xsl:text>
                   </xsl:otherwise>                              
                </xsl:choose>
            </xsl:for-each>
        </xsl:template>

    </xsl:transform>
    '''
    xsl = lxet.fromstring(xslstr)

    transform = lxet.XSLT(xsl)
    newdom = transform(xml)

    df = pd.read_csv(StringIO(str(newdom)))

时序 (当前的XML和XML的子级是25倍(即900条StackOverflow用户记录)

# SHORTER FILE
python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_iterfind()'
100 loops, best of 3: 3.87 msec per loop

python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_iterparse()'
100 loops, best of 3: 5.5 msec per loop

python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_lxml_xpath()'
100 loops, best of 3: 3.86 msec per loop

python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_lxml_xsl()'
100 loops, best of 3: 5.68 msec per loop

# LARGER FILE
python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_iterfind()'
100 loops, best of 3: 36 msec per loop

python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_iterparse()'
100 loops, best of 3: 78.9 msec per loop

python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_lxml_xpath()'
100 loops, best of 3: 32.7 msec per loop

python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_lxml_xsl()'
100 loops, best of 3: 51.4 msec per loop

Currently, pandas I/O tools does not maintain a read_xml() method and the counterpart to_xml(). However, read_json proves tree-like structures can be implemented for dataframe import and read_html for markup formats.

If the pandas team does consider such a read_xml method for a future pandas version, what implementation would they pursue: parsing with built-in xml.etree.ElementTree with its iterfind() or iterparse() functions or the third-party module, lxml with its XPath 1.0 and XSLT 1.0 methods?

Below are my test runs for four method types on a simple, flat, element-centric XML input. All are set up for generalized parsing for any second level children of root and each method should yield exact same pandas dataframe. All but the last calls pd.Dataframe() on list of dictionaries. The XSLT method transforms XML to CSV for casted StringIO() in pd.read_csv().

Question (multi-part)

  • PERFORMANCE: How do you explain the slower iterparse often recommended for larger files as file is iteratively parsed? Is it partly due to the if logic checks?

  • MEMORY: Do CPU memory correlate with timings in I/O calls? XSLT and XPath 1.0 tend not to scale well with larger XML documents as entire file must be read in memory to be parsed.

  • STRATEGY: Is list of dictionaries an optimal strategy for Dataframe() call? See these interesting answers: generator version and a iterwalk user-defined version. Both upcast lists to dataframe.

Input Data (Stack Overflow’s current top users by year of which our pandas friends are included)

<?xml version="1.0" encoding="utf-8"?>
<stackoverflow>
  <topusers>
    <user>Gordon Linoff</user>
    <link>http://www.stackoverflow.com//users/1144035/gordon-linoff</link>
    <location>New York, United States</location>
    <year_rep>5,985</year_rep>
    <total_rep>499,408</total_rep>
    <tag1>sql</tag1>
    <tag2>sql-server</tag2>
    <tag3>mysql</tag3>
  </topusers>
  <topusers>
    <user>Günter Zöchbauer</user>
    <link>http://www.stackoverflow.com//users/217408/g%c3%bcnter-z%c3%b6chbauer</link>
    <location>Linz, Austria</location>
    <year_rep>5,835</year_rep>
    <total_rep>154,439</total_rep>
    <tag1>angular2</tag1>
    <tag2>typescript</tag2>
    <tag3>javascript</tag3>
  </topusers>
  <topusers>
    <user>jezrael</user>
    <link>http://www.stackoverflow.com//users/2901002/jezrael</link>
    <location>Bratislava, Slovakia</location>
    <year_rep>5,740</year_rep>
    <total_rep>83,237</total_rep>
    <tag1>pandas</tag1>
    <tag2>python</tag2>
    <tag3>dataframe</tag3>
  </topusers>
  <topusers>
    <user>VonC</user>
    <link>http://www.stackoverflow.com//users/6309/vonc</link>
    <location>France</location>
    <year_rep>5,577</year_rep>
    <total_rep>651,397</total_rep>
    <tag1>git</tag1>
    <tag2>github</tag2>
    <tag3>docker</tag3>
  </topusers>
  <topusers>
    <user>Martijn Pieters</user>
    <link>http://www.stackoverflow.com//users/100297/martijn-pieters</link>
    <location>Cambridge, United Kingdom</location>
    <year_rep>5,337</year_rep>
    <total_rep>525,176</total_rep>
    <tag1>python</tag1>
    <tag2>python-3.x</tag2>
    <tag3>python-2.7</tag3>
  </topusers>
  <topusers>
    <user>T.J. Crowder</user>
    <link>http://www.stackoverflow.com//users/157247/t-j-crowder</link>
    <location>United Kingdom</location>
    <year_rep>5,258</year_rep>
    <total_rep>508,310</total_rep>
    <tag1>javascript</tag1>
    <tag2>jquery</tag2>
    <tag3>java</tag3>
  </topusers>
  <topusers>
    <user>akrun</user>
    <link>http://www.stackoverflow.com//users/3732271/akrun</link>
    <location></location>
    <year_rep>5,188</year_rep>
    <total_rep>229,553</total_rep>
    <tag1>r</tag1>
    <tag2>dplyr</tag2>
    <tag3>dataframe</tag3>
  </topusers>
  <topusers>
    <user>Wiktor Stribi?ew</user>
    <link>http://www.stackoverflow.com//users/3832970/wiktor-stribi%c5%bcew</link>
    <location>Warsaw, Poland</location>
    <year_rep>4,948</year_rep>
    <total_rep>158,134</total_rep>
    <tag1>regex</tag1>
    <tag2>javascript</tag2>
    <tag3>c#</tag3>
  </topusers>
  <topusers>
    <user>Darin Dimitrov</user>
    <link>http://www.stackoverflow.com//users/29407/darin-dimitrov</link>
    <location>Sofia, Bulgaria</location>
    <year_rep>4,936</year_rep>
    <total_rep>709,683</total_rep>
    <tag1>c#</tag1>
    <tag2>asp.net-mvc</tag2>
    <tag3>asp.net-mvc-3</tag3>
  </topusers>
  <topusers>
    <user>Eric Duminil</user>
    <link>http://www.stackoverflow.com//users/6419007/eric-duminil</link>
    <location></location>
    <year_rep>4,854</year_rep>
    <total_rep>12,557</total_rep>
    <tag1>ruby</tag1>
    <tag2>ruby-on-rails</tag2>
    <tag3>arrays</tag3>
  </topusers>
  <topusers>
    <user>alecxe</user>
    <link>http://www.stackoverflow.com//users/771848/alecxe</link>
    <location>New York, United States</location>
    <year_rep>4,723</year_rep>
    <total_rep>233,368</total_rep>
    <tag1>python</tag1>
    <tag2>selenium</tag2>
    <tag3>protractor</tag3>
  </topusers>
  <topusers>
    <user>Jean-François Fabre</user>
    <link>http://www.stackoverflow.com//users/6451573/jean-fran%c3%a7ois-fabre</link>
    <location>Toulouse, France</location>
    <year_rep>4,526</year_rep>
    <total_rep>30,027</total_rep>
    <tag1>python</tag1>
    <tag2>python-3.x</tag2>
    <tag3>python-2.7</tag3>
  </topusers>
  <topusers>
    <user>piRSquared</user>
    <link>http://www.stackoverflow.com//users/2336654/pirsquared</link>
    <location>Bellevue, WA, United States</location>
    <year_rep>4,482</year_rep>
    <total_rep>41,183</total_rep>
    <tag1>pandas</tag1>
    <tag2>python</tag2>
    <tag3>dataframe</tag3>
  </topusers>
  <topusers>
    <user>CommonsWare</user>
    <link>http://www.stackoverflow.com//users/115145/commonsware</link>
    <location>Who Wants to Know?</location>
    <year_rep>4,475</year_rep>
    <total_rep>616,135</total_rep>
    <tag1>android</tag1>
    <tag2>java</tag2>
    <tag3>android-intent</tag3>
  </topusers>
  <topusers>
    <user>Quentin</user>
    <link>http://www.stackoverflow.com//users/19068/quentin</link>
    <location>United Kingdom</location>
    <year_rep>4,464</year_rep>
    <total_rep>509,365</total_rep>
    <tag1>javascript</tag1>
    <tag2>html</tag2>
    <tag3>css</tag3>
  </topusers>
  <topusers>
    <user>Jon Skeet</user>
    <link>http://www.stackoverflow.com//users/22656/jon-skeet</link>
    <location>Reading, United Kingdom</location>
    <year_rep>4,348</year_rep>
    <total_rep>921,690</total_rep>
    <tag1>c#</tag1>
    <tag2>java</tag2>
    <tag3>.net</tag3>
  </topusers>
  <topusers>
    <user>Felix Kling</user>
    <link>http://www.stackoverflow.com//users/218196/felix-kling</link>
    <location>Sunnyvale, CA</location>
    <year_rep>4,324</year_rep>
    <total_rep>411,535</total_rep>
    <tag1>javascript</tag1>
    <tag2>jquery</tag2>
    <tag3>asynchronous</tag3>
  </topusers>
  <topusers>
    <user>matt</user>
    <link>http://www.stackoverflow.com//users/341994/matt</link>
    <location></location>
    <year_rep>4,313</year_rep>
    <total_rep>220,515</total_rep>
    <tag1>swift</tag1>
    <tag2>ios</tag2>
    <tag3>xcode</tag3>
  </topusers>
  <topusers>
    <user>Psidom</user>
    <link>http://www.stackoverflow.com//users/4983450/psidom</link>
    <location>Atlanta, GA, United States</location>
    <year_rep>4,236</year_rep>
    <total_rep>36,950</total_rep>
    <tag1>python</tag1>
    <tag2>pandas</tag2>
    <tag3>r</tag3>
  </topusers>
  <topusers>
    <user>Martin R</user>
    <link>http://www.stackoverflow.com//users/1187415/martin-r</link>
    <location>Germany</location>
    <year_rep>4,195</year_rep>
    <total_rep>269,380</total_rep>
    <tag1>swift</tag1>
    <tag2>ios</tag2>
    <tag3>swift3</tag3>
  </topusers>
  <topusers>
    <user>Barmar</user>
    <link>http://www.stackoverflow.com//users/1491895/barmar</link>
    <location>Arlington, MA</location>
    <year_rep>4,179</year_rep>
    <total_rep>289,989</total_rep>
    <tag1>javascript</tag1>
    <tag2>php</tag2>
    <tag3>jquery</tag3>
  </topusers>
  <topusers>
    <user>Alexey Mezenin</user>
    <link>http://www.stackoverflow.com//users/1227923/alexey-mezenin</link>
    <location>??????</location>
    <year_rep>4,142</year_rep>
    <total_rep>31,602</total_rep>
    <tag1>laravel</tag1>
    <tag2>php</tag2>
    <tag3>laravel-5.3</tag3>
  </topusers>
  <topusers>
    <user>BalusC</user>
    <link>http://www.stackoverflow.com//users/157882/balusc</link>
    <location>Amsterdam, Netherlands</location>
    <year_rep>4,046</year_rep>
    <total_rep>703,046</total_rep>
    <tag1>java</tag1>
    <tag2>jsf</tag2>
    <tag3>servlets</tag3>
  </topusers>
  <topusers>
    <user>GurV</user>
    <link>http://www.stackoverflow.com//users/6348498/gurv</link>
    <location></location>
    <year_rep>4,016</year_rep>
    <total_rep>7,932</total_rep>
    <tag1>sql</tag1>
    <tag2>mysql</tag2>
    <tag3>sql-server</tag3>
  </topusers>
  <topusers>
    <user>Nina Scholz</user>
    <link>http://www.stackoverflow.com//users/1447675/nina-scholz</link>
    <location>Berlin, Deutschland</location>
    <year_rep>3,950</year_rep>
    <total_rep>61,135</total_rep>
    <tag1>javascript</tag1>
    <tag2>arrays</tag2>
    <tag3>object</tag3>
  </topusers>
  <topusers>
    <user>JB Nizet</user>
    <link>http://www.stackoverflow.com//users/571407/jb-nizet</link>
    <location>Saint-Etienne, France</location>
    <year_rep>3,923</year_rep>
    <total_rep>418,780</total_rep>
    <tag1>java</tag1>
    <tag2>hibernate</tag2>
    <tag3>java-8</tag3>
  </topusers>
  <topusers>
    <user>Frank van Puffelen</user>
    <link>http://www.stackoverflow.com//users/209103/frank-van-puffelen</link>
    <location>San Francisco, CA</location>
    <year_rep>3,920</year_rep>
    <total_rep>86,520</total_rep>
    <tag1>firebase</tag1>
    <tag2>firebase-database</tag2>
    <tag3>android</tag3>
  </topusers>
  <topusers>
    <user>dasblinkenlight</user>
    <link>http://www.stackoverflow.com//users/335858/dasblinkenlight</link>
    <location>United States</location>
    <year_rep>3,886</year_rep>
    <total_rep>475,813</total_rep>
    <tag1>c#</tag1>
    <tag2>java</tag2>
    <tag3>c++</tag3>
  </topusers>
  <topusers>
    <user>Tim Biegeleisen</user>
    <link>http://www.stackoverflow.com//users/1863229/tim-biegeleisen</link>
    <location>Singapore</location>
    <year_rep>3,814</year_rep>
    <total_rep>77,211</total_rep>
    <tag1>sql</tag1>
    <tag2>mysql</tag2>
    <tag3>java</tag3>
  </topusers>
  <topusers>
    <user>Greg Hewgill</user>
    <link>http://www.stackoverflow.com//users/893/greg-hewgill</link>
    <location>Christchurch, New Zealand</location>
    <year_rep>3,796</year_rep>
    <total_rep>529,137</total_rep>
    <tag1>git</tag1>
    <tag2>python</tag2>
    <tag3>git-pull</tag3>
  </topusers>
  <topusers>
    <user>unutbu</user>
    <link>http://www.stackoverflow.com//users/190597/unutbu</link>
    <location></location>
    <year_rep>3,735</year_rep>
    <total_rep>401,595</total_rep>
    <tag1>python</tag1>
    <tag2>pandas</tag2>
    <tag3>numpy</tag3>
  </topusers>
  <topusers>
    <user>Hans Passant</user>
    <link>http://www.stackoverflow.com//users/17034/hans-passant</link>
    <location>Madison, WI</location>
    <year_rep>3,688</year_rep>
    <total_rep>672,118</total_rep>
    <tag1>c#</tag1>
    <tag2>.net</tag2>
    <tag3>winforms</tag3>
  </topusers>
  <topusers>
    <user>Jonathan Leffler</user>
    <link>http://www.stackoverflow.com//users/15168/jonathan-leffler</link>
    <location>California, USA</location>
    <year_rep>3,649</year_rep>
    <total_rep>455,157</total_rep>
    <tag1>c</tag1>
    <tag2>bash</tag2>
    <tag3>unix</tag3>
  </topusers>
  <topusers>
    <user>paxdiablo</user>
    <link>http://www.stackoverflow.com//users/14860/paxdiablo</link>
    <location></location>
    <year_rep>3,636</year_rep>
    <total_rep>507,043</total_rep>
    <tag1>c</tag1>
    <tag2>c++</tag2>
    <tag3>bash</tag3>
  </topusers>
  <topusers>
    <user>Pranav C Balan</user>
    <link>http://www.stackoverflow.com//users/3037257/pranav-c-balan</link>
    <location>Ramanthali, Kannur, Kerala, India</location>
    <year_rep>3,604</year_rep>
    <total_rep>64,476</total_rep>
    <tag1>javascript</tag1>
    <tag2>jquery</tag2>
    <tag3>html</tag3>
  </topusers>
  <topusers>
    <user>Suragch</user>
    <link>http://www.stackoverflow.com//users/3681880/suragch</link>
    <location>Hohhot, China</location>
    <year_rep>3,580</year_rep>
    <total_rep>71,032</total_rep>
    <tag1>swift</tag1>
    <tag2>ios</tag2>
    <tag3>android</tag3>
  </topusers>
</stackoverflow>

Python Methods

import xml.etree.ElementTree as et
import pandas as pd
from io import StringIO
from lxml import etree as lxet

def read_xml_iterfind():
    tree = et.parse('Input.xml')

    data = []
    inner = {}
    for el in tree.iterfind('./*'):
        for i in el.iterfind('*'):
            inner[i.tag] = i.text
        data.append(inner)
        inner = {}

    df = pd.DataFrame(data)

def read_xml_iterparse():
    data = []
    inner = {}
    i = 1
    for (ev, el) in et.iterparse(path):
        if i <= 2:
           first_tag = el.tag

        if el.tag == first_tag and len(inner) != 0:
            data.append(inner)            
            inner = {}

        if el.text is not None and len(el.text.strip()) > 0:
            inner[el.tag] = el.text
    i += 1

    df = pd.DataFrame(data)    

def read_xml_lxml_xpath():     
    tree = lxet.parse('Input.xml')

    data = []
    inner = {}
    for el in tree.xpath('/*/*'):
        for i in el:
            inner[i.tag] = i.text
        data.append(inner)
        inner = {}

    df = pd.DataFrame(data)

def read_xml_lxml_xsl():     
    xml = lxet.parse('Input.xml')

    xslstr = '''
    <xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
        <xsl:output version="1.0" encoding="UTF-8" indent="yes"  method="text"/>
        <xsl:strip-space elements="*"/>

        <!-- HEADERS -->
        <xsl:template match = "/*">
            <xsl:for-each select="*[1]/*">
              <xsl:value-of select="local-name()" />
                <xsl:choose>
                   <xsl:when test="position() != last()">
                      <xsl:text>,</xsl:text>
                   </xsl:when>
                   <xsl:otherwise>
                      <xsl:text>&#xa;</xsl:text>
                   </xsl:otherwise>                              
                </xsl:choose>   
            </xsl:for-each>
            <xsl:apply-templates/>
        </xsl:template>

        <!-- DATA ROWS (COMMA-SEPARATED) -->
        <xsl:template match="/*/*" priority="2">    
            <xsl:for-each select="*">
              <xsl:if test="position() = 1">
                   <xsl:text>&quot;</xsl:text>
              </xsl:if>
              <xsl:value-of select="." />
                <xsl:choose>
                   <xsl:when test="position() != last()">
                      <xsl:text>&quot;,&quot;</xsl:text>
                   </xsl:when>
                   <xsl:otherwise>
                      <xsl:text>&quot;&#xa;</xsl:text>
                   </xsl:otherwise>                              
                </xsl:choose>
            </xsl:for-each>
        </xsl:template>

    </xsl:transform>
    '''
    xsl = lxet.fromstring(xslstr)

    transform = lxet.XSLT(xsl)
    newdom = transform(xml)

    df = pd.read_csv(StringIO(str(newdom)))

Timings (with current XML and XML with 25 times the children (i.e., 900 StackOverflow user records)

# SHORTER FILE
python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_iterfind()'
100 loops, best of 3: 3.87 msec per loop

python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_iterparse()'
100 loops, best of 3: 5.5 msec per loop

python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_lxml_xpath()'
100 loops, best of 3: 3.86 msec per loop

python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_lxml_xsl()'
100 loops, best of 3: 5.68 msec per loop

# LARGER FILE
python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_iterfind()'
100 loops, best of 3: 36 msec per loop

python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_iterparse()'
100 loops, best of 3: 78.9 msec per loop

python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_lxml_xpath()'
100 loops, best of 3: 32.7 msec per loop

python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_lxml_xsl()'
100 loops, best of 3: 51.4 msec per loop

回答 0

性能:如何解释由于迭代解析文件而通常建议对较大文件使用的较慢iterparse?部分原因是由于if逻辑检查?

我认为更多的python代码会使它变慢,因为每次都会评估python代码。您是否尝试过像pypy这样的JIT编译器?

如果仅删除i并使用first_tag,它似乎会快很多,所以是的,部分原因在于if逻辑检查:

def read_xml_iterparse2(path):
    data = []
    inner = {}
    first_tag = None
    for (ev, el) in et.iterparse(path):
        if not first_tag:
           first_tag = el.tag

        if el.tag == first_tag and len(inner) != 0:
            data.append(inner)            
            inner = {}

        if el.text is not None and len(el.text.strip()) > 0:
            inner[el.tag] = el.text

    df = pd.DataFrame(data)    

%timeit read_xml_iterparse(path)
# 10 loops, best of 5: 33 ms per loop
%timeit read_xml_iterparse2(path)
# 10 loops, best of 5: 23 ms per loop

我不确定我是否了解上次if检查的目的,但也不确定为什么您会丢失仅空白元素。持续删除最后一个可以if节省一点时间:

def read_xml_iterparse3(path):
    data = []
    inner = {}
    first_tag = None
    for (ev, el) in et.iterparse(path):
        if not first_tag:
           first_tag = el.tag

        if el.tag == first_tag and len(inner) != 0:
            data.append(inner)            
            inner = {}

        inner[el.tag] = el.text

    df = pd.DataFrame(data)    

%timeit read_xml_iterparse(path)
# 10 loops, best of 5: 34.4 ms per loop
%timeit read_xml_iterparse2(path)
# 10 loops, best of 5: 24.5 ms per loop
%timeit read_xml_iterparse3(path)
# 10 loops, best of 5: 20.9 ms per loop

现在,无论是否进行了这些性能改进,您的iterparse版本似乎都会产生一个更大的数据框。这似乎是一个有效的快速版本:

def read_xml_iterparse5(path):
    data = []
    inner = {}
    for (ev, el) in et.iterparse(path):
        # /ending parents trigger a new row, and in our case .text is \n followed by spaces.  it would be more reliable to pass 'topusers' to our read_xml_iterparse5 as the .tag to check
        if el.text and el.text[0] == '\n':
            # ignore /stackoverflow
            if inner:
                data.append(inner)
                inner = {}
        else:
            inner[el.tag] = el.text

    return pd.DataFrame(data)    

print(read_xml_iterfind(path).shape)
# (900, 8)
print(read_xml_iterparse(path).shape)
# (7050, 8)
print(read_xml_lxml_xpath(path).shape)
# (900, 8)
print(read_xml_lxml_xsl(path).shape)
# (900, 8)
print(read_xml_iterparse5(path).shape)
# (900, 8)
%timeit read_xml_iterparse5(path)
# 10 loops, best of 5: 20.6 ms per loop

内存:CPU内存是否与I / O调用中的时间相关?XSLT和XPath 1.0在较大的XML文档中往往无法很好地扩展,因为必须在内存中读取整个文件才能进行解析。

我不能完全确定“ I / O调用”是什么意思,但是如果您的文档足够小以适合缓存,那么一切都会更快,因为它不会从缓存中逐出其他项目。

策略:词典列表是否是Dataframe()调用的最佳策略?请参阅以下有趣的答案:生成器版本和iterwalk用户定义的版本。两个上载列表到数据帧。

列表使用的内存较少,因此根据您拥有的列数,它可能会产生明显的不同。当然,这然后要求您的XML标记具有一致的顺序,看起来确实如此。该DataFrame()调用也将需要做的工作更少,因为它不必在每一行的dict中查找键,以弄清楚哪一列是什么值。

PERFORMANCE: How do you explain the slower iterparse often recommended for larger files as file is iteratively parsed? Is it partly due to the if logic checks?

I would assume that more python code would make it slower, as the python code is evaluated every time. Have you tried a JIT compiler like pypy?

If I remove i and use first_tag only, it seems to be quite a bit faster, so yes it is partly due to the if logic checks:

def read_xml_iterparse2(path):
    data = []
    inner = {}
    first_tag = None
    for (ev, el) in et.iterparse(path):
        if not first_tag:
           first_tag = el.tag

        if el.tag == first_tag and len(inner) != 0:
            data.append(inner)            
            inner = {}

        if el.text is not None and len(el.text.strip()) > 0:
            inner[el.tag] = el.text

    df = pd.DataFrame(data)    

%timeit read_xml_iterparse(path)
# 10 loops, best of 5: 33 ms per loop
%timeit read_xml_iterparse2(path)
# 10 loops, best of 5: 23 ms per loop

I wasn’t sure I understood the purpose of the last if check, but I’m also not sure why you would want to lose whitespace-only elements. Removing the last if consistently shaves off a little bit of time:

def read_xml_iterparse3(path):
    data = []
    inner = {}
    first_tag = None
    for (ev, el) in et.iterparse(path):
        if not first_tag:
           first_tag = el.tag

        if el.tag == first_tag and len(inner) != 0:
            data.append(inner)            
            inner = {}

        inner[el.tag] = el.text

    df = pd.DataFrame(data)    

%timeit read_xml_iterparse(path)
# 10 loops, best of 5: 34.4 ms per loop
%timeit read_xml_iterparse2(path)
# 10 loops, best of 5: 24.5 ms per loop
%timeit read_xml_iterparse3(path)
# 10 loops, best of 5: 20.9 ms per loop

Now, with or without those performance improvements, your iterparse version seems to produce an extra-large dataframe. Here seems to be a working, fast version:

def read_xml_iterparse5(path):
    data = []
    inner = {}
    for (ev, el) in et.iterparse(path):
        # /ending parents trigger a new row, and in our case .text is \n followed by spaces.  it would be more reliable to pass 'topusers' to our read_xml_iterparse5 as the .tag to check
        if el.text and el.text[0] == '\n':
            # ignore /stackoverflow
            if inner:
                data.append(inner)
                inner = {}
        else:
            inner[el.tag] = el.text

    return pd.DataFrame(data)    

print(read_xml_iterfind(path).shape)
# (900, 8)
print(read_xml_iterparse(path).shape)
# (7050, 8)
print(read_xml_lxml_xpath(path).shape)
# (900, 8)
print(read_xml_lxml_xsl(path).shape)
# (900, 8)
print(read_xml_iterparse5(path).shape)
# (900, 8)
%timeit read_xml_iterparse5(path)
# 10 loops, best of 5: 20.6 ms per loop

MEMORY: Do CPU memory correlate with timings in I/O calls? XSLT and XPath 1.0 tend not to scale well with larger XML documents as entire file must be read in memory to be parsed.

I’m not totally sure what you mean by “I/O calls” but if your document is small enough to fit in cache, then everything will be much faster as it won’t evict many other items from the cache.

STRATEGY: Is list of dictionaries an optimal strategy for Dataframe() call? See these interesting answers: generator version and a iterwalk user-defined version. Both upcast lists to dataframe.

The lists use less memory, so depending on how many columns you have, it could make a noticeable difference. Of course, this then requires your XML tags to be in a consistent order, which they do appear to be. The DataFrame() call would also need to do less work, as it doesn’t have to lookup keys in the dict on every row, to figure out what column if for what value.


根据另一个列熊猫数据框提取列值

问题:根据另一个列熊猫数据框提取列值

我有点被困在提取一个变量对另一个变量的条件值上。例如,以下数据框:

A  B
p1 1
p1 2
p3 3
p2 4

我如何获得Awhen 的价值B=3?每当我提取的值时A,我都会得到一个对象,而不是字符串。

I am kind of getting stuck on extracting value of one variable conditioning on another variable. For example, the following dataframe:

A  B
p1 1
p1 2
p3 3
p2 4

How can I get the value of A when B=3? Every time when I extracted the value of A, I got an object, not a string.


回答 0

您可以loc用来获取满足条件的序列,然后iloc获取第一个元素:

In [2]: df
Out[2]:
    A  B
0  p1  1
1  p1  2
2  p3  3
3  p2  4

In [3]: df.loc[df['B'] == 3, 'A']
Out[3]:
2    p3
Name: A, dtype: object

In [4]: df.loc[df['B'] == 3, 'A'].iloc[0]
Out[4]: 'p3'

You could use loc to get series which satisfying your condition and then iloc to get first element:

In [2]: df
Out[2]:
    A  B
0  p1  1
1  p1  2
2  p3  3
3  p2  4

In [3]: df.loc[df['B'] == 3, 'A']
Out[3]:
2    p3
Name: A, dtype: object

In [4]: df.loc[df['B'] == 3, 'A'].iloc[0]
Out[4]: 'p3'

回答 1

您可以尝试query,输入更少:

df.query('B==3')['A']

You can try query, which is less typing:

df.query('B==3')['A']

回答 2

df[df['B']==3]['A'],假设df是您的pandas.DataFrame。

df[df['B']==3]['A'], assuming df is your pandas.DataFrame.


回答 3

使用df[df['B']==3]['A'].values如果你只是想项目本身没有括号

Use df[df['B']==3]['A'].values if you just want item itself without the brackets


回答 4

male_avgtip=(tips_data.loc[tips_data['sex'] == 'Male', 'tip']).mean()

我还为我的任务进行了这种clause和提取操作。

male_avgtip=(tips_data.loc[tips_data['sex'] == 'Male', 'tip']).mean()

I have also worked on this clausing and extraction operations for my assignment.


Python Pandas仅合并某些列

问题:Python Pandas仅合并某些列

是否可以仅合并一些列?我有一个带有x,y,z和df2列的DataFrame df1,其中x,a,b,c,d,e,f等列。

我想在x上合并两个DataFrame,但是我只想合并df2.a,df2.b列-而不是整个DataFrame。

结果将是具有x,y,z,a,b的DataFrame。

我可以合并然后删除不需要的列,但是似乎有更好的方法。

Is it possible to only merge some columns? I have a DataFrame df1 with columns x, y, z, and df2 with columns x, a ,b, c, d, e, f, etc.

I want to merge the two DataFrames on x, but I only want to merge columns df2.a, df2.b – not the entire DataFrame.

The result would be a DataFrame with x, y, z, a, b.

I could merge then delete the unwanted columns, but it seems like there is a better method.


回答 0

您可以合并sub-DataFrame(仅包含那些列):

df2[list('xab')]  # df2 but only with columns x, a, and b

df1.merge(df2[list('xab')])

You could merge the sub-DataFrame (with just those columns):

df2[list('xab')]  # df2 but only with columns x, a, and b

df1.merge(df2[list('xab')])

回答 1

您想使用两个括号,因此,如果要执行VLOOKUP动作,请执行以下操作:

df = pd.merge(df,df2[['Key_Column','Target_Column']],on='Key_Column', how='left')

这将为您提供原始df中的所有内容,并在df2中添加您想要加入的相应列。

You want to use TWO brackets, so if you are doing a VLOOKUP sort of action:

df = pd.merge(df,df2[['Key_Column','Target_Column']],on='Key_Column', how='left')

This will give you everything in the original df + add that one corresponding column in df2 that you want to join.


回答 2

如果要从目标数据框中删除列,但联接需要该列,则可以执行以下操作:

df1 = df1.merge(df2[['a', 'b', 'key1']], how = 'left',
                left_on = 'key2', right_on = 'key1').drop('key1')

.drop('key1')部分将防止“ key1”保留在结果数据帧中,尽管它首先需要加入。

If you want to drop column(s) from the target data frame, but the column(s) are required for the join, you can do the following:

df1 = df1.merge(df2[['a', 'b', 'key1']], how = 'left',
                left_on = 'key2', right_on = 'key1').drop('key1')

The .drop('key1') part will prevent ‘key1’ from being kept in the resulting data frame, despite it being required to join in the first place.


回答 3

您可以使用.loc选择所有行的特定列,然后将其拉出。下面是一个示例:

pandas.merge(dataframe1, dataframe2.iloc[:, [0:5]], how='left', on='key')

在此示例中,您要合并dataframe1和dataframe2。您已选择对“键”进行外部左连接。但是,对于dataframe2,您指定.iloc了允许您以数字格式指定想要的行和列的方法。使用:,选择所有行,但[0:5]选择前5列。您可以使用.loc按名称指定,但是如果您使用长列名称,则.iloc可能会更好。

You can use .loc to select the specific columns with all rows and then pull that. An example is below:

pandas.merge(dataframe1, dataframe2.iloc[:, [0:5]], how='left', on='key')

In this example, you are merging dataframe1 and dataframe2. You have chosen to do an outer left join on ‘key’. However, for dataframe2 you have specified .iloc which allows you to specific the rows and columns you want in a numerical format. Using :, your selecting all rows, but [0:5] selects the first 5 columns. You could use .loc to specify by name, but if your dealing with long column names, then .iloc may be better.


回答 4

这是为了合并两个表中的选定列。

如果table_1包含t1_a,t1_b,t1_c..,id,..t1_ztable_2包含t2_a, t2_b, t2_c..., id,..t2_z列,并且最终表中仅需要t1_a,id和t2_a,则

mergedCSV = table_1[['t1_a','id']].merge(table_2[['t2_a','id']], on = 'id',how = 'left')
# save resulting output file    
mergedCSV.to_csv('output.csv',index = False)

This is to merge selected columns from two tables.

If table_1 contains t1_a,t1_b,t1_c..,id,..t1_z columns, and table_2 contains t2_a, t2_b, t2_c..., id,..t2_z columns, and only t1_a, id, t2_a are required in the final table, then

mergedCSV = table_1[['t1_a','id']].merge(table_2[['t2_a','id']], on = 'id',how = 'left')
# save resulting output file    
mergedCSV.to_csv('output.csv',index = False)

如何在熊猫中更改日期时间格式

问题:如何在熊猫中更改日期时间格式

我的数据框有一个DOB列(示例格式1/1/2016),默认情况下该列会转换为dtype’object’熊猫:DOB object

使用将日期转换为日期格式df['DOB'] = pd.to_datetime(df['DOB']),日期将转换为:2016-01-26,日期dtype为:DOB datetime64[ns]

现在,我想将此日期格式转换为01/26/2016任何其他通用日期格式或。我该怎么做?

无论我尝试哪种方法,它始终以2016-01-26格式显示日期。

My dataframe has a DOB column (example format 1/1/2016) which by default gets converted to pandas dtype ‘object’: DOB object

Converting this to date format with df['DOB'] = pd.to_datetime(df['DOB']), the date gets converted to: 2016-01-26 and its dtype is: DOB datetime64[ns].

Now I want to convert this date format to 01/26/2016 or in any other general date formats. How do I do it?

Whatever the method I try, it always shows the date in 2016-01-26 format.


回答 0

dt.strftime如果需要转换datetime为其他格式,可以使用(但请注意,dtype列的则为objectstring)):

import pandas as pd

df = pd.DataFrame({'DOB': {0: '26/1/2016', 1: '26/1/2016'}})
print (df)
         DOB
0  26/1/2016 
1  26/1/2016

df['DOB'] = pd.to_datetime(df.DOB)
print (df)
         DOB
0 2016-01-26
1 2016-01-26

df['DOB1'] = df['DOB'].dt.strftime('%m/%d/%Y')
print (df)
         DOB        DOB1
0 2016-01-26  01/26/2016
1 2016-01-26  01/26/2016

You can use dt.strftime if you need to convert datetime to other formats (but note that then dtype of column will be object (string)):

import pandas as pd

df = pd.DataFrame({'DOB': {0: '26/1/2016', 1: '26/1/2016'}})
print (df)
         DOB
0  26/1/2016 
1  26/1/2016

df['DOB'] = pd.to_datetime(df.DOB)
print (df)
         DOB
0 2016-01-26
1 2016-01-26

df['DOB1'] = df['DOB'].dt.strftime('%m/%d/%Y')
print (df)
         DOB        DOB1
0 2016-01-26  01/26/2016
1 2016-01-26  01/26/2016

回答 1

更改格式但不更改类型:

df['date'] = pd.to_datetime(df["date"].dt.strftime('%Y-%m'))

Changing the format but not changing the type:

df['date'] = pd.to_datetime(df["date"].dt.strftime('%Y-%m'))

回答 2

下面的代码对我有用,而不是上一个-试试看!

df['DOB']=pd.to_datetime(df['DOB'].astype(str), format='%m/%d/%Y')

The below code worked for me instead of the previous one – try it out !

df['DOB']=pd.to_datetime(df['DOB'].astype(str), format='%m/%d/%Y')

回答 3

与第一个答案相比,我建议先使用dt.strftime(),然后再使用pd.to_datetime()。这样,它将仍然导致datetime数据类型。

例如,

import pandas as pd

df = pd.DataFrame({'DOB': {0: '26/1/2016 ', 1: '26/1/2016 '})
print(df.dtypes)

df['DOB1'] = df['DOB'].dt.strftime('%m/%d/%Y')
print(df.dtypes)

df['DOB1'] = pd.to_datetime(df['DOB1'])
print(df.dtypes)

Compared to the first answer, I will recommend to use dt.strftime() first, then pd.to_datetime(). In this way, it will still result in the datetime data type.

For example,

import pandas as pd

df = pd.DataFrame({'DOB': {0: '26/1/2016 ', 1: '26/1/2016 '})
print(df.dtypes)

df['DOB1'] = df['DOB'].dt.strftime('%m/%d/%Y')
print(df.dtypes)

df['DOB1'] = pd.to_datetime(df['DOB1'])
print(df.dtypes)

回答 4

两者之间有区别

  • 数据帧单元的内容(二进制值)和
  • 它对我们(人类)的演示(展示)。

所以问题是:如何在不更改数据/数据类型本身的情况下达到我的数据的适当表示

答案是:

  • 如果您使用Jupyter笔记本显示数据,或者
  • 如果您想以HTML文件的形式进行演示(即使准备了许多多余的属性idclass属性来进行进一步的 CSS样式设置,则可以使用也可以不使用它们),

使用样式样式不会更改数据框列的数据/数据类型。

现在,我向您展示如何在Jupyter笔记本中找到它-有关HTML文件形式的演示文稿,请参阅问题末尾的注释。

我将假设您的列DOB 已经具有该类型datetime64(您已表明知道如何访问它)。我准备了一个简单的数据框(只有一列),向您展示了一些基本样式:

  • 没有样式:

       df
          DOB
0  2019-07-03
1  2019-08-03
2  2019-09-03
3  2019-10-03
  • 样式为mm/dd/yyyy

       df.style.format({"DOB": lambda t: t.strftime("%m/%d/%Y")})
          DOB
0  07/03/2019
1  08/03/2019
2  09/03/2019
3  10/03/2019
  • 样式为dd-mm-yyyy

       df.style.format({"DOB": lambda t: t.strftime("%d-%m-%Y")}) 
          DOB
0  03-07-2019
1  03-08-2019
2  03-09-2019
3  03-10-2019

小心!
返回的对象不是数据框-它是类的对象Styler,因此请勿将其分配回df

不要这样做:

df = df.style.format({"DOB": lambda t: t.strftime("%m/%d/%Y")})    # Don´t do this!

(每个数据框都可以通过其.style属性访问其Styler对象,我们更改了该df.style对象,而不是数据框本身。)


问题和解答:

  • 问: 为什么在Jupyter笔记本单元格中用作最后一条命令的Styler对象(或返回它的表达式)显示您的(样式化)表,而不显示Styler对象本身?

  • 答:因为每个Styler对象都有一个回调方法._repr_html_(),该方法返回用于呈现数据框的HTML代码(作为漂亮的HTML表)。

    Jupyter Notebook IDE 自动调用此方法以呈现具有此方法的对象。


注意:

您不需要Jupyter笔记本进行样式设置(即,在不更改数据/数据类型的情况下很好地输出数据框)。

render()如果您想使用HTML代码获取字符串(例如,用于将格式化的数据帧发布到Web上,或仅以HTML格式显示表格),则Styler对象也具有一种方法:

df_styler = df.style.format({"DOB": lambda t: t.strftime("%m/%d/%Y")})
HTML_string = df_styler.render()

There is a difference between

  • the content of a dataframe cell (a binary value) and
  • its presentation (displaying it) for us, humans.

So the question is: How to reach the appropriate presentation of my datas without changing the data / data types themselves?

Here is the answer:

  • If you use the Jupyter notebook for displaying your dataframe, or
  • if you want to reach a presentation in the form of an HTML file (even with many prepared superfluous id and class attributes for further CSS styling — you may or you may not use them),

use styling. Styling don’t change data / data types of columns of your dataframe.

Now I show you how to reach it in the Jupyter notebook — for a presentation in the form of HTML file see the note near the end of the question.

I will suppose that your column DOB already has the type datetime64 (you shown that you know how to reach it). I prepared a simple dataframe (with only one column) to show you some basic styling:

  • Not styled:

       df
    
          DOB
0  2019-07-03
1  2019-08-03
2  2019-09-03
3  2019-10-03
  • Styling it as mm/dd/yyyy:

       df.style.format({"DOB": lambda t: t.strftime("%m/%d/%Y")})
    
          DOB
0  07/03/2019
1  08/03/2019
2  09/03/2019
3  10/03/2019
  • Styling it as dd-mm-yyyy:

       df.style.format({"DOB": lambda t: t.strftime("%d-%m-%Y")}) 
    
          DOB
0  03-07-2019
1  03-08-2019
2  03-09-2019
3  03-10-2019

Be careful!
The returning object is NOT a dataframe — it is an object of the class Styler, so don’t assign it back to df:

Don´t do this:

df = df.style.format({"DOB": lambda t: t.strftime("%m/%d/%Y")})    # Don´t do this!

(Every dataframe has its Styler object accessible by its .style property, and we changed this df.style object, not the dataframe itself.)


Questions and Answers:

  • Q: Why your Styler object (or an expression returning it) used as the last command in a Jupyter notebook cell displays your (styled) table, and not the Styler object itself?

  • A: Because every Styler object has a callback method ._repr_html_() which returns an HTML code for rendering your dataframe (as a nice HTML table).

    Jupyter Notebook IDE calls this method automatically to render objects which have it.


Note:

You don’t need the Jupyter notebook for styling (i.e. for nice outputting a dataframe without changing its data / data types).

A Styler object has a method render(), too, if you want to obtain a string with the HTML code (e.g. for publishing your formatted dataframe to the Web, or simply present your table in the HTML format):

df_styler = df.style.format({"DOB": lambda t: t.strftime("%m/%d/%Y")})
HTML_string = df_styler.render()

回答 5

下面的代码更改为“ datetime”类型,并以给定的格式字符串格式化。效果很好!

df['DOB']=pd.to_datetime(df['DOB'].dt.strftime('%m/%d/%Y'))

Below code changes to ‘datetime’ type and also formats in the given format string. Works well!

df['DOB']=pd.to_datetime(df['DOB'].dt.strftime('%m/%d/%Y'))

回答 6

您可以尝试将日期格式转换为DD-MM-YYYY:

df['DOB'] = pd.to_datetime(df['DOB'], dayfirst = True)

You can try this it’ll convert the date format to DD-MM-YYYY:

df['DOB'] = pd.to_datetime(df['DOB'], dayfirst = True)

了解inplace = True

问题:了解inplace = True

pandas库中多次出现改变就地等物体的方式与下面的语句一个选项…

df.dropna(axis='index', how='all', inplace=True)

我很好奇返回的内容以及inplace=True传递时与传递对象时如何处理该对象inplace=False

所有操作self何时都在修改inplace=True?何时inplace=False立即创建一个新对象,例如new_df = self然后new_df返回?

In the pandas library many times there is an option to change the object inplace such as with the following statement…

df.dropna(axis='index', how='all', inplace=True)

I am curious what is being returned as well as how the object is handled when inplace=True is passed vs. when inplace=False.

Are all operations modifying self when inplace=True? And when inplace=False is a new object created immediately such as new_df = self and then new_df is returned?


回答 0

如果inplace=True通过,该数据被重命名到位(它没有返回值),所以你会使用:

df.an_operation(inplace=True)

inplace=False传递(这是默认值,所以没有必要),执行操作,并返回该对象的副本,所以你会使用:

df = df.an_operation(inplace=False) 

When inplace=True is passed, the data is renamed in place (it returns nothing), so you’d use:

df.an_operation(inplace=True)

When inplace=False is passed (this is the default value, so isn’t necessary), performs the operation and returns a copy of the object, so you’d use:

df = df.an_operation(inplace=False) 

回答 1

在大熊猫中,inplace = True是否有害?

TLDR;是的,是的。

  • inplace,顾名思义,通常不会阻止创建副本,并且(几乎)从不提供任何性能优势
  • inplace 不适用于方法链接
  • inplace 对于初学者来说是一个常见的陷阱,因此删除此选项将简化API

我不建议设置此参数,因为它的作用很小。请参阅此GitHub问题,其中建议在inplaceapi范围内弃用该参数。

一个普遍的误解是,使用inplace=True会导致更高效或更优化的代码。实际上,使用绝对没有性能上的好处inplace=True。无论是就地和外的地方版本创建数据的副本无论如何,与就地版本自动分配拷贝回来。

inplace=True对于初学者来说是一个常见的陷阱。例如,它可以触发SettingWithCopyWarning

df = pd.DataFrame({'a': [3, 2, 1], 'b': ['x', 'y', 'z']})

df2 = df[df['a'] > 1]
df2['b'].replace({'x': 'abc'}, inplace=True)
# SettingWithCopyWarning: 
# A value is trying to be set on a copy of a slice from a DataFrame

使用inplace=True 可能会或可能不会在DataFrame列上调用函数。当涉及链式索引时,尤其如此。

似乎上述问题还不够,inplace=True阻碍了方法链接。对比一下

result = df.some_function1().reset_index().some_function2()

相对于

temp = df.some_function1()
temp.reset_index(inplace=True)
result = temp.some_function2()

前者有助于更好的代码组织和可读性。


另一个支持的说法是,set_axis最近更改了API,以便将inplace默认值从True切换为False。参见GH27600。出色的开发人员!

In pandas, is inplace = True considered harmful, or not?

TLDR; Yes, yes it is.

  • inplace, contrary to what the name implies, often does not prevent copies from being created, and (almost) never offers any performance benefits
  • inplace does not work with method chaining
  • inplace is a common pitfall for beginners, so removing this option will simplify the API

I don’t advise setting this parameter as it serves little purpose. See this GitHub issue which proposes the inplace argument be deprecated api-wide.

It is a common misconception that using inplace=True will lead to more efficient or optimized code. In reality, there are absolutely no performance benefits to using inplace=True. Both the in-place and out-of-place versions create a copy of the data anyway, with the in-place version automatically assigning the copy back.

inplace=True is a common pitfall for beginners. For example, it can trigger the SettingWithCopyWarning:

df = pd.DataFrame({'a': [3, 2, 1], 'b': ['x', 'y', 'z']})

df2 = df[df['a'] > 1]
df2['b'].replace({'x': 'abc'}, inplace=True)
# SettingWithCopyWarning: 
# A value is trying to be set on a copy of a slice from a DataFrame

Calling a function on a DataFrame column with inplace=True may or may not work. This is especially true when chained indexing is involved.

As if the problems described above aren’t enough, inplace=True also hinders method chaining. Contrast the working of

result = df.some_function1().reset_index().some_function2()

As opposed to

temp = df.some_function1()
temp.reset_index(inplace=True)
result = temp.some_function2()

The former lends itself to better code organization and readability.


Another supporting claim is that the API for set_axis was recently changed such that inplace default value was switched from True to False. See GH27600. Great job devs!


回答 2

我使用它的方式是

# Have to assign back to dataframe (because it is a new copy)
df = df.some_operation(inplace=False) 

要么

# No need to assign back to dataframe (because it is on the same copy)
df.some_operation(inplace=True)

结论:

 if inplace is False
      Assign to a new variable;
 else
      No need to assign

The way I use it is

# Have to assign back to dataframe (because it is a new copy)
df = df.some_operation(inplace=False) 

Or

# No need to assign back to dataframe (because it is on the same copy)
df.some_operation(inplace=True)

CONCLUSION:

 if inplace is False
      Assign to a new variable;
 else
      No need to assign

回答 3

inplace参数:

df.dropna(axis='index', how='all', inplace=True)

Pandas与一般的手段:

1.熊猫创建原始数据的副本

2. …对它进行一些计算

3. …将结果分配给原始数据。

4. …删除副本。

正如你在我的答案其余阅读下面的进一步,我们还可以有充分的理由来使用此参数即inplace operations,但如果能,我们应该避免,因为它产生更多的问题,如:

1.您的代码将更难调试(实际上,SettingwithCopyWarning表示警告您可能出现的问题)

2.与方法链冲突


因此,甚至在某些情况下我们应该使用它吗?

绝对可以。如果我们使用熊猫或任何工具处理庞大的数据集,我们很容易面对这样的情况,即一些大数据会消耗我们的整个内存。为了避免这种不良影响,我们可以使用诸如方法链接之类的一些技术:

(
    wine.rename(columns={"color_intensity": "ci"})
    .assign(color_filter=lambda x: np.where((x.hue > 1) & (x.ci > 7), 1, 0))
    .query("alcohol > 14 and color_filter == 1")
    .sort_values("alcohol", ascending=False)
    .reset_index(drop=True)
    .loc[:, ["alcohol", "ci", "hue"]]
)

这使我们的代码更紧凑(尽管也更难以解释和调试),并且由于链接的方法与另一种方法的返回值一起使用而占用的内存更少,因此仅产生输入数据的一个副本。我们可以清楚地看到,执行此操作后,我们将有2倍的原始数据内存消耗。

或者我们可以使用inplace参数(尽管也更难解释和调试),我们的内存消耗将是原始数据的2倍,但是此操作后的内存消耗仍然是原始数据的1倍,如果有人每次使用庞大的数据集时都确切知道这可能是一个原始数据,大收益。


定论:

避免使用inplace参数,除非您不使用大量数据,并且在仍然使用它的情况下要注意其可能的问题。

The inplace parameter:

df.dropna(axis='index', how='all', inplace=True)

in Pandas and in general means:

1. Pandas creates a copy of the original data

2. … does some computation on it

3. … assigns the results to the original data.

4. … deletes the copy.

As you can read in the rest of my answer’s further below, we still can have good reason to use this parameter i.e. the inplace operations, but we should avoid it if we can, as it generate more issues, as:

1. Your code will be harder to debug (Actually SettingwithCopyWarning stands for warning you to this possible problem)

2. Conflict with method chaining


So there is even case when we should use it yet?

Definitely yes. If we use pandas or any tool for handeling huge dataset, we can easily face the situation, where some big data can consume our entire memory. To avoid this unwanted effect we can use some technics like method chaining:

(
    wine.rename(columns={"color_intensity": "ci"})
    .assign(color_filter=lambda x: np.where((x.hue > 1) & (x.ci > 7), 1, 0))
    .query("alcohol > 14 and color_filter == 1")
    .sort_values("alcohol", ascending=False)
    .reset_index(drop=True)
    .loc[:, ["alcohol", "ci", "hue"]]
)

which make our code more compact (though harder to interpret and debug too) and consumes less memory as the chained methods works with the other method’s returned values, thus resulting in only one copy of the input data. We can see clearly, that we will have 2 x original data memory consumption after this operations.

Or we can use inplace parameter (though harder to interpret and debug too) our memory consumption will be 2 x original data, but our memory consumption after this operation remains 1 x original data, which if somebody whenever worked with huge datasets exactly knows can be a big benefit.


Final conclusion:

Avoid using inplace parameter unless you don’t work with huge data and be aware of its possible issues in case of still using of it.


回答 4

将其保存到相同的变量

data["column01"].where(data["column01"]< 5, inplace=True)

将其保存到单独的变量

data["column02"] = data["column01"].where(data["column1"]< 5)

但是,您始终可以覆盖变量

data["column01"] = data["column01"].where(data["column1"]< 5)

仅供参考:默认 inplace = False

Save it to the same variable

data["column01"].where(data["column01"]< 5, inplace=True)

Save it to a separate variable

data["column02"] = data["column01"].where(data["column1"]< 5)

But, you can always overwrite the variable

data["column01"] = data["column01"].where(data["column1"]< 5)

FYI: In default inplace = False


回答 5

当尝试使用函数对Pandas数据框进行更改时,如果要将更改提交到数据框,则使用“ inplace = True”。因此,以下代码中的第一行将“ df”中第一列的名称更改为“ Grades”。如果要查看生成的数据库,我们需要调用数据库。

df.rename(columns={0: 'Grades'}, inplace=True)
df

当我们不想提交更改而只打印结果数据库时,我们使用’inplace = False’(这也是默认值)。因此,实际上是在不更改原始数据库的情况下打印具有已提交更改的原始数据库的副本。

为了更清楚一点,以下代码执行相同的操作:

#Code 1
df.rename(columns={0: 'Grades'}, inplace=True)
#Code 2
df=df.rename(columns={0: 'Grades'}, inplace=False}

When trying to make changes to a Pandas dataframe using a function, we use ‘inplace=True’ if we want to commit the changes to the dataframe. Therefore, the first line in the following code changes the name of the first column in ‘df’ to ‘Grades’. We need to call the database if we want to see the resulting database.

df.rename(columns={0: 'Grades'}, inplace=True)
df

We use ‘inplace=False’ (this is also the default value) when we don’t want to commit the changes but just print the resulting database. So, in effect a copy of the original database with the committed changes is printed without altering the original database.

Just to be more clear, the following codes do the same thing:

#Code 1
df.rename(columns={0: 'Grades'}, inplace=True)
#Code 2
df=df.rename(columns={0: 'Grades'}, inplace=False}

回答 6

inplace=True 是否使用取决于您是否要更改原始df。

df.drop_duplicates()

将仅查看丢弃的值,而不会对df进行任何更改

df.drop_duplicates(inplace  = True)

将删除值并更改df。

希望这可以帮助。:)

inplace=True is used depending if you want to make changes to the original df or not.

df.drop_duplicates()

will only make a view of dropped values but not make any changes to df

df.drop_duplicates(inplace  = True)

will drop values and make changes to df.

Hope this helps.:)


回答 7

inplace=True使功能不纯。它更改原始数据框并返回无。在这种情况下,您会中断DSL链。由于大多数数据框功能都返回一个新的数据框,因此可以方便地使用DSL。喜欢

df.sort_values().rename().to_csv()

inplace=True返回值为None的函数调用,并且DSL链断开。例如

df.sort_values(inplace=True).rename().to_csv()

会抛出 NoneType object has no attribute 'rename'

与python的内置排序和排序类似。lst.sort()返回Nonesorted(lst)返回一个新列表。

通常,inplace=True除非有特殊原因,否则请勿使用。当您必须编写类似的重新分配代码时df = df.sort_values(),请尝试将函数调用附加到DSL链中,例如

df = pd.read_csv().sort_values()...

inplace=True makes the function impure. It changes the original dataframe and returns None. In that case, You breaks the DSL chain. Because most of dataframe functions return a new dataframe, you can use the DSL conveniently. Like

df.sort_values().rename().to_csv()

Function call with inplace=True returns None and DSL chain is broken. For example

df.sort_values(inplace=True).rename().to_csv()

will throw NoneType object has no attribute 'rename'

Something similar with python’s build-in sort and sorted. lst.sort() returns None and sorted(lst) returns a new list.

Generally, do not use inplace=True unless you have specific reason of doing so. When you have to write reassignment code like df = df.sort_values(), try attaching the function call in the DSL chain, e.g.

df = pd.read_csv().sort_values()...

回答 8

就我在大熊猫方面的经验而言,我想回答一下。

‘inplace = True’参数代表数据帧必须永久更改,例如。

    df.dropna(axis='index', how='all', inplace=True)

更改相同的数据框(因为这只大熊猫在索引中找到NaN条目并将其删除)。如果我们尝试

    df.dropna(axis='index', how='all')

熊猫显示了我们进行了更改的数据框,但不会修改原始数据框“ df”。

As Far my experience in pandas I would like to answer.

The ‘inplace=True’ argument stands for the data frame has to make changes permanent eg.

    df.dropna(axis='index', how='all', inplace=True)

changes the same dataframe (as this pandas find NaN entries in index and drops them). If we try

    df.dropna(axis='index', how='all')

pandas shows the dataframe with changes we make but will not modify the original dataframe ‘df’.


回答 9

如果您不使用inplace = True或使用inplace = False,则基本上可以得到一个副本。

因此,例如:

testdf.sort_values(inplace=True, by='volume', ascending=False)

会改变结构,数据按降序排列。

然后:

testdf2 = testdf.sort_values( by='volume', ascending=True)

将使testdf2成为副本。值将全部相同,但排序将颠倒,您将拥有一个独立的对象。

然后在另一列中,说出LongMA,您可以:

testdf2.LongMA = testdf2.LongMA -1

testdf中的LongMA列将保留原始值,而testdf2列将减少值。

随着计算链的增长以及数据帧的副本具有其自己的生命周期,跟踪差异至关重要。

If you don’t use inplace=True or you use inplace=False you basically get back a copy.

So for instance:

testdf.sort_values(inplace=True, by='volume', ascending=False)

will alter the structure with the data sorted in descending order.

then:

testdf2 = testdf.sort_values( by='volume', ascending=True)

will make testdf2 a copy. the values will all be the same but the sort will be reversed and you will have an independent object.

then given another column, say LongMA and you do:

testdf2.LongMA = testdf2.LongMA -1

the LongMA column in testdf will have the original values and testdf2 will have the decrimented values.

It is important to keep track of the difference as the chain of calculations grows and the copies of dataframes have their own lifecycle.


回答 10

是的,在Pandas中,我们有很多具有参数的函数,inplace但默认情况下将其分配给False

因此,当您执行df.dropna(axis='index', how='all', inplace=False)此操作时,它认为您不想更改orignial DataFrame,因此它为您创建具有所需更改的新副本

但是,当您将inplace参数更改为True

然后,这等效于明确地说,我不需要新的副本,DataFrame而是对给定的内容进行更改DataFrame

这迫使Python解释器不要创建新的DataFrame

但是您也可以inplace通过将结果重新分配给原始DataFrame来避免使用参数

df = df.dropna(axis='index', how='all')

Yes, in Pandas we have many functions has the parameter inplace but by default it is assigned to False.

So, when you do df.dropna(axis='index', how='all', inplace=False) it thinks that you do not want to change the orignial DataFrame, therefore it instead creates a new copy for you with the required changes.

But, when you change the inplace parameter to True

Then it is equivalent to explicitly say that I do not want a new copy of the DataFrame instead do the changes on the given DataFrame

This forces the Python interpreter to not to create a new DataFrame

But you can also avoid using the inplace parameter by reassigning the result to the orignal DataFrame

df = df.dropna(axis='index', how='all')


读取压缩文件作为Pandas DataFrame

问题:读取压缩文件作为Pandas DataFrame

我正在尝试解压缩csv文件并将其传递到熊猫中,以便我可以处理该文件。
到目前为止,我尝试过的代码是:

import requests, zipfile, StringIO
r = requests.get('http://data.octo.dc.gov/feeds/crime_incidents/archive/crime_incidents_2013_CSV.zip')
z = zipfile.ZipFile(StringIO.StringIO(r.content))
crime2013 = pandas.read_csv(z.read('crime_incidents_2013_CSV.csv'))

在最后一行之后,尽管python能够获取文件,但在错误末尾出现“不存在”。

有人可以告诉我我做错了什么吗?

I’m trying to unzip a csv file and pass it into pandas so I can work on the file.
The code I have tried so far is:

import requests, zipfile, StringIO
r = requests.get('http://data.octo.dc.gov/feeds/crime_incidents/archive/crime_incidents_2013_CSV.zip')
z = zipfile.ZipFile(StringIO.StringIO(r.content))
crime2013 = pandas.read_csv(z.read('crime_incidents_2013_CSV.csv'))

After the last line, although python is able to get the file, I get a “does not exist” at the end of the error.

Can someone tell me what I’m doing incorrectly?


回答 0

如果要将压缩文件或tar.gz文件读入pandas数据帧,则这些read_csv方法包括此特定实现。

df = pd.read_csv('filename.zip')

或长格式:

df = pd.read_csv('filename.zip', compression='zip', header=0, sep=',', quotechar='"')

docs中压缩参数的说明:

压缩:{‘infer’,’gzip’,’bz2’,’zip’,’xz’,无},默认为’infer’用于对磁盘数据进行实时解压缩。如果’infer’和filepath_or_buffer与路径类似,则从以下扩展名检测压缩:’.gz’,’。bz2’,’。zip’或’.xz’(否则不进行解压缩)。如果使用“ zip”,则ZIP文件必须仅包含一个要读取的数据文件。设置为“无”将不进行解压缩。

0.18.1版中的新功能:支持“ zip”和“ xz”压缩。

If you want to read a zipped or a tar.gz file into pandas dataframe, the read_csv methods includes this particular implementation.

df = pd.read_csv('filename.zip')

Or the long form:

df = pd.read_csv('filename.zip', compression='zip', header=0, sep=',', quotechar='"')

Description of the compression argument from the docs:

compression : {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’ For on-the-fly decompression of on-disk data. If ‘infer’ and filepath_or_buffer is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, or ‘.xz’ (otherwise no decompression). If using ‘zip’, the ZIP file must contain only one data file to be read in. Set to None for no decompression.

New in version 0.18.1: support for ‘zip’ and ‘xz’ compression.


回答 1

我认为您想要openZipFile,它返回一个类似文件的对象,而不是read

In [11]: crime2013 = pd.read_csv(z.open('crime_incidents_2013_CSV.csv'))

In [12]: crime2013
Out[12]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24567 entries, 0 to 24566
Data columns (total 15 columns):
CCN                            24567  non-null values
REPORTDATETIME                 24567  non-null values
SHIFT                          24567  non-null values
OFFENSE                        24567  non-null values
METHOD                         24567  non-null values
LASTMODIFIEDDATE               24567  non-null values
BLOCKSITEADDRESS               24567  non-null values
BLOCKXCOORD                    24567  non-null values
BLOCKYCOORD                    24567  non-null values
WARD                           24563  non-null values
ANC                            24567  non-null values
DISTRICT                       24567  non-null values
PSA                            24567  non-null values
NEIGHBORHOODCLUSTER            24263  non-null values
BUSINESSIMPROVEMENTDISTRICT    3613  non-null values
dtypes: float64(4), int64(1), object(10)

I think you want to open the ZipFile, which returns a file-like object, rather than read:

In [11]: crime2013 = pd.read_csv(z.open('crime_incidents_2013_CSV.csv'))

In [12]: crime2013
Out[12]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24567 entries, 0 to 24566
Data columns (total 15 columns):
CCN                            24567  non-null values
REPORTDATETIME                 24567  non-null values
SHIFT                          24567  non-null values
OFFENSE                        24567  non-null values
METHOD                         24567  non-null values
LASTMODIFIEDDATE               24567  non-null values
BLOCKSITEADDRESS               24567  non-null values
BLOCKXCOORD                    24567  non-null values
BLOCKYCOORD                    24567  non-null values
WARD                           24563  non-null values
ANC                            24567  non-null values
DISTRICT                       24567  non-null values
PSA                            24567  non-null values
NEIGHBORHOODCLUSTER            24263  non-null values
BUSINESSIMPROVEMENTDISTRICT    3613  non-null values
dtypes: float64(4), int64(1), object(10)

回答 2

似乎您甚至不必再指定压缩。以下代码段将文件名中的数据加载到df中。

import pandas as pd
df = pd.read_csv('filename.zip')

(当然,如果它们与默认值不同,则需要指定分隔符,标头等。)

It seems you don’t even have to specify the compression any more. The following snippet loads the data from filename.zip into df.

import pandas as pd
df = pd.read_csv('filename.zip')

(Of course you will need to specify separator, header, etc. if they are different from the defaults.)


回答 3

对于“ zip ”文件,您可以使用import zipfile并且您的代码将仅通过以下几行工作:

import zipfile
import pandas as pd
with zipfile.ZipFile("Crime_Incidents_in_2013.zip") as z:
   with z.open("Crime_Incidents_in_2013.csv") as f:
      train = pd.read_csv(f, header=0, delimiter="\t")
      print(train.head())    # print the first 5 rows

结果将是:

X,Y,CCN,REPORT_DAT,SHIFT,METHOD,OFFENSE,BLOCK,XBLOCK,YBLOCK,WARD,ANC,DISTRICT,PSA,NEIGHBORHOOD_CLUSTER,BLOCK_GROUP,CENSUS_TRACT,VOTING_PRECINCT,XCOORD,YCOORD,LATITUDE,LONGITUDE,BID,START_DATE,END_DATE,OBJECTID
0  -77.054968548763071,38.899775938598317,0925135...                                                                                                                                                               
1  -76.967309569035052,38.872119553647011,1003352...                                                                                                                                                               
2  -76.996184958456539,38.927921847721443,1101010...                                                                                                                                                               
3  -76.943077541353617,38.883686046653935,1104551...                                                                                                                                                               
4  -76.939209158039446,38.892278093281632,1125028...

For “zip” files, you can use import zipfile and your code will be working simply with these lines:

import zipfile
import pandas as pd
with zipfile.ZipFile("Crime_Incidents_in_2013.zip") as z:
   with z.open("Crime_Incidents_in_2013.csv") as f:
      train = pd.read_csv(f, header=0, delimiter="\t")
      print(train.head())    # print the first 5 rows

And the result will be:

X,Y,CCN,REPORT_DAT,SHIFT,METHOD,OFFENSE,BLOCK,XBLOCK,YBLOCK,WARD,ANC,DISTRICT,PSA,NEIGHBORHOOD_CLUSTER,BLOCK_GROUP,CENSUS_TRACT,VOTING_PRECINCT,XCOORD,YCOORD,LATITUDE,LONGITUDE,BID,START_DATE,END_DATE,OBJECTID
0  -77.054968548763071,38.899775938598317,0925135...                                                                                                                                                               
1  -76.967309569035052,38.872119553647011,1003352...                                                                                                                                                               
2  -76.996184958456539,38.927921847721443,1101010...                                                                                                                                                               
3  -76.943077541353617,38.883686046653935,1104551...                                                                                                                                                               
4  -76.939209158039446,38.892278093281632,1125028...

回答 4

https://www.kaggle.com/jboysen/quick-gz-pandas-tutorial

请点击此链接。

import pandas as pd
traffic_station_df = pd.read_csv('C:\\Folders\\Jupiter_Feed.txt.gz', compression='gzip',
                                 header=1, sep='\t', quotechar='"')

#traffic_station_df['Address'] = 'address'

#traffic_station_df.append(traffic_station_df)
print(traffic_station_df)

https://www.kaggle.com/jboysen/quick-gz-pandas-tutorial

Please follow this link.

import pandas as pd
traffic_station_df = pd.read_csv('C:\\Folders\\Jupiter_Feed.txt.gz', compression='gzip',
                                 header=1, sep='\t', quotechar='"')

#traffic_station_df['Address'] = 'address'

#traffic_station_df.append(traffic_station_df)
print(traffic_station_df)

FutureWarning:逐元素比较失败;返回标量,但将来将执行元素比较

问题:FutureWarning:逐元素比较失败;返回标量,但将来将执行元素比较

0.19.1在Python 3上使用Pandas 。我在这些代码行上收到警告。我正在尝试获取一个包含所有Peter在column处存在string的行号的列表Unnamed: 5

df = pd.read_excel(xls_path)
myRows = df[df['Unnamed: 5'] == 'Peter'].index.tolist()

它产生一个警告:

"\Python36\lib\site-packages\pandas\core\ops.py:792: FutureWarning: elementwise 
comparison failed; returning scalar, but in the future will perform 
elementwise comparison 
result = getattr(x, name)(y)"

这是什么FutureFarning,由于它似乎起作用,因此我应该忽略它。

I am using Pandas 0.19.1 on Python 3. I am getting a warning on these lines of code. I’m trying to get a list that contains all the row numbers where string Peter is present at column Unnamed: 5.

df = pd.read_excel(xls_path)
myRows = df[df['Unnamed: 5'] == 'Peter'].index.tolist()

It produces a Warning:

"\Python36\lib\site-packages\pandas\core\ops.py:792: FutureWarning: elementwise 
comparison failed; returning scalar, but in the future will perform 
elementwise comparison 
result = getattr(x, name)(y)"

What is this FutureWarning and should I ignore it since it seems to work.


回答 0

此FutureWarning并非来自Pandas,而是来自numpy,并且该错误也影响了matplotlib和其他漏洞,以下是在更接近问题根源的位置重现警告的方法:

import numpy as np
print(np.__version__)   # Numpy version '1.12.0'
'x' in np.arange(5)       #Future warning thrown here

FutureWarning: elementwise comparison failed; returning scalar instead, but in the 
future will perform elementwise comparison
False

使用double equals运算符重现此错误的另一种方法:

import numpy as np
np.arange(5) == np.arange(5).astype(str)    #FutureWarning thrown here

受此FutureWarning影响的Matplotlib示例在其颤动图实施下:https ://matplotlib.org/examples/pylab_examples/quiver_demo.html

这里发生了什么?

当您将字符串与numpy的数字类型进行比较时,Numpy和本机python之间会发生什么分歧。请注意,左操作数是python的草皮,是原始字符串,中间操作是python的草皮,而右操作数是numpy的草皮。您应该返回Python样式的Scalar还是Numpy样式的ndarray布尔值?Numpy说布尔的ndarray,Pythonic开发人员不同意。经典对峙。

如果数组中存在item,应该是元素比较还是标量?

如果您的代码或库使用in==运算符将python字符串与numpy ndarrays比较,则它们不兼容,因此,当您尝试使用它时,它将返回标量,但仅在现在。警告表示将来这种行为可能会改变,因此,如果python / numpy决定采用Numpy样式,则您的代码会全程吐槽。

提交的错误报告:

Numpy和Python处于僵持状态,目前操作返回标量,但将来可能会改变。

https://github.com/numpy/numpy/issues/6784

https://github.com/pandas-dev/pandas/issues/7830

两种解决方法:

无论您锁定Python和numpy的版本,忽略这些警告并期望行为不改变,或转换的左侧和右侧的操作数==,并in从一个numpy的类型或原始数值Python类型。

全局禁止警告:

import warnings
import numpy as np
warnings.simplefilter(action='ignore', category=FutureWarning)
print('x' in np.arange(5))   #returns False, without Warning

逐行抑制警告。

import warnings
import numpy as np

with warnings.catch_warnings():
    warnings.simplefilter(action='ignore', category=FutureWarning)
    print('x' in np.arange(2))   #returns False, warning is suppressed

print('x' in np.arange(10))   #returns False, Throws FutureWarning

只需按名称隐藏警告,然后在其旁边添加一个大声注释,提及python和numpy的当前版本,并说此代码很脆弱,并且需要这些版本,并在此处添加了链接。踢罐子的路。

TLDR: pandas是绝地武士;numpy是小屋 并且python是银河帝国。 https://youtu.be/OZczsiCfQQk?t=3

This FutureWarning isn’t from Pandas, it is from numpy and the bug also affects matplotlib and others, here’s how to reproduce the warning nearer to the source of the trouble:

import numpy as np
print(np.__version__)   # Numpy version '1.12.0'
'x' in np.arange(5)       #Future warning thrown here

FutureWarning: elementwise comparison failed; returning scalar instead, but in the 
future will perform elementwise comparison
False

Another way to reproduce this bug using the double equals operator:

import numpy as np
np.arange(5) == np.arange(5).astype(str)    #FutureWarning thrown here

An example of Matplotlib affected by this FutureWarning under their quiver plot implementation: https://matplotlib.org/examples/pylab_examples/quiver_demo.html

What’s going on here?

There is a disagreement between Numpy and native python on what should happen when you compare a strings to numpy’s numeric types. Notice the left operand is python’s turf, a primitive string, and the middle operation is python’s turf, but the right operand is numpy’s turf. Should you return a Python style Scalar or a Numpy style ndarray of Boolean? Numpy says ndarray of bool, Pythonic developers disagree. Classic standoff.

Should it be elementwise comparison or Scalar if item exists in the array?

If your code or library is using the in or == operators to compare python string to numpy ndarrays, they aren’t compatible, so when if you try it, it returns a scalar, but only for now. The Warning indicates that in the future this behavior might change so your code pukes all over the carpet if python/numpy decide to do adopt Numpy style.

Submitted Bug reports:

Numpy and Python are in a standoff, for now the operation returns a scalar, but in the future it may change.

https://github.com/numpy/numpy/issues/6784

https://github.com/pandas-dev/pandas/issues/7830

Two workaround solutions:

Either lockdown your version of python and numpy, ignore the warnings and expect the behavior to not change, or convert both left and right operands of == and in to be from a numpy type or primitive python numeric type.

Suppress the warning globally:

import warnings
import numpy as np
warnings.simplefilter(action='ignore', category=FutureWarning)
print('x' in np.arange(5))   #returns False, without Warning

Suppress the warning on a line by line basis.

import warnings
import numpy as np

with warnings.catch_warnings():
    warnings.simplefilter(action='ignore', category=FutureWarning)
    print('x' in np.arange(2))   #returns False, warning is suppressed

print('x' in np.arange(10))   #returns False, Throws FutureWarning

Just suppress the warning by name, then put a loud comment next to it mentioning the current version of python and numpy, saying this code is brittle and requires these versions and put a link to here. Kick the can down the road.

TLDR: pandas are Jedi; numpy are the hutts; and python is the galactic empire. https://youtu.be/OZczsiCfQQk?t=3


回答 1

当我尝试将index_col读取文件设置为Panda的数据帧时,出现相同的错误:

df = pd.read_csv('my_file.tsv', sep='\t', header=0, index_col=['0'])  ## or same with the following
df = pd.read_csv('my_file.tsv', sep='\t', header=0, index_col=[0])

我以前从未遇到过这样的错误。我仍然试图找出背后的原因(使用@Eric Leschinski的解释和其他解释)。

无论如何,在我找出原因之前,以下方法可以立即解决该问题:

df = pd.read_csv('my_file.tsv', sep='\t', header=0)  ## not setting the index_col
df.set_index(['0'], inplace=True)

一旦弄清这种行为的原因,我将立即更新。

I get the same error when I try to set the index_col reading a file into a Panda‘s data-frame:

df = pd.read_csv('my_file.tsv', sep='\t', header=0, index_col=['0'])  ## or same with the following
df = pd.read_csv('my_file.tsv', sep='\t', header=0, index_col=[0])

I have never encountered such an error previously. I still am trying to figure out the reason behind this (using @Eric Leschinski explanation and others).

Anyhow, the following approach solves the problem for now until I figure the reason out:

df = pd.read_csv('my_file.tsv', sep='\t', header=0)  ## not setting the index_col
df.set_index(['0'], inplace=True)

I will update this as soon as I figure out the reason for such behavior.


回答 2

我对同一条警告消息的体验是由TypeError引起的。

TypeError:类型比较无效

因此,您可能要检查 Unnamed: 5

for x in df['Unnamed: 5']:
  print(type(x))  # are they 'str' ?

这是我可以复制警告消息的方法:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 2), columns=['num1', 'num2'])
df['num3'] = 3
df.loc[df['num3'] == '3', 'num3'] = 4  # TypeError and the Warning
df.loc[df['num3'] == 3, 'num3'] = 4  # No Error

希望能帮助到你。

My experience to the same warning message was caused by TypeError.

TypeError: invalid type comparison

So, you may want to check the data type of the Unnamed: 5

for x in df['Unnamed: 5']:
  print(type(x))  # are they 'str' ?

Here is how I can replicate the warning message:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 2), columns=['num1', 'num2'])
df['num3'] = 3
df.loc[df['num3'] == '3', 'num3'] = 4  # TypeError and the Warning
df.loc[df['num3'] == 3, 'num3'] = 4  # No Error

Hope it helps.


回答 3

无法击败Eric Leschinski的详细答案,但这是针对我认为尚未提及的原始问题的快速解决方法-将字符串放在列表中并使用.isin而不是==

例如:

import pandas as pd
import numpy as np

df = pd.DataFrame({"Name": ["Peter", "Joe"], "Number": [1, 2]})

# Raises warning using == to compare different types:
df.loc[df["Number"] == "2", "Number"]

# No warning using .isin:
df.loc[df["Number"].isin(["2"]), "Number"]

Can’t beat Eric Leschinski’s awesomely detailed answer, but here’s a quick workaround to the original question that I don’t think has been mentioned yet – put the string in a list and use .isin instead of ==

For example:

import pandas as pd
import numpy as np

df = pd.DataFrame({"Name": ["Peter", "Joe"], "Number": [1, 2]})

# Raises warning using == to compare different types:
df.loc[df["Number"] == "2", "Number"]

# No warning using .isin:
df.loc[df["Number"].isin(["2"]), "Number"]

回答 4

一个快速的解决方法是使用numpy.core.defchararray。我也遇到了同样的警告消息,并且能够使用上述模块来解决它。

import numpy.core.defchararray as npd
resultdataset = npd.equal(dataset1, dataset2)

A quick workaround for this is to use numpy.core.defchararray. I also faced the same warning message and was able to resolve it using above module.

import numpy.core.defchararray as npd
resultdataset = npd.equal(dataset1, dataset2)

回答 5

埃里克(Eric)的回答很有帮助,说明了麻烦来自将Pandas系列(包含NumPy数组)与Python字符串进行比较。不幸的是,他的两个解决方法都只是抑制了警告。

要首先编写不会引起警告的代码,请显式地将字符串与Series的每个元素进行比较,并为每个元素获取单独的布尔值。例如,您可以使用map和匿名函数。

myRows = df[df['Unnamed: 5'].map( lambda x: x == 'Peter' )].index.tolist()

Eric’s answer helpfully explains that the trouble comes from comparing a Pandas Series (containing a NumPy array) to a Python string. Unfortunately, his two workarounds both just suppress the warning.

To write code that doesn’t cause the warning in the first place, explicitly compare your string to each element of the Series and get a separate bool for each. For example, you could use map and an anonymous function.

myRows = df[df['Unnamed: 5'].map( lambda x: x == 'Peter' )].index.tolist()

回答 6

如果数组不太大或数组不太多,则可以通过将其左侧强制==为字符串来摆脱困境:

myRows = df[str(df['Unnamed: 5']) == 'Peter'].index.tolist()

但这如果df['Unnamed: 5']是字符串则要慢约1.5倍,如果df['Unnamed: 5']是小的numpy数组(长度= 10)则要慢25-30倍,如果是长度为100的numpy数组则要慢150-160倍(时间超过500次试验) 。

a = linspace(0, 5, 10)
b = linspace(0, 50, 100)
n = 500
string1 = 'Peter'
string2 = 'blargh'
times_a = zeros(n)
times_str_a = zeros(n)
times_s = zeros(n)
times_str_s = zeros(n)
times_b = zeros(n)
times_str_b = zeros(n)
for i in range(n):
    t0 = time.time()
    tmp1 = a == string1
    t1 = time.time()
    tmp2 = str(a) == string1
    t2 = time.time()
    tmp3 = string2 == string1
    t3 = time.time()
    tmp4 = str(string2) == string1
    t4 = time.time()
    tmp5 = b == string1
    t5 = time.time()
    tmp6 = str(b) == string1
    t6 = time.time()
    times_a[i] = t1 - t0
    times_str_a[i] = t2 - t1
    times_s[i] = t3 - t2
    times_str_s[i] = t4 - t3
    times_b[i] = t5 - t4
    times_str_b[i] = t6 - t5
print('Small array:')
print('Time to compare without str conversion: {} s. With str conversion: {} s'.format(mean(times_a), mean(times_str_a)))
print('Ratio of time with/without string conversion: {}'.format(mean(times_str_a)/mean(times_a)))

print('\nBig array')
print('Time to compare without str conversion: {} s. With str conversion: {} s'.format(mean(times_b), mean(times_str_b)))
print(mean(times_str_b)/mean(times_b))

print('\nString')
print('Time to compare without str conversion: {} s. With str conversion: {} s'.format(mean(times_s), mean(times_str_s)))
print('Ratio of time with/without string conversion: {}'.format(mean(times_str_s)/mean(times_s)))

结果:

Small array:
Time to compare without str conversion: 6.58464431763e-06 s. With str conversion: 0.000173756599426 s
Ratio of time with/without string conversion: 26.3881526541

Big array
Time to compare without str conversion: 5.44309616089e-06 s. With str conversion: 0.000870866775513 s
159.99474375821288

String
Time to compare without str conversion: 5.89370727539e-07 s. With str conversion: 8.30173492432e-07 s
Ratio of time with/without string conversion: 1.40857605178

If your arrays aren’t too big or you don’t have too many of them, you might be able to get away with forcing the left hand side of == to be a string:

myRows = df[str(df['Unnamed: 5']) == 'Peter'].index.tolist()

But this is ~1.5 times slower if df['Unnamed: 5'] is a string, 25-30 times slower if df['Unnamed: 5'] is a small numpy array (length = 10), and 150-160 times slower if it’s a numpy array with length 100 (times averaged over 500 trials).

a = linspace(0, 5, 10)
b = linspace(0, 50, 100)
n = 500
string1 = 'Peter'
string2 = 'blargh'
times_a = zeros(n)
times_str_a = zeros(n)
times_s = zeros(n)
times_str_s = zeros(n)
times_b = zeros(n)
times_str_b = zeros(n)
for i in range(n):
    t0 = time.time()
    tmp1 = a == string1
    t1 = time.time()
    tmp2 = str(a) == string1
    t2 = time.time()
    tmp3 = string2 == string1
    t3 = time.time()
    tmp4 = str(string2) == string1
    t4 = time.time()
    tmp5 = b == string1
    t5 = time.time()
    tmp6 = str(b) == string1
    t6 = time.time()
    times_a[i] = t1 - t0
    times_str_a[i] = t2 - t1
    times_s[i] = t3 - t2
    times_str_s[i] = t4 - t3
    times_b[i] = t5 - t4
    times_str_b[i] = t6 - t5
print('Small array:')
print('Time to compare without str conversion: {} s. With str conversion: {} s'.format(mean(times_a), mean(times_str_a)))
print('Ratio of time with/without string conversion: {}'.format(mean(times_str_a)/mean(times_a)))

print('\nBig array')
print('Time to compare without str conversion: {} s. With str conversion: {} s'.format(mean(times_b), mean(times_str_b)))
print(mean(times_str_b)/mean(times_b))

print('\nString')
print('Time to compare without str conversion: {} s. With str conversion: {} s'.format(mean(times_s), mean(times_str_s)))
print('Ratio of time with/without string conversion: {}'.format(mean(times_str_s)/mean(times_s)))

Result:

Small array:
Time to compare without str conversion: 6.58464431763e-06 s. With str conversion: 0.000173756599426 s
Ratio of time with/without string conversion: 26.3881526541

Big array
Time to compare without str conversion: 5.44309616089e-06 s. With str conversion: 0.000870866775513 s
159.99474375821288

String
Time to compare without str conversion: 5.89370727539e-07 s. With str conversion: 8.30173492432e-07 s
Ratio of time with/without string conversion: 1.40857605178

回答 7

就我而言,发出警告的原因仅仅是布尔索引的常规类型-因为该系列只有np.nan。示范(熊猫1.0.3):

>>> import pandas as pd
>>> import numpy as np
>>> pd.Series([np.nan, 'Hi']) == 'Hi'
0    False
1     True
>>> pd.Series([np.nan, np.nan]) == 'Hi'
~/anaconda3/envs/ms3/lib/python3.7/site-packages/pandas/core/ops/array_ops.py:255: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  res_values = method(rvalues)
0    False
1    False

我认为对于pandas 1.0,他们确实希望您使用'string'允许pd.NA值的新数据类型:

>>> pd.Series([pd.NA, pd.NA]) == 'Hi'
0    False
1    False
>>> pd.Series([np.nan, np.nan], dtype='string') == 'Hi'
0    <NA>
1    <NA>
>>> (pd.Series([np.nan, np.nan], dtype='string') == 'Hi').fillna(False)
0    False
1    False

不喜欢他们在何时开始使用布尔索引等日常功能。

In my case, the warning occurred because of just the regular type of boolean indexing — because the series had only np.nan. Demonstration (pandas 1.0.3):

>>> import pandas as pd
>>> import numpy as np
>>> pd.Series([np.nan, 'Hi']) == 'Hi'
0    False
1     True
>>> pd.Series([np.nan, np.nan]) == 'Hi'
~/anaconda3/envs/ms3/lib/python3.7/site-packages/pandas/core/ops/array_ops.py:255: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  res_values = method(rvalues)
0    False
1    False

I think with pandas 1.0 they really want you to use the new 'string' datatype which allows for pd.NA values:

>>> pd.Series([pd.NA, pd.NA]) == 'Hi'
0    False
1    False
>>> pd.Series([np.nan, np.nan], dtype='string') == 'Hi'
0    <NA>
1    <NA>
>>> (pd.Series([np.nan, np.nan], dtype='string') == 'Hi').fillna(False)
0    False
1    False

Don’t love at which point they tinkered with every-day functionality such as boolean indexing.


回答 8

我收到此警告是因为我认为我的列包含空字符串,但是在检查时,它包含了np.nan!

if df['column'] == '':

将我的列更改为空字符串有帮助:)

I got this warning because I thought my column contained null strings, but on checking, it contained np.nan!

if df['column'] == '':

Changing my column to empty strings helped :)


回答 9

我已经比较了几种可能的方法,包括熊猫,几种numpy方法和列表理解方法。

首先,让我们从基线开始:

>>> import numpy as np
>>> import operator
>>> import pandas as pd

>>> x = [1, 2, 1, 2]
>>> %time count = np.sum(np.equal(1, x))
>>> print("Count {} using numpy equal with ints".format(count))
CPU times: user 52 µs, sys: 0 ns, total: 52 µs
Wall time: 56 µs
Count 2 using numpy equal with ints

因此,我们的基准是该计数应该正确2,并且我们应该大约50 us

现在,我们尝试使用朴素的方法:

>>> x = ['s', 'b', 's', 'b']
>>> %time count = np.sum(np.equal('s', x))
>>> print("Count {} using numpy equal".format(count))
CPU times: user 145 µs, sys: 24 µs, total: 169 µs
Wall time: 158 µs
Count NotImplemented using numpy equal
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  """Entry point for launching an IPython kernel.

在这里,我们得到了错误的答案(NotImplemented != 2),这花了我们很长时间,并且引发了警告。

因此,我们将尝试另一种幼稚的方法:

>>> %time count = np.sum(x == 's')
>>> print("Count {} using ==".format(count))
CPU times: user 46 µs, sys: 1 µs, total: 47 µs
Wall time: 50.1 µs
Count 0 using ==

同样,错误答案(0 != 2)。这更加隐蔽,因为没有后续警告(0可以像一样传递2)。

现在,让我们尝试一个列表理解:

>>> %time count = np.sum([operator.eq(_x, 's') for _x in x])
>>> print("Count {} using list comprehension".format(count))
CPU times: user 55 µs, sys: 1 µs, total: 56 µs
Wall time: 60.3 µs
Count 2 using list comprehension

我们在这里得到正确的答案,而且速度很快!

另一种可能性pandas

>>> y = pd.Series(x)
>>> %time count = np.sum(y == 's')
>>> print("Count {} using pandas ==".format(count))
CPU times: user 453 µs, sys: 31 µs, total: 484 µs
Wall time: 463 µs
Count 2 using pandas ==

慢,但是正确!

最后,我将使用的选项是:将numpy数组转换为object类型:

>>> x = np.array(['s', 'b', 's', 'b']).astype(object)
>>> %time count = np.sum(np.equal('s', x))
>>> print("Count {} using numpy equal".format(count))
CPU times: user 50 µs, sys: 1 µs, total: 51 µs
Wall time: 55.1 µs
Count 2 using numpy equal

快速正确!

I’ve compared a few of the methods possible for doing this, including pandas, several numpy methods, and a list comprehension method.

First, let’s start with a baseline:

>>> import numpy as np
>>> import operator
>>> import pandas as pd

>>> x = [1, 2, 1, 2]
>>> %time count = np.sum(np.equal(1, x))
>>> print("Count {} using numpy equal with ints".format(count))
CPU times: user 52 µs, sys: 0 ns, total: 52 µs
Wall time: 56 µs
Count 2 using numpy equal with ints

So, our baseline is that the count should be correct 2, and we should take about 50 us.

Now, we try the naive method:

>>> x = ['s', 'b', 's', 'b']
>>> %time count = np.sum(np.equal('s', x))
>>> print("Count {} using numpy equal".format(count))
CPU times: user 145 µs, sys: 24 µs, total: 169 µs
Wall time: 158 µs
Count NotImplemented using numpy equal
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  """Entry point for launching an IPython kernel.

And here, we get the wrong answer (NotImplemented != 2), it takes us a long time, and it throws the warning.

So we’ll try another naive method:

>>> %time count = np.sum(x == 's')
>>> print("Count {} using ==".format(count))
CPU times: user 46 µs, sys: 1 µs, total: 47 µs
Wall time: 50.1 µs
Count 0 using ==

Again, the wrong answer (0 != 2). This is even more insidious because there’s no subsequent warnings (0 can be passed around just like 2).

Now, let’s try a list comprehension:

>>> %time count = np.sum([operator.eq(_x, 's') for _x in x])
>>> print("Count {} using list comprehension".format(count))
CPU times: user 55 µs, sys: 1 µs, total: 56 µs
Wall time: 60.3 µs
Count 2 using list comprehension

We get the right answer here, and it’s pretty fast!

Another possibility, pandas:

>>> y = pd.Series(x)
>>> %time count = np.sum(y == 's')
>>> print("Count {} using pandas ==".format(count))
CPU times: user 453 µs, sys: 31 µs, total: 484 µs
Wall time: 463 µs
Count 2 using pandas ==

Slow, but correct!

And finally, the option I’m going to use: casting the numpy array to the object type:

>>> x = np.array(['s', 'b', 's', 'b']).astype(object)
>>> %time count = np.sum(np.equal('s', x))
>>> print("Count {} using numpy equal".format(count))
CPU times: user 50 µs, sys: 1 µs, total: 51 µs
Wall time: 55.1 µs
Count 2 using numpy equal

Fast and correct!


回答 10

我有导致错误的此代码:

for t in dfObj['time']:
  if type(t) == str:
    the_date = dateutil.parser.parse(t)
    loc_dt_int = int(the_date.timestamp())
    dfObj.loc[t == dfObj.time, 'time'] = loc_dt_int

我将其更改为:

for t in dfObj['time']:
  try:
    the_date = dateutil.parser.parse(t)
    loc_dt_int = int(the_date.timestamp())
    dfObj.loc[t == dfObj.time, 'time'] = loc_dt_int
  except Exception as e:
    print(e)
    continue

为了避免比较,它会发出警告-如上所述。我只需要避免这种异常,因为dfObj.loc在for循环中,也许有一种方法可以告诉它不要检查已更改的行。

I had this code which was causing the error:

for t in dfObj['time']:
  if type(t) == str:
    the_date = dateutil.parser.parse(t)
    loc_dt_int = int(the_date.timestamp())
    dfObj.loc[t == dfObj.time, 'time'] = loc_dt_int

I changed it to this:

for t in dfObj['time']:
  try:
    the_date = dateutil.parser.parse(t)
    loc_dt_int = int(the_date.timestamp())
    dfObj.loc[t == dfObj.time, 'time'] = loc_dt_int
  except Exception as e:
    print(e)
    continue

to avoid the comparison, which is throwing the warning – as stated above. I only had to avoid the exception because of dfObj.loc in the for loop, maybe there is a way to tell it not to check the rows it has already changed.


SQLAlchemy ORM转换为Pandas DataFrame

问题:SQLAlchemy ORM转换为Pandas DataFrame

这个话题已经有一段时间没有在这里或其他地方了。是否有将SQLAlchemy <Query object>转换为pandas DataFrame 的解决方案?

Pandas具有使用能力,pandas.read_sql但这需要使用原始SQL。我有两个避免发生这种情况的原因:1)我已经使用ORM拥有了一切(本身就是一个很好的理由),并且2)我正在使用python列表作为查询的一部分(例如:模型类.db.session.query(Item).filter(Item.symbol.in_(add_symbols)在哪里Item)并且add_symbols是列表)。这等效于SQL SELECT ... from ... WHERE ... IN

有什么可能吗?

This topic hasn’t been addressed in a while, here or elsewhere. Is there a solution converting a SQLAlchemy <Query object> to a pandas DataFrame?

Pandas has the capability to use pandas.read_sql but this requires use of raw SQL. I have two reasons for wanting to avoid it: 1) I already have everything using the ORM (a good reason in and of itself) and 2) I’m using python lists as part of the query (eg: .db.session.query(Item).filter(Item.symbol.in_(add_symbols) where Item is my model class and add_symbols is a list). This is the equivalent of SQL SELECT ... from ... WHERE ... IN.

Is anything possible?


回答 0

在大多数情况下,下面的代码应该有效:

df = pd.read_sql(query.statement, query.session.bind)

有关pandas.read_sql参数的更多信息,请参见文档。

Below should work in most cases:

df = pd.read_sql(query.statement, query.session.bind)

See pandas.read_sql documentation for more information on the parameters.


回答 1

为了让新手熊猫程序员更加清楚,这是一个具体示例,

pd.read_sql(session.query(Complaint).filter(Complaint.id == 2).statement,session.bind) 

在这里,我们从id = 2的投诉表(sqlalchemy模型为Complaint)中选择一个投诉

Just to make this more clear for novice pandas programmers, here is a concrete example,

pd.read_sql(session.query(Complaint).filter(Complaint.id == 2).statement,session.bind) 

Here we select a complaint from complaints table (sqlalchemy model is Complaint) with id = 2


回答 2

所选解决方案对我不起作用,因为我不断收到错误消息

AttributeError:’AnnotatedSelect’对象没有属性’lower’

我发现以下工作:

df = pd.read_sql_query(query.statement, engine)

The selected solution didn’t work for me, as I kept getting the error

AttributeError: ‘AnnotatedSelect’ object has no attribute ‘lower’

I found the following worked:

df = pd.read_sql_query(query.statement, engine)

回答 3

如果要使用参数编译查询并说方言特定的参数,请使用以下命令:

c = query.statement.compile(query.session.bind)
df = pandas.read_sql(c.string, query.session.bind, params=c.params)

If you want to compile a query with parameters and dialect specific arguments, use something like this:

c = query.statement.compile(query.session.bind)
df = pandas.read_sql(c.string, query.session.bind, params=c.params)

回答 4

from sqlalchemy import Column, Integer, String, create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker

engine = create_engine('postgresql://postgres:postgres@localhost:5432/DB', echo=False)
Base = declarative_base(bind=engine)
Session = sessionmaker(bind=engine)
session = Session()

conn = session.bind

class DailyTrendsTable(Base):

    __tablename__ = 'trends'
    __table_args__ = ({"schema": 'mf_analysis'})

    company_code = Column(DOUBLE_PRECISION, primary_key=True)
    rt_bullish_trending = Column(Integer)
    rt_bearish_trending = Column(Integer)
    rt_bullish_non_trending = Column(Integer)
    rt_bearish_non_trending = Column(Integer)
    gen_date = Column(Date, primary_key=True)

df_query = select([DailyTrendsTable])

df_data = pd.read_sql(rt_daily_query, con = conn)
from sqlalchemy import Column, Integer, String, create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker

engine = create_engine('postgresql://postgres:postgres@localhost:5432/DB', echo=False)
Base = declarative_base(bind=engine)
Session = sessionmaker(bind=engine)
session = Session()

conn = session.bind

class DailyTrendsTable(Base):

    __tablename__ = 'trends'
    __table_args__ = ({"schema": 'mf_analysis'})

    company_code = Column(DOUBLE_PRECISION, primary_key=True)
    rt_bullish_trending = Column(Integer)
    rt_bearish_trending = Column(Integer)
    rt_bullish_non_trending = Column(Integer)
    rt_bearish_non_trending = Column(Integer)
    gen_date = Column(Date, primary_key=True)

df_query = select([DailyTrendsTable])

df_data = pd.read_sql(rt_daily_query, con = conn)