标签归档:pandas

标准化大熊猫中的数据

问题:标准化大熊猫中的数据

假设我有一个熊猫数据框df

我想计算数据框的列均值。

这很简单:

df.apply(average) 

然后按列范围max(col)-min(col)。这又很容易:

df.apply(max) - df.apply(min)

现在,对于每个元素,我要减去其列的均值并除以其列的范围。我不确定该怎么做

非常感谢任何帮助/指针。

Suppose I have a pandas data frame df:

I want to calculate the column wise mean of a data frame.

This is easy:

df.apply(average) 

then the column wise range max(col) – min(col). This is easy again:

df.apply(max) - df.apply(min)

Now for each element I want to subtract its column’s mean and divide by its column’s range. I am not sure how to do that

Any help/pointers are much appreciated.


回答 0

In [92]: df
Out[92]:
           a         b          c         d
A  -0.488816  0.863769   4.325608 -4.721202
B -11.937097  2.993993 -12.916784 -1.086236
C  -5.569493  4.672679  -2.168464 -9.315900
D   8.892368  0.932785   4.535396  0.598124

In [93]: df_norm = (df - df.mean()) / (df.max() - df.min())

In [94]: df_norm
Out[94]:
          a         b         c         d
A  0.085789 -0.394348  0.337016 -0.109935
B -0.463830  0.164926 -0.650963  0.256714
C -0.158129  0.605652 -0.035090 -0.573389
D  0.536170 -0.376229  0.349037  0.426611

In [95]: df_norm.mean()
Out[95]:
a   -2.081668e-17
b    4.857226e-17
c    1.734723e-17
d   -1.040834e-17

In [96]: df_norm.max() - df_norm.min()
Out[96]:
a    1
b    1
c    1
d    1
In [92]: df
Out[92]:
           a         b          c         d
A  -0.488816  0.863769   4.325608 -4.721202
B -11.937097  2.993993 -12.916784 -1.086236
C  -5.569493  4.672679  -2.168464 -9.315900
D   8.892368  0.932785   4.535396  0.598124

In [93]: df_norm = (df - df.mean()) / (df.max() - df.min())

In [94]: df_norm
Out[94]:
          a         b         c         d
A  0.085789 -0.394348  0.337016 -0.109935
B -0.463830  0.164926 -0.650963  0.256714
C -0.158129  0.605652 -0.035090 -0.573389
D  0.536170 -0.376229  0.349037  0.426611

In [95]: df_norm.mean()
Out[95]:
a   -2.081668e-17
b    4.857226e-17
c    1.734723e-17
d   -1.040834e-17

In [96]: df_norm.max() - df_norm.min()
Out[96]:
a    1
b    1
c    1
d    1

回答 1

如果您不介意导入sklearn库,我建议您使用博客上介绍的方法。

import pandas as pd
from sklearn import preprocessing

data = {'score': [234,24,14,27,-74,46,73,-18,59,160]}
cols = data.columns
df = pd.DataFrame(data)
df

min_max_scaler = preprocessing.MinMaxScaler()
np_scaled = min_max_scaler.fit_transform(df)
df_normalized = pd.DataFrame(np_scaled, columns = cols)
df_normalized

If you don’t mind importing the sklearn library, I would recommend the method talked on this blog.

import pandas as pd
from sklearn import preprocessing

data = {'score': [234,24,14,27,-74,46,73,-18,59,160]}
cols = data.columns
df = pd.DataFrame(data)
df

min_max_scaler = preprocessing.MinMaxScaler()
np_scaled = min_max_scaler.fit_transform(df)
df_normalized = pd.DataFrame(np_scaled, columns = cols)
df_normalized

回答 2

您可以使用apply它,它有点整洁:

import numpy as np
import pandas as pd

np.random.seed(1)

df = pd.DataFrame(np.random.randn(4,4)* 4 + 3)

          0         1         2         3
0  9.497381  0.552974  0.887313 -1.291874
1  6.461631 -6.206155  9.979247 -0.044828
2  4.276156  2.002518  8.848432 -5.240563
3  1.710331  1.463783  7.535078 -1.399565

df.apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))

          0         1         2         3
0  0.515087  0.133967 -0.651699  0.135175
1  0.125241 -0.689446  0.348301  0.375188
2 -0.155414  0.310554  0.223925 -0.624812
3 -0.484913  0.244924  0.079473  0.114448

此外,groupby如果您选择相关列,它也可以与配合使用:

df['grp'] = ['A', 'A', 'B', 'B']

          0         1         2         3 grp
0  9.497381  0.552974  0.887313 -1.291874   A
1  6.461631 -6.206155  9.979247 -0.044828   A
2  4.276156  2.002518  8.848432 -5.240563   B
3  1.710331  1.463783  7.535078 -1.399565   B


df.groupby(['grp'])[[0,1,2,3]].apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))

     0    1    2    3
0  0.5  0.5 -0.5 -0.5
1 -0.5 -0.5  0.5  0.5
2  0.5  0.5  0.5 -0.5
3 -0.5 -0.5 -0.5  0.5

You can use apply for this, and it’s a bit neater:

import numpy as np
import pandas as pd

np.random.seed(1)

df = pd.DataFrame(np.random.randn(4,4)* 4 + 3)

          0         1         2         3
0  9.497381  0.552974  0.887313 -1.291874
1  6.461631 -6.206155  9.979247 -0.044828
2  4.276156  2.002518  8.848432 -5.240563
3  1.710331  1.463783  7.535078 -1.399565

df.apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))

          0         1         2         3
0  0.515087  0.133967 -0.651699  0.135175
1  0.125241 -0.689446  0.348301  0.375188
2 -0.155414  0.310554  0.223925 -0.624812
3 -0.484913  0.244924  0.079473  0.114448

Also, it works nicely with groupby, if you select the relevant columns:

df['grp'] = ['A', 'A', 'B', 'B']

          0         1         2         3 grp
0  9.497381  0.552974  0.887313 -1.291874   A
1  6.461631 -6.206155  9.979247 -0.044828   A
2  4.276156  2.002518  8.848432 -5.240563   B
3  1.710331  1.463783  7.535078 -1.399565   B


df.groupby(['grp'])[[0,1,2,3]].apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))

     0    1    2    3
0  0.5  0.5 -0.5 -0.5
1 -0.5 -0.5  0.5  0.5
2  0.5  0.5  0.5 -0.5
3 -0.5 -0.5 -0.5  0.5

回答 3

稍作修改自:Python Pandas数据框:归一化0.01和0.99之间的数据?但是从一些评论中认为这是相关的(抱歉,如果考虑重新发布…)

我想要自定义归一化,因为基准或z分数的常规百分位数不够。有时我知道总体的可行最大值和最小值是多少,因此除了我的样本或其他中点之外,还想对其进行定义!这通常对于重新缩放和规范化神经网络的数据很有用,因为您可能希望所有输入都在0到1之间,但是某些数据可能需要以更自定义的方式进行缩放…因为百分位数和标准差假设您的样本覆盖了人口,但有时我们知道这是不对的。在可视化热图中的数据时,这对我也非常有用。因此,我构建了一个自定义函数(在此处的代码中使用了额外的步骤,以使其更具可读性):

def NormData(s,low='min',center='mid',hi='max',insideout=False,shrinkfactor=0.):    
    if low=='min':
        low=min(s)
    elif low=='abs':
        low=max(abs(min(s)),abs(max(s)))*-1.#sign(min(s))
    if hi=='max':
        hi=max(s)
    elif hi=='abs':
        hi=max(abs(min(s)),abs(max(s)))*1.#sign(max(s))

    if center=='mid':
        center=(max(s)+min(s))/2
    elif center=='avg':
        center=mean(s)
    elif center=='median':
        center=median(s)

    s2=[x-center for x in s]
    hi=hi-center
    low=low-center
    center=0.

    r=[]

    for x in s2:
        if x<low:
            r.append(0.)
        elif x>hi:
            r.append(1.)
        else:
            if x>=center:
                r.append((x-center)/(hi-center)*0.5+0.5)
            else:
                r.append((x-low)/(center-low)*0.5+0.)

    if insideout==True:
        ir=[(1.-abs(z-0.5)*2.) for z in r]
        r=ir

    rr =[x-(x-0.5)*shrinkfactor for x in r]    
    return rr

这将采用熊猫系列,甚至只是一个列表,并将其标准化为您指定的低点,中点和高点。还有一个缩小因素!使您可以缩小端点0和1之外的数据的比例(在matplotlib中组合颜色图时,我必须这样做:使用Matplotlib单个pcolormesh中使用多个颜色图)样本中具有[-5,1,10]的值,但要基于-7到7(因此,大于7的任何值,我们的“ 10”有效地视为7)以2为中点进行归一化但将其缩小以适合256 RGB色彩图:

#In[1]
NormData([-5,2,10],low=-7,center=1,hi=7,shrinkfactor=2./256)
#Out[1]
[0.1279296875, 0.5826822916666667, 0.99609375]

它也可以将您的数据完全翻过来……这似乎很奇怪,但是我发现它对于热图很有用。假设您想使用深色来表示接近0的值,而不是高/低。您可以基于归一化数据的热图,其中Insideout = True:

#In[2]
NormData([-5,2,10],low=-7,center=1,hi=7,insideout=True,shrinkfactor=2./256)
#Out[2]
[0.251953125, 0.8307291666666666, 0.00390625]

因此,现在最接近中心的“ 2”(定义为“ 1”)是最大值。

无论如何,如果您希望以其他可能对您有用的应用程序重新缩放数据的方式,我认为我的应用程序很重要。

Slightly modified from: Python Pandas Dataframe: Normalize data between 0.01 and 0.99? but from some of the comments thought it was relevant (sorry if considered a repost though…)

I wanted customized normalization in that regular percentile of datum or z-score was not adequate. Sometimes I knew what the feasible max and min of the population were, and therefore wanted to define it other than my sample, or a different midpoint, or whatever! This can often be useful for rescaling and normalizing data for neural nets where you may want all inputs between 0 and 1, but some of your data may need to be scaled in a more customized way… because percentiles and stdevs assumes your sample covers the population, but sometimes we know this isn’t true. It was also very useful for me when visualizing data in heatmaps. So i built a custom function (used extra steps in the code here to make it as readable as possible):

def NormData(s,low='min',center='mid',hi='max',insideout=False,shrinkfactor=0.):    
    if low=='min':
        low=min(s)
    elif low=='abs':
        low=max(abs(min(s)),abs(max(s)))*-1.#sign(min(s))
    if hi=='max':
        hi=max(s)
    elif hi=='abs':
        hi=max(abs(min(s)),abs(max(s)))*1.#sign(max(s))

    if center=='mid':
        center=(max(s)+min(s))/2
    elif center=='avg':
        center=mean(s)
    elif center=='median':
        center=median(s)

    s2=[x-center for x in s]
    hi=hi-center
    low=low-center
    center=0.

    r=[]

    for x in s2:
        if x<low:
            r.append(0.)
        elif x>hi:
            r.append(1.)
        else:
            if x>=center:
                r.append((x-center)/(hi-center)*0.5+0.5)
            else:
                r.append((x-low)/(center-low)*0.5+0.)

    if insideout==True:
        ir=[(1.-abs(z-0.5)*2.) for z in r]
        r=ir

    rr =[x-(x-0.5)*shrinkfactor for x in r]    
    return rr

This will take in a pandas series, or even just a list and normalize it to your specified low, center, and high points. also there is a shrink factor! to allow you to scale down the data away from endpoints 0 and 1 (I had to do this when combining colormaps in matplotlib:Single pcolormesh with more than one colormap using Matplotlib) So you can likely see how the code works, but basically say you have values [-5,1,10] in a sample, but want to normalize based on a range of -7 to 7 (so anything above 7, our “10” is treated as a 7 effectively) with a midpoint of 2, but shrink it to fit a 256 RGB colormap:

#In[1]
NormData([-5,2,10],low=-7,center=1,hi=7,shrinkfactor=2./256)
#Out[1]
[0.1279296875, 0.5826822916666667, 0.99609375]

It can also turn your data inside out… this may seem odd, but I found it useful for heatmapping. Say you want a darker color for values closer to 0 rather than hi/low. You could heatmap based on normalized data where insideout=True:

#In[2]
NormData([-5,2,10],low=-7,center=1,hi=7,insideout=True,shrinkfactor=2./256)
#Out[2]
[0.251953125, 0.8307291666666666, 0.00390625]

So now “2” which is closest to the center, defined as “1” is the highest value.

Anyways, I thought my application was relevant if you’re looking to rescale data in other ways that could have useful applications to you.


回答 4

这是按列进行的方式:

[df[col].update((df[col] - df[col].min()) / (df[col].max() - df[col].min())) for col in df.columns]

This is how you do it column-wise:

[df[col].update((df[col] - df[col].min()) / (df[col].max() - df[col].min())) for col in df.columns]

熊猫将数据框转换为元组数组

问题:熊猫将数据框转换为元组数组

我已经使用熊猫处理了一些数据,现在我想将批处理保存回数据库。这要求我将数据帧转换为元组数组,每个元组都对应于数据帧的“行”。

我的DataFrame看起来像:

In [182]: data_set
Out[182]: 
  index data_date   data_1  data_2
0  14303 2012-02-17  24.75   25.03 
1  12009 2012-02-16  25.00   25.07 
2  11830 2012-02-15  24.99   25.15 
3  6274  2012-02-14  24.68   25.05 
4  2302  2012-02-13  24.62   24.77 
5  14085 2012-02-10  24.38   24.61 

我想将其转换为元组数组,例如:

[(datetime.date(2012,2,17),24.75,25.03),
(datetime.date(2012,2,16),25.00,25.07),
...etc. ]

关于如何有效执行此操作的任何建议?

I have manipulated some data using pandas and now I want to carry out a batch save back to the database. This requires me to convert the dataframe into an array of tuples, with each tuple corresponding to a “row” of the dataframe.

My DataFrame looks something like:

In [182]: data_set
Out[182]: 
  index data_date   data_1  data_2
0  14303 2012-02-17  24.75   25.03 
1  12009 2012-02-16  25.00   25.07 
2  11830 2012-02-15  24.99   25.15 
3  6274  2012-02-14  24.68   25.05 
4  2302  2012-02-13  24.62   24.77 
5  14085 2012-02-10  24.38   24.61 

I want to convert it to an array of tuples like:

[(datetime.date(2012,2,17),24.75,25.03),
(datetime.date(2012,2,16),25.00,25.07),
...etc. ]

Any suggestion on how I can efficiently do this?


回答 0

怎么样:

subset = data_set[['data_date', 'data_1', 'data_2']]
tuples = [tuple(x) for x in subset.to_numpy()]

大熊猫<0.24使用

tuples = [tuple(x) for x in subset.values]

How about:

subset = data_set[['data_date', 'data_1', 'data_2']]
tuples = [tuple(x) for x in subset.to_numpy()]

for pandas < 0.24 use

tuples = [tuple(x) for x in subset.values]

回答 1

list(data_set.itertuples(index=False))

从17.1开始,以上代码将返回namedtuples列表

如果需要普通元组的列表,请name=None作为参数传递:

list(data_set.itertuples(index=False, name=None))
list(data_set.itertuples(index=False))

As of 17.1, the above will return a list of namedtuples.

If you want a list of ordinary tuples, pass name=None as an argument:

list(data_set.itertuples(index=False, name=None))

回答 2

通用方式:

[tuple(x) for x in data_set.to_records(index=False)]

A generic way:

[tuple(x) for x in data_set.to_records(index=False)]

回答 3

动机
许多数据集足够大,我们需要关注自身的速度/效率。因此,我本着这种精神提供此解决方案。它恰好也是简洁的。

为了比较,让我们删除该index

df = data_set.drop('index', 1)

解决方案
我将建议使用zipmap

list(zip(*map(df.get, df)))

[('2012-02-17', 24.75, 25.03),
 ('2012-02-16', 25.0, 25.07),
 ('2012-02-15', 24.99, 25.15),
 ('2012-02-14', 24.68, 25.05),
 ('2012-02-13', 24.62, 24.77),
 ('2012-02-10', 24.38, 24.61)]

如果我们要处理特定的列子集,它也很灵活。我们假设已经显示的列是我们想要的子集。

list(zip(*map(df.get, ['data_date', 'data_1', 'data_2'])))

[('2012-02-17', 24.75, 25.03),
 ('2012-02-16', 25.0, 25.07),
 ('2012-02-15', 24.99, 25.15),
 ('2012-02-14', 24.68, 25.05),
 ('2012-02-13', 24.62, 24.77),
 ('2012-02-10', 24.38, 24.61)]

什么是更快?

转弯records最快,然后渐近收敛zipmapiter_tuples

我将使用simple_benchmarks这篇文章中获得的库

from simple_benchmark import BenchmarkBuilder
b = BenchmarkBuilder()

import pandas as pd
import numpy as np

def tuple_comp(df): return [tuple(x) for x in df.to_numpy()]
def iter_namedtuples(df): return list(df.itertuples(index=False))
def iter_tuples(df): return list(df.itertuples(index=False, name=None))
def records(df): return df.to_records(index=False).tolist()
def zipmap(df): return list(zip(*map(df.get, df)))

funcs = [tuple_comp, iter_namedtuples, iter_tuples, records, zipmap]
for func in funcs:
    b.add_function()(func)

def creator(n):
    return pd.DataFrame({"A": random.randint(n, size=n), "B": random.randint(n, size=n)})

@b.add_arguments('Rows in DataFrame')
def argument_provider():
    for n in (10 ** (np.arange(4, 11) / 2)).astype(int):
        yield n, creator(n)

r = b.run()

检查结果

r.to_pandas_dataframe().pipe(lambda d: d.div(d.min(1), 0))

        tuple_comp  iter_namedtuples  iter_tuples   records    zipmap
100       2.905662          6.626308     3.450741  1.469471  1.000000
316       4.612692          4.814433     2.375874  1.096352  1.000000
1000      6.513121          4.106426     1.958293  1.000000  1.316303
3162      8.446138          4.082161     1.808339  1.000000  1.533605
10000     8.424483          3.621461     1.651831  1.000000  1.558592
31622     7.813803          3.386592     1.586483  1.000000  1.515478
100000    7.050572          3.162426     1.499977  1.000000  1.480131

r.plot()

Motivation
Many data sets are large enough that we need to concern ourselves with speed/efficiency. So I offer this solution in that spirit. It happens to also be succinct.

For the sake of comparison, let’s drop the index column

df = data_set.drop('index', 1)

Solution
I’ll propose the use of zip and map

list(zip(*map(df.get, df)))

[('2012-02-17', 24.75, 25.03),
 ('2012-02-16', 25.0, 25.07),
 ('2012-02-15', 24.99, 25.15),
 ('2012-02-14', 24.68, 25.05),
 ('2012-02-13', 24.62, 24.77),
 ('2012-02-10', 24.38, 24.61)]

It happens to also be flexible if we wanted to deal with a specific subset of columns. We’ll assume the columns we’ve already displayed are the subset we want.

list(zip(*map(df.get, ['data_date', 'data_1', 'data_2'])))

[('2012-02-17', 24.75, 25.03),
 ('2012-02-16', 25.0, 25.07),
 ('2012-02-15', 24.99, 25.15),
 ('2012-02-14', 24.68, 25.05),
 ('2012-02-13', 24.62, 24.77),
 ('2012-02-10', 24.38, 24.61)]

What is Quicker?

Turn’s out records is quickest followed by asymptotically converging zipmap and iter_tuples

I’ll use a library simple_benchmarks that I got from this post

from simple_benchmark import BenchmarkBuilder
b = BenchmarkBuilder()

import pandas as pd
import numpy as np

def tuple_comp(df): return [tuple(x) for x in df.to_numpy()]
def iter_namedtuples(df): return list(df.itertuples(index=False))
def iter_tuples(df): return list(df.itertuples(index=False, name=None))
def records(df): return df.to_records(index=False).tolist()
def zipmap(df): return list(zip(*map(df.get, df)))

funcs = [tuple_comp, iter_namedtuples, iter_tuples, records, zipmap]
for func in funcs:
    b.add_function()(func)

def creator(n):
    return pd.DataFrame({"A": random.randint(n, size=n), "B": random.randint(n, size=n)})

@b.add_arguments('Rows in DataFrame')
def argument_provider():
    for n in (10 ** (np.arange(4, 11) / 2)).astype(int):
        yield n, creator(n)

r = b.run()

Check the results

r.to_pandas_dataframe().pipe(lambda d: d.div(d.min(1), 0))

        tuple_comp  iter_namedtuples  iter_tuples   records    zipmap
100       2.905662          6.626308     3.450741  1.469471  1.000000
316       4.612692          4.814433     2.375874  1.096352  1.000000
1000      6.513121          4.106426     1.958293  1.000000  1.316303
3162      8.446138          4.082161     1.808339  1.000000  1.533605
10000     8.424483          3.621461     1.651831  1.000000  1.558592
31622     7.813803          3.386592     1.586483  1.000000  1.515478
100000    7.050572          3.162426     1.499977  1.000000  1.480131

r.plot()


回答 4

这是一种向量化方法(假设将数据帧data_set定义为df),它返回的listof tuples,如下所示:

>>> df.set_index(['data_date'])[['data_1', 'data_2']].to_records().tolist()

生成:

[(datetime.datetime(2012, 2, 17, 0, 0), 24.75, 25.03),
 (datetime.datetime(2012, 2, 16, 0, 0), 25.0, 25.07),
 (datetime.datetime(2012, 2, 15, 0, 0), 24.99, 25.15),
 (datetime.datetime(2012, 2, 14, 0, 0), 24.68, 25.05),
 (datetime.datetime(2012, 2, 13, 0, 0), 24.62, 24.77),
 (datetime.datetime(2012, 2, 10, 0, 0), 24.38, 24.61)]

将datetime列设置为索引轴的想法是,通过对数据帧使用其中的参数来帮助将Timestamp值转换为其对应的datetime.datetime等效格式。convert_datetime64DF.to_recordsDateTimeIndex

这会返回recarray,然后可以将其返回给listusing.tolist


根据用例,更通用的解决方案是:

df.to_records().tolist()                              # Supply index=False to exclude index

Here’s a vectorized approach (assuming the dataframe, data_set to be defined as df instead) that returns a list of tuples as shown:

>>> df.set_index(['data_date'])[['data_1', 'data_2']].to_records().tolist()

produces:

[(datetime.datetime(2012, 2, 17, 0, 0), 24.75, 25.03),
 (datetime.datetime(2012, 2, 16, 0, 0), 25.0, 25.07),
 (datetime.datetime(2012, 2, 15, 0, 0), 24.99, 25.15),
 (datetime.datetime(2012, 2, 14, 0, 0), 24.68, 25.05),
 (datetime.datetime(2012, 2, 13, 0, 0), 24.62, 24.77),
 (datetime.datetime(2012, 2, 10, 0, 0), 24.38, 24.61)]

The idea of setting datetime column as the index axis is to aid in the conversion of the Timestamp value to it’s corresponding datetime.datetime format equivalent by making use of the convert_datetime64 argument in DF.to_records which does so for a DateTimeIndex dataframe.

This returns a recarray which could be then made to return a list using .tolist


More generalized solution depending on the use case would be:

df.to_records().tolist()                              # Supply index=False to exclude index

回答 5

最有效,最简单的方法:

list(data_set.to_records())

您可以在此调用之前过滤所需的列。

The most efficient and easy way:

list(data_set.to_records())

You can filter the columns you need before this call.


回答 6

该答案不会添加尚未讨论的任何答案,但是这里提供了一些速度结果。我认为这应该可以解决评论中出现的问题。所有这些看起来都像是O(n)基于这三个值,。

TL; DRtuples = list(df.itertuples(index=False, name=None))tuples = list(zip(*[df[c].values.tolist() for c in df]))并列最快。

我对结果进行了快速速度测试,得出以下三个建议:

  1. @pirsquared的zip答案: tuples = list(zip(*[df[c].values.tolist() for c in df]))
  2. @ wes-mckinney接受的答案: tuples = [tuple(x) for x in df.values]
  3. itertuples来自@ksindi的答案以及来自@Axel的name=None建议:tuples = list(df.itertuples(index=False, name=None))
from numpy import random
import pandas as pd


def create_random_df(n):
    return pd.DataFrame({"A": random.randint(n, size=n), "B": random.randint(n, size=n)})

小尺寸:

df = create_random_df(10000)
%timeit tuples = list(zip(*[df[c].values.tolist() for c in df]))
%timeit tuples = [tuple(x) for x in df.values]
%timeit tuples = list(df.itertuples(index=False, name=None))

给出:

1.66 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
15.5 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.74 ms ± 75.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

较大:

df = create_random_df(1000000)
%timeit tuples = list(zip(*[df[c].values.tolist() for c in df]))
%timeit tuples = [tuple(x) for x in df.values]
%timeit tuples = list(df.itertuples(index=False, name=None))

给出:

202 ms ± 5.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.52 s ± 98.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
209 ms ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

尽我所能:

df = create_random_df(10000000)
%timeit tuples = list(zip(*[df[c].values.tolist() for c in df]))
%timeit tuples = [tuple(x) for x in df.values]
%timeit tuples = list(df.itertuples(index=False, name=None))

给出:

1.78 s ± 118 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
15.4 s ± 222 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.68 s ± 96.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

zip版本和itertuples版本彼此在置信区间内。我怀疑他们在幕后做着同样的事情。

这些速度测试可能无关紧要。突破计算机内存的限制并不需要花费大量时间,并且您实际上不应该对大型数据集执行此操作。完成这些操作后,使用这些元组将最终效率低下。这不太可能成为代码中的主要瓶颈,因此请坚持使用您认为最易读的版本。

This answer doesn’t add any answers that aren’t already discussed, but here are some speed results. I think this should resolve questions that came up in the comments. All of these look like they are O(n), based on these three values.

TL;DR: tuples = list(df.itertuples(index=False, name=None)) and tuples = list(zip(*[df[c].values.tolist() for c in df])) are tied for the fastest.

I did a quick speed test on results for three suggestions here:

  1. The zip answer from @pirsquared: tuples = list(zip(*[df[c].values.tolist() for c in df]))
  2. The accepted answer from @wes-mckinney: tuples = [tuple(x) for x in df.values]
  3. The itertuples answer from @ksindi with the name=None suggestion from @Axel: tuples = list(df.itertuples(index=False, name=None))
from numpy import random
import pandas as pd


def create_random_df(n):
    return pd.DataFrame({"A": random.randint(n, size=n), "B": random.randint(n, size=n)})

Small size:

df = create_random_df(10000)
%timeit tuples = list(zip(*[df[c].values.tolist() for c in df]))
%timeit tuples = [tuple(x) for x in df.values]
%timeit tuples = list(df.itertuples(index=False, name=None))

Gives:

1.66 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
15.5 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.74 ms ± 75.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Larger:

df = create_random_df(1000000)
%timeit tuples = list(zip(*[df[c].values.tolist() for c in df]))
%timeit tuples = [tuple(x) for x in df.values]
%timeit tuples = list(df.itertuples(index=False, name=None))

Gives:

202 ms ± 5.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.52 s ± 98.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
209 ms ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

As much patience as I have:

df = create_random_df(10000000)
%timeit tuples = list(zip(*[df[c].values.tolist() for c in df]))
%timeit tuples = [tuple(x) for x in df.values]
%timeit tuples = list(df.itertuples(index=False, name=None))

Gives:

1.78 s ± 118 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
15.4 s ± 222 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.68 s ± 96.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The zip version and the itertuples version are within the confidence intervals each other. I suspect that they are doing the same thing under the hood.

These speed tests are probably irrelevant though. Pushing the limits of my computer’s memory doesn’t take a huge amount of time, and you really shouldn’t be doing this on a large data set. Working with those tuples after doing this will end up being really inefficient. It’s unlikely to be a major bottleneck in your code, so just stick with the version you think is most readable.


回答 7

#try this one:

tuples = list(zip(data_set["data_date"], data_set["data_1"],data_set["data_2"]))
print (tuples)
#try this one:

tuples = list(zip(data_set["data_date"], data_set["data_1"],data_set["data_2"]))
print (tuples)

回答 8

将数据框架列表更改为元组列表。

df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
print(df)
OUTPUT
   col1  col2
0     1     4
1     2     5
2     3     6

records = df.to_records(index=False)
result = list(records)
print(result)
OUTPUT
[(1, 4), (2, 5), (3, 6)]

Changing the data frames list into a list of tuples.

df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
print(df)
OUTPUT
   col1  col2
0     1     4
1     2     5
2     3     6

records = df.to_records(index=False)
result = list(records)
print(result)
OUTPUT
[(1, 4), (2, 5), (3, 6)]

回答 9

更多pythonic方式:

df = data_set[['data_date', 'data_1', 'data_2']]
map(tuple,df.values)

More pythonic way:

df = data_set[['data_date', 'data_1', 'data_2']]
map(tuple,df.values)

计算大熊猫数量的最有效方法是什么?

问题:计算大熊猫数量的最有效方法是什么?

我有一个大的(约1200万行)数据帧df,说:

df.columns = ['word','documents','frequency']

因此,以下及时运行:

word_grouping = df[['word','frequency']].groupby('word')
MaxFrequency_perWord = word_grouping[['frequency']].max().reset_index()
MaxFrequency_perWord.columns = ['word','MaxFrequency']

但是,这要花费很长的时间才能运行:

Occurrences_of_Words = word_grouping[['word']].count().reset_index()

我在这里做错了什么?有没有更好的方法来计算大型数据框中的出现次数?

df.word.describe()

运行良好,所以我真的没想到这个Occurrences_of_Words数据框会花费很长时间。

ps:如果答案很明显,并且您觉得有必要因提出这个问题而对我不利,请同时提供答案。谢谢。

I have a large (about 12M rows) dataframe df with say:

df.columns = ['word','documents','frequency']

So the following ran in a timely fashion:

word_grouping = df[['word','frequency']].groupby('word')
MaxFrequency_perWord = word_grouping[['frequency']].max().reset_index()
MaxFrequency_perWord.columns = ['word','MaxFrequency']

However, this is taking an unexpected long time to run:

Occurrences_of_Words = word_grouping[['word']].count().reset_index()

What am I doing wrong here? Is there a better way to count occurences in a large dataframe?

df.word.describe()

ran pretty well, so I really did not expect this Occurrences_of_Words dataframe to take very long to build.

ps: If the answer is obvious and you feel the need to penalize me for asking this question, please include the answer as well. thank you.


回答 0

我认为df['word'].value_counts()应该服务。通过跳过groupby机制,您可以节省一些时间。我不知道为什么count要慢于max。两者都需要一些时间来避免丢失值。(与相比size。)

无论如何,对value_counts进行了专门优化以处理像您的单词这样的对象类型,因此我怀疑您会做得更好。

I think df['word'].value_counts() should serve. By skipping the groupby machinery, you’ll save some time. I’m not sure why count should be much slower than max. Both take some time to avoid missing values. (Compare with size.)

In any case, value_counts has been specifically optimized to handle object type, like your words, so I doubt you’ll do much better than that.


回答 1

当您想统计pandas dataFrame中一列中分类数据的频率时,请使用: df['Column_Name'].value_counts()

来源

When you want to count the frequency of categorical data in a column in pandas dataFrame use: df['Column_Name'].value_counts()

Source.


回答 2

只是先前答案的补充。别忘了,在处理实际数据时,可能会有空值,因此使用选项将默认值包括在内也很有用dropna=False默认值为True

一个例子:

>>> df['Embarked'].value_counts(dropna=False)
S      644
C      168
Q       77
NaN      2

Just an addition to the previous answers. Let’s not forget that when dealing with real data there might be null values, so it’s useful to also include those in the counting by using the option dropna=False (default is True)

An example:

>>> df['Embarked'].value_counts(dropna=False)
S      644
C      168
Q       77
NaN      2

如何检查python pandas中列的dtype

问题:如何检查python pandas中列的dtype

我需要使用不同的函数来处理数字列和字符串列。我现在正在做的事情真是愚蠢:

allc = list((agg.loc[:, (agg.dtypes==np.float64)|(agg.dtypes==np.int)]).columns)
for y in allc:
    treat_numeric(agg[y])    

allc = list((agg.loc[:, (agg.dtypes!=np.float64)&(agg.dtypes!=np.int)]).columns)
for y in allc:
    treat_str(agg[y])    

有没有更优雅的方法可以做到这一点?例如

for y in agg.columns:
    if(dtype(agg[y]) == 'string'):
          treat_str(agg[y])
    elif(dtype(agg[y]) != 'string'):
          treat_numeric(agg[y])

I need to use different functions to treat numeric columns and string columns. What I am doing now is really dumb:

allc = list((agg.loc[:, (agg.dtypes==np.float64)|(agg.dtypes==np.int)]).columns)
for y in allc:
    treat_numeric(agg[y])    

allc = list((agg.loc[:, (agg.dtypes!=np.float64)&(agg.dtypes!=np.int)]).columns)
for y in allc:
    treat_str(agg[y])    

Is there a more elegant way to do this? E.g.

for y in agg.columns:
    if(dtype(agg[y]) == 'string'):
          treat_str(agg[y])
    elif(dtype(agg[y]) != 'string'):
          treat_numeric(agg[y])

回答 0

您可以使用以下命令访问列的数据类型dtype

for y in agg.columns:
    if(agg[y].dtype == np.float64 or agg[y].dtype == np.int64):
          treat_numeric(agg[y])
    else:
          treat_str(agg[y])

You can access the data-type of a column with dtype:

for y in agg.columns:
    if(agg[y].dtype == np.float64 or agg[y].dtype == np.int64):
          treat_numeric(agg[y])
    else:
          treat_str(agg[y])

回答 1

pandas 0.20.2你可以这样做:

from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype

is_string_dtype(df['A'])
>>>> True

is_numeric_dtype(df['B'])
>>>> True

因此,您的代码变为:

for y in agg.columns:
    if (is_string_dtype(agg[y])):
        treat_str(agg[y])
    elif (is_numeric_dtype(agg[y])):
        treat_numeric(agg[y])

In pandas 0.20.2 you can do:

from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype

is_string_dtype(df['A'])
>>>> True

is_numeric_dtype(df['B'])
>>>> True

So your code becomes:

for y in agg.columns:
    if (is_string_dtype(agg[y])):
        treat_str(agg[y])
    elif (is_numeric_dtype(agg[y])):
        treat_numeric(agg[y])

回答 2

我知道这有点旧,但是使用熊猫19.02,您可以执行以下操作:

df.select_dtypes(include=['float64']).apply(your_function)
df.select_dtypes(exclude=['string','object']).apply(your_other_function)

http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.DataFrame.select_dtypes.html

I know this is a bit of an old thread but with pandas 19.02, you can do:

df.select_dtypes(include=['float64']).apply(your_function)
df.select_dtypes(exclude=['string','object']).apply(your_other_function)

http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.DataFrame.select_dtypes.html


回答 3

问题标题是一般性的,但问题正文中所述的作者用例是特定的。因此,可以使用任何其他答案。

但是,为了完全回答标题问题,应澄清所有方法似乎在某些情况下可能会失败,并且需要进行一些重新设计。我以降低可靠性的顺序(我认为)对所有这些(以及其他一些)进行了审查:

1.通过==(接受的答案)直接比较类型。

尽管这是公认的答案,并且投票最多,但我认为完全不应使用此方法。因为实际上,这种方法在python中不建议使用,如这里多次提到的。
但是,如果仍然想使用它-应该知道像一些熊猫专用dtypes的pd.CategoricalDTypepd.PeriodDtypepd.IntervalDtypetype( )为了正确识别dtype,这里必须使用extra :

s = pd.Series([pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')])
s
s.dtype == pd.PeriodDtype   # Not working
type(s.dtype) == pd.PeriodDtype # working 

>>> 0    2002-03-01
>>> 1    2012-02-01
>>> dtype: period[D]
>>> False
>>> True

这里的另一个警告是应该精确指出类型:

s = pd.Series([1,2])
s
s.dtype == np.int64 # Working
s.dtype == np.int32 # Not working

>>> 0    1
>>> 1    2
>>> dtype: int64
>>> True
>>> False

2. isinstance()方法。

到目前为止,尚未在答案中提及此方法。

因此,如果直接比较类型不是一个好主意-为此,请尝试使用内置的python函数,即- isinstance()
它会在一开始就失败,因为它假定我们有一些对象,但是pd.Series或者pd.DataFrame可能只用作带有预定义dtype但没有对象的空容器:

s = pd.Series([], dtype=bool)
s

>>> Series([], dtype: bool)

但是,如果有人以某种方式克服了这个问题,并且想要访问每个对象,例如,在第一行中,并像这样检查其dtype:

df = pd.DataFrame({'int': [12, 2], 'dt': [pd.Timestamp('2013-01-02'), pd.Timestamp('2016-10-20')]},
                  index = ['A', 'B'])
for col in df.columns:
    df[col].dtype, 'is_int64 = %s' % isinstance(df.loc['A', col], np.int64)

>>> (dtype('int64'), 'is_int64 = True')
>>> (dtype('<M8[ns]'), 'is_int64 = False')

在单列中混合类型的数据时,这将产生误导:

df2 = pd.DataFrame({'data': [12, pd.Timestamp('2013-01-02')]},
                  index = ['A', 'B'])
for col in df2.columns:
    df2[col].dtype, 'is_int64 = %s' % isinstance(df2.loc['A', col], np.int64)

>>> (dtype('O'), 'is_int64 = False')

最后但并非最不重要的一点-此方法无法直接识别Categorydtype。如文档所述

从分类数据返回单个项目也将返回值,而不是长度为“ 1”的分类。

df['int'] = df['int'].astype('category')
for col in df.columns:
    df[col].dtype, 'is_int64 = %s' % isinstance(df.loc['A', col], np.int64)

>>> (CategoricalDtype(categories=[2, 12], ordered=False), 'is_int64 = True')
>>> (dtype('<M8[ns]'), 'is_int64 = False')

因此,这种方法几乎也不适用。

3. df.dtype.kind方法。

此方法可能与空方法一起使用,pd.Series或者pd.DataFrames还有其他问题。

首先-无法区分某些dtype:

df = pd.DataFrame({'prd'  :[pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')],
                   'str'  :['s1', 's2'],
                   'cat'  :[1, -1]})
df['cat'] = df['cat'].astype('category')
for col in df:
    # kind will define all columns as 'Object'
    print (df[col].dtype, df[col].dtype.kind)

>>> period[D] O
>>> object O
>>> category O

第二,实际上我仍然不清楚,它甚至在某些dtypes返回None

4. df.select_dtypes方法。

这几乎是我们想要的。此方法在pandas内部设计,因此可以处理前面提到的大多数极端情况-空的DataFrame,与numpy或特定于pandas的dtypes完全不同。与dtype这样的单个dtype一起使用时效果很好.select_dtypes('bool')。它甚至可以用于基于dtype选择列组:

test = pd.DataFrame({'bool' :[False, True], 'int64':[-1,2], 'int32':[-1,2],'float': [-2.5, 3.4],
                     'compl':np.array([1-1j, 5]),
                     'dt'   :[pd.Timestamp('2013-01-02'), pd.Timestamp('2016-10-20')],
                     'td'   :[pd.Timestamp('2012-03-02')- pd.Timestamp('2016-10-20'),
                              pd.Timestamp('2010-07-12')- pd.Timestamp('2000-11-10')],
                     'prd'  :[pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')],
                     'intrv':pd.arrays.IntervalArray([pd.Interval(0, 0.1), pd.Interval(1, 5)]),
                     'str'  :['s1', 's2'],
                     'cat'  :[1, -1],
                     'obj'  :[[1,2,3], [5435,35,-52,14]]
                    })
test['int32'] = test['int32'].astype(np.int32)
test['cat'] = test['cat'].astype('category')

就像文档中所述:

test.select_dtypes('number')

>>>     int64   int32   float   compl   td
>>> 0      -1      -1   -2.5    (1-1j)  -1693 days
>>> 1       2       2    3.4    (5+0j)   3531 days

在可能会认为这里我们看到的第一个意外结果(过去对我来说是:问题)- TimeDelta被包含在输出中DataFrame。但是,正如相反的回答,应该是这样,但是必须意识到这一点。请注意,bool跳过了dtype,这对于某些人来说也是不希望的,但这是由于boolnumber位于numpy dtype的不同“ 子树 ”中。如果是布尔型,我们可以test.select_dtypes(['bool'])在这里使用。

此方法的下一个限制是,对于当前版本的Pandas(0.24.2),此代码:test.select_dtypes('period')将引发NotImplementedError

另一件事是它无法将字符串与其他对象区分开:

test.select_dtypes('object')

>>>     str     obj
>>> 0    s1     [1, 2, 3]
>>> 1    s2     [5435, 35, -52, 14]

但这首先是- 在文档中已经提到。其次-不是此方法的问题,而是字符串存储在中的方式DataFrame。但是无论如何,这种情况必须进行一些后期处理。

5. df.api.types.is_XXX_dtype方法。

我猜想这是实现dtype识别(函数所在的模块的路径本身说)的最健壮和本机的方式。它几乎可以完美地工作,但是仍然至少有一个警告,并且仍然必须以某种方式区分字符串列

此外,这可能是主观的,但是与以下方法相比,该方法还具有更多的“人类可理解”的numberdtypes组处理.select_dtypes('number')

for col in test.columns:
    if pd.api.types.is_numeric_dtype(test[col]):
        print (test[col].dtype)

>>> bool
>>> int64
>>> int32
>>> float64
>>> complex128

timedeltabool包括在内。完善。

我的管道此时恰好利用了此功能,以及一些后期处理。

输出。

希望我能够论点的主要观点-所有讨论的方法可以使用,但只能pd.DataFrame.select_dtypes()pd.api.types.is_XXX_dtype必须真正视为适用的。

Asked question title is general, but authors use case stated in the body of the question is specific. So any other answers may be used.

But in order to fully answer the title question it should be clarified that it seems like all of the approaches may fail in some cases and require some rework. I reviewed all of them (and some additional) in decreasing of reliability order (in my opinion):

1. Comparing types directly via == (accepted answer).

Despite the fact that this is accepted answer and has most upvotes count, I think this method should not be used at all. Because in fact this approach is discouraged in python as mentioned several times here.
But if one still want to use it – should be aware of some pandas-specific dtypes like pd.CategoricalDType, pd.PeriodDtype, or pd.IntervalDtype. Here one have to use extra type( ) in order to recognize dtype correctly:

s = pd.Series([pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')])
s
s.dtype == pd.PeriodDtype   # Not working
type(s.dtype) == pd.PeriodDtype # working 

>>> 0    2002-03-01
>>> 1    2012-02-01
>>> dtype: period[D]
>>> False
>>> True

Another caveat here is that type should be pointed out precisely:

s = pd.Series([1,2])
s
s.dtype == np.int64 # Working
s.dtype == np.int32 # Not working

>>> 0    1
>>> 1    2
>>> dtype: int64
>>> True
>>> False

2. isinstance() approach.

This method has not been mentioned in answers so far.

So if direct comparing of types is not a good idea – lets try built-in python function for this purpose, namely – isinstance().
It fails just in the beginning, because assumes that we have some objects, but pd.Series or pd.DataFrame may be used as just empty containers with predefined dtype but no objects in it:

s = pd.Series([], dtype=bool)
s

>>> Series([], dtype: bool)

But if one somehow overcome this issue, and wants to access each object, for example, in the first row and checks its dtype like something like that:

df = pd.DataFrame({'int': [12, 2], 'dt': [pd.Timestamp('2013-01-02'), pd.Timestamp('2016-10-20')]},
                  index = ['A', 'B'])
for col in df.columns:
    df[col].dtype, 'is_int64 = %s' % isinstance(df.loc['A', col], np.int64)

>>> (dtype('int64'), 'is_int64 = True')
>>> (dtype('<M8[ns]'), 'is_int64 = False')

It will be misleading in the case of mixed type of data in single column:

df2 = pd.DataFrame({'data': [12, pd.Timestamp('2013-01-02')]},
                  index = ['A', 'B'])
for col in df2.columns:
    df2[col].dtype, 'is_int64 = %s' % isinstance(df2.loc['A', col], np.int64)

>>> (dtype('O'), 'is_int64 = False')

And last but not least – this method cannot directly recognize Category dtype. As stated in docs:

Returning a single item from categorical data will also return the value, not a categorical of length “1”.

df['int'] = df['int'].astype('category')
for col in df.columns:
    df[col].dtype, 'is_int64 = %s' % isinstance(df.loc['A', col], np.int64)

>>> (CategoricalDtype(categories=[2, 12], ordered=False), 'is_int64 = True')
>>> (dtype('<M8[ns]'), 'is_int64 = False')

So this method is also almost inapplicable.

3. df.dtype.kind approach.

This method yet may work with empty pd.Series or pd.DataFrames but has another problems.

First – it is unable to differ some dtypes:

df = pd.DataFrame({'prd'  :[pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')],
                   'str'  :['s1', 's2'],
                   'cat'  :[1, -1]})
df['cat'] = df['cat'].astype('category')
for col in df:
    # kind will define all columns as 'Object'
    print (df[col].dtype, df[col].dtype.kind)

>>> period[D] O
>>> object O
>>> category O

Second, what is actually still unclear for me, it even returns on some dtypes None.

4. df.select_dtypes approach.

This is almost what we want. This method designed inside pandas so it handles most corner cases mentioned earlier – empty DataFrames, differs numpy or pandas-specific dtypes well. It works well with single dtype like .select_dtypes('bool'). It may be used even for selecting groups of columns based on dtype:

test = pd.DataFrame({'bool' :[False, True], 'int64':[-1,2], 'int32':[-1,2],'float': [-2.5, 3.4],
                     'compl':np.array([1-1j, 5]),
                     'dt'   :[pd.Timestamp('2013-01-02'), pd.Timestamp('2016-10-20')],
                     'td'   :[pd.Timestamp('2012-03-02')- pd.Timestamp('2016-10-20'),
                              pd.Timestamp('2010-07-12')- pd.Timestamp('2000-11-10')],
                     'prd'  :[pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')],
                     'intrv':pd.arrays.IntervalArray([pd.Interval(0, 0.1), pd.Interval(1, 5)]),
                     'str'  :['s1', 's2'],
                     'cat'  :[1, -1],
                     'obj'  :[[1,2,3], [5435,35,-52,14]]
                    })
test['int32'] = test['int32'].astype(np.int32)
test['cat'] = test['cat'].astype('category')

Like so, as stated in the docs:

test.select_dtypes('number')

>>>     int64   int32   float   compl   td
>>> 0      -1      -1   -2.5    (1-1j)  -1693 days
>>> 1       2       2    3.4    (5+0j)   3531 days

On may think that here we see first unexpected (at used to be for me: question) results – TimeDelta is included into output DataFrame. But as answered in contrary it should be so, but one have to be aware of it. Note that bool dtype is skipped, that may be also undesired for someone, but it’s due to bool and number are in different “subtrees” of numpy dtypes. In case with bool, we may use test.select_dtypes(['bool']) here.

Next restriction of this method is that for current version of pandas (0.24.2), this code: test.select_dtypes('period') will raise NotImplementedError.

And another thing is that it’s unable to differ strings from other objects:

test.select_dtypes('object')

>>>     str     obj
>>> 0    s1     [1, 2, 3]
>>> 1    s2     [5435, 35, -52, 14]

But this is, first – already mentioned in the docs. And second – is not the problem of this method, rather the way strings are stored in DataFrame. But anyway this case have to have some post processing.

5. df.api.types.is_XXX_dtype approach.

This one is intended to be most robust and native way to achieve dtype recognition (path of the module where functions resides says by itself) as i suppose. And it works almost perfectly, but still have at least one caveat and still have to somehow distinguish string columns.

Besides, this may be subjective, but this approach also has more ‘human-understandable’ number dtypes group processing comparing with .select_dtypes('number'):

for col in test.columns:
    if pd.api.types.is_numeric_dtype(test[col]):
        print (test[col].dtype)

>>> bool
>>> int64
>>> int32
>>> float64
>>> complex128

No timedelta and bool is included. Perfect.

My pipeline exploits exactly this functionality at this moment of time, plus a bit of post hand processing.

Output.

Hope I was able to argument the main point – that all discussed approaches may be used, but only pd.DataFrame.select_dtypes() and pd.api.types.is_XXX_dtype should be really considered as the applicable ones.


回答 4

如果要将数据框列的类型标记为字符串,则可以执行以下操作:

df['A'].dtype.kind

一个例子:

In [8]: df = pd.DataFrame([[1,'a',1.2],[2,'b',2.3]])
In [9]: df[0].dtype.kind, df[1].dtype.kind, df[2].dtype.kind
Out[9]: ('i', 'O', 'f')

您的代码的答案:

for y in agg.columns:
    if(agg[y].dtype.kind == 'f' or agg[y].dtype.kind == 'i'):
          treat_numeric(agg[y])
    else:
          treat_str(agg[y])

If you want to mark the type of a dataframe column as a string, you can do:

df['A'].dtype.kind

An example:

In [8]: df = pd.DataFrame([[1,'a',1.2],[2,'b',2.3]])
In [9]: df[0].dtype.kind, df[1].dtype.kind, df[2].dtype.kind
Out[9]: ('i', 'O', 'f')

The answer for your code:

for y in agg.columns:
    if(agg[y].dtype.kind == 'f' or agg[y].dtype.kind == 'i'):
          treat_numeric(agg[y])
    else:
          treat_str(agg[y])

Note:


回答 5

漂亮地打印列数据类型

在例如从文件导入后检查数据类型

def printColumnInfo(df):
    template="%-8s %-30s %s"
    print(template % ("Type", "Column Name", "Example Value"))
    print("-"*53)
    for c in df.columns:
        print(template % (df[c].dtype, c, df[c].iloc[1]) )

说明性输出:

Type     Column Name                    Example Value
-----------------------------------------------------
int64    Age                            49
object   Attrition                      No
object   BusinessTravel                 Travel_Frequently
float64  DailyRate                      279.0

To pretty print the column data types

To check the data types after, for example, an import from a file

def printColumnInfo(df):
    template="%-8s %-30s %s"
    print(template % ("Type", "Column Name", "Example Value"))
    print("-"*53)
    for c in df.columns:
        print(template % (df[c].dtype, c, df[c].iloc[1]) )

Illustrative output:

Type     Column Name                    Example Value
-----------------------------------------------------
int64    Age                            49
object   Attrition                      No
object   BusinessTravel                 Travel_Frequently
float64  DailyRate                      279.0

从列中的字符串中删除不需要的部分

问题:从列中的字符串中删除不需要的部分

我正在寻找一种有效的方法来从DataFrame列的字符串中删除不需要的部分。

数据如下:

    time    result
1    09:00   +52A
2    10:00   +62B
3    11:00   +44a
4    12:00   +30b
5    13:00   -110a

我需要将这些数据修剪为:

    time    result
1    09:00   52
2    10:00   62
3    11:00   44
4    12:00   30
5    13:00   110

我试过了.str.lstrip('+-')str.rstrip('aAbBcC'),但出现错误:

TypeError: wrapper() takes exactly 1 argument (2 given)

任何指针将不胜感激!

I am looking for an efficient way to remove unwanted parts from strings in a DataFrame column.

Data looks like:

    time    result
1    09:00   +52A
2    10:00   +62B
3    11:00   +44a
4    12:00   +30b
5    13:00   -110a

I need to trim these data to:

    time    result
1    09:00   52
2    10:00   62
3    11:00   44
4    12:00   30
5    13:00   110

I tried .str.lstrip('+-') and .str.rstrip('aAbBcC'), but got an error:

TypeError: wrapper() takes exactly 1 argument (2 given)

Any pointers would be greatly appreciated!


回答 0

data['result'] = data['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC'))
data['result'] = data['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC'))

回答 1

如何从列的字符串中删除不需要的部分?

在最初提出问题的6年后,pandas现在具有大量的“向量化”字符串函数,可以简洁地执行这些字符串操作操作。

该答案将探索其中的一些字符串函数,提出更快的替代方法,最后进行时序比较。


.str.replace

指定要匹配的子字符串/样式,以及要替换为的子字符串。

pd.__version__
# '0.24.1'

df    
    time result
1  09:00   +52A
2  10:00   +62B
3  11:00   +44a
4  12:00   +30b
5  13:00  -110a

df['result'] = df['result'].str.replace(r'\D', '')
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

如果您需要将结果转换为整数,则可以使用Series.astype

df['result'] = df['result'].str.replace(r'\D', '').astype(int)

df.dtypes
time      object
result     int64
dtype: object

如果您不想df就地修改,请使用DataFrame.assign

df2 = df.assign(result=df['result'].str.replace(r'\D', ''))
df
# Unchanged

.str.extract

对于提取要保留的子字符串很有用。

df['result'] = df['result'].str.extract(r'(\d+)', expand=False)
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

使用extract,必须指定至少一个捕获组。expand=False将返回带有第一个捕获组中捕获项目的系列。


.str.split.str.get

假设您所有的字符串都遵循这种一致的结构,则拆分工作有效。

# df['result'] = df['result'].str.split(r'\D').str[1]
df['result'] = df['result'].str.split(r'\D').str.get(1)
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

如果您正在寻找一般的解决方案,则不建议这样做。


如果您对str 上述基于简洁和可读的访问器的解决方案感到满意,则可以在此处停止。但是,如果您对更快,性能更高的替代产品感兴趣,请继续阅读。


优化:列表理解

在某些情况下,列表理解应优于熊猫字符串函数。原因是因为字符串函数本来就很难向量化(从字面意义上来说),所以大多数字符串和正则表达式函数只是循环包装,开销更大。

我写的文章,熊猫中的for循环真的不好吗?我什么时候应该在意?,详细介绍。

str.replace选项可以使用重写re.sub

import re

# Pre-compile your regex pattern for more performance.
p = re.compile(r'\D')
df['result'] = [p.sub('', x) for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

str.extract示例可以使用列表理解用来重写re.search

p = re.compile(r'\d+')
df['result'] = [p.search(x)[0] for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

如果可能出现NaN或不匹配的情况,则您需要重新编写上面的内容以包含一些错误检查。我使用一个函数来做到这一点。

def try_extract(pattern, string):
    try:
        m = pattern.search(string)
        return m.group(0)
    except (TypeError, ValueError, AttributeError):
        return np.nan

p = re.compile(r'\d+')
df['result'] = [try_extract(p, x) for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

我们还可以使用列表推导来重写@eumiro和@MonkeyButter的答案:

df['result'] = [x.lstrip('+-').rstrip('aAbBcC') for x in df['result']]

和,

df['result'] = [x[1:-1] for x in df['result']]

适用于处理NaN等的相同规则。


性能比较

使用perfplot生成的图。完整的代码清单,供您参考。相关功能在下面列出。

这些比较中的一些比较不公平,因为它们利用了OP数据的结构,但从中得到了好处。需要注意的一件事是,每个列表理解功能都比其等效的pandas变体更快或更可比。

功能

def eumiro(df):
    return df.assign(
        result=df['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC')))

def coder375(df):
    return df.assign(
        result=df['result'].replace(r'\D', r'', regex=True))

def monkeybutter(df):
    return df.assign(result=df['result'].map(lambda x: x[1:-1]))

def wes(df):
    return df.assign(result=df['result'].str.lstrip('+-').str.rstrip('aAbBcC'))

def cs1(df):
    return df.assign(result=df['result'].str.replace(r'\D', ''))

def cs2_ted(df):
    # `str.extract` based solution, similar to @Ted Petrou's. so timing together.
    return df.assign(result=df['result'].str.extract(r'(\d+)', expand=False))

def cs1_listcomp(df):
    return df.assign(result=[p1.sub('', x) for x in df['result']])

def cs2_listcomp(df):
    return df.assign(result=[p2.search(x)[0] for x in df['result']])

def cs_eumiro_listcomp(df):
    return df.assign(
        result=[x.lstrip('+-').rstrip('aAbBcC') for x in df['result']])

def cs_mb_listcomp(df):
    return df.assign(result=[x[1:-1] for x in df['result']])

How do I remove unwanted parts from strings in a column?

6 years after the original question was posted, pandas now has a good number of “vectorised” string functions that can succinctly perform these string manipulation operations.

This answer will explore some of these string functions, suggest faster alternatives, and go into a timings comparison at the end.


.str.replace

Specify the substring/pattern to match, and the substring to replace it with.

pd.__version__
# '0.24.1'

df    
    time result
1  09:00   +52A
2  10:00   +62B
3  11:00   +44a
4  12:00   +30b
5  13:00  -110a

df['result'] = df['result'].str.replace(r'\D', '')
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

If you need the result converted to an integer, you can use Series.astype,

df['result'] = df['result'].str.replace(r'\D', '').astype(int)

df.dtypes
time      object
result     int64
dtype: object

If you don’t want to modify df in-place, use DataFrame.assign:

df2 = df.assign(result=df['result'].str.replace(r'\D', ''))
df
# Unchanged

.str.extract

Useful for extracting the substring(s) you want to keep.

df['result'] = df['result'].str.extract(r'(\d+)', expand=False)
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

With extract, it is necessary to specify at least one capture group. expand=False will return a Series with the captured items from the first capture group.


.str.split and .str.get

Splitting works assuming all your strings follow this consistent structure.

# df['result'] = df['result'].str.split(r'\D').str[1]
df['result'] = df['result'].str.split(r'\D').str.get(1)
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

Do not recommend if you are looking for a general solution.


If you are satisfied with the succinct and readable str accessor-based solutions above, you can stop here. However, if you are interested in faster, more performant alternatives, keep reading.


Optimizing: List Comprehensions

In some circumstances, list comprehensions should be favoured over pandas string functions. The reason is because string functions are inherently hard to vectorize (in the true sense of the word), so most string and regex functions are only wrappers around loops with more overhead.

My write-up, Are for-loops in pandas really bad? When should I care?, goes into greater detail.

The str.replace option can be re-written using re.sub

import re

# Pre-compile your regex pattern for more performance.
p = re.compile(r'\D')
df['result'] = [p.sub('', x) for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

The str.extract example can be re-written using a list comprehension with re.search,

p = re.compile(r'\d+')
df['result'] = [p.search(x)[0] for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

If NaNs or no-matches are a possibility, you will need to re-write the above to include some error checking. I do this using a function.

def try_extract(pattern, string):
    try:
        m = pattern.search(string)
        return m.group(0)
    except (TypeError, ValueError, AttributeError):
        return np.nan

p = re.compile(r'\d+')
df['result'] = [try_extract(p, x) for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

We can also re-write @eumiro’s and @MonkeyButter’s answers using list comprehensions:

df['result'] = [x.lstrip('+-').rstrip('aAbBcC') for x in df['result']]

And,

df['result'] = [x[1:-1] for x in df['result']]

Same rules for handling NaNs, etc, apply.


Performance Comparison

Graphs generated using perfplot. Full code listing, for your reference. The relevant functions are listed below.

Some of these comparisons are unfair because they take advantage of the structure of OP’s data, but take from it what you will. One thing to note is that every list comprehension function is either faster or comparable than its equivalent pandas variant.

Functions

def eumiro(df):
    return df.assign(
        result=df['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC')))

def coder375(df):
    return df.assign(
        result=df['result'].replace(r'\D', r'', regex=True))

def monkeybutter(df):
    return df.assign(result=df['result'].map(lambda x: x[1:-1]))

def wes(df):
    return df.assign(result=df['result'].str.lstrip('+-').str.rstrip('aAbBcC'))

def cs1(df):
    return df.assign(result=df['result'].str.replace(r'\D', ''))

def cs2_ted(df):
    # `str.extract` based solution, similar to @Ted Petrou's. so timing together.
    return df.assign(result=df['result'].str.extract(r'(\d+)', expand=False))

def cs1_listcomp(df):
    return df.assign(result=[p1.sub('', x) for x in df['result']])

def cs2_listcomp(df):
    return df.assign(result=[p2.search(x)[0] for x in df['result']])

def cs_eumiro_listcomp(df):
    return df.assign(
        result=[x.lstrip('+-').rstrip('aAbBcC') for x in df['result']])

def cs_mb_listcomp(df):
    return df.assign(result=[x[1:-1] for x in df['result']])

回答 2

我会使用熊猫替换功能,因为您可以使用正则表达式,所以它非常简单而强大。在下面,我使用正则表达式\ D删除所有非数字字符,但显然,使用正则表达式可以变得很有创意。

data['result'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')

i’d use the pandas replace function, very simple and powerful as you can use regex. Below i’m using the regex \D to remove any non-digit characters but obviously you could get quite creative with regex.

data['result'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')

回答 3

在特定情况下,如果您知道要从数据框列中删除的位置数,则可以在lambda函数内使用字符串索引来摆脱这些部分:

最后符:

data['result'] = data['result'].map(lambda x: str(x)[:-1])

前两个字符:

data['result'] = data['result'].map(lambda x: str(x)[2:])

In the particular case where you know the number of positions that you want to remove from the dataframe column, you can use string indexing inside a lambda function to get rid of that parts:

Last character:

data['result'] = data['result'].map(lambda x: str(x)[:-1])

First two characters:

data['result'] = data['result'].map(lambda x: str(x)[2:])

回答 4

这里有一个错误:目前无法将参数传递给str.lstripstr.rstrip

http://github.com/pydata/pandas/issues/2411

编辑:2012-12-07这现在可以在dev分支上工作:

In [8]: df['result'].str.lstrip('+-').str.rstrip('aAbBcC')
Out[8]: 
1     52
2     62
3     44
4     30
5    110
Name: result

There’s a bug here: currently cannot pass arguments to str.lstrip and str.rstrip:

http://github.com/pydata/pandas/issues/2411

EDIT: 2012-12-07 this works now on the dev branch:

In [8]: df['result'].str.lstrip('+-').str.rstrip('aAbBcC')
Out[8]: 
1     52
2     62
3     44
4     30
5    110
Name: result

回答 5

一种非常简单的方法是使用该extract方法选择所有数字。只需为其提供'\d+'可提取任意数字的正则表达式即可。

df['result'] = df.result.str.extract(r'(\d+)', expand=True).astype(int)
df

    time  result
1  09:00      52
2  10:00      62
3  11:00      44
4  12:00      30
5  13:00     110

A very simple method would be to use the extract method to select all the digits. Simply supply it the regular expression '\d+' which extracts any number of digits.

df['result'] = df.result.str.extract(r'(\d+)', expand=True).astype(int)
df

    time  result
1  09:00      52
2  10:00      62
3  11:00      44
4  12:00      30
5  13:00     110

回答 6

对于这些类型的任务,我经常使用列表推导,因为它们通常更快。

进行这种操作的各种方法(例如,修改DataFrame中序列的每个元素)的性能可能存在很大差异。通常,列表理解可能是最快的-有关此任务,请参见下面的代码竞赛:

import pandas as pd
#Map
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = data['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC'))
10000 loops, best of 3: 187 µs per loop
#List comprehension
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = [x.lstrip('+-').rstrip('aAbBcC') for x in data['result']]
10000 loops, best of 3: 117 µs per loop
#.str
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = data['result'].str.lstrip('+-').str.rstrip('aAbBcC')
1000 loops, best of 3: 336 µs per loop

I often use list comprehensions for these types of tasks because they’re often faster.

There can be big differences in performance between the various methods for doing things like this (i.e. modifying every element of a series within a DataFrame). Often a list comprehension can be fastest – see code race below for this task:

import pandas as pd
#Map
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = data['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC'))
10000 loops, best of 3: 187 µs per loop
#List comprehension
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = [x.lstrip('+-').rstrip('aAbBcC') for x in data['result']]
10000 loops, best of 3: 117 µs per loop
#.str
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = data['result'].str.lstrip('+-').str.rstrip('aAbBcC')
1000 loops, best of 3: 336 µs per loop

回答 7

假设您的DF在数字之间也有那些多余的字符。

  result   time
0   +52A  09:00
1   +62B  10:00
2   +44a  11:00
3   +30b  12:00
4  -110a  13:00
5   3+b0  14:00

您可以尝试str.replace删除字符,不仅从开头和结尾,而且从中间删除。

DF['result'] = DF['result'].str.replace('\+|a|b|\-|A|B', '')

输出:

  result   time
0     52  09:00
1     62  10:00
2     44  11:00
3     30  12:00
4    110  13:00
5     30  14:00

Suppose your DF is having those extra character in between numbers as well.The last entry.

  result   time
0   +52A  09:00
1   +62B  10:00
2   +44a  11:00
3   +30b  12:00
4  -110a  13:00
5   3+b0  14:00

You can try str.replace to remove characters not only from start and end but also from in between.

DF['result'] = DF['result'].str.replace('\+|a|b|\-|A|B', '')

Output:

  result   time
0     52  09:00
1     62  10:00
2     44  11:00
3     30  12:00
4    110  13:00
5     30  14:00

回答 8

使用正则表达式尝试:

import re
data['result'] = data['result'].map(lambda x: re.sub('[-+A-Za-z]',x)

Try this using regular expression:

import re
data['result'] = data['result'].map(lambda x: re.sub('[-+A-Za-z]',x)

检查变量是否为数据框

问题:检查变量是否为数据框

当我的函数f用一个变量调用时,我想检查var是否是一个熊猫数据框:

def f(var):
    if var == pd.DataFrame():
        print "do stuff"

我想解决方案可能很简单,但即使

def f(var):
    if var.values != None:
        print "do stuff"

我无法使其按预期方式工作。

when my function f is called with a variable I want to check if var is a pandas dataframe:

def f(var):
    if var == pd.DataFrame():
        print "do stuff"

I guess the solution might be quite simple but even with

def f(var):
    if var.values != None:
        print "do stuff"

I can’t get it to work like expected.


回答 0

使用isinstance,没有别的:

if isinstance(x, pd.DataFrame):
    ... # do something

PEP8明确表示这isinstance是检查类型的首选方法

No:  type(x) is pd.DataFrame
No:  type(x) == pd.DataFrame
Yes: isinstance(x, pd.DataFrame)

而且甚至不用考虑

if obj.__class__.__name__ = 'DataFrame':
    expect_problems_some_day()

isinstance处理继承(请参见type()和isinstance()之间的区别?)。例如,它会告诉你,如果一个变量是一个字符串(strunicode),因为他们从派生basestring

if isinstance(obj, basestring):
    i_am_string(obj)

专门针对pandas DataFrame对象:

import pandas as pd
isinstance(var, pd.DataFrame)

Use isinstance, nothing else:

if isinstance(x, pd.DataFrame):
    ... # do something

PEP8 says explicitly that isinstance is the preferred way to check types

No:  type(x) is pd.DataFrame
No:  type(x) == pd.DataFrame
Yes: isinstance(x, pd.DataFrame)

And don’t even think about

if obj.__class__.__name__ = 'DataFrame':
    expect_problems_some_day()

isinstance handles inheritance (see What are the differences between type() and isinstance()?). For example, it will tell you if a variable is a string (either str or unicode), because they derive from basestring)

if isinstance(obj, basestring):
    i_am_string(obj)

Specifically for pandas DataFrame objects:

import pandas as pd
isinstance(var, pd.DataFrame)

回答 1

使用内置isinstance()功能。

import pandas as pd

def f(var):
    if isinstance(var, pd.DataFrame):
        print("do stuff")

Use the built-in isinstance() function.

import pandas as pd

def f(var):
    if isinstance(var, pd.DataFrame):
        print("do stuff")

将熊猫数据框列表连接在一起

问题:将熊猫数据框列表连接在一起

我有一个熊猫数据框列表,我想将其合并为一个熊猫数据框。我正在使用Python 2.7.10和Pandas 0.16.2

我从以下位置创建了数据框列表:

import pandas as pd
dfs = []
sqlall = "select * from mytable"

for chunk in pd.read_sql_query(sqlall , cnxn, chunksize=10000):
    dfs.append(chunk)

这将返回数据帧列表

type(dfs[0])
Out[6]: pandas.core.frame.DataFrame

type(dfs)
Out[7]: list

len(dfs)
Out[8]: 408

这是一些示例数据

# sample dataframes
d1 = pd.DataFrame({'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]})
d2 = pd.DataFrame({'one' : [5., 6., 7., 8.], 'two' : [9., 10., 11., 12.]})
d3 = pd.DataFrame({'one' : [15., 16., 17., 18.], 'two' : [19., 10., 11., 12.]})

# list of dataframes
mydfs = [d1, d2, d3]

我想将d1d2和组合d3成一个熊猫数据框。另外,使用该chunksize选项时将大表直接读入数据框的方法将非常有帮助。

I have a list of Pandas dataframes that I would like to combine into one Pandas dataframe. I am using Python 2.7.10 and Pandas 0.16.2

I created the list of dataframes from:

import pandas as pd
dfs = []
sqlall = "select * from mytable"

for chunk in pd.read_sql_query(sqlall , cnxn, chunksize=10000):
    dfs.append(chunk)

This returns a list of dataframes

type(dfs[0])
Out[6]: pandas.core.frame.DataFrame

type(dfs)
Out[7]: list

len(dfs)
Out[8]: 408

Here is some sample data

# sample dataframes
d1 = pd.DataFrame({'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]})
d2 = pd.DataFrame({'one' : [5., 6., 7., 8.], 'two' : [9., 10., 11., 12.]})
d3 = pd.DataFrame({'one' : [15., 16., 17., 18.], 'two' : [19., 10., 11., 12.]})

# list of dataframes
mydfs = [d1, d2, d3]

I would like to combine d1, d2, and d3 into one pandas dataframe. Alternatively, a method of reading a large-ish table directly into a dataframe when using the chunksize option would be very helpful.


回答 0

鉴于所有数据框都具有相同的列,您可以简单地将concat它们:

import pandas as pd
df = pd.concat(list_of_dataframes)

Given that all the dataframes have the same columns, you can simply concat them:

import pandas as pd
df = pd.concat(list_of_dataframes)

回答 1

如果数据帧的所有列都不相同,请尝试以下操作:

df = pd.DataFrame.from_dict(map(dict,df_list))

If the dataframes DO NOT all have the same columns try the following:

df = pd.DataFrame.from_dict(map(dict,df_list))

回答 2

您也可以使用函数式编程来做到这一点:

from functools import reduce
reduce(lambda df1, df2: df1.merge(df2, "outer"), mydfs)

You also can do it with functional programming:

from functools import reduce
reduce(lambda df1, df2: df1.merge(df2, "outer"), mydfs)

回答 3

concat 对于使用“ loc”命令针对现有数据框提取的列表理解也可以很好地工作

df = pd.read_csv('./data.csv') # ie; Dataframe pulled from csv file with a "userID" column

review_ids = ['1','2','3'] # ie; ID values to grab from DataFrame

# Gets rows in df where IDs match in the userID column and combines them 

dfa = pd.concat([df.loc[df['userID'] == x] for x in review_ids])

concat also works nicely with a list comprehension pulled using the “loc” command against an existing dataframe

df = pd.read_csv('./data.csv') # ie; Dataframe pulled from csv file with a "userID" column

review_ids = ['1','2','3'] # ie; ID values to grab from DataFrame

# Gets rows in df where IDs match in the userID column and combines them 

dfa = pd.concat([df.loc[df['userID'] == x] for x in review_ids])

读取csv时删除熊猫中的索引列

问题:读取csv时删除熊猫中的索引列

我有以下代码导入CSV文件。有3列,我想将其中的前两个设置为变量。当我将第二列设置为变量“效率”时,索引列也会被添加。如何摆脱索引列?

df = pd.DataFrame.from_csv('Efficiency_Data.csv', header=0, parse_dates=False)
energy = df.index
efficiency = df.Efficiency
print efficiency

我尝试使用

del df['index']

我设置好之后

energy = df.index

我在另一篇文章中找到的,但结果为“ KeyError:’index’”

I have the following code which imports a CSV file. There are 3 columns and I want to set the first two of them to variables. When I set the second column to the variable “efficiency” the index column is also tacked on. How can I get rid of the index column?

df = pd.DataFrame.from_csv('Efficiency_Data.csv', header=0, parse_dates=False)
energy = df.index
efficiency = df.Efficiency
print efficiency

I tried using

del df['index']

after I set

energy = df.index

which I found in another post but that results in “KeyError: ‘index’ “


回答 0

DataFrameSeries始终具有索引。尽管它显示在列旁边,但它不是列,这就是为什么它del df['index']不起作用的原因。

如果要用简单的序号替换索引,请使用df.reset_index()

要了解为什么存在索引以及如何使用该索引,请参阅距熊猫10分钟的信息

DataFrames and Series always have an index. Although it displays alongside the column(s), it is not a column, which is why del df['index'] did not work.

If you want to replace the index with simple sequential numbers, use df.reset_index().

To get a sense for why the index is there and how it is used, see e.g. 10 minutes to Pandas.


回答 1

在读取和读取CSV文件时,请包含参数index=False,例如:

 df.to_csv(filename, index=False)

并从CSV读取

df.read_csv(filename, index=False)  

这样可以防止出现此问题,因此您以后无需修复它。

When reading to and from your CSV file include the argument index=False so for example:

 df.to_csv(filename, index=False)

and to read from the csv

df.read_csv(filename, index=False)  

This should prevent the issue so you don’t need to fix it later.


回答 2

df.reset_index(drop=True, inplace=True)

df.reset_index(drop=True, inplace=True)


回答 3

您可以将其中一列设置为索引,以防万一它是“ id”。在这种情况下,索引列将替换为您选择的列之一。

df.set_index('id', inplace=True)

You can set one of the columns as an index in case it is an “id” for example. In this case the index column will be replaced by one of the columns you have chosen.

df.set_index('id', inplace=True)

回答 4

如果您的问题与我的问题相同,则只想将列标题从0重置为列大小。做

df = pd.DataFrame(df.values);

编辑:

如果您具有异构数据类型,则不是一个好主意。更好地使用

df.columns = range(len(df.columns))

If your problem is same as mine where you just want to reset the column headers from 0 to column size. Do

df = pd.DataFrame(df.values);

EDIT:

Not a good idea if you have heterogenous data types. Better just use

df.columns = range(len(df.columns))

回答 5

您可以使用from_csv函数的index_col参数在csv文件中指定哪一列是索引,如果这样做不能解决问题,请提供数据示例

you can specify which column is an index in your csv file by using index_col parameter of from_csv function if this doesn’t solve you problem please provide example of your data


回答 6

一两件事,我做的是df=df.reset_index() 那么df=df.drop(['index'],axis=1)

One thing that i do is df=df.reset_index() then df=df.drop(['index'],axis=1)


如何将单独的Pan​​das DataFrame绘制为子图?

问题:如何将单独的Pan​​das DataFrame绘制为子图?

我有一些Pandas DataFrame共享相同的价值规模,但是具有不同的列和索引。调用时df.plot(),会得到单独的绘图图像。我真正想要的是将它们与子图放置在同一块图上,但是不幸的是,我未能提出解决方案,并且希望获得一些帮助。

I have a few Pandas DataFrames sharing the same value scale, but having different columns and indices. When invoking df.plot(), I get separate plot images. what I really want is to have them all in the same plot as subplots, but I’m unfortunately failing to come up with a solution to how and would highly appreciate some help.


回答 0

您可以使用matplotlib手动创建子图,然后使用ax关键字在特定的子图上绘制数据框。例如,对于4个子图(2×2):

import matplotlib.pyplot as plt

fig, axes = plt.subplots(nrows=2, ncols=2)

df1.plot(ax=axes[0,0])
df2.plot(ax=axes[0,1])
...

axes是一个包含不同子图轴的数组,您只需通过index即可访问一个axes
如果要共享x轴,则可以提供sharex=Trueplt.subplots

You can manually create the subplots with matplotlib, and then plot the dataframes on a specific subplot using the ax keyword. For example for 4 subplots (2×2):

import matplotlib.pyplot as plt

fig, axes = plt.subplots(nrows=2, ncols=2)

df1.plot(ax=axes[0,0])
df2.plot(ax=axes[0,1])
...

Here axes is an array which holds the different subplot axes, and you can access one just by indexing axes.
If you want a shared x-axis, then you can provide sharex=True to plt.subplots.


回答 1

您可以看到例如 在演示joris答案的文档中。另外,从文档,您也可以设置subplots=Truelayout=(,)大熊猫内plot功能:

df.plot(subplots=True, layout=(1,2))

你也可以使用fig.add_subplot()这需要插曲电网参数,如221,222,223,224,等,在后描述这里。可以在此ipython笔记本中看到有关熊猫数据框(包括子图)的漂亮绘图示例。

You can see e.gs. in the documentation demonstrating joris answer. Also from the documentation, you could also set subplots=True and layout=(,) within the pandas plot function:

df.plot(subplots=True, layout=(1,2))

You could also use fig.add_subplot() which takes subplot grid parameters such as 221, 222, 223, 224, etc. as described in the post here. Nice examples of plot on pandas data frame, including subplots, can be seen in this ipython notebook.


回答 2

您可以使用熟悉的Matplotlib样式调用a figuresubplot,但是只需使用即可指定当前轴plt.gca()。一个例子:

plt.figure(1)
plt.subplot(2,2,1)
df.A.plot() #no need to specify for first axis
plt.subplot(2,2,2)
df.B.plot(ax=plt.gca())
plt.subplot(2,2,3)
df.C.plot(ax=plt.gca())

等等…

You can use the familiar Matplotlib style calling a figure and subplot, but you simply need to specify the current axis using plt.gca(). An example:

plt.figure(1)
plt.subplot(2,2,1)
df.A.plot() #no need to specify for first axis
plt.subplot(2,2,2)
df.B.plot(ax=plt.gca())
plt.subplot(2,2,3)
df.C.plot(ax=plt.gca())

etc…


回答 3

您可以使用matplotlib通过绘制所有数据框列表的简单技巧来绘制多个熊猫数据框的多个子图。然后使用for循环绘制子图。

工作代码:

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# dataframe sample data
df1 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df2 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df3 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df4 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df5 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df6 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
#define number of rows and columns for subplots
nrow=3
ncol=2
# make a list of all dataframes 
df_list = [df1 ,df2, df3, df4, df5, df6]
fig, axes = plt.subplots(nrow, ncol)
# plot counter
count=0
for r in range(nrow):
    for c in range(ncol):
        df_list[count].plot(ax=axes[r,c])
        count=+1

使用此代码,您可以在任何配置中绘制子图。您只需要定义行nrow数和列数即可ncol。另外,您需要列出df_list要绘制的数据框。

You can plot multiple subplots of multiple pandas data frames using matplotlib with a simple trick of making a list of all data frame. Then using the for loop for plotting subplots.

Working code:

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# dataframe sample data
df1 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df2 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df3 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df4 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df5 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df6 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
#define number of rows and columns for subplots
nrow=3
ncol=2
# make a list of all dataframes 
df_list = [df1 ,df2, df3, df4, df5, df6]
fig, axes = plt.subplots(nrow, ncol)
# plot counter
count=0
for r in range(nrow):
    for c in range(ncol):
        df_list[count].plot(ax=axes[r,c])
        count=+1

Using this code you can plot subplots in any configuration. You need to just define number of rows nrow and number of columns ncol. Also, you need to make list of data frames df_list which you wanted to plot.


回答 4

您可以使用此:

fig = plt.figure()
ax = fig.add_subplot(221)
plt.plot(x,y)

ax = fig.add_subplot(222)
plt.plot(x,z)
...

plt.show()

You can use this:

fig = plt.figure()
ax = fig.add_subplot(221)
plt.plot(x,y)

ax = fig.add_subplot(222)
plt.plot(x,z)
...

plt.show()

回答 5

您可能根本不需要使用熊猫。这是猫的频率的matplotlib图:

x = np.linspace(0, 2*np.pi, 400)
y = np.sin(x**2)

f, axes = plt.subplots(2, 1)
for c, i in enumerate(axes):
  axes[c].plot(x, y)
  axes[c].set_title('cats')
plt.tight_layout()

You may not need to use Pandas at all. Here’s a matplotlib plot of cat frequencies:

x = np.linspace(0, 2*np.pi, 400)
y = np.sin(x**2)

f, axes = plt.subplots(2, 1)
for c, i in enumerate(axes):
  axes[c].plot(x, y)
  axes[c].set_title('cats')
plt.tight_layout()

回答 6

以上面的@joris响应为基础,如果您已经建立了对该子图的引用,则也可以使用该引用。例如,

ax1 = plt.subplot2grid((50,100), (0, 0), colspan=20, rowspan=10)
...

df.plot.barh(ax=ax1, stacked=True)

Building on @joris response above, if you have already established a reference to the subplot, you can use the reference as well. For example,

ax1 = plt.subplot2grid((50,100), (0, 0), colspan=20, rowspan=10)
...

df.plot.barh(ax=ax1, stacked=True)

回答 7

如何从具有长(整洁)数据的数据帧字典中创建多个图

  • 假设条件

    • 有一个包含多个整齐数据框架的字典
      • 通过读取文件创建
      • 通过将单个数据帧分为多个数据帧来创建
    • 类别cat可能重叠,但是所有数据框可能不包含的所有值cat
    • hue='cat'
  • 因为要遍历数据帧,所以不能保证每个图的颜色都相同

    • 需要根据'cat'所有数据框的唯一值创建自定义颜色图
    • 由于颜色相同,因此在图的侧面放置一个图例,而不要在每个图上都放置图例

导入和综合数据

import pandas as pd
import numpy as np  # used for random data
import random  # used for random data
import matplotlib.pyplot as plt
from matplotlib.patches import Patch  # for custom legend
import seaborn as sns
import math import ceil  # determine correct number of subplot


# synthetic data
df_dict = dict()
for i in range(1, 7):
    np.random.seed(i)
    random.seed(i)
    data_length = 100
    data = {'cat': [random.choice(['A', 'B', 'C']) for _ in range(data_length)],
            'x': np.random.rand(data_length),
            'y': np.random.rand(data_length)}
    df_dict[i] = pd.DataFrame(data)


# display(df_dict[1].head())

  cat         x         y
0   A  0.417022  0.326645
1   C  0.720324  0.527058
2   A  0.000114  0.885942
3   B  0.302333  0.357270
4   A  0.146756  0.908535

创建颜色映射并绘制

# create color mapping based on all unique values of cat
unique_cat = {cat for v in df_dict.values() for cat in v.cat.unique()}  # get unique cats
colors = sns.color_palette('husl', n_colors=len(unique_cat))  # get a number of colors
cmap = dict(zip(unique_cat, colors))  # zip values to colors

# iterate through dictionary and plot
col_nums = 3  # how many plots per row
row_nums = math.ceil(len(df_dict) / col_nums)  # how many rows of plots
plt.figure(figsize=(10, 5))  # change the figure size as needed
for i, (k, v) in enumerate(df_dict.items(), 1):
    plt.subplot(row_nums, col_nums, i)  # create subplots
    p = sns.scatterplot(data=v, x='x', y='y', hue='cat', palette=cmap)
    p.legend_.remove()  # remove the individual plot legends
    plt.title(f'DataFrame: {k}')

plt.tight_layout()
# create legend from cmap
patches = [Patch(color=v, label=k) for k, v in cmap.items()]
# place legend outside of plot; change the right bbox value to move the legend up or down
plt.legend(handles=patches, bbox_to_anchor=(1.06, 1.2), loc='center left', borderaxespad=0)
plt.show()

How to create multiple plots from a dictionary of dataframes with long (tidy) data

  • Assumptions

    • There is a dictionary of multiple dataframes of tidy data
      • Created by reading in from files
      • Created by separating a single dataframe into multiple dataframes
    • The categories, cat, may be overlapping, but all dataframes may not contain all values of cat
    • hue='cat'
  • Because dataframes are being iterated through, there’s not guarantee that colors will be mapped the same for each plot

    • A custom color map needs to be created from the unique 'cat' values for all the dataframes
    • Since the colors will be the same, place one legend to the side of the plots, instead of a legend in every plot

Imports and synthetic data

import pandas as pd
import numpy as np  # used for random data
import random  # used for random data
import matplotlib.pyplot as plt
from matplotlib.patches import Patch  # for custom legend
import seaborn as sns
import math import ceil  # determine correct number of subplot


# synthetic data
df_dict = dict()
for i in range(1, 7):
    np.random.seed(i)
    random.seed(i)
    data_length = 100
    data = {'cat': [random.choice(['A', 'B', 'C']) for _ in range(data_length)],
            'x': np.random.rand(data_length),
            'y': np.random.rand(data_length)}
    df_dict[i] = pd.DataFrame(data)


# display(df_dict[1].head())

  cat         x         y
0   A  0.417022  0.326645
1   C  0.720324  0.527058
2   A  0.000114  0.885942
3   B  0.302333  0.357270
4   A  0.146756  0.908535

Create color mappings and plot

# create color mapping based on all unique values of cat
unique_cat = {cat for v in df_dict.values() for cat in v.cat.unique()}  # get unique cats
colors = sns.color_palette('husl', n_colors=len(unique_cat))  # get a number of colors
cmap = dict(zip(unique_cat, colors))  # zip values to colors

# iterate through dictionary and plot
col_nums = 3  # how many plots per row
row_nums = math.ceil(len(df_dict) / col_nums)  # how many rows of plots
plt.figure(figsize=(10, 5))  # change the figure size as needed
for i, (k, v) in enumerate(df_dict.items(), 1):
    plt.subplot(row_nums, col_nums, i)  # create subplots
    p = sns.scatterplot(data=v, x='x', y='y', hue='cat', palette=cmap)
    p.legend_.remove()  # remove the individual plot legends
    plt.title(f'DataFrame: {k}')

plt.tight_layout()
# create legend from cmap
patches = [Patch(color=v, label=k) for k, v in cmap.items()]
# place legend outside of plot; change the right bbox value to move the legend up or down
plt.legend(handles=patches, bbox_to_anchor=(1.06, 1.2), loc='center left', borderaxespad=0)
plt.show()


用None替换Pandas或Numpy Nan以与MysqlDB一起使用

问题:用None替换Pandas或Numpy Nan以与MysqlDB一起使用

我正在尝试使用MysqlDB将Pandas数据帧(或可以使用numpy数组)写入mysql数据库。MysqlDB似乎不理解’nan’,我的数据库抛出一个错误,说nan不在字段列表中。我需要找到一种将’nan’转换为NoneType的方法。

有任何想法吗?

I am trying to write a Pandas dataframe (or can use a numpy array) to a mysql database using MysqlDB . MysqlDB doesn’t seem understand ‘nan’ and my database throws out an error saying nan is not in the field list. I need to find a way to convert the ‘nan’ into a NoneType.

Any ideas?


回答 0

@bogatron正确,您可以使用where,值得注意的是您可以在熊猫本机执行此操作:

df1 = df.where(pd.notnull(df), None)

注意:这会将所有列的dtype更改为object

例:

In [1]: df = pd.DataFrame([1, np.nan])

In [2]: df
Out[2]: 
    0
0   1
1 NaN

In [3]: df1 = df.where(pd.notnull(df), None)

In [4]: df1
Out[4]: 
      0
0     1
1  None

注意:您不能执行的操作dtype是使用astype,然后使用DataFrame fillna方法来重铸DataFrame 以允许所有数据类型,请执行以下操作:

df1 = df.astype(object).replace(np.nan, 'None')

遗憾的是这个没有,也没有使用replace,用作品None这个(关闭)的问题


顺便说一句,值得注意的是,对于大多数用例,您不需要将NaN替换为None,请参阅有关熊猫中NaN和None之间的区别的问题。

但是,在这种特定情况下,您似乎可以这样做(至少在回答此问题时)。

@bogatron has it right, you can use where, it’s worth noting that you can do this natively in pandas:

df1 = df.where(pd.notnull(df), None)

Note: this changes the dtype of all columns to object.

Example:

In [1]: df = pd.DataFrame([1, np.nan])

In [2]: df
Out[2]: 
    0
0   1
1 NaN

In [3]: df1 = df.where(pd.notnull(df), None)

In [4]: df1
Out[4]: 
      0
0     1
1  None

Note: what you cannot do recast the DataFrames dtype to allow all datatypes types, using astype, and then the DataFrame fillna method:

df1 = df.astype(object).replace(np.nan, 'None')

Unfortunately neither this, nor using replace, works with None see this (closed) issue.


As an aside, it’s worth noting that for most use cases you don’t need to replace NaN with None, see this question about the difference between NaN and None in pandas.

However, in this specific case it seems you do (at least at the time of this answer).


回答 1

df = df.replace({np.nan: None})

这个Github问题归功于这个家伙。

df = df.replace({np.nan: None})

Credit goes to this guy here on this Github issue.


回答 2

您可以在numpy数组中替换nanNone

>>> x = np.array([1, np.nan, 3])
>>> y = np.where(np.isnan(x), None, x)
>>> print y
[1.0 None 3.0]
>>> print type(y[1])
<type 'NoneType'>

You can replace nan with None in your numpy array:

>>> x = np.array([1, np.nan, 3])
>>> y = np.where(np.isnan(x), None, x)
>>> print y
[1.0 None 3.0]
>>> print type(y[1])
<type 'NoneType'>

回答 3

经过绊脚,这对我有用:

df = df.astype(object).where(pd.notnull(df),None)

After stumbling around, this worked for me:

df = df.astype(object).where(pd.notnull(df),None)

回答 4

只是@Andy Hayden的答案的补充:

由于DataFrame.mask是的相对孪生子DataFrame.where,因此它们具有完全相同的签名,但含义相反:

  • DataFrame.where对于替换条件为False的很有用
  • DataFrame.mask用于替换条件为True的值。

所以在这个问题上,使用df.mask(df.isna(), other=None, inplace=True)可能会更直观。

Just an addition to @Andy Hayden’s answer:

Since DataFrame.mask is the opposite twin of DataFrame.where, they have the exactly same signature but with opposite meaning:

  • DataFrame.where is useful for Replacing values where the condition is False.
  • DataFrame.mask is used for Replacing values where the condition is True.

So in this question, using df.mask(df.isna(), other=None, inplace=True) might be more intuitive.


回答 5

另外除了:更换倍数和转换从柱背面的类型时要小心对象浮动。如果您想确定自己None的不会退回到np.NaN‘s’,请使用@ andy-hayden的建议pd.where。替换仍然会出错的说明:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: df = pd.DataFrame({"a": [1, np.NAN, np.inf]})

In [4]: df
Out[4]:
     a
0  1.0
1  NaN
2  inf

In [5]: df.replace({np.NAN: None})
Out[5]:
      a
0     1
1  None
2   inf

In [6]: df.replace({np.NAN: None, np.inf: None})
Out[6]:
     a
0  1.0
1  NaN
2  NaN

In [7]: df.where((pd.notnull(df)), None).replace({np.inf: None})
Out[7]:
     a
0  1.0
1  NaN
2  NaN

Another addition: be careful when replacing multiples and converting the type of the column back from object to float. If you want to be certain that your None‘s won’t flip back to np.NaN‘s apply @andy-hayden’s suggestion with using pd.where. Illustration of how replace can still go ‘wrong’:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: df = pd.DataFrame({"a": [1, np.NAN, np.inf]})

In [4]: df
Out[4]:
     a
0  1.0
1  NaN
2  inf

In [5]: df.replace({np.NAN: None})
Out[5]:
      a
0     1
1  None
2   inf

In [6]: df.replace({np.NAN: None, np.inf: None})
Out[6]:
     a
0  1.0
1  NaN
2  NaN

In [7]: df.where((pd.notnull(df)), None).replace({np.inf: None})
Out[7]:
     a
0  1.0
1  NaN
2  NaN

回答 6

很老,但我偶然发现了同样的问题。尝试这样做:

df['col_replaced'] = df['col_with_npnans'].apply(lambda x: None if np.isnan(x) else x)

Quite old, yet I stumbled upon the very same issue. Try doing this:

df['col_replaced'] = df['col_with_npnans'].apply(lambda x: None if np.isnan(x) else x)