标签归档:pandas

熊猫的笛卡尔积

问题:熊猫的笛卡尔积

我有两个熊猫数据框:

from pandas import DataFrame
df1 = DataFrame({'col1':[1,2],'col2':[3,4]})
df2 = DataFrame({'col3':[5,6]})     

获得笛卡尔积的最佳实践是什么(当然不用像我这样明确地写它)?

#df1, df2 cartesian product
df_cartesian = DataFrame({'col1':[1,2,1,2],'col2':[3,4,3,4],'col3':[5,5,6,6]})

I have two pandas dataframes:

from pandas import DataFrame
df1 = DataFrame({'col1':[1,2],'col2':[3,4]})
df2 = DataFrame({'col3':[5,6]})     

What is the best practice to get their cartesian product (of course without writing it explicitly like me)?

#df1, df2 cartesian product
df_cartesian = DataFrame({'col1':[1,2,1,2],'col2':[3,4,3,4],'col3':[5,5,6,6]})

回答 0

如果每行都有一个重复的键,则可以使用merge生成笛卡尔乘积(就像在SQL中一样)。

from pandas import DataFrame, merge
df1 = DataFrame({'key':[1,1], 'col1':[1,2],'col2':[3,4]})
df2 = DataFrame({'key':[1,1], 'col3':[5,6]})

merge(df1, df2,on='key')[['col1', 'col2', 'col3']]

输出:

   col1  col2  col3
0     1     3     5
1     1     3     6
2     2     4     5
3     2     4     6

有关文档,请参见此处:http : //pandas.pydata.org/pandas-docs/stable/merging.html#brief-primer-on-merge-methods-relational-algebra

If you have a key that is repeated for each row, then you can produce a cartesian product using merge (like you would in SQL).

from pandas import DataFrame, merge
df1 = DataFrame({'key':[1,1], 'col1':[1,2],'col2':[3,4]})
df2 = DataFrame({'key':[1,1], 'col3':[5,6]})

merge(df1, df2,on='key')[['col1', 'col2', 'col3']]

Output:

   col1  col2  col3
0     1     3     5
1     1     3     6
2     2     4     5
3     2     4     6

See here for the documentation: http://pandas.pydata.org/pandas-docs/stable/merging.html#brief-primer-on-merge-methods-relational-algebra


回答 1

使用pd.MultiIndex.from_product在人少的数据帧的索引,然后复位它的索引,就大功告成了。

a = [1, 2, 3]
b = ["a", "b", "c"]

index = pd.MultiIndex.from_product([a, b], names = ["a", "b"])

pd.DataFrame(index = index).reset_index()

出:

   a  b
0  1  a
1  1  b
2  1  c
3  2  a
4  2  b
5  2  c
6  3  a
7  3  b
8  3  c

Use pd.MultiIndex.from_product as an index in an otherwise empty dataframe, then reset its index, and you’re done.

a = [1, 2, 3]
b = ["a", "b", "c"]

index = pd.MultiIndex.from_product([a, b], names = ["a", "b"])

pd.DataFrame(index = index).reset_index()

out:

   a  b
0  1  a
1  1  b
2  1  c
3  2  a
4  2  b
5  2  c
6  3  a
7  3  b
8  3  c

回答 2

这不会赢得一场代码高尔夫比赛,它会借鉴先前的答案-但会清楚地显示出密钥的添加方式以及联接的工作方式。这将从列表中创建2个新数据框,然后添加用于执行笛卡尔乘积的键。

我的用例是,我需要在列表中每周列出所有商店ID的列表。因此,我创建了一个我想拥有的所有星期的列表,然后创建了一个我想用来映射它们的所有商店ID的列表。

我选择了合并,但在语义上与此设置中的内部相同。您可以在关于合并的文档中看到这一点,该文档指出如果两个表中的键组合均出现多次,则该操作将执行笛卡尔乘积。

days = pd.DataFrame({'date':list_of_days})
stores = pd.DataFrame({'store_id':list_of_stores})
stores['key'] = 0
days['key'] = 0
days_and_stores = days.merge(stores, how='left', on = 'key')
days_and_stores.drop('key',1, inplace=True)

This won’t win a code golf competition, and borrows from the previous answers – but clearly shows how the key is added, and how the join works. This creates 2 new data frames from lists, then adds the key to do the cartesian product on.

My use case was that I needed a list of all store IDs on for each week in my list. So, I created a list of all the weeks I wanted to have, then a list of all the store IDs I wanted to map them against.

The merge I chose left, but would be semantically the same as inner in this setup. You can see this in the documentation on merging, which states it does a Cartesian product if key combination appears more than once in both tables – which is what we set up.

days = pd.DataFrame({'date':list_of_days})
stores = pd.DataFrame({'store_id':list_of_stores})
stores['key'] = 0
days['key'] = 0
days_and_stores = days.merge(stores, how='left', on = 'key')
days_and_stores.drop('key',1, inplace=True)

回答 3

这需要最少的代码。创建一个通用的“键”以笛卡尔合并两者:

df1['key'] = 0
df2['key'] = 0

df_cartesian = df1.merge(df2, how='outer')

Minimal code needed for this one. Create a common ‘key’ to cartesian merge the two:

df1['key'] = 0
df2['key'] = 0

df_cartesian = df1.merge(df2, how='outer')

回答 4

使用方法链接:

product = (
    df1.assign(key=1)
    .merge(df2.assign(key=1), on="key")
    .drop("key", axis=1)
)

With method chaining:

product = (
    df1.assign(key=1)
    .merge(df2.assign(key=1), on="key")
    .drop("key", axis=1)
)

回答 5

另一种选择是,可以依靠itertools:提供的笛卡尔乘积itertools.product,避免创建临时键或修改索引:

import numpy as np 
import pandas as pd 
import itertools

def cartesian(df1, df2):
    rows = itertools.product(df1.iterrows(), df2.iterrows())

    df = pd.DataFrame(left.append(right) for (_, left), (_, right) in rows)
    return df.reset_index(drop=True)

快速测试:

In [46]: a = pd.DataFrame(np.random.rand(5, 3), columns=["a", "b", "c"])

In [47]: b = pd.DataFrame(np.random.rand(5, 3), columns=["d", "e", "f"])    

In [48]: cartesian(a,b)
Out[48]:
           a         b         c         d         e         f
0   0.436480  0.068491  0.260292  0.991311  0.064167  0.715142
1   0.436480  0.068491  0.260292  0.101777  0.840464  0.760616
2   0.436480  0.068491  0.260292  0.655391  0.289537  0.391893
3   0.436480  0.068491  0.260292  0.383729  0.061811  0.773627
4   0.436480  0.068491  0.260292  0.575711  0.995151  0.804567
5   0.469578  0.052932  0.633394  0.991311  0.064167  0.715142
6   0.469578  0.052932  0.633394  0.101777  0.840464  0.760616
7   0.469578  0.052932  0.633394  0.655391  0.289537  0.391893
8   0.469578  0.052932  0.633394  0.383729  0.061811  0.773627
9   0.469578  0.052932  0.633394  0.575711  0.995151  0.804567
10  0.466813  0.224062  0.218994  0.991311  0.064167  0.715142
11  0.466813  0.224062  0.218994  0.101777  0.840464  0.760616
12  0.466813  0.224062  0.218994  0.655391  0.289537  0.391893
13  0.466813  0.224062  0.218994  0.383729  0.061811  0.773627
14  0.466813  0.224062  0.218994  0.575711  0.995151  0.804567
15  0.831365  0.273890  0.130410  0.991311  0.064167  0.715142
16  0.831365  0.273890  0.130410  0.101777  0.840464  0.760616
17  0.831365  0.273890  0.130410  0.655391  0.289537  0.391893
18  0.831365  0.273890  0.130410  0.383729  0.061811  0.773627
19  0.831365  0.273890  0.130410  0.575711  0.995151  0.804567
20  0.447640  0.848283  0.627224  0.991311  0.064167  0.715142
21  0.447640  0.848283  0.627224  0.101777  0.840464  0.760616
22  0.447640  0.848283  0.627224  0.655391  0.289537  0.391893
23  0.447640  0.848283  0.627224  0.383729  0.061811  0.773627
24  0.447640  0.848283  0.627224  0.575711  0.995151  0.804567

As an alternative, one can rely on the cartesian product provided by itertools: itertools.product, which avoids creating a temporary key or modifying the index:

import numpy as np 
import pandas as pd 
import itertools

def cartesian(df1, df2):
    rows = itertools.product(df1.iterrows(), df2.iterrows())

    df = pd.DataFrame(left.append(right) for (_, left), (_, right) in rows)
    return df.reset_index(drop=True)

Quick test:

In [46]: a = pd.DataFrame(np.random.rand(5, 3), columns=["a", "b", "c"])

In [47]: b = pd.DataFrame(np.random.rand(5, 3), columns=["d", "e", "f"])    

In [48]: cartesian(a,b)
Out[48]:
           a         b         c         d         e         f
0   0.436480  0.068491  0.260292  0.991311  0.064167  0.715142
1   0.436480  0.068491  0.260292  0.101777  0.840464  0.760616
2   0.436480  0.068491  0.260292  0.655391  0.289537  0.391893
3   0.436480  0.068491  0.260292  0.383729  0.061811  0.773627
4   0.436480  0.068491  0.260292  0.575711  0.995151  0.804567
5   0.469578  0.052932  0.633394  0.991311  0.064167  0.715142
6   0.469578  0.052932  0.633394  0.101777  0.840464  0.760616
7   0.469578  0.052932  0.633394  0.655391  0.289537  0.391893
8   0.469578  0.052932  0.633394  0.383729  0.061811  0.773627
9   0.469578  0.052932  0.633394  0.575711  0.995151  0.804567
10  0.466813  0.224062  0.218994  0.991311  0.064167  0.715142
11  0.466813  0.224062  0.218994  0.101777  0.840464  0.760616
12  0.466813  0.224062  0.218994  0.655391  0.289537  0.391893
13  0.466813  0.224062  0.218994  0.383729  0.061811  0.773627
14  0.466813  0.224062  0.218994  0.575711  0.995151  0.804567
15  0.831365  0.273890  0.130410  0.991311  0.064167  0.715142
16  0.831365  0.273890  0.130410  0.101777  0.840464  0.760616
17  0.831365  0.273890  0.130410  0.655391  0.289537  0.391893
18  0.831365  0.273890  0.130410  0.383729  0.061811  0.773627
19  0.831365  0.273890  0.130410  0.575711  0.995151  0.804567
20  0.447640  0.848283  0.627224  0.991311  0.064167  0.715142
21  0.447640  0.848283  0.627224  0.101777  0.840464  0.760616
22  0.447640  0.848283  0.627224  0.655391  0.289537  0.391893
23  0.447640  0.848283  0.627224  0.383729  0.061811  0.773627
24  0.447640  0.848283  0.627224  0.575711  0.995151  0.804567

回答 6

如果没有重叠的列,不想添加一列,并且可以丢弃数据帧的索引,则可能会更容易:

df1.index[:] = df2.index[:] = 0
df_cartesian = df1.join(df2, how='outer')
df_cartesian.index[:] = range(len(df_cartesian))

If you have no overlapping columns, don’t want to add one, and the indices of the data frames can be discarded, this may be easier:

df1.index[:] = df2.index[:] = 0
df_cartesian = df1.join(df2, how='outer')
df_cartesian.index[:] = range(len(df_cartesian))

回答 7

这是一个帮助函数,用于执行带有两个数据帧的简单笛卡尔乘积。内部逻辑使用内部键进行处理,并避免从任一侧弄乱任何碰巧被命名为“键”的列。

import pandas as pd

def cartesian(df1, df2):
    """Determine Cartesian product of two data frames."""
    key = 'key'
    while key in df1.columns or key in df2.columns:
        key = '_' + key
    key_d = {key: 0}
    return pd.merge(
        df1.assign(**key_d), df2.assign(**key_d), on=key).drop(key, axis=1)

# Two data frames, where the first happens to have a 'key' column
df1 = pd.DataFrame({'number':[1, 2], 'key':[3, 4]})
df2 = pd.DataFrame({'digit': [5, 6]})
cartesian(df1, df2)

显示:

   number  key  digit
0       1    3      5
1       1    3      6
2       2    4      5
3       2    4      6

Here is a helper function to perform a simple Cartesian product with two data frames. The internal logic handles using an internal key, and avoids mangling any columns that happen to be named “key” from either side.

import pandas as pd

def cartesian(df1, df2):
    """Determine Cartesian product of two data frames."""
    key = 'key'
    while key in df1.columns or key in df2.columns:
        key = '_' + key
    key_d = {key: 0}
    return pd.merge(
        df1.assign(**key_d), df2.assign(**key_d), on=key).drop(key, axis=1)

# Two data frames, where the first happens to have a 'key' column
df1 = pd.DataFrame({'number':[1, 2], 'key':[3, 4]})
df2 = pd.DataFrame({'digit': [5, 6]})
cartesian(df1, df2)

shows:

   number  key  digit
0       1    3      5
1       1    3      6
2       2    4      5
3       2    4      6

回答 8

你可以采取的笛卡尔积启动df1.col1df2.col3,然后合并回df1得到col2

这是一个通用的笛卡尔乘积函数,它采用列表字典:

def cartesian_product(d):
    index = pd.MultiIndex.from_product(d.values(), names=d.keys())
    return pd.DataFrame(index=index).reset_index()

申请为:

res = cartesian_product({'col1': df1.col1, 'col3': df2.col3})
pd.merge(res, df1, on='col1')
#  col1 col3 col2
# 0   1    5    3
# 1   1    6    3
# 2   2    5    4
# 3   2    6    4

You could start by taking the Cartesian product of df1.col1 and df2.col3, then merge back to df1 to get col2.

Here’s a general Cartesian product function which takes a dictionary of lists:

def cartesian_product(d):
    index = pd.MultiIndex.from_product(d.values(), names=d.keys())
    return pd.DataFrame(index=index).reset_index()

Apply as:

res = cartesian_product({'col1': df1.col1, 'col3': df2.col3})
pd.merge(res, df1, on='col1')
#  col1 col3 col2
# 0   1    5    3
# 1   1    6    3
# 2   2    5    4
# 3   2    6    4

回答 9

您可以使用numpy,因为它可能更快。假设您有两个系列,如下所示:

s1 = pd.Series(np.random.randn(100,))
s2 = pd.Series(np.random.randn(100,))

您只需要,

pd.DataFrame(
    s1[:, None] @ s2[None, :], 
    index = s1.index, columns = s2.index
)

You can use numpy as it could be faster. Suppose you have two series as follows,

s1 = pd.Series(np.random.randn(100,))
s2 = pd.Series(np.random.randn(100,))

You just need,

pd.DataFrame(
    s1[:, None] @ s2[None, :], 
    index = s1.index, columns = s2.index
)

回答 10

我发现使用pandas MultiIndex是工作的最佳工具。如果您具有列表列表lists_list,请调用pd.MultiIndex.from_product(lists_list)并遍历结果(或在DataFrame索引中使用它)。

I find using pandas MultiIndex to be the best tool for the job. If you have a list of lists lists_list, call pd.MultiIndex.from_product(lists_list) and iterate over the result (or use it in DataFrame index).


获取总计熊猫列

问题:获取总计熊猫列

目标

我有一个Pandas数据框,如下所示,具有多个列,并希望获取列的总数MyColumn


数据框df

print df

           X           MyColumn  Y              Z   
0          A           84        13.0           69.0   
1          B           76         77.0          127.0   
2          C           28         69.0           16.0   
3          D           28         28.0           31.0   
4          E           19         20.0           85.0   
5          F           84        193.0           70.0   

我的尝试

我试图使用groupby和获得列的总和.sum()

Total = df.groupby['MyColumn'].sum()

print Total

这将导致以下错误:

TypeError: 'instancemethod' object has no attribute '__getitem__'

预期Yield

我期望输出如下:

319

或者,我想df用一个包含总数的新row标题进行编辑TOTAL

           X           MyColumn  Y              Z   
0          A           84        13.0           69.0   
1          B           76         77.0          127.0   
2          C           28         69.0           16.0   
3          D           28         28.0           31.0   
4          E           19         20.0           85.0   
5          F           84        193.0           70.0   
TOTAL                  319

Target

I have a Pandas data frame, as shown below, with multiple columns and would like to get the total of column, MyColumn.


Data Framedf:

print df

           X           MyColumn  Y              Z   
0          A           84        13.0           69.0   
1          B           76         77.0          127.0   
2          C           28         69.0           16.0   
3          D           28         28.0           31.0   
4          E           19         20.0           85.0   
5          F           84        193.0           70.0   

My attempt:

I have attempted to get the sum of the column using groupby and .sum():

Total = df.groupby['MyColumn'].sum()

print Total

This causes the following error:

TypeError: 'instancemethod' object has no attribute '__getitem__'

Expected Output

I’d have expected the output to be as followed:

319

Or alternatively, I would like df to be edited with a new row entitled TOTAL containing the total:

           X           MyColumn  Y              Z   
0          A           84        13.0           69.0   
1          B           76         77.0          127.0   
2          C           28         69.0           16.0   
3          D           28         28.0           31.0   
4          E           19         20.0           85.0   
5          F           84        193.0           70.0   
TOTAL                  319

回答 0

您应该使用sum

Total = df['MyColumn'].sum()
print (Total)
319

然后loc与一起使用Series,在这种情况下,索引应设置为与您需要求和的特定列相同:

df.loc['Total'] = pd.Series(df['MyColumn'].sum(), index = ['MyColumn'])
print (df)
         X  MyColumn      Y      Z
0        A      84.0   13.0   69.0
1        B      76.0   77.0  127.0
2        C      28.0   69.0   16.0
3        D      28.0   28.0   31.0
4        E      19.0   20.0   85.0
5        F      84.0  193.0   70.0
Total  NaN     319.0    NaN    NaN

因为如果传递标量,则将填充所有行的值:

df.loc['Total'] = df['MyColumn'].sum()
print (df)
         X  MyColumn      Y      Z
0        A        84   13.0   69.0
1        B        76   77.0  127.0
2        C        28   69.0   16.0
3        D        28   28.0   31.0
4        E        19   20.0   85.0
5        F        84  193.0   70.0
Total  319       319  319.0  319.0

另有两个解决方案atix请参见下面的应用程序:

df.at['Total', 'MyColumn'] = df['MyColumn'].sum()
print (df)
         X  MyColumn      Y      Z
0        A      84.0   13.0   69.0
1        B      76.0   77.0  127.0
2        C      28.0   69.0   16.0
3        D      28.0   28.0   31.0
4        E      19.0   20.0   85.0
5        F      84.0  193.0   70.0
Total  NaN     319.0    NaN    NaN

df.ix['Total', 'MyColumn'] = df['MyColumn'].sum()
print (df)
         X  MyColumn      Y      Z
0        A      84.0   13.0   69.0
1        B      76.0   77.0  127.0
2        C      28.0   69.0   16.0
3        D      28.0   28.0   31.0
4        E      19.0   20.0   85.0
5        F      84.0  193.0   70.0
Total  NaN     319.0    NaN    NaN

注意:自Pandas v0.20起,ix已弃用。使用lociloc代替。

You should use sum:

Total = df['MyColumn'].sum()
print (Total)
319

Then you use loc with Series, in that case the index should be set as the same as the specific column you need to sum:

df.loc['Total'] = pd.Series(df['MyColumn'].sum(), index = ['MyColumn'])
print (df)
         X  MyColumn      Y      Z
0        A      84.0   13.0   69.0
1        B      76.0   77.0  127.0
2        C      28.0   69.0   16.0
3        D      28.0   28.0   31.0
4        E      19.0   20.0   85.0
5        F      84.0  193.0   70.0
Total  NaN     319.0    NaN    NaN

because if you pass scalar, the values of all rows will be filled:

df.loc['Total'] = df['MyColumn'].sum()
print (df)
         X  MyColumn      Y      Z
0        A        84   13.0   69.0
1        B        76   77.0  127.0
2        C        28   69.0   16.0
3        D        28   28.0   31.0
4        E        19   20.0   85.0
5        F        84  193.0   70.0
Total  319       319  319.0  319.0

Two other solutions are with at, and ix see the applications below:

df.at['Total', 'MyColumn'] = df['MyColumn'].sum()
print (df)
         X  MyColumn      Y      Z
0        A      84.0   13.0   69.0
1        B      76.0   77.0  127.0
2        C      28.0   69.0   16.0
3        D      28.0   28.0   31.0
4        E      19.0   20.0   85.0
5        F      84.0  193.0   70.0
Total  NaN     319.0    NaN    NaN

df.ix['Total', 'MyColumn'] = df['MyColumn'].sum()
print (df)
         X  MyColumn      Y      Z
0        A      84.0   13.0   69.0
1        B      76.0   77.0  127.0
2        C      28.0   69.0   16.0
3        D      28.0   28.0   31.0
4        E      19.0   20.0   85.0
5        F      84.0  193.0   70.0
Total  NaN     319.0    NaN    NaN

Note: Since Pandas v0.20, ix has been deprecated. Use loc or iloc instead.


回答 1

您可以在此处使用的另一种选择:

df.loc["Total", "MyColumn"] = df.MyColumn.sum()

#         X  MyColumn      Y       Z
#0        A     84.0    13.0    69.0
#1        B     76.0    77.0   127.0
#2        C     28.0    69.0    16.0
#3        D     28.0    28.0    31.0
#4        E     19.0    20.0    85.0
#5        F     84.0   193.0    70.0
#Total  NaN    319.0     NaN     NaN

您也可以使用append()方法:

df.append(pd.DataFrame(df.MyColumn.sum(), index = ["Total"], columns=["MyColumn"]))


更新:

如果需要为所有数字列追加总和,则可以执行以下操作之一:

用于append以功能性方式执行此操作(不更改原始数据帧):

# select numeric columns and calculate the sums
sums = df.select_dtypes(pd.np.number).sum().rename('total')

# append sums to the data frame
df.append(sums)
#         X  MyColumn      Y      Z
#0        A      84.0   13.0   69.0
#1        B      76.0   77.0  127.0
#2        C      28.0   69.0   16.0
#3        D      28.0   28.0   31.0
#4        E      19.0   20.0   85.0
#5        F      84.0  193.0   70.0
#total  NaN     319.0  400.0  398.0

用于loc在适当位置更改数据框:

df.loc['total'] = df.select_dtypes(pd.np.number).sum()
df
#         X  MyColumn      Y      Z
#0        A      84.0   13.0   69.0
#1        B      76.0   77.0  127.0
#2        C      28.0   69.0   16.0
#3        D      28.0   28.0   31.0
#4        E      19.0   20.0   85.0
#5        F      84.0  193.0   70.0
#total  NaN     638.0  800.0  796.0

Another option you can go with here:

df.loc["Total", "MyColumn"] = df.MyColumn.sum()

#         X  MyColumn      Y       Z
#0        A     84.0    13.0    69.0
#1        B     76.0    77.0   127.0
#2        C     28.0    69.0    16.0
#3        D     28.0    28.0    31.0
#4        E     19.0    20.0    85.0
#5        F     84.0   193.0    70.0
#Total  NaN    319.0     NaN     NaN

You can also use append() method:

df.append(pd.DataFrame(df.MyColumn.sum(), index = ["Total"], columns=["MyColumn"]))


Update:

In case you need to append sum for all numeric columns, you can do one of the followings:

Use append to do this in a functional manner (doesn’t change the original data frame):

# select numeric columns and calculate the sums
sums = df.select_dtypes(pd.np.number).sum().rename('total')

# append sums to the data frame
df.append(sums)
#         X  MyColumn      Y      Z
#0        A      84.0   13.0   69.0
#1        B      76.0   77.0  127.0
#2        C      28.0   69.0   16.0
#3        D      28.0   28.0   31.0
#4        E      19.0   20.0   85.0
#5        F      84.0  193.0   70.0
#total  NaN     319.0  400.0  398.0

Use loc to mutate data frame in place:

df.loc['total'] = df.select_dtypes(pd.np.number).sum()
df
#         X  MyColumn      Y      Z
#0        A      84.0   13.0   69.0
#1        B      76.0   77.0  127.0
#2        C      28.0   69.0   16.0
#3        D      28.0   28.0   31.0
#4        E      19.0   20.0   85.0
#5        F      84.0  193.0   70.0
#total  NaN     638.0  800.0  796.0

回答 2

与获取数据框的长度类似len(df),以下内容适用于熊猫和大火:

Total = sum(df['MyColumn'])

或者

Total = sum(df.MyColumn)
print Total

Similar to getting the length of a dataframe, len(df), the following worked for pandas and blaze:

Total = sum(df['MyColumn'])

or alternatively

Total = sum(df.MyColumn)
print Total

回答 3

列求和有两种方法

数据集= pd.read_csv(“ data.csv”)

1:总和(dataset.Column_name)

2:数据集[‘Column_Name’]。sum()

如果有任何问题,请纠正我。

There are two ways to sum of a column

dataset = pd.read_csv(“data.csv”)

1: sum(dataset.Column_name)

2: dataset[‘Column_Name’].sum()

If there is any issue in this the please correct me..


回答 4

作为其他选择,您可以执行以下操作

Group   Valuation   amount
    0   BKB Tube    156
    1   BKB Tube    143
    2   BKB Tube    67
    3   BAC Tube    176
    4   BAC Tube    39
    5   JDK Tube    75
    6   JDK Tube    35
    7   JDK Tube    155
    8   ETH Tube    38
    9   ETH Tube    56

下面的脚本,您可以用于上面的数据

import pandas as pd    
data = pd.read_csv("daata1.csv")
bytreatment = data.groupby('Group')
bytreatment['amount'].sum()

As other option, you can do something like below

Group   Valuation   amount
    0   BKB Tube    156
    1   BKB Tube    143
    2   BKB Tube    67
    3   BAC Tube    176
    4   BAC Tube    39
    5   JDK Tube    75
    6   JDK Tube    35
    7   JDK Tube    155
    8   ETH Tube    38
    9   ETH Tube    56

Below script, you can use for above data

import pandas as pd    
data = pd.read_csv("daata1.csv")
bytreatment = data.groupby('Group')
bytreatment['amount'].sum()

将列表或系列作为一行附加到熊猫DataFrame吗?

问题:将列表或系列作为一行附加到熊猫DataFrame吗?

因此,我已经初始化了一个空的Pandas DataFrame,并希望迭代地将列表(或Series)追加为该DataFrame中的行。最好的方法是什么?

So I have initialized an empty pandas DataFrame and I would like to iteratively append lists (or Series) as rows in this DataFrame. What is the best way of doing this?


回答 0

有时,在熊猫之​​外进行所有附加操作会更容易,然后只需创建DataFrame即可。

>>> import pandas as pd
>>> simple_list=[['a','b']]
>>> simple_list.append(['e','f'])
>>> df=pd.DataFrame(simple_list,columns=['col1','col2'])
   col1 col2
0    a    b
1    e    f

Sometimes it’s easier to do all the appending outside of pandas, then, just create the DataFrame in one shot.

>>> import pandas as pd
>>> simple_list=[['a','b']]
>>> simple_list.append(['e','f'])
>>> df=pd.DataFrame(simple_list,columns=['col1','col2'])
   col1 col2
0    a    b
1    e    f

回答 1

df = pd.DataFrame(columns=list("ABC"))
df.loc[len(df)] = [1,2,3]
df = pd.DataFrame(columns=list("ABC"))
df.loc[len(df)] = [1,2,3]

回答 2

这是一个简单而愚蠢的解决方案:

>>> import pandas as pd
>>> df = pd.DataFrame()
>>> df = df.append({'foo':1, 'bar':2}, ignore_index=True)

Here’s a simple and dumb solution:

>>> import pandas as pd
>>> df = pd.DataFrame()
>>> df = df.append({'foo':1, 'bar':2}, ignore_index=True)

回答 3

你能做这样的事情吗?

>>> import pandas as pd
>>> df = pd.DataFrame(columns=['col1', 'col2'])
>>> df = df.append(pd.Series(['a', 'b'], index=['col1','col2']), ignore_index=True)
>>> df = df.append(pd.Series(['d', 'e'], index=['col1','col2']), ignore_index=True) 
>>> df
  col1 col2
0    a    b
1    d    e

有谁有更优雅的解决方案?

Could you do something like this?

>>> import pandas as pd
>>> df = pd.DataFrame(columns=['col1', 'col2'])
>>> df = df.append(pd.Series(['a', 'b'], index=['col1','col2']), ignore_index=True)
>>> df = df.append(pd.Series(['d', 'e'], index=['col1','col2']), ignore_index=True) 
>>> df
  col1 col2
0    a    b
1    d    e

Does anyone have a more elegant solution?


回答 4

跟随Mike Chirico的回答…如果您想已填充数据框追加列表…

>>> list = [['f','g']]
>>> df = df.append(pd.DataFrame(list, columns=['col1','col2']),ignore_index=True)
>>> df
  col1 col2
0    a    b
1    d    e
2    f    g

Following onto Mike Chirico’s answer… if you want to append a list after the dataframe is already populated…

>>> list = [['f','g']]
>>> df = df.append(pd.DataFrame(list, columns=['col1','col2']),ignore_index=True)
>>> df
  col1 col2
0    a    b
1    d    e
2    f    g

回答 5

如果要添加一个Series并将Series的索引用作DataFrame的列,则只需将Series附加在方括号之间:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame()

In [3]: row=pd.Series([1,2,3],["A","B","C"])

In [4]: row
Out[4]: 
A    1
B    2
C    3
dtype: int64

In [5]: df.append([row],ignore_index=True)
Out[5]: 
   A  B  C
0  1  2  3

[1 rows x 3 columns]

淘汰ignore_index=True你没有得到正确的索引。

If you want to add a Series and use the Series’ index as columns of the DataFrame, you only need to append the Series between brackets:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame()

In [3]: row=pd.Series([1,2,3],["A","B","C"])

In [4]: row
Out[4]: 
A    1
B    2
C    3
dtype: int64

In [5]: df.append([row],ignore_index=True)
Out[5]: 
   A  B  C
0  1  2  3

[1 rows x 3 columns]

Whitout the ignore_index=True you don’t get proper index.


回答 6

这是一个给定已经创建的数据框的函数,该函数会将列表作为新行追加。这可能应该抛出错误捕获器,但是如果您确切知道要添加的内容,那应该不是问题。

import pandas as pd
import numpy as np

def addRow(df,ls):
    """
    Given a dataframe and a list, append the list as a new row to the dataframe.

    :param df: <DataFrame> The original dataframe
    :param ls: <list> The new row to be added
    :return: <DataFrame> The dataframe with the newly appended row
    """

    numEl = len(ls)

    newRow = pd.DataFrame(np.array(ls).reshape(1,numEl), columns = list(df.columns))

    df = df.append(newRow, ignore_index=True)

    return df

Here’s a function that, given an already created dataframe, will append a list as a new row. This should probably have error catchers thrown in, but if you know exactly what you’re adding then it shouldn’t be an issue.

import pandas as pd
import numpy as np

def addRow(df,ls):
    """
    Given a dataframe and a list, append the list as a new row to the dataframe.

    :param df: <DataFrame> The original dataframe
    :param ls: <list> The new row to be added
    :return: <DataFrame> The dataframe with the newly appended row
    """

    numEl = len(ls)

    newRow = pd.DataFrame(np.array(ls).reshape(1,numEl), columns = list(df.columns))

    df = df.append(newRow, ignore_index=True)

    return df

回答 7

将列表转换为append函数中的数据框也有效,即使在循环中应用也是如此

import pandas as pd
mylist = [1,2,3]
df = pd.DataFrame()
df = df.append(pd.DataFrame(data[mylist]))

Converting the list to a data frame within the append function works, also when applied in a loop

import pandas as pd
mylist = [1,2,3]
df = pd.DataFrame()
df = df.append(pd.DataFrame(data[mylist]))

回答 8

只需使用loc:

>>> df
     A  B  C
one  1  2  3
>>> df.loc["two"] = [4,5,6]
>>> df
     A  B  C
one  1  2  3
two  4  5  6

simply use loc:

>>> df
     A  B  C
one  1  2  3
>>> df.loc["two"] = [4,5,6]
>>> df
     A  B  C
one  1  2  3
two  4  5  6

回答 9

如此处所述-https: //kite.com/python/answers/how-to-append-a-list-as-a-row-to-a-pandas-dataframe-in-python,您需要先将列表转换为序列,然后将序列附加到数据框。

df = pd.DataFrame([[1, 2], [3, 4]], columns = ["a", "b"])
to_append = [5, 6]
a_series = pd.Series(to_append, index = df.columns)
df = df.append(a_series, ignore_index=True)

As mentioned here – https://kite.com/python/answers/how-to-append-a-list-as-a-row-to-a-pandas-dataframe-in-python, you’ll need to first convert the list to a series then append the series to dataframe.

df = pd.DataFrame([[1, 2], [3, 4]], columns = ["a", "b"])
to_append = [5, 6]
a_series = pd.Series(to_append, index = df.columns)
df = df.append(a_series, ignore_index=True)

回答 10

最简单的方法:

my_list = [1,2,3,4,5]
df['new_column'] = pd.Series(my_list).values

编辑:

不要忘记,新列表的长度应与相应数据框的长度相同。

The simplest way:

my_list = [1,2,3,4,5]
df['new_column'] = pd.Series(my_list).values

Edit:

Don’t forget that the length of the new list should be the same of the corresponding Dataframe.


根据熊猫中的另一个值更改一个值

问题:根据熊猫中的另一个值更改一个值

我试图将我的Stata代码重新编程为Python,以提高速度,而我的方向是PANDAS。但是,我很难集中精力处理数据。

假设我要遍历列标题“ ID”中的所有值。如果该ID与特定数字匹配,那么我想更改两个相应的值FirstName和LastName。

在Stata中,它看起来像这样:

replace FirstName = "Matt" if ID==103
replace LastName =  "Jones" if ID==103

因此,这将替换FirstName中与Matt的ID == 103值相对应的所有值。

在PANDAS中,我正在尝试类似的方法

df = read_csv("test.csv")
for i in df['ID']:
    if i ==103:
          ...

不知道从这里去哪里。有任何想法吗?

I’m trying to reprogram my Stata code into Python for speed improvements, and I was pointed in the direction of PANDAS. I am, however, having a hard time wrapping my head around how to process the data.

Let’s say I want to iterate over all values in the column head ‘ID.’ If that ID matches a specific number, then I want to change two corresponding values FirstName and LastName.

In Stata it looks like this:

replace FirstName = "Matt" if ID==103
replace LastName =  "Jones" if ID==103

So this replaces all values in FirstName that correspond with values of ID == 103 to Matt.

In PANDAS, I’m trying something like this

df = read_csv("test.csv")
for i in df['ID']:
    if i ==103:
          ...

Not sure where to go from here. Any ideas?


回答 0

一种选择是使用Python的切片和索引功能来逻辑评估条件所在的位置并覆盖其中的数据。

假设您可以使用将数据直接加载到pandas其中,pandas.read_csv则以下代码可能对您有所帮助。

import pandas
df = pandas.read_csv("test.csv")
df.loc[df.ID == 103, 'FirstName'] = "Matt"
df.loc[df.ID == 103, 'LastName'] = "Jones"

如评论中所述,您也可以一次性完成对两列的分配:

df.loc[df.ID == 103, ['FirstName', 'LastName']] = 'Matt', 'Jones'

请注意,您需要pandas使用0.11或更高版本才能进行loc覆盖分配操作。


另一种方法是使用所谓的链式分配。这种行为的稳定性较差,因此不被认为是最佳解决方案(在文档中明确建议不要这样做),但了解以下信息将很有用:

import pandas
df = pandas.read_csv("test.csv")
df['FirstName'][df.ID == 103] = "Matt"
df['LastName'][df.ID == 103] = "Jones"

One option is to use Python’s slicing and indexing features to logically evaluate the places where your condition holds and overwrite the data there.

Assuming you can load your data directly into pandas with pandas.read_csv then the following code might be helpful for you.

import pandas
df = pandas.read_csv("test.csv")
df.loc[df.ID == 103, 'FirstName'] = "Matt"
df.loc[df.ID == 103, 'LastName'] = "Jones"

As mentioned in the comments, you can also do the assignment to both columns in one shot:

df.loc[df.ID == 103, ['FirstName', 'LastName']] = 'Matt', 'Jones'

Note that you’ll need pandas version 0.11 or newer to make use of loc for overwrite assignment operations.


Another way to do it is to use what is called chained assignment. The behavior of this is less stable and so it is not considered the best solution (it is explicitly discouraged in the docs), but it is useful to know about:

import pandas
df = pandas.read_csv("test.csv")
df['FirstName'][df.ID == 103] = "Matt"
df['LastName'][df.ID == 103] = "Jones"

回答 1

您可以使用map,它可以映射字典或自定义函数中的值。

假设这是您的df:

    ID First_Name Last_Name
0  103          a         b
1  104          c         d

创建字典:

fnames = {103: "Matt", 104: "Mr"}
lnames = {103: "Jones", 104: "X"}

和地图:

df['First_Name'] = df['ID'].map(fnames)
df['Last_Name'] = df['ID'].map(lnames)

结果将是:

    ID First_Name Last_Name
0  103       Matt     Jones
1  104         Mr         X

或使用自定义函数:

names = {103: ("Matt", "Jones"), 104: ("Mr", "X")}
df['First_Name'] = df['ID'].map(lambda x: names[x][0])

You can use map, it can map vales from a dictonairy or even a custom function.

Suppose this is your df:

    ID First_Name Last_Name
0  103          a         b
1  104          c         d

Create the dicts:

fnames = {103: "Matt", 104: "Mr"}
lnames = {103: "Jones", 104: "X"}

And map:

df['First_Name'] = df['ID'].map(fnames)
df['Last_Name'] = df['ID'].map(lnames)

The result will be:

    ID First_Name Last_Name
0  103       Matt     Jones
1  104         Mr         X

Or use a custom function:

names = {103: ("Matt", "Jones"), 104: ("Mr", "X")}
df['First_Name'] = df['ID'].map(lambda x: names[x][0])

回答 2

最初的问题是针对特定的狭窄用例。对于那些需要更通用答案的人,这里有一些示例:

使用其他列中的数据创建新列

给定以下数据框:

import pandas as pd
import numpy as np

df = pd.DataFrame([['dog', 'hound', 5],
                   ['cat', 'ragdoll', 1]],
                  columns=['animal', 'type', 'age'])

In[1]:
Out[1]:
  animal     type  age
----------------------
0    dog    hound    5
1    cat  ragdoll    1

下面,我们description通过使用+被系列覆盖的操作,添加一个新列作为其他列的串联。花式字符串格式,f字符串等在这里不起作用,因为这+适用于标量而不是“原始”值:

df['description'] = 'A ' + df.age.astype(str) + ' years old ' \
                    + df.type + ' ' + df.animal

In [2]: df
Out[2]:
  animal     type  age                description
-------------------------------------------------
0    dog    hound    5    A 5 years old hound dog
1    cat  ragdoll    1  A 1 years old ragdoll cat

我们获得1 years了猫(而不是1 year),它将在下面使用条件固定。

使用条件修改现有列

在这里,我们用animal其他列中的值替换原始列,并np.where根据的值设置条件子字符串age

# append 's' to 'age' if it's greater than 1
df.animal = df.animal + ", " + df.type + ", " + \
    df.age.astype(str) + " year" + np.where(df.age > 1, 's', '')

In [3]: df
Out[3]:
                 animal     type  age
-------------------------------------
0   dog, hound, 5 years    hound    5
1  cat, ragdoll, 1 year  ragdoll    1

使用条件修改多列

一种更灵活的方法是调用.apply()整个数据框而不是单个列:

def transform_row(r):
    r.animal = 'wild ' + r.type
    r.type = r.animal + ' creature'
    r.age = "{} year{}".format(r.age, r.age > 1 and 's' or '')
    return r

df.apply(transform_row, axis=1)

In[4]:
Out[4]:
         animal            type      age
----------------------------------------
0    wild hound    dog creature  5 years
1  wild ragdoll    cat creature   1 year

在上面的代码中,该transform_row(r)函数接受一个Series表示给定行的对象(用表示axis=1,默认值axis=0Series为每一列提供一个对象)。因为我们可以使用列名称访问行中的实际“原始”值,并且可以查看给定行/列中其他单元格的情况,所以这简化了处理。

The original question addresses a specific narrow use case. For those who need more generic answers here are some examples:

Creating a new column using data from other columns

Given the dataframe below:

import pandas as pd
import numpy as np

df = pd.DataFrame([['dog', 'hound', 5],
                   ['cat', 'ragdoll', 1]],
                  columns=['animal', 'type', 'age'])

In[1]:
Out[1]:
  animal     type  age
----------------------
0    dog    hound    5
1    cat  ragdoll    1

Below we are adding a new description column as a concatenation of other columns by using the + operation which is overridden for series. Fancy string formatting, f-strings etc won’t work here since the + applies to scalars and not ‘primitive’ values:

df['description'] = 'A ' + df.age.astype(str) + ' years old ' \
                    + df.type + ' ' + df.animal

In [2]: df
Out[2]:
  animal     type  age                description
-------------------------------------------------
0    dog    hound    5    A 5 years old hound dog
1    cat  ragdoll    1  A 1 years old ragdoll cat

We get 1 years for the cat (instead of 1 year) which we will be fixing below using conditionals.

Modifying an existing column with conditionals

Here we are replacing the original animal column with values from other columns, and using np.where to set a conditional substring based on the value of age:

# append 's' to 'age' if it's greater than 1
df.animal = df.animal + ", " + df.type + ", " + \
    df.age.astype(str) + " year" + np.where(df.age > 1, 's', '')

In [3]: df
Out[3]:
                 animal     type  age
-------------------------------------
0   dog, hound, 5 years    hound    5
1  cat, ragdoll, 1 year  ragdoll    1

Modifying multiple columns with conditionals

A more flexible approach is to call .apply() on an entire dataframe rather than on a single column:

def transform_row(r):
    r.animal = 'wild ' + r.type
    r.type = r.animal + ' creature'
    r.age = "{} year{}".format(r.age, r.age > 1 and 's' or '')
    return r

df.apply(transform_row, axis=1)

In[4]:
Out[4]:
         animal            type      age
----------------------------------------
0    wild hound    dog creature  5 years
1  wild ragdoll    cat creature   1 year

In the code above the transform_row(r) function takes a Series object representing a given row (indicated by axis=1, the default value of axis=0 will provide a Series object for each column). This simplifies processing since we can access the actual ‘primitive’ values in the row using the column names and have visibility of other cells in the given row/column.


回答 3

这个问题可能仍然经常被探访,因此值得为卡西斯先生的回答提供补充。可以对dict内置类进行子类化,以便为“缺失”键返回默认值。此机制对熊猫有效。但请参阅下文。

这样就可以避免关键错误。

>>> import pandas as pd
>>> data = { 'ID': [ 101, 201, 301, 401 ] }
>>> df = pd.DataFrame(data)
>>> class SurnameMap(dict):
...     def __missing__(self, key):
...         return ''
...     
>>> surnamemap = SurnameMap()
>>> surnamemap[101] = 'Mohanty'
>>> surnamemap[301] = 'Drake'
>>> df['Surname'] = df['ID'].apply(lambda x: surnamemap[x])
>>> df
    ID  Surname
0  101  Mohanty
1  201         
2  301    Drake
3  401         

可以通过以下方式更简单地完成同一件事。getdict对象的方法使用’default’参数使得不必将dict子类化。

>>> import pandas as pd
>>> data = { 'ID': [ 101, 201, 301, 401 ] }
>>> df = pd.DataFrame(data)
>>> surnamemap = {}
>>> surnamemap[101] = 'Mohanty'
>>> surnamemap[301] = 'Drake'
>>> df['Surname'] = df['ID'].apply(lambda x: surnamemap.get(x, ''))
>>> df
    ID  Surname
0  101  Mohanty
1  201         
2  301    Drake
3  401         

This question might still be visited often enough that it’s worth offering an addendum to Mr Kassies’ answer. The dict built-in class can be sub-classed so that a default is returned for ‘missing’ keys. This mechanism works well for pandas. But see below.

In this way it’s possible to avoid key errors.

>>> import pandas as pd
>>> data = { 'ID': [ 101, 201, 301, 401 ] }
>>> df = pd.DataFrame(data)
>>> class SurnameMap(dict):
...     def __missing__(self, key):
...         return ''
...     
>>> surnamemap = SurnameMap()
>>> surnamemap[101] = 'Mohanty'
>>> surnamemap[301] = 'Drake'
>>> df['Surname'] = df['ID'].apply(lambda x: surnamemap[x])
>>> df
    ID  Surname
0  101  Mohanty
1  201         
2  301    Drake
3  401         

The same thing can be done more simply in the following way. The use of the ‘default’ argument for the get method of a dict object makes it unnecessary to subclass a dict.

>>> import pandas as pd
>>> data = { 'ID': [ 101, 201, 301, 401 ] }
>>> df = pd.DataFrame(data)
>>> surnamemap = {}
>>> surnamemap[101] = 'Mohanty'
>>> surnamemap[301] = 'Drake'
>>> df['Surname'] = df['ID'].apply(lambda x: surnamemap.get(x, ''))
>>> df
    ID  Surname
0  101  Mohanty
1  201         
2  301    Drake
3  401         

熊猫每隔n行

问题:熊猫每隔n行

Dataframe.resample()仅适用于时间序列数据。我找不到从非时间序列数据中获取第n行的方法。最好的方法是什么?

Dataframe.resample() works only with timeseries data. I cannot find a way of getting every nth row from non-timeseries data. What is the best method?


回答 0

我会使用iloc,它根据整数位置并遵循常规python语法获取行/列切片。

df.iloc[::5, :]

I’d use iloc, which takes a row/column slice, both based on integer position and following normal python syntax.

df.iloc[::5, :]

回答 1

尽管@chrisb接受的答案确实回答了该问题,但我想在此添加以下内容。

我用来获取nth数据或删除nth行的一种简单方法如下:

df1 = df[df.index % 3 != 0]  # Excludes every 3rd row starting from 0
df2 = df[df.index % 3 == 0]  # Selects every 3rd raw starting from 0

这种基于算术的采样具有实现甚至更复杂的行选择的能力。

当然,这假设您有一index列从0开始的有序,连续的整数

Though @chrisb’s accepted answer does answer the question, I would like to add to it the following.

A simple method I use to get the nth data or drop the nth row is the following:

df1 = df[df.index % 3 != 0]  # Excludes every 3rd row starting from 0
df2 = df[df.index % 3 == 0]  # Selects every 3rd raw starting from 0

This arithmetic based sampling has the ability to enable even more complex row-selections.

This assumes, of course, that you have an index column of ordered, consecutive, integers starting at 0.


回答 2

对于接受的答案,有一个甚至更简单的解决方案,涉及直接调用df.__getitem__

df = pd.DataFrame('x', index=range(5), columns=list('abc'))
df

   a  b  c
0  x  x  x
1  x  x  x
2  x  x  x
3  x  x  x
4  x  x  x

例如,要获取每2行,您可以执行

df[::2]

   a  b  c
0  x  x  x
2  x  x  x
4  x  x  x

还有GroupBy.first/ GroupBy.head,您对索引进行分组:

df.index // 2
# Int64Index([0, 0, 1, 1, 2], dtype='int64')

df.groupby(df.index // 2).first()
# Alternatively,
# df.groupby(df.index // 2).head(1)

   a  b  c
0  x  x  x
1  x  x  x
2  x  x  x

索引被步幅(在本例中为2)划分为底数。如果索引是非数字的,请执行

# df.groupby(np.arange(len(df)) // 2).first()
df.groupby(pd.RangeIndex(len(df)) // 2).first()

   a  b  c
0  x  x  x
1  x  x  x
2  x  x  x

There is an even simpler solution to the accepted answer that involves directly invoking df.__getitem__.

df = pd.DataFrame('x', index=range(5), columns=list('abc'))
df

   a  b  c
0  x  x  x
1  x  x  x
2  x  x  x
3  x  x  x
4  x  x  x

For example, to get every 2 rows, you can do

df[::2]

   a  b  c
0  x  x  x
2  x  x  x
4  x  x  x

There’s also GroupBy.first/GroupBy.head, you group on the index:

df.index // 2
# Int64Index([0, 0, 1, 1, 2], dtype='int64')

df.groupby(df.index // 2).first()
# Alternatively,
# df.groupby(df.index // 2).head(1)

   a  b  c
0  x  x  x
1  x  x  x
2  x  x  x

The index is floor-divved by the stride (2, in this case). If the index is non-numeric, instead do

# df.groupby(np.arange(len(df)) // 2).first()
df.groupby(pd.RangeIndex(len(df)) // 2).first()

   a  b  c
0  x  x  x
1  x  x  x
2  x  x  x

回答 3

我也有类似的要求,但我希望特定组中的第n个物品。这就是我解决的方法。

groups = data.groupby(['group_key'])
selection = groups['index_col'].apply(lambda x: x % 3 == 0)
subset = data[selection]

I had a similar requirement, but I wanted the n’th item in a particular group. This is how I solved it.

groups = data.groupby(['group_key'])
selection = groups['index_col'].apply(lambda x: x % 3 == 0)
subset = data[selection]

将pandas数据框中的列从int转换为string

问题:将pandas数据框中的列从int转换为string

我在pandas中有一个数据帧,其中包含int和str数据列。我想先串联数据框内的列。为此,我必须将int列转换为str。我尝试做如下:

mtrx['X.3'] = mtrx.to_string(columns = ['X.3'])

要么

mtrx['X.3'] = mtrx['X.3'].astype(str)

但是在两种情况下都无法正常工作,并且我收到一条错误消息:“无法连接’str’和’int’对象”。连接两str列效果很好。

I have a dataframe in pandas with mixed int and str data columns. I want to concatenate first the columns within the dataframe. To do that I have to convert an int column to str. I’ve tried to do as follows:

mtrx['X.3'] = mtrx.to_string(columns = ['X.3'])

or

mtrx['X.3'] = mtrx['X.3'].astype(str)

but in both cases it’s not working and I’m getting an error saying “cannot concatenate ‘str’ and ‘int’ objects”. Concatenating two str columns is working perfectly fine.


回答 0

In [16]: df = DataFrame(np.arange(10).reshape(5,2),columns=list('AB'))

In [17]: df
Out[17]: 
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9

In [18]: df.dtypes
Out[18]: 
A    int64
B    int64
dtype: object

转换系列

In [19]: df['A'].apply(str)
Out[19]: 
0    0
1    2
2    4
3    6
4    8
Name: A, dtype: object

In [20]: df['A'].apply(str)[0]
Out[20]: '0'

不要忘记将结果分配回去:

df['A'] = df['A'].apply(str)

转换整个框架

In [21]: df.applymap(str)
Out[21]: 
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9

In [22]: df.applymap(str).iloc[0,0]
Out[22]: '0'

df = df.applymap(str)
In [16]: df = DataFrame(np.arange(10).reshape(5,2),columns=list('AB'))

In [17]: df
Out[17]: 
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9

In [18]: df.dtypes
Out[18]: 
A    int64
B    int64
dtype: object

Convert a series

In [19]: df['A'].apply(str)
Out[19]: 
0    0
1    2
2    4
3    6
4    8
Name: A, dtype: object

In [20]: df['A'].apply(str)[0]
Out[20]: '0'

Don’t forget to assign the result back:

df['A'] = df['A'].apply(str)

Convert the whole frame

In [21]: df.applymap(str)
Out[21]: 
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9

In [22]: df.applymap(str).iloc[0,0]
Out[22]: '0'

df = df.applymap(str)

回答 1

更改DataFrame列的数据类型:

要诠释:

df.column_name = df.column_name.astype(np.int64)

要str:

df.column_name = df.column_name.astype(str)

Change data type of DataFrame column:

To int:

df.column_name = df.column_name.astype(np.int64)

To str:

df.column_name = df.column_name.astype(str)


回答 2

警告:给定的两个解决方案 astype()和apply()都不以nan或None形式保留NULL值。

import pandas as pd
import numpy as np

df = pd.DataFrame([None,'string',np.nan,42], index=[0,1,2,3], columns=['A'])

df1 = df['A'].astype(str)
df2 =  df['A'].apply(str)

print df.isnull()
print df1.isnull()
print df2.isnull()

我相信这是由to_string()的实现解决的

Warning: Both solutions given ( astype() and apply() ) do not preserve NULL values in either the nan or the None form.

import pandas as pd
import numpy as np

df = pd.DataFrame([None,'string',np.nan,42], index=[0,1,2,3], columns=['A'])

df1 = df['A'].astype(str)
df2 =  df['A'].apply(str)

print df.isnull()
print df1.isnull()
print df2.isnull()

I believe this is fixed by the implementation of to_string()


回答 3

使用以下代码:

df.column_name = df.column_name.astype('str')

Use the following code:

df.column_name = df.column_name.astype('str')

回答 4

仅供参考。

以上所有答案均适用于数据帧的情况。但是,如果您在创建/修改列时使用lambda,则此方法将不起作用,因为在那里将其视为int属性而不是pandas系列。您必须使用str(target_attribute)使其成为字符串。请参考以下示例。

def add_zero_in_prefix(df):
    if(df['Hour']<10):
        return '0' + str(df['Hour'])

data['str_hr'] = data.apply(add_zero_in_prefix, axis=1)

Just for an additional reference.

All of the above answers will work in case of a data frame. But if you are using lambda while creating / modify a column this won’t work, Because there it is considered as a int attribute instead of pandas series. You have to use str( target_attribute ) to make it as a string. Please refer the below example.

def add_zero_in_prefix(df):
    if(df['Hour']<10):
        return '0' + str(df['Hour'])

data['str_hr'] = data.apply(add_zero_in_prefix, axis=1)

从pandas DataFrame中删除名称包含特定字符串的列

问题:从pandas DataFrame中删除名称包含特定字符串的列

我有一个带有以下列名称的pandas数据框:

Result1,Test1,Result2,Test2,Result3,Test3等…

我想删除名称包含单词“ Test”的所有列。这样的列数不是静态的,而是取决于先前的功能。

我怎样才能做到这一点?

I have a pandas dataframe with the following column names:

Result1, Test1, Result2, Test2, Result3, Test3, etc…

I want to drop all the columns whose name contains the word “Test”. The numbers of such columns is not static but depends on a previous function.

How can I do that?


回答 0

import pandas as pd

import numpy as np

array=np.random.random((2,4))

df=pd.DataFrame(array, columns=('Test1', 'toto', 'test2', 'riri'))

print df

      Test1      toto     test2      riri
0  0.923249  0.572528  0.845464  0.144891
1  0.020438  0.332540  0.144455  0.741412

cols = [c for c in df.columns if c.lower()[:4] != 'test']

df=df[cols]

print df
       toto      riri
0  0.572528  0.144891
1  0.332540  0.741412
import pandas as pd

import numpy as np

array=np.random.random((2,4))

df=pd.DataFrame(array, columns=('Test1', 'toto', 'test2', 'riri'))

print df

      Test1      toto     test2      riri
0  0.923249  0.572528  0.845464  0.144891
1  0.020438  0.332540  0.144455  0.741412

cols = [c for c in df.columns if c.lower()[:4] != 'test']

df=df[cols]

print df
       toto      riri
0  0.572528  0.144891
1  0.332540  0.741412

回答 1

这是一个很好的方法:

df = df[df.columns.drop(list(df.filter(regex='Test')))]

Here is one way to do this:

df = df[df.columns.drop(list(df.filter(regex='Test')))]

回答 2

便宜,快捷和惯用语: str.contains

在最新版本的熊猫中,可以在索引和列上使用字符串方法。在这里,str.startswith似乎很合适。

要删除以给定子字符串开头的所有列:

df.columns.str.startswith('Test')
# array([ True, False, False, False])

df.loc[:,~df.columns.str.startswith('Test')]

  toto test2 riri
0    x     x    x
1    x     x    x

对于不区分大小写的匹配,可以将基于正则表达式的匹配与str.containsSOL锚一起使用:

df.columns.str.contains('^test', case=False)
# array([ True, False,  True, False])

df.loc[:,~df.columns.str.contains('^test', case=False)] 

  toto riri
0    x    x
1    x    x

如果可能使用混合类型,则也要指定na=False

Cheaper, Faster, and Idiomatic: str.contains

In recent versions of pandas, you can use string methods on the index and columns. Here, str.startswith seems like a good fit.

To remove all columns starting with a given substring:

df.columns.str.startswith('Test')
# array([ True, False, False, False])

df.loc[:,~df.columns.str.startswith('Test')]

  toto test2 riri
0    x     x    x
1    x     x    x

For case-insensitive matching, you can use regex-based matching with str.contains with an SOL anchor:

df.columns.str.contains('^test', case=False)
# array([ True, False,  True, False])

df.loc[:,~df.columns.str.contains('^test', case=False)] 

  toto riri
0    x    x
1    x    x

if mixed-types is a possibility, specify na=False as well.


回答 3

您可以使用“过滤器”过滤出您想要的列

import pandas as pd
import numpy as np

data2 = [{'test2': 1, 'result1': 2}, {'test': 5, 'result34': 10, 'c': 20}]

df = pd.DataFrame(data2)

df

    c   result1     result34    test    test2
0   NaN     2.0     NaN     NaN     1.0
1   20.0    NaN     10.0    5.0     NaN

现在过滤

df.filter(like='result',axis=1)

得到..

   result1  result34
0   2.0     NaN
1   NaN     10.0

You can filter out the columns you DO want using ‘filter’

import pandas as pd
import numpy as np

data2 = [{'test2': 1, 'result1': 2}, {'test': 5, 'result34': 10, 'c': 20}]

df = pd.DataFrame(data2)

df

    c   result1     result34    test    test2
0   NaN     2.0     NaN     NaN     1.0
1   20.0    NaN     10.0    5.0     NaN

Now filter

df.filter(like='result',axis=1)

Get..

   result1  result34
0   2.0     NaN
1   NaN     10.0

回答 4

可以整齐地用以下一行完成此操作:

df = df.drop(df.filter(regex='Test').columns, axis=1)

This can be done neatly in one line with:

df = df.drop(df.filter(regex='Test').columns, axis=1)

回答 5

使用DataFrame.select方法:

In [38]: df = DataFrame({'Test1': randn(10), 'Test2': randn(10), 'awesome': randn(10)})

In [39]: df.select(lambda x: not re.search('Test\d+', x), axis=1)
Out[39]:
   awesome
0    1.215
1    1.247
2    0.142
3    0.169
4    0.137
5   -0.971
6    0.736
7    0.214
8    0.111
9   -0.214

Use the DataFrame.select method:

In [38]: df = DataFrame({'Test1': randn(10), 'Test2': randn(10), 'awesome': randn(10)})

In [39]: df.select(lambda x: not re.search('Test\d+', x), axis=1)
Out[39]:
   awesome
0    1.215
1    1.247
2    0.142
3    0.169
4    0.137
5   -0.971
6    0.736
7    0.214
8    0.111
9   -0.214

回答 6

此方法可以完成所有事情。其他许多答案也会创建副本,但效率不高:

df.drop(df.columns[df.columns.str.contains('Test')], axis=1, inplace=True)

This method does everything in place. Many of the other answers create copies and are not as efficient:

df.drop(df.columns[df.columns.str.contains('Test')], axis=1, inplace=True)


回答 7

不要丢下 赶上你想要的相反。

df = df.filter(regex='^((?!badword).)*$').columns

Don’t drop. Catch the opposite of what you want.

df = df.filter(regex='^((?!badword).)*$').columns

回答 8

最简单的方法是:

resdf = df.filter(like='Test',axis=1)

the shortest way to do is is :

resdf = df.filter(like='Test',axis=1)

回答 9

删除包含正则表达式的列名称列表时的解决方案。我更喜欢这种方法,因为我经常编辑下拉列表。对下拉列表使用负过滤器正则表达式。

drop_column_names = ['A','B.+','C.*']
drop_columns_regex = '^(?!(?:'+'|'.join(drop_column_names)+')$)'
print('Dropping columns:',', '.join([c for c in df.columns if re.search(drop_columns_regex,c)]))
df = df.filter(regex=drop_columns_regex,axis=1)

Solution when dropping a list of column names containing regex. I prefer this approach because I’m frequently editing the drop list. Uses a negative filter regex for the drop list.

drop_column_names = ['A','B.+','C.*']
drop_columns_regex = '^(?!(?:'+'|'.join(drop_column_names)+')$)'
print('Dropping columns:',', '.join([c for c in df.columns if re.search(drop_columns_regex,c)]))
df = df.filter(regex=drop_columns_regex,axis=1)

熊猫中的for循环真的不好吗?我什么时候应该在意?

问题:熊猫中的for循环真的不好吗?我什么时候应该在意?

for循环真正的“坏”?如果不是,在什么情况下它们会比使用更常规的“矢量化”方法更好?1个

我熟悉“矢量化”的概念,以及熊猫如何利用矢量化技术来加快计算速度。向量化功能在整个系列或DataFrame上广播操作,以实现比传统上迭代数据快得多的加速。

但是,我很惊讶地看到很多代码(包括来自Stack Overflow的答案)提供了解决问题的解决方案,这些问题涉及使用for循环和列表推导来遍历数据。文档和API指出循环是“不好的”循环,并且“绝不能”循环访问数组,序列或DataFrame。那么,为什么有时我会看到用户建议基于循环的解决方案?


1-虽然问题听起来似乎有些宽泛,但事实是,在某些非常特殊的情况下,for循环通常比传统上遍历数据更好。这篇文章的目的是为了后代。

Are for loops really “bad”? If not, in what situation(s) would they be better than using a more conventional “vectorized” approach?1

I am familiar with the concept of “vectorization”, and how pandas employs vectorized techniques to speed up computation. Vectorized functions broadcast operations over the entire series or DataFrame to achieve speedups much greater than conventionally iterating over the data.

However, I am quite surprised to see a lot of code (including from answers on Stack Overflow) offering solutions to problems that involve looping through data using for loops and list comprehensions. The documentation and API say that loops are “bad”, and that one should “never” iterate over arrays, series, or DataFrames. So, how come I sometimes see users suggesting loop-based solutions?


1 – While it is true that the question sounds somewhat broad, the truth is that there are very specific situations when for loops are usually better than conventionally iterating over data. This post aims to capture this for posterity.


回答 0

TLDR;不,for循环并非总会“坏”,至少并非总是如此。说某些矢量化操作比迭代慢,而不是说迭代快于某些矢量化操作,可能更准确。知道何时以及为什么是使代码获得最大性能的关键。简而言之,在以下情况下,值得考虑使用矢量化熊猫函数的替代方法:

  1. 当您的数据很小时(…取决于您的工作),
  2. 处理object/ mixed dtypes时
  3. 使用str/ regex访问器功能时

让我们分别检查这些情况。


小数据上的迭代v / s矢量化

熊猫在其API设计中遵循“配置惯例”方法。这意味着已经安装了相同的API,以适应广泛的数据和用例。

调用pandas函数时,该函数必须在内部处理以下各项(除其他事项外),以确保工作正常

  1. 索引/轴对齐
  2. 处理混合数据类型
  3. 处理丢失的数据

几乎每个函数都必须在不同程度上处理这些问题,这带来了开销。数字函数(例如Series.add)的开销较少,而字符串函数(例如Series.str.replace)的开销更为明显。

for另一方面,循环比您想象的要快。更好的是列表理解(通过for循环创建列表)更快,因为它们是优化的列表创建迭代机制。

列表理解遵循模式

[f(x) for x in seq]

seq熊猫系列或DataFrame列在哪里。或者,当对多列进行操作时,

[f(x, y) for x, y in zip(seq1, seq2)]

seq1seq2列。

数值比较
考虑一个简单的布尔索引操作。列表推导方法已针对Series.ne!=)和进行计时query。功能如下:

# Boolean indexing with Numeric value comparison.
df[df.A != df.B]                            # vectorized !=
df.query('A != B')                          # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]    # list comp

为简单起见,本文中使用了该perfplot包来运行所有的timeit测试。上述操作的时间安排如下:

query对于中等大小的N,列表理解要胜过,甚至对于较小的N而言,列表理解要胜过向量化不等于比较。不幸的是,列表理解是线性缩放的,因此对于较大的N而言,它不能提供很多性能提升。

注意
值得一提的是,列表理解的许多好处来自于不必担心索引对齐,但是这意味着,如果您的代码依赖于索引对齐,则此操作会中断。在某些情况下,可以将对基础NumPy数组的矢量化操作视为“两全其美”,从而实现了矢量化,而没有熊猫函数的所有不必要开销。这意味着您可以将上面的操作重写为

df[df.A.values != df.B.values]

它的性能优于熊猫和列表理解同等物:

NumPy矢量化不在本文讨论范围之内,但是如果性能很重要,则绝对值得考虑。

值计数
再举一个例子-这次,使用另一个比for循环快的香草python构造- collections.Counter。通常的要求是计算值计数并将结果作为字典返回。这与做value_countsnp.unique以及Counter

# Value Counts comparison.
ser.value_counts(sort=False).to_dict()           # value_counts
dict(zip(*np.unique(ser, return_counts=True)))   # np.unique
Counter(ser)                                     # Counter

结果更明显,Counter在较大的小N范围(〜3500)下胜过两种矢量化方法。

注意
更多琐事(由@ user2357112提供)。的Counter实现是使用C加速器实现的,因此尽管它仍必须使用python对象而不是底层的C数据类型,但它仍比for循环要快。Python的力量!

当然,这里的好处是性能取决于您的数据和用例。这些示例的目的是说服您不要将这些解决方案排除为合法选项。如果这些仍然不能满足您的需求,那么总会有cythonnumba。让我们将此测试添加到混合中。

from numba import njit, prange

@njit(parallel=True)
def get_mask(x, y):
    result = [False] * len(x)
    for i in prange(len(x)):
        result[i] = x[i] != y[i]

    return np.array(result)

df[get_mask(df.A.values, df.B.values)] # numba

Numba可以将循环python代码的JIT编译为功能非常强大的矢量化代码。了解如何使numba发挥作用需要学习。


混合/ objectdtype操作

基于字符串的比较再
来看第一部分的过滤示例,如果要比较的列是字符串怎么办?考虑上面相同的3个函数,但将输入DataFrame强制转换为字符串。

# Boolean indexing with string value comparison.
df[df.A != df.B]                            # vectorized !=
df.query('A != B')                          # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]    # list comp

那么,发生了什么变化?这里要注意的是,字符串操作本质上难以向量化。Pandas将字符串视为对象,并且对对象的所有操作都会回退到缓慢,循环的实现中。

现在,由于此循环实现被上述所有开销所包围,因此,即使这些解决方案按比例缩放,它们之间也存在恒定的幅度差异。

当涉及对可变/复杂对象的操作时,没有比较。列表理解胜过所有涉及字典和列表的操作。

通过键访问字典值
以下是从字典列中提取值的两个操作的时间安排:map和列表理解。该设置位于附录的“代码段”标题下。

# Dictionary value extraction.
ser.map(operator.itemgetter('value'))     # map
pd.Series([x.get('value') for x in ser])  # list comprehension


3个操作的位置列表索引计时,这些操作从列列表中提取第0个元素(处理异常)mapstr.get访问器方法和列表推导:

# List positional indexing. 
def get_0th(lst):
    try:
        return lst[0]
    # Handle empty lists and NaNs gracefully.
    except (IndexError, TypeError):
        return np.nan

ser.map(get_0th)                                          # map
ser.str[0]                                                # str accessor
pd.Series([x[0] if len(x) > 0 else np.nan for x in ser])  # list comp
pd.Series([get_0th(x) for x in ser])                      # list comp safe

注意
如果索引很重要,则需要执行以下操作:

pd.Series([...], index=ser.index)

重建系列时。

列表扁平
化最后一个例子是扁平化列表。这是另一个常见问题,它演示了纯python在这里有多么强大。

# Nested list flattening.
pd.DataFrame(ser.tolist()).stack().reset_index(drop=True)  # stack
pd.Series(list(chain.from_iterable(ser.tolist())))         # itertools.chain
pd.Series([y for x in ser for y in x])                     # nested list comp

无论itertools.chain.from_iterable和嵌套列表理解是纯Python结构,并且规模比更好stack的解决方案。

这些时间点充分说明了熊猫没有为混合dtypes做好准备的事实,并且您可能应该避免使用它来这样做。数据应尽可能在单独的列中作为标量值(整数/浮点数/字符串)存在。

最后,这些解决方案的适用性在很大程度上取决于您的数据。因此,最好的办法是先决定对数据进行这些操作,然后再决定要做什么。请注意,我尚未apply对这些解决方案计时,因为它会使图形倾斜(是​​的,那太慢了)。


正则表达式操作和访问器.str方法

熊猫可以应用正则表达式的操作,如str.containsstr.extractstr.extractall,以及其他的“矢量”字符串操作(例如str.split,str.find ,str.translate`,等等)的字符串列。这些功能比列表理解要慢,并且是比其他功能更方便的功能。

预编译正则表达式模式并使用遍历数据通常要快得多re.compile(另请参阅使用Python的re.compile是否值得?)。列表组合等效于str.contains如下所示:

p = re.compile(...)
ser2 = pd.Series([x for x in ser if p.search(x)])

要么,

ser2 = ser[[bool(p.search(x)) for x in ser]]

如果您需要处理NaN,则可以执行以下操作

ser[[bool(p.search(x)) if pd.notnull(x) else False for x in ser]]

相当于str.extract(无组)的列表组合看起来像:

df['col2'] = [p.search(x).group(0) for x in df['col']]

如果您需要处理不匹配和NaN,则可以使用自定义函数(速度更快!):

def matcher(x):
    m = p.search(str(x))
    if m:
        return m.group(0)
    return np.nan

df['col2'] = [matcher(x) for x in df['col']]

matcher功能是非常可扩展的。根据需要,它可以适合返回每个捕获组的列表。只需提取查询匹配对象的groupor groups属性即可。

对于str.extractall,请更改p.searchp.findall

字符串提取
考虑简单的过滤操作。这个想法是提取一个大写字母开头的4位数字。

# Extracting strings.
p = re.compile(r'(?<=[A-Z])(\d{4})')
def matcher(x):
    m = p.search(x)
    if m:
        return m.group(0)
    return np.nan

ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False)   #  str.extract
pd.Series([matcher(x) for x in ser])                  #  list comprehension

更多示例
完全公开-我是以下列出的这些帖子的作者(部分或全部)。


结论

如上面的示例所示,当处理少量的DataFrame,混合的数据类型和正则表达式时,迭代会发光。

您获得的提速取决于您的数据和问题,因此里程可能会有所不同。最好的办法是仔细运行测试,看看是否值得付出努力。

“向量化”功能的优点在于其简单性和可读性,因此,如果性能不是很关键,则绝对应该首选这些功能。

另一个注意事项是,某些字符串操作处理了一些使用NumPy的约束。以下是两个示例,其中仔细的NumPy向量化性能胜过python:

此外,有时.values相对于Series或DataFrame ,仅通过底层数组进行操作就可以为大多数常见情况提供足够健康的加速(请参见上面“ 数字比较”部分的“ 注意 ” )。因此,举例来说,即时性能会提高。使用可能并非在每种情况下都适用,但这是一个有用的技巧。df[df.A.values != df.B.values]df[df.A != df.B].values

如上所述,由您决定这些解决方案是否值得实施。


附录:代码段

import perfplot  
import operator 
import pandas as pd
import numpy as np
import re

from collections import Counter
from itertools import chain

# Boolean indexing with Numeric value comparison.
perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B']),
    kernels=[
        lambda df: df[df.A != df.B],
        lambda df: df.query('A != B'),
        lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
        lambda df: df[get_mask(df.A.values, df.B.values)]
    ],
    labels=['vectorized !=', 'query (numexpr)', 'list comp', 'numba'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N'
)

# Value Counts comparison.
perfplot.show(
    setup=lambda n: pd.Series(np.random.choice(1000, n)),
    kernels=[
        lambda ser: ser.value_counts(sort=False).to_dict(),
        lambda ser: dict(zip(*np.unique(ser, return_counts=True))),
        lambda ser: Counter(ser),
    ],
    labels=['value_counts', 'np.unique', 'Counter'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=lambda x, y: dict(x) == dict(y)
)

# Boolean indexing with string value comparison.
perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B'], dtype=str),
    kernels=[
        lambda df: df[df.A != df.B],
        lambda df: df.query('A != B'),
        lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
    ],
    labels=['vectorized !=', 'query (numexpr)', 'list comp'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

# Dictionary value extraction.
ser1 = pd.Series([{'key': 'abc', 'value': 123}, {'key': 'xyz', 'value': 456}])
perfplot.show(
    setup=lambda n: pd.concat([ser1] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.map(operator.itemgetter('value')),
        lambda ser: pd.Series([x.get('value') for x in ser]),
    ],
    labels=['map', 'list comprehension'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

# List positional indexing. 
ser2 = pd.Series([['a', 'b', 'c'], [1, 2], []])        
perfplot.show(
    setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.map(get_0th),
        lambda ser: ser.str[0],
        lambda ser: pd.Series([x[0] if len(x) > 0 else np.nan for x in ser]),
        lambda ser: pd.Series([get_0th(x) for x in ser]),
    ],
    labels=['map', 'str accessor', 'list comprehension', 'list comp safe'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

# Nested list flattening.
ser3 = pd.Series([['a', 'b', 'c'], ['d', 'e'], ['f', 'g']])
perfplot.show(
    setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
    kernels=[
        lambda ser: pd.DataFrame(ser.tolist()).stack().reset_index(drop=True),
        lambda ser: pd.Series(list(chain.from_iterable(ser.tolist()))),
        lambda ser: pd.Series([y for x in ser for y in x]),
    ],
    labels=['stack', 'itertools.chain', 'nested list comp'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',    
    equality_check=None

)

# Extracting strings.
ser4 = pd.Series(['foo xyz', 'test A1234', 'D3345 xtz'])
perfplot.show(
    setup=lambda n: pd.concat([ser4] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False),
        lambda ser: pd.Series([matcher(x) for x in ser])
    ],
    labels=['str.extract', 'list comprehension'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

TLDR; No, for loops are not blanket “bad”, at least, not always. It is probably more accurate to say that some vectorized operations are slower than iterating, versus saying that iteration is faster than some vectorized operations. Knowing when and why is key to getting the most performance out of your code. In a nutshell, these are the situations where it is worth considering an alternative to vectorized pandas functions:

  1. When your data is small (…depending on what you’re doing),
  2. When dealing with object/mixed dtypes
  3. When using the str/regex accessor functions

Let’s examine these situations individually.


Iteration v/s Vectorization on Small Data

Pandas follows a “Convention Over Configuration” approach in its API design. This means that the same API has been fitted to cater to a broad range of data and use cases.

When a pandas function is called, the following things (among others) must internally be handled by the function, to ensure working

  1. Index/axis alignment
  2. Handling mixed datatypes
  3. Handling missing data

Almost every function will have to deal with these to varying extents, and this presents an overhead. The overhead is less for numeric functions (for example, Series.add), while it is more pronounced for string functions (for example, Series.str.replace).

for loops, on the other hand, are faster then you think. What’s even better is list comprehensions (which create lists through for loops) are even faster as they are optimized iterative mechanisms for list creation.

List comprehensions follow the pattern

[f(x) for x in seq]

Where seq is a pandas series or DataFrame column. Or, when operating over multiple columns,

[f(x, y) for x, y in zip(seq1, seq2)]

Where seq1 and seq2 are columns.

Numeric Comparison
Consider a simple boolean indexing operation. The list comprehension method has been timed against Series.ne (!=) and query. Here are the functions:

# Boolean indexing with Numeric value comparison.
df[df.A != df.B]                            # vectorized !=
df.query('A != B')                          # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]    # list comp

For simplicity, I have used the perfplot package to run all the timeit tests in this post. The timings for the operations above are below:

The list comprehension outperforms query for moderately sized N, and even outperforms the vectorized not equals comparison for tiny N. Unfortunately, the list comprehension scales linearly, so it does not offer much performance gain for larger N.

Note
It is worth mentioning that much of the benefit of list comprehension come from not having to worry about the index alignment, but this means that if your code is dependent on indexing alignment, this will break. In some cases, vectorised operations over the underlying NumPy arrays can be considered as bringing in the “best of both worlds”, allowing for vectorisation without all the unneeded overhead of the pandas functions. This means that you can rewrite the operation above as

df[df.A.values != df.B.values]

Which outperforms both the pandas and list comprehension equivalents:

NumPy vectorization is out of the scope of this post, but it is definitely worth considering, if performance matters.

Value Counts
Taking another example – this time, with another vanilla python construct that is faster than a for loop – collections.Counter. A common requirement is to compute the value counts and return the result as a dictionary. This is done with value_counts, np.unique, and Counter:

# Value Counts comparison.
ser.value_counts(sort=False).to_dict()           # value_counts
dict(zip(*np.unique(ser, return_counts=True)))   # np.unique
Counter(ser)                                     # Counter

The results are more pronounced, Counter wins out over both vectorized methods for a larger range of small N (~3500).

Note
More trivia (courtesy @user2357112). The Counter is implemented with a C accelerator, so while it still has to work with python objects instead of the underlying C datatypes, it is still faster than a for loop. Python power!

Of course, the take away from here is that the performance depends on your data and use case. The point of these examples is to convince you not to rule out these solutions as legitimate options. If these still don’t give you the performance you need, there is always cython and numba. Let’s add this test into the mix.

from numba import njit, prange

@njit(parallel=True)
def get_mask(x, y):
    result = [False] * len(x)
    for i in prange(len(x)):
        result[i] = x[i] != y[i]

    return np.array(result)

df[get_mask(df.A.values, df.B.values)] # numba

Numba offers JIT compilation of loopy python code to very powerful vectorized code. Understanding how to make numba work involves a learning curve.


Operations with Mixed/object dtypes

String-based Comparison
Revisiting the filtering example from the first section, what if the columns being compared are strings? Consider the same 3 functions above, but with the input DataFrame cast to string.

# Boolean indexing with string value comparison.
df[df.A != df.B]                            # vectorized !=
df.query('A != B')                          # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]    # list comp

So, what changed? The thing to note here is that string operations are inherently difficult to vectorize. Pandas treats strings as objects, and all operations on objects fall back to a slow, loopy implementation.

Now, because this loopy implementation is surrounded by all the overhead mentioned above, there is a constant magnitude difference between these solutions, even though they scale the same.

When it comes to operations on mutable/complex objects, there is no comparison. List comprehension outperforms all operations involving dicts and lists.

Accessing Dictionary Value(s) by Key
Here are timings for two operations that extract a value from a column of dictionaries: map and the list comprehension. The setup is in the Appendix, under the heading “Code Snippets”.

# Dictionary value extraction.
ser.map(operator.itemgetter('value'))     # map
pd.Series([x.get('value') for x in ser])  # list comprehension

Positional List Indexing
Timings for 3 operations that extract the 0th element from a list of columns (handling exceptions), map, str.get accessor method, and the list comprehension:

# List positional indexing. 
def get_0th(lst):
    try:
        return lst[0]
    # Handle empty lists and NaNs gracefully.
    except (IndexError, TypeError):
        return np.nan

ser.map(get_0th)                                          # map
ser.str[0]                                                # str accessor
pd.Series([x[0] if len(x) > 0 else np.nan for x in ser])  # list comp
pd.Series([get_0th(x) for x in ser])                      # list comp safe

Note
If the index matters, you would want to do:

pd.Series([...], index=ser.index)

When reconstructing the series.

List Flattening
A final example is flattening lists. This is another common problem, and demonstrates just how powerful pure python is here.

# Nested list flattening.
pd.DataFrame(ser.tolist()).stack().reset_index(drop=True)  # stack
pd.Series(list(chain.from_iterable(ser.tolist())))         # itertools.chain
pd.Series([y for x in ser for y in x])                     # nested list comp

Both itertools.chain.from_iterable and the nested list comprehension are pure python constructs, and scale much better than the stack solution.

These timings are a strong indication of the fact that pandas is not equipped to work with mixed dtypes, and that you should probably refrain from using it to do so. Wherever possible, data should be present as scalar values (ints/floats/strings) in separate columns.

Lastly, the applicability of these solutions depend widely on your data. So, the best thing to do would be to test these operations on your data before deciding what to go with. Notice how I have not timed apply on these solutions, because it would skew the graph (yes, it’s that slow).


Regex Operations, and .str Accessor Methods

Pandas can apply regex operations such as str.contains, str.extract, and str.extractall, as well as other “vectorized” string operations (such as str.split, str.find,str.translate`, and so on) on string columns. These functions are slower than list comprehensions, and are meant to be more convenience functions than anything else.

It is usually much faster to pre-compile a regex pattern and iterate over your data with re.compile (also see Is it worth using Python’s re.compile?). The list comp equivalent to str.contains looks something like this:

p = re.compile(...)
ser2 = pd.Series([x for x in ser if p.search(x)])

Or,

ser2 = ser[[bool(p.search(x)) for x in ser]]

If you need to handle NaNs, you can do something like

ser[[bool(p.search(x)) if pd.notnull(x) else False for x in ser]]

The list comp equivalent to str.extract (without groups) will look something like:

df['col2'] = [p.search(x).group(0) for x in df['col']]

If you need to handle no-matches and NaNs, you can use a custom function (still faster!):

def matcher(x):
    m = p.search(str(x))
    if m:
        return m.group(0)
    return np.nan

df['col2'] = [matcher(x) for x in df['col']]

The matcher function is very extensible. It can be fitted to return a list for each capture group, as needed. Just extract query the group or groups attribute of the matcher object.

For str.extractall, change p.search to p.findall.

String Extraction
Consider a simple filtering operation. The idea is to extract 4 digits if it is preceded by an upper case letter.

# Extracting strings.
p = re.compile(r'(?<=[A-Z])(\d{4})')
def matcher(x):
    m = p.search(x)
    if m:
        return m.group(0)
    return np.nan

ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False)   #  str.extract
pd.Series([matcher(x) for x in ser])                  #  list comprehension

More Examples
Full disclosure – I am the author (in part or whole) of these posts listed below.


Conclusion

As shown from the examples above, iteration shines when working with small rows of DataFrames, mixed datatypes, and regular expressions.

The speedup you get depends on your data and your problem, so your mileage may vary. The best thing to do is to carefully run tests and see if the payout is worth the effort.

The “vectorized” functions shine in their simplicity and readability, so if performance is not critical, you should definitely prefer those.

Another side note, certain string operations deal with constraints that favour the use of NumPy. Here are two examples where careful NumPy vectorization outperforms python:

Additionally, sometimes just operating on the underlying arrays via .values as opposed to on the Series or DataFrames can offer a healthy enough speedup for most usual scenarios (see the Note in the Numeric Comparison section above). So, for example df[df.A.values != df.B.values] would show instant performance boosts over df[df.A != df.B]. Using .values may not be appropriate in every situation, but it is a useful hack to know.

As mentioned above, it’s up to you to decide whether these solutions are worth the trouble of implementing.


Appendix: Code Snippets

import perfplot  
import operator 
import pandas as pd
import numpy as np
import re

from collections import Counter
from itertools import chain

# Boolean indexing with Numeric value comparison.
perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B']),
    kernels=[
        lambda df: df[df.A != df.B],
        lambda df: df.query('A != B'),
        lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
        lambda df: df[get_mask(df.A.values, df.B.values)]
    ],
    labels=['vectorized !=', 'query (numexpr)', 'list comp', 'numba'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N'
)

# Value Counts comparison.
perfplot.show(
    setup=lambda n: pd.Series(np.random.choice(1000, n)),
    kernels=[
        lambda ser: ser.value_counts(sort=False).to_dict(),
        lambda ser: dict(zip(*np.unique(ser, return_counts=True))),
        lambda ser: Counter(ser),
    ],
    labels=['value_counts', 'np.unique', 'Counter'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=lambda x, y: dict(x) == dict(y)
)

# Boolean indexing with string value comparison.
perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B'], dtype=str),
    kernels=[
        lambda df: df[df.A != df.B],
        lambda df: df.query('A != B'),
        lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
    ],
    labels=['vectorized !=', 'query (numexpr)', 'list comp'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

# Dictionary value extraction.
ser1 = pd.Series([{'key': 'abc', 'value': 123}, {'key': 'xyz', 'value': 456}])
perfplot.show(
    setup=lambda n: pd.concat([ser1] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.map(operator.itemgetter('value')),
        lambda ser: pd.Series([x.get('value') for x in ser]),
    ],
    labels=['map', 'list comprehension'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

# List positional indexing. 
ser2 = pd.Series([['a', 'b', 'c'], [1, 2], []])        
perfplot.show(
    setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.map(get_0th),
        lambda ser: ser.str[0],
        lambda ser: pd.Series([x[0] if len(x) > 0 else np.nan for x in ser]),
        lambda ser: pd.Series([get_0th(x) for x in ser]),
    ],
    labels=['map', 'str accessor', 'list comprehension', 'list comp safe'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

# Nested list flattening.
ser3 = pd.Series([['a', 'b', 'c'], ['d', 'e'], ['f', 'g']])
perfplot.show(
    setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
    kernels=[
        lambda ser: pd.DataFrame(ser.tolist()).stack().reset_index(drop=True),
        lambda ser: pd.Series(list(chain.from_iterable(ser.tolist()))),
        lambda ser: pd.Series([y for x in ser for y in x]),
    ],
    labels=['stack', 'itertools.chain', 'nested list comp'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',    
    equality_check=None

)

# Extracting strings.
ser4 = pd.Series(['foo xyz', 'test A1234', 'D3345 xtz'])
perfplot.show(
    setup=lambda n: pd.concat([ser4] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False),
        lambda ser: pd.Series([matcher(x) for x in ser])
    ],
    labels=['str.extract', 'list comprehension'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

回答 1

简而言之

  • for循环+ iterrows非常慢。大约1k行的开销并不重要,但超过10k行的开销却很明显。
  • for loop + itertuplesiterrowsor 快得多apply
  • 向量化通常比 itertuples

基准测试

In short

  • for loop + iterrows is extremely slow. Overhead isn’t significant on ~1k rows, but noticeable on 10k+ rows.
  • for loop + itertuples is much faster than iterrows or apply.
  • vectorization is usually much faster than itertuples

Benchmark


像Qlik中那样在pandas数据框中的列中计算唯一值?

问题:像Qlik中那样在pandas数据框中的列中计算唯一值?

如果我有这样的表:

df = pd.DataFrame({
         'hID': [101, 102, 103, 101, 102, 104, 105, 101],
         'dID': [10, 11, 12, 10, 11, 10, 12, 10],
         'uID': ['James', 'Henry', 'Abe', 'James', 'Henry', 'Brian', 'Claude', 'James'],
         'mID': ['A', 'B', 'A', 'B', 'A', 'A', 'A', 'C']
})

我可以count(distinct hID)在Qlik中提出5个唯一的hID。我该如何在Python中使用Pandas数据框?还是一个numpy数组?同样,如果这样做,count(hID)我将在Qlik中得到8。在大熊猫中做这件事的等效方法是什么?

If I have a table like this:

df = pd.DataFrame({
         'hID': [101, 102, 103, 101, 102, 104, 105, 101],
         'dID': [10, 11, 12, 10, 11, 10, 12, 10],
         'uID': ['James', 'Henry', 'Abe', 'James', 'Henry', 'Brian', 'Claude', 'James'],
         'mID': ['A', 'B', 'A', 'B', 'A', 'A', 'A', 'C']
})

I can do count(distinct hID) in Qlik to come up with count of 5 for unique hID. How do I do that in python using a pandas dataframe? Or maybe a numpy array? Similarly, if were to do count(hID) I will get 8 in Qlik. What is the equivalent way to do it in pandas?


回答 0

计算不同的值,使用nunique

df['hID'].nunique()
5

仅计算非空值,请使用count

df['hID'].count()
8

计算包括空值在内的总值,请使用size属性:

df['hID'].size
8

编辑添加条件

使用布尔索引:

df.loc[df['mID']=='A','hID'].agg(['nunique','count','size'])

或使用query

df.query('mID == "A"')['hID'].agg(['nunique','count','size'])

输出:

nunique    5
count      5
size       5
Name: hID, dtype: int64

Count distinct values, use nunique:

df['hID'].nunique()
5

Count only non-null values, use count:

df['hID'].count()
8

Count total values including null values, use the size attribute:

df['hID'].size
8

Edit to add condition

Use boolean indexing:

df.loc[df['mID']=='A','hID'].agg(['nunique','count','size'])

OR using query:

df.query('mID == "A"')['hID'].agg(['nunique','count','size'])

Output:

nunique    5
count      5
size       5
Name: hID, dtype: int64

回答 1

如果我假设data是您数据框的名称,则可以执行以下操作:

data['race'].value_counts()

这将向您显示不同的元素及其发生的次数。

If I assume data is the name of your dataframe, you can do :

data['race'].value_counts()

this will show you the distinct element and their number of occurence.


回答 2

或获取每一列的唯一值数量:

df.nunique()

dID    3
hID    5
mID    3
uID    5
dtype: int64

新进 pandas 0.20.0 pd.DataFrame.agg

df.agg(['count', 'size', 'nunique'])

         dID  hID  mID  uID
count      8    8    8    8
size       8    8    8    8
nunique    3    5    3    5

您始终能够agg在内完成groupbystack最后使用了,因为我更喜欢演示文稿。

df.groupby('mID').agg(['count', 'size', 'nunique']).stack()


             dID  hID  uID
mID                       
A   count      5    5    5
    size       5    5    5
    nunique    3    5    5
B   count      2    2    2
    size       2    2    2
    nunique    2    2    2
C   count      1    1    1
    size       1    1    1
    nunique    1    1    1

Or get the number of unique values for each column:

df.nunique()

dID    3
hID    5
mID    3
uID    5
dtype: int64

New in pandas 0.20.0 pd.DataFrame.agg

df.agg(['count', 'size', 'nunique'])

         dID  hID  mID  uID
count      8    8    8    8
size       8    8    8    8
nunique    3    5    3    5

You’ve always been able to do an agg within a groupby. I used stack at the end because I like the presentation better.

df.groupby('mID').agg(['count', 'size', 'nunique']).stack()


             dID  hID  uID
mID                       
A   count      5    5    5
    size       5    5    5
    nunique    3    5    5
B   count      2    2    2
    size       2    2    2
    nunique    2    2    2
C   count      1    1    1
    size       1    1    1
    nunique    1    1    1

回答 3

您可以nunique在大熊猫中使用:

df.hID.nunique()
# 5

You can use nunique in pandas:

df.hID.nunique()
# 5

回答 4

要计算hIDdataframe列中的唯一值df,请使用:

len(df.hID.unique())

To count unique values in column, say hID of dataframe df, use:

len(df.hID.unique())

回答 5

您可以通过使用len函数来使用唯一属性

len(df [‘hID’]。unique())5

you can use unique property by using len function

len(df[‘hID’].unique()) 5


熊猫中的dtype(’O’)是什么?

问题:熊猫中的dtype(’O’)是什么?

我在pandas中有一个数据框,我试图找出其值的类型。我不确定column的类型'Test'。但是,当我跑步时myFrame['Test'].dtype,我得到了;

dtype('O')

这是什么意思?

I have a dataframe in pandas and I’m trying to figure out what the types of its values are. I am unsure what the type is of column 'Test'. However, when I run myFrame['Test'].dtype, I get;

dtype('O')

What does this mean?


回答 0

它的意思是:

'O'     (Python) objects

来源

第一个字符指定数据的类型,其余字符指定每个项目的字节数,Unicode除外,Unicode将其解释为字符数。项目大小必须与现有类型相对应,否则将引发错误。支持的类型为现有类型,否则将引发错误。支持的种类有:

'b'       boolean
'i'       (signed) integer
'u'       unsigned integer
'f'       floating-point
'c'       complex-floating point
'O'       (Python) objects
'S', 'a'  (byte-)string
'U'       Unicode
'V'       raw data (void)

如果需要检查,另一个答案会有所帮助type

It means:

'O'     (Python) objects

Source.

The first character specifies the kind of data and the remaining characters specify the number of bytes per item, except for Unicode, where it is interpreted as the number of characters. The item size must correspond to an existing type, or an error will be raised. The supported kinds are to an existing type, or an error will be raised. The supported kinds are:

'b'       boolean
'i'       (signed) integer
'u'       unsigned integer
'f'       floating-point
'c'       complex-floating point
'O'       (Python) objects
'S', 'a'  (byte-)string
'U'       Unicode
'V'       raw data (void)

Another answer helps if need check types.


回答 1

当您dtype('O')在数据框内看到这意味着熊猫字符串。

什么dtype

属于pandasnumpy或两者兼而有之的东西?如果我们检查熊猫代码:

df = pd.DataFrame({'float': [1.0],
                    'int': [1],
                    'datetime': [pd.Timestamp('20180310')],
                    'string': ['foo']})
print(df)
print(df['float'].dtype,df['int'].dtype,df['datetime'].dtype,df['string'].dtype)
df['string'].dtype

它将输出如下:

   float  int   datetime string    
0    1.0    1 2018-03-10    foo
---
float64 int64 datetime64[ns] object
---
dtype('O')

您可以将最后一个解释为Pandas dtype('O')或Pandas对象,它是Python类型的字符串,它对应于Numpy string_unicode_type。

Pandas dtype    Python type     NumPy type          Usage
object          str             string_, unicode_   Text

就像唐吉x德(Don Quixote)在屁股上一样,熊猫(Pandas)在Numpy上一样,Numpy理解系统的基础架构,并numpy.dtype为此使用类。

数据类型对象是numpy.dtype类的实例,可以更精确地理解数据类型,包括:

  • 数据类型(整数,浮点数,Python对象等)
  • 数据大小(例如整数中有多少个字节)
  • 数据的字节顺序(小端或大端)
  • 如果数据类型是结构化的,则为其他数据类型的集合(例如,描述由整数和浮点数组成的数组项)
  • 该结构的“字段”的名称是什么
  • 每个字段的数据类型是什么
  • 每个字段占用存储块的哪一部分
  • 如果数据类型是子数组,则其形状和数据类型是什么

在这个问题的上下文中dtype,它既属于pand又属于numpy,尤其dtype('O')意味着我们期望该字符串。


这是一些测试用的代码,并带有解释:如果我们将数据集作为字典

import pandas as pd
import numpy as np
from pandas import Timestamp

data={'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 'date': {0: Timestamp('2018-12-12 00:00:00'), 1: Timestamp('2018-12-12 00:00:00'), 2: Timestamp('2018-12-12 00:00:00'), 3: Timestamp('2018-12-12 00:00:00'), 4: Timestamp('2018-12-12 00:00:00')}, 'role': {0: 'Support', 1: 'Marketing', 2: 'Business Development', 3: 'Sales', 4: 'Engineering'}, 'num': {0: 123, 1: 234, 2: 345, 3: 456, 4: 567}, 'fnum': {0: 3.14, 1: 2.14, 2: -0.14, 3: 41.3, 4: 3.14}}
df = pd.DataFrame.from_dict(data) #now we have a dataframe

print(df)
print(df.dtypes)

最后几行将检查数据框并记录输出:

   id       date                  role  num   fnum
0   1 2018-12-12               Support  123   3.14
1   2 2018-12-12             Marketing  234   2.14
2   3 2018-12-12  Business Development  345  -0.14
3   4 2018-12-12                 Sales  456  41.30
4   5 2018-12-12           Engineering  567   3.14
id               int64
date    datetime64[ns]
role            object
num              int64
fnum           float64
dtype: object

各种不同 dtypes

df.iloc[1,:] = np.nan
df.iloc[2,:] = None

但是,如果我们尝试设置np.nanNone这将不会影响原始列的dtype。输出将如下所示:

print(df)
print(df.dtypes)

    id       date         role    num   fnum
0  1.0 2018-12-12      Support  123.0   3.14
1  NaN        NaT          NaN    NaN    NaN
2  NaN        NaT         None    NaN    NaN
3  4.0 2018-12-12        Sales  456.0  41.30
4  5.0 2018-12-12  Engineering  567.0   3.14
id             float64
date    datetime64[ns]
role            object
num            float64
fnum           float64
dtype: object

因此,np.nan否则None将不会更改列dtype,除非我们将所有列行都设置为np.nanNone。在这种情况下,列将分别变为float64object

您也可以尝试设置单行:

df.iloc[3,:] = 0 # will convert datetime to object only
df.iloc[4,:] = '' # will convert all columns to object

这里需要注意的是,如果我们在非字符串列中设置字符串,它将变成string或object dtype

When you see dtype('O') inside dataframe this means Pandas string.

What is dtype?

Something that belongs to pandas or numpy, or both, or something else? If we examine pandas code:

df = pd.DataFrame({'float': [1.0],
                    'int': [1],
                    'datetime': [pd.Timestamp('20180310')],
                    'string': ['foo']})
print(df)
print(df['float'].dtype,df['int'].dtype,df['datetime'].dtype,df['string'].dtype)
df['string'].dtype

It will output like this:

   float  int   datetime string    
0    1.0    1 2018-03-10    foo
---
float64 int64 datetime64[ns] object
---
dtype('O')

You can interpret the last as Pandas dtype('O') or Pandas object which is Python type string, and this corresponds to Numpy string_, or unicode_ types.

Pandas dtype    Python type     NumPy type          Usage
object          str             string_, unicode_   Text

Like Don Quixote is on ass, Pandas is on Numpy and Numpy understand the underlying architecture of your system and uses the class numpy.dtype for that.

Data type object is an instance of numpy.dtype class that understand the data type more precise including:

  • Type of the data (integer, float, Python object, etc.)
  • Size of the data (how many bytes is in e.g. the integer)
  • Byte order of the data (little-endian or big-endian)
  • If the data type is structured, an aggregate of other data types, (e.g., describing an array item consisting of an integer and a float)
  • What are the names of the “fields” of the structure
  • What is the data-type of each field
  • Which part of the memory block each field takes
  • If the data type is a sub-array, what is its shape and data type

In the context of this question dtype belongs to both pands and numpy and in particular dtype('O') means we expect the string.


Here is some code for testing with explanation: If we have the dataset as dictionary

import pandas as pd
import numpy as np
from pandas import Timestamp

data={'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 'date': {0: Timestamp('2018-12-12 00:00:00'), 1: Timestamp('2018-12-12 00:00:00'), 2: Timestamp('2018-12-12 00:00:00'), 3: Timestamp('2018-12-12 00:00:00'), 4: Timestamp('2018-12-12 00:00:00')}, 'role': {0: 'Support', 1: 'Marketing', 2: 'Business Development', 3: 'Sales', 4: 'Engineering'}, 'num': {0: 123, 1: 234, 2: 345, 3: 456, 4: 567}, 'fnum': {0: 3.14, 1: 2.14, 2: -0.14, 3: 41.3, 4: 3.14}}
df = pd.DataFrame.from_dict(data) #now we have a dataframe

print(df)
print(df.dtypes)

The last lines will examine the dataframe and note the output:

   id       date                  role  num   fnum
0   1 2018-12-12               Support  123   3.14
1   2 2018-12-12             Marketing  234   2.14
2   3 2018-12-12  Business Development  345  -0.14
3   4 2018-12-12                 Sales  456  41.30
4   5 2018-12-12           Engineering  567   3.14
id               int64
date    datetime64[ns]
role            object
num              int64
fnum           float64
dtype: object

All kind of different dtypes

df.iloc[1,:] = np.nan
df.iloc[2,:] = None

But if we try to set np.nan or None this will not affect the original column dtype. The output will be like this:

print(df)
print(df.dtypes)

    id       date         role    num   fnum
0  1.0 2018-12-12      Support  123.0   3.14
1  NaN        NaT          NaN    NaN    NaN
2  NaN        NaT         None    NaN    NaN
3  4.0 2018-12-12        Sales  456.0  41.30
4  5.0 2018-12-12  Engineering  567.0   3.14
id             float64
date    datetime64[ns]
role            object
num            float64
fnum           float64
dtype: object

So np.nan or None will not change the columns dtype, unless we set the all column rows to np.nan or None. In that case column will become float64 or object respectively.

You may try also setting single rows:

df.iloc[3,:] = 0 # will convert datetime to object only
df.iloc[4,:] = '' # will convert all columns to object

And to note here, if we set string inside a non string column it will become string or object dtype.


回答 2

它的意思是“一个python对象”,即不是numpy支持的内置标量类型之一。

np.array([object()]).dtype
=> dtype('O')

It means “a python object”, i.e. not one of the builtin scalar types supported by numpy.

np.array([object()]).dtype
=> dtype('O')

回答 3

“ O”代表对象

#Loading a csv file as a dataframe
import pandas as pd 
train_df = pd.read_csv('train.csv')
col_name = 'Name of Employee'

#Checking the datatype of column name
train_df[col_name].dtype

#Instead try printing the same thing
print train_df[col_name].dtype

第一行返回: dtype('O')

带有print语句的行返回以下内容: object

‘O’ stands for object.

#Loading a csv file as a dataframe
import pandas as pd 
train_df = pd.read_csv('train.csv')
col_name = 'Name of Employee'

#Checking the datatype of column name
train_df[col_name].dtype

#Instead try printing the same thing
print train_df[col_name].dtype

The first line returns: dtype('O')

The line with the print statement returns the following: object