标签归档:pandas

根据“未进入”条件从数据框中删除行[重复]

问题:根据“未进入”条件从数据框中删除行[重复]

当日期列的值在日期列表中时,我想从熊猫数据框中删除行。以下代码不起作用:

a=['2015-01-01' , '2015-02-01']

df=df[df.datecolumn not in a]

我收到以下错误:

ValueError:系列的真值不明确。使用a.empty,a.bool(),a.item(),a.any()或a.all()。

I want to drop rows from a pandas dataframe when the value of the date column is in a list of dates. The following code doesn’t work:

a=['2015-01-01' , '2015-02-01']

df=df[df.datecolumn not in a]

I get the following error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().


回答 0

您可以使用pandas.Dataframe.isin

pandas.Dateframe.isin将返回布尔值,具体取决于每个元素是否在列表内a。然后,您~可以将转换TrueFalse,反之亦然。

import pandas as pd

a = ['2015-01-01' , '2015-02-01']

df = pd.DataFrame(data={'date':['2015-01-01' , '2015-02-01', '2015-03-01' , '2015-04-01', '2015-05-01' , '2015-06-01']})

print(df)
#         date
#0  2015-01-01
#1  2015-02-01
#2  2015-03-01
#3  2015-04-01
#4  2015-05-01
#5  2015-06-01

df = df[~df['date'].isin(a)]

print(df)
#         date
#2  2015-03-01
#3  2015-04-01
#4  2015-05-01
#5  2015-06-01

You can use pandas.Dataframe.isin.

pandas.Dateframe.isin will return boolean values depending on whether each element is inside the list a or not. You then invert this with the ~ to convert True to False and vice versa.

import pandas as pd

a = ['2015-01-01' , '2015-02-01']

df = pd.DataFrame(data={'date':['2015-01-01' , '2015-02-01', '2015-03-01' , '2015-04-01', '2015-05-01' , '2015-06-01']})

print(df)
#         date
#0  2015-01-01
#1  2015-02-01
#2  2015-03-01
#3  2015-04-01
#4  2015-05-01
#5  2015-06-01

df = df[~df['date'].isin(a)]

print(df)
#         date
#2  2015-03-01
#3  2015-04-01
#4  2015-05-01
#5  2015-06-01

回答 1

您可以使用Series.isin

df = df[~df.datecolumn.isin(a)]

虽然错误消息提示您可以使用all()any(),但仅当您要将结果简化为单个布尔值时,它们才有用。但是,这不是您现在要尝试的操作,而是要针对外部列表测试Series中每个值的成员资格,并保持结果不变(即布尔系列,该布尔系列随后将用于切片原始DataFrame) )。

您可以在Gotchas中阅读有关此内容的更多信息。

You can use Series.isin:

df = df[~df.datecolumn.isin(a)]

While the error message suggests that all() or any() can be used, they are useful only when you want to reduce the result into a single Boolean value. That is however not what you are trying to do now, which is to test the membership of every values in the Series against the external list, and keep the results intact (i.e., a Boolean Series which will then be used to slice the original DataFrame).

You can read more about this in the Gotchas.


Python Pandas用第二列对应行中的值替换第一列中的NaN

问题:Python Pandas用第二列对应行中的值替换第一列中的NaN

我正在使用Python中的Pandas DataFrame。

File    heat    Farheit Temp_Rating
   1    YesQ         75         N/A
   1    NoR         115         N/A
   1    YesA         63         N/A
   1    NoT          83          41
   1    NoY         100          80
   1    YesZ         56          12
   2    YesQ        111         N/A
   2    NoR          60         N/A
   2    YesA         19         N/A
   2    NoT         106          77
   2    NoY          45          21
   2    YesZ         40          54
   3    YesQ         84         N/A
   3    NoR          67         N/A
   3    YesA         94         N/A
   3    NoT          68          39
   3    NoY          63          46
   3    YesZ         34          81

我需要用Temp_Rating列中的值替换列中的所有NaN Farheit

这就是我需要的:

File        heat    Temp_Rating
   1        YesQ             75
   1         NoR            115
   1        YesA             63
   1        YesQ             41
   1         NoR             80
   1        YesA             12
   2        YesQ            111
   2         NoR             60
   2        YesA             19
   2         NoT             77
   2         NoY             21
   2        YesZ             54
   3        YesQ             84
   3         NoR             67
   3        YesA             94
   3         NoT             39
   3         NoY             46
   3        YesZ             81

如果我进行布尔选择,则一次只能选择其中一列。问题是,如果我随后尝试加入他们,那么在保留正确顺序的同时我将无法执行此操作。

如何只查找Temp_Rating带有NaNs的行并将其替换为该Farheit列同一行中的值?

I am working with this Pandas DataFrame in Python.

File    heat    Farheit Temp_Rating
   1    YesQ         75         N/A
   1    NoR         115         N/A
   1    YesA         63         N/A
   1    NoT          83          41
   1    NoY         100          80
   1    YesZ         56          12
   2    YesQ        111         N/A
   2    NoR          60         N/A
   2    YesA         19         N/A
   2    NoT         106          77
   2    NoY          45          21
   2    YesZ         40          54
   3    YesQ         84         N/A
   3    NoR          67         N/A
   3    YesA         94         N/A
   3    NoT          68          39
   3    NoY          63          46
   3    YesZ         34          81

I need to replace all NaNs in the Temp_Rating column with the value from the Farheit column.

This is what I need:

File        heat    Temp_Rating
   1        YesQ             75
   1         NoR            115
   1        YesA             63
   1        YesQ             41
   1         NoR             80
   1        YesA             12
   2        YesQ            111
   2         NoR             60
   2        YesA             19
   2         NoT             77
   2         NoY             21
   2        YesZ             54
   3        YesQ             84
   3         NoR             67
   3        YesA             94
   3         NoT             39
   3         NoY             46
   3        YesZ             81

If I do a Boolean selection, I can pick out only one of these columns at a time. The problem is if I then try to join them, I am not able to do this while preserving the correct order.

How can I only find Temp_Rating rows with the NaNs and replace them with the value in the same row of the Farheit column?


回答 0

假设您的DataFrame位于df

df.Temp_Rating.fillna(df.Farheit, inplace=True)
del df['Farheit']
df.columns = 'File heat Observations'.split()

首先NaN用的对应值替换任何值df.Farheit。删除'Farheit'列。然后重命名列。结果DataFrame如下:

Assuming your DataFrame is in df:

df.Temp_Rating.fillna(df.Farheit, inplace=True)
del df['Farheit']
df.columns = 'File heat Observations'.split()

First replace any NaN values with the corresponding value of df.Farheit. Delete the 'Farheit' column. Then rename the columns. Here’s the resulting DataFrame:


回答 1

上述解决方案对我不起作用。我使用的方法是:

df.loc[df['foo'].isnull(),'foo'] = df['bar']

The above mentioned solutions did not work for me. The method I used was:

df.loc[df['foo'].isnull(),'foo'] = df['bar']

回答 2

解决这个问题的另一种方法,

import pandas as pd
import numpy as np

ts_df = pd.DataFrame([[1,"YesQ",75,],[1,"NoR",115,],[1,"NoT",63,13],[2,"YesT",43,71]],columns=['File','heat','Farheit','Temp'])


def fx(x):
    if np.isnan(x['Temp']):
        return x['Farheit']
    else:
        return x['Temp']
print(1,ts_df)
ts_df['Temp']=ts_df.apply(lambda x : fx(x),axis=1)

print(2,ts_df)

返回:

(1,    File  heat  Farheit  Temp                                                                                    
0     1  YesQ       75   NaN                                                                                        
1     1   NoR      115   NaN                                                                                        
2     1   NoT       63  13.0                                                                                        
3     2  YesT       43  71.0)                                                                                       
(2,    File  heat  Farheit   Temp                                                                                   
0     1  YesQ       75   75.0                                                                                       
1     1   NoR      115  115.0
2     1   NoT       63   13.0
3     2  YesT       43   71.0)

An other way to solve this problem,

import pandas as pd
import numpy as np

ts_df = pd.DataFrame([[1,"YesQ",75,],[1,"NoR",115,],[1,"NoT",63,13],[2,"YesT",43,71]],columns=['File','heat','Farheit','Temp'])


def fx(x):
    if np.isnan(x['Temp']):
        return x['Farheit']
    else:
        return x['Temp']
print(1,ts_df)
ts_df['Temp']=ts_df.apply(lambda x : fx(x),axis=1)

print(2,ts_df)

returns:

(1,    File  heat  Farheit  Temp                                                                                    
0     1  YesQ       75   NaN                                                                                        
1     1   NoR      115   NaN                                                                                        
2     1   NoT       63  13.0                                                                                        
3     2  YesT       43  71.0)                                                                                       
(2,    File  heat  Farheit   Temp                                                                                   
0     1  YesQ       75   75.0                                                                                       
1     1   NoR      115  115.0
2     1   NoT       63   13.0
3     2  YesT       43   71.0)

按名称将列移动到熊猫表的前面

问题:按名称将列移动到熊猫表的前面

这是我的df:

                             Net   Upper   Lower  Mid  Zsore
Answer option                                                
More than once a day          0%   0.22%  -0.12%   2    65 
Once a day                    0%   0.32%  -0.19%   3    45
Several times a week          2%   2.45%   1.10%   4    78
Once a week                   1%   1.63%  -0.40%   6    65

如何将按名称("Mid")的列移动到表的前面,索引为0。结果应如下所示:

                             Mid   Upper   Lower  Net  Zsore
Answer option                                                
More than once a day          2   0.22%  -0.12%   0%    65 
Once a day                    3   0.32%  -0.19%   0%    45
Several times a week          4   2.45%   1.10%   2%    78
Once a week                   6   1.63%  -0.40%   1%    65

我当前的代码使用来按索引移动列,df.columns.tolist()但我想按名称进行移动。

Here is my df:

                             Net   Upper   Lower  Mid  Zsore
Answer option                                                
More than once a day          0%   0.22%  -0.12%   2    65 
Once a day                    0%   0.32%  -0.19%   3    45
Several times a week          2%   2.45%   1.10%   4    78
Once a week                   1%   1.63%  -0.40%   6    65

How can I move a column by name ("Mid") to the front of the table, index 0. This is what the result should look like:

                             Mid   Upper   Lower  Net  Zsore
Answer option                                                
More than once a day          2   0.22%  -0.12%   0%    65 
Once a day                    3   0.32%  -0.19%   0%    45
Several times a week          4   2.45%   1.10%   2%    78
Once a week                   6   1.63%  -0.40%   1%    65

My current code moves the column by index using df.columns.tolist() but I’d like to shift it by name.


回答 0

我们可以ix通过传递列表来重新排序:

In [27]:
# get a list of columns
cols = list(df)
# move the column to head of list using index, pop and insert
cols.insert(0, cols.pop(cols.index('Mid')))
cols
Out[27]:
['Mid', 'Net', 'Upper', 'Lower', 'Zsore']
In [28]:
# use ix to reorder
df = df.ix[:, cols]
df
Out[28]:
                      Mid Net  Upper   Lower  Zsore
Answer_option                                      
More_than_once_a_day    2  0%  0.22%  -0.12%     65
Once_a_day              3  0%  0.32%  -0.19%     45
Several_times_a_week    4  2%  2.45%   1.10%     78
Once_a_week             6  1%  1.63%  -0.40%     65

另一种方法是引用该列,然后将其重新插入前面:

In [39]:
mid = df['Mid']
df.drop(labels=['Mid'], axis=1,inplace = True)
df.insert(0, 'Mid', mid)
df
Out[39]:
                      Mid Net  Upper   Lower  Zsore
Answer_option                                      
More_than_once_a_day    2  0%  0.22%  -0.12%     65
Once_a_day              3  0%  0.32%  -0.19%     45
Several_times_a_week    4  2%  2.45%   1.10%     78
Once_a_week             6  1%  1.63%  -0.40%     65

从以后开始,您还可以使用loc以获得与ix以后版本的熊猫不建议使用的相同的结果0.20.0

df = df.loc[:, cols]

We can use ix to reorder by passing a list:

In [27]:
# get a list of columns
cols = list(df)
# move the column to head of list using index, pop and insert
cols.insert(0, cols.pop(cols.index('Mid')))
cols
Out[27]:
['Mid', 'Net', 'Upper', 'Lower', 'Zsore']
In [28]:
# use ix to reorder
df = df.ix[:, cols]
df
Out[28]:
                      Mid Net  Upper   Lower  Zsore
Answer_option                                      
More_than_once_a_day    2  0%  0.22%  -0.12%     65
Once_a_day              3  0%  0.32%  -0.19%     45
Several_times_a_week    4  2%  2.45%   1.10%     78
Once_a_week             6  1%  1.63%  -0.40%     65

Another method is to take a reference to the column and reinsert it at the front:

In [39]:
mid = df['Mid']
df.drop(labels=['Mid'], axis=1,inplace = True)
df.insert(0, 'Mid', mid)
df
Out[39]:
                      Mid Net  Upper   Lower  Zsore
Answer_option                                      
More_than_once_a_day    2  0%  0.22%  -0.12%     65
Once_a_day              3  0%  0.32%  -0.19%     45
Several_times_a_week    4  2%  2.45%   1.10%     78
Once_a_week             6  1%  1.63%  -0.40%     65

You can also use loc to achieve the same result as ix will be deprecated in a future version of pandas from 0.20.0 onwards:

df = df.loc[:, cols]

回答 1

也许我错过了一些东西,但是许多答案似乎过于复杂。您应该只需要在一个列表中设置列即可:

列在最前面:

df = df[ ['Mid'] + [ col for col in df.columns if col != 'Mid' ] ]

或者,如果您想将其移到后面:

df = df[ [ col for col in df.columns if col != 'Mid' ] + ['Mid'] ]

或者,如果您想移动不止一列:

cols_to_move = ['Mid', 'Zsore']
df           = df[ cols_to_move + [ col for col in df.columns if col not in cols_to_move ] ]

Maybe I’m missing something, but a lot of these answers seem overly complicated. You should be able to just set the columns within a single list:

Column to the front:

df = df[ ['Mid'] + [ col for col in df.columns if col != 'Mid' ] ]

Or if instead, you want to move it to the back:

df = df[ [ col for col in df.columns if col != 'Mid' ] + ['Mid'] ]

Or if you wanted to move more than one column:

cols_to_move = ['Mid', 'Zsore']
df           = df[ cols_to_move + [ col for col in df.columns if col not in cols_to_move ] ]

回答 2

您可以在熊猫中使用df.reindex()函数。df是

                      Net  Upper   Lower  Mid  Zsore
Answer option                                      
More than once a day  0%  0.22%  -0.12%    2     65
Once a day            0%  0.32%  -0.19%    3     45
Several times a week  2%  2.45%   1.10%    4     78
Once a week           1%  1.63%  -0.40%    6     65

定义列名列表

cols = df.columns.tolist()
cols
Out[13]: ['Net', 'Upper', 'Lower', 'Mid', 'Zsore']

将列名移动到所需位置

cols.insert(0, cols.pop(cols.index('Mid')))
cols
Out[16]: ['Mid', 'Net', 'Upper', 'Lower', 'Zsore']

然后使用df.reindex()函数重新排序

df = df.reindex(columns= cols)

输出是:df

                      Mid  Upper   Lower Net  Zsore
Answer option                                      
More than once a day    2  0.22%  -0.12%  0%     65
Once a day              3  0.32%  -0.19%  0%     45
Several times a week    4  2.45%   1.10%  2%     78
Once a week             6  1.63%  -0.40%  1%     65

You can use the df.reindex() function in pandas. df is

                      Net  Upper   Lower  Mid  Zsore
Answer option                                      
More than once a day  0%  0.22%  -0.12%    2     65
Once a day            0%  0.32%  -0.19%    3     45
Several times a week  2%  2.45%   1.10%    4     78
Once a week           1%  1.63%  -0.40%    6     65

define an list of column names

cols = df.columns.tolist()
cols
Out[13]: ['Net', 'Upper', 'Lower', 'Mid', 'Zsore']

move the column name to wherever you want

cols.insert(0, cols.pop(cols.index('Mid')))
cols
Out[16]: ['Mid', 'Net', 'Upper', 'Lower', 'Zsore']

then use df.reindex() function to reorder

df = df.reindex(columns= cols)

out put is: df

                      Mid  Upper   Lower Net  Zsore
Answer option                                      
More than once a day    2  0.22%  -0.12%  0%     65
Once a day              3  0.32%  -0.19%  0%     45
Several times a week    4  2.45%   1.10%  2%     78
Once a week             6  1.63%  -0.40%  1%     65

回答 3

我更喜欢这种解决方案:

col = df.pop("Mid")
df.insert(0, col.name, col)

它比其他建议的答案更容易阅读且速度更快。

def move_column_inplace(df, col, pos):
    col = df.pop(col)
    df.insert(pos, col.name, col)

绩效评估:

对于此测试,当前的最后一列在每次重复中都移到最前面。就地方法通常表现更好。虽然citynorman的解决方案可以就地完成,但Ed Chum的基于方法.loc和sachinnm的方法却reindex不能。

尽管其他方法通用,但citynorman的解决方案仅限于pos=0。我没有观察到df.loc[cols]和之间的性能差异df[cols],这就是为什么我没有包含其他建议的原因。

我在MacBook Pro(2015年中)上使用python 3.6.8和pandas 0.24.2进行了测试。

import numpy as np
import pandas as pd

n_cols = 11
df = pd.DataFrame(np.random.randn(200000, n_cols),
                  columns=range(n_cols))

def move_column_inplace(df, col, pos):
    col = df.pop(col)
    df.insert(pos, col.name, col)

def move_to_front_normanius_inplace(df, col):
    move_column_inplace(df, col, 0)
    return df

def move_to_front_chum(df, col):
    cols = list(df)
    cols.insert(0, cols.pop(cols.index(col)))
    return df.loc[:, cols]

def move_to_front_chum_inplace(df, col):
    col = df[col]
    df.drop(col.name, axis=1, inplace=True)
    df.insert(0, col.name, col)
    return df

def move_to_front_elpastor(df, col):
    cols = [col] + [ c for c in df.columns if c!=col ]
    return df[cols] # or df.loc[cols]

def move_to_front_sachinmm(df, col):
    cols = df.columns.tolist()
    cols.insert(0, cols.pop(cols.index(col)))
    df = df.reindex(columns=cols, copy=False)
    return df

def move_to_front_citynorman_inplace(df, col):
    # This approach exploits that reset_index() moves the index
    # at the first position of the data frame.
    df.set_index(col, inplace=True)
    df.reset_index(inplace=True)
    return df

def test(method, df):
    col = np.random.randint(0, n_cols)
    method(df, col)

col = np.random.randint(0, n_cols)
ret_mine = move_to_front_normanius_inplace(df.copy(), col)
ret_chum1 = move_to_front_chum(df.copy(), col)
ret_chum2 = move_to_front_chum_inplace(df.copy(), col)
ret_elpas = move_to_front_elpastor(df.copy(), col)
ret_sach = move_to_front_sachinmm(df.copy(), col)
ret_city = move_to_front_citynorman_inplace(df.copy(), col)

# Assert equivalence of solutions.
assert(ret_mine.equals(ret_chum1))
assert(ret_mine.equals(ret_chum2))
assert(ret_mine.equals(ret_elpas))
assert(ret_mine.equals(ret_sach))
assert(ret_mine.equals(ret_city))

结果

# For n_cols = 11:
%timeit test(move_to_front_normanius_inplace, df)
# 1.05 ms ± 42.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit test(move_to_front_citynorman_inplace, df)
# 1.68 ms ± 46.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit test(move_to_front_sachinmm, df)
# 3.24 ms ± 96.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit test(move_to_front_chum, df)
# 3.84 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit test(move_to_front_elpastor, df)
# 3.85 ms ± 58.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit test(move_to_front_chum_inplace, df)
# 9.67 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# For n_cols = 31:
%timeit test(move_to_front_normanius_inplace, df)
# 1.26 ms ± 31.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit test(move_to_front_citynorman_inplace, df)
# 1.95 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit test(move_to_front_sachinmm, df)
# 10.7 ms ± 348 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit test(move_to_front_chum, df)
# 11.5 ms ± 869 µs per loop (mean ± std. dev. of 7 runs, 100 loops each
%timeit test(move_to_front_elpastor, df)
# 11.4 ms ± 598 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit test(move_to_front_chum_inplace, df)
# 31.4 ms ± 1.89 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

I prefer this solution:

col = df.pop("Mid")
df.insert(0, col.name, col)

It’s simpler to read and faster than other suggested answers.

def move_column_inplace(df, col, pos):
    col = df.pop(col)
    df.insert(pos, col.name, col)

Performance assessment:

For this test, the currently last column is moved to the front in each repetition. In-place methods generally perform better. While citynorman’s solution can be made in-place, Ed Chum’s method based on .loc and sachinnm’s method based on reindex cannot.

While other methods are generic, citynorman’s solution is limited to pos=0. I didn’t observe any performance difference between df.loc[cols] and df[cols], which is why I didn’t include some other suggestions.

I tested with python 3.6.8 and pandas 0.24.2 on a MacBook Pro (Mid 2015).

import numpy as np
import pandas as pd

n_cols = 11
df = pd.DataFrame(np.random.randn(200000, n_cols),
                  columns=range(n_cols))

def move_column_inplace(df, col, pos):
    col = df.pop(col)
    df.insert(pos, col.name, col)

def move_to_front_normanius_inplace(df, col):
    move_column_inplace(df, col, 0)
    return df

def move_to_front_chum(df, col):
    cols = list(df)
    cols.insert(0, cols.pop(cols.index(col)))
    return df.loc[:, cols]

def move_to_front_chum_inplace(df, col):
    col = df[col]
    df.drop(col.name, axis=1, inplace=True)
    df.insert(0, col.name, col)
    return df

def move_to_front_elpastor(df, col):
    cols = [col] + [ c for c in df.columns if c!=col ]
    return df[cols] # or df.loc[cols]

def move_to_front_sachinmm(df, col):
    cols = df.columns.tolist()
    cols.insert(0, cols.pop(cols.index(col)))
    df = df.reindex(columns=cols, copy=False)
    return df

def move_to_front_citynorman_inplace(df, col):
    # This approach exploits that reset_index() moves the index
    # at the first position of the data frame.
    df.set_index(col, inplace=True)
    df.reset_index(inplace=True)
    return df

def test(method, df):
    col = np.random.randint(0, n_cols)
    method(df, col)

col = np.random.randint(0, n_cols)
ret_mine = move_to_front_normanius_inplace(df.copy(), col)
ret_chum1 = move_to_front_chum(df.copy(), col)
ret_chum2 = move_to_front_chum_inplace(df.copy(), col)
ret_elpas = move_to_front_elpastor(df.copy(), col)
ret_sach = move_to_front_sachinmm(df.copy(), col)
ret_city = move_to_front_citynorman_inplace(df.copy(), col)

# Assert equivalence of solutions.
assert(ret_mine.equals(ret_chum1))
assert(ret_mine.equals(ret_chum2))
assert(ret_mine.equals(ret_elpas))
assert(ret_mine.equals(ret_sach))
assert(ret_mine.equals(ret_city))

Results:

# For n_cols = 11:
%timeit test(move_to_front_normanius_inplace, df)
# 1.05 ms ± 42.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit test(move_to_front_citynorman_inplace, df)
# 1.68 ms ± 46.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit test(move_to_front_sachinmm, df)
# 3.24 ms ± 96.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit test(move_to_front_chum, df)
# 3.84 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit test(move_to_front_elpastor, df)
# 3.85 ms ± 58.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit test(move_to_front_chum_inplace, df)
# 9.67 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# For n_cols = 31:
%timeit test(move_to_front_normanius_inplace, df)
# 1.26 ms ± 31.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit test(move_to_front_citynorman_inplace, df)
# 1.95 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit test(move_to_front_sachinmm, df)
# 10.7 ms ± 348 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit test(move_to_front_chum, df)
# 11.5 ms ± 869 µs per loop (mean ± std. dev. of 7 runs, 100 loops each
%timeit test(move_to_front_elpastor, df)
# 11.4 ms ± 598 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit test(move_to_front_chum_inplace, df)
# 31.4 ms ± 1.89 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

回答 4

我不喜欢必须在其他解决方案中明确指定所有其他列的方式,因此这对我来说最有效。虽然对于大型数据帧可能会比较慢…?

df = df.set_index('Mid').reset_index()

I didn’t like how I had to explicitly specify all the other column in the other solutions so this worked best for me. Though it might be slow for large dataframes…?

df = df.set_index('Mid').reset_index()


回答 5

这是我经常使用的一组通用代码来重新排列列的位置。您可能会发现它很有用。

cols = df.columns.tolist()
n = int(cols.index('Mid'))
cols = [cols[n]] + cols[:n] + cols[n+1:]
df = df[cols]

Here is a generic set of code that I frequently use to rearrange the position of columns. You may find it useful.

cols = df.columns.tolist()
n = int(cols.index('Mid'))
cols = [cols[n]] + cols[:n] + cols[n+1:]
df = df[cols]

回答 6

要重新排列DataFrame的行,只需使用如下列表即可。

df = df[['Mid', 'Net', 'Upper', 'Lower', 'Zsore']]

这使得以后阅读代码时所做的工作非常明显。也可以使用:

df.columns
Out[1]: Index(['Net', 'Upper', 'Lower', 'Mid', 'Zsore'], dtype='object')

然后剪切并粘贴以重新排序。


对于具有许多列的DataFrame,将列的列表存储在变量中,然后将所需的列弹出到列表的前面。这是一个例子:

cols = [str(col_name) for col_name in range(1001)]
data = np.random.rand(10,1001)
df = pd.DataFrame(data=data, columns=cols)

mv_col = cols.pop(cols.index('77'))
df = df[[mv_col] + cols]

现在df.columns有。

Index(['77', '0', '1', '2', '3', '4', '5', '6', '7', '8',
       ...
       '991', '992', '993', '994', '995', '996', '997', '998', '999', '1000'],
      dtype='object', length=1001)

To reorder the rows of a DataFrame just use a list as follows.

df = df[['Mid', 'Net', 'Upper', 'Lower', 'Zsore']]

This makes it very obvious what was done when reading the code later. Also use:

df.columns
Out[1]: Index(['Net', 'Upper', 'Lower', 'Mid', 'Zsore'], dtype='object')

Then cut and paste to reorder.


For a DataFrame with many columns, store the list of columns in a variable and pop the desired column to the front of the list. Here is an example:

cols = [str(col_name) for col_name in range(1001)]
data = np.random.rand(10,1001)
df = pd.DataFrame(data=data, columns=cols)

mv_col = cols.pop(cols.index('77'))
df = df[[mv_col] + cols]

Now df.columns has.

Index(['77', '0', '1', '2', '3', '4', '5', '6', '7', '8',
       ...
       '991', '992', '993', '994', '995', '996', '997', '998', '999', '1000'],
      dtype='object', length=1001)

回答 7

这是一个非常简单的答案。

不要忘了在列名的两个(())’括号’,否则会给你一个错误。


# here you can add below line and it should work 
df = df[list(('Mid','Upper', 'Lower', 'Net','Zsore'))]
df

                             Mid   Upper   Lower  Net  Zsore
Answer option                                                
More than once a day          2   0.22%  -0.12%   0%    65 
Once a day                    3   0.32%  -0.19%   0%    45
Several times a week          4   2.45%   1.10%   2%    78
Once a week                   6   1.63%  -0.40%   1%    65

Here is a very simple answer to this.

Don’t forget the two (()) ‘brackets’ around columns names.Otherwise, it’ll give you an error.


# here you can add below line and it should work 
df = df[list(('Mid','Upper', 'Lower', 'Net','Zsore'))]
df

                             Mid   Upper   Lower  Net  Zsore
Answer option                                                
More than once a day          2   0.22%  -0.12%   0%    65 
Once a day                    3   0.32%  -0.19%   0%    45
Several times a week          4   2.45%   1.10%   2%    78
Once a week                   6   1.63%  -0.40%   1%    65

回答 8

您可以尝试的最简单的方法是:

df=df[[ 'Mid',   'Upper',   'Lower', 'Net'  , 'Zsore']]

The most simplist thing you can try is:

df=df[[ 'Mid',   'Upper',   'Lower', 'Net'  , 'Zsore']]

当在apply中也计算出先前值时,Pandas中有没有一种方法可以使用dataframe.apply中的先前行值?

问题:当在apply中也计算出先前值时,Pandas中有没有一种方法可以使用dataframe.apply中的先前行值?

我有以下数据框:

 Index_Date    A    B    C    D
 ===============================
 2015-01-31    10   10   Nan  10
 2015-02-01     2    3   Nan  22 
 2015-02-02    10   60   Nan  280
 2015-02-03    10   100   Nan  250

要求:

 Index_Date    A    B    C    D
 ===============================
 2015-01-31    10   10   10   10
 2015-02-01     2    3   23   22
 2015-02-02    10   60   290  280
 2015-02-03    10   100  3000 250

Column C导出用于2015-01-31通过取valueD

然后,我需要使用valueC用于2015-01-31通过和乘法valueA2015-02-01添加B

我尝试使用,applyshift使用if else,这会导致出现关键错误。

I have the following dataframe:

 Index_Date    A    B    C    D
 ===============================
 2015-01-31    10   10   Nan  10
 2015-02-01     2    3   Nan  22 
 2015-02-02    10   60   Nan  280
 2015-02-03    10   100   Nan  250

Require:

 Index_Date    A    B    C    D
 ===============================
 2015-01-31    10   10   10   10
 2015-02-01     2    3   23   22
 2015-02-02    10   60   290  280
 2015-02-03    10   100  3000 250

Column C is derived for 2015-01-31 by taking value of D.

Then I need to use the value of C for 2015-01-31 and multiply by the value of A on 2015-02-01 and add B.

I have attempted an apply and a shift using an if else by this gives a key error.


回答 0

首先,创建派生值:

df.loc[0, 'C'] = df.loc[0, 'D']

然后遍历其余行并填充计算出的值:

for i in range(1, len(df)):
    df.loc[i, 'C'] = df.loc[i-1, 'C'] * df.loc[i, 'A'] + df.loc[i, 'B']


  Index_Date   A   B    C    D
0 2015-01-31  10  10   10   10
1 2015-02-01   2   3   23   22
2 2015-02-02  10  60  290  280

First, create the derived value:

df.loc[0, 'C'] = df.loc[0, 'D']

Then iterate through the remaining rows and fill the calculated values:

for i in range(1, len(df)):
    df.loc[i, 'C'] = df.loc[i-1, 'C'] * df.loc[i, 'A'] + df.loc[i, 'B']


  Index_Date   A   B    C    D
0 2015-01-31  10  10   10   10
1 2015-02-01   2   3   23   22
2 2015-02-02  10  60  290  280

回答 1

给定一列数字:

lst = []
cols = ['A']
for a in range(100, 105):
    lst.append([a])
df = pd.DataFrame(lst, columns=cols, index=range(5))
df

    A
0   100
1   101
2   102
3   103
4   104

您可以使用shift引用上一行:

df['Change'] = df.A - df.A.shift(1)
df

    A   Change
0   100 NaN
1   101 1.0
2   102 1.0
3   103 1.0
4   104 1.0

Given a column of numbers:

lst = []
cols = ['A']
for a in range(100, 105):
    lst.append([a])
df = pd.DataFrame(lst, columns=cols, index=range(5))
df

    A
0   100
1   101
2   102
3   103
4   104

You can reference the previous row with shift:

df['Change'] = df.A - df.A.shift(1)
df

    A   Change
0   100 NaN
1   101 1.0
2   102 1.0
3   103 1.0
4   104 1.0

回答 2

numba

对于不可矢量化的递归计算numba,使用JIT编译并与较低级别的对象配合使用,通常会带来较大的性能改进。您只需要定义一个常规for循环并使用decorator@njit或(对于旧版本)@jit(nopython=True)

对于合理大小的数据帧,与常规for循环相比,这可以将性能提高约30倍:

from numba import jit

@jit(nopython=True)
def calculator_nb(a, b, d):
    res = np.empty(d.shape)
    res[0] = d[0]
    for i in range(1, res.shape[0]):
        res[i] = res[i-1] * a[i] + b[i]
    return res

df['C'] = calculator_nb(*df[list('ABD')].values.T)

n = 10**5
df = pd.concat([df]*n, ignore_index=True)

# benchmarking on Python 3.6.0, Pandas 0.19.2, NumPy 1.11.3, Numba 0.30.1
# calculator() is same as calculator_nb() but without @jit decorator
%timeit calculator_nb(*df[list('ABD')].values.T)  # 14.1 ms per loop
%timeit calculator(*df[list('ABD')].values.T)     # 444 ms per loop

numba

For recursive calculations which are not vectorisable, numba, which uses JIT-compilation and works with lower level objects, often yields large performance improvements. You need only define a regular for loop and use the decorator @njit or (for older versions) @jit(nopython=True):

For a reasonable size dataframe, this gives a ~30x performance improvement versus a regular for loop:

from numba import jit

@jit(nopython=True)
def calculator_nb(a, b, d):
    res = np.empty(d.shape)
    res[0] = d[0]
    for i in range(1, res.shape[0]):
        res[i] = res[i-1] * a[i] + b[i]
    return res

df['C'] = calculator_nb(*df[list('ABD')].values.T)

n = 10**5
df = pd.concat([df]*n, ignore_index=True)

# benchmarking on Python 3.6.0, Pandas 0.19.2, NumPy 1.11.3, Numba 0.30.1
# calculator() is same as calculator_nb() but without @jit decorator
%timeit calculator_nb(*df[list('ABD')].values.T)  # 14.1 ms per loop
%timeit calculator(*df[list('ABD')].values.T)     # 444 ms per loop

回答 3

在numpy数组上应用递归函数将比当前答案更快。

df = pd.DataFrame(np.repeat(np.arange(2, 6),3).reshape(4,3), columns=['A', 'B', 'D'])
new = [df.D.values[0]]
for i in range(1, len(df.index)):
    new.append(new[i-1]*df.A.values[i]+df.B.values[i])
df['C'] = new

输出量

      A  B  D    C
   0  1  1  1    1
   1  2  2  2    4
   2  3  3  3   15
   3  4  4  4   64
   4  5  5  5  325

Applying the recursive function on numpy arrays will be faster than the current answer.

df = pd.DataFrame(np.repeat(np.arange(2, 6),3).reshape(4,3), columns=['A', 'B', 'D'])
new = [df.D.values[0]]
for i in range(1, len(df.index)):
    new.append(new[i-1]*df.A.values[i]+df.B.values[i])
df['C'] = new

Output

      A  B  D    C
   0  1  1  1    1
   1  2  2  2    4
   2  3  3  3   15
   3  4  4  4   64
   4  5  5  5  325

回答 4

尽管问这个问题已经有一段时间了,但我还是会发表我的答案,希望对大家有所帮助。

免责声明:我知道此解决方案不是标准的,但我认为它很好用。

import pandas as pd
import numpy as np

data = np.array([[10, 2, 10, 10],
                 [10, 3, 60, 100],
                 [np.nan] * 4,
                 [10, 22, 280, 250]]).T
idx = pd.date_range('20150131', end='20150203')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df
               A    B     C    D
 =================================
 2015-01-31    10   10    NaN  10
 2015-02-01    2    3     NaN  22 
 2015-02-02    10   60    NaN  280
 2015-02-03    10   100   NaN  250

def calculate(mul, add):
    global value
    value = value * mul + add
    return value

value = df.loc['2015-01-31', 'D']
df.loc['2015-01-31', 'C'] = value
df.loc['2015-02-01':, 'C'] = df.loc['2015-02-01':].apply(lambda row: calculate(*row[['A', 'B']]), axis=1)
df
               A    B     C     D
 =================================
 2015-01-31    10   10    10    10
 2015-02-01    2    3     23    22 
 2015-02-02    10   60    290   280
 2015-02-03    10   100   3000  250

因此,基本上,我们使用applyfrom from pandas和全局变量的帮助来跟踪先前的计算值。


for循环时间比较:

data = np.random.random(size=(1000, 4))
idx = pd.date_range('20150131', end='20171026')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df.C = np.nan

df.loc['2015-01-31', 'C'] = df.loc['2015-01-31', 'D']

%%timeit
for i in df.loc['2015-02-01':].index.date:
    df.loc[i, 'C'] = df.loc[(i - pd.DateOffset(days=1)).date(), 'C'] * df.loc[i, 'A'] + df.loc[i, 'B']

每个循环3.2 s±114毫秒(平均±标准偏差,共运行7次,每个循环1次)

data = np.random.random(size=(1000, 4))
idx = pd.date_range('20150131', end='20171026')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df.C = np.nan

def calculate(mul, add):
    global value
    value = value * mul + add
    return value

value = df.loc['2015-01-31', 'D']
df.loc['2015-01-31', 'C'] = value

%%timeit
df.loc['2015-02-01':, 'C'] = df.loc['2015-02-01':].apply(lambda row: calculate(*row[['A', 'B']]), axis=1)

每个循环1.82 s±64.4 ms(平均±标准偏差,共7次运行,每个循环1次)

因此平均快0.57倍。

Although it has been a while since this question was asked, I will post my answer hoping it helps somebody.

Disclaimer: I know this solution is not standard, but I think it works well.

import pandas as pd
import numpy as np

data = np.array([[10, 2, 10, 10],
                 [10, 3, 60, 100],
                 [np.nan] * 4,
                 [10, 22, 280, 250]]).T
idx = pd.date_range('20150131', end='20150203')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df
               A    B     C    D
 =================================
 2015-01-31    10   10    NaN  10
 2015-02-01    2    3     NaN  22 
 2015-02-02    10   60    NaN  280
 2015-02-03    10   100   NaN  250

def calculate(mul, add):
    global value
    value = value * mul + add
    return value

value = df.loc['2015-01-31', 'D']
df.loc['2015-01-31', 'C'] = value
df.loc['2015-02-01':, 'C'] = df.loc['2015-02-01':].apply(lambda row: calculate(*row[['A', 'B']]), axis=1)
df
               A    B     C     D
 =================================
 2015-01-31    10   10    10    10
 2015-02-01    2    3     23    22 
 2015-02-02    10   60    290   280
 2015-02-03    10   100   3000  250

So basically we use a apply from pandas and the help of a global variable that keeps track of the previous calculated value.


Time comparison with a for loop:

data = np.random.random(size=(1000, 4))
idx = pd.date_range('20150131', end='20171026')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df.C = np.nan

df.loc['2015-01-31', 'C'] = df.loc['2015-01-31', 'D']

%%timeit
for i in df.loc['2015-02-01':].index.date:
    df.loc[i, 'C'] = df.loc[(i - pd.DateOffset(days=1)).date(), 'C'] * df.loc[i, 'A'] + df.loc[i, 'B']

3.2 s ± 114 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

data = np.random.random(size=(1000, 4))
idx = pd.date_range('20150131', end='20171026')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df.C = np.nan

def calculate(mul, add):
    global value
    value = value * mul + add
    return value

value = df.loc['2015-01-31', 'D']
df.loc['2015-01-31', 'C'] = value

%%timeit
df.loc['2015-02-01':, 'C'] = df.loc['2015-02-01':].apply(lambda row: calculate(*row[['A', 'B']]), axis=1)

1.82 s ± 64.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So 0.57 times faster on average.


回答 5

通常,避免显式循环的关键是在rowindex-1 == rowindex上联接(合并)数据框的2个实例。

然后,您将拥有一个包含r和r-1行的大数据框,可以在其中执行df.apply()函数。

但是,创建大型数据集的开销可能抵消了并行处理的好处。

马丁

In general, the key to avoiding an explicit loop would be to join (merge) 2 instances of the dataframe on rowindex-1==rowindex.

Then you would have a big dataframe containing rows of r and r-1, from where you could do a df.apply() function.

However the overhead of creating the large dataset may offset the benefits of parallel processing…

HTH Martin


如何使用点绘制熊猫数据框的两列?

问题:如何使用点绘制熊猫数据框的两列?

我有一个pandas数据框,想绘制一列的值与另一列的值。幸运的是,有plot一种与数据帧相关的方法似乎可以满足我的需求:

df.plot(x='col_name_1', y='col_name_2')

不幸的是,它看起来像打印样式(上市中这里kind参数)有没有点。我可以使用线或条,甚至可以使用密度,但不能使用点。是否有解决方法可以帮助解决此问题。

I have a pandas data frame and would like to plot values from one column versus the values from another column. Fortunately, there is plot method associated with the data-frames that seems to do what I need:

df.plot(x='col_name_1', y='col_name_2')

Unfortunately, it looks like among the plot styles (listed here after the kind parameter) there are not points. I can use lines or bars or even density but not points. Is there a work around that can help to solve this problem.


回答 0

您可以style在调用时指定标绘线的df.plot

df.plot(x='col_name_1', y='col_name_2', style='o')

style参数也可以是一个dict或者list,如:

import numpy as np
import pandas as pd

d = {'one' : np.random.rand(10),
     'two' : np.random.rand(10)}

df = pd.DataFrame(d)

df.plot(style=['o','rx'])

的文档中列出了所有可接受的样式格式matplotlib.pyplot.plot

You can specify the style of the plotted line when calling df.plot:

df.plot(x='col_name_1', y='col_name_2', style='o')

The style argument can also be a dict or list, e.g.:

import numpy as np
import pandas as pd

d = {'one' : np.random.rand(10),
     'two' : np.random.rand(10)}

df = pd.DataFrame(d)

df.plot(style=['o','rx'])

All the accepted style formats are listed in the documentation of matplotlib.pyplot.plot.


回答 1

对于这个(以及大多数绘图),我不会依赖Pandas包装器来matplotlib。而是直接使用matplotlib:

import matplotlib.pyplot as plt
plt.scatter(df['col_name_1'], df['col_name_2'])
plt.show() # Depending on whether you use IPython or interactive mode, etc.

并记住,您可以使用例如访问列值的NumPy数组df.col_name_1.values

在一列具有毫秒精度的Timestamp值的情况下,我在使用Pandas默认绘图时遇到了麻烦。在尝试将对象转换为datetime64类型时,我还发现了一个令人讨厌的问题:< Pandas在询问Timestamp列值是否具有attr astype时给出了错误的结果

For this (and most plotting) I would not rely on the Pandas wrappers to matplotlib. Instead, just use matplotlib directly:

import matplotlib.pyplot as plt
plt.scatter(df['col_name_1'], df['col_name_2'])
plt.show() # Depending on whether you use IPython or interactive mode, etc.

and remember that you can access a NumPy array of the column’s values with df.col_name_1.values for example.

I ran into trouble using this with Pandas default plotting in the case of a column of Timestamp values with millisecond precision. In trying to convert the objects to datetime64 type, I also discovered a nasty issue: < Pandas gives incorrect result when asking if Timestamp column values have attr astype >.


回答 2

Pandas使用matplotlib作为基本的绘图库。在您的情况下,最简单的方法是使用以下方法:

import pandas as pd
import numpy as np

#creating sample data 
sample_data={'col_name_1':np.random.rand(20),
      'col_name_2': np.random.rand(20)}
df= pd.DataFrame(sample_data)
df.plot(x='col_name_1', y='col_name_2', style='o')

但是,seaborn如果您想拥有更多的自定义图,而又不涉及基础知识,我建议您使用它作为替代解决方案。matplotlib.在这种情况下,您将遵循以下解决方案:

import pandas as pd
import seaborn as sns
import numpy as np

#creating sample data 
sample_data={'col_name_1':np.random.rand(20),
      'col_name_2': np.random.rand(20)}
df= pd.DataFrame(sample_data)
sns.scatterplot(x="col_name_1", y="col_name_2", data=df)

Pandas uses matplotlib as a library for basic plots. The easiest way in your case will using the following:

import pandas as pd
import numpy as np

#creating sample data 
sample_data={'col_name_1':np.random.rand(20),
      'col_name_2': np.random.rand(20)}
df= pd.DataFrame(sample_data)
df.plot(x='col_name_1', y='col_name_2', style='o')

However, I would recommend to use seaborn as an alternative solution if you want have more customized plots while not going into the basic level of matplotlib. In this case you the solution will be following:

import pandas as pd
import seaborn as sns
import numpy as np

#creating sample data 
sample_data={'col_name_1':np.random.rand(20),
      'col_name_2': np.random.rand(20)}
df= pd.DataFrame(sample_data)
sns.scatterplot(x="col_name_1", y="col_name_2", data=df)


回答 3

现在,在最新的熊猫中,您可以直接使用df.plot.scatter函数

df = pd.DataFrame([[5.1, 3.5, 0], [4.9, 3.0, 0], [7.0, 3.2, 1],
                   [6.4, 3.2, 1], [5.9, 3.0, 2]],
                  columns=['length', 'width', 'species'])
ax1 = df.plot.scatter(x='length',
                      y='width',
                      c='DarkBlue')

https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.plot.scatter.html

Now in latest pandas you can directly use df.plot.scatter function

df = pd.DataFrame([[5.1, 3.5, 0], [4.9, 3.0, 0], [7.0, 3.2, 1],
                   [6.4, 3.2, 1], [5.9, 3.0, 2]],
                  columns=['length', 'width', 'species'])
ax1 = df.plot.scatter(x='length',
                      y='width',
                      c='DarkBlue')

https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.plot.scatter.html


DataFrame中的字符串,但dtype是object

问题:DataFrame中的字符串,但dtype是object

为什么Pandas告诉我我有对象,尽管所选列中的每个项目都是一个字符串-即使经过显式转换也是如此。

这是我的DataFrame:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 56992 entries, 0 to 56991
Data columns (total 7 columns):
id            56992  non-null values
attr1         56992  non-null values
attr2         56992  non-null values
attr3         56992  non-null values
attr4         56992  non-null values
attr5         56992  non-null values
attr6         56992  non-null values
dtypes: int64(2), object(5)

他们五个dtype object。我将这些对象明确转换为字符串:

for c in df.columns:
    if df[c].dtype == object:
        print "convert ", df[c].name, " to string"
        df[c] = df[c].astype(str)

然后,尽管显示,df["attr2"]仍然是正确的。dtype objecttype(df["attr2"].ix[0]str

熊猫区分int64float64object。没有时背后的逻辑是什么dtype str?为什么被str覆盖object

Why does Pandas tell me that I have objects, although every item in the selected column is a string — even after explicit conversion.

This is my DataFrame:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 56992 entries, 0 to 56991
Data columns (total 7 columns):
id            56992  non-null values
attr1         56992  non-null values
attr2         56992  non-null values
attr3         56992  non-null values
attr4         56992  non-null values
attr5         56992  non-null values
attr6         56992  non-null values
dtypes: int64(2), object(5)

Five of them are dtype object. I explicitly convert those objects to strings:

for c in df.columns:
    if df[c].dtype == object:
        print "convert ", df[c].name, " to string"
        df[c] = df[c].astype(str)

Then, df["attr2"] still has dtype object, although type(df["attr2"].ix[0] reveals str, which is correct.

Pandas distinguishes between int64 and float64 and object. What is the logic behind it when there is no dtype str? Why is a str covered by object?


回答 0

dtype对象来自NumPy,它描述ndarray中元素的类型。ndarray中的每个元素都必须具有相同的字节大小。对于int64和float64,它们是8个字节。但是对于字符串,字符串的长度不是固定的。因此,熊猫没有直接将字符串的字节保存在ndarray中,而是使用对象ndarray来保存指向对象的指针,因此,这种ndarray的dtype是object。

这是一个例子:

  • int64数组包含4个int64值。
  • 对象数组包含4个指向3个字符串对象的指针。

The dtype object comes from NumPy, it describes the type of element in a ndarray. Every element in a ndarray must has the same size in byte. For int64 and float64, they are 8 bytes. But for strings, the length of the string is not fixed. So instead of save the bytes of strings in the ndarray directly, Pandas use object ndarray, which save pointers to objects, because of this the dtype of this kind ndarray is object.

Here is an example:

  • the int64 array contains 4 int64 value.
  • the object array contains 4 pointers to 3 string objects.


回答 1

接受的答案是好的。只是想提供一个参考文档的答案。该文档说:

熊猫使用对象dtype来存储字符串。

正如主要评论所说:“不用担心;它应该像这样。” (尽管可接受的答案在解释“为什么”方面做得很好,字符串是可变长度的)

但是对于字符串,字符串的长度不是固定的。

The accepted answer is good. Just wanted to provide an answer which referenced the documentation. The documentation says:

Pandas uses the object dtype for storing strings.

As the leading comment says “Don’t worry about it; it’s supposed to be like this.” (Although the accepted answer did a great job explaining the “why”; strings are variable-length)

But for strings, the length of the string is not fixed.


回答 2

@HYRY的答案很好。我只想提供更多背景信息。

阵列存储的数据作为连续的固定大小的存储器块。这些属性的结合使阵列可以快速进行数据访问。例如,考虑您的计算机可能如何存储32位整数数组[3,0,1]

如果您要求计算机获取数组中的第3个元素,它将从头开始,然后跨64位跳转到第3个元素。确切知道要跳过多少位才可以使数组快速运行

现在考虑字符串的顺序['hello', 'i', 'am', 'a', 'banana']。字符串是大小不同的对象,因此,如果您尝试将它们存储在连续的内存块中,它将最终看起来像这样。

现在,您的计算机没有快速的方法来访问随机请求的元素。克服这个问题的关键是使用指针。基本上,将每个字符串存储在某个随机的内存位置,然后用每个字符串的内存地址填充数组。(内存地址只是整数。)所以现在,事情看起来像这样

现在,如果您像以前一样要求计算机获取第三个元素,它可以跨64位跳转(假设内存地址是32位整数),然后再执行一个步骤来获取字符串。

NumPy面临的挑战是不能保证指针实际上指向字符串。这就是为什么它将dtype报告为“对象”的原因。

无耻地插入我自己的博客文章,最初是在此进行讨论的。

@HYRY’s answer is great. I just want to provide a little more context..

Arrays store data as contiguous, fixed-size memory blocks. The combination of these properties together is what makes arrays lightning fast for data access. For example, consider how your computer might store an array of 32-bit integers, [3,0,1].

If you ask your computer to fetch the 3rd element in the array, it’ll start at the beginning and then jump across 64 bits to get to the 3rd element. Knowing exactly how many bits to jump across is what makes arrays fast.

Now consider the sequence of strings ['hello', 'i', 'am', 'a', 'banana']. Strings are objects that vary in size, so if you tried to store them in contiguous memory blocks, it’d end up looking like this.

Now your computer doesn’t have a fast way to access a randomly requested element. The key to overcoming this is to use pointers. Basically, store each string in some random memory location, and fill the array with the memory address of each string. (Memory addresses are just integers.) So now, things look like this

Now, if you ask your computer to fetch the 3rd element, just as before, it can jump across 64 bits (assuming the memory addresses are 32-bit integers) and then make one extra step to go fetch the string.

The challenge for NumPy is that there’s no guarantee the pointers are actually pointing to strings. That’s why it reports the dtype as ‘object’.

Shamelessly gonna plug my own blog article where I originally discussed this.


回答 3

从1.0.0版开始(2020年1月),pandas作为实验功能被引入,它通过提供对字符串类型的一流支持pandas.StringDtype

虽然您仍然会object默认看到,但是可以通过指定dtypeof pd.StringDtype或简单地使用新类型'string'

>>> pd.Series(['abc', None, 'def'])
0     abc
1    None
2     def
dtype: object
>>> pd.Series(['abc', None, 'def'], dtype=pd.StringDtype())
0     abc
1    <NA>
2     def
dtype: string
>>> pd.Series(['abc', None, 'def']).astype('string')
0     abc
1    <NA>
2     def
dtype: string

As of version 1.0.0 (January 2020), pandas has introduced as an experimental feature providing first-class support for string types through pandas.StringDtype.

While you’ll still be seeing object by default, the new type can be used by specifying a dtype of pd.StringDtype or simply 'string':

>>> pd.Series(['abc', None, 'def'])
0     abc
1    None
2     def
dtype: object
>>> pd.Series(['abc', None, 'def'], dtype=pd.StringDtype())
0     abc
1    <NA>
2     def
dtype: string
>>> pd.Series(['abc', None, 'def']).astype('string')
0     abc
1    <NA>
2     def
dtype: string

如何使Pandas DataFrame列标题全部小写?

问题:如何使Pandas DataFrame列标题全部小写?

我想使我的pandas数据框中的所有列标题都小写

如果我有:

data =

  country country isocode  year     XRAT          tcgdp
0  Canada             CAN  2001  1.54876   924909.44207
1  Canada             CAN  2002  1.56932   957299.91586
2  Canada             CAN  2003  1.40105  1016902.00180
....

我想通过执行以下操作将XRAT更改为xrat:

data.headers.lowercase()

这样我得到:

  country country isocode  year     xrat          tcgdp
0  Canada             CAN  2001  1.54876   924909.44207
1  Canada             CAN  2002  1.56932   957299.91586
2  Canada             CAN  2003  1.40105  1016902.00180
3  Canada             CAN  2004  1.30102  1096000.35500
....

我不会提前知道每个列标题的名称。

I want to make all column headers in my pandas data frame lower case

Example

If I have:

data =

  country country isocode  year     XRAT          tcgdp
0  Canada             CAN  2001  1.54876   924909.44207
1  Canada             CAN  2002  1.56932   957299.91586
2  Canada             CAN  2003  1.40105  1016902.00180
....

I would like to change XRAT to xrat by doing something like:

data.headers.lowercase()

So that I get:

  country country isocode  year     xrat          tcgdp
0  Canada             CAN  2001  1.54876   924909.44207
1  Canada             CAN  2002  1.56932   957299.91586
2  Canada             CAN  2003  1.40105  1016902.00180
3  Canada             CAN  2004  1.30102  1096000.35500
....

I will not know the names of each column header ahead of time.


回答 0

您可以这样做:

data.columns = map(str.lower, data.columns)

要么

data.columns = [x.lower() for x in data.columns]

例:

>>> data = pd.DataFrame({'A':range(3), 'B':range(3,0,-1), 'C':list('abc')})
>>> data
   A  B  C
0  0  3  a
1  1  2  b
2  2  1  c
>>> data.columns = map(str.lower, data.columns)
>>> data
   a  b  c
0  0  3  a
1  1  2  b
2  2  1  c

You can do it like this:

data.columns = map(str.lower, data.columns)

or

data.columns = [x.lower() for x in data.columns]

example:

>>> data = pd.DataFrame({'A':range(3), 'B':range(3,0,-1), 'C':list('abc')})
>>> data
   A  B  C
0  0  3  a
1  1  2  b
2  2  1  c
>>> data.columns = map(str.lower, data.columns)
>>> data
   a  b  c
0  0  3  a
1  1  2  b
2  2  1  c

回答 1

您可以使用str.lower以下方法轻松实现columns

df.columns = df.columns.str.lower()

例:

In [63]: df
Out[63]: 
  country country isocode  year     XRAT         tcgdp
0  Canada             CAN  2001  1.54876  9.249094e+05
1  Canada             CAN  2002  1.56932  9.572999e+05
2  Canada             CAN  2003  1.40105  1.016902e+06

In [64]: df.columns = df.columns.str.lower()

In [65]: df
Out[65]: 
  country country isocode  year     xrat         tcgdp
0  Canada             CAN  2001  1.54876  9.249094e+05
1  Canada             CAN  2002  1.56932  9.572999e+05
2  Canada             CAN  2003  1.40105  1.016902e+06

You could do it easily with str.lower for columns:

df.columns = df.columns.str.lower()

Example:

In [63]: df
Out[63]: 
  country country isocode  year     XRAT         tcgdp
0  Canada             CAN  2001  1.54876  9.249094e+05
1  Canada             CAN  2002  1.56932  9.572999e+05
2  Canada             CAN  2003  1.40105  1.016902e+06

In [64]: df.columns = df.columns.str.lower()

In [65]: df
Out[65]: 
  country country isocode  year     xrat         tcgdp
0  Canada             CAN  2001  1.54876  9.249094e+05
1  Canada             CAN  2002  1.56932  9.572999e+05
2  Canada             CAN  2003  1.40105  1.016902e+06

回答 2

如果要使用链接方法调用进行重命名,则可以使用

data.rename(
    columns=unicode.lower
)

(Python 2)

要么

data.rename(
    columns=str.lower
)

(Python 3)

If you want to do the rename using a chained method call, you can use

data.rename(
    columns=unicode.lower
)

(Python 2)

or

data.rename(
    columns=str.lower
)

(Python 3)


回答 3

这是一个简单的方法: data.columns = data.columns.str.lower()

Here is a simple way: data.columns = data.columns.str.lower()


回答 4

df.columns = df.columns.str.lower()

是最简单的方法,但是如果某些标头为数字,则会出现错误

如果您有数字标题,请使用以下代码:

df.columns = [str(x).lower() for x in df.columns]
df.columns = df.columns.str.lower()

is the easiest but will give an error if some headers are numeric

if you have numeric headers then use this:

df.columns = [str(x).lower() for x in df.columns]

列出大关联矩阵中的最高关联对?

问题:列出大关联矩阵中的最高关联对?

您如何在与熊猫相关的矩阵中找到最相关的?关于如何使用R进行操作有很多答案(将相关性显示为有序列表,而不是大型矩阵从Python或R中从大型数据集中获取高度相关对的有效方法),但我想知道如何做到这一点大熊猫?在我的情况下,矩阵为4460×4460,因此无法从视觉上做到。

How do you find the top correlations in a correlation matrix with Pandas? There are many answers on how to do this with R (Show correlations as an ordered list, not as a large matrix or Efficient way to get highly correlated pairs from large data set in Python or R), but I am wondering how to do it with pandas? In my case the matrix is 4460×4460, so can’t do it visually.


回答 0

您可以DataFrame.values用来获取数据的numpy数组,然后使用NumPy函数argsort()来获取最相关的对。

但是,如果要在熊猫中执行此操作,则可以unstack对DataFrame进行排序:

import pandas as pd
import numpy as np

shape = (50, 4460)

data = np.random.normal(size=shape)

data[:, 1000] += data[:, 2000]

df = pd.DataFrame(data)

c = df.corr().abs()

s = c.unstack()
so = s.sort_values(kind="quicksort")

print so[-4470:-4460]

这是输出:

2192  1522    0.636198
1522  2192    0.636198
3677  2027    0.641817
2027  3677    0.641817
242   130     0.646760
130   242     0.646760
1171  2733    0.670048
2733  1171    0.670048
1000  2000    0.742340
2000  1000    0.742340
dtype: float64

You can use DataFrame.values to get an numpy array of the data and then use NumPy functions such as argsort() to get the most correlated pairs.

But if you want to do this in pandas, you can unstack and sort the DataFrame:

import pandas as pd
import numpy as np

shape = (50, 4460)

data = np.random.normal(size=shape)

data[:, 1000] += data[:, 2000]

df = pd.DataFrame(data)

c = df.corr().abs()

s = c.unstack()
so = s.sort_values(kind="quicksort")

print so[-4470:-4460]

Here is the output:

2192  1522    0.636198
1522  2192    0.636198
3677  2027    0.641817
2027  3677    0.641817
242   130     0.646760
130   242     0.646760
1171  2733    0.670048
2733  1171    0.670048
1000  2000    0.742340
2000  1000    0.742340
dtype: float64

回答 1

@HYRY的答案是完美的。只是通过添加更多逻辑来避免重复和自相关以及正确的排序来构建该答案:

import pandas as pd
d = {'x1': [1, 4, 4, 5, 6], 
     'x2': [0, 0, 8, 2, 4], 
     'x3': [2, 8, 8, 10, 12], 
     'x4': [-1, -4, -4, -4, -5]}
df = pd.DataFrame(data = d)
print("Data Frame")
print(df)
print()

print("Correlation Matrix")
print(df.corr())
print()

def get_redundant_pairs(df):
    '''Get diagonal and lower triangular pairs of correlation matrix'''
    pairs_to_drop = set()
    cols = df.columns
    for i in range(0, df.shape[1]):
        for j in range(0, i+1):
            pairs_to_drop.add((cols[i], cols[j]))
    return pairs_to_drop

def get_top_abs_correlations(df, n=5):
    au_corr = df.corr().abs().unstack()
    labels_to_drop = get_redundant_pairs(df)
    au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
    return au_corr[0:n]

print("Top Absolute Correlations")
print(get_top_abs_correlations(df, 3))

给出以下输出:

Data Frame
   x1  x2  x3  x4
0   1   0   2  -1
1   4   0   8  -4
2   4   8   8  -4
3   5   2  10  -4
4   6   4  12  -5

Correlation Matrix
          x1        x2        x3        x4
x1  1.000000  0.399298  1.000000 -0.969248
x2  0.399298  1.000000  0.399298 -0.472866
x3  1.000000  0.399298  1.000000 -0.969248
x4 -0.969248 -0.472866 -0.969248  1.000000

Top Absolute Correlations
x1  x3    1.000000
x3  x4    0.969248
x1  x4    0.969248
dtype: float64

@HYRY’s answer is perfect. Just building on that answer by adding a bit more logic to avoid duplicate and self correlations and proper sorting:

import pandas as pd
d = {'x1': [1, 4, 4, 5, 6], 
     'x2': [0, 0, 8, 2, 4], 
     'x3': [2, 8, 8, 10, 12], 
     'x4': [-1, -4, -4, -4, -5]}
df = pd.DataFrame(data = d)
print("Data Frame")
print(df)
print()

print("Correlation Matrix")
print(df.corr())
print()

def get_redundant_pairs(df):
    '''Get diagonal and lower triangular pairs of correlation matrix'''
    pairs_to_drop = set()
    cols = df.columns
    for i in range(0, df.shape[1]):
        for j in range(0, i+1):
            pairs_to_drop.add((cols[i], cols[j]))
    return pairs_to_drop

def get_top_abs_correlations(df, n=5):
    au_corr = df.corr().abs().unstack()
    labels_to_drop = get_redundant_pairs(df)
    au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
    return au_corr[0:n]

print("Top Absolute Correlations")
print(get_top_abs_correlations(df, 3))

That gives the following output:

Data Frame
   x1  x2  x3  x4
0   1   0   2  -1
1   4   0   8  -4
2   4   8   8  -4
3   5   2  10  -4
4   6   4  12  -5

Correlation Matrix
          x1        x2        x3        x4
x1  1.000000  0.399298  1.000000 -0.969248
x2  0.399298  1.000000  0.399298 -0.472866
x3  1.000000  0.399298  1.000000 -0.969248
x4 -0.969248 -0.472866 -0.969248  1.000000

Top Absolute Correlations
x1  x3    1.000000
x3  x4    0.969248
x1  x4    0.969248
dtype: float64

回答 2

很少有没有冗余变量对的行解决方案:

corr_matrix = df.corr().abs()

#the matrix is symmetric so we need to extract upper triangle matrix without diagonal (k = 1)

sol = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
                  .stack()
                  .sort_values(ascending=False))

#first element of sol series is the pair with the biggest correlation

然后,您可以遍历变量对的名称(pandas.Series多索引)及其值,如下所示:

for index, value in sol.items():
  # do some staff

Few lines solution without redundant pairs of variables:

corr_matrix = df.corr().abs()

#the matrix is symmetric so we need to extract upper triangle matrix without diagonal (k = 1)

sol = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
                  .stack()
                  .sort_values(ascending=False))

#first element of sol series is the pair with the biggest correlation

Then you can iterate through names of variables pairs (which are pandas.Series multi-indexes) and theirs values like this:

for index, value in sol.items():
  # do some staff

回答 3

结合@HYRY和@arun的某些功能,可以df使用以下命令在一行中打印数据帧的最高相关性:

df.corr().unstack().sort_values().drop_duplicates()

注意:一个缺点是,如果您具有1.0的相关性,该相关性本身并不是一个变量,则drop_duplicates()添加会删除它们

Combining some features of @HYRY and @arun’s answers, you can print the top correlations for dataframe df in a single line using:

df.corr().unstack().sort_values().drop_duplicates()

Note: the one downside is if you have 1.0 correlations that are not one variable to itself, the drop_duplicates() addition would remove them


回答 4

使用下面的代码按降序查看相关性。

# See the correlations in descending order

corr = df.corr() # df is the pandas dataframe
c1 = corr.abs().unstack()
c1.sort_values(ascending = False)

Use the code below to view the correlations in the descending order.

# See the correlations in descending order

corr = df.corr() # df is the pandas dataframe
c1 = corr.abs().unstack()
c1.sort_values(ascending = False)

回答 5

您可以通过替换数据来根据此简单代码以图形方式进行操作。

corr = df.corr()

kot = corr[corr>=.9]
plt.figure(figsize=(12,8))
sns.heatmap(kot, cmap="Greens")

You can do graphically according to this simple code by substituting your data.

corr = df.corr()

kot = corr[corr>=.9]
plt.figure(figsize=(12,8))
sns.heatmap(kot, cmap="Greens")


回答 6

这里有很多好的答案。我找到的最简单的方法是上述一些答案的组合。

corr = corr.where(np.triu(np.ones(corr.shape), k=1).astype(np.bool))
corr = corr.unstack().transpose()\
    .sort_values(by='column', ascending=False)\
    .dropna()

Lot’s of good answers here. The easiest way I found was a combination of some of the answers above.

corr = corr.where(np.triu(np.ones(corr.shape), k=1).astype(np.bool))
corr = corr.unstack().transpose()\
    .sort_values(by='column', ascending=False)\
    .dropna()

回答 7

用于itertools.combinations从熊猫自己的相关矩阵中获取所有唯一的相关性.corr(),生成列表列表并将其反馈回DataFrame中,以便使用“ .sort_values”。设置ascending = True为在顶部显示最低的相关性

corrank使用DataFrame作为参数,因为它需要.corr()

  def corrank(X: pandas.DataFrame):
        import itertools
        df = pd.DataFrame([[(i,j),X.corr().loc[i,j]] for i,j in list(itertools.combinations(X.corr(), 2))],columns=['pairs','corr'])    
        print(df.sort_values(by='corr',ascending=False))

  corrank(X) # prints a descending list of correlation pair (Max on top)

Use itertools.combinations to get all unique correlations from pandas own correlation matrix .corr(), generate list of lists and feed it back into a DataFrame in order to use ‘.sort_values’. Set ascending = True to display lowest correlations on top

corrank takes a DataFrame as argument because it requires .corr().

  def corrank(X: pandas.DataFrame):
        import itertools
        df = pd.DataFrame([[(i,j),X.corr().loc[i,j]] for i,j in list(itertools.combinations(X.corr(), 2))],columns=['pairs','corr'])    
        print(df.sort_values(by='corr',ascending=False))

  corrank(X) # prints a descending list of correlation pair (Max on top)

回答 8

我不想让unstack这个问题复杂化,因为我只是想在功能选择阶段删除一些高度相关的功能。

因此,我得到了以下简化的解决方案:

# map features to their absolute correlation values
corr = features.corr().abs()

# set equality (self correlation) as zero
corr[corr == 1] = 0

# of each feature, find the max correlation
# and sort the resulting array in ascending order
corr_cols = corr.max().sort_values(ascending=False)

# display the highly correlated features
display(corr_cols[corr_cols > 0.8])

在这种情况下,如果要删除相关特征,则可以映射过滤后的corr_cols数组并删除奇数索引(或偶数索引)的特征。

I didn’t want to unstack or over-complicate this issue, since I just wanted to drop some highly correlated features as part of a feature selection phase.

So I ended up with the following simplified solution:

# map features to their absolute correlation values
corr = features.corr().abs()

# set equality (self correlation) as zero
corr[corr == 1] = 0

# of each feature, find the max correlation
# and sort the resulting array in ascending order
corr_cols = corr.max().sort_values(ascending=False)

# display the highly correlated features
display(corr_cols[corr_cols > 0.8])

In this case, if you want to drop correlated features, you may map through the filtered corr_cols array and remove the odd-indexed (or even-indexed) ones.


回答 9

我在这里尝试一些解决方案,但后来我想出了自己的解决方案。我希望这可能对下一个有用,所以我在这里分享:

def sort_correlation_matrix(correlation_matrix):
    cor = correlation_matrix.abs()
    top_col = cor[cor.columns[0]][1:]
    top_col = top_col.sort_values(ascending=False)
    ordered_columns = [cor.columns[0]] + top_col.index.tolist()
    return correlation_matrix[ordered_columns].reindex(ordered_columns)

I was trying some of the solutions here but then I actually came up with my own one. I hope this might be useful for the next one so I share it here:

def sort_correlation_matrix(correlation_matrix):
    cor = correlation_matrix.abs()
    top_col = cor[cor.columns[0]][1:]
    top_col = top_col.sort_values(ascending=False)
    ordered_columns = [cor.columns[0]] + top_col.index.tolist()
    return correlation_matrix[ordered_columns].reindex(ordered_columns)

回答 10

这是@MiFi的改进代码。该顺序为绝对值,但不排除负值。

   def top_correlation (df,n):
    corr_matrix = df.corr()
    correlation = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
                 .stack()
                 .sort_values(ascending=False))
    correlation = pd.DataFrame(correlation).reset_index()
    correlation.columns=["Variable_1","Variable_2","Correlacion"]
    correlation = correlation.reindex(correlation.Correlacion.abs().sort_values(ascending=False).index).reset_index().drop(["index"],axis=1)
    return correlation.head(n)

top_correlation(ANYDATA,10)

This is a improve code from @MiFi. This one order in abs but not excluding the negative values.

   def top_correlation (df,n):
    corr_matrix = df.corr()
    correlation = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
                 .stack()
                 .sort_values(ascending=False))
    correlation = pd.DataFrame(correlation).reset_index()
    correlation.columns=["Variable_1","Variable_2","Correlacion"]
    correlation = correlation.reindex(correlation.Correlacion.abs().sort_values(ascending=False).index).reset_index().drop(["index"],axis=1)
    return correlation.head(n)

top_correlation(ANYDATA,10)

回答 11

以下功能可以解决问题。这个实现

  • 删除自相关
  • 删除重复项
  • 可以选择前N个最相关的功能

并且它也是可配置的,因此您既可以保留自相关,也可以保留重复。您还可以根据需要报告尽可能多的功能对。


def get_feature_correlation(df, top_n=None, corr_method='spearman',
                            remove_duplicates=True, remove_self_correlations=True):
    """
    Compute the feature correlation and sort feature pairs based on their correlation

    :param df: The dataframe with the predictor variables
    :type df: pandas.core.frame.DataFrame
    :param top_n: Top N feature pairs to be reported (if None, all of the pairs will be returned)
    :param corr_method: Correlation compuation method
    :type corr_method: str
    :param remove_duplicates: Indicates whether duplicate features must be removed
    :type remove_duplicates: bool
    :param remove_self_correlations: Indicates whether self correlations will be removed
    :type remove_self_correlations: bool

    :return: pandas.core.frame.DataFrame
    """
    corr_matrix_abs = df.corr(method=corr_method).abs()
    corr_matrix_abs_us = corr_matrix_abs.unstack()
    sorted_correlated_features = corr_matrix_abs_us \
        .sort_values(kind="quicksort", ascending=False) \
        .reset_index()

    # Remove comparisons of the same feature
    if remove_self_correlations:
        sorted_correlated_features = sorted_correlated_features[
            (sorted_correlated_features.level_0 != sorted_correlated_features.level_1)
        ]

    # Remove duplicates
    if remove_duplicates:
        sorted_correlated_features = sorted_correlated_features.iloc[:-2:2]

    # Create meaningful names for the columns
    sorted_correlated_features.columns = ['Feature 1', 'Feature 2', 'Correlation (abs)']

    if top_n:
        return sorted_correlated_features[:top_n]

    return sorted_correlated_features

The following function should do the trick. This implementation

  • Removes self correlations
  • Removes duplicates
  • Enables the selection of top N highest correlated features

and it is also configurable so that you can keep both the self correlations as well as the duplicates. You can also to report as many feature pairs as you wish.


def get_feature_correlation(df, top_n=None, corr_method='spearman',
                            remove_duplicates=True, remove_self_correlations=True):
    """
    Compute the feature correlation and sort feature pairs based on their correlation

    :param df: The dataframe with the predictor variables
    :type df: pandas.core.frame.DataFrame
    :param top_n: Top N feature pairs to be reported (if None, all of the pairs will be returned)
    :param corr_method: Correlation compuation method
    :type corr_method: str
    :param remove_duplicates: Indicates whether duplicate features must be removed
    :type remove_duplicates: bool
    :param remove_self_correlations: Indicates whether self correlations will be removed
    :type remove_self_correlations: bool

    :return: pandas.core.frame.DataFrame
    """
    corr_matrix_abs = df.corr(method=corr_method).abs()
    corr_matrix_abs_us = corr_matrix_abs.unstack()
    sorted_correlated_features = corr_matrix_abs_us \
        .sort_values(kind="quicksort", ascending=False) \
        .reset_index()

    # Remove comparisons of the same feature
    if remove_self_correlations:
        sorted_correlated_features = sorted_correlated_features[
            (sorted_correlated_features.level_0 != sorted_correlated_features.level_1)
        ]

    # Remove duplicates
    if remove_duplicates:
        sorted_correlated_features = sorted_correlated_features.iloc[:-2:2]

    # Create meaningful names for the columns
    sorted_correlated_features.columns = ['Feature 1', 'Feature 2', 'Correlation (abs)']

    if top_n:
        return sorted_correlated_features[:top_n]

    return sorted_correlated_features


回答 12

我最喜欢Addison Klinke的帖子,因为它是最简单的,但它最喜欢使用Wojciech Moszczyzysk的建议进行过滤和制图,但是扩展了该过滤器以避免绝对值,因此给定一个大的相关矩阵,对其进行过滤,制图然后对其进行展平:

创建,过滤和绘制图表

dfCorr = df.corr()
filteredDf = dfCorr[((dfCorr >= .5) | (dfCorr <= -.5)) & (dfCorr !=1.000)]
plt.figure(figsize=(30,10))
sn.heatmap(filteredDf, annot=True, cmap="Reds")
plt.show()

功能

最后,我创建了一个小函数来创建相关矩阵,对其进行过滤,然后对其进行展平。作为一个想法,它可以很容易地扩展,例如不对称的上下限等。

def corrFilter(x: pd.DataFrame, bound: float):
    xCorr = x.corr()
    xFiltered = xCorr[((xCorr >= bound) | (xCorr <= -bound)) & (xCorr !=1.000)]
    xFlattened = xFiltered.unstack().sort_values().drop_duplicates()
    return xFlattened

corrFilter(df, .7)

I liked Addison Klinke’s post the most, as being the simplest, but used Wojciech Moszczyńsk’s suggestion for filtering and charting, but extended the filter to avoid absolute values, so given a large correlation matrix, filter it, chart it, and then flatten it:

Created, Filtered and Charted

dfCorr = df.corr()
filteredDf = dfCorr[((dfCorr >= .5) | (dfCorr <= -.5)) & (dfCorr !=1.000)]
plt.figure(figsize=(30,10))
sn.heatmap(filteredDf, annot=True, cmap="Reds")
plt.show()

Function

In the end, I created a small function to create the correlation matrix, filter it, and then flatten it. As an idea, it could easily be extended, e.g., asymmetric upper and lower bounds, etc.

def corrFilter(x: pd.DataFrame, bound: float):
    xCorr = x.corr()
    xFiltered = xCorr[((xCorr >= bound) | (xCorr <= -bound)) & (xCorr !=1.000)]
    xFlattened = xFiltered.unstack().sort_values().drop_duplicates()
    return xFlattened

corrFilter(df, .7)


熊猫合并-如何避免重复的列

问题:熊猫合并-如何避免重复的列

我正在尝试在两个数据帧之间合并。每个数据帧都有两个索引级别(日期,客户)。在列中,例如,某些列在两者之间匹配(货币,日期)。

按索引合并这些内容的最佳方法是什么,但不要采用两个副本的货币和日期。

每个数据框都是90列,所以我试图避免用手将所有内容写出来。

df:                 currency  adj_date   data_col1 ...
date        cusip
2012-01-01  XSDP      USD      2012-01-03   0.45
...

df2:                currency  adj_date   data_col2 ...
date        cusip
2012-01-01  XSDP      USD      2012-01-03   0.45
...

如果我做:

dfNew = merge(df, df2, left_index=True, right_index=True, how='outer')

我懂了

dfNew:              currency_x  adj_date_x   data_col2 ... currency_y adj_date_y
date        cusip
2012-01-01  XSDP      USD      2012-01-03   0.45             USD         2012-01-03

谢谢!…

I am attempting a merge between two data frames. Each data frame has two index levels (date, cusip). In the columns, some columns match between the two (currency, adj date) for example.

What is the best way to merge these by index, but to not take two copies of currency and adj date.

Each data frame is 90 columns, so I am trying to avoid writing everything out by hand.

df:                 currency  adj_date   data_col1 ...
date        cusip
2012-01-01  XSDP      USD      2012-01-03   0.45
...

df2:                currency  adj_date   data_col2 ...
date        cusip
2012-01-01  XSDP      USD      2012-01-03   0.45
...

If I do:

dfNew = merge(df, df2, left_index=True, right_index=True, how='outer')

I get

dfNew:              currency_x  adj_date_x   data_col2 ... currency_y adj_date_y
date        cusip
2012-01-01  XSDP      USD      2012-01-03   0.45             USD         2012-01-03

Thank you! …


回答 0

您可以算出仅在一个DataFrame中的列,并使用它来选择合并中列的子集。

cols_to_use = df2.columns.difference(df.columns)

然后执行合并(请注意,这是一个索引对象,但是它有一个方便的tolist()方法)。

dfNew = merge(df, df2[cols_to_use], left_index=True, right_index=True, how='outer')

这将避免合并中的任何列冲突。

You can work out the columns that are only in one DataFrame and use this to select a subset of columns in the merge.

cols_to_use = df2.columns.difference(df.columns)

Then perform the merge (note this is an index object but it has a handy tolist() method).

dfNew = merge(df, df2[cols_to_use], left_index=True, right_index=True, how='outer')

This will avoid any columns clashing in the merge.


回答 1

我在中使用该suffixes选项.merge()

dfNew = df.merge(df2, left_index=True, right_index=True,
                 how='outer', suffixes=('', '_y'))
dfNew.drop(dfNew.filter(regex='_y$').columns.tolist(),axis=1, inplace=True)

谢谢@ijoseph

I use the suffixes option in .merge():

dfNew = df.merge(df2, left_index=True, right_index=True,
                 how='outer', suffixes=('', '_y'))
dfNew.drop(dfNew.filter(regex='_y$').columns.tolist(),axis=1, inplace=True)

Thanks @ijoseph


回答 2

以@rprog的答案为基础,可以使用负正则表达式将后缀和filter步骤的各个部分组合为一行:

dfNew = df.merge(df2, left_index=True, right_index=True,
             how='outer', suffixes=('', '_DROP')).filter(regex='^(?!.*_DROP)')

或使用df.join

dfNew = df.join(df2, lsuffix="DROP").filter(regex="^(?!.*DROP)")

这里的正则表达式保留了所有以单词“ DROP”结尾的内容,因此请确保使用未在各列之间出现的后缀。

Building on @rprog’s answer, you can combine the various pieces of the suffix & filter step into one line using a negative regex:

dfNew = df.merge(df2, left_index=True, right_index=True,
             how='outer', suffixes=('', '_DROP')).filter(regex='^(?!.*_DROP)')

Or using df.join:

dfNew = df.join(df2, lsuffix="DROP").filter(regex="^(?!.*DROP)")

The regex here is keeping anything that does not end with the word “DROP”, so just make sure to use a suffix that doesn’t appear among the columns already.


回答 3

我刚接触Pandas,但是我想实现相同的目的,自动避免使用_x或_y的列名并删除重复的数据。我终于用这个做了回答,这一个从#1

sales.csv

    城市;州;单位
    门多西诺; CA; 1
    丹佛; CO; 4
    奥斯汀;德克萨斯州; 2

Revenue.csv

    branch_id; city; revenue; state_id
    10;奥斯丁; 100; TX
    20;奥斯丁; 83; TX
    30;奥斯丁; 4; TX
    47;奥斯丁; 200; TX
    20;丹佛; 83; CO
    30;斯普林菲尔德; 4;我

merge.py导入熊猫

def drop_y(df):
    # list comprehension of the cols that end with '_y'
    to_drop = [x for x in df if x.endswith('_y')]
    df.drop(to_drop, axis=1, inplace=True)


sales = pandas.read_csv('data/sales.csv', delimiter=';')
revenue = pandas.read_csv('data/revenue.csv', delimiter=';')

result = pandas.merge(sales, revenue,  how='inner', left_on=['state'], right_on=['state_id'], suffixes=('', '_y'))
drop_y(result)
result.to_csv('results/output.csv', index=True, index_label='id', sep=';')

当执行merge命令时,我将_x后缀替换为空字符串,并且可以删除以结尾的列_y

output.csv

    id; city; state; units; branch_id; revenue; state_id
    0;丹佛; CO; 4; 20; 83; CO
    1; Austin; TX; 2; 10; 100; TX
    2; Austin; TX; 2; 20; 83; TX
    3; Austin; TX; 2; 30; 4; TX
    4; Austin; TX; 2; 47; 200; TX

I’m freshly new with Pandas but I wanted to achieve the same thing, automatically avoiding column names with _x or _y and removing duplicate data. I finally did it by using this answer and this one from Stackoverflow

sales.csv

    city;state;units
    Mendocino;CA;1
    Denver;CO;4
    Austin;TX;2

revenue.csv

    branch_id;city;revenue;state_id
    10;Austin;100;TX
    20;Austin;83;TX
    30;Austin;4;TX
    47;Austin;200;TX
    20;Denver;83;CO
    30;Springfield;4;I

merge.py import pandas

def drop_y(df):
    # list comprehension of the cols that end with '_y'
    to_drop = [x for x in df if x.endswith('_y')]
    df.drop(to_drop, axis=1, inplace=True)


sales = pandas.read_csv('data/sales.csv', delimiter=';')
revenue = pandas.read_csv('data/revenue.csv', delimiter=';')

result = pandas.merge(sales, revenue,  how='inner', left_on=['state'], right_on=['state_id'], suffixes=('', '_y'))
drop_y(result)
result.to_csv('results/output.csv', index=True, index_label='id', sep=';')

When executing the merge command I replace the _x suffix with an empty string and them I can remove columns ending with _y

output.csv

    id;city;state;units;branch_id;revenue;state_id
    0;Denver;CO;4;20;83;CO
    1;Austin;TX;2;10;100;TX
    2;Austin;TX;2;20;83;TX
    3;Austin;TX;2;30;4;TX
    4;Austin;TX;2;47;200;TX

回答 4

可以解决这个问题,但是我编写了一个基本上处理多余列的函数:

def merge_fix_cols(df_company,df_product,uniqueID):
    
    df_merged = pd.merge(df_company,
                         df_product,
                         how='left',left_on=uniqueID,right_on=uniqueID)    
    for col in df_merged:
        if col.endswith('_x'):
            df_merged.rename(columns = lambda col:col.rstrip('_x'),inplace=True)
        elif col.endswith('_y'):
            to_drop = [col for col in df_merged if col.endswith('_y')]
            df_merged.drop(to_drop,axis=1,inplace=True)
        else:
            pass
    return df_merged

似乎可以很好地与我的合并!

This is a bit of going around the problem, but I have written a function that basically deals with the extra columns:

def merge_fix_cols(df_company,df_product,uniqueID):
    
    df_merged = pd.merge(df_company,
                         df_product,
                         how='left',left_on=uniqueID,right_on=uniqueID)    
    for col in df_merged:
        if col.endswith('_x'):
            df_merged.rename(columns = lambda col:col.rstrip('_x'),inplace=True)
        elif col.endswith('_y'):
            to_drop = [col for col in df_merged if col.endswith('_y')]
            df_merged.drop(to_drop,axis=1,inplace=True)
        else:
            pass
    return df_merged

Seems to work well with my merges!


Python Pandas:如何仅读取CSV文件的前n行?

问题:Python Pandas:如何仅读取CSV文件的前n行?

我有一个非常大的数据集,我无法读取其中的整个数据集。因此,我正在考虑只读取其中的一个数据块进行训练,但是我不知道该怎么做。任何想法将不胜感激。

I have a very large data set and I can’t afford to read the entire data set in. So, I’m thinking of reading only one chunk of it to train but I have no idea how to do it. Any thought will be appreciated.


回答 0

如果您只想读取前999,999行(非标题):

read_csv(..., nrows=999999)

如果您只想读取1,000,000 … 1,999,999行

read_csv(..., skiprows=1000000, nrows=999999)

nrows:int,默认值无要读取的文件行数。对读取大文件有用*

skiprows:类似于列表或整数的行号,在文件开头要跳过(索引为0)或要跳过的行数(整数)

对于大文件,您可能还需要使用chunksize:

chunksize:int,默认为None返回TextFileReader对象进行迭代

pandas.io.parsers.read_csv文档

If you only want to read the first 999,999 (non-header) rows:

read_csv(..., nrows=999999)

If you only want to read rows 1,000,000 … 1,999,999

read_csv(..., skiprows=1000000, nrows=999999)

nrows : int, default None Number of rows of file to read. Useful for reading pieces of large files*

skiprows : list-like or integer Row numbers to skip (0-indexed) or number of rows to skip (int) at the start of the file

and for large files, you’ll probably also want to use chunksize:

chunksize : int, default None Return TextFileReader object for iteration

pandas.io.parsers.read_csv documentation