标签归档:dataframe

使用Pandas对同一工作簿的多个工作表进行pd.read_excel()

问题:使用Pandas对同一工作簿的多个工作表进行pd.read_excel()

我有一个较大的电子表格文件(.xlsx),正在使用python pandas处理。碰巧我需要该大文件中两个选项卡中的数据。选项卡中的一个包含大量数据,另一个仅包含几个正方形单元格。

当我在任何工作表上使用pd.read_excel()时,在我看来整个文件都已加载(而不仅仅是我感兴趣的工作表)。因此,当我两次使用该方法(每张纸一次)时,我实际上不得不使整个工作簿被读两次(即使我们仅使用指定的工作表)。

我使用的是错误的还是仅限于这种方式?

谢谢!

I have a large spreadsheet file (.xlsx) that I’m processing using python pandas. It happens that I need data from two tabs in that large file. One of the tabs has a ton of data and the other is just a few square cells.

When I use pd.read_excel() on any worksheet, it looks to me like the whole file is loaded (not just the worksheet I’m interested in). So when I use the method twice (once for each sheet), I effectively have to suffer the whole workbook being read in twice (even though we’re only using the specified sheet).

Am I using it wrong or is it just limited in this way?

Thank you!


回答 0

尝试pd.ExcelFile

xls = pd.ExcelFile('path_to_file.xls')
df1 = pd.read_excel(xls, 'Sheet1')
df2 = pd.read_excel(xls, 'Sheet2')

正如@HaPsantran指出的那样,在ExcelFile()调用过程中将读取整个Excel文件(似乎没有办法解决此问题)。这仅使您不必每次访问新表时都必须读取相同的文件。

请注意,sheet_name参数to pd.read_excel()可以是工作表的名称(如上所述),指定工作表编号的整数(例如0、1等),工作表名称或索引的列表或None。如果提供了列表,它将返回一个字典,其中的键是工作表名称/索引,值是数据框。默认设置是仅返回第一张纸(即sheet_name=0)。

如果None指定,则将所有表作为{sheet_name:dataframe}字典返回。

Try pd.ExcelFile:

xls = pd.ExcelFile('path_to_file.xls')
df1 = pd.read_excel(xls, 'Sheet1')
df2 = pd.read_excel(xls, 'Sheet2')

As noted by @HaPsantran, the entire Excel file is read in during the ExcelFile() call (there doesn’t appear to be a way around this). This merely saves you from having to read the same file in each time you want to access a new sheet.

Note that the sheet_name argument to pd.read_excel() can be the name of the sheet (as above), an integer specifying the sheet number (eg 0, 1, etc), a list of sheet names or indices, or None. If a list is provided, it returns a dictionary where the keys are the sheet names/indices and the values are the data frames. The default is to simply return the first sheet (ie, sheet_name=0).

If None is specified, all sheets are returned, as a {sheet_name:dataframe} dictionary.


回答 1

有3个选项:

将所有表直接阅读到有序词典中。

import pandas as pd

# for pandas version >= 0.21.0
sheet_to_df_map = pd.read_excel(file_name, sheet_name=None)

# for pandas version < 0.21.0
sheet_to_df_map = pd.read_excel(file_name, sheetname=None)

感谢@ihightower指出它,并@toto_tico指出版本问题。

将第一张表直接读入数据框

df = pd.read_excel('excel_file_path.xls')
# this will read the first sheet into df

阅读excel文件并获取工作表列表。然后选择并加载纸张。

xls = pd.ExcelFile('excel_file_path.xls')

# Now you can list all sheets in the file
xls.sheet_names
# ['house', 'house_extra', ...]

# to read just one sheet to dataframe:
df = pd.read_excel(file_name, sheetname="house")

阅读所有工作表并将其存储在字典中。与第一个相同,但更明确。

# to read all sheets to a map
sheet_to_df_map = {}
for sheet_name in xls.sheet_names:
    sheet_to_df_map[sheet_name] = xls.parse(sheet_name)

更新:感谢@toto_tico指出版本问题。

sheetname:字符串,整数,字符串/整数的混合列表,或无,默认值0,自0.21.0版开始不推荐使用:使用sheet_name代替源链接

There are a few options:

Read all sheets directly into an ordered dictionary.

import pandas as pd

# for pandas version >= 0.21.0
sheet_to_df_map = pd.read_excel(file_name, sheet_name=None)

# for pandas version < 0.21.0
sheet_to_df_map = pd.read_excel(file_name, sheetname=None)

Read the first sheet directly into dataframe

df = pd.read_excel('excel_file_path.xls')
# this will read the first sheet into df

Read the excel file and get a list of sheets. Then chose and load the sheets.

xls = pd.ExcelFile('excel_file_path.xls')

# Now you can list all sheets in the file
xls.sheet_names
# ['house', 'house_extra', ...]

# to read just one sheet to dataframe:
df = pd.read_excel(file_name, sheetname="house")

Read all sheets and store it in a dictionary. Same as first but more explicit.

# to read all sheets to a map
sheet_to_df_map = {}
for sheet_name in xls.sheet_names:
    sheet_to_df_map[sheet_name] = xls.parse(sheet_name)
    # you can also use sheet_index [0,1,2..] instead of sheet name.

Thanks @ihightower for pointing it out way to read all sheets and @toto_tico for pointing out the version issue.

sheetname : string, int, mixed list of strings/ints, or None, default 0 Deprecated since version 0.21.0: Use sheet_name instead Source Link


回答 2

您还可以为工作表使用索引:

xls = pd.ExcelFile('path_to_file.xls')
sheet1 = xls.parse(0)

将给出第一个工作表。对于第二个工作表:

sheet2 = xls.parse(1)

You can also use the index for the sheet:

xls = pd.ExcelFile('path_to_file.xls')
sheet1 = xls.parse(0)

will give the first worksheet. for the second worksheet:

sheet2 = xls.parse(1)

回答 3

您还可以将工作表名称指定为参数:

data_file = pd.read_excel('path_to_file.xls', sheet_name="sheet_name")

将仅上传工作表"sheet_name"

You could also specify the sheet name as a parameter:

data_file = pd.read_excel('path_to_file.xls', sheet_name="sheet_name")

will upload only the sheet "sheet_name".


回答 4

pd.read_excel('filename.xlsx') 

默认情况下,请阅读工作簿的第一张表。

pd.read_excel('filename.xlsx', sheet_name = 'sheetname') 

阅读工作簿的特定表并

pd.read_excel('filename.xlsx', sheet_name = None) 

从Excel到pandas数据框读取所有工作表,因为OrderedDict的类型表示嵌套数据框,所有工作表都作为在数据框内部收集的数据框,其类型为OrderedDict。

pd.read_excel('filename.xlsx') 

by default read the first sheet of workbook.

pd.read_excel('filename.xlsx', sheet_name = 'sheetname') 

read the specific sheet of workbook and

pd.read_excel('filename.xlsx', sheet_name = None) 

read all the worksheets from excel to pandas dataframe as a type of OrderedDict means nested dataframes, all the worksheets as dataframes collected inside dataframe and it’s type is OrderedDict.


回答 5

是的,不幸的是,它将始终加载整个文件。如果您重复执行此操作,则最好将工作表提取为单独的CSV,然后分别加载。您可以使用d6tstack自动化该过程,它还添加了其他功能,例如检查所有工作表或多个Excel文件中的所有列是否相等。

import d6tstack
c = d6tstack.convert_xls.XLStoCSVMultiSheet('multisheet.xlsx')
c.convert_all() # ['multisheet-Sheet1.csv','multisheet-Sheet2.csv']

请参阅d6tstack Excel示例

Yes unfortunately it will always load the full file. If you’re doing this repeatedly probably best to extract the sheets to separate CSVs and then load separately. You can automate that process with d6tstack which also adds additional features like checking if all the columns are equal across all sheets or multiple Excel files.

import d6tstack
c = d6tstack.convert_xls.XLStoCSVMultiSheet('multisheet.xlsx')
c.convert_all() # ['multisheet-Sheet1.csv','multisheet-Sheet2.csv']

See d6tstack Excel examples


回答 6

如果您已将excel文件与python程序(相对地址)保存在同一文件夹中,则只需提及工作表编号以及文件名。语法= pd.read_excel(Filename,SheetNo)示例:

    data=pd.read_excel("wt_vs_ht.xlsx","Sheet2")
    print(data)
    x=data.Height
    y=data.Weight
    plt.plot(x,y,'x')
    plt.show()

If you have saved the excel file in the same folder as your python program (relative paths) then you just need to mention sheet number along with file name.

Example:

 data = pd.read_excel("wt_vs_ht.xlsx", "Sheet2")
 print(data)
 x = data.Height
 y = data.Weight
 plt.plot(x,y,'x')
 plt.show()

比较两个DataFrame并并排输出它们的差异

问题:比较两个DataFrame并并排输出它们的差异

我试图突出显示两个数据框之间到底发生了什么变化。

假设我有两个Python Pandas数据框:

"StudentRoster Jan-1":
id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.11                     False                Graduated
113  Zoe    4.12                     True       

"StudentRoster Jan-2":
id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.21                     False                Graduated
113  Zoe    4.12                     False                On vacation

我的目标是输出一个HTML表:

  1. 标识已更改的行(可以是int,float,boolean,string)
  2. 输出具有相同,OLD和NEW值的行(理想情况下将其输出到HTML表中),以便使用者可以清楚地看到两个数据框之间的变化:

    "StudentRoster Difference Jan-1 - Jan-2":  
    id   Name   score                    isEnrolled           Comment
    112  Nick   was 1.11| now 1.21       False                Graduated
    113  Zoe    4.12                     was True | now False was "" | now   "On   vacation"

我想我可以逐行和逐列进行比较,但是有没有更简单的方法?

I am trying to highlight exactly what changed between two dataframes.

Suppose I have two Python Pandas dataframes:

"StudentRoster Jan-1":
id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.11                     False                Graduated
113  Zoe    4.12                     True       

"StudentRoster Jan-2":
id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.21                     False                Graduated
113  Zoe    4.12                     False                On vacation

My goal is to output an HTML table that:

  1. Identifies rows that have changed (could be int, float, boolean, string)
  2. Outputs rows with same, OLD and NEW values (ideally into an HTML table) so the consumer can clearly see what changed between two dataframes:

    "StudentRoster Difference Jan-1 - Jan-2":  
    id   Name   score                    isEnrolled           Comment
    112  Nick   was 1.11| now 1.21       False                Graduated
    113  Zoe    4.12                     was True | now False was "" | now   "On   vacation"
    

I suppose I could do a row by row and column by column comparison, but is there an easier way?


回答 0

第一部分类似于君士坦丁,您可以获取哪些行为空的布尔值*:

In [21]: ne = (df1 != df2).any(1)

In [22]: ne
Out[22]:
0    False
1     True
2     True
dtype: bool

然后,我们可以查看哪些条目已更改:

In [23]: ne_stacked = (df1 != df2).stack()

In [24]: changed = ne_stacked[ne_stacked]

In [25]: changed.index.names = ['id', 'col']

In [26]: changed
Out[26]:
id  col
1   score         True
2   isEnrolled    True
    Comment       True
dtype: bool

在这里,第一个条目是索引,第二个条目是已更改的列。

In [27]: difference_locations = np.where(df1 != df2)

In [28]: changed_from = df1.values[difference_locations]

In [29]: changed_to = df2.values[difference_locations]

In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
Out[30]:
               from           to
id col
1  score       1.11         1.21
2  isEnrolled  True        False
   Comment     None  On vacation

*注:这是非常重要的df1,并df2在这里分享相同的索引。为了克服这种歧义,您可以确保仅使用来查看共享标签df1.index & df2.index,但我想将其保留为练习。

The first part is similar to Constantine, you can get the boolean of which rows are empty*:

In [21]: ne = (df1 != df2).any(1)

In [22]: ne
Out[22]:
0    False
1     True
2     True
dtype: bool

Then we can see which entries have changed:

In [23]: ne_stacked = (df1 != df2).stack()

In [24]: changed = ne_stacked[ne_stacked]

In [25]: changed.index.names = ['id', 'col']

In [26]: changed
Out[26]:
id  col
1   score         True
2   isEnrolled    True
    Comment       True
dtype: bool

Here the first entry is the index and the second the columns which has been changed.

In [27]: difference_locations = np.where(df1 != df2)

In [28]: changed_from = df1.values[difference_locations]

In [29]: changed_to = df2.values[difference_locations]

In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
Out[30]:
               from           to
id col
1  score       1.11         1.21
2  isEnrolled  True        False
   Comment     None  On vacation

* Note: it’s important that df1 and df2 share the same index here. To overcome this ambiguity, you can ensure you only look at the shared labels using df1.index & df2.index, but I think I’ll leave that as an exercise.


回答 1

突出显示两个DataFrame之间的差异

可以使用DataFrame样式属性突出显示存在差异的单元格的背景色。

使用原始问题中的示例数据

第一步是使用concat功能将DataFrames水平连接,并使用keys参数区分每个帧:

df_all = pd.concat([df.set_index('id'), df2.set_index('id')], 
                   axis='columns', keys=['First', 'Second'])
df_all

在此处输入图片说明

交换列级别并将相同的列名称彼此相邻可能更容易:

df_final = df_all.swaplevel(axis='columns')[df.columns[1:]]
df_final

在此处输入图片说明

现在,更容易发现框架中的差异。但是,我们可以走得更远,并使用该style属性突出显示不同的单元格。我们定义了一个自定义函数来执行此操作,您可以在文档的此部分中看到。

def highlight_diff(data, color='yellow'):
    attr = 'background-color: {}'.format(color)
    other = data.xs('First', axis='columns', level=-1)
    return pd.DataFrame(np.where(data.ne(other, level=0), attr, ''),
                        index=data.index, columns=data.columns)

df_final.style.apply(highlight_diff, axis=None)

在此处输入图片说明

这将突出显示两个均缺少值的单元格。您可以填充它们或提供额外的逻辑,以免突出显示它们。

Highlighting the difference between two DataFrames

It is possible to use the DataFrame style property to highlight the background color of the cells where there is a difference.

Using the example data from the original question

The first step is to concatenate the DataFrames horizontally with the concat function and distinguish each frame with the keys parameter:

df_all = pd.concat([df.set_index('id'), df2.set_index('id')], 
                   axis='columns', keys=['First', 'Second'])
df_all

enter image description here

It’s probably easier to swap the column levels and put the same column names next to each other:

df_final = df_all.swaplevel(axis='columns')[df.columns[1:]]
df_final

enter image description here

Now, its much easier to spot the differences in the frames. But, we can go further and use the style property to highlight the cells that are different. We define a custom function to do this which you can see in this part of the documentation.

def highlight_diff(data, color='yellow'):
    attr = 'background-color: {}'.format(color)
    other = data.xs('First', axis='columns', level=-1)
    return pd.DataFrame(np.where(data.ne(other, level=0), attr, ''),
                        index=data.index, columns=data.columns)

df_final.style.apply(highlight_diff, axis=None)

enter image description here

This will highlight cells that both have missing values. You can either fill them or provide extra logic so that they don’t get highlighted.


回答 2

这个答案只是扩展了@Andy Hayden的值,使其在数字字段为时具有弹性nan,并将其包装到函数中。

import pandas as pd
import numpy as np


def diff_pd(df1, df2):
    """Identify differences between two pandas DataFrames"""
    assert (df1.columns == df2.columns).all(), \
        "DataFrame column names are different"
    if any(df1.dtypes != df2.dtypes):
        "Data Types are different, trying to convert"
        df2 = df2.astype(df1.dtypes)
    if df1.equals(df2):
        return None
    else:
        # need to account for np.nan != np.nan returning True
        diff_mask = (df1 != df2) & ~(df1.isnull() & df2.isnull())
        ne_stacked = diff_mask.stack()
        changed = ne_stacked[ne_stacked]
        changed.index.names = ['id', 'col']
        difference_locations = np.where(diff_mask)
        changed_from = df1.values[difference_locations]
        changed_to = df2.values[difference_locations]
        return pd.DataFrame({'from': changed_from, 'to': changed_to},
                            index=changed.index)

因此,对于您的数据(略作编辑以使分数列中具有NaN):

import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO

DF1 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.11                     False                "Graduated"
113  Zoe    NaN                     True                  " "
""")
DF2 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.21                     False                "Graduated"
113  Zoe    NaN                     False                "On vacation" """)
df1 = pd.read_table(DF1, sep='\s+', index_col='id')
df2 = pd.read_table(DF2, sep='\s+', index_col='id')
diff_pd(df1, df2)

输出:

                from           to
id  col                          
112 score       1.11         1.21
113 isEnrolled  True        False
    Comment           On vacation

This answer simply extends @Andy Hayden’s, making it resilient to when numeric fields are nan, and wrapping it up into a function.

import pandas as pd
import numpy as np


def diff_pd(df1, df2):
    """Identify differences between two pandas DataFrames"""
    assert (df1.columns == df2.columns).all(), \
        "DataFrame column names are different"
    if any(df1.dtypes != df2.dtypes):
        "Data Types are different, trying to convert"
        df2 = df2.astype(df1.dtypes)
    if df1.equals(df2):
        return None
    else:
        # need to account for np.nan != np.nan returning True
        diff_mask = (df1 != df2) & ~(df1.isnull() & df2.isnull())
        ne_stacked = diff_mask.stack()
        changed = ne_stacked[ne_stacked]
        changed.index.names = ['id', 'col']
        difference_locations = np.where(diff_mask)
        changed_from = df1.values[difference_locations]
        changed_to = df2.values[difference_locations]
        return pd.DataFrame({'from': changed_from, 'to': changed_to},
                            index=changed.index)

So with your data (slightly edited to have a NaN in the score column):

import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO

DF1 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.11                     False                "Graduated"
113  Zoe    NaN                     True                  " "
""")
DF2 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.21                     False                "Graduated"
113  Zoe    NaN                     False                "On vacation" """)
df1 = pd.read_table(DF1, sep='\s+', index_col='id')
df2 = pd.read_table(DF2, sep='\s+', index_col='id')
diff_pd(df1, df2)

Output:

                from           to
id  col                          
112 score       1.11         1.21
113 isEnrolled  True        False
    Comment           On vacation

回答 3

import pandas as pd
import io

texts = ['''\
id   Name   score                    isEnrolled                        Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.11                     False                           Graduated
113  Zoe    4.12                     True       ''',

         '''\
id   Name   score                    isEnrolled                        Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.21                     False                           Graduated
113  Zoe    4.12                     False                         On vacation''']


df1 = pd.read_fwf(io.StringIO(texts[0]), widths=[5,7,25,21,20])
df2 = pd.read_fwf(io.StringIO(texts[1]), widths=[5,7,25,21,20])
df = pd.concat([df1,df2]) 

print(df)
#     id  Name  score isEnrolled               Comment
# 0  111  Jack   2.17       True  He was late to class
# 1  112  Nick   1.11      False             Graduated
# 2  113   Zoe   4.12       True                   NaN
# 0  111  Jack   2.17       True  He was late to class
# 1  112  Nick   1.21      False             Graduated
# 2  113   Zoe   4.12      False           On vacation

df.set_index(['id', 'Name'], inplace=True)
print(df)
#           score isEnrolled               Comment
# id  Name                                        
# 111 Jack   2.17       True  He was late to class
# 112 Nick   1.11      False             Graduated
# 113 Zoe    4.12       True                   NaN
# 111 Jack   2.17       True  He was late to class
# 112 Nick   1.21      False             Graduated
# 113 Zoe    4.12      False           On vacation

def report_diff(x):
    return x[0] if x[0] == x[1] else '{} | {}'.format(*x)

changes = df.groupby(level=['id', 'Name']).agg(report_diff)
print(changes)

版画

                score    isEnrolled               Comment
id  Name                                                 
111 Jack         2.17          True  He was late to class
112 Nick  1.11 | 1.21         False             Graduated
113 Zoe          4.12  True | False     nan | On vacation
import pandas as pd
import io

texts = ['''\
id   Name   score                    isEnrolled                        Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.11                     False                           Graduated
113  Zoe    4.12                     True       ''',

         '''\
id   Name   score                    isEnrolled                        Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.21                     False                           Graduated
113  Zoe    4.12                     False                         On vacation''']


df1 = pd.read_fwf(io.StringIO(texts[0]), widths=[5,7,25,21,20])
df2 = pd.read_fwf(io.StringIO(texts[1]), widths=[5,7,25,21,20])
df = pd.concat([df1,df2]) 

print(df)
#     id  Name  score isEnrolled               Comment
# 0  111  Jack   2.17       True  He was late to class
# 1  112  Nick   1.11      False             Graduated
# 2  113   Zoe   4.12       True                   NaN
# 0  111  Jack   2.17       True  He was late to class
# 1  112  Nick   1.21      False             Graduated
# 2  113   Zoe   4.12      False           On vacation

df.set_index(['id', 'Name'], inplace=True)
print(df)
#           score isEnrolled               Comment
# id  Name                                        
# 111 Jack   2.17       True  He was late to class
# 112 Nick   1.11      False             Graduated
# 113 Zoe    4.12       True                   NaN
# 111 Jack   2.17       True  He was late to class
# 112 Nick   1.21      False             Graduated
# 113 Zoe    4.12      False           On vacation

def report_diff(x):
    return x[0] if x[0] == x[1] else '{} | {}'.format(*x)

changes = df.groupby(level=['id', 'Name']).agg(report_diff)
print(changes)

prints

                score    isEnrolled               Comment
id  Name                                                 
111 Jack         2.17          True  He was late to class
112 Nick  1.11 | 1.21         False             Graduated
113 Zoe          4.12  True | False     nan | On vacation

回答 4

我已经遇到了这个问题,但是在找到这篇文章之前找到了答案:

根据unutbu的答案,加载您的数据…

import pandas as pd
import io

texts = ['''\
id   Name   score                    isEnrolled                       Date
111  Jack                            True              2013-05-01 12:00:00
112  Nick   1.11                     False             2013-05-12 15:05:23
     Zoe    4.12                     True                                  ''',

         '''\
id   Name   score                    isEnrolled                       Date
111  Jack   2.17                     True              2013-05-01 12:00:00
112  Nick   1.21                     False                                
     Zoe    4.12                     False             2013-05-01 12:00:00''']


df1 = pd.read_fwf(io.StringIO(texts[0]), widths=[5,7,25,17,20], parse_dates=[4])
df2 = pd.read_fwf(io.StringIO(texts[1]), widths=[5,7,25,17,20], parse_dates=[4])

…定义您的diff函数…

def report_diff(x):
    return x[0] if x[0] == x[1] else '{} | {}'.format(*x)

然后,您可以简单地使用面板来得出结论:

my_panel = pd.Panel(dict(df1=df1,df2=df2))
print my_panel.apply(report_diff, axis=0)

#          id  Name        score    isEnrolled                       Date
#0        111  Jack   nan | 2.17          True        2013-05-01 12:00:00
#1        112  Nick  1.11 | 1.21         False  2013-05-12 15:05:23 | NaT
#2  nan | nan   Zoe         4.12  True | False  NaT | 2013-05-01 12:00:00

顺便说一句,如果您使用的是IPython Notebook,则可能希望使用彩色的diff函数根据单元格是不同,相等还是left / right null来赋予颜色:

from IPython.display import HTML
pd.options.display.max_colwidth = 500  # You need this, otherwise pandas
#                          will limit your HTML strings to 50 characters

def report_diff(x):
    if x[0]==x[1]:
        return unicode(x[0].__str__())
    elif pd.isnull(x[0]) and pd.isnull(x[1]):
        return u'<table style="background-color:#00ff00;font-weight:bold;">'+\
            '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % ('nan', 'nan')
    elif pd.isnull(x[0]) and ~pd.isnull(x[1]):
        return u'<table style="background-color:#ffff00;font-weight:bold;">'+\
            '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % ('nan', x[1])
    elif ~pd.isnull(x[0]) and pd.isnull(x[1]):
        return u'<table style="background-color:#0000ff;font-weight:bold;">'+\
            '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % (x[0],'nan')
    else:
        return u'<table style="background-color:#ff0000;font-weight:bold;">'+\
            '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % (x[0], x[1])

HTML(my_panel.apply(report_diff, axis=0).to_html(escape=False))

I have faced this issue, but found an answer before finding this post :

Based on unutbu’s answer, load your data…

import pandas as pd
import io

texts = ['''\
id   Name   score                    isEnrolled                       Date
111  Jack                            True              2013-05-01 12:00:00
112  Nick   1.11                     False             2013-05-12 15:05:23
     Zoe    4.12                     True                                  ''',

         '''\
id   Name   score                    isEnrolled                       Date
111  Jack   2.17                     True              2013-05-01 12:00:00
112  Nick   1.21                     False                                
     Zoe    4.12                     False             2013-05-01 12:00:00''']


df1 = pd.read_fwf(io.StringIO(texts[0]), widths=[5,7,25,17,20], parse_dates=[4])
df2 = pd.read_fwf(io.StringIO(texts[1]), widths=[5,7,25,17,20], parse_dates=[4])

…define your diff function…

def report_diff(x):
    return x[0] if x[0] == x[1] else '{} | {}'.format(*x)

Then you can simply use a Panel to conclude :

my_panel = pd.Panel(dict(df1=df1,df2=df2))
print my_panel.apply(report_diff, axis=0)

#          id  Name        score    isEnrolled                       Date
#0        111  Jack   nan | 2.17          True        2013-05-01 12:00:00
#1        112  Nick  1.11 | 1.21         False  2013-05-12 15:05:23 | NaT
#2  nan | nan   Zoe         4.12  True | False  NaT | 2013-05-01 12:00:00

By the way, if you’re in IPython Notebook, you may like to use a colored diff function to give colors depending whether cells are different, equal or left/right null :

from IPython.display import HTML
pd.options.display.max_colwidth = 500  # You need this, otherwise pandas
#                          will limit your HTML strings to 50 characters

def report_diff(x):
    if x[0]==x[1]:
        return unicode(x[0].__str__())
    elif pd.isnull(x[0]) and pd.isnull(x[1]):
        return u'<table style="background-color:#00ff00;font-weight:bold;">'+\
            '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % ('nan', 'nan')
    elif pd.isnull(x[0]) and ~pd.isnull(x[1]):
        return u'<table style="background-color:#ffff00;font-weight:bold;">'+\
            '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % ('nan', x[1])
    elif ~pd.isnull(x[0]) and pd.isnull(x[1]):
        return u'<table style="background-color:#0000ff;font-weight:bold;">'+\
            '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % (x[0],'nan')
    else:
        return u'<table style="background-color:#ff0000;font-weight:bold;">'+\
            '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % (x[0], x[1])

HTML(my_panel.apply(report_diff, axis=0).to_html(escape=False))

回答 5

如果您的两个数据帧中具有相同的ID,那么找出更改实际上是很容易的。这样做frame1 != frame2会为您提供一个布尔型DataFrame,其中每个True都是已更改的数据。由此,您可以通过轻松获得每个更改行的索引changedids = frame1.index[np.any(frame1 != frame2,axis=1)]

If your two dataframes have the same ids in them, then finding out what changed is actually pretty easy. Just doing frame1 != frame2 will give you a boolean DataFrame where each True is data that has changed. From that, you could easily get the index of each changed row by doing changedids = frame1.index[np.any(frame1 != frame2,axis=1)].


回答 6

使用concat和drop_duplicates的另一种方法:

import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO
import pandas as pd

DF1 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.11                     False                "Graduated"
113  Zoe    NaN                     True                  " "
""")
DF2 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.21                     False                "Graduated"
113  Zoe    NaN                     False                "On vacation" """)

df1 = pd.read_table(DF1, sep='\s+', index_col='id')
df2 = pd.read_table(DF2, sep='\s+', index_col='id')
#%%
dictionary = {1:df1,2:df2}
df=pd.concat(dictionary)
df.drop_duplicates(keep=False)

输出:

       Name  score isEnrolled      Comment
  id                                      
1 112  Nick   1.11      False    Graduated
  113   Zoe    NaN       True             
2 112  Nick   1.21      False    Graduated
  113   Zoe    NaN      False  On vacation

A different approach using concat and drop_duplicates:

import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO
import pandas as pd

DF1 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.11                     False                "Graduated"
113  Zoe    NaN                     True                  " "
""")
DF2 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.21                     False                "Graduated"
113  Zoe    NaN                     False                "On vacation" """)

df1 = pd.read_table(DF1, sep='\s+', index_col='id')
df2 = pd.read_table(DF2, sep='\s+', index_col='id')
#%%
dictionary = {1:df1,2:df2}
df=pd.concat(dictionary)
df.drop_duplicates(keep=False)

Output:

       Name  score isEnrolled      Comment
  id                                      
1 112  Nick   1.11      False    Graduated
  113   Zoe    NaN       True             
2 112  Nick   1.21      False    Graduated
  113   Zoe    NaN      False  On vacation

回答 7

摆弄@journois的答案后,由于Panel的贬值,我能够使用MultiIndex而不是Panel使它正常工作

首先,创建一些虚拟数据:

df1 = pd.DataFrame({
    'id': ['111', '222', '333', '444', '555'],
    'let': ['a', 'b', 'c', 'd', 'e'],
    'num': ['1', '2', '3', '4', '5']
})
df2 = pd.DataFrame({
    'id': ['111', '222', '333', '444', '666'],
    'let': ['a', 'b', 'c', 'D', 'f'],
    'num': ['1', '2', 'Three', '4', '6'],
})

然后,定义您的diff函数,在这种情况下,我将使用他的答案中的一个report_diff保持不变:

def report_diff(x):
    return x[0] if x[0] == x[1] else '{} | {}'.format(*x)

然后,我将把数据连接到一个MultiIndex数据帧中:

df_all = pd.concat(
    [df1.set_index('id'), df2.set_index('id')], 
    axis='columns', 
    keys=['df1', 'df2'],
    join='outer'
)
df_all = df_all.swaplevel(axis='columns')[df1.columns[1:]]

最后,我将report_diff向下应用每个列组:

df_final.groupby(level=0, axis=1).apply(lambda frame: frame.apply(report_diff, axis=1))

输出:

         let        num
111        a          1
222        b          2
333        c  3 | Three
444    d | D          4
555  e | nan    5 | nan
666  nan | f    nan | 6

仅此而已!

After fiddling around with @journois’s answer, I was able to get it to work using MultiIndex instead of Panel due to Panel’s deprication.

First, create some dummy data:

df1 = pd.DataFrame({
    'id': ['111', '222', '333', '444', '555'],
    'let': ['a', 'b', 'c', 'd', 'e'],
    'num': ['1', '2', '3', '4', '5']
})
df2 = pd.DataFrame({
    'id': ['111', '222', '333', '444', '666'],
    'let': ['a', 'b', 'c', 'D', 'f'],
    'num': ['1', '2', 'Three', '4', '6'],
})

Then, define your diff function, in this case I’ll use the one from his answer report_diff stays the same:

def report_diff(x):
    return x[0] if x[0] == x[1] else '{} | {}'.format(*x)

Then, I’m going to concatenate the data into a MultiIndex dataframe:

df_all = pd.concat(
    [df1.set_index('id'), df2.set_index('id')], 
    axis='columns', 
    keys=['df1', 'df2'],
    join='outer'
)
df_all = df_all.swaplevel(axis='columns')[df1.columns[1:]]

And finally I’m going to apply the report_diff down each column group:

df_final.groupby(level=0, axis=1).apply(lambda frame: frame.apply(report_diff, axis=1))

This outputs:

         let        num
111        a          1
222        b          2
333        c  3 | Three
444    d | D          4
555  e | nan    5 | nan
666  nan | f    nan | 6

And that is all!


回答 8

扩展@cge的答案,这对于提高结果的可读性非常酷:

a[a != b][np.any(a != b, axis=1)].join(pd.DataFrame('a<->b', index=a.index, columns=['a<=>b'])).join(
        b[a != b][np.any(a != b, axis=1)]
        ,rsuffix='_b', how='outer'
).fillna('')

完整的演示示例:

import numpy as np, pandas as pd

a = pd.DataFrame(np.random.randn(7,3), columns=list('ABC'))
b = a.copy()
b.iloc[0,2] = np.nan
b.iloc[1,0] = 7
b.iloc[3,1] = 77
b.iloc[4,2] = 777

a[a != b][np.any(a != b, axis=1)].join(pd.DataFrame('a<->b', index=a.index, columns=['a<=>b'])).join(
        b[a != b][np.any(a != b, axis=1)]
        ,rsuffix='_b', how='outer'
).fillna('')

Extending answer of @cge, which is pretty cool for more readability of result:

a[a != b][np.any(a != b, axis=1)].join(pd.DataFrame('a<->b', index=a.index, columns=['a<=>b'])).join(
        b[a != b][np.any(a != b, axis=1)]
        ,rsuffix='_b', how='outer'
).fillna('')

Full demonstration example:

import numpy as np, pandas as pd

a = pd.DataFrame(np.random.randn(7,3), columns=list('ABC'))
b = a.copy()
b.iloc[0,2] = np.nan
b.iloc[1,0] = 7
b.iloc[3,1] = 77
b.iloc[4,2] = 777

a[a != b][np.any(a != b, axis=1)].join(pd.DataFrame('a<->b', index=a.index, columns=['a<=>b'])).join(
        b[a != b][np.any(a != b, axis=1)]
        ,rsuffix='_b', how='outer'
).fillna('')

回答 9

这是使用选择并合并的另一种方法:

In [6]: # first lets create some dummy dataframes with some column(s) different
   ...: df1 = pd.DataFrame({'a': range(-5,0), 'b': range(10,15), 'c': range(20,25)})
   ...: df2 = pd.DataFrame({'a': range(-5,0), 'b': range(10,15), 'c': [20] + list(range(101,105))})


In [7]: df1
Out[7]:
   a   b   c
0 -5  10  20
1 -4  11  21
2 -3  12  22
3 -2  13  23
4 -1  14  24


In [8]: df2
Out[8]:
   a   b    c
0 -5  10   20
1 -4  11  101
2 -3  12  102
3 -2  13  103
4 -1  14  104


In [10]: # make condition over the columns you want to comapre
    ...: condition = df1['c'] != df2['c']
    ...:
    ...: # select rows from each dataframe where the condition holds
    ...: diff1 = df1[condition]
    ...: diff2 = df2[condition]


In [11]: # merge the selected rows (dataframes) with some suffixes (optional)
    ...: diff1.merge(diff2, on=['a','b'], suffixes=('_before', '_after'))
Out[11]:
   a   b  c_before  c_after
0 -4  11        21      101
1 -3  12        22      102
2 -2  13        23      103
3 -1  14        24      104

这是Jupyter屏幕截图中的相同内容:

在此处输入图片说明

Here is another way using select and merge:

In [6]: # first lets create some dummy dataframes with some column(s) different
   ...: df1 = pd.DataFrame({'a': range(-5,0), 'b': range(10,15), 'c': range(20,25)})
   ...: df2 = pd.DataFrame({'a': range(-5,0), 'b': range(10,15), 'c': [20] + list(range(101,105))})


In [7]: df1
Out[7]:
   a   b   c
0 -5  10  20
1 -4  11  21
2 -3  12  22
3 -2  13  23
4 -1  14  24


In [8]: df2
Out[8]:
   a   b    c
0 -5  10   20
1 -4  11  101
2 -3  12  102
3 -2  13  103
4 -1  14  104


In [10]: # make condition over the columns you want to comapre
    ...: condition = df1['c'] != df2['c']
    ...:
    ...: # select rows from each dataframe where the condition holds
    ...: diff1 = df1[condition]
    ...: diff2 = df2[condition]


In [11]: # merge the selected rows (dataframes) with some suffixes (optional)
    ...: diff1.merge(diff2, on=['a','b'], suffixes=('_before', '_after'))
Out[11]:
   a   b  c_before  c_after
0 -4  11        21      101
1 -3  12        22      102
2 -2  13        23      103
3 -1  14        24      104

Here is the same thing from a Jupyter screenshot:

enter image description here


回答 10

大熊猫> = 1.1: DataFrame.compare

使用pandas 1.1,您基本上可以通过一个函数调用来复制Ted Petrou的输出。来自文档的示例:

pd.__version__
# '1.1.0.dev0+2004.g8d10bfb6f'

df1.compare(df2)

  score       isEnrolled       Comment             
   self other       self other    self        other
1  1.11  1.21        NaN   NaN     NaN          NaN
2   NaN   NaN        1.0   0.0     NaN  On vacation

此处,“自身”是指LHS数据帧,而“其他”是指RHS数据帧。默认情况下,相等值将替换为NaN,因此您可以仅关注差异。如果要显示相等的值,请使用

df1.compare(df2, keep_equal=True, keep_shape=True) 

  score       isEnrolled           Comment             
   self other       self  other       self        other
1  1.11  1.21      False  False  Graduated    Graduated
2  4.12  4.12       True  False        NaN  On vacation

您还可以使用align_axis以下方式更改比较轴:

df1.compare(df2, align_axis='index')

         score  isEnrolled      Comment
1 self    1.11         NaN          NaN
  other   1.21         NaN          NaN
2 self     NaN         1.0          NaN
  other    NaN         0.0  On vacation

这将按行而不是按列比较值。

pandas >= 1.1: DataFrame.compare

With pandas 1.1, you could essentially replicate Ted Petrou’s output with a single function call. Example taken from the docs:

pd.__version__
# '1.1.0'

df1.compare(df2)

  score       isEnrolled       Comment             
   self other       self other    self        other
1  1.11  1.21        NaN   NaN     NaN          NaN
2   NaN   NaN        1.0   0.0     NaN  On vacation

Here, “self” refers to the LHS dataFrame, while “other” is the RHS DataFrame. By default, equal values are replaced with NaNs so you can focus on just the diffs. If you want to show values that are equal as well, use

df1.compare(df2, keep_equal=True, keep_shape=True) 

  score       isEnrolled           Comment             
   self other       self  other       self        other
1  1.11  1.21      False  False  Graduated    Graduated
2  4.12  4.12       True  False        NaN  On vacation

You can also change the axis of comparison using align_axis:

df1.compare(df2, align_axis='index')

         score  isEnrolled      Comment
1 self    1.11         NaN          NaN
  other   1.21         NaN          NaN
2 self     NaN         1.0          NaN
  other    NaN         0.0  On vacation

This compares values row-wise, instead of column-wise.


回答 11

查找两个数据帧之间不对称差异的函数在以下实现:(基于熊猫的集合差异)GIST:https : //gist.github.com/oneryalcin/68cf25f536a25e65f0b3c84f9c118e03

def diff_df(df1, df2, how="left"):
    """
      Find Difference of rows for given two dataframes
      this function is not symmetric, means
            diff(x, y) != diff(y, x)
      however
            diff(x, y, how='left') == diff(y, x, how='right')

      Ref: /programming/18180763/set-difference-for-pandas/40209800#40209800
    """
    if (df1.columns != df2.columns).any():
        raise ValueError("Two dataframe columns must match")

    if df1.equals(df2):
        return None
    elif how == 'right':
        return pd.concat([df2, df1, df1]).drop_duplicates(keep=False)
    elif how == 'left':
        return pd.concat([df1, df2, df2]).drop_duplicates(keep=False)
    else:
        raise ValueError('how parameter supports only "left" or "right keywords"')

例:

df1 = pd.DataFrame(d1)
Out[1]: 
                Comment  Name  isEnrolled  score
0  He was late to class  Jack        True   2.17
1             Graduated  Nick       False   1.11
2                         Zoe        True   4.12


df2 = pd.DataFrame(d2)

Out[2]: 
                Comment  Name  isEnrolled  score
0  He was late to class  Jack        True   2.17
1           On vacation   Zoe        True   4.12

diff_df(df1, df2)
Out[3]: 
     Comment  Name  isEnrolled  score
1  Graduated  Nick       False   1.11
2              Zoe        True   4.12

diff_df(df2, df1)
Out[4]: 
       Comment Name  isEnrolled  score
1  On vacation  Zoe        True   4.12

# This gives the same result as above
diff_df(df1, df2, how='right')
Out[22]: 
       Comment Name  isEnrolled  score
1  On vacation  Zoe        True   4.12

A function that finds asymmetrical difference between two data frames is implemented below: (Based on set difference for pandas) GIST: https://gist.github.com/oneryalcin/68cf25f536a25e65f0b3c84f9c118e03

def diff_df(df1, df2, how="left"):
    """
      Find Difference of rows for given two dataframes
      this function is not symmetric, means
            diff(x, y) != diff(y, x)
      however
            diff(x, y, how='left') == diff(y, x, how='right')

      Ref: https://stackoverflow.com/questions/18180763/set-difference-for-pandas/40209800#40209800
    """
    if (df1.columns != df2.columns).any():
        raise ValueError("Two dataframe columns must match")

    if df1.equals(df2):
        return None
    elif how == 'right':
        return pd.concat([df2, df1, df1]).drop_duplicates(keep=False)
    elif how == 'left':
        return pd.concat([df1, df2, df2]).drop_duplicates(keep=False)
    else:
        raise ValueError('how parameter supports only "left" or "right keywords"')

Example:

df1 = pd.DataFrame(d1)
Out[1]: 
                Comment  Name  isEnrolled  score
0  He was late to class  Jack        True   2.17
1             Graduated  Nick       False   1.11
2                         Zoe        True   4.12


df2 = pd.DataFrame(d2)

Out[2]: 
                Comment  Name  isEnrolled  score
0  He was late to class  Jack        True   2.17
1           On vacation   Zoe        True   4.12

diff_df(df1, df2)
Out[3]: 
     Comment  Name  isEnrolled  score
1  Graduated  Nick       False   1.11
2              Zoe        True   4.12

diff_df(df2, df1)
Out[4]: 
       Comment Name  isEnrolled  score
1  On vacation  Zoe        True   4.12

# This gives the same result as above
diff_df(df1, df2, how='right')
Out[22]: 
       Comment Name  isEnrolled  score
1  On vacation  Zoe        True   4.12

回答 12

将pda导入为pd将numpy导入为np

df = pd.read_excel(’D:\ HARISH \ DATA SCIENCE \ 1 MY Training \ SAMPLE DATA&PROJS \ CRICKET DATA \ IPL PLAYER LIST \ IPL PLAYER LIST _ harish.xlsx’)

df1 = srh = df [df [‘TEAM’]。str.contains(“ SRH”)] df2 = csk = df [df [‘TEAM’]。str.contains(“ CSK”)]

srh = srh.iloc [:,0:2] csk = csk.iloc [:,0:2]

csk = csk.reset_index(drop = True)csk

srh = srh.reset_index(drop = True)srh

new = pd.concat([srh,csk],axis = 1)

new.head()

**玩家类型玩家类型

0戴维·华纳·蝙蝠侠… MS Dhoni Captain

1布瓦内什瓦尔·库马尔·鲍勒(Bhuvaneshwar Kumar Bowler)…

2 Manish Pandey击球手… Suresh Raina All-Rounder

3拉希德·汗·阿曼·鲍勒(Kashir Jadhav All-Rounder)

4 Shikhar Dhawan击球手…. Dwayne Bravo All-Rounder

import pandas as pd
import numpy as np

df = pd.read_excel('D:\\HARISH\\DATA SCIENCE\\1 MY Training\\SAMPLE DATA & projs\\CRICKET DATA\\IPL PLAYER LIST\\IPL PLAYER LIST _ harish.xlsx')


df1= srh = df[df['TEAM'].str.contains("SRH")]
df2 = csk = df[df['TEAM'].str.contains("CSK")]   

srh = srh.iloc[:,0:2]
csk = csk.iloc[:,0:2]

csk = csk.reset_index(drop=True)
csk

srh = srh.reset_index(drop=True)
srh

new = pd.concat([srh, csk], axis=1)

new.head()
** 
               PLAYER     TYPE           PLAYER         TYPE

0        David Warner  Batsman    ...        MS Dhoni      Captain

1  Bhuvaneshwar Kumar   Bowler  ...    Ravindra Jadeja  All-Rounder

2       Manish Pandey  Batsman   ...   Suresh Raina  All-Rounder

3   Rashid Khan Arman   Bowler     ...   Kedar Jadhav  All-Rounder

4      Shikhar Dhawan  Batsman    ....    Dwayne Bravo  All-Rounder

按索引合并两个数据框

问题:按索引合并两个数据框

嗨,我有以下数据框:

> df1
  id begin conditional confidence discoveryTechnique  
0 278    56       false        0.0                  1   
1 421    18       false        0.0                  1 

> df2
   concept 
0  A  
1  B

如何合并索引以获取:

  id begin conditional confidence discoveryTechnique   concept 
0 278    56       false        0.0                  1  A 
1 421    18       false        0.0                  1  B

我问,因为据我了解,merge()df1.merge(df2)使用列进行匹配。实际上,这样做我得到:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 4618, in merge
    copy=copy, indicator=indicator)
  File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 58, in merge
    copy=copy, indicator=indicator)
  File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 491, in __init__
    self._validate_specification()
  File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 812, in _validate_specification
    raise MergeError('No common columns to perform merge on')
pandas.tools.merge.MergeError: No common columns to perform merge on

在索引上合并是不好的做法吗?不可能吗 如果是这样,如何将索引移到称为“索引”的新列中?

谢谢

I have the following dataframes:

> df1
  id begin conditional confidence discoveryTechnique  
0 278    56       false        0.0                  1   
1 421    18       false        0.0                  1 

> df2
   concept 
0  A  
1  B
   

How do I merge on the indices to get:

  id begin conditional confidence discoveryTechnique   concept 
0 278    56       false        0.0                  1  A 
1 421    18       false        0.0                  1  B

I ask because it is my understanding that merge() i.e. df1.merge(df2) uses columns to do the matching. In fact, doing this I get:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 4618, in merge
    copy=copy, indicator=indicator)
  File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 58, in merge
    copy=copy, indicator=indicator)
  File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 491, in __init__
    self._validate_specification()
  File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 812, in _validate_specification
    raise MergeError('No common columns to perform merge on')
pandas.tools.merge.MergeError: No common columns to perform merge on

Is it bad practice to merge on index? Is it impossible? If so, how can I shift the index into a new column called “index”?


回答 0

使用merge,默认情况下是内部联接:

pd.merge(df1, df2, left_index=True, right_index=True)

join,默认情况下为左连接:

df1.join(df2)

concat,默认情况下为外部联接:

pd.concat([df1, df2], axis=1)

样品

df1 = pd.DataFrame({'a':range(6),
                    'b':[5,3,6,9,2,4]}, index=list('abcdef'))

print (df1)
   a  b
a  0  5
b  1  3
c  2  6
d  3  9
e  4  2
f  5  4

df2 = pd.DataFrame({'c':range(4),
                    'd':[10,20,30, 40]}, index=list('abhi'))

print (df2)
   c   d
a  0  10
b  1  20
h  2  30
i  3  40

#default inner join
df3 = pd.merge(df1, df2, left_index=True, right_index=True)
print (df3)
   a  b  c   d
a  0  5  0  10
b  1  3  1  20

#default left join
df4 = df1.join(df2)
print (df4)
   a  b    c     d
a  0  5  0.0  10.0
b  1  3  1.0  20.0
c  2  6  NaN   NaN
d  3  9  NaN   NaN
e  4  2  NaN   NaN
f  5  4  NaN   NaN

#default outer join
df5 = pd.concat([df1, df2], axis=1)
print (df5)
     a    b    c     d
a  0.0  5.0  0.0  10.0
b  1.0  3.0  1.0  20.0
c  2.0  6.0  NaN   NaN
d  3.0  9.0  NaN   NaN
e  4.0  2.0  NaN   NaN
f  5.0  4.0  NaN   NaN
h  NaN  NaN  2.0  30.0
i  NaN  NaN  3.0  40.0

Use merge, which is inner join by default:

pd.merge(df1, df2, left_index=True, right_index=True)

Or join, which is left join by default:

df1.join(df2)

Or concat, which is outer join by default:

pd.concat([df1, df2], axis=1)

Samples:

df1 = pd.DataFrame({'a':range(6),
                    'b':[5,3,6,9,2,4]}, index=list('abcdef'))

print (df1)
   a  b
a  0  5
b  1  3
c  2  6
d  3  9
e  4  2
f  5  4

df2 = pd.DataFrame({'c':range(4),
                    'd':[10,20,30, 40]}, index=list('abhi'))

print (df2)
   c   d
a  0  10
b  1  20
h  2  30
i  3  40

#default inner join
df3 = pd.merge(df1, df2, left_index=True, right_index=True)
print (df3)
   a  b  c   d
a  0  5  0  10
b  1  3  1  20

#default left join
df4 = df1.join(df2)
print (df4)
   a  b    c     d
a  0  5  0.0  10.0
b  1  3  1.0  20.0
c  2  6  NaN   NaN
d  3  9  NaN   NaN
e  4  2  NaN   NaN
f  5  4  NaN   NaN

#default outer join
df5 = pd.concat([df1, df2], axis=1)
print (df5)
     a    b    c     d
a  0.0  5.0  0.0  10.0
b  1.0  3.0  1.0  20.0
c  2.0  6.0  NaN   NaN
d  3.0  9.0  NaN   NaN
e  4.0  2.0  NaN   NaN
f  5.0  4.0  NaN   NaN
h  NaN  NaN  2.0  30.0
i  NaN  NaN  3.0  40.0

回答 1

您可以使用concat([df1,df2,…],axis = 1)来连接两个或更多个按索引对齐的DF:

pd.concat([df1, df2, df3, ...], axis=1)

合并以通过自定义字段/索引进行串联:

# join by _common_ columns: `col1`, `col3`
pd.merge(df1, df2, on=['col1','col3'])

# join by: `df1.col1 == df2.index`
pd.merge(df1, df2, left_on='col1' right_index=True)

参加由指数加盟:

 df1.join(df2)

you can use concat([df1, df2, …], axis=1) in order to concatenate two or more DFs aligned by indexes:

pd.concat([df1, df2, df3, ...], axis=1)

or merge for concatenating by custom fields / indexes:

# join by _common_ columns: `col1`, `col3`
pd.merge(df1, df2, on=['col1','col3'])

# join by: `df1.col1 == df2.index`
pd.merge(df1, df2, left_on='col1' right_index=True)

or join for joining by index:

 df1.join(df2)

回答 2

默认情况下:
join是按列的左联接
pd.merge是按列的内部联接
pd.concat是按行的外部联接

pd.concat
采用Iterable参数。因此,它不能直接使用DataFrames(使用[df,df2]
DataFrame的尺寸应沿轴匹配

Joinpd.merge
可以接受DataFrame参数

By default:
join is a column-wise left join
pd.merge is a column-wise inner join
pd.concat is a row-wise outer join

pd.concat:
takes Iterable arguments. Thus, it cannot take DataFrames directly (use [df,df2])
Dimensions of DataFrame should match along axis

Join and pd.merge:
can take DataFrame arguments


回答 3

一个愚蠢的错误吸引了我:由于索引dtypes不同,联接失败。这不是很明显,因为两个表都是同一原始表的数据透视表。之后reset_index,索引在Jupyter中看起来相同。保存到Excel时才发现…

固定于: df1[['key']] = df1[['key']].apply(pd.to_numeric)

希望这可以节省一个小时!

A silly bug that got me: the joins failed because index dtypes differed. This was not obvious as both tables were pivot tables of the same original table. After reset_index, the indices looked identical in Jupyter. It only came to light when saving to Excel…

Fixed with: df1[['key']] = df1[['key']].apply(pd.to_numeric)

Hopefully this saves somebody an hour!


回答 4

如果要在熊猫中加入两个数据框,则可以简单地使用可用的属性,例如mergeconcatenate。例如,如果我有两个数据框df1df2可以通过以下方式将它们加入:

newdataframe=merge(df1,df2,left_index=True,right_index=True)

If u want to join two dataframes in pandas you can simply use available attributes like merge or concatenate. For example if I have two dataframes df1 and df2 I can join them by:

newdataframe=merge(df1,df2,left_index=True,right_index=True)

在日期上过滤熊猫数据框

问题:在日期上过滤熊猫数据框

我有一个带有“日期”列的Pandas DataFrame。现在,我需要过滤掉DataFrame中日期在接下来两个月之外的所有行。本质上,我只需要保留接下来两个月内的行。

实现此目标的最佳方法是什么?

I have a Pandas DataFrame with a ‘date’ column. Now I need to filter out all rows in the DataFrame that have dates outside of the next two months. Essentially, I only need to retain the rows that are within the next two months.

What is the best way to achieve this?


回答 0

如果date列是索引,则将.loc用于基于标签的索引,将.iloc用于位置索引。

例如:

df.loc['2014-01-01':'2014-02-01']

在此处查看详细信息http://pandas.pydata.org/pandas-docs/stable/dsintro.html#indexing-selection

如果列不是索引,则有两个选择:

  1. 使其成为索引(如果是时间序列数据,则为临时索引或永久索引)
  2. df[(df['date'] > '2013-01-01') & (df['date'] < '2013-02-01')]

有关一般说明,请参见此处

注意:不建议使用.ix。

If date column is the index, then use .loc for label based indexing or .iloc for positional indexing.

For example:

df.loc['2014-01-01':'2014-02-01']

See details here http://pandas.pydata.org/pandas-docs/stable/dsintro.html#indexing-selection

If the column is not the index you have two choices:

  1. Make it the index (either temporarily or permanently if it’s time-series data)
  2. df[(df['date'] > '2013-01-01') & (df['date'] < '2013-02-01')]

See here for the general explanation

Note: .ix is deprecated.


回答 1

根据我的经验,上一个答案是不正确的,您不能将其传递为简单的字符串,而必须是datetime对象。所以:

import datetime 
df.loc[datetime.date(year=2014,month=1,day=1):datetime.date(year=2014,month=2,day=1)]

Previous answer is not correct in my experience, you can’t pass it a simple string, needs to be a datetime object. So:

import datetime 
df.loc[datetime.date(year=2014,month=1,day=1):datetime.date(year=2014,month=2,day=1)]

回答 2

而且,如果通过导入datetime包将日期标准化,则可以简单地使用:

df[(df['date']>datetime.date(2016,1,1)) & (df['date']<datetime.date(2016,3,1))]  

为了使用datetime包标准化日期字符串,可以使用以下功能:

import datetime
datetime.datetime.strptime

And if your dates are standardized by importing datetime package, you can simply use:

df[(df['date']>datetime.date(2016,1,1)) & (df['date']<datetime.date(2016,3,1))]  

For standarding your date string using datetime package, you can use this function:

import datetime
datetime.datetime.strptime

回答 3

如果您的datetime列具有Pandas datetime类型(例如datetime64[ns]),则为了进行正确的过滤,您需要pd.Timestamp对象,例如:

from datetime import date

import pandas as pd

value_to_check = pd.Timestamp(date.today().year, 1, 1)
filter_mask = df['date_column'] < value_to_check
filtered_df = df[filter_mask]

If your datetime column have the Pandas datetime type (e.g. datetime64[ns]), for proper filtering you need the pd.Timestamp object, for example:

from datetime import date

import pandas as pd

value_to_check = pd.Timestamp(date.today().year, 1, 1)
filter_mask = df['date_column'] < value_to_check
filtered_df = df[filter_mask]

回答 4

如果日期在索引中,则只需:

df['20160101':'20160301']

If the dates are in the index then simply:

df['20160101':'20160301']

回答 5

您可以使用pd.Timestamp执行查询和本地引用

import pandas as pd
import numpy as np

df = pd.DataFrame()
ts = pd.Timestamp

df['date'] = np.array(np.arange(10) + datetime.now().timestamp(), dtype='M8[s]')

print(df)
print(df.query('date > @ts("20190515T071320")')

与输出

                 date
0 2019-05-15 07:13:16
1 2019-05-15 07:13:17
2 2019-05-15 07:13:18
3 2019-05-15 07:13:19
4 2019-05-15 07:13:20
5 2019-05-15 07:13:21
6 2019-05-15 07:13:22
7 2019-05-15 07:13:23
8 2019-05-15 07:13:24
9 2019-05-15 07:13:25


                 date
5 2019-05-15 07:13:21
6 2019-05-15 07:13:22
7 2019-05-15 07:13:23
8 2019-05-15 07:13:24
9 2019-05-15 07:13:25

看一下DataFrame.query的pandas文档,特别是有关本地变量引用的udsing @前缀的提及。在这种情况下,我们pd.Timestamp使用本地别名ts进行引用,以便能够提供时间戳字符串

You can use pd.Timestamp to perform a query and a local reference

import pandas as pd
import numpy as np

df = pd.DataFrame()
ts = pd.Timestamp

df['date'] = np.array(np.arange(10) + datetime.now().timestamp(), dtype='M8[s]')

print(df)
print(df.query('date > @ts("20190515T071320")')

with the output

                 date
0 2019-05-15 07:13:16
1 2019-05-15 07:13:17
2 2019-05-15 07:13:18
3 2019-05-15 07:13:19
4 2019-05-15 07:13:20
5 2019-05-15 07:13:21
6 2019-05-15 07:13:22
7 2019-05-15 07:13:23
8 2019-05-15 07:13:24
9 2019-05-15 07:13:25


                 date
5 2019-05-15 07:13:21
6 2019-05-15 07:13:22
7 2019-05-15 07:13:23
8 2019-05-15 07:13:24
9 2019-05-15 07:13:25

Have a look at the pandas documentation for DataFrame.query, specifically the mention about the local variabile referenced udsing @ prefix. In this case we reference pd.Timestamp using the local alias ts to be able to supply a timestamp string


回答 6

因此,在加载csv数据文件时,我们需要如下所示将date列设置为索引,以便根据日期范围过滤数据。现在不推荐使用的方法:pd.DataFrame.from_csv()不需要此功能。

如果您只想显示一月至二月两个月的数据,例如2020-01-01至2020-02-29,则可以执行以下操作:

import pandas as pd
mydata = pd.read_csv('mydata.csv',index_col='date') # or its index number, e.g. index_col=[0]
mydata['2020-01-01':'2020-02-29'] # will pull all the columns
#if just need one column, e.g. Cost, can be done:
mydata['2020-01-01':'2020-02-29','Cost'] 

已针对Python 3.7进行了测试。希望您会发现这个有用。

So when loading the csv data file, we’ll need to set the date column as index now as below, in order to filter data based on a range of dates. This was not needed for the now deprecated method: pd.DataFrame.from_csv().

If you just want to show the data for two months from Jan to Feb, e.g. 2020-01-01 to 2020-02-29, you can do so:

import pandas as pd
mydata = pd.read_csv('mydata.csv',index_col='date') # or its index number, e.g. index_col=[0]
mydata['2020-01-01':'2020-02-29'] # will pull all the columns
#if just need one column, e.g. Cost, can be done:
mydata['2020-01-01':'2020-02-29','Cost'] 

This has been tested working for Python 3.7. Hope you will find this useful.


回答 7

怎么样使用 pyjanitor

它具有很酷的功能。

pip install pyjanitor

import janitor

df_filtered = df.filter_date(your_date_column_name, start_date, end_date)

How about using pyjanitor

It has cool features.

After pip install pyjanitor

import janitor

df_filtered = df.filter_date(your_date_column_name, start_date, end_date)

回答 8

按日期过滤数据框的最短方法:假设您的日期列为datetime64 [ns]类型

# filter by single day
df = df[df['date'].dt.strftime('%Y-%m-%d') == '2014-01-01']

# filter by single month
df = df[df['date'].dt.strftime('%Y-%m') == '2014-01']

# filter by single year
df = df[df['date'].dt.strftime('%Y') == '2014']

The shortest way to filter your dataframe by date: Lets suppose your date column is type of datetime64[ns]

# filter by single day
df = df[df['date'].dt.strftime('%Y-%m-%d') == '2014-01-01']

# filter by single month
df = df[df['date'].dt.strftime('%Y-%m') == '2014-01']

# filter by single year
df = df[df['date'].dt.strftime('%Y') == '2014']

回答 9

我尚未被允许发表任何评论,所以如果有人可以阅读所有评论并达到目的,我将写一个答案。

如果数据集的索引是日期时间,并且您只想按(例如)个月筛选,则可以执行以下操作:

df.loc[df.index.month = 3]

这将在三月之前为您过滤数据集。

I’m not allowed to write any comments yet, so I’ll write an answer, if somebody will read all of them and reach this one.

If the index of the dataset is a datetime and you want to filter that just by (for example) months, you can do following:

df.loc[df.index.month = 3]

That will filter the dataset for you by March.


回答 10

您可以通过执行以下操作来选择时间范围:df.loc [‘start_date’:’end_date’]

You could just select the time range by doing: df.loc[‘start_date’:’end_date’]


回答 11

如果您已经使用pd.to_datetime将字符串转换为日期格式,则可以使用:

df = df[(df['Date']> "2018-01-01") & (df['Date']< "2019-07-01")]

If you have already converted the string to a date format using pd.to_datetime you can just use:

df = df[(df['Date']> "2018-01-01") & (df['Date']< "2019-07-01")]


将Pandas Multi-Index转换为专栏

问题:将Pandas Multi-Index转换为专栏

我有一个具有2个索引级别的数据框:

                         value
Trial    measurement
    1              0        13
                   1         3
                   2         4
    2              0       NaN
                   1        12
    3              0        34 

我想变成这样:

Trial    measurement       value

    1              0        13
    1              1         3
    1              2         4
    2              0       NaN
    2              1        12
    3              0        34 

我怎样才能最好地做到这一点?

我需要这样做是因为我想按照此处的指示汇总数据,但是如果将它们用作索引,则无法选择这样的列。

I have a dataframe with 2 index levels:

                         value
Trial    measurement
    1              0        13
                   1         3
                   2         4
    2              0       NaN
                   1        12
    3              0        34 

Which I want to turn into this:

Trial    measurement       value

    1              0        13
    1              1         3
    1              2         4
    2              0       NaN
    2              1        12
    3              0        34 

How can I best do this?

I need this because I want to aggregate the data as instructed here, but I can’t select my columns like that if they are in use as indices.


回答 0

所述reset_index()是一个数据帧熊猫方法,将索引值转移到数据帧为列。参数的默认设置为drop = False(将索引值保留为列)。

您只需.reset_index(inplace=True)在DataFrame名称后添加:

df.reset_index(inplace=True)  

The reset_index() is a pandas DataFrame method that will transfer index values into the DataFrame as columns. The default setting for the parameter is drop=False (which will keep the index values as columns).

All you have to do add .reset_index(inplace=True) after the name of the DataFrame:

df.reset_index(inplace=True)  

回答 1

这并不是真的适用于您的情况,但可能有助于其他人(例如5分钟前的我自己)知道。如果一个人的多重数具有如下相同的名称:

                         value
Trial        Trial
    1              0        13
                   1         3
                   2         4
    2              0       NaN
                   1        12
    3              0        34 

df.reset_index(inplace=True) 将会失败,因为创建的列不能具有相同的名称。

因此,您需要将multindex重命名为df.index = df.index.set_names(['Trial', 'measurement'])

                           value
Trial    measurement       

    1              0        13
    1              1         3
    1              2         4
    2              0       NaN
    2              1        12
    3              0        34 

然后df.reset_index(inplace=True)将像魅力一样工作。

在按年和月对名为datetime的列(不是索引)进行分组之后,我遇到了这个问题live_date,这意味着年和月都被命名了live_date

This doesn’t really apply to your case but could be helpful for others (like myself 5 minutes ago) to know. If one’s multindex have the same name like this:

                         value
Trial        Trial
    1              0        13
                   1         3
                   2         4
    2              0       NaN
                   1        12
    3              0        34 

df.reset_index(inplace=True) will fail, cause the columns that are created cannot have the same names.

So then you need to rename the multindex with df.index = df.index.set_names(['Trial', 'measurement']) to get:

                           value
Trial    measurement       

    1              0        13
    1              1         3
    1              2         4
    2              0       NaN
    2              1        12
    3              0        34 

And then df.reset_index(inplace=True) will work like a charm.

I encountered this problem after grouping by year and month on a datetime-column(not index) called live_date, which meant that both year and month were named live_date.


回答 2

正如@ cs95在评论中提到的,要仅降低一个级别,请使用:

df.reset_index(level=[...])

这样可以避免在重置后必须重新定义所需的索引。

As @cs95 mentioned in a comment, to drop only one level, use:

df.reset_index(level=[...])

This avoids having to redefine your desired index after reset.


如何通过密钥按数据组访问熊猫

问题:如何通过密钥按数据组访问熊猫

如何通过密钥访问groupby对象中的相应groupby数据帧?

通过以下groupby:

rand = np.random.RandomState(1)
df = pd.DataFrame({'A': ['foo', 'bar'] * 3,
                   'B': rand.randn(6),
                   'C': rand.randint(0, 20, 6)})
gb = df.groupby(['A'])

我可以遍历它来获取密钥和组:

In [11]: for k, gp in gb:
             print 'key=' + str(k)
             print gp
key=bar
     A         B   C
1  bar -0.611756  18
3  bar -1.072969  10
5  bar -2.301539  18
key=foo
     A         B   C
0  foo  1.624345   5
2  foo -0.528172  11
4  foo  0.865408  14

我希望能够通过其键访问组:

In [12]: gb['foo']
Out[12]:  
     A         B   C
0  foo  1.624345   5
2  foo -0.528172  11
4  foo  0.865408  14

但是当我尝试这样做时,gb[('foo',)]我得到了这个奇怪的pandas.core.groupby.DataFrameGroupBy对象,似乎没有任何与我想要的DataFrame相对应的方法。

我能想到的最好的是:

In [13]: def gb_df_key(gb, key, orig_df):
             ix = gb.indices[key]
             return orig_df.ix[ix]

         gb_df_key(gb, 'foo', df)
Out[13]:
     A         B   C
0  foo  1.624345   5
2  foo -0.528172  11
4  foo  0.865408  14  

但是,考虑到这些事情上熊猫通常是多么好,这有点令人讨厌。
这样做的内置方式是什么?

How do I access the corresponding groupby dataframe in a groupby object by the key?

With the following groupby:

rand = np.random.RandomState(1)
df = pd.DataFrame({'A': ['foo', 'bar'] * 3,
                   'B': rand.randn(6),
                   'C': rand.randint(0, 20, 6)})
gb = df.groupby(['A'])

I can iterate through it to get the keys and groups:

In [11]: for k, gp in gb:
             print 'key=' + str(k)
             print gp
key=bar
     A         B   C
1  bar -0.611756  18
3  bar -1.072969  10
5  bar -2.301539  18
key=foo
     A         B   C
0  foo  1.624345   5
2  foo -0.528172  11
4  foo  0.865408  14

I would like to be able to access a group by its key:

In [12]: gb['foo']
Out[12]:  
     A         B   C
0  foo  1.624345   5
2  foo -0.528172  11
4  foo  0.865408  14

But when I try doing that with gb[('foo',)] I get this weird pandas.core.groupby.DataFrameGroupBy object thing which doesn’t seem to have any methods that correspond to the DataFrame I want.

The best I could think of is:

In [13]: def gb_df_key(gb, key, orig_df):
             ix = gb.indices[key]
             return orig_df.ix[ix]

         gb_df_key(gb, 'foo', df)
Out[13]:
     A         B   C
0  foo  1.624345   5
2  foo -0.528172  11
4  foo  0.865408  14  

but this is kind of nasty, considering how nice pandas usually is at these things.
What’s the built-in way of doing this?


回答 0

您可以使用以下get_group方法:

In [21]: gb.get_group('foo')
Out[21]: 
     A         B   C
0  foo  1.624345   5
2  foo -0.528172  11
4  foo  0.865408  14

注意:这不需要为每个组创建一个中间字典/每个子数据帧的副本,因此与使用来创建朴素的字典相比,其内存效率更高dict(iter(gb))。这是因为它使用了groupby对象中已经可用的数据结构。


您可以使用groupby切片选择不同的列:

In [22]: gb[["A", "B"]].get_group("foo")
Out[22]:
     A         B
0  foo  1.624345
2  foo -0.528172
4  foo  0.865408

In [23]: gb["C"].get_group("foo")
Out[23]:
0     5
2    11
4    14
Name: C, dtype: int64

You can use the get_group method:

In [21]: gb.get_group('foo')
Out[21]: 
     A         B   C
0  foo  1.624345   5
2  foo -0.528172  11
4  foo  0.865408  14

Note: This doesn’t require creating an intermediary dictionary / copy of every subdataframe for every group, so will be much more memory-efficient that creating the naive dictionary with dict(iter(gb)). This is because it uses data-structures already available in the groupby object.


You can select different columns using the groupby slicing:

In [22]: gb[["A", "B"]].get_group("foo")
Out[22]:
     A         B
0  foo  1.624345
2  foo -0.528172
4  foo  0.865408

In [23]: gb["C"].get_group("foo")
Out[23]:
0     5
2    11
4    14
Name: C, dtype: int64

回答 1

Python for Data Analysis中的Wes McKinney(熊猫的作者)提供了以下配方:

groups = dict(list(gb))

它返回一个字典,其键是您的组标签,其值是DataFrames,即

groups['foo']

将产生您想要的东西:

     A         B   C
0  foo  1.624345   5
2  foo -0.528172  11
4  foo  0.865408  14

Wes McKinney (pandas’ author) in Python for Data Analysis provides the following recipe:

groups = dict(list(gb))

which returns a dictionary whose keys are your group labels and whose values are DataFrames, i.e.

groups['foo']

will yield what you are looking for:

     A         B   C
0  foo  1.624345   5
2  foo -0.528172  11
4  foo  0.865408  14

回答 2

而不是

gb.get_group('foo')

我更喜欢使用 gb.groups

df.loc[gb.groups['foo']]

因为这样您也可以选择多个列。例如:

df.loc[gb.groups['foo'],('A','B')]

Rather than

gb.get_group('foo')

I prefer using gb.groups

df.loc[gb.groups['foo']]

Because in this way you can choose multiple columns as well. for example:

df.loc[gb.groups['foo'],('A','B')]

回答 3

gb = df.groupby(['A'])

gb_groups = grouped_df.groups

如果要查找选择性的groupby对象,请执行:gb_groups.keys(),然后将所需的密钥输入到以下key_list中。

gb_groups.keys()

key_list = [key1, key2, key3 and so on...]

for key, values in gb_groups.iteritems():
    if key in key_list:
        print df.ix[values], "\n"
gb = df.groupby(['A'])

gb_groups = grouped_df.groups

If you are looking for selective groupby objects then, do: gb_groups.keys(), and input desired key into the following key_list..

gb_groups.keys()

key_list = [key1, key2, key3 and so on...]

for key, values in gb_groups.iteritems():
    if key in key_list:
        print df.ix[values], "\n"

回答 4

我正在寻找对GroupBy obj的几个成员进行抽样的方法-必须解决发布的问题才能完成此任务。

创建分组对象

grouped = df.groupby('some_key')

选择N个数据框并获取其索引

sampled_df_i  = random.sample(grouped.indicies, N)

抢团体

df_list  = map(lambda df_i: grouped.get_group(df_i), sampled_df_i)

可选-将所有内容重新转换为单个dataframe对象

sampled_df = pd.concat(df_list, axis=0, join='outer')

I was looking for a way to sample a few members of the GroupBy obj – had to address the posted question to get this done.

create groupby object

grouped = df.groupby('some_key')

pick N dataframes and grab their indicies

sampled_df_i  = random.sample(grouped.indicies, N)

grab the groups

df_list  = map(lambda df_i: grouped.get_group(df_i), sampled_df_i)

optionally – turn it all back into a single dataframe object

sampled_df = pd.concat(df_list, axis=0, join='outer')

熊猫:求和给定列的DataFrame行

问题:熊猫:求和给定列的DataFrame行

我有以下DataFrame:

In [1]:

import pandas as pd
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df
Out [1]:
   a  b   c  d
0  1  2  dd  5
1  2  3  ee  9
2  3  4  ff  1

我想增加一列'e'是列的总和'a''b''d'

在各个论坛上,我认为这样会起作用:

df['e'] = df[['a','b','d']].map(sum)

但事实并非如此。

我想知道适当的操作与列的列表['a','b','d']df作为输入。

I have the following DataFrame:

In [1]:

import pandas as pd
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df
Out [1]:
   a  b   c  d
0  1  2  dd  5
1  2  3  ee  9
2  3  4  ff  1

I would like to add a column 'e' which is the sum of column 'a', 'b' and 'd'.

Going across forums, I thought something like this would work:

df['e'] = df[['a','b','d']].map(sum)

But it didn’t.

I would like to know the appropriate operation with the list of columns ['a','b','d'] and df as inputs.


回答 0

您可以sum设置参数axis=1以对行求和,这将忽略任何数字列:

In [91]:

df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
   a  b   c  d   e
0  1  2  dd  5   8
1  2  3  ee  9  14
2  3  4  ff  1   8

如果您只想汇总特定的列,则可以创建列的列表并删除不感兴趣的列:

In [98]:

col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:

df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
   a  b   c  d  e
0  1  2  dd  5  3
1  2  3  ee  9  5
2  3  4  ff  1  7

You can just sum and set param axis=1 to sum the rows, this will ignore none numeric columns:

In [91]:

df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
   a  b   c  d   e
0  1  2  dd  5   8
1  2  3  ee  9  14
2  3  4  ff  1   8

If you want to just sum specific columns then you can create a list of the columns and remove the ones you are not interested in:

In [98]:

col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:

df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
   a  b   c  d  e
0  1  2  dd  5  3
1  2  3  ee  9  5
2  3  4  ff  1  7

回答 1

如果您只需要汇总几列,则可以编写:

df['e'] = df['a'] + df['b'] + df['d']

这将创建e具有以下值的新列:

   a  b   c  d   e
0  1  2  dd  5   8
1  2  3  ee  9  14
2  3  4  ff  1   8

对于较长的列列表,首选EdChum的答案。

If you have just a few columns to sum, you can write:

df['e'] = df['a'] + df['b'] + df['d']

This creates new column e with the values:

   a  b   c  d   e
0  1  2  dd  5   8
1  2  3  ee  9  14
2  3  4  ff  1   8

For longer lists of columns, EdChum’s answer is preferred.


回答 2

创建要添加的列名列表。

df['total']=df.loc[:,list_name].sum(axis=1)

如果要某些行的总和,请使用“:”指定行

Create a list of column names you want to add up.

df['total']=df.loc[:,list_name].sum(axis=1)

If you want the sum for certain rows, specify the rows using ‘:’


回答 3

这是使用iloc选择要累加的列的更简单方法:

df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)

生成:

   a  b   c  d   e  f  g   h
0  1  2  dd  5   8  3  3   6
1  2  3  ee  9  14  5  5  11
2  3  4  ff  1   8  7  7   4

我找不到一种将范围和特定列结合起来的方法,例如:

df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)

This is a simpler way using iloc to select which columns to sum:

df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)

Produces:

   a  b   c  d   e  f  g   h
0  1  2  dd  5   8  3  3   6
1  2  3  ee  9  14  5  5  11
2  3  4  ff  1   8  7  7   4

I can’t find a way to combine a range and specific columns that works e.g. something like:

df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)

回答 4

当我按顺序排列列时,以下语法对我有帮助

awards_frame.values[:,1:4].sum(axis =1)

Following syntax helped me when I have columns in sequence

awards_frame.values[:,1:4].sum(axis =1)

回答 5

您只需将数据框传递给以下函数即可

def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
    frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
    return(frame)

范例

我有一个数据框(awards_frame)如下:

在此处输入图片说明

…并且我想创建一个新列,显示每一行的奖励总和

用法

我只是通过我的awards_frame进入功能,同时指定名称的新列的,和列表将被归纳列名:

sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])

结果

在此处输入图片说明

You can simply pass your dataframe into the following function:

def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
    frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
    return(frame)

Example:

I have a dataframe (awards_frame) as follows:

enter image description here

…and I want to create a new column that shows the sum of awards for each row:

Usage:

I simply pass my awards_frame into the function, also specifying the name of the new column, and a list of column names that are to be summed:

sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])

Result:

enter image description here


回答 6

这里最简单的方法是使用

    df.eval('e = a + b + d')

The shortest and simpliest way here is to use

    df.eval('e = a + b + d')

使用int的python dataframe pandas drop column

问题:使用int的python dataframe pandas drop column

我知道要删除列,您可以使用df.drop(’column name’,axis = 1)。有没有一种方法可以使用数字索引而不是列名来删除列?

I understand that to drop a column you use df.drop(‘column name’, axis=1). Is there a way to drop a column using a numerical index instead of the column name?


回答 0

您可以i像这样删除索引上的列:

df.drop(df.columns[i], axis=1)

如果列中有重复的名称,这可能会很奇怪,因此,您可以重命名要用新名称删除列的列。或者,您可以像这样重新分配DataFrame:

df = df.iloc[:, [j for j, c in enumerate(df.columns) if j != i]]

You can delete column on i index like this:

df.drop(df.columns[i], axis=1)

It could work strange, if you have duplicate names in columns, so to do this you can rename column you want to delete column by new name. Or you can reassign DataFrame like this:

df = df.iloc[:, [j for j, c in enumerate(df.columns) if j != i]]

回答 1

像这样删除多列:

cols = [1,2,4,5,12]
df.drop(df.columns[cols],axis=1,inplace=True)

inplace=True用于在数据框本身中进行更改,而无需将列放在数据框的副本上。如果您需要保持原样,请使用:

df_after_dropping = df.drop(df.columns[cols],axis=1)

Drop multiple columns like this:

cols = [1,2,4,5,12]
df.drop(df.columns[cols],axis=1,inplace=True)

inplace=True is used to make the changes in the dataframe itself without doing the column dropping on a copy of the data frame. If you need to keep your original intact, use:

df_after_dropping = df.drop(df.columns[cols],axis=1)

回答 2

如果存在多个具有相同名称的列,那么到目前为止给出的解决方案将删除所有列,而这可能并不是所要查找的。如果尝试删除一个实例以外的重复列,则可能是这种情况。下面的示例阐明了这种情况:

# make a df with duplicate columns 'x'
df = pd.DataFrame({'x': range(5) , 'x':range(5), 'y':range(6, 11)}, columns = ['x', 'x', 'y']) 


df
Out[495]: 
   x  x   y
0  0  0   6
1  1  1   7
2  2  2   8
3  3  3   9
4  4  4  10

# attempting to drop the first column according to the solution offered so far     
df.drop(df.columns[0], axis = 1) 
   y
0  6
1  7
2  8
3  9
4  10

如您所见,两个Xs列均被删除。替代解决方案:

column_numbers = [x for x in range(df.shape[1])]  # list of columns' integer indices

column_numbers .remove(0) #removing column integer index 0
df.iloc[:, column_numbers] #return all columns except the 0th column

   x  y
0  0  6
1  1  7
2  2  8
3  3  9
4  4  10

如您所见,这确实删除了仅第0列(第一个“ x”)。

If there are multiple columns with identical names, the solutions given here so far will remove all of the columns, which may not be what one is looking for. This may be the case if one is trying to remove duplicate columns except one instance. The example below clarifies this situation:

# make a df with duplicate columns 'x'
df = pd.DataFrame({'x': range(5) , 'x':range(5), 'y':range(6, 11)}, columns = ['x', 'x', 'y']) 


df
Out[495]: 
   x  x   y
0  0  0   6
1  1  1   7
2  2  2   8
3  3  3   9
4  4  4  10

# attempting to drop the first column according to the solution offered so far     
df.drop(df.columns[0], axis = 1) 
   y
0  6
1  7
2  8
3  9
4  10

As you can see, both Xs columns were dropped. Alternative solution:

column_numbers = [x for x in range(df.shape[1])]  # list of columns' integer indices

column_numbers .remove(0) #removing column integer index 0
df.iloc[:, column_numbers] #return all columns except the 0th column

   x  y
0  0  6
1  1  7
2  2  8
3  3  9
4  4  10

As you can see, this truly removed only the 0th column (first ‘x’).


回答 3

您需要根据列在数据框中的位置来标识它们。例如,如果您要删除(删除)第2,3和5列,它将是

df.drop(df.columns[[2,3,5]], axis = 1)

You need to identify the columns based on their position in dataframe. For example, if you want to drop (del) column number 2,3 and 5, it will be,

df.drop(df.columns[[2,3,5]], axis = 1)

回答 4

如果您有两个具有相同名称的列。一种简单的方法是像这样手动重命名列:

df.columns = ['column1', 'column2', 'column3']

然后,您可以根据需要通过列索引进行删除,如下所示:-

df.drop(df.columns[1], axis=1, inplace=True)

df.column[1] 将删除索引1。

请记住,轴1 =列,轴0 =行。

If you have two columns with the same name. One simple way is to manually rename the columns like this:-

df.columns = ['column1', 'column2', 'column3']

Then you can drop via column index as you requested, like this:-

df.drop(df.columns[1], axis=1, inplace=True)

df.column[1] will drop index 1.

Remember axis 1 = columns and axis 0 = rows.


回答 5

如果您真的想使用整数(但是为什么呢?),则可以构建一个字典。

col_dict = {x: col for x, col in enumerate(df.columns)}

然后df = df.drop(col_dict[0], 1)将按需要工作

编辑:您可以将其放入为您执行此操作的函数中,尽管这样,每次调用它时都会创建字典

def drop_col_n(df, col_n_to_drop):
    col_dict = {x: col for x, col in enumerate(df.columns)}
    return df.drop(col_dict[col_n_to_drop], 1)

df = drop_col_n(df, 2)

if you really want to do it with integers (but why?), then you could build a dictionary.

col_dict = {x: col for x, col in enumerate(df.columns)}

then df = df.drop(col_dict[0], 1) will work as desired

edit: you can put it in a function that does that for you, though this way it creates the dictionary every time you call it

def drop_col_n(df, col_n_to_drop):
    col_dict = {x: col for x, col in enumerate(df.columns)}
    return df.drop(col_dict[col_n_to_drop], 1)

df = drop_col_n(df, 2)

回答 6

您可以使用以下行删除前两列(或不需要的任何列):

df.drop([df.columns[0], df.columns[1]], axis=1)

参考

You can use the following line to drop the first two columns (or any column you don’t need):

df.drop([df.columns[0], df.columns[1]], axis=1)

Reference


回答 7

由于可以有多个具有相同名称的列,我们应该首先重命名这些列。这是解决方案的代码。

df.columns=list(range(0,len(df.columns)))
df.drop(columns=[1,2])#drop second and third columns

Since there can be multiple columns with same name , we should first rename the columns. Here is code for the solution.

df.columns=list(range(0,len(df.columns)))
df.drop(columns=[1,2])#drop second and third columns

如何摆脱熊猫DataFrame中的“未命名:0”列?

问题:如何摆脱熊猫DataFrame中的“未命名:0”列?

我遇到一种情况,有时当我csv从中读取时,会df得到一个不需要的类似索引的列,名为unnamed:0

file.csv

,A,B,C
0,1,2,3
1,4,5,6
2,7,8,9

CSV读取与此:

pd.read_csv('file.csv')

   Unnamed: 0  A  B  C
0           0  1  2  3
1           1  4  5  6
2           2  7  8  9

这很烦人!有谁知道如何摆脱这一点?

I have a situation wherein sometimes when I read a csv from df I get an unwanted index-like column named unnamed:0.

file.csv

,A,B,C
0,1,2,3
1,4,5,6
2,7,8,9

The CSV is read with this:

pd.read_csv('file.csv')

   Unnamed: 0  A  B  C
0           0  1  2  3
1           1  4  5  6
2           2  7  8  9

This is very annoying! Does anyone have an idea on how to get rid of this?


回答 0

它是索引列,请传递index=False以不将其写出,请参阅文档

例:

In [37]:
df = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
pd.read_csv(io.StringIO(df.to_csv()))

Out[37]:
   Unnamed: 0         a         b         c
0           0  0.109066 -1.112704 -0.545209
1           1  0.447114  1.525341  0.317252
2           2  0.507495  0.137863  0.886283
3           3  1.452867  1.888363  1.168101
4           4  0.901371 -0.704805  0.088335

与之比较:

In [38]:
pd.read_csv(io.StringIO(df.to_csv(index=False)))

Out[38]:
          a         b         c
0  0.109066 -1.112704 -0.545209
1  0.447114  1.525341  0.317252
2  0.507495  0.137863  0.886283
3  1.452867  1.888363  1.168101
4  0.901371 -0.704805  0.088335

您还可以选择read_csv通过传递index_col=0以下内容来判断第一列是索引列:

In [40]:
pd.read_csv(io.StringIO(df.to_csv()), index_col=0)

Out[40]:
          a         b         c
0  0.109066 -1.112704 -0.545209
1  0.447114  1.525341  0.317252
2  0.507495  0.137863  0.886283
3  1.452867  1.888363  1.168101
4  0.901371 -0.704805  0.088335

It’s the index column, pass index=False to not write it out, see the docs

Example:

In [37]:
df = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
pd.read_csv(io.StringIO(df.to_csv()))

Out[37]:
   Unnamed: 0         a         b         c
0           0  0.109066 -1.112704 -0.545209
1           1  0.447114  1.525341  0.317252
2           2  0.507495  0.137863  0.886283
3           3  1.452867  1.888363  1.168101
4           4  0.901371 -0.704805  0.088335

compare with:

In [38]:
pd.read_csv(io.StringIO(df.to_csv(index=False)))

Out[38]:
          a         b         c
0  0.109066 -1.112704 -0.545209
1  0.447114  1.525341  0.317252
2  0.507495  0.137863  0.886283
3  1.452867  1.888363  1.168101
4  0.901371 -0.704805  0.088335

You could also optionally tell read_csv that the first column is the index column by passing index_col=0:

In [40]:
pd.read_csv(io.StringIO(df.to_csv()), index_col=0)

Out[40]:
          a         b         c
0  0.109066 -1.112704 -0.545209
1  0.447114  1.525341  0.317252
2  0.507495  0.137863  0.886283
3  1.452867  1.888363  1.168101
4  0.901371 -0.704805  0.088335

回答 1

由于您的CSV及其CSV文件RangeIndex(通常没有名称)一起保存,因此很可能会出现此问题。在保存DataFrame时,实际上需要完成此修复,但这并不总是一种选择。

避免问题:read_csv带有index_col 参数

IMO,最简单的解决方案是将未命名的列作为index读取。将index_col=[0]参数指定为pd.read_csv,它将在第一列中读取作为索引。

df = pd.DataFrame('x', index=range(5), columns=list('abc'))
df

   a  b  c
0  x  x  x
1  x  x  x
2  x  x  x
3  x  x  x
4  x  x  x

# Save DataFrame to CSV.
df.to_csv('file.csv')

pd.read_csv('file.csv')

   Unnamed: 0  a  b  c
0           0  x  x  x
1           1  x  x  x
2           2  x  x  x
3           3  x  x  x
4           4  x  x  x

# Now try this again, with the extra argument.
pd.read_csv('file.csv', index_col=[0])

   a  b  c
0  x  x  x
1  x  x  x
2  x  x  x
3  x  x  x
4  x  x  x

注意如果DataFrame没有索引开头,则可以
通过index=False在创建输出CSV时使用来避免这种情况。

df.to_csv('file.csv', index=False)

但是如上所述,这并不总是一种选择。


权宜之计解决方案:过滤 str.match

如果您无法修改代码以读取/写入CSV文件,则可以通过使用str.match以下:

df 

   Unnamed: 0  a  b  c
0           0  x  x  x
1           1  x  x  x
2           2  x  x  x
3           3  x  x  x
4           4  x  x  x

df.columns
# Index(['Unnamed: 0', 'a', 'b', 'c'], dtype='object')

df.columns.str.match('Unnamed')
# array([ True, False, False, False])

df.loc[:, ~df.columns.str.match('Unnamed')]

   a  b  c
0  x  x  x
1  x  x  x
2  x  x  x
3  x  x  x
4  x  x  x

This issue most likely manifests because your CSV was saved along with its RangeIndex (which usually doesn’t have a name). The fix would actually need to be done when saving the DataFrame, but this isn’t always an option.

Avoiding the Problem: read_csv with index_col argument

IMO, the simplest solution would be to read the unnamed column as the index. Specify an index_col=[0] argument to pd.read_csv, this reads in the first column as the index.

df = pd.DataFrame('x', index=range(5), columns=list('abc'))
df

   a  b  c
0  x  x  x
1  x  x  x
2  x  x  x
3  x  x  x
4  x  x  x

# Save DataFrame to CSV.
df.to_csv('file.csv')

pd.read_csv('file.csv')

   Unnamed: 0  a  b  c
0           0  x  x  x
1           1  x  x  x
2           2  x  x  x
3           3  x  x  x
4           4  x  x  x

# Now try this again, with the extra argument.
pd.read_csv('file.csv', index_col=[0])

   a  b  c
0  x  x  x
1  x  x  x
2  x  x  x
3  x  x  x
4  x  x  x

Note
You could have avoided this in the first place by using index=False when creating the output CSV, if your DataFrame does not have an index to begin with.

df.to_csv('file.csv', index=False)

But as mentioned above, this isn’t always an option.


Stopgap Solution: Filtering with str.match

If you cannot modify the code to read/write the CSV file, you can just remove the column by filtering with str.match:

df 

   Unnamed: 0  a  b  c
0           0  x  x  x
1           1  x  x  x
2           2  x  x  x
3           3  x  x  x
4           4  x  x  x

df.columns
# Index(['Unnamed: 0', 'a', 'b', 'c'], dtype='object')

df.columns.str.match('Unnamed')
# array([ True, False, False, False])

df.loc[:, ~df.columns.str.match('Unnamed')]

   a  b  c
0  x  x  x
1  x  x  x
2  x  x  x
3  x  x  x
4  x  x  x

回答 2

可能发生这种情况的另一种情况是,如果您的数据被不正确地写入到您的数据中,以致csv每一行都以逗号结尾。Unnamed: x当您尝试将数据读入时,这将在数据末尾留下一个未命名的列df

Another case that this might be happening is if your data was improperly written to your csv to have each row end with a comma. This will leave you with an unnamed column Unnamed: x at the end of your data when you try to read it into a df.


回答 3

要使用所有未命名列,您还可以使用正则表达式,例如 df.drop(df.filter(regex="Unname"),axis=1, inplace=True)

To get ride of all Unnamed columns, you can also use regex such as df.drop(df.filter(regex="Unname"),axis=1, inplace=True)


回答 4

只需使用以下命令删除该列: del df['column_name']

Simply delete that column using: del df['column_name']


熊猫中布尔索引的逻辑运算符

问题:熊猫中布尔索引的逻辑运算符

我正在Pandas中使用布尔值索引。问题是为什么声明:

a[(a['some_column']==some_number) & (a['some_other_column']==some_other_number)]

工作正常而

a[(a['some_column']==some_number) and (a['some_other_column']==some_other_number)]

退出错误?

例:

a=pd.DataFrame({'x':[1,1],'y':[10,20]})

In: a[(a['x']==1)&(a['y']==10)]
Out:    x   y
     0  1  10

In: a[(a['x']==1) and (a['y']==10)]
Out: ValueError: The truth value of an array with more than one element is ambiguous.     Use a.any() or a.all()

I’m working with boolean index in Pandas. The question is why the statement:

a[(a['some_column']==some_number) & (a['some_other_column']==some_other_number)]

works fine whereas

a[(a['some_column']==some_number) and (a['some_other_column']==some_other_number)]

exits with error?

Example:

a=pd.DataFrame({'x':[1,1],'y':[10,20]})

In: a[(a['x']==1)&(a['y']==10)]
Out:    x   y
     0  1  10

In: a[(a['x']==1) and (a['y']==10)]
Out: ValueError: The truth value of an array with more than one element is ambiguous.     Use a.any() or a.all()

回答 0

当你说

(a['x']==1) and (a['y']==10)

您暗中要求Python进行转换(a['x']==1)并转换(a['y']==10)为布尔值。

NumPy数组(长度大于1)和Pandas对象(例如Series)没有布尔值-换句话说,它们引发

ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().

当用作布尔值时。这是因为尚不清楚何时应为True或False。如果某些用户的长度非零,则可能会认为它们为True,例如Python列表。其他人可能只希望它的所有元素都为真,才希望它为真。如果其他任何元素为True,则其他人可能希望它为True。

由于期望值如此之多,因此NumPy和Pandas的设计师拒绝猜测,而是提出了ValueError。

相反,你必须是明确的,通过调用empty()all()any()方法来表示你的愿望是什么行为。

但是,在这种情况下,您似乎不希望布尔值求值,而是希望按元素进行逻辑与。这就是&二进制运算符执行的操作:

(a['x']==1) & (a['y']==10)

返回一个布尔数组。


顺便说一句,正如alexpmil所指出的,括号是强制性的,因为&运算符优先级高于==。如果没有括号,a['x']==1 & a['y']==10则将被评估为a['x'] == (1 & a['y']) == 10等效于链式比较(a['x'] == (1 & a['y'])) and ((1 & a['y']) == 10)。那是形式的表达Series and Seriesand与两个Series一起使用将再次触发与ValueError上述相同的操作。这就是为什么括号是强制性的。

When you say

(a['x']==1) and (a['y']==10)

You are implicitly asking Python to convert (a['x']==1) and (a['y']==10) to boolean values.

NumPy arrays (of length greater than 1) and Pandas objects such as Series do not have a boolean value — in other words, they raise

ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().

when used as a boolean value. That’s because its unclear when it should be True or False. Some users might assume they are True if they have non-zero length, like a Python list. Others might desire for it to be True only if all its elements are True. Others might want it to be True if any of its elements are True.

Because there are so many conflicting expectations, the designers of NumPy and Pandas refuse to guess, and instead raise a ValueError.

Instead, you must be explicit, by calling the empty(), all() or any() method to indicate which behavior you desire.

In this case, however, it looks like you do not want boolean evaluation, you want element-wise logical-and. That is what the & binary operator performs:

(a['x']==1) & (a['y']==10)

returns a boolean array.


By the way, as alexpmil notes, the parentheses are mandatory since & has a higher operator precedence than ==. Without the parentheses, a['x']==1 & a['y']==10 would be evaluated as a['x'] == (1 & a['y']) == 10 which would in turn be equivalent to the chained comparison (a['x'] == (1 & a['y'])) and ((1 & a['y']) == 10). That is an expression of the form Series and Series. The use of and with two Series would again trigger the same ValueError as above. That’s why the parentheses are mandatory.


回答 1

TLDR;在熊猫逻辑运算符&|~和括号(...)是很重要的!

Python的andornot逻辑运算符的设计与标量的工作。因此,Pandas必须做得更好,并覆盖按位运算符,以实现此功能的矢量化(逐元素)版本。

因此,以下是python中的(exp1以及exp2是计算结果为布尔结果的表达式)…

exp1 and exp2              # Logical AND
exp1 or exp2               # Logical OR
not exp1                   # Logical NOT

…将转换为…

exp1 & exp2                # Element-wise logical AND
exp1 | exp2                # Element-wise logical OR
~exp1                      # Element-wise logical NOT

大熊猫。

如果在执行逻辑运算的过程中得到ValueError,则需要使用括号进行分组:

(exp1) op (exp2)

例如,

(df['col1'] == x) & (df['col2'] == y) 

等等。


布尔索引:常见的操作是通过逻辑条件来计算布尔掩码,以过滤数据。熊猫提供了三种运算符:&用于逻辑与,|逻辑或,以及~逻辑非。

请考虑以下设置:

np.random.seed(0)
df = pd.DataFrame(np.random.choice(10, (5, 3)), columns=list('ABC'))
df

   A  B  C
0  5  0  3
1  3  7  9
2  3  5  2
3  4  7  6
4  8  8  1

逻辑与

对于df上面的内容,假设您想返回A <5和B> 5的所有行。这是通过分别计算每个条件的掩码并将它们与与完成的。

重载的按位&运算符
在继续之前,请注意文档的此特定摘录,其中指出

另一个常见的操作是使用布尔向量来过滤数据。运算符是:|for or&for and~for not这些必须通过使用括号来分组,由于由默认的Python将评估的表达式如df.A > 2 & df.B < 3df.A > (2 & df.B) < 3,而所期望的评价顺序是(df.A > 2) & (df.B < 3)

因此,考虑到这一点,可以使用按位运算符实现按元素逻辑与&

df['A'] < 5

0    False
1     True
2     True
3     True
4    False
Name: A, dtype: bool

df['B'] > 5

0    False
1     True
2    False
3     True
4     True
Name: B, dtype: bool

(df['A'] < 5) & (df['B'] > 5)

0    False
1     True
2    False
3     True
4    False
dtype: bool

接下来的过滤步骤很简单,

df[(df['A'] < 5) & (df['B'] > 5)]

   A  B  C
1  3  7  9
3  4  7  6

括号用于覆盖按位运算符的默认优先级顺序,后者比条件运算符<和具有更高的优先级>。请参阅python文档中的“ 运算符优先级 ”部分。

如果不使用括号,则表达式的计算不正确。例如,如果您不小心尝试了诸如

df['A'] < 5 & df['B'] > 5

它被解析为

df['A'] < (5 & df['B']) > 5

变成

df['A'] < something_you_dont_want > 5

变成了(请参阅有关链接运算符比较的python文档),

(df['A'] < something_you_dont_want) and (something_you_dont_want > 5)

变成

# Both operands are Series...
something_else_you_dont_want1 and something_else_you_dont_want2

哪个抛出

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

所以,不要犯那个错误!1个

避免括号分组
该修补程序实际上非常简单。大多数运算符都有对应的DataFrame绑定方法。如果使用函数而不是条件运算符来构建单个掩码,则不再需要按括号分组以指定评估顺序:

df['A'].lt(5)

0     True
1     True
2     True
3     True
4    False
Name: A, dtype: bool

df['B'].gt(5)

0    False
1     True
2    False
3     True
4     True
Name: B, dtype: bool

df['A'].lt(5) & df['B'].gt(5)

0    False
1     True
2    False
3     True
4    False
dtype: bool

请参阅“ 灵活比较 ”部分。总而言之,我们有

╒════╤════════════╤════════════╕
     Operator    Function   
╞════╪════════════╪════════════╡
  0  >           gt         
├────┼────────────┼────────────┤
  1  >=          ge         
├────┼────────────┼────────────┤
  2  <           lt         
├────┼────────────┼────────────┤
  3  <=          le         
├────┼────────────┼────────────┤
  4  ==          eq         
├────┼────────────┼────────────┤
  5  !=          ne         
╘════╧════════════╧════════════╛

避免括号的另一种方法是使用DataFrame.query(或eval):

df.query('A < 5 and B > 5')

   A  B  C
1  3  7  9
3  4  7  6

我已经广泛地记录queryeval使用pd.eval动态表达评价大熊猫()

operator.and_
允许您以功能方式执行此操作。内部调用Series.__and__,它对应于按位运算符。

import operator 

operator.and_(df['A'] < 5, df['B'] > 5)
# Same as,
# (df['A'] < 5).__and__(df['B'] > 5) 

0    False
1     True
2    False
3     True
4    False
dtype: bool

df[operator.and_(df['A'] < 5, df['B'] > 5)]

   A  B  C
1  3  7  9
3  4  7  6

您通常不需要此功能,但了解它很有用。

概括:(np.logical_andlogical_and.reduce
另一种替代方法是使用 np.logical_and,它也不需要括号分组:

np.logical_and(df['A'] < 5, df['B'] > 5)

0    False
1     True
2    False
3     True
4    False
Name: A, dtype: bool

df[np.logical_and(df['A'] < 5, df['B'] > 5)]

   A  B  C
1  3  7  9
3  4  7  6

np.logical_and是ufunc (通用函数),大多数ufunc都有一个reduce方法。这意味着logical_and如果对AND有多个掩码,则更容易一概而论。例如,对于AND遮罩m1以及m2m3&,您必须要做

m1 & m2 & m3

但是,一个更简单的选择是

np.logical_and.reduce([m1, m2, m3])

这很强大,因为它使您可以使用更复杂的逻辑在此基础上构建(例如,在列表理解中动态生成掩码并添加所有掩码):

import operator

cols = ['A', 'B']
ops = [np.less, np.greater]
values = [5, 5]

m = np.logical_and.reduce([op(df[c], v) for op, c, v in zip(ops, cols, values)])
m 
# array([False,  True, False,  True, False])

df[m]
   A  B  C
1  3  7  9
3  4  7  6

1-我知道我在这一点上很困难,但是请耐心等待。这是一个非常非常常见的初学者的错误,必须非常有详尽的解释。


逻辑或

对于df上述内容,假设您想返回A == 3或B == 7的所有行。

按位重载 |

df['A'] == 3

0    False
1     True
2     True
3    False
4    False
Name: A, dtype: bool

df['B'] == 7

0    False
1     True
2    False
3     True
4    False
Name: B, dtype: bool

(df['A'] == 3) | (df['B'] == 7)

0    False
1     True
2     True
3     True
4    False
dtype: bool

df[(df['A'] == 3) | (df['B'] == 7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

如果您还没有阅读过,请同时阅读上面“ 逻辑与 ”部分,所有注意事项均在此处适用。

或者,可以使用

df[df['A'].eq(3) | df['B'].eq(7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

operator.or_
在后台调用Series.__or__

operator.or_(df['A'] == 3, df['B'] == 7)
# Same as,
# (df['A'] == 3).__or__(df['B'] == 7)

0    False
1     True
2     True
3     True
4    False
dtype: bool

df[operator.or_(df['A'] == 3, df['B'] == 7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

np.logical_or
对于两个条件,请使用logical_or

np.logical_or(df['A'] == 3, df['B'] == 7)

0    False
1     True
2     True
3     True
4    False
Name: A, dtype: bool

df[np.logical_or(df['A'] == 3, df['B'] == 7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

对于多个口罩,请使用logical_or.reduce

np.logical_or.reduce([df['A'] == 3, df['B'] == 7])
# array([False,  True,  True,  True, False])

df[np.logical_or.reduce([df['A'] == 3, df['B'] == 7])]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

逻辑非

给定口罩,例如

mask = pd.Series([True, True, False])

如果您需要反转每个布尔值(以使最终结果为[False, False, True]),则可以使用以下任何方法。

按位 ~

~mask

0    False
1    False
2     True
dtype: bool

同样,表达式需要加上括号。

~(df['A'] == 3)

0     True
1    False
2    False
3     True
4     True
Name: A, dtype: bool

这在内部调用

mask.__invert__()

0    False
1    False
2     True
dtype: bool

但是不要直接使用它。

operator.inv
内部调用__invert__该系列。

operator.inv(mask)

0    False
1    False
2     True
dtype: bool

np.logical_not
这是numpy的变体。

np.logical_not(mask)

0    False
1    False
2     True
dtype: bool

注意,np.logical_and可以代替np.bitwise_andlogical_orbitwise_or,并logical_notinvert

TLDR; Logical Operators in Pandas are &, | and ~, and parentheses (...) is important!

Python’s and, or and not logical operators are designed to work with scalars. So Pandas had to do one better and override the bitwise operators to achieve vectorized (element-wise) version of this functionality.

So the following in python (exp1 and exp2 are expressions which evaluate to a boolean result)…

exp1 and exp2              # Logical AND
exp1 or exp2               # Logical OR
not exp1                   # Logical NOT

…will translate to…

exp1 & exp2                # Element-wise logical AND
exp1 | exp2                # Element-wise logical OR
~exp1                      # Element-wise logical NOT

for pandas.

If in the process of performing logical operation you get a ValueError, then you need to use parentheses for grouping:

(exp1) op (exp2)

For example,

(df['col1'] == x) & (df['col2'] == y) 

And so on.


Boolean Indexing: A common operation is to compute boolean masks through logical conditions to filter the data. Pandas provides three operators: & for logical AND, | for logical OR, and ~ for logical NOT.

Consider the following setup:

np.random.seed(0)
df = pd.DataFrame(np.random.choice(10, (5, 3)), columns=list('ABC'))
df

   A  B  C
0  5  0  3
1  3  7  9
2  3  5  2
3  4  7  6
4  8  8  1

Logical AND

For df above, say you’d like to return all rows where A < 5 and B > 5. This is done by computing masks for each condition separately, and ANDing them.

Overloaded Bitwise & Operator
Before continuing, please take note of this particular excerpt of the docs, which state

Another common operation is the use of boolean vectors to filter the data. The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses, since by default Python will evaluate an expression such as df.A > 2 & df.B < 3 as df.A > (2 & df.B) < 3, while the desired evaluation order is (df.A > 2) & (df.B < 3).

So, with this in mind, element wise logical AND can be implemented with the bitwise operator &:

df['A'] < 5

0    False
1     True
2     True
3     True
4    False
Name: A, dtype: bool

df['B'] > 5

0    False
1     True
2    False
3     True
4     True
Name: B, dtype: bool

(df['A'] < 5) & (df['B'] > 5)

0    False
1     True
2    False
3     True
4    False
dtype: bool

And the subsequent filtering step is simply,

df[(df['A'] < 5) & (df['B'] > 5)]

   A  B  C
1  3  7  9
3  4  7  6

The parentheses are used to override the default precedence order of bitwise operators, which have higher precedence over the conditional operators < and >. See the section of Operator Precedence in the python docs.

If you do not use parentheses, the expression is evaluated incorrectly. For example, if you accidentally attempt something such as

df['A'] < 5 & df['B'] > 5

It is parsed as

df['A'] < (5 & df['B']) > 5

Which becomes,

df['A'] < something_you_dont_want > 5

Which becomes (see the python docs on chained operator comparison),

(df['A'] < something_you_dont_want) and (something_you_dont_want > 5)

Which becomes,

# Both operands are Series...
something_else_you_dont_want1 and something_else_you_dont_want2

Which throws

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

So, don’t make that mistake!1

Avoiding Parentheses Grouping
The fix is actually quite simple. Most operators have a corresponding bound method for DataFrames. If the individual masks are built up using functions instead of conditional operators, you will no longer need to group by parens to specify evaluation order:

df['A'].lt(5)

0     True
1     True
2     True
3     True
4    False
Name: A, dtype: bool

df['B'].gt(5)

0    False
1     True
2    False
3     True
4     True
Name: B, dtype: bool

df['A'].lt(5) & df['B'].gt(5)

0    False
1     True
2    False
3     True
4    False
dtype: bool

See the section on Flexible Comparisons.. To summarise, we have

╒════╤════════════╤════════════╕
│    │ Operator   │ Function   │
╞════╪════════════╪════════════╡
│  0 │ >          │ gt         │
├────┼────────────┼────────────┤
│  1 │ >=         │ ge         │
├────┼────────────┼────────────┤
│  2 │ <          │ lt         │
├────┼────────────┼────────────┤
│  3 │ <=         │ le         │
├────┼────────────┼────────────┤
│  4 │ ==         │ eq         │
├────┼────────────┼────────────┤
│  5 │ !=         │ ne         │
╘════╧════════════╧════════════╛

Another option for avoiding parentheses is to use DataFrame.query (or eval):

df.query('A < 5 and B > 5')

   A  B  C
1  3  7  9
3  4  7  6

I have extensively documented query and eval in Dynamic Expression Evaluation in pandas using pd.eval().

operator.and_
Allows you to perform this operation in a functional manner. Internally calls Series.__and__ which corresponds to the bitwise operator.

import operator 

operator.and_(df['A'] < 5, df['B'] > 5)
# Same as,
# (df['A'] < 5).__and__(df['B'] > 5) 

0    False
1     True
2    False
3     True
4    False
dtype: bool

df[operator.and_(df['A'] < 5, df['B'] > 5)]

   A  B  C
1  3  7  9
3  4  7  6

You won’t usually need this, but it is useful to know.

Generalizing: np.logical_and (and logical_and.reduce)
Another alternative is using np.logical_and, which also does not need parentheses grouping:

np.logical_and(df['A'] < 5, df['B'] > 5)

0    False
1     True
2    False
3     True
4    False
Name: A, dtype: bool

df[np.logical_and(df['A'] < 5, df['B'] > 5)]

   A  B  C
1  3  7  9
3  4  7  6

np.logical_and is a ufunc (Universal Functions), and most ufuncs have a reduce method. This means it is easier to generalise with logical_and if you have multiple masks to AND. For example, to AND masks m1 and m2 and m3 with &, you would have to do

m1 & m2 & m3

However, an easier option is

np.logical_and.reduce([m1, m2, m3])

This is powerful, because it lets you build on top of this with more complex logic (for example, dynamically generating masks in a list comprehension and adding all of them):

import operator

cols = ['A', 'B']
ops = [np.less, np.greater]
values = [5, 5]

m = np.logical_and.reduce([op(df[c], v) for op, c, v in zip(ops, cols, values)])
m 
# array([False,  True, False,  True, False])

df[m]
   A  B  C
1  3  7  9
3  4  7  6

1 – I know I’m harping on this point, but please bear with me. This is a very, very common beginner’s mistake, and must be explained very thoroughly.


Logical OR

For the df above, say you’d like to return all rows where A == 3 or B == 7.

Overloaded Bitwise |

df['A'] == 3

0    False
1     True
2     True
3    False
4    False
Name: A, dtype: bool

df['B'] == 7

0    False
1     True
2    False
3     True
4    False
Name: B, dtype: bool

(df['A'] == 3) | (df['B'] == 7)

0    False
1     True
2     True
3     True
4    False
dtype: bool

df[(df['A'] == 3) | (df['B'] == 7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

If you haven’t yet, please also read the section on Logical AND above, all caveats apply here.

Alternatively, this operation can be specified with

df[df['A'].eq(3) | df['B'].eq(7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

operator.or_
Calls Series.__or__ under the hood.

operator.or_(df['A'] == 3, df['B'] == 7)
# Same as,
# (df['A'] == 3).__or__(df['B'] == 7)

0    False
1     True
2     True
3     True
4    False
dtype: bool

df[operator.or_(df['A'] == 3, df['B'] == 7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

np.logical_or
For two conditions, use logical_or:

np.logical_or(df['A'] == 3, df['B'] == 7)

0    False
1     True
2     True
3     True
4    False
Name: A, dtype: bool

df[np.logical_or(df['A'] == 3, df['B'] == 7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

For multiple masks, use logical_or.reduce:

np.logical_or.reduce([df['A'] == 3, df['B'] == 7])
# array([False,  True,  True,  True, False])

df[np.logical_or.reduce([df['A'] == 3, df['B'] == 7])]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

Logical NOT

Given a mask, such as

mask = pd.Series([True, True, False])

If you need to invert every boolean value (so that the end result is [False, False, True]), then you can use any of the methods below.

Bitwise ~

~mask

0    False
1    False
2     True
dtype: bool

Again, expressions need to be parenthesised.

~(df['A'] == 3)

0     True
1    False
2    False
3     True
4     True
Name: A, dtype: bool

This internally calls

mask.__invert__()

0    False
1    False
2     True
dtype: bool

But don’t use it directly.

operator.inv
Internally calls __invert__ on the Series.

operator.inv(mask)

0    False
1    False
2     True
dtype: bool

np.logical_not
This is the numpy variant.

np.logical_not(mask)

0    False
1    False
2     True
dtype: bool

Note, np.logical_and can be substituted for np.bitwise_and, logical_or with bitwise_or, and logical_not with invert.


回答 2

熊猫中布尔索引的逻辑运算符

要认识到,你不能使用任何的Python是很重要的逻辑运算符andornot上)pandas.Seriespandas.DataFrameS(同样,你不能上使用它们numpy.array有一个以上的元素S)。之所以不能使用它们,是因为它们隐式地调用bool其操作数,从而引发异常,因为这些数据结构确定数组的布尔值是不明确的:

>>> import numpy as np
>>> import pandas as pd
>>> arr = np.array([1,2,3])
>>> s = pd.Series([1,2,3])
>>> df = pd.DataFrame([1,2,3])
>>> bool(arr)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
>>> bool(s)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> bool(df)
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

我确实在回答“系列的真值不明确。请使用a.empty,a.bool(),a.item(),a.any()或a.all()”时回答这个问题。 + A

NumPys逻辑功能

然而NumPy的提供逐元素的操作等同于这些运营商的功能,可以在被使用numpy.arraypandas.Seriespandas.DataFrame,或任何其他(符合)numpy.array亚类:

因此,从本质上讲,应该使用(假设df1并且df2是pandas DataFrames):

np.logical_and(df1, df2)
np.logical_or(df1, df2)
np.logical_not(df1)
np.logical_xor(df1, df2)

布尔的按位函数和按位运算符

但是,如果您具有布尔NumPy数组,pandas系列或pandas DataFrame,则也可以使用按元素逐位的函数(对于布尔,它们与逻辑函数是(或至少应该是)不可区分的):

通常使用运算符。但是,当与比较运算符组合使用时,必须记住将比较括在括号中,因为按位运算符的优先级高于比较运算符

(df1 < 10) | (df2 > 10)  # instead of the wrong df1 < 10 | df2 > 10

这可能很烦人,因为Python逻辑运算符的优先级比比较运算符的优先级低,因此您通常可以编写a < 10 and b > 10(其中a并且b是简单整数),并且不需要括号。

逻辑和按位运算之间的差异(非布尔值)

需要特别强调的是,位和逻辑运算仅对布尔NumPy数组(以及布尔Series和DataFrame)是等效的。如果这些不包含布尔值,则这些操作将给出不同的结果。我将包括使用NumPy数组的示例,但对于pandas数据结构,结果将相似:

>>> import numpy as np
>>> a1 = np.array([0, 0, 1, 1])
>>> a2 = np.array([0, 1, 0, 1])

>>> np.logical_and(a1, a2)
array([False, False, False,  True])
>>> np.bitwise_and(a1, a2)
array([0, 0, 0, 1], dtype=int32)

而且由于NumPy(和类似的pandas)对boolean(布尔或“掩码”索引数组)和integer(Index数组)索引所做的操作不同,因此索引的结果也将不同:

>>> a3 = np.array([1, 2, 3, 4])

>>> a3[np.logical_and(a1, a2)]
array([4])
>>> a3[np.bitwise_and(a1, a2)]
array([1, 1, 1, 2])

汇总表

Logical operator | NumPy logical function | NumPy bitwise function | Bitwise operator
-------------------------------------------------------------------------------------
       and       |  np.logical_and        | np.bitwise_and         |        &
-------------------------------------------------------------------------------------
       or        |  np.logical_or         | np.bitwise_or          |        |
-------------------------------------------------------------------------------------
                 |  np.logical_xor        | np.bitwise_xor         |        ^
-------------------------------------------------------------------------------------
       not       |  np.logical_not        | np.invert              |        ~

其中的逻辑运算符不适合与NumPy阵列工作,熊猫系列,和熊猫DataFrames。其他的则在这些数据结构(和普通的Python对象)上工作,并在元素方面工作。但是,请对纯Python上的按位取反要小心,bool因为在这种情况下,布尔将被解释为整数(例如~Falsereturn -1~Truereturn -2)。

Logical operators for boolean indexing in Pandas

It’s important to realize that you cannot use any of the Python logical operators (and, or or not) on pandas.Series or pandas.DataFrames (similarly you cannot use them on numpy.arrays with more than one element). The reason why you cannot use those is because they implicitly call bool on their operands which throws an Exception because these data structures decided that the boolean of an array is ambiguous:

>>> import numpy as np
>>> import pandas as pd
>>> arr = np.array([1,2,3])
>>> s = pd.Series([1,2,3])
>>> df = pd.DataFrame([1,2,3])
>>> bool(arr)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
>>> bool(s)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> bool(df)
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I did cover this more extensively in my answer to the “Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()” Q+A.

NumPys logical functions

However NumPy provides element-wise operating equivalents to these operators as functions that can be used on numpy.array, pandas.Series, pandas.DataFrame, or any other (conforming) numpy.array subclass:

So, essentially, one should use (assuming df1 and df2 are pandas DataFrames):

np.logical_and(df1, df2)
np.logical_or(df1, df2)
np.logical_not(df1)
np.logical_xor(df1, df2)

Bitwise functions and bitwise operators for booleans

However in case you have boolean NumPy array, pandas Series, or pandas DataFrames you could also use the element-wise bitwise functions (for booleans they are – or at least should be – indistinguishable from the logical functions):

Typically the operators are used. However when combined with comparison operators one has to remember to wrap the comparison in parenthesis because the bitwise operators have a higher precedence than the comparison operators:

(df1 < 10) | (df2 > 10)  # instead of the wrong df1 < 10 | df2 > 10

This may be irritating because the Python logical operators have a lower precendence than the comparison operators so you normally write a < 10 and b > 10 (where a and b are for example simple integers) and don’t need the parenthesis.

Differences between logical and bitwise operations (on non-booleans)

It is really important to stress that bit and logical operations are only equivalent for boolean NumPy arrays (and boolean Series & DataFrames). If these don’t contain booleans then the operations will give different results. I’ll include examples using NumPy arrays but the results will be similar for the pandas data structures:

>>> import numpy as np
>>> a1 = np.array([0, 0, 1, 1])
>>> a2 = np.array([0, 1, 0, 1])

>>> np.logical_and(a1, a2)
array([False, False, False,  True])
>>> np.bitwise_and(a1, a2)
array([0, 0, 0, 1], dtype=int32)

And since NumPy (and similarly pandas) does different things for boolean (Boolean or “mask” index arrays) and integer (Index arrays) indices the results of indexing will be also be different:

>>> a3 = np.array([1, 2, 3, 4])

>>> a3[np.logical_and(a1, a2)]
array([4])
>>> a3[np.bitwise_and(a1, a2)]
array([1, 1, 1, 2])

Summary table

Logical operator | NumPy logical function | NumPy bitwise function | Bitwise operator
-------------------------------------------------------------------------------------
       and       |  np.logical_and        | np.bitwise_and         |        &
-------------------------------------------------------------------------------------
       or        |  np.logical_or         | np.bitwise_or          |        |
-------------------------------------------------------------------------------------
                 |  np.logical_xor        | np.bitwise_xor         |        ^
-------------------------------------------------------------------------------------
       not       |  np.logical_not        | np.invert              |        ~

Where the logical operator does not work for NumPy arrays, pandas Series, and pandas DataFrames. The others work on these data structures (and plain Python objects) and work element-wise. However be careful with the bitwise invert on plain Python bools because the bool will be interpreted as integers in this context (for example ~False returns -1 and ~True returns -2).