标签归档:dataframe

Python Pandas从一列字符串的数据选择中过滤掉Nan

问题:Python Pandas从一列字符串的数据选择中过滤掉Nan

如果不使用groupby,我将如何过滤掉没有的数据NaN

假设我有一个矩阵,客户可以在其中填写“ N / A”,“ n / a”或其任何变体,而其他人则将其留空:

import pandas as pd
import numpy as np


df = pd.DataFrame({'movie': ['thg', 'thg', 'mol', 'mol', 'lob', 'lob'],
                  'rating': [3., 4., 5., np.nan, np.nan, np.nan],
                  'name': ['John', np.nan, 'N/A', 'Graham', np.nan, np.nan]})

nbs = df['name'].str.extract('^(N/A|NA|na|n/a)')
nms=df[(df['name'] != nbs) ]

输出:

>>> nms
  movie    name  rating
0   thg    John       3
1   thg     NaN       4
3   mol  Graham     NaN
4   lob     NaN     NaN
5   lob     NaN     NaN

我将如何过滤NaN值,以便可以像这样获得结果:

  movie    name  rating
0   thg    John       3
3   mol  Graham     NaN

我猜我需要类似的东西,~np.isnan但tilda不适用于字符串。

Without using groupby how would I filter out data without NaN?

Let say I have a matrix where customers will fill in ‘N/A’,’n/a’ or any of its variations and others leave it blank:

import pandas as pd
import numpy as np


df = pd.DataFrame({'movie': ['thg', 'thg', 'mol', 'mol', 'lob', 'lob'],
                  'rating': [3., 4., 5., np.nan, np.nan, np.nan],
                  'name': ['John', np.nan, 'N/A', 'Graham', np.nan, np.nan]})

nbs = df['name'].str.extract('^(N/A|NA|na|n/a)')
nms=df[(df['name'] != nbs) ]

output:

>>> nms
  movie    name  rating
0   thg    John       3
1   thg     NaN       4
3   mol  Graham     NaN
4   lob     NaN     NaN
5   lob     NaN     NaN

How would I filter out NaN values so I can get results to work with like this:

  movie    name  rating
0   thg    John       3
3   mol  Graham     NaN

I am guessing I need something like ~np.isnan but the tilda does not work with strings.


回答 0

只需将它们放下:

nms.dropna(thresh=2)

这将删除所有至少有两个non-的行NaN

然后,您可以删除名称为NaN

In [87]:

nms
Out[87]:
  movie    name  rating
0   thg    John       3
1   thg     NaN       4
3   mol  Graham     NaN
4   lob     NaN     NaN
5   lob     NaN     NaN

[5 rows x 3 columns]
In [89]:

nms = nms.dropna(thresh=2)
In [90]:

nms[nms.name.notnull()]
Out[90]:
  movie    name  rating
0   thg    John       3
3   mol  Graham     NaN

[2 rows x 3 columns]

编辑

实际查看您最初想要的是什么,而无需dropna调用即可:

nms[nms.name.notnull()]

更新

3年后的这个问题,有一个错误,首先thresharg至少查找nNaN值,因此实际上输出应为:

In [4]:
nms.dropna(thresh=2)

Out[4]:
  movie    name  rating
0   thg    John     3.0
1   thg     NaN     4.0
3   mol  Graham     NaN

我可能是3年前弄错了,或者我运行的熊猫版本存在错误,两种情况都是可能的。

Just drop them:

nms.dropna(thresh=2)

this will drop all rows where there are at least two non-NaN.

Then you could then drop where name is NaN:

In [87]:

nms
Out[87]:
  movie    name  rating
0   thg    John       3
1   thg     NaN       4
3   mol  Graham     NaN
4   lob     NaN     NaN
5   lob     NaN     NaN

[5 rows x 3 columns]
In [89]:

nms = nms.dropna(thresh=2)
In [90]:

nms[nms.name.notnull()]
Out[90]:
  movie    name  rating
0   thg    John       3
3   mol  Graham     NaN

[2 rows x 3 columns]

EDIT

Actually looking at what you originally want you can do just this without the dropna call:

nms[nms.name.notnull()]

UPDATE

Looking at this question 3 years later, there is a mistake, firstly thresh arg looks for at least n non-NaN values so in fact the output should be:

In [4]:
nms.dropna(thresh=2)

Out[4]:
  movie    name  rating
0   thg    John     3.0
1   thg     NaN     4.0
3   mol  Graham     NaN

It’s possible that I was either mistaken 3 years ago or that the version of pandas I was running had a bug, both scenarios are entirely possible.


回答 1

所有解决方案中最简单的:

filtered_df = df[df['name'].notnull()]

因此,它仅过滤掉“名称”列中没有NaN值的行。

对于多列:

filtered_df = df[df[['name', 'country', 'region']].notnull().all(1)]

Simplest of all solutions:

filtered_df = df[df['name'].notnull()]

Thus, it filters out only rows that doesn’t have NaN values in ‘name’ column.

For multiple columns:

filtered_df = df[df[['name', 'country', 'region']].notnull().all(1)]

回答 2

df = pd.DataFrame({'movie': ['thg', 'thg', 'mol', 'mol', 'lob', 'lob'],'rating': [3., 4., 5., np.nan, np.nan, np.nan],'name': ['John','James', np.nan, np.nan, np.nan,np.nan]})

for col in df.columns:
    df = df[~pd.isnull(df[col])]
df = pd.DataFrame({'movie': ['thg', 'thg', 'mol', 'mol', 'lob', 'lob'],'rating': [3., 4., 5., np.nan, np.nan, np.nan],'name': ['John','James', np.nan, np.nan, np.nan,np.nan]})

for col in df.columns:
    df = df[~pd.isnull(df[col])]

回答 3

df.dropna(subset=['columnName1', 'columnName2'])
df.dropna(subset=['columnName1', 'columnName2'])

将列添加到具有恒定值的数据框

问题:将列添加到具有恒定值的数据框

我有一个现有的数据框,我需要添加一个额外的列,每行将包含相同的值。

现有的df:

Date, Open, High, Low, Close
01-01-2015, 565, 600, 400, 450

新的df:

Name, Date, Open, High, Low, Close
abc, 01-01-2015, 565, 600, 400, 450

我知道如何追加现有的series / dataframe列。但这是另一种情况,因为我所需要的只是添加“名称”列,并将每一行设置为相同的值,在本例中为“ abc”。

I have an existing dataframe which I need to add an additional column to which will contain the same value for every row.

Existing df:

Date, Open, High, Low, Close
01-01-2015, 565, 600, 400, 450

New df:

Name, Date, Open, High, Low, Close
abc, 01-01-2015, 565, 600, 400, 450

I know how to append an existing series / dataframe column. But this is a different situation, because all I need is to add the ‘Name’ column and set every row to the same value, in this case ‘abc’.


回答 0

df['Name']='abc' 将添加新列并将所有行设置为该值:

In [79]:

df
Out[79]:
         Date, Open, High,  Low,  Close
0  01-01-2015,  565,  600,  400,    450
In [80]:

df['Name'] = 'abc'
df
Out[80]:
         Date, Open, High,  Low,  Close Name
0  01-01-2015,  565,  600,  400,    450  abc

df['Name']='abc' will add the new column and set all rows to that value:

In [79]:

df
Out[79]:
         Date, Open, High,  Low,  Close
0  01-01-2015,  565,  600,  400,    450
In [80]:

df['Name'] = 'abc'
df
Out[80]:
         Date, Open, High,  Low,  Close Name
0  01-01-2015,  565,  600,  400,    450  abc

回答 1

您可以使用insert指定要在何处添加新列。在这种情况下,我通常0将新列放在左侧。

df.insert(0, 'Name', 'abc')

  Name        Date  Open  High  Low  Close
0  abc  01-01-2015   565   600  400    450

You can use insert to specify where you want to new column to be. In this case, I use 0 to place the new column at the left.

df.insert(0, 'Name', 'abc')

  Name        Date  Open  High  Low  Close
0  abc  01-01-2015   565   600  400    450

回答 2

单班轮工程

df['Name'] = 'abc'

创建一Name列并将所有行设置为abcvalue

Single liner works

df['Name'] = 'abc'

Creates a Name column and sets all rows to abc value


回答 3

总结其他人的建议,并添加第三种方法

您可以:

  • 分配(** kwargs)

    df.assign(Name='abc')
  • 访问新的列系列(将被创建)并进行设置:

    df['Name'] = 'abc'
  • 插入(位置,列,值,allow_duplicates = False)

    df.insert(0, 'Name', 'abc')

    参数loc(0 <= loc <= len(columns))允许您在所需的位置插入列。

    “禄”为您提供了索引你的列将在插入后。例如,上面的代码将Name列插入第0列,即它将插入到第一列之前,成为新的第一列。(索引从0开始)。

所有这些方法都允许您从系列中添加新列(只需将上面的’abc’默认参数替换为系列)。

Summing up what the others have suggested, and adding a third way

You can:

  • assign(**kwargs):

    df.assign(Name='abc')
    
  • access the new column series (it will be created) and set it:

    df['Name'] = 'abc'
    
  • insert(loc, column, value, allow_duplicates=False)

    df.insert(0, 'Name', 'abc')
    

    where the argument loc ( 0 <= loc <= len(columns) ) allows you to insert the column where you want.

    ‘loc’ gives you the index that your column will be at after the insertion. For example, the code above inserts the column Name as the 0-th column, i.e. it will be inserted before the first column, becoming the new first column. (Indexing starts from 0).

All these methods allow you to add a new column from a Series as well (just substitute the ‘abc’ default argument above with the series).


如何检索Pandas数据框中的列数?

问题:如何检索Pandas数据框中的列数?

您如何以编程方式检索熊猫数据框中的列数?我希望有这样的东西:

df.num_columns

How do you programmatically retrieve the number of columns in a pandas dataframe? I was hoping for something like:

df.num_columns

回答 0

像这样:

import pandas as pd
df = pd.DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})

len(df.columns)
3

Like so:

import pandas as pd
df = pd.DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})

len(df.columns)
3

回答 1

选择:

df.shape[1]

df.shape[0]是行数)

Alternative:

df.shape[1]

(df.shape[0] is the number of rows)


回答 2

如果保存数据帧的变量称为df,则:

len(df.columns)

给出列数。

对于那些想要行数的人:

len(df.index)

对于包含行数和列数的元组:

df.shape

If the variable holding the dataframe is called df, then:

len(df.columns)

gives the number of columns.

And for those who want the number of rows:

len(df.index)

For a tuple containing the number of both rows and columns:

df.shape

回答 3

len(list(df))对我有用。

This worked for me len(list(df)).


回答 4

df.info()函数将为您提供如下结果。如果您使用的是不带sep参数或不带“,”的sep的Pandas的read_csv方法。

raw_data = pd.read_csv("a1:\aa2/aaa3/data.csv")
raw_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5144 entries, 0 to 5143
Columns: 145 entries, R_fighter to R_age

df.info() function will give you result something like as below. If you are using read_csv method of Pandas without sep parameter or sep with “,”.

raw_data = pd.read_csv("a1:\aa2/aaa3/data.csv")
raw_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5144 entries, 0 to 5143
Columns: 145 entries, R_fighter to R_age

回答 5

有多种选择来获取列号和列信息,例如:
让我们检查一下。

local_df = pd.DataFrame(np.random.randint(1,12,size =(2,6)),列= [‘a’,’b’,’c’,’d’,’e’,’f ‘])1. local_df.shape [1]-> Shape属性返回元组为(行和列)(0,1)。

  1. local_df.info()-> info方法将返回有关数据框及其列的详细信息,例如列数,列的数据类型,非空值计数,数据帧的内存使用情况

  2. len(local_df.columns)->列属性将返回数据框列的索引对象,而len函数将返回可用列总数。

  3. local_df.head(0)->具有参数0的head方法将返回df的第一行,实际上仅是标题。

假设列数不超过10。为了循环乐趣:local_df中x的li_count = 0:li_count = li_count +1 print(li_count)

There are multiple option to get column number and column information such as:
let’s check them.

local_df = pd.DataFrame(np.random.randint(1,12,size=(2,6)),columns =[‘a’,’b’,’c’,’d’,’e’,’f’]) 1. local_df.shape[1] –> Shape attribute return tuple as (row & columns) (0,1).

  1. local_df.info() –> info Method will return detailed information about data frame and it’s columns such column count, data type of columns, Not null value count, memory usage by Data Frame

  2. len(local_df.columns) –> columns attribute will return index object of data frame columns & len function will return total available columns.

  3. local_df.head(0) –> head method with parameter 0 will return 1st row of df which actually nothing but header.

Assuming number of columns are not more than 10. For loop fun: li_count =0 for x in local_df: li_count =li_count + 1 print(li_count)


重命名熊猫中的特定列

问题:重命名熊猫中的特定列

我有一个名为的数据框data。如何重命名唯一的一列标题?例如gdplog(gdp)

data =
    y  gdp  cap
0   1    2    5
1   2    3    9
2   8    7    2
3   3    4    7
4   6    7    7
5   4    8    3
6   8    2    8
7   9    9   10
8   6    6    4
9  10   10    7

I’ve got a dataframe called data. How would I rename the only one column header? For example gdp to log(gdp)?

data =
    y  gdp  cap
0   1    2    5
1   2    3    9
2   8    7    2
3   3    4    7
4   6    7    7
5   4    8    3
6   8    2    8
7   9    9   10
8   6    6    4
9  10   10    7

回答 0

data.rename(columns={'gdp':'log(gdp)'}, inplace=True)

rename它接受一个字典作为一个PARAM演出columns,所以你只是传递一个字典一次入境。

另请参阅相关

data.rename(columns={'gdp':'log(gdp)'}, inplace=True)

The rename show that it accepts a dict as a param for columns so you just pass a dict with a single entry.

Also see related


回答 1

list-comprehension如果您需要重命名单个列,则将使用更快的实现。

df.columns = ['log(gdp)' if x=='gdp' else x for x in df.columns]

如果需要重命名多个列,请使用以下条件表达式:

df.columns = ['log(gdp)' if x=='gdp' else 'cap_mod' if x=='cap' else x for x in df.columns]

或者,使用a构造映射dictionary并通过将默认值设置为旧名称来list-comprehension对其执行get操作:

col_dict = {'gdp': 'log(gdp)', 'cap': 'cap_mod'}   ## key→old name, value→new name

df.columns = [col_dict.get(x, x) for x in df.columns]

时间:

%%timeit
df.rename(columns={'gdp':'log(gdp)'}, inplace=True)
10000 loops, best of 3: 168 µs per loop

%%timeit
df.columns = ['log(gdp)' if x=='gdp' else x for x in df.columns]
10000 loops, best of 3: 58.5 µs per loop

A much faster implementation would be to use list-comprehension if you need to rename a single column.

df.columns = ['log(gdp)' if x=='gdp' else x for x in df.columns]

If the need arises to rename multiple columns, either use conditional expressions like:

df.columns = ['log(gdp)' if x=='gdp' else 'cap_mod' if x=='cap' else x for x in df.columns]

Or, construct a mapping using a dictionary and perform the list-comprehension with it’s get operation by setting default value as the old name:

col_dict = {'gdp': 'log(gdp)', 'cap': 'cap_mod'}   ## key→old name, value→new name

df.columns = [col_dict.get(x, x) for x in df.columns]

Timings:

%%timeit
df.rename(columns={'gdp':'log(gdp)'}, inplace=True)
10000 loops, best of 3: 168 µs per loop

%%timeit
df.columns = ['log(gdp)' if x=='gdp' else x for x in df.columns]
10000 loops, best of 3: 58.5 µs per loop

回答 2

如何重命名熊猫中的特定列?

从v0.24 +起,要一次重命名一列(或多列),

如果您需要一次重命名所有列,

  • DataFrame.set_axis()的方法axis=1。传递类似列表的序列。选项也可用于就地修改。

renameaxis=1

df = pd.DataFrame('x', columns=['y', 'gdp', 'cap'], index=range(5))
df

   y gdp cap
0  x   x   x
1  x   x   x
2  x   x   x
3  x   x   x
4  x   x   x

使用0.21+,您现在可以使用来指定axis参数rename

df.rename({'gdp':'log(gdp)'}, axis=1)
# df.rename({'gdp':'log(gdp)'}, axis='columns')
    
   y log(gdp) cap
0  x        x   x
1  x        x   x
2  x        x   x
3  x        x   x
4  x        x   x

(请注意,rename默认情况下它不是就地的,因此您需要将结果分配回去。)

进行此添加是为了提高与其他API的一致性。新axis参数类似于该columns参数,它们执行相同的操作。

df.rename(columns={'gdp': 'log(gdp)'})

   y log(gdp) cap
0  x        x   x
1  x        x   x
2  x        x   x
3  x        x   x
4  x        x   x

rename 还接受为每个列调用一次的回调。

df.rename(lambda x: x[0], axis=1)
# df.rename(lambda x: x[0], axis='columns')

   y  g  c
0  x  x  x
1  x  x  x
2  x  x  x
3  x  x  x
4  x  x  x

对于这种特定情况,您可能要使用

df.rename(lambda x: 'log(gdp)' if x == 'gdp' else x, axis=1)

Index.str.replace

replacepython中的字符串方法类似,pandas Index和Series(仅对象dtype)定义了一种(“矢量化”)str.replace方法,用于基于字符串和正则表达式的替换。

df.columns = df.columns.str.replace('gdp', 'log(gdp)')
df
 
   y log(gdp) cap
0  x        x   x
1  x        x   x
2  x        x   x
3  x        x   x
4  x        x   x

与其他方法相比,此方法的优点是str.replace支持正则表达式(默认情况下启用)。有关更多信息,请参阅文档。


传递一个列表,set_axisaxis=1

set_axis用标题列表进行调用。该列表的长度必须等于列/索引的大小。set_axis默认情况下会更改原始DataFrame,但您可以指定inplace=False返回修改后的副本。

df.set_axis(['cap', 'log(gdp)', 'y'], axis=1, inplace=False)
# df.set_axis(['cap', 'log(gdp)', 'y'], axis='columns', inplace=False)

  cap log(gdp)  y
0   x        x  x
1   x        x  x
2   x        x  x
3   x        x  x
4   x        x  x

注意:在将来的版本中,inplace默认为True

方法链接
为什么选择set_axis已经有一种有效的方式分配列的方式df.columns = ...?如Ted Petrou在[此答案]中所示,(https://stackoverflow.com/a/46912050/4909087set_axis在尝试链接方法时很有用。

比较

# new for pandas 0.21+
df.some_method1()
  .some_method2()
  .set_axis()
  .some_method3()

# old way
df1 = df.some_method1()
        .some_method2()
df1.columns = columns
df1.some_method3()

前者是更自然和自由流动的语法。

How do I rename a specific column in pandas?

From v0.24+, to rename one (or more) columns at a time,

If you need to rename ALL columns at once,

  • DataFrame.set_axis() method with axis=1. Pass a list-like sequence. Options are available for in-place modification as well.

rename with axis=1

df = pd.DataFrame('x', columns=['y', 'gdp', 'cap'], index=range(5))
df

   y gdp cap
0  x   x   x
1  x   x   x
2  x   x   x
3  x   x   x
4  x   x   x

With 0.21+, you can now specify an axis parameter with rename:

df.rename({'gdp':'log(gdp)'}, axis=1)
# df.rename({'gdp':'log(gdp)'}, axis='columns')
    
   y log(gdp) cap
0  x        x   x
1  x        x   x
2  x        x   x
3  x        x   x
4  x        x   x

(Note that rename is not in-place by default, so you will need to assign the result back.)

This addition has been made to improve consistency with the rest of the API. The new axis argument is analogous to the columns parameter—they do the same thing.

df.rename(columns={'gdp': 'log(gdp)'})

   y log(gdp) cap
0  x        x   x
1  x        x   x
2  x        x   x
3  x        x   x
4  x        x   x

rename also accepts a callback that is called once for each column.

df.rename(lambda x: x[0], axis=1)
# df.rename(lambda x: x[0], axis='columns')

   y  g  c
0  x  x  x
1  x  x  x
2  x  x  x
3  x  x  x
4  x  x  x

For this specific scenario, you would want to use

df.rename(lambda x: 'log(gdp)' if x == 'gdp' else x, axis=1)

Index.str.replace

Similar to replace method of strings in python, pandas Index and Series (object dtype only) define a (“vectorized”) str.replace method for string and regex-based replacement.

df.columns = df.columns.str.replace('gdp', 'log(gdp)')
df
 
   y log(gdp) cap
0  x        x   x
1  x        x   x
2  x        x   x
3  x        x   x
4  x        x   x

The advantage of this over the other methods is that str.replace supports regex (enabled by default). See the docs for more information.


Passing a list to set_axis with axis=1

Call set_axis with a list of header(s). The list must be equal in length to the columns/index size. set_axis mutates the original DataFrame by default, but you can specify inplace=False to return a modified copy.

df.set_axis(['cap', 'log(gdp)', 'y'], axis=1, inplace=False)
# df.set_axis(['cap', 'log(gdp)', 'y'], axis='columns', inplace=False)

  cap log(gdp)  y
0   x        x  x
1   x        x  x
2   x        x  x
3   x        x  x
4   x        x  x

Note: In future releases, inplace will default to True.

Method Chaining
Why choose set_axis when we already have an efficient way of assigning columns with df.columns = ...? As shown by Ted Petrou in [this answer],(https://stackoverflow.com/a/46912050/4909087) set_axis is useful when trying to chain methods.

Compare

# new for pandas 0.21+
df.some_method1()
  .some_method2()
  .set_axis()
  .some_method3()

Versus

# old way
df1 = df.some_method1()
        .some_method2()
df1.columns = columns
df1.some_method3()

The former is more natural and free flowing syntax.


回答 3

至少有五种不同的方法来重命名熊猫中的特定列,我在下面列出了它们以及原始答案的链接。我还对这些方法进行了计时,发现它们执行的效果大致相同(尽管YMMV取决于您的数据集和方案)。下面的试验情况下是列重命名A M N ZA2 M2 N2 Z2在一个数据帧的列AZ含有一百万行。

# Import required modules
import numpy as np
import pandas as pd
import timeit

# Create sample data
df = pd.DataFrame(np.random.randint(0,9999,size=(1000000, 26)), columns=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'))

# Standard way - https://stackoverflow.com/a/19758398/452587
def method_1():
    df_renamed = df.rename(columns={'A': 'A2', 'M': 'M2', 'N': 'N2', 'Z': 'Z2'})

# Lambda function - https://stackoverflow.com/a/16770353/452587
def method_2():
    df_renamed = df.rename(columns=lambda x: x + '2' if x in ['A', 'M', 'N', 'Z'] else x)

# Mapping function - https://stackoverflow.com/a/19758398/452587
def rename_some(x):
    if x=='A' or x=='M' or x=='N' or x=='Z':
        return x + '2'
    return x
def method_3():
    df_renamed = df.rename(columns=rename_some)

# Dictionary comprehension - https://stackoverflow.com/a/58143182/452587
def method_4():
    df_renamed = df.rename(columns={col: col + '2' for col in df.columns[
        np.asarray([i for i, col in enumerate(df.columns) if 'A' in col or 'M' in col or 'N' in col or 'Z' in col])
    ]})

# Dictionary comprehension - https://stackoverflow.com/a/38101084/452587
def method_5():
    df_renamed = df.rename(columns=dict(zip(df[['A', 'M', 'N', 'Z']], ['A2', 'M2', 'N2', 'Z2'])))

print('Method 1:', timeit.timeit(method_1, number=10))
print('Method 2:', timeit.timeit(method_2, number=10))
print('Method 3:', timeit.timeit(method_3, number=10))
print('Method 4:', timeit.timeit(method_4, number=10))
print('Method 5:', timeit.timeit(method_5, number=10))

输出:

Method 1: 3.650640267
Method 2: 3.163998427
Method 3: 2.998530871
Method 4: 2.9918436889999995
Method 5: 3.2436501520000007

使用对您来说最直观,最容易在应用程序中实现的方法。

There are at least five different ways to rename specific columns in pandas, and I have listed them below along with links to the original answers. I also timed these methods and found them to perform about the same (though YMMV depending on your data set and scenario). The test case below is to rename columns A M N Z to A2 M2 N2 Z2 in a dataframe with columns A to Z containing a million rows.

# Import required modules
import numpy as np
import pandas as pd
import timeit

# Create sample data
df = pd.DataFrame(np.random.randint(0,9999,size=(1000000, 26)), columns=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'))

# Standard way - https://stackoverflow.com/a/19758398/452587
def method_1():
    df_renamed = df.rename(columns={'A': 'A2', 'M': 'M2', 'N': 'N2', 'Z': 'Z2'})

# Lambda function - https://stackoverflow.com/a/16770353/452587
def method_2():
    df_renamed = df.rename(columns=lambda x: x + '2' if x in ['A', 'M', 'N', 'Z'] else x)

# Mapping function - https://stackoverflow.com/a/19758398/452587
def rename_some(x):
    if x=='A' or x=='M' or x=='N' or x=='Z':
        return x + '2'
    return x
def method_3():
    df_renamed = df.rename(columns=rename_some)

# Dictionary comprehension - https://stackoverflow.com/a/58143182/452587
def method_4():
    df_renamed = df.rename(columns={col: col + '2' for col in df.columns[
        np.asarray([i for i, col in enumerate(df.columns) if 'A' in col or 'M' in col or 'N' in col or 'Z' in col])
    ]})

# Dictionary comprehension - https://stackoverflow.com/a/38101084/452587
def method_5():
    df_renamed = df.rename(columns=dict(zip(df[['A', 'M', 'N', 'Z']], ['A2', 'M2', 'N2', 'Z2'])))

print('Method 1:', timeit.timeit(method_1, number=10))
print('Method 2:', timeit.timeit(method_2, number=10))
print('Method 3:', timeit.timeit(method_3, number=10))
print('Method 4:', timeit.timeit(method_4, number=10))
print('Method 5:', timeit.timeit(method_5, number=10))

Output:

Method 1: 3.650640267
Method 2: 3.163998427
Method 3: 2.998530871
Method 4: 2.9918436889999995
Method 5: 3.2436501520000007

Use the method that is most intuitive to you and easiest for you to implement in your application.


熊猫:如何对单个列使用apply()函数?

问题:熊猫:如何对单个列使用apply()函数?

我有两列的熊猫数据框。我需要在不影响第二列的情况下更改第一列的值,并只更改第一列的值即可获取整个数据帧。我该如何使用大熊猫应用程序?

I have a pandas data frame with two columns. I need to change the values of the first column without affecting the second one and get back the whole data frame with just first column values changed. How can I do that using apply in pandas?


回答 0

给定一个示例数据框df为:

a,b
1,2
2,3
3,4
4,5

您想要的是:

df['a'] = df['a'].apply(lambda x: x + 1)

返回:

   a  b
0  2  2
1  3  3
2  4  4
3  5  5

Given a sample dataframe df as:

a,b
1,2
2,3
3,4
4,5

what you want is:

df['a'] = df['a'].apply(lambda x: x + 1)

that returns:

   a  b
0  2  2
1  3  3
2  4  4
3  5  5

回答 1

对于更好使用的单列map(),像这样:

df = pd.DataFrame([{'a': 15, 'b': 15, 'c': 5}, {'a': 20, 'b': 10, 'c': 7}, {'a': 25, 'b': 30, 'c': 9}])

    a   b  c
0  15  15  5
1  20  10  7
2  25  30  9



df['a'] = df['a'].map(lambda a: a / 2.)

      a   b  c
0   7.5  15  5
1  10.0  10  7
2  12.5  30  9

For a single column better to use map(), like this:

df = pd.DataFrame([{'a': 15, 'b': 15, 'c': 5}, {'a': 20, 'b': 10, 'c': 7}, {'a': 25, 'b': 30, 'c': 9}])

    a   b  c
0  15  15  5
1  20  10  7
2  25  30  9



df['a'] = df['a'].map(lambda a: a / 2.)

      a   b  c
0   7.5  15  5
1  10.0  10  7
2  12.5  30  9

回答 2

您根本不需要功能。您可以直接处理整个列。

示例数据:

>>> df = pd.DataFrame({'a': [100, 1000], 'b': [200, 2000], 'c': [300, 3000]})
>>> df

      a     b     c
0   100   200   300
1  1000  2000  3000

列中所有值的一半a

>>> df.a = df.a / 2
>>> df

     a     b     c
0   50   200   300
1  500  2000  3000

You don’t need a function at all. You can work on a whole column directly.

Example data:

>>> df = pd.DataFrame({'a': [100, 1000], 'b': [200, 2000], 'c': [300, 3000]})
>>> df

      a     b     c
0   100   200   300
1  1000  2000  3000

Half all the values in column a:

>>> df.a = df.a / 2
>>> df

     a     b     c
0   50   200   300
1  500  2000  3000

回答 3

尽管给定的响应是正确的,但是它们修改了初始数据帧,这并不总是令人满意的(并且,如果OP要求示例“使用apply”,那么他们可能想要一个返回新数据帧的版本,就像apply这样)。

可以使用assign:这可能assign对现有列有效,因为文档指出(重点是我的):

将新列分配给DataFrame。

返回一个新对象,该对象具有除新列之外的所有原始列。重新分配的现有列将被覆盖

简而言之:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame([{'a': 15, 'b': 15, 'c': 5}, {'a': 20, 'b': 10, 'c': 7}, {'a': 25, 'b': 30, 'c': 9}])

In [3]: df.assign(a=lambda df: df.a / 2)
Out[3]: 
      a   b  c
0   7.5  15  5
1  10.0  10  7
2  12.5  30  9

In [4]: df
Out[4]: 
    a   b  c
0  15  15  5
1  20  10  7
2  25  30  9

请注意,该函数将传递给整个数据框,而不仅是要修改的列,因此您需要确保在lambda中选择正确的列。

Although the given responses are correct, they modify the initial data frame, which is not always desirable (and, given the OP asked for examples “using apply“, it might be they wanted a version that returns a new data frame, as apply does).

This is possible using assign: it is valid to assign to existing columns, as the documentation states (emphasis is mine):

Assign new columns to a DataFrame.

Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.

In short:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame([{'a': 15, 'b': 15, 'c': 5}, {'a': 20, 'b': 10, 'c': 7}, {'a': 25, 'b': 30, 'c': 9}])

In [3]: df.assign(a=lambda df: df.a / 2)
Out[3]: 
      a   b  c
0   7.5  15  5
1  10.0  10  7
2  12.5  30  9

In [4]: df
Out[4]: 
    a   b  c
0  15  15  5
1  20  10  7
2  25  30  9

Note that the function will be passed the whole dataframe, not only the column you want to modify, so you will need to make sure you select the right column in your lambda.


回答 4

如果您真的很关心apply函数的执行速度,并且有庞大的数据集需要处理,则可以使用swifter加快执行速度,以下是在swifter上实现pandas数据框的示例:

import pandas as pd
import swifter

def fnc(m):
    return m*3+4

df = pd.DataFrame({"m": [1,2,3,4,5,6], "c": [1,1,1,1,1,1], "x":[5,3,6,2,6,1]})

# apply a self created function to a single column in pandas
df["y"] = df.m.swifter.apply(fnc)

这将使您所有的CPU内核都能计算结果,因此比正常的应用功能要快得多。尝试让我知道它是否对您有用。

If you are really concerned about the execution speed of your apply function and you have a huge dataset to work on, you could use swifter to make faster execution, here is an example for swifter on pandas dataframe:

import pandas as pd
import swifter

def fnc(m):
    return m*3+4

df = pd.DataFrame({"m": [1,2,3,4,5,6], "c": [1,1,1,1,1,1], "x":[5,3,6,2,6,1]})

# apply a self created function to a single column in pandas
df["y"] = df.m.swifter.apply(fnc)

This will enable your all CPU cores to compute the result hence it will be much faster than normal apply functions. Try and let me know if it become useful for you.


回答 5

让我尝试使用日期时间并考虑空值或空白的复杂计算。我正在减少30年的datetime列,并使用apply方法以及lambda转换datetime格式。Line if x != '' else x将照顾所有空白或相应的空值。

df['Date'] = df['Date'].fillna('')
df['Date'] = df['Date'].apply(lambda x : ((datetime.datetime.strptime(str(x), '%m/%d/%Y') - datetime.timedelta(days=30*365)).strftime('%Y%m%d')) if x != '' else x)

Let me try a complex computation using datetime and considering nulls or empty spaces. I am reducing 30 years on a datetime column and using apply method as well as lambda and converting datetime format. Line if x != '' else x will take care of all empty spaces or nulls accordingly.

df['Date'] = df['Date'].fillna('')
df['Date'] = df['Date'].apply(lambda x : ((datetime.datetime.strptime(str(x), '%m/%d/%Y') - datetime.timedelta(days=30*365)).strftime('%Y%m%d')) if x != '' else x)

熊猫索引列标题或名称

问题:熊猫索引列标题或名称

如何获取python pandas中的索引列名称?这是一个示例数据框:

             Column 1
Index Title          
Apples              1
Oranges             2
Puppies             3
Ducks               4  

我想做的是获取/设置数据框索引标题。这是我尝试过的:

import pandas as pd
data = {'Column 1'     : [1., 2., 3., 4.],
        'Index Title'  : ["Apples", "Oranges", "Puppies", "Ducks"]}
df = pd.DataFrame(data)
df.index = df["Index Title"]
del df["Index Title"]
print df

有人知道怎么做吗?

How do I get the index column name in python pandas? Here’s an example dataframe:

             Column 1
Index Title          
Apples              1
Oranges             2
Puppies             3
Ducks               4  

What I’m trying to do is get/set the dataframe index title. Here is what i tried:

import pandas as pd
data = {'Column 1'     : [1., 2., 3., 4.],
        'Index Title'  : ["Apples", "Oranges", "Puppies", "Ducks"]}
df = pd.DataFrame(data)
df.index = df["Index Title"]
del df["Index Title"]
print df

Anyone know how to do this?


回答 0

您可以通过其name属性获取/设置索引

In [7]: df.index.name
Out[7]: 'Index Title'

In [8]: df.index.name = 'foo'

In [9]: df.index.name
Out[9]: 'foo'

In [10]: df
Out[10]: 
         Column 1
foo              
Apples          1
Oranges         2
Puppies         3
Ducks           4

You can just get/set the index via its name property

In [7]: df.index.name
Out[7]: 'Index Title'

In [8]: df.index.name = 'foo'

In [9]: df.index.name
Out[9]: 'foo'

In [10]: df
Out[10]: 
         Column 1
foo              
Apples          1
Oranges         2
Puppies         3
Ducks           4

回答 1

您可以使用rename_axis删除设置为None

d = {'Index Title': ['Apples', 'Oranges', 'Puppies', 'Ducks'],'Column 1': [1.0, 2.0, 3.0, 4.0]}
df = pd.DataFrame(d).set_index('Index Title')
print (df)
             Column 1
Index Title          
Apples            1.0
Oranges           2.0
Puppies           3.0
Ducks             4.0

print (df.index.name)
Index Title

print (df.columns.name)
None

新功能在方法链中效果很好。

df = df.rename_axis('foo')
print (df)
         Column 1
foo              
Apples        1.0
Oranges       2.0
Puppies       3.0
Ducks         4.0

您还可以使用参数重命名列名称axis

d = {'Index Title': ['Apples', 'Oranges', 'Puppies', 'Ducks'],'Column 1': [1.0, 2.0, 3.0, 4.0]}
df = pd.DataFrame(d).set_index('Index Title').rename_axis('Col Name', axis=1)
print (df)
Col Name     Column 1
Index Title          
Apples            1.0
Oranges           2.0
Puppies           3.0
Ducks             4.0

print (df.index.name)
Index Title

print (df.columns.name)
Col Name
print df.rename_axis('foo').rename_axis("bar", axis="columns")
bar      Column 1
foo              
Apples        1.0
Oranges       2.0
Puppies       3.0
Ducks         4.0

print df.rename_axis('foo').rename_axis("bar", axis=1)
bar      Column 1
foo              
Apples        1.0
Oranges       2.0
Puppies       3.0
Ducks         4.0

从版本pandas 0.24.0+可以使用参数indexcolumns

df = df.rename_axis(index='foo', columns="bar")
print (df)
bar      Column 1
foo              
Apples        1.0
Oranges       2.0
Puppies       3.0
Ducks         4.0

删除索引和列名称意味着将其设置为None

df = df.rename_axis(index=None, columns=None)
print (df)
         Column 1
Apples        1.0
Oranges       2.0
Puppies       3.0
Ducks         4.0

如果MultiIndex仅在索引中:

mux = pd.MultiIndex.from_arrays([['Apples', 'Oranges', 'Puppies', 'Ducks'],
                                  list('abcd')], 
                                  names=['index name 1','index name 1'])


df = pd.DataFrame(np.random.randint(10, size=(4,6)), 
                  index=mux, 
                  columns=list('ABCDEF')).rename_axis('col name', axis=1)
print (df)
col name                   A  B  C  D  E  F
index name 1 index name 1                  
Apples       a             5  4  0  5  2  2
Oranges      b             5  8  2  5  9  9
Puppies      c             7  6  0  7  8  3
Ducks        d             6  5  0  1  6  0

print (df.index.name)
None

print (df.columns.name)
col name

print (df.index.names)
['index name 1', 'index name 1']

print (df.columns.names)
['col name']

df1 = df.rename_axis(('foo','bar'))
print (df1)
col name     A  B  C  D  E  F
foo     bar                  
Apples  a    5  4  0  5  2  2
Oranges b    5  8  2  5  9  9
Puppies c    7  6  0  7  8  3
Ducks   d    6  5  0  1  6  0

df2 = df.rename_axis('baz', axis=1)
print (df2)
baz                        A  B  C  D  E  F
index name 1 index name 1                  
Apples       a             5  4  0  5  2  2
Oranges      b             5  8  2  5  9  9
Puppies      c             7  6  0  7  8  3
Ducks        d             6  5  0  1  6  0

df2 = df.rename_axis(index=('foo','bar'), columns='baz')
print (df2)
baz          A  B  C  D  E  F
foo     bar                  
Apples  a    5  4  0  5  2  2
Oranges b    5  8  2  5  9  9
Puppies c    7  6  0  7  8  3
Ducks   d    6  5  0  1  6  0

删除索引和列名称意味着将其设置为None

df2 = df.rename_axis(index=(None,None), columns=None)
print (df2)

           A  B  C  D  E  F
Apples  a  6  9  9  5  4  6
Oranges b  2  6  7  4  3  5
Puppies c  6  3  6  3  5  1
Ducks   d  4  9  1  3  0  5

对于MultiIndexin索引和列,有必要.names改为.name使用list或tuple进行设置:

mux1 = pd.MultiIndex.from_arrays([['Apples', 'Oranges', 'Puppies', 'Ducks'],
                                  list('abcd')], 
                                  names=['index name 1','index name 1'])


mux2 = pd.MultiIndex.from_product([list('ABC'),
                                  list('XY')], 
                                  names=['col name 1','col name 2'])

df = pd.DataFrame(np.random.randint(10, size=(4,6)), index=mux1, columns=mux2)
print (df)
col name 1                 A     B     C   
col name 2                 X  Y  X  Y  X  Y
index name 1 index name 1                  
Apples       a             2  9  4  7  0  3
Oranges      b             9  0  6  0  9  4
Puppies      c             2  4  6  1  4  4
Ducks        d             6  6  7  1  2  8

检查/设置值必须为复数:

print (df.index.name)
None

print (df.columns.name)
None

print (df.index.names)
['index name 1', 'index name 1']

print (df.columns.names)
['col name 1', 'col name 2']

df1 = df.rename_axis(('foo','bar'))
print (df1)
col name 1   A     B     C   
col name 2   X  Y  X  Y  X  Y
foo     bar                  
Apples  a    2  9  4  7  0  3
Oranges b    9  0  6  0  9  4
Puppies c    2  4  6  1  4  4
Ducks   d    6  6  7  1  2  8

df2 = df.rename_axis(('baz','bak'), axis=1)
print (df2)
baz                        A     B     C   
bak                        X  Y  X  Y  X  Y
index name 1 index name 1                  
Apples       a             2  9  4  7  0  3
Oranges      b             9  0  6  0  9  4
Puppies      c             2  4  6  1  4  4
Ducks        d             6  6  7  1  2  8

df2 = df.rename_axis(index=('foo','bar'), columns=('baz','bak'))
print (df2)
baz          A     B     C   
bak          X  Y  X  Y  X  Y
foo     bar                  
Apples  a    2  9  4  7  0  3
Oranges b    9  0  6  0  9  4
Puppies c    2  4  6  1  4  4
Ducks   d    6  6  7  1  2  8

删除索引和列名称意味着将其设置为None

df2 = df.rename_axis(index=(None,None), columns=(None,None))
print (df2)

           A     B     C   
           X  Y  X  Y  X  Y
Apples  a  2  0  2  5  2  0
Oranges b  1  7  5  5  4  8
Puppies c  2  4  6  3  6  5
Ducks   d  9  6  3  9  7  0

和@Jeff解决方案:

df.index.names = ['foo','bar']
df.columns.names = ['baz','bak']
print (df)

baz          A     B     C   
bak          X  Y  X  Y  X  Y
foo     bar                  
Apples  a    3  4  7  3  3  3
Oranges b    1  2  5  8  1  0
Puppies c    9  6  3  9  6  3
Ducks   d    3  2  1  0  1  0

You can use rename_axis, for removing set to None:

d = {'Index Title': ['Apples', 'Oranges', 'Puppies', 'Ducks'],'Column 1': [1.0, 2.0, 3.0, 4.0]}
df = pd.DataFrame(d).set_index('Index Title')
print (df)
             Column 1
Index Title          
Apples            1.0
Oranges           2.0
Puppies           3.0
Ducks             4.0

print (df.index.name)
Index Title

print (df.columns.name)
None

The new functionality works well in method chains.

df = df.rename_axis('foo')
print (df)
         Column 1
foo              
Apples        1.0
Oranges       2.0
Puppies       3.0
Ducks         4.0

You can also rename column names with parameter axis:

d = {'Index Title': ['Apples', 'Oranges', 'Puppies', 'Ducks'],'Column 1': [1.0, 2.0, 3.0, 4.0]}
df = pd.DataFrame(d).set_index('Index Title').rename_axis('Col Name', axis=1)
print (df)
Col Name     Column 1
Index Title          
Apples            1.0
Oranges           2.0
Puppies           3.0
Ducks             4.0

print (df.index.name)
Index Title

print (df.columns.name)
Col Name
print df.rename_axis('foo').rename_axis("bar", axis="columns")
bar      Column 1
foo              
Apples        1.0
Oranges       2.0
Puppies       3.0
Ducks         4.0

print df.rename_axis('foo').rename_axis("bar", axis=1)
bar      Column 1
foo              
Apples        1.0
Oranges       2.0
Puppies       3.0
Ducks         4.0

From version pandas 0.24.0+ is possible use parameter index and columns:

df = df.rename_axis(index='foo', columns="bar")
print (df)
bar      Column 1
foo              
Apples        1.0
Oranges       2.0
Puppies       3.0
Ducks         4.0

Removing index and columns names means set it to None:

df = df.rename_axis(index=None, columns=None)
print (df)
         Column 1
Apples        1.0
Oranges       2.0
Puppies       3.0
Ducks         4.0

If MultiIndex in index only:

mux = pd.MultiIndex.from_arrays([['Apples', 'Oranges', 'Puppies', 'Ducks'],
                                  list('abcd')], 
                                  names=['index name 1','index name 1'])


df = pd.DataFrame(np.random.randint(10, size=(4,6)), 
                  index=mux, 
                  columns=list('ABCDEF')).rename_axis('col name', axis=1)
print (df)
col name                   A  B  C  D  E  F
index name 1 index name 1                  
Apples       a             5  4  0  5  2  2
Oranges      b             5  8  2  5  9  9
Puppies      c             7  6  0  7  8  3
Ducks        d             6  5  0  1  6  0

print (df.index.name)
None

print (df.columns.name)
col name

print (df.index.names)
['index name 1', 'index name 1']

print (df.columns.names)
['col name']

df1 = df.rename_axis(('foo','bar'))
print (df1)
col name     A  B  C  D  E  F
foo     bar                  
Apples  a    5  4  0  5  2  2
Oranges b    5  8  2  5  9  9
Puppies c    7  6  0  7  8  3
Ducks   d    6  5  0  1  6  0

df2 = df.rename_axis('baz', axis=1)
print (df2)
baz                        A  B  C  D  E  F
index name 1 index name 1                  
Apples       a             5  4  0  5  2  2
Oranges      b             5  8  2  5  9  9
Puppies      c             7  6  0  7  8  3
Ducks        d             6  5  0  1  6  0

df2 = df.rename_axis(index=('foo','bar'), columns='baz')
print (df2)
baz          A  B  C  D  E  F
foo     bar                  
Apples  a    5  4  0  5  2  2
Oranges b    5  8  2  5  9  9
Puppies c    7  6  0  7  8  3
Ducks   d    6  5  0  1  6  0

Removing index and columns names means set it to None:

df2 = df.rename_axis(index=(None,None), columns=None)
print (df2)

           A  B  C  D  E  F
Apples  a  6  9  9  5  4  6
Oranges b  2  6  7  4  3  5
Puppies c  6  3  6  3  5  1
Ducks   d  4  9  1  3  0  5

For MultiIndex in index and columns is necessary working with .names instead .name and set by list or tuples:

mux1 = pd.MultiIndex.from_arrays([['Apples', 'Oranges', 'Puppies', 'Ducks'],
                                  list('abcd')], 
                                  names=['index name 1','index name 1'])


mux2 = pd.MultiIndex.from_product([list('ABC'),
                                  list('XY')], 
                                  names=['col name 1','col name 2'])

df = pd.DataFrame(np.random.randint(10, size=(4,6)), index=mux1, columns=mux2)
print (df)
col name 1                 A     B     C   
col name 2                 X  Y  X  Y  X  Y
index name 1 index name 1                  
Apples       a             2  9  4  7  0  3
Oranges      b             9  0  6  0  9  4
Puppies      c             2  4  6  1  4  4
Ducks        d             6  6  7  1  2  8

Plural is necessary for check/set values:

print (df.index.name)
None

print (df.columns.name)
None

print (df.index.names)
['index name 1', 'index name 1']

print (df.columns.names)
['col name 1', 'col name 2']

df1 = df.rename_axis(('foo','bar'))
print (df1)
col name 1   A     B     C   
col name 2   X  Y  X  Y  X  Y
foo     bar                  
Apples  a    2  9  4  7  0  3
Oranges b    9  0  6  0  9  4
Puppies c    2  4  6  1  4  4
Ducks   d    6  6  7  1  2  8

df2 = df.rename_axis(('baz','bak'), axis=1)
print (df2)
baz                        A     B     C   
bak                        X  Y  X  Y  X  Y
index name 1 index name 1                  
Apples       a             2  9  4  7  0  3
Oranges      b             9  0  6  0  9  4
Puppies      c             2  4  6  1  4  4
Ducks        d             6  6  7  1  2  8

df2 = df.rename_axis(index=('foo','bar'), columns=('baz','bak'))
print (df2)
baz          A     B     C   
bak          X  Y  X  Y  X  Y
foo     bar                  
Apples  a    2  9  4  7  0  3
Oranges b    9  0  6  0  9  4
Puppies c    2  4  6  1  4  4
Ducks   d    6  6  7  1  2  8

Removing index and columns names means set it to None:

df2 = df.rename_axis(index=(None,None), columns=(None,None))
print (df2)

           A     B     C   
           X  Y  X  Y  X  Y
Apples  a  2  0  2  5  2  0
Oranges b  1  7  5  5  4  8
Puppies c  2  4  6  3  6  5
Ducks   d  9  6  3  9  7  0

And @Jeff solution:

df.index.names = ['foo','bar']
df.columns.names = ['baz','bak']
print (df)

baz          A     B     C   
bak          X  Y  X  Y  X  Y
foo     bar                  
Apples  a    3  4  7  3  3  3
Oranges b    1  2  5  8  1  0
Puppies c    9  6  3  9  6  3
Ducks   d    3  2  1  0  1  0

回答 2

df.index.name 应该可以。

Python具有dir让您查询对象属性的功能。dir(df.index)在这里很有帮助。

df.index.name should do the trick.

Python has a dir function that let’s you query object attributes. dir(df.index) was helpful here.


回答 3

df.index.rename('foo', inplace=True)来设置索引名。

似乎该API自pandas 0.13起可用。

Use df.index.rename('foo', inplace=True) to set the index name.

Seems this api is available since pandas 0.13.


回答 4

如果您不想创建新行,而只是将其放在空单元格中,请使用:

df.columns.name = 'foo'

否则使用:

df.index.name = 'foo'

If you do not want to create a new row but simply put it in the empty cell then use:

df.columns.name = 'foo'

Otherwise use:

df.index.name = 'foo'

回答 5

df.columns.values 也给我们列名

df.columns.values also give us the column names


回答 6

多索引解决方案位于jezrael的百科全书答案中,但是花了我一段时间才找到它,所以我发布了一个新答案:

df.index.names 给出多索引的名称(作为“冻结列表”)。

The solution for multi-indexes is inside jezrael’s cyclopedic answer, but it took me a while to find it so I am posting a new answer:

df.index.names gives the names of a multi-index (as a Frozenlist).


回答 7

只获取索引列名 df.index.names熊猫的最新版本开始,将对单个Index或MultiIndex都适用。

作为在尝试找到获取索引名+列名列表的最佳方法时发现此问题的人,我会发现此答案很有用:

names = list(filter(None, df.index.names + df.columns.values.tolist()))

这不适用于没有索引,单列索引或多索引。它避免了调用reset_index(),因为这种简单的操作会对性能造成不必要的影响。我很惊讶没有一个内置的方法(我遇到过)。我猜想我经常遇到这种情况,因为我正在从数据帧索引映射到主键/唯一键的数据库中穿梭数据,但实际上这只是我的另一列。

To just get the index column names df.index.names will work for both a single Index or MultiIndex as of the most recent version of pandas.

As someone who found this while trying to find the best way to get a list of index names + column names, I would have found this answer useful:

names = list(filter(None, df.index.names + df.columns.values.tolist()))

This works for no index, single column Index, or MultiIndex. It avoids calling reset_index() which has an unnecessary performance hit for such a simple operation. I’m surprised there isn’t a built in method for this (that I’ve come across). I guess I run into needing this more often because I’m shuttling data from databases where the dataframe index maps to a primary/unique key, but is really just another column to me.


回答 8

设置索引名称也可以在创建时完成:

pd.DataFrame(data={'age': [10,20,30], 'height': [100, 170, 175]}, index=pd.Series(['a', 'b', 'c'], name='Tag'))

Setting the index name can also be accomplished at creation:

pd.DataFrame(data={'age': [10,20,30], 'height': [100, 170, 175]}, index=pd.Series(['a', 'b', 'c'], name='Tag'))

为什么我的Pandas的“应用”功能不能引用多个列?[关闭]

问题:为什么我的Pandas的“应用”功能不能引用多个列?[关闭]

当将多个列与以下数据框一起使用时,Pandas Apply函数存在一些问题

df = DataFrame ({'a' : np.random.randn(6),
                 'b' : ['foo', 'bar'] * 3,
                 'c' : np.random.randn(6)})

和以下功能

def my_test(a, b):
    return a % b

当我尝试使用以下功能时:

df['Value'] = df.apply(lambda row: my_test(row[a], row[c]), axis=1)

我收到错误消息:

NameError: ("global name 'a' is not defined", u'occurred at index 0')

我不明白此消息,我正确定义了名称。

非常感谢您对此问题的帮助

更新资料

谢谢你的帮助。我确实在代码中犯了一些语法错误,索引应该放在”。但是,使用更复杂的功能仍然会遇到相同的问题,例如:

def my_test(a):
    cum_diff = 0
    for ix in df.index():
        cum_diff = cum_diff + (a - df['a'][ix])
    return cum_diff 

I have some problems with the Pandas apply function, when using multiple columns with the following dataframe

df = DataFrame ({'a' : np.random.randn(6),
                 'b' : ['foo', 'bar'] * 3,
                 'c' : np.random.randn(6)})

and the following function

def my_test(a, b):
    return a % b

When I try to apply this function with :

df['Value'] = df.apply(lambda row: my_test(row[a], row[c]), axis=1)

I get the error message:

NameError: ("global name 'a' is not defined", u'occurred at index 0')

I do not understand this message, I defined the name properly.

I would highly appreciate any help on this issue

Update

Thanks for your help. I made indeed some syntax mistakes with the code, the index should be put ”. However I still get the same issue using a more complex function such as:

def my_test(a):
    cum_diff = 0
    for ix in df.index():
        cum_diff = cum_diff + (a - df['a'][ix])
    return cum_diff 

回答 0

好像您忘记了''字符串。

In [43]: df['Value'] = df.apply(lambda row: my_test(row['a'], row['c']), axis=1)

In [44]: df
Out[44]:
                    a    b         c     Value
          0 -1.674308  foo  0.343801  0.044698
          1 -2.163236  bar -2.046438 -0.116798
          2 -0.199115  foo -0.458050 -0.199115
          3  0.918646  bar -0.007185 -0.001006
          4  1.336830  foo  0.534292  0.268245
          5  0.976844  bar -0.773630 -0.570417

在我看来,顺便说一句,以下方式更为优雅:

In [53]: def my_test2(row):
....:     return row['a'] % row['c']
....:     

In [54]: df['Value'] = df.apply(my_test2, axis=1)

Seems you forgot the '' of your string.

In [43]: df['Value'] = df.apply(lambda row: my_test(row['a'], row['c']), axis=1)

In [44]: df
Out[44]:
                    a    b         c     Value
          0 -1.674308  foo  0.343801  0.044698
          1 -2.163236  bar -2.046438 -0.116798
          2 -0.199115  foo -0.458050 -0.199115
          3  0.918646  bar -0.007185 -0.001006
          4  1.336830  foo  0.534292  0.268245
          5  0.976844  bar -0.773630 -0.570417

BTW, in my opinion, following way is more elegant:

In [53]: def my_test2(row):
....:     return row['a'] % row['c']
....:     

In [54]: df['Value'] = df.apply(my_test2, axis=1)

回答 1

如果您只想计算(a栏)%(b栏),则不需要apply,只需直接执行:

In [7]: df['a'] % df['c']                                                                                                                                                        
Out[7]: 
0   -1.132022                                                                                                                                                                    
1   -0.939493                                                                                                                                                                    
2    0.201931                                                                                                                                                                    
3    0.511374                                                                                                                                                                    
4   -0.694647                                                                                                                                                                    
5   -0.023486                                                                                                                                                                    
Name: a

If you just want to compute (column a) % (column b), you don’t need apply, just do it directly:

In [7]: df['a'] % df['c']                                                                                                                                                        
Out[7]: 
0   -1.132022                                                                                                                                                                    
1   -0.939493                                                                                                                                                                    
2    0.201931                                                                                                                                                                    
3    0.511374                                                                                                                                                                    
4   -0.694647                                                                                                                                                                    
5   -0.023486                                                                                                                                                                    
Name: a

回答 2

假设我们要对DataFrame df的列“ a”和“ b”应用add5函数

def add5(x):
    return x+5

df[['a', 'b']].apply(add5)

Let’s say we want to apply a function add5 to columns ‘a’ and ‘b’ of DataFrame df

def add5(x):
    return x+5

df[['a', 'b']].apply(add5)

回答 3

以上所有建议均有效,但如果您希望提高计算效率,则应利用numpy向量运算(如此处所述)

import pandas as pd
import numpy as np


df = pd.DataFrame ({'a' : np.random.randn(6),
             'b' : ['foo', 'bar'] * 3,
             'c' : np.random.randn(6)})

示例1:循环pandas.apply()

%%timeit
def my_test2(row):
    return row['a'] % row['c']

df['Value'] = df.apply(my_test2, axis=1)

最慢的运行时间比最快的运行时间长7.49倍。这可能意味着正在缓存中间结果。1000个循环,最佳3:每个循环481 µs

示例2:使用进行矢量化pandas.apply()

%%timeit
df['a'] % df['c']

最慢的运行时间比最快的运行时间长458.85倍。这可能意味着正在缓存中间结果。10000次循环,最好为3次:每个循环70.9 µs

示例3:使用numpy数组进行向量化:

%%timeit
df['a'].values % df['c'].values

最慢的运行时间比最快的运行时间长7.98倍。这可能意味着正在缓存中间结果。100000次循环,每循环3:6.39 µs最佳

因此,使用numpy数组进行向量化将速度提高了近两个数量级。

All of the suggestions above work, but if you want your computations to by more efficient, you should take advantage of numpy vector operations (as pointed out here).

import pandas as pd
import numpy as np


df = pd.DataFrame ({'a' : np.random.randn(6),
             'b' : ['foo', 'bar'] * 3,
             'c' : np.random.randn(6)})

Example 1: looping with pandas.apply():

%%timeit
def my_test2(row):
    return row['a'] % row['c']

df['Value'] = df.apply(my_test2, axis=1)

The slowest run took 7.49 times longer than the fastest. This could mean that an intermediate result is being cached. 1000 loops, best of 3: 481 µs per loop

Example 2: vectorize using pandas.apply():

%%timeit
df['a'] % df['c']

The slowest run took 458.85 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 70.9 µs per loop

Example 3: vectorize using numpy arrays:

%%timeit
df['a'].values % df['c'].values

The slowest run took 7.98 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 6.39 µs per loop

So vectorizing using numpy arrays improved the speed by almost two orders of magnitude.


回答 4

这与先前的解决方案相同,但是我已经在df.apply本身中定义了该函数:

df['Value'] = df.apply(lambda row: row['a']%row['c'], axis=1)

This is same as the previous solution but I have defined the function in df.apply itself:

df['Value'] = df.apply(lambda row: row['a']%row['c'], axis=1)

回答 5

我已经比较了上面讨论的所有三个。

使用值

%timeit df ['value'] = df ['a']。values%df ['c']。values

每个回路139 µs±1.91 µs(平均±标准偏差,共运行7次,每个回路10000个)

没有价值

%timeit df ['value'] = df ['a']%df ['c'] 

每个循环216 µs±1.86 µs(平均±标准偏差,共运行7次,每个循环1000个)

套用功能

%timeit df ['Value'] = df.apply(lambda row:row ['a']%row ['c'],axis = 1)

每个回路474 µs±5.07 µs(平均±标准偏差,共运行7次,每个回路1000个)

I have given the comparison of all three discussed above.

Using values

%timeit df['value'] = df['a'].values % df['c'].values

139 µs ± 1.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Without values

%timeit df['value'] = df['a']%df['c'] 

216 µs ± 1.86 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Apply function

%timeit df['Value'] = df.apply(lambda row: row['a']%row['c'], axis=1)

474 µs ± 5.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


熊猫获取不在其他数据框中的行

问题:熊猫获取不在其他数据框中的行

我有两个大熊猫数据框,它们有一些共同点。

假设dataframe2是dataframe1的子集。

如何获取dataframe1中不在dataframe2中的行?

df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]}) 
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})

I’ve two pandas data frames which have some rows in common.

Suppose dataframe2 is a subset of dataframe1.

How can I get the rows of dataframe1 which are not in dataframe2?

df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]}) 
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})

回答 0

一种方法是存储两个dfs内部合并的结果,然后我们可以简单地选择当一列的值不在此通用值中时的行:

In [119]:

common = df1.merge(df2,on=['col1','col2'])
print(common)
df1[(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))]
   col1  col2
0     1    10
1     2    11
2     3    12
Out[119]:
   col1  col2
3     4    13
4     5    14

编辑

您发现的另一种方法是使用isin它将产生NaN可删除的行:

In [138]:

df1[~df1.isin(df2)].dropna()
Out[138]:
   col1  col2
3     4    13
4     5    14

但是,如果df2不能以相同的方式开始行,那么它将行不通:

df2 = pd.DataFrame(data = {'col1' : [2, 3,4], 'col2' : [11, 12,13]})

将产生整个df:

In [140]:

df1[~df1.isin(df2)].dropna()
Out[140]:
   col1  col2
0     1    10
1     2    11
2     3    12
3     4    13
4     5    14

One method would be to store the result of an inner merge form both dfs, then we can simply select the rows when one column’s values are not in this common:

In [119]:

common = df1.merge(df2,on=['col1','col2'])
print(common)
df1[(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))]
   col1  col2
0     1    10
1     2    11
2     3    12
Out[119]:
   col1  col2
3     4    13
4     5    14

EDIT

Another method as you’ve found is to use isin which will produce NaN rows which you can drop:

In [138]:

df1[~df1.isin(df2)].dropna()
Out[138]:
   col1  col2
3     4    13
4     5    14

However if df2 does not start rows in the same manner then this won’t work:

df2 = pd.DataFrame(data = {'col1' : [2, 3,4], 'col2' : [11, 12,13]})

will produce the entire df:

In [140]:

df1[~df1.isin(df2)].dropna()
Out[140]:
   col1  col2
0     1    10
1     2    11
2     3    12
3     4    13
4     5    14

回答 1

当前选择的解决方案产生不正确的结果。为了正确解决此问题,我们可以执行从df1到的左联接df2,确保首先仅获得的唯一行df2

首先,我们需要修改原始DataFrame以添加包含数据的行[3,10]。

df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 
                           'col2' : [10, 11, 12, 13, 14, 10]}) 
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
                           'col2' : [10, 11, 12]})

df1

   col1  col2
0     1    10
1     2    11
2     3    12
3     4    13
4     5    14
5     3    10

df2

   col1  col2
0     1    10
1     2    11
2     3    12

执行左联接,消除中的重复项,df2以便df1联接的每一行都恰好有1行df2。使用参数indicator返回额外的一列,指示该行来自哪个表。

df_all = df1.merge(df2.drop_duplicates(), on=['col1','col2'], 
                   how='left', indicator=True)
df_all

   col1  col2     _merge
0     1    10       both
1     2    11       both
2     3    12       both
3     4    13  left_only
4     5    14  left_only
5     3    10  left_only

创建一个布尔条件:

df_all['_merge'] == 'left_only'

0    False
1    False
2    False
3     True
4     True
5     True
Name: _merge, dtype: bool

为什么其他解决方案是错误的

一些解决方案会犯同样的错误-他们仅检查每个值在每一列中是否独立,而不是在同一行中。添加最后一行,这是唯一的,但具有两列中的值,则会显示df2以下错误:

common = df1.merge(df2,on=['col1','col2'])
(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))
0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

此解决方案得到相同的错误结果:

df1.isin(df2.to_dict('l')).all(1)

The currently selected solution produces incorrect results. To correctly solve this problem, we can perform a left-join from df1 to df2, making sure to first get just the unique rows for df2.

First, we need to modify the original DataFrame to add the row with data [3, 10].

df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 
                           'col2' : [10, 11, 12, 13, 14, 10]}) 
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
                           'col2' : [10, 11, 12]})

df1

   col1  col2
0     1    10
1     2    11
2     3    12
3     4    13
4     5    14
5     3    10

df2

   col1  col2
0     1    10
1     2    11
2     3    12

Perform a left-join, eliminating duplicates in df2 so that each row of df1 joins with exactly 1 row of df2. Use the parameter indicator to return an extra column indicating which table the row was from.

df_all = df1.merge(df2.drop_duplicates(), on=['col1','col2'], 
                   how='left', indicator=True)
df_all

   col1  col2     _merge
0     1    10       both
1     2    11       both
2     3    12       both
3     4    13  left_only
4     5    14  left_only
5     3    10  left_only

Create a boolean condition:

df_all['_merge'] == 'left_only'

0    False
1    False
2    False
3     True
4     True
5     True
Name: _merge, dtype: bool

Why other solutions are wrong

A few solutions make the same mistake – they only check that each value is independently in each column, not together in the same row. Adding the last row, which is unique but has the values from both columns from df2 exposes the mistake:

common = df1.merge(df2,on=['col1','col2'])
(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))
0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

This solution gets the same wrong result:

df1.isin(df2.to_dict('l')).all(1)

回答 2

假设索引在数据框中是一致的(不考虑实际col值):

df1[~df1.index.isin(df2.index)]

Assuming that the indexes are consistent in the dataframes (not taking into account the actual col values):

df1[~df1.index.isin(df2.index)]

回答 3

如前所述,isin要求列和索引必须相同才能匹配。如果匹配仅在行内容上,则获得用于过滤存在的行的掩码的一种方法是将行转换为(Multi)Index:

In [77]: df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 'col2' : [10, 11, 12, 13, 14, 10]})
In [78]: df2 = pandas.DataFrame(data = {'col1' : [1, 3, 4], 'col2' : [10, 12, 13]})
In [79]: df1.loc[~df1.set_index(list(df1.columns)).index.isin(df2.set_index(list(df2.columns)).index)]
Out[79]:
   col1  col2
1     2    11
4     5    14
5     3    10

如果应考虑索引,则set_index具有关键字参数append,以将列附加到现有索引。如果列未对齐,则可以用列规范替换list(df.columns)以对齐数据。

pandas.MultiIndex.from_tuples(df<N>.to_records(index = False).tolist())

可以替代地使用它来创建索引,尽管我怀疑这样做效率更高。

As already hinted at, isin requires columns and indices to be the same for a match. If match should only be on row contents, one way to get the mask for filtering the rows present is to convert the rows to a (Multi)Index:

In [77]: df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 'col2' : [10, 11, 12, 13, 14, 10]})
In [78]: df2 = pandas.DataFrame(data = {'col1' : [1, 3, 4], 'col2' : [10, 12, 13]})
In [79]: df1.loc[~df1.set_index(list(df1.columns)).index.isin(df2.set_index(list(df2.columns)).index)]
Out[79]:
   col1  col2
1     2    11
4     5    14
5     3    10

If index should be taken into account, set_index has keyword argument append to append columns to existing index. If columns do not line up, list(df.columns) can be replaced with column specifications to align the data.

pandas.MultiIndex.from_tuples(df<N>.to_records(index = False).tolist())

could alternatively be used to create the indices, though I doubt this is more efficient.


回答 4

假设您有两个具有多个字段(column_names)的数据帧df_1和df_2,并且您要基于某些字段(例如,fields_x,fields_y)查找df_1中唯一不在df_2中的那些条目,请执行以下步骤。

步骤1.将列key1和key2分别添加到df_1和df_2。

步骤2合并数据框,如下所示。field_x和field_y是我们想要的列。

第三步:仅从df_1中选择key1不等于key2的那些行。

步骤4放下key1和key2。

这种方法将解决您的问题,即使使用大数据集也可以快速运行。我已经尝试将其用于具有超过1,000,000行的数据帧。

df_1['key1'] = 1
df_2['key2'] = 1
df_1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'left')
df_1 = df_1[~(df_1.key2 == df_1.key1)]
df_1 = df_1.drop(['key1','key2'], axis=1)

Suppose you have two dataframes, df_1 and df_2 having multiple fields(column_names) and you want to find the only those entries in df_1 that are not in df_2 on the basis of some fields(e.g. fields_x, fields_y), follow the following steps.

Step1.Add a column key1 and key2 to df_1 and df_2 respectively.

Step2.Merge the dataframes as shown below. field_x and field_y are our desired columns.

Step3.Select only those rows from df_1 where key1 is not equal to key2.

Step4.Drop key1 and key2.

This method will solve your problem and works fast even with big data sets. I have tried it for dataframes with more than 1,000,000 rows.

df_1['key1'] = 1
df_2['key2'] = 1
df_1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'left')
df_1 = df_1[~(df_1.key2 == df_1.key1)]
df_1 = df_1.drop(['key1','key2'], axis=1)

回答 5

有点晚了,但是值得检查pd.merge的“ indicator”参数。

参见另一个问题示例: 比较PandaS DataFrames并返回第一个缺少的行

a bit late, but it might be worth checking the “indicator” parameter of pd.merge.

See this other question for an example: Compare PandaS DataFrames and return rows that are missing from the first one


回答 6

您可以使用isin(dict)方法进行操作:

In [74]: df1[~df1.isin(df2.to_dict('l')).all(1)]
Out[74]:
   col1  col2
3     4    13
4     5    14

说明:

In [75]: df2.to_dict('l')
Out[75]: {'col1': [1, 2, 3], 'col2': [10, 11, 12]}

In [76]: df1.isin(df2.to_dict('l'))
Out[76]:
    col1   col2
0   True   True
1   True   True
2   True   True
3  False  False
4  False  False

In [77]: df1.isin(df2.to_dict('l')).all(1)
Out[77]:
0     True
1     True
2     True
3    False
4    False
dtype: bool

you can do it using isin(dict) method:

In [74]: df1[~df1.isin(df2.to_dict('l')).all(1)]
Out[74]:
   col1  col2
3     4    13
4     5    14

Explanation:

In [75]: df2.to_dict('l')
Out[75]: {'col1': [1, 2, 3], 'col2': [10, 11, 12]}

In [76]: df1.isin(df2.to_dict('l'))
Out[76]:
    col1   col2
0   True   True
1   True   True
2   True   True
3  False  False
4  False  False

In [77]: df1.isin(df2.to_dict('l')).all(1)
Out[77]:
0     True
1     True
2     True
3    False
4    False
dtype: bool

回答 7

您也可以Concat的df1df2

x = pd.concat([df1, df2])

然后删除所有重复项:

y = x.drop_duplicates(keep=False, inplace=False)

You can also concat df1, df2:

x = pd.concat([df1, df2])

and then remove all duplicates:

y = x.drop_duplicates(keep=False, inplace=False)

回答 8

这个怎么样:

df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 
                               'col2' : [10, 11, 12, 13, 14]}) 
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 
                               'col2' : [10, 11, 12]})
records_df2 = set([tuple(row) for row in df2.values])
in_df2_mask = np.array([tuple(row) in records_df2 for row in df1.values])
result = df1[~in_df2_mask]

How about this:

df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 
                               'col2' : [10, 11, 12, 13, 14]}) 
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 
                               'col2' : [10, 11, 12]})
records_df2 = set([tuple(row) for row in df2.values])
in_df2_mask = np.array([tuple(row) in records_df2 for row in df1.values])
result = df1[~in_df2_mask]

回答 9

这是解决此问题的另一种方法:

df1[~df1.index.isin(df1.merge(df2, how='inner', on=['col1', 'col2']).index)]

要么:

df1.loc[df1.index.difference(df1.merge(df2, how='inner', on=['col1', 'col2']).index)]

Here is another way of solving this:

df1[~df1.index.isin(df1.merge(df2, how='inner', on=['col1', 'col2']).index)]

Or:

df1.loc[df1.index.difference(df1.merge(df2, how='inner', on=['col1', 'col2']).index)]

回答 10

我这样做的方法涉及添加一个数据框唯一的新列,并使用此列选择是否保留条目

df2[col3] = 1
df1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'outer')
df1['Empt'].fillna(0, inplace=True)

这使得df1中的每个条目都有一个代码-如果它对于df1是唯一的,则为0;如果在两个dataFrames中都是唯一的,则为1。然后,您可以使用它来限制所需的内容

answer = nonuni[nonuni['Empt'] == 0]

My way of doing this involves adding a new column that is unique to one dataframe and using this to choose whether to keep an entry

df2[col3] = 1
df1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'outer')
df1['Empt'].fillna(0, inplace=True)

This makes it so every entry in df1 has a code – 0 if it is unique to df1, 1 if it is in both dataFrames. You then use this to restrict to what you want

answer = nonuni[nonuni['Empt'] == 0]

回答 11

使用合并功能提取不相似的行
df = df.merge(same.drop_duplicates(), on=['col1','col2'], 
               how='left', indicator=True)
将不同的行保存为CSV
df[df['_merge'] == 'left_only'].to_csv('output.csv')
extract the dissimilar rows using the merge function
df = df.merge(same.drop_duplicates(), on=['col1','col2'], 
               how='left', indicator=True)
save the dissimilar rows in CSV
df[df['_merge'] == 'left_only'].to_csv('output.csv')

将DataFrame列类型从字符串转换为日期时间,格式为dd / mm / yyyy

问题:将DataFrame列类型从字符串转换为日期时间,格式为dd / mm / yyyy

如何将字符串的DataFrame列(以dd / mm / yyyy格式)转换为日期时间?

How can I convert a DataFrame column of strings (in dd/mm/yyyy format) to datetimes?


回答 0

最简单的方法是使用to_datetime

df['col'] = pd.to_datetime(df['col'])

它还dayfirst为欧洲时代提供了依据(但请注意,这并不严格)。

它在起作用:

In [11]: pd.to_datetime(pd.Series(['05/23/2005']))
Out[11]:
0   2005-05-23 00:00:00
dtype: datetime64[ns]

您可以传递特定格式

In [12]: pd.to_datetime(pd.Series(['05/23/2005']), format="%m/%d/%Y")
Out[12]:
0   2005-05-23
dtype: datetime64[ns]

The easiest way is to use to_datetime:

df['col'] = pd.to_datetime(df['col'])

It also offers a dayfirst argument for European times (but beware this isn’t strict).

Here it is in action:

In [11]: pd.to_datetime(pd.Series(['05/23/2005']))
Out[11]:
0   2005-05-23 00:00:00
dtype: datetime64[ns]

You can pass a specific format:

In [12]: pd.to_datetime(pd.Series(['05/23/2005']), format="%m/%d/%Y")
Out[12]:
0   2005-05-23
dtype: datetime64[ns]

回答 1

如果您的日期列是格式为’2017-01-01’的字符串,则可以使用pandas astype将其转换为日期时间。

df['date'] = df['date'].astype('datetime64[ns]')

或使用datetime64 [D](如果您想要“天”精度而不是纳秒)

print(type(df_launath['date'].iloc[0]))

Yield

<class 'pandas._libs.tslib.Timestamp'> 与使用pandas.to_datetime时相同

您可以尝试使用其他格式,然后是’%Y-%m-%d’,但至少可以使用。

If your date column is a string of the format ‘2017-01-01’ you can use pandas astype to convert it to datetime.

df['date'] = df['date'].astype('datetime64[ns]')

or use datetime64[D] if you want Day precision and not nanoseconds

print(type(df_launath['date'].iloc[0]))

yields

<class 'pandas._libs.tslib.Timestamp'> the same as when you use pandas.to_datetime

You can try it with other formats then ‘%Y-%m-%d’ but at least this works.


回答 2

如果要指定复杂的格式,可以使用以下内容:

df['date_col'] =  pd.to_datetime(df['date_col'], format='%d/%m/%Y')

format此处的更多详细信息:

You can use the following if you want to specify tricky formats:

df['date_col'] =  pd.to_datetime(df['date_col'], format='%d/%m/%Y')

More details on format here:


回答 3

如果您的约会中有多种格式,别忘了设定infer_datetime_format=True以简化生活

df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)

资料来源:pd.to_datetime

或者,如果您想要定制的方法:

def autoconvert_datetime(value):
    formats = ['%m/%d/%Y', '%m-%d-%y']  # formats to try
    result_format = '%d-%m-%Y'  # output format
    for dt_format in formats:
        try:
            dt_obj = datetime.strptime(value, dt_format)
            return dt_obj.strftime(result_format)
        except Exception as e:  # throws exception when format doesn't match
            pass
    return value  # let it be if it doesn't match

df['date'] = df['date'].apply(autoconvert_datetime)

If you have a mixture of formats in your date, don’t forget to set infer_datetime_format=True to make life easier

df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)

Source: pd.to_datetime

or if you want a customized approach:

def autoconvert_datetime(value):
    formats = ['%m/%d/%Y', '%m-%d-%y']  # formats to try
    result_format = '%d-%m-%Y'  # output format
    for dt_format in formats:
        try:
            dt_obj = datetime.strptime(value, dt_format)
            return dt_obj.strftime(result_format)
        except Exception as e:  # throws exception when format doesn't match
            pass
    return value  # let it be if it doesn't match

df['date'] = df['date'].apply(autoconvert_datetime)

在熊猫中将两个系列组合到一个DataFrame中

问题:在熊猫中将两个系列组合到一个DataFrame中

我有两个Series,s1并且s2索引相同(非连续)。如何合并s1s2成为DataFrame中的两列,并将其中一个索引保留为第三列?

I have two Series s1 and s2 with the same (non-consecutive) indices. How do I combine s1 and s2 to being two columns in a DataFrame and keep one of the indices as a third column?


回答 0

我认为这concat是个不错的方法。如果存在它们,则将“系列”的名称属性用作列(否则,将它们简单地编号):

In [1]: s1 = pd.Series([1, 2], index=['A', 'B'], name='s1')

In [2]: s2 = pd.Series([3, 4], index=['A', 'B'], name='s2')

In [3]: pd.concat([s1, s2], axis=1)
Out[3]:
   s1  s2
A   1   3
B   2   4

In [4]: pd.concat([s1, s2], axis=1).reset_index()
Out[4]:
  index  s1  s2
0     A   1   3
1     B   2   4

注意:这扩展到2个以上的系列。

I think concat is a nice way to do this. If they are present it uses the name attributes of the Series as the columns (otherwise it simply numbers them):

In [1]: s1 = pd.Series([1, 2], index=['A', 'B'], name='s1')

In [2]: s2 = pd.Series([3, 4], index=['A', 'B'], name='s2')

In [3]: pd.concat([s1, s2], axis=1)
Out[3]:
   s1  s2
A   1   3
B   2   4

In [4]: pd.concat([s1, s2], axis=1).reset_index()
Out[4]:
  index  s1  s2
0     A   1   3
1     B   2   4

Note: This extends to more than 2 Series.


回答 1

如果两个索引都相同,为什么不只使用.to_frame?

> = v0.23

a.to_frame().join(b)

< v0.23

a.to_frame().join(b.to_frame())

Why don’t you just use .to_frame if both have the same indexes?

>= v0.23

a.to_frame().join(b)

< v0.23

a.to_frame().join(b.to_frame())

回答 2

熊猫会自动将这些通过的序列对齐并创建联合索引。它们在这里恰好是相同的。reset_index将索引移到列。

In [2]: s1 = Series(randn(5),index=[1,2,4,5,6])

In [4]: s2 = Series(randn(5),index=[1,2,4,5,6])

In [8]: DataFrame(dict(s1 = s1, s2 = s2)).reset_index()
Out[8]: 
   index        s1        s2
0      1 -0.176143  0.128635
1      2 -1.286470  0.908497
2      4 -0.995881  0.528050
3      5  0.402241  0.458870
4      6  0.380457  0.072251

Pandas will automatically align these passed in series and create the joint index They happen to be the same here. reset_index moves the index to a column.

In [2]: s1 = Series(randn(5),index=[1,2,4,5,6])

In [4]: s2 = Series(randn(5),index=[1,2,4,5,6])

In [8]: DataFrame(dict(s1 = s1, s2 = s2)).reset_index()
Out[8]: 
   index        s1        s2
0      1 -0.176143  0.128635
1      2 -1.286470  0.908497
2      4 -0.995881  0.528050
3      5  0.402241  0.458870
4      6  0.380457  0.072251

回答 3

示例代码:

a = pd.Series([1,2,3,4], index=[7,2,8,9])
b = pd.Series([5,6,7,8], index=[7,2,8,9])
data = pd.DataFrame({'a': a,'b':b, 'idx_col':a.index})

Pandas允许您从中创建一个DataFramedictSeries作为值,将列名作为键。当找到a Series作为值时,它将使用Series索引作为索引的一部分DataFrame。数据对齐是熊猫的主要特权之一。因此,除非您有其他需求,否则新创建的商品DataFrame具有重复的价值。在上述示例中,data['idx_col']具有与相同的数据data.index

Example code:

a = pd.Series([1,2,3,4], index=[7,2,8,9])
b = pd.Series([5,6,7,8], index=[7,2,8,9])
data = pd.DataFrame({'a': a,'b':b, 'idx_col':a.index})

Pandas allows you to create a DataFrame from a dict with Series as the values and the column names as the keys. When it finds a Series as a value, it uses the Series index as part of the DataFrame index. This data alignment is one of the main perks of Pandas. Consequently, unless you have other needs, the freshly created DataFrame has duplicated value. In the above example, data['idx_col'] has the same data as data.index.


回答 4

如果我可以回答这个问题。

将系列转换为数据框的基本原理是要了解

从概念上讲,数据框中的每一列都是一个序列。

2.而且,每个列名都是映射到系列的键名。

如果牢记以上两个概念,则可以想到许多将系列转换为数据框的方法。一个简单的解决方案将是这样的:

在这里创建两个系列

import pandas as pd

series_1 = pd.Series(list(range(10)))

series_2 = pd.Series(list(range(20,30)))

使用所需的列名创建一个空的数据框

df = pd.DataFrame(columns = ['Column_name#1', 'Column_name#1'])

使用映射概念将序列值放入数据框内

df['Column_name#1'] = series_1

df['Column_name#2'] = series_2

立即检查结果

df.head(5)

If I may answer this.

The fundamentals behind converting series to data frame is to understand that

1. At conceptual level, every column in data frame is a series.

2. And, every column name is a key name that maps to a series.

If you keep above two concepts in mind, you can think of many ways to convert series to data frame. One easy solution will be like this:

Create two series here

import pandas as pd

series_1 = pd.Series(list(range(10)))

series_2 = pd.Series(list(range(20,30)))

Create an empty data frame with just desired column names

df = pd.DataFrame(columns = ['Column_name#1', 'Column_name#1'])

Put series value inside data frame using mapping concept

df['Column_name#1'] = series_1

df['Column_name#2'] = series_2

Check results now

df.head(5)

回答 5

不确定我是否完全理解您的问题,但这是您想做的吗?

pd.DataFrame(data=dict(s1=s1, s2=s2), index=s1.index)

index=s1.index这里甚至没有必要)

Not sure I fully understand your question, but is this what you want to do?

pd.DataFrame(data=dict(s1=s1, s2=s2), index=s1.index)

(index=s1.index is not even necessary here)


回答 6

基于以下方式的解决方案的简化join()

df = a.to_frame().join(b)

A simplification of the solution based on join():

df = a.to_frame().join(b)

回答 7

我使用了pandas将numpy数组或iseries转换为数据框,然后添加并按键将其他附加列作为“预测”。如果您需要将数据框转换回列表,请使用values.tolist()

output=pd.DataFrame(X_test)
output['prediction']=y_pred

list=output.values.tolist()     

I used pandas to convert my numpy array or iseries to an dataframe then added and additional the additional column by key as ‘prediction’. If you need dataframe converted back to a list then use values.tolist()

output=pd.DataFrame(X_test)
output['prediction']=y_pred

list=output.values.tolist()