标签归档:dataframe

Python Pandas将列表插入单元格

问题:Python Pandas将列表插入单元格

我有一个列表“ abc”和一个数据框“ df”:

abc = ['foo', 'bar']
df =
    A  B
0  12  NaN
1  23  NaN

我想将列表插入单元格1B中,所以我想要这个结果:

    A  B
0  12  NaN
1  23  ['foo', 'bar']

我能做到吗?

1)如果我使用这个:

df.ix[1,'B'] = abc

我收到以下错误消息:

ValueError: Must have equal len keys and value when setting with an iterable

因为它尝试将列表(具有两个元素)插入行/列而不插入单元格。

2)如果我使用这个:

df.ix[1,'B'] = [abc]

然后插入一个只有一个元素的列表,即“ abc”列表( [['foo', 'bar']])。

3)如果我使用这个:

df.ix[1,'B'] = ', '.join(abc)

然后插入一个字符串:( foo, bar),但不插入列表。

4)如果我使用这个:

df.ix[1,'B'] = [', '.join(abc)]

然后插入一个列表,但只有一个元素(['foo, bar']),但没有两个我想要的元素(['foo', 'bar'])。

感谢帮助!


编辑

我的新数据框和旧列表:

abc = ['foo', 'bar']
df2 =
    A    B         C
0  12  NaN      'bla'
1  23  NaN  'bla bla'

另一个数据框:

df3 =
    A    B         C                    D
0  12  NaN      'bla'  ['item1', 'item2']
1  23  NaN  'bla bla'        [11, 12, 13]

我想将“ abc”列表插入df2.loc[1,'B']和/或df3.loc[1,'B']

如果数据框仅包含具有整数值和/或NaN值和/或列表值的列,则将列表插入到单元格中的效果很好。如果数据框仅包含具有字符串值和/或NaN值和/或列表值的列,则将列表插入到单元格中的效果很好。但是,如果数据框具有包含整数和字符串值的列以及其他列,那么如果我使用此错误消息,则会出现错误消息:df2.loc[1,'B'] = abcdf3.loc[1,'B'] = abc

另一个数据框:

df4 =
          A     B
0      'bla'  NaN
1  'bla bla'  NaN

这些插入可以完美地工作:df.loc[1,'B'] = abcdf4.loc[1,'B'] = abc

I have a list ‘abc’ and a dataframe ‘df’:

abc = ['foo', 'bar']
df =
    A  B
0  12  NaN
1  23  NaN

I want to insert the list into cell 1B, so I want this result:

    A  B
0  12  NaN
1  23  ['foo', 'bar']

Ho can I do that?

1) If I use this:

df.ix[1,'B'] = abc

I get the following error message:

ValueError: Must have equal len keys and value when setting with an iterable

because it tries to insert the list (that has two elements) into a row / column but not into a cell.

2) If I use this:

df.ix[1,'B'] = [abc]

then it inserts a list that has only one element that is the ‘abc’ list ( [['foo', 'bar']] ).

3) If I use this:

df.ix[1,'B'] = ', '.join(abc)

then it inserts a string: ( foo, bar ) but not a list.

4) If I use this:

df.ix[1,'B'] = [', '.join(abc)]

then it inserts a list but it has only one element ( ['foo, bar'] ) but not two as I want ( ['foo', 'bar'] ).

Thanks for help!


EDIT

My new dataframe and the old list:

abc = ['foo', 'bar']
df2 =
    A    B         C
0  12  NaN      'bla'
1  23  NaN  'bla bla'

Another dataframe:

df3 =
    A    B         C                    D
0  12  NaN      'bla'  ['item1', 'item2']
1  23  NaN  'bla bla'        [11, 12, 13]

I want insert the ‘abc’ list into df2.loc[1,'B'] and/or df3.loc[1,'B'].

If the dataframe has columns only with integer values and/or NaN values and/or list values then inserting a list into a cell works perfectly. If the dataframe has columns only with string values and/or NaN values and/or list values then inserting a list into a cell works perfectly. But if the dataframe has columns with integer and string values and other columns then the error message appears if I use this: df2.loc[1,'B'] = abc or df3.loc[1,'B'] = abc.

Another dataframe:

df4 =
          A     B
0      'bla'  NaN
1  'bla bla'  NaN

These inserts work perfectly: df.loc[1,'B'] = abc or df4.loc[1,'B'] = abc.


回答 0

由于自0.21.0版set_value开始不推荐使用,因此您现在应该使用at。它可以插入一个列表的小区没有抚养ValueErrorloc一样。我认为这是因为at 总是引用单个值,而loc可以引用值以及行和列。

df = pd.DataFrame(data={'A': [1, 2, 3], 'B': ['x', 'y', 'z']})

df.at[1, 'B'] = ['m', 'n']

df =
    A   B
0   1   x
1   2   [m, n]
2   3   z

您还需要确保要插入的具有dtype=object。例如

>>> df = pd.DataFrame(data={'A': [1, 2, 3], 'B': [1,2,3]})
>>> df.dtypes
A    int64
B    int64
dtype: object

>>> df.at[1, 'B'] = [1, 2, 3]
ValueError: setting an array element with a sequence

>>> df['B'] = df['B'].astype('object')
>>> df.at[1, 'B'] = [1, 2, 3]
>>> df
   A          B
0  1          1
1  2  [1, 2, 3]
2  3          3

Since set_value has been deprecated since version 0.21.0, you should now use at. It can insert a list into a cell without raising a ValueError as loc does. I think this is because at always refers to a single value, while loc can refer to values as well as rows and columns.

df = pd.DataFrame(data={'A': [1, 2, 3], 'B': ['x', 'y', 'z']})

df.at[1, 'B'] = ['m', 'n']

df =
    A   B
0   1   x
1   2   [m, n]
2   3   z

You also need to make sure the column you are inserting into has dtype=object. For example

>>> df = pd.DataFrame(data={'A': [1, 2, 3], 'B': [1,2,3]})
>>> df.dtypes
A    int64
B    int64
dtype: object

>>> df.at[1, 'B'] = [1, 2, 3]
ValueError: setting an array element with a sequence

>>> df['B'] = df['B'].astype('object')
>>> df.at[1, 'B'] = [1, 2, 3]
>>> df
   A          B
0  1          1
1  2  [1, 2, 3]
2  3          3

回答 1

df3.set_value(1, 'B', abc)适用于任何数据框。注意列“ B”的数据类型。例如。不能将列表插入浮点列,在这种情况下df['B'] = df['B'].astype(object)可以提供帮助。

df3.set_value(1, 'B', abc) works for any dataframe. Take care of the data type of column ‘B’. Eg. a list can not be inserted into a float column, at that case df['B'] = df['B'].astype(object) can help.


回答 2

熊猫> = 0.21

set_value已不推荐使用。 现在,您可以使用DataFrame.at按标签DataFrame.iat设置和按整数位置设置。

使用at/ 设置单元格值iat

# Setup
df = pd.DataFrame({'A': [12, 23], 'B': [['a', 'b'], ['c', 'd']]})
df

    A       B
0  12  [a, b]
1  23  [c, d]

df.dtypes

A     int64
B    object
dtype: object

如果要将“ B”第二行中的值设置为一些新列表,请使用DataFrane.at

df.at[1, 'B'] = ['m', 'n']
df

    A       B
0  12  [a, b]
1  23  [m, n]

您也可以使用 DataFrame.iat

df.iat[1, df.columns.get_loc('B')] = ['m', 'n']
df

    A       B
0  12  [a, b]
1  23  [m, n]

如果得到了ValueError: setting an array element with a sequence怎么办?

我将尝试通过以下方式重现该内容:

df

    A   B
0  12 NaN
1  23 NaN

df.dtypes

A      int64
B    float64
dtype: object

df.at[1, 'B'] = ['m', 'n']
# ValueError: setting an array element with a sequence.

这是因为您的对象是float64dtype,而列表是objects,所以那里不匹配。在这种情况下,您要做的是先将列转换为对象。

df['B'] = df['B'].astype(object)
df.dtypes

A     int64
B    object
dtype: object

然后,它起作用:

df.at[1, 'B'] = ['m', 'n']
df

    A       B
0  12     NaN
1  23  [m, n]

可能,但是哈基

更古怪的是,我发现DataFrame.loc如果传递嵌套列表,您可以破解以实现相似的目的。

df.loc[1, 'B'] = [['m'], ['n'], ['o'], ['p']]
df

    A             B
0  12        [a, b]
1  23  [m, n, o, p]

您可以在这里阅读更多有关其工作原理的信息。

Pandas >= 0.21

set_value has been deprecated. You can now use DataFrame.at to set by label, and DataFrame.iat to set by integer position.

Setting Cell Values with at/iat

# Setup
df = pd.DataFrame({'A': [12, 23], 'B': [['a', 'b'], ['c', 'd']]})
df

    A       B
0  12  [a, b]
1  23  [c, d]

df.dtypes

A     int64
B    object
dtype: object

If you want to set a value in second row of the “B” to some new list, use DataFrane.at:

df.at[1, 'B'] = ['m', 'n']
df

    A       B
0  12  [a, b]
1  23  [m, n]

You can also set by integer position using DataFrame.iat

df.iat[1, df.columns.get_loc('B')] = ['m', 'n']
df

    A       B
0  12  [a, b]
1  23  [m, n]

What if I get ValueError: setting an array element with a sequence?

I’ll try to reproduce this with:

df

    A   B
0  12 NaN
1  23 NaN

df.dtypes

A      int64
B    float64
dtype: object

df.at[1, 'B'] = ['m', 'n']
# ValueError: setting an array element with a sequence.

This is because of a your object is of float64 dtype, whereas lists are objects, so there’s a mismatch there. What you would have to do in this situation is to convert the column to object first.

df['B'] = df['B'].astype(object)
df.dtypes

A     int64
B    object
dtype: object

Then, it works:

df.at[1, 'B'] = ['m', 'n']
df

    A       B
0  12     NaN
1  23  [m, n]

Possible, But Hacky

Even more wacky, I’ve found you can hack through DataFrame.loc to achieve something similar if you pass nested lists.

df.loc[1, 'B'] = [['m'], ['n'], ['o'], ['p']]
df

    A             B
0  12        [a, b]
1  23  [m, n, o, p]

You can read more about why this works here.


回答 3

如本篇文章中提到的熊猫:如何在数据框中存储列表?; 数据帧中的dtype可能会影响结果,以及调用数据帧或不将其分配给它。

As mentionned in this post pandas: how to store a list in a dataframe?; the dtypes in the dataframe may influence the results, as well as calling a dataframe or not to be assigned to.


回答 4

快速解决

只需将列表括在新列表中,就像在下面的数据框中对col2所做的那样。它起作用的原因是python获取(列表的)外部列表,并将其转换为列,就好像它包含普通标量项目一样,在我们的例子中是列表,而不是普通标量。

mydict={'col1':[1,2,3],'col2':[[1, 4], [2, 5], [3, 6]]}
data=pd.DataFrame(mydict)
data


   col1     col2
0   1       [1, 4]
1   2       [2, 5]
2   3       [3, 6]

Quick work around

Simply enclose the list within a new list, as done for col2 in the data frame below. The reason it works is that python takes the outer list (of lists) and converts it into a column as if it were containing normal scalar items, which is lists in our case and not normal scalars.

mydict={'col1':[1,2,3],'col2':[[1, 4], [2, 5], [3, 6]]}
data=pd.DataFrame(mydict)
data


   col1     col2
0   1       [1, 4]
1   2       [2, 5]
2   3       [3, 6]

回答 5

也得到

ValueError: Must have equal len keys and value when setting with an iterable

在我的情况下,使用.at而不是.loc并没有任何区别,但是强制使用dataframe列的数据类型可以解决问题:

df['B'] = df['B'].astype(object)

然后,我可以将列表,numpy数组和所有类型的东西设置为数据帧中的单个单元格值。

Also getting

ValueError: Must have equal len keys and value when setting with an iterable,

using .at rather than .loc did not make any difference in my case, but enforcing the datatype of the dataframe column did the trick:

df['B'] = df['B'].astype(object)

Then I could set lists, numpy array and all sorts of things as single cell values in my dataframes.


熊猫:选择所有名称以X开头的列的最佳方法

问题:熊猫:选择所有名称以X开头的列的最佳方法

我有一个DataFrame:

import pandas as pd
import numpy as np

df = pd.DataFrame({'foo.aa': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
                   'foo.fighters': [0, 1, np.nan, 0, 0, 0],
                   'foo.bars': [0, 0, 0, 0, 0, 1],
                   'bar.baz': [5, 5, 6, 5, 5.6, 6.8],
                   'foo.fox': [2, 4, 1, 0, 0, 5],
                   'nas.foo': ['NA', 0, 1, 0, 0, 0],
                   'foo.manchu': ['NA', 0, 0, 0, 0, 0],})

我想在以开头的列中选择1的值foo.。除了以下以外,还有更好的方法吗:

df2 = df[(df['foo.aa'] == 1)|
(df['foo.fighters'] == 1)|
(df['foo.bars'] == 1)|
(df['foo.fox'] == 1)|
(df['foo.manchu'] == 1)
]

类似于写类似的东西:

df2= df[df.STARTS_WITH_FOO == 1]

答案应打印出如下所示的DataFrame:

   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

[4 rows x 7 columns]

I have a DataFrame:

import pandas as pd
import numpy as np

df = pd.DataFrame({'foo.aa': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
                   'foo.fighters': [0, 1, np.nan, 0, 0, 0],
                   'foo.bars': [0, 0, 0, 0, 0, 1],
                   'bar.baz': [5, 5, 6, 5, 5.6, 6.8],
                   'foo.fox': [2, 4, 1, 0, 0, 5],
                   'nas.foo': ['NA', 0, 1, 0, 0, 0],
                   'foo.manchu': ['NA', 0, 0, 0, 0, 0],})

I want to select values of 1 in columns starting with foo.. Is there a better way to do it other than:

df2 = df[(df['foo.aa'] == 1)|
(df['foo.fighters'] == 1)|
(df['foo.bars'] == 1)|
(df['foo.fox'] == 1)|
(df['foo.manchu'] == 1)
]

Something similar to writing something like:

df2= df[df.STARTS_WITH_FOO == 1]

The answer should print out a DataFrame like this:

   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

[4 rows x 7 columns]

回答 0

只需执行列表推导即可创建您的列:

In [28]:

filter_col = [col for col in df if col.startswith('foo')]
filter_col
Out[28]:
['foo.aa', 'foo.bars', 'foo.fighters', 'foo.fox', 'foo.manchu']
In [29]:

df[filter_col]
Out[29]:
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
3     4.7         0             0        0          0
4     5.6         0             0        0          0
5     6.8         1             0        5          0

另一种方法是从列创建序列,并使用向量化str方法startswith

In [33]:

df[df.columns[pd.Series(df.columns).str.startswith('foo')]]
Out[33]:
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
3     4.7         0             0        0          0
4     5.6         0             0        0          0
5     6.8         1             0        5          0

为了实现您想要的目标,您需要添加以下内容以过滤不符合您的==1条件的值:

In [36]:

df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]]==1]
Out[36]:
   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      NaN       1       NaN           NaN      NaN        NaN     NaN
1      NaN     NaN       NaN             1      NaN        NaN     NaN
2      NaN     NaN       NaN           NaN        1        NaN     NaN
3      NaN     NaN       NaN           NaN      NaN        NaN     NaN
4      NaN     NaN       NaN           NaN      NaN        NaN     NaN
5      NaN     NaN         1           NaN      NaN        NaN     NaN

编辑

看到您想要复杂的答案后,确定为:

In [72]:

df.loc[df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]] == 1].dropna(how='all', axis=0).index]
Out[72]:
   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

Just perform a list comprehension to create your columns:

In [28]:

filter_col = [col for col in df if col.startswith('foo')]
filter_col
Out[28]:
['foo.aa', 'foo.bars', 'foo.fighters', 'foo.fox', 'foo.manchu']
In [29]:

df[filter_col]
Out[29]:
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
3     4.7         0             0        0          0
4     5.6         0             0        0          0
5     6.8         1             0        5          0

Another method is to create a series from the columns and use the vectorised str method startswith:

In [33]:

df[df.columns[pd.Series(df.columns).str.startswith('foo')]]
Out[33]:
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
3     4.7         0             0        0          0
4     5.6         0             0        0          0
5     6.8         1             0        5          0

In order to achieve what you want you need to add the following to filter the values that don’t meet your ==1 criteria:

In [36]:

df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]]==1]
Out[36]:
   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      NaN       1       NaN           NaN      NaN        NaN     NaN
1      NaN     NaN       NaN             1      NaN        NaN     NaN
2      NaN     NaN       NaN           NaN        1        NaN     NaN
3      NaN     NaN       NaN           NaN      NaN        NaN     NaN
4      NaN     NaN       NaN           NaN      NaN        NaN     NaN
5      NaN     NaN         1           NaN      NaN        NaN     NaN

EDIT

OK after seeing what you want the convoluted answer is this:

In [72]:

df.loc[df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]] == 1].dropna(how='all', axis=0).index]
Out[72]:
   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

回答 1

既然熊猫的索引支持字符串操作,那么可以说选择以’foo’开头的列的最简单最好的方法就是:

df.loc[:, df.columns.str.startswith('foo')]

或者,您可以使用过滤列(或行)标签df.filter()。要指定正则表达式以匹配以开头的名称foo.

>>> df.filter(regex=r'^foo\.', axis=1)
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
3     4.7         0             0        0          0
4     5.6         0             0        0          0
5     6.8         1             0        5          0

要仅选择所需的行(包含1)和列,可以使用loc,使用filter(或任何其他方法)选择列,使用any

>>> df.loc[(df == 1).any(axis=1), df.filter(regex=r'^foo\.', axis=1).columns]
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
5     6.8         1             0        5          0

Now that pandas’ indexes support string operations, arguably the simplest and best way to select columns beginning with ‘foo’ is just:

df.loc[:, df.columns.str.startswith('foo')]

Alternatively, you can filter column (or row) labels with df.filter(). To specify a regular expression to match the names beginning with foo.:

>>> df.filter(regex=r'^foo\.', axis=1)
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
3     4.7         0             0        0          0
4     5.6         0             0        0          0
5     6.8         1             0        5          0

To select only the required rows (containing a 1) and the columns, you can use loc, selecting the columns using filter (or any other method) and the rows using any:

>>> df.loc[(df == 1).any(axis=1), df.filter(regex=r'^foo\.', axis=1).columns]
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
5     6.8         1             0        5          0

回答 2

最简单的方法是直接在列名上使用str,不需要 pd.Series

df.loc[:,df.columns.str.startswith("foo")]

The simplest way is to use str directly on column names, there is no need for pd.Series

df.loc[:,df.columns.str.startswith("foo")]



回答 3

根据@EdChum的答案,您可以尝试以下解决方案:

df[df.columns[pd.Series(df.columns).str.contains("foo")]]

万一并非您要选择的所有列都以开头,这将非常有用foo。此方法选择包含子字符串的所有列,foo并且可以将其放置在列名称的任何位置。

本质上,我替换.startswith().contains()

Based on @EdChum’s answer, you can try the following solution:

df[df.columns[pd.Series(df.columns).str.contains("foo")]]

This will be really helpful in case not all the columns you want to select start with foo. This method selects all the columns that contain the substring foo and it could be placed in at any point of a column’s name.

In essence, I replaced .startswith() with .contains().


回答 4

我的解决方案。性能可能会变慢:

a = pd.concat(df[df[c] == 1] for c in df.columns if c.startswith('foo'))
a.sort_index()


   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

My solution. It may be slower on performance:

a = pd.concat(df[df[c] == 1] for c in df.columns if c.startswith('foo'))
a.sort_index()


   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

回答 5

选择所需条目的另一种方法是使用map

df.loc[(df == 1).any(axis=1), df.columns.map(lambda x: x.startswith('foo'))]

这将为您提供包含的行的所有列1

   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
5     6.8         1             0        5          0

行选择是通过做

(df == 1).any(axis=1)

如@ajcr的答案,它为您提供:

0     True
1     True
2     True
3    False
4    False
5     True
dtype: bool

表示该行34不包含1和不会被选中。

选择是使用布尔索引完成的,如下所示:

df.columns.map(lambda x: x.startswith('foo'))

在上面的示例中,此返回

array([False,  True,  True,  True,  True,  True, False], dtype=bool)

因此,如果某列不是以开头fooFalse则返回该列,因此未选择该列。

如果您只想返回包含1-的所有行(如您期望的输出所示),则只需执行

df.loc[(df == 1).any(axis=1)]

哪个返回

   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

Another option for the selection of the desired entries is to use map:

df.loc[(df == 1).any(axis=1), df.columns.map(lambda x: x.startswith('foo'))]

which gives you all the columns for rows that contain a 1:

   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
5     6.8         1             0        5          0

The row selection is done by

(df == 1).any(axis=1)

as in @ajcr’s answer which gives you:

0     True
1     True
2     True
3    False
4    False
5     True
dtype: bool

meaning that row 3 and 4 do not contain a 1 and won’t be selected.

The selection of the columns is done using Boolean indexing like this:

df.columns.map(lambda x: x.startswith('foo'))

In the example above this returns

array([False,  True,  True,  True,  True,  True, False], dtype=bool)

So, if a column does not start with foo, False is returned and the column is therefore not selected.

If you just want to return all rows that contain a 1 – as your desired output suggests – you can simply do

df.loc[(df == 1).any(axis=1)]

which returns

   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

回答 6

您可以在此处尝试使用正则表达式来过滤以“ foo”开头的列

df.filter(regex='^foo*')

如果您需要在列中包含字符串foo,则

df.filter(regex='foo*')

将是适当的。

下一步,您可以使用

df[df.filter(regex='^foo*').values==1]

过滤掉“ foo *”列的值之一为1的行。

You can try the regex here to filter out the columns starting with “foo”

df.filter(regex='^foo*')

If you need to have the string foo in your column then

df.filter(regex='foo*')

would be appropriate.

For the next step, you can use

df[df.filter(regex='^foo*').values==1]

to filter out the rows where one of the values of ‘foo*’ column is 1.


回答 7

就我而言,我需要一个前缀列表

colsToScale=["production", "test", "development"]
dc[dc.columns[dc.columns.str.startswith(tuple(colsToScale))]]

In my case I needed a list of prefixes

colsToScale=["production", "test", "development"]
dc[dc.columns[dc.columns.str.startswith(tuple(colsToScale))]]

如何从Pandas DataFrame获取值而不是索引和对象类型

问题:如何从Pandas DataFrame获取值而不是索引和对象类型

说我有以下DataFrame

字母编号
A 1
B 2
C 3
4天

可以通过以下代码获得

import pandas as pd

letters=pd.Series(('A', 'B', 'C', 'D'))
numbers=pd.Series((1, 2, 3, 4))
keys=('Letters', 'Numbers')
df=pd.concat((letters, numbers), axis=1, keys=keys)

现在,我想从“字母”列中获取值C。

命令行

df[df.Letters=='C'].Letters

将返回

2℃
名称:字母,dtype:对象

我怎样才能只获得C值而不是整个两行输出?

Say I have the following DataFrame

Letter    Number
A          1
B          2
C          3
D          4

Which can be obtained through the following code

import pandas as pd

letters=pd.Series(('A', 'B', 'C', 'D'))
numbers=pd.Series((1, 2, 3, 4))
keys=('Letters', 'Numbers')
df=pd.concat((letters, numbers), axis=1, keys=keys)

Now I want to get the value C from the column Letters.

The command line

df[df.Letters=='C'].Letters

will return

2    C
Name: Letters, dtype: object

How can I get only the value C and not the whole two line output?


回答 0

df[df.Letters=='C'].Letters.item()

这将返回从该选择返回的索引/系列中的第一个元素。在这种情况下,该值始终是第一个元素。

编辑:

或者,您可以运行loc()并以这种方式访问​​第一个元素。这比较短,这是我过去实现它的方式。

df[df.Letters=='C'].Letters.item()

This returns the first element in the Index/Series returned from that selection. In this case, the value is always the first element.

EDIT:

Or you can run a loc() and access the first element that way. This was shorter and is the way I have implemented it in the past.


回答 1

使用values属性将值作为np数组返回,然后使用[0]获取第一个值:

In [4]:
df.loc[df.Letters=='C','Letters'].values[0]

Out[4]:
'C'

编辑

我个人更喜欢使用下标运算符访问列:

df.loc[df['Letters'] == 'C', 'Letters'].values[0]

这样可以避免列名中可以包含空格或破折号的问题-,这意味着使用进行访问.

Use the values attribute to return the values as a np array and then use [0] to get the first value:

In [4]:
df.loc[df.Letters=='C','Letters'].values[0]

Out[4]:
'C'

EDIT

I personally prefer to access the columns using subscript operators:

df.loc[df['Letters'] == 'C', 'Letters'].values[0]

This avoids issues where the column names can have spaces or dashes - which mean that accessing using ..


回答 2

import pandas as pd

dataset = pd.read_csv("data.csv")
values = list(x for x in dataset["column name"])

>>> values[0]
'item_0'

编辑:

实际上,您可以像对任何旧数组一样索引数据集。

import pandas as pd

dataset = pd.read_csv("data.csv")
first_value = dataset["column name"][0]

>>> print(first_value)
'item_0'
import pandas as pd

dataset = pd.read_csv("data.csv")
values = list(x for x in dataset["column name"])

>>> values[0]
'item_0'

edit:

actually, you can just index the dataset like any old array.

import pandas as pd

dataset = pd.read_csv("data.csv")
first_value = dataset["column name"][0]

>>> print(first_value)
'item_0'

如何使用iPython中的pandas库读取.xlsx文件?

问题:如何使用iPython中的pandas库读取.xlsx文件?

我想使用python的Pandas库读取.xlsx文件,并将数据移植到postgreSQL表中。

到目前为止,我所能做的就是:

import pandas as pd
data = pd.ExcelFile("*File Name*")

现在,我知道该步骤已成功执行,但是我想知道如何解析已读取的excel文件,以便可以了解excel中的数据如何映射到变量数据中的数据。
我了解到,如果我没有记错的话,数据就是一个Dataframe对象。因此,我如何解析此dataframe对象以逐行提取每一行。

I want to read a .xlsx file using the Pandas Library of python and port the data to a postgreSQL table.

All I could do up until now is:

import pandas as pd
data = pd.ExcelFile("*File Name*")

Now I know that the step got executed successfully, but I want to know how i can parse the excel file that has been read so that I can understand how the data in the excel maps to the data in the variable data.
I learnt that data is a Dataframe object if I’m not wrong. So How do i parse this dataframe object to extract each line row by row.


回答 0

我通常会DataFrame为每个工作表创建一个包含的字典:

xl_file = pd.ExcelFile(file_name)

dfs = {sheet_name: xl_file.parse(sheet_name) 
          for sheet_name in xl_file.sheet_names}

更新:在pandas 0.21.0+版本中,您可以通过传递sheet_name=Noneread_excel

dfs = pd.read_excel(file_name, sheet_name=None)

在0.20及sheetname更低版本中,它是而不是sheet_name(现在已弃用,而转而支持上述内容):

dfs = pd.read_excel(file_name, sheetname=None)

I usually create a dictionary containing a DataFrame for every sheet:

xl_file = pd.ExcelFile(file_name)

dfs = {sheet_name: xl_file.parse(sheet_name) 
          for sheet_name in xl_file.sheet_names}

Update: In pandas version 0.21.0+ you will get this behavior more cleanly by passing sheet_name=None to read_excel:

dfs = pd.read_excel(file_name, sheet_name=None)

In 0.20 and prior, this was sheetname rather than sheet_name (this is now deprecated in favor of the above):

dfs = pd.read_excel(file_name, sheetname=None)

回答 1

from pandas import read_excel
# find your sheet name at the bottom left of your excel file and assign 
# it to my_sheet 
my_sheet = 'Sheet1' # change it to your sheet name
file_name = 'products_and_categories.xlsx' # change it to the name of your excel file
df = read_excel(file_name, sheet_name = my_sheet)
print(df.head()) # shows headers with top 5 rows
from pandas import read_excel
# find your sheet name at the bottom left of your excel file and assign 
# it to my_sheet 
my_sheet = 'Sheet1' # change it to your sheet name
file_name = 'products_and_categories.xlsx' # change it to the name of your excel file
df = read_excel(file_name, sheet_name = my_sheet)
print(df.head()) # shows headers with top 5 rows

回答 2

DataFrame的read_excel方法类似于read_csv方法:

dfs = pd.read_excel(xlsx_file, sheetname="sheet1")


Help on function read_excel in module pandas.io.excel:

read_excel(io, sheetname=0, header=0, skiprows=None, skip_footer=0, index_col=None, names=None, parse_cols=None, parse_dates=False, date_parser=None, na_values=None, thousands=None, convert_float=True, has_index_names=None, converters=None, true_values=None, false_values=None, engine=None, squeeze=False, **kwds)
    Read an Excel table into a pandas DataFrame

    Parameters
    ----------
    io : string, path object (pathlib.Path or py._path.local.LocalPath),
        file-like object, pandas ExcelFile, or xlrd workbook.
        The string could be a URL. Valid URL schemes include http, ftp, s3,
        and file. For file URLs, a host is expected. For instance, a local
        file could be file://localhost/path/to/workbook.xlsx
    sheetname : string, int, mixed list of strings/ints, or None, default 0

        Strings are used for sheet names, Integers are used in zero-indexed
        sheet positions.

        Lists of strings/integers are used to request multiple sheets.

        Specify None to get all sheets.

        str|int -> DataFrame is returned.
        list|None -> Dict of DataFrames is returned, with keys representing
        sheets.

        Available Cases

        * Defaults to 0 -> 1st sheet as a DataFrame
        * 1 -> 2nd sheet as a DataFrame
        * "Sheet1" -> 1st sheet as a DataFrame
        * [0,1,"Sheet5"] -> 1st, 2nd & 5th sheet as a dictionary of DataFrames
        * None -> All sheets as a dictionary of DataFrames

    header : int, list of ints, default 0
        Row (0-indexed) to use for the column labels of the parsed
        DataFrame. If a list of integers is passed those row positions will
        be combined into a ``MultiIndex``
    skiprows : list-like
        Rows to skip at the beginning (0-indexed)
    skip_footer : int, default 0
        Rows at the end to skip (0-indexed)
    index_col : int, list of ints, default None
        Column (0-indexed) to use as the row labels of the DataFrame.
        Pass None if there is no such column.  If a list is passed,
        those columns will be combined into a ``MultiIndex``
    names : array-like, default None
        List of column names to use. If file contains no header row,
        then you should explicitly pass header=None
    converters : dict, default None
        Dict of functions for converting values in certain columns. Keys can
        either be integers or column labels, values are functions that take one
        input argument, the Excel cell content, and return the transformed
        content.
    true_values : list, default None
        Values to consider as True

        .. versionadded:: 0.19.0

    false_values : list, default None
        Values to consider as False

        .. versionadded:: 0.19.0

    parse_cols : int or list, default None
        * If None then parse all columns,
        * If int then indicates last column to be parsed
        * If list of ints then indicates list of column numbers to be parsed
        * If string then indicates comma separated list of column names and
          column ranges (e.g. "A:E" or "A,C,E:F")
    squeeze : boolean, default False
        If the parsed data only contains one column then return a Series
    na_values : scalar, str, list-like, or dict, default None
        Additional strings to recognize as NA/NaN. If dict passed, specific
        per-column NA values. By default the following values are interpreted
        as NaN: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan',
    '1.#IND', '1.#QNAN', 'N/A', 'NA', 'NULL', 'NaN', 'nan'.
    thousands : str, default None
        Thousands separator for parsing string columns to numeric.  Note that
        this parameter is only necessary for columns stored as TEXT in Excel,
        any numeric columns will automatically be parsed, regardless of display
        format.
    keep_default_na : bool, default True
        If na_values are specified and keep_default_na is False the default NaN
        values are overridden, otherwise they're appended to.
    verbose : boolean, default False
        Indicate number of NA values placed in non-numeric columns
    engine: string, default None
        If io is not a buffer or path, this must be set to identify io.
        Acceptable values are None or xlrd
    convert_float : boolean, default True
        convert integral floats to int (i.e., 1.0 --> 1). If False, all numeric
        data will be read in as floats: Excel stores all numbers as floats
        internally
    has_index_names : boolean, default None
        DEPRECATED: for version 0.17+ index names will be automatically
        inferred based on index_col.  To read Excel output from 0.16.2 and
        prior that had saved index names, use True.

    Returns
    -------
    parsed : DataFrame or Dict of DataFrames
        DataFrame from the passed in Excel file.  See notes in sheetname
        argument for more information on when a Dict of Dataframes is returned.

DataFrame’s read_excel method is like read_csv method:

dfs = pd.read_excel(xlsx_file, sheetname="sheet1")


Help on function read_excel in module pandas.io.excel:

read_excel(io, sheetname=0, header=0, skiprows=None, skip_footer=0, index_col=None, names=None, parse_cols=None, parse_dates=False, date_parser=None, na_values=None, thousands=None, convert_float=True, has_index_names=None, converters=None, true_values=None, false_values=None, engine=None, squeeze=False, **kwds)
    Read an Excel table into a pandas DataFrame

    Parameters
    ----------
    io : string, path object (pathlib.Path or py._path.local.LocalPath),
        file-like object, pandas ExcelFile, or xlrd workbook.
        The string could be a URL. Valid URL schemes include http, ftp, s3,
        and file. For file URLs, a host is expected. For instance, a local
        file could be file://localhost/path/to/workbook.xlsx
    sheetname : string, int, mixed list of strings/ints, or None, default 0

        Strings are used for sheet names, Integers are used in zero-indexed
        sheet positions.

        Lists of strings/integers are used to request multiple sheets.

        Specify None to get all sheets.

        str|int -> DataFrame is returned.
        list|None -> Dict of DataFrames is returned, with keys representing
        sheets.

        Available Cases

        * Defaults to 0 -> 1st sheet as a DataFrame
        * 1 -> 2nd sheet as a DataFrame
        * "Sheet1" -> 1st sheet as a DataFrame
        * [0,1,"Sheet5"] -> 1st, 2nd & 5th sheet as a dictionary of DataFrames
        * None -> All sheets as a dictionary of DataFrames

    header : int, list of ints, default 0
        Row (0-indexed) to use for the column labels of the parsed
        DataFrame. If a list of integers is passed those row positions will
        be combined into a ``MultiIndex``
    skiprows : list-like
        Rows to skip at the beginning (0-indexed)
    skip_footer : int, default 0
        Rows at the end to skip (0-indexed)
    index_col : int, list of ints, default None
        Column (0-indexed) to use as the row labels of the DataFrame.
        Pass None if there is no such column.  If a list is passed,
        those columns will be combined into a ``MultiIndex``
    names : array-like, default None
        List of column names to use. If file contains no header row,
        then you should explicitly pass header=None
    converters : dict, default None
        Dict of functions for converting values in certain columns. Keys can
        either be integers or column labels, values are functions that take one
        input argument, the Excel cell content, and return the transformed
        content.
    true_values : list, default None
        Values to consider as True

        .. versionadded:: 0.19.0

    false_values : list, default None
        Values to consider as False

        .. versionadded:: 0.19.0

    parse_cols : int or list, default None
        * If None then parse all columns,
        * If int then indicates last column to be parsed
        * If list of ints then indicates list of column numbers to be parsed
        * If string then indicates comma separated list of column names and
          column ranges (e.g. "A:E" or "A,C,E:F")
    squeeze : boolean, default False
        If the parsed data only contains one column then return a Series
    na_values : scalar, str, list-like, or dict, default None
        Additional strings to recognize as NA/NaN. If dict passed, specific
        per-column NA values. By default the following values are interpreted
        as NaN: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan',
    '1.#IND', '1.#QNAN', 'N/A', 'NA', 'NULL', 'NaN', 'nan'.
    thousands : str, default None
        Thousands separator for parsing string columns to numeric.  Note that
        this parameter is only necessary for columns stored as TEXT in Excel,
        any numeric columns will automatically be parsed, regardless of display
        format.
    keep_default_na : bool, default True
        If na_values are specified and keep_default_na is False the default NaN
        values are overridden, otherwise they're appended to.
    verbose : boolean, default False
        Indicate number of NA values placed in non-numeric columns
    engine: string, default None
        If io is not a buffer or path, this must be set to identify io.
        Acceptable values are None or xlrd
    convert_float : boolean, default True
        convert integral floats to int (i.e., 1.0 --> 1). If False, all numeric
        data will be read in as floats: Excel stores all numbers as floats
        internally
    has_index_names : boolean, default None
        DEPRECATED: for version 0.17+ index names will be automatically
        inferred based on index_col.  To read Excel output from 0.16.2 and
        prior that had saved index names, use True.

    Returns
    -------
    parsed : DataFrame or Dict of DataFrames
        DataFrame from the passed in Excel file.  See notes in sheetname
        argument for more information on when a Dict of Dataframes is returned.

回答 3

如果您不知道或无法打开excel文件以签入ubuntu(在我的情况下为Python 3.6.7,ubuntu 18.04),则可以使用参数index_col(index_col = 0)来代替工作表名称。第一张)

import pandas as pd
file_name = 'some_data_file.xlsx' 
df = pd.read_excel(file_name, index_col=0)
print(df.head()) # print the first 5 rows

Instead of using a sheet name, in case you don’t know or can’t open the excel file to check in ubuntu (in my case, Python 3.6.7, ubuntu 18.04), I use the parameter index_col (index_col=0 for the first sheet)

import pandas as pd
file_name = 'some_data_file.xlsx' 
df = pd.read_excel(file_name, index_col=0)
print(df.head()) # print the first 5 rows

回答 4

将电子表格文件名分配给 file

加载电子表格

打印工作表名称

通过名称将表加载到DataFrame中:df1

file = 'example.xlsx'
xl = pd.ExcelFile(file)
print(xl.sheet_names)
df1 = xl.parse('Sheet1')

Assign spreadsheet filename to file

Load spreadsheet

Print the sheet names

Load a sheet into a DataFrame by name: df1

file = 'example.xlsx'
xl = pd.ExcelFile(file)
print(xl.sheet_names)
df1 = xl.parse('Sheet1')

回答 5

如果在使用read_excel()函数打开的文件上使用open(),请确保将其添加rb到打开函数中,以避免编码错误

If you use read_excel() on a file opened using the function open(), make sure to add rb to the open function to avoid encoding errors


创建零填充的熊猫数据框

问题:创建零填充的熊猫数据框

创建给定大小的零填充熊猫数据框的最佳方法是什么?

我用过了:

zero_data = np.zeros(shape=(len(data),len(feature_list)))
d = pd.DataFrame(zero_data, columns=feature_list)

有更好的方法吗?

What is the best way to create a zero-filled pandas data frame of a given size?

I have used:

zero_data = np.zeros(shape=(len(data),len(feature_list)))
d = pd.DataFrame(zero_data, columns=feature_list)

Is there a better way to do it?


回答 0

您可以尝试以下方法:

d = pd.DataFrame(0, index=np.arange(len(data)), columns=feature_list)

You can try this:

d = pd.DataFrame(0, index=np.arange(len(data)), columns=feature_list)

回答 1

我认为最好用numpy做到这一点

import numpy as np
import pandas as pd
d = pd.DataFrame(np.zeros((N_rows, N_cols)))

It’s best to do this with numpy in my opinion

import numpy as np
import pandas as pd
d = pd.DataFrame(np.zeros((N_rows, N_cols)))

回答 2

类似于@Shravan,但不使用numpy:

  height = 10
  width = 20
  df_0 = pd.DataFrame(0, index=range(height), columns=range(width))

然后,您可以使用它做任何您想做的事情:

post_instantiation_fcn = lambda x: str(x)
df_ready_for_whatever = df_0.applymap(post_instantiation_fcn)

Similar to @Shravan, but without the use of numpy:

  height = 10
  width = 20
  df_0 = pd.DataFrame(0, index=range(height), columns=range(width))

Then you can do whatever you want with it:

post_instantiation_fcn = lambda x: str(x)
df_ready_for_whatever = df_0.applymap(post_instantiation_fcn)

回答 3

如果您希望新数据框具有与现有数据框相同的索引和列,则可以将现有数据框乘以零:

df_zeros = df * 0

If you would like the new data frame to have the same index and columns as an existing data frame, you can just multiply the existing data frame by zero:

df_zeros = df * 0

回答 4

如果您已经有一个数据框,这是最快的方法:

In [1]: columns = ["col{}".format(i) for i in range(10)]
In [2]: orig_df = pd.DataFrame(np.ones((10, 10)), columns=columns)
In [3]: %timeit d = pd.DataFrame(np.zeros_like(orig_df), index=orig_df.index, columns=orig_df.columns)
10000 loops, best of 3: 60.2 µs per loop

相比于:

In [4]: %timeit d = pd.DataFrame(0, index = np.arange(10), columns=columns)
10000 loops, best of 3: 110 µs per loop

In [5]: temp = np.zeros((10, 10))
In [6]: %timeit d = pd.DataFrame(temp, columns=columns)
10000 loops, best of 3: 95.7 µs per loop

If you already have a dataframe, this is the fastest way:

In [1]: columns = ["col{}".format(i) for i in range(10)]
In [2]: orig_df = pd.DataFrame(np.ones((10, 10)), columns=columns)
In [3]: %timeit d = pd.DataFrame(np.zeros_like(orig_df), index=orig_df.index, columns=orig_df.columns)
10000 loops, best of 3: 60.2 µs per loop

Compare to:

In [4]: %timeit d = pd.DataFrame(0, index = np.arange(10), columns=columns)
10000 loops, best of 3: 110 µs per loop

In [5]: temp = np.zeros((10, 10))
In [6]: %timeit d = pd.DataFrame(temp, columns=columns)
10000 loops, best of 3: 95.7 µs per loop

回答 5

假设有一个模板DataFrame,要在此处填充零值进行复制…

如果您的数据集中没有NaN,那么乘以零可能会更快:

In [19]: columns = ["col{}".format(i) for i in xrange(3000)]                                                                                       

In [20]: indices = xrange(2000)

In [21]: orig_df = pd.DataFrame(42.0, index=indices, columns=columns)

In [22]: %timeit d = pd.DataFrame(np.zeros_like(orig_df), index=orig_df.index, columns=orig_df.columns)
100 loops, best of 3: 12.6 ms per loop

In [23]: %timeit d = orig_df * 0.0
100 loops, best of 3: 7.17 ms per loop

改进取决于DataFrame的大小,但从未发现它会变慢。

只是为了它:

In [24]: %timeit d = orig_df * 0.0 + 1.0
100 loops, best of 3: 13.6 ms per loop

In [25]: %timeit d = pd.eval('orig_df * 0.0 + 1.0')
100 loops, best of 3: 8.36 ms per loop

但:

In [24]: %timeit d = orig_df.copy()
10 loops, best of 3: 24 ms per loop

编辑!!!

假设您有一个使用float64的框架,那么这将是最快的!通过将0.0替换为所需的填充编号,它还可以生成任何值。

In [23]: %timeit d = pd.eval('orig_df > 1.7976931348623157e+308 + 0.0')
100 loops, best of 3: 3.68 ms per loop

根据口味的不同,可以从外部定义nan,并做出通用的解决方案,而与特定的浮点类型无关:

In [39]: nan = np.nan
In [40]: %timeit d = pd.eval('orig_df > nan + 0.0')
100 loops, best of 3: 4.39 ms per loop

Assuming having a template DataFrame, which one would like to copy with zero values filled here…

If you have no NaNs in your data set, multiplying by zero can be significantly faster:

In [19]: columns = ["col{}".format(i) for i in xrange(3000)]                                                                                       

In [20]: indices = xrange(2000)

In [21]: orig_df = pd.DataFrame(42.0, index=indices, columns=columns)

In [22]: %timeit d = pd.DataFrame(np.zeros_like(orig_df), index=orig_df.index, columns=orig_df.columns)
100 loops, best of 3: 12.6 ms per loop

In [23]: %timeit d = orig_df * 0.0
100 loops, best of 3: 7.17 ms per loop

Improvement depends on DataFrame size, but never found it slower.

And just for the heck of it:

In [24]: %timeit d = orig_df * 0.0 + 1.0
100 loops, best of 3: 13.6 ms per loop

In [25]: %timeit d = pd.eval('orig_df * 0.0 + 1.0')
100 loops, best of 3: 8.36 ms per loop

But:

In [24]: %timeit d = orig_df.copy()
10 loops, best of 3: 24 ms per loop

EDIT!!!

Assuming you have a frame using float64, this will be the fastest by a huge margin! It is also able to generate any value by replacing 0.0 to the desired fill number.

In [23]: %timeit d = pd.eval('orig_df > 1.7976931348623157e+308 + 0.0')
100 loops, best of 3: 3.68 ms per loop

Depending on taste, one can externally define nan, and do a general solution, irrespective of the particular float type:

In [39]: nan = np.nan
In [40]: %timeit d = pd.eval('orig_df > nan + 0.0')
100 loops, best of 3: 4.39 ms per loop

使用熊猫比较两列

问题:使用熊猫比较两列

以此为起点:

a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])

Out[8]: 
  one  two three
0   10  1.2   4.2
1   15  70   0.03
2    8   5     0

我想if在熊猫中使用类似声明的内容。

if df['one'] >= df['two'] and df['one'] <= df['three']:
    df['que'] = df['one']

基本上,通过if语句检查每一行,然后创建新列。

文档说要使用,.all但没有示例…

Using this as a starting point:

a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])

Out[8]: 
  one  two three
0   10  1.2   4.2
1   15  70   0.03
2    8   5     0

I want to use something like an if statement within pandas.

if df['one'] >= df['two'] and df['one'] <= df['three']:
    df['que'] = df['one']

Basically, check each row via the if statement, create new column.

The docs say to use .all but there is no example…


回答 0

您可以使用np.where。如果cond是布尔数组,A并且B是数组,则

C = np.where(cond, A, B)

将C定义为等于A哪里cond为True,B哪里cond为False。

import numpy as np
import pandas as pd

a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])

df['que'] = np.where((df['one'] >= df['two']) & (df['one'] <= df['three'])
                     , df['one'], np.nan)

Yield

  one  two three  que
0  10  1.2   4.2   10
1  15   70  0.03  NaN
2   8    5     0  NaN

如果您有多个条件,则可以使用np.select代替。例如,如果你想df['que']等于df['two']df['one'] < df['two'],则

conditions = [
    (df['one'] >= df['two']) & (df['one'] <= df['three']), 
    df['one'] < df['two']]

choices = [df['one'], df['two']]

df['que'] = np.select(conditions, choices, default=np.nan)

Yield

  one  two three  que
0  10  1.2   4.2   10
1  15   70  0.03   70
2   8    5     0  NaN

如果我们可以假设df['one'] >= df['two']whendf['one'] < df['two']为False,那么条件和选择可以简化为

conditions = [
    df['one'] < df['two'],
    df['one'] <= df['three']]

choices = [df['two'], df['one']]

(如果包含df['one']df['two']包含NaN,则该假设可能不正确。)


注意

a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])

用字符串值定义一个DataFrame。由于它们看起来是数字,因此最好将这些字符串转换为浮点数:

df2 = df.astype(float)

但是,这会改变结果,因为字符串会逐个字符进行比较,而浮点数会进行数字比较。

In [61]: '10' <= '4.2'
Out[61]: True

In [62]: 10 <= 4.2
Out[62]: False

You could use np.where. If cond is a boolean array, and A and B are arrays, then

C = np.where(cond, A, B)

defines C to be equal to A where cond is True, and B where cond is False.

import numpy as np
import pandas as pd

a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])

df['que'] = np.where((df['one'] >= df['two']) & (df['one'] <= df['three'])
                     , df['one'], np.nan)

yields

  one  two three  que
0  10  1.2   4.2   10
1  15   70  0.03  NaN
2   8    5     0  NaN

If you have more than one condition, then you could use np.select instead. For example, if you wish df['que'] to equal df['two'] when df['one'] < df['two'], then

conditions = [
    (df['one'] >= df['two']) & (df['one'] <= df['three']), 
    df['one'] < df['two']]

choices = [df['one'], df['two']]

df['que'] = np.select(conditions, choices, default=np.nan)

yields

  one  two three  que
0  10  1.2   4.2   10
1  15   70  0.03   70
2   8    5     0  NaN

If we can assume that df['one'] >= df['two'] when df['one'] < df['two'] is False, then the conditions and choices could be simplified to

conditions = [
    df['one'] < df['two'],
    df['one'] <= df['three']]

choices = [df['two'], df['one']]

(The assumption may not be true if df['one'] or df['two'] contain NaNs.)


Note that

a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])

defines a DataFrame with string values. Since they look numeric, you might be better off converting those strings to floats:

df2 = df.astype(float)

This changes the results, however, since strings compare character-by-character, while floats are compared numerically.

In [61]: '10' <= '4.2'
Out[61]: True

In [62]: 10 <= 4.2
Out[62]: False

回答 1

您可以将其.equals用于列或整个数据框。

df['col1'].equals(df['col2'])

如果它们相等,则该语句将返回Trueelse False

You can use .equals for columns or entire dataframes.

df['col1'].equals(df['col2'])

If they’re equal, that statement will return True, else False.


回答 2

您可以使用apply()并执行类似的操作

df['que'] = df.apply(lambda x : x['one'] if x['one'] >= x['two'] and x['one'] <= x['three'] else "", axis=1)

或者如果您不想使用lambda

def que(x):
    if x['one'] >= x['two'] and x['one'] <= x['three']:
        return x['one']
    return ''
df['que'] = df.apply(que, axis=1)

You could use apply() and do something like this

df['que'] = df.apply(lambda x : x['one'] if x['one'] >= x['two'] and x['one'] <= x['three'] else "", axis=1)

or if you prefer not to use a lambda

def que(x):
    if x['one'] >= x['two'] and x['one'] <= x['three']:
        return x['one']
    return ''
df['que'] = df.apply(que, axis=1)

回答 3

一种方法是使用布尔序列对列进行索引df['one']。这将为您提供一个新列,其中的True条目与相同的行具有相同的值,df['one']并且这些False值为NaN

布尔级数仅由您的if语句给出(尽管必须使用&代替and):

>>> df['que'] = df['one'][(df['one'] >= df['two']) & (df['one'] <= df['three'])]
>>> df
    one two three   que
0   10  1.2 4.2      10
1   15  70  0.03    NaN
2   8   5   0       NaN

如果您希望将这些NaN值替换为其他值,则可以使用fillna新列上的方法que。我用的0不是这里的空字符串:

>>> df['que'] = df['que'].fillna(0)
>>> df
    one two three   que
0   10  1.2   4.2    10
1   15   70  0.03     0
2    8    5     0     0

One way is to use a Boolean series to index the column df['one']. This gives you a new column where the True entries have the same value as the same row as df['one'] and the False values are NaN.

The Boolean series is just given by your if statement (although it is necessary to use & instead of and):

>>> df['que'] = df['one'][(df['one'] >= df['two']) & (df['one'] <= df['three'])]
>>> df
    one two three   que
0   10  1.2 4.2      10
1   15  70  0.03    NaN
2   8   5   0       NaN

If you want the NaN values to be replaced by other values, you can use the fillna method on the new column que. I’ve used 0 instead of the empty string here:

>>> df['que'] = df['que'].fillna(0)
>>> df
    one two three   que
0   10  1.2   4.2    10
1   15   70  0.03     0
2    8    5     0     0

回答 4

将每个条件括在括号中,然后使用&运算符组合条件:

df.loc[(df['one'] >= df['two']) & (df['one'] <= df['three']), 'que'] = df['one']

您可以仅使用~(“ not”运算符)来反转匹配项来填充不匹配的行:

df.loc[~ ((df['one'] >= df['two']) & (df['one'] <= df['three'])), 'que'] = ''

您需要使用&~而不是and和,not因为&~运算符是逐个元素地工作的。

最终结果:

df
Out[8]: 
  one  two three que
0  10  1.2   4.2  10
1  15   70  0.03    
2   8    5     0  

Wrap each individual condition in parentheses, and then use the & operator to combine the conditions:

df.loc[(df['one'] >= df['two']) & (df['one'] <= df['three']), 'que'] = df['one']

You can fill the non-matching rows by just using ~ (the “not” operator) to invert the match:

df.loc[~ ((df['one'] >= df['two']) & (df['one'] <= df['three'])), 'que'] = ''

You need to use & and ~ rather than and and not because the & and ~ operators work element-by-element.

The final result:

df
Out[8]: 
  one  two three que
0  10  1.2   4.2  10
1  15   70  0.03    
2   8    5     0  

回答 5

使用np.select,如果你必须从数据帧和输出特定的选择在不同的列中选中多个条件

conditions=[(condition1),(condition2)]
choices=["choice1","chocie2"]

df["new column"]=np.select=(condtion,choice,default=)

注意:没有条件,没有选择项应该匹配,如果对于两个不同的条件您有相同的选择,请重复选择文本

Use np.select if you have multiple conditions to be checked from the dataframe and output a specific choice in a different column

conditions=[(condition1),(condition2)]
choices=["choice1","chocie2"]

df["new column"]=np.select=(condtion,choice,default=)

Note: No of conditions and no of choices should match, repeat text in choice if for two different conditions you have same choices


回答 6

我认为最接近OP直觉的是内联if语句:

df['que'] = (df['one'] if ((df['one'] >= df['two']) and (df['one'] <= df['three'])) 

I think the closest to the OP’s intuition is an inline if statement:

df['que'] = (df['one'] if ((df['one'] >= df['two']) and (df['one'] <= df['three'])) 

Python Pandas-查找两个数据框之间的差异

问题:Python Pandas-查找两个数据框之间的差异

我有两个数据帧df1和df2,其中df2是df1的子集。我如何获得一个新的数据帧(df3),这是两个数据帧之间的区别?

换句话说,一个数据帧具有df1中所有不在df2中的行/列?

在此处输入图片说明

I have two data frames df1 and df2, where df2 is a subset of df1. How do I get a new data frame (df3) which is the difference between the two data frames?

In other word, a data frame that has all the rows/columns in df1 that are not in df2?

enter image description here


回答 0

通过使用 drop_duplicates

pd.concat([df1,df2]).drop_duplicates(keep=False)

Update :

Above method only working for those dataframes they do not have duplicate itself, For example

df1=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
df2=pd.DataFrame({'A':[1],'B':[2]})

它将输出如下所示,这是错误的

错误的输出:

pd.concat([df1, df2]).drop_duplicates(keep=False)
Out[655]: 
   A  B
1  2  3

正确的输出

Out[656]: 
   A  B
1  2  3
2  3  4
3  3  4

如何实现呢?

方法1:isin与一起使用tuple

df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
Out[657]: 
   A  B
1  2  3
2  3  4
3  3  4

方法2:mergeindicator

df1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
Out[421]: 
   A  B     _merge
1  2  3  left_only
2  3  4  left_only
3  3  4  left_only

By using drop_duplicates

pd.concat([df1,df2]).drop_duplicates(keep=False)

Update :

Above method only working for those dataframes they do not have duplicate itself, For example

df1=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
df2=pd.DataFrame({'A':[1],'B':[2]})

It will output like below , which is wrong

Wrong Output :

pd.concat([df1, df2]).drop_duplicates(keep=False)
Out[655]: 
   A  B
1  2  3

Correct Output

Out[656]: 
   A  B
1  2  3
2  3  4
3  3  4

How to achieve that?

Method 1: Using isin with tuple

df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
Out[657]: 
   A  B
1  2  3
2  3  4
3  3  4

Method 2: merge with indicator

df1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
Out[421]: 
   A  B     _merge
1  2  3  left_only
2  3  4  left_only
3  3  4  left_only

回答 1

对于行,请尝试以下操作,Name联合索引列在哪里(可以是多个公共列的列表,也可以指定left_onright_on):

m = df1.merge(df2, on='Name', how='outer', suffixes=['', '_'], indicator=True)

indicator=True设置非常有用,因为它添加了一个名为的列_merge,并且在df1和之间进行了所有更改df2,分为3种可能的类型:“ left_only”,“ right_only”或“ both”。

对于列,请尝试以下操作:

set(df1.columns).symmetric_difference(df2.columns)

For rows, try this, where Name is the joint index column (can be a list for multiple common columns, or specify left_on and right_on):

m = df1.merge(df2, on='Name', how='outer', suffixes=['', '_'], indicator=True)

The indicator=True setting is useful as it adds a column called _merge, with all changes between df1 and df2, categorized into 3 possible kinds: “left_only”, “right_only” or “both”.

For columns, try this:

set(df1.columns).symmetric_difference(df2.columns)

回答 2

接受的答案方法1将不适用于内部具有NaN的数据帧,例如pd.np.nan != pd.np.nan。我不确定这是否是最好的方法,但是可以避免

df1[~df1.astype(str).apply(tuple, 1).isin(df2.astype(str).apply(tuple, 1))]

Accepted answer Method 1 will not work for data frames with NaNs inside, as pd.np.nan != pd.np.nan. I am not sure if this is the best way, but it can be avoided by

df1[~df1.astype(str).apply(tuple, 1).isin(df2.astype(str).apply(tuple, 1))]

回答 3

edit2,我想出了一个无需设置索引的新解决方案

newdf=pd.concat([df1,df2]).drop_duplicates(keep=False)

好吧,我发现最高投票的答案已经包含了我所想的。是的,我们只能在每两个df中没有重复的情况下使用此代码。


我有一个棘手的方法。首先,我们将“名称”设置为问题给出的两个数据框的索引。由于我们在两个df中具有相同的“名称”,因此我们可以从“较大” df中删除“较小” df的索引。这是代码。

df1.set_index('Name',inplace=True)
df2.set_index('Name',inplace=True)
newdf=df1.drop(df2.index)

edit2, I figured out a new solution without the need of setting index

newdf=pd.concat[df1,df2].drop_duplicates(keep=False)

okay i found the answer of hightest vote already contain what i have figured out .Yes, we can only use this code on condition that there are no duplicates in each two dfs.


I have a tricky method.First we set ’Name’ as the index of two dataframe given by the question.Since we have same ’Name’ in two dfs,we can just drop the ’smaller’ df’s index from the ‘bigger’ df. Here is the code.

df1.set_index('Name',inplace=True)
df2.set_index('Name',inplace=True)
newdf=df1.drop(df2.index)

回答 4

import pandas as pd
# given
df1 = pd.DataFrame({'Name':['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa',],
    'Age':[23,45,12,34,27,44,28,39,40]})
df2 = pd.DataFrame({'Name':['John','Smith','Wale','Tom','Menda','Yuswa',],
    'Age':[23,12,34,44,28,40]})

# find elements in df1 that are not in df2
df_1notin2 = df1[~(df1['Name'].isin(df2['Name']) & df1['Age'].isin(df2['Age']))].reset_index(drop=True)

# output:
print('df1\n', df1)
print('df2\n', df2)
print('df_1notin2\n', df_1notin2)

# df1
#     Age   Name
# 0   23   John
# 1   45   Mike
# 2   12  Smith
# 3   34   Wale
# 4   27  Marry
# 5   44    Tom
# 6   28  Menda
# 7   39   Bolt
# 8   40  Yuswa
# df2
#     Age   Name
# 0   23   John
# 1   12  Smith
# 2   34   Wale
# 3   44    Tom
# 4   28  Menda
# 5   40  Yuswa
# df_1notin2
#     Age   Name
# 0   45   Mike
# 1   27  Marry
# 2   39   Bolt
import pandas as pd
# given
df1 = pd.DataFrame({'Name':['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa',],
    'Age':[23,45,12,34,27,44,28,39,40]})
df2 = pd.DataFrame({'Name':['John','Smith','Wale','Tom','Menda','Yuswa',],
    'Age':[23,12,34,44,28,40]})

# find elements in df1 that are not in df2
df_1notin2 = df1[~(df1['Name'].isin(df2['Name']) & df1['Age'].isin(df2['Age']))].reset_index(drop=True)

# output:
print('df1\n', df1)
print('df2\n', df2)
print('df_1notin2\n', df_1notin2)

# df1
#     Age   Name
# 0   23   John
# 1   45   Mike
# 2   12  Smith
# 3   34   Wale
# 4   27  Marry
# 5   44    Tom
# 6   28  Menda
# 7   39   Bolt
# 8   40  Yuswa
# df2
#     Age   Name
# 0   23   John
# 1   12  Smith
# 2   34   Wale
# 3   44    Tom
# 4   28  Menda
# 5   40  Yuswa
# df_1notin2
#     Age   Name
# 0   45   Mike
# 1   27  Marry
# 2   39   Bolt

回答 5

也许是一种简单的单行代码,具有相同或不同的列名。即使df2 [‘Name2’]包含重复值也可以使用。

newDf = df1.set_index('Name1')
           .drop(df2['Name2'], errors='ignore')
           .reset_index(drop=False)

Perhaps a simpler one-liner, with identical or different column names. Worked even when df2[‘Name2’] contained duplicate values.

newDf = df1.set_index('Name1')
           .drop(df2['Name2'], errors='ignore')
           .reset_index(drop=False)

回答 6

正如这里 提到的

df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]

是正确的解决方案,但是如果

df1=pd.DataFrame({'A':[1],'B':[2]})
df2=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})

在这种情况下,上述解决方案将提供 Empty DataFrame,而应该concat在从每个datframe中删除重复项之后使用method。

采用 concate with drop_duplicates

df1=df1.drop_duplicates(keep="first") 
df2=df2.drop_duplicates(keep="first") 
pd.concat([df1,df2]).drop_duplicates(keep=False)

As mentioned here that

df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]

is correct solution but it will produce wrong output if

df1=pd.DataFrame({'A':[1],'B':[2]})
df2=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})

In that case above solution will give Empty DataFrame, instead you should use concat method after removing duplicates from each datframe.

Use concate with drop_duplicates

df1=df1.drop_duplicates(keep="first") 
df2=df2.drop_duplicates(keep="first") 
pd.concat([df1,df2]).drop_duplicates(keep=False)

回答 7

@liangli解决方案的细微变化,不需要更改现有数据帧的索引:

newdf = df1.drop(df1.join(df2.set_index('Name').index))

A slight variation of the nice @liangli’s solution that does not require to change the index of existing dataframes:

newdf = df1.drop(df1.join(df2.set_index('Name').index))

回答 8

通过索引查找差异。假设df1是df2的子集,并且在进行子集设置时将索引结转

df1.loc[set(df1.index).symmetric_difference(set(df2.index))].dropna()

# Example

df1 = pd.DataFrame({"gender":np.random.choice(['m','f'],size=5), "subject":np.random.choice(["bio","phy","chem"],size=5)}, index = [1,2,3,4,5])

df2 =  df1.loc[[1,3,5]]

df1

 gender subject
1      f     bio
2      m    chem
3      f     phy
4      m     bio
5      f     bio

df2

  gender subject
1      f     bio
3      f     phy
5      f     bio

df3 = df1.loc[set(df1.index).symmetric_difference(set(df2.index))].dropna()

df3

  gender subject
2      m    chem
4      m     bio

Finding difference by index. Assuming df1 is a subset of df2 and the indexes are carried forward when subsetting

df1.loc[set(df1.index).symmetric_difference(set(df2.index))].dropna()

# Example

df1 = pd.DataFrame({"gender":np.random.choice(['m','f'],size=5), "subject":np.random.choice(["bio","phy","chem"],size=5)}, index = [1,2,3,4,5])

df2 =  df1.loc[[1,3,5]]

df1

 gender subject
1      f     bio
2      m    chem
3      f     phy
4      m     bio
5      f     bio

df2

  gender subject
1      f     bio
3      f     phy
5      f     bio

df3 = df1.loc[set(df1.index).symmetric_difference(set(df2.index))].dropna()

df3

  gender subject
2      m    chem
4      m     bio


回答 9

除了可接受的答案之外,我还想提出一个更宽泛的解决方案,该解决方案可以找到带有/的两个数据帧的2D集合差异(它们可能对于两个数据帧都不相同)。同样,该方法还允许为数据框比较设置元素的公差(使用)indexcolumnsfloatnp.isclose


import numpy as np
import pandas as pd

def get_dataframe_setdiff2d(df_new: pd.DataFrame, 
                            df_old: pd.DataFrame, 
                            rtol=1e-03, atol=1e-05) -> pd.DataFrame:
    """Returns set difference of two pandas DataFrames"""

    union_index = np.union1d(df_new.index, df_old.index)
    union_columns = np.union1d(df_new.columns, df_old.columns)

    new = df_new.reindex(index=union_index, columns=union_columns)
    old = df_old.reindex(index=union_index, columns=union_columns)

    mask_diff = ~np.isclose(new, old, rtol, atol)

    df_bool = pd.DataFrame(mask_diff, union_index, union_columns)

    df_diff = pd.concat([new[df_bool].stack(),
                         old[df_bool].stack()], axis=1)

    df_diff.columns = ["New", "Old"]

    return df_diff

例:

In [1]

df1 = pd.DataFrame({'A':[2,1,2],'C':[2,1,2]})
df2 = pd.DataFrame({'A':[1,1],'B':[1,1]})

print("df1:\n", df1, "\n")

print("df2:\n", df2, "\n")

diff = get_dataframe_setdiff2d(df1, df2)

print("diff:\n", diff, "\n")
Out [1]

df1:
   A  C
0  2  2
1  1  1
2  2  2 

df2:
   A  B
0  1  1
1  1  1 

diff:
     New  Old
0 A  2.0  1.0
  B  NaN  1.0
  C  2.0  NaN
1 B  NaN  1.0
  C  1.0  NaN
2 A  2.0  NaN
  C  2.0  NaN 

In addition to accepted answer, I would like to propose one more wider solution that can find a 2D set difference of two dataframes with any index/columns (they might not coincide for both datarames). Also method allows to setup tolerance for float elements for dataframe comparison (it uses np.isclose)


import numpy as np
import pandas as pd

def get_dataframe_setdiff2d(df_new: pd.DataFrame, 
                            df_old: pd.DataFrame, 
                            rtol=1e-03, atol=1e-05) -> pd.DataFrame:
    """Returns set difference of two pandas DataFrames"""

    union_index = np.union1d(df_new.index, df_old.index)
    union_columns = np.union1d(df_new.columns, df_old.columns)

    new = df_new.reindex(index=union_index, columns=union_columns)
    old = df_old.reindex(index=union_index, columns=union_columns)

    mask_diff = ~np.isclose(new, old, rtol, atol)

    df_bool = pd.DataFrame(mask_diff, union_index, union_columns)

    df_diff = pd.concat([new[df_bool].stack(),
                         old[df_bool].stack()], axis=1)

    df_diff.columns = ["New", "Old"]

    return df_diff

Example:

In [1]

df1 = pd.DataFrame({'A':[2,1,2],'C':[2,1,2]})
df2 = pd.DataFrame({'A':[1,1],'B':[1,1]})

print("df1:\n", df1, "\n")

print("df2:\n", df2, "\n")

diff = get_dataframe_setdiff2d(df1, df2)

print("diff:\n", diff, "\n")
Out [1]

df1:
   A  C
0  2  2
1  1  1
2  2  2 

df2:
   A  B
0  1  1
1  1  1 

diff:
     New  Old
0 A  2.0  1.0
  B  NaN  1.0
  C  2.0  NaN
1 B  NaN  1.0
  C  1.0  NaN
2 A  2.0  NaN
  C  2.0  NaN 

在Jupyter Python Notebook中显示所有数据框列

问题:在Jupyter Python Notebook中显示所有数据框列

我想在Jupyter Notebook的数据框中显示所有列。Jupyter显示一些列,并在最后一列中添加点,如下图所示:

Juputer屏幕截图

如何显示所有列?

I want to show all columns in a dataframe in a Jupyter Notebook. Jupyter shows some of the columns and adds dots to the last columns like in the following picture:

Juputer Screenshot

How can I display all columns?


回答 0

尝试如下显示max_columns设置:

import pandas as pd
from IPython.display import display

df = pd.read_csv("some_data.csv")
pd.options.display.max_columns = None
display(df)

要么

pd.set_option('display.max_columns', None)

编辑:熊猫0.11.0向后

不建议使用此功能,但在低于0.11.0的Pandas版本中,该max_columns设置指定如下:

pd.set_printoptions(max_columns=500)

Try the display max_columns setting as follows:

import pandas as pd
from IPython.display import display

df = pd.read_csv("some_data.csv")
pd.options.display.max_columns = None
display(df)

Or

pd.set_option('display.max_columns', None)

Edit: Pandas 0.11.0 backwards

This is deprecated but in versions of Pandas older than 0.11.0 the max_columns setting is specified as follows:

pd.set_printoptions(max_columns=500)

回答 1

我知道这个问题有点老了,但是以下内容在运行pandas 0.22.0和Python 3的Jupyter Notebook中为我工作:

import pandas as pd
pd.set_option('display.max_columns', <number of columns>)

您也可以对行执行相同的操作:

pd.set_option('display.max_rows', <number of rows>)

这样可以节省导入IPython的时间,并且pandas.set_option文档中还有更多选项:https ://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.set_option.html

I know this question is a little old but the following worked for me in a Jupyter Notebook running pandas 0.22.0 and Python 3:

import pandas as pd
pd.set_option('display.max_columns', <number of columns>)

You can do the same for the rows too:

pd.set_option('display.max_rows', <number of rows>)

This saves importing IPython, and there are more options in the pandas.set_option documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.set_option.html


回答 2

适用于大型(但不是太大)DataFrame的Python 3.x

也许是因为我有较旧版本的熊猫,但在Jupyter笔记本上,这对我有用

import pandas as pd
from IPython.core.display import HTML

df=pd.read_pickle('Data1')
display(HTML(df.to_html()))

Python 3.x for large (but not too large) DataFrames

Maybe because I have an older version of pandas but on Jupyter notebook this work for me

import pandas as pd
from IPython.core.display import HTML

df=pd.read_pickle('Data1')
display(HTML(df.to_html()))

回答 3

我建议在上下文管理器中设置显示选项,以使其仅影响单个输出。如果您还想打印“漂亮”的html版本,我将定义一个函数并df使用force_show_all(df)以下命令显示数据框:

from IPython.core.display import display, HTML

def force_show_all(df):
    with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.width', None):
        display(HTML(df.to_html()))

正如其他人提到的那样,请谨慎地仅在合理大小的数据帧上调用它。

I recommend setting the display options inside a context manager so that it only affects a single output. If you also want to print a “pretty” html-version, I would define a function and display the dataframe df using force_show_all(df):

from IPython.core.display import display, HTML

def force_show_all(df):
    with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.width', None):
        display(HTML(df.to_html()))

As others have mentioned, be cautious to only call this on a reasonably-sized dataframe.


回答 4

您可以使用pandas.set_option()作为列,您可以指定以下任何选项

pd.set_option("display.max_rows", 200)
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_colwidth", 200)

对于完整的打印列,您可以这样使用

import pandas as pd
pd.set_option('display.max_colwidth', -1)
print(words.head())

在此处输入图片说明

you can use pandas.set_option(), for column, you can specify any of these options

pd.set_option("display.max_rows", 200)
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_colwidth", 200)

For full print column, you can use like this

import pandas as pd
pd.set_option('display.max_colwidth', -1)
print(words.head())

enter image description here


回答 5

如果要显示所有行设置如下

pd.options.display.max_rows = None

如果要显示所有列,例如波纹管

pd.options.display.max_columns = None

If you want to show all the rows set like bellow

pd.options.display.max_rows = None

If you want to show all columns set like bellow

pd.options.display.max_columns = None

‘DataFrame’对象没有属性’sort’

问题:’DataFrame’对象没有属性’sort’

我在这里遇到一些问题,在我的python包中,我已经安装了numpy,但是我仍然遇到此错误‘DataFrame’对象没有属性’sort’

任何人都可以给我一些想法。

这是我的代码:

final.loc[-1] =['', 'P','Actual']
final.index = final.index + 1  # shifting index
final = final.sort()
final.columns=[final.columns,final.iloc[0]]
final = final.iloc[1:].reset_index(drop=True)
final.columns.names = (None, None)

I face some problem here, in my python package I have install numpy, but I still have this error ‘DataFrame’ object has no attribute ‘sort’

Anyone can give me some idea..

This is my code :

final.loc[-1] =['', 'P','Actual']
final.index = final.index + 1  # shifting index
final = final.sort()
final.columns=[final.columns,final.iloc[0]]
final = final.iloc[1:].reset_index(drop=True)
final.columns.names = (None, None)

回答 0

sort() 不推荐使用DataFrames,而采用以下任何一种方法:

sort()在Pandas中已弃用(但仍可用)版本0.17(2015-10-09),并引入sort_values()sort_index()。它已从0.20版(2017-05-05)的Pandas中删除。

sort() was deprecated for DataFrames in favor of either:

sort() was deprecated (but still available) in Pandas with release 0.17 (2015-10-09) with the introduction of sort_values() and sort_index(). It was removed from Pandas with release 0.20 (2017-05-05).


回答 1

熊猫排序101

sort已经在v0.20替换DataFrame.sort_valuesDataFrame.sort_index。除此之外,我们还有argsort

以下是一些常见的排序用例,以及如何使用当前API中的排序功能解决它们。首先,设置。

# Setup
np.random.seed(0)
df = pd.DataFrame({'A': list('accab'), 'B': np.random.choice(10, 5)})    
df                                                                                                                                        
   A  B
0  a  7
1  c  9
2  c  3
3  a  5
4  b  2

按单列排序

例如,要按df列“ A” 排序,请使用sort_values单个列名:

df.sort_values(by='A')

   A  B
0  a  7
3  a  5
4  b  2
1  c  9
2  c  3

如果您需要新的RangeIndex,请使用DataFrame.reset_index

按多列排序

例如,通过排序两个关口“A”和“B”中df,你可以通过一个列表sort_values

df.sort_values(by=['A', 'B'])

   A  B
3  a  5
0  a  7
4  b  2
2  c  3
1  c  9

按DataFrame索引排序

df2 = df.sample(frac=1)
df2

   A  B
1  c  9
0  a  7
2  c  3
3  a  5
4  b  2

您可以使用sort_index

df2.sort_index()

   A  B
0  a  7
1  c  9
2  c  3
3  a  5
4  b  2

df.equals(df2)                                                                                                                            
# False
df.equals(df2.sort_index())                                                                                                               
# True

以下是一些可比较的方法及其性能:

%timeit df2.sort_index()                                                                                                                  
%timeit df2.iloc[df2.index.argsort()]                                                                                                     
%timeit df2.reindex(np.sort(df2.index))                                                                                                   

605 µs ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
610 µs ± 24.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
581 µs ± 7.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

按指数列表排序

例如,

idx = df2.index.argsort()
idx
# array([0, 7, 2, 3, 9, 4, 5, 6, 8, 1])

这个“排序”问题实际上是一个简单的索引问题。仅传递整数标签即可iloc

df.iloc[idx]

   A  B
1  c  9
0  a  7
2  c  3
3  a  5
4  b  2

Pandas Sorting 101

sort has been replaced in v0.20 by DataFrame.sort_values and DataFrame.sort_index. Aside from this, we also have argsort.

Here are some common use cases in sorting, and how to solve them using the sorting functions in the current API. First, the setup.

# Setup
np.random.seed(0)
df = pd.DataFrame({'A': list('accab'), 'B': np.random.choice(10, 5)})    
df                                                                                                                                        
   A  B
0  a  7
1  c  9
2  c  3
3  a  5
4  b  2

Sort by Single Column

For example, to sort df by column “A”, use sort_values with a single column name:

df.sort_values(by='A')

   A  B
0  a  7
3  a  5
4  b  2
1  c  9
2  c  3

If you need a fresh RangeIndex, use DataFrame.reset_index.

Sort by Multiple Columns

For example, to sort by both col “A” and “B” in df, you can pass a list to sort_values:

df.sort_values(by=['A', 'B'])

   A  B
3  a  5
0  a  7
4  b  2
2  c  3
1  c  9

Sort By DataFrame Index

df2 = df.sample(frac=1)
df2

   A  B
1  c  9
0  a  7
2  c  3
3  a  5
4  b  2

You can do this using sort_index:

df2.sort_index()

   A  B
0  a  7
1  c  9
2  c  3
3  a  5
4  b  2

df.equals(df2)                                                                                                                            
# False
df.equals(df2.sort_index())                                                                                                               
# True

Here are some comparable methods with their performance:

%timeit df2.sort_index()                                                                                                                  
%timeit df2.iloc[df2.index.argsort()]                                                                                                     
%timeit df2.reindex(np.sort(df2.index))                                                                                                   

605 µs ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
610 µs ± 24.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
581 µs ± 7.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Sort by List of Indices

For example,

idx = df2.index.argsort()
idx
# array([0, 7, 2, 3, 9, 4, 5, 6, 8, 1])

This “sorting” problem is actually a simple indexing problem. Just passing integer labels to iloc will do.

df.iloc[idx]

   A  B
1  c  9
0  a  7
2  c  3
3  a  5
4  b  2

如何在Pandas DataFrame中移动列

问题:如何在Pandas DataFrame中移动列

我想在Pandas中移动一列DataFrame,但是我无法在不重写整个DF的情况下从文档中找到一种方法来做到这一点。有人知道怎么做吗?数据框:

##    x1   x2
##0  206  214
##1  226  234
##2  245  253
##3  265  272
##4  283  291

所需的输出:

##    x1   x2
##0  206  nan
##1  226  214
##2  245  234
##3  265  253
##4  283  272
##5  nan  291

I would like to shift a column in a Pandas DataFrame, but I haven’t been able to find a method to do it from the documentation without rewriting the whole DF. Does anyone know how to do it? DataFrame:

##    x1   x2
##0  206  214
##1  226  234
##2  245  253
##3  265  272
##4  283  291

Desired output:

##    x1   x2
##0  206  nan
##1  226  214
##2  245  234
##3  265  253
##4  283  272
##5  nan  291

回答 0

In [18]: a
Out[18]: 
   x1  x2
0   0   5
1   1   6
2   2   7
3   3   8
4   4   9

In [19]: a.x2 = a.x2.shift(1)

In [20]: a
Out[20]: 
   x1  x2
0   0 NaN
1   1   5
2   2   6
3   3   7
4   4   8
In [18]: a
Out[18]: 
   x1  x2
0   0   5
1   1   6
2   2   7
3   3   8
4   4   9

In [19]: a.x2 = a.x2.shift(1)

In [20]: a
Out[20]: 
   x1  x2
0   0 NaN
1   1   5
2   2   6
3   3   7
4   4   8

回答 1

您需要在df.shift这里使用。
df.shift(i)将整个数据帧i向下移动一个单位。

因此,对于i = 1

输入:

    x1   x2  
0  206  214  
1  226  234  
2  245  253  
3  265  272    
4  283  291

输出:

    x1   x2
0  Nan  Nan   
1  206  214  
2  226  234  
3  245  253  
4  265  272 

因此,运行此脚本以获取预期的输出:

import pandas as pd

df = pd.DataFrame({'x1': ['206', '226', '245',' 265', '283'],
                   'x2': ['214', '234', '253', '272', '291']})

print(df)
df['x2'] = df['x2'].shift(1)
print(df)

You need to use df.shift here.
df.shift(i) shifts the entire dataframe by i units down.

So, for i = 1:

Input:

    x1   x2  
0  206  214  
1  226  234  
2  245  253  
3  265  272    
4  283  291

Output:

    x1   x2
0  Nan  Nan   
1  206  214  
2  226  234  
3  245  253  
4  265  272 

So, run this script to get the expected output:

import pandas as pd

df = pd.DataFrame({'x1': ['206', '226', '245',' 265', '283'],
                   'x2': ['214', '234', '253', '272', '291']})

print(df)
df['x2'] = df['x2'].shift(1)
print(df)

回答 2

让我们通过以下示例定义数据框:

>>> df = pd.DataFrame([[206, 214], [226, 234], [245, 253], [265, 272], [283, 291]], 
    columns=[1, 2])
>>> df
     1    2
0  206  214
1  226  234
2  245  253
3  265  272
4  283  291

然后您可以通过操作第二列的索引

>>> df[2].index = df[2].index+1

最后重新组合单列

>>> pd.concat([df[1], df[2]], axis=1)
       1      2
0  206.0    NaN
1  226.0  214.0
2  245.0  234.0
3  265.0  253.0
4  283.0  272.0
5    NaN  291.0

也许不快,但简单易读。考虑为列名和所需的实际移位设置变量。

编辑:通常可以通过df[2].shift(1)已发布的方式进行转移,但是这会切断结转。

Lets define the dataframe from your example by

>>> df = pd.DataFrame([[206, 214], [226, 234], [245, 253], [265, 272], [283, 291]], 
    columns=[1, 2])
>>> df
     1    2
0  206  214
1  226  234
2  245  253
3  265  272
4  283  291

Then you could manipulate the index of the second column by

>>> df[2].index = df[2].index+1

and finally re-combine the single columns

>>> pd.concat([df[1], df[2]], axis=1)
       1      2
0  206.0    NaN
1  226.0  214.0
2  245.0  234.0
3  265.0  253.0
4  283.0  272.0
5    NaN  291.0

Perhaps not fast but simple to read. Consider setting variables for the column names and the actual shift required.

Edit: Generally shifting is possible by df[2].shift(1) as already posted however would that cut-off the carryover.


回答 3

如果你不想失去你的列转移过去的数据帧的结束,只是首先附加所需数量:

    offset = 5
    DF = DF.append([np.nan for x in range(offset)])
    DF = DF.shift(periods=offset)
    DF = DF.reset_index() #Only works if sequential index

If you don’t want to lose the columns you shift past the end of your dataframe, simply append the required number first:

    offset = 5
    DF = DF.append([np.nan for x in range(offset)])
    DF = DF.shift(periods=offset)
    DF = DF.reset_index() #Only works if sequential index

回答 4

我想进口

import pandas as pd
import numpy as np

首先NaN, NaN,...在DataFrame(df)的末尾添加新行。

s1 = df.iloc[0]    # copy 1st row to a new Series s1
s1[:] = np.NaN     # set all values to NaN
df2 = df.append(s1, ignore_index=True)  # add s1 to the end of df

它将创建新的DF df2。也许有一种更优雅的方式,但这可行。

现在您可以移动它:

df2.x2 = df2.x2.shift(1)  # shift what you want

I suppose imports

import pandas as pd
import numpy as np

First append new row with NaN, NaN,... at the end of DataFrame (df).

s1 = df.iloc[0]    # copy 1st row to a new Series s1
s1[:] = np.NaN     # set all values to NaN
df2 = df.append(s1, ignore_index=True)  # add s1 to the end of df

It will create new DF df2. Maybe there is more elegant way but this works.

Now you can shift it:

df2.x2 = df2.x2.shift(1)  # shift what you want

回答 5

尝试回答一个个人问题,并且与您在Pandas Doc上发现的问题类似,我认为可以回答这个问题:

DataFrame.shift(周期= 1,频率=无,轴= 0)按所需的周期数移动索引,并具有可选的时间频率

笔记

如果指定了freq,则索引值会移位,但数据不会重新对齐。也就是说,如果您想在移位时扩展索引并保留原始数据,请使用freq。

希望对以后的问题有所帮助。

Trying to answer a personal problem and similar to yours I found on Pandas Doc what I think would answer this question:

DataFrame.shift(periods=1, freq=None, axis=0) Shift index by desired number of periods with an optional time freq

Notes

If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data.

Hope to help future questions in this matter.


回答 6

这是我的方法:

df_ext = pd.DataFrame(index=pd.date_range(df.index[-1], periods=8, closed='right'))
df2 = pd.concat([df, df_ext], axis=0, sort=True)
df2["forecast"] = df2["some column"].shift(7)

基本上,我正在生成具有所需索引的空数据框,然后将它们连接在一起。但是我真的很想将此作为熊猫的标准功能,因此我提出了对熊猫的增强功能

This is how I do it:

df_ext = pd.DataFrame(index=pd.date_range(df.index[-1], periods=8, closed='right'))
df2 = pd.concat([df, df_ext], axis=0, sort=True)
df2["forecast"] = df2["some column"].shift(7)

Basically I am generating an empty dataframe with the desired index and then just concatenate them together. But I would really like to see this as a standard feature in pandas so I have proposed an enhancement to pandas.