标签归档:pandas

如何跨熊猫的多个数据框列“选择不同”?

问题:如何跨熊猫的多个数据框列“选择不同”?

我正在寻找一种等效于SQL的方法

SELECT DISTINCT col1, col2 FROM dataframe_table

pandas sql比较与无关distinct

.unique() 仅适用于单个列,因此我想我可以合并这些列,或将它们放在列表/元组中并进行比较,但这似乎是熊猫应该以更原生的方式进行的操作。

我是否缺少明显的东西,还是没有办法做到这一点?

I’m looking for a way to do the equivalent to the SQL

SELECT DISTINCT col1, col2 FROM dataframe_table

The pandas sql comparison doesn’t have anything about distinct.

.unique() only works for a single column, so I suppose I could concat the columns, or put them in a list/tuple and compare that way, but this seems like something pandas should do in a more native way.

Am I missing something obvious, or is there no way to do this?


回答 0

您可以使用该drop_duplicates方法来获取DataFrame中的唯一行:

In [29]: df = pd.DataFrame({'a':[1,2,1,2], 'b':[3,4,3,5]})

In [30]: df
Out[30]:
   a  b
0  1  3
1  2  4
2  1  3
3  2  5

In [32]: df.drop_duplicates()
Out[32]:
   a  b
0  1  3
1  2  4
3  2  5

subset如果只想使用某些列来确定唯一性,则还可以提供关键字参数。请参阅文档字符串

You can use the drop_duplicates method to get the unique rows in a DataFrame:

In [29]: df = pd.DataFrame({'a':[1,2,1,2], 'b':[3,4,3,5]})

In [30]: df
Out[30]:
   a  b
0  1  3
1  2  4
2  1  3
3  2  5

In [32]: df.drop_duplicates()
Out[32]:
   a  b
0  1  3
1  2  4
3  2  5

You can also provide the subset keyword argument if you only want to use certain columns to determine uniqueness. See the docstring.


回答 1

我尝试了不同的解决方案。首先是:

a_df=np.unique(df[['col1','col2']], axis=0)

并且适用于非对象数据,这也是另一种避免错误(针对对象列类型)的方法是应用drop_duplicates()

a_df=df.drop_duplicates(['col1','col2'])[['col1','col2']]

您也可以使用SQL来执行此操作,但是在我的情况下,它的运行速度非常慢:

from pandasql import sqldf
q="""SELECT DISTINCT col1, col2 FROM df;"""
pysqldf = lambda q: sqldf(q, globals())
a_df = pysqldf(q)

I’ve tried different solutions. First was:

a_df=np.unique(df[['col1','col2']], axis=0)

and it works well for not object data Another way to do this and to avoid error (for object columns type) is to apply drop_duplicates()

a_df=df.drop_duplicates(['col1','col2'])[['col1','col2']]

You can also use SQL to do this, but it worked very slow in my case:

from pandasql import sqldf
q="""SELECT DISTINCT col1, col2 FROM df;"""
pysqldf = lambda q: sqldf(q, globals())
a_df = pysqldf(q)

回答 2

没有unique用于df的方法,如果每列的唯一值的数量相同,则可以进行以下操作:df.apply(pd.Series.unique)但是,如果不这样做,则会出现错误。另一种方法是将值存储在以列名称为键的dict中:

In [111]:
df = pd.DataFrame({'a':[0,1,2,2,4], 'b':[1,1,1,2,2]})
d={}
for col in df:
    d[col] = df[col].unique()
d

Out[111]:
{'a': array([0, 1, 2, 4], dtype=int64), 'b': array([1, 2], dtype=int64)}

There is no unique method for a df, if the number of unique values for each column were the same then the following would work: df.apply(pd.Series.unique) but if not then you will get an error. Another approach would be to store the values in a dict which is keyed on the column name:

In [111]:
df = pd.DataFrame({'a':[0,1,2,2,4], 'b':[1,1,1,2,2]})
d={}
for col in df:
    d[col] = df[col].unique()
d

Out[111]:
{'a': array([0, 1, 2, 4], dtype=int64), 'b': array([1, 2], dtype=int64)}

回答 3

为了解决类似的问题,我正在使用groupby

print(f"Distinct entries: {len(df.groupby(['col1', 'col2']))}")

不过,这是否合适将取决于您要对结果执行什么操作(在我的情况下,我只是想要与COUNT DISTINCT所示结果等效)。

To solve a similar problem, I’m using groupby:

print(f"Distinct entries: {len(df.groupby(['col1', 'col2']))}")

Whether that’s appropriate will depend on what you want to do with the result, though (in my case, I just wanted the equivalent of COUNT DISTINCT as shown).


回答 4

我认为drop duplicate根据数据框的使用有时并不会那么有用。

我找到了这个:

[in] df['col_1'].unique()
[out] array(['A', 'B', 'C'], dtype=object)

为我工作!

https://riptutorial.com/pandas/example/26077/select-distinct-rows-across-dataframe

I think use drop duplicate sometimes will not so useful depending dataframe.

I found this:

[in] df['col_1'].unique()
[out] array(['A', 'B', 'C'], dtype=object)

And work for me!

https://riptutorial.com/pandas/example/26077/select-distinct-rows-across-dataframe


回答 5

您可以采用列的集合,并从较大的集合中减去较小的集合:

distinct_values = set(df['a'])-set(df['b'])

You can take the sets of the columns and just subtract the smaller set from the larger set:

distinct_values = set(df['a'])-set(df['b'])

‘DataFrame’对象没有属性’sort’

问题:’DataFrame’对象没有属性’sort’

我在这里遇到一些问题,在我的python包中,我已经安装了numpy,但是我仍然遇到此错误‘DataFrame’对象没有属性’sort’

任何人都可以给我一些想法。

这是我的代码:

final.loc[-1] =['', 'P','Actual']
final.index = final.index + 1  # shifting index
final = final.sort()
final.columns=[final.columns,final.iloc[0]]
final = final.iloc[1:].reset_index(drop=True)
final.columns.names = (None, None)

I face some problem here, in my python package I have install numpy, but I still have this error ‘DataFrame’ object has no attribute ‘sort’

Anyone can give me some idea..

This is my code :

final.loc[-1] =['', 'P','Actual']
final.index = final.index + 1  # shifting index
final = final.sort()
final.columns=[final.columns,final.iloc[0]]
final = final.iloc[1:].reset_index(drop=True)
final.columns.names = (None, None)

回答 0

sort() 不推荐使用DataFrames,而采用以下任何一种方法:

sort()在Pandas中已弃用(但仍可用)版本0.17(2015-10-09),并引入sort_values()sort_index()。它已从0.20版(2017-05-05)的Pandas中删除。

sort() was deprecated for DataFrames in favor of either:

sort() was deprecated (but still available) in Pandas with release 0.17 (2015-10-09) with the introduction of sort_values() and sort_index(). It was removed from Pandas with release 0.20 (2017-05-05).


回答 1

熊猫排序101

sort已经在v0.20替换DataFrame.sort_valuesDataFrame.sort_index。除此之外,我们还有argsort

以下是一些常见的排序用例,以及如何使用当前API中的排序功能解决它们。首先,设置。

# Setup
np.random.seed(0)
df = pd.DataFrame({'A': list('accab'), 'B': np.random.choice(10, 5)})    
df                                                                                                                                        
   A  B
0  a  7
1  c  9
2  c  3
3  a  5
4  b  2

按单列排序

例如,要按df列“ A” 排序,请使用sort_values单个列名:

df.sort_values(by='A')

   A  B
0  a  7
3  a  5
4  b  2
1  c  9
2  c  3

如果您需要新的RangeIndex,请使用DataFrame.reset_index

按多列排序

例如,通过排序两个关口“A”和“B”中df,你可以通过一个列表sort_values

df.sort_values(by=['A', 'B'])

   A  B
3  a  5
0  a  7
4  b  2
2  c  3
1  c  9

按DataFrame索引排序

df2 = df.sample(frac=1)
df2

   A  B
1  c  9
0  a  7
2  c  3
3  a  5
4  b  2

您可以使用sort_index

df2.sort_index()

   A  B
0  a  7
1  c  9
2  c  3
3  a  5
4  b  2

df.equals(df2)                                                                                                                            
# False
df.equals(df2.sort_index())                                                                                                               
# True

以下是一些可比较的方法及其性能:

%timeit df2.sort_index()                                                                                                                  
%timeit df2.iloc[df2.index.argsort()]                                                                                                     
%timeit df2.reindex(np.sort(df2.index))                                                                                                   

605 µs ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
610 µs ± 24.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
581 µs ± 7.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

按指数列表排序

例如,

idx = df2.index.argsort()
idx
# array([0, 7, 2, 3, 9, 4, 5, 6, 8, 1])

这个“排序”问题实际上是一个简单的索引问题。仅传递整数标签即可iloc

df.iloc[idx]

   A  B
1  c  9
0  a  7
2  c  3
3  a  5
4  b  2

Pandas Sorting 101

sort has been replaced in v0.20 by DataFrame.sort_values and DataFrame.sort_index. Aside from this, we also have argsort.

Here are some common use cases in sorting, and how to solve them using the sorting functions in the current API. First, the setup.

# Setup
np.random.seed(0)
df = pd.DataFrame({'A': list('accab'), 'B': np.random.choice(10, 5)})    
df                                                                                                                                        
   A  B
0  a  7
1  c  9
2  c  3
3  a  5
4  b  2

Sort by Single Column

For example, to sort df by column “A”, use sort_values with a single column name:

df.sort_values(by='A')

   A  B
0  a  7
3  a  5
4  b  2
1  c  9
2  c  3

If you need a fresh RangeIndex, use DataFrame.reset_index.

Sort by Multiple Columns

For example, to sort by both col “A” and “B” in df, you can pass a list to sort_values:

df.sort_values(by=['A', 'B'])

   A  B
3  a  5
0  a  7
4  b  2
2  c  3
1  c  9

Sort By DataFrame Index

df2 = df.sample(frac=1)
df2

   A  B
1  c  9
0  a  7
2  c  3
3  a  5
4  b  2

You can do this using sort_index:

df2.sort_index()

   A  B
0  a  7
1  c  9
2  c  3
3  a  5
4  b  2

df.equals(df2)                                                                                                                            
# False
df.equals(df2.sort_index())                                                                                                               
# True

Here are some comparable methods with their performance:

%timeit df2.sort_index()                                                                                                                  
%timeit df2.iloc[df2.index.argsort()]                                                                                                     
%timeit df2.reindex(np.sort(df2.index))                                                                                                   

605 µs ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
610 µs ± 24.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
581 µs ± 7.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Sort by List of Indices

For example,

idx = df2.index.argsort()
idx
# array([0, 7, 2, 3, 9, 4, 5, 6, 8, 1])

This “sorting” problem is actually a simple indexing problem. Just passing integer labels to iloc will do.

df.iloc[idx]

   A  B
1  c  9
0  a  7
2  c  3
3  a  5
4  b  2

为熊猫MultiIndex设置一个级别

问题:为熊猫MultiIndex设置一个级别

经过分组后,我创建了一个具有MultiIndex的DataFrame:

import numpy as np
import pandas as p
from numpy.random import randn

df = p.DataFrame({
    'A' : ['a1', 'a1', 'a2', 'a3']
  , 'B' : ['b1', 'b2', 'b3', 'b4']
  , 'Vals' : randn(4)
}).groupby(['A', 'B']).sum()

df

Output>            Vals
Output> A  B           
Output> a1 b1 -1.632460
Output>    b2  0.596027
Output> a2 b3 -0.619130
Output> a3 b4 -0.002009

如何在MultiIndex前面添加一个级别,以便将其转换为类似以下内容:

Output>                       Vals
Output> FirstLevel A  B           
Output> Foo        a1 b1 -1.632460
Output>               b2  0.596027
Output>            a2 b3 -0.619130
Output>            a3 b4 -0.002009

I have a DataFrame with a MultiIndex created after some grouping:

import numpy as np
import pandas as p
from numpy.random import randn

df = p.DataFrame({
    'A' : ['a1', 'a1', 'a2', 'a3']
  , 'B' : ['b1', 'b2', 'b3', 'b4']
  , 'Vals' : randn(4)
}).groupby(['A', 'B']).sum()

df

Output>            Vals
Output> A  B           
Output> a1 b1 -1.632460
Output>    b2  0.596027
Output> a2 b3 -0.619130
Output> a3 b4 -0.002009

How do I prepend a level to the MultiIndex so that I turn it into something like:

Output>                       Vals
Output> FirstLevel A  B           
Output> Foo        a1 b1 -1.632460
Output>               b2  0.596027
Output>            a2 b3 -0.619130
Output>            a3 b4 -0.002009

回答 0

一种使用pandas.concat()以下代码一行完成此操作的好方法:

import pandas as pd

pd.concat([df], keys=['Foo'], names=['Firstlevel'])

甚至更短的方法:

pd.concat({'Foo': df}, names=['Firstlevel'])

这可以推广到许多数据框架,请参阅docs

A nice way to do this in one line using pandas.concat():

import pandas as pd

pd.concat([df], keys=['Foo'], names=['Firstlevel'])

An even shorter way:

pd.concat({'Foo': df}, names=['Firstlevel'])

This can be generalized to many data frames, see the docs.


回答 1

您可以首先将其添加为普通列,然后将其附加到当前索引,因此:

df['Firstlevel'] = 'Foo'
df.set_index('Firstlevel', append=True, inplace=True)

并根据需要更改顺序:

df.reorder_levels(['Firstlevel', 'A', 'B'])

结果是:

                      Vals
Firstlevel A  B           
Foo        a1 b1  0.871563
              b2  0.494001
           a2 b3 -0.167811
           a3 b4 -1.353409

You can first add it as a normal column and then append it to the current index, so:

df['Firstlevel'] = 'Foo'
df.set_index('Firstlevel', append=True, inplace=True)

And change the order if needed with:

df.reorder_levels(['Firstlevel', 'A', 'B'])

Which results in:

                      Vals
Firstlevel A  B           
Foo        a1 b1  0.871563
              b2  0.494001
           a2 b3 -0.167811
           a3 b4 -1.353409

回答 2

我认为这是一个更通用的解决方案:

# Convert index to dataframe
old_idx = df.index.to_frame()

# Insert new level at specified location
old_idx.insert(0, 'new_level_name', new_level_values)

# Convert back to MultiIndex
df.index = pandas.MultiIndex.from_frame(old_idx)

与其他答案相比有一些优势:

  • 新级别可以添加到任何位置,而不仅仅是顶部。
  • 它纯粹是对索引的一种操作,不需要像串联技巧那样操作数据。
  • 它不需要添加列作为中间步骤,这可以破坏多级列索引。

I think this is a more general solution:

# Convert index to dataframe
old_idx = df.index.to_frame()

# Insert new level at specified location
old_idx.insert(0, 'new_level_name', new_level_values)

# Convert back to MultiIndex
df.index = pandas.MultiIndex.from_frame(old_idx)

Some advantages over the other answers:

  • The new level can be added at any location, not just the top.
  • It is purely a manipulation on the index and doesn’t require manipulating the data, like the concatenation trick.
  • It doesn’t require adding a column as an intermediate step, which can break multi-level column indexes.

回答 3

我用cxrodgers的答案做了一点功能,该IMHO是最佳的解决方案,因为它完全在索引上工作,而与任何数据帧或序列无关。

我添加了一个修复程序: to_frame()方法将为没有索引级别的索引级别发明新名称。这样,新索引将具有旧索引中不存在的名称。我添加了一些代码来还原此名称更改。

下面是代码,我已经使用了一段时间,它似乎工作正常。如果您发现任何问题或极端情况,我将不得不调整答案。

import pandas as pd

def _handle_insert_loc(loc: int, n: int) -> int:
    """
    Computes the insert index from the right if loc is negative for a given size of n.
    """
    return n + loc + 1 if loc < 0 else loc


def add_index_level(old_index: pd.Index, value: Any, name: str = None, loc: int = 0) -> pd.MultiIndex:
    """
    Expand a (multi)index by adding a level to it.

    :param old_index: The index to expand
    :param name: The name of the new index level
    :param value: Scalar or list-like, the values of the new index level
    :param loc: Where to insert the level in the index, 0 is at the front, negative values count back from the rear end
    :return: A new multi-index with the new level added
    """
    loc = _handle_insert_loc(loc, len(old_index.names))
    old_index_df = old_index.to_frame()
    old_index_df.insert(loc, name, value)
    new_index_names = list(old_index.names)  # sometimes new index level names are invented when converting to a df,
    new_index_names.insert(loc, name)        # here the original names are reconstructed
    new_index = pd.MultiIndex.from_frame(old_index_df, names=new_index_names)
    return new_index

它通过了以下单元测试代码:

import unittest

import numpy as np
import pandas as pd

class TestPandaStuff(unittest.TestCase):

    def test_add_index_level(self):
        df = pd.DataFrame(data=np.random.normal(size=(6, 3)))
        i1 = add_index_level(df.index, "foo")

        # it does not invent new index names where there are missing
        self.assertEqual([None, None], i1.names)

        # the new level values are added
        self.assertTrue(np.all(i1.get_level_values(0) == "foo"))
        self.assertTrue(np.all(i1.get_level_values(1) == df.index))

        # it does not invent new index names where there are missing
        i2 = add_index_level(i1, ["x", "y"]*3, name="xy", loc=2)
        i3 = add_index_level(i2, ["a", "b", "c"]*2, name="abc", loc=-1)
        self.assertEqual([None, None, "xy", "abc"], i3.names)

        # the new level values are added
        self.assertTrue(np.all(i3.get_level_values(0) == "foo"))
        self.assertTrue(np.all(i3.get_level_values(1) == df.index))
        self.assertTrue(np.all(i3.get_level_values(2) == ["x", "y"]*3))
        self.assertTrue(np.all(i3.get_level_values(3) == ["a", "b", "c"]*2))

        # df.index = i3
        # print()
        # print(df)

I made a little function out of cxrodgers answer, which IMHO is the best solution since it works purely on an index, independent of any data frame or series.

There is one fix I added: the to_frame() method will invent new names for index levels that don’t have one. As such the new index will have names that don’t exist in the old index. I added some code to revert this name-change.

Below is the code, I’ve used it myself for a while and it seems to work fine. If you find any issues or edge cases, I’d be much obliged to adjust my answer.

import pandas as pd

def _handle_insert_loc(loc: int, n: int) -> int:
    """
    Computes the insert index from the right if loc is negative for a given size of n.
    """
    return n + loc + 1 if loc < 0 else loc


def add_index_level(old_index: pd.Index, value: Any, name: str = None, loc: int = 0) -> pd.MultiIndex:
    """
    Expand a (multi)index by adding a level to it.

    :param old_index: The index to expand
    :param name: The name of the new index level
    :param value: Scalar or list-like, the values of the new index level
    :param loc: Where to insert the level in the index, 0 is at the front, negative values count back from the rear end
    :return: A new multi-index with the new level added
    """
    loc = _handle_insert_loc(loc, len(old_index.names))
    old_index_df = old_index.to_frame()
    old_index_df.insert(loc, name, value)
    new_index_names = list(old_index.names)  # sometimes new index level names are invented when converting to a df,
    new_index_names.insert(loc, name)        # here the original names are reconstructed
    new_index = pd.MultiIndex.from_frame(old_index_df, names=new_index_names)
    return new_index

It passed the following unittest code:

import unittest

import numpy as np
import pandas as pd

class TestPandaStuff(unittest.TestCase):

    def test_add_index_level(self):
        df = pd.DataFrame(data=np.random.normal(size=(6, 3)))
        i1 = add_index_level(df.index, "foo")

        # it does not invent new index names where there are missing
        self.assertEqual([None, None], i1.names)

        # the new level values are added
        self.assertTrue(np.all(i1.get_level_values(0) == "foo"))
        self.assertTrue(np.all(i1.get_level_values(1) == df.index))

        # it does not invent new index names where there are missing
        i2 = add_index_level(i1, ["x", "y"]*3, name="xy", loc=2)
        i3 = add_index_level(i2, ["a", "b", "c"]*2, name="abc", loc=-1)
        self.assertEqual([None, None, "xy", "abc"], i3.names)

        # the new level values are added
        self.assertTrue(np.all(i3.get_level_values(0) == "foo"))
        self.assertTrue(np.all(i3.get_level_values(1) == df.index))
        self.assertTrue(np.all(i3.get_level_values(2) == ["x", "y"]*3))
        self.assertTrue(np.all(i3.get_level_values(3) == ["a", "b", "c"]*2))

        # df.index = i3
        # print()
        # print(df)

回答 4

如何使用pandas.MultiIndex.from_tuples从头开始构建它?

df.index = p.MultiIndex.from_tuples(
    [(nl, A, B) for nl, (A, B) in
        zip(['Foo'] * len(df), df.index)],
    names=['FirstLevel', 'A', 'B'])

cxrodger的解决方案类似,这是一种灵活的方法,避免修改数据框的基础数组。

How about building it from scratch with pandas.MultiIndex.from_tuples?

df.index = p.MultiIndex.from_tuples(
    [(nl, A, B) for nl, (A, B) in
        zip(['Foo'] * len(df), df.index)],
    names=['FirstLevel', 'A', 'B'])

Similarly to cxrodger’s solution, this is a flexible method and avoids modifying the underlying array for the dataframe.


熊猫:在数据框中创建两个新列,并使用从现有列中计算出的值

问题:熊猫:在数据框中创建两个新列,并使用从现有列中计算出的值

我正在使用pandas库,我想将两个新列添加到df具有n列(n> 0)的数据框中。
这些新列是由于将函数应用于数据框中的某一列而产生的。

要应用的功能如下:

def calculate(x):
    ...operate...
    return z, y

为仅返回值的函数创建新列的一种方法是:

df['new_col']) = df['column_A'].map(a_function)

因此,我想要的但尝试失败的(*)是这样的:

(df['new_col_zetas'], df['new_col_ys']) = df['column_A'].map(calculate)

实现此目的的最佳方法是什么?我毫无头绪地扫描了文档

** df['column_A'].map(calculate)返回一个熊猫系列,每个项目都由一个元组z,y组成。尝试将其分配给两个数据框列会产生ValueError。*

I am working with the pandas library and I want to add two new columns to a dataframe df with n columns (n > 0).
These new columns result from the application of a function to one of the columns in the dataframe.

The function to apply is like:

def calculate(x):
    ...operate...
    return z, y

One method for creating a new column for a function returning only a value is:

df['new_col']) = df['column_A'].map(a_function)

So, what I want, and tried unsuccesfully (*), is something like:

(df['new_col_zetas'], df['new_col_ys']) = df['column_A'].map(calculate)

What the best way to accomplish this could be ? I scanned the documentation with no clue.

**df['column_A'].map(calculate) returns a pandas Series each item consisting of a tuple z, y. And trying to assign this to two dataframe columns produces a ValueError.*


回答 0

我只用zip

In [1]: from pandas import *

In [2]: def calculate(x):
   ...:     return x*2, x*3
   ...: 

In [3]: df = DataFrame({'a': [1,2,3], 'b': [2,3,4]})

In [4]: df
Out[4]: 
   a  b
0  1  2
1  2  3
2  3  4

In [5]: df["A1"], df["A2"] = zip(*df["a"].map(calculate))

In [6]: df
Out[6]: 
   a  b  A1  A2
0  1  2   2   3
1  2  3   4   6
2  3  4   6   9

I’d just use zip:

In [1]: from pandas import *

In [2]: def calculate(x):
   ...:     return x*2, x*3
   ...: 

In [3]: df = DataFrame({'a': [1,2,3], 'b': [2,3,4]})

In [4]: df
Out[4]: 
   a  b
0  1  2
1  2  3
2  3  4

In [5]: df["A1"], df["A2"] = zip(*df["a"].map(calculate))

In [6]: df
Out[6]: 
   a  b  A1  A2
0  1  2   2   3
1  2  3   4   6
2  3  4   6   9

回答 1

我认为最佳答案是有缺陷的。希望没有人使用来将所有大熊猫大量导入其命名空间from pandas import *。同样,在将map方法传递给字典或系列时,应保留该方法的使用时间。它可以带一个函数,但这就是apply它的用途。

所以,如果您必须使用上述方法,我会这样写

df["A1"], df["A2"] = zip(*df["a"].apply(calculate))

实际上,这里没有理由使用zip。您可以简单地做到这一点:

df["A1"], df["A2"] = calculate(df['a'])

在较大的DataFrame上,第二种方法也快得多

df = pd.DataFrame({'a': [1,2,3] * 100000, 'b': [2,3,4] * 100000})

创建了300,000行的DataFrame

%timeit df["A1"], df["A2"] = calculate(df['a'])
2.65 ms ± 92.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df["A1"], df["A2"] = zip(*df["a"].apply(calculate))
159 ms ± 5.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

比拉链快60倍


通常,避免使用Apply

Apply通常不会比遍历Python列表快多少。让我们测试一个for循环的性能,以执行与上述相同的操作

%%timeit
A1, A2 = [], []
for val in df['a']:
    A1.append(val**2)
    A2.append(val**3)

df['A1'] = A1
df['A2'] = A2

298 ms ± 7.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

因此,这是缓慢的两倍,这并不是可怕的性能下降,但是如果我们对上述内容进行cythonize,我们将获得更好的性能。假设您正在使用ipython:

%load_ext cython

%%cython
cpdef power(vals):
    A1, A2 = [], []
    cdef double val
    for val in vals:
        A1.append(val**2)
        A2.append(val**3)

    return A1, A2

%timeit df['A1'], df['A2'] = power(df['a'])
72.7 ms ± 2.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

直接分配而不适用

如果使用直接矢量化操作,则可以进一步提高速度。

%timeit df['A1'], df['A2'] = df['a'] ** 2, df['a'] ** 3
5.13 ms ± 320 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

这利用了NumPy极其快速的矢量化操作的优势,而不是我们的循环。现在,我们的速度比原始速度提高了30倍。


最简单的速度测试 apply

上面的示例应该清楚地显示出速度有多慢apply,但是正是如此,让我们来看一个最基本的示例。让我们平方一千万个带和不带数字的序列

s = pd.Series(np.random.rand(10000000))

%timeit s.apply(calc)
3.3 s ± 57.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

如果不套用,则速度提高了50倍

%timeit s ** 2
66 ms ± 2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The top answer is flawed in my opinion. Hopefully, no one is mass importing all of pandas into their namespace with from pandas import *. Also, the map method should be reserved for those times when passing it a dictionary or Series. It can take a function but this is what apply is used for.

So, if you must use the above approach, I would write it like this

df["A1"], df["A2"] = zip(*df["a"].apply(calculate))

There’s actually no reason to use zip here. You can simply do this:

df["A1"], df["A2"] = calculate(df['a'])

This second method is also much faster on larger DataFrames

df = pd.DataFrame({'a': [1,2,3] * 100000, 'b': [2,3,4] * 100000})

DataFrame created with 300,000 rows

%timeit df["A1"], df["A2"] = calculate(df['a'])
2.65 ms ± 92.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df["A1"], df["A2"] = zip(*df["a"].apply(calculate))
159 ms ± 5.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

60x faster than zip


In general, avoid using apply

Apply is generally not much faster than iterating over a Python list. Let’s test the performance of a for-loop to do the same thing as above

%%timeit
A1, A2 = [], []
for val in df['a']:
    A1.append(val**2)
    A2.append(val**3)

df['A1'] = A1
df['A2'] = A2

298 ms ± 7.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So this is twice as slow which isn’t a terrible performance regression, but if we cythonize the above, we get much better performance. Assuming, you are using ipython:

%load_ext cython

%%cython
cpdef power(vals):
    A1, A2 = [], []
    cdef double val
    for val in vals:
        A1.append(val**2)
        A2.append(val**3)

    return A1, A2

%timeit df['A1'], df['A2'] = power(df['a'])
72.7 ms ± 2.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Directly assigning without apply

You can get even greater speed improvements if you use the direct vectorized operations.

%timeit df['A1'], df['A2'] = df['a'] ** 2, df['a'] ** 3
5.13 ms ± 320 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This takes advantage of NumPy’s extremely fast vectorized operations instead of our loops. We now have a 30x speedup over the original.


The simplest speed test with apply

The above example should clearly show how slow apply can be, but just so its extra clear let’s look at the most basic example. Let’s square a Series of 10 million numbers with and without apply

s = pd.Series(np.random.rand(10000000))

%timeit s.apply(calc)
3.3 s ± 57.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Without apply is 50x faster

%timeit s ** 2
66 ms ± 2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

熊猫可以绘制日期直方图吗?

问题:熊猫可以绘制日期直方图吗?

我已经将我的Series系列产品,并将其强制为dtype =的datetime列datetime64[ns](尽管仅需要日期分辨率…不确定如何更改)。

import pandas as pd
df = pd.read_csv('somefile.csv')
column = df['date']
column = pd.to_datetime(column, coerce=True)

但是绘图不起作用:

ipdb> column.plot(kind='hist')
*** TypeError: ufunc add cannot use operands with types dtype('<M8[ns]') and dtype('float64')

我想绘制一个直方图,该直方图仅按周,月或年显示日期计数

当然有办法做到pandas吗?

I’ve taken my Series and coerced it to a datetime column of dtype=datetime64[ns] (though only need day resolution…not sure how to change).

import pandas as pd
df = pd.read_csv('somefile.csv')
column = df['date']
column = pd.to_datetime(column, coerce=True)

but plotting doesn’t work:

ipdb> column.plot(kind='hist')
*** TypeError: ufunc add cannot use operands with types dtype('<M8[ns]') and dtype('float64')

I’d like to plot a histogram that just shows the count of dates by week, month, or year.

Surely there is a way to do this in pandas?


回答 0

鉴于此df:

        date
0 2001-08-10
1 2002-08-31
2 2003-08-29
3 2006-06-21
4 2002-03-27
5 2003-07-14
6 2004-06-15
7 2003-08-14
8 2003-07-29

并且,如果还不是这样的话:

df["date"] = df["date"].astype("datetime64")

要按月显示日期计数:

df.groupby(df["date"].dt.month).count().plot(kind="bar")

.dt 允许您访问datetime属性。

这会给你:

分组日期月份

您可以按年,日等替换月份。

例如,如果要区分年份和月份,请执行以下操作:

df.groupby([df["date"].dt.year, df["date"].dt.month]).count().plot(kind="bar")

这使:

分组日期日期年

是您想要的吗?这清楚吗?

希望这可以帮助 !

Given this df:

        date
0 2001-08-10
1 2002-08-31
2 2003-08-29
3 2006-06-21
4 2002-03-27
5 2003-07-14
6 2004-06-15
7 2003-08-14
8 2003-07-29

and, if it’s not already the case:

df["date"] = df["date"].astype("datetime64")

To show the count of dates by month:

df.groupby(df["date"].dt.month).count().plot(kind="bar")

.dt allows you to access the datetime properties.

Which will give you:

groupby date month

You can replace month by year, day, etc..

If you want to distinguish year and month for instance, just do:

df.groupby([df["date"].dt.year, df["date"].dt.month]).count().plot(kind="bar")

Which gives:

groupby date month year

Was it what you wanted ? Is this clear ?

Hope this helps !


回答 1

我认为重新采样可能是您想要的。对于您的情况,请执行以下操作:

df.set_index('date', inplace=True)
# for '1M' for 1 month; '1W' for 1 week; check documentation on offset alias
df.resample('1M', how='count')

它仅在进行计数而不是在进行图,因此您必须自己制作图。

有关重新采样熊猫重新采样文档的更多详细信息,请参见这篇文章。

我遇到了和您一样的类似问题。希望这可以帮助。

I think resample might be what you are looking for. In your case, do:

df.set_index('date', inplace=True)
# for '1M' for 1 month; '1W' for 1 week; check documentation on offset alias
df.resample('1M', how='count')

It is only doing the counting and not the plot, so you then have to make your own plots.

See this post for more details on the documentation of resample pandas resample documentation

I have ran into similar problems as you did. Hope this helps.


回答 2

渲染示例

在此处输入图片说明

范例程式码

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""Create random datetime object."""

# core modules
from datetime import datetime
import random

# 3rd party modules
import pandas as pd
import matplotlib.pyplot as plt


def visualize(df, column_name='start_date', color='#494949', title=''):
    """
    Visualize a dataframe with a date column.

    Parameters
    ----------
    df : Pandas dataframe
    column_name : str
        Column to visualize
    color : str
    title : str
    """
    plt.figure(figsize=(20, 10))
    ax = (df[column_name].groupby(df[column_name].dt.hour)
                         .count()).plot(kind="bar", color=color)
    ax.set_facecolor('#eeeeee')
    ax.set_xlabel("hour of the day")
    ax.set_ylabel("count")
    ax.set_title(title)
    plt.show()


def create_random_datetime(from_date, to_date, rand_type='uniform'):
    """
    Create random date within timeframe.

    Parameters
    ----------
    from_date : datetime object
    to_date : datetime object
    rand_type : {'uniform'}

    Examples
    --------
    >>> random.seed(28041990)
    >>> create_random_datetime(datetime(1990, 4, 28), datetime(2000, 12, 31))
    datetime.datetime(1998, 12, 13, 23, 38, 0, 121628)
    >>> create_random_datetime(datetime(1990, 4, 28), datetime(2000, 12, 31))
    datetime.datetime(2000, 3, 19, 19, 24, 31, 193940)
    """
    delta = to_date - from_date
    if rand_type == 'uniform':
        rand = random.random()
    else:
        raise NotImplementedError('Unknown random mode \'{}\''
                                  .format(rand_type))
    return from_date + rand * delta


def create_df(n=1000):
    """Create a Pandas dataframe with datetime objects."""
    from_date = datetime(1990, 4, 28)
    to_date = datetime(2000, 12, 31)
    sales = [create_random_datetime(from_date, to_date) for _ in range(n)]
    df = pd.DataFrame({'start_date': sales})
    return df


if __name__ == '__main__':
    import doctest
    doctest.testmod()
    df = create_df()
    visualize(df)

Rendered example

enter image description here

Example Code

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""Create random datetime object."""

# core modules
from datetime import datetime
import random

# 3rd party modules
import pandas as pd
import matplotlib.pyplot as plt


def visualize(df, column_name='start_date', color='#494949', title=''):
    """
    Visualize a dataframe with a date column.

    Parameters
    ----------
    df : Pandas dataframe
    column_name : str
        Column to visualize
    color : str
    title : str
    """
    plt.figure(figsize=(20, 10))
    ax = (df[column_name].groupby(df[column_name].dt.hour)
                         .count()).plot(kind="bar", color=color)
    ax.set_facecolor('#eeeeee')
    ax.set_xlabel("hour of the day")
    ax.set_ylabel("count")
    ax.set_title(title)
    plt.show()


def create_random_datetime(from_date, to_date, rand_type='uniform'):
    """
    Create random date within timeframe.

    Parameters
    ----------
    from_date : datetime object
    to_date : datetime object
    rand_type : {'uniform'}

    Examples
    --------
    >>> random.seed(28041990)
    >>> create_random_datetime(datetime(1990, 4, 28), datetime(2000, 12, 31))
    datetime.datetime(1998, 12, 13, 23, 38, 0, 121628)
    >>> create_random_datetime(datetime(1990, 4, 28), datetime(2000, 12, 31))
    datetime.datetime(2000, 3, 19, 19, 24, 31, 193940)
    """
    delta = to_date - from_date
    if rand_type == 'uniform':
        rand = random.random()
    else:
        raise NotImplementedError('Unknown random mode \'{}\''
                                  .format(rand_type))
    return from_date + rand * delta


def create_df(n=1000):
    """Create a Pandas dataframe with datetime objects."""
    from_date = datetime(1990, 4, 28)
    to_date = datetime(2000, 12, 31)
    sales = [create_random_datetime(from_date, to_date) for _ in range(n)]
    df = pd.DataFrame({'start_date': sales})
    return df


if __name__ == '__main__':
    import doctest
    doctest.testmod()
    df = create_df()
    visualize(df)

回答 3

我可以通过以下方法解决此问题:(1)使用matplotlib进行绘制,而不是直接使用数据框,以及(2)使用values属性。参见示例:

import matplotlib.pyplot as plt

ax = plt.gca()
ax.hist(column.values)

如果我不使用values,这是行不通的,但是我不知道为什么行得通。

I was able to work around this by (1) plotting with matplotlib instead of using the dataframe directly and (2) using the values attribute. See example:

import matplotlib.pyplot as plt

ax = plt.gca()
ax.hist(column.values)

This doesn’t work if I don’t use values, but I don’t know why it does work.


回答 4

当您只想拥有期望的直方图时,这是一个解决方案。这不使用groupby,而是将日期时间值转换为整数并更改绘图上的标签。可以做一些改进以将刻度标签移动到均匀位置。同样,采用这种方法,内核密度估计图(和任何其他图)也是可能的。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.DataFrame({"datetime": pd.to_datetime(np.random.randint(1582800000000000000, 1583500000000000000, 100, dtype=np.int64))})
fig, ax = plt.subplots()
df["datetime"].astype(np.int64).plot.hist(ax=ax)
labels = ax.get_xticks().tolist()
labels = pd.to_datetime(labels)
ax.set_xticklabels(labels, rotation=90)
plt.show()

日期时间直方图

Here is a solution for when you just want to have a histogram like you expect it. This doesn’t use groupby, but converts datetime values to integers and changes labels on the plot. Some improvement could be done to move the tick labels to even locations. Also with approach a kernel density estimation plot (and any other plot) is also possible.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.DataFrame({"datetime": pd.to_datetime(np.random.randint(1582800000000000000, 1583500000000000000, 100, dtype=np.int64))})
fig, ax = plt.subplots()
df["datetime"].astype(np.int64).plot.hist(ax=ax)
labels = ax.get_xticks().tolist()
labels = pd.to_datetime(labels)
ax.set_xticklabels(labels, rotation=90)
plt.show()

Datetime histogram


回答 5

我认为要解决该问题,您可以使用以下代码,它将日期类型转换为int类型:

df['date'] = df['date'].astype(int)
df['date'] = pd.to_datetime(df['date'], unit='s')

仅用于获取日期,您可以添加以下代码:

pd.DatetimeIndex(df.date).normalize()
df['date'] = pd.DatetimeIndex(df.date).normalize()

I think for solving that problem, you can use this code, it converts date type to int types:

df['date'] = df['date'].astype(int)
df['date'] = pd.to_datetime(df['date'], unit='s')

for getting date only, you can add this code:

pd.DatetimeIndex(df.date).normalize()
df['date'] = pd.DatetimeIndex(df.date).normalize()

回答 6

我也有这个问题。我想像是因为您正在使用日期,所以您想要保留时间顺序(就像我一样)。

解决方法是

import matplotlib.pyplot as plt    
counts = df['date'].value_counts(sort=False)
plt.bar(counts.index,counts)
plt.show()

请,如果有人知道更好的方法,请说出来。

编辑:对于上面的吉恩,这是数据示例[我从完整数据集中随机取样,因此是平凡的直方图数据。]

print dates
type(dates),type(dates[0])
dates.hist()
plt.show()

输出:

0    2001-07-10
1    2002-05-31
2    2003-08-29
3    2006-06-21
4    2002-03-27
5    2003-07-14
6    2004-06-15
7    2002-01-17
Name: Date, dtype: object
<class 'pandas.core.series.Series'> <type 'datetime.date'>

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-38-f39e334eece0> in <module>()
      2 print dates
      3 print type(dates),type(dates[0])
----> 4 dates.hist()
      5 plt.show()

/anaconda/lib/python2.7/site-packages/pandas/tools/plotting.pyc in hist_series(self, by, ax, grid, xlabelsize, xrot, ylabelsize, yrot, figsize, bins, **kwds)
   2570         values = self.dropna().values
   2571 
-> 2572         ax.hist(values, bins=bins, **kwds)
   2573         ax.grid(grid)
   2574         axes = np.array([ax])

/anaconda/lib/python2.7/site-packages/matplotlib/axes/_axes.pyc in hist(self, x, bins, range, normed, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, **kwargs)
   5620             for xi in x:
   5621                 if len(xi) > 0:
-> 5622                     xmin = min(xmin, xi.min())
   5623                     xmax = max(xmax, xi.max())
   5624             bin_range = (xmin, xmax)

TypeError: can't compare datetime.date to float

I was just having trouble with this as well. I imagine that since you’re working with dates you want to preserve chronological ordering (like I did.)

The workaround then is

import matplotlib.pyplot as plt    
counts = df['date'].value_counts(sort=False)
plt.bar(counts.index,counts)
plt.show()

Please, if anyone knows of a better way please speak up.

EDIT: for jean above, here’s a sample of the data [I randomly sampled from the full dataset, hence the trivial histogram data.]

print dates
type(dates),type(dates[0])
dates.hist()
plt.show()

Output:

0    2001-07-10
1    2002-05-31
2    2003-08-29
3    2006-06-21
4    2002-03-27
5    2003-07-14
6    2004-06-15
7    2002-01-17
Name: Date, dtype: object
<class 'pandas.core.series.Series'> <type 'datetime.date'>

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-38-f39e334eece0> in <module>()
      2 print dates
      3 print type(dates),type(dates[0])
----> 4 dates.hist()
      5 plt.show()

/anaconda/lib/python2.7/site-packages/pandas/tools/plotting.pyc in hist_series(self, by, ax, grid, xlabelsize, xrot, ylabelsize, yrot, figsize, bins, **kwds)
   2570         values = self.dropna().values
   2571 
-> 2572         ax.hist(values, bins=bins, **kwds)
   2573         ax.grid(grid)
   2574         axes = np.array([ax])

/anaconda/lib/python2.7/site-packages/matplotlib/axes/_axes.pyc in hist(self, x, bins, range, normed, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, **kwargs)
   5620             for xi in x:
   5621                 if len(xi) > 0:
-> 5622                     xmin = min(xmin, xi.min())
   5623                     xmax = max(xmax, xi.max())
   5624             bin_range = (xmin, xmax)

TypeError: can't compare datetime.date to float

回答 7

所有这些答案似乎都过于复杂,至少对于“现代”熊猫来说,这是两行。

df.set_index('date', inplace=True)
df.resample('M').size().plot.bar()

All of these answers seem overly complex, as least with ‘modern’ pandas it’s two lines.

df.set_index('date', inplace=True)
df.resample('M').size().plot.bar()

如何在Pandas DataFrame中移动列

问题:如何在Pandas DataFrame中移动列

我想在Pandas中移动一列DataFrame,但是我无法在不重写整个DF的情况下从文档中找到一种方法来做到这一点。有人知道怎么做吗?数据框:

##    x1   x2
##0  206  214
##1  226  234
##2  245  253
##3  265  272
##4  283  291

所需的输出:

##    x1   x2
##0  206  nan
##1  226  214
##2  245  234
##3  265  253
##4  283  272
##5  nan  291

I would like to shift a column in a Pandas DataFrame, but I haven’t been able to find a method to do it from the documentation without rewriting the whole DF. Does anyone know how to do it? DataFrame:

##    x1   x2
##0  206  214
##1  226  234
##2  245  253
##3  265  272
##4  283  291

Desired output:

##    x1   x2
##0  206  nan
##1  226  214
##2  245  234
##3  265  253
##4  283  272
##5  nan  291

回答 0

In [18]: a
Out[18]: 
   x1  x2
0   0   5
1   1   6
2   2   7
3   3   8
4   4   9

In [19]: a.x2 = a.x2.shift(1)

In [20]: a
Out[20]: 
   x1  x2
0   0 NaN
1   1   5
2   2   6
3   3   7
4   4   8
In [18]: a
Out[18]: 
   x1  x2
0   0   5
1   1   6
2   2   7
3   3   8
4   4   9

In [19]: a.x2 = a.x2.shift(1)

In [20]: a
Out[20]: 
   x1  x2
0   0 NaN
1   1   5
2   2   6
3   3   7
4   4   8

回答 1

您需要在df.shift这里使用。
df.shift(i)将整个数据帧i向下移动一个单位。

因此,对于i = 1

输入:

    x1   x2  
0  206  214  
1  226  234  
2  245  253  
3  265  272    
4  283  291

输出:

    x1   x2
0  Nan  Nan   
1  206  214  
2  226  234  
3  245  253  
4  265  272 

因此,运行此脚本以获取预期的输出:

import pandas as pd

df = pd.DataFrame({'x1': ['206', '226', '245',' 265', '283'],
                   'x2': ['214', '234', '253', '272', '291']})

print(df)
df['x2'] = df['x2'].shift(1)
print(df)

You need to use df.shift here.
df.shift(i) shifts the entire dataframe by i units down.

So, for i = 1:

Input:

    x1   x2  
0  206  214  
1  226  234  
2  245  253  
3  265  272    
4  283  291

Output:

    x1   x2
0  Nan  Nan   
1  206  214  
2  226  234  
3  245  253  
4  265  272 

So, run this script to get the expected output:

import pandas as pd

df = pd.DataFrame({'x1': ['206', '226', '245',' 265', '283'],
                   'x2': ['214', '234', '253', '272', '291']})

print(df)
df['x2'] = df['x2'].shift(1)
print(df)

回答 2

让我们通过以下示例定义数据框:

>>> df = pd.DataFrame([[206, 214], [226, 234], [245, 253], [265, 272], [283, 291]], 
    columns=[1, 2])
>>> df
     1    2
0  206  214
1  226  234
2  245  253
3  265  272
4  283  291

然后您可以通过操作第二列的索引

>>> df[2].index = df[2].index+1

最后重新组合单列

>>> pd.concat([df[1], df[2]], axis=1)
       1      2
0  206.0    NaN
1  226.0  214.0
2  245.0  234.0
3  265.0  253.0
4  283.0  272.0
5    NaN  291.0

也许不快,但简单易读。考虑为列名和所需的实际移位设置变量。

编辑:通常可以通过df[2].shift(1)已发布的方式进行转移,但是这会切断结转。

Lets define the dataframe from your example by

>>> df = pd.DataFrame([[206, 214], [226, 234], [245, 253], [265, 272], [283, 291]], 
    columns=[1, 2])
>>> df
     1    2
0  206  214
1  226  234
2  245  253
3  265  272
4  283  291

Then you could manipulate the index of the second column by

>>> df[2].index = df[2].index+1

and finally re-combine the single columns

>>> pd.concat([df[1], df[2]], axis=1)
       1      2
0  206.0    NaN
1  226.0  214.0
2  245.0  234.0
3  265.0  253.0
4  283.0  272.0
5    NaN  291.0

Perhaps not fast but simple to read. Consider setting variables for the column names and the actual shift required.

Edit: Generally shifting is possible by df[2].shift(1) as already posted however would that cut-off the carryover.


回答 3

如果你不想失去你的列转移过去的数据帧的结束,只是首先附加所需数量:

    offset = 5
    DF = DF.append([np.nan for x in range(offset)])
    DF = DF.shift(periods=offset)
    DF = DF.reset_index() #Only works if sequential index

If you don’t want to lose the columns you shift past the end of your dataframe, simply append the required number first:

    offset = 5
    DF = DF.append([np.nan for x in range(offset)])
    DF = DF.shift(periods=offset)
    DF = DF.reset_index() #Only works if sequential index

回答 4

我想进口

import pandas as pd
import numpy as np

首先NaN, NaN,...在DataFrame(df)的末尾添加新行。

s1 = df.iloc[0]    # copy 1st row to a new Series s1
s1[:] = np.NaN     # set all values to NaN
df2 = df.append(s1, ignore_index=True)  # add s1 to the end of df

它将创建新的DF df2。也许有一种更优雅的方式,但这可行。

现在您可以移动它:

df2.x2 = df2.x2.shift(1)  # shift what you want

I suppose imports

import pandas as pd
import numpy as np

First append new row with NaN, NaN,... at the end of DataFrame (df).

s1 = df.iloc[0]    # copy 1st row to a new Series s1
s1[:] = np.NaN     # set all values to NaN
df2 = df.append(s1, ignore_index=True)  # add s1 to the end of df

It will create new DF df2. Maybe there is more elegant way but this works.

Now you can shift it:

df2.x2 = df2.x2.shift(1)  # shift what you want

回答 5

尝试回答一个个人问题,并且与您在Pandas Doc上发现的问题类似,我认为可以回答这个问题:

DataFrame.shift(周期= 1,频率=无,轴= 0)按所需的周期数移动索引,并具有可选的时间频率

笔记

如果指定了freq,则索引值会移位,但数据不会重新对齐。也就是说,如果您想在移位时扩展索引并保留原始数据,请使用freq。

希望对以后的问题有所帮助。

Trying to answer a personal problem and similar to yours I found on Pandas Doc what I think would answer this question:

DataFrame.shift(periods=1, freq=None, axis=0) Shift index by desired number of periods with an optional time freq

Notes

If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data.

Hope to help future questions in this matter.


回答 6

这是我的方法:

df_ext = pd.DataFrame(index=pd.date_range(df.index[-1], periods=8, closed='right'))
df2 = pd.concat([df, df_ext], axis=0, sort=True)
df2["forecast"] = df2["some column"].shift(7)

基本上,我正在生成具有所需索引的空数据框,然后将它们连接在一起。但是我真的很想将此作为熊猫的标准功能,因此我提出了对熊猫的增强功能

This is how I do it:

df_ext = pd.DataFrame(index=pd.date_range(df.index[-1], periods=8, closed='right'))
df2 = pd.concat([df, df_ext], axis=0, sort=True)
df2["forecast"] = df2["some column"].shift(7)

Basically I am generating an empty dataframe with the desired index and then just concatenate them together. But I would really like to see this as a standard feature in pandas so I have proposed an enhancement to pandas.


按位置选择熊猫列

问题:按位置选择熊猫列

我只是想通过整数访问命名的熊猫列。

您可以使用来按位置选择一行df.ix[3]

但是如何按整数选择一列呢?

我的数据框:

df=pandas.DataFrame({'a':np.random.rand(5), 'b':np.random.rand(5)})

I’m simply trying to access named pandas columns by an integer.

You can select a row by location using df.ix[3].

But how to select a column by integer?

My dataframe:

df=pandas.DataFrame({'a':np.random.rand(5), 'b':np.random.rand(5)})

回答 0

我想到两种方法:

>>> df
          A         B         C         D
0  0.424634  1.716633  0.282734  2.086944
1 -1.325816  2.056277  2.583704 -0.776403
2  1.457809 -0.407279 -1.560583 -1.316246
3 -0.757134 -1.321025  1.325853 -2.513373
4  1.366180 -1.265185 -2.184617  0.881514
>>> df.iloc[:, 2]
0    0.282734
1    2.583704
2   -1.560583
3    1.325853
4   -2.184617
Name: C
>>> df[df.columns[2]]
0    0.282734
1    2.583704
2   -1.560583
3    1.325853
4   -2.184617
Name: C

编辑:原来的答案建议使用,df.ix[:,2]但是现在不建议使用此功能。用户应切换到df.iloc[:,2]

Two approaches that come to mind:

>>> df
          A         B         C         D
0  0.424634  1.716633  0.282734  2.086944
1 -1.325816  2.056277  2.583704 -0.776403
2  1.457809 -0.407279 -1.560583 -1.316246
3 -0.757134 -1.321025  1.325853 -2.513373
4  1.366180 -1.265185 -2.184617  0.881514
>>> df.iloc[:, 2]
0    0.282734
1    2.583704
2   -1.560583
3    1.325853
4   -2.184617
Name: C
>>> df[df.columns[2]]
0    0.282734
1    2.583704
2   -1.560583
3    1.325853
4   -2.184617
Name: C

Edit: The original answer suggested the use of df.ix[:,2] but this function is now deprecated. Users should switch to df.iloc[:,2].


回答 1

您还可以df.icol(n)用于按整数访问列。

更新:icol不推荐使用,并且可以通过以下方式实现相同的功能:

df.iloc[:, n]  # to access the column at the nth position

You can also use df.icol(n) to access a column by integer.

Update: icol is deprecated and the same functionality can be achieved by:

df.iloc[:, n]  # to access the column at the nth position

回答 2

您可以使用基于.loc的标签或基于.iloc方法的索引来进行包括列范围在内的列切片:

In [50]: import pandas as pd

In [51]: import numpy as np

In [52]: df = pd.DataFrame(np.random.rand(4,4), columns = list('abcd'))

In [53]: df
Out[53]: 
          a         b         c         d
0  0.806811  0.187630  0.978159  0.317261
1  0.738792  0.862661  0.580592  0.010177
2  0.224633  0.342579  0.214512  0.375147
3  0.875262  0.151867  0.071244  0.893735

In [54]: df.loc[:, ["a", "b", "d"]] ### Selective columns based slicing
Out[54]: 
          a         b         d
0  0.806811  0.187630  0.317261
1  0.738792  0.862661  0.010177
2  0.224633  0.342579  0.375147
3  0.875262  0.151867  0.893735

In [55]: df.loc[:, "a":"c"] ### Selective label based column ranges slicing
Out[55]: 
          a         b         c
0  0.806811  0.187630  0.978159
1  0.738792  0.862661  0.580592
2  0.224633  0.342579  0.214512
3  0.875262  0.151867  0.071244

In [56]: df.iloc[:, 0:3] ### Selective index based column ranges slicing
Out[56]: 
          a         b         c
0  0.806811  0.187630  0.978159
1  0.738792  0.862661  0.580592
2  0.224633  0.342579  0.214512
3  0.875262  0.151867  0.071244

You could use label based using .loc or index based using .iloc method to do column-slicing including column ranges:

In [50]: import pandas as pd

In [51]: import numpy as np

In [52]: df = pd.DataFrame(np.random.rand(4,4), columns = list('abcd'))

In [53]: df
Out[53]: 
          a         b         c         d
0  0.806811  0.187630  0.978159  0.317261
1  0.738792  0.862661  0.580592  0.010177
2  0.224633  0.342579  0.214512  0.375147
3  0.875262  0.151867  0.071244  0.893735

In [54]: df.loc[:, ["a", "b", "d"]] ### Selective columns based slicing
Out[54]: 
          a         b         d
0  0.806811  0.187630  0.317261
1  0.738792  0.862661  0.010177
2  0.224633  0.342579  0.375147
3  0.875262  0.151867  0.893735

In [55]: df.loc[:, "a":"c"] ### Selective label based column ranges slicing
Out[55]: 
          a         b         c
0  0.806811  0.187630  0.978159
1  0.738792  0.862661  0.580592
2  0.224633  0.342579  0.214512
3  0.875262  0.151867  0.071244

In [56]: df.iloc[:, 0:3] ### Selective index based column ranges slicing
Out[56]: 
          a         b         c
0  0.806811  0.187630  0.978159
1  0.738792  0.862661  0.580592
2  0.224633  0.342579  0.214512
3  0.875262  0.151867  0.071244

回答 3

您可以通过将列索引列表传递给dataFrame.ix来访问多个列。

例如:

>>> df = pandas.DataFrame({
             'a': np.random.rand(5),
             'b': np.random.rand(5),
             'c': np.random.rand(5),
             'd': np.random.rand(5)
         })

>>> df
          a         b         c         d
0  0.705718  0.414073  0.007040  0.889579
1  0.198005  0.520747  0.827818  0.366271
2  0.974552  0.667484  0.056246  0.524306
3  0.512126  0.775926  0.837896  0.955200
4  0.793203  0.686405  0.401596  0.544421

>>> df.ix[:,[1,3]]
          b         d
0  0.414073  0.889579
1  0.520747  0.366271
2  0.667484  0.524306
3  0.775926  0.955200
4  0.686405  0.544421

You can access multiple columns by passing a list of column indices to dataFrame.ix.

For example:

>>> df = pandas.DataFrame({
             'a': np.random.rand(5),
             'b': np.random.rand(5),
             'c': np.random.rand(5),
             'd': np.random.rand(5)
         })

>>> df
          a         b         c         d
0  0.705718  0.414073  0.007040  0.889579
1  0.198005  0.520747  0.827818  0.366271
2  0.974552  0.667484  0.056246  0.524306
3  0.512126  0.775926  0.837896  0.955200
4  0.793203  0.686405  0.401596  0.544421

>>> df.ix[:,[1,3]]
          b         d
0  0.414073  0.889579
1  0.520747  0.366271
2  0.667484  0.524306
3  0.775926  0.955200
4  0.686405  0.544421

回答 4

.transpose()方法将列转换为行,将行转换为列,因此您甚至可以编写

df.transpose().ix[3]

The method .transpose() converts columns to rows and rows to column, hence you could even write

df.transpose().ix[3]

读取pandas数据框的前几行的方法

问题:读取pandas数据框的前几行的方法

是否有内置的使用方式 read_csv仅读取n文件的前几行而无需提前知道行的长度?我有一个大文件,需要花费很长时间才能读取,偶尔只想使用前20行来获取它的样本(并且不希望加载完整的文件并花大头)。

如果我知道总行数,则可以执行类似的操作footer_lines = total_lines - n并将其传递给skipfooter关键字arg。我当前的解决方案是n使用python和StringIO 手动将第一行抓取到熊猫:

import pandas as pd
from StringIO import StringIO

n = 20
with open('big_file.csv', 'r') as f:
    head = ''.join(f.readlines(n))

df = pd.read_csv(StringIO(head))

并没有那么糟,但是有没有更简洁的“ pandasic”(?)方式来处理关键字或其他内容呢?

Is there a built-in way to use read_csv to read only the first n lines of a file without knowing the length of the lines ahead of time? I have a large file that takes a long time to read, and occasionally only want to use the first, say, 20 lines to get a sample of it (and prefer not to load the full thing and take the head of it).

If I knew the total number of lines I could do something like footer_lines = total_lines - n and pass this to the skipfooter keyword arg. My current solution is to manually grab the first n lines with python and StringIO it to pandas:

import pandas as pd
from StringIO import StringIO

n = 20
with open('big_file.csv', 'r') as f:
    head = ''.join(f.readlines(n))

df = pd.read_csv(StringIO(head))

It’s not that bad, but is there a more concise, ‘pandasic’ (?) way to do it with keywords or something?


回答 0

我认为您可以使用该nrows参数。从文档

nrows : int, default None

    Number of rows of file to read. Useful for reading pieces of large files

这似乎有效。使用标准大型测试文件之一(988504479字节,5344499行):

In [1]: import pandas as pd

In [2]: time z = pd.read_csv("P00000001-ALL.csv", nrows=20)
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.00 s

In [3]: len(z)
Out[3]: 20

In [4]: time z = pd.read_csv("P00000001-ALL.csv")
CPU times: user 27.63 s, sys: 1.92 s, total: 29.55 s
Wall time: 30.23 s

I think you can use the nrows parameter. From the docs:

nrows : int, default None

    Number of rows of file to read. Useful for reading pieces of large files

which seems to work. Using one of the standard large test files (988504479 bytes, 5344499 lines):

In [1]: import pandas as pd

In [2]: time z = pd.read_csv("P00000001-ALL.csv", nrows=20)
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.00 s

In [3]: len(z)
Out[3]: 20

In [4]: time z = pd.read_csv("P00000001-ALL.csv")
CPU times: user 27.63 s, sys: 1.92 s, total: 29.55 s
Wall time: 30.23 s

从熊猫返回多列apply()

问题:从熊猫返回多列apply()

我有一个熊猫DataFrame ,df_test。它包含一列“大小”,以字节为单位表示大小。我已经使用以下代码计算了KB,MB和GB:

df_test = pd.DataFrame([
    {'dir': '/Users/uname1', 'size': 994933},
    {'dir': '/Users/uname2', 'size': 109338711},
])

df_test['size_kb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0, grouping=True) + ' KB')
df_test['size_mb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0 ** 2, grouping=True) + ' MB')
df_test['size_gb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0 ** 3, grouping=True) + ' GB')

df_test


             dir       size       size_kb   size_mb size_gb
0  /Users/uname1     994933      971.6 KB    0.9 MB  0.0 GB
1  /Users/uname2  109338711  106,776.1 KB  104.3 MB  0.1 GB

[2 rows x 5 columns]

我已经运行了超过120,000行,并且根据%timeit,每列花费的时间约为2.97秒* 3 =〜9秒。

无论如何,我可以使它更快吗?例如,我是否可以代替一次套用并运行3次而不是一次返回一列,可以一次返回所有三列以将其插入回原始数据帧吗?

我发现的其他问题都希望采用多个值并返回一个值。我想要一个值并返回多列

I have a pandas DataFrame, df_test. It contains a column ‘size’ which represents size in bytes. I’ve calculated KB, MB, and GB using the following code:

df_test = pd.DataFrame([
    {'dir': '/Users/uname1', 'size': 994933},
    {'dir': '/Users/uname2', 'size': 109338711},
])

df_test['size_kb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0, grouping=True) + ' KB')
df_test['size_mb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0 ** 2, grouping=True) + ' MB')
df_test['size_gb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0 ** 3, grouping=True) + ' GB')

df_test


             dir       size       size_kb   size_mb size_gb
0  /Users/uname1     994933      971.6 KB    0.9 MB  0.0 GB
1  /Users/uname2  109338711  106,776.1 KB  104.3 MB  0.1 GB

[2 rows x 5 columns]

I’ve run this over 120,000 rows and time it takes about 2.97 seconds per column * 3 = ~9 seconds according to %timeit.

Is there anyway I can make this faster? For example, can I instead of returning one column at a time from apply and running it 3 times, can I return all three columns in one pass to insert back into the original dataframe?

The other questions I’ve found all want to take multiple values and return a single value. I want to take a single value and return multiple columns.


回答 0

这是一个古老的问题,但是为了完整起见,您可以从包含新数据的应用函数中返回一个Series,从而避免了需要迭代3次的麻烦。传递axis=1到apply函数sizes会将函数应用于数据框的每一行,并返回一个序列以添加到新的数据框。该系列s包含新值以及原始数据。

def sizes(s):
    s['size_kb'] = locale.format("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
    s['size_mb'] = locale.format("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
    s['size_gb'] = locale.format("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
    return s

df_test = df_test.append(rows_list)
df_test = df_test.apply(sizes, axis=1)

This is an old question, but for completeness, you can return a Series from the applied function that contains the new data, preventing the need to iterate three times. Passing axis=1 to the apply function applies the function sizes to each row of the dataframe, returning a series to add to a new dataframe. This series, s, contains the new values, as well as the original data.

def sizes(s):
    s['size_kb'] = locale.format("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
    s['size_mb'] = locale.format("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
    s['size_gb'] = locale.format("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
    return s

df_test = df_test.append(rows_list)
df_test = df_test.apply(sizes, axis=1)

回答 1

使用apply和zip将比Series方式快3倍。

def sizes(s):    
    return locale.format("%.1f", s / 1024.0, grouping=True) + ' KB', \
        locale.format("%.1f", s / 1024.0 ** 2, grouping=True) + ' MB', \
        locale.format("%.1f", s / 1024.0 ** 3, grouping=True) + ' GB'
df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes))

测试结果为:

Separate df.apply(): 

    100 loops, best of 3: 1.43 ms per loop

Return Series: 

    100 loops, best of 3: 2.61 ms per loop

Return tuple:

    1000 loops, best of 3: 819 µs per loop

Use apply and zip will 3 times fast than Series way.

def sizes(s):    
    return locale.format("%.1f", s / 1024.0, grouping=True) + ' KB', \
        locale.format("%.1f", s / 1024.0 ** 2, grouping=True) + ' MB', \
        locale.format("%.1f", s / 1024.0 ** 3, grouping=True) + ' GB'
df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes))

Test result are:

Separate df.apply(): 

    100 loops, best of 3: 1.43 ms per loop

Return Series: 

    100 loops, best of 3: 2.61 ms per loop

Return tuple:

    1000 loops, best of 3: 819 µs per loop

回答 2

当前的某些回复效果很好,但我想提供另一个也许更“惊慌”的选择。这适用于我当前的大熊猫0.23(不确定是否可以在以前的版本中使用):

import pandas as pd

df_test = pd.DataFrame([
  {'dir': '/Users/uname1', 'size': 994933},
  {'dir': '/Users/uname2', 'size': 109338711},
])

def sizes(s):
  a = locale.format("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
  b = locale.format("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
  c = locale.format("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
  return a, b, c

df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes, axis=1, result_type="expand")

注意,技巧在的result_type参数上apply,它将把结果扩展为DataFrame,可以直接分配给新/旧列。

Some of the current replies work fine, but I want to offer another, maybe more “pandifyed” option. This works for me with the current pandas 0.23 (not sure if it will work in previous versions):

import pandas as pd

df_test = pd.DataFrame([
  {'dir': '/Users/uname1', 'size': 994933},
  {'dir': '/Users/uname2', 'size': 109338711},
])

def sizes(s):
  a = locale.format("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
  b = locale.format("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
  c = locale.format("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
  return a, b, c

df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes, axis=1, result_type="expand")

Notice that the trick is on the result_type parameter of apply, that will expand its result into a DataFrame that can be directly assign to new/old columns.


回答 3

只是另一种可读的方式。这段代码将添加三个新列及其值,并在apply函数中返回不使用任何参数的序列。

def sizes(s):

    val_kb = locale.format("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
    val_mb = locale.format("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
    val_gb = locale.format("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
    return pd.Series([val_kb,val_mb,val_gb],index=['size_kb','size_mb','size_gb'])

df[['size_kb','size_mb','size_gb']] = df.apply(lambda x: sizes(x) , axis=1)

来自以下的一般示例:https : //pandas.pydata.org/pandas-docs/stable/genic/pandas.DataFrame.apply.html

df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1)

#foo  bar
#0    1    2
#1    1    2
#2    1    2

Just another readable way. This code will add three new columns and its values, returning series without use parameters in the apply function.

def sizes(s):

    val_kb = locale.format("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
    val_mb = locale.format("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
    val_gb = locale.format("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
    return pd.Series([val_kb,val_mb,val_gb],index=['size_kb','size_mb','size_gb'])

df[['size_kb','size_mb','size_gb']] = df.apply(lambda x: sizes(x) , axis=1)

A general example from: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html

df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1)

#foo  bar
#0    1    2
#1    1    2
#2    1    2

回答 4

真的很酷的答案!感谢Jesse和jaumebonet!关于以下方面的一些观察:

  • zip(* ...
  • ... result_type="expand")

尽管expand有点更优雅(平庸),但是zip至少快** 2倍。在下面这个简单的示例中,我的速度提高4倍

import pandas as pd

dat = [ [i, 10*i] for i in range(1000)]

df = pd.DataFrame(dat, columns = ["a","b"])

def add_and_sub(row):
    add = row["a"] + row["b"]
    sub = row["a"] - row["b"]
    return add, sub

df[["add", "sub"]] = df.apply(add_and_sub, axis=1, result_type="expand")
# versus
df["add"], df["sub"] = zip(*df.apply(add_and_sub, axis=1))

Really cool answers! Thanks Jesse and jaumebonet! Just some observation in regards to:

  • zip(* ...
  • ... result_type="expand")

Although expand is kind of more elegant (pandifyed), zip is at least **2x faster. On this simple example bellow, I got 4x faster.

import pandas as pd

dat = [ [i, 10*i] for i in range(1000)]

df = pd.DataFrame(dat, columns = ["a","b"])

def add_and_sub(row):
    add = row["a"] + row["b"]
    sub = row["a"] - row["b"]
    return add, sub

df[["add", "sub"]] = df.apply(add_and_sub, axis=1, result_type="expand")
# versus
df["add"], df["sub"] = zip(*df.apply(add_and_sub, axis=1))

回答 5

顶部答案之间的性能显著变化,和杰西&famaral42已经讨论过这一点,但它是值得分享的顶部答案之间的公平比较,并制定杰西的回答的一个微妙但重要的细节:在传递给说法功能,也会影响性能

(Python 3.7.4,Pandas 1.0.3)

import pandas as pd
import locale
import timeit


def create_new_df_test():
    df_test = pd.DataFrame([
      {'dir': '/Users/uname1', 'size': 994933},
      {'dir': '/Users/uname2', 'size': 109338711},
    ])
    return df_test


def sizes_pass_series_return_series(series):
    series['size_kb'] = locale.format_string("%.1f", series['size'] / 1024.0, grouping=True) + ' KB'
    series['size_mb'] = locale.format_string("%.1f", series['size'] / 1024.0 ** 2, grouping=True) + ' MB'
    series['size_gb'] = locale.format_string("%.1f", series['size'] / 1024.0 ** 3, grouping=True) + ' GB'
    return series


def sizes_pass_series_return_tuple(series):
    a = locale.format_string("%.1f", series['size'] / 1024.0, grouping=True) + ' KB'
    b = locale.format_string("%.1f", series['size'] / 1024.0 ** 2, grouping=True) + ' MB'
    c = locale.format_string("%.1f", series['size'] / 1024.0 ** 3, grouping=True) + ' GB'
    return a, b, c


def sizes_pass_value_return_tuple(value):
    a = locale.format_string("%.1f", value / 1024.0, grouping=True) + ' KB'
    b = locale.format_string("%.1f", value / 1024.0 ** 2, grouping=True) + ' MB'
    c = locale.format_string("%.1f", value / 1024.0 ** 3, grouping=True) + ' GB'
    return a, b, c

结果如下:

# 1 - Accepted (Nels11 Answer) - (pass series, return series):
9.82 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# 2 - Pandafied (jaumebonet Answer) - (pass series, return tuple):
2.34 ms ± 48.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# 3 - Tuples (pass series, return tuple then zip):
1.36 ms ± 62.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# 4 - Tuples (Jesse Answer) - (pass value, return tuple then zip):
752 µs ± 18.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

请注意如何返回元组是最快的方法,但是什么传递作为一个参数,也是影响性能。代码上的差异很细微,但性能却有显着提高。

测试#4(传递一个值)的速度是测试#3(传递一个值)的两倍,即使表面上执行的操作相同。

但是还有更多…

# 1a - Accepted (Nels11 Answer) - (pass series, return series, new columns exist):
3.23 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# 2a - Pandafied (jaumebonet Answer) - (pass series, return tuple, new columns exist):
2.31 ms ± 39.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# 3a - Tuples (pass series, return tuple then zip, new columns exist):
1.36 ms ± 58.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# 4a - Tuples (Jesse Answer) - (pass value, return tuple then zip, new columns exist):
694 µs ± 3.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

在某些情况下(#1a和#4a),将函数应用到已经存在输出列的DataFrame上比从函数创建它们更快。

这是运行测试的代码:

# Paste and run the following in ipython console. It will not work if you run it from a .py file.
print('\nAccepted Answer (pass series, return series, new columns dont exist):')
df_test = create_new_df_test()
%timeit result = df_test.apply(sizes_pass_series_return_series, axis=1)
print('Accepted Answer (pass series, return series, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit result = df_test.apply(sizes_pass_series_return_series, axis=1)

print('\nPandafied (pass series, return tuple, new columns dont exist):')
df_test = create_new_df_test()
%timeit df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes_pass_series_return_tuple, axis=1, result_type="expand")
print('Pandafied (pass series, return tuple, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes_pass_series_return_tuple, axis=1, result_type="expand")

print('\nTuples (pass series, return tuple then zip, new columns dont exist):')
df_test = create_new_df_test()
%timeit df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test.apply(sizes_pass_series_return_tuple, axis=1))
print('Tuples (pass series, return tuple then zip, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test.apply(sizes_pass_series_return_tuple, axis=1))

print('\nTuples (pass value, return tuple then zip, new columns dont exist):')
df_test = create_new_df_test()
%timeit df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes_pass_value_return_tuple))
print('Tuples (pass value, return tuple then zip, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes_pass_value_return_tuple))

The performance between the top answers is significantly varied, and Jesse & famaral42 have already discussed this, but it is worth sharing a fair comparison between the top answers, and elaborating on a subtle but important detail of Jesse’s answer: the argument passed in to the function, also affects performance.

(Python 3.7.4, Pandas 1.0.3)

import pandas as pd
import locale
import timeit


def create_new_df_test():
    df_test = pd.DataFrame([
      {'dir': '/Users/uname1', 'size': 994933},
      {'dir': '/Users/uname2', 'size': 109338711},
    ])
    return df_test


def sizes_pass_series_return_series(series):
    series['size_kb'] = locale.format_string("%.1f", series['size'] / 1024.0, grouping=True) + ' KB'
    series['size_mb'] = locale.format_string("%.1f", series['size'] / 1024.0 ** 2, grouping=True) + ' MB'
    series['size_gb'] = locale.format_string("%.1f", series['size'] / 1024.0 ** 3, grouping=True) + ' GB'
    return series


def sizes_pass_series_return_tuple(series):
    a = locale.format_string("%.1f", series['size'] / 1024.0, grouping=True) + ' KB'
    b = locale.format_string("%.1f", series['size'] / 1024.0 ** 2, grouping=True) + ' MB'
    c = locale.format_string("%.1f", series['size'] / 1024.0 ** 3, grouping=True) + ' GB'
    return a, b, c


def sizes_pass_value_return_tuple(value):
    a = locale.format_string("%.1f", value / 1024.0, grouping=True) + ' KB'
    b = locale.format_string("%.1f", value / 1024.0 ** 2, grouping=True) + ' MB'
    c = locale.format_string("%.1f", value / 1024.0 ** 3, grouping=True) + ' GB'
    return a, b, c

Here are the results:

# 1 - Accepted (Nels11 Answer) - (pass series, return series):
9.82 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# 2 - Pandafied (jaumebonet Answer) - (pass series, return tuple):
2.34 ms ± 48.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# 3 - Tuples (pass series, return tuple then zip):
1.36 ms ± 62.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# 4 - Tuples (Jesse Answer) - (pass value, return tuple then zip):
752 µs ± 18.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Notice how returning tuples is the fastest method, but what is passed in as an argument, also affects the performance. The difference in the code is subtle but the performance improvement is significant.

Test #4 (passing in a single value) is twice as fast as test #3 (passing in a series), even though the operation performed is ostensibly identical.

But there’s more…

# 1a - Accepted (Nels11 Answer) - (pass series, return series, new columns exist):
3.23 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# 2a - Pandafied (jaumebonet Answer) - (pass series, return tuple, new columns exist):
2.31 ms ± 39.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# 3a - Tuples (pass series, return tuple then zip, new columns exist):
1.36 ms ± 58.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# 4a - Tuples (Jesse Answer) - (pass value, return tuple then zip, new columns exist):
694 µs ± 3.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In some cases (#1a and #4a), applying the function to a DataFrame in which the output columns already exist is faster than creating them from the function.

Here is the code for running the tests:

# Paste and run the following in ipython console. It will not work if you run it from a .py file.
print('\nAccepted Answer (pass series, return series, new columns dont exist):')
df_test = create_new_df_test()
%timeit result = df_test.apply(sizes_pass_series_return_series, axis=1)
print('Accepted Answer (pass series, return series, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit result = df_test.apply(sizes_pass_series_return_series, axis=1)

print('\nPandafied (pass series, return tuple, new columns dont exist):')
df_test = create_new_df_test()
%timeit df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes_pass_series_return_tuple, axis=1, result_type="expand")
print('Pandafied (pass series, return tuple, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes_pass_series_return_tuple, axis=1, result_type="expand")

print('\nTuples (pass series, return tuple then zip, new columns dont exist):')
df_test = create_new_df_test()
%timeit df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test.apply(sizes_pass_series_return_tuple, axis=1))
print('Tuples (pass series, return tuple then zip, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test.apply(sizes_pass_series_return_tuple, axis=1))

print('\nTuples (pass value, return tuple then zip, new columns dont exist):')
df_test = create_new_df_test()
%timeit df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes_pass_value_return_tuple))
print('Tuples (pass value, return tuple then zip, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes_pass_value_return_tuple))

回答 6

我相信1.1版会破坏此处最佳答案中建议的行为。

import pandas as pd
def test_func(row):
    row['c'] = str(row['a']) + str(row['b'])
    row['d'] = row['a'] + 1
    return row

df = pd.DataFrame({'a': [1, 2, 3], 'b': ['i', 'j', 'k']})
df.apply(test_func, axis=1)

上面的代码在pandas 1.1.0返回上运行:

   a  b   c  d
0  1  i  1i  2
1  1  i  1i  2
2  1  i  1i  2

在大熊猫1.0.5中返回:

   a   b    c  d
0  1   i   1i  2
1  2   j   2j  3
2  3   k   3k  4

我认为这是您所期望的。

不知道怎么的发行说明解释这种行为,然而由于解释这里通过复制他们复活旧的行为避免了原始行的突变。即:

def test_func(row):
    row = row.copy()   #  <---- Avoid mutating the original reference
    row['c'] = str(row['a']) + str(row['b'])
    row['d'] = row['a'] + 1
    return row

I believe the 1.1 version breaks the behavior suggested in the top answer here.

import pandas as pd
def test_func(row):
    row['c'] = str(row['a']) + str(row['b'])
    row['d'] = row['a'] + 1
    return row

df = pd.DataFrame({'a': [1, 2, 3], 'b': ['i', 'j', 'k']})
df.apply(test_func, axis=1)

The above code ran on pandas 1.1.0 returns:

   a  b   c  d
0  1  i  1i  2
1  1  i  1i  2
2  1  i  1i  2

While in pandas 1.0.5 it returned:

   a   b    c  d
0  1   i   1i  2
1  2   j   2j  3
2  3   k   3k  4

Which I think is what you’d expect.

Not sure how the release notes explain this behavior, however as explained here avoiding mutation of the original rows by copying them resurrects the old behavior. i.e.:

def test_func(row):
    row = row.copy()   #  <---- Avoid mutating the original reference
    row['c'] = str(row['a']) + str(row['b'])
    row['d'] = row['a'] + 1
    return row

回答 7

通常,要返回多个值,这就是我要做的

def gimmeMultiple(group):
    x1 = 1
    x2 = 2
    return array([[1, 2]])
def gimmeMultipleDf(group):
    x1 = 1
    x2 = 2
    return pd.DataFrame(array([[1,2]]), columns=['x1', 'x2'])
df['size'].astype(int).apply(gimmeMultiple)
df['size'].astype(int).apply(gimmeMultipleDf)

明确地返回数据帧具有其优势,但有时并非必需。您可以查看apply()收益,并使用这些功能;)

Generally, to return multiple values, this is what I do

def gimmeMultiple(group):
    x1 = 1
    x2 = 2
    return array([[1, 2]])
def gimmeMultipleDf(group):
    x1 = 1
    x2 = 2
    return pd.DataFrame(array([[1,2]]), columns=['x1', 'x2'])
df['size'].astype(int).apply(gimmeMultiple)
df['size'].astype(int).apply(gimmeMultipleDf)

Returning a dataframe definitively has its perks, but sometimes not required. You can look at what the apply() returns and play a bit with the functions ;)


回答 8

它提供了一个新数据框,其中包含原始列的两列。

import pandas as pd
df = ...
df_with_two_columns = df.apply(lambda row:pd.Series([row['column_1'], row['column_2']], index=['column_1', 'column_2']),axis = 1)

It gives a new dataframe with two columns from the original one.

import pandas as pd
df = ...
df_with_two_columns = df.apply(lambda row:pd.Series([row['column_1'], row['column_2']], index=['column_1', 'column_2']),axis = 1)

Numpy isnan()在浮点数组上失败(适用于pandas数据框)

问题:Numpy isnan()在浮点数组上失败(适用于pandas数据框)

我有一个浮点数数组(一些正常数字,一些nans),它们是从对熊猫数据框的应用中得出的。

由于某种原因,numpy.isnan在此数组上失败,但是,如下所示,每个元素都是浮点数,numpy.isnan在每个元素上正确运行,变量的类型肯定是一个numpy数组。

这是怎么回事?!

set([type(x) for x in tester])
Out[59]: {float}

tester
Out[60]: 
array([-0.7000000000000001, nan, nan, nan, nan, nan, nan, nan, nan, nan,
   nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
   nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
   nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
   nan, nan], dtype=object)

set([type(x) for x in tester])
Out[61]: {float}

np.isnan(tester)
Traceback (most recent call last):

File "<ipython-input-62-e3638605b43c>", line 1, in <module>
np.isnan(tester)

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

set([np.isnan(x) for x in tester])
Out[65]: {False, True}

type(tester)
Out[66]: numpy.ndarray

I have an array of floats (some normal numbers, some nans) that is coming out of an apply on a pandas dataframe.

For some reason, numpy.isnan is failing on this array, however as shown below, each element is a float, numpy.isnan runs correctly on each element, the type of the variable is definitely a numpy array.

What’s going on?!

set([type(x) for x in tester])
Out[59]: {float}

tester
Out[60]: 
array([-0.7000000000000001, nan, nan, nan, nan, nan, nan, nan, nan, nan,
   nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
   nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
   nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
   nan, nan], dtype=object)

set([type(x) for x in tester])
Out[61]: {float}

np.isnan(tester)
Traceback (most recent call last):

File "<ipython-input-62-e3638605b43c>", line 1, in <module>
np.isnan(tester)

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

set([np.isnan(x) for x in tester])
Out[65]: {False, True}

type(tester)
Out[66]: numpy.ndarray

回答 0

np.isnan 可以应用于本机dtype的NumPy数组(例如np.float64):

In [99]: np.isnan(np.array([np.nan, 0], dtype=np.float64))
Out[99]: array([ True, False], dtype=bool)

但是在应用于对象数组时引发TypeError:

In [96]: np.isnan(np.array([np.nan, 0], dtype=object))
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

由于您拥有Pandas,pd.isnull因此可以改用-它可以接受对象或本机dtypes的NumPy数组:

In [97]: pd.isnull(np.array([np.nan, 0], dtype=float))
Out[97]: array([ True, False], dtype=bool)

In [98]: pd.isnull(np.array([np.nan, 0], dtype=object))
Out[98]: array([ True, False], dtype=bool)

请注意,None在对象数组中也将其视为空值。

np.isnan can be applied to NumPy arrays of native dtype (such as np.float64):

In [99]: np.isnan(np.array([np.nan, 0], dtype=np.float64))
Out[99]: array([ True, False], dtype=bool)

but raises TypeError when applied to object arrays:

In [96]: np.isnan(np.array([np.nan, 0], dtype=object))
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Since you have Pandas, you could use pd.isnull instead — it can accept NumPy arrays of object or native dtypes:

In [97]: pd.isnull(np.array([np.nan, 0], dtype=float))
Out[97]: array([ True, False], dtype=bool)

In [98]: pd.isnull(np.array([np.nan, 0], dtype=object))
Out[98]: array([ True, False], dtype=bool)

Note that None is also considered a null value in object arrays.


回答 1

np.isnan()和pd.isnull()的绝佳替代品是

for i in range(0,a.shape[0]):
    if(a[i]!=a[i]):
       //do something here
       //a[i] is nan

因为只有nan不等于自己。

A great substitute for np.isnan() and pd.isnull() is

for i in range(0,a.shape[0]):
    if(a[i]!=a[i]):
       //do something here
       //a[i] is nan

since only nan is not equal to itself.


回答 2

在@unutbu答案的顶部,您可以将pandas numpy对象数组强制转换为本机(float64)类型,沿线进行操作

import pandas as pd
pd.to_numeric(df['tester'], errors='coerce')

指定errors =’coerce’强制将无法解析为数字值的字符串变为NaN。列类型为dtype: float64,然后isnan检查是否可以使用

On top of @unutbu answer, you could coerce pandas numpy object array to native (float64) type, something along the line

import pandas as pd
pd.to_numeric(df['tester'], errors='coerce')

Specify errors=’coerce’ to force strings that can’t be parsed to a numeric value to become NaN. Column type would be dtype: float64, and then isnan check should work


回答 3

确保使用熊猫导入csv文件

import pandas as pd

condition = pd.isnull(data[i][j])

Make sure you import csv file using Pandas

import pandas as pd

condition = pd.isnull(data[i][j])