标签归档:dataframe

从熊猫DataFrame制作热图

问题:从熊猫DataFrame制作热图

我有一个从Python的Pandas包生成的数据框。如何使用pandas包中的DataFrame生成热图。

import numpy as np 
from pandas import *

Index= ['aaa','bbb','ccc','ddd','eee']
Cols = ['A', 'B', 'C','D']
df = DataFrame(abs(np.random.randn(5, 4)), index= Index, columns=Cols)

>>> df
          A         B         C         D
aaa  2.431645  1.248688  0.267648  0.613826
bbb  0.809296  1.671020  1.564420  0.347662
ccc  1.501939  1.126518  0.702019  1.596048
ddd  0.137160  0.147368  1.504663  0.202822
eee  0.134540  3.708104  0.309097  1.641090
>>> 

I have a dataframe generated from Python’s Pandas package. How can I generate heatmap using DataFrame from pandas package.

import numpy as np 
from pandas import *

Index= ['aaa','bbb','ccc','ddd','eee']
Cols = ['A', 'B', 'C','D']
df = DataFrame(abs(np.random.randn(5, 4)), index= Index, columns=Cols)

>>> df
          A         B         C         D
aaa  2.431645  1.248688  0.267648  0.613826
bbb  0.809296  1.671020  1.564420  0.347662
ccc  1.501939  1.126518  0.702019  1.596048
ddd  0.137160  0.147368  1.504663  0.202822
eee  0.134540  3.708104  0.309097  1.641090
>>> 

回答 0

您要matplotlib.pcolor

import numpy as np 
from pandas import DataFrame
import matplotlib.pyplot as plt

index = ['aaa', 'bbb', 'ccc', 'ddd', 'eee']
columns = ['A', 'B', 'C', 'D']
df = DataFrame(abs(np.random.randn(5, 4)), index=index, columns=columns)

plt.pcolor(df)
plt.yticks(np.arange(0.5, len(df.index), 1), df.index)
plt.xticks(np.arange(0.5, len(df.columns), 1), df.columns)
plt.show()

这给出:

You want matplotlib.pcolor:

import numpy as np 
from pandas import DataFrame
import matplotlib.pyplot as plt

index = ['aaa', 'bbb', 'ccc', 'ddd', 'eee']
columns = ['A', 'B', 'C', 'D']
df = DataFrame(abs(np.random.randn(5, 4)), index=index, columns=columns)

plt.pcolor(df)
plt.yticks(np.arange(0.5, len(df.index), 1), df.index)
plt.xticks(np.arange(0.5, len(df.columns), 1), df.columns)
plt.show()

This gives:


回答 1

对于今天正在看此书的人,我将推荐此处heatmap()记录的Seaborn 。

上面的示例将按以下方式完成:

import numpy as np 
from pandas import DataFrame
import seaborn as sns
%matplotlib inline

Index= ['aaa', 'bbb', 'ccc', 'ddd', 'eee']
Cols = ['A', 'B', 'C', 'D']
df = DataFrame(abs(np.random.randn(5, 4)), index=Index, columns=Cols)

sns.heatmap(df, annot=True)

%matplotlib对于那些不熟悉的人,IPython魔术函数在哪里?

For people looking at this today, I would recommend the Seaborn heatmap() as documented here.

The example above would be done as follows:

import numpy as np 
from pandas import DataFrame
import seaborn as sns
%matplotlib inline

Index= ['aaa', 'bbb', 'ccc', 'ddd', 'eee']
Cols = ['A', 'B', 'C', 'D']
df = DataFrame(abs(np.random.randn(5, 4)), index=Index, columns=Cols)

sns.heatmap(df, annot=True)

Where %matplotlib is an IPython magic function for those unfamiliar.


回答 2

如果您不需要说每个图,并且只想添加颜色来以表格格式表示值,则可以使用style.background_gradient()pandas数据框的方法。此方法使在例如JupyterLab Notebook中查看熊猫数据框时显示的HTML表格着色,结果类似于在电子表格软件中使用“条件格式”:

import numpy as np 
import pandas as pd


index= ['aaa', 'bbb', 'ccc', 'ddd', 'eee']
cols = ['A', 'B', 'C', 'D']
df = pd.DataFrame(abs(np.random.randn(5, 4)), index=index, columns=cols)
df.style.background_gradient(cmap='Blues')

有关详细用法,请参阅我之前在同一主题上提供的更详尽的答案以及pandas文档样式部分

If you don’t need a plot per say, and you’re simply interested in adding color to represent the values in a table format, you can use the style.background_gradient() method of the pandas data frame. This method colorizes the HTML table that is displayed when viewing pandas data frames in e.g. the JupyterLab Notebook and the result is similar to using “conditional formatting” in spreadsheet software:

import numpy as np 
import pandas as pd


index= ['aaa', 'bbb', 'ccc', 'ddd', 'eee']
cols = ['A', 'B', 'C', 'D']
df = pd.DataFrame(abs(np.random.randn(5, 4)), index=index, columns=cols)
df.style.background_gradient(cmap='Blues')

For detailed usage, please see the more elaborate answer I provided on the same topic previously and the styling section of the pandas documentation.


回答 3

有用的sns.heatmapAPI在这里。检查参数,其中有很多。例:

import seaborn as sns
%matplotlib inline

idx= ['aaa','bbb','ccc','ddd','eee']
cols = list('ABCD')
df = DataFrame(abs(np.random.randn(5,4)), index=idx, columns=cols)

# _r reverses the normal order of the color map 'RdYlGn'
sns.heatmap(df, cmap='RdYlGn_r', linewidths=0.5, annot=True)

Useful sns.heatmap api is here. Check out the parameters, there are a good number of them. Example:

import seaborn as sns
%matplotlib inline

idx= ['aaa','bbb','ccc','ddd','eee']
cols = list('ABCD')
df = DataFrame(abs(np.random.randn(5,4)), index=idx, columns=cols)

# _r reverses the normal order of the color map 'RdYlGn'
sns.heatmap(df, cmap='RdYlGn_r', linewidths=0.5, annot=True)


回答 4

如果您希望从Pandas DataFrame获得交互式热图,并且正在运行Jupyter笔记本,则可以尝试使用交互式Widget Clustergrammer-Widget,在此处查看NBViewer上的交互式笔记本,在此处查看文档

对于更大的数据集,您可以尝试使用开发中的Clustergrammer2 WebGL小部件(此处是示例笔记本)

If you want an interactive heatmap from a Pandas DataFrame and you are running a Jupyter notebook, you can try the interactive Widget Clustergrammer-Widget, see interactive notebook on NBViewer here, documentation here

And for larger datasets you can try the in-development Clustergrammer2 WebGL widget (example notebook here)


回答 5

请注意,的作者seaborn希望 seaborn.heatmap使用分类数据框。这不是一般的。

如果您的索引和列是数字和/或日期时间值,那么此代码将很适合您。

Matplotlib热映射功能pcolormesh需要bin而不是index,因此有一些漂亮的代码可以从数据框索引中构建bin(即使索引间距不均匀!)。

剩下的就是np.meshgridplt.pcolormesh

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

def conv_index_to_bins(index):
    """Calculate bins to contain the index values.
    The start and end bin boundaries are linearly extrapolated from 
    the two first and last values. The middle bin boundaries are 
    midpoints.

    Example 1: [0, 1] -> [-0.5, 0.5, 1.5]
    Example 2: [0, 1, 4] -> [-0.5, 0.5, 2.5, 5.5]
    Example 3: [4, 1, 0] -> [5.5, 2.5, 0.5, -0.5]"""
    assert index.is_monotonic_increasing or index.is_monotonic_decreasing

    # the beginning and end values are guessed from first and last two
    start = index[0] - (index[1]-index[0])/2
    end = index[-1] + (index[-1]-index[-2])/2

    # the middle values are the midpoints
    middle = pd.DataFrame({'m1': index[:-1], 'p1': index[1:]})
    middle = middle['m1'] + (middle['p1']-middle['m1'])/2

    if isinstance(index, pd.DatetimeIndex):
        idx = pd.DatetimeIndex(middle).union([start,end])
    elif isinstance(index, (pd.Float64Index,pd.RangeIndex,pd.Int64Index)):
        idx = pd.Float64Index(middle).union([start,end])
    else:
        print('Warning: guessing what to do with index type %s' % 
              type(index))
        idx = pd.Float64Index(middle).union([start,end])

    return idx.sort_values(ascending=index.is_monotonic_increasing)

def calc_df_mesh(df):
    """Calculate the two-dimensional bins to hold the index and 
    column values."""
    return np.meshgrid(conv_index_to_bins(df.index),
                       conv_index_to_bins(df.columns))

def heatmap(df):
    """Plot a heatmap of the dataframe values using the index and 
    columns"""
    X,Y = calc_df_mesh(df)
    c = plt.pcolormesh(X, Y, df.values.T)
    plt.colorbar(c)

使用调用它heatmap(df),然后使用查看它plt.show()

Please note that the authors of seaborn only want seaborn.heatmap to work with categorical dataframes. It’s not general.

If your index and columns are numeric and/or datetime values, this code will serve you well.

Matplotlib heat-mapping function pcolormesh requires bins instead of indices, so there is some fancy code to build bins from your dataframe indices (even if your index isn’t evenly spaced!).

The rest is simply np.meshgrid and plt.pcolormesh.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

def conv_index_to_bins(index):
    """Calculate bins to contain the index values.
    The start and end bin boundaries are linearly extrapolated from 
    the two first and last values. The middle bin boundaries are 
    midpoints.

    Example 1: [0, 1] -> [-0.5, 0.5, 1.5]
    Example 2: [0, 1, 4] -> [-0.5, 0.5, 2.5, 5.5]
    Example 3: [4, 1, 0] -> [5.5, 2.5, 0.5, -0.5]"""
    assert index.is_monotonic_increasing or index.is_monotonic_decreasing

    # the beginning and end values are guessed from first and last two
    start = index[0] - (index[1]-index[0])/2
    end = index[-1] + (index[-1]-index[-2])/2

    # the middle values are the midpoints
    middle = pd.DataFrame({'m1': index[:-1], 'p1': index[1:]})
    middle = middle['m1'] + (middle['p1']-middle['m1'])/2

    if isinstance(index, pd.DatetimeIndex):
        idx = pd.DatetimeIndex(middle).union([start,end])
    elif isinstance(index, (pd.Float64Index,pd.RangeIndex,pd.Int64Index)):
        idx = pd.Float64Index(middle).union([start,end])
    else:
        print('Warning: guessing what to do with index type %s' % 
              type(index))
        idx = pd.Float64Index(middle).union([start,end])

    return idx.sort_values(ascending=index.is_monotonic_increasing)

def calc_df_mesh(df):
    """Calculate the two-dimensional bins to hold the index and 
    column values."""
    return np.meshgrid(conv_index_to_bins(df.index),
                       conv_index_to_bins(df.columns))

def heatmap(df):
    """Plot a heatmap of the dataframe values using the index and 
    columns"""
    X,Y = calc_df_mesh(df)
    c = plt.pcolormesh(X, Y, df.values.T)
    plt.colorbar(c)

Call it using heatmap(df), and see it using plt.show().


在熊猫数据框中插入一行

问题:在熊猫数据框中插入一行

我有一个数据框:

s1 = pd.Series([5, 6, 7])
s2 = pd.Series([7, 8, 9])

df = pd.DataFrame([list(s1), list(s2)],  columns =  ["A", "B", "C"])

   A  B  C
0  5  6  7
1  7  8  9

[2 rows x 3 columns]

并且我需要添加第一行[2、3、4]以获取:

   A  B  C
0  2  3  4
1  5  6  7
2  7  8  9

我已经尝试过append()concat()起作用,但是找不到正确的方法。

如何在数据框中添加/插入序列?

I have a dataframe:

s1 = pd.Series([5, 6, 7])
s2 = pd.Series([7, 8, 9])

df = pd.DataFrame([list(s1), list(s2)],  columns =  ["A", "B", "C"])

   A  B  C
0  5  6  7
1  7  8  9

[2 rows x 3 columns]

and I need to add a first row [2, 3, 4] to get:

   A  B  C
0  2  3  4
1  5  6  7
2  7  8  9

I’ve tried append() and concat() functions but can’t find the right way how to do that.

How to add/insert series to dataframe?


回答 0

只需使用以下命令将行分配给特定索引loc

 df.loc[-1] = [2, 3, 4]  # adding a row
 df.index = df.index + 1  # shifting index
 df = df.sort_index()  # sorting by index

然后,您可以根据需要获得:

    A  B  C
 0  2  3  4
 1  5  6  7
 2  7  8  9

请参阅Pandas文档中的“ 索引:放大设置”

Just assign row to a particular index, using loc:

 df.loc[-1] = [2, 3, 4]  # adding a row
 df.index = df.index + 1  # shifting index
 df = df.sort_index()  # sorting by index

And you get, as desired:

    A  B  C
 0  2  3  4
 1  5  6  7
 2  7  8  9

See in Pandas documentation Indexing: Setting with enlargement.


回答 1

不确定您的调用方式,concat()但是只要两个对象的类型相同,它就可以正常工作。也许问题是您需要将第二个向量转换为数据框?使用您定义的df,以下对我有用:

df2 = pd.DataFrame([[2,3,4]], columns=['A','B','C'])
pd.concat([df2, df])

Not sure how you were calling concat() but it should work as long as both objects are of the same type. Maybe the issue is that you need to cast your second vector to a dataframe? Using the df that you defined the following works for me:

df2 = pd.DataFrame([[2,3,4]], columns=['A','B','C'])
pd.concat([df2, df])

回答 2

实现此目的的一种方法是

>>> pd.DataFrame(np.array([[2, 3, 4]]), columns=['A', 'B', 'C']).append(df, ignore_index=True)
Out[330]: 
   A  B  C
0  2  3  4
1  5  6  7
2  7  8  9

通常,最简单的方法是附加数据帧,而不是序列。在您的情况下,由于您希望新行位于“顶部”(具有起始ID),并且没有功能pd.prepend(),因此我首先创建新的数据框,然后追加旧的数据框。

ignore_index会忽略数据框中旧的正在进行的索引,并确保第一行实际上以index开头,1而不是以index重启0

典型的免责声明:Cetero censeo …追加行是一种效率很低的操作。如果您关心性能,并且可以某种方式确保首先创建具有正确(较长)索引的数据框,然后仅另一行插入该数据框,则绝对应该这样做。看到:

>>> index = np.array([0, 1, 2])
>>> df2 = pd.DataFrame(columns=['A', 'B', 'C'], index=index)
>>> df2.loc[0:1] = [list(s1), list(s2)]
>>> df2
Out[336]: 
     A    B    C
0    5    6    7
1    7    8    9
2  NaN  NaN  NaN
>>> df2 = pd.DataFrame(columns=['A', 'B', 'C'], index=index)
>>> df2.loc[1:] = [list(s1), list(s2)]

到目前为止,我们拥有您所拥有的df

>>> df2
Out[339]: 
     A    B    C
0  NaN  NaN  NaN
1    5    6    7
2    7    8    9

但是现在您可以按如下所示轻松插入该行。由于空间是预先分配的,因此效率更高。

>>> df2.loc[0] = np.array([2, 3, 4])
>>> df2
Out[341]: 
   A  B  C
0  2  3  4
1  5  6  7
2  7  8  9

One way to achieve this is

>>> pd.DataFrame(np.array([[2, 3, 4]]), columns=['A', 'B', 'C']).append(df, ignore_index=True)
Out[330]: 
   A  B  C
0  2  3  4
1  5  6  7
2  7  8  9

Generally, it’s easiest to append dataframes, not series. In your case, since you want the new row to be “on top” (with starting id), and there is no function pd.prepend(), I first create the new dataframe and then append your old one.

ignore_index will ignore the old ongoing index in your dataframe and ensure that the first row actually starts with index 1 instead of restarting with index 0.

Typical Disclaimer: Cetero censeo … appending rows is a quite inefficient operation. If you care about performance and can somehow ensure to first create a dataframe with the correct (longer) index and then just inserting the additional row into the dataframe, you should definitely do that. See:

>>> index = np.array([0, 1, 2])
>>> df2 = pd.DataFrame(columns=['A', 'B', 'C'], index=index)
>>> df2.loc[0:1] = [list(s1), list(s2)]
>>> df2
Out[336]: 
     A    B    C
0    5    6    7
1    7    8    9
2  NaN  NaN  NaN
>>> df2 = pd.DataFrame(columns=['A', 'B', 'C'], index=index)
>>> df2.loc[1:] = [list(s1), list(s2)]

So far, we have what you had as df:

>>> df2
Out[339]: 
     A    B    C
0  NaN  NaN  NaN
1    5    6    7
2    7    8    9

But now you can easily insert the row as follows. Since the space was preallocated, this is more efficient.

>>> df2.loc[0] = np.array([2, 3, 4])
>>> df2
Out[341]: 
   A  B  C
0  2  3  4
1  5  6  7
2  7  8  9

回答 3

我整理了一个简短的函数,该函数在插入行时具有更大的灵活性:

def insert_row(idx, df, df_insert):
    dfA = df.iloc[:idx, ]
    dfB = df.iloc[idx:, ]

    df = dfA.append(df_insert).append(dfB).reset_index(drop = True)

    return df

可以进一步缩短为:

def insert_row(idx, df, df_insert):
    return df.iloc[:idx, ].append(df_insert).append(df.iloc[idx:, ]).reset_index(drop = True)

然后,您可以使用类似:

df = insert_row(2, df, df_new)

这里2是在索引位置df要插入df_new

I put together a short function that allows for a little more flexibility when inserting a row:

def insert_row(idx, df, df_insert):
    dfA = df.iloc[:idx, ]
    dfB = df.iloc[idx:, ]

    df = dfA.append(df_insert).append(dfB).reset_index(drop = True)

    return df

which could be further shortened to:

def insert_row(idx, df, df_insert):
    return df.iloc[:idx, ].append(df_insert).append(df.iloc[idx:, ]).reset_index(drop = True)

Then you could use something like:

df = insert_row(2, df, df_new)

where 2 is the index position in df where you want to insert df_new.


回答 4

我们可以使用numpy.insert。这具有灵活性的优点。您只需要指定要插入的索引。

s1 = pd.Series([5, 6, 7])
s2 = pd.Series([7, 8, 9])

df = pd.DataFrame([list(s1), list(s2)],  columns =  ["A", "B", "C"])

pd.DataFrame(np.insert(df.values, 0, values=[2, 3, 4], axis=0))

    0   1   2
0   2   3   4
1   5   6   7
2   7   8   9

对于np.insert(df.values, 0, values=[2, 3, 4], axis=0),0告诉函数要放置新值的位置/索引。

We can use numpy.insert. This has the advantage of flexibility. You only need to specify the index you want to insert to.

s1 = pd.Series([5, 6, 7])
s2 = pd.Series([7, 8, 9])

df = pd.DataFrame([list(s1), list(s2)],  columns =  ["A", "B", "C"])

pd.DataFrame(np.insert(df.values, 0, values=[2, 3, 4], axis=0))

    0   1   2
0   2   3   4
1   5   6   7
2   7   8   9

For np.insert(df.values, 0, values=[2, 3, 4], axis=0), 0 tells the function the place/index you want to place the new values.


回答 5

这看似过于简单,但令人难以置信的是,没有内置简单的插入新行功能。我已经读了很多关于将新df附加到原始df的信息,但是我想知道这样做是否会更快。

df.loc[0] = [row1data, blah...]
i = len(df) + 1
df.loc[i] = [row2data, blah...]

this might seem overly simple but its incredible that a simple insert new row function isn’t built in. i’ve read a lot about appending a new df to the original, but i’m wondering if this would be faster.

df.loc[0] = [row1data, blah...]
i = len(df) + 1
df.loc[i] = [row2data, blah...]

回答 6

以下是在不排序和重置索引的情况下将行插入pandas数据框的最佳方法:

import pandas as pd

df = pd.DataFrame(columns=['a','b','c'])

def insert(df, row):
    insert_loc = df.index.max()

    if pd.isna(insert_loc):
        df.loc[0] = row
    else:
        df.loc[insert_loc + 1] = row

insert(df,[2,3,4])
insert(df,[8,9,0])
print(df)

Below would be the best way to insert a row into pandas dataframe without sorting and reseting an index:

import pandas as pd

df = pd.DataFrame(columns=['a','b','c'])

def insert(df, row):
    insert_loc = df.index.max()

    if pd.isna(insert_loc):
        df.loc[0] = row
    else:
        df.loc[insert_loc + 1] = row

insert(df,[2,3,4])
insert(df,[8,9,0])
print(df)

回答 7

concat()似乎比最后一行插入和重新索引要快一点。如果有人想知道两种主要方法的速度:

In [x]: %%timeit
     ...: df = pd.DataFrame(columns=['a','b'])
     ...: for i in range(10000):
     ...:     df.loc[-1] = [1,2]
     ...:     df.index = df.index + 1
     ...:     df = df.sort_index()

每个循环17.1 s±705毫秒(平均±标准偏差,共7次运行,每个循环1次)

In [y]: %%timeit
     ...: df = pd.DataFrame(columns=['a', 'b'])
     ...: for i in range(10000):
     ...:     df = pd.concat([pd.DataFrame([[1,2]], columns=df.columns), df])

每个循环6.53 s±127毫秒(平均±标准偏差,共7次运行,每个循环1次)

concat() seems to be a bit faster than last row insertion and reindexing. In case someone would wonder about the speed of two top approaches:

In [x]: %%timeit
     ...: df = pd.DataFrame(columns=['a','b'])
     ...: for i in range(10000):
     ...:     df.loc[-1] = [1,2]
     ...:     df.index = df.index + 1
     ...:     df = df.sort_index()

17.1 s ± 705 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [y]: %%timeit
     ...: df = pd.DataFrame(columns=['a', 'b'])
     ...: for i in range(10000):
     ...:     df = pd.concat([pd.DataFrame([[1,2]], columns=df.columns), df])

6.53 s ± 127 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


回答 8

在pandas中添加一行很简单DataFrame

  1. 创建一个与您的列名称相同的常规Python字典Dataframe

  2. 使用pandas.append()method并传入您的字典名称,其中.append()DataFrame实例上的方法是;

  3. ignore_index=True在您的词典名称之后添加。

It is pretty simple to add a row into a pandas DataFrame:

  1. Create a regular Python dictionary with the same columns names as your Dataframe;

  2. Use pandas.append() method and pass in the name of your dictionary, where .append() is a method on DataFrame instances;

  3. Add ignore_index=True right after your dictionary name.


回答 9

您可以简单地将行追加到DataFrame的末尾,然后调整索引。

例如:

df = df.append(pd.DataFrame([[2,3,4]],columns=df.columns),ignore_index=True)
df.index = (df.index + 1) % len(df)
df = df.sort_index()

concat用作:

df = pd.concat([pd.DataFrame([[1,2,3,4,5,6]],columns=df.columns),df],ignore_index=True)

You can simply append the row to the end of the DataFrame, and then adjust the index.

For instance:

df = df.append(pd.DataFrame([[2,3,4]],columns=df.columns),ignore_index=True)
df.index = (df.index + 1) % len(df)
df = df.sort_index()

Or use concat as:

df = pd.concat([pd.DataFrame([[1,2,3,4,5,6]],columns=df.columns),df],ignore_index=True)

回答 10

在熊猫数据框中添加一行的最简单方法是:

DataFrame.loc[ location of insertion ]= list( )

范例:

DF.loc[ 9 ] = [ ´Pepe , 33, ´Japan ]

注意:列表的长度应与数据框的长度匹配。

The simplest way add a row in a pandas data frame is:

DataFrame.loc[ location of insertion ]= list( )

Example :

DF.loc[ 9 ] = [ ´Pepe’ , 33, ´Japan’ ]

NB: the length of your list should match that of the data frame.


使用无值过滤Pyspark数据框列

问题:使用无值过滤Pyspark数据框列

我正在尝试过滤具有None作为行值的PySpark数据框:

df.select('dt_mvmt').distinct().collect()

[Row(dt_mvmt=u'2016-03-27'),
 Row(dt_mvmt=u'2016-03-28'),
 Row(dt_mvmt=u'2016-03-29'),
 Row(dt_mvmt=None),
 Row(dt_mvmt=u'2016-03-30'),
 Row(dt_mvmt=u'2016-03-31')]

我可以使用字符串值正确过滤:

df[df.dt_mvmt == '2016-03-31']
# some results here

但这失败了:

df[df.dt_mvmt == None].count()
0
df[df.dt_mvmt != None].count()
0

但是每个类别上肯定都有价值。这是怎么回事?

I’m trying to filter a PySpark dataframe that has None as a row value:

df.select('dt_mvmt').distinct().collect()

[Row(dt_mvmt=u'2016-03-27'),
 Row(dt_mvmt=u'2016-03-28'),
 Row(dt_mvmt=u'2016-03-29'),
 Row(dt_mvmt=None),
 Row(dt_mvmt=u'2016-03-30'),
 Row(dt_mvmt=u'2016-03-31')]

and I can filter correctly with an string value:

df[df.dt_mvmt == '2016-03-31']
# some results here

but this fails:

df[df.dt_mvmt == None].count()
0
df[df.dt_mvmt != None].count()
0

But there are definitely values on each category. What’s going on?


回答 0

您可以使用Column.isNull/ Column.isNotNull

df.where(col("dt_mvmt").isNull())

df.where(col("dt_mvmt").isNotNull())

如果你想简单地丢弃NULL值,您可以使用na.dropsubset参数:

df.na.drop(subset=["dt_mvmt"])

与Equals进行基于相等的比较NULL将不起作用,因为在SQL NULL中未定义,因此任何将其与另一个值进行比较的尝试都将返回NULL

sqlContext.sql("SELECT NULL = NULL").show()
## +-------------+
## |(NULL = NULL)|
## +-------------+
## |         null|
## +-------------+


sqlContext.sql("SELECT NULL != NULL").show()
## +-------------------+
## |(NOT (NULL = NULL))|
## +-------------------+
## |               null|
## +-------------------+

与值进行比较的唯一有效方法NULLIS/ IS NOT,它等效于isNull/ isNotNull方法调用。

You can use Column.isNull / Column.isNotNull:

df.where(col("dt_mvmt").isNull())

df.where(col("dt_mvmt").isNotNull())

If you want to simply drop NULL values you can use na.drop with subset argument:

df.na.drop(subset=["dt_mvmt"])

Equality based comparisons with NULL won’t work because in SQL NULL is undefined so any attempt to compare it with another value returns NULL:

sqlContext.sql("SELECT NULL = NULL").show()
## +-------------+
## |(NULL = NULL)|
## +-------------+
## |         null|
## +-------------+


sqlContext.sql("SELECT NULL != NULL").show()
## +-------------------+
## |(NOT (NULL = NULL))|
## +-------------------+
## |               null|
## +-------------------+

The only valid method to compare value with NULL is IS / IS NOT which are equivalent to the isNull / isNotNull method calls.


回答 1

尝试仅使用isNotNull函数。

df.filter(df.dt_mvmt.isNotNull()).count()

Try to just use isNotNull function.

df.filter(df.dt_mvmt.isNotNull()).count()

回答 2

为了获得dt_mvmt列中的值不为null的条目,我们有

df.filter("dt_mvmt is not NULL")

对于为空的条目,我们有

df.filter("dt_mvmt is NULL")

To obtain entries whose values in the dt_mvmt column are not null we have

df.filter("dt_mvmt is not NULL")

and for entries which are null we have

df.filter("dt_mvmt is NULL")

回答 3

如果您想保留Pandas语法,这对我有用。

df = df[df.dt_mvmt.isNotNull()]

If you want to keep with the Pandas syntex this worked for me.

df = df[df.dt_mvmt.isNotNull()]

回答 4

如果列=无

COLUMN_OLD_VALUE
----------------
None
1
None
100
20
------------------

使用在数据框上创建一个临时表:

sqlContext.sql("select * from tempTable where column_old_value='None' ").show()

因此使用: column_old_value='None'

if column = None

COLUMN_OLD_VALUE
----------------
None
1
None
100
20
------------------

Use create a temptable on data frame:

sqlContext.sql("select * from tempTable where column_old_value='None' ").show()

So use : column_old_value='None'


回答 5

您可以通过多种方式从DataFrame的列中删除/过滤空值。

让我们用下面的代码创建一个简单的DataFrame:

date = ['2016-03-27','2016-03-28','2016-03-29', None, '2016-03-30','2016-03-31']
df = spark.createDataFrame(date, StringType())

现在,您可以尝试使用以下方法之一来筛选出空值。

# Approach - 1
df.filter("value is not null").show()

# Approach - 2
df.filter(col("value").isNotNull()).show()

# Approach - 3
df.filter(df["value"].isNotNull()).show()

# Approach - 4
df.filter(df.value.isNotNull()).show()

# Approach - 5
df.na.drop(subset=["value"]).show()

# Approach - 6
df.dropna(subset=["value"]).show()

# Note: You can also use where function instead of a filter.

您也可以在我的博客上查看“使用NULL值”部分以获取更多信息。

希望对您有所帮助。

There are multiple ways you can remove/filter the null values from a column in DataFrame.

Lets create a simple DataFrame with below code:

date = ['2016-03-27','2016-03-28','2016-03-29', None, '2016-03-30','2016-03-31']
df = spark.createDataFrame(date, StringType())

Now you can try one of the below approach to filter out the null values.

# Approach - 1
df.filter("value is not null").show()

# Approach - 2
df.filter(col("value").isNotNull()).show()

# Approach - 3
df.filter(df["value"].isNotNull()).show()

# Approach - 4
df.filter(df.value.isNotNull()).show()

# Approach - 5
df.na.drop(subset=["value"]).show()

# Approach - 6
df.dropna(subset=["value"]).show()

# Note: You can also use where function instead of a filter.

You can also check the section “Working with NULL Values” on my blog for more information.

I hope it helps.


回答 6

PySpark根据算术,逻辑和其他条件提供各种过滤选项。NULL值的存在可能会妨碍进一步的处理。可以选择删除它们或从统计学上估算它们。

可以考虑以下代码集:

# Dataset is df
# Column name is dt_mvmt
# Before filtering make sure you have the right count of the dataset
df.count() # Some number

# Filter here
df = df.filter(df.dt_mvmt.isNotNull())

# Check the count to ensure there are NULL values present (This is important when dealing with large dataset)
df.count() # Count should be reduced if NULL values are present

PySpark provides various filtering options based on arithmetic, logical and other conditions. Presence of NULL values can hamper further processes. Removing them or statistically imputing them could be a choice.

Below set of code can be considered:

# Dataset is df
# Column name is dt_mvmt
# Before filtering make sure you have the right count of the dataset
df.count() # Some number

# Filter here
df = df.filter(df.dt_mvmt.isNotNull())

# Check the count to ensure there are NULL values present (This is important when dealing with large dataset)
df.count() # Count should be reduced if NULL values are present

回答 7

我也会尝试:

df = df.dropna(subset=["dt_mvmt"])

I would also try:

df = df.dropna(subset=["dt_mvmt"])


回答 8

如果要过滤出列中没有值的记录,请参见以下示例:

df=spark.createDataFrame([[123,"abc"],[234,"fre"],[345,None]],["a","b"])

现在过滤掉空值记录:

df=df.filter(df.b.isNotNull())

df.show()

如果要从DF中删除这些记录,请参见以下内容:

df1=df.na.drop(subset=['b'])

df1.show()

If you want to filter out records having None value in column then see below example:

df=spark.createDataFrame([[123,"abc"],[234,"fre"],[345,None]],["a","b"])

Now filter out null value records:

df=df.filter(df.b.isNotNull())

df.show()

If you want to remove those records from DF then see below:

df1=df.na.drop(subset=['b'])

df1.show()

回答 9

None / Null是pyspark / python中类NoneType的数据类型,因此,当您尝试将NoneType对象与字符串对象进行比较时,下面的方法将不起作用

错误的过滤方法

df [df.dt_mvmt ==无] .count()0 df [df.dt_mvmt!=无] .count()0

正确

df = df.where(col(“ dt_mvmt”)。isNotNull())返回dt_mvmt为None / Null的所有记录

None/Null is a data type of the class NoneType in pyspark/python so, Below will not work as you are trying to compare NoneType object with string object

Wrong way of filreting

df[df.dt_mvmt == None].count() 0 df[df.dt_mvmt != None].count() 0

correct

df=df.where(col(“dt_mvmt”).isNotNull()) returns all records with dt_mvmt as None/Null


python pandas dataframe列转换为dict键和值

问题:python pandas dataframe列转换为dict键和值

我有一个带有多列的pandas数据框,我想从两列构造一个dict:一个作为dict的键,另一个作为dict的值。我怎样才能做到这一点?

数据框:

           area  count
co tp
DE Lake      10      7
Forest       20      5
FR Lake      30      2
Forest       40      3

我需要将区域定义为键,在dict中计为值。先感谢您。

I have a pandas data frame with multiple columns and I would like to construct a dict from two columns: one as the dict’s keys and the other as the dict’s values. How can I do that?

Dataframe:

           area  count
co tp
DE Lake      10      7
Forest       20      5
FR Lake      30      2
Forest       40      3

I need to define area as key, count as value in dict. Thank you in advance.


回答 0

如果lakes是您DataFrame,则可以执行以下操作

area_dict = dict(zip(lakes.area, lakes.count))

If lakes is your DataFrame, you can do something like

area_dict = dict(zip(lakes.area, lakes.count))

回答 1

使用大熊猫可以做到:

如果lakes是您的DataFrame:

area_dict = lakes.to_dict('records')

With pandas it can be done as:

If lakes is your DataFrame:

area_dict = lakes.to_dict('records')

回答 2

如果您想和熊猫玩耍,也可以这样做。但是,我喜欢punchagan的方式。

# replicating your dataframe
lake = pd.DataFrame({'co tp': ['DE Lake', 'Forest', 'FR Lake', 'Forest'], 
                 'area': [10, 20, 30, 40], 
                 'count': [7, 5, 2, 3]})
lake.set_index('co tp', inplace=True)

# to get key value using pandas
area_dict = lake.set_index('area').T.to_dict('records')[0]
print(area_dict)

output: {10: 7, 20: 5, 30: 2, 40: 3}

You can also do this if you want to play around with pandas. However, I like punchagan’s way.

# replicating your dataframe
lake = pd.DataFrame({'co tp': ['DE Lake', 'Forest', 'FR Lake', 'Forest'], 
                 'area': [10, 20, 30, 40], 
                 'count': [7, 5, 2, 3]})
lake.set_index('co tp', inplace=True)

# to get key value using pandas
area_dict = lake.set_index('area').T.to_dict('records')[0]
print(area_dict)

output: {10: 7, 20: 5, 30: 2, 40: 3}

将熊猫数据框转换为序列

问题:将熊猫数据框转换为序列

我对熊猫有些陌生。我有一个熊猫数据框,它是1行乘23列。

我想将其转换为系列吗?我想知道最pythonic的方法是什么?

我已经尝试过了,pd.Series(myResults)但是抱怨ValueError: cannot copy sequence with size 23 to array axis with dimension 1。它还不够聪明,无法意识到它仍然是数学上的“向量”。

谢谢!

I’m somewhat new to pandas. I have a pandas data frame that is 1 row by 23 columns.

I want to convert this into a series? I’m wondering what the most pythonic way to do this is?

I’ve tried pd.Series(myResults) but it complains ValueError: cannot copy sequence with size 23 to array axis with dimension 1. It’s not smart enough to realize it’s still a “vector” in math terms.

Thanks!


回答 0

它还不够聪明,无法意识到它仍然是数学上的“向量”。

可以说它足够聪明,可以识别尺寸差异。:-)

我认为您可以做的最简单的事情是使用位置选择该行iloc,这将为您提供一个Series,其列作为新索引,值作为值:

>>> df = pd.DataFrame([list(range(5))], columns=["a{}".format(i) for i in range(5)])
>>> df
   a0  a1  a2  a3  a4
0   0   1   2   3   4
>>> df.iloc[0]
a0    0
a1    1
a2    2
a3    3
a4    4
Name: 0, dtype: int64
>>> type(_)
<class 'pandas.core.series.Series'>

It’s not smart enough to realize it’s still a “vector” in math terms.

Say rather that it’s smart enough to recognize a difference in dimensionality. :-)

I think the simplest thing you can do is select that row positionally using iloc, which gives you a Series with the columns as the new index and the values as the values:

>>> df = pd.DataFrame([list(range(5))], columns=["a{}".format(i) for i in range(5)])
>>> df
   a0  a1  a2  a3  a4
0   0   1   2   3   4
>>> df.iloc[0]
a0    0
a1    1
a2    2
a3    3
a4    4
Name: 0, dtype: int64
>>> type(_)
<class 'pandas.core.series.Series'>

回答 1

您可以转置单行数据框(仍会生成一个数据框),然后结果压缩为一系列(与相反to_frame)。

df = pd.DataFrame([list(range(5))], columns=["a{}".format(i) for i in range(5)])

>>> df.T.squeeze()  # Or more simply, df.squeeze() for a single row dataframe.
a0    0
a1    1
a2    2
a3    3
a4    4
Name: 0, dtype: int64

注意:为了适应@IanS提出的观点(即使不是OP的问题),请测试数据框的大小。我假设这df是一个数据框,但是边缘情况是一个空的数据框,一个形状为(1,1)的数据框以及一个具有多行的数据框,在这种情况下,使用应实现其所需的功能。

if df.empty:
    # Empty dataframe, so convert to empty Series.
    result = pd.Series()
elif df.shape == (1, 1)
    # DataFrame with one value, so convert to series with appropriate index.
    result = pd.Series(df.iat[0, 0], index=df.columns)
elif len(df) == 1:
    # Convert to series per OP's question.
    result = df.T.squeeze()
else:
    # Dataframe with multiple rows.  Implement desired behavior.
    pass

也可以按照@themachinist提供的答案进行简化。

if len(df) > 1:
    # Dataframe with multiple rows.  Implement desired behavior.
    pass
else:
    result = pd.Series() if df.empty else df.iloc[0, :]

You can transpose the single-row dataframe (which still results in a dataframe) and then squeeze the results into a series (the inverse of to_frame).

df = pd.DataFrame([list(range(5))], columns=["a{}".format(i) for i in range(5)])

>>> df.T.squeeze()  # Or more simply, df.squeeze() for a single row dataframe.
a0    0
a1    1
a2    2
a3    3
a4    4
Name: 0, dtype: int64

Note: To accommodate the point raised by @IanS (even though it is not in the OP’s question), test for the dataframe’s size. I am assuming that df is a dataframe, but the edge cases are an empty dataframe, a dataframe of shape (1, 1), and a dataframe with more than one row in which case the use should implement their desired functionality.

if df.empty:
    # Empty dataframe, so convert to empty Series.
    result = pd.Series()
elif df.shape == (1, 1)
    # DataFrame with one value, so convert to series with appropriate index.
    result = pd.Series(df.iat[0, 0], index=df.columns)
elif len(df) == 1:
    # Convert to series per OP's question.
    result = df.T.squeeze()
else:
    # Dataframe with multiple rows.  Implement desired behavior.
    pass

This can also be simplified along the lines of the answer provided by @themachinist.

if len(df) > 1:
    # Dataframe with multiple rows.  Implement desired behavior.
    pass
else:
    result = pd.Series() if df.empty else df.iloc[0, :]

回答 2

您可以使用以下两种方法之一对数据框进行切片来检索系列:

http://pandas.pydata.org/pandas-docs/stable/generation/pandas.DataFrame.iloc.html http://pandas.pydata.org/pandas-docs/stable/generation/pandas.DataFrame.loc.html

import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randn(1,8))

series1=df.iloc[0,:]
type(series1)
pandas.core.series.Series

You can retrieve the series through slicing your dataframe using one of these two methods:

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html

import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randn(1,8))

series1=df.iloc[0,:]
type(series1)
pandas.core.series.Series

回答 3

其他方式 –

假设myResult是包含1 col和23行形式的数据的dataFrame

// label your columns by passing a list of names
myResult.columns = ['firstCol']

// fetch the column in this way, which will return you a series
myResult = myResult['firstCol']

print(type(myResult))

以类似的方式,您可以从具有多个列的Dataframe中获得序列。

Another way –

Suppose myResult is the dataFrame that contains your data in the form of 1 col and 23 rows

// label your columns by passing a list of names
myResult.columns = ['firstCol']

// fetch the column in this way, which will return you a series
myResult = myResult['firstCol']

print(type(myResult))

In similar fashion, you can get series from Dataframe with multiple columns.


回答 4

您也可以使用stack()

df= DataFrame([list(range(5))], columns = [“a{}”.format(I) for I in range(5)])

在您运行df之后,请运行:

df.stack()

您获得系列数据

You can also use stack()

df= DataFrame([list(range(5))], columns = [“a{}”.format(I) for I in range(5)])

After u run df, then run:

df.stack()

You obtain your dataframe in series


回答 5

data = pd.DataFrame({"a":[1,2,3,34],"b":[5,6,7,8]})
new_data = pd.melt(data)
new_data.set_index("variable", inplace=True)

这给出了一个带有索引的数据框,作为数据的列名,并且所有数据都在“值”列中

data = pd.DataFrame({"a":[1,2,3,34],"b":[5,6,7,8]})
new_data = pd.melt(data)
new_data.set_index("variable", inplace=True)

This gives a dataframe with index as column name of data and all data are present in “values” column


Python Pandas用第二列对应行中的值替换第一列中的NaN

问题:Python Pandas用第二列对应行中的值替换第一列中的NaN

我正在使用Python中的Pandas DataFrame。

File    heat    Farheit Temp_Rating
   1    YesQ         75         N/A
   1    NoR         115         N/A
   1    YesA         63         N/A
   1    NoT          83          41
   1    NoY         100          80
   1    YesZ         56          12
   2    YesQ        111         N/A
   2    NoR          60         N/A
   2    YesA         19         N/A
   2    NoT         106          77
   2    NoY          45          21
   2    YesZ         40          54
   3    YesQ         84         N/A
   3    NoR          67         N/A
   3    YesA         94         N/A
   3    NoT          68          39
   3    NoY          63          46
   3    YesZ         34          81

我需要用Temp_Rating列中的值替换列中的所有NaN Farheit

这就是我需要的:

File        heat    Temp_Rating
   1        YesQ             75
   1         NoR            115
   1        YesA             63
   1        YesQ             41
   1         NoR             80
   1        YesA             12
   2        YesQ            111
   2         NoR             60
   2        YesA             19
   2         NoT             77
   2         NoY             21
   2        YesZ             54
   3        YesQ             84
   3         NoR             67
   3        YesA             94
   3         NoT             39
   3         NoY             46
   3        YesZ             81

如果我进行布尔选择,则一次只能选择其中一列。问题是,如果我随后尝试加入他们,那么在保留正确顺序的同时我将无法执行此操作。

如何只查找Temp_Rating带有NaNs的行并将其替换为该Farheit列同一行中的值?

I am working with this Pandas DataFrame in Python.

File    heat    Farheit Temp_Rating
   1    YesQ         75         N/A
   1    NoR         115         N/A
   1    YesA         63         N/A
   1    NoT          83          41
   1    NoY         100          80
   1    YesZ         56          12
   2    YesQ        111         N/A
   2    NoR          60         N/A
   2    YesA         19         N/A
   2    NoT         106          77
   2    NoY          45          21
   2    YesZ         40          54
   3    YesQ         84         N/A
   3    NoR          67         N/A
   3    YesA         94         N/A
   3    NoT          68          39
   3    NoY          63          46
   3    YesZ         34          81

I need to replace all NaNs in the Temp_Rating column with the value from the Farheit column.

This is what I need:

File        heat    Temp_Rating
   1        YesQ             75
   1         NoR            115
   1        YesA             63
   1        YesQ             41
   1         NoR             80
   1        YesA             12
   2        YesQ            111
   2         NoR             60
   2        YesA             19
   2         NoT             77
   2         NoY             21
   2        YesZ             54
   3        YesQ             84
   3         NoR             67
   3        YesA             94
   3         NoT             39
   3         NoY             46
   3        YesZ             81

If I do a Boolean selection, I can pick out only one of these columns at a time. The problem is if I then try to join them, I am not able to do this while preserving the correct order.

How can I only find Temp_Rating rows with the NaNs and replace them with the value in the same row of the Farheit column?


回答 0

假设您的DataFrame位于df

df.Temp_Rating.fillna(df.Farheit, inplace=True)
del df['Farheit']
df.columns = 'File heat Observations'.split()

首先NaN用的对应值替换任何值df.Farheit。删除'Farheit'列。然后重命名列。结果DataFrame如下:

Assuming your DataFrame is in df:

df.Temp_Rating.fillna(df.Farheit, inplace=True)
del df['Farheit']
df.columns = 'File heat Observations'.split()

First replace any NaN values with the corresponding value of df.Farheit. Delete the 'Farheit' column. Then rename the columns. Here’s the resulting DataFrame:


回答 1

上述解决方案对我不起作用。我使用的方法是:

df.loc[df['foo'].isnull(),'foo'] = df['bar']

The above mentioned solutions did not work for me. The method I used was:

df.loc[df['foo'].isnull(),'foo'] = df['bar']

回答 2

解决这个问题的另一种方法,

import pandas as pd
import numpy as np

ts_df = pd.DataFrame([[1,"YesQ",75,],[1,"NoR",115,],[1,"NoT",63,13],[2,"YesT",43,71]],columns=['File','heat','Farheit','Temp'])


def fx(x):
    if np.isnan(x['Temp']):
        return x['Farheit']
    else:
        return x['Temp']
print(1,ts_df)
ts_df['Temp']=ts_df.apply(lambda x : fx(x),axis=1)

print(2,ts_df)

返回:

(1,    File  heat  Farheit  Temp                                                                                    
0     1  YesQ       75   NaN                                                                                        
1     1   NoR      115   NaN                                                                                        
2     1   NoT       63  13.0                                                                                        
3     2  YesT       43  71.0)                                                                                       
(2,    File  heat  Farheit   Temp                                                                                   
0     1  YesQ       75   75.0                                                                                       
1     1   NoR      115  115.0
2     1   NoT       63   13.0
3     2  YesT       43   71.0)

An other way to solve this problem,

import pandas as pd
import numpy as np

ts_df = pd.DataFrame([[1,"YesQ",75,],[1,"NoR",115,],[1,"NoT",63,13],[2,"YesT",43,71]],columns=['File','heat','Farheit','Temp'])


def fx(x):
    if np.isnan(x['Temp']):
        return x['Farheit']
    else:
        return x['Temp']
print(1,ts_df)
ts_df['Temp']=ts_df.apply(lambda x : fx(x),axis=1)

print(2,ts_df)

returns:

(1,    File  heat  Farheit  Temp                                                                                    
0     1  YesQ       75   NaN                                                                                        
1     1   NoR      115   NaN                                                                                        
2     1   NoT       63  13.0                                                                                        
3     2  YesT       43  71.0)                                                                                       
(2,    File  heat  Farheit   Temp                                                                                   
0     1  YesQ       75   75.0                                                                                       
1     1   NoR      115  115.0
2     1   NoT       63   13.0
3     2  YesT       43   71.0)

按名称将列移动到熊猫表的前面

问题:按名称将列移动到熊猫表的前面

这是我的df:

                             Net   Upper   Lower  Mid  Zsore
Answer option                                                
More than once a day          0%   0.22%  -0.12%   2    65 
Once a day                    0%   0.32%  -0.19%   3    45
Several times a week          2%   2.45%   1.10%   4    78
Once a week                   1%   1.63%  -0.40%   6    65

如何将按名称("Mid")的列移动到表的前面,索引为0。结果应如下所示:

                             Mid   Upper   Lower  Net  Zsore
Answer option                                                
More than once a day          2   0.22%  -0.12%   0%    65 
Once a day                    3   0.32%  -0.19%   0%    45
Several times a week          4   2.45%   1.10%   2%    78
Once a week                   6   1.63%  -0.40%   1%    65

我当前的代码使用来按索引移动列,df.columns.tolist()但我想按名称进行移动。

Here is my df:

                             Net   Upper   Lower  Mid  Zsore
Answer option                                                
More than once a day          0%   0.22%  -0.12%   2    65 
Once a day                    0%   0.32%  -0.19%   3    45
Several times a week          2%   2.45%   1.10%   4    78
Once a week                   1%   1.63%  -0.40%   6    65

How can I move a column by name ("Mid") to the front of the table, index 0. This is what the result should look like:

                             Mid   Upper   Lower  Net  Zsore
Answer option                                                
More than once a day          2   0.22%  -0.12%   0%    65 
Once a day                    3   0.32%  -0.19%   0%    45
Several times a week          4   2.45%   1.10%   2%    78
Once a week                   6   1.63%  -0.40%   1%    65

My current code moves the column by index using df.columns.tolist() but I’d like to shift it by name.


回答 0

我们可以ix通过传递列表来重新排序:

In [27]:
# get a list of columns
cols = list(df)
# move the column to head of list using index, pop and insert
cols.insert(0, cols.pop(cols.index('Mid')))
cols
Out[27]:
['Mid', 'Net', 'Upper', 'Lower', 'Zsore']
In [28]:
# use ix to reorder
df = df.ix[:, cols]
df
Out[28]:
                      Mid Net  Upper   Lower  Zsore
Answer_option                                      
More_than_once_a_day    2  0%  0.22%  -0.12%     65
Once_a_day              3  0%  0.32%  -0.19%     45
Several_times_a_week    4  2%  2.45%   1.10%     78
Once_a_week             6  1%  1.63%  -0.40%     65

另一种方法是引用该列,然后将其重新插入前面:

In [39]:
mid = df['Mid']
df.drop(labels=['Mid'], axis=1,inplace = True)
df.insert(0, 'Mid', mid)
df
Out[39]:
                      Mid Net  Upper   Lower  Zsore
Answer_option                                      
More_than_once_a_day    2  0%  0.22%  -0.12%     65
Once_a_day              3  0%  0.32%  -0.19%     45
Several_times_a_week    4  2%  2.45%   1.10%     78
Once_a_week             6  1%  1.63%  -0.40%     65

从以后开始,您还可以使用loc以获得与ix以后版本的熊猫不建议使用的相同的结果0.20.0

df = df.loc[:, cols]

We can use ix to reorder by passing a list:

In [27]:
# get a list of columns
cols = list(df)
# move the column to head of list using index, pop and insert
cols.insert(0, cols.pop(cols.index('Mid')))
cols
Out[27]:
['Mid', 'Net', 'Upper', 'Lower', 'Zsore']
In [28]:
# use ix to reorder
df = df.ix[:, cols]
df
Out[28]:
                      Mid Net  Upper   Lower  Zsore
Answer_option                                      
More_than_once_a_day    2  0%  0.22%  -0.12%     65
Once_a_day              3  0%  0.32%  -0.19%     45
Several_times_a_week    4  2%  2.45%   1.10%     78
Once_a_week             6  1%  1.63%  -0.40%     65

Another method is to take a reference to the column and reinsert it at the front:

In [39]:
mid = df['Mid']
df.drop(labels=['Mid'], axis=1,inplace = True)
df.insert(0, 'Mid', mid)
df
Out[39]:
                      Mid Net  Upper   Lower  Zsore
Answer_option                                      
More_than_once_a_day    2  0%  0.22%  -0.12%     65
Once_a_day              3  0%  0.32%  -0.19%     45
Several_times_a_week    4  2%  2.45%   1.10%     78
Once_a_week             6  1%  1.63%  -0.40%     65

You can also use loc to achieve the same result as ix will be deprecated in a future version of pandas from 0.20.0 onwards:

df = df.loc[:, cols]

回答 1

也许我错过了一些东西,但是许多答案似乎过于复杂。您应该只需要在一个列表中设置列即可:

列在最前面:

df = df[ ['Mid'] + [ col for col in df.columns if col != 'Mid' ] ]

或者,如果您想将其移到后面:

df = df[ [ col for col in df.columns if col != 'Mid' ] + ['Mid'] ]

或者,如果您想移动不止一列:

cols_to_move = ['Mid', 'Zsore']
df           = df[ cols_to_move + [ col for col in df.columns if col not in cols_to_move ] ]

Maybe I’m missing something, but a lot of these answers seem overly complicated. You should be able to just set the columns within a single list:

Column to the front:

df = df[ ['Mid'] + [ col for col in df.columns if col != 'Mid' ] ]

Or if instead, you want to move it to the back:

df = df[ [ col for col in df.columns if col != 'Mid' ] + ['Mid'] ]

Or if you wanted to move more than one column:

cols_to_move = ['Mid', 'Zsore']
df           = df[ cols_to_move + [ col for col in df.columns if col not in cols_to_move ] ]

回答 2

您可以在熊猫中使用df.reindex()函数。df是

                      Net  Upper   Lower  Mid  Zsore
Answer option                                      
More than once a day  0%  0.22%  -0.12%    2     65
Once a day            0%  0.32%  -0.19%    3     45
Several times a week  2%  2.45%   1.10%    4     78
Once a week           1%  1.63%  -0.40%    6     65

定义列名列表

cols = df.columns.tolist()
cols
Out[13]: ['Net', 'Upper', 'Lower', 'Mid', 'Zsore']

将列名移动到所需位置

cols.insert(0, cols.pop(cols.index('Mid')))
cols
Out[16]: ['Mid', 'Net', 'Upper', 'Lower', 'Zsore']

然后使用df.reindex()函数重新排序

df = df.reindex(columns= cols)

输出是:df

                      Mid  Upper   Lower Net  Zsore
Answer option                                      
More than once a day    2  0.22%  -0.12%  0%     65
Once a day              3  0.32%  -0.19%  0%     45
Several times a week    4  2.45%   1.10%  2%     78
Once a week             6  1.63%  -0.40%  1%     65

You can use the df.reindex() function in pandas. df is

                      Net  Upper   Lower  Mid  Zsore
Answer option                                      
More than once a day  0%  0.22%  -0.12%    2     65
Once a day            0%  0.32%  -0.19%    3     45
Several times a week  2%  2.45%   1.10%    4     78
Once a week           1%  1.63%  -0.40%    6     65

define an list of column names

cols = df.columns.tolist()
cols
Out[13]: ['Net', 'Upper', 'Lower', 'Mid', 'Zsore']

move the column name to wherever you want

cols.insert(0, cols.pop(cols.index('Mid')))
cols
Out[16]: ['Mid', 'Net', 'Upper', 'Lower', 'Zsore']

then use df.reindex() function to reorder

df = df.reindex(columns= cols)

out put is: df

                      Mid  Upper   Lower Net  Zsore
Answer option                                      
More than once a day    2  0.22%  -0.12%  0%     65
Once a day              3  0.32%  -0.19%  0%     45
Several times a week    4  2.45%   1.10%  2%     78
Once a week             6  1.63%  -0.40%  1%     65

回答 3

我更喜欢这种解决方案:

col = df.pop("Mid")
df.insert(0, col.name, col)

它比其他建议的答案更容易阅读且速度更快。

def move_column_inplace(df, col, pos):
    col = df.pop(col)
    df.insert(pos, col.name, col)

绩效评估:

对于此测试,当前的最后一列在每次重复中都移到最前面。就地方法通常表现更好。虽然citynorman的解决方案可以就地完成,但Ed Chum的基于方法.loc和sachinnm的方法却reindex不能。

尽管其他方法通用,但citynorman的解决方案仅限于pos=0。我没有观察到df.loc[cols]和之间的性能差异df[cols],这就是为什么我没有包含其他建议的原因。

我在MacBook Pro(2015年中)上使用python 3.6.8和pandas 0.24.2进行了测试。

import numpy as np
import pandas as pd

n_cols = 11
df = pd.DataFrame(np.random.randn(200000, n_cols),
                  columns=range(n_cols))

def move_column_inplace(df, col, pos):
    col = df.pop(col)
    df.insert(pos, col.name, col)

def move_to_front_normanius_inplace(df, col):
    move_column_inplace(df, col, 0)
    return df

def move_to_front_chum(df, col):
    cols = list(df)
    cols.insert(0, cols.pop(cols.index(col)))
    return df.loc[:, cols]

def move_to_front_chum_inplace(df, col):
    col = df[col]
    df.drop(col.name, axis=1, inplace=True)
    df.insert(0, col.name, col)
    return df

def move_to_front_elpastor(df, col):
    cols = [col] + [ c for c in df.columns if c!=col ]
    return df[cols] # or df.loc[cols]

def move_to_front_sachinmm(df, col):
    cols = df.columns.tolist()
    cols.insert(0, cols.pop(cols.index(col)))
    df = df.reindex(columns=cols, copy=False)
    return df

def move_to_front_citynorman_inplace(df, col):
    # This approach exploits that reset_index() moves the index
    # at the first position of the data frame.
    df.set_index(col, inplace=True)
    df.reset_index(inplace=True)
    return df

def test(method, df):
    col = np.random.randint(0, n_cols)
    method(df, col)

col = np.random.randint(0, n_cols)
ret_mine = move_to_front_normanius_inplace(df.copy(), col)
ret_chum1 = move_to_front_chum(df.copy(), col)
ret_chum2 = move_to_front_chum_inplace(df.copy(), col)
ret_elpas = move_to_front_elpastor(df.copy(), col)
ret_sach = move_to_front_sachinmm(df.copy(), col)
ret_city = move_to_front_citynorman_inplace(df.copy(), col)

# Assert equivalence of solutions.
assert(ret_mine.equals(ret_chum1))
assert(ret_mine.equals(ret_chum2))
assert(ret_mine.equals(ret_elpas))
assert(ret_mine.equals(ret_sach))
assert(ret_mine.equals(ret_city))

结果

# For n_cols = 11:
%timeit test(move_to_front_normanius_inplace, df)
# 1.05 ms ± 42.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit test(move_to_front_citynorman_inplace, df)
# 1.68 ms ± 46.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit test(move_to_front_sachinmm, df)
# 3.24 ms ± 96.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit test(move_to_front_chum, df)
# 3.84 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit test(move_to_front_elpastor, df)
# 3.85 ms ± 58.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit test(move_to_front_chum_inplace, df)
# 9.67 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# For n_cols = 31:
%timeit test(move_to_front_normanius_inplace, df)
# 1.26 ms ± 31.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit test(move_to_front_citynorman_inplace, df)
# 1.95 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit test(move_to_front_sachinmm, df)
# 10.7 ms ± 348 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit test(move_to_front_chum, df)
# 11.5 ms ± 869 µs per loop (mean ± std. dev. of 7 runs, 100 loops each
%timeit test(move_to_front_elpastor, df)
# 11.4 ms ± 598 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit test(move_to_front_chum_inplace, df)
# 31.4 ms ± 1.89 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

I prefer this solution:

col = df.pop("Mid")
df.insert(0, col.name, col)

It’s simpler to read and faster than other suggested answers.

def move_column_inplace(df, col, pos):
    col = df.pop(col)
    df.insert(pos, col.name, col)

Performance assessment:

For this test, the currently last column is moved to the front in each repetition. In-place methods generally perform better. While citynorman’s solution can be made in-place, Ed Chum’s method based on .loc and sachinnm’s method based on reindex cannot.

While other methods are generic, citynorman’s solution is limited to pos=0. I didn’t observe any performance difference between df.loc[cols] and df[cols], which is why I didn’t include some other suggestions.

I tested with python 3.6.8 and pandas 0.24.2 on a MacBook Pro (Mid 2015).

import numpy as np
import pandas as pd

n_cols = 11
df = pd.DataFrame(np.random.randn(200000, n_cols),
                  columns=range(n_cols))

def move_column_inplace(df, col, pos):
    col = df.pop(col)
    df.insert(pos, col.name, col)

def move_to_front_normanius_inplace(df, col):
    move_column_inplace(df, col, 0)
    return df

def move_to_front_chum(df, col):
    cols = list(df)
    cols.insert(0, cols.pop(cols.index(col)))
    return df.loc[:, cols]

def move_to_front_chum_inplace(df, col):
    col = df[col]
    df.drop(col.name, axis=1, inplace=True)
    df.insert(0, col.name, col)
    return df

def move_to_front_elpastor(df, col):
    cols = [col] + [ c for c in df.columns if c!=col ]
    return df[cols] # or df.loc[cols]

def move_to_front_sachinmm(df, col):
    cols = df.columns.tolist()
    cols.insert(0, cols.pop(cols.index(col)))
    df = df.reindex(columns=cols, copy=False)
    return df

def move_to_front_citynorman_inplace(df, col):
    # This approach exploits that reset_index() moves the index
    # at the first position of the data frame.
    df.set_index(col, inplace=True)
    df.reset_index(inplace=True)
    return df

def test(method, df):
    col = np.random.randint(0, n_cols)
    method(df, col)

col = np.random.randint(0, n_cols)
ret_mine = move_to_front_normanius_inplace(df.copy(), col)
ret_chum1 = move_to_front_chum(df.copy(), col)
ret_chum2 = move_to_front_chum_inplace(df.copy(), col)
ret_elpas = move_to_front_elpastor(df.copy(), col)
ret_sach = move_to_front_sachinmm(df.copy(), col)
ret_city = move_to_front_citynorman_inplace(df.copy(), col)

# Assert equivalence of solutions.
assert(ret_mine.equals(ret_chum1))
assert(ret_mine.equals(ret_chum2))
assert(ret_mine.equals(ret_elpas))
assert(ret_mine.equals(ret_sach))
assert(ret_mine.equals(ret_city))

Results:

# For n_cols = 11:
%timeit test(move_to_front_normanius_inplace, df)
# 1.05 ms ± 42.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit test(move_to_front_citynorman_inplace, df)
# 1.68 ms ± 46.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit test(move_to_front_sachinmm, df)
# 3.24 ms ± 96.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit test(move_to_front_chum, df)
# 3.84 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit test(move_to_front_elpastor, df)
# 3.85 ms ± 58.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit test(move_to_front_chum_inplace, df)
# 9.67 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# For n_cols = 31:
%timeit test(move_to_front_normanius_inplace, df)
# 1.26 ms ± 31.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit test(move_to_front_citynorman_inplace, df)
# 1.95 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit test(move_to_front_sachinmm, df)
# 10.7 ms ± 348 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit test(move_to_front_chum, df)
# 11.5 ms ± 869 µs per loop (mean ± std. dev. of 7 runs, 100 loops each
%timeit test(move_to_front_elpastor, df)
# 11.4 ms ± 598 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit test(move_to_front_chum_inplace, df)
# 31.4 ms ± 1.89 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

回答 4

我不喜欢必须在其他解决方案中明确指定所有其他列的方式,因此这对我来说最有效。虽然对于大型数据帧可能会比较慢…?

df = df.set_index('Mid').reset_index()

I didn’t like how I had to explicitly specify all the other column in the other solutions so this worked best for me. Though it might be slow for large dataframes…?

df = df.set_index('Mid').reset_index()


回答 5

这是我经常使用的一组通用代码来重新排列列的位置。您可能会发现它很有用。

cols = df.columns.tolist()
n = int(cols.index('Mid'))
cols = [cols[n]] + cols[:n] + cols[n+1:]
df = df[cols]

Here is a generic set of code that I frequently use to rearrange the position of columns. You may find it useful.

cols = df.columns.tolist()
n = int(cols.index('Mid'))
cols = [cols[n]] + cols[:n] + cols[n+1:]
df = df[cols]

回答 6

要重新排列DataFrame的行,只需使用如下列表即可。

df = df[['Mid', 'Net', 'Upper', 'Lower', 'Zsore']]

这使得以后阅读代码时所做的工作非常明显。也可以使用:

df.columns
Out[1]: Index(['Net', 'Upper', 'Lower', 'Mid', 'Zsore'], dtype='object')

然后剪切并粘贴以重新排序。


对于具有许多列的DataFrame,将列的列表存储在变量中,然后将所需的列弹出到列表的前面。这是一个例子:

cols = [str(col_name) for col_name in range(1001)]
data = np.random.rand(10,1001)
df = pd.DataFrame(data=data, columns=cols)

mv_col = cols.pop(cols.index('77'))
df = df[[mv_col] + cols]

现在df.columns有。

Index(['77', '0', '1', '2', '3', '4', '5', '6', '7', '8',
       ...
       '991', '992', '993', '994', '995', '996', '997', '998', '999', '1000'],
      dtype='object', length=1001)

To reorder the rows of a DataFrame just use a list as follows.

df = df[['Mid', 'Net', 'Upper', 'Lower', 'Zsore']]

This makes it very obvious what was done when reading the code later. Also use:

df.columns
Out[1]: Index(['Net', 'Upper', 'Lower', 'Mid', 'Zsore'], dtype='object')

Then cut and paste to reorder.


For a DataFrame with many columns, store the list of columns in a variable and pop the desired column to the front of the list. Here is an example:

cols = [str(col_name) for col_name in range(1001)]
data = np.random.rand(10,1001)
df = pd.DataFrame(data=data, columns=cols)

mv_col = cols.pop(cols.index('77'))
df = df[[mv_col] + cols]

Now df.columns has.

Index(['77', '0', '1', '2', '3', '4', '5', '6', '7', '8',
       ...
       '991', '992', '993', '994', '995', '996', '997', '998', '999', '1000'],
      dtype='object', length=1001)

回答 7

这是一个非常简单的答案。

不要忘了在列名的两个(())’括号’,否则会给你一个错误。


# here you can add below line and it should work 
df = df[list(('Mid','Upper', 'Lower', 'Net','Zsore'))]
df

                             Mid   Upper   Lower  Net  Zsore
Answer option                                                
More than once a day          2   0.22%  -0.12%   0%    65 
Once a day                    3   0.32%  -0.19%   0%    45
Several times a week          4   2.45%   1.10%   2%    78
Once a week                   6   1.63%  -0.40%   1%    65

Here is a very simple answer to this.

Don’t forget the two (()) ‘brackets’ around columns names.Otherwise, it’ll give you an error.


# here you can add below line and it should work 
df = df[list(('Mid','Upper', 'Lower', 'Net','Zsore'))]
df

                             Mid   Upper   Lower  Net  Zsore
Answer option                                                
More than once a day          2   0.22%  -0.12%   0%    65 
Once a day                    3   0.32%  -0.19%   0%    45
Several times a week          4   2.45%   1.10%   2%    78
Once a week                   6   1.63%  -0.40%   1%    65

回答 8

您可以尝试的最简单的方法是:

df=df[[ 'Mid',   'Upper',   'Lower', 'Net'  , 'Zsore']]

The most simplist thing you can try is:

df=df[[ 'Mid',   'Upper',   'Lower', 'Net'  , 'Zsore']]

当在apply中也计算出先前值时,Pandas中有没有一种方法可以使用dataframe.apply中的先前行值?

问题:当在apply中也计算出先前值时,Pandas中有没有一种方法可以使用dataframe.apply中的先前行值?

我有以下数据框:

 Index_Date    A    B    C    D
 ===============================
 2015-01-31    10   10   Nan  10
 2015-02-01     2    3   Nan  22 
 2015-02-02    10   60   Nan  280
 2015-02-03    10   100   Nan  250

要求:

 Index_Date    A    B    C    D
 ===============================
 2015-01-31    10   10   10   10
 2015-02-01     2    3   23   22
 2015-02-02    10   60   290  280
 2015-02-03    10   100  3000 250

Column C导出用于2015-01-31通过取valueD

然后,我需要使用valueC用于2015-01-31通过和乘法valueA2015-02-01添加B

我尝试使用,applyshift使用if else,这会导致出现关键错误。

I have the following dataframe:

 Index_Date    A    B    C    D
 ===============================
 2015-01-31    10   10   Nan  10
 2015-02-01     2    3   Nan  22 
 2015-02-02    10   60   Nan  280
 2015-02-03    10   100   Nan  250

Require:

 Index_Date    A    B    C    D
 ===============================
 2015-01-31    10   10   10   10
 2015-02-01     2    3   23   22
 2015-02-02    10   60   290  280
 2015-02-03    10   100  3000 250

Column C is derived for 2015-01-31 by taking value of D.

Then I need to use the value of C for 2015-01-31 and multiply by the value of A on 2015-02-01 and add B.

I have attempted an apply and a shift using an if else by this gives a key error.


回答 0

首先,创建派生值:

df.loc[0, 'C'] = df.loc[0, 'D']

然后遍历其余行并填充计算出的值:

for i in range(1, len(df)):
    df.loc[i, 'C'] = df.loc[i-1, 'C'] * df.loc[i, 'A'] + df.loc[i, 'B']


  Index_Date   A   B    C    D
0 2015-01-31  10  10   10   10
1 2015-02-01   2   3   23   22
2 2015-02-02  10  60  290  280

First, create the derived value:

df.loc[0, 'C'] = df.loc[0, 'D']

Then iterate through the remaining rows and fill the calculated values:

for i in range(1, len(df)):
    df.loc[i, 'C'] = df.loc[i-1, 'C'] * df.loc[i, 'A'] + df.loc[i, 'B']


  Index_Date   A   B    C    D
0 2015-01-31  10  10   10   10
1 2015-02-01   2   3   23   22
2 2015-02-02  10  60  290  280

回答 1

给定一列数字:

lst = []
cols = ['A']
for a in range(100, 105):
    lst.append([a])
df = pd.DataFrame(lst, columns=cols, index=range(5))
df

    A
0   100
1   101
2   102
3   103
4   104

您可以使用shift引用上一行:

df['Change'] = df.A - df.A.shift(1)
df

    A   Change
0   100 NaN
1   101 1.0
2   102 1.0
3   103 1.0
4   104 1.0

Given a column of numbers:

lst = []
cols = ['A']
for a in range(100, 105):
    lst.append([a])
df = pd.DataFrame(lst, columns=cols, index=range(5))
df

    A
0   100
1   101
2   102
3   103
4   104

You can reference the previous row with shift:

df['Change'] = df.A - df.A.shift(1)
df

    A   Change
0   100 NaN
1   101 1.0
2   102 1.0
3   103 1.0
4   104 1.0

回答 2

numba

对于不可矢量化的递归计算numba,使用JIT编译并与较低级别的对象配合使用,通常会带来较大的性能改进。您只需要定义一个常规for循环并使用decorator@njit或(对于旧版本)@jit(nopython=True)

对于合理大小的数据帧,与常规for循环相比,这可以将性能提高约30倍:

from numba import jit

@jit(nopython=True)
def calculator_nb(a, b, d):
    res = np.empty(d.shape)
    res[0] = d[0]
    for i in range(1, res.shape[0]):
        res[i] = res[i-1] * a[i] + b[i]
    return res

df['C'] = calculator_nb(*df[list('ABD')].values.T)

n = 10**5
df = pd.concat([df]*n, ignore_index=True)

# benchmarking on Python 3.6.0, Pandas 0.19.2, NumPy 1.11.3, Numba 0.30.1
# calculator() is same as calculator_nb() but without @jit decorator
%timeit calculator_nb(*df[list('ABD')].values.T)  # 14.1 ms per loop
%timeit calculator(*df[list('ABD')].values.T)     # 444 ms per loop

numba

For recursive calculations which are not vectorisable, numba, which uses JIT-compilation and works with lower level objects, often yields large performance improvements. You need only define a regular for loop and use the decorator @njit or (for older versions) @jit(nopython=True):

For a reasonable size dataframe, this gives a ~30x performance improvement versus a regular for loop:

from numba import jit

@jit(nopython=True)
def calculator_nb(a, b, d):
    res = np.empty(d.shape)
    res[0] = d[0]
    for i in range(1, res.shape[0]):
        res[i] = res[i-1] * a[i] + b[i]
    return res

df['C'] = calculator_nb(*df[list('ABD')].values.T)

n = 10**5
df = pd.concat([df]*n, ignore_index=True)

# benchmarking on Python 3.6.0, Pandas 0.19.2, NumPy 1.11.3, Numba 0.30.1
# calculator() is same as calculator_nb() but without @jit decorator
%timeit calculator_nb(*df[list('ABD')].values.T)  # 14.1 ms per loop
%timeit calculator(*df[list('ABD')].values.T)     # 444 ms per loop

回答 3

在numpy数组上应用递归函数将比当前答案更快。

df = pd.DataFrame(np.repeat(np.arange(2, 6),3).reshape(4,3), columns=['A', 'B', 'D'])
new = [df.D.values[0]]
for i in range(1, len(df.index)):
    new.append(new[i-1]*df.A.values[i]+df.B.values[i])
df['C'] = new

输出量

      A  B  D    C
   0  1  1  1    1
   1  2  2  2    4
   2  3  3  3   15
   3  4  4  4   64
   4  5  5  5  325

Applying the recursive function on numpy arrays will be faster than the current answer.

df = pd.DataFrame(np.repeat(np.arange(2, 6),3).reshape(4,3), columns=['A', 'B', 'D'])
new = [df.D.values[0]]
for i in range(1, len(df.index)):
    new.append(new[i-1]*df.A.values[i]+df.B.values[i])
df['C'] = new

Output

      A  B  D    C
   0  1  1  1    1
   1  2  2  2    4
   2  3  3  3   15
   3  4  4  4   64
   4  5  5  5  325

回答 4

尽管问这个问题已经有一段时间了,但我还是会发表我的答案,希望对大家有所帮助。

免责声明:我知道此解决方案不是标准的,但我认为它很好用。

import pandas as pd
import numpy as np

data = np.array([[10, 2, 10, 10],
                 [10, 3, 60, 100],
                 [np.nan] * 4,
                 [10, 22, 280, 250]]).T
idx = pd.date_range('20150131', end='20150203')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df
               A    B     C    D
 =================================
 2015-01-31    10   10    NaN  10
 2015-02-01    2    3     NaN  22 
 2015-02-02    10   60    NaN  280
 2015-02-03    10   100   NaN  250

def calculate(mul, add):
    global value
    value = value * mul + add
    return value

value = df.loc['2015-01-31', 'D']
df.loc['2015-01-31', 'C'] = value
df.loc['2015-02-01':, 'C'] = df.loc['2015-02-01':].apply(lambda row: calculate(*row[['A', 'B']]), axis=1)
df
               A    B     C     D
 =================================
 2015-01-31    10   10    10    10
 2015-02-01    2    3     23    22 
 2015-02-02    10   60    290   280
 2015-02-03    10   100   3000  250

因此,基本上,我们使用applyfrom from pandas和全局变量的帮助来跟踪先前的计算值。


for循环时间比较:

data = np.random.random(size=(1000, 4))
idx = pd.date_range('20150131', end='20171026')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df.C = np.nan

df.loc['2015-01-31', 'C'] = df.loc['2015-01-31', 'D']

%%timeit
for i in df.loc['2015-02-01':].index.date:
    df.loc[i, 'C'] = df.loc[(i - pd.DateOffset(days=1)).date(), 'C'] * df.loc[i, 'A'] + df.loc[i, 'B']

每个循环3.2 s±114毫秒(平均±标准偏差,共运行7次,每个循环1次)

data = np.random.random(size=(1000, 4))
idx = pd.date_range('20150131', end='20171026')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df.C = np.nan

def calculate(mul, add):
    global value
    value = value * mul + add
    return value

value = df.loc['2015-01-31', 'D']
df.loc['2015-01-31', 'C'] = value

%%timeit
df.loc['2015-02-01':, 'C'] = df.loc['2015-02-01':].apply(lambda row: calculate(*row[['A', 'B']]), axis=1)

每个循环1.82 s±64.4 ms(平均±标准偏差,共7次运行,每个循环1次)

因此平均快0.57倍。

Although it has been a while since this question was asked, I will post my answer hoping it helps somebody.

Disclaimer: I know this solution is not standard, but I think it works well.

import pandas as pd
import numpy as np

data = np.array([[10, 2, 10, 10],
                 [10, 3, 60, 100],
                 [np.nan] * 4,
                 [10, 22, 280, 250]]).T
idx = pd.date_range('20150131', end='20150203')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df
               A    B     C    D
 =================================
 2015-01-31    10   10    NaN  10
 2015-02-01    2    3     NaN  22 
 2015-02-02    10   60    NaN  280
 2015-02-03    10   100   NaN  250

def calculate(mul, add):
    global value
    value = value * mul + add
    return value

value = df.loc['2015-01-31', 'D']
df.loc['2015-01-31', 'C'] = value
df.loc['2015-02-01':, 'C'] = df.loc['2015-02-01':].apply(lambda row: calculate(*row[['A', 'B']]), axis=1)
df
               A    B     C     D
 =================================
 2015-01-31    10   10    10    10
 2015-02-01    2    3     23    22 
 2015-02-02    10   60    290   280
 2015-02-03    10   100   3000  250

So basically we use a apply from pandas and the help of a global variable that keeps track of the previous calculated value.


Time comparison with a for loop:

data = np.random.random(size=(1000, 4))
idx = pd.date_range('20150131', end='20171026')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df.C = np.nan

df.loc['2015-01-31', 'C'] = df.loc['2015-01-31', 'D']

%%timeit
for i in df.loc['2015-02-01':].index.date:
    df.loc[i, 'C'] = df.loc[(i - pd.DateOffset(days=1)).date(), 'C'] * df.loc[i, 'A'] + df.loc[i, 'B']

3.2 s ± 114 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

data = np.random.random(size=(1000, 4))
idx = pd.date_range('20150131', end='20171026')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df.C = np.nan

def calculate(mul, add):
    global value
    value = value * mul + add
    return value

value = df.loc['2015-01-31', 'D']
df.loc['2015-01-31', 'C'] = value

%%timeit
df.loc['2015-02-01':, 'C'] = df.loc['2015-02-01':].apply(lambda row: calculate(*row[['A', 'B']]), axis=1)

1.82 s ± 64.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So 0.57 times faster on average.


回答 5

通常,避免显式循环的关键是在rowindex-1 == rowindex上联接(合并)数据框的2个实例。

然后,您将拥有一个包含r和r-1行的大数据框,可以在其中执行df.apply()函数。

但是,创建大型数据集的开销可能抵消了并行处理的好处。

马丁

In general, the key to avoiding an explicit loop would be to join (merge) 2 instances of the dataframe on rowindex-1==rowindex.

Then you would have a big dataframe containing rows of r and r-1, from where you could do a df.apply() function.

However the overhead of creating the large dataset may offset the benefits of parallel processing…

HTH Martin


如何使用点绘制熊猫数据框的两列?

问题:如何使用点绘制熊猫数据框的两列?

我有一个pandas数据框,想绘制一列的值与另一列的值。幸运的是,有plot一种与数据帧相关的方法似乎可以满足我的需求:

df.plot(x='col_name_1', y='col_name_2')

不幸的是,它看起来像打印样式(上市中这里kind参数)有没有点。我可以使用线或条,甚至可以使用密度,但不能使用点。是否有解决方法可以帮助解决此问题。

I have a pandas data frame and would like to plot values from one column versus the values from another column. Fortunately, there is plot method associated with the data-frames that seems to do what I need:

df.plot(x='col_name_1', y='col_name_2')

Unfortunately, it looks like among the plot styles (listed here after the kind parameter) there are not points. I can use lines or bars or even density but not points. Is there a work around that can help to solve this problem.


回答 0

您可以style在调用时指定标绘线的df.plot

df.plot(x='col_name_1', y='col_name_2', style='o')

style参数也可以是一个dict或者list,如:

import numpy as np
import pandas as pd

d = {'one' : np.random.rand(10),
     'two' : np.random.rand(10)}

df = pd.DataFrame(d)

df.plot(style=['o','rx'])

的文档中列出了所有可接受的样式格式matplotlib.pyplot.plot

You can specify the style of the plotted line when calling df.plot:

df.plot(x='col_name_1', y='col_name_2', style='o')

The style argument can also be a dict or list, e.g.:

import numpy as np
import pandas as pd

d = {'one' : np.random.rand(10),
     'two' : np.random.rand(10)}

df = pd.DataFrame(d)

df.plot(style=['o','rx'])

All the accepted style formats are listed in the documentation of matplotlib.pyplot.plot.


回答 1

对于这个(以及大多数绘图),我不会依赖Pandas包装器来matplotlib。而是直接使用matplotlib:

import matplotlib.pyplot as plt
plt.scatter(df['col_name_1'], df['col_name_2'])
plt.show() # Depending on whether you use IPython or interactive mode, etc.

并记住,您可以使用例如访问列值的NumPy数组df.col_name_1.values

在一列具有毫秒精度的Timestamp值的情况下,我在使用Pandas默认绘图时遇到了麻烦。在尝试将对象转换为datetime64类型时,我还发现了一个令人讨厌的问题:< Pandas在询问Timestamp列值是否具有attr astype时给出了错误的结果

For this (and most plotting) I would not rely on the Pandas wrappers to matplotlib. Instead, just use matplotlib directly:

import matplotlib.pyplot as plt
plt.scatter(df['col_name_1'], df['col_name_2'])
plt.show() # Depending on whether you use IPython or interactive mode, etc.

and remember that you can access a NumPy array of the column’s values with df.col_name_1.values for example.

I ran into trouble using this with Pandas default plotting in the case of a column of Timestamp values with millisecond precision. In trying to convert the objects to datetime64 type, I also discovered a nasty issue: < Pandas gives incorrect result when asking if Timestamp column values have attr astype >.


回答 2

Pandas使用matplotlib作为基本的绘图库。在您的情况下,最简单的方法是使用以下方法:

import pandas as pd
import numpy as np

#creating sample data 
sample_data={'col_name_1':np.random.rand(20),
      'col_name_2': np.random.rand(20)}
df= pd.DataFrame(sample_data)
df.plot(x='col_name_1', y='col_name_2', style='o')

但是,seaborn如果您想拥有更多的自定义图,而又不涉及基础知识,我建议您使用它作为替代解决方案。matplotlib.在这种情况下,您将遵循以下解决方案:

import pandas as pd
import seaborn as sns
import numpy as np

#creating sample data 
sample_data={'col_name_1':np.random.rand(20),
      'col_name_2': np.random.rand(20)}
df= pd.DataFrame(sample_data)
sns.scatterplot(x="col_name_1", y="col_name_2", data=df)

Pandas uses matplotlib as a library for basic plots. The easiest way in your case will using the following:

import pandas as pd
import numpy as np

#creating sample data 
sample_data={'col_name_1':np.random.rand(20),
      'col_name_2': np.random.rand(20)}
df= pd.DataFrame(sample_data)
df.plot(x='col_name_1', y='col_name_2', style='o')

However, I would recommend to use seaborn as an alternative solution if you want have more customized plots while not going into the basic level of matplotlib. In this case you the solution will be following:

import pandas as pd
import seaborn as sns
import numpy as np

#creating sample data 
sample_data={'col_name_1':np.random.rand(20),
      'col_name_2': np.random.rand(20)}
df= pd.DataFrame(sample_data)
sns.scatterplot(x="col_name_1", y="col_name_2", data=df)


回答 3

现在,在最新的熊猫中,您可以直接使用df.plot.scatter函数

df = pd.DataFrame([[5.1, 3.5, 0], [4.9, 3.0, 0], [7.0, 3.2, 1],
                   [6.4, 3.2, 1], [5.9, 3.0, 2]],
                  columns=['length', 'width', 'species'])
ax1 = df.plot.scatter(x='length',
                      y='width',
                      c='DarkBlue')

https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.plot.scatter.html

Now in latest pandas you can directly use df.plot.scatter function

df = pd.DataFrame([[5.1, 3.5, 0], [4.9, 3.0, 0], [7.0, 3.2, 1],
                   [6.4, 3.2, 1], [5.9, 3.0, 2]],
                  columns=['length', 'width', 'species'])
ax1 = df.plot.scatter(x='length',
                      y='width',
                      c='DarkBlue')

https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.plot.scatter.html


如何使Pandas DataFrame列标题全部小写?

问题:如何使Pandas DataFrame列标题全部小写?

我想使我的pandas数据框中的所有列标题都小写

如果我有:

data =

  country country isocode  year     XRAT          tcgdp
0  Canada             CAN  2001  1.54876   924909.44207
1  Canada             CAN  2002  1.56932   957299.91586
2  Canada             CAN  2003  1.40105  1016902.00180
....

我想通过执行以下操作将XRAT更改为xrat:

data.headers.lowercase()

这样我得到:

  country country isocode  year     xrat          tcgdp
0  Canada             CAN  2001  1.54876   924909.44207
1  Canada             CAN  2002  1.56932   957299.91586
2  Canada             CAN  2003  1.40105  1016902.00180
3  Canada             CAN  2004  1.30102  1096000.35500
....

我不会提前知道每个列标题的名称。

I want to make all column headers in my pandas data frame lower case

Example

If I have:

data =

  country country isocode  year     XRAT          tcgdp
0  Canada             CAN  2001  1.54876   924909.44207
1  Canada             CAN  2002  1.56932   957299.91586
2  Canada             CAN  2003  1.40105  1016902.00180
....

I would like to change XRAT to xrat by doing something like:

data.headers.lowercase()

So that I get:

  country country isocode  year     xrat          tcgdp
0  Canada             CAN  2001  1.54876   924909.44207
1  Canada             CAN  2002  1.56932   957299.91586
2  Canada             CAN  2003  1.40105  1016902.00180
3  Canada             CAN  2004  1.30102  1096000.35500
....

I will not know the names of each column header ahead of time.


回答 0

您可以这样做:

data.columns = map(str.lower, data.columns)

要么

data.columns = [x.lower() for x in data.columns]

例:

>>> data = pd.DataFrame({'A':range(3), 'B':range(3,0,-1), 'C':list('abc')})
>>> data
   A  B  C
0  0  3  a
1  1  2  b
2  2  1  c
>>> data.columns = map(str.lower, data.columns)
>>> data
   a  b  c
0  0  3  a
1  1  2  b
2  2  1  c

You can do it like this:

data.columns = map(str.lower, data.columns)

or

data.columns = [x.lower() for x in data.columns]

example:

>>> data = pd.DataFrame({'A':range(3), 'B':range(3,0,-1), 'C':list('abc')})
>>> data
   A  B  C
0  0  3  a
1  1  2  b
2  2  1  c
>>> data.columns = map(str.lower, data.columns)
>>> data
   a  b  c
0  0  3  a
1  1  2  b
2  2  1  c

回答 1

您可以使用str.lower以下方法轻松实现columns

df.columns = df.columns.str.lower()

例:

In [63]: df
Out[63]: 
  country country isocode  year     XRAT         tcgdp
0  Canada             CAN  2001  1.54876  9.249094e+05
1  Canada             CAN  2002  1.56932  9.572999e+05
2  Canada             CAN  2003  1.40105  1.016902e+06

In [64]: df.columns = df.columns.str.lower()

In [65]: df
Out[65]: 
  country country isocode  year     xrat         tcgdp
0  Canada             CAN  2001  1.54876  9.249094e+05
1  Canada             CAN  2002  1.56932  9.572999e+05
2  Canada             CAN  2003  1.40105  1.016902e+06

You could do it easily with str.lower for columns:

df.columns = df.columns.str.lower()

Example:

In [63]: df
Out[63]: 
  country country isocode  year     XRAT         tcgdp
0  Canada             CAN  2001  1.54876  9.249094e+05
1  Canada             CAN  2002  1.56932  9.572999e+05
2  Canada             CAN  2003  1.40105  1.016902e+06

In [64]: df.columns = df.columns.str.lower()

In [65]: df
Out[65]: 
  country country isocode  year     xrat         tcgdp
0  Canada             CAN  2001  1.54876  9.249094e+05
1  Canada             CAN  2002  1.56932  9.572999e+05
2  Canada             CAN  2003  1.40105  1.016902e+06

回答 2

如果要使用链接方法调用进行重命名,则可以使用

data.rename(
    columns=unicode.lower
)

(Python 2)

要么

data.rename(
    columns=str.lower
)

(Python 3)

If you want to do the rename using a chained method call, you can use

data.rename(
    columns=unicode.lower
)

(Python 2)

or

data.rename(
    columns=str.lower
)

(Python 3)


回答 3

这是一个简单的方法: data.columns = data.columns.str.lower()

Here is a simple way: data.columns = data.columns.str.lower()


回答 4

df.columns = df.columns.str.lower()

是最简单的方法,但是如果某些标头为数字,则会出现错误

如果您有数字标题,请使用以下代码:

df.columns = [str(x).lower() for x in df.columns]
df.columns = df.columns.str.lower()

is the easiest but will give an error if some headers are numeric

if you have numeric headers then use this:

df.columns = [str(x).lower() for x in df.columns]