标签归档:dataframe

标准化熊猫数据框的列

问题:标准化熊猫数据框的列

我在熊猫中有一个数据框,其中每一列都有不同的值范围。例如:

df:

A     B   C
1000  10  0.5
765   5   0.35
800   7   0.09

知道如何将每个值介于0和1之间的数据框的列标准化吗?

我想要的输出是:

A     B    C
1     1    1
0.765 0.5  0.7
0.8   0.7  0.18(which is 0.09/0.5)

I have a dataframe in pandas where each column has different value range. For example:

df:

A     B   C
1000  10  0.5
765   5   0.35
800   7   0.09

Any idea how I can normalize the columns of this dataframe where each value is between 0 and 1?

My desired output is:

A     B    C
1     1    1
0.765 0.5  0.7
0.8   0.7  0.18(which is 0.09/0.5)

回答 0

您可以使用软件包sklearn及其关联的预处理实用程序来规范化数据。

import pandas as pd
from sklearn import preprocessing

x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)

有关更多信息,请参见有关预处理数据的scikit-learn 文档:将特征缩放到一定范围。

You can use the package sklearn and its associated preprocessing utilities to normalize the data.

import pandas as pd
from sklearn import preprocessing

x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)

For more information look at the scikit-learn documentation on preprocessing data: scaling features to a range.


回答 1

使用Pandas的一种简单方法:(这里我要使用均值归一化)

normalized_df=(df-df.mean())/df.std()

使用最小-最大规格化:

normalized_df=(df-df.min())/(df.max()-df.min())

编辑:要解决一些问题,需要说熊猫自动在上面的代码中应用了以列为单位的函数。

one easy way by using Pandas: (here I want to use mean normalization)

normalized_df=(df-df.mean())/df.std()

to use min-max normalization:

normalized_df=(df-df.min())/(df.max()-df.min())

Edit: To address some concerns, need to say that Pandas automatically applies colomn-wise function in the code above.


回答 2

根据这篇文章:https : //stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range

您可以执行以下操作:

def normalize(df):
    result = df.copy()
    for feature_name in df.columns:
        max_value = df[feature_name].max()
        min_value = df[feature_name].min()
        result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
    return result

您无需担心您的价值观是消极还是积极。并且这些值应在0到1之间很好地分布。

Based on this post: https://stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range

You can do the following:

def normalize(df):
    result = df.copy()
    for feature_name in df.columns:
        max_value = df[feature_name].max()
        min_value = df[feature_name].min()
        result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
    return result

You don’t need to stay worrying about whether your values are negative or positive. And the values should be nicely spread out between 0 and 1.


回答 3

您的问题实际上是作用在列上的简单转换:

def f(s):
    return s/s.max()

frame.apply(f, axis=0)

或更简洁:

   frame.apply(lambda x: x/x.max(), axis=0)

Your problem is actually a simple transform acting on the columns:

def f(s):
    return s/s.max()

frame.apply(f, axis=0)

Or even more terse:

   frame.apply(lambda x: x/x.max(), axis=0)

回答 4

如果您喜欢使用sklearn包,则可以通过使用pandas来保留列名和索引名,loc如下所示:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler() 
scaled_values = scaler.fit_transform(df) 
df.loc[:,:] = scaled_values

If you like using the sklearn package, you can keep the column and index names by using pandas loc like so:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler() 
scaled_values = scaler.fit_transform(df) 
df.loc[:,:] = scaled_values

回答 5

简单即美:

df["A"] = df["A"] / df["A"].max()
df["B"] = df["B"] / df["B"].max()
df["C"] = df["C"] / df["C"].max()

Simple is Beautiful:

df["A"] = df["A"] / df["A"].max()
df["B"] = df["B"] / df["B"].max()
df["C"] = df["C"] / df["C"].max()

回答 6

您可以创建要标准化的列的列表

column_names_to_normalize = ['A', 'E', 'G', 'sadasdsd', 'lol']
x = df[column_names_to_normalize].values
x_scaled = min_max_scaler.fit_transform(x)
df_temp = pd.DataFrame(x_scaled, columns=column_names_to_normalize, index = df.index)
df[column_names_to_normalize] = df_temp

现在,您的Pandas Dataframe仅在您想要的列上进行了标准化


但是,如果你想的相反,选择列的列表不要想规范化,您可以简单地创建的所有列的列表,删除非期望的人

column_names_to_not_normalize = ['B', 'J', 'K']
column_names_to_normalize = [x for x in list(df) if x not in column_names_to_not_normalize ]

You can create a list of columns that you want to normalize

column_names_to_normalize = ['A', 'E', 'G', 'sadasdsd', 'lol']
x = df[column_names_to_normalize].values
x_scaled = min_max_scaler.fit_transform(x)
df_temp = pd.DataFrame(x_scaled, columns=column_names_to_normalize, index = df.index)
df[column_names_to_normalize] = df_temp

Your Pandas Dataframe is now normalized only at the columns you want


However, if you want the opposite, select a list of columns that you DON’T want to normalize, you can simply create a list of all columns and remove that non desired ones

column_names_to_not_normalize = ['B', 'J', 'K']
column_names_to_normalize = [x for x in list(df) if x not in column_names_to_not_normalize ]

回答 7

我认为在熊猫中做到这一点的更好方法是

df = df/df.max().astype(np.float64)

编辑如果在数据框中出现负数,则应改用

df = df/df.loc[df.abs().idxmax()].astype(np.float64)

I think that a better way to do that in pandas is just

df = df/df.max().astype(np.float64)

Edit If in your data frame negative numbers are present you should use instead

df = df/df.loc[df.abs().idxmax()].astype(np.float64)

回答 8

Sandman和Praveen给出的解决方案非常好。唯一的问题是,如果数据框的其他列中有类别变量,则此方法将需要进行一些调整。

我针对此类问题的解决方案如下:

 from sklearn import preprocesing
 x = pd.concat([df.Numerical1, df.Numerical2,df.Numerical3])
 min_max_scaler = preprocessing.MinMaxScaler()
 x_scaled = min_max_scaler.fit_transform(x)
 x_new = pd.DataFrame(x_scaled)
 df = pd.concat([df.Categoricals,x_new])

The solution given by Sandman and Praveen is very well. The only problem with that if you have categorical variables in other columns of your data frame this method will need some adjustments.

My solution to this type of issue is following:

 from sklearn import preprocesing
 x = pd.concat([df.Numerical1, df.Numerical2,df.Numerical3])
 min_max_scaler = preprocessing.MinMaxScaler()
 x_scaled = min_max_scaler.fit_transform(x)
 x_new = pd.DataFrame(x_scaled)
 df = pd.concat([df.Categoricals,x_new])

回答 9

python中不同标准化的示例。

作为参考,请参阅以下维基百科文章: https //en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation

示例数据

import pandas as pd
df = pd.DataFrame({
               'A':[1,2,3],
               'B':[100,300,500],
               'C':list('abc')
             })
print(df)
   A    B  C
0  1  100  a
1  2  300  b
2  3  500  c

使用熊猫进行归一化(给出无偏估计)

归一化时,我们只需减去平均值并除以标准差即可。

df.iloc[:,0:-1] = df.iloc[:,0:-1].apply(lambda x: (x-x.mean())/ x.std(), axis=0)
print(df)
     A    B  C
0 -1.0 -1.0  a
1  0.0  0.0  b
2  1.0  1.0  c

使用sklearn进行归一化(Gives有偏差的估计,与熊猫不同)

如果您做同样的事情,sklearn您将获得不同的输出!

import pandas as pd

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()


df = pd.DataFrame({
               'A':[1,2,3],
               'B':[100,300,500],
               'C':list('abc')
             })
df.iloc[:,0:-1] = scaler.fit_transform(df.iloc[:,0:-1].to_numpy())
print(df)
          A         B  C
0 -1.224745 -1.224745  a
1  0.000000  0.000000  b
2  1.224745  1.224745  c

偏向sklearn的估计是否会使机器学习功能降低?

没有。

sklearn.preprocessing.scale的官方文档指出,使用偏倚估计量会异常地影响机器学习算法的性能,因此我们可以安全地使用它们。

From official documentation:
We use a biased estimator for the standard deviation,
equivalent to numpy.std(x, ddof=0). 
Note that the choice of ddof is unlikely to affect model performance.

MinMax Scaling怎么样?

MinMax缩放中没有标准偏差计算。因此,在熊猫和scikit-learn中,结果都是相同的。

import pandas as pd
df = pd.DataFrame({
               'A':[1,2,3],
               'B':[100,300,500],
             })
(df - df.min()) / (df.max() - df.min())
     A    B
0  0.0  0.0
1  0.5  0.5
2  1.0  1.0


# Using sklearn
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler() 
arr_scaled = scaler.fit_transform(df) 

print(arr_scaled)
[[0.  0. ]
 [0.5 0.5]
 [1.  1. ]]

df_scaled = pd.DataFrame(arr_scaled, columns=df.columns,index=df.index)
print(df_scaled)
     A    B
0  0.0  0.0
1  0.5  0.5
2  1.0  1.0

Example of different standardizations in python.

For reference look at this wikipedia article: Unbiased Estimation of Standard Deviation

Example Data

import pandas as pd
df = pd.DataFrame({
               'A':[1,2,3],
               'B':[100,300,500],
               'C':list('abc')
             })
print(df)
   A    B  C
0  1  100  a
1  2  300  b
2  3  500  c

Normalization using pandas (Gives unbiased estimates)

When normalizing we simply subtract the mean and divide by standard deviation.

df.iloc[:,0:-1] = df.iloc[:,0:-1].apply(lambda x: (x-x.mean())/ x.std(), axis=0)
print(df)
     A    B  C
0 -1.0 -1.0  a
1  0.0  0.0  b
2  1.0  1.0  c

Normalization using sklearn (Gives biased estimates, different from pandas)

If you do the same thing with sklearn you will get DIFFERENT output!

import pandas as pd

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()


df = pd.DataFrame({
               'A':[1,2,3],
               'B':[100,300,500],
               'C':list('abc')
             })
df.iloc[:,0:-1] = scaler.fit_transform(df.iloc[:,0:-1].to_numpy())
print(df)
          A         B  C
0 -1.224745 -1.224745  a
1  0.000000  0.000000  b
2  1.224745  1.224745  c

Does Biased estimates of sklearn makes Machine Learning Less Powerful?

NO.

The official documentation of sklearn.preprocessing.scale states that using biased estimator is UNLIKELY to affect the performance of machine learning algorithms and we can safely use them.

From official documentation:

We use a biased estimator for the standard deviation, equivalent to numpy.std(x, ddof=0). Note that the choice of ddof is unlikely to affect model performance.

What about MinMax Scaling?

There is no Standard Deviation calculation in MinMax scaling. So the result is same in both pandas and scikit-learn.

import pandas as pd
df = pd.DataFrame({
               'A':[1,2,3],
               'B':[100,300,500],
             })
(df - df.min()) / (df.max() - df.min())
     A    B
0  0.0  0.0
1  0.5  0.5
2  1.0  1.0


# Using sklearn
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler() 
arr_scaled = scaler.fit_transform(df) 

print(arr_scaled)
[[0.  0. ]
 [0.5 0.5]
 [1.  1. ]]

df_scaled = pd.DataFrame(arr_scaled, columns=df.columns,index=df.index)
print(df_scaled)
     A    B
0  0.0  0.0
1  0.5  0.5
2  1.0  1.0

回答 10

您可能希望对某些列进行规范化,而对其他列进行不变,例如某些回归任务,其中数据标签或分类列不变,因此,我建议您使用这种pythonic方式(这是@shg和@Cina答案的组合):

features_to_normalize = ['A', 'B', 'C']
# could be ['A','B'] 

df[features_to_normalize] = df[features_to_normalize].apply(lambda x:(x-x.min()) / (x.max()-x.min()))

You might want to have some of columns being normalized and the others be unchanged like some of regression tasks which data labels or categorical columns are unchanged So I suggest you this pythonic way (It’s a combination of @shg and @Cina answers ):

features_to_normalize = ['A', 'B', 'C']
# could be ['A','B'] 

df[features_to_normalize] = df[features_to_normalize].apply(lambda x:(x-x.min()) / (x.max()-x.min()))

回答 11

这只是简单的数学。答案应如下所示。

normed_df = (df - df.min()) / (df.max() - df.min())

It is only simple mathematics. The answer should as simple as below.

normed_df = (df - df.min()) / (df.max() - df.min())

回答 12

def normalize(x):
    try:
        x = x/np.linalg.norm(x,ord=1)
        return x
    except :
        raise
data = pd.DataFrame.apply(data,normalize)

从熊猫文件中,DataFrame结构可以对其自身应用操作(函数)。

DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)

沿DataFrame的输入轴应用功能。传递给函数的对象是Series对象,其索引为DataFrame的索引(axis = 0)或列(axis = 1)。返回类型取决于传递的函数是否聚合,或者取决于DataFrame为空时的reduce参数。

您可以应用自定义函数来操作DataFrame。

def normalize(x):
    try:
        x = x/np.linalg.norm(x,ord=1)
        return x
    except :
        raise
data = pd.DataFrame.apply(data,normalize)

From the document of pandas,DataFrame structure can apply an operation (function) to itself .

DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)

Applies function along input axis of DataFrame. Objects passed to functions are Series objects having index either the DataFrame’s index (axis=0) or the columns (axis=1). Return type depends on whether passed function aggregates, or the reduce argument if the DataFrame is empty.

You can apply a custom function to operate the DataFrame .


回答 13

以下函数计算Z分数:

def standardization(dataset):
  """ Standardization of numeric fields, where all values will have mean of zero 
  and standard deviation of one. (z-score)

  Args:
    dataset: A `Pandas.Dataframe` 
  """
  dtypes = list(zip(dataset.dtypes.index, map(str, dataset.dtypes)))
  # Normalize numeric columns.
  for column, dtype in dtypes:
      if dtype == 'float32':
          dataset[column] -= dataset[column].mean()
          dataset[column] /= dataset[column].std()
  return dataset

The following function calculates the Z score:

def standardization(dataset):
  """ Standardization of numeric fields, where all values will have mean of zero 
  and standard deviation of one. (z-score)

  Args:
    dataset: A `Pandas.Dataframe` 
  """
  dtypes = list(zip(dataset.dtypes.index, map(str, dataset.dtypes)))
  # Normalize numeric columns.
  for column, dtype in dtypes:
      if dtype == 'float32':
          dataset[column] -= dataset[column].mean()
          dataset[column] /= dataset[column].std()
  return dataset

回答 14

这是使用列表推导按列进行的方式:

[df[col].update((df[col] - df[col].min()) / (df[col].max() - df[col].min())) for col in df.columns]

This is how you do it column-wise using list comprehension:

[df[col].update((df[col] - df[col].min()) / (df[col].max() - df[col].min())) for col in df.columns]

回答 15

您可以通过这种方式简单地使用pandas.DataFrame.transform 1函数:

df.transform(lambda x: x/x.max())

You can simply use the pandas.DataFrame.transform1 function in this way:

df.transform(lambda x: x/x.max())

回答 16

df_normalized = df / df.max(axis=0)
df_normalized = df / df.max(axis=0)

回答 17

您可以一行完成

DF_test = DF_test.sub(DF_test.mean(axis=0), axis=1)/DF_test.mean(axis=0)

它对每一列取均值,然后从每一行中减去(均值)(特定列的均值仅从其行中减去)并仅除以均值。最后,我们得到的是标准化数据集。

You can do this in one line

DF_test = DF_test.sub(DF_test.mean(axis=0), axis=1)/DF_test.mean(axis=0)

it takes mean for each of the column and then subtracts it(mean) from every row(mean of particular column subtracts from its row only) and divide by mean only. Finally, we what we get is the normalized data set.


回答 18

熊猫默认情况下会按列进行规范化。试试下面的代码。

X= pd.read_csv('.\\data.csv')
X = (X-X.min())/(X.max()-X.min())

输出值将在0到1的范围内。

Pandas does column wise normalization by default. Try the code below.

X= pd.read_csv('.\\data.csv')
X = (X-X.min())/(X.max()-X.min())

The output values will be in range of 0 and 1.


从python pandas中的列名获取列索引

问题:从python pandas中的列名获取列索引

在R中,当您需要根据列名检索列索引时,可以执行此操作

idx <- which(names(my_data)==my_colum_name)

有没有办法对熊猫数据框做同样的事情?

In R when you need to retrieve a column index based on the name of the column you could do

idx <- which(names(my_data)==my_colum_name)

Is there a way to do the same with pandas dataframes?


回答 0

当然可以使用.get_loc()

In [45]: df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})

In [46]: df.columns
Out[46]: Index([apple, orange, pear], dtype=object)

In [47]: df.columns.get_loc("pear")
Out[47]: 2

虽然老实说,我自己通常不需要这个。通常,通过名称进行访问可以实现我想要的功能(df["pear"],,df[["apple", "orange"]]或者也许df.columns.isin(["orange", "pear"])),尽管我可以肯定地看到一些您需要索引号的情况。

Sure, you can use .get_loc():

In [45]: df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})

In [46]: df.columns
Out[46]: Index([apple, orange, pear], dtype=object)

In [47]: df.columns.get_loc("pear")
Out[47]: 2

although to be honest I don’t often need this myself. Usually access by name does what I want it to (df["pear"], df[["apple", "orange"]], or maybe df.columns.isin(["orange", "pear"])), although I can definitely see cases where you’d want the index number.


回答 1

这是通过列表理解的解决方案。cols是要获取索引的列的列表:

[df.columns.get_loc(c) for c in cols if c in df]

Here is a solution through list comprehension. cols is the list of columns to get index for:

[df.columns.get_loc(c) for c in cols if c in df]

回答 2

DSM的解决方案有效,但是如果您想要直接等效的解决方案,则which可以这样做(df.columns == name).nonzero()

DSM’s solution works, but if you wanted a direct equivalent to which you could do (df.columns == name).nonzero()


回答 3

当您要查找多个列匹配项时,可以使用使用searchsorted方法的矢量化解决方案。因此,df作为query_cols要搜索的数据框和列名,实现将是-

def column_index(df, query_cols):
    cols = df.columns.values
    sidx = np.argsort(cols)
    return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]

样品运行-

In [162]: df
Out[162]: 
   apple  banana  pear  orange  peach
0      8       3     4       4      2
1      4       4     3       0      1
2      1       2     6       8      1

In [163]: column_index(df, ['peach', 'banana', 'apple'])
Out[163]: array([4, 1, 0])

When you might be looking to find multiple column matches, a vectorized solution using searchsorted method could be used. Thus, with df as the dataframe and query_cols as the column names to be searched for, an implementation would be –

def column_index(df, query_cols):
    cols = df.columns.values
    sidx = np.argsort(cols)
    return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]

Sample run –

In [162]: df
Out[162]: 
   apple  banana  pear  orange  peach
0      8       3     4       4      2
1      4       4     3       0      1
2      1       2     6       8      1

In [163]: column_index(df, ['peach', 'banana', 'apple'])
Out[163]: array([4, 1, 0])

回答 4

如果您希望从列位置获得列名(OP问题的另一种方法),则可以使用:

>>> df.columns.get_values()[location]

使用@DSM示例:

>>> df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})

>>> df.columns

Index(['apple', 'orange', 'pear'], dtype='object')

>>> df.columns.get_values()[1]

'orange'

其他方法:

df.iloc[:,1].name

df.columns[location] #(thanks to @roobie-nuby for pointing that out in comments.) 

In case you want the column name from the column location (the other way around to the OP question), you can use:

>>> df.columns.get_values()[location]

Using @DSM Example:

>>> df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})

>>> df.columns

Index(['apple', 'orange', 'pear'], dtype='object')

>>> df.columns.get_values()[1]

'orange'

Other ways:

df.iloc[:,1].name

df.columns[location] #(thanks to @roobie-nuby for pointing that out in comments.) 

回答 5

这个怎么样:

df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
out = np.argwhere(df.columns.isin(['apple', 'orange'])).ravel()
print(out)
[1 2]

how about this:

df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
out = np.argwhere(df.columns.isin(['apple', 'orange'])).ravel()
print(out)
[1 2]

熊猫用空白/空字符串替换NaN

问题:熊猫用空白/空字符串替换NaN

我有一个Pandas Dataframe,如下所示:

    1    2       3
 0  a  NaN    read
 1  b    l  unread
 2  c  NaN    read

我想用一个空字符串删除NaN值,使其看起来像这样:

    1    2       3
 0  a   ""    read
 1  b    l  unread
 2  c   ""    read

I have a Pandas Dataframe as shown below:

    1    2       3
 0  a  NaN    read
 1  b    l  unread
 2  c  NaN    read

I want to remove the NaN values with an empty string so that it looks like so:

    1    2       3
 0  a   ""    read
 1  b    l  unread
 2  c   ""    read

回答 0

import numpy as np
df1 = df.replace(np.nan, '', regex=True)

这可能会有所帮助。它将用空字符串替换所有NaN。

import numpy as np
df1 = df.replace(np.nan, '', regex=True)

This might help. It will replace all NaNs with an empty string.


回答 1

df = df.fillna('')

要不就

df.fillna('', inplace=True)

这将用填充na(例如NaN)''

如果要填充单个列,则可以使用:

df.column1 = df.column1.fillna('')

可以使用df['column1']代替df.column1

df = df.fillna('')

or just

df.fillna('', inplace=True)

This will fill na’s (e.g. NaN’s) with ''.

If you want to fill a single column, you can use:

df.column1 = df.column1.fillna('')

One can use df['column1'] instead of df.column1.


回答 2

如果要从文件(例如CSV或Excel)读取数据帧,请使用:

  • df.read_csv(path , na_filter=False)
  • df.read_excel(path , na_filter=False)

这将自动将空字段视为空字符串 ''


如果您已经有了数据框

  • df = df.replace(np.nan, '', regex=True)
  • df = df.fillna('')

If you are reading the dataframe from a file (say CSV or Excel) then use :

  • df.read_csv(path , na_filter=False)
  • df.read_excel(path , na_filter=False)

This will automatically consider the empty fields as empty strings ''


If you already have the dataframe

  • df = df.replace(np.nan, '', regex=True)
  • df = df.fillna('')

回答 3

如果只想格式化它,以使其在打印时呈现良好,请使用格式化程序。只需使用df.to_string(... formatters即可定义自定义字符串格式,而无需修改您的DataFrame或浪费内存:

df = pd.DataFrame({
    'A': ['a', 'b', 'c'],
    'B': [np.nan, 1, np.nan],
    'C': ['read', 'unread', 'read']})
print df.to_string(
    formatters={'B': lambda x: '' if pd.isnull(x) else '{:.0f}'.format(x)})

要得到:

   A B       C
0  a      read
1  b 1  unread
2  c      read

Use a formatter, if you only want to format it so that it renders nicely when printed. Just use the df.to_string(... formatters to define custom string-formatting, without needlessly modifying your DataFrame or wasting memory:

df = pd.DataFrame({
    'A': ['a', 'b', 'c'],
    'B': [np.nan, 1, np.nan],
    'C': ['read', 'unread', 'read']})
print df.to_string(
    formatters={'B': lambda x: '' if pd.isnull(x) else '{:.0f}'.format(x)})

To get:

   A B       C
0  a      read
1  b 1  unread
2  c      read

回答 4

试试这个,

inplace=True

import numpy as np
df.replace(np.NaN, ' ', inplace=True)

Try this,

add inplace=True

import numpy as np
df.replace(np.NaN, ' ', inplace=True)

回答 5

使用keep_default_na=False 应该可以帮助您:

df = pd.read_csv(filename, keep_default_na=False)

using keep_default_na=False should help you:

df = pd.read_csv(filename, keep_default_na=False)

回答 6

如果您要将DataFrame转换为JSON,NaN将给出错误,因此在此用例中的最佳解决方案是将替换NaNNone
方法如下:

df1 = df.where((pd.notnull(df)), None)

If you are converting DataFrame to JSON, NaN will give error so best solution is in this use case is to replace NaN with None.
Here is how:

df1 = df.where((pd.notnull(df)), None)

回答 7

我用nan尝试了一列字符串值。

要删除nan并填充空字符串,请执行以下操作:

df.columnname.replace(np.nan,'',regex = True)

要删除nan并填充一些值:

df.columnname.replace(np.nan,'value',regex = True)

我也尝试了df.iloc。但它需要列的索引。所以您需要再次查看表格。简单地,上述方法减少了一个步骤。

I tried with one column of string values with nan.

To remove the nan and fill the empty string:

df.columnname.replace(np.nan,'',regex = True)

To remove the nan and fill some values:

df.columnname.replace(np.nan,'value',regex = True)

I tried df.iloc also. but it needs the index of the column. so you need to look into the table again. simply the above method reduced one step.


逐行迭代时更新熊猫数据框

问题:逐行迭代时更新熊猫数据框

我有一个看起来像这样的熊猫数据框(非常大)

           date      exer exp     ifor         mat  
1092  2014-03-17  American   M  528.205  2014-04-19 
1093  2014-03-17  American   M  528.205  2014-04-19 
1094  2014-03-17  American   M  528.205  2014-04-19 
1095  2014-03-17  American   M  528.205  2014-04-19    
1096  2014-03-17  American   M  528.205  2014-05-17 

现在我想逐行进行迭代,当我遍历每一行时,每行中的值ifor 可以根据某些条件而变化,因此我需要查找另一个数据帧。

现在,如何在迭代时更新它。尝试了几项都不起作用的东西。

for i, row in df.iterrows():
    if <something>:
        row['ifor'] = x
    else:
        row['ifor'] = y

    df.ix[i]['ifor'] = x

这些方法似乎都不起作用。我看不到数据框中更新的值。

I have a pandas data frame that looks like this (its a pretty big one)

           date      exer exp     ifor         mat  
1092  2014-03-17  American   M  528.205  2014-04-19 
1093  2014-03-17  American   M  528.205  2014-04-19 
1094  2014-03-17  American   M  528.205  2014-04-19 
1095  2014-03-17  American   M  528.205  2014-04-19    
1096  2014-03-17  American   M  528.205  2014-05-17 

now I would like to iterate row by row and as I go through each row, the value of ifor in each row can change depending on some conditions and I need to lookup another dataframe.

Now, how do I update this as I iterate. Tried a few things none of them worked.

for i, row in df.iterrows():
    if <something>:
        row['ifor'] = x
    else:
        row['ifor'] = y

    df.ix[i]['ifor'] = x

None of these approaches seem to work. I don’t see the values updated in the dataframe.


回答 0

您可以使用df.set_value在循环中分配值:

for i, row in df.iterrows():
    ifor_val = something
    if <condition>:
        ifor_val = something_else
    df.set_value(i,'ifor',ifor_val)

如果不需要行值,则可以简单地遍历df的索引,但是我保留了原始的for循环,以防需要此处未显示的行值。

更新

从0.21.0版开始不推荐使用df.set_value(),而可以使用df.at():

for i, row in df.iterrows():
    ifor_val = something
    if <condition>:
        ifor_val = something_else
    df.at[i,'ifor'] = ifor_val

You can assign values in the loop using df.set_value:

for i, row in df.iterrows():
    ifor_val = something
    if <condition>:
        ifor_val = something_else
    df.set_value(i,'ifor',ifor_val)

If you don’t need the row values you could simply iterate over the indices of df, but I kept the original for-loop in case you need the row value for something not shown here.

update

df.set_value() has been deprecated since version 0.21.0 you can use df.at() instead:

for i, row in df.iterrows():
    ifor_val = something
    if <condition>:
        ifor_val = something_else
    df.at[i,'ifor'] = ifor_val

回答 1

熊猫DataFrame对象应被视为一系列系列。换句话说,您应该从列的角度来考虑它。之所以如此重要,是因为在使用时,您将pd.DataFrame.iterrows行作为Series进行迭代。但是这些不是数据帧存储的系列,因此它们是在迭代时为您创建的新系列。这意味着当您尝试分配它们时,这些编辑将不会最终反映在原始数据框中。

好的,现在这已经不合时宜了:我们该怎么办?

此职位之前的建议包括:

  1. pd.DataFrame.set_value弃用的熊猫版0.21
  2. pd.DataFrame.ix弃用
  3. pd.DataFrame.loc很好,但是可以在数组索引器上工作,并且您可以做得更好

我的建议
使用pd.DataFrame.at

for i in df.index:
    if <something>:
        df.at[i, 'ifor'] = x
    else:
        df.at[i, 'ifor'] = y

您甚至可以将其更改为:

for i in df.index:
    df.at[i, 'ifor'] = x if <something> else y

回应评论

如果我需要使用if条件的前一行的值怎么办?

for i in range(1, len(df) + 1):
    j = df.columns.get_loc('ifor')
    if <something>:
        df.iat[i - 1, j] = x
    else:
        df.iat[i - 1, j] = y

Pandas DataFrame object should be thought of as a Series of Series. In other words, you should think of it in terms of columns. The reason why this is important is because when you use pd.DataFrame.iterrows you are iterating through rows as Series. But these are not the Series that the data frame is storing and so they are new Series that are created for you while you iterate. That implies that when you attempt to assign tho them, those edits won’t end up reflected in the original data frame.

Ok, now that that is out of the way: What do we do?

Suggestions prior to this post include:

  1. pd.DataFrame.set_value is deprecated as of Pandas version 0.21
  2. pd.DataFrame.ix is deprecated
  3. pd.DataFrame.loc is fine but can work on array indexers and you can do better

My recommendation
Use pd.DataFrame.at

for i in df.index:
    if <something>:
        df.at[i, 'ifor'] = x
    else:
        df.at[i, 'ifor'] = y

You can even change this to:

for i in df.index:
    df.at[i, 'ifor'] = x if <something> else y

Response to comment

and what if I need to use the value of the previous row for the if condition?

for i in range(1, len(df) + 1):
    j = df.columns.get_loc('ifor')
    if <something>:
        df.iat[i - 1, j] = x
    else:
        df.iat[i - 1, j] = y

回答 2

可以使用的方法是itertuples(),将DataFrame行作为namedtuple进行迭代,将index值作为tuple的第一个元素。与相比,它要快得多iterrows()。对于itertuples(),每个都在DataFrame中row包含其Index,您可以loc用来设置值。

for row in df.itertuples():
    if <something>:
        df.at[row.Index, 'ifor'] = x
    else:
        df.at[row.Index, 'ifor'] = x

    df.loc[row.Index, 'ifor'] = x

在大多数情况下,itertuples()速度比iat或快at

感谢@SantiStSupery,使用.at速度比快得多loc

A method you can use is itertuples(), it iterates over DataFrame rows as namedtuples, with index value as first element of the tuple. And it is much much faster compared with iterrows(). For itertuples(), each row contains its Index in the DataFrame, and you can use loc to set the value.

for row in df.itertuples():
    if <something>:
        df.at[row.Index, 'ifor'] = x
    else:
        df.at[row.Index, 'ifor'] = x

    df.loc[row.Index, 'ifor'] = x

Under most cases, itertuples() is faster than iat or at.

Thanks @SantiStSupery, using .at is much faster than loc.


回答 3

您应该用df.ix[i, 'exp']=Xdf.loc[i, 'exp']=X代替赋值df.ix[i]['ifor'] = x

否则,您正在处理视图,并且应该变暖:

-c:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_index,col_indexer] = value instead

但是可以肯定的是,循环可能最好用某种矢量化算法代替,以充分利用DataFrame@Phillip Cloud建议的方法。

You should assign value by df.ix[i, 'exp']=X or df.loc[i, 'exp']=X instead of df.ix[i]['ifor'] = x.

Otherwise you are working on a view, and should get a warming:

-c:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_index,col_indexer] = value instead

But certainly, loop probably should better be replaced by some vectorized algorithm to make the full use of DataFrame as @Phillip Cloud suggested.


回答 4

好吧,如果您要进行迭代,为什么不使用所有最简单的方法, df['Column'].values[i]

df['Column'] = ''

for i in range(len(df)):
    df['Column'].values[i] = something/update/new_value

或者,如果您想将新值与旧值或类似值进行比较,为什么不将其存储在列表中,然后追加到末尾。

mylist, df['Column'] = [], ''

for <condition>:
    mylist.append(something/update/new_value)

df['Column'] = mylist

Well, if you are going to iterate anyhow, why don’t use the simplest method of all, df['Column'].values[i]

df['Column'] = ''

for i in range(len(df)):
    df['Column'].values[i] = something/update/new_value

Or if you want to compare the new values with old or anything like that, why not store it in a list and then append in the end.

mylist, df['Column'] = [], ''

for <condition>:
    mylist.append(something/update/new_value)

df['Column'] = mylist

回答 5

for i, row in df.iterrows():
    if <something>:
        df.at[i, 'ifor'] = x
    else:
        df.at[i, 'ifor'] = y
for i, row in df.iterrows():
    if <something>:
        df.at[i, 'ifor'] = x
    else:
        df.at[i, 'ifor'] = y

回答 6

最好lambda使用df.apply()

df["ifor"] = df.apply(lambda x: {value} if {condition} else x["ifor"], axis=1)

It’s better to use lambda functions using df.apply()

df["ifor"] = df.apply(lambda x: {value} if {condition} else x["ifor"], axis=1)

回答 7

从一列增加最大数。例如 :

df1 = [sort_ID, Column1,Column2]
print(df1)

我的输出:

Sort_ID Column1 Column2
12         a    e
45         b    f
65         c    g
78         d    h

MAX = df1['Sort_ID'].max() #This returns my Max Number 

现在,我需要在df2中创建一列,并填充增加MAX的列值。

Sort_ID Column1 Column2
79      a1       e1
80      b1       f1
81      c1       g1
82      d1       h1

注意:df2最初将仅包含Column1和Column2。我们需要创建Sortid列,并从df1开始增加MAX。

Increment the MAX number from a column. For Example :

df1 = [sort_ID, Column1,Column2]
print(df1)

My output :

Sort_ID Column1 Column2
12         a    e
45         b    f
65         c    g
78         d    h

MAX = df1['Sort_ID'].max() #This returns my Max Number 

Now , I need to create a column in df2 and fill the column values which increments the MAX .

Sort_ID Column1 Column2
79      a1       e1
80      b1       f1
81      c1       g1
82      d1       h1

Note : df2 will initially contain only the Column1 and Column2 . we need the Sortid column to be created and incremental of the MAX from df1 .


在熊猫中加入和合并有什么区别?

问题:在熊猫中加入和合并有什么区别?

假设我有两个像这样的DataFrame:

left = pd.DataFrame({'key1': ['foo', 'bar'], 'lval': [1, 2]})

right = pd.DataFrame({'key2': ['foo', 'bar'], 'rval': [4, 5]})

我想合并它们,所以我尝试这样的事情:

pd.merge(left, right, left_on='key1', right_on='key2')

我很开心

    key1    lval    key2    rval
0   foo     1       foo     4
1   bar     2       bar     5

但是我正在尝试使用join方法,我被认为这是非常相似的。

left.join(right, on=['key1', 'key2'])

我得到这个:

//anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in _validate_specification(self)
    406             if self.right_index:
    407                 if not ((len(self.left_on) == self.right.index.nlevels)):
--> 408                     raise AssertionError()
    409                 self.right_on = [None] * n
    410         elif self.right_on is not None:

AssertionError: 

我想念什么?

Suppose I have two DataFrames like so:

left = pd.DataFrame({'key1': ['foo', 'bar'], 'lval': [1, 2]})

right = pd.DataFrame({'key2': ['foo', 'bar'], 'rval': [4, 5]})

I want to merge them, so I try something like this:

pd.merge(left, right, left_on='key1', right_on='key2')

And I’m happy

    key1    lval    key2    rval
0   foo     1       foo     4
1   bar     2       bar     5

But I’m trying to use the join method, which I’ve been lead to believe is pretty similar.

left.join(right, on=['key1', 'key2'])

And I get this:

//anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in _validate_specification(self)
    406             if self.right_index:
    407                 if not ((len(self.left_on) == self.right.index.nlevels)):
--> 408                     raise AssertionError()
    409                 self.right_on = [None] * n
    410         elif self.right_on is not None:

AssertionError: 

What am I missing?


回答 0

我总是join在索引上使用:

import pandas as pd
left = pd.DataFrame({'key': ['foo', 'bar'], 'val': [1, 2]}).set_index('key')
right = pd.DataFrame({'key': ['foo', 'bar'], 'val': [4, 5]}).set_index('key')
left.join(right, lsuffix='_l', rsuffix='_r')

     val_l  val_r
key            
foo      1      4
bar      2      5

通过merge在以下各列上使用,可以具有相同的功能:

left = pd.DataFrame({'key': ['foo', 'bar'], 'val': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'val': [4, 5]})
left.merge(right, on=('key'), suffixes=('_l', '_r'))

   key  val_l  val_r
0  foo      1      4
1  bar      2      5

I always use join on indices:

import pandas as pd
left = pd.DataFrame({'key': ['foo', 'bar'], 'val': [1, 2]}).set_index('key')
right = pd.DataFrame({'key': ['foo', 'bar'], 'val': [4, 5]}).set_index('key')
left.join(right, lsuffix='_l', rsuffix='_r')

     val_l  val_r
key            
foo      1      4
bar      2      5

The same functionality can be had by using merge on the columns follows:

left = pd.DataFrame({'key': ['foo', 'bar'], 'val': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'val': [4, 5]})
left.merge(right, on=('key'), suffixes=('_l', '_r'))

   key  val_l  val_r
0  foo      1      4
1  bar      2      5

回答 1

pandas.merge() 是用于所有合并/联接行为的基础函数。

DataFrames提供pandas.DataFrame.merge()pandas.DataFrame.join()方法,作为一种方便的方法来访问的功能pandas.merge()。例如,df1.merge(right=df2, ...)等效于pandas.merge(left=df1, right=df2, ...)

这些是df.join()和之间的主要区别df.merge()

  1. 在右表上查找:df1.join(df2)始终通过的索引进行连接df2,但df1.merge(df2)可以与df2(默认)的一个或多个列或df2(与right_index=True)的索引进行连接。
  2. 在左表上查找:默认情况下,df1.join(df2)使用的索引df1df1.merge(df2)使用的列df1。可以通过指定df1.join(df2, on=key_or_keys)或覆盖df1.merge(df2, left_index=True)
  3. 左vs内部联接:df1.join(df2)默认情况下执行左联接(保留的所有行df1),但df.merge默认情况下进行内部联接(仅返回df1和的匹配行df2)。

因此,通用方法是使用pandas.merge(df1, df2)df1.merge(df2)。但是在许多常见情况下(将中的所有行保留df1并连接到中的索引df2),您可以使用df1.join(df2)代替保存一些类型。

http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging上的文档中针对这些问题的一些说明:

merge 是pandas命名空间中的一个函数,它也可以作为DataFrame实例方法使用,调用的DataFrame被隐式视为联接中的左侧对象。

相关DataFrame.join方法在merge内部用于索引索引连接和列索引连接,但是默认情况下在索引上进行连接,而不是尝试在公共列上进行连接(的默认行为merge)。如果您要加入索引,则不妨使用它DataFrame.join来保存自己的输入内容。

这两个函数调用是完全等效的:

left.join(right, on=key_or_keys)
pd.merge(left, right, left_on=key_or_keys, right_index=True, how='left', sort=False)

pandas.merge() is the underlying function used for all merge/join behavior.

DataFrames provide the pandas.DataFrame.merge() and pandas.DataFrame.join() methods as a convenient way to access the capabilities of pandas.merge(). For example, df1.merge(right=df2, ...) is equivalent to pandas.merge(left=df1, right=df2, ...).

These are the main differences between df.join() and df.merge():

  1. lookup on right table: df1.join(df2) always joins via the index of df2, but df1.merge(df2) can join to one or more columns of df2 (default) or to the index of df2 (with right_index=True).
  2. lookup on left table: by default, df1.join(df2) uses the index of df1 and df1.merge(df2) uses column(s) of df1. That can be overridden by specifying df1.join(df2, on=key_or_keys) or df1.merge(df2, left_index=True).
  3. left vs inner join: df1.join(df2) does a left join by default (keeps all rows of df1), but df.merge does an inner join by default (returns only matching rows of df1 and df2).

So, the generic approach is to use pandas.merge(df1, df2) or df1.merge(df2). But for a number of common situations (keeping all rows of df1 and joining to an index in df2), you can save some typing by using df1.join(df2) instead.

Some notes on these issues from the documentation at http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging:

merge is a function in the pandas namespace, and it is also available as a DataFrame instance method, with the calling DataFrame being implicitly considered the left object in the join.

The related DataFrame.join method, uses merge internally for the index-on-index and index-on-column(s) joins, but joins on indexes by default rather than trying to join on common columns (the default behavior for merge). If you are joining on index, you may wish to use DataFrame.join to save yourself some typing.

These two function calls are completely equivalent:

left.join(right, on=key_or_keys)
pd.merge(left, right, left_on=key_or_keys, right_index=True, how='left', sort=False)

回答 2

我相信这join()只是一种方便的方法。请尝试尝试df1.merge(df2),它允许您指定left_onright_on

In [30]: left.merge(right, left_on="key1", right_on="key2")
Out[30]: 
  key1  lval key2  rval
0  foo     1  foo     4
1  bar     2  bar     5

I believe that join() is just a convenience method. Try df1.merge(df2) instead, which allows you to specify left_on and right_on:

In [30]: left.merge(right, left_on="key1", right_on="key2")
Out[30]: 
  key1  lval key2  rval
0  foo     1  foo     4
1  bar     2  bar     5

回答 3

本文档

pandas提供一个合并功能,作为DataFrame对象之间所有标准数据库联接操作的入口点:

merge(left, right, how='inner', on=None, left_on=None, right_on=None,
      left_index=False, right_index=False, sort=True,
      suffixes=('_x', '_y'), copy=True, indicator=False)

和:

DataFrame.join是一种将两个可能具有不同索引的DataFrame的列组合到单个结果DataFrame中的便捷方法。这是一个非常基本的示例:此处的数据对齐在索引(行标签)上。使用合并加上指示它使用索引的其他参数,可以实现相同的行为:

result = pd.merge(left, right, left_index=True, right_index=True,
how='outer')

From this documentation

pandas provides a single function, merge, as the entry point for all standard database join operations between DataFrame objects:

merge(left, right, how='inner', on=None, left_on=None, right_on=None,
      left_index=False, right_index=False, sort=True,
      suffixes=('_x', '_y'), copy=True, indicator=False)

And :

DataFrame.join is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame. Here is a very basic example: The data alignment here is on the indexes (row labels). This same behavior can be achieved using merge plus additional arguments instructing it to use the indexes:

result = pd.merge(left, right, left_index=True, right_index=True,
how='outer')

回答 4

区别之一merge是创建新索引,并join保留左侧索引。如果您错误地认为索引未使用进行更改,则可能对以后的转换产生重大影响merge

例如:

import pandas as pd

df1 = pd.DataFrame({'org_index': [101, 102, 103, 104],
                    'date': [201801, 201801, 201802, 201802],
                    'val': [1, 2, 3, 4]}, index=[101, 102, 103, 104])
df1

       date  org_index  val
101  201801        101    1
102  201801        102    2
103  201802        103    3
104  201802        104    4

df2 = pd.DataFrame({'date': [201801, 201802], 'dateval': ['A', 'B']}).set_index('date')
df2

       dateval
date          
201801       A
201802       B

df1.merge(df2, on='date')

     date  org_index  val dateval
0  201801        101    1       A
1  201801        102    2       A
2  201802        103    3       B
3  201802        104    4       B

df1.join(df2, on='date')
       date  org_index  val dateval
101  201801        101    1       A
102  201801        102    2       A
103  201802        103    3       B
104  201802        104    4       B

One of the difference is that merge is creating a new index, and join is keeping the left side index. It can have a big consequence on your later transformations if you wrongly assume that your index isn’t changed with merge.

For example:

import pandas as pd

df1 = pd.DataFrame({'org_index': [101, 102, 103, 104],
                    'date': [201801, 201801, 201802, 201802],
                    'val': [1, 2, 3, 4]}, index=[101, 102, 103, 104])
df1

       date  org_index  val
101  201801        101    1
102  201801        102    2
103  201802        103    3
104  201802        104    4

df2 = pd.DataFrame({'date': [201801, 201802], 'dateval': ['A', 'B']}).set_index('date')
df2

       dateval
date          
201801       A
201802       B

df1.merge(df2, on='date')

     date  org_index  val dateval
0  201801        101    1       A
1  201801        102    2       A
2  201802        103    3       B
3  201802        104    4       B

df1.join(df2, on='date')
       date  org_index  val dateval
101  201801        101    1       A
102  201801        102    2       A
103  201802        103    3       B
104  201802        104    4       B

回答 5

  • 联接:默认索引(如果使用相同的列名,则由于未定义lsuffix或rsuffix,它将在默认模式下引发错误)
df_1.join(df_2)
  • 合并:默认相同的列名(如果没有相同的列名,则在默认模式下将引发错误)
df_1.merge(df_2)
  • on 在这两种情况下,参数具有不同的含义
df_1.merge(df_2, on='column_1')

df_1.join(df_2, on='column_1') // It will throw error
df_1.join(df_2.set_index('column_1'), on='column_1')
  • Join: Default Index (If any same column name then it will throw an error in default mode because u have not defined lsuffix or rsuffix))
df_1.join(df_2)
  • Merge: Default Same Column Names (If no same column name it will throw an error in default mode)
df_1.merge(df_2)
  • on parameter has different meaning in both cases
df_1.merge(df_2, on='column_1')

df_1.join(df_2, on='column_1') // It will throw error
df_1.join(df_2.set_index('column_1'), on='column_1')

回答 6

用类似于SQL的方式表示“ Pandas合并是外部/内部联接,Pandas联接是自然联接”。因此,当您在熊猫中使用合并时,您要指定要使用哪种sqlish联接,而当使用熊猫联接时,您确实希望有一个匹配的列标签以确保其联接

To put it analogously to SQL “Pandas merge is to outer/inner join and Pandas join is to natural join”. Hence when you use merge in pandas, you want to specify which kind of sqlish join you want to use whereas when you use pandas join, you really want to have a matching column label to ensure it joins


熊猫分组和

问题:熊猫分组和

我正在使用此数据框:

Fruit   Date      Name  Number
Apples  10/6/2016 Bob    7
Apples  10/6/2016 Bob    8
Apples  10/6/2016 Mike   9
Apples  10/7/2016 Steve 10
Apples  10/7/2016 Bob    1
Oranges 10/7/2016 Bob    2
Oranges 10/6/2016 Tom   15
Oranges 10/6/2016 Mike  57
Oranges 10/6/2016 Bob   65
Oranges 10/7/2016 Tony   1
Grapes  10/7/2016 Bob    1
Grapes  10/7/2016 Tom   87
Grapes  10/7/2016 Bob   22
Grapes  10/7/2016 Bob   12
Grapes  10/7/2016 Tony  15

我想按名称然后按水果进行汇总,以获得每个名称的水果总数。

Bob,Apples,16 ( for example )

我尝试按名称和水果分组,但是如何获取水果总数。

I am using this data frame:

Fruit   Date      Name  Number
Apples  10/6/2016 Bob    7
Apples  10/6/2016 Bob    8
Apples  10/6/2016 Mike   9
Apples  10/7/2016 Steve 10
Apples  10/7/2016 Bob    1
Oranges 10/7/2016 Bob    2
Oranges 10/6/2016 Tom   15
Oranges 10/6/2016 Mike  57
Oranges 10/6/2016 Bob   65
Oranges 10/7/2016 Tony   1
Grapes  10/7/2016 Bob    1
Grapes  10/7/2016 Tom   87
Grapes  10/7/2016 Bob   22
Grapes  10/7/2016 Bob   12
Grapes  10/7/2016 Tony  15

I want to aggregate this by name and then by fruit to get a total number of fruit per name.

Bob,Apples,16 ( for example )

I tried grouping by Name and Fruit but how do I get the total number of fruit.


回答 0

用途GroupBy.sum

df.groupby(['Fruit','Name']).sum()

Out[31]: 
               Number
Fruit   Name         
Apples  Bob        16
        Mike        9
        Steve      10
Grapes  Bob        35
        Tom        87
        Tony       15
Oranges Bob        67
        Mike       57
        Tom        15
        Tony        1

Use GroupBy.sum:

df.groupby(['Fruit','Name']).sum()

Out[31]: 
               Number
Fruit   Name         
Apples  Bob        16
        Mike        9
        Steve      10
Grapes  Bob        35
        Tom        87
        Tony       15
Oranges Bob        67
        Mike       57
        Tom        15
        Tony        1

回答 1

你也可以使用agg函数

df.groupby(['Name', 'Fruit'])['Number'].agg('sum')

Also you can use agg function,

df.groupby(['Name', 'Fruit'])['Number'].agg('sum')

回答 2

如果要保留原始列FruitName,请使用reset_index()。否则,FruitName将成为指数的一部分。

df.groupby(['Fruit','Name'])['Number'].sum().reset_index()

Fruit   Name       Number
Apples  Bob        16
Apples  Mike        9
Apples  Steve      10
Grapes  Bob        35
Grapes  Tom        87
Grapes  Tony       15
Oranges Bob        67
Oranges Mike       57
Oranges Tom        15
Oranges Tony        1

如其他答案所示:

df.groupby(['Fruit','Name'])['Number'].sum()

               Number
Fruit   Name         
Apples  Bob        16
        Mike        9
        Steve      10
Grapes  Bob        35
        Tom        87
        Tony       15
Oranges Bob        67
        Mike       57
        Tom        15
        Tony        1

If you want to keep the original columns Fruit and Name, use reset_index(). Otherwise Fruit and Name will become part of the index.

df.groupby(['Fruit','Name'])['Number'].sum().reset_index()

Fruit   Name       Number
Apples  Bob        16
Apples  Mike        9
Apples  Steve      10
Grapes  Bob        35
Grapes  Tom        87
Grapes  Tony       15
Oranges Bob        67
Oranges Mike       57
Oranges Tom        15
Oranges Tony        1

As seen in the other answers:

df.groupby(['Fruit','Name'])['Number'].sum()

               Number
Fruit   Name         
Apples  Bob        16
        Mike        9
        Steve      10
Grapes  Bob        35
        Tom        87
        Tony       15
Oranges Bob        67
        Mike       57
        Tom        15
        Tony        1

回答 3

其他两个答案都能满足您的需求。

您可以使用该pivot功能将数据排列在一个漂亮的表中

df.groupby(['Fruit','Name'],as_index = False).sum().pivot('Fruit','Name').fillna(0)



Name    Bob     Mike    Steve   Tom    Tony
Fruit                   
Apples  16.0    9.0     10.0    0.0     0.0
Grapes  35.0    0.0     0.0     87.0    15.0
Oranges 67.0    57.0    0.0     15.0    1.0

Both the other answers accomplish what you want.

You can use the pivot functionality to arrange the data in a nice table

df.groupby(['Fruit','Name'],as_index = False).sum().pivot('Fruit','Name').fillna(0)



Name    Bob     Mike    Steve   Tom    Tony
Fruit                   
Apples  16.0    9.0     10.0    0.0     0.0
Grapes  35.0    0.0     0.0     87.0    15.0
Oranges 67.0    57.0    0.0     15.0    1.0

回答 4

df.groupby(['Fruit','Name'])['Number'].sum()

您可以选择不同的列来对数字求和。

df.groupby(['Fruit','Name'])['Number'].sum()

You can select different columns to sum numbers.


回答 5

您可以将设置groupbyindex ,然后使用sumlevel

df.set_index(['Fruit','Name']).sum(level=[0,1])
Out[175]: 
               Number
Fruit   Name         
Apples  Bob        16
        Mike        9
        Steve      10
Oranges Bob        67
        Tom        15
        Mike       57
        Tony        1
Grapes  Bob        35
        Tom        87
        Tony       15

You can set the groupby column to index then using sum with level

df.set_index(['Fruit','Name']).sum(level=[0,1])
Out[175]: 
               Number
Fruit   Name         
Apples  Bob        16
        Mike        9
        Steve      10
Oranges Bob        67
        Tom        15
        Mike       57
        Tony        1
Grapes  Bob        35
        Tom        87
        Tony       15

回答 6

.agg()函数的变体;提供以下功能:(1)持久化类型DataFrame,(2)应用平均值,计数,求和等,以及(3)在保持易读性的同时在多个列上启用groupby。

df.groupby(['att1', 'att2']).agg({'att1': "count", 'att3': "sum",'att4': 'mean'})

用你的价值观…

df.groupby(['Name', 'Fruit']).agg({'Number': "sum"})

A variation on the .agg() function; provides the ability to (1) persist type DataFrame, (2) apply averages, counts, summations, etc. and (3) enables groupby on multiple columns while maintaining legibility.

df.groupby(['att1', 'att2']).agg({'att1': "count", 'att3': "sum",'att4': 'mean'})

using your values…

df.groupby(['Name', 'Fruit']).agg({'Number': "sum"})

将熊猫数据框字符串条目拆分(分解)为单独的行

问题:将熊猫数据框字符串条目拆分(分解)为单独的行

我有一个pandas dataframe文本字符串的一列包含逗号分隔的值。我想拆分每个CSV字段,并为每个条目创建一个新行(假设CSV干净并且只需要在’,’上拆分)。例如,a应变为b

In [7]: a
Out[7]: 
    var1  var2
0  a,b,c     1
1  d,e,f     2

In [8]: b
Out[8]: 
  var1  var2
0    a     1
1    b     1
2    c     1
3    d     2
4    e     2
5    f     2

到目前为止,我已经尝试了各种简单的函数,但是该.apply方法似乎只在轴上使用一行作为返回值,而我无法开始.transform工作。我们欢迎所有的建议!

示例数据:

from pandas import DataFrame
import numpy as np
a = DataFrame([{'var1': 'a,b,c', 'var2': 1},
               {'var1': 'd,e,f', 'var2': 2}])
b = DataFrame([{'var1': 'a', 'var2': 1},
               {'var1': 'b', 'var2': 1},
               {'var1': 'c', 'var2': 1},
               {'var1': 'd', 'var2': 2},
               {'var1': 'e', 'var2': 2},
               {'var1': 'f', 'var2': 2}])

我知道这是行不通的,因为我们通过numpy丢失了DataFrame元数据,但是它应该使您了解我尝试做的事情:

def fun(row):
    letters = row['var1']
    letters = letters.split(',')
    out = np.array([row] * len(letters))
    out['var1'] = letters
a['idx'] = range(a.shape[0])
z = a.groupby('idx')
z.transform(fun)

I have a pandas dataframe in which one column of text strings contains comma-separated values. I want to split each CSV field and create a new row per entry (assume that CSV are clean and need only be split on ‘,’). For example, a should become b:

In [7]: a
Out[7]: 
    var1  var2
0  a,b,c     1
1  d,e,f     2

In [8]: b
Out[8]: 
  var1  var2
0    a     1
1    b     1
2    c     1
3    d     2
4    e     2
5    f     2

So far, I have tried various simple functions, but the .apply method seems to only accept one row as return value when it is used on an axis, and I can’t get .transform to work. Any suggestions would be much appreciated!

Example data:

from pandas import DataFrame
import numpy as np
a = DataFrame([{'var1': 'a,b,c', 'var2': 1},
               {'var1': 'd,e,f', 'var2': 2}])
b = DataFrame([{'var1': 'a', 'var2': 1},
               {'var1': 'b', 'var2': 1},
               {'var1': 'c', 'var2': 1},
               {'var1': 'd', 'var2': 2},
               {'var1': 'e', 'var2': 2},
               {'var1': 'f', 'var2': 2}])

I know this won’t work because we lose DataFrame meta-data by going through numpy, but it should give you a sense of what I tried to do:

def fun(row):
    letters = row['var1']
    letters = letters.split(',')
    out = np.array([row] * len(letters))
    out['var1'] = letters
a['idx'] = range(a.shape[0])
z = a.groupby('idx')
z.transform(fun)

回答 0

这样的事情怎么样:

In [55]: pd.concat([Series(row['var2'], row['var1'].split(','))              
                    for _, row in a.iterrows()]).reset_index()
Out[55]: 
  index  0
0     a  1
1     b  1
2     c  1
3     d  2
4     e  2
5     f  2

然后,您只需要重命名列

How about something like this:

In [55]: pd.concat([Series(row['var2'], row['var1'].split(','))              
                    for _, row in a.iterrows()]).reset_index()
Out[55]: 
  index  0
0     a  1
1     b  1
2     c  1
3     d  2
4     e  2
5     f  2

Then you just have to rename the columns


回答 1

UPDATE2:更通用的矢量化函数,可用于normal多个list

def explode(df, lst_cols, fill_value='', preserve_index=False):
    # make sure `lst_cols` is list-alike
    if (lst_cols is not None
        and len(lst_cols) > 0
        and not isinstance(lst_cols, (list, tuple, np.ndarray, pd.Series))):
        lst_cols = [lst_cols]
    # all columns except `lst_cols`
    idx_cols = df.columns.difference(lst_cols)
    # calculate lengths of lists
    lens = df[lst_cols[0]].str.len()
    # preserve original index values    
    idx = np.repeat(df.index.values, lens)
    # create "exploded" DF
    res = (pd.DataFrame({
                col:np.repeat(df[col].values, lens)
                for col in idx_cols},
                index=idx)
             .assign(**{col:np.concatenate(df.loc[lens>0, col].values)
                            for col in lst_cols}))
    # append those rows that have empty lists
    if (lens == 0).any():
        # at least one list in cells is empty
        res = (res.append(df.loc[lens==0, idx_cols], sort=False)
                  .fillna(fill_value))
    # revert the original index order
    res = res.sort_index()
    # reset index if requested
    if not preserve_index:        
        res = res.reset_index(drop=True)
    return res

演示:

list列-所有list列每行中必须具有相同的元素数:

In [134]: df
Out[134]:
   aaa  myid        num          text
0   10     1  [1, 2, 3]  [aa, bb, cc]
1   11     2         []            []
2   12     3     [1, 2]      [cc, dd]
3   13     4         []            []

In [135]: explode(df, ['num','text'], fill_value='')
Out[135]:
   aaa  myid num text
0   10     1   1   aa
1   10     1   2   bb
2   10     1   3   cc
3   11     2
4   12     3   1   cc
5   12     3   2   dd
6   13     4

保留原始索引值:

In [136]: explode(df, ['num','text'], fill_value='', preserve_index=True)
Out[136]:
   aaa  myid num text
0   10     1   1   aa
0   10     1   2   bb
0   10     1   3   cc
1   11     2
2   12     3   1   cc
2   12     3   2   dd
3   13     4

建立:

df = pd.DataFrame({
 'aaa': {0: 10, 1: 11, 2: 12, 3: 13},
 'myid': {0: 1, 1: 2, 2: 3, 3: 4},
 'num': {0: [1, 2, 3], 1: [], 2: [1, 2], 3: []},
 'text': {0: ['aa', 'bb', 'cc'], 1: [], 2: ['cc', 'dd'], 3: []}
})

CSV栏:

In [46]: df
Out[46]:
        var1  var2 var3
0      a,b,c     1   XX
1  d,e,f,x,y     2   ZZ

In [47]: explode(df.assign(var1=df.var1.str.split(',')), 'var1')
Out[47]:
  var1  var2 var3
0    a     1   XX
1    b     1   XX
2    c     1   XX
3    d     2   ZZ
4    e     2   ZZ
5    f     2   ZZ
6    x     2   ZZ
7    y     2   ZZ

使用这个小技巧,我们可以将类似CSV的列转换为list列:

In [48]: df.assign(var1=df.var1.str.split(','))
Out[48]:
              var1  var2 var3
0        [a, b, c]     1   XX
1  [d, e, f, x, y]     2   ZZ

更新: 通用矢量化方法(也适用于多列):

原始DF:

In [177]: df
Out[177]:
        var1  var2 var3
0      a,b,c     1   XX
1  d,e,f,x,y     2   ZZ

解:

首先让我们将CSV字符串转换为列表:

In [178]: lst_col = 'var1' 

In [179]: x = df.assign(**{lst_col:df[lst_col].str.split(',')})

In [180]: x
Out[180]:
              var1  var2 var3
0        [a, b, c]     1   XX
1  [d, e, f, x, y]     2   ZZ

现在我们可以这样做:

In [181]: pd.DataFrame({
     ...:     col:np.repeat(x[col].values, x[lst_col].str.len())
     ...:     for col in x.columns.difference([lst_col])
     ...: }).assign(**{lst_col:np.concatenate(x[lst_col].values)})[x.columns.tolist()]
     ...:
Out[181]:
  var1  var2 var3
0    a     1   XX
1    b     1   XX
2    c     1   XX
3    d     2   ZZ
4    e     2   ZZ
5    f     2   ZZ
6    x     2   ZZ
7    y     2   ZZ

旧答案:

受到@AFinkelstein解决方案的启发,我想让它更通用一些,可以应用于多于两列的DF,其速度与AFinkelstein解决方案一样快,几乎一样快):

In [2]: df = pd.DataFrame(
   ...:    [{'var1': 'a,b,c', 'var2': 1, 'var3': 'XX'},
   ...:     {'var1': 'd,e,f,x,y', 'var2': 2, 'var3': 'ZZ'}]
   ...: )

In [3]: df
Out[3]:
        var1  var2 var3
0      a,b,c     1   XX
1  d,e,f,x,y     2   ZZ

In [4]: (df.set_index(df.columns.drop('var1',1).tolist())
   ...:    .var1.str.split(',', expand=True)
   ...:    .stack()
   ...:    .reset_index()
   ...:    .rename(columns={0:'var1'})
   ...:    .loc[:, df.columns]
   ...: )
Out[4]:
  var1  var2 var3
0    a     1   XX
1    b     1   XX
2    c     1   XX
3    d     2   ZZ
4    e     2   ZZ
5    f     2   ZZ
6    x     2   ZZ
7    y     2   ZZ

UPDATE2: more generic vectorized function, which will work for multiple normal and multiple list columns

def explode(df, lst_cols, fill_value='', preserve_index=False):
    # make sure `lst_cols` is list-alike
    if (lst_cols is not None
        and len(lst_cols) > 0
        and not isinstance(lst_cols, (list, tuple, np.ndarray, pd.Series))):
        lst_cols = [lst_cols]
    # all columns except `lst_cols`
    idx_cols = df.columns.difference(lst_cols)
    # calculate lengths of lists
    lens = df[lst_cols[0]].str.len()
    # preserve original index values    
    idx = np.repeat(df.index.values, lens)
    # create "exploded" DF
    res = (pd.DataFrame({
                col:np.repeat(df[col].values, lens)
                for col in idx_cols},
                index=idx)
             .assign(**{col:np.concatenate(df.loc[lens>0, col].values)
                            for col in lst_cols}))
    # append those rows that have empty lists
    if (lens == 0).any():
        # at least one list in cells is empty
        res = (res.append(df.loc[lens==0, idx_cols], sort=False)
                  .fillna(fill_value))
    # revert the original index order
    res = res.sort_index()
    # reset index if requested
    if not preserve_index:        
        res = res.reset_index(drop=True)
    return res

Demo:

Multiple list columns – all list columns must have the same # of elements in each row:

In [134]: df
Out[134]:
   aaa  myid        num          text
0   10     1  [1, 2, 3]  [aa, bb, cc]
1   11     2         []            []
2   12     3     [1, 2]      [cc, dd]
3   13     4         []            []

In [135]: explode(df, ['num','text'], fill_value='')
Out[135]:
   aaa  myid num text
0   10     1   1   aa
1   10     1   2   bb
2   10     1   3   cc
3   11     2
4   12     3   1   cc
5   12     3   2   dd
6   13     4

preserving original index values:

In [136]: explode(df, ['num','text'], fill_value='', preserve_index=True)
Out[136]:
   aaa  myid num text
0   10     1   1   aa
0   10     1   2   bb
0   10     1   3   cc
1   11     2
2   12     3   1   cc
2   12     3   2   dd
3   13     4

Setup:

df = pd.DataFrame({
 'aaa': {0: 10, 1: 11, 2: 12, 3: 13},
 'myid': {0: 1, 1: 2, 2: 3, 3: 4},
 'num': {0: [1, 2, 3], 1: [], 2: [1, 2], 3: []},
 'text': {0: ['aa', 'bb', 'cc'], 1: [], 2: ['cc', 'dd'], 3: []}
})

CSV column:

In [46]: df
Out[46]:
        var1  var2 var3
0      a,b,c     1   XX
1  d,e,f,x,y     2   ZZ

In [47]: explode(df.assign(var1=df.var1.str.split(',')), 'var1')
Out[47]:
  var1  var2 var3
0    a     1   XX
1    b     1   XX
2    c     1   XX
3    d     2   ZZ
4    e     2   ZZ
5    f     2   ZZ
6    x     2   ZZ
7    y     2   ZZ

using this little trick we can convert CSV-like column to list column:

In [48]: df.assign(var1=df.var1.str.split(','))
Out[48]:
              var1  var2 var3
0        [a, b, c]     1   XX
1  [d, e, f, x, y]     2   ZZ

UPDATE: generic vectorized approach (will work also for multiple columns):

Original DF:

In [177]: df
Out[177]:
        var1  var2 var3
0      a,b,c     1   XX
1  d,e,f,x,y     2   ZZ

Solution:

first let’s convert CSV strings to lists:

In [178]: lst_col = 'var1' 

In [179]: x = df.assign(**{lst_col:df[lst_col].str.split(',')})

In [180]: x
Out[180]:
              var1  var2 var3
0        [a, b, c]     1   XX
1  [d, e, f, x, y]     2   ZZ

Now we can do this:

In [181]: pd.DataFrame({
     ...:     col:np.repeat(x[col].values, x[lst_col].str.len())
     ...:     for col in x.columns.difference([lst_col])
     ...: }).assign(**{lst_col:np.concatenate(x[lst_col].values)})[x.columns.tolist()]
     ...:
Out[181]:
  var1  var2 var3
0    a     1   XX
1    b     1   XX
2    c     1   XX
3    d     2   ZZ
4    e     2   ZZ
5    f     2   ZZ
6    x     2   ZZ
7    y     2   ZZ

OLD answer:

Inspired by @AFinkelstein solution, i wanted to make it bit more generalized which could be applied to DF with more than two columns and as fast, well almost, as fast as AFinkelstein’s solution):

In [2]: df = pd.DataFrame(
   ...:    [{'var1': 'a,b,c', 'var2': 1, 'var3': 'XX'},
   ...:     {'var1': 'd,e,f,x,y', 'var2': 2, 'var3': 'ZZ'}]
   ...: )

In [3]: df
Out[3]:
        var1  var2 var3
0      a,b,c     1   XX
1  d,e,f,x,y     2   ZZ

In [4]: (df.set_index(df.columns.drop('var1',1).tolist())
   ...:    .var1.str.split(',', expand=True)
   ...:    .stack()
   ...:    .reset_index()
   ...:    .rename(columns={0:'var1'})
   ...:    .loc[:, df.columns]
   ...: )
Out[4]:
  var1  var2 var3
0    a     1   XX
1    b     1   XX
2    c     1   XX
3    d     2   ZZ
4    e     2   ZZ
5    f     2   ZZ
6    x     2   ZZ
7    y     2   ZZ

回答 2

经过艰苦的实验,找到比接受的答案更快的方法,我得到了它。在我尝试过的数据集上,它的运行速度快了约100倍。

如果有人知道使这种方式更优雅的方法,请务必修改我的代码。如果没有将要保留的其他列设置为索引,然后重设索引并重命名这些列,我找不到一种可行的方法,但是我想还有其他方法可以解决。

b = DataFrame(a.var1.str.split(',').tolist(), index=a.var2).stack()
b = b.reset_index()[[0, 'var2']] # var1 variable is currently labeled 0
b.columns = ['var1', 'var2'] # renaming var1

After painful experimentation to find something faster than the accepted answer, I got this to work. It ran around 100x faster on the dataset I tried it on.

If someone knows a way to make this more elegant, by all means please modify my code. I couldn’t find a way that works without setting the other columns you want to keep as the index and then resetting the index and re-naming the columns, but I’d imagine there’s something else that works.

b = DataFrame(a.var1.str.split(',').tolist(), index=a.var2).stack()
b = b.reset_index()[[0, 'var2']] # var1 variable is currently labeled 0
b.columns = ['var1', 'var2'] # renaming var1

回答 3

这是为这项常见任务编写函数。它比Series/ stack方法更有效。列顺序和名称将保留。

def tidy_split(df, column, sep='|', keep=False):
    """
    Split the values of a column and expand so the new DataFrame has one split
    value per row. Filters rows where the column is missing.

    Params
    ------
    df : pandas.DataFrame
        dataframe with the column to split and expand
    column : str
        the column to split and expand
    sep : str
        the string used to split the column's values
    keep : bool
        whether to retain the presplit value as it's own row

    Returns
    -------
    pandas.DataFrame
        Returns a dataframe with the same columns as `df`.
    """
    indexes = list()
    new_values = list()
    df = df.dropna(subset=[column])
    for i, presplit in enumerate(df[column].astype(str)):
        values = presplit.split(sep)
        if keep and len(values) > 1:
            indexes.append(i)
            new_values.append(presplit)
        for value in values:
            indexes.append(i)
            new_values.append(value)
    new_df = df.iloc[indexes, :].copy()
    new_df[column] = new_values
    return new_df

使用此功能,原始问题很简单:

tidy_split(a, 'var1', sep=',')

Here’s a function I wrote for this common task. It’s more efficient than the Series/stack methods. Column order and names are retained.

def tidy_split(df, column, sep='|', keep=False):
    """
    Split the values of a column and expand so the new DataFrame has one split
    value per row. Filters rows where the column is missing.

    Params
    ------
    df : pandas.DataFrame
        dataframe with the column to split and expand
    column : str
        the column to split and expand
    sep : str
        the string used to split the column's values
    keep : bool
        whether to retain the presplit value as it's own row

    Returns
    -------
    pandas.DataFrame
        Returns a dataframe with the same columns as `df`.
    """
    indexes = list()
    new_values = list()
    df = df.dropna(subset=[column])
    for i, presplit in enumerate(df[column].astype(str)):
        values = presplit.split(sep)
        if keep and len(values) > 1:
            indexes.append(i)
            new_values.append(presplit)
        for value in values:
            indexes.append(i)
            new_values.append(value)
    new_df = df.iloc[indexes, :].copy()
    new_df[column] = new_values
    return new_df

With this function, the original question is as simple as:

tidy_split(a, 'var1', sep=',')

回答 4

熊猫> = 0.25

系列和数据帧的方法定义一个.explode()方法爆炸列出在不同的行。请参阅爆炸类似列表的docs部分。

由于您有一个用逗号分隔的字符串列表,因此请在逗号上分割字符串以获取元素列表,然后explode在该列上调用。

df = pd.DataFrame({'var1': ['a,b,c', 'd,e,f'], 'var2': [1, 2]})
df
    var1  var2
0  a,b,c     1
1  d,e,f     2

df.assign(var1=df['var1'].str.split(',')).explode('var1')

  var1  var2
0    a     1
0    b     1
0    c     1
1    d     2
1    e     2
1    f     2

请注意,explode仅适用于单列(目前)。


NaN和空列表将获得应有的待遇,而您无需花钱就可以解决问题。

df = pd.DataFrame({'var1': ['d,e,f', '', np.nan], 'var2': [1, 2, 3]})
df
    var1  var2
0  d,e,f     1
1            2
2    NaN     3

df['var1'].str.split(',')

0    [d, e, f]
1           []
2          NaN

df.assign(var1=df['var1'].str.split(',')).explode('var1')

  var1  var2
0    d     1
0    e     1
0    f     1
1          2  # empty list entry becomes empty string after exploding 
2  NaN     3  # NaN left un-touched

与基于ravel+ repeat的解决方案(完全忽略空列表并阻塞NaN)相比,这是一个重大优势

Pandas >= 0.25

Series and DataFrame methods define a .explode() method that explodes lists into separate rows. See the docs section on Exploding a list-like column.

Since you have a list of comma separated strings, split the string on comma to get a list of elements, then call explode on that column.

df = pd.DataFrame({'var1': ['a,b,c', 'd,e,f'], 'var2': [1, 2]})
df
    var1  var2
0  a,b,c     1
1  d,e,f     2

df.assign(var1=df['var1'].str.split(',')).explode('var1')

  var1  var2
0    a     1
0    b     1
0    c     1
1    d     2
1    e     2
1    f     2

Note that explode only works on a single column (for now).


NaNs and empty lists get the treatment they deserve without you having to jump through hoops to get it right.

df = pd.DataFrame({'var1': ['d,e,f', '', np.nan], 'var2': [1, 2, 3]})
df
    var1  var2
0  d,e,f     1
1            2
2    NaN     3

df['var1'].str.split(',')

0    [d, e, f]
1           []
2          NaN

df.assign(var1=df['var1'].str.split(',')).explode('var1')

  var1  var2
0    d     1
0    e     1
0    f     1
1          2  # empty list entry becomes empty string after exploding 
2  NaN     3  # NaN left un-touched

This is a serious advantage over ravel + repeat -based solutions (which ignore empty lists completely, and choke on NaNs).


回答 5

类似的问题:熊猫:如何将一列中的文本分成多行?

您可以这样做:

>> a=pd.DataFrame({"var1":"a,b,c d,e,f".split(),"var2":[1,2]})
>> s = a.var1.str.split(",").apply(pd.Series, 1).stack()
>> s.index = s.index.droplevel(-1)
>> del a['var1']
>> a.join(s)
   var2 var1
0     1    a
0     1    b
0     1    c
1     2    d
1     2    e
1     2    f

Similar question as: pandas: How do I split text in a column into multiple rows?

You could do:

>> a=pd.DataFrame({"var1":"a,b,c d,e,f".split(),"var2":[1,2]})
>> s = a.var1.str.split(",").apply(pd.Series, 1).stack()
>> s.index = s.index.droplevel(-1)
>> del a['var1']
>> a.join(s)
   var2 var1
0     1    a
0     1    b
0     1    c
1     2    d
1     2    e
1     2    f

回答 6

TL; DR

import pandas as pd
import numpy as np

def explode_str(df, col, sep):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
    return df.iloc[i].assign(**{col: sep.join(s).split(sep)})

def explode_list(df, col):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.len())
    return df.iloc[i].assign(**{col: np.concatenate(s)})

示范

explode_str(a, 'var1', ',')

  var1  var2
0    a     1
0    b     1
0    c     1
1    d     2
1    e     2
1    f     2

让我们创建一个d具有列表的新数据框

d = a.assign(var1=lambda d: d.var1.str.split(','))

explode_list(d, 'var1')

  var1  var2
0    a     1
0    b     1
0    c     1
1    d     2
1    e     2
1    f     2

普通的留言

我将使用np.arangewith repeat来生成可用于的数据框索引位置iloc

常问问题

我为什么不使用loc

因为索引可能不是唯一的并且使用 loc将返回与查询的索引匹配的每一行。

你为什么不使用 values属性并对它进行切片?

调用时values,如果数据帧的整体位于一个内聚的“块”中,则Pandas将返回作为“块”的数组的视图。否则,熊猫将不得不拼凑出一个新的阵列。排序时,该数组必须具有统一的dtype。通常,这意味着返回dtype为的数组object。通过使用iloc而不是切片values属性,我减轻了自己。

你为什么用 assign

当我使用 assign使用相同的列名说我炸响,我覆盖现有的列并保持其在数据帧的位置。

为什么索引值重复?

通过iloc在重复位置上使用,所得索引显示相同的重复模式。对列表或字符串的每个元素重复一次。
可以使用reset_index(drop=True)


对于字符串

我不想过早地拆分字符串。所以我算了sep假设如果要拆分,则参数结果列表的长度将比分隔符的数量多一。

然后,我将其sep用于join字符串split

def explode_str(df, col, sep):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
    return df.iloc[i].assign(**{col: sep.join(s).split(sep)})

对于列表

与字符串相似,除了我不需要计算出现的次数 sep因为它已经分裂了。

我用Numpy concatenate将清单加在一起。

import pandas as pd
import numpy as np

def explode_list(df, col):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.len())
    return df.iloc[i].assign(**{col: np.concatenate(s)})

TL;DR

import pandas as pd
import numpy as np

def explode_str(df, col, sep):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
    return df.iloc[i].assign(**{col: sep.join(s).split(sep)})

def explode_list(df, col):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.len())
    return df.iloc[i].assign(**{col: np.concatenate(s)})

Demonstration

explode_str(a, 'var1', ',')

  var1  var2
0    a     1
0    b     1
0    c     1
1    d     2
1    e     2
1    f     2

Let’s create a new dataframe d that has lists

d = a.assign(var1=lambda d: d.var1.str.split(','))

explode_list(d, 'var1')

  var1  var2
0    a     1
0    b     1
0    c     1
1    d     2
1    e     2
1    f     2

General Comments

I’ll use np.arange with repeat to produce dataframe index positions that I can use with iloc.

FAQ

Why don’t I use loc?

Because the index may not be unique and using loc will return every row that matches a queried index.

Why don’t you use the values attribute and slice that?

When calling values, if the entirety of the the dataframe is in one cohesive “block”, Pandas will return a view of the array that is the “block”. Otherwise Pandas will have to cobble together a new array. When cobbling, that array must be of a uniform dtype. Often that means returning an array with dtype that is object. By using iloc instead of slicing the values attribute, I alleviate myself from having to deal with that.

Why do you use assign?

When I use assign using the same column name that I’m exploding, I overwrite the existing column and maintain its position in the dataframe.

Why are the index values repeat?

By virtue of using iloc on repeated positions, the resulting index shows the same repeated pattern. One repeat for each element the list or string.
This can be reset with reset_index(drop=True)


For Strings

I don’t want to have to split the strings prematurely. So instead I count the occurrences of the sep argument assuming that if I were to split, the length of the resulting list would be one more than the number of separators.

I then use that sep to join the strings then split.

def explode_str(df, col, sep):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
    return df.iloc[i].assign(**{col: sep.join(s).split(sep)})

For Lists

Similar as for strings except I don’t need to count occurrences of sep because its already split.

I use Numpy’s concatenate to jam the lists together.

import pandas as pd
import numpy as np

def explode_list(df, col):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.len())
    return df.iloc[i].assign(**{col: np.concatenate(s)})


回答 7

有可能在不更改数据帧结构的情况下拆分和爆炸数据帧

拆分和扩展特定列的数据

输入:

    var1    var2
0   a,b,c   1
1   d,e,f   2



#Get the indexes which are repetative with the split 
temp = df['var1'].str.split(',')
df = df.reindex(df.index.repeat(temp.apply(len)))


df['var1'] = np.hstack(temp)

出:

    var1    var2
0   a   1
0   b   1
0   c   1
1   d   2
1   e   2
1   f   2

编辑1

拆分和扩展多列的行

Filename    RGB                                             RGB_type
0   A   [[0, 1650, 6, 39], [0, 1691, 1, 59], [50, 1402...   [r, g, b]
1   B   [[0, 1423, 16, 38], [0, 1445, 16, 46], [0, 141...   [r, g, b]

根据参考列重新索引并将列值信息与堆栈对齐

df = df.reindex(df.index.repeat(df['RGB_type'].apply(len)))
df = df.groupby('Filename').apply(lambda x:x.apply(lambda y: pd.Series(y.iloc[0])))
df.reset_index(drop=True).ffill()

出:

                Filename    RGB_type    Top 1 colour    Top 1 frequency Top 2 colour    Top 2 frequency
    Filename                            
 A  0       A   r   0   1650    6   39
    1       A   g   0   1691    1   59
    2       A   b   50  1402    49  187
 B  0       B   r   0   1423    16  38
    1       B   g   0   1445    16  46
    2       B   b   0   1419    16  39

There is a possibility to split and explode the dataframe without changing the structure of dataframe

Split and expand data of specific columns

Input:

    var1    var2
0   a,b,c   1
1   d,e,f   2



#Get the indexes which are repetative with the split 
df['var1'] = df['var1'].str.split(',')
df = df.explode('var1')

Out:

    var1    var2
0   a   1
0   b   1
0   c   1
1   d   2
1   e   2
1   f   2

Edit-1

Split and Expand of rows for Multiple columns

Filename    RGB                                             RGB_type
0   A   [[0, 1650, 6, 39], [0, 1691, 1, 59], [50, 1402...   [r, g, b]
1   B   [[0, 1423, 16, 38], [0, 1445, 16, 46], [0, 141...   [r, g, b]

Re indexing based on the reference column and aligning the column value information with stack

df = df.reindex(df.index.repeat(df['RGB_type'].apply(len)))
df = df.groupby('Filename').apply(lambda x:x.apply(lambda y: pd.Series(y.iloc[0])))
df.reset_index(drop=True).ffill()

Out:

                Filename    RGB_type    Top 1 colour    Top 1 frequency Top 2 colour    Top 2 frequency
    Filename                            
 A  0       A   r   0   1650    6   39
    1       A   g   0   1691    1   59
    2       A   b   50  1402    49  187
 B  0       B   r   0   1423    16  38
    1       B   g   0   1445    16  46
    2       B   b   0   1419    16  39

回答 8

我想出了一种针对具有任意列数的数据框的解决方案(尽管仍然一次只分隔一个列的条目)。

def splitDataFrameList(df,target_column,separator):
    ''' df = dataframe to split,
    target_column = the column containing the values to split
    separator = the symbol used to perform the split

    returns: a dataframe with each entry for the target column separated, with each element moved into a new row. 
    The values in the other columns are duplicated across the newly divided rows.
    '''
    def splitListToRows(row,row_accumulator,target_column,separator):
        split_row = row[target_column].split(separator)
        for s in split_row:
            new_row = row.to_dict()
            new_row[target_column] = s
            row_accumulator.append(new_row)
    new_rows = []
    df.apply(splitListToRows,axis=1,args = (new_rows,target_column,separator))
    new_df = pandas.DataFrame(new_rows)
    return new_df

I came up with a solution for dataframes with arbitrary numbers of columns (while still only separating one column’s entries at a time).

def splitDataFrameList(df,target_column,separator):
    ''' df = dataframe to split,
    target_column = the column containing the values to split
    separator = the symbol used to perform the split

    returns: a dataframe with each entry for the target column separated, with each element moved into a new row. 
    The values in the other columns are duplicated across the newly divided rows.
    '''
    def splitListToRows(row,row_accumulator,target_column,separator):
        split_row = row[target_column].split(separator)
        for s in split_row:
            new_row = row.to_dict()
            new_row[target_column] = s
            row_accumulator.append(new_row)
    new_rows = []
    df.apply(splitListToRows,axis=1,args = (new_rows,target_column,separator))
    new_df = pandas.DataFrame(new_rows)
    return new_df

回答 9

这是一条使用split熊猫方法的相当简单的消息str访问器中,然后使用NumPy将每一行展平为单个数组。

通过将非拆分列重复正确的次数来检索相应的值np.repeat

var1 = df.var1.str.split(',', expand=True).values.ravel()
var2 = np.repeat(df.var2.values, len(var1) / len(df))

pd.DataFrame({'var1': var1,
              'var2': var2})

  var1  var2
0    a     1
1    b     1
2    c     1
3    d     2
4    e     2
5    f     2

Here is a fairly straightforward message that uses the split method from pandas str accessor and then uses NumPy to flatten each row into a single array.

The corresponding values are retrieved by repeating the non-split column the correct number of times with np.repeat.

var1 = df.var1.str.split(',', expand=True).values.ravel()
var2 = np.repeat(df.var2.values, len(var1) / len(df))

pd.DataFrame({'var1': var1,
              'var2': var2})

  var1  var2
0    a     1
1    b     1
2    c     1
3    d     2
4    e     2
5    f     2

回答 10

我一直在用各种方式来爆炸我的列表,以解决内存不足的问题,因此我准备了一些基准测试来帮助我决定支持哪些答案。我测试了五种情况,它们的列表长度与列表数量的比例不同。分享以下结果:

时间:(越少越好,点击查看大图)

速度

峰值内存使用情况:(越少越好)

峰值内存使用

结论

  • @MaxU的答案(更新2),代号串联在几乎每种情况下都能提供最佳速度,同时保持较低的窥视内存使用率,
  • 如果您需要处理具有相对较小列表的许多行并且可以提供更大的峰值内存,请参见@DMulligan的答案(代号堆栈),
  • 对于行数少但列表很大的数据帧,可接受的@Chang答案很好。

完整的细节(功能和基准代码)在GitHub gist中。请注意,基准测试问题已得到简化,并且不包括将字符串拆分为列表-大多数解决方案都以类似的方式执行。

I have been struggling with out-of-memory experience using various way to explode my lists so I prepared some benchmarks to help me decide which answers to upvote. I tested five scenarios with varying proportions of the list length to the number of lists. Sharing the results below:

Time: (less is better, click to view large version)

Speed

Peak memory usage: (less is better)

Peak memory usage

Conclusions:

  • @MaxU’s answer (update 2), codename concatenate offers the best speed in almost every case, while keeping the peek memory usage low,
  • see @DMulligan’s answer (codename stack) if you need to process lots of rows with relatively small lists and can afford increased peak memory,
  • the accepted @Chang’s answer works well for data frames that have a few rows but very large lists.

Full details (functions and benchmarking code) are in this GitHub gist. Please note that the benchmark problem was simplified and did not include splitting of strings into the list – which most solutions performed in a similar fashion.


回答 11

基于出色的@DMulligan 解决方案,这是一个通用的矢量化(无循环)功能,该功能将数据帧的一列拆分为多行,然后将其合并回原始数据帧。它也change_column_order从这个答案中使用了很棒的泛型函数。

def change_column_order(df, col_name, index):
    cols = df.columns.tolist()
    cols.remove(col_name)
    cols.insert(index, col_name)
    return df[cols]

def split_df(dataframe, col_name, sep):
    orig_col_index = dataframe.columns.tolist().index(col_name)
    orig_index_name = dataframe.index.name
    orig_columns = dataframe.columns
    dataframe = dataframe.reset_index()  # we need a natural 0-based index for proper merge
    index_col_name = (set(dataframe.columns) - set(orig_columns)).pop()
    df_split = pd.DataFrame(
        pd.DataFrame(dataframe[col_name].str.split(sep).tolist())
        .stack().reset_index(level=1, drop=1), columns=[col_name])
    df = dataframe.drop(col_name, axis=1)
    df = pd.merge(df, df_split, left_index=True, right_index=True, how='inner')
    df = df.set_index(index_col_name)
    df.index.name = orig_index_name
    # merge adds the column to the last place, so we need to move it back
    return change_column_order(df, col_name, orig_col_index)

例:

df = pd.DataFrame([['a:b', 1, 4], ['c:d', 2, 5], ['e:f:g:h', 3, 6]], 
                  columns=['Name', 'A', 'B'], index=[10, 12, 13])
df
        Name    A   B
    10   a:b     1   4
    12   c:d     2   5
    13   e:f:g:h 3   6

split_df(df, 'Name', ':')
    Name    A   B
10   a       1   4
10   b       1   4
12   c       2   5
12   d       2   5
13   e       3   6
13   f       3   6    
13   g       3   6    
13   h       3   6    

请注意,它保留了列的原始索引和顺序。它也适用于具有非顺序索引的数据帧。

Based on the excellent @DMulligan’s solution, here is a generic vectorized (no loops) function which splits a column of a dataframe into multiple rows, and merges it back to the original dataframe. It also uses a great generic change_column_order function from this answer.

def change_column_order(df, col_name, index):
    cols = df.columns.tolist()
    cols.remove(col_name)
    cols.insert(index, col_name)
    return df[cols]

def split_df(dataframe, col_name, sep):
    orig_col_index = dataframe.columns.tolist().index(col_name)
    orig_index_name = dataframe.index.name
    orig_columns = dataframe.columns
    dataframe = dataframe.reset_index()  # we need a natural 0-based index for proper merge
    index_col_name = (set(dataframe.columns) - set(orig_columns)).pop()
    df_split = pd.DataFrame(
        pd.DataFrame(dataframe[col_name].str.split(sep).tolist())
        .stack().reset_index(level=1, drop=1), columns=[col_name])
    df = dataframe.drop(col_name, axis=1)
    df = pd.merge(df, df_split, left_index=True, right_index=True, how='inner')
    df = df.set_index(index_col_name)
    df.index.name = orig_index_name
    # merge adds the column to the last place, so we need to move it back
    return change_column_order(df, col_name, orig_col_index)

Example:

df = pd.DataFrame([['a:b', 1, 4], ['c:d', 2, 5], ['e:f:g:h', 3, 6]], 
                  columns=['Name', 'A', 'B'], index=[10, 12, 13])
df
        Name    A   B
    10   a:b     1   4
    12   c:d     2   5
    13   e:f:g:h 3   6

split_df(df, 'Name', ':')
    Name    A   B
10   a       1   4
10   b       1   4
12   c       2   5
12   d       2   5
13   e       3   6
13   f       3   6    
13   g       3   6    
13   h       3   6    

Note that it preserves the original index and order of the columns. It also works with dataframes which have non-sequential index.


回答 12

字符串函数split可以使用选项布尔参数’expand’。

这是使用此参数的解决方案:

(a.var1
  .str.split(",",expand=True)
  .set_index(a.var2)
  .stack()
  .reset_index(level=1, drop=True)
  .reset_index()
  .rename(columns={0:"var1"}))

The string function split can take an option boolean argument ‘expand’.

Here is a solution using this argument:

(a.var1
  .str.split(",",expand=True)
  .set_index(a.var2)
  .stack()
  .reset_index(level=1, drop=True)
  .reset_index()
  .rename(columns={0:"var1"}))

回答 13

只是从上面使用了jiln的出色答案,但需要扩展以拆分多列。以为我会分享。

def splitDataFrameList(df,target_column,separator):
''' df = dataframe to split,
target_column = the column containing the values to split
separator = the symbol used to perform the split

returns: a dataframe with each entry for the target column separated, with each element moved into a new row. 
The values in the other columns are duplicated across the newly divided rows.
'''
def splitListToRows(row, row_accumulator, target_columns, separator):
    split_rows = []
    for target_column in target_columns:
        split_rows.append(row[target_column].split(separator))
    # Seperate for multiple columns
    for i in range(len(split_rows[0])):
        new_row = row.to_dict()
        for j in range(len(split_rows)):
            new_row[target_columns[j]] = split_rows[j][i]
        row_accumulator.append(new_row)
new_rows = []
df.apply(splitListToRows,axis=1,args = (new_rows,target_column,separator))
new_df = pd.DataFrame(new_rows)
return new_df

Just used jiln’s excellent answer from above, but needed to expand to split multiple columns. Thought I would share.

def splitDataFrameList(df,target_column,separator):
''' df = dataframe to split,
target_column = the column containing the values to split
separator = the symbol used to perform the split

returns: a dataframe with each entry for the target column separated, with each element moved into a new row. 
The values in the other columns are duplicated across the newly divided rows.
'''
def splitListToRows(row, row_accumulator, target_columns, separator):
    split_rows = []
    for target_column in target_columns:
        split_rows.append(row[target_column].split(separator))
    # Seperate for multiple columns
    for i in range(len(split_rows[0])):
        new_row = row.to_dict()
        for j in range(len(split_rows)):
            new_row[target_columns[j]] = split_rows[j][i]
        row_accumulator.append(new_row)
new_rows = []
df.apply(splitListToRows,axis=1,args = (new_rows,target_column,separator))
new_df = pd.DataFrame(new_rows)
return new_df

回答 14

通过MultiIndex支持升级了MaxU的答案

def explode(df, lst_cols, fill_value='', preserve_index=False):
    """
    usage:
        In [134]: df
        Out[134]:
           aaa  myid        num          text
        0   10     1  [1, 2, 3]  [aa, bb, cc]
        1   11     2         []            []
        2   12     3     [1, 2]      [cc, dd]
        3   13     4         []            []

        In [135]: explode(df, ['num','text'], fill_value='')
        Out[135]:
           aaa  myid num text
        0   10     1   1   aa
        1   10     1   2   bb
        2   10     1   3   cc
        3   11     2
        4   12     3   1   cc
        5   12     3   2   dd
        6   13     4
    """
    # make sure `lst_cols` is list-alike
    if (lst_cols is not None
        and len(lst_cols) > 0
        and not isinstance(lst_cols, (list, tuple, np.ndarray, pd.Series))):
        lst_cols = [lst_cols]
    # all columns except `lst_cols`
    idx_cols = df.columns.difference(lst_cols)
    # calculate lengths of lists
    lens = df[lst_cols[0]].str.len()
    # preserve original index values    
    idx = np.repeat(df.index.values, lens)
    res = (pd.DataFrame({
                col:np.repeat(df[col].values, lens)
                for col in idx_cols},
                index=idx)
             .assign(**{col:np.concatenate(df.loc[lens>0, col].values)
                            for col in lst_cols}))
    # append those rows that have empty lists
    if (lens == 0).any():
        # at least one list in cells is empty
        res = (res.append(df.loc[lens==0, idx_cols], sort=False)
                  .fillna(fill_value))
    # revert the original index order
    res = res.sort_index()
    # reset index if requested
    if not preserve_index:        
        res = res.reset_index(drop=True)

    # if original index is MultiIndex build the dataframe from the multiindex
    # create "exploded" DF
    if isinstance(df.index, pd.MultiIndex):
        res = res.reindex(
            index=pd.MultiIndex.from_tuples(
                res.index,
                names=['number', 'color']
            )
    )
    return res

upgraded MaxU’s answer with MultiIndex support

def explode(df, lst_cols, fill_value='', preserve_index=False):
    """
    usage:
        In [134]: df
        Out[134]:
           aaa  myid        num          text
        0   10     1  [1, 2, 3]  [aa, bb, cc]
        1   11     2         []            []
        2   12     3     [1, 2]      [cc, dd]
        3   13     4         []            []

        In [135]: explode(df, ['num','text'], fill_value='')
        Out[135]:
           aaa  myid num text
        0   10     1   1   aa
        1   10     1   2   bb
        2   10     1   3   cc
        3   11     2
        4   12     3   1   cc
        5   12     3   2   dd
        6   13     4
    """
    # make sure `lst_cols` is list-alike
    if (lst_cols is not None
        and len(lst_cols) > 0
        and not isinstance(lst_cols, (list, tuple, np.ndarray, pd.Series))):
        lst_cols = [lst_cols]
    # all columns except `lst_cols`
    idx_cols = df.columns.difference(lst_cols)
    # calculate lengths of lists
    lens = df[lst_cols[0]].str.len()
    # preserve original index values    
    idx = np.repeat(df.index.values, lens)
    res = (pd.DataFrame({
                col:np.repeat(df[col].values, lens)
                for col in idx_cols},
                index=idx)
             .assign(**{col:np.concatenate(df.loc[lens>0, col].values)
                            for col in lst_cols}))
    # append those rows that have empty lists
    if (lens == 0).any():
        # at least one list in cells is empty
        res = (res.append(df.loc[lens==0, idx_cols], sort=False)
                  .fillna(fill_value))
    # revert the original index order
    res = res.sort_index()
    # reset index if requested
    if not preserve_index:        
        res = res.reset_index(drop=True)

    # if original index is MultiIndex build the dataframe from the multiindex
    # create "exploded" DF
    if isinstance(df.index, pd.MultiIndex):
        res = res.reindex(
            index=pd.MultiIndex.from_tuples(
                res.index,
                names=['number', 'color']
            )
    )
    return res

回答 15

一线使用split(___, expand=True)levelname参数reset_index()

>>> b = a.var1.str.split(',', expand=True).set_index(a.var2).stack().reset_index(level=0, name='var1')
>>> b
   var2 var1
0     1    a
1     1    b
2     1    c
0     2    d
1     2    e
2     2    f

如果您需要b看起来像问题中的样子,还可以执行以下操作:

>>> b = b.reset_index(drop=True)[['var1', 'var2']]
>>> b
  var1  var2
0    a     1
1    b     1
2    c     1
3    d     2
4    e     2
5    f     2

One-liner using split(___, expand=True) and the level and name arguments to reset_index():

>>> b = a.var1.str.split(',', expand=True).set_index(a.var2).stack().reset_index(level=0, name='var1')
>>> b
   var2 var1
0     1    a
1     1    b
2     1    c
0     2    d
1     2    e
2     2    f

If you need b to look exactly like in the question, you can additionally do:

>>> b = b.reset_index(drop=True)[['var1', 'var2']]
>>> b
  var1  var2
0    a     1
1    b     1
2    c     1
3    d     2
4    e     2
5    f     2

回答 16

我针对此问题提出了以下解决方案:

def iter_var1(d):
    for _, row in d.iterrows():
        for v in row["var1"].split(","):
            yield (v, row["var2"])

new_a = DataFrame.from_records([i for i in iter_var1(a)],
        columns=["var1", "var2"])

I have come up with the following solution to this problem:

def iter_var1(d):
    for _, row in d.iterrows():
        for v in row["var1"].split(","):
            yield (v, row["var2"])

new_a = DataFrame.from_records([i for i in iter_var1(a)],
        columns=["var1", "var2"])

回答 17

使用python复制包的另一种解决方案

import copy
new_observations = list()
def pandas_explode(df, column_to_explode):
    new_observations = list()
    for row in df.to_dict(orient='records'):
        explode_values = row[column_to_explode]
        del row[column_to_explode]
        if type(explode_values) is list or type(explode_values) is tuple:
            for explode_value in explode_values:
                new_observation = copy.deepcopy(row)
                new_observation[column_to_explode] = explode_value
                new_observations.append(new_observation) 
        else:
            new_observation = copy.deepcopy(row)
            new_observation[column_to_explode] = explode_values
            new_observations.append(new_observation) 
    return_df = pd.DataFrame(new_observations)
    return return_df

df = pandas_explode(df, column_name)

Another solution that uses python copy package

import copy
new_observations = list()
def pandas_explode(df, column_to_explode):
    new_observations = list()
    for row in df.to_dict(orient='records'):
        explode_values = row[column_to_explode]
        del row[column_to_explode]
        if type(explode_values) is list or type(explode_values) is tuple:
            for explode_value in explode_values:
                new_observation = copy.deepcopy(row)
                new_observation[column_to_explode] = explode_value
                new_observations.append(new_observation) 
        else:
            new_observation = copy.deepcopy(row)
            new_observation[column_to_explode] = explode_values
            new_observations.append(new_observation) 
    return_df = pd.DataFrame(new_observations)
    return return_df

df = pandas_explode(df, column_name)

回答 18

这里有很多答案,但令我惊讶的是,没有人提到内置的熊猫爆炸功能。查看以下链接: https //pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html#pandas.DataFrame.explode

由于某种原因,我无法访问该功能,因此我使用了以下代码:

import pandas_explode
pandas_explode.patch()
df_zlp_people_cnt3 = df_zlp_people_cnt2.explode('people')

在此处输入图片说明

以上是我的数据示例。如您所见,“ 人员”列中有一系列人员,而我正试图使其爆炸。我给的代码适用于列表类型数据。因此,请尝试将以逗号分隔的文本数据转换为列表格式。另外,由于我的代码使用内置函数,因此它比自定义/应用函数快得多。

注意:您可能需要使用pip安装pandas_explode。

There are a lot of answers here but I’m surprised no one has mentioned the built in pandas explode function. Check out the link below: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html#pandas.DataFrame.explode

For some reason I was unable to access that function, so I used the below code:

import pandas_explode
pandas_explode.patch()
df_zlp_people_cnt3 = df_zlp_people_cnt2.explode('people')

enter image description here

Above is a sample of my data. As you can see the people column had series of people, and I was trying to explode it. The code I have given works for list type data. So try to get your comma separated text data into list format. Also since my code uses built in functions, it is much faster than custom/apply functions.

Note: You may need to install pandas_explode with pip.


回答 19

我有一个类似的问题,我的解决方案是先将数据框转换为字典列表,然后进行转换。这是函数:

import copy
import re

def separate_row(df, column_name):
    ls = []
    for row_dict in df.to_dict('records'):
        for word in re.split(',', row_dict[column_name]):
            row = copy.deepcopy(row_dict)
            row[column_name]=word
            ls(row)
    return pd.DataFrame(ls)

例:

>>> from pandas import DataFrame
>>> import numpy as np
>>> a = DataFrame([{'var1': 'a,b,c', 'var2': 1},
               {'var1': 'd,e,f', 'var2': 2}])
>>> a
    var1  var2
0  a,b,c     1
1  d,e,f     2
>>> separate_row(a, "var1")
  var1  var2
0    a     1
1    b     1
2    c     1
3    d     2
4    e     2
5    f     2

您也可以稍微更改功能以支持分隔列表类型的行。

I had a similar problem, my solution was converting the dataframe to a list of dictionaries first, then do the transition. Here is the function:

import re
import pandas as pd

def separate_row(df, column_name):
    ls = []
    for row_dict in df.to_dict('records'):
        for word in re.split(',', row_dict[column_name]):
            row = row_dict.copy()
            row[column_name]=word
            ls.append(row)
    return pd.DataFrame(ls)

Example:

>>> from pandas import DataFrame
>>> import numpy as np
>>> a = DataFrame([{'var1': 'a,b,c', 'var2': 1},
               {'var1': 'd,e,f', 'var2': 2}])
>>> a
    var1  var2
0  a,b,c     1
1  d,e,f     2
>>> separate_row(a, "var1")
  var1  var2
0    a     1
1    b     1
2    c     1
3    d     2
4    e     2
5    f     2

You can also change the function a bit to support separating list type rows.


如何将一列分为两列?

问题:如何将一列分为两列?

我有一个带有一列的数据框,我想将其分为两列,其中一列标题为’ fips',另一列为'row'

我的数据框df如下所示:

          row
0    00000 UNITED STATES
1    01000 ALABAMA
2    01001 Autauga County, AL
3    01003 Baldwin County, AL
4    01005 Barbour County, AL

我不知道如何使用df.row.str[:]以达到分割行单元的目的。我可以df['fips'] = hello用来添加一个新列,并用填充它hello。有任何想法吗?

         fips       row
0    00000 UNITED STATES
1    01000 ALABAMA 
2    01001 Autauga County, AL
3    01003 Baldwin County, AL
4    01005 Barbour County, AL

I have a data frame with one (string) column and I’d like to split it into two (string) columns, with one column header as ‘fips' and the other 'row'

My dataframe df looks like this:

          row
0    00000 UNITED STATES
1    01000 ALABAMA
2    01001 Autauga County, AL
3    01003 Baldwin County, AL
4    01005 Barbour County, AL

I do not know how to use df.row.str[:] to achieve my goal of splitting the row cell. I can use df['fips'] = hello to add a new column and populate it with hello. Any ideas?

         fips       row
0    00000 UNITED STATES
1    01000 ALABAMA 
2    01001 Autauga County, AL
3    01003 Baldwin County, AL
4    01005 Barbour County, AL

回答 0

也许有更好的方法,但这是一种方法:

                            row
    0       00000 UNITED STATES
    1             01000 ALABAMA
    2  01001 Autauga County, AL
    3  01003 Baldwin County, AL
    4  01005 Barbour County, AL
df = pd.DataFrame(df.row.str.split(' ',1).tolist(),
                                 columns = ['flips','row'])
   flips                 row
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL

There might be a better way, but this here’s one approach:

                            row
    0       00000 UNITED STATES
    1             01000 ALABAMA
    2  01001 Autauga County, AL
    3  01003 Baldwin County, AL
    4  01005 Barbour County, AL
df = pd.DataFrame(df.row.str.split(' ',1).tolist(),
                                 columns = ['flips','row'])
   flips                 row
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL

回答 1

TL; DR版本:

对于以下简单情况:

  • 我有一个带有定界符的文本列,我想要两列

最简单的解决方案是:

df['A'], df['B'] = df['AB'].str.split(' ', 1).str

或者,您可以使用以下方法自动为拆分的每个条目创建一个带有一列的DataFrame:

df['AB'].str.split(' ', 1, expand=True)

expand=True如果字符串的分割数不一致,并且要None替换缺失的值,则必须使用。

请注意,无论哪种情况,该.tolist()方法都是不必要的。都不是zip()

详细地:

安迪·海登(Andy Hayden)的解决方案最能证明该str.extract()方法的强大功能。

但是对于在已知分隔符上的简单拆分(例如,用破折号拆分或通过空格拆分),该.str.split()方法就足够了1。它对字符串的一列(系列)进行操作,并返回列表的一列(系列):

>>> import pandas as pd
>>> df = pd.DataFrame({'AB': ['A1-B1', 'A2-B2']})
>>> df

      AB
0  A1-B1
1  A2-B2
>>> df['AB_split'] = df['AB'].str.split('-')
>>> df

      AB  AB_split
0  A1-B1  [A1, B1]
1  A2-B2  [A2, B2]

1:如果不确定.str.split()do 的前两个参数是什么,我建议使用该方法纯Python版本的文档。

但是你如何去做:

  • 包含两个元素的列表的列

至:

  • 两列,每列包含列表的相应元素?

好吧,我们需要仔细查看.str列的属性。

这是一个神奇的对象,用于收集将列中的每个元素视为字符串的方法,然后在每个元素中尽可能有效地应用相应的方法:

>>> upper_lower_df = pd.DataFrame({"U": ["A", "B", "C"]})
>>> upper_lower_df

   U
0  A
1  B
2  C
>>> upper_lower_df["L"] = upper_lower_df["U"].str.lower()
>>> upper_lower_df

   U  L
0  A  a
1  B  b
2  C  c

但是它也有一个“索引”接口,用于通过其索引获取字符串的每个元素:

>>> df['AB'].str[0]

0    A
1    A
Name: AB, dtype: object

>>> df['AB'].str[1]

0    1
1    2
Name: AB, dtype: object

当然,.str只要可以对其建立索引,则此索引接口并不真正在乎它所索引的每个元素是否实际上是字符串,因此:

>>> df['AB'].str.split('-', 1).str[0]

0    A1
1    A2
Name: AB, dtype: object

>>> df['AB'].str.split('-', 1).str[1]

0    B1
1    B2
Name: AB, dtype: object

然后,只需利用Python元组对可迭代对象进行拆包即可

>>> df['A'], df['B'] = df['AB'].str.split('-', 1).str
>>> df

      AB  AB_split   A   B
0  A1-B1  [A1, B1]  A1  B1
1  A2-B2  [A2, B2]  A2  B2

当然,从拆分一列字符串中获取一个DataFrame非常有用,以至于该.str.split()方法可以通过expand=True参数为您做到这一点:

>>> df['AB'].str.split('-', 1, expand=True)

    0   1
0  A1  B1
1  A2  B2

因此,完成我们想要的工作的另一种方法是:

>>> df = df[['AB']]
>>> df

      AB
0  A1-B1
1  A2-B2

>>> df.join(df['AB'].str.split('-', 1, expand=True).rename(columns={0:'A', 1:'B'}))

      AB   A   B
0  A1-B1  A1  B1
1  A2-B2  A2  B2

expand=True版本虽然较长,但与元组拆包方法相比具有明显的优势。元组解压缩不能很好地处理不同长度的拆分:

>>> df = pd.DataFrame({'AB': ['A1-B1', 'A2-B2', 'A3-B3-C3']})
>>> df
         AB
0     A1-B1
1     A2-B2
2  A3-B3-C3
>>> df['A'], df['B'], df['C'] = df['AB'].str.split('-')
Traceback (most recent call last):
  [...]    
ValueError: Length of values does not match length of index
>>> 

但是expand=True通过放置None没有足够“拆分”的列来很好地处理它:

>>> df.join(
...     df['AB'].str.split('-', expand=True).rename(
...         columns={0:'A', 1:'B', 2:'C'}
...     )
... )
         AB   A   B     C
0     A1-B1  A1  B1  None
1     A2-B2  A2  B2  None
2  A3-B3-C3  A3  B3    C3

TL;DR version:

For the simple case of:

  • I have a text column with a delimiter and I want two columns

The simplest solution is:

df[['A', 'B']] = df['AB'].str.split(' ', 1, expand=True)

You must use expand=True if your strings have a non-uniform number of splits and you want None to replace the missing values.

Notice how, in either case, the .tolist() method is not necessary. Neither is zip().

In detail:

Andy Hayden’s solution is most excellent in demonstrating the power of the str.extract() method.

But for a simple split over a known separator (like, splitting by dashes, or splitting by whitespace), the .str.split() method is enough1. It operates on a column (Series) of strings, and returns a column (Series) of lists:

>>> import pandas as pd
>>> df = pd.DataFrame({'AB': ['A1-B1', 'A2-B2']})
>>> df

      AB
0  A1-B1
1  A2-B2
>>> df['AB_split'] = df['AB'].str.split('-')
>>> df

      AB  AB_split
0  A1-B1  [A1, B1]
1  A2-B2  [A2, B2]

1: If you’re unsure what the first two parameters of .str.split() do, I recommend the docs for the plain Python version of the method.

But how do you go from:

  • a column containing two-element lists

to:

  • two columns, each containing the respective element of the lists?

Well, we need to take a closer look at the .str attribute of a column.

It’s a magical object that is used to collect methods that treat each element in a column as a string, and then apply the respective method in each element as efficient as possible:

>>> upper_lower_df = pd.DataFrame({"U": ["A", "B", "C"]})
>>> upper_lower_df

   U
0  A
1  B
2  C
>>> upper_lower_df["L"] = upper_lower_df["U"].str.lower()
>>> upper_lower_df

   U  L
0  A  a
1  B  b
2  C  c

But it also has an “indexing” interface for getting each element of a string by its index:

>>> df['AB'].str[0]

0    A
1    A
Name: AB, dtype: object

>>> df['AB'].str[1]

0    1
1    2
Name: AB, dtype: object

Of course, this indexing interface of .str doesn’t really care if each element it’s indexing is actually a string, as long as it can be indexed, so:

>>> df['AB'].str.split('-', 1).str[0]

0    A1
1    A2
Name: AB, dtype: object

>>> df['AB'].str.split('-', 1).str[1]

0    B1
1    B2
Name: AB, dtype: object

Then, it’s a simple matter of taking advantage of the Python tuple unpacking of iterables to do

>>> df['A'], df['B'] = df['AB'].str.split('-', 1).str
>>> df

      AB  AB_split   A   B
0  A1-B1  [A1, B1]  A1  B1
1  A2-B2  [A2, B2]  A2  B2

Of course, getting a DataFrame out of splitting a column of strings is so useful that the .str.split() method can do it for you with the expand=True parameter:

>>> df['AB'].str.split('-', 1, expand=True)

    0   1
0  A1  B1
1  A2  B2

So, another way of accomplishing what we wanted is to do:

>>> df = df[['AB']]
>>> df

      AB
0  A1-B1
1  A2-B2

>>> df.join(df['AB'].str.split('-', 1, expand=True).rename(columns={0:'A', 1:'B'}))

      AB   A   B
0  A1-B1  A1  B1
1  A2-B2  A2  B2

The expand=True version, although longer, has a distinct advantage over the tuple unpacking method. Tuple unpacking doesn’t deal well with splits of different lengths:

>>> df = pd.DataFrame({'AB': ['A1-B1', 'A2-B2', 'A3-B3-C3']})
>>> df
         AB
0     A1-B1
1     A2-B2
2  A3-B3-C3
>>> df['A'], df['B'], df['C'] = df['AB'].str.split('-')
Traceback (most recent call last):
  [...]    
ValueError: Length of values does not match length of index
>>> 

But expand=True handles it nicely by placing None in the columns for which there aren’t enough “splits”:

>>> df.join(
...     df['AB'].str.split('-', expand=True).rename(
...         columns={0:'A', 1:'B', 2:'C'}
...     )
... )
         AB   A   B     C
0     A1-B1  A1  B1  None
1     A2-B2  A2  B2  None
2  A3-B3-C3  A3  B3    C3

回答 2

您可以使用正则表达式模式整齐地提取不同部分:

In [11]: df.row.str.extract('(?P<fips>\d{5})((?P<state>[A-Z ]*$)|(?P<county>.*?), (?P<state_code>[A-Z]{2}$))')
Out[11]: 
    fips                    1           state           county state_code
0  00000        UNITED STATES   UNITED STATES              NaN        NaN
1  01000              ALABAMA         ALABAMA              NaN        NaN
2  01001   Autauga County, AL             NaN   Autauga County         AL
3  01003   Baldwin County, AL             NaN   Baldwin County         AL
4  01005   Barbour County, AL             NaN   Barbour County         AL

[5 rows x 5 columns]

要解释有点长的正则表达式:

(?P<fips>\d{5})
  • 匹配五个数字(\d)并命名"fips"

下一部分:

((?P<state>[A-Z ]*$)|(?P<county>.*?), (?P<state_code>[A-Z]{2}$))

|)做以下两件事之一:

(?P<state>[A-Z ]*$)
  • 匹配任意数量(*)的大写字母或空格([A-Z ]),并"state"在字符串($)末尾之前对其进行命名,

要么

(?P<county>.*?), (?P<state_code>[A-Z]{2}$))
  • .*然后匹配其他任何()
  • 然后用逗号和空格
  • 匹配state_code字符串($)末尾的两位数字。

在示例中:
请注意,前两行命中“州”(将NaN保留在county和state_code列中),而后三行命中县(即state_code)(将NaN保留在state列中)。

You can extract the different parts out quite neatly using a regex pattern:

In [11]: df.row.str.extract('(?P<fips>\d{5})((?P<state>[A-Z ]*$)|(?P<county>.*?), (?P<state_code>[A-Z]{2}$))')
Out[11]: 
    fips                    1           state           county state_code
0  00000        UNITED STATES   UNITED STATES              NaN        NaN
1  01000              ALABAMA         ALABAMA              NaN        NaN
2  01001   Autauga County, AL             NaN   Autauga County         AL
3  01003   Baldwin County, AL             NaN   Baldwin County         AL
4  01005   Barbour County, AL             NaN   Barbour County         AL

[5 rows x 5 columns]

To explain the somewhat long regex:

(?P<fips>\d{5})
  • Matches the five digits (\d) and names them "fips".

The next part:

((?P<state>[A-Z ]*$)|(?P<county>.*?), (?P<state_code>[A-Z]{2}$))

Does either (|) one of two things:

(?P<state>[A-Z ]*$)
  • Matches any number (*) of capital letters or spaces ([A-Z ]) and names this "state" before the end of the string ($),

or

(?P<county>.*?), (?P<state_code>[A-Z]{2}$))
  • matches anything else (.*) then
  • a comma and a space then
  • matches the two digit state_code before the end of the string ($).

In the example:
Note that the first two rows hit the “state” (leaving NaN in the county and state_code columns), whilst the last three hit the county, state_code (leaving NaN in the state column).


回答 3

df[['fips', 'row']] = df['row'].str.split(' ', n=1, expand=True)
df[['fips', 'row']] = df['row'].str.split(' ', n=1, expand=True)

回答 4

如果您不想创建新的数据框,或者您的数据框具有比仅要拆分的列更多的列,则可以:

df["flips"], df["row_name"] = zip(*df["row"].str.split().tolist())
del df["row"]  

If you don’t want to create a new dataframe, or if your dataframe has more columns than just the ones you want to split, you could:

df["flips"], df["row_name"] = zip(*df["row"].str.split().tolist())
del df["row"]  

回答 5

您可以使用str.split空格(默认分隔符)和参数expand=True用于将DataFrame其分配给新列:

df = pd.DataFrame({'row': ['00000 UNITED STATES', '01000 ALABAMA', 
                           '01001 Autauga County, AL', '01003 Baldwin County, AL', 
                           '01005 Barbour County, AL']})
print (df)
                        row
0       00000 UNITED STATES
1             01000 ALABAMA
2  01001 Autauga County, AL
3  01003 Baldwin County, AL
4  01005 Barbour County, AL



df[['a','b']] = df['row'].str.split(n=1, expand=True)
print (df)
                        row      a                   b
0       00000 UNITED STATES  00000       UNITED STATES
1             01000 ALABAMA  01000             ALABAMA
2  01001 Autauga County, AL  01001  Autauga County, AL
3  01003 Baldwin County, AL  01003  Baldwin County, AL
4  01005 Barbour County, AL  01005  Barbour County, AL

修改是否需要删除原始列 DataFrame.pop

df[['a','b']] = df.pop('row').str.split(n=1, expand=True)
print (df)
       a                   b
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL

什么是一样的:

df[['a','b']] = df['row'].str.split(n=1, expand=True)
df = df.drop('row', axis=1)
print (df)

       a                   b
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL

如果出现错误:

#remove n=1 for split by all whitespaces
df[['a','b']] = df['row'].str.split(expand=True)

ValueError:列的长度必须与键的长度相同

您可以检查并返回4列DataFrame,不仅2:

print (df['row'].str.split(expand=True))
       0        1        2     3
0  00000   UNITED   STATES  None
1  01000  ALABAMA     None  None
2  01001  Autauga  County,    AL
3  01003  Baldwin  County,    AL
4  01005  Barbour  County,    AL

然后将解决方案添加到新DataFramejoin

df = pd.DataFrame({'row': ['00000 UNITED STATES', '01000 ALABAMA', 
                           '01001 Autauga County, AL', '01003 Baldwin County, AL', 
                           '01005 Barbour County, AL'],
                    'a':range(5)})
print (df)
   a                       row
0  0       00000 UNITED STATES
1  1             01000 ALABAMA
2  2  01001 Autauga County, AL
3  3  01003 Baldwin County, AL
4  4  01005 Barbour County, AL

df = df.join(df['row'].str.split(expand=True))
print (df)

   a                       row      0        1        2     3
0  0       00000 UNITED STATES  00000   UNITED   STATES  None
1  1             01000 ALABAMA  01000  ALABAMA     None  None
2  2  01001 Autauga County, AL  01001  Autauga  County,    AL
3  3  01003 Baldwin County, AL  01003  Baldwin  County,    AL
4  4  01005 Barbour County, AL  01005  Barbour  County,    AL

使用删除原始列(如果还有其他列):

df = df.join(df.pop('row').str.split(expand=True))
print (df)
   a      0        1        2     3
0  0  00000   UNITED   STATES  None
1  1  01000  ALABAMA     None  None
2  2  01001  Autauga  County,    AL
3  3  01003  Baldwin  County,    AL
4  4  01005  Barbour  County,    AL   

You can use str.split by whitespace (default separator) and parameter expand=True for DataFrame with assign to new columns:

df = pd.DataFrame({'row': ['00000 UNITED STATES', '01000 ALABAMA', 
                           '01001 Autauga County, AL', '01003 Baldwin County, AL', 
                           '01005 Barbour County, AL']})
print (df)
                        row
0       00000 UNITED STATES
1             01000 ALABAMA
2  01001 Autauga County, AL
3  01003 Baldwin County, AL
4  01005 Barbour County, AL



df[['a','b']] = df['row'].str.split(n=1, expand=True)
print (df)
                        row      a                   b
0       00000 UNITED STATES  00000       UNITED STATES
1             01000 ALABAMA  01000             ALABAMA
2  01001 Autauga County, AL  01001  Autauga County, AL
3  01003 Baldwin County, AL  01003  Baldwin County, AL
4  01005 Barbour County, AL  01005  Barbour County, AL

Modification if need remove original column with DataFrame.pop

df[['a','b']] = df.pop('row').str.split(n=1, expand=True)
print (df)
       a                   b
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL

What is same like:

df[['a','b']] = df['row'].str.split(n=1, expand=True)
df = df.drop('row', axis=1)
print (df)

       a                   b
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL

If get error:

#remove n=1 for split by all whitespaces
df[['a','b']] = df['row'].str.split(expand=True)

ValueError: Columns must be same length as key

You can check and it return 4 column DataFrame, not only 2:

print (df['row'].str.split(expand=True))
       0        1        2     3
0  00000   UNITED   STATES  None
1  01000  ALABAMA     None  None
2  01001  Autauga  County,    AL
3  01003  Baldwin  County,    AL
4  01005  Barbour  County,    AL

Then solution is append new DataFrame by join:

df = pd.DataFrame({'row': ['00000 UNITED STATES', '01000 ALABAMA', 
                           '01001 Autauga County, AL', '01003 Baldwin County, AL', 
                           '01005 Barbour County, AL'],
                    'a':range(5)})
print (df)
   a                       row
0  0       00000 UNITED STATES
1  1             01000 ALABAMA
2  2  01001 Autauga County, AL
3  3  01003 Baldwin County, AL
4  4  01005 Barbour County, AL

df = df.join(df['row'].str.split(expand=True))
print (df)

   a                       row      0        1        2     3
0  0       00000 UNITED STATES  00000   UNITED   STATES  None
1  1             01000 ALABAMA  01000  ALABAMA     None  None
2  2  01001 Autauga County, AL  01001  Autauga  County,    AL
3  3  01003 Baldwin County, AL  01003  Baldwin  County,    AL
4  4  01005 Barbour County, AL  01005  Barbour  County,    AL

With remove original column (if there are also another columns):

df = df.join(df.pop('row').str.split(expand=True))
print (df)
   a      0        1        2     3
0  0  00000   UNITED   STATES  None
1  1  01000  ALABAMA     None  None
2  2  01001  Autauga  County,    AL
3  3  01003  Baldwin  County,    AL
4  4  01005  Barbour  County,    AL   

回答 6

如果要基于定界符将字符串分为两列以上,则可以省略“ maximum splits”参数。
您可以使用:

df['column_name'].str.split('/', expand=True)

这将自动创建与您的任何初始字符串中包含的最大字段数一样多的列。

If you want to split a string into more than two columns based on a delimiter you can omit the ‘maximum splits’ parameter.
You can use:

df['column_name'].str.split('/', expand=True)

This will automatically create as many columns as the maximum number of fields included in any of your initial strings.


回答 7

感到惊讶的是我还没有看到这个。如果您只需要两个分割,我强烈建议您。。。

Series.str.partition

partition 在分隔符上执行一次拆分,通常表现出色。

df['row'].str.partition(' ')[[0, 2]]

       0                   2
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL

如果您需要重命名行,

df['row'].str.partition(' ')[[0, 2]].rename({0: 'fips', 2: 'row'}, axis=1)

    fips                 row
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL

如果您需要将其恢复为原始版本,请使用joinconcat

df.join(df['row'].str.partition(' ')[[0, 2]])

pd.concat([df, df['row'].str.partition(' ')[[0, 2]]], axis=1)

                        row      0                   2
0       00000 UNITED STATES  00000       UNITED STATES
1             01000 ALABAMA  01000             ALABAMA
2  01001 Autauga County, AL  01001  Autauga County, AL
3  01003 Baldwin County, AL  01003  Baldwin County, AL
4  01005 Barbour County, AL  01005  Barbour County, AL

Surprised I haven’t seen this one yet. If you only need two splits, I highly recommend. . .

Series.str.partition

partition performs one split on the separator, and is generally quite performant.

df['row'].str.partition(' ')[[0, 2]]

       0                   2
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL

If you need to rename the rows,

df['row'].str.partition(' ')[[0, 2]].rename({0: 'fips', 2: 'row'}, axis=1)

    fips                 row
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL

If you need to join this back to the original, use join or concat:

df.join(df['row'].str.partition(' ')[[0, 2]])

pd.concat([df, df['row'].str.partition(' ')[[0, 2]]], axis=1)

                        row      0                   2
0       00000 UNITED STATES  00000       UNITED STATES
1             01000 ALABAMA  01000             ALABAMA
2  01001 Autauga County, AL  01001  Autauga County, AL
3  01003 Baldwin County, AL  01003  Baldwin County, AL
4  01005 Barbour County, AL  01005  Barbour County, AL

回答 8

我更喜欢导出相应的熊猫系列(即我需要的列),使用apply函数将列内容分为多个系列,然后加入生成的列到现有的数据帧。当然,应删除源列。

例如

 col1 = df["<col_name>"].apply(<function>)
 col2 = ...
 df = df.join(col1.to_frame(name="<name1>"))
 df = df.join(col2.toframe(name="<name2>"))
 df = df.drop(["<col_name>"], axis=1)

拆分两个单词的字符串函数应该是这样的:

lambda x: x.split(" ")[0] # for the first element
lambda x: x.split(" ")[-1] # for the last element

I prefer exporting the corresponding pandas series (i.e. the columns I need), using the apply function to split the column content into multiple series and then join the generated columns to the existing DataFrame. Of course, the source column should be removed.

e.g.

 col1 = df["<col_name>"].apply(<function>)
 col2 = ...
 df = df.join(col1.to_frame(name="<name1>"))
 df = df.join(col2.toframe(name="<name2>"))
 df = df.drop(["<col_name>"], axis=1)

To split two words strings function should be something like that:

lambda x: x.split(" ")[0] # for the first element
lambda x: x.split(" ")[-1] # for the last element

回答 9

我看到没有人使用过切片法,所以在这里我放了2美分。

df["<col_name>"].str.slice(stop=5)
df["<col_name>"].str.slice(start=6)

此方法将创建两个新列。

I saw that no one had used the slice method, so here I put my 2 cents here.

df["<col_name>"].str.slice(stop=5)
df["<col_name>"].str.slice(start=6)

This method will create two new columns.


回答 10

使用df.assign创建一个新的DF。参见http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

split = df_selected['name'].str.split(',', 1, expand=True)
df_split = df_selected.assign(first_name=split[0], last_name=split[1])
df_split.drop('name', 1, inplace=True)

Use df.assign to create a new df. See http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

split = df_selected['name'].str.split(',', 1, expand=True)
df_split = df_selected.assign(first_name=split[0], last_name=split[1])
df_split.drop('name', 1, inplace=True)

检测并排除熊猫数据框中的异常值

问题:检测并排除熊猫数据框中的异常值

我有一个只有几列的熊猫数据框。

现在我知道某些行是基于某个列值的离群值。

例如

“ Vol”列的所有值都在周围,12xx而一个值是4000(离群值)。

现在,我想排除具有Vol此类列的行。

因此,从本质上讲,我需要在数据帧上放置一个过滤器,以便我们选择某一列的值在均值例如3个标准差以内的所有行。

有什么优雅的方法可以做到这一点?

I have a pandas data frame with few columns.

Now I know that certain rows are outliers based on a certain column value.

For instance

column ‘Vol’ has all values around 12xx and one value is 4000 (outlier).

Now I would like to exclude those rows that have Vol column like this.

So, essentially I need to put a filter on the data frame such that we select all rows where the values of a certain column are within, say, 3 standard deviations from mean.

What is an elegant way to achieve this?


回答 0

如果您的数据框中有多个列,并且想要删除至少一列中具有异常值的所有行,则以下表达式可以一次性完成。

df = pd.DataFrame(np.random.randn(100, 3))

from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

描述:

  • 对于每列,首先要计算列中每个值相对于列均值和标准差的Z分数。
  • 然后取Z分数的绝对值,因为方向无关紧要,只有方向低于阈值时才行。
  • all(axis = 1)确保对于每一行,所有列均满足约束。
  • 最后,此条件的结果用于索引数据帧。

If you have multiple columns in your dataframe and would like to remove all rows that have outliers in at least one column, the following expression would do that in one shot.

df = pd.DataFrame(np.random.randn(100, 3))

from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

description:

  • For each column, first it computes the Z-score of each value in the column, relative to the column mean and standard deviation.
  • Then is takes the absolute of Z-score because the direction does not matter, only if it is below the threshold.
  • all(axis=1) ensures that for each row, all column satisfy the constraint.
  • Finally, result of this condition is used to index the dataframe.

回答 1

boolean就像在索引中那样使用索引numpy.array

df = pd.DataFrame({'Data':np.random.normal(size=200)})
# example dataset of normally distributed data. 

df[np.abs(df.Data-df.Data.mean()) <= (3*df.Data.std())]
# keep only the ones that are within +3 to -3 standard deviations in the column 'Data'.

df[~(np.abs(df.Data-df.Data.mean()) > (3*df.Data.std()))]
# or if you prefer the other way around

对于系列,它类似于:

S = pd.Series(np.random.normal(size=200))
S[~((S-S.mean()).abs() > 3*S.std())]

Use boolean indexing as you would do in numpy.array

df = pd.DataFrame({'Data':np.random.normal(size=200)})
# example dataset of normally distributed data. 

df[np.abs(df.Data-df.Data.mean()) <= (3*df.Data.std())]
# keep only the ones that are within +3 to -3 standard deviations in the column 'Data'.

df[~(np.abs(df.Data-df.Data.mean()) > (3*df.Data.std()))]
# or if you prefer the other way around

For a series it is similar:

S = pd.Series(np.random.normal(size=200))
S[~((S-S.mean()).abs() > 3*S.std())]

回答 2

对于每个dataframe列,您可以使用以下方法获得分位数:

q = df["col"].quantile(0.99)

然后过滤:

df[df["col"] < q]

如果需要删除上下限离群值,则将条件与AND语句结合使用:

q_low = df["col"].quantile(0.01)
q_hi  = df["col"].quantile(0.99)

df_filtered = df[(df["col"] < q_hi) & (df["col"] > q_low)]

For each of your dataframe column, you could get quantile with:

q = df["col"].quantile(0.99)

and then filter with:

df[df["col"] < q]

If one need to remove lower and upper outliers, combine condition with an AND statement:

q_low = df["col"].quantile(0.01)
q_hi  = df["col"].quantile(0.99)

df_filtered = df[(df["col"] < q_hi) & (df["col"] > q_low)]

回答 3

此答案与@tanemaki提供的答案相似,但使用的是lambda表达式而不是scipy stats

df = pd.DataFrame(np.random.randn(100, 3), columns=list('ABC'))

df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < 3).all(axis=1)]

要过滤只有一个列(例如“ B”)在三个标准差以内的DataFrame:

df[((df.B - df.B.mean()) / df.B.std()).abs() < 3]

有关如何在滚动基础上应用此z评分的信息,请参见此处:将滚动z评分应用于熊猫数据框

This answer is similar to that provided by @tanemaki, but uses a lambda expression instead of scipy stats.

df = pd.DataFrame(np.random.randn(100, 3), columns=list('ABC'))

df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < 3).all(axis=1)]

To filter the DataFrame where only ONE column (e.g. ‘B’) is within three standard deviations:

df[((df.B - df.B.mean()) / df.B.std()).abs() < 3]

See here for how to apply this z-score on a rolling basis: Rolling Z-score applied to pandas dataframe


回答 4

#------------------------------------------------------------------------------
# accept a dataframe, remove outliers, return cleaned data in a new dataframe
# see http://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm
#------------------------------------------------------------------------------
def remove_outlier(df_in, col_name):
    q1 = df_in[col_name].quantile(0.25)
    q3 = df_in[col_name].quantile(0.75)
    iqr = q3-q1 #Interquartile range
    fence_low  = q1-1.5*iqr
    fence_high = q3+1.5*iqr
    df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
    return df_out
#------------------------------------------------------------------------------
# accept a dataframe, remove outliers, return cleaned data in a new dataframe
# see http://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm
#------------------------------------------------------------------------------
def remove_outlier(df_in, col_name):
    q1 = df_in[col_name].quantile(0.25)
    q3 = df_in[col_name].quantile(0.75)
    iqr = q3-q1 #Interquartile range
    fence_low  = q1-1.5*iqr
    fence_high = q3+1.5*iqr
    df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
    return df_out

回答 5

对于数据框中的每个系列,您可以使用betweenquantile删除异常值。

x = pd.Series(np.random.normal(size=200)) # with outliers
x = x[x.between(x.quantile(.25), x.quantile(.75))] # without outliers

For each series in the dataframe, you could use between and quantile to remove outliers.

x = pd.Series(np.random.normal(size=200)) # with outliers
x = x[x.between(x.quantile(.25), x.quantile(.75))] # without outliers

回答 6

由于我还没有看到涉及数字非数字属性的答案,因此这里是一个补充性答案。

您可能只想将离群值放在数字属性上(分类变量几乎不可能是离群值)。

功能定义

我还扩展了@tanemaki的建议,以在还存在非数字属性时处理数据:

from scipy import stats

def drop_numerical_outliers(df, z_thresh=3):
    # Constrains will contain `True` or `False` depending on if it is a value below the threshold.
    constrains = df.select_dtypes(include=[np.number]) \
        .apply(lambda x: np.abs(stats.zscore(x)) < z_thresh, reduce=False) \
        .all(axis=1)
    # Drop (inplace) values set to be rejected
    df.drop(df.index[~constrains], inplace=True)

用法

drop_numerical_outliers(df)

想象一个数据集df,其中包含有关房屋的一些值:胡同,土地轮廓,售价,…例如:数据文档

首先,您要可视化散点图上的数据(z分数Thresh = 3):

# Plot data before dropping those greater than z-score 3. 
# The scatterAreaVsPrice function's definition has been removed for readability's sake.
scatterAreaVsPrice(df)

在此之前-Gr Liv Area与SalePrice

# Drop the outliers on every attributes
drop_numerical_outliers(train_df)

# Plot the result. All outliers were dropped. Note that the red points are not
# the same outliers from the first plot, but the new computed outliers based on the new data-frame.
scatterAreaVsPrice(train_df)

售后-Gr Liv地区对售价

Since I haven’t seen an answer that deal with numerical and non-numerical attributes, here is a complement answer.

You might want to drop the outliers only on numerical attributes (categorical variables can hardly be outliers).

Function definition

I have extended @tanemaki’s suggestion to handle data when non-numeric attributes are also present:

from scipy import stats

def drop_numerical_outliers(df, z_thresh=3):
    # Constrains will contain `True` or `False` depending on if it is a value below the threshold.
    constrains = df.select_dtypes(include=[np.number]) \
        .apply(lambda x: np.abs(stats.zscore(x)) < z_thresh, reduce=False) \
        .all(axis=1)
    # Drop (inplace) values set to be rejected
    df.drop(df.index[~constrains], inplace=True)

Usage

drop_numerical_outliers(df)

Example

Imagine a dataset df with some values about houses: alley, land contour, sale price, … E.g: Data Documentation

First, you want to visualise the data on a scatter graph (with z-score Thresh=3):

# Plot data before dropping those greater than z-score 3. 
# The scatterAreaVsPrice function's definition has been removed for readability's sake.
scatterAreaVsPrice(df)

Before - Gr Liv Area Versus SalePrice

# Drop the outliers on every attributes
drop_numerical_outliers(train_df)

# Plot the result. All outliers were dropped. Note that the red points are not
# the same outliers from the first plot, but the new computed outliers based on the new data-frame.
scatterAreaVsPrice(train_df)

After - Gr Liv Area Versus SalePrice


回答 7

scipy.stats有方法trim1()trimboth()可以根据排名和删除值的引入百分比将异常值切成一行。

scipy.stats has methods trim1() and trimboth() to cut the outliers out in a single row, according to the ranking and an introduced percentage of removed values.


回答 8

另一种选择是转换数据,以减轻异常值的影响。您可以通过取消存储数据来做到这一点。

import pandas as pd
from scipy.stats import mstats
%matplotlib inline

test_data = pd.Series(range(30))
test_data.plot()

原始数据

# Truncate values to the 5th and 95th percentiles
transformed_test_data = pd.Series(mstats.winsorize(test_data, limits=[0.05, 0.05])) 
transformed_test_data.plot()

Winsorized数据

Another option is to transform your data so that the effect of outliers is mitigated. You can do this by winsorizing your data.

import pandas as pd
from scipy.stats import mstats
%matplotlib inline

test_data = pd.Series(range(30))
test_data.plot()

Original data

# Truncate values to the 5th and 95th percentiles
transformed_test_data = pd.Series(mstats.winsorize(test_data, limits=[0.05, 0.05])) 
transformed_test_data.plot()

Winsorized data


回答 9

如果您喜欢方法链接,则可以为所有数字列获取布尔条件,如下所示:

df.sub(df.mean()).div(df.std()).abs().lt(3)

每列的每个值将True/False根据其与平均值之间的距离是否小于三个标准偏差进行转换。

If you like method chaining, you can get your boolean condition for all numeric columns like this:

df.sub(df.mean()).div(df.std()).abs().lt(3)

Each value of each column will be converted to True/False based on whether its less than three standard deviations away from the mean or not.


回答 10

您可以使用布尔掩码:

import pandas as pd

def remove_outliers(df, q=0.05):
    upper = df.quantile(1-q)
    lower = df.quantile(q)
    mask = (df < upper) & (df > lower)
    return mask

t = pd.DataFrame({'train': [1,1,2,3,4,5,6,7,8,9,9],
                  'y': [1,0,0,1,1,0,0,1,1,1,0]})

mask = remove_outliers(t['train'], 0.1)

print(t[mask])

输出:

   train  y
2      2  0
3      3  1
4      4  1
5      5  0
6      6  0
7      7  1
8      8  1

You can use boolean mask:

import pandas as pd

def remove_outliers(df, q=0.05):
    upper = df.quantile(1-q)
    lower = df.quantile(q)
    mask = (df < upper) & (df > lower)
    return mask

t = pd.DataFrame({'train': [1,1,2,3,4,5,6,7,8,9,9],
                  'y': [1,0,0,1,1,0,0,1,1,1,0]})

mask = remove_outliers(t['train'], 0.1)

print(t[mask])

output:

   train  y
2      2  0
3      3  1
4      4  1
5      5  0
6      6  0
7      7  1
8      8  1

回答 11

由于我正处于数据科学之旅的早期阶段,因此我使用以下代码处理异常值。

#Outlier Treatment

def outlier_detect(df):
    for i in df.describe().columns:
        Q1=df.describe().at['25%',i]
        Q3=df.describe().at['75%',i]
        IQR=Q3 - Q1
        LTV=Q1 - 1.5 * IQR
        UTV=Q3 + 1.5 * IQR
        x=np.array(df[i])
        p=[]
        for j in x:
            if j < LTV or j>UTV:
                p.append(df[i].median())
            else:
                p.append(j)
        df[i]=p
    return df

Since I am in a very early stage of my data science journey, I am treating outliers with the code below.

#Outlier Treatment

def outlier_detect(df):
    for i in df.describe().columns:
        Q1=df.describe().at['25%',i]
        Q3=df.describe().at['75%',i]
        IQR=Q3 - Q1
        LTV=Q1 - 1.5 * IQR
        UTV=Q3 + 1.5 * IQR
        x=np.array(df[i])
        p=[]
        for j in x:
            if j < LTV or j>UTV:
                p.append(df[i].median())
            else:
                p.append(j)
        df[i]=p
    return df

回答 12

获得第98和第2个百分位数作为离群值的限制

upper_limit = np.percentile(X_train.logerror.values, 98) 
lower_limit = np.percentile(X_train.logerror.values, 2) # Filter the outliers from the dataframe
data[‘target’].loc[X_train[‘target’]>upper_limit] = upper_limit data[‘target’].loc[X_train[‘target’]<lower_limit] = lower_limit

Get the 98th and 2nd percentile as the limits of our outliers

upper_limit = np.percentile(X_train.logerror.values, 98) 
lower_limit = np.percentile(X_train.logerror.values, 2) # Filter the outliers from the dataframe
data[‘target’].loc[X_train[‘target’]>upper_limit] = upper_limit data[‘target’].loc[X_train[‘target’]<lower_limit] = lower_limit

回答 13

包含数据和2个组的完整示例如下:

进口:

from StringIO import StringIO
import pandas as pd
#pandas config
pd.set_option('display.max_rows', 20)

具有2组的数据示例:G1:组1。G2:组2:

TESTDATA = StringIO("""G1;G2;Value
1;A;1.6
1;A;5.1
1;A;7.1
1;A;8.1

1;B;21.1
1;B;22.1
1;B;24.1
1;B;30.6

2;A;40.6
2;A;51.1
2;A;52.1
2;A;60.6

2;B;80.1
2;B;70.6
2;B;90.6
2;B;85.1
""")

将文本数据读取到pandas数据框:

df = pd.read_csv(TESTDATA, sep=";")

使用标准偏差定义离群值

stds = 1.0
outliers = df[['G1', 'G2', 'Value']].groupby(['G1','G2']).transform(
           lambda group: (group - group.mean()).abs().div(group.std())) > stds

定义过滤后的数据值和离群值:

dfv = df[outliers.Value == False]
dfo = df[outliers.Value == True]

打印结果:

print '\n'*5, 'All values with decimal 1 are non-outliers. In the other hand, all values with 6 in the decimal are.'
print '\nDef DATA:\n%s\n\nFiltred Values with %s stds:\n%s\n\nOutliers:\n%s' %(df, stds, dfv, dfo)

a full example with data and 2 groups follows:

Imports:

from StringIO import StringIO
import pandas as pd
#pandas config
pd.set_option('display.max_rows', 20)

Data example with 2 groups: G1:Group 1. G2: Group 2:

TESTDATA = StringIO("""G1;G2;Value
1;A;1.6
1;A;5.1
1;A;7.1
1;A;8.1

1;B;21.1
1;B;22.1
1;B;24.1
1;B;30.6

2;A;40.6
2;A;51.1
2;A;52.1
2;A;60.6

2;B;80.1
2;B;70.6
2;B;90.6
2;B;85.1
""")

Read text data to pandas dataframe:

df = pd.read_csv(TESTDATA, sep=";")

Define the outliers using standard deviations

stds = 1.0
outliers = df[['G1', 'G2', 'Value']].groupby(['G1','G2']).transform(
           lambda group: (group - group.mean()).abs().div(group.std())) > stds

Define filtered data values and the outliers:

dfv = df[outliers.Value == False]
dfo = df[outliers.Value == True]

Print the result:

print '\n'*5, 'All values with decimal 1 are non-outliers. In the other hand, all values with 6 in the decimal are.'
print '\nDef DATA:\n%s\n\nFiltred Values with %s stds:\n%s\n\nOutliers:\n%s' %(df, stds, dfv, dfo)

回答 14

我删除异常值的功能

def drop_outliers(df, field_name):
    distance = 1.5 * (np.percentile(df[field_name], 75) - np.percentile(df[field_name], 25))
    df.drop(df[df[field_name] > distance + np.percentile(df[field_name], 75)].index, inplace=True)
    df.drop(df[df[field_name] < np.percentile(df[field_name], 25) - distance].index, inplace=True)

My function for dropping outliers

def drop_outliers(df, field_name):
    distance = 1.5 * (np.percentile(df[field_name], 75) - np.percentile(df[field_name], 25))
    df.drop(df[df[field_name] > distance + np.percentile(df[field_name], 75)].index, inplace=True)
    df.drop(df[df[field_name] < np.percentile(df[field_name], 25) - distance].index, inplace=True)

回答 15

我更喜欢剪辑而不是放下。下面的内容将在第2个和第98个百分位处固定。

df_list = list(df)
minPercentile = 0.02
maxPercentile = 0.98

for _ in range(numCols):
    df[df_list[_]] = df[df_list[_]].clip((df[df_list[_]].quantile(minPercentile)),(df[df_list[_]].quantile(maxPercentile)))

I prefer to clip rather than drop. the following will clip inplace at the 2nd and 98th pecentiles.

df_list = list(df)
minPercentile = 0.02
maxPercentile = 0.98

for _ in range(numCols):
    df[df_list[_]] = df[df_list[_]].clip((df[df_list[_]].quantile(minPercentile)),(df[df_list[_]].quantile(maxPercentile)))

回答 16

我认为从统计上删除和删除异常值是错误的。它使数据与原始数据不同。也使数据不均匀地成形,因此最好的方法是通过对数据进行对数变换来减少或避免离群值的影响。这为我工作:

np.log(data.iloc[:, :])

Deleting and dropping outliers I believe is wrong statistically. It makes the data different from original data. Also makes data unequally shaped and hence best way is to reduce or avoid the effect of outliers by log transform the data. This worked for me:

np.log(data.iloc[:, :])

将x和y标签添加到熊猫图

问题:将x和y标签添加到熊猫图

假设我有以下代码使用pandas绘制了一些非常简单的图形:

import pandas as pd
values = [[1, 2], [2, 5]]
df2 = pd.DataFrame(values, columns=['Type A', 'Type B'], 
                   index=['Index 1', 'Index 2'])
df2.plot(lw=2, colormap='jet', marker='.', markersize=10, 
         title='Video streaming dropout by category')

输出量

如何在保留我使用特定颜色图的能力的同时轻松设置x和y标签?我注意到,plot()pandas DataFrames 的包装没有采用任何特定于此的参数。

Suppose I have the following code that plots something very simple using pandas:

import pandas as pd
values = [[1, 2], [2, 5]]
df2 = pd.DataFrame(values, columns=['Type A', 'Type B'], 
                   index=['Index 1', 'Index 2'])
df2.plot(lw=2, colormap='jet', marker='.', markersize=10, 
         title='Video streaming dropout by category')

Output

How do I easily set x and y-labels while preserving my ability to use specific colormaps? I noticed that the plot() wrapper for pandas DataFrames doesn’t take any parameters specific for that.


回答 0

df.plot()函数返回一个matplotlib.axes.AxesSubplot对象。您可以在该对象上设置标签。

ax = df2.plot(lw=2, colormap='jet', marker='.', markersize=10, title='Video streaming dropout by category')
ax.set_xlabel("x label")
ax.set_ylabel("y label")

在此处输入图片说明

或者,更简洁地说:ax.set(xlabel="x label", ylabel="y label")

或者,索引x轴标签(如果有的话)会自动设置为索引名称。所以df2.index.name = 'x label'也可以。

The df.plot() function returns a matplotlib.axes.AxesSubplot object. You can set the labels on that object.

ax = df2.plot(lw=2, colormap='jet', marker='.', markersize=10, title='Video streaming dropout by category')
ax.set_xlabel("x label")
ax.set_ylabel("y label")

enter image description here

Or, more succinctly: ax.set(xlabel="x label", ylabel="y label").

Alternatively, the index x-axis label is automatically set to the Index name, if it has one. so df2.index.name = 'x label' would work too.


回答 1

您可以像这样使用它:

import matplotlib.pyplot as plt 
import pandas as pd

plt.figure()
values = [[1, 2], [2, 5]]
df2 = pd.DataFrame(values, columns=['Type A', 'Type B'], 
                   index=['Index 1', 'Index 2'])
df2.plot(lw=2, colormap='jet', marker='.', markersize=10,
         title='Video streaming dropout by category')
plt.xlabel('xlabel')
plt.ylabel('ylabel')
plt.show()

显然,您必须将字符串’xlabel’和’ylabel’替换为您想要的名称。

You can use do it like this:

import matplotlib.pyplot as plt 
import pandas as pd

plt.figure()
values = [[1, 2], [2, 5]]
df2 = pd.DataFrame(values, columns=['Type A', 'Type B'], 
                   index=['Index 1', 'Index 2'])
df2.plot(lw=2, colormap='jet', marker='.', markersize=10,
         title='Video streaming dropout by category')
plt.xlabel('xlabel')
plt.ylabel('ylabel')
plt.show()

Obviously you have to replace the strings ‘xlabel’ and ‘ylabel’ with what you want them to be.


回答 2

如果您为DataFrame的列和索引添加标签,熊猫将自动提供适当的标签:

import pandas as pd
values = [[1, 2], [2, 5]]
df = pd.DataFrame(values, columns=['Type A', 'Type B'], 
                  index=['Index 1', 'Index 2'])
df.columns.name = 'Type'
df.index.name = 'Index'
df.plot(lw=2, colormap='jet', marker='.', markersize=10, 
        title='Video streaming dropout by category')

在此处输入图片说明

在这种情况下,您仍然需要手动提供y标签(例如,通过plt.ylabel其他答案所示)。

If you label the columns and index of your DataFrame, pandas will automatically supply appropriate labels:

import pandas as pd
values = [[1, 2], [2, 5]]
df = pd.DataFrame(values, columns=['Type A', 'Type B'], 
                  index=['Index 1', 'Index 2'])
df.columns.name = 'Type'
df.index.name = 'Index'
df.plot(lw=2, colormap='jet', marker='.', markersize=10, 
        title='Video streaming dropout by category')

enter image description here

In this case, you’ll still need to supply y-labels manually (e.g., via plt.ylabel as shown in the other answers).


回答 3

可以同时设置两个标签和axis.set功能。查找示例:

import pandas as pd
import matplotlib.pyplot as plt
values = [[1,2], [2,5]]
df2 = pd.DataFrame(values, columns=['Type A', 'Type B'], index=['Index 1','Index 2'])
ax = df2.plot(lw=2,colormap='jet',marker='.',markersize=10,title='Video streaming dropout by category')
# set labels for both axes
ax.set(xlabel='x axis', ylabel='y axis')
plt.show()

在此处输入图片说明

It is possible to set both labels together with axis.set function. Look for the example:

import pandas as pd
import matplotlib.pyplot as plt
values = [[1,2], [2,5]]
df2 = pd.DataFrame(values, columns=['Type A', 'Type B'], index=['Index 1','Index 2'])
ax = df2.plot(lw=2,colormap='jet',marker='.',markersize=10,title='Video streaming dropout by category')
# set labels for both axes
ax.set(xlabel='x axis', ylabel='y axis')
plt.show()

enter image description here


回答 4

对于您使用的情况pandas.DataFrame.hist

plt = df.Column_A.hist(bins=10)

请注意,您得到的是图的阵列,而不是图。因此,要设置x标签,您将需要执行以下操作

plt[0][0].set_xlabel("column A")

For cases where you use pandas.DataFrame.hist:

plt = df.Column_A.hist(bins=10)

Note that you get an ARRAY of plots, rather than a plot. Thus to set the x label you will need to do something like this

plt[0][0].set_xlabel("column A")

回答 5

关于什么 …

import pandas as pd
import matplotlib.pyplot as plt

values = [[1,2], [2,5]]

df2 = pd.DataFrame(values, columns=['Type A', 'Type B'], index=['Index 1','Index 2'])

(df2.plot(lw=2,
          colormap='jet',
          marker='.',
          markersize=10,
          title='Video streaming dropout by category')
    .set(xlabel='x axis',
         ylabel='y axis'))

plt.show()

what about …

import pandas as pd
import matplotlib.pyplot as plt

values = [[1,2], [2,5]]

df2 = pd.DataFrame(values, columns=['Type A', 'Type B'], index=['Index 1','Index 2'])

(df2.plot(lw=2,
          colormap='jet',
          marker='.',
          markersize=10,
          title='Video streaming dropout by category')
    .set(xlabel='x axis',
         ylabel='y axis'))

plt.show()

回答 6

pandas使用matplotlib基本数据帧图。因此,如果您pandas用于基本绘图,则可以使用matplotlib进行绘图自定义。但是,我在这里提出了一种替代方法,使用seaborn该方法可以对图进行更多的自定义,而不必进入的基本层次matplotlib

工作代码:

import pandas as pd
import seaborn as sns
values = [[1, 2], [2, 5]]
df2 = pd.DataFrame(values, columns=['Type A', 'Type B'], 
                   index=['Index 1', 'Index 2'])
ax= sns.lineplot(data=df2, markers= True)
ax.set(xlabel='xlabel', ylabel='ylabel', title='Video streaming dropout by category') 

在此处输入图片说明

pandas uses matplotlib for basic dataframe plots. So, if you are using pandas for basic plot you can use matplotlib for plot customization. However, I propose an alternative method here using seaborn which allows more customization of the plot while not going into the basic level of matplotlib.

Working Code:

import pandas as pd
import seaborn as sns
values = [[1, 2], [2, 5]]
df2 = pd.DataFrame(values, columns=['Type A', 'Type B'], 
                   index=['Index 1', 'Index 2'])
ax= sns.lineplot(data=df2, markers= True)
ax.set(xlabel='xlabel', ylabel='ylabel', title='Video streaming dropout by category') 

enter image description here