标准化熊猫数据框的列

问题:标准化熊猫数据框的列

我在熊猫中有一个数据框,其中每一列都有不同的值范围。例如:

df:

A     B   C
1000  10  0.5
765   5   0.35
800   7   0.09

知道如何将每个值介于0和1之间的数据框的列标准化吗?

我想要的输出是:

A     B    C
1     1    1
0.765 0.5  0.7
0.8   0.7  0.18(which is 0.09/0.5)

I have a dataframe in pandas where each column has different value range. For example:

df:

A     B   C
1000  10  0.5
765   5   0.35
800   7   0.09

Any idea how I can normalize the columns of this dataframe where each value is between 0 and 1?

My desired output is:

A     B    C
1     1    1
0.765 0.5  0.7
0.8   0.7  0.18(which is 0.09/0.5)

回答 0

您可以使用软件包sklearn及其关联的预处理实用程序来规范化数据。

import pandas as pd
from sklearn import preprocessing

x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)

有关更多信息,请参见有关预处理数据的scikit-learn 文档:将特征缩放到一定范围。

You can use the package sklearn and its associated preprocessing utilities to normalize the data.

import pandas as pd
from sklearn import preprocessing

x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)

For more information look at the scikit-learn documentation on preprocessing data: scaling features to a range.


回答 1

使用Pandas的一种简单方法:(这里我要使用均值归一化)

normalized_df=(df-df.mean())/df.std()

使用最小-最大规格化:

normalized_df=(df-df.min())/(df.max()-df.min())

编辑:要解决一些问题,需要说熊猫自动在上面的代码中应用了以列为单位的函数。

one easy way by using Pandas: (here I want to use mean normalization)

normalized_df=(df-df.mean())/df.std()

to use min-max normalization:

normalized_df=(df-df.min())/(df.max()-df.min())

Edit: To address some concerns, need to say that Pandas automatically applies colomn-wise function in the code above.


回答 2

根据这篇文章:https : //stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range

您可以执行以下操作:

def normalize(df):
    result = df.copy()
    for feature_name in df.columns:
        max_value = df[feature_name].max()
        min_value = df[feature_name].min()
        result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
    return result

您无需担心您的价值观是消极还是积极。并且这些值应在0到1之间很好地分布。

Based on this post: https://stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range

You can do the following:

def normalize(df):
    result = df.copy()
    for feature_name in df.columns:
        max_value = df[feature_name].max()
        min_value = df[feature_name].min()
        result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
    return result

You don’t need to stay worrying about whether your values are negative or positive. And the values should be nicely spread out between 0 and 1.


回答 3

您的问题实际上是作用在列上的简单转换:

def f(s):
    return s/s.max()

frame.apply(f, axis=0)

或更简洁:

   frame.apply(lambda x: x/x.max(), axis=0)

Your problem is actually a simple transform acting on the columns:

def f(s):
    return s/s.max()

frame.apply(f, axis=0)

Or even more terse:

   frame.apply(lambda x: x/x.max(), axis=0)

回答 4

如果您喜欢使用sklearn包,则可以通过使用pandas来保留列名和索引名,loc如下所示:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler() 
scaled_values = scaler.fit_transform(df) 
df.loc[:,:] = scaled_values

If you like using the sklearn package, you can keep the column and index names by using pandas loc like so:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler() 
scaled_values = scaler.fit_transform(df) 
df.loc[:,:] = scaled_values

回答 5

简单即美:

df["A"] = df["A"] / df["A"].max()
df["B"] = df["B"] / df["B"].max()
df["C"] = df["C"] / df["C"].max()

Simple is Beautiful:

df["A"] = df["A"] / df["A"].max()
df["B"] = df["B"] / df["B"].max()
df["C"] = df["C"] / df["C"].max()

回答 6

您可以创建要标准化的列的列表

column_names_to_normalize = ['A', 'E', 'G', 'sadasdsd', 'lol']
x = df[column_names_to_normalize].values
x_scaled = min_max_scaler.fit_transform(x)
df_temp = pd.DataFrame(x_scaled, columns=column_names_to_normalize, index = df.index)
df[column_names_to_normalize] = df_temp

现在,您的Pandas Dataframe仅在您想要的列上进行了标准化


但是,如果你想的相反,选择列的列表不要想规范化,您可以简单地创建的所有列的列表,删除非期望的人

column_names_to_not_normalize = ['B', 'J', 'K']
column_names_to_normalize = [x for x in list(df) if x not in column_names_to_not_normalize ]

You can create a list of columns that you want to normalize

column_names_to_normalize = ['A', 'E', 'G', 'sadasdsd', 'lol']
x = df[column_names_to_normalize].values
x_scaled = min_max_scaler.fit_transform(x)
df_temp = pd.DataFrame(x_scaled, columns=column_names_to_normalize, index = df.index)
df[column_names_to_normalize] = df_temp

Your Pandas Dataframe is now normalized only at the columns you want


However, if you want the opposite, select a list of columns that you DON’T want to normalize, you can simply create a list of all columns and remove that non desired ones

column_names_to_not_normalize = ['B', 'J', 'K']
column_names_to_normalize = [x for x in list(df) if x not in column_names_to_not_normalize ]

回答 7

我认为在熊猫中做到这一点的更好方法是

df = df/df.max().astype(np.float64)

编辑如果在数据框中出现负数,则应改用

df = df/df.loc[df.abs().idxmax()].astype(np.float64)

I think that a better way to do that in pandas is just

df = df/df.max().astype(np.float64)

Edit If in your data frame negative numbers are present you should use instead

df = df/df.loc[df.abs().idxmax()].astype(np.float64)

回答 8

Sandman和Praveen给出的解决方案非常好。唯一的问题是,如果数据框的其他列中有类别变量,则此方法将需要进行一些调整。

我针对此类问题的解决方案如下:

 from sklearn import preprocesing
 x = pd.concat([df.Numerical1, df.Numerical2,df.Numerical3])
 min_max_scaler = preprocessing.MinMaxScaler()
 x_scaled = min_max_scaler.fit_transform(x)
 x_new = pd.DataFrame(x_scaled)
 df = pd.concat([df.Categoricals,x_new])

The solution given by Sandman and Praveen is very well. The only problem with that if you have categorical variables in other columns of your data frame this method will need some adjustments.

My solution to this type of issue is following:

 from sklearn import preprocesing
 x = pd.concat([df.Numerical1, df.Numerical2,df.Numerical3])
 min_max_scaler = preprocessing.MinMaxScaler()
 x_scaled = min_max_scaler.fit_transform(x)
 x_new = pd.DataFrame(x_scaled)
 df = pd.concat([df.Categoricals,x_new])

回答 9

python中不同标准化的示例。

作为参考,请参阅以下维基百科文章: https //en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation

示例数据

import pandas as pd
df = pd.DataFrame({
               'A':[1,2,3],
               'B':[100,300,500],
               'C':list('abc')
             })
print(df)
   A    B  C
0  1  100  a
1  2  300  b
2  3  500  c

使用熊猫进行归一化(给出无偏估计)

归一化时,我们只需减去平均值并除以标准差即可。

df.iloc[:,0:-1] = df.iloc[:,0:-1].apply(lambda x: (x-x.mean())/ x.std(), axis=0)
print(df)
     A    B  C
0 -1.0 -1.0  a
1  0.0  0.0  b
2  1.0  1.0  c

使用sklearn进行归一化(Gives有偏差的估计,与熊猫不同)

如果您做同样的事情,sklearn您将获得不同的输出!

import pandas as pd

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()


df = pd.DataFrame({
               'A':[1,2,3],
               'B':[100,300,500],
               'C':list('abc')
             })
df.iloc[:,0:-1] = scaler.fit_transform(df.iloc[:,0:-1].to_numpy())
print(df)
          A         B  C
0 -1.224745 -1.224745  a
1  0.000000  0.000000  b
2  1.224745  1.224745  c

偏向sklearn的估计是否会使机器学习功能降低?

没有。

sklearn.preprocessing.scale的官方文档指出,使用偏倚估计量会异常地影响机器学习算法的性能,因此我们可以安全地使用它们。

From official documentation:
We use a biased estimator for the standard deviation,
equivalent to numpy.std(x, ddof=0). 
Note that the choice of ddof is unlikely to affect model performance.

MinMax Scaling怎么样?

MinMax缩放中没有标准偏差计算。因此,在熊猫和scikit-learn中,结果都是相同的。

import pandas as pd
df = pd.DataFrame({
               'A':[1,2,3],
               'B':[100,300,500],
             })
(df - df.min()) / (df.max() - df.min())
     A    B
0  0.0  0.0
1  0.5  0.5
2  1.0  1.0


# Using sklearn
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler() 
arr_scaled = scaler.fit_transform(df) 

print(arr_scaled)
[[0.  0. ]
 [0.5 0.5]
 [1.  1. ]]

df_scaled = pd.DataFrame(arr_scaled, columns=df.columns,index=df.index)
print(df_scaled)
     A    B
0  0.0  0.0
1  0.5  0.5
2  1.0  1.0

Example of different standardizations in python.

For reference look at this wikipedia article: Unbiased Estimation of Standard Deviation

Example Data

import pandas as pd
df = pd.DataFrame({
               'A':[1,2,3],
               'B':[100,300,500],
               'C':list('abc')
             })
print(df)
   A    B  C
0  1  100  a
1  2  300  b
2  3  500  c

Normalization using pandas (Gives unbiased estimates)

When normalizing we simply subtract the mean and divide by standard deviation.

df.iloc[:,0:-1] = df.iloc[:,0:-1].apply(lambda x: (x-x.mean())/ x.std(), axis=0)
print(df)
     A    B  C
0 -1.0 -1.0  a
1  0.0  0.0  b
2  1.0  1.0  c

Normalization using sklearn (Gives biased estimates, different from pandas)

If you do the same thing with sklearn you will get DIFFERENT output!

import pandas as pd

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()


df = pd.DataFrame({
               'A':[1,2,3],
               'B':[100,300,500],
               'C':list('abc')
             })
df.iloc[:,0:-1] = scaler.fit_transform(df.iloc[:,0:-1].to_numpy())
print(df)
          A         B  C
0 -1.224745 -1.224745  a
1  0.000000  0.000000  b
2  1.224745  1.224745  c

Does Biased estimates of sklearn makes Machine Learning Less Powerful?

NO.

The official documentation of sklearn.preprocessing.scale states that using biased estimator is UNLIKELY to affect the performance of machine learning algorithms and we can safely use them.

From official documentation:

We use a biased estimator for the standard deviation, equivalent to numpy.std(x, ddof=0). Note that the choice of ddof is unlikely to affect model performance.

What about MinMax Scaling?

There is no Standard Deviation calculation in MinMax scaling. So the result is same in both pandas and scikit-learn.

import pandas as pd
df = pd.DataFrame({
               'A':[1,2,3],
               'B':[100,300,500],
             })
(df - df.min()) / (df.max() - df.min())
     A    B
0  0.0  0.0
1  0.5  0.5
2  1.0  1.0


# Using sklearn
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler() 
arr_scaled = scaler.fit_transform(df) 

print(arr_scaled)
[[0.  0. ]
 [0.5 0.5]
 [1.  1. ]]

df_scaled = pd.DataFrame(arr_scaled, columns=df.columns,index=df.index)
print(df_scaled)
     A    B
0  0.0  0.0
1  0.5  0.5
2  1.0  1.0

回答 10

您可能希望对某些列进行规范化,而对其他列进行不变,例如某些回归任务,其中数据标签或分类列不变,因此,我建议您使用这种pythonic方式(这是@shg和@Cina答案的组合):

features_to_normalize = ['A', 'B', 'C']
# could be ['A','B'] 

df[features_to_normalize] = df[features_to_normalize].apply(lambda x:(x-x.min()) / (x.max()-x.min()))

You might want to have some of columns being normalized and the others be unchanged like some of regression tasks which data labels or categorical columns are unchanged So I suggest you this pythonic way (It’s a combination of @shg and @Cina answers ):

features_to_normalize = ['A', 'B', 'C']
# could be ['A','B'] 

df[features_to_normalize] = df[features_to_normalize].apply(lambda x:(x-x.min()) / (x.max()-x.min()))

回答 11

这只是简单的数学。答案应如下所示。

normed_df = (df - df.min()) / (df.max() - df.min())

It is only simple mathematics. The answer should as simple as below.

normed_df = (df - df.min()) / (df.max() - df.min())

回答 12

def normalize(x):
    try:
        x = x/np.linalg.norm(x,ord=1)
        return x
    except :
        raise
data = pd.DataFrame.apply(data,normalize)

从熊猫文件中,DataFrame结构可以对其自身应用操作(函数)。

DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)

沿DataFrame的输入轴应用功能。传递给函数的对象是Series对象,其索引为DataFrame的索引(axis = 0)或列(axis = 1)。返回类型取决于传递的函数是否聚合,或者取决于DataFrame为空时的reduce参数。

您可以应用自定义函数来操作DataFrame。

def normalize(x):
    try:
        x = x/np.linalg.norm(x,ord=1)
        return x
    except :
        raise
data = pd.DataFrame.apply(data,normalize)

From the document of pandas,DataFrame structure can apply an operation (function) to itself .

DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)

Applies function along input axis of DataFrame. Objects passed to functions are Series objects having index either the DataFrame’s index (axis=0) or the columns (axis=1). Return type depends on whether passed function aggregates, or the reduce argument if the DataFrame is empty.

You can apply a custom function to operate the DataFrame .


回答 13

以下函数计算Z分数:

def standardization(dataset):
  """ Standardization of numeric fields, where all values will have mean of zero 
  and standard deviation of one. (z-score)

  Args:
    dataset: A `Pandas.Dataframe` 
  """
  dtypes = list(zip(dataset.dtypes.index, map(str, dataset.dtypes)))
  # Normalize numeric columns.
  for column, dtype in dtypes:
      if dtype == 'float32':
          dataset[column] -= dataset[column].mean()
          dataset[column] /= dataset[column].std()
  return dataset

The following function calculates the Z score:

def standardization(dataset):
  """ Standardization of numeric fields, where all values will have mean of zero 
  and standard deviation of one. (z-score)

  Args:
    dataset: A `Pandas.Dataframe` 
  """
  dtypes = list(zip(dataset.dtypes.index, map(str, dataset.dtypes)))
  # Normalize numeric columns.
  for column, dtype in dtypes:
      if dtype == 'float32':
          dataset[column] -= dataset[column].mean()
          dataset[column] /= dataset[column].std()
  return dataset

回答 14

这是使用列表推导按列进行的方式:

[df[col].update((df[col] - df[col].min()) / (df[col].max() - df[col].min())) for col in df.columns]

This is how you do it column-wise using list comprehension:

[df[col].update((df[col] - df[col].min()) / (df[col].max() - df[col].min())) for col in df.columns]

回答 15

您可以通过这种方式简单地使用pandas.DataFrame.transform 1函数:

df.transform(lambda x: x/x.max())

You can simply use the pandas.DataFrame.transform1 function in this way:

df.transform(lambda x: x/x.max())

回答 16

df_normalized = df / df.max(axis=0)
df_normalized = df / df.max(axis=0)

回答 17

您可以一行完成

DF_test = DF_test.sub(DF_test.mean(axis=0), axis=1)/DF_test.mean(axis=0)

它对每一列取均值,然后从每一行中减去(均值)(特定列的均值仅从其行中减去)并仅除以均值。最后,我们得到的是标准化数据集。

You can do this in one line

DF_test = DF_test.sub(DF_test.mean(axis=0), axis=1)/DF_test.mean(axis=0)

it takes mean for each of the column and then subtracts it(mean) from every row(mean of particular column subtracts from its row only) and divide by mean only. Finally, we what we get is the normalized data set.


回答 18

熊猫默认情况下会按列进行规范化。试试下面的代码。

X= pd.read_csv('.\\data.csv')
X = (X-X.min())/(X.max()-X.min())

输出值将在0到1的范围内。

Pandas does column wise normalization by default. Try the code below.

X= pd.read_csv('.\\data.csv')
X = (X-X.min())/(X.max()-X.min())

The output values will be in range of 0 and 1.