Now I know that certain rows are outliers based on a certain column value.
For instance
column ‘Vol’ has all values around 12xx and one value is 4000 (outlier).
Now I would like to exclude those rows that have Vol column like this.
So, essentially I need to put a filter on the data frame such that we select all rows where the values of a certain column are within, say, 3 standard deviations from mean.
If you have multiple columns in your dataframe and would like to remove all rows that have outliers in at least one column, the following expression would do that in one shot.
For each column, first it computes the Z-score of each value in the
column, relative to the column mean and standard deviation.
Then is takes the absolute of Z-score because the direction does not
matter, only if it is below the threshold.
all(axis=1) ensures that for each row, all column satisfy the
constraint.
Finally, result of this condition is used to index the dataframe.
回答 1
boolean就像在索引中那样使用索引numpy.array
df = pd.DataFrame({'Data':np.random.normal(size=200)})# example dataset of normally distributed data.
df[np.abs(df.Data-df.Data.mean())<=(3*df.Data.std())]# keep only the ones that are within +3 to -3 standard deviations in the column 'Data'.
df[~(np.abs(df.Data-df.Data.mean())>(3*df.Data.std()))]# or if you prefer the other way around
对于系列,它类似于:
S = pd.Series(np.random.normal(size=200))
S[~((S-S.mean()).abs()>3*S.std())]
Use boolean indexing as you would do in numpy.array
df = pd.DataFrame({'Data':np.random.normal(size=200)})
# example dataset of normally distributed data.
df[np.abs(df.Data-df.Data.mean()) <= (3*df.Data.std())]
# keep only the ones that are within +3 to -3 standard deviations in the column 'Data'.
df[~(np.abs(df.Data-df.Data.mean()) > (3*df.Data.std()))]
# or if you prefer the other way around
For a series it is similar:
S = pd.Series(np.random.normal(size=200))
S[~((S-S.mean()).abs() > 3*S.std())]
For each series in the dataframe, you could use between and quantile to remove outliers.
x = pd.Series(np.random.normal(size=200)) # with outliers
x = x[x.between(x.quantile(.25), x.quantile(.75))] # without outliers
回答 6
由于我还没有看到涉及数字和非数字属性的答案,因此这里是一个补充性答案。
您可能只想将离群值放在数字属性上(分类变量几乎不可能是离群值)。
功能定义
我还扩展了@tanemaki的建议,以在还存在非数字属性时处理数据:
from scipy import stats
def drop_numerical_outliers(df, z_thresh=3):# Constrains will contain `True` or `False` depending on if it is a value below the threshold.
constrains = df.select_dtypes(include=[np.number]) \
.apply(lambda x: np.abs(stats.zscore(x))< z_thresh, reduce=False) \
.all(axis=1)# Drop (inplace) values set to be rejected
df.drop(df.index[~constrains], inplace=True)
# Plot data before dropping those greater than z-score 3. # The scatterAreaVsPrice function's definition has been removed for readability's sake.
scatterAreaVsPrice(df)
# Drop the outliers on every attributes
drop_numerical_outliers(train_df)# Plot the result. All outliers were dropped. Note that the red points are not# the same outliers from the first plot, but the new computed outliers based on the new data-frame.
scatterAreaVsPrice(train_df)
Since I haven’t seen an answer that deal with numerical and non-numerical attributes, here is a complement answer.
You might want to drop the outliers only on numerical attributes (categorical variables can hardly be outliers).
Function definition
I have extended @tanemaki’s suggestion to handle data when non-numeric attributes are also present:
from scipy import stats
def drop_numerical_outliers(df, z_thresh=3):
# Constrains will contain `True` or `False` depending on if it is a value below the threshold.
constrains = df.select_dtypes(include=[np.number]) \
.apply(lambda x: np.abs(stats.zscore(x)) < z_thresh, reduce=False) \
.all(axis=1)
# Drop (inplace) values set to be rejected
df.drop(df.index[~constrains], inplace=True)
Usage
drop_numerical_outliers(df)
Example
Imagine a dataset df with some values about houses: alley, land contour, sale price, … E.g: Data Documentation
First, you want to visualise the data on a scatter graph (with z-score Thresh=3):
# Plot data before dropping those greater than z-score 3.
# The scatterAreaVsPrice function's definition has been removed for readability's sake.
scatterAreaVsPrice(df)
# Drop the outliers on every attributes
drop_numerical_outliers(train_df)
# Plot the result. All outliers were dropped. Note that the red points are not
# the same outliers from the first plot, but the new computed outliers based on the new data-frame.
scatterAreaVsPrice(train_df)
scipy.stats has methods trim1() and trimboth() to cut the outliers out in a single row, according to the ranking and an introduced percentage of removed values.
回答 8
另一种选择是转换数据,以减轻异常值的影响。您可以通过取消存储数据来做到这一点。
import pandas as pd
from scipy.stats import mstats
%matplotlib inline
test_data = pd.Series(range(30))
test_data.plot()
# Truncate values to the 5th and 95th percentiles
transformed_test_data = pd.Series(mstats.winsorize(test_data, limits=[0.05,0.05]))
transformed_test_data.plot()
Another option is to transform your data so that the effect of outliers is mitigated. You can do this by winsorizing your data.
import pandas as pd
from scipy.stats import mstats
%matplotlib inline
test_data = pd.Series(range(30))
test_data.plot()
# Truncate values to the 5th and 95th percentiles
transformed_test_data = pd.Series(mstats.winsorize(test_data, limits=[0.05, 0.05]))
transformed_test_data.plot()
print'\n'*5,'All values with decimal 1 are non-outliers. In the other hand, all values with 6 in the decimal are.'print'\nDef DATA:\n%s\n\nFiltred Values with %s stds:\n%s\n\nOutliers:\n%s'%(df, stds, dfv, dfo)
print '\n'*5, 'All values with decimal 1 are non-outliers. In the other hand, all values with 6 in the decimal are.'
print '\nDef DATA:\n%s\n\nFiltred Values with %s stds:\n%s\n\nOutliers:\n%s' %(df, stds, dfv, dfo)
Deleting and dropping outliers I believe is wrong statistically.
It makes the data different from original data.
Also makes data unequally shaped and hence best way is to reduce or avoid the effect of outliers by log transform the data.
This worked for me: