## 1.准备

Windows环境下打开Cmd(开始—运行—CMD)，苹果系统环境下请打开Terminal(command+空格输入Terminal)，输入命令安装依赖：

```pip install pandas
pip install numpy
pip install matplotlib
pip install seaborn
pip install scikit-learn```

## 2.导入相关数据集

```import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

## 3.探索性数据分析

3.1 清理缺失数据

`flights.info()`

```# clearing the missing data
flights.dropna(inplace=True)
flights.info()```

3.2 航班公司分布特征

```sns.countplot('Airline', data=flights)
plt.xticks(rotation=90)
plt.show()```

3.3 再来看看始发地的分布

```sns.countplot('Source',data=flights)
plt.xticks(rotation=90)
plt.show()```

3.4 停靠站点的数量分布

```sns.countplot('Total_Stops',data=flights)
plt.xticks(rotation=90)
plt.show()```

3.5 有多少数据含有额外信息

```plot=plt.figure()
plt.xticks(rotation=90)```

3.6 时间维度分析

```flights['Date_of_Journey'] = pd.to_datetime(flights['Date_of_Journey'])
flights['Dep_Time'] = pd.to_datetime(flights['Dep_Time'],format='%H:%M:%S').dt.time```

```flights['weekday'] = flights[['Date_of_Journey']].apply(lambda x:x.dt.day_name())
sns.barplot('weekday','Price',data=flights)
plt.show()```

```flights["month"] = flights['Date_of_Journey'].map(lambda x: x.month_name())
sns.barplot('month','Price',data=flights)
plt.show()```

```flights['Dep_Time'] = flights['Dep_Time'].apply(lambda x:x.hour)
flights['Dep_Time'] = pd.to_numeric(flights['Dep_Time'])
sns.barplot('Dep_Time','Price',data=flights)
plot.show()```

3.7 清除无效特征

```flights.drop(['Route','Arrival_Time','Date_of_Journey'],axis=1,inplace=True)

## 4.模型训练

4.1 数据预处理

```from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for i in var_mod:
flights[i] = le.fit_transform(flights[i])

```flights.corr()
def outlier(df):
for i in df.describe().columns:
Q1=df.describe().at['25%',i]
Q3=df.describe().at['75%',i]
IQR= Q3-Q1
LE=Q1-1.5*IQR
UE=Q3+1.5*IQR
return df
flights = outlier(flights)
x = flights.drop('Price',axis=1)
y = flights['Price']```

```from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=101)```

4.2 模型训练及测试

```from sklearn.ensemble import RandomForestRegressor
rfr=RandomForestRegressor(n_estimators=100)
rfr.fit(x_train,y_train)```

```features=x.columns
importances = rfr.feature_importances_
indices = np.argsort(importances)
plt.figure(1)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), features[indices])
plt.xlabel('Relative Importance')```

```predictions=rfr.predict(x_test)
plt.scatter(y_test,predictions)
plt.show()```

4.3 模型评价

sklearn 提供了非常方便的函数来评价模型，那就是 metrics :

```from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
print('r2_score:', (metrics.r2_score(y_test, predictions)))```
```MAE: 1453.9350628905618
MSE: 4506308.3645551
RMSE: 2122.806718605135
r2_score: 0.7532074710409375```

```sns.distplot((y_test-predictions),bins=50)
plt.show()```

​Python实用宝典 ( pythondict.com )