Speed Prediction¶
In this project, we will use weather data to predict average speed.
-
we will use weather data from openweathermap.org which provides temperature, wind, humidity, and weather conditions
-
This notebook will be running on Google Colab
Some of the data preprocessing techniques are based on the online notebooks from tensorflow org.
https://www.tensorflow.org/tutorials/structured_data/time_series
Other references are:
https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
Install and import dependencies¶
!pip install pactools
Requirement already satisfied: pactools in /usr/local/lib/python3.6/dist-packages (0.3.1)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from pactools) (1.4.1)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.6/dist-packages (from pactools) (0.22.2.post1)
Requirement already satisfied: h5py in /usr/local/lib/python3.6/dist-packages (from pactools) (2.10.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from pactools) (1.18.5)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.6/dist-packages (from pactools) (3.2.2)
Requirement already satisfied: mne in /usr/local/lib/python3.6/dist-packages (from pactools) (0.21.2)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn->pactools) (0.17.0)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from h5py->pactools) (1.15.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->pactools) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib->pactools) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->pactools) (2.4.7)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->pactools) (2.8.1)
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
import requests
from bs4 import BeautifulSoup
from datetime import datetime
import tensorflow as tf
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelBinarizer,OneHotEncoder,StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from pactools.grid_search import GridSearchCVProgressBar
from sklearn.model_selection import TimeSeriesSplit
from sklearn.model_selection import GridSearchCV
from sklearn import svm
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
/usr/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
return f(*args, **kwds)
/usr/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
return f(*args, **kwds)
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
Num GPUs Available: 1
Preprocess¶
Asign Holiday to the table¶
def time_desc(x: datetime):
Early_Morning = [4,5,6,7]
Morning = [8,9,10,11]
Afternoon = [12,13,14,15]
Evening = [16,17,18,19]
Night = [20,21,22,23]
Late_Night = [0,1,2,3]
if x.hour in Early_Morning:
return 'early'
elif x.hour in Morning:
return 'morning'
elif x.hour in Afternoon:
return 'noon'
elif x.hour in Evening:
return 'afternoon'
elif x.hour in Night:
return 'night'
else:
return 'latenight'
def add_holiday_and_weekend(df: pd.DataFrame) -> pd.DataFrame:
"""
Add holiday and weekend to the dataset
"""
new_df = df.copy()
new_df['IsWeekend'] = new_df['date'].apply(lambda x:0 if x.weekday() in [0,1,2,3,4] else 1)
new_df['IsHoliday']=new_df['date'].apply(lambda x:1 if (x.date().strftime('%Y-%m-%d') in [
'2017-01-02','2017-01-28','2017-01-30','2017-01-31','2017-04-04',
'2017-04-14','2017-04-15','2017-04-17','2017-05-01','2017-05-03',
'2017-05-30','2017-07-01','2017-10-02','2017-10-05','2017-10-28',
'2017-12-25','2017-12-26',
'2018-01-01','2018-02-16','2018-02-17','2018-02-19','2018-03-30',
'2018-03-31','2018-04-02','2018-04-05','2018-05-01','2018-05-22',
'2018-06-18','2018-07-02','2018-09-25','2018-10-01','2018-10-17',
'2018-12-25','2018-12-26'
])
or(x.weekday() in[6]) else 0)
return new_df
I am using google drive to store all the data including the weather data. So please change this line to the your file paths.
# change following two lines to your paths
train = pd.read_csv('/content/drive/MyDrive/courses/HKUST/MSBD5001/project/data/individual project/train.csv')
test = pd.read_csv('/content/drive/MyDrive/courses/HKUST/MSBD5001/project/data/individual project/test.csv')
train = train.drop('id', axis=1)
train['date'] = pd.to_datetime(train['date'])
train = add_holiday_and_weekend(train)
train['time desc']=train['date'].apply(time_desc)
train
| date | speed | IsWeekend | IsHoliday | time desc | |
|---|---|---|---|---|---|
| 0 | 2017-01-01 00:00:00 | 43.002930 | 1 | 1 | latenight |
| 1 | 2017-01-01 01:00:00 | 46.118696 | 1 | 1 | latenight |
| 2 | 2017-01-01 02:00:00 | 44.294158 | 1 | 1 | latenight |
| 3 | 2017-01-01 03:00:00 | 41.067468 | 1 | 1 | latenight |
| 4 | 2017-01-01 04:00:00 | 46.448653 | 1 | 1 | early |
| ... | ... | ... | ... | ... | ... |
| 14001 | 2018-12-31 12:00:00 | 19.865269 | 0 | 0 | noon |
| 14002 | 2018-12-31 15:00:00 | 17.820375 | 0 | 0 | noon |
| 14003 | 2018-12-31 16:00:00 | 12.501851 | 0 | 0 | afternoon |
| 14004 | 2018-12-31 18:00:00 | 15.979319 | 0 | 0 | afternoon |
| 14005 | 2018-12-31 20:00:00 | 40.594183 | 0 | 0 | night |
14006 rows × 5 columns
train.plot(x='date', y='speed', figsize=(20, 10))
<matplotlib.axes._subplots.AxesSubplot at 0x7f5321240400>
Merge weather¶
Pre-process weather data¶
from datetime import datetime
def k_to_c(x):
return x - 273.15
we have two different weather sources.
- Weather from open weather (Provides hourly basis weather report)
- Weather from Hong Kong observatory (Provides daily basis weather report)
Based on the experiments, I decide to use Hong Kong observatory's weather report which will bring a higher accuracy
# Change this path to yours
# weather = pd.read_csv("/content/drive/MyDrive/courses/HKUST/MSBD5001/project/data/individual project/hongkong_weather_1970-2020.csv")
# weather['dt_iso'] = weather['dt_iso'].apply(lambda x: x.replace('UTC', ''))
# weather['date'] = pd.to_datetime(weather['dt_iso']).dt.tz_convert("Asia/Hong_Kong").dt.tz_localize(None)
# # Transform unit
# weather['temp'] = weather['temp'].apply(k_to_c)
# weather['feels_like'] = weather['feels_like'].apply(k_to_c)
# weather['temp_min'] = weather['temp_min'].apply(k_to_c)
# weather['temp_max'] = weather['temp_max'].apply(k_to_c)
# weather = weather.drop(["dt_iso", "dt", "weather_icon", "rain_1h", "rain_3h", "snow_1h", "snow_3h", "sea_level", "grnd_level", "timezone", "lat", "lon"], axis=1)
# mask = (weather['date'] >= datetime(2017,1,1)) & (weather['date'] <= datetime(2019, 1, 1))
# weather = weather.loc[mask]
# weather
weather=pd.read_csv("/content/drive/MyDrive/courses/HKUST/MSBD5001/project/data/individual project/hong-kong-weather-histories.csv")
weather['date'] = pd.to_datetime(weather.year*10000+weather.month*100+weather.day,format='%Y%m%d')
weather.drop(['Unnamed: 0', 'year', 'month', 'day'],axis=1,inplace=True)
weather['TRainfall(mm)']=weather['TRainfall(mm)'].apply(lambda x:0 if x in ['-','Trace'] else x)
weather.head()
| MPressure(hPa) | MaxTemp(℃) | MinTemp(℃) | MCloud(%) | TRainfall(mm) | #hRedVisi(h) | WindDirect(degrees) | MWindSpeed(km/h) | date | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1021.7 | 20.8 | 18.4 | 72 | 0 | 0 | 60 | 34.2 | 2017-01-01 |
| 1 | 1020.2 | 23.3 | 18.4 | 28 | 0 | 0 | 70 | 17.7 | 2017-01-02 |
| 2 | 1019.8 | 21.3 | 18.9 | 56 | 0 | 5 | 70 | 26.1 | 2017-01-03 |
| 3 | 1018.7 | 21.7 | 18.7 | 51 | 0 | 0 | 70 | 27.7 | 2017-01-04 |
| 4 | 1016.9 | 23.4 | 18.9 | 61 | 0 | 0 | 40 | 14.3 | 2017-01-05 |
Merge¶
from pandas import DatetimeIndex
def merge_weather(df: pd.DataFrame, weather: pd.DataFrame, how='left') -> pd.DataFrame:
'''
Merge weather with data.
'''
new_df = df.copy()
new_weather = weather.copy()
# new_df['tmp_date'] = new_df['date']
# new_weather['tmp_date'] = new_weather['date']
new_df['tmp_date'] = new_df['date'].apply(lambda date: date.date())
new_weather['tmp_date'] = new_weather['date'].apply(lambda date: date.date())
new_training_data = new_df.merge(new_weather, on='tmp_date', how=how)
new_training_data = new_training_data.drop(['tmp_date', 'date_y'], axis=1)
new_training_data = new_training_data.rename(columns={'date_x': 'date'})
new_training_data['hour'] = DatetimeIndex(new_training_data['date']).hour
new_training_data['day'] = DatetimeIndex(new_training_data['date']).day
new_training_data['month'] = DatetimeIndex(new_training_data['date']).month
new_training_data['year'] = DatetimeIndex(new_training_data['date']).year
new_training_data['weekday'] = DatetimeIndex(new_training_data['date']).weekday
return new_training_data
new_training_data = merge_weather(train, weather)
new_training_data.sample(20)
| date | speed | IsWeekend | IsHoliday | time desc | MPressure(hPa) | MaxTemp(℃) | MinTemp(℃) | MCloud(%) | TRainfall(mm) | #hRedVisi(h) | WindDirect(degrees) | MWindSpeed(km/h) | hour | day | month | year | weekday | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10903 | 2018-05-27 00:00:00 | 45.356636 | 1 | 1 | latenight | 1008.9 | 33.4 | 26.9 | 63 | 3.4 | 0 | 230 | 23.7 | 0 | 27 | 5 | 2018 | 6 |
| 1201 | 2017-02-20 10:00:00 | 32.278186 | 0 | 0 | morning | 1013.9 | 25.5 | 18.3 | 72 | 0 | 3 | 30 | 12.8 | 10 | 20 | 2 | 2017 | 0 |
| 5038 | 2017-07-30 08:00:00 | 45.045055 | 1 | 1 | morning | 996.0 | 34.8 | 29.6 | 69 | 0 | 4 | 300 | 24.3 | 8 | 30 | 7 | 2017 | 6 |
| 1107 | 2017-02-16 12:00:00 | 21.056502 | 0 | 0 | noon | 1021.6 | 24.0 | 15.4 | 11 | 0 | 0 | 40 | 16.3 | 12 | 16 | 2 | 2017 | 3 |
| 5044 | 2017-07-30 14:00:00 | 37.517872 | 1 | 1 | noon | 996.0 | 34.8 | 29.6 | 69 | 0 | 4 | 300 | 24.3 | 14 | 30 | 7 | 2017 | 6 |
| 10121 | 2018-03-04 20:00:00 | 39.774170 | 1 | 1 | night | 1011.0 | 27.3 | 21.9 | 86 | 0 | 0 | 140 | 8.0 | 20 | 4 | 3 | 2018 | 6 |
| 8027 | 2017-01-12 21:00:00 | 43.835431 | 0 | 0 | night | 1015.5 | 20.3 | 16.9 | 93 | 0 | 15 | 70 | 28.0 | 21 | 12 | 1 | 2017 | 3 |
| 1336 | 2017-02-26 01:00:00 | 48.388743 | 1 | 1 | latenight | 1021.2 | 17.0 | 10.6 | 88 | 1.4 | 0 | 360 | 21.7 | 1 | 26 | 2 | 2017 | 6 |
| 436 | 2017-01-19 13:00:00 | 19.198023 | 0 | 0 | noon | 1020.1 | 24.1 | 18.7 | 76 | 0 | 8 | 60 | 12.8 | 13 | 19 | 1 | 2017 | 3 |
| 6499 | 2017-09-29 05:00:00 | 48.814754 | 0 | 0 | early | 1012.2 | 33.1 | 28.8 | 63 | 0 | 0 | 90 | 21.1 | 5 | 29 | 9 | 2017 | 4 |
| 317 | 2017-01-14 05:00:00 | 45.863401 | 1 | 0 | early | 1017.9 | 16.5 | 14.5 | 92 | 1.0 | 0 | 50 | 29.8 | 5 | 14 | 1 | 2017 | 5 |
| 788 | 2017-03-02 05:00:00 | 46.502654 | 0 | 0 | early | 1019.2 | 23.9 | 17.2 | 12 | 0 | 0 | 10 | 25.5 | 5 | 2 | 3 | 2017 | 3 |
| 4910 | 2017-07-25 00:00:00 | 48.773277 | 0 | 0 | latenight | 1005.1 | 33.1 | 27.7 | 55 | 0 | 0 | 100 | 15.0 | 0 | 25 | 7 | 2017 | 1 |
| 6698 | 2017-07-10 12:00:00 | 31.192037 | 0 | 0 | noon | 1008.5 | 32.1 | 27.5 | 83 | 0.6 | 0 | 190 | 14.6 | 12 | 10 | 7 | 2017 | 0 |
| 7041 | 2017-10-21 19:00:00 | 29.392348 | 1 | 0 | afternoon | 1012.1 | 27.2 | 21.6 | 33 | 0 | 0 | 360 | 33.3 | 19 | 21 | 10 | 2017 | 5 |
| 6451 | 2017-09-27 05:00:00 | 48.713259 | 0 | 0 | early | 1009.6 | 33.0 | 27.7 | 34 | 0 | 1 | 200 | 6.2 | 5 | 27 | 9 | 2017 | 2 |
| 779 | 2017-02-02 20:00:00 | 36.991496 | 0 | 0 | night | 1022.7 | 17.7 | 16.2 | 86 | 0 | 0 | 80 | 43.0 | 20 | 2 | 2 | 2017 | 3 |
| 8127 | 2017-06-12 01:00:00 | 50.491144 | 0 | 0 | latenight | 1001.9 | 30.0 | 25.3 | 80 | 37.7 | 0 | 80 | 53.5 | 1 | 12 | 6 | 2017 | 0 |
| 9211 | 2018-01-02 20:00:00 | 39.546576 | 0 | 0 | night | 1019.3 | 19.2 | 16.0 | 77 | 0 | 10 | 60 | 31.8 | 20 | 2 | 1 | 2018 | 1 |
| 5850 | 2017-02-09 04:00:00 | 47.510284 | 0 | 0 | early | 1020.2 | 16.8 | 11.1 | 72 | 0 | 0 | 10 | 39.9 | 4 | 9 | 2 | 2017 | 3 |
Plot data¶
plt.figure(figsize=(6,4))
sns.boxplot('speed',data=new_training_data,orient='h',palette="Set3",linewidth=2.5)
plt.show()
/usr/local/lib/python3.6/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
FutureWarning
Traffic speed¶
data_plot = new_training_data
data_plot['month'] = data_plot['date'].dt.month
data_plot
| date | speed | IsWeekend | IsHoliday | time desc | MPressure(hPa) | MaxTemp(℃) | MinTemp(℃) | MCloud(%) | TRainfall(mm) | #hRedVisi(h) | WindDirect(degrees) | MWindSpeed(km/h) | hour | day | month | year | weekday | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2017-01-01 00:00:00 | 43.002930 | 1 | 1 | latenight | 1021.7 | 20.8 | 18.4 | 72 | 0 | 0 | 60 | 34.2 | 0 | 1 | 1 | 2017 | 6 |
| 1 | 2017-01-01 01:00:00 | 46.118696 | 1 | 1 | latenight | 1021.7 | 20.8 | 18.4 | 72 | 0 | 0 | 60 | 34.2 | 1 | 1 | 1 | 2017 | 6 |
| 2 | 2017-01-01 02:00:00 | 44.294158 | 1 | 1 | latenight | 1021.7 | 20.8 | 18.4 | 72 | 0 | 0 | 60 | 34.2 | 2 | 1 | 1 | 2017 | 6 |
| 3 | 2017-01-01 03:00:00 | 41.067468 | 1 | 1 | latenight | 1021.7 | 20.8 | 18.4 | 72 | 0 | 0 | 60 | 34.2 | 3 | 1 | 1 | 2017 | 6 |
| 4 | 2017-01-01 04:00:00 | 46.448653 | 1 | 1 | early | 1021.7 | 20.8 | 18.4 | 72 | 0 | 0 | 60 | 34.2 | 4 | 1 | 1 | 2017 | 6 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 14001 | 2018-12-31 12:00:00 | 19.865269 | 0 | 0 | noon | 1027.0 | 15.6 | 11.8 | 77 | 0 | 0 | 360 | 26.8 | 12 | 31 | 12 | 2018 | 0 |
| 14002 | 2018-12-31 15:00:00 | 17.820375 | 0 | 0 | noon | 1027.0 | 15.6 | 11.8 | 77 | 0 | 0 | 360 | 26.8 | 15 | 31 | 12 | 2018 | 0 |
| 14003 | 2018-12-31 16:00:00 | 12.501851 | 0 | 0 | afternoon | 1027.0 | 15.6 | 11.8 | 77 | 0 | 0 | 360 | 26.8 | 16 | 31 | 12 | 2018 | 0 |
| 14004 | 2018-12-31 18:00:00 | 15.979319 | 0 | 0 | afternoon | 1027.0 | 15.6 | 11.8 | 77 | 0 | 0 | 360 | 26.8 | 18 | 31 | 12 | 2018 | 0 |
| 14005 | 2018-12-31 20:00:00 | 40.594183 | 0 | 0 | night | 1027.0 | 15.6 | 11.8 | 77 | 0 | 0 | 360 | 26.8 | 20 | 31 | 12 | 2018 | 0 |
14006 rows × 18 columns
tmp_data=new_training_data.groupby('month').aggregate({'speed':'mean'})
plt.figure(figsize=(8,6))
sns.lineplot(x=tmp_data.index,y=tmp_data.speed,data=tmp_data,palette="Set2")
plt.show()
plt.figure(figsize=(8,6))
sns.countplot(y='time desc',data=new_training_data,palette=["#7fcdbb","#edf8b1","#fc9272","#fee0d2","#bcbddc","#efedf5"])
plt.show()
new_training_data.hist(bins=50,figsize=(20,15))
plt.show()
Train¶
new_training_data.columns
Index(['date', 'speed', 'IsWeekend', 'IsHoliday', 'time desc',
'MPressure(hPa)', 'MaxTemp(℃)', 'MinTemp(℃)', 'MCloud(%)',
'TRainfall(mm)', '#hRedVisi(h)', 'WindDirect(degrees)',
'MWindSpeed(km/h)', 'hour', 'day', 'month', 'year', 'weekday'],
dtype='object')
# using open weather dataset
# cat_vars=['IsWeekend','IsHoliday','time desc']
# num_vars=['temp', 'pressure', 'wind_speed', 'humidity', "clouds_all", 'month', 'year', 'day', 'hour']
#using hong kong government dataset
cat_vars=['IsWeekend','IsHoliday','time desc']
num_vars=['year','month','day','hour','MaxTemp(℃)','MinTemp(℃)','MCloud(%)',
'TRainfall(mm)','#hRedVisi(h)','WindDirect(degrees)',
'MWindSpeed(km/h)']
Transform Data¶
numeric_transformer=Pipeline(steps=[
('scaler', MinMaxScaler())])
categorical_transformer=Pipeline(steps=[
('oneHot',OneHotEncoder())])
preprocessor=ColumnTransformer(transformers=[
('num',numeric_transformer,num_vars),
('cat',categorical_transformer,cat_vars)])
scaled_data=preprocessor.fit_transform(new_training_data)
print(scaled_data.shape)
(14006, 21)
y = new_training_data['speed']
# y = y.to_numpy()
# y = y.reshape(-1, 1)
# print(y.shape)
# print(y)
We want to scale the speed
import numpy as np
Split data¶
X_train,X_test,y_train,y_test=train_test_split(scaled_data,y,test_size=0.15,random_state=42)
print(X_train)
print(y_train)
print(f"Train x shape: {X_train.shape}")
print(f"Train y shape: {y_train.shape}")
[[1. 0.09090909 0.53333333 ... 0. 0. 0. ]
[0. 0.54545455 0.66666667 ... 1. 0. 0. ]
[0. 0.63636364 0.43333333 ... 0. 0. 0. ]
...
[0. 0.63636364 0.43333333 ... 0. 0. 0. ]
[0. 0.45454545 0.03333333 ... 0. 0. 0. ]
[0. 0.81818182 1. ... 1. 0. 0. ]]
9443 46.813428
4822 15.018334
5391 48.414942
11099 19.663880
7854 10.389012
...
5191 24.637279
13418 35.306286
5390 46.950711
860 46.623319
7270 16.718626
Name: speed, Length: 11905, dtype: float64
Train x shape: (11905, 21)
Train y shape: (11905,)
Training by using XGBRegression¶
tscv = TimeSeriesSplit(n_splits=3)
model = XGBRegressor()
param_grid = {'nthread':[4,6,8],
'objective':['reg:squarederror'],
'learning_rate':[.03, 0.05, .07],
'max_depth':[5, 6, 7, 8],
'min_child_weight':[4],
'subsample':[0.7],
'colsample_bytree':[0.7],
'n_estimators':[500]}
xgb=GridSearchCV(estimator=model,param_grid=param_grid,cv=tscv,n_jobs=2, verbose=10)
xgb.fit(X_train,y_train)
Fitting 3 folds for each of 36 candidates, totalling 108 fits
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done 1 tasks | elapsed: 37.4s
[Parallel(n_jobs=2)]: Done 4 tasks | elapsed: 51.0s
[Parallel(n_jobs=2)]: Done 9 tasks | elapsed: 1.0min
[Parallel(n_jobs=2)]: Done 14 tasks | elapsed: 2.6min
[Parallel(n_jobs=2)]: Done 21 tasks | elapsed: 5.8min
[Parallel(n_jobs=2)]: Done 28 tasks | elapsed: 8.1min
[Parallel(n_jobs=2)]: Done 37 tasks | elapsed: 10.0min
[Parallel(n_jobs=2)]: Done 46 tasks | elapsed: 11.5min
[Parallel(n_jobs=2)]: Done 57 tasks | elapsed: 16.0min
[Parallel(n_jobs=2)]: Done 68 tasks | elapsed: 18.1min
[Parallel(n_jobs=2)]: Done 81 tasks | elapsed: 20.3min
[Parallel(n_jobs=2)]: Done 94 tasks | elapsed: 25.2min
[Parallel(n_jobs=2)]: Done 108 out of 108 | elapsed: 29.2min finished
/usr/local/lib/python3.6/dist-packages/xgboost/sklearn.py:242: DeprecationWarning: The nthread parameter is deprecated as of version .6.Please use n_jobs instead.nthread is deprecated.
'nthread is deprecated.', DeprecationWarning)
GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=3),
error_score=nan,
estimator=XGBRegressor(base_score=0.5, booster='gbtree',
colsample_bylevel=1, colsample_bynode=1,
colsample_bytree=1, gamma=0,
importance_type='gain', learning_rate=0.1,
max_delta_step=0, max_depth=3,
min_child_weight=1, missing=None,
n_estimators=100, n_jobs=1, nthread=None,
objectiv...
scale_pos_weight=1, seed=None, silent=None,
subsample=1, verbosity=1),
iid='deprecated', n_jobs=2,
param_grid={'colsample_bytree': [0.7],
'learning_rate': [0.03, 0.05, 0.07],
'max_depth': [5, 6, 7, 8], 'min_child_weight': [4],
'n_estimators': [500], 'nthread': [4, 6, 8],
'objective': ['reg:squarederror'],
'subsample': [0.7]},
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring=None, verbose=10)
Training by using DNN¶
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.metrics import MeanSquaredError
from tensorflow.keras import regularizers
print(X_train.shape)
(11905, 58)
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=20)
model = keras.Sequential(
[
layers.Dense(128, activation="relu",
input_shape=(X_train.shape[1],),),
layers.Dense(128, activation="relu"),
layers.Dropout(0.2),
layers.Dense(256, activation="relu"),
layers.Dense(512, activation="relu"),
layers.Dropout(0.2),
layers.Dense(64, activation="relu"),
layers.Dense(1),
]
)
opt = keras.optimizers.Adam(learning_rate=0.01)
model.compile(loss='mean_squared_error',
optimizer=opt, metrics=['mae', 'mse'])
history = model.fit(X_train, y_train, epochs=200, validation_split=0.1, callbacks=[early_stop],)
Epoch 1/200
335/335 [==============================] - 1s 3ms/step - loss: 88.5822 - mae: 6.5761 - mse: 88.5822 - val_loss: 36.1129 - val_mae: 4.6599 - val_mse: 36.1129
Epoch 2/200
335/335 [==============================] - 1s 3ms/step - loss: 38.0175 - mae: 4.6946 - mse: 38.0175 - val_loss: 30.1431 - val_mae: 4.1988 - val_mse: 30.1431
Epoch 3/200
335/335 [==============================] - 1s 3ms/step - loss: 33.4089 - mae: 4.3429 - mse: 33.4089 - val_loss: 34.9154 - val_mae: 4.8873 - val_mse: 34.9154
Epoch 4/200
335/335 [==============================] - 1s 3ms/step - loss: 28.9762 - mae: 4.0474 - mse: 28.9762 - val_loss: 36.5322 - val_mae: 4.8710 - val_mse: 36.5322
Epoch 5/200
335/335 [==============================] - 1s 3ms/step - loss: 26.6944 - mae: 3.9239 - mse: 26.6944 - val_loss: 36.9679 - val_mae: 5.0986 - val_mse: 36.9679
Epoch 6/200
335/335 [==============================] - 1s 3ms/step - loss: 22.6390 - mae: 3.5917 - mse: 22.6390 - val_loss: 25.5548 - val_mae: 3.9797 - val_mse: 25.5548
Epoch 7/200
335/335 [==============================] - 1s 3ms/step - loss: 21.3222 - mae: 3.4796 - mse: 21.3222 - val_loss: 24.1274 - val_mae: 3.8104 - val_mse: 24.1274
Epoch 8/200
335/335 [==============================] - 1s 3ms/step - loss: 21.2355 - mae: 3.4528 - mse: 21.2355 - val_loss: 23.0463 - val_mae: 3.6462 - val_mse: 23.0463
Epoch 9/200
335/335 [==============================] - 1s 3ms/step - loss: 19.8987 - mae: 3.3513 - mse: 19.8987 - val_loss: 23.3246 - val_mae: 3.5491 - val_mse: 23.3246
Epoch 10/200
335/335 [==============================] - 1s 3ms/step - loss: 20.1989 - mae: 3.3736 - mse: 20.1989 - val_loss: 21.5967 - val_mae: 3.5246 - val_mse: 21.5967
Epoch 11/200
335/335 [==============================] - 1s 3ms/step - loss: 17.1561 - mae: 3.1038 - mse: 17.1561 - val_loss: 21.4864 - val_mae: 3.3486 - val_mse: 21.4864
Epoch 12/200
335/335 [==============================] - 1s 3ms/step - loss: 18.4897 - mae: 3.2406 - mse: 18.4897 - val_loss: 23.1777 - val_mae: 3.6755 - val_mse: 23.1777
Epoch 13/200
335/335 [==============================] - 1s 3ms/step - loss: 16.9098 - mae: 3.1090 - mse: 16.9098 - val_loss: 19.4548 - val_mae: 3.1942 - val_mse: 19.4548
Epoch 14/200
335/335 [==============================] - 1s 3ms/step - loss: 17.2672 - mae: 3.1229 - mse: 17.2672 - val_loss: 20.6426 - val_mae: 3.4684 - val_mse: 20.6426
Epoch 15/200
335/335 [==============================] - 1s 3ms/step - loss: 17.6549 - mae: 3.1584 - mse: 17.6549 - val_loss: 22.3610 - val_mae: 3.4008 - val_mse: 22.3610
Epoch 16/200
335/335 [==============================] - 1s 3ms/step - loss: 16.9222 - mae: 3.1086 - mse: 16.9222 - val_loss: 20.2673 - val_mae: 3.2451 - val_mse: 20.2673
Epoch 17/200
335/335 [==============================] - 1s 3ms/step - loss: 15.9358 - mae: 3.0255 - mse: 15.9358 - val_loss: 20.4191 - val_mae: 3.2302 - val_mse: 20.4191
Epoch 18/200
335/335 [==============================] - 1s 3ms/step - loss: 16.7693 - mae: 3.0742 - mse: 16.7693 - val_loss: 21.7748 - val_mae: 3.6764 - val_mse: 21.7748
Epoch 19/200
335/335 [==============================] - 1s 3ms/step - loss: 16.3996 - mae: 3.0702 - mse: 16.3996 - val_loss: 19.8259 - val_mae: 3.1752 - val_mse: 19.8259
Epoch 20/200
335/335 [==============================] - 1s 3ms/step - loss: 16.5213 - mae: 3.0592 - mse: 16.5213 - val_loss: 21.4091 - val_mae: 3.6836 - val_mse: 21.4091
Epoch 21/200
335/335 [==============================] - 1s 3ms/step - loss: 15.9956 - mae: 3.0162 - mse: 15.9956 - val_loss: 19.6310 - val_mae: 3.2997 - val_mse: 19.6310
Epoch 22/200
335/335 [==============================] - 1s 3ms/step - loss: 16.0111 - mae: 3.0024 - mse: 16.0111 - val_loss: 20.6067 - val_mae: 3.3334 - val_mse: 20.6067
Epoch 23/200
335/335 [==============================] - 1s 3ms/step - loss: 15.1770 - mae: 2.9399 - mse: 15.1770 - val_loss: 22.1456 - val_mae: 3.6386 - val_mse: 22.1456
Epoch 24/200
335/335 [==============================] - 1s 3ms/step - loss: 14.6462 - mae: 2.8749 - mse: 14.6462 - val_loss: 18.6738 - val_mae: 3.2064 - val_mse: 18.6738
Epoch 25/200
335/335 [==============================] - 1s 3ms/step - loss: 13.9560 - mae: 2.8153 - mse: 13.9560 - val_loss: 21.1440 - val_mae: 3.5123 - val_mse: 21.1440
Epoch 26/200
335/335 [==============================] - 1s 3ms/step - loss: 14.0086 - mae: 2.8230 - mse: 14.0086 - val_loss: 19.5613 - val_mae: 3.2137 - val_mse: 19.5613
Epoch 27/200
335/335 [==============================] - 1s 3ms/step - loss: 14.3125 - mae: 2.8489 - mse: 14.3125 - val_loss: 19.7472 - val_mae: 3.3587 - val_mse: 19.7472
Epoch 28/200
335/335 [==============================] - 1s 3ms/step - loss: 14.1316 - mae: 2.8411 - mse: 14.1316 - val_loss: 19.7847 - val_mae: 3.3107 - val_mse: 19.7847
Epoch 29/200
335/335 [==============================] - 1s 3ms/step - loss: 13.7052 - mae: 2.7982 - mse: 13.7052 - val_loss: 20.4032 - val_mae: 3.4941 - val_mse: 20.4032
Epoch 30/200
335/335 [==============================] - 1s 3ms/step - loss: 13.0548 - mae: 2.7319 - mse: 13.0548 - val_loss: 18.7457 - val_mae: 3.1877 - val_mse: 18.7457
Epoch 31/200
335/335 [==============================] - 1s 3ms/step - loss: 13.0223 - mae: 2.7370 - mse: 13.0223 - val_loss: 19.7470 - val_mae: 3.3078 - val_mse: 19.7470
Epoch 32/200
335/335 [==============================] - 1s 3ms/step - loss: 13.7680 - mae: 2.7799 - mse: 13.7680 - val_loss: 31.8045 - val_mae: 4.6455 - val_mse: 31.8045
Epoch 33/200
335/335 [==============================] - 1s 3ms/step - loss: 16.9060 - mae: 3.0815 - mse: 16.9060 - val_loss: 20.7011 - val_mae: 3.5447 - val_mse: 20.7011
Epoch 34/200
335/335 [==============================] - 1s 3ms/step - loss: 13.4084 - mae: 2.7451 - mse: 13.4084 - val_loss: 19.2511 - val_mae: 3.1178 - val_mse: 19.2511
Epoch 35/200
335/335 [==============================] - 1s 3ms/step - loss: 12.4933 - mae: 2.6592 - mse: 12.4933 - val_loss: 20.6078 - val_mae: 3.3862 - val_mse: 20.6078
Epoch 36/200
335/335 [==============================] - 1s 3ms/step - loss: 12.3008 - mae: 2.6765 - mse: 12.3008 - val_loss: 19.3481 - val_mae: 3.1230 - val_mse: 19.3481
Epoch 37/200
335/335 [==============================] - 1s 3ms/step - loss: 11.9626 - mae: 2.6093 - mse: 11.9626 - val_loss: 21.1786 - val_mae: 3.2431 - val_mse: 21.1786
Epoch 38/200
335/335 [==============================] - 1s 3ms/step - loss: 12.3632 - mae: 2.6454 - mse: 12.3632 - val_loss: 22.4643 - val_mae: 3.6553 - val_mse: 22.4643
Epoch 39/200
335/335 [==============================] - 1s 3ms/step - loss: 12.7137 - mae: 2.6874 - mse: 12.7137 - val_loss: 22.2701 - val_mae: 3.5854 - val_mse: 22.2701
Epoch 40/200
335/335 [==============================] - 1s 3ms/step - loss: 11.7284 - mae: 2.5673 - mse: 11.7284 - val_loss: 21.5855 - val_mae: 3.4258 - val_mse: 21.5855
Epoch 41/200
335/335 [==============================] - 1s 3ms/step - loss: 12.7227 - mae: 2.6810 - mse: 12.7227 - val_loss: 23.4930 - val_mae: 3.9059 - val_mse: 23.4930
Epoch 42/200
335/335 [==============================] - 1s 3ms/step - loss: 12.4669 - mae: 2.6337 - mse: 12.4669 - val_loss: 22.7228 - val_mae: 3.7075 - val_mse: 22.7228
Epoch 43/200
335/335 [==============================] - 1s 3ms/step - loss: 12.0860 - mae: 2.6287 - mse: 12.0860 - val_loss: 20.9738 - val_mae: 3.2877 - val_mse: 20.9738
Epoch 44/200
335/335 [==============================] - 1s 3ms/step - loss: 11.8580 - mae: 2.5910 - mse: 11.8580 - val_loss: 21.4522 - val_mae: 3.3515 - val_mse: 21.4522
plot
def plot_history(history):
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
plt.figure()
plt.xlabel('Epoch')
plt.ylabel('Mean Abs Error [MPG]')
plt.plot(hist['epoch'], hist['mae'],
label='Train Error')
plt.plot(hist['epoch'], hist['val_mae'],
label = 'Val Error')
plt.legend()
plt.figure()
plt.xlabel('Epoch')
plt.ylabel('Mean Square Error [$MPG^2$]')
plt.plot(hist['epoch'], hist['mse'],
label='Train Error')
plt.plot(hist['epoch'], hist['val_mse'],
label = 'Val Error')
plt.legend()
plt.show()
plot_history(history)
Training by using Decision Tree¶
from sklearn import tree
# tree_clf = tree.DecisionTreeRegressor()
# tree_clf.fit(X_train, y_train)
tscv = TimeSeriesSplit(n_splits=3)
param_grid = {
'max_depth': range(1, 10),
'min_samples_split': range(1, 10),
'min_samples_leaf': range(1, 5)
}
tree_model = tree.DecisionTreeRegressor()
tree_reg = GridSearchCV(estimator=tree_model, param_grid=param_grid, cv=tscv, n_jobs=4, verbose=10)
tree_reg.fit(X_train, y_train)
Fitting 3 folds for each of 324 candidates, totalling 972 fits
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 5 tasks | elapsed: 1.2s
[Parallel(n_jobs=4)]: Done 10 tasks | elapsed: 1.2s
[Parallel(n_jobs=4)]: Done 17 tasks | elapsed: 1.3s
[Parallel(n_jobs=4)]: Batch computation too fast (0.1742s.) Setting batch_size=2.
[Parallel(n_jobs=4)]: Done 24 tasks | elapsed: 1.3s
[Parallel(n_jobs=4)]: Batch computation too fast (0.0517s.) Setting batch_size=4.
[Parallel(n_jobs=4)]: Done 44 tasks | elapsed: 1.4s
[Parallel(n_jobs=4)]: Batch computation too fast (0.0866s.) Setting batch_size=8.
[Parallel(n_jobs=4)]: Batch computation too fast (0.1193s.) Setting batch_size=16.
[Parallel(n_jobs=4)]: Done 88 tasks | elapsed: 1.6s
[Parallel(n_jobs=4)]: Done 216 tasks | elapsed: 2.3s
[Parallel(n_jobs=4)]: Done 392 tasks | elapsed: 3.3s
[Parallel(n_jobs=4)]: Done 600 tasks | elapsed: 5.4s
[Parallel(n_jobs=4)]: Done 808 tasks | elapsed: 7.7s
[Parallel(n_jobs=4)]: Done 972 out of 972 | elapsed: 9.7s finished
GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=3),
error_score=nan,
estimator=DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse',
max_depth=None, max_features=None,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
presort='deprecated',
random_state=None,
splitter='best'),
iid='deprecated', n_jobs=4,
param_grid={'max_depth': range(1, 10),
'min_samples_leaf': range(1, 5),
'min_samples_split': range(1, 10)},
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring=None, verbose=10)
Training by using Ridge Regression¶
from sklearn import linear_model
param_grid = {'alpha': range(1, 10)}
reg_model = linear_model.Ridge(alpha=.5)
ridge = GridSearchCV(estimator=reg_model, param_grid=param_grid, cv=tscv, n_jobs=4, verbose=10)
ridge.fit(X_train, y_train)
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
Fitting 3 folds for each of 9 candidates, totalling 27 fits
[Parallel(n_jobs=4)]: Batch computation too fast (0.0206s.) Setting batch_size=2.
[Parallel(n_jobs=4)]: Done 5 tasks | elapsed: 0.0s
[Parallel(n_jobs=4)]: Batch computation too fast (0.0476s.) Setting batch_size=4.
[Parallel(n_jobs=4)]: Done 12 tasks | elapsed: 0.1s
[Parallel(n_jobs=4)]: Done 20 out of 27 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=4)]: Done 23 out of 27 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=4)]: Done 27 out of 27 | elapsed: 0.1s finished
GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=3),
error_score=nan,
estimator=Ridge(alpha=0.5, copy_X=True, fit_intercept=True,
max_iter=None, normalize=False, random_state=None,
solver='auto', tol=0.001),
iid='deprecated', n_jobs=4, param_grid={'alpha': range(1, 10)},
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring=None, verbose=10)
Training by using Random forest¶
model_forest = RandomForestRegressor()
param_grid2={'n_estimators':[10,50,100,1000],
'max_features':range(1, 4)
}
random_forest=GridSearchCV(estimator=model_forest,param_grid=param_grid2,cv=tscv,n_jobs=10, verbose=10)
random_forest.fit(X_train,y_train)
Fitting 3 folds for each of 12 candidates, totalling 36 fits
[Parallel(n_jobs=10)]: Using backend LokyBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done 5 tasks | elapsed: 1.5s
[Parallel(n_jobs=10)]: Done 12 tasks | elapsed: 4.1s
[Parallel(n_jobs=10)]: Done 21 out of 36 | elapsed: 7.9s remaining: 5.7s
[Parallel(n_jobs=10)]: Done 25 out of 36 | elapsed: 10.6s remaining: 4.7s
[Parallel(n_jobs=10)]: Done 29 out of 36 | elapsed: 23.0s remaining: 5.6s
[Parallel(n_jobs=10)]: Done 33 out of 36 | elapsed: 35.6s remaining: 3.2s
[Parallel(n_jobs=10)]: Done 36 out of 36 | elapsed: 42.2s finished
GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=3),
error_score=nan,
estimator=RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,
criterion='mse', max_depth=None,
max_features='auto',
max_leaf_nodes=None,
max_samples=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_estimators=100, n_jobs=None,
oob_score=False, random_state=None,
verbose=0, warm_start=False),
iid='deprecated', n_jobs=10,
param_grid={'max_features': range(1, 4),
'n_estimators': [10, 50, 100, 1000]},
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring=None, verbose=10)
Training by using SVM¶
model_svm = svm.SVR()
param_grid3={'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
'C': [1, 10, 100, 1000]}
svr=GridSearchCV(estimator=model_svm,n_jobs=10, verbose=10, param_grid=param_grid3)
svr.fit(X_train,y_train, )
Fitting 5 folds for each of 8 candidates, totalling 40 fits
[Parallel(n_jobs=10)]: Using backend LokyBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done 5 tasks | elapsed: 48.8s
[Parallel(n_jobs=10)]: Done 12 tasks | elapsed: 1.6min
[Parallel(n_jobs=10)]: Done 21 tasks | elapsed: 2.4min
[Parallel(n_jobs=10)]: Done 26 out of 40 | elapsed: 2.4min remaining: 1.3min
[Parallel(n_jobs=10)]: Done 31 out of 40 | elapsed: 3.2min remaining: 56.1s
[Parallel(n_jobs=10)]: Done 36 out of 40 | elapsed: 3.2min remaining: 21.6s
[Parallel(n_jobs=10)]: Done 40 out of 40 | elapsed: 3.3min finished
GridSearchCV(cv=None, error_score=nan,
estimator=SVR(C=1.0, cache_size=200, coef0=0.0, degree=3,
epsilon=0.1, gamma='scale', kernel='rbf',
max_iter=-1, shrinking=True, tol=0.001,
verbose=False),
iid='deprecated', n_jobs=10,
param_grid={'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001],
'kernel': ['rbf']},
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring=None, verbose=10)
Training by using Neural network¶
from sklearn.neural_network import MLPRegressor
mlp_reg = MLPRegressor(max_iter=1000)
mlp_reg.fit(X_train, y_train)
MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
beta_2=0.999, early_stopping=False, epsilon=1e-08,
hidden_layer_sizes=(100,), learning_rate='constant',
learning_rate_init=0.001, max_fun=15000, max_iter=1000,
momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
power_t=0.5, random_state=None, shuffle=True, solver='adam',
tol=0.0001, validation_fraction=0.1, verbose=False,
warm_start=False)
Evaluation¶
y_pred = model.predict(X_test)
print(y_pred)
print(y_test)
[[18.308086]
[16.099003]
[43.96459 ]
...
[44.576817]
[10.564673]
[30.076843]]
6405 17.479723
10497 19.555973
3669 48.410243
747 24.168746
902 44.950639
...
11278 19.544291
2312 9.208905
6475 47.599270
6440 7.897116
10103 31.978004
Name: speed, Length: 2101, dtype: float64
y_pred2 = xgb.predict(X_test)
# y_pred3 = random_forest.predict(X_test)
# y_pred4 = svr.predict(X_test)
# y_pred5 = tree_reg.predict(X_test)
# y_pred6 = ridge.predict(X_test)
# y_pred7 = mlp_reg.predict(X_test)
# MSE_DNN=mean_squared_error(y_pred,y_test)
MSE_rg=mean_squared_error(y_pred2,y_test)
# MSE_rf=mean_squared_error(y_pred3,y_test)
# MSE_tree=mean_squared_error(y_pred5,y_test)
# MSE_svr=mean_squared_error(y_pred4,y_test)
# MSE_ridge = mean_squared_error(y_pred6, y_test)
# MSE_mlp = mean_squared_error(y_pred7, y_test)
print(f"MSE for dnn is: {MSE_DNN}")
print('MSE for XGBoost is '+str(MSE_rg))
# print('MSE for RandomForest is '+str(MSE_rf))
# print(f"MSE for decision is: {MSE_tree}")
# print(f"MSE for svr is: {MSE_svr}")
# print(f"MSE for ridge is: {MSE_ridge}")
# print(f"MSE for neural network is: {MSE_mlp}")
MSE for dnn is: 17.01656963733734
MSE for XGBoost is 14.680223822593183
Based on the observation, we found that XGV will provide the most accurate results.
Make a prediction¶
test = pd.read_csv('/content/drive/MyDrive/courses/HKUST/MSBD5001/project/data/individual project/test.csv')
print(test.to_numpy().shape)
(3504, 2)
test['date'] = pd.to_datetime(test['date'])
test = add_holiday_and_weekend(test)
test['time desc']=test['date'].apply(time_desc)
print(test.shape)
(3504, 5)
new_test = merge_weather(test, weather, how='inner')
new_test = new_test.drop_duplicates(['date'], keep='last')
new_test
| id | date | IsWeekend | IsHoliday | time desc | MPressure(hPa) | MaxTemp(℃) | MinTemp(℃) | MCloud(%) | TRainfall(mm) | #hRedVisi(h) | WindDirect(degrees) | MWindSpeed(km/h) | hour | day | month | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2018-01-01 02:00:00 | 0 | 1 | Late_Night | 1020.5 | 19.0 | 16.3 | 75 | 0 | 21 | 70 | 28.4 | 2 | 1 | 1 | 2018 |
| 1 | 1 | 2018-01-01 05:00:00 | 0 | 1 | Early_Morning | 1020.5 | 19.0 | 16.3 | 75 | 0 | 21 | 70 | 28.4 | 5 | 1 | 1 | 2018 |
| 2 | 2 | 2018-01-01 07:00:00 | 0 | 1 | Early_Morning | 1020.5 | 19.0 | 16.3 | 75 | 0 | 21 | 70 | 28.4 | 7 | 1 | 1 | 2018 |
| 3 | 3 | 2018-01-01 08:00:00 | 0 | 1 | Morning | 1020.5 | 19.0 | 16.3 | 75 | 0 | 21 | 70 | 28.4 | 8 | 1 | 1 | 2018 |
| 4 | 4 | 2018-01-01 10:00:00 | 0 | 1 | Morning | 1020.5 | 19.0 | 16.3 | 75 | 0 | 21 | 70 | 28.4 | 10 | 1 | 1 | 2018 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3499 | 3499 | 2018-12-31 17:00:00 | 0 | 0 | Evening | 1027.0 | 15.6 | 11.8 | 77 | 0 | 0 | 360 | 26.8 | 17 | 31 | 12 | 2018 |
| 3500 | 3500 | 2018-12-31 19:00:00 | 0 | 0 | Evening | 1027.0 | 15.6 | 11.8 | 77 | 0 | 0 | 360 | 26.8 | 19 | 31 | 12 | 2018 |
| 3501 | 3501 | 2018-12-31 21:00:00 | 0 | 0 | Night | 1027.0 | 15.6 | 11.8 | 77 | 0 | 0 | 360 | 26.8 | 21 | 31 | 12 | 2018 |
| 3502 | 3502 | 2018-12-31 22:00:00 | 0 | 0 | Night | 1027.0 | 15.6 | 11.8 | 77 | 0 | 0 | 360 | 26.8 | 22 | 31 | 12 | 2018 |
| 3503 | 3503 | 2018-12-31 23:00:00 | 0 | 0 | Night | 1027.0 | 15.6 | 11.8 | 77 | 0 | 0 | 360 | 26.8 | 23 | 31 | 12 | 2018 |
3504 rows × 17 columns
transformed_data = preprocessor.fit_transform(new_test)
print(transformed_data.shape)
(3504, 21)
pred = xgb.predict(transformed_data)
print(pred.shape)
(3504,)
submission = test.copy()
submission['speed'] = pred
submission = submission[['id', 'speed']]
# Write submission to file
submission.to_csv('submission_20720230.csv', index=False)