Skip to content

Speed Prediction

In this project, we will use weather data to predict average speed.

  • we will use weather data from openweathermap.org which provides temperature, wind, humidity, and weather conditions

  • This notebook will be running on Google Colab

Some of the data preprocessing techniques are based on the online notebooks from tensorflow org.

https://www.tensorflow.org/tutorials/structured_data/time_series

Other references are:

https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

Install and import dependencies

!pip install pactools
Requirement already satisfied: pactools in /usr/local/lib/python3.6/dist-packages (0.3.1)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from pactools) (1.4.1)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.6/dist-packages (from pactools) (0.22.2.post1)
Requirement already satisfied: h5py in /usr/local/lib/python3.6/dist-packages (from pactools) (2.10.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from pactools) (1.18.5)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.6/dist-packages (from pactools) (3.2.2)
Requirement already satisfied: mne in /usr/local/lib/python3.6/dist-packages (from pactools) (0.21.2)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn->pactools) (0.17.0)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from h5py->pactools) (1.15.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->pactools) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib->pactools) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->pactools) (2.4.7)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->pactools) (2.8.1)
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
import requests
from bs4 import BeautifulSoup
from datetime import datetime
import tensorflow as tf
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelBinarizer,OneHotEncoder,StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from pactools.grid_search import GridSearchCVProgressBar
from sklearn.model_selection import TimeSeriesSplit
from sklearn.model_selection import GridSearchCV
from sklearn import svm
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_squared_error
/usr/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  return f(*args, **kwds)
/usr/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  return f(*args, **kwds)
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
Num GPUs Available:  1

Preprocess

Asign Holiday to the table

def time_desc(x: datetime):
    Early_Morning = [4,5,6,7]
    Morning = [8,9,10,11]
    Afternoon = [12,13,14,15]
    Evening = [16,17,18,19]
    Night = [20,21,22,23]
    Late_Night = [0,1,2,3]


    if x.hour in Early_Morning:
        return 'early'
    elif x.hour in Morning:
        return 'morning'
    elif x.hour in Afternoon:
        return 'noon'
    elif x.hour in Evening:
        return 'afternoon'
    elif x.hour in Night:
        return 'night'
    else:
        return 'latenight'

def add_holiday_and_weekend(df: pd.DataFrame) -> pd.DataFrame:
    """
    Add holiday and weekend to the dataset
    """
    new_df = df.copy()
    new_df['IsWeekend'] = new_df['date'].apply(lambda x:0 if x.weekday() in [0,1,2,3,4] else 1)
    new_df['IsHoliday']=new_df['date'].apply(lambda x:1 if (x.date().strftime('%Y-%m-%d') in [
           '2017-01-02','2017-01-28','2017-01-30','2017-01-31','2017-04-04',
           '2017-04-14','2017-04-15','2017-04-17','2017-05-01','2017-05-03',
           '2017-05-30','2017-07-01','2017-10-02','2017-10-05','2017-10-28',
           '2017-12-25','2017-12-26',
           '2018-01-01','2018-02-16','2018-02-17','2018-02-19','2018-03-30',
           '2018-03-31','2018-04-02','2018-04-05','2018-05-01','2018-05-22',
           '2018-06-18','2018-07-02','2018-09-25','2018-10-01','2018-10-17',
           '2018-12-25','2018-12-26'
           ])
           or(x.weekday() in[6]) else 0)
    return new_df

I am using google drive to store all the data including the weather data. So please change this line to the your file paths.

# change following two lines to your paths
train = pd.read_csv('/content/drive/MyDrive/courses/HKUST/MSBD5001/project/data/individual project/train.csv')
test = pd.read_csv('/content/drive/MyDrive/courses/HKUST/MSBD5001/project/data/individual project/test.csv')

train = train.drop('id', axis=1)
train['date'] = pd.to_datetime(train['date'])
train = add_holiday_and_weekend(train)
train['time desc']=train['date'].apply(time_desc)

train
date speed IsWeekend IsHoliday time desc
0 2017-01-01 00:00:00 43.002930 1 1 latenight
1 2017-01-01 01:00:00 46.118696 1 1 latenight
2 2017-01-01 02:00:00 44.294158 1 1 latenight
3 2017-01-01 03:00:00 41.067468 1 1 latenight
4 2017-01-01 04:00:00 46.448653 1 1 early
... ... ... ... ... ...
14001 2018-12-31 12:00:00 19.865269 0 0 noon
14002 2018-12-31 15:00:00 17.820375 0 0 noon
14003 2018-12-31 16:00:00 12.501851 0 0 afternoon
14004 2018-12-31 18:00:00 15.979319 0 0 afternoon
14005 2018-12-31 20:00:00 40.594183 0 0 night

14006 rows × 5 columns

train.plot(x='date', y='speed', figsize=(20, 10))
<matplotlib.axes._subplots.AxesSubplot at 0x7f5321240400>

png


Merge weather

Pre-process weather data

from datetime import datetime
def k_to_c(x):
    return x - 273.15

we have two different weather sources.

  • Weather from open weather (Provides hourly basis weather report)
  • Weather from Hong Kong observatory (Provides daily basis weather report)

Based on the experiments, I decide to use Hong Kong observatory's weather report which will bring a higher accuracy

# Change this path to yours
# weather = pd.read_csv("/content/drive/MyDrive/courses/HKUST/MSBD5001/project/data/individual project/hongkong_weather_1970-2020.csv")
# weather['dt_iso'] = weather['dt_iso'].apply(lambda x: x.replace('UTC', ''))
# weather['date'] = pd.to_datetime(weather['dt_iso']).dt.tz_convert("Asia/Hong_Kong").dt.tz_localize(None)

# # Transform unit
# weather['temp'] = weather['temp'].apply(k_to_c)
# weather['feels_like'] = weather['feels_like'].apply(k_to_c)
# weather['temp_min'] = weather['temp_min'].apply(k_to_c)
# weather['temp_max'] = weather['temp_max'].apply(k_to_c)

# weather = weather.drop(["dt_iso", "dt", "weather_icon", "rain_1h", "rain_3h", "snow_1h", "snow_3h", "sea_level", "grnd_level", "timezone", "lat", "lon"], axis=1)
# mask = (weather['date'] >= datetime(2017,1,1)) & (weather['date'] <= datetime(2019, 1, 1))
# weather = weather.loc[mask]

# weather
weather=pd.read_csv("/content/drive/MyDrive/courses/HKUST/MSBD5001/project/data/individual project/hong-kong-weather-histories.csv")

weather['date'] = pd.to_datetime(weather.year*10000+weather.month*100+weather.day,format='%Y%m%d')
weather.drop(['Unnamed: 0', 'year', 'month', 'day'],axis=1,inplace=True)
weather['TRainfall(mm)']=weather['TRainfall(mm)'].apply(lambda x:0 if x in ['-','Trace'] else x)
weather.head()
MPressure(hPa) MaxTemp(℃) MinTemp(℃) MCloud(%) TRainfall(mm) #hRedVisi(h) WindDirect(degrees) MWindSpeed(km/h) date
0 1021.7 20.8 18.4 72 0 0 60 34.2 2017-01-01
1 1020.2 23.3 18.4 28 0 0 70 17.7 2017-01-02
2 1019.8 21.3 18.9 56 0 5 70 26.1 2017-01-03
3 1018.7 21.7 18.7 51 0 0 70 27.7 2017-01-04
4 1016.9 23.4 18.9 61 0 0 40 14.3 2017-01-05

Merge

from pandas import DatetimeIndex
def merge_weather(df: pd.DataFrame, weather: pd.DataFrame, how='left') -> pd.DataFrame:
    '''
    Merge weather with data.
    '''
    new_df = df.copy()
    new_weather = weather.copy()

    # new_df['tmp_date'] = new_df['date']
    # new_weather['tmp_date'] = new_weather['date']

    new_df['tmp_date'] = new_df['date'].apply(lambda date: date.date())
    new_weather['tmp_date'] = new_weather['date'].apply(lambda date: date.date())

    new_training_data = new_df.merge(new_weather, on='tmp_date', how=how)
    new_training_data = new_training_data.drop(['tmp_date', 'date_y'], axis=1)
    new_training_data = new_training_data.rename(columns={'date_x': 'date'})

    new_training_data['hour'] = DatetimeIndex(new_training_data['date']).hour
    new_training_data['day'] = DatetimeIndex(new_training_data['date']).day
    new_training_data['month'] = DatetimeIndex(new_training_data['date']).month
    new_training_data['year'] = DatetimeIndex(new_training_data['date']).year
    new_training_data['weekday'] = DatetimeIndex(new_training_data['date']).weekday
    return new_training_data
new_training_data = merge_weather(train, weather)

new_training_data.sample(20)
date speed IsWeekend IsHoliday time desc MPressure(hPa) MaxTemp(℃) MinTemp(℃) MCloud(%) TRainfall(mm) #hRedVisi(h) WindDirect(degrees) MWindSpeed(km/h) hour day month year weekday
10903 2018-05-27 00:00:00 45.356636 1 1 latenight 1008.9 33.4 26.9 63 3.4 0 230 23.7 0 27 5 2018 6
1201 2017-02-20 10:00:00 32.278186 0 0 morning 1013.9 25.5 18.3 72 0 3 30 12.8 10 20 2 2017 0
5038 2017-07-30 08:00:00 45.045055 1 1 morning 996.0 34.8 29.6 69 0 4 300 24.3 8 30 7 2017 6
1107 2017-02-16 12:00:00 21.056502 0 0 noon 1021.6 24.0 15.4 11 0 0 40 16.3 12 16 2 2017 3
5044 2017-07-30 14:00:00 37.517872 1 1 noon 996.0 34.8 29.6 69 0 4 300 24.3 14 30 7 2017 6
10121 2018-03-04 20:00:00 39.774170 1 1 night 1011.0 27.3 21.9 86 0 0 140 8.0 20 4 3 2018 6
8027 2017-01-12 21:00:00 43.835431 0 0 night 1015.5 20.3 16.9 93 0 15 70 28.0 21 12 1 2017 3
1336 2017-02-26 01:00:00 48.388743 1 1 latenight 1021.2 17.0 10.6 88 1.4 0 360 21.7 1 26 2 2017 6
436 2017-01-19 13:00:00 19.198023 0 0 noon 1020.1 24.1 18.7 76 0 8 60 12.8 13 19 1 2017 3
6499 2017-09-29 05:00:00 48.814754 0 0 early 1012.2 33.1 28.8 63 0 0 90 21.1 5 29 9 2017 4
317 2017-01-14 05:00:00 45.863401 1 0 early 1017.9 16.5 14.5 92 1.0 0 50 29.8 5 14 1 2017 5
788 2017-03-02 05:00:00 46.502654 0 0 early 1019.2 23.9 17.2 12 0 0 10 25.5 5 2 3 2017 3
4910 2017-07-25 00:00:00 48.773277 0 0 latenight 1005.1 33.1 27.7 55 0 0 100 15.0 0 25 7 2017 1
6698 2017-07-10 12:00:00 31.192037 0 0 noon 1008.5 32.1 27.5 83 0.6 0 190 14.6 12 10 7 2017 0
7041 2017-10-21 19:00:00 29.392348 1 0 afternoon 1012.1 27.2 21.6 33 0 0 360 33.3 19 21 10 2017 5
6451 2017-09-27 05:00:00 48.713259 0 0 early 1009.6 33.0 27.7 34 0 1 200 6.2 5 27 9 2017 2
779 2017-02-02 20:00:00 36.991496 0 0 night 1022.7 17.7 16.2 86 0 0 80 43.0 20 2 2 2017 3
8127 2017-06-12 01:00:00 50.491144 0 0 latenight 1001.9 30.0 25.3 80 37.7 0 80 53.5 1 12 6 2017 0
9211 2018-01-02 20:00:00 39.546576 0 0 night 1019.3 19.2 16.0 77 0 10 60 31.8 20 2 1 2018 1
5850 2017-02-09 04:00:00 47.510284 0 0 early 1020.2 16.8 11.1 72 0 0 10 39.9 4 9 2 2017 3

Plot data

plt.figure(figsize=(6,4))
sns.boxplot('speed',data=new_training_data,orient='h',palette="Set3",linewidth=2.5)
plt.show()
/usr/local/lib/python3.6/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

png

Traffic speed

data_plot = new_training_data
data_plot['month'] = data_plot['date'].dt.month
data_plot
date speed IsWeekend IsHoliday time desc MPressure(hPa) MaxTemp(℃) MinTemp(℃) MCloud(%) TRainfall(mm) #hRedVisi(h) WindDirect(degrees) MWindSpeed(km/h) hour day month year weekday
0 2017-01-01 00:00:00 43.002930 1 1 latenight 1021.7 20.8 18.4 72 0 0 60 34.2 0 1 1 2017 6
1 2017-01-01 01:00:00 46.118696 1 1 latenight 1021.7 20.8 18.4 72 0 0 60 34.2 1 1 1 2017 6
2 2017-01-01 02:00:00 44.294158 1 1 latenight 1021.7 20.8 18.4 72 0 0 60 34.2 2 1 1 2017 6
3 2017-01-01 03:00:00 41.067468 1 1 latenight 1021.7 20.8 18.4 72 0 0 60 34.2 3 1 1 2017 6
4 2017-01-01 04:00:00 46.448653 1 1 early 1021.7 20.8 18.4 72 0 0 60 34.2 4 1 1 2017 6
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
14001 2018-12-31 12:00:00 19.865269 0 0 noon 1027.0 15.6 11.8 77 0 0 360 26.8 12 31 12 2018 0
14002 2018-12-31 15:00:00 17.820375 0 0 noon 1027.0 15.6 11.8 77 0 0 360 26.8 15 31 12 2018 0
14003 2018-12-31 16:00:00 12.501851 0 0 afternoon 1027.0 15.6 11.8 77 0 0 360 26.8 16 31 12 2018 0
14004 2018-12-31 18:00:00 15.979319 0 0 afternoon 1027.0 15.6 11.8 77 0 0 360 26.8 18 31 12 2018 0
14005 2018-12-31 20:00:00 40.594183 0 0 night 1027.0 15.6 11.8 77 0 0 360 26.8 20 31 12 2018 0

14006 rows × 18 columns

tmp_data=new_training_data.groupby('month').aggregate({'speed':'mean'})
plt.figure(figsize=(8,6))
sns.lineplot(x=tmp_data.index,y=tmp_data.speed,data=tmp_data,palette="Set2")
plt.show()

png

plt.figure(figsize=(8,6))
sns.countplot(y='time desc',data=new_training_data,palette=["#7fcdbb","#edf8b1","#fc9272","#fee0d2","#bcbddc","#efedf5"])
plt.show()

png

new_training_data.hist(bins=50,figsize=(20,15))
plt.show()

png


Train

new_training_data.columns
Index(['date', 'speed', 'IsWeekend', 'IsHoliday', 'time desc',
       'MPressure(hPa)', 'MaxTemp(℃)', 'MinTemp(℃)', 'MCloud(%)',
       'TRainfall(mm)', '#hRedVisi(h)', 'WindDirect(degrees)',
       'MWindSpeed(km/h)', 'hour', 'day', 'month', 'year', 'weekday'],
      dtype='object')
# using open weather dataset
# cat_vars=['IsWeekend','IsHoliday','time desc']
# num_vars=['temp', 'pressure', 'wind_speed', 'humidity', "clouds_all", 'month', 'year', 'day', 'hour']

#using hong kong government dataset
cat_vars=['IsWeekend','IsHoliday','time desc']
num_vars=['year','month','day','hour','MaxTemp(℃)','MinTemp(℃)','MCloud(%)',
          'TRainfall(mm)','#hRedVisi(h)','WindDirect(degrees)',
          'MWindSpeed(km/h)']

Transform Data

numeric_transformer=Pipeline(steps=[
    ('scaler', MinMaxScaler())])
categorical_transformer=Pipeline(steps=[
    ('oneHot',OneHotEncoder())])

preprocessor=ColumnTransformer(transformers=[
    ('num',numeric_transformer,num_vars),
    ('cat',categorical_transformer,cat_vars)])

scaled_data=preprocessor.fit_transform(new_training_data)
print(scaled_data.shape)
(14006, 21)
y = new_training_data['speed']
# y = y.to_numpy()
# y = y.reshape(-1, 1)
# print(y.shape)
# print(y)

We want to scale the speed

import numpy as np

Split data

X_train,X_test,y_train,y_test=train_test_split(scaled_data,y,test_size=0.15,random_state=42)
print(X_train)
print(y_train)
print(f"Train x shape: {X_train.shape}")
print(f"Train y shape: {y_train.shape}")
[[1.         0.09090909 0.53333333 ... 0.         0.         0.        ]
 [0.         0.54545455 0.66666667 ... 1.         0.         0.        ]
 [0.         0.63636364 0.43333333 ... 0.         0.         0.        ]
 ...
 [0.         0.63636364 0.43333333 ... 0.         0.         0.        ]
 [0.         0.45454545 0.03333333 ... 0.         0.         0.        ]
 [0.         0.81818182 1.         ... 1.         0.         0.        ]]
9443     46.813428
4822     15.018334
5391     48.414942
11099    19.663880
7854     10.389012
           ...    
5191     24.637279
13418    35.306286
5390     46.950711
860      46.623319
7270     16.718626
Name: speed, Length: 11905, dtype: float64
Train x shape: (11905, 21)
Train y shape: (11905,)

Training by using XGBRegression

tscv = TimeSeriesSplit(n_splits=3)
model = XGBRegressor()

param_grid = {'nthread':[4,6,8], 
              'objective':['reg:squarederror'],
              'learning_rate':[.03, 0.05, .07],
              'max_depth':[5, 6, 7, 8],
              'min_child_weight':[4],
              'subsample':[0.7],
              'colsample_bytree':[0.7],
              'n_estimators':[500]}

xgb=GridSearchCV(estimator=model,param_grid=param_grid,cv=tscv,n_jobs=2, verbose=10)
xgb.fit(X_train,y_train)
Fitting 3 folds for each of 36 candidates, totalling 108 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   1 tasks      | elapsed:   37.4s
[Parallel(n_jobs=2)]: Done   4 tasks      | elapsed:   51.0s
[Parallel(n_jobs=2)]: Done   9 tasks      | elapsed:  1.0min
[Parallel(n_jobs=2)]: Done  14 tasks      | elapsed:  2.6min
[Parallel(n_jobs=2)]: Done  21 tasks      | elapsed:  5.8min
[Parallel(n_jobs=2)]: Done  28 tasks      | elapsed:  8.1min
[Parallel(n_jobs=2)]: Done  37 tasks      | elapsed: 10.0min
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed: 11.5min
[Parallel(n_jobs=2)]: Done  57 tasks      | elapsed: 16.0min
[Parallel(n_jobs=2)]: Done  68 tasks      | elapsed: 18.1min
[Parallel(n_jobs=2)]: Done  81 tasks      | elapsed: 20.3min
[Parallel(n_jobs=2)]: Done  94 tasks      | elapsed: 25.2min
[Parallel(n_jobs=2)]: Done 108 out of 108 | elapsed: 29.2min finished
/usr/local/lib/python3.6/dist-packages/xgboost/sklearn.py:242: DeprecationWarning: The nthread parameter is deprecated as of version .6.Please use n_jobs instead.nthread is deprecated.
  'nthread is deprecated.', DeprecationWarning)





GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=3),
             error_score=nan,
             estimator=XGBRegressor(base_score=0.5, booster='gbtree',
                                    colsample_bylevel=1, colsample_bynode=1,
                                    colsample_bytree=1, gamma=0,
                                    importance_type='gain', learning_rate=0.1,
                                    max_delta_step=0, max_depth=3,
                                    min_child_weight=1, missing=None,
                                    n_estimators=100, n_jobs=1, nthread=None,
                                    objectiv...
                                    scale_pos_weight=1, seed=None, silent=None,
                                    subsample=1, verbosity=1),
             iid='deprecated', n_jobs=2,
             param_grid={'colsample_bytree': [0.7],
                         'learning_rate': [0.03, 0.05, 0.07],
                         'max_depth': [5, 6, 7, 8], 'min_child_weight': [4],
                         'n_estimators': [500], 'nthread': [4, 6, 8],
                         'objective': ['reg:squarederror'],
                         'subsample': [0.7]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=10)

Training by using DNN

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.metrics import MeanSquaredError
from tensorflow.keras import regularizers

print(X_train.shape)
(11905, 58)
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=20)
model = keras.Sequential(
    [

        layers.Dense(128, activation="relu",  
                     input_shape=(X_train.shape[1],),),
         layers.Dense(128, activation="relu"),
     layers.Dropout(0.2),
      layers.Dense(256, activation="relu"),
      layers.Dense(512, activation="relu"),
        layers.Dropout(0.2),
       layers.Dense(64, activation="relu"),

        layers.Dense(1),
    ]
)

opt = keras.optimizers.Adam(learning_rate=0.01)

model.compile(loss='mean_squared_error',
              optimizer=opt, metrics=['mae', 'mse'])
history = model.fit(X_train, y_train, epochs=200, validation_split=0.1, callbacks=[early_stop],)
Epoch 1/200
335/335 [==============================] - 1s 3ms/step - loss: 88.5822 - mae: 6.5761 - mse: 88.5822 - val_loss: 36.1129 - val_mae: 4.6599 - val_mse: 36.1129
Epoch 2/200
335/335 [==============================] - 1s 3ms/step - loss: 38.0175 - mae: 4.6946 - mse: 38.0175 - val_loss: 30.1431 - val_mae: 4.1988 - val_mse: 30.1431
Epoch 3/200
335/335 [==============================] - 1s 3ms/step - loss: 33.4089 - mae: 4.3429 - mse: 33.4089 - val_loss: 34.9154 - val_mae: 4.8873 - val_mse: 34.9154
Epoch 4/200
335/335 [==============================] - 1s 3ms/step - loss: 28.9762 - mae: 4.0474 - mse: 28.9762 - val_loss: 36.5322 - val_mae: 4.8710 - val_mse: 36.5322
Epoch 5/200
335/335 [==============================] - 1s 3ms/step - loss: 26.6944 - mae: 3.9239 - mse: 26.6944 - val_loss: 36.9679 - val_mae: 5.0986 - val_mse: 36.9679
Epoch 6/200
335/335 [==============================] - 1s 3ms/step - loss: 22.6390 - mae: 3.5917 - mse: 22.6390 - val_loss: 25.5548 - val_mae: 3.9797 - val_mse: 25.5548
Epoch 7/200
335/335 [==============================] - 1s 3ms/step - loss: 21.3222 - mae: 3.4796 - mse: 21.3222 - val_loss: 24.1274 - val_mae: 3.8104 - val_mse: 24.1274
Epoch 8/200
335/335 [==============================] - 1s 3ms/step - loss: 21.2355 - mae: 3.4528 - mse: 21.2355 - val_loss: 23.0463 - val_mae: 3.6462 - val_mse: 23.0463
Epoch 9/200
335/335 [==============================] - 1s 3ms/step - loss: 19.8987 - mae: 3.3513 - mse: 19.8987 - val_loss: 23.3246 - val_mae: 3.5491 - val_mse: 23.3246
Epoch 10/200
335/335 [==============================] - 1s 3ms/step - loss: 20.1989 - mae: 3.3736 - mse: 20.1989 - val_loss: 21.5967 - val_mae: 3.5246 - val_mse: 21.5967
Epoch 11/200
335/335 [==============================] - 1s 3ms/step - loss: 17.1561 - mae: 3.1038 - mse: 17.1561 - val_loss: 21.4864 - val_mae: 3.3486 - val_mse: 21.4864
Epoch 12/200
335/335 [==============================] - 1s 3ms/step - loss: 18.4897 - mae: 3.2406 - mse: 18.4897 - val_loss: 23.1777 - val_mae: 3.6755 - val_mse: 23.1777
Epoch 13/200
335/335 [==============================] - 1s 3ms/step - loss: 16.9098 - mae: 3.1090 - mse: 16.9098 - val_loss: 19.4548 - val_mae: 3.1942 - val_mse: 19.4548
Epoch 14/200
335/335 [==============================] - 1s 3ms/step - loss: 17.2672 - mae: 3.1229 - mse: 17.2672 - val_loss: 20.6426 - val_mae: 3.4684 - val_mse: 20.6426
Epoch 15/200
335/335 [==============================] - 1s 3ms/step - loss: 17.6549 - mae: 3.1584 - mse: 17.6549 - val_loss: 22.3610 - val_mae: 3.4008 - val_mse: 22.3610
Epoch 16/200
335/335 [==============================] - 1s 3ms/step - loss: 16.9222 - mae: 3.1086 - mse: 16.9222 - val_loss: 20.2673 - val_mae: 3.2451 - val_mse: 20.2673
Epoch 17/200
335/335 [==============================] - 1s 3ms/step - loss: 15.9358 - mae: 3.0255 - mse: 15.9358 - val_loss: 20.4191 - val_mae: 3.2302 - val_mse: 20.4191
Epoch 18/200
335/335 [==============================] - 1s 3ms/step - loss: 16.7693 - mae: 3.0742 - mse: 16.7693 - val_loss: 21.7748 - val_mae: 3.6764 - val_mse: 21.7748
Epoch 19/200
335/335 [==============================] - 1s 3ms/step - loss: 16.3996 - mae: 3.0702 - mse: 16.3996 - val_loss: 19.8259 - val_mae: 3.1752 - val_mse: 19.8259
Epoch 20/200
335/335 [==============================] - 1s 3ms/step - loss: 16.5213 - mae: 3.0592 - mse: 16.5213 - val_loss: 21.4091 - val_mae: 3.6836 - val_mse: 21.4091
Epoch 21/200
335/335 [==============================] - 1s 3ms/step - loss: 15.9956 - mae: 3.0162 - mse: 15.9956 - val_loss: 19.6310 - val_mae: 3.2997 - val_mse: 19.6310
Epoch 22/200
335/335 [==============================] - 1s 3ms/step - loss: 16.0111 - mae: 3.0024 - mse: 16.0111 - val_loss: 20.6067 - val_mae: 3.3334 - val_mse: 20.6067
Epoch 23/200
335/335 [==============================] - 1s 3ms/step - loss: 15.1770 - mae: 2.9399 - mse: 15.1770 - val_loss: 22.1456 - val_mae: 3.6386 - val_mse: 22.1456
Epoch 24/200
335/335 [==============================] - 1s 3ms/step - loss: 14.6462 - mae: 2.8749 - mse: 14.6462 - val_loss: 18.6738 - val_mae: 3.2064 - val_mse: 18.6738
Epoch 25/200
335/335 [==============================] - 1s 3ms/step - loss: 13.9560 - mae: 2.8153 - mse: 13.9560 - val_loss: 21.1440 - val_mae: 3.5123 - val_mse: 21.1440
Epoch 26/200
335/335 [==============================] - 1s 3ms/step - loss: 14.0086 - mae: 2.8230 - mse: 14.0086 - val_loss: 19.5613 - val_mae: 3.2137 - val_mse: 19.5613
Epoch 27/200
335/335 [==============================] - 1s 3ms/step - loss: 14.3125 - mae: 2.8489 - mse: 14.3125 - val_loss: 19.7472 - val_mae: 3.3587 - val_mse: 19.7472
Epoch 28/200
335/335 [==============================] - 1s 3ms/step - loss: 14.1316 - mae: 2.8411 - mse: 14.1316 - val_loss: 19.7847 - val_mae: 3.3107 - val_mse: 19.7847
Epoch 29/200
335/335 [==============================] - 1s 3ms/step - loss: 13.7052 - mae: 2.7982 - mse: 13.7052 - val_loss: 20.4032 - val_mae: 3.4941 - val_mse: 20.4032
Epoch 30/200
335/335 [==============================] - 1s 3ms/step - loss: 13.0548 - mae: 2.7319 - mse: 13.0548 - val_loss: 18.7457 - val_mae: 3.1877 - val_mse: 18.7457
Epoch 31/200
335/335 [==============================] - 1s 3ms/step - loss: 13.0223 - mae: 2.7370 - mse: 13.0223 - val_loss: 19.7470 - val_mae: 3.3078 - val_mse: 19.7470
Epoch 32/200
335/335 [==============================] - 1s 3ms/step - loss: 13.7680 - mae: 2.7799 - mse: 13.7680 - val_loss: 31.8045 - val_mae: 4.6455 - val_mse: 31.8045
Epoch 33/200
335/335 [==============================] - 1s 3ms/step - loss: 16.9060 - mae: 3.0815 - mse: 16.9060 - val_loss: 20.7011 - val_mae: 3.5447 - val_mse: 20.7011
Epoch 34/200
335/335 [==============================] - 1s 3ms/step - loss: 13.4084 - mae: 2.7451 - mse: 13.4084 - val_loss: 19.2511 - val_mae: 3.1178 - val_mse: 19.2511
Epoch 35/200
335/335 [==============================] - 1s 3ms/step - loss: 12.4933 - mae: 2.6592 - mse: 12.4933 - val_loss: 20.6078 - val_mae: 3.3862 - val_mse: 20.6078
Epoch 36/200
335/335 [==============================] - 1s 3ms/step - loss: 12.3008 - mae: 2.6765 - mse: 12.3008 - val_loss: 19.3481 - val_mae: 3.1230 - val_mse: 19.3481
Epoch 37/200
335/335 [==============================] - 1s 3ms/step - loss: 11.9626 - mae: 2.6093 - mse: 11.9626 - val_loss: 21.1786 - val_mae: 3.2431 - val_mse: 21.1786
Epoch 38/200
335/335 [==============================] - 1s 3ms/step - loss: 12.3632 - mae: 2.6454 - mse: 12.3632 - val_loss: 22.4643 - val_mae: 3.6553 - val_mse: 22.4643
Epoch 39/200
335/335 [==============================] - 1s 3ms/step - loss: 12.7137 - mae: 2.6874 - mse: 12.7137 - val_loss: 22.2701 - val_mae: 3.5854 - val_mse: 22.2701
Epoch 40/200
335/335 [==============================] - 1s 3ms/step - loss: 11.7284 - mae: 2.5673 - mse: 11.7284 - val_loss: 21.5855 - val_mae: 3.4258 - val_mse: 21.5855
Epoch 41/200
335/335 [==============================] - 1s 3ms/step - loss: 12.7227 - mae: 2.6810 - mse: 12.7227 - val_loss: 23.4930 - val_mae: 3.9059 - val_mse: 23.4930
Epoch 42/200
335/335 [==============================] - 1s 3ms/step - loss: 12.4669 - mae: 2.6337 - mse: 12.4669 - val_loss: 22.7228 - val_mae: 3.7075 - val_mse: 22.7228
Epoch 43/200
335/335 [==============================] - 1s 3ms/step - loss: 12.0860 - mae: 2.6287 - mse: 12.0860 - val_loss: 20.9738 - val_mae: 3.2877 - val_mse: 20.9738
Epoch 44/200
335/335 [==============================] - 1s 3ms/step - loss: 11.8580 - mae: 2.5910 - mse: 11.8580 - val_loss: 21.4522 - val_mae: 3.3515 - val_mse: 21.4522

plot

def plot_history(history):
  hist = pd.DataFrame(history.history)
  hist['epoch'] = history.epoch

  plt.figure()
  plt.xlabel('Epoch')
  plt.ylabel('Mean Abs Error [MPG]')
  plt.plot(hist['epoch'], hist['mae'],
           label='Train Error')
  plt.plot(hist['epoch'], hist['val_mae'],
           label = 'Val Error')
  plt.legend()

  plt.figure()
  plt.xlabel('Epoch')
  plt.ylabel('Mean Square Error [$MPG^2$]')
  plt.plot(hist['epoch'], hist['mse'],
           label='Train Error')
  plt.plot(hist['epoch'], hist['val_mse'],
           label = 'Val Error')
  plt.legend()
  plt.show()


plot_history(history)

png

png

Training by using Decision Tree

from sklearn import tree
# tree_clf = tree.DecisionTreeRegressor()
# tree_clf.fit(X_train, y_train)
tscv = TimeSeriesSplit(n_splits=3)
param_grid = {
    'max_depth': range(1, 10),
    'min_samples_split': range(1, 10),
    'min_samples_leaf': range(1, 5)
}
tree_model = tree.DecisionTreeRegressor()
tree_reg = GridSearchCV(estimator=tree_model, param_grid=param_grid, cv=tscv, n_jobs=4, verbose=10)
tree_reg.fit(X_train, y_train)
Fitting 3 folds for each of 324 candidates, totalling 972 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:    1.2s
[Parallel(n_jobs=4)]: Done  10 tasks      | elapsed:    1.2s
[Parallel(n_jobs=4)]: Done  17 tasks      | elapsed:    1.3s
[Parallel(n_jobs=4)]: Batch computation too fast (0.1742s.) Setting batch_size=2.
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:    1.3s
[Parallel(n_jobs=4)]: Batch computation too fast (0.0517s.) Setting batch_size=4.
[Parallel(n_jobs=4)]: Done  44 tasks      | elapsed:    1.4s
[Parallel(n_jobs=4)]: Batch computation too fast (0.0866s.) Setting batch_size=8.
[Parallel(n_jobs=4)]: Batch computation too fast (0.1193s.) Setting batch_size=16.
[Parallel(n_jobs=4)]: Done  88 tasks      | elapsed:    1.6s
[Parallel(n_jobs=4)]: Done 216 tasks      | elapsed:    2.3s
[Parallel(n_jobs=4)]: Done 392 tasks      | elapsed:    3.3s
[Parallel(n_jobs=4)]: Done 600 tasks      | elapsed:    5.4s
[Parallel(n_jobs=4)]: Done 808 tasks      | elapsed:    7.7s
[Parallel(n_jobs=4)]: Done 972 out of 972 | elapsed:    9.7s finished





GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=3),
             error_score=nan,
             estimator=DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse',
                                             max_depth=None, max_features=None,
                                             max_leaf_nodes=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             presort='deprecated',
                                             random_state=None,
                                             splitter='best'),
             iid='deprecated', n_jobs=4,
             param_grid={'max_depth': range(1, 10),
                         'min_samples_leaf': range(1, 5),
                         'min_samples_split': range(1, 10)},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=10)

Training by using Ridge Regression

from sklearn import linear_model


param_grid = {'alpha': range(1, 10)}
reg_model = linear_model.Ridge(alpha=.5)

ridge = GridSearchCV(estimator=reg_model, param_grid=param_grid, cv=tscv, n_jobs=4, verbose=10)

ridge.fit(X_train, y_train)
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


Fitting 3 folds for each of 9 candidates, totalling 27 fits


[Parallel(n_jobs=4)]: Batch computation too fast (0.0206s.) Setting batch_size=2.
[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Batch computation too fast (0.0476s.) Setting batch_size=4.
[Parallel(n_jobs=4)]: Done  12 tasks      | elapsed:    0.1s
[Parallel(n_jobs=4)]: Done  20 out of  27 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  23 out of  27 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  27 out of  27 | elapsed:    0.1s finished





GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=3),
             error_score=nan,
             estimator=Ridge(alpha=0.5, copy_X=True, fit_intercept=True,
                             max_iter=None, normalize=False, random_state=None,
                             solver='auto', tol=0.001),
             iid='deprecated', n_jobs=4, param_grid={'alpha': range(1, 10)},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=10)

Training by using Random forest

model_forest = RandomForestRegressor()

param_grid2={'n_estimators':[10,50,100,1000],
             'max_features':range(1, 4)
            }

random_forest=GridSearchCV(estimator=model_forest,param_grid=param_grid2,cv=tscv,n_jobs=10, verbose=10)
random_forest.fit(X_train,y_train)
Fitting 3 folds for each of 12 candidates, totalling 36 fits


[Parallel(n_jobs=10)]: Using backend LokyBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done   5 tasks      | elapsed:    1.5s
[Parallel(n_jobs=10)]: Done  12 tasks      | elapsed:    4.1s
[Parallel(n_jobs=10)]: Done  21 out of  36 | elapsed:    7.9s remaining:    5.7s
[Parallel(n_jobs=10)]: Done  25 out of  36 | elapsed:   10.6s remaining:    4.7s
[Parallel(n_jobs=10)]: Done  29 out of  36 | elapsed:   23.0s remaining:    5.6s
[Parallel(n_jobs=10)]: Done  33 out of  36 | elapsed:   35.6s remaining:    3.2s
[Parallel(n_jobs=10)]: Done  36 out of  36 | elapsed:   42.2s finished





GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=3),
             error_score=nan,
             estimator=RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,
                                             criterion='mse', max_depth=None,
                                             max_features='auto',
                                             max_leaf_nodes=None,
                                             max_samples=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             n_estimators=100, n_jobs=None,
                                             oob_score=False, random_state=None,
                                             verbose=0, warm_start=False),
             iid='deprecated', n_jobs=10,
             param_grid={'max_features': range(1, 4),
                         'n_estimators': [10, 50, 100, 1000]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=10)

Training by using SVM

model_svm = svm.SVR()

param_grid3={'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]}

svr=GridSearchCV(estimator=model_svm,n_jobs=10, verbose=10, param_grid=param_grid3)
svr.fit(X_train,y_train, )
Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=10)]: Using backend LokyBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done   5 tasks      | elapsed:   48.8s
[Parallel(n_jobs=10)]: Done  12 tasks      | elapsed:  1.6min
[Parallel(n_jobs=10)]: Done  21 tasks      | elapsed:  2.4min
[Parallel(n_jobs=10)]: Done  26 out of  40 | elapsed:  2.4min remaining:  1.3min
[Parallel(n_jobs=10)]: Done  31 out of  40 | elapsed:  3.2min remaining:   56.1s
[Parallel(n_jobs=10)]: Done  36 out of  40 | elapsed:  3.2min remaining:   21.6s
[Parallel(n_jobs=10)]: Done  40 out of  40 | elapsed:  3.3min finished





GridSearchCV(cv=None, error_score=nan,
             estimator=SVR(C=1.0, cache_size=200, coef0=0.0, degree=3,
                           epsilon=0.1, gamma='scale', kernel='rbf',
                           max_iter=-1, shrinking=True, tol=0.001,
                           verbose=False),
             iid='deprecated', n_jobs=10,
             param_grid={'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001],
                         'kernel': ['rbf']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=10)

Training by using Neural network

from sklearn.neural_network import MLPRegressor

mlp_reg = MLPRegressor(max_iter=1000)
mlp_reg.fit(X_train, y_train)
MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
             beta_2=0.999, early_stopping=False, epsilon=1e-08,
             hidden_layer_sizes=(100,), learning_rate='constant',
             learning_rate_init=0.001, max_fun=15000, max_iter=1000,
             momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
             power_t=0.5, random_state=None, shuffle=True, solver='adam',
             tol=0.0001, validation_fraction=0.1, verbose=False,
             warm_start=False)

Evaluation

y_pred = model.predict(X_test)
print(y_pred)
print(y_test)
[[18.308086]
 [16.099003]
 [43.96459 ]
 ...
 [44.576817]
 [10.564673]
 [30.076843]]
6405     17.479723
10497    19.555973
3669     48.410243
747      24.168746
902      44.950639
           ...    
11278    19.544291
2312      9.208905
6475     47.599270
6440      7.897116
10103    31.978004
Name: speed, Length: 2101, dtype: float64
y_pred2 = xgb.predict(X_test)
# y_pred3 = random_forest.predict(X_test)
# y_pred4 = svr.predict(X_test)
# y_pred5 = tree_reg.predict(X_test)
# y_pred6 = ridge.predict(X_test)
# y_pred7 = mlp_reg.predict(X_test)
# MSE_DNN=mean_squared_error(y_pred,y_test)
MSE_rg=mean_squared_error(y_pred2,y_test)
# MSE_rf=mean_squared_error(y_pred3,y_test)
# MSE_tree=mean_squared_error(y_pred5,y_test)
# MSE_svr=mean_squared_error(y_pred4,y_test)
# MSE_ridge = mean_squared_error(y_pred6, y_test)
# MSE_mlp = mean_squared_error(y_pred7, y_test)

print(f"MSE for dnn is: {MSE_DNN}")
print('MSE for XGBoost is '+str(MSE_rg))
# print('MSE for RandomForest is '+str(MSE_rf))
# print(f"MSE for decision is: {MSE_tree}")
# print(f"MSE for svr is: {MSE_svr}")
# print(f"MSE for ridge is: {MSE_ridge}")
# print(f"MSE for neural network is: {MSE_mlp}")
MSE for dnn is: 17.01656963733734
MSE for XGBoost is 14.680223822593183

Based on the observation, we found that XGV will provide the most accurate results.

Make a prediction

test = pd.read_csv('/content/drive/MyDrive/courses/HKUST/MSBD5001/project/data/individual project/test.csv')
print(test.to_numpy().shape)
(3504, 2)
test['date'] = pd.to_datetime(test['date'])
test = add_holiday_and_weekend(test)
test['time desc']=test['date'].apply(time_desc)
print(test.shape)
(3504, 5)
new_test = merge_weather(test, weather, how='inner')
new_test = new_test.drop_duplicates(['date'], keep='last')
new_test
id date IsWeekend IsHoliday time desc MPressure(hPa) MaxTemp(℃) MinTemp(℃) MCloud(%) TRainfall(mm) #hRedVisi(h) WindDirect(degrees) MWindSpeed(km/h) hour day month year
0 0 2018-01-01 02:00:00 0 1 Late_Night 1020.5 19.0 16.3 75 0 21 70 28.4 2 1 1 2018
1 1 2018-01-01 05:00:00 0 1 Early_Morning 1020.5 19.0 16.3 75 0 21 70 28.4 5 1 1 2018
2 2 2018-01-01 07:00:00 0 1 Early_Morning 1020.5 19.0 16.3 75 0 21 70 28.4 7 1 1 2018
3 3 2018-01-01 08:00:00 0 1 Morning 1020.5 19.0 16.3 75 0 21 70 28.4 8 1 1 2018
4 4 2018-01-01 10:00:00 0 1 Morning 1020.5 19.0 16.3 75 0 21 70 28.4 10 1 1 2018
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3499 3499 2018-12-31 17:00:00 0 0 Evening 1027.0 15.6 11.8 77 0 0 360 26.8 17 31 12 2018
3500 3500 2018-12-31 19:00:00 0 0 Evening 1027.0 15.6 11.8 77 0 0 360 26.8 19 31 12 2018
3501 3501 2018-12-31 21:00:00 0 0 Night 1027.0 15.6 11.8 77 0 0 360 26.8 21 31 12 2018
3502 3502 2018-12-31 22:00:00 0 0 Night 1027.0 15.6 11.8 77 0 0 360 26.8 22 31 12 2018
3503 3503 2018-12-31 23:00:00 0 0 Night 1027.0 15.6 11.8 77 0 0 360 26.8 23 31 12 2018

3504 rows × 17 columns

transformed_data = preprocessor.fit_transform(new_test)
print(transformed_data.shape)
(3504, 21)
pred = xgb.predict(transformed_data)
print(pred.shape)
(3504,)
submission = test.copy()
submission['speed'] = pred
submission = submission[['id', 'speed']]
# Write submission to file

submission.to_csv('submission_20720230.csv', index=False)