Multi-Dataset Time Series Modeling - Stationarity, Seasonality, and Forecasting Performance

Jun 25, 2024 | python data-science data-visualization data-analysis financial-modelling time-series pandas statsmodels forecasting monte-carlo-sim


A comprehensive comparative study of time series forecasting methods across six diverse datasets (Gold, S&P 500, FTSE, USDC/USDT, StackOverflow questions, and cinema tickets). This analysis evaluates 10+ forecasting techniques—from naive baselines to ARIMA/SARIMA models—with quantitative performance metrics (MSE, RMSE, MAE), explores stationarity testing, seasonality decomposition, Monte Carlo simulation, and implements a simulated trading strategy.

Abstract

This project presents a comprehensive comparative analysis of time series forecasting methodologies applied to six diverse datasets: financial markets (Gold, S&P 500, FTSE), cryptocurrency (USDC/USDT), and non-financial time series (StackOverflow questions, cinema ticket sales). The study systematically evaluates forecasting performance across multiple algorithmic approaches, from simple baseline methods to advanced stochastic models.

Methodology

Data preprocessing included stationarity testing using KPSS and Augmented Dickey-Fuller (ADF) tests, with first-order differencing applied to achieve stationarity across all datasets. Seasonality was investigated through manual decomposition, Fast Fourier Transform (FFT) analysis, and visual inspection of ACF/PACF plots. Ten forecasting techniques were implemented and compared: Naive Forecasting, Seasonal Forecasting, Average Forecasting, Average Difference Forecasting, Autoregression (AR), ARIMA with varying orders, SARIMA with multiple seasonal configurations, and a manual EMA-based ARIMA approach. Advanced applications included Monte Carlo simulation for uncertainty quantification and a simulated trading strategy using AR predictions.

Key Findings

Performance varied significantly by dataset characteristics. For financial data (Gold, S&P 500, FTSE, USDC/USDT close prices), Naive Forecasting consistently achieved the lowest error metrics, suggesting these series exhibit random walk behavior where the best prediction is simply the last observed value. For datasets with clear seasonal patterns, more sophisticated models excelled, such as SARIMA models achieved superior performance on cinema ticket sales and Seasonal Forecasting performed best for USDC/USDT tradecount. The analysis revealed that first-order differencing (I=1) was necessary and sufficient for achieving stationarity across all datasets, while second-order differencing generally degraded performance.

The study demonstrates that model selection should be guided by dataset characteristics rather than model complexity. For random-walk financial series, simple methods often outperform sophisticated models, while seasonal data benefits from ARIMA/SARIMA frameworks. The Monte Carlo simulations successfully generated probabilistic forecasts, though the trading strategy implementation revealed the challenges of translating statistical predictions into profitable trades. FFT analysis confirmed strong weekly seasonality in ticket sales but minimal seasonal components in financial markets, validating the differential performance of seasonal models across datasets.

Library Initialization

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import os
import math

Datasets

The dataset that I have chosen for this assignment is using 4 of the available dataset provided in the UOL website. Gold data, S&P 500 data, FTSE data and USDCUSDT data. The other 2 dataset were found on www.kaggle.com. One of them is about the types of questions on stackoverflow over time, and one is about cinema tickets.

Gold, S&P 500, FTSE Dataset

We will explore the Gold, S&P 500 and FTSE dataset columns and plotting them

sp500_data = pd.read_csv('Datasets/SP 500 04072014 2011.csv')
sp500_data.head()
Unnamed: 0 Date Time Bar# Bar Index Tick Range Open High Low Close
0 04/04/2014 21:06:00 501724/501724 0 0 1865.09 1865.09 1865.09 1865.09
1 04/04/2014 21:04:00 501723/501724 -1 0 1865.1 1865.1 1865.1 1865.1
2 04/04/2014 21:03:00 501722/501724 -2 2 1865.13 1865.13 1865.11 1865.11
3 04/04/2014 21:02:00 501721/501724 -3 4 1865.18 1865.18 1865.14 1865.14
4 04/04/2014 21:01:00 501720/501724 -4 6 1865.26 1865.26 1865.2 1865.2
sp500_data['Date']
0         04/04/2014
1         04/04/2014
2         04/04/2014
3         04/04/2014
4         04/04/2014
             ...
501719    11/03/2009
501720    11/03/2009
501721    11/03/2009
501722    11/03/2009
501723    11/03/2009
Name: Date, Length: 501724, dtype: object
sp500_data.describe()
Unnamed: 0 Bar Index Tick Range Open High Low Close
count 501724 501724 501724 501724 501724 501724
mean -250862 34.4442 1320.38 1320.55 1320.2 1320.38
std 144835 39.3128 257.653 257.644 257.662 257.653
min -501723 0 713.85 714.06 713.85 714.02
25% -376292 12 1124.89 1125.04 1124.73 1124.88
50% -250862 24 1306.71 1306.86 1306.54 1306.69
75% -125431 44 1460.38 1460.54 1460.26 1460.39
max 0 1833 1896.97 1897.28 1896.36 1897.04
gold_data = pd.read_csv('Datasets/Gold 07042014 2011.csv')
gold_data.head()
Unnamed: 0 Date Time Bar# Bar Index Tick Range Open High Low Close
0 07/04/2014 13:54:00 886801/886801 0 80 1298.6 1299 1298.2 1298.5
1 07/04/2014 13:53:00 886800/886801 -1 69 1299.04 1299.29 1298.6 1298.6
2 07/04/2014 13:52:00 886799/886801 -2 39 1299.07 1299.29 1298.9 1298.9
3 07/04/2014 13:51:00 886798/886801 -3 31 1299 1299.21 1298.9 1299.09
4 07/04/2014 13:50:00 886797/886801 -4 52 1298.8 1299.22 1298.7 1299.17
gold_data['Date']
0         07/04/2014
1         07/04/2014
2         07/04/2014
3         07/04/2014
4         07/04/2014
             ...
886796    26/10/2011
886797    26/10/2011
886798    26/10/2011
886799    26/10/2011
886800    26/10/2011
Name: Date, Length: 886801, dtype: object
gold_data.describe()
Unnamed: 0 Bar Index Tick Range Open High Low Close
count 886801 886801 886801 886801 886801 886801
mean -443400 89.6254 1525.99 1526.35 1525.46 1525.96
std 255998 50.5657 180.892 180.869 180.847 180.901
min -886800 0 1180.5 1183.7 1179.83 1180.85
25% -665100 60 1334.7 1335.05 1334.23 1334.66
50% -443400 80 1591.85 1592.19 1591.3 1591.83
75% -221700 106 1676.32 1676.67 1675.8 1676.3
max 0 4227 1802.52 1802.92 1802.2 1802.43
ftse_data = pd.read_csv('Datasets/FTSE 04072014 2011.csv')

ftse_data.head()
Unnamed: 0 Date Time Bar# Bar Index Tick Range Open High Low Close
0 07/04/2014 14:23:00 408806/408806 0 2 6642.94 6642.94 6642.92 6642.93
1 07/04/2014 14:22:00 408805/408806 -1 51 6643.44 6643.45 6642.94 6642.96
2 07/04/2014 14:21:00 408804/408806 -2 84 6643.27 6644.06 6643.22 6643.45
3 07/04/2014 14:20:00 408803/408806 -3 44 6643.39 6643.69 6643.25 6643.29
4 07/04/2014 14:19:00 408802/408806 -4 92 6643.1 6643.83 6642.91 6643.38
ftse_data['Date']
0         07/04/2014
1         07/04/2014
2         07/04/2014
3         07/04/2014
4         07/04/2014
             ...
408801    01/02/2011
408802    01/02/2011
408803    01/02/2011
408804    01/02/2011
408805    01/02/2011
Name: Date, Length: 408806, dtype: object
ftse_data.describe()
Unnamed: 0 Bar Index Tick Range Open High Low Close
count 408806 408806 408806 408806 408806 408806
mean -204402 178.419 6025.72 6026.61 6024.83 6025.72
std 118012 211.696 459.01 458.806 459.218 459.01
min -408805 0 4795.14 4798.8 4791.01 4795.12
25% -306604 78 5715.06 5715.99 5714.18 5715.07
50% -204402 130 5927.6 5928.33 5926.89 5927.61
75% -102201 218 6462.52 6463.37 6461.66 6462.52
max 0 13825 6873.93 6875.62 6871.38 6873.75

A quick load and using the .head() function in pandas, gives us a quick view on all 3 of the dataset. They have similar columns where there are date and time separtely. While it is unsure about the 'Bar#' and 'Bar Index', the tick range could be the difference between the high and low columns.

gold_data['High_Low_Difference'] = gold_data['High'] - gold_data['Low']
gold_data[['Tick Range','High_Low_Difference']].head()
Unnamed: 0 Tick Range High_Low_Difference
0 80 0.8
1 69 0.69
2 39 0.39
3 31 0.31
4 52 0.52

Based on the above result, we know that tick range is the difference between the high and the low columns, multiplied by 10. Not sure if we will need to use the result and information later.

We will now plot the columns open, close, high and low.

gold_data_test = gold_data
gold_data_test['Datetime'] = pd.to_datetime(
    gold_data_test['Date'] + ' ' + gold_data_test['Time'],
    format='%d/%m/%Y %H:%M:%S'
)

plt.plot(gold_data_test['Datetime'][:100], gold_data_test['Open'][:100])
plt.plot(gold_data_test['Datetime'][:100], gold_data_test['Close'][:100])
plt.plot(gold_data_test['Datetime'][:100], gold_data_test['High'][:100])
plt.plot(gold_data_test['Datetime'][:100], gold_data_test['Low'][:100])
plt.legend(['Open', 'Close', 'High', 'Low'])
Visualization

As shown above, column 'high' shows the highest price that it has reached at the given time and the column 'low' shows the the opposite. While the high and low prices should be useful when using other models, but as of now, we will just stick to using only the close price.

def plot_gold_sp_ftse(data, title):
    data['Datetime'] = pd.to_datetime(
        data['Date'] + ' ' + data['Time'],
        format='%d/%m/%Y %H:%M:%S'
    )

    plt.plot(data['Datetime'], data['Close'])
    plt.title(title)
    plt.show()

plt.rcParams['figure.figsize'] = [7, 5]
plot_gold_sp_ftse(gold_data, 'Gold')
plot_gold_sp_ftse(sp500_data, 'S&P 500')
plot_gold_sp_ftse(ftse_data, 'FTSE')
Visualization
Visualization
Visualization

The above are the normal graphs for

  • Gold price, from Oct 2011 - Apr 2014

  • S&P500 price, from Nov 2009 - Apr 2014

  • FTSE price, from Feb 2011 - Apr 2014.

We will try to see if we can explore the relationship between them. Because just in a quick glance, we can see that Gold and S&P500 are inversely related, while FTSE is somewhat correlate to S&P500.

USDCUSDT Dataset

Next we will be looking into the USDCUSDT Dataset.

After using both methods, it seems to be quite quite undetermined.

usdcusdt_data = pd.read_csv('Datasets/USDC-USDT.csv')

usdcusdt_data.head()
Unnamed: 0 unix date symbol open high low close Volume USDC Volume USDT tradecount
0 1635552000000 2021-10-30 00:00:00 USDC/USDT 1 1 0.9999 0.9999 740710 740678 277
1 1635465600000 2021-10-29 00:00:00 USDC/USDT 0.9995 1 0.9994 1 1.46553e+08 1.46532e+08 43366
2 1635379200000 2021-10-28 00:00:00 USDC/USDT 0.9994 0.9997 0.999 0.9995 2.97686e+08 2.97482e+08 58314
3 1635292800000 2021-10-27 00:00:00 USDC/USDT 0.9997 1 0.9976 0.9994 2.93017e+08 2.92868e+08 67437
4 1635206400000 2021-10-26 00:00:00 USDC/USDT 1.0003 1.0008 0.9993 0.9998 2.10106e+08 2.10147e+08 56204
usdcusdt_data['date']
0       2021-10-30 00:00:00
1       2021-10-29 00:00:00
2       2021-10-28 00:00:00
3       2021-10-27 00:00:00
4       2021-10-26 00:00:00
               ...
1046    2018-12-19 00:00:00
1047    2018-12-18 00:00:00
1048    2018-12-17 00:00:00
1049    2018-12-16 00:00:00
1050    2018-12-15 00:00:00
Name: date, Length: 1051, dtype: object
usdcusdt_data.describe()
Unnamed: 0 unix open high low close Volume USDC Volume USDT tradecount
count 1051 1051 1051 1051 1051 1051 1051 1051
mean 1.59019e+12 1.00001 1.011 0.998193 0.999991 5.50997e+07 5.50802e+07 29534.8
std 2.6226e+10 0.003651 0.281 0.004549 0.003629 7.75637e+07 7.75237e+07 21174.2
min 1.54483e+12 0.987 0.9918 0.9367 0.9866 542413 548254 277
25% 1.56751e+12 0.999 1 0.9975 0.999 7.135e+06 7.13665e+06 14527
50% 1.59019e+12 0.9997 1.0006 0.999 0.9997 1.47944e+07 1.47597e+07 24644
75% 1.61287e+12 1.0003 1.0022 0.9997 1.0003 9.38658e+07 9.38618e+07 39083
max 1.63555e+12 1.0318 10.11 1.0238 1.0318 7.76166e+08 7.75546e+08 196022

Based on the data descriptions above, we can see that unix and date can be used for the datetime column, as both of the columns provide datetime information. Symbol column would not be useful in this context, among the price columns, we will be using the close price. As for the volume USDC, volume USDT, and tradecount, they will definitely benefit with other models, but we can explore how they look when we were to plot them

usdcusdt_explore_columns = ['close', 'Volume USDC', 'Volume USDT', 'tradecount']
usdcusdt_data['date'] = pd.to_datetime(usdcusdt_data['date'])

for column in usdcusdt_explore_columns:
    plt.plot(usdcusdt_data['date'], usdcusdt_data[column])
    plt.title(column)
    plt.show()
Visualization
Visualization
Visualization
Visualization

With them visualized, we can easily pick them to perform any form of analysis when required.

We can observe that the 'close' price column that at the start, the data has very high variance but exponentially decrease in its volatility, but the mean of the data might be the same overall.

For the volume, both had very low amounts and then increased. It can be seen that the entire portion of the graph shifted upwards with the mean.

Tradecount column data seem to have some seasonality to it, with some form of repeating waves.

Cinema Ticket Dataset

While this dataset consists of time component and it is time series, it is not directly apparent. This is because the dataset not only includes a datetime information, but also status of the ticket sale of the specific cinema. Therefore, we will need to first do some pre-processing.

cinemaTicket_data = pd.read_csv('Datasets/cinemaTicket_Ref.csv')

cinemaTicket_data.head()
Unnamed: 0 film_code cinema_code total_sales tickets_sold tickets_out show_time occu_perc ticket_price ticket_use capacity date month quarter day
0 1492 304 3900000 26 0 4 4.26 150000 26 610.329 2018-05-05 5 2 5
1 1492 352 3360000 42 0 5 8.08 80000 42 519.802 2018-05-05 5 2 5
2 1492 489 2560000 32 0 4 20 80000 32 160 2018-05-05 5 2 5
3 1492 429 1200000 12 0 1 11.01 100000 12 108.992 2018-05-05 5 2 5
4 1492 524 1200000 15 0 3 16.67 80000 15 89.982 2018-05-05 5 2 5

Let's say for this project, we would like t be able to predict when is the best time to increase teh discount, so that it can encourage ticket sales during off peak hours. We will first need to explore which columns to be using that can best represent our analysis.

cinemaTicket_data.columns
Index(['film_code', 'cinema_code', 'total_sales', 'tickets_sold',
       'tickets_out', 'show_time', 'occu_perc', 'ticket_price', 'ticket_use',
       'capacity', 'date', 'month', 'quarter', 'day'],
      dtype='object')
cinemaTicket_data.describe()
Unnamed: 0 film_code cinema_code total_sales tickets_sold tickets_out show_time occu_perc ticket_price ticket_use capacity month quarter day
count 142524 142524 142524 142524 142524 142524 142399 142524 142524 142399 142524 142524 142524
mean 1518.99 320.378 1.23473e+07 140.138 0.237413 3.9321 19.966 81234.6 139.9 854.724 6.77685 2.63472 16.1126
std 36.1844 159.701 3.06549e+07 279.759 2.92321 3.05628 22.6534 33236.6 279.565 953.118 2.19584 0.809692 8.94947
min 1471 32 20000 1 0 1 0 483.871 -219 -2 2 1 1
25% 1485 181 1.26e+06 18 0 2 3.75 60000 18 276.994 5 2 8
50% 1498 324 3.72e+06 50 0 3 10.35 79454.2 50 525.714 7 3 16
75% 1556 474 1.11e+07 143 0 5 28.21 100000 143 1038.96 9 3 24
max 1589 637 1.26282e+09 8499 311 60 147.5 700000 8499 9692.1 11 4 31

When inspecting the columns, the .head() and .describe() output, we noted that there are film code identified, as well as cinema_code. While these details might be useful for other analysis, we might just explore ticket_sold, to which should be good enough to indicate the performance of a given cinema. Because there also got a date column, we can use it as the datetime column. There are also other time related columns, like month, quarter, and day. However, similar justification that it might be useful for other analysis, but for this project will use only the date column and ticket sold.

Then, it is also further noted that because there is cinema codes, identifying the cinema, let's explore this aspect.

cinemaTicket_data['Datetime'] = pd.to_datetime(cinemaTicket_data['date'])
cinemaTicket_data
Unnamed: 0 film_code cinema_code total_sales tickets_sold tickets_out show_time occu_perc ticket_price ticket_use capacity date month quarter day Datetime
0 1492 304 3900000 26 0 4 4.26 150000.0 26 610.328638 2018-05-05 5 2 5 2018-05-05
1 1492 352 3360000 42 0 5 8.08 80000.0 42 519.801980 2018-05-05 5 2 5 2018-05-05
2 1492 489 2560000 32 0 4 20.00 80000.0 32 160.000000 2018-05-05 5 2 5 2018-05-05
3 1492 429 1200000 12 0 1 11.01 100000.0 12 108.991826 2018-05-05 5 2 5 2018-05-05
4 1492 524 1200000 15 0 3 16.67 80000.0 15 89.982004 2018-05-05 5 2 5 2018-05-05
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
142519 1569 495 1320000 22 0 2 3.86 60000.0 22 569.948187 2018-11-04 11 4 4 2018-11-04
142520 1569 474 1200000 15 0 1 65.22 80000.0 15 22.999080 2018-11-04 11 4 4 2018-11-04
142521 1569 524 1060000 8 0 3 9.20 132500.0 8 86.956522 2018-11-04 11 4 4 2018-11-04
142522 1569 529 600000 5 0 2 5.00 120000.0 5 100.000000 2018-11-04 11 4 4 2018-11-04
142523 1569 486 250000 5 0 1 1.79 50000.0 5 279.329609 2018-11-04 11 4 4 2018-11-04
cinemaTicket_data['show_time'].plot()
Visualization
cinemaTicket_data['show_time'].hist()
Visualization

After checking the column showtime, we cannot conclude what does the column represent. At first glance, we might be able to use it to indicate the hours in datetime, and we might even be able to dig into finding out the hours that have lower sales. However, that would not be the case and we would look only into which day has the highest and lowest sale, and see if we are able to identify any trends overall.

# list of unique cinema
unique_cinemas = cinemaTicket_data['cinema_code'].unique()
unique_cinemas
array([304, 352, 489, 429, 524,  71, 163, 450,  51, 522,  43, 529,  82,
       344,  73, 485, 518, 448, 431,  72, 144, 456, 238, 312, 168, 254,
       214, 474, 445, 300, 362, 324, 452, 291, 479, 210, 428, 277,  56,
       253, 243,  39, 310, 201, 457, 191, 532, 167, 266, 204, 380, 169,
       162, 506, 537, 513, 467, 509, 165, 262, 486, 198, 508, 222,  98,
       230, 156, 181, 141, 528,  94, 350, 442, 556, 475, 142,  35,  89,
        34, 225, 182, 396,  50,  61, 487,  88, 338, 417, 194,  57, 285,
       187, 159, 184,  81, 207, 339, 326, 531, 505, 492, 299, 507, 316,
       333, 172, 526, 414, 115, 468, 490, 441, 430, 472, 511, 480, 470,
       496, 466, 381, 368, 498, 195, 546, 516, 425, 488, 535, 196, 453,
       321, 152, 390, 166, 247, 454, 464, 499, 460, 251, 481, 315, 307,
       120, 250, 533, 221, 248, 313, 164,  70, 180, 160, 495, 314, 415,
       174, 259, 471, 245,  83,  91, 365, 359, 286,  64, 426, 237, 536,
       397, 476, 503, 491, 517,  55, 170, 175,  62, 539, 541, 540, 484,
       514, 548, 432, 501, 447, 186, 477, 331, 515,  48,  33, 185, 155,
       455, 461, 534, 273, 504, 249,  52, 482, 323, 451, 341, 306, 413,
       292, 463, 223, 497, 525, 427, 231, 519, 183,  32, 553, 271, 276,
       154, 521, 502, 402, 158, 473, 465, 561, 328, 558, 562, 424, 512,
       560, 458, 637, 520, 557, 555, 543, 542, 565, 215, 376, 566],
      dtype=int64)
plt.rcParams['figure.figsize'] = [20, 12]
for cinema in unique_cinemas:

    temp = cinemaTicket_data[cinemaTicket_data['cinema_code'] == cinema]
    temp = temp[['Datetime','tickets_sold']]
    temp = temp.groupby(['Datetime']).sum()

    plt.plot(temp.index,temp['tickets_sold'])

# the legend list will be too long
# plt.legend(unique_cinemas)
Visualization

By getting the total sale at the given day for each cinema, and then plotting it, does not seem like a good idea, especially since because they have varying number of ticket sale.

However, we are able to note that there some form of repeatable patterns, trends or seasonality in the data. Let's use only 10 cinemas and plot it for visualization. Furthermore, later we shall look into take the average sales on a given day, and use that as the dataset moving forward for the cinema tickets dataset.

plt.rcParams['figure.figsize'] = [20, 12]

for cinema in unique_cinemas[:10]:

    temp = cinemaTicket_data[cinemaTicket_data['cinema_code'] == cinema]
    temp = temp[['Datetime','tickets_sold']]
    temp = temp.groupby(['Datetime']).sum()

    plt.plot(temp.index,temp['tickets_sold'],label=cinema)
    plt.legend()
Visualization

Using only 10 cinemas, the patterns are now more obvious.

ticketSales_data = cinemaTicket_data[['Datetime', 'tickets_sold']]
ticketSales_data = ticketSales_data.groupby('Datetime').mean()
ticketSales_data
('Unnamed: 0_level_0', 'Datetime') ('tickets_sold', 'Unnamed: 1_level_1')
2018-02-21 267.000000
2018-02-23 3.000000
2018-03-14 125.650000
2018-03-15 178.675325
2018-03-16 175.461017
... ...
2018-10-31 70.583704
2018-11-01 113.653521
2018-11-02 126.824561
2018-11-03 75.431177
2018-11-04 49.894737
ticketSales_data.plot()
Visualization

It seems that the first two entries are not properly maintained, or it could be outliers. Therefore, we would be dropping the first two entries.

ticketSales_data = ticketSales_data[2:]
ticketSales_data.plot()
Visualization

Stackoverflow Dataset

This dataset is a little more direct, where each columns signifies the topic that is being asked. For this project, we can look into the questions asked on the site overtime, from 2009 to 2019

mlStackoverflow_data = pd.read_csv('Datasets/MLTollsStackOverflow.csv')

mlStackoverflow_data.head()
Unnamed: 0 month nltk spacy stanford-nlp python r numpy scipy matlab machine-learning ... Plato Sympy Flair stanford-nlp.1 pyqt Nolearn Lasagne OCR Apache-spark-mlib azure-virtual-machine
0 09-Jan 0 0 0 631 8 6 2 19 8 ... 0 1 0 0 5 0 0 5 0 0
1 09-Feb 1 0 0 633 9 7 3 27 4 ... 0 0 0 0 5 0 0 11 0 0
2 09-Mar 0 0 0 766 4 4 2 24 3 ... 0 0 0 0 7 0 0 2 0 0
3 09-Apr 0 0 0 768 12 6 3 32 10 ... 0 0 0 0 11 0 0 5 0 0
4 09-May 1 0 0 1003 2 7 2 42 7 ... 0 0 0 0 10 0 0 3 0 0
mlStackoverflow_data.columns
Index(['month', 'nltk', 'spacy', 'stanford-nlp', 'python', 'r', 'numpy',
       'scipy', 'matlab', 'machine-learning', 'pandas', 'pytorch', 'keras',
       'nlp', 'apache-spark', 'hadoop', 'pyspark', 'python-3.x', 'tensorflow',
       'deep-learning', 'neural-network', 'lstm', 'time-series', 'pillow',
       'rasa', 'opencv', 'pipenv', 'seaborn', 'Dask', 'jupyter', 'AllenNLP',
       'Theano', 'plotly', 'scikit-learn', 'BeautifulSoup', 'scrapy', 'Gensim',
       'FastText', 'Pydot', 'Pybrain', 'Pytil', 'Pygame', 'Colab', 'Shogun',
       'KNIME', 'Apache', 'Gunicorn', 'Pygtk', 'Weka', 'Conda', 'Ray',
       'matlab.1', 'accord.net', 'regression', 'classification', 'correlation',
       'cluster-analysis', 'H2o', 'Mallet', 'Numba', 'Tableau', 'Trifacta',
       'PyArrow', 'Rasterio', 'Orange3', 'PyMC3', 'Opennn', 'Oryx', 'Istio',
       'Venes', 'Plotnine', 'Gluon', 'Plato', 'Sympy', 'Flair',
       'stanford-nlp.1', 'pyqt', 'Nolearn', 'Lasagne', 'OCR',
       'Apache-spark-mlib', 'azure-virtual-machine'],
      dtype='object')
mlStackoverflow_data['month']
0      09-Jan
1      09-Feb
2      09-Mar
3      09-Apr
4      09-May
        ...
127    19-Aug
128    19-Sep
129    19-Oct
130    19-Nov
131    19-Dec
Name: month, Length: 132, dtype: object

Given that the topics of machine learning should have increase over the 10 years, especially with the recent explosion of interest for generative AI. We will see some relevant topics over the years.

mlStackoverflow_data['Datetime'] = pd.to_datetime(mlStackoverflow_data['month'], format='%y-%b')
mlStackoverflow_data.head()
Unnamed: 0 month nltk spacy stanford-nlp python r numpy scipy matlab machine-learning ... Sympy Flair stanford-nlp.1 pyqt Nolearn Lasagne OCR Apache-spark-mlib azure-virtual-machine Datetime
0 09-Jan 0 0 0 631 8 6 2 19 8 ... 1 0 0 5 0 0 5 0 0 2009-01-01
1 09-Feb 1 0 0 633 9 7 3 27 4 ... 0 0 0 5 0 0 11 0 0 2009-02-01
2 09-Mar 0 0 0 766 4 4 2 24 3 ... 0 0 0 7 0 0 2 0 0 2009-03-01
3 09-Apr 0 0 0 768 12 6 3 32 10 ... 0 0 0 11 0 0 5 0 0 2009-04-01
4 09-May 1 0 0 1003 2 7 2 42 7 ... 0 0 0 10 0 0 3 0 0 2009-05-01
relevant_topics = ['python', 'machine-learning', 'deep-learning', 'time-series', 'regression', 'pytorch', 'tensorflow', 'classification']

plt.rcParams['figure.figsize'] = [7,5]

for topic in relevant_topics:
    plt.plot(mlStackoverflow_data['Datetime'], mlStackoverflow_data[topic])
    plt.scatter(mlStackoverflow_data['Datetime'], mlStackoverflow_data[topic], s=10)
    plt.title(topic)
    plt.show()
Visualization
Visualization
Visualization
Visualization
Visualization
Visualization
Visualization
Visualization

All of them, shows different properties in the graph. For example, while pytorch and tensorflow had beginnings around 2016, but they seem to increase a lot when observing the slope.

Multiplicative or Additive - Visual Identification

We will look into some dataset to see if there are multiplicative or additive. We will use the gold dataset, python and deep learning topic in the stackoverflow dataset and the cinema dataset to see if they are additive or multiplicative.

Based on the textbook that this course is based on, the difference between additive and multiplicative datsets are that one is the sum of seasonal component, trend component and residual component, and one is the product of all three. As shown in the equations below:

  • Additive:
    yt=St+Tt+Rty_{t} = S_{t} + T_{t} + R_{t}

  • Multiplicative:
    yt=St×Tt×Rty_{t} = S_{t} \times T_{t} \times R_{t}

Hence, to truly find out, we can decompose the time series into the components mentioned, seasonality, trend and residual.

However, based on the courses' lectures, it is also mentioned that we can identify based on the dispersion and shape of the differenced dataset, as shown below.

Gold Dataset - Close Price

gold_data['Close'].plot()
Visualization
gold_data['Diff'] = gold_data['Close'] - gold_data['Close'].shift()
gold_data.plot.scatter(x = 'Close', y = 'Diff')
Visualization

As observed, regardless or what value x, the y stays very consistent. Therefore, the changes between the price are not steep or drastic. Which represent an additive chart.

Stakeoverflow Dataset - Python

While the chart shows that it is linearly increase over time, when we check for additivity or multiplicity, turns out this is multiplicative, as shown in the scatter plot below. Where there are instance of x with very little change, while on the other end, the x values have higher change.

mlStackoverflow_data['python'].plot()
Visualization
mlStackoverflow_data['python-diff'] = mlStackoverflow_data['python'] - mlStackoverflow_data['python'].shift()
mlStackoverflow_data.plot.scatter(x='python', y = 'python-diff')
Visualization

This is a multiplicative dataset because there is an obvious dispersion of y values, as x increases.

mlStackoverflow_data['logged_python'] = np.log(mlStackoverflow_data['python'])
mlStackoverflow_data['logged_python'].plot()
Visualization

Even after logging, the chart still looks like it might be multiplicative.

mlStackoverflow_data['logged_python_diff'] = mlStackoverflow_data['logged_python'] - mlStackoverflow_data['logged_python'].shift()
mlStackoverflow_data.plot.scatter(x='logged_python', y='logged_python_diff')
Visualization

However, based on the result, the chart now looks more additive than it is multiplicative as the y values are now more consistent with the increase in x.

Stackoverflow Dataset - Machine Learning

Machine learning dataset was chosen because of its higher increase at the end of the chart. This shows that it is highly likely to be multiplicative.

mlStackoverflow_data['machine-learning'].plot()
Visualization
mlStackoverflow_data['machine-learning-diff'] = mlStackoverflow_data['machine-learning'] - mlStackoverflow_data['machine-learning'].shift()
mlStackoverflow_data.plot.scatter(x='machine-learning', y='machine-learning-diff')
Visualization

As guessed, it is a multiplicative chart.

mlStackoverflow_data['logged_ml'] = np.log(mlStackoverflow_data['machine-learning'])
mlStackoverflow_data['logged_ml'].plot()
Visualization

Now we shall explore the logged chart. Now the charts looks more linear, which leads more to additive.

mlStackoverflow_data['logged_ml_diff'] = mlStackoverflow_data['logged_ml'] - mlStackoverflow_data['logged_ml'].shift()
mlStackoverflow_data.plot.scatter(x='logged_ml', y='logged_ml_diff')
Visualization

As shown, while there are still areas of higher dispersed y values, it is better than it was without logging.

Cinema Ticket Dataset - Tickets Sold

This dataset was chosen because out of curiously, a strong seasonal dataset, would be additive or multiplicative.

ticketSales_data['tickets_sold'].plot()
Visualization
ticketSales_data['tickets_sold_diff'] = ticketSales_data['tickets_sold'] - ticketSales_data['tickets_sold'].shift()
ticketSales_data.plot.scatter(x='tickets_sold', y ='tickets_sold_diff')
Visualization

Based on the scatter plot, it seems like it is multiplicative, let's try logging it.

ticketSales_data['logged_sales'] = np.log(ticketSales_data['tickets_sold'])
ticketSales_data['logged_sales'].plot()
Visualization
ticketSales_data['logged_sales_diff'] = ticketSales_data['logged_sales'] - ticketSales_data['logged_sales'].shift()
ticketSales_data.plot.scatter(x='logged_sales', y ='logged_sales_diff')
Visualization

Based on the result, it looks like the multiplicity of has decreased and looks more additive with the values dispersed more consistently has compared to the one before.

Stationarity

See if any of data are stationary?

Visually, upon first glance, we can see that all of the current dataset, is not stationary, as most of them have some form trend, seasonality, varying means and variance. Therefore, let's implement ways to make them stationary and ensure that they are stationary. The dataset that we will be using is gold dataset and machine learning.

gold_data['Close'].plot()
Visualization
mlStackoverflow_data['machine-learning'].plot()
Visualization

Random walk model or Simple Differencing

For many cases, just by finding the difference between the current period and previous period, should be enough to make the time-series data to be stationary.

gold_data['Close_Diffrenced'] = gold_data['Close'] - gold_data['Close'].shift()
gold_data['Close_Diffrenced'].plot()
Visualization
mlStackoverflow_data['machine-learning_differenced'] = mlStackoverflow_data['machine-learning'] - mlStackoverflow_data['machine-learning'].shift()
mlStackoverflow_data['machine-learning_differenced'].plot()
Visualization

While we can visually see that both the graph has no more trend and has become more stationary than before, there are still some things to note. For the gold data, it is basically the ideal stationary data outcome, with some areas of outliers, but mostly, it is have a very steady mean and variance. However, even though we have use this method, the stackoverflow questions dataset still as an increase of variance overtime. Let's use some test to see if it is stationary or not. If not, we shall apply more differencing.

KPSS and ADF Test

We will be using the KPSS and ADF test to see if both of the above are stationary data.

from statsmodels.tsa.stattools import kpss

# Referred the below for the function and the use of kpss
# https://www.machinelearningplus.com/time-series/kpss-test-for-stationarity/

def kpss_test(series):
    statistic, p_value, n_lags, critical_values = kpss(series)
    print('KPSS Stat: ', statistic)
    print('p-value: ', p_value)
    print('Number of lags:', n_lags)
    print('Critial Values:')

    for key, value in critical_values.items():
        print(f'   {key} : {value}')
    print(f'Result: The series is {"not " if p_value < 0.05 else ""}stationary')

print('Gold Data-------------------')
print('=====Before Differencing:')
kpss_test(gold_data['Close'])
print()
print('=====After Differencing:')
kpss_test(gold_data['Close_Diffrenced'][1:])
Gold Data-------------------
=====Before Differencing:
KPSS Stat:  133.7235404045421
p-value:  0.01
Number of lags: 518
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
Result: The series is not stationary

=====After Differencing:
KPSS Stat:  0.06604071713292024
p-value:  0.1
Number of lags: 44
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
Result: The series is stationary
print('StackOverFlow ML Data-------------------')
print('=====Before Differencing:')
kpss_test(mlStackoverflow_data['machine-learning'])
print()
print('=====After Differencing:')
kpss_test(mlStackoverflow_data['machine-learning_differenced'][1:])
StackOverFlow ML Data-------------------
=====Before Differencing:
KPSS Stat:  1.8470519859412013
p-value:  0.01
Number of lags: 6
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
Result: The series is not stationary

=====After Differencing:
KPSS Stat:  0.29869005236660856
p-value:  0.1
Number of lags: 10
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
Result: The series is stationary

According to the KPSS test, even after the simple differencing for the the questions in stackoverflow, is enough to be stationary. Let's see for the adfuller test.

For the adfuller test, the closer the p-value it is to 0, the higher the likelihood for us to assume that there is no unit root in the time-series and that it is stationary. Assuming we were to use p-values of 0.05.

from statsmodels.tsa.stattools import adfuller

def adftest(series):
    res = adfuller(series)
    print('AdfTest Stat: ', res[0])
    print('p-value: ', res[1])
    print('Number of lags:', res[2])
    print('Number of observation:', res[3])
    print('Critial Values:')

    for key, value in res[4].items():
        print(f'   {key} : {value}')

    if res[1] < 0.05:
        print('The series is stationary')
    else:
        print('The series is not stationary')

adftest(mlStackoverflow_data['machine-learning'])
AdfTest Stat:  0.6666371271225812
p-value:  0.9891479891628309
Number of lags: 12
Number of observation: 119
Critial Values:
   1% : -3.4865346059036564
   5% : -2.8861509858476264
   10% : -2.579896092790057
The series is not stationary
adftest(mlStackoverflow_data['machine-learning_differenced'][1:])
AdfTest Stat:  -2.054928456415778
p-value:  0.2630197887870384
Number of lags: 11
Number of observation: 119
Critial Values:
   1% : -3.4865346059036564
   5% : -2.8861509858476264
   10% : -2.579896092790057
The series is not stationary

Using the adfuller test, the simple differencing was not enough. While the p-values have dropped from 0.98 to 0.26, we can still try to make the dataset more stationary.

Second Order Differencing

firstDifference = mlStackoverflow_data['machine-learning'] - mlStackoverflow_data['machine-learning'].shift()

firstDifference
0        NaN
1       -4.0
2       -1.0
3        7.0
4       -3.0
       ...
127    -37.0
128    -45.0
129    212.0
130    168.0
131     65.0
Name: machine-learning, Length: 132, dtype: float64
secondDifference = mlStackoverflow_data['machine-learning'].shift() - mlStackoverflow_data['machine-learning'].shift(2)
secondDifference
0        NaN
1        NaN
2       -4.0
3       -1.0
4        7.0
       ...
127     80.0
128    -37.0
129    -45.0
130    212.0
131    168.0
Name: machine-learning, Length: 132, dtype: float64
secondOrderDifference = firstDifference - secondDifference
secondOrderDifference
0        NaN
1        NaN
2        3.0
3        8.0
4      -10.0
       ...
127   -117.0
128     -8.0
129    257.0
130    -44.0
131   -103.0
Name: machine-learning, Length: 132, dtype: float64
plt.plot(firstDifference)
plt.plot(secondOrderDifference)
plt.show()
Visualization
adftest(secondOrderDifference[2:])
AdfTest Stat:  -6.611299198025336
p-value:  6.367085815585583e-09
Number of lags: 10
Number of observation: 119
Critial Values:
   1% : -3.4865346059036564
   5% : -2.8861509858476264
   10% : -2.579896092790057
The series is stationary

We can see that visually, the difference between the simple difference and second order differencing, the second order differencing looked like the variance increased as we do see higher highs and lower low in the graph as compared to the simple differencing. The orange graph is the second order differencing. However, based on the adfuller test, it is stationary.

Seasonal Differencing

Seasonal differencing is based on the formula below, which is subtract by the current value with a value in the past mm observations, called lag. The higher the lag, the further in the past you subtract. This is one way to remove seasonality.

y=ytymy = y_{t} - y_{m}

However, we will need to loop through some values to find this. Another method of seasonal differencing is to understand a season timeframe, then get the mean of each season, and apply subtraction according to respective season's mean.

lags = [3,5,10,20,50]

plt.rcParams['figure.figsize'] = [15,7]

for lag in lags:
    temp = mlStackoverflow_data['machine-learning'] - mlStackoverflow_data['machine-learning'].shift(lag)
    plt.plot(temp, label=str(lag) + " lag")

    print('Test for lag ' + str(lag) + "  =====================================")
    kpss_test(temp[lag:])
    print()
    adftest(temp[lag:])
    print()

plt.legend()
plt.show()
Test for lag 3  =====================================
KPSS Stat:  0.2898885946621235
p-value:  0.1
Number of lags: 4
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
Result: The series is stationary

AdfTest Stat:  -2.25147303187454
p-value:  0.18809655506315798
Number of lags: 13
Number of observation: 115
Critial Values:
   1% : -3.4885349695076844
   5% : -2.887019521656941
   10% : -2.5803597920604915
The series is not stationary

Test for lag 5  =====================================
KPSS Stat:  0.27908581626858614
p-value:  0.1
Number of lags: 4
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
Result: The series is stationary

AdfTest Stat:  -2.1942366663365926
p-value:  0.20834386047176706
Number of lags: 12
Number of observation: 114
Critial Values:
   1% : -3.489057523907491
   5% : -2.887246327182993
   10% : -2.5804808802708528
The series is not stationary

Test for lag 10  =====================================
KPSS Stat:  0.35328244736991793
p-value:  0.09729204854744916
Number of lags: 5
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
Result: The series is stationary

AdfTest Stat:  -2.4158493770328726
p-value:  0.13731291632025688
Number of lags: 12
Number of observation: 109
Critial Values:
   1% : -3.49181775886872
   5% : -2.8884437992971588
   10% : -2.5811201893779985
The series is not stationary

Test for lag 20  =====================================
KPSS Stat:  0.5230701441606662
p-value:  0.0364706882521022
Number of lags: 5
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
Result: The series is not stationary

AdfTest Stat:  -2.157009917873902
p-value:  0.2222165170167978
Number of lags: 12
Number of observation: 99
Critial Values:
   1% : -3.498198082189098
   5% : -2.891208211860468
   10% : -2.5825959973472097
The series is not stationary

Test for lag 50  =====================================
KPSS Stat:  1.3102598299838486
p-value:  0.01
Number of lags: 5
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
Result: The series is not stationary

AdfTest Stat:  -0.5612811021960334
p-value:  0.8794931265993677
Number of lags: 12
Number of observation: 69
Critial Values:
   1% : -3.528889992207215
   5% : -2.9044395987933362
   10% : -2.589655654274312
The series is not stationary

Visualization

Surprisingly, the further you lag, does not mean that the data will become more stationary. As shown with the kpss and adfuller test, we can see that even using difference of lag 3, adfuller test already did not consider the time series as stationary. The KPSS result shows that from lag 20 onwards, the data is no longer stationary. We can also observed this because visually as the red and purple graph slowly becoming a upward trending graph.

Log and log differences

Let's see how does logging a time-series or getting the log differences, can effect the stationarity of a time-series. We will continue using the stackoverflow dataset on machine learning.

loggedMLtopic = np.log(mlStackoverflow_data['machine-learning'])
loggedMLtopic.plot()
Visualization

It has made it into a trending time series, so now we can try apply simple differencing

differencedLoggedMLtopic = loggedMLtopic - loggedMLtopic.shift()
differencedLoggedMLtopic.plot()
Visualization
kpss_test(differencedLoggedMLtopic[1:])
KPSS Stat:  0.28055757025438893
p-value:  0.1
Number of lags: 55
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
Result: The series is stationary
adftest(differencedLoggedMLtopic[1:])
AdfTest Stat:  -13.691197822513615
p-value:  1.342938733545792e-25
Number of lags: 1
Number of observation: 129
Critial Values:
   1% : -3.482087964046026
   5% : -2.8842185101614626
   10% : -2.578864381347275
The series is stationary

Based on the both the test and visually, we have able to make the machine learning topic time series to be stationary.

Seasonality

Seasonality are data that has some form of repeating pattern and precipitability at a certain time frame.

Based on the current set of data we have, we can clearly see that the cinema dataset has repeatable pattern, very noticeable peak and trough. Therefore, we will look into this dataset for the seasonality adjustments.

Cinema Tickets

ticketSales_data.index.weekday
Int64Index([2, 3, 4, 5, 6, 0, 1, 2, 3, 4,
            ...
            4, 5, 6, 0, 1, 2, 3, 4, 5, 6],
           dtype='int64', name='Datetime', length=232)
ticketSales_data.index.weekday.unique()
Int64Index([2, 3, 4, 5, 6, 0, 1], dtype='int64', name='Datetime')
ticketSales_data = ticketSales_data.assign(day=ticketSales_data.index.weekday)
ticketSales_data
('Unnamed: 0_level_0', 'Datetime') ('tickets_sold', 'Unnamed: 1_level_1') ('day', 'Unnamed: 2_level_1')
2018-03-14 125.650000 2
2018-03-15 178.675325 3
2018-03-16 175.461017 4
2018-03-17 114.192192 5
2018-03-18 94.831956 6
... ... ...
2018-10-31 70.583704 2
2018-11-01 113.653521 3
2018-11-02 126.824561 4
2018-11-03 75.431177 5
2018-11-04 49.894737 6
# Because the timeframe in the dataset is not a full calendar year.
# Get the list of months that are involved
days = ticketSales_data.index.weekday.unique()

# Initializing season count and sum
seasonSum = [0] * len(days)
seasonCount = [0] * len(days)

for rowCount in range(len(ticketSales_data)):

    currentValue, currentDay = ticketSales_data.iloc[rowCount,[0,1]]

    seasonSum[int(currentDay)] += currentValue
    seasonCount[int(currentDay)] += 1

print(seasonCount)
print(seasonSum)
[32, 33, 32, 33, 34, 34, 34]
[2431.9228952621716, 8661.859799267784, 3712.1669864735404, 4930.762704784601, 5691.3748326994255, 3488.7014251378805, 2745.6502726189774]
# Season Mean
seasonMean = [0] * len(days)

for count in range(len(seasonCount)):
    seasonMean[count] = seasonSum[count] / seasonCount[count]

seasonMean
[75.99759047694286,
 262.48059997781166,
 116.00521832729814,
 149.4170516601394,
 167.39337743233605,
 102.60886544523179,
 80.7544197829111]
dataMean = np.mean(ticketSales_data['tickets_sold'].values)
dataMean
136.47602981139818
seasonOffsets = [x - dataMean for x in seasonMean]
seasonOffsets
[-60.47843933445532,
 126.00457016641349,
 -20.47081148410004,
 12.94102184874123,
 30.917347620937875,
 -33.86716436616639,
 -55.72161002848708]
adjustedTicketsSold = []
for count in range(len(ticketSales_data)):

    currentValue, currentDay = ticketSales_data.iloc[count,[0,1]]
    adjustedTicketsSold.append(currentValue - seasonOffsets[int(currentDay)])

adjustedTicketsSold
[146.12081148410005,
 165.73430282658344,
 144.54366932821466,
 148.0593565583586,
 150.55356595135208,
 ...
 91.05451518780374,
 100.71249927801934,
 95.90721378257089,
 109.29834181226921,
 105.61634687059234]
ticketSales_data = ticketSales_data.assign(adjustedTicketSale = adjustedTicketsSold)
plt.rcParams['figure.figsize'] = [12, 5]
ticketSales_data.iloc[:, [0,2]].plot()
Visualization

As we can observe, the orange plot is the adjusted ticket sales. The adjustment works because we can see that the peaks are now lower, while the trough are now higher generally. However, we are still able to note that there are still peaks and troughs even after the adjustment.

secondAdjustmentCount = [0] * len(days)
secondAdjustmentSum = [0] * len(days)
secondAdjustmentMean = [0] * len(days)
secondAdjustmentValue = []

for count in range(len(ticketSales_data)):
    currentDay, currentValue = ticketSales_data.iloc[count, [1,2]]
    secondAdjustmentCount[int(currentDay)] += 1
    secondAdjustmentSum[int(currentDay)] += currentValue

print("Season Count:", secondAdjustmentCount)
print("Season Sum:", secondAdjustmentSum)

for count in range(len(secondAdjustmentSum)):
    secondAdjustmentMean[count] = secondAdjustmentSum[count] / secondAdjustmentCount[count]
print("Season Mean:", secondAdjustmentMean)

secondMean = np.mean(ticketSales_data['adjustedTicketSale'].values)
print("Adjusted Ticket Sale Mean:", secondMean)

secondAdjustmentOffsets = [(x - secondMean) for x in secondAdjustmentMean]
print(secondAdjustmentOffsets)

secondAdjustedTickets = []
for count in range(len(ticketSales_data)):
    currentValue, currentDay = ticketSales_data.iloc[count, [2,1]]
    secondAdjustedTickets.append(currentValue - secondAdjustmentOffsets[int(currentDay)])

ticketSales_data = ticketSales_data.assign(second_adjusted_ticket_sales = secondAdjustedTickets)
plt.rcParams['figure.figsize'] = [12, 5]
ticketSales_data.iloc[:, [2,3]].plot()
Visualization

It is surprising to see that after the first adjustment, we are no longer able to remove more variances, lower the peaks and heighten the troughs, as the offsets list and the mean of the adjusted tickets sold column, are so close to each other.

weeks = ticketSales_data.index.week.unique()
weeks
Int64Index([11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27,
            28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,
            44],
           dtype='int64', name='Datetime')
ticketSales_data = ticketSales_data.assign(week=ticketSales_data.index.isocalendar().week)
ticketSales_data
('Unnamed: 0_level_0', 'Datetime') ('tickets_sold', 'Unnamed: 1_level_1') ('day', 'Unnamed: 2_level_1') ('adjustedTicketSale', 'Unnamed: 3_level_1') ('second_adjusted_ticket_sales', 'Unnamed: 4_level_1') ('week', 'Unnamed: 5_level_1')
2018-03-14 125.650000 2 146.120811 146.120811 11
2018-03-15 178.675325 3 165.734303 165.734303 11
2018-03-16 175.461017 4 144.543669 144.543669 11
2018-03-17 114.192192 5 148.059357 148.059357 11
2018-03-18 94.831956 6 150.553566 150.553566 11
... ... ... ... ... ...
2018-10-31 70.583704 2 91.054515 91.054515 44
2018-11-01 113.653521 3 100.712499 100.712499 44
2018-11-02 126.824561 4 95.907214 95.907214 44
2018-11-03 75.431177 5 109.298342 109.298342 44
2018-11-04 49.894737 6 105.616347 105.616347 44
weekAdjustmentCount = [0] * len(weeks)
weekAdjustmentSum = [0] * len(weeks)
weekAdjustmentMean = [0] * len(weeks)
weekAdjustmentValue = []

for count in range(len(ticketSales_data)):
    currentWeek, currentValue = ticketSales_data.iloc[count, [4,0]]
    weekAdjustmentCount[int(currentWeek)-11] += 1
    weekAdjustmentSum[int(currentWeek)-11] += currentValue

print("Season Count:", weekAdjustmentCount)
print("Season Sum:", weekAdjustmentSum)

for count in range(len(weekAdjustmentSum)):
    weekAdjustmentMean[count] = weekAdjustmentSum[count] / weekAdjustmentCount[count]
print("Season Mean:", weekAdjustmentMean)

weekMean = np.mean(ticketSales_data['adjustedTicketSale'].values)
print("Adjusted Ticket Sale Mean:", secondMean)

weekAdjustmentOffsets = [(x - weekMean) for x in weekAdjustmentMean]
print(weekAdjustmentOffsets)

weekAdjustedTickets = []
for count in range(len(ticketSales_data)):
    currentWeek, currentValue = ticketSales_data.iloc[count, [4,0]]
    weekAdjustedTickets.append(currentValue - weekAdjustmentOffsets[int(currentWeek)-11])

ticketSales_data = ticketSales_data.assign(week_adjusted_ticket_sales = weekAdjustedTickets)
plt.rcParams['figure.figsize'] = [12, 5]
ticketSales_data[['tickets_sold','week_adjusted_ticket_sales']].plot()
Visualization
plt.rcParams['figure.figsize'] = [12, 5]

ticketSales_data[['tickets_sold','week_adjusted_ticket_sales','adjustedTicketSale']].plot()
Visualization

As it can be seen, the seasonality was more removed when we are using the day of the week trying adjust according to the mean of each day, rather than the mean of each week. The week adjusted ticket sales seem to more stationary, but the peaks and trough, while decreased, they are still very prominent. Hence, seasonality is still there.

USDCUSDT Tradecount

We can also explore on the USDCUSDT tradecount column which seem to look like that there are seasonality. We will applied the same steps as above.

for no, column in enumerate(usdcusdt_data.columns):
    print(str(no) + " " + column)
0 unix
1 date
2 symbol
3 open
4 high
5 low
6 close
7 Volume USDC
8 Volume USDT
9 tradecount
usdcusdt_data.iloc[2,1].weekday()
3
usdcusdtDayAdjustmentCount = [0] * 31
usdcusdtDayWeekAdjustmentCount = [0] * 7
usdusdtMonthAdjustmentCount = [0] * 12
usdcusdtDayAdjustmentSum = [0] * 31
usdcusdtDayWeekAdjustmentSum = [0] * 7
usdcusdtMonthAdjustmentSum = [0] * 12
usdcusdtDayAdjustmentMean = [0] * 31
usdcusdtDayWeekAdjustmentMean = [0] * 7
usdcusdtMonthAdjustmentMean = [0] * 12
usdcusdtCloseMean = usdcusdt_data['tradecount'].mean()

for rowCount in range(len(usdcusdt_data)):

    usdcusdtDayAdjustmentCount[usdcusdt_data.iloc[rowCount, 1].day-1] += 1
    usdcusdtDayAdjustmentSum[usdcusdt_data.iloc[rowCount, 1].day-1] += usdcusdt_data.iloc[rowCount, 9]

    usdcusdtDayWeekAdjustmentCount[usdcusdt_data.iloc[rowCount, 1].weekday()] += 1
    usdcusdtDayWeekAdjustmentSum[usdcusdt_data.iloc[rowCount, 1].weekday()] += usdcusdt_data.iloc[rowCount, 9]

    usdusdtMonthAdjustmentCount[usdcusdt_data.iloc[rowCount, 1].month-1] += 1
    usdcusdtMonthAdjustmentSum[usdcusdt_data.iloc[rowCount, 1].month-1] += usdcusdt_data.iloc[rowCount, 9]

for count in range(len(usdcusdtDayAdjustmentSum)):
    usdcusdtDayAdjustmentMean[count] = usdcusdtDayAdjustmentSum[count] / usdcusdtDayAdjustmentCount[count]

for count in range(len(usdcusdtDayWeekAdjustmentSum)):
    usdcusdtDayWeekAdjustmentMean[count] = usdcusdtDayWeekAdjustmentSum[count] / usdcusdtDayWeekAdjustmentCount[count]

for count in range(len(usdcusdtMonthAdjustmentSum)):
    usdcusdtMonthAdjustmentMean[count] = usdcusdtMonthAdjustmentSum[count] / usdusdtMonthAdjustmentCount[count]

dayAdjustmentOffset = [(x - usdcusdtCloseMean) for x in usdcusdtDayAdjustmentMean]
dayWeekAdjustmentOffset = [(x - usdcusdtCloseMean) for x in usdcusdtDayWeekAdjustmentMean]
monthAdjustmentOffset = [(x - usdcusdtCloseMean) for x in usdcusdtMonthAdjustmentMean]

dayAdjustedClose = []
dayWeekAdjustedClose = []
monthAdjustedClose = []

for rowCount in range(len(usdcusdt_data)):
    currentValue, currentDatetime = usdcusdt_data.iloc[rowCount, [9,1]]

    dayAdjustedClose.append(currentValue - dayAdjustmentOffset[currentDatetime.day - 1])
    dayWeekAdjustedClose.append(currentValue - dayWeekAdjustmentOffset[currentDatetime.weekday()])
    monthAdjustedClose.append(currentValue - monthAdjustmentOffset[currentDatetime.month - 1])

usdcusdt_data = usdcusdt_data.assign(Adjusted_tradecount_day = dayAdjustedClose)
usdcusdt_data = usdcusdt_data.assign(Adjusted_tradecount_week = dayWeekAdjustedClose)
usdcusdt_data = usdcusdt_data.assign(Adjusted_tradecount_month = monthAdjustedClose)

adjusted_col = ['Adjusted_tradecount_day', 'Adjusted_tradecount_week', 'Adjusted_tradecount_month']
plt.rcParams['figure.figsize'] = [12, 5]

for column in adjusted_col:
    plt.plot(usdcusdt_data['date'], usdcusdt_data['tradecount'])
    plt.plot(usdcusdt_data['date'], usdcusdt_data[column], alpha=0.8)
    plt.title(column)
    plt.show()
Visualization
Visualization
Visualization

As shown above, only when using the month timeframe, where we can see that the graph moved only a little, and the seasonality of the graph was not removed.

Looking for Seasonality using FFT

We will try to use FFT to identify the seasonality in some of the dataset.

pd.Series(np.fft.fft(ticketSalesData)).plot()
Visualization
pd.Series(np.abs(np.fft.fft(ticketSalesData))).plot()
Visualization
pd.Series(np.abs(np.fft.fft(ticketSalesData)[10:len(ticketSalesData)])).plot()
Visualization

Clearly there are seasonality in this dataset

Gold

pd.Series(np.abs(np.fft.fft(goldData['Close'])[10:len(ticketSalesData)])).plot()
Visualization
pd.Series(np.abs(np.fft.fft(goldData['Close'])[10:len(ticketSalesData)//3])).plot()
Visualization

For gold, there seem to be no seasonality in the dataset.

Machine Learning Questions

pd.Series(np.abs(np.fft.fft(mlStackoverflow_data['machine-learning'])[0:len(ticketSalesData)])).plot()
Visualization
pd.Series(np.abs(np.fft.fft(mlStackoverflow_data['machine-learning'])[100:])).plot()
Visualization

Honestly cannot tell if the final stretch of the values is considered to indicating that this dataset is a seasonal component

USDCUSDT Tradecount

usdcusdt_data['tradecount']
0         277
1       43366
2       58314
3       67437
4       56204
        ...
1046     9876
1047     5308
1048     4360
1049     1185
1050     1054
Name: tradecount, Length: 1051, dtype: int64
pd.Series(np.abs(np.fft.fft(usdcusdt_data['tradecount']))).plot()
Visualization

Clearly this is not a seasonal dataset.

Forecasting

Naive Forecasting

Taking the previous period as a prediction. Therefore, we only need to shift the column by one, as we have done for the differencing. Some example shown at the bottom.

def naiveForecasting(series):

    temp = pd.DataFrame(series.rename('observed'))
    temp.insert(1, 'predicted', temp['observed'].shift())

    return temp
mlTopicNaiveForecast = naiveForecasting(mlStackoverflow_data['machine-learning'])
mlTopicNaiveForecast.plot()
Visualization
ticketSoldNaiveForecast = naiveForecasting(ticketSales_data['tickets_sold'])
ticketSoldNaiveForecast.plot()
Visualization
goldPriceNaiveForecast = naiveForecasting(gold_data['Close'])
goldPriceNaiveForecast.plot()
Visualization

Seasonal Forecasting

Define a season and then we predict that next season with the previous season's value. It is similar to naive forecasting above, but with more time gap in between

def seasonalForecasting(series, season):
    temp = pd.DataFrame(series.rename('observed'))
    temp.insert(1, 'predicted', temp['observed'].shift(season))

    return temp

Because each data point represents a month, our season could be quarter, half yearly or yearly.

mlTopicSeasonalYearlyForecast = seasonalForecasting(mlStackoverflow_data['machine-learning'], 12)
mlTopicSeasonalYearlyForecast.plot()
Visualization
mlTopicSeasonalQuarterlyForecast = seasonalForecasting(mlStackoverflow_data['machine-learning'], 3)
mlTopicSeasonalQuarterlyForecast.plot()
Visualization
ticketSoldWeeklyForecast = seasonalForecasting(ticketSales_data['tickets_sold'], 7)
ticketSoldWeeklyForecast.plot()
Visualization
ticketSoldMonthlyForecast = seasonalForecasting(ticketSales_data['tickets_sold'], 30)
ticketSoldMonthlyForecast.plot()
Visualization
goldPriceDailyForecast = seasonalForecasting(gold_data['Close'], 1440)
goldPriceDailyForecast.plot()
Visualization
goldPriceWeeklyForecast = seasonalForecasting(gold_data['Close'], 10080)
goldPriceWeeklyForecast.plot()
Visualization

Average Forecasting

Using the average of the previous n of observation to predict.

def averageForecasting(series, n):
    temp = pd.DataFrame(series.rename('observed'))
    temp.insert(1, 'predicted',
                1/n * (temp['observed'].cumsum().shift() - temp['observed'].cumsum().shift(n+1)))

    return temp

def averageForecastPlot(series, list):
    for n in list:
        temp = averageForecasting(series,n)
        plt.plot(temp['predicted'], label = 'n = ' + str(n))
    plt.plot(temp['observed'], label='observed', linewidth=2, alpha=0.3, linestyle='dashed')
    plt.legend()
    plt.show()
averageForecastPlot(mlStackoverflow_data['machine-learning'],[2,6,12,24])
Visualization
averageForecastPlot(ticketSales_data['tickets_sold'],[2,7,21,60])
Visualization
averageForecastPlot(gold_data['Close'],[1440,10080,43200])
Visualization

Based on the result shown, this method seem to be acting as a smoothing method rather than predictor, especially when it is with a longer time period.

Average Differences Forecasting

Now we will explore using average of the difference between timeframe, and use it as a way to predict

def differenceAverage(series, p):
    temp = pd.DataFrame(series.rename('observed'))
    difference = temp['observed'] - temp['observed'].shift()

    prediction_of_differences = (1/p) * (difference.cumsum().shift(1) - difference.cumsum().shift(p+1))
    temp['predicted']= prediction_of_differences.shift(1) + temp['observed'].shift(1)

    return temp

def differenceAveragePlot(series, list):
    plt.plot(series, label='observed', linewidth=2, alpha=0.5, linestyle='dashed')
    for n in list:
        plt.plot(differenceAverage(series, n)['predicted'], label='predicted, n = ' + str(n))

    plt.legend()
    plt.show()
differenceAveragePlot(mlStackoverflow_data['machine-learning'], [2,6,12,24])
Visualization
differenceAveragePlot(ticketSales_data['tickets_sold'], [2,6,12,24])
Visualization
differenceAveragePlot(gold_data['Close'][:100000], [1440,10080,43200])
Visualization

Naive Differences Forecasting

Make list the difference between the current period and previous period. Then, to predict the next value of the next period, take the difference from the current period and the previous period, and use it to predict the next value.

def naiveDifferenceForecasting(series):
    temp = pd.DataFrame(series.rename('observed'))
    differences = temp['observed'] - temp['observed'].shift()
    predictionsOfDifferences = differences.shift()
    temp['predicted'] = predictionsOfDifferences + temp['observed'].shift(1)

    return temp
naiveDifferenceForecasting(mlStackoverflow_data['machine-learning']).plot()
Visualization
naiveForecasting(ticketSales_data['tickets_sold']).plot()
Visualization
naiveDifferenceForecasting(gold_data['Close'])[:300].plot()
Visualization

Ironic, that using a naive method, of use the previous change of value, to make a prediction, seem to be actually good. While based on the three dataset used, the prediction for the machine learning questions and gold prices, the prediction while it seems that it over-predicts the value, it does seem to at least predict the major movement same as the observed. But that might also be because it seems to be the case visually.

Evaluating the forecasts

Will now implement the ways we can use to evaluate the forecasts. We will be implementing MSE, RMSE and MAE, which are typically used for regression predictions.

Define Functions and Setting Table of Results

def mse(df):
    temp = df['observed'] - df['predicted']
    temp = temp**2
    temp = temp.dropna()

    return temp.sum() / temp.count()

def rmse(df):
    return (mse(df))**(1/2)

def mae(df):
    temp = abs(df['observed'] - df['predicted'])
    temp = temp.dropna()

    return temp.sum() / temp.count()

def evaluateErrors(df):
    return [mse(df), rmse(df), mae(df)]
tableOfResults = pd.DataFrame(columns = ['Data', 'Algorithm', 'MSE', 'RMSE', 'MAE'])
tableOfResults

# Function to fill the Table of Results
def appendingTOR(dataframe, dataset, algorithm, list):
    dataframe.loc[len(dataframe)+1] = {'Data': dataset,'Algorithm': algorithm, 'MSE':list[0],'RMSE':list[1], 'MAE':list[2]}

Naive Forecasting

# Re-setup some of the dataset above
sp500Data = sp500_data[['Datetime', 'Close']]
sp500Data = sp500Data.set_index('Datetime')

ftseData = ftse_data[['Datetime', 'Close']]
ftseData = ftseData.set_index('Datetime')

goldData = gold_data[['Datetime', 'Close']]
goldData = goldData.set_index('Datetime')

temp = usdcusdt_data[['date', 'close', 'tradecount']]
temp = temp.set_index(pd.to_datetime(temp['date']))
usdcusdtDataTradeCount = temp[['tradecount']]
usdcusdtDataClose = temp[['close']]

# Create Naive Forcast Predictions
sp500DataNaiveForecast = naiveForecasting(sp500Data['Close'])
ftseDataNaiveForecast = naiveForecasting(ftseData['Close'])
goldDataNaiveForecast = naiveForecasting(goldData['Close'])
usdcusdtDataCloseNaiveForecast = naiveForecasting(usdcusdtDataClose['close'])
usdcusdtTradeCountNaiveForecast = naiveForecasting(usdcusdtDataTradeCount['tradecount'])
pythonNaiveForecast = naiveForecasting(mlStackoverflow_data['python'])

# Appending to Result Table
naiveForecastingString = 'Naive Forecasting'

appendingTOR(tableOfResults, 'SP500', naiveForecastingString, evaluateErrors(sp500DataNaiveForecast))
appendingTOR(tableOfResults, 'FTSE', naiveForecastingString, evaluateErrors(ftseDataNaiveForecast))
appendingTOR(tableOfResults, 'Gold', naiveForecastingString, evaluateErrors(goldDataNaiveForecast))
appendingTOR(tableOfResults, 'USDCUSDT Close Price', naiveForecastingString, evaluateErrors(usdcusdtDataCloseNaiveForecast))
appendingTOR(tableOfResults, 'USDCUSDT Tradecount', naiveForecastingString, evaluateErrors(usdcusdtTradeCountNaiveForecast))
appendingTOR(tableOfResults, 'Python Questions', naiveForecastingString, evaluateErrors(pythonNaiveForecast))
appendingTOR(tableOfResults, 'Machine Learning Questions', naiveForecastingString, evaluateErrors(mlTopicNaiveForecast))
appendingTOR(tableOfResults, 'Ticket Sales', naiveForecastingString, evaluateErrors(ticketSoldNaiveForecast))

tableOfResults
Unnamed: 0 Data Algorithm MSE RMSE MAE
1 SP500 Naive Forecasting 0.315878 0.56203 0.325863
2 FTSE Naive Forecasting 5.69646 2.38673 1.29783
3 Gold Naive Forecasting 0.237289 0.487123 0.31245
4 USDCUSDT Close Price Naive Forecasting 4.01536e-06 0.002004 0.001034
5 USDCUSDT Tradecount Naive Forecasting 1.56376e+08 12505 7519.96
6 Python Questions Naive Forecasting 1.00863e+06 1004.31 680.298
7 Machine Learning Questions Naive Forecasting 3053.46 55.2581 35.4885
8 Ticket Sales Naive Forecasting 12476 111.696 74.8537

Seasonal Forecasting

# Create Seasonal Forecasting Predictions
sp500SeasonalForecast60 = seasonalForecasting(sp500Data['Close'], 60)
sp500SeasonalForecast1440 = seasonalForecasting(sp500Data['Close'], 1440)
ftseSeasonalForecast60 = seasonalForecasting(ftseData['Close'], 60)
ftseSeasonalForecast1440 = seasonalForecasting(ftseData['Close'], 1440)
goldSeasonalForecast60 = seasonalForecasting(goldData['Close'], 60)
goldSeasonalForecast1440 = seasonalForecasting(goldData['Close'], 1440)
usdcusdtCloseSeasonalForecast = seasonalForecasting(usdcusdtDataClose['close'], 7)
usdcusdtTCSeasonalForecast = seasonalForecasting(usdcusdtDataTradeCount['tradecount'], 7)
pythonSeasonalForecast = seasonalForecasting(mlStackoverflow_data['python'], 12)

# Appending to Result Table
seasonalForecastingString = 'Seasonal Forecasting'

appendingTOR(tableOfResults, 'SP500', seasonalForecastingString + ' - 60', evaluateErrors(sp500SeasonalForecast60))
appendingTOR(tableOfResults, 'SP500', seasonalForecastingString + ' - 1440', evaluateErrors(sp500SeasonalForecast1440))
appendingTOR(tableOfResults, 'FTSE', seasonalForecastingString + ' - 60', evaluateErrors(ftseSeasonalForecast60) )
appendingTOR(tableOfResults, 'FTSE', seasonalForecastingString + ' - 1440', evaluateErrors(ftseSeasonalForecast1440))
appendingTOR(tableOfResults, 'Gold', seasonalForecastingString + ' - 60', evaluateErrors(goldSeasonalForecast60))
appendingTOR(tableOfResults, 'Gold', seasonalForecastingString + ' - 1440', evaluateErrors(goldSeasonalForecast1440))
appendingTOR(tableOfResults, 'USDCUSDT Close Price', seasonalForecastingString, evaluateErrors(usdcusdtCloseSeasonalForecast))
appendingTOR(tableOfResults, 'USDCUSDT Tradecount', seasonalForecastingString, evaluateErrors(usdcusdtTCSeasonalForecast))
appendingTOR(tableOfResults, 'Python Questions', seasonalForecastingString, evaluateErrors(pythonSeasonalForecast))
appendingTOR(tableOfResults, 'Machine Learning Questions', seasonalForecastingString, evaluateErrors(mlTopicSeasonalYearlyForecast))
appendingTOR(tableOfResults, 'Ticket Sales', seasonalForecastingString, evaluateErrors(ticketSoldWeeklyForecast))

tableOfResults
Unnamed: 0 Data Algorithm MSE RMSE MAE
1 SP500 Naive Forecasting 0.315878 0.56203 0.325863
2 FTSE Naive Forecasting 5.69646 2.38673 1.29783
3 Gold Naive Forecasting 0.237289 0.487123 0.31245
4 USDCUSDT Close Price Naive Forecasting 4.01536e-06 0.002004 0.001034
5 USDCUSDT Tradecount Naive Forecasting 1.56376e+08 12505 7519.96
6 Python Questions Naive Forecasting 1.00863e+06 1004.31 680.298
7 Machine Learning Questions Naive Forecasting 3053.46 55.2581 35.4885
8 Ticket Sales Naive Forecasting 12476 111.696 74.8537
9 SP500 Seasonal Forecasting - 60 24.0706 4.90618 3.19858
10 SP500 Seasonal Forecasting - 1440 578.648 24.0551 18.6964
11 FTSE Seasonal Forecasting - 60 358.944 18.9458 12.5549
12 FTSE Seasonal Forecasting - 1440 9305.27 96.4639 71.5358
13 Gold Seasonal Forecasting - 60 11.0817 3.32892 2.08104
14 Gold Seasonal Forecasting - 1440 297.39 17.245 12.2358
15 USDCUSDT Close Price Seasonal Forecasting 1.41478e-05 0.003761 0.00176
16 USDCUSDT Tradecount Seasonal Forecasting 2.76653e+08 16632.9 10558.5
17 Python Questions Seasonal Forecasting 4.97505e+06 2230.48 1924.32
18 Machine Learning Questions Seasonal Forecasting 14325.5 119.689 84.975
19 Ticket Sales Seasonal Forecasting 3922.4 62.6291 37.0187

Average Forecasting

# Create Average Forecasting Predictions

sp500AverageForecast60 = averageForecasting(sp500Data['Close'], 60)
sp500AverageForecast1440 = averageForecasting(sp500Data['Close'], 1440)
ftseAverageForecast60 = averageForecasting(ftseData['Close'], 60)
ftseAverageForecast1440 = averageForecasting(ftseData['Close'], 1440)
goldAverageForecast60 = averageForecasting(goldData['Close'], 60)
goldAverageForecast1440 = averageForecasting(goldData['Close'], 1440)
usdcusdtCloseAverageForecast = averageForecasting(usdcusdtDataClose['close'], 7)
usdcusdtTCAverageForecast = averageForecasting(usdcusdtDataTradeCount['tradecount'], 7)
pythonAverageForecast = averageForecasting(mlStackoverflow_data['python'], 12)
mlTopicAverageForecast = averageForecasting(mlStackoverflow_data['machine-learning'], 12)
ticketSalesAverageForecast = averageForecasting(ticketSales_data['tickets_sold'], 7)

# Appending to Result Table
averageForcastingString = 'Average Forecasting'

appendingTOR(tableOfResults, 'SP500', averageForcastingString + ' - 60', evaluateErrors(sp500AverageForecast60))
appendingTOR(tableOfResults, 'SP500', averageForcastingString + ' - 1440', evaluateErrors(sp500AverageForecast1440))
appendingTOR(tableOfResults, 'FTSE', averageForcastingString + ' - 60', evaluateErrors(ftseAverageForecast60))
appendingTOR(tableOfResults, 'FTSE', averageForcastingString + ' - 1440', evaluateErrors(ftseAverageForecast1440))
appendingTOR(tableOfResults, 'Gold', averageForcastingString + ' - 60', evaluateErrors(goldAverageForecast60))
appendingTOR(tableOfResults, 'Gold', averageForcastingString + ' - 1440', evaluateErrors(goldAverageForecast1440))
appendingTOR(tableOfResults, 'USDCUSDT Close Price', averageForcastingString, evaluateErrors(usdcusdtCloseAverageForecast))
appendingTOR(tableOfResults, 'USDCUSDT Tradecount', averageForcastingString, evaluateErrors(usdcusdtTCAverageForecast))
appendingTOR(tableOfResults, 'Python Questions', averageForcastingString, evaluateErrors(pythonAverageForecast))
appendingTOR(tableOfResults, 'Machine Learning Questions', averageForcastingString, evaluateErrors(mlTopicAverageForecast))
appendingTOR(tableOfResults, 'Ticket Sales', averageForcastingString, evaluateErrors(ticketSalesAverageForecast))

tableOfResults
Unnamed: 0 Data Algorithm MSE RMSE MAE
1 SP500 Naive Forecasting 0.315878 0.56203 0.325863
2 FTSE Naive Forecasting 5.69646 2.38673 1.29783
3 Gold Naive Forecasting 0.237289 0.487123 0.31245
4 USDCUSDT Close Price Naive Forecasting 4.01536e-06 0.002004 0.001034
5 USDCUSDT Tradecount Naive Forecasting 1.56376e+08 12505 7519.96
6 Python Questions Naive Forecasting 1.00863e+06 1004.31 680.298
7 Machine Learning Questions Naive Forecasting 3053.46 55.2581 35.4885
8 Ticket Sales Naive Forecasting 12476 111.696 74.8537
9 SP500 Seasonal Forecasting - 60 24.0706 4.90618 3.19858
10 SP500 Seasonal Forecasting - 1440 578.648 24.0551 18.6964
11 FTSE Seasonal Forecasting - 60 358.944 18.9458 12.5549
12 FTSE Seasonal Forecasting - 1440 9305.27 96.4639 71.5358
13 Gold Seasonal Forecasting - 60 11.0817 3.32892 2.08104
14 Gold Seasonal Forecasting - 1440 297.39 17.245 12.2358
15 USDCUSDT Close Price Seasonal Forecasting 1.41478e-05 0.003761 0.00176
16 USDCUSDT Tradecount Seasonal Forecasting 2.76653e+08 16632.9 10558.5
17 Python Questions Seasonal Forecasting 4.97505e+06 2230.48 1924.32
18 Machine Learning Questions Seasonal Forecasting 14325.5 119.689 84.975
19 Ticket Sales Seasonal Forecasting 3922.4 62.6291 37.0187
20 SP500 Average Forecasting - 60 8.21104 2.86549 1.84062
21 SP500 Average Forecasting - 1440 195.085 13.9673 10.6287
22 FTSE Average Forecasting - 60 122.369 11.0621 7.24807
23 FTSE Average Forecasting - 1440 3105.82 55.7299 40.9431
24 Gold Average Forecasting - 60 3.79681 1.94854 1.22559
25 Gold Average Forecasting - 1440 96.5074 9.82382 6.8452
26 USDCUSDT Close Price Average Forecasting 6.51406e-06 0.002552 0.001187
27 USDCUSDT Tradecount Average Forecasting 1.44419e+08 12017.5 7727.27
28 Python Questions Average Forecasting 2.18019e+06 1476.55 1175.32
29 Machine Learning Questions Average Forecasting 7140.3 84.5003 55.785
30 Ticket Sales Average Forecasting 6061.2 77.8537 59.7301

Average Differences

# Create Average Differences Prediction

sp500AverageDifferenceForecast60 = differenceAverage(sp500Data['Close'], 60)
sp500AverageDifferenceForecast1440 = differenceAverage(sp500Data['Close'], 1440)
ftseAverageDifferenceForecast60 = differenceAverage(ftseData['Close'], 60)
ftseAverageDifferenceForecast1440 = differenceAverage(ftseData['Close'], 1440)
goldAverageDifferenceForecast60 = differenceAverage(goldData['Close'], 60)
goldAverageDifferenceForecast1440 = differenceAverage(goldData['Close'], 1440)
usdcusdtCloseAverageDifferenceForecast = differenceAverage(usdcusdtDataClose['close'], 7)
usdcusdtTCAverageDifferenceForecast = differenceAverage(usdcusdtDataTradeCount['tradecount'], 7)
pythonAverageDifferenceForecast = differenceAverage(mlStackoverflow_data['python'], 12)
mlTopicAverageDifferenceForecast = differenceAverage(mlStackoverflow_data['machine-learning'], 12)
ticketSalesAverageDifferenceForecast = differenceAverage(ticketSales_data['tickets_sold'], 7)

# Appending to Result Table
averageDifferenceForecastingString = 'Average Difference Forecasting'

appendingTOR(tableOfResults, 'SP500', averageDifferenceForecastingString + ' - 60', evaluateErrors(sp500AverageDifferenceForecast60))
appendingTOR(tableOfResults, 'SP500', averageDifferenceForecastingString + ' - 1440', evaluateErrors(sp500AverageDifferenceForecast1440))
appendingTOR(tableOfResults, 'FTSE', averageDifferenceForecastingString + ' - 60', evaluateErrors(ftseAverageDifferenceForecast60))
appendingTOR(tableOfResults, 'FTSE', averageDifferenceForecastingString + ' - 1440', evaluateErrors(ftseAverageDifferenceForecast1440))
appendingTOR(tableOfResults, 'Gold', averageDifferenceForecastingString + ' - 60', evaluateErrors(goldAverageDifferenceForecast60))
appendingTOR(tableOfResults, 'Gold', averageDifferenceForecastingString + ' - 1440', evaluateErrors(goldAverageDifferenceForecast1440))
appendingTOR(tableOfResults, 'USDCUSDT Close Price', averageDifferenceForecastingString, evaluateErrors(usdcusdtCloseAverageDifferenceForecast))
appendingTOR(tableOfResults, 'USDCUSDT Tradecount', averageDifferenceForecastingString, evaluateErrors(usdcusdtTCAverageDifferenceForecast))
appendingTOR(tableOfResults, 'Python Questions', averageDifferenceForecastingString, evaluateErrors(pythonAverageDifferenceForecast))
appendingTOR(tableOfResults, 'Machine Learning Questions', averageDifferenceForecastingString, evaluateErrors(mlTopicAverageDifferenceForecast))
appendingTOR(tableOfResults, 'Ticket Sales', averageDifferenceForecastingString, evaluateErrors(ticketSalesAverageDifferenceForecast))

tableOfResults
Unnamed: 0 Data Algorithm MSE RMSE MAE
1 SP500 Naive Forecasting 0.315878 0.56203 0.325863
2 FTSE Naive Forecasting 5.69646 2.38673 1.29783
3 Gold Naive Forecasting 0.237289 0.487123 0.31245
4 USDCUSDT Close Price Naive Forecasting 4.01536e-06 0.002004 0.001034
5 USDCUSDT Tradecount Naive Forecasting 1.56376e+08 12505 7519.96
6 Python Questions Naive Forecasting 1.00863e+06 1004.31 680.298
7 Machine Learning Questions Naive Forecasting 3053.46 55.2581 35.4885
8 Ticket Sales Naive Forecasting 12476 111.696 74.8537
9 SP500 Seasonal Forecasting - 60 24.0706 4.90618 3.19858
10 SP500 Seasonal Forecasting - 1440 578.648 24.0551 18.6964
11 FTSE Seasonal Forecasting - 60 358.944 18.9458 12.5549
12 FTSE Seasonal Forecasting - 1440 9305.27 96.4639 71.5358
13 Gold Seasonal Forecasting - 60 11.0817 3.32892 2.08104
14 Gold Seasonal Forecasting - 1440 297.39 17.245 12.2358
15 USDCUSDT Close Price Seasonal Forecasting 1.41478e-05 0.003761 0.00176
16 USDCUSDT Tradecount Seasonal Forecasting 2.76653e+08 16632.9 10558.5
17 Python Questions Seasonal Forecasting 4.97505e+06 2230.48 1924.32
18 Machine Learning Questions Seasonal Forecasting 14325.5 119.689 84.975
19 Ticket Sales Seasonal Forecasting 3922.4 62.6291 37.0187
20 SP500 Average Forecasting - 60 8.21104 2.86549 1.84062
21 SP500 Average Forecasting - 1440 195.085 13.9673 10.6287
22 FTSE Average Forecasting - 60 122.369 11.0621 7.24807
23 FTSE Average Forecasting - 1440 3105.82 55.7299 40.9431
24 Gold Average Forecasting - 60 3.79681 1.94854 1.22559
25 Gold Average Forecasting - 1440 96.5074 9.82382 6.8452
26 USDCUSDT Close Price Average Forecasting 6.51406e-06 0.002552 0.001187
27 USDCUSDT Tradecount Average Forecasting 1.44419e+08 12017.5 7727.27
28 Python Questions Average Forecasting 2.18019e+06 1476.55 1175.32
29 Machine Learning Questions Average Forecasting 7140.3 84.5003 55.785
30 Ticket Sales Average Forecasting 6061.2 77.8537 59.7301
31 SP500 Average Difference Forecasting - 60 0.322452 0.567849 0.332982
32 SP500 Average Difference Forecasting - 1440 0.316412 0.562505 0.326336
33 FTSE Average Difference Forecasting - 60 5.79875 2.40806 1.32178
34 FTSE Average Difference Forecasting - 1440 5.70723 2.38898 1.30033
35 Gold Average Difference Forecasting - 60 0.240423 0.49033 0.315994
36 Gold Average Difference Forecasting - 1440 0.237595 0.487437 0.313009
37 USDCUSDT Close Price Average Difference Forecasting 4.53094e-06 0.002129 0.001103
38 USDCUSDT Tradecount Average Difference Forecasting 1.64095e+08 12810 7724.69
39 Python Questions Average Difference Forecasting 1.1445e+06 1069.81 751.801
40 Machine Learning Questions Average Difference Forecasting 3399.79 58.3077 38.5563
41 Ticket Sales Average Difference Forecasting 12936.3 113.738 76.6102

Naive Difference

# Create Naive Difference Prediction

sp500NaiveDifferenceForecast = naiveDifferenceForecasting(sp500Data['Close'])
ftseNaiveDifferenceForecast = naiveDifferenceForecasting(ftseData['Close'])
goldNaiveDifferenceForecast = naiveDifferenceForecasting(goldData['Close'])
usdcusdtCloseNaiveDifferenceForecast = naiveDifferenceForecasting(usdcusdtDataClose['close'])
usdcusdtTCNaiveDifferenceForecast = naiveDifferenceForecasting(usdcusdtDataTradeCount['tradecount'])
pythonNaiveDifferenceForecast = naiveDifferenceForecasting(mlStackoverflow_data['python'])
mlTopicNaiveDifferenceForecast = naiveDifferenceForecasting(mlStackoverflow_data['machine-learning'])
ticketSalesNaiveDifferenceForecast = naiveDifferenceForecasting(ticketSales_data['tickets_sold'])

# Appending to Result Table
naiveDifferenceForecastingString = 'Naive Difference Forecasting'

appendingTOR(tableOfResults, 'SP500', naiveDifferenceForecastingString, evaluateErrors(sp500NaiveDifferenceForecast))
appendingTOR(tableOfResults, 'FTSE', naiveDifferenceForecastingString, evaluateErrors(ftseNaiveDifferenceForecast))
appendingTOR(tableOfResults, 'Gold', naiveDifferenceForecastingString, evaluateErrors(goldNaiveDifferenceForecast))
appendingTOR(tableOfResults, 'USDCUSDT Close Price', naiveDifferenceForecastingString, evaluateErrors(usdcusdtCloseNaiveDifferenceForecast))
appendingTOR(tableOfResults, 'USDCUSDT Tradecount', naiveDifferenceForecastingString, evaluateErrors(usdcusdtTCNaiveDifferenceForecast))
appendingTOR(tableOfResults, 'Python Questions', naiveDifferenceForecastingString, evaluateErrors(pythonNaiveDifferenceForecast))
appendingTOR(tableOfResults, 'Machine Learning Questions', naiveDifferenceForecastingString, evaluateErrors(mlTopicNaiveDifferenceForecast))
appendingTOR(tableOfResults, 'Ticket Sales', naiveDifferenceForecastingString, evaluateErrors(ticketSalesNaiveDifferenceForecast))

tableOfResults
Unnamed: 0 Data Algorithm MSE RMSE MAE
1 SP500 Naive Forecasting 0.315878 0.56203 0.325863
2 FTSE Naive Forecasting 5.69646 2.38673 1.29783
3 Gold Naive Forecasting 0.237289 0.487123 0.31245
4 USDCUSDT Close Price Naive Forecasting 4.01536e-06 0.002004 0.001034
5 USDCUSDT Tradecount Naive Forecasting 1.56376e+08 12505 7519.96
6 Python Questions Naive Forecasting 1.00863e+06 1004.31 680.298
7 Machine Learning Questions Naive Forecasting 3053.46 55.2581 35.4885
8 Ticket Sales Naive Forecasting 12476 111.696 74.8537
9 SP500 Seasonal Forecasting - 60 24.0706 4.90618 3.19858
10 SP500 Seasonal Forecasting - 1440 578.648 24.0551 18.6964
11 FTSE Seasonal Forecasting - 60 358.944 18.9458 12.5549
12 FTSE Seasonal Forecasting - 1440 9305.27 96.4639 71.5358
13 Gold Seasonal Forecasting - 60 11.0817 3.32892 2.08104
14 Gold Seasonal Forecasting - 1440 297.39 17.245 12.2358
15 USDCUSDT Close Price Seasonal Forecasting 1.41478e-05 0.003761 0.00176
16 USDCUSDT Tradecount Seasonal Forecasting 2.76653e+08 16632.9 10558.5
17 Python Questions Seasonal Forecasting 4.97505e+06 2230.48 1924.32
18 Machine Learning Questions Seasonal Forecasting 14325.5 119.689 84.975
19 Ticket Sales Seasonal Forecasting 3922.4 62.6291 37.0187
20 SP500 Average Forecasting - 60 8.21104 2.86549 1.84062
21 SP500 Average Forecasting - 1440 195.085 13.9673 10.6287
22 FTSE Average Forecasting - 60 122.369 11.0621 7.24807
23 FTSE Average Forecasting - 1440 3105.82 55.7299 40.9431
24 Gold Average Forecasting - 60 3.79681 1.94854 1.22559
25 Gold Average Forecasting - 1440 96.5074 9.82382 6.8452
26 USDCUSDT Close Price Average Forecasting 6.51406e-06 0.002552 0.001187
27 USDCUSDT Tradecount Average Forecasting 1.44419e+08 12017.5 7727.27
28 Python Questions Average Forecasting 2.18019e+06 1476.55 1175.32
29 Machine Learning Questions Average Forecasting 7140.3 84.5003 55.785
30 Ticket Sales Average Forecasting 6061.2 77.8537 59.7301
31 SP500 Average Difference Forecasting - 60 0.322452 0.567849 0.332982
32 SP500 Average Difference Forecasting - 1440 0.316412 0.562505 0.326336
33 FTSE Average Difference Forecasting - 60 5.79875 2.40806 1.32178
34 FTSE Average Difference Forecasting - 1440 5.70723 2.38898 1.30033
35 Gold Average Difference Forecasting - 60 0.240423 0.49033 0.315994
36 Gold Average Difference Forecasting - 1440 0.237595 0.487437 0.313009
37 USDCUSDT Close Price Average Difference Forecasting 4.53094e-06 0.002129 0.001103
38 USDCUSDT Tradecount Average Difference Forecasting 1.64095e+08 12810 7724.69
39 Python Questions Average Difference Forecasting 1.1445e+06 1069.81 751.801
40 Machine Learning Questions Average Difference Forecasting 3399.79 58.3077 38.5563
41 Ticket Sales Average Difference Forecasting 12936.3 113.738 76.6102
42 SP500 Naive Difference Forecasting 0.553753 0.744146 0.456603
43 FTSE Naive Difference Forecasting 10.796 3.28573 1.81081
44 Gold Naive Difference Forecasting 0.521205 0.721946 0.480641
45 USDCUSDT Close Price Naive Difference Forecasting 9.13676e-06 0.003023 0.001659
46 USDCUSDT Tradecount Naive Difference Forecasting 4.15266e+08 20378.1 12089.8
47 Python Questions Naive Difference Forecasting 2.34416e+06 1531.07 1117.71
48 Machine Learning Questions Naive Difference Forecasting 6678.13 81.7198 51.9462
49 Ticket Sales Naive Difference Forecasting 37594.4 193.893 136.398

Analysis

Because the evaluation metrics does not produce a standardize output, we cannot use the MSE, RMSE, and MAE value directly without context. It needs to be compared within the same dataset used. For example, the evaluations on USDCUSDT dataset are the lowest, but that does not mean that average difference algorithm on usdcusdt is better than naive difference forecasting on Ticket Sales. So let's pick S&P500, Python Questions and Ticket Sales, to compare first and see what observations can we find.

display(tableOfResults[tableOfResults['Data'] == 'SP500'].sort_values('MSE'),
        tableOfResults[tableOfResults['Data'] == 'Python Questions'].sort_values('MSE'),
        tableOfResults[tableOfResults['Data'] == 'Ticket Sales'].sort_values('MSE'),
)
Unnamed: 0 Data Algorithm MSE RMSE MAE
1 SP500 Naive Forecasting 0.315878 0.56203 0.325863
32 SP500 Average Difference Forecasting - 1440 0.316412 0.562505 0.326336
31 SP500 Average Difference Forecasting - 60 0.322453 0.567849 0.332982
42 SP500 Naive Difference Forecasting 0.553753 0.744146 0.456603
20 SP500 Average Forecasting - 60 8.21104 2.86549 1.84062
9 SP500 Seasonal Forecasting - 60 24.0706 4.90618 3.19858
21 SP500 Average Forecasting - 1440 195.085 13.9673 10.6287
10 SP500 Seasonal Forecasting - 1440 578.648 24.0551 18.6964
Unnamed: 0 Data Algorithm MSE RMSE MAE
6 Python Questions Naive Forecasting 1.00863e+06 1004.31 680.298
39 Python Questions Average Difference Forecasting 1.1445e+06 1069.81 751.801
28 Python Questions Average Forecasting 2.18019e+06 1476.55 1175.32
47 Python Questions Naive Difference Forecasting 2.34416e+06 1531.07 1117.71
17 Python Questions Seasonal Forecasting 4.97505e+06 2230.48 1924.32
Unnamed: 0 Data Algorithm MSE RMSE MAE
19 Ticket Sales Seasonal Forecasting 3922.4 62.6291 37.0187
30 Ticket Sales Average Forecasting 6061.2 77.8537 59.7301
8 Ticket Sales Naive Forecasting 12476 111.696 74.8537
41 Ticket Sales Average Difference Forecasting 12936.3 113.738 76.6102
49 Ticket Sales Naive Difference Forecasting 37594.4 193.893 136.398

We can see that all of there evaluation value are very different in terms of size and magnitude. Then, currently we can see that naive forecasting performing the best. Let's list all the best algorithm for the given dataset.

for dataset in tableOfResults['Data'].unique():
    display(tableOfResults[tableOfResults['Data'] == dataset].sort_values('MSE').head(2))
Unnamed: 0 Data Algorithm MSE RMSE MAE
1 SP500 Naive Forecasting 0.315878 0.56203 0.325863
32 SP500 Average Difference Forecasting - 1440 0.316412 0.562505 0.326336
Unnamed: 0 Data Algorithm MSE RMSE MAE
2 FTSE Naive Forecasting 5.69646 2.38673 1.29783
34 FTSE Average Difference Forecasting - 1440 5.70723 2.38898 1.30033
Unnamed: 0 Data Algorithm MSE RMSE MAE
3 Gold Naive Forecasting 0.237289 0.487123 0.31245
36 Gold Average Difference Forecasting - 1440 0.237595 0.487437 0.313009
Unnamed: 0 Data Algorithm MSE RMSE MAE
4 USDCUSDT Close Price Naive Forecasting 4e-06 0.002004 0.001034
37 USDCUSDT Close Price Average Difference Forecasting 5e-06 0.002129 0.001103
Unnamed: 0 Data Algorithm MSE RMSE MAE
27 USDCUSDT Tradecount Average Forecasting 1.44419e+08 12017.5 7727.27
5 USDCUSDT Tradecount Naive Forecasting 1.56376e+08 12505 7519.96
Unnamed: 0 Data Algorithm MSE RMSE MAE
6 Python Questions Naive Forecasting 1.00863e+06 1004.31 680.298
39 Python Questions Average Difference Forecasting 1.1445e+06 1069.81 751.801
Unnamed: 0 Data Algorithm MSE RMSE MAE
7 Machine Learning Questions Naive Forecasting 3053.46 55.2581 35.4885
40 Machine Learning Questions Average Difference Forecasting 3399.79 58.3077 38.5563
Unnamed: 0 Data Algorithm MSE RMSE MAE
19 Ticket Sales Seasonal Forecasting 3922.4 62.6291 37.0187
30 Ticket Sales Average Forecasting 6061.2 77.8537 59.7301

We can observe that Naive Forecasting seem to be the best choice at the moment, and the Average Difference in second. Only TicketSales and USDCUSDT Tradecount do not have both of the algorithm. This could be because the ticket sales and usdcusdt datasets, have more prominent repeatable pattern that other forecasting methods work better, while because the rest of the dataset are more random in nature.

Additionally, something to take note is that Naive Forecasting and Average Differences are currently the best algorithm only in the context of technical evaluation of MSE, RMSE and MAE. Because if we were to set trading/investment rules and different way to evaluate performance, the result might be different.

Autoregression Prediction

Before we can use autoregression prediction, first we will need to make the datasets stationary, as well as using autocorrelation and partial autocorrelation.

Choosing Dataset

We have a lot of datasets that we can use. We will use ticket sales (because it consists of more obvious seasonal changes), machine learning questions from stackoverflow (for its multiplicativeness), SP500 and Gold dataset as they are real world financial data. But because the SP500 and Gold data are very large, we will only use 3 months of data.

We will need need to also split the dataset into training and testing sets. We will use 70% as training and 30% training.

Autocorrelation and Partial Autocorrelation

# Importing plot_acf
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

Gold Data

Making it stationary
stationaryGoldData = goldData
stationaryGoldData['observed'] = goldData['Close'] - goldData['Close'].shift()
stationaryGoldData = stationaryGoldData.dropna().drop('Close', axis=1)
stationaryGoldData = stationaryGoldData[stationaryGoldData.index.dayofweek != 5]
stationaryGoldData = stationaryGoldData[:108000]

kpss_test(stationaryGoldData)
print()
adftest(stationaryGoldData)
stationaryGoldData.plot()
Visualization
Setting the training and testing set
stationaryGoldDataLength = len(stationaryGoldData)
traniningStationaryGoldDataLength = int(len(stationaryGoldData) * 0.7)
testingStationaryGoldDataLength = stationaryGoldDataLength - traniningStationaryGoldDataLength

print(stationaryGoldDataLength, traniningStationaryGoldDataLength, testingStationaryGoldDataLength)
108000 75600 32400
ACF
plot_acf(stationaryGoldData[:traniningStationaryGoldDataLength],lags=60)
Visualization
Visualization
PACF
plot_pacf(stationaryGoldData[:traniningStationaryGoldDataLength], lags = 60)
Visualization
Visualization

S&P500

Making it stationary
stationarySP500Data = sp500Data
stationarySP500Data['observed'] = sp500Data['Close'] - sp500Data['Close'].shift()
stationarySP500Data = stationarySP500Data.dropna().drop('Close', axis=1)
stationarySP500Data = stationarySP500Data[stationarySP500Data.index.dayofweek != 5].sort_index()
stationarySP500Data = stationarySP500Data[:36000]

kpss_test(stationarySP500Data)
print()
adftest(stationarySP500Data)
stationarySP500Data.plot()
Visualization
Setting the training and testing set
stationarySP500DataLength = len(stationarySP500Data)
trainingStationarySP500Data = int(0.7 * stationarySP500DataLength)
testingStationarySP500Data = stationarySP500DataLength - trainingStationarySP500Data

print(stationarySP500DataLength, trainingStationarySP500Data, testingStationarySP500Data)
36000 25200 10800
ACF
plot_acf(stationarySP500Data[:trainingStationarySP500Data], lags=60)
plt.xticks([x for x in range(0,61,2)])
Visualization
PACF
plot_pacf(stationarySP500Data[:trainingStationarySP500Data], lags=60)
plt.xticks([x for x in range(0,61,2)])
Visualization

Machine Learning Questions

Because the dataset is multiplicative, we will need to log and get the difference of it.

Making it Stationary
stationaryMLQuestions = mlStackoverflow_data[['month','machine-learning']]

# Setting the month as index and also removing the month column.
stationaryMLQuestions = stationaryMLQuestions.set_index(
    pd.to_datetime(stationaryMLQuestions['month'],format='%y-%b'
)).drop('month', axis=1)

stationaryMLQuestions = np.log(stationaryMLQuestions)
stationaryMLQuestions.plot()
Visualization
kpss_test(stationaryMLQuestions)
print()
adftest(stationaryMLQuestions)
KPSS Stat:  1.873246798025978
p-value:  0.01
Number of lags: 6
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
Result: The series is not stationary

AdfTest Stat:  -2.668195595640323
p-value:  0.07970535704695508
Number of lags: 2
Number of observation: 129
Critial Values:
   1% : -3.482087964046026
   5% : -2.8842185101614626
   10% : -2.578864381347275
The series is not stationary
stationaryMLQuestions['observed'] = stationaryMLQuestions['machine-learning'] - stationaryMLQuestions['machine-learning'].shift()
stationaryMLQuestions = stationaryMLQuestions.dropna().drop('machine-learning', axis=1)
stationaryMLQuestions.plot()
Visualization
kpss_test(stationaryMLQuestions)
print()
adftest(stationaryMLQuestions)
KPSS Stat:  0.28055757025438893
p-value:  0.1
Number of lags: 55
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
Result: The series is stationary

AdfTest Stat:  -13.691197822513615
p-value:  1.342938733545792e-25
Number of lags: 1
Number of observation: 129
Critial Values:
   1% : -3.482087964046026
   5% : -2.8842185101614626
   10% : -2.578864381347275
The series is stationary
Setting the training and testing set
stationaryMLQuestionsLength = len(stationaryMLQuestions)
trainingStationaryMLQuestions = int(stationaryMLQuestionsLength * 0.7)
testingStationaryMLQuestions = stationaryMLQuestionsLength - trainingStationaryMLQuestions

print(stationaryMLQuestionsLength, trainingStationaryMLQuestions, testingStationaryMLQuestions )
131 91 40
ACF
plot_acf(stationaryMLQuestions[:trainingStationaryMLQuestions], lags=60)
plt.xticks([x for x in range(0,61,2)])
Visualization
PACF
plot_pacf(stationaryMLQuestions[:trainingStationaryMLQuestions], lags=44)
plt.xticks([x for x in range(0,45,2)])
Visualization

Cinema Tickets

Making it Stationary
stationaryTicketSales = ticketSales_data
stationaryTicketSales['observed'] = stationaryTicketSales['tickets_sold'] - stationaryTicketSales['tickets_sold'].shift()
stationaryTicketSales = stationaryTicketSales['observed'].dropna()

kpss_test(stationaryTicketSales)
print()
adftest(stationaryTicketSales)
KPSS Stat:  0.14388225215810577
p-value:  0.1
Number of lags: 62
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
Result: The series is stationary

AdfTest Stat:  -6.734609879307423
p-value:  3.2311867005854417e-09
Number of lags: 12
Number of observation: 218
Critial Values:
   1% : -3.460707667106296
   5% : -2.874891213486339
   10% : -2.573885987711472
The series is stationary
stationaryTicketSales.plot()
Visualization
Setting the training set and testing set
stationaryTicketSalesLength = len(stationaryTicketSales)
trainingStationaryTicketSales = int(stationaryTicketSalesLength * 0.7)
testingStationaryTicketSales = stationaryTicketSalesLength - trainingStationaryTicketSales

print(stationaryTicketSalesLength, trainingStationaryTicketSales, testingStationaryTicketSales)
231 161 70
ACF
plot_acf(stationaryTicketSales[:trainingStationaryTicketSales], lags=60)
plt.xticks([x for x in range(0,60,2)])
Visualization
PACF
plot_pacf(stationaryTicketSales[:trainingStationaryTicketSales], lags=60)
plt.xticks([x for x in range(0,61,2)])
Visualization

Prediction

We will be moving on with the prediction portion. We will be using the outputs from the PACF to determine the lags/period that we will be using for the autoregression.

Gold

from statsmodels.tsa.ar_model import AutoReg
goldResults = AutoReg(stationaryGoldData[:traniningStationaryGoldDataLength], [1,2,3]).fit()
goldResults.summary()

AutoReg Model Results

Dep. Variable: observed No. Observations: 75600
Model: AutoReg(3) Log Likelihood -30274.411
Method: Conditional MLE S.D. of innovations 0.361
Date: Mon, 08 Jul 2024 AIC 60558.822
Time: 15:38:54 BIC 60604.988
Sample: 3 HQIC 60573.011
75600
coef std err z P>|z| [0.025 0.975]
const -0.0008 0.001 -0.617 0.537 -0.003 0.002
observed.L1 -0.1528 0.004 -42.018 0.000 -0.160 -0.146
observed.L2 -0.0332 0.004 -9.021 0.000 -0.040 -0.026
observed.L3 -0.0047 0.004 -1.282 0.200 -0.012 0.002

Roots

Real Imaginary Modulus Frequency
AR.1 -0.1166 -5.5814j 5.5827 -0.2533
AR.2 -0.1166 +5.5814j 5.5827 0.2533
AR.3 -6.8805 -0.0000j 6.8805 -0.5000
goldResults.params
const         -0.000811
observed.L1   -0.152820
observed.L2   -0.033173
observed.L3   -0.004663
dtype: float64
predictions = goldResults.predict(start=traniningStationaryGoldDataLength, end=stationaryGoldDataLength-1) + \
    gold_data['Close'].shift()[traniningStationaryGoldDataLength:stationaryGoldDataLength]

stationaryGoldDataResults = pd.DataFrame(goldData.reset_index()['Close'][traniningStationaryGoldDataLength:stationaryGoldDataLength].rename('observed'))
stationaryGoldDataResults['predicted'] = predictions
stationaryGoldDataResults.plot()
Visualization
appendingTOR(tableOfResults, 'Gold', 'Autoregression', evaluateErrors(stationaryGoldDataResults))
tableOfResults
Unnamed: 0 Data Algorithm MSE RMSE MAE
1 SP500 Naive Forecasting 0.315878 0.56203 0.325863
2 FTSE Naive Forecasting 5.69646 2.38673 1.29783
3 Gold Naive Forecasting 0.237289 0.487123 0.31245
4 USDCUSDT Close Price Naive Forecasting 4.01536e-06 0.002004 0.001034
5 USDCUSDT Tradecount Naive Forecasting 1.56376e+08 12505 7519.96
6 Python Questions Naive Forecasting 1.00863e+06 1004.31 680.298
7 Machine Learning Questions Naive Forecasting 3053.46 55.2581 35.4885
8 Ticket Sales Naive Forecasting 12476 111.696 74.8537
9 SP500 Seasonal Forecasting - 60 24.0706 4.90618 3.19858
10 SP500 Seasonal Forecasting - 1440 578.648 24.0551 18.6964
11 FTSE Seasonal Forecasting - 60 358.944 18.9458 12.5549
12 FTSE Seasonal Forecasting - 1440 9305.27 96.4639 71.5358
13 Gold Seasonal Forecasting - 60 11.0817 3.32892 2.08104
14 Gold Seasonal Forecasting - 1440 297.39 17.245 12.2358
15 USDCUSDT Close Price Seasonal Forecasting 1.41478e-05 0.003761 0.00176
16 USDCUSDT Tradecount Seasonal Forecasting 2.76653e+08 16632.9 10558.5
17 Python Questions Seasonal Forecasting 4.97505e+06 2230.48 1924.32
18 Machine Learning Questions Seasonal Forecasting 14325.5 119.689 84.975
19 Ticket Sales Seasonal Forecasting 3922.4 62.6291 37.0187
20 SP500 Average Forecasting - 60 8.21104 2.86549 1.84062
21 SP500 Average Forecasting - 1440 195.085 13.9673 10.6287
22 FTSE Average Forecasting - 60 122.369 11.0621 7.24807
23 FTSE Average Forecasting - 1440 3105.82 55.7299 40.9431
24 Gold Average Forecasting - 60 3.79681 1.94854 1.22559
25 Gold Average Forecasting - 1440 96.5074 9.82382 6.8452
26 USDCUSDT Close Price Average Forecasting 6.51406e-06 0.002552 0.001187
27 USDCUSDT Tradecount Average Forecasting 1.44419e+08 12017.5 7727.27
28 Python Questions Average Forecasting 2.18019e+06 1476.55 1175.32
29 Machine Learning Questions Average Forecasting 7140.3 84.5003 55.785
30 Ticket Sales Average Forecasting 6061.2 77.8537 59.7301
31 SP500 Average Difference Forecasting - 60 0.322452 0.567849 0.332982
32 SP500 Average Difference Forecasting - 1440 0.316412 0.562505 0.326336
33 FTSE Average Difference Forecasting - 60 5.79875 2.40806 1.32178
34 FTSE Average Difference Forecasting - 1440 5.70723 2.38898 1.30033
35 Gold Average Difference Forecasting - 60 0.240423 0.49033 0.315994
36 Gold Average Difference Forecasting - 1440 0.237595 0.487437 0.313009
37 USDCUSDT Close Price Average Difference Forecasting 4.53094e-06 0.002129 0.001103
38 USDCUSDT Tradecount Average Difference Forecasting 1.64095e+08 12810 7724.69
39 Python Questions Average Difference Forecasting 1.1445e+06 1069.81 751.801
40 Machine Learning Questions Average Difference Forecasting 3399.79 58.3077 38.5563
41 Ticket Sales Average Difference Forecasting 12936.3 113.738 76.6102
42 SP500 Naive Difference Forecasting 0.553753 0.744146 0.456603
43 FTSE Naive Difference Forecasting 10.796 3.28573 1.81081
44 Gold Naive Difference Forecasting 0.521205 0.721946 0.480641
45 USDCUSDT Close Price Naive Difference Forecasting 9.13676e-06 0.003023 0.001659
46 USDCUSDT Tradecount Naive Difference Forecasting 4.15266e+08 20378.1 12089.8
47 Python Questions Naive Difference Forecasting 2.34416e+06 1531.07 1117.71
48 Machine Learning Questions Naive Difference Forecasting 6678.13 81.7198 51.9462
49 Ticket Sales Naive Difference Forecasting 37594.4 193.893 136.398
50 Gold Autoregression 0.167187 0.408885 0.270593

S&P500

sp500Results = AutoReg(stationarySP500Data[:trainingStationarySP500Data],[1,2,3]).fit()
sp500Results.summary()

AutoReg Model Results

Dep. Variable: observed No. Observations: 25200
Model: AutoReg(3) Log Likelihood -26406.1
Method: Conditional MLE S.D. of innovations 0.69
Date: Mon, 08 Jul 2024 AIC 52822.1
Time: 15:38:55 BIC 52862.8
Sample: 3 HQIC 52835.3
25200
coef std err z P>|z| [0.025 0.975]
const -0.0081 0.004 -1.865 0.062 -0.017 0.000
observed.L1 0.0963 0.006 15.284 0.000 0.084 0.109
observed.L2 0.0258 0.006 4.074 0.000 0.013 0.038
observed.L3 -0.0102 0.006 -1.612 0.107 -0.023 0.002

Roots

Real Imaginary Modulus Frequency
AR.1 -4.4810 -0.0000j 4.4810 -0.5000
AR.2 3.5094 -3.1074j 4.6874 -0.1153
AR.3 3.5094 +3.1074j 4.6874 0.1153
sp500Data['Close'].shift()[trainingStationarySP500Data: stationarySP500DataLength]
Datetime
2014-01-03 15:43:00    1837.18
2014-01-03 15:42:00    1837.49
2014-01-03 15:41:00    1837.61
2014-01-03 15:40:00    1837.14
2014-01-03 15:39:00    1836.72
                        ...
2013-11-20 20:39:00    1780.69
2013-11-20 20:38:00    1779.84
2013-11-20 20:37:00    1779.54
2013-11-20 20:36:00    1779.17
2013-11-20 20:35:00    1778.87
Name: Close, Length: 10800, dtype: float64
sp500Results.predict(start=trainingStationarySP500Data, end=stationarySP500DataLength-1)
c:\Users\Edwin Teoh\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:834: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`.
  return get_prediction_index(
c:\Users\Edwin Teoh\anaconda3\lib\site-packages\statsmodels\tsa\deterministic.py:302: UserWarning: Only PeriodIndexes, DatetimeIndexes with a frequency set, RangesIndexes, and Index with a unit increment support extending. The index is set will contain the position relative to the data length.
  fcast_index = self._extend_index(index, steps, forecast_index)
25200    0.005949
25201   -0.002961
25202   -0.009767
25203   -0.009188
25204   -0.009218
           ...
35995   -0.009133
35996   -0.009133
35997   -0.009133
35998   -0.009133
35999   -0.009133
Length: 10800, dtype: float64
predictions = sp500Results.predict(start=trainingStationarySP500Data, end=stationarySP500DataLength-1) + \
    sp500Data.reset_index()['Close'].shift()[trainingStationarySP500Data: stationarySP500DataLength]

stationarySP500DataResults = pd.DataFrame(
    sp500Data.reset_index()['Close'][trainingStationarySP500Data:stationarySP500DataLength].rename('observed'))

stationarySP500DataResults['predicted'] = predictions
stationarySP500DataResults.plot()
Visualization
appendingTOR(tableOfResults, 'SP500', 'Autoregression', evaluateErrors(stationarySP500DataResults))
tableOfResults
Unnamed: 0 Data Algorithm MSE RMSE MAE
1 SP500 Naive Forecasting 0.315878 0.56203 0.325863
2 FTSE Naive Forecasting 5.69646 2.38673 1.29783
3 Gold Naive Forecasting 0.237289 0.487123 0.31245
4 USDCUSDT Close Price Naive Forecasting 4.01536e-06 0.002004 0.001034
5 USDCUSDT Tradecount Naive Forecasting 1.56376e+08 12505 7519.96
6 Python Questions Naive Forecasting 1.00863e+06 1004.31 680.298
7 Machine Learning Questions Naive Forecasting 3053.46 55.2581 35.4885
8 Ticket Sales Naive Forecasting 12476 111.696 74.8537
9 SP500 Seasonal Forecasting - 60 24.0706 4.90618 3.19858
10 SP500 Seasonal Forecasting - 1440 578.648 24.0551 18.6964
11 FTSE Seasonal Forecasting - 60 358.944 18.9458 12.5549
12 FTSE Seasonal Forecasting - 1440 9305.27 96.4639 71.5358
13 Gold Seasonal Forecasting - 60 11.0817 3.32892 2.08104
14 Gold Seasonal Forecasting - 1440 297.39 17.245 12.2358
15 USDCUSDT Close Price Seasonal Forecasting 1.41478e-05 0.003761 0.00176
16 USDCUSDT Tradecount Seasonal Forecasting 2.76653e+08 16632.9 10558.5
17 Python Questions Seasonal Forecasting 4.97505e+06 2230.48 1924.32
18 Machine Learning Questions Seasonal Forecasting 14325.5 119.689 84.975
19 Ticket Sales Seasonal Forecasting 3922.4 62.6291 37.0187
20 SP500 Average Forecasting - 60 8.21104 2.86549 1.84062
21 SP500 Average Forecasting - 1440 195.085 13.9673 10.6287
22 FTSE Average Forecasting - 60 122.369 11.0621 7.24807
23 FTSE Average Forecasting - 1440 3105.82 55.7299 40.9431
24 Gold Average Forecasting - 60 3.79681 1.94854 1.22559
25 Gold Average Forecasting - 1440 96.5074 9.82382 6.8452
26 USDCUSDT Close Price Average Forecasting 6.51406e-06 0.002552 0.001187
27 USDCUSDT Tradecount Average Forecasting 1.44419e+08 12017.5 7727.27
28 Python Questions Average Forecasting 2.18019e+06 1476.55 1175.32
29 Machine Learning Questions Average Forecasting 7140.3 84.5003 55.785
30 Ticket Sales Average Forecasting 6061.2 77.8537 59.7301
31 SP500 Average Difference Forecasting - 60 0.322452 0.567849 0.332982
32 SP500 Average Difference Forecasting - 1440 0.316412 0.562505 0.326336
33 FTSE Average Difference Forecasting - 60 5.79875 2.40806 1.32178
34 FTSE Average Difference Forecasting - 1440 5.70723 2.38898 1.30033
35 Gold Average Difference Forecasting - 60 0.240423 0.49033 0.315994
36 Gold Average Difference Forecasting - 1440 0.237595 0.487437 0.313009
37 USDCUSDT Close Price Average Difference Forecasting 4.53094e-06 0.002129 0.001103
38 USDCUSDT Tradecount Average Difference Forecasting 1.64095e+08 12810 7724.69
39 Python Questions Average Difference Forecasting 1.1445e+06 1069.81 751.801
40 Machine Learning Questions Average Difference Forecasting 3399.79 58.3077 38.5563
41 Ticket Sales Average Difference Forecasting 12936.3 113.738 76.6102
42 SP500 Naive Difference Forecasting 0.553753 0.744146 0.456603
43 FTSE Naive Difference Forecasting 10.796 3.28573 1.81081
44 Gold Naive Difference Forecasting 0.521205 0.721946 0.480641
45 USDCUSDT Close Price Naive Difference Forecasting 9.13676e-06 0.003023 0.001659
46 USDCUSDT Tradecount Naive Difference Forecasting 4.15266e+08 20378.1 12089.8
47 Python Questions Naive Difference Forecasting 2.34416e+06 1531.07 1117.71
48 Machine Learning Questions Naive Difference Forecasting 6678.13 81.7198 51.9462
49 Ticket Sales Naive Difference Forecasting 37594.4 193.893 136.398
50 Gold Autoregression 0.167187 0.408885 0.270593
51 SP500 Autoregression 0.181846 0.426434 0.236095

Machine Learning Questions

mlQuestionsResults = AutoReg(stationaryMLQuestions[:trainingStationaryMLQuestions], [1, 2, 7, 14, 28, 31]).fit()
mlQuestionsResults.summary()

AutoReg Model Results

Dep. Variable: observed No. Observations: 91
Model: Restr. AutoReg(31) Log Likelihood 24.827
Method: Conditional MLE S.D. of innovations 0.16
Date: Mon, 08 Jul 2024 AIC -33.654
Time: 15:38:55 BIC -16.899
Sample: 09-01-2011 HQIC -27.1
- 08-01-2016
coef std err z P>|z| [0.025 0.975]
const 0.0658 0.023 2.894 0.004 0.021 0.110
observed.L1 -0.4974 0.116 -4.289 0.000 -0.725 -0.270
observed.L2 -0.2309 0.124 -1.862 0.063 -0.474 0.012
observed.L7 0.2036 0.099 2.049 0.040 0.009 0.398
observed.L14 -0.0429 0.089 -0.483 0.629 -0.217 0.131
observed.L28 -0.0634 0.055 -1.158 0.247 -0.171 0.044
observed.L31 -0.0349 0.050 -0.695 0.487 -0.133 0.064

Roots

Real Imaginary Modulus Frequency
AR.1 -0.8987 -0.5721j 1.0654 -0.4098
AR.2 -0.8987 +0.5721j 1.0654 0.4098
AR.3 -1.0050 -0.3868j 1.0769 -0.4415
AR.4 -1.0050 +0.3868j 1.0769 0.4415
AR.5 -1.1138 -0.1904j 1.1300 -0.4731
AR.6 -1.1138 +0.1904j 1.1300 0.4731
AR.7 -1.2513 -0.0000j 1.2513 -0.5000
AR.8 -0.7733 -0.7617j 1.0854 -0.3762
AR.9 -0.7733 +0.7617j 1.0854 0.3762
AR.10 -0.5650 -0.9155j 1.0758 -0.3380
AR.11 -0.5650 +0.9155j 1.0758 0.3380
AR.12 -0.3691 -0.9883j 1.0550 -0.3069
AR.13 -0.3691 +0.9883j 1.0550 0.3069
AR.14 -0.1692 -1.0598j 1.0732 -0.2752
AR.15 -0.1692 +1.0598j 1.0732 0.2752
AR.16 0.0585 -1.1078j 1.1094 -0.2416
AR.17 0.0585 +1.1078j 1.1094 0.2416
AR.18 1.0930 -0.1180j 1.0994 -0.0171
AR.19 1.0930 +0.1180j 1.0994 0.0171
AR.20 1.0568 -0.3488j 1.1128 -0.0507
AR.21 1.0568 +0.3488j 1.1128 0.0507
AR.22 0.9596 -0.5937j 1.1284 -0.0882
AR.23 0.9596 +0.5937j 1.1284 0.0882
AR.24 0.8043 -0.7902j 1.1275 -0.1236
AR.25 0.8043 +0.7902j 1.1275 0.1236
AR.26 0.3333 -1.0889j 1.1387 -0.2027
AR.27 0.3333 +1.0889j 1.1387 0.2027
AR.28 0.6337 -1.0033j 1.1867 -0.1603
AR.29 0.6337 +1.0033j 1.1867 0.1603
AR.30 0.5806 -1.0452j 1.1956 -0.1693
AR.31 0.5806 +1.0452j 1.1956 0.1693
loggedMLtopic
0      2.079442
1      1.386294
2      1.098612
3      2.302585
4      1.945910
         ...
127    6.368187
128    6.287859
129    6.620073
130    6.822197
131    6.890609
Name: machine-learning, Length: 132, dtype: float64
# Predictions from AR
predictions = mlQuestionsResults.predict(start = trainingStationaryMLQuestions, end = stationaryMLQuestionsLength-1).reset_index()

# Setting the index to number
predictions['index'] = [x for x in range(trainingStationaryMLQuestions,stationaryMLQuestionsLength)]
predictions = predictions.set_index('index')

# Adding the previous logged value
predictions = predictions[0] + loggedMLtopic.shift()

# Exponent it back
predictions = np.exp(predictions)
predictions
c:\Users\Edwin Teoh\anaconda3\lib\site-packages\statsmodels\tsa\deterministic.py:302: UserWarning: Only PeriodIndexes, DatetimeIndexes with a frequency set, RangesIndexes, and Index with a unit increment support extending. The index is set will contain the position relative to the data length.
  fcast_index = self._extend_index(index, steps, forecast_index)
0             NaN
1             NaN
2             NaN
3             NaN
4             NaN
          ...
127    642.365437
128    605.731054
129    561.555450
130    781.032565
131           NaN
Length: 132, dtype: float64
stationaryMLQuestionsResults = pd.DataFrame(
    mlStackoverflow_data['machine-learning'][trainingStationaryMLQuestions:stationaryMLQuestionsLength]
    .rename('observed'))
stationaryMLQuestionsResults['predicted'] = predictions
stationaryMLQuestionsResults.plot()
Visualization
appendingTOR(tableOfResults, 'Machine Learning Questions', 'Autoregression', evaluateErrors(stationaryMLQuestionsResults))
tableOfResults
Unnamed: 0 Data Algorithm MSE RMSE MAE
1 SP500 Naive Forecasting 0.315878 0.56203 0.325863
2 FTSE Naive Forecasting 5.69646 2.38673 1.29783
3 Gold Naive Forecasting 0.237289 0.487123 0.31245
4 USDCUSDT Close Price Naive Forecasting 4.01536e-06 0.002004 0.001034
5 USDCUSDT Tradecount Naive Forecasting 1.56376e+08 12505 7519.96
6 Python Questions Naive Forecasting 1.00863e+06 1004.31 680.298
7 Machine Learning Questions Naive Forecasting 3053.46 55.2581 35.4885
8 Ticket Sales Naive Forecasting 12476 111.696 74.8537
9 SP500 Seasonal Forecasting - 60 24.0706 4.90618 3.19858
10 SP500 Seasonal Forecasting - 1440 578.648 24.0551 18.6964
11 FTSE Seasonal Forecasting - 60 358.944 18.9458 12.5549
12 FTSE Seasonal Forecasting - 1440 9305.27 96.4639 71.5358
13 Gold Seasonal Forecasting - 60 11.0817 3.32892 2.08104
14 Gold Seasonal Forecasting - 1440 297.39 17.245 12.2358
15 USDCUSDT Close Price Seasonal Forecasting 1.41478e-05 0.003761 0.00176
16 USDCUSDT Tradecount Seasonal Forecasting 2.76653e+08 16632.9 10558.5
17 Python Questions Seasonal Forecasting 4.97505e+06 2230.48 1924.32
18 Machine Learning Questions Seasonal Forecasting 14325.5 119.689 84.975
19 Ticket Sales Seasonal Forecasting 3922.4 62.6291 37.0187
20 SP500 Average Forecasting - 60 8.21104 2.86549 1.84062
21 SP500 Average Forecasting - 1440 195.085 13.9673 10.6287
22 FTSE Average Forecasting - 60 122.369 11.0621 7.24807
23 FTSE Average Forecasting - 1440 3105.82 55.7299 40.9431
24 Gold Average Forecasting - 60 3.79681 1.94854 1.22559
25 Gold Average Forecasting - 1440 96.5074 9.82382 6.8452
26 USDCUSDT Close Price Average Forecasting 6.51406e-06 0.002552 0.001187
27 USDCUSDT Tradecount Average Forecasting 1.44419e+08 12017.5 7727.27
28 Python Questions Average Forecasting 2.18019e+06 1476.55 1175.32
29 Machine Learning Questions Average Forecasting 7140.3 84.5003 55.785
30 Ticket Sales Average Forecasting 6061.2 77.8537 59.7301
31 SP500 Average Difference Forecasting - 60 0.322452 0.567849 0.332982
32 SP500 Average Difference Forecasting - 1440 0.316412 0.562505 0.326336
33 FTSE Average Difference Forecasting - 60 5.79875 2.40806 1.32178
34 FTSE Average Difference Forecasting - 1440 5.70723 2.38898 1.30033
35 Gold Average Difference Forecasting - 60 0.240423 0.49033 0.315994
36 Gold Average Difference Forecasting - 1440 0.237595 0.487437 0.313009
37 USDCUSDT Close Price Average Difference Forecasting 4.53094e-06 0.002129 0.001103
38 USDCUSDT Tradecount Average Difference Forecasting 1.64095e+08 12810 7724.69
39 Python Questions Average Difference Forecasting 1.1445e+06 1069.81 751.801
40 Machine Learning Questions Average Difference Forecasting 3399.79 58.3077 38.5563
41 Ticket Sales Average Difference Forecasting 12936.3 113.738 76.6102
42 SP500 Naive Difference Forecasting 0.553753 0.744146 0.456603
43 FTSE Naive Difference Forecasting 10.796 3.28573 1.81081
44 Gold Naive Difference Forecasting 0.521205 0.721946 0.480641
45 USDCUSDT Close Price Naive Difference Forecasting 9.13676e-06 0.003023 0.001659
46 USDCUSDT Tradecount Naive Difference Forecasting 4.15266e+08 20378.1 12089.8
47 Python Questions Naive Difference Forecasting 2.34416e+06 1531.07 1117.71
48 Machine Learning Questions Naive Difference Forecasting 6678.13 81.7198 51.9462
49 Ticket Sales Naive Difference Forecasting 37594.4 193.893 136.398
50 Gold Autoregression 0.167187 0.408885 0.270593
51 SP500 Autoregression 0.181846 0.426434 0.236095
52 Machine Learning Questions Autoregression 9493.88 97.4365 80.8513

Cinema Tickets

ticketSalesResults = AutoReg(stationaryTicketSales[:trainingStationaryTicketSales], [1,2,3,6,7]).fit()
ticketSalesResults.summary()

AutoReg Model Results

Dep. Variable: observed No. Observations: 161
Model: Restr. AutoReg(7) Log Likelihood -861.251
Method: Conditional MLE S.D. of innovations 64.949
Date: Mon, 08 Jul 2024 AIC 1736.5
Time: 15:38:55 BIC 1757.76
Sample: 7 HQIC 1745.14
161
coef std err z P>|z| [0.025 0.975]
const 0.3138 5.235 0.060 0.952 -9.947 10.574
observed.L1 -0.4471 0.067 -6.711 0.000 -0.578 -0.317
observed.L2 -0.2981 0.068 -4.382 0.000 -0.431 -0.165
observed.L3 -0.1313 0.055 -2.382 0.017 -0.239 -0.023
observed.L6 -0.1062 0.053 -1.990 0.047 -0.211 -0.002
observed.L7 0.5693 0.062 9.164 0.000 0.448 0.691

Roots

Real Imaginary Modulus Frequency
AR.1 -0.9209 -0.4350j 1.0185 -0.4298
AR.2 -0.9209 +0.4350j 1.0185 0.4298
AR.3 -0.2433 -0.9990j 1.0282 -0.2880
AR.4 -0.2433 +0.9990j 1.0282 0.2880
AR.5 0.6337 -0.9394j 1.1331 -0.1555
AR.6 0.6337 +0.9394j 1.1331 0.1555
AR.7 1.2475 -0.0000j 1.2475 -0.0000
ticketSalesARResults = pd.DataFrame(
    ticketSales_data['tickets_sold'].reset_index()['tickets_sold'].rename('observed')
    [trainingStationaryTicketSales-1:stationaryTicketSalesLength]
)

ticketSalesARResults
Unnamed: 0 observed
160 115.308869
161 138.182362
162 71.504747
163 63.026772
164 53.210611
... ...
226 34.500000
227 70.583704
228 113.653521
229 126.824561
230 75.431177
ticketSalesARPredictions = ticketSalesResults.predict(start=trainingStationaryTicketSales, end=stationaryTicketSalesLength-1)
ticketSalesARPredictions
c:\Users\Edwin Teoh\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:834: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`.
  return get_prediction_index(
c:\Users\Edwin Teoh\anaconda3\lib\site-packages\statsmodels\tsa\deterministic.py:302: UserWarning: Only PeriodIndexes, DatetimeIndexes with a frequency set, RangesIndexes, and Index with a unit increment support extending. The index is set will contain the position relative to the data length.
  fcast_index = self._extend_index(index, steps, forecast_index)
161    -33.435136
162      2.950482
163    -16.865140
164    141.340881
165   -127.283711
          ...
226     -4.337732
227     22.393172
228    -34.553267
229     25.892766
230     -5.222340
Length: 70, dtype: float64
ticketSalesARResults['predicted'] = ticketSalesARPredictions + ticketSalesARResults['observed'].shift()
ticketSalesARResults.plot()
Visualization
appendingTOR(tableOfResults, 'Ticket Sales', 'Autoregression', evaluateErrors(ticketSalesARResults))
tableOfResults
Unnamed: 0 Data Algorithm MSE RMSE MAE
1 SP500 Naive Forecasting 0.315878 0.56203 0.325863
2 FTSE Naive Forecasting 5.69646 2.38673 1.29783
3 Gold Naive Forecasting 0.237289 0.487123 0.31245
4 USDCUSDT Close Price Naive Forecasting 4.01536e-06 0.002004 0.001034
5 USDCUSDT Tradecount Naive Forecasting 1.56376e+08 12505 7519.96
6 Python Questions Naive Forecasting 1.00863e+06 1004.31 680.298
7 Machine Learning Questions Naive Forecasting 3053.46 55.2581 35.4885
8 Ticket Sales Naive Forecasting 12476 111.696 74.8537
9 SP500 Seasonal Forecasting - 60 24.0706 4.90618 3.19858
10 SP500 Seasonal Forecasting - 1440 578.648 24.0551 18.6964
11 FTSE Seasonal Forecasting - 60 358.944 18.9458 12.5549
12 FTSE Seasonal Forecasting - 1440 9305.27 96.4639 71.5358
13 Gold Seasonal Forecasting - 60 11.0817 3.32892 2.08104
14 Gold Seasonal Forecasting - 1440 297.39 17.245 12.2358
15 USDCUSDT Close Price Seasonal Forecasting 1.41478e-05 0.003761 0.00176
16 USDCUSDT Tradecount Seasonal Forecasting 2.76653e+08 16632.9 10558.5
17 Python Questions Seasonal Forecasting 4.97505e+06 2230.48 1924.32
18 Machine Learning Questions Seasonal Forecasting 14325.5 119.689 84.975
19 Ticket Sales Seasonal Forecasting 3922.4 62.6291 37.0187
20 SP500 Average Forecasting - 60 8.21104 2.86549 1.84062
21 SP500 Average Forecasting - 1440 195.085 13.9673 10.6287
22 FTSE Average Forecasting - 60 122.369 11.0621 7.24807
23 FTSE Average Forecasting - 1440 3105.82 55.7299 40.9431
24 Gold Average Forecasting - 60 3.79681 1.94854 1.22559
25 Gold Average Forecasting - 1440 96.5074 9.82382 6.8452
26 USDCUSDT Close Price Average Forecasting 6.51406e-06 0.002552 0.001187
27 USDCUSDT Tradecount Average Forecasting 1.44419e+08 12017.5 7727.27
28 Python Questions Average Forecasting 2.18019e+06 1476.55 1175.32
29 Machine Learning Questions Average Forecasting 7140.3 84.5003 55.785
30 Ticket Sales Average Forecasting 6061.2 77.8537 59.7301
31 SP500 Average Difference Forecasting - 60 0.322452 0.567849 0.332982
32 SP500 Average Difference Forecasting - 1440 0.316412 0.562505 0.326336
33 FTSE Average Difference Forecasting - 60 5.79875 2.40806 1.32178
34 FTSE Average Difference Forecasting - 1440 5.70723 2.38898 1.30033
35 Gold Average Difference Forecasting - 60 0.240423 0.49033 0.315994
36 Gold Average Difference Forecasting - 1440 0.237595 0.487437 0.313009
37 USDCUSDT Close Price Average Difference Forecasting 4.53094e-06 0.002129 0.001103
38 USDCUSDT Tradecount Average Difference Forecasting 1.64095e+08 12810 7724.69
39 Python Questions Average Difference Forecasting 1.1445e+06 1069.81 751.801
40 Machine Learning Questions Average Difference Forecasting 3399.79 58.3077 38.5563
41 Ticket Sales Average Difference Forecasting 12936.3 113.738 76.6102
42 SP500 Naive Difference Forecasting 0.553753 0.744146 0.456603
43 FTSE Naive Difference Forecasting 10.796 3.28573 1.81081
44 Gold Naive Difference Forecasting 0.521205 0.721946 0.480641
45 USDCUSDT Close Price Naive Difference Forecasting 9.13676e-06 0.003023 0.001659
46 USDCUSDT Tradecount Naive Difference Forecasting 4.15266e+08 20378.1 12089.8
47 Python Questions Naive Difference Forecasting 2.34416e+06 1531.07 1117.71
48 Machine Learning Questions Naive Difference Forecasting 6678.13 81.7198 51.9462
49 Ticket Sales Naive Difference Forecasting 37594.4 193.893 136.398
50 Gold Autoregression 0.167187 0.408885 0.270593
51 SP500 Autoregression 0.181846 0.426434 0.236095
52 Machine Learning Questions Autoregression 9493.88 97.4365 80.8513
53 Ticket Sales Autoregression 12195.6 110.434 78.7188

Observations

tableOfResults[tableOfResults['Data'] == 'Machine Learning Questions'].sort_values('MSE')
Unnamed: 0 Data Algorithm MSE RMSE MAE
7 Machine Learning Questions Naive Forecasting 3053.46 55.2581 35.4885
40 Machine Learning Questions Average Difference Forecasting 3399.79 58.3077 38.5563
48 Machine Learning Questions Naive Difference Forecasting 6678.13 81.7198 51.9462
29 Machine Learning Questions Average Forecasting 7140.3 84.5003 55.785
52 Machine Learning Questions Autoregression 9493.88 97.4365 80.8513
18 Machine Learning Questions Seasonal Forecasting 14325.5 119.689 84.975
tableOfResults[tableOfResults['Data'] == 'Gold'].sort_values('MSE')
Unnamed: 0 Data Algorithm MSE RMSE MAE
50 Gold Autoregression 0.167187 0.408885 0.270593
3 Gold Naive Forecasting 0.237289 0.487123 0.31245
36 Gold Average Difference Forecasting - 1440 0.237595 0.487437 0.313009
35 Gold Average Difference Forecasting - 60 0.240423 0.49033 0.315994
44 Gold Naive Difference Forecasting 0.521205 0.721946 0.480641
24 Gold Average Forecasting - 60 3.79681 1.94854 1.22559
13 Gold Seasonal Forecasting - 60 11.0817 3.32892 2.08104
25 Gold Average Forecasting - 1440 96.5074 9.82382 6.8452
14 Gold Seasonal Forecasting - 1440 297.39 17.245 12.2358
tableOfResults[tableOfResults['Data'] == 'SP500'].sort_values('MSE')
Unnamed: 0 Data Algorithm MSE RMSE MAE
51 SP500 Autoregression 0.181846 0.426434 0.236095
1 SP500 Naive Forecasting 0.315878 0.56203 0.325863
32 SP500 Average Difference Forecasting - 1440 0.316412 0.562505 0.326336
31 SP500 Average Difference Forecasting - 60 0.322453 0.567849 0.332982
42 SP500 Naive Difference Forecasting 0.553753 0.744146 0.456603
20 SP500 Average Forecasting - 60 8.21104 2.86549 1.84062
9 SP500 Seasonal Forecasting - 60 24.0706 4.90618 3.19858
21 SP500 Average Forecasting - 1440 195.085 13.9673 10.6287
10 SP500 Seasonal Forecasting - 1440 578.648 24.0551 18.6964
tableOfResults[tableOfResults['Data'] == 'Ticket Sales'].sort_values('MSE')
Unnamed: 0 Data Algorithm MSE RMSE MAE
19 Ticket Sales Seasonal Forecasting 3922.4 62.6291 37.0187
30 Ticket Sales Average Forecasting 6061.2 77.8537 59.7301
53 Ticket Sales Autoregression 12195.6 110.434 78.7188
8 Ticket Sales Naive Forecasting 12476 111.696 74.8537
41 Ticket Sales Average Difference Forecasting 12936.3 113.738 76.6102
49 Ticket Sales Naive Difference Forecasting 37594.4 193.893 136.398

Simulated Trading

Let's try to apply the train_ar and predict from the lectures for the simulated trading. This will be heavily based off what is in the lectures.

def train_ar(values, order):
    target_vector = np.array(values[order:])
    lagged_values = []

    for i in range(len(values) - order):
        lagged_values.append(values[i:i+order])
    desgin_matrix = np.array(lagged_values)
    return np.linalg.inv(desgin_matrix.transpose() @ desgin_matrix) @ desgin_matrix.transpose() @ target_vector

def predict(coeffs, data):
    predicted_values = [math.nan] * len(coeffs)
    for i in range(len(coeffs), len(data)):
        predicted_values.append(np.dot(coeffs, data[i-len(coeffs):i]))
    return np.array(predicted_values)
goldData['Close'].plot()
Visualization
simTradingGoldData = goldData[goldData.index.dayofweek != 5]['Close'].sort_index()[:108000]
simTradingGoldData = pd.DataFrame(simTradingGoldData)
simTradingGoldData['Previous Close'] = simTradingGoldData['Close'].shift()
simTradingGoldData['Difference'] = simTradingGoldData['Close'] - simTradingGoldData['Close'].shift()

simTradingGoldData
('Unnamed: 0_level_0', 'Datetime') ('Close', 'Unnamed: 1_level_1') ('Previous Close', 'Unnamed: 2_level_1') ('Difference', 'Unnamed: 3_level_1')
2011-10-26 15:50:00 1711.27 nan nan
2011-10-26 15:51:00 1712.07 1711.27 0.80
2011-10-26 15:52:00 1712.13 1712.07 0.06
2011-10-26 15:53:00 1712.83 1712.13 0.70
2011-10-26 15:54:00 1713.87 1712.83 1.04
... ... ... ...
2012-02-15 01:19:00 1722.71 1723.55 -0.84
2012-02-15 01:20:00 1722.79 1722.71 0.08
2012-02-15 01:21:00 1722.90 1722.79 0.11
2012-02-15 01:22:00 1722.86 1722.90 -0.04
2012-02-15 01:23:00 1722.80 1722.86 -0.06
trainingSimTradingGoldData = simTradingGoldData[:int(len(simTradingGoldData)*0.7)]
trainingSimTradingGoldData
('Unnamed: 0_level_0', 'Datetime') ('Close', 'Unnamed: 1_level_1') ('Previous Close', 'Unnamed: 2_level_1') ('Difference', 'Unnamed: 3_level_1')
2011-10-26 15:50:00 1711.27 nan nan
2011-10-26 15:51:00 1712.07 1711.27 0.80
2011-10-26 15:52:00 1712.13 1712.07 0.06
2011-10-26 15:53:00 1712.83 1712.13 0.70
2011-10-26 15:54:00 1713.87 1712.83 1.04
... ... ... ...
2012-01-12 17:32:00 1654.07 1653.40 0.67
2012-01-12 17:33:00 1653.85 1654.07 -0.22
2012-01-12 17:34:00 1653.38 1653.85 -0.47
2012-01-12 17:35:00 1653.46 1653.38 0.08
2012-01-12 17:36:00 1653.36 1653.46 -0.10
testingSimTradingGoldData = simTradingGoldData[int(len(simTradingGoldData)*0.7):]
testingSimTradingGoldData
('Unnamed: 0_level_0', 'Datetime') ('Close', 'Unnamed: 1_level_1') ('Previous Close', 'Unnamed: 2_level_1') ('Difference', 'Unnamed: 3_level_1')
2012-01-12 17:37:00 1653.12 1653.36 -0.24
2012-01-12 17:38:00 1653.43 1653.12 0.31
2012-01-12 17:39:00 1653.20 1653.43 -0.23
2012-01-12 17:40:00 1653.31 1653.20 0.11
2012-01-12 17:41:00 1652.87 1653.31 -0.44
... ... ... ...
2012-02-15 01:19:00 1722.71 1723.55 -0.84
2012-02-15 01:20:00 1722.79 1722.71 0.08
2012-02-15 01:21:00 1722.90 1722.79 0.11
2012-02-15 01:22:00 1722.86 1722.90 -0.04
2012-02-15 01:23:00 1722.80 1722.86 -0.06
plot_pacf(trainingSimTradingGoldData['Difference'].dropna())
Visualization
Visualization
coeffs = train_ar(trainingSimTradingGoldData['Difference'].dropna(), 2)
coeffs
array([-0.00372669, -0.05243032])
prediction = predict(coeffs, testingSimTradingGoldData['Difference'])
prediction
array([        nan,         nan, -0.01535899, ..., -0.001064  ,
       -0.00606547,  0.00168728])
simTradingGoldDataResult = pd.DataFrame(testingSimTradingGoldData['Close'].rename('observed'))
simTradingGoldDataResult['predicted'] = prediction + testingSimTradingGoldData['Close']
simTradingGoldDataResult = simTradingGoldDataResult.dropna()
simTradingGoldDataResult.plot()
Visualization

We will be using only 1.001 times larger than the previous. This is because when the value is 1.001, there are only 152 instances of True, and then at 1.002, there is only 8, and 1.003, there is only left with 4. That is why we proceeded to use 1.001

simTradingGoldDataResult['isHolding'] = simTradingGoldDataResult['predicted'] > simTradingGoldDataResult['observed'].shift() * 1.001
simTradingGoldDataResult[simTradingGoldDataResult['isHolding'] == True].count()
observed     152
predicted    152
isHolding    152
dtype: int64
lastPrice = None
portfolioValue = 10000
values = []

for ts, (close, _, isHolding) in simTradingGoldDataResult.iterrows():
    if isHolding:
        portfolioValue *= close / lastPrice
    lastPrice = close
    values.append(portfolioValue)

simTradingGoldDataResult['Equity Value'] = values
offsets = (simTradingGoldDataResult['isHolding'] * (simTradingGoldDataResult['observed'].shift(-1) - simTradingGoldDataResult['observed']))
hasIncreased = ((simTradingGoldDataResult['observed'].shift(-1) > simTradingGoldDataResult['observed']) & simTradingGoldDataResult['isHolding']).astype(int)
hasDecreased = ((simTradingGoldDataResult['observed'].shift(-1) < simTradingGoldDataResult['observed']) & simTradingGoldDataResult['isHolding']).astype(int)

plt.figure()
ax = simTradingGoldDataResult['observed'].plot()
for no, i in enumerate(hasIncreased):
    if i:
        plt.axvline(hasIncreased.index[no], color='g', linewidth=0.3)

for no, i in enumerate(hasDecreased):
    if i:
        plt.axvline(hasIncreased.index[no], color='r', linewidth=0.3)
plt.show()

for no, i in enumerate(hasIncreased):
    if i:
        plt.axvline(hasIncreased.index[no], color='g', linewidth=0.3)

for no, i in enumerate(hasDecreased):
    if i:
        plt.axvline(hasIncreased.index[no], color='r', linewidth=0.3)
simTradingGoldDataResult['Equity Value'].plot()
Visualization
Visualization

Monte Carlo Simulations

We will now apply Monte Carlo to 3 datasets and explore. We will be using the gold data (for its real-worldness), machine learning questions (for its multiplicative graph), and ticket sales (for its seasonality). Furthermore, using the result found from PACF above. We can use the similar concept above, were we split the training and testing set.

Gold

def mcRmse(predicted, actual):
    diffs = predicted - actual
    return math.sqrt(np.mean(diffs**2))
goldTrainSeries = goldData.reset_index()['Close'][:traniningStationaryGoldDataLength]
goldTrainSeries
0        1298.50
1        1298.60
2        1298.90
3        1299.09
4        1299.17
          ...
75595    1246.30
75596    1246.60
75597    1246.40
75598    1247.10
75599    1247.20
Name: Close, Length: 75600, dtype: float64
goldModel = AutoReg(goldTrainSeries, 3, trend='n').fit()
goldCoeffs = list(reversed(goldModel.params))
goldCoeffs
[0.032491603321171736, 0.12017858859369826, 0.8473291714654028]
goldStdDev = mcRmse(goldModel.predict(), goldTrainSeries)
goldStdDev
0.36114889408159995
numOfSamplesGold = 1000
goldTestSeries = goldData.reset_index()['Close'][traniningStationaryGoldDataLength:stationaryGoldDataLength]
goldTestSeries = goldTestSeries.reset_index()['Close']
initialGoldValues = np.array(goldTestSeries[:2500]).reshape(1, -1)
initialGoldValues

goldValues = np.broadcast_to(initialGoldValues, (numOfSamplesGold, 2500))
goldValues

goldTrajectoryLength = 1000
np.random.seed(123)

for i in range(goldTrajectoryLength):
    nextPrediction = goldValues[:, -3:] @ goldCoeffs
    nextColumn = nextPrediction + goldStdDev * np.random.randn(numOfSamplesGold)
    goldValues = np.append(goldValues, nextColumn.reshape(numOfSamplesGold, 1), axis = 1)

horizontalLines = [x for x in range(1220, 1281, 10)]

plt.plot(goldValues.transpose())
for line in horizontalLines:
    plt.axhline(y = line, linestyle = 'dashed', linewidth = 2, alpha = 0.5)
plt.plot(goldTestSeries[:3500], linewidth = 1.5, c = 'black')
plt.show()
Visualization

Now that we have produced a 1000 samples of possible price movements, we can now calculate the difference probabilities of it reaching certain prices and what not.

The black line shown in the plot, is to show the actual gold price move.

finalTimePeriod = nextColumn
finalTimePeriod
array([1248.81453198, 1234.48618073, 1260.43243186, 1249.37824546,
       1246.44648938, 1234.98845302, 1237.40624842, 1263.51763179,
       1249.32173509, 1237.21926607, 1262.02420909, 1241.23580835,
       ...
       1255.65131606, 1269.97946134, 1258.23988999, 1249.41256685])

We can look into what are the probabilities of the price of gold to reach a certain price at time 3500. Using the function from the materials given, we can calculate the probability.

def probBetween (arr, lower, higher):
    count = np.count_nonzero((arr > lower) & (arr < higher))
    return count/arr.size

The probability of the gold of price reaching between 1220 to 1280

probBetween(finalTimePeriod, 1220, 1280)
0.997

The probability of the gold of price reaching between 1215 and 1240

probBetween(finalTimePeriod, 1215, 1240)
0.231

The probability of the gold of price reaching between 1245 and 1255

probBetween(finalTimePeriod, 1245, 1255)
0.36

The probability of the gold of price reaching between 1245 and 1255

probBetween(finalTimePeriod, 1250, 1260)
0.296

The probability of the gold of price reaching between 1265 and 1280

probBetween(finalTimePeriod, 1265, 1280)
0.042

The probability of the gold of price reaching between 1250 and 1253

probBetween(finalTimePeriod, 1250, 1253)
0.101

The further away in the future we try to predict, the harder it is to determine the prices, and because the cone of possibility is larger, many prices that we set out to find, have generally the same probability. The lower it is further away from the first prediction value.

Next, we will look into the probability that the price will go beyond or lower than a specific value. Let's say we would like to find out if there are any higher probability of either direction, and then we will try some other values instead.

# Checking if calling the correct
goldValues[:,2500][:5]
array([1247.77539526, 1248.52765975, 1248.26966692, 1247.62347288,
       1247.95850871])
goldValues[:,2499][:5]
array([1248.2, 1248.2, 1248.2, 1248.2, 1248.2])
period = 2499
lastPrice = goldValues[:,period][0]
lastPrice
1248.2
(lastPrice < goldValues[:,2500:]).sum() / (numOfSamplesGold*goldTrajectoryLength)
0.487236
(lastPrice > goldValues[:,2500:]).sum() / (numOfSamplesGold*goldTrajectoryLength)
0.512764

Based on the monte carlo result, there are 45.23% of values that are larger the last point we were and 54.73% of them are lower. Because of that, let's assume we enter a short position, now lets explore further in time of our result.

Let's say we are now we would like to only hold on to the position only for a shorter term. At period 2750, we would like to exit the position. What are the probability of it to take profit on 1240, or stop loss on 1260, or let's be conservative and also find out the probability for it at 1255.

takeProfit = 1240
np.any(takeProfit > goldValues[:,2500:2750], axis = 1).sum() / 1000
0.088
stopLoss = 1260
np.any(stopLoss < goldValues[:,2500:2750], axis = 1).sum() / 1000
0.008
conservativeStopLoss = 1255
np.any(conservativeStopLoss < goldValues[:,2500:2750], axis = 1).sum() / 1000
0.145

Given that the chances for the price of 1255 is higher than the initial take profit of 1240, we might want to adjust it lower to increase our chances to trigger the take profit.

takeProfit = 1240.5
np.any(takeProfit > goldValues[:,2500:2750], axis = 1).sum() / 1000
0.109
takeProfit = 1241.5
np.any(takeProfit > goldValues[:,2500:2750], axis = 1).sum() / 1000
0.168

By increasing the stop loss by 0.5, we have increased the probability of to 0.144, and be more conservative, at 1241.5, the chances for us to increased to 0.205.

The following is the density plot for the gold values.

xTicks = [x for x in range(1190, 1311, 10)]
goldValuesSumKDE = []
for i in goldValues:
    [goldValuesSumKDE.append(x) for x in i[2500:]]
pd.DataFrame(goldValuesSumKDE).plot.kde()
plt.axvline(lastPrice, linestyle = 'dashed', label = 'Last Price', c = 'black')
plt.xticks(xTicks)
for line in xTicks:
    plt.axvline(line, linestyle = 'dotted', c = 'b', alpha = 0.3)
plt.legend()
Visualization

The following is the density plot for period from 2500 to 2750, according to the scenario above. Here, we can visualize its probability of each given range of prices, based on the monte carlo results. As we can see that, a small increase from 1240, would significantly increase the probability as shown in the steep increase.

xTicks = [x for x in range(1190, 1311, 10)]
goldValuesSumKDE = []
for i in goldValues:
    [goldValuesSumKDE.append(x) for x in i[2500:2750]]
pd.DataFrame(goldValuesSumKDE).plot.kde()
plt.axvline(lastPrice, linestyle = 'dashed', label = 'Last Price', c = 'black')
plt.xticks(xTicks)
for line in xTicks:
    plt.axvline(line, linestyle = 'dotted', c = 'b', alpha = 0.3)
plt.legend()
Visualization

Cinema Tickets

Given the time contraints on working on the coursework, we will explore one more datasaet before moving on.

ticketSalesTrainSeries = ticketSales_data['tickets_sold'].reset_index()['tickets_sold'][:trainingStationaryTicketSales]
ticketSalesTrainSeries
0      125.650000
1      178.675325
2      175.461017
3      114.192192
4       94.831956
          ...
156     73.299830
157     71.350000
158    275.887784
159    148.436834
160    115.308869
Name: tickets_sold, Length: 161, dtype: float64
ticketSalesModel = AutoReg(ticketSalesTrainSeries, 7, trend='n').fit()
ticketSalesCoeffs = list(reversed(ticketSalesModel.params))
ticketSalesCoeffs
[0.7426181566741853,
 -0.03903937266016291,
 -0.031596445533128875,
 0.036661173670043956,
 0.10106447839651159,
 0.0957974652301363,
 0.08677635276613055]
ticketsSalesStdDev = mcRmse(ticketSalesModel.predict(), ticketSalesTrainSeries)
ticketsSalesStdDev
62.78175334614299
numOfSamplesTicketSales = 71 - 14
def runMonteCarloTicketSale(multiplier):
    np.random.seed(123)
    ticketSalesTestSeries = ticketSales_data['tickets_sold'].reset_index()['tickets_sold'][trainingStationaryTicketSales:]
    ticketSalesTestSeries = ticketSalesTestSeries.reset_index()['tickets_sold']

    initialTicketSalesValues = np.array(ticketSalesTestSeries[:14]).reshape(1, -1)

    ticketSalesValues = np.broadcast_to(initialTicketSalesValues, (numOfSamplesTicketSales, 14))

    ticketSalesTrajectoryLength = (71 - 14)

    for i in range(ticketSalesTrajectoryLength):
        nextPrediction = ticketSalesValues[:, -7:] @ ticketSalesCoeffs
        nextColumn = nextPrediction + multiplier * np.random.randn(numOfSamplesTicketSales)
        ticketSalesValues = np.append(ticketSalesValues, nextColumn.reshape(numOfSamplesTicketSales, 1), axis = 1)

    horizontalLines = [x for x in range(-300, 501, 100)]

    plt.plot(ticketSalesValues.transpose())
    for line in horizontalLines:
        plt.axhline(y = line, linestyle = 'dashed', linewidth = 2, alpha = 0.5)
    plt.plot(ticketSalesTestSeries[:71], linewidth = 1.5, c = 'black')
    plt.yticks(horizontalLines)
    plt.show()
runMonteCarloTicketSale(ticketsSalesStdDev)
Visualization

After visualizing the monte carlo simulation, realized that it might not be suitable for ths dataset because, the random nature of the monte carlo will include predictions of negative number, when this dataset will not ever reach so. Then, I thought what if we were to use other values rather than the standard deviations, such as the ones shown below. The above def/function was implemented after the realization.

Using several values and test to see how the results would be

runMonteCarloTicketSale(10)
Visualization
runMonteCarloTicketSale(30)
Visualization
runMonteCarloTicketSale(10 + np.random.randint(10, ticketsSalesStdDev))
Visualization

We even have tried using a varying value with the last one, but they all seem to have predictions being too narrow and might shrink even more if further in time. So next we will try removing 0 and below, but using the abs() function

def runMonteCarloTicketSale2(multiplier):
    np.random.seed(123)
    ticketSalesTestSeries = ticketSales_data['tickets_sold'].reset_index()['tickets_sold'][trainingStationaryTicketSales:]
    ticketSalesTestSeries = ticketSalesTestSeries.reset_index()['tickets_sold']

    initialTicketSalesValues = np.array(ticketSalesTestSeries[:14]).reshape(1, -1)

    ticketSalesValues = np.broadcast_to(initialTicketSalesValues, (numOfSamplesTicketSales, 14))

    ticketSalesTrajectoryLength = (71 - 14)

    for i in range(ticketSalesTrajectoryLength):
        nextPrediction = ticketSalesValues[:, -7:] @ ticketSalesCoeffs
        nextColumn = nextPrediction + abs(multiplier * np.random.randn(numOfSamplesTicketSales))
        ticketSalesValues = np.append(ticketSalesValues, nextColumn.reshape(numOfSamplesTicketSales, 1), axis = 1)

    horizontalLines = [x for x in range(-300, 501, 100)]

    plt.plot(ticketSalesValues.transpose())
    for line in horizontalLines:
        plt.axhline(y = line, linestyle = 'dashed', linewidth = 2, alpha = 0.5)
    plt.plot(ticketSalesTestSeries[:71], linewidth = 1.5, c = 'black')
    plt.yticks(horizontalLines)
    plt.show()
runMonteCarloTicketSale2(ticketsSalesStdDev)
Visualization

By only having positive numbers, it somehow affected the prediction to only trend upwards, which is also not ideal either.

Regardless of which parameter used, it does seem that the model, given for this dataset, work best only at the first few period. While monte carlo simulation was supposed to be able to solve this issue, but it does not. While it is very useful to be used as a baseline, to understand the possible and very probable outcomes, and to have a probabilistic view of the upcoming time frames, it quickly becomes unreliable too far in the future, similar to many models.

Moving Averages

def maSmoothing(series, n):
    answer = series - series
    for i in range(n):
        answer = answer + series.shift(i)
    return answer/n

def maSmoothingPlot(series, nList):
    temp = pd.Series(series, name='observed')
    temp = pd.DataFrame(temp)
    for i in nList:
        tempCol = 'smoothed ' + str(i)
        temp[tempCol] = maSmoothing(temp['observed'], i)

    return temp

Gold

maSmoothingPlot(goldData['Close'][:3000], [10,100,500]).plot()
Visualization

FTSE

maSmoothingPlot(ftseData['Close'][:100000], [100,3000,10000]).plot()
Visualization

S&P500

maSmoothingPlot(sp500Data['Close'][:100000], [500,5000,10000]).plot()
Visualization

USDCUSDT - Tradecount

maSmoothingPlot(usdcusdtDataTradeCount['tradecount'].sort_index(),[10,50,100]).plot()
Visualization

Cinema Tickets

maSmoothingPlot(ticketSales_data['tickets_sold'], [3,7,21]).plot()
Visualization

Machine Learning Questions

maSmoothingPlot(mlStackoverflow_data['machine-learning'], [3,7,20]).plot()
Visualization

Moving average is a very equal and simple way to get an sense of the trending within the data. The downside of moving averages is that if there is not enough data for a longer time frame period, we need to wait however the selected period is, for the data to finalize. While there are many longer term investing strategies that utilizes the moving average, they are usually implemented for trend finding. Another downside is that, we will need to sacrifice some data, usually in the front of the dataset, to be used for the calculation of the average. The dataset will suffer a little from that chunk of missing data

Exponential Averaging

The following averaging method is interesting because it does not need to wait for a given period. However, we do need set a parameter, which will affect the how smooth the line is. Similarly, we will explore this will all the datasets.

def expSmoothing (series, alpha):
    smoothedValue = series[0]
    results = []
    for value in series:
        smoothedValue = (1 - alpha) * smoothedValue + alpha * value
        results.append(smoothedValue)
    return results

def expSmoothingPlot (series, list):
    temp = pd.Series(series, name='observed')
    temp = pd.DataFrame(temp)
    for n in list:
        tempCol = 'smoothed ' + str(n)
        temp[tempCol] = expSmoothing(temp['observed'], n)
    return temp

Gold

expSmoothingPlot(goldData['Close'][:3000], [0.9, 0.05, 0.01, 0.003]).plot()
Visualization

FTSE

expSmoothingPlot(ftseData['Close'][:100000], [0.0005, 0.0001]).plot()
Visualization

S&P500

expSmoothingPlot(sp500Data['Close'][:100000], [0.0024, 0.0005, 0.0001]).plot()
Visualization

USDCUSDT - Tradecount

expSmoothingPlot(usdcusdtDataTradeCount['tradecount'].sort_index(), [0.1, 0.01, 0.001]).plot()
Visualization

Cinema Tickets

expSmoothingPlot(ticketSales_data['tickets_sold'], [0.1, 0.05, 0.025, 0.01]).plot()
Visualization

Machine Learning Questions

expSmoothingPlot(mlStackoverflow_data['machine-learning'], [0.25, 0.1, 0.01]).plot()
Visualization

Based on the above, we can see that the alpha number, really largely depends on the size of the dataset. 0.1 on one mimics the graph very closely while on the other, it is already near a straight line.

Removing Trend with EMA

The good thing about using EMA, is that the first few values of the dataset will not be excluded, making it easier for us to remove the trend portion of the dataset. Then, it should become stationary. We can also test the stationarity of the data with the kpss and adf test. We wil then explore if it predicts better this way.

We will only do one dataset to explore given the limited time.

First, we will need to see which value of alpha produces the most stationary graph. We will be decide it visually, only then we check with the tests

emaGoldData = expSmoothingPlot(goldData['Close'][:stationaryGoldDataLength], [0.001, 0.005, 0.0002])
emaGoldData.columns
Index(['observed', 'smoothed 0.001', 'smoothed 0.005', 'smoothed 0.0002'], dtype='object')
for i in range(1,4):
    currentCol = emaGoldData.columns[i]
    smoothed = emaGoldData[currentCol]
    plt.plot(emaGoldData['observed'] - smoothed, label = currentCol, alpha=0.7)
plt.legend()
Visualization

Visually, we can see that the alpha value of 0.005, seem to be a good fit for this case. So we will be using it.

emaGoldData['residual'] = emaGoldData['observed'] - emaGoldData['smoothed 0.005']
kpss_test(emaGoldData['residual'])
KPSS Stat:  0.4907302016778367
p-value:  0.04375445908156831
Number of lags: 192
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
Result: The series is not stationary
adftest(emaGoldData['residual'])
AdfTest Stat:  -16.70183985809975
p-value:  1.4590092692087397e-29
Number of lags: 50
Number of observation: 107949
Critial Values:
   1% : -3.430410579120753
   5% : -2.8615667750440084
   10% : -2.5667842514152195
The series is stationary

Visually, it might look like the data is already stationary, but based on two tests, it is still not the case. Therefore, we can try to proceed with getting the difference.

emaGoldData['residualDiff'] = emaGoldData['residual'] - emaGoldData['residual'].shift()
emaGoldData[['residual', 'residualDiff']].plot()
Visualization
kpss_test(emaGoldData['residualDiff'][1:])
KPSS Stat:  0.0010710001653877659
p-value:  0.1
Number of lags: 71
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
Result: The series is stationary
adftest(emaGoldData['residualDiff'][1:])
AdfTest Stat:  -47.34320740713912
p-value:  0.0
Number of lags: 53
Number of observation: 107945
Critial Values:
   1% : -3.4304105813656207
   5% : -2.8615667760361956
   10% : -2.5667842519433277
The series is stationary

Now we will proceed with applying prediction methods.

plot_pacf(emaGoldData['residualDiff'][1:traniningStationaryGoldDataLength])
Visualization
Visualization

Seem like only the 2nd period is effective, however, let's just use a bigger scale period.

emaGoldCoeffs = train_ar(emaGoldData['residualDiff'][1:traniningStationaryGoldDataLength],20)
emaGoldCoeffs
array([ 0.00133632, -0.00681158, -0.006989  , -0.00519191, -0.00436599,
       -0.00034325, -0.01157953, -0.00883011, -0.01158689, -0.00957072,
        0.01358625,  0.00297729, -0.00152516, -0.01193093, -0.01983785,
       -0.01051143, -0.02078453, -0.0104172 , -0.03703278, -0.15587259])
emaGoldResidualDiffPredict = predict(emaGoldCoeffs, emaGoldData['residualDiff'][traniningStationaryGoldDataLength:] )
emaGoldResidualDiffPredict = pd.DataFrame(emaGoldResidualDiffPredict)
emaGoldResidualDiffPredict = emaGoldResidualDiffPredict.set_index(pd.Series([x for x in range(traniningStationaryGoldDataLength,stationaryGoldDataLength)]))
emaGoldResidualDiffPredict.plot()
Visualization
emaGoldResidualPrediction = emaGoldResidualDiffPredict[0] + emaGoldData.reset_index()['residualDiff'][traniningStationaryGoldDataLength:].shift()
emaGoldResidualPrediction.plot()
Visualization
emaGoldTrendPredict = 2 * emaGoldData['smoothed 0.005'][traniningStationaryGoldDataLength:].shift(1) \
    - emaGoldData['smoothed 0.005'][traniningStationaryGoldDataLength:].shift(2)

emaGoldTrendPredict = pd.DataFrame(emaGoldTrendPredict.reset_index()['smoothed 0.005'].rename('trend prediction'))\
    .set_index(pd.Series([x for x in range(traniningStationaryGoldDataLength, stationaryGoldDataLength)]))

emaGoldTrendPredict.plot()
Visualization
emaGoldPrediction = emaGoldTrendPredict['trend prediction'] + emaGoldResidualPrediction

emaGoldResult = pd.DataFrame(emaGoldData.reset_index()['observed'][traniningStationaryGoldDataLength:])
emaGoldResult['predicted'] = emaGoldPrediction
emaGoldResult = emaGoldResult.dropna()
emaGoldResult.plot()
Visualization
evaluateErrors(emaGoldResult)
[8.310943507413699, 2.882870705982788, 1.9945175517858817]
appendingTOR(tableOfResults, 'Gold', 'Manual ARIMA',evaluateErrors(emaGoldResult))
tableOfResults
Unnamed: 0 Data Algorithm MSE RMSE MAE
1 SP500 Naive Forecasting 0.315878 0.56203 0.325863
2 FTSE Naive Forecasting 5.69646 2.38673 1.29783
3 Gold Naive Forecasting 0.237289 0.487123 0.31245
4 USDCUSDT Close Price Naive Forecasting 4.01536e-06 0.002004 0.001034
5 USDCUSDT Tradecount Naive Forecasting 1.56376e+08 12505 7519.96
6 Python Questions Naive Forecasting 1.00863e+06 1004.31 680.298
7 Machine Learning Questions Naive Forecasting 3053.46 55.2581 35.4885
8 Ticket Sales Naive Forecasting 12476 111.696 74.8537
9 SP500 Seasonal Forecasting - 60 24.0706 4.90618 3.19858
10 SP500 Seasonal Forecasting - 1440 578.648 24.0551 18.6964
11 FTSE Seasonal Forecasting - 60 358.944 18.9458 12.5549
12 FTSE Seasonal Forecasting - 1440 9305.27 96.4639 71.5358
13 Gold Seasonal Forecasting - 60 11.0817 3.32892 2.08104
14 Gold Seasonal Forecasting - 1440 297.39 17.245 12.2358
15 USDCUSDT Close Price Seasonal Forecasting 1.41478e-05 0.003761 0.00176
16 USDCUSDT Tradecount Seasonal Forecasting 2.76653e+08 16632.9 10558.5
17 Python Questions Seasonal Forecasting 4.97505e+06 2230.48 1924.32
18 Machine Learning Questions Seasonal Forecasting 14325.5 119.689 84.975
19 Ticket Sales Seasonal Forecasting 3922.4 62.6291 37.0187
20 SP500 Average Forecasting - 60 8.21104 2.86549 1.84062
21 SP500 Average Forecasting - 1440 195.085 13.9673 10.6287
22 FTSE Average Forecasting - 60 122.369 11.0621 7.24807
23 FTSE Average Forecasting - 1440 3105.82 55.7299 40.9431
24 Gold Average Forecasting - 60 3.79681 1.94854 1.22559
25 Gold Average Forecasting - 1440 96.5074 9.82382 6.8452
26 USDCUSDT Close Price Average Forecasting 6.51406e-06 0.002552 0.001187
27 USDCUSDT Tradecount Average Forecasting 1.44419e+08 12017.5 7727.27
28 Python Questions Average Forecasting 2.18019e+06 1476.55 1175.32
29 Machine Learning Questions Average Forecasting 7140.3 84.5003 55.785
30 Ticket Sales Average Forecasting 6061.2 77.8537 59.7301
31 SP500 Average Difference Forecasting - 60 0.322452 0.567849 0.332982
32 SP500 Average Difference Forecasting - 1440 0.316412 0.562505 0.326336
33 FTSE Average Difference Forecasting - 60 5.79875 2.40806 1.32178
34 FTSE Average Difference Forecasting - 1440 5.70723 2.38898 1.30033
35 Gold Average Difference Forecasting - 60 0.240423 0.49033 0.315994
36 Gold Average Difference Forecasting - 1440 0.237595 0.487437 0.313009
37 USDCUSDT Close Price Average Difference Forecasting 4.53094e-06 0.002129 0.001103
38 USDCUSDT Tradecount Average Difference Forecasting 1.64095e+08 12810 7724.69
39 Python Questions Average Difference Forecasting 1.1445e+06 1069.81 751.801
40 Machine Learning Questions Average Difference Forecasting 3399.79 58.3077 38.5563
41 Ticket Sales Average Difference Forecasting 12936.3 113.738 76.6102
42 SP500 Naive Difference Forecasting 0.553753 0.744146 0.456603
43 FTSE Naive Difference Forecasting 10.796 3.28573 1.81081
44 Gold Naive Difference Forecasting 0.521205 0.721946 0.480641
45 USDCUSDT Close Price Naive Difference Forecasting 9.13676e-06 0.003023 0.001659
46 USDCUSDT Tradecount Naive Difference Forecasting 4.15266e+08 20378.1 12089.8
47 Python Questions Naive Difference Forecasting 2.34416e+06 1531.07 1117.71
48 Machine Learning Questions Naive Difference Forecasting 6678.13 81.7198 51.9462
49 Ticket Sales Naive Difference Forecasting 37594.4 193.893 136.398
50 Gold Autoregression 0.167187 0.408885 0.270593
51 SP500 Autoregression 0.181846 0.426434 0.236095
52 Machine Learning Questions Autoregression 9493.88 97.4365 80.8513
53 Ticket Sales Autoregression 12195.6 110.434 78.7188
54 Gold Manual ARIMA 8.31094 2.88287 1.99452
tableOfResults[tableOfResults['Data'] == 'Gold'].sort_values('MSE')
Unnamed: 0 Data Algorithm MSE RMSE MAE
50 Gold Autoregression 0.167187 0.408885 0.270593
3 Gold Naive Forecasting 0.237289 0.487123 0.31245
36 Gold Average Difference Forecasting - 1440 0.237595 0.487437 0.313009
35 Gold Average Difference Forecasting - 60 0.240423 0.49033 0.315994
44 Gold Naive Difference Forecasting 0.521205 0.721946 0.480641
24 Gold Average Forecasting - 60 3.79681 1.94854 1.22559
54 Gold Manual ARIMA 8.31094 2.88287 1.99452
13 Gold Seasonal Forecasting - 60 11.0817 3.32892 2.08104
25 Gold Average Forecasting - 1440 96.5074 9.82382 6.8452
14 Gold Seasonal Forecasting - 1440 297.39 17.245 12.2358

As of now, the autoregression still outperforms it. Let's try to see if we use other ARIMA parameters, will it help improve the outcome.

ARIMA

def exploringARIMA(data, ar, i, m):
    model = sm.tsa.arima.ARIMA(data, order = [ar,i,m], trend = 'n').fit()
    results = pd.DataFrame(data.rename('observed'))
    results['predicted'] = model.predict()
    results.plot()
    return results

Ticket Sales

ticketSalesData = ticketSales_data.reset_index()['tickets_sold']
ticketSalesData
0      125.650000
1      178.675325
2      175.461017
3      114.192192
4       94.831956
          ...
227     70.583704
228    113.653521
229    126.824561
230     75.431177
231     49.894737
Name: tickets_sold, Length: 232, dtype: float64

MA 2

ticketSalesMA2Model = exploringARIMA(ticketSalesData, 0, 0, 2)
Visualization
appendingTOR(tableOfResults, 'Ticket Sales', 'ARIMA - MA2', evaluateErrors(ticketSalesMA2Model))
tableOfResults
Unnamed: 0 Data Algorithm MSE RMSE MAE
1 SP500 Naive Forecasting 0.315878 0.56203 0.325863
2 FTSE Naive Forecasting 5.69646 2.38673 1.29783
3 Gold Naive Forecasting 0.237289 0.487123 0.31245
4 USDCUSDT Close Price Naive Forecasting 4.01536e-06 0.002004 0.001034
5 USDCUSDT Tradecount Naive Forecasting 1.56376e+08 12505 7519.96
6 Python Questions Naive Forecasting 1.00863e+06 1004.31 680.298
7 Machine Learning Questions Naive Forecasting 3053.46 55.2581 35.4885
8 Ticket Sales Naive Forecasting 12476 111.696 74.8537
9 SP500 Seasonal Forecasting - 60 24.0706 4.90618 3.19858
10 SP500 Seasonal Forecasting - 1440 578.648 24.0551 18.6964
11 FTSE Seasonal Forecasting - 60 358.944 18.9458 12.5549
12 FTSE Seasonal Forecasting - 1440 9305.27 96.4639 71.5358
13 Gold Seasonal Forecasting - 60 11.0817 3.32892 2.08104
14 Gold Seasonal Forecasting - 1440 297.39 17.245 12.2358
15 USDCUSDT Close Price Seasonal Forecasting 1.41478e-05 0.003761 0.00176
16 USDCUSDT Tradecount Seasonal Forecasting 2.76653e+08 16632.9 10558.5
17 Python Questions Seasonal Forecasting 4.97505e+06 2230.48 1924.32
18 Machine Learning Questions Seasonal Forecasting 14325.5 119.689 84.975
19 Ticket Sales Seasonal Forecasting 3922.4 62.6291 37.0187
20 SP500 Average Forecasting - 60 8.21104 2.86549 1.84062
21 SP500 Average Forecasting - 1440 195.085 13.9673 10.6287
22 FTSE Average Forecasting - 60 122.369 11.0621 7.24807
23 FTSE Average Forecasting - 1440 3105.82 55.7299 40.9431
24 Gold Average Forecasting - 60 3.79681 1.94854 1.22559
25 Gold Average Forecasting - 1440 96.5074 9.82382 6.8452
26 USDCUSDT Close Price Average Forecasting 6.51406e-06 0.002552 0.001187
27 USDCUSDT Tradecount Average Forecasting 1.44419e+08 12017.5 7727.27
28 Python Questions Average Forecasting 2.18019e+06 1476.55 1175.32
29 Machine Learning Questions Average Forecasting 7140.3 84.5003 55.785
30 Ticket Sales Average Forecasting 6061.2 77.8537 59.7301
31 SP500 Average Difference Forecasting - 60 0.322452 0.567849 0.332982
32 SP500 Average Difference Forecasting - 1440 0.316412 0.562505 0.326336
33 FTSE Average Difference Forecasting - 60 5.79875 2.40806 1.32178
34 FTSE Average Difference Forecasting - 1440 5.70723 2.38898 1.30033
35 Gold Average Difference Forecasting - 60 0.240423 0.49033 0.315994
36 Gold Average Difference Forecasting - 1440 0.237595 0.487437 0.313009
37 USDCUSDT Close Price Average Difference Forecasting 4.53094e-06 0.002129 0.001103
38 USDCUSDT Tradecount Average Difference Forecasting 1.64095e+08 12810 7724.69
39 Python Questions Average Difference Forecasting 1.1445e+06 1069.81 751.801
40 Machine Learning Questions Average Difference Forecasting 3399.79 58.3077 38.5563
41 Ticket Sales Average Difference Forecasting 12936.3 113.738 76.6102
42 SP500 Naive Difference Forecasting 0.553753 0.744146 0.456603
43 FTSE Naive Difference Forecasting 10.796 3.28573 1.81081
44 Gold Naive Difference Forecasting 0.521205 0.721946 0.480641
45 USDCUSDT Close Price Naive Difference Forecasting 9.13676e-06 0.003023 0.001659
46 USDCUSDT Tradecount Naive Difference Forecasting 4.15266e+08 20378.1 12089.8
47 Python Questions Naive Difference Forecasting 2.34416e+06 1531.07 1117.71
48 Machine Learning Questions Naive Difference Forecasting 6678.13 81.7198 51.9462
49 Ticket Sales Naive Difference Forecasting 37594.4 193.893 136.398
50 Gold Autoregression 0.167187 0.408885 0.270593
51 SP500 Autoregression 0.181846 0.426434 0.236095
52 Machine Learning Questions Autoregression 9493.88 97.4365 80.8513
53 Ticket Sales Autoregression 12195.6 110.434 78.7188
54 Gold Manual ARIMA 8.31094 2.88287 1.99452
55 Ticket Sales ARIMA - MA2 14257.6 119.405 93.6058

MA 3

ticketSalesMA3Model = exploringARIMA(ticketSalesData, 0, 0, 3)
Visualization
appendingTOR(tableOfResults, 'Ticket Sales', 'ARIMA - MA3', evaluateErrors(ticketSalesMA3Model))
tableOfResults
Unnamed: 0 Data Algorithm MSE RMSE MAE
1 SP500 Naive Forecasting 0.315878 0.56203 0.325863
2 FTSE Naive Forecasting 5.69646 2.38673 1.29783
3 Gold Naive Forecasting 0.237289 0.487123 0.31245
4 USDCUSDT Close Price Naive Forecasting 4.01536e-06 0.002004 0.001034
5 USDCUSDT Tradecount Naive Forecasting 1.56376e+08 12505 7519.96
6 Python Questions Naive Forecasting 1.00863e+06 1004.31 680.298
7 Machine Learning Questions Naive Forecasting 3053.46 55.2581 35.4885
8 Ticket Sales Naive Forecasting 12476 111.696 74.8537
9 SP500 Seasonal Forecasting - 60 24.0706 4.90618 3.19858
10 SP500 Seasonal Forecasting - 1440 578.648 24.0551 18.6964
11 FTSE Seasonal Forecasting - 60 358.944 18.9458 12.5549
12 FTSE Seasonal Forecasting - 1440 9305.27 96.4639 71.5358
13 Gold Seasonal Forecasting - 60 11.0817 3.32892 2.08104
14 Gold Seasonal Forecasting - 1440 297.39 17.245 12.2358
15 USDCUSDT Close Price Seasonal Forecasting 1.41478e-05 0.003761 0.00176
16 USDCUSDT Tradecount Seasonal Forecasting 2.76653e+08 16632.9 10558.5
17 Python Questions Seasonal Forecasting 4.97505e+06 2230.48 1924.32
18 Machine Learning Questions Seasonal Forecasting 14325.5 119.689 84.975
19 Ticket Sales Seasonal Forecasting 3922.4 62.6291 37.0187
20 SP500 Average Forecasting - 60 8.21104 2.86549 1.84062
21 SP500 Average Forecasting - 1440 195.085 13.9673 10.6287
22 FTSE Average Forecasting - 60 122.369 11.0621 7.24807
23 FTSE Average Forecasting - 1440 3105.82 55.7299 40.9431
24 Gold Average Forecasting - 60 3.79681 1.94854 1.22559
25 Gold Average Forecasting - 1440 96.5074 9.82382 6.8452
26 USDCUSDT Close Price Average Forecasting 6.51406e-06 0.002552 0.001187
27 USDCUSDT Tradecount Average Forecasting 1.44419e+08 12017.5 7727.27
28 Python Questions Average Forecasting 2.18019e+06 1476.55 1175.32
29 Machine Learning Questions Average Forecasting 7140.3 84.5003 55.785
30 Ticket Sales Average Forecasting 6061.2 77.8537 59.7301
31 SP500 Average Difference Forecasting - 60 0.322452 0.567849 0.332982
32 SP500 Average Difference Forecasting - 1440 0.316412 0.562505 0.326336
33 FTSE Average Difference Forecasting - 60 5.79875 2.40806 1.32178
34 FTSE Average Difference Forecasting - 1440 5.70723 2.38898 1.30033
35 Gold Average Difference Forecasting - 60 0.240423 0.49033 0.315994
36 Gold Average Difference Forecasting - 1440 0.237595 0.487437 0.313009
37 USDCUSDT Close Price Average Difference Forecasting 4.53094e-06 0.002129 0.001103
38 USDCUSDT Tradecount Average Difference Forecasting 1.64095e+08 12810 7724.69
39 Python Questions Average Difference Forecasting 1.1445e+06 1069.81 751.801
40 Machine Learning Questions Average Difference Forecasting 3399.79 58.3077 38.5563
41 Ticket Sales Average Difference Forecasting 12936.3 113.738 76.6102
42 SP500 Naive Difference Forecasting 0.553753 0.744146 0.456603
43 FTSE Naive Difference Forecasting 10.796 3.28573 1.81081
44 Gold Naive Difference Forecasting 0.521205 0.721946 0.480641
45 USDCUSDT Close Price Naive Difference Forecasting 9.13676e-06 0.003023 0.001659
46 USDCUSDT Tradecount Naive Difference Forecasting 4.15266e+08 20378.1 12089.8
47 Python Questions Naive Difference Forecasting 2.34416e+06 1531.07 1117.71
48 Machine Learning Questions Naive Difference Forecasting 6678.13 81.7198 51.9462
49 Ticket Sales Naive Difference Forecasting 37594.4 193.893 136.398
50 Gold Autoregression 0.167187 0.408885 0.270593
51 SP500 Autoregression 0.181846 0.426434 0.236095
52 Machine Learning Questions Autoregression 9493.88 97.4365 80.8513
53 Ticket Sales Autoregression 12195.6 110.434 78.7188
54 Gold Manual ARIMA 8.31094 2.88287 1.99452
55 Ticket Sales ARIMA - MA2 14257.6 119.405 93.6058
56 Ticket Sales ARIMA - MA3 9808.53 99.038 63.8928

MA 8 - 10

for i in range(8,11):
    model = exploringARIMA(ticketSalesData, 0, 0, i)
    appendingTOR(tableOfResults, 'Ticket Sales', 'ARIMA - MA' + str(i), evaluateErrors(model))

tableOfResults[tableOfResults['Data']=='Ticket Sales'].sort_values('MSE')
Visualization
Visualization
Visualization
Unnamed: 0 Data Algorithm MSE RMSE MAE
19 Ticket Sales Seasonal Forecasting 3922.4 62.6291 37.0187
59 Ticket Sales ARIMA - MA10 5362.12 73.2265 49.4972
30 Ticket Sales Average Forecasting 6061.2 77.8537 59.7301
58 Ticket Sales ARIMA - MA9 6219.51 78.8639 55.3295
57 Ticket Sales ARIMA - MA8 6453.79 80.3355 59.9724
56 Ticket Sales ARIMA - MA3 9808.53 99.038 63.8928
53 Ticket Sales Autoregression 12195.6 110.434 78.7188
8 Ticket Sales Naive Forecasting 12476 111.696 74.8537
41 Ticket Sales Average Difference Forecasting 12936.3 113.738 76.6102
55 Ticket Sales ARIMA - MA2 14257.6 119.405 93.6058
49 Ticket Sales Naive Difference Forecasting 37594.4 193.893 136.398

Visually we do not really see much improvement increasing the value of MA, however the results says otherwise. However it is still not the best. Let's try a higher number once again.

MA 20

ticketSalesMA20Model = exploringARIMA(ticketSalesData, 0, 0, 20)
appendingTOR(tableOfResults, 'Ticket Sales', 'ARIMA - MA20', evaluateErrors(ticketSalesMA20Model))
tableOfResults[tableOfResults['Data']=='Ticket Sales'].sort_values('MSE')
Visualization
Unnamed: 0 Data Algorithm MSE RMSE MAE
60 Ticket Sales ARIMA - MA20 3774.48 61.4368 42.521
19 Ticket Sales Seasonal Forecasting 3922.4 62.6291 37.0187
59 Ticket Sales ARIMA - MA10 5362.12 73.2265 49.4972
30 Ticket Sales Average Forecasting 6061.2 77.8537 59.7301
58 Ticket Sales ARIMA - MA9 6219.51 78.8639 55.3295
57 Ticket Sales ARIMA - MA8 6453.79 80.3355 59.9724
56 Ticket Sales ARIMA - MA3 9808.53 99.038 63.8928
53 Ticket Sales Autoregression 12195.6 110.434 78.7188
8 Ticket Sales Naive Forecasting 12476 111.696 74.8537
41 Ticket Sales Average Difference Forecasting 12936.3 113.738 76.6102
55 Ticket Sales ARIMA - MA2 14257.6 119.405 93.6058
49 Ticket Sales Naive Difference Forecasting 37594.4 193.893 136.398

As of the current observation, based on the ticket sales dataset, we can see that increase the MA lowers the errors.

Let's try only using I

for i in [1,2,5,10  ]:
    model = exploringARIMA(ticketSalesData, 0, i, 0)
    appendingTOR(tableOfResults, 'Ticket Sales', 'ARIMA - I' + str(i), evaluateErrors(model))

tableOfResults[tableOfResults['Data']=='Ticket Sales'].sort_values('MSE')
Visualization
Visualization
Visualization
Visualization
Unnamed: 0 Data Algorithm MSE RMSE MAE
60 Ticket Sales ARIMA - MA20 3774.48 61.4368 42.521
19 Ticket Sales Seasonal Forecasting 3922.4 62.6291 37.0187
59 Ticket Sales ARIMA - MA10 5362.11 73.2265 49.4972
30 Ticket Sales Average Forecasting 6061.2 77.8537 59.7301
58 Ticket Sales ARIMA - MA9 6219.51 78.8639 55.3295
57 Ticket Sales ARIMA - MA8 6453.8 80.3355 59.9724
56 Ticket Sales ARIMA - MA3 9808.53 99.038 63.8928
53 Ticket Sales Autoregression 12195.6 110.434 78.7188
8 Ticket Sales Naive Forecasting 12476 111.696 74.8537
61 Ticket Sales ARIMA - I1 12490.3 111.76 75.0726
41 Ticket Sales Average Difference Forecasting 12936.3 113.738 76.6102
55 Ticket Sales ARIMA - MA2 14257.6 119.405 93.6058
62 Ticket Sales ARIMA - I2 37338.9 193.233 135.811
49 Ticket Sales Naive Difference Forecasting 37594.4 193.893 136.398
63 Ticket Sales ARIMA - I5 1.52389e+06 1234.46 926.465
64 Ticket Sales ARIMA - I10 1.08863e+09 32994.4 26180.1

As observed, I will need to be used sparingly as increase in I will quickly increase the predicted values.

I1 AM10

ticketSalesI1MA10Model = exploringARIMA(ticketSalesData, 0, 1, 10)
appendingTOR(tableOfResults, 'Ticket Sales', 'ARIMA - I1 MA10', evaluateErrors(ticketSalesI1MA10Model))
tableOfResults[tableOfResults['Data']=='Ticket Sales'].sort_values('MSE')
Visualization
Unnamed: 0 Data Algorithm MSE RMSE MAE
60 Ticket Sales ARIMA - MA20 3774.48 61.4368 42.521
19 Ticket Sales Seasonal Forecasting 3922.4 62.6291 37.0187
65 Ticket Sales ARIMA - I1 MA10 4073.38 63.823 47.404
59 Ticket Sales ARIMA - MA10 5362.11 73.2265 49.4972
30 Ticket Sales Average Forecasting 6061.2 77.8537 59.7301
58 Ticket Sales ARIMA - MA9 6219.51 78.8639 55.3295
57 Ticket Sales ARIMA - MA8 6453.8 80.3355 59.9724
56 Ticket Sales ARIMA - MA3 9808.53 99.038 63.8928
53 Ticket Sales Autoregression 12195.6 110.434 78.7188
8 Ticket Sales Naive Forecasting 12476 111.696 74.8537
61 Ticket Sales ARIMA - I1 12490.3 111.76 75.0726
41 Ticket Sales Average Difference Forecasting 12936.3 113.738 76.6102
55 Ticket Sales ARIMA - MA2 14257.6 119.405 93.6058
62 Ticket Sales ARIMA - I2 37338.9 193.233 135.811
49 Ticket Sales Naive Difference Forecasting 37594.4 193.893 136.398
63 Ticket Sales ARIMA - I5 1.52389e+06 1234.46 926.465
64 Ticket Sales ARIMA - I10 1.08863e+09 32994.4 26180.1
ticketSalesI1MA5Model = exploringARIMA(ticketSalesData, 0, 1, 5)
appendingTOR(tableOfResults, 'Ticket Sales', 'ARIMA - I1 MA5', evaluateErrors(ticketSalesI1MA5Model))
tableOfResults[tableOfResults['Data']=='Ticket Sales'].sort_values('MSE')
Visualization
Unnamed: 0 Data Algorithm MSE RMSE MAE
60 Ticket Sales ARIMA - MA20 3774.48 61.4368 42.521
19 Ticket Sales Seasonal Forecasting 3922.4 62.6291 37.0187
65 Ticket Sales ARIMA - I1 MA10 4073.38 63.823 47.404
59 Ticket Sales ARIMA - MA10 5362.11 73.2265 49.4972
66 Ticket Sales ARIMA - I1 MA5 6030.57 77.6567 61.9688
30 Ticket Sales Average Forecasting 6061.2 77.8537 59.7301
58 Ticket Sales ARIMA - MA9 6219.51 78.8639 55.3295
57 Ticket Sales ARIMA - MA8 6453.8 80.3355 59.9724
56 Ticket Sales ARIMA - MA3 9808.53 99.038 63.8928
53 Ticket Sales Autoregression 12195.6 110.434 78.7188
8 Ticket Sales Naive Forecasting 12476 111.696 74.8537
61 Ticket Sales ARIMA - I1 12490.3 111.76 75.0726
41 Ticket Sales Average Difference Forecasting 12936.3 113.738 76.6102
55 Ticket Sales ARIMA - MA2 14257.6 119.405 93.6058
62 Ticket Sales ARIMA - I2 37338.9 193.233 135.811
49 Ticket Sales Naive Difference Forecasting 37594.4 193.893 136.398
63 Ticket Sales ARIMA - I5 1.52389e+06 1234.46 926.465
64 Ticket Sales ARIMA - I10 1.08863e+09 32994.4 26180.1
ticketSalesI1MA5Model = exploringARIMA(ticketSalesData, 0, 1, 2)
appendingTOR(tableOfResults, 'Ticket Sales', 'ARIMA - I1 MA2', evaluateErrors(ticketSalesI1MA5Model))
tableOfResults[tableOfResults['Data']=='Ticket Sales'].sort_values('MSE')
Visualization
Unnamed: 0 Data Algorithm MSE RMSE MAE
60 Ticket Sales ARIMA - MA20 3774.48 61.4368 42.521
19 Ticket Sales Seasonal Forecasting 3922.4 62.6291 37.0187
65 Ticket Sales ARIMA - I1 MA10 4073.38 63.823 47.404
59 Ticket Sales ARIMA - MA10 5362.11 73.2265 49.4972
66 Ticket Sales ARIMA - I1 MA5 6030.57 77.6567 61.9688
30 Ticket Sales Average Forecasting 6061.2 77.8537 59.7301
58 Ticket Sales ARIMA - MA9 6219.51 78.8639 55.3295
57 Ticket Sales ARIMA - MA8 6453.8 80.3355 59.9724
67 Ticket Sales ARIMA - I1 MA2 6726.3 82.014 62.7677
56 Ticket Sales ARIMA - MA3 9808.53 99.038 63.8928
53 Ticket Sales Autoregression 12195.6 110.434 78.7188
8 Ticket Sales Naive Forecasting 12476 111.696 74.8537
61 Ticket Sales ARIMA - I1 12490.3 111.76 75.0726
41 Ticket Sales Average Difference Forecasting 12936.3 113.738 76.6102
55 Ticket Sales ARIMA - MA2 14257.6 119.405 93.6058
62 Ticket Sales ARIMA - I2 37338.9 193.233 135.811
49 Ticket Sales Naive Difference Forecasting 37594.4 193.893 136.398
63 Ticket Sales ARIMA - I5 1.52389e+06 1234.46 926.465
64 Ticket Sales ARIMA - I10 1.08863e+09 32994.4 26180.1
ticketSalesI2MA10Model = exploringARIMA(ticketSalesData, 0, 2, 10)
appendingTOR(tableOfResults, 'Ticket Sales', 'ARIMA - I2 MA10', evaluateErrors(ticketSalesI2MA10Model))
tableOfResults[tableOfResults['Data']=='Ticket Sales'].sort_values('MSE')
Visualization
Unnamed: 0 Data Algorithm MSE RMSE MAE
60 Ticket Sales ARIMA - MA20 3774.48 61.4368 42.521
19 Ticket Sales Seasonal Forecasting 3922.4 62.6291 37.0187
65 Ticket Sales ARIMA - I1 MA10 4073.38 63.823 47.404
68 Ticket Sales ARIMA - I2 MA10 5306.96 72.8489 54.751
59 Ticket Sales ARIMA - MA10 5362.11 73.2265 49.4972
66 Ticket Sales ARIMA - I1 MA5 6030.57 77.6567 61.9688
30 Ticket Sales Average Forecasting 6061.2 77.8537 59.7301
58 Ticket Sales ARIMA - MA9 6219.51 78.8639 55.3295
57 Ticket Sales ARIMA - MA8 6453.8 80.3355 59.9724
67 Ticket Sales ARIMA - I1 MA2 6726.3 82.014 62.7677
56 Ticket Sales ARIMA - MA3 9808.53 99.038 63.8928
53 Ticket Sales Autoregression 12195.6 110.434 78.7188
8 Ticket Sales Naive Forecasting 12476 111.696 74.8537
61 Ticket Sales ARIMA - I1 12490.3 111.76 75.0726
41 Ticket Sales Average Difference Forecasting 12936.3 113.738 76.6102
55 Ticket Sales ARIMA - MA2 14257.6 119.405 93.6058
62 Ticket Sales ARIMA - I2 37338.9 193.233 135.811
49 Ticket Sales Naive Difference Forecasting 37594.4 193.893 136.398
63 Ticket Sales ARIMA - I5 1.52389e+06 1234.46 926.465
64 Ticket Sales ARIMA - I10 1.08863e+09 32994.4 26180.1

As of the current observations, MA20 is still the highest at the moment. However, using higher value MA with I value of 1, is performing well

Now let's explore ARIMA.

A7 I1 MA2,3,5,10

for i in [2,3,5,10]:
    model = exploringARIMA(ticketSalesData, 7, 1, i)
    appendingTOR(tableOfResults, 'Ticket Sales', 'ARIMA - A7 I1 MA' + str(i), evaluateErrors(model))

tableOfResults[tableOfResults['Data']=='Ticket Sales'].sort_values('MSE')
Visualization
Visualization
Visualization
Visualization
Unnamed: 0 Data Algorithm MSE RMSE MAE
72 Ticket Sales ARIMA - A7 I1 MA10 2805.43 52.9663 34.2589
71 Ticket Sales ARIMA - A7 I1 MA5 2835.07 53.2454 33.3404
70 Ticket Sales ARIMA - A7 I1 MA3 2875.15 53.6205 33.8151
69 Ticket Sales ARIMA - A7 I1 MA2 2900.55 53.8568 33.7684
60 Ticket Sales ARIMA - MA20 3774.48 61.4368 42.521
19 Ticket Sales Seasonal Forecasting 3922.4 62.6291 37.0187
65 Ticket Sales ARIMA - I1 MA10 4073.38 63.823 47.404
68 Ticket Sales ARIMA - I2 MA10 5306.96 72.8489 54.751
59 Ticket Sales ARIMA - MA10 5362.11 73.2265 49.4972
66 Ticket Sales ARIMA - I1 MA5 6030.57 77.6567 61.9688
30 Ticket Sales Average Forecasting 6061.2 77.8537 59.7301
58 Ticket Sales ARIMA - MA9 6219.51 78.8639 55.3295
57 Ticket Sales ARIMA - MA8 6453.8 80.3355 59.9724
67 Ticket Sales ARIMA - I1 MA2 6726.3 82.014 62.7677
56 Ticket Sales ARIMA - MA3 9808.53 99.038 63.8928
53 Ticket Sales Autoregression 12195.6 110.434 78.7188
8 Ticket Sales Naive Forecasting 12476 111.696 74.8537
61 Ticket Sales ARIMA - I1 12490.3 111.76 75.0726
41 Ticket Sales Average Difference Forecasting 12936.3 113.738 76.6102
55 Ticket Sales ARIMA - MA2 14257.6 119.405 93.6058
62 Ticket Sales ARIMA - I2 37338.9 193.233 135.811
49 Ticket Sales Naive Difference Forecasting 37594.4 193.893 136.398
63 Ticket Sales ARIMA - I5 1.52389e+06 1234.46 926.465
64 Ticket Sales ARIMA - I10 1.08863e+09 32994.4 26180.1

When using Autoregression, Intergrated and Moving Average, all together, it performed the best, according the RMSE, MSE and MAE. We now will try a few iterations and parameters with the gold data.

Gold Data

arimaGoldData = goldData.reset_index()['Close'][:30000]
arimaGoldData.plot()
Visualization

MA Test

for i in [1,2,5,10]:
    model = exploringARIMA(arimaGoldData, 0,0,i)
    appendingTOR(tableOfResults, 'Gold', 'ARIMA - MA' + str(i), evaluateErrors(model))
Visualization
Visualization
Visualization
Visualization
tableOfResults[tableOfResults['Data'] == 'Gold'].sort_values('MSE')
Unnamed: 0 Data Algorithm MSE RMSE MAE
50 Gold Autoregression 0.167187 0.408885 0.270593
3 Gold Naive Forecasting 0.237289 0.487123 0.31245
36 Gold Average Difference Forecasting - 1440 0.237595 0.487437 0.313009
35 Gold Average Difference Forecasting - 60 0.240423 0.49033 0.315994
44 Gold Naive Difference Forecasting 0.521205 0.721946 0.480641
24 Gold Average Forecasting - 60 3.79681 1.94854 1.22559
54 Gold Manual ARIMA 8.31094 2.88287 1.99452
13 Gold Seasonal Forecasting - 60 11.0817 3.32892 2.08104
25 Gold Average Forecasting - 1440 96.5074 9.82382 6.8452
14 Gold Seasonal Forecasting - 1440 297.39 17.245 12.2358
75 Gold ARIMA - MA5 155277 394.052 316.199
74 Gold ARIMA - MA2 194643 441.184 368.211
73 Gold ARIMA - MA1 439874 663.23 662.926
76 Gold ARIMA - MA10 1.51433e+20 1.23058e+10 1.21355e+10

As shown above, using only the MA alone does not seem to be performing good, and actually the worst ones when compared to the other models here, especially when there is no pre-processing done. By having I, the differencing performed, the performance should improve

I Test

for i in [1,2,5,10]:
    model = exploringARIMA(arimaGoldData, 0,i,0)
    appendingTOR(tableOfResults, 'Gold', 'ARIMA - I' + str(i), evaluateErrors(model))
Visualization
Visualization
Visualization
Visualization
tableOfResults[tableOfResults['Data'] == 'Gold'].sort_values('MSE')
Unnamed: 0 Data Algorithm MSE RMSE MAE
50 Gold Autoregression 0.167187 0.408885 0.270593
3 Gold Naive Forecasting 0.237289 0.487123 0.31245
36 Gold Average Difference Forecasting - 1440 0.237595 0.487437 0.313009
35 Gold Average Difference Forecasting - 60 0.240423 0.49033 0.315994
44 Gold Naive Difference Forecasting 0.521205 0.721946 0.480641
24 Gold Average Forecasting - 60 3.79681 1.94854 1.22559
54 Gold Manual ARIMA 8.31094 2.88287 1.99452
13 Gold Seasonal Forecasting - 60 11.0817 3.32892 2.08104
77 Gold ARIMA - I1 56.3362 7.50575 0.295501
78 Gold ARIMA - I2 70.5507 8.39945 0.459694
25 Gold Average Forecasting - 1440 96.5074 9.82382 6.8452
14 Gold Seasonal Forecasting - 1440 297.39 17.245 12.2358
79 Gold ARIMA - I5 350.547 18.7229 2.61938
80 Gold ARIMA - I10 12166.7 110.303 66.0953
75 Gold ARIMA - MA5 155277 394.052 316.199
74 Gold ARIMA - MA2 194643 441.184 368.211
73 Gold ARIMA - MA1 439874 663.23 662.926
76 Gold ARIMA - MA10 1.51433e+20 1.23058e+10 1.21355e+10

The I of 1, is the best performing one so far.

Let's try having the I set to 1, and we will see if MA 1 2 or 5 will perform better.

I1 MA Test

for i in [1,2,5]:
    model = exploringARIMA(arimaGoldData, 0,1,i)
    appendingTOR(tableOfResults, 'Gold', 'ARIMA - I1 MA' + str(i), evaluateErrors(model))
Visualization
Visualization
Visualization
tableOfResults[tableOfResults['Data'] == 'Gold'].sort_values('MSE')
Unnamed: 0 Data Algorithm MSE RMSE MAE
50 Gold Autoregression 0.167187 0.408885 0.270593
3 Gold Naive Forecasting 0.237289 0.487123 0.31245
36 Gold Average Difference Forecasting - 1440 0.237595 0.487437 0.313009
35 Gold Average Difference Forecasting - 60 0.240423 0.49033 0.315994
44 Gold Naive Difference Forecasting 0.521205 0.721946 0.480641
24 Gold Average Forecasting - 60 3.79681 1.94854 1.22559
54 Gold Manual ARIMA 8.31094 2.88287 1.99452
13 Gold Seasonal Forecasting - 60 11.0817 3.32892 2.08104
83 Gold ARIMA - I1 MA5 56.3336 7.50557 0.29298
82 Gold ARIMA - I1 MA2 56.3336 7.50557 0.292994
81 Gold ARIMA - I1 MA1 56.3337 7.50558 0.293011
77 Gold ARIMA - I1 56.3362 7.50575 0.295501
78 Gold ARIMA - I2 70.5507 8.39945 0.459694
25 Gold Average Forecasting - 1440 96.5074 9.82382 6.8452
14 Gold Seasonal Forecasting - 1440 297.39 17.245 12.2358
79 Gold ARIMA - I5 350.547 18.7229 2.61938
80 Gold ARIMA - I10 12166.7 110.303 66.0953
75 Gold ARIMA - MA5 155277 394.052 316.199
74 Gold ARIMA - MA2 194643 441.184 368.211
73 Gold ARIMA - MA1 439874 663.23 662.926
76 Gold ARIMA - MA10 1.51433e+20 1.23058e+10 1.21355e+10

It seems that MA 5 is the better performing one by only a very small margin. Next we explore implementing the autoregression

AR I1 MA5 Test

plot_pacf(arimaGoldData)
Visualization
Visualization

Based on the PACF test, we can see that up to an AR 4 should be good. However, we can also increase it to see how it performs

for i in [4, 8, 15, 20]:
    model = exploringARIMA(arimaGoldData, i,1,5)
    appendingTOR(tableOfResults, 'Gold', 'ARIMA - A' + str(i) + ' I1 MA5', evaluateErrors(model))
Visualization
Visualization
Visualization
Visualization
tableOfResults[tableOfResults['Data'] == 'Gold'].sort_values('MSE')
Unnamed: 0 Data Algorithm MSE RMSE MAE
50 Gold Autoregression 0.167187 0.408885 0.270593
3 Gold Naive Forecasting 0.237289 0.487123 0.31245
36 Gold Average Difference Forecasting - 1440 0.237595 0.487437 0.313009
35 Gold Average Difference Forecasting - 60 0.240423 0.49033 0.315994
44 Gold Naive Difference Forecasting 0.521205 0.721946 0.480641
24 Gold Average Forecasting - 60 3.79681 1.94854 1.22559
54 Gold Manual ARIMA 8.31094 2.88287 1.99452
13 Gold Seasonal Forecasting - 60 11.0817 3.32892 2.08104
87 Gold ARIMA - A20 I1 MA5 56.3334 7.50556 0.293046
86 Gold ARIMA - A15 I1 MA5 56.3334 7.50556 0.293048
85 Gold ARIMA - A8 I1 MA5 56.3334 7.50556 0.293065
84 Gold ARIMA - A4 I1 MA5 56.3335 7.50556 0.293029
83 Gold ARIMA - I1 MA5 56.3336 7.50557 0.29298
82 Gold ARIMA - I1 MA2 56.3336 7.50557 0.292994
81 Gold ARIMA - I1 MA1 56.3337 7.50558 0.293011
77 Gold ARIMA - I1 56.3362 7.50575 0.295501
78 Gold ARIMA - I2 70.5507 8.39945 0.459694
25 Gold Average Forecasting - 1440 96.5074 9.82382 6.8452
14 Gold Seasonal Forecasting - 1440 297.39 17.245 12.2358
79 Gold ARIMA - I5 350.547 18.7229 2.61938
80 Gold ARIMA - I10 12166.7 110.303 66.0953
75 Gold ARIMA - MA5 155277 394.052 316.199
74 Gold ARIMA - MA2 194643 441.184 368.211
73 Gold ARIMA - MA1 439874 663.23 662.926
76 Gold ARIMA - MA10 1.51433e+20 1.23058e+10 1.21355e+10

While it is the best performing ones so far, it still does beat the earlier simple algorithms that were used.

Autoreg only

autoregGold4 = exploringARIMA(arimaGoldData, 4, 0, 0)
autoregGold4[1:].plot()
Visualization
Visualization
appendingTOR(tableOfResults, 'Gold', 'ARIMA - A4', evaluateErrors(autoregGold4))
tableOfResults[tableOfResults['Data'] == 'Gold'].sort_values('MSE')
Unnamed: 0 Data Algorithm MSE RMSE MAE
50 Gold Autoregression 0.167187 0.408885 0.270593
3 Gold Naive Forecasting 0.237289 0.487123 0.31245
36 Gold Average Difference Forecasting - 1440 0.237595 0.487437 0.313009
35 Gold Average Difference Forecasting - 60 0.240423 0.49033 0.315994
44 Gold Naive Difference Forecasting 0.521205 0.721946 0.480641
24 Gold Average Forecasting - 60 3.79681 1.94854 1.22559
54 Gold Manual ARIMA 8.31094 2.88287 1.99452
13 Gold Seasonal Forecasting - 60 11.0817 3.32892 2.08104
87 Gold ARIMA - A20 I1 MA5 56.3334 7.50556 0.293046
86 Gold ARIMA - A15 I1 MA5 56.3334 7.50556 0.293048
85 Gold ARIMA - A8 I1 MA5 56.3334 7.50556 0.293065
84 Gold ARIMA - A4 I1 MA5 56.3335 7.50556 0.293029
83 Gold ARIMA - I1 MA5 56.3336 7.50557 0.29298
82 Gold ARIMA - I1 MA2 56.3336 7.50557 0.292994
81 Gold ARIMA - I1 MA1 56.3337 7.50558 0.293011
77 Gold ARIMA - I1 56.3362 7.50575 0.295501
88 Gold ARIMA - A4 56.3762 7.50841 0.334332
78 Gold ARIMA - I2 70.5507 8.39945 0.459694
25 Gold Average Forecasting - 1440 96.5074 9.82382 6.8452
14 Gold Seasonal Forecasting - 1440 297.39 17.245 12.2358
79 Gold ARIMA - I5 350.547 18.7229 2.61938
80 Gold ARIMA - I10 12166.7 110.303 66.0953
75 Gold ARIMA - MA5 155277 394.052 316.199
74 Gold ARIMA - MA2 194643 441.184 368.211
73 Gold ARIMA - MA1 439874 663.23 662.926
76 Gold ARIMA - MA10 1.51433e+20 1.23058e+10 1.21355e+10

We would think that the autoregression 4 model would perform good, however, it did not and it is the middle of the pack

tableOfResults[tableOfResults['Data'] == 'Ticket Sales'].sort_values('MSE')
Unnamed: 0 Data Algorithm MSE RMSE MAE
72 Ticket Sales ARIMA - A7 I1 MA10 2805.43 52.9663 34.2589
71 Ticket Sales ARIMA - A7 I1 MA5 2835.07 53.2454 33.3404
70 Ticket Sales ARIMA - A7 I1 MA3 2875.15 53.6205 33.8151
69 Ticket Sales ARIMA - A7 I1 MA2 2900.55 53.8568 33.7684
60 Ticket Sales ARIMA - MA20 3774.48 61.4368 42.521
19 Ticket Sales Seasonal Forecasting 3922.4 62.6291 37.0187
65 Ticket Sales ARIMA - I1 MA10 4073.38 63.823 47.404
68 Ticket Sales ARIMA - I2 MA10 5306.96 72.8489 54.751
59 Ticket Sales ARIMA - MA10 5362.11 73.2265 49.4972
66 Ticket Sales ARIMA - I1 MA5 6030.57 77.6567 61.9688
30 Ticket Sales Average Forecasting 6061.2 77.8537 59.7301
58 Ticket Sales ARIMA - MA9 6219.51 78.8639 55.3295
57 Ticket Sales ARIMA - MA8 6453.8 80.3355 59.9724
67 Ticket Sales ARIMA - I1 MA2 6726.3 82.014 62.7677
56 Ticket Sales ARIMA - MA3 9808.53 99.038 63.8928
53 Ticket Sales Autoregression 12195.6 110.434 78.7188
8 Ticket Sales Naive Forecasting 12476 111.696 74.8537
61 Ticket Sales ARIMA - I1 12490.3 111.76 75.0726
41 Ticket Sales Average Difference Forecasting 12936.3 113.738 76.6102
55 Ticket Sales ARIMA - MA2 14257.6 119.405 93.6058
62 Ticket Sales ARIMA - I2 37338.9 193.233 135.811
49 Ticket Sales Naive Difference Forecasting 37594.4 193.893 136.398
63 Ticket Sales ARIMA - I5 1.52389e+06 1234.46 926.465
64 Ticket Sales ARIMA - I10 1.08863e+09 32994.4 26180.1

Seasonal ARIMA

We will again trying SARIMA with and the tickets sales

ticketSalesData.plot()
sm.tsa.arima.ARIMA(ticketSalesData, order=(5,1,5), seasonal_order=(7,1,5,7), trend = 'n').fit().predict().plot()
Visualization

As shown, we can add more parameters given to the function. We can add a seasonal component to in, as shown in the seasonal_order. the first three supposed to be representing the same 3 values as the ones in order, and then the last value supposed to be representing the seasonality expectation. If we expect the seasons to be weekly, we can use 7, assume each row is one day, like this case. Seeing this site explains that the order was for trend, and seasonal order is to address the seasonality.

However, there are just too many parameters. Therefore, we will just be running loops and appending the results into the table. But we will be keeping the order parameters

def exploringSARIMA(data, p, q, r, m):
    model = sm.tsa.arima.ARIMA(ticketSalesData, order=(5,1,5), seasonal_order=(p,q,r,m), trend = 'n').fit()
    results = pd.DataFrame(data.rename('observed'))
    results['predicted'] = model.predict()
    results.plot()
    return results
for p in range(5, 10):
        for r in range(3,7):
                model = exploringSARIMA(ticketSalesData, p, 1, r, 7)
                appendingTOR(tableOfResults, 'Ticket Sales', 'SARIMA - p' +
                             str(p) + ' q' + str(1) + ' r' + str(r) + ' m' + str(7),
                             evaluateErrors(model)
                             )
Visualization
Visualization
Visualization
Visualization
Visualization
Visualization
Visualization
Visualization
Visualization
Visualization
Visualization
Visualization
Visualization
Visualization
Visualization
Visualization
Visualization
Visualization
Visualization
Visualization
tableOfResults[tableOfResults['Data'] == 'Ticket Sales'].sort_values('MSE')
Unnamed: 0 Data Algorithm MSE RMSE MAE
72 Ticket Sales ARIMA - A7 I1 MA10 2805.43 52.9663 34.2589
71 Ticket Sales ARIMA - A7 I1 MA5 2835.07 53.2454 33.3404
70 Ticket Sales ARIMA - A7 I1 MA3 2875.15 53.6205 33.8151
108 Ticket Sales SARIMA - p9 q1 r6 m7 2882.86 53.6923 33.7857
105 Ticket Sales SARIMA - p9 q1 r3 m7 2884.03 53.7032 34.02
106 Ticket Sales SARIMA - p9 q1 r4 m7 2890.3 53.7616 34.1163
107 Ticket Sales SARIMA - p9 q1 r5 m7 2895.94 53.814 33.693
103 Ticket Sales SARIMA - p8 q1 r5 m7 2900.32 53.8546 33.5066
69 Ticket Sales ARIMA - A7 I1 MA2 2900.55 53.8568 33.7684
102 Ticket Sales SARIMA - p8 q1 r4 m7 2907.21 53.9185 34.1786
100 Ticket Sales SARIMA - p7 q1 r6 m7 2920.49 54.0415 33.954
101 Ticket Sales SARIMA - p8 q1 r3 m7 2935.23 54.1777 34.1753
97 Ticket Sales SARIMA - p7 q1 r3 m7 2940.85 54.2296 33.917
104 Ticket Sales SARIMA - p8 q1 r6 m7 2941.32 54.234 34.4356
93 Ticket Sales SARIMA - p6 q1 r3 m7 2947.68 54.2926 33.6726
96 Ticket Sales SARIMA - p6 q1 r6 m7 2948.95 54.3043 33.8336
98 Ticket Sales SARIMA - p7 q1 r4 m7 2952.75 54.3392 33.7096
99 Ticket Sales SARIMA - p7 q1 r5 m7 2953.18 54.3432 34.2208
95 Ticket Sales SARIMA - p6 q1 r5 m7 2958.34 54.3906 34.2589
90 Ticket Sales SARIMA - p5 q1 r4 m7 2959.82 54.4042 33.2821
92 Ticket Sales SARIMA - p5 q1 r6 m7 2967.44 54.4742 34.2919
94 Ticket Sales SARIMA - p6 q1 r4 m7 2969.72 54.4952 33.8973
89 Ticket Sales SARIMA - p5 q1 r3 m7 2972.4 54.5197 33.3068
91 Ticket Sales SARIMA - p5 q1 r5 m7 2974.57 54.5396 33.2173
60 Ticket Sales ARIMA - MA20 3774.48 61.4368 42.521
19 Ticket Sales Seasonal Forecasting 3922.4 62.6291 37.0187
65 Ticket Sales ARIMA - I1 MA10 4073.38 63.823 47.404
68 Ticket Sales ARIMA - I2 MA10 5306.96 72.8489 54.751
59 Ticket Sales ARIMA - MA10 5362.11 73.2265 49.4972
66 Ticket Sales ARIMA - I1 MA5 6030.57 77.6567 61.9688
30 Ticket Sales Average Forecasting 6061.2 77.8537 59.7301
58 Ticket Sales ARIMA - MA9 6219.51 78.8639 55.3295
57 Ticket Sales ARIMA - MA8 6453.8 80.3355 59.9724
67 Ticket Sales ARIMA - I1 MA2 6726.3 82.014 62.7677
56 Ticket Sales ARIMA - MA3 9808.53 99.038 63.8928
53 Ticket Sales Autoregression 12195.6 110.434 78.7188
8 Ticket Sales Naive Forecasting 12476 111.696 74.8537
61 Ticket Sales ARIMA - I1 12490.3 111.76 75.0726
41 Ticket Sales Average Difference Forecasting 12936.3 113.738 76.6102
55 Ticket Sales ARIMA - MA2 14257.6 119.405 93.6058
62 Ticket Sales ARIMA - I2 37338.9 193.233 135.811
49 Ticket Sales Naive Difference Forecasting 37594.4 193.893 136.398
63 Ticket Sales ARIMA - I5 1.52389e+06 1234.46 926.465
64 Ticket Sales ARIMA - I10 1.08863e+09 32994.4 26180.1

The final result, we can see that the ARIMA still performs best and that all the SARIMA are not far behind. They are all very small margins apart only.


Table of Contents