Multi-Dataset Time Series Modeling - Stationarity, Seasonality, and Forecasting Performance

Jun 25, 2024 | python data-science data-visualization data-analysis financial-modelling time-series pandas statsmodels forecasting monte-carlo-sim

A comprehensive comparative study of time series forecasting methods across six diverse datasets (Gold, S&P 500, FTSE, USDC/USDT, StackOverflow questions, and cinema tickets). This analysis evaluates 10+ forecasting techniques—from naive baselines to ARIMA/SARIMA models—with quantitative performance metrics (MSE, RMSE, MAE), explores stationarity testing, seasonality decomposition, Monte Carlo simulation, and implements a simulated trading strategy.

Back to Projects

Abstract

This project presents a comprehensive comparative analysis of time series forecasting methodologies applied to six diverse datasets: financial markets (Gold, S&P 500, FTSE), cryptocurrency (USDC/USDT), and non-financial time series (StackOverflow questions, cinema ticket sales). The study systematically evaluates forecasting performance across multiple algorithmic approaches, from simple baseline methods to advanced stochastic models.

Methodology

Data preprocessing included stationarity testing using KPSS and Augmented Dickey-Fuller (ADF) tests, with first-order differencing applied to achieve stationarity across all datasets. Seasonality was investigated through manual decomposition, Fast Fourier Transform (FFT) analysis, and visual inspection of ACF/PACF plots. Ten forecasting techniques were implemented and compared: Naive Forecasting, Seasonal Forecasting, Average Forecasting, Average Difference Forecasting, Autoregression (AR), ARIMA with varying orders, SARIMA with multiple seasonal configurations, and a manual EMA-based ARIMA approach. Advanced applications included Monte Carlo simulation for uncertainty quantification and a simulated trading strategy using AR predictions.

Key Findings

Performance varied significantly by dataset characteristics. For financial data (Gold, S&P 500, FTSE, USDC/USDT close prices), Naive Forecasting consistently achieved the lowest error metrics, suggesting these series exhibit random walk behavior where the best prediction is simply the last observed value. For datasets with clear seasonal patterns, more sophisticated models excelled, such as SARIMA models achieved superior performance on cinema ticket sales and Seasonal Forecasting performed best for USDC/USDT tradecount. The analysis revealed that first-order differencing (I=1) was necessary and sufficient for achieving stationarity across all datasets, while second-order differencing generally degraded performance.

The study demonstrates that model selection should be guided by dataset characteristics rather than model complexity. For random-walk financial series, simple methods often outperform sophisticated models, while seasonal data benefits from ARIMA/SARIMA frameworks. The Monte Carlo simulations successfully generated probabilistic forecasts, though the trading strategy implementation revealed the challenges of translating statistical predictions into profitable trades. FFT analysis confirmed strong weekly seasonality in ticket sales but minimal seasonal components in financial markets, validating the differential performance of seasonal models across datasets.

Library Initialization

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import os
import math

Datasets

The dataset that I have chosen for this assignment is using 4 of the available dataset provided in the UOL website. Gold data, S&P 500 data, FTSE data and USDCUSDT data. The other 2 dataset were found on www.kaggle.com. One of them is about the types of questions on stackoverflow over time, and one is about cinema tickets.

Gold, S&P 500, FTSE Dataset

We will explore the Gold, S&P 500 and FTSE dataset columns and plotting them

sp500_data = pd.read_csv('Datasets/SP 500 04072014 2011.csv')
sp500_data.head()

Unnamed: 0	Date	Time	Bar#	Bar Index	Tick Range	Open	High	Low	Close
0	04/04/2014	21:06:00	501724/501724	0	0	1865.09	1865.09	1865.09	1865.09
1	04/04/2014	21:04:00	501723/501724	-1	0	1865.1	1865.1	1865.1	1865.1
2	04/04/2014	21:03:00	501722/501724	-2	2	1865.13	1865.13	1865.11	1865.11
3	04/04/2014	21:02:00	501721/501724	-3	4	1865.18	1865.18	1865.14	1865.14
4	04/04/2014	21:01:00	501720/501724	-4	6	1865.26	1865.26	1865.2	1865.2

sp500_data['Date']

0         04/04/2014
1         04/04/2014
2         04/04/2014
3         04/04/2014
4         04/04/2014
             ...
501719    11/03/2009
501720    11/03/2009
501721    11/03/2009
501722    11/03/2009
501723    11/03/2009
Name: Date, Length: 501724, dtype: object

sp500_data.describe()

Unnamed: 0	Bar Index	Tick Range	Open	High	Low	Close
count	501724	501724	501724	501724	501724	501724
mean	-250862	34.4442	1320.38	1320.55	1320.2	1320.38
std	144835	39.3128	257.653	257.644	257.662	257.653
min	-501723	0	713.85	714.06	713.85	714.02
25%	-376292	12	1124.89	1125.04	1124.73	1124.88
50%	-250862	24	1306.71	1306.86	1306.54	1306.69
75%	-125431	44	1460.38	1460.54	1460.26	1460.39
max	0	1833	1896.97	1897.28	1896.36	1897.04

gold_data = pd.read_csv('Datasets/Gold 07042014 2011.csv')
gold_data.head()

Unnamed: 0	Date	Time	Bar#	Bar Index	Tick Range	Open	High	Low	Close
0	07/04/2014	13:54:00	886801/886801	0	80	1298.6	1299	1298.2	1298.5
1	07/04/2014	13:53:00	886800/886801	-1	69	1299.04	1299.29	1298.6	1298.6
2	07/04/2014	13:52:00	886799/886801	-2	39	1299.07	1299.29	1298.9	1298.9
3	07/04/2014	13:51:00	886798/886801	-3	31	1299	1299.21	1298.9	1299.09
4	07/04/2014	13:50:00	886797/886801	-4	52	1298.8	1299.22	1298.7	1299.17

gold_data['Date']

0         07/04/2014
1         07/04/2014
2         07/04/2014
3         07/04/2014
4         07/04/2014
             ...
886796    26/10/2011
886797    26/10/2011
886798    26/10/2011
886799    26/10/2011
886800    26/10/2011
Name: Date, Length: 886801, dtype: object

gold_data.describe()

Unnamed: 0	Bar Index	Tick Range	Open	High	Low	Close
count	886801	886801	886801	886801	886801	886801
mean	-443400	89.6254	1525.99	1526.35	1525.46	1525.96
std	255998	50.5657	180.892	180.869	180.847	180.901
min	-886800	0	1180.5	1183.7	1179.83	1180.85
25%	-665100	60	1334.7	1335.05	1334.23	1334.66
50%	-443400	80	1591.85	1592.19	1591.3	1591.83
75%	-221700	106	1676.32	1676.67	1675.8	1676.3
max	0	4227	1802.52	1802.92	1802.2	1802.43

ftse_data = pd.read_csv('Datasets/FTSE 04072014 2011.csv')

ftse_data.head()

Unnamed: 0	Date	Time	Bar#	Bar Index	Tick Range	Open	High	Low	Close
0	07/04/2014	14:23:00	408806/408806	0	2	6642.94	6642.94	6642.92	6642.93
1	07/04/2014	14:22:00	408805/408806	-1	51	6643.44	6643.45	6642.94	6642.96
2	07/04/2014	14:21:00	408804/408806	-2	84	6643.27	6644.06	6643.22	6643.45
3	07/04/2014	14:20:00	408803/408806	-3	44	6643.39	6643.69	6643.25	6643.29
4	07/04/2014	14:19:00	408802/408806	-4	92	6643.1	6643.83	6642.91	6643.38

ftse_data['Date']

0         07/04/2014
1         07/04/2014
2         07/04/2014
3         07/04/2014
4         07/04/2014
             ...
408801    01/02/2011
408802    01/02/2011
408803    01/02/2011
408804    01/02/2011
408805    01/02/2011
Name: Date, Length: 408806, dtype: object

ftse_data.describe()

Unnamed: 0	Bar Index	Tick Range	Open	High	Low	Close
count	408806	408806	408806	408806	408806	408806
mean	-204402	178.419	6025.72	6026.61	6024.83	6025.72
std	118012	211.696	459.01	458.806	459.218	459.01
min	-408805	0	4795.14	4798.8	4791.01	4795.12
25%	-306604	78	5715.06	5715.99	5714.18	5715.07
50%	-204402	130	5927.6	5928.33	5926.89	5927.61
75%	-102201	218	6462.52	6463.37	6461.66	6462.52
max	0	13825	6873.93	6875.62	6871.38	6873.75

A quick load and using the .head() function in pandas, gives us a quick view on all 3 of the dataset. They have similar columns where there are date and time separtely. While it is unsure about the 'Bar#' and 'Bar Index', the tick range could be the difference between the high and low columns.

gold_data['High_Low_Difference'] = gold_data['High'] - gold_data['Low']
gold_data[['Tick Range','High_Low_Difference']].head()

Unnamed: 0	Tick Range	High_Low_Difference
0	80	0.8
1	69	0.69
2	39	0.39
3	31	0.31
4	52	0.52

Based on the above result, we know that tick range is the difference between the high and the low columns, multiplied by 10. Not sure if we will need to use the result and information later.

We will now plot the columns open, close, high and low.

gold_data_test = gold_data
gold_data_test['Datetime'] = pd.to_datetime(
    gold_data_test['Date'] + ' ' + gold_data_test['Time'],
    format='%d/%m/%Y %H:%M:%S'
)

plt.plot(gold_data_test['Datetime'][:100], gold_data_test['Open'][:100])
plt.plot(gold_data_test['Datetime'][:100], gold_data_test['Close'][:100])
plt.plot(gold_data_test['Datetime'][:100], gold_data_test['High'][:100])
plt.plot(gold_data_test['Datetime'][:100], gold_data_test['Low'][:100])
plt.legend(['Open', 'Close', 'High', 'Low'])

As shown above, column 'high' shows the highest price that it has reached at the given time and the column 'low' shows the the opposite. While the high and low prices should be useful when using other models, but as of now, we will just stick to using only the close price.

def plot_gold_sp_ftse(data, title):
    data['Datetime'] = pd.to_datetime(
        data['Date'] + ' ' + data['Time'],
        format='%d/%m/%Y %H:%M:%S'
    )

    plt.plot(data['Datetime'], data['Close'])
    plt.title(title)
    plt.show()

plt.rcParams['figure.figsize'] = [7, 5]
plot_gold_sp_ftse(gold_data, 'Gold')
plot_gold_sp_ftse(sp500_data, 'S&P 500')
plot_gold_sp_ftse(ftse_data, 'FTSE')

The above are the normal graphs for

Gold price, from Oct 2011 - Apr 2014
S&P500 price, from Nov 2009 - Apr 2014
FTSE price, from Feb 2011 - Apr 2014.

We will try to see if we can explore the relationship between them. Because just in a quick glance, we can see that Gold and S&P500 are inversely related, while FTSE is somewhat correlate to S&P500.

USDCUSDT Dataset

Next we will be looking into the USDCUSDT Dataset.

After using both methods, it seems to be quite quite undetermined.

usdcusdt_data = pd.read_csv('Datasets/USDC-USDT.csv')

usdcusdt_data.head()

Unnamed: 0	unix	date	symbol	open	high	low	close	Volume USDC	Volume USDT	tradecount
0	1635552000000	2021-10-30 00:00:00	USDC/USDT	1	1	0.9999	0.9999	740710	740678	277
1	1635465600000	2021-10-29 00:00:00	USDC/USDT	0.9995	1	0.9994	1	1.46553e+08	1.46532e+08	43366
2	1635379200000	2021-10-28 00:00:00	USDC/USDT	0.9994	0.9997	0.999	0.9995	2.97686e+08	2.97482e+08	58314
3	1635292800000	2021-10-27 00:00:00	USDC/USDT	0.9997	1	0.9976	0.9994	2.93017e+08	2.92868e+08	67437
4	1635206400000	2021-10-26 00:00:00	USDC/USDT	1.0003	1.0008	0.9993	0.9998	2.10106e+08	2.10147e+08	56204

usdcusdt_data['date']

0       2021-10-30 00:00:00
1       2021-10-29 00:00:00
2       2021-10-28 00:00:00
3       2021-10-27 00:00:00
4       2021-10-26 00:00:00
               ...
1046    2018-12-19 00:00:00
1047    2018-12-18 00:00:00
1048    2018-12-17 00:00:00
1049    2018-12-16 00:00:00
1050    2018-12-15 00:00:00
Name: date, Length: 1051, dtype: object

usdcusdt_data.describe()

Unnamed: 0	unix	open	high	low	close	Volume USDC	Volume USDT	tradecount
count	1051	1051	1051	1051	1051	1051	1051	1051
mean	1.59019e+12	1.00001	1.011	0.998193	0.999991	5.50997e+07	5.50802e+07	29534.8
std	2.6226e+10	0.003651	0.281	0.004549	0.003629	7.75637e+07	7.75237e+07	21174.2
min	1.54483e+12	0.987	0.9918	0.9367	0.9866	542413	548254	277
25%	1.56751e+12	0.999	1	0.9975	0.999	7.135e+06	7.13665e+06	14527
50%	1.59019e+12	0.9997	1.0006	0.999	0.9997	1.47944e+07	1.47597e+07	24644
75%	1.61287e+12	1.0003	1.0022	0.9997	1.0003	9.38658e+07	9.38618e+07	39083
max	1.63555e+12	1.0318	10.11	1.0238	1.0318	7.76166e+08	7.75546e+08	196022

Based on the data descriptions above, we can see that unix and date can be used for the datetime column, as both of the columns provide datetime information. Symbol column would not be useful in this context, among the price columns, we will be using the close price. As for the volume USDC, volume USDT, and tradecount, they will definitely benefit with other models, but we can explore how they look when we were to plot them

usdcusdt_explore_columns = ['close', 'Volume USDC', 'Volume USDT', 'tradecount']
usdcusdt_data['date'] = pd.to_datetime(usdcusdt_data['date'])

for column in usdcusdt_explore_columns:
    plt.plot(usdcusdt_data['date'], usdcusdt_data[column])
    plt.title(column)
    plt.show()

With them visualized, we can easily pick them to perform any form of analysis when required.

We can observe that the 'close' price column that at the start, the data has very high variance but exponentially decrease in its volatility, but the mean of the data might be the same overall.

For the volume, both had very low amounts and then increased. It can be seen that the entire portion of the graph shifted upwards with the mean.

Tradecount column data seem to have some seasonality to it, with some form of repeating waves.

Cinema Ticket Dataset

While this dataset consists of time component and it is time series, it is not directly apparent. This is because the dataset not only includes a datetime information, but also status of the ticket sale of the specific cinema. Therefore, we will need to first do some pre-processing.

cinemaTicket_data = pd.read_csv('Datasets/cinemaTicket_Ref.csv')

cinemaTicket_data.head()

Unnamed: 0	film_code	cinema_code	total_sales	tickets_sold	show_time	occu_perc	ticket_price	ticket_use	capacity	date	month	quarter	day
0	1492	304	3900000	26	4	4.26	150000	26	610.329	2018-05-05	5	2	5
1	1492	352	3360000	42	5	8.08	80000	42	519.802	2018-05-05	5	2	5
2	1492	489	2560000	32	4	20	80000	32	160	2018-05-05	5	2	5
3	1492	429	1200000	12	1	11.01	100000	12	108.992	2018-05-05	5	2	5
4	1492	524	1200000	15	3	16.67	80000	15	89.982	2018-05-05	5	2	5

Let's say for this project, we would like t be able to predict when is the best time to increase teh discount, so that it can encourage ticket sales during off peak hours. We will first need to explore which columns to be using that can best represent our analysis.

cinemaTicket_data.columns

Index(['film_code', 'cinema_code', 'total_sales', 'tickets_sold',
       'tickets_out', 'show_time', 'occu_perc', 'ticket_price', 'ticket_use',
       'capacity', 'date', 'month', 'quarter', 'day'],
      dtype='object')

cinemaTicket_data.describe()

Unnamed: 0	film_code	cinema_code	total_sales	tickets_sold	tickets_out	show_time	occu_perc	ticket_price	ticket_use	capacity	month	quarter	day
count	142524	142524	142524	142524	142524	142524	142399	142524	142524	142399	142524	142524	142524
mean	1518.99	320.378	1.23473e+07	140.138	0.237413	3.9321	19.966	81234.6	139.9	854.724	6.77685	2.63472	16.1126
std	36.1844	159.701	3.06549e+07	279.759	2.92321	3.05628	22.6534	33236.6	279.565	953.118	2.19584	0.809692	8.94947
min	1471	32	20000	1	0	1	0	483.871	-219	-2	2	1	1
25%	1485	181	1.26e+06	18	0	2	3.75	60000	18	276.994	5	2	8
50%	1498	324	3.72e+06	50	0	3	10.35	79454.2	50	525.714	7	3	16
75%	1556	474	1.11e+07	143	0	5	28.21	100000	143	1038.96	9	3	24
max	1589	637	1.26282e+09	8499	311	60	147.5	700000	8499	9692.1	11	4	31

When inspecting the columns, the .head() and .describe() output, we noted that there are film code identified, as well as cinema_code. While these details might be useful for other analysis, we might just explore ticket_sold, to which should be good enough to indicate the performance of a given cinema. Because there also got a date column, we can use it as the datetime column. There are also other time related columns, like month, quarter, and day. However, similar justification that it might be useful for other analysis, but for this project will use only the date column and ticket sold.

Then, it is also further noted that because there is cinema codes, identifying the cinema, let's explore this aspect.

cinemaTicket_data['Datetime'] = pd.to_datetime(cinemaTicket_data['date'])
cinemaTicket_data

Unnamed: 0	film_code	cinema_code	total_sales	tickets_sold	tickets_out	show_time	occu_perc	ticket_price	ticket_use	capacity	date	month	quarter	day	Datetime
0	1492	304	3900000	26	0	4	4.26	150000.0	26	610.328638	2018-05-05	5	2	5	2018-05-05
1	1492	352	3360000	42	0	5	8.08	80000.0	42	519.801980	2018-05-05	5	2	5	2018-05-05
2	1492	489	2560000	32	0	4	20.00	80000.0	32	160.000000	2018-05-05	5	2	5	2018-05-05
3	1492	429	1200000	12	0	1	11.01	100000.0	12	108.991826	2018-05-05	5	2	5	2018-05-05
4	1492	524	1200000	15	0	3	16.67	80000.0	15	89.982004	2018-05-05	5	2	5	2018-05-05
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
142519	1569	495	1320000	22	0	2	3.86	60000.0	22	569.948187	2018-11-04	11	4	4	2018-11-04
142520	1569	474	1200000	15	0	1	65.22	80000.0	15	22.999080	2018-11-04	11	4	4	2018-11-04
142521	1569	524	1060000	8	0	3	9.20	132500.0	8	86.956522	2018-11-04	11	4	4	2018-11-04
142522	1569	529	600000	5	0	2	5.00	120000.0	5	100.000000	2018-11-04	11	4	4	2018-11-04
142523	1569	486	250000	5	0	1	1.79	50000.0	5	279.329609	2018-11-04	11	4	4	2018-11-04

cinemaTicket_data['show_time'].plot()

cinemaTicket_data['show_time'].hist()

After checking the column showtime, we cannot conclude what does the column represent. At first glance, we might be able to use it to indicate the hours in datetime, and we might even be able to dig into finding out the hours that have lower sales. However, that would not be the case and we would look only into which day has the highest and lowest sale, and see if we are able to identify any trends overall.

# list of unique cinema
unique_cinemas = cinemaTicket_data['cinema_code'].unique()
unique_cinemas

array([304, 352, 489, 429, 524,  71, 163, 450,  51, 522,  43, 529,  82,
       344,  73, 485, 518, 448, 431,  72, 144, 456, 238, 312, 168, 254,
       214, 474, 445, 300, 362, 324, 452, 291, 479, 210, 428, 277,  56,
       253, 243,  39, 310, 201, 457, 191, 532, 167, 266, 204, 380, 169,
       162, 506, 537, 513, 467, 509, 165, 262, 486, 198, 508, 222,  98,
       230, 156, 181, 141, 528,  94, 350, 442, 556, 475, 142,  35,  89,
        34, 225, 182, 396,  50,  61, 487,  88, 338, 417, 194,  57, 285,
       187, 159, 184,  81, 207, 339, 326, 531, 505, 492, 299, 507, 316,
       333, 172, 526, 414, 115, 468, 490, 441, 430, 472, 511, 480, 470,
       496, 466, 381, 368, 498, 195, 546, 516, 425, 488, 535, 196, 453,
       321, 152, 390, 166, 247, 454, 464, 499, 460, 251, 481, 315, 307,
       120, 250, 533, 221, 248, 313, 164,  70, 180, 160, 495, 314, 415,
       174, 259, 471, 245,  83,  91, 365, 359, 286,  64, 426, 237, 536,
       397, 476, 503, 491, 517,  55, 170, 175,  62, 539, 541, 540, 484,
       514, 548, 432, 501, 447, 186, 477, 331, 515,  48,  33, 185, 155,
       455, 461, 534, 273, 504, 249,  52, 482, 323, 451, 341, 306, 413,
       292, 463, 223, 497, 525, 427, 231, 519, 183,  32, 553, 271, 276,
       154, 521, 502, 402, 158, 473, 465, 561, 328, 558, 562, 424, 512,
       560, 458, 637, 520, 557, 555, 543, 542, 565, 215, 376, 566],
      dtype=int64)

plt.rcParams['figure.figsize'] = [20, 12]
for cinema in unique_cinemas:

    temp = cinemaTicket_data[cinemaTicket_data['cinema_code'] == cinema]
    temp = temp[['Datetime','tickets_sold']]
    temp = temp.groupby(['Datetime']).sum()

    plt.plot(temp.index,temp['tickets_sold'])

# the legend list will be too long
# plt.legend(unique_cinemas)

By getting the total sale at the given day for each cinema, and then plotting it, does not seem like a good idea, especially since because they have varying number of ticket sale.

However, we are able to note that there some form of repeatable patterns, trends or seasonality in the data. Let's use only 10 cinemas and plot it for visualization. Furthermore, later we shall look into take the average sales on a given day, and use that as the dataset moving forward for the cinema tickets dataset.

plt.rcParams['figure.figsize'] = [20, 12]

for cinema in unique_cinemas[:10]:

    temp = cinemaTicket_data[cinemaTicket_data['cinema_code'] == cinema]
    temp = temp[['Datetime','tickets_sold']]
    temp = temp.groupby(['Datetime']).sum()

    plt.plot(temp.index,temp['tickets_sold'],label=cinema)
    plt.legend()

Using only 10 cinemas, the patterns are now more obvious.

ticketSales_data = cinemaTicket_data[['Datetime', 'tickets_sold']]
ticketSales_data = ticketSales_data.groupby('Datetime').mean()
ticketSales_data

('Unnamed: 0_level_0', 'Datetime')	('tickets_sold', 'Unnamed: 1_level_1')
2018-02-21	267.000000
2018-02-23	3.000000
2018-03-14	125.650000
2018-03-15	178.675325
2018-03-16	175.461017
...	...
2018-10-31	70.583704
2018-11-01	113.653521
2018-11-02	126.824561
2018-11-03	75.431177
2018-11-04	49.894737

ticketSales_data.plot()

It seems that the first two entries are not properly maintained, or it could be outliers. Therefore, we would be dropping the first two entries.

ticketSales_data = ticketSales_data[2:]
ticketSales_data.plot()

Stackoverflow Dataset

This dataset is a little more direct, where each columns signifies the topic that is being asked. For this project, we can look into the questions asked on the site overtime, from 2009 to 2019

mlStackoverflow_data = pd.read_csv('Datasets/MLTollsStackOverflow.csv')

mlStackoverflow_data.head()

Unnamed: 0	month	nltk	python	r	numpy	scipy	matlab	machine-learning	...	Sympy	pyqt	OCR
0	09-Jan	0	631	8	6	2	19	8	...	1	5	5
1	09-Feb	1	633	9	7	3	27	4	...	0	5	11
2	09-Mar	0	766	4	4	2	24	3	...	0	7	2
3	09-Apr	0	768	12	6	3	32	10	...	0	11	5
4	09-May	1	1003	2	7	2	42	7	...	0	10	3

mlStackoverflow_data.columns

Index(['month', 'nltk', 'spacy', 'stanford-nlp', 'python', 'r', 'numpy',
       'scipy', 'matlab', 'machine-learning', 'pandas', 'pytorch', 'keras',
       'nlp', 'apache-spark', 'hadoop', 'pyspark', 'python-3.x', 'tensorflow',
       'deep-learning', 'neural-network', 'lstm', 'time-series', 'pillow',
       'rasa', 'opencv', 'pipenv', 'seaborn', 'Dask', 'jupyter', 'AllenNLP',
       'Theano', 'plotly', 'scikit-learn', 'BeautifulSoup', 'scrapy', 'Gensim',
       'FastText', 'Pydot', 'Pybrain', 'Pytil', 'Pygame', 'Colab', 'Shogun',
       'KNIME', 'Apache', 'Gunicorn', 'Pygtk', 'Weka', 'Conda', 'Ray',
       'matlab.1', 'accord.net', 'regression', 'classification', 'correlation',
       'cluster-analysis', 'H2o', 'Mallet', 'Numba', 'Tableau', 'Trifacta',
       'PyArrow', 'Rasterio', 'Orange3', 'PyMC3', 'Opennn', 'Oryx', 'Istio',
       'Venes', 'Plotnine', 'Gluon', 'Plato', 'Sympy', 'Flair',
       'stanford-nlp.1', 'pyqt', 'Nolearn', 'Lasagne', 'OCR',
       'Apache-spark-mlib', 'azure-virtual-machine'],
      dtype='object')

mlStackoverflow_data['month']

0      09-Jan
1      09-Feb
2      09-Mar
3      09-Apr
4      09-May
        ...
127    19-Aug
128    19-Sep
129    19-Oct
130    19-Nov
131    19-Dec
Name: month, Length: 132, dtype: object

Given that the topics of machine learning should have increase over the 10 years, especially with the recent explosion of interest for generative AI. We will see some relevant topics over the years.

mlStackoverflow_data['Datetime'] = pd.to_datetime(mlStackoverflow_data['month'], format='%y-%b')
mlStackoverflow_data.head()

Unnamed: 0	month	nltk	python	r	numpy	scipy	matlab	machine-learning	...	Sympy	pyqt	OCR	Datetime
0	09-Jan	0	631	8	6	2	19	8	...	1	5	5	2009-01-01
1	09-Feb	1	633	9	7	3	27	4	...	0	5	11	2009-02-01
2	09-Mar	0	766	4	4	2	24	3	...	0	7	2	2009-03-01
3	09-Apr	0	768	12	6	3	32	10	...	0	11	5	2009-04-01
4	09-May	1	1003	2	7	2	42	7	...	0	10	3	2009-05-01

relevant_topics = ['python', 'machine-learning', 'deep-learning', 'time-series', 'regression', 'pytorch', 'tensorflow', 'classification']

plt.rcParams['figure.figsize'] = [7,5]

for topic in relevant_topics:
    plt.plot(mlStackoverflow_data['Datetime'], mlStackoverflow_data[topic])
    plt.scatter(mlStackoverflow_data['Datetime'], mlStackoverflow_data[topic], s=10)
    plt.title(topic)
    plt.show()

All of them, shows different properties in the graph. For example, while pytorch and tensorflow had beginnings around 2016, but they seem to increase a lot when observing the slope.

Multiplicative or Additive - Visual Identification

We will look into some dataset to see if there are multiplicative or additive. We will use the gold dataset, python and deep learning topic in the stackoverflow dataset and the cinema dataset to see if they are additive or multiplicative.

Based on the textbook that this course is based on, the difference between additive and multiplicative datsets are that one is the sum of seasonal component, trend component and residual component, and one is the product of all three. As shown in the equations below:

Additive:
$y_{t} = S_{t} + T_{t} + R_{t}$
Multiplicative:
$y_{t} = S_{t} \times T_{t} \times R_{t}$

Hence, to truly find out, we can decompose the time series into the components mentioned, seasonality, trend and residual.

However, based on the courses' lectures, it is also mentioned that we can identify based on the dispersion and shape of the differenced dataset, as shown below.

Gold Dataset - Close Price

gold_data['Close'].plot()

gold_data['Diff'] = gold_data['Close'] - gold_data['Close'].shift()
gold_data.plot.scatter(x = 'Close', y = 'Diff')

As observed, regardless or what value x, the y stays very consistent. Therefore, the changes between the price are not steep or drastic. Which represent an additive chart.

Stakeoverflow Dataset - Python

While the chart shows that it is linearly increase over time, when we check for additivity or multiplicity, turns out this is multiplicative, as shown in the scatter plot below. Where there are instance of x with very little change, while on the other end, the x values have higher change.

mlStackoverflow_data['python'].plot()

mlStackoverflow_data['python-diff'] = mlStackoverflow_data['python'] - mlStackoverflow_data['python'].shift()
mlStackoverflow_data.plot.scatter(x='python', y = 'python-diff')

This is a multiplicative dataset because there is an obvious dispersion of y values, as x increases.

mlStackoverflow_data['logged_python'] = np.log(mlStackoverflow_data['python'])
mlStackoverflow_data['logged_python'].plot()

Even after logging, the chart still looks like it might be multiplicative.

mlStackoverflow_data['logged_python_diff'] = mlStackoverflow_data['logged_python'] - mlStackoverflow_data['logged_python'].shift()
mlStackoverflow_data.plot.scatter(x='logged_python', y='logged_python_diff')

However, based on the result, the chart now looks more additive than it is multiplicative as the y values are now more consistent with the increase in x.

Stackoverflow Dataset - Machine Learning

Machine learning dataset was chosen because of its higher increase at the end of the chart. This shows that it is highly likely to be multiplicative.

mlStackoverflow_data['machine-learning'].plot()

mlStackoverflow_data['machine-learning-diff'] = mlStackoverflow_data['machine-learning'] - mlStackoverflow_data['machine-learning'].shift()
mlStackoverflow_data.plot.scatter(x='machine-learning', y='machine-learning-diff')

As guessed, it is a multiplicative chart.

mlStackoverflow_data['logged_ml'] = np.log(mlStackoverflow_data['machine-learning'])
mlStackoverflow_data['logged_ml'].plot()

Now we shall explore the logged chart. Now the charts looks more linear, which leads more to additive.

mlStackoverflow_data['logged_ml_diff'] = mlStackoverflow_data['logged_ml'] - mlStackoverflow_data['logged_ml'].shift()
mlStackoverflow_data.plot.scatter(x='logged_ml', y='logged_ml_diff')

As shown, while there are still areas of higher dispersed y values, it is better than it was without logging.

Cinema Ticket Dataset - Tickets Sold

This dataset was chosen because out of curiously, a strong seasonal dataset, would be additive or multiplicative.

ticketSales_data['tickets_sold'].plot()

ticketSales_data['tickets_sold_diff'] = ticketSales_data['tickets_sold'] - ticketSales_data['tickets_sold'].shift()
ticketSales_data.plot.scatter(x='tickets_sold', y ='tickets_sold_diff')

Based on the scatter plot, it seems like it is multiplicative, let's try logging it.

ticketSales_data['logged_sales'] = np.log(ticketSales_data['tickets_sold'])
ticketSales_data['logged_sales'].plot()

ticketSales_data['logged_sales_diff'] = ticketSales_data['logged_sales'] - ticketSales_data['logged_sales'].shift()
ticketSales_data.plot.scatter(x='logged_sales', y ='logged_sales_diff')

Based on the result, it looks like the multiplicity of has decreased and looks more additive with the values dispersed more consistently has compared to the one before.

Stationarity

See if any of data are stationary?

Visually, upon first glance, we can see that all of the current dataset, is not stationary, as most of them have some form trend, seasonality, varying means and variance. Therefore, let's implement ways to make them stationary and ensure that they are stationary. The dataset that we will be using is gold dataset and machine learning.

gold_data['Close'].plot()

mlStackoverflow_data['machine-learning'].plot()

Random walk model or Simple Differencing

For many cases, just by finding the difference between the current period and previous period, should be enough to make the time-series data to be stationary.

gold_data['Close_Diffrenced'] = gold_data['Close'] - gold_data['Close'].shift()
gold_data['Close_Diffrenced'].plot()

mlStackoverflow_data['machine-learning_differenced'] = mlStackoverflow_data['machine-learning'] - mlStackoverflow_data['machine-learning'].shift()
mlStackoverflow_data['machine-learning_differenced'].plot()

While we can visually see that both the graph has no more trend and has become more stationary than before, there are still some things to note. For the gold data, it is basically the ideal stationary data outcome, with some areas of outliers, but mostly, it is have a very steady mean and variance. However, even though we have use this method, the stackoverflow questions dataset still as an increase of variance overtime. Let's use some test to see if it is stationary or not. If not, we shall apply more differencing.

KPSS and ADF Test

We will be using the KPSS and ADF test to see if both of the above are stationary data.

from statsmodels.tsa.stattools import kpss

# Referred the below for the function and the use of kpss
# https://www.machinelearningplus.com/time-series/kpss-test-for-stationarity/

def kpss_test(series):
    statistic, p_value, n_lags, critical_values = kpss(series)
    print('KPSS Stat: ', statistic)
    print('p-value: ', p_value)
    print('Number of lags:', n_lags)
    print('Critial Values:')

    for key, value in critical_values.items():
        print(f'   {key} : {value}')
    print(f'Result: The series is {"not " if p_value < 0.05 else ""}stationary')

print('Gold Data-------------------')
print('=====Before Differencing:')
kpss_test(gold_data['Close'])
print()
print('=====After Differencing:')
kpss_test(gold_data['Close_Diffrenced'][1:])

Gold Data-------------------
=====Before Differencing:
KPSS Stat:  133.7235404045421
p-value:  0.01
Number of lags: 518
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
Result: The series is not stationary

=====After Differencing:
KPSS Stat:  0.06604071713292024
p-value:  0.1
Number of lags: 44
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
Result: The series is stationary

print('StackOverFlow ML Data-------------------')
print('=====Before Differencing:')
kpss_test(mlStackoverflow_data['machine-learning'])
print()
print('=====After Differencing:')
kpss_test(mlStackoverflow_data['machine-learning_differenced'][1:])

StackOverFlow ML Data-------------------
=====Before Differencing:
KPSS Stat:  1.8470519859412013
p-value:  0.01
Number of lags: 6
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
Result: The series is not stationary

=====After Differencing:
KPSS Stat:  0.29869005236660856
p-value:  0.1
Number of lags: 10
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
Result: The series is stationary

According to the KPSS test, even after the simple differencing for the the questions in stackoverflow, is enough to be stationary. Let's see for the adfuller test.

For the adfuller test, the closer the p-value it is to 0, the higher the likelihood for us to assume that there is no unit root in the time-series and that it is stationary. Assuming we were to use p-values of 0.05.

from statsmodels.tsa.stattools import adfuller

def adftest(series):
    res = adfuller(series)
    print('AdfTest Stat: ', res[0])
    print('p-value: ', res[1])
    print('Number of lags:', res[2])
    print('Number of observation:', res[3])
    print('Critial Values:')

    for key, value in res[4].items():
        print(f'   {key} : {value}')

    if res[1] < 0.05:
        print('The series is stationary')
    else:
        print('The series is not stationary')

adftest(mlStackoverflow_data['machine-learning'])

AdfTest Stat:  0.6666371271225812
p-value:  0.9891479891628309
Number of lags: 12
Number of observation: 119
Critial Values:
   1% : -3.4865346059036564
   5% : -2.8861509858476264
   10% : -2.579896092790057
The series is not stationary

adftest(mlStackoverflow_data['machine-learning_differenced'][1:])

AdfTest Stat:  -2.054928456415778
p-value:  0.2630197887870384
Number of lags: 11
Number of observation: 119
Critial Values:
   1% : -3.4865346059036564
   5% : -2.8861509858476264
   10% : -2.579896092790057
The series is not stationary

Using the adfuller test, the simple differencing was not enough. While the p-values have dropped from 0.98 to 0.26, we can still try to make the dataset more stationary.

Second Order Differencing

firstDifference = mlStackoverflow_data['machine-learning'] - mlStackoverflow_data['machine-learning'].shift()

firstDifference

0        NaN
1       -4.0
2       -1.0
3        7.0
4       -3.0
       ...
127    -37.0
128    -45.0
129    212.0
130    168.0
131     65.0
Name: machine-learning, Length: 132, dtype: float64

secondDifference = mlStackoverflow_data['machine-learning'].shift() - mlStackoverflow_data['machine-learning'].shift(2)
secondDifference

0        NaN
1        NaN
2       -4.0
3       -1.0
4        7.0
       ...
127     80.0
128    -37.0
129    -45.0
130    212.0
131    168.0
Name: machine-learning, Length: 132, dtype: float64

secondOrderDifference = firstDifference - secondDifference
secondOrderDifference

0        NaN
1        NaN
2        3.0
3        8.0
4      -10.0
       ...
127   -117.0
128     -8.0
129    257.0
130    -44.0
131   -103.0
Name: machine-learning, Length: 132, dtype: float64

plt.plot(firstDifference)
plt.plot(secondOrderDifference)
plt.show()

adftest(secondOrderDifference[2:])

AdfTest Stat:  -6.611299198025336
p-value:  6.367085815585583e-09
Number of lags: 10
Number of observation: 119
Critial Values:
   1% : -3.4865346059036564
   5% : -2.8861509858476264
   10% : -2.579896092790057
The series is stationary

We can see that visually, the difference between the simple difference and second order differencing, the second order differencing looked like the variance increased as we do see higher highs and lower low in the graph as compared to the simple differencing. The orange graph is the second order differencing. However, based on the adfuller test, it is stationary.

Seasonal Differencing

Seasonal differencing is based on the formula below, which is subtract by the current value with a value in the past $m$ observations, called lag. The higher the lag, the further in the past you subtract. This is one way to remove seasonality.

$y = y_{t} - y_{m}$

However, we will need to loop through some values to find this. Another method of seasonal differencing is to understand a season timeframe, then get the mean of each season, and apply subtraction according to respective season's mean.

lags = [3,5,10,20,50]

plt.rcParams['figure.figsize'] = [15,7]

for lag in lags:
    temp = mlStackoverflow_data['machine-learning'] - mlStackoverflow_data['machine-learning'].shift(lag)
    plt.plot(temp, label=str(lag) + " lag")

    print('Test for lag ' + str(lag) + "  =====================================")
    kpss_test(temp[lag:])
    print()
    adftest(temp[lag:])
    print()

plt.legend()
plt.show()

Code Output

Test for lag 3  =====================================
KPSS Stat:  0.2898885946621235
p-value:  0.1
Number of lags: 4
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
Result: The series is stationary

AdfTest Stat:  -2.25147303187454
p-value:  0.18809655506315798
Number of lags: 13
Number of observation: 115
Critial Values:
   1% : -3.4885349695076844
   5% : -2.887019521656941
   10% : -2.5803597920604915
The series is not stationary

Test for lag 5  =====================================
KPSS Stat:  0.27908581626858614
p-value:  0.1
Number of lags: 4
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
Result: The series is stationary

AdfTest Stat:  -2.1942366663365926
p-value:  0.20834386047176706
Number of lags: 12
Number of observation: 114
Critial Values:
   1% : -3.489057523907491
   5% : -2.887246327182993
   10% : -2.5804808802708528
The series is not stationary

Test for lag 10  =====================================
KPSS Stat:  0.35328244736991793
p-value:  0.09729204854744916
Number of lags: 5
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
Result: The series is stationary

AdfTest Stat:  -2.4158493770328726
p-value:  0.13731291632025688
Number of lags: 12
Number of observation: 109
Critial Values:
   1% : -3.49181775886872
   5% : -2.8884437992971588
   10% : -2.5811201893779985
The series is not stationary

Test for lag 20  =====================================
KPSS Stat:  0.5230701441606662
p-value:  0.0364706882521022
Number of lags: 5
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
Result: The series is not stationary

AdfTest Stat:  -2.157009917873902
p-value:  0.2222165170167978
Number of lags: 12
Number of observation: 99
Critial Values:
   1% : -3.498198082189098
   5% : -2.891208211860468
   10% : -2.5825959973472097
The series is not stationary

Test for lag 50  =====================================
KPSS Stat:  1.3102598299838486
p-value:  0.01
Number of lags: 5
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
Result: The series is not stationary

AdfTest Stat:  -0.5612811021960334
p-value:  0.8794931265993677
Number of lags: 12
Number of observation: 69
Critial Values:
   1% : -3.528889992207215
   5% : -2.9044395987933362
   10% : -2.589655654274312
The series is not stationary

Surprisingly, the further you lag, does not mean that the data will become more stationary. As shown with the kpss and adfuller test, we can see that even using difference of lag 3, adfuller test already did not consider the time series as stationary. The KPSS result shows that from lag 20 onwards, the data is no longer stationary. We can also observed this because visually as the red and purple graph slowly becoming a upward trending graph.

Log and log differences

Let's see how does logging a time-series or getting the log differences, can effect the stationarity of a time-series. We will continue using the stackoverflow dataset on machine learning.

loggedMLtopic = np.log(mlStackoverflow_data['machine-learning'])
loggedMLtopic.plot()

It has made it into a trending time series, so now we can try apply simple differencing

differencedLoggedMLtopic = loggedMLtopic - loggedMLtopic.shift()
differencedLoggedMLtopic.plot()

kpss_test(differencedLoggedMLtopic[1:])

KPSS Stat:  0.28055757025438893
p-value:  0.1
Number of lags: 55
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
Result: The series is stationary

adftest(differencedLoggedMLtopic[1:])

AdfTest Stat:  -13.691197822513615
p-value:  1.342938733545792e-25
Number of lags: 1
Number of observation: 129
Critial Values:
   1% : -3.482087964046026
   5% : -2.8842185101614626
   10% : -2.578864381347275
The series is stationary

Based on the both the test and visually, we have able to make the machine learning topic time series to be stationary.

Seasonality

Seasonality are data that has some form of repeating pattern and precipitability at a certain time frame.

Based on the current set of data we have, we can clearly see that the cinema dataset has repeatable pattern, very noticeable peak and trough. Therefore, we will look into this dataset for the seasonality adjustments.

Cinema Tickets

ticketSales_data.index.weekday

Int64Index([2, 3, 4, 5, 6, 0, 1, 2, 3, 4,
            ...
            4, 5, 6, 0, 1, 2, 3, 4, 5, 6],
           dtype='int64', name='Datetime', length=232)

ticketSales_data.index.weekday.unique()

Int64Index([2, 3, 4, 5, 6, 0, 1], dtype='int64', name='Datetime')

ticketSales_data = ticketSales_data.assign(day=ticketSales_data.index.weekday)
ticketSales_data

('Unnamed: 0_level_0', 'Datetime')	('tickets_sold', 'Unnamed: 1_level_1')	('day', 'Unnamed: 2_level_1')
2018-03-14	125.650000	2
2018-03-15	178.675325	3
2018-03-16	175.461017	4
2018-03-17	114.192192	5
2018-03-18	94.831956	6
...	...	...
2018-10-31	70.583704	2
2018-11-01	113.653521	3
2018-11-02	126.824561	4
2018-11-03	75.431177	5
2018-11-04	49.894737	6

# Because the timeframe in the dataset is not a full calendar year.
# Get the list of months that are involved
days = ticketSales_data.index.weekday.unique()

# Initializing season count and sum
seasonSum = [0] * len(days)
seasonCount = [0] * len(days)

for rowCount in range(len(ticketSales_data)):

    currentValue, currentDay = ticketSales_data.iloc[rowCount,[0,1]]

    seasonSum[int(currentDay)] += currentValue
    seasonCount[int(currentDay)] += 1

print(seasonCount)
print(seasonSum)

[32, 33, 32, 33, 34, 34, 34]
[2431.9228952621716, 8661.859799267784, 3712.1669864735404, 4930.762704784601, 5691.3748326994255, 3488.7014251378805, 2745.6502726189774]

# Season Mean
seasonMean = [0] * len(days)

for count in range(len(seasonCount)):
    seasonMean[count] = seasonSum[count] / seasonCount[count]

seasonMean

[75.99759047694286,
 262.48059997781166,
 116.00521832729814,
 149.4170516601394,
 167.39337743233605,
 102.60886544523179,
 80.7544197829111]

dataMean = np.mean(ticketSales_data['tickets_sold'].values)
dataMean

136.47602981139818

seasonOffsets = [x - dataMean for x in seasonMean]
seasonOffsets

[-60.47843933445532,
 126.00457016641349,
 -20.47081148410004,
 12.94102184874123,
 30.917347620937875,
 -33.86716436616639,
 -55.72161002848708]

adjustedTicketsSold = []
for count in range(len(ticketSales_data)):

    currentValue, currentDay = ticketSales_data.iloc[count,[0,1]]
    adjustedTicketsSold.append(currentValue - seasonOffsets[int(currentDay)])

adjustedTicketsSold

[146.12081148410005,
 165.73430282658344,
 144.54366932821466,
 148.0593565583586,
 150.55356595135208,
 ...
 91.05451518780374,
 100.71249927801934,
 95.90721378257089,
 109.29834181226921,
 105.61634687059234]

ticketSales_data = ticketSales_data.assign(adjustedTicketSale = adjustedTicketsSold)
plt.rcParams['figure.figsize'] = [12, 5]
ticketSales_data.iloc[:, [0,2]].plot()

As we can observe, the orange plot is the adjusted ticket sales. The adjustment works because we can see that the peaks are now lower, while the trough are now higher generally. However, we are still able to note that there are still peaks and troughs even after the adjustment.

secondAdjustmentCount = [0] * len(days)
secondAdjustmentSum = [0] * len(days)
secondAdjustmentMean = [0] * len(days)
secondAdjustmentValue = []

for count in range(len(ticketSales_data)):
    currentDay, currentValue = ticketSales_data.iloc[count, [1,2]]
    secondAdjustmentCount[int(currentDay)] += 1
    secondAdjustmentSum[int(currentDay)] += currentValue

print("Season Count:", secondAdjustmentCount)
print("Season Sum:", secondAdjustmentSum)

for count in range(len(secondAdjustmentSum)):
    secondAdjustmentMean[count] = secondAdjustmentSum[count] / secondAdjustmentCount[count]
print("Season Mean:", secondAdjustmentMean)

secondMean = np.mean(ticketSales_data['adjustedTicketSale'].values)
print("Adjusted Ticket Sale Mean:", secondMean)

secondAdjustmentOffsets = [(x - secondMean) for x in secondAdjustmentMean]
print(secondAdjustmentOffsets)

secondAdjustedTickets = []
for count in range(len(ticketSales_data)):
    currentValue, currentDay = ticketSales_data.iloc[count, [2,1]]
    secondAdjustedTickets.append(currentValue - secondAdjustmentOffsets[int(currentDay)])

ticketSales_data = ticketSales_data.assign(second_adjusted_ticket_sales = secondAdjustedTickets)
plt.rcParams['figure.figsize'] = [12, 5]
ticketSales_data.iloc[:, [2,3]].plot()

It is surprising to see that after the first adjustment, we are no longer able to remove more variances, lower the peaks and heighten the troughs, as the offsets list and the mean of the adjusted tickets sold column, are so close to each other.

weeks = ticketSales_data.index.week.unique()
weeks

Int64Index([11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27,
            28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,
            44],
           dtype='int64', name='Datetime')

ticketSales_data = ticketSales_data.assign(week=ticketSales_data.index.isocalendar().week)
ticketSales_data

('Unnamed: 0_level_0', 'Datetime')	('tickets_sold', 'Unnamed: 1_level_1')	('day', 'Unnamed: 2_level_1')	('adjustedTicketSale', 'Unnamed: 3_level_1')	('second_adjusted_ticket_sales', 'Unnamed: 4_level_1')	('week', 'Unnamed: 5_level_1')
2018-03-14	125.650000	2	146.120811	146.120811	11
2018-03-15	178.675325	3	165.734303	165.734303	11
2018-03-16	175.461017	4	144.543669	144.543669	11
2018-03-17	114.192192	5	148.059357	148.059357	11
2018-03-18	94.831956	6	150.553566	150.553566	11
...	...	...	...	...	...
2018-10-31	70.583704	2	91.054515	91.054515	44
2018-11-01	113.653521	3	100.712499	100.712499	44
2018-11-02	126.824561	4	95.907214	95.907214	44
2018-11-03	75.431177	5	109.298342	109.298342	44
2018-11-04	49.894737	6	105.616347	105.616347	44

weekAdjustmentCount = [0] * len(weeks)
weekAdjustmentSum = [0] * len(weeks)
weekAdjustmentMean = [0] * len(weeks)
weekAdjustmentValue = []

for count in range(len(ticketSales_data)):
    currentWeek, currentValue = ticketSales_data.iloc[count, [4,0]]
    weekAdjustmentCount[int(currentWeek)-11] += 1
    weekAdjustmentSum[int(currentWeek)-11] += currentValue

print("Season Count:", weekAdjustmentCount)
print("Season Sum:", weekAdjustmentSum)

for count in range(len(weekAdjustmentSum)):
    weekAdjustmentMean[count] = weekAdjustmentSum[count] / weekAdjustmentCount[count]
print("Season Mean:", weekAdjustmentMean)

weekMean = np.mean(ticketSales_data['adjustedTicketSale'].values)
print("Adjusted Ticket Sale Mean:", secondMean)

weekAdjustmentOffsets = [(x - weekMean) for x in weekAdjustmentMean]
print(weekAdjustmentOffsets)

weekAdjustedTickets = []
for count in range(len(ticketSales_data)):
    currentWeek, currentValue = ticketSales_data.iloc[count, [4,0]]
    weekAdjustedTickets.append(currentValue - weekAdjustmentOffsets[int(currentWeek)-11])

ticketSales_data = ticketSales_data.assign(week_adjusted_ticket_sales = weekAdjustedTickets)
plt.rcParams['figure.figsize'] = [12, 5]
ticketSales_data[['tickets_sold','week_adjusted_ticket_sales']].plot()

plt.rcParams['figure.figsize'] = [12, 5]

ticketSales_data[['tickets_sold','week_adjusted_ticket_sales','adjustedTicketSale']].plot()

As it can be seen, the seasonality was more removed when we are using the day of the week trying adjust according to the mean of each day, rather than the mean of each week. The week adjusted ticket sales seem to more stationary, but the peaks and trough, while decreased, they are still very prominent. Hence, seasonality is still there.

USDCUSDT Tradecount

We can also explore on the USDCUSDT tradecount column which seem to look like that there are seasonality. We will applied the same steps as above.

for no, column in enumerate(usdcusdt_data.columns):
    print(str(no) + " " + column)

0 unix
1 date
2 symbol
3 open
4 high
5 low
6 close
7 Volume USDC
8 Volume USDT
9 tradecount

usdcusdt_data.iloc[2,1].weekday()

usdcusdtDayAdjustmentCount = [0] * 31
usdcusdtDayWeekAdjustmentCount = [0] * 7
usdusdtMonthAdjustmentCount = [0] * 12
usdcusdtDayAdjustmentSum = [0] * 31
usdcusdtDayWeekAdjustmentSum = [0] * 7
usdcusdtMonthAdjustmentSum = [0] * 12
usdcusdtDayAdjustmentMean = [0] * 31
usdcusdtDayWeekAdjustmentMean = [0] * 7
usdcusdtMonthAdjustmentMean = [0] * 12
usdcusdtCloseMean = usdcusdt_data['tradecount'].mean()

for rowCount in range(len(usdcusdt_data)):

    usdcusdtDayAdjustmentCount[usdcusdt_data.iloc[rowCount, 1].day-1] += 1
    usdcusdtDayAdjustmentSum[usdcusdt_data.iloc[rowCount, 1].day-1] += usdcusdt_data.iloc[rowCount, 9]

    usdcusdtDayWeekAdjustmentCount[usdcusdt_data.iloc[rowCount, 1].weekday()] += 1
    usdcusdtDayWeekAdjustmentSum[usdcusdt_data.iloc[rowCount, 1].weekday()] += usdcusdt_data.iloc[rowCount, 9]

    usdusdtMonthAdjustmentCount[usdcusdt_data.iloc[rowCount, 1].month-1] += 1
    usdcusdtMonthAdjustmentSum[usdcusdt_data.iloc[rowCount, 1].month-1] += usdcusdt_data.iloc[rowCount, 9]

for count in range(len(usdcusdtDayAdjustmentSum)):
    usdcusdtDayAdjustmentMean[count] = usdcusdtDayAdjustmentSum[count] / usdcusdtDayAdjustmentCount[count]

for count in range(len(usdcusdtDayWeekAdjustmentSum)):
    usdcusdtDayWeekAdjustmentMean[count] = usdcusdtDayWeekAdjustmentSum[count] / usdcusdtDayWeekAdjustmentCount[count]

for count in range(len(usdcusdtMonthAdjustmentSum)):
    usdcusdtMonthAdjustmentMean[count] = usdcusdtMonthAdjustmentSum[count] / usdusdtMonthAdjustmentCount[count]

dayAdjustmentOffset = [(x - usdcusdtCloseMean) for x in usdcusdtDayAdjustmentMean]
dayWeekAdjustmentOffset = [(x - usdcusdtCloseMean) for x in usdcusdtDayWeekAdjustmentMean]
monthAdjustmentOffset = [(x - usdcusdtCloseMean) for x in usdcusdtMonthAdjustmentMean]

dayAdjustedClose = []
dayWeekAdjustedClose = []
monthAdjustedClose = []

for rowCount in range(len(usdcusdt_data)):
    currentValue, currentDatetime = usdcusdt_data.iloc[rowCount, [9,1]]

    dayAdjustedClose.append(currentValue - dayAdjustmentOffset[currentDatetime.day - 1])
    dayWeekAdjustedClose.append(currentValue - dayWeekAdjustmentOffset[currentDatetime.weekday()])
    monthAdjustedClose.append(currentValue - monthAdjustmentOffset[currentDatetime.month - 1])

usdcusdt_data = usdcusdt_data.assign(Adjusted_tradecount_day = dayAdjustedClose)
usdcusdt_data = usdcusdt_data.assign(Adjusted_tradecount_week = dayWeekAdjustedClose)
usdcusdt_data = usdcusdt_data.assign(Adjusted_tradecount_month = monthAdjustedClose)

adjusted_col = ['Adjusted_tradecount_day', 'Adjusted_tradecount_week', 'Adjusted_tradecount_month']
plt.rcParams['figure.figsize'] = [12, 5]

for column in adjusted_col:
    plt.plot(usdcusdt_data['date'], usdcusdt_data['tradecount'])
    plt.plot(usdcusdt_data['date'], usdcusdt_data[column], alpha=0.8)
    plt.title(column)
    plt.show()

As shown above, only when using the month timeframe, where we can see that the graph moved only a little, and the seasonality of the graph was not removed.

Looking for Seasonality using FFT

We will try to use FFT to identify the seasonality in some of the dataset.

pd.Series(np.fft.fft(ticketSalesData)).plot()

pd.Series(np.abs(np.fft.fft(ticketSalesData))).plot()

pd.Series(np.abs(np.fft.fft(ticketSalesData)[10:len(ticketSalesData)])).plot()

Clearly there are seasonality in this dataset

Gold

pd.Series(np.abs(np.fft.fft(goldData['Close'])[10:len(ticketSalesData)])).plot()

pd.Series(np.abs(np.fft.fft(goldData['Close'])[10:len(ticketSalesData)//3])).plot()

For gold, there seem to be no seasonality in the dataset.

Machine Learning Questions

pd.Series(np.abs(np.fft.fft(mlStackoverflow_data['machine-learning'])[0:len(ticketSalesData)])).plot()

pd.Series(np.abs(np.fft.fft(mlStackoverflow_data['machine-learning'])[100:])).plot()

Honestly cannot tell if the final stretch of the values is considered to indicating that this dataset is a seasonal component

USDCUSDT Tradecount

usdcusdt_data['tradecount']

0         277
1       43366
2       58314
3       67437
4       56204
        ...
1046     9876
1047     5308
1048     4360
1049     1185
1050     1054
Name: tradecount, Length: 1051, dtype: int64

pd.Series(np.abs(np.fft.fft(usdcusdt_data['tradecount']))).plot()

Clearly this is not a seasonal dataset.

Forecasting

Naive Forecasting

Taking the previous period as a prediction. Therefore, we only need to shift the column by one, as we have done for the differencing. Some example shown at the bottom.

def naiveForecasting(series):

    temp = pd.DataFrame(series.rename('observed'))
    temp.insert(1, 'predicted', temp['observed'].shift())

    return temp

mlTopicNaiveForecast = naiveForecasting(mlStackoverflow_data['machine-learning'])
mlTopicNaiveForecast.plot()

ticketSoldNaiveForecast = naiveForecasting(ticketSales_data['tickets_sold'])
ticketSoldNaiveForecast.plot()

goldPriceNaiveForecast = naiveForecasting(gold_data['Close'])
goldPriceNaiveForecast.plot()

Seasonal Forecasting

Define a season and then we predict that next season with the previous season's value. It is similar to naive forecasting above, but with more time gap in between

def seasonalForecasting(series, season):
    temp = pd.DataFrame(series.rename('observed'))
    temp.insert(1, 'predicted', temp['observed'].shift(season))

    return temp

Because each data point represents a month, our season could be quarter, half yearly or yearly.

mlTopicSeasonalYearlyForecast = seasonalForecasting(mlStackoverflow_data['machine-learning'], 12)
mlTopicSeasonalYearlyForecast.plot()

mlTopicSeasonalQuarterlyForecast = seasonalForecasting(mlStackoverflow_data['machine-learning'], 3)
mlTopicSeasonalQuarterlyForecast.plot()

ticketSoldWeeklyForecast = seasonalForecasting(ticketSales_data['tickets_sold'], 7)
ticketSoldWeeklyForecast.plot()

ticketSoldMonthlyForecast = seasonalForecasting(ticketSales_data['tickets_sold'], 30)
ticketSoldMonthlyForecast.plot()

goldPriceDailyForecast = seasonalForecasting(gold_data['Close'], 1440)
goldPriceDailyForecast.plot()

goldPriceWeeklyForecast = seasonalForecasting(gold_data['Close'], 10080)
goldPriceWeeklyForecast.plot()

Average Forecasting

Using the average of the previous n of observation to predict.

def averageForecasting(series, n):
    temp = pd.DataFrame(series.rename('observed'))
    temp.insert(1, 'predicted',
                1/n * (temp['observed'].cumsum().shift() - temp['observed'].cumsum().shift(n+1)))

    return temp

def averageForecastPlot(series, list):
    for n in list:
        temp = averageForecasting(series,n)
        plt.plot(temp['predicted'], label = 'n = ' + str(n))
    plt.plot(temp['observed'], label='observed', linewidth=2, alpha=0.3, linestyle='dashed')
    plt.legend()
    plt.show()

averageForecastPlot(mlStackoverflow_data['machine-learning'],[2,6,12,24])

averageForecastPlot(ticketSales_data['tickets_sold'],[2,7,21,60])

averageForecastPlot(gold_data['Close'],[1440,10080,43200])

Based on the result shown, this method seem to be acting as a smoothing method rather than predictor, especially when it is with a longer time period.

Average Differences Forecasting

Now we will explore using average of the difference between timeframe, and use it as a way to predict

def differenceAverage(series, p):
    temp = pd.DataFrame(series.rename('observed'))
    difference = temp['observed'] - temp['observed'].shift()

    prediction_of_differences = (1/p) * (difference.cumsum().shift(1) - difference.cumsum().shift(p+1))
    temp['predicted']= prediction_of_differences.shift(1) + temp['observed'].shift(1)

    return temp

def differenceAveragePlot(series, list):
    plt.plot(series, label='observed', linewidth=2, alpha=0.5, linestyle='dashed')
    for n in list:
        plt.plot(differenceAverage(series, n)['predicted'], label='predicted, n = ' + str(n))

    plt.legend()
    plt.show()

differenceAveragePlot(mlStackoverflow_data['machine-learning'], [2,6,12,24])

differenceAveragePlot(ticketSales_data['tickets_sold'], [2,6,12,24])

differenceAveragePlot(gold_data['Close'][:100000], [1440,10080,43200])

Naive Differences Forecasting

Make list the difference between the current period and previous period. Then, to predict the next value of the next period, take the difference from the current period and the previous period, and use it to predict the next value.

def naiveDifferenceForecasting(series):
    temp = pd.DataFrame(series.rename('observed'))
    differences = temp['observed'] - temp['observed'].shift()
    predictionsOfDifferences = differences.shift()
    temp['predicted'] = predictionsOfDifferences + temp['observed'].shift(1)

    return temp

naiveDifferenceForecasting(mlStackoverflow_data['machine-learning']).plot()

naiveForecasting(ticketSales_data['tickets_sold']).plot()

naiveDifferenceForecasting(gold_data['Close'])[:300].plot()

Ironic, that using a naive method, of use the previous change of value, to make a prediction, seem to be actually good. While based on the three dataset used, the prediction for the machine learning questions and gold prices, the prediction while it seems that it over-predicts the value, it does seem to at least predict the major movement same as the observed. But that might also be because it seems to be the case visually.

Evaluating the forecasts

Will now implement the ways we can use to evaluate the forecasts. We will be implementing MSE, RMSE and MAE, which are typically used for regression predictions.

Define Functions and Setting Table of Results

def mse(df):
    temp = df['observed'] - df['predicted']
    temp = temp**2
    temp = temp.dropna()

    return temp.sum() / temp.count()

def rmse(df):
    return (mse(df))**(1/2)

def mae(df):
    temp = abs(df['observed'] - df['predicted'])
    temp = temp.dropna()

    return temp.sum() / temp.count()

def evaluateErrors(df):
    return [mse(df), rmse(df), mae(df)]

tableOfResults = pd.DataFrame(columns = ['Data', 'Algorithm', 'MSE', 'RMSE', 'MAE'])
tableOfResults

# Function to fill the Table of Results
def appendingTOR(dataframe, dataset, algorithm, list):
    dataframe.loc[len(dataframe)+1] = {'Data': dataset,'Algorithm': algorithm, 'MSE':list[0],'RMSE':list[1], 'MAE':list[2]}

Naive Forecasting

# Re-setup some of the dataset above
sp500Data = sp500_data[['Datetime', 'Close']]
sp500Data = sp500Data.set_index('Datetime')

ftseData = ftse_data[['Datetime', 'Close']]
ftseData = ftseData.set_index('Datetime')

goldData = gold_data[['Datetime', 'Close']]
goldData = goldData.set_index('Datetime')

temp = usdcusdt_data[['date', 'close', 'tradecount']]
temp = temp.set_index(pd.to_datetime(temp['date']))
usdcusdtDataTradeCount = temp[['tradecount']]
usdcusdtDataClose = temp[['close']]

# Create Naive Forcast Predictions
sp500DataNaiveForecast = naiveForecasting(sp500Data['Close'])
ftseDataNaiveForecast = naiveForecasting(ftseData['Close'])
goldDataNaiveForecast = naiveForecasting(goldData['Close'])
usdcusdtDataCloseNaiveForecast = naiveForecasting(usdcusdtDataClose['close'])
usdcusdtTradeCountNaiveForecast = naiveForecasting(usdcusdtDataTradeCount['tradecount'])
pythonNaiveForecast = naiveForecasting(mlStackoverflow_data['python'])

# Appending to Result Table
naiveForecastingString = 'Naive Forecasting'

appendingTOR(tableOfResults, 'SP500', naiveForecastingString, evaluateErrors(sp500DataNaiveForecast))
appendingTOR(tableOfResults, 'FTSE', naiveForecastingString, evaluateErrors(ftseDataNaiveForecast))
appendingTOR(tableOfResults, 'Gold', naiveForecastingString, evaluateErrors(goldDataNaiveForecast))
appendingTOR(tableOfResults, 'USDCUSDT Close Price', naiveForecastingString, evaluateErrors(usdcusdtDataCloseNaiveForecast))
appendingTOR(tableOfResults, 'USDCUSDT Tradecount', naiveForecastingString, evaluateErrors(usdcusdtTradeCountNaiveForecast))
appendingTOR(tableOfResults, 'Python Questions', naiveForecastingString, evaluateErrors(pythonNaiveForecast))
appendingTOR(tableOfResults, 'Machine Learning Questions', naiveForecastingString, evaluateErrors(mlTopicNaiveForecast))
appendingTOR(tableOfResults, 'Ticket Sales', naiveForecastingString, evaluateErrors(ticketSoldNaiveForecast))

tableOfResults

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
1	SP500	Naive Forecasting	0.315878	0.56203	0.325863
2	FTSE	Naive Forecasting	5.69646	2.38673	1.29783
3	Gold	Naive Forecasting	0.237289	0.487123	0.31245
4	USDCUSDT Close Price	Naive Forecasting	4.01536e-06	0.002004	0.001034
5	USDCUSDT Tradecount	Naive Forecasting	1.56376e+08	12505	7519.96
6	Python Questions	Naive Forecasting	1.00863e+06	1004.31	680.298
7	Machine Learning Questions	Naive Forecasting	3053.46	55.2581	35.4885
8	Ticket Sales	Naive Forecasting	12476	111.696	74.8537

Seasonal Forecasting

# Create Seasonal Forecasting Predictions
sp500SeasonalForecast60 = seasonalForecasting(sp500Data['Close'], 60)
sp500SeasonalForecast1440 = seasonalForecasting(sp500Data['Close'], 1440)
ftseSeasonalForecast60 = seasonalForecasting(ftseData['Close'], 60)
ftseSeasonalForecast1440 = seasonalForecasting(ftseData['Close'], 1440)
goldSeasonalForecast60 = seasonalForecasting(goldData['Close'], 60)
goldSeasonalForecast1440 = seasonalForecasting(goldData['Close'], 1440)
usdcusdtCloseSeasonalForecast = seasonalForecasting(usdcusdtDataClose['close'], 7)
usdcusdtTCSeasonalForecast = seasonalForecasting(usdcusdtDataTradeCount['tradecount'], 7)
pythonSeasonalForecast = seasonalForecasting(mlStackoverflow_data['python'], 12)

# Appending to Result Table
seasonalForecastingString = 'Seasonal Forecasting'

appendingTOR(tableOfResults, 'SP500', seasonalForecastingString + ' - 60', evaluateErrors(sp500SeasonalForecast60))
appendingTOR(tableOfResults, 'SP500', seasonalForecastingString + ' - 1440', evaluateErrors(sp500SeasonalForecast1440))
appendingTOR(tableOfResults, 'FTSE', seasonalForecastingString + ' - 60', evaluateErrors(ftseSeasonalForecast60) )
appendingTOR(tableOfResults, 'FTSE', seasonalForecastingString + ' - 1440', evaluateErrors(ftseSeasonalForecast1440))
appendingTOR(tableOfResults, 'Gold', seasonalForecastingString + ' - 60', evaluateErrors(goldSeasonalForecast60))
appendingTOR(tableOfResults, 'Gold', seasonalForecastingString + ' - 1440', evaluateErrors(goldSeasonalForecast1440))
appendingTOR(tableOfResults, 'USDCUSDT Close Price', seasonalForecastingString, evaluateErrors(usdcusdtCloseSeasonalForecast))
appendingTOR(tableOfResults, 'USDCUSDT Tradecount', seasonalForecastingString, evaluateErrors(usdcusdtTCSeasonalForecast))
appendingTOR(tableOfResults, 'Python Questions', seasonalForecastingString, evaluateErrors(pythonSeasonalForecast))
appendingTOR(tableOfResults, 'Machine Learning Questions', seasonalForecastingString, evaluateErrors(mlTopicSeasonalYearlyForecast))
appendingTOR(tableOfResults, 'Ticket Sales', seasonalForecastingString, evaluateErrors(ticketSoldWeeklyForecast))

tableOfResults

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
1	SP500	Naive Forecasting	0.315878	0.56203	0.325863
2	FTSE	Naive Forecasting	5.69646	2.38673	1.29783
3	Gold	Naive Forecasting	0.237289	0.487123	0.31245
4	USDCUSDT Close Price	Naive Forecasting	4.01536e-06	0.002004	0.001034
5	USDCUSDT Tradecount	Naive Forecasting	1.56376e+08	12505	7519.96
6	Python Questions	Naive Forecasting	1.00863e+06	1004.31	680.298
7	Machine Learning Questions	Naive Forecasting	3053.46	55.2581	35.4885
8	Ticket Sales	Naive Forecasting	12476	111.696	74.8537
9	SP500	Seasonal Forecasting - 60	24.0706	4.90618	3.19858
10	SP500	Seasonal Forecasting - 1440	578.648	24.0551	18.6964
11	FTSE	Seasonal Forecasting - 60	358.944	18.9458	12.5549
12	FTSE	Seasonal Forecasting - 1440	9305.27	96.4639	71.5358
13	Gold	Seasonal Forecasting - 60	11.0817	3.32892	2.08104
14	Gold	Seasonal Forecasting - 1440	297.39	17.245	12.2358
15	USDCUSDT Close Price	Seasonal Forecasting	1.41478e-05	0.003761	0.00176
16	USDCUSDT Tradecount	Seasonal Forecasting	2.76653e+08	16632.9	10558.5
17	Python Questions	Seasonal Forecasting	4.97505e+06	2230.48	1924.32
18	Machine Learning Questions	Seasonal Forecasting	14325.5	119.689	84.975
19	Ticket Sales	Seasonal Forecasting	3922.4	62.6291	37.0187

Average Forecasting

# Create Average Forecasting Predictions

sp500AverageForecast60 = averageForecasting(sp500Data['Close'], 60)
sp500AverageForecast1440 = averageForecasting(sp500Data['Close'], 1440)
ftseAverageForecast60 = averageForecasting(ftseData['Close'], 60)
ftseAverageForecast1440 = averageForecasting(ftseData['Close'], 1440)
goldAverageForecast60 = averageForecasting(goldData['Close'], 60)
goldAverageForecast1440 = averageForecasting(goldData['Close'], 1440)
usdcusdtCloseAverageForecast = averageForecasting(usdcusdtDataClose['close'], 7)
usdcusdtTCAverageForecast = averageForecasting(usdcusdtDataTradeCount['tradecount'], 7)
pythonAverageForecast = averageForecasting(mlStackoverflow_data['python'], 12)
mlTopicAverageForecast = averageForecasting(mlStackoverflow_data['machine-learning'], 12)
ticketSalesAverageForecast = averageForecasting(ticketSales_data['tickets_sold'], 7)

# Appending to Result Table
averageForcastingString = 'Average Forecasting'

appendingTOR(tableOfResults, 'SP500', averageForcastingString + ' - 60', evaluateErrors(sp500AverageForecast60))
appendingTOR(tableOfResults, 'SP500', averageForcastingString + ' - 1440', evaluateErrors(sp500AverageForecast1440))
appendingTOR(tableOfResults, 'FTSE', averageForcastingString + ' - 60', evaluateErrors(ftseAverageForecast60))
appendingTOR(tableOfResults, 'FTSE', averageForcastingString + ' - 1440', evaluateErrors(ftseAverageForecast1440))
appendingTOR(tableOfResults, 'Gold', averageForcastingString + ' - 60', evaluateErrors(goldAverageForecast60))
appendingTOR(tableOfResults, 'Gold', averageForcastingString + ' - 1440', evaluateErrors(goldAverageForecast1440))
appendingTOR(tableOfResults, 'USDCUSDT Close Price', averageForcastingString, evaluateErrors(usdcusdtCloseAverageForecast))
appendingTOR(tableOfResults, 'USDCUSDT Tradecount', averageForcastingString, evaluateErrors(usdcusdtTCAverageForecast))
appendingTOR(tableOfResults, 'Python Questions', averageForcastingString, evaluateErrors(pythonAverageForecast))
appendingTOR(tableOfResults, 'Machine Learning Questions', averageForcastingString, evaluateErrors(mlTopicAverageForecast))
appendingTOR(tableOfResults, 'Ticket Sales', averageForcastingString, evaluateErrors(ticketSalesAverageForecast))

tableOfResults

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
1	SP500	Naive Forecasting	0.315878	0.56203	0.325863
2	FTSE	Naive Forecasting	5.69646	2.38673	1.29783
3	Gold	Naive Forecasting	0.237289	0.487123	0.31245
4	USDCUSDT Close Price	Naive Forecasting	4.01536e-06	0.002004	0.001034
5	USDCUSDT Tradecount	Naive Forecasting	1.56376e+08	12505	7519.96
6	Python Questions	Naive Forecasting	1.00863e+06	1004.31	680.298
7	Machine Learning Questions	Naive Forecasting	3053.46	55.2581	35.4885
8	Ticket Sales	Naive Forecasting	12476	111.696	74.8537
9	SP500	Seasonal Forecasting - 60	24.0706	4.90618	3.19858
10	SP500	Seasonal Forecasting - 1440	578.648	24.0551	18.6964
11	FTSE	Seasonal Forecasting - 60	358.944	18.9458	12.5549
12	FTSE	Seasonal Forecasting - 1440	9305.27	96.4639	71.5358
13	Gold	Seasonal Forecasting - 60	11.0817	3.32892	2.08104
14	Gold	Seasonal Forecasting - 1440	297.39	17.245	12.2358
15	USDCUSDT Close Price	Seasonal Forecasting	1.41478e-05	0.003761	0.00176
16	USDCUSDT Tradecount	Seasonal Forecasting	2.76653e+08	16632.9	10558.5
17	Python Questions	Seasonal Forecasting	4.97505e+06	2230.48	1924.32
18	Machine Learning Questions	Seasonal Forecasting	14325.5	119.689	84.975
19	Ticket Sales	Seasonal Forecasting	3922.4	62.6291	37.0187
20	SP500	Average Forecasting - 60	8.21104	2.86549	1.84062
21	SP500	Average Forecasting - 1440	195.085	13.9673	10.6287
22	FTSE	Average Forecasting - 60	122.369	11.0621	7.24807
23	FTSE	Average Forecasting - 1440	3105.82	55.7299	40.9431
24	Gold	Average Forecasting - 60	3.79681	1.94854	1.22559
25	Gold	Average Forecasting - 1440	96.5074	9.82382	6.8452
26	USDCUSDT Close Price	Average Forecasting	6.51406e-06	0.002552	0.001187
27	USDCUSDT Tradecount	Average Forecasting	1.44419e+08	12017.5	7727.27
28	Python Questions	Average Forecasting	2.18019e+06	1476.55	1175.32
29	Machine Learning Questions	Average Forecasting	7140.3	84.5003	55.785
30	Ticket Sales	Average Forecasting	6061.2	77.8537	59.7301

Average Differences

# Create Average Differences Prediction

sp500AverageDifferenceForecast60 = differenceAverage(sp500Data['Close'], 60)
sp500AverageDifferenceForecast1440 = differenceAverage(sp500Data['Close'], 1440)
ftseAverageDifferenceForecast60 = differenceAverage(ftseData['Close'], 60)
ftseAverageDifferenceForecast1440 = differenceAverage(ftseData['Close'], 1440)
goldAverageDifferenceForecast60 = differenceAverage(goldData['Close'], 60)
goldAverageDifferenceForecast1440 = differenceAverage(goldData['Close'], 1440)
usdcusdtCloseAverageDifferenceForecast = differenceAverage(usdcusdtDataClose['close'], 7)
usdcusdtTCAverageDifferenceForecast = differenceAverage(usdcusdtDataTradeCount['tradecount'], 7)
pythonAverageDifferenceForecast = differenceAverage(mlStackoverflow_data['python'], 12)
mlTopicAverageDifferenceForecast = differenceAverage(mlStackoverflow_data['machine-learning'], 12)
ticketSalesAverageDifferenceForecast = differenceAverage(ticketSales_data['tickets_sold'], 7)

# Appending to Result Table
averageDifferenceForecastingString = 'Average Difference Forecasting'

appendingTOR(tableOfResults, 'SP500', averageDifferenceForecastingString + ' - 60', evaluateErrors(sp500AverageDifferenceForecast60))
appendingTOR(tableOfResults, 'SP500', averageDifferenceForecastingString + ' - 1440', evaluateErrors(sp500AverageDifferenceForecast1440))
appendingTOR(tableOfResults, 'FTSE', averageDifferenceForecastingString + ' - 60', evaluateErrors(ftseAverageDifferenceForecast60))
appendingTOR(tableOfResults, 'FTSE', averageDifferenceForecastingString + ' - 1440', evaluateErrors(ftseAverageDifferenceForecast1440))
appendingTOR(tableOfResults, 'Gold', averageDifferenceForecastingString + ' - 60', evaluateErrors(goldAverageDifferenceForecast60))
appendingTOR(tableOfResults, 'Gold', averageDifferenceForecastingString + ' - 1440', evaluateErrors(goldAverageDifferenceForecast1440))
appendingTOR(tableOfResults, 'USDCUSDT Close Price', averageDifferenceForecastingString, evaluateErrors(usdcusdtCloseAverageDifferenceForecast))
appendingTOR(tableOfResults, 'USDCUSDT Tradecount', averageDifferenceForecastingString, evaluateErrors(usdcusdtTCAverageDifferenceForecast))
appendingTOR(tableOfResults, 'Python Questions', averageDifferenceForecastingString, evaluateErrors(pythonAverageDifferenceForecast))
appendingTOR(tableOfResults, 'Machine Learning Questions', averageDifferenceForecastingString, evaluateErrors(mlTopicAverageDifferenceForecast))
appendingTOR(tableOfResults, 'Ticket Sales', averageDifferenceForecastingString, evaluateErrors(ticketSalesAverageDifferenceForecast))

tableOfResults

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
1	SP500	Naive Forecasting	0.315878	0.56203	0.325863
2	FTSE	Naive Forecasting	5.69646	2.38673	1.29783
3	Gold	Naive Forecasting	0.237289	0.487123	0.31245
4	USDCUSDT Close Price	Naive Forecasting	4.01536e-06	0.002004	0.001034
5	USDCUSDT Tradecount	Naive Forecasting	1.56376e+08	12505	7519.96
6	Python Questions	Naive Forecasting	1.00863e+06	1004.31	680.298
7	Machine Learning Questions	Naive Forecasting	3053.46	55.2581	35.4885
8	Ticket Sales	Naive Forecasting	12476	111.696	74.8537
9	SP500	Seasonal Forecasting - 60	24.0706	4.90618	3.19858
10	SP500	Seasonal Forecasting - 1440	578.648	24.0551	18.6964
11	FTSE	Seasonal Forecasting - 60	358.944	18.9458	12.5549
12	FTSE	Seasonal Forecasting - 1440	9305.27	96.4639	71.5358
13	Gold	Seasonal Forecasting - 60	11.0817	3.32892	2.08104
14	Gold	Seasonal Forecasting - 1440	297.39	17.245	12.2358
15	USDCUSDT Close Price	Seasonal Forecasting	1.41478e-05	0.003761	0.00176
16	USDCUSDT Tradecount	Seasonal Forecasting	2.76653e+08	16632.9	10558.5
17	Python Questions	Seasonal Forecasting	4.97505e+06	2230.48	1924.32
18	Machine Learning Questions	Seasonal Forecasting	14325.5	119.689	84.975
19	Ticket Sales	Seasonal Forecasting	3922.4	62.6291	37.0187
20	SP500	Average Forecasting - 60	8.21104	2.86549	1.84062
21	SP500	Average Forecasting - 1440	195.085	13.9673	10.6287
22	FTSE	Average Forecasting - 60	122.369	11.0621	7.24807
23	FTSE	Average Forecasting - 1440	3105.82	55.7299	40.9431
24	Gold	Average Forecasting - 60	3.79681	1.94854	1.22559
25	Gold	Average Forecasting - 1440	96.5074	9.82382	6.8452
26	USDCUSDT Close Price	Average Forecasting	6.51406e-06	0.002552	0.001187
27	USDCUSDT Tradecount	Average Forecasting	1.44419e+08	12017.5	7727.27
28	Python Questions	Average Forecasting	2.18019e+06	1476.55	1175.32
29	Machine Learning Questions	Average Forecasting	7140.3	84.5003	55.785
30	Ticket Sales	Average Forecasting	6061.2	77.8537	59.7301
31	SP500	Average Difference Forecasting - 60	0.322452	0.567849	0.332982
32	SP500	Average Difference Forecasting - 1440	0.316412	0.562505	0.326336
33	FTSE	Average Difference Forecasting - 60	5.79875	2.40806	1.32178
34	FTSE	Average Difference Forecasting - 1440	5.70723	2.38898	1.30033
35	Gold	Average Difference Forecasting - 60	0.240423	0.49033	0.315994
36	Gold	Average Difference Forecasting - 1440	0.237595	0.487437	0.313009
37	USDCUSDT Close Price	Average Difference Forecasting	4.53094e-06	0.002129	0.001103
38	USDCUSDT Tradecount	Average Difference Forecasting	1.64095e+08	12810	7724.69
39	Python Questions	Average Difference Forecasting	1.1445e+06	1069.81	751.801
40	Machine Learning Questions	Average Difference Forecasting	3399.79	58.3077	38.5563
41	Ticket Sales	Average Difference Forecasting	12936.3	113.738	76.6102

Naive Difference

# Create Naive Difference Prediction

sp500NaiveDifferenceForecast = naiveDifferenceForecasting(sp500Data['Close'])
ftseNaiveDifferenceForecast = naiveDifferenceForecasting(ftseData['Close'])
goldNaiveDifferenceForecast = naiveDifferenceForecasting(goldData['Close'])
usdcusdtCloseNaiveDifferenceForecast = naiveDifferenceForecasting(usdcusdtDataClose['close'])
usdcusdtTCNaiveDifferenceForecast = naiveDifferenceForecasting(usdcusdtDataTradeCount['tradecount'])
pythonNaiveDifferenceForecast = naiveDifferenceForecasting(mlStackoverflow_data['python'])
mlTopicNaiveDifferenceForecast = naiveDifferenceForecasting(mlStackoverflow_data['machine-learning'])
ticketSalesNaiveDifferenceForecast = naiveDifferenceForecasting(ticketSales_data['tickets_sold'])

# Appending to Result Table
naiveDifferenceForecastingString = 'Naive Difference Forecasting'

appendingTOR(tableOfResults, 'SP500', naiveDifferenceForecastingString, evaluateErrors(sp500NaiveDifferenceForecast))
appendingTOR(tableOfResults, 'FTSE', naiveDifferenceForecastingString, evaluateErrors(ftseNaiveDifferenceForecast))
appendingTOR(tableOfResults, 'Gold', naiveDifferenceForecastingString, evaluateErrors(goldNaiveDifferenceForecast))
appendingTOR(tableOfResults, 'USDCUSDT Close Price', naiveDifferenceForecastingString, evaluateErrors(usdcusdtCloseNaiveDifferenceForecast))
appendingTOR(tableOfResults, 'USDCUSDT Tradecount', naiveDifferenceForecastingString, evaluateErrors(usdcusdtTCNaiveDifferenceForecast))
appendingTOR(tableOfResults, 'Python Questions', naiveDifferenceForecastingString, evaluateErrors(pythonNaiveDifferenceForecast))
appendingTOR(tableOfResults, 'Machine Learning Questions', naiveDifferenceForecastingString, evaluateErrors(mlTopicNaiveDifferenceForecast))
appendingTOR(tableOfResults, 'Ticket Sales', naiveDifferenceForecastingString, evaluateErrors(ticketSalesNaiveDifferenceForecast))

tableOfResults

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
1	SP500	Naive Forecasting	0.315878	0.56203	0.325863
2	FTSE	Naive Forecasting	5.69646	2.38673	1.29783
3	Gold	Naive Forecasting	0.237289	0.487123	0.31245
4	USDCUSDT Close Price	Naive Forecasting	4.01536e-06	0.002004	0.001034
5	USDCUSDT Tradecount	Naive Forecasting	1.56376e+08	12505	7519.96
6	Python Questions	Naive Forecasting	1.00863e+06	1004.31	680.298
7	Machine Learning Questions	Naive Forecasting	3053.46	55.2581	35.4885
8	Ticket Sales	Naive Forecasting	12476	111.696	74.8537
9	SP500	Seasonal Forecasting - 60	24.0706	4.90618	3.19858
10	SP500	Seasonal Forecasting - 1440	578.648	24.0551	18.6964
11	FTSE	Seasonal Forecasting - 60	358.944	18.9458	12.5549
12	FTSE	Seasonal Forecasting - 1440	9305.27	96.4639	71.5358
13	Gold	Seasonal Forecasting - 60	11.0817	3.32892	2.08104
14	Gold	Seasonal Forecasting - 1440	297.39	17.245	12.2358
15	USDCUSDT Close Price	Seasonal Forecasting	1.41478e-05	0.003761	0.00176
16	USDCUSDT Tradecount	Seasonal Forecasting	2.76653e+08	16632.9	10558.5
17	Python Questions	Seasonal Forecasting	4.97505e+06	2230.48	1924.32
18	Machine Learning Questions	Seasonal Forecasting	14325.5	119.689	84.975
19	Ticket Sales	Seasonal Forecasting	3922.4	62.6291	37.0187
20	SP500	Average Forecasting - 60	8.21104	2.86549	1.84062
21	SP500	Average Forecasting - 1440	195.085	13.9673	10.6287
22	FTSE	Average Forecasting - 60	122.369	11.0621	7.24807
23	FTSE	Average Forecasting - 1440	3105.82	55.7299	40.9431
24	Gold	Average Forecasting - 60	3.79681	1.94854	1.22559
25	Gold	Average Forecasting - 1440	96.5074	9.82382	6.8452
26	USDCUSDT Close Price	Average Forecasting	6.51406e-06	0.002552	0.001187
27	USDCUSDT Tradecount	Average Forecasting	1.44419e+08	12017.5	7727.27
28	Python Questions	Average Forecasting	2.18019e+06	1476.55	1175.32
29	Machine Learning Questions	Average Forecasting	7140.3	84.5003	55.785
30	Ticket Sales	Average Forecasting	6061.2	77.8537	59.7301
31	SP500	Average Difference Forecasting - 60	0.322452	0.567849	0.332982
32	SP500	Average Difference Forecasting - 1440	0.316412	0.562505	0.326336
33	FTSE	Average Difference Forecasting - 60	5.79875	2.40806	1.32178
34	FTSE	Average Difference Forecasting - 1440	5.70723	2.38898	1.30033
35	Gold	Average Difference Forecasting - 60	0.240423	0.49033	0.315994
36	Gold	Average Difference Forecasting - 1440	0.237595	0.487437	0.313009
37	USDCUSDT Close Price	Average Difference Forecasting	4.53094e-06	0.002129	0.001103
38	USDCUSDT Tradecount	Average Difference Forecasting	1.64095e+08	12810	7724.69
39	Python Questions	Average Difference Forecasting	1.1445e+06	1069.81	751.801
40	Machine Learning Questions	Average Difference Forecasting	3399.79	58.3077	38.5563
41	Ticket Sales	Average Difference Forecasting	12936.3	113.738	76.6102
42	SP500	Naive Difference Forecasting	0.553753	0.744146	0.456603
43	FTSE	Naive Difference Forecasting	10.796	3.28573	1.81081
44	Gold	Naive Difference Forecasting	0.521205	0.721946	0.480641
45	USDCUSDT Close Price	Naive Difference Forecasting	9.13676e-06	0.003023	0.001659
46	USDCUSDT Tradecount	Naive Difference Forecasting	4.15266e+08	20378.1	12089.8
47	Python Questions	Naive Difference Forecasting	2.34416e+06	1531.07	1117.71
48	Machine Learning Questions	Naive Difference Forecasting	6678.13	81.7198	51.9462
49	Ticket Sales	Naive Difference Forecasting	37594.4	193.893	136.398

Analysis

Because the evaluation metrics does not produce a standardize output, we cannot use the MSE, RMSE, and MAE value directly without context. It needs to be compared within the same dataset used. For example, the evaluations on USDCUSDT dataset are the lowest, but that does not mean that average difference algorithm on usdcusdt is better than naive difference forecasting on Ticket Sales. So let's pick S&P500, Python Questions and Ticket Sales, to compare first and see what observations can we find.

display(tableOfResults[tableOfResults['Data'] == 'SP500'].sort_values('MSE'),
        tableOfResults[tableOfResults['Data'] == 'Python Questions'].sort_values('MSE'),
        tableOfResults[tableOfResults['Data'] == 'Ticket Sales'].sort_values('MSE'),
)

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
1	SP500	Naive Forecasting	0.315878	0.56203	0.325863
32	SP500	Average Difference Forecasting - 1440	0.316412	0.562505	0.326336
31	SP500	Average Difference Forecasting - 60	0.322453	0.567849	0.332982
42	SP500	Naive Difference Forecasting	0.553753	0.744146	0.456603
20	SP500	Average Forecasting - 60	8.21104	2.86549	1.84062
9	SP500	Seasonal Forecasting - 60	24.0706	4.90618	3.19858
21	SP500	Average Forecasting - 1440	195.085	13.9673	10.6287
10	SP500	Seasonal Forecasting - 1440	578.648	24.0551	18.6964

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
6	Python Questions	Naive Forecasting	1.00863e+06	1004.31	680.298
39	Python Questions	Average Difference Forecasting	1.1445e+06	1069.81	751.801
28	Python Questions	Average Forecasting	2.18019e+06	1476.55	1175.32
47	Python Questions	Naive Difference Forecasting	2.34416e+06	1531.07	1117.71
17	Python Questions	Seasonal Forecasting	4.97505e+06	2230.48	1924.32

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
19	Ticket Sales	Seasonal Forecasting	3922.4	62.6291	37.0187
30	Ticket Sales	Average Forecasting	6061.2	77.8537	59.7301
8	Ticket Sales	Naive Forecasting	12476	111.696	74.8537
41	Ticket Sales	Average Difference Forecasting	12936.3	113.738	76.6102
49	Ticket Sales	Naive Difference Forecasting	37594.4	193.893	136.398

We can see that all of there evaluation value are very different in terms of size and magnitude. Then, currently we can see that naive forecasting performing the best. Let's list all the best algorithm for the given dataset.

for dataset in tableOfResults['Data'].unique():
    display(tableOfResults[tableOfResults['Data'] == dataset].sort_values('MSE').head(2))

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
1	SP500	Naive Forecasting	0.315878	0.56203	0.325863
32	SP500	Average Difference Forecasting - 1440	0.316412	0.562505	0.326336

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
2	FTSE	Naive Forecasting	5.69646	2.38673	1.29783
34	FTSE	Average Difference Forecasting - 1440	5.70723	2.38898	1.30033

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
3	Gold	Naive Forecasting	0.237289	0.487123	0.31245
36	Gold	Average Difference Forecasting - 1440	0.237595	0.487437	0.313009

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
4	USDCUSDT Close Price	Naive Forecasting	4e-06	0.002004	0.001034
37	USDCUSDT Close Price	Average Difference Forecasting	5e-06	0.002129	0.001103

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
27	USDCUSDT Tradecount	Average Forecasting	1.44419e+08	12017.5	7727.27
5	USDCUSDT Tradecount	Naive Forecasting	1.56376e+08	12505	7519.96

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
6	Python Questions	Naive Forecasting	1.00863e+06	1004.31	680.298
39	Python Questions	Average Difference Forecasting	1.1445e+06	1069.81	751.801

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
7	Machine Learning Questions	Naive Forecasting	3053.46	55.2581	35.4885
40	Machine Learning Questions	Average Difference Forecasting	3399.79	58.3077	38.5563

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
19	Ticket Sales	Seasonal Forecasting	3922.4	62.6291	37.0187
30	Ticket Sales	Average Forecasting	6061.2	77.8537	59.7301

We can observe that Naive Forecasting seem to be the best choice at the moment, and the Average Difference in second. Only TicketSales and USDCUSDT Tradecount do not have both of the algorithm. This could be because the ticket sales and usdcusdt datasets, have more prominent repeatable pattern that other forecasting methods work better, while because the rest of the dataset are more random in nature.

Additionally, something to take note is that Naive Forecasting and Average Differences are currently the best algorithm only in the context of technical evaluation of MSE, RMSE and MAE. Because if we were to set trading/investment rules and different way to evaluate performance, the result might be different.

Autoregression Prediction

Before we can use autoregression prediction, first we will need to make the datasets stationary, as well as using autocorrelation and partial autocorrelation.

Choosing Dataset

We have a lot of datasets that we can use. We will use ticket sales (because it consists of more obvious seasonal changes), machine learning questions from stackoverflow (for its multiplicativeness), SP500 and Gold dataset as they are real world financial data. But because the SP500 and Gold data are very large, we will only use 3 months of data.

We will need need to also split the dataset into training and testing sets. We will use 70% as training and 30% training.

Autocorrelation and Partial Autocorrelation

# Importing plot_acf
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

Gold Data

Making it stationary

stationaryGoldData = goldData
stationaryGoldData['observed'] = goldData['Close'] - goldData['Close'].shift()
stationaryGoldData = stationaryGoldData.dropna().drop('Close', axis=1)
stationaryGoldData = stationaryGoldData[stationaryGoldData.index.dayofweek != 5]
stationaryGoldData = stationaryGoldData[:108000]

kpss_test(stationaryGoldData)
print()
adftest(stationaryGoldData)
stationaryGoldData.plot()

Setting the training and testing set

stationaryGoldDataLength = len(stationaryGoldData)
traniningStationaryGoldDataLength = int(len(stationaryGoldData) * 0.7)
testingStationaryGoldDataLength = stationaryGoldDataLength - traniningStationaryGoldDataLength

print(stationaryGoldDataLength, traniningStationaryGoldDataLength, testingStationaryGoldDataLength)

108000 75600 32400

ACF

plot_acf(stationaryGoldData[:traniningStationaryGoldDataLength],lags=60)

PACF

plot_pacf(stationaryGoldData[:traniningStationaryGoldDataLength], lags = 60)

S&P500

Making it stationary

stationarySP500Data = sp500Data
stationarySP500Data['observed'] = sp500Data['Close'] - sp500Data['Close'].shift()
stationarySP500Data = stationarySP500Data.dropna().drop('Close', axis=1)
stationarySP500Data = stationarySP500Data[stationarySP500Data.index.dayofweek != 5].sort_index()
stationarySP500Data = stationarySP500Data[:36000]

kpss_test(stationarySP500Data)
print()
adftest(stationarySP500Data)
stationarySP500Data.plot()

Setting the training and testing set

stationarySP500DataLength = len(stationarySP500Data)
trainingStationarySP500Data = int(0.7 * stationarySP500DataLength)
testingStationarySP500Data = stationarySP500DataLength - trainingStationarySP500Data

print(stationarySP500DataLength, trainingStationarySP500Data, testingStationarySP500Data)

36000 25200 10800

ACF

plot_acf(stationarySP500Data[:trainingStationarySP500Data], lags=60)
plt.xticks([x for x in range(0,61,2)])

PACF

plot_pacf(stationarySP500Data[:trainingStationarySP500Data], lags=60)
plt.xticks([x for x in range(0,61,2)])

Machine Learning Questions

Because the dataset is multiplicative, we will need to log and get the difference of it.

Making it Stationary

stationaryMLQuestions = mlStackoverflow_data[['month','machine-learning']]

# Setting the month as index and also removing the month column.
stationaryMLQuestions = stationaryMLQuestions.set_index(
    pd.to_datetime(stationaryMLQuestions['month'],format='%y-%b'
)).drop('month', axis=1)

stationaryMLQuestions = np.log(stationaryMLQuestions)
stationaryMLQuestions.plot()

kpss_test(stationaryMLQuestions)
print()
adftest(stationaryMLQuestions)

KPSS Stat:  1.873246798025978
p-value:  0.01
Number of lags: 6
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
Result: The series is not stationary

AdfTest Stat:  -2.668195595640323
p-value:  0.07970535704695508
Number of lags: 2
Number of observation: 129
Critial Values:
   1% : -3.482087964046026
   5% : -2.8842185101614626
   10% : -2.578864381347275
The series is not stationary

stationaryMLQuestions['observed'] = stationaryMLQuestions['machine-learning'] - stationaryMLQuestions['machine-learning'].shift()
stationaryMLQuestions = stationaryMLQuestions.dropna().drop('machine-learning', axis=1)
stationaryMLQuestions.plot()

kpss_test(stationaryMLQuestions)
print()
adftest(stationaryMLQuestions)

KPSS Stat:  0.28055757025438893
p-value:  0.1
Number of lags: 55
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
Result: The series is stationary

AdfTest Stat:  -13.691197822513615
p-value:  1.342938733545792e-25
Number of lags: 1
Number of observation: 129
Critial Values:
   1% : -3.482087964046026
   5% : -2.8842185101614626
   10% : -2.578864381347275
The series is stationary

Setting the training and testing set

stationaryMLQuestionsLength = len(stationaryMLQuestions)
trainingStationaryMLQuestions = int(stationaryMLQuestionsLength * 0.7)
testingStationaryMLQuestions = stationaryMLQuestionsLength - trainingStationaryMLQuestions

print(stationaryMLQuestionsLength, trainingStationaryMLQuestions, testingStationaryMLQuestions )

131 91 40

ACF

plot_acf(stationaryMLQuestions[:trainingStationaryMLQuestions], lags=60)
plt.xticks([x for x in range(0,61,2)])

PACF

plot_pacf(stationaryMLQuestions[:trainingStationaryMLQuestions], lags=44)
plt.xticks([x for x in range(0,45,2)])

Cinema Tickets

Making it Stationary

stationaryTicketSales = ticketSales_data
stationaryTicketSales['observed'] = stationaryTicketSales['tickets_sold'] - stationaryTicketSales['tickets_sold'].shift()
stationaryTicketSales = stationaryTicketSales['observed'].dropna()

kpss_test(stationaryTicketSales)
print()
adftest(stationaryTicketSales)

KPSS Stat:  0.14388225215810577
p-value:  0.1
Number of lags: 62
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
Result: The series is stationary

AdfTest Stat:  -6.734609879307423
p-value:  3.2311867005854417e-09
Number of lags: 12
Number of observation: 218
Critial Values:
   1% : -3.460707667106296
   5% : -2.874891213486339
   10% : -2.573885987711472
The series is stationary

stationaryTicketSales.plot()

Setting the training set and testing set

stationaryTicketSalesLength = len(stationaryTicketSales)
trainingStationaryTicketSales = int(stationaryTicketSalesLength * 0.7)
testingStationaryTicketSales = stationaryTicketSalesLength - trainingStationaryTicketSales

print(stationaryTicketSalesLength, trainingStationaryTicketSales, testingStationaryTicketSales)

231 161 70

ACF

plot_acf(stationaryTicketSales[:trainingStationaryTicketSales], lags=60)
plt.xticks([x for x in range(0,60,2)])

PACF

plot_pacf(stationaryTicketSales[:trainingStationaryTicketSales], lags=60)
plt.xticks([x for x in range(0,61,2)])

Prediction

We will be moving on with the prediction portion. We will be using the outputs from the PACF to determine the lags/period that we will be using for the autoregression.

Gold

from statsmodels.tsa.ar_model import AutoReg

goldResults = AutoReg(stationaryGoldData[:traniningStationaryGoldDataLength], [1,2,3]).fit()
goldResults.summary()

AutoReg Model Results


Dep. Variable:	observed	No. Observations:	75600
Model:	AutoReg(3)	Log Likelihood	-30274.411
Method:	Conditional MLE	S.D. of innovations	0.361
Date:	Mon, 08 Jul 2024	AIC	60558.822
Time:	15:38:54	BIC	60604.988
Sample:	3	HQIC	60573.011
	75600

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	-0.0008	0.001	-0.617	0.537	-0.003	0.002
observed.L1	-0.1528	0.004	-42.018	0.000	-0.160	-0.146
observed.L2	-0.0332	0.004	-9.021	0.000	-0.040	-0.026
observed.L3	-0.0047	0.004	-1.282	0.200	-0.012	0.002

Roots

	Real	Imaginary	Modulus	Frequency
AR.1	-0.1166	-5.5814j	5.5827	-0.2533
AR.2	-0.1166	+5.5814j	5.5827	0.2533
AR.3	-6.8805	-0.0000j	6.8805	-0.5000

goldResults.params

const         -0.000811
observed.L1   -0.152820
observed.L2   -0.033173
observed.L3   -0.004663
dtype: float64

predictions = goldResults.predict(start=traniningStationaryGoldDataLength, end=stationaryGoldDataLength-1) + \
    gold_data['Close'].shift()[traniningStationaryGoldDataLength:stationaryGoldDataLength]

stationaryGoldDataResults = pd.DataFrame(goldData.reset_index()['Close'][traniningStationaryGoldDataLength:stationaryGoldDataLength].rename('observed'))
stationaryGoldDataResults['predicted'] = predictions
stationaryGoldDataResults.plot()

appendingTOR(tableOfResults, 'Gold', 'Autoregression', evaluateErrors(stationaryGoldDataResults))
tableOfResults

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
1	SP500	Naive Forecasting	0.315878	0.56203	0.325863
2	FTSE	Naive Forecasting	5.69646	2.38673	1.29783
3	Gold	Naive Forecasting	0.237289	0.487123	0.31245
4	USDCUSDT Close Price	Naive Forecasting	4.01536e-06	0.002004	0.001034
5	USDCUSDT Tradecount	Naive Forecasting	1.56376e+08	12505	7519.96
6	Python Questions	Naive Forecasting	1.00863e+06	1004.31	680.298
7	Machine Learning Questions	Naive Forecasting	3053.46	55.2581	35.4885
8	Ticket Sales	Naive Forecasting	12476	111.696	74.8537
9	SP500	Seasonal Forecasting - 60	24.0706	4.90618	3.19858
10	SP500	Seasonal Forecasting - 1440	578.648	24.0551	18.6964
11	FTSE	Seasonal Forecasting - 60	358.944	18.9458	12.5549
12	FTSE	Seasonal Forecasting - 1440	9305.27	96.4639	71.5358
13	Gold	Seasonal Forecasting - 60	11.0817	3.32892	2.08104
14	Gold	Seasonal Forecasting - 1440	297.39	17.245	12.2358
15	USDCUSDT Close Price	Seasonal Forecasting	1.41478e-05	0.003761	0.00176
16	USDCUSDT Tradecount	Seasonal Forecasting	2.76653e+08	16632.9	10558.5
17	Python Questions	Seasonal Forecasting	4.97505e+06	2230.48	1924.32
18	Machine Learning Questions	Seasonal Forecasting	14325.5	119.689	84.975
19	Ticket Sales	Seasonal Forecasting	3922.4	62.6291	37.0187
20	SP500	Average Forecasting - 60	8.21104	2.86549	1.84062
21	SP500	Average Forecasting - 1440	195.085	13.9673	10.6287
22	FTSE	Average Forecasting - 60	122.369	11.0621	7.24807
23	FTSE	Average Forecasting - 1440	3105.82	55.7299	40.9431
24	Gold	Average Forecasting - 60	3.79681	1.94854	1.22559
25	Gold	Average Forecasting - 1440	96.5074	9.82382	6.8452
26	USDCUSDT Close Price	Average Forecasting	6.51406e-06	0.002552	0.001187
27	USDCUSDT Tradecount	Average Forecasting	1.44419e+08	12017.5	7727.27
28	Python Questions	Average Forecasting	2.18019e+06	1476.55	1175.32
29	Machine Learning Questions	Average Forecasting	7140.3	84.5003	55.785
30	Ticket Sales	Average Forecasting	6061.2	77.8537	59.7301
31	SP500	Average Difference Forecasting - 60	0.322452	0.567849	0.332982
32	SP500	Average Difference Forecasting - 1440	0.316412	0.562505	0.326336
33	FTSE	Average Difference Forecasting - 60	5.79875	2.40806	1.32178
34	FTSE	Average Difference Forecasting - 1440	5.70723	2.38898	1.30033
35	Gold	Average Difference Forecasting - 60	0.240423	0.49033	0.315994
36	Gold	Average Difference Forecasting - 1440	0.237595	0.487437	0.313009
37	USDCUSDT Close Price	Average Difference Forecasting	4.53094e-06	0.002129	0.001103
38	USDCUSDT Tradecount	Average Difference Forecasting	1.64095e+08	12810	7724.69
39	Python Questions	Average Difference Forecasting	1.1445e+06	1069.81	751.801
40	Machine Learning Questions	Average Difference Forecasting	3399.79	58.3077	38.5563
41	Ticket Sales	Average Difference Forecasting	12936.3	113.738	76.6102
42	SP500	Naive Difference Forecasting	0.553753	0.744146	0.456603
43	FTSE	Naive Difference Forecasting	10.796	3.28573	1.81081
44	Gold	Naive Difference Forecasting	0.521205	0.721946	0.480641
45	USDCUSDT Close Price	Naive Difference Forecasting	9.13676e-06	0.003023	0.001659
46	USDCUSDT Tradecount	Naive Difference Forecasting	4.15266e+08	20378.1	12089.8
47	Python Questions	Naive Difference Forecasting	2.34416e+06	1531.07	1117.71
48	Machine Learning Questions	Naive Difference Forecasting	6678.13	81.7198	51.9462
49	Ticket Sales	Naive Difference Forecasting	37594.4	193.893	136.398
50	Gold	Autoregression	0.167187	0.408885	0.270593

S&P500

sp500Results = AutoReg(stationarySP500Data[:trainingStationarySP500Data],[1,2,3]).fit()
sp500Results.summary()

AutoReg Model Results


Dep. Variable:	observed	No. Observations:	25200
Model:	AutoReg(3)	Log Likelihood	-26406.1
Method:	Conditional MLE	S.D. of innovations	0.69
Date:	Mon, 08 Jul 2024	AIC	52822.1
Time:	15:38:55	BIC	52862.8
Sample:	3	HQIC	52835.3
	25200

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	-0.0081	0.004	-1.865	0.062	-0.017	0.000
observed.L1	0.0963	0.006	15.284	0.000	0.084	0.109
observed.L2	0.0258	0.006	4.074	0.000	0.013	0.038
observed.L3	-0.0102	0.006	-1.612	0.107	-0.023	0.002

Roots

	Real	Imaginary	Modulus	Frequency
AR.1	-4.4810	-0.0000j	4.4810	-0.5000
AR.2	3.5094	-3.1074j	4.6874	-0.1153
AR.3	3.5094	+3.1074j	4.6874	0.1153

sp500Data['Close'].shift()[trainingStationarySP500Data: stationarySP500DataLength]

Datetime
2014-01-03 15:43:00    1837.18
2014-01-03 15:42:00    1837.49
2014-01-03 15:41:00    1837.61
2014-01-03 15:40:00    1837.14
2014-01-03 15:39:00    1836.72
                        ...
2013-11-20 20:39:00    1780.69
2013-11-20 20:38:00    1779.84
2013-11-20 20:37:00    1779.54
2013-11-20 20:36:00    1779.17
2013-11-20 20:35:00    1778.87
Name: Close, Length: 10800, dtype: float64

sp500Results.predict(start=trainingStationarySP500Data, end=stationarySP500DataLength-1)

c:\Users\Edwin Teoh\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:834: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`.
  return get_prediction_index(
c:\Users\Edwin Teoh\anaconda3\lib\site-packages\statsmodels\tsa\deterministic.py:302: UserWarning: Only PeriodIndexes, DatetimeIndexes with a frequency set, RangesIndexes, and Index with a unit increment support extending. The index is set will contain the position relative to the data length.
  fcast_index = self._extend_index(index, steps, forecast_index)

25200    0.005949
25201   -0.002961
25202   -0.009767
25203   -0.009188
25204   -0.009218
           ...
35995   -0.009133
35996   -0.009133
35997   -0.009133
35998   -0.009133
35999   -0.009133
Length: 10800, dtype: float64

predictions = sp500Results.predict(start=trainingStationarySP500Data, end=stationarySP500DataLength-1) + \
    sp500Data.reset_index()['Close'].shift()[trainingStationarySP500Data: stationarySP500DataLength]

stationarySP500DataResults = pd.DataFrame(
    sp500Data.reset_index()['Close'][trainingStationarySP500Data:stationarySP500DataLength].rename('observed'))

stationarySP500DataResults['predicted'] = predictions
stationarySP500DataResults.plot()

appendingTOR(tableOfResults, 'SP500', 'Autoregression', evaluateErrors(stationarySP500DataResults))
tableOfResults

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
1	SP500	Naive Forecasting	0.315878	0.56203	0.325863
2	FTSE	Naive Forecasting	5.69646	2.38673	1.29783
3	Gold	Naive Forecasting	0.237289	0.487123	0.31245
4	USDCUSDT Close Price	Naive Forecasting	4.01536e-06	0.002004	0.001034
5	USDCUSDT Tradecount	Naive Forecasting	1.56376e+08	12505	7519.96
6	Python Questions	Naive Forecasting	1.00863e+06	1004.31	680.298
7	Machine Learning Questions	Naive Forecasting	3053.46	55.2581	35.4885
8	Ticket Sales	Naive Forecasting	12476	111.696	74.8537
9	SP500	Seasonal Forecasting - 60	24.0706	4.90618	3.19858
10	SP500	Seasonal Forecasting - 1440	578.648	24.0551	18.6964
11	FTSE	Seasonal Forecasting - 60	358.944	18.9458	12.5549
12	FTSE	Seasonal Forecasting - 1440	9305.27	96.4639	71.5358
13	Gold	Seasonal Forecasting - 60	11.0817	3.32892	2.08104
14	Gold	Seasonal Forecasting - 1440	297.39	17.245	12.2358
15	USDCUSDT Close Price	Seasonal Forecasting	1.41478e-05	0.003761	0.00176
16	USDCUSDT Tradecount	Seasonal Forecasting	2.76653e+08	16632.9	10558.5
17	Python Questions	Seasonal Forecasting	4.97505e+06	2230.48	1924.32
18	Machine Learning Questions	Seasonal Forecasting	14325.5	119.689	84.975
19	Ticket Sales	Seasonal Forecasting	3922.4	62.6291	37.0187
20	SP500	Average Forecasting - 60	8.21104	2.86549	1.84062
21	SP500	Average Forecasting - 1440	195.085	13.9673	10.6287
22	FTSE	Average Forecasting - 60	122.369	11.0621	7.24807
23	FTSE	Average Forecasting - 1440	3105.82	55.7299	40.9431
24	Gold	Average Forecasting - 60	3.79681	1.94854	1.22559
25	Gold	Average Forecasting - 1440	96.5074	9.82382	6.8452
26	USDCUSDT Close Price	Average Forecasting	6.51406e-06	0.002552	0.001187
27	USDCUSDT Tradecount	Average Forecasting	1.44419e+08	12017.5	7727.27
28	Python Questions	Average Forecasting	2.18019e+06	1476.55	1175.32
29	Machine Learning Questions	Average Forecasting	7140.3	84.5003	55.785
30	Ticket Sales	Average Forecasting	6061.2	77.8537	59.7301
31	SP500	Average Difference Forecasting - 60	0.322452	0.567849	0.332982
32	SP500	Average Difference Forecasting - 1440	0.316412	0.562505	0.326336
33	FTSE	Average Difference Forecasting - 60	5.79875	2.40806	1.32178
34	FTSE	Average Difference Forecasting - 1440	5.70723	2.38898	1.30033
35	Gold	Average Difference Forecasting - 60	0.240423	0.49033	0.315994
36	Gold	Average Difference Forecasting - 1440	0.237595	0.487437	0.313009
37	USDCUSDT Close Price	Average Difference Forecasting	4.53094e-06	0.002129	0.001103
38	USDCUSDT Tradecount	Average Difference Forecasting	1.64095e+08	12810	7724.69
39	Python Questions	Average Difference Forecasting	1.1445e+06	1069.81	751.801
40	Machine Learning Questions	Average Difference Forecasting	3399.79	58.3077	38.5563
41	Ticket Sales	Average Difference Forecasting	12936.3	113.738	76.6102
42	SP500	Naive Difference Forecasting	0.553753	0.744146	0.456603
43	FTSE	Naive Difference Forecasting	10.796	3.28573	1.81081
44	Gold	Naive Difference Forecasting	0.521205	0.721946	0.480641
45	USDCUSDT Close Price	Naive Difference Forecasting	9.13676e-06	0.003023	0.001659
46	USDCUSDT Tradecount	Naive Difference Forecasting	4.15266e+08	20378.1	12089.8
47	Python Questions	Naive Difference Forecasting	2.34416e+06	1531.07	1117.71
48	Machine Learning Questions	Naive Difference Forecasting	6678.13	81.7198	51.9462
49	Ticket Sales	Naive Difference Forecasting	37594.4	193.893	136.398
50	Gold	Autoregression	0.167187	0.408885	0.270593
51	SP500	Autoregression	0.181846	0.426434	0.236095

Machine Learning Questions

mlQuestionsResults = AutoReg(stationaryMLQuestions[:trainingStationaryMLQuestions], [1, 2, 7, 14, 28, 31]).fit()
mlQuestionsResults.summary()

AutoReg Model Results


Dep. Variable:	observed	No. Observations:	91
Model:	Restr. AutoReg(31)	Log Likelihood	24.827
Method:	Conditional MLE	S.D. of innovations	0.16
Date:	Mon, 08 Jul 2024	AIC	-33.654
Time:	15:38:55	BIC	-16.899
Sample:	09-01-2011	HQIC	-27.1
	- 08-01-2016

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	0.0658	0.023	2.894	0.004	0.021	0.110
observed.L1	-0.4974	0.116	-4.289	0.000	-0.725	-0.270
observed.L2	-0.2309	0.124	-1.862	0.063	-0.474	0.012
observed.L7	0.2036	0.099	2.049	0.040	0.009	0.398
observed.L14	-0.0429	0.089	-0.483	0.629	-0.217	0.131
observed.L28	-0.0634	0.055	-1.158	0.247	-0.171	0.044
observed.L31	-0.0349	0.050	-0.695	0.487	-0.133	0.064

Roots

	Real	Imaginary	Modulus	Frequency
AR.1	-0.8987	-0.5721j	1.0654	-0.4098
AR.2	-0.8987	+0.5721j	1.0654	0.4098
AR.3	-1.0050	-0.3868j	1.0769	-0.4415
AR.4	-1.0050	+0.3868j	1.0769	0.4415
AR.5	-1.1138	-0.1904j	1.1300	-0.4731
AR.6	-1.1138	+0.1904j	1.1300	0.4731
AR.7	-1.2513	-0.0000j	1.2513	-0.5000
AR.8	-0.7733	-0.7617j	1.0854	-0.3762
AR.9	-0.7733	+0.7617j	1.0854	0.3762
AR.10	-0.5650	-0.9155j	1.0758	-0.3380
AR.11	-0.5650	+0.9155j	1.0758	0.3380
AR.12	-0.3691	-0.9883j	1.0550	-0.3069
AR.13	-0.3691	+0.9883j	1.0550	0.3069
AR.14	-0.1692	-1.0598j	1.0732	-0.2752
AR.15	-0.1692	+1.0598j	1.0732	0.2752
AR.16	0.0585	-1.1078j	1.1094	-0.2416
AR.17	0.0585	+1.1078j	1.1094	0.2416
AR.18	1.0930	-0.1180j	1.0994	-0.0171
AR.19	1.0930	+0.1180j	1.0994	0.0171
AR.20	1.0568	-0.3488j	1.1128	-0.0507
AR.21	1.0568	+0.3488j	1.1128	0.0507
AR.22	0.9596	-0.5937j	1.1284	-0.0882
AR.23	0.9596	+0.5937j	1.1284	0.0882
AR.24	0.8043	-0.7902j	1.1275	-0.1236
AR.25	0.8043	+0.7902j	1.1275	0.1236
AR.26	0.3333	-1.0889j	1.1387	-0.2027
AR.27	0.3333	+1.0889j	1.1387	0.2027
AR.28	0.6337	-1.0033j	1.1867	-0.1603
AR.29	0.6337	+1.0033j	1.1867	0.1603
AR.30	0.5806	-1.0452j	1.1956	-0.1693
AR.31	0.5806	+1.0452j	1.1956	0.1693

loggedMLtopic

0      2.079442
1      1.386294
2      1.098612
3      2.302585
4      1.945910
         ...
127    6.368187
128    6.287859
129    6.620073
130    6.822197
131    6.890609
Name: machine-learning, Length: 132, dtype: float64

# Predictions from AR
predictions = mlQuestionsResults.predict(start = trainingStationaryMLQuestions, end = stationaryMLQuestionsLength-1).reset_index()

# Setting the index to number
predictions['index'] = [x for x in range(trainingStationaryMLQuestions,stationaryMLQuestionsLength)]
predictions = predictions.set_index('index')

# Adding the previous logged value
predictions = predictions[0] + loggedMLtopic.shift()

# Exponent it back
predictions = np.exp(predictions)
predictions

c:\Users\Edwin Teoh\anaconda3\lib\site-packages\statsmodels\tsa\deterministic.py:302: UserWarning: Only PeriodIndexes, DatetimeIndexes with a frequency set, RangesIndexes, and Index with a unit increment support extending. The index is set will contain the position relative to the data length.
  fcast_index = self._extend_index(index, steps, forecast_index)

0             NaN
1             NaN
2             NaN
3             NaN
4             NaN
          ...
127    642.365437
128    605.731054
129    561.555450
130    781.032565
131           NaN
Length: 132, dtype: float64

stationaryMLQuestionsResults = pd.DataFrame(
    mlStackoverflow_data['machine-learning'][trainingStationaryMLQuestions:stationaryMLQuestionsLength]
    .rename('observed'))
stationaryMLQuestionsResults['predicted'] = predictions
stationaryMLQuestionsResults.plot()

appendingTOR(tableOfResults, 'Machine Learning Questions', 'Autoregression', evaluateErrors(stationaryMLQuestionsResults))
tableOfResults

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
1	SP500	Naive Forecasting	0.315878	0.56203	0.325863
2	FTSE	Naive Forecasting	5.69646	2.38673	1.29783
3	Gold	Naive Forecasting	0.237289	0.487123	0.31245
4	USDCUSDT Close Price	Naive Forecasting	4.01536e-06	0.002004	0.001034
5	USDCUSDT Tradecount	Naive Forecasting	1.56376e+08	12505	7519.96
6	Python Questions	Naive Forecasting	1.00863e+06	1004.31	680.298
7	Machine Learning Questions	Naive Forecasting	3053.46	55.2581	35.4885
8	Ticket Sales	Naive Forecasting	12476	111.696	74.8537
9	SP500	Seasonal Forecasting - 60	24.0706	4.90618	3.19858
10	SP500	Seasonal Forecasting - 1440	578.648	24.0551	18.6964
11	FTSE	Seasonal Forecasting - 60	358.944	18.9458	12.5549
12	FTSE	Seasonal Forecasting - 1440	9305.27	96.4639	71.5358
13	Gold	Seasonal Forecasting - 60	11.0817	3.32892	2.08104
14	Gold	Seasonal Forecasting - 1440	297.39	17.245	12.2358
15	USDCUSDT Close Price	Seasonal Forecasting	1.41478e-05	0.003761	0.00176
16	USDCUSDT Tradecount	Seasonal Forecasting	2.76653e+08	16632.9	10558.5
17	Python Questions	Seasonal Forecasting	4.97505e+06	2230.48	1924.32
18	Machine Learning Questions	Seasonal Forecasting	14325.5	119.689	84.975
19	Ticket Sales	Seasonal Forecasting	3922.4	62.6291	37.0187
20	SP500	Average Forecasting - 60	8.21104	2.86549	1.84062
21	SP500	Average Forecasting - 1440	195.085	13.9673	10.6287
22	FTSE	Average Forecasting - 60	122.369	11.0621	7.24807
23	FTSE	Average Forecasting - 1440	3105.82	55.7299	40.9431
24	Gold	Average Forecasting - 60	3.79681	1.94854	1.22559
25	Gold	Average Forecasting - 1440	96.5074	9.82382	6.8452
26	USDCUSDT Close Price	Average Forecasting	6.51406e-06	0.002552	0.001187
27	USDCUSDT Tradecount	Average Forecasting	1.44419e+08	12017.5	7727.27
28	Python Questions	Average Forecasting	2.18019e+06	1476.55	1175.32
29	Machine Learning Questions	Average Forecasting	7140.3	84.5003	55.785
30	Ticket Sales	Average Forecasting	6061.2	77.8537	59.7301
31	SP500	Average Difference Forecasting - 60	0.322452	0.567849	0.332982
32	SP500	Average Difference Forecasting - 1440	0.316412	0.562505	0.326336
33	FTSE	Average Difference Forecasting - 60	5.79875	2.40806	1.32178
34	FTSE	Average Difference Forecasting - 1440	5.70723	2.38898	1.30033
35	Gold	Average Difference Forecasting - 60	0.240423	0.49033	0.315994
36	Gold	Average Difference Forecasting - 1440	0.237595	0.487437	0.313009
37	USDCUSDT Close Price	Average Difference Forecasting	4.53094e-06	0.002129	0.001103
38	USDCUSDT Tradecount	Average Difference Forecasting	1.64095e+08	12810	7724.69
39	Python Questions	Average Difference Forecasting	1.1445e+06	1069.81	751.801
40	Machine Learning Questions	Average Difference Forecasting	3399.79	58.3077	38.5563
41	Ticket Sales	Average Difference Forecasting	12936.3	113.738	76.6102
42	SP500	Naive Difference Forecasting	0.553753	0.744146	0.456603
43	FTSE	Naive Difference Forecasting	10.796	3.28573	1.81081
44	Gold	Naive Difference Forecasting	0.521205	0.721946	0.480641
45	USDCUSDT Close Price	Naive Difference Forecasting	9.13676e-06	0.003023	0.001659
46	USDCUSDT Tradecount	Naive Difference Forecasting	4.15266e+08	20378.1	12089.8
47	Python Questions	Naive Difference Forecasting	2.34416e+06	1531.07	1117.71
48	Machine Learning Questions	Naive Difference Forecasting	6678.13	81.7198	51.9462
49	Ticket Sales	Naive Difference Forecasting	37594.4	193.893	136.398
50	Gold	Autoregression	0.167187	0.408885	0.270593
51	SP500	Autoregression	0.181846	0.426434	0.236095
52	Machine Learning Questions	Autoregression	9493.88	97.4365	80.8513

Cinema Tickets

ticketSalesResults = AutoReg(stationaryTicketSales[:trainingStationaryTicketSales], [1,2,3,6,7]).fit()
ticketSalesResults.summary()

AutoReg Model Results


Dep. Variable:	observed	No. Observations:	161
Model:	Restr. AutoReg(7)	Log Likelihood	-861.251
Method:	Conditional MLE	S.D. of innovations	64.949
Date:	Mon, 08 Jul 2024	AIC	1736.5
Time:	15:38:55	BIC	1757.76
Sample:	7	HQIC	1745.14
	161

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	0.3138	5.235	0.060	0.952	-9.947	10.574
observed.L1	-0.4471	0.067	-6.711	0.000	-0.578	-0.317
observed.L2	-0.2981	0.068	-4.382	0.000	-0.431	-0.165
observed.L3	-0.1313	0.055	-2.382	0.017	-0.239	-0.023
observed.L6	-0.1062	0.053	-1.990	0.047	-0.211	-0.002
observed.L7	0.5693	0.062	9.164	0.000	0.448	0.691

Roots

	Real	Imaginary	Modulus	Frequency
AR.1	-0.9209	-0.4350j	1.0185	-0.4298
AR.2	-0.9209	+0.4350j	1.0185	0.4298
AR.3	-0.2433	-0.9990j	1.0282	-0.2880
AR.4	-0.2433	+0.9990j	1.0282	0.2880
AR.5	0.6337	-0.9394j	1.1331	-0.1555
AR.6	0.6337	+0.9394j	1.1331	0.1555
AR.7	1.2475	-0.0000j	1.2475	-0.0000

ticketSalesARResults = pd.DataFrame(
    ticketSales_data['tickets_sold'].reset_index()['tickets_sold'].rename('observed')
    [trainingStationaryTicketSales-1:stationaryTicketSalesLength]
)

ticketSalesARResults

Unnamed: 0	observed
160	115.308869
161	138.182362
162	71.504747
163	63.026772
164	53.210611
...	...
226	34.500000
227	70.583704
228	113.653521
229	126.824561
230	75.431177

ticketSalesARPredictions = ticketSalesResults.predict(start=trainingStationaryTicketSales, end=stationaryTicketSalesLength-1)
ticketSalesARPredictions

c:\Users\Edwin Teoh\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:834: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`.
  return get_prediction_index(
c:\Users\Edwin Teoh\anaconda3\lib\site-packages\statsmodels\tsa\deterministic.py:302: UserWarning: Only PeriodIndexes, DatetimeIndexes with a frequency set, RangesIndexes, and Index with a unit increment support extending. The index is set will contain the position relative to the data length.
  fcast_index = self._extend_index(index, steps, forecast_index)

161    -33.435136
162      2.950482
163    -16.865140
164    141.340881
165   -127.283711
          ...
226     -4.337732
227     22.393172
228    -34.553267
229     25.892766
230     -5.222340
Length: 70, dtype: float64

ticketSalesARResults['predicted'] = ticketSalesARPredictions + ticketSalesARResults['observed'].shift()
ticketSalesARResults.plot()

appendingTOR(tableOfResults, 'Ticket Sales', 'Autoregression', evaluateErrors(ticketSalesARResults))
tableOfResults

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
1	SP500	Naive Forecasting	0.315878	0.56203	0.325863
2	FTSE	Naive Forecasting	5.69646	2.38673	1.29783
3	Gold	Naive Forecasting	0.237289	0.487123	0.31245
4	USDCUSDT Close Price	Naive Forecasting	4.01536e-06	0.002004	0.001034
5	USDCUSDT Tradecount	Naive Forecasting	1.56376e+08	12505	7519.96
6	Python Questions	Naive Forecasting	1.00863e+06	1004.31	680.298
7	Machine Learning Questions	Naive Forecasting	3053.46	55.2581	35.4885
8	Ticket Sales	Naive Forecasting	12476	111.696	74.8537
9	SP500	Seasonal Forecasting - 60	24.0706	4.90618	3.19858
10	SP500	Seasonal Forecasting - 1440	578.648	24.0551	18.6964
11	FTSE	Seasonal Forecasting - 60	358.944	18.9458	12.5549
12	FTSE	Seasonal Forecasting - 1440	9305.27	96.4639	71.5358
13	Gold	Seasonal Forecasting - 60	11.0817	3.32892	2.08104
14	Gold	Seasonal Forecasting - 1440	297.39	17.245	12.2358
15	USDCUSDT Close Price	Seasonal Forecasting	1.41478e-05	0.003761	0.00176
16	USDCUSDT Tradecount	Seasonal Forecasting	2.76653e+08	16632.9	10558.5
17	Python Questions	Seasonal Forecasting	4.97505e+06	2230.48	1924.32
18	Machine Learning Questions	Seasonal Forecasting	14325.5	119.689	84.975
19	Ticket Sales	Seasonal Forecasting	3922.4	62.6291	37.0187
20	SP500	Average Forecasting - 60	8.21104	2.86549	1.84062
21	SP500	Average Forecasting - 1440	195.085	13.9673	10.6287
22	FTSE	Average Forecasting - 60	122.369	11.0621	7.24807
23	FTSE	Average Forecasting - 1440	3105.82	55.7299	40.9431
24	Gold	Average Forecasting - 60	3.79681	1.94854	1.22559
25	Gold	Average Forecasting - 1440	96.5074	9.82382	6.8452
26	USDCUSDT Close Price	Average Forecasting	6.51406e-06	0.002552	0.001187
27	USDCUSDT Tradecount	Average Forecasting	1.44419e+08	12017.5	7727.27
28	Python Questions	Average Forecasting	2.18019e+06	1476.55	1175.32
29	Machine Learning Questions	Average Forecasting	7140.3	84.5003	55.785
30	Ticket Sales	Average Forecasting	6061.2	77.8537	59.7301
31	SP500	Average Difference Forecasting - 60	0.322452	0.567849	0.332982
32	SP500	Average Difference Forecasting - 1440	0.316412	0.562505	0.326336
33	FTSE	Average Difference Forecasting - 60	5.79875	2.40806	1.32178
34	FTSE	Average Difference Forecasting - 1440	5.70723	2.38898	1.30033
35	Gold	Average Difference Forecasting - 60	0.240423	0.49033	0.315994
36	Gold	Average Difference Forecasting - 1440	0.237595	0.487437	0.313009
37	USDCUSDT Close Price	Average Difference Forecasting	4.53094e-06	0.002129	0.001103
38	USDCUSDT Tradecount	Average Difference Forecasting	1.64095e+08	12810	7724.69
39	Python Questions	Average Difference Forecasting	1.1445e+06	1069.81	751.801
40	Machine Learning Questions	Average Difference Forecasting	3399.79	58.3077	38.5563
41	Ticket Sales	Average Difference Forecasting	12936.3	113.738	76.6102
42	SP500	Naive Difference Forecasting	0.553753	0.744146	0.456603
43	FTSE	Naive Difference Forecasting	10.796	3.28573	1.81081
44	Gold	Naive Difference Forecasting	0.521205	0.721946	0.480641
45	USDCUSDT Close Price	Naive Difference Forecasting	9.13676e-06	0.003023	0.001659
46	USDCUSDT Tradecount	Naive Difference Forecasting	4.15266e+08	20378.1	12089.8
47	Python Questions	Naive Difference Forecasting	2.34416e+06	1531.07	1117.71
48	Machine Learning Questions	Naive Difference Forecasting	6678.13	81.7198	51.9462
49	Ticket Sales	Naive Difference Forecasting	37594.4	193.893	136.398
50	Gold	Autoregression	0.167187	0.408885	0.270593
51	SP500	Autoregression	0.181846	0.426434	0.236095
52	Machine Learning Questions	Autoregression	9493.88	97.4365	80.8513
53	Ticket Sales	Autoregression	12195.6	110.434	78.7188

Observations

tableOfResults[tableOfResults['Data'] == 'Machine Learning Questions'].sort_values('MSE')

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
7	Machine Learning Questions	Naive Forecasting	3053.46	55.2581	35.4885
40	Machine Learning Questions	Average Difference Forecasting	3399.79	58.3077	38.5563
48	Machine Learning Questions	Naive Difference Forecasting	6678.13	81.7198	51.9462
29	Machine Learning Questions	Average Forecasting	7140.3	84.5003	55.785
52	Machine Learning Questions	Autoregression	9493.88	97.4365	80.8513
18	Machine Learning Questions	Seasonal Forecasting	14325.5	119.689	84.975

tableOfResults[tableOfResults['Data'] == 'Gold'].sort_values('MSE')

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
50	Gold	Autoregression	0.167187	0.408885	0.270593
3	Gold	Naive Forecasting	0.237289	0.487123	0.31245
36	Gold	Average Difference Forecasting - 1440	0.237595	0.487437	0.313009
35	Gold	Average Difference Forecasting - 60	0.240423	0.49033	0.315994
44	Gold	Naive Difference Forecasting	0.521205	0.721946	0.480641
24	Gold	Average Forecasting - 60	3.79681	1.94854	1.22559
13	Gold	Seasonal Forecasting - 60	11.0817	3.32892	2.08104
25	Gold	Average Forecasting - 1440	96.5074	9.82382	6.8452
14	Gold	Seasonal Forecasting - 1440	297.39	17.245	12.2358

tableOfResults[tableOfResults['Data'] == 'SP500'].sort_values('MSE')

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
51	SP500	Autoregression	0.181846	0.426434	0.236095
1	SP500	Naive Forecasting	0.315878	0.56203	0.325863
32	SP500	Average Difference Forecasting - 1440	0.316412	0.562505	0.326336
31	SP500	Average Difference Forecasting - 60	0.322453	0.567849	0.332982
42	SP500	Naive Difference Forecasting	0.553753	0.744146	0.456603
20	SP500	Average Forecasting - 60	8.21104	2.86549	1.84062
9	SP500	Seasonal Forecasting - 60	24.0706	4.90618	3.19858
21	SP500	Average Forecasting - 1440	195.085	13.9673	10.6287
10	SP500	Seasonal Forecasting - 1440	578.648	24.0551	18.6964

tableOfResults[tableOfResults['Data'] == 'Ticket Sales'].sort_values('MSE')

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
19	Ticket Sales	Seasonal Forecasting	3922.4	62.6291	37.0187
30	Ticket Sales	Average Forecasting	6061.2	77.8537	59.7301
53	Ticket Sales	Autoregression	12195.6	110.434	78.7188
8	Ticket Sales	Naive Forecasting	12476	111.696	74.8537
41	Ticket Sales	Average Difference Forecasting	12936.3	113.738	76.6102
49	Ticket Sales	Naive Difference Forecasting	37594.4	193.893	136.398

Simulated Trading

Let's try to apply the train_ar and predict from the lectures for the simulated trading. This will be heavily based off what is in the lectures.

def train_ar(values, order):
    target_vector = np.array(values[order:])
    lagged_values = []

    for i in range(len(values) - order):
        lagged_values.append(values[i:i+order])
    desgin_matrix = np.array(lagged_values)
    return np.linalg.inv(desgin_matrix.transpose() @ desgin_matrix) @ desgin_matrix.transpose() @ target_vector

def predict(coeffs, data):
    predicted_values = [math.nan] * len(coeffs)
    for i in range(len(coeffs), len(data)):
        predicted_values.append(np.dot(coeffs, data[i-len(coeffs):i]))
    return np.array(predicted_values)

goldData['Close'].plot()

simTradingGoldData = goldData[goldData.index.dayofweek != 5]['Close'].sort_index()[:108000]
simTradingGoldData = pd.DataFrame(simTradingGoldData)
simTradingGoldData['Previous Close'] = simTradingGoldData['Close'].shift()
simTradingGoldData['Difference'] = simTradingGoldData['Close'] - simTradingGoldData['Close'].shift()

simTradingGoldData

('Unnamed: 0_level_0', 'Datetime')	('Close', 'Unnamed: 1_level_1')	('Previous Close', 'Unnamed: 2_level_1')	('Difference', 'Unnamed: 3_level_1')
2011-10-26 15:50:00	1711.27	nan	nan
2011-10-26 15:51:00	1712.07	1711.27	0.80
2011-10-26 15:52:00	1712.13	1712.07	0.06
2011-10-26 15:53:00	1712.83	1712.13	0.70
2011-10-26 15:54:00	1713.87	1712.83	1.04
...	...	...	...
2012-02-15 01:19:00	1722.71	1723.55	-0.84
2012-02-15 01:20:00	1722.79	1722.71	0.08
2012-02-15 01:21:00	1722.90	1722.79	0.11
2012-02-15 01:22:00	1722.86	1722.90	-0.04
2012-02-15 01:23:00	1722.80	1722.86	-0.06

trainingSimTradingGoldData = simTradingGoldData[:int(len(simTradingGoldData)*0.7)]
trainingSimTradingGoldData

('Unnamed: 0_level_0', 'Datetime')	('Close', 'Unnamed: 1_level_1')	('Previous Close', 'Unnamed: 2_level_1')	('Difference', 'Unnamed: 3_level_1')
2011-10-26 15:50:00	1711.27	nan	nan
2011-10-26 15:51:00	1712.07	1711.27	0.80
2011-10-26 15:52:00	1712.13	1712.07	0.06
2011-10-26 15:53:00	1712.83	1712.13	0.70
2011-10-26 15:54:00	1713.87	1712.83	1.04
...	...	...	...
2012-01-12 17:32:00	1654.07	1653.40	0.67
2012-01-12 17:33:00	1653.85	1654.07	-0.22
2012-01-12 17:34:00	1653.38	1653.85	-0.47
2012-01-12 17:35:00	1653.46	1653.38	0.08
2012-01-12 17:36:00	1653.36	1653.46	-0.10

testingSimTradingGoldData = simTradingGoldData[int(len(simTradingGoldData)*0.7):]
testingSimTradingGoldData

('Unnamed: 0_level_0', 'Datetime')	('Close', 'Unnamed: 1_level_1')	('Previous Close', 'Unnamed: 2_level_1')	('Difference', 'Unnamed: 3_level_1')
2012-01-12 17:37:00	1653.12	1653.36	-0.24
2012-01-12 17:38:00	1653.43	1653.12	0.31
2012-01-12 17:39:00	1653.20	1653.43	-0.23
2012-01-12 17:40:00	1653.31	1653.20	0.11
2012-01-12 17:41:00	1652.87	1653.31	-0.44
...	...	...	...
2012-02-15 01:19:00	1722.71	1723.55	-0.84
2012-02-15 01:20:00	1722.79	1722.71	0.08
2012-02-15 01:21:00	1722.90	1722.79	0.11
2012-02-15 01:22:00	1722.86	1722.90	-0.04
2012-02-15 01:23:00	1722.80	1722.86	-0.06

plot_pacf(trainingSimTradingGoldData['Difference'].dropna())

coeffs = train_ar(trainingSimTradingGoldData['Difference'].dropna(), 2)
coeffs

array([-0.00372669, -0.05243032])

prediction = predict(coeffs, testingSimTradingGoldData['Difference'])
prediction

array([        nan,         nan, -0.01535899, ..., -0.001064  ,
       -0.00606547,  0.00168728])

simTradingGoldDataResult = pd.DataFrame(testingSimTradingGoldData['Close'].rename('observed'))
simTradingGoldDataResult['predicted'] = prediction + testingSimTradingGoldData['Close']
simTradingGoldDataResult = simTradingGoldDataResult.dropna()
simTradingGoldDataResult.plot()

We will be using only 1.001 times larger than the previous. This is because when the value is 1.001, there are only 152 instances of True, and then at 1.002, there is only 8, and 1.003, there is only left with 4. That is why we proceeded to use 1.001

simTradingGoldDataResult['isHolding'] = simTradingGoldDataResult['predicted'] > simTradingGoldDataResult['observed'].shift() * 1.001
simTradingGoldDataResult[simTradingGoldDataResult['isHolding'] == True].count()

observed     152
predicted    152
isHolding    152
dtype: int64

lastPrice = None
portfolioValue = 10000
values = []

for ts, (close, _, isHolding) in simTradingGoldDataResult.iterrows():
    if isHolding:
        portfolioValue *= close / lastPrice
    lastPrice = close
    values.append(portfolioValue)

simTradingGoldDataResult['Equity Value'] = values

offsets = (simTradingGoldDataResult['isHolding'] * (simTradingGoldDataResult['observed'].shift(-1) - simTradingGoldDataResult['observed']))
hasIncreased = ((simTradingGoldDataResult['observed'].shift(-1) > simTradingGoldDataResult['observed']) & simTradingGoldDataResult['isHolding']).astype(int)
hasDecreased = ((simTradingGoldDataResult['observed'].shift(-1) < simTradingGoldDataResult['observed']) & simTradingGoldDataResult['isHolding']).astype(int)

plt.figure()
ax = simTradingGoldDataResult['observed'].plot()
for no, i in enumerate(hasIncreased):
    if i:
        plt.axvline(hasIncreased.index[no], color='g', linewidth=0.3)

for no, i in enumerate(hasDecreased):
    if i:
        plt.axvline(hasIncreased.index[no], color='r', linewidth=0.3)
plt.show()

for no, i in enumerate(hasIncreased):
    if i:
        plt.axvline(hasIncreased.index[no], color='g', linewidth=0.3)

for no, i in enumerate(hasDecreased):
    if i:
        plt.axvline(hasIncreased.index[no], color='r', linewidth=0.3)
simTradingGoldDataResult['Equity Value'].plot()

Monte Carlo Simulations

We will now apply Monte Carlo to 3 datasets and explore. We will be using the gold data (for its real-worldness), machine learning questions (for its multiplicative graph), and ticket sales (for its seasonality). Furthermore, using the result found from PACF above. We can use the similar concept above, were we split the training and testing set.

Gold

def mcRmse(predicted, actual):
    diffs = predicted - actual
    return math.sqrt(np.mean(diffs**2))

goldTrainSeries = goldData.reset_index()['Close'][:traniningStationaryGoldDataLength]
goldTrainSeries

0        1298.50
1        1298.60
2        1298.90
3        1299.09
4        1299.17
          ...
75595    1246.30
75596    1246.60
75597    1246.40
75598    1247.10
75599    1247.20
Name: Close, Length: 75600, dtype: float64

goldModel = AutoReg(goldTrainSeries, 3, trend='n').fit()
goldCoeffs = list(reversed(goldModel.params))
goldCoeffs

[0.032491603321171736, 0.12017858859369826, 0.8473291714654028]

goldStdDev = mcRmse(goldModel.predict(), goldTrainSeries)
goldStdDev

0.36114889408159995

numOfSamplesGold = 1000
goldTestSeries = goldData.reset_index()['Close'][traniningStationaryGoldDataLength:stationaryGoldDataLength]
goldTestSeries = goldTestSeries.reset_index()['Close']
initialGoldValues = np.array(goldTestSeries[:2500]).reshape(1, -1)
initialGoldValues

goldValues = np.broadcast_to(initialGoldValues, (numOfSamplesGold, 2500))
goldValues

goldTrajectoryLength = 1000
np.random.seed(123)

for i in range(goldTrajectoryLength):
    nextPrediction = goldValues[:, -3:] @ goldCoeffs
    nextColumn = nextPrediction + goldStdDev * np.random.randn(numOfSamplesGold)
    goldValues = np.append(goldValues, nextColumn.reshape(numOfSamplesGold, 1), axis = 1)

horizontalLines = [x for x in range(1220, 1281, 10)]

plt.plot(goldValues.transpose())
for line in horizontalLines:
    plt.axhline(y = line, linestyle = 'dashed', linewidth = 2, alpha = 0.5)
plt.plot(goldTestSeries[:3500], linewidth = 1.5, c = 'black')
plt.show()

Now that we have produced a 1000 samples of possible price movements, we can now calculate the difference probabilities of it reaching certain prices and what not.

The black line shown in the plot, is to show the actual gold price move.

finalTimePeriod = nextColumn
finalTimePeriod

array([1248.81453198, 1234.48618073, 1260.43243186, 1249.37824546,
       1246.44648938, 1234.98845302, 1237.40624842, 1263.51763179,
       1249.32173509, 1237.21926607, 1262.02420909, 1241.23580835,
       ...
       1255.65131606, 1269.97946134, 1258.23988999, 1249.41256685])

We can look into what are the probabilities of the price of gold to reach a certain price at time 3500. Using the function from the materials given, we can calculate the probability.

def probBetween (arr, lower, higher):
    count = np.count_nonzero((arr > lower) & (arr < higher))
    return count/arr.size

The probability of the gold of price reaching between 1220 to 1280

probBetween(finalTimePeriod, 1220, 1280)

0.997

The probability of the gold of price reaching between 1215 and 1240

probBetween(finalTimePeriod, 1215, 1240)

0.231

The probability of the gold of price reaching between 1245 and 1255

probBetween(finalTimePeriod, 1245, 1255)

0.36

The probability of the gold of price reaching between 1245 and 1255

probBetween(finalTimePeriod, 1250, 1260)

0.296

The probability of the gold of price reaching between 1265 and 1280

probBetween(finalTimePeriod, 1265, 1280)

0.042

The probability of the gold of price reaching between 1250 and 1253

probBetween(finalTimePeriod, 1250, 1253)

0.101

The further away in the future we try to predict, the harder it is to determine the prices, and because the cone of possibility is larger, many prices that we set out to find, have generally the same probability. The lower it is further away from the first prediction value.

Next, we will look into the probability that the price will go beyond or lower than a specific value. Let's say we would like to find out if there are any higher probability of either direction, and then we will try some other values instead.

# Checking if calling the correct
goldValues[:,2500][:5]

array([1247.77539526, 1248.52765975, 1248.26966692, 1247.62347288,
       1247.95850871])

goldValues[:,2499][:5]

array([1248.2, 1248.2, 1248.2, 1248.2, 1248.2])

period = 2499
lastPrice = goldValues[:,period][0]
lastPrice

1248.2

(lastPrice < goldValues[:,2500:]).sum() / (numOfSamplesGold*goldTrajectoryLength)

0.487236

(lastPrice > goldValues[:,2500:]).sum() / (numOfSamplesGold*goldTrajectoryLength)

0.512764

Based on the monte carlo result, there are 45.23% of values that are larger the last point we were and 54.73% of them are lower. Because of that, let's assume we enter a short position, now lets explore further in time of our result.

Let's say we are now we would like to only hold on to the position only for a shorter term. At period 2750, we would like to exit the position. What are the probability of it to take profit on 1240, or stop loss on 1260, or let's be conservative and also find out the probability for it at 1255.

takeProfit = 1240
np.any(takeProfit > goldValues[:,2500:2750], axis = 1).sum() / 1000

0.088

stopLoss = 1260
np.any(stopLoss < goldValues[:,2500:2750], axis = 1).sum() / 1000

0.008

conservativeStopLoss = 1255
np.any(conservativeStopLoss < goldValues[:,2500:2750], axis = 1).sum() / 1000

0.145

Given that the chances for the price of 1255 is higher than the initial take profit of 1240, we might want to adjust it lower to increase our chances to trigger the take profit.

takeProfit = 1240.5
np.any(takeProfit > goldValues[:,2500:2750], axis = 1).sum() / 1000

0.109

takeProfit = 1241.5
np.any(takeProfit > goldValues[:,2500:2750], axis = 1).sum() / 1000

0.168

By increasing the stop loss by 0.5, we have increased the probability of to 0.144, and be more conservative, at 1241.5, the chances for us to increased to 0.205.

The following is the density plot for the gold values.

xTicks = [x for x in range(1190, 1311, 10)]
goldValuesSumKDE = []
for i in goldValues:
    [goldValuesSumKDE.append(x) for x in i[2500:]]
pd.DataFrame(goldValuesSumKDE).plot.kde()
plt.axvline(lastPrice, linestyle = 'dashed', label = 'Last Price', c = 'black')
plt.xticks(xTicks)
for line in xTicks:
    plt.axvline(line, linestyle = 'dotted', c = 'b', alpha = 0.3)
plt.legend()

The following is the density plot for period from 2500 to 2750, according to the scenario above. Here, we can visualize its probability of each given range of prices, based on the monte carlo results. As we can see that, a small increase from 1240, would significantly increase the probability as shown in the steep increase.

xTicks = [x for x in range(1190, 1311, 10)]
goldValuesSumKDE = []
for i in goldValues:
    [goldValuesSumKDE.append(x) for x in i[2500:2750]]
pd.DataFrame(goldValuesSumKDE).plot.kde()
plt.axvline(lastPrice, linestyle = 'dashed', label = 'Last Price', c = 'black')
plt.xticks(xTicks)
for line in xTicks:
    plt.axvline(line, linestyle = 'dotted', c = 'b', alpha = 0.3)
plt.legend()

Cinema Tickets

Given the time contraints on working on the coursework, we will explore one more datasaet before moving on.

ticketSalesTrainSeries = ticketSales_data['tickets_sold'].reset_index()['tickets_sold'][:trainingStationaryTicketSales]
ticketSalesTrainSeries

0      125.650000
1      178.675325
2      175.461017
3      114.192192
4       94.831956
          ...
156     73.299830
157     71.350000
158    275.887784
159    148.436834
160    115.308869
Name: tickets_sold, Length: 161, dtype: float64

ticketSalesModel = AutoReg(ticketSalesTrainSeries, 7, trend='n').fit()
ticketSalesCoeffs = list(reversed(ticketSalesModel.params))
ticketSalesCoeffs

[0.7426181566741853,
 -0.03903937266016291,
 -0.031596445533128875,
 0.036661173670043956,
 0.10106447839651159,
 0.0957974652301363,
 0.08677635276613055]

ticketsSalesStdDev = mcRmse(ticketSalesModel.predict(), ticketSalesTrainSeries)
ticketsSalesStdDev

62.78175334614299

numOfSamplesTicketSales = 71 - 14

def runMonteCarloTicketSale(multiplier):
    np.random.seed(123)
    ticketSalesTestSeries = ticketSales_data['tickets_sold'].reset_index()['tickets_sold'][trainingStationaryTicketSales:]
    ticketSalesTestSeries = ticketSalesTestSeries.reset_index()['tickets_sold']

    initialTicketSalesValues = np.array(ticketSalesTestSeries[:14]).reshape(1, -1)

    ticketSalesValues = np.broadcast_to(initialTicketSalesValues, (numOfSamplesTicketSales, 14))

    ticketSalesTrajectoryLength = (71 - 14)

    for i in range(ticketSalesTrajectoryLength):
        nextPrediction = ticketSalesValues[:, -7:] @ ticketSalesCoeffs
        nextColumn = nextPrediction + multiplier * np.random.randn(numOfSamplesTicketSales)
        ticketSalesValues = np.append(ticketSalesValues, nextColumn.reshape(numOfSamplesTicketSales, 1), axis = 1)

    horizontalLines = [x for x in range(-300, 501, 100)]

    plt.plot(ticketSalesValues.transpose())
    for line in horizontalLines:
        plt.axhline(y = line, linestyle = 'dashed', linewidth = 2, alpha = 0.5)
    plt.plot(ticketSalesTestSeries[:71], linewidth = 1.5, c = 'black')
    plt.yticks(horizontalLines)
    plt.show()

runMonteCarloTicketSale(ticketsSalesStdDev)

After visualizing the monte carlo simulation, realized that it might not be suitable for ths dataset because, the random nature of the monte carlo will include predictions of negative number, when this dataset will not ever reach so. Then, I thought what if we were to use other values rather than the standard deviations, such as the ones shown below. The above def/function was implemented after the realization.

Using several values and test to see how the results would be

runMonteCarloTicketSale(10)

runMonteCarloTicketSale(30)

runMonteCarloTicketSale(10 + np.random.randint(10, ticketsSalesStdDev))

We even have tried using a varying value with the last one, but they all seem to have predictions being too narrow and might shrink even more if further in time. So next we will try removing 0 and below, but using the abs() function

def runMonteCarloTicketSale2(multiplier):
    np.random.seed(123)
    ticketSalesTestSeries = ticketSales_data['tickets_sold'].reset_index()['tickets_sold'][trainingStationaryTicketSales:]
    ticketSalesTestSeries = ticketSalesTestSeries.reset_index()['tickets_sold']

    initialTicketSalesValues = np.array(ticketSalesTestSeries[:14]).reshape(1, -1)

    ticketSalesValues = np.broadcast_to(initialTicketSalesValues, (numOfSamplesTicketSales, 14))

    ticketSalesTrajectoryLength = (71 - 14)

    for i in range(ticketSalesTrajectoryLength):
        nextPrediction = ticketSalesValues[:, -7:] @ ticketSalesCoeffs
        nextColumn = nextPrediction + abs(multiplier * np.random.randn(numOfSamplesTicketSales))
        ticketSalesValues = np.append(ticketSalesValues, nextColumn.reshape(numOfSamplesTicketSales, 1), axis = 1)

    horizontalLines = [x for x in range(-300, 501, 100)]

    plt.plot(ticketSalesValues.transpose())
    for line in horizontalLines:
        plt.axhline(y = line, linestyle = 'dashed', linewidth = 2, alpha = 0.5)
    plt.plot(ticketSalesTestSeries[:71], linewidth = 1.5, c = 'black')
    plt.yticks(horizontalLines)
    plt.show()

runMonteCarloTicketSale2(ticketsSalesStdDev)

By only having positive numbers, it somehow affected the prediction to only trend upwards, which is also not ideal either.

Regardless of which parameter used, it does seem that the model, given for this dataset, work best only at the first few period. While monte carlo simulation was supposed to be able to solve this issue, but it does not. While it is very useful to be used as a baseline, to understand the possible and very probable outcomes, and to have a probabilistic view of the upcoming time frames, it quickly becomes unreliable too far in the future, similar to many models.

Moving Averages

def maSmoothing(series, n):
    answer = series - series
    for i in range(n):
        answer = answer + series.shift(i)
    return answer/n

def maSmoothingPlot(series, nList):
    temp = pd.Series(series, name='observed')
    temp = pd.DataFrame(temp)
    for i in nList:
        tempCol = 'smoothed ' + str(i)
        temp[tempCol] = maSmoothing(temp['observed'], i)

    return temp

Gold

maSmoothingPlot(goldData['Close'][:3000], [10,100,500]).plot()

FTSE

maSmoothingPlot(ftseData['Close'][:100000], [100,3000,10000]).plot()

S&P500

maSmoothingPlot(sp500Data['Close'][:100000], [500,5000,10000]).plot()

USDCUSDT - Tradecount

maSmoothingPlot(usdcusdtDataTradeCount['tradecount'].sort_index(),[10,50,100]).plot()

Cinema Tickets

maSmoothingPlot(ticketSales_data['tickets_sold'], [3,7,21]).plot()

Machine Learning Questions

maSmoothingPlot(mlStackoverflow_data['machine-learning'], [3,7,20]).plot()

Moving average is a very equal and simple way to get an sense of the trending within the data. The downside of moving averages is that if there is not enough data for a longer time frame period, we need to wait however the selected period is, for the data to finalize. While there are many longer term investing strategies that utilizes the moving average, they are usually implemented for trend finding. Another downside is that, we will need to sacrifice some data, usually in the front of the dataset, to be used for the calculation of the average. The dataset will suffer a little from that chunk of missing data

Exponential Averaging

The following averaging method is interesting because it does not need to wait for a given period. However, we do need set a parameter, which will affect the how smooth the line is. Similarly, we will explore this will all the datasets.

def expSmoothing (series, alpha):
    smoothedValue = series[0]
    results = []
    for value in series:
        smoothedValue = (1 - alpha) * smoothedValue + alpha * value
        results.append(smoothedValue)
    return results

def expSmoothingPlot (series, list):
    temp = pd.Series(series, name='observed')
    temp = pd.DataFrame(temp)
    for n in list:
        tempCol = 'smoothed ' + str(n)
        temp[tempCol] = expSmoothing(temp['observed'], n)
    return temp

Gold

expSmoothingPlot(goldData['Close'][:3000], [0.9, 0.05, 0.01, 0.003]).plot()

FTSE

expSmoothingPlot(ftseData['Close'][:100000], [0.0005, 0.0001]).plot()

S&P500

expSmoothingPlot(sp500Data['Close'][:100000], [0.0024, 0.0005, 0.0001]).plot()

USDCUSDT - Tradecount

expSmoothingPlot(usdcusdtDataTradeCount['tradecount'].sort_index(), [0.1, 0.01, 0.001]).plot()

Cinema Tickets

expSmoothingPlot(ticketSales_data['tickets_sold'], [0.1, 0.05, 0.025, 0.01]).plot()

Machine Learning Questions

expSmoothingPlot(mlStackoverflow_data['machine-learning'], [0.25, 0.1, 0.01]).plot()

Based on the above, we can see that the alpha number, really largely depends on the size of the dataset. 0.1 on one mimics the graph very closely while on the other, it is already near a straight line.

Removing Trend with EMA

The good thing about using EMA, is that the first few values of the dataset will not be excluded, making it easier for us to remove the trend portion of the dataset. Then, it should become stationary. We can also test the stationarity of the data with the kpss and adf test. We wil then explore if it predicts better this way.

We will only do one dataset to explore given the limited time.

First, we will need to see which value of alpha produces the most stationary graph. We will be decide it visually, only then we check with the tests

emaGoldData = expSmoothingPlot(goldData['Close'][:stationaryGoldDataLength], [0.001, 0.005, 0.0002])

emaGoldData.columns

Index(['observed', 'smoothed 0.001', 'smoothed 0.005', 'smoothed 0.0002'], dtype='object')

for i in range(1,4):
    currentCol = emaGoldData.columns[i]
    smoothed = emaGoldData[currentCol]
    plt.plot(emaGoldData['observed'] - smoothed, label = currentCol, alpha=0.7)
plt.legend()

Visually, we can see that the alpha value of 0.005, seem to be a good fit for this case. So we will be using it.

emaGoldData['residual'] = emaGoldData['observed'] - emaGoldData['smoothed 0.005']

kpss_test(emaGoldData['residual'])

KPSS Stat:  0.4907302016778367
p-value:  0.04375445908156831
Number of lags: 192
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
Result: The series is not stationary

adftest(emaGoldData['residual'])

AdfTest Stat:  -16.70183985809975
p-value:  1.4590092692087397e-29
Number of lags: 50
Number of observation: 107949
Critial Values:
   1% : -3.430410579120753
   5% : -2.8615667750440084
   10% : -2.5667842514152195
The series is stationary

Visually, it might look like the data is already stationary, but based on two tests, it is still not the case. Therefore, we can try to proceed with getting the difference.

emaGoldData['residualDiff'] = emaGoldData['residual'] - emaGoldData['residual'].shift()
emaGoldData[['residual', 'residualDiff']].plot()

kpss_test(emaGoldData['residualDiff'][1:])

KPSS Stat:  0.0010710001653877659
p-value:  0.1
Number of lags: 71
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
Result: The series is stationary

adftest(emaGoldData['residualDiff'][1:])

AdfTest Stat:  -47.34320740713912
p-value:  0.0
Number of lags: 53
Number of observation: 107945
Critial Values:
   1% : -3.4304105813656207
   5% : -2.8615667760361956
   10% : -2.5667842519433277
The series is stationary

Now we will proceed with applying prediction methods.

plot_pacf(emaGoldData['residualDiff'][1:traniningStationaryGoldDataLength])

Seem like only the 2nd period is effective, however, let's just use a bigger scale period.

emaGoldCoeffs = train_ar(emaGoldData['residualDiff'][1:traniningStationaryGoldDataLength],20)
emaGoldCoeffs

array([ 0.00133632, -0.00681158, -0.006989  , -0.00519191, -0.00436599,
       -0.00034325, -0.01157953, -0.00883011, -0.01158689, -0.00957072,
        0.01358625,  0.00297729, -0.00152516, -0.01193093, -0.01983785,
       -0.01051143, -0.02078453, -0.0104172 , -0.03703278, -0.15587259])

emaGoldResidualDiffPredict = predict(emaGoldCoeffs, emaGoldData['residualDiff'][traniningStationaryGoldDataLength:] )
emaGoldResidualDiffPredict = pd.DataFrame(emaGoldResidualDiffPredict)
emaGoldResidualDiffPredict = emaGoldResidualDiffPredict.set_index(pd.Series([x for x in range(traniningStationaryGoldDataLength,stationaryGoldDataLength)]))
emaGoldResidualDiffPredict.plot()

emaGoldResidualPrediction = emaGoldResidualDiffPredict[0] + emaGoldData.reset_index()['residualDiff'][traniningStationaryGoldDataLength:].shift()
emaGoldResidualPrediction.plot()

emaGoldTrendPredict = 2 * emaGoldData['smoothed 0.005'][traniningStationaryGoldDataLength:].shift(1) \
    - emaGoldData['smoothed 0.005'][traniningStationaryGoldDataLength:].shift(2)

emaGoldTrendPredict = pd.DataFrame(emaGoldTrendPredict.reset_index()['smoothed 0.005'].rename('trend prediction'))\
    .set_index(pd.Series([x for x in range(traniningStationaryGoldDataLength, stationaryGoldDataLength)]))

emaGoldTrendPredict.plot()

emaGoldPrediction = emaGoldTrendPredict['trend prediction'] + emaGoldResidualPrediction

emaGoldResult = pd.DataFrame(emaGoldData.reset_index()['observed'][traniningStationaryGoldDataLength:])
emaGoldResult['predicted'] = emaGoldPrediction
emaGoldResult = emaGoldResult.dropna()
emaGoldResult.plot()

evaluateErrors(emaGoldResult)

[8.310943507413699, 2.882870705982788, 1.9945175517858817]

appendingTOR(tableOfResults, 'Gold', 'Manual ARIMA',evaluateErrors(emaGoldResult))
tableOfResults

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
1	SP500	Naive Forecasting	0.315878	0.56203	0.325863
2	FTSE	Naive Forecasting	5.69646	2.38673	1.29783
3	Gold	Naive Forecasting	0.237289	0.487123	0.31245
4	USDCUSDT Close Price	Naive Forecasting	4.01536e-06	0.002004	0.001034
5	USDCUSDT Tradecount	Naive Forecasting	1.56376e+08	12505	7519.96
6	Python Questions	Naive Forecasting	1.00863e+06	1004.31	680.298
7	Machine Learning Questions	Naive Forecasting	3053.46	55.2581	35.4885
8	Ticket Sales	Naive Forecasting	12476	111.696	74.8537
9	SP500	Seasonal Forecasting - 60	24.0706	4.90618	3.19858
10	SP500	Seasonal Forecasting - 1440	578.648	24.0551	18.6964
11	FTSE	Seasonal Forecasting - 60	358.944	18.9458	12.5549
12	FTSE	Seasonal Forecasting - 1440	9305.27	96.4639	71.5358
13	Gold	Seasonal Forecasting - 60	11.0817	3.32892	2.08104
14	Gold	Seasonal Forecasting - 1440	297.39	17.245	12.2358
15	USDCUSDT Close Price	Seasonal Forecasting	1.41478e-05	0.003761	0.00176
16	USDCUSDT Tradecount	Seasonal Forecasting	2.76653e+08	16632.9	10558.5
17	Python Questions	Seasonal Forecasting	4.97505e+06	2230.48	1924.32
18	Machine Learning Questions	Seasonal Forecasting	14325.5	119.689	84.975
19	Ticket Sales	Seasonal Forecasting	3922.4	62.6291	37.0187
20	SP500	Average Forecasting - 60	8.21104	2.86549	1.84062
21	SP500	Average Forecasting - 1440	195.085	13.9673	10.6287
22	FTSE	Average Forecasting - 60	122.369	11.0621	7.24807
23	FTSE	Average Forecasting - 1440	3105.82	55.7299	40.9431
24	Gold	Average Forecasting - 60	3.79681	1.94854	1.22559
25	Gold	Average Forecasting - 1440	96.5074	9.82382	6.8452
26	USDCUSDT Close Price	Average Forecasting	6.51406e-06	0.002552	0.001187
27	USDCUSDT Tradecount	Average Forecasting	1.44419e+08	12017.5	7727.27
28	Python Questions	Average Forecasting	2.18019e+06	1476.55	1175.32
29	Machine Learning Questions	Average Forecasting	7140.3	84.5003	55.785
30	Ticket Sales	Average Forecasting	6061.2	77.8537	59.7301
31	SP500	Average Difference Forecasting - 60	0.322452	0.567849	0.332982
32	SP500	Average Difference Forecasting - 1440	0.316412	0.562505	0.326336
33	FTSE	Average Difference Forecasting - 60	5.79875	2.40806	1.32178
34	FTSE	Average Difference Forecasting - 1440	5.70723	2.38898	1.30033
35	Gold	Average Difference Forecasting - 60	0.240423	0.49033	0.315994
36	Gold	Average Difference Forecasting - 1440	0.237595	0.487437	0.313009
37	USDCUSDT Close Price	Average Difference Forecasting	4.53094e-06	0.002129	0.001103
38	USDCUSDT Tradecount	Average Difference Forecasting	1.64095e+08	12810	7724.69
39	Python Questions	Average Difference Forecasting	1.1445e+06	1069.81	751.801
40	Machine Learning Questions	Average Difference Forecasting	3399.79	58.3077	38.5563
41	Ticket Sales	Average Difference Forecasting	12936.3	113.738	76.6102
42	SP500	Naive Difference Forecasting	0.553753	0.744146	0.456603
43	FTSE	Naive Difference Forecasting	10.796	3.28573	1.81081
44	Gold	Naive Difference Forecasting	0.521205	0.721946	0.480641
45	USDCUSDT Close Price	Naive Difference Forecasting	9.13676e-06	0.003023	0.001659
46	USDCUSDT Tradecount	Naive Difference Forecasting	4.15266e+08	20378.1	12089.8
47	Python Questions	Naive Difference Forecasting	2.34416e+06	1531.07	1117.71
48	Machine Learning Questions	Naive Difference Forecasting	6678.13	81.7198	51.9462
49	Ticket Sales	Naive Difference Forecasting	37594.4	193.893	136.398
50	Gold	Autoregression	0.167187	0.408885	0.270593
51	SP500	Autoregression	0.181846	0.426434	0.236095
52	Machine Learning Questions	Autoregression	9493.88	97.4365	80.8513
53	Ticket Sales	Autoregression	12195.6	110.434	78.7188
54	Gold	Manual ARIMA	8.31094	2.88287	1.99452

tableOfResults[tableOfResults['Data'] == 'Gold'].sort_values('MSE')

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
50	Gold	Autoregression	0.167187	0.408885	0.270593
3	Gold	Naive Forecasting	0.237289	0.487123	0.31245
36	Gold	Average Difference Forecasting - 1440	0.237595	0.487437	0.313009
35	Gold	Average Difference Forecasting - 60	0.240423	0.49033	0.315994
44	Gold	Naive Difference Forecasting	0.521205	0.721946	0.480641
24	Gold	Average Forecasting - 60	3.79681	1.94854	1.22559
54	Gold	Manual ARIMA	8.31094	2.88287	1.99452
13	Gold	Seasonal Forecasting - 60	11.0817	3.32892	2.08104
25	Gold	Average Forecasting - 1440	96.5074	9.82382	6.8452
14	Gold	Seasonal Forecasting - 1440	297.39	17.245	12.2358

As of now, the autoregression still outperforms it. Let's try to see if we use other ARIMA parameters, will it help improve the outcome.

ARIMA

def exploringARIMA(data, ar, i, m):
    model = sm.tsa.arima.ARIMA(data, order = [ar,i,m], trend = 'n').fit()
    results = pd.DataFrame(data.rename('observed'))
    results['predicted'] = model.predict()
    results.plot()
    return results

Ticket Sales

ticketSalesData = ticketSales_data.reset_index()['tickets_sold']
ticketSalesData

0      125.650000
1      178.675325
2      175.461017
3      114.192192
4       94.831956
          ...
227     70.583704
228    113.653521
229    126.824561
230     75.431177
231     49.894737
Name: tickets_sold, Length: 232, dtype: float64

MA 2

ticketSalesMA2Model = exploringARIMA(ticketSalesData, 0, 0, 2)

appendingTOR(tableOfResults, 'Ticket Sales', 'ARIMA - MA2', evaluateErrors(ticketSalesMA2Model))
tableOfResults

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
1	SP500	Naive Forecasting	0.315878	0.56203	0.325863
2	FTSE	Naive Forecasting	5.69646	2.38673	1.29783
3	Gold	Naive Forecasting	0.237289	0.487123	0.31245
4	USDCUSDT Close Price	Naive Forecasting	4.01536e-06	0.002004	0.001034
5	USDCUSDT Tradecount	Naive Forecasting	1.56376e+08	12505	7519.96
6	Python Questions	Naive Forecasting	1.00863e+06	1004.31	680.298
7	Machine Learning Questions	Naive Forecasting	3053.46	55.2581	35.4885
8	Ticket Sales	Naive Forecasting	12476	111.696	74.8537
9	SP500	Seasonal Forecasting - 60	24.0706	4.90618	3.19858
10	SP500	Seasonal Forecasting - 1440	578.648	24.0551	18.6964
11	FTSE	Seasonal Forecasting - 60	358.944	18.9458	12.5549
12	FTSE	Seasonal Forecasting - 1440	9305.27	96.4639	71.5358
13	Gold	Seasonal Forecasting - 60	11.0817	3.32892	2.08104
14	Gold	Seasonal Forecasting - 1440	297.39	17.245	12.2358
15	USDCUSDT Close Price	Seasonal Forecasting	1.41478e-05	0.003761	0.00176
16	USDCUSDT Tradecount	Seasonal Forecasting	2.76653e+08	16632.9	10558.5
17	Python Questions	Seasonal Forecasting	4.97505e+06	2230.48	1924.32
18	Machine Learning Questions	Seasonal Forecasting	14325.5	119.689	84.975
19	Ticket Sales	Seasonal Forecasting	3922.4	62.6291	37.0187
20	SP500	Average Forecasting - 60	8.21104	2.86549	1.84062
21	SP500	Average Forecasting - 1440	195.085	13.9673	10.6287
22	FTSE	Average Forecasting - 60	122.369	11.0621	7.24807
23	FTSE	Average Forecasting - 1440	3105.82	55.7299	40.9431
24	Gold	Average Forecasting - 60	3.79681	1.94854	1.22559
25	Gold	Average Forecasting - 1440	96.5074	9.82382	6.8452
26	USDCUSDT Close Price	Average Forecasting	6.51406e-06	0.002552	0.001187
27	USDCUSDT Tradecount	Average Forecasting	1.44419e+08	12017.5	7727.27
28	Python Questions	Average Forecasting	2.18019e+06	1476.55	1175.32
29	Machine Learning Questions	Average Forecasting	7140.3	84.5003	55.785
30	Ticket Sales	Average Forecasting	6061.2	77.8537	59.7301
31	SP500	Average Difference Forecasting - 60	0.322452	0.567849	0.332982
32	SP500	Average Difference Forecasting - 1440	0.316412	0.562505	0.326336
33	FTSE	Average Difference Forecasting - 60	5.79875	2.40806	1.32178
34	FTSE	Average Difference Forecasting - 1440	5.70723	2.38898	1.30033
35	Gold	Average Difference Forecasting - 60	0.240423	0.49033	0.315994
36	Gold	Average Difference Forecasting - 1440	0.237595	0.487437	0.313009
37	USDCUSDT Close Price	Average Difference Forecasting	4.53094e-06	0.002129	0.001103
38	USDCUSDT Tradecount	Average Difference Forecasting	1.64095e+08	12810	7724.69
39	Python Questions	Average Difference Forecasting	1.1445e+06	1069.81	751.801
40	Machine Learning Questions	Average Difference Forecasting	3399.79	58.3077	38.5563
41	Ticket Sales	Average Difference Forecasting	12936.3	113.738	76.6102
42	SP500	Naive Difference Forecasting	0.553753	0.744146	0.456603
43	FTSE	Naive Difference Forecasting	10.796	3.28573	1.81081
44	Gold	Naive Difference Forecasting	0.521205	0.721946	0.480641
45	USDCUSDT Close Price	Naive Difference Forecasting	9.13676e-06	0.003023	0.001659
46	USDCUSDT Tradecount	Naive Difference Forecasting	4.15266e+08	20378.1	12089.8
47	Python Questions	Naive Difference Forecasting	2.34416e+06	1531.07	1117.71
48	Machine Learning Questions	Naive Difference Forecasting	6678.13	81.7198	51.9462
49	Ticket Sales	Naive Difference Forecasting	37594.4	193.893	136.398
50	Gold	Autoregression	0.167187	0.408885	0.270593
51	SP500	Autoregression	0.181846	0.426434	0.236095
52	Machine Learning Questions	Autoregression	9493.88	97.4365	80.8513
53	Ticket Sales	Autoregression	12195.6	110.434	78.7188
54	Gold	Manual ARIMA	8.31094	2.88287	1.99452
55	Ticket Sales	ARIMA - MA2	14257.6	119.405	93.6058

MA 3

ticketSalesMA3Model = exploringARIMA(ticketSalesData, 0, 0, 3)

appendingTOR(tableOfResults, 'Ticket Sales', 'ARIMA - MA3', evaluateErrors(ticketSalesMA3Model))
tableOfResults

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
1	SP500	Naive Forecasting	0.315878	0.56203	0.325863
2	FTSE	Naive Forecasting	5.69646	2.38673	1.29783
3	Gold	Naive Forecasting	0.237289	0.487123	0.31245
4	USDCUSDT Close Price	Naive Forecasting	4.01536e-06	0.002004	0.001034
5	USDCUSDT Tradecount	Naive Forecasting	1.56376e+08	12505	7519.96
6	Python Questions	Naive Forecasting	1.00863e+06	1004.31	680.298
7	Machine Learning Questions	Naive Forecasting	3053.46	55.2581	35.4885
8	Ticket Sales	Naive Forecasting	12476	111.696	74.8537
9	SP500	Seasonal Forecasting - 60	24.0706	4.90618	3.19858
10	SP500	Seasonal Forecasting - 1440	578.648	24.0551	18.6964
11	FTSE	Seasonal Forecasting - 60	358.944	18.9458	12.5549
12	FTSE	Seasonal Forecasting - 1440	9305.27	96.4639	71.5358
13	Gold	Seasonal Forecasting - 60	11.0817	3.32892	2.08104
14	Gold	Seasonal Forecasting - 1440	297.39	17.245	12.2358
15	USDCUSDT Close Price	Seasonal Forecasting	1.41478e-05	0.003761	0.00176
16	USDCUSDT Tradecount	Seasonal Forecasting	2.76653e+08	16632.9	10558.5
17	Python Questions	Seasonal Forecasting	4.97505e+06	2230.48	1924.32
18	Machine Learning Questions	Seasonal Forecasting	14325.5	119.689	84.975
19	Ticket Sales	Seasonal Forecasting	3922.4	62.6291	37.0187
20	SP500	Average Forecasting - 60	8.21104	2.86549	1.84062
21	SP500	Average Forecasting - 1440	195.085	13.9673	10.6287
22	FTSE	Average Forecasting - 60	122.369	11.0621	7.24807
23	FTSE	Average Forecasting - 1440	3105.82	55.7299	40.9431
24	Gold	Average Forecasting - 60	3.79681	1.94854	1.22559
25	Gold	Average Forecasting - 1440	96.5074	9.82382	6.8452
26	USDCUSDT Close Price	Average Forecasting	6.51406e-06	0.002552	0.001187
27	USDCUSDT Tradecount	Average Forecasting	1.44419e+08	12017.5	7727.27
28	Python Questions	Average Forecasting	2.18019e+06	1476.55	1175.32
29	Machine Learning Questions	Average Forecasting	7140.3	84.5003	55.785
30	Ticket Sales	Average Forecasting	6061.2	77.8537	59.7301
31	SP500	Average Difference Forecasting - 60	0.322452	0.567849	0.332982
32	SP500	Average Difference Forecasting - 1440	0.316412	0.562505	0.326336
33	FTSE	Average Difference Forecasting - 60	5.79875	2.40806	1.32178
34	FTSE	Average Difference Forecasting - 1440	5.70723	2.38898	1.30033
35	Gold	Average Difference Forecasting - 60	0.240423	0.49033	0.315994
36	Gold	Average Difference Forecasting - 1440	0.237595	0.487437	0.313009
37	USDCUSDT Close Price	Average Difference Forecasting	4.53094e-06	0.002129	0.001103
38	USDCUSDT Tradecount	Average Difference Forecasting	1.64095e+08	12810	7724.69
39	Python Questions	Average Difference Forecasting	1.1445e+06	1069.81	751.801
40	Machine Learning Questions	Average Difference Forecasting	3399.79	58.3077	38.5563
41	Ticket Sales	Average Difference Forecasting	12936.3	113.738	76.6102
42	SP500	Naive Difference Forecasting	0.553753	0.744146	0.456603
43	FTSE	Naive Difference Forecasting	10.796	3.28573	1.81081
44	Gold	Naive Difference Forecasting	0.521205	0.721946	0.480641
45	USDCUSDT Close Price	Naive Difference Forecasting	9.13676e-06	0.003023	0.001659
46	USDCUSDT Tradecount	Naive Difference Forecasting	4.15266e+08	20378.1	12089.8
47	Python Questions	Naive Difference Forecasting	2.34416e+06	1531.07	1117.71
48	Machine Learning Questions	Naive Difference Forecasting	6678.13	81.7198	51.9462
49	Ticket Sales	Naive Difference Forecasting	37594.4	193.893	136.398
50	Gold	Autoregression	0.167187	0.408885	0.270593
51	SP500	Autoregression	0.181846	0.426434	0.236095
52	Machine Learning Questions	Autoregression	9493.88	97.4365	80.8513
53	Ticket Sales	Autoregression	12195.6	110.434	78.7188
54	Gold	Manual ARIMA	8.31094	2.88287	1.99452
55	Ticket Sales	ARIMA - MA2	14257.6	119.405	93.6058
56	Ticket Sales	ARIMA - MA3	9808.53	99.038	63.8928

MA 8 - 10

for i in range(8,11):
    model = exploringARIMA(ticketSalesData, 0, 0, i)
    appendingTOR(tableOfResults, 'Ticket Sales', 'ARIMA - MA' + str(i), evaluateErrors(model))

tableOfResults[tableOfResults['Data']=='Ticket Sales'].sort_values('MSE')

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
19	Ticket Sales	Seasonal Forecasting	3922.4	62.6291	37.0187
59	Ticket Sales	ARIMA - MA10	5362.12	73.2265	49.4972
30	Ticket Sales	Average Forecasting	6061.2	77.8537	59.7301
58	Ticket Sales	ARIMA - MA9	6219.51	78.8639	55.3295
57	Ticket Sales	ARIMA - MA8	6453.79	80.3355	59.9724
56	Ticket Sales	ARIMA - MA3	9808.53	99.038	63.8928
53	Ticket Sales	Autoregression	12195.6	110.434	78.7188
8	Ticket Sales	Naive Forecasting	12476	111.696	74.8537
41	Ticket Sales	Average Difference Forecasting	12936.3	113.738	76.6102
55	Ticket Sales	ARIMA - MA2	14257.6	119.405	93.6058
49	Ticket Sales	Naive Difference Forecasting	37594.4	193.893	136.398

Visually we do not really see much improvement increasing the value of MA, however the results says otherwise. However it is still not the best. Let's try a higher number once again.

MA 20

ticketSalesMA20Model = exploringARIMA(ticketSalesData, 0, 0, 20)
appendingTOR(tableOfResults, 'Ticket Sales', 'ARIMA - MA20', evaluateErrors(ticketSalesMA20Model))
tableOfResults[tableOfResults['Data']=='Ticket Sales'].sort_values('MSE')

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
60	Ticket Sales	ARIMA - MA20	3774.48	61.4368	42.521
19	Ticket Sales	Seasonal Forecasting	3922.4	62.6291	37.0187
59	Ticket Sales	ARIMA - MA10	5362.12	73.2265	49.4972
30	Ticket Sales	Average Forecasting	6061.2	77.8537	59.7301
58	Ticket Sales	ARIMA - MA9	6219.51	78.8639	55.3295
57	Ticket Sales	ARIMA - MA8	6453.79	80.3355	59.9724
56	Ticket Sales	ARIMA - MA3	9808.53	99.038	63.8928
53	Ticket Sales	Autoregression	12195.6	110.434	78.7188
8	Ticket Sales	Naive Forecasting	12476	111.696	74.8537
41	Ticket Sales	Average Difference Forecasting	12936.3	113.738	76.6102
55	Ticket Sales	ARIMA - MA2	14257.6	119.405	93.6058
49	Ticket Sales	Naive Difference Forecasting	37594.4	193.893	136.398

As of the current observation, based on the ticket sales dataset, we can see that increase the MA lowers the errors.

Let's try only using I

for i in [1,2,5,10  ]:
    model = exploringARIMA(ticketSalesData, 0, i, 0)
    appendingTOR(tableOfResults, 'Ticket Sales', 'ARIMA - I' + str(i), evaluateErrors(model))

tableOfResults[tableOfResults['Data']=='Ticket Sales'].sort_values('MSE')

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
60	Ticket Sales	ARIMA - MA20	3774.48	61.4368	42.521
19	Ticket Sales	Seasonal Forecasting	3922.4	62.6291	37.0187
59	Ticket Sales	ARIMA - MA10	5362.11	73.2265	49.4972
30	Ticket Sales	Average Forecasting	6061.2	77.8537	59.7301
58	Ticket Sales	ARIMA - MA9	6219.51	78.8639	55.3295
57	Ticket Sales	ARIMA - MA8	6453.8	80.3355	59.9724
56	Ticket Sales	ARIMA - MA3	9808.53	99.038	63.8928
53	Ticket Sales	Autoregression	12195.6	110.434	78.7188
8	Ticket Sales	Naive Forecasting	12476	111.696	74.8537
61	Ticket Sales	ARIMA - I1	12490.3	111.76	75.0726
41	Ticket Sales	Average Difference Forecasting	12936.3	113.738	76.6102
55	Ticket Sales	ARIMA - MA2	14257.6	119.405	93.6058
62	Ticket Sales	ARIMA - I2	37338.9	193.233	135.811
49	Ticket Sales	Naive Difference Forecasting	37594.4	193.893	136.398
63	Ticket Sales	ARIMA - I5	1.52389e+06	1234.46	926.465
64	Ticket Sales	ARIMA - I10	1.08863e+09	32994.4	26180.1

As observed, I will need to be used sparingly as increase in I will quickly increase the predicted values.

I1 AM10

ticketSalesI1MA10Model = exploringARIMA(ticketSalesData, 0, 1, 10)
appendingTOR(tableOfResults, 'Ticket Sales', 'ARIMA - I1 MA10', evaluateErrors(ticketSalesI1MA10Model))
tableOfResults[tableOfResults['Data']=='Ticket Sales'].sort_values('MSE')

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
60	Ticket Sales	ARIMA - MA20	3774.48	61.4368	42.521
19	Ticket Sales	Seasonal Forecasting	3922.4	62.6291	37.0187
65	Ticket Sales	ARIMA - I1 MA10	4073.38	63.823	47.404
59	Ticket Sales	ARIMA - MA10	5362.11	73.2265	49.4972
30	Ticket Sales	Average Forecasting	6061.2	77.8537	59.7301
58	Ticket Sales	ARIMA - MA9	6219.51	78.8639	55.3295
57	Ticket Sales	ARIMA - MA8	6453.8	80.3355	59.9724
56	Ticket Sales	ARIMA - MA3	9808.53	99.038	63.8928
53	Ticket Sales	Autoregression	12195.6	110.434	78.7188
8	Ticket Sales	Naive Forecasting	12476	111.696	74.8537
61	Ticket Sales	ARIMA - I1	12490.3	111.76	75.0726
41	Ticket Sales	Average Difference Forecasting	12936.3	113.738	76.6102
55	Ticket Sales	ARIMA - MA2	14257.6	119.405	93.6058
62	Ticket Sales	ARIMA - I2	37338.9	193.233	135.811
49	Ticket Sales	Naive Difference Forecasting	37594.4	193.893	136.398
63	Ticket Sales	ARIMA - I5	1.52389e+06	1234.46	926.465
64	Ticket Sales	ARIMA - I10	1.08863e+09	32994.4	26180.1

ticketSalesI1MA5Model = exploringARIMA(ticketSalesData, 0, 1, 5)
appendingTOR(tableOfResults, 'Ticket Sales', 'ARIMA - I1 MA5', evaluateErrors(ticketSalesI1MA5Model))
tableOfResults[tableOfResults['Data']=='Ticket Sales'].sort_values('MSE')

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
60	Ticket Sales	ARIMA - MA20	3774.48	61.4368	42.521
19	Ticket Sales	Seasonal Forecasting	3922.4	62.6291	37.0187
65	Ticket Sales	ARIMA - I1 MA10	4073.38	63.823	47.404
59	Ticket Sales	ARIMA - MA10	5362.11	73.2265	49.4972
66	Ticket Sales	ARIMA - I1 MA5	6030.57	77.6567	61.9688
30	Ticket Sales	Average Forecasting	6061.2	77.8537	59.7301
58	Ticket Sales	ARIMA - MA9	6219.51	78.8639	55.3295
57	Ticket Sales	ARIMA - MA8	6453.8	80.3355	59.9724
56	Ticket Sales	ARIMA - MA3	9808.53	99.038	63.8928
53	Ticket Sales	Autoregression	12195.6	110.434	78.7188
8	Ticket Sales	Naive Forecasting	12476	111.696	74.8537
61	Ticket Sales	ARIMA - I1	12490.3	111.76	75.0726
41	Ticket Sales	Average Difference Forecasting	12936.3	113.738	76.6102
55	Ticket Sales	ARIMA - MA2	14257.6	119.405	93.6058
62	Ticket Sales	ARIMA - I2	37338.9	193.233	135.811
49	Ticket Sales	Naive Difference Forecasting	37594.4	193.893	136.398
63	Ticket Sales	ARIMA - I5	1.52389e+06	1234.46	926.465
64	Ticket Sales	ARIMA - I10	1.08863e+09	32994.4	26180.1

ticketSalesI1MA5Model = exploringARIMA(ticketSalesData, 0, 1, 2)
appendingTOR(tableOfResults, 'Ticket Sales', 'ARIMA - I1 MA2', evaluateErrors(ticketSalesI1MA5Model))
tableOfResults[tableOfResults['Data']=='Ticket Sales'].sort_values('MSE')

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
60	Ticket Sales	ARIMA - MA20	3774.48	61.4368	42.521
19	Ticket Sales	Seasonal Forecasting	3922.4	62.6291	37.0187
65	Ticket Sales	ARIMA - I1 MA10	4073.38	63.823	47.404
59	Ticket Sales	ARIMA - MA10	5362.11	73.2265	49.4972
66	Ticket Sales	ARIMA - I1 MA5	6030.57	77.6567	61.9688
30	Ticket Sales	Average Forecasting	6061.2	77.8537	59.7301
58	Ticket Sales	ARIMA - MA9	6219.51	78.8639	55.3295
57	Ticket Sales	ARIMA - MA8	6453.8	80.3355	59.9724
67	Ticket Sales	ARIMA - I1 MA2	6726.3	82.014	62.7677
56	Ticket Sales	ARIMA - MA3	9808.53	99.038	63.8928
53	Ticket Sales	Autoregression	12195.6	110.434	78.7188
8	Ticket Sales	Naive Forecasting	12476	111.696	74.8537
61	Ticket Sales	ARIMA - I1	12490.3	111.76	75.0726
41	Ticket Sales	Average Difference Forecasting	12936.3	113.738	76.6102
55	Ticket Sales	ARIMA - MA2	14257.6	119.405	93.6058
62	Ticket Sales	ARIMA - I2	37338.9	193.233	135.811
49	Ticket Sales	Naive Difference Forecasting	37594.4	193.893	136.398
63	Ticket Sales	ARIMA - I5	1.52389e+06	1234.46	926.465
64	Ticket Sales	ARIMA - I10	1.08863e+09	32994.4	26180.1

ticketSalesI2MA10Model = exploringARIMA(ticketSalesData, 0, 2, 10)
appendingTOR(tableOfResults, 'Ticket Sales', 'ARIMA - I2 MA10', evaluateErrors(ticketSalesI2MA10Model))
tableOfResults[tableOfResults['Data']=='Ticket Sales'].sort_values('MSE')

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
60	Ticket Sales	ARIMA - MA20	3774.48	61.4368	42.521
19	Ticket Sales	Seasonal Forecasting	3922.4	62.6291	37.0187
65	Ticket Sales	ARIMA - I1 MA10	4073.38	63.823	47.404
68	Ticket Sales	ARIMA - I2 MA10	5306.96	72.8489	54.751
59	Ticket Sales	ARIMA - MA10	5362.11	73.2265	49.4972
66	Ticket Sales	ARIMA - I1 MA5	6030.57	77.6567	61.9688
30	Ticket Sales	Average Forecasting	6061.2	77.8537	59.7301
58	Ticket Sales	ARIMA - MA9	6219.51	78.8639	55.3295
57	Ticket Sales	ARIMA - MA8	6453.8	80.3355	59.9724
67	Ticket Sales	ARIMA - I1 MA2	6726.3	82.014	62.7677
56	Ticket Sales	ARIMA - MA3	9808.53	99.038	63.8928
53	Ticket Sales	Autoregression	12195.6	110.434	78.7188
8	Ticket Sales	Naive Forecasting	12476	111.696	74.8537
61	Ticket Sales	ARIMA - I1	12490.3	111.76	75.0726
41	Ticket Sales	Average Difference Forecasting	12936.3	113.738	76.6102
55	Ticket Sales	ARIMA - MA2	14257.6	119.405	93.6058
62	Ticket Sales	ARIMA - I2	37338.9	193.233	135.811
49	Ticket Sales	Naive Difference Forecasting	37594.4	193.893	136.398
63	Ticket Sales	ARIMA - I5	1.52389e+06	1234.46	926.465
64	Ticket Sales	ARIMA - I10	1.08863e+09	32994.4	26180.1

As of the current observations, MA20 is still the highest at the moment. However, using higher value MA with I value of 1, is performing well

Now let's explore ARIMA.

A7 I1 MA2,3,5,10

for i in [2,3,5,10]:
    model = exploringARIMA(ticketSalesData, 7, 1, i)
    appendingTOR(tableOfResults, 'Ticket Sales', 'ARIMA - A7 I1 MA' + str(i), evaluateErrors(model))

tableOfResults[tableOfResults['Data']=='Ticket Sales'].sort_values('MSE')

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
72	Ticket Sales	ARIMA - A7 I1 MA10	2805.43	52.9663	34.2589
71	Ticket Sales	ARIMA - A7 I1 MA5	2835.07	53.2454	33.3404
70	Ticket Sales	ARIMA - A7 I1 MA3	2875.15	53.6205	33.8151
69	Ticket Sales	ARIMA - A7 I1 MA2	2900.55	53.8568	33.7684
60	Ticket Sales	ARIMA - MA20	3774.48	61.4368	42.521
19	Ticket Sales	Seasonal Forecasting	3922.4	62.6291	37.0187
65	Ticket Sales	ARIMA - I1 MA10	4073.38	63.823	47.404
68	Ticket Sales	ARIMA - I2 MA10	5306.96	72.8489	54.751
59	Ticket Sales	ARIMA - MA10	5362.11	73.2265	49.4972
66	Ticket Sales	ARIMA - I1 MA5	6030.57	77.6567	61.9688
30	Ticket Sales	Average Forecasting	6061.2	77.8537	59.7301
58	Ticket Sales	ARIMA - MA9	6219.51	78.8639	55.3295
57	Ticket Sales	ARIMA - MA8	6453.8	80.3355	59.9724
67	Ticket Sales	ARIMA - I1 MA2	6726.3	82.014	62.7677
56	Ticket Sales	ARIMA - MA3	9808.53	99.038	63.8928
53	Ticket Sales	Autoregression	12195.6	110.434	78.7188
8	Ticket Sales	Naive Forecasting	12476	111.696	74.8537
61	Ticket Sales	ARIMA - I1	12490.3	111.76	75.0726
41	Ticket Sales	Average Difference Forecasting	12936.3	113.738	76.6102
55	Ticket Sales	ARIMA - MA2	14257.6	119.405	93.6058
62	Ticket Sales	ARIMA - I2	37338.9	193.233	135.811
49	Ticket Sales	Naive Difference Forecasting	37594.4	193.893	136.398
63	Ticket Sales	ARIMA - I5	1.52389e+06	1234.46	926.465
64	Ticket Sales	ARIMA - I10	1.08863e+09	32994.4	26180.1

When using Autoregression, Intergrated and Moving Average, all together, it performed the best, according the RMSE, MSE and MAE. We now will try a few iterations and parameters with the gold data.

Gold Data

arimaGoldData = goldData.reset_index()['Close'][:30000]
arimaGoldData.plot()

MA Test

for i in [1,2,5,10]:
    model = exploringARIMA(arimaGoldData, 0,0,i)
    appendingTOR(tableOfResults, 'Gold', 'ARIMA - MA' + str(i), evaluateErrors(model))

tableOfResults[tableOfResults['Data'] == 'Gold'].sort_values('MSE')

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
50	Gold	Autoregression	0.167187	0.408885	0.270593
3	Gold	Naive Forecasting	0.237289	0.487123	0.31245
36	Gold	Average Difference Forecasting - 1440	0.237595	0.487437	0.313009
35	Gold	Average Difference Forecasting - 60	0.240423	0.49033	0.315994
44	Gold	Naive Difference Forecasting	0.521205	0.721946	0.480641
24	Gold	Average Forecasting - 60	3.79681	1.94854	1.22559
54	Gold	Manual ARIMA	8.31094	2.88287	1.99452
13	Gold	Seasonal Forecasting - 60	11.0817	3.32892	2.08104
25	Gold	Average Forecasting - 1440	96.5074	9.82382	6.8452
14	Gold	Seasonal Forecasting - 1440	297.39	17.245	12.2358
75	Gold	ARIMA - MA5	155277	394.052	316.199
74	Gold	ARIMA - MA2	194643	441.184	368.211
73	Gold	ARIMA - MA1	439874	663.23	662.926
76	Gold	ARIMA - MA10	1.51433e+20	1.23058e+10	1.21355e+10

As shown above, using only the MA alone does not seem to be performing good, and actually the worst ones when compared to the other models here, especially when there is no pre-processing done. By having I, the differencing performed, the performance should improve

I Test

for i in [1,2,5,10]:
    model = exploringARIMA(arimaGoldData, 0,i,0)
    appendingTOR(tableOfResults, 'Gold', 'ARIMA - I' + str(i), evaluateErrors(model))

tableOfResults[tableOfResults['Data'] == 'Gold'].sort_values('MSE')

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
50	Gold	Autoregression	0.167187	0.408885	0.270593
3	Gold	Naive Forecasting	0.237289	0.487123	0.31245
36	Gold	Average Difference Forecasting - 1440	0.237595	0.487437	0.313009
35	Gold	Average Difference Forecasting - 60	0.240423	0.49033	0.315994
44	Gold	Naive Difference Forecasting	0.521205	0.721946	0.480641
24	Gold	Average Forecasting - 60	3.79681	1.94854	1.22559
54	Gold	Manual ARIMA	8.31094	2.88287	1.99452
13	Gold	Seasonal Forecasting - 60	11.0817	3.32892	2.08104
77	Gold	ARIMA - I1	56.3362	7.50575	0.295501
78	Gold	ARIMA - I2	70.5507	8.39945	0.459694
25	Gold	Average Forecasting - 1440	96.5074	9.82382	6.8452
14	Gold	Seasonal Forecasting - 1440	297.39	17.245	12.2358
79	Gold	ARIMA - I5	350.547	18.7229	2.61938
80	Gold	ARIMA - I10	12166.7	110.303	66.0953
75	Gold	ARIMA - MA5	155277	394.052	316.199
74	Gold	ARIMA - MA2	194643	441.184	368.211
73	Gold	ARIMA - MA1	439874	663.23	662.926
76	Gold	ARIMA - MA10	1.51433e+20	1.23058e+10	1.21355e+10

The I of 1, is the best performing one so far.

Let's try having the I set to 1, and we will see if MA 1 2 or 5 will perform better.

I1 MA Test

for i in [1,2,5]:
    model = exploringARIMA(arimaGoldData, 0,1,i)
    appendingTOR(tableOfResults, 'Gold', 'ARIMA - I1 MA' + str(i), evaluateErrors(model))

tableOfResults[tableOfResults['Data'] == 'Gold'].sort_values('MSE')

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
50	Gold	Autoregression	0.167187	0.408885	0.270593
3	Gold	Naive Forecasting	0.237289	0.487123	0.31245
36	Gold	Average Difference Forecasting - 1440	0.237595	0.487437	0.313009
35	Gold	Average Difference Forecasting - 60	0.240423	0.49033	0.315994
44	Gold	Naive Difference Forecasting	0.521205	0.721946	0.480641
24	Gold	Average Forecasting - 60	3.79681	1.94854	1.22559
54	Gold	Manual ARIMA	8.31094	2.88287	1.99452
13	Gold	Seasonal Forecasting - 60	11.0817	3.32892	2.08104
83	Gold	ARIMA - I1 MA5	56.3336	7.50557	0.29298
82	Gold	ARIMA - I1 MA2	56.3336	7.50557	0.292994
81	Gold	ARIMA - I1 MA1	56.3337	7.50558	0.293011
77	Gold	ARIMA - I1	56.3362	7.50575	0.295501
78	Gold	ARIMA - I2	70.5507	8.39945	0.459694
25	Gold	Average Forecasting - 1440	96.5074	9.82382	6.8452
14	Gold	Seasonal Forecasting - 1440	297.39	17.245	12.2358
79	Gold	ARIMA - I5	350.547	18.7229	2.61938
80	Gold	ARIMA - I10	12166.7	110.303	66.0953
75	Gold	ARIMA - MA5	155277	394.052	316.199
74	Gold	ARIMA - MA2	194643	441.184	368.211
73	Gold	ARIMA - MA1	439874	663.23	662.926
76	Gold	ARIMA - MA10	1.51433e+20	1.23058e+10	1.21355e+10

It seems that MA 5 is the better performing one by only a very small margin. Next we explore implementing the autoregression

AR I1 MA5 Test

plot_pacf(arimaGoldData)

Based on the PACF test, we can see that up to an AR 4 should be good. However, we can also increase it to see how it performs

for i in [4, 8, 15, 20]:
    model = exploringARIMA(arimaGoldData, i,1,5)
    appendingTOR(tableOfResults, 'Gold', 'ARIMA - A' + str(i) + ' I1 MA5', evaluateErrors(model))

tableOfResults[tableOfResults['Data'] == 'Gold'].sort_values('MSE')

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
50	Gold	Autoregression	0.167187	0.408885	0.270593
3	Gold	Naive Forecasting	0.237289	0.487123	0.31245
36	Gold	Average Difference Forecasting - 1440	0.237595	0.487437	0.313009
35	Gold	Average Difference Forecasting - 60	0.240423	0.49033	0.315994
44	Gold	Naive Difference Forecasting	0.521205	0.721946	0.480641
24	Gold	Average Forecasting - 60	3.79681	1.94854	1.22559
54	Gold	Manual ARIMA	8.31094	2.88287	1.99452
13	Gold	Seasonal Forecasting - 60	11.0817	3.32892	2.08104
87	Gold	ARIMA - A20 I1 MA5	56.3334	7.50556	0.293046
86	Gold	ARIMA - A15 I1 MA5	56.3334	7.50556	0.293048
85	Gold	ARIMA - A8 I1 MA5	56.3334	7.50556	0.293065
84	Gold	ARIMA - A4 I1 MA5	56.3335	7.50556	0.293029
83	Gold	ARIMA - I1 MA5	56.3336	7.50557	0.29298
82	Gold	ARIMA - I1 MA2	56.3336	7.50557	0.292994
81	Gold	ARIMA - I1 MA1	56.3337	7.50558	0.293011
77	Gold	ARIMA - I1	56.3362	7.50575	0.295501
78	Gold	ARIMA - I2	70.5507	8.39945	0.459694
25	Gold	Average Forecasting - 1440	96.5074	9.82382	6.8452
14	Gold	Seasonal Forecasting - 1440	297.39	17.245	12.2358
79	Gold	ARIMA - I5	350.547	18.7229	2.61938
80	Gold	ARIMA - I10	12166.7	110.303	66.0953
75	Gold	ARIMA - MA5	155277	394.052	316.199
74	Gold	ARIMA - MA2	194643	441.184	368.211
73	Gold	ARIMA - MA1	439874	663.23	662.926
76	Gold	ARIMA - MA10	1.51433e+20	1.23058e+10	1.21355e+10

While it is the best performing ones so far, it still does beat the earlier simple algorithms that were used.

Autoreg only

autoregGold4 = exploringARIMA(arimaGoldData, 4, 0, 0)
autoregGold4[1:].plot()

appendingTOR(tableOfResults, 'Gold', 'ARIMA - A4', evaluateErrors(autoregGold4))
tableOfResults[tableOfResults['Data'] == 'Gold'].sort_values('MSE')

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
50	Gold	Autoregression	0.167187	0.408885	0.270593
3	Gold	Naive Forecasting	0.237289	0.487123	0.31245
36	Gold	Average Difference Forecasting - 1440	0.237595	0.487437	0.313009
35	Gold	Average Difference Forecasting - 60	0.240423	0.49033	0.315994
44	Gold	Naive Difference Forecasting	0.521205	0.721946	0.480641
24	Gold	Average Forecasting - 60	3.79681	1.94854	1.22559
54	Gold	Manual ARIMA	8.31094	2.88287	1.99452
13	Gold	Seasonal Forecasting - 60	11.0817	3.32892	2.08104
87	Gold	ARIMA - A20 I1 MA5	56.3334	7.50556	0.293046
86	Gold	ARIMA - A15 I1 MA5	56.3334	7.50556	0.293048
85	Gold	ARIMA - A8 I1 MA5	56.3334	7.50556	0.293065
84	Gold	ARIMA - A4 I1 MA5	56.3335	7.50556	0.293029
83	Gold	ARIMA - I1 MA5	56.3336	7.50557	0.29298
82	Gold	ARIMA - I1 MA2	56.3336	7.50557	0.292994
81	Gold	ARIMA - I1 MA1	56.3337	7.50558	0.293011
77	Gold	ARIMA - I1	56.3362	7.50575	0.295501
88	Gold	ARIMA - A4	56.3762	7.50841	0.334332
78	Gold	ARIMA - I2	70.5507	8.39945	0.459694
25	Gold	Average Forecasting - 1440	96.5074	9.82382	6.8452
14	Gold	Seasonal Forecasting - 1440	297.39	17.245	12.2358
79	Gold	ARIMA - I5	350.547	18.7229	2.61938
80	Gold	ARIMA - I10	12166.7	110.303	66.0953
75	Gold	ARIMA - MA5	155277	394.052	316.199
74	Gold	ARIMA - MA2	194643	441.184	368.211
73	Gold	ARIMA - MA1	439874	663.23	662.926
76	Gold	ARIMA - MA10	1.51433e+20	1.23058e+10	1.21355e+10

We would think that the autoregression 4 model would perform good, however, it did not and it is the middle of the pack

tableOfResults[tableOfResults['Data'] == 'Ticket Sales'].sort_values('MSE')

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
72	Ticket Sales	ARIMA - A7 I1 MA10	2805.43	52.9663	34.2589
71	Ticket Sales	ARIMA - A7 I1 MA5	2835.07	53.2454	33.3404
70	Ticket Sales	ARIMA - A7 I1 MA3	2875.15	53.6205	33.8151
69	Ticket Sales	ARIMA - A7 I1 MA2	2900.55	53.8568	33.7684
60	Ticket Sales	ARIMA - MA20	3774.48	61.4368	42.521
19	Ticket Sales	Seasonal Forecasting	3922.4	62.6291	37.0187
65	Ticket Sales	ARIMA - I1 MA10	4073.38	63.823	47.404
68	Ticket Sales	ARIMA - I2 MA10	5306.96	72.8489	54.751
59	Ticket Sales	ARIMA - MA10	5362.11	73.2265	49.4972
66	Ticket Sales	ARIMA - I1 MA5	6030.57	77.6567	61.9688
30	Ticket Sales	Average Forecasting	6061.2	77.8537	59.7301
58	Ticket Sales	ARIMA - MA9	6219.51	78.8639	55.3295
57	Ticket Sales	ARIMA - MA8	6453.8	80.3355	59.9724
67	Ticket Sales	ARIMA - I1 MA2	6726.3	82.014	62.7677
56	Ticket Sales	ARIMA - MA3	9808.53	99.038	63.8928
53	Ticket Sales	Autoregression	12195.6	110.434	78.7188
8	Ticket Sales	Naive Forecasting	12476	111.696	74.8537
61	Ticket Sales	ARIMA - I1	12490.3	111.76	75.0726
41	Ticket Sales	Average Difference Forecasting	12936.3	113.738	76.6102
55	Ticket Sales	ARIMA - MA2	14257.6	119.405	93.6058
62	Ticket Sales	ARIMA - I2	37338.9	193.233	135.811
49	Ticket Sales	Naive Difference Forecasting	37594.4	193.893	136.398
63	Ticket Sales	ARIMA - I5	1.52389e+06	1234.46	926.465
64	Ticket Sales	ARIMA - I10	1.08863e+09	32994.4	26180.1

Seasonal ARIMA

We will again trying SARIMA with and the tickets sales

ticketSalesData.plot()
sm.tsa.arima.ARIMA(ticketSalesData, order=(5,1,5), seasonal_order=(7,1,5,7), trend = 'n').fit().predict().plot()

As shown, we can add more parameters given to the function. We can add a seasonal component to in, as shown in the seasonal_order. the first three supposed to be representing the same 3 values as the ones in order, and then the last value supposed to be representing the seasonality expectation. If we expect the seasons to be weekly, we can use 7, assume each row is one day, like this case. Seeing this site explains that the order was for trend, and seasonal order is to address the seasonality.

However, there are just too many parameters. Therefore, we will just be running loops and appending the results into the table. But we will be keeping the order parameters

def exploringSARIMA(data, p, q, r, m):
    model = sm.tsa.arima.ARIMA(ticketSalesData, order=(5,1,5), seasonal_order=(p,q,r,m), trend = 'n').fit()
    results = pd.DataFrame(data.rename('observed'))
    results['predicted'] = model.predict()
    results.plot()
    return results

for p in range(5, 10):
        for r in range(3,7):
                model = exploringSARIMA(ticketSalesData, p, 1, r, 7)
                appendingTOR(tableOfResults, 'Ticket Sales', 'SARIMA - p' +
                             str(p) + ' q' + str(1) + ' r' + str(r) + ' m' + str(7),
                             evaluateErrors(model)
                             )

tableOfResults[tableOfResults['Data'] == 'Ticket Sales'].sort_values('MSE')

Unnamed: 0	Data	Algorithm	MSE	RMSE	MAE
72	Ticket Sales	ARIMA - A7 I1 MA10	2805.43	52.9663	34.2589
71	Ticket Sales	ARIMA - A7 I1 MA5	2835.07	53.2454	33.3404
70	Ticket Sales	ARIMA - A7 I1 MA3	2875.15	53.6205	33.8151
108	Ticket Sales	SARIMA - p9 q1 r6 m7	2882.86	53.6923	33.7857
105	Ticket Sales	SARIMA - p9 q1 r3 m7	2884.03	53.7032	34.02
106	Ticket Sales	SARIMA - p9 q1 r4 m7	2890.3	53.7616	34.1163
107	Ticket Sales	SARIMA - p9 q1 r5 m7	2895.94	53.814	33.693
103	Ticket Sales	SARIMA - p8 q1 r5 m7	2900.32	53.8546	33.5066
69	Ticket Sales	ARIMA - A7 I1 MA2	2900.55	53.8568	33.7684
102	Ticket Sales	SARIMA - p8 q1 r4 m7	2907.21	53.9185	34.1786
100	Ticket Sales	SARIMA - p7 q1 r6 m7	2920.49	54.0415	33.954
101	Ticket Sales	SARIMA - p8 q1 r3 m7	2935.23	54.1777	34.1753
97	Ticket Sales	SARIMA - p7 q1 r3 m7	2940.85	54.2296	33.917
104	Ticket Sales	SARIMA - p8 q1 r6 m7	2941.32	54.234	34.4356
93	Ticket Sales	SARIMA - p6 q1 r3 m7	2947.68	54.2926	33.6726
96	Ticket Sales	SARIMA - p6 q1 r6 m7	2948.95	54.3043	33.8336
98	Ticket Sales	SARIMA - p7 q1 r4 m7	2952.75	54.3392	33.7096
99	Ticket Sales	SARIMA - p7 q1 r5 m7	2953.18	54.3432	34.2208
95	Ticket Sales	SARIMA - p6 q1 r5 m7	2958.34	54.3906	34.2589
90	Ticket Sales	SARIMA - p5 q1 r4 m7	2959.82	54.4042	33.2821
92	Ticket Sales	SARIMA - p5 q1 r6 m7	2967.44	54.4742	34.2919
94	Ticket Sales	SARIMA - p6 q1 r4 m7	2969.72	54.4952	33.8973
89	Ticket Sales	SARIMA - p5 q1 r3 m7	2972.4	54.5197	33.3068
91	Ticket Sales	SARIMA - p5 q1 r5 m7	2974.57	54.5396	33.2173
60	Ticket Sales	ARIMA - MA20	3774.48	61.4368	42.521
19	Ticket Sales	Seasonal Forecasting	3922.4	62.6291	37.0187
65	Ticket Sales	ARIMA - I1 MA10	4073.38	63.823	47.404
68	Ticket Sales	ARIMA - I2 MA10	5306.96	72.8489	54.751
59	Ticket Sales	ARIMA - MA10	5362.11	73.2265	49.4972
66	Ticket Sales	ARIMA - I1 MA5	6030.57	77.6567	61.9688
30	Ticket Sales	Average Forecasting	6061.2	77.8537	59.7301
58	Ticket Sales	ARIMA - MA9	6219.51	78.8639	55.3295
57	Ticket Sales	ARIMA - MA8	6453.8	80.3355	59.9724
67	Ticket Sales	ARIMA - I1 MA2	6726.3	82.014	62.7677
56	Ticket Sales	ARIMA - MA3	9808.53	99.038	63.8928
53	Ticket Sales	Autoregression	12195.6	110.434	78.7188
8	Ticket Sales	Naive Forecasting	12476	111.696	74.8537
61	Ticket Sales	ARIMA - I1	12490.3	111.76	75.0726
41	Ticket Sales	Average Difference Forecasting	12936.3	113.738	76.6102
55	Ticket Sales	ARIMA - MA2	14257.6	119.405	93.6058
62	Ticket Sales	ARIMA - I2	37338.9	193.233	135.811
49	Ticket Sales	Naive Difference Forecasting	37594.4	193.893	136.398
63	Ticket Sales	ARIMA - I5	1.52389e+06	1234.46	926.465
64	Ticket Sales	ARIMA - I10	1.08863e+09	32994.4	26180.1

The final result, we can see that the ARIMA still performs best and that all the SARIMA are not far behind. They are all very small margins apart only.

Back to Projects

Abstract
- Methodology
- Key Findings
Library Initialization
Datasets
Multiplicative or Additive - Visual Identification
Stationarity
Seasonality
Forecasting
Autoregression Prediction
Simulated Trading
Monte Carlo Simulations
- Gold
- Cinema Tickets
Moving Averages
Exponential Averaging
Removing Trend with EMA
ARIMA
Seasonal ARIMA

Multi-Dataset Time Series Modeling - Stationarity, Seasonality, and Forecasting Performance

Table of Contents