Capstone Project: Investment and Trading

Project Overview:

This project is about the idea of using AI to predict stock trends for different time frames. It combines methods to download and explore data from different tickers, to get an idea of how well a stock price is performing in future. For that, it uses multiple ML-models to conduct the predictions.
The program consists of multiple python classes and a Jupter notebook, which is divided into three main sections:

  • Analysis Datasets
    (Provides statistical numbers and visualizations to explore the dataset)
  • Forecasting
    (Performs predictions via KI)

Data Preprocessing:

To grep data, the program is using the “yfinance API”. It queries stock information from yahoo. To start with, the user adds ticker symbols into the form field and specifies the time range for the historical data load.

Figure 1: Set parameters
Figure 2: Sample dataframe after data preparation

Data Exploration & Visualization:

To get some insight of the given data sets, the section “Analysis Datasets” provides global statistical data like mean of daily return, the cumulative return, which describes the total return of a stock since the beginning of the record, standard deviation or the sharp risk ration.
A lower number of the standard deviation indicates if the stock is of less risk of high variability or better say, the volatility of a stock.

Figure 3: Global statistics
Figure 4: Stock trend of Daimler in total
Figure 5: Stock trend of Daimler — drilled
Figure 6: Distribution of daily returns
Figure 7: Price trend of Nvidia
Figure 8: Correlation matrix features and price

Further data preparation:

Before I can start with the prediction of the stock price, the dataset needs to be further prepared.
Currently I have no y_train/ y_test target where a supervised model can be trained on.
In order to implement this, I take the “price” column and shift it by the number of days, I want to forecast.
For instance, if I want to predict the price in five days, I make a copy of the price column and shift it by the number of five like shown in figure below. Afterwards I adapt the index column. Otherwise, I would have a misleading date index.
Now I’m able perform the split into the test and training dataset. In this project, the ratio between training and test 75% to 25%.

Figure 9: Price shifting

Metrics:

To benchmark the results of every ML algorithm, it is necessary to define some statistical criteria’s. For this project I made my choice for three performance indicators, which I want to describe a bit closer in this section.

Mean square error (MSE):
The first indicator, I want to talk about is the mean square error (MSE). This value describes the variance of the predicted value. For example, if we imagine a two dimensional plane with two dots. First dot is the expected value, and the second dots is the predicted value, the number of the MSE describes how close the predicted dot is ranging around the expected dot. The smaller the value, the better are my predictions.

Algorithm Technics and Evaluation:

For the experiment, I decided to use three different machine-learning (ml) algorithms, which all belongs to the class of supervised machine learning algorithms:

  • Multi-layer Perceptron
  • LSTM
  • mean square error (MSE)
  • mean absolute error (MAE)
Figure 10: Neuronal Network
max_iter = 1000
hls = 100
MLP = MLPRegressor(random_state=0, max_iter=max_iter, hidden_layer_sizes=(hls,),
activation='identity',
learning_rate='adaptive').fit(x_train_scaled, y_train)
model = Sequential()
# add first layer
model.add(LSTM(units=100, return_sequences=True, input_shape=(x_train_data.shape[1], 1)))
model.add(Dropout(0.2))
# add second layer
model.add(LSTM(units=100, return_sequences=True))
model.add(Dropout(0.2))
# add third layer
model.add(LSTM(units=100, return_sequences=False))
model.add(Dropout(0.2))
model.fit(x_train_data, y_train_data, batch_size=4, epochs=6, verbose=0)

Refinement:

In the section above, i have already provided the final parameter setting of each ML model. However, in this section I want to show, how difficult it is to set up the correct parameters in respect to the performance. As an example a take the LSTM, since with this algrithmn the parameters seems to have the biggest influence.

#units=50
#batch=20
#epochs=4
LSTM R2: 0.8343208888037233
LSTM MSE: 24.18871589943589
LSTM RMSE: 4.918202506956773
LSTM MAE: 3.6720203234698316
Test loss: 0.004514301661401987
Test accuracy: 0.004514301661401987
Accuracy: 0.45%
#units=50
#batch=10
#epochs=2
LSTM R2: 0.8529816884815744
LSTM MSE: 21.464288066592907
LSTM RMSE: 4.632956730490034
LSTM MAE: 3.1555438136686225
Test loss: 0.004005847033113241
Test accuracy: 0.004005847033113241
Accuracy: 0.40%
#units=50
#batch=1
#epochs=4
LSTM R2: 0.9393601858190183
LSTM MSE: 8.853253900430836
LSTM RMSE: 2.975441799200723
LSTM MAE: 2.27944727267998
Test loss: 0.0016522685764357448
Test accuracy: 0.0016522685764357448
Accuracy: 0.17%
#units=100
#batch=8
#epochs=8
LSTM R2: 0.9373253905196233
LSTM MSE: 9.15032867983537
LSTM RMSE: 3.0249510210638735
LSTM MAE: 2.350033736063246
Test loss: 0.0017077106749638915
Test accuracy: 0.0017077106749638915
Accuracy: 0.17%
#units=100
#batch=2
#epochs=4
LSTM R2: 0.9441181986701426
LSTM MSE: 8.158596497510764
LSTM RMSE: 2.856325698779949
LSTM MAE: 2.2533935607262094
Test loss: 0.0015226254472509027
Test accuracy: 0.0015226254472509027
Accuracy: 0.15%
#units=100
#batch=4
#epochs=6
LSTM R2: 0.9518972432497814
LSTM MSE: 7.02287638199774
LSTM RMSE: 2.650071014519751
LSTM MAE: 2.052696560230034
Test loss: 0.001310668420046568
Test accuracy: 0.001310668420046568
Accuracy: 0.13%
#units=100
#batch=4
#epochs8´
LSTM R2: 0.9573160407850155
LSTM MSE: 6.231746147474334
LSTM RMSE: 2.4963465599700565
LSTM MAE: 1.9181539341105456
Test loss: 0.0011630207300186157
Test accuracy: 0.0011630207300186157
Accuracy: 0.12%
#units=100
#batch=4
#epochs=10
LSTM R2: 0.9551101251093278
LSTM MSE: 6.553804053217859
LSTM RMSE: 2.5600398538338927
LSTM MAE: 1.987840666826167
Test loss: 0.0012231270084157586
Test accuracy: 0.0012231270084157586
Accuracy: 0.12%585387

Benchmark & Results:

Now let’s take a look at the results after running the last section of the notebook. For the first observation, we are looking to the Daimler stock on a seven-day forecast.

7 days out:------------
Linear Regression
Linear Regression R2: 0.9599340051810165
Linear Regression MSE: 5.849530208769271
Linear Regression RMSE: 2.418580205155345
Linear Regression MAE: 1.8381742201585387
Accuracy: 0.026061776061776062
Muli-layer Perceptron
Muli-layer Perceptron R2: 0.9597540943486109
Muli-layer Perceptron MSE: 5.875796718656164
Muli-layer Perceptron RMSE: 2.4240042736464313
Muli-layer Perceptron MAE: 1.842234816874342
Accuracy: 0.019305019305019305
LSTM
LSTM R2: 0.9518996750746772
LSTM MSE: 7.022521341938096
LSTM RMSE: 2.6500040267777134
LSTM MAE: 2.050466546018151
Test loss: 0.0013106020633131266
Test accuracy: 0.0013106020633131266
Accuracy: 0.13%
Figure 11: Prediction Daimler stock
Figure 12: Prediction Daimler stock with linear regression
Figure 13: Prediction Daimler stock with MLP
Figure 14: Prediction Daimler stock with LSTM
Figure 15: Scatter plot linear regression
Figure 16: Scatter plot LSTM
14 days out:------------
Linear Regression
Linear Regression R2: 0.9210534027157851
Linear Regression MSE: 11.205277229153344
Linear Regression RMSE: 3.34742845019178
Linear Regression MAE: 2.480697728061077
Accuracy: 0.02131782945736434
Muli-layer Perceptron
Muli-layer Perceptron R2: 0.9208279486779142
Muli-layer Perceptron MSE: 11.237277025011299
Muli-layer Perceptron RMSE: 3.3522048005769722
Muli-layer Perceptron MAE: 2.4830014230203714
Accuracy: 0.025193798449612403
LSTM
LSTM R2: 0.8293106867363559
LSTM MSE: 24.22677025948721
LSTM RMSE: 4.9220697129853015
LSTM MAE: 3.9718806714789814
Test loss: 0.004521401599049568
Test accuracy: 0.004521401599049568
Accuracy: 0.45%
30 days out:------------
Linear Regression
Linear Regression R2: 0.8023495414619239
Linear Regression MSE: 26.06617012652934
Linear Regression RMSE: 5.105503905250621
Linear Regression MAE: 3.840367674747195
Accuracy: 0.014634146341463415
Muli-layer Perceptron
Muli-layer Perceptron R2: 0.8056262562367411
Muli-layer Perceptron MSE: 25.634036523560532
Muli-layer Perceptron RMSE: 5.06300666833064
Muli-layer Perceptron MAE: 3.8282281762928236
Accuracy: 0.007804878048780488
LSTM
LSTM R2: 0.794942949055266
LSTM MSE: 27.042952569422663
LSTM RMSE: 5.200283893156475
LSTM MAE: 3.942713516533084
Test loss: 0.005046984646469355
Test accuracy: 0.005046984646469355
Accuracy: 0.50%
7 days out:
------------
Linear Regression
Linear Regression R2: 0.9840980476812651
Linear Regression MSE: 477.8649860102321
Linear Regression RMSE: 21.860123192933568
Linear Regression MAE: 15.751104514724204
Accuracy: 0.0019305019305019305

Muli-layer Perceptron
Muli-layer Perceptron R2: 0.982952112618063
Muli-layer Perceptron MSE: 512.3011503232476
Muli-layer Perceptron RMSE: 22.634070564599014
Muli-layer Perceptron MAE: 16.427553295187025
Accuracy: 0.0019305019305019305
#units=100
#batch=100
#epochs=10
LSTM
LSTM R2: 0.9526400502028002
LSTM MSE: 1423.200201689741
LSTM RMSE: 37.72532573338156
LSTM MAE: 28.309211559516577
Test loss: 0.0021751392632722855
Test accuracy: 0.0021751392632722855
Accuracy: 0.22%
#units=100
#batch=4
#epochs=6
LSTM R2: 0.7715236662166166
LSTM MSE: 6865.876457096025
LSTM RMSE: 82.86058446026087
LSTM MAE: 59.134434449773956
Test loss: 0.010493420995771885
Test accuracy: 0.010493420995771885
Accuracy: 1.05%
30 days out:
------------
Linear Regression
Linear Regression R2: 0.9260988131636365
Linear Regression MSE: 1967.4077523867877
Linear Regression RMSE: 44.355470377246455
Linear Regression MAE: 32.34013531018811
Accuracy: 0.002926829268292683

Muli-layer Perceptron
Muli-layer Perceptron R2: 0.9314364892804224
Muli-layer Perceptron MSE: 1825.3073907897913
Muli-layer Perceptron RMSE: 42.72361631217319
Muli-layer Perceptron MAE: 31.672436833157796
Accuracy: 0.001951219512195122
#units=100
#batch=100
#epochs=10
LSTM
LSTM R2: 0.8819063513993067
LSTM MSE: 3143.9056625586086
LSTM RMSE: 56.070541842919695
LSTM MAE: 41.608872088176454
Test loss: 0.006525109056383371
Test accuracy: 0.006525109056383371
Accuracy: 0.65%
#units=100
#batch=4
#epochs=6
LSTM R2: 0.6263458145324079
LSTM MSE: 9947.47408899509
LSTM RMSE: 99.73702466484094
LSTM MAE: 72.85733414812786
Test loss: 0.020645776763558388
Test accuracy: 0.020645776763558388
Accuracy: 2.06%
Figure 17: Nvidia prediction — total
Figure 18: Nvidia prediction — drilled

Justification:

All three models seem to perform very effective when it comes to a stock prediction. With the given Daimler stock, all algorithms are facing a very good R2 ratio of ~95% and an average RMSE of ~2.5. Because of that fact and with respect of the situation that I have only used technical indicators as features, the setup of the algorithms seems quite well.
The circumstance that LSTM performs suddenly different by changing the data set also indicates that the model setting are dependent on the given dataset. However, we cannot yet foreseen how a stock suddenly behaving
(for example the rise of the Nvidia stock). Because of the fact and the difficulty to set up a good performing LSTM I would rather tend to use LR or MLP, because of their average robustness.
An additional important factor is time. The training and the prediction wuth LSTM took at least 4 times longer than with LR or MLP. With respect to the result, it is hard to justify why LSTM is the better algorithm.

Reflection:

With this project, I have conducted predictions of two different stocks by the use of three different AI models over certain time windows. During the observation of the results, it seems like that with LR and MLP the results were always a bit better than with LSTM. If I also take the time of training and prediction into account, I would tend to use the LR or MLP.
By using two different datasets, I can also make the conclusion that the results are dependent on the volatility of the stock trend. When it comes to LSTM, it seems like, that the volatility of a stock is also important when is comes to the setup of the parameter of the model.
In order to predict a stock price, I have only used the price trend of past days and some statistical number. For that, all results looks well.

Figure 19: Actual Price and predicted price

Improvement:

Although the performance of each model seemed to be quite impressive,
I have yet not faced a real prediction yet. Thus, a new set of features is necessary. The yfinance API could help here a lot since
it provides a lot more of market data/ metadata and information about the companies that can be used to define new features. By using neuronal networks like MLP and LSTM, I see a lot of potential to improve the results by changing the model parameters.