Regression is a machine learning task to find the relationship or pattern between independent variables (variables assumed to influence or cause changes in other variables) and dependent variables (variables whose values depend on independent variables) to make predictions of continuous values (unbounded range). As we know, generally machine learning tasks can be categorized into three parts, namely regression, classification, and clustering. Here are the details:

Concept	Definition	Algorithms	Applications	Evaluation
Regression	Finding the relationship or pattern between independent variables and dependent variables to make predictions of continuous values.	Linear Regression, Ridge Regression, Lasso Regression, Polynomial Regression, Support Vector Regression (SVR), Prophet, Long Short Term Memory (LSTM), Gated Recurrent Unit (GRU)	Predicting house prices based on features such as number of rooms and land area. Predicting CO2 based on engine type and car model.	Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE), R-squared (R2).
Classification	Sorting or grouping data into specific categories or classes based on certain features.	Decision Trees, Random Forest, Support Vector Machines (SVM), K-Nearest Neighbors (KNN)	Identifying gender based on features such as height and weight. Predicting images of animals based on body parts.	Accuracy, precision, recall, F1-Score, Area Under the ROC Curve (AUC-ROC).
Clustering	Grouping data into clusters or clusters that have internal similarity.	K-Means Clustering, Hierarchical Clustering (Agglomerative), DBSCAN, Gaussian Mixture Model (GMM)	Customer segmentation based on purchasing behavior. Text clustering based on topics.	Silhouette Score, Davies-Bouldin Index, Calinski-Harabasz Index, Adjusted Rand Index (ARI), Normalized Mutual Information (NMI).

1. Regression

Definition: Finding the relationship or pattern between independent variables and dependent variables to make predictions of continuous values.

Algorithms: Linear Regression, Ridge Regression, Lasso Regression, Polynomial Regression, Support Vector Regression (SVR), Prophet, Long Short Term Memory (LSTM), Gated Recurrent Unit (GRU)

Applications: Predicting house prices based on features such as number of rooms and land area.

Evaluation: Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE), R-squared (R2).

2. Classification

Definition: Sorting or grouping data into specific categories or classes based on certain features.

Algorithms: Decision Trees, Random Forest, Support Vector Machines (SVM), K-Nearest Neighbors (KNN)

Applications: Identifying gender based on features such as height and weight.

Evaluation: Accuracy, precision, recall, F1-Score, Area Under the ROC Curve (AUC-ROC).

3. Clustering

Definition: Grouping data into clusters or clusters that have internal similarity.

Algorithms: K-Means Clustering, Hierarchical Clustering (Agglomerative), DBSCAN, Gaussian Mixture Model (GMM)

Applications: Customer segmentation based on purchasing behavior. Text clustering based on topics.

Evaluation: Silhouette Score, Davies-Bouldin Index, Calinski-Harabasz Index, Adjusted Rand Index (ARI), Normalized Mutual Information (NMI).

In this blog, we will delve into the workings of several machine learning regression algorithms, namely: Support Vector Regression (SVR), Prophet, Long Short Term Memory (LSTM), Bidirectional Long Short Term Memory (BiLSTM), and Gated Recurrent Unit (GRU). Evaluation metrics for regression models will also be discussed in this blog, including: Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE), and R-Squared (R2).

Table of Content

Regression Algorithms
1. Linear Regression
1.1 Multiple Linear Regression
1.2 Polynomial Linear Regression
1.3 Logistic Regression
2. Support Vector Regression (SVR)
3. Prophet
4. Long Short Term Memory (LSTM)
5. Bidirectional LSTM (BiLSTM)
6. Gated Recurrent Unit (GRU)
Evaluation Metrics
1. Mean Absolute Error (MAE)
2. Mean Squared Error (MSE)
3. Root Mean Squared Error (RMSE)
4. Mean Absolute Percentage Error (MAPE)
5. R-squared (R2)

Regression Algorithms

Some algorithms in regression tasks include:

1. Linear Regression

Mathematically, linear regression seeks to find the best line representing the relationship between independent variables (usually denoted as X) and the dependent variable (usually denoted as Y). In simple linear regression, the equation of the simple linear regression line can be written as:

Y = θ_{0} + θ_{1}X + ε

How Linear Regression Works (GeeksforGeeks, 2024)

Where Y is the dependent variable, X is the independent variable, θ1 is the intercept, θ2 is the regression coefficient measuring the slope of the regression line, and ε is the random error.

Some other commonly used linear regressions are:

1.1 Multiple Linear Regression

Multiple linear regression involves two or more independent variables used to predict one dependent variable. The equation of multiple linear regression is:

Y = θ_{0} + θ_{1}X_{1} + θ_{2}X_{2} + ... + θ_{n}X_{n} + ε

How Multiple Linear Regression Works (Kargin, 2021)

1.2 Polynomial Linear Regression

How Polynomial Linear Regression Works (Herlambang, 2018)

Polynomial linear regression involves a non-linear relationship between independent and dependent variables. In polynomial regression, the model is expanded by including higher-degree terms of the independent variable. An example equation of polynomial regression is:

Y = θ_{0} + θ_{1}X + θ_{2}X^{2} + ε

where (X^2) indicates that the relationship between X and Y is not linear.

1.3 Logistic Regression

How Logistic Regression Works (saedsayad.com)

Logistic regression is a regression technique used when the dependent variable is binary or categorical, such as yes/no, 0/1, or certain classes. It is one of the most commonly used methods in classification prediction modeling. With the formula:

P(Y=1|X) = \frac{1} {1 + e^{-(β_{0} + β_{1}X_{1} + ... + β_{p}X_{p})}}

This exponential function ensures that the probabilities generated by the model always lie between 0 and 1.

2. Support Vector Regression (SVR)

Support Vector Regression (SVR) is a machine learning algorithm used in regression tasks, focusing on finding the hyperplane line to predict the target value as best as possible. The image below illustrates the application of the SVR algorithm.

SVR Working Mechanism (Alakh Sethi, 2024)

The distance between the hyperplane and the decision boundary can be called +a (upper decision boundary) and -a (lower decision boundary). Assuming the equation of the hyperplane is:

Y = 𝑤𝑥 + b

Where w is the weight determining the direction of the hyperplane, x is the input vector, and b is the bias. Then the decision boundary equation becomes:

+𝑎 = 𝑤𝑥+𝑏

−𝑎= 𝑤𝑥+𝑏

Thus, every hyperplane must satisfy:

−𝑎<𝑌−𝑤𝑥+𝑏<+𝑎

It can be concluded that the SVR model seeks to meet the conditions in the above formula, where points close to the hyperplane are within the decision boundary (Trivusi, 2022). On the other hand, there are two methods that can be used in the SVR algorithm, namely linear and non-linear methods.

In the linear method, the formula used is:

𝑦= Σ(𝑎_{i}−𝑎_{i}^∗)∙<x_{i},𝑥>+ 𝑏

Where y is the output of the model or prediction we want to produce, 𝑎i is the coefficient of each input feature, ai* is the coefficient of each input feature at the support vector point, xi is the input feature, and b is the bias determining the position of the hyperplane with respect to the zero point on the feature axis.

While in the non-linear method, the formula used is:

𝑦= Σ(𝑎_{i}−𝑎_{i}^∗)∙<𝜑(x_{i}),𝜑(𝑥)>+ 𝑏

With y as the output of the model or prediction we want to produce, 𝑎i coefficient of each input feature 𝜑(𝑥𝑖) that has been transformed, 𝑎i* is the coefficient of each input feature at the support vector point, ϕ(x) is the transformation function (feature mapping) that converts input features x into a different feature space (for example, a feature space with higher dimensions), and b is the bias determining the position of the hyperplane with respect to the zero point on the feature axis.

3. Prophet

Prophet is a time forecasting algorithm designed to address complex forecasting challenges. This algorithm was designed by Facebook to model daily, weekly, yearly seasonal trends, and take into account holiday effects. In general, the equation used in the prophet algorithm is:

y(t) = g(t)+s(t)+h(t)+ϵt

Where g(t) represents the trend function that represents nonperiodic changes in the time series, s(t) represents periodic changes (such as daily, weekly, and yearly seasons), and h(t) reflects the influence of holidays. The error variable ϵt represents idiosyncratic changes (factors that are specific or unique to an event or data) not accommodated by the model (Taylor and Letham, 2018). For projection of nonperiodic change trends g(t) in prophet, the logistic growth model is used as follows:

g(t) = \frac{C(t)}{1 + \exp\left(-\left(k + a(t)^T\delta\right)t - \left(m + a(t)^T\gamma\right)\right)}

This model describes the growth of a phenomenon over time. Variable C(t) represents capacity that varies over time, k as growth rate, a(t) as adjusting factor that can vary over time, t represents a specific point in time, T as the time point where we want to estimate future growth, and variables δ, m, and γ as adjustment parameters in forming growth curve.

In the stock or business field, often both have recurring seasonal behaviors every week or year. For periodic changes s(t) involving weekly or yearly seasonal trends, the Fourier formula is used as follows:

s(t) = \sum\_{n} \left( a_n \cos\left(2\pi n \frac{t}{P}\right) + b_n \sin\left(2\pi n \frac{t}{P}\right) \right)

P is the regular period from the time series (for example, P=365.25 for annual data or P=7 for weekly data, in days).

For the influence of holidays h(t) on a phenomenon, it is set using the following regressor matrix:

Z(t) = [1(t 𝜖 𝐷_{1}),...,1(t 𝜖 𝐷_{𝐿})]

h(t) = Z(t) ⋅ k

With the use of k ~ Normal (0, v^2) (Taylor and Letham, 2018).

4. Long Short Term Memory (LSTM)

Long Short Term Memory (LSTM) is one type of Recurrent Neural Network (RNN) which modifies the basic RNN architecture by adding memory cells (Masnawi, 2018). In RNN, the output from the last step is fed back as input to the current step. However, RNN algorithms have a limitation in predicting words stored in long-term memory (Trivusi, 2022). The architecture of LSTM is represented below.

LSTM Architecture (Syed & Ahmed, 2023)

In an LSTM cell, there are forget gate, input gate, and output gate. Firstly, the forget gate is a gate that functions to discard information that is no longer needed in the cell by evaluating the input value x[t] and output value s[t-1]. Next, the input gate is a gate that functions to store useful information into the cell state by using a sigmoid function and filtering the values to be stored. Lastly, the output gate functions to extract useful information from the current cell state to be presented as the output value (Trivusi, 2022).

5. Bidirectional LSTM (BiLSTM)

BiLSTM Architecture (Cornegruta et al., 2016)

On the other hand, Bidirectional LSTM (BiLSTM) is an extension of the LSTM model, where two LSTM layers are applied to the input data (Siami-Namini et al., 2019). The architecture of the BiLSTM model can be seen in Figure above. Both BiLSTM layers consist of forward and backward layers that can process information from past to future, or vice versa, depending on the specific task or model requirements.

6. Gated Recurrent Unit (GRU)

GRU Architecture (Xin Wang et al., 2019)

The GRU neural network is a variation of the LSTM model that optimizes the LSTM structure and combines three state units into two state units. The two GRU states consist of an update gate and a reset gate (Xin Wang et al., 2019). The image below represents the architecture of GRU.

The GRU model consists of input layer, output layer, and hidden layer, where the hidden layer consists of GRU neurons. In the GRU model, the formulas are formulated as follows:

r_t = \sigma(W_r^* [h_{t-1}, x_t])

z_t = \sigma(W_z^* [h_{t-1}, x_t])

n_t = \tanh(W^* [r_t^* h_{t-1}, x_t])

h_t = (1 - z_t) \cdot h_{t-1} + z_t \cdot n_t

y_t = \sigma(W_o^* h_t)

The symbol 𝑟𝑡 reflects the output of the reset gate at time t, 𝑊𝑟 is the weight matrix for the reset gate, [ℎ𝑡−1,𝑥𝑡] is the concatenation of the previous hidden state ℎ𝑡−1 and the input at time t, 𝑍𝑡 is the output of the update gate, 𝑊𝑧 is the weight matrix for the update gate, 𝑛𝑡 is the new memory content at time t, W is the weight matrix, [𝑟𝑡∗ℎ𝑡−1,𝑥𝑡] is the concatenation of the element-wise multiplication of the reset gate and the previous hidden state with the input at time t, ℎ𝑡 is the updated hidden state at time t, 𝑦𝑡 is the output at time t, and 𝑊𝑜 is the weight matrix for the output.

Evaluation Metrics

Several evaluation metrics for regression tasks include:

Note: Variable n is the number of samples in the data, 𝑦𝑖 represents the actual value, and ŷ𝑖 represents the predicted value.

1. Mean Absolute Error (MAE)

MAE is calculated as the average absolute difference between the predicted value and the observed historical data (IBM, 2023). MAE provides information about the magnitude of prediction errors without considering the direction of the errors. The calculation used for MAE is as follows:

MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|

MAE is less sensitive to outliers than Mean Squared Error (MSE) or Root Mean Squared Error (RMSE). If our dataset is vulnerable to significant outliers, using MAE will provide more stable evaluation.

Additionally, MAE provides results that are easier to interpret, as the results are directly in the same unit as the predicted variable. This makes it easier to understand for stakeholders who do not have a strong background in mathematics or statistics.

2. Mean Squared Error (MSE)

MSE is the average of the squared differences between the predicted value and the actual value. It provides information about the magnitude of prediction errors by giving more weight to larger errors. The calculation for MSE is as follows:

MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

If we want to emphasize larger errors in predictions, MSE is a better choice. Because MSE gives more weight to larger errors (in squared form), this means the model will be more sensitive to predictions far from the actual value.

3. Root Mean Squared Error (RMSE)

RMSE is the square root of MSE. It provides a more intuitive measure of error because it is measured in the same unit as the predicted variable. The calculation for RMSE is as follows:

RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}

What does it mean to be more intuitive? For example, if we are predicting house prices in dollars, then the MSE value will be measured in squared dollars (dollars^2). This may be difficult to understand intuitively because the unit is no longer in dollars, but in dollars squared. However, RMSE takes the square root of MSE, returning to the original unit, which is dollars. Therefore, RMSE provides error values that are easier to understand because they are measured in the same unit as the predicted variable, in this example, dollars.

So when we say that RMSE provides more intuitive error values, it means that the error values are easier to understand or interpret in the context of the problem being studied.

However, on the other hand, RMSE is very sensitive to outliers because the errors are squared before the square root is taken. If there are significant outliers in the data, RMSE can become very large, even if the majority of the model's predictions are quite good. In this case, alternative metrics such as Mean Absolute Error (MAE) may be more suitable because MAE is not as affected by outliers.

4. Mean Absolute Percentage Error (MAPE)

MAPE is the average absolute percentage difference between the predicted value and the observed data (IBM, 2023). MAPE is useful for understanding the extent of errors relative to the actual magnitude. The formula for calculating MAPE is as follows:

MAPE = \frac{1}{n} \sum_{i=1}^{n} \left| \frac{y_i - \hat{y}_i}{y_i} \right| \times 100\%

In the case of stock price prediction I've done, stock prices can vary from a few dollars to hundreds or even thousands of dollars. Because Root Mean Squared Error (RMSE) measures errors in the same unit as the predicted variable, the RMSE value can be unintuitive to interpret when dealing with data with a very wide range of values like this.

In such cases, it is more advisable to use other more appropriate evaluation metrics, such as MAPE. MAPE measures relative errors as a percentage of the actual value, thus providing a better understanding of how accurate our predictions are relative to the different ranges of stock values.

5. R-squared (R2)

R-squared is the proportion of the variance in the dependent variable that is predictable from the independent variable(s). The value of R2 ranges from 0 to 1, where 1 indicates a perfect fit of the model to the data.

R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}

R2 can be used as one of the criteria to validate regression models. The higher the R2 value, the better the model fits the data and the more valid the model is.

In its use, R2 should be used in conjunction with other evaluation metrics such as Mean Squared Error (MSE) or Mean Absolute Error (MAE) to gain a more complete understanding of the performance of regression models.