The decision-making process in stock instruments is a complex issue. Investors often employ fundamental and technical approaches to select the right stock issuers, but these methods are time-consuming and inefficient. On the other hand, the development of Artificial Intelligence, particularly Natural Language Processing (NLP) and forecasting techniques, could provide a solution for prospective investors. This research focuses on predicting stock prices using historical data and news from 2018 to 2023.

In the preprocessing phase, news data will undergo the removal of irrelevant topics (BertTopic), case folding, and punctuation removal. Meanwhile, historical data will undergo data normalization. Sentiment analysis of news will be conducted using machine learning (SVM). The combined data of historical prices and news sentiment results will be approached using machine learning (SVR and Prophet), deep learning (LSTM, BiLSTM, GRU, and LSTM+GRU), and ensemble learning (Random Forest and Bagging). After completing the modeling phase, the evaluation of the stock price prediction models will be measured using MAE (Mean Absolute Error) and MAPE (Mean Absolute Percentage Error).

Table of Content

Background
Problem Formulation
Problem Limitations
Dataset
Methodology
Preprocessing of News Data
Prediction of News Sentiment
Preprocessing and EDA of Historical Stock Data
Stock Price Prediction using Historical Data
Stock Price Prediction using Historical Data and News Sentiment
Evaluation and Analysis

Background

Stock price movement is a highly complex issue influenced by various factors. Factors affecting stock price movement, based on their scale, consist of macroeconomics and microeconomics, including inflation rates, interest rates, trade balances, company performance, industry performance, market sentiment, etc. Macroeconomics and microeconomics play a significant role in influencing stock price movements.

In macroeconomic scale, factors such as inflation and interest rates have an impact on the Composite Stock Price Index (CSPI). An increase in inflation has a non-significant negative impact on the CSPI, while an increase in interest rates has a significantly negative impact on the CSPI. Stable inflation and low-interest rates tend to create a positive environment for stock investment. Additionally, the trade balance (total export and import value) of a country can also affect stock prices at the macro level. If a country's trade balance is healthy, it can create investor confidence and support economic growth, which, in turn, can lift stock prices.

On the other hand, from a microeconomic perspective, if a company's performance is considered positive, investors are likely to be interested in buying shares of that company, leading to an increase in the stock's value. Company profits, sales growth, and specific industry factors can significantly influence the valuation of individual stocks. Moreover, there is a one-way causality where investor sentiment has a significant impact on stock market movements. Positive or negative perceptions can create significant buying or selling trends in the stock market, which, in turn, affects stock prices overall.

In the stock market, there are two main approaches to predict stock prices: fundamental and technical analysis. Fundamental analysis focuses on factors underlying the intrinsic value of a company. It involves evaluating the financial condition of the company, such as net profit, income, financial ratios, management, and external factors such as industry and economic conditions. Fundamental analysis aims to determine whether a stock is trading at a value commensurate with its actual performance and growth potential.

On the other hand, technical analysis focuses on analyzing historical price movements and trading volume to identify market patterns and trends. It involves using price charts, technical indicators, and other analytical tools. Technical analysts believe that historical trends and price patterns can provide clues about the direction of future price movements. This approach often does not consider the fundamental factors of the company and is more concentrated on market behavior and psychology.

With the complexity of factors influencing stock price movements, market participants are increasingly turning to modern approaches that use machine learning technology. Machine learning models can quickly process and analyze large historical data, identify complex patterns, and make predictions based on available information. This approach allows investors and traders to use data more efficiently, including macroeconomic and microeconomic, fundamental, and technical data. Thus, machine learning becomes a valuable tool in improving the accuracy of stock price predictions.

Historical data is often the primary reference in machine learning models. This may include daily closing prices, daily opening prices, trading volumes, and other technical indicators. Historical data can help identify trends and patterns that may repeat, providing a basis for predicting how stocks may perform in the future. Machine learning models and statistical analysis are often used to analyze historical data efficiently.

On the other hand, company news can provide an overview of events happening in the company. News can include company financial reports, the macroeconomic impact on the company, industry events, or other developments that can affect the company's performance or the market as a whole. News sentiment analysis, evaluating whether the news is positive or negative, can also serve as a reference in predicting stock price movements. In this case, Natural Language Processing (NLP) techniques in machine learning are often used to process text data.

By combining historical data and news, analysts can obtain a more holistic picture of the factors influencing stock price movements. This approach leverages both the technical aspects of past price movements and the fundamental information contained in recent news. With the advancement of technology and artificial intelligence, the combination of historical data and news is becoming increasingly crucial in predicting the dynamic movements of the stock market.

Problem Formulation

Based on the outlined background, the problem formulation can be summarized as follows:

How is the process of collecting and preparing historical stock price and news datasets carried out?
What is the performance of stock price prediction using historical data using machine learning models (SVR and Prophet), deep learning (LSTM, BiLSTM, GRU, and LSTM+GRU)?
How does news sentiment affect stock price movements?

Problem Limitations

In conducting the research, problem limitations are established, covering the following:

Historical stock price and news data use stocks in the LQ45 in 2023, in the categories of telecommunications (EXCL, TBIG, TLKM, and TOWR), consumer cyclicals (AMRT, CPIN, GGRM, ICBP, INDF, and UNVR), and consumer non-cyclicals (ACES, MAPI, and SCMA). Stock price data is obtained from August 1, 2018, to August 1, 2023, using the yfinance API library.
The collected news data comes from CNBC (cnbcindonesia.com), Kontan (kontan.co.id), Bisnis (bisnis.com), Katadata (katadata.co.id), and Investor (investor.id), using the gnews API library.

Dataset

Collection of Historical Stock Price Data using the yfinance Library

The yfinance library is utilized to collect historical stock price data. The historical stock prices used belong to the LQ45 stocks for the year 2023, covering three sectors: telecommunications, consumer-cyclicals, and consumer non-cyclicals. The stock ticker symbols used for the telecommunications category are: EXCL, TBIG, TLKM, and TOWR; for consumer-cyclicals: AMRT, CPIN, GGRM, ICBP, INDF, and UNVR; and for consumer non-cyclicals: ACES, MAPI, and SCMA. Stock price data is obtained from August 1, 2018, to August 1, 2023. Figure below presents an example visualization of the historical stock price data for TLKM obtained using the yfinance library.

The table above is an example of historical data obtained using the yfinance library. The data has several features, namely: Date, Open (stock market opening price), High (highest price), Low (lowest price), Close (stock market closing price), Adj Close (price adjusted for technical changes affecting the price, such as dividends, stock splits, and public offerings), and Volume (stock trading volume).

Date	Open	High	Low	Close	Adj Close	Volume
2018-08-01	3570.0	3600.0	3500.0	3550.0	2,936,646,240	190,720,900
2018-08-02	3550.0	3590.0	3450.0	3500.0	2,895,285,156	171,276,100
2018-08-03	3480.0	3490.0	3430.0	3460.0	2,862,196,289	224,005,400
2018-08-06	3480.0	3670.0	3480.0	3650.0	3,019,368,652	231,846,400
2018-08-07	3650.0	3670.0	3570.0	3580.0	2,961,463,135	111,024,900

Collection of News Data using the gnews Library

News data is collected from CNBC (cnbcindonesia.com), Kontan (kontan.co.id), Bisnis (bisnis.com), Katadata (katadata.co.id), and Investor (investor.id) using the gnews library. News data is obtained from August 1, 2018, to August 1, 2023. For each stock ticker symbol, 400–1500 rows of news data are collected. It should be noted that there may not be news every day. The table below is an example of the obtained news data, with features such as published_date (publication date), title (news title), media (news source), and url (link).

index	published_date	title	media	url
0	Tue, 08 Aug 2023 07:00:00	Telkom Indonesia's (TLKM) Strategy to Improve Performance Until the End of 2023	Industri Kontan	https://news.google.com/rss/articles ...
1	Tue, 01 Aug 2023 07:00:00	Performance of Telecommunication Issuers Tends to Rise, Check Stock Analyst Recommendations	Investasi Kontan	https://news.google.com/rss/articles ...
2	Wed, 02 Aug 2023 07:00:00	Revenue & Stock Price Trend Up, These Telecommunication Stocks are Worth Buying	Investasi Kontan	https://news.google.com/rss/articles ...
21457	Mon, 20 Aug 2018 07:00:00	Challenges of Tax Revenue in Political Year 2019	CNBC Indonesia	https://news.google.com/rss/articles ...
21458	Tue, 07 Aug 2018 07:00:00	Just Launched Satellite, Telkom's Stock Released by Foreigners	CNBC Indonesia	https://news.google.com/rss/articles ...
21459	Wed, 08 Aug 2018 07:00:00	Red and White Satellite Launches into Space	CNBC Indonesia	https://news.google.com/rss/articles ...

Methodology

First, historical and news data will be collected. Next, data handling will be performed on historical data to address missing or inconsistent data, as well as normalization. For news data, irrelevant topics will be removed using topic modeling, followed by sentiment annotation. Then, news data will undergo preprocessing through case-folding and punctuation removal. In the model training phase, historical data will undergo train-test split. News data will undergo encoding, train-test split, and vectorization, followed by model training to obtain news sentiment. The resulting news sentiment will be incorporated into the model training process along with historical data. In the final stage, model evaluation measurements will be applied.

Preprocessing of News Data

The detection of news topics using BertTopic is employed to filter out topics unrelated to stock price movements. This stage begins by obtaining the best parameters to achieve optimal segmentation between topics. The search for these optimal parameters involves evaluating coherence score to measure the extent to which the generated topics are related or coherent. Iterated parameters include top_n_words, n_gram_range, min_topic_size, and nr_topics. The distribution results of topics using BertTopic on TLKM stock can be seen in Figure below.

It can be observed that there are topics unrelated to stock price movements, such as: 30nomor_cara_hangus, 34_internet_100_paket, 23_lowongan kerja_buka, and 5_xl_cara_indoesat transfer kuota/pulsa. These news topics will be manually removed as they are not relevant to stock price movements.

After removing irrelevant topics, the next step is manual sentiment annotation of news articles. The table below represents the annotation of stock news data. Stock news annotation consists of three categories: 1 (positive), 0 (neutral), and -1 (negative).

No.	Title	Sentiment
1	Telkom Indonesia's (TLKM) Strategy to Improve Performance Until the End of 2023	0
2	Revenue & Stock Price Trend Up, These Telecommunication Stocks are Worth Buying	1
3	TLKM and BBRI are the Largest, Pay Attention to Stocks Sold by Foreigners as IHSG Rebounds	-1
4	Telkomsel and Ericsson Forge Partnership to Strengthen 4G/5G Networks	1
5	Strong Signals of Telecommunication Issuers are Becoming More Visible	1

Annotated news data will then undergo further preprocessing. This preprocessing involves two steps: changing the text to lowercase (case-folding) and removing punctuation. The table below represents the preprocessing of some news headlines on TLKM stock.

No.	Original Text	Lowercase Change	Punctuation Removal
1	Telkom Indonesia's (TLKM) Strategy to Improve Performance Until the End of 2023	telkom indonesia's (tlkm) strategy to improve performance until the end of 2023	telkom indonesia's tlkm strategy to improve performance until the end of 2023
2	Revenue & Stock Price Trend Up, These Telecommunication Stocks are Worth Buying	revenue & stock price trend up, these telecommunication stocks are worth buying	revenue stock price trend up these telecommunication stocks are worth buying
3	TLKM and BBRI are the Largest, Pay Attention to Stocks Sold by Foreigners as IHSG Rebounds	tlkm and bbri are the largest, pay attention to stocks sold by foreigners as IHSG rebounds	tlkm and bbri are the largest pay attention to stocks sold by foreigners as ihsg rebounds
4	Telkomsel and Ericsson Forge Partnership to Strengthen 4G/5G Networks	telkomsel and ericsson forge partnership to strengthen 4g/5g networks	telkomsel and ericsson forge partnership to strengthen 4g5g networks
5	Strong Signals of Telecommunication Issuers are Becoming More Visible	strong signals of telecommunication issuers are becoming more visible	strong signals of telecommunication issuers are becoming more visible

Prediction of News Sentiment

Annotated and cleaned news data will proceed to the sentiment prediction stage. Beforehand, sentiment annotations consisting of three categories will be transformed into positive and negative only. This is done to improve the accuracy of sentiment predictions, as texts containing neutral sentiments have significant similarity with texts containing positive sentiments. The process begins with encoding the target data, which consists of positive and negative sentiments, into categorical numeric data. Subsequently, a train-test split is performed with 80% training data and 20% testing data. A vectorizer is also applied to the news data to convert it into numeric representations that can be understood by machine learning models. After all feature engineering processes are complete, the news data will enter SVM and BERT models for training. The results of the news sentiment prediction will be used as input, along with historical data, for training models using machine learning algorithms (SVR, Prophet), deep learning (LSTM, BiLSTM, GRU), and ensemble learning (Random Forest, CatBoost).

Preprocessing and EDA of Historical Stock Data

EDA is conducted on historical stock data to understand its characteristics. For the period from August 1, 2018, to August 1, 2023, the data characteristics of TLKM stock are as follows:

	Open	High	Low	Close	Adj Close	Volume
Count	1235	1235	1235	1235	1235	1.235000e+03
Mean	3767	3809	3722	3765	3358	1.083151e+08
Std	470	468	470	471	520	7.324254e+07
Min	2550	2590	2450	2560	2197	0.000000e+00
25%	3390	3430	3350	3380	2969	6.569400e+07
50%	3820	3860	3780	3830	3267	9.136790e+07
75%	4110	4160	4060	4110	3816	1.286072e+08
Max	4850	4850	4720	4770	4558	1.155861e+09

It is known that over the past 5 years, the lowest price in the Adj Close (closing stock price adjusted for stock-related changes, such as dividends and stock splits) of TLKM stock was 2,197 Rupiah, while the highest was 4,558 Rupiah. The lowest price occurred on March 19, 2020, and the highest on August 24, 2022. There are 1235 rows of stock price data, with no data on holidays.

The data used for training machine learning models will undergo normalization to optimize model performance. This normalization is performed using the Standard Scaler.

Stock Price Prediction using Historical Data

After preprocessing, the model is trained using machine learning, deep learning, and ensemble learning algorithms. Train-test split is performed using 80% training data and 20% testing data. In the LSTM architecture, the first layer consists of 50 units with return_sequences=True, meaning it will return the output sequence for each time point in the input series. The second LSTM layer has 50 units with return_sequences=False, meaning it only returns the output at the last time point. Then, two Dense layers are added, which are fully connected layers with 25 and 1 unit, respectively. After building the model architecture, the 'adam' optimizer and 'mean_squared_error' loss function are specified for training. Training is done with 250 epochs and a batch size of 32. Figure below shows the prediction results using historical stock data in LSTM.

For the Prophet algorithm, the parameters used are changepoint prior scale = 200, seasonality prior scale = 3, changepoint range = 0.9, n changepoints = 250, weekly seasonality = True, daily seasonality = True, and yearly seasonality = True. Figure below shows the prediction results using historical stock data in Prophet.

For SVR, the 'linear' kernel is used as a parameter in training the model. In Figure below are the results of stock price prediction using historical data in SVR.

Stock Price Prediction using Historical Data and News Sentiment

The combination of historical data and news sentiment is performed. Historical data is of regression data type, while news sentiment is in categorical data type. This categorical data needs to be encoded first to enter the stock prediction model. The model training process is done by adding two features as inputs to the model.

In this study, three main scenarios are used. First, the use of SVM and BERT, along with the SMOTE oversampling technique in sentiment analysis. Second, the use of machine learning algorithms (SVR, Prophet), deep learning (LSTM, BiLSTM, GRU), and ensemble learning (Random Forest, CatBoost). Third, the use of the quantity of days (30 days, 60 days, and 90 days) as input variables in the model. This scenario is expected to improve the model's performance in accurately predicting stock prices.

Evaluation and Analysis

Evaluation will be conducted on the created model scenarios, including single historical data and the combination of historical data and news sentiment. The evaluation metric used to demonstrate model accuracy is MAE and MAPE. Evaluation is performed on both test data and the original stock price data outside the dataset. This is done to test whether the model can be relied upon in real stock price movement cases.

This article will be updated afterward

As this research is still in the proposal stage, further explanations will be provided after the research is completed.