Predicted Stock Direction

Stock market predictions are challenging, but machine learning models can help identify trends by learning patterns in historical data.

In this article, we’ll walk through how we can use a decision tree classifier to predict stock price movements for Alphabet Inc. (GOOGL) using Python.

The process includes data collection, feature engineering, model training, and evaluation.

Objective: Predicting Stock Price Movements

The goal of this project is to build a model that can predict whether the stock price of Google (GOOGL) will go up (1) or down (0) in the next week.

We will use historical stock data, process it into features that reflect the stock’s behavior, and then apply a decision tree classifier to predict future movements.

Overview of the Approach

Data Collection: We start by downloading historical stock data.
Feature Engineering: We create new features that could help the model make better predictions.
Modeling: We use a decision tree classifier to make predictions.
Evaluation: We evaluate the model’s performance and visualize the results.

Why a Decision Tree?

A decision tree is a popular classification algorithm. It works by splitting the data based on different features, which makes it easy to interpret.

It’s a non-linear model, meaning it can handle complex relationships in the data. Some advantages of decision trees include:

Interpretability: Easy to understand and visualize.
Non-linear modeling: Can handle non-linear relationships between features.
Low requirement for data scaling: Doesn’t need to be scaled like other algorithms such as logistic regression.

However, decision trees can be prone to overfitting, especially when they are too deep. We’ll use a regularized version of the tree to combat this.

Step-by-Step Code Walkthrough

You can find the complete code and notebook for this project on GitHub: View Repository. This includes all the preprocessing steps, model training, evaluation, and visualizations explained in this article.

1. Installing Dependencies

We first install the required Python libraries using pip. These libraries are crucial for data manipulation, model training, and evaluation.

Bash

pip install yfinance pandas numpy scikit-learn seaborn matplotlib

2. Importing Libraries

Here, we import the necessary libraries for data manipulation (pandas, numpy), machine learning (scikit-learn), and visualization (matplotlib, seaborn).

Python

import yfinance as yf
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier

plt.style.use('dark_background')

3. Downloading the Data

We download historical stock data for Google from Yahoo Finance using the yfinance library. We choose a weekly interval and a time frame from 2010 to 2023.

Python

data = yf.download("GOOGL", start="2010-01-01", end="2023-12-31", auto_adjust=True, interval="1wk")
data.head()

The auto_adjust=True argument ensures that adjusted closing prices are used to account for events like stock splits.

4. Preparing the Data

We remove any unnecessary multi-level index created during the download process and select only the relevant columns (Open, Close, Volume, Low, High).

Python

data.columns = data.columns.get_level_values(0)
data = data[['Open', 'Close', 'Volume', 'Low', 'High']]

# Plotting the closing price
plt.figure(figsize=(14,6))
plt.plot(data.index, data['Close'], label='Close Price', color='blue')
plt.title('Stock Closing Price Over Time')
plt.xlabel('Date')
plt.ylabel('Closing Price (USD)')
plt.legend()
plt.grid(True)
plt.savefig('stock_closing_price.png')
plt.show()

Alphabet’s Closing Stock Price Chart

5. Creating the Labels

We create the target label: 1 if tomorrow’s closing price is higher than today’s, and 0 otherwise.

Python

data["Direction"] = (data["Close"].shift(-1) > data["Close"]).astype(int)
data.dropna(inplace=True)

6. Feature Engineering

Next, we calculate several technical indicators that might help predict the stock’s future movements. These include:

Returns: Daily and 5-day returns.
Moving Averages: 5-day and 10-day moving averages.
Volatility: The 5-day rolling standard deviation.
Momentum: The difference in closing prices from 10 days ago.
RSI: Relative Strength Index, a momentum indicator.
Range: Difference between high and low prices.

Python

data['Return_1d'] = data['Close'].pct_change()
data['Return_5d'] = data['Close'].pct_change(5)
data['MA_5'] = data['Close'].rolling(window=5).mean()
data['MA_10'] = data['Close'].rolling(window=10).mean()
data['MA_ratio'] = data['MA_5'] / data['MA_10']
data['Volatility_5d'] = data['Close'].rolling(window=5).std()
data['Momentum_10'] = data['Close'] - data['Close'].shift(10)

def compute_rsi(series, period=14):
    delta = series.diff()
    gain = (delta.where(delta > 0, 0)).rolling(window=period).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window=period).mean()
    rs = gain / loss
    return 100 - (100 / (1 + rs))

data['RSI_14'] = compute_rsi(data['Close'])
data['Range'] = data['High'] - data['Low']
data['Close_to_High'] = data['High'] - data['Close']
data['Close_to_Low'] = data['Close'] - data['Low']

These features capture different aspects of stock behavior, such as momentum, volatility, and market trends.

7. Preparing Features and Target Variables

We select the features and target variable for model training. The target is Direction, and the features are the calculated indicators and stock data columns.

Python

feature_cols = [
    'Open', 'Close', 'Volume', 'Low', 'High',
    'Return_1d', 'Return_5d', 'MA_5', 'MA_10', 'MA_ratio',
    'Volatility_5d', 'Momentum_10', 'RSI_14', 'Range', 'Close_to_High', 'Close_to_Low'
]

data.dropna(inplace=True)

X = data[feature_cols]
y = data["Direction"]

8. Scaling the Data

We scale the features to ensure that all are on the same scale, which helps the model perform better.

Python

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

9. Train/Test Split

We split the data into training and testing sets, using 80% of the data for training and 20% for testing.

Python

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, shuffle=False)

10. Model Training

We use a decision tree classifier for this task. The decision tree is a simple and interpretable machine learning model.

Python

model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

11. Making Predictions

We use the trained model to make predictions on the test data and evaluate the model’s performance.

Python

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

Plaintext

              precision    recall  f1-score   support

           0       0.44      0.25      0.32        68
           1       0.51      0.71      0.60        76

    accuracy                           0.49       144
   macro avg       0.48      0.48      0.46       144
weighted avg       0.48      0.49      0.46       144

12. Visualizing Predictions

We plot the predicted stock movements against the actual closing prices to visually assess the model’s performance.

Python

plt.figure(figsize=(14,6))
plt.plot(data.index[-len(y_test):], data["Close"][-len(y_test):], label='Close Price')
plt.plot(data.index[-len(y_test):][y_pred == 1], data["Close"][-len(y_test):][y_pred == 1], '^', markersize=10, color='g', label='Predicted Up')
plt.plot(data.index[-len(y_test):][y_pred == 0], data["Close"][-len(y_test):][y_pred == 0], 'v', markersize=10, color='r', label='Predicted Down')
plt.title("Predicted Market Direction vs Close Price")
plt.legend()
plt.show()

Predicted Stock Direction

13. Evaluating the Model with a Confusion Matrix

We use a confusion matrix to get a deeper understanding of the model’s performance.

Python

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

Confusion Matrix

14. Feature Importances

We plot the feature importances, showing which features had the most impact on the decision tree’s predictions.

Python

importances = model.feature_importances_
feat_names = X.columns
plt.barh(feat_names, importances)
plt.title("Feature Importances (Decision Tree)")
plt.xlabel("Importance")
plt.show()

Feature Importance

Possible Improvements

Model Tuning: Decision trees can easily overfit the data. We can try tuning the tree by adjusting the maximum depth or using ensemble methods like Random Forest or Gradient Boosting.
Additional Features: We could experiment with more advanced technical indicators or use macroeconomic data (e.g., interest rates, inflation).
Alternative Models: We could explore more sophisticated models like Support Vector Machines (SVM) or neural networks for better accuracy.