Contents
04 May 2025 • 21:32
Predicted Stock Direction
Stock market predictions are challenging, but machine learning models can help identify trends by learning patterns in historical data.
In this article, we’ll walk through how we can use a decision tree classifier to predict stock price movements for Alphabet Inc. (GOOGL) using Python.
The process includes data collection, feature engineering, model training, and evaluation.
The goal of this project is to build a model that can predict whether the stock price of Google (GOOGL) will go up (1) or down (0) in the next week.
We will use historical stock data, process it into features that reflect the stock’s behavior, and then apply a decision tree classifier to predict future movements.
Data Collection: We start by downloading historical stock data.
Feature Engineering: We create new features that could help the model make better predictions.
Modeling: We use a decision tree classifier to make predictions.
Evaluation: We evaluate the model’s performance and visualize the results.
A decision tree is a popular classification algorithm. It works by splitting the data based on different features, which makes it easy to interpret.
It’s a non-linear model, meaning it can handle complex relationships in the data. Some advantages of decision trees include:
Interpretability: Easy to understand and visualize.
Non-linear modeling: Can handle non-linear relationships between features.
Low requirement for data scaling: Doesn’t need to be scaled like other algorithms such as logistic regression.
However, decision trees can be prone to overfitting, especially when they are too deep. We’ll use a regularized version of the tree to combat this.
You can find the complete code and notebook for this project on GitHub: View Repository. This includes all the preprocessing steps, model training, evaluation, and visualizations explained in this article.
We first install the required Python libraries using pip
. These libraries are crucial for data manipulation, model training, and evaluation.
pip install yfinance pandas numpy scikit-learn seaborn matplotlib
Here, we import the necessary libraries for data manipulation (pandas
, numpy
), machine learning (scikit-learn
), and visualization (matplotlib
, seaborn
).
import yfinance as yf
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
plt.style.use('dark_background')
We download historical stock data for Google from Yahoo Finance using the yfinance
library. We choose a weekly interval and a time frame from 2010 to 2023.
data = yf.download("GOOGL", start="2010-01-01", end="2023-12-31", auto_adjust=True, interval="1wk")
data.head()
The auto_adjust=True
argument ensures that adjusted closing prices are used to account for events like stock splits.
We remove any unnecessary multi-level index created during the download process and select only the relevant columns (Open
, Close
, Volume
, Low
, High
).
data.columns = data.columns.get_level_values(0)
data = data[['Open', 'Close', 'Volume', 'Low', 'High']]
# Plotting the closing price
plt.figure(figsize=(14,6))
plt.plot(data.index, data['Close'], label='Close Price', color='blue')
plt.title('Stock Closing Price Over Time')
plt.xlabel('Date')
plt.ylabel('Closing Price (USD)')
plt.legend()
plt.grid(True)
plt.savefig('stock_closing_price.png')
plt.show()
Alphabet’s Closing Stock Price Chart
We create the target label: 1 if tomorrow’s closing price is higher than today’s, and 0 otherwise.
data["Direction"] = (data["Close"].shift(-1) > data["Close"]).astype(int)
data.dropna(inplace=True)
Next, we calculate several technical indicators that might help predict the stock’s future movements. These include:
Returns: Daily and 5-day returns.
Moving Averages: 5-day and 10-day moving averages.
Volatility: The 5-day rolling standard deviation.
Momentum: The difference in closing prices from 10 days ago.
RSI: Relative Strength Index, a momentum indicator.
Range: Difference between high and low prices.
data['Return_1d'] = data['Close'].pct_change()
data['Return_5d'] = data['Close'].pct_change(5)
data['MA_5'] = data['Close'].rolling(window=5).mean()
data['MA_10'] = data['Close'].rolling(window=10).mean()
data['MA_ratio'] = data['MA_5'] / data['MA_10']
data['Volatility_5d'] = data['Close'].rolling(window=5).std()
data['Momentum_10'] = data['Close'] - data['Close'].shift(10)
def compute_rsi(series, period=14):
delta = series.diff()
gain = (delta.where(delta > 0, 0)).rolling(window=period).mean()
loss = (-delta.where(delta < 0, 0)).rolling(window=period).mean()
rs = gain / loss
return 100 - (100 / (1 + rs))
data['RSI_14'] = compute_rsi(data['Close'])
data['Range'] = data['High'] - data['Low']
data['Close_to_High'] = data['High'] - data['Close']
data['Close_to_Low'] = data['Close'] - data['Low']
These features capture different aspects of stock behavior, such as momentum, volatility, and market trends.
We select the features and target variable for model training. The target is Direction
, and the features are the calculated indicators and stock data columns.
feature_cols = [
'Open', 'Close', 'Volume', 'Low', 'High',
'Return_1d', 'Return_5d', 'MA_5', 'MA_10', 'MA_ratio',
'Volatility_5d', 'Momentum_10', 'RSI_14', 'Range', 'Close_to_High', 'Close_to_Low'
]
data.dropna(inplace=True)
X = data[feature_cols]
y = data["Direction"]
We scale the features to ensure that all are on the same scale, which helps the model perform better.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
We split the data into training and testing sets, using 80% of the data for training and 20% for testing.
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, shuffle=False)
We use a decision tree classifier for this task. The decision tree is a simple and interpretable machine learning model.
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)
We use the trained model to make predictions on the test data and evaluate the model’s performance.
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.44 0.25 0.32 68
1 0.51 0.71 0.60 76
accuracy 0.49 144
macro avg 0.48 0.48 0.46 144
weighted avg 0.48 0.49 0.46 144
We plot the predicted stock movements against the actual closing prices to visually assess the model’s performance.
plt.figure(figsize=(14,6))
plt.plot(data.index[-len(y_test):], data["Close"][-len(y_test):], label='Close Price')
plt.plot(data.index[-len(y_test):][y_pred == 1], data["Close"][-len(y_test):][y_pred == 1], '^', markersize=10, color='g', label='Predicted Up')
plt.plot(data.index[-len(y_test):][y_pred == 0], data["Close"][-len(y_test):][y_pred == 0], 'v', markersize=10, color='r', label='Predicted Down')
plt.title("Predicted Market Direction vs Close Price")
plt.legend()
plt.show()
Predicted Stock Direction
We use a confusion matrix to get a deeper understanding of the model’s performance.
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
Confusion Matrix
We plot the feature importances, showing which features had the most impact on the decision tree’s predictions.
importances = model.feature_importances_
feat_names = X.columns
plt.barh(feat_names, importances)
plt.title("Feature Importances (Decision Tree)")
plt.xlabel("Importance")
plt.show()
Feature Importance
Model Tuning: Decision trees can easily overfit the data. We can try tuning the tree by adjusting the maximum depth or using ensemble methods like Random Forest or Gradient Boosting.
Additional Features: We could experiment with more advanced technical indicators or use macroeconomic data (e.g., interest rates, inflation).
Alternative Models: We could explore more sophisticated models like Support Vector Machines (SVM) or neural networks for better accuracy.