Tutorial 08: XGBoost Raw Candle ML Model Advanced

Table of Contents

What Does This Script Do?
Key Concepts
The ML Pipeline
Code Walkthrough
Feature Engineering Explained
Dependencies
Glossary

1. What Does This Script Do?

This is a complete machine learning pipeline that trains an XGBoost model to predict BTC's direction in 5-minute windows. Unlike earlier tutorials that used technical indicators (MACD, RSI), this approach feeds the model raw candle shapes - the actual body size, wick length, and volume of each 1-minute candle - and lets the AI figure out what matters.

Simple Analogy: In earlier tutorials, you told the computer WHAT to look for (e.g., "check if MACD is above signal"). In this tutorial, you give the computer RAW DATA (the actual candlestick shapes from the last 15 minutes) and say "YOU figure out what patterns matter." It's like the difference between giving someone a checklist vs. showing them 1000 examples and letting them learn.

2. Key Concepts

What is XGBoost?

XGBoost (eXtreme Gradient Boosting) is one of the most popular machine learning algorithms for structured data. It builds an ensemble of decision trees that learn from each other's mistakes.

Decision Tree: A flowchart that makes decisions: "If body size > 0.05% AND volume is high, then predict UP"
Ensemble: Combining hundreds of trees together for better accuracy
Gradient Boosting: Each new tree focuses on the examples the previous trees got wrong

Raw Candle Features (No Indicators!)

Instead of pre-computing MACD, RSI, or Bollinger Bands, this approach uses the raw shape of each candle:

Body size %: How big the candle body is (difference between open and close)
Upper wick %: How far above the body the price went (shows selling pressure)
Lower wick %: How far below the body the price dipped (shows buying pressure)
Volume ratio: How much volume compared to average (unusual activity?)

15 candles x 4 features = 60 raw features, plus aggregate stats = ~75 total features.

Time-Series Split

Crucial: Unlike random data, time-series data must be split in chronological order. You train on old data and test on new data. Never shuffle - that would be "seeing the future" during training, which is impossible in real trading.

3. The ML Pipeline

Step	Function	What Happens
1	`load_raw_data()`	Load 1-minute BTC/USD OHLCV data from CSV
2	`build_features()`	Extract 75 raw candle features per 5-min window
3	`time_series_split()`	Split into train (70%) / validation (15%) / test (15%)
4	`train_model()`	Train XGBoost classifier with early stopping
5	`evaluate_model()`	Test accuracy, P&L simulation, feature importance
6	`save_model()`	Save trained model to disk for later use

4. Code Walkthrough

1 build_features() - The Heart of the Pipeline

def build_features(df):
    # Pre-compute raw candle metrics for every 1-min candle
    body_pct = (closes - opens) / opens * 100
    upper_wick_pct = (highs - max_opens_closes) / opens * 100
    lower_wick_pct = (min_opens_closes - lows) / opens * 100
    vol_ratio = volumes / rolling_mean_volume
    is_green = (closes > opens).astype(float)

    # For each 5-minute window...
    for w in range(n_windows):
        # Label: 1=UP if last close >= first open
        label = 1 if last_close >= first_open else 0

        # Look back at 15 candles before this window
        for i in range(1, LOOKBACK + 1):
            row[f"candle_{i}_body"] = body_pct[idx]
            row[f"candle_{i}_upper_wick"] = upper_wick_pct[idx]
            row[f"candle_{i}_lower_wick"] = lower_wick_pct[idx]
            row[f"candle_{i}_vol"] = vol_ratio[idx]

        # Add aggregate features...
        # Green count, avg body size, wick ratio, streaks, etc.

What it does: This is the feature engineering step - the most important part of any ML project. It looks at the 15 one-minute candles before each 5-minute window and extracts 4 features per candle. Then it adds aggregate statistics (green candle counts, average sizes, streaks, etc.).

2 train_model() - Training the AI

model = xgb.XGBClassifier(
    n_estimators=1000,      # Up to 1000 trees
    max_depth=4,             # Shallow trees (prevent overfitting)
    learning_rate=0.01,       # Small steps for better generalization
    subsample=0.7,            # Use 70% of data per tree
    colsample_bytree=0.5,    # Use 50% of features per tree
    min_child_weight=100,    # Minimum samples per leaf
    early_stopping_rounds=100, # Stop if no improvement for 100 rounds
)

model.fit(X_train, y_train,
    eval_set=[(X_train, y_train), (X_val, y_val)])

What it does: Trains an XGBoost model with careful hyperparameters designed to prevent overfitting. Early stopping is key - the model monitors its performance on the validation set and stops training when it stops improving.

3 evaluate_model() - Testing Performance

The evaluation produces several key outputs:

Accuracy: What % of predictions were correct (needs >54% for Polymarket profit)
Classification Report: Precision, recall, F1-score for UP and DOWN predictions
Confusion Matrix: Shows actual vs predicted counts
Polymarket P&L: Simulated profit using $0.54 entry price
Feature Importance: Which candle features mattered most
Confidence Threshold: How filtering by model confidence affects results
Time Breakdown: Win rate by session (Asia/Europe/US) and hour

5. Feature Engineering Explained

Candle Anatomy:

            |       <- Upper Wick (price went up but got rejected)
            |
         ---|---    <- Open (or Close, whichever is higher for green)
         |     |
         |  Body  |  <- The "real" move: Open to Close
         |     |
         ---|---    <- Close (or Open, whichever is lower for green)
            |
            |       <- Lower Wick (price dipped but buyers stepped in)

Green candle (bullish): Close > Open. Body shows upward move.

Red candle (bearish): Close < Open. Body shows downward move.

Long upper wick: Sellers pushed price down from the high = selling pressure

Long lower wick: Buyers pushed price up from the low = buying pressure

6. Dependencies

pip install xgboost scikit-learn scipy pandas numpy termcolor

7. Glossary

Term	Meaning
XGBoost	eXtreme Gradient Boosting - a powerful ML algorithm for tabular data
Feature Engineering	Creating useful inputs for the model from raw data
Decision Tree	A model that makes predictions by following a series of yes/no questions
Overfitting	When a model memorizes training data but fails on new data
Early Stopping	Stopping training when validation performance stops improving
Validation Set	Data held out during training to check for overfitting
Feature Importance	Which input features the model relied on most
Confidence Threshold	Only trading when the model's prediction probability is high enough
Classification Report	Detailed accuracy metrics per class (UP vs DOWN)
Confusion Matrix	A table showing actual vs predicted classifications

7. Full Code: Python to Pseudo-Code Translation

build_features() - Extract Raw Candle Features

# --- PYTHON ---
def build_features(df):
    body_pct = (closes - opens) / opens * 100
    upper_wick_pct = (highs - max_oc) / opens * 100
    lower_wick_pct = (min_oc - lows) / opens * 100
    vol_ratio = volumes / rolling_mean_volume
    is_green = (closes > opens).astype(float)
    for w in range(n_windows):
        label = 1 if last_close >= first_open else 0
        for i in range(1, LOOKBACK + 1):
            row[f"candle_{i}_body"] = body_pct[idx]
            row[f"candle_{i}_upper_wick"] = upper_wick_pct[idx]
            row[f"candle_{i}_lower_wick"] = lower_wick_pct[idx]
            row[f"candle_{i}_vol"] = vol_ratio[idx]
        row["green_count_15"] = is_green[lb_slice].sum()
        row["avg_body_size_15"] = abs_body_15.mean()
        row["consecutive_same"] = streak
        row["return_skew_30"] = skew(ret_slice)
        row["hour"] = dt.hour
        row["session"] = 0/1/2  # Asia/Europe/US

# --- PSEUDO-CODE ---
FUNCTION build_features(dataframe):
    PRE-COMPUTE for every 1-minute candle:
        body_pct = how big the candle body is (open to close) as a percentage
        upper_wick_pct = how far above the body the price reached (selling pressure)
        lower_wick_pct = how far below the body the price dipped (buying pressure)
        vol_ratio = current volume divided by 30-candle average volume
        is_green = 1 if close > open (bullish candle), 0 if red (bearish)

    FOR every group of 5 consecutive candles (a 5-minute window):
        DETERMINE the label:
            1 (UP) if the last close price >= first open price
            0 (DOWN) if the last close price < first open price

        LOOK BACK at the 15 candles BEFORE this window:
            FOR each of the 15 previous candles:
                RECORD its body size percentage
                RECORD its upper wick size percentage
                RECORD its lower wick size percentage
                RECORD its volume ratio (is volume unusual?)

        COMPUTE aggregate statistics:
            How many of the last 15 candles were green (bullish)?
            What's the average body size over 15 candles?
            What's the ratio of upper wicks to lower wicks (buy vs sell pressure)?
            Are candle bodies getting bigger or smaller recently?
            What's the longest streak of consecutive same-direction candles?
            What's the skewness of recent returns? (asymmetric distribution?)
            What hour of the day is it?
            What trading session? (Asia=0, Europe=1, US=2)

time_series_split() - Split Data Chronologically

# --- PYTHON ---
def time_series_split(df):
    n = len(df)
    train_end = int(n * 0.70)
    val_end = int(n * 0.85)
    train = df.iloc[:train_end].copy()
    val = df.iloc[train_end:val_end].copy()
    test = df.iloc[val_end:].copy()
    return train, val, test

# --- PSEUDO-CODE ---
FUNCTION time_series_split(dataframe):
    COUNT total rows
    CALCULATE split points:
        First 70% = TRAINING data (the model learns from this)
        Next 15% = VALIDATION data (to check during training for overfitting)
        Last 15% = TEST data (final evaluation, never seen by model)

    IMPORTANT: DO NOT shuffle! Time must flow forward.
        Training = oldest data
        Validation = middle data
        Test = newest data

    RETURN the three split datasets

train_model() - Train the XGBoost Classifier

# --- PYTHON ---
def train_model(train, val, feature_cols):
    model = xgb.XGBClassifier(
        n_estimators=1000, max_depth=4, learning_rate=0.01,
        subsample=0.7, colsample_bytree=0.5, min_child_weight=100,
        early_stopping_rounds=100)
    model.fit(X_train, y_train,
              eval_set=[(X_train, y_train), (X_val, y_val)])
    return model

# --- PSEUDO-CODE ---
FUNCTION train_model(training data, validation data, feature columns):
    CREATE an XGBoost classifier with these settings:
        n_estimators=1000: build up to 1000 decision trees
        max_depth=4: each tree can only be 4 levels deep (prevents overfitting)
        learning_rate=0.01: each tree contributes only 1% to the final answer
        subsample=0.7: each tree only sees 70% of the data (prevents overfitting)
        colsample_bytree=0.5: each tree only sees 50% of features (prevents overfitting)
        min_child_weight=100: each leaf needs at least 100 samples (prevents overfitting)
        early_stopping=100: if no improvement for 100 rounds, stop training

    TRAIN the model:
        Feed it training data (features + labels)
        After each tree, CHECK performance on validation data
        KEEP the best version of the model

    RETURN the trained model

evaluate_model() - Test & Analyze Results

# --- PYTHON ---
def evaluate_model(model, train, val, test, feature_cols):
    for name, split in [("TRAIN", train), ("VAL", val), ("TEST", test)]:
        preds = model.predict(X)
        acc = accuracy_score(y, preds)
    report = classification_report(y_test, preds)
    cm = confusion_matrix(y_test, preds)
    correct = (preds == y)
    win_rate = correct.sum() / len(preds) * 100
    pnl_per_trade = np.where(correct, WIN_PROFIT, -LOSS_AMOUNT)
    cumulative_pnl = np.cumsum(pnl_per_trade)
    importance = model.get_booster().get_score(importance_type="gain")
    proba = model.predict_proba(X_test)
    for thresh in [0.55, 0.60, 0.65, 0.70]:
        mask = max_proba >= thresh
        # ... filter and recalculate

# --- PSEUDO-CODE ---
FUNCTION evaluate_model(model, train, val, test, features):

    STEP 1: Measure accuracy on each dataset:
        ASK model to predict UP/DOWN for training data -> check accuracy
        ASK model to predict UP/DOWN for validation data -> check accuracy
        ASK model to predict UP/DOWN for test data -> check accuracy
        (Test accuracy is the most important - it's never-seen-before data)

    STEP 2: Print classification report:
        For both UP and DOWN predictions:
            Precision: when it says UP, how often is it right?
            Recall: of all actual UPs, how many did it catch?
            F1-score: balance of precision and recall

    STEP 3: Show confusion matrix:
        2x2 table: "Predicted UP vs DOWN" x "Actual UP vs DOWN"

    STEP 4: Simulate Polymarket P&L:
        FOR each test prediction:
            IF correct: ADD $8.52 profit
            IF wrong: SUBTRACT $10.00 loss
        CALCULATE cumulative P&L over time
        CALCULATE maximum drawdown (worst peak-to-trough decline)

    STEP 5: Feature importance:
        RANK all 75 features by how much they contributed to predictions
        SHOW top 20

    STEP 6: Confidence threshold analysis:
        FOR each threshold (55%, 60%, 65%, 70%):
            ONLY count predictions where the model is this confident
            CHECK: does filtering for high confidence improve win rate?
            TRADE-OFF: higher threshold = fewer trades but better quality

    STEP 7: Time breakdown:
        WIN RATE by trading session (Asia / Europe / US)
        WIN RATE by hour of the day