Cross-Correlation in Stock Market Data

statistics
stock market
Python
cross-correlation
time series
Author

Jong-Hoon Kim

Published

March 14, 2026

1 Introduction

When we study multiple stock prices together, one of the first questions is: do they move together, and if so, by how much and with what delay? Cross-correlation answers exactly this.

Cross-correlation measures the similarity between two time series as a function of a lag applied to one of them. Unlike a simple correlation coefficient (which is a single number), cross-correlation gives us a function that reveals:

  • How strongly two stocks co-move
  • Whether one leads the other (a lag > 0 suggests stock A today predicts stock B tomorrow)
  • How persistent that relationship is over time

This post covers cross-correlation from first principles, applied to real Korean stock data.


2 Setup

Code
# Install if needed:
# pip install FinanceDataReader pandas numpy matplotlib seaborn scipy statsmodels

import FinanceDataReader as fdr
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
from scipy import stats
from statsmodels.tsa.stattools import ccf as sm_ccf
from statsmodels.tsa.stattools import grangercausalitytests

plt.rcParams['figure.dpi'] = 120
plt.rcParams['font.size'] = 11
sns.set_theme(style='whitegrid')

3 1. Downloading Korean Stock Data

We use FinanceDataReader, which supports KOSPI and KOSDAQ natively.

Code
# Major Korean blue-chip stocks
tickers = {
    '005930': 'Samsung Electronics',
    '000660': 'SK Hynix',
    '035420': 'NAVER',
    '035720': 'Kakao',
    '005380': 'Hyundai Motor',
    '005490': 'POSCO',
    '105560': 'KB Financial',
    '000270': 'Kia',
}

start = '2018-01-01'
end   = '2024-12-31'

prices = {}
for ticker, name in tickers.items():
    df = fdr.DataReader(ticker, start, end)
    prices[name] = df['Close']

prices = pd.DataFrame(prices)
prices.index = pd.to_datetime(prices.index)
prices = prices.dropna()

print(f"Shape: {prices.shape}")
print(f"Period: {prices.index[0].date()} to {prices.index[-1].date()}")
prices.head()
Shape: (1721, 8)
Period: 2018-01-02 to 2024-12-30
Samsung Electronics SK Hynix NAVER Kakao Hyundai Motor POSCO KB Financial Kia
Date
2018-01-02 51020 76600 177251 29405 149500 339000 63100 32800
2018-01-03 51620 77700 174447 29906 150500 357500 63100 32600
2018-01-04 51080 77100 178853 31311 146500 367500 63000 31550
2018-01-05 52120 79300 181857 31311 149000 368000 64100 31950
2018-01-08 52020 78200 190269 32014 151000 369500 66600 32400

4 2. Exploratory Data Analysis

4.1 2.1 Normalized Price Trajectories

Raw prices have very different scales. We normalize to 100 at the start to compare trajectories.

Code
normalized = prices / prices.iloc[0] * 100

fig, ax = plt.subplots(figsize=(12, 5))
for col in normalized.columns:
    ax.plot(normalized.index, normalized[col], label=col, linewidth=1.2)

ax.axhline(100, color='black', linewidth=0.8, linestyle='--', alpha=0.4)
ax.set_title('Normalized Closing Prices (Base = 100)', fontsize=13)
ax.set_ylabel('Price Index (Base = 100)')
ax.legend(loc='upper left', fontsize=8, ncol=2)
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
plt.tight_layout()
plt.show()

4.2 2.2 Log Returns

Financial analysis works with log returns rather than raw prices, for two key reasons:

  1. Additivity: log returns aggregate over time (daily → monthly = sum of daily)
  2. Stationarity: prices trend upward; returns fluctuate around zero

\[r_t = \ln\left(\frac{P_t}{P_{t-1}}\right)\]

Code
returns = np.log(prices / prices.shift(1)).dropna()

fig, axes = plt.subplots(2, 4, figsize=(14, 6), sharey=False)
axes = axes.flatten()

for i, col in enumerate(returns.columns):
    axes[i].hist(returns[col], bins=60, edgecolor='none', alpha=0.7, color='steelblue')
    axes[i].axvline(0, color='red', linewidth=0.8, linestyle='--')
    mu = returns[col].mean()
    sigma = returns[col].std()
    axes[i].set_title(f'{col}\nμ={mu:.4f}, σ={sigma:.4f}', fontsize=8)
    axes[i].set_xlabel('Log Return')

plt.suptitle('Distribution of Daily Log Returns', fontsize=13, y=1.01)
plt.tight_layout()
plt.show()

4.3 2.3 Descriptive Statistics of Returns

Code
from scipy.stats import skew, kurtosis, jarque_bera

stats_df = pd.DataFrame({
    'Mean':     returns.mean(),
    'Std':      returns.std(),
    'Skewness': returns.apply(skew),
    'Excess Kurtosis': returns.apply(lambda x: kurtosis(x, fisher=True)),
    'JB p-value': returns.apply(lambda x: jarque_bera(x)[1]),
    'Sharpe (ann.)': returns.mean() / returns.std() * np.sqrt(252),
}).round(4)

stats_df
Mean Std Skewness Excess Kurtosis JB p-value Sharpe (ann.)
Samsung Electronics 0.0000 0.0165 0.2042 2.6386 0.0 0.0234
SK Hynix 0.0005 0.0241 0.1520 1.5818 0.0 0.3139
NAVER 0.0001 0.0216 0.3951 2.4855 0.0 0.0491
Kakao 0.0002 0.0233 0.3858 2.2615 0.0 0.1036
Hyundai Motor 0.0002 0.0209 0.9135 6.9340 0.0 0.1540
POSCO -0.0002 0.0231 0.4915 4.6163 0.0 -0.1159
KB Financial 0.0002 0.0212 0.3717 4.9190 0.0 0.1190
Kia 0.0007 0.0220 0.1344 5.6250 0.0 0.4700

Interpretation note: Stock returns are almost never normal. Excess kurtosis > 0 (“fat tails”) means extreme moves are more common than a normal distribution predicts. The Jarque-Bera test formalizes this: p-value < 0.05 rejects normality.


5 3. Cross-Correlation: Theory

5.1 3.1 Pearson Correlation (Lag = 0)

The familiar correlation coefficient between two return series \(X\) and \(Y\) is:

\[ \rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} \]

This is cross-correlation at lag 0 — it only captures contemporaneous co-movement.

5.2 3.2 Cross-Correlation Function (CCF)

The cross-correlation function generalizes this to arbitrary lags \(k\):

\[ \rho_{XY}(k) = \frac{\text{Cov}(X_t,\, Y_{t+k})}{\sigma_X \sigma_Y} \]

  • \(k > 0\): \(X\) today is correlated with \(Y\) in the future → \(X\) leads \(Y\)
  • \(k < 0\): \(X\) is correlated with \(Y\) in the past → \(Y\) leads \(X\)
  • \(k = 0\): contemporaneous correlation

The 95% confidence bounds for a white-noise null hypothesis are \(\pm 1.96 / \sqrt{n}\), where \(n\) is the sample size.


6 4. Contemporaneous Correlation

6.1 4.1 Correlation Matrix Heatmap

Code
corr = returns.corr()

fig, ax = plt.subplots(figsize=(9, 7))
mask = np.triu(np.ones_like(corr, dtype=bool), k=1)  # show lower triangle only

sns.heatmap(
    corr,
    annot=True,
    fmt='.2f',
    cmap='RdYlGn',
    vmin=-1, vmax=1,
    center=0,
    square=True,
    linewidths=0.5,
    ax=ax,
    annot_kws={'size': 9},
)
ax.set_title('Pairwise Correlation of Daily Log Returns\n(2018–2024)', fontsize=13)
plt.tight_layout()
plt.show()

6.2 4.2 Cluster Map

A clustermap reorders stocks by similarity, revealing natural groupings.

Code
sns.clustermap(
    corr,
    annot=True,
    fmt='.2f',
    cmap='RdYlGn',
    vmin=-1, vmax=1,
    center=0,
    figsize=(9, 8),
    annot_kws={'size': 8},
)
plt.suptitle('Clustered Correlation Matrix', y=1.02, fontsize=13)
plt.show()

6.3 4.3 Scatter Plot Matrix for Selected Pairs

Code
selected = ['Samsung Electronics', 'SK Hynix', 'NAVER', 'Kakao']
sub = returns[selected]

fig, axes = plt.subplots(4, 4, figsize=(11, 11))

for i, col_i in enumerate(selected):
    for j, col_j in enumerate(selected):
        ax = axes[i][j]
        if i == j:
            ax.hist(sub[col_i], bins=40, color='steelblue', alpha=0.7, edgecolor='none')
            ax.set_xlabel(col_i if i == 3 else '')
        else:
            x, y = sub[col_j], sub[col_i]
            ax.scatter(x, y, alpha=0.15, s=5, color='navy')
            # regression line
            m, b, r, p, _ = stats.linregress(x, y)
            xr = np.linspace(x.min(), x.max(), 100)
            ax.plot(xr, m * xr + b, color='red', linewidth=1)
            ax.set_title(f'r={r:.2f}', fontsize=8, pad=2)
        if j == 0:
            ax.set_ylabel(col_i, fontsize=8)
        if i == 3:
            ax.set_xlabel(col_j, fontsize=8)

plt.suptitle('Scatter Plot Matrix: Tech & Platform Stocks',
             fontsize=13, y=1.01)
plt.tight_layout()
plt.show()


7 5. Cross-Correlation Function (CCF)

The CCF tells us whether one stock leads or lags another. This is crucial for understanding information flow in markets.

7.1 5.1 CCF for a Single Pair

Code
def plot_ccf(x, y, name_x, name_y, max_lag=30, ax=None):
    """Plot cross-correlation function with confidence bounds."""
    if ax is None:
        fig, ax = plt.subplots(figsize=(10, 4))

    n = len(x)
    conf = 1.96 / np.sqrt(n)
    lags = np.arange(-max_lag, max_lag + 1)

    ccf_vals = []
    for lag in lags:
        if lag == 0:
            ccf_vals.append(np.corrcoef(x, y)[0, 1])
        elif lag > 0:
            ccf_vals.append(np.corrcoef(x[:-lag], y[lag:])[0, 1])
        else:
            ccf_vals.append(np.corrcoef(x[-lag:], y[:lag])[0, 1])

    colors = ['tomato' if abs(v) > conf else 'steelblue' for v in ccf_vals]
    ax.bar(lags, ccf_vals, color=colors, width=0.6)
    ax.axhline(conf,  color='black', linestyle='--', linewidth=0.8, label='95% CI')
    ax.axhline(-conf, color='black', linestyle='--', linewidth=0.8)
    ax.axhline(0, color='black', linewidth=0.5)
    ax.axvline(0, color='gray', linewidth=0.8, linestyle=':')
    ax.set_xlabel('Lag (days)  — positive: X leads Y')
    ax.set_ylabel('Correlation')
    ax.set_title(f'CCF: {name_x} (X) vs {name_y} (Y)', fontsize=11)
    ax.legend(fontsize=8)
    return ccf_vals, lags


x = returns['Samsung Electronics'].values
y = returns['SK Hynix'].values

fig, ax = plt.subplots(figsize=(11, 4))
plot_ccf(x, y, 'Samsung', 'SK Hynix', max_lag=20, ax=ax)
plt.tight_layout()
plt.show()

Reading the CCF plot: Red bars exceed the 95% confidence band — these lags show statistically significant correlation. A positive lag means Samsung today predicts SK Hynix in lag days.

7.2 5.2 CCF for Multiple Pairs

Code
pairs = [
    ('Samsung Electronics', 'SK Hynix'),
    ('NAVER', 'Kakao'),
    ('Hyundai Motor', 'Kia'),
    ('Samsung Electronics', 'NAVER'),
]

fig, axes = plt.subplots(2, 2, figsize=(14, 8))
axes = axes.flatten()

for ax, (stock_a, stock_b) in zip(axes, pairs):
    x = returns[stock_a].values
    y = returns[stock_b].values
    plot_ccf(x, y, stock_a.split()[0], stock_b.split()[0], max_lag=15, ax=ax)

plt.suptitle('Cross-Correlation Functions: Selected Pairs', fontsize=13, y=1.01)
plt.tight_layout()
plt.show()


8 6. Rolling Cross-Correlation

The correlation between stocks is not stable over time — it changes with market regimes (bull, bear, crisis). Rolling correlation captures this dynamics.

Code
window = 60  # ~3 months of trading days

fig, axes = plt.subplots(3, 1, figsize=(13, 10), sharex=True)

pairs_rolling = [
    ('Samsung Electronics', 'SK Hynix', 'steelblue'),
    ('NAVER', 'Kakao', 'darkorange'),
    ('Hyundai Motor', 'Kia', 'forestgreen'),
]

for ax, (a, b, color) in zip(axes, pairs_rolling):
    roll_corr = returns[a].rolling(window).corr(returns[b])
    ax.plot(roll_corr.index, roll_corr, color=color, linewidth=1, label=f'{a.split()[0]} vs {b.split()[0]}')
    ax.axhline(0, color='black', linewidth=0.5, linestyle='--')
    ax.fill_between(roll_corr.index, roll_corr, 0, alpha=0.15, color=color)
    ax.set_ylim(-1, 1)
    ax.set_ylabel('Rolling Correlation')
    ax.legend(loc='lower right', fontsize=9)

    # Shade major market events
    ax.axvspan('2020-01-01', '2020-06-30', alpha=0.08, color='red', label='COVID-19')
    ax.axvspan('2022-01-01', '2022-12-31', alpha=0.08, color='gray', label='Rate hike cycle')

axes[0].set_title(f'{window}-Day Rolling Cross-Correlation', fontsize=13)
axes[-1].set_xlabel('Date')
axes[-1].xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
plt.tight_layout()
plt.show()

Key insight: Correlations spike toward 1.0 during market crises (COVID-19 crash in early 2020). This is the “correlation breakdown” problem — diversification fails precisely when you need it most.


9 7. Sector-Level Analysis

9.1 7.1 Average Within-Sector vs. Cross-Sector Correlation

Code
sectors = {
    'Tech/Semi':   ['Samsung Electronics', 'SK Hynix'],
    'Platform':    ['NAVER', 'Kakao'],
    'Auto':        ['Hyundai Motor', 'Kia'],
    'Finance/Industry': ['KB Financial', 'POSCO'],
}

# Build sector label for each stock
stock_sector = {}
for sector, stocks in sectors.items():
    for s in stocks:
        stock_sector[s] = sector

# Annotated heatmap with sector groupings
ordered_stocks = [s for sector in sectors.values() for s in sector]
corr_ordered = returns[ordered_stocks].corr()

fig, ax = plt.subplots(figsize=(9, 7))
sns.heatmap(
    corr_ordered,
    annot=True,
    fmt='.2f',
    cmap='RdYlGn',
    vmin=-1, vmax=1,
    center=0,
    square=True,
    linewidths=0.5,
    ax=ax,
    annot_kws={'size': 9},
)

# Draw sector boundaries
boundaries = [0, 2, 4, 6, 8]
for b in boundaries:
    ax.axhline(b, color='black', linewidth=2)
    ax.axvline(b, color='black', linewidth=2)

ax.set_title('Correlation Matrix (Stocks Grouped by Sector)', fontsize=13)
plt.tight_layout()
plt.show()


10 8. Granger Causality

Granger causality tests whether past values of stock X help predict stock Y, beyond what Y’s own past predicts. It is not causality in a strict sense — it’s predictive precedence.

Null hypothesis: X does not Granger-cause Y

Code
def granger_table(returns_df, stocks, max_lag=5):
    """Return a DataFrame of Granger causality p-values."""
    results = []
    for cause in stocks:
        for effect in stocks:
            if cause == effect:
                continue
            data = returns_df[[effect, cause]].dropna()
            test = grangercausalitytests(data, maxlag=max_lag, verbose=False)
            # use minimum p-value across lags (F-test)
            min_p = min(test[lag][0]['ssr_ftest'][1] for lag in range(1, max_lag + 1))
            best_lag = min(range(1, max_lag + 1), key=lambda l: test[l][0]['ssr_ftest'][1])
            results.append({
                'Cause': cause.split()[0],
                'Effect': effect.split()[0],
                'Min p-value': round(min_p, 4),
                'Best lag': best_lag,
                'Significant': 'Yes' if min_p < 0.05 else 'No',
            })
    return pd.DataFrame(results)


selected_stocks = ['Samsung Electronics', 'SK Hynix', 'NAVER', 'Kakao']
granger_df = granger_table(returns, selected_stocks, max_lag=5)
granger_df.sort_values('Min p-value')
Cause Effect Min p-value Best lag Significant
7 NAVER SK 0.0007 3 Yes
6 NAVER Samsung 0.0011 1 Yes
10 Kakao SK 0.0034 1 Yes
9 Kakao Samsung 0.0114 1 Yes
0 Samsung SK 0.1435 1 No
3 SK Samsung 0.1975 2 No
1 Samsung NAVER 0.2485 2 No
4 SK NAVER 0.3046 2 No
11 Kakao NAVER 0.3373 5 No
2 Samsung Kakao 0.6116 3 No
8 NAVER Kakao 0.6770 1 No
5 SK Kakao 0.8934 1 No
Code
# Visualize as a heatmap of p-values
pivot = granger_df.pivot(index='Cause', columns='Effect', values='Min p-value')

fig, ax = plt.subplots(figsize=(6, 5))
sns.heatmap(
    pivot,
    annot=True,
    fmt='.3f',
    cmap='RdYlGn_r',  # red = low p-value = significant
    vmin=0, vmax=0.1,
    linewidths=0.5,
    ax=ax,
    annot_kws={'size': 10},
)
ax.set_title('Granger Causality: Min p-value\n(Row → Cause, Column → Effect)', fontsize=11)
plt.tight_layout()
plt.show()

Caution: Granger causality in daily stock returns is hard to find due to market efficiency. Significant results are more common in intraday data or between related instruments (ETFs and their underlying stocks).


11 9. Key Takeaways

Concept What it measures Limitation
Pearson correlation Co-movement at lag 0 Static; hides time dynamics
Cross-correlation function (CCF) Co-movement at all lags Assumes stationarity
Rolling correlation Time-varying co-movement Window choice is arbitrary
Granger causality Predictive precedence Not structural causation

Empirical findings for Korean blue chips (2018–2024):

  1. Intra-sector correlations are high: Samsung–SK Hynix (semiconductors) and Hyundai–Kia (auto) move closely together, reflecting shared business cycles and news.
  2. Tech platform stocks (NAVER, Kakao) are moderately correlated but diverge more than the hardware pairs.
  3. Crisis periods compress all correlations toward 1 — the diversification benefit collapses exactly when markets fall.
  4. Lag effects are generally weak in daily data — consistent with the efficient market hypothesis.
  5. Rolling correlations are highly non-stationary — a static correlation matrix is a snapshot, not a permanent property.

12 Next Steps

  • Volatility cross-correlation: Do volatility spikes (GARCH residuals) spread across stocks faster than return co-movements?
  • High-frequency data: Lead-lag effects are much stronger at 1–5 minute intervals
  • Copulas: Model non-linear tail dependence beyond linear correlation
  • Network analysis: Build a correlation network graph and detect communities

13 References

  • Campbell, J. Y., Lo, A. W., & MacKinlay, A. C. (1997). The Econometrics of Financial Markets. Princeton University Press.
  • Cont, R. (2001). Empirical properties of asset returns: stylized facts and statistical issues. Quantitative Finance, 1(2), 223–236.
  • Granger, C. W. J. (1969). Investigating causal relations by econometric models and cross-spectral methods. Econometrica, 37(3), 424–438.