ETF Screening Process and Key Points Overview

    1. Basic Data Acquisition and Preliminary Filtering

Retrieve ETF list: Use get_all_securities([‘etf’]) to get all market ETFs, then filter for those established before January 1, 2013 (start_date < 2023-01-01) to ensure sufficient historical data.
Exclude low-liquidity ETFs: Manually remove specific ETFs with very low average trading volume (e.g., 159003.XSHE China Merchants Fast Track ETF, 159005.XSHE Huatai Wealth Quick Money ETF, etc., average volume ≤ 2.92kw).

    1. Daily ETF Data and Return Calculation
      Data Range: Obtain closing prices for the past 240 trading days up to the current date (today).
      Return Processing: Calculate daily returns (pchg = close.pct_change()), forming an ETF return matrix (prices, rows=trading days, columns=ETF codes).
    1. K-Means Clustering for Deduplication (Based on Similarity in Trends)
      Clustering Goal: Group ETFs with similar trends to reduce duplicates.
      Parameters: Set number of clusters n_clusters=30 (to avoid too few clusters that may merge dissimilar ETFs), use KMeans algorithm with random_state=42.
      Within-Cluster Selection: Keep only the earliest established ETF in each cluster, because:
    • Earlier establishment → usually higher trading volume (better liquidity);
    • Earlier establishment → more historical data (better for model training).
    1. Silhouette Score Evaluation of Clustering Effectiveness
      Calculate silhouette score: approximately 0.4512 (moderate level, indicating decent compactness and separation, but room for improvement).
    1. Secondary Filtering Based on Correlation (Further Reduce Correlation)
      Correlation matrix: Compute correlation matrix of ETF returns (corr = prices[df.code].corr()).
      High-correlation pairs: For pairs with correlation > 0.85, keep only the ETF with earlier establishment date, remove the others (e.g., remove 159922.XSHE, 512100.XSHG, etc.).
    1. Optional: Filter ETFs Established Later (Improve Data Quality)
      Threshold: Remove ETFs established after 2020 (e.g., 513060.XSHG Hang Seng Healthcare, 515790.XSHG Photovoltaic ETF, etc.), to ensure remaining ETFs have richer historical data (useful for model training).
    1. Notes and Additional Recommendations
      Special Handling for Treasury Bond ETFs: If used for model training, exclude 511010.XSHE Treasury Bond ETF—its trend is nearly linear (similar to Yu’ebao), with minimal volatility, which can interfere with the model’s learning of volatility features and is unnecessary for prediction.
      Handling Downward-Trending ETFs: The results may include long-term declining ETFs (e.g., healthcare ETF, real estate ETF). Whether to exclude depends on strategy goals:
    • For stable returns, consider removing;
    • If the strategy performs well even with declining ETFs, it indicates robustness (but beware of “future function” risk—can’t predict if declining ETFs will reverse).
      Visualization for validation: Plot remaining ETFs’ price charts (e.g., since 2017) to manually verify if correlations and distributions meet expectations (low correlation, reasonable spread).
      Final Filtering Summary:
      Through four steps—initial filtering → clustering deduplication → secondary correlation filtering → optional establishment date filtering—obtain a pool of ETFs with good liquidity, low trend correlation, and sufficient historical data. The core goal is to provide diverse, high-quality underlying assets for strategies or models.
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments