Skip to content

luismi-97/Clustering-Music

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Apple Music Favourites Clustering

A Python-based unsupervised machine learning pipeline that automatically clusters your Apple Music library based on audio characteristics. Discover patterns in your music taste by segmenting your favourite tracks into meaningful groups.


Overview

This project takes your Apple Music library export and applies K-Means clustering to organize tracks based on their audio features (energy, tempo, danceability, etc.). The pipeline enriches your local library data with Spotify audio features, preprocesses the data, and visualizes clusters in 2D space.

Key Features

  • Automatic Library Parsing: Reads Apple Music Library.xml exports
  • Smart Track Filtering: Multiple strategies to select favourite tracks (loved, rated, most-played)
  • Spotify Enrichment: Pulls detailed audio features for each track via Spotify API
  • Intelligent Preprocessing: Handles missing data and normalizes features
  • K-Means Clustering: Automated optimal cluster selection using silhouette scores
  • Interactive Visualization: 2D cluster plots with hover information and radar charts for cluster profiles

Architecture

Data Flow:
┌──────────────────┐
│ Library.xml      │
│ (Apple Music)    │
└────────┬─────────┘
         │
         ▼
┌──────────────────────┐
│ INGEST               │
│ Parse XML, filter    │
│ by strategy          │
└────────┬─────────────┘
         │
         ▼
┌──────────────────────┐
│ ENRICH               │
│ Query Spotify API    │
│ fetch audio features │
│ (cached)             │
└────────┬─────────────┘
         │
         ▼
┌──────────────────────┐
│ PREPROCESS           │
│ Drop missing data,   │
│ fill NaN, normalize  │
└────────┬─────────────┘
         │
         ▼
┌──────────────────────┐
│ CLUSTERING           │
│ K-Means evaluation   │
│ silhouette scoring   │
└────────┬─────────────┘
         │
         ▼
┌──────────────────────┐
│ VISUALIZE            │
│ 2D scatter plots     │
│ cluster profiles     │
└──────────────────────┘

Project Structure

.
├── ingest.py           # Load and filter Apple Music library
├── enrich.py           # Spotify API integration & feature extraction
├── preprocess.py       # Data cleaning & normalization
├── kmeans.py           # Clustering & evaluation
├── plot.py             # Visualization (static & interactive)
├── pipeline.py         # Main orchestration script
├── config.py           # Global configuration & API credentials
├── Library.xml         # Your Apple Music library export
│
├── data/
│   ├── raw/            # Input data directory
│   └── cache/          # Cached Spotify features
│
└── outputs/
    ├── enriched_tracks.csv        # After Spotify enrichment
    ├── clustered_tracks.csv       # Final output with cluster labels
    ├── clusters.html              # Interactive scatter plot
    ├── clusters.png               # Static cluster visualization
    ├── cluster_profiles.png       # Radar chart of cluster characteristics
    └── k_selection.png            # Elbow & silhouette plots

Installation

Requirements

  • Python 3.8+
  • pip

Setup

  1. Clone/Download this repository

  2. Create a virtual environment (recommended)

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies

    pip install pandas scikit-learn matplotlib plotly spotipy numpy

    Optional (for enhanced clustering visualization):

    pip install umap-learn
  4. Configure Spotify API credentials

    • Go to Spotify Developer Dashboard
    • Create an app and get your Client ID and Client Secret
    • Add them to config.py:
      SPOTIFY_CLIENT_ID = 'your_client_id_here'
      SPOTIFY_CLIENT_SECRET = 'your_client_secret_here'
  5. Export your Apple Music library

    • Open Apple Music on macOS
    • Select Music → Library → Export Library
    • Save as Library.xml in your project directory

Usage

Basic Usage

python pipeline.py --library Library.xml

With Options

# Cluster by highly-rated tracks
python pipeline.py --library Library.xml --strategy rated

# Skip Spotify enrichment (use cached data from previous run)
python pipeline.py --library Library.xml --no-enrich

# Manually set number of clusters
python pipeline.py --library Library.xml --k 8

# Run without generating visualizations
python pipeline.py --library Library.xml --no-plot

# Combine options
python pipeline.py --library Library.xml --strategy played --k 6 --no-enrich

Command-Line Arguments

Argument Default Options Description
--library required path Path to Apple Music Library.xml export
--strategy loved loved, rated, played, all How to define favourite tracks
--k auto auto or integer Number of clusters (auto = silhouette-based selection)
--no-enrich False flag Skip Spotify API calls (use cache)
--no-plot False flag Skip all visualizations

Configuration

Edit config.py to customize the pipeline:

Directories

DATA_DIR = ROOT / 'data' / 'raw'         # Input directory
CACHE_DIR = ROOT / 'data' / 'cache'      # Spotify feature cache
OUTPUT_DIR = ROOT / 'outputs'            # Results directory

API Credentials

SPOTIFY_CLIENT_ID = 'your_id'
SPOTIFY_CLIENT_SECRET = 'your_secret'

Audio Features

FEATURES = [
    'tempo',              # BPM (0-210)
    'energy',             # 0-1 (intensity & activity)
    'valence',            # 0-1 (positivity/happiness)
    'danceability',       # 0-1 (rhythm regularity)
    'acousticness',       # 0-1 (non-electric)
    'instrumentalness',   # 0-1 (no vocals)
    'loudness',           # dB (-60 to 0)
    'speechiness',        # 0-1 (spoken words)
]

Clustering Parameters

K_RANGE = range(2, 15)              # Test k values 2-14
KMEANS_INIT = 'k-means++'           # Initialization method
KMEANS_N_INIT = 10                  # Number of re-runs
KMEANS_MAX_ITER = 300               # Max iterations
RANDOM_STATE = 42                   # Reproducibility
K_OVERRIDE = None                   # Force specific k (overrides auto-selection)

Dimensionality Reduction for Visualization

DIM_REDUCTION = 'umap'              # 'pca', 'umap', or 'tsne'
N_COMPONENTS = 2                    # 2D visualization

Module Reference

ingest.py

Parses Apple Music Library.xml and filters tracks.

Key Functions:

  • load_library(xml_path) → List[Track]

    • Loads all tracks from Library.xml
    • Excludes podcasts, audiobooks, and videos
  • filter_favourites(tracks, filter_by) → List[Track]

    • 'loved': Only tracks marked as loved
    • 'rated': Tracks with rating ≥ 50
    • 'played': Top 25% most-played tracks
    • 'all': All tracks
  • convert_dataframe(tracks) → pd.DataFrame

enrich.py

Fetches audio features from Spotify API with intelligent caching.

Key Functions:

  • spotify_client() → spotipy.Spotify

    • Initializes authenticated Spotify client
  • audio_features(name, artist, sp) → dict

    • Searches Spotify for track and extracts audio features
    • Returns cached result if available
  • enrich_df(df, delay=0.1) → pd.DataFrame

    • Enriches dataframe with Spotify features
    • Rate-limiting: 0.1s delay between API calls

preprocess.py

Data cleaning and normalization.

Key Functions:

  • drop_missing(df, limit=0.5) → pd.DataFrame

    • Drops rows missing >50% of features
  • fill_missing(df) → pd.DataFrame

    • Imputes missing values with median
  • normalise(df) → (pd.DataFrame, MinMaxScaler)

    • Scales features to [0, 1] range
  • full_preprocess(df) → (pd.DataFrame, np.ndarray, MinMaxScaler)

    • Orchestrates all preprocessing steps

kmeans.py

K-Means clustering with automatic k-selection.

Key Functions:

  • k_range(X) → dict

    • Evaluates k values 2-14
    • Returns inertia & silhouette scores
  • best_k(results) → int

    • Selects k with highest silhouette score
  • fit_kmeans(X, k) → sklearn.cluster.KMeans

    • Trains K-Means model
  • assign_clusters(df, X, km) → pd.DataFrame

    • Adds cluster labels to dataframe
  • run_clustering(df, X, plot=True) → (pd.DataFrame, KMeans)

    • Main clustering pipeline

plot.py

Visualization of clusters and cluster characteristics.

Key Functions:

  • plot_clusters(df, X, km, save=True)

    • Attempts interactive plot; falls back to static
    • Automatically reduces dimensions via PCA/UMAP/t-SNE
  • plot_iteractive(df, X, km, save=True)

    • Plotly scatter plot with hover information
    • HTML output for exploration
  • plot_static(df, X, km, save=True)

    • Matplotlib fallback
  • plot_clusters_profiles(df, km, save=True)

    • Radar chart showing cluster feature profiles
    • Compares characteristic patterns across clusters

pipeline.py

Main orchestration script that runs the full pipeline.


Outputs

CSV Files

enriched_tracks.csv

  • Original track metadata + Spotify audio features
  • Columns: name, artist, album, genre, year, play_count, loved, rating, duration, + audio features

clustered_tracks.csv

  • Same as enriched + cluster column with integer cluster labels (0-k)

Visualizations

clusters.html (Interactive)

  • Plotly scatter plot with 2D reduced features
  • Hover over points to see track name and artist
  • Color-coded by cluster

clusters.png (Static)

  • Matplotlib scatter plot fallback
  • Useful for static reports or sharing

cluster_profiles.png (Radar Chart)

  • Shows normalized mean feature values per cluster
  • Helps understand what makes each cluster distinct
  • Example: Cluster 0 might be "high-energy, high-danceability" while Cluster 1 is "acoustic, low-tempo"

k_selection.png (Evaluation Metrics)

  • Left: Inertia vs k (elbow curve)
  • Right: Silhouette score vs k (peak = optimal k)

Known Issues & Future Work

Current Limitations

  • Metrics & Evaluation: Silhouette score is the only evaluation metric; need Davies-Bouldin, Calinski-Harabasz, or domain-based validation
  • Error Handling: Spotify API failures don't gracefully degrade; could build fallback feature extraction
  • Cache Logic Bug: enrich.py line 30 returns cached data only if NOT found (inverted logic)
  • Dimensionality Reduction: t-SNE perplexity hardcoded to 30; should adapt to dataset size
  • Apple Music Metadata: Genre/year often missing; enrichment relies entirely on Spotify matches

Planned Improvements

  • Add clustering validation metrics (Davies-Bouldin Index, Calinski-Harabasz Index)
  • Implement cross-validation for robust cluster assignment
  • Improve Spotify matching (fuzzy string matching for artist/track names)
  • Support for Apple Music audio features directly (if API available)
  • Automated report generation (PDF with all plots + statistics)
  • Web interface for interactive exploration
  • Batch processing for large libraries (>5000 tracks)
  • Comparison of clustering algorithms (DBSCAN, Hierarchical)
  • Feature importance analysis (which features drive cluster separation?)
  • Handling for edge cases (instrumental tracks, remixes, live versions)

Troubleshooting

"File not found: Library.xml"

Ensure you've exported your Apple Music library and placed it in the project directory. See Installation step 5.

"Could not import umap"

UMAP is optional. Install it with:

pip install umap-learn

Or switch DIM_REDUCTION to 'pca' in config.py.

Spotify API Errors

  • Verify SPOTIFY_CLIENT_ID and SPOTIFY_CLIENT_SECRET are correct in config.py
  • Check Spotify Developer Dashboard for app status
  • If rate-limited, increase delay in enrich.py line 35

No Spotify Matches Found

Some tracks (especially from smaller/independent artists) may not exist in Spotify. These will be dropped during preprocessing. To see how many were lost, check console output after enrichment.

Memory Issues with Large Libraries

If clustering a very large library (>10k tracks):

  • Set K_RANGE = range(2, 8) to speed up evaluation
  • Use --no-plot to skip expensive dimensionality reduction
  • Consider filtering tracks: --strategy rated instead of --all

Example Outputs

Console Output

---------- INGEST ----------
[ingest] | Loaded 2,345 tracks from Library.xml successfully.
[filter] | Filtered the list of tracks by loved giving 412 items.

---------- ENRICH ----------
[enrich] | Enriched 398 tracks, 14 had no Spotify match.

---------- PREPROCESS ----------
[preprocess] | Dropped 8 tracks with insufficient features; 390 tracks still left.
[preprocess] | Normalised 8 features.

---------- CLUSTER ----------
[clustering] | Evaluating list in [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
k =  2 | inertia = 245.3 | silhouette = 0.6421
k =  5 | inertia = 178.2 | silhouette = 0.7182
k =  8 | inertia = 156.9 | silhouette = 0.6954
[clustering] | Best k = 5, silhouette = 0.7182

---------- VISUALIZE ----------
[plotting] | Saved cluster plot -> outputs/clusters.html
[plotting] | Saved Cluster Profile -> outputs/cluster_profiles.png

Pipeline completed :)

Dependencies

Package Version Purpose
pandas ≥1.3 Data manipulation
scikit-learn ≥0.24 K-Means, PCA, preprocessing
matplotlib ≥3.4 Static visualizations
plotly ≥5.0 Interactive plots
spotipy ≥2.19 Spotify API client
numpy ≥1.20 Numerical operations
umap-learn ≥0.5 Optional: advanced dimensionality reduction

Performance Notes

Typical runtime on a modern machine:

Step Duration Scaling
Ingest <1s O(1) - independent of library size
Enrich (first run) 5-30s O(n) per track; rate-limited by Spotify API
Enrich (cached) <1s O(n) - just reads cache
Preprocess <1s O(n)
Clustering 1-5s O(n × k × iterations)
Visualization 2-10s O(n) for dimensionality reduction

Total (first run): ~10-50s for 200-400 tracks

About

Built an end-to-end machine learning pipeline to classify and cluster music data extracted from XML-based playlists. Performed data preprocessing, feature engineering, and exploratory analysis to improve clustering quality and interpretability.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages