A Python-based unsupervised machine learning pipeline that automatically clusters your Apple Music library based on audio characteristics. Discover patterns in your music taste by segmenting your favourite tracks into meaningful groups.
This project takes your Apple Music library export and applies K-Means clustering to organize tracks based on their audio features (energy, tempo, danceability, etc.). The pipeline enriches your local library data with Spotify audio features, preprocesses the data, and visualizes clusters in 2D space.
- Automatic Library Parsing: Reads Apple Music
Library.xmlexports - Smart Track Filtering: Multiple strategies to select favourite tracks (loved, rated, most-played)
- Spotify Enrichment: Pulls detailed audio features for each track via Spotify API
- Intelligent Preprocessing: Handles missing data and normalizes features
- K-Means Clustering: Automated optimal cluster selection using silhouette scores
- Interactive Visualization: 2D cluster plots with hover information and radar charts for cluster profiles
Data Flow:
┌──────────────────┐
│ Library.xml │
│ (Apple Music) │
└────────┬─────────┘
│
▼
┌──────────────────────┐
│ INGEST │
│ Parse XML, filter │
│ by strategy │
└────────┬─────────────┘
│
▼
┌──────────────────────┐
│ ENRICH │
│ Query Spotify API │
│ fetch audio features │
│ (cached) │
└────────┬─────────────┘
│
▼
┌──────────────────────┐
│ PREPROCESS │
│ Drop missing data, │
│ fill NaN, normalize │
└────────┬─────────────┘
│
▼
┌──────────────────────┐
│ CLUSTERING │
│ K-Means evaluation │
│ silhouette scoring │
└────────┬─────────────┘
│
▼
┌──────────────────────┐
│ VISUALIZE │
│ 2D scatter plots │
│ cluster profiles │
└──────────────────────┘
.
├── ingest.py # Load and filter Apple Music library
├── enrich.py # Spotify API integration & feature extraction
├── preprocess.py # Data cleaning & normalization
├── kmeans.py # Clustering & evaluation
├── plot.py # Visualization (static & interactive)
├── pipeline.py # Main orchestration script
├── config.py # Global configuration & API credentials
├── Library.xml # Your Apple Music library export
│
├── data/
│ ├── raw/ # Input data directory
│ └── cache/ # Cached Spotify features
│
└── outputs/
├── enriched_tracks.csv # After Spotify enrichment
├── clustered_tracks.csv # Final output with cluster labels
├── clusters.html # Interactive scatter plot
├── clusters.png # Static cluster visualization
├── cluster_profiles.png # Radar chart of cluster characteristics
└── k_selection.png # Elbow & silhouette plots
- Python 3.8+
- pip
-
Clone/Download this repository
-
Create a virtual environment (recommended)
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install pandas scikit-learn matplotlib plotly spotipy numpy
Optional (for enhanced clustering visualization):
pip install umap-learn
-
Configure Spotify API credentials
- Go to Spotify Developer Dashboard
- Create an app and get your
Client IDandClient Secret - Add them to
config.py:SPOTIFY_CLIENT_ID = 'your_client_id_here' SPOTIFY_CLIENT_SECRET = 'your_client_secret_here'
-
Export your Apple Music library
- Open Apple Music on macOS
- Select Music → Library → Export Library
- Save as
Library.xmlin your project directory
python pipeline.py --library Library.xml# Cluster by highly-rated tracks
python pipeline.py --library Library.xml --strategy rated
# Skip Spotify enrichment (use cached data from previous run)
python pipeline.py --library Library.xml --no-enrich
# Manually set number of clusters
python pipeline.py --library Library.xml --k 8
# Run without generating visualizations
python pipeline.py --library Library.xml --no-plot
# Combine options
python pipeline.py --library Library.xml --strategy played --k 6 --no-enrich| Argument | Default | Options | Description |
|---|---|---|---|
--library |
required | path | Path to Apple Music Library.xml export |
--strategy |
loved |
loved, rated, played, all |
How to define favourite tracks |
--k |
auto |
auto or integer |
Number of clusters (auto = silhouette-based selection) |
--no-enrich |
False | flag | Skip Spotify API calls (use cache) |
--no-plot |
False | flag | Skip all visualizations |
Edit config.py to customize the pipeline:
DATA_DIR = ROOT / 'data' / 'raw' # Input directory
CACHE_DIR = ROOT / 'data' / 'cache' # Spotify feature cache
OUTPUT_DIR = ROOT / 'outputs' # Results directorySPOTIFY_CLIENT_ID = 'your_id'
SPOTIFY_CLIENT_SECRET = 'your_secret'FEATURES = [
'tempo', # BPM (0-210)
'energy', # 0-1 (intensity & activity)
'valence', # 0-1 (positivity/happiness)
'danceability', # 0-1 (rhythm regularity)
'acousticness', # 0-1 (non-electric)
'instrumentalness', # 0-1 (no vocals)
'loudness', # dB (-60 to 0)
'speechiness', # 0-1 (spoken words)
]K_RANGE = range(2, 15) # Test k values 2-14
KMEANS_INIT = 'k-means++' # Initialization method
KMEANS_N_INIT = 10 # Number of re-runs
KMEANS_MAX_ITER = 300 # Max iterations
RANDOM_STATE = 42 # Reproducibility
K_OVERRIDE = None # Force specific k (overrides auto-selection)DIM_REDUCTION = 'umap' # 'pca', 'umap', or 'tsne'
N_COMPONENTS = 2 # 2D visualizationParses Apple Music Library.xml and filters tracks.
Key Functions:
-
load_library(xml_path)→ List[Track]- Loads all tracks from Library.xml
- Excludes podcasts, audiobooks, and videos
-
filter_favourites(tracks, filter_by)→ List[Track]'loved': Only tracks marked as loved'rated': Tracks with rating ≥ 50'played': Top 25% most-played tracks'all': All tracks
-
convert_dataframe(tracks)→ pd.DataFrame
Fetches audio features from Spotify API with intelligent caching.
Key Functions:
-
spotify_client()→ spotipy.Spotify- Initializes authenticated Spotify client
-
audio_features(name, artist, sp)→ dict- Searches Spotify for track and extracts audio features
- Returns cached result if available
-
enrich_df(df, delay=0.1)→ pd.DataFrame- Enriches dataframe with Spotify features
- Rate-limiting: 0.1s delay between API calls
Data cleaning and normalization.
Key Functions:
-
drop_missing(df, limit=0.5)→ pd.DataFrame- Drops rows missing >50% of features
-
fill_missing(df)→ pd.DataFrame- Imputes missing values with median
-
normalise(df)→ (pd.DataFrame, MinMaxScaler)- Scales features to [0, 1] range
-
full_preprocess(df)→ (pd.DataFrame, np.ndarray, MinMaxScaler)- Orchestrates all preprocessing steps
K-Means clustering with automatic k-selection.
Key Functions:
-
k_range(X)→ dict- Evaluates k values 2-14
- Returns inertia & silhouette scores
-
best_k(results)→ int- Selects k with highest silhouette score
-
fit_kmeans(X, k)→ sklearn.cluster.KMeans- Trains K-Means model
-
assign_clusters(df, X, km)→ pd.DataFrame- Adds cluster labels to dataframe
-
run_clustering(df, X, plot=True)→ (pd.DataFrame, KMeans)- Main clustering pipeline
Visualization of clusters and cluster characteristics.
Key Functions:
-
plot_clusters(df, X, km, save=True)- Attempts interactive plot; falls back to static
- Automatically reduces dimensions via PCA/UMAP/t-SNE
-
plot_iteractive(df, X, km, save=True)- Plotly scatter plot with hover information
- HTML output for exploration
-
plot_static(df, X, km, save=True)- Matplotlib fallback
-
plot_clusters_profiles(df, km, save=True)- Radar chart showing cluster feature profiles
- Compares characteristic patterns across clusters
Main orchestration script that runs the full pipeline.
enriched_tracks.csv
- Original track metadata + Spotify audio features
- Columns:
name,artist,album,genre,year,play_count,loved,rating,duration, + audio features
clustered_tracks.csv
- Same as enriched +
clustercolumn with integer cluster labels (0-k)
clusters.html (Interactive)
- Plotly scatter plot with 2D reduced features
- Hover over points to see track name and artist
- Color-coded by cluster
clusters.png (Static)
- Matplotlib scatter plot fallback
- Useful for static reports or sharing
cluster_profiles.png (Radar Chart)
- Shows normalized mean feature values per cluster
- Helps understand what makes each cluster distinct
- Example: Cluster 0 might be "high-energy, high-danceability" while Cluster 1 is "acoustic, low-tempo"
k_selection.png (Evaluation Metrics)
- Left: Inertia vs k (elbow curve)
- Right: Silhouette score vs k (peak = optimal k)
- Metrics & Evaluation: Silhouette score is the only evaluation metric; need Davies-Bouldin, Calinski-Harabasz, or domain-based validation
- Error Handling: Spotify API failures don't gracefully degrade; could build fallback feature extraction
- Cache Logic Bug:
enrich.pyline 30 returns cached data only if NOT found (inverted logic) - Dimensionality Reduction: t-SNE perplexity hardcoded to 30; should adapt to dataset size
- Apple Music Metadata: Genre/year often missing; enrichment relies entirely on Spotify matches
- Add clustering validation metrics (Davies-Bouldin Index, Calinski-Harabasz Index)
- Implement cross-validation for robust cluster assignment
- Improve Spotify matching (fuzzy string matching for artist/track names)
- Support for Apple Music audio features directly (if API available)
- Automated report generation (PDF with all plots + statistics)
- Web interface for interactive exploration
- Batch processing for large libraries (>5000 tracks)
- Comparison of clustering algorithms (DBSCAN, Hierarchical)
- Feature importance analysis (which features drive cluster separation?)
- Handling for edge cases (instrumental tracks, remixes, live versions)
Ensure you've exported your Apple Music library and placed it in the project directory. See Installation step 5.
UMAP is optional. Install it with:
pip install umap-learnOr switch DIM_REDUCTION to 'pca' in config.py.
- Verify
SPOTIFY_CLIENT_IDandSPOTIFY_CLIENT_SECRETare correct inconfig.py - Check Spotify Developer Dashboard for app status
- If rate-limited, increase
delayinenrich.pyline 35
Some tracks (especially from smaller/independent artists) may not exist in Spotify. These will be dropped during preprocessing. To see how many were lost, check console output after enrichment.
If clustering a very large library (>10k tracks):
- Set
K_RANGE = range(2, 8)to speed up evaluation - Use
--no-plotto skip expensive dimensionality reduction - Consider filtering tracks:
--strategy ratedinstead of--all
---------- INGEST ----------
[ingest] | Loaded 2,345 tracks from Library.xml successfully.
[filter] | Filtered the list of tracks by loved giving 412 items.
---------- ENRICH ----------
[enrich] | Enriched 398 tracks, 14 had no Spotify match.
---------- PREPROCESS ----------
[preprocess] | Dropped 8 tracks with insufficient features; 390 tracks still left.
[preprocess] | Normalised 8 features.
---------- CLUSTER ----------
[clustering] | Evaluating list in [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
k = 2 | inertia = 245.3 | silhouette = 0.6421
k = 5 | inertia = 178.2 | silhouette = 0.7182
k = 8 | inertia = 156.9 | silhouette = 0.6954
[clustering] | Best k = 5, silhouette = 0.7182
---------- VISUALIZE ----------
[plotting] | Saved cluster plot -> outputs/clusters.html
[plotting] | Saved Cluster Profile -> outputs/cluster_profiles.png
Pipeline completed :)
| Package | Version | Purpose |
|---|---|---|
pandas |
≥1.3 | Data manipulation |
scikit-learn |
≥0.24 | K-Means, PCA, preprocessing |
matplotlib |
≥3.4 | Static visualizations |
plotly |
≥5.0 | Interactive plots |
spotipy |
≥2.19 | Spotify API client |
numpy |
≥1.20 | Numerical operations |
umap-learn |
≥0.5 | Optional: advanced dimensionality reduction |
Typical runtime on a modern machine:
| Step | Duration | Scaling |
|---|---|---|
| Ingest | <1s | O(1) - independent of library size |
| Enrich (first run) | 5-30s | O(n) per track; rate-limited by Spotify API |
| Enrich (cached) | <1s | O(n) - just reads cache |
| Preprocess | <1s | O(n) |
| Clustering | 1-5s | O(n × k × iterations) |
| Visualization | 2-10s | O(n) for dimensionality reduction |
Total (first run): ~10-50s for 200-400 tracks