Apple Music Favourites Clustering

A Python-based unsupervised machine learning pipeline that automatically clusters your Apple Music library based on audio characteristics. Discover patterns in your music taste by segmenting your favourite tracks into meaningful groups.

Overview

This project takes your Apple Music library export and applies K-Means clustering to organize tracks based on their audio features (energy, tempo, danceability, etc.). The pipeline enriches your local library data with Spotify audio features, preprocesses the data, and visualizes clusters in 2D space.

Key Features

Automatic Library Parsing: Reads Apple Music Library.xml exports
Smart Track Filtering: Multiple strategies to select favourite tracks (loved, rated, most-played)
Spotify Enrichment: Pulls detailed audio features for each track via Spotify API
Intelligent Preprocessing: Handles missing data and normalizes features
K-Means Clustering: Automated optimal cluster selection using silhouette scores
Interactive Visualization: 2D cluster plots with hover information and radar charts for cluster profiles

Architecture

Data Flow:
┌──────────────────┐
│ Library.xml      │
│ (Apple Music)    │
└────────┬─────────┘
         │
         ▼
┌──────────────────────┐
│ INGEST               │
│ Parse XML, filter    │
│ by strategy          │
└────────┬─────────────┘
         │
         ▼
┌──────────────────────┐
│ ENRICH               │
│ Query Spotify API    │
│ fetch audio features │
│ (cached)             │
└────────┬─────────────┘
         │
         ▼
┌──────────────────────┐
│ PREPROCESS           │
│ Drop missing data,   │
│ fill NaN, normalize  │
└────────┬─────────────┘
         │
         ▼
┌──────────────────────┐
│ CLUSTERING           │
│ K-Means evaluation   │
│ silhouette scoring   │
└────────┬─────────────┘
         │
         ▼
┌──────────────────────┐
│ VISUALIZE            │
│ 2D scatter plots     │
│ cluster profiles     │
└──────────────────────┘

Project Structure

.
├── ingest.py           # Load and filter Apple Music library
├── enrich.py           # Spotify API integration & feature extraction
├── preprocess.py       # Data cleaning & normalization
├── kmeans.py           # Clustering & evaluation
├── plot.py             # Visualization (static & interactive)
├── pipeline.py         # Main orchestration script
├── config.py           # Global configuration & API credentials
├── Library.xml         # Your Apple Music library export
│
├── data/
│   ├── raw/            # Input data directory
│   └── cache/          # Cached Spotify features
│
└── outputs/
    ├── enriched_tracks.csv        # After Spotify enrichment
    ├── clustered_tracks.csv       # Final output with cluster labels
    ├── clusters.html              # Interactive scatter plot
    ├── clusters.png               # Static cluster visualization
    ├── cluster_profiles.png       # Radar chart of cluster characteristics
    └── k_selection.png            # Elbow & silhouette plots

Installation

Requirements

Python 3.8+
pip

Setup

Clone/Download this repository

Create a virtual environment (recommended)

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies

pip install pandas scikit-learn matplotlib plotly spotipy numpy

Optional (for enhanced clustering visualization):

pip install umap-learn

Configure Spotify API credentials
- Go to Spotify Developer Dashboard
- Create an app and get your Client ID and Client Secret
- Add them to config.py:
```
SPOTIFY_CLIENT_ID = 'your_client_id_here'
SPOTIFY_CLIENT_SECRET = 'your_client_secret_here'
```
Export your Apple Music library
- Open Apple Music on macOS
- Select Music → Library → Export Library
- Save as Library.xml in your project directory

Usage

Basic Usage

python pipeline.py --library Library.xml

With Options

# Cluster by highly-rated tracks
python pipeline.py --library Library.xml --strategy rated

# Skip Spotify enrichment (use cached data from previous run)
python pipeline.py --library Library.xml --no-enrich

# Manually set number of clusters
python pipeline.py --library Library.xml --k 8

# Run without generating visualizations
python pipeline.py --library Library.xml --no-plot

# Combine options
python pipeline.py --library Library.xml --strategy played --k 6 --no-enrich

Command-Line Arguments

Argument	Default	Options	Description
`--library`	required	path	Path to Apple Music Library.xml export
`--strategy`	`loved`	`loved`, `rated`, `played`, `all`	How to define favourite tracks
`--k`	`auto`	`auto` or integer	Number of clusters (auto = silhouette-based selection)
`--no-enrich`	False	flag	Skip Spotify API calls (use cache)
`--no-plot`	False	flag	Skip all visualizations

Configuration

Edit config.py to customize the pipeline:

Directories

DATA_DIR = ROOT / 'data' / 'raw'         # Input directory
CACHE_DIR = ROOT / 'data' / 'cache'      # Spotify feature cache
OUTPUT_DIR = ROOT / 'outputs'            # Results directory

API Credentials

SPOTIFY_CLIENT_ID = 'your_id'
SPOTIFY_CLIENT_SECRET = 'your_secret'

Audio Features

FEATURES = [
    'tempo',              # BPM (0-210)
    'energy',             # 0-1 (intensity & activity)
    'valence',            # 0-1 (positivity/happiness)
    'danceability',       # 0-1 (rhythm regularity)
    'acousticness',       # 0-1 (non-electric)
    'instrumentalness',   # 0-1 (no vocals)
    'loudness',           # dB (-60 to 0)
    'speechiness',        # 0-1 (spoken words)
]

Clustering Parameters

K_RANGE = range(2, 15)              # Test k values 2-14
KMEANS_INIT = 'k-means++'           # Initialization method
KMEANS_N_INIT = 10                  # Number of re-runs
KMEANS_MAX_ITER = 300               # Max iterations
RANDOM_STATE = 42                   # Reproducibility
K_OVERRIDE = None                   # Force specific k (overrides auto-selection)

Dimensionality Reduction for Visualization

DIM_REDUCTION = 'umap'              # 'pca', 'umap', or 'tsne'
N_COMPONENTS = 2                    # 2D visualization

Module Reference

`ingest.py`

Parses Apple Music Library.xml and filters tracks.

Key Functions:

load_library(xml_path) → List[Track]
- Loads all tracks from Library.xml
- Excludes podcasts, audiobooks, and videos
filter_favourites(tracks, filter_by) → List[Track]
- 'loved': Only tracks marked as loved
- 'rated': Tracks with rating ≥ 50
- 'played': Top 25% most-played tracks
- 'all': All tracks
convert_dataframe(tracks) → pd.DataFrame

`enrich.py`

Fetches audio features from Spotify API with intelligent caching.

Key Functions:

spotify_client() → spotipy.Spotify
- Initializes authenticated Spotify client
audio_features(name, artist, sp) → dict
- Searches Spotify for track and extracts audio features
- Returns cached result if available
enrich_df(df, delay=0.1) → pd.DataFrame
- Enriches dataframe with Spotify features
- Rate-limiting: 0.1s delay between API calls

`preprocess.py`

Data cleaning and normalization.

Key Functions:

drop_missing(df, limit=0.5) → pd.DataFrame
- Drops rows missing >50% of features
fill_missing(df) → pd.DataFrame
- Imputes missing values with median
normalise(df) → (pd.DataFrame, MinMaxScaler)
- Scales features to [0, 1] range
full_preprocess(df) → (pd.DataFrame, np.ndarray, MinMaxScaler)
- Orchestrates all preprocessing steps

`kmeans.py`

K-Means clustering with automatic k-selection.

Key Functions:

k_range(X) → dict
- Evaluates k values 2-14
- Returns inertia & silhouette scores
best_k(results) → int
- Selects k with highest silhouette score
fit_kmeans(X, k) → sklearn.cluster.KMeans
- Trains K-Means model
assign_clusters(df, X, km) → pd.DataFrame
- Adds cluster labels to dataframe
run_clustering(df, X, plot=True) → (pd.DataFrame, KMeans)
- Main clustering pipeline

`plot.py`

Visualization of clusters and cluster characteristics.

Key Functions:

plot_clusters(df, X, km, save=True)
- Attempts interactive plot; falls back to static
- Automatically reduces dimensions via PCA/UMAP/t-SNE
plot_iteractive(df, X, km, save=True)
- Plotly scatter plot with hover information
- HTML output for exploration
plot_static(df, X, km, save=True)
- Matplotlib fallback
plot_clusters_profiles(df, km, save=True)
- Radar chart showing cluster feature profiles
- Compares characteristic patterns across clusters

`pipeline.py`

Main orchestration script that runs the full pipeline.

Outputs

CSV Files

enriched_tracks.csv

Original track metadata + Spotify audio features
Columns: name, artist, album, genre, year, play_count, loved, rating, duration, + audio features

clustered_tracks.csv

Same as enriched + cluster column with integer cluster labels (0-k)

Visualizations

clusters.html (Interactive)

Plotly scatter plot with 2D reduced features
Hover over points to see track name and artist
Color-coded by cluster

clusters.png (Static)

Matplotlib scatter plot fallback
Useful for static reports or sharing

cluster_profiles.png (Radar Chart)

Shows normalized mean feature values per cluster
Helps understand what makes each cluster distinct
Example: Cluster 0 might be "high-energy, high-danceability" while Cluster 1 is "acoustic, low-tempo"

k_selection.png (Evaluation Metrics)

Left: Inertia vs k (elbow curve)
Right: Silhouette score vs k (peak = optimal k)

Known Issues & Future Work

Current Limitations

Metrics & Evaluation: Silhouette score is the only evaluation metric; need Davies-Bouldin, Calinski-Harabasz, or domain-based validation
Error Handling: Spotify API failures don't gracefully degrade; could build fallback feature extraction
Cache Logic Bug: enrich.py line 30 returns cached data only if NOT found (inverted logic)
Dimensionality Reduction: t-SNE perplexity hardcoded to 30; should adapt to dataset size
Apple Music Metadata: Genre/year often missing; enrichment relies entirely on Spotify matches

Planned Improvements

Troubleshooting

"File not found: Library.xml"

Ensure you've exported your Apple Music library and placed it in the project directory. See Installation step 5.

"Could not import umap"

UMAP is optional. Install it with:

pip install umap-learn

Or switch DIM_REDUCTION to 'pca' in config.py.

Spotify API Errors

Verify SPOTIFY_CLIENT_ID and SPOTIFY_CLIENT_SECRET are correct in config.py
Check Spotify Developer Dashboard for app status
If rate-limited, increase delay in enrich.py line 35

No Spotify Matches Found

Some tracks (especially from smaller/independent artists) may not exist in Spotify. These will be dropped during preprocessing. To see how many were lost, check console output after enrichment.

Memory Issues with Large Libraries

If clustering a very large library (>10k tracks):

Set K_RANGE = range(2, 8) to speed up evaluation
Use --no-plot to skip expensive dimensionality reduction
Consider filtering tracks: --strategy rated instead of --all

Example Outputs

Console Output

---------- INGEST ----------
[ingest] | Loaded 2,345 tracks from Library.xml successfully.
[filter] | Filtered the list of tracks by loved giving 412 items.

---------- ENRICH ----------
[enrich] | Enriched 398 tracks, 14 had no Spotify match.

---------- PREPROCESS ----------
[preprocess] | Dropped 8 tracks with insufficient features; 390 tracks still left.
[preprocess] | Normalised 8 features.

---------- CLUSTER ----------
[clustering] | Evaluating list in [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
k =  2 | inertia = 245.3 | silhouette = 0.6421
k =  5 | inertia = 178.2 | silhouette = 0.7182
k =  8 | inertia = 156.9 | silhouette = 0.6954
[clustering] | Best k = 5, silhouette = 0.7182

---------- VISUALIZE ----------
[plotting] | Saved cluster plot -> outputs/clusters.html
[plotting] | Saved Cluster Profile -> outputs/cluster_profiles.png

Pipeline completed :)

Dependencies

Package	Version	Purpose
`pandas`	≥1.3	Data manipulation
`scikit-learn`	≥0.24	K-Means, PCA, preprocessing
`matplotlib`	≥3.4	Static visualizations
`plotly`	≥5.0	Interactive plots
`spotipy`	≥2.19	Spotify API client
`numpy`	≥1.20	Numerical operations
`umap-learn`	≥0.5	Optional: advanced dimensionality reduction

Performance Notes

Typical runtime on a modern machine:

Step	Duration	Scaling
Ingest	<1s	O(1) - independent of library size
Enrich (first run)	5-30s	O(n) per track; rate-limited by Spotify API
Enrich (cached)	<1s	O(n) - just reads cache
Preprocess	<1s	O(n)
Clustering	1-5s	O(n × k × iterations)
Visualization	2-10s	O(n) for dimensionality reduction

Total (first run): ~10-50s for 200-400 tracks

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Library.xml		Library.xml
README.md		README.md
config.py		config.py
enrich.py		enrich.py
ingest.py		ingest.py
kmeans.py		kmeans.py
pipeline.py		pipeline.py
plot.py		plot.py
preprocess.py		preprocess.py

Folders and files

Latest commit

History

Repository files navigation

Apple Music Favourites Clustering

Overview

Key Features

Architecture

Project Structure

Installation

Requirements

Setup

Usage

Basic Usage

With Options

Command-Line Arguments

Configuration

Directories

API Credentials

Audio Features

Clustering Parameters

Dimensionality Reduction for Visualization

Module Reference

ingest.py

enrich.py

preprocess.py

kmeans.py

plot.py

pipeline.py

Outputs

CSV Files

Visualizations

Known Issues & Future Work

Current Limitations

Planned Improvements

Troubleshooting

"File not found: Library.xml"

"Could not import umap"

Spotify API Errors

No Spotify Matches Found

Memory Issues with Large Libraries

Example Outputs

Console Output

Dependencies

Performance Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`ingest.py`

`enrich.py`

`preprocess.py`

`kmeans.py`

`plot.py`

`pipeline.py`

Packages