An unsupervised machine learning project that segments ~9,000 active credit card holders into distinct behavioural clusters β enabling banks to design targeted marketing strategies and personalised financial products.
A bank's marketing team needs to understand who their credit card customers really are β not just as account holders, but as distinct behavioural groups with different spending habits, payment patterns, and financial needs.
With 9,000 active customers and 18 behavioural variables, this project applies unsupervised ML techniques to:
- Engineer intelligent KPIs from raw transactional data
- Reduce dimensionality using Factor Analysis to remove multicollinearity
- Cluster customers into meaningful behavioural segments
- Profile each segment with detailed characteristics
- Recommend targeted marketing strategies for each cluster
| Stakeholder | Value Delivered |
|---|---|
| Marketing Team | Run personalised campaigns tailored to each customer segment |
| Product Team | Design credit products (cashback, travel, premium) for specific segments |
| Risk Team | Identify high-risk segments (heavy cash advance users, minimum payment payers) |
| Operations | Prioritise service improvements for high-value but low-satisfaction segments |
| Executive Leadership | Data-driven strategic insights on customer base composition |
According to Bain & Company, companies with excellent customer segmentation strategies see up to 10% profit increase over a 5-year period compared to those without.
- Source: Kaggle β Credit Card Dataset for Clustering
- File:
CC_GENERAL.csv - Size: ~9,000 active credit card holders
- Period: Last 6 months of behavioural data
- Granularity: Customer level with 18 behavioural variables
| Variable | Description |
|---|---|
CUST_ID |
Unique credit card holder ID |
BALANCE |
Monthly average balance (based on daily averages) |
BALANCE_FREQUENCY |
Ratio of last 12 months with a balance (0β1) |
PURCHASES |
Total purchase amount in last 12 months |
ONEOFF_PURCHASES |
Total one-off (single) purchase amount |
INSTALLMENTS_PURCHASES |
Total instalment purchase amount |
CASH_ADVANCE |
Total cash advance amount taken |
PURCHASES_FREQUENCY |
% of months with at least one purchase |
ONEOFF_PURCHASES_FREQUENCY |
Frequency of one-off purchases |
PURCHASES_INSTALLMENTS_FREQUENCY |
Frequency of instalment purchases |
CASH_ADVANCE_FREQUENCY |
Frequency of cash advances |
CASH_ADVANCE_TRX |
Average amount per cash advance transaction |
PURCHASES_TRX |
Average amount per purchase transaction |
CREDIT_LIMIT |
Customer's credit limit |
PAYMENTS |
Total payments made to reduce statement balance |
MINIMUM_PAYMENTS |
Total minimum payments made |
PRC_FULL_PAYMENT |
Percentage of months with full payment |
TENURE |
Number of months as a credit card customer |
Segmentation-of-Credit-Card-Customers/
βββ CC_GENERAL.csv # Raw customer behavioural dataset
βββ Credit_new_cust.xlsx # New customer profiles for segment assignment
βββ Profiling_output.csv # Cluster profiling results
βββ python_FA.xls # Factor Analysis output
βββ pandas_profiling.html # Auto-generated EDA report (open in browser)
βββ Segmentation of Credit Card Customers.ipynb # Full pipeline notebook
βββ LICENSE
βββ README.md
Raw transactional variables alone don't tell the full story. Intelligent KPIs are engineered to capture customer behaviour ratios that reveal true financial habits:
| Engineered KPI | Formula | Business Meaning |
|---|---|---|
monthly_avg_purchase |
PURCHASES / TENURE |
Average monthly spending rate |
monthly_avg_cash_advance |
CASH_ADVANCE / TENURE |
Monthly reliance on cash advances |
limit_usage_ratio |
BALANCE / CREDIT_LIMIT |
How much of credit limit is used |
payments_to_min_ratio |
PAYMENTS / MINIMUM_PAYMENTS |
Payment discipline (higher = better) |
purchase_type_ratio |
ONEOFF / INSTALLMENTS |
Spending style β impulsive vs planned |
cash_advance_ratio |
CASH_ADVANCE / PURCHASES |
Reliance on cash vs purchases |
- Generated full statistical profiling report using pandas-profiling (
pandas_profiling.html) - Analysed distributions of all 18 variables β most spending variables are highly right-skewed
- Identified significant missing values in
MINIMUM_PAYMENTSandCREDIT_LIMITβ imputed using median - Correlation analysis revealed multicollinearity between purchase frequency variables β motivating Factor Analysis
With 18 correlated variables, direct clustering leads to the Curse of Dimensionality. Factor Analysis groups correlated variables into latent factors that represent underlying behavioural dimensions:
# Factor Analysis for variable reduction
from factor_analyzer import FactorAnalyzer
fa = FactorAnalyzer(n_factors=optimal_factors, rotation='varimax')
fa.fit(scaled_data)
factor_scores = fa.transform(scaled_data)Output saved to python_FA.xls for detailed factor loading analysis.
Identified Latent Factors (example):
- Factor 1 β Purchase Activity: Groups purchase frequency, purchase amount, transaction count
- Factor 2 β Cash Advance Behaviour: Groups cash advance amount, frequency, transactions
- Factor 3 β Payment Discipline: Groups full payment %, payments to minimum ratio
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# Scale factor scores
scaler = StandardScaler()
scaled_factors = scaler.fit_transform(factor_scores)
# Optimal k using Elbow Method
inertias = []
for k in range(2, 11):
km = KMeans(n_clusters=k, random_state=42)
km.fit(scaled_factors)
inertias.append(km.inertia_)Optimal number of clusters: 4 (determined via Elbow Method + Silhouette Score)
Each cluster is profiled on key variables to understand who they are and what they need:
| Cluster | Profile Name | Characteristics | Recommended Strategy |
|---|---|---|---|
| 0 | π€ Inactive / Low Engagers | Low balance, low purchases, low cash advance | Re-engagement campaigns, introductory cashback offers |
| 1 | π° High Spenders | High purchases, high credit limit, frequent transactions | Premium rewards, travel cards, loyalty programs |
| 2 | π§ Cash Advance Reliant | High cash advance, low purchases, high interest burden | Financial wellness programs, personal loan conversion |
| 3 | π¦ Instalment Shoppers | High instalment purchases, consistent payment behaviour | EMI offers, buy-now-pay-later products, retail partnerships |
Note: Exact cluster labels depend on the final model output in the notebook.
- ~30% of customers rely heavily on cash advances β a high-risk, high-interest segment that the bank should either convert to personal loans or offer financial planning support
- High spenders (Cluster 1) have the highest credit limits but also the highest balance-to-limit ratios β prime candidates for premium card upgrades
- Full payment behaviour (
PRC_FULL_PAYMENT) is the single strongest differentiator between financially healthy and at-risk customer segments - Instalment shoppers show the most consistent, predictable behaviour β lowest churn risk and best candidates for cross-sell
| Category | Tools |
|---|---|
| Language | Python 3.8+ |
| Data Processing | Pandas, NumPy |
| EDA | pandas-profiling |
| Visualisation | Matplotlib, Seaborn |
| Dimensionality Reduction | Factor Analyzer, PCA (Scikit-learn) |
| Clustering | Scikit-learn (K-Means) |
| Cluster Validation | Elbow Method, Silhouette Score |
| Output | Excel (openpyxl) for profiling reports |
| Notebook | Jupyter Notebook |
# Clone the repository
git clone https://github.com/vicky60629/Segmentation-of-Credit-Card-Customers.git
cd Segmentation-of-Credit-Card-Customers
# Install dependencies
pip install -r requirements.txt
# Launch the notebook
jupyter notebook "Segmentation of Credit Card Customers.ipynb"
# View the EDA report
open pandas_profiling.html # or double-click to open in browserWhat worked well:
- Factor Analysis before clustering significantly reduced noise and multicollinearity β clusters are more interpretable than raw K-Means on 18 variables
- Engineering KPI ratios (limit usage, payment discipline) revealed behavioural patterns invisible in raw variables
- Profiling output (
Profiling_output.csv) makes cluster insights shareable with non-technical stakeholders
Future enhancements:
- Try DBSCAN or Gaussian Mixture Models for non-spherical cluster shapes
- Add Hierarchical Clustering + Dendrogram for alternative segmentation view
- Build interactive Tableau / Plotly Dash dashboard for cluster exploration
- Apply RFM Analysis (Recency, Frequency, Monetary) alongside behavioural clustering
- Extend to real-time segment assignment for new customers using
Credit_new_cust.xlsxpipeline - Add SHAP / interpretability layer to explain individual customer segment assignment
- Track experiments with MLflow
Unlike supervised learning, there are no predefined labels for customer segments β we don't know upfront how many types of customers exist or what they look like. Unsupervised learning lets the data speak for itself, revealing natural groupings that a business analyst might never discover manually from 9,000 rows and 18 columns.
This is exactly the kind of analysis that drives real decisions at banks: product design, marketing spend allocation, and risk management β all informed by who customers actually are rather than assumptions.
Vicky Gupta β Data Engineering Analyst @ Accenture (4.5 years) | Aspiring Data Scientist
Passionate about unsupervised ML, customer analytics, and building data products that drive real business decisions. Experienced in PySpark, ETL pipelines, and end-to-end ML workflows.
π§ vg60629@gmail.com
This project is licensed under the MIT License β see the LICENSE file for details.
β Found this useful? Star the repository β it helps others in the community discover it!