South Africa has one of the highest unemployment rates in the world. Nearly 1 in 3 working-age South Africans is officially unemployed β by the expanded measure, closer to 1 in 2. Among youth aged 15β24, it exceeds 60%. Into this crisis, the government deploys billions of rands through development finance institutions with a constitutional mandate to create jobs and transform the economy. This project asks a simple question: is the money working, and is it reaching the people who need it most?
This project is a two-layer data science investigation of South African Development Finance Institution (DFI) funding:
| Layer | Audience | File |
|---|---|---|
| π Analytical notebook | Data scientists, policy analysts, finance professionals | analysis.ipynb |
| π Public Streamlit app | Journalists, civic advocates, general public | app.py + pages/ |
The analysis covers R69.5 billion of public development finance across two institutions: the Industrial Development Corporation (IDC) and the National Empowerment Fund (NEF).
Gauteng and KwaZulu-Natal received 55.9% of NEF disbursements and 65.6% of all deals β from just 2 of 9 provinces. The provincial disbursement Gini coefficient is 0.469, indicating moderate-to-high geographic concentration. The Northern Cape received 17.8% of money through just 14 deals, almost entirely driven by two high-value, low-job-creation recipients.
The NEF deal-size Lorenz curve reveals a Gini of 0.56 β approaching South Africa's own income inequality levels. The bottom 12% of deals (under R1m) share just 0.7% of funding. The top 0.5% of deals (over R50m) absorb 17% of all money.
Zero overlap exists between the top-10 disbursement recipients and the top-10 job creators. Khatu Industrial received R534M and created 26 jobs (R20.5M per job). Umnotho Maize received R9M and created 2,352 jobs (R3,827 per job). Both received public NEF money. The 5,366Γ gap is not a statistical outlier β it is a policy pattern.
The IDC β deploying 18.3Γ more capital than the NEF β publishes no cost-per-job data, no provincial breakdown, and no sector attribution for 56.5% of its investment. The NEF tracks every job and every province. The absence of IDC job data is a policy choice, not a technical limitation.
Among named sectors, mining and metals account for 56.5% of attributed IDC investment β R16.2 billion. New Industries (biotech, solar, AI) received just 1.3% β R371 million. South Africa's economy was built on mining. The IDC, by this measure, is reinforcing that structure rather than transforming it.
- FY2023-24 is absent from the IDC dataset
- 56.5% of IDC investment has no sector attribution
- NEF grants + loans sum to R114.5M more than total disbursed (likely uncommitted facilities)
- Duplicate company entries appear in the NEF top-10 job creators list (KPML Group, Bibi Cash & Carry)
| Source | Institution | Description |
|---|---|---|
| IDC Funding Dashboard | Industrial Development Corporation | 852 deals, R65.9B, FY2017β2025 |
| Parliamentary Question PQ705 | National Empowerment Fund via dtic.gov.za | 392 companies, R3.6B, all 9 provinces |
| QLFS Q3 2025 | Stats SA | Official unemployment statistics |
Note: Both DFI datasets are structured from public-facing dashboard exports. Raw company-level transaction data is not publicly available for the IDC. The NEF data originates from a parliamentary question (PQ705, Mr RWC Chance, DA), answered via dtic.gov.za.
sa-idc-inequality/
β
βββ app.py # Streamlit landing page
βββ analysis.ipynb # Inequality analysis notebook (professional audience)
βββ requirements.txt
βββ README.md
β
βββ notebooks/
β βββ ml_predictor.ipynb # Job Creation ROI Predictor β ML training notebook
β βββ anomaly_detection.ipynb # NEF Anomaly Detection β Isolation Forest notebook
β
βββ pages/
β βββ 1_Crisis_Context.py # Unemployment benchmarks + DFI mandate
β βββ 2_Geographic_Inequality.py # Provincial money vs jobs + cost per job
β βββ 3_Deal_Size_Inequality.py # Lorenz curve + bracket divergence + grants
β βββ 4_Job_Efficiency.py # Top-10 disbursed vs top-10 job creators
β βββ 5_Sector_Concentration.py # IDC sectors + fiscal trend + accountability gap
β βββ 6_Job_ROI_Predictor.py # ML-powered cost-per-job predictor
β βββ 7_Anomaly_Detection.py # Isolation Forest anomaly flagging
β
βββ data/
βββ IDC_Funded_Businesses.csv
βββ NEF_Funded_Businesses.csv
# 1. Clone the repo
git clone https://github.com/your-org/sa-dfi-inequality.git
cd sa-dfi-inequality
# 2. Install dependencies
pip install -r requirements.txt
# 3. Run the Streamlit app
streamlit run app.py
# 4. Or open the notebook
jupyter notebook analysis.ipynbLinear narrative with headings per analytical lens, for data science and policy professionals.
| Section | What it covers |
|---|---|
| Setup | Libraries, colour palette, matplotlib dark theme |
| Data Loading | IDC and NEF DataFrames reconstructed from dashboard exports |
| Lens 1 | Unemployment crisis as the moral benchmark |
| Lens 2 | Geographic concentration β Gini, provincial divergence, cost-per-job |
| Lens 3 | Deal size inequality β Lorenz curve, grants vs loans discrepancy |
| Lens 4 | Job creation efficiency β scatter, top-10 comparison, the 5,366Γ gap |
| Lens 5 | IDC sector concentration β sector breakdown, fiscal year trend |
| Lens 6 | Cross-dataset synthesis β IDC vs NEF comparison, discrepancy register |
| Conclusions | 6 numbered findings handed off to app.py |
Three ML models trained to predict cost-per-job from deal characteristics.
| Section | What it covers |
|---|---|
| Data Construction | 17 real anchors + 375 aggregate-derived records |
| EDA | Distribution analysis, log-log relationships |
| Feature Engineering | log_disbursed, bracket_ord, has_grant, province dummies |
| Model Training | Linear Regression Β· Random Forest Β· XGBoost with 5-fold CV |
| Model Comparison | RΒ², RMSE, predicted vs actual charts |
| Feature Importance | Deal size dominates at >85% β the policy implication |
| Conclusions | 5 findings including the Umnotho Maize outlier note |
Isolation Forest flags statistical outliers β both red flags and positive outliers.
| Section | What it covers |
|---|---|
| EDA | Disbursed vs jobs in raw and log-log space |
| Isolation Forest | Setup, contamination=10%, 200 estimators |
| Score Distribution | Anomaly score histogram with real company annotations |
| Anomaly Map | Scatter chart β disbursed vs jobs coloured by anomaly status |
| Named Outliers | Real companies flagged with full metrics |
| Province Analysis | Anomaly rate concentration by province |
| Sensitivity Analysis | Stability across contamination rates 5%β20% |
| Conclusions | 5 findings including CK Mafutha as most anomalous record |
| Tool | Purpose |
|---|---|
pandas |
Data wrangling and analysis |
numpy |
Statistical calculations (Gini, Lorenz) |
matplotlib |
Notebook visualisations (dark theme) |
plotly |
Interactive Streamlit charts |
streamlit |
Public-facing web application |
scikit-learn |
Machine learning β Isolation Forest, Random Forest, Linear Regression |
xgboost |
Gradient boosting β XGBoost regressor |
The app is deployed on Streamlit Cloud. To deploy your own instance:
- Fork this repo
- Go to share.streamlit.io
- Connect your GitHub account
- Select this repo and set Main file to
app.py - Click Deploy β dependencies install automatically from
requirements.txt
| # | Title | Notebook | Streamlit Page | Status |
|---|---|---|---|---|
| 3 | Funding Concentration & Inequality Analysis | analysis.ipynb |
Pages 1β5 | β Live |
| 2 | Job Creation ROI Predictor | notebooks/ml_predictor.ipynb |
Page 6 | β Live |
| 5 | NEF Anomaly Detection | notebooks/anomaly_detection.ipynb |
Page 7 | β Live |
- IDC data covers named sectors only β 56.5% of investment is unattributed
- NEF job figures are self-reported in a parliamentary response and have not been independently audited
- FY2023-24 IDC data is absent from the source β trend analysis skips this year
- Provincial unemployment rates are not included in this dataset β geographic efficiency analysis uses national averages
- The NEF dataset does not include time-series data β no year-on-year comparison is possible
This is an open civic data project. If you have access to more granular IDC or NEF data, or can identify errors in the source parliamentary question, please open an issue or pull request.
The underlying dataset was compiled and made publicly available by @AfikaSoyamba on X, who built a database of 1,248 South African businesses funded by the IDC and NEF β 856 from the IDC, 392 from the NEF β including every company name, amount, and province. This analysis would not exist without that work. Thank you.
MIT License β see LICENSE for details.
Data sourced from South African public records. Analysis and code Β© 2025 Lindiwe Songelwa.
Part of a data science portfolio targeting the South African mining and finance sectors. Built with Python Β· Streamlit Β· Plotly Β· Public data.