Reproduction of the analytical workflow from:
Ryder, C. et al. (2026). Refining near-infrared spectroscopy for collagen quantification: A new predictive model for archaeological bone. Journal of Archaeological Science, 185, 106448.
The notebook implements the full pipeline — from raw reflectance spectra to collagen yield prediction — using Python open-source libraries.
The CSV file contains spectral data for 176 bone samples:
- 140 Reference samples — used for model calibration and validation. Already K-means balanced from an original set of 319 samples (see paper Section 2.3).
- 36 Zafarraya samples — independent external validation set from Zafarraya Cave, Spain.
Columns:
- Sample ID, Country of Origin, Extraction Technique (ORAU / MPI)
- Collagen Yield (%)
- Reflectance at 350–2500 nm (2151 variables)
The notebook reproduces the complete workflow in 13 sections:
| Section | Content |
|---|---|
| 1–2 | Imports and data loading |
| 3 | Exploratory data analysis (yield distribution, raw spectra) |
| 4 | Spectral preprocessing: reflectance → pseudo-absorbance → Savitzky-Golay 2nd derivative |
| 5 | PCA for outlier detection (full range and NIR range) |
| 6 | Stratified calibration / validation split (100 / 40 samples) |
| 7 | Helper functions |
| 8 | PLSR across 11 wavelength ranges (reproduces Table 5) |
| 9 | Random Forest regression — full NIR and 2030–2060 nm (reproduces Sections 3.1–3.2) |
| 10 | Variable importance plots |
| 11 | Combined LOO-CV models on all 140 samples |
| 12 | Zafarraya external validation with PVA consolidant detection (reproduces Tables 6–8) |
| 13 | Model persistence to JSON using OpenModels |
The preferred model (2030–2060 nm, 1 PLSR factor) targets the 2045 nm absorption feature associated with the 2nd overtone of C=O stretching and N-H stretching in collagen.
numpy
pandas
matplotlib
scikit-learn
chemotools
openmodels
Install with:
pip install -r requirements.txtor manually:
pip install numpy pandas matplotlib scikit-learn chemotools openmodels- Clone the repository.
- Install dependencies (see above).
- Open
NIR_Collagen_Prediction.ipynbin Jupyter Lab or Jupyter Notebook. - Run all cells in order (
Kernel → Restart & Run All).
The notebook is self-contained: all preprocessing, modelling, and evaluation steps run sequentially without external configuration.
The serialized models are also available as a REST API, so you can predict collagen yield directly from your own spectra without running the notebook.
Try it online — no installation needed:
https://collagen-prediction.fastapicloud.dev
Run it locally — download the pre-built executable for your platform from the Releases page and run it:
# macOS / Linux
chmod +x nir-collagen-api-macos # or nir-collagen-api-linux
./nir-collagen-api-macos
# Windows
nir-collagen-api-windows.exeThen open http://localhost:8000/docs to explore the endpoints interactively.
The API accepts spectra as JSON (single or batch) or as a CSV file upload, and supports both pseudo-absorbance and raw reflectance inputs.
PLSR model performance across wavelength ranges (100-sample calibration / 40-sample validation split):
| Range | Factors | Val R² | Val RMSE |
|---|---|---|---|
| 2030–2060 nm (preferred) | 1 | 0.876 | 1.78% |
| 2030–2060 + 2244–2300 nm | 3 | 0.890 | 1.67% |
| 2000–2300 nm | 3 | 0.883 | 1.73% |
| 780–2500 nm (full NIR) | 3 | 0.862 | 1.88% |
Combined models — Leave-One-Out CV (all 140 reference samples):
| Model | LOO-CV R² | LOO-CV RMSE |
|---|---|---|
| PLSR 2030–2060 nm | 0.883 | 1.62% |
| RF 2030–2060 nm | 0.894 | 1.54% |
| RF 780–2500 nm | 0.919 | 1.35% |
- The 2030–2060 nm range (1 PLSR factor, 31 features) delivers the best balance between parsimony and predictive accuracy — the preferred model for deployment.
- Restricting the spectral range to 2030–2060 nm avoids PVA consolidant absorption bands at 2135, 2250, and 2296 nm, making the model robust for museum collections.
- Random Forest (780–2500 nm) achieves the lowest LOO-CV RMSE (1.35%) but requires the full spectral range and is more sensitive to consolidant contamination.