Skip to content

Gnpd/NIR-Collagen-Prediction

Repository files navigation

NIR Spectroscopy for Collagen Quantification in Archaeological Bone

Reproduction of the analytical workflow from:

Ryder, C. et al. (2026). Refining near-infrared spectroscopy for collagen quantification: A new predictive model for archaeological bone. Journal of Archaeological Science, 185, 106448.

The notebook implements the full pipeline — from raw reflectance spectra to collagen yield prediction — using Python open-source libraries.


Contents

File Description Open in Colab
NIR_Collagen_Prediction.ipynb Main analysis notebook Open In Colab
1-s2.0-S0305440325002973-mmc2.csv Supplementary data from the paper (176 samples, 2151 wavelengths)
plsr_2045.json Serialized PLSR pipeline — IntensityConversion → SG → RangeCut(2030–2060 nm) → PLSRegression
rf_2045.json Serialized Random Forest pipeline — IntensityConversion → SG → RangeCut(2030–2060 nm) → RandomForestRegressor

Dataset

The CSV file contains spectral data for 176 bone samples:

  • 140 Reference samples — used for model calibration and validation. Already K-means balanced from an original set of 319 samples (see paper Section 2.3).
  • 36 Zafarraya samples — independent external validation set from Zafarraya Cave, Spain.

Columns:

  • Sample ID, Country of Origin, Extraction Technique (ORAU / MPI)
  • Collagen Yield (%)
  • Reflectance at 350–2500 nm (2151 variables)

Notebook Overview

The notebook reproduces the complete workflow in 13 sections:

Section Content
1–2 Imports and data loading
3 Exploratory data analysis (yield distribution, raw spectra)
4 Spectral preprocessing: reflectance → pseudo-absorbance → Savitzky-Golay 2nd derivative
5 PCA for outlier detection (full range and NIR range)
6 Stratified calibration / validation split (100 / 40 samples)
7 Helper functions
8 PLSR across 11 wavelength ranges (reproduces Table 5)
9 Random Forest regression — full NIR and 2030–2060 nm (reproduces Sections 3.1–3.2)
10 Variable importance plots
11 Combined LOO-CV models on all 140 samples
12 Zafarraya external validation with PVA consolidant detection (reproduces Tables 6–8)
13 Model persistence to JSON using OpenModels

The preferred model (2030–2060 nm, 1 PLSR factor) targets the 2045 nm absorption feature associated with the 2nd overtone of C=O stretching and N-H stretching in collagen.


Dependencies

numpy
pandas
matplotlib
scikit-learn
chemotools
openmodels

Install with:

pip install -r requirements.txt

or manually:

pip install numpy pandas matplotlib scikit-learn chemotools openmodels

Usage

  1. Clone the repository.
  2. Install dependencies (see above).
  3. Open NIR_Collagen_Prediction.ipynb in Jupyter Lab or Jupyter Notebook.
  4. Run all cells in order (Kernel → Restart & Run All).

The notebook is self-contained: all preprocessing, modelling, and evaluation steps run sequentially without external configuration.


Prediction API

The serialized models are also available as a REST API, so you can predict collagen yield directly from your own spectra without running the notebook.

Try it online — no installation needed:

API Docs

https://collagen-prediction.fastapicloud.dev

Run it locally — download the pre-built executable for your platform from the Releases page and run it:

# macOS / Linux
chmod +x nir-collagen-api-macos   # or nir-collagen-api-linux
./nir-collagen-api-macos

# Windows
nir-collagen-api-windows.exe

Then open http://localhost:8000/docs to explore the endpoints interactively.

The API accepts spectra as JSON (single or batch) or as a CSV file upload, and supports both pseudo-absorbance and raw reflectance inputs.


Key Results Reproduced

PLSR model performance across wavelength ranges (100-sample calibration / 40-sample validation split):

Range Factors Val R² Val RMSE
2030–2060 nm (preferred) 1 0.876 1.78%
2030–2060 + 2244–2300 nm 3 0.890 1.67%
2000–2300 nm 3 0.883 1.73%
780–2500 nm (full NIR) 3 0.862 1.88%

Combined models — Leave-One-Out CV (all 140 reference samples):

Model LOO-CV R² LOO-CV RMSE
PLSR 2030–2060 nm 0.883 1.62%
RF 2030–2060 nm 0.894 1.54%
RF 780–2500 nm 0.919 1.35%
  • The 2030–2060 nm range (1 PLSR factor, 31 features) delivers the best balance between parsimony and predictive accuracy — the preferred model for deployment.
  • Restricting the spectral range to 2030–2060 nm avoids PVA consolidant absorption bands at 2135, 2250, and 2296 nm, making the model robust for museum collections.
  • Random Forest (780–2500 nm) achieves the lowest LOO-CV RMSE (1.35%) but requires the full spectral range and is more sensitive to consolidant contamination.

About

Analytical workflow for collagen quantification. Ryder, C. et al. (2026)

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors