NIR Spectroscopy for Collagen Quantification in Archaeological Bone

Reproduction of the analytical workflow from:

Ryder, C. et al. (2026). Refining near-infrared spectroscopy for collagen quantification: A new predictive model for archaeological bone. Journal of Archaeological Science, 185, 106448.

The notebook implements the full pipeline — from raw reflectance spectra to collagen yield prediction — using Python open-source libraries.

File	Description	Open in Colab
`NIR_Collagen_Prediction.ipynb`	Main analysis notebook
`1-s2.0-S0305440325002973-mmc2.csv`	Supplementary data from the paper (176 samples, 2151 wavelengths)
`plsr_2045.json`	Serialized PLSR pipeline — IntensityConversion → SG → RangeCut(2030–2060 nm) → PLSRegression
`rf_2045.json`	Serialized Random Forest pipeline — IntensityConversion → SG → RangeCut(2030–2060 nm) → RandomForestRegressor

Dataset

The CSV file contains spectral data for 176 bone samples:

140 Reference samples — used for model calibration and validation. Already K-means balanced from an original set of 319 samples (see paper Section 2.3).
36 Zafarraya samples — independent external validation set from Zafarraya Cave, Spain.

Columns:

Sample ID, Country of Origin, Extraction Technique (ORAU / MPI)
Collagen Yield (%)
Reflectance at 350–2500 nm (2151 variables)

Notebook Overview

The notebook reproduces the complete workflow in 13 sections:

Section	Content
1–2	Imports and data loading
3	Exploratory data analysis (yield distribution, raw spectra)
4	Spectral preprocessing: reflectance → pseudo-absorbance → Savitzky-Golay 2nd derivative
5	PCA for outlier detection (full range and NIR range)
6	Stratified calibration / validation split (100 / 40 samples)
7	Helper functions
8	PLSR across 11 wavelength ranges (reproduces Table 5)
9	Random Forest regression — full NIR and 2030–2060 nm (reproduces Sections 3.1–3.2)
10	Variable importance plots
11	Combined LOO-CV models on all 140 samples
12	Zafarraya external validation with PVA consolidant detection (reproduces Tables 6–8)
13	Model persistence to JSON using OpenModels

The preferred model (2030–2060 nm, 1 PLSR factor) targets the 2045 nm absorption feature associated with the 2nd overtone of C=O stretching and N-H stretching in collagen.

Dependencies

numpy
pandas
matplotlib
scikit-learn
chemotools
openmodels

Install with:

pip install -r requirements.txt

or manually:

pip install numpy pandas matplotlib scikit-learn chemotools openmodels

Usage

Clone the repository.
Install dependencies (see above).
Open NIR_Collagen_Prediction.ipynb in Jupyter Lab or Jupyter Notebook.
Run all cells in order (Kernel → Restart & Run All).

The notebook is self-contained: all preprocessing, modelling, and evaluation steps run sequentially without external configuration.

Prediction API

The serialized models are also available as a REST API, so you can predict collagen yield directly from your own spectra without running the notebook.

Try it online — no installation needed:

https://collagen-prediction.fastapicloud.dev

Run it locally — download the pre-built executable for your platform from the Releases page and run it:

# macOS / Linux
chmod +x nir-collagen-api-macos   # or nir-collagen-api-linux
./nir-collagen-api-macos

# Windows
nir-collagen-api-windows.exe

Then open http://localhost:8000/docs to explore the endpoints interactively.

The API accepts spectra as JSON (single or batch) or as a CSV file upload, and supports both pseudo-absorbance and raw reflectance inputs.

Key Results Reproduced

PLSR model performance across wavelength ranges (100-sample calibration / 40-sample validation split):

Range	Factors	Val R²	Val RMSE
2030–2060 nm (preferred)	1	0.876	1.78%
2030–2060 + 2244–2300 nm	3	0.890	1.67%
2000–2300 nm	3	0.883	1.73%
780–2500 nm (full NIR)	3	0.862	1.88%

Combined models — Leave-One-Out CV (all 140 reference samples):

Model	LOO-CV R²	LOO-CV RMSE
PLSR 2030–2060 nm	0.883	1.62%
RF 2030–2060 nm	0.894	1.54%
RF 780–2500 nm	0.919	1.35%

The 2030–2060 nm range (1 PLSR factor, 31 features) delivers the best balance between parsimony and predictive accuracy — the preferred model for deployment.
Restricting the spectral range to 2030–2060 nm avoids PVA consolidant absorption bands at 2135, 2250, and 2296 nm, making the model robust for museum collections.
Random Forest (780–2500 nm) achieves the lowest LOO-CV RMSE (1.35%) but requires the full spectral range and is more sensitive to consolidant contamination.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
__pycache__		__pycache__
api		api
.gitignore		.gitignore
1-s2.0-S0305440325002973-mmc1.docx		1-s2.0-S0305440325002973-mmc1.docx
1-s2.0-S0305440325002973-mmc2.csv		1-s2.0-S0305440325002973-mmc2.csv
NIR_Collagen_Prediction.ipynb		NIR_Collagen_Prediction.ipynb
README.md		README.md
plsr_2045.json		plsr_2045.json
requirements.txt		requirements.txt
rf_2045.json		rf_2045.json
test.csv		test.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NIR Spectroscopy for Collagen Quantification in Archaeological Bone

Contents

Dataset

Notebook Overview

Dependencies

Usage

Prediction API

Key Results Reproduced

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NIR Spectroscopy for Collagen Quantification in Archaeological Bone

Contents

Dataset

Notebook Overview

Dependencies

Usage

Prediction API

Key Results Reproduced

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages