Master's Thesis: Estimating Protein in Barley Using ML on Multispectral Imaging

This repository contains the code developed during my master’s thesis in spring 2025, used to preprocess data, visualize results, build machine learning models, and generate final analyses. It also includes the improvements and bug fixes applied for the informal revised version of the thesis, which was submitted to my supervisor and project leader on 20.06.2025, the day after the thesis defence. The content of this README is mostly pulled from the master's thesis while some of the results are pulled from the revised version. For more detailed explanations, see any of those.

To replicate this study, simply follow these few steps: 1) download this repository, 2) setup a python environment like this by importing masteroppgave.yml into your package manager, 3) Run min_masteroppgave_2_WP2-2-korrelasjoner.ipynb and min_masteroppgave_3_WP2-3-AI-modellering.ipynb.

Background

With the green revolution industrializing the agricultural sector, the global consumption of nitrogen fertilizer has increased approximately 100-fold over the past 100 years. Today, the global consumption of nitrogen is at around 94 million metric tons, for use in fertilizers for cereal crops (Ladha et al., 2005) (N. Sandhu et al., 2021). Projections estimate the grain demand to increase by 50-70% by 2050, which could require a proportional increase in demand for nitrogen fertilizer, unless something is done (Ladha et al., 2005). This high use of nitrogen fertilizer is underpinned by issues of nitrogen use inefficiency and slow rate of genetic gain. Solving these issues will be steps toward more sustainable agriculture, reaping both economical and environmental benefits.

Research question

As Norwegian innovation in agriculture relies heavily on public research (“Policies for the Future of Farming and Food in Norway”, 2021), this thesis aims to advance the current state of research towards applicable solutions. Even though research exists on each technology and domain that that this thesis will touch on, research on the combination of them for our specific purpose, is lacking. Therefore, this thesis will combine the technologies of unmanned aerial vehicles, multi-spectral imaging and machine learning for remote sensing of nitrogen and protein in soil and barley.

This thesis is part of the broader research project called "Protein bar", which aims to achieve "increased protein production from Norwegian barley for animal feed (NFR 336315)" (Appendix A.1). More specifically, this thesis is part of work package 2 (WP2) which aims at "linking crop canopy data with spatial soil variability to better assess plant nitrogen status and fertilization needs through remote sensing" (Appendix A.2). Therefore, this thesis will first try to accomplish the two following tasks under WP2 before arriving at the third task which fulfills the thesis’ title. In total, it will accomplish the three following tasks:

WP2 2-2: Correlate and calibrate multispectral UAV images with soil/plant (N) sensor measurements, plant available N, total soil N and N fertilization strategy.
WP2 2-3: Use machine learning in combination with multispectral images of soil/plants to estimate the soil’s content of plant-available N and thus the need for N fertilization.
Use machine learning on multispectral imaging, to estimate protein level in barley.

Methodology

Data

In total, four types of data was acquired for this study, those being multi-spectral data, soil samples, crop samples and weather data. The multi-spectral data was captured using Unmanned Aerial Vehicles fitted with multi-spectral sensors, before being processed first into Orthomosaics using Pix4DMapper and then into zonal statistics using QGIS. This zonal statistics is what can be found in this repository's "Data/QGIS"-directory.

As this is part of a publicly funded science project at a public university, we are comitted to the FAIR principles and to make the data and results publicly accessible. As stated in the research project's data management plan: "Our intention is to make as much as possible of the generated data that are of high quality and valuable for long-term storage accessible for re-use". None of the data acquired is contains sensitive personal data or security concerns. Long-term, the projects data will be stored at "NMBU Open Research Data", whose costs will be covered by the university library. In addition to this, I have received explicit permission from the project manager and supervisor to publish my thesis data in this repository.

Code setup

For this thesis, the preprocessing, data analysis, and producing results was done by running python (version 3.12.8) code on Anaconda. Which is a powerful and robust cross-plattform package and environment manager.

This thesis has also used several python packages. One of them being Scikit-learn (often shortened to sklearn) for machine learning, which provides state-of-the-art supervised and unsupervised algorithms for medium-scale problems. The module is open-source under the BSD license, has minimal dependencies and ensures solid implementations through unit tests with an 81% coverage (as of release 0.8) and analysis tools such as pep8 and pyflakes (Pedregosa et al., 2018). Another package used is Pandas, a modern python library for efficiently handling tabular data and has been dubbed a “foundational layer for future statistical computing”. It is a very powerful and low-cost platform for data analysis that has inbuilt Numpy and supports matplotlib (Vachiyatwala2, 2022). Numpy, which is also used, is the package containing the NumPy array, which is the standard way to represent numerical data 30 in the Python world. It is a data structure that can have any dimensionality and more kinds of elements, such as booleans and dates, than the built-in python matrix. It is also more efficient and therefore ideally suited for high-performance numerical computation (van der Walt et al., 2011). For visualization, matplotlib.pyplot and seaborn has been used.

A full list of pyhthon packages and version numbers can be seen in the file masteroppgave.yml, which can easily be imported as an environment into most package managers. However an overview of the main Python packages used in the thesis, as well as their version numbers and purpose, is shown below.

Package Name	Version	Purpose
pandas	2.2.3	Data manipulation and analysis
numpy	1.26.4	Representing numerical data and performing operations
matplotlib	3.9.2	Data visualization
seaborn	0.13.2	Additional data visualization
scikit-learn	1.5.1	Machine learning
mlxtend	0.23.1	Correlation matrix

Code structure

The full code is divided into 4 numerated files which each depend on the lower numerated files. They therefore start by running and capturing the output of the lower numerated files.

min_masteroppgave_0_imports.py
min_masteroppgave_1_import_og_databehandling.ipynb:

📥 1. Imports
📁 2. Reading data
🧹 3. *Data cleaning
🔍 4. Data exploration and visualisation

min_masteroppgave_2_WP2-2-korrelasjoner.ipynb:

Gjennomsnitt over tid
Korrelasjonsmatriser
Korrelasjonslister
Boxplots

min_masteroppgave_3_WP2-3-AI-modellering.ipynb:

🛠 5. Data preprocessing
🤖 6. Modelling
- Definérer modeller og hyperparameter-grids
- Cross-validation
- Plotting and table functions
- Testing ML
- ML nitrogen experiments (WP2-3)
- ML protein experiments

Model training

Both linear and non-linear ML models were used. Those being Lasso regression, Ridge Regression, ElasticNet Regression, Random Forest Regressor and Gradient Boosting Regressor. Lasso regression was chosen due to its demonstrated higher predictive and computational performance than SVR and - by extension - also than ReLU and Selu. However, as Ridge Regression is so similar, but with L2 instead of L1, this was also tested to see the effect of this regularization choice. As ElasticNet Regression has shown to achieve the benefits of both L1 and L2, this was also chosen. Regarding the non-linear models, a reason for choosing Random Forest Regressor is its reputation of being one of the most successful, best-performing and reliable off-the-shelf supervised learning algorithms in modern times. Another reason is its success in Ijaz’s thesis. While Gradient Boosting Regressor is one of the most popular ensemble implementations, alongside Random Forest Regressor.

To find optimal hyperparameters, Grid Search was used, both because of its success over alternatives in literature, and because of its reproducibility. It was implemented using Sklearn’s GridSearchCV which performs a one-loop cross validation for each combination of hyperparameters. This lowers the risk that some combinations appear better than others due to random chance.

The K-fold cross-validation was chosen as it provides a good balance between predictive performance and computational performance. This was used both in the inner and the outer validation loop. The specific implementation used was Sklearn’s GroupKFold, which treats all samples belonging to the same time series together. This is in contrast to KFold which would treat each sample completely independently, causing information leakage, due to the inherent relation between samples belonging to the same time series.

All the ML model runs were performed across multiple random states to capture the random distribution of the model performances. The random states being the values [0, 1, 2, 3, 4].

Results

First, looking at the soil Nitrogen, we can see that it gradually decreases during the season, as the crops absorb it into their canopy. This applies for all the fertilization groups.

By using ndvi as a proxy for nitrogen in the plant, we can see the inverse curve in the figure below, matching our expectation. Though with a lower growth for the control group as they lacked the enough Nitrogen in the soil to grow to their potential.

Analyzing correlations

In the following table we can see that the correlations drastically increase if we measure them individually for each fertilization group. This gives us valuable insight about the importance of accounting for fertilization strategy, due to the high degree of noise this variation adds to the data.

If we account for fertilization groups we get a much higher correlations.

To better see which features best correlate with soil Nitrogen, the following box plot does precisely that by showing each feature's absolute correlation with soil Nitrogen, when fertilization groups are accounted for. This reveals that CCCI is a bad outlier among the vegetation indices.

Machine learning modelling

Before doing any other machine learning modelling, we first test the methodology of Ijaz's thesis, where he used flat cross-validation instead of nested cross-validation. The concern is that this methodology will lead to artificially high prediction scores, as cross-validation is being used for multiple purposes (both for finding optimal model complexity and for testing that model complexity's predictive performance). Therefore, below we see tables of prediction scores for flat cross-validation and nested cross-validation.

Mean model performance using flat vs nested crossvalidation for soil total N:

Flat crossvalidation

Model	Bands	Bands + Indices	Indices
ElasticNet	0.1592	0.2164	0.1687
GradientBoosting	0.1210	0.1773	0.1737
Lasso	0.1482	0.2050	0.1683
RandomForest	0.1133	0.2226	0.0187
Ridge	0.1572	0.2173	0.1675

Nested crossvalidation

Model	Bands	Bands + Indices	Indices
ElasticNet	0.0579	0.1572	0.1179
GradientBoosting	0.0438	0.0843	-0.0968
Lasso	0.0839	0.1565	0.1418
RandomForest	0.0914	0.2188	-0.0491
Ridge	0.1324	0.2173	0.1622

By taking the average score of the "Bands + Indices" column in each table, we see the confirmation of our suspicion. The flat cross-validation indeed results in a 24,5% falsely higher score. This validates our choise of using nested cross-validation for the rest of this thesis.

In the following figures, we see the effect of adding of restricting the training data to only portions of the timeseries. Here we see that it has opposite effects on estimation of nitrogen vs protein.

Soil Nitrogen

Grain Protein

In the following figures, we cumulatively add more and more feature types to the ML model's data to see which increases performance. Again, we see an interesting contrast between estimation of nitrogen and protein.

Soil Nitrogen

Grain Protein

Box Plot Protein Cumulating Feature types %5Brevision%5D

Breaking it down by the individual models, below, we see that the "growth time" column and the time series statistics gives predictive benefits in separate areas. The former is the most beneficial and only necessity for estimating soil Nitrogen, where the models Lasso, Random Forest and Gradient Boosting benefitted more from the "growth time" column than from the time series statistics, while seeing no additional benefit from the time series statistics. The opposite is true for grain protein content, where all models benefitted from the time series statistics, with no close-to-significant benefit from the "growth time" column. We also see that Lasso struggles a lot at predicting grain protein content when we add the time series statistics, due to it not handling noice very well.

Soil Nitrogen

Grain Protein

Box Plot Protein Growth Time vs ts by models %5Brevision%5D

Below we have summary tables of mean model performance when cumulatively adding feature types. Each column contains the feature type of all the previous columns.

Soil Nitrogen

Model	Bands	+ Indices	+ TS	+ Weather	+ Growth Time
ElasticNet	0.0579	0.1572	0.3211	0.2081	0.3024
GradientBoosting	0.0438	0.0843	0.2062	0.2625	0.3203
Lasso	0.0839	0.1565	0.1849	0.1645	0.2699
RandomForest	0.0914	0.2188	0.3070	0.4906	0.5478
Ridge	0.1324	0.2173	0.3906	0.4159	0.4728

Grain Protein

Model	Bands	+ Indices	+ TS	+ Weather	+ Growth Time
ElasticNet	0.0125	-0.0421	0.3915	0.3917	0.3917
GradientBoosting	0.1370	0.0599	0.6135	0.5719	0.6132
Lasso	0.0060	-0.0094	-0.2513	-0.2513	-0.2514
RandomForest	0.2444	0.1960	0.6011	0.6217	0.6261
Ridge	0.0328	0.0445	0.1688	0.1689	0.1687

The non-linear models outshine the linear models on estimating protein content.

Conclusions

After having accomplished all three tasks of the research question and fulfilled the thesis title, this thesis has arrived at the following conclusions.

Nitrogen uptake: The crops' Nitrogen uptake curve matches expectations, but the control group fails to grow up to its potential, due to limited Nitrogen supply in the soil.

Correlation analysis:

Correlations between spectral features and soil Nitrogen is heavily negatively effected by including multiple fertilization strategies without accounting for them. This is due to them adding a high level of noise.
The standard deviations of Rededge and NIR were the best correlators with soil Nitrogen, among all spectral bands' stds and medians. This is despite Rededge’s median performing worst of all features. This highlights the positive potential of including data about the image patterns, in the prediction of soil Nitrogen.
CCCI is a bad performing outlier among the vegetation indices, for correlating with soil Nitrogen.

ML modelling:

For soil Nitrogen, every single feature type tested contributes to increasing the predictive performance. For grain protein content, on the other hand, only time series statistics had any effect beyond what the spactral indices achieved.
On the highest achieving models, the "growth time" column and the time series statistics gives predictive benefits in separate areas. The former is the most beneficial and only necessity for estimating soil Nitrogen, where the models Lasso, Random Forest and Gradient Boosting benefitted more from the "growth time" column than from the time series statistics, while seeing no additional benefit from the time series statistics. The opposite is true for grain protein content, where all models benefitted from the time series statistics, with no close-to-significant benefit from the "growth time" column.
Weather data increases the variance of the R^2 for soil Nitrogen. This is because the linear models were able to utilize the weather data to a far higher degree than the nonlinear models.
While the predictive performance on soil Nitogen increases with the span of the time series it is trained on, the opposite seems to be true for grain protein content. Here, the predictive performance was highest when the models were only trained on data up until one week before heading.
The non-linear models outshine the linear models on estimating protein content. This is due to their unmatched ability to utilize the information in the time series statistics.
RandomForestRegressor was the best performer, both for estimating soil Nitrogen and grain protein content. In this study, it was able to achieve R^2 = 0.5478 for estimating soil Nitrogen and R^2 = 0.6261 for estimating grain protein content.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Master's Thesis: Estimating Protein in Barley Using ML on Multispectral Imaging

Background

Research question

Methodology