Comparative Evaluation of Machine Learning Models for Predicting Cimbex Quadrimaculata Populatio

03 Apr 20264 min read0 viewsJournal Feed

GIST

by Yunus Güral The high variability and nonlinear relationships between environmental variables (such as temperature, relative humidity, and altitude) in ecological datasets prevent classical statistical models from obtaining accurate predictions. This study aimed to compare and investigate the performance of AI-based machine learning methods in analyzing complex ecological data structures.

An agricultural dataset containing meteorological and vegetation variables was used as the representative case study. This dataset is based on population observations of Cimbex quadrimaculata in Diyarbakır (Eğil) and Elazığ (Keban) provinces in Türkiye between 2020 and 2022.

Three different modeling approaches (binary classification, multiclass classification, and regression) were applied to the same data. This three-approach design enabled a systematic comparison of model performance, generalizability, and explainability on the same dataset using different definitions of the target variable.

For classification tasks, the model performance was evaluated using accuracy, F1 score, and AUC metrics under a stratified 10-fold cross-validation scheme. Regression models, on the other hand, were assessed within a nested cross-validation framework using R², root mean square error (RMSE), mean absolute error (MAE).

Clinical Editorial

Study Scope and Design

quadrimaculata observations in Diyarbakır (Eğil) and Elazığ (Keban), Türkiye, spanning 2020–2022.

Objective framed as a comparative evaluation of AI-driven machine learning approaches for predicting population density of Cimbex quadrimaculata using ecological data with environmental predictors.
Data source comprises meteorological and vegetation variables collected for C.
A three-pronged modeling design was employed on the same dataset: binary classification, multiclass classification, and regression, enabling cross-formulation comparison of performance, generalizability, and explainability.

Data, targets, and modeling framework

The study defines distinct target formulations corresponding to the three modeling tasks, enabling assessment under stratified cross-validation and nested cross-validation schemes.
For classification tasks, model performance was assessed with accuracy, F1 score, and AUC under a stratified 10-fold cross-validation approach.
For regression tasks, evaluation utilized a nested cross-validation framework with metrics including R-squared, RMSE, and MAE.
Ensemble-based boosting algorithms were central to the assessment, including Gradient Boosting, XGBoost, and LightGBM, chosen for their capacity to model nonlinearities and interactions in complex ecological data.

Key findings on predictive performance

In binary classification, Gradient Boosting achieved high accuracy and strong AUC, supporting its effectiveness in this formulation.
XGBoost demonstrated strong overall predictive capability across the study, particularly in broader analyses.
In regression contexts, LightGBM and Random Forest showed notable performance with cross-validated R-squared around 0.73, indicating substantial explained variance.
SHAP analyses were employed to enhance model interpretability, consistently identifying temperature- and humidity-related variables as influential predictors across model types.

Contextual interpretation and methodological implications

The results underscore the capacity of ensemble-based learning methods to capture multidimensional, nonlinear, and interacting effects present in ecological datasets.
The study emphasizes generalizability and interpretability as critical considerations in model selection when handling environmental datasets with nonlinear relationships.

Limitations and reporting qualifiers

Where such specifics are not reported, the findings are presented as stated without extrapolation.

The provided content does not include explicit numeric confidence intervals beyond the listed metrics, and source details do not elaborate on potential biases or data limitations.
It is not reported whether external validation beyond the internal cross-validation schemes was conducted.