Palmisani J., Pizzillo V., Di Gilio A., Tosoni E., Facchini L., Perbellini L., de Gennaro G., Santo A.
Department of Biosciences, Biotechnologies and Environment, University of Bari Aldo Moro, via Orabona 4, 70125 Bari, Italy
Lung Unit, P. Pederzoli Hospital, Via Monte Baldo, 24, 37019 Peschiera del Garda (VR), Italy
Predict srl, Viale Adriatico, c/o Fiera del Levante Pad. 105, 70132 Bari, Italy
University of Verona, Via dell’Artigliere 8, 37129 Verona, Italy
Introduction
Lung cancer (LC) is one of the leading cause of cancer-related death worldwide. Diagnosis of the disease generally occurs at an advanced stage, leading to a successful surgical treatment in less than 20% of cases. Alternative therapeutic options are chemotherapy, immunotherapy or molecular targeted drugs, sometimes complemented by radiotherapy treatments. The wide-spectrum chemical characterization of human breath allows the identification of several diseases-related volatile biomarkers, appearing a strategic approach for clinical diagnosis to be potentially integrated with conventional diagnostic techniques [1].
Study objective: The main objective of the study is the identification of a lung cancer- related VOCs pattern in human breath and the development of an advanced classification model for the early diagnosis of the disease and for the implementation of screening programs addressed to subjects at risk based on genetic and predisposition factors. For this purpose and to develop high performance and robust classification/discrimination model a Random Forest based-machine learning approach was applied to the collected dataset.
Experimental Methods
A prospective observational study was carried out by the research group of the Environmental Sustainability Laboratory of the Department of Biosciences, Biotechnologies and Environment of University of Bari, in collaboration with the medical team of the ‘P. Pederzoli’ Hospital in Verona (Italy) and the University of Verona, after approval of the Ethics Reference Committee (study protocol n. 45355). The research was conducted in accordance with the principles embodied in the Declaration of Helsinki and in accordance to the local statutory requirements. An overall number of 130 volunteers were enrolled at the Lung Unit of ‘P. Pederzoli’ Hospital, according to well-determined inclusion and exclusion criteria: 65 patients affected by LC (mean age 67 years) and 65 healthy controls (HCs) (mean age 62 years). The applied methodology is based on end-tidal breath sampling directly onto two-bed adsorbent cartridges (Biomonitoring steel tubes, Markes International) by means of an automated sampler Mistral (patented device, Predict srl). Ambient air samples (AA) were simultaneously collected at each sampling session. The collected samples were thermally desorbed (Unity-2, Markes International) and analyzed by Gas Chromatography/Mass Spectrometry (GC Agilent 7890/MS Agilent 5975) at the University of Bari resulting in a dataset based on abundances of identified compounds. The experimental dataset was preliminary processed by non-parametric tests e.g., Wilcoxon signed rank-test and Kruskal-Wallis test (R software version 3.5.1) in order to discriminate between endogenous and exogenous VOCs as well as to highlight statistically significant differences in terms of VOCs composition between LC and HC breath samples (p-values ≤ 0.05). Features selection and ranking was performed according to Random Forest selection tests e.g., Mean Decrease Accuracy (MDA) and Mean Decrease Gini (MDG). Boruta test algorithm (Random Forest wrapper) was additionally applied to identify relevant features in the discrimination process between LC and HC breath samples and to minimize overfitting. The Random Forest model based on the 10-ten ranked features was then developed through optimized values of mtry (number of variables considered at each split) and ntree (number of trees in the forest) parameters and successively trained using a 5-fold cross-validation framework to ensure robustness in model performance evaluation [2]. A receiver operating characteristic ROC analysis was carried out and the area under the ROC (AUC) was computed as a comprehensive index of model classification performance.
Results and Discussions
The developed Random Forest-based classification model validated with 5- fold cross- validation provided promising results exhibiting classification accuracy equal to 80%, specificity equal to 84% and AUC of 0.84, a well-balanced, predictive and reliable model. A pattern of VOCs consisting of β-pinene, limonene, isoprene, heptanal, 2- butoxyethanol, pentadecane, pentane and tetradecane was found to be highly effective in discrimination between LC patients and HC groups and identified as potentially diagnostics for lung cancer. The obtained outcomes are in line with previous observational studies on LC patients. Potential metabolic pathways for the identified VOCs can be speculated e.g., promoted lipid peroxidation.
Conclusions
This research highlights the potentialities of machine learning techniques, more specifically ensemble methods like Random Forest, in the development of a classification model based exhibiting high efficacy in discrimination between LC patients and HCs.