Big Data/Machine Learning/AI
Evaluation of the Oversampling Method for Balancing Pediatric Chest Radiograph Dataset in Classification Using Convolutional Neural Network VGG-16 Tiago Almeida de Oliveira* Tiago Almeida de Oliveira Giullber Valentim da Silva Roberta Moreira Wichmann Crysttian Arantes Paixão Luciana de Queiroz Leal Gomes
X-ray images are widely used in medical diagnosis, particularly for detecting pneumonia. This study utilizes a Convolutional Neural Network (CNN), specifically the VGG-16 architecture, to classify pediatric chest tomography images for pneumonia presence. The research addresses data imbalance using oversampling with data augmentation and employs transfer learning with pre-trained CNN weights to enhance training efficiency and performance. The dataset used comprises pediatric chest radiographs for pneumonia diagnosis, categorized into two classes: Pneumonia (4273 images) and Normal (1583 images). Prior to analysis, data augmentation techniques were applied, including random cropping, resizing, and adjustments in brightness and contrast. This process involved randomly selecting 845 images from the Normal class, equivalent to approximately 62.63% of the original images, to generate an additional 2526 images. Comparative analyses on imbalanced and balanced datasets showed an initial accuracy of 93.26%, with precision values of 0.94 for Pneumonia and 0.83 for Normal. After applying data augmentation for balance, accuracy improved to 94.23%, with recall metrics rising to 0.92 for Normal and 0.96 for Pneumonia. Evaluation included a confusion matrix for metrics like Precision, Recall, F1-score, and Accuracy. Graphical evaluation methods, such as the ROC curve and Precision-Recall graph, were also employed. The computational analysis was executed in Python 3.10.9, utilizing libraries like Tensorflow for transfer learning and Scikit-learn for metric calculations. Notably, the model trained on the balanced dataset demonstrated a slight enhancement, as indicated by an AUC of 0.94 in the ROC curve, compared to the model trained on the imbalanced dataset, which obtained an AUC value of 0.92. The study highlights VGG-16 CNN’s effective use in achieving strong classification results with minimal training. Data augmentation improves model performance, emphasizing the potential of advanced imaging and deep learning for accurate pediatric pneumonia detection and broader healthcare diagnostics.