This project was done for the class CS6140 - Machine Learning during my Master's program at Northeastern University. I secured a GPA of 4.0 for this subject.
I collaborated with my classmate Aditya Shanmugham and Gugan Kathiresan on this project.
Abstract
Toxic comment classification is crucial for maintaining a peaceful online environment. This study compares the performance of machine learning algorithms and transformer architectures, specifically BERT, for multilingual toxic comment classification. The analysis focuses on contextual performance and finds that transformers offer superior classification compared to traditional machine learning models. Transformers, with their ability to handle multilingual data, prove to be a viable solution for real-world applications.
Introduction
The rise of social media platforms has led to an exponential increase in online comments. However, not all comments are constructive, and some contain toxic language. Detecting and classifying toxic comments is challenging but essential for promoting healthy online conversations. This project aims to develop an NLP model for toxic comment detection and compares the performance of machine learning algorithms and transformers. The goal is to identify the optimal method for contextually relevant and high-quality toxic comment classification.
Problem Statement
Previous research has focused on developing models using transfer learning techniques, but there is a need to extend these models to handle multilingual datasets. This research conducts a comprehensive study on various supervised learning algorithms to understand the complexity of multilingual datasets and determine the best model for toxic comment classification.
Methodology
A multilingual toxic comment dataset from Kaggle was used for this study. The data was preprocessed to obtain binary labels and normalized. The dataset was divided into 70% for training and 30% for testing.
Theory
The proposed models, whether machine learning or transformers, handle the feature extraction process. The dataset used is well-annotated and preprocessed, making it suitable for transformers. The dataset was also analyzed for imbalances or biases.
The three approaches compared in this project are:
Approach 1: Machine Learning Models (KNN, Random Forest, XGBoost) Approach 2: Transformers with feed-forward architecture Approach 3: Stacked transformers
Approach 1: Machine Learning Models
The data was split into training and validation subsets using an 80/20 split. Machine learning models such as KNN, Random Forest, and XGBoost were trained on the training dataset. The models were evaluated on the validation dataset using performance metrics such as accuracy, precision, recall, and F1 score.
Approach 2: Transformers with Feed-Forward Architecture
Transformers, specifically BERT, were employed for toxic comment classification. The dataset was tokenized and fed into the BERT model. The output was passed through a feed-forward neural network for classification. The model was trained on the training dataset and evaluated on the validation dataset.
Approach 3: Stacked Transformers
Multiple transformers, including BERT, were stacked to improve classification performance. The output of one transformer was passed as input to the next transformer. The stacked transformers were trained and evaluated on the training and validation datasets, respectively.
Conclusion
The comparative analysis of machine learning models and transformers for toxic comment classification demonstrates that transformers, especially stacked transformers, offer better quality classification compared to traditional machine learning algorithms. Transformers' ability to handle multilingual data makes them suitable for real-world applications. Future work involves deploying the proposed model for real-time computations and expanding the analysis to include more advanced transformer architectures.
A detailed report of the comparison is provided in this document. Click here