PriMera Scientific Engineering (ISSN: 2834-2550)

Research Article

Volume 9 Issue 1

Comparative Study of Machine Learning and Text Vectorization Techniques for Spam Detection

Sai Teja Mantha*

June 30, 2026

Abstract

Spam detection remains a critical challenge in natural language processing (NLP) and cybersecurity, with over 50% of global email traffic consisting of unwanted messages. This comprehensive study presents an extensive comparative analysis of machine learning algorithms and text vectorization techniques for spam classification, evaluating seven distinct machine learning models across four feature engineering approaches using multiple large-scale datasets comprising over 15,000 messages. Our experimental results demonstrate that XGBoost achieves the highest overall performance with 94.4% accuracy and 95.4% precision, while ensemble methods consistently outperform traditional approaches by 5-7%. The research reveals that text vectorization techniques show minimal performance variance (less than 0.3% accuracy difference), with Bag of Words (BoW) achieving slightly superior results at 87.9% accuracy. These findings highlight the critical importance of algorithmic sophistication over feature complexity for spam detection systems, providing evidence-based guidance for practical deployment in cybersecurity applications. The study contributes novel insights into ensemble method superiority and establishes comprehensive benchmarks for spam detection research.

Keywords: Spam Detection; Machine Learning; Ensemble Methods; XGBoost; Random Forest; Text Vectorization; Cybersecurity; Natural Language Processing

References

  1. Aleisa MA. “Advancing Email Spam Classification using Machine Learning and Deep Learning Techniques”. Engineering, Technology & Applied Science Research 14.4 (2024): 15420-15426.
  2. Gamango SK and Prabavathy AK. “Spam Analysis and Classification of the Dynamic Message using A Vectorizing Technique with Multi-Model Machine Learning Algorithm”. GRENZE International Journal of Engineering & Technology 10.2 (2024): 919-930.
  3. Bhatnagar P and Degadwala S. “Efficient Email Spam Classification with N-gram Features and Ensemble Learning”. International Journal of Scientific Research in Computer Science, Engineering and Information Technology 10.2 (2024): 278-284.
  4. Singh S. “Text Pre-processing for Spam Filtering”. Shiksha Online (2022). https://www.shiksha.com/online-courses/articles/text-pre-processing-for-spam-filtering/.
  5. Umamaheswari TS and Umaselvi M. “Enhanced Ensemble Classification Techniques for Accurate Spam Detection in E-mail Communications”. International Journal of Intelligent Systems and Applications in Engineering 13.1 (2025): 45-58.
  6. Malhotra R and Malik A. “Classification of Spam Mail Utilizing Machine Learning and Deep Learning Techniques”. International Journal on Information Technologies & Security 16.2 (2024): 89-104.
  7. Zhang L, Chen M and Wang K. “Ensemble-Based Text Classification for Spam Detection”. Informatica 48.3 (2024): 123-138.
  8. Ghogare PP., et al. “Enhancing Spam Email Classification Using Effective Preprocessing Strategies and Optimal Machine Learning Algorithms”. Indian Journal of Science and Technology 17.15 (2024): 1456-1467.
  9. Sutta N, Johnson R and Williams A. “A Study of Machine Learning Algorithms on Email Spam Classification”. Southeast Missouri State University Computer Science Papers (2024): 78-92.
  10. Otieno DO, Smith J and Brown L. “The Application of the BERT Transformer Model for Phishing Email Classification”. Texas Tech University Cybersecurity Research 5 (2024): 234-251.
  11. Shah SS. “Email Spam Detection: Leveraging Fine-Tuned Transformer Models with Attention Mechanism”. National College of Ireland Machine Learning Papers (2024): 156-171.
  12. Fellah A., et al. “Investigating the Effectiveness of Word2Vec for Spam Detection Using Lazy Predict Library”. International Journal of Intelligent Systems and Applications in Engineering 12.4 (2024): 445-460.
  13. Tida VS and Hsu S. “Universal Spam Detection using Transfer Learning of BERT Model”. University of Louisiana at Lafayette Computer Science Research, arXiv:2202.03480 (2022).
  14. Isra'a A and Qussai Y. “Spam Email Detection Using Deep Learning Techniques”. Procedia Computer Science 184 (2021): 853-858.
  15. Liu X. “Deciphering Spam Through AI: From Traditional Methods to Deep Learning Advancements in Email Security”. Minzu University of China Information Science Papers (2024): 554-567.
  16. Bhardwaj U and Sharma P. “Email spam detection using bagging and boosting of machine learning classifiers”. International Journal of Advanced Intelligence Paradigms 24.3/4 (2023): 229-253.
  17. Al-shanableh N, Alzyoud M and Nashnush E. “Enhancing Email Spam Detection Through Ensemble Machine Learning: A Comprehensive Evaluation of Model Integration and Performance”. Communications of the IIMA 22.1 (2024): 30-45.
  18. Chakir O., et al. “An empirical assessment of ensemble methods and traditional machine learning techniques for web-based attack detection in industry 5.0”. Journal of King Saud University - Computer and Information Sciences 35.2 (2023): 101281.
  19. Jiang Y and Atif Y. “A selective ensemble model for cognitive cybersecurity analysis”. Journal of Network and Computer Applications 193 (2021): 103212.
  20. Varun N, Singh P and Agrawal K. “Mail Spam Detection Using Clustering & Random Forest Algorithm”. International Journal of Recent Advances in Science and Technology 6.2 (2019): 190-196.
  21. Shah A. “Classification and Detection of email Phishing using random Forest supervised-unsupervised machine learning algorithms”. National College of Ireland Masters Thesis (2022): 1-85.
  22. Jose A., et al. “Phishing URL Detection Using XGBoost”. International Journal for Research in Applied Science & Engineering Technology 12.5 (2024): 1255-1260.
  23. Shahzad A, Nawi NM and Rehman MZ. “Detection of Spam Pages Using XGBoost Algorithm”. International Journal of Electrical and Computer Engineering 14.3 (2024): 2847-2856.
  24. Oumaima C., et al. “Phishing Website Detection with XGBoost and Adaptive Bat Algorithm Optimization”. Procedia Computer Science 230 (2025): 1532-1541.