PriMera Scientific Engineering (ISSN: 2834-2550)

Research Article

Volume 8 Issue 6

A Forensically Grounded Machine Learning Framework for Ransomware Family Classification via Behavioral Analysis

Bokolo Wanengimorte George*

May 31, 2026

Abstract

Ransomware continues to pose a critical threat to modern information systems, particularly due to the increasing prevalence of polymorphic and behaviorally evasive variants. This paper proposes a reproducible, forensically auditable machine learning pipeline for multi-class ransomware family classification based on dynamic behavioral features. Using the RanSAP-2022 dataset comprising ten ransomware families, the pipeline automates data ingestion, exploratory analysis, preprocessing, and model evaluation under deterministic controls. Exploratory Data Analysis reveals significant class imbalance, with the largest family accounting for 32.4% of samples and an imbalance ratio of 4.6:1 (Fig. 1), as well as 27% aggregate missingness concentrated in network-related features (Fig. 2). Feature correlation analysis (Fig. 3) identifies moderate behavioral coupling among file system, registry, and API activity features without severe multicollinearity. Six machine learning models are comparatively evaluated using macro-averaged F1-score as the primary metric. Boosting-based ensemble models achieve the strongest performance, with XGBoost attaining 96.1% accuracy and a macro-F1 score of 0.948 (Table 7). Confusion matrix analysis (Fig. 4) demonstrates high precision and recall for majority families, while highlighting reduced recall for minority families due to behavioral overlap. These findings confirm that behavior-based ensemble learning, when embedded in a reproducible and leakage-resistant pipeline, provides both high predictive performance and forensic reliability for ransomware family classification.

Keywords: Ransomware classification; behavioral malware analysis; machine learning; reproducible pipelines; digital forensics

References

  1. CrowdStrike. “2023 Global Threat Report”. CrowdStrike Intelligence, Sunnyvale, CA, USA, Annual Threat Intelligence Report (2023).
  2. Kaspersky Lab. “IT Threat Evolution Q1 2023: Statistics”. Kaspersky Security Bulletin, Global Research & Analysis Team (GReAT), Moscow, Russia, Technical Report (2023).
  3. Symantec. “Internet Security Threat Report (ISTR)”. Broadcom Inc., San Jose, CA, USA 24 (2019).
  4. A Moser, C Kruegel and E Kirda. “Limits of static analysis for malware detection”. Proc. 23rd Annu. Comput. Security Appl. Conf. (ACSAC), Miami Beach, FL, USA (2007): 421-430.
  5. A Kharraz., et al. “UNVEIL: A large-scale, automated approach to detecting ransomware”. Proc. 25th USENIX Security Symp., Austin, TX, USA (2016): 757-772.
  6. National Institute of Standards and Technology. “Guide to Integrating Forensic Techniques into Incident Response”. NIST Special Publication 800-86, U.S. Dept. Commerce, Gaithersburg, MD, USA (2006).
  7. T Chen and C Guestrin. “XGBoost: A scalable tree boosting system”. Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, San Francisco, CA, USA (2016): 785-794.
  8. N Scaife., et al. “CryptoLock (and drop it): Stopping ransomware attacks on user data”. Proc. 36th IEEE Int. Conf. Distrib. Comput. Syst. (ICDCS), Nara, Japan (2016): 303-312.
  9. D Sgandurra., et al. “Automated dynamic analysis of ransomware: Benefits, limitations and use for detection”. arXiv preprint arXiv:1609.03020 (2016).
  10. A Continella., et al. “ShieldFS: A self-healing, ransomware-aware filesystem”. Proc. 32nd Annu. Comput. Security Appl. Conf. (ACSAC), Los Angeles, CA, USA (2016): 336-347.
  11. A Kharraz., et al. “Cutting the gordian knot: A look under the hood of ransomware attacks”. Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA), LNCS, vol. 9148, Springer, Cham (2015): 3-24.
  12. K Rieck., et al. “Learning and classification of malware behavior”. Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA), LNCS, vol. 5137, Springer, Berlin (2008): 108-125.
  13. D Ucci, L Aniello and R Baldoni. “Survey of machine learning techniques for malware analysis”. Comput. Security 81 (2019): 123-147.
  14. J Saxe and K Berlin. “Deep neural network based malware detection using two dimensional binary program features”. Proc. 10th Int. Conf. Malicious Unwanted Softw. (MALWARE), Fajardo, PR, USA (2015): 11-20.
  15. Q Zeng., et al. “Dark-net ecosystem malware-as-a-service threat intelligence”. Proc. IEEE Conf. Commun. and Network Security (CNS), Atlanta, GA, USA (2017): 1-9.
  16. F Pedregosa., et al. “Scikit-learn: Machine learning in Python”. J. Mach. Learn. Res. (JMLR) 12 (2011): 2825-2830.
  17. R Rowlingson. “A ten step process for forensic readiness”. Int. J. Digit. Evidence 2.3 (2004): 1-28.
  18. RanSAP Consortium, “RanSAP-2022: Ransomware Behavioral Features Dataset”. Kaggle Data Repository (2022). [Online]. https://www.kaggle.com/datasets/ransap2022
  19. L Abrams. “The history of ransomware: Understanding the origins of modern cyber extortion”. BleepingComputer Threat Intelligence Series (2021).
  20. M Morishita, T Okabe and T Mori. “Detecting ransomware using API call sequences”. Proc. 2019 ACM Asia Conf. Comput. Commun. Security (ASIACCS), Auckland, New Zealand (2019): 203-214.
  21. Y Ye., et al. “A survey on malware detection using data mining techniques”. ACM Comput. Surv 50.3 (2017): Art. 41.
  22. F Cohen. “Computer viruses: Theory and experiments”. Comput. Security 6.1 (1987): 22-35.
  23. B Dolan-Gavitt., et al. “LAVA: Large-scale automated vulnerability addition”. Proc. 37th IEEE Symp. Security Privacy (S&P), San Jose, CA, USA (2016): 110-121.
  24. R Sihwail, K Omar and KAZ Ariffin. “A survey on malware analysis techniques: Static, dynamic, hybrid and memory analysis”. Int. J. Adv. Sci. Eng. Inf. Technol 8.4-2 (2018): 1662-1671.
  25. L Lundberg and S-I Lee. “A unified approach to interpreting model predictions”. Proc. 31st Int. Conf. Neural Inf. Process. Syst. (NeurIPS), Long Beach, CA, USA (2017): 4765-4774.
  26. SK Dash., et al. “DamDroid: Detecting android malware using control-flow graph representation”. Proc. 12th Int. Conf. Inf. Security (ISC), Xi’an, China (2016): 377-390.
  27. RS Pirscoveanu., et al. “Analysis of malware behavior: Type classification using machine learning”. Proc. Int. Conf. Cyber Situational Awareness, Data Analytics Assess. (CyberSA), London, UK (2015): 1-7.
  28. V Roussev. “An overview of digital forensics”. Digital Forensics (2nd ed.), A. Jones and R. Valli, Eds. Syngress, Waltham, MA, USA (2014): 1-17.
  29. G Wicherski. “peHash: A novel approach to fast malware clustering”. Proc. 2nd USENIX Workshop Large-Scale Exploits Emergent Threats (LEET), Boston, MA, USA (2009): 1-9.
  30. HS Anderson and P Roth. “EMBER: An open dataset for training static PE malware machine learning models”. arXiv preprint arXiv:1804.04637 (2018).
  31. M Rhode, P Burnap and K Jones. “Early-stage malware prediction using recurrent neural networks”. Comput. Security 77 (2018): 578-594.
  32. SA Habtor and AH Dahah. “Machine-learning classifiers for malware detection using data features”. Journal of ICT Research and Applications (2021).
  33. K Razak. “Ransomware detection by machine learning”. (2025).
  34. FR Alzaabi and A Mehmood. “A review of recent advances, challenges, and opportunities in malicious insider threat detection using machine learning methods”. IEEE Access 12 (2024): 30907-30927.
  35. T Yang., et al. “Systematic review on next-generation web-based software architecture clustering models”. Computer Communications 167 (2021): 63-74.
  36. T Raitsis., et al. “Code obfuscation: A comprehensive approach to detection, classification, and ethical challenges”. Algorithms 18.2 (2025): 54.
  37. S Razaulla., et al. “The age of ransomware: A survey on the evolution, taxonomy, and research directions”. IEEE Access 11 (2023): 40698-40723.
  38. D Fyford., et al. “Detecting ransomware through network traffic patterns using Random Forest machine learning”. Authorea Preprints (2024).
  39. D Vasan., et al. “IMCFN: Image-based malware classification using fine-tuned convolutional neural network architecture”. Computer Networks 171 (2020): 107138.
  40. A Jiménez-Sánchez., et al. “In the picture: Medical imaging datasets, artifacts, and their living review”. Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (2025): 511-531.
  41. A Greish and I Osman. “Machine learning approaches for ransomware detection: Challenges and future directions”. Journal of Cybersecurity and Information Systems 9.1 (2025): 1-18.
  42. MMHZ Abedin and T Mehrub. “Evaluating ensemble and deep learning models for static malware detection with dimensionality reduction using the EMBER dataset”. arXiv preprint arXiv:2507.16952 (2025).
  43. A Afianian., et al. “Malware dynamic analysis evasion techniques: A survey”. ACM Computing Surveys 52.6 (2019): 1-28.
  44. A Pinto., et al. “Survey on intrusion detection systems based on machine learning techniques for the protection of critical infrastructure”. Sensors 23.5 (2023): 2415.
  45. A. I. Weinberg. “Passive Hack-Back Strategies for Cyber Attribution: Covert Vectors in Denied Environment”. arXiv preprint arXiv:2508.16637 (2025).
  46. R Yu., et al. “Ransomware detection using dynamic behavioral profiling: A novel approach for real-time threat mitigation”. Authorea Preprints (2024).
  47. NKY Gurukala and DK Verma. “Feature selection using particle swarm optimization and ensemble-based machine learning models for ransomware detection”. SN Computer Science 5.8 (2024): 1093.
  48. RD Peng. “Reproducible research in computational science”. Science 334.6060 (2011): 1226-1227.
  49. GK Sandve., et al. “Ten simple rules for reproducible computational research”. PLoS Comput. Biol 9.10 (2013): Art. no. e1003285.
  50. E Casey. Digital Evidence and Computer Crime: Forensic Science, Computers and the Internet, 3rd ed. Waltham, MA, USA: Academic Press (2019).
  51. CF Aliferis and GE Simon. “Lessons learned from historical failures, limitations and successes of AI/ML in healthcare and the health sciences: Enduring problems, and the role of best practices”. Artificial Intelligence and Machine Learning in Health Care and Medical Sciences: Best Practices and Pitfalls, Cham, Switzerland: Springer (2024): 543-606.
  52. N Andronio, S Zanero and F Maggi. “HelDroid: Dissecting and detecting mobile ransomware”. Proc. Int. Conf. Detection Intrusions Malware Vulnerability Assessment (DIMVA), Milan, Italy (2015): 382-404.
  53. T Schlosser., et al. “A consolidated overview of evaluation and performance metrics for machine learning and computer vision”. arXiv preprint, arXiv:2409.0107 (2024).
  54. RV Mahmoud., et al. “Redefining malware sandboxing: Enhancing analysis through Sysmon and ELK integration”. IEEE Access 12 (2024): 68624-68636.
  55. S Kaufman., et al. “Leakage in data mining: Formulation, detection, and avoidance”. ACM Trans. Knowl. Discov. Data 6.4 (2012): 1-21.
  56. NV Chawla., et al. “SMOTE: Synthetic minority over-sampling technique”. J. Artif. Intell. Res 16 (2002): 321-357.
  57. J Saxe and K Berlin. “Deep neural network based malware detection using two-dimensional binary program features”. Proc. 10th Int. Conf. Malicious Unwanted Software (MALWARE), Fajardo, PR, USA (2015): 11-20.
  58. A Fernández., et al. “Learning from Imbalanced Data Sets”. Cham, Switzerland: Springer (2018).
  59. M Egele., et al. “A survey on automated dynamic malware-analysis techniques and tools”. ACM Comput. Surv 44.2 (2012): 1-42.