SecureMind: A Framework for Benchmarking Large Language Models in Memory Bug Detection and Repair

Publication
2025 ACM SIGPLAN International Symposium on Memory Management (ISMM)

Abstract

Large language models (LLMs) hold great promise for automating software vulnerability detection and repair, but ensuring their correctness remains a challenge. While recent work has developed benchmarks for evaluating LLMs in bug detection and repair, existing studies rely on hand-crafted datasets that quickly become outdated. Moreover, systematic evaluation of advanced reasoning-based LLMs using chain-of-thought prompting for software security is lacking. We introduce SecureMind, an open-source framework for evaluating LLMs in vulnerability detection and repair, focusing on memory-related vulnerabilities. SecureMind provides a user-friendly Python interface for defining test plans, which automates data retrieval, preparation, and benchmarking across a wide range of metrics. Using SecureMind, we assess 10 representative LLMs, including 7 state-of-the-art reasoning models, on 16K test samples spanning 8 Common Weakness Enumeration (CWE) types related to memory safety violations. Our findings highlight the strengths and limitations of current LLMs in handling memory-related vulnerabilities. We hope SecureMind and the insights provided contribute to advancing LLM-based approaches for improving software security.

Huanting Wang
Huanting Wang
PhD Student and Research Fellow
Zheng Wang
Zheng Wang
Professor of Intelligent Software Technology