SecureMind: A Framework for Benchmarking Large Language Models in Memory Bug Detection and Repair

Huanting Wang, Dejice Jacob, David Kelly, Yehia Elkhatib, Jeremy Singer, Zheng Wang

November 3030

Type

Publication

2025 ACM SIGPLAN International Symposium on Memory Management (ISMM)

Abstract

Large language models (LLMs) hold great promise for automating software vulnerability detection and repair, but ensuring their correctness remains a challenge. While recent work has developed benchmarks for evaluating LLMs in bug detection and repair, existing studies rely on hand-crafted datasets that quickly become outdated. Moreover, systematic evaluation of advanced reasoning-based LLMs using chain-of-thought prompting for software security is lacking. We introduce SecureMind, an open-source framework for evaluating LLMs in vulnerability detection and repair, focusing on memory-related vulnerabilities. SecureMind provides a user-friendly Python interface for defining test plans, which automates data retrieval, preparation, and benchmarking across a wide range of metrics. Using SecureMind, we assess 10 representative LLMs, including 7 state-of-the-art reasoning models, on 16K test samples spanning 8 Common Weakness Enumeration (CWE) types related to memory safety violations. Our findings highlight the strengths and limitations of current LLMs in handling memory-related vulnerabilities. We hope SecureMind and the insights provided contribute to advancing LLM-based approaches for improving software security.

SecureMind: A Framework for Benchmarking Large Language Models in Memory Bug Detection and Repair

Huanting Wang

PhD Student and Research Fellow

Zheng Wang

Professor of Intelligent Software Technology