Found 189 repositories(showing 30)
OWASP-Benchmark
OWASP Benchmark is a test suite designed to verify the speed and accuracy of software vulnerability detection tools. A fully runnable web app written in Java, it supports analysis by Static (SAST), Dynamic (DAST), and Runtime (IAST) tools that support Java. The idea is that since it is fully runnable and all the vulnerabilities are actually exploitable, it’s a fair test for any kind of vulnerability detection tool. For more details on this project, please see the OWASP Benchmark Project home page.
bytedance
PatchEval: A New Benchmark for Evaluating LLMs on Patching Real-World Vulnerabilities
uiuc-kang-lab
CVE-Bench: A Benchmark for AI Agents’ Ability to Exploit Real-World Web Application Vulnerabilities
ossf-cve-benchmark
The OpenSSF CVE Benchmark consists of code and metadata for over 200 real life CVEs, as well as tooling to analyze the vulnerable codebases using a variety of static analysis security testing (SAST) tools and generate reports to evaluate those tools.
nccgroup
Whalescan is a vulnerability scanner for Windows containers, which performs several benchmark checks, as well as checking for CVEs/vulnerable packages on the container
HTBridge
Created by High-Tech Bridge, the Purposefully Insecure and Vulnerable Android Application (PIVAA) replaces outdated DIVA for benchmark of mobile vulnerability scanners.
alt-research
Solidity/EVM smart contract security auditor — 104 vulnerability patterns, 8 tools, 100% CTF + EVMBench benchmark (120/120)
Hustcw
This is a benchmark for evaluating the vulnerability discovery ability of automated approaches including Large Language Models (LLMs), deep learning methods and static analyzers
lucagioacchini
This repo contains the codes of the penetration test benchmark for Generative Agents presented in the paper "AutoPenBench: Benchmarking Generative Agents for Penetration Testing". It contains also the instructions to install, develop and test new vulnerable containers to include in the benchmark.
timothee-chauvin
future-proof vulnerability detection benchmark, based on CVEs in open-source repos
Mobile-IoT-Security-Lab
The OWApp Benchmark: an OWASP-compliant Vulnerable Android App Dataset
yubol-bobo
This repo investigates LLMs' tendency to exhibit acquiescence bias in sequential QA interactions. Includes evaluation methods, datasets, benchmarks, and experiment code to assess and mitigate vulnerabilities in conversational consistency and robustness, offering a reproducible framework for future research.
Sweetaroo
A Novel Benchmark evaluating the Deep Capability of Vulnerability Detection with Large Language Models
Troublor
Smart contract front-running vulnerability benchmark
CASTLE-Benchmark
The CASTLE Benchmark is a modern micro-benchmarking solution to test Static Analyzers and LLMs in vulnerability detection
a101e-lab
IoTVulBench is an open-source benchmark dataset for IoT security research, containing firmware-related vulnerabilities and the corresponding toolkits for building firmware emulations and verifying vulnerabilities.
OWASP-Benchmark
The OWASP Benchmark for Python is a test suite designed to verify the accuracy of Python software vulnerability detection tools. A fully runnable web app written in Python, it supports analysis by Static (SAST), Dynamic (DAST), and Runtime (IAST) tools that support Python. For more details, see the OWASP Benchmark Project home page.
FaroukDaboussi0
This project aims to fine-tune a pre-trained LLM using CTI-specific data and evaluate its performance with CTIBench, a benchmark designed for cybersecurity tasks. CTIBench helps assess how well the model performs on tasks like identifying threat actors, mapping attack techniques, and correlating vulnerabilities
agentlisa
LISABench - Smart Contract Vulnerability Detection Benchmark
nross12
PEVuln: A Benchmark Dataset for Using Machine Learning to Detect Vulnerabilities in PE Malware
socsecresearch
No description available
satty-br
LOKI (Leverage Offensive Knowledge Intelligently) is an AI-powered tool that creates pull requests with realistic code and dependency vulnerabilities to test and benchmark SAST/SCA tools.
mikusher
Benchmark collection for analysis. The idea is to have a collection of projects in several languages as well as various sast applications to do scans and comparisons. At the end of the day the intention is to reduce the number of false positives in benchmarks projects.
Cristian-Curaba
We introduce a benchmark for testing how well LLMs can find vulnerabilities in cryptographic protocols. By combining LLMs with symbolic reasoning tools like Tamarin, we aim to improve the efficiency and thoroughness of protocol analysis, paving the way for future AI-powered cybersecurity defenses.
Amitelazari
This is the #legalbugbounty standardization project. As I explain in my Enigma talk and my papers - the legal landscape of bug bounties is currently lacking. Safe harbor is the exception, not the standard and thousands of thousands of hunters are put in "legal's" harm way. I've suggested that bug bounty legal terms, starting with safe harbor, could and should be standardized. Once standardization of bug bounty legal language is achieved, the bug bounty economy will become an alternate private legal regime in which white-hat hacking is celebrated through regulatory incentives. Standardization will start a race-to-the-top over the quality of bug bounty terms. This project, supported by CLTC, aims to achieve standardization of bug bounty legal terms across platforms, industries and sponsors, in line with the DOJ framework, and akin to the licenses employed by Creative Commons and the open source industry. This will reduce the informational burden and increase hackers’ awareness of terms (salience). It could also signal whether a particular platform or company conforms with the standard terms that are considered best practice. Finally, it could reduce the drafting costs of the platform or sponsoring program, as well as the transactional costs. While some organizations (such as governmental or financial organizations) might require adjustments, generally the legal concerns of bug bounties’ sponsors and platforms are similar and could be addressed in standardized language. Moreover, standardization should be used to ensure that hackers have authorized access to any third-parties data or components implemented in the bug bounty administrator product/network, and to facilitate coordinated disclosure of third-party vulnerabilities found (and ethically disclosed). Companies and platforms should coordinate to ensure that such clauses are included in all terms, facilitating a best practice mentality in the industry. The benefits of standardization in bug bounties/CVDs of legal language across industries and platforms in light of DOJ framework +One language of safe harbor akin to Creative Commons/Open Source +Create an industry standard that will serve as a benchmark and signal to hackers if companies don’t adopt it +Reduce the informational burden and increase hackers’ awareness towards terms +Reduce transaction and drafting costs +Create a reputation system for legal terms You must consult with a lawyer Disclaimer: this report does not constitute legal advice and the author is not admitted to practice law in the U.S. The information contained herein is for general guidance on matters of interest only. The application and impact of laws can vary widely based on the specific facts involved. Given the changing nature of laws, terms, rules and regulations, there may be delays, omissions or inaccuracies in information contained herein. Accordingly, the information is provided with the understanding that the author is not herein engaged in rendering legal or other professional advice and services. As such, it should not be used as a substitute for consultation with professional legal or other competent advisers. Before making any decision or taking any action, you should consult a professional. All information is provided “as is”, with no guarantee of completeness, accuracy, timeliness or of the results obtained from the use of this information, and without warranty of any kind, express or implied, including, but not limited to warranties of performance, merchantability and fitness for a particular purpose. In no event will the author be liable to you or anyone else for any decision made or action taken in reliance on the information herein or for any consequential, special or similar damages.
toxy4ny
Kidnapp-AI-Benchmark is a modular, extensible framework designed to systematically test and evaluate privacy leakage, data extraction, and adversarial vulnerabilities in large language models (LLMs) and other generative AI systems. Built for red teamers, penetration testers, and AI security researchers.
secure-software-engineering
Achilles - Benchmark for assessing OSS-Vulnerability Scanners 59
getastra
HypeJab is a deliberately vulnerable web application intended for benchmarking automated scanners.
FuzzingLabs
Benchmarking 12 LLMs for vulnerability research
rapticore
A multi-LLM benchmark suite for evaluating security analysis and vulnerability detection capabilities across OpenAI, Anthropic, Google's models.