For most of the history of software security testing, fuzzingFuzzing — automated technique of feeding malformed or random inputs to software to trigger unexpected behavior, crashes, or security flaws. was the workhorse: unglamorous, resource-intensive, and brutally effective. Feed a program enough random garbage and eventually it will break in ways that reveal real vulnerabilities. The method is simple in concept, hard to tune in practice, and has quietly credited itself with tens of thousands of bug discoveries across open-source projects. Then large language models entered the picture, and the question of whether AI would replace fuzzing, enhance it, or simply run alongside it became one of the more interesting arguments in applied security.
The honest answer heading into 2026 is that the two approaches are merging — and the convergence is producing results that neither method could achieve independently. This article looks at what that convergence actually looks like at the implementation level, where the seams still show, and why the same tooling now benefits both defenders and attackers.
How Traditional Fuzzing Actually Works
Fuzzing, at its core, is adversarial input testing at scale. A fuzzer generates large volumes of unexpected, malformed, or random data and feeds it into a target program while monitoring for crashes, assertion failures, or memory leaks. The goal is to find edge cases that developers did not anticipate and that normal testing would never surface.
The most widely used modern fuzzers — AFL++, libFuzzer, Honggfuzz — operate using coverage-guided mutationCoverage-guided mutation — instruments the binary to track which code branches execute, then prioritizes inputs that reach previously untested paths. Far more efficient than pure random fuzzing.. They instrument the target binary to track which code paths get executed during each test run, then prioritize mutations that reach previously untouched branches. This makes them dramatically more efficient than pure random fuzzing, because they learn from what they find and adapt their inputs accordingly.
Google's OSS-Fuzz has been continuously testing open-source projects since 2016. As of May 2025, the platform had surfaced over 13,000 vulnerabilities and 50,000 bugs across roughly 1,343 projects — before AI augmentation began contributing significantly to those totals. Source: github.com/google/oss-fuzz
Where traditional fuzzing struggles is with context. A fuzzer does not understand the semantics of what it is testing. It cannot read a protocol specification, reason about state machines, or recognize that a particular sequence of valid-looking inputs will trigger a logic flaw three function calls later. It explores the surface area of a program efficiently, but it cannot think about what the surface area means.
A less-discussed limitation is the coverage ceiling problem. OSS-Fuzz requires human supervision to monitor project coverage and to write new harnesses for uncovered code. An analysis of GStreamer — enrolled in OSS-Fuzz for seven years — found that the project had only two active fuzzers and code coverage around 19%, compared to OpenSSL's 139 active fuzzers. That gap existed not because of technical failure but because fuzzing scales human expertise, not automated insight. When no human is watching coverage gaps, the gaps persist indefinitely. Source: GitHub Security Lab, 2025
That limitation left an entire class of vulnerabilities effectively invisible to conventional fuzz campaigns — bugs that required understanding the program's intent, not just its structure.
What AI Brings to the Table
LLM-enhanced fuzzing addresses the context problem directly. When Google integrated large language model capabilities into OSS-Fuzz in August 2023, the initial focus was on automating one of the most time-consuming manual steps in any fuzzing workflow: writing fuzz targetsFuzz target (harness) — the bridge code that tells a fuzzer how to call into the program under test. Defines which APIs to invoke, which inputs to pass, and what constitutes a crash. Requires deep codebase knowledge to write well.. A fuzz target is the harness code that tells the fuzzer how to talk to the program under test — which inputs to feed, which APIs to invoke, which behaviors to monitor. Writing good targets requires deep familiarity with the codebase and takes significant expert time.
By prompting an LLM with project-specific context — function signatures, type definitions, cross-references, existing unit tests — Google's framework can generate fuzz targets automatically. The results across more than 272 C/C++ projects were concrete: over 370,000 lines of new code coverage that existing human-written targets never reached. Within that expanded coverage, the AI-generated targets discovered 26 vulnerabilities that could not have been found any other way. Source: Google Security Blog, November 2024
Google's open-source security team observed that sustained line coverage is no guarantee a function is free of bugs — different flags and configurations can trigger entirely different behaviors. Source: Google Security Blog, November 2024.
Among those 26 was CVE-2024-9143, an out-of-bounds read/write flaw in OpenSSL that had likely existed in the codebase for roughly two decades. The vulnerability was reachable only through code paths that human-written fuzz targets had never exercised — not because developers were careless, but because the target code was structurally difficult to reach and the connection between inputs and the vulnerable function was non-obvious.
The OSS-Fuzz AI framework currently automates four of the five core stages of a fuzzing workflow: drafting the initial fuzz target, compiling it and fixing compilation errors, running the target to resolve runtime issues, and continuously fuzzing while triaging any crashes for root cause. Google's stated roadmap adds a fifth stage — patch generation — which is where CodeMender enters the picture.
Big Sleep and the Agentic Angle
Google's separate Project Zero effort, called Big Sleep, takes a different angle. Rather than enhancing the fuzzing engine itself, Big Sleep deploys an LLM agent that mimics the reasoning workflow of a human security researcher — examining code, forming hypotheses about where bugs might exist, generating targeted inputs, and interpreting results. The project grew out of an earlier research framework called Project Naptime, which was designed to evaluate offensive security capabilities of large language models in a structured, verifiable environment.
In October 2024, Big Sleep found its first real-world vulnerability: an exploitable stack buffer underflow in SQLite, a widely used open-source database engine. The flaw was identified and reported to the SQLite developers, who fixed it the same day — before it appeared in any official release. Google published the findings in November 2024. OSS-Fuzz and SQLite's own testing infrastructure had not caught it; human researchers who attempted to rediscover the flaw with AFL++ were unable to do so even after 150 CPU hours of fuzzing. The significance was not just the finding; it was that Big Sleep's hypothesis-driven approach reached a bug that brute-force coverage could not. Source: Google Project Zero blog
In July 2025, Big Sleep found a second critical SQLite flaw, CVE-2025-6965, a memory corruption issue scored 7.2 CVSS (NIST assigns it 9.8 Critical under v3.1) affecting all SQLite versions prior to 3.50.2. This case was more operationally consequential: Google's threat intelligence team had identified artifacts suggesting that threat actors were already staging the vulnerability for exploitation. The flaw was, according to Google, known only to those actors at the time Big Sleep isolated it. Kent Walker, President of Global Affairs at Google and Alphabet, stated that Google was able to predict that a vulnerability was imminently going to be used and cut it off beforehand. Google described this as the first time an AI agent had been used to directly foil efforts to exploit a vulnerability in the wild. Source: The Hacker News, July 2025
Big Sleep's architecture is built around three distinct phases: deep codebase familiarization (studying architecture, historical vulnerabilities, and variant patterns), a continuous agentic reasoning loop that generates and tests hypotheses using real tools including a debugger and Python sandbox, and structured verification to confirm whether a crash is exploitable and what input triggers it. Most automated tools skip the first phase entirely.
The variant analysis use case is where Big Sleep diverges most clearly from fuzzing. A key motivating factor behind the project was the continued discovery in the wild of exploits for variants of previously patched vulnerabilities — a class of bugs that traditional fuzzing consistently misses. By starting with a known patched flaw as a hypothesis anchor, the model reduces ambiguity significantly. The reasoning becomes: this was a previous bug; there is probably another similar one nearby. That framing plays to current LLM strengths in a way that open-ended vulnerability discovery does not.
CodeMender: Closing the Patch Loop
Finding bugs at scale is useful only if the remediation pipeline can keep pace. In October 2025, Google DeepMind announced CodeMender, an autonomous AI agent designed to close that gap. Where Big Sleep finds vulnerabilities and OSS-Fuzz generates crash reports, CodeMender's job is to understand what caused the crash and produce a correct, regression-free patch without requiring a developer to interpret the root cause manually. It runs on Google's Gemini Deep Think reasoning models. Source: Google DeepMind blog, October 2025
CodeMender operates through a structured sequence. It uses a debugger, static analysis, dynamic analysis, differential testing, fuzzing, and SMT solvers to diagnose the root cause of a vulnerability — not just its surface symptoms. A heap buffer overflow report, for example, might actually indicate incorrect stack management of XML elements during parsing; CodeMender traces back through the execution to identify where the actual fault originates before touching any code. The agent then generates a patch and routes it through specialized critique agents that validate correctness, security implications, and style conformance before any human sees it.
Google DeepMind researchers Raluca Ada Popa and John "Four" Flynn have stated that as AI-powered discovery scales, human patch teams alone will not be able to keep pace with the volume of findings. Source: Google DeepMind blog, October 2025.
In its first six months, CodeMender upstreamed 72 security fixes to open-source projects, including codebases as large as 4.5 million lines. One of CodeMender's proactive capabilities involves applying bounds-safety annotations to existing C code — a technique that, if applied retroactively to libwebp, would have rendered CVE-2023-4863 (a heap buffer overflow used in a zero-click iOS exploit) unexploitable by eliminating the vulnerability class rather than patching the specific instance. All CodeMender patches are currently reviewed by human researchers before upstream submission.
The pipeline framing — AI finds the bug, AI generates the patch, humans review and approve — reflects where Google has explicitly stated it is heading. The OSS-Fuzz roadmap has always listed patch generation as its fifth and final automation stage. CodeMender is the working implementation of that stage.
Fuzzer vs. AI Agent: Side-by-Side
The two approaches differ across every meaningful operational dimension. The accordion below compares them across the attributes that matter most when evaluating which tool to reach for — and when to use both.
Where AI Falls Short (and Fuzzing Still Wins)
AI-driven security testing is not a clean upgrade over traditional fuzzing. The limitations are real and matter operationally.
The most significant is non-determinism. When the same input crashes a program 100 times out of 100, debugging is straightforward. LLM-based systems do not offer that consistency. The same prompt might surface an issue 20 times out of 100 runs, making triage, reproduction, and root cause analysis substantially harder. Traditional fuzzers, for all their brute-force character, produce highly reproducible crash cases.
A second limitation is hallucinationLLM hallucination — when a language model generates plausible-sounding but factually incorrect output. In security contexts, this means fuzz targets that compile but test the wrong API surface, or vulnerability reports describing bugs that don't exist.. Without sufficient context about the target codebase, an LLM will invent details — generating fuzz targets that compile cleanly but test the wrong things, or that make assumptions about API behavior that do not hold. There have been documented cases where AI-generated fuzz targets began overwriting files on a system due to how the target was authored and which payloads were being passed to the target function. Google's OSS-Fuzz team addressed this by building infrastructure to index projects and inject accurate, project-specific context into prompts, but the problem does not disappear entirely. It requires ongoing maintenance and validation.
A third limitation showed up clearly in a research exercise examining AI agent performance against realistic enterprise scenarios. In directed tasks with clear success criteria, agents performed well. In less structured, more realistic scenarios, performance degraded significantly. In one documented case, an AI agent made approximately 500 tool calls over an hour pursuing a hypothesis about a deserialization vulnerability — and found nothing. A human investigator, switching to comprehensive directory enumeration with a 350,000-path wordlist, found an exposed service with default credentials within minutes.
AI agents tend to produce false positives, overstate severity of minor findings, and struggle to distinguish meaningful access from interesting-looking noise — especially without clear success criteria defined in advance. Human review of AI-generated vulnerability reports remains essential. The Big Sleep team itself notes that a target-specific fuzzer is likely at least as effective as LLM-assisted analysis for many classes of bugs.
There is also a false sense of coverage risk that mirrors what OSS-Fuzz enrollment created for some projects. A security team that deploys an AI agent against a codebase and receives a report of findings may believe the areas not flagged are clean. They are not necessarily clean — they may simply be outside the agent's effective reasoning range or hypothesis space for that run.
The most accurate description of where the field sits is that human direction combined with AI execution outperforms either approach alone — but "human direction" is doing significant work in that sentence. Humans still need to define scope, evaluate findings, confirm exploitability, and recognize when an agent is spinning its wheels.
The Attacker's Angle
Any capability that helps defenders find bugs faster also helps attackers find them faster. AI-enhanced fuzzing is a dual-use technology in the most literal sense.
Traditional fuzzing required significant setup: selecting and configuring a fuzzer, writing harnesses, managing infrastructure, interpreting results. The expertise barrier kept casual attackers out. AI-powered fuzzing lowers that barrier substantially — a pattern already visible in how threat actors are building entire AI-built attack frameworks from scratch in days rather than months. An attacker can now prompt an LLM to analyze a target's expected input format — a file type, a protocol, an API structure — and generate intelligent, semantically valid mutations that are far more likely to trigger edge cases than random garbage data.
The PDF example is instructive. A traditional fuzzer might generate millions of files immediately rejected by a PDF parser. An AI-guided fuzzer learns the internal structure of a valid PDF and produces thousands of subtly malformed versions: one with a slightly incorrect header length, another with an impossibly large embedded image field, a third with a recursive object reference. These inputs probe the specific corners where memory corruption vulnerabilities hide.
Unit 42 researchers at Palo Alto Networks documented another angle: fuzzing AI systems themselves. Their tool, AdvJudge-Zero, applied automated fuzzing principles to AI "judge" models — the LLMs used to evaluate whether other AI outputs are harmful or policy-compliant. By iteratively probing the judge's decision boundary with innocent-looking formatting symbols and low-perplexity tokens, the tool identified input sequences that flipped block decisions to allow with high success rates. The same adversarial training approach used to find those weaknesses can, if applied defensively, reduce that success rate significantly. This offensive use of AI against AI systems is part of a broader shift in AI-powered cyber warfare that security teams now need to plan for explicitly.
The CVE count increase is part of this picture too. NIST reported a 32% increase in CVE submissions in 2024 alone, and acknowledged in March 2025 that its prior processing rate is no longer sufficient to keep pace with incoming volume. As of late 2025, roughly 44% of CVEs added in the preceding year remained in "awaiting analysis" status — meaning no CVSS score, no affected product list, and no actionable enrichment. That gap exists in part because discovery is outpacing the human capacity to analyze and prioritize findings, which is precisely the problem CodeMender was built to address on the defensive side. Attackers benefit from the same backlog, because unanalyzed entries are entries that organizations have not yet prioritized patching. Source: NIST NVD March 2025 update
Generative AI has materially lowered the expertise threshold for effective fuzzing campaigns. Organizations that have not recently fuzz-tested their externally facing software — particularly anything with a complex file parser, protocol handler, or AI decision-making component — should treat that as an elevated risk item, not a deferred one.
How to Evaluate AI-Assisted Fuzzing for Your Security Program
The choice between traditional fuzzing, LLM-enhanced fuzzing, and agentic vulnerability research is not binary — each approach fits different targets, budgets, and team capabilities. The following steps help you assess which combination makes sense before committing infrastructure or budget.
- Audit your current fuzzing coverage first. Before adding AI tooling, establish a baseline. Use a tool like OSS-Fuzz's Fuzz Introspector or your existing fuzzer's coverage reports to identify which code paths are tested and which are not. If you have projects with coverage below 30%, AI-generated target expansion is likely your highest-value move.
- Identify whether your bottleneck is harness authorship or compute. If your team lacks the time or expertise to write fuzz harnesses for every target, LLM-generated targets (via OSS-Fuzz AI or a similar framework) directly address that constraint. If harnesses already exist and coverage is solid, shifting budget to compute scale on existing fuzzers may produce more findings per dollar.
- Reserve agentic tools for variant analysis, not open-ended discovery. Big Sleep's own research team states that a target-specific fuzzer is likely at least as effective as LLM-assisted analysis for general bug classes. Where agentic approaches outperform is directed variant research — starting from a known patched CVE and asking the agent to search for related flaws nearby. Scope these sessions tightly.
- Build a triage process for probabilistic findings before you deploy. LLM-based discovery produces non-deterministic results. The same prompt may surface a genuine flaw only a fraction of the time. Your team needs a defined process — how many runs, what constitutes a confirmed finding, who confirms exploitability — before integrating AI-generated results into your vulnerability management pipeline.
- Track the NVD enrichment backlog against your patching cadence. Roughly 44% of CVEs added in 2025 lacked CVSS scores or actionable enrichment at the time of publication. If your patch prioritization relies solely on NVD data, you may be operating with a significant blind spot. Supplement with CISA KEV, VulnCheck, or commercial enrichment feeds for recently published CVEs.
- Treat AI safety components in your stack as fuzz targets. If your organization uses LLMs for content moderation, access control decisions, or policy enforcement, those systems are now a legitimate part of your attack surface. AdvJudge-Zero demonstrated that adversarial probing of AI judge models can flip block decisions at scale. Include AI components in scope for your next red team or penetration test.
Key Takeaways
- AI and fuzzing are not competitors: The most effective current implementations use LLMs to generate better fuzz targets, triage crashes, and extend code coverage — not to replace the underlying fuzzing engine. Google's OSS-Fuzz is the most documented example of this hybrid working at scale, with 26 confirmed vulnerabilities found as of November 2024 that human-written targets would not have reached.
- Big Sleep proved agentic vulnerability research works in the real world: The SQLite discoveries — the October 2024 stack buffer underflow (reported in a development branch before any official release) and the July 2025 CVE-2025-6965 — demonstrate that hypothesis-driven LLM agents can find bugs ahead of exploitation, including cases where threat actors were already preparing to use the vulnerability. CVE-2025-6965 was the first documented case of an AI agent directly foiling an active exploitation attempt.
- CodeMender represents the next phase of the pipeline: Autonomous patch generation was the missing link between discovery and remediation. With 72 upstreamed fixes in its first six months and the ability to eliminate entire vulnerability classes via proactive annotation rather than individual bug fixes, it changes the math on how fast defenders can close windows of exposure. All patches remain under human review before upstream submission.
- Non-determinism is a real operational problem: LLM-based vulnerability discovery produces inconsistent results across runs. Security teams integrating AI-assisted fuzzing need processes to handle probabilistic findings — not just deterministic crash reproducers. Human confirmation of exploitability before escalation is not optional. The Big Sleep team itself notes that a target-specific fuzzer is likely at least as effective as LLM-assisted analysis for many bug classes.
- Attackers are adopting AI-enhanced fuzzing too: The same techniques that found a 20-year-old OpenSSL flaw are available to threat actors. Protocol-aware, semantically guided fuzzing requires significantly less expertise to set up than traditional campaigns. And AI systems themselves — including safety classifiers and content judges — are now fuzz targets in their own right, as demonstrated by Unit 42's AdvJudge-Zero research.
The trajectory is toward fuller automation: AI systems that write their own fuzz targets, triage their own crashes, generate patch suggestions, and flag findings for human review rather than requiring human involvement at every step. Google's stated roadmap for OSS-Fuzz includes automating patch generation as the final piece of that pipeline, and CodeMender is the working implementation of that fifth stage. When that loop closes at scale, the speed of vulnerability discovery and remediation will shift significantly — for both sides of the equation. The organizations best positioned for that shift are the ones that have already stopped treating fuzzing as a checkbox and started treating it as infrastructure.
OSS-Fuzz AI fuzzing (26 vulnerabilities, November 2024): Google Security Blog — Leveling Up Fuzzing: Finding More Vulnerabilities with AI
Big Sleep first SQLite finding (October 2024): Google Project Zero — From Naptime to Big Sleep: Using Large Language Models To Catch Vulnerabilities In Real-World Code
CVE-2025-6965 (Big Sleep foils active exploitation, July 2025): The Hacker News — Google AI "Big Sleep" Stops Exploitation of Critical SQLite Vulnerability Before Hackers Act
CodeMender (October 2025): Google DeepMind — Introducing CodeMender: An AI Agent for Code Security
CVE-2024-9143 (OpenSSL, 20-year-old flaw): NVD — CVE-2024-9143 detail
NVD backlog and CVE growth (32% 2024, ongoing enrichment crisis): NIST — NVD March 2025 update
OSS-Fuzz project statistics (1,343 projects, 13,000+ vulnerabilities): GitHub — google/oss-fuzz
Coverage gaps in long-enrolled OSS-Fuzz projects (GStreamer, OpenSSL): GitHub Security Lab — Bugs that survive the heat of continuous fuzzing