17 bugs in 10 weeks from AI security scanning

Over the last several weeks, I’ve been receiving more security bug reports for Perfetto’s trace processor than I ever have before, all of them found by AI. And I’m very happy about it! These are bugs that would almost certainly not have been found a year ago and it feels good to close these loopholes even though trace processor is by no means security critical.

For years, security researchers concentrated their time on the highest-stakes targets: kernels, cryptography libraries, password managers. But there’s a lot of code out there which is security-relevant but not truly security-critical. In my experience, these sorts of projects didn’t draw much attention. Now systems in the long tail can get that attention which they wouldn’t have before.

Why is this happening

Trace processor is a project which sits squarely in that long tail. It’s a C++ library (yes, Rust would be the obvious choice today but it’s not practical to rewrite, see footnote 1) for processing recorded traces of various formats. These are typically traces you collected yourself or in your test infra and process offline so “untrusted input” isn’t much of a concern.

However, some people do process traces they didn’t collect themselves (e.g. user bug reports, automated collection from dogfood users). For those cases we’ve strongly recommended sandboxing trace processor (e.g. gvisor, sandbox2, or minijail) or, for even more sensitive use cases, a VM.

Beyond sandboxing, for catching issues proactively, we mainly relied on fuzzing running internally in Google. These fuzzers occasionally surfaced real, actionable bugs: we set them up to pass in arbitrary trace bytes (as this is the main “attack surface”) but over time these became quite rare as they discovered much of the low hanging fruit, which we quickly fixed. The bugs that remained tend to live deep in the internals, reachable only with a very precisely crafted sequence of bytes that a fuzzer is unlikely to hit by mutation alone.

Apart from this, there has rarely been any bandwidth or resources for a human, either a security expert or someone from my team, to spend lots of time finding security issues in trace processor. There were always other parts of Perfetto more worth spending security time on (e.g. the tracing service, on-device profilers) as they’re actively running in production systems.

All of this changed as of a couple of months ago. We started receiving bug reports filed by some central team which appears to be running AI-based security scanning against various projects throughout Google. Unfortunately, I have to be hand wavy about what exactly they’re doing as their work doesn’t appear to be public.

Starting in early April, we had a slow drip of 1 bug a week, but since the end of April this increased to a rate of several a week, with some days having 3 or 4 being opened in quick succession. This lasted until mid-May, at which point it started tapering back to 1-2 a week with some weeks having none.

I also want to say that the quality of the bugs is high. They’re well-described, often with the relevant attacker model already worked out and even minimal fixes proposed: basically everything I could ask for from a bug report. This matches what both curl and Linux kernel maintainers have noted about security bugs they’ve received, especially how sharply quality has improved in the last few months.

As I can only see the bugs that get filed against me, not the raw output of the AI scanner, I don’t know exactly how much triage happens upstream. My guess is there’s a human doing a light pass to drop obvious noise before reports reach client teams, but judging from the rate at which bugs are opened and the way they’re filed, I doubt anyone is deeply triaging each one.

In total, we’ve received 21 bugs (17 real issues and 4 not actionable), which can be broken down into the following categories:

  • 10 bounds checking: arbitrary trace data flowing into fixed-size buffers or unchecked array indices, leading to out-of-bounds reads or writes.
  • 5 use-after-free: back-pointers, pointer snapshots, or hashmap keys outliving the object they refer to.
  • 1 stack overflow: unbounded recursion when input is deeply nested.
  • 1 access control: not enforcing allowlists on some rare codepaths.
  • 4 closed as not actionable: either where the chance of exploit was purely hypothetical or where fixing would have required fundamental design changes which were not worth the tiny security risk.

All 17 real issues have been fixed, almost all shipping in Perfetto v56.0 2.

How it feels to get a report

How does receiving one of these reports actually feel? Well not as bad as you’d think. Unlike a security-critical application like OpenSSL or curl, in trace processor, a security issue is very unlikely to be a P0 I have to drop everything to fix. Don’t get me wrong, it’s still a priority but one where I have the luxury of taking a few days to figure out the right answer and can release fixes according to our normal schedule, instead of trying to rush out a CVE and get everyone to patch immediately.

Also thankfully, because the majority of the issues are mechanical, the fixes are generally quite straightforward. Take this PR, for example:

  • We build a key string into a fixed-size stack buffer while parsing some metadata.
  • The bounds check only runs in debug builds, and the metadata name comes straight from the trace. Putting a long enough name means you would escape the buffer.
  • The fix is a simple matter of swapping the stack buffer for a std::string. The code path is very cold (only once or twice in a trace) so the extra heap allocation doesn’t matter.

In fact, these sorts of issues are so mechanical that I trust a coding agent to just fix them with minimal guidance: take the well-written report, feed it to the agent, and within ~10 minutes there’s a 10-20 line PR which fixes it. I review every line thoroughly and make sure I understand it, but these tasks are not difficult and firmly inside the “jagged frontier” of what AI can do.

I want to stress though that not every issue is mechanical or can be left to AI; a few reports actually point more to design problems than incorrect function implementations. This use after free is a good example:

  • The problem was a state object held a back-pointer that could end up pointing to freed memory given certain data appearing in the trace.
  • The immediate dangling case was easy to patch by just having a callback which invalidated the back-pointer on free. But this is a horrible hack which makes the lifetimes of the objects involved impossible to reason about.
  • The real problem here is that you had a child object whose parent could go away before it, which really shouldn’t happen if this code is properly architected.
  • Fixing it properly meant restructuring the ownership model so the lifetime was correct by construction.

The interesting thing was that this was a problem I was aware of and that I had been meaning to clean up for close to a year but never got round to: the security bug just gave me the push and justification to do it. This applied in a couple of other bugs as well and made me internalize that security issues can sometimes be correlated with deeper design flaws or hacky code so there are wider benefits to “security scanning” than just the direct bugs they find.

Will this last?

One thing I am wary of is how long this stream of bugs will keep up; I’m feeling good about it given it’s only been going on for a couple of months, but I can easily imagine that if this goes on for several more months, it might become mentally exhausting.

But my suspicion is that this will go to zero. Why? It’s to do with the pattern of how these bugs are being filed. Each part of the codebase seems like it’s getting a day or two of attention (and associated bugs) before moving on to a different part. Repeats are rare, and the pace of bugs has slowed especially in the last couple of weeks: we had a lot more in the start of May (several a week) but now we’re down to 1-2 a week. There are a finite number of files, so eventually my gut tells me they will run out.

An important consideration is whether we’ll add new bugs faster than the scanner can find old ones. My suspicion is no; the 17 real issues so far are from scanning across 9 years of development. Even if that number triples before things settle, the scanner is still working through years of accumulated code. And we wrote a lot more code, a lot faster, in the earlier years of the project, so the rate of new code being added now is lower than it once was.

The other question is whether new model releases will find more complex design issues rather than the simple issues we’re finding today. Those take significantly more time and effort to fix and so would be a lot more painful if we were to get many of those. I’m very unsure on this so we’ll just have to wait and see!

Where this leaves us

I feel Daniel Stenberg (curl maintainer) phrased it well in this post:

Any project that has not scanned their source code with AI powered tooling will likely find huge number of flaws, bugs and possible vulnerabilities with this new generation of tools.

This rings very true to me. More broadly, I think folks will have one of three experiences:

  1. Untrusted input + security critical (e.g. curl, kernel, OpenSSL): many complex reports, with a higher false positive rate than the other categories, because there’s a lot of attention on the project and much of the low hanging fruit would already have been picked in the critical codepaths. Though codepaths for lesser-used functionality (e.g. legacy drivers) could end up in category 2 instead.
  2. Untrusted input + not previously audited (e.g. trace processor): a wave of mechanical bugs at a manageable pace and low individual stress because the project is not on a security critical code path. This is where both Daniel and I expect AI security scanners to have the most impact.
  3. No untrusted input (internal tools, math libs, anything operating only on trusted data): you probably won’t notice this shift at all.

My own case sits squarely in that second bucket. But I don’t want to over-generalize from my experience, because there are three things that make this manageable for me that wouldn’t be true for everyone: a) I’m paid to maintain trace processor as part of my full time job; b) someone else is taking the effort to run the AI scans and discover the bugs in the first place; c) the reports appear to be lightly filtered by an upstream human reviewer, enough to strip obvious noise but probably not a deep triage.

To me, this points to a gap in the ecosystem: most open-source projects cannot afford to have a dedicated team doing security scanning for them, and telling a maintainer to stand up their own pipeline when their security risk is marginal will restrict this to only the most motivated projects. I would guess we’re going to see a lot more innovation in this space, including from the big AI labs.

All in all, I’m cautiously positive about my own experience: most of the bugs are mechanical, a few have nudged long-overdue design cleanups, and the pace is manageable. There’s plenty I don’t know about how this evolves: whether the pace holds, whether future models start finding harder design issues. So this should very much be treated as a snapshot, not a forecast!



  1. A common response I expect is: “if it’s parsing arbitrary binary data like traces, it should be written in Rust.” In a vacuum I agree and if I was writing trace processor from scratch today, I would definitely use Rust. But switching to Rust is unfortunately quite impractical; the library is a significant amount of code and is embedded in hundreds of downstream tools, many in environments that don’t have a Rust toolchain. Asking all our embedders to start using Rust would be a significant burden and one I don’t want to impose. Not to mention that our team doesn’t actually have any Rust expertise so reviewing this code at a standard I want from trace processor would be a significant productivity hit until folks got up to speed. And unlike others in the industry, I don’t feel comfortable just rewriting the whole project in one shot and calling it a day… ↩︎

  2. The couple of remaining bugs were found after the v56.0 release was cut and are low-priority enough that it’s not worth rushing out a release for them. They have already been fixed on main and will be fixed in an upcoming point release. ↩︎