Rendering 100k trace events faster with exponential search

We’ve recently been looking into optimizing rendering performance of the Perfetto UI on large traces. We discovered that there was some inefficiency in our data fetching logic, especially when you’re very zoomed out.

In this case, there can be a lot of slices (spans) which are so small that they take less than one pixel of width. So for each pixel, we need to figure out “what is the event which we should draw for this pixel”. Over time we’ve come to the conclusion that the best thing to draw is the slice with the largest duration in that pixel.

We can break this into two sub-problems:

  1. What is the range of events which correspond to each pixel?
  2. What is the event with the maximum duration for that pixel?

We’re going to focus on 1) in this post as that’s where the slowdown was. 2) is fascinating but also surprisingly orthogonal. If you’re interested, I would suggest reading this excellent post from Tristan Hume explaining the basic algorithm we use.

So let’s formalize our setup for 1): we have a sorted array containing N (where N is O(100k)) int64 timestamps. The timestamps come directly from the Perfetto trace file so we cannot make any assumptions on values or the distribution. But generally speaking, traces tend to have a mix of dense clusters of events (i.e. lot of activity) and large gaps between these clusters (i.e. not much going on). So it makes sense to optimize for this sort of data pattern.

We then need to find the indexes corresponding to the “lower bound” of M regularly-spaced timestamps (A, A+S, A+2S… B), where each corresponds to a pixel boundary and S corresponds to the “time in the trace represented by one pixel”.

Data:      | | | · · | | | | | | | | | | · · · | | · · · · | | | | | | |
                         dense              sparse          dense

Queries:       ↓           ↓           ↓           ↓           ↓
               A          A+S        A+2S        A+3S          B
               │           │           │           │           │
               ▼           ▼           ▼           ▼           ▼
Result:      lb(A)      lb(A+S)    lb(A+2S)   lb(A+3S)      lb(B)

The obvious solution, and what we’ve been doing for the last few years, is to just binary search each of the M values independently giving us O(M × log N). This is very fast when N is reasonable but when we get into O(100k) on high resolution screens (meaning M is also relatively large), you end up taking several milliseconds to do these queries.

My first approach was to use “static search trees” to speed up the searches; ever since I had come across this amazing post last year, I had been itching to find a place to use them. They worked great and definitely made things faster, but it was a lot of code and required hand-written AVX and a deep understanding, which I knew would be difficult to get through review. And I had this nagging feeling there was something much simpler I could do.

Eventually I had a key realization: I wasn’t using the fact that the queries themselves are sorted! By using this, we can do much better and achieve O(M + log(N)) performance instead!

How? Using a cool technique called exponential search!

You start by binary searching for A just like before giving you an index I. But for every timestamp after that, instead of binary searching, you do something which looks like a hybrid between linear search and binary search.

Found A at index i, now searching for A+S:

Index:   i     i+1   i+2   i+3   i+4   i+5   i+6   i+7   i+8
Data:   [A]   [  ]  [  ]  [  ]  [  ]  [A+S] [  ]  [  ]  [  ]
         ↑     ↑     ↑           ↑                       ↑
       start  +1    +2          +4                      +8
             <A+S  <A+S        <A+S                    >A+S
                                                     overshot!

                                └───────────────────────┘
                                    binary search here

Take A + S: at I+1, I+2, I+4, I+8…, you check if the timestamp at that position is greater than A + S. If not, you continue to the next one. If it is, then you stop and instead binary search between the previous position you checked and the current one.

This is much faster than just binary searching both theoretically and practically. Theoretically, notice how we scan the same memory at most twice; once for the exponential scan, once in the binary search afterwards. This is unlike the “binary search everything” approach where the same piece of memory could be touched M times, if the data layout was pathological. This is where the O(M × log N) -> O(M + log N) improvement comes from.

Note that this also applies to static search trees which also have O(M × log N) complexity; exponential search turned out to beat them too!

Practically, the initial exponential scan is great for CPUs which love scanning through memory linearly instead of jumping around randomly like binary search does. By significantly reducing the binary search range, you significantly reduce the chance of cache misses.

A final micro-optimization you can do is, when the number of indices you are binary searching is fewer than 16 (empirically chosen), you can just replace with a linear scan instead. At those sizes, you find that the branching and looping from binary search is actually slower than brute force checking everything (especially with the auto-vectorizer helping you out!).

We switched to exponential search in this Perfetto PR and in our microbenchmark (on real trace data!), we saw an improvement from 1.5ms to 180us - a speedup of 8x! This, along with a bunch of other improvements we made, should help make the Perfetto UI faster and smoother on large traces!