Re: index prefetching - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: index prefetching |
Date | |
Msg-id | f2m3vqsnbnnmybnn3vyjy3t3t3ayw5yyatcbohs3adlzfxzsy2@wt32h37uyvpf Whole thread Raw |
In response to | Re: index prefetching (Tomas Vondra <tomas@vondra.me>) |
Responses |
Re: index prefetching
|
List | pgsql-hackers |
Hi, On 2025-07-18 19:44:51 +0200, Tomas Vondra wrote: > I agree tableam needs to have a say in this, so that it can interpret > the TIDs in a way that fits how it actually stores data. But I'm not > sure it should be responsible for calling index_batch_getnext(). Isn't > the batching mostly an "implementation" detail of the index AM? That's > how I was thinking about it, at least. I don't agree with that. For efficiency reasons alone table AMs should get a whole batch of TIDs at once. If you have an ordered indexscan that returns TIDs that are correlated with the table, we waste *tremendous* amount of cycles right now. Instead of locking the page, doing a HOT search for every tuple, and then unlocking the page, we lock and unlock the page for every single TID. The locking alone is a significant overhead (it's like 25% of the cycles or so), but what's worse, it reduces what out-of-order execution can do to hide cache-misses. Even leaving locking overhead and out-of-order execution aside, there's a good bit of constant overhead work in heap_hot_search_buffer() that can be avoided by doing the work all at once. Just to show how big that effect is, I hacked up a patch that holds the buffer lock from when the buffer is first pinned in heapam_index_fetch_tuple() until another buffer is pinned, or until the scan ends. That's totally not a valid change due to holding the lock for far too long, but it's a decent approximation of the gain of reducing the locking. This query SELECT * FROM lineitem ORDER BY l_orderkey OFFSET 10000000 LIMIT 1; speeds up by 28%. Of course that's an extreme case, but still. That likely undersells the gain, because the out-of-order benefits aren't really there due to all the other code that runs inbetween two heap_hot_search_buffer() calls. It obviously also doesn't show any of the amortization benefits. IMO the flow really should be something like this: IndexScan executor node -> table "index" scan using the passed in IndexScanDesc -> read stream doing readahead for all the required heap blocks -> table AM next page callback -> index scans returning batches I think the way that IndexOnlyScan works today (independent of this patch) really is a layering violation. It "knows" about the way the visibilitymap, which it really has no business accessing, that's a heap specific thing. It also knows too much about different formats that can be stored by indexes, but that's kind of a separate issue. Greetings, Andres Freund
pgsql-hackers by date: