Re: Adding skip scan (including MDAM style range skip scan) to nbtree - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Adding skip scan (including MDAM style range skip scan) to nbtree |
Date | |
Msg-id | CAH2-Wz=9A_UtM7HzUThSkQ+BcrQsQZuNhWOvQWK06PRkEp=SKQ@mail.gmail.com Whole thread Raw |
In response to | Re: Adding skip scan (including MDAM style range skip scan) to nbtree (Peter Geoghegan <pg@bowt.ie>) |
List | pgsql-hackers |
On Wed, Jul 24, 2024 at 5:14 PM Peter Geoghegan <pg@bowt.ie> wrote: > Attached is v4 Attached is v5, which splits the code from v4 patch into 2 pieces -- it becomes 0002-* and 0003-*. Certain refactoring work now appears under its own separate patch/commit -- see 0002-* (nothing new here, except the commit message/patch structure). The patch that actually adds skip scan (0003-* in this new version) has been further polished, though not in a way that I think is interesting enough to go into here. The interesting and notable change for v5 is the addition of the code in 0001-*. The new 0001-* patch is concerned with certain aspects of how _bt_advance_array_keys decides whether to start another primitive index scan (or to stick with the ongoing one for one more leaf page instead). This is a behavioral change, albeit a subtle one. It's also kinda independent of skip scan (more on why that is at the end). It's easiest to explain why 0001-* matters by way of an example. My example will show significantly more internal/root page accesses than seen on master, though only when 0002-* and 0003-* are applied, and 0001-* is omitted. When all 3 v5 patches are applied together, the total number of index pages accessed by the test query will match the master branch. It's important that skip scan never loses by much to the master branch, of course. Even when the details of the index/scan are inconvenient to the implementation, in whatever way. Setup: create table demo (int4 a, numeric b); create index demo_idx on demo (a, b); insert into demo select a, random() from generate_series(1, 10000) a, generate_series(1,5) five_rows_per_a_val; vacuum demo; We now have a btree index "demo_idx", which has two levels (a root page plus a leaf level). The root page contains several hundred pivot tuples, all of which have their "b" value truncated away (or have the value -inf, if you prefer), with just one prefix "a" column left in place. Naturally, every leaf page has a high key with its own separator key that matches one particular tuple that appears in the root page (except for the rightmost leaf page). So our leaf level scan will see lots of truncated leaf page high keys (all matching a corresponding root page tuple). Test query: select a from demo where b > 0.99; This is a query that really shouldn't be doing any skipping at all. We nevertheless still see a huge amount of skipping with this query, ocne 0001-* is omitted. Prior to 0001-*, a new primitive index scan is started whenever the scan reaches a "boundary" between adjoining leaf pages. That is, whenever _bt_advance_array_keys stopped on a high key pstate.finaltup. So without the new 0001-* work, the number of page accesses almost doubles (because we access the root page once per leaf page accessed, instead of just accessing it once for the whole scan). What skip scan should have been doing all along (and will do now) is to step forward to the next right sibling leaf page whenever it reaches a boundary between leaf pages. This should happen again and again, without our ever choosing to start a new primitive index scan instead (it shouldn't happen even once with this query). In other words, we ought to behave just like a full index scan would behave with this query -- which is exactly what we get on master. The scan will still nominally "use skip scan" even with this fix in place, but in practice, for this particular query/index, the scan won't ever actually decide to skip. So it at least "looks like" an index scan from the point of view of EXPLAIN (ANALYZE, BUFFERS). There is a separate question of how many CPU cycles we use to do all this, but for now my focus is on total pages accessed by the patch versus on master, especially for adversarial cases such as this. It should be noted that the skip scan patch never had any problems with this very similar query (same table as before): select a from demo where b < 0.01; The fact that we did the wrong thing for the first query, but the right thing for this second similar query, was solely due to certain accidental implementation details -- it had nothing to do with the fundamentals of the problem. You might even say that 0001-* makes the original "b > 0.99" case behave in the same manner as this similar "b < 0.01" case, which is justifiable on consistency grounds. Why wouldn't these two cases behave similarly? It's only logical. The underlying problem arguably has little to do with skip scan; whether we use a real SAOP array on "a" or a consed up skip array is incidental to the problem that my example highlights. As always, the underlying "array type" (skip vs SOAP) only matters to the lowest level code. And so technically, this is an existing issue on HEAD/master. You can see that for yourself by making the problematic query's qual "where a = any ('every possible a value') and b > 0.99" -- same problem on Postgres 17, without involving skip scan. To be sure, the underlying problem does become more practically relevant with the invention of skip arrays for skip scan, but 0001-* can still be treated as independent work. It can be committed well ahead of the other stuff IMV. The same is likely also true of the refactoring now done in 0002-* -- it does refactoring that makes sense, even without skip scan. And so I don't expect it to take all that long for it to be committable. -- Peter Geoghegan
Attachment
pgsql-hackers by date: