Re: Use of additional index columns in rows filtering - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Use of additional index columns in rows filtering |
Date | |
Msg-id | CAH2-WzkaP=gQp9uaY4Xrv74WetTttcU+C-tWbtrQLm+=CaUDdg@mail.gmail.com Whole thread Raw |
In response to | Re: Use of additional index columns in rows filtering (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: Use of additional index columns in rows filtering
|
List | pgsql-hackers |
On Fri, Jul 21, 2023 at 4:52 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > (Actually, I'm glossing over a lot. The MDAM paper describes > > techniques that'd make even the really challenging cases safe, through > > a process of converting quals from conjunctive normal form into > > disjunctive normal form. This is more or less the form that the state > > machine implemented by _bt_advance_array_keys() produces already, > > today. But even with all this practically all of the heavy lifting > > takes place before the index scan even begins, during preprocessing -- > > so you still require surprisingly few changes to index scans > > themselves.) > > > > Ah, OK. I was assuming the execution might be more complex. Sort of. Execution of individual "primitive index scans" effectively works the same way as it does already -- the preprocessing is required to generate disjunctive primitive index scans that look like one big index scan when combined (which is how native execution of SAOPs by nbtree works today). The challenge during execution of index scans (execution proper, not preprocessing) comes from processing a "flattened" DNF representation of your original quals efficiently. If you have (say) 3 SAOPs, then the total number of distinct DNF quals is the cartesian product of the 3 arrays -- which is multiplicative. But, you can skip over the single-value DNF quals quickly when they have no matches. Which isn't all that hard. We get some very useful invariants with these DNF quals: you can have at most one individual DNF qual as a match for any individual index tuple. Plus the quals are materialized in key space order, which is ideally suited to processing by an ordered scan. So just as you can use the array keys to skip over parts of the index when searching for an index tuple, you can use an index tuple to skip over the arrays when searching for the next relevant set of array keys. It works both ways! > But I was > thinking more about the costing part - if you convert the clauses in > some way, does that affect the reliability of estimates somehow? Obviously, it doesn't affect the selectivity at all. That seems most important (you kinda said the same thing yourself). > If the > conversion from AND to OR makes the list of clauses more complex, that > might be an issue ... That's definitely a concern. Even still, the biggest problem by far in this general area is selectivity estimation. Which, in a way, can be made a lot easier by this general approach. Let's say we have the tenk1 table, with the same composite index as in my example upthread (on "(two,four,twenty)"). Further suppose you have a very simple query: "select count(*) from tenk1". On master (and with the patch) that's going to give you an index-only scan on the composite index (assuming it's the only index), which is quite efficient despite being a full index scan -- 11 buffer hits. This much more complicated count(*) query is where it gets interesting: select count(*), two, four, twenty from tenk1_dyn_saop where two in (0, 1) and four in (1, 2, 3, 4) and twenty in (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15) group by two, four, twenty order by two, four, twenty; It's inherently very difficult to predict how selective this query will be using basic statistics. But maybe it doesn't need to matter so much, so often. The patch can execute this with an index-only scan + GroupAggregate. What ends up happening is that the patch gets 9 buffer hits -- so pretty close to 11. Master can use almost the same query plan (it uses quals but needs to hashagg+ sort). It ends up getting 245 buffer hits -- vastly more than what we see for the full index scan case (and nothing to do with the sort/an issue with a limit). That's nearly as many hits as you'd get with a sequential scan. (BTW, I don't need to coax the query planner to get this result on master.) With the patch you can vary the predicate in whatever way, so that the selectivity shifts up or down. Occasionally you'll get maybe one extra buffer access relative to the base full index scan case, but overall the patch makes the worst case look very much like a full index scan (plus some relatively tiny CPU overhead). This is just common sense, in a way; selectivities are always between 0.0 and 1.0. Why shouldn't we be able to think about it like that? > I wasn't really thinking about LIMIT, and I don't think it changes the > overall behavior very much (sure, it's damn difficult to estimate for > skewed data sets, but meh). > > The case I had in mind is something like this: > > CREATE TABLE t (a int, b int, c int); > CREATE INDEX ON t (a); > INSERT INTO t SELECT i, i, i FROM generate_series(1,1000000) s(i); > > SELECT * FROM t WHERE bad_qual(a) AND b < 1 AND c < 1 ORDER BY a; > > where bad_qual is expensive and matches almost all rows. You must distinguish between quals that can become required scan keys (or can terminate the scan), and all other quals. This is really important for my pending SAOP patch, but I think it might be important here too. I wonder if the best place to address the possibility of such a regression is in the index AM itself. Let's make your example a bit more concrete: let's assume that bad_qual is a very expensive integer comparison, against a column that has only one possible value. So now your example becomes: CREATE TABLE t (a expensive_int, b int, c int); CREATE INDEX ON t (a); INSERT INTO t SELECT 42, i, i FROM generate_series(1,1000000) s(i); SELECT * FROM t a in (7, 42) AND b < 1 AND c < 1 ORDER BY a; (I'm using a SAOP here because the planner will more or less disregard the ORDER BY if I make it "= 42" instead. Maybe that makes it simpler.) Sure, you're getting a full index scan here, and you get all these useless comparisons on "a" -- that's a real risk. But AFAICT there is no real need for it. There is another nbtree patch that might help. A patch that teaches nbtree's _bt_readpage function to skip useless comparisons like this: https://postgr.es/m/079c3f8e-3371-abe2-e93c-fc8a0ae3f571@garret.ru In order for this kind of optimization to be possible at all, we must be able to reason about "a" as a column whose values will always be in key space order. That is, nbtree must recognize that "a" is the most significant key column, not (say) a low-order column from a composite index -- it's a required column in both directions. If _bt_readpage can determine that the first tuple on a leaf page has the value "42", and the high key has that same value, then we can skip all of the comparisons of "a" for that page, right away (in fact we don't require any comparisons). Now it doesn't matter that they're super expensive comparisons (or it hardly matters). It's natural to think of things like this _bt_readpage optimization as something that makes existing types of plan shapes run faster. But you can also think of them as things that make new and fundamentally better plan shapes feasible, by making risky things much less risky. > FWIW the "ORDER BY" is necessary, because otherwise we may not even > build the index path (no index keys, no interesting pathkeys). It's just > an opportunistic optimization - if already doing index scan, try doing > this too. I wonder if we should relax that ... I'm kinda doing the same thing with ordering in my own patch. In general, even if the query really doesn't care about the index order, there may be a lot of value in making the nbtree code understand that this is an ordered index scan. That's what enables skipping, in all its forms (skipping individual comparisons, skipping whole subsections of the index, etc). I'm not saying that this is 100% problem free. But it seems like a promising high level direction. > In a way, focusing on the worst case does that by assuming the worst > combination - which is fine, although it may choose the slower (but > safer) approach in some cases. I don't think that it has to be slower on average (even by a tiny bit). It might just end up being slightly faster on average, and way faster on occasion. -- Peter Geoghegan
pgsql-hackers by date: