MDAM techniques and Index Skip Scan patch - Mailing list pgsql-hackers

From Peter Geoghegan
Subject MDAM techniques and Index Skip Scan patch
Date
Msg-id CAH2-WzmUscvoxVkokHxP=uPTDjSi0tJkFpUPD-CeA35dvn-CMw@mail.gmail.com
Whole thread Raw
Responses RE: MDAM techniques and Index Skip Scan patch
Re: MDAM techniques and Index Skip Scan patch
Re: MDAM techniques and Index Skip Scan patch
Re: MDAM techniques and Index Skip Scan patch
List pgsql-hackers
I returned to the 1995 paper "Efficient Search of Multidimensional
B-Trees" [1] as part of the process of reviewing v39 of the skip scan
patch, which was posted back in May. It's a great paper, and anybody
involved in the skip scan effort should read it thoroughly (if they
haven't already). It's easy to see why people get excited about skip
scan [2]. But there is a bigger picture here.

I don't necessarily expect to come away from this discussion with a
much better high level architecture for the patch, or any kind of
deeper insight, or even a frame of reference for further discussion. I
just think that we ought to *try* to impose some order on this stuff.

Like many difficult patches, the skip scan patch is not so much
troubled by problems with the implementation as it is troubled by
*ambiguity* about the design. Particularly concerning how skip scan
meshes with existing designs, as well as future designs --
particularly designs for other MDAM techniques. I've started this
thread to have a big picture conversation about how to think about
these things. Many other MDAM techniques also seem highly appealing.
Much of the MDAM stuff is for data warehousing use-cases, while skip
scan/loose index scan is seen as more of an OLTP thing. But they are
still related, clearly.

I'd like to also talk about another patch, that ISTM had that same
quality -- it was also held back by high level design uncertainty. Back in 2018,
Tom abandoned a patch that transformed a star-schema style query with
left outer joins on dimension tables with OR conditions, into an
equivalent query that UNIONs together 2 distinct queries [3][4].

Believe it or not, I am now reminded of that patch by the example of
"IN() Lists", from page 5 of the paper. We see this example SQL query:

SELECT date, item_class, store, sum(total_sales)
FROM sales
WHERE date between '06/01/95' and '06/30/95' and
item_class IN (20,35,50) and
store IN (200,250)
GROUP BY dept, date, item_class, store;

Granted, this SQL might not seem directly relevant to Tom's patch at
first -- there is no join for the optimizer to even try to eliminate,
which was the whole basis of Jim Nasby's original complaint, which is
what spurred Tom to write the patch in the first place. But hear me
out: there is still a fact table (the sales table) with some
dimensions (the 'D' from 'MDAM') shown in the predicate. Moreover, the
table (and this SQL query) drives discussion of an optimization
involving transforming a predicate with many ORs (which is explicitly
said to be logically/semantically equivalent to the IN() lists from
the query). They transform the query into a bunch of disjunct clauses
that can easily be independently executed, and combined at the end
(see also "General OR Optimization" on page 6 of the paper).

Also...I'm not entirely sure that the intended underlying "physical
plan" is truly free of join-like scans. If you squint just right, you
might see something that you could think of as a "physical join" (at
least very informally). The whole point of this particular "IN()
Lists" example is that we get to the following, for each distinct
"dept" and "date" in the table:

dept=1, date='06/04/95', item_class=20, store=200
dept=1, date='06/04/95', item_class=20, store=250
dept=1, date='06/04/95', item_class=35, store=200
dept=1, date='06/04/95', item_class=35, store=250
dept=1, date='06/04/95', item_class=50, store=200
dept=1, date='06/04/95', item_class=50, store=250

There are 2400 such accesses in total after transformation -- imagine
additional lines like these, for every distinct combination of dept
and date (only for those dates that actually had sales, which they
enumerate up-front), for store 200 and 250, and item_class 20, 35, and
50. This adds up to 2400 lines in total. Even 2400 index probes will
be much faster than a full table scan, given that this is a large fact
table. The "sales" table is a clustered index whose keys are on the
columns "(dept, date, item_class, store)", per note at the top of page
4. The whole point is to avoid having any secondary indexes on this
fact table, without getting a full scan. We can just probe the primary
key 2400 times instead, following this transformation. No need for
secondary indexes.

The plan can be thought of as a DAG, at least informally. This is also
somewhat similar to what Tom was thinking about back in 2018. Tom had
to deduplicate rows during execution (IIRC using a UNION style ad-hoc
approach that sorted on TIDs), whereas I think that they can get away
with skipping that extra step. Page 7 says "MDAM removes duplicates
before reading the data, so it does not have to do any post read
operations to accomplish duplicate elimination (a common problem with
OR optimization)".

My general concern is that the skip scan patch may currently be
structured in a way that paints us into a corner, MDAM-wise.

Note that the MDAM paper treats skipping a prefix of columns as a case
where the prefix is handled by pretending that there is a clause that
looks like this: "WHERE date between -inf AND +inf" -- which is not so
different from the original sales SQL query example that I have
highlighted. We don't tend to think of queries like this (like my
sales query) as in any way related to skip-scan, because we don't
imagine that there is any skipping going on. But maybe we should
recognize the similarities.

BTW, these imaginary -inf/+inf values seem to me to be just like the
sentinel values already used inside nbtree, for pivot tuples -- they
have explicit -inf values for truncated suffix key columns, and you
can think of a rightmost page as having a +inf high key, per the L&Y
paper. Wearing my B-Tree hat, I don't see much difference between
imaginary -inf/+inf values, and values from the BETWEEN "date" range
from the example SQL query. I have in the past wondered if
_bt_get_endpoint() should have been implemented that way -- we could
go through _bt_search() instead, and get rid of that code. All we need
is insertion scan keys that can explicitly contain the same -inf/+inf
sentinel values. Maybe this also allows us to get rid of
BTScanInsertData.nextkey semantics (not sure offhand).

Another more concrete concern about the patch series comes from the
backwards scan stuff. This is added by a later patch in the patch
series, "v39-0004-Extend-amskip-implementation-for-Btree.patch". It
strikes me as a bad thing that we cannot just do leaf-page-at-a-time
processing, without usually needing to hold a pin on the leaf page.
After all, ordinary backwards scans manage to avoid that today, albeit
by using trickery inside _bt_walk_left(). MDAM-style "Maintenance of
Index Order" (as described on page 8) seems like a good goal for us
here. I don't like the idea of doing ad-hoc duplicate TID elimination
inside nbtree, across calls made from the executor (whether it's
during backwards skip scans, or at any other time). Not because it
seems to go against the approach taken by the MDAM paper (though it
does); just because it seems kludgy. (I think that Tom felt the same
way about the TID deduplication stuff in his own patch back in 2018,
too.)

Open question: What does all of this MDAM business mean for
ScalarArrayOpExpr, if anything?

I freely admit that I could easily be worrying over nothing here. But
if I am, I'd really like to know *why* that's the case.

[1] http://vldb.org/conf/1995/P710.PDF
[2] https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
[3] https://www.postgresql.org/message-id/flat/7f70bd5a-5d16-e05c-f0b4-2fdfc8873489%40BlueTreble.com
[4] https://www.postgresql.org/message-id/flat/14593.1517581614%40sss.pgh.pa.us#caf373b36332f25acb7673bff331c95e
--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Experimenting with hash tables inside pg_dump
Next
From: Amit Langote
Date:
Subject: Re: pgstat_assert_is_up() can fail in walsender