Re: Improve EXPLAIN output for multicolumn B-Tree Index - Mailing list pgsql-hackers

From Matthias van de Meent
Subject Re: Improve EXPLAIN output for multicolumn B-Tree Index
Date
Msg-id CAEze2WgooitYVHO7tdVnewLM=hu8t4Eah2WYyAs25t5KCf7-oQ@mail.gmail.com
Whole thread Raw
In response to Re: Improve EXPLAIN output for multicolumn B-Tree Index  (Jelte Fennema-Nio <postgres@jeltef.nl>)
Responses Re: Improve EXPLAIN output for multicolumn B-Tree Index
List pgsql-hackers
On Mon, 24 Jun 2024 at 11:58, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
>
> +1 for the idea.
>
> On Mon, 24 Jun 2024 at 11:11, Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > I think this is too easy to confuse with the pre-existing 'Filter'
> > condition, which you'll find on indexes with INCLUDE-d columns or
> > filters on non-index columns.
>
> Why not combine them? And both call them Filter? In a sense this
> filtering acts very similar to INCLUDE based filtering (for btrees at
> least).

It does not really behave similar: index scan keys (such as the
id3=101 scankey) don't require visibility checks in the btree code,
while the Filter condition _does_ require a visibility check, and
delegates the check to the table AM if the scan isn't Index-Only, or
if the VM didn't show all-visible during the check.

Furthermore, the index could use the scankey to improve the number of
keys to scan using "skip scans"; by realising during a forward scan
that if you've reached tuple (1, 2, 3) and looking for (1, _, 1) you
can skip forward to (1, 3, _), rather than having to go through tuples
(1, 2, 4), (1, 2, 5), ... (1, 2, n). This is not possible for
INCLUDE-d columns, because their datatypes and structure are opaque to
the index AM; the AM cannot assume anything about or do anything with
those values.

> Although I might be wrong about that, because when I try to
> confirm the same perf using the following script I do get quite
> different timings (maybe you have an idea what's going on here). But
> even if it does mean something slightly different perf wise, I think
> using Filter for both is unlikely to confuse anyone.

I don't want A to to be the plan, while showing B' to the user, as the
performance picture for the two may be completely different. And, as I
mentioned upthread, the differences between AMs in the (lack of)
meaning in index column order also makes it quite wrong to generally
separate prefixes equalities from the rest of the keys.

> Since, while
> allowed, it seems extremely unlikely in practice that someone will use
> the same column as part of the indexed columns and as part of the
> INCLUDE-d columns (why would you store the same info twice).

Yeah, people don't generally include the same index column more than
once in the same index.

> CREATE INDEX test_idx_include ON test(id1, id2) INCLUDE (id3);
> CREATE INDEX test_idx ON test(id1, id2, id3);
>
>                                                               QUERY PLAN
> ───────────────────────────────────────
>  Index Only Scan using test_idx_include on public.test
[...]
> Time: 7.139 ms
>                                                           QUERY PLAN
> ─────────────────────────────────────
>  Index Only Scan using test_idx on public.test  (cost=0.42..2591.77
[...]
> Time: 2.645 ms

As you can see, there's a huge difference in performance. Putting both
non-bound and "normal" filter clauses in the same Filter clause will
make it more difficult to explain performance issues based on only the
explain output.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: long-standing data loss bug in initial sync of logical replication
Next
From: Stepan Neretin
Date:
Subject: Re: sql/json miscellaneous issue