PATCH: AM-specific statistics, with an example implementation for BRIN (WIP) - Mailing list pgsql-hackers

From Tomas Vondra
Subject PATCH: AM-specific statistics, with an example implementation for BRIN (WIP)
Date
Msg-id 7cfb21ed-ddd3-821b-a852-75fe42c64d49@enterprisedb.com
Whole thread Raw
Responses Re: PATCH: AM-specific statistics, with an example implementation for BRIN (WIP)  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hi,

A couple days ago I posted a WIP patch [1] implementing "BRIN Sort",
i.e. a node producing sorted output for BRIN minmax indexes. One of the
challenges I mentioned in that thread is costing - that's actually not
specific to that patch, it's an issue affecting BRIN in general, not
just the proposed node, due to block-range indexes being very different
from regular indexes with explicit tuple pointers.

I mentioned I have some ideas how to improve this, and that I'll start a
separate thread to discuss this. So here we go ...


The traditional estimation problem is roughly this:

    Given a condition, how many rows will match it?

That is, given a table with X rows, we need to estimate how many rows
will match a WHERE condition, for example. And once we have the row
estimate, we can estimate the amount of I/O, cost for sorting, etc.

We have built fairly solid capability to calculate these estimates,
using per-column statistics, extended statistics, ... The calculated
estimates are not always perfect, but in general it works well.

This affects all path types etc. mostly equally - yes, some paths are
more sensitive to poor estimates (e.g. runtime may grow exponentially
with increasing rowcount).


BRIN indexes however add another layers to this - once we have estimated
the number of rows, we need to estimate the number of pages ranges this
maps to. You may estimate the WHERE condition to match 1000 rows, but
then you need to decide if that's 1 page range, 1000 page ranges or
possibly even all page ranges for the table.

It all depends on how "correlated" the data is with physical position in
the table. If you have perfectly correlated data, it may be enough to
scan a single page. If it's random, you may need to scan everything.

The existing costing uses the column correlation statistics, but sadly
that's rather insensitive to outlier values. If you have a sequential
table, and then set 1% of data to min/max (making the ranges very wide),
the correlation will remain very close to 1.0, but you'll have to scan
all the ranges (and the costing won't reflect that).

The "BRIN sort" patch needs to estimate a different thing - given a page
range, how many other page ranges overlap with it? This is roughly the
amount of stuff we'll need to scan and sort in order to produce the
first row.

These are all things we can't currently estimate - we have some rough
heuristics, but it's pretty easy to confuse those.

Therefore, I propose to calculate a couple new statistics for BRIN
indexes (assume minmax indexes, unless mentioned otherwise):


1) average number of overlapping ranges
---------------------------------------

Given a range, with how many ranges it overlaps? In a perfectly
sequential table this will be 0, so if you have a value you know it'll
match just one range. In random table, it'll be pretty close to the
number of page ranges.

This can be calculated by simply walking the ranges, sorted by minval
(see brin_minmax_count_overlaps).


2) average number of matching ranges for a value
------------------------------------------------

Given a value, how many ranges it matches? This can be calculated by
matching sampled rows to ranges (brin_minmax_match_tuples_to_ranges).

For minmax indexes this is somewhat complementary to the average number
of overlaps, the relationship is roughly this:

  avg(# of matching ranges) = 1 + avg(number of overlapping ranges)/2

The intuition is that if you assume a range randomly overlapped by other
ranges, you're likely to hit about 1/2 of them.

The reason why we want to calculate both (1) and (2) is that for other
opclasses the relationship is not that simple. For bloom opclasses we
probably can't calculate overlaps at all (or at least not that easily),
so the average number of matches is all we have. For minmax-multi, the
overlaps will probably use only the min/max values, ignoring the "gaps",
but the matches should use the gaps.


3) a bunch of other simple statistics
-------------------------------------

These are number of summarized / not-summarized ranges, all_nulls and
has_nulls ranges, which is useful to estimate IS NULL conditions etc.


The attached patch implements a PoC of this. There's a new GUC
(enable_indexam_stats) that can be used to enable/disable this (both the
ANALYZE and costing part). By default it's "off" so make sure to do

  SET enable_indexam_stats = true;

The statistics is stored in pg_statistics catalog, in a new staindexam
column (with bytea). The opclasses can implement a new support
procedure, similarly to what we do of opclass options. There's a couple
of wrinkles (should be explained in XXX comments), but in principle this
works.

The brin_minmax_stats procedure implements this for minmax opclasses,
calculating the stuff mentioned above. I've been experimenting with
different ways to calculate some of the stuff, and ANALYZE prints info
about the calculated values and timings (this can be disabled by
removing the STATS_CROSS_CHECK define).

Finally, brincostestimate() loads the statistics and uses it for
costing. At the moment it uses only the average number of overlaps.

Trivial example:

create table t (a int) with (fillfactor = 10);

  insert into t
  select (case when mod(i,22) = 0 then 100000000
               when mod(i,22) = 1 then 0
               else i end)
    from generate_series(1,300000) s(i);

  create index on t using brin (a) with (pages_per_range = 1);

The table fits 22 rows per page, and the data is mostly sequential,
except that every page has both 0 and 100000000. The correlation however
remains fairly high:

  # select correlation from pg_stats where tablename = 't';
   correlation
  -------------
     0.8303595
  (1 row)

Now, let's do a simple query:

# explain (analyze, buffers, timing off) select * from t where a = 500;

                              QUERY PLAN
------------------------------------------------------------------------
 Bitmap Heap Scan on t  (cost=154.00..254.92 rows=2 width=4)
                        (actual rows=1 loops=1)
   Recheck Cond: (a = 500)
   Rows Removed by Index Recheck: 299999
   Heap Blocks: lossy=13637
   Buffers: shared hit=13695
   ->  Bitmap Index Scan on t_a_idx  (cost=0.00..154.00 rows=26 width=0)
                                     (actual rows=136370 loops=1)
         Index Cond: (a = 500)
         Buffers: shared hit=58
 Planning:
   Buffers: shared hit=1
 Planning Time: 0.173 ms
 Execution Time: 101.972 ms
(12 rows)

That's pretty poor, because brincostestimate() still thinks it'll be
enough to read one or two page ranges (because 1/0.8 = ~1.2).

Now, with the extra statistics:

SET enable_indexam_stats = true;
ANALYZE t;

                               QUERY PLAN
----------------------------------------------------------------------
 Bitmap Heap Scan on t  (cost=157.41..17544.41 rows=2 width=4)
                        (actual rows=1 loops=1)
   Recheck Cond: (a = 500)
   Rows Removed by Index Recheck: 299999
   Heap Blocks: lossy=13637
   Buffers: shared hit=13695
   ->  Bitmap Index Scan on t_a_idx  (cost=0.00..157.41 rows=300000
                               width=0) (actual rows=136370 loops=1)
         Index Cond: (a = 500)
         Buffers: shared hit=58
 Planning:
   Buffers: shared hit=1
 Planning Time: 0.230 ms
 Execution Time: 104.603 ms
(12 rows)

So in this case we realize we actually have to scan the whole table, all
~13637 ranges, and the cost reflects that.

Feel free to experiment with other data sets.


regards

[1]
https://www.postgresql.org/message-id/e70fa091-e338-1598-9de4-6d0ef6b693e2%40enterprisedb.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)
Next
From: Dilip Kumar
Date:
Subject: Re: Perform streaming logical transactions by background workers and parallel apply