Re: Optimizer misses big in 10.4 with BRIN index - Mailing list pgsql-hackers

From Arcadiy Ivanov
Subject Re: Optimizer misses big in 10.4 with BRIN index
Date
Msg-id af24721c-8ad4-8edd-fd25-24818684b669@gmail.com
Whole thread Raw
In response to Re: Optimizer misses big in 10.4 with BRIN index  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: Optimizer misses big in 10.4 with BRIN index
List pgsql-hackers
On 07/26/2018 07:27 AM, Tomas Vondra wrote:
> Arcadiy, can you provide plans with parallel query disabled? Or even
> better, produce a test case that reproduces this (using synthetic
> data, anonymized data or something like that, if needed).
So I basically spent most of the time trying to create a reproducible
case and I can say I failed. I can however reproduce this with specific
large data set with specific data distribution, but not an artificial
one. Unfortunately data is restricted so I have to obfuscate the data
structures. That said, I am able to reproduce this sporadically on local
machine after exporting from RDS 10.4. On some days reproduction happens
always, on others it's hard to reproduce at all. So here it goes in
sequential order:



** Data is imported into database
======================
db0=# \d+ schema0.table0
                                          Table "schema0.table0"
    Column   |          Type          | Collation | Nullable | Default |
Storage  | Stats target | Description
------------+------------------------+-----------+----------+---------+----------+--------------+-------------
  field1     | character varying(128) |           | not null |         |
extended |              |
  field2     | character varying(128) |           | not null |         |
extended |              |
  field3     | character varying(128) |           | not null |         |
extended |              |
  field4     | bigint                 |           | not null |         |
plain    |              |
  field5     | bigint                 |           | not null |         |
plain    |              |
  field6     | bigint                 |           | not null |         |
plain    |              |
  data       | jsonb                  |           | not null |         |
extended |              |
Indexes:
     "date_idx" brin (((data ->> 'date'::text)::numeric)) WITH
(autosummarize='true')
Options: autovacuum_vacuum_scale_factor=0.0,
autovacuum_vacuum_threshold=5000, autovacuum_analyze_scale_factor=0.0,
autovacuum_analyze_threshold=5000, fillfactor=75


** Data distribution in the table
======================
db0# SELECT (data->>'date')::numeric as date, count(1) FROM
schema0.table0 GROUP BY date ORDER BY date;
     date    |  count
------------+---------
  1527580800 |       5
  1527753600 |      72
  1527840000 |      36
  1528012800 |      18
  1528099200 |      72
  1528185600 |    7266
  1528272000 |     336
  1528358400 |     230
  1528444800 |      90
  1528704000 |      76
  1528790400 |       4
  1528876800 |    9060
  1528934400 |       6
  1528963200 |      33
  1529193600 |       6
  1529280000 |      18
  1529294400 |  541344
  1529380800 | 1113121
  1529467200 |  794082
  1529553600 |  604752
  1529812800 |  135252
  1529899200 |   63222
  1529985600 |   31134
  1530072000 |   25392
  1530158400 |      90
  1530417600 |      48
  1530504000 |     144
  1530518400 |       1
  1530590400 |    1958
  1530604800 |       1
  1530763200 |   48614
  1530849600 |       9
  1531022400 |       3
  1531108800 |       3
  1531195200 |       1
*** This is the range we're interested in
  1531281600 |    3713
*******
  1531627200 |     300
  1531713600 |    1113
  1531800000 |      30
  1531886400 |   36000
  1532232000 |   24900
  1532577600 |  409364
  1532664000 |    2050
  1532836800 |  334782
  1532923200 |    6771
(45 rows)

** No analysis was run after import. Let's see how we do.
=======================
db0=# EXPLAIN (analyze, verbose, costs, buffers) SELECT ctid,
(data->>'date')::numeric as date FROM schema0.table0 WHERE
(data->>'date')::numeric >= '1531267200'
                     AND (data->>'date')::numeric <= '1531353600'
                     ORDER BY ctid, (data->>'date')::numeric
                     DESC LIMIT 1000;
QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=1102962.86..1103079.54 rows=1000 width=38) (actual
time=7234.091..7234.697 rows=1000 loops=1)
    Output: ctid, (((data ->> 'date'::text))::numeric)
    Buffers: shared hit=9 read=350293
    ->  Gather Merge  (cost=1102962.86..1105002.57 rows=17482 width=38)
(actual time=7234.089..7234.589 rows=1000 loops=1)
          Output: ctid, (((data ->> 'date'::text))::numeric)
          Workers Planned: 2
          Workers Launched: 2
          Buffers: shared hit=9 read=350293
          ->  Sort  (cost=1101962.84..1101984.69 rows=8741 width=38)
(actual time=7227.938..7227.985 rows=964 loops=3)
                Output: ctid, (((data ->> 'date'::text))::numeric)
                Sort Key: table0.ctid, (((table0.data ->>
'date'::text))::numeric) DESC
                Sort Method: quicksort  Memory: 102kB
                Buffers: shared hit=144 read=1048851
                Worker 0: actual time=7223.770..7223.820 rows=1272 loops=1
                  Buffers: shared hit=68 read=348293
                Worker 1: actual time=7226.520..7226.589 rows=1299 loops=1
                  Buffers: shared hit=67 read=350265
                ->  Parallel Seq Scan on schema0.table0
(cost=0.00..1101390.58 rows=8741 width=38) (actual time=68.862..7227.366
rows=1238 loops=3)
                      Output: ctid, ((data ->> 'date'::text))::numeric
                      Filter: ((((table0.data ->>
'date'::text))::numeric >= '1531267200'::numeric) AND (((table0.data ->>
'date'::text))::numeric <= '1531353600'::numeric))
                      Rows Removed by Filter: 1397270
                      Buffers: shared hit=32 read=1048849
                      Worker 0: actual time=50.494..7223.107 rows=1272
loops=1
                        Buffers: shared hit=11 read=348293
                      Worker 1: actual time=53.408..7225.849 rows=1299
loops=1
                        Buffers: shared hit=12 read=350263
  Planning time: 2.096 ms
  Execution time: 7235.867 ms
(28 rows)

** Ok, no statistics, Par Seq Scan is selected, it's expected.
=======================
db0=# VACUUM ANALYZE VERBOSE schema0.table0;
INFO:  vacuuming "schema0.table0"
INFO:  index "date_idx" now contains 8194 row versions in 45 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  "table0": found 0 removable, 4195522 nonremovable row versions in
1048881 out of 1048881 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 1243
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 2.36 s, system: 4.84 s, elapsed: 7.48 s.
INFO:  vacuuming "pg_toast.pg_toast_7888880"
INFO:  index "pg_toast_7888880_index" now contains 0 row versions in 1 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  "pg_toast_7888880": found 0 removable, 0 nonremovable row
versions in 0 out of 0 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 1243
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  analyzing "schema0.table0"
INFO:  "table0": scanned 30000 of 1048881 pages, containing 120000 live
rows and 0 dead rows; 30000 rows in sample, 4195524 estimated total rows
VACUUM

** Statistics are collected. Let's explain again
=======================
db0=# EXPLAIN (analyze, verbose, costs, buffers) SELECT ctid,
(data->>'date')::numeric as date FROM schema0.table0 WHERE
(data->>'date')::numeric >= '1531267200'
                     AND (data->>'date')::numeric <= '1531353600'
                     ORDER BY ctid, (data->>'date')::numeric
                     DESC LIMIT 1000;
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=2393.45..2394.66 rows=482 width=38) (actual
time=7498.674..7498.791 rows=1000 loops=1)
    Output: ctid, (((data ->> 'date'::text))::numeric)
    Buffers: shared hit=74 read=411764
    ->  Sort  (cost=2393.45..2394.66 rows=482 width=38) (actual
time=7498.672..7498.731 rows=1000 loops=1)
          Output: ctid, (((data ->> 'date'::text))::numeric)
          Sort Key: table0.ctid, (((table0.data ->>
'date'::text))::numeric) DESC
          Sort Method: top-N heapsort  Memory: 95kB
          Buffers: shared hit=74 read=411764
          ->  Bitmap Heap Scan on schema0.table0 (cost=162.16..2371.97
rows=482 width=38) (actual time=58.322..7497.656 rows=3713 loops=1)
                Output: ctid, ((data ->> 'date'::text))::numeric
                Recheck Cond: ((((table0.data ->>
'date'::text))::numeric >= '1531267200'::numeric) AND (((table0.data ->>
'date'::text))::numeric <= '1531353600'::numeric))
                Rows Removed by Index Recheck: 1643390
                Heap Blocks: lossy=411776
                Buffers: shared hit=74 read=411764
                ->  Bitmap Index Scan on date_idx (cost=0.00..162.03
rows=557 width=0) (actual time=40.243..40.243 rows=4117760 loops=1)
                      Index Cond: ((((table0.data ->>
'date'::text))::numeric >= '1531267200'::numeric) AND (((table0.data ->>
'date'::text))::numeric <= '1531353600'::numeric))
                      Buffers: shared hit=62
  Planning time: 1.388 ms
  Execution time: 7498.907 ms
(19 rows)

** Wait... What??? "Rows Removed by Index Recheck: 1643390" but data is
almost sequential! Let's take a look at it.
=======================
db0=# SELECT ctid, (data->>'date')::numeric as date FROM schema0.table0
WHERE (data->>'date')::numeric >= '1531267200'
                     AND (data->>'date')::numeric <= '1531353600'
                     ORDER BY ctid, (data->>'date')::numeric
                     DESC LIMIT 1000;
    ctid    |    date
-----------+------------
  (4456,2)  | 1531281600
  (4468,2)  | 1531281600
  (9301,2)  | 1531281600
  (9310,3)  | 1531281600
...
  (9317,4)  | 1531281600
  (9325,3)  | 1531281600
  (9325,4)  | 1531281600
  (9326,1)  | 1531281600
  (9328,1)  | 1531281600
  (9328,2)  | 1531281600
  (9345,1)  | 1531281600
...
  (9365,2)  | 1531281600
  (9386,2)  | 1531281600
  (9386,3)  | 1531281600
  (9386,4)  | 1531281600
  (9418,4)  | 1531281600
...
  (9419,4)  | 1531281600
  (12921,2) | 1531281600
...
  (12962,2) | 1531281600
  (12983,1) | 1531281600
  (12983,2) | 1531281600
  (12983,3) | 1531281600
  (12983,4) | 1531281600
  (13028,3) | 1531281600
  (13028,4) | 1531281600
  (14113,1) | 1531281600
  (14113,2) | 1531281600
  (15297,1) | 1531281600
...
  (15317,2) | 1531281600
  (15350,1) | 1531281600
  (15350,2) | 1531281600
  (15350,3) | 1531281600
  (15350,4) | 1531281600
  (15386,3) | 1531281600
  (15386,4) | 1531281600
  (17620,3) | 1531281600
...
  (17648,4) | 1531281600
  (17652,1) | 1531281600
...
  (17653,2) | 1531281600
  (17658,1) | 1531281600
...
  (17659,2) | 1531281600
  (21229,2) | 1531281600
  (21243,1) | 1531281600
...
  (21246,4) | 1531281600
  (35794,2) | 1531281600
...
  (35798,1) | 1531281600
  (35827,4) | 1531281600
...
  (35887,2) | 1531281600
  (37006,3) | 1531281600
...
  (37080,1) | 1531281600
  (37157,2) | 1531281600
  (37157,3) | 1531281600
  (37157,4) | 1531281600
  (39475,1) | 1531281600
...
  (39552,4) | 1531281600
  (39623,4) | 1531281600
  (39653,1) | 1531281600
  (39653,2) | 1531281600
  (39653,3) | 1531281600
  (39653,4) | 1531281600
  (39674,1) | 1531281600
  (39674,2) | 1531281600
  (39674,3) | 1531281600
  (39674,4) | 1531281600
  (40829,3) | 1531281600
  (40829,4) | 1531281600
  (41946,4) | 1531281600
...
  (41962,2) | 1531281600
(1000 rows)

** Ok, fine, there are large gaps. Let's make **absolutely sure** the
data is sequential.
=======================
db0=# CREATE INDEX date_ord_idx ON schema0.table0 (((table0.data ->>
'date')::numeric));
CREATE INDEX

** Not running any analysis on a new index, shouldn't matter if stats
are wrong, expecting scan and sort.
=======================
db0=# CLUSTER VERBOSE schema0.table0 USING date_ord_idx;
INFO:  clustering "schema0.table0" using sequential scan and sort
INFO:  "table0": found 0 removable, 4195522 nonremovable row versions in
1048881 pages
DETAIL:  0 dead row versions cannot be removed yet.
CPU: user: 873.12 s, system: 19.18 s, elapsed: 1019.36 s.
CLUSTER

** Ok, now we're clustered. Dropping order b-tree index.
=======================
db0=# DROP INDEX schema0.date_ord_idx ;
DROP INDEX

** Let's explain again
=======================
db0=# EXPLAIN (analyze, verbose, costs, buffers) SELECT ctid,
(data->>'date')::numeric as date FROM schema0.table0 WHERE
(data->>'date')::numeric >= '1531267200'
                     AND (data->>'date')::numeric <= '1531353600'
                     ORDER BY ctid, (data->>'date')::numeric
                     DESC LIMIT 1000;
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=2393.45..2394.66 rows=482 width=38) (actual
time=7339.584..7339.703 rows=1000 loops=1)
    Output: ctid, (((data ->> 'date'::text))::numeric)
    Buffers: shared hit=56 read=410752
    ->  Sort  (cost=2393.45..2394.66 rows=482 width=38) (actual
time=7339.583..7339.634 rows=1000 loops=1)
          Output: ctid, (((data ->> 'date'::text))::numeric)
          Sort Key: table0.ctid, (((table0.data ->>
'date'::text))::numeric) DESC
          Sort Method: top-N heapsort  Memory: 95kB
          Buffers: shared hit=56 read=410752
          ->  Bitmap Heap Scan on schema0.table0 (cost=162.16..2371.97
rows=482 width=38) (actual time=59.252..7338.635 rows=3713 loops=1)
                Output: ctid, ((data ->> 'date'::text))::numeric
                Recheck Cond: ((((table0.data ->>
'date'::text))::numeric >= '1531267200'::numeric) AND (((table0.data ->>
'date'::text))::numeric <= '1531353600'::numeric))
                Rows Removed by Index Recheck: 1639294
                Heap Blocks: lossy=410752
                Buffers: shared hit=56 read=410752
                ->  Bitmap Index Scan on date_idx (cost=0.00..162.03
rows=557 width=0) (actual time=24.545..24.545 rows=4107520 loops=1)
                      Index Cond: ((((table0.data ->>
'date'::text))::numeric >= '1531267200'::numeric) AND (((table0.data ->>
'date'::text))::numeric <= '1531353600'::numeric))
                      Buffers: shared hit=56
  Planning time: 0.515 ms
  Execution time: 7339.772 ms
(19 rows)

**  "Rows Removed by Index Recheck: 1639294" - OK, statistics are
pooched post-clustering. Let's reanalyze.
=======================
db0=# ANALYZE VERBOSE schema0.table0;
INFO:  analyzing "schema0.table0"
INFO:  "table0": scanned 30000 of 1048881 pages, containing 120000 live
rows and 0 dead rows; 30000 rows in sample, 4195524 estimated total rows
ANALYZE


**  Let's try this again...
=======================
db0=# EXPLAIN (analyze, verbose, costs, buffers) SELECT ctid,
(data->>'date')::numeric as date FROM schema0.table0 WHERE
(data->>'date')::numeric >= '1531267200'
                     AND (data->>'date')::numeric <= '1531353600'
                     ORDER BY ctid, (data->>'date')::numeric
                     DESC LIMIT 1000;
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=5444.33..5445.71 rows=555 width=38) (actual
time=6533.824..6533.927 rows=1000 loops=1)
    Output: ctid, (((data ->> 'date'::text))::numeric)
    Buffers: shared hit=410808
    ->  Sort  (cost=5444.33..5445.71 rows=555 width=38) (actual
time=6533.822..6533.862 rows=1000 loops=1)
          Output: ctid, (((data ->> 'date'::text))::numeric)
          Sort Key: table0.ctid, (((table0.data ->>
'date'::text))::numeric) DESC
          Sort Method: top-N heapsort  Memory: 95kB
          Buffers: shared hit=410808
          ->  Bitmap Heap Scan on schema0.table0 (cost=162.22..5419.03
rows=555 width=38) (actual time=59.316..6532.913 rows=3713 loops=1)
                Output: ctid, ((data ->> 'date'::text))::numeric
                Recheck Cond: ((((table0.data ->>
'date'::text))::numeric >= '1531267200'::numeric) AND (((table0.data ->>
'date'::text))::numeric <= '1531353600'::numeric))
                Rows Removed by Index Recheck: 1639294
                Heap Blocks: lossy=410752
                Buffers: shared hit=410808
                ->  Bitmap Index Scan on date_idx (cost=0.00..162.08
rows=1339 width=0) (actual time=30.859..30.859 rows=4107520 loops=1)
                      Index Cond: ((((table0.data ->>
'date'::text))::numeric >= '1531267200'::numeric) AND (((table0.data ->>
'date'::text))::numeric <= '1531353600'::numeric))
                      Buffers: shared hit=56
  Planning time: 1.100 ms
  Execution time: 6534.026 ms
(19 rows)

** This cannot be possible! The table must be clustered on the same
expression we cluster BRIN! Let's look at the data.
=======================
db0=# SELECT ctid, (data->>'date')::numeric as date FROM schema0.table0
WHERE (data->>'date')::numeric >= '1531267200'
                     AND (data->>'date')::numeric <= '1531353600'
                     ORDER BY ctid, (data->>'date')::numeric
                     DESC LIMIT 1000;
    ctid    |    date
-----------+------------
  (4155,2)  | 1531281600
...
  (4159,1)  | 1531281600
  (4188,4)  | 1531281600
  (4189,1)  | 1531281600
  (4189,2)  | 1531281600
  (4189,3)  | 1531281600
  (4189,4)  | 1531281600
  (4231,4)  | 1531281600
  (4247,3)  | 1531281600
  (4247,4)  | 1531281600
  (4248,1)  | 1531281600
  (4248,2)  | 1531281600
  (5367,3)  | 1531281600
...
  (5368,4)  | 1531281600
  (5376,3)  | 1531281600
...
  (5441,1)  | 1531281600
  (5518,2)  | 1531281600
  (5518,3)  | 1531281600
  (5518,4)  | 1531281600
  (7836,1)  | 1531281600
  (7836,2)  | 1531281600
  (7851,3)  | 1531281600
...
  (7913,4)  | 1531281600
  (7984,4)  | 1531281600
  (8014,1)  | 1531281600
  (8014,2)  | 1531281600
  (8014,3)  | 1531281600
  (8014,4)  | 1531281600
  (8035,1)  | 1531281600
  (8035,2)  | 1531281600
  (8035,3)  | 1531281600
  (8035,4)  | 1531281600
  (9190,3)  | 1531281600
  (9190,4)  | 1531281600
  (10307,4) | 1531281600
...
  (10349,3) | 1531281600
  (10371,3) | 1531281600
...
  (10392,2) | 1531281600
  (10445,4) | 1531281600
  (10446,1) | 1531281600
  (10446,2) | 1531281600
  (11580,3) | 1531281600
...
  (11586,2) | 1531281600
  (11595,3) | 1531281600
...
  (11598,2) | 1531281600
  (11600,1) | 1531281600
...
  (11601,2) | 1531281600
  (11609,1) | 1531281600
...
  (11621,2) | 1531281600
  (11621,3) | 1531281600
  (11664,3) | 1531281600
...
  (11684,3) | 1531281600
  (11726,2) | 1531281600
...
  (11770,4) | 1531281600
  (12891,2) | 1531281600
  (14057,1) | 1531281600
...
  (14058,2) | 1531281600
  (14064,3) | 1531281600
...
  (14071,3) | 1531281600
(1000 rows)

****

The issue seems to be with both clustering and BRIN stats.

Fillfactor on the table is not a factor, can be 100 can be 75, I'm able
to reproduce and fail to reproduce with both.
Clustering on `date_ord_idx` can be done with ANALYZE and without, in
some cases PG picks scan and sort, sometimes it uses index regardless of
stats.
Sometimes the table is properly clustered, sometimes it isn't
afterwards. Sometimes if you cluster first on the unanalyzed
`date_ord_idx`, clustering doesn't fix the problem, but then analyzing
and clustering on `date_ord_idx` again does fix the issue. Sometimes
neither helps.
Full vacuum analyze may help, or may not help, depending on the phase of
the moon.
When BRIN is actually fixed, the explain has a few hundred rows
eliminated by the recheck, not >1M, and queries are 50 - 70 ms long for
date range.

If anybody can give a suggestion on where in the code to start looking,
I'll be extra-grateful.

--
Arcadiy Ivanov
arcadiy@gmail.com | @arcivanov | https://ivanov.biz
https://github.com/arcivanov


pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: BUG #15212: Default values in partition tables don't work asexpected and allow NOT NULL violation
Next
From: Andrey Lepikhov
Date:
Subject: Re: [WIP] [B-Tree] Retail IndexTuple deletion