Re: Postgres picks suboptimal index after building of an extended statistics - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: Postgres picks suboptimal index after building of an extended statistics
Date
Msg-id CAPpHfdt_AFo9SQkD2JooH3CTR9e+6fnEUZy_hGV5ZcEszHirCQ@mail.gmail.com
Whole thread Raw
In response to Re: Postgres picks suboptimal index after building of an extended statistics  (Andrei Lepikhov <a.lepikhov@postgrespro.ru>)
Responses Re: Postgres picks suboptimal index after building of an extended statistics
List pgsql-hackers
Hi!

I'd like to get this subject off the ground.  The problem originally described in [1] obviously comes from wrong selectivity estimation.  "Dependencies" extended statistics lead to significant selectivity miss 24/1000 instead of 1/1000.  When the estimation is correct, the PostgreSQL optimizer is capable of choosing the appropriate unique index for the query.

Tom pointed out in [2] that this might be a problem of "Dependencies" extended statistics.  I've rechecked this.  The dataset consists of two parts.  The first part defined in the CREATE TABLE statement contains independent values.  The second part defined in the INSERT statement contains dependent values.  "Dependencies" extended statistics estimate dependency degree as a fraction of rows containing dependent values.  According to this definition, it correctly estimates the dependency degree as about 0.5 for all the combinations.  So, the error in the estimate comes from the synergy of two factors: MCV estimation of the frequency of individual values, and spreading of average dependency degree for those values, which is not relevant for them.  I don't think there is a way to fix "dependencies" extended statistics because it works exactly as designed.  I have to note that instead of fixing "dependencies" extended statistics you can just add multi-column MCV statistics, which would fix the problem.

CREATE STATISTICS aestat(dependencies,ndistinct,mcv) ON x,y,z FROM a;

Independently on how well our statistics work, it looks pitiful that we couldn't fix that using the knowledge of unique constraints on the table.  Despite statistics, which give us just more or less accurate estimates, the constraint is something we really enforce and thus can rely on.  The patches provided by Andrei in [1], [3], and [4] fix this issue at the index scan path level.  As Thomas pointed out in [5], that could lead to inconsistency between the number of rows used for unique index scan estimation and the value displayed in EXPLAIN (and used for other paths).  Even though this was debated in [6], I can confirm this inconsistency.  Thus, with the patch published in [4], I can see the 28-row estimation with a unique index scan.

`                              QUERY PLAN
-----------------------------------------------------------------------
 Index Only Scan using a_pkey on a  (cost=0.28..8.30 rows=28 width=12)
   Index Cond: ((x = 1) AND (y = 1) AND (z = 1))
(2 rows)

Also, there is a set of patches [7], [8], and [9], which makes the optimizer consider path selectivity as long as path costs during the path selection.  I've rechecked that none of these patches could resolve the original problem described in [1].  Also, I think they are quite tricky.  The model of our optimizer assumes that paths in the list should be the different ways of getting the same result.  If we choose the paths by their selectivity, that breaks this model.  I don't say there is no way for this.  But if we do this, that would require significant rethinking of our optimizer model and possible revision of a significant part of it.  Anyway, I think if there is still interest in this, that should be moved into a separate thread to keep this thread focused on the problem described in [1].

Finally, I'd like to note that the issue described in [1] is mostly the selectivity estimation problem.  It could be solved by adding the multi-column MCV statistics.  The patches published so far look more like hacks for particular use cases rather than appropriate solutions.  It still looks promising to me to use the knowledge of unique constraints during selectivity estimation [10].  Even though it's hard to implement and possibly implies some overhead, it fits the current model.  I also think unique contracts could probably be used in some way to improve estimates even when there is no full match.

Links.
1. https://www.postgresql.org/message-id/0ca4553c-1f34-12ba-9122-44199d1ced41%40postgrespro.ru
2. https://www.postgresql.org/message-id/3119052.1657231656%40sss.pgh.pa.us
3. https://www.postgresql.org/message-id/90a1d6ef-c777-b95d-9f77-0065ad4522df%40postgrespro.ru
4. https://www.postgresql.org/message-id/a5a18d86-c0e5-0ceb-9a18-be1beb2d2944%40postgrespro.ru
5. https://www.postgresql.org/message-id/f8044836-5d61-a4e0-af82-5821a2a1f0a7%40enterprisedb.com
6. https://www.postgresql.org/message-id/90a1d6ef-c777-b95d-9f77-0065ad4522df%40postgrespro.ru
7. https://www.postgresql.org/message-id/2df148b5-0bb8-f80b-ac03-251682fab585%40postgrespro.ru
8. https://www.postgresql.org/message-id/6fb43191-2df3-4791-b307-be754e648276%40postgrespro.ru
9. https://www.postgresql.org/message-id/154f786a-06a0-4fb1-b8a4-16c66149731b%40postgrespro.ru
10. https://www.postgresql.org/message-id/f8044836-5d61-a4e0-af82-5821a2a1f0a7%40enterprisedb.com

------
Regards,
Alexander Korotkov

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: "pgoutput" options missing on documentation
Next
From: Jelte Fennema-Nio
Date:
Subject: Re: Add new for_each macros for iterating over a List that do not require ListCell pointer