Home > mailing lists

Extended statistics for correlated columns, row estimates when values are not in MCVs list - Mailing list pgsql-general

From	Michael Lewis
Subject	Extended statistics for correlated columns, row estimates when values are not in MCVs list
Date	December 4, 2020 22:53:16
Msg-id	CAMcsB=y=3G_+s_zFYPu2-O6OMWOvOkb3t1MU=907yk5RC_RaYw@mail.gmail.com Whole thread
Responses	Re: Extended statistics for correlated columns, row estimates when values are not in MCVs list
List	pgsql-general

Tree view

On version 12.5, I have a query similar to the below where I am getting rather unfortunate row estimates and a sub-optimal index choice as a result.

SELECT
id
FROM
messages
WHERE
client_id = 1234
and site_id = 889977
and message_type_id in ( 59, 62, 102, 162 )
and sent_on > NOW() - INTERVAL '3 days';

I created extended statistics (all types) on client_id and site_id, analyzed the columns, and in pg_stats_ext.dependencies (4 = client_id, 7 = site_id) have:

{"4 => 7": 0.002997, "7 => 4": 0.896230}

When I check row estimates like below, I get results that indicate the columns are expected to be independent still.

explain SELECT id FROM messages; --889594304

explain SELECT id FROM messages where client_id = 1234; --133439

explain SELECT id FROM messages where site_id = 889977; --28800

explain SELECT id FROM messages where client_id = 1234 and site_id = 889977; --4

However, I pick a client & site ID pair which show up in the MCVs list, then I get the same estimate when querying for that site_id with or without including the client_id. That is great.

Is it reasonable to expect that if the correlation between two columns is rather high, then the optimizer might figure the columns are not independent and perhaps would give less weight to the value derived from independent column statistics? With table statistics, it is possible to set a static value or ratio for something like ndistinct. Any chance for something similar on dependency someday?

Perhaps I am expecting too much or have a poor understanding of what extended statistics can or someday might do. I deal with under estimates from correlations between client_id and sites or other similar dependent objects a fair bit and am hopeful to resolve some of those planning problems with extended stats, without the maintenance overhead of migrating everything to a single client per database to get more specific statistics.

pgsql-general by date:

From: Nicklas Avén
Date: 04 December 2020, 20:05:57
Subject: Re: Accessing Postgres Server and database from other Machine

From: Hemil Ruparel
Date: 05 December 2020, 05:23:13
Subject: Re: Accessing Postgres Server and database from other Machine

Extended statistics for correlated columns, row estimates when values are not in MCVs list - Mailing list pgsql-general

Previous

Next