Thread: Statistics on array values

Statistics on array values

From
Marco Colli
Date:
Hello!

Let's say that you have a simple query like the following on a large table (for a multi-tenant application):
SELECT "subscribers".* FROM "subscribers" WHERE "subscribers"."project_id" = 123 AND (tags @> ARRAY['de']::varchar[]);

If you run EXPLAIN ANALYZE you can see that stats are completely wrong. For example I get an expected count of 3,500 rows whereas the actual result is 20 rows. This also results in bad query plans...

In a previous discussion someone said that this wrong estimate is because @> uses a fixed selectivity of 0.001, **regardless of actual data**!!
Is that true? Is there any solution or any plan to improve this in future versions of PostgreSQL?

Finally it would be useful to have the ability to CREATE STATISTICS, to show PostgreSQL that there's a correlation between project_id and tag values... but this is a further step. Currently I can create statistics, however it seems to have no positive effect on the estimates for the above case


Marco Colli

Re: Statistics on array values

From
Justin Pryzby
Date:
On Sun, Feb 02, 2020 at 03:18:19PM +0100, Marco Colli wrote:
> Hello!
> 
> Let's say that you have a simple query like the following on a large table
> (for a multi-tenant application):
> SELECT "subscribers".* FROM "subscribers" WHERE "subscribers"."project_id"
> = 123 AND (tags @> ARRAY['de']::varchar[]);
> 
> If you run EXPLAIN ANALYZE you can see that stats are completely wrong.
> For example I get an expected count of 3,500 rows whereas the actual
> result is 20 rows. This also results in bad query plans...

https://www.postgresql.org/message-id/CAMkU%3D1z%2BQijUWAYgeqeyw%2BAvD7adPgOmEnY%2BOcTw6qDVFtD7cQ%40mail.gmail.com
On Fri, Jan 10, 2020 at 12:12:52PM -0500, Jeff Janes wrote:
> Why is the estimate off by so much?  If you run a simple select, what the
> actual and expected number of rows WHERE project_id = 12345?  WHERE tags @>
> '{crt:2018_11}'?  Is one of those estimates way off reality, or is it only
> the conjunction which is deranged?

Could you respond to Jeff's inquiry ?

Justin



Re: Statistics on array values

From
Marco Colli
Date:
Is one of those estimates way off reality, or is it only the conjunction which is deranged?

The estimate is wrong *even with a single tag*, without the conjunction (e.g. expected 3500, actual 20). Then the conjunction can make the bias even worse...

On Sun, Feb 2, 2020 at 3:23 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
On Sun, Feb 02, 2020 at 03:18:19PM +0100, Marco Colli wrote:
> Hello!
>
> Let's say that you have a simple query like the following on a large table
> (for a multi-tenant application):
> SELECT "subscribers".* FROM "subscribers" WHERE "subscribers"."project_id"
> = 123 AND (tags @> ARRAY['de']::varchar[]);
>
> If you run EXPLAIN ANALYZE you can see that stats are completely wrong.
> For example I get an expected count of 3,500 rows whereas the actual
> result is 20 rows. This also results in bad query plans...

https://www.postgresql.org/message-id/CAMkU%3D1z%2BQijUWAYgeqeyw%2BAvD7adPgOmEnY%2BOcTw6qDVFtD7cQ%40mail.gmail.com
On Fri, Jan 10, 2020 at 12:12:52PM -0500, Jeff Janes wrote:
> Why is the estimate off by so much?  If you run a simple select, what the
> actual and expected number of rows WHERE project_id = 12345?  WHERE tags @>
> '{crt:2018_11}'?  Is one of those estimates way off reality, or is it only
> the conjunction which is deranged?

Could you respond to Jeff's inquiry ?

Justin

Re: Statistics on array values

From
Tom Lane
Date:
Marco Colli <collimarco91@gmail.com> writes:
> Let's say that you have a simple query like the following on a large table
> (for a multi-tenant application):
> SELECT "subscribers".* FROM "subscribers" WHERE "subscribers"."project_id"
> = 123 AND (tags @> ARRAY['de']::varchar[]);

> If you run EXPLAIN ANALYZE you can see that stats are completely wrong.
> For example I get an expected count of 3,500 rows whereas the actual
> result is 20 rows. This also results in bad query plans...

> In a previous discussion someone said that this wrong estimate is because
> @> uses a fixed selectivity of 0.001, **regardless of actual data**!!
> Is that true?

Hasn't been true since 9.2.

You might get some insight from looking into the most_common_elems,
most_common_elem_freqs, and elem_count_histogram fields of the pg_stats
view.

It seems likely to me that increasing the statistics target for this array
column would help.  IIRC, estimates for values that don't show up in
most_common_elems are going to depend on the lowest frequency that *does*
show up there ... so if you want better resolution for non-common values,
you need more entries.

            regards, tom lane



Re: Statistics on array values

From
Marco Colli
Date:
Thanks Tom for the clear explanation. 
Unfortunately I don't get actual improvements. I use PG 11 and I run the following commands:

ALTER TABLE subscriptions ALTER tags SET STATISTICS 1000; 
ANALYZE subscriptions;

However the bias remains pretty much the same (slightly worse after). Any idea?

On Sun, Feb 2, 2020 at 6:11 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Marco Colli <collimarco91@gmail.com> writes:
> Let's say that you have a simple query like the following on a large table
> (for a multi-tenant application):
> SELECT "subscribers".* FROM "subscribers" WHERE "subscribers"."project_id"
> = 123 AND (tags @> ARRAY['de']::varchar[]);

> If you run EXPLAIN ANALYZE you can see that stats are completely wrong.
> For example I get an expected count of 3,500 rows whereas the actual
> result is 20 rows. This also results in bad query plans...

> In a previous discussion someone said that this wrong estimate is because
> @> uses a fixed selectivity of 0.001, **regardless of actual data**!!
> Is that true?

Hasn't been true since 9.2.

You might get some insight from looking into the most_common_elems,
most_common_elem_freqs, and elem_count_histogram fields of the pg_stats
view.

It seems likely to me that increasing the statistics target for this array
column would help.  IIRC, estimates for values that don't show up in
most_common_elems are going to depend on the lowest frequency that *does*
show up there ... so if you want better resolution for non-common values,
you need more entries.

                        regards, tom lane

Re: Statistics on array values

From
Tom Lane
Date:
Marco Colli <collimarco91@gmail.com> writes:
> Unfortunately I don't get actual improvements. I use PG 11 and I run the
> following commands:
> ALTER TABLE subscriptions ALTER tags SET STATISTICS 1000;
> ANALYZE subscriptions;
> However the bias remains pretty much the same (slightly worse after). Any
> idea?

So what have you got in the pg_stats fields I asked about?
How big is this table anyway (how many rows, how many different tag
values)?

            regards, tom lane



Re: Statistics on array values

From
Marco Colli
Date:
Sorry, I don't understand your exact question about pg_stats. In any case I cannot make strict assumptions about data, because that greatly varies from project to project (it's a SaaS) and over time. To give an idea the table has some tens of millions of rows, each project has from a few thousands to a few millions of rows and each project has its own tags that the customer can define (unlimited tags for each row, but usually only 1 - 10 actually used)

Il Dom 2 Feb 2020, 19:32 Tom Lane <tgl@sss.pgh.pa.us> ha scritto:
Marco Colli <collimarco91@gmail.com> writes:
> Unfortunately I don't get actual improvements. I use PG 11 and I run the
> following commands:
> ALTER TABLE subscriptions ALTER tags SET STATISTICS 1000;
> ANALYZE subscriptions;
> However the bias remains pretty much the same (slightly worse after). Any
> idea?

So what have you got in the pg_stats fields I asked about?
How big is this table anyway (how many rows, how many different tag
values)?

                        regards, tom lane