Home > mailing lists

distinct estimate of a hard-coded VALUES list - Mailing list pgsql-hackers

From	Jeff Janes
Subject	distinct estimate of a hard-coded VALUES list
Date	August 19, 2016 00:03:30
Msg-id	CAMkU=1xHkyPa8VQgGcCNg3RMFFvVxUdOpus1gKcFuvVi0w6Acg@mail.gmail.com Whole thread Raw
Responses	Re: distinct estimate of a hard-coded VALUES list (Tom Lane <tgl@sss.pgh.pa.us>)
List	pgsql-hackers

Tree view

I have a query which contains a where clause like:

aid =ANY(VALUES (1),(45),(87), <6948 more>, (4999947))

for example:

perl -le 'print "explain (analyze) select sum(abalance) from pgbench_accounts where aid=ANY(VALUES "; print join ",", map "($_)", sort {$a<=>$b} map int(rand(5000000)), 1..6952; print ");"' | psql

And it gives me an explain section like:

-> HashAggregate (cost=104.28..106.28 rows=200 width=32) (actual time=15.171..30.859 rows=6931 loops=1)
Group Key: "*VALUES*".column1
-> Values Scan on "*VALUES*" (cost=0.00..86.90 rows=6952 width=32) (actual time=0.007..6.926 rows=6952 loops=1)

So even though it knows that 6952 values have been shoved in the bottom, it thinks only 200 are going to come out of the aggregation. This seems like a really lousy estimate. In more complex queries than the example one given it leads to poor planning choices.

Is the size of the input list not available to the planner at the point where it estimates the distinct size of the input list? I'm assuming that if it is available to EXPLAIN than it is available to the planner. Does it know how large the input list is, but just throw up its hands and use 200 as the distinct size anyway?

If I throw at the system a massively degenerate list, and it makes bad choices by assuming the list is distinct, I have the recourse of uniquifying the list myself before passing it in. If I pass in a mostly distinct list, and it makes bad choices by assuming only 200 are distinct, I don't have much recourse aside from creating, populating, analyzing, and then using temporary tables. Which is annoying, and causes other problems like catalog bloat.

Is this something in the planner that could be fixed in a future version, or is a temp table the only real solution here?

Cheers,

Jeff

pgsql-hackers by date:

From: Robert Haas
Date: 18 August 2016, 23:54:54
Subject: Re: Improving planner's checks for parallel-unsafety

From: Kevin Grittner
Date: 19 August 2016, 00:04:59
Subject: Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

distinct estimate of a hard-coded VALUES list - Mailing list pgsql-hackers

Previous

Next