Thread: Why there is 30000 rows is sample

Why there is 30000 rows is sample

From
"Andrus"
Date:
Hi!

vacuumdb output:

vacuumdb: vacuuming database "mydb"
INFO:  analyzing "public.mytable"
INFO:  "mytable": scanned 2709 of 2709 pages, containing 10834 live rows and 0 dead rows; 10834 rows in sample, 10834
estimated
 
total rows

For tables with more than 30000 rows, it shows that there are 30000 rows in sample.

postgresql.conf does not set  default_statistics_target value.
It contains

#default_statistics_target = 100    # range 1-10000

So I expect that there should be 100 rows is sample.
Why Postgres uses 30000 or number of rows in table for smaller tables ?

Is 30000 some magical value, how to control it.

Using Postgres 12 in Debian.

Andrus.




Re: Why there is 30000 rows is sample

From
Julien Rouhaud
Date:
Hi,

On Sat, Apr 04, 2020 at 10:07:51AM +0300, Andrus wrote:
> Hi!
> 
> vacuumdb output:
> 
> vacuumdb: vacuuming database "mydb"
> INFO:  analyzing "public.mytable"
> INFO:  "mytable": scanned 2709 of 2709 pages, containing 10834 live rows and
> 0 dead rows; 10834 rows in sample, 10834 estimated total rows
> 
> For tables with more than 30000 rows, it shows that there are 30000 rows in sample.
> 
> postgresql.conf does not set  default_statistics_target value.
> It contains
> 
> #default_statistics_target = 100    # range 1-10000
> 
> So I expect that there should be 100 rows is sample.
> Why Postgres uses 30000 or number of rows in table for smaller tables ?
> 
> Is 30000 some magical value, how to control it.


That's because default_statistics_target's unit isn't 1 row but 300 rows, so
yeah the 300 here is a little bit of a magical value.

You can see more details about this value at
https://github.com/postgres/postgres/blob/master/src/backend/commands/analyze.c#L1723-L1756.



Re: Why there is 30000 rows is sample

From
"Peter J. Holzer"
Date:
On 2020-04-04 10:07:51 +0300, Andrus wrote:
> vacuumdb: vacuuming database "mydb"
> INFO:  analyzing "public.mytable"
> INFO:  "mytable": scanned 2709 of 2709 pages, containing 10834 live rows and
> 0 dead rows; 10834 rows in sample, 10834 estimated total rows
>
> For tables with more than 30000 rows, it shows that there are 30000 rows in sample.
>
> postgresql.conf does not set  default_statistics_target value.
> It contains
>
> #default_statistics_target = 100    # range 1-10000
>
> So I expect that there should be 100 rows is sample.

The statistics target determines the size of gathered statistics: A
value of 100 means that Postgres should store (at most) the 100 most
frequent values and that the histogram should have 100 buckets.

Obviously you can't determine the 100 most frequent values if you onöy
sample 100 rows, and a histogram with 100 buckets based on only 100 rows
will be quite empty.

So you need to sample more rows to compute those statistics. How much
more? As it turns out (see the link Julien posted), it depends a bit on
the size of the table, but not that much, so a factor of 300 is good
enough for a wide range of sizes.

> Why Postgres uses 30000 or number of rows in table for smaller tables ?
>
> Is 30000 some magical value, how to control it.

The factor of 300 is a bit magical but well founded. You can't control
it (except by changing the source code and recompiling, of course - but
why would you want to?). The value 100 can be controlled either by
changing default_statistics_target or by changing the statistics target
of a specific column of a specific table (alter table ... alter column
... set statistics ...)

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment