Home > mailing lists

Fwd: Merging statistics from children instead of re-sampling everything - Mailing list pgsql-hackers

From	Damir Belyalov
Subject	Fwd: Merging statistics from children instead of re-sampling everything
Date	August 15, 2022 14:10:44
Msg-id	CALH1LgvFDa4SehS6wHwE0iZVLrdrm=tuOMkvt7VuUcZ+EoiwuQ@mail.gmail.com Whole thread Raw
In response to	Merging statistics from children instead of re-sampling everything (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List	pgsql-hackers

Tree view

3) stadistinct - This is quite problematic. We only have the per-child
estimates, and it's not clear if there's any overlap. For now I've just
summed it up, because that's safer / similar to what we do for gather
merge paths etc. Maybe we could improve this by estimating the overlap
somehow (e.g. from MCV lists / histograms). But honestly, I doubt the
estimates based on tiny sample of each child are any better. I suppose
we could introduce a column option, determining how to combine ndistinct
(similar to how we can override n_distinct itself).

4) MCV - It's trivial to build a new "parent" MCV list, although it may
be too large (in which case we cut it at statistics target, and copy the
remaining bits to the histogram)

I think there is one approach to solve the problem with calculating mcv and distinct statistics.
To do this, you need to calculate the density of the sample distribution and store it, for example, in some slot.
Then, when merging statistics, we will sum up the densities of all partitions as functions and get a new density.
According to the new density, you can find out which values are most common and which are distinct.

To calculate the partition densities, you can use the "Kernel density Estimation" - https://www.statsmodels.org/dev/examples/notebooks/generated/kernel_density html

The approach may be very inaccurate and difficult to implement, but solves the problem.

Regards,

Damir Belyalov

Postgres Professional

pgsql-hackers by date:

From: Thomas Munro
Date: 15 August 2022, 13:48:22
Subject: Re: Cleaning up historical portability baggage

From: Melih Mutlu
Date: 15 August 2022, 14:26:26
Subject: Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

Fwd: Merging statistics from children instead of re-sampling everything - Mailing list pgsql-hackers

Previous

Next