Fwd: Merging statistics from children instead of re-sampling everything - Mailing list pgsql-hackers

From Damir Belyalov
Subject Fwd: Merging statistics from children instead of re-sampling everything
Date
Msg-id CALH1LgvFDa4SehS6wHwE0iZVLrdrm=tuOMkvt7VuUcZ+EoiwuQ@mail.gmail.com
Whole thread Raw
In response to Merging statistics from children instead of re-sampling everything  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List pgsql-hackers
3) stadistinct - This is quite problematic. We only have the per-child
estimates, and it's not clear if there's any overlap. For now I've just
summed it up, because that's safer / similar to what we do for gather
merge paths etc. Maybe we could improve this by estimating the overlap
somehow (e.g. from MCV lists / histograms). But honestly, I doubt the
estimates based on tiny sample of each child are any better. I suppose
we could introduce a column option, determining how to combine ndistinct
(similar to how we can override n_distinct itself).

4) MCV - It's trivial to build a new "parent" MCV list, although it may
be too large (in which case we cut it at statistics target, and copy the
remaining bits to the histogram)

I think there is one approach to solve the problem with calculating mcv and distinct statistics.
To do this, you need to calculate the density of the sample distribution and store it, for example, in some slot.
Then, when merging statistics, we will sum up the densities of all partitions as functions and get a new density.
According to the new density, you can find out which values are most common and which are distinct.

To calculate the partition densities, you can use the "Kernel density Estimation" - https://www.statsmodels.org/dev/examples/notebooks/generated/kernel_density html

The approach may be very inaccurate and difficult to implement, but solves the problem.

Regards,
Damir Belyalov
Postgres Professional

 

pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: Cleaning up historical portability baggage
Next
From: Melih Mutlu
Date:
Subject: Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication