Home > mailing lists

Re: Merging statistics from children instead of re-sampling everything - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: Merging statistics from children instead of re-sampling everything
Date	February 11, 2022 01:37:26
Msg-id	4e86ae74-4e2c-b40f-4405-035d2f818e5d@enterprisedb.com Whole thread Raw
In response to	Re: Merging statistics from children instead of re-sampling everything (Andrey Lepikhov <a.lepikhov@postgrespro.ru>)
Responses	Re: Merging statistics from children instead of re-sampling everything
List	pgsql-hackers

Tree view

On 2/10/22 12:50, Andrey Lepikhov wrote:
> On 21/1/2022 01:25, Tomas Vondra wrote:
>> But I don't have a very good idea what to do about statistics that we
>> can't really merge. For some types of statistics it's rather tricky to
>> reasonably merge the results - ndistinct is a simple example, although
>> we could work around that by building and merging hyperloglog counters.
>
> I think, as a first step on this way we can reduce a number of pulled
> tuples. We don't really needed to pull all tuples from a remote server.
> To construct a reservoir, we can pull only a tuple sample. Reservoir
> method needs only a few arguments to return a sample like you read
> tuples locally. Also, to get such parts of samples asynchronously, we
> can get size of each partition on a preliminary step of analysis.
> In my opinion, even this solution can reduce heaviness of a problem
> drastically.
> 

Oh, wow! I haven't realized we're fetching all the rows from foreign
(postgres_fdw) partitions. For local partitions we already do that,
because that uses the usual acquire function, with a reservoir
proportional to partition size. I have assumed we use tablesample to
fetch just a small fraction of rows from FDW partitions, and I agree
doing that would be a pretty huge benefit.

I actually tried hacking that together - there's a couple problems with
that (e.g. determining what fraction to sample using bernoulli/system),
but in principle it seems quite doable. Some minor changes to the FDW
API may be necessary, not sure.

Not sure about the async execution - that seems way more complicated,
and the sampling reduces the total cost, async just parallelizes it.

That being said, this thread was not really about foreign partitions,
but about re-analyzing inheritance trees in general. And sampling
foreign partitions doesn't really solve that - we'll still do the
sampling over and over.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

pgsql-hackers by date:

From: Andres Freund
Date: 11 February 2022, 01:26:59
Subject: Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

From: Alvaro Herrera
Date: 11 February 2022, 01:45:17
Subject: Re: Add jsonlog log_destination for JSON server logs

Re: Merging statistics from children instead of re-sampling everything - Mailing list pgsql-hackers

Previous

Next