Home > mailing lists

Re: postgres_fdw: using TABLESAMPLE to collect remote sample - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: postgres_fdw: using TABLESAMPLE to collect remote sample
Date	December 31, 2022 02:03:08
Msg-id	6da357b0-fdd3-095d-a06c-87666f4a5a8c@enterprisedb.com Whole thread Raw
In response to	Re: postgres_fdw: using TABLESAMPLE to collect remote sample (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses	Re: postgres_fdw: using TABLESAMPLE to collect remote sample (Tom Lane <tgl@sss.pgh.pa.us>)
List	pgsql-hackers

Tree view

Pushed.

After thinking about it a bit more I decided to rip out the 10% sampling
rate inflation. While stale reltuples may be an issue, but I couldn't
convince myself this would be an improvement overall, or that this is
the right place to deal with it.

Firstly, if reltuples is too low (i.e. when the table grew), everything
is "fine" - we sample more than targrows and then remove some of that
locally. We end up with same sample size as without this patch - which
is not necessarily the right sample size (more about that in a minute),
but if it was good enough for now ... improving this bit was not the
ambition of this patch.

If reltuples is too high (e.g. right after bulk DELETE) we may end up
sampling too few rows. If it's 1/2 what it should be, we'll end up
sampling only 15k rows instead of 30k rows. But it's unclear how much
should we adjust the value - 10% as was in the patch (kinda based on
autovacuum_analyze_scale_factor being 10%)?

I'd argue that's too low to make any difference - cases that are only
~10% off should work fine (we don't have perfect stats anyway). And for
the really bad cases 10% is not going to make a difference.

And if the reltuples is that off, it'll probably affect even queries
planned on the remote node (directly or pushed down), so I guess the
right fix should be making sure reltuples is not actually off - make
autovacuum more aggressive or whatever.


A thing that bothers me is that calculating sample size for partitioned
tables is a bit cyclic - we look at number of blocks, and divide the
whole sample based on that. And that's generally correlated with the
reltuples, but OTOH it may also be wrong - and not necessarily in the
same way. For example you might delete 90% of a table, which will make
the table overrepresented in the sample. But then reltuples may be quite
accurate thanks to recent vacuum - so fixing this by blindly adjusting
the sampling rate seems misguided. If this turns out to be an issue in
practice, we need to rethink acquire_inherited_sample_rows as a whole
(not just for foreign tables).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

pgsql-hackers by date:

From: Peter Geoghegan
Date: 31 December 2022, 00:12:21
Subject: Re: New strategies for freezing, advancing relfrozenxid early

From: Peter Geoghegan
Date: 31 December 2022, 03:58:12
Subject: Re: New strategies for freezing, advancing relfrozenxid early

Re: postgres_fdw: using TABLESAMPLE to collect remote sample - Mailing list pgsql-hackers

Previous

Next