Re: postgres_fdw: using TABLESAMPLE to collect remote sample - Mailing list pgsql-hackers

From Sofia Kopikova
Subject Re: postgres_fdw: using TABLESAMPLE to collect remote sample
Date
Msg-id 1645016218.115729855@f745.i.mail.ru
Whole thread Raw
In response to postgres_fdw: using TABLESAMPLE to collect remote sample  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List pgsql-hackers
Hello,
 
I find it a great idea to use TABLESAMPLE in postgres_fdw ANALYZE.
Let me offer you some ideas how to resolve you problems.
 
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
> The other issue is which sampling method to use - we have SYSTEM and 
> BERNOULLI built in, and the tsm_system_rows as an extension (and _time,
> but that's not very useful here). I guess we'd use one of the built-in
> ones, because that'll work on more systems out of the box.
 
It’s hard to choose between speed and quality, but I think we need
SYSTEM method here. This patch is for speeding-up ANALYZE,
and SYSTEM method will faster than BERNOULLI on fraction
values to 50%.
 
> But that leads to the main issue - determining the fraction of rows to
> sample. We know how many rows we want to sample, but we have no idea how
> many rows there are in total. We can look at reltuples, but we can't be
> sure how accurate / up-to-date that value is.
>
> The patch just trusts it unless it's obviously bogus (-1, 0, etc.) and
> applies some simple sanity checks, but I wonder if we need to do more
> (e.g. look at relation size and adjust reltuples by current/relpages).
 
I found a query on Stackoverflow (it does similar thing to what
estimate_rel_size does) that may help with it. So that makes
tsm_system_rows unnecessary.
 
----------------------------------------------------------------------
SELECT
    (CASE WHEN relpages = 0 THEN float8 '0'
                ELSE reltuples / relpages END
    * (pg_relation_size(oid) / pg_catalog.current_setting('block_size')::int)
    )::bigint
FROM pg_class c
WHERE c.oid = 'tablename'::regclass;
----------------------------------------------------------------------
 
And one more refactoring note. New function deparseAnalyzeSampleSql
duplicates function deparseAnalyzeSql (is previous in file deparse.c)
except for the new last line. I guess it would be better to add new
parameter — double sample_frac — to existing function
deparseAnalyzeSql and use it as a condition for adding
"TABLESAMPLE SYSTEM..." to SQL query (set it to zero when
do_sample is false). Or you may also add do_sample as a parameter to
deparseAnalyzeSql, but as for me that’s redundantly.
 
 
--
Sofia Kopikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
 
 
 

pgsql-hackers by date:

Previous
From: John Naylor
Date:
Subject: Re: minor code correction in typecmds.c
Next
From: Jim Mlodgenski
Date:
Subject: Re: support for CREATE MODULE