Re: postgres_fdw: using TABLESAMPLE to collect remote sample - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: postgres_fdw: using TABLESAMPLE to collect remote sample
Date
Msg-id 560c80fc-823e-418d-280e-e0c7c0dad59b@enterprisedb.com
Whole thread Raw
In response to Re: postgres_fdw: using TABLESAMPLE to collect remote sample  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: postgres_fdw: using TABLESAMPLE to collect remote sample
List pgsql-hackers
On 1/6/23 23:41, Tomas Vondra wrote:
> On 1/6/23 17:58, Tom Lane wrote:
>> Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
>>> The one difference is that I realized the relkind check does not
>>> actually say we can't do sampling - it just means we can't use
>>> TABLESAMPLE to do it. We could still use "random()" ...
>>
>>> Furthermore, I don't think we should silently disable sampling when the
>>> user explicitly requests TABLESAMPLE by specifying bernoulli/system for
>>> the table - IMHO it's less surprising to just fail in that case.
>>
>> Agreed on both points.  This patch looks good to me.
>>
> 
> Good, I'll get this committed.The "ORDER BY random()" idea is a possible
> improvement, can be discussed on it's own.
> 

Pushed.

>>> Of course, all relkinds that don't support TABLESAMPLE currently have
>>> reltuples value that will disable sampling anyway (e.g. views have -1).
>>> So we won't actually fallback to random() anyway, because we can't
>>> calculate the sample fraction.
>>> That's a bit annoying for foreign tables pointing at a view, which is a
>>> more likely use case than table pointing at a sequence.
>>
>> Right, that's a case worth being concerned about.
>>
>>> But I realized we could actually still do "random()" sampling:
>>>     SELECT * FROM t ORDER BY random() LIMIT $X;
>>
>> Hmm, interesting idea, but it would totally bollix our correlation
>> estimates.  Not sure that those are worth anything for remote views,
>> but still...
> 
> But isn't that an issue that we already have? I mean, if we do ANALYZE
> on a foreign table pointing to a view, we fetch all the results. But if
> the view does not have a well-defined ORDER BY, a trivial plan change
> may change the order of results and thus the correlation.
> 
> Actually, how is a correlation even defined for a view?
> 
> It's true this "ORDER BY random()" thing would make it less stable, as
> it would change the correlation on every run. Although - the calculated
> correlation is actually quite stable, because it's guaranteed to be
> pretty close to 0 because we make the order random.
> 
> Maybe that's actually better than the current state where it depends on
> the plan? Why not to look at the relkind and just set correlation to 0.0
> in such cases?
> 
> But if we want to restore that current behavior (where it depends on the
> actual query plan), we could do something like this:
> 
>    SELECT * FROM the_remote_view ORDER BY row_number() over ();
> 
> But yeah, this makes the remote sampling more expensive. Probably still
> a win because of network costs, but not great.
> 

I've been thinking about this a bit more, and I'll consider submitting a
patch for the next CF. IMHO it's probably better to accept correlation
being 0 in these cases - it's more representative of what we know about
the view / plan output (or rather lack of knowledge).

However, maybe views are not the best / most common example to think
about. I'd imagine it's much more common to reference a regular table,
but the table gets truncated / populated quickly, and/or the autovacuum
workers are busy so it takes time to update reltuples. But in this case
it's also quite simple to fix the correlation by just ordering by ctid
(which I guess we might do based on the relkind).

There's a minor issue with partitioned tables, with foreign partitions
pointing to remote views. This is kinda broken, because the sample size
for individual relations is determined based on relpages. But that's 0
for views, so these partitions get ignored when building the sample. But
that's a pre-existing issue.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Dean Rasheed
Date:
Subject: Re: MERGE ... WHEN NOT MATCHED BY SOURCE
Next
From: Tom Lane
Date:
Subject: Re: postgres_fdw: using TABLESAMPLE to collect remote sample