Re: postgres_fdw: using TABLESAMPLE to collect remote sample - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: postgres_fdw: using TABLESAMPLE to collect remote sample |
Date | |
Msg-id | 560c80fc-823e-418d-280e-e0c7c0dad59b@enterprisedb.com Whole thread Raw |
In response to | Re: postgres_fdw: using TABLESAMPLE to collect remote sample (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: postgres_fdw: using TABLESAMPLE to collect remote sample
|
List | pgsql-hackers |
On 1/6/23 23:41, Tomas Vondra wrote: > On 1/6/23 17:58, Tom Lane wrote: >> Tomas Vondra <tomas.vondra@enterprisedb.com> writes: >>> The one difference is that I realized the relkind check does not >>> actually say we can't do sampling - it just means we can't use >>> TABLESAMPLE to do it. We could still use "random()" ... >> >>> Furthermore, I don't think we should silently disable sampling when the >>> user explicitly requests TABLESAMPLE by specifying bernoulli/system for >>> the table - IMHO it's less surprising to just fail in that case. >> >> Agreed on both points. This patch looks good to me. >> > > Good, I'll get this committed.The "ORDER BY random()" idea is a possible > improvement, can be discussed on it's own. > Pushed. >>> Of course, all relkinds that don't support TABLESAMPLE currently have >>> reltuples value that will disable sampling anyway (e.g. views have -1). >>> So we won't actually fallback to random() anyway, because we can't >>> calculate the sample fraction. >>> That's a bit annoying for foreign tables pointing at a view, which is a >>> more likely use case than table pointing at a sequence. >> >> Right, that's a case worth being concerned about. >> >>> But I realized we could actually still do "random()" sampling: >>> SELECT * FROM t ORDER BY random() LIMIT $X; >> >> Hmm, interesting idea, but it would totally bollix our correlation >> estimates. Not sure that those are worth anything for remote views, >> but still... > > But isn't that an issue that we already have? I mean, if we do ANALYZE > on a foreign table pointing to a view, we fetch all the results. But if > the view does not have a well-defined ORDER BY, a trivial plan change > may change the order of results and thus the correlation. > > Actually, how is a correlation even defined for a view? > > It's true this "ORDER BY random()" thing would make it less stable, as > it would change the correlation on every run. Although - the calculated > correlation is actually quite stable, because it's guaranteed to be > pretty close to 0 because we make the order random. > > Maybe that's actually better than the current state where it depends on > the plan? Why not to look at the relkind and just set correlation to 0.0 > in such cases? > > But if we want to restore that current behavior (where it depends on the > actual query plan), we could do something like this: > > SELECT * FROM the_remote_view ORDER BY row_number() over (); > > But yeah, this makes the remote sampling more expensive. Probably still > a win because of network costs, but not great. > I've been thinking about this a bit more, and I'll consider submitting a patch for the next CF. IMHO it's probably better to accept correlation being 0 in these cases - it's more representative of what we know about the view / plan output (or rather lack of knowledge). However, maybe views are not the best / most common example to think about. I'd imagine it's much more common to reference a regular table, but the table gets truncated / populated quickly, and/or the autovacuum workers are busy so it takes time to update reltuples. But in this case it's also quite simple to fix the correlation by just ordering by ctid (which I guess we might do based on the relkind). There's a minor issue with partitioned tables, with foreign partitions pointing to remote views. This is kinda broken, because the sample size for individual relations is determined based on relpages. But that's 0 for views, so these partitions get ignored when building the sample. But that's a pre-existing issue. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: