Re: TABLESAMPLE patch is really in pretty sad shape - Mailing list pgsql-hackers

From Tom Lane
Subject Re: TABLESAMPLE patch is really in pretty sad shape
Date
Msg-id 12064.1436803249@sss.pgh.pa.us
Whole thread Raw
In response to Re: TABLESAMPLE patch is really in pretty sad shape  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: TABLESAMPLE patch is really in pretty sad shape
Re: TABLESAMPLE patch is really in pretty sad shape
List pgsql-hackers
I wrote:
> TBH, I think the right thing to do at this point is to revert the entire
> patch and send it back for ground-up rework.  I think the high-level
> design is wrong in many ways and I have about zero confidence in most
> of the code details as well.

> I'll send a separate message about high-level issues,

And here's that.  I do *not* claim that this is a complete list of design
issues with the patch, it's just things I happened to notice in the amount
of time I've spent so far (which is already way more than I wanted to
spend on TABLESAMPLE right now).


I'm not sure that we need an extensible sampling-method capability at all,
let alone that an API for that needs to be the primary focus of a patch.
Certainly the offered examples of extension modules aren't inspiring:
tsm_system_time is broken by design (more about that below) and nobody
would miss tsm_system_rows if it weren't there.  What is far more
important is to get sampling capability into indexscans, and designing
an API ahead of time seems like mostly a distraction from that.

I'd think seriously about tossing the entire executor-stage part of the
patch, creating a stateless (history-independent) sampling filter that
just accepts or rejects TIDs, and sticking calls to that into all the
table scan node types.  Once you had something like that working well
it might be worth considering whether to try to expose an API to
generalize it.  But even then it's not clear that we really need any
more options than true-Bernoulli and block-level sampling.

The IBM paper I linked to in the other thread mentions that their
optimizer will sometimes choose to do Bernoulli sampling even if SYSTEM
was requested.  Probably that happens when it decides to do a simple
indexscan, because then there's no advantage to trying to cluster the
sampled rows.  But in the case of a bitmap scan, you could very easily do
either true Bernoulli or block-level sampling simply by adjusting the
rules about which bits you keep or clear in the bitmap (given that you
apply the filter between collection of the index bitmap and accessing the
heap, which seems natural).  The only case where a special scan type
really seems to be needed is if you want block-level sampling, the query
would otherwise use a seqscan, *and* the sampling percentage is pretty low
--- if you'd be reading every second or third block anyway, you're likely
better off with a plain seqscan so that the kernel sees sequential access
and does prefetching.  The current API doesn't seem to make it possible to
switch between seqscan and read-only-selected-blocks based on the sampling
percentage, but I think that could be an interesting optimization.
(Another bet that's been missed is having the samplescan logic request
prefetching when it is doing selective block reads.  The current API can't
support that AFAICS, since there's no expectation that nextblock calls
could be done asynchronously from nexttuple calls.)

Another issue with the API as designed is the examinetuple() callback.
Allowing sample methods to see invisible tuples is quite possibly the
worst idea in the whole patch.  They can *not* do anything with such
tuples, or they'll totally break reproducibility: if the tuple is
invisible to your scan, it might well be or soon become invisible to
everybody, whereupon it would be subject to garbage collection at the
drop of a hat.  So if an invisible tuple affects the sample method's
behavior at all, repeated scans in the same query would not produce
identical results, which as mentioned before is required both by spec
and for minimally sane query behavior.  Moreover, examining the contents
of the tuple is unsafe (it might contain pointers to TOAST values that
no longer exist); and even if it were safe, what's the point?  Sampling
that pays attention to the data is the very definition of biased.  So
if we do re-introduce an API like the current one, I'd definitely lose
this bit and only allow sample methods to consider visible tuples.

On the point of reproducibility: the tsm_system_time method is utterly
incapable of producing identical results across repeated scans, because
there is no reason to believe it would get exactly as far in the same
amount of time each time.  This might be all right across queries if
the method could refuse to work with REPEATABLE clauses (but there's
no provision for such a restriction in the current API).  But it's not
acceptable within a query.  Another problem with tsm_system_time is that
its cost/rowcount estimation is based on nothing more than wishful
thinking, and can never be based on anything more than wishful thinking,
because planner cost units are not time-based.  Adding a comment that
says we'll nonetheless pretend they are milliseconds isn't a solution.
So that sampling method really has to go away and never come back,
whatever we might ultimately salvage from this patch otherwise.

(I'm not exactly convinced that the system or tsm_system_rows methods
are adequately reproducible either, given that their sampling pattern
will change when the relation block count changes.  Perhaps that's
unavoidable, but it seems like it might be possible to define things
in such a way that adding blocks doesn't change which existing blocks
get sampled.)

A more localized issue that I noticed is that nowhere is it documented
what the REPEATABLE parameter value is.  Digging in the code eventually
reveals that the value is assignment-coerced to an int4, which I find
rather problematic because a user might reasonably assume that the
parameter works like setseed's parameter (float8 in the range -1 to 1).
If he does then he'll get no errors and identical samples from say
REPEATABLE(0.1) and REPEATABLE(0.2), which is bad.  On the other hand,
it looks like DB2 allows integer values, so implementing it just like
setseed might cause problems for people coming from DB2.  I'm inclined to
suggest that we should define the parameter as being any float8 value,
and obtain a seed from it with hashfloat8().  That way, no matter whether
users think that usable values are fractional or integral, they'll get
sane behavior with different supplied seeds almost always producing
different samples.
        regards, tom lane



pgsql-hackers by date:

Previous
From: Marco Atzeri
Date:
Subject: Re: PostgreSQL 9.5 Alpha 1 build fail with perl 5.22
Next
From: Heikki Linnakangas
Date:
Subject: Re: [BUGS] BUG #13126: table constraint loses its comment