Re: TABLESAMPLE patch is really in pretty sad shape - Mailing list pgsql-hackers
From | Petr Jelinek |
---|---|
Subject | Re: TABLESAMPLE patch is really in pretty sad shape |
Date | |
Msg-id | 55A79BC3.30301@2ndquadrant.com Whole thread Raw |
In response to | Re: TABLESAMPLE patch is really in pretty sad shape (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: TABLESAMPLE patch is really in pretty sad shape
|
List | pgsql-hackers |
On 2015-07-13 15:39, Tom Lane wrote: > > Datum > tsm_system_rows_init(PG_FUNCTION_ARGS) > { > TableSampleDesc *tsdesc = (TableSampleDesc *) PG_GETARG_POINTER(0); > uint32 seed = PG_GETARG_UINT32(1); > int32 ntuples = PG_ARGISNULL(2) ? -1 : PG_GETARG_INT32(2); > > This is rather curious coding. Why should this function check only > one of its arguments for nullness? If it needs to defend against > any of them being null, it really needs to check all. But in point of > fact, this function is declared STRICT, which means it's a violation of > the function call protocol if the core code ever passes a null to it, > and so this test ought to be dead code. > > A similar pattern of ARGISNULL checks in declared-strict functions exists > in all the tablesample modules, not just this one, showing that this is an > overall design error not just a thinko here. My inclination would be to > make the core code enforce non-nullness of all tablesample arguments so as > to make it unnecessary to check strictness of the tsm*init functions, but > in any case it is Not Okay to just pass nulls willy-nilly to strict C > functions. > The reason for this ugliness came from having to have balance between modularity and following the SQL Standard error codes for BERNOULLI and SYSTEM, which came as issue during reviews (the original code did the checks before calling the sampling method's functions but produced just generic error code about wrong parameter). I considered it as okayish mainly because those functions are not SQL callable and the code which calls them knows how to handle it correctly, but I understand why you don't. I guess if we did what you proposed in another email in this thread - don't have the API exposed on SQL level, we could send the additional parameters as List * and leave the validation completely to the function. (And maybe don't allow NULLs at all) > Also, I find this coding pretty sloppy even without the strictness angle, > because the net effect of this way of dealing with nulls is that an > argument-must-not-be-null complaint is reported as "out of range", > which is both confusing and the wrong ERRCODE. > > if (ntuples < 1) > ereport(ERROR, > (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), > errmsg("invalid sample size"), > errhint("Sample size must be positive integer value."))); > > I don't find this to be good error message style. The secondary comment > is not a "hint", it's an ironclad statement of what you did wrong, so if > we wanted to phrase it like this it should be an errdetail not errhint. > But the whole thing is overly cute anyway because there is no reason at > all not to just say what we mean in the primary error message, eg > ereport(ERROR, > (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), > errmsg("sample size must be greater than zero"))); > Same as above, although now that I re-read the standard I am sure I misunderstand it the first time - it says: "If S is the null value or if S < 0 (zero) or if S > 100, then an exception condition is raised: data exception — invalid sample size." I took it as literal error message originally but they just mean status code by it. > While we're on the subject, what's the reason for disallowing a sample > size of zero? Seems like a reasonable edge case. Yes that's a bug. > > /* All blocks have been read, we're done */ > if (sampler->doneblocks > sampler->nblocks || > sampler->donetuples >= sampler->ntuples) > PG_RETURN_UINT32(InvalidBlockNumber); > > Okay, I lied, I *am* going to complain about this comment. Comments that > do not accurately describe the code they're attached to are worse than > useless. > That's copy-pasto from the tsm_system_time. > In short, I'd do something more like > > uint32 r; > > /* Safety check to avoid infinite loop or zero result for small n. */ > if (n <= 1) > return 1; > > /* > * This should only take 2 or 3 iterations as the probability of 2 numbers > * being relatively prime is ~61%; but just in case, we'll include a > * CHECK_FOR_INTERRUPTS in the loop. > */ > do { > CHECK_FOR_INTERRUPTS(); > r = (uint32) (sampler_random_fract(randstate) * n); > } while (r == 0 || gcd(r, n) > 1); > > Note however that this coding would result in an unpredictable number > of iterations of the RNG, which might not be such a good thing if we're > trying to achieve repeatability. It doesn't matter in the context of > this module since the RNG is not used after initialization, but it would > matter if we then went on to do Bernoulli-style sampling. (Possibly that > could be dodged by reinitializing the RNG after the initialization steps.) > Bernoulli-style sampling does not need this kind of code so it's not really an issue. That is unless you'd like to combine the linear probing and bernoulli of course, but I don't see any benefit in doing that. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: