Re: TABLESAMPLE patch is really in pretty sad shape - Mailing list pgsql-hackers

From Petr Jelinek
Subject Re: TABLESAMPLE patch is really in pretty sad shape
Date
Msg-id 55A79BC3.30301@2ndquadrant.com
Whole thread Raw
In response to Re: TABLESAMPLE patch is really in pretty sad shape  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: TABLESAMPLE patch is really in pretty sad shape
List pgsql-hackers
On 2015-07-13 15:39, Tom Lane wrote:
>
> Datum
> tsm_system_rows_init(PG_FUNCTION_ARGS)
> {
>      TableSampleDesc *tsdesc = (TableSampleDesc *) PG_GETARG_POINTER(0);
>      uint32      seed = PG_GETARG_UINT32(1);
>      int32       ntuples = PG_ARGISNULL(2) ? -1 : PG_GETARG_INT32(2);
>
> This is rather curious coding.  Why should this function check only
> one of its arguments for nullness?  If it needs to defend against
> any of them being null, it really needs to check all.  But in point of
> fact, this function is declared STRICT, which means it's a violation of
> the function call protocol if the core code ever passes a null to it,
> and so this test ought to be dead code.
>
> A similar pattern of ARGISNULL checks in declared-strict functions exists
> in all the tablesample modules, not just this one, showing that this is an
> overall design error not just a thinko here.  My inclination would be to
> make the core code enforce non-nullness of all tablesample arguments so as
> to make it unnecessary to check strictness of the tsm*init functions, but
> in any case it is Not Okay to just pass nulls willy-nilly to strict C
> functions.
>

The reason for this ugliness came from having to have balance between 
modularity and following the SQL Standard error codes for BERNOULLI and 
SYSTEM, which came as issue during reviews (the original code did the 
checks before calling the sampling method's functions but produced just 
generic error code about wrong parameter). I considered it as okayish 
mainly because those functions are not SQL callable and the code which 
calls them knows how to handle it correctly, but I understand why you don't.

I guess if we did what you proposed in another email in this thread - 
don't have the API exposed on SQL level, we could send the additional 
parameters as List * and leave the validation completely to the 
function. (And maybe don't allow NULLs at all)

> Also, I find this coding pretty sloppy even without the strictness angle,
> because the net effect of this way of dealing with nulls is that an
> argument-must-not-be-null complaint is reported as "out of range",
> which is both confusing and the wrong ERRCODE.
>
>      if (ntuples < 1)
>          ereport(ERROR,
>                  (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
>                   errmsg("invalid sample size"),
>                   errhint("Sample size must be positive integer value.")));
>
> I don't find this to be good error message style.  The secondary comment
> is not a "hint", it's an ironclad statement of what you did wrong, so if
> we wanted to phrase it like this it should be an errdetail not errhint.
> But the whole thing is overly cute anyway because there is no reason at
> all not to just say what we mean in the primary error message, eg
>          ereport(ERROR,
>                  (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
>                   errmsg("sample size must be greater than zero")));
>

Same as above, although now that I re-read the standard I am sure I 
misunderstand it the first time - it says:
"If S is the null value or if S < 0 (zero) or if S > 100, then an 
exception condition is raised: data exception — invalid sample size."

I took it as literal error message originally but they just mean status 
code by it.

> While we're on the subject, what's the reason for disallowing a sample
> size of zero?  Seems like a reasonable edge case.

Yes that's a bug.

>
>      /* All blocks have been read, we're done */
>      if (sampler->doneblocks > sampler->nblocks ||
>          sampler->donetuples >= sampler->ntuples)
>          PG_RETURN_UINT32(InvalidBlockNumber);
>
> Okay, I lied, I *am* going to complain about this comment.  Comments that
> do not accurately describe the code they're attached to are worse than
> useless.
>

That's copy-pasto from the tsm_system_time.

> In short, I'd do something more like
>
>      uint32      r;
>
>      /* Safety check to avoid infinite loop or zero result for small n. */
>      if (n <= 1)
>          return 1;
>
>      /*
>       * This should only take 2 or 3 iterations as the probability of 2 numbers
>       * being relatively prime is ~61%; but just in case, we'll include a
>       * CHECK_FOR_INTERRUPTS in the loop.
>       */
>      do {
>          CHECK_FOR_INTERRUPTS();
>          r = (uint32) (sampler_random_fract(randstate) * n);
>      } while (r == 0 || gcd(r, n) > 1);
>
> Note however that this coding would result in an unpredictable number
> of iterations of the RNG, which might not be such a good thing if we're
> trying to achieve repeatability.  It doesn't matter in the context of
> this module since the RNG is not used after initialization, but it would
> matter if we then went on to do Bernoulli-style sampling.  (Possibly that
> could be dodged by reinitializing the RNG after the initialization steps.)
>

Bernoulli-style sampling does not need this kind of code so it's not 
really an issue. That is unless you'd like to combine the linear probing 
and bernoulli of course, but I don't see any benefit in doing that.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Kyotaro HORIGUCHI
Date:
Subject: Re: multivariate statistics / patch v7
Next
From: Amit Kapila
Date:
Subject: Re: optimizing vacuum truncation scans