Re: extend pgbench expressions with functions - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: extend pgbench expressions with functions
Date
Msg-id alpine.DEB.2.10.1511052018410.8244@sto
Whole thread Raw
In response to Re: extend pgbench expressions with functions  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: extend pgbench expressions with functions  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hello Robert,

>>> 2. ddebug and idebug seem like a lousy idea to me.
>>
>> It was really useful to me for debugging and testing.
>
> That doesn't mean it belongs in the final patch.

I think it is useful when debugging a script, not just for debugging the 
evaluation code itself.

>>> 3. I'm perplexed by why you've replaced evaluateExpr() with evalInt()
>>> and evalDouble().
>>
>> Basically, the code is significantly shorter and elegant with this option.
>
> I find that pretty hard to swallow.

Hmmm, maybe with some more explanations?

> If the backend took this approach,

Sure, I would never suggest to do anything like that in the backend!

> we've have a separate evaluate function for every datatype, which would 
> make it completely impossible to support the creation of new datatypes.

In the backend implementation is generic about types (one can add a type 
dynamically in the system), which is another abstraction level, it does 
not compare in any way.

> And I find this code to be quite difficult to follow.

It is really the function *you* wrote, and there is just one version for
int and one for double.

> What I think we should have is a type like PgBenchValue that represents 
> a value which may be an integer or a double or whatever else we want to 
> support, but not an expression - specifically a value. Then when you 
> invoke a function or an operator it gets passed one or more PgBenchValue 
> objects and can do whatever it wants with those, storing the result into 
> an output PgBenchValue. So add might look like this:

Sure, I do know what it looks like, and I want to avoid it, because this 
is just a lot of ugly code which is useless for pgbench purpose.

> void
> add(PgBenchValue *result, PgBenchValue *x, PgBenchValue *y)
> {
>    if (x->type == PGBT_INTEGER && y->type == PGBT_INTEGER)
>    {
>        result->type = PGBT_INTEGER;
>        result->u.ival = x->u.ival + y->u.ival;
>        return;
>    }
>    if (x->type == PGBT_DOUBLE && y->type == PGBT_DOUBLE)
>    {
>        result->type = PGBT_DOUBLE;
>        result->u.ival = x->u.dval + y->u.dval;
>        return;
>   }
>    /* cross-type cases, if desired, go here */

Why reject 1 + 2.0 ? So the cross-type cases are really required for user 
sanity, which adds:
    if (x->type == PGBT_DOUBLE && y->type == PGBT_INTEGER)    {        result->type = PGBT_DOUBLE;
result->u.ival= x->u.dval + y->u.ival;        return;    }    if (x->type == PGBT_INTEGER && y->type == PGBT_DOUBLE)
{       result->type = PGBT_DOUBLE;        result->u.ival = x->u.ival + y->u.dval;       return;   } 
 
> }

For the '+' overload operator with conversions there are 4 cases (2 
arguments ** 2 types) to handle. For all 5 binary operators (+ - * / %). 
that makes 20 cases to handle. Then for every function, you have to deal 
with type conversion as well, each function times number of arguments ** 
number of types. That is 2 cases for abs, 4 cases for random, 8 cases for 
each random_exp*, 8 for random_gaus*, and so on. Some thinking would be 
required for n-ary functions (min & max).

Basically, it can be done, no technical issue, it is just a matter of 
writing a *lot* of repetitive code, hundreds of lines of them. As I think 
it does not bring any value for pgbench purpose, I used the other approach 
which reduces the code size by avoiding the combinatorial "cross-type" 
conversions.

> The way you have it, the logic for every function and operator exists
> in two copies: one in the integer function, and the other in the
> double function.

Yep, but only 2 cases to handle, instead of 4 cases in the above example, 
that is the point.

> As soon as we add another data type - strings, dates, whatever - we'd 
> need to have cases for all of these things in those functions as well,
> and every time we add a function, we have to update all the case 
> statements for every datatype's evaluation function.  That's just not a 
> scalable model.

On the contrary, it is reasonably scalable:

With the eval-per-type for every added type one should implement one eval 
function which handles all functions and operators, each eval function 
roughly the same size as the current ones.

With the above approach, the overloaded add function which handles 2 
operands with 3 types ('post' + 'gres' -> 'postgres') would have to deal 
with 2**3 = 8 types combinations instead of 4, so basically it would be 
doubling the code size. If you add dates on top of that, 2**4 = 16 cases 
just for one operator. No difficulty there, just a lot of lines...

Basically the code size complexity of the above approach is:
  #functions * (#args ** #types)

while for the approach in the submitted patch it is:
  #types * #functions

Now I obviously agree that the approach is not generic and should not be 
used in other contexts, it is just that it is simple/short and it fits the 
bill for pgbench.

Note that I do not really envision adding more types for pgbench scripts. 
The double type is just a side effect of the non uniform randoms. What I 
plan is to add a few functions, especially a pseudo-random permutation 
and/or a parametric hash, that would allow running more realistic tests 
with non uniform distributions.

-- 
Fabien.



pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: Note about comparation PL/SQL packages and our schema/extensions
Next
From: Tom Lane
Date:
Subject: Brain fade in gin_extract_jsonb_path()