Re: pgbench - add pseudo-random permutation function - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: pgbench - add pseudo-random permutation function
Date
Msg-id alpine.DEB.2.21.2002010916510.20752@pseudo
Whole thread Raw
In response to Re: pgbench - add pseudo-random permutation function  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
List pgsql-hackers
Hello Peter,

>>> This patch was marked as RFC on 2019-03-30, but since then there have
>>> been a couple more issues pointed out in a review by Thomas Munro, and
>>> it went through 2019-09 and 2019-11 without any attention. Is the RFC
>>> status still appropriate?
>> 
>> Thomas review was about comments/documentation wording and asking for
>> explanations, which I think I addressed, and the code did not actually
>> change, so I'm not sure that the "needs review" is really needed, but do
>> as you feel.
>
> I read the whole thread, I still don't know what this patch is supposed to 
> do.  I know what the words in the subject line mean, but I don't know how 
> this helps a pgbench user run better benchmarks.  I feel this is also the 
> sentiment expressed by others earlier in the thread.  You indicated that this 
> functionality makes sense to those who want this functionality, but so far 
> only two people, namely the patch author and the reviewer, have participated 
> in the discussion on the substance of this patch.  So either the feature is 
> extremely niche, or nobody understands it.  I think you ought to take about 
> three steps back and explain this in more basic terms, even just in email at 
> first so that we can then discuss what to put into the documentation.

Here is the motivation for this feature:

When you design a benchmark to test performance, you want to avoid 
unwanted correlation which may impact performance unduly, one way or 
another. For the default pgbench benchmarks, accounts are chosen 
uniformly, no problem. However, if you want to test a non uniform 
distribution, which is what most people would encounter in practice, 
things are different.

For instance, you would replace random by random_exp in the default 
benchmarks. If you do that, then all accounts with lower ids would come 
out more often. However this creates an unwanted correlation and 
performance effect: why frequent accounts would just happen to be those 
with small ids? which just happen to reside together in the first few 
pages of the table? In order to avoid these effects, you need to shuffle 
the chosen account ids, so that frequent accounts are not specifically 
those at the beginning of the table.

Basically, as soon as you have a non uniform random generator selecting 
step you want to add some shuffle, otherwise you are going to skew your 
benchmark measures. Nobody should use a non-uniform generator for 
selecting rows without some shuffling, ever. I have no doubt that many 
people do without realizing that they are skewing their performance data.

Once you realize that, you can try to invent your own shuffling, but 
frankly this is not as easy as it looks to have something non trivial 
which would not generate unwanted correlation. I had a lot of (very) bad 
design before coming up with the one in the patch: You want something 
fast, you want good steering, which are contradictory objectives. There is 
also the question of being able to change the shuffling for a given size, 
for instance to check that there is no unwanted effects, hence the 
seeding. Good seeded shuffling is what an encryption algorithm do, but for 
these it is somehow simpler because they would usually work on power of 
two sizes.

In conclusion, using a seeded shuffle step is needed as soon as you start 
using non random generators. Providing one in pgbench, a tool designed to 
run possibly non-uniform benchmarks against postgres, looks like common 
courtesy. Not providing one is helping people to design bad benchmarks, 
possibly without realizing it, so is outright thoughtlessness.

I have no doubt that the documentation should be improved to explain that 
in a concise but clear way.

-- 
Fabien.



pgsql-hackers by date:

Previous
From: Pierre Ducroquet
Date:
Subject: PATCH: add support for IN and @> in functional-dependency statistics use
Next
From: Dent John
Date:
Subject: Re: The flinfo->fn_extra question, from me this time.