Re: add modulo (%) operator to pgbench - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: add modulo (%) operator to pgbench
Date
Msg-id alpine.DEB.2.10.1409240916340.3672@sto
Whole thread Raw
In response to Re: add modulo (%) operator to pgbench  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: add modulo (%) operator to pgbench
List pgsql-hackers
Hello Heikki,

>> If you reject it, you can also remove the gaussian and exponential random
>> distribution which is near useless without a mean to add a minimal
>> pseudo-random stage, for which "(x * something) % size" is a reasonable
>> approximation, hence the modulo submission.
>
> I'm confused. The gaussian and exponential patch was already committed.

Yes.

They are significant patches that really involved significant work, and 
which are only really useful with a complementary stupid 10 lines patch 
which is being rejected without understanding why it is needed.

> Are you saying that it's not actually useful, and should be reverted? 
> That doesn't make sense to me, gaussian and exponential distributed 
> values seems quite useful to me in test scripts.

Yes and no.

Currently these distributions are achieved by mapping a continuous 
function onto integers, so that neighboring integers get neighboring 
number of draws, say with size=7:
  #draws     10 6 3 1 0 0 0  // some exponential distribution  int drawn   0 1 2 3 4 5 6

Although having an exponential distribution of accesses on tuples is quite 
reasonable, the likelyhood there would be so much correlation between 
neighboring values is not realistic at all. You need some additional 
shuffling to get there.

> I don't understand what that pseudo-random stage you're talking about is. Can 
> you elaborate?

The pseudo random stage is just a way to scatter the values. A basic 
approach to achieve this is "i' = (i * large-prime) % size", if you have a 
modulo. For instance with prime=5 you may get something like:
  #draws     10 6 3 1 0 0 0  int drawn   0 1 2 3 4 5 6 (i)  scattered   0 5 3 1 6 4 2 (i' = 5 i % 7)

So the distribution becomes:
  #draws     10 1 0 3 0 6 0  scattered   0 1 2 3 4 5 6

Which is more interesting from a testing perspective because it removes 
the neighboring value correlation.

A better approach is to use a hash function. "i' = hash(i) % size",
although it skews the distribution some more, but the quality of the 
shuffling is much better than with the mult-modulo version above.
Note that you need a modulo as well...

I must say that I'm appaled by a decision process which leads to such 
results, with significant patches passed, and the tiny complement to make 
it really useful (I mean not on the paper or on the feature list, but in 
real life) is rejected...

-- 
Fabien.



pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: BRIN indexes - TRAP: BadArgument
Next
From: Abhijit Menon-Sen
Date:
Subject: Re: pgcrypto: PGP signatures