Thread: pgbench randomness initialization

pgbench randomness initialization

From

Andres Freund

Date:

07 April 2016, 09:41:20

Hi,

pondering
http://archives.postgresql.org/message-id/CA%2BTgmoZJdA6K7-17K4A48rVB0UPR98HVuaNcfNNLrGsdb1uChg%40mail.gmail.com
et al I was wondering why it's a good idea for pgbench to doINSTR_TIME_SET_CURRENT(start_time);srandom((unsigned int)
INSTR_TIME_GET_MICROSEC(start_time));
to initialize randomness and thenfor (i = 0; i < nthreads; i++)    thread->random_state[0] = random();
thread->random_state[1]= random();    thread->random_state[2] = random();
 
to initialize the individual thread random state which is then used by
pg_erand48().

To me it seems better to instead initialize srandom() with a known value
(say, uh, 0). Or even better don't use random() at all, and fill a
global pg_erand48() with a known state; and use pg_erand48() to
initialize the thread states.

Obviously that doesn't make pgbench entirely reproducible, but it seems
a lot better than now. Individual threads would do work in a
reproducible order.

I see very little reason to have the current behaviour, or at the very
least not by default.

Greetings,

Andres Freund

Re: pgbench randomness initialization

From

Fabien COELHO

Date:

07 April 2016, 09:56:22

Hello Andres,

> et al I was wondering why it's a good idea for pgbench to do
>     INSTR_TIME_SET_CURRENT(start_time);
>     srandom((unsigned int) INSTR_TIME_GET_MICROSEC(start_time));
> to initialize randomness and then
>     for (i = 0; i < nthreads; i++)
>         thread->random_state[0] = random();
>         thread->random_state[1] = random();
>         thread->random_state[2] = random();
> to initialize the individual thread random state which is then used by
> pg_erand48().
>
> To me it seems better to instead initialize srandom() with a known value
> (say, uh, 0). Or even better don't use random() at all, and fill a
> global pg_erand48() with a known state; and use pg_erand48() to
> initialize the thread states.
>
> Obviously that doesn't make pgbench entirely reproducible, but it seems
> a lot better than now. Individual threads would do work in a
> reproducible order.
>
> I see very little reason to have the current behaviour, or at the very
> least not by default.

I think that it depends on what you want, which may vary:
 (1) "exactly" reproducible runs, but one run may hit a particular     steady state not representative of what happens
ingeneral.
 
 (2) runs which really vary from one to the next, so as     to have an idea about how much it may vary, what is the
performancestability.
 

Currently pgbench focusses on (2), which may or may not be fine depending 
on what you are doing. From a personal point of view I think that (2) is 
more significant to collect performance data, even if the results are more 
unstable: that simply reflects reality and its intrinsic variations, so 
I'm fine that as the default.

Now for those interested in (1) for some reason, I would suggest to rely a 
PGBENCH_RANDOM_SEED environment variable or --random-seed option which 
could be used to have a oxymoronic "deterministic randomness", if desired.
I do not think that it should be the default, though.

-- 
Fabien.

Re: pgbench randomness initialization

From

Andres Freund

Date:

07 April 2016, 10:02:24

On 2016-04-07 11:56:12 +0200, Fabien COELHO wrote:
>  (2) runs which really vary from one to the next, so as
>      to have an idea about how much it may vary, what is the
>      performance stability.

I don't think this POV makes all that much sense. If you do something
non-comparable, then the results aren't, uh, comparable. Which also
means there's a lower chance to reproduce observed problems.

> Currently pgbench focusses on (2), which may or may not be fine depending on
> what you are doing. From a personal point of view I think that (2) is more
> significant to collect performance data, even if the results are more
> unstable: that simply reflects reality and its intrinsic variations, so I'm
> fine that as the default.

Uh, and what's the benefit of that variability? pgbench isn't a reality
simulation tool, it's a benchmarking tool. And benchmarks with intrisinc
variability are bad benchmarks.

Greetings,

Andres Freund

Re: pgbench randomness initialization

From

Fabien COELHO

Date:

07 April 2016, 10:26:07

>>  (2) runs which really vary from one to the next, so as
>>      to have an idea about how much it may vary, what is the
>>      performance stability.
>
> I don't think this POV makes all that much sense. If you do something
> non-comparable, then the results aren't, uh, comparable. Which also
> means there's a lower chance to reproduce observed problems.

That also means that you are likely not to hit them if you always do the 
very same run...

Moreover, the Monte Carlo method requires randomness for its convergence 
result.

>> Currently pgbench focusses on (2), which may or may not be fine depending on
>> what you are doing. From a personal point of view I think that (2) is more
>> significant to collect performance data, even if the results are more
>> unstable: that simply reflects reality and its intrinsic variations, so I'm
>> fine that as the default.
>
> Uh, and what's the benefit of that variability? pgbench isn't a reality
> simulation tool, it's a benchmarking tool. And benchmarks with intrisinc
> variability are bad benchmarks.

From a statistical perspective, one run does not mean anything. If you do 
the exact same run over and over again, then all mathematical results 
about (slow) convergence towards the average are lost. This is like trying 
to survey a population by asking the questions to the same person over and 
over: the result will be biased.

Now when you develop, which is the use case you probably have in mind, you 
want to compare two pg version and check for the performance impact, so 
having the exact same run seems like a proxy to quickly check for that.

However, from a stastistical perspective this is just heresy: you may do a 
change which improves one given run at the expense of all possible others 
and you would not know it: Say for instance that there are two different 
behaviors depending on something, then you will check against one of them 
only.

So I have no mathematical doubt that changing the seed is the right 
default setting, thus I think that the current behavior is fine. However 
I'm okay if someone wants to control the randomness for some reason (maybe 
having "less sure" results, but quickly), so it could be allowed somehow.

-- 
Fabien.

Re: pgbench randomness initialization

From

Andres Freund

Date:

07 April 2016, 10:38:27

On 2016-04-07 12:25:58 +0200, Fabien COELHO wrote:
> 
> >> (2) runs which really vary from one to the next, so as
> >>     to have an idea about how much it may vary, what is the
> >>     performance stability.
> >
> >I don't think this POV makes all that much sense. If you do something
> >non-comparable, then the results aren't, uh, comparable. Which also
> >means there's a lower chance to reproduce observed problems.
> 
> That also means that you are likely not to hit them if you always do the
> very same run...

If you run the test for longer... Or explicitly iterate over IVs. At the
very least we need to make pgbench output the IV used, to have some
chance of repeating tests.


> >Uh, and what's the benefit of that variability? pgbench isn't a reality
> >simulation tool, it's a benchmarking tool. And benchmarks with intrisinc
> >variability are bad benchmarks.
> 
> From a statistical perspective, one run does not mean anything. If you do
> the exact same run over and over again, then all mathematical results about
> (slow) convergence towards the average are lost. This is like trying to
> survey a population by asking the questions to the same person over and
> over: the result will be biased.

That comparison pretty much invalidates any point you're making, it's
that bad.


> Now when you develop, which is the use case you probably have in mind, you
> want to compare two pg version and check for the performance impact, so
> having the exact same run seems like a proxy to quickly check for that.

It's not about "quickly" checking for something. If you look at the
results in thread mentioned in the OP, the order of operations
drastically and *PERSISTENTLY* changes the observations. Causing *days*
of work lost.


> However, from a stastistical perspective this is just heresy: you may do a
> change which improves one given run at the expense of all possible others
> and you would not know it: Say for instance that there are two different
> behaviors depending on something, then you will check against one of them
> only.

Meh. That assumes that we're doing a huge number of pgbench runs; but
usually people do maybe a handful. Tops. If you're trying to defend
against scenarios like that you need to design your tests so that you'll
encounter such problems by running longer.


> So I have no mathematical doubt that changing the seed is the right default
> setting, thus I think that the current behavior is fine. However I'm okay if
> someone wants to control the randomness for some reason (maybe having "less
> sure" results, but quickly), so it could be allowed somehow.

There might be some statistics arguments, but I think they're pretty
ignoring reality.

Greetings,

Andres Freund

Re: pgbench randomness initialization

From

Fabien COELHO

Date:

07 April 2016, 11:01:58

Hello Andres,

> If you run the test for longer... Or explicitly iterate over IVs. At the
> very least we need to make pgbench output the IV used, to have some
> chance of repeating tests.

Note that I'm not against providing a way to repeat tests "exactly", and I 
have suggested two means: environment variable and/or option.

> [...] That comparison pretty much invalidates any point you're making, 
> it's that bad.

At least it is simple, if simplistic.

Here is another one: I knew a financial institution which needed to 
evaluate the VAR of exotic financial products every night. They relied on 
MC for that. Alas, it was not converging quickly enough, results were 
unstable, so they took your advice: they froze the seed. Day after day the 
results were mostly the same, the VAR was stable one morning to the other, 
the management is happy, the risks were under control... That was in the 
mid 2000s:-)

>> However, from a stastistical perspective this is just heresy: you may do a
>> change which improves one given run at the expense of all possible others
>> and you would not know it: Say for instance that there are two different
>> behaviors depending on something, then you will check against one of them
>> only.
>
> Meh. That assumes that we're doing a huge number of pgbench runs;

A number of, not necessarily "huge". Or averaging a lot of intermediate 
values and having a hard look at the distribution, not just the final tps 
number.

> but usually people do maybe a handful. Tops. If you're trying to defend 
> against scenarios like that you need to design your tests so that you'll 
> encounter such problems by running longer.

People usually do a lot of things, does not mean that it is "right".

>> So I have no mathematical doubt that changing the seed is the right 
>> default setting, thus I think that the current behavior is fine. 
>> However I'm okay if someone wants to control the randomness for some 
>> reason (maybe having "less sure" results, but quickly), so it could be 
>> allowed somehow.
>
> There might be some statistics arguments,

Yep, there is.

> but I think they're pretty ignoring reality.

Hmmm. If reality wants to ignore mathematics, usually it looses, so this 
will not be with my blessing. Note that as a committer you do not need me 
to freeze the seed. I'm just providing an opinion backed by mathematical 
proofs.

-- 
Fabien.

Re: pgbench randomness initialization

From

Robert Haas

Date:

07 April 2016, 12:58:21

On Thu, Apr 7, 2016 at 5:56 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> I think that it depends on what you want, which may vary:
>
>  (1) "exactly" reproducible runs, but one run may hit a particular
>      steady state not representative of what happens in general.
>
>  (2) runs which really vary from one to the next, so as
>      to have an idea about how much it may vary, what is the
>      performance stability.
>
> Currently pgbench focusses on (2), which may or may not be fine depending on
> what you are doing. From a personal point of view I think that (2) is more
> significant to collect performance data, even if the results are more
> unstable: that simply reflects reality and its intrinsic variations, so I'm
> fine that as the default.
>
> Now for those interested in (1) for some reason, I would suggest to rely a
> PGBENCH_RANDOM_SEED environment variable or --random-seed option which could
> be used to have a oxymoronic "deterministic randomness", if desired.
> I do not think that it should be the default, though.

I agree entirely.  If performance is erratic, that's actually
something you want to discover during benchmarking.  If different
pgbench runs (of non-trivial length) are producing substantially
different results, then that's really a problem we need to fix, not
just adjust pgbench to cover it up.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pgbench randomness initialization

From

Andres Freund

Date:

07 April 2016, 13:15:38

On 2016-04-07 08:58:16 -0400, Robert Haas wrote:
> On Thu, Apr 7, 2016 at 5:56 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> > I think that it depends on what you want, which may vary:
> >
> >  (1) "exactly" reproducible runs, but one run may hit a particular
> >      steady state not representative of what happens in general.
> >
> >  (2) runs which really vary from one to the next, so as
> >      to have an idea about how much it may vary, what is the
> >      performance stability.
> >
> > Currently pgbench focusses on (2), which may or may not be fine depending on
> > what you are doing. From a personal point of view I think that (2) is more
> > significant to collect performance data, even if the results are more
> > unstable: that simply reflects reality and its intrinsic variations, so I'm
> > fine that as the default.
> >
> > Now for those interested in (1) for some reason, I would suggest to rely a
> > PGBENCH_RANDOM_SEED environment variable or --random-seed option which could
> > be used to have a oxymoronic "deterministic randomness", if desired.
> > I do not think that it should be the default, though.
> 
> I agree entirely.  If performance is erratic, that's actually
> something you want to discover during benchmarking.  If different
> pgbench runs (of non-trivial length) are producing substantially
> different results, then that's really a problem we need to fix, not
> just adjust pgbench to cover it up.

It's not about "covering it up"; it's about actually being able to take
action based on benchmark results, and about practically being able to
run benchmarks. The argument above means essentially that we need to run
a significant number of pgbench runs for *anything*, because running
them 3-5 times before/after just isn't meaningful enough.

It means that you can't separate between OS caused, and pgbench order
caused performance differences.

I agree that it's a horrid problem that we can get half the throughput
dependent on large machines, dependant on the ordering. But without
running queries in the same order before/after a patch there's no way to
validate whether $patch caused the problem. And no way to reliably
trigger problematic scenarios.

I also agree that it's important to be able to vary workloads. But if
you do so, you should do so in the same order, both pre/post a
patch. Afaics the prime use of pgbench is validation of the performance
effects of patches; therefore it should be usable for that, and it's
not.

Greetings,

Andres Freund

Re: pgbench randomness initialization

From

Robert Haas

Date:

07 April 2016, 13:33:12

On Thu, Apr 7, 2016 at 9:15 AM, Andres Freund <andres@anarazel.de> wrote:
> It's not about "covering it up"; it's about actually being able to take
> action based on benchmark results, and about practically being able to
> run benchmarks. The argument above means essentially that we need to run
> a significant number of pgbench runs for *anything*, because running
> them 3-5 times before/after just isn't meaningful enough.
>
> It means that you can't separate between OS caused, and pgbench order
> caused performance differences.

I'm not objecting to adding an option for this; but I think Fabien is
right that it shouldn't be the default.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pgbench randomness initialization

From

Tom Lane

Date:

07 April 2016, 13:46:39

Andres Freund <andres@anarazel.de> writes:
> On 2016-04-07 12:25:58 +0200, Fabien COELHO wrote:
>> So I have no mathematical doubt that changing the seed is the right default
>> setting, thus I think that the current behavior is fine. However I'm okay if
>> someone wants to control the randomness for some reason (maybe having "less
>> sure" results, but quickly), so it could be allowed somehow.

> There might be some statistics arguments, but I think they're pretty
> ignoring reality.

Sorry, but I think Fabien is right and you are wrong.  There is no
point in having randomness in there at all if the thing is constrained
to generate the same "random" sequence every time.

I don't object to having an option to force the initial seed, but
it should not be the default, and it most certainly should not be
the only behavior.
        regards, tom lane

Re: pgbench randomness initialization

From

Andres Freund

Date:

07 April 2016, 13:52:03

On 2016-04-07 09:46:27 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2016-04-07 12:25:58 +0200, Fabien COELHO wrote:
> >> So I have no mathematical doubt that changing the seed is the right default
> >> setting, thus I think that the current behavior is fine. However I'm okay if
> >> someone wants to control the randomness for some reason (maybe having "less
> >> sure" results, but quickly), so it could be allowed somehow.
> 
> > There might be some statistics arguments, but I think they're pretty
> > ignoring reality.
> 
> Sorry, but I think Fabien is right and you are wrong.

Given that it's 3:1 so far, you might be right...

> There is no point in having randomness in there at all if the thing is
> constrained to generate the same "random" sequence every time.

but that argument seems pretty absurd. It's obviously different to query
for different rows over a run, rather than querying the same row again
and again, in all backends.  The reason we use randomness is to avoid
easily discernible patterns in querying. Without randomness, how would
you do that?

Greetings,

Andres Freund

Re: pgbench randomness initialization

From

Fabien COELHO

Date:

07 April 2016, 14:26:58

>> It means that you can't separate between OS caused, and pgbench order
>> caused performance differences.
>
> I'm not objecting to adding an option for this; but I think Fabien is
> right that it shouldn't be the default.

Yep.

Andres, attached is a simple POC with an option & environment variable 
(whereas I should rather have looked at the current checkpointer/vacuum 
issue which I have reproduced:-().

While testing it I had a funny pattern, something like:
  pgbench --random-seed=123 -M prepared -T 3 -P 1 -S  1.0: 600 tps  2.0: 600 tps  3.0: 600 tps

First rerun just after:
  pgbench --random-seed=123 -M prepared -T 3 -P 1 -S  1.0: 1800 tps  2.0: 600 tps  3.0: 600 tps

The first rerun hits the same pages, so the first 1800 transactions are 
run in one second, and then it is new pages which are loaded so the 
performance goes down.

Second rerun just after:
  pgbench --random-seed=123 -M prepared -T 3 -P 1 -S  1.0: 1800 tps  2.0: 1400 tps  3.0: 600 tps

The second redun hits the same 3000 transactions than the previous one in 
about 1.7 seconds, then goes back to 600 tps for new pages...

After more iterations the performance is 1800 tps during the 3 seconds.

This clearly illustrates that it should be used with caution.

-- 
Fabien.

Re: pgbench randomness initialization

From

Alvaro Herrera

Date:

07 April 2016, 15:51:45

Fabien COELHO wrote:

> While testing it I had a funny pattern, something like:
> 
>   pgbench --random-seed=123 -M prepared -T 3 -P 1 -S
>   1.0: 600 tps
>   2.0: 600 tps
>   3.0: 600 tps

The output should include the random seed used, whether it was passed
with --random-seed, environment variable or randomly determined.  That
way, the user that later wants to verify why a particular run caused
some particular misbehavior knows what seed to use to reproduce that
run.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] pgbench randomness initialization

From

Fabien COELHO

Date:

02 January 2018, 12:48:24

Hello Alvaro,

I revive this patch because controlling the seed is useful for tap testing 
pgbench.

> The output should include the random seed used, whether it was passed 
> with --random-seed, environment variable or randomly determined.  That 
> way, the user that later wants to verify why a particular run caused 
> some particular misbehavior knows what seed to use to reproduce that 
> run.

Yep.

Here is a new version which output use used seed when a seed is 
explicitely set with an option or from the environment.

However, the default (current) behavior remains silent, so no visible 
changes unless tinkering with it.

The patch also allows to use a "strong" random for seeding the PRNG, 
thanks to pg_strong_random().

The tests assume that stdlib random/srandom behavior is standard thus 
deterministic between platform.

-- 
Fabien.

Attachment

pgbench-seed-2.patch

Re: [HACKERS] pgbench randomness initialization

From

Fabien COELHO

Date:

02 January 2018, 13:57:35

> Here is a new version which output use used seed when a seed is explicitely 
> set with an option or from the environment.

It is even better without xml typos, with simpler coding and the doc in 
the right place... Sorry for the noise.

-- 
Fabien.

Attachment

pgbench-seed-3.patch

Re: Re: [HACKERS] pgbench randomness initialization

From

Chapman Flack

Date:

09 January 2018, 11:59:42

On 01/02/18 05:57, Fabien COELHO wrote:
>> Here is a new version which output use used seed when a seed is
>> explicitely set with an option or from the environment.

This is a simple patch that does what it says on the tin. I ran into
trouble with the pgbench TAP test *even before applying the patch*, but
only because I was doing a VPATH build as a user without 'write'
on the source tree (001_pgbench_with_server.pl tried to make pgbench
create log files there). Bad me. Oddly, that was the only test in the
whole tree to have such an issue, so here I add a pre-patch to fix that.
Now my review needs a review. :)

With that taken care of, the added tests all pass for me. I wonder, though:

> The tests assume that stdlib random/srandom behavior is standard thus
> deterministic between platform.

Is the behavior of srandom() and the system generator really so precisely
specified that seed 5432 will produce the same values hardcoded in the
tests on all platforms? It does on mine, but could it produce spurious
test failures for others? I guess the (or a) spec would be here:

http://pubs.opengroup.org/onlinepubs/7908799/xsh/initstate.html

It specifies a "non-linear additive feedback random-number generator
employing a default state array size of 31 long integers", but it does
not pin down parameters or claim only one candidate exists.

To have the test run pgbench twice with the same seed and compare the
results sounds safer.

This revised pgbench-seed-4.patch changes the code not at all, and
the TAP test only in whitespace. I did some wordsmithing of the doc,
which I hope was not presumptuous of me, only as a conversation starter.
I expanded the second sentence because on my first reading I wasn't quite
sure of its meaning. Once I had looked at the code, I could see that the
sentence was economical and precise already, but I tried to find a version
that would also have been clear to the me-before-I-looked.

The documentation doesn't say that --random-seed= (or PGBENCH_RANDOM_SEED=)
will have the same effect as 'time', and indeed, I really think it should
be unset (defaulting to 'time'), or 'time', or 'rand', or an integer,
or an error. The error, right now, says only "expecting an unsigned
integer"; it should also mention time and rand. Should 'rand' be something
that conveys more about its meaning, 'strong' perhaps?

The documentation doesn't mention the allowed range of the unsigned
integer (which I get, because 'unsigned int' is exactly the signature
for srandom, but somebody might read "unsigned integer" in a more
generic sense). I'm not sure what would be a better way to say it.
The base (only decimal, as now in the code) isn't specified either.

Maybe the documentation should mention that the output now includes the
random seed being used, so that (even without planning ahead) a session
can be re-run with the same seed if necessary. It could just say "an
unsigned integer in decimal, as it is shown in pgbench's output" or
something like that.

Something more may need to be said in the doc about reproducibility. I think
the real story is that a run can be reproduced if the number of clients is
equal to the number of threads. Although each thread has its own generator
state, each client does not (if there is more than one per thread), and as
soon as real select() delays start happening in CSTATE_WAIT_RESULT, the
clients dealt out to a given thread may not be hitting that thread's
generator in a deterministic order.

-Chap

Attachment

Re: pgbench randomness initialization

From

Chapman Flack

Date:

09 January 2018, 12:04:00

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           not tested
Documentation:            not tested

Initial review is the message this is in reply to.

The new status of this patch is: Waiting on Author

Re: Re: [HACKERS] pgbench randomness initialization

From

Fabien COELHO

Date:

09 January 2018, 16:42:31

Hello Chapman,

Thanks for the review,

>> The tests assume that stdlib random/srandom behavior is standard thus
>> deterministic between platform.
>
> Is the behavior of srandom() and the system generator really so precisely
> specified that seed 5432 will produce the same values hardcoded in the
> tests on all platforms? [...]

Good question.

I'm hoping that in practice it would be the same, or that their would be 
very few cases (eg linux vs windows vs macos...). I was counting on the 
the buildfarm to tell me if I'm wrong, and fix it if needed.

> To have the test run pgbench twice with the same seed and compare the
> results sounds safer.

Interesting idea. The test script would be more complicated to do that, 
though. I would prefer to bet on "random" determinism (:-) and resort to 
such a solution only if it is not the case. Or maybe just put back some 
"\d+" to keep it simple.

This is a debatable strategy.

> I did some wordsmithing of the doc, which I hope was not presumptuous of 
> me, only as a conversation starter. [...]

Thanks for the doc improvements.

> The documentation doesn't say that --random-seed= (or PGBENCH_RANDOM_SEED=)
> will have the same effect as 'time', and indeed, I really think it should
> be unset (defaulting to 'time'), or 'time', or 'rand', or an integer,
> or an error.

Ok, done.

> The error, right now, says only "expecting an unsigned integer"; it 
> should also mention time and rand.

Ok, done.

> Should 'rand' be something that conveys more about its meaning, 'strong' 
> perhaps?

Hmmm. "Random means random":-). I have no opinion about whether it would 
be better. ISTM that "strong" would require some explanations. I let is as 
"rand" for now.

> The documentation doesn't mention the allowed range of the unsigned
> integer (which I get, because 'unsigned int' is exactly the signature
> for srandom, but somebody might read "unsigned integer" in a more
> generic sense).

Ok. I extended so that it works with octal, decimal and hexadecimal, and 
updated the doc accordingly. I did not add range information though.

> I'm not sure what would be a better way to say it.
> The base (only decimal, as now in the code) isn't specified either.

Sure.

> Maybe the documentation should mention that the output now includes the 
> random seed being used, so that (even without planning ahead) [...]

It does so only if the seed is explicitely set, otherwise it keeps the 
previous behavior of being silent. I added a sentence about that.

> Something more may need to be said in the doc about reproducibility. I think
> the real story is that a run can be reproduced if the number of clients is
> equal to the number of threads.

Yes, good point. I tried to hide the issue under the "as far as random 
numbers are concerned". I've tried to improve this point in the doc.

> Although each thread has its own generator state, each client does not 
> (if there is more than one per thread), and as soon as real select() 
> delays start happening in CSTATE_WAIT_RESULT, the clients dealt out to a 
> given thread may not be hitting that thread's generator in a 
> deterministic order.

Yep. This may evolve, for instance the error handling patch needs to 
restart transactions so it adds a state into the client.

-- 
Fabien.

Attachment

pgbench-seed-5.patch

Re: Re: [HACKERS] pgbench randomness initialization

From

Fabien COELHO

Date:

10 January 2018, 13:00:24

> This is a simple patch that does what it says on the tin. I ran into
> trouble with the pgbench TAP test *even before applying the patch*, but
> only because I was doing a VPATH build as a user without 'write'
> on the source tree (001_pgbench_with_server.pl tried to make pgbench
> create log files there). Bad me. Oddly, that was the only test in the
> whole tree to have such an issue, so here I add a pre-patch to fix that.
> Now my review needs a review. :)

Yep. I find the multiple chdir solution a little bit too extreme.

ISTM that it should rather add the correct path to --log-prefix by 
prepending $node->basedir, like the pgbench function does for -f scripts.

See attached.

-- 
Fabien.

Attachment

pgbench-vpath-2.patch

Re: Re: [HACKERS] pgbench randomness initialization

From

Fabien COELHO

Date:

10 January 2018, 15:27:19

Here is a rebase, plus some more changes:

I have improved the error message to tell from where the value was 
provided.

I have removed the test to the exact values produced from the expression 
test run.

I have added a test which run from the same seed value several times
and checks that the output values are the same.

-- 
Fabien.

Attachment

pgbench-seed-6.patch

Re: Re: [HACKERS] pgbench randomness initialization

From

Fabien COELHO

Date:

19 January 2018, 01:29:59

> Here is a rebase, plus some more changes:
>
> I have improved the error message to tell from where the value was provided.
>
> I have removed the test to the exact values produced from the expression test 
> run.
>
> I have added a test which run from the same seed value several times
> and checks that the output values are the same.

This version adds a :random_seed script variable, for information.

-- 
Fabien.

Attachment

pgbench-seed-7.patch

Re: Re: [HACKERS] pgbench randomness initialization

From

Fabien COELHO

Date:

24 January 2018, 12:33:10

>> Here is a rebase, plus some more changes:
>> 
>> I have improved the error message to tell from where the value was 
>> provided.
>> 
>> I have removed the test to the exact values produced from the expression 
>> test run.
>> 
>> I have added a test which run from the same seed value several times
>> and checks that the output values are the same.
>
> This version adds a :random_seed script variable, for information.

Rebased.

-- 
Fabien.

Attachment

pgbench-seed-8.patch

Re: [HACKERS] pgbench randomness initialization

From

Tom Lane

Date:

01 March 2018, 02:01:55

Fabien COELHO <coelho@cri.ensmp.fr> writes:
>> This is a simple patch that does what it says on the tin. I ran into
>> trouble with the pgbench TAP test *even before applying the patch*, but
>> only because I was doing a VPATH build as a user without 'write'
>> on the source tree (001_pgbench_with_server.pl tried to make pgbench
>> create log files there). Bad me. Oddly, that was the only test in the
>> whole tree to have such an issue, so here I add a pre-patch to fix that.
>> Now my review needs a review. :)

> Yep. I find the multiple chdir solution a little bit too extreme.

> ISTM that it should rather add the correct path to --log-prefix by 
> prepending $node->basedir, like the pgbench function does for -f scripts.
> See attached.

Hm ... so I tried to replicate this problem, and failed to: the log files
get made under the VPATH build directory, as desired, even without this
patch.  Am I doing something wrong, or is this platform-dependent somehow?

            regards, tom lane

Re: [HACKERS] pgbench randomness initialization

From

Fabien COELHO

Date:

01 March 2018, 09:56:57

Hello Tom,

> Fabien COELHO <coelho@cri.ensmp.fr> writes:
>>> This is a simple patch that does what it says on the tin. I ran into
>>> trouble with the pgbench TAP test *even before applying the patch*, but
>>> only because I was doing a VPATH build as a user without 'write'
>>> on the source tree (001_pgbench_with_server.pl tried to make pgbench
>>> create log files there). Bad me. Oddly, that was the only test in the
>>> whole tree to have such an issue, so here I add a pre-patch to fix that.
>>> Now my review needs a review. :)
>
>> Yep. I find the multiple chdir solution a little bit too extreme.
>
>> ISTM that it should rather add the correct path to --log-prefix by
>> prepending $node->basedir, like the pgbench function does for -f scripts.
>> See attached.
>
> Hm ... so I tried to replicate this problem, and failed to: the log files
> get made under the VPATH build directory, as desired, even without this
> patch.  Am I doing something wrong, or is this platform-dependent somehow?

As I recall, it indeed works if the source directories are rw, but fails 
if they are ro because then the local mkdir fails. So you would have to do 
a vpath build with sources that are ro to get the issue the patch is 
fixing. Otherwise, the issue would have been cought earlier by the 
buildfarm, which I guess is doing vpath compilation and full validation.

-- 
Fabien.

Re: [HACKERS] pgbench randomness initialization

From

Tom Lane

Date:

02 March 2018, 22:50:46

Fabien COELHO <coelho@cri.ensmp.fr> writes:
>> Hm ... so I tried to replicate this problem, and failed to: the log files
>> get made under the VPATH build directory, as desired, even without this
>> patch.  Am I doing something wrong, or is this platform-dependent somehow?

> As I recall, it indeed works if the source directories are rw, but fails 
> if they are ro because then the local mkdir fails. So you would have to do 
> a vpath build with sources that are ro to get the issue the patch is 
> fixing.

Ah, right you are.  Apparently, if the source is rw, the temporary files
in question are made there but immediately deleted, so the bug isn't
obvious.

Fix verified and pushed.

            regards, tom lane

Re: pgbench randomness initialization

From

Chapman Flack

Date:

03 March 2018, 20:41:20

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           not tested
Documentation:            tested, passed

The patch 8 works and addresses the things I noticed earlier.

It needs s/explicitely/explicitly/ in the docs.

The parsing of the seed involves matters of taste, I guess: if it were a signed int, then
sscanf's built-in %i would do everything those three explicit hex/octal/decimal branches
do, but there's no unsigned version of %i. Then there's strtoul(..., base=0), which accepts
the same choice of bases, but there's no unsigned-int-width version of that. Maybe it
would still look cleaner to use strtoul and just check that the result fits in unsigned int?
As I began, it comes down to taste ... this code does work.

I am not sure about the "idem for :random_seed" part: Does this mean that a value
could be given with -Drandom_seed on the command line, and become the value
of :random_seed, possibly different from the value given to --random-seed?
Is that intended? (Perhaps it is; I'm merely asking.)

-Chap

The new status of this patch is: Waiting on Author

Re: pgbench randomness initialization

From

Fabien COELHO

Date:

05 March 2018, 15:01:39

Hello Chapman,

Here is v9.

> It needs s/explicitely/explicitly/ in the docs.

Done.

> The parsing of the seed involves matters of taste, I guess: if it were a 
> signed int, then sscanf's built-in %i would do everything those three 
> explicit hex/octal/decimal branches do, but there's no unsigned version 
> of %i. Then there's strtoul(..., base=0), which accepts the same choice 
> of bases, but there's no unsigned-int-width version of that. Maybe it 
> would still look cleaner to use strtoul and just check that the result 
> fits in unsigned int? As I began, it comes down to taste ... this code 
> does work.

I must admit that I'm not too happy with the result as well, so I dropped 
the octal/hexadecimal parsing.

> I am not sure about the "idem for :random_seed" part: Does this mean that a value
> could be given with -Drandom_seed on the command line, and become the value
> of :random_seed, possibly different from the value given to --random-seed?
> Is that intended? (Perhaps it is; I'm merely asking.)

The "idem" is about setting the variable but not overwritting it if it 
already exists. The intention is that :random_seed is the random seed, 
unless the user set it to something else in which case it is the user's 
value. I've improved the variable description in the doc to point out that 
the value may be overwritten with -D.

-- 
Fabien.

Attachment

pgbench-seed-9.patch

Re: Re: pgbench randomness initialization

From

Chapman Flack

Date:

16 March 2018, 06:20:56

I'm sorry, I must have missed your reply on the 5th somehow.

On 03/05/18 07:01, Fabien COELHO wrote:
> I must admit that I'm not too happy with the result as well, so I dropped
> the octal/hexadecimal parsing.

That seems perfectly reasonable to me; perfectly adequate to accept only
one base.

But now the documentation is back to its original state of silence on
what base or how many bases might be allowed. Could it just say
"or an unsigned decimal integer value"? Then no one will wonder.

> The "idem" is about setting the variable but not overwritting it if it
> already exists. The intention is that :random_seed is the random seed,
> unless the user set it to something else in which case it is the user's
> value. I've improved the variable description in the doc to point out that
> the value may be overwritten with -D.

Ok.

-Chap

Re: Re: pgbench randomness initialization

From

Fabien COELHO

Date:

16 March 2018, 16:58:19

> But now the documentation is back to its original state of silence on
> what base or how many bases might be allowed. Could it just say
> "or an unsigned decimal integer value"? Then no one will wonder.

Done in the attached.

Thanks for the reviews.

-- 
Fabien.

Attachment

pgbench-seed-10.patch

Re: pgbench randomness initialization

From

Chapman Flack

Date:

18 March 2018, 02:10:20

The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:       tested, failed
Spec compliant:           not tested
Documentation:            tested, failed

This is a simple patch, includes documentation, includes and passes tests, and, in my rookie opinion, is ready for
committer.

The new status of this patch is: Ready for Committer

Re: pgbench randomness initialization

From

Teodor Sigaev

Date:

21 March 2018, 18:29:18

Patch isn't applyed cleanly anymore.

Fabien COELHO wrote:
> 
>> But now the documentation is back to its original state of silence on
>> what base or how many bases might be allowed. Could it just say
>> "or an unsigned decimal integer value"? Then no one will wonder.
> 
> Done in the attached.
> 
> Thanks for the reviews.
> 

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: pgbench randomness initialization

From

Fabien COELHO

Date:

22 March 2018, 10:40:29

> Patch isn't applyed cleanly anymore.

Indeed. Here is a rebase. All pgbench patches conflict about test cases.

-- 
Fabien.

Attachment

pgbench-seed-11.patch

Re: pgbench randomness initialization

From

Fabien COELHO

Date:

24 March 2018, 12:18:32

>> Patch isn't applyed cleanly anymore.
>
> Indeed. Here is a rebase. All pgbench patches conflict about test cases.

Patch v12, yet another rebase.

-- 
Fabien.

Attachment

pgbench-seed-12.patch

Re: pgbench randomness initialization

From

Teodor Sigaev

Date:

26 March 2018, 18:28:06

Thank you, pushed

Fabien COELHO wrote:
> 
>>> Patch isn't applyed cleanly anymore.
>>
>> Indeed. Here is a rebase. All pgbench patches conflict about test cases.
> 
> Patch v12, yet another rebase.
> 

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/