Thread: PATCH: pgbench - merging transaction logs

PATCH: pgbench - merging transaction logs

From
Tomas Vondra
Date:
Hi there,

attached is a patch implementing merging of pgbench logs. These logs are
written by each thread, so with N threads you get N files with names

    pgbench_log.PID
    pgbench_log.PID.1
    ...
    pgbench_log.PID.N

Before analyzing these logs, these files need to be combined. I usually
ended up wrinting ad-hoc scripts doing that, lost them, written them
again and so on over and over again.

The other disadvantage of the external scripts is that you have to pass
all the info about the logs (whether the logs are aggregated, whther
there's throttling, etc.).

So integrating this into pgbench directly seems like a better approach,
and the attached patch implements that.

With '-m' or '--merge-logs' on the command-line, the logs are merged at
the end, using a simple 2-way merge to keep the log sorted by the time
field. It should work with all the other options that influence the log
format (--rate, --aggregate-interval and --latency-limit).

I'll add this to CF 2016-06.

--
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: PATCH: pgbench - merging transaction logs

From
Fabien COELHO
Date:
Sun, 15 Mar 2015 11:22:01 +0100 (CET)
Hello Tomas,

> attached is a patch implementing merging of pgbench logs. These logs are
> written by each thread, so with N threads you get N files with names
>
>    pgbench_log.PID
>    pgbench_log.PID.1
>    ...
>    pgbench_log.PID.N
>
> Before analyzing these logs, these files need to be combined. I usually
> ended up wrinting ad-hoc scripts doing that, lost them, written them
> again and so on over and over again.

I've looked at the patch. Although I think that such a feature is somehow 
desirable... I have two issues with it: ISTM that

(1) it does not belong to pgbench as such

(2) even if, the implementation is not right

About (1):

I think that this functionnality as implemented does not belong to 
pgbench, and does really belong to an external script, which happen not to 
be readily available, which is a shame. PostgreSQL should probably provide 
such a script, or make it easy to find.

It could belong to pgbench if pgbench was *generating* the merged log file 
directly. However I'm not sure this can be done cleanly for the 
multi-process "thread emulation" (IN PASSING: which I think this should 
really get rid of because I do not know what system does not provide 
threads nowadays and would require to have a state of the art pgbench 
nevertheless, and this emulation significantly complexify the code by 
making things uselessly difficult and/or needed to be implemented twice or 
not be provided in some cases).

Another issue raised by your patch is that the log format may be improved, 
say by providing a one-field timestamp at the beginning of the line.


About (2):

In the implementation you reimplement a partial merge sort by parsing each 
line of each file and merging it with the current result over and over. 
ISTM that an implementation should read all files in parallel and merge 
them in one pass. The current IO complexity is in p²n where it should be 
simply pn... do not use it with a significant number of threads and many 
lines... Ok, the generated files are likely to be kept in cache, but 
nevertheless.

Also there are plenty of copy paste when scanning for two files, and then 
reprinting in all the different formats. The same logic is implemented 
twice, once for simple and once for aggregated. This means that updating 
or extending the log format later on would require to modify these scans 
and prints in many places.

For aggregates, some data in the output may be special values "NaN/-/...", 
I am not sure how the implementation would behave in such cases. As lines 
that do not match are silently ignored, the result merge would just be non 
significant.... should it rather be an error? Try it with a low rate for 
instance.

It seems that system function calls are not tested for errors.

> The other disadvantage of the external scripts is that you have to pass
> all the info about the logs (whether the logs are aggregated, whther
> there's throttling, etc.).

I think that is another argument to make a better format, with the a 
timestamp ahead? Also, ISTM that it only needs to know whether it is 
merging aggregate or simple logs, no more, the other information can be 
infered by the number of fields on the line.

> So integrating this into pgbench directly seems like a better approach,
> and the attached patch implements that.

You guessed that I disagree. Note that this is only my own opinion.

In summary, I think that:

(a) providing a clean script would be nice,

(b) if we could get rid of the "thread emulation", pgbench could generate 
the merged logs directly and simply, and the option could be provided 
then.

-- 
Fabien.

Re: PATCH: pgbench - merging transaction logs

From
Tomas Vondra
Date:
On 15.3.2015 11:22, Fabien COELHO wrote:
>
> I've looked at the patch. Although I think that such a feature is
> somehow desirable... I have two issues with it: ISTM that
> 
> (1) it does not belong to pgbench as such
> 
> (2) even if, the implementation is not right
> 
> About (1):
> 
> I think that this functionnality as implemented does not belong to 
> pgbench, and does really belong to an external script, which happen 
> not to be readily available, which is a shame. PostgreSQL should 
> probably provide such a script, or make it easy to find.

I disagree, for two main reasons.

Firstly, the fact that pgbench produces one file per thread is awkward.
It was implemented like this most likely because it was the simplest
solution after adding "-j" in 9.0, but I don't remember if I ever needed
a separate log from a single thread.

Secondly, a separate tool (even if provided with pgbench) would require
passing detailed info about the log format - whether it's aggregated or
not, throttling or not, ....

Those are the two main reasons why I think this should be implemented as
another option in pgbench.


> It could belong to pgbench if pgbench was *generating* the merged
> log file directly. However I'm not sure this can be done cleanly for
> the multi-process "thread emulation" (IN PASSING: which I think this 
> should really get rid of because I do not know what system does not 
> provide threads nowadays and would require to have a state of the art
> pgbench nevertheless, and this emulation significantly complexify the
> code by making things uselessly difficult and/or needed to be 
> implemented twice or not be provided in some cases).

I considered this approach, i.e. adding another 'collector' thread
receiving results from all the other threads, but decided not to do
that. That would require a fair amount of inter-process communication,
locking etc. and might affect the measurements (especially for the
workloads with extremely short transactions, like read-only pgbench).

The implemented approach, i.e. merging results collected by each thread
independently, after the benchmarking actually completed, is a better
solution. And the code is actually simpler.


> Another issue raised by your patch is that the log format may be 
> improved, say by providing a one-field timestamp at the beginning of 
> the line.

I don't see how this is relevant to this patch?


> About (2):
> 
> In the implementation you reimplement a partial merge sort by
> parsing each line of each file and merging it with the current result
> over and over. ISTM that an implementation should read all files in
> parallel and merge them in one pass. The current IO complexity is in
> p²n where it should be simply pn... do not use it with a significant
> number of threads and many lines... Ok, the generated files are
> likely to be kept in cache, but nevertheless.

Maybe. Implementing a 2-way merge sort was the simplest solution at the
moment, and I don't think this really matters because the I/O generated
by the benchmark itself is usually much higher than this.

The only case when this might make a difference is probably merging
large transaction logs (e.g. 100% sample from read-only test on small
dataset).

> Also there are plenty of copy paste when scanning for two files, and 
> then reprinting in all the different formats. The same logic is 
> implemented twice, once for simple and once for aggregated. This
> means that updating or extending the log format later on would
> require to modify these scans and prints in many places.

That is true, and I'll address that somehow (either moving the code to a
macro or a separate utility function).

> For aggregates, some data in the output may be special values 
> "NaN/-/...", I am not sure how the implementation would behave in
> such cases. As lines that do not match are silently ignored, the
> result merge would just be non significant.... should it rather be an
> error? Try it with a low rate for instance.

You mean the aggregated log? I can't think of a way to get there such
values - can you provide a plausible scenario how that could happen?

> 
> It seems that system function calls are not tested for errors.

That's true, but that's how the rest of pgbench code is written.

> 
>> The other disadvantage of the external scripts is that you have to
>> pass all the info about the logs (whether the logs are aggregated,
>> whther there's throttling, etc.).
> 
> I think that is another argument to make a better format, with the a
> timestamp ahead? Also, ISTM that it only needs to know whether it is
> merging aggregate or simple logs, no more, the other information can be
> infered by the number of fields on the line.

No, it's not. Even if you change the format like this, you still have no
idea whether the log is a per-transaction log (possibly with some
additional options), aggregated log.

There might be some auto-detection based on number of fields, for
example, but considering how many options influence that I consider that
really unreliable.

> 
>> So integrating this into pgbench directly seems like a better approach,
>> and the attached patch implements that.
> 
> You guessed that I disagree. Note that this is only my own opinion.
> 
> In summary, I think that:
> 
> (a) providing a clean script would be nice,
> 
> (b) if we could get rid of the "thread emulation", pgbench could 
> generate the merged logs directly and simply, and the option could
> be provided then.

That however is not the goal of this patch. The thread emulation is
there for a reason, and I certainly am not going to work on eliminating
it (not sure that's even possible).


-- 
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: pgbench - merging transaction logs

From
Fabien COELHO
Date:
> Firstly, the fact that pgbench produces one file per thread is awkward.

I agree, but I think it is due to the multi process thread emulation: if 
you have real threads, you can do a simple fprintf, possibly with some 
mutex, and you're done. There is really nothing to do to implement this
feature.

> Secondly, a separate tool (even if provided with pgbench) would require
> passing detailed info about the log format - whether it's aggregated or
> not, throttling or not, ....

Hmmm.

>> It could belong to pgbench if pgbench was *generating* the merged
>> log file directly. [...]
>
> I considered this approach, i.e. adding another 'collector' thread
> receiving results from all the other threads, but decided not to do
> that. That would require a fair amount of inter-process communication,
> locking etc. and might affect the measurements

I agree that inter-process stuff should be avoided. This is not what I had 
in mind. I was thinking of "fprintf" on the same file handler by different 
threads.

>> Another issue raised by your patch is that the log format may be
>> improved, say by providing a one-field timestamp at the beginning of
>> the line.
>
> I don't see how this is relevant to this patch?

For the simple log format, all the parsing needed would be for the 
timestamp, and the remainder would just be text to pass along, no need to 
%d %f... whatever.

>> The current IO complexity is in p²n where it should be simply pn...
>
> Maybe. Implementing a 2-way merge sort was the simplest solution at the
> moment, and I don't think this really matters because the I/O generated
> by the benchmark itself is usually much higher than this.

If you do not do a n-way merge, you could do a 2-way merge on a binary 
tree so that the IO complexity would be p.log(p).n (I think), and not p²n.

>> For aggregates, some data in the output may be special values
>> "NaN/-/...", [...]
> You mean the aggregated log? I can't think of a way to get there such
> values - can you provide a plausible scenario how that could happen?

Possibly I'm wrong. Ok, I tried it:
 sh> ./pgbench --rate=0.5 --aggregate-interval=1 --log sh> cat pgbench_log.12671 1426445236 1 5034 25341156 5034 5034
687471969 687 687 1426445237 0 0 0 0 0 0 0 0 0 1426445238 0 0 0 0 0 0 0 0 0 1426445239 1 8708 75829264 8708 8708 2063
42559692063 2063 ...
 

Good news, I could not generate strange values. Just when there are 0 
transactions possibly some care should be taken when combining values.

>> It seems that system function calls are not tested for errors.
>
> That's true, but that's how the rest of pgbench code is written.

Hmmm.... This is not entirely true, there are *some* checks if you look 
carefully:-)
   if ((fd = fopen(filename, "r")) == NULL) ...   if ((fp = popen(command, "r")) == NULL) ...   nsocks = select(...)
if(nsocks < 0) ...
 

>>> So integrating this into pgbench directly seems like a better approach,
>>> and the attached patch implements that.
>>
>> You guessed that I disagree. Note that this is only my own opinion.
>>
>> In summary, I think that:
>>
>> (a) providing a clean script would be nice,
>>
>> (b) if we could get rid of the "thread emulation", pgbench could
>> generate the merged logs directly and simply, and the option could
>> be provided then.
>
> That however is not the goal of this patch.

Sure. My point is that the amount of code you write to implement this 
merge stuff is due to this feature. Without it, the patch would probably 
need 10 lines of code. Moreover, the way it is implement requires scanning 
and reprinting, which means more work in many places to update the format 
later.

> The thread emulation is there for a reason,

My opinion is that it *was* there for a reason. Whether it makes sense 
today to still have it, maintain it, and have to write such heavy code for 
a trivial feature just because of it is another matter.

> and I certainly am not going to work on eliminating it
> (not sure that's even possible).

I wish it will be:-)

I would suggest this that I would support: implement this feature the 
simple way (aka fprintf, maybe a mutex) when compiled with threads, and 
generate an error "feature not available with process-based thread 
emulation" when compiled with processes. This way we avoid a, lot of heavy 
code to maintain in the future, and you still get the feature within 
pgbench. There are already some things which are not the same with thread 
emulation because it would have been tiring to implement for it for very 
little benefit.

-- 
Fabien.

Re: PATCH: pgbench - merging transaction logs

From
Tomas Vondra
Date:
On 15.3.2015 20:35, Fabien COELHO wrote:
> 
>> Firstly, the fact that pgbench produces one file per thread is awkward.
> 
> I agree, but I think it is due to the multi process thread emulation: if
> you have real threads, you can do a simple fprintf, possibly with some
> mutex, and you're done. There is really nothing to do to implement this
> feature.

My fear was that this either requires explicit locking, or some implicit
locking - for example while fprintf() in glibc is thread-safe, I really
am not sure about Win32 for example. This might influence the results
for very short transactions - I haven't tried that, though, so this
might be a false assumption.

The way I implemented the merge is immute to this.

>> Secondly, a separate tool (even if provided with pgbench) would
>> require passing detailed info about the log format - whether it's
>> aggregated or not, throttling or not, ....
> 
> Hmmm.
> 
>>> It could belong to pgbench if pgbench was *generating* the merged
>>> log file directly. [...]
>>
>> I considered this approach, i.e. adding another 'collector' thread 
>> receiving results from all the other threads, but decided not to
>> do that. That would require a fair amount of inter-process
>> communication, locking etc. and might affect the measurements
> 
> I agree that inter-process stuff should be avoided. This is not what 
> I had in mind. I was thinking of "fprintf" on the same file handler 
> by different threads.

That still involves some sort of 'implicit' locking, no? And as I
mentioned, I'm not sure fprintf() is thread-safe on all the platforms we
support.


>>> Another issue raised by your patch is that the log format may be 
>>> improved, say by providing a one-field timestamp at the beginning
>>> of the line.
>>
>> I don't see how this is relevant to this patch?
> 
> For the simple log format, all the parsing needed would be for the 
> timestamp, and the remainder would just be text to pass along, no
> need to %d %f... whatever.

Oh, ok. Well, that's true, but I don't think that significantly changes
the overall complexity.

>>> The current IO complexity is in p²n where it should be simply pn...
>>
>> Maybe. Implementing a 2-way merge sort was the simplest solution at
>> the moment, and I don't think this really matters because the I/O
>> generated by the benchmark itself is usually much higher than this.
> 
> If you do not do a n-way merge, you could do a 2-way merge on a
> binary tree so that the IO complexity would be p.log(p).n (I think),
> and not p²n.

Yes, I could do that.

I still think this probably an overengineering, but not a big deal. I'll
leave this for later, though (this patch is in 2015-06 CF anyway).

>>> For aggregates, some data in the output may be special values
>>> "NaN/-/...", [...]
>> You mean the aggregated log? I can't think of a way to get there such
>> values - can you provide a plausible scenario how that could happen?
> 
> Possibly I'm wrong. Ok, I tried it:
> 
>  sh> ./pgbench --rate=0.5 --aggregate-interval=1 --log
>  sh> cat pgbench_log.12671
>  1426445236 1 5034 25341156 5034 5034 687 471969 687 687
>  1426445237 0 0 0 0 0 0 0 0 0
>  1426445238 0 0 0 0 0 0 0 0 0
>  1426445239 1 8708 75829264 8708 8708 2063 4255969 2063 2063
>  ...
> 
> Good news, I could not generate strange values. Just when there are
> 0 transactions possibly some care should be taken when combining 
> values.

Yeah, I wasn't able to come up with such scenario, but wasn't sure.

> 
>>> It seems that system function calls are not tested for errors.
>>
>> That's true, but that's how the rest of pgbench code is written.
> 
> Hmmm.... This is not entirely true, there are *some* checks if you look
> carefully:-)
> 
>    if ((fd = fopen(filename, "r")) == NULL) ...
>    if ((fp = popen(command, "r")) == NULL) ...
>    nsocks = select(...)
>    if (nsocks < 0) ...

OK, there are a few checks ;-) But none of the fprintf calls IIRC.

Anyway, I plan to refactor this part of the patch to get rid of the
copy'n'paste pieces, so I'll take care of this too.

>>> (b) if we could get rid of the "thread emulation", pgbench could
>>> generate the merged logs directly and simply, and the option could
>>> be provided then.
>>
>> That however is not the goal of this patch.
> 
> Sure. My point is that the amount of code you write to implement this
> merge stuff is due to this feature. Without it, the patch would 
> probably need 10 lines of code. Moreover, the way it is implement 
> requires scanning and reprinting, which means more work in many 
> places to update the format later.

Possibly. But it's not written like that :-(


>> The thread emulation is there for a reason,
> 
> My opinion is that it *was* there for a reason. Whether it makes 
> sense today to still have it, maintain it, and have to write such 
> heavy code for a trivial feature just because of it is another 
> matter.

Possibly. I can't really decide that.

>> and I certainly am not going to work on eliminating it
>> (not sure that's even possible).
> 
> I wish it will be:-)
> 
> I would suggest this that I would support: implement this feature the
> simple way (aka fprintf, maybe a mutex) when compiled with threads, and
> generate an error "feature not available with process-based thread
> emulation" when compiled with processes. This way we avoid a, lot of
> heavy code to maintain in the future, and you still get the feature
> within pgbench. There are already some things which are not the same
> with thread emulation because it would have been tiring to implement for
> it for very little benefit.

I really dislike the features supported only in some cases (although
most deployments probably support proper threading these days).


-- 
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: pgbench - merging transaction logs

From
Fabien COELHO
Date:
>> I agree that inter-process stuff should be avoided. This is not what
>> I had in mind. I was thinking of "fprintf" on the same file handler
>> by different threads.
>
> That still involves some sort of 'implicit' locking, no?

I fully agree that it would need locking, wether it is implicit if fprintf 
is thread safe, or explicit if not.

> And as I mentioned, I'm not sure fprintf() is thread-safe on all the 
> platforms we support.

I do not know, but the question is relevant ! The answer must be thought 
for. Adding a MUTEX to be on the safe side may not be too big an issue.

>>>> Another issue raised by your patch is that the log format may be
>>>> improved, say by providing a one-field timestamp at the beginning
>>>> of the line.
>>>
>>> I don't see how this is relevant to this patch?
>>
>> For the simple log format, all the parsing needed would be for the
>> timestamp, and the remainder would just be text to pass along, no
>> need to %d %f... whatever.
>
> Oh, ok. Well, that's true, but I don't think that significantly changes
> the overall complexity.

For your current implementation, it would remove plenty of variables, 
conditions about options, just one scan one print would remain, and no 
need to update the format later, so it would simplify the code greatly for 
the "simple" log.

>> I would suggest this that I would support: implement this feature the
>> simple way (aka fprintf, maybe a mutex) when compiled with threads, and
>> generate an error "feature not available with process-based thread
>> emulation" when compiled with processes. This way we avoid a, lot of
>> heavy code to maintain in the future, and you still get the feature
>> within pgbench. There are already some things which are not the same
>> with thread emulation because it would have been tiring to implement for
>> it for very little benefit.
>
> I really dislike the features supported only in some cases (although
> most deployments probably support proper threading these days).

I agree on the general principle, but not in this particular case, because 
for me thread emulation is more or less obsolete, useless and unused.

My view is that you're code is on the too-heavy side, and adds a burden 
for later maintenance for rather a should-be-simple feature, really just 
so it works for "thread emulation" that probably nought people use or 
really need. So I'm not enthousiastic at all to get that in pgbench as is.

If the code is much simpler, these reservations would be dropped, it would 
be a small and useful feature for a small code and maintenance impact. And 
I do not see the point in using the heavy stuff in my view obsolete stuff.

-- 
Fabien.



Re: PATCH: pgbench - merging transaction logs

From
Robert Haas
Date:
On Sun, Mar 15, 2015 at 3:35 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>> Firstly, the fact that pgbench produces one file per thread is awkward.
>
> I agree, but I think it is due to the multi process thread emulation: if you
> have real threads, you can do a simple fprintf, possibly with some mutex,
> and you're done. There is really nothing to do to implement this
> feature.

I think that's probably not a good idea, because fprintf() might then
become a bottleneck.  I fixed a similar problem with random() in
commit 4af43ee3f165c8e4b332a7e680a44f4b7ba2d3c1.  Even though the
locking was happening inside libc, it was still locking, and it still
caused a contention problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: pgbench - merging transaction logs

From
Fabien COELHO
Date:
Hello,

>> I agree, but I think it is due to the multi process thread emulation: if you
>> have real threads, you can do a simple fprintf, possibly with some mutex,
>> and you're done. There is really nothing to do to implement this
>> feature.
>
> I think that's probably not a good idea, because fprintf() might then
> become a bottleneck.  I fixed a similar problem with random() in
> commit 4af43ee3f165c8e4b332a7e680a44f4b7ba2d3c1.  Even though the
> locking was happening inside libc, it was still locking, and it still
> caused a contention problem.

The fprintf we are talking about occurs at most once per pgbench 
transaction, possibly much less when aggregation is activated, and this 
transaction involves networks exchanges and possibly disk writes on the 
server.

So I would have thought that the risk of contention because of such a lock 
would be very low in this case. If it really becomes a bottleneck, it 
means a lot of threads doing a lot of small transactions, and I would 
suggest that using the aggregated option would be the right answer to 
that.

-- 
Fabien.



Re: PATCH: pgbench - merging transaction logs

From
Robert Haas
Date:
On Tue, Mar 17, 2015 at 11:27 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> The fprintf we are talking about occurs at most once per pgbench
> transaction, possibly much less when aggregation is activated, and this
> transaction involves networks exchanges and possibly disk writes on the
> server.

random() was occurring four times per transaction rather than once,
but OTOH I think fprintf() is probably a much heavier-weight
operation.  The way to know if there's a real problem here is to test
it, but I'd be pretty surprised if there isn't.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: pgbench - merging transaction logs

From
Fabien COELHO
Date:
Hello Robert,

>> The fprintf we are talking about occurs at most once per pgbench
>> transaction, possibly much less when aggregation is activated, and this
>> transaction involves networks exchanges and possibly disk writes on the
>> server.
>
> random() was occurring four times per transaction rather than once,
> but OTOH I think fprintf() is probably a much heavier-weight
> operation.

Yes, sure.

My point is that if there are many threads and tremendous TPS, the 
*detailed* per-transaction log (aka simple log) is probably a bad choice 
anyway, and the aggregated version is the way to go.

Note that even without mutex fprintf may be considered a "heavy function" 
which is going to slow down the transaction rate significantly. That could 
be tested as well.

It is possible to reduce the lock time by preparing the string (which 
would mean introducing buffers) and just do a "fputs" under mutex. That 
would not reduce the print time anyway, and that may add malloc/free 
operations, though.

> The way to know if there's a real problem here is to test it, but I'd be 
> pretty surprised if there isn't.

Indeed, I think I can contrive a simple example where it is, basically a
more or less empty or read only transaction (eg SELECT 1).

My opinion is that there is a tradeoff between code simplicity and later 
maintenance vs feature benefit.

If threads are assumed and fprintf is used, the feature is much simpler to 
implement, and the maintenance is lighter. The alternative implementation 
means reparsing the generated files over and over for merging their 
contents.

Also, I do not think that the detailed log provides much benefit with very 
fast transactions, where probably the aggregate is a much better choice 
anyway. If the user persists, she may generate a per-thread log and merge 
it later, in which case a merge script is needed, but I do not think that 
would be a bad thing.

Obviously, all that is only my opinion and is quite debatable.

-- 
Fabien.



Re: PATCH: pgbench - merging transaction logs

From
Tomas Vondra
Date:
Hi,

On 20.3.2015 13:43, Fabien COELHO wrote:
> 
> Hello Robert,
> 
>>> The fprintf we are talking about occurs at most once per pgbench 
>>> transaction, possibly much less when aggregation is activated,
>>> and this transaction involves networks exchanges and possibly
>>> disk writes on the server.
>>
>> random() was occurring four times per transaction rather than
>> once, but OTOH I think fprintf() is probably a much heavier-weight 
>> operation.
> 
> Yes, sure.
> 
> My point is that if there are many threads and tremendous TPS, the 
> *detailed* per-transaction log (aka simple log) is probably a bad 
> choice anyway, and the aggregated version is the way to go.

I disagree with this reasoning. Can you provide numbers supporting it?

I do agree that fprintf is not cheap, actually when profiling pgbench
it's often the #1 item, but the impact on the measurements is actually
quite small. For example with a small database (scale 10) and read-only
30-second runs (single client), I get this:
  no logging: 18672 18792 18667 18518 18613 18547with logging: 18170 18093 18162 18273 18307 18234

So on average, that's 18634 vs. 18206, i.e. less than 2.5% difference.
And with more expensive transactions (larger scale, writes, ...) the
difference will be much smaller.

It's true that this might produce large logs, especially when the runs
are long, but that has nothing to do with fprintf. And can be easily
fixed by either using a dedicated client machine, or only sample the
transaction log.

Introducing actual synchronization between the threads (by locking
inside fprintf) is however a completely different thing.

> Note that even without mutex fprintf may be considered a "heavy
> function" which is going to slow down the transaction rate
> significantly. That could be tested as well.
> 
> It is possible to reduce the lock time by preparing the string
> (which would mean introducing buffers) and just do a "fputs" under
> mutex. That would not reduce the print time anyway, and that may add
> malloc/free operations, though.

I seriously doubt fprintf does the string formatting while holding lock
on the file. So by doing this you only simulate what fprintf() does
(assuming it's thread-safe on your platform) and gain nothing.

> 
>> The way to know if there's a real problem here is to test it, but 
>> I'd be pretty surprised if there isn't.
> 
> Indeed, I think I can contrive a simple example where it is,
> basically a more or less empty or read only transaction (eg SELECT
> 1).

That would be nice, because my quick testing suggests it's not the case.

> My opinion is that there is a tradeoff between code simplicity and 
> later maintenance vs feature benefit.
> 
> If threads are assumed and fprintf is used, the feature is much 
> simpler to implement, and the maintenance is lighter.

I think the "if threads are assumed" part makes this dead in water
unless someone wants to spend time on getting rid of the thread
emulation. Removing the code is quite simple, researching whether we can
do that will be difficult IMHO - I have no idea which of the supported
platorms require the emulation etc. And I envision endless discussions
about this.

> The alternative implementation means reparsing the generated files 
> over and over for merging their contents.

I agree that the current implementation is not particularly pretty, and
I plan to get rid of the copy&paste parts etc.

> Also, I do not think that the detailed log provides much benefit
> with very fast transactions, where probably the aggregate is a much
> better choice anyway. If the user persists, she may generate a
> per-thread log and merge it later, in which case a merge script is
> needed, but I do not think that would be a bad thing.

I disagree with this - I use transaction logs (either complete or
sampled) quite often. I also explained why I think a separate merge
script is awkward to use.

-- 
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: pgbench - merging transaction logs

From
Fabien COELHO
Date:
Hello Tomas,

>> My point is that if there are many threads and tremendous TPS, the
>> *detailed* per-transaction log (aka simple log) is probably a bad
>> choice anyway, and the aggregated version is the way to go.
>
> I disagree with this reasoning. Can you provide numbers supporting it?

I'm not sure which part of this reasoning you want me to support with 
numbers.

My first argument is qualitative: I'm trying to say that if you are doing 
100000 tps, probably the per-transaction detailed measure would be useless 
as such, so there is no point in logging them all. You would rather be 
interested in aggregate figures and should also be interested in extremal 
events (very slow queries) and their frequency.

The second point is that aggregated logs should be cheaper than detailed 
log, it seems more or less self evident to me.

> I do agree that fprintf is not cheap, actually when profiling pgbench
> it's often the #1 item, but the impact on the measurements is actually
> quite small. For example with a small database (scale 10) and read-only
> 30-second runs (single client), I get this:
>
>   no logging: 18672 18792 18667 18518 18613 18547
> with logging: 18170 18093 18162 18273 18307 18234
>
> So on average, that's 18634 vs. 18206, i.e. less than 2.5% difference.
> And with more expensive transactions (larger scale, writes, ...) the
> difference will be much smaller.

Ok. Great!

Let us take this as a worst-case figure and try some maths.

If fprintf takes p = 0.025 (1/40) of the time, then with 2 threads the 
collision probability would be about 1/40 and the delayed thread would be 
waiting for half this time on average, so the performance impact due to 
fprintf locking would be negligeable (1/80 delay occured in 1/40 cases => 
1/3200 time added on the computed average, if I'm not mistaken).

With t threads, I would guess that the collision probability in the first 
order is about 0.5 t (t-1) p, and the performance impact would be ~ (0.5 t 
(t-1) p * 0.5 p) (hmmm... this approximation, if not completely stupid, 
just holds for small p and not too large t).

With your worst-case figure and some rounding, it seems to look like:
  #threads    collision probability    performance impact    2             1/40                    1/3200    4
  1/7                     1/533    8             0.7                     < 0.01 (about 1%)
 

This suggest that for a pessimistic (ro load) fprintf overhead ratio there 
would be a small impact even with 8 thread doing 20000 tps each.

> It's true that this might produce large logs, especially when the runs
> are long, but that has nothing to do with fprintf. And can be easily
> fixed by either using a dedicated client machine, or only sample the
> transaction log.

Sure. In which case locking/mutex is not an issue.

> Introducing actual synchronization between the threads (by locking
> inside fprintf) is however a completely different thing.

It is the issue I'm trying to discuss.

>> Note that even without mutex fprintf may be considered a "heavy
>> function" which is going to slow down the transaction rate
>> significantly. That could be tested as well.
>>
>> It is possible to reduce the lock time by preparing the string
>> (which would mean introducing buffers) and just do a "fputs" under
>> mutex. That would not reduce the print time anyway, and that may add
>> malloc/free operations, though.
>
> I seriously doubt fprintf does the string formatting while holding lock
> on the file.

I do not think that the lock is necessarily at the file (OS level), but at 
least it has to hold the FILE structure to manage buffering without 
interference from other threads during string processing and printing.

> So by doing this you only simulate what fprintf() does (assuming it's 
> thread-safe on your platform) and gain nothing.

Ok, if fprintf is thread safe, I fully agree that there is no need to add 
a mutex or lock! However I do not know how to test that, so putting a 
useless mutex would prevent

>>> The way to know if there's a real problem here is to test it, but
>>> I'd be pretty surprised if there isn't.
>>
>> Indeed, I think I can contrive a simple example where it is,
>> basically a more or less empty or read only transaction (eg SELECT
>> 1).
>
> That would be nice, because my quick testing suggests it's not the case.

I do not understand your point. ISTM that your test and my maths suggest
that the performance impact of the fprintf approach is reasonably low.

>> My opinion is that there is a tradeoff between code simplicity and
>> later maintenance vs feature benefit.
>>
>> If threads are assumed and fprintf is used, the feature is much
>> simpler to implement, and the maintenance is lighter.
>
> I think the "if threads are assumed" part makes this dead in water
> unless someone wants to spend time on getting rid of the thread
> emulation.

My suggestion is to skip the feature under thread emulation, not 
necessarily to remove thread emulation, because of the argument you raise 
below.

> Removing the code is quite simple, researching whether we can do that 
> will be difficult IMHO - I have no idea which of the supported platorms 
> require the emulation etc.

I know of none, that is a start:-)

> And I envision endless discussions about this.

You are pessimistic, it may be a very quick and emphatic "NO":-)

>> The alternative implementation means reparsing the generated files
>> over and over for merging their contents.
>
> I agree that the current implementation is not particularly pretty, and
> I plan to get rid of the copy&paste parts etc.

That would be good.

>> Also, I do not think that the detailed log provides much benefit
>> with very fast transactions, where probably the aggregate is a much
>> better choice anyway. If the user persists, she may generate a
>> per-thread log and merge it later, in which case a merge script is
>> needed, but I do not think that would be a bad thing.
>
> I disagree with this - I use transaction logs (either complete or
> sampled) quite often. I also explained why I think a separate merge
> script is awkward to use.

Yes, I did read that, but for me the choice is between an "awkward" script 
outside or "awkward" code inside to maintain forever, hence my lack of 
enthousiasm and the sought for a lighter solution.

Note that "sampled" would have no locking issue, because the overhead is 
very low.

Just for the record I do not like simplistic samples because they would 
miss infrequent large events (eg if 1/40000 transaction takes 40000 more 
time to be processed, it counts for half the time, it divides the 
performance by 2, and you will not get it often in your sample).

-- 
Fabien.



Re: PATCH: pgbench - merging transaction logs

From
Andres Freund
Date:
On 2015-03-21 10:37:05 +0100, Fabien COELHO wrote:
> 
> Hello Tomas,
> Let us take this as a worst-case figure and try some maths.
> 
> If fprintf takes p = 0.025 (1/40) of the time, then with 2 threads the
> collision probability would be about 1/40 and the delayed thread would be
> waiting for half this time on average, so the performance impact due to
> fprintf locking would be negligeable (1/80 delay occured in 1/40 cases =>
> 1/3200 time added on the computed average, if I'm not mistaken).
> 
> With t threads, I would guess that the collision probability in the first
> order is about 0.5 t (t-1) p, and the performance impact would be ~ (0.5 t
> (t-1) p * 0.5 p) (hmmm... this approximation, if not completely stupid, just
> holds for small p and not too large t).
> 
> With your worst-case figure and some rounding, it seems to look like:
> 
>   #threads    collision probability    performance impact
>     2             1/40                    1/3200
>     4             1/7                     1/533
>     8             0.7                     < 0.01 (about 1%)
> 
> This suggest that for a pessimistic (ro load) fprintf overhead ratio there
> would be a small impact even with 8 thread doing 20000 tps each.

I think math like this mostly disregards hardware realities. You don't
actually need to have actual lock contention to notice overhead -
frequently acquiring an *uncontended* lock that resides in another
socket's cache and where the cacheline is dirty requires relatively
expensive cross cpu transfers.  That's all besides the overhead of doing
a lock operation itself. A lock; xaddl;, or whatever you end up using,
has a significant cost in itself. It implies a bus lock and cache flush,
which is far from free.

Additionally we're quite possibly talking about more than 8
threads. I've frequently used pgbench with hundreds of threads; for imo
good reasons.

That all said, it's far from guaranteed that there's an actual problem
here. If done right, i.e. the expensive formatting of the string is
separated from the locked output to the kernel, it might end up being
acceptable.

I wonder how bad using unbuffered write + O_APPEND would end up being;
without a lock.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: PATCH: pgbench - merging transaction logs

From
Andres Freund
Date:
On 2015-03-17 11:50:28 -0400, Robert Haas wrote:
> On Tue, Mar 17, 2015 at 11:27 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> > The fprintf we are talking about occurs at most once per pgbench
> > transaction, possibly much less when aggregation is activated, and this
> > transaction involves networks exchanges and possibly disk writes on the
> > server.
> 
> random() was occurring four times per transaction rather than once,
> but OTOH I think fprintf() is probably a much heavier-weight
> operation.  The way to know if there's a real problem here is to test
> it, but I'd be pretty surprised if there isn't.

Well, fprintf() doesn't have to acquire the lock for the entirety of
it's operation - just for the access to the stream buffer.

Note that posix 2001 *does* guarantee that FILE* style IO is thread
safe:
"All functions that reference (FILE *) objects, except those with names
ending in _unlocked, shall behave as if they use flockfile() and
funlockfile() internally to obtain ownership of these (FILE *) objects."

Hilariously that tidbit hidden in the documentation about
flockfile. Very, err, easy to find:
http://pubs.opengroup.org/onlinepubs/9699919799/functions/flockfile.html

But I agree that we simply need to test this on a larger machine.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: PATCH: pgbench - merging transaction logs

From
didier
Date:
Hi,

On Sat, Mar 21, 2015 at 10:37 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

>>   no logging: 18672 18792 18667 18518 18613 18547
>> with logging: 18170 18093 18162 18273 18307 18234
>>
>> So on average, that's 18634 vs. 18206, i.e. less than 2.5% difference.
>> And with more expensive transactions (larger scale, writes, ...) the
>> difference will be much smaller.
>
>
> Ok. Great!
>
> Let us take this as a worst-case figure and try some maths.
>
> If fprintf takes p = 0.025 (1/40) of the time, then with 2 threads the
> collision probability would be about 1/40 and the delayed thread would be
> waiting for half this time on average, so the performance impact due to
> fprintf locking would be negligeable (1/80 delay occured in 1/40 cases =>
> 1/3200 time added on the computed average, if I'm not mistaken).
If  threads run more or less the same code with the same timing after
a while they will lockstep  on synchronization primitives and your
collision probability will be very close to 1.

Moreover  they will write to the same cache lines for every fprintf
and this is very very bad even without atomic operations.

Regards
Didier



Re: PATCH: pgbench - merging transaction logs

From
Fabien COELHO
Date:
Hello Andres,

>> With your worst-case figure and some rounding, it seems to look like:
>>
>>   #threads    collision probability    performance impact
>>     2             1/40                    1/3200
>>     4             1/7                     1/533
>>     8             0.7                     < 0.01 (about 1%)
>>
>> This suggest that for a pessimistic (ro load) fprintf overhead ratio there
>> would be a small impact even with 8 thread doing 20000 tps each.
>
> I think math like this mostly disregards hardware realities.

Hmmm. In my mind, doing the maths helps understand what may be going on.

Note that it does not preclude to check afterwards that it does indeed 
correspond to reality:-)

The key suggestion of the maths is that if p*t << 1 all is (seems) well.

> You don't actually need to have actual lock contention to notice 
> overhead.

The overhead assumed is 1/40 of the transaction time from Tomas' measures. 
Given the ~ 18000 tps (we are talking of an in-memory read-only load 
probably on the same host), transaction time for pgbench seems to be about 
0.06 ms, and fprintf seems to be about 0.0015 ms (1.5 µs).

> - frequently acquiring an *uncontended* lock that resides in another 
> socket's cache and where the cacheline is dirty requires relatively 
> expensive cross cpu transfers. That's all besides the overhead of doing 
> a lock operation itself. A lock; xaddl;, or whatever you end up using, 
> has a significant cost in itself. It implies a bus lock and cache flush, 
> which is far from free.

Ok, I did not assume an additional "lock cost". Do you have a figure? A 
quick googling suggested figure for "lightweight mutexes" around 100 ns, 
but the test conditions were unclear. If it is oky, then it is does not 
change much the above maths to add that overhead.

> Additionally we're quite possibly talking about more than 8 threads. 
> I've frequently used pgbench with hundreds of threads; for imo good 
> reasons.

Good for you. I do not have access to a host on which this would make 
sense:-)

> That all said, it's far from guaranteed that there's an actual problem
> here. If done right, i.e. the expensive formatting of the string is
> separated from the locked output to the kernel, it might end up being
> acceptable.

That is what I would like to assess. Indeed, probably snprinf (to avoid 
mallocing anything) and then fputs/write/whatever would indeed help reduce 
the "contention" probability, if not the actual overhead.

-- 
Fabien.

Re: PATCH: pgbench - merging transaction logs

From
Fabien COELHO
Date:
Hello Didier,

>> If fprintf takes p = 0.025 (1/40) of the time, then with 2 threads the
>> collision probability would be about 1/40 and the delayed thread would be
>> waiting for half this time on average, so the performance impact due to
>> fprintf locking would be negligeable (1/80 delay occured in 1/40 cases =>
>> 1/3200 time added on the computed average, if I'm not mistaken).

> If  threads run more or less the same code with the same timing after
> a while they will lockstep  on synchronization primitives and your
> collision probability will be very close to 1.

I'm not sure I understand. If transaction times were really constant, then 
after a while the mutexes would be synchronised so as to avoid contention, 
i.e. the collision probability would be 0?

> Moreover  they will write to the same cache lines for every fprintf
> and this is very very bad even without atomic operations.

We're talking of transactions that involve network messages and possibly 
disk IOs on the server, so some cache issues issues within pgbench would 
not be a priori the main performance driver.

-- 
Fabien.



Re: PATCH: pgbench - merging transaction logs

From
Fabien COELHO
Date:
> Well, fprintf() doesn't have to acquire the lock for the entirety of
> it's operation - just for the access to the stream buffer.

Yep. If it is implemented by appending stuff to the stream as the format 
is processed, this would still mean the whole time of its operation.

Hence preprocessing the string as you suggested may be desirable.

> Note that posix 2001 *does* guarantee that FILE* style IO is thread
> safe:
> "All functions that reference (FILE *) objects, except those with names
> ending in _unlocked, shall behave as if they use flockfile() and
> funlockfile() internally to obtain ownership of these (FILE *) objects."
>
> Hilariously that tidbit hidden in the documentation about
> flockfile. Very, err, easy to find:
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/flockfile.html

Thanks for the pointer!

-- 
Fabien.



Re: PATCH: pgbench - merging transaction logs

From
didier
Date:
Hi,

On Sat, Mar 21, 2015 at 8:42 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
> Hello Didier,
>
>>> If fprintf takes p = 0.025 (1/40) of the time, then with 2 threads the
>>> collision probability would be about 1/40 and the delayed thread would be
>>> waiting for half this time on average, so the performance impact due to
>>> fprintf locking would be negligeable (1/80 delay occured in 1/40 cases =>
>>> 1/3200 time added on the computed average, if I'm not mistaken).
Yes but for a third thread (each on a physical core) it will be 1/40 +
1/40 and so on up to roughly 40/40 for 40 cores.

>
>
>> If  threads run more or less the same code with the same timing after
>> a while they will lockstep  on synchronization primitives and your
>> collision probability will be very close to 1.
>
>
> I'm not sure I understand. If transaction times were really constant, then
> after a while the mutexes would be synchronised so as to avoid contention,
> i.e. the collision probability would be 0?
But they aren't constant only close. It may or not show up in this
case but I've noticed that often the collision rate is a lot higher
than the probability would suggest, I'm not sure why,

>
>> Moreover  they will write to the same cache lines for every fprintf
>> and this is very very bad even without atomic operations.
>
>
> We're talking of transactions that involve network messages and possibly
> disk IOs on the server, so some cache issues issues within pgbench would not
> be a priori the main performance driver.
Sure but :
- good measurement is hard and by adding locking in fprintf it make
its timing more noisy.

- it's against 'good practices' for scalable code. Trivial code can
show that elapsed time for as low as  four cores writing to same cache
line in a loop, without locking or synchronization, is greater than
the elapsed time for running these four loops sequentially on one
core. If they write to different cache lines it scales linearly.

Regards
Didier



Re: PATCH: pgbench - merging transaction logs

From
Fabien COELHO
Date:
Hello,

> Yes but for a third thread (each on a physical core) it will be 1/40 +
> 1/40 and so on up to roughly 40/40 for 40 cores.

That is why I proposed a formula which depends on the number of threads.

> [...] But they aren't constant only close. It may or not show up in this 
> case but I've noticed that often the collision rate is a lot higher than 
> the probability would suggest, I'm not sure why,

If so, I would suggested that the probability is wrong and try to 
understand why:-)

>>> Moreover  they will write to the same cache lines for every fprintf
>>> and this is very very bad even without atomic operations.
>>
>> We're talking of transactions that involve network messages and possibly
>> disk IOs on the server, so some cache issues issues within pgbench would not
>> be a priori the main performance driver.

> Sure but :
> - good measurement is hard and by adding locking in fprintf it make
> its timing more noisy.

This really depends on the probability of the lock collisions. If it is 
small enough, the impact would be negligeable.

> - it's against 'good practices' for scalable code.
> Trivial code can show that elapsed time for as low as four cores writing 
> to same cache line in a loop, without locking or synchronization, is 
> greater than the elapsed time for running these four loops sequentially 
> on one core. If they write to different cache lines it scales linearly.

I'm not argumenting about general scalability principles, which may or may 
not be relevant to the case at hand.

I'm discussing whether the proposed feature can be implemented much simply 
with mutex instead of the current proposal which is on the heavy side, 
thus induces more maintenance effort latter.

Now I agree that if there is a mutex it must be a short as possible and 
not hinder performance significantly for pertinent use case. Note that 
overhead evaluation by Tomas is pessimistic as it only involves read-only 
transactions for which all transaction details are logged. Note also that 
if you have 1000 cores to run pgbench and that locking may be an issue, 
you could still use the per-thread logs.

The current discussion suggests that each thread should prepare the string 
off-lock (say with some sprintf) and then only lock when sending the 
string. This looks reasonable, but still need to be validated (i.e. the 
lock time would indeed be very small wrt the transaction time).

-- 
Fabien.



Re: PATCH: pgbench - merging transaction logs

From
Andres Freund
Date:
Hi,

Guys, I don't see this theoretical discussion going anywhere. I think
it's time to simply implement this and evaluate it on a bigger
machine. It can't take very long to implement tosimply just write to one
file instead of the multiple files as now. The posix guaranteed fprintf
locking should already take care of the rest.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: PATCH: pgbench - merging transaction logs

From
Fabien COELHO
Date:
> Guys, I don't see this theoretical discussion going anywhere. I think
> it's time to simply implement this and evaluate it on a bigger
> machine.

Sure. I was kind of hoping that someone else would implement it, because 
I'm a reviewer on this one, and I do not have the bigger machine at hand 
either:-) If nothing happens and when I have some time, I'll do it.

> It can't take very long to implement tosimply just write to one
> file instead of the multiple files as now.

Sure.

> The posix guaranteed fprintf locking should already take care of the 
> rest.

I think I would try with snprintf to reduce locking.

Note that I'm first trying to evaluate logging overhead, but I'm quite far 
from Tomas figures at the moment, looks more like 20% overhead (12 threads 
read-only load, runs at about 130000 tps with full logging, 160000 
without, repeated 5 times with the similar results).

In such a case, a mutex would be a bad idea, but I must add that detailed 
logging is a plain bad idea as well!

-- 
Fabien.



Re: PATCH: pgbench - merging transaction logs

From
Andres Freund
Date:
On March 23, 2015 8:00:04 PM GMT+01:00, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
>> Guys, I don't see this theoretical discussion going anywhere. I think
>> it's time to simply implement this and evaluate it on a bigger
>> machine.
>
>Sure. I was kind of hoping that someone else would implement it,
>because 
>I'm a reviewer on this one, and I do not have the bigger machine at
>hand 
>either:-) If nothing happens and when I have some time, I'll do it.

It wasn't specifically aimed at you...

>> It can't take very long to implement tosimply just write to one
>> file instead of the multiple files as now.
>
>Sure.
>
>> The posix guaranteed fprintf locking should already take care of the 
>> rest.
>
>I think I would try with snprintf to reduce locking.

Glibc's locking is around the streambuffer IIRC. So that's not likely to do much...


-- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

Andres Freund                       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: PATCH: pgbench - merging transaction logs

From
Fabien COELHO
Date:
Hello Tomas,

> I do agree that fprintf is not cheap, actually when profiling pgbench
> it's often the #1 item, but the impact on the measurements is actually
> quite small. For example with a small database (scale 10) and read-only
> 30-second runs (single client), I get this:
>
>   no logging: 18672 18792 18667 18518 18613 18547
> with logging: 18170 18093 18162 18273 18307 18234
>
> So on average, that's 18634 vs. 18206, i.e. less than 2.5% difference.
> And with more expensive transactions (larger scale, writes, ...) the
> difference will be much smaller.

I did some testing with a scale 10 prepared "SELECT only" 200 seconds 
plenty of runs with local socket connections on the largest host I have 
available:
  pgbench -P 10 -T 200 -S -M prepared -j $c -c $c ...

I think that this cpu-bound bench is kind of a worst possible case for the 
detailed per transaction log.

I also implemented a quick and dirty version for a merge log based on 
sharing a file handle (append mode + sprintf + fputs).

The results are as follow:
 * 1 thread 33 runs median tps (average is consistent): - no logging:        22062 - separate logging:  19360  (-12.2%)
-merged logging:    19326  (-12.4%, not significant from previous)
 

Note that the impact of logging is much larger than with your tests.
The underlying fprintf comes from gnu libc 2.19.

The worst overhead I could trigger is with 12 threads:
 * 12 threads 35 runs median tps (average is consistent) - no logging:       155917 - separate logging: 124695 (-20.0%)
-merged logging:   119349 (-23.5%)
 

My conclusion from these figures is that although the direct merged 
logging approach adds some overhead, this overhead is small wrt detailed 
logging (it adds 3.5% to a 20% logging overhead) with 12 threads. Other 
tests, even with more threads, did not yield larger absolute or relative 
overheads. Although the direct merge approach is shown to add overheads, 
this is a small additional overhead on a quite bad situation already, 
which suggest that using detailed log on a cpu-bound pgbench run is a bad 
idea to begin with.

For a more realistic test, I ran "simple updates" which involve actual 
disk writes. It ran at around 840 tps with 24 threads. The logging 
overhead seems to be under 1%, and there is no significant difference 
between separate and merged on the 20 runs.

So my overall conclusion is:

(1) The simple thread-shared file approach would save pgbench from 
post-processing merge-sort heavy code, for a reasonable cost.

(2) The feature would not be available for the thread-emulation with this 
approach, but I do not see this as a particular issue as I think that it 
is pretty much only dead code and a maintenance burden.

(3) Optimizing doLog from its current fprintf-based implementation may be 
a good thing.

-- 
Fabien.



Re: PATCH: pgbench - merging transaction logs

From
Tomas Vondra
Date:
On 28.3.2015 11:21, Fabien COELHO wrote:
> 
> Hello Tomas,
> 
>> I do agree that fprintf is not cheap, actually when profiling pgbench
>> it's often the #1 item, but the impact on the measurements is actually
>> quite small. For example with a small database (scale 10) and read-only
>> 30-second runs (single client), I get this:
>>
>>   no logging: 18672 18792 18667 18518 18613 18547
>> with logging: 18170 18093 18162 18273 18307 18234
>>
>> So on average, that's 18634 vs. 18206, i.e. less than 2.5% difference.
>> And with more expensive transactions (larger scale, writes, ...) the
>> difference will be much smaller.
> 
> I did some testing with a scale 10 prepared "SELECT only" 200 seconds
> plenty of runs with local socket connections on the largest host I have
> available:
> 
>   pgbench -P 10 -T 200 -S -M prepared -j $c -c $c ...
> 
> I think that this cpu-bound bench is kind of a worst possible case for
> the detailed per transaction log.
> 
> I also implemented a quick and dirty version for a merge log based on
> sharing a file handle (append mode + sprintf + fputs).
> 
> The results are as follow:
> 
>  * 1 thread 33 runs median tps (average is consistent):
>  - no logging:        22062
>  - separate logging:  19360  (-12.2%)
>  - merged logging:    19326  (-12.4%, not significant from previous)
> 

Interesting. What hardware is this? I wouldn't be surprised by this
behavior on a multi-socket system, where the cross-cpu transfers are
likely more expensive on the single-socket I used (sorry, forgot to
mention what hardware I used).

> Note that the impact of logging is much larger than with your tests.
> The underlying fprintf comes from gnu libc 2.19.

I'm running glibc 2.19 too, on kernel 3.17 (might be important too).

> The worst overhead I could trigger is with 12 threads:
> 
>  * 12 threads 35 runs median tps (average is consistent)
>  - no logging:       155917
>  - separate logging: 124695 (-20.0%)
>  - merged logging:   119349 (-23.5%)
> 
> My conclusion from these figures is that although the direct merged 
> logging approach adds some overhead, this overhead is small wrt
> detailed logging (it adds 3.5% to a 20% logging overhead) with 12
> threads. Other tests, even with more threads, did not yield larger
> absolute or relative overheads. Although the direct merge approach is
> shown to add overheads, this is a small additional overhead on a
> quite bad situation already, which suggest that using detailed log on
> a cpu-bound pgbench run is a bad idea to begin with.

Yeah, seems like that.

> For a more realistic test, I ran "simple updates" which involve actual
> disk writes. It ran at around 840 tps with 24 threads. The logging
> overhead seems to be under 1%, and there is no significant difference
> between separate and merged on the 20 runs.
> 
> So my overall conclusion is:
> 
> (1) The simple thread-shared file approach would save pgbench from 
> post-processing merge-sort heavy code, for a reasonable cost.

No it wouldn't - you're missing the fact that the proposed approach
(shared file + fprintf) only works with raw transaction log.

It does not work with aggregated log - the threads would have to somehow
track the progress of the other threads somehow, in a very non-trivial
way (e.g. what if one of the threads executes a long query, and thus
does not send the results for a long time?). Another option would be to
update shared aggregated results, but that requires locking.

> (2) The feature would not be available for the thread-emulation with 
> this approach, but I do not see this as a particular issue as I
> think that it is pretty much only dead code and a maintenance burden.

I'm not willing to investigate that, nor am I willing to implement
another feature that works only sometimes (I've done that in the past,
and I find it a bad practice).

If someone else is willing to try to eliminate the thread emulation, I
won't object to that. But as I pointed out above, simple fprintf is not
going to work for the aggregated log - solving that will need more code
(e.g. maintaining aggregated results for all threads, requiring
additional locking etc).

> 
> (3) Optimizing doLog from its current fprintf-based implementation
> may be a good thing.

That's probably true. The simplest thing we can do right now is
buffering the data into larger chunks and writing those chunks. That
amortizes the costs of locking. Using O_APPEND, as suggested by Andres,
seems like a promising idea.

-- 
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: pgbench - merging transaction logs

From
Fabien COELHO
Date:
Hello Tomas,

>> The results are as follow:
>>
>>  * 1 thread 33 runs median tps (average is consistent):
>>  - no logging:        22062
>>  - separate logging:  19360  (-12.2%)
>>  - merged logging:    19326  (-12.4%, not significant from previous)
>
> Interesting. What hardware is this?

Dell PowerEdge R720 with 2 Intel Xeon CPU E5-2660 2.20GHz (8 cores and 16 
threads per processor, so 32 threads in total), running with Linux 3.13 
(Ubuntu trusty).

> I wouldn't be surprised by this behavior on a multi-socket system, [...]

There are 2 sockets.

>> So my overall conclusion is:
>>
>> (1) The simple thread-shared file approach would save pgbench from
>> post-processing merge-sort heavy code, for a reasonable cost.
>
> No it wouldn't - you're missing the fact that the proposed approach
> (shared file + fprintf) only works with raw transaction log.
>
> It does not work with aggregated log - the threads would have to somehow
> track the progress of the other threads somehow, in a very non-trivial
> way (e.g. what if one of the threads executes a long query, and thus
> does not send the results for a long time?).

The counters are updated when the transaction is finished anyway?

> Another option would be to update shared aggregated results, but that 
> requires locking.

That is what I had in mind. ISTM that the locking impact would be much 
lower than for logging, the data is just locked for a counter update, and 
if counters are per-thread, a conflict may only occur when the data are 
gathered for actual logging, which would be rather infrequent. Even if the 
counters are shared, the locking would be for small time that would imply 
a low conflict probability. So I do not see this one as a significant 
performance issue.

>> (2) The feature would not be available for the thread-emulation with
>> this approach, but I do not see this as a particular issue as I
>> think that it is pretty much only dead code and a maintenance burden.
>
> I'm not willing to investigate that, nor am I willing to implement
> another feature that works only sometimes (I've done that in the past,
> and I find it a bad practice).

Hmmm. Keeping an obsolete feature with significant impact on how other 
features can be implemented, so basically a maintenance burden, does not 
look like best practice *either*.

> If someone else is willing to try to eliminate the thread emulation, I
> won't object to that.

Hmmm. I'll try to trigger a discussion in another thread to test the idea.

> But as I pointed out above, simple fprintf is not going to work for the 
> aggregated log - solving that will need more code (e.g. maintaining 
> aggregated results for all threads, requiring additional locking etc).

The code for that is probably simple and short, and my wish is to try to 
avoid an external merge sort post processing, if possible, which is not 
especially cheap anyway, neither in code nor in time.

>> (3) Optimizing doLog from its current fprintf-based implementation
>> may be a good thing.
>
> That's probably true. The simplest thing we can do right now is
> buffering the data into larger chunks and writing those chunks.
> That amortizes the costs of locking.

If it is buffered in the process, that would mean more locking. If it is 
buffered per threads that would result in out of order logs. Would that be 
an issue? It would be fine with me.

> Using O_APPEND, as suggested by Andres, seems like a promising idea.

I tried that with a shared file handle, but the impact seemed negligeable. 
The figures I reported used it, btw.

I also tried to open the same file in append mode from all threads, with 
positive performance effects, but then the flush did not occur at line 
boundaries to there was some mangling in the result.

-- 
Fabien.



Re: PATCH: pgbench - merging transaction logs

From
Fabien COELHO
Date:
>> I also implemented a quick and dirty version for a merge log based on
>> sharing a file handle (append mode + sprintf + fputs).

I tried the "append + per-thread 2KB buffered sprintf + fputs when full",
with the same number of runs. The logs are out of order by chunks, the 
overhead seems higher with 1 thread, but there is no extra overhead
with 12 threads.

>> The results are as follow:
>>
>>  * 1 thread 33 runs median tps (average is consistent):
>>  - no logging:        22062
>>  - separate logging:  19360  (-12.2%)
>>  - merged logging:    19326  (-12.4%, not significant from previous)
 - buf merged logging:   18751 (-15%, seems significant)

>> The worst overhead I could trigger is with 12 threads:
>>
>>  * 12 threads 35 runs median tps (average is consistent)
>>  - no logging:       155917
>>  - separate logging: 124695 (-20.0%)
>>  - merged logging:   119349 (-23.5%)
 - buf merged logging: 124914 (-19.9%, not significant from separate)

-- 
Fabien.



Re: PATCH: pgbench - merging transaction logs

From
Tomas Vondra
Date:
On 29.3.2015 10:58, Fabien COELHO wrote:

>>> So my overall conclusion is:
>>>
>>> (1) The simple thread-shared file approach would save pgbench from
>>> post-processing merge-sort heavy code, for a reasonable cost.
>>
>> No it wouldn't - you're missing the fact that the proposed approach
>> (shared file + fprintf) only works with raw transaction log.
>>
>> It does not work with aggregated log - the threads would have to somehow
>> track the progress of the other threads somehow, in a very non-trivial
>> way (e.g. what if one of the threads executes a long query, and thus
>> does not send the results for a long time?).
> 
> The counters are updated when the transaction is finished anyway?

Yes, but the thread does not know it's time to write the results until
it completes the first transaction after the interval ends ...

Let's say the very first query in thread #1 takes a minute for some
reason, while the other threads process 100 transactions per second. So
before the thread #1 can report 0 transactions for the first second, the
other threads already have results for the 60 intervals.

I think there's no way to make this work except for somehow tracking
timestamp of the last submitted results for each thread, and only
flushing results older than the minimum of the timestamps. But that's
not trivial - it certainly is more complicated than just writing into a
shared file descriptor.

>> Another option would be to update shared aggregated results, but
>> that requires locking.
> 
> That is what I had in mind. ISTM that the locking impact would be
> much lower than for logging, the data is just locked for a counter
> update, and if counters are per-thread, a conflict may only occur
> when the data are gathered for actual logging, which would be rather
> infrequent. Even if the counters are shared, the locking would be for
> small time that would imply a low conflict probability. So I do not
> see this one as a significant performance issue.

No, for the reasons I explained above (thread not sending results for a
long time, etc.).

Merging results for each transaction would not have this issue, but it
would also use the lock much more frequently, and that seems like a
pretty bad idea (especially for the workloads with short transactions
that you suggested are bad match for detailed log - this would make the
aggregated log bad too).

Also notice that with all the threads will try to merge the data (and
thus acquire the lock) at almost the same time - this is especially true
for very short transactions. I would be surprised if this did not cause
issues on read-only workloads with large numbers of threads.

>>> (2) The feature would not be available for the thread-emulation with
>>> this approach, but I do not see this as a particular issue as I
>>> think that it is pretty much only dead code and a maintenance burden.
>>
>> I'm not willing to investigate that, nor am I willing to implement
>> another feature that works only sometimes (I've done that in the past,
>> and I find it a bad practice).
> 
> Hmmm. Keeping an obsolete feature with significant impact on how
> other features can be implemented, so basically a maintenance burden,
> does not look like best practice *either*.

That is not what I said, though.

If thread emulation is considered obsolete - fine, let's remove it. But
that's not really certain at this point, and until it's actually removed
I'm not particularly thrilled by adding a feature that does not work
with --disable-thread-safety.

>> If someone else is willing to try to eliminate the thread
>> emulation, I won't object to that.
> 
> Hmmm. I'll try to trigger a discussion in another thread to test the
> idea.

OK, good.

> 
>> But as I pointed out above, simple fprintf is not going to work
>> for the aggregated log - solving that will need more code (e.g. 
>> maintaining aggregated results for all threads, requiring
>> additional locking etc).
> 
> The code for that is probably simple and short, and my wish is to
> try to avoid an external merge sort post processing, if possible,
> which is not especially cheap anyway, neither in code nor in time.

First, I don't think it's as simple as you think.

Second, which the merge sort is not free, it happens *after* the pgbench
run completed and does not interfere with it at all.


>>> (3) Optimizing doLog from its current fprintf-based implementation
>>> may be a good thing.
>>
>> That's probably true. The simplest thing we can do right now is 
>> buffering the data into larger chunks and writing those chunks. 
>> That amortizes the costs of locking.
> 
> If it is buffered in the process, that would mean more locking. If
> it is buffered per threads that would result in out of order logs.
> Would that be an issue? It would be fine with me.

You're mixing two things - optimizing doLog and writing into a shared
file. I was talking about optimizing doLog as it stands now, i.e. per
thread, and in that case it does not cause any out-of-order logs.

Also, if the lock for the shared buffer is cheaper than the lock
required for fprintf, it may still be an improvement.

>> Using O_APPEND, as suggested by Andres, seems like a promising idea.
> 
> I tried that with a shared file handle, but the impact seemed 
> negligeable. The figures I reported used it, btw.
> 
> I also tried to open the same file in append mode from all threads,
> with positive performance effects, but then the flush did not occur
> at line boundaries to there was some mangling in the result.

OK.


-- 
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: pgbench - merging transaction logs

From
Fabien COELHO
Date:
Hello,

>> The counters are updated when the transaction is finished anyway?
>
> Yes, but the thread does not know it's time to write the results until
> it completes the first transaction after the interval ends ...
>
> Let's say the very first query in thread #1 takes a minute for some
> reason, while the other threads process 100 transactions per second. So
> before the thread #1 can report 0 transactions for the first second, the
> other threads already have results for the 60 intervals.
>
> I think there's no way to make this work except for somehow tracking
> timestamp of the last submitted results for each thread, and only
> flushing results older than the minimum of the timestamps. But that's
> not trivial - it certainly is more complicated than just writing into a
> shared file descriptor.

I agree that such an approach this would be horrible for a very limited 
value. However I was suggesting that a transaction is counted only when it 
is finished, so the aggregated data is to be understood as refering to 
"finished transactions in the interval", and what is in progress would be 
counted in the next interval anyway.

> Merging results for each transaction would not have this issue, but it
> would also use the lock much more frequently, and that seems like a
> pretty bad idea (especially for the workloads with short transactions
> that you suggested are bad match for detailed log - this would make the
> aggregated log bad too).
>
> Also notice that with all the threads will try to merge the data (and
> thus acquire the lock) at almost the same time - this is especially true
> for very short transactions. I would be surprised if this did not cause
> issues on read-only workloads with large numbers of threads.

ISTM that the aggregated version should fare better than the detailed log,
whatever is done: the key performance issue is because fprintf is slow, 
with aggregated log these are infrequent, and only arithmetic remains in a 
critical section.

>>>> (2) The feature would not be available for the thread-emulation with
>>>> this approach, but I do not see this as a particular issue as I
>>>> think that it is pretty much only dead code and a maintenance burden.
>>>
>>> I'm not willing to investigate that, nor am I willing to implement
>>> another feature that works only sometimes (I've done that in the past,
>>> and I find it a bad practice).
>
> [...]

After the small discussion I triggered, I've submitted a patch to drop 
thread fork-emulation from pgbench.

> [...]
> Also, if the lock for the shared buffer is cheaper than the lock
> required for fprintf, it may still be an improvement.

Yep. "fprintf" does a lot of processing, so it is the main issue.

-- 
Fabien.



Re: PATCH: pgbench - merging transaction logs

From
Tomas Vondra
Date:
Hi,

On 05/02/15 15:30, Fabien COELHO wrote:
>
> Hello,
>
>>> The counters are updated when the transaction is finished anyway?
>>
>> Yes, but the thread does not know it's time to write the results until
>> it completes the first transaction after the interval ends ...
>>
>> Let's say the very first query in thread #1 takes a minute for some
>> reason, while the other threads process 100 transactions per second. So
>> before the thread #1 can report 0 transactions for the first second, the
>> other threads already have results for the 60 intervals.
>>
>> I think there's no way to make this work except for somehow tracking
>> timestamp of the last submitted results for each thread, and only
>> flushing results older than the minimum of the timestamps. But that's
>> not trivial - it certainly is more complicated than just writing into a
>> shared file descriptor.
>
> I agree that such an approach this would be horrible for a very limited
> value. However I was suggesting that a transaction is counted only when
> it is finished, so the aggregated data is to be understood as refering
> to "finished transactions in the interval", and what is in progress
> would be counted in the next interval anyway.

That only works if every single transaction is immediately written into 
the shared buffer/file, but that would require acquiring a lock shared 
by all the threads. And that's not really free - for cases with many 
clients doing tiny transactions, this might be a big issue, for example.

That's why I suggested that each client uses a shared buffer for the 
results, and only submits the results once the interval is over. 
Submitting the result however happens on the first transaction from the 
next interval. If the transaction is long, the results would not be 
submitted.

It might be done in the other direction, though - the "writer" thread 
might collect current results at the end of the interval.

>> Merging results for each transaction would not have this issue, but
>> it would also use the lock much more frequently, and that seems
>> like a pretty bad idea (especially for the workloads with short
>> transactions that you suggested are bad match for detailed log -
>> this would make the aggregated log bad too).
>>
>> Also notice that with all the threads will try to merge the data
>> (and thus acquire the lock) at almost the same time - this is
>> especially true for very short transactions. I would be surprised
>> if this did not cause issues on read-only workloads with large
>> numbers of threads.
>
> ISTM that the aggregated version should fare better than the
> detailed log, whatever is done: the key performance issue is because
> fprintf is slow, with aggregated log these are infrequent, and only
> arithmetic remains in a critical section.

Probably.

>>>>> (2) The feature would not be available for the thread-emulation with
>>>>> this approach, but I do not see this as a particular issue as I
>>>>> think that it is pretty much only dead code and a maintenance burden.
>>>>
>>>> I'm not willing to investigate that, nor am I willing to implement
>>>> another feature that works only sometimes (I've done that in the past,
>>>> and I find it a bad practice).
>>
>> [...]
>
> After the small discussion I triggered, I've submitted a patch to drop
> thread fork-emulation from pgbench.

OK, good.

>
>> [...]
>> Also, if the lock for the shared buffer is cheaper than the lock
>> required for fprintf, it may still be an improvement.
>
> Yep. "fprintf" does a lot of processing, so it is the main issue.

The question is whether the processing happens while holding the lock, 
though. I don't think that's the case.


--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: pgbench - merging transaction logs

From
Fabien COELHO
Date:
> It might be done in the other direction, though - the "writer" thread 
> might collect current results at the end of the interval.

Yep, you can indeed accumulate per thread and sum on the end of the 
interval, but a lock is still needed if you want exact figures.

ISTM that it is what is done for --progress, without even bothering to 
lock anything (even if transactions are fast and some inconsistency arise, 
this is just for showing how things are faring on stderr, so no big deal).

>> "fprintf" does a lot of processing, so it is the main issue.
>
> The question is whether the processing happens while holding the lock, 
> though. I don't think that's the case.

Alas I'm pretty nearly sure it is necessarily the case, there is no 
intermediate buffer in fprintf, the buffering is done trough the file 
handler so a lock must be acquire throughout format processing.

https://www.gnu.org/software/libc/manual/html_node/Streams-and-Threads.html#Streams-and-Threads

"The POSIX standard requires that by default the stream operations are 
atomic.  I.e., issuing two stream operations for the same stream in two 
threads at the same time will cause the operations to be executed as if 
they were issued sequentially. The buffer operations performed while 
reading or writing are protected from other uses of the same stream. To do 
this each stream has an internal lock object which has to be (implicitly) 
acquired before any work can be done. "

-- 
Fabien.