Thread: XLogInsert

XLogInsert

From

Jeff Janes

Date:

19 August 2009, 13:22:08

In XLogInsert (src/backend/access/transam/xlog.c), the part that adds back-up
blocks into the rdata chain is described:
       /*        * Make additional rdata chain entries for the backup blocks, so that we        * don't need to
special-casethem in the write loop.  Note that we have        * now irrevocably changed the input rdata chain.
 

If I read the code correctly, the only thing that is irrevocable is
that it writes into
rdt->next, and if it saved an old copy of rdt first, then it could
revoke the changes just
by doing rdt_old->next=NULL.  If that were done, then I think this
code could be
moved out of the section holding the WALInsertLock.  It could probably be folded
into the part of the code that computes the CRC.

I don't think this wold be a big performance win, as that part of the
code is pretty
fast, but every bit helps in a highly contended lock, and I think the code
would be simpler as well.  Do you think it would be worth me giving this a try?

Cheers,

Jeff

Re: XLogInsert

From

Tom Lane

Date:

19 August 2009, 13:49:14

Jeff Janes <jeff.janes@gmail.com> writes:
> If I read the code correctly, the only thing that is irrevocable is
> that it writes into
> rdt->next, and if it saved an old copy of rdt first, then it could
> revoke the changes just
> by doing rdt_old->next=NULL.  If that were done, then I think this
> code could be
> moved out of the section holding the WALInsertLock.

Hmm, I recall that the changes are ... or were ... more complex.
The tricky case I think is where we have to go back and redo the
block-backup decisions after discovering that the checkpoint REDO
pointer has just moved.

If you can get the work out of the WALInsertLock section for just a
few more instructions, it would definitely be worth doing.
        regards, tom lane

Re: XLogInsert

From

Jeff Janes

Date:

14 September 2009, 00:42:24

On Wed, Aug 19, 2009 at 9:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeff Janes <jeff.janes@gmail.com> writes:
> If I read the code correctly, the only thing that is irrevocable is
> that it writes into
> rdt->next, and if it saved an old copy of rdt first, then it could
> revoke the changes just
> by doing rdt_old->next=NULL. If that were done, then I think this
> code could be
> moved out of the section holding the WALInsertLock.

Hmm, I recall that the changes are ... or were ... more complex.
The tricky case I think is where we have to go back and redo the
block-backup decisions after discovering that the checkpoint REDO
pointer has just moved.

If you can get the work out of the WALInsertLock section for just a
few more instructions, it would definitely be worth doing.

I've attached a patch which removes the iteration over the blocks to be backed-up from the critical section of XLogInsert. Now those blocks are only looped over in one piece of code which both computes the CRC and builds the linked list, rather than having parallel loops.

I've used an elog statement (not shown in patch) to demonstrate that the "goto begin;" after detecting REDO race actually does get executed under a standard workload, (pgbench -c10). Two to 4 out of 10 the backends execute that code path for each checkpoint on my single CPU machine. By doing a kill -9 on a process, to simulate a crash, during the period after the goto begin is execercised but before the precipitating heckpoint completes, I can force it to use the written WAL records in recovery. The database automatically recovers and the results are self-consistent.

I cannot imagine any other races, rare events, or action at a distance that could come into play with this code change, so I cannot think of anything else to test at the moment.

I could not detect a speed difference with pgbench, but as I cannot get pgbench to be XLogInsert bound, that is not surprising. Using the only XLogInsert-bound test case I know of, parallel COPY into a skinny, unindexed table, using 8 parallel copies on a 4 x dual-core x86_64 and with fsync turned off (to approxiamately simulate SSD, which I do not have), I get a speed improvement of 2-4% with the patch over unpatched head. Maybe with more CPUs the benefit would be greater.

That small improvement is probably not very attractive, however I think the patch makes the overall code a bit cleaner, so it may be warranted on that ground. Indeed, my motivation for working on this is that I kept beating my head against the complexity of the old code, and thought that simplifying it would make future work easier.

Cheers,

Jeff

Attachment

XLogInsert_rdt2_v1.patch

Patch LWlocks instrumentation

From

Pierre Frédéric Caillaud

Date:

14 September 2009, 07:20:39

A little bit of a reply to Jeff's email about WALInsertLock.

This patch instruments LWLocks, it is controlled with the following
#define's in lwlock.c :

LWLOCK_STATS
LWLOCK_TIMING_STATS

It is an upgrade of current lwlocks stats.

When active, at backend exit, it will display stats as shown below (here,
we have a parallel COPY with 4 concurrent processes into the same table,
on a 4 core machine).
If the (rather wide) sample output is mangled in your mail client, I've
attached it as a separate text file.

-------- Lock stats for PID 22403
     PID    Lock      ShAcq     ShWait     ShWaitT     ShHeldT
ExAcq     ExWait             ExWaitT                ExHeldT     Name
   22403       7          0          0        0.00        0.00
2500002     730338       24.02 ( 53.49 %)        7.25 ( 16.14 %)  WALInsert
   22403       8          0          0        0.00        0.00
19501         73        3.48 (  7.75 %)        0.40 (  0.88 %)  WALWrite

-------- Lock stats for PID 22404
     PID    Lock      ShAcq     ShWait     ShWaitT     ShHeldT
ExAcq     ExWait             ExWaitT                ExHeldT     Name
   22404       7          0          0        0.00        0.00
2500002     724683       23.34 ( 51.59 %)        8.24 ( 18.20 %)  WALInsert
   22404       8          0          0        0.00        0.00
19418         90        4.37 (  9.67 %)        0.44 (  0.97 %)  WALWrite

-------- Lock stats for PID 22402
     PID    Lock      ShAcq     ShWait     ShWaitT     ShHeldT
ExAcq     ExWait             ExWaitT                ExHeldT     Name
   22402       7          0          0        0.00        0.00
2500002     735958       24.06 ( 52.73 %)        8.05 ( 17.63 %)  WALInsert
   22402       8          0          0        0.00        0.00
19154         97        4.21 (  9.22 %)        0.39 (  0.85 %)  WALWrite

-------- Lock stats for PID 22400
     PID    Lock      ShAcq     ShWait     ShWaitT     ShHeldT
ExAcq     ExWait             ExWaitT                ExHeldT     Name
   22400       7          0          0        0.00        0.00
2500002     736265       25.50 ( 55.59 %)        6.74 ( 14.70 %)  WALInsert
   22400       8          0          0        0.00        0.00
19391         66        2.95 (  6.42 %)        0.39 (  0.85 %)  WALWrite


Here we see that PID 22400 spent :

25.50 s waiting to get exclusive on WALInsert
6.74  s while holding exclusive on WALInsert

The percentages represent the fraction of time relative to the backend
process' lifetime.
Here, I've exited the processes right after committing the transactions,
but if you use psql and want accurate %, you'll need to exit quickly after
the query to benchmark.

Here, for example, backends spend more than 50% of their time waiting on
WALInsert...

Attachment

Re: Patch LWlocks instrumentation

From

Jeff Janes

Date:

14 September 2009, 12:08:38

2009/9/14 Pierre Frédéric Caillaud <lists@peufeu.com>

A little bit of a reply to Jeff's email about WALInsertLock.

This patch instruments LWLocks, it is controlled with the following #define's in lwlock.c :

LWLOCK_STATS
LWLOCK_TIMING_STATS

It is an upgrade of current lwlocks stats.

Hi Pierre,

Have you looked at the total execution time with and without the LWLOCK_TIMING_STATS?

I've implemented something similar to this myself (only without attempting to make it portable and otherwise worthy of submitting as a general-interest patch), what I found is that attempting to time every "hold" time substantially increased the overall run time (which I would worry distorts the reported times, queue bad Heisenberg analogies). The problem is that gettimeofday is slow, and on some multi-processor systems it is a global point of serialization, making it even slower. I decided to time only the time spent waiting on a block, and not the time spent holding the lock. This way you only call gettimeofday twice if you actually need to block, and not at all if you immediately get the lock. This had a much smaller effect on runtime, and the info produced was sufficient for my purposes.

Not that this changes your conclusion. With or without that distortion I completely believe that WALInsertLock is the bottleneck of parallel bulk copy into unindexed tables. I just can't find anything else it is a primary bottleneck on. I think the only real solution for bulk copy is to call XLogInsert less often. For example, it could build blocks in local memory, then when done copy it into the shared buffers and then toss the entire block into WAL in one call. Easier said than implemented, of course.

Cheers,

Jeff

Re: Patch LWlocks instrumentation

From

Pierre Frédéric Caillaud

Date:

14 September 2009, 16:15:17

> Have you looked at the total execution time with and without the
> LWLOCK_TIMING_STATS?
It didn't show any significant overhead on the little COPY test I made.  
On selects, it probably does (just like EXPLAIN ANALYZE), but I didn't  
test.It is not meant to be always active, it's a #define, so I guess it would  
be OK though.
I'm going to modify it according to your suggestions and repost it (why  
didn't I do that first ?...)

> Not that this changes your conclusion.  With or without that distortion I
> completely believe that WALInsertLock is the bottleneck of parallel bulk
> copy into unindexed tables.  I just can't find anything else it is a  
> primary
> bottleneck on.  I think the only real solution for bulk copy is to call
> XLogInsert less often.  For example, it could build blocks in local  
> memory,
> then when done copy it into the shared buffers and then toss the entire
> block into WAL in one call.  Easier said than implemented, of course.
Actually,
http://archives.postgresql.org/pgsql-hackers/2009-09/msg00806.php

Re: XLogInsert

From

Jaime Casanova

Date:

09 December 2009, 03:06:24

On Sun, Sep 13, 2009 at 10:42 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>
> I've attached a patch which removes the iteration over the blocks to be
> backed-up from the critical section of XLogInsert.  Now those blocks are
> only looped over in one piece of code which both computes the CRC and builds
> the linked list, rather than having parallel loops.
>

ok, i started to review this one...

what the patch is doing seems very obvious and doesn't break the
original logic AFAIUI (i could be wrong, this seems like chinese to me
;)

i made some tests archiving wal and recovering... crashing the server
and restarting... every test was running the regression tests,
don't know what else to do, ideas?

i haven't made any performance tests but it should gain something :),
maybe someone can make those tests?

if not, i will mark the patch as ready for committer some time in the
next hours...

--
Atentamente,
Jaime Casanova
Soporte y capacitación de PostgreSQL
Asesoría y desarrollo de sistemas
Guayaquil - Ecuador
Cel. +59387171157

Re: XLogInsert

From

Tom Lane

Date:

09 December 2009, 10:39:59

Jaime Casanova <jcasanov@systemguards.com.ec> writes:
> i haven't made any performance tests but it should gain something :),
> maybe someone can make those tests?

The argument for changing this at all is only that it will improve
performance, so I'd like some independent confirmation that it does.
        regards, tom lane

Re: XLogInsert

From

Greg Smith

Date:

09 December 2009, 14:18:23

Tom Lane wrote: <blockquote cite="mid:1524.1260369585@sss.pgh.pa.us" type="cite"><pre wrap="">Jaime Casanova <a
class="moz-txt-link-rfc2396E"href="mailto:jcasanov@systemguards.com.ec"><jcasanov@systemguards.com.ec></a>
writes:</pre><blockquote type="cite"><pre wrap="">i haven't made any performance tests but it should gain something
:),
maybe someone can make those tests?   </pre></blockquote><pre wrap="">
The argument for changing this at all is only that it will improve
performance, so I'd like some independent confirmation that it does.</pre></blockquote><pre wrap="">I've done a little
reviewof this myself, and I'm not quite happy with how this patch was delivered to us.  The bar for committing
somethingthat touches the WAL is really high--it needs to be a unquestionable win to touch that code.  The
justificationof "the patch makes the overall code a bit cleaner" is a hard sell on something that's hard to debug
(triggeringbad WAL situations at will isn't easy) and critical to the system.  If there's a clear performance
improvement,that helps justify why it's worth working on.  Here's the original performance justification:
 

"Using the only XLogInsert-bound test case I know of, parallel COPY into a skinny, unindexed table, using 8 parallel
copieson a 4 x dual-core x86_64 and with fsync turned off (to approxiamately simulate SSD, which I do not have), I get
aspeed improvement of 2-4% with the patch over unpatched head."
 

That makes sense, and using this spec I could probably come up with the test program to reproduce this.  But I'm
gettingtired of doing that.  It's hard enough to reproduce performance changes when someone gives the exact
configurationand test program they used.  If we're working with a verbal spec for how to reproduce the issues, forget
aboutit--that's more than we can expect a reviewer to handle, and the odds of that whole thing ending well are low.
 

Jeff:  before we do anything else with your patch, I'd like to see a script of some sort that runs the test you
describeabove, everything changed in the postgresql.conf from the defaults, and the resulting raw numbers that come
fromit on your system that prove an improvement there--not just a summary of the change.  That's really mandatory for a
performancepatch.  If any reviewer who's interested can't just run something and get a report suggesting whether the
patchhelped or harmed results in five minutes, unless we really, really want your patch it's just going to stall at
thatpoint.  And unfortunately, in the case of something that touches the WAL path, we really don't want to change
anythingunless there's a quite good reason to do so.
 

I've also realized that "Patch LWlocks instrumentation" at <a class="moz-txt-link-freetext"
href="http://archives.postgresql.org/message-id/op.uz8sfkxycke6l8@soyouz">http://archives.postgresql.org/message-id/op.uz8sfkxycke6l8@soyouz</a>
shouldhave been evaluated as its own patch altogether.  I think that the test program you're suggesting also proves its
utilitythough, so for now I'll keep them roped together.
 

Sorry this ended up so late in this CommitFest, just a series of unexpected stuff rippled down to you.  On the bright
side,had you submitted this before the whole organized CF process started, you could have waited months longer to get
thesame feedback.
 

</pre> <pre class="moz-signature" cols="72">-- 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
<a class="moz-txt-link-abbreviated" href="mailto:greg@2ndQuadrant.com">greg@2ndQuadrant.com</a>  <a
class="moz-txt-link-abbreviated"href="http://www.2ndQuadrant.com">www.2ndQuadrant.com</a>
 
</pre>

Re: XLogInsert

From

Jaime Casanova

Date:

10 December 2009, 16:48:46

On Wed, Dec 9, 2009 at 9:39 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> so I'd like some independent confirmation that it does.
>

what kind of tests could show that? or simply running pgbench several
times for 15 minutes each run could show any benefit?

--
Atentamente,
Jaime Casanova
Soporte y capacitación de PostgreSQL
Asesoría y desarrollo de sistemas
Guayaquil - Ecuador
Cel. +59387171157

Re: XLogInsert

From

Alvaro Herrera

Date:

10 December 2009, 16:59:10

Jaime Casanova escribió:
> On Wed, Dec 9, 2009 at 9:39 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > so I'd like some independent confirmation that it does.
> >
> 
> what kind of tests could show that? or simply running pgbench several
> times for 15 minutes each run could show any benefit?

Isn't the supposed performance improvement in this patch linked strongly
to backup blocks?  If that's really the case, I think it would show more
prominently if you had very frequent checkpoints.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: XLogInsert

From

Jaime Casanova

Date:

13 December 2009, 03:18:34

On Thu, Dec 10, 2009 at 3:58 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Jaime Casanova escribió:
>> On Wed, Dec 9, 2009 at 9:39 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> > so I'd like some independent confirmation that it does.
>> >
>>
>> what kind of tests could show that? or simply running pgbench several
>> times for 15 minutes each run could show any benefit?
>
> Isn't the supposed performance improvement in this patch linked strongly
> to backup blocks?  If that's really the case, I think it would show more
> prominently if you had very frequent checkpoints.
>

Ah! Ok, i was only following the logic that it was eliminating the
need of executing a loop twice...
But you are right while the loop executes always it only do something
meaningful after a checkpoint and the for statement only make 3 loops
each, because XLR_MAX_BKP_BLOCKS is defined as 3 in
src/include/access/xlog.h

looked that way seems like the benefit could be only marginal

to prove that i compile with and without the patch, and change
checkpoint_segments = 1 and checkpoint_timeout = 1min to force
frequent checkpoints (actually they ocurred a few seconds apart)

initialize the pgbench database in each installation with:
pgbench -i -s 200 -F 90 test

and executed 6 times with:
pgbench -n -c 50 -j 5 -l -T 900 test

Results are:

Min (tps)
Unpatched - including connections establishing 133.046069 excluding it
133.085274
Patched     - including connections establishing 139.567284    excluding
it 139.591229

Max (tps)
Unpatched - including connections establishing 147.082181 excluding
it    147.108752
Patched    - including connections establishing 151.276416 excluding
it    151.311745

Avg (tps)
Unpatched - including connections establishing 140.750998 excluding
it    140.790336
Patched     - including connections establishing 146.383735 excluding
it 146.411039

So in this extreme case avg tps is just 6 transactions better

PS: i'm attaching the files i use for the tests

--
Atentamente,
Jaime Casanova
Soporte y capacitación de PostgreSQL
Asesoría y desarrollo de sistemas
Guayaquil - Ecuador
Cel. +59387171157

Attachment

Re: XLogInsert

From

Greg Smith

Date:

15 December 2009, 02:16:21

Jaime Casanova wrote:
> So in this extreme case avg tps is just 6 transactions better
>   
Great job trying to find the spot where the code worked better.  I'm not 
so sure I trust pgbench results where the TPS was so low though.  Which 
leads us right back to exactly how Jeff measured his original results.

As I said already, I think we need more insight into Jeff's performance 
report, a way to replicate that test, to look a bit at the latency as 
reported by the updated LWLock patch that Pierre submitted.  Tweaking 
your test to give more useful results is a nice second opinion on top of 
that.  But we're out of time for now, so this patch is getting returned 
with feedback.  I encourage Jeff to resubmit the same patch or a better 
one with a little more data on performance measurements to our final 8.5 
CommitFest in hopes we can confirm this an improvement worth committing.

-- 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com

Re: XLogInsert

From

Gurjeet Singh

Date:

16 December 2009, 15:07:35

2009/12/15 Greg Smith <greg@2ndquadrant.com>

Jaime Casanova wrote:
So in this extreme case avg tps is just 6 transactions better

Great job trying to find the spot where the code worked better. I'm not so sure I trust pgbench results where the TPS was so low though. Which leads us right back to exactly how Jeff measured his original results.

As I said already, I think we need more insight into Jeff's performance report, a way to replicate that test, to look a bit at the latency as reported by the updated LWLock patch that Pierre submitted. Tweaking your test to give more useful results is a nice second opinion on top of that. But we're out of time for now, so this patch is getting returned with feedback. I encourage Jeff to resubmit the same patch or a better one with a little more data on performance measurements to our final 8.5 CommitFest in hopes we can confirm this an improvement worth committing.

Last week I worked on a FUSE based filesystem, which I call BlackholeFS. Its similar to /dev/null, but for directories. Basically it simply returns success for all the writes, but doesn't do any writes on the files under it.

Would moving the pg_xlog/ (and possibly table data too) to such a filesystem exercise this patch better?

Best regards,

--
Lets call it Postgres

EnterpriseDB http://www.enterprisedb.com

gurjeet[.singh]@EnterpriseDB.com

singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com
Twitter: singh_gurjeet
Skype: singh_gurjeet

Mail sent from my BlackLaptop device

Re: XLogInsert

From

Andres Freund

Date:

16 December 2009, 15:22:55

On Wednesday 16 December 2009 20:07:07 Gurjeet Singh wrote:
> 2009/12/15 Greg Smith <greg@2ndquadrant.com>
> 
> > Jaime Casanova wrote:
> >> So in this extreme case avg tps is just 6 transactions better
> >
> > Great job trying to find the spot where the code worked better.  I'm not
> > so sure I trust pgbench results where the TPS was so low though.  Which
> > leads us right back to exactly how Jeff measured his original results.
> >
> > As I said already, I think we need more insight into Jeff's performance
> > report, a way to replicate that test, to look a bit at the latency as
> > reported by the updated LWLock patch that Pierre submitted.  Tweaking
> > your test to give more useful results is a nice second opinion on top of
> > that. But we're out of time for now, so this patch is getting returned
> > with feedback.  I encourage Jeff to resubmit the same patch or a better
> > one with a little more data on performance measurements to our final 8.5
> > CommitFest in hopes we can confirm this an improvement worth committing.
> 
> Last week I worked on a FUSE based filesystem, which I call BlackholeFS.
>  Its similar to /dev/null, but for directories. Basically it simply returns
>  success for all the writes, but doesn't do any writes on the files under
>  it.
I doubt that it will be faster than a tmpfs - the additional context switches 
et al probably will hurt already.
If you constrain the checkpoint_segments to something sensible it shouldnt use 
too much memory.


Andres