Thread: XLogInsert
In XLogInsert (src/backend/access/transam/xlog.c), the part that adds back-up blocks into the rdata chain is described: /* * Make additional rdata chain entries for the backup blocks, so that we * don't need to special-casethem in the write loop. Note that we have * now irrevocably changed the input rdata chain. If I read the code correctly, the only thing that is irrevocable is that it writes into rdt->next, and if it saved an old copy of rdt first, then it could revoke the changes just by doing rdt_old->next=NULL. If that were done, then I think this code could be moved out of the section holding the WALInsertLock. It could probably be folded into the part of the code that computes the CRC. I don't think this wold be a big performance win, as that part of the code is pretty fast, but every bit helps in a highly contended lock, and I think the code would be simpler as well. Do you think it would be worth me giving this a try? Cheers, Jeff
Jeff Janes <jeff.janes@gmail.com> writes: > If I read the code correctly, the only thing that is irrevocable is > that it writes into > rdt->next, and if it saved an old copy of rdt first, then it could > revoke the changes just > by doing rdt_old->next=NULL. If that were done, then I think this > code could be > moved out of the section holding the WALInsertLock. Hmm, I recall that the changes are ... or were ... more complex. The tricky case I think is where we have to go back and redo the block-backup decisions after discovering that the checkpoint REDO pointer has just moved. If you can get the work out of the WALInsertLock section for just a few more instructions, it would definitely be worth doing. regards, tom lane
On Wed, Aug 19, 2009 at 9:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeff Janes <jeff.janes@gmail.com> writes:Hmm, I recall that the changes are ... or were ... more complex.
> If I read the code correctly, the only thing that is irrevocable is
> that it writes into
> rdt->next, and if it saved an old copy of rdt first, then it could
> revoke the changes just
> by doing rdt_old->next=NULL. If that were done, then I think this
> code could be
> moved out of the section holding the WALInsertLock.
The tricky case I think is where we have to go back and redo the
block-backup decisions after discovering that the checkpoint REDO
pointer has just moved.
If you can get the work out of the WALInsertLock section for just a
few more instructions, it would definitely be worth doing.
I've attached a patch which removes the iteration over the blocks to be backed-up from the critical section of XLogInsert. Now those blocks are only looped over in one piece of code which both computes the CRC and builds the linked list, rather than having parallel loops.
I've used an elog statement (not shown in patch) to demonstrate that the "goto begin;" after detecting REDO race actually does get executed under a standard workload, (pgbench -c10). Two to 4 out of 10 the backends execute that code path for each checkpoint on my single CPU machine. By doing a kill -9 on a process, to simulate a crash, during the period after the goto begin is execercised but before the precipitating heckpoint completes, I can force it to use the written WAL records in recovery. The database automatically recovers and the results are self-consistent.
I cannot imagine any other races, rare events, or action at a distance that could come into play with this code change, so I cannot think of anything else to test at the moment.
I could not detect a speed difference with pgbench, but as I cannot get pgbench to be XLogInsert bound, that is not surprising. Using the only XLogInsert-bound test case I know of, parallel COPY into a skinny, unindexed table, using 8 parallel copies on a 4 x dual-core x86_64 and with fsync turned off (to approxiamately simulate SSD, which I do not have), I get a speed improvement of 2-4% with the patch over unpatched head. Maybe with more CPUs the benefit would be greater.
That small improvement is probably not very attractive, however I think the patch makes the overall code a bit cleaner, so it may be warranted on that ground. Indeed, my motivation for working on this is that I kept beating my head against the complexity of the old code, and thought that simplifying it would make future work easier.
Cheers,
Jeff
Attachment
A little bit of a reply to Jeff's email about WALInsertLock. This patch instruments LWLocks, it is controlled with the following #define's in lwlock.c : LWLOCK_STATS LWLOCK_TIMING_STATS It is an upgrade of current lwlocks stats. When active, at backend exit, it will display stats as shown below (here, we have a parallel COPY with 4 concurrent processes into the same table, on a 4 core machine). If the (rather wide) sample output is mangled in your mail client, I've attached it as a separate text file. -------- Lock stats for PID 22403 PID Lock ShAcq ShWait ShWaitT ShHeldT ExAcq ExWait ExWaitT ExHeldT Name 22403 7 0 0 0.00 0.00 2500002 730338 24.02 ( 53.49 %) 7.25 ( 16.14 %) WALInsert 22403 8 0 0 0.00 0.00 19501 73 3.48 ( 7.75 %) 0.40 ( 0.88 %) WALWrite -------- Lock stats for PID 22404 PID Lock ShAcq ShWait ShWaitT ShHeldT ExAcq ExWait ExWaitT ExHeldT Name 22404 7 0 0 0.00 0.00 2500002 724683 23.34 ( 51.59 %) 8.24 ( 18.20 %) WALInsert 22404 8 0 0 0.00 0.00 19418 90 4.37 ( 9.67 %) 0.44 ( 0.97 %) WALWrite -------- Lock stats for PID 22402 PID Lock ShAcq ShWait ShWaitT ShHeldT ExAcq ExWait ExWaitT ExHeldT Name 22402 7 0 0 0.00 0.00 2500002 735958 24.06 ( 52.73 %) 8.05 ( 17.63 %) WALInsert 22402 8 0 0 0.00 0.00 19154 97 4.21 ( 9.22 %) 0.39 ( 0.85 %) WALWrite -------- Lock stats for PID 22400 PID Lock ShAcq ShWait ShWaitT ShHeldT ExAcq ExWait ExWaitT ExHeldT Name 22400 7 0 0 0.00 0.00 2500002 736265 25.50 ( 55.59 %) 6.74 ( 14.70 %) WALInsert 22400 8 0 0 0.00 0.00 19391 66 2.95 ( 6.42 %) 0.39 ( 0.85 %) WALWrite Here we see that PID 22400 spent : 25.50 s waiting to get exclusive on WALInsert 6.74 s while holding exclusive on WALInsert The percentages represent the fraction of time relative to the backend process' lifetime. Here, I've exited the processes right after committing the transactions, but if you use psql and want accurate %, you'll need to exit quickly after the query to benchmark. Here, for example, backends spend more than 50% of their time waiting on WALInsert...
Attachment
2009/9/14 Pierre Frédéric Caillaud <lists@peufeu.com>
Hi Pierre,
Have you looked at the total execution time with and without the LWLOCK_TIMING_STATS?
I've implemented something similar to this myself (only without attempting to make it portable and otherwise worthy of submitting as a general-interest patch), what I found is that attempting to time every "hold" time substantially increased the overall run time (which I would worry distorts the reported times, queue bad Heisenberg analogies). The problem is that gettimeofday is slow, and on some multi-processor systems it is a global point of serialization, making it even slower. I decided to time only the time spent waiting on a block, and not the time spent holding the lock. This way you only call gettimeofday twice if you actually need to block, and not at all if you immediately get the lock. This had a much smaller effect on runtime, and the info produced was sufficient for my purposes.
Not that this changes your conclusion. With or without that distortion I completely believe that WALInsertLock is the bottleneck of parallel bulk copy into unindexed tables. I just can't find anything else it is a primary bottleneck on. I think the only real solution for bulk copy is to call XLogInsert less often. For example, it could build blocks in local memory, then when done copy it into the shared buffers and then toss the entire block into WAL in one call. Easier said than implemented, of course.
Cheers,
Jeff
A little bit of a reply to Jeff's email about WALInsertLock.
This patch instruments LWLocks, it is controlled with the following #define's in lwlock.c :
LWLOCK_STATS
LWLOCK_TIMING_STATS
It is an upgrade of current lwlocks stats.
Hi Pierre,
Have you looked at the total execution time with and without the LWLOCK_TIMING_STATS?
I've implemented something similar to this myself (only without attempting to make it portable and otherwise worthy of submitting as a general-interest patch), what I found is that attempting to time every "hold" time substantially increased the overall run time (which I would worry distorts the reported times, queue bad Heisenberg analogies). The problem is that gettimeofday is slow, and on some multi-processor systems it is a global point of serialization, making it even slower. I decided to time only the time spent waiting on a block, and not the time spent holding the lock. This way you only call gettimeofday twice if you actually need to block, and not at all if you immediately get the lock. This had a much smaller effect on runtime, and the info produced was sufficient for my purposes.
Not that this changes your conclusion. With or without that distortion I completely believe that WALInsertLock is the bottleneck of parallel bulk copy into unindexed tables. I just can't find anything else it is a primary bottleneck on. I think the only real solution for bulk copy is to call XLogInsert less often. For example, it could build blocks in local memory, then when done copy it into the shared buffers and then toss the entire block into WAL in one call. Easier said than implemented, of course.
Cheers,
Jeff
> Have you looked at the total execution time with and without the > LWLOCK_TIMING_STATS? It didn't show any significant overhead on the little COPY test I made. On selects, it probably does (just like EXPLAIN ANALYZE), but I didn't test.It is not meant to be always active, it's a #define, so I guess it would be OK though. I'm going to modify it according to your suggestions and repost it (why didn't I do that first ?...) > Not that this changes your conclusion. With or without that distortion I > completely believe that WALInsertLock is the bottleneck of parallel bulk > copy into unindexed tables. I just can't find anything else it is a > primary > bottleneck on. I think the only real solution for bulk copy is to call > XLogInsert less often. For example, it could build blocks in local > memory, > then when done copy it into the shared buffers and then toss the entire > block into WAL in one call. Easier said than implemented, of course. Actually, http://archives.postgresql.org/pgsql-hackers/2009-09/msg00806.php
On Sun, Sep 13, 2009 at 10:42 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > > I've attached a patch which removes the iteration over the blocks to be > backed-up from the critical section of XLogInsert. Now those blocks are > only looped over in one piece of code which both computes the CRC and builds > the linked list, rather than having parallel loops. > ok, i started to review this one... what the patch is doing seems very obvious and doesn't break the original logic AFAIUI (i could be wrong, this seems like chinese to me ;) i made some tests archiving wal and recovering... crashing the server and restarting... every test was running the regression tests, don't know what else to do, ideas? i haven't made any performance tests but it should gain something :), maybe someone can make those tests? if not, i will mark the patch as ready for committer some time in the next hours... -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Asesoría y desarrollo de sistemas Guayaquil - Ecuador Cel. +59387171157
Jaime Casanova <jcasanov@systemguards.com.ec> writes: > i haven't made any performance tests but it should gain something :), > maybe someone can make those tests? The argument for changing this at all is only that it will improve performance, so I'd like some independent confirmation that it does. regards, tom lane
Tom Lane wrote: <blockquote cite="mid:1524.1260369585@sss.pgh.pa.us" type="cite"><pre wrap="">Jaime Casanova <a class="moz-txt-link-rfc2396E"href="mailto:jcasanov@systemguards.com.ec"><jcasanov@systemguards.com.ec></a> writes:</pre><blockquote type="cite"><pre wrap="">i haven't made any performance tests but it should gain something :), maybe someone can make those tests? </pre></blockquote><pre wrap=""> The argument for changing this at all is only that it will improve performance, so I'd like some independent confirmation that it does.</pre></blockquote><pre wrap="">I've done a little reviewof this myself, and I'm not quite happy with how this patch was delivered to us. The bar for committing somethingthat touches the WAL is really high--it needs to be a unquestionable win to touch that code. The justificationof "the patch makes the overall code a bit cleaner" is a hard sell on something that's hard to debug (triggeringbad WAL situations at will isn't easy) and critical to the system. If there's a clear performance improvement,that helps justify why it's worth working on. Here's the original performance justification: "Using the only XLogInsert-bound test case I know of, parallel COPY into a skinny, unindexed table, using 8 parallel copieson a 4 x dual-core x86_64 and with fsync turned off (to approxiamately simulate SSD, which I do not have), I get aspeed improvement of 2-4% with the patch over unpatched head." That makes sense, and using this spec I could probably come up with the test program to reproduce this. But I'm gettingtired of doing that. It's hard enough to reproduce performance changes when someone gives the exact configurationand test program they used. If we're working with a verbal spec for how to reproduce the issues, forget aboutit--that's more than we can expect a reviewer to handle, and the odds of that whole thing ending well are low. Jeff: before we do anything else with your patch, I'd like to see a script of some sort that runs the test you describeabove, everything changed in the postgresql.conf from the defaults, and the resulting raw numbers that come fromit on your system that prove an improvement there--not just a summary of the change. That's really mandatory for a performancepatch. If any reviewer who's interested can't just run something and get a report suggesting whether the patchhelped or harmed results in five minutes, unless we really, really want your patch it's just going to stall at thatpoint. And unfortunately, in the case of something that touches the WAL path, we really don't want to change anythingunless there's a quite good reason to do so. I've also realized that "Patch LWlocks instrumentation" at <a class="moz-txt-link-freetext" href="http://archives.postgresql.org/message-id/op.uz8sfkxycke6l8@soyouz">http://archives.postgresql.org/message-id/op.uz8sfkxycke6l8@soyouz</a> shouldhave been evaluated as its own patch altogether. I think that the test program you're suggesting also proves its utilitythough, so for now I'll keep them roped together. Sorry this ended up so late in this CommitFest, just a series of unexpected stuff rippled down to you. On the bright side,had you submitted this before the whole organized CF process started, you could have waited months longer to get thesame feedback. </pre> <pre class="moz-signature" cols="72">-- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support <a class="moz-txt-link-abbreviated" href="mailto:greg@2ndQuadrant.com">greg@2ndQuadrant.com</a> <a class="moz-txt-link-abbreviated"href="http://www.2ndQuadrant.com">www.2ndQuadrant.com</a> </pre>
On Wed, Dec 9, 2009 at 9:39 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > so I'd like some independent confirmation that it does. > what kind of tests could show that? or simply running pgbench several times for 15 minutes each run could show any benefit? -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Asesoría y desarrollo de sistemas Guayaquil - Ecuador Cel. +59387171157
Jaime Casanova escribió: > On Wed, Dec 9, 2009 at 9:39 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > so I'd like some independent confirmation that it does. > > > > what kind of tests could show that? or simply running pgbench several > times for 15 minutes each run could show any benefit? Isn't the supposed performance improvement in this patch linked strongly to backup blocks? If that's really the case, I think it would show more prominently if you had very frequent checkpoints. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Thu, Dec 10, 2009 at 3:58 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Jaime Casanova escribió: >> On Wed, Dec 9, 2009 at 9:39 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> > so I'd like some independent confirmation that it does. >> > >> >> what kind of tests could show that? or simply running pgbench several >> times for 15 minutes each run could show any benefit? > > Isn't the supposed performance improvement in this patch linked strongly > to backup blocks? If that's really the case, I think it would show more > prominently if you had very frequent checkpoints. > Ah! Ok, i was only following the logic that it was eliminating the need of executing a loop twice... But you are right while the loop executes always it only do something meaningful after a checkpoint and the for statement only make 3 loops each, because XLR_MAX_BKP_BLOCKS is defined as 3 in src/include/access/xlog.h looked that way seems like the benefit could be only marginal to prove that i compile with and without the patch, and change checkpoint_segments = 1 and checkpoint_timeout = 1min to force frequent checkpoints (actually they ocurred a few seconds apart) initialize the pgbench database in each installation with: pgbench -i -s 200 -F 90 test and executed 6 times with: pgbench -n -c 50 -j 5 -l -T 900 test Results are: Min (tps) Unpatched - including connections establishing 133.046069 excluding it 133.085274 Patched - including connections establishing 139.567284 excluding it 139.591229 Max (tps) Unpatched - including connections establishing 147.082181 excluding it 147.108752 Patched - including connections establishing 151.276416 excluding it 151.311745 Avg (tps) Unpatched - including connections establishing 140.750998 excluding it 140.790336 Patched - including connections establishing 146.383735 excluding it 146.411039 So in this extreme case avg tps is just 6 transactions better PS: i'm attaching the files i use for the tests -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Asesoría y desarrollo de sistemas Guayaquil - Ecuador Cel. +59387171157
Attachment
Jaime Casanova wrote: > So in this extreme case avg tps is just 6 transactions better > Great job trying to find the spot where the code worked better. I'm not so sure I trust pgbench results where the TPS was so low though. Which leads us right back to exactly how Jeff measured his original results. As I said already, I think we need more insight into Jeff's performance report, a way to replicate that test, to look a bit at the latency as reported by the updated LWLock patch that Pierre submitted. Tweaking your test to give more useful results is a nice second opinion on top of that. But we're out of time for now, so this patch is getting returned with feedback. I encourage Jeff to resubmit the same patch or a better one with a little more data on performance measurements to our final 8.5 CommitFest in hopes we can confirm this an improvement worth committing. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
2009/12/15 Greg Smith <greg@2ndquadrant.com>
Last week I worked on a FUSE based filesystem, which I call BlackholeFS. Its similar to /dev/null, but for directories. Basically it simply returns success for all the writes, but doesn't do any writes on the files under it.
Would moving the pg_xlog/ (and possibly table data too) to such a filesystem exercise this patch better?
Best regards,
-- Jaime Casanova wrote:Great job trying to find the spot where the code worked better. I'm not so sure I trust pgbench results where the TPS was so low though. Which leads us right back to exactly how Jeff measured his original results.So in this extreme case avg tps is just 6 transactions better
As I said already, I think we need more insight into Jeff's performance report, a way to replicate that test, to look a bit at the latency as reported by the updated LWLock patch that Pierre submitted. Tweaking your test to give more useful results is a nice second opinion on top of that. But we're out of time for now, so this patch is getting returned with feedback. I encourage Jeff to resubmit the same patch or a better one with a little more data on performance measurements to our final 8.5 CommitFest in hopes we can confirm this an improvement worth committing.
Last week I worked on a FUSE based filesystem, which I call BlackholeFS. Its similar to /dev/null, but for directories. Basically it simply returns success for all the writes, but doesn't do any writes on the files under it.
Would moving the pg_xlog/ (and possibly table data too) to such a filesystem exercise this patch better?
Best regards,
Lets call it Postgres
EnterpriseDB http://www.enterprisedb.com
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com
Twitter: singh_gurjeet
Skype: singh_gurjeet
Mail sent from my BlackLaptop device
On Wednesday 16 December 2009 20:07:07 Gurjeet Singh wrote: > 2009/12/15 Greg Smith <greg@2ndquadrant.com> > > > Jaime Casanova wrote: > >> So in this extreme case avg tps is just 6 transactions better > > > > Great job trying to find the spot where the code worked better. I'm not > > so sure I trust pgbench results where the TPS was so low though. Which > > leads us right back to exactly how Jeff measured his original results. > > > > As I said already, I think we need more insight into Jeff's performance > > report, a way to replicate that test, to look a bit at the latency as > > reported by the updated LWLock patch that Pierre submitted. Tweaking > > your test to give more useful results is a nice second opinion on top of > > that. But we're out of time for now, so this patch is getting returned > > with feedback. I encourage Jeff to resubmit the same patch or a better > > one with a little more data on performance measurements to our final 8.5 > > CommitFest in hopes we can confirm this an improvement worth committing. > > Last week I worked on a FUSE based filesystem, which I call BlackholeFS. > Its similar to /dev/null, but for directories. Basically it simply returns > success for all the writes, but doesn't do any writes on the files under > it. I doubt that it will be faster than a tmpfs - the additional context switches et al probably will hurt already. If you constrain the checkpoint_segments to something sensible it shouldnt use too much memory. Andres