Thread: Re: WIP patch for latestCompletedXid method of computing snapshot xmax

Re: WIP patch for latestCompletedXid method of computing snapshot xmax

From
Tom Lane
Date:
I wrote:
> This patch implements Florian's idea about how to manage snapshot xmax
> without the ugly and performance-losing tactic of taking XidGenLock and
> ProcArrayLock at the same time.  I had to do a couple of slightly klugy
> things to get bootstrap and prepared transactions to work, but on the
> whole it seems at least as clean as the code we have now.  Comments?

I spent a fair amount of time this afternoon trying to measure the
performance impact of this patch using pgbench, without a lot of success
--- as far as average transaction rates go, it's a wash compared to CVS
HEAD.  However, it eventually struck me to look at the distribution of
transaction times, using pgbench's -l logging output, and on that basis
it is clear that getting rid of the XidGenLock interaction has a good
deal of use in eliminating outlier times.  My desktop machine has a
single consumer-grade IDE drive, and even with fsync off and
synchronous_commit off, it can barely make 190 tps sustained pgbench
throughput --- it's just disk write bound all the time.  On a run with 8
clients, 10000 transactions per client, DB scale factor 25, I get this
distribution of transaction times from CVS HEAD:

postgres=# select usec/1000000, count(*) from plhead group by 1 order by 1;?column? | count 
----------+-------       0 | 79306       1 |   290       2 |   116       3 |    65       4 |    82       5 |    30
6 |    31       7 |    32      10 |     8      11 |     8      13 |    16      14 |     5      15 |     3      20 |
8
(14 rows)

and this from HEAD plus the patch:

postgres=# select usec/1000000, count(*) from plpatch group by 1 order by 1;?column? | count 
----------+-------       0 | 79305       1 |   325       2 |    85       3 |    49       4 |    68       5 |    50
6 |    45       7 |    35       8 |    14       9 |     8      10 |     6      11 |    10
 
(12 rows)

The worst-case transaction time has dropped by nearly a factor of 2:

postgres=# select * from plhead order by usec desc limit 20;client | trans |   usec   | f |   epoch    |  lsb   
--------+-------+----------+---+------------+--------     2 |  6379 | 20621910 | 0 | 1189280557 | 664207     6 |  5992
|20621175 | 0 | 1189280557 | 665970     7 |  5795 | 20621024 | 0 | 1189280557 | 666353     1 |  6327 | 20620833 | 0 |
1189280557| 663606     3 |  6463 | 20620277 | 0 | 1189280557 | 663895     4 |  6383 | 20620260 | 0 | 1189280557 |
664000    5 |  6209 | 20620077 | 0 | 1189280557 | 665060     0 |  6269 | 20619875 | 0 | 1189280557 | 664935     6 |
8182| 15191784 | 0 | 1189280655 |  87859     3 |  8810 | 15191637 | 0 | 1189280655 |  86802     2 |  8700 | 15185120 |
0| 1189280655 |  86742     5 |  8479 | 14078513 | 0 | 1189280653 | 978339     1 |  8618 | 14077106 | 0 | 1189280653 |
978216    7 |  7930 | 14076905 | 0 | 1189280653 | 978832     4 |  8704 | 14076429 | 0 | 1189280653 | 977877     0 |
8557| 14076249 | 0 | 1189280653 | 977477     0 |  6717 | 13932179 | 0 | 1189280576 |  65288     1 |  6775 | 13931973 |
0| 1189280576 |  65387     6 |  6410 | 13931493 | 0 | 1189280576 |  67192     7 |  6201 | 13931140 | 0 | 1189280576 |
69247
(20 rows)

postgres=# select * from plpatch order by usec desc limit 20;client | trans |   usec   | f |   epoch    |  lsb   
--------+-------+----------+---+------------+--------     6 |  6008 | 11833702 | 0 | 1189281093 | 646851     0 |  6140
|11833041 | 0 | 1189281093 | 645738     2 |  6289 | 11809343 | 0 | 1189281093 | 616734     4 |  6315 | 11808044 | 0 |
1189281093| 617505     3 |  6344 | 11807762 | 0 | 1189281093 | 616970     7 |  5802 | 11807641 | 0 | 1189281093 |
617932    5 |  6183 | 11806964 | 0 | 1189281093 | 618060     1 |  6163 | 11805494 | 0 | 1189281093 | 616679     7 |
8175| 11027973 | 0 | 1189281239 | 675499     2 |  8725 | 11019066 | 0 | 1189281239 | 674305     5 |  8828 | 10997331 |
0| 1189281239 | 674953     0 |  8541 | 10987629 | 0 | 1189281239 | 673773     1 |  8799 | 10713734 | 0 | 1189281239 |
673217    4 |  8897 | 10705975 | 0 | 1189281239 | 672743     6 |  8364 | 10702875 | 0 | 1189281239 | 677163     3 |
8814| 10701467 | 0 | 1189281239 | 674369     3 |  9223 |  9158554 | 0 | 1189281258 | 202934     5 |  9234 |  9143744 |
0| 1189281258 | 203073     7 |  8493 |  9099174 | 0 | 1189281258 | 204092     4 |  9306 |  9097074 | 0 | 1189281258 |
202402
(20 rows)


So on the strength of that, I'm going to go ahead and commit the patch,
but I'd be interested to see benchmarks from people with access to
better hardware.
        regards, tom lane


Re: WIP patch for latestCompletedXid method of computing snapshot xmax

From
Josh Berkus
Date:
Tom,

> So on the strength of that, I'm going to go ahead and commit the patch,
> but I'd be interested to see benchmarks from people with access to
> better hardware.

Is the latest version of the patch from -patches the one I should test?

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco


Re: WIP patch for latestCompletedXid method of computing snapshot xmax

From
Markus Schiltknecht
Date:
Hello Tom,

Tom Lane wrote:
> So on the strength of that, I'm going to go ahead and commit the patch,
> but I'd be interested to see benchmarks from people with access to
> better hardware.

I've just completed two dbt2 test runs on a mid-level system, with 4GB 
RAM and a 7 disk SATA RAID 1+0 w/ BBU. Once with code as of 2007/09/05 
18:00 (which is *before* the first lazy xid commit) and once with cvs 
HEAD (2007/09/08  +latestCompletedXid.patch.

Here are the results from the first test run (test run 33, without lazy 
xid):

> $ cat 33/driver/results.out 
>                          Response Time (s)
>  Transaction      %    Average :    90th %        Total        Rollbacks      %
> ------------  -----  ---------------------  -----------  ---------------  -----
>     Delivery   3.97      3.745 :     7.844        11844                0   0.00
>    New Order  45.35      3.844 :     7.692       135192             1352   1.01
> Order Status   3.95      2.728 :     6.371        11764                0   0.00
>      Payment  42.74      2.649 :     6.349       127415                0   0.00
>  Stock Level   4.00      2.172 :     5.634        11915                0   0.00
> ------------  -----  ---------------------  -----------  ---------------  -----
> 
> 1103.45 new-order transactions per minute (NOTPM)
> 120.1 minute duration
> 0 total unknown errors
> 1003 second(s) ramping up

And that's with HEAD +latestCompletedXid.patch (test run 34):

> $ cat 34/driver/results.out 
>                          Response Time (s)
>  Transaction      %    Average :    90th %        Total        Rollbacks      %
> ------------  -----  ---------------------  -----------  ---------------  -----
>     Delivery   3.96      3.843 :     8.223        11760                0   0.00
>    New Order  45.28      4.049 :     8.451       134398             1300   0.98
> Order Status   3.97      2.877 :     6.815        11777                0   0.00
>      Payment  42.80      2.745 :     6.718       127027                0   0.00
>  Stock Level   4.00      2.280 :     6.129        11859                0   0.00
> ------------  -----  ---------------------  -----------  ---------------  -----
> 
> 1097.71 new-order transactions per minute (NOTPM)
> 120.1 minute duration
> 0 total unknown errors
> 1003 second(s) ramping up

Both tests ran for two hours, had 100 warehouses and 50 connections. 
shared_buffers were set to 1024MB, effective_cachesize = 3800MB, all 
other settings were standard.

Regards

Markus


Re: WIP patch for latestCompletedXid method of computing snapshot xmax

From
Josh Berkus
Date:
Markus,

> I've just completed two dbt2 test runs on a mid-level system, with 4GB
> RAM and a 7 disk SATA RAID 1+0 w/ BBU. Once with code as of 2007/09/05
> 18:00 (which is *before* the first lazy xid commit) and once with cvs
> HEAD (2007/09/08  +latestCompletedXid.patch.

Hmmm.  Your results are withing the margin of error for DBT2, so they show no 
real difference.  What we need for this is a heavy-read workload, though; on 
a workload like DBT2 (which is mostly writes) I wouldn't expect lazy-XID to 
help much.

I'll see if I can find something appropriate.  Maybe Jan's TPC-W 
implementation?

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco


Re: WIP patch for latestCompletedXid method of computing snapshot xmax

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> Hmmm.  Your results are withing the margin of error for DBT2, so they
> show no real difference.  What we need for this is a heavy-read
> workload, though; on a workload like DBT2 (which is mostly writes) I
> wouldn't expect lazy-XID to help much.

Lazy-XID doesn't help on a write-mostly workload, but I would have
expected to see some benefit from the latestCompletedXid patch.

The rough tests I just finished suggest that the win to look for is
improvement of the very tail of the distribution --- well beyond the
90th percentile point which is the most we can see in Markus'
printout.  Can we get 95% and 99% response time percentiles?

For comparison, in my little test I got 0.828546 vs 0.873957 as the
99% percentile pgbench transaction times.
        regards, tom lane


Re: WIP patch for latestCompletedXid method of computing snapshot xmax

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> Is the latest version of the patch from -patches the one I should test?

Yeah, the posted patch is the same as what I committed.
        regards, tom lane


Re: WIP patch for latestCompletedXid method of computing snapshot xmax

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> My desktop machine has a single consumer-grade IDE drive, and even with
> fsync off and synchronous_commit off, it can barely make 190 tps sustained
> pgbench throughput --- it's just disk write bound all the time. On a run
> with 8 clients, 10000 transactions per client, DB scale factor 25, I get
> this distribution of transaction times from CVS HEAD:

Wouldn't you expect to see more of an effect on cpu-bound environments? Is
there any i/o going on while these locks are being held? I suppose there's a
WAL commit record that has to be synced?

Have you tried with a smaller scale factor?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: WIP patch for latestCompletedXid method of computing snapshot xmax

From
Gregory Stark
Date:
"Josh Berkus" <josh@agliodbs.com> writes:

> Hmmm.  Your results are withing the margin of error for DBT2, so they show no 
> real difference.  What we need for this is a heavy-read workload, though; on 
> a workload like DBT2 (which is mostly writes) I wouldn't expect lazy-XID to 
> help much.

The TPM is definitely within the margin of error but the 90th percentile
response times seem worrisome. But they're already over 5s which are TPC-C
failures. From what we've seen when you push the scale factor too high until
those numbers are even close to 5s you get very unstable results. 

The published benchmarks seem to keep the max around 5s which puts the 90th
percentile around half that. That seems overly conservative but I wonder what
reason they have for doing it that way.

> I'll see if I can find something appropriate.  Maybe Jan's TPC-W 
> implementation?

We could just rejigger the percentages on the transactions. I think Stock
Level and Order Status are entirely read-only but I would have to check. Stock
Level is a very intensive query though so I wouldn't suggest raising the
percentage on that.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com