Thread: [PATCH] 2PC state files on shared memory

[PATCH] 2PC state files on shared memory

From

Michael Paquier

Date:

07 August 2009, 00:31:25

Hi all,

Based on an idea of Heikki Linnakangas, here is a patch in order to improve 2PC
by sending the state files of prepared transactions to shared memory instead of disk.
It is not possible to avoid the Xlog flush operation but reducing the amout of data sent to disk permits to accelerate 2PC process.

During a checkpoint, only the state files of prepared but not committed transactions are flushed to disk from shared memory.
The shared memory allocated for state files on shmem is made with an additionnal parameter called max_state_file_space in postgresql.conf.
Of course if there are too many transactions and not enough space on shared memory, state files are sent to disk originally.

By default, the space allocated is set at 0 as max_prepared_transaction is nul in 8.4.

For some other results, please reference to the wiki page I wrote about this 2PC improvement.
http://wiki.postgresql.org/wiki/2PC_improvement:_state_files_in_shared_memory
This page explains the simulation method for the patch analysis and gathers the main results.

Here are some of the performance results got by testing the code with a battery-backedup cache Disk Array with 8 disks in RAID0 configuration.
The four tables below depend on the scale factor at 1 or 100 of pgbench and if the results are normalized or not.
Normalized results have no unit but pure results are in TX/s.
Tests were made using transaction whose state file sizes are 600B and 712B via pgbench.
As it is possible to see, the patch permits to improve the transaction flow by up to 15-18%, what is not negligible.

1) Case scale factor 1, normalized results

State File Size (B)		600			712
Use of 2PC		State file on Shmem	State file on Disk	No 2PC	State file on Shmem	State file on Disk	No 2PC
Pgbench conf		State file on Shmem	State file on Disk	No 2PC	State file on Shmem	State file on Disk	No 2PC
Conn	Trans	Tps1-2	Tps2-2	Tps3-2	Tps1-2	Tps2-2	Tps3-2
2	10000	0.078663793	0	1	0.079653	0	1
5	10000	0.105263158	0	1	0.08438061	0	1
10	10000	0.096105528	0	1	0.07166124	0	1
25	10000	0.106321839	0	1	0.12846154	0	1
35	10000	0.138996139	0	1	0.12106136	0	1
50	10000	0.130278527	0	1	0.14072693	0	1
60	10000	0.133937563	0	1	0.1517094	0	1
70	10000	0.17218543	0	1	0.14913295	0	1
80	10000	0.1775	0	1	0.17786561	0	1
90	10000	0.179806362	0	1	0.15232722	0	1
100	10000	0.182242991	0	1	0.15264798	0	1

2) Case scale factor 1, pure TX/s results

State File Size (B)		600			712
Use of 2PC		State file on Shmem	State file on Disk	No 2PC	State file on Shmem	State file on Disk	No 2PC
Pgbench conf		State file on Shmem	State file on Disk	No 2PC	State file on Shmem	State file on Disk	No 2PC
Conn	Trans	Tps1-2	Tps2-2	Tps3-2	Tps1-2	Tps2-2	Tps3-2
2	10000	1163	1017	2873	1134	1033	2301
5	10000	1263	1077	2844	1213	1072	2743
10	10000	1265	1112	2704	1175	1065	2600
25	10000	1233	1085	2477	1205	1038	2338
35	10000	1220	1040	2335	1169	1023	2229
50	10000	1190	1045	2158	1143	992	2065
60	10000	1151	1018	2011	1111	969	1905
70	10000	1127	971	1877	1067	938	1803
80	10000	1091	949	1749	1021	886	1645
90	10000	1050	920	1643	939	831	1540
100	10000	1012	895	1537	889	791	1433

3) Case scale factor 100, normalized results

State File Size (B)		600			712
Use of 2PC		State file on Shmem	State file on Disk	No 2PC	State file on Shmem	State file on Disk	No 2PC
Pgbench conf		State file on Shmem	State file on Disk	No 2PC	State file on Shmem	State file on Disk	No 2PC
Conn	Trans	Tps1-2	Tps2-2	Tps3-2	Tps1-2	Tps2-2	Tps3-2
2	10000	0.031791908	0	1	0.00426621	0	1
5	10000	0.018481848	0	1	0.03858731	0	1
10	10000	0.049115914	0	1	0.07661017	0	1
25	10000	0.06954612	0	1	0.06117247	0	1
35	10000	0.077677841	0	1	0.05846422	0	1
50	10000	0.059885932	0	1	0.08961303	0	1
60	10000	0.071888412	0	1	0.06997743	0	1
70	10000	0.094007051	0	1	0.03571429	0	1
80	10000	0.078838174	0	1	0.05635838	0	1

4) Case scale factor 100, pure results

State File Size (B)		600			712
Use of 2PC		State file on Shmem	State file on Disk	No 2PC	State file on Shmem	State file on Disk	No 2PC
Pgbench conf		State file on Shmem	State file on Disk	No 2PC	State file on Shmem	State file on Disk	No 2PC
Conn	Trans	Tps1-2	Tps2-2	Tps3-2	Tps1-2	Tps2-2	Tps3-2
2	10000	1113	1058	2788	1147	1142	2314
5	10000	1240	1212	2727	1184	1125	2654
10	10000	1225	1150	2677	1203	1090	2565
25	10000	1218	1123	2489	1176	1104	2281
35	10000	1210	1115	2338	1151	1084	2230
50	10000	1153	1090	2142	1127	1039	2021
60	10000	1126	1059	1991	1083	1021	1907
70	10000	1087	1007	1858	1014	986	1770
80	10000	1046	989	1712	983	944	1636

Regards,

--
Michael Paquier

NTT OSSC

Attachment

postgresql-8.4.0-2PCshmem.patch

Re: [PATCH] 2PC state files on shared memory

From

Tom Lane

Date:

07 August 2009, 21:13:23

Michael Paquier <michael.paquier@gmail.com> writes:
> Based on an idea of Heikki Linnakangas, here is a patch in order to improve
> 2PC
> by sending the state files of prepared transactions to shared memory instead
> of disk.

I don't understand how this can possibly work.  The entire point of
2PC is that the state file is guaranteed to be on disk so it will
survive a crash.  What good is it if it's in shared memory?

Quite aside from that, the fixed size of shared memory makes this seem
pretty impractical.
        regards, tom lane

Re: [PATCH] 2PC state files on shared memory

From

Heikki Linnakangas

Date:

08 August 2009, 10:32:03

Tom Lane wrote:
> Michael Paquier <michael.paquier@gmail.com> writes:
>> Based on an idea of Heikki Linnakangas, here is a patch in order to improve
>> 2PC
>> by sending the state files of prepared transactions to shared memory instead
>> of disk.
> 
> I don't understand how this can possibly work.  The entire point of
> 2PC is that the state file is guaranteed to be on disk so it will
> survive a crash.  What good is it if it's in shared memory?

The state files are not fsync'd when they're written, but a copy is
written to WAL so that it can be replayed on crash. With this patch,
it's still written to WAL, but the write to a file on disk is skipped,
and it's stored in shared memory instead.

> Quite aside from that, the fixed size of shared memory makes this seem
> pretty impractical.

Most state files are small. If one doesn't fit in the area reserved for
this, it's written to disk as usual. It's just an optimization.

I'm a bit disappointed by the performance gains. I would've expected
more, given a decent battery-backed-up cache to buffer the WAL fsyncs.
But it looks like they're still causing the most overhead, even with a
battery-backed-up cache.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: [PATCH] 2PC state files on shared memory

From

Robert Haas

Date:

08 August 2009, 10:44:48

On Sat, Aug 8, 2009 at 9:31 AM, Heikki
Linnakangas<heikki.linnakangas@enterprisedb.com> wrote:
> Tom Lane wrote:
>> Michael Paquier <michael.paquier@gmail.com> writes:
>>> Based on an idea of Heikki Linnakangas, here is a patch in order to improve
>>> 2PC
>>> by sending the state files of prepared transactions to shared memory instead
>>> of disk.
>>
>> I don't understand how this can possibly work.  The entire point of
>> 2PC is that the state file is guaranteed to be on disk so it will
>> survive a crash.  What good is it if it's in shared memory?
>
> The state files are not fsync'd when they're written, but a copy is
> written to WAL so that it can be replayed on crash. With this patch,
> it's still written to WAL, but the write to a file on disk is skipped,
> and it's stored in shared memory instead.
>
>> Quite aside from that, the fixed size of shared memory makes this seem
>> pretty impractical.
>
> Most state files are small. If one doesn't fit in the area reserved for
> this, it's written to disk as usual. It's just an optimization.
>
> I'm a bit disappointed by the performance gains. I would've expected
> more, given a decent battery-backed-up cache to buffer the WAL fsyncs.
> But it looks like they're still causing the most overhead, even with a
> battery-backed-up cache.

It doesn't seem that surprising to me that a write to shared memory
and a write to an un-fsync'd file would be about the same speed.  The
file write will eventually generate some I/O when it goes to disk, but
at the time you make the system call it's basically just a memory
copy.

...Robert

Re: [PATCH] 2PC state files on shared memory

From

Tom Lane

Date:

08 August 2009, 12:29:21

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> Quite aside from that, the fixed size of shared memory makes this seem
>> pretty impractical.

> Most state files are small. If one doesn't fit in the area reserved for
> this, it's written to disk as usual. It's just an optimization.

What evidence do you have for that assumption?  And what's "small" anyway?
I think setting the size parameter for this would be a frightfully
difficult problem; the fact that average installations wouldn't use it
doesn't make that any better for those who would.  After our bad
experiences with fixed-size FSM, I'm pretty wary of introducing new
fixed-size structures that the user is expected to figure out how to
size.

> I'm a bit disappointed by the performance gains. I would've expected
> more, given a decent battery-backed-up cache to buffer the WAL fsyncs.
> But it looks like they're still causing the most overhead, even with a
> battery-backed-up cache.

If you can't demonstrate order-of-magnitude speedups, I think we
shouldn't touch this.
        regards, tom lane

Re: [PATCH] 2PC state files on shared memory

From

Tom Lane

Date:

08 August 2009, 12:43:22

Robert Haas <robertmhaas@gmail.com> writes:
> On Sat, Aug 8, 2009 at 9:31 AM, Heikki
> Linnakangas<heikki.linnakangas@enterprisedb.com> wrote:
>> I'm a bit disappointed by the performance gains. I would've expected
>> more, given a decent battery-backed-up cache to buffer the WAL fsyncs.

> It doesn't seem that surprising to me that a write to shared memory
> and a write to an un-fsync'd file would be about the same speed.

I just had a second thought about this.  The idea is to avoid writing
the separate 2PC state file until/unless it has to be checkpointed.
(And, per the comments for CheckPointTwoPhase, that is an uncommon
case --- especially now with our time-extended checkpoints.)

What if PREPARE simply didn't write the 2PC file at all, except into WAL?
Then, make CheckPointTwoPhase write the 2PC file for any still-live
GXACT, by means of reaching into the WAL and pulling the data out.
All it would need for that is the LSN of the WAL record, which I think
the GXACT has already.  (It might have the end location rather than
the start, but in any case we could store both.)  Similarly, COMMIT
PREPARED could be taught to pull the data from WAL instead of a 2PC
file, in the typical case where the file didn't exist yet.  I think
there might be some synchronization issues against checkpoints --- you
couldn't recycle WAL until you were sure there was no COMMIT PREPARED
pulling from it.  But it seems possibly workable, and there's no tuning
knob needed.
        regards, tom lane

Re: [PATCH] 2PC state files on shared memory

From

Heikki Linnakangas

Date:

08 August 2009, 16:03:05

Tom Lane wrote:
> What if PREPARE simply didn't write the 2PC file at all, except into WAL?
> Then, make CheckPointTwoPhase write the 2PC file for any still-live
> GXACT, by means of reaching into the WAL and pulling the data out.
> All it would need for that is the LSN of the WAL record, which I think
> the GXACT has already.  (It might have the end location rather than
> the start, but in any case we could store both.)  Similarly, COMMIT
> PREPARED could be taught to pull the data from WAL instead of a 2PC
> file, in the typical case where the file didn't exist yet.  I think
> there might be some synchronization issues against checkpoints --- you
> couldn't recycle WAL until you were sure there was no COMMIT PREPARED
> pulling from it.  But it seems possibly workable, and there's no tuning
> knob needed.

Interesting idea, might be worth performance testing. Peeking into the
WAL files during normal operation feels naughty, but it should work.
However, if the bottleneck is the WAL fsyncs, I doubt it's any faster
than Michael's current patch.

Actually, it would be interesting to performance test a stripped down
broken implementation that doesn't write the state files anywhere but
WAL, PREPARE releases all locks like regular COMMIT does, and COMMIT
PREPARED just writes the commit record and fsyncs. That would give an
upper bound on how much gain any of these patches can have. If that's
not much, we can throw in the towel.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: [PATCH] 2PC state files on shared memory

From

Tom Lane

Date:

08 August 2009, 16:54:56

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> What if PREPARE simply didn't write the 2PC file at all, except into WAL?

> Interesting idea, might be worth performance testing. Peeking into the
> WAL files during normal operation feels naughty, but it should work.
> However, if the bottleneck is the WAL fsyncs, I doubt it's any faster
> than Michael's current patch.

This isn't about faster, it's about not requiring users to estimate
a suitable size for a shared-memory arena.

> Actually, it would be interesting to performance test a stripped down
> broken implementation that doesn't write the state files anywhere but
> WAL, PREPARE releases all locks like regular COMMIT does, and COMMIT
> PREPARED just writes the commit record and fsyncs. That would give an
> upper bound on how much gain any of these patches can have. If that's
> not much, we can throw in the towel.

Good idea --- although I would think that the performance of 2PC would
be pretty context-dependent anyway.  What load would you test under?
        regards, tom lane

Re: [PATCH] 2PC state files on shared memory

From

Michael Paquier

Date:

09 August 2009, 23:37:00

After making a lot of tests, state file size is not more than 600B.

In some cases, it reached a maximum of size of 712B and I used such transactions in my tests.

> I think setting the size parameter for this would be a frightfully
> difficult problem; the fact that average installations wouldn't use it
> doesn't make that any better for those who would. After our bad
> experiences with fixed-size FSM, I'm pretty wary of introducing new
> fixed-size structures that the user is expected to figure out how to
> size.

The patch has been designed such as if a state file has a size higher than what has been decided by the user,

it will be written to disk instead of shared memory. So it will not represent a danger for teh stability of the system.

The case of too many prepared transactions is also covered thanks to max_prepared_transactions.

Regards,

--
Michael Paquier

NTT OSSC

Re: [PATCH] 2PC state files on shared memory

From

Tom Lane

Date:

10 August 2009, 03:45:32

Michael Paquier <michael.paquier@gmail.com> writes:
> After making a lot of tests, state file size is not more than 600B.
> In some cases, it reached a maximum of size of 712B and I used such
> transactions in my tests.

I can only say that that demonstrates you didn't test very many cases.
It is trivial to generate enormous state files --- try something with
a lot of subtransactions, for example, or a lot of files created or
deleted.  I remain of the opinion that asking users to estimate the
amount of shared memory needed for this patch will cripple its
usability.  We learned that lesson the hard way for FSM, I see no
reason we have to fail to learn from experience.
        regards, tom lane