Thread: Could synchronous streaming replication really degrade the performance of the primary?

Hello,

I've heard from some people that synchronous streaming replication has
severe performance impact on the primary. They said that the transaction
throughput of TPC-C like benchmark (perhaps DBT-2) decreased by 50%. I'm
sorry I haven't asked them about their testing environment, because they
just gave me their experience. They think that this result is much worse
than some commercial database.

I'm surprised. I know that the amount of transaction logs of PostgreSQL is
larger than other databases because it it logs the entire row for each
update operation instead of just changed columns, and because of full page
writes. But I can't (and don't want to) believe that those have such big
negative impact.

Does anyone have any experience of benchmarking synchronous streaming
replication under TPC-C or similar write-heavy workload? Could anybody give
me any performance evaluation result if you don't mind?

Regards
MauMau



On Wed, May 9, 2012 at 8:06 AM, MauMau <maumau307@gmail.com> wrote:
> Hello,
>
> I've heard from some people that synchronous streaming replication has
> severe performance impact on the primary. They said that the transaction
> throughput of TPC-C like benchmark (perhaps DBT-2) decreased by 50%. I'm
> sorry I haven't asked them about their testing environment, because they
> just gave me their experience. They think that this result is much worse
> than some commercial database.

I can't speak for other databases, but it's only natural to assume
that tps must drop.  At minimum, you have to add the latency of
communication and remote sync operation to your transaction time.  For
very short transactions this adds up to a lot of extra work relative
to the transaction itself.

merlin

On Wed, May 9, 2012 at 3:58 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Wed, May 9, 2012 at 8:06 AM, MauMau <maumau307@gmail.com> wrote:
>> I've heard from some people that synchronous streaming replication has
>> severe performance impact on the primary. They said that the transaction
>> throughput of TPC-C like benchmark (perhaps DBT-2) decreased by 50%. I'm
>> sorry I haven't asked them about their testing environment, because they
>> just gave me their experience. They think that this result is much worse
>> than some commercial database.
>
> I can't speak for other databases, but it's only natural to assume
> that tps must drop.  At minimum, you have to add the latency of
> communication and remote sync operation to your transaction time.  For
> very short transactions this adds up to a lot of extra work relative
> to the transaction itself.

Actually I would expect 50% degradation if both databases run on
identical hardware: the second instance needs to do the same work
(i.e. write WAL AND ensure it reached the disk) before it can
acknowledge.

"When requesting synchronous replication, each commit of a write
transaction will wait until confirmation is received that the commit
has been written to the transaction log on disk of both the primary
and standby server."
http://www.postgresql.org/docs/9.1/static/warm-standby.html#SYNCHRONOUS-REPLICATION

I am not sure whether the replicant can be triggered to commit to disk
before the commit to disk on the master has succeeded; if that was the
case there would be true serialization => 50%.

This sounds like it could actually be the case (note the "after it commits"):
"When synchronous replication is requested the transaction will wait
after it commits until it receives confirmation that the transfer has
been successful."
http://wiki.postgresql.org/wiki/Synchronous_replication

Kind regards

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

On Wed, May 9, 2012 at 12:41 PM, Robert Klemme
<shortcutter@googlemail.com> wrote:
> I am not sure whether the replicant can be triggered to commit to disk
> before the commit to disk on the master has succeeded; if that was the
> case there would be true serialization => 50%.
>
> This sounds like it could actually be the case (note the "after it commits"):
> "When synchronous replication is requested the transaction will wait
> after it commits until it receives confirmation that the transfer has
> been successful."
> http://wiki.postgresql.org/wiki/Synchronous_replication

That should only happen for very short transactions.
IIRC, WAL records can be sent to the slaves before the transaction in
the master commits, so bigger transactions would see higher
parallelism.

On Wed, May 9, 2012 at 5:45 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Wed, May 9, 2012 at 12:41 PM, Robert Klemme
> <shortcutter@googlemail.com> wrote:
>> I am not sure whether the replicant can be triggered to commit to disk
>> before the commit to disk on the master has succeeded; if that was the
>> case there would be true serialization => 50%.
>>
>> This sounds like it could actually be the case (note the "after it commits"):
>> "When synchronous replication is requested the transaction will wait
>> after it commits until it receives confirmation that the transfer has
>> been successful."
>> http://wiki.postgresql.org/wiki/Synchronous_replication
>
> That should only happen for very short transactions.
> IIRC, WAL records can be sent to the slaves before the transaction in
> the master commits, so bigger transactions would see higher
> parallelism.

I considered that as well.  But the question is: when are they written
to disk in the slave?  If they are in buffer cache until data is
synched to disk then you only gain a bit of advantage by earlier
sending (i.e. network latency).  Assuming a high bandwidth and low
latency network (which you want to have in this case anyway) that gain
is probably not big compared to the time it takes to ensure WAL is
written to disk.  I do not know implementation details but *if* the
server triggers sync only after its own sync has succeeded *then* you
basically have serialization and you need to wait twice the time.

For small TX OTOH network overhead might relatively large compared to
WAL IO (for example with a battery backed cache in the controller)
that it shows.  Since we do not know the test cases which lead to the
50% statement we can probably only speculate.  Ultimately each
individual setup and workload has to be tested.

Kind regards

robert


--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

On Wed, May 9, 2012 at 12:03 PM, Robert Klemme
<shortcutter@googlemail.com> wrote:
> On Wed, May 9, 2012 at 5:45 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> On Wed, May 9, 2012 at 12:41 PM, Robert Klemme
>> <shortcutter@googlemail.com> wrote:
>>> I am not sure whether the replicant can be triggered to commit to disk
>>> before the commit to disk on the master has succeeded; if that was the
>>> case there would be true serialization => 50%.
>>>
>>> This sounds like it could actually be the case (note the "after it commits"):
>>> "When synchronous replication is requested the transaction will wait
>>> after it commits until it receives confirmation that the transfer has
>>> been successful."
>>> http://wiki.postgresql.org/wiki/Synchronous_replication
>>
>> That should only happen for very short transactions.
>> IIRC, WAL records can be sent to the slaves before the transaction in
>> the master commits, so bigger transactions would see higher
>> parallelism.
>
> I considered that as well.  But the question is: when are they written
> to disk in the slave?  If they are in buffer cache until data is
> synched to disk then you only gain a bit of advantage by earlier
> sending (i.e. network latency).  Assuming a high bandwidth and low
> latency network (which you want to have in this case anyway) that gain
> is probably not big compared to the time it takes to ensure WAL is
> written to disk.  I do not know implementation details but *if* the
> server triggers sync only after its own sync has succeeded *then* you
> basically have serialization and you need to wait twice the time.
>
> For small TX OTOH network overhead might relatively large compared to
> WAL IO (for example with a battery backed cache in the controller)
> that it shows.  Since we do not know the test cases which lead to the
> 50% statement we can probably only speculate.  Ultimately each
> individual setup and workload has to be tested.

yeah. note the upcoming 9.2 synchronous_commit=remote_write setting is
intended to improve this situation by letting the transaction go a bit
earlier -- the slave basically only has to acknowledge receipt of the
data.


merlin

From: "Merlin Moncure" <mmoncure@gmail.com>
> On Wed, May 9, 2012 at 8:06 AM, MauMau <maumau307@gmail.com> wrote:
>> Hello,
>>
>> I've heard from some people that synchronous streaming replication has
>> severe performance impact on the primary. They said that the transaction
>> throughput of TPC-C like benchmark (perhaps DBT-2) decreased by 50%. I'm
>> sorry I haven't asked them about their testing environment, because they
>> just gave me their experience. They think that this result is much worse
>> than some commercial database.
>
> I can't speak for other databases, but it's only natural to assume
> that tps must drop.  At minimum, you have to add the latency of
> communication and remote sync operation to your transaction time.  For
> very short transactions this adds up to a lot of extra work relative
> to the transaction itself.

Yes, I understand it is natural for the response time of each transaction to
double or more. But I think the throughput drop would be amortized among
multiple simultaneous transactions. So, 50% throughput decrease seems
unreasonable.

If this thinking is correct, and some could kindly share his/her past
performance evaluation results (ideally of DBT-2), I want to say to my
acquaintance "hey, community people experience better performance, so you
may need to review your configuration."

Regards
MauMau


On Wed, May 9, 2012 at 7:34 PM, MauMau <maumau307@gmail.com> wrote:
>> I can't speak for other databases, but it's only natural to assume
>> that tps must drop.  At minimum, you have to add the latency of
>> communication and remote sync operation to your transaction time.  For
>> very short transactions this adds up to a lot of extra work relative
>> to the transaction itself.
>
>
> Yes, I understand it is natural for the response time of each transaction to
> double or more. But I think the throughput drop would be amortized among
> multiple simultaneous transactions. So, 50% throughput decrease seems
> unreasonable.

I'm pretty sure it depends a lot on the workload. Knowing the
methodology used that arrived to those figures is critical. Was the
thoughput decrease measured against no replication, or asynchronous
replication? How many clients were used? What was the workload like?
Was it CPU bound? I/O bound? Read-mostly?

We have asynchronous replication in production and thoughput has not
changed relative to no replication. I cannot see how making it
synchronous would change thoughput, as it only induces waiting time on
the clients, but no extra work. I can only assume the test didn't use
enough clients to saturate the hardware under high-latency situations,
or clients were somehow experiencing application-specific contention.

I don't know the code, but knowing how synchronous replication works,
I would say any such drop under high concurrency would be a bug,
contention among waiting processes or something like that, that needs
to be fixed.

From: "Claudio Freire" <klaussfreire@gmail.com>
On Wed, May 9, 2012 at 7:34 PM, MauMau <maumau307@gmail.com> wrote:
>> Yes, I understand it is natural for the response time of each transaction
>> to
>> double or more. But I think the throughput drop would be amortized among
>> multiple simultaneous transactions. So, 50% throughput decrease seems
>> unreasonable.

> I'm pretty sure it depends a lot on the workload. Knowing the
> methodology used that arrived to those figures is critical. Was the
> thoughput decrease measured against no replication, or asynchronous
> replication? How many clients were used? What was the workload like?
> Was it CPU bound? I/O bound? Read-mostly?

> We have asynchronous replication in production and thoughput has not
> changed relative to no replication. I cannot see how making it
> synchronous would change thoughput, as it only induces waiting time on
> the clients, but no extra work. I can only assume the test didn't use
> enough clients to saturate the hardware under high-latency situations,
> or clients were somehow experiencing application-specific contention.

Thank you for your experience and opinion.

The workload is TPC-C-like write-heavy one; DBT-2. They compared the
throughput of synchronous replication case against that of no replication
case.

Today, they told me that they ran the test on two virtual machines on a
single physical machine. They also used pgpool-II in both cases. In
addition, they may have ran the applications and pgpool-II on the same
virtual machine as the database server.

It sounded to me that the resource is so scarce that concurrency was low, or
your assumption may be correct. I'll hear more about their environment from
them.

BTW it's pity that I cannot find any case study of performance of the
flagship feature of PostgreSQL 9.0/9.1, streaming replication...

Regards
MauMau


MauMau, 10.05.2012 13:34:
> Today, they told me that they ran the test on two virtual machines on
> a single physical machine.

Which means that both databases shared the same I/O system (harddisks).
Thererfor it's not really surprising that the overall performance goes down if you increase the I/O load.

A more realistic test (at least in my opinion) would have been to have two separate computers with two separate I/O
systems



On 10 Květen 2012, 13:34, MauMau wrote:
> The workload is TPC-C-like write-heavy one; DBT-2. They compared the
> throughput of synchronous replication case against that of no replication
> case.
>
> Today, they told me that they ran the test on two virtual machines on a
> single physical machine. They also used pgpool-II in both cases. In
> addition, they may have ran the applications and pgpool-II on the same
> virtual machine as the database server.

So they've run a test that is usually I/O bound on a single machine? If
they've used the same I/O devices, I'm surprised the degradation was just
50%. If you have a system that can handle X IOPS, and you run two
instances there, each will get ~X/2 IOPS. No magic can help here.

Even if they used separate I/O devices, there are probably many things
that are shared and can become a bottleneck in a virtualized environment.

The setup is definitely very suspicious.

> It sounded to me that the resource is so scarce that concurrency was low,
> or
> your assumption may be correct. I'll hear more about their environment
> from
> them.
>
> BTW it's pity that I cannot find any case study of performance of the
> flagship feature of PostgreSQL 9.0/9.1, streaming replication...

There were some nice talks about performance impact of sync rep, for
example this one:

  http://www.2ndquadrant.com/static/2quad/media/pdfs/talks/SyncRepDurability.pdf

There's also a video:

  http://www.youtube.com/watch?v=XL7j8hTd6R8

Tomas


On Wed, May 9, 2012 at 5:34 PM, MauMau <maumau307@gmail.com> wrote:
> Yes, I understand it is natural for the response time of each transaction to
> double or more. But I think the throughput drop would be amortized among
> multiple simultaneous transactions. So, 50% throughput decrease seems
> unreasonable.
>
> If this thinking is correct, and some could kindly share his/her past
> performance evaluation results (ideally of DBT-2), I want to say to my
> acquaintance "hey, community people experience better performance, so you
> may need to review your configuration."

It seems theoretically possible to interleave the processing on both
sides but 50% reduction in throughput for latency bound transactions
seems to be broadly advertised as what to reasonably expect for sync
rep with 9.1.

9.2 beta is arriving shortly and when it does I suggest experimenting
with the new remote_write feature of sync_rep over non-production
workloads.

merlin

From: "Tomas Vondra" <tv@fuzzy.cz>
> There were some nice talks about performance impact of sync rep, for
> example this one:
>
>
> http://www.2ndquadrant.com/static/2quad/media/pdfs/talks/SyncRepDurability.pdf
>
> There's also a video:
>
>  http://www.youtube.com/watch?v=XL7j8hTd6R8

Thanks. The video is especially interesting. I'll tell my aquaintance to
check it, too.

Regards
MauMau



On Thu, May 10, 2012 at 8:34 PM, MauMau <maumau307@gmail.com> wrote:
> Today, they told me that they ran the test on two virtual machines on a
> single physical machine. They also used pgpool-II in both cases. In
> addition, they may have ran the applications and pgpool-II on the same
> virtual machine as the database server.

So they compared the throughput of one server running on a single machine
(non replication case) with that of two servers (i.e., master and
standby) running
on the same single machine (sync rep case)? The amount of CPU/Mem/IO
resource available per server is not the same between those two cases. So
ISTM it's very unfair for sync rep case. In this situation, I'm not
surprised if I
see 50% performance degradation in sync rep case.

> It sounded to me that the resource is so scarce that concurrency was low, or
> your assumption may be correct. I'll hear more about their environment from
> them.
>
> BTW it's pity that I cannot find any case study of performance of the
> flagship feature of PostgreSQL 9.0/9.1, streaming replication...

Though I cannot show the detail for some reasons, as far as I measured
the performance overhead of sync rep by using pgbench, the overhead of
throughput was less than 10%. When measuring sync rep, I used two
set of physical machine and storage for the master and standby, and
used 1Gbps network between them.

Regards,

--
Fujii Masao

From: "Fujii Masao" <masao.fujii@gmail.com>
> Though I cannot show the detail for some reasons, as far as I measured
> the performance overhead of sync rep by using pgbench, the overhead of
> throughput was less than 10%. When measuring sync rep, I used two
> set of physical machine and storage for the master and standby, and
> used 1Gbps network between them.

Fujii-san, thanks a million. That's valuable information. The overhead less
than 10% under perhaps high concurrency and write heavy workload exceeds my
expectation. Great!

Though I couldn't contact the testers today, I'll tell this to them next
week.

Regards
MauMau