Thread: Sync Rep: First Thoughts on Code

Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

01 December 2008, 07:41:02

Breaking down of patch into sections works very well for review. Should
allow us to get different reviewers on different parts of the code -
review wranglers please take note: Dave, Josh.

Can you confirm that all the docs on the Wiki page are up to date? There
are a few minor discrepancies that make me think it isn't.

Examples: "For example, to make a single multi-statement transaction
replication asynchronously when the default is the opposite, issue SET
LOCAL synchronous_commit TO OFF within the transaction."
Do we mean synchronous_replication in this sentence? I think you've
copied the text and not changed all of the necessary parts - please
re-read the whole section (probably the whole Wiki, actually).

"wal_writer_delay" - do we mean wal_sender_delay? Is there some ability
to measure the amount of data to be sent and avoid the delay altogether,
when the server is sufficiently busy? 

The reaction to replication_timeout may need to be configurable. I might
not want to keep on processing if the information didn't reach the
standby. I would prefer in many cases that the transactions that were
waiting for walsender would abort, but the walsender kept processing.
How can we restart the walsender if it shuts down? Do we want a maximum
wait for a transaction and a maximum wait for the server? Do we report
stats on how long the replication has been taking? If the average rep
time is close to rep timeout then we will be fragile, so we need some
way to notice this and produce warnings. Or at least provide info to an
external monitoring system.

How do we specify the user we use to connect to primary?

Definitely need more explanatory comments/README-style docs.

For example, 03_libpq seems simple and self-contained. I'm not sure why
we have a state called PGASYNC_REPLICATION; I was hoping that would be
dynamic, but I'm not sure where to look for that. It would be useful to
have a very long comment within the code to explain how the replication
messages work, and note on each function who the intended client and
server is.

02_pqcomm: What does HAVE_POLL mean? Do we need to worry about periodic
renegotiation of keys in be-secure.c? Not sure I understand why so many
new functions in there.

04_recovery_conf is a change I agree with, though I think it may not
work with EXEC_BACKEND for Windows.

05... I need dome commentary to explain this better.

06 and 07 are large and will take substantial review time. So we must
get the overall architecture done first and then check the code that
implements that.

08 - I think I get this, but some docs will help to confirm.

09 pg_standby changes: so more changes are coming there? OK. Can we
refer to those two options as failover and switchover? There's no need
to change definitions that many Postgres people already use. This change
can be done without making any change to server behaviour, so this
change can have benefit to 8.2 and 8,3 people also.

01_signal_handling: I've looked at the LWlock acquires and releases in
the patch and am fairly happy, except for the ProcArrayLock acquire
during this sub-patch. Do we really need to do things this way? Is the
actual state important? Could we just do this with a counter which
cycles? So callers increment counter atomically and the reader just
polls to see if anybody has incremented? Or could we protect that part
of the proc with a different lock? Touching ProcArrayLock is bad news.

Anyway, feeling very positive about this. Hope we can get this reviewed
and committed in next 3-4 weeks.

I have many clues as to how to structure my own work also. Thanks.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

02 December 2008, 08:37:20

Hi, Simon.

Thanks for taking many hours to review the code!!

On Mon, Dec 1, 2008 at 8:42 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Can you confirm that all the docs on the Wiki page are up to date? There
> are a few minor discrepancies that make me think it isn't.

Documentation is ongoing. Sorry for my slow progress.

BTW, I'm going to add and change the sgml files listed on wiki.
http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#Documentation_Plan

>
> Examples: "For example, to make a single multi-statement transaction
> replication asynchronously when the default is the opposite, issue SET
> LOCAL synchronous_commit TO OFF within the transaction."
> Do we mean synchronous_replication in this sentence? I think you've
> copied the text and not changed all of the necessary parts - please
> re-read the whole section (probably the whole Wiki, actually).

Oops! It's just typo. Sorry for the confusion.
I will revise this section.

>
> "wal_writer_delay" - do we mean wal_sender_delay?

Yes. I will fix it.

> Is there some ability
> to measure the amount of data to be sent and avoid the delay altogether,
> when the server is sufficiently busy?

Why is the former ability required?

The latter is possible, I think. We can guarantee that the WAL is sent (in
more detail, called send(2)) once at least per wal_sender_delay. Of course,
it's dependent on the scheduler of a kernel.

>
> The reaction to replication_timeout may need to be configurable. I might
> not want to keep on processing if the information didn't reach the
> standby.

OK. I will add new GUC variable (PGC_SIGHUP) to specify the reaction for
the timeout.

> I would prefer in many cases that the transactions that were
> waiting for walsender would abort, but the walsender kept processing.

Is it dangerous to abort the transaction with replication continued when
the timeout occurs? I think that the WAL consistency between two servers
might be broken. Because the WAL writing and sending are done concurrently,
and the backend might already write the WAL to disk on the primary when
waiting for walsender.

> How can we restart the walsender if it shuts down?

Only restart the standby (with walreceiver). The standby connects to
the postmaster on the primary, then the postmaster forks new walsender.

> Do we want a maximum
> wait for a transaction and a maximum wait for the server?

ISTM that these feature are too much.

> Do we report
> stats on how long the replication has been taking? If the average rep
> time is close to rep timeout then we will be fragile, so we need some
> way to notice this and produce warnings. Or at least provide info to an
> external monitoring system.

Sounds good. How about log_min_duration_replication? If the rep time
is greater than it, we produce warning (or log) like log_min_duration_xx.

>
> How do we specify the user we use to connect to primary?

Yes, I need to add new option to specify the user name into
recovery.conf. Thanks for reminding me!

>
> Definitely need more explanatory comments/README-style docs.

Completely agreed ;-)
I will write README together with other documents.

>
> For example, 03_libpq seems simple and self-contained. I'm not sure why
> we have a state called PGASYNC_REPLICATION; I was hoping that would be
> dynamic, but I'm not sure where to look for that. It would be useful to
> have a very long comment within the code to explain how the replication
> messages work, and note on each function who the intended client and
> server is.
>

OK. I will re-consider whether PGASYNC_REPLICATION is removable, and
write the comment about it.

> 02_pqcomm: What does HAVE_POLL mean?

It identifies whether poll(2) is available or not on the platform. We
use poll(2)
if it's defined, otherwise select(2). There is similar code at pqSocketPoll() in
fe-misc.c.

> Do we need to worry about periodic
> renegotiation of keys in be-secure.c?

What is "keys" you mean?

> Not sure I understand why so many
> new functions in there.

It's because walsender waits for the reply from the standby and the
request from the backend concurrently. So, we need poll(2) or select(2)
to make walsender wait for them, and some functions for non-blocking
receiving.

>
> 04_recovery_conf is a change I agree with, though I think it may not
> work with EXEC_BACKEND for Windows.

OK. I will examine and fix it.

>
> 05... I need dome commentary to explain this better.
>
> 06 and 07 are large and will take substantial review time. So we must
> get the overall architecture done first and then check the code that
> implements that.
>
> 08 - I think I get this, but some docs will help to confirm.

Yes. I need more documentation.

>
> 09 pg_standby changes: so more changes are coming there? OK. Can we
> refer to those two options as failover and switchover?

You mean failover trigger and switchover one? ISTM that those names
and features might not suit.

Naming always bother me, and the current name "commit/abort trigger"
might tend to cause confusion. Is there any other suitable name?

>  There's no need
> to change definitions that many Postgres people already use. This change
> can be done without making any change to server behaviour, so this
> change can have benefit to 8.2 and 8,3 people also.

Agreed.

>
> 01_signal_handling: I've looked at the LWlock acquires and releases in
> the patch and am fairly happy, except for the ProcArrayLock acquire
> during this sub-patch. Do we really need to do things this way? Is the
> actual state important? Could we just do this with a counter which
> cycles? So callers increment counter atomically and the reader just
> polls to see if anybody has incremented? Or could we protect that part
> of the proc with a different lock? Touching ProcArrayLock is bad news.

Agreed. I will add new lock for proc.signalFlags.

>
> Anyway, feeling very positive about this. Hope we can get this reviewed
> and committed in next 3-4 weeks.
>
> I have many clues as to how to structure my own work also. Thanks.

Thanks again!

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

02 December 2008, 09:10:56

On Tue, 2008-12-02 at 21:37 +0900, Fujii Masao wrote:

> Thanks for taking many hours to review the code!!
> 
> On Mon, Dec 1, 2008 at 8:42 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > Can you confirm that all the docs on the Wiki page are up to date? There
> > are a few minor discrepancies that make me think it isn't.
> 
> Documentation is ongoing. Sorry for my slow progress.
> 
> BTW, I'm going to add and change the sgml files listed on wiki.
> http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#Documentation_Plan

I'm patient, I know it takes time. Happy to spend hours on the review,
but I want to do that knowing I agree with the higher level features and
architecture first.

This was just a first review, I expect to spend more time on it yet.

> > The reaction to replication_timeout may need to be configurable. I might
> > not want to keep on processing if the information didn't reach the
> > standby.
> 
> OK. I will add new GUC variable (PGC_SIGHUP) to specify the reaction for
> the timeout.
> 
> > I would prefer in many cases that the transactions that were
> > waiting for walsender would abort, but the walsender kept processing.
> 
> Is it dangerous to abort the transaction with replication continued when
> the timeout occurs? I think that the WAL consistency between two servers
> might be broken. Because the WAL writing and sending are done concurrently,
> and the backend might already write the WAL to disk on the primary when
> waiting for walsender.

The issue I see is that we might want to keep wal_sender_delay small so
that transaction times are not increased. But we also want
wal_sender_delay high so that replication never breaks. It seems better
to have the action on wal_sender_delay configurable if we have an
unsteady network (like the internet). Marcus made some comments on line
dropping that seem relevant here; we should listen to his experience.

Hmmm, dangerous? Well assuming we're linking commits with replication
sends then it sounds it. We might end up committing to disk and then
deciding to abort instead. But remember we don't remove the xid from
procarray or mark the result in clog until the flush is over, so it is
possible. But I think we should discuss this in more detail when the
main patch is committed.

> > Do we report
> > stats on how long the replication has been taking? If the average rep
> > time is close to rep timeout then we will be fragile, so we need some
> > way to notice this and produce warnings. Or at least provide info to an
> > external monitoring system.
> 
> Sounds good. How about log_min_duration_replication? If the rep time
> is greater than it, we produce warning (or log) like log_min_duration_xx.

Maybe, lets put in something that logs if >50% (?) of timeout. Make that
configurable with a #define and see if we need that to be configurable
with a GUC later.

> > Do we need to worry about periodic
> > renegotiation of keys in be-secure.c?
> 
> What is "keys" you mean?

See the notes in that file for explanation.

I wondered whether it might be a perf problem for us?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

02 December 2008, 15:22:23

On Tue, 2008-12-02 at 11:08 -0800, Jeff Davis wrote:
> On Tue, 2008-12-02 at 13:09 +0000, Simon Riggs wrote:
> > > Is it dangerous to abort the transaction with replication continued when
> > > the timeout occurs? I think that the WAL consistency between two servers
> > > might be broken. Because the WAL writing and sending are done concurrently,
> > > and the backend might already write the WAL to disk on the primary when
> > > waiting for walsender.
> > 
> > The issue I see is that we might want to keep wal_sender_delay small so
> > that transaction times are not increased. But we also want
> > wal_sender_delay high so that replication never breaks. It seems better
> > to have the action on wal_sender_delay configurable if we have an
> > unsteady network (like the internet). Marcus made some comments on line
> > dropping that seem relevant here; we should listen to his experience.
> > 
> > Hmmm, dangerous? Well assuming we're linking commits with replication
> > sends then it sounds it. We might end up committing to disk and then
> > deciding to abort instead. But remember we don't remove the xid from
> > procarray or mark the result in clog until the flush is over, so it is
> > possible. But I think we should discuss this in more detail when the
> > main patch is committed.
> > 
> 
> What is the "it" in "it is possible"? It seems like there's still a
> problem window in there.

Marking a transaction aborted after we have written a commit record, but
before we have removed it from proc array and marked in clog. We'd need
a special kind of WAL record to do that.

> Even if that could be made safe, in the event of a real network failure,
> you'd just wait the full timeout every transaction, because it still
> thinks it's replicating.

True, but I did suggest having two timeouts.

There is considerable reason to reduce the timeout as well as reason to
increase it - at the same time.

Anyway, lets wait for some user experience following commit.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

Jeff Davis

Date:

02 December 2008, 15:31:07

On Tue, 2008-12-02 at 13:09 +0000, Simon Riggs wrote:
> > Is it dangerous to abort the transaction with replication continued when
> > the timeout occurs? I think that the WAL consistency between two servers
> > might be broken. Because the WAL writing and sending are done concurrently,
> > and the backend might already write the WAL to disk on the primary when
> > waiting for walsender.
> 
> The issue I see is that we might want to keep wal_sender_delay small so
> that transaction times are not increased. But we also want
> wal_sender_delay high so that replication never breaks. It seems better
> to have the action on wal_sender_delay configurable if we have an
> unsteady network (like the internet). Marcus made some comments on line
> dropping that seem relevant here; we should listen to his experience.
> 
> Hmmm, dangerous? Well assuming we're linking commits with replication
> sends then it sounds it. We might end up committing to disk and then
> deciding to abort instead. But remember we don't remove the xid from
> procarray or mark the result in clog until the flush is over, so it is
> possible. But I think we should discuss this in more detail when the
> main patch is committed.
> 

What is the "it" in "it is possible"? It seems like there's still a
problem window in there.

Even if that could be made safe, in the event of a real network failure,
you'd just wait the full timeout every transaction, because it still
thinks it's replicating.

If the timeout is exceeded, it seems more reasonable to abandon the
slave until you could re-sync it and continue processing as normal. As
you pointed out, that's not necessarily an expensive operation because
you can use something like rsync. The process of re-syncing might be
made easier (or perhaps less costly), of course.

If we want to still allow processing to happen after a timeout, it seems
reasonable to have a configurable option to allow/disallow non-read-only
transactions when out of sync. 

Regards,Jeff Davis

Re: Sync Rep: First Thoughts on Code

From

Josh Berkus

Date:

02 December 2008, 16:55:14

> Breaking down of patch into sections works very well for review. Should
> allow us to get different reviewers on different parts of the code -
> review wranglers please take note: Dave, Josh.

Fujii-san, could you break the patch up into several parts?  We have quite 
a few junior reviewers who are idle right now.  

-- 
--Josh

Josh Berkus
PostgreSQL
San Francisco

Re: Sync Rep: First Thoughts on Code

From

Josh Berkus

Date:

02 December 2008, 16:57:33

Jeff,

> Even if that could be made safe, in the event of a real network failure,
> you'd just wait the full timeout every transaction, because it still
> thinks it's replicating.

Hmmm.  I'd suggest that if we get timeouts for more than 10xTimeout value 
in a row, that replication stops.  Unfortunatley, we should probably make 
that *another* configuration setting.

-- 
--Josh

Josh Berkus
PostgreSQL
San Francisco

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

03 December 2008, 01:00:28

Hi,

On Wed, Dec 3, 2008 at 6:03 AM, Josh Berkus <josh@agliodbs.com> wrote:
>
>> Breaking down of patch into sections works very well for review. Should
>> allow us to get different reviewers on different parts of the code -
>> review wranglers please take note: Dave, Josh.
>
> Fujii-san, could you break the patch up into several parts?  We have quite
> a few junior reviewers who are idle right now.

Yes, I divided the patch into 9 pieces. Do I need  to divide it further?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

Josh Berkus

Date:

03 December 2008, 02:21:57

Fujii-san,

> Yes, I divided the patch into 9 pieces. Do I need  to divide it further?

That's plenty.  Where do reviews find the 9 pieces?

-- 
Josh Berkus
PostgreSQL
San Francisco

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

03 December 2008, 02:33:25

Hi,

On Wed, Dec 3, 2008 at 3:21 PM, Josh Berkus <josh@agliodbs.com> wrote:
> Fujii-san,
>
>> Yes, I divided the patch into 9 pieces. Do I need  to divide it further?
>
> That's plenty.  Where do reviews find the 9 pieces?

The latest patch set (v4) is on wiki.
http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#Patch_set

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

03 December 2008, 02:38:35

Hello,

On Tue, Dec 2, 2008 at 10:09 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> > The reaction to replication_timeout may need to be configurable. I might
>> > not want to keep on processing if the information didn't reach the
>> > standby.
>>
>> OK. I will add new GUC variable (PGC_SIGHUP) to specify the reaction for
>> the timeout.
>>
>> > I would prefer in many cases that the transactions that were
>> > waiting for walsender would abort, but the walsender kept processing.
>>
>> Is it dangerous to abort the transaction with replication continued when
>> the timeout occurs? I think that the WAL consistency between two servers
>> might be broken. Because the WAL writing and sending are done concurrently,
>> and the backend might already write the WAL to disk on the primary when
>> waiting for walsender.
>
> The issue I see is that we might want to keep wal_sender_delay small so
> that transaction times are not increased. But we also want
> wal_sender_delay high so that replication never breaks.

Are you assuming only asynch case? In synch case, since walsender is
awoken by the signal from the backend, we don't need to keep the delay
so small. And, wal_sender_delay has no relation with the mis-termination
of replication.

> It seems better
> to have the action on wal_sender_delay configurable if we have an
> unsteady network (like the internet). Marcus made some comments on line
> dropping that seem relevant here; we should listen to his experience.

OK, I would look for his comments. Please let me know which thread has
the comments if you know.

>
> Hmmm, dangerous? Well assuming we're linking commits with replication
> sends then it sounds it. We might end up committing to disk and then
> deciding to abort instead. But remember we don't remove the xid from
> procarray or mark the result in clog until the flush is over, so it is
> possible. But I think we should discuss this in more detail when the
> main patch is committed.

If the transaction is aborted while the backend is waiting for replication,
the transaction commit command returns "false" indication to the client.
But the transaction commit record might be written in the primary and
standby. As you say, it may not be dangerous as long as the primary is
alive. But, when we recover the failed primary, clog of the transaction
is marked with "success" because of the commit record. Is it safe?

And, in that case, the transaction is treated as "sucess" on the standby,
and visible for the read-only query. On the other hand, it's invisible on
the primary. Isn't it dangerous?

>
>> > Do we need to worry about periodic
>> > renegotiation of keys in be-secure.c?
>>
>> What is "keys" you mean?
>
> See the notes in that file for explanation.

Thanks! I would check it.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

03 December 2008, 03:16:01

Hi,

On Wed, Dec 3, 2008 at 4:08 AM, Jeff Davis <pgsql@j-davis.com> wrote:
> Even if that could be made safe, in the event of a real network failure,
> you'd just wait the full timeout every transaction, because it still
> thinks it's replicating.

If walsender detects a real network failure, the transaction doesn't need to
wait for the timeout. Configuring keepalive options would help walsender to
detect it. Of course, though keepalive on linux might not work as expected.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

03 December 2008, 08:37:44

Hi,

On Tue, Dec 2, 2008 at 10:09 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On Tue, 2008-12-02 at 21:37 +0900, Fujii Masao wrote:
>
>> Thanks for taking many hours to review the code!!
>>
>> On Mon, Dec 1, 2008 at 8:42 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> > Can you confirm that all the docs on the Wiki page are up to date? There
>> > are a few minor discrepancies that make me think it isn't.
>>
>> Documentation is ongoing. Sorry for my slow progress.
>>
>> BTW, I'm going to add and change the sgml files listed on wiki.
>> http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#Documentation_Plan
>
> I'm patient, I know it takes time. Happy to spend hours on the review,
> but I want to do that knowing I agree with the higher level features and
> architecture first.

Since I thought that the figure was more intelligible for some people
than my poor English, I illustrated the architecture first.
http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#Detailed_Design

Are there any other parts which should be illustrated for review?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

03 December 2008, 10:34:06

On Wed, 2008-12-03 at 21:37 +0900, Fujii Masao wrote:

> Since I thought that the figure was more intelligible for some people
> than my poor English, I illustrated the architecture first.
> http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#Detailed_Design
> 
> Are there any other parts which should be illustrated for review?

Those are very useful, thanks.

Some questions to check my understanding (expected answers in brackets)

* Diagram on p.2 has two Archives. We have just one (yes)

* We send data continuously, whether or not we are in sync/async? (yes)
So the only difference between sync/async is whether we wait when we
flush the commit? (yes)

* If we have synchronous_commit = off do we ignore
synchronous_replication = on (yes)

* If two transactions commit almost simultaneously and one is sync and
the other async then only the sync backend will wait? (Yes)

Do we definitely need the archiver to move the files written by
walreceiver to archive and then move them back out again? Seems like we
can streamline that part in many (all?) cases.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

04 December 2008, 03:10:57

Hi,

On Wed, Dec 3, 2008 at 11:33 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> I'm patient, I know it takes time. Happy to spend hours on the review,
> but I want to do that knowing I agree with the higher level features and
> architecture first.

I wrote the features and restrictions of Synch Rep. Please also check
it together with the figures of architecture.
http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#User_Overview

> Some questions to check my understanding (expected answers in brackets)
>
> * Diagram on p.2 has two Archives. We have just one (yes)

No, we need archive in both the primary and standby. The primary needs
archive because a base backup is required when starting the standby.
Meanwhile, the standby needs archive for cooperating with pg_standby.

If the directory where pg_standby checks is the same as the directory
where walreceiver writes the WAL, the halfway WAL file might be
restored by pg_standby, and continuous recovery would fail. So, we have
to separate the directories, and I assigned pg_xlog and archive to them.

Another idea; walreceiver writes the WAL to the file with temporary name,
and rename it to the formal name when it fills. So, pg_standby doesn't
restore a halfway WAL file. But it's more difficult to perform the failover
because the unrenamed WAL file remains.

Do you have any other good idea?

>
> * We send data continuously, whether or not we are in sync/async? (yes)

Yes.

> So the only difference between sync/async is whether we wait when we
> flush the commit? (yes)

Yes.
And, in asynch case, the backend basically doesn't send the wakeup-signal
to walsender.

>
> * If we have synchronous_commit = off do we ignore
> synchronous_replication = on (yes)

No, we can configure them independently. synchronous_commit covers
only local writing of the WAL. If synch_*commit* should cover both local
writing and replication, I'd like to add new GUC which covers only local
writing (synchronous_local_write?).

>
> * If two transactions commit almost simultaneously and one is sync and
> the other async then only the sync backend will wait? (Yes)

Yes.

>
>
> Do we definitely need the archiver to move the files written by
> walreceiver to archive and then move them back out again?

Yes, it's because of cooperating with pg_standby.

> Seems like we
> can streamline that part in many (all?) cases.

Agreed. But I thought that such streaming was TODO of next time.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

04 December 2008, 04:57:28

Hi,

On Wed, Dec 3, 2008 at 3:38 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>> > Do we need to worry about periodic
>>> > renegotiation of keys in be-secure.c?
>>>
>>> What is "keys" you mean?
>>
>> See the notes in that file for explanation.
>
> Thanks! I would check it.

The key is used only when we use SSL for the connection of
replication. As far as I examined, secure_write() renegotiates
the key if needed. Since walsender calls secure_write() when
sending the WAL to the standby, the key is renegotiated
periodically. So, I think that we don't need to worry about the
obsolescence of the key. Am I missing something?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

04 December 2008, 05:28:34

On Thu, 2008-12-04 at 17:57 +0900, Fujii Masao wrote:

> On Wed, Dec 3, 2008 at 3:38 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> >>> > Do we need to worry about periodic
> >>> > renegotiation of keys in be-secure.c?
> >>>
> >>> What is "keys" you mean?
> >>
> >> See the notes in that file for explanation.
> >
> > Thanks! I would check it.
> 
> The key is used only when we use SSL for the connection of
> replication. As far as I examined, secure_write() renegotiates
> the key if needed. Since walsender calls secure_write() when
> sending the WAL to the standby, the key is renegotiated
> periodically. So, I think that we don't need to worry about the
> obsolescence of the key.

Understood. Is the periodic renegotiation of keys something that would
interfere with the performance or robustness of replication? Is the
delay likely to effect sync rep? I'm just checking we've thought about
it.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

04 December 2008, 05:30:01

On Thu, 2008-12-04 at 16:10 +0900, Fujii Masao wrote:

> > * Diagram on p.2 has two Archives. We have just one (yes)
> 
> No, we need archive in both the primary and standby. The primary needs
> archive because a base backup is required when starting the standby.
> Meanwhile, the standby needs archive for cooperating with pg_standby.
> 
> If the directory where pg_standby checks is the same as the directory
> where walreceiver writes the WAL, the halfway WAL file might be
> restored by pg_standby, and continuous recovery would fail. So, we have
> to separate the directories, and I assigned pg_xlog and archive to them.
> 
> Another idea; walreceiver writes the WAL to the file with temporary name,
> and rename it to the formal name when it fills. So, pg_standby doesn't
> restore a halfway WAL file. But it's more difficult to perform the failover
> because the unrenamed WAL file remains.

WAL sending is either via archiver or via streaming. We must switch
cleanly from one mode to the other and not half-way through a WAL file.

When WAL sending is about to begin, issue xlog switch. Then tell
archiver to shutdown once it has got to the last file. All files after
that point are streamed. So there need be no conflict in filename.

We must avoid having two archives, because people will configure this
incorrectly.

> > * If we have synchronous_commit = off do we ignore
> > synchronous_replication = on (yes)
> 
> No, we can configure them independently. synchronous_commit covers
> only local writing of the WAL. If synch_*commit* should cover both local
> writing and replication, I'd like to add new GUC which covers only local
> writing (synchronous_local_write?).

The only sensible settings are
synchronous_commit = on, synchronous_replication = on
synchronous_commit = on, synchronous_replication = off
synchronous_commit = off, synchronous_replication = off

This doesn't make any sense: (does it??)
synchronous_commit = off, synchronous_replication = on

> > Do we definitely need the archiver to move the files written by
> > walreceiver to archive and then move them back out again?
> 
> Yes, it's because of cooperating with pg_standby.

It seems very easy to make this happen the way we want. We could make
pg_standby look into pg_xlog also, for example.

I was expecting you to have walreceiver and startup share an end of WAL
address via shared memory, so that startup never tries to read past end.
That way we would be able to begin reading a WAL file *before* it was
filled. Waiting until a file fills means we still have to have
archive_timeout set to ensure we switch regularly.

We need the existing mechanisms for the start of replication (base
backup etc..) but we don't need them after that point.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

Magnus Hagander

Date:

04 December 2008, 07:41:17

Simon Riggs wrote:
> On Thu, 2008-12-04 at 17:57 +0900, Fujii Masao wrote:
> 
>> On Wed, Dec 3, 2008 at 3:38 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>>>>> Do we need to worry about periodic
>>>>>> renegotiation of keys in be-secure.c?
>>>>> What is "keys" you mean?
>>>> See the notes in that file for explanation.
>>> Thanks! I would check it.
>> The key is used only when we use SSL for the connection of
>> replication. As far as I examined, secure_write() renegotiates
>> the key if needed. Since walsender calls secure_write() when
>> sending the WAL to the standby, the key is renegotiated
>> periodically. So, I think that we don't need to worry about the
>> obsolescence of the key.
> 
> Understood. Is the periodic renegotiation of keys something that would
> interfere with the performance or robustness of replication? Is the
> delay likely to effect sync rep? I'm just checking we've thought about
> it.

It will certainly add an extra piece of delay. But if you are worried
about performance for it, you are likely not running SSL. Plus, if you
don't renegotiate the key, you gamble with security.

If it does have a negative effect on the robustness of the replication,
we should just recommend against using it - or refuse to use - not
disable renegotiation.

/Magnus

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

04 December 2008, 07:58:43

On Thu, 2008-12-04 at 12:41 +0100, Magnus Hagander wrote:

> > Understood. Is the periodic renegotiation of keys something that would
> > interfere with the performance or robustness of replication? Is the
> > delay likely to effect sync rep? I'm just checking we've thought about
> > it.
> 
> It will certainly add an extra piece of delay. But if you are worried
> about performance for it, you are likely not running SSL. Plus, if you
> don't renegotiate the key, you gamble with security.
> 
> If it does have a negative effect on the robustness of the replication,
> we should just recommend against using it - or refuse to use - not
> disable renegotiation.

I didn't mean to imply renegotiation might optional. I just wanted to
check whether there is anything to worry about as a result of it, there
may not be. *If* it took a long time, I would not want sync commits to
wait for it.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

04 December 2008, 23:09:54

Hi,

On Thu, Dec 4, 2008 at 6:29 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> The only sensible settings are
> synchronous_commit = on, synchronous_replication = on
> synchronous_commit = on, synchronous_replication = off
> synchronous_commit = off, synchronous_replication = off
>
> This doesn't make any sense: (does it??)
> synchronous_commit = off, synchronous_replication = on

If the standby replies before writing the WAL, that strategy can improve
the performance with moderate reliability, and sounds sensible.
IIRC, MySQL Cluster might use that strategy.

> I was expecting you to have walreceiver and startup share an end of WAL
> address via shared memory, so that startup never tries to read past end.
> That way we would be able to begin reading a WAL file *before* it was
> filled. Waiting until a file fills means we still have to have
> archive_timeout set to ensure we switch regularly.

You mean that not pg_standby but startup process waits for the next
WAL available? If so, I agree with you in the future. That is, I just think
that this is next TODO because there are many problems which we
should resolve carefully to achieve it. But, if it's essential for 8.4, I will
tackle it. What is your opinion? I'd like to clear up the goal for 8.4.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

05 December 2008, 03:00:33

Hello,

On Fri, Dec 5, 2008 at 12:09 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> I was expecting you to have walreceiver and startup share an end of WAL
>> address via shared memory, so that startup never tries to read past end.
>> That way we would be able to begin reading a WAL file *before* it was
>> filled. Waiting until a file fills means we still have to have
>> archive_timeout set to ensure we switch regularly.
>
> You mean that not pg_standby but startup process waits for the next
> WAL available? If so, I agree with you in the future. That is, I just think
> that this is next TODO because there are many problems which we
> should resolve carefully to achieve it. But, if it's essential for 8.4, I will
> tackle it. What is your opinion? I'd like to clear up the goal for 8.4.

Umm.. on second thought, this feature (continuous recovery without
pg_standby) seems to be essential for 8.4. So, I will try it.

Development plan:
- Share the end of WAL address via shared memory <--- Done!
- Change ReadRecord() to wait for the next WAL *record* available.
- Change ReadRecord() to restore the WAL from archive by using pg_standby before reaching the replication starting
position,then read the half-streaming WAL from pg_xlog.
 
- Add new trigger for promoting the standby to the primary. As the trigger, when fast shudown (SIGINT) is requested
duringrecovery, the standby would recover the WAL up to end and become the primary.
 

What system call does walreceiver have to call against the WAL
before startup process reads it? Probably we need  to call write(2),
and don't need fsync(2) in Linux. How about other platform?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

05 December 2008, 04:12:52

Hi, sorry for my consecutive posting.

On Fri, Dec 5, 2008 at 4:00 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> Hello,
>
> On Fri, Dec 5, 2008 at 12:09 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>> I was expecting you to have walreceiver and startup share an end of WAL
>>> address via shared memory, so that startup never tries to read past end.
>>> That way we would be able to begin reading a WAL file *before* it was
>>> filled. Waiting until a file fills means we still have to have
>>> archive_timeout set to ensure we switch regularly.
>>
>> You mean that not pg_standby but startup process waits for the next
>> WAL available? If so, I agree with you in the future. That is, I just think
>> that this is next TODO because there are many problems which we
>> should resolve carefully to achieve it. But, if it's essential for 8.4, I will
>> tackle it. What is your opinion? I'd like to clear up the goal for 8.4.
>
> Umm.. on second thought, this feature (continuous recovery without
> pg_standby) seems to be essential for 8.4. So, I will try it.
>
> Development plan:
> - Share the end of WAL address via shared memory <--- Done!
> - Change ReadRecord() to wait for the next WAL *record* available.
> - Change ReadRecord() to restore the WAL from archive by using
>  pg_standby before reaching the replication starting position, then
>  read the half-streaming WAL from pg_xlog.
> - Add new trigger for promoting the standby to the primary. As the
>  trigger, when fast shudown (SIGINT) is requested during recovery,
>  the standby would recover the WAL up to end and become the
>  primary.
>
> What system call does walreceiver have to call against the WAL
> before startup process reads it? Probably we need  to call write(2),
> and don't need fsync(2) in Linux. How about other platform?

I added the figures about the latest architecture into PDF file.
Please check P6, 7. Is this architecture close to your imege?
http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#Detailed_Design

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

05 December 2008, 06:00:50

On Fri, 2008-12-05 at 16:00 +0900, Fujii Masao wrote:

> On Fri, Dec 5, 2008 at 12:09 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> >> I was expecting you to have walreceiver and startup share an end of WAL
> >> address via shared memory, so that startup never tries to read past end.
> >> That way we would be able to begin reading a WAL file *before* it was
> >> filled. Waiting until a file fills means we still have to have
> >> archive_timeout set to ensure we switch regularly.
> >
> > You mean that not pg_standby but startup process waits for the next
> > WAL available? If so, I agree with you in the future. That is, I just think
> > that this is next TODO because there are many problems which we
> > should resolve carefully to achieve it. But, if it's essential for 8.4, I will
> > tackle it. What is your opinion? I'd like to clear up the goal for 8.4.
> 
> Umm.. on second thought, this feature (continuous recovery without
> pg_standby) seems to be essential for 8.4. So, I will try it.

Sounds good. Perhaps you can share what changed your mind in those 4
hours...

Could we start with pictures and some descriptions first, so we know
we're on the right track? I foresee no coding issues.

My understanding is that we start with a normal log shipping
architecture, then we switch into continuous recovery mode. So we do use
pg_standby at beginning, but then it gets turned off.

Let's look at all of the corner cases also:
* standby keeps pace with primary (desired state)
* standby falls behind primary
* standby restarts to change shmmem settings 
etc

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

05 December 2008, 06:10:19

On Fri, 2008-12-05 at 12:09 +0900, Fujii Masao wrote:

> On Thu, Dec 4, 2008 at 6:29 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > The only sensible settings are
> > synchronous_commit = on, synchronous_replication = on
> > synchronous_commit = on, synchronous_replication = off
> > synchronous_commit = off, synchronous_replication = off
> >
> > This doesn't make any sense: (does it??)
> > synchronous_commit = off, synchronous_replication = on
> 
> If the standby replies before writing the WAL, that strategy can improve
> the performance with moderate reliability, and sounds sensible.

Do you think it likely that your replication time is consistently and
noticeably less than your time-to-disk? If not, you'll wait just as long
but be less robust. I guess its possible.

On a related thought: presumably we force a sync rep if forceSyncCommit
is set?

> IIRC, MySQL Cluster might use that strategy.

Not the most convincing argument I've heard.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

06 December 2008, 04:29:23

Hi,

On Fri, Dec 5, 2008 at 7:09 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On Fri, 2008-12-05 at 12:09 +0900, Fujii Masao wrote:
>
>> On Thu, Dec 4, 2008 at 6:29 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> > The only sensible settings are
>> > synchronous_commit = on, synchronous_replication = on
>> > synchronous_commit = on, synchronous_replication = off
>> > synchronous_commit = off, synchronous_replication = off
>> >
>> > This doesn't make any sense: (does it??)
>> > synchronous_commit = off, synchronous_replication = on
>>
>> If the standby replies before writing the WAL, that strategy can improve
>> the performance with moderate reliability, and sounds sensible.
>
> Do you think it likely that your replication time is consistently and
> noticeably less than your time-to-disk?

It depends on a system environment.
- How many miles two servers? same rack? separate continent?
- Does system have high-end storage? cheap one?
... etc

>
> On a related thought: presumably we force a sync rep if forceSyncCommit
> is set?

Yes!
Please see RecordTransactionCommit() in xact.c (in my patch).

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

06 December 2008, 04:55:27

Greetings!

On Fri, Dec 5, 2008 at 6:59 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On Fri, 2008-12-05 at 16:00 +0900, Fujii Masao wrote:
>
>> On Fri, Dec 5, 2008 at 12:09 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> >> I was expecting you to have walreceiver and startup share an end of WAL
>> >> address via shared memory, so that startup never tries to read past end.
>> >> That way we would be able to begin reading a WAL file *before* it was
>> >> filled. Waiting until a file fills means we still have to have
>> >> archive_timeout set to ensure we switch regularly.
>> >
>> > You mean that not pg_standby but startup process waits for the next
>> > WAL available? If so, I agree with you in the future. That is, I just think
>> > that this is next TODO because there are many problems which we
>> > should resolve carefully to achieve it. But, if it's essential for 8.4, I will
>> > tackle it. What is your opinion? I'd like to clear up the goal for 8.4.
>>
>> Umm.. on second thought, this feature (continuous recovery without
>> pg_standby) seems to be essential for 8.4. So, I will try it.
>
> Sounds good. Perhaps you can share what changed your mind in those 4
> hours...

Yeah, it's my imagination about the real situation after 8.4 release,
especially I considered about the future conjugal life of Synch Rep and
Hot Standby ;) Waiting to redo until the file fills might lead to marital
breakdown.

>
> Could we start with pictures and some descriptions first, so we know
> we're on the right track? I foresee no coding issues.
>
> My understanding is that we start with a normal log shipping
> architecture, then we switch into continuous recovery mode. So we do use
> pg_standby at beginning, but then it gets turned off.

Yes, I also understand so. Updated sequence pictures are on wiki
as per usual. Please see P3, 4.
http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#Detailed_Design

>
> Let's look at all of the corner cases also:
> * standby keeps pace with primary (desired state)
> * standby falls behind primary
> * standby restarts to change shmmem settings
> etc

Yes, I will examine such cases!

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

08 December 2008, 10:05:35

On Sat, 2008-12-06 at 17:55 +0900, Fujii Masao wrote:

> Yeah, it's my imagination about the real situation after 8.4 release,
> especially I considered about the future conjugal life of Synch Rep and
> Hot Standby ;) Waiting to redo until the file fills might lead to marital
> breakdown.

You're obviously working with some comedians now. ;-)

> > Could we start with pictures and some descriptions first, so we know
> > we're on the right track? I foresee no coding issues.
> >
> > My understanding is that we start with a normal log shipping
> > architecture, then we switch into continuous recovery mode. So we do use
> > pg_standby at beginning, but then it gets turned off.
> 
> Yes, I also understand so. Updated sequence pictures are on wiki
> as per usual. Please see P3, 4.
> http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#Detailed_Design

p.6 looks good.

But what is p.7? It's even more complex than the original. Forgive me,
but I don't understand that. Can you explain?

What is the procedure if the standby shuts down, for example if we wish
to restart server to change a parameter? Or to reboot the system it is
on. Does the primary switch back to writing files to archive?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

09 December 2008, 04:15:19

Hi, thanks for the comment!

On Mon, Dec 8, 2008 at 11:04 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> > Could we start with pictures and some descriptions first, so we know
>> > we're on the right track? I foresee no coding issues.
>> >
>> > My understanding is that we start with a normal log shipping
>> > architecture, then we switch into continuous recovery mode. So we do use
>> > pg_standby at beginning, but then it gets turned off.
>>
>> Yes, I also understand so. Updated sequence pictures are on wiki
>> as per usual. Please see P3, 4.
>> http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#Detailed_Design
>
> p.6 looks good.
>
> But what is p.7? It's even more complex than the original. Forgive me,
> but I don't understand that. Can you explain?

p.7 shows one of the system configuration examples. Some people don't
want to share an archive between two servers would probably choose
this configuration, I think.

If archive is not shared, some WAL files before replication starts would not
be copied automatically from the primary to standby. So, we have to copy
them by hand or using clusterware ..etc. This is what p.7 shows. If archive
is shared, archiver on the primary would copy them automatically (p.6).

>
> What is the procedure if the standby shuts down, for example if we wish
> to restart server to change a parameter?

Stop postgres by using immediate shutdown, and start postgres from an
existing database cluster directory. When restarting postgres, if there are
one or more archives, we also need to copy the WAL files after stopping
replication before restarting replication.

> Or to reboot the system it is
> on. Does the primary switch back to writing files to archive?

I assume that the primary always writes files to archive, that is, basically
the primary doesn't switch to non-archiving mode. Of course, if archiving
is disabled on the primary in any reason when restarting standby, the
primary need to switch back.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

09 December 2008, 06:01:13

On Tue, 2008-12-09 at 17:15 +0900, Fujii Masao wrote:
> >
> > But what is p.7? It's even more complex than the original. Forgive me,
> > but I don't understand that. Can you explain?
> 
> p.7 shows one of the system configuration examples. Some people don't
> want to share an archive between two servers would probably choose
> this configuration, I think.
> 
> If archive is not shared, some WAL files before replication starts would not
> be copied automatically from the primary to standby. So, we have to copy
> them by hand or using clusterware ..etc. This is what p.7 shows. If archive
> is shared, archiver on the primary would copy them automatically (p.6).

I agree that is the way to do it *if* the archive is not shared. But why
would you want to *not* share the archive??

> > What is the procedure if the standby shuts down, for example if we wish
> > to restart server to change a parameter?
> 
> Stop postgres by using immediate shutdown, and start postgres from an
> existing database cluster directory. When restarting postgres, if there are
> one or more archives, we also need to copy the WAL files after stopping
> replication before restarting replication.
> 
> > Or to reboot the system it is
> > on. Does the primary switch back to writing files to archive?
> 
> I assume that the primary always writes files to archive, that is, basically
> the primary doesn't switch to non-archiving mode. 

OK, I think that clears up what I was seeing in the code. i.e. I didn't
understand the modes of operation.

I really like most of what you've done, though you must forgive me for
saying I still don't like this. I really am with you on how tiresome
that sounds.

For clarity: I don't think its acceptable to have the archiver send
files to the archive at the same time as we're streaming data. In normal
running we should not duplicate the data paths - its just too much data
volume and/or bandwidth.

The cleanest way I can see is to have two modes of operation:
* First mode is file-based log shipping (FLS) (i.e. "warm standby")
* Second mode is streaming log shipping (SLS) (wal sender to wal
receiver)

When we start we are in FLS mode, then we catch up to the cross-over
point and we switch to SLS mode. If streaming stops, we just switch back
to FLS mode. If they reconnect, we follow same procedure again. So the
two modes are compatible, but are never simultaneously active except for
a short period when we switch modes.

If SLS mode is active then the archiver doesn't send files. If FLS mode
is active, we send files. All of the places in code that currently are
not optimised when XLogArchivingActive() must remain unoptimised for
either FLS or SLS mode, so we need a new name for that.

This makes least number of changes to existing architecture. People
currently use FLS mode and understand it (!), they just add
understanding of SLS mode. It's also a very straightforward
architecture, which means fewer code paths and less weird bugs. (There's
been enough already, as you know).

So just for clarity, let me rephrase it:

We set up FLS mode as we do currently. Then we initiate SLS mode. At the
end of the next WAL file on primary we archive it, then turn off
archiving on primary. (So for up to one WAL file we operate two modes
together).

If SLS mode ends, we send next WAL file via archiver. Some part of that
file has already been streamed across, but that doesn't matter. (If SLS
mode ends because primary is down, we obviously do nothing. If we have a
split brain situation then we rely on clusterware to kill us (STONITH).

So AFAICS p.6 of the architecture is all we really need. Nice, simple.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

Heikki Linnakangas

Date:

09 December 2008, 08:42:44

Simon Riggs wrote:
> For clarity: I don't think its acceptable to have the archiver send
> files to the archive at the same time as we're streaming data. In normal
> running we should not duplicate the data paths - its just too much data
> volume and/or bandwidth.

What if you want to run archiving for backup purposes, and also have a 
standby server?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

09 December 2008, 09:48:41

On Tue, 2008-12-09 at 14:42 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > For clarity: I don't think its acceptable to have the archiver send
> > files to the archive at the same time as we're streaming data. In normal
> > running we should not duplicate the data paths - its just too much data
> > volume and/or bandwidth.
>
> What if you want to run archiving for backup purposes, and also have a
> standby server?

If we want to include that as an option, yes. If it is "always on" then
no, not everybody wants that.

The best way to implement that is to archive from the standby, not to
send the data twice. By definition the archive is more closely
associated with the standby node than the primary.

Maybe I misunderstood the diagrams? The additional flows to the archive
are actually all optional?

Anyway, I enclose a slightly simplified version of p.6 to allow us to
see the progression of file mode through to streaming mode. This is an
in-my-understanding version.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Attachment

SyncRepArchitectures.pdf

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

10 December 2008, 01:51:36

Hi,

Thanks for explaining the architecture in detail!

> If we want to include that as an option, yes. If it is "always on" then
> no, not everybody wants that.

Yes. I also think that archiving should be optional on each servers.

> The best way to implement that is to archive from the standby, not to
> send the data twice. By definition the archive is more closely
> associated with the standby node than the primary.
>
> Maybe I misunderstood the diagrams? The additional flows to the archive
> are actually all optional?
>
> Anyway, I enclose a slightly simplified version of p.6 to allow us to
> see the progression of file mode through to streaming mode. This is an
> in-my-understanding version.

Yes, I basically agree with you! The only difference between us is
whether the primary also has to switch two modes (FLS <-> SLS).
I think that the primary don't need to stop archiving forcibly when
replication starts, which should be optional for the user. The user
who doesn't want to archive can disable archiving by using existing
mechanism (change archive_command & pg_ctl reload). It's more
complicated to switch the modes on each servers.

For clarity: the user can choose the strategy of archiving from the
following.

1) each primary and standby archives
2) only primary archives
3) only standby archives
4) no server archives

The user who don't want to share an archive would choose 1).
The user who want to share an archive and cannot accept any
increase of bandwidth would choose 4). On the other hand,
the user who can accept it would choose 2) or 3). I prefer 2) to
3), for multiple standby in the future. And, if 3) is adopted,
I wonder if we can get a base backup. Can we get it from the
standby during recovery?

> I agree that is the way to do it *if* the archive is not shared. But why
> would you want to *not* share the archive??

First of all, I'd not like to buy a machine only for an archive other than
the primary and standby. Meanwhile, if an archive is located on either
the primary or standby (which should we locate it on?), post-failure
processing is complicated.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

10 December 2008, 05:49:22

On Wed, 2008-12-10 at 14:51 +0900, Fujii Masao wrote:

> Yes, I basically agree with you! The only difference between us is
> whether the primary also has to switch two modes (FLS <-> SLS).
> I think that the primary don't need to stop archiving forcibly when
> replication starts, which should be optional for the user. The user
> who doesn't want to archive can disable archiving by using existing
> mechanism (change archive_command & pg_ctl reload). It's more
> complicated to switch the modes on each servers.

Yes, I see that a manual change of parameter is possible. But it is
difficult to get the timing of the manual change correct and yet
important not to get that wrong. I don't want to spend the next year
answering questions on list about how that works and agreeing that it
isn't ideal.

We should have an optional mechanism that will turn archiving on the
primary off *automatically* when the mode changes. Maybe a third mode on
archive_mode to cater for this, but other ways possible also.

> For clarity: the user can choose the strategy of archiving from the
> following.
> 
> 1) each primary and standby archives
> 2) only primary archives
> 3) only standby archives
> 4) no server archives

Those are all possible, but they aren't all equally usable as it stands.

In my experience most people do things very simply, so (4) is the common
use case. So it needs to Just Work.

We need to cater for a range of use cases, from simple implementations
through to complex multi-node cases. I don't think its right to assume
that everybody is implementing a complex use case and so we mostly cater
for that.

> The user who don't want to share an archive would choose 1).

If we include a feature you need to explain why its there. Asking the
question doesn't mean that I'm opposed, just that I'm checking why you
think its important to have that option.

So, why would you want to run with multiple archives?

> The user who want to share an archive and cannot accept any
> increase of bandwidth would choose 4). On the other hand,
> the user who can accept it would choose 2) or 3). I prefer 2) to
> 3), for multiple standby in the future. And, if 3) is adopted,

> I wonder if we can get a base backup. Can we get it from the
> standby during recovery?

That's an important feature, so we should make it "yes". (Can't
understand why you've built this with the archiver active on standby
node if this isn't possible).

People I talk to consider "low impact on primary" to be an important
aspect of this feature. Though if you forced me to prioritise I would
say making (4) automatic is more important than (3).

> > I agree that is the way to do it *if* the archive is not shared. But why
> > would you want to *not* share the archive??
> 
> First of all, I'd not like to buy a machine only for an archive other than
> the primary and standby. Meanwhile, if an archive is located on either
> the primary or standby (which should we locate it on?), post-failure
> processing is complicated.

Are you saying that putting the archive on the primary is an option?

What is complicated about having the archive on the standby server? 

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

Heikki Linnakangas

Date:

10 December 2008, 08:39:40

Simon Riggs wrote:
> On Wed, 2008-12-10 at 14:51 +0900, Fujii Masao wrote:
>> For clarity: the user can choose the strategy of archiving from the
>> following.
>>
>> 1) each primary and standby archives
>> 2) only primary archives
>> 3) only standby archives
>> 4) no server archives
> 
> Those are all possible, but they aren't all equally usable as it stands.
> 
> In my experience most people do things very simply, so (4) is the common
> use case. So it needs to Just Work.

Agreed. All this talk about archiving and streaming working at the same 
time is very confusing.

AFAICS, the patch as submitted doesn't work if archiving is disabled in 
the primary. Which means that strategies (2) and (4) in your list are 
not possible. The standby relies on the archiving and file-based log 
shipping to work correctly. The streaming is just an extra thing, 
shortcutting the normal file-based log shipping path to keep the latest 
WAL segment up-to-date in the standby.

In the current form, is there any reason why walreceiver needs to be an 
integrated server process? Couldn't it just be a stand-alone program 
that connects to the primary and writes the received records to the 
right WAL file? The only reason I can see is to reliably kill it when 
the standby server is promoted to primary.

For a solution that doesn't depend on the file-based log shipping, I 
think we'll need a way for the standby to request a certain starting 
point for the streaming when it connects. When the standby starts, it 
would first recover all the log segments it can obtain using 
recovery_command, and then connect to the primary and request to start 
streaming from where recovery_command stopped.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

10 December 2008, 10:56:52

On Wed, 2008-12-10 at 14:39 +0200, Heikki Linnakangas wrote:

> For a solution that doesn't depend on the file-based log shipping, I 
> think we'll need a way for the standby to request a certain starting 
> point for the streaming when it connects. When the standby starts, it 
> would first recover all the log segments it can obtain using 
> recovery_command, and then connect to the primary and request to
> start 
> streaming from where recovery_command stopped.

That was already suggested and rejected because it introduces a
potentially unacceptable delay in the start of synch replication - for
large databases this could be hours. (I should add it was suggested by
me and I now accept that it should be rejected.)

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

10 December 2008, 11:00:47

On Wed, 2008-12-10 at 14:39 +0200, Heikki Linnakangas wrote:

> In the current form, is there any reason why walreceiver needs to be
> an integrated server process? Couldn't it just be a stand-alone
> program that connects to the primary and writes the received records
> to the right WAL file? The only reason I can see is to reliably kill
> it when the standby server is promoted to primary.

Reasons:

* integration: we have one service we stop and start, not two. We want
one log, one set of commands, one set of parameters etc

* cooperation: if wal receiver is a server process we can reasonably
communicate the current WAL limit via shared memory. That gives us
smooth flow of WAL between receiver and replay (startup process) rather
than a burst of activity each time a file arrives. That helps smooth
performance and minimises failover time. Without this we would need to
retain the concept of archive_timeout on the primary even when
streaming, which is fairly strange.

* code management

Other than that there isn't that much in it...

We've all read the stuff about how other RDBMS come with integrated
replication. We *can* make this integrated, robust and very very easy to
use, yet with flexibility for a variety of purposes.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

Heikki Linnakangas

Date:

10 December 2008, 14:52:49

Simon Riggs wrote:
> On Wed, 2008-12-10 at 14:39 +0200, Heikki Linnakangas wrote:
> 
>> For a solution that doesn't depend on the file-based log shipping, I 
>> think we'll need a way for the standby to request a certain starting 
>> point for the streaming when it connects. When the standby starts, it 
>> would first recover all the log segments it can obtain using 
>> recovery_command, and then connect to the primary and request to
>> start 
>> streaming from where recovery_command stopped.
> 
> That was already suggested and rejected because it introduces a
> potentially unacceptable delay in the start of synch replication - for
> large databases this could be hours. (I should add it was suggested by
> me and I now accept that it should be rejected.)

I don't understand that argument. If the standby is missing say 100 log 
files, it's not up-to-date with the primary until it has somehow 
obtained and replayed all those log file. It doesn't make any difference 
whether it obtains them over the wire via walreceiver, or via an 
archive. Until it has obtained and replayed all those files, it's not 
up-to-date, and a failover would lead to data loss.

Or did I misunderstand what "start of synch replication" means? Got a 
pointer to the previous discussion?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Sync Rep: First Thoughts on Code

From

Heikki Linnakangas

Date:

10 December 2008, 15:02:44

Simon Riggs wrote:
> * cooperation: if wal receiver is a server process we can reasonably
> communicate the current WAL limit via shared memory. That gives us
> smooth flow of WAL between receiver and replay (startup process) rather
> than a burst of activity each time a file arrives. That helps smooth
> performance and minimises failover time. Without this we would need to
> retain the concept of archive_timeout on the primary even when
> streaming, which is fairly strange.

Does it actually do that? I can see comments suggesting that in 
walreceiver, but I can't find the place in xlog.c where the startup 
process does the waiting.

> * code management
> 
> Other than that there isn't that much in it...

Ok, just making sure I wasn't missing something crucial. I agree it 
should be integrated. What I'm actually worried about is that this 
system isn't integrated enough, and having to set up the archiving, 
pg_standby, and the synchronous repliation itself, correctly, makes it 
too complex to be practical.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

10 December 2008, 15:56:25

On Wed, 2008-12-10 at 20:52 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Wed, 2008-12-10 at 14:39 +0200, Heikki Linnakangas wrote:
> > 
> >> For a solution that doesn't depend on the file-based log shipping, I 
> >> think we'll need a way for the standby to request a certain starting 
> >> point for the streaming when it connects. When the standby starts, it 
> >> would first recover all the log segments it can obtain using 
> >> recovery_command, and then connect to the primary and request to
> >> start 
> >> streaming from where recovery_command stopped.
> > 
> > That was already suggested and rejected because it introduces a
> > potentially unacceptable delay in the start of synch replication - for
> > large databases this could be hours. (I should add it was suggested by
> > me and I now accept that it should be rejected.)
> 
> I don't understand that argument. If the standby is missing say 100 log 
> files, it's not up-to-date with the primary until it has somehow 
> obtained and replayed all those log file. It doesn't make any difference 
> whether it obtains them over the wire via walreceiver, or via an 
> archive. Until it has obtained and replayed all those files, it's not 
> up-to-date, and a failover would lead to data loss.
> 
> Or did I misunderstand what "start of synch replication" means? Got a 
> pointer to the previous discussion?

I think you just went down the same path I did before. (That's a good
sign).

When the WAL starts streaming the *primary* can immediately perform
synchronous replication, i.e. commit waits for transfer. The *standby*
has an initial lag before it catches up, whatever we do (as you say).

I suggested that way initially because it simplifies the mode change.
The mode change isn't really that complex, so I agreed we should change
it.

The two ways of doing this are/were:

1. (Initial suggestion)
* allow standby to catchup
* then connect and allow sync rep

2. Preferred Choice
* connect to primary and allow sync rep
* catch up

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

10 December 2008, 16:05:22

On Wed, 2008-12-10 at 11:52 -0800, Jeff Davis wrote:
> On Wed, 2008-12-10 at 09:48 +0000, Simon Riggs wrote:
> > What is complicated about having the archive on the standby server? 
> > 
> 
> If the storage on the standby fails, you would lose the archive, right?

As well as the standby itself presumably. Either way you need to restart
from a base backup.

> I think there's a use case for having two identical servers, and just
> setting them up to replicate synchronously. Many of these use-cases
> might not even care much about write performance or the duplicity of
> maintaining two copies of the archive. 

Yes, that's what I've said also.

> They might care a lot about PITR
> though, and that would be impossible if you lose the archive.

Agreed, yes we need it as an option.

> Do you see a cost to allowing all of the options listed by Fujii Masao?

I haven't argued in favour of removing any options, so not sure what you
mean. I have asked for an explanation of why certain features are needed
so we can judge whether there is a simpler way of providing everything
required. It may not exist.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

Aidan Van Dyk

Date:

10 December 2008, 16:05:23

* Simon Riggs <simon@2ndQuadrant.com> [081210 14:58]:

> I think you just went down the same path I did before. (That's a good
> sign).
> 
> When the WAL starts streaming the *primary* can immediately perform
> synchronous replication, i.e. commit waits for transfer. The *standby*
> has an initial lag before it catches up, whatever we do (as you say).
> 
> I suggested that way initially because it simplifies the mode change.
> The mode change isn't really that complex, so I agreed we should change
> it.
> 
> The two ways of doing this are/were:
> 
> 1. (Initial suggestion)
> * allow standby to catchup
> * then connect and allow sync rep
> 
> 2. Preferred Choice
> * connect to primary and allow sync rep
> * catch up

Call me think, but I'm confused... In sync rep, there *can't be* any
catchign up do do... i.e. if the "slave" isn't accepting the WAL the
master "stops" doing *anything*...

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Sync Rep: First Thoughts on Code

From

Jeff Davis

Date:

10 December 2008, 16:12:50

On Wed, 2008-12-10 at 09:48 +0000, Simon Riggs wrote:
> What is complicated about having the archive on the standby server? 
> 

If the storage on the standby fails, you would lose the archive, right?

I think there's a use case for having two identical servers, and just
setting them up to replicate synchronously. Many of these use-cases
might not even care much about write performance or the duplicity of
maintaining two copies of the archive. They might care a lot about PITR
though, and that would be impossible if you lose the archive.

Do you see a cost to allowing all of the options listed by Fujii Masao?

Regards,Jeff Davis

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

10 December 2008, 16:13:06

On Wed, 2008-12-10 at 21:02 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > * cooperation: if wal receiver is a server process we can reasonably
> > communicate the current WAL limit via shared memory. That gives us
> > smooth flow of WAL between receiver and replay (startup process) rather
> > than a burst of activity each time a file arrives. That helps smooth
> > performance and minimises failover time. Without this we would need to
> > retain the concept of archive_timeout on the primary even when
> > streaming, which is fairly strange.
> 
> Does it actually do that? I can see comments suggesting that in 
> walreceiver, but I can't find the place in xlog.c where the startup 
> process does the waiting.

Not yet... we agreed it would do that a few days ago. This thread, Fri 5
Dec.

> > * code management
> > 
> > Other than that there isn't that much in it...
> 
> Ok, just making sure I wasn't missing something crucial. I agree it 
> should be integrated. What I'm actually worried about is that this 
> system isn't integrated enough, and having to set up the archiving, 
> pg_standby, and the synchronous replication itself, correctly, makes it 
> too complex to be practical.

I'm worried about the complexity also. If we didn't use the existing
archiving mechanism we'd need to invent something that looks just like
it.

If I could get rid of pg_standby as well, I would. I've got no qualms
about chopping stuff I wrote, as long as we do it for a good reason.
Keeping the parts of the old model that make sense means less code and
less process change for existing users.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

Jeff Davis

Date:

10 December 2008, 16:27:03

On Wed, 2008-12-10 at 20:04 +0000, Simon Riggs wrote:
> > They might care a lot about PITR
> > though, and that would be impossible if you lose the archive.
> 
> Agreed, yes we need it as an option.
> 
> > Do you see a cost to allowing all of the options listed by Fujii Masao?
> 
> I haven't argued in favour of removing any options, so not sure what you
> mean. I have asked for an explanation of why certain features are needed
> so we can judge whether there is a simpler way of providing everything
> required. It may not exist.

I was trying to provide a use-case for maintaining the archive on both
primary and standby, i.e. option (1). My understanding was that you were
asking for such a use case with this question:

"So, why would you want to run with multiple archives?"

Regards,Jeff Davis

Re: Sync Rep: First Thoughts on Code

From

Heikki Linnakangas

Date:

11 December 2008, 03:44:24

Simon Riggs wrote:
> When the WAL starts streaming the *primary* can immediately perform
> synchronous replication, i.e. commit waits for transfer. 

Until the standby has obtained all the missing log files, it's not 
up-to-date, and there's no guarantee that it can finish the replay. For 
example, imagine that your archive_command is an scp from the primary to 
the standby. If a lightning strikes the primary before some WAL file has 
been copied over to the archive directory in the standby, the standby 
can't catch up. In the primary then, what's the point for a commit to 
wait for transfer, if the reply from the standby doesn't guarantee that 
the transaction is safe in the standby?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

11 December 2008, 04:56:46

On Thu, 2008-12-11 at 09:44 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > When the WAL starts streaming the *primary* can immediately perform
> > synchronous replication, i.e. commit waits for transfer. 
> 
> Until the standby has obtained all the missing log files, it's not 
> up-to-date, and there's no guarantee that it can finish the replay. For 
> example, imagine that your archive_command is an scp from the primary to 
> the standby. If a lightning strikes the primary before some WAL file has 
> been copied over to the archive directory in the standby, the standby 
> can't catch up. In the primary then, what's the point for a commit to 
> wait for transfer, if the reply from the standby doesn't guarantee that 
> the transaction is safe in the standby?

The WAL files will have already left the primary. 

Timeline is this in my understanding
1 [Primary] Set up continuous archiving 
2 [Primary] Take base backup
3 [Standby] Connect to primary to initiate streaming
4 [Primary] Log switch and, optionally, turn off archiving
5 [Standby] Begin replaying files, initially from archive
6 [Standby] Switch to replaying WAL records immediately after streaming

So sync rep would turn on after step 4, so that all intermediate WAL
files have been sent to the archive. If we lose the Primary after this
point then all transactions are accessible to standby. If we lose the
Standby or Archive, then we need to replace them and re-run the above.

The above was outlined on thread "Synchronous Log Shipping Replication"
and pretty much all agreed on 18 Sep.

Recent changes I have requested in the architecture are:
* making archiving optional on primary, so we don't need to send WAL
data *twice*. 
* allowing streaming/startup process to work together via shared memory,
to reduce average replication delay and improve performance
* skip archiving/de-archiving step on standby because it's superfluous
(all on this thread)

All of those are fairly minor code changes, but reduce complexity of
solution and significantly reduce the amount of copying of WAL files (3
copy actions to/from archive removed without loss of robustness). I
would have made the suggestions earlier but it wasn't until I saw the
architecture diagrams that I understood the intention of the code.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

Heikki Linnakangas

Date:

11 December 2008, 05:29:24

Simon Riggs wrote:
> On Thu, 2008-12-11 at 09:44 +0200, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> When the WAL starts streaming the *primary* can immediately perform
>>> synchronous replication, i.e. commit waits for transfer. 
>> Until the standby has obtained all the missing log files, it's not 
>> up-to-date, and there's no guarantee that it can finish the replay. For 
>> example, imagine that your archive_command is an scp from the primary to 
>> the standby. If a lightning strikes the primary before some WAL file has 
>> been copied over to the archive directory in the standby, the standby 
>> can't catch up. In the primary then, what's the point for a commit to 
>> wait for transfer, if the reply from the standby doesn't guarantee that 
>> the transaction is safe in the standby?
> 
> The WAL files will have already left the primary. 
> 
> Timeline is this in my understanding
> 1 [Primary] Set up continuous archiving 
> 2 [Primary] Take base backup
> 3 [Standby] Connect to primary to initiate streaming
> 4 [Primary] Log switch and, optionally, turn off archiving
> 5 [Standby] Begin replaying files, initially from archive
> 6 [Standby] Switch to replaying WAL records immediately after streaming
> 
> So sync rep would turn on after step 4, so that all intermediate WAL
> files have been sent to the archive.  If we lose the Primary after this
> point then all transactions are accessible to standby. If we lose the
> Standby or Archive, then we need to replace them and re-run the above.

Between steps 4 and 5, there's no guarantee that all WAL files generated 
after step 3 and the start of streaming have already been archived. 
There's a delay between writing a WAL file and when the file has been 
safely archived. If you lose the primary during that window, the standby 
will have old WAL files in the archive, the most recent ones in received 
by walreceiver, but it's missing the WAL files generated just before the 
switch to streaming mode.

> Recent changes I have requested in the architecture are:
> * making archiving optional on primary, so we don't need to send WAL
> data *twice*. 

Agreed. I'm not so much worried about the bandwidth, but it's a lot of 
extra work from administration point of view. It's very hard to get it 
right, so that you eliminate windows like the above.

As the patch stands, if you turn off archiving in the primary, and the 
standby ever disconnects, even for only a few seconds, the standby will 
miss any WAL generated until it reconnects, and without archiving 
there's no way for the standby to get hold of the missed WAL.

> * allowing streaming/startup process to work together via shared memory,
> to reduce average replication delay and improve performance
> * skip archiving/de-archiving step on standby because it's superfluous
> (all on this thread)
> 
> All of those are fairly minor code changes, but reduce complexity of
> solution and significantly reduce the amount of copying of WAL files (3
> copy actions to/from archive removed without loss of robustness). I
> would have made the suggestions earlier but it wasn't until I saw the
> architecture diagrams that I understood the intention of the code.

To make archiving optional in the primary, I don't see any other choice 
than adding the capability for the standby to request arbitrary WAL 
files from the primary, over the wire. That seems like a pretty 
significant change to walsender: it needs to be able to read WAL not 
only from wal_buffers, but from files. That would be a good idea for 
performance reasons, too: currently if there's a network glitch and the 
primary doesn't get acknowledgements from the standby for a short while, 
XLogInserts in the primary will block waiting for the standby after 
wal_buffers fills up. That's not a big deal for synchronous replication, 
but in asynchronous mode you don't want network glitches like that to 
stall the primary.

And of course it means changes in the startup code as well. And we'll 
need bookkeeping in the primary of what WAL the standby has already 
received, so that it doesn't recycle the WAL segments until they've been 
sent to the standby. Or alternatively, the primary needs to be able to 
retrieve segments from the archive, but then we're dependent on 
archiving again.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

11 December 2008, 06:10:19

On Thu, 2008-12-11 at 11:29 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Thu, 2008-12-11 at 09:44 +0200, Heikki Linnakangas wrote:
> >> Simon Riggs wrote:
> >>> When the WAL starts streaming the *primary* can immediately perform
> >>> synchronous replication, i.e. commit waits for transfer. 
> >> Until the standby has obtained all the missing log files, it's not 
> >> up-to-date, and there's no guarantee that it can finish the replay. For 
> >> example, imagine that your archive_command is an scp from the primary to 
> >> the standby. If a lightning strikes the primary before some WAL file has 
> >> been copied over to the archive directory in the standby, the standby 
> >> can't catch up. In the primary then, what's the point for a commit to 
> >> wait for transfer, if the reply from the standby doesn't guarantee that 
> >> the transaction is safe in the standby?
> > 
> > The WAL files will have already left the primary. 
> > 
> > Timeline is this in my understanding
> > 1 [Primary] Set up continuous archiving 
> > 2 [Primary] Take base backup
> > 3 [Standby] Connect to primary to initiate streaming
> > 4 [Primary] Log switch and, optionally, turn off archiving
> > 5 [Standby] Begin replaying files, initially from archive
> > 6 [Standby] Switch to replaying WAL records immediately after streaming
> > 
> > So sync rep would turn on after step 4, so that all intermediate WAL
> > files have been sent to the archive.  If we lose the Primary after this
> > point then all transactions are accessible to standby. If we lose the
> > Standby or Archive, then we need to replace them and re-run the above.
> 
> Between steps 4 and 5, there's no guarantee that all WAL files generated 
> after step 3 and the start of streaming have already been archived. 
> There's a delay between writing a WAL file and when the file has been 
> safely archived. If you lose the primary during that window, the standby 
> will have old WAL files in the archive, the most recent ones in received 
> by walreceiver, but it's missing the WAL files generated just before the 
> switch to streaming mode.

I was presuming that the synchronisation was clear, but I'm sorry it
wasn't. Sync rep would begin only *after* the last WAL file was
archived.

> > Recent changes I have requested in the architecture are:
> > * making archiving optional on primary, so we don't need to send WAL
> > data *twice*. 
> 
> Agreed. I'm not so much worried about the bandwidth, but it's a lot of 
> extra work from administration point of view. It's very hard to get it 
> right, so that you eliminate windows like the above.
> 
> As the patch stands, if you turn off archiving in the primary, and the 
> standby ever disconnects, even for only a few seconds, the standby will 
> miss any WAL generated until it reconnects, and without archiving 
> there's no way for the standby to get hold of the missed WAL.

I described earlier that archiving would turn back on again if the
replication ever failed (with correct synchronisation).

All I've asked for is the ability to turn on and turn back on archiving,
yes, with synchronisation so its safe. 

Personally, I think people will laugh if we tell them we decided to ship
all the data twice and couldn't see another way. That's the kind of
thing people give presentations at PGcon about...

> > * allowing streaming/startup process to work together via shared memory,
> > to reduce average replication delay and improve performance
> > * skip archiving/de-archiving step on standby because it's superfluous
> > (all on this thread)
> > 
> > All of those are fairly minor code changes, but reduce complexity of
> > solution and significantly reduce the amount of copying of WAL files (3
> > copy actions to/from archive removed without loss of robustness). I
> > would have made the suggestions earlier but it wasn't until I saw the
> > architecture diagrams that I understood the intention of the code.
> 
> To make archiving optional in the primary, I don't see any other choice 
> than adding the capability for the standby to request arbitrary WAL 
> files from the primary, over the wire. 

I don't think that's the only or even a desirable way. We cannot allow a
build up of WAL files to occur on the primary.

Making archiving optional isn't the big deal you're saying it is.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

11 December 2008, 06:21:48

Hi,

On Thu, Dec 11, 2008 at 7:09 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> > Recent changes I have requested in the architecture are:
>> > * making archiving optional on primary, so we don't need to send WAL
>> > data *twice*.
>>
>> Agreed. I'm not so much worried about the bandwidth, but it's a lot of
>> extra work from administration point of view. It's very hard to get it
>> right, so that you eliminate windows like the above.
>>
>> As the patch stands, if you turn off archiving in the primary, and the
>> standby ever disconnects, even for only a few seconds, the standby will
>> miss any WAL generated until it reconnects, and without archiving
>> there's no way for the standby to get hold of the missed WAL.
>
> I described earlier that archiving would turn back on again if the
> replication ever failed (with correct synchronisation).
>
> All I've asked for is the ability to turn on and turn back on archiving,
> yes, with synchronisation so its safe.
>
> Personally, I think people will laugh if we tell them we decided to ship
> all the data twice and couldn't see another way. That's the kind of
> thing people give presentations at PGcon about...


OK, I will add such archiving feature. My new design of archiving is as follows.

Primary
----------
I extend archive_mode as follows and make the user be able to choose the
archiving strategy on the primary.

- always The primary always archives the WAL. This is compatible with current (<=8.3) archive_mode = on.

- none The primary always doesn't archive the WAL. This is compatible with current archive_mode = off.

- standalone The primary doesn't archive the WAL only during replication. If replication is not in progress, the
primaryarchives the WAL. That is, the primary switches the modes whenever replication starts / ends.
 
 [FLS->SLS] When replication starts, the primary disable archiving *after* the switched WAL file is archived. WAL
streamingdoesn't need to wait for disablement of archiving, so the processing on the primary isn't blocked by starting
ofreplication. But, both WAL streaming and archiving would be in progress for a while (until the switched WAL file is
archived)after
 
replication starts.
 [SLS->FLS] When replication starts, the primary restarts archiving immediately. This also doesn't block the processing
onthe primary. But, this might cause loss of some files from an archive if archiving is slow on the standby. The
primaryshould look for the last archived file (by the standby) from an archive and restart archiving from the
subsequentfile? Of course, the primary cannot archive it if it's already removed on the primary.
 

Standby
-----------
I would add new option for achiving during recovery into recovery.conf
(recovery_archive_mode). Though this option is similar to archive_mode,
merging them would confuse the user more, I think. Or, I should merge?
And, do you want to configure the archive command only for recovery?
If so, I would add new option to specify the archive command during
recovery (recovery_archive_command).

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

11 December 2008, 06:40:06

Hi,

On Thu, Dec 11, 2008 at 7:09 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On Thu, 2008-12-11 at 11:29 +0200, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>> > On Thu, 2008-12-11 at 09:44 +0200, Heikki Linnakangas wrote:
>> >> Simon Riggs wrote:
>> >>> When the WAL starts streaming the *primary* can immediately perform
>> >>> synchronous replication, i.e. commit waits for transfer.
>> >> Until the standby has obtained all the missing log files, it's not
>> >> up-to-date, and there's no guarantee that it can finish the replay. For
>> >> example, imagine that your archive_command is an scp from the primary to
>> >> the standby. If a lightning strikes the primary before some WAL file has
>> >> been copied over to the archive directory in the standby, the standby
>> >> can't catch up. In the primary then, what's the point for a commit to
>> >> wait for transfer, if the reply from the standby doesn't guarantee that
>> >> the transaction is safe in the standby?
>> >
>> > The WAL files will have already left the primary.
>> >
>> > Timeline is this in my understanding
>> > 1 [Primary] Set up continuous archiving
>> > 2 [Primary] Take base backup
>> > 3 [Standby] Connect to primary to initiate streaming
>> > 4 [Primary] Log switch and, optionally, turn off archiving
>> > 5 [Standby] Begin replaying files, initially from archive
>> > 6 [Standby] Switch to replaying WAL records immediately after streaming
>> >
>> > So sync rep would turn on after step 4, so that all intermediate WAL
>> > files have been sent to the archive.  If we lose the Primary after this
>> > point then all transactions are accessible to standby. If we lose the
>> > Standby or Archive, then we need to replace them and re-run the above.
>>
>> Between steps 4 and 5, there's no guarantee that all WAL files generated
>> after step 3 and the start of streaming have already been archived.
>> There's a delay between writing a WAL file and when the file has been
>> safely archived. If you lose the primary during that window, the standby
>> will have old WAL files in the archive, the most recent ones in received
>> by walreceiver, but it's missing the WAL files generated just before the
>> switch to streaming mode.

Yes, since such standby is unsafe, the user must not promote it to the primary.
Then, the user has to stop the standby (don't complete recovery), restart the
primary and restart the standby.

>
> I was presuming that the synchronisation was clear, but I'm sorry it
> wasn't. Sync rep would begin only *after* the last WAL file was
> archived.

Agreed. In order for the user to confirm whether replication began or
not, we might need to log the name of the switched WAL file.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

11 December 2008, 06:41:42

On Wed, 2008-12-10 at 15:06 -0500, Aidan Van Dyk wrote:

> Call me think, but I'm confused... In sync rep, there *can't be* any
> catchign up do do... i.e. if the "slave" isn't accepting the WAL the
> master "stops" doing *anything*...

In normal/steady state, yes, you are correct. But there is more...

The simplest way to configure standby would be to freeze the primary
while we setup the standby and then go straight into normal/steady
state. That could mean hours of downtime for large databases, which is
unacceptable in a feature aimed at increasing availability. So we need
to allow the primary to continue working while the standby is setup.
That then creates a log gap between the LSN of the primary and the LSN
of the standby, which must be resolved.

So the catchup occurs during the transient initial phase when standby is
catching up with primary before they continue together in normal/steady
state. 

Most of the architectural discussion over last few months has been about
the need for the initial state and how to handle it. Most of the code
complexity also.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

11 December 2008, 07:11:46

On Thu, 2008-12-11 at 19:19 +0900, Fujii Masao wrote:
> >
> > All I've asked for is the ability to turn on and turn back on archiving,
> > yes, with synchronisation so its safe.
> >(snip)

> OK, I will add such archiving feature. My new design of archiving is as follows.
> 
> Primary
> ----------
> I extend archive_mode as follows and make the user be able to choose the
> archiving strategy on the primary.
> 
> - always
>   The primary always archives the WAL. This is compatible with current (<=8.3)
>   archive_mode = on.
> 
> - none
>   The primary always doesn't archive the WAL. This is compatible with current
>   archive_mode = off.
> 
> - standalone
>   The primary doesn't archive the WAL only during replication. If replication is
>   not in progress, the primary archives the WAL. That is, the primary switches
>   the modes whenever replication starts / ends.
> 
>   [FLS->SLS]
>   When replication starts, the primary disable archiving *after* the switched
>   WAL file is archived. WAL streaming doesn't need to wait for disablement
>   of archiving, so the processing on the primary isn't blocked by starting of
>   replication. But, both WAL streaming and archiving would be in progress
>   for a while (until the switched WAL file is archived) after
> replication starts.

I'm OK with that, but that is slightly different from what Heikki had
said in relation to the point at which sync rep begins on primary, so he
may have a different view.

synchronous_replication means "if we a standby server has connected to
us we will wait for all WAL associated with a transaction to be
transferred prior to commit". So there is never a 100% guarantee that
the transaction is safe, just an "if possible, 100%".

So this implements the equivalent of DRBD Protocol A and B. Do we have
an option to allow the WALreceiver to fsync the WAL file after a commit
is received, which would make it equivalent to Protocol C? If we don't,
I'm OK with that since it reduces performance so much it isn't a
practical option in many cases. 
http://www.drbd.org/users-guide/s-replication-protocols.html

>   [SLS->FLS]
>   When replication starts, the primary restarts archiving immediately. This
>   also doesn't block the processing on the primary. But, this might cause
>   loss of some files from an archive if archiving is slow on the standby.
>   The primary should look for the last archived file (by the standby) from
>   an archive and restart archiving from the subsequent file? Of course,
>   the primary cannot archive it if it's already removed on the primary.

Standby will always have kept enough files to allow it to restart from
the last restartpoint, so a gap in the file sequence is unlikely. As
long as we archive the WAL file that contains the last LSN we
transferred before streaming failed. That conceivably might mean we need
to write a .ready message after a WAL file filled, which might mean we
have problems if the replication timeout is longer than the checkpoint
timeout, but that seems an unlikely configuration. And if anybody has a
problem with that we just recommend they use the "always" mode.

> Standby
> -----------
> I would add new option for achiving during recovery into recovery.conf
> (recovery_archive_mode). Though this option is similar to archive_mode,
> merging them would confuse the user more, I think. Or, I should merge?
> And, do you want to configure the archive command only for recovery?
> If so, I would add new option to specify the archive command during
> recovery (recovery_archive_command).

I think if you really want two archives or archiving during recovery
then this is desirable to avoid confusion.

Explaining all this in the docs will be fun. :-)

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

Aidan Van Dyk

Date:

11 December 2008, 10:25:47

* Simon Riggs <simon@2ndQuadrant.com> [081211 05:45]:
> 
> On Wed, 2008-12-10 at 15:06 -0500, Aidan Van Dyk wrote:
> 
> > Call me think, but I'm confused... In sync rep, there *can't be* any
> > catchign up do do... i.e. if the "slave" isn't accepting the WAL the
> > master "stops" doing *anything*...
> 
> In normal/steady state, yes, you are correct. But there is more...
> 
> The simplest way to configure standby would be to freeze the primary
> while we setup the standby and then go straight into normal/steady
> state. That could mean hours of downtime for large databases, which is
> unacceptable in a feature aimed at increasing availability. So we need
> to allow the primary to continue working while the standby is setup.
> That then creates a log gap between the LSN of the primary and the LSN
> of the standby, which must be resolved.
> 
> So the catchup occurs during the transient initial phase when standby is
> catching up with primary before they continue together in normal/steady
> state. 

But "catchup" *has* to be *done* before PostgreSQL can enter "sync rep".

So, if I start PostgreSQL in sync rep mode, without any capable clients
to rep with....  But I'ld rather be buggered there then find out tonight
at 3am that it was in sync rep mode but wasn't really doing sync rep,
becus I'ld messed up something somewhere (firewall, config, password,
anything) and ther ewas not "caught up" client at the time, and I've
just lost a days' worth of my $$$$$ transactions...

> Most of the architectural discussion over last few months has been about
> the need for the initial state and how to handle it. Most of the code
> complexity also.

Well, for me, I'm quite happy with a "restart/stop&start" being a
necessary "downtime" to move to synchronous replication.  This way, I
could see a "setup" routing that looks like:
1) Current "production" DB does normal backups/PITR/WAL archiving
2) I setup new "slave", which involves  - restore from backup + wal recover (pg_standby type)  - Could take days+++  -
Ohwell....

3) Stop production
4) so, now slave is caught up...
5) Start "production" now in sync rep mode as master
6) start slave in sync-rep mode as slave...

So downtime would be limited to the time from the old postmaster
shutdown to the time the slave has replayed the last WAL and connected
to the restarted postmaster as a sync rep slave...

Or am I way too naive to think that a small downtime to "switch" from
non-sync-rep to sync-rep is acceptable...

a.
-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Sync Rep: First Thoughts on Code

From

Aidan Van Dyk

Date:

11 December 2008, 10:35:36

* Fujii Masao <masao.fujii@gmail.com> [081211 05:25]:
> - standalone
>   The primary doesn't archive the WAL only during replication. If replication is
>   not in progress, the primary archives the WAL. That is, the primary switches
>   the modes whenever replication starts / ends.

That scares the hebegebies out of me... I'm doing sync-rep because I
*really* *want* *my* *data* .... *always* ...

I want sync-rep because I'm going to get even *stonger* guarentees on my
data (and, if hot-standby works out, load balancing too, but thats not
*my* primary desire for sync-rep)...

But I'm sure as hell *not* going to throw all my eggs into that slave's
basket and do away with my WAL archive...  Would anyone actually use
that "standby" mode, and if not, why compilcate the code for it?

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

11 December 2008, 10:59:23

On Thu, 2008-12-11 at 09:27 -0500, Aidan Van Dyk wrote:

> But "catchup" *has* to be *done* before PostgreSQL can enter "sync rep".

Not true. Please reread the thread where Heikki questions that and I
reply. This was Fujii-san's idea, which I now agree with.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

11 December 2008, 10:59:59

On Thu, 2008-12-11 at 09:37 -0500, Aidan Van Dyk wrote:
> * Fujii Masao <masao.fujii@gmail.com> [081211 05:25]:
>  
> > - standalone
> >   The primary doesn't archive the WAL only during replication. If
> replication is
> >   not in progress, the primary archives the WAL. That is, the
> primary switches
> >   the modes whenever replication starts / ends.

> But I'm sure as hell *not* going to throw all my eggs into that
> slave's
> basket and do away with my WAL archive...  Would anyone actually use
> that "standby" mode, and if not, why compilcate the code for it?

Sending data twice is not a requirement I ever heard expressed, nor has
the lack of ability to send it twice been voiced as a criticism for any
form of replication I'm familiar with. Ask the DRBD guys if sending data
twice is necessary or required to make replication work.

If multiple people think its a good idea then I respect your choice of
option.

But I also think that many or perhaps most people will choose not to
send data twice and I respect that choice of option also.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

Aidan Van Dyk

Date:

11 December 2008, 11:08:17

* Simon Riggs <simon@2ndQuadrant.com> [081211 10:03]:
> Sending data twice is not a requirement I ever heard expressed, nor has
> the lack of ability to send it twice been voiced as a criticism for any
> form of replication I'm familiar with. Ask the DRBD guys if sending data
> twice is necessary or required to make replication work.
> 
> If multiple people think its a good idea then I respect your choice of
> option.
> 
> But I also think that many or perhaps most people will choose not to
> send data twice and I respect that choice of option also.

Well, PostgreSQL has WAL, so we've already accepted the notion of "send
data twice" being useful sometimes...

But I would note that the "archive" and "streaming" are both sending the
data *different* places... or at least, in my case would be...

And, also, I know WAL archiving isn't necessary for replication to work.
but it's necessary for me to sleep comfortably at night ;-)

I'm just suprised that people are willing to throw away their
backup/PITR archiving once they have a singl "live slave" up.

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Sync Rep: First Thoughts on Code

From

Aidan Van Dyk

Date:

11 December 2008, 11:13:41

* Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> [081211 10:09]:
> Simon Riggs wrote:
>> On Thu, 2008-12-11 at 09:27 -0500, Aidan Van Dyk wrote:
>>
>>> But "catchup" *has* to be *done* before PostgreSQL can enter "sync rep".
>>
>> Not true. Please reread the thread where Heikki questions that and I
>> reply. This was Fujii-san's idea, which I now agree with.
>
> I think the confusion here is about what exactly "sync rep" means in  
> this situation. It's true that you can start streaming the WAL before  
> the standby has fully caught up. But from the client's point of view,  
> there's not much point in streaming the log *synchronously* and making  
> the client to wait for the acknowledment from the standby, if the  
> acknowledgment from the standby that WAL has be streamed up to point X,  
> doesn't actually guarantee that the slave can recover all the way to  
> that point.

Quite possibly a terminology problem.. I my case I said "sync rep"
meaning the mode such that the transaction doesn't commit successfully
for my PG client until the xlog record has been "streamed" to the
client... and I understand that at his presentation at PGcon, Fujii-san
there could be possible variants on when the "streamed" is considered
done based on network, slave ram, disk, application, etc.

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

11 December 2008, 11:18:57

On Thu, 2008-12-11 at 17:07 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Thu, 2008-12-11 at 09:27 -0500, Aidan Van Dyk wrote:
> > 
> >> But "catchup" *has* to be *done* before PostgreSQL can enter "sync rep".
> > 
> > Not true. Please reread the thread where Heikki questions that and I
> > reply. This was Fujii-san's idea, which I now agree with.
> 
> I think the confusion here is about what exactly "sync rep" means in 
> this situation. It's true that you can start streaming the WAL before 
> the standby has fully caught up. 

Yep.

> But from the client's point of view, 
> there's not much point in streaming the log *synchronously* and making 
> the client to wait for the acknowledment from the standby, if the 
> acknowledgment from the standby that WAL has be streamed up to point X, 
> doesn't actually guarantee that the slave can recover all the way to 
> that point.

I disagree. This morning I showed it was possible, given the
synchronisation I outlined.

There is a slight relaxation of that in the current proposal, so you
need to take that up if you see any problem there.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

Heikki Linnakangas

Date:

11 December 2008, 11:34:25

Simon Riggs wrote:
> On Thu, 2008-12-11 at 09:27 -0500, Aidan Van Dyk wrote:
> 
>> But "catchup" *has* to be *done* before PostgreSQL can enter "sync rep".
> 
> Not true. Please reread the thread where Heikki questions that and I
> reply. This was Fujii-san's idea, which I now agree with.

I think the confusion here is about what exactly "sync rep" means in 
this situation. It's true that you can start streaming the WAL before 
the standby has fully caught up. But from the client's point of view, 
there's not much point in streaming the log *synchronously* and making 
the client to wait for the acknowledment from the standby, if the 
acknowledgment from the standby that WAL has be streamed up to point X, 
doesn't actually guarantee that the slave can recover all the way to 
that point.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

11 December 2008, 17:40:17

On Thu, 2008-12-11 at 19:19 +0900, Fujii Masao wrote:

> My new design of archiving is as follows.

So far I haven't asked about running multiple standby servers and don't
recall having seen this mentioned anywhere. Forgive me if this was
already mentioned.

The idea is that we would be able to have multiple standby servers
connecting to one primary, yes? It would be useful to have sync
replication work that it must get an acknowledgement from at least one
standby before it continues.

Or do you think we would stream to just one standby, then use the
archiver (primary or standby) to keep sending files to allow multiple
additional standby nodes?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

12 December 2008, 00:01:11

Hi,

On Fri, Dec 12, 2008 at 12:15 AM, Aidan Van Dyk <aidan@highrise.ca> wrote:
> * Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> [081211 10:09]:
>> Simon Riggs wrote:
>>> On Thu, 2008-12-11 at 09:27 -0500, Aidan Van Dyk wrote:
>>>
>>>> But "catchup" *has* to be *done* before PostgreSQL can enter "sync rep".
>>>
>>> Not true. Please reread the thread where Heikki questions that and I
>>> reply. This was Fujii-san's idea, which I now agree with.
>>
>> I think the confusion here is about what exactly "sync rep" means in
>> this situation. It's true that you can start streaming the WAL before
>> the standby has fully caught up. But from the client's point of view,
>> there's not much point in streaming the log *synchronously* and making
>> the client to wait for the acknowledment from the standby, if the
>> acknowledgment from the standby that WAL has be streamed up to point X,
>> doesn't actually guarantee that the slave can recover all the way to
>> that point.
>
> Quite possibly a terminology problem.. I my case I said "sync rep"
> meaning the mode such that the transaction doesn't commit successfully
> for my PG client until the xlog record has been "streamed" to the
> client... and I understand that at his presentation at PGcon, Fujii-san
> there could be possible variants on when the "streamed" is considered
> done based on network, slave ram, disk, application, etc.

I'd like to define the meaning of "synch rep" again. "synch rep" means:

(1) Transaction commit waits for WAL records to be replicated to the standby     before the command returns a "success"
indicationto the client.

(2) The standby has (can read) all WAL files indispensable for recovery.

If both are true, your system is in "synch rep"; you can perform failover safely
without any transaction loss whenever the primary falls down. On the other
hand, if either is false, your system is in not "synch rep" but "standalone";
the failure of the primary might cause a certain transaction loss. Starting the
standby doesn't mean "synch rep" directly. We have to wait for (1) *and* (2)
after starting the standby. (1) is reported as a server log message, so we can
wait for (1). (2) is somewhat complicated; if an archive is shared, the server
log message for achiving indicates (2). otherwise, The copy operation (copy
indispensable WAL files from the primary to the standby) by the user or
clusterware indicates (2). But, as Simon pointed out, since many people share
an archive, they should monitor only the server log messages. Or, should I
create the feature for the user to confirm whether it's in "synch rep" via SQL?

Since there is a little delay between (1) and (2), we can do WAL streaming
asynchronously only in the delay, as Heikki pointed out. But I'm not sure if
it's worth trying it.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

12 December 2008, 00:11:39

Hi,

> The idea is that we would be able to have multiple standby servers
> connecting to one primary, yes? It would be useful to have sync
> replication work that it must get an acknowledgement from at least one
> standby before it continues.

No, in my current patch, only one standby can perform WAL streaming.
Of course, Yes in the future (8.5?).

>
> Or do you think we would stream to just one standby, then use the
> archiver (primary or standby) to keep sending files to allow multiple
> additional standby nodes?

Interesting! and Yes, we can.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

Aidan Van Dyk

Date:

12 December 2008, 00:33:19

* Fujii Masao <masao.fujii@gmail.com> [081211 23:00]:
> Hi,

>                                                               Or, should I
> create the feature for the user to confirm whether it's in "synch rep" via SQL?

I don't need a way to check via SQL, but I'ld love a postgresql.conf
option that when set would make sure that all connections pretty much
just hang until a slave has connected and everything is setup for "sync
rep".  I think I saw that youre using "normal" connection setup to start
the wal streaming to the slave, so you have to allow connections, but
I'ld really not want any of my pg-clients able to do anything if
sync-rep isn't happenning...

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

12 December 2008, 05:56:49

Hi,

On Fri, Dec 12, 2008 at 1:34 PM, Aidan Van Dyk <aidan@highrise.ca> wrote:
> * Fujii Masao <masao.fujii@gmail.com> [081211 23:00]:
>> Hi,
>
>>                                                               Or, should I
>> create the feature for the user to confirm whether it's in "synch rep" via SQL?
>
> I don't need a way to check via SQL, but I'ld love a postgresql.conf
> option that when set would make sure that all connections pretty much
> just hang until a slave has connected and everything is setup for "sync
> rep".  I think I saw that youre using "normal" connection setup to start
> the wal streaming to the slave, so you have to allow connections, but
> I'ld really not want any of my pg-clients able to do anything if
> sync-rep isn't happenning...

How about stopping the request / connection from a client in front of
postgres (e.g. connection pooling software)? Or, we should develop
the feature like OFFLINE of Oracle apart from Synch Rep at first.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

12 December 2008, 09:18:05

On Fri, 2008-12-12 at 12:53 +0900, Fujii Masao wrote:
> >
> > Quite possibly a terminology problem.. I my case I said "sync rep"
> > meaning the mode such that the transaction doesn't commit successfully
> > for my PG client until the xlog record has been "streamed" to the
> > client... and I understand that at his presentation at PGcon, Fujii-san
> > there could be possible variants on when the "streamed" is considered
> > done based on network, slave ram, disk, application, etc.
> 
> I'd like to define the meaning of "synch rep" again. "synch rep" means:
> 
> (1) Transaction commit waits for WAL records to be replicated to the standby
>       before the command returns a "success" indication to the client.

> (2) The standby has (can read) all WAL files indispensable for recovery.

I would change "can read" in (2) to "has access to". "Can read" implies
we have read all files and checked CRCs of individual records.

The crux of this is what we mean by "synchronous_replication = on".
There are two possible meanings:

1. Commit will wait only if streaming is available and has waited for
all necessary startup conditions.
This provides "Highest Availability"

2. Commit will wait *until* full sync rep is available. So we don't
allow it until standby fails and also don't allow it if standby goes
down.
This provides "Highest Transaction Durability", though is fairly
fragile. Other systems recommend use of multiple standby nodes if this
option is selected.

Perhaps we should add this as a third option to synchronous_replication,
so we have either off, on, only

So far I realise I've been talking exclusively about (1). In that mode
synchronous_replication = on would wait for streaming to complete even
if last WAL file not fully transferred. 

For (2) we need a full interlock. Given that we don't currently support
multiple streamed standby servers, it seems not much point in
implementing the interlock (2) would require. Should we leave that part
for 8.5, or do it now?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

Aidan Van Dyk

Date:

12 December 2008, 09:55:41

* Simon Riggs <simon@2ndQuadrant.com> [081212 08:20]:
> 
> 2. Commit will wait *until* full sync rep is available. So we don't
> allow it until standby fails and also don't allow it if standby goes
> down.
> This provides "Highest Transaction Durability", though is fairly
> fragile. Other systems recommend use of multiple standby nodes if this
> option is selected.

yes please!

> Perhaps we should add this as a third option to synchronous_replication,
> so we have either off, on, only
> 
> So far I realise I've been talking exclusively about (1). In that mode
> synchronous_replication = on would wait for streaming to complete even
> if last WAL file not fully transferred. 

Seems reasonable...

> For (2) we need a full interlock. Given that we don't currently support
> multiple streamed standby servers, it seems not much point in
> implementing the interlock (2) would require. Should we leave that part
> for 8.5, or do it now?

Ugh... If all sync-rep is gong to give is "if it's working, the commit
made it the slaves, but it might not be working [anymore|yet], but you
(the app using pg) have no way of knowing...", that sort of defeats the
point ;-)

I'ld love multiple slaves, but I understand that's not in the current
work, and I understand that it might be hard with the accept & become
wall-sender approach.  It should be very easy to make a walsender handle
"multiple" slaves, and voting of quorum/etc as "successfully on slave",
except that we need to get the multiple connections to the walsender
backend...

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Sync Rep: First Thoughts on Code

From

Jeff Davis

Date:

12 December 2008, 14:27:06

On Fri, 2008-12-12 at 12:53 +0900, Fujii Masao wrote:
> Or, should I create the feature for the user to confirm whether it's in
>  "synch rep" via SQL?
> 

I think this would be useful.

Regards,Jeff Davis

Re: Sync Rep: First Thoughts on Code

From

Jeff Davis

Date:

12 December 2008, 14:39:29

On Fri, 2008-12-12 at 08:57 -0500, Aidan Van Dyk wrote:
> > For (2) we need a full interlock. Given that we don't currently support
> > multiple streamed standby servers, it seems not much point in
> > implementing the interlock (2) would require. Should we leave that part
> > for 8.5, or do it now?
> 
> Ugh... If all sync-rep is gong to give is "if it's working, the commit
> made it the slaves, but it might not be working [anymore|yet], but you
> (the app using pg) have no way of knowing...", that sort of defeats the
> point ;-)

http://archives.postgresql.org/pgsql-hackers/2008-12/msg00865.php

Fujii Masao offers to provide a SQL function that will tell you
definitively whether you are in full sync rep, or some degraded mode. I
assume that there will also be server log messages to identify whether
you ever left sync rep mode.

Regards,Jeff Davis

Re: Sync Rep: First Thoughts on Code

From

Aidan Van Dyk

Date:

12 December 2008, 15:22:03

* Jeff Davis <pgsql@j-davis.com> [081212 13:41]:
> On Fri, 2008-12-12 at 08:57 -0500, Aidan Van Dyk wrote:
> > > For (2) we need a full interlock. Given that we don't currently support
> > > multiple streamed standby servers, it seems not much point in
> > > implementing the interlock (2) would require. Should we leave that part
> > > for 8.5, or do it now?
> > 
> > Ugh... If all sync-rep is gong to give is "if it's working, the commit
> > made it the slaves, but it might not be working [anymore|yet], but you
> > (the app using pg) have no way of knowing...", that sort of defeats the
> > point ;-)
> 
> http://archives.postgresql.org/pgsql-hackers/2008-12/msg00865.php
> 
> Fujii Masao offers to provide a SQL function that will tell you
> definitively whether you are in full sync rep, or some degraded mode. I
> assume that there will also be server log messages to identify whether
> you ever left sync rep mode.

So when would I have to call that function? Before begin, after begin,
before commit, or all, to guarentee that know that my application is
suppose to "delay" calling commit until when sync-mode is actualyl
synchronous? And then afterwards, I have to call it again t omake sure
it didn't fall "out of" mode between my previous call and the commit
actually working?

Bugger it, then I'll have to to patch every single app/query that writes
transactions to the database to be "sync rep" aware...  And if I miss
one...

Some might say that if the data's that important, that audit/patching to
be "sync rep" aware is worth it, but then I guess they say that then you
might as well do application level replication as well ;-)

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Sync Rep: First Thoughts on Code

From

Jeff Davis

Date:

12 December 2008, 15:56:15

On Fri, 2008-12-12 at 14:23 -0500, Aidan Van Dyk wrote:
> So when would I have to call that function? Before begin, after begin,
> before commit, or all, to guarentee that know that my application is
> suppose to "delay" calling commit until when sync-mode is actualyl
> synchronous? And then afterwards, I have to call it again t omake sure
> it didn't fall "out of" mode between my previous call and the commit
> actually working?

I'm not suggesting that applications call the function. It's a way for a
monitoring system to know that you're in a degraded state and notify
you.

I'm not sure I entirely understand the use case you're advocating:

Let's say the standby has a major failure. Now you have a single point
of failure (the primary), so _all_ of your transactions are in jeopardy
anyway -- at least until you get back into sync rep. Rejecting new
transactions won't save your old ones.

The only time it helps is when the failure is temporary, i.e. you didn't
really lose the storage on the standby. But you would need to rely on
some guarantee that the storage is still intact on the standby system
even though the standby is unresponsive.

Is that the use case?

Regards,Jeff Davis

Re: Sync Rep: First Thoughts on Code

From

Markus Wanner

Date:

12 December 2008, 19:00:15

Hi,

Fujii Masao wrote:
> I'd like to define the meaning of "synch rep" again. "synch rep" means:
> 
> (1) Transaction commit waits for WAL records to be replicated to the standby
>       before the command returns a "success" indication to the client.
> 
> (2) The standby has (can read) all WAL files indispensable for recovery.

Let me point out that - very much like the original Postgres-R algorithm
- this guarantees committed transactions to be durable and consistent
(no late aborts of conflicting transactions), but it does not guarantee
that a transaction committed on one node is immediately visible on the
other node. In that sense, it is not synchronous as commonly understood,
because it does not "operate with all their parts in synchrony" [1], as
implied by the term "synchronous". This might (and often has in the
past) lead to confusion.

It's certainly enough of a reason for me to rather use the term "eager
replication". See [2] for a more in-depth explanation. I might also
point out, that Jan Wieck called this very same approch "an asynchronous
replication system by all means" [3].

Regards

Markus Wanner


[1]: Wikipedia on Synchronization
http://en.wikipedia.org/wiki/Synchronization

[2]: Postgres-R general mailing list, by Markus Wanner, subject:
terms for database replication: synchronous vs eager
http://lists.pgfoundry.org/pipermail/postgres-r-general/2008-September/000014.html

[3]: Postgres General mailing list, by Jan Wieck, subject:
terms for database replication: synchronous vs eager
http://archives.postgresql.org/pgsql-hackers/2007-09/msg00631.php

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

13 December 2008, 05:20:27

On Sat, 2008-12-13 at 00:00 +0100, Markus Wanner wrote:
> Hi,
> 
> Fujii Masao wrote:
> > I'd like to define the meaning of "synch rep" again. "synch rep" means:
> > 
> > (1) Transaction commit waits for WAL records to be replicated to the standby
> >       before the command returns a "success" indication to the client.
> > 
> > (2) The standby has (can read) all WAL files indispensable for recovery.
> 
> Let me point out that - very much like the original Postgres-R algorithm
> - this guarantees committed transactions to be durable and consistent
> (no late aborts of conflicting transactions), but it does not guarantee
> that a transaction committed on one node is immediately visible on the
> other node. In that sense, it is not synchronous as commonly understood,
> because it does not "operate with all their parts in synchrony" [1], as
> implied by the term "synchronous". This might (and often has in the
> past) lead to confusion.

You're right that neither the data transfer nor data availability is
entirely synchronous, but data transfer is synchronous at time of
*commit*: it is recorded on multiple nodes at the same time.

The term "synchronous replication" is already well used in the industry
to mean synchronous commit, so I don't think we should change the name
now. The project here is also known to everybody as "synch rep".

* Oracle Data Guard calls it "synchronous redo transport"
* MS Exchange calls it "synchronous replication"
* MS SQL Server has "Database Mirroring", "Log Shipping" and
"Replication". "Database Mirroring" provides synchronous mechanism, with
"Replication" meaning data transfer to other databases,
publish&subscribe.
* DB2 HADR provides "synchronous replication"
* MySQL call it "synchronous replication"

What is confusing is that "replication" itself is a much abused term and
is used to describe technologies for HA, DR and data movement.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

Markus Wanner

Date:

13 December 2008, 09:08:06

Hi,

Simon Riggs wrote:
> You're right that neither the data transfer nor data availability is
> entirely synchronous, but data transfer is synchronous at time of
> *commit*: it is recorded on multiple nodes at the same time.

I'm unsure what you mean by a "data transfer being synchronous". To what
other process or state should the data transfer be synchronous to?

> The term "synchronous replication" is already well used in the industry
> to mean synchronous commit, so I don't think we should change the name
> now. The project here is also known to everybody as "synch rep".

I understand very well, that you don't want to change the name. I've
been hesitant to "relabel" Postgres-R from synchronous to asynchronous
to eager.

However, that is a marketing decision [1], which should not be mixed
with the technical discussion here. Speaking of a "synchronous commit"
is utterly misleading, because the commit itself is exactly the thing
that's *not* synchronous.

It *is* an optimization to fully synchronous replication to defer commit
on the "slave" and only make sure that the transaction *can* be applied
at some time in the future.

However, this *does* have the drawback of transactions not being
immediately visible on the slave. Often enough, this is acceptable. But
it certainly matters to some applications developers.

> What is confusing is that "replication" itself is a much abused term and
> is used to describe technologies for HA, DR and data movement.

I absolutely agree to that. And I'm thus recommending to at least be
consistent and honest with the term "synchronous" and point out that WAL
writing is synchronous for the log shipping approach here (AFAIK). But
that the commit is asynchronous for performance reasons. In other words:
this approach is certainly (and hopefully, for performance reasons)
different from a fully synchronous approach. Even for marketing reasons,
it might make sense to point out that difference (.. "no, we are faster
than fully sync rep.").

Regards

Markus Wanner

[1]: Some people like the term "virtually synchronous" for marketing
purposes. That's at least half-ways technically correct.

Re: Sync Rep: First Thoughts on Code

From

Grzegorz Jaskiewicz

Date:

13 December 2008, 09:14:28

On 2008-12-13, at 13:07, Markus Wanner wrote:
>
>
> However, that is a marketing decision [1], which should not be mixed
> with the technical discussion here. Speaking of a "synchronous commit"
> is utterly misleading, because the commit itself is exactly the thing
> that's *not* synchronous.
>

> [1]: Some people like the term "virtually synchronous" for marketing
> purposes. That's at least half-ways technically correct.

Marketing people are virtually trustworthy, from my life experience.
If you ask me, this is just preposterous.

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

13 December 2008, 09:35:42

On Sat, 2008-12-13 at 14:07 +0100, Markus Wanner wrote:

> Speaking of a "synchronous commit"
> is utterly misleading, because the commit itself is exactly the thing
> that's *not* synchronous.

Not really sure where you're going here. "synchronous replication" is
used exactly as described in the Wikipedia entry here:
http://en.wikipedia.org/wiki/Database_replication

No two word phrase is going to accurately sum up the complexity and
potential for data loss in these situations. DRBD saw that too and just
called them A, B and C and then describe them more accurately. 

But I don't think we should say "PostgreSQL just implemented algorithm
B" which is just unhelpful. I don't think its "marketing" to refer to it
by the phrase most commonly used for the technology we are building.
Nobody suggested we call it "wizrep" or suchlike...

The docs can contain the exact description of data loss and timing
windows.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

Markus Wanner

Date:

13 December 2008, 13:15:17

Hi,

Simon Riggs wrote:
> On Sat, 2008-12-13 at 14:07 +0100, Markus Wanner wrote:
>> Speaking of a "synchronous commit"
>> is utterly misleading, because the commit itself is exactly the thing
>> that's *not* synchronous.
> 
> Not really sure where you're going here.

I'm pointing to a potential misunderstanding, trying to help to prevent
you from running into the same issues and discussions as I did.

I've learned the hard way, that the Postgres-R algorithm is not fully
synchronous (in the strict sense). This caused confusion for people who
take the word "synchronous" by its original meaning. The algorithm
proposed here seems similar enough to potentially cause the same confusion.

As I see it now, I think it's well worth to point out the difference,
from both, the technical as well as from the marketing perspective. The
former for better understanding, the later to prevent users from
thinking it must be slow per definition. Arguing that your approach is
not fully synchronous definitely helps defending that concern.

However, I'm just now realizing, that the difference is only relevant as
soon as you begin to allow read-only access on the slave. AFAIK that's
among the goals of this effort, no?

> "synchronous replication" is
> used exactly as described in the Wikipedia entry here:
> http://en.wikipedia.org/wiki/Database_replication

That article describes pretty much all variants of replication, what
exactly are you referring to?

Under "Database Replication > Multi-Master replication" it describes
eager vs lazy variants, which is IMO a more appropriate and useful
distinction than sync vs async. (But that's admittedly a sentence I've
contributed myself, IIRC).

Under "Storage Replication > Synchronous Replication" one can read:
"Write is not considered complete until acknowledgement by both local
and remote storage." For the proposed approach this might hold true for
WAL writing. However, the user certainly doesn't care how synchronous
the log is shipped nor written, is as long as she doesn't see the
changes on the slave.

That's the difference between fully synchronous and eager (or virtually
or approximately synchronous) algorithms. You seem to refer to both as
"synchronous". Phrases like "synchronous commit" or "synchronous data
transfer" do not help me to understand what exactly you are talking about.

Explaining that the slave commits (and therefore makes the transactions
visible) asynchronously would help. And it would prevent disappointment
for users who expect changes to be immediately visible on the slave.

> No two word phrase is going to accurately sum up the complexity and
> potential for data loss in these situations. DRBD saw that too and just
> called them A, B and C and then describe them more accurately.

Agreed. I've chosen lazy, eager and sync, so far. I'm open for better
terms, and I leave it up to you to call your variants whatever you like.
But to understand what you are talking about, I'd prefer to get to know
these distinctions crisp and clear.

> But I don't think we should say "PostgreSQL just implemented algorithm
> B" which is just unhelpful. I don't think its "marketing" to refer to it
> by the phrase most commonly used for the technology we are building.

I certainly agree to using such terms. Unfortunately, in my experience,
synchronous replication is commonly used to mean that transactions are
guaranteed to be immediately visible on remote nodes after the client
got commit acknowledgment. That's the cause for confusion I'm envisioning.

I'm hoping to be somewhat helpful to this effort of getting a log
shipping replication variant into Postgres. It can only be beneficial
for Postgres-R in that we gain field experience with ..uhm.. this
special kind of replication, however we name it.

I'm already on xmas vacation, so I won't bother you any further on this
issue. Have fun coding and make sure to enjoy this time of the year.

All the best.

Markus Wanner

Re: Sync Rep: First Thoughts on Code

From

"Robert Haas"

Date:

13 December 2008, 14:05:10

> I certainly agree to using such terms. Unfortunately, in my experience,
> synchronous replication is commonly used to mean that transactions are
> guaranteed to be immediately visible on remote nodes after the client
> got commit acknowledgment. That's the cause for confusion I'm envisioning.

I think that's a very important point.  It's very possible that 8.4
may support both this feature and Hot Standby (although the latter
seems to have stalled a bit...).  That makes me think "oh, great, I
can offload any subset of my read-only queries to the standby".  Not
so fast.

I think we need to reserve the term "synchronous replication" for a
system where transactions that begin at the same time on the primary
and standby see the same tuples.  Clearly that is "more" synchronous
than what is being proposed here; if we call this "synchronous
replication", what will we call that?  "Really Synchronous, Honest, No
Kidding"?   Admittedly, we may never implement that feature, but that
seems irrelevant.

It would be useful to have names for all the different possibilities.Random ideas:

Log Shipping.  After each log switch, the previous WAL log is copied
to the standby in its entirety.

WAL Streaming - Asynchronous.  The WAL log is streamed from master to
standby as it is written, but transactions on the master never wait.

WAL Streaming - Synchronous Receive.  The WAL log is streamed from
master to standby as it is written, and transactions on the master
wait until the standby acknowledges receipt of the WAL.

WAL Streaming - Synchronous Write.  The WAL log is streamed from
master to standby as it is written, and transactions on the master
wait until the standby acknowledges that the WAL has been written to
disk.

WAL Streaming - Synchronous Apply.  The WAL log is streamed from
master to standby as it is written, and transactions on the master
wait until the standby acknowledges that WAL has been written to disk
and applied.

...Robert

Re: Sync Rep: First Thoughts on Code

From

Tom Lane

Date:

13 December 2008, 14:30:18

"Robert Haas" <robertmhaas@gmail.com> writes:
> I think we need to reserve the term "synchronous replication" for a
> system where transactions that begin at the same time on the primary
> and standby see the same tuples.  Clearly that is "more" synchronous
> than what is being proposed here; if we call this "synchronous
> replication", what will we call that?  "Really Synchronous, Honest, No
> Kidding"?   Admittedly, we may never implement that feature, but that
> seems irrelevant.

We won't call it anything, because we never will or can implement that.
See the theory of relativity: the notion of exactly simultaneous events
at distinct locations isn't even well-defined, because observers at yet
other locations will disagree about what is "simultaneous".  And I'm
not just making a joke here --- speed-of-light delays in a WAN are
meaningful compared to current computer speeds.  In practice, the
slave and the master will never commit at exactly the same time.

I agree with the point made upthread that we should use the term
"synchronous replication" the way it's commonly used in the industry.
Inventing our own terminology might be fun but it's not really going
to result in less confusion.
        regards, tom lane

Re: Sync Rep: First Thoughts on Code

From

Aidan Van Dyk

Date:

13 December 2008, 14:44:05

Synchronous replication, "sync rep" is *not* intersted in the "slave's
visiblity of the commit", because PostgreSQL doesn't "serve" requests
when in recovery (wal receiving) mode *now*.

This sync rep patch/proposal/discution is *strictly* (at this point yet,
hot standby may eventually or hopefully soon change that) the means to
get the data "safely in 2 seperate places", before the COMMIT returns,
by means of wal streaming.  That "safely in 2 places" can have various
implementation options (like received, on disk, or applied), and
Fujii-san explained some of the options as to what to consider "safe"
and their trade-offs at his presentation at last year.

Once both sync-rep (the wal-streaming get changes in two places) and
hot-standby (run queries while WAL is being applied) are available in
PostgreSQL, at that point we might need to start "other client
visibility", but even then, we still don't need to worry about
multi-master options...

a.


* Markus Wanner <markus@bluegap.ch> [081213 12:17]:
> Hi,
> 
> Simon Riggs wrote:
> > On Sat, 2008-12-13 at 14:07 +0100, Markus Wanner wrote:
> >> Speaking of a "synchronous commit"
> >> is utterly misleading, because the commit itself is exactly the thing
> >> that's *not* synchronous.
> > 
> > Not really sure where you're going here.
> 
> I'm pointing to a potential misunderstanding, trying to help to prevent
> you from running into the same issues and discussions as I did.
> 
> I've learned the hard way, that the Postgres-R algorithm is not fully
> synchronous (in the strict sense). This caused confusion for people who
> take the word "synchronous" by its original meaning. The algorithm
> proposed here seems similar enough to potentially cause the same confusion.
> 
> As I see it now, I think it's well worth to point out the difference,
> from both, the technical as well as from the marketing perspective. The
> former for better understanding, the later to prevent users from
> thinking it must be slow per definition. Arguing that your approach is
> not fully synchronous definitely helps defending that concern.
> 
> However, I'm just now realizing, that the difference is only relevant as
> soon as you begin to allow read-only access on the slave. AFAIK that's
> among the goals of this effort, no?
> 
> > "synchronous replication" is
> > used exactly as described in the Wikipedia entry here:
> > http://en.wikipedia.org/wiki/Database_replication
> 
> That article describes pretty much all variants of replication, what
> exactly are you referring to?
> 
> Under "Database Replication > Multi-Master replication" it describes
> eager vs lazy variants, which is IMO a more appropriate and useful
> distinction than sync vs async. (But that's admittedly a sentence I've
> contributed myself, IIRC).
> 
> Under "Storage Replication > Synchronous Replication" one can read:
> "Write is not considered complete until acknowledgement by both local
> and remote storage." For the proposed approach this might hold true for
> WAL writing. However, the user certainly doesn't care how synchronous
> the log is shipped nor written, is as long as she doesn't see the
> changes on the slave.
> 
> That's the difference between fully synchronous and eager (or virtually
> or approximately synchronous) algorithms. You seem to refer to both as
> "synchronous". Phrases like "synchronous commit" or "synchronous data
> transfer" do not help me to understand what exactly you are talking about.
> 
> Explaining that the slave commits (and therefore makes the transactions
> visible) asynchronously would help. And it would prevent disappointment
> for users who expect changes to be immediately visible on the slave.
> 
> > No two word phrase is going to accurately sum up the complexity and
> > potential for data loss in these situations. DRBD saw that too and just
> > called them A, B and C and then describe them more accurately.
> 
> Agreed. I've chosen lazy, eager and sync, so far. I'm open for better
> terms, and I leave it up to you to call your variants whatever you like.
> But to understand what you are talking about, I'd prefer to get to know
> these distinctions crisp and clear.
> 
> > But I don't think we should say "PostgreSQL just implemented algorithm
> > B" which is just unhelpful. I don't think its "marketing" to refer to it
> > by the phrase most commonly used for the technology we are building.
> 
> I certainly agree to using such terms. Unfortunately, in my experience,
> synchronous replication is commonly used to mean that transactions are
> guaranteed to be immediately visible on remote nodes after the client
> got commit acknowledgment. That's the cause for confusion I'm envisioning.
> 
> 
> I'm hoping to be somewhat helpful to this effort of getting a log
> shipping replication variant into Postgres. It can only be beneficial
> for Postgres-R in that we gain field experience with ..uhm.. this
> special kind of replication, however we name it.
> 
> I'm already on xmas vacation, so I won't bother you any further on this
> issue. Have fun coding and make sure to enjoy this time of the year.
> 
> All the best.
> 
> Markus Wanner
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

13 December 2008, 14:52:08

On Sat, 2008-12-13 at 13:05 -0500, Robert Haas wrote:

> Hot Standby (although the latter
> seems to have stalled a bit...)

It's just being worked on asynchronously. ;-)

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

Hannu Krosing

Date:

13 December 2008, 15:36:04

On Sat, 2008-12-13 at 13:05 -0500, Robert Haas wrote:
> > I certainly agree to using such terms. Unfortunately, in my experience,
> > synchronous replication is commonly used to mean that transactions are
> > guaranteed to be immediately visible on remote nodes after the client
> > got commit acknowledgment. That's the cause for confusion I'm envisioning.
> 
> I think that's a very important point.  It's very possible that 8.4
> may support both this feature and Hot Standby (although the latter
> seems to have stalled a bit...).  That makes me think "oh, great, I
> can offload any subset of my read-only queries to the standby".  Not
> so fast.
> 
> I think we need to reserve the term "synchronous replication" for a
> system where transactions that begin at the same time on the primary
> and standby see the same tuples.

Define "same time". 

You can have a variantof sync rep + hot standby where the master does
not return committed before the slave has both synced the data and
replied the transaction so that it is visible on slave but in that case
you may have a usecase, where it is actually visible on slave _before_
it is visible on master.

actually you can't have that "same time" guarantee even on single
system, that is, if you start two transactions connections "at the same
time", you still cant be sure there is not third transaction which has
committed between those two and which makes the visible data on those
two different.


>  Clearly that is "more" synchronous
> than what is being proposed here; if we call this "synchronous
> replication", what will we call that?  "Really Synchronous, Honest, No
> Kidding"?   Admittedly, we may never implement that feature, but that
> seems irrelevant.
> 
> It would be useful to have names for all the different possibilities.
>  Random ideas:
> 
> Log Shipping.  After each log switch, the previous WAL log is copied
> to the standby in its entirety.
> 
> WAL Streaming - Asynchronous.  The WAL log is streamed from master to
> standby as it is written, but transactions on the master never wait.
> 
> WAL Streaming - Synchronous Receive.  The WAL log is streamed from
> master to standby as it is written, and transactions on the master
> wait until the standby acknowledges receipt of the WAL.
> 
> WAL Streaming - Synchronous Write.  The WAL log is streamed from
> master to standby as it is written, and transactions on the master
> wait until the standby acknowledges that the WAL has been written to
> disk.
> 
> WAL Streaming - Synchronous Apply.  The WAL log is streamed from
> master to standby as it is written, and transactions on the master
> wait until the standby acknowledges that WAL has been written to disk
> and applied.

We still could call Sync Rep as a feature "synchronous replication" on
basis that "WAL Streaming - Synchronous Write" is the highest security
level achievable using the feature.

And maybe have Sync Hot Standby as a feature on top of that which
provides "WAL Streaming - Synchronous Apply"



------------------------------------------
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability   Services, Consulting and Training

Re: Sync Rep: First Thoughts on Code

From

Hannu Krosing

Date:

13 December 2008, 15:46:09

On Sat, 2008-12-13 at 21:35 +0200, Hannu Krosing wrote:

> We still could call Sync Rep as a feature "synchronous replication" on
> basis that "WAL Streaming - Synchronous Write" is the highest security
> level achievable using the feature.
> 
> And maybe have Sync Hot Standby as a feature on top of that which
> provides "WAL Streaming - Synchronous Apply"

Or maybe better call it Serializable Hot Standby, as the actual
guarantee that can be achieved is that when one client does something on
master and after committing on master starts another transaction on
slave, then the effects of query on master are visible on slave.

-- 
------------------------------------------
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability   Services, Consulting and Training

Re: Sync Rep: First Thoughts on Code

From

Markus Wanner

Date:

13 December 2008, 16:57:28

Hi,

Tom Lane wrote:
> We won't call it anything, because we never will or can implement that.
> See the theory of relativity: the notion of exactly simultaneous events
> at distinct locations isn't even well-defined

That has never been the point of the discussion. It's rather about the
question if changes from transactions are guaranteed to be visible on
remote nodes immediately after commit acknowledgment. Whether or not
this is guaranteed, in both cases the term "synchronous replication" is
commonly used, which is causing confusion.

Regards

Markus Wanner

Re: Sync Rep: First Thoughts on Code

From

Markus Wanner

Date:

13 December 2008, 17:21:44

Hi,

Simon Riggs wrote:
>> Hot Standby (although the latter
>> seems to have stalled a bit...)
> 
> It's just being worked on asynchronously. ;-)

LOL, thanks for bringing humor into this discussion :-)

Regards

Markus Wanner

Re: Sync Rep: First Thoughts on Code

From

Markus Wanner

Date:

13 December 2008, 17:31:31

Hi,

Hannu Krosing wrote:
> You can have a variantof sync rep + hot standby where the master does
> not return committed before the slave has both synced the data and
> replied the transaction so that it is visible on slave but in that case
> you may have a usecase, where it is actually visible on slave _before_
> it is visible on master.

As long as it's not visible *before* the client requests a COMMIT, that
certainly doesn't matter (because the application cannot check that).

What matters is, that an application might expect a node to show the
changes of a transaction which has previously (seen from the application
itself) been committed and acknowledged by another node.

AFAICT the common understanding of synchronous replication is, that all
nodes confirm to have committed the changes of a transaction *before*
acknowledging COMMIT to the application (and obviously only *after* the
application requested to COMMIT the transaction, so the guarantee is
that all nodes commit *sometime* within that time frame, which is
certainly possible to guarantee, see 2PC approaches).

This guarantee is not provided by the Postgres-R algorithm, nor by the
approach presented. Both only guarantee, that the transaction *will* get
committed (and thus get visible) on all nodes *sometime* *after* the
application requested to commit it (even in case of various failures,
that is) [1]. As cited before, that has been enough of a reason for Jan
Wieck to call Postgres-R asynchronous, and I certainly see his point.

Note that the amount of time that passes between the commit
acknowledgment and the actual commit on remote nodes may theoretically
be infinitely long. And in practice certainly long enough for an
application to notice the difference. However, it still is a practical
optimization, because most applications should cope with it just fine.
But not all...

Do you consider the proposed log shipping approach to be synchronous?
How about the Postgres-R algorithm?

Regards

Markus Wanner

[1]: of course these approaches also guarantee that the transaction is
committed on the local node *before* acknowledging commit, so that
subsequent (seen from the application) queries are guaranteed to see the
changes. But that guarantee only holds true for the local node.

Re: Sync Rep: First Thoughts on Code

From

Aidan Van Dyk

Date:

13 December 2008, 17:47:35

* Markus Wanner <markus@bluegap.ch> [081213 16:33]:
> Hi,
> 
> Hannu Krosing wrote:
> > You can have a variantof sync rep + hot standby where the master does
> > not return committed before the slave has both synced the data and
> > replied the transaction so that it is visible on slave but in that case
> > you may have a usecase, where it is actually visible on slave _before_
> > it is visible on master.
> 
> As long as it's not visible *before* the client requests a COMMIT, that
> certainly doesn't matter (because the application cannot check that).

Well, I think the PG MVCC (which wal-streaming just ships across
somewhere else) will save that.  So with hot-standby you could have
another client could see the result *after* the COMMIT has been
requested, but *before* the COMMIT returns...  But we have this
situation in a single current PG instance anyways, so it's nothing
new....

But with hot-standby, I could also see that it could be done such that
the wal-stream is fsynced to disk (i.e. xlog) and acknowledged, but
because of a current running query, application of it is delayed...  But
this is hot-standby's problem of describing itself, not sync-rep.

IMHO, sync-rep is about getting the change "durrably to a slave" before
acknoledging the COMMIT.  That slave could be any number of things:
- A "WAL archive" type system having the ability to be used for recover
- A PG with special "recovery mode" that reads the stream and applies it
- A full hot-standby recovery

I could see any and all of those (and probably other) being usefull and
used.

But in the current patch, it focusses on the streaming (sending), and
and a receiver "recovery" mode that can accept/apply them, again,
without worrying about acutally running queries (yet) ...

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Sync Rep: First Thoughts on Code

From

Mark Mielke

Date:

13 December 2008, 18:12:46

Markus Wanner wrote:<br /><blockquote cite="mid:494421B1.7040707@bluegap.ch" type="cite"><pre wrap="">Tom Lane wrote:
</pre><blockquotetype="cite"><pre wrap="">We won't call it anything, because we never will or can implement that.
 
See the theory of relativity: the notion of exactly simultaneous events
at distinct locations isn't even well-defined   </pre></blockquote><pre wrap="">
That has never been the point of the discussion. It's rather about the
question if changes from transactions are guaranteed to be visible on
remote nodes immediately after commit acknowledgment. Whether or not
this is guaranteed, in both cases the term "synchronous replication" is
commonly used, which is causing confusion. </pre></blockquote><br /> Might it not be true that anybody unfamiliar would
beconfused and that this is a bit of a straw man?<br /><br /> I don't think synchronous replication guarantees that it
willbe immediately visible. Even if it did push the change to the other machine, and the other machine had committed
it,that doesn't guarantee that any reader sees it any more than if I commit to the same machine (no replication), I am
guaranteedto see the change from another session. Synchronous replication only means that I can be assured that my
changehas been saved permanently by the time my commit completes. It doesn't mean anybody else can see my change or is
guaranteedto see my change if the query from another session.<br /><br /> If my application assumes that it can commit
toone server, and then read back the commit from another server, and my application breaks as a result, it's because I
didn'tunderstand the problem. Even if PostgreSQL didn't use the word "synchronous replication", I could still be
confused.I need to understand the problem no matter what words are used.<br /><br /> Cheers,<br /> mark<br /><br /><pre
class="moz-signature"cols="72">-- 
 
Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a>
</pre>

Re: Sync Rep: First Thoughts on Code

From

Markus Wanner

Date:

13 December 2008, 18:14:01

Hi,

Aidan Van Dyk wrote:
> Well, I think the PG MVCC (which wal-streaming just ships across
> somewhere else) will save that.  So with hot-standby you could have
> another client could see the result *after* the COMMIT has been
> requested, but *before* the COMMIT returns...  But we have this
> situation in a single current PG instance anyways, so it's nothing
> new....

AFAIU the proposed algorithm only waits until WAL is written on the
slave before acknowledging COMMIT. Application of the changes may be
deferred, so it's not necessarily immediately visible on the slave.

> But with hot-standby, I could also see that it could be done such that
> the wal-stream is fsynced to disk (i.e. xlog) and acknowledged, but
> because of a current running query, application of it is delayed...  But
> this is hot-standby's problem of describing itself, not sync-rep.

I'm thinking of the overall system and don't care much if it's
hot-standby's or sync-rep's problem. But it's certainly the master which
needs to await certain acknowledgments of the slaves. That has so far
been discussed within this sync-rep thread.

> IMHO, sync-rep is about getting the change "durrably to a slave" before
> acknoledging the COMMIT.  That slave could be any number of things:
> - A "WAL archive" type system having the ability to be used for
>   recover
> - A PG with special "recovery mode" that reads the stream and applies it
> - A full hot-standby recovery
> 
> I could see any and all of those (and probably other) being usefull and
> used.

I certainly agree to that.

Regards

Markus Wanner

Re: Sync Rep: First Thoughts on Code

From

Markus Wanner

Date:

13 December 2008, 18:27:02

Hi,

Mark Mielke wrote:
> Might it not be true that anybody unfamiliar would be confused and that
> this is a bit of a straw man?

Might be. I've neglected the issue myself for a while.

> I don't think synchronous replication guarantees that it will be
> immediately visible. Even if it did push the change to the other
> machine, and the other machine had committed it, that doesn't guarantee
> that any reader sees it any more than if I commit to the same machine
> (no replication), I am guaranteed to see the change from another
> session.

AFAIK every snapshot taken after a transaction has acknowledged its
commit is guaranteed to see changes from that transaction. Isn't that a
pretty frequent and obvious user expectation?

> Synchronous replication only means that I can be assured that
> my change has been saved permanently by the time my commit completes. It
> doesn't mean anybody else can see my change or is guaranteed to see my
> change if the query from another session.

So you wouldn't be surprised if a transaction from two hours ago isn't
visible on another node, just because that node happens to be rather
busy with lots of other readers and maintenance tasks?

> If my application assumes that it can commit to one server, and then
> read back the commit from another server, and my application breaks as a
> result, it's because I didn't understand the problem.

Well, yeah, depends on user expectations. I'm surprised to hear that you
have that understanding of synchronous replication.

> Even if PostgreSQL
> didn't use the word "synchronous replication", I could still be
> confused. I need to understand the problem no matter what words are used.

As said, it depends on what the common understanding of "synchronous
replication" is. I've so far been under the impression, that these
potential lags are unexpected and confusing. Several people pointed me
at that problem and I've thus "relabeled" Postgres-R as not being
synchronous. I'm at least surprised to suddenly get pushed into the
other direction. :-)

However, I absolutely agree that it's not that important how we name it.
What is important, is that users and developers understand the difference.

Regards

Markus Wanner

Re: Sync Rep: First Thoughts on Code

From

Mark Mielke

Date:

13 December 2008, 19:35:50

Markus Wanner wrote:<br /><blockquote cite="mid:494436AE.2080207@bluegap.ch" type="cite"><blockquote type="cite"><pre
wrap="">Idon't think synchronous replication guarantees that it will be
 
immediately visible. Even if it did push the change to the other
machine, and the other machine had committed it, that doesn't guarantee
that any reader sees it any more than if I commit to the same machine
(no replication), I am guaranteed to see the change from another
session.   </pre></blockquote><pre wrap="">
AFAIK every snapshot taken after a transaction has acknowledged its
commit is guaranteed to see changes from that transaction. Isn't that a
pretty frequent and obvious user expectation? </pre></blockquote><br /> Yes - but that's only really true while the
sessioncontinues. From another session? I've never assumed that I could reconnect and be guaranteed to get the latest
snapshotthat includes absolutely everything that has been committed.<br /><br /> Any system that guaranteed this even
wheninvolving multiple machines would be guaranteed to be inefficient and difficult to scale in my opinion. How could
anysystem promise to have reasonable commit times while also guaranteeing that once a commit completes, any session to
anyother server will be able to see the commit? I think this forces some sort of serialization between multiple
machinesand defeats the purpose of having multiple machines. Where before it was indeterminate to know when the commit
wouldtake effect at each replica, it's not indeterminate when my commit will succeed. That is, my commit cannot succeed
untilevery single server acknowledge that it is has fully received and committed my transaction. What happens if there
arenetwork problems, or what happens if I am replicating over a slower link? What if I am committing to 100 servers? Is
itreasonable to expect 100 server negotiations to complete in full before my own commit will return?<br /><br
/><blockquotecite="mid:494436AE.2080207@bluegap.ch" type="cite"><blockquote type="cite"><pre wrap="">Synchronous
replicationonly means that I can be assured that
 
my change has been saved permanently by the time my commit completes. It
doesn't mean anybody else can see my change or is guaranteed to see my
change if the query from another session.   </pre></blockquote><pre wrap="">So you wouldn't be surprised if a
transactionfrom two hours ago isn't
 
visible on another node, just because that node happens to be rather
busy with lots of other readers and maintenance tasks? </pre></blockquote><br /> Any system that is two hours behind
shouldfall out of the pool used to satisfy reads from. So, if there was a surprise, it would be this. I don't believe
ACIDrequires that a commit on one server is immediately visible on another server. Any work I do on the "behind" server
wouldstill be safe from a transaction and referential integrity perspective. However, if I executed 'commit' on this
"behind"server, I would expect the commit to wait until it catches up, or in the case of a 2 hour behind, I would
expectthe commit to fail. Look at the alternative - all commits to any server in the pool would be locked up waiting
forthis one machine to catch up on 2 hours of transaction. This emphasizes that the problem is that a server two hours
ofdate is still in the pool, rather than the problem being keeping things up-to-date.<br /><br /><br /><blockquote
cite="mid:494436AE.2080207@bluegap.ch"type="cite"><blockquote type="cite"><pre wrap="">If my application assumes that
itcan commit to one server, and then
 
read back the commit from another server, and my application breaks as a
result, it's because I didn't understand the problem.   </pre></blockquote><pre wrap="">Well, yeah, depends on user
expectations.I'm surprised to hear that you
 
have that understanding of synchronous replication. </pre></blockquote><br /> I've seen people face it in the past.
Mostrecently we had a presentation from the developer of digg.com, and he described how he had this problem with MySQL
andthat he had to work around it.<br /><br /> On a smaller scale and slightly unrelated, I had this problem frequently
betweenmemcache and PostgreSQL. That is, memcache would always be latest, but PostgreSQL might not be latest, because
thecommit had not occurred.<br /><br /> It seems like a standard enough problem to me. I don't expect Postgres-R to do
theimpossible. As with my previous paragraph, I don't expect Postgres-R to wait 2-hours to commit just because one
serveris falling behind.<br /><br /><blockquote cite="mid:494436AE.2080207@bluegap.ch" type="cite"><blockquote
type="cite"><prewrap="">Even if PostgreSQL
 
didn't use the word "synchronous replication", I could still be
confused. I need to understand the problem no matter what words are used.   </pre></blockquote><pre wrap="">
As said, it depends on what the common understanding of "synchronous
replication" is. I've so far been under the impression, that these
potential lags are unexpected and confusing. Several people pointed me
at that problem and I've thus "relabeled" Postgres-R as not being
synchronous. I'm at least surprised to suddenly get pushed into the
other direction. :-)

However, I absolutely agree that it's not that important how we name it.
What is important, is that users and developers understand the difference</pre></blockquote><br /> I agree they are
unexpectedand confusing. I don't agree that they are unexpected or confusing to those knowledgeable in the domain. So,
thequestion becomes - whose expectation is wrong? Should the user learn more? Or should we push for a change in
terminology?Does it make sense for Postgres-R (which looks excellent to me BTW, at least in principle) be marketed
differently,because a few users tie "synchronous replication" to "serialized access"?<br /><br /> Because that's really
whatwe're talking about - we're talking about transactions in all sessions being serialized between machines to provide
lesssurprise to users who don't understand the complexity of having multiple replicas.<br /><br /> Forget replication -
evenfor the exact same server - I don't expect that if I commit from one session, I will be able to see the change
immediatelyfrom my other session or a new session that I just opened. Perhaps this is often stable to rely on this, and
itis useful for the database server to minimize the window during which the commit becomes visible to others, but I
thinkit's a false expectation from the start that it absolutely will be immediately visible to another session. I'm
thinkingof situations where some part of the table is in cache. The only way the commit can communicate that the new
transactionis available is by during communication between the processes or threads, or between the multiple CPUs on
themachine. Do I want every commit to force each session to become fully in alignment before my commit completes? Does
PostgreSQLmake this guarantee today? I bet it doesn't if you look far enough into the guts. It might be very fast - I
don'tthink it is infinitely fast.<br /><br /> Cheers,<br /> mark<br /><br /><pre class="moz-signature" cols="72">-- 
 
Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a>
</pre>

Re: Sync Rep: First Thoughts on Code

From

"Robert Haas"

Date:

13 December 2008, 22:35:34

On Sat, Dec 13, 2008 at 1:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Robert Haas" <robertmhaas@gmail.com> writes:
>> I think we need to reserve the term "synchronous replication" for a
>> system where transactions that begin at the same time on the primary
>> and standby see the same tuples.  Clearly that is "more" synchronous
>
> We won't call it anything, because we never will or can implement that.
> See the theory of relativity: the notion of exactly simultaneous events

OK, fine.  I'll be more precise.  I think we need to reserve the term
"synchronous replication" for a system where transactions that begin
on the standby after the transactions has committed on the master see
the effects of the committed transaction.

> at distinct locations isn't even well-defined, because observers at yet
> other locations will disagree about what is "simultaneous".  And I'm
> not just making a joke here --- speed-of-light delays in a WAN are
> meaningful compared to current computer speeds.  In practice, the
> slave and the master will never commit at exactly the same time.
>
> I agree with the point made upthread that we should use the term
> "synchronous replication" the way it's commonly used in the industry.
> Inventing our own terminology might be fun but it's not really going
> to result in less confusion.

I just googled "synchronous replication" and read through the first
page of hits.  Most of them do not address the question of whether
synchronous replication can be said to have be completed when WAL has
been received by the standby not but yet applied.  One of the ones
that does is:

http://code.google.com/p/google-mysql-tools/wiki/SemiSyncReplicationDesign

...which refers to what we're proposing to call "Synchronous
Replication" as "Semi-Synchronous Replication" (or 2-safe replication)
specifically to distinguish it.  The other is:

http://www.cnds.jhu.edu/pub/papers/cnds-2002-4.pdf

...which doesn't specifically examine the issue but seems to take the
opposite position, namely that the server on which the transaction is
executed needs to wait only for one server to apply the changes to the
database (the others need only to know that they need to commit it;
they don't actually need to have done it).  However, that same paper
refers to two-phase commit as a synchronous replication algorithm, and
Wikipedia's discussion of two-phase commit:

http://en.wikipedia.org/wiki/Two-phase_commit_protocol

...clearly implies that the transaction must be applied everywhere
before it can be said to have committed.

The second page of Google results is mostly a further discussion of
the MySQL solution, which is mostly described as "semi-synchronous
replication".

Simon Riggs said upthread that Oracle called this "synchronous redo
transport".  That is obviously much closer to what we are doing than
"synchronous replication".

...Robert

Re: Sync Rep: First Thoughts on Code

From

Jeff Davis

Date:

13 December 2008, 23:13:47

On Sat, 2008-12-13 at 21:35 -0500, Robert Haas wrote:
> On Sat, Dec 13, 2008 at 1:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > "Robert Haas" <robertmhaas@gmail.com> writes:
> >> I think we need to reserve the term "synchronous replication" for a
> >> system where transactions that begin at the same time on the primary
> >> and standby see the same tuples.  Clearly that is "more" synchronous
> >
> > We won't call it anything, because we never will or can implement that.
> > See the theory of relativity: the notion of exactly simultaneous events
> 
> OK, fine.  I'll be more precise.  I think we need to reserve the term
> "synchronous replication" for a system where transactions that begin
> on the standby after the transactions has committed on the master see
> the effects of the committed transaction.
> 

If it's guaranteed to be visible on the standby after it's committed on
the master, and you don't have any way to make it actually simultaneous,
then that implies that it's visible on the slave for some brief period
of time before it's committed on the master.

That situation is still asymmetric, so why is that a better use of the
term "synchronous"?

Regards,Jeff Davis

Re: Sync Rep: First Thoughts on Code

From

"Robert Haas"

Date:

13 December 2008, 23:23:39

> If it's guaranteed to be visible on the standby after it's committed on
> the master, and you don't have any way to make it actually simultaneous,
> then that implies that it's visible on the slave for some brief period
> of time before it's committed on the master.
>
> That situation is still asymmetric, so why is that a better use of the
> term "synchronous"?

Because that happens anyway.  If I request a commit on a single,
unreplicated server, the server makes the commit visible to new
transactions and then sends me a message informing me that the commit
has completed.  Since the message takes some finite time to reach me,
there is a window of time after the commit has completed and before I
know that the commit has been completed.

Suppose for the sake of argument that the single, unreplicated server
did these two tasks in the opposite order - namely, first, it sent a
message to the process requesting the commit stating that the commit
had completed, and only then made the transaction visible.  This would
create a race condition: the process requesting the commit might
receive the commit and begin a new transaction before the previous
transaction had been made visible, and would therefore not be able to
see the results of its own previous actions.  I think it's fair to say
that this behavior would be judged totally intolerable.

Therefore, there can't possibly be any applications out there which
are depending on the fact that commits don't become visible until they
are acknowledged, but there very well could be some applications which
depend on the fact that one commits are acknowledged, they are
visible.  If replication is synchronous in this sense, then I can open
a connection to the master, write some data, close the connection,
open a new connection to the master or the slave (not caring which),
and read back the data that I just wrote (assuming no one else has
modified it in the mean time).  If it isn't, then I can't.  Some
people will not care about this, but some will.

The point here is that synchronous replication, at least to some
people, is going to imply that the user-visible states of the two
copies are consistent.  To other people, it is going to imply that
committed transactions will never be lost even in the event of a
catastropic loss of the primary 1 picosecond after the commit is
acknowledged.  We need to choose some word that implies that we are
guaranteeing the latter of these two things but not the former.
Otherwise, we will have confused users, and terminological confusion
when and if we ever implement the former as well.

...Robert

Re: Sync Rep: First Thoughts on Code

From

"Robert Haas"

Date:

13 December 2008, 23:26:54

> Might it not be true that anybody unfamiliar would be confused and that this
> is a bit of a straw man?
[...]
> If my application assumes that it can commit to one server, and then read
> back the commit from another server, and my application breaks as a result,
> it's because I didn't understand the problem. Even if PostgreSQL didn't use
> the word "synchronous replication", I could still be confused. I need to
> understand the problem no matter what words are used.

That is certainly true.  But there is value in choosing words which
elucidate the situation as much as possible.

...Robert

Re: Sync Rep: First Thoughts on Code

From

Jeff Davis

Date:

13 December 2008, 23:33:36

On Sat, 2008-12-13 at 22:23 -0500, Robert Haas wrote:
> > If it's guaranteed to be visible on the standby after it's committed on
> > the master, and you don't have any way to make it actually simultaneous,
> > then that implies that it's visible on the slave for some brief period
> > of time before it's committed on the master.
> >
> > That situation is still asymmetric, so why is that a better use of the
> > term "synchronous"?
> 
> Because that happens anyway.  If I request a commit on a single,
> unreplicated server, the server makes the commit visible to new
> transactions and then sends me a message informing me that the commit
> has completed.  Since the message takes some finite time to reach me,
> there is a window of time after the commit has completed and before I
> know that the commit has been completed.
> 

Oh, I see the distinction now.

Thanks for the detailed reply.

Regards,Jeff Davis

Re: Sync Rep: First Thoughts on Code

From

Tatsuo Ishii

Date:

14 December 2008, 00:31:23

> The point here is that synchronous replication, at least to some
> people, is going to imply that the user-visible states of the two
> copies are consistent.  To other people, it is going to imply that
> committed transactions will never be lost even in the event of a
> catastropic loss of the primary 1 picosecond after the commit is
> acknowledged.  We need to choose some word that implies that we are
> guaranteeing the latter of these two things but not the former.
> Otherwise, we will have confused users, and terminological confusion
> when and if we ever implement the former as well.

Right. Before watching this thread, I had thought that the log
shipping sync replication behaves former (and I had told so to people
in Japan who are interested in 8.4 development. Of course this is my
fault, though).

Now I understand the log shipping sync replication does not behave
same as other "sync replications" such as pgpool and PGCluster (there
maybe more, but I don't know)
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Re: Sync Rep: First Thoughts on Code

From

"Robert Haas"

Date:

14 December 2008, 01:42:18

> The point here is that synchronous replication, at least to some
> people, is going to imply that the user-visible states of the two
> copies are consistent.  To other people, it is going to imply that
> committed transactions will never be lost even in the event of a
> catastropic loss of the primary 1 picosecond after the commit is
> acknowledged.  We need to choose some word that implies that we are
> guaranteeing the latter of these two things but not the former.
> Otherwise, we will have confused users, and terminological confusion
> when and if we ever implement the former as well.

With apologies for replying to my own post:

It's also important to understand that these two invariants are
completely separate and it is possible to guarantee either without the
other.  If you want (1), the standby needs to apply the WAL before
sending an acknowledgment to the primary but does not necessarily need
to write it to disk (of course, it will have to be written to disk
before the modified buffers are written to disk, but that's a separate
issue).  If you want (2), the standby needs to write the WAL to disk
before sending the acknowledgment but does not necessarily need to
apply it.  If you want both, then, you need to wait for both (and it's
worth noting that your performance will probably be nothing to write
home about).

I also did some research on terminology that has been used in the
literature.  As Jim Gray describes it:

1-safe replication.  Transaction is committed when it has been locally
WAL-logged to durable storage.
Group-safe replication.  Transaction is committed when WAL has been
received by all remote servers, but not necessarily written to durable
storage.
Group-safe & 1-safe replication.  Transaction is committed when it has
been locally WAL-logged to durable storage and WAL has been received
by all remote servers.
2-safe replication.  Transaction is committed when it has been written
to durable storage on both local and remote servers.
Very safe replication.  As 2-safe, but fails any read-write
transaction if the secondary is down.

(Actually, it appears that "Transaction Processing" Jim Gray and
Andreas Reuter, 1993 uses 2-safe to refer to either 2-safe or
group-safe; the distinction between the two is a subsequent
development. See e.g. Advances in Database Technology-EDBT 2004
by Elisa Bertino)

The term of art for making sure that transactions committed on the
primary are visible on the secondary seems to be "one-copy
serializability" (see, for example, a Google Books search on that
term).

...Robert

Re: Sync Rep: First Thoughts on Code

From

Mark Mielke

Date:

14 December 2008, 04:53:07

Robert Haas wrote: <blockquote cite="mid:603c8f070812131835v7839b68fj736c853241cc7813@mail.gmail.com" type="cite"><pre
wrap="">OnSat, Dec 13, 2008 at 1:29 PM, Tom Lane <a class="moz-txt-link-rfc2396E"
href="mailto:tgl@sss.pgh.pa.us"><tgl@sss.pgh.pa.us></a>wrote: </pre><blockquote type="cite"><pre wrap="">We won't
callit anything, because we never will or can implement that.
 
See the theory of relativity: the notion of exactly simultaneous events   </pre></blockquote><pre wrap="">
OK, fine.  I'll be more precise.  I think we need to reserve the term
"synchronous replication" for a system where transactions that begin
on the standby after the transactions has committed on the master see
the effects of the committed transaction. </pre></blockquote><br /> Wouldn't this be serialized transactions?<br /><br
/>I'd like to see proof of some sort that PostgreSQL guarantees that the instant a 'commit' returns, any transactions
alreadyopen with the appropriate transaction isolation level, or any new sessions *will* see the results of the
commit.<br/><br /> I know that most of the time this happens - but what process synchronization steps occur to
*guarantee*that this happens?<br /><br /><blockquote
cite="mid:603c8f070812131835v7839b68fj736c853241cc7813@mail.gmail.com"type="cite"><pre wrap="">I just googled
"synchronousreplication" and read through the first
 
page of hits.  Most of them do not address the question of whether
synchronous replication can be said to have be completed when WAL has
been received by the standby not but yet applied.  One of the ones
that does is:

<a class="moz-txt-link-freetext"
href="http://code.google.com/p/google-mysql-tools/wiki/SemiSyncReplicationDesign">http://code.google.com/p/google-mysql-tools/wiki/SemiSyncReplicationDesign</a>

...which refers to what we're proposing to call "Synchronous
Replication" as "Semi-Synchronous Replication" (or 2-safe replication)
specifically to distinguish it.  The other is:

<a class="moz-txt-link-freetext"
href="http://www.cnds.jhu.edu/pub/papers/cnds-2002-4.pdf">http://www.cnds.jhu.edu/pub/papers/cnds-2002-4.pdf</a>

...which doesn't specifically examine the issue but seems to take the
opposite position, namely that the server on which the transaction is
executed needs to wait only for one server to apply the changes to the
database (the others need only to know that they need to commit it;
they don't actually need to have done it).  However, that same paper
refers to two-phase commit as a synchronous replication algorithm, and
Wikipedia's discussion of two-phase commit:

<a class="moz-txt-link-freetext"
href="http://en.wikipedia.org/wiki/Two-phase_commit_protocol">http://en.wikipedia.org/wiki/Two-phase_commit_protocol</a>

...clearly implies that the transaction must be applied everywhere
before it can be said to have committed.

The second page of Google results is mostly a further discussion of
the MySQL solution, which is mostly described as "semi-synchronous
replication".

Simon Riggs said upthread that Oracle called this "synchronous redo
transport".  That is obviously much closer to what we are doing than
"synchronous replication". </pre></blockquote><br /> Two phase commit doesn't imply that the transaction is guaranteed
tobe immediately visible. See my previous paragraph. Unless transactions are locked from starting until they are able
toprove that they have the latest commit (a feat which I'm going to theorize as impossible - because the moment you
waitfor a commit, and you begin again, you really have no guarantee that another commit has not occurred in the mean
time),I think it's clear that two phase commit guarantees that the commit has taken place, but does *not* guarantee
anythingabout visibility.<br /><br /> It might be a good bet - but guarantee? There is no such guarantee.<br /><br />
Cheers,<br/> mark<br /><br /><pre class="moz-signature" cols="72">-- 
 
Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a>
</pre>

Re: Sync Rep: Second thoughts

From

Emmanuel Cecchet

Date:

14 December 2008, 05:15:00

Hi all,

I just wanted to point out a detail that I have not seen mentioned in 
this thread (but I might have skipped some messages and I apologize in 
advance if this is a duplicate).

What the application is going to see is a failure when the postmaster it 
is connected to is going down. If this happen at commit time, I think 
that there is no guarantee for the application to know what happened:
1. failure occurred before the request reached postmaster:  no instance 
committed
2. failure occurred during commit: might be committed on either nodes
3. failure occurred while sending back ack of commit to client: both 
instances have committed
But for the client, it will all look the same: an error on commit.

This is just to point out that despite all your efforts, the client 
might think that some transactions have failed (error on commit) but 
they are actually committed. If you don't put some state in the driver 
that is able to check at failover time if the commit operation succeeded 
or not, it does not really matter what happens for in-flight 
transactions (or in-commit transactions) at failure time. In all cases, 
a manual inspection of the database logs will be required.
Actually, if there was a way to query the database about the status of a 
particular transaction by providing a cluster-wide unique id, that would 
help a lot. I wrote a paper on the issues with database replication at 
Sigmod earlier this year (http://infoscience.epfl.ch/record/129042). 
Even though it was targeted at middleware replication, I think that some 
of it is still relevant for the problem at hand.

Regarding the wording, if experts can't agree, you can be sure that 
users won't either. Most of them don't have a clue about the different 
flavors of replication. So as long as you state clearly how it behaves 
and define all the terms you use that should be fine.

manu

-- 
Emmanuel Cecchet
FTO @ Frog Thinker 
Open Source Development & Consulting
--
Web: http://www.frogthinker.org
email: manu@frogthinker.org
Skype: emmanuel_cecchet

Re: Sync Rep: First Thoughts on Code

From

Emmanuel Cecchet

Date:

14 December 2008, 05:18:03

Robert Haas wrote:
> The term of art for making sure that transactions committed on the
> primary are visible on the secondary seems to be "one-copy
> serializability" (see, for example, a Google Books search on that
> term).
Not exactly. 1-copy-serializability which is the standard for 
multi-master solutions, guarantees that transactions are executed in the 
same serializable order at each replica (which means that transactions 
can be executed in different order and committed at different times on 
different replica as long as a consistent serializable view is presented 
to the client).
There are a number of optimizations in that area but in a multi-master 
case, replicas rarely commit at the same time. There are interesting 
papers on the subject (like Tashkent & Tashkent+ based on Postgres) for 
those who want to understand these problems more thoroughly.

Hope this helps,
manu

-- 
Emmanuel Cecchet
FTO @ Frog Thinker 
Open Source Development & Consulting
--
Web: http://www.frogthinker.org
email: manu@frogthinker.org
Skype: emmanuel_cecchet

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

14 December 2008, 11:49:02

On Sun, 2008-12-14 at 13:31 +0900, Tatsuo Ishii wrote:
> > The point here is that synchronous replication, at least to some
> > people, is going to imply that the user-visible states of the two
> > copies are consistent.  To other people, it is going to imply that
> > committed transactions will never be lost even in the event of a
> > catastropic loss of the primary 1 picosecond after the commit is
> > acknowledged.  We need to choose some word that implies that we are
> > guaranteeing the latter of these two things but not the former.
> > Otherwise, we will have confused users, and terminological confusion
> > when and if we ever implement the former as well.
> 
> Right. Before watching this thread, I had thought that the log
> shipping sync replication behaves former (and I had told so to people
> in Japan who are interested in 8.4 development. Of course this is my
> fault, though).
> 
> Now I understand the log shipping sync replication does not behave
> same as other "sync replications" such as pgpool and PGCluster (there
> maybe more, but I don't know)

GENERAL COMMENTS, not to anybody in particular:

'Tis but thy name that is my enemy.
...
What's in a name? That which we call a rose
By any other name would smell as sweet.
...

Juliet, from "Romeo and Juliet"

I am truly lost to understand why the *name* "synchronous replication"
causes so much discussion, yet nobody has discussed what they would
actually like the software to *do* (this being a software discussion
list...). AFAICS we can make the software behave like *any* of the
definitions discussed so far.

It is certainly far too early to say what the final exact behaviour will
be and there is no reason at all to pre-suppose that it need only be a
single behaviour. I'm in favour of options, generally, but I would say
that the distinction between some of these options is mostly very fine
and strongly doubt whether people would use them if they existed. *But*
I think we can add them at a later stage of development if requirements
genuinely exist once all the benefits *and* costs are understood.

I would also point out that the distinction made between various
meanings of synchronous is *only* important if Hot Standby is included
as well. And that is closely linked to the replication feature, which we
really need to complete first. We have much to do yet.

So let's please end the name debate there and think about software.

...

We can make the reply to a commit message when any of the following
events have occurred

1. We sent the message to standby
2. We received the message on standby
3. We wrote the WAL to the WAL file
4. We fsync'd the WAL file
5. We CRC checked the WAL commit record
6. We applied the WAL commit record

Now you might think from what people have said that having synchronised
contents on both primary and standby is the only way to achieve exactly
the same results to queries on both nodes. Another way is to utilise a
snapshot taken on the primary and simply wait until the standby catches
up with that snapshot's LSN. So there is more than one way of achieving
a particular result and it is not dependent upon the exact
synchronisation we employ at commit time.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

Mark Mielke

Date:

14 December 2008, 14:04:06

Simon Riggs wrote:
> I am truly lost to understand why the *name* "synchronous replication"
> causes so much discussion, yet nobody has discussed what they would
> actually like the software to *do* (this being a software discussion
> list...). AFAICS we can make the software behave like *any* of the
> definitions discussed so far.
>   

I think people have talked about 'like' in the context of user 
expectations. That is, there seems to exist a set of people (probably 
those who've never worked with a multi-replica solution before) who 
expect that once commit completes on one server, they can query any 
other master or slave and be guaranteed visibility of the transaction 
they just committed. These people may theoretically change their 
decision to not use Postgres-R, or at least change their approach to how 
they work with Postgres-R, if the name was in some way more intuitive to 
them in terms of what is actually being provided.

"Synchronous replication" itself says only details about replication, it 
does not say anything about visibility, so to some degree, people are 
focusing on the wrong term as the problem. Even if it says "asynchronous 
replication" - not sure that I care either way - this doesn't improve 
the understanding for the casual user of what is happening behind the 
scenes. Neither synchronous nor asynchronous guarantees that the change 
will be immediately visible from other nodes after I type 'commit;'. 
Asynchronous might err on the side of not immediately visible, where 
synchronous might (incorrectly) imply immediate visibility, but it's not 
an accurate guarantee to provide.

Synchronous does not guarantee visibility immediately after. Some 
indefinite but usually short time must normally pass from when my 
'commit;' completes until when the shared memory visible to my process 
"sees" the transaction. Multiple replicas with network latency or 
reliability issues increases the theoretical minimum size of this window 
to something that would be normally encountered as opposed to something 
that is normally not encountered.

The only way to guarantee visibility is to ensure that the new 
transaction is guaranteed to be visible from a shared memory perspective 
on every machine in the pool, and every active backend process. If my 
'commit;' is going to wait for this to occur, first, I think this forces 
every commit to have numerous network round trips to each machine in the 
pool, it forces each machine in the pool to be network accessible and 
responsive, it forces all commits to be serialized in the sense of "the 
slowest machine in the pool determines the time for my commit to 
complete", and I think it implies some sort of inter-process signalling, 
or at the very least CPU level signalling about shared memory (in the 
case of multiple CPUs).

People such as myself think that a visibility guarantee is unreasonable 
and certain to cause scalability or reliability problems. So, my 'like' 
is an efficient multi-master solution where if I put 10 machines in the 
pool, I expect my normal query/commit loads to approach 10X as fast. My 
like prefers scalability over guarantees that may be difficult to 
provide, and probably are not provided today even in a single server 
scenario.

> It is certainly far too early to say what the final exact behaviour will
> be and there is no reason at all to pre-suppose that it need only be a
> single behaviour. I'm in favour of options, generally, but I would say
> that the distinction between some of these options is mostly very fine
> and strongly doubt whether people would use them if they existed. *But*
> I think we can add them at a later stage of development if requirements
> genuinely exist once all the benefits *and* costs are understood.
>   

The above 'commit;' behaviour difference - whether it completes when the 
commit is permanent (it definitely will be applied for certain to all 
replicas - it just may take time to apply to all replicas), or when the 
commit has actually taken effect (two-phase commit on all replicas - and 
both phases have completed on all replicas - what happens if second 
phase commit fails on one or more servers?), or when the commit is 
guaranteed to be visible from all existing and new sessionss (two-phase 
commit plus additional signalling required?) might be such an option.

I'm doubtful, though - as the difference in implementation between the 
first and second is pretty significant.

I'm curious about your suggestion to direct queries that need the latest 
snapshot to the 'primary'. I might have misunderstood it - but it seems 
that the expectation from some is that *all* sessions see the latest 
snapshot, so would this not imply that all sessions would be redirect to 
the 'primary'? I don't think it is reasonable myself, but I might be 
misunderstanding something...

Cheers,
mark

-- 
Mark Mielke <mark@mielke.cc>

Re: Sync Rep: First Thoughts on Code

From

"Robert Haas"

Date:

14 December 2008, 14:07:04

> We can make the reply to a commit message when any of the following
> events have occurred
>
> 1. We sent the message to standby
> 2. We received the message on standby
> 3. We wrote the WAL to the WAL file
> 4. We fsync'd the WAL file
> 5. We CRC checked the WAL commit record
> 6. We applied the WAL commit record

Also

0. The same time we would have done so if replication had not been
configured at all.

I think the basic problem here is that we can talk about "asynchronous
replication" and "synchronous replication" but there are n>2
possible/useful behaviors (I would guess principally 0, 2, 4, and 6,
but YMMV).  So we're going to need some way to clarify what we mean.

BTW, in case my previous emails on this topic might have given someone
the contrary impression, I'm not really that worked up about this
either.  Interesting?  Yes.  Have opinions?  Yes.  Lie awake nights
worrying about it?  Nope.  :-)

...Robert

Re: Sync Rep: First Thoughts on Code

From

Mark Mielke

Date:

14 December 2008, 15:44:38

Mark Mielke wrote:
> Forget replication - even for the exact same server - I don't expect 
> that if I commit from one session, I will be able to see the change 
> immediately from my other session or a new session that I just opened. 
> Perhaps this is often stable to rely on this, and it is useful for the 
> database server to minimize the window during which the commit becomes 
> visible to others, but I think it's a false expectation from the start 
> that it absolutely will be immediately visible to another session. I'm 
> thinking of situations where some part of the table is in cache. The 
> only way the commit can communicate that the new transaction is 
> available is by during communication between the processes or threads, 
> or between the multiple CPUs on the machine. Do I want every commit to 
> force each session to become fully in alignment before my commit 
> completes? Does PostgreSQL make this guarantee today? I bet it doesn't 
> if you look far enough into the guts. It might be very fast - I don't 
> think it is infinitely fast.

FYI: I haven't been able to prove this. Multiple sessions running on my 
dual-core CPU seem to be able to see the latest commits before they 
begin executing. Am I wrong about this? Does PostgreSQL provide a 
intentional guarantee that a commit from one session that completes 
immediately followed by a query from another session will always find 
the commit effect visible (provide the transaction isolation level 
doesn't get in the way)? Or is the machine and algorithms just fast 
enough that by the time it executes the query (up to 1 ms later) the 
commit is always visible in practice?

Cheers,
mark

-- 
Mark Mielke <mark@mielke.cc>

Re: Sync Rep: First Thoughts on Code

From

Heikki Linnakangas

Date:

14 December 2008, 16:07:04

Mark Mielke wrote:
> Mark Mielke wrote:
>> Forget replication - even for the exact same server - I don't expect 
>> that if I commit from one session, I will be able to see the change 
>> immediately from my other session or a new session that I just opened. 
>> Perhaps this is often stable to rely on this, and it is useful for the 
>> database server to minimize the window during which the commit becomes 
>> visible to others, but I think it's a false expectation from the start 
>> that it absolutely will be immediately visible to another session. I'm 
>> thinking of situations where some part of the table is in cache. The 
>> only way the commit can communicate that the new transaction is 
>> available is by during communication between the processes or threads, 
>> or between the multiple CPUs on the machine. Do I want every commit to 
>> force each session to become fully in alignment before my commit 
>> completes? Does PostgreSQL make this guarantee today? I bet it doesn't 
>> if you look far enough into the guts. It might be very fast - I don't 
>> think it is infinitely fast.
> 
> FYI: I haven't been able to prove this. Multiple sessions running on my 
> dual-core CPU seem to be able to see the latest commits before they 
> begin executing. Am I wrong about this? Does PostgreSQL provide a 
> intentional guarantee that a commit from one session that completes 
> immediately followed by a query from another session will always find 
> the commit effect visible (provide the transaction isolation level 
> doesn't get in the way)?

Yes. PostgreSQL does guarantee that, and I would expect any other DBMS 
to do the same.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Sync Rep: First Thoughts on Code

From

Dimitri Fontaine

Date:

14 December 2008, 16:45:35

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

Le 14 déc. 08 à 16:48, Simon Riggs a écrit :
> I am truly lost to understand why the *name* "synchronous replication"
> causes so much discussion, yet nobody has discussed what they would
> actually like the software to *do* (this being a software discussion
> list...). AFAICS we can make the software behave like *any* of the
> definitions discussed so far.

It seems that the easy parts are the one the more people will
participate into. Maybe that's that simple.

> We can make the reply to a commit message when any of the following
> events have occurred
>
> 1. We sent the message to standby
> 2. We received the message on standby
> 3. We wrote the WAL to the WAL file
> 4. We fsync'd the WAL file
> 5. We CRC checked the WAL commit record
> 6. We applied the WAL commit record

Ok, so let's talk about this easy part: my understanding of
"synchronous replication" is that it gives to its users the strong
guarantee that at commit time the transaction is secured to the
slave(s). That means you get the D of ACID on more than one server.

Why synchronous? Because you know the durability is ensured exactly
when you receive the COMMIT ack.

So I'm with Simon on this, the term Synchronous Replication does
describe accurately what's being implemented here, and on the other
hand, as so many of us are saying, it's true that it tells very little
about it. Those 6 options are all in the scope of the infamous naming,
just different guarantees level, from almost strong to very strong,
with some "almost, but not quite, entirely unlike the strong I want".
Pick your naming here too.

At least, that's how I'm understanding this, the bottom line of why
care sending this email is that maybe it'll help some people to
recover from sleep deprivation ;)

My 2¢,
- --
dim

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Darwin)

iEYEARECAAYFAklFcEsACgkQlBXRlnbh1bk0YwCfa+zGBKTK5EoH/Nmu0x+R6vKI
buAAniyL6Z+3MdT4rim5/xZQvdr4QOIQ
=iHnY
-----END PGP SIGNATURE-----

Re: Sync Rep: First Thoughts on Code

From

Ron Mayer

Date:

14 December 2008, 18:00:54

Robert Haas wrote:
>> We can make the reply to a commit message when any of the following
>> events have occurred
>>
>> 1. We sent the message to standby
>> 2. We received the message on standby
>> 3. We wrote the WAL to the WAL file
>> 4. We fsync'd the WAL file
>> 5. We CRC checked the WAL commit record
>> 6. We applied the WAL commit record

Perhaps it'd be useful if the failure modes these are trying to
protect against were described too.

If I understand right.

1. Protects all the transactions from the failure of the   master; so long as neither the network nor the slave
machinedie soon?

2. Protects all the transactions from the failure of the   master and the network between the slave and master,   so
longas the slave doesn't die soon?

3. Same as #2?

4. Protects against the failure of the master, the network,   and parts of the slave; so long as the slave's disk
survivesthe failure?

5. Protects against all of the above, and bit-errors in the   memories of the slave machine (except the slave's disk
controller?)?  Or are we reading-back the CRC from the   slave's disk and comparing to the CRC computed on the   master
whereit might protect from even more?

6. Same as 4?

If this is right, #2, #3, #4, and #6 feel similar except
that they're protecting against failures of different (but
still all incomplete) subsets of the hardware on the slave, right?

Re: Sync Rep: First Thoughts on Code

From

Mark Mielke

Date:

14 December 2008, 19:53:01

Heikki Linnakangas wrote:
> Mark Mielke wrote:
>> FYI: I haven't been able to prove this. Multiple sessions running on 
>> my dual-core CPU seem to be able to see the latest commits before 
>> they begin executing. Am I wrong about this? Does PostgreSQL provide 
>> a intentional guarantee that a commit from one session that completes 
>> immediately followed by a query from another session will always find 
>> the commit effect visible (provide the transaction isolation level 
>> doesn't get in the way)?
> Yes. PostgreSQL does guarantee that, and I would expect any other DBMS 
> to do the same.

Where does the expectation come from? I don't recall ever reading it in 
the documentation, and unless the session processes are contending over 
the integers (using some sort of synchronization primitive) in memory 
that represent the "latest visible commit" on every single select, I'm 
wondering how it is accomplished? If they are contending over these 
integers, doesn't that represent a scaling limitation, in the sense that 
on a 32-core machine, they're going to be fighting with each other to 
get the latest version of these shared integers into the CPU for 
processing? Maybe it's such a small penalty that we don't care? :-)

I was never instilled with the logic that 'commit in one session 
guarantees visibility of the effects in another session'. But, as I say 
above, I wasn't able to make PostgreSQL "fail" in this regard. So maybe 
I have no clue what I am talking about? :-)

If you happen to know where the code or documentation makes this 
promise, feel free to point it out. I'd like to review the code. If you 
don't know - don't worry about it, I'll find it later...

Cheers,
mark

-- 
Mark Mielke <mark@mielke.cc>

Re: Sync Rep: First Thoughts on Code

From

Greg Stark

Date:

14 December 2008, 20:22:38

When the database says the data is committed it has to mean the data  
is really committed. Imagine if you looked at a bank account balance  
after withdrawing all the money and saw a balance which didn't reflect  
the withdrawal and allowed you to withdraw more money again...

-- 
Greg


On 14 Dec 2008, at 14:44, Mark Mielke <mark@mark.mielke.cc> wrote:

> Mark Mielke wrote:
>> Forget replication - even for the exact same server - I don't  
>> expect that if I commit from one session, I will be able to see the  
>> change immediately from my other session or a new session that I  
>> just opened. Perhaps this is often stable to rely on this, and it  
>> is useful for the database server to minimize the window during  
>> which the commit becomes visible to others, but I think it's a  
>> false expectation from the start that it absolutely will be  
>> immediately visible to another session. I'm thinking of situations  
>> where some part of the table is in cache. The only way the commit  
>> can communicate that the new transaction is available is by during  
>> communication between the processes or threads, or between the  
>> multiple CPUs on the machine. Do I want every commit to force each  
>> session to become fully in alignment before my commit completes?  
>> Does PostgreSQL make this guarantee today? I bet it doesn't if you  
>> look far enough into the guts. It might be very fast - I don't  
>> think it is infinitely fast.
>
> FYI: I haven't been able to prove this. Multiple sessions running on  
> my dual-core CPU seem to be able to see the latest commits before  
> they begin executing. Am I wrong about this? Does PostgreSQL provide  
> a intentional guarantee that a commit from one session that  
> completes immediately followed by a query from another session will  
> always find the commit effect visible (provide the transaction  
> isolation level doesn't get in the way)? Or is the machine and  
> algorithms just fast enough that by the time it executes the query  
> (up to 1 ms later) the commit is always visible in practice?
>
> Cheers,
> mark
>
> -- 
> Mark Mielke <mark@mielke.cc>
>
>
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

Re: Sync Rep: First Thoughts on Code

From

"Robert Haas"

Date:

14 December 2008, 22:41:43

> If this is right, #2, #3, #4, and #6 feel similar except
> that they're protecting against failures of different (but
> still all incomplete) subsets of the hardware on the slave, right?

Right.  Actually, the biggest difference with #6 has nothing to do
with protecting against failures.  It has rather to do with the ease
of writing applications in the context of hot standby.  You can close
your connection, open a connection to a different server, and know
that your transactions will be reflected there.  On the other hand,
I'd be surprised if it didn't come with a substantial performance
penalty, so it may not be too practical in real life even if it sounds
good on paper.

#1 , #3, and #5 don't feel that useful to me.  In the case of #1,
sending your WAL over the network and then not checking that it got
there is sort of silly: the likelihood of packet loss on the network
has got to be several orders of magnitude more likely than a failure
on the master.  #3 and #5 just don't seem to provide any real benefits
over their immediate predecessors.

Honestly, I think the most useful thing is probably going to be
asynchronous replication: in other words, when a commit is requested
on the master, we write WAL and return success.  In the background, we
stream the WAL to a secondary, which writes it and applies it.  This
will give us a secondary which is mostly up to date (and can run
queries, with hot standby) without killing performance.  The other
options are going to be for environments where losing a transaction is
really, really bad, or (in the case of #6) read-mostly environments
where it's useful to spread the query load out across several servers,
but the overhead associated with waiting for the rare write
transactions to apply everywhere is tolerable.

...Robert

Re: Sync Rep: First Thoughts on Code

From

Mark Mielke

Date:

15 December 2008, 03:03:36

Greg Stark wrote:
> When the database says the data is committed it has to mean the data 
> is really committed. Imagine if you looked at a bank account balance 
> after withdrawing all the money and saw a balance which didn't reflect 
> the withdrawal and allowed you to withdraw more money again...

Within the same session - sure. From different sessions? PostgeSQL MVCC 
let's you see an older snapshot, although it does prefer to have the 
latest snapshot with each command.

For allowing to withdraw more money again, I would expect some sort of 
locking "SELECT ... FOR UPDATE;" to be used. This lock then forces the 
two transactions to become serialized and the second will either wait 
for the first to complete or fail. Any banking program that assumed that 
it could SELECT to confirm a balance and then UPDATE to withdraw the 
money as separate instructions would be a bad banking program. To 
exploit it, I would just have to start both operations at the same time 
- they both SELECT, they both see I have money, they both give me the 
money and UPDATE, and I get double the money (although my balance would 
show a big negative value - but I'm already gone...). Database 101.

When I asked for "does PostgreSQL guarantee this?" I didn't mean hand 
waving examples or hand waving expectations. I meant a pointer into the 
code that has some comment that says "we want to guarantee that a commit 
in one session will be immediately visible to other sessions, and that a 
later select issued in the other sessions will ALWAYS see the commit 
whether 1 nanosecond later or 200 seconds later" Robert's expectation 
and yours seem like taking this "guarantee" for granted rather than 
being justified with design intent and proof thus far. :-) Given my 
experiment to try and force it to fail, I can see why this would be 
taken for granted. Is this a real promise, though? Or just a unlikely 
scenario that never seems to be hit?

To me, the question is relevant in terms of the expectations of a 
multi-replica solution. We know people have the expectation. We know it 
can be convenient. Is the expectation valid in the first place?

I've probably drawn this question out too long and should do my own 
research and report back... Sorry... :-)

Cheers,
mark

-- 
Mark Mielke <mark@mielke.cc>

Re: Sync Rep: First Thoughts on Code

From

Heikki Linnakangas

Date:

15 December 2008, 03:40:02

Mark Mielke wrote:
> When I asked for "does PostgreSQL guarantee this?" I didn't mean hand 
> waving examples or hand waving expectations. I meant a pointer into the 
> code that has some comment that says "we want to guarantee that a commit 
> in one session will be immediately visible to other sessions, and that a 
> later select issued in the other sessions will ALWAYS see the commit 
> whether 1 nanosecond later or 200 seconds later" Robert's expectation 
> and yours seem like taking this "guarantee" for granted rather than 
> being justified with design intent and proof thus far. :-) Given my 
> experiment to try and force it to fail, I can see why this would be 
> taken for granted. Is this a real promise, though? 

Yes.

In a nutshell, commit works like this:

1. Write and flush WAL record about the commit
2. Mark the transaction as committed in clog
3. Remove the xid from the shared memory ProcArray.
4. Release locks and other resources
5. Reply to client that the transaction has been committed.

After step 3, any backend taking a snapshot will see the transaction as 
committed. Since we only reply to the client at step 5, it is guaranteed 
that a transaction beginning after step 5, as well as an already open 
transaction taking a new snapshot (ie. running a new command in read 
committed mode) after that will see the transaction as committed.

The relevant code is in CommitTransaction() in xact.c.

> To me, the question is relevant in terms of the expectations of a 
> multi-replica solution. We know people have the expectation.

Yeah, I think Robert is right. We should reserve the term "synchronous 
replication" for the mode where that guarantee holds for the slave as well.

In fact, waiting for reply from standby server before acknowledging a 
commit to the client is a bit pointless otherwise. It puts you in a 
strange situation, where you're waiting for the commits in normal 
operation, but if there's a network glitch or the standby goes down, 
you're willing to go ahead without it. You get a high guarantee that 
your data is up-to-date in the standby, except when it isn't. Which 
isn't much of a guarantee.

But with hot standby, it makes a lot of sense. The guarantee is that if 
the standby is accepting queries, it's up-to-date with the primary.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Sync Rep: First Thoughts on Code

From

Heikki Linnakangas

Date:

15 December 2008, 04:07:19

Mark Mielke wrote:
> Where does the expectation come from? I don't recall ever reading it in 
> the documentation, and unless the session processes are contending over 
> the integers (using some sort of synchronization primitive) in memory 
> that represent the "latest visible commit" on every single select, I'm 
> wondering how it is accomplished? 

The "integers" you're imagining are the ProcArray. Every backend has an 
entry there, and among other things it contains the current XID the 
backend is running. When a backend takes a new snapshot (on every single 
select in read committed mode), it locks the ProcArray, scans all the 
entries and collects all the XIDs listed there in the snapshot. Those 
are the set of transactions that were running when the snapshot was 
taken, and is used in the visibility checks.
> If they are contending over these> integers, doesn't that represent a scaling limitation, in the sense that> on a
32-coremachine, they're going to be fighting with each other to> get the latest version of these shared integers into
theCPU for> processing? Maybe it's such a small penalty that we don't care? :-)
 

The ProcArrayLock is indeed quite busy on systems with a lot of CPUs. 
It's held for such short times that it's not a problem usually, but it 
can become a bottleneck with a machine like that with all backends 
running small transactions.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

15 December 2008, 05:39:37

On Sun, 2008-12-14 at 12:57 -0500, Mark Mielke wrote:

> I'm curious about your suggestion to direct queries that need the
> latest 
> snapshot to the 'primary'. I might have misunderstood it - but it
> seems 
> that the expectation from some is that *all* sessions see the latest 
> snapshot, so would this not imply that all sessions would be redirect
> to 
> the 'primary'? I don't think it is reasonable myself, but I might be 
> misunderstanding something...

I said "a snapshot taken on the primary", but the query would run on the
standby.

Synchronising primary and standby so that they are identical from the
perspective of a query requires some synchronisation delay. I'm pointing
out that the synchronisation delay can occur 

* at the time we apply WAL - which will slow down commits (i.e. #6 on my
previous list of options)
* at the time we run a query that needs to see primary and standby
synchronised

So the same effect can be achieved in various ways.

The first way would require *all* transactions to be applied on standby,
i.e. option #6 for all transactions. That is a performance disaster and
I would not force that onto everybody.

The second way can be done by taking a snapshot on the primary, with an
associated LSN, then using that snapshot on the standby. That is
somewhat complex, but possible. I see the requirement for getting the
same answer on multiple nodes as a further extension of "transaction
isolation mode" and think that not all people will want this, so we
should allow that as an option.

I'm not going to worry about this at the moment. Hot standby will be
useful without this and so I regard this as a secondary objective. Rome
wasn't built in a single release, or something like that.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

15 December 2008, 05:42:12

On Sun, 2008-12-14 at 21:41 -0500, Robert Haas wrote:
> > If this is right, #2, #3, #4, and #6 feel similar except
> > that they're protecting against failures of different (but
> > still all incomplete) subsets of the hardware on the slave, right?
> 
> Right.  Actually, the biggest difference with #6 has nothing to do
> with protecting against failures.  It has rather to do with the ease
> of writing applications in the context of hot standby.  You can close
> your connection, open a connection to a different server, and know
> that your transactions will be reflected there.  On the other hand,
> I'd be surprised if it didn't come with a substantial performance
> penalty, so it may not be too practical in real life even if it sounds
> good on paper.
> 
> #1 , #3, and #5 don't feel that useful to me. 

Yes, looks that way for me also.

Good analysis Ron. I agree with Robert that #6 is there for other
reasons.

#2 corresponds to DRBD algorithm B

#4 corresponds to DRBD algorithm C

Fujii-san, please can we incorporate those two options, rather than just
one choice "synchronous_replication = on". They look like two commonly
requested options.

#6 is an additional synchronization step in Hot Standby. I would say
that people won't want that when they see how it performs (they probably
won't want #4 either for that same reason, but that is for robustness).

Also, I would point out that the class of synch_rep is selected by the
user on the primary and can vary from transaction to transaction. That
is a very good thing, as far as I am concerned. We would need to enforce
#6 for all transactions (if we implemented synchronisation in this way).

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

Peter Eisentraut

Date:

15 December 2008, 05:44:44

Simon Riggs wrote:
> I am truly lost to understand why the *name* "synchronous replication"
> causes so much discussion, yet nobody has discussed what they would
> actually like the software to *do*

It's the color of the bikeshed ...

> We can make the reply to a commit message when any of the following
> events have occurred
> 
> 1. We sent the message to standby
> 2. We received the message on standby
> 3. We wrote the WAL to the WAL file
> 4. We fsync'd the WAL file
> 5. We CRC checked the WAL commit record
> 6. We applied the WAL commit record

In DRBD tradition, I suggest you implement all of them, or at least 
factor the code so that each of them can be a one line change.  (We can 
probably later drop one or two options.)

Re: Sync Rep: First Thoughts on Code

From

"Robert Haas"

Date:

15 December 2008, 08:28:23

> In fact, waiting for reply from standby server before acknowledging a commit
> to the client is a bit pointless otherwise. It puts you in a strange
> situation, where you're waiting for the commits in normal operation, but if
> there's a network glitch or the standby goes down, you're willing to go
> ahead without it. You get a high guarantee that your data is up-to-date in
> the standby, except when it isn't. Which isn't much of a guarantee.

It protects you against a catastrophic loss of the primary, which is a
non-trivial consideration.  At the risk of being ghoulish, imagine
that you are a large financial company headquartered in the world
trade center.

...Robert

Re: Sync Rep: First Thoughts on Code

From

Greg Stark

Date:

15 December 2008, 09:10:04

It's a real promise. The reason you're getting hand-wavy answers is  
because it's such a basic requirement that I'm trying to point out  
just how fundamental a requirement it is.

If you want to see the actual code which guarantees this take a look  
around the code for procarray - in particular the code for taking a  
snapshot. There are comments there about what locks are needed when  
committing and when taking a snapshot and why. But it's quite technical.

-- 
Greg


On 15 Dec 2008, at 02:03, Mark Mielke <mark@mark.mielke.cc> wrote:

> Greg Stark wrote:
>> When the database says the data is committed it has to mean the  
>> data is really committed. Imagine if you looked at a bank account  
>> balance after withdrawing all the money and saw a balance which  
>> didn't reflect the withdrawal and allowed you to withdraw more  
>> money again...
>
> Within the same session - sure. From different sessions? PostgeSQL  
> MVCC let's you see an older snapshot, although it does prefer to  
> have the latest snapshot with each command.
>
> For allowing to withdraw more money again, I would expect some sort  
> of locking "SELECT ... FOR UPDATE;" to be used. This lock then  
> forces the two transactions to become serialized and the second will  
> either wait for the first to complete or fail. Any banking program  
> that assumed that it could SELECT to confirm a balance and then  
> UPDATE to withdraw the money as separate instructions would be a bad  
> banking program. To exploit it, I would just have to start both  
> operations at the same time - they both SELECT, they both see I have  
> money, they both give me the money and UPDATE, and I get double the  
> money (although my balance would show a big negative value - but I'm  
> already gone...). Database 101.
>
> When I asked for "does PostgreSQL guarantee this?" I didn't mean  
> hand waving examples or hand waving expectations. I meant a pointer  
> into the code that has some comment that says "we want to guarantee  
> that a commit in one session will be immediately visible to other  
> sessions, and that a later select issued in the other sessions will  
> ALWAYS see the commit whether 1 nanosecond later or 200 seconds  
> later" Robert's expectation and yours seem like taking this  
> "guarantee" for granted rather than being justified with design  
> intent and proof thus far. :-) Given my experiment to try and force  
> it to fail, I can see why this would be taken for granted. Is this a  
> real promise, though? Or just a unlikely scenario that never seems  
> to be hit?
>
> To me, the question is relevant in terms of the expectations of a  
> multi-replica solution. We know people have the expectation. We know  
> it can be convenient. Is the expectation valid in the first place?
>
> I've probably drawn this question out too long and should do my own  
> research and report back... Sorry... :-)
>
> Cheers,
> mark
>
> -- 
> Mark Mielke <mark@mielke.cc>
>

Re: Sync Rep: First Thoughts on Code

From

Heikki Linnakangas

Date:

15 December 2008, 09:36:08

Robert Haas wrote:
>> In fact, waiting for reply from standby server before acknowledging a commit
>> to the client is a bit pointless otherwise. It puts you in a strange
>> situation, where you're waiting for the commits in normal operation, but if
>> there's a network glitch or the standby goes down, you're willing to go
>> ahead without it. You get a high guarantee that your data is up-to-date in
>> the standby, except when it isn't. Which isn't much of a guarantee.
> 
> It protects you against a catastrophic loss of the primary, which is a
> non-trivial consideration.  At the risk of being ghoulish, imagine
> that you are a large financial company headquartered in the world
> trade center.

So you'd want all commits to wait until the transaction is safely 
replicated in the standby. But if there's a network glitch, or the 
standby is restarted, you're happy to reply to the client that it's 
committed if it's only safely committed in the primary. Essentially, you 
wait for the reply as long the standby responds within X seconds, but if 
it takes more then Y seconds, you don't wait. I know that people do 
that, but it seems counterintuitive to me. In that case, when the 
primary acks the transaction as committed, you only know that it's 
safely committed in the primary; it doesn't give any hard guarantee 
about the state in the standby.

But when you consider the possibility to use the standby for queries, 
the synchronous mode makes sense too.

I'm not opposed to providing all the options, but the synchronous mode 
where we can guarantee that if you query the standby, you will see the 
effects of all transactions committed in the primary, makes the 
synchronous mode much more interesting. If you don't need that property, 
you're most likely more happy with asynchronous mode anyway.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Sync Rep: First Thoughts on Code

From

Aidan Van Dyk

Date:

15 December 2008, 10:14:54

* Robert Haas <robertmhaas@gmail.com> [081215 07:32]:
> > In fact, waiting for reply from standby server before acknowledging a commit
> > to the client is a bit pointless otherwise. It puts you in a strange
> > situation, where you're waiting for the commits in normal operation, but if
> > there's a network glitch or the standby goes down, you're willing to go
> > ahead without it. You get a high guarantee that your data is up-to-date in
> > the standby, except when it isn't. Which isn't much of a guarantee.
> 
> It protects you against a catastrophic loss of the primary, which is a
> non-trivial consideration.  At the risk of being ghoulish, imagine
> that you are a large financial company headquartered in the world
> trade center.

This was exacty my original point - I want the transaction durably on
the slave before the commit is acknowledged (to build as much local
redunancy as I can), but I certatily *don't* want to loose the ability
to use WAL archiving, because I ship my WAL off-site too...

The ability to have an extra local copy is good.  But I'm certainly not
going to want to give up my off-site backup/WAL for it...

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Sync Rep: First Thoughts on Code

From

"Robert Haas"

Date:

15 December 2008, 10:21:56

> So you'd want all commits to wait until the transaction is safely replicated
> in the standby. But if there's a network glitch, or the standby is
> restarted, you're happy to reply to the client that it's committed if it's
> only safely committed in the primary. Essentially, you wait for the reply as
> long the standby responds within X seconds, but if it takes more then Y
> seconds, you don't wait. I know that people do that, but it seems
> counterintuitive to me. In that case, when the primary acks the transaction
> as committed, you only know that it's safely committed in the primary; it
> doesn't give any hard guarantee about the state in the standby.

I understand you're point, but I think there's still a use case.   The
idea is that declaring the secondary dead is a rare event, and there's
some mechanism by which you're enabled to page your network staff, and
they hightail it into the office to fix the problem.  It might not be
the way that you want to run your system, but I don't think it's
unreasonable for someone else to want it.

> But when you consider the possibility to use the standby for queries, the
> synchronous mode makes sense too.
> I'm not opposed to providing all the options, but the synchronous mode where
> we can guarantee that if you query the standby, you will see the effects of
> all transactions committed in the primary, makes the synchronous mode much
> more interesting. If you don't need that property, you're most likely more
> happy with asynchronous mode anyway.

I agree that asynchronous mode will be the right solution for a very
large subset of our users.

...Robert

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

15 December 2008, 12:24:51

Fujii-san,

Just repeating this in case you lost this comment:

On Mon, 2008-12-15 at 09:40 +0000, Simon Riggs wrote:

> Fujii-san, please can we incorporate those two options, rather than just
> one choice "synchronous_replication = on". They look like two commonly
> requested options.

I see the comment in line 230+ of walreceiver.c, so understand that you
have implemented option #3 from the following list.

So from my previous list

1. We sent the message to standby (A)
2. We received the message on standby
3. We wrote the WAL to the WAL file (B)
4. We fsync'd the WAL file (C)
5. We CRC checked the WAL commit record
6. We applied the WAL commit record

Please could you also add an option #4, i.e. add the *option* to fsync
the WAL to disk at commit time also. That requires us to add a third
option to synchronous_replication parameter.

That then means we will have robustness options that map directly to
DRBD algorithms A, B and C (shown in brackets in the above list). I
believe these map also to Data Guard options Maximum Performance and
Maximum Availability.

AFAICS if we implement the additional items I've requested over the last
few days, then the architecture is now at a good point for 8.4 and we
can begin to look at low level implementation details. Or put another
way, I'm not expecting to come up with more architecture changes.

> #6 is an additional synchronization step in Hot Standby. I would say
> that people won't want that when they see how it performs (they probably
> won't want #4 either for that same reason, but that is for robustness).

We can jointly add option #6 once we have both sync rep and hot standby
committed, or at a late stage of hot standby development. There's not
much point looking at it before then.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

Jeff Davis

Date:

15 December 2008, 16:21:44

On Mon, 2008-12-15 at 09:19 -0500, Robert Haas wrote:
> I understand you're point, but I think there's still a use case.   The
> idea is that declaring the secondary dead is a rare event, and there's
> some mechanism by which you're enabled to page your network staff, and
> they hightail it into the office to fix the problem.  It might not be
> the way that you want to run your system, but I don't think it's
> unreasonable for someone else to want it.
> 

Agreed: there's an analogy to RAID here. When a disk goes out, it still
allows writes, but moves to a degraded state. Hopefully your monitoring
system notifies you, and you fix it.

Also, let's say that the standby suffers catastrophic storage failure.
Now you only have your data on one server anyway (the primary).
Rejecting new transactions from committing doesn't save all the old
transactions in the event of a subsequent storage failure on the
primary.

I'm not advocating this option in particular, other than saying that it
seems like a reasonable option to me.

Regards,Jeff Davis

Re: Sync Rep: First Thoughts on Code

From

Josh Berkus

Date:

15 December 2008, 17:07:11

Peter Eisentraut wrote:
> Simon Riggs wrote:
>> I am truly lost to understand why the *name* "synchronous replication"
>> causes so much discussion, yet nobody has discussed what they would
>> actually like the software to *do*
> 
> It's the color of the bikeshed ...

Hmmm.  I thought this was pretty clear.  There's three levels of synch 
which are useful features:

1) "synchronus" standby which is really asynchronous, but only has a gap 
of < 100ms.

2) Synchronous standby which guarentees that all committed transactions 
are on the failover node and that no data will be lost for failover, but 
the failover node is still in standby mode.

3) Synchronous replication where the standby node has identical 
transactions to the master node, and is queryable read-only.

Any of these levels would be useful and allow a certain number of our 
users to deploy PostgreSQL in an environment where it wasn't used 
before.  So if we can only do (2) for 8.4, that's still very useful for 
telecoms and banks.

--Josh

Re: Sync Rep: First Thoughts on Code

From

Ron Mayer

Date:

15 December 2008, 17:40:50

Josh Berkus wrote:
> 
> Hmmm.  I thought this was pretty clear.  There's three levels of synch 
> which are useful features:
> 
> 1) "synchronus" standby which is really asynchronous, but only has a gap 
> of < 100ms.
> 
> 2) Synchronous standby which guarentees that all committed transactions 
> are on the failover node and that no data will be lost for failover, but 
> the failover node is still in standby mode.
> 
> 3) Synchronous replication where the standby node has identical 
> transactions to the master node, and is queryable read-only.
> 
> Any of these levels would be useful....

Isn't the "queryable read-only" feature totally orthogonal with
how synchronous the replication is?

For one reporting system I have, where new data is continually
being added every second; I'd love to have a read-only-slave
even if that system has the "100ms" gap you mentioned in #1.

Heck I don't care if the queries it runs even have a 100 *minute*
gap; but I sure would like it to be synchronous in the sense
that all the transactions to survive a failure of the primary.

Re: Sync Rep: First Thoughts on Code

From

Josh Berkus

Date:

15 December 2008, 17:44:42

> Isn't the "queryable read-only" feature totally orthogonal with
> how synchronous the replication is?

Yes.  However, it introduces specific difficult issues which an 
unreadable synchronous slave does not have.

--Josh

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

15 December 2008, 17:56:56

On Mon, 2008-12-15 at 13:43 -0800, Josh Berkus wrote:
> > Isn't the "queryable read-only" feature totally orthogonal with
> > how synchronous the replication is?
> 
> Yes.  However, it introduces specific difficult issues which an 
> unreadable synchronous slave does not have.

Don't think it's hugely difficult, but there are multiple ways of doing
this. But it is irrelevant until we have the basic ability to run
queries.

I've explained this twice now on different parts of this thread. Could I
politely direct your attention to those posts?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

Josh Berkus

Date:

15 December 2008, 17:59:04

Simon,

> I've explained this twice now on different parts of this thread. Could I
> politely direct your attention to those posts?

Chill.  I was just explaining that the *goal* of sync standby was not 
complicated or really something to be argued about.  It's pretty clear.

--Josh

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

15 December 2008, 18:11:12

On Mon, 2008-12-15 at 13:06 -0800, Josh Berkus wrote:
> Peter Eisentraut wrote:
> > Simon Riggs wrote:
> >> I am truly lost to understand why the *name* "synchronous replication"
> >> causes so much discussion, yet nobody has discussed what they would
> >> actually like the software to *do*
> > 
> > It's the color of the bikeshed ...
> 
> Hmmm.  I thought this was pretty clear.  There's three levels of synch 
> which are useful features:
> 
> 1) "synchronus" standby which is really asynchronous, but only has a gap 
> of < 100ms.
> 
> 2) Synchronous standby which guarentees that all committed transactions 
> are on the failover node and that no data will be lost for failover, but 
> the failover node is still in standby mode.
> 
> 3) Synchronous replication where the standby node has identical 
> transactions to the master node, and is queryable read-only.

> Any of these levels would be useful and allow a certain number of our 
> users to deploy PostgreSQL in an environment where it wasn't used 
> before.  So if we can only do (2) for 8.4, that's still very useful for 
> telecoms and banks.

The (2) mentioned here could be any of sync points #2-5 referred to
upthread. Different people have requested different levels of
robustness. Looking at DRBD and Oracle, they both subdivide (2) into at
least two further levels of option. So (2) is too broad a brush to paint
with.

I don't believe that (2) as stated is sufficient for banks, though is
reasonable for many telco applications. But #4 or #5 would be suitable
for banks, i.e. we must fsync to disk for very high value transactions.

The extra code to do this is minor, which is why I've asked Fujii-san to
include it now within the patch.

All of this is controllable by the parameter synchronous_replication,
which it is important to note can be set for each individual transaction
rather than just fixed for the whole server. This is identical to the
way we can mix synchronous commit and asynchronous commit transactions.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

15 December 2008, 23:36:20

Hi,

Sorry for this late reply. And, thanks for the hot discussion ;)

On Tue, Dec 16, 2008 at 1:24 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> Fujii-san,
>
> Just repeating this in case you lost this comment:
>
> On Mon, 2008-12-15 at 09:40 +0000, Simon Riggs wrote:
>
>> Fujii-san, please can we incorporate those two options, rather than just
>> one choice "synchronous_replication = on". They look like two commonly
>> requested options.
>
> I see the comment in line 230+ of walreceiver.c, so understand that you
> have implemented option #3 from the following list.
>
> So from my previous list
>
> 1. We sent the message to standby (A)
> 2. We received the message on standby
> 3. We wrote the WAL to the WAL file (B)
> 4. We fsync'd the WAL file (C)
> 5. We CRC checked the WAL commit record
> 6. We applied the WAL commit record
>
> Please could you also add an option #4, i.e. add the *option* to fsync
> the WAL to disk at commit time also. That requires us to add a third
> option to synchronous_replication parameter.

The above option should be configured on the primary? or standby?
The primary is suitable to vary it from transaction to transaction. On
the other hand, it should be configured on the standby in order to
choose it for every standby (in the future).

I prefer the latter, and thought that it should be added into recovery.conf.
I mean, synchronous_replication identifies only whether commit waits for
replication (if the name is confusing, I would rename it). The above
options (#1-#6) are chosen in recovery.conf. What is your opion?

>> #6 is an additional synchronization step in Hot Standby. I would say
>> that people won't want that when they see how it performs (they probably
>> won't want #4 either for that same reason, but that is for robustness).
>
> We can jointly add option #6 once we have both sync rep and hot standby
> committed, or at a late stage of hot standby development. There's not
> much point looking at it before then.

Agreed.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

16 December 2008, 06:21:23

On Tue, 2008-12-16 at 12:36 +0900, Fujii Masao wrote:

> > So from my previous list
> >
> > 1. We sent the message to standby (A)
> > 2. We received the message on standby
> > 3. We wrote the WAL to the WAL file (B)
> > 4. We fsync'd the WAL file (C)
> > 5. We CRC checked the WAL commit record
> > 6. We applied the WAL commit record
> >
> > Please could you also add an option #4, i.e. add the *option* to fsync
> > the WAL to disk at commit time also. That requires us to add a third
> > option to synchronous_replication parameter.
> 
> The above option should be configured on the primary? or standby?
> The primary is suitable to vary it from transaction to transaction. On
> the other hand, it should be configured on the standby in order to
> choose it for every standby (in the future).
> 
> I prefer the latter, and thought that it should be added into recovery.conf.
> I mean, synchronous_replication identifies only whether commit waits for
> replication (if the name is confusing, I would rename it). The above
> options (#1-#6) are chosen in recovery.conf. What is your opion?

No, we've been through that loop already a few months back:
Transaction-controlled robustness.

It should be up to the client on the primary to decide how much waiting
they would like to perform in order to provide a guarantee. A change of
setting on the standby should not be allowed to alter the performance or
durability on the primary.

My perspective is that synchronous_replication specifies how long to
wait. Current settings are "off" (don't wait) or "on" (meaning wait
until point #3). So I think we should change this to a list of options
to allow people to more carefully select how much waiting is required.

This feature is then analogous to the way synchronous_commit works. It
also provides a level of application control not seen in any other RDBMS
in the industry, which makes it very suitable for large and important
applications that need a fine mix of robustness and performance.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

16 December 2008, 23:09:08

Hi,

On Tue, Dec 16, 2008 at 7:21 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On Tue, 2008-12-16 at 12:36 +0900, Fujii Masao wrote:
>
>> > So from my previous list
>> >
>> > 1. We sent the message to standby (A)
>> > 2. We received the message on standby
>> > 3. We wrote the WAL to the WAL file (B)
>> > 4. We fsync'd the WAL file (C)
>> > 5. We CRC checked the WAL commit record
>> > 6. We applied the WAL commit record
>> >
>> > Please could you also add an option #4, i.e. add the *option* to fsync
>> > the WAL to disk at commit time also. That requires us to add a third
>> > option to synchronous_replication parameter.
>>
>> The above option should be configured on the primary? or standby?
>> The primary is suitable to vary it from transaction to transaction. On
>> the other hand, it should be configured on the standby in order to
>> choose it for every standby (in the future).
>>
>> I prefer the latter, and thought that it should be added into recovery.conf.
>> I mean, synchronous_replication identifies only whether commit waits for
>> replication (if the name is confusing, I would rename it). The above
>> options (#1-#6) are chosen in recovery.conf. What is your opion?
>
> No, we've been through that loop already a few months back:
> Transaction-controlled robustness.
>
> It should be up to the client on the primary to decide how much waiting
> they would like to perform in order to provide a guarantee. A change of
> setting on the standby should not be allowed to alter the performance or
> durability on the primary.

OK. I will extend synchronous_replication, make walsender send XLOG
with synchronization mode flag and make walreceiver perform according
to the flag.

>
> My perspective is that synchronous_replication specifies how long to
> wait. Current settings are "off" (don't wait) or "on" (meaning wait
> until point #3). So I think we should change this to a list of options
> to allow people to more carefully select how much waiting is required.

In the latest patch, "off" keeps us waiting for replication in some
cases, e.g. forceSyncCommit = true. This is analogous to the way
synchronous_commit works. When "off" keeps us waiting for
replication, which option (#1-#6) should we choose? Should it be
user-configurable (though the parameter values are doubled)?
hardcode #3? "off" always should not keep us waiting for
replication?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

17 December 2008, 07:50:18

On Wed, 2008-12-17 at 12:07 +0900, Fujii Masao wrote:

> OK. I will extend synchronous_replication, make walsender send XLOG
> with synchronization mode flag and make walreceiver perform according
> to the flag.

Sounds good.

> > My perspective is that synchronous_replication specifies how long to
> > wait. Current settings are "off" (don't wait) or "on" (meaning wait
> > until point #3). So I think we should change this to a list of options
> > to allow people to more carefully select how much waiting is required.
> 
> In the latest patch, "off" keeps us waiting for replication in some
> cases, e.g. forceSyncCommit = true. This is analogous to the way
> synchronous_commit works. When "off" keeps us waiting for
> replication, which option (#1-#6) should we choose? Should it be
> user-configurable (though the parameter values are doubled)?
> hardcode #3? "off" always should not keep us waiting for
> replication?

I would hard code #4, i.e. make it fsync, so that DDL changes are
regarded as "high value transactions".

A parameter sounds like overkill. We'd need to explain what
forceSyncCommit does to users then, which is easier to avoid.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

17 December 2008, 22:03:08

Hi,

Thanks for the helpful comments!

On Wed, Dec 17, 2008 at 8:50 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On Wed, 2008-12-17 at 12:07 +0900, Fujii Masao wrote:
>
>> OK. I will extend synchronous_replication, make walsender send XLOG
>> with synchronization mode flag and make walreceiver perform according
>> to the flag.
>
> Sounds good.
>
>> > My perspective is that synchronous_replication specifies how long to
>> > wait. Current settings are "off" (don't wait) or "on" (meaning wait
>> > until point #3). So I think we should change this to a list of options
>> > to allow people to more carefully select how much waiting is required.
>>
>> In the latest patch, "off" keeps us waiting for replication in some
>> cases, e.g. forceSyncCommit = true. This is analogous to the way
>> synchronous_commit works. When "off" keeps us waiting for
>> replication, which option (#1-#6) should we choose? Should it be
>> user-configurable (though the parameter values are doubled)?
>> hardcode #3? "off" always should not keep us waiting for
>> replication?
>
> I would hard code #4, i.e. make it fsync, so that DDL changes are
> regarded as "high value transactions".
>
> A parameter sounds like overkill. We'd need to explain what
> forceSyncCommit does to users then, which is easier to avoid.

Agreed, I also think that hard code is better. But I'm nervous that "off"
keeps us waiting for replication in cases other than DDL, e.g. flush
buffer, truncate clog, checkpoint.. etc. synchronous_replication = off
is quite similar to synchronous_commit = off. If we would hard code #4,
the performance might degrade although it's asynchronous replication.
So, I'd like to hard code #3. What is your opinion?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

17 December 2008, 22:19:38

On Thu, 2008-12-18 at 11:03 +0900, Fujii Masao wrote:
> Hi,
> 
> Thanks for the helpful comments!
> 
> On Wed, Dec 17, 2008 at 8:50 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >
> > On Wed, 2008-12-17 at 12:07 +0900, Fujii Masao wrote:
> >
> >> OK. I will extend synchronous_replication, make walsender send XLOG
> >> with synchronization mode flag and make walreceiver perform according
> >> to the flag.
> >
> > Sounds good.
> >
> >> > My perspective is that synchronous_replication specifies how long to
> >> > wait. Current settings are "off" (don't wait) or "on" (meaning wait
> >> > until point #3). So I think we should change this to a list of options
> >> > to allow people to more carefully select how much waiting is required.
> >>
> >> In the latest patch, "off" keeps us waiting for replication in some
> >> cases, e.g. forceSyncCommit = true. This is analogous to the way
> >> synchronous_commit works. When "off" keeps us waiting for
> >> replication, which option (#1-#6) should we choose? Should it be
> >> user-configurable (though the parameter values are doubled)?
> >> hardcode #3? "off" always should not keep us waiting for
> >> replication?
> >
> > I would hard code #4, i.e. make it fsync, so that DDL changes are
> > regarded as "high value transactions".
> >
> > A parameter sounds like overkill. We'd need to explain what
> > forceSyncCommit does to users then, which is easier to avoid.
> 
> Agreed, I also think that hard code is better. But I'm nervous that "off"
> keeps us waiting for replication in cases other than DDL, e.g. flush
> buffer, truncate clog, checkpoint.. etc. synchronous_replication = off
> is quite similar to synchronous_commit = off. If we would hard code #4,
> the performance might degrade although it's asynchronous replication.
> So, I'd like to hard code #3. What is your opinion?

We don't do that when we flush buffer, truncate clog or checkpoint, not
sure why you mention those.

We ForceSyncCommit when we
* VACUUM FULL
* CREATE/DROP DATABASE or USER
* Create/Drop Tablespace

I don't see a problem in forcing an fsync for those. I will sleep safer
knowing those guys are on disk even in async mode.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

17 December 2008, 23:08:58

Hi,

On Thu, Dec 18, 2008 at 11:19 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On Thu, 2008-12-18 at 11:03 +0900, Fujii Masao wrote:
>> Hi,
>>
>> Thanks for the helpful comments!
>>
>> On Wed, Dec 17, 2008 at 8:50 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> >
>> > On Wed, 2008-12-17 at 12:07 +0900, Fujii Masao wrote:
>> >
>> >> OK. I will extend synchronous_replication, make walsender send XLOG
>> >> with synchronization mode flag and make walreceiver perform according
>> >> to the flag.
>> >
>> > Sounds good.
>> >
>> >> > My perspective is that synchronous_replication specifies how long to
>> >> > wait. Current settings are "off" (don't wait) or "on" (meaning wait
>> >> > until point #3). So I think we should change this to a list of options
>> >> > to allow people to more carefully select how much waiting is required.
>> >>
>> >> In the latest patch, "off" keeps us waiting for replication in some
>> >> cases, e.g. forceSyncCommit = true. This is analogous to the way
>> >> synchronous_commit works. When "off" keeps us waiting for
>> >> replication, which option (#1-#6) should we choose? Should it be
>> >> user-configurable (though the parameter values are doubled)?
>> >> hardcode #3? "off" always should not keep us waiting for
>> >> replication?
>> >
>> > I would hard code #4, i.e. make it fsync, so that DDL changes are
>> > regarded as "high value transactions".
>> >
>> > A parameter sounds like overkill. We'd need to explain what
>> > forceSyncCommit does to users then, which is easier to avoid.
>>
>> Agreed, I also think that hard code is better. But I'm nervous that "off"
>> keeps us waiting for replication in cases other than DDL, e.g. flush
>> buffer, truncate clog, checkpoint.. etc. synchronous_replication = off
>> is quite similar to synchronous_commit = off. If we would hard code #4,
>> the performance might degrade although it's asynchronous replication.
>> So, I'd like to hard code #3. What is your opinion?
>
> We don't do that when we flush buffer, truncate clog or checkpoint, not
> sure why you mention those.
>
> We ForceSyncCommit when we
> * VACUUM FULL
> * CREATE/DROP DATABASE or USER
> * Create/Drop Tablespace
>
> I don't see a problem in forcing an fsync for those. I will sleep safer
> knowing those guys are on disk even in async mode.

If my understanding is correct, XLOG flush is forced up to buffer's LSN
when flushing buffer even if asynchronous commit case. Am I missing
something?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

18 December 2008, 05:35:13

On Thu, 2008-12-18 at 12:08 +0900, Fujii Masao wrote:

> >> Agreed, I also think that hard code is better. But I'm nervous that "off"
> >> keeps us waiting for replication in cases other than DDL, e.g. flush
> >> buffer, truncate clog, checkpoint.. etc. synchronous_replication = off
> >> is quite similar to synchronous_commit = off. If we would hard code #4,
> >> the performance might degrade although it's asynchronous replication.
> >> So, I'd like to hard code #3. What is your opinion?
> >
> > We don't do that when we flush buffer, truncate clog or checkpoint, not
> > sure why you mention those.
> >
> > We ForceSyncCommit when we
> > * VACUUM FULL
> > * CREATE/DROP DATABASE or USER
> > * Create/Drop Tablespace
> >
> > I don't see a problem in forcing an fsync for those. I will sleep safer
> > knowing those guys are on disk even in async mode.
> 
> If my understanding is correct, XLOG flush is forced up to buffer's LSN
> when flushing buffer even if asynchronous commit case. Am I missing
> something?

Yes, please check the call points for ForceSyncCommit.

Do I think every xlog flush should be synchronous, no, I don't. That's
why we have a user settable parameter for it.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

18 December 2008, 20:43:16

Hi,

On Thu, Dec 18, 2008 at 6:35 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On Thu, 2008-12-18 at 12:08 +0900, Fujii Masao wrote:
>
>> >> Agreed, I also think that hard code is better. But I'm nervous that "off"
>> >> keeps us waiting for replication in cases other than DDL, e.g. flush
>> >> buffer, truncate clog, checkpoint.. etc. synchronous_replication = off
>> >> is quite similar to synchronous_commit = off. If we would hard code #4,
>> >> the performance might degrade although it's asynchronous replication.
>> >> So, I'd like to hard code #3. What is your opinion?
>> >
>> > We don't do that when we flush buffer, truncate clog or checkpoint, not
>> > sure why you mention those.
>> >
>> > We ForceSyncCommit when we
>> > * VACUUM FULL
>> > * CREATE/DROP DATABASE or USER
>> > * Create/Drop Tablespace
>> >
>> > I don't see a problem in forcing an fsync for those. I will sleep safer
>> > knowing those guys are on disk even in async mode.
>>
>> If my understanding is correct, XLOG flush is forced up to buffer's LSN
>> when flushing buffer even if asynchronous commit case. Am I missing
>> something?
>
> Yes, please check the call points for ForceSyncCommit.
>
> Do I think every xlog flush should be synchronous, no, I don't. That's
> why we have a user settable parameter for it.

Umm.. I focus attention on XLogFlush() called except RecordTransactionCommit().
For example, FlushBuffer(), WriteTruncateXlogRec().. etc. These
XLogFlush() might
flush XLOG synchronously even if asynchronous commit case.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

19 December 2008, 04:50:49

On Fri, 2008-12-19 at 09:43 +0900, Fujii Masao wrote:

> > Yes, please check the call points for ForceSyncCommit.
> >
> > Do I think every xlog flush should be synchronous, no, I don't.
> That's why we have a user settable parameter for it.
> 
> Umm.. I focus attention on XLogFlush() called except
> RecordTransactionCommit().
> For example, FlushBuffer(), WriteTruncateXlogRec().. etc. These
> XLogFlush() might
> flush XLOG synchronously even if asynchronous commit case.

XLogFlush() flushes because of an interlock between a dirty buffer write
and an outstanding WAL write. Dirty buffer writes are not replicated, so
there is no need to have a similar interlock on WAL streaming.

So making those call points synchronous is possible, but neither
necessary or IMHO desirable.

On a related but different point: We don't need an interlock between
dirty buffers and WAL during recovery because the WAL has already been
written.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

Heikki Linnakangas

Date:

19 December 2008, 05:04:38

Simon Riggs wrote:
> On a related but different point: We don't need an interlock between
> dirty buffers and WAL during recovery because the WAL has already been
> written.

Assuming the WAL has also been fsync'd.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

19 December 2008, 05:22:45

On Fri, 2008-12-19 at 11:04 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On a related but different point: We don't need an interlock between
> > dirty buffers and WAL during recovery because the WAL has already been
> > written.
> 
> Assuming the WAL has also been fsync'd.

True, so this will need to change for 8.4 also

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

Markus Wanner

Date:

20 December 2008, 10:24:12

Hi,

Mark Mielke wrote:
> Where does the expectation come from?

I find the seat reservation, bank account or stock trading examples
pretty obvious WRT user expectations.

Nonetheless, I've compiled some hints from the documentation and sources:

"Since in Read Committed mode each new command starts with a new
snapshot that includes all transactions committed up to that instant" [1].

"This [SERIALIZABLE ISOLATION] level emulates serial transaction
execution, as if transactions had been executed one after another,
serially, rather than concurrently." [1].  (IMO this implies, that a
transaction "sees" changes from all preceding transactions).

"All changes made by the transaction become visible to others and are
guaranteed to be durable if a crash occurs." [2]. (Agreed, it's not
overly clear here, when exactly the changes become visible. OTOH,
there's no warning, that another session doesn't immediately see
committed transactions. Not sure where you got that from).

> I don't recall ever reading it in
> the documentation, and unless the session processes are contending over
> the integers (using some sort of synchronization primitive) in memory
> that represent the "latest visible commit" on every single select, I'm
> wondering how it is accomplished?

See the transaction system's README [3]. It documents the process of
snapshot taking and transaction isolation pretty well. Around line 226
it says: "What we actually enforce is strict serialization of commits
and rollbacks with snapshot-taking". (So the outcome of your experiment
is no surprise at all).

And a bit later: "This rule is stronger than necessary for consistency,
but is relatively simple to enforce, and it assists with some other
issues as explained below.". While this implies, that an optimization is
theoretically possible, I very much doubt it would be worth it (for a
single node system).

In a distributed system, things are a bit different. Network latency is
an order of magnitude higher than memory latency (for IPC). So a similar
optimization is very well worth it. However, the application (or the
load balancer or both) need to know about this potential lag between
nodes. And as you've outlined elsewhere, a limit for how much a single
node may lag behind needs to be established.

(As a side note: for a multi-master system like Postgres-R, it's
beneficial to keep the lag time as low as possible, because the larger
the lag, the higher the probability for a conflict between two
transactions on different nodes.)

Regards

Markus Wanner

[1]: Pg 8.3 Docu: Concurrency Control:
http://www.postgresql.org/docs/8.3/static/transaction-iso.html

[2]: Pg 8.3 Docu: COMMIT command:
http://www.postgresql.org/docs/8.3/static/sql-commit.html

[3]: README of transam (src/backend/access/transam/README):
https://projects.commandprompt.com/public/pgsql/browser/trunk/pgsql/src/backend/access/transam/README#L224

Re: Sync Rep: First Thoughts on Code

From

Mark Mielke

Date:

20 December 2008, 11:29:15

Good answers, Markus. Thanks.<br /><br /> I've bought the thinking of several here that the user should have some
controlover what they expect (and what optimizations they are willing to accept as a good choice), but that commit
shouldstill be able to have a capped time limit.<br /><br /> I can think of many of my own applications where I would
chooseone mode vs another mode, even within the same application, depending on the operation itself. The most important
requirementis that transactions are durable. It becomes convenient, though, to provide additional guarantees for some
operationsequences.<br /><br /> I still see the requirement for seat reservation, bank account, or stock trading, as
synchronizingusing read-write locks before starting the select, rather than enforcing latest on every select.<br /><br
/>For my own bank, when I do an online transaction, operations don't always immediately appear in my list of
transactions.They appear to sometimes be batched, sometimes in near real time, and sometimes as part of some sort of
dayend processing.<br /><br /> For seat reservation, the time the seat layout is shown on the screen is not usually
lockedduring a transaction. Between the time the travel agent brings up the seats on the plane, and the time they
selectthe seat, the seat could be taken. What's important is that the reservation is durable, and that conflicts are
notintroduced. The commit must fail if another person has chosen the seat already already. The commit does not need to
waituntil the reservation is pushed out to all systems before completing. The same is true of stock trading.<br /><br
/>However, it can be very convenient for commits to be immediately visible after the commit completes. This allows for
laziermodels, such as a web site that reloads the view on the reservations or recent trades and expects to see recent
commitsno matter which server it accesses, rather than taking into account that the commit succeeded when presenting
thenext view.<br /><br /> If I look at sites like Google - they take the opposite extreme. I can post a message, and it
remembersthat I posted the message and makes it immediately visible, however, I might not see other new messages in a
threaduntil a minute or more later.<br /><br /> So it looks like there is value to both ends of the spectrum, and while
Ifeel the most value would be in providing a very fast system that scales near linear to the number of nodes in the
system,even at the expense of immediately visible transactions from all servers, I can accept that sometimes the
expectationsare stricter and would appreciate seeing an option to let me choose based upon my requirements.<br /><br />
Cheers,<br/> mark<br /><br /><br /> Markus Wanner wrote: <blockquote cite="mid:494CFFFF.2060200@bluegap.ch"
type="cite"><prewrap="">Hi,
 

Mark Mielke wrote: </pre><blockquote type="cite"><pre wrap="">Where does the expectation come from?
</pre></blockquote><prewrap="">
 
I find the seat reservation, bank account or stock trading examples
pretty obvious WRT user expectations.

Nonetheless, I've compiled some hints from the documentation and sources:

"Since in Read Committed mode each new command starts with a new
snapshot that includes all transactions committed up to that instant" [1].

"This [SERIALIZABLE ISOLATION] level emulates serial transaction
execution, as if transactions had been executed one after another,
serially, rather than concurrently." [1].  (IMO this implies, that a
transaction "sees" changes from all preceding transactions).

"All changes made by the transaction become visible to others and are
guaranteed to be durable if a crash occurs." [2]. (Agreed, it's not
overly clear here, when exactly the changes become visible. OTOH,
there's no warning, that another session doesn't immediately see
committed transactions. Not sure where you got that from).
 </pre><blockquote type="cite"><pre wrap="">I don't recall ever reading it in
the documentation, and unless the session processes are contending over
the integers (using some sort of synchronization primitive) in memory
that represent the "latest visible commit" on every single select, I'm
wondering how it is accomplished?   </pre></blockquote><pre wrap="">
See the transaction system's README [3]. It documents the process of
snapshot taking and transaction isolation pretty well. Around line 226
it says: "What we actually enforce is strict serialization of commits
and rollbacks with snapshot-taking". (So the outcome of your experiment
is no surprise at all).

And a bit later: "This rule is stronger than necessary for consistency,
but is relatively simple to enforce, and it assists with some other
issues as explained below.". While this implies, that an optimization is
theoretically possible, I very much doubt it would be worth it (for a
single node system).

In a distributed system, things are a bit different. Network latency is
an order of magnitude higher than memory latency (for IPC). So a similar
optimization is very well worth it. However, the application (or the
load balancer or both) need to know about this potential lag between
nodes. And as you've outlined elsewhere, a limit for how much a single
node may lag behind needs to be established.

(As a side note: for a multi-master system like Postgres-R, it's
beneficial to keep the lag time as low as possible, because the larger
the lag, the higher the probability for a conflict between two
transactions on different nodes.)

Regards

Markus Wanner


[1]: Pg 8.3 Docu: Concurrency Control:
<a class="moz-txt-link-freetext"
href="http://www.postgresql.org/docs/8.3/static/transaction-iso.html">http://www.postgresql.org/docs/8.3/static/transaction-iso.html</a>

[2]: Pg 8.3 Docu: COMMIT command:
<a class="moz-txt-link-freetext"
href="http://www.postgresql.org/docs/8.3/static/sql-commit.html">http://www.postgresql.org/docs/8.3/static/sql-commit.html</a>

[3]: README of transam (src/backend/access/transam/README):
<a class="moz-txt-link-freetext"
href="https://projects.commandprompt.com/public/pgsql/browser/trunk/pgsql/src/backend/access/transam/README#L224">https://projects.commandprompt.com/public/pgsql/browser/trunk/pgsql/src/backend/access/transam/README#L224</a>
 </pre></blockquote><br /><br /><pre class="moz-signature" cols="72">-- 
Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a>
</pre>

Re: Sync Rep: First Thoughts on Code

From

Markus Wanner

Date:

20 December 2008, 11:30:35

Hi,

Mark Mielke wrote:
> Robert Haas wrote:
>> On Sat, Dec 13, 2008 at 1:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> We won't call it anything, because we never will or can implement that.
>>> See the theory of relativity: the notion of exactly simultaneous events
>>
>> OK, fine.  I'll be more precise.  I think we need to reserve the term
>> "synchronous replication" for a system where transactions that begin
>> on the standby after the transactions has committed on the master see
>> the effects of the committed transaction.

I agree with Robert here. As far as I know this is the common
understanding of "synchronous replication". Everything less - including
Postgres-R - is considered to be asynchronous.

> I'd like to see proof of some sort that PostgreSQL guarantees that the
> instant a 'commit' returns, any transactions already open with the
> appropriate transaction isolation level, or any new sessions *will* see
> the results of the commit.

Given within this thread, here [1].

> Two phase commit doesn't imply that the transaction is guaranteed to be
> immediately visible.

Just for the record: that's plain wrong. As with any other transaction,
a COMMIT of a prepared transaction guarantees visibility from all
subsequent snapshots (at least for Postgres and other serious RDBSen).

Systems based on 2PC are the typical synchronous replication solution:
works, resistant to failures, consistent across nodes (WRT visibility),
but unusably slow. This is what people have in mind and expect when they
hear "synchronous replication" for databases. (And which is why I'm
thinking it's better for an optimized solution not to call itself
"synchronous").

> Unless transactions are
> locked from starting until they are able to prove that they have the
> latest commit

See the cited README. It already happens for (single node) Postgres
systems, because the action of snapshot taking and committing are
serialized.

> (a feat which I'm going to theorize as impossible -
> because the moment you wait for a commit, and you begin again, you
> really have no guarantee that another commit has not occurred in the
> mean time)

This problem is solved by locking.

Regards

Markus Wanner

[1]: Hints to docs and source, that COMMIT actually ensures subsequent
snapshots "include" changes of the committed transaction:
http://archives.postgresql.org/message-id/494CFFFF.2060200@bluegap.ch

Re: Sync Rep: First Thoughts on Code

From

Markus Wanner

Date:

20 December 2008, 11:48:28

Hi,

Josh Berkus wrote:
> Peter Eisentraut wrote:
>> It's the color of the bikeshed ...

Agreed. It's why I've decided to support various modes for Postgres-R.
I'm glad to see that the current "Sync Rep" approach does the same.

> Hmmm.  I thought this was pretty clear.  There's three levels of synch
> which are useful features:
> 
> 1) "synchronus" standby which is really asynchronous, but only has a gap
> of < 100ms.

A synchronous standby which is really asynchronous? That's exactly the
naming challenge I've been pointing to.

Commonly used terms are: "virtually synchronous", "approximately
synchronous", "near-real-time replication" or "eager replication", but
for most users, this is not "synchronous" (enough).

(BTW: there's no such "< 100 ms" guarantee. It may be typically below
100 ms, or even below 10 ms on average. But replication is not about the
typical or average case. It's much more about failures and uncommon
cases. The guarantee you can get in such a system (by declaring a node
as dead) is much more likely to be within the range of several seconds
and more, be it network, disk or whatever other failure-timeout that
applies here.)

> 2) Synchronous standby which guarentees that all committed transactions
> are on the failover node and that no data will be lost for failover, but
> the failover node is still in standby mode.

What's the difference to 1) here? I'm not following.

> 3) Synchronous replication where the standby node has identical
> transactions to the master node, and is queryable read-only.

So, a synchronous standby is different from synchronous replication in
that it's asynchronous?

Sorry for bugging with naming, but I think it is important for an
understanding during development.

> Any of these levels would be useful and allow a certain number of our
> users to deploy PostgreSQL in an environment where it wasn't used
> before.

I absolutely agree to that statement.

However, please do not confuse future users (and today's hackers), but
instead use existing terms consistently and clearly. Something that lags
behind, potentially by several seconds (in case of failure) is commonly
considered asynchronous, no matter how close to "immediate" it is on
average.

Regards

Markus Wanner

Re: Sync Rep: First Thoughts on Code

From

Markus Wanner

Date:

20 December 2008, 19:17:05

Hi,

Mark Mielke wrote:
> Good answers, Markus. Thanks.

You are welcome.

> So it looks like there is value to both ends of the spectrum, and while
> I feel the most value would be in providing a very fast system that
> scales near linear to the number of nodes in the system, even at the
> expense of immediately visible transactions from all servers, I can
> accept that sometimes the expectations are stricter and would appreciate
> seeing an option to let me choose based upon my requirements.

I absolutely agree to that. The original Postgres-R algorithm covers the
eager (or virtually synchronous) part. I'm planning to extend it with a
(fully) synchronous mode and let the user choose per transaction.

Regards

Markus Wanner

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

21 December 2008, 01:46:50

Hi,

On Fri, Dec 19, 2008 at 5:50 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On Fri, 2008-12-19 at 09:43 +0900, Fujii Masao wrote:
>
>> > Yes, please check the call points for ForceSyncCommit.
>> >
>> > Do I think every xlog flush should be synchronous, no, I don't.
>> That's why we have a user settable parameter for it.
>>
>> Umm.. I focus attention on XLogFlush() called except
>> RecordTransactionCommit().
>> For example, FlushBuffer(), WriteTruncateXlogRec().. etc. These
>> XLogFlush() might
>> flush XLOG synchronously even if asynchronous commit case.
>
> XLogFlush() flushes because of an interlock between a dirty buffer write
> and an outstanding WAL write. Dirty buffer writes are not replicated, so
> there is no need to have a similar interlock on WAL streaming.
>
> So making those call points synchronous is possible, but neither
> necessary or IMHO desirable.

Yes in upcoming 8.4, but probably no in the future.

What if the primary fails after writing the dirty data buffer before sending
the corresponding logs? This would make data on the primary and logs
on the standby inconsistent. In 8.4, such inconsistency might not matter
because we don't use the data on the failed primary for recovery (when
restarting the failed server, we always need a fresh backup). But, since
this restriction is not good for some people, in the future, the failed server
should restart without a fresh backup, and the inconsistency would be
problem. So, I think that the inconsistency should be removed even if
asynchronous replication case, and we should enforce "WAL rule" over
some servers.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

Markus Wanner

Date:

21 December 2008, 18:26:22

Hi,

Simon Riggs wrote:
> The second way can be done by taking a snapshot on the primary, with an
> associated LSN, then using that snapshot on the standby. That is
> somewhat complex, but possible. I see the requirement for getting the
> same answer on multiple nodes as a further extension of "transaction
> isolation mode" and think that not all people will want this, so we
> should allow that as an option.

I've been thinking a bit about this pretty interesting idea. It's
certainly of interest for Postgres-R as well.

AFAIK a function could simply wait, until the node which is being
queried reaches a given point in time of application of transactions (an
LSN, in the Sync-Rep world). Calling such a waiting function just after
BEGIN would ensure to see (at least) the given snapshot. If that
snapshot has already been reached or passed, the function does nothing.

What I like is, that it's optimistic in that the wait is only enforced
when needed by the reader. However, unlike enforcing the wait before
COMMIT, it requires changing the application to cope with this behavior
of the distributed database system. And knowing when to require which
snapshot sounds rather difficult from the point of view of the
application developer.

Also note, that it might be the issuer of the transaction who wants to
ensure "his" transaction got propagated to the remote nodes.

> I'm not going to worry about this at the moment. Hot standby will be
> useful without this and so I regard this as a secondary objective. Rome
> wasn't built in a single release, or something like that.

Sounds like a decent plan. Good luck.

Regards

Markus Wanner

Re: Sync Rep: Second thoughts

From

Markus Wanner

Date:

21 December 2008, 19:36:15

Hi,

Emmanuel Cecchet wrote:
> What the application is going to see is a failure when the postmaster it
> is connected to is going down. If this happen at commit time, I think
> that there is no guarantee for the application to know what happened:
> 1. failure occurred before the request reached postmaster:  no instance
> committed
> 2. failure occurred during commit: might be committed on either nodes
> 3. failure occurred while sending back ack of commit to client: both
> instances have committed
> But for the client, it will all look the same: an error on commit.

This is very much the same for a single node database system, so I think
current application developers are used to that behavior.

A distributed database system just needs to make sure, that committed
transactions can and will eventually get committed on all nodes. So in
case the client doesn't receive a COMMIT acknowledgment due to atny kind
of failure, it can only be sure the distributed database is in a
consistent state. The client cannot tell, if it applied the transaction
or not. Much like for single node systems.

I agree, that a distributed database system could theoretically do
better, but that would require changes on the client side connection
library as well, letting the client connect to two or even more nodes of
the distributed database system.

> This is just to point out that despite all your efforts, the client
> might think that some transactions have failed (error on commit) but
> they are actually committed.

As pointed out above, that would currently be an erroneous conclusion.

> If you don't put some state in the driver
> that is able to check at failover time if the commit operation succeeded
> or not, it does not really matter what happens for in-flight
> transactions (or in-commit transactions) at failure time.

Sure it does. The database system still needs to guarantee consistency.

Hm.. well, you're right if there's only one single standby left (as is
obviously the case for the proposed Sync Rep). Ensuring consistency is
pretty simple in such a case. But imagine having two or more standby
servers. Those would need to agree on a set of in-flight transactions
from the master they both need to apply.

> Actually, if there was a way to query the database about the status of a
> particular transaction by providing a cluster-wide unique id, that would
> help a lot.

You're certainly aware, that Postgres-R features such a global
transaction id... And I guess Sync Rep could easily add such an
identifier as well.

> I wrote a paper on the issues with database replication at
> Sigmod earlier this year (http://infoscience.epfl.ch/record/129042).
> Even though it was targeted at middleware replication, I think that some
> of it is still relevant for the problem at hand.

Interesting read, thanks for the pointer. It's pretty obvious that I
don't consider Postgres-R to be obsolete...

I've linked your paper from www.postgres-r.org [1].

> Regarding the wording, if experts can't agree, you can be sure that
> users won't either. Most of them don't have a clue about the different
> flavors of replication. So as long as you state clearly how it behaves
> and define all the terms you use that should be fine.

I mostly agree to that, without repeating my concerns again, here.

Regards

Markus Wanner

[1]: Referenced Papers from Postgres-R website:
http://www.postgres-r.org/documentation/references

Re: Sync Rep: Second thoughts

From

Emmanuel Cecchet

Date:

21 December 2008, 19:55:00

Hi Markus,

I am happy to see that Postgres-R is alive again. The paper was written 
in 07 (and published in 08, the review process is longer than a 
CommitFest ;-)) and at the time of the writing there was no version of 
Postgres-R available, hence the 'obsolete' mention referring to past 
versions.
I think that it is legitimate for users to expect more guarantees from a 
replicated database than from a single database. Not knowing what happen 
when a failure happens at commit time when some nodes are still active 
in a cluster is not intuitive for users.
I did not look at the source, but if Postgres -R continue to elaborate 
on Bettina's ideas with writeset extraction and a certification 
protocol, I think that it will be a bad idea to try to mix it with Sync 
Rep (mentioned in another thread). If you delay commits, you will 
increase the window for transactions to conflict and therefore induce a 
higher abort rate (thus less scalability). Certification-based 
approaches have already multiple reliability issues to improve write 
performance compared to statement-based replication, but this is very 
dependent on the capacity of the system to limit the conflicting window 
for concurrent transactions. The writeset extraction mechanisms have had 
too many limitations so far to allow the use of certification-based 
replication in production (AFAIK).

Good luck with Postgres-R.
Emmanuel

> Emmanuel Cecchet wrote:
>   
>> What the application is going to see is a failure when the postmaster it
>> is connected to is going down. If this happen at commit time, I think
>> that there is no guarantee for the application to know what happened:
>> 1. failure occurred before the request reached postmaster:  no instance
>> committed
>> 2. failure occurred during commit: might be committed on either nodes
>> 3. failure occurred while sending back ack of commit to client: both
>> instances have committed
>> But for the client, it will all look the same: an error on commit.
>>     
>
> This is very much the same for a single node database system, so I think
> current application developers are used to that behavior.
>
> A distributed database system just needs to make sure, that committed
> transactions can and will eventually get committed on all nodes. So in
> case the client doesn't receive a COMMIT acknowledgment due to atny kind
> of failure, it can only be sure the distributed database is in a
> consistent state. The client cannot tell, if it applied the transaction
> or not. Much like for single node systems.
>
> I agree, that a distributed database system could theoretically do
> better, but that would require changes on the client side connection
> library as well, letting the client connect to two or even more nodes of
> the distributed database system.
>
>   
>> This is just to point out that despite all your efforts, the client
>> might think that some transactions have failed (error on commit) but
>> they are actually committed.
>>     
>
> As pointed out above, that would currently be an erroneous conclusion.
>
>   
>> If you don't put some state in the driver
>> that is able to check at failover time if the commit operation succeeded
>> or not, it does not really matter what happens for in-flight
>> transactions (or in-commit transactions) at failure time.
>>     
>
> Sure it does. The database system still needs to guarantee consistency.
>
> Hm.. well, you're right if there's only one single standby left (as is
> obviously the case for the proposed Sync Rep). Ensuring consistency is
> pretty simple in such a case. But imagine having two or more standby
> servers. Those would need to agree on a set of in-flight transactions
> from the master they both need to apply.
>
>   
>> Actually, if there was a way to query the database about the status of a
>> particular transaction by providing a cluster-wide unique id, that would
>> help a lot.
>>     
>
> You're certainly aware, that Postgres-R features such a global
> transaction id... And I guess Sync Rep could easily add such an
> identifier as well.
>
>   
>> I wrote a paper on the issues with database replication at
>> Sigmod earlier this year (http://infoscience.epfl.ch/record/129042).
>> Even though it was targeted at middleware replication, I think that some
>> of it is still relevant for the problem at hand.
>>     
>
> Interesting read, thanks for the pointer. It's pretty obvious that I
> don't consider Postgres-R to be obsolete...
>
> I've linked your paper from www.postgres-r.org [1].
>
>   
>> Regarding the wording, if experts can't agree, you can be sure that
>> users won't either. Most of them don't have a clue about the different
>> flavors of replication. So as long as you state clearly how it behaves
>> and define all the terms you use that should be fine.
>>     
>
> I mostly agree to that, without repeating my concerns again, here.
>
> Regards
>
> Markus Wanner
>
>
> [1]: Referenced Papers from Postgres-R website:
> http://www.postgres-r.org/documentation/references
>
>   


-- 
Emmanuel Cecchet
FTO @ Frog Thinker 
Open Source Development & Consulting
--
Web: http://www.frogthinker.org
email: manu@frogthinker.org
Skype: emmanuel_cecchet

Re: Sync Rep: Second thoughts

From

Markus Wanner

Date:

21 December 2008, 20:22:27

Hello Emmanuel,

Emmanuel Cecchet wrote:
> I am happy to see that Postgres-R is alive again. The paper was written
> in 07 (and published in 08, the review process is longer than a
> CommitFest ;-)) and at the time of the writing there was no version of
> Postgres-R available, hence the 'obsolete' mention referring to past
> versions.

Understood.

> I think that it is legitimate for users to expect more guarantees from a
> replicated database than from a single database. Not knowing what happen
> when a failure happens at commit time when some nodes are still active
> in a cluster is not intuitive for users.

I absolutely agree to that. However, it's lower priority for me.

> I did not look at the source, but if Postgres -R continue to elaborate
> on Bettina's ideas with writeset extraction and a certification
> protocol, I think that it will be a bad idea to try to mix it with Sync
> Rep (mentioned in another thread).

I'm not quite sure what you mean by "certification protocol", there's no
such thing in Postgres-R (as proposed by Kemme). Although, I remember
having heard that term in the context of F. Pedone's work. Can you point
me to some paper explaining this certification protocol?

> If you delay commits, you will
> increase the window for transactions to conflict and therefore induce a
> higher abort rate (thus less scalability).

This assumes that *all* types of transactions are unlikely to conflict.
But there sometimes just are transactions with a very high probability
for conflicts with other transactions. Applying optimistic locking (as
the original Postgres-R algorithm does) cannot be efficient in such a
case, because of lots of useless retries. (It could even lead to
starvation of long running transactions, which always get aborted be
shorter conflicting ones).

> Certification-based
> approaches have already multiple reliability issues to improve write
> performance compared to statement-based replication, but this is very
> dependent on the capacity of the system to limit the conflicting window
> for concurrent transactions.

What do you mean by "reliability issues"?

Keeping the "conflicting window" as narrow as possibly certainly
benefits performance, yes. But keeping the retry rate low also helps a
lot (and influences the conflict window in turn).

> The writeset extraction mechanisms have had
> too many limitations so far to allow the use of certification-based
> replication in production (AFAIK).

What limitations are you speaking of here?

> Good luck with Postgres-R.

Thank you.

Regards

Markus Wanner

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

22 December 2008, 00:29:47

Hi,

On Wed, Dec 17, 2008 at 12:07 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> No, we've been through that loop already a few months back:
>> Transaction-controlled robustness.
>>
>> It should be up to the client on the primary to decide how much waiting
>> they would like to perform in order to provide a guarantee. A change of
>> setting on the standby should not be allowed to alter the performance or
>> durability on the primary.
>
> OK. I will extend synchronous_replication, make walsender send XLOG
> with synchronization mode flag and make walreceiver perform according
> to the flag.

Not so simple.

At least the primary has to additionally maintain the byte position the standby
has already fsynced. The main difference from the current patch is whether
the standby fsyncs the logfile when it fills even if you don't choose #4(fsync).
In order to prevent from having to go back and re-open prior logfiles when an
fsync request comes along later, we would need to ignore the sync mode and
make the standby fsync the logfile when it fills. This would degrade the
performance periodically. Is this acceptable?

I think there are four choices. Which do you prefer?

1) Accept the above change.
2) Go back and re-open prior logfiles when a fsync request comes along.
3) Stop the sync control by the primary and leave it to the standby.
4) Add new option to specify whether to permit optimistic fsync, this option   makes the standby fsync only the current
logfilewhen a fsync request   comes along (don't go back and re-open prior logfiles).

2) would cause another performance degradation. 4) would furthermore
confuse users about setting a sync mode. So, I prefer 3) though I'm sorry
for digging up the discussion about transaction control. Please feel free
to comment!

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: Second thoughts

From

Emmanuel Cecchet

Date:

23 December 2008, 02:09:52

Hi Markus,

> I'm not quite sure what you mean by "certification protocol", there's no
> such thing in Postgres-R (as proposed by Kemme). Although, I remember
> having heard that term in the context of F. Pedone's work. Can you point
> me to some paper explaining this certification protocol?
>   
What Bettina calls the Lock Phase in 
http://www.cs.mcgill.ca/~kemme/papers/vldb00.pdf is actually a 
certification.
You can find more references to certification protocols in 
http://gorda.di.uminho.pt/download/reports/gapi.pdf
I would also recommend the work of Sameh on Tashkent and Taskent+ that 
was based on Postgres: 
http://labos.epfl.ch/webdav/site/labos/users/157494/public/papers/tashkent.eurosys2006.pdf 
and 
http://infoscience.epfl.ch/record/97654/files/tashkentPlus.eurosys2007.final.pdf
>> Certification-based
>> approaches have already multiple reliability issues to improve write
>> performance compared to statement-based replication, but this is very
>> dependent on the capacity of the system to limit the conflicting window
>> for concurrent transactions.
>>     
>
> What do you mean by "reliability issues"?
>   
These approaches usually require an atomic broadcast primitive that is 
usually fragile (limited scalability, hard to tune failure timeouts, ). 
Most prototype implementations have the load balancer and/or the 
certifier as a SPOF (single point of failure). Building reliability for 
these components will come with a significant performance penalty.
>> The writeset extraction mechanisms have had
>> too many limitations so far to allow the use of certification-based
>> replication in production (AFAIK).
>>     
> What limitations are you speaking of here?
>   
Oftentimes DDL support is very limited. Non-transactional objects like 
sequences are not captured.
Session or environment variables are not necessarily propagated. Support 
of temp tables varies between databases which makes it hard to support 
them properly in a generic way.
Well I guess everyone has a story on some limitations it has found with 
some database replication technology especially when a user expects a 
cluster to behave like a single database instance.

Happy holidays,
Emmanuel

-- 
Emmanuel Cecchet
FTO @ Frog Thinker 
Open Source Development & Consulting
--
Web: http://www.frogthinker.org
email: manu@frogthinker.org
Skype: emmanuel_cecchet

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

23 December 2008, 04:26:00

On Sun, 2008-12-21 at 14:46 +0900, Fujii Masao wrote:

> > XLogFlush() flushes because of an interlock between a dirty buffer write
> > and an outstanding WAL write. Dirty buffer writes are not replicated, so
> > there is no need to have a similar interlock on WAL streaming.
> >
> > So making those call points synchronous is possible, but neither
> > necessary or IMHO desirable.
> 
> Yes in upcoming 8.4, but probably no in the future.
> 
> What if the primary fails after writing the dirty data buffer before sending
> the corresponding logs? This would make data on the primary and logs
> on the standby inconsistent. In 8.4, such inconsistency might not matter
> because we don't use the data on the failed primary for recovery (when
> restarting the failed server, we always need a fresh backup). But, since
> this restriction is not good for some people, in the future, the failed server
> should restart without a fresh backup, and the inconsistency would be
> problem. So, I think that the inconsistency should be removed even if
> asynchronous replication case, and we should enforce "WAL rule" over
> some servers.

I don't get this argument. Why would we care what happens on the failed server?

The additional synchronizations you suggest are neither necessary, nor
IMHO desirable.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

23 December 2008, 05:00:52

Hi,

On Tue, Dec 23, 2008 at 5:22 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Sun, 2008-12-21 at 14:46 +0900, Fujii Masao wrote:
>
>> > XLogFlush() flushes because of an interlock between a dirty buffer write
>> > and an outstanding WAL write. Dirty buffer writes are not replicated, so
>> > there is no need to have a similar interlock on WAL streaming.
>> >
>> > So making those call points synchronous is possible, but neither
>> > necessary or IMHO desirable.
>>
>> Yes in upcoming 8.4, but probably no in the future.
>>
>> What if the primary fails after writing the dirty data buffer before sending
>> the corresponding logs? This would make data on the primary and logs
>> on the standby inconsistent. In 8.4, such inconsistency might not matter
>> because we don't use the data on the failed primary for recovery (when
>> restarting the failed server, we always need a fresh backup). But, since
>> this restriction is not good for some people, in the future, the failed server
>> should restart without a fresh backup, and the inconsistency would be
>> problem. So, I think that the inconsistency should be removed even if
>> asynchronous replication case, and we should enforce "WAL rule" over
>> some servers.
>
> I don't get this argument. Why would we care what happens on the failed server?

It's because, in the future, I'd like to use the data on the failed server when
making it catch up with new primary. This desire might be violated by the
inconsistency which I described.

>
> The additional synchronizations you suggest are neither necessary, nor
> IMHO desirable.

Not additional. It's quite analogous to synchronous_commit.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

23 December 2008, 05:28:01

On Tue, 2008-12-23 at 18:00 +0900, Fujii Masao wrote:
> > I don't get this argument. Why would we care what happens on the
> failed server?
> 
> It's because, in the future, I'd like to use the data on the failed
> server when making it catch up with new primary. This desire might be
> violated by the inconsistency which I described.

I don't really understand why you would put something in there that has
no use at all. Why make every server in the world do extra
synchronisation? 

Whatever you build in the future can include this, if that is still a
required point at the time you add the new feature.

Are you thinking about switchover rather than failover? I'm sure a
graceful switchover doesn't need this.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

23 December 2008, 06:53:48

Hi,

On Tue, Dec 23, 2008 at 6:28 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On Tue, 2008-12-23 at 18:00 +0900, Fujii Masao wrote:
>> > I don't get this argument. Why would we care what happens on the
>> failed server?
>>
>> It's because, in the future, I'd like to use the data on the failed
>> server when making it catch up with new primary. This desire might be
>> violated by the inconsistency which I described.
>
> I don't really understand why you would put something in there that has
> no use at all. Why make every server in the world do extra
> synchronisation?
>
> Whatever you build in the future can include this, if that is still a
> required point at the time you add the new feature.

Right. But since it's difficult to change the once fixed specification,
I ruminate about it from now for future.

But, since I cannot obtain consensus from hackers including you,
I would change my course, and forbid XLogFlush (called from other
than RecordTransactionCommit) to replicate xlog synchronously
if asynchronous replication case.

BTW, here is the callers other than RecordTransactionCommit.
- CreateCheckPoint()
- EndPrepare()
- FlushBuffer()
- RecordTransactionAbortPrepared()
- RecordTransactionCommitPrepared()
- RelationTruncate()
- SlruPhysicalWritePage()
- WriteTruncateXlogRec()
- XLogAsyncCommitFlush()

>
> Are you thinking about switchover rather than failover? I'm sure a
> graceful switchover doesn't need this.

Yes, switchover is one of case example I care. Typically, I care
about restarting the failed server (original primary) after failover:

-------------
1. a dirty buffer page is chosen as victim of buffer replacement
2. flush xlog up to the buffer's LSN on only primary
3. write out the dirty buffer page
4. primary fails   (replication up to buffer's LSN is not performed)

The above case produces inconsistency between data on the
original primary (failed server) and xlogs on the original standby
(new primary after failover). Isn't this right?

5. restart the failed server and make it catch up with new primary

We cannot recycle the existing data on the failed server because
of that inconsistency. I think this restriction should be removed.
-------------

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

"Pavan Deolasee"

Date:

23 December 2008, 07:24:20

On Tue, Dec 23, 2008 at 4:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>
>
> But, since I cannot obtain consensus from hackers including you,
> I would change my course, and forbid XLogFlush (called from other
> than RecordTransactionCommit) to replicate xlog synchronously
> if asynchronous replication case.
>

Since synchronous/asynchronous behavior of replication is tied to a
transaction (even if there is global default) , I don't understand why
we should not ship the xlogs to the standby when xlogs are written on
primary outside of a transaction context.  This is quite same as we do
with asynchronous_commit where we flush the xlog to disk at certain
points irrespective of the synchronization set.

> Yes, switchover is one of case example I care. Typically, I care
> about restarting the failed server (original primary) after failover:
>

I think this is a very important requirement because it's quite
unrealistic to expect that every time there is a failover, fresh
backup is required for the old primary to join back the replication.

> -------------
> 1. a dirty buffer page is chosen as victim of buffer replacement
> 2. flush xlog up to the buffer's LSN on only primary
> 3. write out the dirty buffer page
> 4. primary fails
>    (replication up to buffer's LSN is not performed)
>
> The above case produces inconsistency between data on the
> original primary (failed server) and xlogs on the original standby
> (new primary after failover). Isn't this right?
>

Yes, it would create inconsistency which I don't think can be
corrected without a fresh backup.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

23 December 2008, 08:25:16

On Tue, 2008-12-23 at 16:54 +0530, Pavan Deolasee wrote:
> On Tue, Dec 23, 2008 at 4:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> >
> > But, since I cannot obtain consensus from hackers including you,
> > I would change my course, and forbid XLogFlush (called from other
> > than RecordTransactionCommit) to replicate xlog synchronously
> > if asynchronous replication case.
> 
> Since synchronous/asynchronous behavior of replication is tied to a
> transaction (even if there is global default) , I don't understand why
> we should not ship the xlogs to the standby when xlogs are written on
> primary outside of a transaction context.  This is quite same as we do
> with asynchronous_commit where we flush the xlog to disk at certain
> points irrespective of the synchronization set.

We stream constantly from primary to standby. That point is not being
debated. The issue is whether we should add additional synchronisation
points (i.e. additional times we need to wait) into the WAL stream.
Currently, I have said no because this has no purpose in the current
design: definitely not performance, not robustness, not code clarity.

Specifically, we're talking about slowing down WAL flushes required
because of dirty page replacement, amongst others. That's not something
I want to see slowed down on a server that has specifically opted for
asynchronous replication, presumably because of a slow link. The other
call points are also potential contention points.

> > Yes, switchover is one of case example I care. Typically, I care
> > about restarting the failed server (original primary) after failover:
> >
> 
> I think this is a very important requirement because it's quite
> unrealistic to expect that every time there is a failover, fresh
> backup is required for the old primary to join back the replication.

I personally don't expect that, because we have rsync.

If that is a very important requirement then the current software needs
to include all the aspects of a feature, not just some of them. Either
we include a whole feature or we leave it out. A release will need to
stand for 5+ years, so supporting extraneous features is troublesome and
wasteful.

Currently, Fujii-san has stated he is not planning to allow fast
resynchronization in 8.4, so why would we need this?

If we were to add fast resynchronisation as a feature in 8.4, then I
will be happy to have *all* required changes included. People mention it
enough that I would be happy to see the whole feature added in this
release

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

"Pavan Deolasee"

Date:

23 December 2008, 09:06:52

On Tue, Dec 23, 2008 at 5:55 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
>
> We stream constantly from primary to standby. That point is not being
> debated. The issue is whether we should add additional synchronisation
> points (i.e. additional times we need to wait) into the WAL stream.
> Currently, I have said no because this has no purpose in the current
> design: definitely not performance, not robustness, not code clarity.
>
> Specifically, we're talking about slowing down WAL flushes required
> because of dirty page replacement, amongst others. That's not something
> I want to see slowed down on a server that has specifically opted for
> asynchronous replication, presumably because of a slow link. The other
> call points are also potential contention points.

So we would still be sending WAL to standby at XLogWrite time (and I
think that's necessary). The question is whether we should wait for
standby ack at XLogFlush time, right ? Hmm. I think the argument for
that would be what Fujii-san described for maintaining consistency
between data and WAL. I agree with you that we should add additional
synchronization points only if they give us any real value in
administrating replication setup. Personally, I would like to have a
simple setup where I can initially setup primary and standby and they
continue to work in a single-failure mode without any additional
administrative overhead (such as rsync). But that's just me and I
don't know what the preferred option in the field.

BTW, I won't be too much worried about dirty buffer case because the
WAL synchronization at that point usually occurs much later than the
WAL is actually sent to the standby. I would imagine that most of the
time WAL would have made to standby by that time.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

23 December 2008, 09:41:34

On Tue, 2008-12-23 at 18:36 +0530, Pavan Deolasee wrote:

> Personally, I would like to have a
> simple setup where I can initially setup primary and standby and they
> continue to work in a single-failure mode without any additional
> administrative overhead (such as rsync). But that's just me and I
> don't know what the preferred option in the field.

If you want a tripod, you need to turn up with all 3 legs. :-) 

PostgreSQL is a working product, not a framework or a function library.
We're not going to add code that has no function at all other than as
part of a larger feature, unless we add the whole feature.

I'm happy if that whole feature is added. If we do add it, it will be a
utility like "pg_resync". So in admin terms it will be almost identical
to using rsync, just a specific version that minimizes effort even more
than rsync does currently. The only difference as I see it would be some
gain in performance, but we don't need to send the whole database down
the wire again in either case.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

23 December 2008, 10:32:01

Hi,

On Tue, Dec 23, 2008 at 10:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> I'm happy if that whole feature is added. If we do add it, it will be a
> utility like "pg_resync". So in admin terms it will be almost identical
> to using rsync, just a specific version that minimizes effort even more
> than rsync does currently. The only difference as I see it would be some
> gain in performance, but we don't need to send the whole database down
> the wire again in either case.

I think that the type of your user is different from mine. If server fails
by simple termination of process, I don't want to spend 1min for
restarting other than catching up itself. For me, getting a fresh backup
(not only copying backup data but also checkpoint by pg_start_backup)
is expensive operation.

Of course, since I'm not planning to tackle that problem in 8.4,
I would not add "additional" synchronization point.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

23 December 2008, 10:36:55

Hi,

On Tue, Dec 23, 2008 at 11:31 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> Of course, since I'm not planning to tackle that problem in 8.4,
> I would not add "additional" synchronization point.

Second thought:
For normal shutdown case, we probably should force synchronous
replication in CreateCheckPoint at least.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

23 December 2008, 11:38:28

On Tue, 2008-12-23 at 23:31 +0900, Fujii Masao wrote:
> Hi,
> 
> On Tue, Dec 23, 2008 at 10:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > I'm happy if that whole feature is added. If we do add it, it will be a
> > utility like "pg_resync". So in admin terms it will be almost identical
> > to using rsync, just a specific version that minimizes effort even more
> > than rsync does currently. The only difference as I see it would be some
> > gain in performance, but we don't need to send the whole database down
> > the wire again in either case.
> 
> I think that the type of your user is different from mine. 

Perhaps, but why do you say that? I've not blocked you from adding
anything useful to Postgres.

> If server fails
> by simple termination of process, I don't want to spend 1min for
> restarting other than catching up itself. For me, getting a fresh backup
> (not only copying backup data but also checkpoint by pg_start_backup)
> is expensive operation.

As I said: "I'm happy if that whole feature is added."

You scare me that you see failover as sufficiently frequent that you are
worried that being without one of the servers for an extra 60 seconds
during a failover is a problem. And then say you're not going to add the
feature after all. I really don't understand. If its important, add the
feature, the whole feature that is. If not, don't.

My expectation is that most failovers are serious ones, that the primary
system is down and not coming back very fast. Your worries seem to come
from a scenario where the primary system is still up but Postgres
bounces/crashes, we can diagnose the cause of the crash, decide the
crashed server is safe and then wish to recommence operations on it
again as quickly as possible, where seconds count it doing so.

Are failovers going to be common? Why?

> Of course, since I'm not planning to tackle that problem in 8.4,

If you change your mind, having it in 8.4 would be good. 

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: Second thoughts

From

Markus Wanner

Date:

23 December 2008, 12:53:14

Hello Emmanuel,

Emmanuel Cecchet wrote:
> What Bettina calls the Lock Phase in
> http://www.cs.mcgill.ca/~kemme/papers/vldb00.pdf is actually a
> certification.

Aha. Hm.. that has gone since Postgres-R (SI) and doesn't exist anymore
in my current version either (so far called Postgres-R (8)). Most of
what the certifier does (ordering of write sets) is handled by the GCS,
everything else (i.e. what Tashkent refers to as write-write conflicts)
happens within the database system itself using MVCC.

> You can find more references to certification protocols in
> http://gorda.di.uminho.pt/download/reports/gapi.pdf

Thank you for that pointer. Seems like the term "certify" irritated me,
because that's much more tied to public key encryption and such in my mind.

> I would also recommend the work of Sameh on Tashkent and Taskent+ that
> was based on Postgres:

Thanks again. I've read the first one, which confirmed that I'm on the
right track with what I'm doing with Postgres-R (8). I'm preparing to
relive the single replicas from (most of) the WAL logging and instead
apply separate change- or write-set logging. That seems to be the main
achievement of Tashkent. Its savings are pretty obvious, IMO, because it
heavily reduces the overall amount of I/O operations.

>> What do you mean by "reliability issues"?
>>   
> These approaches usually require an atomic broadcast primitive that is
> usually fragile (limited scalability, hard to tune failure timeouts, ).

I didn't have much reliability issues with ensemble, appia or spread, so
far. Although, I admit I didn't ever run any of these in production.
Performance is certainly an issue, yes.

> Most prototype implementations have the load balancer and/or the
> certifier as a SPOF (single point of failure). Building reliability for
> these components will come with a significant performance penalty.

That's a point, yeah. There's alway a compromise between performance and
reliability. And more often than not, the third aspect to complicate the
matter even further is: cost.

>> What limitations are you speaking of here?
>
> Oftentimes DDL support is very limited.

Agreed. My Postgres-R versions doesn't support any of those, yet. BTW,
that's one of the cases where (fully) synchronous replication is more
efficient, because DDL commands very often conflict with other
transactions, it's better to use pessimistic locking.

> Non-transactional objects like
> sequences are not captured.

Postgres-R (8) partly covers sequences already. It uses atomic
broadcasts (independent from change set collection or multi-casting). An
optional per node caching of sequence numbers helps reducing network
latency for sequence increments.

> Session or environment variables are not necessarily propagated. Support
> of temp tables varies between databases which makes it hard to support
> them properly in a generic way.
> Well I guess everyone has a story on some limitations it has found with
> some database replication technology especially when a user expects a
> cluster to behave like a single database instance.

Certainly, yes.

> Happy holidays,

Thanks, same to you!

Regards

Markus Wanner

Re: Sync Rep: First Thoughts on Code

From

Mark Mielke

Date:

23 December 2008, 13:09:48

Simon Riggs wrote:
> You scare me that you see failover as sufficiently frequent that you are
> worried that being without one of the servers for an extra 60 seconds
> during a failover is a problem. And then say you're not going to add the
> feature after all. I really don't understand. If its important, add the
> feature, the whole feature that is. If not, don't.
>
> My expectation is that most failovers are serious ones, that the primary
> system is down and not coming back very fast. Your worries seem to come
> from a scenario where the primary system is still up but Postgres
> bounces/crashes, we can diagnose the cause of the crash, decide the
> crashed server is safe and then wish to recommence operations on it
> again as quickly as possible, where seconds count it doing so.
>
> Are failovers going to be common? Why?
>   

Hi Simon:

I agree with most of your criticism to the "fail over only approach" - 
but don't agree that fail over frequency should really impact 
expectations for the failed system to return to service. I see "soft" 
fails (*not* serious) to potentially be common - somewhere on the 
network, something went down or some packet was lost, and the system 
took a few too many seconds to respond. My expectation is that the 
system can quickly  detect that the node is out of service, be removed 
from the pool, when the situation is resolved (often automatically 
outside of my control) automatically "catch up" and be put back into the 
pool. Having to run some other process such as rsync seems unreliable as 
we already have a mechanism for streaming the data. All that is missing 
is streaming from an earlier point in time to catch up efficiently and 
reliably.

I think I'm talking more about the complete solution though which is in 
line with what you are saying? :-)

Cheers,
mark

-- 
Mark Mielke <mark@mielke.cc>

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

23 December 2008, 13:23:29

Hi,

On Wed, Dec 24, 2008 at 12:38 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Perhaps, but why do you say that?

Since you often pointed out that getting backup is not problem because
of incremental backup (e.g. rsync), I just thought so.

> I've not blocked you from adding
> anything useful to Postgres.

Yes, I see.

> You scare me that you see failover as sufficiently frequent that you are
> worried that being without one of the servers for an extra 60 seconds
> during a failover is a problem. And then say you're not going to add the
> feature after all. I really don't understand. If its important, add the
> feature, the whole feature that is. If not, don't.

Oh, sorry. I don't want to scare you ;) But, yes, it's important. We should
rethink the question? "Why does the failed server always need a fresh
backup?" Though we discussed it previously and concluded that it should
be done next time.
http://archives.postgresql.org/pgsql-hackers/2008-11/msg01612.php

> My expectation is that most failovers are serious ones, that the primary
> system is down and not coming back very fast. Your worries seem to come
> from a scenario where the primary system is still up but Postgres
> bounces/crashes, we can diagnose the cause of the crash, decide the
> crashed server is safe and then wish to recommence operations on it
> again as quickly as possible, where seconds count it doing so.
>
> Are failovers going to be common? Why?

As you say, *all* failovers are not serious ones. I think that a user
would choose most convenient restarting method according to his
or her situation (come back immediately? need careful diagnosis?).

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

23 December 2008, 13:36:54

On Wed, 2008-12-24 at 02:23 +0900, Fujii Masao wrote:

> Oh, sorry. I don't want to scare you ;) But, yes, it's important. We should
> rethink the question? "Why does the failed server always need a fresh
> backup?" Though we discussed it previously and concluded that it should
> be done next time.
> http://archives.postgresql.org/pgsql-hackers/2008-11/msg01612.php

We might ask why pg_start_backup() needs to perform checkpoint though,
since you have remarked that is a problem also.

The answer is that it doesn't really need to, we just need to be certain
that archiving has been running since whenever we choose as the start
time. So we could easily just use the last normal checkpoint time, as
long as we had some way of tracking the archiving.

ISTM we can solve the checkpoint problem more easily and it would
potentially save much more time than "tuning rsync for Postgres", which
is what the other idea amounted to. So I do see a solution that is both
better and more quickly achievable for 8.4.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

23 December 2008, 22:39:21

Hi,

On Wed, Dec 24, 2008 at 2:37 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On Wed, 2008-12-24 at 02:23 +0900, Fujii Masao wrote:
>
>> Oh, sorry. I don't want to scare you ;) But, yes, it's important. We should
>> rethink the question? "Why does the failed server always need a fresh
>> backup?" Though we discussed it previously and concluded that it should
>> be done next time.
>> http://archives.postgresql.org/pgsql-hackers/2008-11/msg01612.php
>
> We might ask why pg_start_backup() needs to perform checkpoint though,
> since you have remarked that is a problem also.
>
> The answer is that it doesn't really need to, we just need to be certain
> that archiving has been running since whenever we choose as the start
> time. So we could easily just use the last normal checkpoint time, as
> long as we had some way of tracking the archiving.
>
> ISTM we can solve the checkpoint problem more easily and it would
> potentially save much more time than "tuning rsync for Postgres", which
> is what the other idea amounted to. So I do see a solution that is both
> better and more quickly achievable for 8.4.

Sounds good. I agree that pg_start_backup basically doesn't need
checkpoint. But, for full_page_write == off, we probably cannot get
rid of it. Even if full_page_write == on, since we cannot make out
whether all indispensable full pages were written after last checkpoint,
pg_start_backup must do checkpoint with "forcePageWrite = on".

Problem is that online backup itself is unsafe. Even if there is no
disk failure (i.e. normal case), we can easily produce a partial write
in online backup. So, we always need full pages when recovering
online backup, then pg_start_backup always needs checkpoint
with forcePageWrite = on.

I think that we probably have to track the history of full_page_write,
in order to get rid of checkpoint from pg_start_backup.

On the other hand, the data after crash other than media crash
is "safe". Currently, we can recover it without full page write
as simple crash recovery case. I think that we can use it also for
archive recovery, because there isn't really any distinction between
both. I've not found the corner case yet. Do you have?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

23 December 2008, 22:45:29

Hi,

On Mon, Dec 22, 2008 at 1:29 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> Not so simple.
>
> At least the primary has to additionally maintain the byte position the standby
> has already fsynced. The main difference from the current patch is whether
> the standby fsyncs the logfile when it fills even if you don't choose #4(fsync).
> In order to prevent from having to go back and re-open prior logfiles when an
> fsync request comes along later, we would need to ignore the sync mode and
> make the standby fsync the logfile when it fills. This would degrade the
> performance periodically. Is this acceptable?
>
> I think there are four choices. Which do you prefer?
>
> 1) Accept the above change.
> 2) Go back and re-open prior logfiles when a fsync request comes along.
> 3) Stop the sync control by the primary and leave it to the standby.
> 4) Add new option to specify whether to permit optimistic fsync, this option
>    makes the standby fsync only the current logfile when a fsync request
>    comes along (don't go back and re-open prior logfiles).
>
> 2) would cause another performance degradation. 4) would furthermore
> confuse users about setting a sync mode. So, I prefer 3) though I'm sorry
> for digging up the discussion about transaction control. Please feel free
> to comment!

5) Only allow optimistic fsync

I'm going to adopt 5) for next patch at least for a while.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

24 December 2008, 05:57:03

On Wed, 2008-12-24 at 11:39 +0900, Fujii Masao wrote:

> > We might ask why pg_start_backup() needs to perform checkpoint though,
> > since you have remarked that is a problem also.
> >
> > The answer is that it doesn't really need to, we just need to be certain
> > that archiving has been running since whenever we choose as the start
> > time. So we could easily just use the last normal checkpoint time, as
> > long as we had some way of tracking the archiving.
> >
> > ISTM we can solve the checkpoint problem more easily and it would
> > potentially save much more time than "tuning rsync for Postgres", which
> > is what the other idea amounted to. So I do see a solution that is both
> > better and more quickly achievable for 8.4.
> 
> Sounds good. I agree that pg_start_backup basically doesn't need
> checkpoint. But, for full_page_write == off, we probably cannot get
> rid of it. Even if full_page_write == on, since we cannot make out
> whether all indispensable full pages were written after last checkpoint,
> pg_start_backup must do checkpoint with "forcePageWrite = on".

Yes, OK. So I think it would only work when full_page_writes = on, and
has been on since last checkpoint. So two changes:

* We just need a boolean that starts at true every checkpoint and gets
set to false anytime someone resets full_page_writes or archive_command.
If the flag is set && full_page_writes = on then we skip the checkpoint
entirely and use the value from the last checkpoint.

* My "infra" patch also had a modified version of pg_start_backup() that
allowed you to specify IMMEDIATE checkpoint or not. Reworking that seems
a waste of time, and I want to listen to everybody else now and change
pg_start_backup() so it throws an IMMEDIATE CHECKPOINT and leave it
there.

Can you work on those also?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

24 December 2008, 06:58:12

Hi,

On Wed, Dec 24, 2008 at 6:57 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Yes, OK. So I think it would only work when full_page_writes = on, and
> has been on since last checkpoint. So two changes:
>
> * We just need a boolean that starts at true every checkpoint and gets
> set to false anytime someone resets full_page_writes or archive_command.
> If the flag is set && full_page_writes = on then we skip the checkpoint
> entirely and use the value from the last checkpoint.

Sounds good.

pg_start_backup on the standby (probably you are planning?) also needs
this logic? If so, resetting full_page_writes or archive_command should
generate its xlog.

I have another thought: should we forbid the reset of archive_command
during online backup? Currently we can do. If we don't need to do so,
we also don't need to track the reset of archiving for fast pg_start_backup.

>
> * My "infra" patch also had a modified version of pg_start_backup() that
> allowed you to specify IMMEDIATE checkpoint or not. Reworking that seems
> a waste of time, and I want to listen to everybody else now and change
> pg_start_backup() so it throws an IMMEDIATE CHECKPOINT and leave it
> there.
>
> Can you work on those also?

Umm.. I'm busy. Of course, I will try it if no one raises his or her hand.
But, I'd like to put coding the core of synch rep ahead of this.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

24 December 2008, 11:10:49

Hi,

On Wed, Dec 24, 2008 at 7:58 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> Hi,
>
> On Wed, Dec 24, 2008 at 6:57 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Yes, OK. So I think it would only work when full_page_writes = on, and
>> has been on since last checkpoint. So two changes:
>>
>> * We just need a boolean that starts at true every checkpoint and gets
>> set to false anytime someone resets full_page_writes or archive_command.
>> If the flag is set && full_page_writes = on then we skip the checkpoint
>> entirely and use the value from the last checkpoint.
>
> Sounds good.

I attached the self-contained patch to skip checkpoint at pg_start_backup.

>
> pg_start_backup on the standby (probably you are planning?) also needs
> this logic? If so, resetting full_page_writes or archive_command should
> generate its xlog.

Now, the patch doesn't care about this.

>
> I have another thought: should we forbid the reset of archive_command
> during online backup? Currently we can do. If we don't need to do so,
> we also don't need to track the reset of archiving for fast pg_start_backup.

Now, doesn't care too.

Happy Holidays!

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

bkp_without_ckpt_v1.patch

Re: Sync Rep: First Thoughts on Code

From

Simon Riggs

Date:

24 December 2008, 11:31:07

On Thu, 2008-12-25 at 00:10 +0900, Fujii Masao wrote:
> Hi,
> 
> On Wed, Dec 24, 2008 at 7:58 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> > Hi,
> >
> > On Wed, Dec 24, 2008 at 6:57 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >> Yes, OK. So I think it would only work when full_page_writes = on, and
> >> has been on since last checkpoint. So two changes:
> >>
> >> * We just need a boolean that starts at true every checkpoint and gets
> >> set to false anytime someone resets full_page_writes or archive_command.
> >> If the flag is set && full_page_writes = on then we skip the checkpoint
> >> entirely and use the value from the last checkpoint.
> >
> > Sounds good.
> 
> I attached the self-contained patch to skip checkpoint at pg_start_backup.

Good.

Can we change to IMMEDIATE when it we need the checkpoint?

What is bkpCount for? I think we should discuss whatever that is for
separately. It isn't used in any if test, AFAICS.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Sync Rep: Second thoughts

From

Emmanuel Cecchet

Date:

24 December 2008, 11:39:42

Hi Markus,

> I didn't have much reliability issues with ensemble, appia or spread, so
> far. Although, I admit I didn't ever run any of these in production.
> Performance is certainly an issue, yes.
>   
I may suggest another reading even though a bit dates, most of the 
results still apply: 
http://jmob.objectweb.org/jgroups/JGroups-middleware-2004.pdf
The baseline is that if you use UDP multicast, you need a dedicated 
switch and the tuning is a nightmare. I discussed these issues with the 
developers of Spread and they have no real magic. TCP seems a more 
reliable alternative (especially predictable performance) but the TCP 
timeouts are also tricky to tune depending on the platform. We worked 
quite a bit with Nuno around Appia in the context of Sequoia and 
performance can be outstanding when properly tuned or absolutely awful 
is some default values are wrong. The chaotic behavior of GCS under 
stress quickly compromises the reliability of the replication system, 
and admission control on UDP multicast has no good solution so far.
It's just a heads up on what is awaiting you in production when the 
system is stressed. There is no good solution so far besides a good 
admission control on top of the GCS (in the application).

I am now off for the holidays.

Cheers,
Emmanuel

-- 
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com

Re: Sync Rep: First Thoughts on Code

From

"Fujii Masao"

Date:

24 December 2008, 14:36:51

Hi,

I fixed some bugs.

On Thu, Dec 25, 2008 at 12:31 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> Can we change to IMMEDIATE when it we need the checkpoint?

Perhaps yes, though current patch doesn't care about it.
I'm not sure if we really need the feature. Yes, as you say,
I'd like to also listen to everybody else.

>
> What is bkpCount for?

So far, name of a backup history file consists of only
checkpoint redo location. But, in this patch, since some
backups use the same checkpoint, a backup history file
could be overwritten unfortunately. So, I introduced
bkpCount as ID of backups which use the same checkpoint.

> I think we should discuss whatever that is for
> separately. It isn't used in any if test, AFAICS.

Yes, this patch is testbed. We need to discuss more.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

bkp_without_ckpt_v2.patch