Thread: Constant WAL replay

Constant WAL replay

From
Hans-Jürgen Schönig
Date:
We have toyed around with PostgreSQL's WAL structure and we are 
wondering whether it is (theoretically) possible to replay WAL records 
coming from a remote host while a system is working in read only mode.

The idea: We are looking for a way to implement a synchronous 
single-master / multiple slaves systems.
Meanwhile we are able to serialize / deserialize WAL records and send 
them to a group communication system which transports those records to 
the slave database.
This is not hard to do. The problem is: How can we replay data on the 
slave while the slave is answering read-only SQL-statements?

As far as I understand the code we can either recover the system or 
answer SQL statements. Which kind of modifications are necessary to 
replay and answer at the same time? (assuming that locks and 
transactional information are available on the slave).

Input is very much appreciated.
Best regards,
    Hans

-- 
Cybertec Geschwinde u Schoenig
Schoengrabern 134, A-2020 Hollabrunn, Austria
Tel: +43/664/393 39 74
www.cybertec.at, www.postgresql.at



Re: Constant WAL replay

From
Tom Lane
Date:
Hans-Jürgen Schönig <postgres@cybertec.at> writes:
> We have toyed around with PostgreSQL's WAL structure and we are 
> wondering whether it is (theoretically) possible to replay WAL records 
> coming from a remote host while a system is working in read only mode.

There's been some idle speculation about that, and offhand I don't know
of any showstopper reasons why it couldn't be done ... but no one's
actually looked into making it happen.  Feel free ;-)

> Which kind of modifications are necessary to 
> replay and answer at the same time? (assuming that locks and 
> transactional information are available on the slave).

I think you should definitely assume that locks are *not* available, and
probably not any live transaction info.  The discussions before included
the idea of being able to do queries against a known-broken,
non-recoverable database without making it any worse than it was before.
In the current code there is no such thing as a hard read-only behavior
--- for example we will try to update commit-status hint bits no matter
what.  Allowing that to be turned off would be interesting for a number
of purposes, such as burning a database onto CD.
        regards, tom lane


Re: Constant WAL replay

From
Neil Conway
Date:
Hans-Jürgen Schönig wrote:
> The idea: We are looking for a way to implement a synchronous 
> single-master / multiple slaves systems.
> Meanwhile we are able to serialize / deserialize WAL records and send 
> them to a group communication system which transports those records to 
> the slave database.

BTW, what is the benefit to using a GCS here?

-Neil


Re: Constant WAL replay

From
Klaus Naumann
Date:
> what.  Allowing that to be turned off would be interesting for a number
> of purposes, such as burning a database onto CD.

FWIW, Oracle suggests a "transportable tablespace" for this feature.
Which is a tablespace that is not written too and which can be read by
any database.
Would that solve the purposes you mean?

Greetings, Klaus


Re: Constant WAL replay

From
Tom Lane
Date:
Klaus Naumann <lists@distinctmind.de> writes:
>> what.  Allowing that to be turned off would be interesting for a number
>> of purposes, such as burning a database onto CD.

> FWIW, Oracle suggests a "transportable tablespace" for this feature.
> Which is a tablespace that is not written too and which can be read by
> any database.
> Would that solve the purposes you mean?

It's a very long way from here to there.  In particular, since different
installations have different transaction histories, the XIDs in the
table could not be transportable.  You'd almost be forced to build
something like the non-MVCC, XID-less table type that was being
speculated about up-thread.
        regards, tom lane


Re: Constant WAL replay

From
Alvaro Herrera
Date:
On Sun, Apr 24, 2005 at 11:41:17AM -0400, Tom Lane wrote:
> Klaus Naumann <lists@distinctmind.de> writes:
> >> what.  Allowing that to be turned off would be interesting for a number
> >> of purposes, such as burning a database onto CD.
> 
> > FWIW, Oracle suggests a "transportable tablespace" for this feature.
> > Which is a tablespace that is not written too and which can be read by
> > any database.
> > Would that solve the purposes you mean?
> 
> It's a very long way from here to there.  In particular, since different
> installations have different transaction histories, the XIDs in the
> table could not be transportable.

Unless the tables are frozen first.

-- 
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"En las profundidades de nuestro inconsciente hay una obsesiva necesidad
de un universo lógico y coherente. Pero el universo real se halla siempre
un paso más allá de la lógica" (Irulan)


Re: Constant WAL replay

From
Alvaro Herrera
Date:
On Sun, Apr 24, 2005 at 08:10:34AM +0200, Hans-Jürgen Schönig wrote:

> The idea: We are looking for a way to implement a synchronous 
> single-master / multiple slaves systems.
> Meanwhile we are able to serialize / deserialize WAL records and send 
> them to a group communication system which transports those records to 
> the slave database.
> This is not hard to do.

In fact, I believe Command Prompt's Mammoth Replicator does exactly
this.

-- 
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"La victoria es para quien se atreve a estar solo"


Re: Constant WAL replay

From
"Joshua D. Drake"
Date:
Alvaro Herrera wrote:
> On Sun, Apr 24, 2005 at 08:10:34AM +0200, Hans-Jürgen Schönig wrote:
> 
> 
>>The idea: We are looking for a way to implement a synchronous 
>>single-master / multiple slaves systems.
>>Meanwhile we are able to serialize / deserialize WAL records and send 
>>them to a group communication system which transports those records to 
>>the slave database.
>>This is not hard to do.
> 
> 
> In fact, I believe Command Prompt's Mammoth Replicator does exactly
> this.

Very close. We don't use the WAL (yet, slated for probably 8.1) but we
do use a transaction log shipping method. So the implementation is 
almost the same.

Sincerely,

Joshua D. Drake
Command Prompt, Inc.




Re: Constant WAL replay

From
Bruce Momjian
Date:
Joshua D. Drake wrote:
> Alvaro Herrera wrote:
> > On Sun, Apr 24, 2005 at 08:10:34AM +0200, Hans-J?rgen Sch?nig wrote:
> > 
> > 
> >>The idea: We are looking for a way to implement a synchronous 
> >>single-master / multiple slaves systems.
> >>Meanwhile we are able to serialize / deserialize WAL records and send 
> >>them to a group communication system which transports those records to 
> >>the slave database.
> >>This is not hard to do.
> > 
> > 
> > In fact, I believe Command Prompt's Mammoth Replicator does exactly
> > this.
> 
> Very close. We don't use the WAL (yet, slated for probably 8.1) but we
> do use a transaction log shipping method. So the implementation is 
> almost the same.

Can you run queries on the slave?  If so, how do you handle xid collisions?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Constant WAL replay

From
"Joshua D. Drake"
Date:
>>
>>Very close. We don't use the WAL (yet, slated for probably 8.1) but we
>>do use a transaction log shipping method. So the implementation is 
>>almost the same.
> 
> 
> Can you run queries on the slave?  If so, how do you handle xid collisions?

You can run any query that does not modify data on a replicated table. 
You can run any non data modifying query on any of the tables.

Sincerely,

Joshua D. Drake



> 



Re: Constant WAL replay

From
Bruce Momjian
Date:
Joshua D. Drake wrote:
> >>
> >>Very close. We don't use the WAL (yet, slated for probably 8.1) but we
> >>do use a transaction log shipping method. So the implementation is 
> >>almost the same.
> > 
> > 
> > Can you run queries on the slave?  If so, how do you handle xid collisions?
> 
> You can run any query that does not modify data on a replicated table. 
> You can run any non data modifying query on any of the tables.

So, do you modify the slave to prevent it from grabbing an xid?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Constant WAL replay

From
Hans-Jürgen Schönig
Date:
Neil Conway wrote:
> Hans-Jürgen Schönig wrote:
> 
>> The idea: We are looking for a way to implement a synchronous 
>> single-master / multiple slaves systems.
>> Meanwhile we are able to serialize / deserialize WAL records and send 
>> them to a group communication system which transports those records to 
>> the slave database.
> 
> 
> BTW, what is the benefit to using a GCS here?
> 
> -Neil


currently i have used it because it is simple.
at the moment we are just doing experiments to see how log information 
can be treated. shipping the data is not the problem - the real problem 
is getting it back in.

good ideas are welcome ;)
best regards,
    hans


-- 
Cybertec Geschwinde u Schoenig
Schoengrabern 134, A-2020 Hollabrunn, Austria
Tel: +43/664/393 39 74
www.cybertec.at, www.postgresql.at



Re: Constant WAL replay

From
Hans-Jürgen Schönig
Date:
Joshua D. Drake wrote:
> Alvaro Herrera wrote:
> 
>> On Sun, Apr 24, 2005 at 08:10:34AM +0200, Hans-Jürgen Schönig wrote:
>>
>>
>>> The idea: We are looking for a way to implement a synchronous 
>>> single-master / multiple slaves systems.
>>> Meanwhile we are able to serialize / deserialize WAL records and send 
>>> them to a group communication system which transports those records 
>>> to the slave database.
>>> This is not hard to do.
>>
>>
>>
>> In fact, I believe Command Prompt's Mammoth Replicator does exactly
>> this.
> 
> 
> Very close. We don't use the WAL (yet, slated for probably 8.1) but we
> do use a transaction log shipping method. So the implementation is 
> almost the same.
> 
> Sincerely,
> 
> Joshua D. Drake
> Command Prompt, Inc.

Joshua,

This sounds interesting. If you don't use the WAL but a transaction log 
shipping - what does it mean in terms of PostgreSQL? Do you create your 
own transaction log?
What really interests me here: Where is you code located in order to 
make sure that this things can work reliably?
Best regards,
    Hans

-- 
Cybertec Geschwinde u Schoenig
Schoengrabern 134, A-2020 Hollabrunn, Austria
Tel: +43/664/393 39 74
www.cybertec.at, www.postgresql.at



Re: Constant WAL replay

From
"Joshua D. Drake"
Date:
> 
> 
> Joshua,
> 
> This sounds interesting. If you don't use the WAL but a transaction log 
> shipping - what does it mean in terms of PostgreSQL? Do you create your 
> own transaction log?

Yes.

> What really interests me here: Where is you code located in order to 
> make sure that this things can work reliably?

We are integrated into the PostgreSQL backend.

Sincerely,

Joshua D. Drake


> 
>     Best regards,
> 
>         Hans
> 


-- 
Your PostgreSQL solutions company - Command Prompt, Inc. 1.800.492.2240
PostgreSQL Replication, Consulting, Custom Programming, 24x7 support
Managed Services, Shared and Dedication Hosting
Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/


Re: Constant WAL replay

From
Bruce Momjian
Date:
Joshua D. Drake wrote:
> > 
> > 
> > Joshua,
> > 
> > This sounds interesting. If you don't use the WAL but a transaction log 
> > shipping - what does it mean in terms of PostgreSQL? Do you create your 
> > own transaction log?
> 
> Yes.
> 
> > What really interests me here: Where is you code located in order to 
> > make sure that this things can work reliably?
> 
> We are integrated into the PostgreSQL backend.

So they have to get a new PostgreSQL release from you for every minor
upgrade, I assume at no cost?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Constant WAL replay

From
Hans-Jürgen Schönig
Date:
Joshua D. Drake wrote:

>> Joshua,
>>
>> This sounds interesting. If you don't use the WAL but a transaction 
>> log shipping - what does it mean in terms of PostgreSQL? Do you 
>> create your own transaction log?
>
>
> Yes.
>
>> What really interests me here: Where is you code located in order to 
>> make sure that this things can work reliably?
>
>
> We are integrated into the PostgreSQL backend.


.... which is a large place ;).
   best regards,
      hans


Re: Constant WAL replay

From
"Joshua D. Drake"
Date:
>>
>>>What really interests me here: Where is you code located in order to 
>>>make sure that this things can work reliably?
>>
>>We are integrated into the PostgreSQL backend.
> 
> 
> So they have to get a new PostgreSQL release from you for every minor
> upgrade, I assume at no cost?

Yep :)

Sincerely,

Joshua D. Drake
Command Prompt, Inc.

> 


-- 
Your PostgreSQL solutions company - Command Prompt, Inc. 1.800.492.2240
PostgreSQL Replication, Consulting, Custom Programming, 24x7 support
Managed Services, Shared and Dedication Hosting
Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/


Re: Writing Commit Status hint bits (was Re: Constant

From
Simon Riggs
Date:
On Wed, 2005-07-20 at 09:24 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Tue, 2005-07-19 at 22:24 -0400, Tom Lane wrote:
> >> Simon Riggs <simon@2ndquadrant.com> writes:
> >>> Short patch enclosed to turn off writing of commit-status hint bits.
> >>
> >> Doesn't this entirely destroy the ability to truncate clog, and
> >> therefore the ability to survive XID wraparound?
>
> > I hope not for all our sakes, since the hint bits are not WAL logged and
> > anything that relies upon them would be fragile.
>
> We don't rely on any one write of them to work, but that doesn't mean
> that we can indefinitely postpone writing them.

OK, I think I understand where you're coming from now.

A table can't be migrated to read-only media until all of its tuples
have an xmin of FrozenTransactionId; I said the following, which was
wrong:
"This should allow migration of older child tables to hierarchical
storage when using a large historical table design."
The patch doesn't directly contribute to that goal, though is of value
in a large historical table design with many read only child tables (and
other situations).

But that comment wasn't the only inspiration for the patch.

When VACUUM freezes the xid, it *does* make sense at that point to
update the hint bits as a performance optimization. That isn't required
though, and *can* be indefinitely postponed, AFAICS. All of the tqual
routines will still work just as well without the hint bits set. If you
know different, I'll need a deeper explanation before I understand.

IMHO, the direction the patch is going in is still worthwhile because of
these issues:

1. Any block read may attempt to set hint bits, which dirties the block
and must be written out. So *reads* can result in heavier write activity
at checkpoint time. That effects both OLTP and DW systems: Random read
transactions against a large table will be worst effected, since we may
end up writing the block once for each read.

2. A lazy vacuum may also dirty a block, even when it has done nothing
else useful to that block. Worse, if we vacuum a table that is bigger
than shared_buffers (or close), then we will end up having to evict
dirty buffers that the vacuum itself has written in order to continue
the vacuum. Since Vacuum is two-pass, we may end up writing a block
*twice*, once where we set the hint bits and then again later where we
remove the tuples and re-write. So this patch will allow a normal VACUUM
to perform better on larger tables.

(1) is a pain, but there's no point solving it without also solving (2).

The patch would fail an Assert test during a VACUUM, since the info bits
are not actually set if cache_txn_status_with_data = false during
VACUUM.

As a result of (2), perhaps we should remove all of the
SetBufferCommitInfoNeedsSave calls in HeapTupleSatisfiesVacuum, and add
a call to SetBufferCommitInfoNeedsSave that overrides
cache_txn_status_with_data when we actually freeze a row. (Or perhaps
that should be a VACUUM FAST command?) That way we would set the hint
bits *only* when we freeze a row and not at any other time.

If we further reduced the number of times we dirty the block *at all* on
the first pass of a VACUUM, we would reduce the chance of writing twice.
We could save the setting of frozen transactions until the second phase,
i.e. only dirty the block if (pgchanged && vacrelstats->num_dead_tuples
> prev_dead_count)

The patch sets cache_txn_status_with_data as a USERSET, with the
intention that particular read-only users would not wish to have their
read-only transactions turn into write transactions. There was no
intention to prevent VACUUM, not to avoid the optimisation of writing
hint bits on a VACUUM FREEZE nor to set full read only status - which we
discussed previously but is a much longer project.

Sorry for any confusion caused in my initial patch submission. Does my
longer explanation make sense of what the patch is trying to achieve.

Best Regards, Simon Riggs


Re: Writing Commit Status hint bits (was Re: Constant WAL replay)

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> On Wed, 2005-07-20 at 09:24 -0400, Tom Lane wrote:
>> We don't rely on any one write of them to work, but that doesn't mean
>> that we can indefinitely postpone writing them.

> OK, I think I understand where you're coming from now.

Apparently not :-(

> When VACUUM freezes the xid, it *does* make sense at that point to
> update the hint bits as a performance optimization.

The hint bits are not really relevant when xmin = FrozenTransactionId,
since any examiner of the tuple would consider that XID committed anyway.
Besides, the hint bit is guaranteed set in that scenario; note the
Assert where vacuum is setting it:

                    HeapTupleHeaderSetXmin(tuple.t_data, FrozenTransactionId);
                    /* infomask should be okay already */
                    Assert(tuple.t_data->t_infomask & HEAP_XMIN_COMMITTED);

The scenario in which the hint bit *must* be set is where it is for
an XID for which we have deleted the relevant section of pg_clog,
which we are willing to do well before freeze occurs, if we know that
all the relevant XIDs have been hinted.  See TruncateCLOG.  Your patch
breaks that logic by not considering hint-bit updates as changes that
must be flushed to disk by checkpoint.

> 1. Any block read may attempt to set hint bits, which dirties the block
> and must be written out. So *reads* can result in heavier write activity
> at checkpoint time.

Sure, but the alternative is heavier activity in repeated checks of
pg_clog to find out commit state that a previous examiner of the tuple
already found out.  The patch supposes that one write is worse than N
reads, which is clearly a loss at some not-exceedingly-large value of N.
If we thought that was a good tradeoff, we might as well not have the
hint bits at all.

            regards, tom lane

Re: Writing Commit Status hint bits (was Re: Constant

From
Simon Riggs
Date:
On Wed, 2005-07-20 at 13:20 -0400, Tom Lane wrote:

> > When VACUUM freezes the xid, it *does* make sense at that point to
> > update the hint bits as a performance optimization.
>
> The hint bits are not really relevant when xmin = FrozenTransactionId,
> since any examiner of the tuple would consider that XID committed anyway.
> Besides, the hint bit is guaranteed set in that scenario; note the
> Assert where vacuum is setting it:
>
>                     HeapTupleHeaderSetXmin(tuple.t_data, FrozenTransactionId);
>                     /* infomask should be okay already */
>                     Assert(tuple.t_data->t_infomask & HEAP_XMIN_COMMITTED);
> The scenario in which the hint bit *must* be set is where it is for
> an XID for which we have deleted the relevant section of pg_clog,
> which we are willing to do well before freeze occurs, if we know that
> all the relevant XIDs have been hinted.  See TruncateCLOG.  Your patch
> breaks that logic by not considering hint-bit updates as changes that
> must be flushed to disk by checkpoint.

OK, I see that now. Thank you for showing me that.

I will change the patch so that this does not prevent VACUUM from
setting hint bits.

All of this does raise other questions, but thats probably sensible to
raise those on other threads.

> > 1. Any block read may attempt to set hint bits, which dirties the block
> > and must be written out. So *reads* can result in heavier write activity
> > at checkpoint time.
>
> Sure, but the alternative is heavier activity in repeated checks of
> pg_clog to find out commit state that a previous examiner of the tuple
> already found out.  The patch supposes that one write is worse than N
> reads, which is clearly a loss at some not-exceedingly-large value of N.
> If we thought that was a good tradeoff, we might as well not have the
> hint bits at all.

That is not a decision we can make without knowledge of the application.
In general, the hint bits are good. In *some* cases, not. I still seek
control over that as a designer.

Specifically, the scenario I want to optimize is this:
- we load a table with lots of real time measurement data, as one child
out of a large number of similar child tables
- we then immediately create summary tables from the measurements
- after this the detailed data is only sporadically accessed, if ever
- detail data is dropped from the database after a few weeks
- the majority of the database is detail data, so those tables are never
vacuumed since no rows are ever deleted from those tables (the tables
are dropped) nor is access sufficiently frequent to make it sensible to
set hint bits - hence *no* complete database vacuum is run on a regular
basis

In this design, the second step causes the whole detailed data table to
be written out to the database (again) immediately after loading. I
would like to be able to prevent that.

That design is *not* uncommon, since we might describe it as a
"regulatory/monitoring data archiving" application. I have two current
clients with exactly this design, plus have worked on similar apps in
the past. The PostgreSQL licence cost makes it particularly suitable for
that kind of application.

Best Regards, Simon Riggs



Re: Writing Commit Status hint bits (was Re: Constant WAL replay)

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> In general, the hint bits are good. In *some* cases, not. I still seek
> control over that as a designer.

> Specifically, the scenario I want to optimize is this:
> - we load a table with lots of real time measurement data, as one child
> out of a large number of similar child tables
> - we then immediately create summary tables from the measurements
> - after this the detailed data is only sporadically accessed, if ever
> - detail data is dropped from the database after a few weeks
> - the majority of the database is detail data, so those tables are never
> vacuumed since no rows are ever deleted from those tables (the tables
> are dropped) nor is access sufficiently frequent to make it sensible to
> set hint bits - hence *no* complete database vacuum is run on a regular
> basis

> In this design, the second step causes the whole detailed data table to
> be written out to the database (again) immediately after loading. I
> would like to be able to prevent that.

This application seems sufficiently off the beaten path to not be worth
solving with a knob as klugy and potentially dangerous as
suppress_hint_bits.

A better way to avoid the write pass, if you must, is to do the summary
table creation in the same transaction that loads the data.

            regards, tom lane

Re: Writing Commit Status hint bits (was Re: Constant

From
Simon Riggs
Date:
On Fri, 2005-07-22 at 09:40 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > In general, the hint bits are good. In *some* cases, not. I still seek
> > control over that as a designer.

...

> ... not be worth
> solving with a knob as klugy and potentially dangerous as
> suppress_hint_bits.

I note that heap_get_latest_tid() relies upon the setting of
HEAP_XMIN_COMMITTED in the infomask in lieu of checking any return codes
from the HeapTupleSatisfies check. Sequences also touch on that.

I agree with you that it is potentially dangerous to mess with that, for
now.

I'll look at Read-Only Tables for 8.2 rather than this quick-fix.

Best Regards, Simon Riggs