Thread: Keeping separate WAL segments for each database

Keeping separate WAL segments for each database

From

Devrim GÜNDÜZ

Date:

30 June 2010, 15:55:18

I talked to Sybase people about their latest technologies. Many features
that shipped with their latest ASE 15.5 product has been in PostgreSQL
since many years (like functional indexes, etc). :)

One of the things that interested me was parallel recovery feature. They
said that they are keeping separate xlogs for each database, which
speeds ups recovery in case of a crash. It also would increase
performance, since we could write xlogs to separate disks.

Is that doable for PostgreSQL, too?
--
Devrim GÜNDÜZ
PostgreSQL Danışmanı/Consultant, Red Hat Certified Engineer
PostgreSQL RPM Repository: http://yum.pgrpms.org
Community: devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr
http://www.gunduz.org  Twitter: http://twitter.com/devrimgunduz

Re: Keeping separate WAL segments for each database

From

Alvaro Herrera

Date:

30 June 2010, 18:15:18

Excerpts from Devrim GÜNDÜZ's message of mié jun 30 14:54:06 -0400 2010:

> One of the things that interested me was parallel recovery feature. They
> said that they are keeping separate xlogs for each database, which
> speeds ups recovery in case of a crash. It also would increase
> performance, since we could write xlogs to separate disks.

I'm not sure about this. You'd need to have one extra WAL stream, for
shared catalogs; and what would you do to a transaction that touches
both shared catalogs and also local objects?  You'd have to split the
WAL entries in those two WAL streams.

I think you could try to solve this by having yet another WAL stream for
transaction commit, and have the database-specific streams reference
that one.  Operations touching shared catalogs would act as barriers:
all other databases' WAL streams would have to be synchronized to that
one.  This would still allow you to have some concurrency because,
presumably, operations on shared catalogs are rare.

Re: Keeping separate WAL segments for each database

From

Robert Haas

Date:

30 June 2010, 19:55:56

2010/6/30 Alvaro Herrera <alvherre@commandprompt.com>:
> Excerpts from Devrim GÜNDÜZ's message of mié jun 30 14:54:06 -0400 2010:
>
>> One of the things that interested me was parallel recovery feature. They
>> said that they are keeping separate xlogs for each database, which
>> speeds ups recovery in case of a crash. It also would increase
>> performance, since we could write xlogs to separate disks.
>
> I'm not sure about this. You'd need to have one extra WAL stream, for
> shared catalogs; and what would you do to a transaction that touches
> both shared catalogs and also local objects?  You'd have to split the
> WAL entries in those two WAL streams.
>
> I think you could try to solve this by having yet another WAL stream for
> transaction commit, and have the database-specific streams reference
> that one.  Operations touching shared catalogs would act as barriers:
> all other databases' WAL streams would have to be synchronized to that
> one.  This would still allow you to have some concurrency because,
> presumably, operations on shared catalogs are rare.

I think one per database and one extra one for the shared catalogs
would be enough.  Most transactions would either touch either just the
database, or just the shared catalogs, so you'd write the commit
record in whichever stream was appropriate.  If you had a transaction
that touched both, you'd write the commit record in both places, and
include in each stream a reference to the other stream.  On replay,
when you reach a commit record that references the another stream, you
pause until the reference stream also reaches the matching commit
record.  If you reach the end of that WAL stream without finding the
commit record, then, in archive recovery, you just keep waiting for
more of the stream to arrive; and, in crash recovery, you write a
matching commit record at the end of WAL.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Keeping separate WAL segments for each database

From

Tom Lane

Date:

30 June 2010, 20:05:17

Robert Haas <robertmhaas@gmail.com> writes:
> I think one per database and one extra one for the shared catalogs
> would be enough.  Most transactions would either touch either just the
> database, or just the shared catalogs, so you'd write the commit
> record in whichever stream was appropriate.  If you had a transaction
> that touched both, you'd write the commit record in both places, and
> include in each stream a reference to the other stream.  On replay,
> when you reach a commit record that references the another stream, you
> pause until the reference stream also reaches the matching commit
> record.  If you reach the end of that WAL stream without finding the
> commit record, then, in archive recovery, you just keep waiting for
> more of the stream to arrive; and, in crash recovery, you write a
> matching commit record at the end of WAL.

Surely you'd have to roll back, not commit, in that situation.  You have
no excuse for assuming that you've replayed all effects of the
transaction.
        regards, tom lane

Re: Keeping separate WAL segments for each database

From

Robert Haas

Date:

30 June 2010, 20:17:44

2010/6/30 Tom Lane <tgl@sss.pgh.pa.us>:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I think one per database and one extra one for the shared catalogs
>> would be enough.  Most transactions would either touch either just the
>> database, or just the shared catalogs, so you'd write the commit
>> record in whichever stream was appropriate.  If you had a transaction
>> that touched both, you'd write the commit record in both places, and
>> include in each stream a reference to the other stream.  On replay,
>> when you reach a commit record that references the another stream, you
>> pause until the reference stream also reaches the matching commit
>> record.  If you reach the end of that WAL stream without finding the
>> commit record, then, in archive recovery, you just keep waiting for
>> more of the stream to arrive; and, in crash recovery, you write a
>> matching commit record at the end of WAL.
>
> Surely you'd have to roll back, not commit, in that situation.  You have
> no excuse for assuming that you've replayed all effects of the
> transaction.

Hmm, good point.  But you could make it work either way, I think.  If
you flush WAL stream A, write commit record to WAL stream B, flush WAL
stream B, write commit record to WAL stream A, then commit is correct.If you write commit record to A, flush A, write
commitrecord to B, 
flush B, then abort is correct.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Keeping separate WAL segments for each database

From

Tom Lane

Date:

30 June 2010, 20:40:52

Robert Haas <robertmhaas@gmail.com> writes:
> 2010/6/30 Tom Lane <tgl@sss.pgh.pa.us>:
>> Surely you'd have to roll back, not commit, in that situation. �You have
>> no excuse for assuming that you've replayed all effects of the
>> transaction.

> Hmm, good point.  But you could make it work either way, I think.  If
> you flush WAL stream A, write commit record to WAL stream B, flush WAL
> stream B, write commit record to WAL stream A, then commit is correct.

I don't think so.  "I flushed this" is not equivalent to "it is certain
that it will be possible to read this again".  In particular, corruption
of WAL stream A leaves you in trouble if you take the commit on B as a
certificate for stream A being complete.

(thinks for a bit...)  Maybe if the commit record on B included a
minimum stopping point for stream A, it'd be all right.  This wouldn't
be exactly the expected LSN of the A commit record, mind you, because
you don't want to block insertions into the A stream while you're
flushing B.  But it would say that all non-commit records for the xact
on stream A are known to be before that point.  If you've replayed A
that far then you can take the transaction as being committable.

(thinks some more...)  No, you still lose, because a commit record isn't
just a single bit.  What about subtransactions for example?  I guess
maybe the commit record written/flushed first is the real commit record
with all the auxiliary data, and the one written second isn't so much
a commit record as a fencepoint record to prevent advancing beyond that
point in stream A before you've processed the relevant commit from B.

(thinks some more...)  Maybe you don't even need the fencepoint record
per se.  I think all it's doing for you is making sure you don't process
commit records on different streams out-of-order.  There might be some
other, more direct way to do that.

(thinks yet more...)  Actually the weak point in this scheme is that it
wouldn't serialize transactions that occur in different databases and
don't touch any shared catalogs.  It'd be entirely possible for T1 in
DB1 to be reported committed, then T2 in DB2 to be reported committed,
then a crash occurs after which T2 is seen committed and T1 not.  While
this would be all right if the clients for T1 and T2 can't communicate,
that isn't the real world.
        regards, tom lane

Re: Keeping separate WAL segments for each database

From

Robert Haas

Date:

30 June 2010, 21:52:31

2010/6/30 Tom Lane <tgl@sss.pgh.pa.us>:
> (thinks some more...)  Maybe you don't even need the fencepoint record
> per se.  I think all it's doing for you is making sure you don't process
> commit records on different streams out-of-order.  There might be some
> other, more direct way to do that.
>
> (thinks yet more...)  Actually the weak point in this scheme is that it
> wouldn't serialize transactions that occur in different databases and
> don't touch any shared catalogs.  It'd be entirely possible for T1 in
> DB1 to be reported committed, then T2 in DB2 to be reported committed,
> then a crash occurs after which T2 is seen committed and T1 not.  While
> this would be all right if the clients for T1 and T2 can't communicate,
> that isn't the real world.

Eh?  If T1 and T2 are both reported committed, then they'll still be
committed after crash recovery, assuming synchronous_commit is turned
on.  If not, our ACID has no D.  Still, I suspect you're right that
there are serialization anomalies buried in here somewhere that can't
happen today.

And at any rate, the per-database thing isn't really the design goal,
anyway.  It would be much nicer if we could find a way to support N>1
WAL streams without requiring that they be segregated by database.
We'd like to be able to write WAL faster, and commit faster, during
normal operation, and recover more quickly during recovery, especially
archive recovery.

You need to make sure not only that you replay commit records in
order, but also that, for example, you don't replay an
XLOG_HEAP2_CLEAN record too early.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Keeping separate WAL segments for each database

From

Joe Conway

Date:

30 June 2010, 23:04:26

On 06/30/2010 05:52 PM, Robert Haas wrote:
> And at any rate, the per-database thing isn't really the design goal,
> anyway.

FWIW, I've run into more than one client where PITR and/or warm standby
on a per-database level would be a killer feature.

Joe

Re: Keeping separate WAL segments for each database

From

Tom Lane

Date:

30 June 2010, 23:22:05

Robert Haas <robertmhaas@gmail.com> writes:
> You need to make sure not only that you replay commit records in
> order, but also that, for example, you don't replay an
> XLOG_HEAP2_CLEAN record too early.

Hm, good point.  That probably means that you *do* need fencepost
records, and furthermore that you might need an interlock to ensure that
you get the fencepost in early enough on the other stream.  Ugh ---
there goes your concurrency.

What about having a single WAL stream for all commit records (thereby
avoiding any possible xact-serialization funnies) and other WAL records
divided up among multiple streams in some fashion or other?  A commit
record would bear minimum-LSN pointers for all the streams that its
transaction had written to.  Things like HEAP_CLEAN records would bear
minimum-LSN pointers for the commit stream.  Workable?
        regards, tom lane

Re: Keeping separate WAL segments for each database

From

Robert Haas

Date:

03 July 2010, 08:02:49

2010/6/30 Tom Lane <tgl@sss.pgh.pa.us>:
> Robert Haas <robertmhaas@gmail.com> writes:
>> You need to make sure not only that you replay commit records in
>> order, but also that, for example, you don't replay an
>> XLOG_HEAP2_CLEAN record too early.
>
> Hm, good point.  That probably means that you *do* need fencepost
> records, and furthermore that you might need an interlock to ensure that
> you get the fencepost in early enough on the other stream.  Ugh ---
> there goes your concurrency.
>
> What about having a single WAL stream for all commit records (thereby
> avoiding any possible xact-serialization funnies) and other WAL records
> divided up among multiple streams in some fashion or other?  A commit
> record would bear minimum-LSN pointers for all the streams that its
> transaction had written to.  Things like HEAP_CLEAN records would bear
> minimum-LSN pointers for the commit stream.  Workable?

I don't see why not.  Of course, the performance of any of these ideas
is another question altogether...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Keeping separate WAL segments for each database

From

Simon Riggs

Date:

03 July 2010, 11:46:51

On Wed, 2010-06-30 at 22:21 -0400, Tom Lane wrote:

> What about having a single WAL stream for all commit records (thereby
> avoiding any possible xact-serialization funnies) and other WAL
> records
> divided up among multiple streams in some fashion or other?  A commit
> record would bear minimum-LSN pointers for all the streams that its
> transaction had written to.  Things like HEAP_CLEAN records would bear
> minimum-LSN pointers for the commit stream.  Workable?

I'm interested in the idea of putting full page writes into one stream
and all other WAL records into another.

That would allow us to stream less data for log shipping.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Keeping separate WAL segments for each database

From

Robert Haas

Date:

03 July 2010, 16:34:46

On Sat, Jul 3, 2010 at 10:46 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Wed, 2010-06-30 at 22:21 -0400, Tom Lane wrote:
>
>> What about having a single WAL stream for all commit records (thereby
>> avoiding any possible xact-serialization funnies) and other WAL
>> records
>> divided up among multiple streams in some fashion or other?  A commit
>> record would bear minimum-LSN pointers for all the streams that its
>> transaction had written to.  Things like HEAP_CLEAN records would bear
>> minimum-LSN pointers for the commit stream.  Workable?
>
> I'm interested in the idea of putting full page writes into one stream
> and all other WAL records into another.
>
> That would allow us to stream less data for log shipping.

Yeah, that would be great.  Heikki and I were discussing this a bit.
I think the standby can have problems with torn pages, too, but maybe
the full page writes could be moved completely outside the xlog
system.  In other words, you write the full-page writes to a separate
log which is maintained locally, and independently, on the master and
standby, and which (I think) can be recycled after each
checkpoint/restartpoint.  You'd have to figure out when to refer back
to that log during redo, of course; I'm not sure if that would require
fencepost records of some kind or just careful accounting.  One
disadvantage of this approach is that you'd be fsync-ing the WAL and
the full-page-log separately - I'm not sure whether that would suck or
not.

An even more radical idea is to try to find a way to reduce the need
for full page writes or even eliminate them altogether.  If a
particular WAL record can be replayed without reference to the
existing page contents, it's invincible to any problem that might be
caused by a torn page.  So I suppose if all of our records had that
property, we wouldn't need this.  Or maybe we could decide that xlog
is allowed to rely on the first n bytes of the page but not the full
contents, and then instead of full page writes, just xlog that much of
it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company