Thread: WAL format

WAL format

From
Heikki Linnakangas
Date:
While looking at the streaming replication patch, I can't help but
wonder why our WAL format is so complicated.

WAL is divided into WAL segments, each 16 MB by default. Each WAL
segment is divided into pages, 8k by default. At the beginning of each
WAL page, there's a page header, but the header at the first page of
each WAL segment contains a few extra fields.

If a WAL record crosses a page boundary, we write as much of it as fits
onto the first page, and so-called continuation records with the rest of
the data on subsequent pages.

In particular I wonder why we bother with the page headers. A much
simpler format would be:

- get rid of page headers, except for the header at the beginning of
each WAL segment
- get rid of continuation records
- at the end of WAL segment, when there's not enough space to write the
next WAL record, always write an XLOG SWITCH record to fill the rest of
the segment.

The page addr stored in the WAL page header gives some extra protection
for detecting end of valid WAL correctly, but we rely on the prev-links
and CRC within page for that anyway, so I wouldn't mind losing that.

The changes to ReadRecord in the streaming replication patch feel a bit
awkward, because it has to work around the fact that WAL is streamed as
a stream of bytes, but ReadRecord works one page at a time. I'd like to
replace ReadRecord with a simpler ring buffer approach, but handling the
continuation records makes it a bit hard.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: WAL format

From
"Kevin Grittner"
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
> In particular I wonder why we bother with the page headers.
Since we re-use the file for a new segment, without overwriting the
old contents, it seems like we would need to do *something* to
reliably determine when we've hit the end of a segment and have
moved into old data from a previous use of the file.  Would your
proposed changes cover that adequately?  (I'm not sure I understood
your proposal well enough to be comfortable about that.)
-Kevin


Re: WAL format

From
Alvaro Herrera
Date:
Heikki Linnakangas wrote:

> - at the end of WAL segment, when there's not enough space to write the
> next WAL record, always write an XLOG SWITCH record to fill the rest of
> the segment.

What happens if a record is larger than a WAL segment?  For example,
what if I insert a 16 MB+ datum into a varlena field?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: WAL format

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> In particular I wonder why we bother with the page headers. A much
> simpler format would be:

> - get rid of page headers, except for the header at the beginning of
> each WAL segment
> - get rid of continuation records
> - at the end of WAL segment, when there's not enough space to write the
> next WAL record, always write an XLOG SWITCH record to fill the rest of
> the segment.

What do you do with a WAL record that doesn't fit in a segment?  (They
do exist.)  I don't think you can eliminate continuation records.
You could maybe use them only at segment boundaries but I doubt that
makes things any simpler than they are now.
        regards, tom lane


Re: WAL format

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
>> In particular I wonder why we bother with the page headers.
> Since we re-use the file for a new segment, without overwriting the
> old contents, it seems like we would need to do *something* to
> reliably determine when we've hit the end of a segment and have
> moved into old data from a previous use of the file.  Would your
> proposed changes cover that adequately?

AFAICT the proposal would make us 100% dependent on the record CRC
to detect when a record has been torn (ie, only the first few sectors
made it to disk).  I'm a bit nervous about that from a reliability
standpoint --- with a 32-bit CRC you've got a 1-in-4-billion chance
of accepting bad data.  Checking the page headers too gives us many
more bits that have to be as-expected to consider the data good.

Since the records are fed to XLogInsert as units, it seems like the
actual problem might be addressable by hooking in the sync-rep data
sending at that level, rather than looking at the WAL page buffers
as I gather it must be doing now.
        regards, tom lane


Re: WAL format

From
Andres Freund
Date:
On Monday 07 December 2009 21:44:37 Tom Lane wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
> >> In particular I wonder why we bother with the page headers.
> >
> > Since we re-use the file for a new segment, without overwriting the
> > old contents, it seems like we would need to do *something* to
> > reliably determine when we've hit the end of a segment and have
> > moved into old data from a previous use of the file.  Would your
> > proposed changes cover that adequately?
> AFAICT the proposal would make us 100% dependent on the record CRC
> to detect when a record has been torn (ie, only the first few sectors
> made it to disk).  I'm a bit nervous about that from a reliability
> standpoint --- with a 32-bit CRC you've got a 1-in-4-billion chance
> of accepting bad data.  Checking the page headers too gives us many
> more bits that have to be as-expected to consider the data good.
One could argue that thats a good argument to go back to 64bit CRCs. 
Considering that they are more seldomly computed with such a change and that 
CPUs got more modern...

Andres


Re: WAL format

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Heikki Linnakangas wrote:
>> - at the end of WAL segment, when there's not enough space to write the
>> next WAL record, always write an XLOG SWITCH record to fill the rest of
>> the segment.

> What happens if a record is larger than a WAL segment?  For example,
> what if I insert a 16 MB+ datum into a varlena field?

That case doesn't pose a problem --- the datum would be toasted into
individual tuples that are certainly no larger than a page.  However
we do have cases where a WAL record can get arbitrarily large; in
particular a commit record with many subtransactions and/or many
disk files to delete.  These cases do get exercised in the field
too --- I can recall at least one related bug report.
        regards, tom lane


Re: WAL format

From
Simon Riggs
Date:
On Mon, 2009-12-07 at 21:28 +0200, Heikki Linnakangas wrote:

> The changes to ReadRecord in the streaming replication patch feel a
> bit awkward, because it has to work around the fact that WAL is
> streamed as a stream of bytes, but ReadRecord works one page at a
> time. I'd like to replace ReadRecord with a simpler ring buffer
> approach, but handling the continuation records makes it a bit hard.

If this was earlier in the release cycle, I'd feel happier.

2.5 months before beta is the wrong time to re-design the crash recovery
data format, especially because its only "a bit awkward". We're bound to
break something unforeseen and not have time to fix it. If you were
telling me "impossible", I'd be all ears.

I feel your pain, but less drastic solutions are always best in such an
important area, at least while we lack automated test harnesses there.

-- Simon Riggs           www.2ndQuadrant.com



Re: WAL format

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
>>> In particular I wonder why we bother with the page headers.
>  
>> Since we re-use the file for a new segment, without overwriting the
>> old contents, it seems like we would need to do *something* to
>> reliably determine when we've hit the end of a segment and have
>> moved into old data from a previous use of the file.  Would your
>> proposed changes cover that adequately?
> 
> AFAICT the proposal would make us 100% dependent on the record CRC
> to detect when a record has been torn (ie, only the first few sectors
> made it to disk).  I'm a bit nervous about that from a reliability
> standpoint --- with a 32-bit CRC you've got a 1-in-4-billion chance
> of accepting bad data.  Checking the page headers too gives us many
> more bits that have to be as-expected to consider the data good.

We also check the prev-link, and some weak checks on rmid, and the
length fields.

> Since the records are fed to XLogInsert as units, it seems like the
> actual problem might be addressable by hooking in the sync-rep data
> sending at that level, rather than looking at the WAL page buffers
> as I gather it must be doing now.

No, walsender reads from disk. The sending side actually looks OK to me,
it's the code in ReadRecord that reads partial pages at the receiving
end that I'd like to simplify. It works as it is, but we have to re-read
the most recent page when it wasn't received as whole yet, and add some
state to track that. I think it's already relying on the fact that
walsender always sends full records (it can stop at a page boundary, at
a continuation record).

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: WAL format

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> In particular I wonder why we bother with the page headers. A much
>> simpler format would be:
> 
>> - get rid of page headers, except for the header at the beginning of
>> each WAL segment
>> - get rid of continuation records
>> - at the end of WAL segment, when there's not enough space to write the
>> next WAL record, always write an XLOG SWITCH record to fill the rest of
>> the segment.
> 
> What do you do with a WAL record that doesn't fit in a segment?  (They
> do exist.)  I don't think you can eliminate continuation records.
> You could maybe use them only at segment boundaries but I doubt that
> makes things any simpler than they are now.

Hmm, yeah, it doesn't make it that much simpler then.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: WAL format

From
Fujii Masao
Date:
On Tue, Dec 8, 2009 at 10:28 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> If this was earlier in the release cycle, I'd feel happier.
>
> 2.5 months before beta is the wrong time to re-design the crash recovery
> data format, especially because its only "a bit awkward". We're bound to
> break something unforeseen and not have time to fix it. If you were
> telling me "impossible", I'd be all ears.

To avoid the harmful effect on the existing feature, how about
introducing new function which reads WAL records in byte level
for Streaming Replication? ISTM that making one function
ReadRecord cover several cases (a crash recovery and replication)
would increase complexity.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: WAL format

From
Greg Stark
Date:
On Mon, Dec 7, 2009 at 8:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> Heikki Linnakangas wrote:
>>> - at the end of WAL segment, when there's not enough space to write the
>>> next WAL record, always write an XLOG SWITCH record to fill the rest of
>>> the segment.
>
>> What happens if a record is larger than a WAL segment?  For example,
>> what if I insert a 16 MB+ datum into a varlena field?
>
> That case doesn't pose a problem --- the datum would be toasted into
> individual tuples that are certainly no larger than a page.  However
> we do have cases where a WAL record can get arbitrarily large; in
> particular a commit record with many subtransactions and/or many
> disk files to delete.  These cases do get exercised in the field
> too --- I can recall at least one related bug report.

Sounds like a reason to make the format simpler...

If we raise the maximum segment size is there a point where we would
be in a reasonable range to impose maximum sizes for these lists?
32MB? 64MB? It's not like there isn't a limit now -- we'll just throw
an out of memory error when replaying the recovery if it doesn't fit
in memory.

What if we push the work of handling these lists up to the recovery
manager instead of xlog.c? So commit records would send a record
saying "when xid nnnn commits the following subtransactions commit as
well" and it could send multiple such records. The recovery manager is
responsible when it sees such records to remember the list somewhere
and append the new values if it's already seen the list, possibly even
spilling to disk and reload it when it sees the corresponding commit.

--
greg