Thread: WAL format
While looking at the streaming replication patch, I can't help but wonder why our WAL format is so complicated. WAL is divided into WAL segments, each 16 MB by default. Each WAL segment is divided into pages, 8k by default. At the beginning of each WAL page, there's a page header, but the header at the first page of each WAL segment contains a few extra fields. If a WAL record crosses a page boundary, we write as much of it as fits onto the first page, and so-called continuation records with the rest of the data on subsequent pages. In particular I wonder why we bother with the page headers. A much simpler format would be: - get rid of page headers, except for the header at the beginning of each WAL segment - get rid of continuation records - at the end of WAL segment, when there's not enough space to write the next WAL record, always write an XLOG SWITCH record to fill the rest of the segment. The page addr stored in the WAL page header gives some extra protection for detecting end of valid WAL correctly, but we rely on the prev-links and CRC within page for that anyway, so I wouldn't mind losing that. The changes to ReadRecord in the streaming replication patch feel a bit awkward, because it has to work around the fact that WAL is streamed as a stream of bytes, but ReadRecord works one page at a time. I'd like to replace ReadRecord with a simpler ring buffer approach, but handling the continuation records makes it a bit hard. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > In particular I wonder why we bother with the page headers. Since we re-use the file for a new segment, without overwriting the old contents, it seems like we would need to do *something* to reliably determine when we've hit the end of a segment and have moved into old data from a previous use of the file. Would your proposed changes cover that adequately? (I'm not sure I understood your proposal well enough to be comfortable about that.) -Kevin
Heikki Linnakangas wrote: > - at the end of WAL segment, when there's not enough space to write the > next WAL record, always write an XLOG SWITCH record to fill the rest of > the segment. What happens if a record is larger than a WAL segment? For example, what if I insert a 16 MB+ datum into a varlena field? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > In particular I wonder why we bother with the page headers. A much > simpler format would be: > - get rid of page headers, except for the header at the beginning of > each WAL segment > - get rid of continuation records > - at the end of WAL segment, when there's not enough space to write the > next WAL record, always write an XLOG SWITCH record to fill the rest of > the segment. What do you do with a WAL record that doesn't fit in a segment? (They do exist.) I don't think you can eliminate continuation records. You could maybe use them only at segment boundaries but I doubt that makes things any simpler than they are now. regards, tom lane
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: >> In particular I wonder why we bother with the page headers. > Since we re-use the file for a new segment, without overwriting the > old contents, it seems like we would need to do *something* to > reliably determine when we've hit the end of a segment and have > moved into old data from a previous use of the file. Would your > proposed changes cover that adequately? AFAICT the proposal would make us 100% dependent on the record CRC to detect when a record has been torn (ie, only the first few sectors made it to disk). I'm a bit nervous about that from a reliability standpoint --- with a 32-bit CRC you've got a 1-in-4-billion chance of accepting bad data. Checking the page headers too gives us many more bits that have to be as-expected to consider the data good. Since the records are fed to XLogInsert as units, it seems like the actual problem might be addressable by hooking in the sync-rep data sending at that level, rather than looking at the WAL page buffers as I gather it must be doing now. regards, tom lane
On Monday 07 December 2009 21:44:37 Tom Lane wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > >> In particular I wonder why we bother with the page headers. > > > > Since we re-use the file for a new segment, without overwriting the > > old contents, it seems like we would need to do *something* to > > reliably determine when we've hit the end of a segment and have > > moved into old data from a previous use of the file. Would your > > proposed changes cover that adequately? > AFAICT the proposal would make us 100% dependent on the record CRC > to detect when a record has been torn (ie, only the first few sectors > made it to disk). I'm a bit nervous about that from a reliability > standpoint --- with a 32-bit CRC you've got a 1-in-4-billion chance > of accepting bad data. Checking the page headers too gives us many > more bits that have to be as-expected to consider the data good. One could argue that thats a good argument to go back to 64bit CRCs. Considering that they are more seldomly computed with such a change and that CPUs got more modern... Andres
Alvaro Herrera <alvherre@commandprompt.com> writes: > Heikki Linnakangas wrote: >> - at the end of WAL segment, when there's not enough space to write the >> next WAL record, always write an XLOG SWITCH record to fill the rest of >> the segment. > What happens if a record is larger than a WAL segment? For example, > what if I insert a 16 MB+ datum into a varlena field? That case doesn't pose a problem --- the datum would be toasted into individual tuples that are certainly no larger than a page. However we do have cases where a WAL record can get arbitrarily large; in particular a commit record with many subtransactions and/or many disk files to delete. These cases do get exercised in the field too --- I can recall at least one related bug report. regards, tom lane
On Mon, 2009-12-07 at 21:28 +0200, Heikki Linnakangas wrote: > The changes to ReadRecord in the streaming replication patch feel a > bit awkward, because it has to work around the fact that WAL is > streamed as a stream of bytes, but ReadRecord works one page at a > time. I'd like to replace ReadRecord with a simpler ring buffer > approach, but handling the continuation records makes it a bit hard. If this was earlier in the release cycle, I'd feel happier. 2.5 months before beta is the wrong time to re-design the crash recovery data format, especially because its only "a bit awkward". We're bound to break something unforeseen and not have time to fix it. If you were telling me "impossible", I'd be all ears. I feel your pain, but less drastic solutions are always best in such an important area, at least while we lack automated test harnesses there. -- Simon Riggs www.2ndQuadrant.com
Tom Lane wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: >>> In particular I wonder why we bother with the page headers. > >> Since we re-use the file for a new segment, without overwriting the >> old contents, it seems like we would need to do *something* to >> reliably determine when we've hit the end of a segment and have >> moved into old data from a previous use of the file. Would your >> proposed changes cover that adequately? > > AFAICT the proposal would make us 100% dependent on the record CRC > to detect when a record has been torn (ie, only the first few sectors > made it to disk). I'm a bit nervous about that from a reliability > standpoint --- with a 32-bit CRC you've got a 1-in-4-billion chance > of accepting bad data. Checking the page headers too gives us many > more bits that have to be as-expected to consider the data good. We also check the prev-link, and some weak checks on rmid, and the length fields. > Since the records are fed to XLogInsert as units, it seems like the > actual problem might be addressable by hooking in the sync-rep data > sending at that level, rather than looking at the WAL page buffers > as I gather it must be doing now. No, walsender reads from disk. The sending side actually looks OK to me, it's the code in ReadRecord that reads partial pages at the receiving end that I'd like to simplify. It works as it is, but we have to re-read the most recent page when it wasn't received as whole yet, and add some state to track that. I think it's already relying on the fact that walsender always sends full records (it can stop at a page boundary, at a continuation record). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> In particular I wonder why we bother with the page headers. A much >> simpler format would be: > >> - get rid of page headers, except for the header at the beginning of >> each WAL segment >> - get rid of continuation records >> - at the end of WAL segment, when there's not enough space to write the >> next WAL record, always write an XLOG SWITCH record to fill the rest of >> the segment. > > What do you do with a WAL record that doesn't fit in a segment? (They > do exist.) I don't think you can eliminate continuation records. > You could maybe use them only at segment boundaries but I doubt that > makes things any simpler than they are now. Hmm, yeah, it doesn't make it that much simpler then. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, Dec 8, 2009 at 10:28 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > If this was earlier in the release cycle, I'd feel happier. > > 2.5 months before beta is the wrong time to re-design the crash recovery > data format, especially because its only "a bit awkward". We're bound to > break something unforeseen and not have time to fix it. If you were > telling me "impossible", I'd be all ears. To avoid the harmful effect on the existing feature, how about introducing new function which reads WAL records in byte level for Streaming Replication? ISTM that making one function ReadRecord cover several cases (a crash recovery and replication) would increase complexity. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, Dec 7, 2009 at 8:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: >> Heikki Linnakangas wrote: >>> - at the end of WAL segment, when there's not enough space to write the >>> next WAL record, always write an XLOG SWITCH record to fill the rest of >>> the segment. > >> What happens if a record is larger than a WAL segment? For example, >> what if I insert a 16 MB+ datum into a varlena field? > > That case doesn't pose a problem --- the datum would be toasted into > individual tuples that are certainly no larger than a page. However > we do have cases where a WAL record can get arbitrarily large; in > particular a commit record with many subtransactions and/or many > disk files to delete. These cases do get exercised in the field > too --- I can recall at least one related bug report. Sounds like a reason to make the format simpler... If we raise the maximum segment size is there a point where we would be in a reasonable range to impose maximum sizes for these lists? 32MB? 64MB? It's not like there isn't a limit now -- we'll just throw an out of memory error when replaying the recovery if it doesn't fit in memory. What if we push the work of handling these lists up to the recovery manager instead of xlog.c? So commit records would send a record saying "when xid nnnn commits the following subtransactions commit as well" and it could send multiple such records. The recovery manager is responsible when it sees such records to remember the list somewhere and append the new values if it's already seen the list, possibly even spilling to disk and reload it when it sees the corresponding commit. -- greg