Thread: Avoid full page images in streaming replication?
ISTM it should be possible to avoid sending full page writes to a streaming replica once the replica has reached a consistent state. I assume that the replica would still need to write full pages to it's disk in case of a crash, but the sender could insert special WAL records to tell it when to do so, instead of sending the full page image. Presumably this would be a big win for replication over a WAN. Am I missing something? I see that pglesslog is no longer supported but couldn't find any particular reason for that... -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
Jim Nasby <Jim.Nasby@BlueTreble.com> writes: > ISTM it should be possible to avoid sending full page writes to a > streaming replica once the replica has reached a consistent state. I > assume that the replica would still need to write full pages to it's > disk in case of a crash, but the sender could insert special WAL records > to tell it when to do so, instead of sending the full page image. > Presumably this would be a big win for replication over a WAN. How are you going to make that work without LSNs in the WAL received by the replica diverging from those in the master's WAL? regards, tom lane
<p dir="ltr"><br /> On Oct 22, 2015 23:38, "Tom Lane" <<a href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>> wrote:<br/> ><br /> > Jim Nasby <Jim.Nasby@BlueTreble.com> writes:<br /> > > ISTM it should be possibleto avoid sending full page writes to a<br /> > > streaming replica once the replica has reached a consistentstate. I<br /> > > assume that the replica would still need to write full pages to it's<br /> > > diskin case of a crash, but the sender could insert special WAL records<br /> > > to tell it when to do so, insteadof sending the full page image.<br /> > > Presumably this would be a big win for replication over a WAN.<br/> ><br /> > How are you going to make that work without LSNs in the WAL received by<br /> > the replicadiverging from those in the master's WAL?<br /> ><p dir="ltr">We could in theory send a "this would be been a fpibut it's skipped" record which would only exist in streaming and just make the standby write a noop of some kind? It wouldstill be on the standby but it would at least not consume the bandwidth. Just skip sending the actual contents of thefpi. <p dir="ltr">/Magnus
On 10/22/15 4:42 PM, Magnus Hagander wrote: > > How are you going to make that work without LSNs in the WAL received by > > the replica diverging from those in the master's WAL? > > > > We could in theory send a "this would be been a fpi but it's skipped" > record which would only exist in streaming and just make the standby > write a noop of some kind? It would still be on the standby but it would > at least not consume the bandwidth. Just skip sending the actual > contents of the fpi. I don't think it can be a noop on the receiver though... doesn't the receiver still need full page images in case of a crash? (Assuming full_page_writes is enabled...) The other issue is chained replicas, where one of the children may need full page writes (during initial copy). -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
On 2015-10-22 16:34:38 -0500, Jim Nasby wrote: > ISTM it should be possible to avoid sending full page writes to a streaming > replica once the replica has reached a consistent state. I assume that the > replica would still need to write full pages to it's disk in case of a > crash, but the sender could insert special WAL records to tell it when to do > so, instead of sending the full page image. Presumably this would be a big > win for replication over a WAN. Note that FPIs are often pretty good for replay performance, avoiding lots of synchronous random reads. I think FPI compression is a better solution for now. I found it to be extremely effective in some benchmarks I recently ran. Andres
On Thu, Oct 22, 2015 at 5:57 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >> We could in theory send a "this would be been a fpi but it's skipped" >> record which would only exist in streaming and just make the standby >> write a noop of some kind? It would still be on the standby but it would >> at least not consume the bandwidth. Just skip sending the actual >> contents of the fpi. > > I don't think it can be a noop on the receiver though... doesn't the > receiver still need full page images in case of a crash? (Assuming > full_page_writes is enabled...) Yes. If the standby is in the middle of writing a page updated by a WAL record and crashes, it can end up with a torn page. We restart from a restartpoint at a location where the master checkpointed so that we can be certain that replay of the FPI will fix the problem. If you got rid of the FPIs, you'd be dead. This is true both before and after reaching a consistent state, which seems like a fatal flaw in this plan. It's true that if the standby didn't have the master's FPIs, it could generate its own in some side location that behaves like a separate WAL stream or a double-write buffer. But that would be a heck of a lot of work to implement for an uncertain benefit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 10/22/15 5:03 PM, Andres Freund wrote: > On 2015-10-22 16:34:38 -0500, Jim Nasby wrote: >> ISTM it should be possible to avoid sending full page writes to a streaming >> replica once the replica has reached a consistent state. I assume that the >> replica would still need to write full pages to it's disk in case of a >> crash, but the sender could insert special WAL records to tell it when to do >> so, instead of sending the full page image. Presumably this would be a big >> win for replication over a WAN. > > Note that FPIs are often pretty good for replay performance, avoiding > lots of synchronous random reads. Right. You'd only want to use this if you're streaming over a slow link (like a WAN). > I think FPI compression is a better solution for now. I found it to be > extremely effective in some benchmarks I recently ran. What I'm wondering is how compressible a 'normal' FPI is. Certainly if the hole is zero'd out and the page is mostly empty you'll get great compression. What about other workloads? For reference, if a 'FPI placeholder' WAL record is 16 bytes, that's 51,200% compression. If it's 12 bytes, it's 68,200% compression. (I'm assuming we write the hole too, but maybe that's not true?) FWIW, I started thinking about this when a client overwhelmed a remote slave doing VACUUM FREEZE after a Slony upgrade to 9.4. Granted, that's not normal, but it looks like normal vacuuming generates 2-6 bytes per modified tuple (depending on what was done). So even if you vacuumed 100 rows on a page (which seems pretty high for most cases) that's only ~200-600 bytes, compared to ~8200 bytes for the FPI. The other interesting thing is that even their local slaves (with 20Gbps bandwidth) fell behind with vacuum_cost_delay=0, because replay was CPU-bound. But presumably that's not due to FPIs. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
On 10/22/15 5:11 PM, Robert Haas wrote: > It's true that if the standby didn't have the master's FPIs, it could > generate its own in some side location that behaves like a separate > WAL stream or a double-write buffer. But that would be a heck of a > lot of work to implement for an uncertain benefit. If the receiver didn't write the WAL before processing it then it can just stick the page image into the WAL it's writing for itself. Probably not good for syncrep, but I don't think you'd want this on for syncrep anyway. But yes, this is all very hand-wavy without any actual data on what percentage of the WAL stream is FPIs. Looks like pageinspect doesn't work for WAL... does anyone have a script/tool that breaks out what percentage of a WAL file is FPIs? -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
On 2015-10-22 17:47:01 -0500, Jim Nasby wrote: > On 10/22/15 5:11 PM, Robert Haas wrote: > >It's true that if the standby didn't have the master's FPIs, it could > >generate its own in some side location that behaves like a separate > >WAL stream or a double-write buffer. But that would be a heck of a > >lot of work to implement for an uncertain benefit. > > If the receiver didn't write the WAL before processing it then it can just > stick the page image into the WAL it's writing for itself. Probably not good > for syncrep, but I don't think you'd want this on for syncrep anyway. To me this sounds like a recipe for disaster (i.e. complex bugs). WAL (and thus CRC checksums) differing between nodes. Ugh. > But yes, this is all very hand-wavy without any actual data on what > percentage of the WAL stream is FPIs. Looks like pageinspect doesn't work > for WAL... does anyone have a script/tool that breaks out what percentage of > a WAL file is FPIs? pg_xlogdump --stats
Jim Nasby wrote: > But yes, this is all very hand-wavy without any actual data on what > percentage of the WAL stream is FPIs. Looks like pageinspect doesn't work > for WAL... does anyone have a script/tool that breaks out what percentage of > a WAL file is FPIs? pg_xlogdump -z -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 10/22/15 5:52 PM, Andres Freund wrote: >> If the receiver didn't write the WAL before processing it then it can just >> >stick the page image into the WAL it's writing for itself. Probably not good >> >for syncrep, but I don't think you'd want this on for syncrep anyway. > To me this sounds like a recipe for disaster (i.e. complex bugs). WAL > (and thus CRC checksums) differing between nodes. Ugh. The WAL would *not* differ. This would only affect streaming replication, and only the stream itself. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
On 2015-10-22 17:59:06 -0500, Jim Nasby wrote: > The WAL would *not* differ. It would. Hint bits and all.
On 10/22/15 5:53 PM, Alvaro Herrera wrote: > Jim Nasby wrote: > >> But yes, this is all very hand-wavy without any actual data on what >> percentage of the WAL stream is FPIs. Looks like pageinspect doesn't work >> for WAL... does anyone have a script/tool that breaks out what percentage of >> a WAL file is FPIs? > > pg_xlogdump -z Hrm, any option for 9.4? I was hoping to get numbers from some real workloads... -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
Jim Nasby wrote: > On 10/22/15 5:53 PM, Alvaro Herrera wrote: > >Jim Nasby wrote: > > > >>But yes, this is all very hand-wavy without any actual data on what > >>percentage of the WAL stream is FPIs. Looks like pageinspect doesn't work > >>for WAL... does anyone have a script/tool that breaks out what percentage of > >>a WAL file is FPIs? > > > >pg_xlogdump -z > > Hrm, any option for 9.4? I was hoping to get numbers from some real > workloads... You can run the new one in old pg_xlog ... -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2015-10-22 20:27:20 -0300, Alvaro Herrera wrote: > You can run the new one in old pg_xlog ... You can? The xlog format between 9.4 and 9.5 changed, so I can't see how that'd work?
On Fri, Oct 23, 2015 at 7:37 AM, Jim Nasby wrote: > What I'm wondering is how compressible a 'normal' FPI is. Certainly if the > hole is zero'd out and the page is mostly empty you'll get great > compression. What about other workloads? For reference, if a 'FPI > placeholder' WAL record is 16 bytes, that's 51,200% compression. If it's 12 > bytes, it's 68,200% compression. (I'm assuming we write the hole too, but > maybe that's not true?) Well, to begin with FPI usually avoid to include the page hole in the middle. Now, regarding the compressibility of a page taken without its hole, that's highly schema-dependent. Based on some measurements I did some time ago a page with repetitive data could compress up to 40%, with less compressible stuff like UUID I recall it to be 20~25%. I hacked out for the FPW compression patch a module able to work directly on raw pages to test their compressibility: https://github.com/michaelpq/pg_plugins/tree/master/compress_test get_raw_page() has been taken from pageinspect and I added to it an option to remove the hole in the middle of the page. Using that you are able to guess how much pages can get compressed. -- Michael
On Fri, Oct 23, 2015 at 8:42 AM, Andres Freund <andres@anarazel.de> wrote: > On 2015-10-22 20:27:20 -0300, Alvaro Herrera wrote: >> You can run the new one in old pg_xlog ... > > You can? The xlog format between 9.4 and 9.5 changed, so I can't see how > that'd work? That's not going to work. -- Michael
Andres Freund wrote: > On 2015-10-22 20:27:20 -0300, Alvaro Herrera wrote: > > You can run the new one in old pg_xlog ... > > You can? The xlog format between 9.4 and 9.5 changed, so I can't see how > that'd work? Oh, crap. Must have been some other cross-version trial run I did, then. I would hope it's at least not terribly difficult to back-patch that commit locally, anyway. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--On 22. Oktober 2015 22:23:58 -0300 Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> You can? The xlog format between 9.4 and 9.5 changed, so I can't see how >> that'd work? > > Oh, crap. Must have been some other cross-version trial run I did, > then. I would hope it's at least not terribly difficult to back-patch > that commit locally, anyway. You might want to try <https://github.com/credativ/pg_xlog_analyzer> Adrian created it basically to track down archive replay performance issues a while ago. Maybe it's useful somehow. -- Thanks Bernd
--On 23. Oktober 2015 00:03:30 +0200 Andres Freund <andres@anarazel.de> wrote: > > Note that FPIs are often pretty good for replay performance, avoiding > lots of synchronous random reads. That's a very import argument, i think. The difference can be significant, even if you have a decent storage, rendering a replica unusable. We had examples in the past where only optimizing aggressive writes of FPIs lead to an acceptable lag of the replica. -- Thanks Bernd