Thread: Avoid full page images in streaming replication?

Avoid full page images in streaming replication?

From
Jim Nasby
Date:
ISTM it should be possible to avoid sending full page writes to a 
streaming replica once the replica has reached a consistent state. I 
assume that the replica would still need to write full pages to it's 
disk in case of a crash, but the sender could insert special WAL records 
to tell it when to do so, instead of sending the full page image. 
Presumably this would be a big win for replication over a WAN.

Am I missing something? I see that pglesslog is no longer supported but 
couldn't find any particular reason for that...
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Avoid full page images in streaming replication?

From
Tom Lane
Date:
Jim Nasby <Jim.Nasby@BlueTreble.com> writes:
> ISTM it should be possible to avoid sending full page writes to a 
> streaming replica once the replica has reached a consistent state. I 
> assume that the replica would still need to write full pages to it's 
> disk in case of a crash, but the sender could insert special WAL records 
> to tell it when to do so, instead of sending the full page image. 
> Presumably this would be a big win for replication over a WAN.

How are you going to make that work without LSNs in the WAL received by
the replica diverging from those in the master's WAL?
        regards, tom lane



Re: Avoid full page images in streaming replication?

From
Magnus Hagander
Date:
<p dir="ltr"><br /> On Oct 22, 2015 23:38, "Tom Lane" <<a href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>>
wrote:<br/> ><br /> > Jim Nasby <Jim.Nasby@BlueTreble.com> writes:<br /> > > ISTM it should be
possibleto avoid sending full page writes to a<br /> > > streaming replica once the replica has reached a
consistentstate. I<br /> > > assume that the replica would still need to write full pages to it's<br /> > >
diskin case of a crash, but the sender could insert special WAL records<br /> > > to tell it when to do so,
insteadof sending the full page image.<br /> > > Presumably this would be a big win for replication over a
WAN.<br/> ><br /> > How are you going to make that work without LSNs in the WAL received by<br /> > the
replicadiverging from those in the master's WAL?<br /> ><p dir="ltr">We could in theory send a "this would be been a
fpibut it's skipped" record which would only exist in streaming and just make the standby write a noop of some kind? It
wouldstill be on the standby but it would at least not consume the bandwidth. Just skip sending the actual contents of
thefpi. <p dir="ltr">/Magnus  

Re: Avoid full page images in streaming replication?

From
Jim Nasby
Date:
On 10/22/15 4:42 PM, Magnus Hagander wrote:
>  > How are you going to make that work without LSNs in the WAL received by
>  > the replica diverging from those in the master's WAL?
>  >
>
> We could in theory send a "this would be been a fpi but it's skipped"
> record which would only exist in streaming and just make the standby
> write a noop of some kind? It would still be on the standby but it would
> at least not consume the bandwidth. Just skip sending the actual
> contents of the fpi.

I don't think it can be a noop on the receiver though... doesn't the 
receiver still need full page images in case of a crash? (Assuming 
full_page_writes is enabled...)

The other issue is chained replicas, where one of the children may need 
full page writes (during initial copy).
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Avoid full page images in streaming replication?

From
Andres Freund
Date:
On 2015-10-22 16:34:38 -0500, Jim Nasby wrote:
> ISTM it should be possible to avoid sending full page writes to a streaming
> replica once the replica has reached a consistent state. I assume that the
> replica would still need to write full pages to it's disk in case of a
> crash, but the sender could insert special WAL records to tell it when to do
> so, instead of sending the full page image. Presumably this would be a big
> win for replication over a WAN.

Note that FPIs are often pretty good for replay performance, avoiding
lots of synchronous random reads.

I think FPI compression is a better solution for now. I found it to be
extremely effective in some benchmarks I recently ran.

Andres



Re: Avoid full page images in streaming replication?

From
Robert Haas
Date:
On Thu, Oct 22, 2015 at 5:57 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>> We could in theory send a "this would be been a fpi but it's skipped"
>> record which would only exist in streaming and just make the standby
>> write a noop of some kind? It would still be on the standby but it would
>> at least not consume the bandwidth. Just skip sending the actual
>> contents of the fpi.
>
> I don't think it can be a noop on the receiver though... doesn't the
> receiver still need full page images in case of a crash? (Assuming
> full_page_writes is enabled...)

Yes. If the standby is in the middle of writing a page updated by a
WAL record and crashes, it can end up with a torn page.  We restart
from a restartpoint at a location where the master checkpointed so
that we can be certain that replay of the FPI will fix the problem.
If you got rid of the FPIs, you'd be dead.

This is true both before and after reaching a consistent state, which
seems like a fatal flaw in this plan.

It's true that if the standby didn't have the master's FPIs, it could
generate its own in some side location that behaves like a separate
WAL stream or a double-write buffer.  But that would be a heck of a
lot of work to implement for an uncertain benefit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Avoid full page images in streaming replication?

From
Jim Nasby
Date:
On 10/22/15 5:03 PM, Andres Freund wrote:
> On 2015-10-22 16:34:38 -0500, Jim Nasby wrote:
>> ISTM it should be possible to avoid sending full page writes to a streaming
>> replica once the replica has reached a consistent state. I assume that the
>> replica would still need to write full pages to it's disk in case of a
>> crash, but the sender could insert special WAL records to tell it when to do
>> so, instead of sending the full page image. Presumably this would be a big
>> win for replication over a WAN.
>
> Note that FPIs are often pretty good for replay performance, avoiding
> lots of synchronous random reads.

Right. You'd only want to use this if you're streaming over a slow link 
(like a WAN).

> I think FPI compression is a better solution for now. I found it to be
> extremely effective in some benchmarks I recently ran.

What I'm wondering is how compressible a 'normal' FPI is. Certainly if 
the hole is zero'd out and the page is mostly empty you'll get great 
compression. What about other workloads? For reference, if a 'FPI 
placeholder' WAL record is 16 bytes, that's 51,200% compression. If it's 
12 bytes, it's 68,200% compression. (I'm assuming we write the hole too, 
but maybe that's not true?)

FWIW, I started thinking about this when a client overwhelmed a remote 
slave doing VACUUM FREEZE after a Slony upgrade to 9.4. Granted, that's 
not normal, but it looks like normal vacuuming generates 2-6 bytes per 
modified tuple (depending on what was done). So even if you vacuumed 100 
rows on a page (which seems pretty high for most cases) that's only 
~200-600 bytes, compared to ~8200 bytes for the FPI.

The other interesting thing is that even their local slaves (with 20Gbps 
bandwidth) fell behind with vacuum_cost_delay=0, because replay was 
CPU-bound. But presumably that's not due to FPIs.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Avoid full page images in streaming replication?

From
Jim Nasby
Date:
On 10/22/15 5:11 PM, Robert Haas wrote:
> It's true that if the standby didn't have the master's FPIs, it could
> generate its own in some side location that behaves like a separate
> WAL stream or a double-write buffer.  But that would be a heck of a
> lot of work to implement for an uncertain benefit.

If the receiver didn't write the WAL before processing it then it can 
just stick the page image into the WAL it's writing for itself. Probably 
not good for syncrep, but I don't think you'd want this on for syncrep 
anyway.

But yes, this is all very hand-wavy without any actual data on what 
percentage of the WAL stream is FPIs. Looks like pageinspect doesn't 
work for WAL... does anyone have a script/tool that breaks out what 
percentage of a WAL file is FPIs?
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Avoid full page images in streaming replication?

From
Andres Freund
Date:
On 2015-10-22 17:47:01 -0500, Jim Nasby wrote:
> On 10/22/15 5:11 PM, Robert Haas wrote:
> >It's true that if the standby didn't have the master's FPIs, it could
> >generate its own in some side location that behaves like a separate
> >WAL stream or a double-write buffer.  But that would be a heck of a
> >lot of work to implement for an uncertain benefit.
> 
> If the receiver didn't write the WAL before processing it then it can just
> stick the page image into the WAL it's writing for itself. Probably not good
> for syncrep, but I don't think you'd want this on for syncrep anyway.

To me this sounds like a recipe for disaster (i.e. complex bugs). WAL
(and thus CRC checksums) differing between nodes. Ugh.

> But yes, this is all very hand-wavy without any actual data on what
> percentage of the WAL stream is FPIs. Looks like pageinspect doesn't work
> for WAL... does anyone have a script/tool that breaks out what percentage of
> a WAL file is FPIs?

pg_xlogdump --stats



Re: Avoid full page images in streaming replication?

From
Alvaro Herrera
Date:
Jim Nasby wrote:

> But yes, this is all very hand-wavy without any actual data on what
> percentage of the WAL stream is FPIs. Looks like pageinspect doesn't work
> for WAL... does anyone have a script/tool that breaks out what percentage of
> a WAL file is FPIs?

pg_xlogdump -z

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Avoid full page images in streaming replication?

From
Jim Nasby
Date:
On 10/22/15 5:52 PM, Andres Freund wrote:
>> If the receiver didn't write the WAL before processing it then it can just
>> >stick the page image into the WAL it's writing for itself. Probably not good
>> >for syncrep, but I don't think you'd want this on for syncrep anyway.
> To me this sounds like a recipe for disaster (i.e. complex bugs). WAL
> (and thus CRC checksums) differing between nodes. Ugh.

The WAL would *not* differ. This would only affect streaming 
replication, and only the stream itself.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Avoid full page images in streaming replication?

From
Andres Freund
Date:
On 2015-10-22 17:59:06 -0500, Jim Nasby wrote:
> The WAL would *not* differ.

It would. Hint bits and all.



Re: Avoid full page images in streaming replication?

From
Jim Nasby
Date:
On 10/22/15 5:53 PM, Alvaro Herrera wrote:
> Jim Nasby wrote:
>
>> But yes, this is all very hand-wavy without any actual data on what
>> percentage of the WAL stream is FPIs. Looks like pageinspect doesn't work
>> for WAL... does anyone have a script/tool that breaks out what percentage of
>> a WAL file is FPIs?
>
> pg_xlogdump -z

Hrm, any option for 9.4? I was hoping to get numbers from some real 
workloads...
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Avoid full page images in streaming replication?

From
Alvaro Herrera
Date:
Jim Nasby wrote:
> On 10/22/15 5:53 PM, Alvaro Herrera wrote:
> >Jim Nasby wrote:
> >
> >>But yes, this is all very hand-wavy without any actual data on what
> >>percentage of the WAL stream is FPIs. Looks like pageinspect doesn't work
> >>for WAL... does anyone have a script/tool that breaks out what percentage of
> >>a WAL file is FPIs?
> >
> >pg_xlogdump -z
> 
> Hrm, any option for 9.4? I was hoping to get numbers from some real
> workloads...

You can run the new one in old pg_xlog ...

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Avoid full page images in streaming replication?

From
Andres Freund
Date:
On 2015-10-22 20:27:20 -0300, Alvaro Herrera wrote:
> You can run the new one in old pg_xlog ...

You can? The xlog format between 9.4 and 9.5 changed, so I can't see how
that'd work?



Re: Avoid full page images in streaming replication?

From
Michael Paquier
Date:
On Fri, Oct 23, 2015 at 7:37 AM, Jim Nasby wrote:
> What I'm wondering is how compressible a 'normal' FPI is. Certainly if the
> hole is zero'd out and the page is mostly empty you'll get great
> compression. What about other workloads? For reference, if a 'FPI
> placeholder' WAL record is 16 bytes, that's 51,200% compression. If it's 12
> bytes, it's 68,200% compression. (I'm assuming we write the hole too, but
> maybe that's not true?)

Well, to begin with FPI usually avoid to include the page hole in the
middle. Now, regarding the compressibility of a page taken without its
hole, that's highly schema-dependent. Based on some measurements I did
some time ago a page with repetitive data could compress up to 40%,
with less compressible stuff like UUID I recall it to be 20~25%. I
hacked out for the FPW compression patch a module able to work
directly on raw pages to test their compressibility:
https://github.com/michaelpq/pg_plugins/tree/master/compress_test
get_raw_page() has been taken from pageinspect and I added to it an
option to remove the hole in the middle of the page. Using that you
are able to guess how much pages can get compressed.
-- 
Michael



Re: Avoid full page images in streaming replication?

From
Michael Paquier
Date:
On Fri, Oct 23, 2015 at 8:42 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-10-22 20:27:20 -0300, Alvaro Herrera wrote:
>> You can run the new one in old pg_xlog ...
>
> You can? The xlog format between 9.4 and 9.5 changed, so I can't see how
> that'd work?

That's not going to work.
-- 
Michael



Re: Avoid full page images in streaming replication?

From
Alvaro Herrera
Date:
Andres Freund wrote:
> On 2015-10-22 20:27:20 -0300, Alvaro Herrera wrote:
> > You can run the new one in old pg_xlog ...
> 
> You can? The xlog format between 9.4 and 9.5 changed, so I can't see how
> that'd work?

Oh, crap.  Must have been some other cross-version trial run I did,
then.  I would hope it's at least not terribly difficult to back-patch
that commit locally, anyway.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Avoid full page images in streaming replication?

From
Bernd Helmle
Date:

--On 22. Oktober 2015 22:23:58 -0300 Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

>> You can? The xlog format between 9.4 and 9.5 changed, so I can't see how
>> that'd work?
> 
> Oh, crap.  Must have been some other cross-version trial run I did,
> then.  I would hope it's at least not terribly difficult to back-patch
> that commit locally, anyway.

You might want to try 

<https://github.com/credativ/pg_xlog_analyzer>

Adrian created it basically to track down archive replay performance issues
a while ago. Maybe it's useful somehow.

-- 
Thanks
Bernd



Re: Avoid full page images in streaming replication?

From
Bernd Helmle
Date:

--On 23. Oktober 2015 00:03:30 +0200 Andres Freund <andres@anarazel.de>
wrote:

> 
> Note that FPIs are often pretty good for replay performance, avoiding
> lots of synchronous random reads.

That's a very import argument, i think. The difference can be significant,
even if you have a decent storage, rendering a replica unusable. We had
examples in the past where only optimizing aggressive writes of FPIs lead
to an acceptable lag of the replica.

-- 
Thanks
Bernd