Thread: finding changed blocks using WAL scanning

finding changed blocks using WAL scanning

From
Robert Haas
Date:
Over at https://www.postgresql.org/message-id/CA%2BTgmobFVe4J4AA7z9OMUzKnm09Tt%2BsybhxeL_Ddst3q3wqpzQ%40mail.gmail.com
I mentioned parsing the WAL to extract block references so that
incremental backup could efficiently determine which blocks needed to
be copied.  Ashwin replied in
http://postgr.es/m/CALfoeitO-vkfjubMFQRmgyXghL-uUnZLNxbr=obrQQsm8kFO4A@mail.gmail.com
to mention that the same approach could be useful for pg_upgrade:

# Currently, pg_rewind requires
# all the WAL logs to be present on source side from point of divergence to
# rewind. Instead just parse the wal and keep the changed blocks around on
# sourece. Then don't need to retain the WAL but can still rewind using the
# changed block map.

Since there are at least two possible use case for an efficient way to
learn when blocks last changed, and in each case the performance
benefits of being able to learn that are potentially quite large, I'm
starting this thread to brainstorm how such a system might work.

It seems to me that there are basically two ways of storing this kind
of information, plus a bunch of variants.  One way is to store files
that cover a range of LSNs, and basically contain a synopsis of the
WAL for those LSNs.  You omit all the actual data and just mention
which blocks were changed by some record in that part of the WAL.  In
this type of scheme, the storage required is roughly proportional to
the volume of WAL for which you wish to retain data.  Pruning old data
is easy; just remove the files that provide information about LSNs
that you don't care about any more.  The other way is to store data
about each block, or each range of blocks, or all the blocks that hash
onto a certain slot; for each, store the newest LSN that has modified
that block, or a block in that range, or a block that hashes onto that
that slot.  In this system, storage is roughly proportional to the
size of the database cluster, except maybe in the hashing case, but I
*think* that case will degrade unless you basically expand the map to
be roughly proportional to the size of the cluster anyway.  I may be
wrong.

Of these two variants, I am inclined to prefer the version where each
file is a summary of the block references within some range of LSNs.
It seems simpler to implement to me.  You just read a bunch of WAL
files and then when you get tired you stop and emit an output file.
You need to protect yourself against untimely crashes.  One way is to
stick a checksum into the output file.  After you finish writing it,
fsync() it before you start writing the next one.  After a restart,
read the latest such file and see if the checksum is OK.  If not,
regenerate it; if not, assume it's good and move on.  Files other than
the last one can be assumed good.  Another way is to create the file
with a temporary name, fsync() it, and then rename it into place and
fsync() again.  The background worker that generates the files can
have a GUC to remove them when they are older than some threshold
amount of time, or you can keep them forever and let the user manually
remove stuff they no longer want based on LSN.  That's pretty much it.

The version where you keep an LSN per block/range/hash bucket seems
more complex in terms of durability.  Now you have to worry not only
about creating new files, but about modifying them.  That seems to add
some complexity.  I think it may be possible to make it work without
doing write-ahead logging for every change, but it certainly needs
careful thought to avoid torn page problems and checkpoint
synchronization issues.  Moreover, it potentially uses lots and lots
of inodes if there are many relations in the cluster.  You can avoid
that by not creating maps for small files, if you like, or by
switching to the hash bucket approach.  But either of those approaches
is lossy.  Omitting the maps for small files means you always have to
assume everything in those files changed. The hash bucket approach is
vulnerable to hash collisions; you have to assume that all blocks that
hash to a given bucket have changed.  Those are probably manageable
disadvantages, but I think they are real ones.

There is one thing that does worry me about the file-per-LSN-range
approach, and that is memory consumption when trying to consume the
information.  Suppose you have a really high velocity system.  I don't
know exactly what the busiest systems around are doing in terms of
data churn these days, but let's say just for kicks that we are
dirtying 100GB/hour.  That means, roughly 12.5 million block
references per hour.  If each block reference takes 12 bytes, that's
maybe 150MB/hour in block reference files.  If you run a daily
incremental backup, you've got to load all the block references for
the last 24 hours and deduplicate them, which means you're going to
need about 3.6GB of memory.  If you run a weekly incremental backup,
you're going to need about 25GB of memory.  That is not ideal.  One
can keep the memory consumption to a more reasonable level by using
temporary files.  For instance, say you realize you're going to need
25GB of memory to store all the block references you have, but you
only have 1GB of memory that you're allowed to use.  Well, just
hash-partition the data 32 ways by dboid/tsoid/relfilenode/segno,
writing each batch to a separate temporary file, and then process each
of those 32 files separately.  That does add some additional I/O, but
it's not crazily complicated and doesn't seem too terrible, at least
to me.  Still, it's something not to like.

Another problem to think about is whether the most recent data is
going to be available when you need it.  This concern applies to
either approach.  In the case of incremental backup, the server is
going to be up and running, so if the WAL-scanner gets behind, you can
just wait for it to catch up.  In the case of pg_rewind, the server is
going to be down, so that doesn't work.  You probably need to figure
out how new the data you have is, and then scan all the newer WAL to
pick up any additional block references.  That's a bit of a pain, but
I don't see any real alternative.  In the case of the
file-per-LSN-range approach, it's easy to see what LSNs are covered by
the files.  In the case of the LSN-per-block/range/hash bucket
approach, you can presumably rely on the last checkpoint having
flushed all the pending changes to disk, but the WAL scanner could've
been arbitrarily far behind at that point, so you'd have to store a
piece of metadata someplace that tells you exactly how far the WAL
scanner had progressed and then have pg_rewind fish it out.

Yet another question is how to make sure WAL doesn't get removed
before we finish scanning it.  Peter mentioned on the other thread
that we could use a variant replication slot, which immediately made
me wonder why we'd need a variant.  Actually, the biggest problem I
see here is that if we use a replication slot, somebody might try to
drop it or use it for some other purpose, and that would mess things
up.  I guess we could pull the usual trick of reserving names that
start with 'pg_' for internal purposes.  Or we could just hard-code
the LSN that was last scanned for this purpose as a bespoke constraint
on WAL discard.  Not sure what is best.

I think all of this should be optional functionality.  It's going to
be really useful for people with large databases, I think, but people
with small databases may not care, and it won't be entirely free.  If
it's not enabled, then the functionality that would otherwise exploit
it can fall back to doing things in a less efficient way; nothing
needs to break hard.

Opinions?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: finding changed blocks using WAL scanning

From
Robert Haas
Date:
On Wed, Apr 10, 2019 at 5:49 PM Robert Haas <robertmhaas@gmail.com> wrote:
> There is one thing that does worry me about the file-per-LSN-range
> approach, and that is memory consumption when trying to consume the
> information.  Suppose you have a really high velocity system.  I don't
> know exactly what the busiest systems around are doing in terms of
> data churn these days, but let's say just for kicks that we are
> dirtying 100GB/hour.  That means, roughly 12.5 million block
> references per hour.  If each block reference takes 12 bytes, that's
> maybe 150MB/hour in block reference files.  If you run a daily
> incremental backup, you've got to load all the block references for
> the last 24 hours and deduplicate them, which means you're going to
> need about 3.6GB of memory.  If you run a weekly incremental backup,
> you're going to need about 25GB of memory.  That is not ideal.  One
> can keep the memory consumption to a more reasonable level by using
> temporary files.  For instance, say you realize you're going to need
> 25GB of memory to store all the block references you have, but you
> only have 1GB of memory that you're allowed to use.  Well, just
> hash-partition the data 32 ways by dboid/tsoid/relfilenode/segno,
> writing each batch to a separate temporary file, and then process each
> of those 32 files separately.  That does add some additional I/O, but
> it's not crazily complicated and doesn't seem too terrible, at least
> to me.  Still, it's something not to like.

Oh, I'm being dumb.  We should just have the process that writes out
these files sort the records first.  Then when we read them back in to
use them, we can just do a merge pass like MergeAppend would do.  Then
you never need very much memory at all.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: finding changed blocks using WAL scanning

From
Peter Eisentraut
Date:
On 2019-04-10 23:49, Robert Haas wrote:
> It seems to me that there are basically two ways of storing this kind
> of information, plus a bunch of variants.  One way is to store files
> that cover a range of LSNs, and basically contain a synopsis of the
> WAL for those LSNs.  You omit all the actual data and just mention
> which blocks were changed by some record in that part of the WAL.

That seems better than the other variant.

> Yet another question is how to make sure WAL doesn't get removed
> before we finish scanning it.  Peter mentioned on the other thread
> that we could use a variant replication slot, which immediately made
> me wonder why we'd need a variant.  Actually, the biggest problem I
> see here is that if we use a replication slot, somebody might try to
> drop it or use it for some other purpose, and that would mess things
> up.  I guess we could pull the usual trick of reserving names that
> start with 'pg_' for internal purposes.  Or we could just hard-code
> the LSN that was last scanned for this purpose as a bespoke constraint
> on WAL discard.  Not sure what is best.

The word "variant" was used as a hedge ;-), but now that I think about
it ...

I had in mind that you could have different overlapping incremental
backup jobs in existence at the same time.  Maybe a daily one to a
nearby disk and a weekly one to a faraway cloud.  Each one of these
would need a separate replication slot, so that the information that is
required for *that* incremental backup series is preserved between runs.
 So just one reserved replication slot that feeds the block summaries
wouldn't work.  Perhaps what would work is a flag on the replication
slot itself "keep block summaries for this slot".  Then when all the
slots with the block summary flag are past an LSN, you can clean up the
summaries before that LSN.

> I think all of this should be optional functionality.  It's going to
> be really useful for people with large databases, I think, but people
> with small databases may not care, and it won't be entirely free.  If
> it's not enabled, then the functionality that would otherwise exploit
> it can fall back to doing things in a less efficient way; nothing
> needs to break hard.

With the flag on the slot scheme you wouldn't need a separate knob to
turn this on, because it's just enabled when a backup software has
created an appropriate slot.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: finding changed blocks using WAL scanning

From
Robert Haas
Date:
On Thu, Apr 11, 2019 at 3:52 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> I had in mind that you could have different overlapping incremental
> backup jobs in existence at the same time.  Maybe a daily one to a
> nearby disk and a weekly one to a faraway cloud.  Each one of these
> would need a separate replication slot, so that the information that is
> required for *that* incremental backup series is preserved between runs.
>  So just one reserved replication slot that feeds the block summaries
> wouldn't work.  Perhaps what would work is a flag on the replication
> slot itself "keep block summaries for this slot".  Then when all the
> slots with the block summary flag are past an LSN, you can clean up the
> summaries before that LSN.

I don't think that quite works.  There are two different LSNs.  One is
the LSN of the oldest WAL archive that we need to keep around so that
it can be summarized, and the other is the LSN of the oldest summary
we need to keep around so it can be used for incremental backup
purposes.  You can't keep both of those LSNs in the same slot.
Furthermore, the LSN stored in the slot is defined as the amount of
WAL we need to keep, not the amount of something else (summaries) that
we need to keep.  Reusing that same field to mean something different
sounds inadvisable.

In other words, I think there are two problems which we need to
clearly separate: one is retaining WAL so we can generate summaries,
and the other is retaining summaries so we can generate incremental
backups.  Even if we solve the second problem by using some kind of
replication slot, we still need to solve the first problem somehow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: finding changed blocks using WAL scanning

From
Ashwin Agrawal
Date:

On Wed, Apr 10, 2019 at 2:50 PM Robert Haas <robertmhaas@gmail.com> wrote:
Over at https://www.postgresql.org/message-id/CA%2BTgmobFVe4J4AA7z9OMUzKnm09Tt%2BsybhxeL_Ddst3q3wqpzQ%40mail.gmail.com
I mentioned parsing the WAL to extract block references so that
incremental backup could efficiently determine which blocks needed to
be copied.  Ashwin replied in
https://urldefense.proofpoint.com/v2/url?u=http-3A__postgr.es_m_CALfoeitO-2DvkfjubMFQRmgyXghL-2DuUnZLNxbr-3DobrQQsm8kFO4A-40mail.gmail.com&d=DwIBaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=gxIaqms7ncm0pvqXLI_xjkgwSStxAET2rnZQpzba2KM&m=W07oy16p6VEfYKCgfRXQpRz9pfy_of-a_8DAjAg5TGk&s=YAtoa9HWqQ1PPjt1CGui1Fo_a20j0n95LRonCXucBz4&e=
to mention that the same approach could be useful for pg_upgrade:

Thank you for initiating this separate thread. Just typo above not pg_upgrade but pg_rewind.

Let me explain first the thought I have around how to leverage this for pg_rewind, actually any type of incremental recovery to be exact. Would love to hear thoughts on it.

Currently, incremental recovery of any form, if replica goes down and comes up or trying to bring back primary after failover to replica, requires *all* the WAL to be present from point of disconnect. So, its boolean in those terms, if WAL available can incrementally recovery otherwise have to perform full basebackup. If we come up with this mechanism to find and store changed blocks from WAL, we can provide intermediate level of incremental recovery which will be better than full recovery.

WAL allows tuple level granularity for recovery (if we ignore FPI for a moment). Modified blocks from WAL, if WAL is not available will provide block level incremental recovery.

So, pg_basebackup (or some other tool or just option to it) and pg_rewind can leverage the changed blocks if WAL can't be retained due to space constraints and perform the recovery.

pg_rewind can also be optimized as it currently copies blocks from src to target which were present in target WAL to rewind. So, such blocks can be easily skipped from copying again.

Depending on pattern of changes in WAL and size, instead of replaying all the WAL logs for incremental recovery, just copying over the changed blocks could prove more efficient.

It seems to me that there are basically two ways of storing this kind
of information, plus a bunch of variants.  One way is to store files
that cover a range of LSNs, and basically contain a synopsis of the
WAL for those LSNs.  You omit all the actual data and just mention
which blocks were changed by some record in that part of the WAL.  In
this type of scheme, the storage required is roughly proportional to
the volume of WAL for which you wish to retain data.  Pruning old data
is easy; just remove the files that provide information about LSNs
that you don't care about any more.  The other way is to store data
about each block, or each range of blocks, or all the blocks that hash
onto a certain slot; for each, store the newest LSN that has modified
that block, or a block in that range, or a block that hashes onto that
that slot.  In this system, storage is roughly proportional to the
size of the database cluster, except maybe in the hashing case, but I
*think* that case will degrade unless you basically expand the map to
be roughly proportional to the size of the cluster anyway.  I may be
wrong.

Of these two variants, I am inclined to prefer the version where each
file is a summary of the block references within some range of LSNs.
It seems simpler to implement to me.  You just read a bunch of WAL
files and then when you get tired you stop and emit an output file.
You need to protect yourself against untimely crashes.  One way is to
stick a checksum into the output file.  After you finish writing it,
fsync() it before you start writing the next one.  After a restart,
read the latest such file and see if the checksum is OK.  If not,
regenerate it; if not, assume it's good and move on.  Files other than
the last one can be assumed good.  Another way is to create the file
with a temporary name, fsync() it, and then rename it into place and
fsync() again.  The background worker that generates the files can
have a GUC to remove them when they are older than some threshold
amount of time, or you can keep them forever and let the user manually
remove stuff they no longer want based on LSN.  That's pretty much it.

+1 for first option. Seems simpler and straight-forward.

Re: finding changed blocks using WAL scanning

From
Ashwin Agrawal
Date:

On Thu, Apr 11, 2019 at 6:27 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Apr 11, 2019 at 3:52 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> I had in mind that you could have different overlapping incremental
> backup jobs in existence at the same time.  Maybe a daily one to a
> nearby disk and a weekly one to a faraway cloud.  Each one of these
> would need a separate replication slot, so that the information that is
> required for *that* incremental backup series is preserved between runs.
>  So just one reserved replication slot that feeds the block summaries
> wouldn't work.  Perhaps what would work is a flag on the replication
> slot itself "keep block summaries for this slot".  Then when all the
> slots with the block summary flag are past an LSN, you can clean up the
> summaries before that LSN.

I don't think that quite works.  There are two different LSNs.  One is
the LSN of the oldest WAL archive that we need to keep around so that
it can be summarized, and the other is the LSN of the oldest summary
we need to keep around so it can be used for incremental backup
purposes.  You can't keep both of those LSNs in the same slot.
Furthermore, the LSN stored in the slot is defined as the amount of
WAL we need to keep, not the amount of something else (summaries) that
we need to keep.  Reusing that same field to mean something different
sounds inadvisable.

In other words, I think there are two problems which we need to
clearly separate: one is retaining WAL so we can generate summaries,
and the other is retaining summaries so we can generate incremental
backups.  Even if we solve the second problem by using some kind of
replication slot, we still need to solve the first problem somehow.

Just a thought for first problem, may not to simpler, can replication slot be enhanced to define X amount of WAL to retain, after reaching such limit collect summary and let the WAL be deleted.

Re: finding changed blocks using WAL scanning

From
Kyotaro HORIGUCHI
Date:
At Thu, 11 Apr 2019 10:00:35 -0700, Ashwin Agrawal <aagrawal@pivotal.io> wrote in
<CALfoeis0qOyGk+KQ3AbkfRVv=XbsSecqHfKSag=i_SLWMT+B0A@mail.gmail.com>
> On Thu, Apr 11, 2019 at 6:27 AM Robert Haas <robertmhaas@gmail.com> wrote:
> 
> > On Thu, Apr 11, 2019 at 3:52 AM Peter Eisentraut
> > <peter.eisentraut@2ndquadrant.com> wrote:
> > > I had in mind that you could have different overlapping incremental
> > > backup jobs in existence at the same time.  Maybe a daily one to a
> > > nearby disk and a weekly one to a faraway cloud.  Each one of these
> > > would need a separate replication slot, so that the information that is
> > > required for *that* incremental backup series is preserved between runs.
> > >  So just one reserved replication slot that feeds the block summaries
> > > wouldn't work.  Perhaps what would work is a flag on the replication
> > > slot itself "keep block summaries for this slot".  Then when all the
> > > slots with the block summary flag are past an LSN, you can clean up the
> > > summaries before that LSN.
> >
> > I don't think that quite works.  There are two different LSNs.  One is
> > the LSN of the oldest WAL archive that we need to keep around so that
> > it can be summarized, and the other is the LSN of the oldest summary
> > we need to keep around so it can be used for incremental backup
> > purposes.  You can't keep both of those LSNs in the same slot.
> > Furthermore, the LSN stored in the slot is defined as the amount of
> > WAL we need to keep, not the amount of something else (summaries) that
> > we need to keep.  Reusing that same field to mean something different
> > sounds inadvisable.
> >
> > In other words, I think there are two problems which we need to
> > clearly separate: one is retaining WAL so we can generate summaries,
> > and the other is retaining summaries so we can generate incremental
> > backups.  Even if we solve the second problem by using some kind of
> > replication slot, we still need to solve the first problem somehow.
> 
> Just a thought for first problem, may not to simpler, can replication slot
> be enhanced to define X amount of WAL to retain, after reaching such limit
> collect summary and let the WAL be deleted.

I think Peter is saying that a slot for block summary doesn't
keep WAL segments themselves, but keeps maybe segmented block
summaries.  n block-summary-slots maintains n block summaries and
the newest block summary is "active", in other words,
continuously updated by WAL records pass-by. When backup-tool
requests for block summary, for example, for the oldest slot, the
acitve summary is closed then a new summary is opened from the
LSN at the time, which is the new LSN of the slot. Then the
concatenated block summary is sent. Finally the oldest summary is
removed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: finding changed blocks using WAL scanning

From
Bruce Momjian
Date:
On Wed, Apr 10, 2019 at 08:11:11PM -0400, Robert Haas wrote:
> On Wed, Apr 10, 2019 at 5:49 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > There is one thing that does worry me about the file-per-LSN-range
> > approach, and that is memory consumption when trying to consume the
> > information.  Suppose you have a really high velocity system.  I don't
> > know exactly what the busiest systems around are doing in terms of
> > data churn these days, but let's say just for kicks that we are
> > dirtying 100GB/hour.  That means, roughly 12.5 million block
> > references per hour.  If each block reference takes 12 bytes, that's
> > maybe 150MB/hour in block reference files.  If you run a daily
> > incremental backup, you've got to load all the block references for
> > the last 24 hours and deduplicate them, which means you're going to
> > need about 3.6GB of memory.  If you run a weekly incremental backup,
> > you're going to need about 25GB of memory.  That is not ideal.  One
> > can keep the memory consumption to a more reasonable level by using
> > temporary files.  For instance, say you realize you're going to need
> > 25GB of memory to store all the block references you have, but you
> > only have 1GB of memory that you're allowed to use.  Well, just
> > hash-partition the data 32 ways by dboid/tsoid/relfilenode/segno,
> > writing each batch to a separate temporary file, and then process each
> > of those 32 files separately.  That does add some additional I/O, but
> > it's not crazily complicated and doesn't seem too terrible, at least
> > to me.  Still, it's something not to like.
> 
> Oh, I'm being dumb.  We should just have the process that writes out
> these files sort the records first.  Then when we read them back in to
> use them, we can just do a merge pass like MergeAppend would do.  Then
> you never need very much memory at all.

Can I throw out a simple idea?  What if, when we finish writing a WAL
file, we create a new file 000000010000000000000001.modblock which
lists all the heap/index files and block numbers modified in that WAL
file?  How much does that help with the list I posted earlier?

    I think there is some interesting complexity brought up in this thread.
    Which options are going to minimize storage I/O, network I/O, have only
    background overhead, allow parallel operation, integrate with
    pg_basebackup.  Eventually we will need to evaluate the incremental
    backup options against these criteria.

I am thinking tools could retain modblock files along with WAL, could
pull full-page-writes from WAL, or from PGDATA.  It avoids the need to
scan 16MB WAL files, and the WAL files and modblock files could be
expired independently.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: finding changed blocks using WAL scanning

From
Robert Haas
Date:
On Mon, Apr 15, 2019 at 4:31 PM Bruce Momjian <bruce@momjian.us> wrote:
> Can I throw out a simple idea?  What if, when we finish writing a WAL
> file, we create a new file 000000010000000000000001.modblock which
> lists all the heap/index files and block numbers modified in that WAL
> file?  How much does that help with the list I posted earlier?
>
>         I think there is some interesting complexity brought up in this thread.
>         Which options are going to minimize storage I/O, network I/O, have only
>         background overhead, allow parallel operation, integrate with
>         pg_basebackup.  Eventually we will need to evaluate the incremental
>         backup options against these criteria.
>
> I am thinking tools could retain modblock files along with WAL, could
> pull full-page-writes from WAL, or from PGDATA.  It avoids the need to
> scan 16MB WAL files, and the WAL files and modblock files could be
> expired independently.

That is pretty much exactly what I was intending to propose.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: finding changed blocks using WAL scanning

From
Bruce Momjian
Date:
On Mon, Apr 15, 2019 at 09:04:13PM -0400, Robert Haas wrote:
> On Mon, Apr 15, 2019 at 4:31 PM Bruce Momjian <bruce@momjian.us> wrote:
> > Can I throw out a simple idea?  What if, when we finish writing a WAL
> > file, we create a new file 000000010000000000000001.modblock which
> > lists all the heap/index files and block numbers modified in that WAL
> > file?  How much does that help with the list I posted earlier?
> >
> >         I think there is some interesting complexity brought up in this thread.
> >         Which options are going to minimize storage I/O, network I/O, have only
> >         background overhead, allow parallel operation, integrate with
> >         pg_basebackup.  Eventually we will need to evaluate the incremental
> >         backup options against these criteria.
> >
> > I am thinking tools could retain modblock files along with WAL, could
> > pull full-page-writes from WAL, or from PGDATA.  It avoids the need to
> > scan 16MB WAL files, and the WAL files and modblock files could be
> > expired independently.
> 
> That is pretty much exactly what I was intending to propose.

OK, good.  Some of your wording was vague so I was unclear exactly what
you were suggesting.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: finding changed blocks using WAL scanning

From
Michael Paquier
Date:
On Mon, Apr 15, 2019 at 09:04:13PM -0400, Robert Haas wrote:
> That is pretty much exactly what I was intending to propose.

Any caller of XLogWrite() could switch to a new segment once the
current one is done, and I am not sure that we would want some random
backend to potentially slow down to do that kind of operation.

Or would a separate background worker do this work by itself?  An
external tool can do that easily already:
https://github.com/michaelpq/pg_plugins/tree/master/pg_wal_blocks
--
Michael

Attachment

Re: finding changed blocks using WAL scanning

From
Robert Haas
Date:
On Mon, Apr 15, 2019 at 10:22 PM Bruce Momjian <bruce@momjian.us> wrote:
> > > I am thinking tools could retain modblock files along with WAL, could
> > > pull full-page-writes from WAL, or from PGDATA.  It avoids the need to
> > > scan 16MB WAL files, and the WAL files and modblock files could be
> > > expired independently.
> >
> > That is pretty much exactly what I was intending to propose.
>
> OK, good.  Some of your wording was vague so I was unclear exactly what
> you were suggesting.

Well, I guess the part that isn't like what I was suggesting is the
idea that there should be exactly one modified block file per segment.
The biggest problem with that idea is that a single WAL record can be
split across two segments (or, in pathological cases, perhaps more).
I think it makes sense to talk about the blocks modified by WAL
between LSN A and LSN B, but it doesn't make much sense to talk about
the block modified by the WAL in segment XYZ.

You can make it kinda make sense by saying "the blocks modified by
records *beginning in* segment XYZ" or alternatively "the blocks
modified by records *ending in* segment XYZ", but that seems confusing
to me.  For example, suppose you decide on the first one --
000000010000000100000068.modblock will contain all blocks modified by
records that begin in 000000010000000100000068.  Well, that means that
to generate the 000000010000000100000068.modblock, you will need
access to 000000010000000100000068 AND probably also
000000010000000100000069 and in rare cases perhaps
00000001000000010000006A or even later files.  I think that's actually
pretty confusing.

It seems better to me to give the files names like
${TLI}.${STARTLSN}.${ENDLSN}.modblock, e.g.
00000001.0000000168000058.00000001687DBBB8.modblock, so that you can
see exactly which *records* are covered by that segment.

And I suspect it may also be a good idea to bunch up the records from
several WAL files.  Especially if you are using 16MB WAL files,
collecting all of the block references from a single WAL file is going
to produce a very small file.  I suspect that the modified block files
will end up being 100x smaller than the WAL itself, perhaps more, and
I don't think anybody will appreciate us adding another PostgreSQL
systems that spews out huge numbers of tiny little files.  If, for
example, somebody's got a big cluster that is churning out a WAL
segment every second, they would probably still be happy to have a new
modified block file only, say, every 10 seconds.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: finding changed blocks using WAL scanning

From
Robert Haas
Date:
On Mon, Apr 15, 2019 at 11:45 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Mon, Apr 15, 2019 at 09:04:13PM -0400, Robert Haas wrote:
> > That is pretty much exactly what I was intending to propose.
>
> Any caller of XLogWrite() could switch to a new segment once the
> current one is done, and I am not sure that we would want some random
> backend to potentially slow down to do that kind of operation.
>
> Or would a separate background worker do this work by itself?  An
> external tool can do that easily already:
> https://github.com/michaelpq/pg_plugins/tree/master/pg_wal_blocks

I was thinking that a dedicated background worker would be a good
option, but Stephen Frost seems concerned (over on the other thread)
about how much load that would generate.  That never really occurred
to me as a serious issue and I suspect for many people it wouldn't be,
but there might be some.

It's cool that you have a command-line tool that does this as well.
Over there, it was also discussed that we might want to have both a
command-line tool and a background worker.  I think, though, that we
would want to get the output in some kind of compressed binary format,
rather than text.  e.g.

4-byte database OID
4-byte tablespace OID
any number of relation OID/block OID pairings for that
database/tablespace combination
4-byte zero to mark the end of the relation OID/block OID list
and then repeat all of the above any number of times

That might be too dumb and I suspect we want some headers and a
checksum, but we should try to somehow exploit the fact that there
aren't likely to be many distinct databases or many distinct
tablespaces mentioned -- whereas relation OID and block number will
probably have a lot more entropy.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: finding changed blocks using WAL scanning

From
Bruce Momjian
Date:
On Thu, Apr 18, 2019 at 03:43:30PM -0400, Robert Haas wrote:
> You can make it kinda make sense by saying "the blocks modified by
> records *beginning in* segment XYZ" or alternatively "the blocks
> modified by records *ending in* segment XYZ", but that seems confusing
> to me.  For example, suppose you decide on the first one --
> 000000010000000100000068.modblock will contain all blocks modified by
> records that begin in 000000010000000100000068.  Well, that means that
> to generate the 000000010000000100000068.modblock, you will need
> access to 000000010000000100000068 AND probably also
> 000000010000000100000069 and in rare cases perhaps
> 00000001000000010000006A or even later files.  I think that's actually
> pretty confusing.
> 
> It seems better to me to give the files names like
> ${TLI}.${STARTLSN}.${ENDLSN}.modblock, e.g.
> 00000001.0000000168000058.00000001687DBBB8.modblock, so that you can
> see exactly which *records* are covered by that segment.

How would you choose the STARTLSN/ENDLSN?  If you could do it per
checkpoint, rather than per-WAL, I think that would be great.

> And I suspect it may also be a good idea to bunch up the records from
> several WAL files.  Especially if you are using 16MB WAL files,
> collecting all of the block references from a single WAL file is going
> to produce a very small file.  I suspect that the modified block files
> will end up being 100x smaller than the WAL itself, perhaps more, and
> I don't think anybody will appreciate us adding another PostgreSQL
> systems that spews out huge numbers of tiny little files.  If, for
> example, somebody's got a big cluster that is churning out a WAL
> segment every second, they would probably still be happy to have a new
> modified block file only, say, every 10 seconds.

Agreed.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: finding changed blocks using WAL scanning

From
Robert Haas
Date:
On Thu, Apr 18, 2019 at 3:51 PM Bruce Momjian <bruce@momjian.us> wrote:
> How would you choose the STARTLSN/ENDLSN?  If you could do it per
> checkpoint, rather than per-WAL, I think that would be great.

I thought of that too.  It seems appealing, because you probably only
really care whether a particular block was modified between one
checkpoint and the next, not exactly when during that interval it was
modified.  However, the simple algorithm of "just stop when you get to
a checkpoint record" does not work, because the checkpoint record
itself points back to a much earlier LSN, and I think that it's that
earlier LSN that is interesting.  So if you want to make this work you
have to be more clever, and I'm not sure I'm clever enough.

I think it's important that a .modblock file not be too large, because
then it will use too much memory, and that it not cover too much WAL,
because then it will be too imprecise about when the blocks were
modified.  Perhaps we should have a threshold for each -- e.g. emit
the next .modblock file after finding 2^20 distinct block references
or scanning 1GB of WAL.  Then individual files would probably be in
the single-digit numbers of megabytes in size, assuming we do a decent
job with the compression, and you never need to scan more than 1GB of
WAL to regenerate one.  If the starting point for a backup falls in
the middle of such a file, and you include the whole file, at worst
you have ~8GB of extra blocks to read, but in most cases less, because
your writes probably have some locality and the file may not actually
contain the full 2^20 block references.  You could also make it more
fine-grained than that if you don't mind having more smaller files
floating around.

It would definitely be better if we could set things up so that we
could always switch to the next .modblock file when we cross a
potential redo start point, but they're not noted in the WAL so I
don't see how to do that.  I don't know if it would be possible to
insert some new kind of log record concurrently with fixing the redo
location, so that redo always started at a record of this new type.
That would certainly be helpful for this kind of thing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: finding changed blocks using WAL scanning

From
Bruce Momjian
Date:
On Thu, Apr 18, 2019 at 04:25:24PM -0400, Robert Haas wrote:
> On Thu, Apr 18, 2019 at 3:51 PM Bruce Momjian <bruce@momjian.us> wrote:
> > How would you choose the STARTLSN/ENDLSN?  If you could do it per
> > checkpoint, rather than per-WAL, I think that would be great.
> 
> I thought of that too.  It seems appealing, because you probably only
> really care whether a particular block was modified between one
> checkpoint and the next, not exactly when during that interval it was
> modified.  However, the simple algorithm of "just stop when you get to
> a checkpoint record" does not work, because the checkpoint record
> itself points back to a much earlier LSN, and I think that it's that
> earlier LSN that is interesting.  So if you want to make this work you
> have to be more clever, and I'm not sure I'm clever enough.

OK, so let's back up and study how this will be used.  Someone wanting
to make a useful incremental backup will need the changed blocks from
the time of the start of the base backup.  It is fine if they
incrementally back up some blocks modified _before_ the base backup, but
they need all blocks after, until some marker.  They will obviously
still use WAL to recover to a point after the incremental backup, so
there is no need to get every modifified block up to current, just up to
some cut-off point where WAL can be discarded.

I can see a 1GB marker being used for that.  It would prevent an
incremental backup from being done until the first 1G modblock files was
written, since until then there is no record of modified blocks, but
that seems fine.  A 1G marker would allow for consistent behavior
independent of server restarts and base backups.

How would the modblock file record all the modified blocks across
restarts and crashes?  I assume that 1G of WAL would not be available
for scanning.  I suppose that writing a modblock file to some PGDATA
location when WAL is removed would work since during a crash the
modblock file could be updated with the contents of the existing pg_wal
files.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: finding changed blocks using WAL scanning

From
Andres Freund
Date:
On 2019-04-18 17:47:56 -0400, Bruce Momjian wrote:
> I can see a 1GB marker being used for that.  It would prevent an
> incremental backup from being done until the first 1G modblock files was
> written, since until then there is no record of modified blocks, but
> that seems fine.  A 1G marker would allow for consistent behavior
> independent of server restarts and base backups.

That doesn't seem like a good idea - it'd make writing regression tests
for this impractical.



Re: finding changed blocks using WAL scanning

From
Michael Paquier
Date:
On Thu, Apr 18, 2019 at 03:51:14PM -0400, Robert Haas wrote:
> I was thinking that a dedicated background worker would be a good
> option, but Stephen Frost seems concerned (over on the other thread)
> about how much load that would generate.  That never really occurred
> to me as a serious issue and I suspect for many people it wouldn't be,
> but there might be some.

WAL segment size can go up to 1GB, and this does not depend on the
compilation anymore.  So scanning a very large segment is not going to
be free.  I think that the performance concerns of Stephen are legit
as now we have on the WAL partitions sequential read and write
patterns.

> It's cool that you have a command-line tool that does this as well.
> Over there, it was also discussed that we might want to have both a
> command-line tool and a background worker.  I think, though, that we
> would want to get the output in some kind of compressed binary format,
> rather than text.  e.g.

If you want to tweak it the way you want, please feel free to reuse
it for any patch submitted to upstream.  Reshaping or redirecting the
data is not a big issue once the basics with the WAL reader are in
place.
--
Michael

Attachment

Re: finding changed blocks using WAL scanning

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Mon, Apr 15, 2019 at 11:45 PM Michael Paquier <michael@paquier.xyz> wrote:
> > Any caller of XLogWrite() could switch to a new segment once the
> > current one is done, and I am not sure that we would want some random
> > backend to potentially slow down to do that kind of operation.
> >
> > Or would a separate background worker do this work by itself?  An
> > external tool can do that easily already:
> > https://github.com/michaelpq/pg_plugins/tree/master/pg_wal_blocks
>
> I was thinking that a dedicated background worker would be a good
> option, but Stephen Frost seems concerned (over on the other thread)
> about how much load that would generate.  That never really occurred
> to me as a serious issue and I suspect for many people it wouldn't be,
> but there might be some.

While I do think we should at least be thinking about the load caused
from scanning the WAL to generate a list of blocks that are changed, the
load I was more concerned with in the other thread is the effort
required to actually merge all of those changes together over a large
amount of WAL.  I'm also not saying that we couldn't have either of
those pieces done as a background worker, just that it'd be really nice
to have an external tool (or library) that can be used on an independent
system to do that work.

> It's cool that you have a command-line tool that does this as well.
> Over there, it was also discussed that we might want to have both a
> command-line tool and a background worker.  I think, though, that we
> would want to get the output in some kind of compressed binary format,
> rather than text.  e.g.
>
> 4-byte database OID
> 4-byte tablespace OID
> any number of relation OID/block OID pairings for that
> database/tablespace combination
> 4-byte zero to mark the end of the relation OID/block OID list
> and then repeat all of the above any number of times

I agree that we'd like to get the data in a binary format of some kind.

> That might be too dumb and I suspect we want some headers and a
> checksum, but we should try to somehow exploit the fact that there
> aren't likely to be many distinct databases or many distinct
> tablespaces mentioned -- whereas relation OID and block number will
> probably have a lot more entropy.

I'm not remembering exactly where this idea came from, but I don't
believe it's my own (and I think there's some tool which already does
this..  maybe it's rsync?), but I certainly don't think we want to
repeat the relation OID for every block, and I don't think we really
want to store a block number for every block.  Instead, something like:

4-byte database OID
4-byte tablespace OID
relation OID

starting-ending block numbers
bitmap covering range of blocks
starting-ending block numbers
bitmap covering range of blocks
4-byte zero to mark the end of the relation
...
4-byte database OID
4-byte tablespace OID
relation OID

starting-ending block numbers
bitmap covering range of blocks
4-byte zero to mark the end of the relation
...

Only for relations which actually have changes though, of course.

Haven't implemented it, so it's entirely possible there's reasons why it
wouldn't work, but I do like the bitmap idea.  I definitely think we
need a checksum, as you mentioned.

Thanks!

Stephen

Attachment

Re: finding changed blocks using WAL scanning

From
Robert Haas
Date:
On Thu, Apr 18, 2019 at 5:47 PM Bruce Momjian <bruce@momjian.us> wrote:
> How would the modblock file record all the modified blocks across
> restarts and crashes?  I assume that 1G of WAL would not be available
> for scanning.  I suppose that writing a modblock file to some PGDATA
> location when WAL is removed would work since during a crash the
> modblock file could be updated with the contents of the existing pg_wal
> files.

I think you've got to prevent the WAL from being removed until a
.modblock file has been written.  In more detail, you should (a) scan
all the WAL segments that will be summarized in the .modblock file,
(b) write the file under a temporary name, (c) fsync it, (d) rename it
into place, (e) fsync it again, and (f) then allow those WAL segments
to be removed, if they are otherwise eligible to be removed.

If 1GB of WAL is too much to keep around (which I doubt, except on
systems that are so small and low-activity that they don't need
incremental backups anyway), then you'd have to scan less WAL at once
and write smaller .modblock files.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: finding changed blocks using WAL scanning

From
Robert Haas
Date:
On Fri, Apr 19, 2019 at 8:39 PM Stephen Frost <sfrost@snowman.net> wrote:
> While I do think we should at least be thinking about the load caused
> from scanning the WAL to generate a list of blocks that are changed, the
> load I was more concerned with in the other thread is the effort
> required to actually merge all of those changes together over a large
> amount of WAL.  I'm also not saying that we couldn't have either of
> those pieces done as a background worker, just that it'd be really nice
> to have an external tool (or library) that can be used on an independent
> system to do that work.

Oh.  Well, I already explained my algorithm for doing that upthread,
which I believe would be quite cheap.

1. When you generate the .modblock files, stick all the block
references into a buffer.  qsort().  Dedup.  Write out in sorted
order.

2. When you want to use a bunch of .modblock files, do the same thing
MergeAppend does, or what merge-sort does when it does a merge pass.
Read the first 1MB of each file (or whatever amount).  Repeatedly pull
an item from whichever file has the lowest remaining value, using a
binary heap.  When no buffered data remains for a particular file,
read another chunk from that file.

If each .modblock file covers 1GB of WAL, you could the data from
across 1TB of WAL using only 1GB of memory, and that's assuming you
have a 1MB buffer for each .modblock file.  You probably don't need
such a large buffer.  If you use, say, a 128kB buffer, you could merge
the data from across 8TB of WAL using 1GB of memory.  And if you have
8TB of WAL and you can't spare 1GB for the task of computing which
blocks need to be included in your incremental backup, it's time for a
hardware upgrade.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: finding changed blocks using WAL scanning

From
Robert Haas
Date:
On Thu, Apr 18, 2019 at 8:38 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Thu, Apr 18, 2019 at 03:51:14PM -0400, Robert Haas wrote:
> > I was thinking that a dedicated background worker would be a good
> > option, but Stephen Frost seems concerned (over on the other thread)
> > about how much load that would generate.  That never really occurred
> > to me as a serious issue and I suspect for many people it wouldn't be,
> > but there might be some.
>
> WAL segment size can go up to 1GB, and this does not depend on the
> compilation anymore.  So scanning a very large segment is not going to
> be free.

The segment size doesn't have much to do with it.  If you make
segments bigger, you'll have to scan fewer larger ones; if you make
them smaller, you'll have more smaller ones.  The only thing that
really matters is the amount of I/O and CPU required, and that doesn't
change very much as you vary the segment size.

> I think that the performance concerns of Stephen are legit
> as now we have on the WAL partitions sequential read and write
> patterns.

As to that, what I'm proposing here is no different than what we are
already doing with physical and logical replication, except that it's
probably a bit cheaper.  Physical replication reads all the WAL and
sends it all out over the network.  Logical replication reads all the
WAL, does a bunch of computation, and then sends the results, possibly
filtered, out over the network.  This would read the WAL and then
write a relatively small file to your local disk.

I think the impact will be about the same as having one additional
standby, give or take.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: finding changed blocks using WAL scanning

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Fri, Apr 19, 2019 at 8:39 PM Stephen Frost <sfrost@snowman.net> wrote:
> > While I do think we should at least be thinking about the load caused
> > from scanning the WAL to generate a list of blocks that are changed, the
> > load I was more concerned with in the other thread is the effort
> > required to actually merge all of those changes together over a large
> > amount of WAL.  I'm also not saying that we couldn't have either of
> > those pieces done as a background worker, just that it'd be really nice
> > to have an external tool (or library) that can be used on an independent
> > system to do that work.
>
> Oh.  Well, I already explained my algorithm for doing that upthread,
> which I believe would be quite cheap.
>
> 1. When you generate the .modblock files, stick all the block
> references into a buffer.  qsort().  Dedup.  Write out in sorted
> order.

Having all of the block references in a sorted order does seem like it
would help, but would also make those potentially quite a bit larger
than necessary (I had some thoughts about making them smaller elsewhere
in this discussion).  That might be worth it though.  I suppose it might
also be possible to line up the bitmaps suggested elsewhere to do
essentially a BitmapOr of them to identify the blocks changed (while
effectively de-duping at the same time).

> 2. When you want to use a bunch of .modblock files, do the same thing
> MergeAppend does, or what merge-sort does when it does a merge pass.
> Read the first 1MB of each file (or whatever amount).  Repeatedly pull
> an item from whichever file has the lowest remaining value, using a
> binary heap.  When no buffered data remains for a particular file,
> read another chunk from that file.

Sure, this is essentially a MergeAppend/MergeSort+GroupAgg on top to get
the distinct set, if I'm following what you're suggesting here.

> If each .modblock file covers 1GB of WAL, you could the data from
> across 1TB of WAL using only 1GB of memory, and that's assuming you
> have a 1MB buffer for each .modblock file.  You probably don't need
> such a large buffer.  If you use, say, a 128kB buffer, you could merge
> the data from across 8TB of WAL using 1GB of memory.  And if you have
> 8TB of WAL and you can't spare 1GB for the task of computing which
> blocks need to be included in your incremental backup, it's time for a
> hardware upgrade.

How much additional work is it going to be to sort/dedup for each 1GB of
WAL though, along with the resulting size?  I'm specifically thinking
about some of the very high WAL-generation rate systems that I've seen,
with TBs of WAL per day.  I get that once you've got nicely sorted
inputs that it isn't too bad to generate a distinct set from that, but
it seems like that moves the problem to the sorting side then rather
than eliminating it, and may make other things less efficient,
particularly when there's workloads that have strings of modified
buffers together and we could capture that more efficiently.

Thanks,

Stephen

Attachment

Re: finding changed blocks using WAL scanning

From
Bruce Momjian
Date:
On Sat, Apr 20, 2019 at 12:09:42AM -0400, Robert Haas wrote:
> On Thu, Apr 18, 2019 at 5:47 PM Bruce Momjian <bruce@momjian.us> wrote:
> > How would the modblock file record all the modified blocks across
> > restarts and crashes?  I assume that 1G of WAL would not be available
> > for scanning.  I suppose that writing a modblock file to some PGDATA
> > location when WAL is removed would work since during a crash the
> > modblock file could be updated with the contents of the existing pg_wal
> > files.
> 
> I think you've got to prevent the WAL from being removed until a
> .modblock file has been written.  In more detail, you should (a) scan
> all the WAL segments that will be summarized in the .modblock file,
> (b) write the file under a temporary name, (c) fsync it, (d) rename it
> into place, (e) fsync it again, and (f) then allow those WAL segments
> to be removed, if they are otherwise eligible to be removed.

Makes sense.  So when you are about to remove WAL, you create the
.modblock files for all complete WAL files and only create a new one
when you are about to remove a WAL that was not in a previous .modblock
file.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: finding changed blocks using WAL scanning

From
Bruce Momjian
Date:
On Sat, Apr 20, 2019 at 12:21:36AM -0400, Robert Haas wrote:
> As to that, what I'm proposing here is no different than what we are
> already doing with physical and logical replication, except that it's
> probably a bit cheaper.  Physical replication reads all the WAL and
> sends it all out over the network.  Logical replication reads all the
> WAL, does a bunch of computation, and then sends the results, possibly
> filtered, out over the network.  This would read the WAL and then
> write a relatively small file to your local disk.
> 
> I think the impact will be about the same as having one additional
> standby, give or take.

Good point.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: finding changed blocks using WAL scanning

From
Robert Haas
Date:
On Sat, Apr 20, 2019 at 9:18 AM Bruce Momjian <bruce@momjian.us> wrote:
> > I think you've got to prevent the WAL from being removed until a
> > .modblock file has been written.  In more detail, you should (a) scan
> > all the WAL segments that will be summarized in the .modblock file,
> > (b) write the file under a temporary name, (c) fsync it, (d) rename it
> > into place, (e) fsync it again, and (f) then allow those WAL segments
> > to be removed, if they are otherwise eligible to be removed.
>
> Makes sense.  So when you are about to remove WAL, you create the
> .modblock files for all complete WAL files and only create a new one
> when you are about to remove a WAL that was not in a previous .modblock
> file.

There will often be a partial WAL record at the end of each file.  So
if you make a .modblock file for WAL files 1-10, you can probably
remove files 1-9, but you probably have to keep WAL file 10 around
until you generate the NEXT .modblock file, because otherwise you
wouldn't be able to read and parse the WAL record that spans the end
of file 10 and the beginning of file 11.

This is a detail that is very very very important to get right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: finding changed blocks using WAL scanning

From
Robert Haas
Date:
On Sat, Apr 20, 2019 at 12:42 AM Stephen Frost <sfrost@snowman.net> wrote:
> > Oh.  Well, I already explained my algorithm for doing that upthread,
> > which I believe would be quite cheap.
> >
> > 1. When you generate the .modblock files, stick all the block
> > references into a buffer.  qsort().  Dedup.  Write out in sorted
> > order.
>
> Having all of the block references in a sorted order does seem like it
> would help, but would also make those potentially quite a bit larger
> than necessary (I had some thoughts about making them smaller elsewhere
> in this discussion).  That might be worth it though.  I suppose it might
> also be possible to line up the bitmaps suggested elsewhere to do
> essentially a BitmapOr of them to identify the blocks changed (while
> effectively de-duping at the same time).

I don't see why this would make them bigger than necessary.  If you
sort by relfilenode/fork/blocknumber and dedup, then references to
nearby blocks will be adjacent in the file.  You can then decide what
format will represent that most efficiently on output.  Whether or not
a bitmap is better idea than a list of block numbers or something else
depends on what percentage of blocks are modified and how clustered
they are.

> > 2. When you want to use a bunch of .modblock files, do the same thing
> > MergeAppend does, or what merge-sort does when it does a merge pass.
> > Read the first 1MB of each file (or whatever amount).  Repeatedly pull
> > an item from whichever file has the lowest remaining value, using a
> > binary heap.  When no buffered data remains for a particular file,
> > read another chunk from that file.
>
> Sure, this is essentially a MergeAppend/MergeSort+GroupAgg on top to get
> the distinct set, if I'm following what you're suggesting here.

Yeah, something like that.

> > If each .modblock file covers 1GB of WAL, you could the data from
> > across 1TB of WAL using only 1GB of memory, and that's assuming you
> > have a 1MB buffer for each .modblock file.  You probably don't need
> > such a large buffer.  If you use, say, a 128kB buffer, you could merge
> > the data from across 8TB of WAL using 1GB of memory.  And if you have
> > 8TB of WAL and you can't spare 1GB for the task of computing which
> > blocks need to be included in your incremental backup, it's time for a
> > hardware upgrade.
>
> How much additional work is it going to be to sort/dedup for each 1GB of
> WAL though, along with the resulting size?

Well all you have to do is quicksort an array with a million or so
elements.  I don't know off-hand how many CPU cycles that takes but I
doubt it's a whole lot.  And for the amount of CPU time and memory
that it saves you when you actually go to use the files, I think it's
got to be worth it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: finding changed blocks using WAL scanning

From
Bruce Momjian
Date:
On Sat, Apr 20, 2019 at 04:17:08PM -0400, Robert Haas wrote:
> On Sat, Apr 20, 2019 at 9:18 AM Bruce Momjian <bruce@momjian.us> wrote:
> > > I think you've got to prevent the WAL from being removed until a
> > > .modblock file has been written.  In more detail, you should (a) scan
> > > all the WAL segments that will be summarized in the .modblock file,
> > > (b) write the file under a temporary name, (c) fsync it, (d) rename it
> > > into place, (e) fsync it again, and (f) then allow those WAL segments
> > > to be removed, if they are otherwise eligible to be removed.
> >
> > Makes sense.  So when you are about to remove WAL, you create the
> > .modblock files for all complete WAL files and only create a new one
> > when you are about to remove a WAL that was not in a previous .modblock
> > file.
> 
> There will often be a partial WAL record at the end of each file.  So
> if you make a .modblock file for WAL files 1-10, you can probably
> remove files 1-9, but you probably have to keep WAL file 10 around
> until you generate the NEXT .modblock file, because otherwise you
> wouldn't be able to read and parse the WAL record that spans the end
> of file 10 and the beginning of file 11.
> 
> This is a detail that is very very very important to get right.

Good point.  You mentioned:

    It seems better to me to give the files names like
    ${TLI}.${STARTLSN}.${ENDLSN}.modblock, e.g.
    00000001.0000000168000058.00000001687DBBB8.modblock, so that you can
    see exactly which *records* are covered by that segment.

but it seems like it should be ${TLI}.${ENDLSN}... (END first) because
you would not want to delete the modblock file until you are about to
delete the final WAL, not the first WAL, but as you mentioned, it might
be ENDLSN-1.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: finding changed blocks using WAL scanning

From
Robert Haas
Date:
On Sat, Apr 20, 2019 at 5:54 PM Bruce Momjian <bruce@momjian.us> wrote:
> Good point.  You mentioned:
>
>         It seems better to me to give the files names like
>         ${TLI}.${STARTLSN}.${ENDLSN}.modblock, e.g.
>         00000001.0000000168000058.00000001687DBBB8.modblock, so that you can
>         see exactly which *records* are covered by that segment.
>
> but it seems like it should be ${TLI}.${ENDLSN}... (END first) because
> you would not want to delete the modblock file until you are about to
> delete the final WAL, not the first WAL, but as you mentioned, it might
> be ENDLSN-1.

Hmm.  It seems to me that it is almost universally the convention to
put the starting point prior to the ending point.  If you are taking a
biology class, the teacher will not tell you to study chapters six
through three.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: finding changed blocks using WAL scanning

From
Michael Paquier
Date:
On Sat, Apr 20, 2019 at 12:21:36AM -0400, Robert Haas wrote:
> The segment size doesn't have much to do with it.  If you make
> segments bigger, you'll have to scan fewer larger ones; if you make
> them smaller, you'll have more smaller ones.  The only thing that
> really matters is the amount of I/O and CPU required, and that doesn't
> change very much as you vary the segment size.

If you create the extra file when a segment is finished and we switch
to a new one, then the extra work would happen for a random backend,
and it is going to be more costly to scan a 1GB segment than a 16MB
segment as a one-time operation, and less backends would see a
slowdown at equal WAL data generated.  From what I can see, you are
not planning to do such operations when a segment finishes being
written, which would be much better.

> As to that, what I'm proposing here is no different than what we are
> already doing with physical and logical replication, except that it's
> probably a bit cheaper.  Physical replication reads all the WAL and
> sends it all out over the network.  Logical replication reads all the
> WAL, does a bunch of computation, and then sends the results, possibly
> filtered, out over the network.  This would read the WAL and then
> write a relatively small file to your local disk.
>
> I think the impact will be about the same as having one additional
> standby, give or take.

If you put the load on an extra process, yeah I don't think that it
would be noticeable.
--
Michael

Attachment

Re: finding changed blocks using WAL scanning

From
Bruce Momjian
Date:
On Sun, Apr 21, 2019 at 06:24:50PM -0400, Robert Haas wrote:
> On Sat, Apr 20, 2019 at 5:54 PM Bruce Momjian <bruce@momjian.us> wrote:
> > Good point.  You mentioned:
> >
> >         It seems better to me to give the files names like
> >         ${TLI}.${STARTLSN}.${ENDLSN}.modblock, e.g.
> >         00000001.0000000168000058.00000001687DBBB8.modblock, so that you can
> >         see exactly which *records* are covered by that segment.
> >
> > but it seems like it should be ${TLI}.${ENDLSN}... (END first) because
> > you would not want to delete the modblock file until you are about to
> > delete the final WAL, not the first WAL, but as you mentioned, it might
> > be ENDLSN-1.
> 
> Hmm.  It seems to me that it is almost universally the convention to
> put the starting point prior to the ending point.  If you are taking a
> biology class, the teacher will not tell you to study chapters six
> through three.

My point is that most WAL archive tools will order and remove files
based on their lexical ordering, so if you put the start first, the file
will normally be removed when it should be kept, e.g., if you have WAL
files like:

    000000010000000000000001
    000000010000000000000002
    000000010000000000000003
    000000010000000000000004
    000000010000000000000005

putting the start first and archiving some wal would lead to:

    000000010000000000000001-000000010000000000000004.modblock
    000000010000000000000003
    000000010000000000000004
    000000010000000000000005

We removed 1 and 2, but kept the modblock file, which looks out of
order. Having the end at the start would have:

    000000010000000000000003
    000000010000000000000004
    000000010000000000000004-000000010000000000000001.modblock
    000000010000000000000005

My point is that you would normally only remove the modblock file when 4
is removed because this modblock files is useful for incremental backups
from base backups that happened between 1 and 4.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: finding changed blocks using WAL scanning

From
Bruce Momjian
Date:
On Mon, Apr 22, 2019 at 11:20:43AM +0900, Michael Paquier wrote:
> On Sat, Apr 20, 2019 at 12:21:36AM -0400, Robert Haas wrote:
> > The segment size doesn't have much to do with it.  If you make
> > segments bigger, you'll have to scan fewer larger ones; if you make
> > them smaller, you'll have more smaller ones.  The only thing that
> > really matters is the amount of I/O and CPU required, and that doesn't
> > change very much as you vary the segment size.
> 
> If you create the extra file when a segment is finished and we switch
> to a new one, then the extra work would happen for a random backend,
> and it is going to be more costly to scan a 1GB segment than a 16MB
> segment as a one-time operation, and less backends would see a
> slowdown at equal WAL data generated.  From what I can see, you are
> not planning to do such operations when a segment finishes being
> written, which would be much better.

I think your point is that the 16MB is more likely to be in memory,
while the 1GB is less likely.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: finding changed blocks using WAL scanning

From
Robert Haas
Date:
On Sun, Apr 21, 2019 at 10:21 PM Michael Paquier <michael@paquier.xyz> wrote:
> If you create the extra file when a segment is finished and we switch
> to a new one, then the extra work would happen for a random backend,
> and it is going to be more costly to scan a 1GB segment than a 16MB
> segment as a one-time operation, and less backends would see a
> slowdown at equal WAL data generated.  From what I can see, you are
> not planning to do such operations when a segment finishes being
> written, which would be much better.

Well, my plan was to do it all from a background worker, so I do not
think a random backend would ever have to do any extra work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: finding changed blocks using WAL scanning

From
Robert Haas
Date:
On Mon, Apr 22, 2019 at 11:48 AM Bruce Momjian <bruce@momjian.us> wrote:
> My point is that most WAL archive tools will order and remove files
> based on their lexical ordering, so if you put the start first, the file
> will normally be removed when it should be kept, e.g., if you have WAL
> files like:
>
>         000000010000000000000001
>         000000010000000000000002
>         000000010000000000000003
>         000000010000000000000004
>         000000010000000000000005
>
> putting the start first and archiving some wal would lead to:
>
>         000000010000000000000001-000000010000000000000004.modblock
>         000000010000000000000003
>         000000010000000000000004
>         000000010000000000000005
>
> We removed 1 and 2, but kept the modblock file, which looks out of
> order. Having the end at the start would have:
>
>         000000010000000000000003
>         000000010000000000000004
>         000000010000000000000004-000000010000000000000001.modblock
>         000000010000000000000005
>
> My point is that you would normally only remove the modblock file when 4
> is removed because this modblock files is useful for incremental backups
> from base backups that happened between 1 and 4.

That's an interesting point.  On the other hand, I think it would be
typical to want the master to retain .modblock files for much longer
than it retains WAL segments, and in my design, the WAL archive
wouldn't see those files at all; they'd be stored on the master.  I
was actually thinking that they should possibly be stored in a
separate directory to avoid confusion.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: finding changed blocks using WAL scanning

From
Bruce Momjian
Date:
On Mon, Apr 22, 2019 at 12:15:32PM -0400, Robert Haas wrote:
> On Mon, Apr 22, 2019 at 11:48 AM Bruce Momjian <bruce@momjian.us> wrote:
> > My point is that you would normally only remove the modblock file when 4
> > is removed because this modblock files is useful for incremental backups
> > from base backups that happened between 1 and 4.
> 
> That's an interesting point.  On the other hand, I think it would be
> typical to want the master to retain .modblock files for much longer
> than it retains WAL segments, and in my design, the WAL archive
> wouldn't see those files at all; they'd be stored on the master.  I
> was actually thinking that they should possibly be stored in a
> separate directory to avoid confusion.

I assumed the modblock files would be stored in the WAL archive so some
external tools could generate incremental backups using just the WAL
files.  I assumed they would also be sent to standby servers so
incremental backups could be done on standby servers too.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: finding changed blocks using WAL scanning

From
Robert Haas
Date:
On Mon, Apr 22, 2019 at 12:35 PM Bruce Momjian <bruce@momjian.us> wrote:
> I assumed the modblock files would be stored in the WAL archive so some
> external tools could generate incremental backups using just the WAL
> files.  I assumed they would also be sent to standby servers so
> incremental backups could be done on standby servers too.

Yeah, that's another possible approach.  I am not sure what is best.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: finding changed blocks using WAL scanning

From
Bruce Momjian
Date:
On Mon, Apr 22, 2019 at 01:11:22PM -0400, Robert Haas wrote:
> On Mon, Apr 22, 2019 at 12:35 PM Bruce Momjian <bruce@momjian.us> wrote:
> > I assumed the modblock files would be stored in the WAL archive so some
> > external tools could generate incremental backups using just the WAL
> > files.  I assumed they would also be sent to standby servers so
> > incremental backups could be done on standby servers too.
> 
> Yeah, that's another possible approach.  I am not sure what is best.

I am thinking you need to allow any of these, and putting the WAL files
in pg_wal and having them streamed and archived gives that flexibility.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: finding changed blocks using WAL scanning

From
Tomas Vondra
Date:
On Thu, Apr 18, 2019 at 04:25:24PM -0400, Robert Haas wrote:
>On Thu, Apr 18, 2019 at 3:51 PM Bruce Momjian <bruce@momjian.us> wrote:
>> How would you choose the STARTLSN/ENDLSN?  If you could do it per
>> checkpoint, rather than per-WAL, I think that would be great.
>
>I thought of that too.  It seems appealing, because you probably only
>really care whether a particular block was modified between one
>checkpoint and the next, not exactly when during that interval it was
>modified. 

That's probably true for incremental backups, but there are other use
cases that could leverage this information.

Some time ago there was a discussion about prefetching blocks during
recovery on a standby, and that's a great example of a use case that
benefit from this - look which blocks where modified in the next chunk
of WAL, prefetch them. But that requires fairly detailed information
about which blocks were modified in the next few megabytes of WAL.

So just doing it once per checkpoint (or even anything above a single
WAL segment) and removing all the detailed LSN location makes it useless
for this use case.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: finding changed blocks using WAL scanning

From
Tomas Vondra
Date:
On Mon, Apr 22, 2019 at 01:15:49PM -0400, Bruce Momjian wrote:
>On Mon, Apr 22, 2019 at 01:11:22PM -0400, Robert Haas wrote:
>> On Mon, Apr 22, 2019 at 12:35 PM Bruce Momjian <bruce@momjian.us> wrote:
>> > I assumed the modblock files would be stored in the WAL archive so some
>> > external tools could generate incremental backups using just the WAL
>> > files.  I assumed they would also be sent to standby servers so
>> > incremental backups could be done on standby servers too.
>>
>> Yeah, that's another possible approach.  I am not sure what is best.
>
>I am thinking you need to allow any of these, and putting the WAL files
>in pg_wal and having them streamed and archived gives that flexibility.
>

I agree - this would be quite useful for the prefetching use case I've
already mentioned in my previous message.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: finding changed blocks using WAL scanning

From
Tomas Vondra
Date:
On Sat, Apr 20, 2019 at 04:21:52PM -0400, Robert Haas wrote:
>On Sat, Apr 20, 2019 at 12:42 AM Stephen Frost <sfrost@snowman.net> wrote:
>> > Oh.  Well, I already explained my algorithm for doing that upthread,
>> > which I believe would be quite cheap.
>> >
>> > 1. When you generate the .modblock files, stick all the block
>> > references into a buffer.  qsort().  Dedup.  Write out in sorted
>> > order.
>>
>> Having all of the block references in a sorted order does seem like it
>> would help, but would also make those potentially quite a bit larger
>> than necessary (I had some thoughts about making them smaller elsewhere
>> in this discussion).  That might be worth it though.  I suppose it might
>> also be possible to line up the bitmaps suggested elsewhere to do
>> essentially a BitmapOr of them to identify the blocks changed (while
>> effectively de-duping at the same time).
>
>I don't see why this would make them bigger than necessary.  If you
>sort by relfilenode/fork/blocknumber and dedup, then references to
>nearby blocks will be adjacent in the file.  You can then decide what
>format will represent that most efficiently on output.  Whether or not
>a bitmap is better idea than a list of block numbers or something else
>depends on what percentage of blocks are modified and how clustered
>they are.
>

Not sure I understand correctly - do you suggest to deduplicate and sort
the data before writing them into the .modblock files? Because that the
the sorting would make this information mostly useless for the recovery
prefetching use case I mentioned elsewhere. For that to work we need
information about both the LSN and block, in the LSN order.

So if we want to allow that use case to leverage this infrastructure, we
need to write the .modfiles kinda "raw" and do this processing in some
later step.

Now, maybe the incremental backup use case is so much more important the
right thing to do is ignore this other use case, and I'm OK with that -
as long as it's a conscious choice.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: finding changed blocks using WAL scanning

From
Bruce Momjian
Date:
On Tue, Apr 23, 2019 at 01:21:27AM +0200, Tomas Vondra wrote:
> On Sat, Apr 20, 2019 at 04:21:52PM -0400, Robert Haas wrote:
> > On Sat, Apr 20, 2019 at 12:42 AM Stephen Frost <sfrost@snowman.net> wrote:
> > > > Oh.  Well, I already explained my algorithm for doing that upthread,
> > > > which I believe would be quite cheap.
> > > >
> > > > 1. When you generate the .modblock files, stick all the block
> > > > references into a buffer.  qsort().  Dedup.  Write out in sorted
> > > > order.
> > > 
> > > Having all of the block references in a sorted order does seem like it
> > > would help, but would also make those potentially quite a bit larger
> > > than necessary (I had some thoughts about making them smaller elsewhere
> > > in this discussion).  That might be worth it though.  I suppose it might
> > > also be possible to line up the bitmaps suggested elsewhere to do
> > > essentially a BitmapOr of them to identify the blocks changed (while
> > > effectively de-duping at the same time).
> > 
> > I don't see why this would make them bigger than necessary.  If you
> > sort by relfilenode/fork/blocknumber and dedup, then references to
> > nearby blocks will be adjacent in the file.  You can then decide what
> > format will represent that most efficiently on output.  Whether or not
> > a bitmap is better idea than a list of block numbers or something else
> > depends on what percentage of blocks are modified and how clustered
> > they are.
> > 
> 
> Not sure I understand correctly - do you suggest to deduplicate and sort
> the data before writing them into the .modblock files? Because that the
> the sorting would make this information mostly useless for the recovery
> prefetching use case I mentioned elsewhere. For that to work we need
> information about both the LSN and block, in the LSN order.
> 
> So if we want to allow that use case to leverage this infrastructure, we
> need to write the .modfiles kinda "raw" and do this processing in some
> later step.
> 
> Now, maybe the incremental backup use case is so much more important the
> right thing to do is ignore this other use case, and I'm OK with that -
> as long as it's a conscious choice.

I think the concern is that the more graunular the modblock files are
(with less de-duping), the larger they will be.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: finding changed blocks using WAL scanning

From
Tomas Vondra
Date:
On Mon, Apr 22, 2019 at 07:44:45PM -0400, Bruce Momjian wrote:
>On Tue, Apr 23, 2019 at 01:21:27AM +0200, Tomas Vondra wrote:
>> On Sat, Apr 20, 2019 at 04:21:52PM -0400, Robert Haas wrote:
>> > On Sat, Apr 20, 2019 at 12:42 AM Stephen Frost <sfrost@snowman.net> wrote:
>> > > > Oh.  Well, I already explained my algorithm for doing that upthread,
>> > > > which I believe would be quite cheap.
>> > > >
>> > > > 1. When you generate the .modblock files, stick all the block
>> > > > references into a buffer.  qsort().  Dedup.  Write out in sorted
>> > > > order.
>> > >
>> > > Having all of the block references in a sorted order does seem like it
>> > > would help, but would also make those potentially quite a bit larger
>> > > than necessary (I had some thoughts about making them smaller elsewhere
>> > > in this discussion).  That might be worth it though.  I suppose it might
>> > > also be possible to line up the bitmaps suggested elsewhere to do
>> > > essentially a BitmapOr of them to identify the blocks changed (while
>> > > effectively de-duping at the same time).
>> >
>> > I don't see why this would make them bigger than necessary.  If you
>> > sort by relfilenode/fork/blocknumber and dedup, then references to
>> > nearby blocks will be adjacent in the file.  You can then decide what
>> > format will represent that most efficiently on output.  Whether or not
>> > a bitmap is better idea than a list of block numbers or something else
>> > depends on what percentage of blocks are modified and how clustered
>> > they are.
>> >
>>
>> Not sure I understand correctly - do you suggest to deduplicate and sort
>> the data before writing them into the .modblock files? Because that the
>> the sorting would make this information mostly useless for the recovery
>> prefetching use case I mentioned elsewhere. For that to work we need
>> information about both the LSN and block, in the LSN order.
>>
>> So if we want to allow that use case to leverage this infrastructure, we
>> need to write the .modfiles kinda "raw" and do this processing in some
>> later step.
>>
>> Now, maybe the incremental backup use case is so much more important the
>> right thing to do is ignore this other use case, and I'm OK with that -
>> as long as it's a conscious choice.
>
>I think the concern is that the more graunular the modblock files are
>(with less de-duping), the larger they will be.
>

Well, I understand that concern - all I'm saying is that makes this
useless for some use cases (that may or may not be important enough).

However, it seems to me those files are guaranteed to be much smaller
than the WAL segments, so I don't see how size alone could be an issue
as long as we do the merging and deduplication when recycling the
segments. At that point the standby can't request the WAL from the
primary anyway, so it won't need the raw .mdblock files either.

And we probably only care about the size of the data we need to keep for
a long time. And that we can deduplicate/reorder any way we want.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: finding changed blocks using WAL scanning

From
Bruce Momjian
Date:
On Tue, Apr 23, 2019 at 02:13:29AM +0200, Tomas Vondra wrote:
> Well, I understand that concern - all I'm saying is that makes this
> useless for some use cases (that may or may not be important enough).
> 
> However, it seems to me those files are guaranteed to be much smaller
> than the WAL segments, so I don't see how size alone could be an issue
> as long as we do the merging and deduplication when recycling the
> segments. At that point the standby can't request the WAL from the
> primary anyway, so it won't need the raw .mdblock files either.
> 
> And we probably only care about the size of the data we need to keep for
> a long time. And that we can deduplicate/reorder any way we want.

Well, the interesting question is whether the server will generate a
single modblock file for all WAL in pg_wal only right before we are
ready to expire some WAL, or whether modblock files will be generated
offline, perhaps independent of the server, and perhaps by aggregating
smaller modblock files.

To throw out an idea, what if we had an executable that could generate a
modblock file by scanning a set of WAL files?  How far would that take
us to meeing incremental backup needs?  I can imagine db/relfilenode oid
volatility could be a problem, but might be fixable.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: finding changed blocks using WAL scanning

From
Bruce Momjian
Date:
On Mon, Apr 22, 2019 at 08:52:11PM -0400, Bruce Momjian wrote:
> Well, the interesting question is whether the server will generate a
> single modblock file for all WAL in pg_wal only right before we are
> ready to expire some WAL, or whether modblock files will be generated
> offline, perhaps independent of the server, and perhaps by aggregating
> smaller modblock files.
> 
> To throw out an idea, what if we had an executable that could generate a
> modblock file by scanning a set of WAL files?  How far would that take
> us to meeing incremental backup needs?  I can imagine db/relfilenode oid
> volatility could be a problem, but might be fixable.

Well, this actually brings up a bunch of questions:

*  How often do we create blockmod files?  Per segment, per checkpoint,
   at WAL deletion time (1GB?)

*  What is the blockmod file format?  dboid, relfilenode, blocknum?
   Use compression?  Sorted?

*  How do we create incremental backups?

*  What is the incremental backup file format?

*  How do we apply incremental backups to base backups?

And there are some secondary questions:

*  Can blockmod files be merged?

*  Can incremental backups be merged?

*  Can blockmod files be used for restore prefetching?

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: finding changed blocks using WAL scanning

From
Robert Haas
Date:
On Mon, Apr 22, 2019 at 7:04 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Some time ago there was a discussion about prefetching blocks during
> recovery on a standby, and that's a great example of a use case that
> benefit from this - look which blocks where modified in the next chunk
> of WAL, prefetch them. But that requires fairly detailed information
> about which blocks were modified in the next few megabytes of WAL.
>
> So just doing it once per checkpoint (or even anything above a single
> WAL segment) and removing all the detailed LSN location makes it useless
> for this use case.

For this particular use case, wouldn't you want to read the WAL itself
and use that to issue prefetch requests?  Because if you use the
.modblock files, the data file blocks will end up in memory but the
WAL blocks won't, and you'll still be waiting for I/O.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: finding changed blocks using WAL scanning

From
Stephen Frost
Date:
Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> On Sat, Apr 20, 2019 at 04:21:52PM -0400, Robert Haas wrote:
> >On Sat, Apr 20, 2019 at 12:42 AM Stephen Frost <sfrost@snowman.net> wrote:
> >>> Oh.  Well, I already explained my algorithm for doing that upthread,
> >>> which I believe would be quite cheap.
> >>>
> >>> 1. When you generate the .modblock files, stick all the block
> >>> references into a buffer.  qsort().  Dedup.  Write out in sorted
> >>> order.
> >>
> >>Having all of the block references in a sorted order does seem like it
> >>would help, but would also make those potentially quite a bit larger
> >>than necessary (I had some thoughts about making them smaller elsewhere
> >>in this discussion).  That might be worth it though.  I suppose it might
> >>also be possible to line up the bitmaps suggested elsewhere to do
> >>essentially a BitmapOr of them to identify the blocks changed (while
> >>effectively de-duping at the same time).
> >
> >I don't see why this would make them bigger than necessary.  If you
> >sort by relfilenode/fork/blocknumber and dedup, then references to
> >nearby blocks will be adjacent in the file.  You can then decide what
> >format will represent that most efficiently on output.  Whether or not
> >a bitmap is better idea than a list of block numbers or something else
> >depends on what percentage of blocks are modified and how clustered
> >they are.
>
> Not sure I understand correctly - do you suggest to deduplicate and sort
> the data before writing them into the .modblock files? Because that the
> the sorting would make this information mostly useless for the recovery
> prefetching use case I mentioned elsewhere. For that to work we need
> information about both the LSN and block, in the LSN order.

I'm not sure I follow- why does the prefetching need to get the blocks
in LSN order..?  Once the blocks that we know are going to change in the
next segment have been identified, we could prefetch them all and have
them ready for when replay gets to them.  I'm not sure that we
specifically need to have them pre-fetched in the same order that the
replay happens and it might even be better to fetch them in an order
that's as sequential as possible to get them in as quickly as possible.

> So if we want to allow that use case to leverage this infrastructure, we
> need to write the .modfiles kinda "raw" and do this processing in some
> later step.

If we really need the LSN info for the blocks, then we could still
de-dup, picking the 'first modified in this segment at LSN X', or keep
both first and last, or I suppose every LSN if we really want, and then
have that information included with the other information about the
block.  Downstream clients could then sort based on the LSN info if they
want to have a list of blocks in sorted-by-LSN-order.

> Now, maybe the incremental backup use case is so much more important the
> right thing to do is ignore this other use case, and I'm OK with that -
> as long as it's a conscious choice.

I'd certainly like to have a way to prefetch, but I'm not entirely sure
that it makes sense to combine it with this, so while I sketched out
some ideas about how to do that above, I don't want it to come across as
being a strong endorsement of the overall idea.

For pre-fetching purposes, for an async streaming replica, it seems like
the wal sender process could potentially just scan the WAL and have a
list of blocks ready to pass to the replica which are "this is what's
coming soon" or similar, rather than working with the modfiles at all.
Not sure if we'd always send that or if we wait for the replica to ask
for it.  Though for doing WAL replay from the archive, being able to ask
for the modfile first to do prefetching before replaying the WAL itself
could certainly be beneficial, so maybe it does make sense to have that
information there too..  still not sure we really need it in LSN order
or that we need to prefetch in LSN order though.

Thanks!

Stephen

Attachment

Re: finding changed blocks using WAL scanning

From
Tomas Vondra
Date:
On Tue, Apr 23, 2019 at 10:22:46AM -0400, Stephen Frost wrote:
>Greetings,
>
>* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
>> On Sat, Apr 20, 2019 at 04:21:52PM -0400, Robert Haas wrote:
>> >On Sat, Apr 20, 2019 at 12:42 AM Stephen Frost <sfrost@snowman.net> wrote:
>> >>> Oh.  Well, I already explained my algorithm for doing that upthread,
>> >>> which I believe would be quite cheap.
>> >>>
>> >>> 1. When you generate the .modblock files, stick all the block
>> >>> references into a buffer.  qsort().  Dedup.  Write out in sorted
>> >>> order.
>> >>
>> >>Having all of the block references in a sorted order does seem like it
>> >>would help, but would also make those potentially quite a bit larger
>> >>than necessary (I had some thoughts about making them smaller elsewhere
>> >>in this discussion).  That might be worth it though.  I suppose it might
>> >>also be possible to line up the bitmaps suggested elsewhere to do
>> >>essentially a BitmapOr of them to identify the blocks changed (while
>> >>effectively de-duping at the same time).
>> >
>> >I don't see why this would make them bigger than necessary.  If you
>> >sort by relfilenode/fork/blocknumber and dedup, then references to
>> >nearby blocks will be adjacent in the file.  You can then decide what
>> >format will represent that most efficiently on output.  Whether or not
>> >a bitmap is better idea than a list of block numbers or something else
>> >depends on what percentage of blocks are modified and how clustered
>> >they are.
>>
>> Not sure I understand correctly - do you suggest to deduplicate and sort
>> the data before writing them into the .modblock files? Because that the
>> the sorting would make this information mostly useless for the recovery
>> prefetching use case I mentioned elsewhere. For that to work we need
>> information about both the LSN and block, in the LSN order.
>
>I'm not sure I follow- why does the prefetching need to get the blocks
>in LSN order..?  Once the blocks that we know are going to change in the
>next segment have been identified, we could prefetch them all and have
>them ready for when replay gets to them.  I'm not sure that we
>specifically need to have them pre-fetched in the same order that the
>replay happens and it might even be better to fetch them in an order
>that's as sequential as possible to get them in as quickly as possible.
>

That means we'd have to prefetch all blocks for the whole WAL segment,
which is pretty useless, IMO. A single INSERT (especially for indexes) is
often just ~100B, so a single 16MB segment can fit ~160k of them. Surely
we don't want to prefetch all of that at once? And it's even worse for
larger WAL segments, which are likely to get more common now that it's an
initdb option.

I'm pretty sure the prefetching needs to be more like "prefetch the next
1024 blocks we'll need" or "prefetch blocks from the next X megabytes of
WAL". That doesn't mean we can't do some additional optimization (like
reordering them a bit), but there's a point where it gets detrimental
because the kernel will just evict some of the prefetched blocks before we
actually access them. And the larger amount of blocks you prefetch the
more likely that is.


>> So if we want to allow that use case to leverage this infrastructure, we
>> need to write the .modfiles kinda "raw" and do this processing in some
>> later step.
>
>If we really need the LSN info for the blocks, then we could still
>de-dup, picking the 'first modified in this segment at LSN X', or keep
>both first and last, or I suppose every LSN if we really want, and then
>have that information included with the other information about the
>block.  Downstream clients could then sort based on the LSN info if they
>want to have a list of blocks in sorted-by-LSN-order.
>

Possibly. I don't think keeping just the first block occurence is enough,
particularly for large WAL segmnent sizes, but I agree we can reduce the
stuff a bit (say, ignore references that are less than 1MB apart or so).
We just can't remove all the LSN information entirely.

>> Now, maybe the incremental backup use case is so much more important the
>> right thing to do is ignore this other use case, and I'm OK with that -
>> as long as it's a conscious choice.
>
>I'd certainly like to have a way to prefetch, but I'm not entirely sure
>that it makes sense to combine it with this, so while I sketched out
>some ideas about how to do that above, I don't want it to come across as
>being a strong endorsement of the overall idea.
>
>For pre-fetching purposes, for an async streaming replica, it seems like
>the wal sender process could potentially just scan the WAL and have a
>list of blocks ready to pass to the replica which are "this is what's
>coming soon" or similar, rather than working with the modfiles at all.
>Not sure if we'd always send that or if we wait for the replica to ask
>for it.  Though for doing WAL replay from the archive, being able to ask
>for the modfile first to do prefetching before replaying the WAL itself
>could certainly be beneficial, so maybe it does make sense to have that
>information there too..  still not sure we really need it in LSN order
>or that we need to prefetch in LSN order though.
>

Well, how exactly should the prefetching extract the block list is still
an open question. But this seems to deal with pretty much the same stuff,
so it might make sense to support the prefetching use case too. Or maybe
not, I'm not sure - but IMHO we should try.

I don't think it makes sense to do prefetching when the standby is fully
caught up with the primary. It gets more important when it falls behind by
a significant amount of WAL, at which point it's likely to fetch already
closed WAL segments - either from primary or archive. So either it can get
the block list from there, or it may extract the info on it's own (using
some of the infrastructure used to build the mdblock files).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: finding changed blocks using WAL scanning

From
Stephen Frost
Date:
Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> On Tue, Apr 23, 2019 at 10:22:46AM -0400, Stephen Frost wrote:
> >* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> >>On Sat, Apr 20, 2019 at 04:21:52PM -0400, Robert Haas wrote:
> >>>On Sat, Apr 20, 2019 at 12:42 AM Stephen Frost <sfrost@snowman.net> wrote:
> >>>>> Oh.  Well, I already explained my algorithm for doing that upthread,
> >>>>> which I believe would be quite cheap.
> >>>>>
> >>>>> 1. When you generate the .modblock files, stick all the block
> >>>>> references into a buffer.  qsort().  Dedup.  Write out in sorted
> >>>>> order.
> >>>>
> >>>>Having all of the block references in a sorted order does seem like it
> >>>>would help, but would also make those potentially quite a bit larger
> >>>>than necessary (I had some thoughts about making them smaller elsewhere
> >>>>in this discussion).  That might be worth it though.  I suppose it might
> >>>>also be possible to line up the bitmaps suggested elsewhere to do
> >>>>essentially a BitmapOr of them to identify the blocks changed (while
> >>>>effectively de-duping at the same time).
> >>>
> >>>I don't see why this would make them bigger than necessary.  If you
> >>>sort by relfilenode/fork/blocknumber and dedup, then references to
> >>>nearby blocks will be adjacent in the file.  You can then decide what
> >>>format will represent that most efficiently on output.  Whether or not
> >>>a bitmap is better idea than a list of block numbers or something else
> >>>depends on what percentage of blocks are modified and how clustered
> >>>they are.
> >>
> >>Not sure I understand correctly - do you suggest to deduplicate and sort
> >>the data before writing them into the .modblock files? Because that the
> >>the sorting would make this information mostly useless for the recovery
> >>prefetching use case I mentioned elsewhere. For that to work we need
> >>information about both the LSN and block, in the LSN order.
> >
> >I'm not sure I follow- why does the prefetching need to get the blocks
> >in LSN order..?  Once the blocks that we know are going to change in the
> >next segment have been identified, we could prefetch them all and have
> >them ready for when replay gets to them.  I'm not sure that we
> >specifically need to have them pre-fetched in the same order that the
> >replay happens and it might even be better to fetch them in an order
> >that's as sequential as possible to get them in as quickly as possible.
>
> That means we'd have to prefetch all blocks for the whole WAL segment,
> which is pretty useless, IMO. A single INSERT (especially for indexes) is
> often just ~100B, so a single 16MB segment can fit ~160k of them. Surely
> we don't want to prefetch all of that at once? And it's even worse for
> larger WAL segments, which are likely to get more common now that it's an
> initdb option.

Ah, yeah, I had been thinking about FPIs and blocks and figuring that
16MB wasn't that bad but you're certainly right that we could end up
touching a lot more blocks in a given WAL segment.

> I'm pretty sure the prefetching needs to be more like "prefetch the next
> 1024 blocks we'll need" or "prefetch blocks from the next X megabytes of
> WAL". That doesn't mean we can't do some additional optimization (like
> reordering them a bit), but there's a point where it gets detrimental
> because the kernel will just evict some of the prefetched blocks before we
> actually access them. And the larger amount of blocks you prefetch the
> more likely that is.

Since we're prefetching, maybe we shouldn't be just pulling the blocks
into the filesystem cache but instead into something we have more
control over..?

> >>So if we want to allow that use case to leverage this infrastructure, we
> >>need to write the .modfiles kinda "raw" and do this processing in some
> >>later step.
> >
> >If we really need the LSN info for the blocks, then we could still
> >de-dup, picking the 'first modified in this segment at LSN X', or keep
> >both first and last, or I suppose every LSN if we really want, and then
> >have that information included with the other information about the
> >block.  Downstream clients could then sort based on the LSN info if they
> >want to have a list of blocks in sorted-by-LSN-order.
>
> Possibly. I don't think keeping just the first block occurence is enough,
> particularly for large WAL segmnent sizes, but I agree we can reduce the
> stuff a bit (say, ignore references that are less than 1MB apart or so).
> We just can't remove all the LSN information entirely.

Yeah, it depends on how much memory we want to use for prefetching and
how large the WAL segments are.

> >>Now, maybe the incremental backup use case is so much more important the
> >>right thing to do is ignore this other use case, and I'm OK with that -
> >>as long as it's a conscious choice.
> >
> >I'd certainly like to have a way to prefetch, but I'm not entirely sure
> >that it makes sense to combine it with this, so while I sketched out
> >some ideas about how to do that above, I don't want it to come across as
> >being a strong endorsement of the overall idea.
> >
> >For pre-fetching purposes, for an async streaming replica, it seems like
> >the wal sender process could potentially just scan the WAL and have a
> >list of blocks ready to pass to the replica which are "this is what's
> >coming soon" or similar, rather than working with the modfiles at all.
> >Not sure if we'd always send that or if we wait for the replica to ask
> >for it.  Though for doing WAL replay from the archive, being able to ask
> >for the modfile first to do prefetching before replaying the WAL itself
> >could certainly be beneficial, so maybe it does make sense to have that
> >information there too..  still not sure we really need it in LSN order
> >or that we need to prefetch in LSN order though.
>
> Well, how exactly should the prefetching extract the block list is still
> an open question. But this seems to deal with pretty much the same stuff,
> so it might make sense to support the prefetching use case too. Or maybe
> not, I'm not sure - but IMHO we should try.

Yeah, makes sense to me to at least be thinking about it and seeing if
there's a way to make adding prefetching easier.

> I don't think it makes sense to do prefetching when the standby is fully
> caught up with the primary. It gets more important when it falls behind by
> a significant amount of WAL, at which point it's likely to fetch already
> closed WAL segments - either from primary or archive. So either it can get
> the block list from there, or it may extract the info on it's own (using
> some of the infrastructure used to build the mdblock files).

I don't know..  if we're able to keep up with the primary because we did
a bunch of prefetching and that made WAL replay fast enough that we
catch up, aren't we going to possibly fall right back behind again
pretty quickly if we don't prefetch for the blocks that are about to be
sent?  If we are caught up and there's no data to be sent then we don't
have much choice, but as soon as there's WAL data to be sent our way,
any more than just one block, then grabbing the list of blocks that we
know we're going to need during replay seems to make a lot of sense to
me.  In other words, I'd think we would just always do this and how much
we do would be dictated by both how far behind we are and how much
memory we want to allocate for this.

Thanks!

Stephen

Attachment

Re: finding changed blocks using WAL scanning

From
Tomas Vondra
Date:
On Tue, Apr 23, 2019 at 11:43:05AM -0400, Stephen Frost wrote:
>Greetings,
>
>* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
>> On Tue, Apr 23, 2019 at 10:22:46AM -0400, Stephen Frost wrote:
>> >* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
>> >>On Sat, Apr 20, 2019 at 04:21:52PM -0400, Robert Haas wrote:
>> >>>On Sat, Apr 20, 2019 at 12:42 AM Stephen Frost <sfrost@snowman.net> wrote:
>> >>>>> Oh.  Well, I already explained my algorithm for doing that upthread,
>> >>>>> which I believe would be quite cheap.
>> >>>>>
>> >>>>> 1. When you generate the .modblock files, stick all the block
>> >>>>> references into a buffer.  qsort().  Dedup.  Write out in sorted
>> >>>>> order.
>> >>>>
>> >>>>Having all of the block references in a sorted order does seem like it
>> >>>>would help, but would also make those potentially quite a bit larger
>> >>>>than necessary (I had some thoughts about making them smaller elsewhere
>> >>>>in this discussion).  That might be worth it though.  I suppose it might
>> >>>>also be possible to line up the bitmaps suggested elsewhere to do
>> >>>>essentially a BitmapOr of them to identify the blocks changed (while
>> >>>>effectively de-duping at the same time).
>> >>>
>> >>>I don't see why this would make them bigger than necessary.  If you
>> >>>sort by relfilenode/fork/blocknumber and dedup, then references to
>> >>>nearby blocks will be adjacent in the file.  You can then decide what
>> >>>format will represent that most efficiently on output.  Whether or not
>> >>>a bitmap is better idea than a list of block numbers or something else
>> >>>depends on what percentage of blocks are modified and how clustered
>> >>>they are.
>> >>
>> >>Not sure I understand correctly - do you suggest to deduplicate and sort
>> >>the data before writing them into the .modblock files? Because that the
>> >>the sorting would make this information mostly useless for the recovery
>> >>prefetching use case I mentioned elsewhere. For that to work we need
>> >>information about both the LSN and block, in the LSN order.
>> >
>> >I'm not sure I follow- why does the prefetching need to get the blocks
>> >in LSN order..?  Once the blocks that we know are going to change in the
>> >next segment have been identified, we could prefetch them all and have
>> >them ready for when replay gets to them.  I'm not sure that we
>> >specifically need to have them pre-fetched in the same order that the
>> >replay happens and it might even be better to fetch them in an order
>> >that's as sequential as possible to get them in as quickly as possible.
>>
>> That means we'd have to prefetch all blocks for the whole WAL segment,
>> which is pretty useless, IMO. A single INSERT (especially for indexes) is
>> often just ~100B, so a single 16MB segment can fit ~160k of them. Surely
>> we don't want to prefetch all of that at once? And it's even worse for
>> larger WAL segments, which are likely to get more common now that it's an
>> initdb option.
>
>Ah, yeah, I had been thinking about FPIs and blocks and figuring that
>16MB wasn't that bad but you're certainly right that we could end up
>touching a lot more blocks in a given WAL segment.
>
>> I'm pretty sure the prefetching needs to be more like "prefetch the next
>> 1024 blocks we'll need" or "prefetch blocks from the next X megabytes of
>> WAL". That doesn't mean we can't do some additional optimization (like
>> reordering them a bit), but there's a point where it gets detrimental
>> because the kernel will just evict some of the prefetched blocks before we
>> actually access them. And the larger amount of blocks you prefetch the
>> more likely that is.
>
>Since we're prefetching, maybe we shouldn't be just pulling the blocks
>into the filesystem cache but instead into something we have more
>control over..?
>

Well, yeah, and there was a discussion about that in the prefetching
thread IIRC. But but in that case it's probably even more imporant to
prefetch only a fairly small chunk of blocks (instead of the whole WAL
segment worth of blocks), because shared buffers are usually much smaller
compared to page cache. So we'd probably want some sort of small ring
buffer there, so the LSN information is even more important.

>> >>So if we want to allow that use case to leverage this infrastructure, we
>> >>need to write the .modfiles kinda "raw" and do this processing in some
>> >>later step.
>> >
>> >If we really need the LSN info for the blocks, then we could still
>> >de-dup, picking the 'first modified in this segment at LSN X', or keep
>> >both first and last, or I suppose every LSN if we really want, and then
>> >have that information included with the other information about the
>> >block.  Downstream clients could then sort based on the LSN info if they
>> >want to have a list of blocks in sorted-by-LSN-order.
>>
>> Possibly. I don't think keeping just the first block occurence is enough,
>> particularly for large WAL segmnent sizes, but I agree we can reduce the
>> stuff a bit (say, ignore references that are less than 1MB apart or so).
>> We just can't remove all the LSN information entirely.
>
>Yeah, it depends on how much memory we want to use for prefetching and
>how large the WAL segments are.
>

Right.

>> >>Now, maybe the incremental backup use case is so much more important the
>> >>right thing to do is ignore this other use case, and I'm OK with that -
>> >>as long as it's a conscious choice.
>> >
>> >I'd certainly like to have a way to prefetch, but I'm not entirely sure
>> >that it makes sense to combine it with this, so while I sketched out
>> >some ideas about how to do that above, I don't want it to come across as
>> >being a strong endorsement of the overall idea.
>> >
>> >For pre-fetching purposes, for an async streaming replica, it seems like
>> >the wal sender process could potentially just scan the WAL and have a
>> >list of blocks ready to pass to the replica which are "this is what's
>> >coming soon" or similar, rather than working with the modfiles at all.
>> >Not sure if we'd always send that or if we wait for the replica to ask
>> >for it.  Though for doing WAL replay from the archive, being able to ask
>> >for the modfile first to do prefetching before replaying the WAL itself
>> >could certainly be beneficial, so maybe it does make sense to have that
>> >information there too..  still not sure we really need it in LSN order
>> >or that we need to prefetch in LSN order though.
>>
>> Well, how exactly should the prefetching extract the block list is still
>> an open question. But this seems to deal with pretty much the same stuff,
>> so it might make sense to support the prefetching use case too. Or maybe
>> not, I'm not sure - but IMHO we should try.
>
>Yeah, makes sense to me to at least be thinking about it and seeing if
>there's a way to make adding prefetching easier.
>

Agreed.

>> I don't think it makes sense to do prefetching when the standby is fully
>> caught up with the primary. It gets more important when it falls behind by
>> a significant amount of WAL, at which point it's likely to fetch already
>> closed WAL segments - either from primary or archive. So either it can get
>> the block list from there, or it may extract the info on it's own (using
>> some of the infrastructure used to build the mdblock files).
>
>I don't know..  if we're able to keep up with the primary because we did
>a bunch of prefetching and that made WAL replay fast enough that we
>catch up, aren't we going to possibly fall right back behind again
>pretty quickly if we don't prefetch for the blocks that are about to be
>sent?  If we are caught up and there's no data to be sent then we don't
>have much choice, but as soon as there's WAL data to be sent our way,
>any more than just one block, then grabbing the list of blocks that we
>know we're going to need during replay seems to make a lot of sense to
>me.  In other words, I'd think we would just always do this and how much
>we do would be dictated by both how far behind we are and how much
>memory we want to allocate for this.
>

Well, the thing is that for prefetching to be possible you actually have
to be a bit behind. Otherwise you can't really look forward which blocks
will be needed, right?

IMHO the main use case for prefetching is when there's a spike of activity
on the primary, making the standby to fall behind, and then hours takes
hours to catch up. I don't think the cases with just a couple of MBs of
lag are the issue prefetching is meant to improve (if it does, great).

Anyway, not sure if this is the right thread to discuss prefetching in
much more detail.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: finding changed blocks using WAL scanning

From
Andres Freund
Date:
Hi,

On 2019-04-23 18:07:40 +0200, Tomas Vondra wrote:
> Well, the thing is that for prefetching to be possible you actually have
> to be a bit behind. Otherwise you can't really look forward which blocks
> will be needed, right?
> 
> IMHO the main use case for prefetching is when there's a spike of activity
> on the primary, making the standby to fall behind, and then hours takes
> hours to catch up. I don't think the cases with just a couple of MBs of
> lag are the issue prefetching is meant to improve (if it does, great).

I'd be surprised if a good implementation didn't. Even just some smarter
IO scheduling in the startup process could help a good bit. E.g. no need
to sequentially read the first and then the second block for an update
record, if you can issue both at the same time - just about every
storage system these days can do a number of IO requests in parallel,
and it nearly halves latency effects. And reading a few records (as in a
few hundred bytes commonly) ahead, allows to do much more than that.

Greetings,

Andres Freund



Re: finding changed blocks using WAL scanning

From
Tomas Vondra
Date:
On Tue, Apr 23, 2019 at 09:34:54AM -0700, Andres Freund wrote:
>Hi,
>
>On 2019-04-23 18:07:40 +0200, Tomas Vondra wrote:
>> Well, the thing is that for prefetching to be possible you actually have
>> to be a bit behind. Otherwise you can't really look forward which blocks
>> will be needed, right?
>>
>> IMHO the main use case for prefetching is when there's a spike of activity
>> on the primary, making the standby to fall behind, and then hours takes
>> hours to catch up. I don't think the cases with just a couple of MBs of
>> lag are the issue prefetching is meant to improve (if it does, great).
>
>I'd be surprised if a good implementation didn't. Even just some smarter
>IO scheduling in the startup process could help a good bit. E.g. no need
>to sequentially read the first and then the second block for an update
>record, if you can issue both at the same time - just about every
>storage system these days can do a number of IO requests in parallel,
>and it nearly halves latency effects. And reading a few records (as in a
>few hundred bytes commonly) ahead, allows to do much more than that.
>

I don't disagree with that - prefetching certainly can improve utilization
of the storage system. The question is whether it can meaningfully improve
performance of the recovery process in cases when it does not lag.  And I
think it can't (perhaps with remote_apply being  an exception).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: finding changed blocks using WAL scanning

From
Andres Freund
Date:
Hi,

On 2019-04-23 19:01:29 +0200, Tomas Vondra wrote:
> On Tue, Apr 23, 2019 at 09:34:54AM -0700, Andres Freund wrote:
> > Hi,
> > 
> > On 2019-04-23 18:07:40 +0200, Tomas Vondra wrote:
> > > Well, the thing is that for prefetching to be possible you actually have
> > > to be a bit behind. Otherwise you can't really look forward which blocks
> > > will be needed, right?
> > > 
> > > IMHO the main use case for prefetching is when there's a spike of activity
> > > on the primary, making the standby to fall behind, and then hours takes
> > > hours to catch up. I don't think the cases with just a couple of MBs of
> > > lag are the issue prefetching is meant to improve (if it does, great).
> > 
> > I'd be surprised if a good implementation didn't. Even just some smarter
> > IO scheduling in the startup process could help a good bit. E.g. no need
> > to sequentially read the first and then the second block for an update
> > record, if you can issue both at the same time - just about every
> > storage system these days can do a number of IO requests in parallel,
> > and it nearly halves latency effects. And reading a few records (as in a
> > few hundred bytes commonly) ahead, allows to do much more than that.
> > 
> 
> I don't disagree with that - prefetching certainly can improve utilization
> of the storage system. The question is whether it can meaningfully improve
> performance of the recovery process in cases when it does not lag.  And I
> think it can't (perhaps with remote_apply being  an exception).

Well. I think a few dozen records behind doesn't really count as "lag",
and I think that's where it'd start to help (and for some record types
like updates it'd start to help even for single records). It'd convert
scenarios where we'd currently fall behind slowly into scenarios where
we can keep up - but where there's no meaningful lag while we keep up.
What's your argument for me being wrong?

And even if we'd keep up without any prefetching, issuing requests in a
more efficient manner allows for more efficient concurrent use of the
storage system. It'll often effectively reduce the amount of random
iops.

Greetings,

Andres Freund



Re: finding changed blocks using WAL scanning

From
Tomas Vondra
Date:
On Tue, Apr 23, 2019 at 10:09:39AM -0700, Andres Freund wrote:
>Hi,
>
>On 2019-04-23 19:01:29 +0200, Tomas Vondra wrote:
>> On Tue, Apr 23, 2019 at 09:34:54AM -0700, Andres Freund wrote:
>> > Hi,
>> >
>> > On 2019-04-23 18:07:40 +0200, Tomas Vondra wrote:
>> > > Well, the thing is that for prefetching to be possible you actually have
>> > > to be a bit behind. Otherwise you can't really look forward which blocks
>> > > will be needed, right?
>> > >
>> > > IMHO the main use case for prefetching is when there's a spike of activity
>> > > on the primary, making the standby to fall behind, and then hours takes
>> > > hours to catch up. I don't think the cases with just a couple of MBs of
>> > > lag are the issue prefetching is meant to improve (if it does, great).
>> >
>> > I'd be surprised if a good implementation didn't. Even just some smarter
>> > IO scheduling in the startup process could help a good bit. E.g. no need
>> > to sequentially read the first and then the second block for an update
>> > record, if you can issue both at the same time - just about every
>> > storage system these days can do a number of IO requests in parallel,
>> > and it nearly halves latency effects. And reading a few records (as in a
>> > few hundred bytes commonly) ahead, allows to do much more than that.
>> >
>>
>> I don't disagree with that - prefetching certainly can improve utilization
>> of the storage system. The question is whether it can meaningfully improve
>> performance of the recovery process in cases when it does not lag.  And I
>> think it can't (perhaps with remote_apply being  an exception).
>
>Well. I think a few dozen records behind doesn't really count as "lag",
>and I think that's where it'd start to help (and for some record types
>like updates it'd start to help even for single records). It'd convert
>scenarios where we'd currently fall behind slowly into scenarios where
>we can keep up - but where there's no meaningful lag while we keep up.
>What's your argument for me being wrong?
>

I was not saying you are wrong. I think we actually agree on the main
points. My point is that prefetching is most valuable for cases when the
standby can't keep up and falls behind significantly - at which point we
have sufficient queue of blocks to prefetch. I don't care about the case
when the standby can keep up even without prefetching, because the metric
we need to optimize (i.e. lag) is close to 0 even without prefetching.

>And even if we'd keep up without any prefetching, issuing requests in a
>more efficient manner allows for more efficient concurrent use of the
>storage system. It'll often effectively reduce the amount of random
>iops.

Maybe, although the metric we (and users) care about the most is the
amount of lag. If the system keeps up even without prefetching, no one
will complain about I/O utilization.

When the lag is close to 0, the average throughput/IOPS/... is bound to be
the same in both cases, because it does not affect how fast the standby
receives WAL from the primary. Except that it's somewhat "spikier" with
prefetching, because we issue requests in bursts. Which may actually be a
bad thing.

Of course, maybe prefetching will make it much more efficient even in the
"no lag" case, and while it won't improve the recovery, it'll leave more
I/O bandwidth for the other processes (say, queries on hot standby).

So to be clear, I'm not against prefetching even in this case, but it's
not the primary reason why I think we need to do that.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: finding changed blocks using WAL scanning

From
Robert Haas
Date:
On Mon, Apr 22, 2019 at 9:51 PM Robert Haas <robertmhaas@gmail.com> wrote:
> For this particular use case, wouldn't you want to read the WAL itself
> and use that to issue prefetch requests?  Because if you use the
> .modblock files, the data file blocks will end up in memory but the
> WAL blocks won't, and you'll still be waiting for I/O.

I'm still interested in the answer to this question, but I don't see a
reply that specifically concerns it.  Apologies if I have missed one.

Stepping back a bit, I think that the basic issue under discussion
here is how granular you want your .modblock files.  At one extreme,
one can imagine an application that wants to know exactly which blocks
were accessed at exact which LSNs.  At the other extreme, if you want
to run a daily incremental backup, you just want to know which blocks
have been modified between the start of the previous backup and the
start of the current backup - i.e. sometime in the last ~24 hours.
These are quite different things.  When you only want approximate
information - is there a chance that this block was changed within
this LSN range, or not? - you can sort and deduplicate in advance;
when you want exact information, you cannot do that.  Furthermore, if
you want exact information, you must store an LSN for every record; if
you want approximate information, you emit a file for each LSN range
and consider it sufficient to know that the change happened somewhere
within the range of LSNs encompassed by that file.

It's pretty clear in my mind that what I want to do here is provide
approximate information, not exact information.  Being able to sort
and deduplicate in advance seems critical to be able to make something
like this work on high-velocity systems.  If you are generating a
terabyte of WAL between incremental backups, and you don't do any
sorting or deduplication prior to the point when you actually try to
generate the modified block map, you are going to need a whole lot of
memory (and CPU time, though that's less critical, I think) to process
all of that data.  If you can read modblock files which are already
sorted and deduplicated, you can generate results incrementally and
send them to the client incrementally and you never really need more
than some fixed amount of memory no matter how much data you are
processing.

While I'm convinced that this particular feature should provide
approximate rather than exact information, the degree of approximation
is up for debate, and maybe it's best to just make that configurable.
Some applications might work best with small modblock files covering
only ~16MB of WAL each, or even less, while others might prefer larger
quanta, say 1GB or even more.  For incremental backup, I believe that
the quanta will depend on the system velocity.  On a system that isn't
very busy, fine-grained modblock files will make incremental backup
more efficient.  If each modblock file covers only 16MB of data, and
the backup manages to start someplace in the middle of that 16MB, then
you'll only be including 16MB or less of unnecessary block references
in the backup so you won't incur much extra work.  On the other hand,
on a busy system, you probably do not want such a small quantum,
because you will then up with gazillions of modblock files and that
will be hard to manage.  It could also have performance problems,
because merging data from a couple of hundred files is fine, but
merging data from a couple of hundred thousand files is going to be
inefficient.  My experience hacking on and testing tuplesort.c a few
years ago (with valuable tutelage by Peter Geoghegan) showed me that
there is a slow drop-off in efficiency as the merge order increases --
and in this case, at some point you will blow out the size of the OS
file descriptor table and have to start opening and closing files
every time you access a different one, and that will be unpleasant.
Finally, deduplication will tend to be more effective across larger
numbers of block references, at least on some access patterns.

So all of that is to say that if somebody wants modblock files each of
which covers 1MB of WAL, I think that the same tools I'm proposing to
build here for incremental backup could support that use case with
just a configuration change.  Moreover, the resulting files would
still be usable by the incremental backup engine.  So that's good: the
same system can, at least to some extent, be reused for whatever other
purposes people want to know about modified blocks.  On the other
hand, the incremental backup engine will likely not cope smoothly with
having hundreds of thousands or millions of modblock files shoved down
its gullet, so if there is a dramatic difference in the granularity
requirements of different consumers, another approach is likely
indicated.  Especially if some consumer wants to see block references
in the exact order in which they appear in WAL, or wants to know the
exact LSN of each reference, it's probably best to go for a different
approach.  For example, pg_waldump could grow a new option which spits
out just the block references and in a format designed to be easily
machine-parseable; or a hypothetical background worker that does
prefetching for recovery could just contain its own copy of the
xlogreader machinery.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: finding changed blocks using WAL scanning

From
Tomas Vondra
Date:
On Wed, Apr 24, 2019 at 09:25:12AM -0400, Robert Haas wrote:
>On Mon, Apr 22, 2019 at 9:51 PM Robert Haas <robertmhaas@gmail.com> wrote:
>> For this particular use case, wouldn't you want to read the WAL itself
>> and use that to issue prefetch requests?  Because if you use the
>> .modblock files, the data file blocks will end up in memory but the
>> WAL blocks won't, and you'll still be waiting for I/O.
>
>I'm still interested in the answer to this question, but I don't see a
>reply that specifically concerns it.  Apologies if I have missed one.
>

I don't think prefetching WAL blocks is all that important. The WAL
segment was probably received fairly recently (either from primary or
archive) and so it's reasonable to assume it's still in page cache. And
even if it's not, sequential reads are handled by readahead pretty well.
Which is a form of prefetching.

But even if WAL prefetching was useful in some cases, I think it's mostly
orthogonal issue - it certainly does not make prefetching of data pages
unnecessary.

>Stepping back a bit, I think that the basic issue under discussion
>here is how granular you want your .modblock files.  At one extreme,
>one can imagine an application that wants to know exactly which blocks
>were accessed at exact which LSNs.  At the other extreme, if you want
>to run a daily incremental backup, you just want to know which blocks
>have been modified between the start of the previous backup and the
>start of the current backup - i.e. sometime in the last ~24 hours.
>These are quite different things.  When you only want approximate
>information - is there a chance that this block was changed within
>this LSN range, or not? - you can sort and deduplicate in advance;
>when you want exact information, you cannot do that.  Furthermore, if
>you want exact information, you must store an LSN for every record; if
>you want approximate information, you emit a file for each LSN range
>and consider it sufficient to know that the change happened somewhere
>within the range of LSNs encompassed by that file.
>

Those are the extreme design options, yes. But I think there may be a
reasonable middle ground, that would allow using the modblock files for
both use cases.

>It's pretty clear in my mind that what I want to do here is provide
>approximate information, not exact information.  Being able to sort
>and deduplicate in advance seems critical to be able to make something
>like this work on high-velocity systems.

Do you have any analysis / data to support that claim? I mean, it's
obvious that sorting and deduplicating the data right away makes
subsequent processing more efficient, but it's not clear to me that not
doing it would make it useless for high-velocity systems.

> If you are generating a
>terabyte of WAL between incremental backups, and you don't do any
>sorting or deduplication prior to the point when you actually try to
>generate the modified block map, you are going to need a whole lot of
>memory (and CPU time, though that's less critical, I think) to process
>all of that data.  If you can read modblock files which are already
>sorted and deduplicated, you can generate results incrementally and
>send them to the client incrementally and you never really need more
>than some fixed amount of memory no matter how much data you are
>processing.
>

Sure, but that's not what I proposed elsewhere in this thread. My proposal
was to keep mdblocks "raw" for WAL segments that were not recycled yet (so
~3 last checkpoints), and deduplicate them after that. So vast majority of
the 1TB of WAL will have already deduplicated data.

Also, maybe we can do partial deduplication, in a way that would be useful
for prefetching. Say we only deduplicate 1MB windows - that would work at
least for cases that touch the same page frequently (say, by inserting to
the tail of an index, or so).

>While I'm convinced that this particular feature should provide
>approximate rather than exact information, the degree of approximation
>is up for debate, and maybe it's best to just make that configurable.
>Some applications might work best with small modblock files covering
>only ~16MB of WAL each, or even less, while others might prefer larger
>quanta, say 1GB or even more.  For incremental backup, I believe that
>the quanta will depend on the system velocity.  On a system that isn't
>very busy, fine-grained modblock files will make incremental backup
>more efficient.  If each modblock file covers only 16MB of data, and
>the backup manages to start someplace in the middle of that 16MB, then
>you'll only be including 16MB or less of unnecessary block references
>in the backup so you won't incur much extra work.  On the other hand,
>on a busy system, you probably do not want such a small quantum,
>because you will then up with gazillions of modblock files and that
>will be hard to manage.  It could also have performance problems,
>because merging data from a couple of hundred files is fine, but
>merging data from a couple of hundred thousand files is going to be
>inefficient.  My experience hacking on and testing tuplesort.c a few
>years ago (with valuable tutelage by Peter Geoghegan) showed me that
>there is a slow drop-off in efficiency as the merge order increases --
>and in this case, at some point you will blow out the size of the OS
>file descriptor table and have to start opening and closing files
>every time you access a different one, and that will be unpleasant.
>Finally, deduplication will tend to be more effective across larger
>numbers of block references, at least on some access patterns.
>

I agree with those observations in general, but I don't think it somehow
proves we have to deduplicate/sort the data.

FWIW no one cares about low-velocity systems. While raw modblock files
would not be an issue on them, it's also mostly uninteresting from the
prefetching perspective. It's the high-velocity sytems that have lag.

>So all of that is to say that if somebody wants modblock files each of
>which covers 1MB of WAL, I think that the same tools I'm proposing to
>build here for incremental backup could support that use case with
>just a configuration change.  Moreover, the resulting files would
>still be usable by the incremental backup engine.  So that's good: the
>same system can, at least to some extent, be reused for whatever other
>purposes people want to know about modified blocks.

+1 to configuration change, at least during the development phase. It'll
allow comfortable testing and benchmarking.

>On the other hand, the incremental backup engine will likely not cope
>smoothly with having hundreds of thousands or millions of modblock files
>shoved down its gullet, so if there is a dramatic difference in the
>granularity requirements of different consumers, another approach is
>likely indicated.  Especially if some consumer wants to see block
>references in the exact order in which they appear in WAL, or wants to
>know the exact LSN of each reference, it's probably best to go for a
>different approach.  For example, pg_waldump could grow a new option
>which spits out just the block references and in a format designed to be
>easily machine-parseable; or a hypothetical background worker that does
>prefetching for recovery could just contain its own copy of the
>xlogreader machinery.
>

Again, I don't think we have to keep the raw modblock files forever. Send
them to the archive, remove/deduplicate/sort them after we recycle the WAL
segment, or something like that. That way the incremental backups don't
need to deal with excessive number of modblock files.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: finding changed blocks using WAL scanning

From
Robert Haas
Date:
On Wed, Apr 24, 2019 at 10:10 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> >I'm still interested in the answer to this question, but I don't see a
> >reply that specifically concerns it.  Apologies if I have missed one.
>
> I don't think prefetching WAL blocks is all that important. The WAL
> segment was probably received fairly recently (either from primary or
> archive) and so it's reasonable to assume it's still in page cache. And
> even if it's not, sequential reads are handled by readahead pretty well.
> Which is a form of prefetching.

True.  But if you are going to need to read the WAL anyway to apply
it, why shouldn't the prefetcher just read it first and use that to
drive prefetching, instead of using the modblock files?  It's strictly
less I/O, because you were going to read the WAL files anyway and now
you don't have to also read some other modblock file, and it doesn't
really seem to have any disadvantages.

> >It's pretty clear in my mind that what I want to do here is provide
> >approximate information, not exact information.  Being able to sort
> >and deduplicate in advance seems critical to be able to make something
> >like this work on high-velocity systems.
>
> Do you have any analysis / data to support that claim? I mean, it's
> obvious that sorting and deduplicating the data right away makes
> subsequent processing more efficient, but it's not clear to me that not
> doing it would make it useless for high-velocity systems.

I did include some analysis of this point in my original post.  It
does depend on your assumptions.  If you assume that users will be OK
with memory usage that runs into the tens of gigabytes when the amount
of change since the last incremental backup is very large, then there
is probably no big problem, but that assumption sounds shaky to me.

(The customers I seem to end up on the phone with seem to be
disproportionately those running enormous systems on dramatically
underpowered hardware, which is not infrequently related to the reason
I end up on the phone with them.)

> Sure, but that's not what I proposed elsewhere in this thread. My proposal
> was to keep mdblocks "raw" for WAL segments that were not recycled yet (so
> ~3 last checkpoints), and deduplicate them after that. So vast majority of
> the 1TB of WAL will have already deduplicated data.

OK, I missed that proposal.  My biggest concern about this is that I
don't see how to square this with the proposal elsewhere on this
thread that these files should be put someplace that makes them
subject to archiving.  If the files are managed by the master in a
separate directory it can easily do this sort of thing, but if they're
archived then you can't.  Now maybe that's just a reason not to adopt
that proposal, but I don't see how to adopt both that proposal and
this one, unless we just say that we're going to spew craptons of tiny
little non-deduplicated modblock files into the archive.

> Also, maybe we can do partial deduplication, in a way that would be useful
> for prefetching. Say we only deduplicate 1MB windows - that would work at
> least for cases that touch the same page frequently (say, by inserting to
> the tail of an index, or so).

Maybe, but I'm not sure that's really optimal for any use case.

> FWIW no one cares about low-velocity systems. While raw modblock files
> would not be an issue on them, it's also mostly uninteresting from the
> prefetching perspective. It's the high-velocity sytems that have lag.

I don't think that's particularly fair.  Low-velocity systems are some
of the best candidates for incremental backup, and people who are
running such systems probably care about that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company