Thread: finding changed blocks using WAL scanning
Over at https://www.postgresql.org/message-id/CA%2BTgmobFVe4J4AA7z9OMUzKnm09Tt%2BsybhxeL_Ddst3q3wqpzQ%40mail.gmail.com I mentioned parsing the WAL to extract block references so that incremental backup could efficiently determine which blocks needed to be copied. Ashwin replied in http://postgr.es/m/CALfoeitO-vkfjubMFQRmgyXghL-uUnZLNxbr=obrQQsm8kFO4A@mail.gmail.com to mention that the same approach could be useful for pg_upgrade: # Currently, pg_rewind requires # all the WAL logs to be present on source side from point of divergence to # rewind. Instead just parse the wal and keep the changed blocks around on # sourece. Then don't need to retain the WAL but can still rewind using the # changed block map. Since there are at least two possible use case for an efficient way to learn when blocks last changed, and in each case the performance benefits of being able to learn that are potentially quite large, I'm starting this thread to brainstorm how such a system might work. It seems to me that there are basically two ways of storing this kind of information, plus a bunch of variants. One way is to store files that cover a range of LSNs, and basically contain a synopsis of the WAL for those LSNs. You omit all the actual data and just mention which blocks were changed by some record in that part of the WAL. In this type of scheme, the storage required is roughly proportional to the volume of WAL for which you wish to retain data. Pruning old data is easy; just remove the files that provide information about LSNs that you don't care about any more. The other way is to store data about each block, or each range of blocks, or all the blocks that hash onto a certain slot; for each, store the newest LSN that has modified that block, or a block in that range, or a block that hashes onto that that slot. In this system, storage is roughly proportional to the size of the database cluster, except maybe in the hashing case, but I *think* that case will degrade unless you basically expand the map to be roughly proportional to the size of the cluster anyway. I may be wrong. Of these two variants, I am inclined to prefer the version where each file is a summary of the block references within some range of LSNs. It seems simpler to implement to me. You just read a bunch of WAL files and then when you get tired you stop and emit an output file. You need to protect yourself against untimely crashes. One way is to stick a checksum into the output file. After you finish writing it, fsync() it before you start writing the next one. After a restart, read the latest such file and see if the checksum is OK. If not, regenerate it; if not, assume it's good and move on. Files other than the last one can be assumed good. Another way is to create the file with a temporary name, fsync() it, and then rename it into place and fsync() again. The background worker that generates the files can have a GUC to remove them when they are older than some threshold amount of time, or you can keep them forever and let the user manually remove stuff they no longer want based on LSN. That's pretty much it. The version where you keep an LSN per block/range/hash bucket seems more complex in terms of durability. Now you have to worry not only about creating new files, but about modifying them. That seems to add some complexity. I think it may be possible to make it work without doing write-ahead logging for every change, but it certainly needs careful thought to avoid torn page problems and checkpoint synchronization issues. Moreover, it potentially uses lots and lots of inodes if there are many relations in the cluster. You can avoid that by not creating maps for small files, if you like, or by switching to the hash bucket approach. But either of those approaches is lossy. Omitting the maps for small files means you always have to assume everything in those files changed. The hash bucket approach is vulnerable to hash collisions; you have to assume that all blocks that hash to a given bucket have changed. Those are probably manageable disadvantages, but I think they are real ones. There is one thing that does worry me about the file-per-LSN-range approach, and that is memory consumption when trying to consume the information. Suppose you have a really high velocity system. I don't know exactly what the busiest systems around are doing in terms of data churn these days, but let's say just for kicks that we are dirtying 100GB/hour. That means, roughly 12.5 million block references per hour. If each block reference takes 12 bytes, that's maybe 150MB/hour in block reference files. If you run a daily incremental backup, you've got to load all the block references for the last 24 hours and deduplicate them, which means you're going to need about 3.6GB of memory. If you run a weekly incremental backup, you're going to need about 25GB of memory. That is not ideal. One can keep the memory consumption to a more reasonable level by using temporary files. For instance, say you realize you're going to need 25GB of memory to store all the block references you have, but you only have 1GB of memory that you're allowed to use. Well, just hash-partition the data 32 ways by dboid/tsoid/relfilenode/segno, writing each batch to a separate temporary file, and then process each of those 32 files separately. That does add some additional I/O, but it's not crazily complicated and doesn't seem too terrible, at least to me. Still, it's something not to like. Another problem to think about is whether the most recent data is going to be available when you need it. This concern applies to either approach. In the case of incremental backup, the server is going to be up and running, so if the WAL-scanner gets behind, you can just wait for it to catch up. In the case of pg_rewind, the server is going to be down, so that doesn't work. You probably need to figure out how new the data you have is, and then scan all the newer WAL to pick up any additional block references. That's a bit of a pain, but I don't see any real alternative. In the case of the file-per-LSN-range approach, it's easy to see what LSNs are covered by the files. In the case of the LSN-per-block/range/hash bucket approach, you can presumably rely on the last checkpoint having flushed all the pending changes to disk, but the WAL scanner could've been arbitrarily far behind at that point, so you'd have to store a piece of metadata someplace that tells you exactly how far the WAL scanner had progressed and then have pg_rewind fish it out. Yet another question is how to make sure WAL doesn't get removed before we finish scanning it. Peter mentioned on the other thread that we could use a variant replication slot, which immediately made me wonder why we'd need a variant. Actually, the biggest problem I see here is that if we use a replication slot, somebody might try to drop it or use it for some other purpose, and that would mess things up. I guess we could pull the usual trick of reserving names that start with 'pg_' for internal purposes. Or we could just hard-code the LSN that was last scanned for this purpose as a bespoke constraint on WAL discard. Not sure what is best. I think all of this should be optional functionality. It's going to be really useful for people with large databases, I think, but people with small databases may not care, and it won't be entirely free. If it's not enabled, then the functionality that would otherwise exploit it can fall back to doing things in a less efficient way; nothing needs to break hard. Opinions? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 10, 2019 at 5:49 PM Robert Haas <robertmhaas@gmail.com> wrote: > There is one thing that does worry me about the file-per-LSN-range > approach, and that is memory consumption when trying to consume the > information. Suppose you have a really high velocity system. I don't > know exactly what the busiest systems around are doing in terms of > data churn these days, but let's say just for kicks that we are > dirtying 100GB/hour. That means, roughly 12.5 million block > references per hour. If each block reference takes 12 bytes, that's > maybe 150MB/hour in block reference files. If you run a daily > incremental backup, you've got to load all the block references for > the last 24 hours and deduplicate them, which means you're going to > need about 3.6GB of memory. If you run a weekly incremental backup, > you're going to need about 25GB of memory. That is not ideal. One > can keep the memory consumption to a more reasonable level by using > temporary files. For instance, say you realize you're going to need > 25GB of memory to store all the block references you have, but you > only have 1GB of memory that you're allowed to use. Well, just > hash-partition the data 32 ways by dboid/tsoid/relfilenode/segno, > writing each batch to a separate temporary file, and then process each > of those 32 files separately. That does add some additional I/O, but > it's not crazily complicated and doesn't seem too terrible, at least > to me. Still, it's something not to like. Oh, I'm being dumb. We should just have the process that writes out these files sort the records first. Then when we read them back in to use them, we can just do a merge pass like MergeAppend would do. Then you never need very much memory at all. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2019-04-10 23:49, Robert Haas wrote: > It seems to me that there are basically two ways of storing this kind > of information, plus a bunch of variants. One way is to store files > that cover a range of LSNs, and basically contain a synopsis of the > WAL for those LSNs. You omit all the actual data and just mention > which blocks were changed by some record in that part of the WAL. That seems better than the other variant. > Yet another question is how to make sure WAL doesn't get removed > before we finish scanning it. Peter mentioned on the other thread > that we could use a variant replication slot, which immediately made > me wonder why we'd need a variant. Actually, the biggest problem I > see here is that if we use a replication slot, somebody might try to > drop it or use it for some other purpose, and that would mess things > up. I guess we could pull the usual trick of reserving names that > start with 'pg_' for internal purposes. Or we could just hard-code > the LSN that was last scanned for this purpose as a bespoke constraint > on WAL discard. Not sure what is best. The word "variant" was used as a hedge ;-), but now that I think about it ... I had in mind that you could have different overlapping incremental backup jobs in existence at the same time. Maybe a daily one to a nearby disk and a weekly one to a faraway cloud. Each one of these would need a separate replication slot, so that the information that is required for *that* incremental backup series is preserved between runs. So just one reserved replication slot that feeds the block summaries wouldn't work. Perhaps what would work is a flag on the replication slot itself "keep block summaries for this slot". Then when all the slots with the block summary flag are past an LSN, you can clean up the summaries before that LSN. > I think all of this should be optional functionality. It's going to > be really useful for people with large databases, I think, but people > with small databases may not care, and it won't be entirely free. If > it's not enabled, then the functionality that would otherwise exploit > it can fall back to doing things in a less efficient way; nothing > needs to break hard. With the flag on the slot scheme you wouldn't need a separate knob to turn this on, because it's just enabled when a backup software has created an appropriate slot. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Apr 11, 2019 at 3:52 AM Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > I had in mind that you could have different overlapping incremental > backup jobs in existence at the same time. Maybe a daily one to a > nearby disk and a weekly one to a faraway cloud. Each one of these > would need a separate replication slot, so that the information that is > required for *that* incremental backup series is preserved between runs. > So just one reserved replication slot that feeds the block summaries > wouldn't work. Perhaps what would work is a flag on the replication > slot itself "keep block summaries for this slot". Then when all the > slots with the block summary flag are past an LSN, you can clean up the > summaries before that LSN. I don't think that quite works. There are two different LSNs. One is the LSN of the oldest WAL archive that we need to keep around so that it can be summarized, and the other is the LSN of the oldest summary we need to keep around so it can be used for incremental backup purposes. You can't keep both of those LSNs in the same slot. Furthermore, the LSN stored in the slot is defined as the amount of WAL we need to keep, not the amount of something else (summaries) that we need to keep. Reusing that same field to mean something different sounds inadvisable. In other words, I think there are two problems which we need to clearly separate: one is retaining WAL so we can generate summaries, and the other is retaining summaries so we can generate incremental backups. Even if we solve the second problem by using some kind of replication slot, we still need to solve the first problem somehow. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 10, 2019 at 2:50 PM Robert Haas <robertmhaas@gmail.com> wrote:
Over at https://www.postgresql.org/message-id/CA%2BTgmobFVe4J4AA7z9OMUzKnm09Tt%2BsybhxeL_Ddst3q3wqpzQ%40mail.gmail.com
I mentioned parsing the WAL to extract block references so that
incremental backup could efficiently determine which blocks needed to
be copied. Ashwin replied in
https://urldefense.proofpoint.com/v2/url?u=http-3A__postgr.es_m_CALfoeitO-2DvkfjubMFQRmgyXghL-2DuUnZLNxbr-3DobrQQsm8kFO4A-40mail.gmail.com&d=DwIBaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=gxIaqms7ncm0pvqXLI_xjkgwSStxAET2rnZQpzba2KM&m=W07oy16p6VEfYKCgfRXQpRz9pfy_of-a_8DAjAg5TGk&s=YAtoa9HWqQ1PPjt1CGui1Fo_a20j0n95LRonCXucBz4&e=
to mention that the same approach could be useful for pg_upgrade:
Thank you for initiating this separate thread. Just typo above not pg_upgrade but pg_rewind.
Let me explain first the thought I have around how to leverage this for pg_rewind, actually any type of incremental recovery to be exact. Would love to hear thoughts on it.
Currently, incremental recovery of any form, if replica goes down and comes up or trying to bring back primary after failover to replica, requires *all* the WAL to be present from point of disconnect. So, its boolean in those terms, if WAL available can incrementally recovery otherwise have to perform full basebackup. If we come up with this mechanism to find and store changed blocks from WAL, we can provide intermediate level of incremental recovery which will be better than full recovery.
WAL allows tuple level granularity for recovery (if we ignore FPI for a moment). Modified blocks from WAL, if WAL is not available will provide block level incremental recovery.
So, pg_basebackup (or some other tool or just option to it) and pg_rewind can leverage the changed blocks if WAL can't be retained due to space constraints and perform the recovery.
pg_rewind can also be optimized as it currently copies blocks from src to target which were present in target WAL to rewind. So, such blocks can be easily skipped from copying again.
Depending on pattern of changes in WAL and size, instead of replaying all the WAL logs for incremental recovery, just copying over the changed blocks could prove more efficient.
It seems to me that there are basically two ways of storing this kind
of information, plus a bunch of variants. One way is to store files
that cover a range of LSNs, and basically contain a synopsis of the
WAL for those LSNs. You omit all the actual data and just mention
which blocks were changed by some record in that part of the WAL. In
this type of scheme, the storage required is roughly proportional to
the volume of WAL for which you wish to retain data. Pruning old data
is easy; just remove the files that provide information about LSNs
that you don't care about any more. The other way is to store data
about each block, or each range of blocks, or all the blocks that hash
onto a certain slot; for each, store the newest LSN that has modified
that block, or a block in that range, or a block that hashes onto that
that slot. In this system, storage is roughly proportional to the
size of the database cluster, except maybe in the hashing case, but I
*think* that case will degrade unless you basically expand the map to
be roughly proportional to the size of the cluster anyway. I may be
wrong.
Of these two variants, I am inclined to prefer the version where each
file is a summary of the block references within some range of LSNs.
It seems simpler to implement to me. You just read a bunch of WAL
files and then when you get tired you stop and emit an output file.
You need to protect yourself against untimely crashes. One way is to
stick a checksum into the output file. After you finish writing it,
fsync() it before you start writing the next one. After a restart,
read the latest such file and see if the checksum is OK. If not,
regenerate it; if not, assume it's good and move on. Files other than
the last one can be assumed good. Another way is to create the file
with a temporary name, fsync() it, and then rename it into place and
fsync() again. The background worker that generates the files can
have a GUC to remove them when they are older than some threshold
amount of time, or you can keep them forever and let the user manually
remove stuff they no longer want based on LSN. That's pretty much it.
+1 for first option. Seems simpler and straight-forward.
On Thu, Apr 11, 2019 at 6:27 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Apr 11, 2019 at 3:52 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> I had in mind that you could have different overlapping incremental
> backup jobs in existence at the same time. Maybe a daily one to a
> nearby disk and a weekly one to a faraway cloud. Each one of these
> would need a separate replication slot, so that the information that is
> required for *that* incremental backup series is preserved between runs.
> So just one reserved replication slot that feeds the block summaries
> wouldn't work. Perhaps what would work is a flag on the replication
> slot itself "keep block summaries for this slot". Then when all the
> slots with the block summary flag are past an LSN, you can clean up the
> summaries before that LSN.
I don't think that quite works. There are two different LSNs. One is
the LSN of the oldest WAL archive that we need to keep around so that
it can be summarized, and the other is the LSN of the oldest summary
we need to keep around so it can be used for incremental backup
purposes. You can't keep both of those LSNs in the same slot.
Furthermore, the LSN stored in the slot is defined as the amount of
WAL we need to keep, not the amount of something else (summaries) that
we need to keep. Reusing that same field to mean something different
sounds inadvisable.
In other words, I think there are two problems which we need to
clearly separate: one is retaining WAL so we can generate summaries,
and the other is retaining summaries so we can generate incremental
backups. Even if we solve the second problem by using some kind of
replication slot, we still need to solve the first problem somehow.
Just a thought for first problem, may not to simpler, can replication slot be enhanced to define X amount of WAL to retain, after reaching such limit collect summary and let the WAL be deleted.
At Thu, 11 Apr 2019 10:00:35 -0700, Ashwin Agrawal <aagrawal@pivotal.io> wrote in <CALfoeis0qOyGk+KQ3AbkfRVv=XbsSecqHfKSag=i_SLWMT+B0A@mail.gmail.com> > On Thu, Apr 11, 2019 at 6:27 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > On Thu, Apr 11, 2019 at 3:52 AM Peter Eisentraut > > <peter.eisentraut@2ndquadrant.com> wrote: > > > I had in mind that you could have different overlapping incremental > > > backup jobs in existence at the same time. Maybe a daily one to a > > > nearby disk and a weekly one to a faraway cloud. Each one of these > > > would need a separate replication slot, so that the information that is > > > required for *that* incremental backup series is preserved between runs. > > > So just one reserved replication slot that feeds the block summaries > > > wouldn't work. Perhaps what would work is a flag on the replication > > > slot itself "keep block summaries for this slot". Then when all the > > > slots with the block summary flag are past an LSN, you can clean up the > > > summaries before that LSN. > > > > I don't think that quite works. There are two different LSNs. One is > > the LSN of the oldest WAL archive that we need to keep around so that > > it can be summarized, and the other is the LSN of the oldest summary > > we need to keep around so it can be used for incremental backup > > purposes. You can't keep both of those LSNs in the same slot. > > Furthermore, the LSN stored in the slot is defined as the amount of > > WAL we need to keep, not the amount of something else (summaries) that > > we need to keep. Reusing that same field to mean something different > > sounds inadvisable. > > > > In other words, I think there are two problems which we need to > > clearly separate: one is retaining WAL so we can generate summaries, > > and the other is retaining summaries so we can generate incremental > > backups. Even if we solve the second problem by using some kind of > > replication slot, we still need to solve the first problem somehow. > > Just a thought for first problem, may not to simpler, can replication slot > be enhanced to define X amount of WAL to retain, after reaching such limit > collect summary and let the WAL be deleted. I think Peter is saying that a slot for block summary doesn't keep WAL segments themselves, but keeps maybe segmented block summaries. n block-summary-slots maintains n block summaries and the newest block summary is "active", in other words, continuously updated by WAL records pass-by. When backup-tool requests for block summary, for example, for the oldest slot, the acitve summary is closed then a new summary is opened from the LSN at the time, which is the new LSN of the slot. Then the concatenated block summary is sent. Finally the oldest summary is removed. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Apr 10, 2019 at 08:11:11PM -0400, Robert Haas wrote: > On Wed, Apr 10, 2019 at 5:49 PM Robert Haas <robertmhaas@gmail.com> wrote: > > There is one thing that does worry me about the file-per-LSN-range > > approach, and that is memory consumption when trying to consume the > > information. Suppose you have a really high velocity system. I don't > > know exactly what the busiest systems around are doing in terms of > > data churn these days, but let's say just for kicks that we are > > dirtying 100GB/hour. That means, roughly 12.5 million block > > references per hour. If each block reference takes 12 bytes, that's > > maybe 150MB/hour in block reference files. If you run a daily > > incremental backup, you've got to load all the block references for > > the last 24 hours and deduplicate them, which means you're going to > > need about 3.6GB of memory. If you run a weekly incremental backup, > > you're going to need about 25GB of memory. That is not ideal. One > > can keep the memory consumption to a more reasonable level by using > > temporary files. For instance, say you realize you're going to need > > 25GB of memory to store all the block references you have, but you > > only have 1GB of memory that you're allowed to use. Well, just > > hash-partition the data 32 ways by dboid/tsoid/relfilenode/segno, > > writing each batch to a separate temporary file, and then process each > > of those 32 files separately. That does add some additional I/O, but > > it's not crazily complicated and doesn't seem too terrible, at least > > to me. Still, it's something not to like. > > Oh, I'm being dumb. We should just have the process that writes out > these files sort the records first. Then when we read them back in to > use them, we can just do a merge pass like MergeAppend would do. Then > you never need very much memory at all. Can I throw out a simple idea? What if, when we finish writing a WAL file, we create a new file 000000010000000000000001.modblock which lists all the heap/index files and block numbers modified in that WAL file? How much does that help with the list I posted earlier? I think there is some interesting complexity brought up in this thread. Which options are going to minimize storage I/O, network I/O, have only background overhead, allow parallel operation, integrate with pg_basebackup. Eventually we will need to evaluate the incremental backup options against these criteria. I am thinking tools could retain modblock files along with WAL, could pull full-page-writes from WAL, or from PGDATA. It avoids the need to scan 16MB WAL files, and the WAL files and modblock files could be expired independently. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Mon, Apr 15, 2019 at 4:31 PM Bruce Momjian <bruce@momjian.us> wrote: > Can I throw out a simple idea? What if, when we finish writing a WAL > file, we create a new file 000000010000000000000001.modblock which > lists all the heap/index files and block numbers modified in that WAL > file? How much does that help with the list I posted earlier? > > I think there is some interesting complexity brought up in this thread. > Which options are going to minimize storage I/O, network I/O, have only > background overhead, allow parallel operation, integrate with > pg_basebackup. Eventually we will need to evaluate the incremental > backup options against these criteria. > > I am thinking tools could retain modblock files along with WAL, could > pull full-page-writes from WAL, or from PGDATA. It avoids the need to > scan 16MB WAL files, and the WAL files and modblock files could be > expired independently. That is pretty much exactly what I was intending to propose. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Apr 15, 2019 at 09:04:13PM -0400, Robert Haas wrote: > On Mon, Apr 15, 2019 at 4:31 PM Bruce Momjian <bruce@momjian.us> wrote: > > Can I throw out a simple idea? What if, when we finish writing a WAL > > file, we create a new file 000000010000000000000001.modblock which > > lists all the heap/index files and block numbers modified in that WAL > > file? How much does that help with the list I posted earlier? > > > > I think there is some interesting complexity brought up in this thread. > > Which options are going to minimize storage I/O, network I/O, have only > > background overhead, allow parallel operation, integrate with > > pg_basebackup. Eventually we will need to evaluate the incremental > > backup options against these criteria. > > > > I am thinking tools could retain modblock files along with WAL, could > > pull full-page-writes from WAL, or from PGDATA. It avoids the need to > > scan 16MB WAL files, and the WAL files and modblock files could be > > expired independently. > > That is pretty much exactly what I was intending to propose. OK, good. Some of your wording was vague so I was unclear exactly what you were suggesting. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Mon, Apr 15, 2019 at 09:04:13PM -0400, Robert Haas wrote: > That is pretty much exactly what I was intending to propose. Any caller of XLogWrite() could switch to a new segment once the current one is done, and I am not sure that we would want some random backend to potentially slow down to do that kind of operation. Or would a separate background worker do this work by itself? An external tool can do that easily already: https://github.com/michaelpq/pg_plugins/tree/master/pg_wal_blocks -- Michael
Attachment
On Mon, Apr 15, 2019 at 10:22 PM Bruce Momjian <bruce@momjian.us> wrote: > > > I am thinking tools could retain modblock files along with WAL, could > > > pull full-page-writes from WAL, or from PGDATA. It avoids the need to > > > scan 16MB WAL files, and the WAL files and modblock files could be > > > expired independently. > > > > That is pretty much exactly what I was intending to propose. > > OK, good. Some of your wording was vague so I was unclear exactly what > you were suggesting. Well, I guess the part that isn't like what I was suggesting is the idea that there should be exactly one modified block file per segment. The biggest problem with that idea is that a single WAL record can be split across two segments (or, in pathological cases, perhaps more). I think it makes sense to talk about the blocks modified by WAL between LSN A and LSN B, but it doesn't make much sense to talk about the block modified by the WAL in segment XYZ. You can make it kinda make sense by saying "the blocks modified by records *beginning in* segment XYZ" or alternatively "the blocks modified by records *ending in* segment XYZ", but that seems confusing to me. For example, suppose you decide on the first one -- 000000010000000100000068.modblock will contain all blocks modified by records that begin in 000000010000000100000068. Well, that means that to generate the 000000010000000100000068.modblock, you will need access to 000000010000000100000068 AND probably also 000000010000000100000069 and in rare cases perhaps 00000001000000010000006A or even later files. I think that's actually pretty confusing. It seems better to me to give the files names like ${TLI}.${STARTLSN}.${ENDLSN}.modblock, e.g. 00000001.0000000168000058.00000001687DBBB8.modblock, so that you can see exactly which *records* are covered by that segment. And I suspect it may also be a good idea to bunch up the records from several WAL files. Especially if you are using 16MB WAL files, collecting all of the block references from a single WAL file is going to produce a very small file. I suspect that the modified block files will end up being 100x smaller than the WAL itself, perhaps more, and I don't think anybody will appreciate us adding another PostgreSQL systems that spews out huge numbers of tiny little files. If, for example, somebody's got a big cluster that is churning out a WAL segment every second, they would probably still be happy to have a new modified block file only, say, every 10 seconds. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Apr 15, 2019 at 11:45 PM Michael Paquier <michael@paquier.xyz> wrote: > On Mon, Apr 15, 2019 at 09:04:13PM -0400, Robert Haas wrote: > > That is pretty much exactly what I was intending to propose. > > Any caller of XLogWrite() could switch to a new segment once the > current one is done, and I am not sure that we would want some random > backend to potentially slow down to do that kind of operation. > > Or would a separate background worker do this work by itself? An > external tool can do that easily already: > https://github.com/michaelpq/pg_plugins/tree/master/pg_wal_blocks I was thinking that a dedicated background worker would be a good option, but Stephen Frost seems concerned (over on the other thread) about how much load that would generate. That never really occurred to me as a serious issue and I suspect for many people it wouldn't be, but there might be some. It's cool that you have a command-line tool that does this as well. Over there, it was also discussed that we might want to have both a command-line tool and a background worker. I think, though, that we would want to get the output in some kind of compressed binary format, rather than text. e.g. 4-byte database OID 4-byte tablespace OID any number of relation OID/block OID pairings for that database/tablespace combination 4-byte zero to mark the end of the relation OID/block OID list and then repeat all of the above any number of times That might be too dumb and I suspect we want some headers and a checksum, but we should try to somehow exploit the fact that there aren't likely to be many distinct databases or many distinct tablespaces mentioned -- whereas relation OID and block number will probably have a lot more entropy. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Apr 18, 2019 at 03:43:30PM -0400, Robert Haas wrote: > You can make it kinda make sense by saying "the blocks modified by > records *beginning in* segment XYZ" or alternatively "the blocks > modified by records *ending in* segment XYZ", but that seems confusing > to me. For example, suppose you decide on the first one -- > 000000010000000100000068.modblock will contain all blocks modified by > records that begin in 000000010000000100000068. Well, that means that > to generate the 000000010000000100000068.modblock, you will need > access to 000000010000000100000068 AND probably also > 000000010000000100000069 and in rare cases perhaps > 00000001000000010000006A or even later files. I think that's actually > pretty confusing. > > It seems better to me to give the files names like > ${TLI}.${STARTLSN}.${ENDLSN}.modblock, e.g. > 00000001.0000000168000058.00000001687DBBB8.modblock, so that you can > see exactly which *records* are covered by that segment. How would you choose the STARTLSN/ENDLSN? If you could do it per checkpoint, rather than per-WAL, I think that would be great. > And I suspect it may also be a good idea to bunch up the records from > several WAL files. Especially if you are using 16MB WAL files, > collecting all of the block references from a single WAL file is going > to produce a very small file. I suspect that the modified block files > will end up being 100x smaller than the WAL itself, perhaps more, and > I don't think anybody will appreciate us adding another PostgreSQL > systems that spews out huge numbers of tiny little files. If, for > example, somebody's got a big cluster that is churning out a WAL > segment every second, they would probably still be happy to have a new > modified block file only, say, every 10 seconds. Agreed. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Thu, Apr 18, 2019 at 3:51 PM Bruce Momjian <bruce@momjian.us> wrote: > How would you choose the STARTLSN/ENDLSN? If you could do it per > checkpoint, rather than per-WAL, I think that would be great. I thought of that too. It seems appealing, because you probably only really care whether a particular block was modified between one checkpoint and the next, not exactly when during that interval it was modified. However, the simple algorithm of "just stop when you get to a checkpoint record" does not work, because the checkpoint record itself points back to a much earlier LSN, and I think that it's that earlier LSN that is interesting. So if you want to make this work you have to be more clever, and I'm not sure I'm clever enough. I think it's important that a .modblock file not be too large, because then it will use too much memory, and that it not cover too much WAL, because then it will be too imprecise about when the blocks were modified. Perhaps we should have a threshold for each -- e.g. emit the next .modblock file after finding 2^20 distinct block references or scanning 1GB of WAL. Then individual files would probably be in the single-digit numbers of megabytes in size, assuming we do a decent job with the compression, and you never need to scan more than 1GB of WAL to regenerate one. If the starting point for a backup falls in the middle of such a file, and you include the whole file, at worst you have ~8GB of extra blocks to read, but in most cases less, because your writes probably have some locality and the file may not actually contain the full 2^20 block references. You could also make it more fine-grained than that if you don't mind having more smaller files floating around. It would definitely be better if we could set things up so that we could always switch to the next .modblock file when we cross a potential redo start point, but they're not noted in the WAL so I don't see how to do that. I don't know if it would be possible to insert some new kind of log record concurrently with fixing the redo location, so that redo always started at a record of this new type. That would certainly be helpful for this kind of thing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Apr 18, 2019 at 04:25:24PM -0400, Robert Haas wrote: > On Thu, Apr 18, 2019 at 3:51 PM Bruce Momjian <bruce@momjian.us> wrote: > > How would you choose the STARTLSN/ENDLSN? If you could do it per > > checkpoint, rather than per-WAL, I think that would be great. > > I thought of that too. It seems appealing, because you probably only > really care whether a particular block was modified between one > checkpoint and the next, not exactly when during that interval it was > modified. However, the simple algorithm of "just stop when you get to > a checkpoint record" does not work, because the checkpoint record > itself points back to a much earlier LSN, and I think that it's that > earlier LSN that is interesting. So if you want to make this work you > have to be more clever, and I'm not sure I'm clever enough. OK, so let's back up and study how this will be used. Someone wanting to make a useful incremental backup will need the changed blocks from the time of the start of the base backup. It is fine if they incrementally back up some blocks modified _before_ the base backup, but they need all blocks after, until some marker. They will obviously still use WAL to recover to a point after the incremental backup, so there is no need to get every modifified block up to current, just up to some cut-off point where WAL can be discarded. I can see a 1GB marker being used for that. It would prevent an incremental backup from being done until the first 1G modblock files was written, since until then there is no record of modified blocks, but that seems fine. A 1G marker would allow for consistent behavior independent of server restarts and base backups. How would the modblock file record all the modified blocks across restarts and crashes? I assume that 1G of WAL would not be available for scanning. I suppose that writing a modblock file to some PGDATA location when WAL is removed would work since during a crash the modblock file could be updated with the contents of the existing pg_wal files. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On 2019-04-18 17:47:56 -0400, Bruce Momjian wrote: > I can see a 1GB marker being used for that. It would prevent an > incremental backup from being done until the first 1G modblock files was > written, since until then there is no record of modified blocks, but > that seems fine. A 1G marker would allow for consistent behavior > independent of server restarts and base backups. That doesn't seem like a good idea - it'd make writing regression tests for this impractical.
On Thu, Apr 18, 2019 at 03:51:14PM -0400, Robert Haas wrote: > I was thinking that a dedicated background worker would be a good > option, but Stephen Frost seems concerned (over on the other thread) > about how much load that would generate. That never really occurred > to me as a serious issue and I suspect for many people it wouldn't be, > but there might be some. WAL segment size can go up to 1GB, and this does not depend on the compilation anymore. So scanning a very large segment is not going to be free. I think that the performance concerns of Stephen are legit as now we have on the WAL partitions sequential read and write patterns. > It's cool that you have a command-line tool that does this as well. > Over there, it was also discussed that we might want to have both a > command-line tool and a background worker. I think, though, that we > would want to get the output in some kind of compressed binary format, > rather than text. e.g. If you want to tweak it the way you want, please feel free to reuse it for any patch submitted to upstream. Reshaping or redirecting the data is not a big issue once the basics with the WAL reader are in place. -- Michael
Attachment
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Mon, Apr 15, 2019 at 11:45 PM Michael Paquier <michael@paquier.xyz> wrote: > > Any caller of XLogWrite() could switch to a new segment once the > > current one is done, and I am not sure that we would want some random > > backend to potentially slow down to do that kind of operation. > > > > Or would a separate background worker do this work by itself? An > > external tool can do that easily already: > > https://github.com/michaelpq/pg_plugins/tree/master/pg_wal_blocks > > I was thinking that a dedicated background worker would be a good > option, but Stephen Frost seems concerned (over on the other thread) > about how much load that would generate. That never really occurred > to me as a serious issue and I suspect for many people it wouldn't be, > but there might be some. While I do think we should at least be thinking about the load caused from scanning the WAL to generate a list of blocks that are changed, the load I was more concerned with in the other thread is the effort required to actually merge all of those changes together over a large amount of WAL. I'm also not saying that we couldn't have either of those pieces done as a background worker, just that it'd be really nice to have an external tool (or library) that can be used on an independent system to do that work. > It's cool that you have a command-line tool that does this as well. > Over there, it was also discussed that we might want to have both a > command-line tool and a background worker. I think, though, that we > would want to get the output in some kind of compressed binary format, > rather than text. e.g. > > 4-byte database OID > 4-byte tablespace OID > any number of relation OID/block OID pairings for that > database/tablespace combination > 4-byte zero to mark the end of the relation OID/block OID list > and then repeat all of the above any number of times I agree that we'd like to get the data in a binary format of some kind. > That might be too dumb and I suspect we want some headers and a > checksum, but we should try to somehow exploit the fact that there > aren't likely to be many distinct databases or many distinct > tablespaces mentioned -- whereas relation OID and block number will > probably have a lot more entropy. I'm not remembering exactly where this idea came from, but I don't believe it's my own (and I think there's some tool which already does this.. maybe it's rsync?), but I certainly don't think we want to repeat the relation OID for every block, and I don't think we really want to store a block number for every block. Instead, something like: 4-byte database OID 4-byte tablespace OID relation OID starting-ending block numbers bitmap covering range of blocks starting-ending block numbers bitmap covering range of blocks 4-byte zero to mark the end of the relation ... 4-byte database OID 4-byte tablespace OID relation OID starting-ending block numbers bitmap covering range of blocks 4-byte zero to mark the end of the relation ... Only for relations which actually have changes though, of course. Haven't implemented it, so it's entirely possible there's reasons why it wouldn't work, but I do like the bitmap idea. I definitely think we need a checksum, as you mentioned. Thanks! Stephen
Attachment
On Thu, Apr 18, 2019 at 5:47 PM Bruce Momjian <bruce@momjian.us> wrote: > How would the modblock file record all the modified blocks across > restarts and crashes? I assume that 1G of WAL would not be available > for scanning. I suppose that writing a modblock file to some PGDATA > location when WAL is removed would work since during a crash the > modblock file could be updated with the contents of the existing pg_wal > files. I think you've got to prevent the WAL from being removed until a .modblock file has been written. In more detail, you should (a) scan all the WAL segments that will be summarized in the .modblock file, (b) write the file under a temporary name, (c) fsync it, (d) rename it into place, (e) fsync it again, and (f) then allow those WAL segments to be removed, if they are otherwise eligible to be removed. If 1GB of WAL is too much to keep around (which I doubt, except on systems that are so small and low-activity that they don't need incremental backups anyway), then you'd have to scan less WAL at once and write smaller .modblock files. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Apr 19, 2019 at 8:39 PM Stephen Frost <sfrost@snowman.net> wrote: > While I do think we should at least be thinking about the load caused > from scanning the WAL to generate a list of blocks that are changed, the > load I was more concerned with in the other thread is the effort > required to actually merge all of those changes together over a large > amount of WAL. I'm also not saying that we couldn't have either of > those pieces done as a background worker, just that it'd be really nice > to have an external tool (or library) that can be used on an independent > system to do that work. Oh. Well, I already explained my algorithm for doing that upthread, which I believe would be quite cheap. 1. When you generate the .modblock files, stick all the block references into a buffer. qsort(). Dedup. Write out in sorted order. 2. When you want to use a bunch of .modblock files, do the same thing MergeAppend does, or what merge-sort does when it does a merge pass. Read the first 1MB of each file (or whatever amount). Repeatedly pull an item from whichever file has the lowest remaining value, using a binary heap. When no buffered data remains for a particular file, read another chunk from that file. If each .modblock file covers 1GB of WAL, you could the data from across 1TB of WAL using only 1GB of memory, and that's assuming you have a 1MB buffer for each .modblock file. You probably don't need such a large buffer. If you use, say, a 128kB buffer, you could merge the data from across 8TB of WAL using 1GB of memory. And if you have 8TB of WAL and you can't spare 1GB for the task of computing which blocks need to be included in your incremental backup, it's time for a hardware upgrade. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Apr 18, 2019 at 8:38 PM Michael Paquier <michael@paquier.xyz> wrote: > On Thu, Apr 18, 2019 at 03:51:14PM -0400, Robert Haas wrote: > > I was thinking that a dedicated background worker would be a good > > option, but Stephen Frost seems concerned (over on the other thread) > > about how much load that would generate. That never really occurred > > to me as a serious issue and I suspect for many people it wouldn't be, > > but there might be some. > > WAL segment size can go up to 1GB, and this does not depend on the > compilation anymore. So scanning a very large segment is not going to > be free. The segment size doesn't have much to do with it. If you make segments bigger, you'll have to scan fewer larger ones; if you make them smaller, you'll have more smaller ones. The only thing that really matters is the amount of I/O and CPU required, and that doesn't change very much as you vary the segment size. > I think that the performance concerns of Stephen are legit > as now we have on the WAL partitions sequential read and write > patterns. As to that, what I'm proposing here is no different than what we are already doing with physical and logical replication, except that it's probably a bit cheaper. Physical replication reads all the WAL and sends it all out over the network. Logical replication reads all the WAL, does a bunch of computation, and then sends the results, possibly filtered, out over the network. This would read the WAL and then write a relatively small file to your local disk. I think the impact will be about the same as having one additional standby, give or take. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Fri, Apr 19, 2019 at 8:39 PM Stephen Frost <sfrost@snowman.net> wrote: > > While I do think we should at least be thinking about the load caused > > from scanning the WAL to generate a list of blocks that are changed, the > > load I was more concerned with in the other thread is the effort > > required to actually merge all of those changes together over a large > > amount of WAL. I'm also not saying that we couldn't have either of > > those pieces done as a background worker, just that it'd be really nice > > to have an external tool (or library) that can be used on an independent > > system to do that work. > > Oh. Well, I already explained my algorithm for doing that upthread, > which I believe would be quite cheap. > > 1. When you generate the .modblock files, stick all the block > references into a buffer. qsort(). Dedup. Write out in sorted > order. Having all of the block references in a sorted order does seem like it would help, but would also make those potentially quite a bit larger than necessary (I had some thoughts about making them smaller elsewhere in this discussion). That might be worth it though. I suppose it might also be possible to line up the bitmaps suggested elsewhere to do essentially a BitmapOr of them to identify the blocks changed (while effectively de-duping at the same time). > 2. When you want to use a bunch of .modblock files, do the same thing > MergeAppend does, or what merge-sort does when it does a merge pass. > Read the first 1MB of each file (or whatever amount). Repeatedly pull > an item from whichever file has the lowest remaining value, using a > binary heap. When no buffered data remains for a particular file, > read another chunk from that file. Sure, this is essentially a MergeAppend/MergeSort+GroupAgg on top to get the distinct set, if I'm following what you're suggesting here. > If each .modblock file covers 1GB of WAL, you could the data from > across 1TB of WAL using only 1GB of memory, and that's assuming you > have a 1MB buffer for each .modblock file. You probably don't need > such a large buffer. If you use, say, a 128kB buffer, you could merge > the data from across 8TB of WAL using 1GB of memory. And if you have > 8TB of WAL and you can't spare 1GB for the task of computing which > blocks need to be included in your incremental backup, it's time for a > hardware upgrade. How much additional work is it going to be to sort/dedup for each 1GB of WAL though, along with the resulting size? I'm specifically thinking about some of the very high WAL-generation rate systems that I've seen, with TBs of WAL per day. I get that once you've got nicely sorted inputs that it isn't too bad to generate a distinct set from that, but it seems like that moves the problem to the sorting side then rather than eliminating it, and may make other things less efficient, particularly when there's workloads that have strings of modified buffers together and we could capture that more efficiently. Thanks, Stephen
Attachment
On Sat, Apr 20, 2019 at 12:09:42AM -0400, Robert Haas wrote: > On Thu, Apr 18, 2019 at 5:47 PM Bruce Momjian <bruce@momjian.us> wrote: > > How would the modblock file record all the modified blocks across > > restarts and crashes? I assume that 1G of WAL would not be available > > for scanning. I suppose that writing a modblock file to some PGDATA > > location when WAL is removed would work since during a crash the > > modblock file could be updated with the contents of the existing pg_wal > > files. > > I think you've got to prevent the WAL from being removed until a > .modblock file has been written. In more detail, you should (a) scan > all the WAL segments that will be summarized in the .modblock file, > (b) write the file under a temporary name, (c) fsync it, (d) rename it > into place, (e) fsync it again, and (f) then allow those WAL segments > to be removed, if they are otherwise eligible to be removed. Makes sense. So when you are about to remove WAL, you create the .modblock files for all complete WAL files and only create a new one when you are about to remove a WAL that was not in a previous .modblock file. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Sat, Apr 20, 2019 at 12:21:36AM -0400, Robert Haas wrote: > As to that, what I'm proposing here is no different than what we are > already doing with physical and logical replication, except that it's > probably a bit cheaper. Physical replication reads all the WAL and > sends it all out over the network. Logical replication reads all the > WAL, does a bunch of computation, and then sends the results, possibly > filtered, out over the network. This would read the WAL and then > write a relatively small file to your local disk. > > I think the impact will be about the same as having one additional > standby, give or take. Good point. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Sat, Apr 20, 2019 at 9:18 AM Bruce Momjian <bruce@momjian.us> wrote: > > I think you've got to prevent the WAL from being removed until a > > .modblock file has been written. In more detail, you should (a) scan > > all the WAL segments that will be summarized in the .modblock file, > > (b) write the file under a temporary name, (c) fsync it, (d) rename it > > into place, (e) fsync it again, and (f) then allow those WAL segments > > to be removed, if they are otherwise eligible to be removed. > > Makes sense. So when you are about to remove WAL, you create the > .modblock files for all complete WAL files and only create a new one > when you are about to remove a WAL that was not in a previous .modblock > file. There will often be a partial WAL record at the end of each file. So if you make a .modblock file for WAL files 1-10, you can probably remove files 1-9, but you probably have to keep WAL file 10 around until you generate the NEXT .modblock file, because otherwise you wouldn't be able to read and parse the WAL record that spans the end of file 10 and the beginning of file 11. This is a detail that is very very very important to get right. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Apr 20, 2019 at 12:42 AM Stephen Frost <sfrost@snowman.net> wrote: > > Oh. Well, I already explained my algorithm for doing that upthread, > > which I believe would be quite cheap. > > > > 1. When you generate the .modblock files, stick all the block > > references into a buffer. qsort(). Dedup. Write out in sorted > > order. > > Having all of the block references in a sorted order does seem like it > would help, but would also make those potentially quite a bit larger > than necessary (I had some thoughts about making them smaller elsewhere > in this discussion). That might be worth it though. I suppose it might > also be possible to line up the bitmaps suggested elsewhere to do > essentially a BitmapOr of them to identify the blocks changed (while > effectively de-duping at the same time). I don't see why this would make them bigger than necessary. If you sort by relfilenode/fork/blocknumber and dedup, then references to nearby blocks will be adjacent in the file. You can then decide what format will represent that most efficiently on output. Whether or not a bitmap is better idea than a list of block numbers or something else depends on what percentage of blocks are modified and how clustered they are. > > 2. When you want to use a bunch of .modblock files, do the same thing > > MergeAppend does, or what merge-sort does when it does a merge pass. > > Read the first 1MB of each file (or whatever amount). Repeatedly pull > > an item from whichever file has the lowest remaining value, using a > > binary heap. When no buffered data remains for a particular file, > > read another chunk from that file. > > Sure, this is essentially a MergeAppend/MergeSort+GroupAgg on top to get > the distinct set, if I'm following what you're suggesting here. Yeah, something like that. > > If each .modblock file covers 1GB of WAL, you could the data from > > across 1TB of WAL using only 1GB of memory, and that's assuming you > > have a 1MB buffer for each .modblock file. You probably don't need > > such a large buffer. If you use, say, a 128kB buffer, you could merge > > the data from across 8TB of WAL using 1GB of memory. And if you have > > 8TB of WAL and you can't spare 1GB for the task of computing which > > blocks need to be included in your incremental backup, it's time for a > > hardware upgrade. > > How much additional work is it going to be to sort/dedup for each 1GB of > WAL though, along with the resulting size? Well all you have to do is quicksort an array with a million or so elements. I don't know off-hand how many CPU cycles that takes but I doubt it's a whole lot. And for the amount of CPU time and memory that it saves you when you actually go to use the files, I think it's got to be worth it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Apr 20, 2019 at 04:17:08PM -0400, Robert Haas wrote: > On Sat, Apr 20, 2019 at 9:18 AM Bruce Momjian <bruce@momjian.us> wrote: > > > I think you've got to prevent the WAL from being removed until a > > > .modblock file has been written. In more detail, you should (a) scan > > > all the WAL segments that will be summarized in the .modblock file, > > > (b) write the file under a temporary name, (c) fsync it, (d) rename it > > > into place, (e) fsync it again, and (f) then allow those WAL segments > > > to be removed, if they are otherwise eligible to be removed. > > > > Makes sense. So when you are about to remove WAL, you create the > > .modblock files for all complete WAL files and only create a new one > > when you are about to remove a WAL that was not in a previous .modblock > > file. > > There will often be a partial WAL record at the end of each file. So > if you make a .modblock file for WAL files 1-10, you can probably > remove files 1-9, but you probably have to keep WAL file 10 around > until you generate the NEXT .modblock file, because otherwise you > wouldn't be able to read and parse the WAL record that spans the end > of file 10 and the beginning of file 11. > > This is a detail that is very very very important to get right. Good point. You mentioned: It seems better to me to give the files names like ${TLI}.${STARTLSN}.${ENDLSN}.modblock, e.g. 00000001.0000000168000058.00000001687DBBB8.modblock, so that you can see exactly which *records* are covered by that segment. but it seems like it should be ${TLI}.${ENDLSN}... (END first) because you would not want to delete the modblock file until you are about to delete the final WAL, not the first WAL, but as you mentioned, it might be ENDLSN-1. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Sat, Apr 20, 2019 at 5:54 PM Bruce Momjian <bruce@momjian.us> wrote: > Good point. You mentioned: > > It seems better to me to give the files names like > ${TLI}.${STARTLSN}.${ENDLSN}.modblock, e.g. > 00000001.0000000168000058.00000001687DBBB8.modblock, so that you can > see exactly which *records* are covered by that segment. > > but it seems like it should be ${TLI}.${ENDLSN}... (END first) because > you would not want to delete the modblock file until you are about to > delete the final WAL, not the first WAL, but as you mentioned, it might > be ENDLSN-1. Hmm. It seems to me that it is almost universally the convention to put the starting point prior to the ending point. If you are taking a biology class, the teacher will not tell you to study chapters six through three. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Apr 20, 2019 at 12:21:36AM -0400, Robert Haas wrote: > The segment size doesn't have much to do with it. If you make > segments bigger, you'll have to scan fewer larger ones; if you make > them smaller, you'll have more smaller ones. The only thing that > really matters is the amount of I/O and CPU required, and that doesn't > change very much as you vary the segment size. If you create the extra file when a segment is finished and we switch to a new one, then the extra work would happen for a random backend, and it is going to be more costly to scan a 1GB segment than a 16MB segment as a one-time operation, and less backends would see a slowdown at equal WAL data generated. From what I can see, you are not planning to do such operations when a segment finishes being written, which would be much better. > As to that, what I'm proposing here is no different than what we are > already doing with physical and logical replication, except that it's > probably a bit cheaper. Physical replication reads all the WAL and > sends it all out over the network. Logical replication reads all the > WAL, does a bunch of computation, and then sends the results, possibly > filtered, out over the network. This would read the WAL and then > write a relatively small file to your local disk. > > I think the impact will be about the same as having one additional > standby, give or take. If you put the load on an extra process, yeah I don't think that it would be noticeable. -- Michael
Attachment
On Sun, Apr 21, 2019 at 06:24:50PM -0400, Robert Haas wrote: > On Sat, Apr 20, 2019 at 5:54 PM Bruce Momjian <bruce@momjian.us> wrote: > > Good point. You mentioned: > > > > It seems better to me to give the files names like > > ${TLI}.${STARTLSN}.${ENDLSN}.modblock, e.g. > > 00000001.0000000168000058.00000001687DBBB8.modblock, so that you can > > see exactly which *records* are covered by that segment. > > > > but it seems like it should be ${TLI}.${ENDLSN}... (END first) because > > you would not want to delete the modblock file until you are about to > > delete the final WAL, not the first WAL, but as you mentioned, it might > > be ENDLSN-1. > > Hmm. It seems to me that it is almost universally the convention to > put the starting point prior to the ending point. If you are taking a > biology class, the teacher will not tell you to study chapters six > through three. My point is that most WAL archive tools will order and remove files based on their lexical ordering, so if you put the start first, the file will normally be removed when it should be kept, e.g., if you have WAL files like: 000000010000000000000001 000000010000000000000002 000000010000000000000003 000000010000000000000004 000000010000000000000005 putting the start first and archiving some wal would lead to: 000000010000000000000001-000000010000000000000004.modblock 000000010000000000000003 000000010000000000000004 000000010000000000000005 We removed 1 and 2, but kept the modblock file, which looks out of order. Having the end at the start would have: 000000010000000000000003 000000010000000000000004 000000010000000000000004-000000010000000000000001.modblock 000000010000000000000005 My point is that you would normally only remove the modblock file when 4 is removed because this modblock files is useful for incremental backups from base backups that happened between 1 and 4. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Mon, Apr 22, 2019 at 11:20:43AM +0900, Michael Paquier wrote: > On Sat, Apr 20, 2019 at 12:21:36AM -0400, Robert Haas wrote: > > The segment size doesn't have much to do with it. If you make > > segments bigger, you'll have to scan fewer larger ones; if you make > > them smaller, you'll have more smaller ones. The only thing that > > really matters is the amount of I/O and CPU required, and that doesn't > > change very much as you vary the segment size. > > If you create the extra file when a segment is finished and we switch > to a new one, then the extra work would happen for a random backend, > and it is going to be more costly to scan a 1GB segment than a 16MB > segment as a one-time operation, and less backends would see a > slowdown at equal WAL data generated. From what I can see, you are > not planning to do such operations when a segment finishes being > written, which would be much better. I think your point is that the 16MB is more likely to be in memory, while the 1GB is less likely. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Sun, Apr 21, 2019 at 10:21 PM Michael Paquier <michael@paquier.xyz> wrote: > If you create the extra file when a segment is finished and we switch > to a new one, then the extra work would happen for a random backend, > and it is going to be more costly to scan a 1GB segment than a 16MB > segment as a one-time operation, and less backends would see a > slowdown at equal WAL data generated. From what I can see, you are > not planning to do such operations when a segment finishes being > written, which would be much better. Well, my plan was to do it all from a background worker, so I do not think a random backend would ever have to do any extra work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Apr 22, 2019 at 11:48 AM Bruce Momjian <bruce@momjian.us> wrote: > My point is that most WAL archive tools will order and remove files > based on their lexical ordering, so if you put the start first, the file > will normally be removed when it should be kept, e.g., if you have WAL > files like: > > 000000010000000000000001 > 000000010000000000000002 > 000000010000000000000003 > 000000010000000000000004 > 000000010000000000000005 > > putting the start first and archiving some wal would lead to: > > 000000010000000000000001-000000010000000000000004.modblock > 000000010000000000000003 > 000000010000000000000004 > 000000010000000000000005 > > We removed 1 and 2, but kept the modblock file, which looks out of > order. Having the end at the start would have: > > 000000010000000000000003 > 000000010000000000000004 > 000000010000000000000004-000000010000000000000001.modblock > 000000010000000000000005 > > My point is that you would normally only remove the modblock file when 4 > is removed because this modblock files is useful for incremental backups > from base backups that happened between 1 and 4. That's an interesting point. On the other hand, I think it would be typical to want the master to retain .modblock files for much longer than it retains WAL segments, and in my design, the WAL archive wouldn't see those files at all; they'd be stored on the master. I was actually thinking that they should possibly be stored in a separate directory to avoid confusion. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Apr 22, 2019 at 12:15:32PM -0400, Robert Haas wrote: > On Mon, Apr 22, 2019 at 11:48 AM Bruce Momjian <bruce@momjian.us> wrote: > > My point is that you would normally only remove the modblock file when 4 > > is removed because this modblock files is useful for incremental backups > > from base backups that happened between 1 and 4. > > That's an interesting point. On the other hand, I think it would be > typical to want the master to retain .modblock files for much longer > than it retains WAL segments, and in my design, the WAL archive > wouldn't see those files at all; they'd be stored on the master. I > was actually thinking that they should possibly be stored in a > separate directory to avoid confusion. I assumed the modblock files would be stored in the WAL archive so some external tools could generate incremental backups using just the WAL files. I assumed they would also be sent to standby servers so incremental backups could be done on standby servers too. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Mon, Apr 22, 2019 at 12:35 PM Bruce Momjian <bruce@momjian.us> wrote: > I assumed the modblock files would be stored in the WAL archive so some > external tools could generate incremental backups using just the WAL > files. I assumed they would also be sent to standby servers so > incremental backups could be done on standby servers too. Yeah, that's another possible approach. I am not sure what is best. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Apr 22, 2019 at 01:11:22PM -0400, Robert Haas wrote: > On Mon, Apr 22, 2019 at 12:35 PM Bruce Momjian <bruce@momjian.us> wrote: > > I assumed the modblock files would be stored in the WAL archive so some > > external tools could generate incremental backups using just the WAL > > files. I assumed they would also be sent to standby servers so > > incremental backups could be done on standby servers too. > > Yeah, that's another possible approach. I am not sure what is best. I am thinking you need to allow any of these, and putting the WAL files in pg_wal and having them streamed and archived gives that flexibility. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Thu, Apr 18, 2019 at 04:25:24PM -0400, Robert Haas wrote: >On Thu, Apr 18, 2019 at 3:51 PM Bruce Momjian <bruce@momjian.us> wrote: >> How would you choose the STARTLSN/ENDLSN? If you could do it per >> checkpoint, rather than per-WAL, I think that would be great. > >I thought of that too. It seems appealing, because you probably only >really care whether a particular block was modified between one >checkpoint and the next, not exactly when during that interval it was >modified. That's probably true for incremental backups, but there are other use cases that could leverage this information. Some time ago there was a discussion about prefetching blocks during recovery on a standby, and that's a great example of a use case that benefit from this - look which blocks where modified in the next chunk of WAL, prefetch them. But that requires fairly detailed information about which blocks were modified in the next few megabytes of WAL. So just doing it once per checkpoint (or even anything above a single WAL segment) and removing all the detailed LSN location makes it useless for this use case. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Apr 22, 2019 at 01:15:49PM -0400, Bruce Momjian wrote: >On Mon, Apr 22, 2019 at 01:11:22PM -0400, Robert Haas wrote: >> On Mon, Apr 22, 2019 at 12:35 PM Bruce Momjian <bruce@momjian.us> wrote: >> > I assumed the modblock files would be stored in the WAL archive so some >> > external tools could generate incremental backups using just the WAL >> > files. I assumed they would also be sent to standby servers so >> > incremental backups could be done on standby servers too. >> >> Yeah, that's another possible approach. I am not sure what is best. > >I am thinking you need to allow any of these, and putting the WAL files >in pg_wal and having them streamed and archived gives that flexibility. > I agree - this would be quite useful for the prefetching use case I've already mentioned in my previous message. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Apr 20, 2019 at 04:21:52PM -0400, Robert Haas wrote: >On Sat, Apr 20, 2019 at 12:42 AM Stephen Frost <sfrost@snowman.net> wrote: >> > Oh. Well, I already explained my algorithm for doing that upthread, >> > which I believe would be quite cheap. >> > >> > 1. When you generate the .modblock files, stick all the block >> > references into a buffer. qsort(). Dedup. Write out in sorted >> > order. >> >> Having all of the block references in a sorted order does seem like it >> would help, but would also make those potentially quite a bit larger >> than necessary (I had some thoughts about making them smaller elsewhere >> in this discussion). That might be worth it though. I suppose it might >> also be possible to line up the bitmaps suggested elsewhere to do >> essentially a BitmapOr of them to identify the blocks changed (while >> effectively de-duping at the same time). > >I don't see why this would make them bigger than necessary. If you >sort by relfilenode/fork/blocknumber and dedup, then references to >nearby blocks will be adjacent in the file. You can then decide what >format will represent that most efficiently on output. Whether or not >a bitmap is better idea than a list of block numbers or something else >depends on what percentage of blocks are modified and how clustered >they are. > Not sure I understand correctly - do you suggest to deduplicate and sort the data before writing them into the .modblock files? Because that the the sorting would make this information mostly useless for the recovery prefetching use case I mentioned elsewhere. For that to work we need information about both the LSN and block, in the LSN order. So if we want to allow that use case to leverage this infrastructure, we need to write the .modfiles kinda "raw" and do this processing in some later step. Now, maybe the incremental backup use case is so much more important the right thing to do is ignore this other use case, and I'm OK with that - as long as it's a conscious choice. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Apr 23, 2019 at 01:21:27AM +0200, Tomas Vondra wrote: > On Sat, Apr 20, 2019 at 04:21:52PM -0400, Robert Haas wrote: > > On Sat, Apr 20, 2019 at 12:42 AM Stephen Frost <sfrost@snowman.net> wrote: > > > > Oh. Well, I already explained my algorithm for doing that upthread, > > > > which I believe would be quite cheap. > > > > > > > > 1. When you generate the .modblock files, stick all the block > > > > references into a buffer. qsort(). Dedup. Write out in sorted > > > > order. > > > > > > Having all of the block references in a sorted order does seem like it > > > would help, but would also make those potentially quite a bit larger > > > than necessary (I had some thoughts about making them smaller elsewhere > > > in this discussion). That might be worth it though. I suppose it might > > > also be possible to line up the bitmaps suggested elsewhere to do > > > essentially a BitmapOr of them to identify the blocks changed (while > > > effectively de-duping at the same time). > > > > I don't see why this would make them bigger than necessary. If you > > sort by relfilenode/fork/blocknumber and dedup, then references to > > nearby blocks will be adjacent in the file. You can then decide what > > format will represent that most efficiently on output. Whether or not > > a bitmap is better idea than a list of block numbers or something else > > depends on what percentage of blocks are modified and how clustered > > they are. > > > > Not sure I understand correctly - do you suggest to deduplicate and sort > the data before writing them into the .modblock files? Because that the > the sorting would make this information mostly useless for the recovery > prefetching use case I mentioned elsewhere. For that to work we need > information about both the LSN and block, in the LSN order. > > So if we want to allow that use case to leverage this infrastructure, we > need to write the .modfiles kinda "raw" and do this processing in some > later step. > > Now, maybe the incremental backup use case is so much more important the > right thing to do is ignore this other use case, and I'm OK with that - > as long as it's a conscious choice. I think the concern is that the more graunular the modblock files are (with less de-duping), the larger they will be. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Mon, Apr 22, 2019 at 07:44:45PM -0400, Bruce Momjian wrote: >On Tue, Apr 23, 2019 at 01:21:27AM +0200, Tomas Vondra wrote: >> On Sat, Apr 20, 2019 at 04:21:52PM -0400, Robert Haas wrote: >> > On Sat, Apr 20, 2019 at 12:42 AM Stephen Frost <sfrost@snowman.net> wrote: >> > > > Oh. Well, I already explained my algorithm for doing that upthread, >> > > > which I believe would be quite cheap. >> > > > >> > > > 1. When you generate the .modblock files, stick all the block >> > > > references into a buffer. qsort(). Dedup. Write out in sorted >> > > > order. >> > > >> > > Having all of the block references in a sorted order does seem like it >> > > would help, but would also make those potentially quite a bit larger >> > > than necessary (I had some thoughts about making them smaller elsewhere >> > > in this discussion). That might be worth it though. I suppose it might >> > > also be possible to line up the bitmaps suggested elsewhere to do >> > > essentially a BitmapOr of them to identify the blocks changed (while >> > > effectively de-duping at the same time). >> > >> > I don't see why this would make them bigger than necessary. If you >> > sort by relfilenode/fork/blocknumber and dedup, then references to >> > nearby blocks will be adjacent in the file. You can then decide what >> > format will represent that most efficiently on output. Whether or not >> > a bitmap is better idea than a list of block numbers or something else >> > depends on what percentage of blocks are modified and how clustered >> > they are. >> > >> >> Not sure I understand correctly - do you suggest to deduplicate and sort >> the data before writing them into the .modblock files? Because that the >> the sorting would make this information mostly useless for the recovery >> prefetching use case I mentioned elsewhere. For that to work we need >> information about both the LSN and block, in the LSN order. >> >> So if we want to allow that use case to leverage this infrastructure, we >> need to write the .modfiles kinda "raw" and do this processing in some >> later step. >> >> Now, maybe the incremental backup use case is so much more important the >> right thing to do is ignore this other use case, and I'm OK with that - >> as long as it's a conscious choice. > >I think the concern is that the more graunular the modblock files are >(with less de-duping), the larger they will be. > Well, I understand that concern - all I'm saying is that makes this useless for some use cases (that may or may not be important enough). However, it seems to me those files are guaranteed to be much smaller than the WAL segments, so I don't see how size alone could be an issue as long as we do the merging and deduplication when recycling the segments. At that point the standby can't request the WAL from the primary anyway, so it won't need the raw .mdblock files either. And we probably only care about the size of the data we need to keep for a long time. And that we can deduplicate/reorder any way we want. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Apr 23, 2019 at 02:13:29AM +0200, Tomas Vondra wrote: > Well, I understand that concern - all I'm saying is that makes this > useless for some use cases (that may or may not be important enough). > > However, it seems to me those files are guaranteed to be much smaller > than the WAL segments, so I don't see how size alone could be an issue > as long as we do the merging and deduplication when recycling the > segments. At that point the standby can't request the WAL from the > primary anyway, so it won't need the raw .mdblock files either. > > And we probably only care about the size of the data we need to keep for > a long time. And that we can deduplicate/reorder any way we want. Well, the interesting question is whether the server will generate a single modblock file for all WAL in pg_wal only right before we are ready to expire some WAL, or whether modblock files will be generated offline, perhaps independent of the server, and perhaps by aggregating smaller modblock files. To throw out an idea, what if we had an executable that could generate a modblock file by scanning a set of WAL files? How far would that take us to meeing incremental backup needs? I can imagine db/relfilenode oid volatility could be a problem, but might be fixable. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Mon, Apr 22, 2019 at 08:52:11PM -0400, Bruce Momjian wrote: > Well, the interesting question is whether the server will generate a > single modblock file for all WAL in pg_wal only right before we are > ready to expire some WAL, or whether modblock files will be generated > offline, perhaps independent of the server, and perhaps by aggregating > smaller modblock files. > > To throw out an idea, what if we had an executable that could generate a > modblock file by scanning a set of WAL files? How far would that take > us to meeing incremental backup needs? I can imagine db/relfilenode oid > volatility could be a problem, but might be fixable. Well, this actually brings up a bunch of questions: * How often do we create blockmod files? Per segment, per checkpoint, at WAL deletion time (1GB?) * What is the blockmod file format? dboid, relfilenode, blocknum? Use compression? Sorted? * How do we create incremental backups? * What is the incremental backup file format? * How do we apply incremental backups to base backups? And there are some secondary questions: * Can blockmod files be merged? * Can incremental backups be merged? * Can blockmod files be used for restore prefetching? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Mon, Apr 22, 2019 at 7:04 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > Some time ago there was a discussion about prefetching blocks during > recovery on a standby, and that's a great example of a use case that > benefit from this - look which blocks where modified in the next chunk > of WAL, prefetch them. But that requires fairly detailed information > about which blocks were modified in the next few megabytes of WAL. > > So just doing it once per checkpoint (or even anything above a single > WAL segment) and removing all the detailed LSN location makes it useless > for this use case. For this particular use case, wouldn't you want to read the WAL itself and use that to issue prefetch requests? Because if you use the .modblock files, the data file blocks will end up in memory but the WAL blocks won't, and you'll still be waiting for I/O. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > On Sat, Apr 20, 2019 at 04:21:52PM -0400, Robert Haas wrote: > >On Sat, Apr 20, 2019 at 12:42 AM Stephen Frost <sfrost@snowman.net> wrote: > >>> Oh. Well, I already explained my algorithm for doing that upthread, > >>> which I believe would be quite cheap. > >>> > >>> 1. When you generate the .modblock files, stick all the block > >>> references into a buffer. qsort(). Dedup. Write out in sorted > >>> order. > >> > >>Having all of the block references in a sorted order does seem like it > >>would help, but would also make those potentially quite a bit larger > >>than necessary (I had some thoughts about making them smaller elsewhere > >>in this discussion). That might be worth it though. I suppose it might > >>also be possible to line up the bitmaps suggested elsewhere to do > >>essentially a BitmapOr of them to identify the blocks changed (while > >>effectively de-duping at the same time). > > > >I don't see why this would make them bigger than necessary. If you > >sort by relfilenode/fork/blocknumber and dedup, then references to > >nearby blocks will be adjacent in the file. You can then decide what > >format will represent that most efficiently on output. Whether or not > >a bitmap is better idea than a list of block numbers or something else > >depends on what percentage of blocks are modified and how clustered > >they are. > > Not sure I understand correctly - do you suggest to deduplicate and sort > the data before writing them into the .modblock files? Because that the > the sorting would make this information mostly useless for the recovery > prefetching use case I mentioned elsewhere. For that to work we need > information about both the LSN and block, in the LSN order. I'm not sure I follow- why does the prefetching need to get the blocks in LSN order..? Once the blocks that we know are going to change in the next segment have been identified, we could prefetch them all and have them ready for when replay gets to them. I'm not sure that we specifically need to have them pre-fetched in the same order that the replay happens and it might even be better to fetch them in an order that's as sequential as possible to get them in as quickly as possible. > So if we want to allow that use case to leverage this infrastructure, we > need to write the .modfiles kinda "raw" and do this processing in some > later step. If we really need the LSN info for the blocks, then we could still de-dup, picking the 'first modified in this segment at LSN X', or keep both first and last, or I suppose every LSN if we really want, and then have that information included with the other information about the block. Downstream clients could then sort based on the LSN info if they want to have a list of blocks in sorted-by-LSN-order. > Now, maybe the incremental backup use case is so much more important the > right thing to do is ignore this other use case, and I'm OK with that - > as long as it's a conscious choice. I'd certainly like to have a way to prefetch, but I'm not entirely sure that it makes sense to combine it with this, so while I sketched out some ideas about how to do that above, I don't want it to come across as being a strong endorsement of the overall idea. For pre-fetching purposes, for an async streaming replica, it seems like the wal sender process could potentially just scan the WAL and have a list of blocks ready to pass to the replica which are "this is what's coming soon" or similar, rather than working with the modfiles at all. Not sure if we'd always send that or if we wait for the replica to ask for it. Though for doing WAL replay from the archive, being able to ask for the modfile first to do prefetching before replaying the WAL itself could certainly be beneficial, so maybe it does make sense to have that information there too.. still not sure we really need it in LSN order or that we need to prefetch in LSN order though. Thanks! Stephen
Attachment
On Tue, Apr 23, 2019 at 10:22:46AM -0400, Stephen Frost wrote: >Greetings, > >* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: >> On Sat, Apr 20, 2019 at 04:21:52PM -0400, Robert Haas wrote: >> >On Sat, Apr 20, 2019 at 12:42 AM Stephen Frost <sfrost@snowman.net> wrote: >> >>> Oh. Well, I already explained my algorithm for doing that upthread, >> >>> which I believe would be quite cheap. >> >>> >> >>> 1. When you generate the .modblock files, stick all the block >> >>> references into a buffer. qsort(). Dedup. Write out in sorted >> >>> order. >> >> >> >>Having all of the block references in a sorted order does seem like it >> >>would help, but would also make those potentially quite a bit larger >> >>than necessary (I had some thoughts about making them smaller elsewhere >> >>in this discussion). That might be worth it though. I suppose it might >> >>also be possible to line up the bitmaps suggested elsewhere to do >> >>essentially a BitmapOr of them to identify the blocks changed (while >> >>effectively de-duping at the same time). >> > >> >I don't see why this would make them bigger than necessary. If you >> >sort by relfilenode/fork/blocknumber and dedup, then references to >> >nearby blocks will be adjacent in the file. You can then decide what >> >format will represent that most efficiently on output. Whether or not >> >a bitmap is better idea than a list of block numbers or something else >> >depends on what percentage of blocks are modified and how clustered >> >they are. >> >> Not sure I understand correctly - do you suggest to deduplicate and sort >> the data before writing them into the .modblock files? Because that the >> the sorting would make this information mostly useless for the recovery >> prefetching use case I mentioned elsewhere. For that to work we need >> information about both the LSN and block, in the LSN order. > >I'm not sure I follow- why does the prefetching need to get the blocks >in LSN order..? Once the blocks that we know are going to change in the >next segment have been identified, we could prefetch them all and have >them ready for when replay gets to them. I'm not sure that we >specifically need to have them pre-fetched in the same order that the >replay happens and it might even be better to fetch them in an order >that's as sequential as possible to get them in as quickly as possible. > That means we'd have to prefetch all blocks for the whole WAL segment, which is pretty useless, IMO. A single INSERT (especially for indexes) is often just ~100B, so a single 16MB segment can fit ~160k of them. Surely we don't want to prefetch all of that at once? And it's even worse for larger WAL segments, which are likely to get more common now that it's an initdb option. I'm pretty sure the prefetching needs to be more like "prefetch the next 1024 blocks we'll need" or "prefetch blocks from the next X megabytes of WAL". That doesn't mean we can't do some additional optimization (like reordering them a bit), but there's a point where it gets detrimental because the kernel will just evict some of the prefetched blocks before we actually access them. And the larger amount of blocks you prefetch the more likely that is. >> So if we want to allow that use case to leverage this infrastructure, we >> need to write the .modfiles kinda "raw" and do this processing in some >> later step. > >If we really need the LSN info for the blocks, then we could still >de-dup, picking the 'first modified in this segment at LSN X', or keep >both first and last, or I suppose every LSN if we really want, and then >have that information included with the other information about the >block. Downstream clients could then sort based on the LSN info if they >want to have a list of blocks in sorted-by-LSN-order. > Possibly. I don't think keeping just the first block occurence is enough, particularly for large WAL segmnent sizes, but I agree we can reduce the stuff a bit (say, ignore references that are less than 1MB apart or so). We just can't remove all the LSN information entirely. >> Now, maybe the incremental backup use case is so much more important the >> right thing to do is ignore this other use case, and I'm OK with that - >> as long as it's a conscious choice. > >I'd certainly like to have a way to prefetch, but I'm not entirely sure >that it makes sense to combine it with this, so while I sketched out >some ideas about how to do that above, I don't want it to come across as >being a strong endorsement of the overall idea. > >For pre-fetching purposes, for an async streaming replica, it seems like >the wal sender process could potentially just scan the WAL and have a >list of blocks ready to pass to the replica which are "this is what's >coming soon" or similar, rather than working with the modfiles at all. >Not sure if we'd always send that or if we wait for the replica to ask >for it. Though for doing WAL replay from the archive, being able to ask >for the modfile first to do prefetching before replaying the WAL itself >could certainly be beneficial, so maybe it does make sense to have that >information there too.. still not sure we really need it in LSN order >or that we need to prefetch in LSN order though. > Well, how exactly should the prefetching extract the block list is still an open question. But this seems to deal with pretty much the same stuff, so it might make sense to support the prefetching use case too. Or maybe not, I'm not sure - but IMHO we should try. I don't think it makes sense to do prefetching when the standby is fully caught up with the primary. It gets more important when it falls behind by a significant amount of WAL, at which point it's likely to fetch already closed WAL segments - either from primary or archive. So either it can get the block list from there, or it may extract the info on it's own (using some of the infrastructure used to build the mdblock files). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Greetings, * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > On Tue, Apr 23, 2019 at 10:22:46AM -0400, Stephen Frost wrote: > >* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > >>On Sat, Apr 20, 2019 at 04:21:52PM -0400, Robert Haas wrote: > >>>On Sat, Apr 20, 2019 at 12:42 AM Stephen Frost <sfrost@snowman.net> wrote: > >>>>> Oh. Well, I already explained my algorithm for doing that upthread, > >>>>> which I believe would be quite cheap. > >>>>> > >>>>> 1. When you generate the .modblock files, stick all the block > >>>>> references into a buffer. qsort(). Dedup. Write out in sorted > >>>>> order. > >>>> > >>>>Having all of the block references in a sorted order does seem like it > >>>>would help, but would also make those potentially quite a bit larger > >>>>than necessary (I had some thoughts about making them smaller elsewhere > >>>>in this discussion). That might be worth it though. I suppose it might > >>>>also be possible to line up the bitmaps suggested elsewhere to do > >>>>essentially a BitmapOr of them to identify the blocks changed (while > >>>>effectively de-duping at the same time). > >>> > >>>I don't see why this would make them bigger than necessary. If you > >>>sort by relfilenode/fork/blocknumber and dedup, then references to > >>>nearby blocks will be adjacent in the file. You can then decide what > >>>format will represent that most efficiently on output. Whether or not > >>>a bitmap is better idea than a list of block numbers or something else > >>>depends on what percentage of blocks are modified and how clustered > >>>they are. > >> > >>Not sure I understand correctly - do you suggest to deduplicate and sort > >>the data before writing them into the .modblock files? Because that the > >>the sorting would make this information mostly useless for the recovery > >>prefetching use case I mentioned elsewhere. For that to work we need > >>information about both the LSN and block, in the LSN order. > > > >I'm not sure I follow- why does the prefetching need to get the blocks > >in LSN order..? Once the blocks that we know are going to change in the > >next segment have been identified, we could prefetch them all and have > >them ready for when replay gets to them. I'm not sure that we > >specifically need to have them pre-fetched in the same order that the > >replay happens and it might even be better to fetch them in an order > >that's as sequential as possible to get them in as quickly as possible. > > That means we'd have to prefetch all blocks for the whole WAL segment, > which is pretty useless, IMO. A single INSERT (especially for indexes) is > often just ~100B, so a single 16MB segment can fit ~160k of them. Surely > we don't want to prefetch all of that at once? And it's even worse for > larger WAL segments, which are likely to get more common now that it's an > initdb option. Ah, yeah, I had been thinking about FPIs and blocks and figuring that 16MB wasn't that bad but you're certainly right that we could end up touching a lot more blocks in a given WAL segment. > I'm pretty sure the prefetching needs to be more like "prefetch the next > 1024 blocks we'll need" or "prefetch blocks from the next X megabytes of > WAL". That doesn't mean we can't do some additional optimization (like > reordering them a bit), but there's a point where it gets detrimental > because the kernel will just evict some of the prefetched blocks before we > actually access them. And the larger amount of blocks you prefetch the > more likely that is. Since we're prefetching, maybe we shouldn't be just pulling the blocks into the filesystem cache but instead into something we have more control over..? > >>So if we want to allow that use case to leverage this infrastructure, we > >>need to write the .modfiles kinda "raw" and do this processing in some > >>later step. > > > >If we really need the LSN info for the blocks, then we could still > >de-dup, picking the 'first modified in this segment at LSN X', or keep > >both first and last, or I suppose every LSN if we really want, and then > >have that information included with the other information about the > >block. Downstream clients could then sort based on the LSN info if they > >want to have a list of blocks in sorted-by-LSN-order. > > Possibly. I don't think keeping just the first block occurence is enough, > particularly for large WAL segmnent sizes, but I agree we can reduce the > stuff a bit (say, ignore references that are less than 1MB apart or so). > We just can't remove all the LSN information entirely. Yeah, it depends on how much memory we want to use for prefetching and how large the WAL segments are. > >>Now, maybe the incremental backup use case is so much more important the > >>right thing to do is ignore this other use case, and I'm OK with that - > >>as long as it's a conscious choice. > > > >I'd certainly like to have a way to prefetch, but I'm not entirely sure > >that it makes sense to combine it with this, so while I sketched out > >some ideas about how to do that above, I don't want it to come across as > >being a strong endorsement of the overall idea. > > > >For pre-fetching purposes, for an async streaming replica, it seems like > >the wal sender process could potentially just scan the WAL and have a > >list of blocks ready to pass to the replica which are "this is what's > >coming soon" or similar, rather than working with the modfiles at all. > >Not sure if we'd always send that or if we wait for the replica to ask > >for it. Though for doing WAL replay from the archive, being able to ask > >for the modfile first to do prefetching before replaying the WAL itself > >could certainly be beneficial, so maybe it does make sense to have that > >information there too.. still not sure we really need it in LSN order > >or that we need to prefetch in LSN order though. > > Well, how exactly should the prefetching extract the block list is still > an open question. But this seems to deal with pretty much the same stuff, > so it might make sense to support the prefetching use case too. Or maybe > not, I'm not sure - but IMHO we should try. Yeah, makes sense to me to at least be thinking about it and seeing if there's a way to make adding prefetching easier. > I don't think it makes sense to do prefetching when the standby is fully > caught up with the primary. It gets more important when it falls behind by > a significant amount of WAL, at which point it's likely to fetch already > closed WAL segments - either from primary or archive. So either it can get > the block list from there, or it may extract the info on it's own (using > some of the infrastructure used to build the mdblock files). I don't know.. if we're able to keep up with the primary because we did a bunch of prefetching and that made WAL replay fast enough that we catch up, aren't we going to possibly fall right back behind again pretty quickly if we don't prefetch for the blocks that are about to be sent? If we are caught up and there's no data to be sent then we don't have much choice, but as soon as there's WAL data to be sent our way, any more than just one block, then grabbing the list of blocks that we know we're going to need during replay seems to make a lot of sense to me. In other words, I'd think we would just always do this and how much we do would be dictated by both how far behind we are and how much memory we want to allocate for this. Thanks! Stephen
Attachment
On Tue, Apr 23, 2019 at 11:43:05AM -0400, Stephen Frost wrote: >Greetings, > >* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: >> On Tue, Apr 23, 2019 at 10:22:46AM -0400, Stephen Frost wrote: >> >* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: >> >>On Sat, Apr 20, 2019 at 04:21:52PM -0400, Robert Haas wrote: >> >>>On Sat, Apr 20, 2019 at 12:42 AM Stephen Frost <sfrost@snowman.net> wrote: >> >>>>> Oh. Well, I already explained my algorithm for doing that upthread, >> >>>>> which I believe would be quite cheap. >> >>>>> >> >>>>> 1. When you generate the .modblock files, stick all the block >> >>>>> references into a buffer. qsort(). Dedup. Write out in sorted >> >>>>> order. >> >>>> >> >>>>Having all of the block references in a sorted order does seem like it >> >>>>would help, but would also make those potentially quite a bit larger >> >>>>than necessary (I had some thoughts about making them smaller elsewhere >> >>>>in this discussion). That might be worth it though. I suppose it might >> >>>>also be possible to line up the bitmaps suggested elsewhere to do >> >>>>essentially a BitmapOr of them to identify the blocks changed (while >> >>>>effectively de-duping at the same time). >> >>> >> >>>I don't see why this would make them bigger than necessary. If you >> >>>sort by relfilenode/fork/blocknumber and dedup, then references to >> >>>nearby blocks will be adjacent in the file. You can then decide what >> >>>format will represent that most efficiently on output. Whether or not >> >>>a bitmap is better idea than a list of block numbers or something else >> >>>depends on what percentage of blocks are modified and how clustered >> >>>they are. >> >> >> >>Not sure I understand correctly - do you suggest to deduplicate and sort >> >>the data before writing them into the .modblock files? Because that the >> >>the sorting would make this information mostly useless for the recovery >> >>prefetching use case I mentioned elsewhere. For that to work we need >> >>information about both the LSN and block, in the LSN order. >> > >> >I'm not sure I follow- why does the prefetching need to get the blocks >> >in LSN order..? Once the blocks that we know are going to change in the >> >next segment have been identified, we could prefetch them all and have >> >them ready for when replay gets to them. I'm not sure that we >> >specifically need to have them pre-fetched in the same order that the >> >replay happens and it might even be better to fetch them in an order >> >that's as sequential as possible to get them in as quickly as possible. >> >> That means we'd have to prefetch all blocks for the whole WAL segment, >> which is pretty useless, IMO. A single INSERT (especially for indexes) is >> often just ~100B, so a single 16MB segment can fit ~160k of them. Surely >> we don't want to prefetch all of that at once? And it's even worse for >> larger WAL segments, which are likely to get more common now that it's an >> initdb option. > >Ah, yeah, I had been thinking about FPIs and blocks and figuring that >16MB wasn't that bad but you're certainly right that we could end up >touching a lot more blocks in a given WAL segment. > >> I'm pretty sure the prefetching needs to be more like "prefetch the next >> 1024 blocks we'll need" or "prefetch blocks from the next X megabytes of >> WAL". That doesn't mean we can't do some additional optimization (like >> reordering them a bit), but there's a point where it gets detrimental >> because the kernel will just evict some of the prefetched blocks before we >> actually access them. And the larger amount of blocks you prefetch the >> more likely that is. > >Since we're prefetching, maybe we shouldn't be just pulling the blocks >into the filesystem cache but instead into something we have more >control over..? > Well, yeah, and there was a discussion about that in the prefetching thread IIRC. But but in that case it's probably even more imporant to prefetch only a fairly small chunk of blocks (instead of the whole WAL segment worth of blocks), because shared buffers are usually much smaller compared to page cache. So we'd probably want some sort of small ring buffer there, so the LSN information is even more important. >> >>So if we want to allow that use case to leverage this infrastructure, we >> >>need to write the .modfiles kinda "raw" and do this processing in some >> >>later step. >> > >> >If we really need the LSN info for the blocks, then we could still >> >de-dup, picking the 'first modified in this segment at LSN X', or keep >> >both first and last, or I suppose every LSN if we really want, and then >> >have that information included with the other information about the >> >block. Downstream clients could then sort based on the LSN info if they >> >want to have a list of blocks in sorted-by-LSN-order. >> >> Possibly. I don't think keeping just the first block occurence is enough, >> particularly for large WAL segmnent sizes, but I agree we can reduce the >> stuff a bit (say, ignore references that are less than 1MB apart or so). >> We just can't remove all the LSN information entirely. > >Yeah, it depends on how much memory we want to use for prefetching and >how large the WAL segments are. > Right. >> >>Now, maybe the incremental backup use case is so much more important the >> >>right thing to do is ignore this other use case, and I'm OK with that - >> >>as long as it's a conscious choice. >> > >> >I'd certainly like to have a way to prefetch, but I'm not entirely sure >> >that it makes sense to combine it with this, so while I sketched out >> >some ideas about how to do that above, I don't want it to come across as >> >being a strong endorsement of the overall idea. >> > >> >For pre-fetching purposes, for an async streaming replica, it seems like >> >the wal sender process could potentially just scan the WAL and have a >> >list of blocks ready to pass to the replica which are "this is what's >> >coming soon" or similar, rather than working with the modfiles at all. >> >Not sure if we'd always send that or if we wait for the replica to ask >> >for it. Though for doing WAL replay from the archive, being able to ask >> >for the modfile first to do prefetching before replaying the WAL itself >> >could certainly be beneficial, so maybe it does make sense to have that >> >information there too.. still not sure we really need it in LSN order >> >or that we need to prefetch in LSN order though. >> >> Well, how exactly should the prefetching extract the block list is still >> an open question. But this seems to deal with pretty much the same stuff, >> so it might make sense to support the prefetching use case too. Or maybe >> not, I'm not sure - but IMHO we should try. > >Yeah, makes sense to me to at least be thinking about it and seeing if >there's a way to make adding prefetching easier. > Agreed. >> I don't think it makes sense to do prefetching when the standby is fully >> caught up with the primary. It gets more important when it falls behind by >> a significant amount of WAL, at which point it's likely to fetch already >> closed WAL segments - either from primary or archive. So either it can get >> the block list from there, or it may extract the info on it's own (using >> some of the infrastructure used to build the mdblock files). > >I don't know.. if we're able to keep up with the primary because we did >a bunch of prefetching and that made WAL replay fast enough that we >catch up, aren't we going to possibly fall right back behind again >pretty quickly if we don't prefetch for the blocks that are about to be >sent? If we are caught up and there's no data to be sent then we don't >have much choice, but as soon as there's WAL data to be sent our way, >any more than just one block, then grabbing the list of blocks that we >know we're going to need during replay seems to make a lot of sense to >me. In other words, I'd think we would just always do this and how much >we do would be dictated by both how far behind we are and how much >memory we want to allocate for this. > Well, the thing is that for prefetching to be possible you actually have to be a bit behind. Otherwise you can't really look forward which blocks will be needed, right? IMHO the main use case for prefetching is when there's a spike of activity on the primary, making the standby to fall behind, and then hours takes hours to catch up. I don't think the cases with just a couple of MBs of lag are the issue prefetching is meant to improve (if it does, great). Anyway, not sure if this is the right thread to discuss prefetching in much more detail. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2019-04-23 18:07:40 +0200, Tomas Vondra wrote: > Well, the thing is that for prefetching to be possible you actually have > to be a bit behind. Otherwise you can't really look forward which blocks > will be needed, right? > > IMHO the main use case for prefetching is when there's a spike of activity > on the primary, making the standby to fall behind, and then hours takes > hours to catch up. I don't think the cases with just a couple of MBs of > lag are the issue prefetching is meant to improve (if it does, great). I'd be surprised if a good implementation didn't. Even just some smarter IO scheduling in the startup process could help a good bit. E.g. no need to sequentially read the first and then the second block for an update record, if you can issue both at the same time - just about every storage system these days can do a number of IO requests in parallel, and it nearly halves latency effects. And reading a few records (as in a few hundred bytes commonly) ahead, allows to do much more than that. Greetings, Andres Freund
On Tue, Apr 23, 2019 at 09:34:54AM -0700, Andres Freund wrote: >Hi, > >On 2019-04-23 18:07:40 +0200, Tomas Vondra wrote: >> Well, the thing is that for prefetching to be possible you actually have >> to be a bit behind. Otherwise you can't really look forward which blocks >> will be needed, right? >> >> IMHO the main use case for prefetching is when there's a spike of activity >> on the primary, making the standby to fall behind, and then hours takes >> hours to catch up. I don't think the cases with just a couple of MBs of >> lag are the issue prefetching is meant to improve (if it does, great). > >I'd be surprised if a good implementation didn't. Even just some smarter >IO scheduling in the startup process could help a good bit. E.g. no need >to sequentially read the first and then the second block for an update >record, if you can issue both at the same time - just about every >storage system these days can do a number of IO requests in parallel, >and it nearly halves latency effects. And reading a few records (as in a >few hundred bytes commonly) ahead, allows to do much more than that. > I don't disagree with that - prefetching certainly can improve utilization of the storage system. The question is whether it can meaningfully improve performance of the recovery process in cases when it does not lag. And I think it can't (perhaps with remote_apply being an exception). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2019-04-23 19:01:29 +0200, Tomas Vondra wrote: > On Tue, Apr 23, 2019 at 09:34:54AM -0700, Andres Freund wrote: > > Hi, > > > > On 2019-04-23 18:07:40 +0200, Tomas Vondra wrote: > > > Well, the thing is that for prefetching to be possible you actually have > > > to be a bit behind. Otherwise you can't really look forward which blocks > > > will be needed, right? > > > > > > IMHO the main use case for prefetching is when there's a spike of activity > > > on the primary, making the standby to fall behind, and then hours takes > > > hours to catch up. I don't think the cases with just a couple of MBs of > > > lag are the issue prefetching is meant to improve (if it does, great). > > > > I'd be surprised if a good implementation didn't. Even just some smarter > > IO scheduling in the startup process could help a good bit. E.g. no need > > to sequentially read the first and then the second block for an update > > record, if you can issue both at the same time - just about every > > storage system these days can do a number of IO requests in parallel, > > and it nearly halves latency effects. And reading a few records (as in a > > few hundred bytes commonly) ahead, allows to do much more than that. > > > > I don't disagree with that - prefetching certainly can improve utilization > of the storage system. The question is whether it can meaningfully improve > performance of the recovery process in cases when it does not lag. And I > think it can't (perhaps with remote_apply being an exception). Well. I think a few dozen records behind doesn't really count as "lag", and I think that's where it'd start to help (and for some record types like updates it'd start to help even for single records). It'd convert scenarios where we'd currently fall behind slowly into scenarios where we can keep up - but where there's no meaningful lag while we keep up. What's your argument for me being wrong? And even if we'd keep up without any prefetching, issuing requests in a more efficient manner allows for more efficient concurrent use of the storage system. It'll often effectively reduce the amount of random iops. Greetings, Andres Freund
On Tue, Apr 23, 2019 at 10:09:39AM -0700, Andres Freund wrote: >Hi, > >On 2019-04-23 19:01:29 +0200, Tomas Vondra wrote: >> On Tue, Apr 23, 2019 at 09:34:54AM -0700, Andres Freund wrote: >> > Hi, >> > >> > On 2019-04-23 18:07:40 +0200, Tomas Vondra wrote: >> > > Well, the thing is that for prefetching to be possible you actually have >> > > to be a bit behind. Otherwise you can't really look forward which blocks >> > > will be needed, right? >> > > >> > > IMHO the main use case for prefetching is when there's a spike of activity >> > > on the primary, making the standby to fall behind, and then hours takes >> > > hours to catch up. I don't think the cases with just a couple of MBs of >> > > lag are the issue prefetching is meant to improve (if it does, great). >> > >> > I'd be surprised if a good implementation didn't. Even just some smarter >> > IO scheduling in the startup process could help a good bit. E.g. no need >> > to sequentially read the first and then the second block for an update >> > record, if you can issue both at the same time - just about every >> > storage system these days can do a number of IO requests in parallel, >> > and it nearly halves latency effects. And reading a few records (as in a >> > few hundred bytes commonly) ahead, allows to do much more than that. >> > >> >> I don't disagree with that - prefetching certainly can improve utilization >> of the storage system. The question is whether it can meaningfully improve >> performance of the recovery process in cases when it does not lag. And I >> think it can't (perhaps with remote_apply being an exception). > >Well. I think a few dozen records behind doesn't really count as "lag", >and I think that's where it'd start to help (and for some record types >like updates it'd start to help even for single records). It'd convert >scenarios where we'd currently fall behind slowly into scenarios where >we can keep up - but where there's no meaningful lag while we keep up. >What's your argument for me being wrong? > I was not saying you are wrong. I think we actually agree on the main points. My point is that prefetching is most valuable for cases when the standby can't keep up and falls behind significantly - at which point we have sufficient queue of blocks to prefetch. I don't care about the case when the standby can keep up even without prefetching, because the metric we need to optimize (i.e. lag) is close to 0 even without prefetching. >And even if we'd keep up without any prefetching, issuing requests in a >more efficient manner allows for more efficient concurrent use of the >storage system. It'll often effectively reduce the amount of random >iops. Maybe, although the metric we (and users) care about the most is the amount of lag. If the system keeps up even without prefetching, no one will complain about I/O utilization. When the lag is close to 0, the average throughput/IOPS/... is bound to be the same in both cases, because it does not affect how fast the standby receives WAL from the primary. Except that it's somewhat "spikier" with prefetching, because we issue requests in bursts. Which may actually be a bad thing. Of course, maybe prefetching will make it much more efficient even in the "no lag" case, and while it won't improve the recovery, it'll leave more I/O bandwidth for the other processes (say, queries on hot standby). So to be clear, I'm not against prefetching even in this case, but it's not the primary reason why I think we need to do that. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Apr 22, 2019 at 9:51 PM Robert Haas <robertmhaas@gmail.com> wrote: > For this particular use case, wouldn't you want to read the WAL itself > and use that to issue prefetch requests? Because if you use the > .modblock files, the data file blocks will end up in memory but the > WAL blocks won't, and you'll still be waiting for I/O. I'm still interested in the answer to this question, but I don't see a reply that specifically concerns it. Apologies if I have missed one. Stepping back a bit, I think that the basic issue under discussion here is how granular you want your .modblock files. At one extreme, one can imagine an application that wants to know exactly which blocks were accessed at exact which LSNs. At the other extreme, if you want to run a daily incremental backup, you just want to know which blocks have been modified between the start of the previous backup and the start of the current backup - i.e. sometime in the last ~24 hours. These are quite different things. When you only want approximate information - is there a chance that this block was changed within this LSN range, or not? - you can sort and deduplicate in advance; when you want exact information, you cannot do that. Furthermore, if you want exact information, you must store an LSN for every record; if you want approximate information, you emit a file for each LSN range and consider it sufficient to know that the change happened somewhere within the range of LSNs encompassed by that file. It's pretty clear in my mind that what I want to do here is provide approximate information, not exact information. Being able to sort and deduplicate in advance seems critical to be able to make something like this work on high-velocity systems. If you are generating a terabyte of WAL between incremental backups, and you don't do any sorting or deduplication prior to the point when you actually try to generate the modified block map, you are going to need a whole lot of memory (and CPU time, though that's less critical, I think) to process all of that data. If you can read modblock files which are already sorted and deduplicated, you can generate results incrementally and send them to the client incrementally and you never really need more than some fixed amount of memory no matter how much data you are processing. While I'm convinced that this particular feature should provide approximate rather than exact information, the degree of approximation is up for debate, and maybe it's best to just make that configurable. Some applications might work best with small modblock files covering only ~16MB of WAL each, or even less, while others might prefer larger quanta, say 1GB or even more. For incremental backup, I believe that the quanta will depend on the system velocity. On a system that isn't very busy, fine-grained modblock files will make incremental backup more efficient. If each modblock file covers only 16MB of data, and the backup manages to start someplace in the middle of that 16MB, then you'll only be including 16MB or less of unnecessary block references in the backup so you won't incur much extra work. On the other hand, on a busy system, you probably do not want such a small quantum, because you will then up with gazillions of modblock files and that will be hard to manage. It could also have performance problems, because merging data from a couple of hundred files is fine, but merging data from a couple of hundred thousand files is going to be inefficient. My experience hacking on and testing tuplesort.c a few years ago (with valuable tutelage by Peter Geoghegan) showed me that there is a slow drop-off in efficiency as the merge order increases -- and in this case, at some point you will blow out the size of the OS file descriptor table and have to start opening and closing files every time you access a different one, and that will be unpleasant. Finally, deduplication will tend to be more effective across larger numbers of block references, at least on some access patterns. So all of that is to say that if somebody wants modblock files each of which covers 1MB of WAL, I think that the same tools I'm proposing to build here for incremental backup could support that use case with just a configuration change. Moreover, the resulting files would still be usable by the incremental backup engine. So that's good: the same system can, at least to some extent, be reused for whatever other purposes people want to know about modified blocks. On the other hand, the incremental backup engine will likely not cope smoothly with having hundreds of thousands or millions of modblock files shoved down its gullet, so if there is a dramatic difference in the granularity requirements of different consumers, another approach is likely indicated. Especially if some consumer wants to see block references in the exact order in which they appear in WAL, or wants to know the exact LSN of each reference, it's probably best to go for a different approach. For example, pg_waldump could grow a new option which spits out just the block references and in a format designed to be easily machine-parseable; or a hypothetical background worker that does prefetching for recovery could just contain its own copy of the xlogreader machinery. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 24, 2019 at 09:25:12AM -0400, Robert Haas wrote: >On Mon, Apr 22, 2019 at 9:51 PM Robert Haas <robertmhaas@gmail.com> wrote: >> For this particular use case, wouldn't you want to read the WAL itself >> and use that to issue prefetch requests? Because if you use the >> .modblock files, the data file blocks will end up in memory but the >> WAL blocks won't, and you'll still be waiting for I/O. > >I'm still interested in the answer to this question, but I don't see a >reply that specifically concerns it. Apologies if I have missed one. > I don't think prefetching WAL blocks is all that important. The WAL segment was probably received fairly recently (either from primary or archive) and so it's reasonable to assume it's still in page cache. And even if it's not, sequential reads are handled by readahead pretty well. Which is a form of prefetching. But even if WAL prefetching was useful in some cases, I think it's mostly orthogonal issue - it certainly does not make prefetching of data pages unnecessary. >Stepping back a bit, I think that the basic issue under discussion >here is how granular you want your .modblock files. At one extreme, >one can imagine an application that wants to know exactly which blocks >were accessed at exact which LSNs. At the other extreme, if you want >to run a daily incremental backup, you just want to know which blocks >have been modified between the start of the previous backup and the >start of the current backup - i.e. sometime in the last ~24 hours. >These are quite different things. When you only want approximate >information - is there a chance that this block was changed within >this LSN range, or not? - you can sort and deduplicate in advance; >when you want exact information, you cannot do that. Furthermore, if >you want exact information, you must store an LSN for every record; if >you want approximate information, you emit a file for each LSN range >and consider it sufficient to know that the change happened somewhere >within the range of LSNs encompassed by that file. > Those are the extreme design options, yes. But I think there may be a reasonable middle ground, that would allow using the modblock files for both use cases. >It's pretty clear in my mind that what I want to do here is provide >approximate information, not exact information. Being able to sort >and deduplicate in advance seems critical to be able to make something >like this work on high-velocity systems. Do you have any analysis / data to support that claim? I mean, it's obvious that sorting and deduplicating the data right away makes subsequent processing more efficient, but it's not clear to me that not doing it would make it useless for high-velocity systems. > If you are generating a >terabyte of WAL between incremental backups, and you don't do any >sorting or deduplication prior to the point when you actually try to >generate the modified block map, you are going to need a whole lot of >memory (and CPU time, though that's less critical, I think) to process >all of that data. If you can read modblock files which are already >sorted and deduplicated, you can generate results incrementally and >send them to the client incrementally and you never really need more >than some fixed amount of memory no matter how much data you are >processing. > Sure, but that's not what I proposed elsewhere in this thread. My proposal was to keep mdblocks "raw" for WAL segments that were not recycled yet (so ~3 last checkpoints), and deduplicate them after that. So vast majority of the 1TB of WAL will have already deduplicated data. Also, maybe we can do partial deduplication, in a way that would be useful for prefetching. Say we only deduplicate 1MB windows - that would work at least for cases that touch the same page frequently (say, by inserting to the tail of an index, or so). >While I'm convinced that this particular feature should provide >approximate rather than exact information, the degree of approximation >is up for debate, and maybe it's best to just make that configurable. >Some applications might work best with small modblock files covering >only ~16MB of WAL each, or even less, while others might prefer larger >quanta, say 1GB or even more. For incremental backup, I believe that >the quanta will depend on the system velocity. On a system that isn't >very busy, fine-grained modblock files will make incremental backup >more efficient. If each modblock file covers only 16MB of data, and >the backup manages to start someplace in the middle of that 16MB, then >you'll only be including 16MB or less of unnecessary block references >in the backup so you won't incur much extra work. On the other hand, >on a busy system, you probably do not want such a small quantum, >because you will then up with gazillions of modblock files and that >will be hard to manage. It could also have performance problems, >because merging data from a couple of hundred files is fine, but >merging data from a couple of hundred thousand files is going to be >inefficient. My experience hacking on and testing tuplesort.c a few >years ago (with valuable tutelage by Peter Geoghegan) showed me that >there is a slow drop-off in efficiency as the merge order increases -- >and in this case, at some point you will blow out the size of the OS >file descriptor table and have to start opening and closing files >every time you access a different one, and that will be unpleasant. >Finally, deduplication will tend to be more effective across larger >numbers of block references, at least on some access patterns. > I agree with those observations in general, but I don't think it somehow proves we have to deduplicate/sort the data. FWIW no one cares about low-velocity systems. While raw modblock files would not be an issue on them, it's also mostly uninteresting from the prefetching perspective. It's the high-velocity sytems that have lag. >So all of that is to say that if somebody wants modblock files each of >which covers 1MB of WAL, I think that the same tools I'm proposing to >build here for incremental backup could support that use case with >just a configuration change. Moreover, the resulting files would >still be usable by the incremental backup engine. So that's good: the >same system can, at least to some extent, be reused for whatever other >purposes people want to know about modified blocks. +1 to configuration change, at least during the development phase. It'll allow comfortable testing and benchmarking. >On the other hand, the incremental backup engine will likely not cope >smoothly with having hundreds of thousands or millions of modblock files >shoved down its gullet, so if there is a dramatic difference in the >granularity requirements of different consumers, another approach is >likely indicated. Especially if some consumer wants to see block >references in the exact order in which they appear in WAL, or wants to >know the exact LSN of each reference, it's probably best to go for a >different approach. For example, pg_waldump could grow a new option >which spits out just the block references and in a format designed to be >easily machine-parseable; or a hypothetical background worker that does >prefetching for recovery could just contain its own copy of the >xlogreader machinery. > Again, I don't think we have to keep the raw modblock files forever. Send them to the archive, remove/deduplicate/sort them after we recycle the WAL segment, or something like that. That way the incremental backups don't need to deal with excessive number of modblock files. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Apr 24, 2019 at 10:10 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > >I'm still interested in the answer to this question, but I don't see a > >reply that specifically concerns it. Apologies if I have missed one. > > I don't think prefetching WAL blocks is all that important. The WAL > segment was probably received fairly recently (either from primary or > archive) and so it's reasonable to assume it's still in page cache. And > even if it's not, sequential reads are handled by readahead pretty well. > Which is a form of prefetching. True. But if you are going to need to read the WAL anyway to apply it, why shouldn't the prefetcher just read it first and use that to drive prefetching, instead of using the modblock files? It's strictly less I/O, because you were going to read the WAL files anyway and now you don't have to also read some other modblock file, and it doesn't really seem to have any disadvantages. > >It's pretty clear in my mind that what I want to do here is provide > >approximate information, not exact information. Being able to sort > >and deduplicate in advance seems critical to be able to make something > >like this work on high-velocity systems. > > Do you have any analysis / data to support that claim? I mean, it's > obvious that sorting and deduplicating the data right away makes > subsequent processing more efficient, but it's not clear to me that not > doing it would make it useless for high-velocity systems. I did include some analysis of this point in my original post. It does depend on your assumptions. If you assume that users will be OK with memory usage that runs into the tens of gigabytes when the amount of change since the last incremental backup is very large, then there is probably no big problem, but that assumption sounds shaky to me. (The customers I seem to end up on the phone with seem to be disproportionately those running enormous systems on dramatically underpowered hardware, which is not infrequently related to the reason I end up on the phone with them.) > Sure, but that's not what I proposed elsewhere in this thread. My proposal > was to keep mdblocks "raw" for WAL segments that were not recycled yet (so > ~3 last checkpoints), and deduplicate them after that. So vast majority of > the 1TB of WAL will have already deduplicated data. OK, I missed that proposal. My biggest concern about this is that I don't see how to square this with the proposal elsewhere on this thread that these files should be put someplace that makes them subject to archiving. If the files are managed by the master in a separate directory it can easily do this sort of thing, but if they're archived then you can't. Now maybe that's just a reason not to adopt that proposal, but I don't see how to adopt both that proposal and this one, unless we just say that we're going to spew craptons of tiny little non-deduplicated modblock files into the archive. > Also, maybe we can do partial deduplication, in a way that would be useful > for prefetching. Say we only deduplicate 1MB windows - that would work at > least for cases that touch the same page frequently (say, by inserting to > the tail of an index, or so). Maybe, but I'm not sure that's really optimal for any use case. > FWIW no one cares about low-velocity systems. While raw modblock files > would not be an issue on them, it's also mostly uninteresting from the > prefetching perspective. It's the high-velocity sytems that have lag. I don't think that's particularly fair. Low-velocity systems are some of the best candidates for incremental backup, and people who are running such systems probably care about that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company