Re: finding changed blocks using WAL scanning - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: finding changed blocks using WAL scanning
Date
Msg-id 20190423160740.kzo45g7agmki3tsp@development
Whole thread Raw
In response to Re: finding changed blocks using WAL scanning  (Stephen Frost <sfrost@snowman.net>)
Responses Re: finding changed blocks using WAL scanning
List pgsql-hackers
On Tue, Apr 23, 2019 at 11:43:05AM -0400, Stephen Frost wrote:
>Greetings,
>
>* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
>> On Tue, Apr 23, 2019 at 10:22:46AM -0400, Stephen Frost wrote:
>> >* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
>> >>On Sat, Apr 20, 2019 at 04:21:52PM -0400, Robert Haas wrote:
>> >>>On Sat, Apr 20, 2019 at 12:42 AM Stephen Frost <sfrost@snowman.net> wrote:
>> >>>>> Oh.  Well, I already explained my algorithm for doing that upthread,
>> >>>>> which I believe would be quite cheap.
>> >>>>>
>> >>>>> 1. When you generate the .modblock files, stick all the block
>> >>>>> references into a buffer.  qsort().  Dedup.  Write out in sorted
>> >>>>> order.
>> >>>>
>> >>>>Having all of the block references in a sorted order does seem like it
>> >>>>would help, but would also make those potentially quite a bit larger
>> >>>>than necessary (I had some thoughts about making them smaller elsewhere
>> >>>>in this discussion).  That might be worth it though.  I suppose it might
>> >>>>also be possible to line up the bitmaps suggested elsewhere to do
>> >>>>essentially a BitmapOr of them to identify the blocks changed (while
>> >>>>effectively de-duping at the same time).
>> >>>
>> >>>I don't see why this would make them bigger than necessary.  If you
>> >>>sort by relfilenode/fork/blocknumber and dedup, then references to
>> >>>nearby blocks will be adjacent in the file.  You can then decide what
>> >>>format will represent that most efficiently on output.  Whether or not
>> >>>a bitmap is better idea than a list of block numbers or something else
>> >>>depends on what percentage of blocks are modified and how clustered
>> >>>they are.
>> >>
>> >>Not sure I understand correctly - do you suggest to deduplicate and sort
>> >>the data before writing them into the .modblock files? Because that the
>> >>the sorting would make this information mostly useless for the recovery
>> >>prefetching use case I mentioned elsewhere. For that to work we need
>> >>information about both the LSN and block, in the LSN order.
>> >
>> >I'm not sure I follow- why does the prefetching need to get the blocks
>> >in LSN order..?  Once the blocks that we know are going to change in the
>> >next segment have been identified, we could prefetch them all and have
>> >them ready for when replay gets to them.  I'm not sure that we
>> >specifically need to have them pre-fetched in the same order that the
>> >replay happens and it might even be better to fetch them in an order
>> >that's as sequential as possible to get them in as quickly as possible.
>>
>> That means we'd have to prefetch all blocks for the whole WAL segment,
>> which is pretty useless, IMO. A single INSERT (especially for indexes) is
>> often just ~100B, so a single 16MB segment can fit ~160k of them. Surely
>> we don't want to prefetch all of that at once? And it's even worse for
>> larger WAL segments, which are likely to get more common now that it's an
>> initdb option.
>
>Ah, yeah, I had been thinking about FPIs and blocks and figuring that
>16MB wasn't that bad but you're certainly right that we could end up
>touching a lot more blocks in a given WAL segment.
>
>> I'm pretty sure the prefetching needs to be more like "prefetch the next
>> 1024 blocks we'll need" or "prefetch blocks from the next X megabytes of
>> WAL". That doesn't mean we can't do some additional optimization (like
>> reordering them a bit), but there's a point where it gets detrimental
>> because the kernel will just evict some of the prefetched blocks before we
>> actually access them. And the larger amount of blocks you prefetch the
>> more likely that is.
>
>Since we're prefetching, maybe we shouldn't be just pulling the blocks
>into the filesystem cache but instead into something we have more
>control over..?
>

Well, yeah, and there was a discussion about that in the prefetching
thread IIRC. But but in that case it's probably even more imporant to
prefetch only a fairly small chunk of blocks (instead of the whole WAL
segment worth of blocks), because shared buffers are usually much smaller
compared to page cache. So we'd probably want some sort of small ring
buffer there, so the LSN information is even more important.

>> >>So if we want to allow that use case to leverage this infrastructure, we
>> >>need to write the .modfiles kinda "raw" and do this processing in some
>> >>later step.
>> >
>> >If we really need the LSN info for the blocks, then we could still
>> >de-dup, picking the 'first modified in this segment at LSN X', or keep
>> >both first and last, or I suppose every LSN if we really want, and then
>> >have that information included with the other information about the
>> >block.  Downstream clients could then sort based on the LSN info if they
>> >want to have a list of blocks in sorted-by-LSN-order.
>>
>> Possibly. I don't think keeping just the first block occurence is enough,
>> particularly for large WAL segmnent sizes, but I agree we can reduce the
>> stuff a bit (say, ignore references that are less than 1MB apart or so).
>> We just can't remove all the LSN information entirely.
>
>Yeah, it depends on how much memory we want to use for prefetching and
>how large the WAL segments are.
>

Right.

>> >>Now, maybe the incremental backup use case is so much more important the
>> >>right thing to do is ignore this other use case, and I'm OK with that -
>> >>as long as it's a conscious choice.
>> >
>> >I'd certainly like to have a way to prefetch, but I'm not entirely sure
>> >that it makes sense to combine it with this, so while I sketched out
>> >some ideas about how to do that above, I don't want it to come across as
>> >being a strong endorsement of the overall idea.
>> >
>> >For pre-fetching purposes, for an async streaming replica, it seems like
>> >the wal sender process could potentially just scan the WAL and have a
>> >list of blocks ready to pass to the replica which are "this is what's
>> >coming soon" or similar, rather than working with the modfiles at all.
>> >Not sure if we'd always send that or if we wait for the replica to ask
>> >for it.  Though for doing WAL replay from the archive, being able to ask
>> >for the modfile first to do prefetching before replaying the WAL itself
>> >could certainly be beneficial, so maybe it does make sense to have that
>> >information there too..  still not sure we really need it in LSN order
>> >or that we need to prefetch in LSN order though.
>>
>> Well, how exactly should the prefetching extract the block list is still
>> an open question. But this seems to deal with pretty much the same stuff,
>> so it might make sense to support the prefetching use case too. Or maybe
>> not, I'm not sure - but IMHO we should try.
>
>Yeah, makes sense to me to at least be thinking about it and seeing if
>there's a way to make adding prefetching easier.
>

Agreed.

>> I don't think it makes sense to do prefetching when the standby is fully
>> caught up with the primary. It gets more important when it falls behind by
>> a significant amount of WAL, at which point it's likely to fetch already
>> closed WAL segments - either from primary or archive. So either it can get
>> the block list from there, or it may extract the info on it's own (using
>> some of the infrastructure used to build the mdblock files).
>
>I don't know..  if we're able to keep up with the primary because we did
>a bunch of prefetching and that made WAL replay fast enough that we
>catch up, aren't we going to possibly fall right back behind again
>pretty quickly if we don't prefetch for the blocks that are about to be
>sent?  If we are caught up and there's no data to be sent then we don't
>have much choice, but as soon as there's WAL data to be sent our way,
>any more than just one block, then grabbing the list of blocks that we
>know we're going to need during replay seems to make a lot of sense to
>me.  In other words, I'd think we would just always do this and how much
>we do would be dictated by both how far behind we are and how much
>memory we want to allocate for this.
>

Well, the thing is that for prefetching to be possible you actually have
to be a bit behind. Otherwise you can't really look forward which blocks
will be needed, right?

IMHO the main use case for prefetching is when there's a spike of activity
on the primary, making the standby to fall behind, and then hours takes
hours to catch up. I don't think the cases with just a couple of MBs of
lag are the issue prefetching is meant to improve (if it does, great).

Anyway, not sure if this is the right thread to discuss prefetching in
much more detail.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




pgsql-hackers by date:

Previous
From: David Fetter
Date:
Subject: Re: doc: datatype-table and xfunc-c-type-table
Next
From: Andres Freund
Date:
Subject: Re: finding changed blocks using WAL scanning