Re: finding changed blocks using WAL scanning - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: finding changed blocks using WAL scanning
Date
Msg-id 20190423152729.gkat5d3lvyro72sp@development
Whole thread Raw
In response to Re: finding changed blocks using WAL scanning  (Stephen Frost <sfrost@snowman.net>)
Responses Re: finding changed blocks using WAL scanning
List pgsql-hackers
On Tue, Apr 23, 2019 at 10:22:46AM -0400, Stephen Frost wrote:
>Greetings,
>
>* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
>> On Sat, Apr 20, 2019 at 04:21:52PM -0400, Robert Haas wrote:
>> >On Sat, Apr 20, 2019 at 12:42 AM Stephen Frost <sfrost@snowman.net> wrote:
>> >>> Oh.  Well, I already explained my algorithm for doing that upthread,
>> >>> which I believe would be quite cheap.
>> >>>
>> >>> 1. When you generate the .modblock files, stick all the block
>> >>> references into a buffer.  qsort().  Dedup.  Write out in sorted
>> >>> order.
>> >>
>> >>Having all of the block references in a sorted order does seem like it
>> >>would help, but would also make those potentially quite a bit larger
>> >>than necessary (I had some thoughts about making them smaller elsewhere
>> >>in this discussion).  That might be worth it though.  I suppose it might
>> >>also be possible to line up the bitmaps suggested elsewhere to do
>> >>essentially a BitmapOr of them to identify the blocks changed (while
>> >>effectively de-duping at the same time).
>> >
>> >I don't see why this would make them bigger than necessary.  If you
>> >sort by relfilenode/fork/blocknumber and dedup, then references to
>> >nearby blocks will be adjacent in the file.  You can then decide what
>> >format will represent that most efficiently on output.  Whether or not
>> >a bitmap is better idea than a list of block numbers or something else
>> >depends on what percentage of blocks are modified and how clustered
>> >they are.
>>
>> Not sure I understand correctly - do you suggest to deduplicate and sort
>> the data before writing them into the .modblock files? Because that the
>> the sorting would make this information mostly useless for the recovery
>> prefetching use case I mentioned elsewhere. For that to work we need
>> information about both the LSN and block, in the LSN order.
>
>I'm not sure I follow- why does the prefetching need to get the blocks
>in LSN order..?  Once the blocks that we know are going to change in the
>next segment have been identified, we could prefetch them all and have
>them ready for when replay gets to them.  I'm not sure that we
>specifically need to have them pre-fetched in the same order that the
>replay happens and it might even be better to fetch them in an order
>that's as sequential as possible to get them in as quickly as possible.
>

That means we'd have to prefetch all blocks for the whole WAL segment,
which is pretty useless, IMO. A single INSERT (especially for indexes) is
often just ~100B, so a single 16MB segment can fit ~160k of them. Surely
we don't want to prefetch all of that at once? And it's even worse for
larger WAL segments, which are likely to get more common now that it's an
initdb option.

I'm pretty sure the prefetching needs to be more like "prefetch the next
1024 blocks we'll need" or "prefetch blocks from the next X megabytes of
WAL". That doesn't mean we can't do some additional optimization (like
reordering them a bit), but there's a point where it gets detrimental
because the kernel will just evict some of the prefetched blocks before we
actually access them. And the larger amount of blocks you prefetch the
more likely that is.


>> So if we want to allow that use case to leverage this infrastructure, we
>> need to write the .modfiles kinda "raw" and do this processing in some
>> later step.
>
>If we really need the LSN info for the blocks, then we could still
>de-dup, picking the 'first modified in this segment at LSN X', or keep
>both first and last, or I suppose every LSN if we really want, and then
>have that information included with the other information about the
>block.  Downstream clients could then sort based on the LSN info if they
>want to have a list of blocks in sorted-by-LSN-order.
>

Possibly. I don't think keeping just the first block occurence is enough,
particularly for large WAL segmnent sizes, but I agree we can reduce the
stuff a bit (say, ignore references that are less than 1MB apart or so).
We just can't remove all the LSN information entirely.

>> Now, maybe the incremental backup use case is so much more important the
>> right thing to do is ignore this other use case, and I'm OK with that -
>> as long as it's a conscious choice.
>
>I'd certainly like to have a way to prefetch, but I'm not entirely sure
>that it makes sense to combine it with this, so while I sketched out
>some ideas about how to do that above, I don't want it to come across as
>being a strong endorsement of the overall idea.
>
>For pre-fetching purposes, for an async streaming replica, it seems like
>the wal sender process could potentially just scan the WAL and have a
>list of blocks ready to pass to the replica which are "this is what's
>coming soon" or similar, rather than working with the modfiles at all.
>Not sure if we'd always send that or if we wait for the replica to ask
>for it.  Though for doing WAL replay from the archive, being able to ask
>for the modfile first to do prefetching before replaying the WAL itself
>could certainly be beneficial, so maybe it does make sense to have that
>information there too..  still not sure we really need it in LSN order
>or that we need to prefetch in LSN order though.
>

Well, how exactly should the prefetching extract the block list is still
an open question. But this seems to deal with pretty much the same stuff,
so it might make sense to support the prefetching use case too. Or maybe
not, I'm not sure - but IMHO we should try.

I don't think it makes sense to do prefetching when the standby is fully
caught up with the primary. It gets more important when it falls behind by
a significant amount of WAL, at which point it's likely to fetch already
closed WAL segments - either from primary or archive. So either it can get
the block list from there, or it may extract the info on it's own (using
some of the infrastructure used to build the mdblock files).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: finding changed blocks using WAL scanning
Next
From: Stephen Frost
Date:
Subject: Re: finding changed blocks using WAL scanning