Re: finding changed blocks using WAL scanning - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: finding changed blocks using WAL scanning
Date
Msg-id 20190422234445.s7mxt6xwfmumzlge@momjian.us
Whole thread Raw
In response to Re: finding changed blocks using WAL scanning  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: finding changed blocks using WAL scanning
List pgsql-hackers
On Tue, Apr 23, 2019 at 01:21:27AM +0200, Tomas Vondra wrote:
> On Sat, Apr 20, 2019 at 04:21:52PM -0400, Robert Haas wrote:
> > On Sat, Apr 20, 2019 at 12:42 AM Stephen Frost <sfrost@snowman.net> wrote:
> > > > Oh.  Well, I already explained my algorithm for doing that upthread,
> > > > which I believe would be quite cheap.
> > > >
> > > > 1. When you generate the .modblock files, stick all the block
> > > > references into a buffer.  qsort().  Dedup.  Write out in sorted
> > > > order.
> > > 
> > > Having all of the block references in a sorted order does seem like it
> > > would help, but would also make those potentially quite a bit larger
> > > than necessary (I had some thoughts about making them smaller elsewhere
> > > in this discussion).  That might be worth it though.  I suppose it might
> > > also be possible to line up the bitmaps suggested elsewhere to do
> > > essentially a BitmapOr of them to identify the blocks changed (while
> > > effectively de-duping at the same time).
> > 
> > I don't see why this would make them bigger than necessary.  If you
> > sort by relfilenode/fork/blocknumber and dedup, then references to
> > nearby blocks will be adjacent in the file.  You can then decide what
> > format will represent that most efficiently on output.  Whether or not
> > a bitmap is better idea than a list of block numbers or something else
> > depends on what percentage of blocks are modified and how clustered
> > they are.
> > 
> 
> Not sure I understand correctly - do you suggest to deduplicate and sort
> the data before writing them into the .modblock files? Because that the
> the sorting would make this information mostly useless for the recovery
> prefetching use case I mentioned elsewhere. For that to work we need
> information about both the LSN and block, in the LSN order.
> 
> So if we want to allow that use case to leverage this infrastructure, we
> need to write the .modfiles kinda "raw" and do this processing in some
> later step.
> 
> Now, maybe the incremental backup use case is so much more important the
> right thing to do is ignore this other use case, and I'm OK with that -
> as long as it's a conscious choice.

I think the concern is that the more graunular the modblock files are
(with less de-duping), the larger they will be.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: finding changed blocks using WAL scanning
Next
From: Mikhail Bautin
Date:
Subject: memory leak checking