Re: finding changed blocks using WAL scanning - Mailing list pgsql-hackers

From Robert Haas
Subject Re: finding changed blocks using WAL scanning
Date
Msg-id CA+TgmoaD=xx=QRm2HA81n4ODD-+Gk1a_mp1S9NQEgBsHVgpu7A@mail.gmail.com
Whole thread Raw
In response to Re: finding changed blocks using WAL scanning  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: finding changed blocks using WAL scanning
List pgsql-hackers
On Mon, Apr 22, 2019 at 9:51 PM Robert Haas <robertmhaas@gmail.com> wrote:
> For this particular use case, wouldn't you want to read the WAL itself
> and use that to issue prefetch requests?  Because if you use the
> .modblock files, the data file blocks will end up in memory but the
> WAL blocks won't, and you'll still be waiting for I/O.

I'm still interested in the answer to this question, but I don't see a
reply that specifically concerns it.  Apologies if I have missed one.

Stepping back a bit, I think that the basic issue under discussion
here is how granular you want your .modblock files.  At one extreme,
one can imagine an application that wants to know exactly which blocks
were accessed at exact which LSNs.  At the other extreme, if you want
to run a daily incremental backup, you just want to know which blocks
have been modified between the start of the previous backup and the
start of the current backup - i.e. sometime in the last ~24 hours.
These are quite different things.  When you only want approximate
information - is there a chance that this block was changed within
this LSN range, or not? - you can sort and deduplicate in advance;
when you want exact information, you cannot do that.  Furthermore, if
you want exact information, you must store an LSN for every record; if
you want approximate information, you emit a file for each LSN range
and consider it sufficient to know that the change happened somewhere
within the range of LSNs encompassed by that file.

It's pretty clear in my mind that what I want to do here is provide
approximate information, not exact information.  Being able to sort
and deduplicate in advance seems critical to be able to make something
like this work on high-velocity systems.  If you are generating a
terabyte of WAL between incremental backups, and you don't do any
sorting or deduplication prior to the point when you actually try to
generate the modified block map, you are going to need a whole lot of
memory (and CPU time, though that's less critical, I think) to process
all of that data.  If you can read modblock files which are already
sorted and deduplicated, you can generate results incrementally and
send them to the client incrementally and you never really need more
than some fixed amount of memory no matter how much data you are
processing.

While I'm convinced that this particular feature should provide
approximate rather than exact information, the degree of approximation
is up for debate, and maybe it's best to just make that configurable.
Some applications might work best with small modblock files covering
only ~16MB of WAL each, or even less, while others might prefer larger
quanta, say 1GB or even more.  For incremental backup, I believe that
the quanta will depend on the system velocity.  On a system that isn't
very busy, fine-grained modblock files will make incremental backup
more efficient.  If each modblock file covers only 16MB of data, and
the backup manages to start someplace in the middle of that 16MB, then
you'll only be including 16MB or less of unnecessary block references
in the backup so you won't incur much extra work.  On the other hand,
on a busy system, you probably do not want such a small quantum,
because you will then up with gazillions of modblock files and that
will be hard to manage.  It could also have performance problems,
because merging data from a couple of hundred files is fine, but
merging data from a couple of hundred thousand files is going to be
inefficient.  My experience hacking on and testing tuplesort.c a few
years ago (with valuable tutelage by Peter Geoghegan) showed me that
there is a slow drop-off in efficiency as the merge order increases --
and in this case, at some point you will blow out the size of the OS
file descriptor table and have to start opening and closing files
every time you access a different one, and that will be unpleasant.
Finally, deduplication will tend to be more effective across larger
numbers of block references, at least on some access patterns.

So all of that is to say that if somebody wants modblock files each of
which covers 1MB of WAL, I think that the same tools I'm proposing to
build here for incremental backup could support that use case with
just a configuration change.  Moreover, the resulting files would
still be usable by the incremental backup engine.  So that's good: the
same system can, at least to some extent, be reused for whatever other
purposes people want to know about modified blocks.  On the other
hand, the incremental backup engine will likely not cope smoothly with
having hundreds of thousands or millions of modblock files shoved down
its gullet, so if there is a dramatic difference in the granularity
requirements of different consumers, another approach is likely
indicated.  Especially if some consumer wants to see block references
in the exact order in which they appear in WAL, or wants to know the
exact LSN of each reference, it's probably best to go for a different
approach.  For example, pg_waldump could grow a new option which spits
out just the block references and in a format designed to be easily
machine-parseable; or a hypothetical background worker that does
prefetching for recovery could just contain its own copy of the
xlogreader machinery.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: [Patch] Check file type before calling AllocateFile() for filesout of pg data directory to avoid potential issues (e.g. hang).
Next
From: alex lock
Date:
Subject: Help to review the with X cursor option.