Re: finding changed blocks using WAL scanning - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: finding changed blocks using WAL scanning |
Date | |
Msg-id | CA+TgmoaD=xx=QRm2HA81n4ODD-+Gk1a_mp1S9NQEgBsHVgpu7A@mail.gmail.com Whole thread Raw |
In response to | Re: finding changed blocks using WAL scanning (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: finding changed blocks using WAL scanning
|
List | pgsql-hackers |
On Mon, Apr 22, 2019 at 9:51 PM Robert Haas <robertmhaas@gmail.com> wrote: > For this particular use case, wouldn't you want to read the WAL itself > and use that to issue prefetch requests? Because if you use the > .modblock files, the data file blocks will end up in memory but the > WAL blocks won't, and you'll still be waiting for I/O. I'm still interested in the answer to this question, but I don't see a reply that specifically concerns it. Apologies if I have missed one. Stepping back a bit, I think that the basic issue under discussion here is how granular you want your .modblock files. At one extreme, one can imagine an application that wants to know exactly which blocks were accessed at exact which LSNs. At the other extreme, if you want to run a daily incremental backup, you just want to know which blocks have been modified between the start of the previous backup and the start of the current backup - i.e. sometime in the last ~24 hours. These are quite different things. When you only want approximate information - is there a chance that this block was changed within this LSN range, or not? - you can sort and deduplicate in advance; when you want exact information, you cannot do that. Furthermore, if you want exact information, you must store an LSN for every record; if you want approximate information, you emit a file for each LSN range and consider it sufficient to know that the change happened somewhere within the range of LSNs encompassed by that file. It's pretty clear in my mind that what I want to do here is provide approximate information, not exact information. Being able to sort and deduplicate in advance seems critical to be able to make something like this work on high-velocity systems. If you are generating a terabyte of WAL between incremental backups, and you don't do any sorting or deduplication prior to the point when you actually try to generate the modified block map, you are going to need a whole lot of memory (and CPU time, though that's less critical, I think) to process all of that data. If you can read modblock files which are already sorted and deduplicated, you can generate results incrementally and send them to the client incrementally and you never really need more than some fixed amount of memory no matter how much data you are processing. While I'm convinced that this particular feature should provide approximate rather than exact information, the degree of approximation is up for debate, and maybe it's best to just make that configurable. Some applications might work best with small modblock files covering only ~16MB of WAL each, or even less, while others might prefer larger quanta, say 1GB or even more. For incremental backup, I believe that the quanta will depend on the system velocity. On a system that isn't very busy, fine-grained modblock files will make incremental backup more efficient. If each modblock file covers only 16MB of data, and the backup manages to start someplace in the middle of that 16MB, then you'll only be including 16MB or less of unnecessary block references in the backup so you won't incur much extra work. On the other hand, on a busy system, you probably do not want such a small quantum, because you will then up with gazillions of modblock files and that will be hard to manage. It could also have performance problems, because merging data from a couple of hundred files is fine, but merging data from a couple of hundred thousand files is going to be inefficient. My experience hacking on and testing tuplesort.c a few years ago (with valuable tutelage by Peter Geoghegan) showed me that there is a slow drop-off in efficiency as the merge order increases -- and in this case, at some point you will blow out the size of the OS file descriptor table and have to start opening and closing files every time you access a different one, and that will be unpleasant. Finally, deduplication will tend to be more effective across larger numbers of block references, at least on some access patterns. So all of that is to say that if somebody wants modblock files each of which covers 1MB of WAL, I think that the same tools I'm proposing to build here for incremental backup could support that use case with just a configuration change. Moreover, the resulting files would still be usable by the incremental backup engine. So that's good: the same system can, at least to some extent, be reused for whatever other purposes people want to know about modified blocks. On the other hand, the incremental backup engine will likely not cope smoothly with having hundreds of thousands or millions of modblock files shoved down its gullet, so if there is a dramatic difference in the granularity requirements of different consumers, another approach is likely indicated. Especially if some consumer wants to see block references in the exact order in which they appear in WAL, or wants to know the exact LSN of each reference, it's probably best to go for a different approach. For example, pg_waldump could grow a new option which spits out just the block references and in a format designed to be easily machine-parseable; or a hypothetical background worker that does prefetching for recovery could just contain its own copy of the xlogreader machinery. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: