Thread: [HACKERS] On markers of changed data
Hi, hackers! Currently I'm working on page-level incremental backups using WAL-G codebase[0]. And I have two questions that I cannot resolvemyself. Incremental backup is a set of changes, that should be applied over preexisting backup. I use page LSN to understand shouldpage be backup`ed or not. Question 1. FSM and VM. As you can see here [1] FSM and VM files are exempt from incremental tracking and are backuped as whole files. I've doneit this way, because sanity checks [2] of page headers have indicated a lot of "invalid" pages in FSM and VM files. Butseems like in some pages headers are valid with sane LSNs. Can I use LSNs as history marker on FSM and VM pages? On 1Tb backup I get like 150Mb of FSM+VM, and it's kind of a lot. Question 2. File dates. Is it safe to use file modification time to track that file were changes since previous backup? If the file has date beforestart of previous backup I just add it to "skip list" [3]. I have assumption: every time file is changes in filesystem, it's modification date is updated to higher value. Is this assumption valid for most of used platforms and filesystems? Or can I check this "capacity" of FS? Thank you for your attention. I'll be glad to receive any information\pointers on this matter. Best regards, Andrey Borodin, Yandex. [0] https://github.com/wal-g/wal-g/pull/29 [1] https://github.com/wal-g/wal-g/pull/29/files#diff-d77406e827f5f947d4d4a1e6d76c1f4eR114 [2] https://github.com/wal-g/wal-g/pull/29/files#diff-d77406e827f5f947d4d4a1e6d76c1f4eR50 [3] https://github.com/wal-g/wal-g/pull/29/files#diff-f5c8f0067297f98eb5acc6e2c6b1b234R87 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Andrey Borodin <x4mmm@yandex-team.ru> writes: > Is it safe to use file modification time to track that file were changes > since previous backup? I'd say no: 1. You don't know the granularity of the filesystem's timestamps, at least not without making unportable assumptions. 2. There's no guarantee that the system clock can't be set backwards. 3. It's not uncommon for filesystems to have optimizations whereby they skip or delay some updates of file mtimes. (I think this is usually optional, but you couldn't know whether it's turned on.) #2 is probably the worst of these problems. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Oct 6, 2017 at 11:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andrey Borodin <x4mmm@yandex-team.ru> writes: >> Is it safe to use file modification time to track that file were changes >> since previous backup? > > I'd say no: > > 1. You don't know the granularity of the filesystem's timestamps, at least > not without making unportable assumptions. > > 2. There's no guarantee that the system clock can't be set backwards. > > 3. It's not uncommon for filesystems to have optimizations whereby they > skip or delay some updates of file mtimes. (I think this is usually > optional, but you couldn't know whether it's turned on.) > > #2 is probably the worst of these problems. Or upwards. A simple example of things depending on clock changes is for example VM snapshotting. Any logic not depending on monotonic timestamps, with things like clock_gettime(CLOCK_MONOTONIC) is a lot of fun to investigate until you know that they are not using any monotonic logic... So the answer is *no*, do not depend on FS-level timestamps. The only sane method for Postgres is really to scan the page header LSNs, and of course you already know that. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Michael Paquier wrote: > The only sane method for Postgres is really to scan the > page header LSNs, and of course you already know that. I hope the idea is not to have to scan every single page in the database, because that would be too slow. It should be possible to build this so that a single summary LSN is kept for a largish group of pages, allowing large portions of the database to be skipped from even being read if they are known not to contain any page newer than the previous backup. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Tom, Michael, * Michael Paquier (michael.paquier@gmail.com) wrote: > On Fri, Oct 6, 2017 at 11:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Andrey Borodin <x4mmm@yandex-team.ru> writes: > >> Is it safe to use file modification time to track that file were changes > >> since previous backup? > > > > I'd say no: > > > > 1. You don't know the granularity of the filesystem's timestamps, at least > > not without making unportable assumptions. > > > > 2. There's no guarantee that the system clock can't be set backwards. > > > > 3. It's not uncommon for filesystems to have optimizations whereby they > > skip or delay some updates of file mtimes. (I think this is usually > > optional, but you couldn't know whether it's turned on.) > > > > #2 is probably the worst of these problems. > > Or upwards. A simple example of things depending on clock changes is > for example VM snapshotting. Any logic not depending on monotonic > timestamps, with things like clock_gettime(CLOCK_MONOTONIC) is a lot > of fun to investigate until you know that they are not using any > monotonic logic... So the answer is *no*, do not depend on FS-level > timestamps. The only sane method for Postgres is really to scan the > page header LSNs, and of course you already know that. Really, these comments appear, at least to me, to be based on an incorrect assumption that's only considering how tools like rsync use mtime. No, you can't trust rsync-based backups that look at mtime and only copy if the mtime of the source file is currently 'more recent' than the mtime of the destination file. That doesn't mean that mtime can't be used to perform incremental backups using PG, but care has to be taken when doing so to minimize the risk of a file getting skipped that should have been copied. There's a few things to do to minimize that risk: Use mtime only as an indication of if the file changed from the last time you looked at it- doesn't matter if the mtime on the file is newer or older. If the mtime is *different*, then you can't trust that the contents are the same and you need to include it in the backup. Of course, combine this with checking the file size has changed, but in PG there's lots of files of the same size, so that's not a terribly good indicator. Further, you have to get the mtime of all the files *before* you start backing them up. If you take the mtime of the file at the time you start actually copying it, then it could possibly be modified while you copy it but without the mtime being updated from when you initially pulled it (and that's not even talking about the concerns around the clock time moving back and forth). To address the granularity concern, you should also be sure to wait after you collect all the mtimes but before actually starting the backup to the level of granularity. Any optimization which delays setting the mtime would, certainly, still get around to updating the mtime before the next backup runs and therefore that file might get copied even though it hadn't changed, but that's still technically correct, just slightly more work. Lastly, don't trust any times which are from after the time that you collected the mtimes- either during the initial backup or when you are doing the subsequent incremental. Any file whose mtime is different *or* is from after the time the mtimes were collected should be copied. This isn't to say that there isn't some risk to using mtime, there still is- if a backup is made of a file and its mtime collected, and then time moves backwards, and the file is modified again at the *exact* same time, leading the 'new' mtime to be identical to the 'old' mtime while the file's contents are different, and that file is not subsequently modified before the next backup happens, then the file might not be included in the backup even though it should be. Other risks are just blatent corruption happening in the mtime field, or a kernel-level bug that doesn't update mtime when it should, or the kernel somehow resetting the mtime back after the file has been changed, or someone explicitly setting the mtime back after changing a file, or perhaps other such attacks, though eliminating all of those risks isn't possible (regardless of solution- someone could go change the LSN on a page too, for example, and foil a tool which was based on that). These are risks which I'd love to remove, but they also strike me as quite small and ones which practical users are willing to accept for their incremental and differential backups, though it's a reason to also take full backups regularly. As Alvaro notes downthread, it's also the only reasonable option available today. It'd be great to have a better solution, and perhaps one which summarizes the LSNs in each file would work and be better, but that would also only be available for PG11, at the earliest. Thanks! Stephen
> Le 6 oct. 2017 à 23:44, Alvaro Herrera <alvherre@alvh.no-ip.org> a écrit : > > Michael Paquier wrote: > >> The only sane method for Postgres is really to scan the >> page header LSNs, and of course you already know that. > > I hope the idea is not to have to scan every single page in the > database, because that would be too slow. It should be possible to > build this so that a single summary LSN is kept for a largish group of > pages, allowing large portions of the database to be skipped from even > being read if they are known not to contain any page newer than the > previous backup. That’s actually what pg_rman is doing for what it calls incremental backups (perhaps that would be differential backup inPG terminology?), and the performance is bad as you can imagine. We could have a dedicated LSN map to do such things with4 bytes per page. I am still not convinced that this much facility and the potential bug risks are worth it though, Postgresalready knows about differential backups if you shape it as a delta of WAL segments. I think that, in order to finda LSN map more convincing, we should find first other use cases where it could become useful. Some use cases may popup with VACUUM, but I have not studied the question hard enough... -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Michael Paquier wrote: > That’s actually what pg_rman is doing for what it calls incremental > backups (perhaps that would be differential backup in PG > terminology?), and the performance is bad as you can imagine. We could > have a dedicated LSN map to do such things with 4 bytes per page. I am > still not convinced that this much facility and the potential bug > risks are worth it though, Postgres already knows about differential > backups if you shape it as a delta of WAL segments. I think that, in > order to find a LSN map more convincing, we should find first other > use cases where it could become useful. Some use cases may pop up with > VACUUM, but I have not studied the question hard enough... The case I've discussed with barman developers is a large database (couple dozen of TBs should be enough) where a large fraction (say 95%) is read-only but there are many changes to the active part of the data, so that WAL is more massive than size of active data. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Alvaro, Michael, * Alvaro Herrera (alvherre@alvh.no-ip.org) wrote: > Michael Paquier wrote: > > That’s actually what pg_rman is doing for what it calls incremental > > backups (perhaps that would be differential backup in PG > > terminology?), and the performance is bad as you can imagine. We could > > have a dedicated LSN map to do such things with 4 bytes per page. I am > > still not convinced that this much facility and the potential bug > > risks are worth it though, Postgres already knows about differential > > backups if you shape it as a delta of WAL segments. I think that, in > > order to find a LSN map more convincing, we should find first other > > use cases where it could become useful. Some use cases may pop up with > > VACUUM, but I have not studied the question hard enough... > > The case I've discussed with barman developers is a large database > (couple dozen of TBs should be enough) where a large fraction (say 95%) > is read-only but there are many changes to the active part of the data, > so that WAL is more massive than size of active data. Yes, we've seen environments like that also. Thanks! Stephen
Tom, Alvaro, Michael, and especially Septhen, thank you for your valuable comments. I feel enlightened about mtime. My takeaway is: 1. Any other marker would be better (It can be WAL scan during archiving, some new LSN-based mechanics* et c.) 2. mtime could be used, with precautions described by Stephen are taken. But my other question still seems unanswered: can I use LSN logic for incrementing FSM and VM? Seems like most of the timethere is valid LSN * I like the idea of using something for both incr(diff) backups and VACUUM, it worth thinking about. Best regards, Andrey Borodin. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Andrey, * Andrey Borodin (x4mmm@yandex-team.ru) wrote: > But my other question still seems unanswered: can I use LSN logic for incrementing FSM and VM? Seems like most of the timethere is valid LSN I haven't gone and audited it myself, but I would certainly expect you to be able to use the LSN for everything which is WAL'd. If you have cases where that's not the case, it'd be useful to see them. Thanks! Stephen
> 8 окт. 2017 г., в 20:11, Stephen Frost <sfrost@snowman.net> написал(а): > * Andrey Borodin (x4mmm@yandex-team.ru) wrote: >> But my other question still seems unanswered: can I use LSN logic for incrementing FSM and VM? Seems like most of thetime there is valid LSN > > I haven't gone and audited it myself, but I would certainly expect you > to be able to use the LSN for everything which is WAL'd. If you have > cases where that's not the case, it'd be useful to see them. Thanks, Stephen, this actually pointed what to look for VM is WAL-logged [0] FSM is not [1] Now I have everything I wanted, and go back coding :) Best regards, Andrey Borodin. [0] https://github.com/postgres/postgres/blob/113b0045e20d40f726a0a30e33214455e4f1385e/src/backend/access/heap/visibilitymap.c#L315 [1] https://github.com/postgres/postgres/blob/1d25779284fe1ba08ecd57e647292a9deb241376/src/backend/storage/freespace/freespace.c#L593 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Oct 9, 2017 at 2:23 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote: >> I haven't gone and audited it myself, but I would certainly expect you >> to be able to use the LSN for everything which is WAL'd. If you have >> cases where that's not the case, it'd be useful to see them. > > Thanks, Stephen, this actually pointed what to look for > VM is WAL-logged [0] > FSM is not [1] If you are willing to go down this road, here are my takes on the matter: - Any LSN map should be in a different file than FSM and VM. The VM uses 2 bits per blocks now, and the FSM 8 bits. The FSM is wanted to remain small, so trying to plug into it more data is not welcome. The VM bits also are dedicated to purely visibility matters. We may not want to use that for the VM. - Those should not be logged, as we would end up with tracking down WAL records for things that themselves track the effects of other records. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Hi Michael!
[0] https://gist.github.com/stalkerg/ab833d94e2f64df241f1835651e06e4b
I was asking about FSM and VM not because I wanted to place something there, but because I was looking for the way to backup FSM and VM efficiently. VM can be copied page-incrementally, FSM cannot.9 окт. 2017 г., в 17:28, Michael Paquier <michael.paquier@gmail.com> написал(а):VM is WAL-logged [0]
FSM is not [1]
If you are willing to go down this road, here are my takes on the matter:
- Any LSN map should be in a different file than FSM and VM. The VM
uses 2 bits per blocks now, and the FSM 8 bits. The FSM is wanted to
remain small, so trying to plug into it more data is not welcome. The
VM bits also are dedicated to purely visibility matters. We may not
want to use that for the VM.
- Those should not be logged, as we would end up with tracking down
WAL records for things that themselves track the effects of other
records.
But the design you are describing resembles PTRACK[0]: fork for page changes tracking, not WAL-logged, but crash safe due to recovery from others WALs.
Best regards, Andrey Borodin.
On Sat, Oct 7, 2017 at 6:34 AM, Stephen Frost <sfrost@snowman.net> wrote: >> > That’s actually what pg_rman is doing for what it calls incremental >> > backups (perhaps that would be differential backup in PG >> > terminology?), and the performance is bad as you can imagine. We could >> > have a dedicated LSN map to do such things with 4 bytes per page. I am >> > still not convinced that this much facility and the potential bug >> > risks are worth it though, Postgres already knows about differential >> > backups if you shape it as a delta of WAL segments. I think that, in >> > order to find a LSN map more convincing, we should find first other >> > use cases where it could become useful. Some use cases may pop up with >> > VACUUM, but I have not studied the question hard enough... >> >> The case I've discussed with barman developers is a large database >> (couple dozen of TBs should be enough) where a large fraction (say 95%) >> is read-only but there are many changes to the active part of the data, >> so that WAL is more massive than size of active data. > > Yes, we've seen environments like that also. I'm pretty sure that those cases are cases where there are many more FPIs than might be expected, due to a lack of locality. (UUID PKs can make the size of WAL balloon, for example.) -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 8 October 2017 at 08:52, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > 1. Any other marker would be better (It can be WAL scan during archiving, some new LSN-based mechanics* et c.) The general shape of what I would like to see is some log which lists where each checkpoint starts and ends and what blocks are modified since the previous checkpoint. Then to generate an incremental backup from any point in time to the current you union all the block lists between them and fetch those blocks. There are other ways of using this aside from incremental backups on disk too -- you could imagine a replica that has fallen behind requesting the block lists and then fetching just those blocks instead of needing to receive and apply all the wal. Or possibly even making a cost-based decision between the two depending on which would be faster. It would also be useful for going in the reverse direction: look up all the records (or just the last record) that modified a given block. Instead of having to scan all the wal you would only need to scan the checkpoint eras that had touched that block. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Greg Stark wrote: > The general shape of what I would like to see is some log which lists > where each checkpoint starts and ends and what blocks are modified > since the previous checkpoint. Then to generate an incremental backup > from any point in time to the current you union all the block lists > between them and fetch those blocks. There are other ways of using > this aside from incremental backups on disk too -- you could imagine a > replica that has fallen behind requesting the block lists and then > fetching just those blocks instead of needing to receive and apply all > the wal. Hmm, this sounds pretty clever. And we already have the blocks touched by each record thanks to the work for pg_rewind (so we don't have to do any nasty tricks like the stuff Suzuki-san did for pg_lesslog, where each WAL record had to be processed individually to know what blocks it referenced), so it shouldn't be *too* difficult ... > It would also be useful for going in the reverse direction: look up > all the records (or just the last record) that modified a given block. Well, a LSN map is what I was suggesting. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Alvaro, * Alvaro Herrera (alvherre@alvh.no-ip.org) wrote: > Greg Stark wrote: > > > The general shape of what I would like to see is some log which lists > > where each checkpoint starts and ends and what blocks are modified > > since the previous checkpoint. Then to generate an incremental backup > > from any point in time to the current you union all the block lists > > between them and fetch those blocks. There are other ways of using > > this aside from incremental backups on disk too -- you could imagine a > > replica that has fallen behind requesting the block lists and then > > fetching just those blocks instead of needing to receive and apply all > > the wal. > > Hmm, this sounds pretty clever. And we already have the blocks touched > by each record thanks to the work for pg_rewind (so we don't have to do > any nasty tricks like the stuff Suzuki-san did for pg_lesslog, where > each WAL record had to be processed individually to know what blocks it > referenced), so it shouldn't be *too* difficult ... Yeah, it sounds interesting, but I was just chatting w/ David about it and we were thinking about how checkpoints are really rather often done, so you end up with quite a few of these lists being out there. Now, if the lists were always kept in a sorted fashion, then perhaps we would be able to essentially merge-sort them all back together and de-dup that way, but even then, you're talking about an awful lot if you're looking at daily incrementals- that's 288 standard 5-minute checkpoints, each with some 128k pages changed, assuming max_wal_size of 1GB, and I think we can all agree that the default max_wal_size is for rather small systems. That ends up being something around 2MB per checkpoint to store the pages in or half a gig per day just to keep track of the pages which changed in each checkpoint across that day. There's a bit of hand-waving in there, but I don't think it's all that much to reach a conclusion that this might not be the best approach. David and I were kicking around the notion of a 'last LSN' which is kept on a per-relation basis, but, of course, that ends up not really being granular enough, and would likely be a source of contention unless we could work out a way to make it "lazy" updated somehow, or similar. > > It would also be useful for going in the reverse direction: look up > > all the records (or just the last record) that modified a given block. > > Well, a LSN map is what I was suggesting. Not sure I entirely followed what you were getting at here..? Thanks! Stephen
On 10 October 2017 at 23:50, Stephen Frost <sfrost@snowman.net> wrote: > Yeah, it sounds interesting, but I was just chatting w/ David about it > and we were thinking about how checkpoints are really rather often done, > so you end up with quite a few of these lists being out there. > > Now, if the lists were always kept in a sorted fashion, then perhaps we > would be able to essentially merge-sort them all back together and > de-dup that way but even then, you're talking about an awful lot if > you're looking at daily incrementals- that's 288 standard 5-minute > checkpoints, each with some 128k pages changed, assuming max_wal_size of > 1GB, and I think we can all agree that the default max_wal_size is for > rather small systems. That ends up being something around 2MB per > checkpoint to store the pages in or half a gig per day just to keep > track of the pages which changed in each checkpoint across that day. I was actually imagining a bitmap, probably for each 1GB piece of each table. That's probably how you would maintain this data in memory anyways. After compression it should be fairly small. You'll probably be modifying the same blocks frequently or doing bulk loads which will touch a consecutive range of blocks. But that's still about the same amount of data. But probably you don't want to actually keep every checkpoint anyways. The nice thing about the changelists is that they will tend to reach a maximum size regardless of how long a time range they span so if you keep one changelist for every 10 checkpoints or every 100 checkpoints you could reduce the storage needs and only lose the time precision. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Oct 6, 2017 at 10:34 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Fri, Oct 6, 2017 at 11:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I'd say no: >> >> 1. You don't know the granularity of the filesystem's timestamps, at least >> not without making unportable assumptions. >> >> 2. There's no guarantee that the system clock can't be set backwards. >> >> 3. It's not uncommon for filesystems to have optimizations whereby they >> skip or delay some updates of file mtimes. (I think this is usually >> optional, but you couldn't know whether it's turned on.) >> >> #2 is probably the worst of these problems. > > Or upwards. A simple example of things depending on clock changes is > for example VM snapshotting. Any logic not depending on monotonic > timestamps, with things like clock_gettime(CLOCK_MONOTONIC) is a lot > of fun to investigate until you know that they are not using any > monotonic logic... So the answer is *no*, do not depend on FS-level > timestamps. The only sane method for Postgres is really to scan the > page header LSNs, and of course you already know that. It might also be worth noting that the FSM is not WAL-logged, so it probably either doesn't have page LSNs or they are not meaningful. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Hello! > 9 окт. 2017 г., в 10:23, Andrey Borodin <x4mmm@yandex-team.ru> написал(а): > > Thanks, Stephen, this actually pointed what to look for > VM is WAL-logged [0] > FSM is not [1] > > [0] https://github.com/postgres/postgres/blob/113b0045e20d40f726a0a30e33214455e4f1385e/src/backend/access/heap/visibilitymap.c#L315 > [1] https://github.com/postgres/postgres/blob/1d25779284fe1ba08ecd57e647292a9deb241376/src/backend/storage/freespace/freespace.c#L593 After tests of binary equivalence before and after backup I've come to conclusion, that Visibility Map cannot be backupedincrementally: it's bits are unset without page LSN bump. This can lead to wrong results of Index Only Scans executedon freshly restored backups. In my implementation of incremental backup in WAL-G I will disable any FSM, VM and XACT\CLOG incrementation. Posting this for the record, so that if someone goes this way info will be available. Thank you for your attention. Best regards, Andrey Borodin. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers