Thread: [HACKERS] Early locking option to parallel backup
Hello people,
I am sending a patch to improve parallel backup on larger databases.
THE PROBLEM
pg_dump was taking more than 24 hours to complete in one of my databases. I begin to research alternatives. Parallel backup reduced the backup time to little less than a hour, but it failed almost every time because of concurrent queries that generated exclusive locks. It is difficult to guarantee that my applications will not issue queries such as drop table, alter table, truncate table, create index or drop index for a hour. And I prefer not to create controls mechanisms to that end if I can work around it.
THE SOLUTION
The patch creates a "--lock-early" option which will make pg_dump to issue shared locks on all tables on the backup TOC on each parallel worker start. That way, the backup has a very small chance of failing. When it does, happen in the first few seconds of the backup job. My backup scripts (not included here) are aware of that and retries the backup in case of failure.
TESTS
I am using this technique in production over a year now and it is working well for me.
Lucas
Attachment
On Sun, Nov 5, 2017 at 7:17 PM, Lucas <lucas75@gmail.com> wrote: > The patch creates a "--lock-early" option which will make pg_dump to issue > shared locks on all tables on the backup TOC on each parallel worker start. > That way, the backup has a very small chance of failing. When it does, > happen in the first few seconds of the backup job. My backup scripts (not > included here) are aware of that and retries the backup in case of failure. You should register your patch to the next opened commit fest, which will begin in January, if you are looking for feedback and review: https://commitfest.postgresql.org/16/ -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Nov 5, 2017 at 5:17 AM, Lucas <lucas75@gmail.com> wrote: > The patch creates a "--lock-early" option which will make pg_dump to issue > shared locks on all tables on the backup TOC on each parallel worker start. > That way, the backup has a very small chance of failing. When it does, > happen in the first few seconds of the backup job. My backup scripts (not > included here) are aware of that and retries the backup in case of failure. I wonder why we don't do this already ... and by default. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-11-05 17:38:39 -0500, Robert Haas wrote: > On Sun, Nov 5, 2017 at 5:17 AM, Lucas <lucas75@gmail.com> wrote: > > The patch creates a "--lock-early" option which will make pg_dump to issue > > shared locks on all tables on the backup TOC on each parallel worker start. > > That way, the backup has a very small chance of failing. When it does, > > happen in the first few seconds of the backup job. My backup scripts (not > > included here) are aware of that and retries the backup in case of failure. > > I wonder why we don't do this already ... and by default. Well, the current approach afaics requires #relations * 2 locks, whereas acquiring them in every worker would scale that with the number of workers. IIUC the problem here is that even though a lock is already held by the main backend an independent locker's request will prevent the on-demand lock by the dump worker from being granted. It seems to me the correct fix here would be to somehow avoid the fairness logic in the parallel dump case - although I don't quite know how to best do so. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@anarazel.de> writes: > Well, the current approach afaics requires #relations * 2 locks, whereas > acquiring them in every worker would scale that with the number of > workers. Yeah, that's gonna be a problem with this proposal. > IIUC the problem here is that even though a lock is already > held by the main backend an independent locker's request will prevent > the on-demand lock by the dump worker from being granted. It seems to > me the correct fix here would be to somehow avoid the fairness logic in > the parallel dump case - although I don't quite know how to best do so. I wonder if we couldn't somehow repurpose the work that was done for parallel workers' locks. Lots of security-type issues to be handled if we're to open that up to clients, but maybe it's solvable. For instance, maybe only allowing it to clients sharing the same snapshot would help. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Nov 6, 2017 at 4:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wonder if we couldn't somehow repurpose the work that was done for > parallel workers' locks. Lots of security-type issues to be handled > if we're to open that up to clients, but maybe it's solvable. For > instance, maybe only allowing it to clients sharing the same snapshot > would help. Interesting idea. There's a bunch of holes that would need to be patched there; for instance, you can't have one session running DDL while somebody else has AccessShareLock. Parallel query relies on the parallel-mode restrictions to prevent that kind of thing from happening, but it would be strange (and likely somewhat broken) to try to enforce those here. It would be strange and probably bad if LOCK TABLE a; LOCK TABLE b in one session and LOCK TABLE b; LOCK TABLE a in another session failed to deadlock. In short, there's a big difference between a single session using multiple processes and multiple closely coordinated sessions. Also, even if you did it, you still need a lot of PROCLOCKs. Workers don't need to take all locks up front because they can be assured of getting them later, but they've still got to lock the objects they actually want to access. Group locking aims to prevent deadlocks between cooperating processes; it is not a license to skip locking altogether. None of which is to say that the problems don't feel related somehow. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-11-05 22:43:34 -0500, Tom Lane wrote: > > IIUC the problem here is that even though a lock is already > > held by the main backend an independent locker's request will prevent > > the on-demand lock by the dump worker from being granted. It seems to > > me the correct fix here would be to somehow avoid the fairness logic in > > the parallel dump case - although I don't quite know how to best do so. > > I wonder if we couldn't somehow repurpose the work that was done for > parallel workers' locks. Lots of security-type issues to be handled > if we're to open that up to clients, but maybe it's solvable. For > instance, maybe only allowing it to clients sharing the same snapshot > would help. Yea, I'd been wondering the same. I'm slightly worried that somehow tying multiple clients into parallel mode would cause a bunch of problems - that's not really the purpose of the code and a number of its assumptions aren't quite right for that. I'm not sure it really buys us much in contrast to just allowing a locker to specify that it's allowed to jump the lock queue for an ASL if it has 'backup' rights or such. Or actually, just allow it as a general option to LOCK, there's plenty other operational cases where the current "fair" behaviour is really annoying, e.g. when executing operational DDL/DML and such. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Lucas, * Lucas (lucas75@gmail.com) wrote: > pg_dump was taking more than 24 hours to complete in one of my databases. I > begin to research alternatives. Parallel backup reduced the backup time to > little less than a hour, but it failed almost every time because of > concurrent queries that generated exclusive locks. It is difficult to > guarantee that my applications will not issue queries such as drop table, > alter table, truncate table, create index or drop index for a hour. And I > prefer not to create controls mechanisms to that end if I can work around > it. I certainly understand the value of pg_dump-based backups, but have you considered doing file-based backups? That would avoid the need to do any in-database locking at all, and would give you the ability to do PITR too. Further, you could actually restore that backup to another system and then do a pg_dump there to get a logical representation (and this would test your physical database backup/restore process too...). Thanks! Stephen
Em 05/11/2017 21:09, Andres Freund escreveu: > On 2017-11-05 17:38:39 -0500, Robert Haas wrote: >> On Sun, Nov 5, 2017 at 5:17 AM, Lucas <lucas75@gmail.com> wrote: >>> The patch creates a "--lock-early" option which will make pg_dump to issue >>> shared locks on all tables on the backup TOC on each parallel worker start. >>> That way, the backup has a very small chance of failing. When it does, >>> happen in the first few seconds of the backup job. My backup scripts (not >>> included here) are aware of that and retries the backup in case of failure. >> >> I wonder why we don't do this already ... and by default. > > Well, the current approach afaics requires #relations * 2 locks, whereas > acquiring them in every worker would scale that with the number of > workers. Yes, that is why I proposed as an option. As an option will not affect anyone that does not want to use it. > IIUC the problem here is that even though a lock is already > held by the main backend an independent locker's request will prevent > the on-demand lock by the dump worker from being granted. It seems to > me the correct fix here would be to somehow avoid the fairness logic in > the parallel dump case - although I don't quite know how to best do so. It seems natural to think several connections in a synchronized snapshot as the same connection. Then it may be reasonable to grant a shared lock out of turn if any connection of the same shared snapshot already have a granted lock for the same relation. Last year Tom mentioned that there is already queue-jumping logic of that sort in the lock manager for other purposes. Although seems conceptually simple, I suspect the implementation is not. On the other hand, the lock-early option is very simple and has no impact on anyone that does not want to use it. --- Lucas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 11/06/2017 12:30 PM, Stephen Frost wrote: > * Lucas (lucas75@gmail.com) wrote: >> pg_dump was taking more than 24 hours to complete in one of my databases. I >> begin to research alternatives. Parallel backup reduced the backup time to >> little less than a hour, but it failed almost every time because of >> concurrent queries that generated exclusive locks. It is difficult to >> guarantee that my applications will not issue queries such as drop table, >> alter table, truncate table, create index or drop index for a hour. And I >> prefer not to create controls mechanisms to that end if I can work around >> it. > I certainly understand the value of pg_dump-based backups, but have you > considered doing file-based backups? That would avoid the need to do > any in-database locking at all, and would give you the ability to do > PITR too. Further, you could actually restore that backup to another > system and then do a pg_dump there to get a logical representation (and > this would test your physical database backup/restore process too...). Yes, a point in time recovery has the advantage of keeping the backup more up-to-date, but has the disadvantage of being more expensive and complex. In my case, point in time recovery would require an upgrade of 10 TB of storage space and my stakeholders did not approved this investment yet. I suspect that there is lots of users that uses pg_dump as primary backup tool and that they would benefit of a more reliable parallel backup. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Lucas, Robert, all, * Robert Haas (robertmhaas@gmail.com) wrote: > On Mon, Nov 6, 2017 at 4:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I wonder if we couldn't somehow repurpose the work that was done for > > parallel workers' locks. Lots of security-type issues to be handled > > if we're to open that up to clients, but maybe it's solvable. For > > instance, maybe only allowing it to clients sharing the same snapshot > > would help. > > Interesting idea. There's a bunch of holes that would need to be > patched there; for instance, you can't have one session running DDL > while somebody else has AccessShareLock. Parallel query relies on the > parallel-mode restrictions to prevent that kind of thing from > happening, but it would be strange (and likely somewhat broken) to try > to enforce those here. It would be strange and probably bad if LOCK > TABLE a; LOCK TABLE b in one session and LOCK TABLE b; LOCK TABLE a in > another session failed to deadlock. In short, there's a big > difference between a single session using multiple processes and > multiple closely coordinated sessions. Parallel pg_dump is based on synchronized transactions though and we have a bunch of checks in ImportSnapshot() because a pg_dump parallel worker also can't really be quite the same as a normal backend. Perhaps we could add on more restrictions in ImportSnapshot() to match the restrictions for parallel-mode workers? If we think there's other users of SET TRANSACTION SNAPSHOT then we might need to extend that command for this case, but that seems relatively straight-forward. I don't know how reasonable the idea of taking a normally-started backend and making it close enough to a parallel worker when a SET TRANSACTION SNAPSHOT PARALLEL (or whatever) happens to allow it to skip the lock fairness is though. > Also, even if you did it, you still need a lot of PROCLOCKs. Workers > don't need to take all locks up front because they can be assured of > getting them later, but they've still got to lock the objects they > actually want to access. Group locking aims to prevent deadlocks > between cooperating processes; it is not a license to skip locking > altogether. This wouldn't be any different from what's happening today in pg_dump though, so I'm not sure why this would be an issue? The proposed patch locks everything in every worker, which is quite different from the main process locking everything and then the individual workers locking the objects they're working on. * Lucas B (lucas75@gmail.com) wrote: > Em 05/11/2017 21:09, Andres Freund escreveu: > >Well, the current approach afaics requires #relations * 2 locks, whereas > >acquiring them in every worker would scale that with the number of > >workers. > > Yes, that is why I proposed as an option. As an option will not affect > anyone that does not want to use it. That's not actually correct- additional options add maintenance overhead for hackers and for users to have to deal with and there's at least a clear idea on how we could make this "just work" without having to add an additional option. Based on that and the discussion thus far, it seems like the next step is to try and work through a way to change things to allow workers to skip the lock fairness and that the current patch isn't going to be accepted. > It seems natural to think several connections in a synchronized snapshot as > the same connection. Then it may be reasonable to grant a shared lock out of > turn if any connection of the same shared snapshot already have a granted > lock for the same relation. Last year Tom mentioned that there is already > queue-jumping logic of that sort in the lock manager for other purposes. > Although seems conceptually simple, I suspect the implementation is not. The implementation is almost certainly not simple, but that doesn't mean we should go in another direction and require a great deal more locks or add an option to do so. I'm going to move this patch into 'Waiting for Author' since it's clearly gotten a good bit of review and discussion already. If there's no interest in further exploring the approach of changing the locking logic then we should probably update the patch to RWF. This seems like a good project to go on the TODO list though, if you aren't going to pursue it, to further explore how to let parallel pg_dump processes jump the lock queue, with a reference back to this thread. Thanks! Stephen
Attachment
On Thu, Jan 11, 2018 at 9:02 AM, Stephen Frost <sfrost@snowman.net> wrote: > Parallel pg_dump is based on synchronized transactions though and we > have a bunch of checks in ImportSnapshot() because a pg_dump parallel > worker also can't really be quite the same as a normal backend. Perhaps > we could add on more restrictions in ImportSnapshot() to match the > restrictions for parallel-mode workers? If we think there's other users > of SET TRANSACTION SNAPSHOT then we might need to extend that command > for this case, but that seems relatively straight-forward. I don't know > how reasonable the idea of taking a normally-started backend and making > it close enough to a parallel worker when a SET TRANSACTION SNAPSHOT > PARALLEL (or whatever) happens to allow it to skip the lock fairness is > though. It seems pretty tricky to me. I actually don't think this use case has all that much in common with the parallel query case. Both can be addressed by tweaking the lock manager, but I think this needs a different set of tweaks than that did. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2018-01-15 12:12:26 -0500, Robert Haas wrote: > On Thu, Jan 11, 2018 at 9:02 AM, Stephen Frost <sfrost@snowman.net> wrote: > > Parallel pg_dump is based on synchronized transactions though and we > > have a bunch of checks in ImportSnapshot() because a pg_dump parallel > > worker also can't really be quite the same as a normal backend. Perhaps > > we could add on more restrictions in ImportSnapshot() to match the > > restrictions for parallel-mode workers? If we think there's other users > > of SET TRANSACTION SNAPSHOT then we might need to extend that command > > for this case, but that seems relatively straight-forward. I don't know > > how reasonable the idea of taking a normally-started backend and making > > it close enough to a parallel worker when a SET TRANSACTION SNAPSHOT > > PARALLEL (or whatever) happens to allow it to skip the lock fairness is > > though. > > It seems pretty tricky to me. I actually don't think this use case > has all that much in common with the parallel query case. Both can be > addressed by tweaking the lock manager, but I think this needs a > different set of tweaks than that did. There seems to to be consensus in this thread that the approach Lucas proposed isn't what we want, and that instead some shared lock based approach is desirable. As that has been the case for ~1.5 months, I propose we mark this as returned with feedback? Greetings, Andres Freund
On Fri, Mar 2, 2018 at 2:29 AM, Andres Freund <andres@anarazel.de> wrote: > There seems to to be consensus in this thread that the approach Lucas > proposed isn't what we want, and that instead some shared lock based > approach is desirable. As that has been the case for ~1.5 months, I > propose we mark this as returned with feedback? Yes, that seems pretty clear-cut to me. It would be totally unfair if a patch that hasn't been updated since November were allowed to submit a new version after the start of the final CommitFest. We shouldn't be working on anything now that hasn't been under active development recently; we have enough things (and then some) that have. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2018-03-02 11:42:06 -0500, Robert Haas wrote: > On Fri, Mar 2, 2018 at 2:29 AM, Andres Freund <andres@anarazel.de> wrote: > > There seems to to be consensus in this thread that the approach Lucas > > proposed isn't what we want, and that instead some shared lock based > > approach is desirable. As that has been the case for ~1.5 months, I > > propose we mark this as returned with feedback? > > Yes, that seems pretty clear-cut to me. It would be totally unfair if > a patch that hasn't been updated since November were allowed to submit > a new version after the start of the final CommitFest. We shouldn't > be working on anything now that hasn't been under active development > recently; we have enough things (and then some) that have. Done.