Thread: Hard limit on WAL space used (because PANIC sucks)
In the "Redesigning checkpoint_segments" thread, many people opined that there should be a hard limit on the amount of disk space used for WAL: http://www.postgresql.org/message-id/CA+TgmoaOkgZb5YsmQeMg8ZVqWMtR=6S4-PPd+6jiy4OQ78ihUA@mail.gmail.com. I'm starting a new thread on that, because that's mostly orthogonal to redesigning checkpoint_segments. The current situation is that if you run out of disk space while writing WAL, you get a PANIC, and the server shuts down. That's awful. We can try to avoid that by checkpointing early enough, so that we can remove old WAL segments to make room for new ones before you run out, but unless we somehow throttle or stop new WAL insertions, it's always going to be possible to use up all disk space. A typical scenario where that happens is when archive_command fails for some reason; even a checkpoint can't remove old, unarchived segments in that case. But it can happen even without WAL archiving. I've seen a case, where it was even worse than a PANIC and shutdown. pg_xlog was on a separate partition that had nothing else on it. The partition filled up, and the system shut down with a PANIC. Because there was no space left, it could not even write the checkpoint after recovery, and thus refused to start up again. There was nothing else on the partition that you could delete to make space. The only recourse would've been to add more disk space to the partition (impossible), or manually delete an old WAL file that was not needed to recover from the latest checkpoint (scary). Fortunately this was a test system, so we just deleted everything. So we need to somehow stop new WAL insertions from happening, before it's too late. Peter Geoghegan suggested one method here: http://www.postgresql.org/message-id/flat/CAM3SWZQcyNxvPaskr-pxm8DeqH7_qevW7uqbhPCsg1FpSxKpoQ@mail.gmail.com. I don't think that exact proposal is going to work very well; throttling WAL flushing by holding WALWriteLock in WAL writer can have knock-on effects on the whole system, as Robert Haas mentioned. Also, it'd still be possible to run out of space, just more difficult. To make sure there is enough room for the checkpoint to finish, other WAL insertions have to stop some time before you completely run out of disk space. The question is how to do that. A naive idea is to check if there's enough preallocated WAL space, just before inserting the WAL record. However, it's too late to check that in XLogInsert; once you get there, you're already holding exclusive locks on data pages, and you are in a critical section so you can't back out. At that point, you have to write the WAL record quickly, or the whole system will suffer. So we need to act earlier. A more workable idea is to sprinkle checks in higher-level code, before you hold any critical locks, to check that there is enough preallocated WAL. Like, at the beginning of heap_insert, heap_update, etc., and all similar indexam entry points. I propose that we maintain a WAL reservation system in shared memory. First of all, keep track of how much preallocated WAL there is left (and try to create more if needed). Also keep track of a different number: the amount of WAL pre-reserved for future insertions. Before entering the critical section, increase the reserved number with a conservative estimate (ie. high enough) of how much WAL space you need, and check that there is still enough preallocated WAL to satisfy all the reservations. If not, throw an error or sleep until there is. After you're done with the insertion, release the reservation by decreasing the number again. A shared reservation counter like that could become a point of contention. One optimization is keep a constant reservation of, say, 32 KB for each backend. That's enough for most operations. Change the logic so that you check if you've exceeded the reserved amount of space *after* writing the WAL record, while you're holding WALInsertLock anyway. If you do go over the limit, set a flag in backend-private memory indicating that the *next* time you're about to enter a critical section where you will write a WAL record, you check again if more space has been made available. - Heikki
On 2013-06-06 17:00:30 +0300, Heikki Linnakangas wrote: > A more workable idea is to sprinkle checks in higher-level code, before you > hold any critical locks, to check that there is enough preallocated WAL. > Like, at the beginning of heap_insert, heap_update, etc., and all similar > indexam entry points. I propose that we maintain a WAL reservation system in > shared memory. I am rather doubtful that this won't end up with a bunch of complex code that won't prevent the situation in all circumstances but which will provide bugs/performance problems for some time. Obviously that's just gut feeling since I haven't see the code... I am much more excited about getting the soft limit case right and then seeing how many problems remain in reality. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 06.06.2013 17:17, Andres Freund wrote: > On 2013-06-06 17:00:30 +0300, Heikki Linnakangas wrote: >> A more workable idea is to sprinkle checks in higher-level code, before you >> hold any critical locks, to check that there is enough preallocated WAL. >> Like, at the beginning of heap_insert, heap_update, etc., and all similar >> indexam entry points. I propose that we maintain a WAL reservation system in >> shared memory. > > I am rather doubtful that this won't end up with a bunch of complex code > that won't prevent the situation in all circumstances but which will > provide bugs/performance problems for some time. > Obviously that's just gut feeling since I haven't see the code... I also have a feeling that we'll likely miss some corner cases in the first cut, so that you can still run out of disk space if you try hard enough / are unlucky. But I think it would still be a big improvement if it only catches, say 90% of the cases. I think it can be made fairly robust otherwise, and the performance impact should be pretty easy to measure with e.g pgbench. - Heikki
* Heikki Linnakangas wrote: > The current situation is that if you run out of disk space while writing > WAL, you get a PANIC, and the server shuts down. That's awful. We can > So we need to somehow stop new WAL insertions from happening, before > it's too late. > A naive idea is to check if there's enough preallocated WAL space, just > before inserting the WAL record. However, it's too late to check that in There is a database engine, Microsoft's "Jet Blue" aka the Extensible Storage Engine, that just keeps some preallocated log files around, specifically so it can get consistent and halt cleanly if it runs out of disk space. In other words, the idea is not to check over and over again that there is enough already-reserved WAL space, but to make sure there always is by having a preallocated segment that is never used outside a disk space emergency. -- Christian
On 2013-06-06 23:28:19 +0200, Christian Ullrich wrote: > * Heikki Linnakangas wrote: > > >The current situation is that if you run out of disk space while writing > >WAL, you get a PANIC, and the server shuts down. That's awful. We can > > >So we need to somehow stop new WAL insertions from happening, before > >it's too late. > > >A naive idea is to check if there's enough preallocated WAL space, just > >before inserting the WAL record. However, it's too late to check that in > > There is a database engine, Microsoft's "Jet Blue" aka the Extensible > Storage Engine, that just keeps some preallocated log files around, > specifically so it can get consistent and halt cleanly if it runs out of > disk space. > > In other words, the idea is not to check over and over again that there is > enough already-reserved WAL space, but to make sure there always is by > having a preallocated segment that is never used outside a disk space > emergency. That's not a bad technique. I wonder how reliable it would be in postgres. Do all filesystems allow a rename() to succeed if there isn't actually any space left? E.g. on btrfs I wouldn't be sure. We need to rename because WAL files need to be named after the LSN timelineid... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jun 6, 2013 at 10:38 PM, Andres Freund <andres@2ndquadrant.com> wrote: > That's not a bad technique. I wonder how reliable it would be in > postgres. Do all filesystems allow a rename() to succeed if there isn't > actually any space left? E.g. on btrfs I wouldn't be sure. We need to > rename because WAL files need to be named after the LSN timelineid... I suppose we could just always do the rename at the same time as setting up the current log file. That is, when we start wal log x also set up wal file x+1 at that time. This isn't actually guaranteed to be enough btw. It's possible that the record we're actively about to write will require all of both those files... But that should be very unlikely. -- greg
Let's talk failure cases. There's actually three potential failure cases here: - One Volume: WAL is on the same volume as PGDATA, and that volume is completely out of space. - XLog Partition: WAL is on its own partition/volume, and fills it up. - Archiving: archiving is failing or too slow, causing the disk to fill up with waiting log segments. I'll argue that these three cases need to be dealt with in three different ways, and no single solution is going to work for all three. Archiving --------- In some ways, this is the simplest case. Really, we just need a way to know when the available WAL space has become 90% full, and abort archiving at that stage. Once we stop attempting to archive, we can clean up the unneeded log segments. What we need is a better way for the DBA to find out that archiving is falling behind when it first starts to fall behind. Tailing the log and examining the rather cryptic error messages we give out isn't very effective. xLog Partition -------------- As Heikki pointed, out, a full dedicated WAL drive is hard to fix once it gets full, since there's nothing you can safely delete to clear space, even enough for a checkpoint record. On the other hand, it should be easy to prevent full status; we could simply force a non-spread checkpoint whenever the available WAL space gets 90% full. We'd also probably want to be prepared to switch to a read-only mode if we get full enough that there's only room for the checkpoint records. One Volume ---------- This is the most complicated case, because we wouldn't necessarily run out of space because of WAL using it up. Anything could cause us to run out of disk space, including activity logs, swapping, pgsql_tmp files, database growth, or some other process which writes files. This means that the DBA getting out of disk-full manually is in some ways easier; there's usually stuff she can delete. However, it's much harder -- maybe impossible -- for PostgreSQL to prevent this kind of space outage. There should be things we can do to make it easier for the DBA to troubleshoot this, but I'm not sure what. We could use a hard limit for WAL to prevent WAL from contributing to out-of-space, but that'll only prevent a minority of cases. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Thu, Jun 6, 2013 at 4:28 PM, Christian Ullrich <chris@chrullrich.net> wrote: > * Heikki Linnakangas wrote: > >> The current situation is that if you run out of disk space while writing >> WAL, you get a PANIC, and the server shuts down. That's awful. We can > > >> So we need to somehow stop new WAL insertions from happening, before >> it's too late. > > >> A naive idea is to check if there's enough preallocated WAL space, just >> before inserting the WAL record. However, it's too late to check that in > > > There is a database engine, Microsoft's "Jet Blue" aka the Extensible > Storage Engine, that just keeps some preallocated log files around, > specifically so it can get consistent and halt cleanly if it runs out of > disk space. > fwiw, informix (at least until IDS 2000, not sure after that) had the same thing. only this was a parameter to set, and bad things happened if you forgot about it :D -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación Phone: +593 4 5107566 Cell: +593 987171157
On Thursday, June 6, 2013, Josh Berkus wrote:
Let's talk failure cases.
There's actually three potential failure cases here:
- One Volume: WAL is on the same volume as PGDATA, and that volume is
completely out of space.
- XLog Partition: WAL is on its own partition/volume, and fills it up.
- Archiving: archiving is failing or too slow, causing the disk to fill
up with waiting log segments.
I'll argue that these three cases need to be dealt with in three
different ways, and no single solution is going to work for all three.
Archiving
---------
In some ways, this is the simplest case. Really, we just need a way to
know when the available WAL space has become 90% full, and abort
archiving at that stage. Once we stop attempting to archive, we can
clean up the unneeded log segments.
I would oppose that as the solution, either an unconditional one, or configurable with is it as the default. Those segments are not unneeded. I need them. That is why I set up archiving in the first place. If you need to shut down the database rather than violate my established retention policy, then shut down the database.
What we need is a better way for the DBA to find out that archiving is
falling behind when it first starts to fall behind. Tailing the log and
examining the rather cryptic error messages we give out isn't very
effective.
The archive command can be made a shell script (or that matter a compiled program) which can do anything it wants upon failure, including emailing people. Of course maybe whatever causes the archive to fail will also cause the delivery of the message to fail, but I don't see a real solution to this that doesn't start down an infinite regress. If it is not failing outright, but merely falling behind, then I don't really know how to go about detecting that, either in archive_command, or through tailing the PostgreSQL log. I guess archive_command, each time it is invoked, could count the files in the pg_xlog directory and warn if it thinks the number is unreasonable.
xLog Partition
--------------
As Heikki pointed, out, a full dedicated WAL drive is hard to fix once
it gets full, since there's nothing you can safely delete to clear
space, even enough for a checkpoint record.
Although the DBA probably wouldn't know it from reading the manual, it is almost always safe to delete the oldest WAL file (after copying it to a different partition just in case something goes wrong--it should be possible to do that as if WAL is on its own partition, it is hard to imagine you can't scrounge up 16MB on a different one), as PostgreSQL keeps two complete checkpoints worth of WAL around. I think the only reason you would not be able to recover after removing the oldest file is if the controldata file is damaged such that the most recent checkpoint record cannot be found and so it has to fall back to the previous one. Or at least, this is my understanding.
On the other hand, it should be easy to prevent full status; we could
simply force a non-spread checkpoint whenever the available WAL space
gets 90% full. We'd also probably want to be prepared to switch to a
read-only mode if we get full enough that there's only room for the
checkpoint records.
I think that that last sentence could also be applied without modification to the "one volume" case as well.
So what would that look like? Before accepting a (non-checkpoint) WAL Insert that fills up the current segment to a high enough level that a checkpoint record will no longer fit, it must first verify that a recycled file exists, or if not it must successfully init a new file.
If that init fails, then it must do what? Signal for a checkpoint, release it's locks, and then ERROR out? That would be better than a PANIC, but can it do better? Enter a retry loop so that once the checkpoint has finished and assuming it has freed up enough WAL files to recycling/removal, then it can try the original WAL Insert again?
Cheers,
Jeff
On 06/06/2013 09:30 PM, Jeff Janes wrote: > Archiving > --------- > > In some ways, this is the simplest case. Really, we just need a way to > know when the available WAL space has become 90% full, and abort > archiving at that stage. Once we stop attempting to archive, we can > clean up the unneeded log segments. > > > I would oppose that as the solution, either an unconditional one, or > configurable with is it as the default. Those segments are not > unneeded. I need them. That is why I set up archiving in the first > place. If you need to shut down the database rather than violate my > established retention policy, then shut down the database. Agreed and I would oppose it even as configurable. We set up the archiving for a reason. I do think it might be useful to be able to store archiving logs as well as wal_keep_segments logs in a different location than pg_xlog. > > What we need is a better way for the DBA to find out that archiving is > falling behind when it first starts to fall behind. Tailing the log and > examining the rather cryptic error messages we give out isn't very > effective. > > > The archive command can be made a shell script (or that matter a > compiled program) which can do anything it wants upon failure, including > emailing people. Yep, that is what PITRTools does. You can make it do whatever you want. JD
On Thu, Jun 6, 2013 at 9:30 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > I would oppose that as the solution, either an unconditional one, or > configurable with is it as the default. Those segments are not unneeded. I > need them. That is why I set up archiving in the first place. If you need > to shut down the database rather than violate my established retention > policy, then shut down the database. Same boat. My archives are the real storage. The disks are write-back caching. That's because the storage of my archives is probably three to five orders of magnitude more reliable.
On 07.06.2013 00:38, Andres Freund wrote: > On 2013-06-06 23:28:19 +0200, Christian Ullrich wrote: >> * Heikki Linnakangas wrote: >> >>> The current situation is that if you run out of disk space while writing >>> WAL, you get a PANIC, and the server shuts down. That's awful. We can >> >>> So we need to somehow stop new WAL insertions from happening, before >>> it's too late. >> >>> A naive idea is to check if there's enough preallocated WAL space, just >>> before inserting the WAL record. However, it's too late to check that in >> >> There is a database engine, Microsoft's "Jet Blue" aka the Extensible >> Storage Engine, that just keeps some preallocated log files around, >> specifically so it can get consistent and halt cleanly if it runs out of >> disk space. >> >> In other words, the idea is not to check over and over again that there is >> enough already-reserved WAL space, but to make sure there always is by >> having a preallocated segment that is never used outside a disk space >> emergency. > > That's not a bad technique. I wonder how reliable it would be in > postgres. That's no different from just having a bit more WAL space in the first place. We need a mechanism to stop backends from writing WAL, before you run out of it completely. It doesn't matter if the reservation is done by stashing away a WAL segment for emergency use, or by a variable in shared memory. Either way, backends need to stop using it up, by blocking or throwing an error before they enter the critical section. I guess you could use the stashed away segment to ensure that you can recover after PANIC. At recovery, there are no other backends that could use up the emergency segment. But that's not much better than what we have now. - Heikki
--On 6. Juni 2013 16:25:29 -0700 Josh Berkus <josh@agliodbs.com> wrote: > Archiving > --------- > > In some ways, this is the simplest case. Really, we just need a way to > know when the available WAL space has become 90% full, and abort > archiving at that stage. Once we stop attempting to archive, we can > clean up the unneeded log segments. > > What we need is a better way for the DBA to find out that archiving is > falling behind when it first starts to fall behind. Tailing the log and > examining the rather cryptic error messages we give out isn't very > effective. Slightly OT, but i always wondered wether we could create a function, say pg_last_xlog_removed() for example, returning a value suitable to be used to calculate the distance to the current position. An increasing value could be used to instruct monitoring to throw a warning if a certain threshold is exceeded. I've also seen people creating monitoring scripts by looking into archive_status and do simple counts on the .ready files and give a warning, if that exceeds an expected maximum value. I haven't looked at the code very deep, but i think we already store the position of the last removed xlog in shared memory already, maybe this can be used somehow. Afaik, we do cleanup only during checkpoints, so this all has too much delay... -- Thanks Bernd
On 06.06.2013 17:00, Heikki Linnakangas wrote: > A more workable idea is to sprinkle checks in higher-level code, before > you hold any critical locks, to check that there is enough preallocated > WAL. Like, at the beginning of heap_insert, heap_update, etc., and all > similar indexam entry points. Actually, there's one place that catches most of these: LockBuffer(..., BUFFER_LOCK_EXCLUSIVE). In all heap and index operations, you always grab an exclusive lock on a page first, before entering the critical section where you call XLogInsert. That leaves a few miscellaneous XLogInsert calls that need to be guarded, but it leaves a lot less room for bugs of omission, and keeps the code cleaner. - Heikki
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > On 06.06.2013 17:00, Heikki Linnakangas wrote: >> A more workable idea is to sprinkle checks in higher-level code, before >> you hold any critical locks, to check that there is enough preallocated >> WAL. Like, at the beginning of heap_insert, heap_update, etc., and all >> similar indexam entry points. > Actually, there's one place that catches most of these: LockBuffer(..., > BUFFER_LOCK_EXCLUSIVE). In all heap and index operations, you always > grab an exclusive lock on a page first, before entering the critical > section where you call XLogInsert. Not only is that a horrible layering/modularity violation, but surely LockBuffer can have no idea how much WAL space will be needed. regards, tom lane
On 07.06.2013 19:33, Tom Lane wrote: > Heikki Linnakangas<hlinnakangas@vmware.com> writes: >> On 06.06.2013 17:00, Heikki Linnakangas wrote: >>> A more workable idea is to sprinkle checks in higher-level code, before >>> you hold any critical locks, to check that there is enough preallocated >>> WAL. Like, at the beginning of heap_insert, heap_update, etc., and all >>> similar indexam entry points. > >> Actually, there's one place that catches most of these: LockBuffer(..., >> BUFFER_LOCK_EXCLUSIVE). In all heap and index operations, you always >> grab an exclusive lock on a page first, before entering the critical >> section where you call XLogInsert. > > Not only is that a horrible layering/modularity violation, but surely > LockBuffer can have no idea how much WAL space will be needed. It can be just a conservative guess, like, 32KB. That should be enough for almost all WAL-logged operations. The only exception that comes to mind is a commit record, which can be arbitrarily large, when you have a lot of subtransactions or dropped/created relations. - Heikki
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > On 07.06.2013 19:33, Tom Lane wrote: >> Not only is that a horrible layering/modularity violation, but surely >> LockBuffer can have no idea how much WAL space will be needed. > It can be just a conservative guess, like, 32KB. That should be enough > for almost all WAL-logged operations. The only exception that comes to > mind is a commit record, which can be arbitrarily large, when you have a > lot of subtransactions or dropped/created relations. What happens when several updates are occurring concurrently? regards, tom lane
>> I would oppose that as the solution, either an unconditional one, or >> configurable with is it as the default. Those segments are not >> unneeded. I need them. That is why I set up archiving in the first >> place. If you need to shut down the database rather than violate my >> established retention policy, then shut down the database. > > Agreed and I would oppose it even as configurable. We set up the > archiving for a reason. I do think it might be useful to be able to > store archiving logs as well as wal_keep_segments logs in a different > location than pg_xlog. People have different configurations. Most of my clients use archiving for backup or replication; they would rather have archiving cease (and send a CRITICAL alert) than have the master go offline. That's pretty common, probably more common than the "if I don't have redundancy shut down" case. Certainly anyone making the decision that their master database should shut down rather than cease archiving should make it *consciously*, instead of finding out the hard way. >> The archive command can be made a shell script (or that matter a >> compiled program) which can do anything it wants upon failure, including >> emailing people. You're talking about using external tools -- frequently hackish, workaround ones -- to handle something which PostgreSQL should be doing itself, and which only the database engine has full knowledge of. While that's the only solution we have for now, it's hardly a worthy design goal. Right now, what we're telling users is "You can have continuous backup with Postgres, but you'd better hire and expensive consultant to set it up for you, or use this external tool of dubious provenance which there's no packages for, or you might accidentally cause your database to shut down in the middle of the night." At which point most sensible users say "no thanks, I'll use something else". -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Fri, Jun 7, 2013 at 12:14 PM, Josh Berkus <josh@agliodbs.com> wrote: > Right now, what we're telling users is "You can have continuous backup > with Postgres, but you'd better hire and expensive consultant to set it > up for you, or use this external tool of dubious provenance which > there's no packages for, or you might accidentally cause your database > to shut down in the middle of the night." Inverted and just as well supported: "if you want to not accidentally lose data, you better hire an expensive consultant to check your systems for all sorts of default 'safety = off' features." This being but the hypothetical first one. Furthermore, I see no reason why high quality external archiving software cannot exist. Maybe some even exists already, and no doubt they can be improved and the contract with Postgres enriched to that purpose. Contrast: JSON, where the stable OID in the core distribution helps pragmatically punt on a particularly sticky problem (extension dependencies and non-system OIDs), I can't think of a reason an external archiver couldn't do its job well right now. > At which point most sensible users say "no thanks, I'll use something else". Oh, I lost some disks, well, no big deal, I'll use the archives. Surprise! <forensic analysis ensues> So, as it turns out, it has been dropping segments at times because of systemic backlog for months/years. Alternative ending: Hey, I restored the database. <later> Why is the state so old? Why are customers getting warnings that their (thought paid) invoices are overdue? Oh crap, the restore was cut short by this stupid option and this database lives in the past! Fin. I have a clear bias in experience here, but I can't relate to someone who sets up archives but is totally okay losing a segment unceremoniously, because it only takes one of those once in a while to make a really, really bad day. Who is this person that lackadaisically archives, and are they just fooling themselves? And where are these archivers that enjoy even a modicum of long-term success that are not reliable? If one wants to casually drop archives, how is someone going to find out and freak out a bit? Per experience, logs are pretty clearly hazardous to the purpose. Basically, I think the default that opts one into danger is not good, especially since the system is starting from a position of "do too much stuff and you'll crash." Finally, it's not that hard to teach any archiver how to no-op at user-peril, or perhaps Postgres can learn a way to do this expressly to standardize the procedure a bit to ease publicly shared recipes, perhaps.
From: "Daniel Farina" <daniel@heroku.com> > On Fri, Jun 7, 2013 at 12:14 PM, Josh Berkus <josh@agliodbs.com> wrote: >> Right now, what we're telling users is "You can have continuous backup >> with Postgres, but you'd better hire and expensive consultant to set it >> up for you, or use this external tool of dubious provenance which >> there's no packages for, or you might accidentally cause your database >> to shut down in the middle of the night." >> >> At which point most sensible users say "no thanks, I'll use something >> else". > Inverted and just as well supported: "if you want to not accidentally > lose data, you better hire an expensive consultant to check your > systems for all sorts of default 'safety = off' features." This > being but the hypothetical first one. > > Furthermore, I see no reason why high quality external archiving > software cannot exist. Maybe some even exists already, and no doubt > they can be improved and the contract with Postgres enriched to that > purpose. > > Finally, it's not that hard to teach any archiver how to no-op at > user-peril, or perhaps Postgres can learn a way to do this expressly > to standardize the procedure a bit to ease publicly shared recipes, > perhaps. Yes, I feel designing reliable archiving, even for the simplest case - copy WAL to disk, is very difficult. I know there are following three problems if you just follow the PostgreSQL manual. Average users won't notice them. I guess even professional DBAs migrating from other DBMSs won't, either. 1. If the machine or postgres crashes while archive_command is copying a WAL file, later archive recovery fails. This is because cp leaves a file of less than 16MB in archive area, and postgres refuses to start when it finds such a small archive WAL file. The solution, which IIRC Tomas san told me here, is to do like "cp %p /archive/dir/%f.tmp && mv /archive/dir/%f.tmp /archive/dir/%f". 2. archive_command dumps core when you run pg_ctl stop -mi. This is because postmaster sends SIGQUIT to all its descendants. The core files accumulate in the data directory, which will be backed up with the database. Of course those core files are garbage. archive_command script needs to catch SIGQUIT and exit. 3. You cannot know the reason of archive_command failure (e.g. archive area full) if you don't use PostgreSQL's server logging. This is because archive_command failure is not logged in syslog/eventlog. I hope PostgreSQL will provide a reliable archiving facility that is ready to use. Regards MauMau
On 06/07/2013 12:14 PM, Josh Berkus wrote: > Right now, what we're telling users is "You can have continuous backup > with Postgres, but you'd better hire and expensive consultant to set it > up for you, or use this external tool of dubious provenance which > there's no packages for, or you might accidentally cause your database > to shut down in the middle of the night." This is an outright falsehood. We are telling them, "You better know what you are doing" or "You should call a consultant". This is no different than, "You better know what you are doing" or "You should take driving lessons". > > At which point most sensible users say "no thanks, I'll use something else". > Josh I have always admired your flair for dramatics, it almost rivals mine. Users are free to use what they want, some will chose lesser databases. I am ok with that because eventually if PostgreSQL is the right tool, they will come back to us, and PgExperts or CMD or OmniTI or they will know what they are doing and thus don't need us. JD
On 06/08/2013 07:36 AM, MauMau wrote: > 1. If the machine or postgres crashes while archive_command is copying a > WAL file, later archive recovery fails. > This is because cp leaves a file of less than 16MB in archive area, and > postgres refuses to start when it finds such a small archive WAL file. > The solution, which IIRC Tomas san told me here, is to do like "cp %p > /archive/dir/%f.tmp && mv /archive/dir/%f.tmp /archive/dir/%f". Well it seems to me that one of the problems here is we tell people to use copy. We should be telling people to use a command (or supply a command) that is smarter than that. > 3. You cannot know the reason of archive_command failure (e.g. archive > area full) if you don't use PostgreSQL's server logging. > This is because archive_command failure is not logged in syslog/eventlog. Wait, what? Is this true (someone else?) JD
On 06/06/2013 07:52 AM, Heikki Linnakangas wrote: > I think it can be made fairly robust otherwise, and the performance > impact should be pretty easy to measure with e.g pgbench. Once upon a time in a land far, far away, we expected users to manage their own systems. We had things like soft and hard quotas on disks and last log to find out who was logging into the system. Alas, as far as I know soft and hard quotas are kind of a thing of the past but that doesn't mean that their usefulness has ended. The idea that we PANIC is not just awful, it is stupid. I don't think anyone is going to disagree with that. However, there is a question of what to do instead. I think the idea of sprinkling checks into the higher level code before specific operations is not invalid but I also don't think it is necessary. To me, a more pragmatic approach makes sense. Obviously having some kind of code that checks the space makes sense but I don't know that it needs to be around any operation other than we are creating a segment. What do we care why the segment is being created? If we don't have enough room to create the segment, the transaction rollsback with some OBVIOUS not OBTUSE error. Obviously this could cause a ton of transactions to roll back but I think keeping the database consistent and rolling back a transaction in case of error is exactly what we are supposed to do. Sincerely, Joshua D. Drake > > - Heikki > >
On 2013-06-08 11:15:40 -0700, Joshua D. Drake wrote: > To me, a more pragmatic approach makes sense. Obviously having some kind of > code that checks the space makes sense but I don't know that it needs to be > around any operation other than we are creating a segment. What do we care > why the segment is being created? If we don't have enough room to create the > segment, the transaction rollsback with some OBVIOUS not OBTUSE error. > > Obviously this could cause a ton of transactions to roll back but I think > keeping the database consistent and rolling back a transaction in case of > error is exactly what we are supposed to do. You know, the PANIC isn't there just because we like to piss of users. There's actual technical reasons that don't just go away by judging the PANIC as stupid. At the points where the XLogInsert()s happens we're in critical sections out of which we *cannot* ERROR out because we already may have made modifications that cannot be allowed to be performed partially/unlogged. That's why we're throwing a PANIC which will force a cluster wide restart including *NOT* writing any further buffers from s_b out. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-06-07 12:02:57 +0300, Heikki Linnakangas wrote: > On 07.06.2013 00:38, Andres Freund wrote: > >On 2013-06-06 23:28:19 +0200, Christian Ullrich wrote: > >>* Heikki Linnakangas wrote: > >> > >>>The current situation is that if you run out of disk space while writing > >>>WAL, you get a PANIC, and the server shuts down. That's awful. We can > >> > >>>So we need to somehow stop new WAL insertions from happening, before > >>>it's too late. > >> > >>>A naive idea is to check if there's enough preallocated WAL space, just > >>>before inserting the WAL record. However, it's too late to check that in > >> > >>There is a database engine, Microsoft's "Jet Blue" aka the Extensible > >>Storage Engine, that just keeps some preallocated log files around, > >>specifically so it can get consistent and halt cleanly if it runs out of > >>disk space. > >> > >>In other words, the idea is not to check over and over again that there is > >>enough already-reserved WAL space, but to make sure there always is by > >>having a preallocated segment that is never used outside a disk space > >>emergency. > > > >That's not a bad technique. I wonder how reliable it would be in > >postgres. > > That's no different from just having a bit more WAL space in the first > place. We need a mechanism to stop backends from writing WAL, before you run > out of it completely. It doesn't matter if the reservation is done by > stashing away a WAL segment for emergency use, or by a variable in shared > memory. Either way, backends need to stop using it up, by blocking or > throwing an error before they enter the critical section. Well, if you have 16 or 32MB of reserved WAL space available you don't need to judge all that precisely how much space is available. So we can just sprinkle some EnsureXLogHasSpace() on XLogInsert() callsites like heap_insert(), but we can do that outside of the critical sections and we can do it without locks since there needs to happen quite some write activity to overrun the reserved space. Anything that desparately needs to write stuff, like the end of recovery checkpoint, can just not call EnsureXLogHasSpace() and rely on the reserved space. Seems like 90% of the solution for 30% of the complexity or so. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jun 7, 2013 at 12:14 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> The archive command can be made a shell script (or that matter aYou're talking about using external tools -- frequently hackish,
>> compiled program) which can do anything it wants upon failure, including
>> emailing people.
workaround ones -- to handle something which PostgreSQL should be doing
itself, and which only the database engine has full knowledge of.
I think the database engine is about the last thing which would have full knowledge of the best way to contact the DBA, especially during events which by definition mean things are already going badly. I certainly don't see having core code which knows how to talk to every PBX, SMS, email system, or twitter feed that anyone might wish to use for logging. PostgreSQL already supports two formats of text logs, plus syslog and eventlog. Is there some additional logging management tool that we could support which is widely used, doesn't require an expensive consultant to set-up and configure correctly (or even to decide what "correctly" means for the given situation), and which solves 80% of the problems?
It would be nice to have the ability to specify multiple log destinations with different log_min_messages for each one. I'm sure syslog already must implement some kind of method for doing that, but I've been happy enough with the text logs that I've never bothered to look into it much.
While
that's the only solution we have for now, it's hardly a worthy design goal.
Right now, what we're telling users is "You can have continuous backup
with Postgres, but you'd better hire and expensive consultant to set it
up for you, or use this external tool of dubious provenance which
there's no packages for, or you might accidentally cause your database
to shut down in the middle of the night."
At which point most sensible users say "no thanks, I'll use something else".
What does the something else do? Hopefully it is not "silently invalidate your backups".
Cheers,
Jeff
On Sat, Jun 8, 2013 at 11:15 AM, Joshua D. Drake <jd@commandprompt.com> wrote:
Once upon a time in a land far, far away, we expected users to manage their own systems. We had things like soft and hard quotas on disks and last log to find out who was logging into the system. Alas, as far as I know soft and hard quotas are kind of a thing of the past but that doesn't mean that their usefulness has ended.
On 06/06/2013 07:52 AM, Heikki Linnakangas wrote:I think it can be made fairly robust otherwise, and the performance
impact should be pretty easy to measure with e.g pgbench.
The idea that we PANIC is not just awful, it is stupid. I don't think anyone is going to disagree with that. However, there is a question of what to do instead. I think the idea of sprinkling checks into the higher level code before specific operations is not invalid but I also don't think it is necessary.
Given that the system is going to become unusable, I don't see why PANIC is an awful, stupid way of doing it. And if it can only be used for things that don't generate WAL, that is pretty much unusable, as even read only transactions often need to do clean-up tasks that generate WAL.
Cheers,
Jeff
On 06/08/2013 11:27 AM, Andres Freund wrote: > > On 2013-06-08 11:15:40 -0700, Joshua D. Drake wrote: >> To me, a more pragmatic approach makes sense. Obviously having some kind of >> code that checks the space makes sense but I don't know that it needs to be >> around any operation other than we are creating a segment. What do we care >> why the segment is being created? If we don't have enough room to create the >> segment, the transaction rollsback with some OBVIOUS not OBTUSE error. >> >> Obviously this could cause a ton of transactions to roll back but I think >> keeping the database consistent and rolling back a transaction in case of >> error is exactly what we are supposed to do. > > You know, the PANIC isn't there just because we like to piss of > users. There's actual technical reasons that don't just go away by > judging the PANIC as stupid. Yes I know we aren't trying to piss off users. What I am saying is that it is stupid to the user that it PANICS. I apologize if that came out wrong. > At the points where the XLogInsert()s happens we're in critical sections > out of which we *cannot* ERROR out because we already may have made > modifications that cannot be allowed to be performed > partially/unlogged. That's why we're throwing a PANIC which will force a > cluster wide restart including *NOT* writing any further buffers from > s_b out. > Does this preclude (sorry I don't know this part of the code very well) my suggestion of on log create? JD > Greetings, > > Andres Freund >
On 7 June 2013 10:02, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 07.06.2013 00:38, Andres Freund wrote: >> >> On 2013-06-06 23:28:19 +0200, Christian Ullrich wrote: >>> >>> * Heikki Linnakangas wrote: >>> >>>> The current situation is that if you run out of disk space while writing >>>> WAL, you get a PANIC, and the server shuts down. That's awful. We can >>> >>> >>>> So we need to somehow stop new WAL insertions from happening, before >>>> it's too late. >>> >>> >>>> A naive idea is to check if there's enough preallocated WAL space, just >>>> before inserting the WAL record. However, it's too late to check that in >>> >>> >>> There is a database engine, Microsoft's "Jet Blue" aka the Extensible >>> Storage Engine, that just keeps some preallocated log files around, >>> specifically so it can get consistent and halt cleanly if it runs out of >>> disk space. >>> >>> In other words, the idea is not to check over and over again that there >>> is >>> enough already-reserved WAL space, but to make sure there always is by >>> having a preallocated segment that is never used outside a disk space >>> emergency. >> >> >> That's not a bad technique. I wonder how reliable it would be in >> postgres. > > > That's no different from just having a bit more WAL space in the first > place. We need a mechanism to stop backends from writing WAL, before you run > out of it completely. It doesn't matter if the reservation is done by > stashing away a WAL segment for emergency use, or by a variable in shared > memory. Either way, backends need to stop using it up, by blocking or > throwing an error before they enter the critical section. > > I guess you could use the stashed away segment to ensure that you can > recover after PANIC. At recovery, there are no other backends that could use > up the emergency segment. But that's not much better than what we have now. Christian's idea seems good to me. Looks like you could be dismissing this too early, especially since there's no better idea emerged. I doubt that we're going to think of something others didn't already face. It's pretty clear that most other DBMS do this with their logs. For your fast wal insert patch to work well, we need a simple and fast technique to detect out of space. --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Jun 8, 2013 at 11:27 AM, Andres Freund <andres@2ndquadrant.com> wrote:
You know, the PANIC isn't there just because we like to piss of
users. There's actual technical reasons that don't just go away by
judging the PANIC as stupid.
At the points where the XLogInsert()s happens we're in critical sections
out of which we *cannot* ERROR out because we already may have made
modifications that cannot be allowed to be performed
partially/unlogged. That's why we're throwing a PANIC which will force a
cluster wide restart including *NOT* writing any further buffers from
s_b out.
If archiving is on and failure is due to no space, could we just keep trying XLogFileInit again for a couple minutes to give archiving a chance to do its things? Doing that while holding onto locks and a critical section would be unfortunate, but if the alternative is a PANIC, it might be acceptable.
The problem is that even if the file is only being kept so it can be archived, once archiving succeeds I think the file is not removed immediately but rather not until the next checkpoint, which will never happen when the locks are still held.
Cheers,
Jeff
From: "Joshua D. Drake" <jd@commandprompt.com> > On 06/08/2013 07:36 AM, MauMau wrote: >> 3. You cannot know the reason of archive_command failure (e.g. archive >> area full) if you don't use PostgreSQL's server logging. >> This is because archive_command failure is not logged in syslog/eventlog. > > Wait, what? Is this true (someone else?) I'm sorry, please let me correct myself. What I meant is: This is because the exact reason for archive_command failure (e.g. cp's output) is not logged in syslog/eventlog. The fact itself that archive_command failed is recorded. Regards MauMau
From: "Joshua D. Drake" <jd@commandprompt.com> > On 06/08/2013 11:27 AM, Andres Freund wrote: >> You know, the PANIC isn't there just because we like to piss of >> users. There's actual technical reasons that don't just go away by >> judging the PANIC as stupid. > > Yes I know we aren't trying to piss off users. What I am saying is that it > is stupid to the user that it PANICS. I apologize if that came out wrong. I've experienced PANIC shutdown several times due to WAL full during development. As a user, one problem with PANIC is that it dumps core. I think core files should be dumped only when "can't happen" events has occurred like PostgreSQL's bug. I didn't expect postgres dumps core simply because disk is full. I want postgres to shutdown with FATAL message in that exact case. Regards MauMau
From: "Josh Berkus" <josh@agliodbs.com> > There's actually three potential failure cases here: > > - One Volume: WAL is on the same volume as PGDATA, and that volume is > completely out of space. > > - XLog Partition: WAL is on its own partition/volume, and fills it up. > > - Archiving: archiving is failing or too slow, causing the disk to fill > up with waiting log segments. I think there is one more case. Is this correct? - Failure of a disk containing data directory or tablespace If checkpoint can't write buffers to disk because of disk failure, checkpoint cannot complete, thus WAL files accumulate in pg_xlog/. This means that one disk failure will lead to postgres shutdown. > xLog Partition > -------------- > > As Heikki pointed, out, a full dedicated WAL drive is hard to fix once > it gets full, since there's nothing you can safely delete to clear > space, even enough for a checkpoint record. This sounds very scary. Is it possible to complete recovery and start up postmaster with either or both of the following modifications? [Idea 1] During recovery, force archiving a WAL file and delete/recycle it in pg_xlog/ as soon as its contents are applied. [Idea 2] During recovery, when disk full is encountered at end-of-recovery checkpoint, force archiving all unarchived WAL files and delete/recycle them at that time. Regards MauMau
On 06/06/2013 10:00 PM, Heikki Linnakangas wrote: > > I've seen a case, where it was even worse than a PANIC and shutdown. > pg_xlog was on a separate partition that had nothing else on it. The > partition filled up, and the system shut down with a PANIC. Because > there was no space left, it could not even write the checkpoint after > recovery, and thus refused to start up again. There was nothing else > on the partition that you could delete to make space. The only > recourse would've been to add more disk space to the partition > (impossible), or manually delete an old WAL file that was not needed > to recover from the latest checkpoint (scary). Fortunately this was a > test system, so we just deleted everything. There were a couple of dba.stackexchange.com reports along the same lines recently, too. Both involved an antivirus vendor's VM appliance with a canned (stupid) configuration that set wal_keep_segments too high for the disk space allocated and stored WAL on a separate partition. People are having issues with WAL space management in the real world and I think it and autovacuum are the two hardest things for most people to configure and understand in Pg right now. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 06/08/2013 10:57 AM, Daniel Farina wrote: > >> At which point most sensible users say "no thanks, I'll use something else". > [snip] > > I have a clear bias in experience here, but I can't relate to someone > who sets up archives but is totally okay losing a segment unceremoniously, > because it only takes one of those once in a while to make a really, > really bad day. It sounds like between you both you've come up with a pretty solid argument by exclusion for throttling WAL writing as space grows critical. Dropping archive segments seems pretty unsafe from the solid arguments Daniel presents, and Josh makes a good case for why shutting the DB down when archiving can't keep up isn't any better. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
<div class="moz-cite-prefix">On 06/09/2013 08:32 AM, MauMau wrote:<br /></div><blockquote cite="mid:D117F8165E894FA1B805302504C6F1D7@maumau"type="cite"><br /> - Failure of a disk containing data directory or tablespace<br /> If checkpoint can't write buffers to disk because of disk failure, checkpoint cannot complete, thus WALfiles accumulate in pg_xlog/. <br /> This means that one disk failure will lead to postgres shutdown. <br /></blockquote>... which is why tablespaces aren't disposable, and why creating a tablespace in a RAM disk is such an awfulidea.<br /><br /> I'd rather like to be able to recover from this by treating the tablespace as dead, so any attemptto get a lock on any table within it fails with an error and already-in-WAL writes to it just get discarded. It'sthe sort of thing that'd only be reasonable to do as a recovery option (like zero_damaged_pages) since if applied bydefault it'd lead to potentially severe and unexpected data loss.<br /><br /> I've seen a couple of people bitten by themisunderstanding that tablespaces are a way to split up your data based on different reliability requirements, and I reallyneed to write a docs patch for <a href="http://www.postgresql.org/docs/9.2/static/manage-ag-tablespaces.html">http://www.postgresql.org/docs/current/static/manage-ag-tablespaces.html</a> thatadds a prominent warning like:<br /><br /> WARNING: Every tablespace must be present before the database can be started.There is no easy way to recover the database if a tablespace is lost to disk failure, deletion, use of volatile storage,etc. <b>Do not put a tablespace on a RAM disk</b>; instead just use UNLOGGED tables.<br /><br /> (Opinionson the above?)<br /><br /><pre class="moz-signature" cols="72">-- Craig Ringer <a class="moz-txt-link-freetext"href="http://www.2ndQuadrant.com/">http://www.2ndQuadrant.com/</a>PostgreSQL Development, 24x7Support, Training & Services</pre>
On 06/09/2013 03:02 AM, Jeff Janes wrote: > It would be nice to have the ability to specify multiple log destinations > with different log_min_messages for each one. I'm sure syslog already must > implement some kind of method for doing that, but I've been happy enough > with the text logs that I've never bothered to look into it much. Smarter syslog flavours like rsyslog certainly do this. No alert system triggered by events within Pg will ever be fully sufficient. "Oops, the postmaster crashed with stack corruption, I'll just exec whatever's in this on_panic_exec GUC (if I can still read it and it's still valid) to hopefully tell the sysadmin about my death." Hmm, sounds as reliable and safe as a bicycle powered by a home-made rocket engine. External monitoring is IMO always necessary. Something like Icinga with check_postgres can trivially poke Pg to make sure it's alive. It can also efficiently check the 'pg_error.log' from rsyslog that contains only severe errors and raise alerts if it doesn't like what it sees. If I'm already doing external monitoring (which is necessary as outlined above) then I see much less point having Pg able to raise alerts for problems, and am more interested in better built-in functions and views for exposing Pg's state. Easier monitoring of WAL build-up, ways to slow the master if async replicas or archiving are getting too far behind, etc. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-06-08 13:26:56 -0700, Joshua D. Drake wrote: > >At the points where the XLogInsert()s happens we're in critical sections > >out of which we *cannot* ERROR out because we already may have made > >modifications that cannot be allowed to be performed > >partially/unlogged. That's why we're throwing a PANIC which will force a > >cluster wide restart including *NOT* writing any further buffers from > >s_b out. > > > > Does this preclude (sorry I don't know this part of the code very well) my > suggestion of on log create? Well, yes. We create log segments some layers below XLogInsert() if necesary, and as I said above, we're in a critical section at that point, so just rolling back isn't one of the options. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
From: "Craig Ringer" <craig@2ndquadrant.com> > On 06/09/2013 08:32 AM, MauMau wrote: >> >> - Failure of a disk containing data directory or tablespace >> If checkpoint can't write buffers to disk because of disk failure, >> checkpoint cannot complete, thus WAL files accumulate in pg_xlog/. >> This means that one disk failure will lead to postgres shutdown. > > I've seen a couple of people bitten by the misunderstanding that > tablespaces are a way to split up your data based on different > reliability requirements, and I really need to write a docs patch for > http://www.postgresql.org/docs/current/static/manage-ag-tablespaces.html > <http://www.postgresql.org/docs/9.2/static/manage-ag-tablespaces.html> > that adds a prominent warning like: > > WARNING: Every tablespace must be present before the database can be > started. There is no easy way to recover the database if a tablespace is > lost to disk failure, deletion, use of volatile storage, etc. <b>Do not > put a tablespace on a RAM disk</b>; instead just use UNLOGGED tables. > > (Opinions on the above?) Yes, I'm sure this is useful for DBAs to know how postgres behaves and take some preparations. However, this does not apply to my case, because I'm using tablespaces for I/O distribution across multiple disks and simply for database capacity. The problem is that the reliability of the database system decreases with more disks, because failure of any one of those disks would result in a database PANIC shutdown > I'd rather like to be able to recover from this by treating the > tablespace as dead, so any attempt to get a lock on any table within it > fails with an error and already-in-WAL writes to it just get discarded. > It's the sort of thing that'd only be reasonable to do as a recovery > option (like zero_damaged_pages) since if applied by default it'd lead > to potentially severe and unexpected data loss. I'm in favor of taking a tablespace offline when I/O failure is encountered, and continue running the database server. But WAL must not be discarded because committed transactions must be preserved for durability of ACID. Postgres needs to take these steps when it encounters an I/O error: 1. Take the tablespace offline, so that subsequent read/write against it returns an error without actually issuing read/write against data files. 2. Discard shared buffers containing data in the tablespace. WAL is not affected by the offlining of tablespaces. WAL records already written on the WAL buffer will be written to pg_xlog/ and archived as usual. Those WAL records will be used to recover committed transactions during archive recovery. Regards MauMau
On 06/10/2013 06:39 AM, MauMau wrote: > The problem is that the reliability of the database system decreases > with more disks, because failure of any one of those disks would result > in a database PANIC shutdown More specifically, with more independent sets of disks / file systems. >> I'd rather like to be able to recover from this by treating the >> tablespace as dead, so any attempt to get a lock on any table within it >> fails with an error and already-in-WAL writes to it just get discarded. >> It's the sort of thing that'd only be reasonable to do as a recovery >> option (like zero_damaged_pages) since if applied by default it'd lead >> to potentially severe and unexpected data loss. > > I'm in favor of taking a tablespace offline when I/O failure is > encountered, and continue running the database server. But WAL must not > be discarded because committed transactions must be preserved for > durability of ACID. [snip] > WAL is not affected by the offlining of tablespaces. WAL records > already written on the WAL buffer will be written to pg_xlog/ and > archived as usual. Those WAL records will be used to recover committed > transactions during archive recovery. (I'm still learning the details of Pg's WAL, WAL replay and recovery, so the below's just my understanding): The problem is that WAL for all tablespaces is mixed together in the archives. If you lose your tablespace then you have to keep *all* WAL around and replay *all* of it again when the tablespace comes back online. This would be very inefficient, would require a lot of tricks to cope with applying WAL to a database that has an on-disk state in the future as far as the archives are concerned. It's not as simple as just replaying all WAL all over again - as I understand it, things like CLUSTER or TRUNCATE will result in relfilenodes not being where they're expected to be as far as old WAL archives are concerned. Selective replay would be required, and that leaves the door open to all sorts of new and exciting bugs in areas that'd hardly ever get tested. To solve the massive disk space explosion problem I imagine we'd have to have per-tablespace WAL. That'd cause a *huge* increase in fsync costs and loss of the rather nice property that WAL writes are nice sequential writes. It'd be complicated and probably cause nightmares during recovery, for archive-based replication, etc. The only other thing I can think of is: When a tablespace is offline, write WAL records to a separate "tablespace recovery log" as they're encountered. Replay this log when the tablespace comes is restored, before applying any other new WAL to the tablespace. This wouldn't affect archive-based recovery since it'd already have the records from the original WAL. None of these options seem exactly simple or pretty, especially given the additional complexities that'd be involved in allowing WAL records to be applied out-of-order, something that AFAIK _never_h happens at the moment. The key problem, of course, is that this all sounds like a lot of complicated work for a case that's not really supposed to happen. Right now, the answer is "your database is unrecoverable, switch to your streaming warm standby and re-seed it from the standby". Not pretty, but at least there's the option of using a sync standby and avoiding data loss. How would you approach this? -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
From: "Craig Ringer" <craig@2ndquadrant.com> > The problem is that WAL for all tablespaces is mixed together in the > archives. If you lose your tablespace then you have to keep *all* WAL > around and replay *all* of it again when the tablespace comes back > online. This would be very inefficient, would require a lot of tricks to > cope with applying WAL to a database that has an on-disk state in the > future as far as the archives are concerned. It's not as simple as just > replaying all WAL all over again - as I understand it, things like > CLUSTER or TRUNCATE will result in relfilenodes not being where they're > expected to be as far as old WAL archives are concerned. Selective > replay would be required, and that leaves the door open to all sorts of > new and exciting bugs in areas that'd hardly ever get tested. Although I still lack understanding of PostgreSQL implementation, I have an optimistic feeling that such complexity would not be required. While a tablespace is offline, subsequent access from new or existing transactions is rejected with an error "tablespace is offline". So new WAL records would not be generated for the offline tablespace. To take the tablespace back online, the DBA performs per-tablespace archive recovery. Per-tablespace archive recovery restores tablespace data files from the backup, then read through archive and pg_xlog/ WAL as usual, and selectively applies WAL records for the tablespace. I don't think it's a "must-be-fixed" problem that the WAL for all tablespaces is mixed in one location. I suppose we can tolerate that archive recovery takes a long time. > To solve the massive disk space explosion problem I imagine we'd have to > have per-tablespace WAL. That'd cause a *huge* increase in fsync costs > and loss of the rather nice property that WAL writes are nice sequential > writes. It'd be complicated and probably cause nightmares during > recovery, for archive-based replication, etc. Per-tablespace WAL is very interesting for another reason -- massive-scale OLTP for database consolidation. This feature would certainly be a breakthrough for amazing performance, because WAL is usually the last bottleneck in OLTP. Yes, I can imagine recovery would be much, much more complicated,. > None of these options seem exactly simple or pretty, especially given > the additional complexities that'd be involved in allowing WAL records > to be applied out-of-order, something that AFAIK _never_h happens at the > moment. As I mentioned above, in my shallow understanding, it seems that the additional complexities can be controlled. > The key problem, of course, is that this all sounds like a lot of > complicated work for a case that's not really supposed to happen. Right > now, the answer is "your database is unrecoverable, switch to your > streaming warm standby and re-seed it from the standby". Not pretty, but > at least there's the option of using a sync standby and avoiding data > loss. Sync standby... maybe. Let me consider this. > How would you approach this? Thanks Craig, you gave me some interesting insights. All of these topics are interesting, and I'd like to work on them when I have acquired enough knowledge and experience in PostgreSQL development. Regards MauMau
Josh, Daniel, >> Right now, what we're telling users is "You can have continuous backup >> with Postgres, but you'd better hire and expensive consultant to set it >> up for you, or use this external tool of dubious provenance which >> there's no packages for, or you might accidentally cause your database >> to shut down in the middle of the night." > > This is an outright falsehood. We are telling them, "You better know > what you are doing" or "You should call a consultant". This is no > different than, "You better know what you are doing" or "You should take > driving lessons". What I'm pointing out is that there is no "simple case" for archiving the way we have it set up. That is, every possible way to deploy PITR for Postgres involves complex, error-prone configuration, setup, and monitoring. I don't think that's necessary; simple cases should have simple solutions. If you do a quick survey of pgsql-general, you will see that the issue of databases shutting down unexpectedly due to archiving running them out of disk space is a very common problem. People shouldn't be afraid of their backup solutions. I'd agree that one possible answer for this is to just get one of the external tools simplified, well-packaged, distributed, instrumented for common monitoring systems, and referenced in our main documentation. I'd say Barman is the closest to "a simple solution for the simple common case", at least for PITR. I've been able to give some clients Barman and have them deploy it themselves. This isn't true of the other tools I've tried. Too bad it's GPL, and doesn't do archiving-for-streaming. > I have a clear bias in experience here, but I can't relate to someone > who sets up archives but is totally okay losing a segment unceremoniously, > because it only takes one of those once in a while to make a really, > really bad day. Who is this person that lackadaisically archives, and > are they just fooling themselves? And where are these archivers that If WAL archiving is your *second* level of redundancy, you will generally be willing to have it break rather than interfere with the production workload. This is particularly the case if you're using archiving just as a backup for streaming replication. Heck, I've had one client where archiving was being used *only* to spin up staging servers, and not for production at all; do you think they wanted production to shut down if they ran out of archive space (which it did)? I'll also point out that archiving can silently fail for a number of reasons having nothing to do with "safety" options, such as an NFS mount in Linux silently going away (I've also had this happen), or network issues causing file corruption. Which just points out that we need better ways to detect gaps/corruption in archiving. Anyway, what I'm pointing out is that this is a business decision, and there is no way that we can make a decision for the users what to do when we run out of WAL space. And that the "stop archiving" option needs to be there for users, as well as the "shut down" option. *without* requiring users to learn the internals of the archiving system to implement it, or to know the implied effects of non-obvious PostgreSQL settings. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Sat, Jun 8, 2013 at 11:07 AM, Joshua D. Drake <jd@commandprompt.com> wrote:
This will overwrite /archive/dir/%f if it already exists, which is usually recommended against. Although I don't know that it necessarily should be. One common problem with archiving is for a network glitch to occur during the archive command, so the archive command fails and tries again later. But the later tries will always fail, because the target was created before/during the glitch. Perhaps a more full featured archive command would detect and rename an existing file, rather than either overwriting it or failing.
On 06/08/2013 07:36 AM, MauMau wrote:1. If the machine or postgres crashes while archive_command is copying a
WAL file, later archive recovery fails.
This is because cp leaves a file of less than 16MB in archive area, and
postgres refuses to start when it finds such a small archive WAL file.
Should that be changed? If the file is 16MB but it turns to gibberish after 3MB, recovery proceeds up to the gibberish. Given that, why should it refuse to start if the file is only 3MB to start with?
The solution, which IIRC Tomas san told me here, is to do like "cp %p
/archive/dir/%f.tmp && mv /archive/dir/%f.tmp /archive/dir/%f".
If we have no compunction about overwriting the file, then I don't see a reason to use the cp + mv combination. If the simple cp fails to copy the entire file, it will be tried again until it succeeds.
Well it seems to me that one of the problems here is we tell people to use copy. We should be telling people to use a command (or supply a command) that is smarter than that.
Actually we describe what archive_command needs to fulfill, and tell them to use something that accomplishes that. The example with cp is explicitly given as an example, not a recommendation.
Wait, what? Is this true (someone else?)3. You cannot know the reason of archive_command failure (e.g. archive
area full) if you don't use PostgreSQL's server logging.
This is because archive_command failure is not logged in syslog/eventlog.
It is kind of true. PostgreSQL does not automatically arrange for the stderr of the archive_command to be sent to syslog. But archive_command can do whatever it wants, including arranging for its own failure messages to go to syslog.
Cheers,
Jeff
On Mon, Jun 10, 2013 at 11:59 AM, Josh Berkus <josh@agliodbs.com> wrote: > Anyway, what I'm pointing out is that this is a business decision, and > there is no way that we can make a decision for the users what to do > when we run out of WAL space. And that the "stop archiving" option > needs to be there for users, as well as the "shut down" option. > *without* requiring users to learn the internals of the archiving system > to implement it, or to know the implied effects of non-obvious > PostgreSQL settings. I don't doubt this, that's why I do have a no-op fallback for emergencies. The discussion was about defaults. I still think that drop-wal-from-archiving-whenever is not a good one. You may have noticed I also wrote that a neater, common way to drop WAL when under pressure might be nice, to avoid having it ad-hoc and all over, so it's not as though I wanted to suggest an Postgres feature to this effect was an anti-feature. And, as I wrote before, it's much easier to teach an external system to drop WAL than it is to teach Postgres to attenuate, hence the repeated correspondence from my fellows and myself about attenuation side of the equation. Hope that clears things up about where I stand on the matter.
Daniel, Jeff, > I don't doubt this, that's why I do have a no-op fallback for > emergencies. The discussion was about defaults. I still think that > drop-wal-from-archiving-whenever is not a good one. Yeah, we can argue defaults for a long time. What would be better is some way to actually determine what the user is trying to do, or wants to happen. That's why I'd be in favor of an explict setting; if there's a setting which says: on_archive_failure=shutdown ... then it's a LOT clearer to the user what will happen if the archive runs out of space, even if we make no change to the defaults. And if that setting is changeable on reload, it even becomes a way for users to get out of tight spots. > You may have noticed I also wrote that a neater, common way to drop > WAL when under pressure might be nice, to avoid having it ad-hoc and > all over, so it's not as though I wanted to suggest an Postgres > feature to this effect was an anti-feature. Yep. Drake was saying it was an anti-feature, though, so I was arguing with him. > Well it seems to me that one of the problems here is we tell people to use >> copy. We should be telling people to use a command (or supply a command) >> that is smarter than that. >> > > Actually we describe what archive_command needs to fulfill, and tell them > to use something that accomplishes that. The example with cp is explicitly > given as an example, not a recommendation. If we offer cp as an example, we *are* recommending it. If we don't recommend it, we shouldn't have it as an example. In fact, if we don't recommend cp, then PostgreSQL should ship with some example shell scripts for archive commands, just as we do for init scripts. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Mon, Jun 10, 2013 at 4:42 PM, Josh Berkus <josh@agliodbs.com> wrote: > Daniel, Jeff, > >> I don't doubt this, that's why I do have a no-op fallback for >> emergencies. The discussion was about defaults. I still think that >> drop-wal-from-archiving-whenever is not a good one. > > Yeah, we can argue defaults for a long time. What would be better is > some way to actually determine what the user is trying to do, or wants > to happen. That's why I'd be in favor of an explict setting; if there's > a setting which says: > > on_archive_failure=shutdown > > ... then it's a LOT clearer to the user what will happen if the archive > runs out of space, even if we make no change to the defaults. And if > that setting is changeable on reload, it even becomes a way for users to > get out of tight spots. I like your suggestion, save one thing: it's not a 'failure' or archiving if it cannot keep up, provided one subscribes to the view that archiving is not elective. I nit pick at this because one might think this has something to do with a non-zero return code from the archiving program, which already has a pretty alarmist message in event of transient failures (I think someone brought this up on -hackers but a few months ago...can't remember if that resulted in a change). I don't have a better suggestion that is less jargonrific though, but I wanted to express my general appreciation as to the shape of the suggestion.
On 06/10/2013 04:42 PM, Josh Berkus wrote: >> Actually we describe what archive_command needs to fulfill, and tell them >> to use something that accomplishes that. The example with cp is explicitly >> given as an example, not a recommendation. > > If we offer cp as an example, we *are* recommending it. If we don't > recommend it, we shouldn't have it as an example. > > In fact, if we don't recommend cp, then PostgreSQL should ship with some > example shell scripts for archive commands, just as we do for init scripts. Not a bad idea. One that supports rsync and another that supports robocopy. That should cover every platform we support. JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc For my dreams of your image that blossoms a rose in the deeps of my heart. - W.B. Yeats
> Not a bad idea. One that supports rsync and another that supports > robocopy. That should cover every platform we support. Example script: ============================= #!/usr/bin/env bash # Simple script to copy WAL archives from one server to another # to be called as archive_command (call this as wal_archive "%p" "%f") # Settings. Please change the below to match your configuration. # holding directory for the archived log files on the replica # this is NOT pg_xlog: WALDIR="/var/lib/pgsql/archive_logs" # touch file to shut off archiving in case of filling up the disk: NOARCHIVE="/var/lib/pgsql/NOARCHIVE" # replica IP, IPv6 or DNS address: REPLICA="192.168.1.3" # put any special SSH options here, # and the location of RSYNC: export RSYNC_RSH="ssh" RSYNC="/usr/bin/rsync" ######## DO NOT CHANGE THINGS BELOW THIS LINE ########## SOURCE="$1" # %p FILE="$2" # %f DEST="${WALDIR}/${FILE}" # See whether we want all archiving off test -f ${NOARCHIVE} && exit 0 # Copy the file to the spool area on the replica, error out if # the transfer fails ${RSYNC} --quiet --archive --rsync-path=${RSYNC} ${SOURCE} \ ${REPLICA}:${DEST} if [ $? -ne 0 ]; then exit 1 fi -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Sat, Jun 8, 2013 at 10:36 AM, MauMau <maumau307@gmail.com> wrote: > Yes, I feel designing reliable archiving, even for the simplest case - copy > WAL to disk, is very difficult. I know there are following three problems > if you just follow the PostgreSQL manual. Average users won't notice them. > I guess even professional DBAs migrating from other DBMSs won't, either. > > 1. If the machine or postgres crashes while archive_command is copying a WAL > file, later archive recovery fails. > This is because cp leaves a file of less than 16MB in archive area, and > postgres refuses to start when it finds such a small archive WAL file. > The solution, which IIRC Tomas san told me here, is to do like "cp %p > /archive/dir/%f.tmp && mv /archive/dir/%f.tmp /archive/dir/%f". > > 2. archive_command dumps core when you run pg_ctl stop -mi. > This is because postmaster sends SIGQUIT to all its descendants. The core > files accumulate in the data directory, which will be backed up with the > database. Of course those core files are garbage. > archive_command script needs to catch SIGQUIT and exit. > > 3. You cannot know the reason of archive_command failure (e.g. archive area > full) if you don't use PostgreSQL's server logging. > This is because archive_command failure is not logged in syslog/eventlog. > > > I hope PostgreSQL will provide a reliable archiving facility that is ready > to use. +1. I think we should have a way to set an archive DIRECTORY, rather than an archive command. And if you set it, then PostgreSQL should just do all of that stuff correctly, without any help from the user. Of course, some users will want to archive to a remote machine via ssh or rsync or what-have-you, and those users will need to provide their own tools. But it's got to be pretty common to archive to a local path that happens to be a remote mount, or to a local directory whose contents are subsequently copied off by a batch job. Making that work nicely with near-zero configuration would be a significant advance. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jun 12, 2013 at 11:55 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I hope PostgreSQL will provide a reliable archiving facility that is ready >> to use. > > +1. I think we should have a way to set an archive DIRECTORY, rather > than an archive command. And if you set it, then PostgreSQL should > just do all of that stuff correctly, without any help from the user. > Of course, some users will want to archive to a remote machine via ssh > or rsync or what-have-you, and those users will need to provide their > own tools. But it's got to be pretty common to archive to a local > path that happens to be a remote mount, or to a local directory whose > contents are subsequently copied off by a batch job. Making that work > nicely with near-zero configuration would be a significant advance. That, or provide a standard archive command that takes the directory as argument? I bet we have tons of those available among us users...
> On Sat, Jun 8, 2013 at 10:36 AM, MauMau <maumau307@gmail.com> wrote: >> Yes, I feel designing reliable archiving, even for the simplest case - copy >> WAL to disk, is very difficult. I know there are following three problems >> if you just follow the PostgreSQL manual. Average users won't notice them. >> I guess even professional DBAs migrating from other DBMSs won't, either. >> >> 1. If the machine or postgres crashes while archive_command is copying a WAL >> file, later archive recovery fails. >> This is because cp leaves a file of less than 16MB in archive area, and >> postgres refuses to start when it finds such a small archive WAL file. >> The solution, which IIRC Tomas san told me here, is to do like "cp %p >> /archive/dir/%f.tmp && mv /archive/dir/%f.tmp /archive/dir/%f". >> >> 2. archive_command dumps core when you run pg_ctl stop -mi. >> This is because postmaster sends SIGQUIT to all its descendants. The core >> files accumulate in the data directory, which will be backed up with the >> database. Of course those core files are garbage. >> archive_command script needs to catch SIGQUIT and exit. >> >> 3. You cannot know the reason of archive_command failure (e.g. archive area >> full) if you don't use PostgreSQL's server logging. >> This is because archive_command failure is not logged in syslog/eventlog. >> >> >> I hope PostgreSQL will provide a reliable archiving facility that is ready >> to use. > > +1. I think we should have a way to set an archive DIRECTORY, rather > than an archive command. And if you set it, then PostgreSQL should > just do all of that stuff correctly, without any help from the user. > Of course, some users will want to archive to a remote machine via ssh > or rsync or what-have-you, and those users will need to provide their > own tools. But it's got to be pretty common to archive to a local > path that happens to be a remote mount, or to a local directory whose > contents are subsequently copied off by a batch job. Making that work > nicely with near-zero configuration would be a significant advance. And there's another example why we need an archive command: > I'm just setting up pgpool replication on Amazon AWS. > I'm sending WAL archives to an S3 bucket, which doesn't appear as a > directory on the server. From: http://www.pgpool.net/pipermail/pgpool-general/2013-June/001851.html -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
<p dir="ltr"><br /> On Jun 12, 2013 4:56 PM, "Robert Haas" <<a href="mailto:robertmhaas@gmail.com">robertmhaas@gmail.com</a>>wrote:<br /> ><br /> > On Sat, Jun 8, 2013 at 10:36AM, MauMau <<a href="mailto:maumau307@gmail.com">maumau307@gmail.com</a>> wrote:<br /> > > Yes, I feel designingreliable archiving, even for the simplest case - copy<br /> > > WAL to disk, is very difficult. I know thereare following three problems<br /> > > if you just follow the PostgreSQL manual. Average users won't notice them.<br/> > > I guess even professional DBAs migrating from other DBMSs won't, either.<br /> > ><br /> >> 1. If the machine or postgres crashes while archive_command is copying a WAL<br /> > > file, later archiverecovery fails.<br /> > > This is because cp leaves a file of less than 16MB in archive area, and<br /> >> postgres refuses to start when it finds such a small archive WAL file.<br /> > > The solution, which IIRCTomas san told me here, is to do like "cp %p<br /> > > /archive/dir/%f.tmp && mv /archive/dir/%f.tmp /archive/dir/%f".<br/> > ><br /> > > 2. archive_command dumps core when you run pg_ctl stop -mi.<br /> > >This is because postmaster sends SIGQUIT to all its descendants. The core<br /> > > files accumulate in the datadirectory, which will be backed up with the<br /> > > database. Of course those core files are garbage.<br />> > archive_command script needs to catch SIGQUIT and exit.<br /> > ><br /> > > 3. You cannot know thereason of archive_command failure (e.g. archive area<br /> > > full) if you don't use PostgreSQL's server logging.<br/> > > This is because archive_command failure is not logged in syslog/eventlog.<br /> > ><br /> >><br /> > > I hope PostgreSQL will provide a reliable archiving facility that is ready<br /> > > to use.<br/> ><br /> > +1. I think we should have a way to set an archive DIRECTORY, rather<br /> > than an archivecommand. And if you set it, then PostgreSQL should<br /> > just do all of that stuff correctly, without any helpfrom the user.<p dir="ltr">Wouldn't that encourage people to do local archiving, which is almost always a bad idea? <pdir="ltr">I'd rather improve the experience with pg_receivexlog or another way that does remote archiving... <br /><p dir="ltr">>Of course, some users will want to archive to a remote machine via ssh<br /> > or rsync or what-have-you,and those users will need to provide their<br /> > own tools. But it's got to be pretty common to archiveto a local<br /> > path that happens to be a remote mount, or to a local directory whose<br /> > contents aresubsequently copied off by a batch job. Making that work<br /> > nicely with near-zero configuration would be a significantadvance.<p dir="ltr">I guess archiving to a nfs mount or so isn't too bad, but archiving and using a cronjob toget the files off is typically a great way to loose data, and we really shouldn't encourage that by default, Imo. <p dir="ltr">/Magnus
On Sat, Jun 8, 2013 at 7:20 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > If archiving is on and failure is due to no space, could we just keep trying > XLogFileInit again for a couple minutes to give archiving a chance to do its > things? Doing that while holding onto locks and a critical section would be > unfortunate, but if the alternative is a PANIC, it might be acceptable. Blech. I think that's setting our standards pretty low. It would neither be possible to use the system nor to shut it down cleanly; I think the effect would be to turn an immediate PANIC into a slightly-delayed PANIC, possibly accompanied by some DBA panic. It seems to me that there are two general ways of approaching this problem. 1. Discover sooner that we're out of space. Once we've modified the buffer and entered the critical section, it's too late to have second thoughts about completing the operation. If we could guarantee prior to modifying the buffers that enough WAL space was present to store the record we're about to write, then we'd be certain not to fail for this reason. In theory, this is simple: keep track of how much uncommitted WAL space we have. Increment the value when we create new WAL segments and decrement it by the size of the WAL record we plan to write. In practice, it's not so simple. We don't know whether we're going to emit FPIs until after we enter the critical section, so the size of the record can't be known precisely early enough. We could think about estimating the space needed conservatively and truing it up occasionally. However, there's a second problem: the available-WAL-space counter would surely become a contention point. Here's a sketch of a possible solution. Suppose we know that an individual WAL record can't be larger than, uh, 64kB. I'm not sure there is a limit on the size of a WAL record, but let's say there is, or we can install one, at around that size limit. Before we enter a critical section that's going to write a WAL record, we verify that the amount of WAL space remaining is at least 64kB * MaxBackends. If it's not, we embark on a series of short sleeps, rechecking after each one; if we hit some time limit, we ERROR out. As long as every backend checks this before every WAL record, we can always be sure there will be at least 64kB left for us. With this approach, the shared variable that stores the amount of WAL space remaining only needs to be updated under WALInsertLock; the "reservation" step only involves a read. That might be cheap enough not to matter. 2. Recover from the fact that we ran out of space by backing out the changes to shared buffers. Initially, I thought this might be a promising approach: if we've modified any shared buffers and discover that we can't log the changes, just invalidate the buffers! Of course, it doesn't work, because the buffer might have have already been dirty when we locked it. So we'd actually need a way to reverse out all the changes we were about to log. That's probably too expensive to contemplate, in general; and the code would likely get so little testing as to invite bugs. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jun 12, 2013 at 11:32 AM, Magnus Hagander <magnus@hagander.net> wrote: > Wouldn't that encourage people to do local archiving, which is almost always > a bad idea? Maybe, but refusing to improve the UI because people might then use the feature seems wrong-headed. > I'd rather improve the experience with pg_receivexlog or another way that > does remote archiving... Sure, remote archiving is great, and I'm glad you've been working on it. In general, I think that's a cleaner approach, but there are still enough people using archive_command that we can't throw them under the bus. > I guess archiving to a nfs mount or so isn't too bad, but archiving and > using a cronjob to get the files off is typically a great way to loose data, > and we really shouldn't encourage that by default, Imo. Well, I think what we're encouraging right now is for people to do it wrong. The proliferation of complex tools to manage this process suggests that it is not easy to manage without a complex tool. That's a problem. And we regularly have users who discover, under a variety of circumstances, that they've been doing it wrong. If there's a better solution than hard-wiring some smarts about local directories, I'm all ears - but making the simple case just work would still be better than doing nothing. Right now you have to be a rocket scientist no matter what configuration you're running. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 6/12/13 10:55 AM, Robert Haas wrote: > But it's got to be pretty common to archive to a local > path that happens to be a remote mount, or to a local directory whose > contents are subsequently copied off by a batch job. Making that work > nicely with near-zero configuration would be a significant advance. Doesn't that just move the problem to managing NFS or batch jobs? Do we want to encourage that? I suspect that there are actually only about 5 or 6 common ways to do archiving (say, local, NFS, scp, rsync, S3, ...). There's no reason why we can't fully specify and/or script what to do in each of these cases.
On Wed, Jun 12, 2013 at 12:07 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > On 6/12/13 10:55 AM, Robert Haas wrote: >> But it's got to be pretty common to archive to a local >> path that happens to be a remote mount, or to a local directory whose >> contents are subsequently copied off by a batch job. Making that work >> nicely with near-zero configuration would be a significant advance. > > Doesn't that just move the problem to managing NFS or batch jobs? Do we > want to encourage that? > > I suspect that there are actually only about 5 or 6 common ways to do > archiving (say, local, NFS, scp, rsync, S3, ...). There's no reason why > we can't fully specify and/or script what to do in each of these cases. Go for it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 06/12/2013 08:49 AM, Robert Haas wrote: > Sure, remote archiving is great, and I'm glad you've been working on > it. In general, I think that's a cleaner approach, but there are > still enough people using archive_command that we can't throw them > under the bus. Correct. > >> I guess archiving to a nfs mount or so isn't too bad, but archiving and >> using a cronjob to get the files off is typically a great way to loose data, >> and we really shouldn't encourage that by default, Imo. > We certainly not by default but it is also something that can be easy to set up reliably if you know what you are doing. > Well, I think what we're encouraging right now is for people to do it > wrong. The proliferation of complex tools to manage this process > suggests that it is not easy to manage without a complex tool. No. It suggests that people have more than one requirement that the project WILL NEVER be able to solve. Granted we have solved some of them, for example pg_basebackup. However, pg_basebackup isn't really useful for a large database. Multithreaded rsync is much more efficient. > That's > a problem. And we regularly have users who discover, under a variety > of circumstances, that they've been doing it wrong. If there's a > better solution than hard-wiring some smarts about local directories, > I'm all ears - but making the simple case just work would still be > better than doing nothing. Agreed. > Right now you have to be a rocket > scientist no matter what configuration you're running. This is quite a bit overblown. Assuming your needs are simple. Archiving is at it is now, a relatively simple process to set up, even without something like PITRTools. Where we run into trouble is when they aren't and that is ok because we can't solve every problem. We can only provide tools for others to solve their particular issue. What concerns me is we seem to be trying to make this "easy". It isn't supposed to be easy. This is hard stuff. Smart people built it and it takes a smart person to run it. When did it become a bad thing to be something that smart people need to run? Yes, we need to make it reliable. We don't need to be the Nanny database. JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc For my dreams of your image that blossoms a rose in the deeps of my heart. - W.B. Yeats
On Wed, Jun 12, 2013 at 6:03 PM, Joshua D. Drake <jd@commandprompt.com> wrote: > >> Right now you have to be a rocket >> scientist no matter what configuration you're running. > > > This is quite a bit overblown. Assuming your needs are simple. Archiving is > at it is now, a relatively simple process to set up, even without something > like PITRTools. Where we run into trouble is when they aren't and that is > ok because we can't solve every problem. We can only provide tools for > others to solve their particular issue. > > What concerns me is we seem to be trying to make this "easy". It isn't > supposed to be easy. This is hard stuff. Smart people built it and it takes > a smart person to run it. When did it become a bad thing to be something > that smart people need to run? > > Yes, we need to make it reliable. We don't need to be the Nanny database. More than easy, it should be obvious. Obvious doesn't mean easy, it just means what you have to do to get it right is clearly in front of you. When you give people the freedom of an "archive command", you also take away any guidance more restricting options give. I think the point here is that a default would guide people in how to make this work reliably, without having to rediscover it every time. A good, *obvious* (not easy) default. Even "cp blah to NFS mount" is obvious, while not easy (setting up an NFS through firewalls is never easy). So, having archive utilities in place of cp would ease the burden of administration, because it'd be based on collective knowledge. Some "pg_cp" (or more likely "pg_archive_wal") could check there's enough space, and whatever else collective knowledge decided is necessary.
On 06/12/2013 02:03 PM, Joshua D. Drake wrote: > What concerns me is we seem to be trying to make this "easy". It isn't > supposed to be easy. This is hard stuff. Smart people built it and it > takes a smart person to run it. When did it become a bad thing to be > something that smart people need to run? 1997, last I checked. Our unofficial motto: "PostgreSQL: making very hard things possible, and simple things hard." It *is* hard. But that's because we've *made* it hard to understand and manage, not because the problem is inherently hard. For example: can you explain to me in 10 words or less how to monitor to see if archiving is falling behind? I'll bet you can't, and that's because we've provided no reliable way to do so. It's normal when you're developing features for the ability to utilize them to go from hacker --> high-end user --> regular user. We suck at moving to that last stage, partly because whenever someone on this list introduces the idea of making a feature not just great but easy to use, people actually object to the idea that anything should be easy to use. It's like we're afraid of being polluted by the unwashedDevOps masses. In the meantime, Mongo kicks our butts a new user adoption. Why? Their features suck, but the features they do have are easy to use. You'd think we would have learned something from MySQL. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 06/14/2013 11:16 AM, Josh Berkus wrote: > > On 06/12/2013 02:03 PM, Joshua D. Drake wrote: >> What concerns me is we seem to be trying to make this "easy". It isn't >> supposed to be easy. This is hard stuff. Smart people built it and it >> takes a smart person to run it. When did it become a bad thing to be >> something that smart people need to run? > > 1997, last I checked. > > Our unofficial motto: "PostgreSQL: making very hard things possible, and > simple things hard." > > It *is* hard. But that's because we've *made* it hard to understand and > manage, not because the problem is inherently hard. For example: can > you explain to me in 10 words or less how to monitor to see if archiving > is falling behind? I'll bet you can't, and that's because we've > provided no reliable way to do so. Hey, I never said we shouldn't have a complete feature set. I agree with you. IMO it should not have even been committed without the ability to actually know what is going on and we have had it since (in theory) 8.1? My primary concern is: Don't make it stupid. I liked Claudio's comment, "More than easy, it should be obvious.". It should be obvious from a review of the documentation how to manage this stuff. It isn't, and worse even if we wrote the documentation it still isn't because the feature is not complete. With great power comes great responsibility.... :P JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc For my dreams of your image that blossoms a rose in the deeps of my heart. - W.B. Yeats
<div class="moz-cite-prefix">On 06/15/2013 02:16 AM, Josh Berkus wrote:<br /></div><blockquote cite="mid:51BB5DE2.7080506@agliodbs.com"type="cite"><pre wrap="">On 06/12/2013 02:03 PM, Joshua D. Drake wrote: </pre><blockquote type="cite"><pre wrap="">What concerns me is we seem to be trying to make this "easy". It isn't supposed to be easy. This is hard stuff. Smart people built it and it takes a smart person to run it. When did it become a bad thing to be something that smart people need to run? </pre></blockquote><pre wrap=""> 1997, last I checked. Our unofficial motto: "PostgreSQL: making very hard things possible, and simple things hard." It *is* hard. But that's because we've *made* it hard to understand and manage, not because the problem is inherently hard.</pre></blockquote><br /> I have to agree with all this... Pg has someof the best docs around, a really nice SQL level interface, and some truly shocking usability outside that nice zone.<br/><br /> Once a user steps into the "admin zone" they're confronted with a lot of settings they'll really struggleto understand and manage.<br /><br /> I don't want this to be used as an argument not to commit early stages of work,though. I think iterative development with exposure to real-world testing and experience is necessary when you're gettingto the complexity of things that are now going in to Pg. It's more that "commited" != "done"; right now, once itsusable at that power-user stage further management and improvement gets farmed out to external tools and the usabilityof the core feature stays rather ... rough.<br /><br /> Some examples:<br /><br /> fsync=off<br /> ------------<br/><br /> We have a giant foot-cannon in the config files, "fsync" with the "off" option neatly documentedalongside all the others. No note saying "setting fsync=off is equivalent to setting yes_you_can_eat_my_data=on".No WARNING in the logs, not that a user who'd set that without understanding it would look atthe logs. The fsync section of <a href="http://www.postgresql.org/docs/9.2/static/runtime-config-wal.html">http://www.postgresql.org/docs/current/static/runtime-config-wal.html</a> isok, though it could do with a more prominent warning... but the user needs to know where to look. I've repeatedly askedto change this - usually after yet another user comes onto -general with data loss due to not understanding fsync=off- and haven't been able to get agreement on even a config file comment. <br /><br /> Proposed fix 9.3, config filecomment saying "Warning, fsync=off may cause data loss, see the user manual."<br /><br /> Proposed fix 9.4+: Remove fsync=offfrom docs. Leave the GUC enum there but have the postmaster FATAL when it sees it with a message saying "fsync=offhas been replaced with unsafe_writes=on, please change your postgresql.conf". Add the corresponding new GUC.<br/><br /> max_connections<br /> ------------------------<br /><br /> max_connections is another one. I see systemswith max_connections=3000 in the wild... performing terribly, as you'd expect. Yet there's no indication (even inthe docs) that this is often a terrible idea, and that you should really look into a connection pooler if you're goingabove a few hundred (hardware/workload dependent). <a href="http://www.postgresql.org/docs/9.1/static/runtime-config-connection.html">http://www.postgresql.org/docs/current/static/runtime-config-connection.html</a> doesn'tmention it, there's no config file comment, etc.<br /><br /> Proposed fix: Comment in the config file saying somethinglike "See the documentation before raising this above a few hundred". In the docs, a note about the perf impactof high max_connections with a brief mention of external connection pooling and links to pgbouncer/pgpool-II, mentionthat many app frameworks have built-in connection pools. Brief comment about there being an optimum workload-and-hardwaredependent level of concurrency above which performance degrades. I'll be happy to write a draft patchfor this if there's agreement on the idea.<br /><br /> vacuum/autovacuum<br /> ---------------------------<br /><br/> autovaccum tuning. We've just had this conversation and there seems to be agreement that it needs some love, butunlike the above two there's no easy fix and it's an ongoing process. I don't have any right to complain about it unlessI do more to help fix it.<br /><br /> Bloat<br /> ------<br /><br /> Table bloat. Table bloat has been a major issuewith PostgreSQL users/admins for years. Anyone care to explain to me in a simple paragraph how to find out if you havetable or index bloat issues in your database and what to do about it? (Maybe we need "pg_catalog.pg_index_bloat" and"pg_catalog.pg_table_bloat" views including FILLFACTOR correction?)<br /><br /> I think I'll draft up a patch to add exactlythat.<br /><br /> Dump/restore and globals<br /> ----------------------------------<br /><br /> Dump and restore.The standard advice I give is to do a "pg_dumpall --globals-only" followed by a "pg_dump -Fc" of each database, sincewe don't have "pg_dumpfall -Fc". Users often seem to do single-DB dumps then find themselves having trouble restoringthem due to missing user accounts, etc. Or they do a pg_dumpall then want to restore just one DB/table.<br /><br/> There's also a lot of confusion around restoring dumps due to the different formats. This has improved now that pg_restoretells the user to restore a SQL dump using psql:<br /><br /> $ pg_restore regress.sql <br /> pg_restore: [archiver]input file appears to be a text format dump. Please use psql.<br /><br /> ... though psql still chokes horriblyon a pg_dump -Fc file:<br /><br /> psql:regress.out:1: ERROR: syntax error at or near "PGDMP"<br /> LINE 1: PGDMP^A^L^A^A^A^A^A^A^A^AREVOKEALL ON SCHEMA public FROM postgres;<br /> ^<br /> psql:regress.out:2: WARNING: noprivileges were granted for "public"<br /> GRANT<br /><br /><br /> Proposed fix: Should we have a pg_dumpall that producesper-database -Fc or -Fd output? Or perhaps --include-roles / --include-tablespaces options to pg_dump that stashesa pg_dumpall --globals-only inside the -Fc archive?<br /><br /> Proposed fix : If psql sees the pseudo-command PGDMPit should quit immediately with an error saying "This is a PostgreSQL custom format dump file and must be restored withthe pg_restore command". (Happy to try to come up with a patch for this).<br /><br /> Proposed fix: Instead of just tellingthe user to run psql, pg_restore should, if there are no selective restore options, propose a psql command. Or evenjust invoke psql, though I'm hesitant about that because of issues where the psql on the PATH is one version and theuser runs /path/to/pg_restore for another version. Or, if we built a full path using $0, the case where pg_restore isbeing run from within the source tree so there's no psql in the same directory.<br /><br /> pg_hba.conf<br /> ----------------<br/><br /> The #1 question I see on Stack Overflow has to be confusion about pg_hba.conf, mostly from peoplewho have no idea it exists, don't understand how to configure it, etc. They can't tell the difference between peer/ident/trust,don't understand that if you set 'md5' but don't set a password then the password will always be wrong,etc.<br /><br /> I list this last because I think Pg's client authentication is well documented, it lacks obvious foot-guns,and it's really not that hard. I have little sympathy for people who respond to a docs link with "I don't havetime for that, I'm not a DBA, can you just tell me what I need to change?". Authentication and authorization isn't simple,and attempts to make it too simple usually also make it wrong. At some point I want to think about how to make iteasier to manage Pg's auth, but I know there are some big complexities around it because there's no DB available at thetime pg_hba.conf checking is done on an incoming connection so not even a shared/global table may be read from. <br /><br/> Proposed fix: None required at this time.<br /><br /><pre class="moz-signature" cols="72">-- Craig Ringer <a class="moz-txt-link-freetext" href="http://www.2ndQuadrant.com/">http://www.2ndQuadrant.com/</a>PostgreSQLDevelopment, 24x7 Support, Training & Services</pre>
On 15 June 2013 14:43, Craig Ringer <craig@2ndquadrant.com> wrote: > The #1 question I see on Stack Overflow has to be confusion about > pg_hba.conf, mostly from people who have no idea it exists, don't understand > how to configure it, etc. The totally non-obvious name of the file probably has something to do with that. It should be called 'auth.conf'. Cheers, BJ
On 06/15/2013 02:08 PM, Brendan Jurd wrote: > On 15 June 2013 14:43, Craig Ringer <craig@2ndquadrant.com> wrote: >> The #1 question I see on Stack Overflow has to be confusion about >> pg_hba.conf, mostly from people who have no idea it exists, don't understand >> how to configure it, etc. > The totally non-obvious name of the file probably has something to do > with that. It should be called 'auth.conf'. Not convinced; since it only controls one facet of auth - it doesn't define users, passwords, grants, etc - that'd probably be just as confusing. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 15 June 2013 16:18, Craig Ringer <craig@2ndquadrant.com> wrote: > On 06/15/2013 02:08 PM, Brendan Jurd wrote: >> On 15 June 2013 14:43, Craig Ringer <craig@2ndquadrant.com> wrote: >>> The #1 question I see on Stack Overflow has to be confusion about >>> pg_hba.conf, mostly from people who have no idea it exists, don't understand >>> how to configure it, etc. >> The totally non-obvious name of the file probably has something to do >> with that. It should be called 'auth.conf'. > Not convinced; since it only controls one facet of auth - it doesn't > define users, passwords, grants, etc ... When somebody is setting up postgres for the first time, and they list the contents of the config directory, you want them to have some idea what each of the files is for. If they see something called 'auth.conf', they'll get the right general idea. An understanding of the nuances (like that it doesn't control user accounts) will come once they open up the file -- which they may well do, because it is called 'auth.conf', and 'auth' is a thing you want to configure. If they see something called 'pg_hba.conf', they may very reasonably assume that it is some internal/advanced stuff that they don't need to worry about just yet, because what the heck is a 'pg_hba'? The 'pg' is unnecessary and the 'hba' is an internal jargon term that we've ill-advisedly allowed to leak out into the filename. If you really feel that 'auth.conf' is too imprecise, maybe something like 'conn-auth.conf' would be more your style. Cheers, BJ
On 2013-06-15 08:44 CEST, Brendan Jurd wrote: > On 15 June 2013 16:18, Craig Ringer ... wrote: >> On 06/15/2013 02:08 PM, Brendan Jurd wrote: >>> On 15 June 2013 14:43, Craig Ringer ... wrote: >>>> The #1 question I see on Stack Overflow has to be confusion about >>>> pg_hba.conf, mostly from people who have no idea it exists, don't understand >>>> how to configure it, etc. >>> The totally non-obvious name of the file probably has something to do >>> with that. It should be called 'auth.conf'. >> Not convinced; since it only controls one facet of auth - it doesn't >> define users, passwords, grants, etc ... > > When somebody is setting up postgres for the first time, and they list > the contents of the config directory, you want them to have some idea as they may not have read up to section 19.1 The pg_hba.conf File inside chapter 19 Client Authentication of part III. Server Administration :-?, which states (as of 9.2.4): """ Client authentication is controlled by a configuration file, which traditionally is named pg_hba.conf and is stored in the database cluster's data directory. (HBA stands for host-based authentication.) A default pg_hba.conf file is installed when the data directory is initialized by initdb. It is possible to place the authentication configuration file elsewhere, however; see the hba_file configuration parameter. ... """ ;-) thanks to hyperlinks this is quite close to the start, but I was surprised to not find it by skimming the text and following the hyperlinks but by knowing the filename instead and entering it ("pg_hba.conf") into the Search Documentation text field on the top right corner of http://www.postgresql.org/docs/9.2/interactive/index.html. Maybe we could find a better place of the whatever-then-name inside the part of the docs even the "TL;DR" mood people might read? A paragraph or two spiced up with some catchy StackOverflow-inspired terms people with a need to configure this authentication aspect might have expected could also be expected in INSTALL like docs or directly observable on the hyperlinked way from part I. Tutorial chapter 1 Getting Started section 1.1 Installation all down to chapter 15. Installation from Source Code. But of course only, if this is "wanted behavior". If I read the section 1.1 Installation (again 9.2.4) I have the impression, that it more transports the message in our case, that "you are the site admin, deal with it, read the docs", or don't I read it right? (I am a non-native English reader) > what each of the files is for. If they see something called > 'auth.conf', they'll get the right general idea. An understanding of > the nuances (like that it doesn't control user accounts) will come > once they open up the file -- which they may well do, because it is > called 'auth.conf', and 'auth' is a thing you want to configure. that may well be, I do not know, how people that prefer reading folder and filenames over manuals written for them grok text, as I read the docs, promised ;-) > If they see something called 'pg_hba.conf', they may very reasonably > assume that it is some internal/advanced stuff that they don't need to > worry about just yet, because what the heck is a 'pg_hba'? The 'pg' > is unnecessary and the 'hba' is an internal jargon term that we've > ill-advisedly allowed to leak out into the filename. at around 1995 when I started using Postgres95 it sure took some time to find that pg_hba.conf file, but I then perceived it to be very well documented, and also felt a bit guilty, as it's name occured in the INSTALL file cf. ftp://ftp-archives.postgresql.org/pub/source/v7.2/postgresql-7.2.tar.gz and the INSTALL file. Therein "burried" inside Step 1 of "If You Are Upgrading" ... > If you really feel that 'auth.conf' is too imprecise, maybe something > like 'conn-auth.conf' would be more your style. I think you guys did and still do a fantastic job with PostgreSQL and eps. it's documentation, but in this case I doubt, that any renaming of config files will really have an impact on usability in the shady area of "TL;DR" - at least for the next twenty years or so - as it still holds, that from a false start (eg. not reading documentation written) anything may follow. But as usability is a practical concern I (as a user) would be +0 on renaming it if people not finding it bearing the old name, but then editing it is really wanted behavior. All the best, Stefan.
On 06/14/2013 11:18 PM, Craig Ringer wrote: > > On 06/15/2013 02:08 PM, Brendan Jurd wrote: >> On 15 June 2013 14:43, Craig Ringer <craig@2ndquadrant.com> wrote: >>> The #1 question I see on Stack Overflow has to be confusion about >>> pg_hba.conf, mostly from people who have no idea it exists, don't understand >>> how to configure it, etc. >> The totally non-obvious name of the file probably has something to do >> with that. It should be called 'auth.conf'. > Not convinced; since it only controls one facet of auth - it doesn't > define users, passwords, grants, etc - that'd probably be just as confusing. > Yeah this one is not making the grade. pg_hba is just that host based auth but I think we are bikeshedding now. JD
On 06/14/2013 11:44 PM, Brendan Jurd wrote: > If they see something called 'pg_hba.conf', they may very reasonably > assume that it is some internal/advanced stuff that they don't need to > worry about just yet, because what the heck is a 'pg_hba'? The 'pg' Only the uneducated. Look, I am not trying to be an ass but seriously. Read the docs. I will argue vigorously against the idea of us designing a system that has people NOT reading the docs. JD
On 06/15/2013 03:53 PM, Joshua D. Drake wrote: > > Yeah this one is not making the grade. pg_hba is just that host based > auth but I think we are bikeshedding now. Agreed... Even as I posted, I realised I shouldn't have mentioned the last point, since everything else has been ignored. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Jun 15, 2013 at 12:43:10PM +0800, Craig Ringer wrote: > Bloat > ------ > > Table bloat. Table bloat has been a major issue with PostgreSQL > users/admins for years. Anyone care to explain to me in a simple > paragraph how to find out if you have table or index bloat issues in > your database and what to do about it? (Maybe we need > "pg_catalog.pg_index_bloat" and "pg_catalog.pg_table_bloat" views > including FILLFACTOR correction?) > > I think I'll draft up a patch to add exactly that. Nice list btw. I monitor this by using the excellent check_progres nagios plugin, which has stuff to check for things like this. Which makes me think that it might be possible to add some other checks like this, in for example pg_ctl. A big fat warning 'your data may be eaten' might get noticed at startup. (A minor annoyance is that in recent version of PostgreSQL you have to give check_postgres admin rights, otherwise it can't warn you about "idle in transaction" problems.) Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > He who writes carelessly confesses thereby at the very outset that he does > not attach much importance to his own thoughts. -- Arthur Schopenhauer
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 06/15/2013 05:57 PM, Martijn van Oosterhout wrote: > On Sat, Jun 15, 2013 at 12:43:10PM +0800, Craig Ringer wrote: >> Bloat ------ >> >> Table bloat. Table bloat has been a major issue with PostgreSQL >> users/admins for years. Anyone care to explain to me in a simple >> paragraph how to find out if you have table or index bloat issues >> in your database and what to do about it? (Maybe we need >> "pg_catalog.pg_index_bloat" and "pg_catalog.pg_table_bloat" >> views including FILLFACTOR correction?) >> >> I think I'll draft up a patch to add exactly that. > > Nice list btw. I monitor this by using the excellent check_progres > nagios plugin, which has stuff to check for things like this. It's been a vital tool for me too. It probably serves as a reasonable guide for some things we could really usefully expose in system views. Once in core we can document them in the main user manual, making them reasonably discoverable. Argh. I think my TODO has some kind of horrible disease, it keeps growing uncontrollably. > Which makes me think that it might be possible to add some other > checks like this, in for example pg_ctl. A big fat warning 'your > data may be eaten' might get noticed at startup. The users who have this kind of issue aren't the ones running pg_ctl. They'll usually be using launchd, systemctl, upstart, sysv init scripts, etc ... whatever, something that sends the warning straight to the system logs that they likely never read. I don't have tons of sympathy for these people, but I do think making fsync=off so easy to set without understanding it is kind of like handing a grenade to a toddler. - -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.13 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJRvD/8AAoJELBXNkqjr+S2aG8H/jnjATe1M+4O6k8iaS6Okgux HQU14nDaBh7tbRaNQZUlNFDaVoQA6cynN8Xzq8k5lqJxGnuNRR7SNw8+cZZiZmMe pS2f9q2IyOMz3T/mBNPuAFcPTbp6pjYrBNpMEGF6FYDhmUMSEfhf4Cp1Ns4FG0kx o5dIXnhgDpCCTBK4XiYqbijFGe0pqbOH98fTQJLXb2ItgE17t4jU0YoYPJovjjT8 xKnDggN+H3uPMmNTcxn0VL6XcrjM6oDeBQPtzCiePWWxYD4nwP3d0ZIok13jZSHm KC3NWgYQ7uP8/NJitnqewMQ8RArQjAWsW94deZt28jNDeaKp/vovQlZtrU2M6dQ= =aysr -----END PGP SIGNATURE-----
> fsync=off synchronous_commits=off replaced fsync=off in almost every use case where fsync=off might have been useful. The only remaining use case is for the initial build of a database. In that case what the user really wants is to turn off WAL logging entirely though. Having a WAL log and not fsyncing it is kind of pointless. I guess it lets you replicate the database but it doesn't let you use the WAL log for recovery locally. > Bloat > ------ > > Table bloat. Table bloat has been a major issue with PostgreSQL users/admins > for years. Anyone care to explain to me in a simple paragraph how to find > out if you have table or index bloat issues in your database and what to do > about it? (Maybe we need "pg_catalog.pg_index_bloat" and > "pg_catalog.pg_table_bloat" views including FILLFACTOR correction?) A nice view that exposes a summary of information from the fsm per table would be pretty handy. In general there's a lot of data tied up in things like the fsm that could be usefully exposed to users. -- greg
Peter Eisentraut <peter_e@gmx.net> writes: > I suspect that there are actually only about 5 or 6 common ways to do > archiving (say, local, NFS, scp, rsync, S3, ...). There's no reason why > we can't fully specify and/or script what to do in each of these cases. And provide either fully reliable contrib scripts or "internal" archive commands ready to use for those common cases. I can't think of other common use cases, by the way. +1 Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Mon, Jun 10, 2013 at 07:28:24AM +0800, Craig Ringer wrote: > (I'm still learning the details of Pg's WAL, WAL replay and recovery, so > the below's just my understanding): > > The problem is that WAL for all tablespaces is mixed together in the > archives. If you lose your tablespace then you have to keep *all* WAL > around and replay *all* of it again when the tablespace comes back > online. This would be very inefficient, would require a lot of tricks to > cope with applying WAL to a database that has an on-disk state in the > future as far as the archives are concerned. It's not as simple as just > replaying all WAL all over again - as I understand it, things like > CLUSTER or TRUNCATE will result in relfilenodes not being where they're > expected to be as far as old WAL archives are concerned. Selective > replay would be required, and that leaves the door open to all sorts of > new and exciting bugs in areas that'd hardly ever get tested. > > To solve the massive disk space explosion problem I imagine we'd have to > have per-tablespace WAL. That'd cause a *huge* increase in fsync costs > and loss of the rather nice property that WAL writes are nice sequential > writes. It'd be complicated and probably cause nightmares during > recovery, for archive-based replication, etc. > > The only other thing I can think of is: When a tablespace is offline, > write WAL records to a separate "tablespace recovery log" as they're > encountered. Replay this log when the tablespace comes is restored, > before applying any other new WAL to the tablespace. This wouldn't > affect archive-based recovery since it'd already have the records from > the original WAL. > > None of these options seem exactly simple or pretty, especially given > the additional complexities that'd be involved in allowing WAL records > to be applied out-of-order, something that AFAIK _never_h happens at the > moment. > > The key problem, of course, is that this all sounds like a lot of > complicated work for a case that's not really supposed to happen. Right > now, the answer is "your database is unrecoverable, switch to your > streaming warm standby and re-seed it from the standby". Not pretty, but > at least there's the option of using a sync standby and avoiding data loss. > > How would you approach this? Sorry to be replying late. You are right that you could record/apply WAL separately for offline tablespaces. The problem is that you could have logical ties from the offline tablespace to online tablespaces. For example, what happens if data in an online tablespace references a primary key in an offline tablespace. What if the system catalogs are stored in an offline tablespace? Right now, we allow logical bindings across physical tablespaces. To do what you want, you would really need to store each database in its own tablespace. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 6 June 2013 16:00, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > In the "Redesigning checkpoint_segments" thread, many people opined that > there should be a hard limit on the amount of disk space used for WAL: > http://www.postgresql.org/message-id/CA+TgmoaOkgZb5YsmQeMg8ZVqWMtR=6S4-PPd+6jiy4OQ78ihUA@mail.gmail.com. > I'm starting a new thread on that, because that's mostly orthogonal to > redesigning checkpoint_segments. > > The current situation is that if you run out of disk space while writing > WAL, you get a PANIC, and the server shuts down. That's awful. We can try to > avoid that by checkpointing early enough, so that we can remove old WAL > segments to make room for new ones before you run out, but unless we somehow > throttle or stop new WAL insertions, it's always going to be possible to use > up all disk space. A typical scenario where that happens is when > archive_command fails for some reason; even a checkpoint can't remove old, > unarchived segments in that case. But it can happen even without WAL > archiving. I don't see we need to prevent WAL insertions when the disk fills. We still have the whole of wal_buffers to use up. When that is full, we will prevent further WAL insertions because we will be holding the WALwritelock to clear more space. So the rest of the system will lock up nicely, like we want, apart from read-only transactions. Instead of PANICing, we should simply signal the checkpointer to perform a shutdown checkpoint. That normally requires a WAL insertion to complete, but it seems easy enough to make that happen by simply rewriting the control file, after which ALL WAL files are superfluous for crash recovery and can be deleted. Once that checkpoint is complete, we can begin deleting WAL files that are archived/replicated and continue as normal. The previously failing WAL write can now be made again and may succeed this time - if it does, we continue, if not - now we PANIC. Note that this would not require in-progress transactions to be aborted. They can continue normally once wal_buffers re-opens. We don't really want anything too drastic, because if this situation happens once it may happen many times - I'm imagining a flaky network etc.. So we want the situation to recover quickly and easily, without too many consequences. The above appears to be very minimal change from existing code and doesn't introduce lots of new points of breakage. > I've seen a case, where it was even worse than a PANIC and shutdown. pg_xlog > was on a separate partition that had nothing else on it. The partition > filled up, and the system shut down with a PANIC. Because there was no space > left, it could not even write the checkpoint after recovery, and thus > refused to start up again. There was nothing else on the partition that you > could delete to make space. The only recourse would've been to add more disk > space to the partition (impossible), or manually delete an old WAL file that > was not needed to recover from the latest checkpoint (scary). Fortunately > this was a test system, so we just deleted everything. Doing shutdown checkpoints via the control file would exactly solve that issue. We already depend upon the readability of the control file anyway, so this changes nothing. (And if you regard it does, then we can have multiple control files, or at least a backup control file at shutdown). We can make the shutdown checkpoint happen always at EOF of a WAL segment, so at shutdown we don't need any WAL files to remain at all. > So we need to somehow stop new WAL insertions from happening, before it's > too late. I don't think we do. What might be sensible is to have checkpoints speed up as WAL volume approaches a predefined limit, so that we minimise the delay caused when wal_buffers locks up. Not suggesting anything here for 9.4, since we're midCF. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> writes: > On 6 June 2013 16:00, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >> The current situation is that if you run out of disk space while writing >> WAL, you get a PANIC, and the server shuts down. That's awful. > I don't see we need to prevent WAL insertions when the disk fills. We > still have the whole of wal_buffers to use up. When that is full, we > will prevent further WAL insertions because we will be holding the > WALwritelock to clear more space. So the rest of the system will lock > up nicely, like we want, apart from read-only transactions. I'm not sure that "all writing transactions lock up hard" is really so much better than the current behavior. My preference would be that we simply start failing writes with ERRORs rather than PANICs. I'm not real sure ATM why this has to be a PANIC condition. Probably the cause is that it's being done inside a critical section, but could we move that? > Instead of PANICing, we should simply signal the checkpointer to > perform a shutdown checkpoint. And if that fails for lack of disk space? In any case, what you're proposing sounds like a lot of new complication in a code path that is necessarily never going to be terribly well tested. regards, tom lane
On 21 January 2014 18:35, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> On 6 June 2013 16:00, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >>> The current situation is that if you run out of disk space while writing >>> WAL, you get a PANIC, and the server shuts down. That's awful. > >> I don't see we need to prevent WAL insertions when the disk fills. We >> still have the whole of wal_buffers to use up. When that is full, we >> will prevent further WAL insertions because we will be holding the >> WALwritelock to clear more space. So the rest of the system will lock >> up nicely, like we want, apart from read-only transactions. > > I'm not sure that "all writing transactions lock up hard" is really so > much better than the current behavior. Lock up momentarily, until the situation clears. But my proposal would allow the situation to fully clear, i.e. all WAL files could be deleted as soon as replication/archiving has caught up. The current behaviour doesn't automatically correct itself as this proposal would. My proposal is also fully safe in line with synchronous replication, as well as zero performance overhead for mainline processing. > My preference would be that we simply start failing writes with ERRORs > rather than PANICs. Yes, that is what I am proposing, amongst other points. > I'm not real sure ATM why this has to be a PANIC > condition. Probably the cause is that it's being done inside a critical > section, but could we move that? Yes, I think so. >> Instead of PANICing, we should simply signal the checkpointer to >> perform a shutdown checkpoint. > > And if that fails for lack of disk space? I proposed a way to ensure it wouldn't fail for that, at least on pg_xlog space. > In any case, what you're > proposing sounds like a lot of new complication in a code path that > is necessarily never going to be terribly well tested. It's the smallest amount of change proposed so far... I agree on the danger of untested code. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Fwiw I think "all transactions lock up until space appears" is *much* better than PANICing. Often disks fill up due to other transient storage or people may have options to manually increase the amount of space. it's much better if the database just continues to function after that rather than need to be restarted.
Greg Stark <stark@mit.edu> writes: > Fwiw I think "all transactions lock up until space appears" is *much* > better than PANICing. Often disks fill up due to other transient > storage or people may have options to manually increase the amount of > space. it's much better if the database just continues to function > after that rather than need to be restarted. Well, PANIC is certainly bad, but what I'm suggesting is that we just focus on getting that down to ERROR and not worry about trying to get out of the disk-shortage situation automatically. Nor do I believe that it's such a good idea to have the database freeze up until space appears rather than reporting errors. regards, tom lane
On Tue, Jan 21, 2014 at 9:35 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On 6 June 2013 16:00, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:>> The current situation is that if you run out of disk space while writing
>> WAL, you get a PANIC, and the server shuts down. That's awful.> I don't see we need to prevent WAL insertions when the disk fills. WeI'm not sure that "all writing transactions lock up hard" is really so
> still have the whole of wal_buffers to use up. When that is full, we
> will prevent further WAL insertions because we will be holding the
> WALwritelock to clear more space. So the rest of the system will lock
> up nicely, like we want, apart from read-only transactions.
much better than the current behavior.
My preference would be that we simply start failing writes with ERRORs
rather than PANICs. I'm not real sure ATM why this has to be a PANIC
condition. Probably the cause is that it's being done inside a critical
section, but could we move that?
My understanding is that if it runs out of buffer space while in an XLogInsert, it will be holding one or more buffer content locks exclusively, and unless it can complete the xlog (or scrounge up the info to return that buffer to its previous state), it can never release that lock. There might be other paths were it could get by with an ERROR, but if no one can write xlog anymore, all of those paths must quickly converge to the one that cannot simply ERROR.
Cheers,
Jeff
Jeff Janes <jeff.janes@gmail.com> writes: > On Tue, Jan 21, 2014 at 9:35 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> My preference would be that we simply start failing writes with ERRORs >> rather than PANICs. I'm not real sure ATM why this has to be a PANIC >> condition. Probably the cause is that it's being done inside a critical >> section, but could we move that? > My understanding is that if it runs out of buffer space while in an > XLogInsert, it will be holding one or more buffer content locks > exclusively, and unless it can complete the xlog (or scrounge up the info > to return that buffer to its previous state), it can never release that > lock. There might be other paths were it could get by with an ERROR, but > if no one can write xlog anymore, all of those paths must quickly converge > to the one that cannot simply ERROR. Well, the point is we'd have to somehow push detection of the problem to a point before the critical section that does the buffer changes and WAL insertion. The first idea that comes to mind is (1) estimate the XLOG space needed (an overestimate is fine here); (2) just before entering the critical section, call some function to "reserve" that space, such that we always have at least sum(outstanding reservations) available future WAL space; (3) release our reservation as part of the actual XLogInsert call. The problem here is that the "reserve" function would presumably need an exclusive lock, and would be about as much of a hot spot as XLogInsert itself is. Plus we'd be paying a lot of extra cycles to solve a corner case problem that, with all due respect, comes up pretty darn seldom. So probably we need a better idea than that. Maybe we could get some mileage out of the fact that very approximate techniques would be good enough. For instance, I doubt anyone would bleat if the system insisted on having 10MB or even 100MB of future WAL space always available. But I'm not sure exactly how to make use of that flexibility. regards, tom lane
On Tue, Jan 21, 2014 at 3:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Maybe we could get some mileage out of the fact that very approximate > techniques would be good enough. For instance, I doubt anyone would bleat > if the system insisted on having 10MB or even 100MB of future WAL space > always available. But I'm not sure exactly how to make use of that > flexibility. In the past I've thought that one approach that would eliminate concerns about portably and efficiently knowing how much space is left on the pg_xlog filesystem is to have a "ballast file". Under this scheme, perhaps XLogInsert() could differentiate between a soft and hard failure. Hopefully the reserve function you mentioned, which is still called at the same place, just before each critical section thereby becomes cheap. Perhaps I'm just restating what you said, though. -- Peter Geoghegan
On 2014-01-21 18:24:39 -0500, Tom Lane wrote: > Jeff Janes <jeff.janes@gmail.com> writes: > > On Tue, Jan 21, 2014 at 9:35 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> My preference would be that we simply start failing writes with ERRORs > >> rather than PANICs. I'm not real sure ATM why this has to be a PANIC > >> condition. Probably the cause is that it's being done inside a critical > >> section, but could we move that? > > > My understanding is that if it runs out of buffer space while in an > > XLogInsert, it will be holding one or more buffer content locks > > exclusively, and unless it can complete the xlog (or scrounge up the info > > to return that buffer to its previous state), it can never release that > > lock. There might be other paths were it could get by with an ERROR, but > > if no one can write xlog anymore, all of those paths must quickly converge > > to the one that cannot simply ERROR. > > Well, the point is we'd have to somehow push detection of the problem > to a point before the critical section that does the buffer changes > and WAL insertion. Well, I think that's already hard for the heapam.c stuff, but doing that in xact.c seems fracking hairy. We can't simply stop in the middle of a commit and not continue, that'd often grind the system to a halt preventing cleanup. Additionally the size of the inserted record for commits is essentially unbounded, which makes it an especially fun case. > The first idea that comes to mind is (1) estimate the XLOG space needed > (an overestimate is fine here); (2) just before entering the critical > section, call some function to "reserve" that space, such that we always > have at least sum(outstanding reservations) available future WAL space; > (3) release our reservation as part of the actual XLogInsert call. I think that's not necessarily enough. In a COW filesystem like btrfs or ZFS you really cannot give much guarantees about writes suceeding or failing, even if we were able to create (and zero) a new segment. Even if you disregard that, we'd need to keep up with lots of concurrent reservations, looking a fair bit into the future. E.g. during a "smart" shutdown in a workload with lots of subtransactions trying to reserve space might make the situation actually worse because we might end up trying to reserve the combined size of records. > The problem here is that the "reserve" function would presumably need an > exclusive lock, and would be about as much of a hot spot as XLogInsert > itself is. Plus we'd be paying a lot of extra cycles to solve a corner > case problem that, with all due respect, comes up pretty darn seldom. > So probably we need a better idea than that. Yea, I don't think anything really safe is going to work without signifcant penalties. > Maybe we could get some mileage out of the fact that very approximate > techniques would be good enough. For instance, I doubt anyone would bleat > if the system insisted on having 10MB or even 100MB of future WAL space > always available. But I'm not sure exactly how to make use of that > flexibility. If we'd be more aggressive with preallocating WAL files and doing so in the WAL writer, we could stop accepting writes in some common codepaths (e.g. nodeModifyTable.c) as soon as preallocating failed but continue to accept writes in other locations (e.g. TRUNCATE, DROP TABLE). That'd still fail if you write a *very* large commit record using up all the reserve though... I personally think this isn't worth complicating the code for. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-01-21 18:24:39 -0500, Tom Lane wrote: >> Maybe we could get some mileage out of the fact that very approximate >> techniques would be good enough. For instance, I doubt anyone would bleat >> if the system insisted on having 10MB or even 100MB of future WAL space >> always available. But I'm not sure exactly how to make use of that >> flexibility. > If we'd be more aggressive with preallocating WAL files and doing so in > the WAL writer, we could stop accepting writes in some common codepaths > (e.g. nodeModifyTable.c) as soon as preallocating failed but continue to > accept writes in other locations (e.g. TRUNCATE, DROP TABLE). That'd > still fail if you write a *very* large commit record using up all the > reserve though... > I personally think this isn't worth complicating the code for. I too have got doubts about whether a completely bulletproof solution is practical. (And as you say, even if our internal logic was bulletproof, a COW filesystem defeats all guarantees in this area anyway.) But perhaps a 99% solution would be a useful compromise. Another thing to think about is whether we couldn't put a hard limit on WAL record size somehow. Multi-megabyte WAL records are an abuse of the design anyway, when you get right down to it. So for example maybe we could split up commit records, with most of the bulky information dumped into separate records that appear before the "real commit". This would complicate replay --- in particular, if we abort the transaction after writing a few such records, how does the replayer realize that it can forget about those records? But that sounds probably surmountable. regards, tom lane
On 2014-01-21 18:59:13 -0500, Tom Lane wrote: > Another thing to think about is whether we couldn't put a hard limit on > WAL record size somehow. Multi-megabyte WAL records are an abuse of the > design anyway, when you get right down to it. So for example maybe we > could split up commit records, with most of the bulky information dumped > into separate records that appear before the "real commit". This would > complicate replay --- in particular, if we abort the transaction after > writing a few such records, how does the replayer realize that it can > forget about those records? But that sounds probably surmountable. Which cases of essentially unbounded record sizes do we currently have? The only ones that I remember right now are commit and abort records (including when wrapped in a prepared xact). Containing a) the list of committing/aborting subtransactions and b) the list of files to drop c) cache invalidations. Hm. There's also xl_standby_locks, but that'd be easily splittable. I think removing the list of subtransactions from commit records would essentially require not truncating pg_subtrans after a restart anymore. If we'd truncate it in concord with pg_clog, we'd only need to log the subxids which haven't been explicitly assigned. Unfortunately pg_subtrans will be bigger than pg_clog, making such a scheme likely to be painful. We could relatively easily split of logging the dropped files from commit records and log them in groups afterwards, we already have several races allowing to leak files. We could do something similar for the cache invalidations, but that seems likely to get rather ugly, as we'd need to hold procarraylock till the next record is read, or until a shutdown or end-of-recovery record is read. If one of the latter is found before the corresponding invalidations, we'd need to invalidate the entire syscache. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 21 January 2014 23:01, Jeff Janes <jeff.janes@gmail.com> wrote: > On Tue, Jan 21, 2014 at 9:35 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> Simon Riggs <simon@2ndQuadrant.com> writes: >> > On 6 June 2013 16:00, Heikki Linnakangas <hlinnakangas@vmware.com> >> > wrote: >> >> The current situation is that if you run out of disk space while >> >> writing >> >> WAL, you get a PANIC, and the server shuts down. That's awful. >> >> > I don't see we need to prevent WAL insertions when the disk fills. We >> > still have the whole of wal_buffers to use up. When that is full, we >> > will prevent further WAL insertions because we will be holding the >> > WALwritelock to clear more space. So the rest of the system will lock >> > up nicely, like we want, apart from read-only transactions. >> >> I'm not sure that "all writing transactions lock up hard" is really so >> much better than the current behavior. >> >> My preference would be that we simply start failing writes with ERRORs >> rather than PANICs. I'm not real sure ATM why this has to be a PANIC >> condition. Probably the cause is that it's being done inside a critical >> section, but could we move that? > > > My understanding is that if it runs out of buffer space while in an > XLogInsert, it will be holding one or more buffer content locks exclusively, > and unless it can complete the xlog (or scrounge up the info to return that > buffer to its previous state), it can never release that lock. There might > be other paths were it could get by with an ERROR, but if no one can write > xlog anymore, all of those paths must quickly converge to the one that > cannot simply ERROR. Agreed. You don't say it but I presume you intend to point out that such long-lived contention could easily have a knock on effect to other read-only statements. I'm pretty sure other databases work the same way. Our choice are 1. Waiting 2. Abort transactions 3. Some kind of release-locks-then-wait-and-retry (3) is a step too far for me, even though it is easier than you say since we write WAL before changing the data block so a failure to insert WAL could just result in a temporary drop lock, sleep and retry. I would go for (1) waiting for up to checkpoint_timeout then (2), if we think that is a problem. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2014-01-22 01:18:36 +0100, Simon Riggs wrote: > > My understanding is that if it runs out of buffer space while in an > > XLogInsert, it will be holding one or more buffer content locks exclusively, > > and unless it can complete the xlog (or scrounge up the info to return that > > buffer to its previous state), it can never release that lock. There might > > be other paths were it could get by with an ERROR, but if no one can write > > xlog anymore, all of those paths must quickly converge to the one that > > cannot simply ERROR. > > Agreed. You don't say it but I presume you intend to point out that > such long-lived contention could easily have a knock on effect to > other read-only statements. I'm pretty sure other databases work the > same way. > > Our choice are > > 1. Waiting > 2. Abort transactions > 3. Some kind of release-locks-then-wait-and-retry > > (3) is a step too far for me, even though it is easier than you say > since we write WAL before changing the data block so a failure to > insert WAL could just result in a temporary drop lock, sleep and > retry. > > I would go for (1) waiting for up to checkpoint_timeout then (2), if > we think that is a problem. How are we supposed to wait while e.g. ProcArrayLock? Aborting transactions doesn't work either, that writes abort records which can get signficantly large. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-01-21 18:59:13 -0500, Tom Lane wrote: >> Another thing to think about is whether we couldn't put a hard limit on >> WAL record size somehow. Multi-megabyte WAL records are an abuse of the >> design anyway, when you get right down to it. So for example maybe we >> could split up commit records, with most of the bulky information dumped >> into separate records that appear before the "real commit". This would >> complicate replay --- in particular, if we abort the transaction after >> writing a few such records, how does the replayer realize that it can >> forget about those records? But that sounds probably surmountable. > I think removing the list of subtransactions from commit records would > essentially require not truncating pg_subtrans after a restart > anymore. I'm not suggesting that we stop providing that information! I'm just saying that we perhaps don't need to store it all in one WAL record, if instead we put the onus on WAL replay to be able to reconstruct what it needs from a series of WAL records. > We could relatively easily split of logging the dropped files from > commit records and log them in groups afterwards, we already have > several races allowing to leak files. I was thinking the other way around: emit the subsidiary records before the atomic commit or abort record, indeed before we've actually committed. Part of the point is to reduce the risk that lack of WAL space would prevent us from fully committing. Also, writing those records afterwards increases the risk of a post-commit failure, which is a bad thing. Replay would then involve either accumulating the subsidiary records in memory, or being willing to go back and re-read them when the real commit or abort record is seen. regards, tom lane
Andres Freund <andres@2ndquadrant.com> writes: > How are we supposed to wait while e.g. ProcArrayLock? Aborting > transactions doesn't work either, that writes abort records which can > get signficantly large. Yeah, that's an interesting point ;-). We can't *either* commit or abort without emitting some WAL, possibly quite a bit of WAL. regards, tom lane
On Tue, Jan 21, 2014 at 3:43 PM, Andres Freund <andres@2ndquadrant.com> wrote: > I personally think this isn't worth complicating the code for. You're probably right. However, I don't see why the bar has to be very high when we're considering the trade-off between taking some emergency precaution against having a PANIC shutdown, and an assured PANIC shutdown. Heikki said somewhere upthread that he'd be happy with a solution that only catches 90% of the cases. That is probably a conservative estimate. The schemes discussed here would probably be much more effective than that in practice. Sure, you can still poke holes in them. For example, there has been some discussion of arbitrarily large commit records. However, this is the kind of thing just isn't that relevant in the real world. I believe that in practice the majority of commit records are all about the same size. I do not believe that the two acceptable outcomes here are either that we continue to always PANIC shutdown (i.e. do nothing), or promise to never PANIC shutdown. There is likely to be a third way, which is that the probability of a PANIC shutdown is, at the macro level, reduced somewhat from the present probability of 1.0. People are not going to develop a lackadaisical attitude about running out of disk space on the pg_xlog partition if we do so. They still have plenty of incentive to make sure that that doesn't happen. -- Peter Geoghegan
On 2014-01-21 19:23:57 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2014-01-21 18:59:13 -0500, Tom Lane wrote: > >> Another thing to think about is whether we couldn't put a hard limit on > >> WAL record size somehow. Multi-megabyte WAL records are an abuse of the > >> design anyway, when you get right down to it. So for example maybe we > >> could split up commit records, with most of the bulky information dumped > >> into separate records that appear before the "real commit". This would > >> complicate replay --- in particular, if we abort the transaction after > >> writing a few such records, how does the replayer realize that it can > >> forget about those records? But that sounds probably surmountable. > > > I think removing the list of subtransactions from commit records would > > essentially require not truncating pg_subtrans after a restart > > anymore. > > I'm not suggesting that we stop providing that information! I'm just > saying that we perhaps don't need to store it all in one WAL record, > if instead we put the onus on WAL replay to be able to reconstruct what > it needs from a series of WAL records. That'd likely require something similar to the incomplete actions used in btrees (and until recently in more places). I think that is/was a disaster I really don't want to extend. > > We could relatively easily split of logging the dropped files from > > commit records and log them in groups afterwards, we already have > > several races allowing to leak files. > > I was thinking the other way around: emit the subsidiary records before the > atomic commit or abort record, indeed before we've actually committed. > Part of the point is to reduce the risk that lack of WAL space would > prevent us from fully committing. > Replay would then involve either accumulating the subsidiary records in > memory, or being willing to go back and re-read them when the real commit > or abort record is seen. Well, the reason I suggested doing it the other way round is that we wouldn't need to reassemble anything (outside of cache invalidations which I don't know how to handle that way) which I think is a significant increase in robustness and decrease in complexity. > Also, writing those records afterwards > increases the risk of a post-commit failure, which is a bad thing. Well, most of those could be done outside of a critical section, possibly just FATALing out. Beats PANICing. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-01-21 19:23:57 -0500, Tom Lane wrote: >> I'm not suggesting that we stop providing that information! I'm just >> saying that we perhaps don't need to store it all in one WAL record, >> if instead we put the onus on WAL replay to be able to reconstruct what >> it needs from a series of WAL records. > That'd likely require something similar to the incomplete actions used > in btrees (and until recently in more places). I think that is/was a > disaster I really don't want to extend. I don't think that's a comparable case. Incomplete actions are actions to be taken immediately, and which the replayer then has to complete somehow if it doesn't find the rest of the action in the WAL sequence. The only thing to be done with the records I'm proposing is to remember their contents (in some fashion) until it's time to apply them. If you hit end of WAL you don't really have to do anything. regards, tom lane
On 2014-01-21 16:34:45 -0800, Peter Geoghegan wrote: > On Tue, Jan 21, 2014 at 3:43 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > I personally think this isn't worth complicating the code for. > > You're probably right. However, I don't see why the bar has to be very > high when we're considering the trade-off between taking some > emergency precaution against having a PANIC shutdown, and an assured > PANIC shutdown Well, the problem is that the tradeoff would very likely include making already complex code even more complex. None of the proposals, even the one just decreasing the likelihood of a PANIC, like like they'd end up being simple implementation-wise. And that additional complexity would hurt robustness and prevent things I find much more important than this. > Heikki said somewhere upthread that he'd be happy with > a solution that only catches 90% of the cases. That is probably a > conservative estimate. The schemes discussed here would probably be > much more effective than that in practice. Sure, you can still poke > holes in them. For example, there has been some discussion of > arbitrarily large commit records. However, this is the kind of thing > just isn't that relevant in the real world. I believe that in practice > the majority of commit records are all about the same size. Yes, realistically the boundary will be relatively low, but I don't think that that means that we can disregard issues like the possibility that a record might be bigger than wal_buffers. Not because it'd allow theoretical issues, but because it rules out several tempting approaches like e.g. extending the in-memory reservation scheme of Heikki's scalability work to handle this. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2014-01-21 19:45:19 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2014-01-21 19:23:57 -0500, Tom Lane wrote: > >> I'm not suggesting that we stop providing that information! I'm just > >> saying that we perhaps don't need to store it all in one WAL record, > >> if instead we put the onus on WAL replay to be able to reconstruct what > >> it needs from a series of WAL records. > > > That'd likely require something similar to the incomplete actions used > > in btrees (and until recently in more places). I think that is/was a > > disaster I really don't want to extend. > > I don't think that's a comparable case. Incomplete actions are actions > to be taken immediately, and which the replayer then has to complete > somehow if it doesn't find the rest of the action in the WAL sequence. > The only thing to be done with the records I'm proposing is to remember > their contents (in some fashion) until it's time to apply them. If you > hit end of WAL you don't really have to do anything. Would that work for the promotion case as well? Afair there's the assumption that everything >= TransactionXmin can be looked up in pg_subtrans or in the procarray - which afaics wouldn't be the case with your scheme? And TransactionXmin could very well be below such an "incomplete commit"'s xids afaics. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-01-21 19:45:19 -0500, Tom Lane wrote: >> I don't think that's a comparable case. Incomplete actions are actions >> to be taken immediately, and which the replayer then has to complete >> somehow if it doesn't find the rest of the action in the WAL sequence. >> The only thing to be done with the records I'm proposing is to remember >> their contents (in some fashion) until it's time to apply them. If you >> hit end of WAL you don't really have to do anything. > Would that work for the promotion case as well? Afair there's the > assumption that everything >= TransactionXmin can be looked up in > pg_subtrans or in the procarray - which afaics wouldn't be the case with > your scheme? And TransactionXmin could very well be below such an > "incomplete commit"'s xids afaics. Uh, what? The behavior I'm talking about is *exactly the same* as what happens now. The only change is that the data sent to the WAL file is laid out a bit differently, and the replay logic has to work harder to reassemble it before it can apply the commit or abort action. If anything outside replay can detect a difference at all, that would be a bug. Once again: the replayer is not supposed to act immediately on the subsidiary records. It's just supposed to remember their contents so it can reattach them to the eventual commit or abort record, and then do what it does today to replay the commit or abort. regards, tom lane
On 22 January 2014 01:23, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andres Freund <andres@2ndquadrant.com> writes: >> On 2014-01-21 18:59:13 -0500, Tom Lane wrote: >>> Another thing to think about is whether we couldn't put a hard limit on >>> WAL record size somehow. Multi-megabyte WAL records are an abuse of the >>> design anyway, when you get right down to it. So for example maybe we >>> could split up commit records, with most of the bulky information dumped >>> into separate records that appear before the "real commit". This would >>> complicate replay --- in particular, if we abort the transaction after >>> writing a few such records, how does the replayer realize that it can >>> forget about those records? But that sounds probably surmountable. > >> I think removing the list of subtransactions from commit records would >> essentially require not truncating pg_subtrans after a restart >> anymore. > > I'm not suggesting that we stop providing that information! I'm just > saying that we perhaps don't need to store it all in one WAL record, > if instead we put the onus on WAL replay to be able to reconstruct what > it needs from a series of WAL records. I think removing excess subxacts from commit and abort records could be a good idea. Not sure anybody considered it before. We already emit xid allocation records when we overflow, so we already know which subxids have been allocated. We also issue subxact abort records for anything that aborted. So in theory we should be able to reconstruct an arbitrarily long chain of subxacts. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 22 January 2014 01:30, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andres Freund <andres@2ndquadrant.com> writes: >> How are we supposed to wait while e.g. ProcArrayLock? Aborting >> transactions doesn't work either, that writes abort records which can >> get signficantly large. > > Yeah, that's an interesting point ;-). We can't *either* commit or abort > without emitting some WAL, possibly quite a bit of WAL. Right, which is why we don't need to lock ProcArrayLock. As soon as we try to write a commit or abort it goes through the normal XLogInsert route. As soon as wal_buffers fills WALWriteLock will be held continuously until we free some space. Since ProcArrayLock isn't held, read-only users can continue. As Jeff points out, the blocks being modified would be locked until space is freed up. Which could make other users wait. The code required to avoid that wait would be complex and not worth any overhead. Note that my proposal would not require aborting any in-flight transactions; they would continue to completion as soon as space is cleared. My proposal again, so we can review how simple it was... 1. Allow a checkpoint to complete by updating the control file, rather than writing WAL. The control file is already there and is fixed size, so we can be more confident it will accept the update. We could add a new checkpoint mode for that, or we could do that always for shutdown checkpoints (my preferred option). EFFECT: Since a checkpoint can now be called and complete without writing WAL, we are able to write dirty buffers and then clean out WAL files to reduce space. 2. If we fill the disk when writing WAL we do not PANIC, we signal the checkpointer process to perform an immediate checkpoint and then wait for its completion. EFFECT: Since we are holding WALWriteLock, all other write users will soon either wait for that lock directly or indirectly. Both of those points are relatively straightforward to implement and this proposal minimises seldom-tested code paths. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 01/22/2014 02:10 PM, Simon Riggs wrote: > As Jeff points out, the blocks being modified would be locked until > space is freed up. Which could make other users wait. The code > required to avoid that wait would be complex and not worth any > overhead. Checkpoint also acquires the content lock of every dirty page in the buffer cache... - Heikki
On 22 January 2014 13:14, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 01/22/2014 02:10 PM, Simon Riggs wrote: >> >> As Jeff points out, the blocks being modified would be locked until >> space is freed up. Which could make other users wait. The code >> required to avoid that wait would be complex and not worth any >> overhead. > > > Checkpoint also acquires the content lock of every dirty page in the buffer > cache... Good point. We would need to take special action for any dirty blocks that we cannot obtain content lock for, which should be a smallish list, to be dealt with right at the end of the checkpoint writes. We know that anyone waiting for the WAL lock will not be modifying the block and so we can copy it without obtaining the lock. We can inspect the lock queue on the WAL locks and then see which buffers we can skip the lock for. The alternative of adding a check for WAL space to the normal path is a non-starter, IMHO. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Well, PANIC is certainly bad, but what I'm suggesting is that we > just focus on getting that down to ERROR and not worry about > trying to get out of the disk-shortage situation automatically. > Nor do I believe that it's such a good idea to have the database > freeze up until space appears rather than reporting errors. Dusting off my DBA hat for a moment, I would say that I would be happy if each process either generated an ERROR or went into a wait state. They would not all need to do the same thing; whichever is easier in each process's context would do. It would be nice if a process which went into a long wait state until disk space became available would issue a WARNING about that, where that is possible. I feel that anyone running a database that matters to them should be monitoring disk space and getting an alert from the monitoring system in advance of actually running out of disk space; but we all know that poorly managed systems are out there. To accomodate them we don't want to add risk or performance hits for those who do manage their systems well. The attempt to make more space by deleting WAL files scares me a bit. The dynamics of that seem like they could be counterproductive if the pg_xlog directory is on the same filesystem as everything else and there is a rapidly growing file. Every time you cleaned up, the fast-growing file would eat more of the space pg_xlog had been using, until perhaps it had no space to keep even a single segment, no? And how would that work with a situation where the archive_command was failing, causing a build-up WAL files? It just seems like too much mechanism and incomplete coverage of the problem space. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-01-21 21:42:19 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2014-01-21 19:45:19 -0500, Tom Lane wrote: > >> I don't think that's a comparable case. Incomplete actions are actions > >> to be taken immediately, and which the replayer then has to complete > >> somehow if it doesn't find the rest of the action in the WAL sequence. > >> The only thing to be done with the records I'm proposing is to remember > >> their contents (in some fashion) until it's time to apply them. If you > >> hit end of WAL you don't really have to do anything. > > > Would that work for the promotion case as well? Afair there's the > > assumption that everything >= TransactionXmin can be looked up in > > pg_subtrans or in the procarray - which afaics wouldn't be the case with > > your scheme? And TransactionXmin could very well be below such an > > "incomplete commit"'s xids afaics. > > Uh, what? The behavior I'm talking about is *exactly the same* > as what happens now. The only change is that the data sent to the > WAL file is laid out a bit differently, and the replay logic has > to work harder to reassemble it before it can apply the commit or > abort action. If anything outside replay can detect a difference > at all, that would be a bug. > > Once again: the replayer is not supposed to act immediately on the > subsidiary records. It's just supposed to remember their contents > so it can reattach them to the eventual commit or abort record, > and then do what it does today to replay the commit or abort. I (think) I get what you want to do, but splitting the record like that nonetheless opens up behaviour that previously wasn't there. Imagine we promote inbetween replaying the list of subxacts (only storing it in memory) and the main commit record. Either we have something like the incomplete action stuff doing something with the in-memory data, or we are in a situation where there can be xids bigger than TransactionXmin that are not in pg_subtrans and not in the procarray. Which I don't think exists today since we either read the commit record in it's entirety or not. We'd also need to use the MyPgXact->delayChkpt mechanism to prevent checkpoints from occuring inbetween those records, but we do that already, so that seems rather uncontroversial. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-01-21 21:42:19 -0500, Tom Lane wrote: >> Uh, what? The behavior I'm talking about is *exactly the same* >> as what happens now. The only change is that the data sent to the >> WAL file is laid out a bit differently, and the replay logic has >> to work harder to reassemble it before it can apply the commit or >> abort action. If anything outside replay can detect a difference >> at all, that would be a bug. >> >> Once again: the replayer is not supposed to act immediately on the >> subsidiary records. It's just supposed to remember their contents >> so it can reattach them to the eventual commit or abort record, >> and then do what it does today to replay the commit or abort. > I (think) I get what you want to do, but splitting the record like that > nonetheless opens up behaviour that previously wasn't there. Obviously we are not on the same page yet. In my vision, the WAL writer is dumping the same data it would have dumped, though in a different layout, and it's working from process-local state same as it does now. The WAL replayer is taking the same actions at the same time using the same data as it does now. There is no "behavior that wasn't there", unless you're claiming that there are *existing* race conditions in commit/abort WAL processing. The only thing that seems mildly squishy about this is that it's not clear how long the WAL replayer ought to hang onto subsidiary records for a commit or abort it hasn't seen yet. In the case where we change our minds and abort a transaction after already having written some subsidiary records for the commit, it's not really a problem; the replayer can throw away any saved data related to the commit of xid N as soon as it sees an abort for xid N. However, what if the session crashes and never writes either a final commit or abort record? I think we can deal with this fairly easily though, because that case should end with a crash recovery cycle writing a shutdown checkpoint to the log (we do do that no?). So the rule can be "discard any unmatched subsidiary records if you see a shutdown checkpoint". This makes sense on its own terms since there are surely no active transactions at that point in the log. regards, tom lane
On 22 January 2014 14:25, Simon Riggs <simon@2ndquadrant.com> wrote: > On 22 January 2014 13:14, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >> On 01/22/2014 02:10 PM, Simon Riggs wrote: >>> >>> As Jeff points out, the blocks being modified would be locked until >>> space is freed up. Which could make other users wait. The code >>> required to avoid that wait would be complex and not worth any >>> overhead. >> >> >> Checkpoint also acquires the content lock of every dirty page in the buffer >> cache... > > Good point. We would need to take special action for any dirty blocks > that we cannot obtain content lock for, which should be a smallish > list, to be dealt with right at the end of the checkpoint writes. > > We know that anyone waiting for the WAL lock will not be modifying the > block and so we can copy it without obtaining the lock. We can inspect > the lock queue on the WAL locks and then see which buffers we can skip > the lock for. This could be handled similarly to the way we handle buffer pin deadlocks in Hot Standby. So I don't see any blockers from that angle. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 1/21/14, 6:46 PM, Andres Freund wrote: > On 2014-01-21 16:34:45 -0800, Peter Geoghegan wrote: >> >On Tue, Jan 21, 2014 at 3:43 PM, Andres Freund<andres@2ndquadrant.com> wrote: >>> > >I personally think this isn't worth complicating the code for. >> > >> >You're probably right. However, I don't see why the bar has to be very >> >high when we're considering the trade-off between taking some >> >emergency precaution against having a PANIC shutdown, and an assured >> >PANIC shutdown > Well, the problem is that the tradeoff would very likely include making > already complex code even more complex. None of the proposals, even the > one just decreasing the likelihood of a PANIC, like like they'd end up > being simple implementation-wise. > And that additional complexity would hurt robustness and prevent things > I find much more important than this. If we're not looking for perfection, what's wrong with Peter's idea of a ballast file? Presumably the check to see if thatfile still exists would be cheap so we can do that before entering the appropriate critical section. There's still a small chance that we'd end up panicing, but it's better than today. I'd argue that even if it doesn't workfor CoW filesystems it'd still be a win. -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On 2014-01-22 18:19:25 -0600, Jim Nasby wrote: > On 1/21/14, 6:46 PM, Andres Freund wrote: > >On 2014-01-21 16:34:45 -0800, Peter Geoghegan wrote: > >>>On Tue, Jan 21, 2014 at 3:43 PM, Andres Freund<andres@2ndquadrant.com> wrote: > >>>> >I personally think this isn't worth complicating the code for. > >>> > >>>You're probably right. However, I don't see why the bar has to be very > >>>high when we're considering the trade-off between taking some > >>>emergency precaution against having a PANIC shutdown, and an assured > >>>PANIC shutdown > >Well, the problem is that the tradeoff would very likely include making > >already complex code even more complex. None of the proposals, even the > >one just decreasing the likelihood of a PANIC, like like they'd end up > >being simple implementation-wise. > >And that additional complexity would hurt robustness and prevent things > >I find much more important than this. > > If we're not looking for perfection, what's wrong with Peter's idea of > a ballast file? Presumably the check to see if that file still exists > would be cheap so we can do that before entering the appropriate > critical section. That'd be noticeably expensive. Opening/stat a file isn't cheap, especially if you do it via filename and not via fd which we'd have to do. I am pretty sure it would be noticeably in single client workloads, but it'd damned sure will be noticeable on busy multi-socket workloads. I still think doing the checks in the wal writer is the best bet, setting a flag that can then cheaply be tested in shared memory. When set it will cause any further action that will write xlog to error out unless it's already in progress. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 23 January 2014 01:19, Jim Nasby <jim@nasby.net> wrote: > On 1/21/14, 6:46 PM, Andres Freund wrote: >> >> On 2014-01-21 16:34:45 -0800, Peter Geoghegan wrote: >>> >>> >On Tue, Jan 21, 2014 at 3:43 PM, Andres Freund<andres@2ndquadrant.com> >>> > wrote: >>>> >>>> > >I personally think this isn't worth complicating the code for. >>> >>> > >>> >You're probably right. However, I don't see why the bar has to be very >>> >high when we're considering the trade-off between taking some >>> >emergency precaution against having a PANIC shutdown, and an assured >>> >PANIC shutdown >> >> Well, the problem is that the tradeoff would very likely include making >> already complex code even more complex. None of the proposals, even the >> one just decreasing the likelihood of a PANIC, like like they'd end up >> being simple implementation-wise. >> And that additional complexity would hurt robustness and prevent things >> I find much more important than this. > > > If we're not looking for perfection, what's wrong with Peter's idea of a > ballast file? Presumably the check to see if that file still exists would be > cheap so we can do that before entering the appropriate critical section. > > There's still a small chance that we'd end up panicing, but it's better than > today. I'd argue that even if it doesn't work for CoW filesystems it'd still > be a win. I grant that it does sound simple enough for a partial stop gap. My concern is that it provides only a short delay before the eventual disk-full situation, which it doesn't actually prevent. IMHO the main issue now is how we clear down old WAL files. We need to perform a checkpoint to do that - and as has been pointed out in relation to my proposal, we cannot complete that because of locks that will be held for some time when we do eventually lock up. That issue is not solved by having a ballast file(s). IMHO we need to resolve the deadlock inherent in the disk-full/WALlock-up/checkpoint situation. My view is that can be solved in a similar way to the way the buffer pin deadlock was resolved for Hot Standby. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2014-01-23 13:56:49 +0100, Simon Riggs wrote: > IMHO we need to resolve the deadlock inherent in the > disk-full/WALlock-up/checkpoint situation. My view is that can be > solved in a similar way to the way the buffer pin deadlock was > resolved for Hot Standby. I don't think that approach works here. We're not talking about mere buffer pins but the big bad exclusively locked buffer which is held by a backend in a critical section. Killing such a backend costs you a PANIC. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services