Thread: Hard limit on WAL space used (because PANIC sucks)

Hard limit on WAL space used (because PANIC sucks)

From

Heikki Linnakangas

Date:

06 June 2013, 14:01:02

In the "Redesigning checkpoint_segments" thread, many people opined that
there should be a hard limit on the amount of disk space used for WAL:
http://www.postgresql.org/message-id/CA+TgmoaOkgZb5YsmQeMg8ZVqWMtR=6S4-PPd+6jiy4OQ78ihUA@mail.gmail.com.
I'm starting a new thread on that, because that's mostly orthogonal to
redesigning checkpoint_segments.

The current situation is that if you run out of disk space while writing
WAL, you get a PANIC, and the server shuts down. That's awful. We can
try to avoid that by checkpointing early enough, so that we can remove
old WAL segments to make room for new ones before you run out, but
unless we somehow throttle or stop new WAL insertions, it's always going
to be possible to use up all disk space. A typical scenario where that
happens is when archive_command fails for some reason; even a checkpoint
can't remove old, unarchived segments in that case. But it can happen
even without WAL archiving.

I've seen a case, where it was even worse than a PANIC and shutdown.
pg_xlog was on a separate partition that had nothing else on it. The
partition filled up, and the system shut down with a PANIC. Because
there was no space left, it could not even write the checkpoint after
recovery, and thus refused to start up again. There was nothing else on
the partition that you could delete to make space. The only recourse
would've been to add more disk space to the partition (impossible), or
manually delete an old WAL file that was not needed to recover from the
latest checkpoint (scary). Fortunately this was a test system, so we
just deleted everything.

So we need to somehow stop new WAL insertions from happening, before
it's too late.

Peter Geoghegan suggested one method here:
http://www.postgresql.org/message-id/flat/CAM3SWZQcyNxvPaskr-pxm8DeqH7_qevW7uqbhPCsg1FpSxKpoQ@mail.gmail.com.
I don't think that exact proposal is going to work very well; throttling
WAL flushing by holding WALWriteLock in WAL writer can have knock-on
effects on the whole system, as Robert Haas mentioned. Also, it'd still
be possible to run out of space, just more difficult.

To make sure there is enough room for the checkpoint to finish, other
WAL insertions have to stop some time before you completely run out of
disk space. The question is how to do that.

A naive idea is to check if there's enough preallocated WAL space, just
before inserting the WAL record. However, it's too late to check that in
XLogInsert; once you get there, you're already holding exclusive locks
on data pages, and you are in a critical section so you can't back out.
At that point, you have to write the WAL record quickly, or the whole
system will suffer. So we need to act earlier.

A more workable idea is to sprinkle checks in higher-level code, before
you hold any critical locks, to check that there is enough preallocated
WAL. Like, at the beginning of heap_insert, heap_update, etc., and all
similar indexam entry points. I propose that we maintain a WAL
reservation system in shared memory. First of all, keep track of how
much preallocated WAL there is left (and try to create more if needed).
Also keep track of a different number: the amount of WAL pre-reserved
for future insertions. Before entering the critical section, increase
the reserved number with a conservative estimate (ie. high enough) of
how much WAL space you need, and check that there is still enough
preallocated WAL to satisfy all the reservations. If not, throw an error
or sleep until there is. After you're done with the insertion, release
the reservation by decreasing the number again.

A shared reservation counter like that could become a point of
contention. One optimization is keep a constant reservation of, say, 32
KB for each backend. That's enough for most operations. Change the logic
so that you check if you've exceeded the reserved amount of space
*after* writing the WAL record, while you're holding WALInsertLock
anyway. If you do go over the limit, set a flag in backend-private
memory indicating that the *next* time you're about to enter a critical
section where you will write a WAL record, you check again if more space
has been made available.

- Heikki

Re: Hard limit on WAL space used (because PANIC sucks)

From

Andres Freund

Date:

06 June 2013, 14:17:47

On 2013-06-06 17:00:30 +0300, Heikki Linnakangas wrote:
> A more workable idea is to sprinkle checks in higher-level code, before you
> hold any critical locks, to check that there is enough preallocated WAL.
> Like, at the beginning of heap_insert, heap_update, etc., and all similar
> indexam entry points. I propose that we maintain a WAL reservation system in
> shared memory.

I am rather doubtful that this won't end up with a bunch of complex code
that won't prevent the situation in all circumstances but which will
provide bugs/performance problems for some time.
Obviously that's just gut feeling since I haven't see the code...

I am much more excited about getting the soft limit case right and then
seeing how many problems remain in reality.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Hard limit on WAL space used (because PANIC sucks)

From

Heikki Linnakangas

Date:

06 June 2013, 14:53:15

On 06.06.2013 17:17, Andres Freund wrote:
> On 2013-06-06 17:00:30 +0300, Heikki Linnakangas wrote:
>> A more workable idea is to sprinkle checks in higher-level code, before you
>> hold any critical locks, to check that there is enough preallocated WAL.
>> Like, at the beginning of heap_insert, heap_update, etc., and all similar
>> indexam entry points. I propose that we maintain a WAL reservation system in
>> shared memory.
>
> I am rather doubtful that this won't end up with a bunch of complex code
> that won't prevent the situation in all circumstances but which will
> provide bugs/performance problems for some time.
> Obviously that's just gut feeling since I haven't see the code...

I also have a feeling that we'll likely miss some corner cases in the 
first cut, so that you can still run out of disk space if you try hard 
enough / are unlucky. But I think it would still be a big improvement if 
it only catches, say 90% of the cases.

I think it can be made fairly robust otherwise, and the performance 
impact should be pretty easy to measure with e.g pgbench.

- Heikki

Re: Hard limit on WAL space used (because PANIC sucks)

From

Christian Ullrich

Date:

06 June 2013, 21:28:50

* Heikki Linnakangas wrote:

> The current situation is that if you run out of disk space while writing
> WAL, you get a PANIC, and the server shuts down. That's awful. We can

> So we need to somehow stop new WAL insertions from happening, before
> it's too late.

> A naive idea is to check if there's enough preallocated WAL space, just
> before inserting the WAL record. However, it's too late to check that in

There is a database engine, Microsoft's "Jet Blue" aka the Extensible 
Storage Engine, that just keeps some preallocated log files around, 
specifically so it can get consistent and halt cleanly if it runs out of 
disk space.

In other words, the idea is not to check over and over again that there 
is enough already-reserved WAL space, but to make sure there always is 
by having a preallocated segment that is never used outside a disk space 
emergency.

-- 
Christian

Re: Hard limit on WAL space used (because PANIC sucks)

From

Andres Freund

Date:

06 June 2013, 21:38:40

On 2013-06-06 23:28:19 +0200, Christian Ullrich wrote:
> * Heikki Linnakangas wrote:
> 
> >The current situation is that if you run out of disk space while writing
> >WAL, you get a PANIC, and the server shuts down. That's awful. We can
> 
> >So we need to somehow stop new WAL insertions from happening, before
> >it's too late.
> 
> >A naive idea is to check if there's enough preallocated WAL space, just
> >before inserting the WAL record. However, it's too late to check that in
> 
> There is a database engine, Microsoft's "Jet Blue" aka the Extensible
> Storage Engine, that just keeps some preallocated log files around,
> specifically so it can get consistent and halt cleanly if it runs out of
> disk space.
> 
> In other words, the idea is not to check over and over again that there is
> enough already-reserved WAL space, but to make sure there always is by
> having a preallocated segment that is never used outside a disk space
> emergency.

That's not a bad technique. I wonder how reliable it would be in
postgres. Do all filesystems allow a rename() to succeed if there isn't
actually any space left? E.g. on btrfs I wouldn't be sure.  We need to
rename because WAL files need to be named after the LSN timelineid...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Hard limit on WAL space used (because PANIC sucks)

From

Greg Stark

Date:

06 June 2013, 22:42:59

On Thu, Jun 6, 2013 at 10:38 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> That's not a bad technique. I wonder how reliable it would be in
> postgres. Do all filesystems allow a rename() to succeed if there isn't
> actually any space left? E.g. on btrfs I wouldn't be sure.  We need to
> rename because WAL files need to be named after the LSN timelineid...

I suppose we could just always do the rename at the same time as
setting up the current log file. That is, when we start wal log x also
set up wal file x+1 at that time.

This isn't actually guaranteed to be enough btw. It's possible that
the record we're actively about to write will require all of both
those files... But that should be very unlikely.

-- 
greg

Re: Hard limit on WAL space used (because PANIC sucks)

From

Josh Berkus

Date:

06 June 2013, 23:25:36

Let's talk failure cases.

There's actually three potential failure cases here:

- One Volume: WAL is on the same volume as PGDATA, and that volume is
completely out of space.

- XLog Partition: WAL is on its own partition/volume, and fills it up.

- Archiving: archiving is failing or too slow, causing the disk to fill
up with waiting log segments.

I'll argue that these three cases need to be dealt with in three
different ways, and no single solution is going to work for all three.

Archiving
---------

In some ways, this is the simplest case.  Really, we just need a way to
know when the available WAL space has become 90% full, and abort
archiving at that stage.  Once we stop attempting to archive, we can
clean up the unneeded log segments.

What we need is a better way for the DBA to find out that archiving is
falling behind when it first starts to fall behind.  Tailing the log and
examining the rather cryptic error messages we give out isn't very
effective.

xLog Partition
--------------

As Heikki pointed, out, a full dedicated WAL drive is hard to fix once
it gets full, since there's nothing you can safely delete to clear
space, even enough for a checkpoint record.

On the other hand, it should be easy to prevent full status; we could
simply force a non-spread checkpoint whenever the available WAL space
gets 90% full.  We'd also probably want to be prepared to switch to a
read-only mode if we get full enough that there's only room for the
checkpoint records.

One Volume
----------

This is the most complicated case, because we wouldn't necessarily run
out of space because of WAL using it up.  Anything could cause us to run
out of disk space, including activity logs, swapping, pgsql_tmp files,
database growth, or some other process which writes files.

This means that the DBA getting out of disk-full manually is in some
ways easier; there's usually stuff she can delete.  However, it's much
harder -- maybe impossible -- for PostgreSQL to prevent this kind of
space outage.  There should be things we can do to make it easier for
the DBA to troubleshoot this, but I'm not sure what.

We could use a hard limit for WAL to prevent WAL from contributing to
out-of-space, but that'll only prevent a minority of cases.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Hard limit on WAL space used (because PANIC sucks)

From

Jaime Casanova

Date:

06 June 2013, 23:34:18

On Thu, Jun 6, 2013 at 4:28 PM, Christian Ullrich <chris@chrullrich.net> wrote:
> * Heikki Linnakangas wrote:
>
>> The current situation is that if you run out of disk space while writing
>> WAL, you get a PANIC, and the server shuts down. That's awful. We can
>
>
>> So we need to somehow stop new WAL insertions from happening, before
>> it's too late.
>
>
>> A naive idea is to check if there's enough preallocated WAL space, just
>> before inserting the WAL record. However, it's too late to check that in
>
>
> There is a database engine, Microsoft's "Jet Blue" aka the Extensible
> Storage Engine, that just keeps some preallocated log files around,
> specifically so it can get consistent and halt cleanly if it runs out of
> disk space.
>

fwiw, informix (at least until IDS 2000, not sure after that) had the
same thing. only this was a parameter to set, and bad things happened
if you forgot about it :D

--
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566         Cell: +593 987171157

Re: Hard limit on WAL space used (because PANIC sucks)

From

Jeff Janes

Date:

07 June 2013, 04:30:56

On Thursday, June 6, 2013, Josh Berkus wrote:

Let's talk failure cases.

There's actually three potential failure cases here:

- One Volume: WAL is on the same volume as PGDATA, and that volume is
completely out of space.

- XLog Partition: WAL is on its own partition/volume, and fills it up.

- Archiving: archiving is failing or too slow, causing the disk to fill
up with waiting log segments.

I'll argue that these three cases need to be dealt with in three
different ways, and no single solution is going to work for all three.

Archiving
---------

In some ways, this is the simplest case. Really, we just need a way to
know when the available WAL space has become 90% full, and abort
archiving at that stage. Once we stop attempting to archive, we can
clean up the unneeded log segments.

I would oppose that as the solution, either an unconditional one, or configurable with is it as the default. Those segments are not unneeded. I need them. That is why I set up archiving in the first place. If you need to shut down the database rather than violate my established retention policy, then shut down the database.

What we need is a better way for the DBA to find out that archiving is
falling behind when it first starts to fall behind. Tailing the log and
examining the rather cryptic error messages we give out isn't very
effective.

The archive command can be made a shell script (or that matter a compiled program) which can do anything it wants upon failure, including emailing people. Of course maybe whatever causes the archive to fail will also cause the delivery of the message to fail, but I don't see a real solution to this that doesn't start down an infinite regress. If it is not failing outright, but merely falling behind, then I don't really know how to go about detecting that, either in archive_command, or through tailing the PostgreSQL log. I guess archive_command, each time it is invoked, could count the files in the pg_xlog directory and warn if it thinks the number is unreasonable.

xLog Partition
--------------

As Heikki pointed, out, a full dedicated WAL drive is hard to fix once
it gets full, since there's nothing you can safely delete to clear
space, even enough for a checkpoint record.

Although the DBA probably wouldn't know it from reading the manual, it is almost always safe to delete the oldest WAL file (after copying it to a different partition just in case something goes wrong--it should be possible to do that as if WAL is on its own partition, it is hard to imagine you can't scrounge up 16MB on a different one), as PostgreSQL keeps two complete checkpoints worth of WAL around. I think the only reason you would not be able to recover after removing the oldest file is if the controldata file is damaged such that the most recent checkpoint record cannot be found and so it has to fall back to the previous one. Or at least, this is my understanding.

On the other hand, it should be easy to prevent full status; we could
simply force a non-spread checkpoint whenever the available WAL space
gets 90% full. We'd also probably want to be prepared to switch to a
read-only mode if we get full enough that there's only room for the
checkpoint records.

I think that that last sentence could also be applied without modification to the "one volume" case as well.

So what would that look like? Before accepting a (non-checkpoint) WAL Insert that fills up the current segment to a high enough level that a checkpoint record will no longer fit, it must first verify that a recycled file exists, or if not it must successfully init a new file.

If that init fails, then it must do what? Signal for a checkpoint, release it's locks, and then ERROR out? That would be better than a PANIC, but can it do better? Enter a retry loop so that once the checkpoint has finished and assuming it has freed up enough WAL files to recycling/removal, then it can try the original WAL Insert again?

Cheers,

Jeff

Re: Hard limit on WAL space used (because PANIC sucks)

From

"Joshua D. Drake"

Date:

07 June 2013, 05:06:50

On 06/06/2013 09:30 PM, Jeff Janes wrote:

>     Archiving
>     ---------
>
>     In some ways, this is the simplest case.  Really, we just need a way to
>     know when the available WAL space has become 90% full, and abort
>     archiving at that stage.  Once we stop attempting to archive, we can
>     clean up the unneeded log segments.
>
>
> I would oppose that as the solution, either an unconditional one, or
> configurable with is it as the default.  Those segments are not
> unneeded.  I need them.  That is why I set up archiving in the first
> place.  If you need to shut down the database rather than violate my
> established retention policy, then shut down the database.

Agreed and I would oppose it even as configurable. We set up the 
archiving for a reason. I do think it might be useful to be able to 
store archiving logs as well as wal_keep_segments logs in a different 
location than pg_xlog.

>
>     What we need is a better way for the DBA to find out that archiving is
>     falling behind when it first starts to fall behind.  Tailing the log and
>     examining the rather cryptic error messages we give out isn't very
>     effective.
>
>
> The archive command can be made a shell script (or that matter a
> compiled program) which can do anything it wants upon failure, including
> emailing people.


Yep, that is what PITRTools does. You can make it do whatever you want.


JD

Re: Hard limit on WAL space used (because PANIC sucks)

From

Daniel Farina

Date:

07 June 2013, 05:42:04

On Thu, Jun 6, 2013 at 9:30 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> I would oppose that as the solution, either an unconditional one, or
> configurable with is it as the default.  Those segments are not unneeded.  I
> need them.  That is why I set up archiving in the first place.  If you need
> to shut down the database rather than violate my established retention
> policy, then shut down the database.

Same boat.  My archives are the real storage.  The disks are
write-back caching.  That's because the storage of my archives is
probably three to five orders of magnitude more reliable.

Re: Hard limit on WAL space used (because PANIC sucks)

From

Heikki Linnakangas

Date:

07 June 2013, 09:03:18

On 07.06.2013 00:38, Andres Freund wrote:
> On 2013-06-06 23:28:19 +0200, Christian Ullrich wrote:
>> * Heikki Linnakangas wrote:
>>
>>> The current situation is that if you run out of disk space while writing
>>> WAL, you get a PANIC, and the server shuts down. That's awful. We can
>>
>>> So we need to somehow stop new WAL insertions from happening, before
>>> it's too late.
>>
>>> A naive idea is to check if there's enough preallocated WAL space, just
>>> before inserting the WAL record. However, it's too late to check that in
>>
>> There is a database engine, Microsoft's "Jet Blue" aka the Extensible
>> Storage Engine, that just keeps some preallocated log files around,
>> specifically so it can get consistent and halt cleanly if it runs out of
>> disk space.
>>
>> In other words, the idea is not to check over and over again that there is
>> enough already-reserved WAL space, but to make sure there always is by
>> having a preallocated segment that is never used outside a disk space
>> emergency.
>
> That's not a bad technique. I wonder how reliable it would be in
> postgres.

That's no different from just having a bit more WAL space in the first 
place. We need a mechanism to stop backends from writing WAL, before you 
run out of it completely. It doesn't matter if the reservation is done 
by stashing away a WAL segment for emergency use, or by a variable in 
shared memory. Either way, backends need to stop using it up, by 
blocking or throwing an error before they enter the critical section.

I guess you could use the stashed away segment to ensure that you can 
recover after PANIC. At recovery, there are no other backends that could 
use up the emergency segment. But that's not much better than what we 
have now.

- Heikki

Re: Hard limit on WAL space used (because PANIC sucks)

From

Bernd Helmle

Date:

07 June 2013, 12:18:55

--On 6. Juni 2013 16:25:29 -0700 Josh Berkus <josh@agliodbs.com> wrote:

> Archiving
> ---------
>
> In some ways, this is the simplest case.  Really, we just need a way to
> know when the available WAL space has become 90% full, and abort
> archiving at that stage.  Once we stop attempting to archive, we can
> clean up the unneeded log segments.
>
> What we need is a better way for the DBA to find out that archiving is
> falling behind when it first starts to fall behind.  Tailing the log and
> examining the rather cryptic error messages we give out isn't very
> effective.

Slightly OT, but i always wondered wether we could create a function, say

pg_last_xlog_removed()

for example, returning a value suitable to be used to calculate the 
distance to the current position. An increasing value could be used to 
instruct monitoring to throw a warning if a certain threshold is exceeded.

I've also seen people creating monitoring scripts by looking into 
archive_status and do simple counts on the .ready files and give a warning, 
if that exceeds an expected maximum value.

I haven't looked at the code very deep, but i think we already store the 
position of the last removed xlog in shared memory already, maybe this can 
be used somehow. Afaik, we do cleanup only during checkpoints, so this all 
has too much delay...

-- 
Thanks
Bernd

Re: Hard limit on WAL space used (because PANIC sucks)

From

Heikki Linnakangas

Date:

07 June 2013, 16:21:46

On 06.06.2013 17:00, Heikki Linnakangas wrote:
> A more workable idea is to sprinkle checks in higher-level code, before
> you hold any critical locks, to check that there is enough preallocated
> WAL. Like, at the beginning of heap_insert, heap_update, etc., and all
> similar indexam entry points.

Actually, there's one place that catches most of these: LockBuffer(..., 
BUFFER_LOCK_EXCLUSIVE). In all heap and index operations, you always 
grab an exclusive lock on a page first, before entering the critical 
section where you call XLogInsert.

That leaves a few miscellaneous XLogInsert calls that need to be 
guarded, but it leaves a lot less room for bugs of omission, and keeps 
the code cleaner.

- Heikki

Re: Hard limit on WAL space used (because PANIC sucks)

From

Tom Lane

Date:

07 June 2013, 16:33:24

Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> On 06.06.2013 17:00, Heikki Linnakangas wrote:
>> A more workable idea is to sprinkle checks in higher-level code, before
>> you hold any critical locks, to check that there is enough preallocated
>> WAL. Like, at the beginning of heap_insert, heap_update, etc., and all
>> similar indexam entry points.

> Actually, there's one place that catches most of these: LockBuffer(..., 
> BUFFER_LOCK_EXCLUSIVE). In all heap and index operations, you always 
> grab an exclusive lock on a page first, before entering the critical 
> section where you call XLogInsert.

Not only is that a horrible layering/modularity violation, but surely
LockBuffer can have no idea how much WAL space will be needed.
        regards, tom lane

Re: Hard limit on WAL space used (because PANIC sucks)

From

Heikki Linnakangas

Date:

07 June 2013, 16:43:13

On 07.06.2013 19:33, Tom Lane wrote:
> Heikki Linnakangas<hlinnakangas@vmware.com>  writes:
>> On 06.06.2013 17:00, Heikki Linnakangas wrote:
>>> A more workable idea is to sprinkle checks in higher-level code, before
>>> you hold any critical locks, to check that there is enough preallocated
>>> WAL. Like, at the beginning of heap_insert, heap_update, etc., and all
>>> similar indexam entry points.
>
>> Actually, there's one place that catches most of these: LockBuffer(...,
>> BUFFER_LOCK_EXCLUSIVE). In all heap and index operations, you always
>> grab an exclusive lock on a page first, before entering the critical
>> section where you call XLogInsert.
>
> Not only is that a horrible layering/modularity violation, but surely
> LockBuffer can have no idea how much WAL space will be needed.

It can be just a conservative guess, like, 32KB. That should be enough 
for almost all WAL-logged operations. The only exception that comes to 
mind is a commit record, which can be arbitrarily large, when you have a 
lot of subtransactions or dropped/created relations.

- Heikki

Re: Hard limit on WAL space used (because PANIC sucks)

From

Tom Lane

Date:

07 June 2013, 16:59:52

Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> On 07.06.2013 19:33, Tom Lane wrote:
>> Not only is that a horrible layering/modularity violation, but surely
>> LockBuffer can have no idea how much WAL space will be needed.

> It can be just a conservative guess, like, 32KB. That should be enough 
> for almost all WAL-logged operations. The only exception that comes to 
> mind is a commit record, which can be arbitrarily large, when you have a 
> lot of subtransactions or dropped/created relations.

What happens when several updates are occurring concurrently?
        regards, tom lane

Re: Hard limit on WAL space used (because PANIC sucks)

From

Josh Berkus

Date:

07 June 2013, 19:14:41

>> I would oppose that as the solution, either an unconditional one, or
>> configurable with is it as the default.  Those segments are not
>> unneeded.  I need them.  That is why I set up archiving in the first
>> place.  If you need to shut down the database rather than violate my
>> established retention policy, then shut down the database.
> 
> Agreed and I would oppose it even as configurable. We set up the
> archiving for a reason. I do think it might be useful to be able to
> store archiving logs as well as wal_keep_segments logs in a different
> location than pg_xlog.

People have different configurations.  Most of my clients use archiving
for backup or replication; they would rather have archiving cease (and
send a CRITICAL alert) than have the master go offline.  That's pretty
common, probably more common than the "if I don't have redundancy shut
down" case.

Certainly anyone making the decision that their master database should
shut down rather than cease archiving should make it *consciously*,
instead of finding out the hard way.

>> The archive command can be made a shell script (or that matter a
>> compiled program) which can do anything it wants upon failure, including
>> emailing people.

You're talking about using external tools -- frequently hackish,
workaround ones -- to handle something which PostgreSQL should be doing
itself, and which only the database engine has full knowledge of.  While
that's the only solution we have for now, it's hardly a worthy design goal.

Right now, what we're telling users is "You can have continuous backup
with Postgres, but you'd better hire and expensive consultant to set it
up for you, or use this external tool of dubious provenance which
there's no packages for, or you might accidentally cause your database
to shut down in the middle of the night."

At which point most sensible users say "no thanks, I'll use something else".

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Hard limit on WAL space used (because PANIC sucks)

From

Daniel Farina

Date:

08 June 2013, 02:58:12

On Fri, Jun 7, 2013 at 12:14 PM, Josh Berkus <josh@agliodbs.com> wrote:
> Right now, what we're telling users is "You can have continuous backup
> with Postgres, but you'd better hire and expensive consultant to set it
> up for you, or use this external tool of dubious provenance which
> there's no packages for, or you might accidentally cause your database
> to shut down in the middle of the night."

Inverted and just as well supported: "if you want to not accidentally
lose data, you better hire an expensive consultant to check your
systems for all sorts of default 'safety = off' features."  This
being but the hypothetical first one.

Furthermore, I see no reason why high quality external archiving
software cannot exist.  Maybe some even exists already, and no doubt
they can be improved and the contract with Postgres enriched to that
purpose.

Contrast: JSON, where the stable OID in the core distribution helps
pragmatically punt on a particularly sticky problem (extension
dependencies and non-system OIDs), I can't think of a reason an
external archiver couldn't do its job well right now.

> At which point most sensible users say "no thanks, I'll use something else".

Oh, I lost some disks, well, no big deal, I'll use the archives.  Surprise!

<forensic analysis ensues>

So, as it turns out, it has been dropping segments at times because of
systemic backlog for months/years.

Alternative ending:

Hey, I restored the database.

<later> Why is the state so old?  Why are customers getting warnings
that their (thought paid) invoices are overdue?  Oh crap, the restore
was cut short by this stupid option and this database lives in the
past!

Fin.

I have a clear bias in experience here, but I can't relate to someone
who sets up archives but is totally okay losing a segment unceremoniously,
because it only takes one of those once in a while to make a really,
really bad day.  Who is this person that lackadaisically archives, and
are they just fooling themselves?  And where are these archivers that
enjoy even a modicum of long-term success that are not reliable?  If
one wants to casually drop archives, how is someone going to find out
and freak out a bit?  Per experience, logs are pretty clearly
hazardous to the purpose.

Basically, I think the default that opts one into danger is not good,
especially since the system is starting from a position of "do too
much stuff and you'll crash."

Finally, it's not that hard to teach any archiver how to no-op at
user-peril, or perhaps Postgres can learn a way to do this expressly
to standardize the procedure a bit to ease publicly shared recipes, perhaps.

Re: Hard limit on WAL space used (because PANIC sucks)

From

"MauMau"

Date:

08 June 2013, 14:37:56

From: "Daniel Farina" <daniel@heroku.com>
> On Fri, Jun 7, 2013 at 12:14 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> Right now, what we're telling users is "You can have continuous backup
>> with Postgres, but you'd better hire and expensive consultant to set it
>> up for you, or use this external tool of dubious provenance which
>> there's no packages for, or you might accidentally cause your database
>> to shut down in the middle of the night."
>>
>> At which point most sensible users say "no thanks, I'll use something 
>> else".

> Inverted and just as well supported: "if you want to not accidentally
> lose data, you better hire an expensive consultant to check your
> systems for all sorts of default 'safety = off' features."  This
> being but the hypothetical first one.
>
> Furthermore, I see no reason why high quality external archiving
> software cannot exist.  Maybe some even exists already, and no doubt
> they can be improved and the contract with Postgres enriched to that
> purpose.
>
> Finally, it's not that hard to teach any archiver how to no-op at
> user-peril, or perhaps Postgres can learn a way to do this expressly
> to standardize the procedure a bit to ease publicly shared recipes, 
> perhaps.

Yes, I feel designing reliable archiving, even for the simplest case - copy 
WAL to disk, is very difficult.  I know there are following three problems 
if you just follow the PostgreSQL manual.  Average users won't notice them. 
I guess even professional DBAs migrating from other DBMSs won't, either.

1. If the machine or postgres crashes while archive_command is copying a WAL 
file, later archive recovery fails.
This is because cp leaves a file of less than 16MB in archive area, and 
postgres refuses to start when it finds such a small archive WAL file.
The solution, which IIRC Tomas san told me here, is to do like "cp %p 
/archive/dir/%f.tmp && mv /archive/dir/%f.tmp /archive/dir/%f".

2. archive_command dumps core when you run pg_ctl stop -mi.
This is because postmaster sends SIGQUIT to all its descendants.  The core 
files accumulate in the data directory, which will be backed up with the 
database.  Of course those core files are garbage.
archive_command script needs to catch SIGQUIT and exit.

3. You cannot know the reason of archive_command failure (e.g. archive area 
full) if you don't use PostgreSQL's server logging.
This is because archive_command failure is not logged in syslog/eventlog.

I hope PostgreSQL will provide a reliable archiving facility that is ready 
to use.

Regards
MauMau

Re: Hard limit on WAL space used (because PANIC sucks)

From

"Joshua D. Drake"

Date:

08 June 2013, 18:05:17

On 06/07/2013 12:14 PM, Josh Berkus wrote:

> Right now, what we're telling users is "You can have continuous backup
> with Postgres, but you'd better hire and expensive consultant to set it
> up for you, or use this external tool of dubious provenance which
> there's no packages for, or you might accidentally cause your database
> to shut down in the middle of the night."

This is an outright falsehood. We are telling them, "You better know 
what you are doing" or "You should call a consultant". This is no 
different than, "You better know what you are doing" or "You should take 
driving lessons".

>
> At which point most sensible users say "no thanks, I'll use something else".
>

Josh I have always admired your flair for dramatics, it almost rivals 
mine. Users are free to use what they want, some will chose lesser 
databases. I am ok with that because eventually if PostgreSQL is the 
right tool, they will come back to us, and PgExperts or CMD or OmniTI or 
they will know what they are doing and thus don't need us.

JD

Re: Hard limit on WAL space used (because PANIC sucks)

From

"Joshua D. Drake"

Date:

08 June 2013, 18:07:42

On 06/08/2013 07:36 AM, MauMau wrote:

> 1. If the machine or postgres crashes while archive_command is copying a
> WAL file, later archive recovery fails.
> This is because cp leaves a file of less than 16MB in archive area, and
> postgres refuses to start when it finds such a small archive WAL file.
> The solution, which IIRC Tomas san told me here, is to do like "cp %p
> /archive/dir/%f.tmp && mv /archive/dir/%f.tmp /archive/dir/%f".

Well it seems to me that one of the problems here is we tell people to 
use copy. We should be telling people to use a command (or supply a 
command) that is smarter than that.

> 3. You cannot know the reason of archive_command failure (e.g. archive
> area full) if you don't use PostgreSQL's server logging.
> This is because archive_command failure is not logged in syslog/eventlog.

Wait, what? Is this true (someone else?)

JD

Re: Hard limit on WAL space used (because PANIC sucks)

From

"Joshua D. Drake"

Date:

08 June 2013, 18:15:46

On 06/06/2013 07:52 AM, Heikki Linnakangas wrote:
> I think it can be made fairly robust otherwise, and the performance
> impact should be pretty easy to measure with e.g pgbench.

Once upon a time in a land far, far away, we expected users to manage 
their own systems. We had things like soft and hard quotas on disks and 
last log to find out who was logging into the system. Alas, as far as I 
know soft and hard quotas are kind of a thing of the past but that 
doesn't mean that their usefulness has ended.

The idea that we PANIC is not just awful, it is stupid. I don't think 
anyone is going to disagree with that. However, there is a question of 
what to do instead. I think the idea of sprinkling checks into the 
higher level code before specific operations is not invalid but I also 
don't think it is necessary.

To me, a more pragmatic approach makes sense. Obviously having some kind 
of code that checks the space makes sense but I don't know that it needs 
to be around any operation other than we are creating a segment. What do 
we care why the segment is being created? If we don't have enough room 
to create the segment, the transaction rollsback with some OBVIOUS not 
OBTUSE error.

Obviously this could cause a ton of transactions to roll back but I 
think keeping the database consistent and rolling back a transaction in 
case of error is exactly what we are supposed to do.

Sincerely,

Joshua D. Drake

>
> - Heikki
>
>

Re: Hard limit on WAL space used (because PANIC sucks)

From

Andres Freund

Date:

08 June 2013, 18:27:52

On 2013-06-08 11:15:40 -0700, Joshua D. Drake wrote:
> To me, a more pragmatic approach makes sense. Obviously having some kind of
> code that checks the space makes sense but I don't know that it needs to be
> around any operation other than we are creating a segment. What do we care
> why the segment is being created? If we don't have enough room to create the
> segment, the transaction rollsback with some OBVIOUS not OBTUSE error.
> 
> Obviously this could cause a ton of transactions to roll back but I think
> keeping the database consistent and rolling back a transaction in case of
> error is exactly what we are supposed to do.

You know, the PANIC isn't there just because we like to piss of
users. There's actual technical reasons that don't just go away by
judging the PANIC as stupid.
At the points where the XLogInsert()s happens we're in critical sections
out of which we *cannot* ERROR out because we already may have made
modifications that cannot be allowed to be performed
partially/unlogged. That's why we're throwing a PANIC which will force a
cluster wide restart including *NOT* writing any further buffers from
s_b out.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Hard limit on WAL space used (because PANIC sucks)

From

Andres Freund

Date:

08 June 2013, 18:33:32

On 2013-06-07 12:02:57 +0300, Heikki Linnakangas wrote:
> On 07.06.2013 00:38, Andres Freund wrote:
> >On 2013-06-06 23:28:19 +0200, Christian Ullrich wrote:
> >>* Heikki Linnakangas wrote:
> >>
> >>>The current situation is that if you run out of disk space while writing
> >>>WAL, you get a PANIC, and the server shuts down. That's awful. We can
> >>
> >>>So we need to somehow stop new WAL insertions from happening, before
> >>>it's too late.
> >>
> >>>A naive idea is to check if there's enough preallocated WAL space, just
> >>>before inserting the WAL record. However, it's too late to check that in
> >>
> >>There is a database engine, Microsoft's "Jet Blue" aka the Extensible
> >>Storage Engine, that just keeps some preallocated log files around,
> >>specifically so it can get consistent and halt cleanly if it runs out of
> >>disk space.
> >>
> >>In other words, the idea is not to check over and over again that there is
> >>enough already-reserved WAL space, but to make sure there always is by
> >>having a preallocated segment that is never used outside a disk space
> >>emergency.
> >
> >That's not a bad technique. I wonder how reliable it would be in
> >postgres.
> 
> That's no different from just having a bit more WAL space in the first
> place. We need a mechanism to stop backends from writing WAL, before you run
> out of it completely. It doesn't matter if the reservation is done by
> stashing away a WAL segment for emergency use, or by a variable in shared
> memory. Either way, backends need to stop using it up, by blocking or
> throwing an error before they enter the critical section.

Well, if you have 16 or 32MB of reserved WAL space available you don't
need to judge all that precisely how much space is available.

So we can just sprinkle some EnsureXLogHasSpace() on XLogInsert()
callsites like heap_insert(), but we can do that outside of the critical
sections and we can do it without locks since there needs to happen
quite some write activity to overrun the reserved space. Anything that
desparately needs to write stuff, like the end of recovery checkpoint,
can just not call EnsureXLogHasSpace() and rely on the reserved space.

Seems like 90% of the solution for 30% of the complexity or so.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Hard limit on WAL space used (because PANIC sucks)

From

Jeff Janes

Date:

08 June 2013, 19:02:20

On Fri, Jun 7, 2013 at 12:14 PM, Josh Berkus <josh@agliodbs.com> wrote:

>> The archive command can be made a shell script (or that matter a
>> compiled program) which can do anything it wants upon failure, including
>> emailing people.

You're talking about using external tools -- frequently hackish,
workaround ones -- to handle something which PostgreSQL should be doing
itself, and which only the database engine has full knowledge of.

I think the database engine is about the last thing which would have full knowledge of the best way to contact the DBA, especially during events which by definition mean things are already going badly. I certainly don't see having core code which knows how to talk to every PBX, SMS, email system, or twitter feed that anyone might wish to use for logging. PostgreSQL already supports two formats of text logs, plus syslog and eventlog. Is there some additional logging management tool that we could support which is widely used, doesn't require an expensive consultant to set-up and configure correctly (or even to decide what "correctly" means for the given situation), and which solves 80% of the problems?

It would be nice to have the ability to specify multiple log destinations with different log_min_messages for each one. I'm sure syslog already must implement some kind of method for doing that, but I've been happy enough with the text logs that I've never bothered to look into it much.

While
that's the only solution we have for now, it's hardly a worthy design goal.

Right now, what we're telling users is "You can have continuous backup
with Postgres, but you'd better hire and expensive consultant to set it
up for you, or use this external tool of dubious provenance which
there's no packages for, or you might accidentally cause your database
to shut down in the middle of the night."

At which point most sensible users say "no thanks, I'll use something else".

What does the something else do? Hopefully it is not "silently invalidate your backups".

Cheers,

Jeff

Re: Hard limit on WAL space used (because PANIC sucks)

From

Jeff Janes

Date:

08 June 2013, 19:17:37

On Sat, Jun 8, 2013 at 11:15 AM, Joshua D. Drake <jd@commandprompt.com> wrote:

On 06/06/2013 07:52 AM, Heikki Linnakangas wrote:
I think it can be made fairly robust otherwise, and the performance
impact should be pretty easy to measure with e.g pgbench.

Once upon a time in a land far, far away, we expected users to manage their own systems. We had things like soft and hard quotas on disks and last log to find out who was logging into the system. Alas, as far as I know soft and hard quotas are kind of a thing of the past but that doesn't mean that their usefulness has ended.

The idea that we PANIC is not just awful, it is stupid. I don't think anyone is going to disagree with that. However, there is a question of what to do instead. I think the idea of sprinkling checks into the higher level code before specific operations is not invalid but I also don't think it is necessary.

Given that the system is going to become unusable, I don't see why PANIC is an awful, stupid way of doing it. And if it can only be used for things that don't generate WAL, that is pretty much unusable, as even read only transactions often need to do clean-up tasks that generate WAL.

Cheers,

Jeff

Re: Hard limit on WAL space used (because PANIC sucks)

From

"Joshua D. Drake"

Date:

08 June 2013, 20:27:09

On 06/08/2013 11:27 AM, Andres Freund wrote:
>
> On 2013-06-08 11:15:40 -0700, Joshua D. Drake wrote:
>> To me, a more pragmatic approach makes sense. Obviously having some kind of
>> code that checks the space makes sense but I don't know that it needs to be
>> around any operation other than we are creating a segment. What do we care
>> why the segment is being created? If we don't have enough room to create the
>> segment, the transaction rollsback with some OBVIOUS not OBTUSE error.
>>
>> Obviously this could cause a ton of transactions to roll back but I think
>> keeping the database consistent and rolling back a transaction in case of
>> error is exactly what we are supposed to do.
>
> You know, the PANIC isn't there just because we like to piss of
> users. There's actual technical reasons that don't just go away by
> judging the PANIC as stupid.

Yes I know we aren't trying to piss off users. What I am saying is that 
it is stupid to the user that it PANICS. I apologize if that came out wrong.

> At the points where the XLogInsert()s happens we're in critical sections
> out of which we *cannot* ERROR out because we already may have made
> modifications that cannot be allowed to be performed
> partially/unlogged. That's why we're throwing a PANIC which will force a
> cluster wide restart including *NOT* writing any further buffers from
> s_b out.
>

Does this preclude (sorry I don't know this part of the code very well) 
my suggestion of on log create?

JD


> Greetings,
>
> Andres Freund
>

Re: Hard limit on WAL space used (because PANIC sucks)

From

Simon Riggs

Date:

08 June 2013, 21:10:45

On 7 June 2013 10:02, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
> On 07.06.2013 00:38, Andres Freund wrote:
>>
>> On 2013-06-06 23:28:19 +0200, Christian Ullrich wrote:
>>>
>>> * Heikki Linnakangas wrote:
>>>
>>>> The current situation is that if you run out of disk space while writing
>>>> WAL, you get a PANIC, and the server shuts down. That's awful. We can
>>>
>>>
>>>> So we need to somehow stop new WAL insertions from happening, before
>>>> it's too late.
>>>
>>>
>>>> A naive idea is to check if there's enough preallocated WAL space, just
>>>> before inserting the WAL record. However, it's too late to check that in
>>>
>>>
>>> There is a database engine, Microsoft's "Jet Blue" aka the Extensible
>>> Storage Engine, that just keeps some preallocated log files around,
>>> specifically so it can get consistent and halt cleanly if it runs out of
>>> disk space.
>>>
>>> In other words, the idea is not to check over and over again that there
>>> is
>>> enough already-reserved WAL space, but to make sure there always is by
>>> having a preallocated segment that is never used outside a disk space
>>> emergency.
>>
>>
>> That's not a bad technique. I wonder how reliable it would be in
>> postgres.
>
>
> That's no different from just having a bit more WAL space in the first
> place. We need a mechanism to stop backends from writing WAL, before you run
> out of it completely. It doesn't matter if the reservation is done by
> stashing away a WAL segment for emergency use, or by a variable in shared
> memory. Either way, backends need to stop using it up, by blocking or
> throwing an error before they enter the critical section.
>
> I guess you could use the stashed away segment to ensure that you can
> recover after PANIC. At recovery, there are no other backends that could use
> up the emergency segment. But that's not much better than what we have now.

Christian's idea seems good to me. Looks like you could be dismissing
this too early, especially since there's no better idea emerged.

I doubt that we're going to think of something others didn't already
face. It's pretty clear that most other DBMS do this with their logs.

For your fast wal insert patch to work well, we need a simple and fast
technique to detect out of space.

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Hard limit on WAL space used (because PANIC sucks)

From

Jeff Janes

Date:

08 June 2013, 23:20:21

On Sat, Jun 8, 2013 at 11:27 AM, Andres Freund <andres@2ndquadrant.com> wrote:

You know, the PANIC isn't there just because we like to piss of
users. There's actual technical reasons that don't just go away by
judging the PANIC as stupid.
At the points where the XLogInsert()s happens we're in critical sections
out of which we *cannot* ERROR out because we already may have made
modifications that cannot be allowed to be performed
partially/unlogged. That's why we're throwing a PANIC which will force a
cluster wide restart including *NOT* writing any further buffers from
s_b out.

If archiving is on and failure is due to no space, could we just keep trying XLogFileInit again for a couple minutes to give archiving a chance to do its things? Doing that while holding onto locks and a critical section would be unfortunate, but if the alternative is a PANIC, it might be acceptable.

The problem is that even if the file is only being kept so it can be archived, once archiving succeeds I think the file is not removed immediately but rather not until the next checkpoint, which will never happen when the locks are still held.

Cheers,

Jeff

Re: Hard limit on WAL space used (because PANIC sucks)

From

"MauMau"

Date:

08 June 2013, 23:23:37

From: "Joshua D. Drake" <jd@commandprompt.com>
> On 06/08/2013 07:36 AM, MauMau wrote:
>> 3. You cannot know the reason of archive_command failure (e.g. archive
>> area full) if you don't use PostgreSQL's server logging.
>> This is because archive_command failure is not logged in syslog/eventlog.
>
> Wait, what? Is this true (someone else?)

I'm sorry, please let me correct myself.  What I meant is:

This is because the exact reason for archive_command failure (e.g. cp's 
output) is not logged in syslog/eventlog.  The fact itself that 
archive_command failed is recorded.

Regards
MauMau

Re: Hard limit on WAL space used (because PANIC sucks)

From

"MauMau"

Date:

08 June 2013, 23:43:33

From: "Joshua D. Drake" <jd@commandprompt.com>
> On 06/08/2013 11:27 AM, Andres Freund wrote:
>> You know, the PANIC isn't there just because we like to piss of
>> users. There's actual technical reasons that don't just go away by
>> judging the PANIC as stupid.
>
> Yes I know we aren't trying to piss off users. What I am saying is that it 
> is stupid to the user that it PANICS. I apologize if that came out wrong.

I've experienced PANIC shutdown several times due to WAL full during 
development.  As a user, one problem with PANIC is that it dumps core.  I 
think core files should be dumped only when "can't happen" events has 
occurred like PostgreSQL's bug.  I didn't expect postgres dumps core simply 
because disk is full.   I want postgres to shutdown with FATAL message in 
that exact case.

Regards
MauMau

Re: Hard limit on WAL space used (because PANIC sucks)

From

"MauMau"

Date:

09 June 2013, 00:33:24

From: "Josh Berkus" <josh@agliodbs.com>
> There's actually three potential failure cases here:
>
> - One Volume: WAL is on the same volume as PGDATA, and that volume is
> completely out of space.
>
> - XLog Partition: WAL is on its own partition/volume, and fills it up.
>
> - Archiving: archiving is failing or too slow, causing the disk to fill
> up with waiting log segments.

I think there is one more case.  Is this correct?

- Failure of a disk containing data directory or tablespace
If checkpoint can't write buffers to disk because of disk failure, 
checkpoint cannot complete, thus WAL files accumulate in pg_xlog/.
This means that one disk failure will lead to postgres shutdown.


> xLog Partition
> --------------
>
> As Heikki pointed, out, a full dedicated WAL drive is hard to fix once
> it gets full, since there's nothing you can safely delete to clear
> space, even enough for a checkpoint record.

This sounds very scary.  Is it possible to complete recovery and start up 
postmaster with either or both of the following modifications?

[Idea 1]
During recovery, force archiving a WAL file and delete/recycle it in 
pg_xlog/ as soon as its contents are applied.

[Idea 2]
During recovery, when disk full is encountered at end-of-recovery 
checkpoint, force archiving all unarchived WAL files and delete/recycle them 
at that time.


Regards
MauMau

Re: Hard limit on WAL space used (because PANIC sucks)

From

Craig Ringer

Date:

09 June 2013, 05:36:15

On 06/06/2013 10:00 PM, Heikki Linnakangas wrote:
>
> I've seen a case, where it was even worse than a PANIC and shutdown.
> pg_xlog was on a separate partition that had nothing else on it. The
> partition filled up, and the system shut down with a PANIC. Because
> there was no space left, it could not even write the checkpoint after
> recovery, and thus refused to start up again. There was nothing else
> on the partition that you could delete to make space. The only
> recourse would've been to add more disk space to the partition
> (impossible), or manually delete an old WAL file that was not needed
> to recover from the latest checkpoint (scary). Fortunately this was a
> test system, so we just deleted everything.

There were a couple of dba.stackexchange.com reports along the same
lines recently, too. Both involved an antivirus vendor's VM appliance
with a canned (stupid) configuration that set wal_keep_segments too high
for the disk space allocated and stored WAL on a separate partition.

People are having issues with WAL space management in the real world and
I think it and autovacuum are the two hardest things for most people to
configure and understand in Pg right now.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Hard limit on WAL space used (because PANIC sucks)

From

Craig Ringer

Date:

09 June 2013, 05:44:55

On 06/08/2013 10:57 AM, Daniel Farina wrote:
>
>> At which point most sensible users say "no thanks, I'll use something else".
> [snip]
>
> I have a clear bias in experience here, but I can't relate to someone
> who sets up archives but is totally okay losing a segment unceremoniously,
> because it only takes one of those once in a while to make a really,
> really bad day. 

It sounds like between you both you've come up with a pretty solid
argument by exclusion for throttling WAL writing as space grows
critical. Dropping archive segments seems pretty unsafe from the solid
arguments Daniel presents, and Josh makes a good case for why shutting
the DB down when archiving can't keep up isn't any better.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Hard limit on WAL space used (because PANIC sucks)

From

Craig Ringer

Date:

09 June 2013, 06:04:07

<div class="moz-cite-prefix">On 06/09/2013 08:32 AM, MauMau wrote:<br /></div><blockquote
cite="mid:D117F8165E894FA1B805302504C6F1D7@maumau"type="cite"><br /> - Failure of a disk containing data directory or
tablespace<br /> If checkpoint can't write buffers to disk because of disk failure, checkpoint cannot complete, thus
WALfiles accumulate in pg_xlog/. <br /> This means that one disk failure will lead to postgres shutdown. <br
/></blockquote>... which is why tablespaces aren't disposable, and why creating a tablespace in a RAM disk is such an
awfulidea.<br /><br /> I'd rather like to be able to recover from this by treating the tablespace as dead, so any
attemptto get a lock on any table within it fails with an error and already-in-WAL writes to it just get discarded.
It'sthe sort of thing that'd only be reasonable to do as a recovery option (like zero_damaged_pages) since if applied
bydefault it'd lead to potentially severe and unexpected data loss.<br /><br /> I've seen a couple of people bitten by
themisunderstanding that tablespaces are a way to split up your data based on different reliability requirements, and I
reallyneed to write a docs patch for <a
href="http://www.postgresql.org/docs/9.2/static/manage-ag-tablespaces.html">http://www.postgresql.org/docs/current/static/manage-ag-tablespaces.html</a>
thatadds a prominent warning like:<br /><br /> WARNING: Every tablespace must be present before the database can be
started.There is no easy way to recover the database if a tablespace is lost to disk failure, deletion, use of volatile
storage,etc. <b>Do not put a tablespace on a RAM disk</b>; instead just use UNLOGGED tables.<br /><br />
(Opinionson the above?)<br /><br /><pre class="moz-signature" cols="72">-- Craig Ringer                   <a
class="moz-txt-link-freetext"href="http://www.2ndQuadrant.com/">http://www.2ndQuadrant.com/</a>PostgreSQL Development,
24x7Support, Training & Services</pre>

Re: Hard limit on WAL space used (because PANIC sucks)

From

Craig Ringer

Date:

09 June 2013, 06:05:45

On 06/09/2013 03:02 AM, Jeff Janes wrote:

> It would be nice to have the ability to specify multiple log destinations
> with different log_min_messages for each one.  I'm sure syslog already must
> implement some kind of method for doing that, but I've been happy enough
> with the text logs that I've never bothered to look into it much.

Smarter syslog flavours like rsyslog certainly do this.

No alert system triggered by events within Pg will ever be fully
sufficient. "Oops, the postmaster crashed with stack corruption, I'll
just exec whatever's in this on_panic_exec GUC (if I can still read it
and it's still valid) to hopefully tell the sysadmin about my death."
Hmm, sounds as reliable and safe as a bicycle powered by a home-made
rocket engine.

External monitoring is IMO always necessary. Something like Icinga with
check_postgres can trivially poke Pg to make sure it's alive. It can
also efficiently check the 'pg_error.log' from rsyslog that contains
only severe errors and raise alerts if it doesn't like what it sees.

If I'm already doing external monitoring (which is necessary as outlined
above) then I see much less point having Pg able to raise alerts for
problems, and am more interested in better built-in functions and views
for exposing Pg's state. Easier monitoring of WAL build-up, ways to slow
the master if async replicas or archiving are getting too far behind, etc.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Hard limit on WAL space used (because PANIC sucks)

From

Andres Freund

Date:

09 June 2013, 12:54:49

On 2013-06-08 13:26:56 -0700, Joshua D. Drake wrote:
> >At the points where the XLogInsert()s happens we're in critical sections
> >out of which we *cannot* ERROR out because we already may have made
> >modifications that cannot be allowed to be performed
> >partially/unlogged. That's why we're throwing a PANIC which will force a
> >cluster wide restart including *NOT* writing any further buffers from
> >s_b out.
> >
> 
> Does this preclude (sorry I don't know this part of the code very well) my
> suggestion of on log create?

Well, yes. We create log segments some layers below XLogInsert() if
necesary, and as I said above, we're in a critical section at that
point, so just rolling back isn't one of the options.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Hard limit on WAL space used (because PANIC sucks)

From

"MauMau"

Date:

09 June 2013, 22:40:13

From: "Craig Ringer" <craig@2ndquadrant.com>
> On 06/09/2013 08:32 AM, MauMau wrote:
>>
>> - Failure of a disk containing data directory or tablespace
>> If checkpoint can't write buffers to disk because of disk failure,
>> checkpoint cannot complete, thus WAL files accumulate in pg_xlog/.
>> This means that one disk failure will lead to postgres shutdown.
>
> I've seen a couple of people bitten by the misunderstanding that
> tablespaces are a way to split up your data based on different
> reliability requirements, and I really need to write a docs patch for
> http://www.postgresql.org/docs/current/static/manage-ag-tablespaces.html
> <http://www.postgresql.org/docs/9.2/static/manage-ag-tablespaces.html>
> that adds a prominent warning like:
>
> WARNING: Every tablespace must be present before the database can be
> started. There is no easy way to recover the database if a tablespace is
> lost to disk failure, deletion, use of volatile storage, etc. <b>Do not
> put a tablespace on a RAM disk</b>; instead just use UNLOGGED tables.
>
> (Opinions on the above?)

Yes, I'm sure this is useful for DBAs to know how postgres behaves and take 
some preparations.  However, this does not apply to my case, because I'm 
using tablespaces for I/O distribution across multiple disks and simply for 
database capacity.

The problem is that the reliability of the database system decreases with 
more disks, because failure of any one of those disks would result in a 
database PANIC shutdown

> I'd rather like to be able to recover from this by treating the
> tablespace as dead, so any attempt to get a lock on any table within it
> fails with an error and already-in-WAL writes to it just get discarded.
> It's the sort of thing that'd only be reasonable to do as a recovery
> option (like zero_damaged_pages) since if applied by default it'd lead
> to potentially severe and unexpected data loss.

I'm in favor of taking a tablespace offline when I/O failure is encountered, 
and continue running the database server.  But WAL must not be discarded 
because committed transactions must be preserved for durability of ACID.

Postgres needs to take these steps when it encounters an I/O error:

1. Take the tablespace offline, so that subsequent read/write against it 
returns an error without actually issuing read/write against data files.

2. Discard shared buffers containing data in the tablespace.

WAL is not affected by the offlining of tablespaces.  WAL records already 
written on the WAL buffer will be written to pg_xlog/ and archived as usual. 
Those WAL records will be used to recover committed transactions during 
archive recovery.

Regards
MauMau

Re: Hard limit on WAL space used (because PANIC sucks)

From

Craig Ringer

Date:

09 June 2013, 23:29:02

On 06/10/2013 06:39 AM, MauMau wrote:

> The problem is that the reliability of the database system decreases
> with more disks, because failure of any one of those disks would result
> in a database PANIC shutdown

More specifically, with more independent sets of disks / file systems.

>> I'd rather like to be able to recover from this by treating the
>> tablespace as dead, so any attempt to get a lock on any table within it
>> fails with an error and already-in-WAL writes to it just get discarded.
>> It's the sort of thing that'd only be reasonable to do as a recovery
>> option (like zero_damaged_pages) since if applied by default it'd lead
>> to potentially severe and unexpected data loss.
> 
> I'm in favor of taking a tablespace offline when I/O failure is
> encountered, and continue running the database server.  But WAL must not
> be discarded because committed transactions must be preserved for
> durability of ACID.
[snip]
> WAL is not affected by the offlining of tablespaces.  WAL records
> already written on the WAL buffer will be written to pg_xlog/ and
> archived as usual. Those WAL records will be used to recover committed
> transactions during archive recovery.

(I'm still learning the details of Pg's WAL, WAL replay and recovery, so
the below's just my understanding):

The problem is that WAL for all tablespaces is mixed together in the
archives. If you lose your tablespace then you have to keep *all* WAL
around and replay *all* of it again when the tablespace comes back
online. This would be very inefficient, would require a lot of tricks to
cope with applying WAL to a database that has an on-disk state in the
future as far as the archives are concerned. It's not as simple as just
replaying all WAL all over again - as I understand it, things like
CLUSTER or TRUNCATE will result in relfilenodes not being where they're
expected to be as far as old WAL archives are concerned. Selective
replay would be required, and that leaves the door open to all sorts of
new and exciting bugs in areas that'd hardly ever get tested.

To solve the massive disk space explosion problem I imagine we'd have to
have per-tablespace WAL. That'd cause a *huge* increase in fsync costs
and loss of the rather nice property that WAL writes are nice sequential
writes. It'd be complicated and probably cause nightmares during
recovery, for archive-based replication, etc.

The only other thing I can think of is: When a tablespace is offline,
write WAL records to a separate "tablespace recovery log" as they're
encountered. Replay this log when the tablespace comes is restored,
before applying any other new WAL to the tablespace. This wouldn't
affect archive-based recovery since it'd already have the records from
the original WAL.

None of these options seem exactly simple or pretty, especially given
the additional complexities that'd be involved in allowing WAL records
to be applied out-of-order, something that AFAIK _never_h happens at the
moment.

The key problem, of course, is that this all sounds like a lot of
complicated work for a case that's not really supposed to happen. Right
now, the answer is "your database is unrecoverable, switch to your
streaming warm standby and re-seed it from the standby". Not pretty, but
at least there's the option of using a sync standby and avoiding data loss.

How would you approach this?

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Hard limit on WAL space used (because PANIC sucks)

From

"MauMau"

Date:

10 June 2013, 14:03:45

From: "Craig Ringer" <craig@2ndquadrant.com>
> The problem is that WAL for all tablespaces is mixed together in the
> archives. If you lose your tablespace then you have to keep *all* WAL
> around and replay *all* of it again when the tablespace comes back
> online. This would be very inefficient, would require a lot of tricks to
> cope with applying WAL to a database that has an on-disk state in the
> future as far as the archives are concerned. It's not as simple as just
> replaying all WAL all over again - as I understand it, things like
> CLUSTER or TRUNCATE will result in relfilenodes not being where they're
> expected to be as far as old WAL archives are concerned. Selective
> replay would be required, and that leaves the door open to all sorts of
> new and exciting bugs in areas that'd hardly ever get tested.

Although I still lack understanding of PostgreSQL implementation, I have an 
optimistic feeling that such complexity would not be required.  While a 
tablespace is offline, subsequent access from new or existing transactions 
is rejected with an error "tablespace is offline".  So new WAL records would 
not be generated for the offline tablespace.  To take the tablespace back 
online, the DBA performs per-tablespace archive recovery.  Per-tablespace 
archive recovery restores tablespace data files from the backup, then read 
through archive and pg_xlog/ WAL as usual, and selectively applies WAL 
records for the tablespace.

I don't think it's a "must-be-fixed" problem that the WAL for all 
tablespaces is mixed in one location.  I suppose we can tolerate that 
archive recovery takes a long time.


> To solve the massive disk space explosion problem I imagine we'd have to
> have per-tablespace WAL. That'd cause a *huge* increase in fsync costs
> and loss of the rather nice property that WAL writes are nice sequential
> writes. It'd be complicated and probably cause nightmares during
> recovery, for archive-based replication, etc.

Per-tablespace WAL is very interesting for another reason -- massive-scale 
OLTP for database consolidation.  This feature would certainly be a 
breakthrough for amazing performance, because WAL is usually the last 
bottleneck in OLTP.  Yes, I can imagine recovery would be much, much more 
complicated,.


> None of these options seem exactly simple or pretty, especially given
> the additional complexities that'd be involved in allowing WAL records
> to be applied out-of-order, something that AFAIK _never_h happens at the
> moment.

As I mentioned above, in my shallow understanding, it seems that the 
additional complexities can be controlled.


> The key problem, of course, is that this all sounds like a lot of
> complicated work for a case that's not really supposed to happen. Right
> now, the answer is "your database is unrecoverable, switch to your
> streaming warm standby and re-seed it from the standby". Not pretty, but
> at least there's the option of using a sync standby and avoiding data 
> loss.

Sync standby... maybe.  Let me consider this.

> How would you approach this?

Thanks Craig, you gave me some interesting insights.  All of these topics 
are interesting, and I'd like to work on them when I have acquired enough 
knowledge and experience in PostgreSQL development.

Regards
MauMau

Re: Hard limit on WAL space used (because PANIC sucks)

From

Josh Berkus

Date:

10 June 2013, 18:59:31

Josh, Daniel,

>> Right now, what we're telling users is "You can have continuous backup
>> with Postgres, but you'd better hire and expensive consultant to set it
>> up for you, or use this external tool of dubious provenance which
>> there's no packages for, or you might accidentally cause your database
>> to shut down in the middle of the night."
> 
> This is an outright falsehood. We are telling them, "You better know
> what you are doing" or "You should call a consultant". This is no
> different than, "You better know what you are doing" or "You should take
> driving lessons".

What I'm pointing out is that there is no "simple case" for archiving
the way we have it set up.  That is, every possible way to deploy PITR
for Postgres involves complex, error-prone configuration, setup, and
monitoring.  I don't think that's necessary; simple cases should have
simple solutions.

If you do a quick survey of pgsql-general, you will see that the issue
of databases shutting down unexpectedly due to archiving running them
out of disk space is a very common problem.  People shouldn't be afraid
of their backup solutions.

I'd agree that one possible answer for this is to just get one of the
external tools simplified, well-packaged, distributed, instrumented for
common monitoring systems, and referenced in our main documentation.
I'd say Barman is the closest to "a simple solution for the simple
common case", at least for PITR.  I've been able to give some clients
Barman and have them deploy it themselves.  This isn't true of the other
tools I've tried.  Too bad it's GPL, and doesn't do archiving-for-streaming.

> I have a clear bias in experience here, but I can't relate to someone
> who sets up archives but is totally okay losing a segment unceremoniously,
> because it only takes one of those once in a while to make a really,
> really bad day.  Who is this person that lackadaisically archives, and
> are they just fooling themselves?  And where are these archivers that

If WAL archiving is your *second* level of redundancy, you will
generally be willing to have it break rather than interfere with the
production workload.  This is particularly the case if you're using
archiving just as a backup for streaming replication.  Heck, I've had
one client where archiving was being used *only* to spin up staging
servers, and not for production at all; do you think they wanted
production to shut down if they ran out of archive space (which it did)?

I'll also point out that archiving can silently fail for a number of
reasons having nothing to do with "safety" options, such as an NFS mount
in Linux silently going away (I've also had this happen), or network
issues causing file corruption.  Which just points out that we need
better ways to detect gaps/corruption in archiving.

Anyway, what I'm pointing out is that this is a business decision, and
there is no way that we can make a decision for the users what to do
when we run out of WAL space.  And that the "stop archiving" option
needs to be there for users, as well as the "shut down" option.
*without* requiring users to learn the internals of the archiving system
to implement it, or to know the implied effects of non-obvious
PostgreSQL settings.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Hard limit on WAL space used (because PANIC sucks)

From

Jeff Janes

Date:

10 June 2013, 22:45:58

On Sat, Jun 8, 2013 at 11:07 AM, Joshua D. Drake <jd@commandprompt.com> wrote:

On 06/08/2013 07:36 AM, MauMau wrote:

1. If the machine or postgres crashes while archive_command is copying a
WAL file, later archive recovery fails.
This is because cp leaves a file of less than 16MB in archive area, and
postgres refuses to start when it finds such a small archive WAL file.

Should that be changed? If the file is 16MB but it turns to gibberish after 3MB, recovery proceeds up to the gibberish. Given that, why should it refuse to start if the file is only 3MB to start with?

The solution, which IIRC Tomas san told me here, is to do like "cp %p
/archive/dir/%f.tmp && mv /archive/dir/%f.tmp /archive/dir/%f".

This will overwrite /archive/dir/%f if it already exists, which is usually recommended against. Although I don't know that it necessarily should be. One common problem with archiving is for a network glitch to occur during the archive command, so the archive command fails and tries again later. But the later tries will always fail, because the target was created before/during the glitch. Perhaps a more full featured archive command would detect and rename an existing file, rather than either overwriting it or failing.

If we have no compunction about overwriting the file, then I don't see a reason to use the cp + mv combination. If the simple cp fails to copy the entire file, it will be tried again until it succeeds.

Well it seems to me that one of the problems here is we tell people to use copy. We should be telling people to use a command (or supply a command) that is smarter than that.

Actually we describe what archive_command needs to fulfill, and tell them to use something that accomplishes that. The example with cp is explicitly given as an example, not a recommendation.

3. You cannot know the reason of archive_command failure (e.g. archive
area full) if you don't use PostgreSQL's server logging.
This is because archive_command failure is not logged in syslog/eventlog.

Wait, what? Is this true (someone else?)

It is kind of true. PostgreSQL does not automatically arrange for the stderr of the archive_command to be sent to syslog. But archive_command can do whatever it wants, including arranging for its own failure messages to go to syslog.

Cheers,

Jeff

Re: Hard limit on WAL space used (because PANIC sucks)

From

Daniel Farina

Date:

10 June 2013, 23:02:30

On Mon, Jun 10, 2013 at 11:59 AM, Josh Berkus <josh@agliodbs.com> wrote:
> Anyway, what I'm pointing out is that this is a business decision, and
> there is no way that we can make a decision for the users what to do
> when we run out of WAL space.  And that the "stop archiving" option
> needs to be there for users, as well as the "shut down" option.
> *without* requiring users to learn the internals of the archiving system
> to implement it, or to know the implied effects of non-obvious
> PostgreSQL settings.

I don't doubt this, that's why I do have a no-op fallback for
emergencies.  The discussion was about defaults.  I still think that
drop-wal-from-archiving-whenever is not a good one.

You may have noticed I also wrote that a neater, common way to drop
WAL when under pressure might be nice, to avoid having it ad-hoc and
all over, so it's not as though I wanted to suggest an Postgres
feature to this effect was an anti-feature.

And, as I wrote before, it's much easier to teach an external system
to drop WAL than it is to teach Postgres to attenuate, hence the
repeated correspondence from my fellows and myself about attenuation
side of the equation.

Hope that clears things up about where I stand on the matter.

Re: Hard limit on WAL space used (because PANIC sucks)

From

Josh Berkus

Date:

10 June 2013, 23:42:29

Daniel, Jeff,

> I don't doubt this, that's why I do have a no-op fallback for
> emergencies.  The discussion was about defaults.  I still think that
> drop-wal-from-archiving-whenever is not a good one.

Yeah, we can argue defaults for a long time.  What would be better is
some way to actually determine what the user is trying to do, or wants
to happen.  That's why I'd be in favor of an explict setting; if there's
a setting which says:

on_archive_failure=shutdown

... then it's a LOT clearer to the user what will happen if the archive
runs out of space, even if we make no change to the defaults.  And if
that setting is changeable on reload, it even becomes a way for users to
get out of tight spots.

> You may have noticed I also wrote that a neater, common way to drop
> WAL when under pressure might be nice, to avoid having it ad-hoc and
> all over, so it's not as though I wanted to suggest an Postgres
> feature to this effect was an anti-feature.

Yep.  Drake was saying it was an anti-feature, though, so I was arguing
with him.

> Well it seems to me that one of the problems here is we tell people to use
>> copy. We should be telling people to use a command (or supply a command)
>> that is smarter than that.
>>
> 
> Actually we describe what archive_command needs to fulfill, and tell them
> to use something that accomplishes that.  The example with cp is explicitly
> given as an example, not a recommendation.

If we offer cp as an example, we *are* recommending it.  If we don't
recommend it, we shouldn't have it as an example.

In fact, if we don't recommend cp, then PostgreSQL should ship with some
example shell scripts for archive commands, just as we do for init scripts.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Hard limit on WAL space used (because PANIC sucks)

From

Daniel Farina

Date:

11 June 2013, 00:00:19

On Mon, Jun 10, 2013 at 4:42 PM, Josh Berkus <josh@agliodbs.com> wrote:
> Daniel, Jeff,
>
>> I don't doubt this, that's why I do have a no-op fallback for
>> emergencies.  The discussion was about defaults.  I still think that
>> drop-wal-from-archiving-whenever is not a good one.
>
> Yeah, we can argue defaults for a long time.  What would be better is
> some way to actually determine what the user is trying to do, or wants
> to happen.  That's why I'd be in favor of an explict setting; if there's
> a setting which says:
>
> on_archive_failure=shutdown
>
> ... then it's a LOT clearer to the user what will happen if the archive
> runs out of space, even if we make no change to the defaults.  And if
> that setting is changeable on reload, it even becomes a way for users to
> get out of tight spots.

I like your suggestion, save one thing: it's not a 'failure' or
archiving if it cannot keep up, provided one subscribes to the view
that archiving is not elective.  I nit pick at this because one might
think this has something to do with a non-zero return code from the
archiving program, which already has a pretty alarmist message in
event of transient failures (I think someone brought this up on
-hackers but a few months ago...can't remember if that resulted in a
change).

I don't have a better suggestion that is less jargonrific though, but
I wanted to express my general appreciation as to the shape of the
suggestion.

Re: Hard limit on WAL space used (because PANIC sucks)

From

"Joshua D. Drake"

Date:

11 June 2013, 00:01:34

On 06/10/2013 04:42 PM, Josh Berkus wrote:

>> Actually we describe what archive_command needs to fulfill, and tell them
>> to use something that accomplishes that.  The example with cp is explicitly
>> given as an example, not a recommendation.
>
> If we offer cp as an example, we *are* recommending it.  If we don't
> recommend it, we shouldn't have it as an example.
>
> In fact, if we don't recommend cp, then PostgreSQL should ship with some
> example shell scripts for archive commands, just as we do for init scripts.

Not a bad idea. One that supports rsync and another that supports 
robocopy. That should cover every platform we support.

JD



-- 
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms   a rose in the deeps of my heart. - W.B. Yeats

Re: Hard limit on WAL space used (because PANIC sucks)

From

Josh Berkus

Date:

11 June 2013, 01:07:01

> Not a bad idea. One that supports rsync and another that supports
> robocopy. That should cover every platform we support.

Example script:

=============================

#!/usr/bin/env bash

# Simple script to copy WAL archives from one server to another
# to be called as archive_command (call this as wal_archive "%p" "%f")

# Settings.  Please change the below to match your configuration.

# holding directory for the archived log files on the replica
# this is NOT pg_xlog:
WALDIR="/var/lib/pgsql/archive_logs"

# touch file to shut off archiving in case of filling up the disk:
NOARCHIVE="/var/lib/pgsql/NOARCHIVE"

# replica IP, IPv6 or DNS address:
REPLICA="192.168.1.3"

# put any special SSH options here,
# and the location of RSYNC:
export RSYNC_RSH="ssh"
RSYNC="/usr/bin/rsync"

######## DO NOT CHANGE THINGS BELOW THIS LINE ##########

SOURCE="$1" # %p
FILE="$2" # %f
DEST="${WALDIR}/${FILE}"

# See whether we want all archiving off
test -f ${NOARCHIVE} && exit 0

# Copy the file to the spool area on the replica, error out if
# the transfer fails
${RSYNC} --quiet --archive --rsync-path=${RSYNC} ${SOURCE} \
${REPLICA}:${DEST}

if [ $? -ne 0 ]; then       exit 1
fi

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Hard limit on WAL space used (because PANIC sucks)

From

Robert Haas

Date:

12 June 2013, 14:56:02

On Sat, Jun 8, 2013 at 10:36 AM, MauMau <maumau307@gmail.com> wrote:
> Yes, I feel designing reliable archiving, even for the simplest case - copy
> WAL to disk, is very difficult.  I know there are following three problems
> if you just follow the PostgreSQL manual.  Average users won't notice them.
> I guess even professional DBAs migrating from other DBMSs won't, either.
>
> 1. If the machine or postgres crashes while archive_command is copying a WAL
> file, later archive recovery fails.
> This is because cp leaves a file of less than 16MB in archive area, and
> postgres refuses to start when it finds such a small archive WAL file.
> The solution, which IIRC Tomas san told me here, is to do like "cp %p
> /archive/dir/%f.tmp && mv /archive/dir/%f.tmp /archive/dir/%f".
>
> 2. archive_command dumps core when you run pg_ctl stop -mi.
> This is because postmaster sends SIGQUIT to all its descendants.  The core
> files accumulate in the data directory, which will be backed up with the
> database.  Of course those core files are garbage.
> archive_command script needs to catch SIGQUIT and exit.
>
> 3. You cannot know the reason of archive_command failure (e.g. archive area
> full) if you don't use PostgreSQL's server logging.
> This is because archive_command failure is not logged in syslog/eventlog.
>
>
> I hope PostgreSQL will provide a reliable archiving facility that is ready
> to use.

+1.  I think we should have a way to set an archive DIRECTORY, rather
than an archive command.  And if you set it, then PostgreSQL should
just do all of that stuff correctly, without any help from the user.
Of course, some users will want to archive to a remote machine via ssh
or rsync or what-have-you, and those users will need to provide their
own tools.  But it's got to be pretty common to archive to a local
path that happens to be a remote mount, or to a local directory whose
contents are subsequently copied off by a batch job.  Making that work
nicely with near-zero configuration would be a significant advance.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Hard limit on WAL space used (because PANIC sucks)

From

Claudio Freire

Date:

12 June 2013, 15:11:49

On Wed, Jun 12, 2013 at 11:55 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I hope PostgreSQL will provide a reliable archiving facility that is ready
>> to use.
>
> +1.  I think we should have a way to set an archive DIRECTORY, rather
> than an archive command.  And if you set it, then PostgreSQL should
> just do all of that stuff correctly, without any help from the user.
> Of course, some users will want to archive to a remote machine via ssh
> or rsync or what-have-you, and those users will need to provide their
> own tools.  But it's got to be pretty common to archive to a local
> path that happens to be a remote mount, or to a local directory whose
> contents are subsequently copied off by a batch job.  Making that work
> nicely with near-zero configuration would be a significant advance.


That, or provide a standard archive command that takes the directory
as argument?

I bet we have tons of those available among us users...

Re: Hard limit on WAL space used (because PANIC sucks)

From

Tatsuo Ishii

Date:

12 June 2013, 15:26:50

> On Sat, Jun 8, 2013 at 10:36 AM, MauMau <maumau307@gmail.com> wrote:
>> Yes, I feel designing reliable archiving, even for the simplest case - copy
>> WAL to disk, is very difficult.  I know there are following three problems
>> if you just follow the PostgreSQL manual.  Average users won't notice them.
>> I guess even professional DBAs migrating from other DBMSs won't, either.
>>
>> 1. If the machine or postgres crashes while archive_command is copying a WAL
>> file, later archive recovery fails.
>> This is because cp leaves a file of less than 16MB in archive area, and
>> postgres refuses to start when it finds such a small archive WAL file.
>> The solution, which IIRC Tomas san told me here, is to do like "cp %p
>> /archive/dir/%f.tmp && mv /archive/dir/%f.tmp /archive/dir/%f".
>>
>> 2. archive_command dumps core when you run pg_ctl stop -mi.
>> This is because postmaster sends SIGQUIT to all its descendants.  The core
>> files accumulate in the data directory, which will be backed up with the
>> database.  Of course those core files are garbage.
>> archive_command script needs to catch SIGQUIT and exit.
>>
>> 3. You cannot know the reason of archive_command failure (e.g. archive area
>> full) if you don't use PostgreSQL's server logging.
>> This is because archive_command failure is not logged in syslog/eventlog.
>>
>>
>> I hope PostgreSQL will provide a reliable archiving facility that is ready
>> to use.
> 
> +1.  I think we should have a way to set an archive DIRECTORY, rather
> than an archive command.  And if you set it, then PostgreSQL should
> just do all of that stuff correctly, without any help from the user.
> Of course, some users will want to archive to a remote machine via ssh
> or rsync or what-have-you, and those users will need to provide their
> own tools.  But it's got to be pretty common to archive to a local
> path that happens to be a remote mount, or to a local directory whose
> contents are subsequently copied off by a batch job.  Making that work
> nicely with near-zero configuration would be a significant advance.

And there's another example why we need an archive command:

> I'm just setting up pgpool replication on Amazon AWS.
> I'm sending WAL archives to an S3 bucket, which doesn't appear as a
> directory on the server.

From:
http://www.pgpool.net/pipermail/pgpool-general/2013-June/001851.html
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: Hard limit on WAL space used (because PANIC sucks)

From

Magnus Hagander

Date:

12 June 2013, 15:32:48

<p dir="ltr"><br /> On Jun 12, 2013 4:56 PM, "Robert Haas" <<a
href="mailto:robertmhaas@gmail.com">robertmhaas@gmail.com</a>>wrote:<br /> ><br /> > On Sat, Jun 8, 2013 at
10:36AM, MauMau <<a href="mailto:maumau307@gmail.com">maumau307@gmail.com</a>> wrote:<br /> > > Yes, I feel
designingreliable archiving, even for the simplest case - copy<br /> > > WAL to disk, is very difficult.  I know
thereare following three problems<br /> > > if you just follow the PostgreSQL manual.  Average users won't notice
them.<br/> > > I guess even professional DBAs migrating from other DBMSs won't, either.<br /> > ><br />
>> 1. If the machine or postgres crashes while archive_command is copying a WAL<br /> > > file, later
archiverecovery fails.<br /> > > This is because cp leaves a file of less than 16MB in archive area, and<br />
>> postgres refuses to start when it finds such a small archive WAL file.<br /> > > The solution, which
IIRCTomas san told me here, is to do like "cp %p<br /> > > /archive/dir/%f.tmp && mv /archive/dir/%f.tmp
/archive/dir/%f".<br/> > ><br /> > > 2. archive_command dumps core when you run pg_ctl stop -mi.<br /> >
>This is because postmaster sends SIGQUIT to all its descendants.  The core<br /> > > files accumulate in the
datadirectory, which will be backed up with the<br /> > > database.  Of course those core files are garbage.<br
/>> > archive_command script needs to catch SIGQUIT and exit.<br /> > ><br /> > > 3. You cannot know
thereason of archive_command failure (e.g. archive area<br /> > > full) if you don't use PostgreSQL's server
logging.<br/> > > This is because archive_command failure is not logged in syslog/eventlog.<br /> > ><br />
>><br /> > > I hope PostgreSQL will provide a reliable archiving facility that is ready<br /> > > to
use.<br/> ><br /> > +1.  I think we should have a way to set an archive DIRECTORY, rather<br /> > than an
archivecommand.  And if you set it, then PostgreSQL should<br /> > just do all of that stuff correctly, without any
helpfrom the user.<p dir="ltr">Wouldn't that encourage people to do local archiving, which is almost always a bad idea?
<pdir="ltr">I'd rather improve the experience with pg_receivexlog or another way that does remote archiving... <br /><p
dir="ltr">>Of course, some users will want to archive to a remote machine via ssh<br /> > or rsync or
what-have-you,and those users will need to provide their<br /> > own tools.  But it's got to be pretty common to
archiveto a local<br /> > path that happens to be a remote mount, or to a local directory whose<br /> > contents
aresubsequently copied off by a batch job.  Making that work<br /> > nicely with near-zero configuration would be a
significantadvance.<p dir="ltr">I guess archiving to a nfs mount or so isn't too bad, but archiving and using a cronjob
toget the files off is typically a great way to loose data, and we really shouldn't encourage that by default, Imo. <p
dir="ltr">/Magnus

Re: Hard limit on WAL space used (because PANIC sucks)

From

Robert Haas

Date:

12 June 2013, 15:35:38

On Sat, Jun 8, 2013 at 7:20 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> If archiving is on and failure is due to no space, could we just keep trying
> XLogFileInit again for a couple minutes to give archiving a chance to do its
> things?  Doing that while holding onto locks and a critical section would be
> unfortunate, but if the alternative is a PANIC, it might be acceptable.

Blech.  I think that's setting our standards pretty low.  It would
neither be possible to use the system nor to shut it down cleanly; I
think the effect would be to turn an immediate PANIC into a
slightly-delayed PANIC, possibly accompanied by some DBA panic.

It seems to me that there are two general ways of approaching this problem.

1. Discover sooner that we're out of space.  Once we've modified the
buffer and entered the critical section, it's too late to have second
thoughts about completing the operation.  If we could guarantee prior
to modifying the buffers that enough WAL space was present to store
the record we're about to write, then we'd be certain not to fail for
this reason.  In theory, this is simple: keep track of how much
uncommitted WAL space we have.  Increment the value when we create new
WAL segments and decrement it by the size of the WAL record we plan to
write.  In practice, it's not so simple.  We don't know whether we're
going to emit FPIs until after we enter the critical section, so the
size of the record can't be known precisely early enough.  We could
think about estimating the space needed conservatively and truing it
up occasionally.  However, there's a second problem: the
available-WAL-space counter would surely become a contention point.

Here's a sketch of a possible solution.  Suppose we know that an
individual WAL record can't be larger than, uh, 64kB.  I'm not sure
there is a limit on the size of a WAL record, but let's say there is,
or we can install one, at around that size limit.  Before we enter a
critical section that's going to write a WAL record, we verify that
the amount of WAL space remaining is at least 64kB * MaxBackends.  If
it's not, we embark on a series of short sleeps, rechecking after each
one; if we hit some time limit, we ERROR out.  As long as every
backend checks this before every WAL record, we can always be sure
there will be at least 64kB left for us.  With this approach, the
shared variable that stores the amount of WAL space remaining only
needs to be updated under WALInsertLock; the "reservation" step only
involves a read.  That might be cheap enough not to matter.

2. Recover from the fact that we ran out of space by backing out the
changes to shared buffers.  Initially, I thought this might be a
promising approach: if we've modified any shared buffers and discover
that we can't log the changes, just invalidate the buffers!  Of
course, it doesn't work, because the buffer might have have already
been dirty when we locked it.  So we'd actually need a way to reverse
out all the changes we were about to log.  That's probably too
expensive to contemplate, in general; and the code would likely get so
little testing as to invite bugs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Hard limit on WAL space used (because PANIC sucks)

From

Robert Haas

Date:

12 June 2013, 15:49:21

On Wed, Jun 12, 2013 at 11:32 AM, Magnus Hagander <magnus@hagander.net> wrote:
> Wouldn't that encourage people to do local archiving, which is almost always
> a bad idea?

Maybe, but refusing to improve the UI because people might then use
the feature seems wrong-headed.

> I'd rather improve the experience with pg_receivexlog or another way that
> does remote archiving...

Sure, remote archiving is great, and I'm glad you've been working on
it.  In general, I think that's a cleaner approach, but there are
still enough people using archive_command that we can't throw them
under the bus.

> I guess archiving to a nfs mount or so isn't too bad, but archiving and
> using a cronjob to get the files off is typically a great way to loose data,
> and we really shouldn't encourage that by default, Imo.

Well, I think what we're encouraging right now is for people to do it
wrong.  The proliferation of complex tools to manage this process
suggests that it is not easy to manage without a complex tool.  That's
a problem.  And we regularly have users who discover, under a variety
of circumstances, that they've been doing it wrong.  If there's a
better solution than hard-wiring some smarts about local directories,
I'm all ears - but making the simple case just work would still be
better than doing nothing.  Right now you have to be a rocket
scientist no matter what configuration you're running.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Hard limit on WAL space used (because PANIC sucks)

From

Peter Eisentraut

Date:

12 June 2013, 16:08:02

On 6/12/13 10:55 AM, Robert Haas wrote:
> But it's got to be pretty common to archive to a local
> path that happens to be a remote mount, or to a local directory whose
> contents are subsequently copied off by a batch job.  Making that work
> nicely with near-zero configuration would be a significant advance.

Doesn't that just move the problem to managing NFS or batch jobs?  Do we
want to encourage that?

I suspect that there are actually only about 5 or 6 common ways to do
archiving (say, local, NFS, scp, rsync, S3, ...).  There's no reason why
we can't fully specify and/or script what to do in each of these cases.

Re: Hard limit on WAL space used (because PANIC sucks)

From

Robert Haas

Date:

12 June 2013, 16:40:03

On Wed, Jun 12, 2013 at 12:07 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On 6/12/13 10:55 AM, Robert Haas wrote:
>> But it's got to be pretty common to archive to a local
>> path that happens to be a remote mount, or to a local directory whose
>> contents are subsequently copied off by a batch job.  Making that work
>> nicely with near-zero configuration would be a significant advance.
>
> Doesn't that just move the problem to managing NFS or batch jobs?  Do we
> want to encourage that?
>
> I suspect that there are actually only about 5 or 6 common ways to do
> archiving (say, local, NFS, scp, rsync, S3, ...).  There's no reason why
> we can't fully specify and/or script what to do in each of these cases.

Go for it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Hard limit on WAL space used (because PANIC sucks)

From

"Joshua D. Drake"

Date:

12 June 2013, 21:03:38

On 06/12/2013 08:49 AM, Robert Haas wrote:

> Sure, remote archiving is great, and I'm glad you've been working on
> it.  In general, I think that's a cleaner approach, but there are
> still enough people using archive_command that we can't throw them
> under the bus.

Correct.

>
>> I guess archiving to a nfs mount or so isn't too bad, but archiving and
>> using a cronjob to get the files off is typically a great way to loose data,
>> and we really shouldn't encourage that by default, Imo.
>

We certainly not by default but it is also something that can be easy to 
set up reliably if you know what you are doing.

> Well, I think what we're encouraging right now is for people to do it
> wrong.  The proliferation of complex tools to manage this process
> suggests that it is not easy to manage without a complex tool.

No. It suggests that people have more than one requirement that the 
project WILL NEVER be able to solve.

Granted we have solved some of them, for example pg_basebackup. However, 
pg_basebackup isn't really useful for a large database. Multithreaded 
rsync is much more efficient.

>  That's
> a problem.  And we regularly have users who discover, under a variety
> of circumstances, that they've been doing it wrong.  If there's a
> better solution than hard-wiring some smarts about local directories,
> I'm all ears - but making the simple case just work would still be
> better than doing nothing.

Agreed.

>  Right now you have to be a rocket
> scientist no matter what configuration you're running.

This is quite a bit overblown. Assuming your needs are simple. Archiving 
is at it is now, a relatively simple process to set up, even without 
something like PITRTools.  Where we run into trouble is when they aren't 
and that is ok because we can't solve every problem. We can only provide 
tools for others to solve their particular issue.

What concerns me is we seem to be trying to make this "easy". It isn't 
supposed to be easy. This is hard stuff. Smart people built it and it 
takes a smart person to run it. When did it become a bad thing to be 
something that smart people need to run?

Yes, we need to make it reliable. We don't need to be the Nanny database.

JD

-- 
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms   a rose in the deeps of my heart. - W.B. Yeats

Re: Hard limit on WAL space used (because PANIC sucks)

From

Claudio Freire

Date:

12 June 2013, 21:31:57

On Wed, Jun 12, 2013 at 6:03 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
>
>>  Right now you have to be a rocket
>> scientist no matter what configuration you're running.
>
>
> This is quite a bit overblown. Assuming your needs are simple. Archiving is
> at it is now, a relatively simple process to set up, even without something
> like PITRTools.  Where we run into trouble is when they aren't and that is
> ok because we can't solve every problem. We can only provide tools for
> others to solve their particular issue.
>
> What concerns me is we seem to be trying to make this "easy". It isn't
> supposed to be easy. This is hard stuff. Smart people built it and it takes
> a smart person to run it. When did it become a bad thing to be something
> that smart people need to run?
>
> Yes, we need to make it reliable. We don't need to be the Nanny database.

More than easy, it should be obvious.

Obvious doesn't mean easy, it just means what you have to do to get it
right is clearly in front of you. When you give people the freedom of
an "archive command", you also take away any guidance more restricting
options give. I think the point here is that a default would guide
people in how to make this work reliably, without having to rediscover
it every time. A good, *obvious* (not easy) default. Even "cp blah to
NFS mount" is obvious, while not easy (setting up an NFS through
firewalls is never easy).

So, having archive utilities in place of cp would ease the burden of
administration, because it'd be based on collective knowledge. Some
"pg_cp" (or more likely "pg_archive_wal") could check there's enough
space, and whatever else collective knowledge decided is necessary.

Re: Hard to Use WAS: Hard limit on WAL space

From

Josh Berkus

Date:

14 June 2013, 18:16:13

On 06/12/2013 02:03 PM, Joshua D. Drake wrote:
> What concerns me is we seem to be trying to make this "easy". It isn't
> supposed to be easy. This is hard stuff. Smart people built it and it
> takes a smart person to run it. When did it become a bad thing to be
> something that smart people need to run?

1997, last I checked.

Our unofficial motto: "PostgreSQL: making very hard things possible, and
simple things hard."

It *is* hard.  But that's because we've *made* it hard to understand and
manage, not because the problem is inherently hard.  For example: can
you explain to me in 10 words or less how to monitor to see if archiving
is falling behind?  I'll bet you can't, and that's because we've
provided no reliable way to do so.

It's normal when you're developing features for the ability to utilize
them to go from hacker --> high-end user --> regular user.  We suck at
moving to that last stage, partly because whenever someone on this list
introduces the idea of making a feature not just great but easy to use,
people actually object to the idea that anything should be easy to use. It's like we're afraid of being polluted by the
unwashedDevOps masses.

In the meantime, Mongo kicks our butts a new user adoption.  Why?  Their
features suck, but the features they do have are easy to use.  You'd
think we would have learned something from MySQL.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Hard to Use WAS: Hard limit on WAL space

From

"Joshua D. Drake"

Date:

14 June 2013, 18:23:13

On 06/14/2013 11:16 AM, Josh Berkus wrote:
>
> On 06/12/2013 02:03 PM, Joshua D. Drake wrote:
>> What concerns me is we seem to be trying to make this "easy". It isn't
>> supposed to be easy. This is hard stuff. Smart people built it and it
>> takes a smart person to run it. When did it become a bad thing to be
>> something that smart people need to run?
>
> 1997, last I checked.
>
> Our unofficial motto: "PostgreSQL: making very hard things possible, and
> simple things hard."
>
> It *is* hard.  But that's because we've *made* it hard to understand and
> manage, not because the problem is inherently hard.  For example: can
> you explain to me in 10 words or less how to monitor to see if archiving
> is falling behind?  I'll bet you can't, and that's because we've
> provided no reliable way to do so.

Hey, I never said we shouldn't have a complete feature set. I agree with 
you. IMO it should not have even been committed without the ability to 
actually know what is going on and we have had it since (in theory) 8.1?

My primary concern is: Don't make it stupid.

I liked Claudio's comment, "More than easy, it should be obvious.".

It should be obvious from a review of the documentation how to manage 
this stuff. It isn't, and worse even if we wrote the documentation it 
still isn't because the feature is not complete.

With great power comes great responsibility.... :P

JD

-- 
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms   a rose in the deeps of my heart. - W.B. Yeats

Re: Hard to Use WAS: Hard limit on WAL space

From

Craig Ringer

Date:

15 June 2013, 04:43:34

<div class="moz-cite-prefix">On 06/15/2013 02:16 AM, Josh Berkus wrote:<br /></div><blockquote
cite="mid:51BB5DE2.7080506@agliodbs.com"type="cite"><pre wrap="">On 06/12/2013 02:03 PM, Joshua D. Drake wrote:
 
</pre><blockquote type="cite"><pre wrap="">What concerns me is we seem to be trying to make this "easy". It isn't
supposed to be easy. This is hard stuff. Smart people built it and it
takes a smart person to run it. When did it become a bad thing to be
something that smart people need to run?
</pre></blockquote><pre wrap="">
1997, last I checked.

Our unofficial motto: "PostgreSQL: making very hard things possible, and
simple things hard."

It *is* hard.  But that's because we've *made* it hard to understand and
manage, not because the problem is inherently hard.</pre></blockquote><br /> I have to agree with all this... Pg has
someof the best docs around, a really nice SQL level interface, and some truly shocking usability outside that nice
zone.<br/><br /> Once a user steps into the "admin zone" they're confronted with a lot of settings they'll really
struggleto understand and manage.<br /><br /> I don't want this to be used as an argument not to commit early stages of
work,though. I think iterative development with exposure to real-world testing and experience is necessary when you're
gettingto the complexity of things that are now going in to Pg. It's more that "commited" != "done"; right now, once
itsusable at that power-user stage further management and improvement gets farmed out to external tools and the
usabilityof the core feature stays rather ... rough.<br /><br /> Some examples:<br /><br /> fsync=off<br />
------------<br/><br /> We have a giant foot-cannon in the config files, "fsync" with the "off" option neatly
documentedalongside all the others. No note saying "setting fsync=off is equivalent to setting
yes_you_can_eat_my_data=on".No WARNING in the logs, not that a user who'd set that without understanding it would look
atthe logs. The fsync section of <a
href="http://www.postgresql.org/docs/9.2/static/runtime-config-wal.html">http://www.postgresql.org/docs/current/static/runtime-config-wal.html</a>
isok, though it could do with a more prominent warning... but the user needs to know where to look. I've repeatedly
askedto change this - usually after yet another user comes onto -general with data loss due to not understanding
fsync=off- and haven't been able to get agreement on even a config file comment. <br /><br /> Proposed fix 9.3, config
filecomment saying "Warning, fsync=off may cause data loss, see the user manual."<br /><br /> Proposed fix 9.4+: Remove
fsync=offfrom docs. Leave the GUC enum there but have the postmaster FATAL when it sees it with a message saying
"fsync=offhas been replaced with unsafe_writes=on, please change your postgresql.conf". Add the corresponding new
GUC.<br/><br /> max_connections<br /> ------------------------<br /><br /> max_connections is another one. I see
systemswith max_connections=3000 in the wild... performing terribly, as you'd expect. Yet there's no indication (even
inthe docs) that this is often a terrible idea, and that you should really look into a connection pooler if you're
goingabove a few hundred (hardware/workload dependent). <a
href="http://www.postgresql.org/docs/9.1/static/runtime-config-connection.html">http://www.postgresql.org/docs/current/static/runtime-config-connection.html</a>
doesn'tmention it, there's no config file comment, etc.<br /><br /> Proposed fix: Comment in the config file saying
somethinglike "See the documentation before raising this above a few hundred". In the docs, a note about the perf
impactof high max_connections with a brief mention of external connection pooling and links to pgbouncer/pgpool-II,
mentionthat many app frameworks have built-in connection pools. Brief comment about there being an optimum
workload-and-hardwaredependent level of concurrency above which performance degrades. I'll be happy to write a draft
patchfor this if there's agreement on the idea.<br /><br /> vacuum/autovacuum<br /> ---------------------------<br
/><br/> autovaccum tuning. We've just had this conversation and there seems to be agreement that it needs some love,
butunlike the above two there's no easy fix and it's an ongoing process. I don't have any right to complain about it
unlessI do more to help fix it.<br /><br /> Bloat<br /> ------<br /><br /> Table bloat. Table bloat has been a major
issuewith PostgreSQL users/admins for years. Anyone care to explain to me in a simple paragraph how to find out if you
havetable or index bloat issues in your database and what to do about it? (Maybe we need "pg_catalog.pg_index_bloat"
and"pg_catalog.pg_table_bloat" views including FILLFACTOR correction?)<br /><br /> I think I'll draft up a patch to add
exactlythat.<br /><br /> Dump/restore and globals<br /> ----------------------------------<br /><br /> Dump and
restore.The standard advice I give is to do a "pg_dumpall --globals-only" followed by a "pg_dump -Fc" of each database,
sincewe don't have "pg_dumpfall -Fc". Users often seem to do single-DB dumps then find themselves having trouble
restoringthem due to missing user accounts, etc. Or they do a pg_dumpall then want to restore just one DB/table.<br
/><br/> There's also a lot of confusion around restoring dumps due to the different formats. This has improved now that
pg_restoretells the user to restore a SQL dump using psql:<br /><br /> $ pg_restore regress.sql <br /> pg_restore:
[archiver]input file appears to be a text format dump. Please use psql.<br /><br /> ... though psql still chokes
horriblyon a pg_dump -Fc file:<br /><br /> psql:regress.out:1: ERROR:  syntax error at or near "PGDMP"<br /> LINE 1:
PGDMP^A^L^A^A^A^A^A^A^A^AREVOKEALL ON SCHEMA public FROM postgres;<br />         ^<br /> psql:regress.out:2: WARNING: 
noprivileges were granted for "public"<br /> GRANT<br /><br /><br /> Proposed fix: Should we have a pg_dumpall that
producesper-database -Fc or -Fd output? Or perhaps --include-roles / --include-tablespaces options to pg_dump that
stashesa pg_dumpall --globals-only inside the -Fc archive?<br /><br /> Proposed fix : If psql sees the pseudo-command
PGDMPit should quit immediately with an error saying "This is a PostgreSQL custom format dump file and must be restored
withthe pg_restore command". (Happy to try to come up with a patch for this).<br /><br /> Proposed fix: Instead of just
tellingthe user to run psql, pg_restore should, if there are no selective restore options, propose a psql command. Or
evenjust invoke psql, though I'm hesitant about that because of issues where the psql on the PATH is one version and
theuser runs /path/to/pg_restore for another version. Or, if we built a full path using $0, the case where pg_restore
isbeing run from within the source tree so there's no psql in the same directory.<br /><br /> pg_hba.conf<br />
----------------<br/><br /> The #1 question I see on Stack Overflow has to be confusion about pg_hba.conf, mostly from
peoplewho have no idea it exists, don't understand how to configure it, etc. They can't tell the difference between
peer/ident/trust,don't understand that if you set 'md5' but don't set a password then the password will always be
wrong,etc.<br /><br /> I list this last because I think Pg's client authentication is well documented, it lacks obvious
foot-guns,and it's really not that hard. I have little sympathy for people who respond to a docs link with "I don't
havetime for that, I'm not a DBA, can you just tell me what I need to change?". Authentication and authorization isn't
simple,and attempts to make it too simple usually also make it wrong. At some point I want to think about how to make
iteasier to manage Pg's auth, but I know there are some big complexities around it because there's no DB available at
thetime pg_hba.conf checking is done on an incoming connection so not even a shared/global table may be read from. <br
/><br/> Proposed fix: None required at this time.<br /><br /><pre class="moz-signature" cols="72">-- Craig Ringer
           <a class="moz-txt-link-freetext"
href="http://www.2ndQuadrant.com/">http://www.2ndQuadrant.com/</a>PostgreSQLDevelopment, 24x7 Support, Training &
Services</pre>

Re: Hard to Use WAS: Hard limit on WAL space

From

Brendan Jurd

Date:

15 June 2013, 06:09:01

On 15 June 2013 14:43, Craig Ringer <craig@2ndquadrant.com> wrote:
> The #1 question I see on Stack Overflow has to be confusion about
> pg_hba.conf, mostly from people who have no idea it exists, don't understand
> how to configure it, etc.

The totally non-obvious name of the file probably has something to do
with that.  It should be called 'auth.conf'.

Cheers,
BJ

Re: Hard to Use WAS: Hard limit on WAL space

From

Craig Ringer

Date:

15 June 2013, 06:19:09

On 06/15/2013 02:08 PM, Brendan Jurd wrote:
> On 15 June 2013 14:43, Craig Ringer <craig@2ndquadrant.com> wrote:
>> The #1 question I see on Stack Overflow has to be confusion about
>> pg_hba.conf, mostly from people who have no idea it exists, don't understand
>> how to configure it, etc.
> The totally non-obvious name of the file probably has something to do
> with that.  It should be called 'auth.conf'.
Not convinced; since it only controls one facet of auth - it doesn't
define users, passwords, grants, etc - that'd probably be just as confusing.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Hard to Use WAS: Hard limit on WAL space

From

Brendan Jurd

Date:

15 June 2013, 06:44:58

On 15 June 2013 16:18, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 06/15/2013 02:08 PM, Brendan Jurd wrote:
>> On 15 June 2013 14:43, Craig Ringer <craig@2ndquadrant.com> wrote:
>>> The #1 question I see on Stack Overflow has to be confusion about
>>> pg_hba.conf, mostly from people who have no idea it exists, don't understand
>>> how to configure it, etc.
>> The totally non-obvious name of the file probably has something to do
>> with that.  It should be called 'auth.conf'.
> Not convinced; since it only controls one facet of auth - it doesn't
> define users, passwords, grants, etc ...

When somebody is setting up postgres for the first time, and they list
the contents of the config directory, you want them to have some idea
what each of the files is for.  If they see something called
'auth.conf', they'll get the right general idea.  An understanding of
the nuances (like that it doesn't control user accounts) will come
once they open up the file -- which they may well do, because it is
called 'auth.conf', and 'auth' is a thing you want to configure.

If they see something called 'pg_hba.conf', they may very reasonably
assume that it is some internal/advanced stuff that they don't need to
worry about just yet, because what the heck is a 'pg_hba'?  The 'pg'
is unnecessary and the 'hba' is an internal jargon term that we've
ill-advisedly allowed to leak out into the filename.

If you really feel that 'auth.conf' is too imprecise, maybe something
like 'conn-auth.conf' would be more your style.

Cheers,
BJ

Re: Hard to Use WAS: Hard limit on WAL space

From

Stefan Drees

Date:

15 June 2013, 07:49:03

On 2013-06-15 08:44 CEST, Brendan Jurd wrote:
> On 15 June 2013 16:18, Craig Ringer ... wrote:
>> On 06/15/2013 02:08 PM, Brendan Jurd wrote:
>>> On 15 June 2013 14:43, Craig Ringer ... wrote:
>>>> The #1 question I see on Stack Overflow has to be confusion about
>>>> pg_hba.conf, mostly from people who have no idea it exists, don't understand
>>>> how to configure it, etc.
>>> The totally non-obvious name of the file probably has something to do
>>> with that.  It should be called 'auth.conf'.
>> Not convinced; since it only controls one facet of auth - it doesn't
>> define users, passwords, grants, etc ...
>
> When somebody is setting up postgres for the first time, and they list
> the contents of the config directory, you want them to have some idea

as they may not have read up to section 19.1 The pg_hba.conf File inside 
chapter 19 Client Authentication of part III. Server Administration :-?, 
which states (as of 9.2.4):
"""
Client authentication is controlled by a configuration file, which 
traditionally is named pg_hba.conf and is stored in the database 
cluster's data directory. (HBA stands for host-based authentication.) A 
default pg_hba.conf file is installed when the data directory is 
initialized by initdb. It is possible to place the authentication 
configuration file elsewhere, however; see the hba_file configuration 
parameter. ...
"""

;-) thanks to hyperlinks this is quite close to the start, but I was 
surprised to not find it by skimming the text and following the 
hyperlinks but by knowing the filename instead and entering it 
("pg_hba.conf") into the Search Documentation text field on the top 
right corner of http://www.postgresql.org/docs/9.2/interactive/index.html.

Maybe we could find a better place of the whatever-then-name inside the 
part of the docs even the "TL;DR" mood people might read? A paragraph or 
two spiced up with some catchy StackOverflow-inspired terms people with 
a need to configure this authentication aspect might have expected could 
also be expected in INSTALL like docs or directly observable on the 
hyperlinked way from part I. Tutorial chapter 1 Getting Started section 
1.1 Installation all down to chapter 15. Installation from Source Code. 
But of course only, if this is "wanted behavior".

If I read the section 1.1 Installation (again 9.2.4) I have the 
impression, that it more transports the message in our case, that "you 
are the site admin, deal with it, read the docs", or don't I read it 
right? (I am a non-native English reader)

> what each of the files is for.  If they see something called
> 'auth.conf', they'll get the right general idea.  An understanding of
> the nuances (like that it doesn't control user accounts) will come
> once they open up the file -- which they may well do, because it is
> called 'auth.conf', and 'auth' is a thing you want to configure.

that may well be, I do not know, how people that prefer reading folder 
and filenames over manuals written for them grok text, as I read the 
docs, promised ;-)

> If they see something called 'pg_hba.conf', they may very reasonably
> assume that it is some internal/advanced stuff that they don't need to
> worry about just yet, because what the heck is a 'pg_hba'?  The 'pg'
> is unnecessary and the 'hba' is an internal jargon term that we've
> ill-advisedly allowed to leak out into the filename.

at around 1995 when I started using Postgres95 it sure took some time to 
find that pg_hba.conf file, but I then perceived it to be very well 
documented, and also felt a bit guilty, as it's name occured in the 
INSTALL file cf. 
ftp://ftp-archives.postgresql.org/pub/source/v7.2/postgresql-7.2.tar.gz 
and the INSTALL file. Therein "burried" inside Step 1 of "If You Are 
Upgrading" ...

> If you really feel that 'auth.conf' is too imprecise, maybe something
> like 'conn-auth.conf' would be more your style.

I think you guys did and still do a fantastic job with PostgreSQL and 
eps. it's documentation, but in this case I doubt, that any renaming of 
config files will really have an impact on usability in the shady area 
of "TL;DR" - at least for the next twenty years or so - as it still 
holds, that from a false start (eg. not reading documentation written) 
anything may follow.

But as usability is a practical concern I (as a user) would be +0 on 
renaming it if people not finding it bearing the old name, but then 
editing it is really wanted behavior.

All the best,
Stefan.

Re: Hard to Use WAS: Hard limit on WAL space

From

"Joshua D. Drake"

Date:

15 June 2013, 07:53:45

On 06/14/2013 11:18 PM, Craig Ringer wrote:
>
> On 06/15/2013 02:08 PM, Brendan Jurd wrote:
>> On 15 June 2013 14:43, Craig Ringer <craig@2ndquadrant.com> wrote:
>>> The #1 question I see on Stack Overflow has to be confusion about
>>> pg_hba.conf, mostly from people who have no idea it exists, don't understand
>>> how to configure it, etc.
>> The totally non-obvious name of the file probably has something to do
>> with that.  It should be called 'auth.conf'.
> Not convinced; since it only controls one facet of auth - it doesn't
> define users, passwords, grants, etc - that'd probably be just as confusing.
>

Yeah this one is not making the grade. pg_hba is just that host based 
auth but I think we are bikeshedding now.

JD

Re: Hard to Use WAS: Hard limit on WAL space

From

"Joshua D. Drake"

Date:

15 June 2013, 07:55:47

On 06/14/2013 11:44 PM, Brendan Jurd wrote:

> If they see something called 'pg_hba.conf', they may very reasonably
> assume that it is some internal/advanced stuff that they don't need to
> worry about just yet, because what the heck is a 'pg_hba'?  The 'pg'

Only the uneducated. Look, I am not trying to be an ass but seriously. 
Read the docs. I will argue vigorously against the idea of us designing 
a system that has people NOT reading the docs.

JD

Re: Hard to Use WAS: Hard limit on WAL space

From

Craig Ringer

Date:

15 June 2013, 09:32:01

On 06/15/2013 03:53 PM, Joshua D. Drake wrote:
>
> Yeah this one is not making the grade. pg_hba is just that host based
> auth but I think we are bikeshedding now.

Agreed... Even as I posted, I realised I shouldn't have mentioned the
last point, since everything else has been ignored.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Hard to Use WAS: Hard limit on WAL space

From

Martijn van Oosterhout

Date:

15 June 2013, 09:57:51

On Sat, Jun 15, 2013 at 12:43:10PM +0800, Craig Ringer wrote:
> Bloat
> ------
>
> Table bloat. Table bloat has been a major issue with PostgreSQL
> users/admins for years. Anyone care to explain to me in a simple
> paragraph how to find out if you have table or index bloat issues in
> your database and what to do about it? (Maybe we need
> "pg_catalog.pg_index_bloat" and "pg_catalog.pg_table_bloat" views
> including FILLFACTOR correction?)
>
> I think I'll draft up a patch to add exactly that.

Nice list btw. I monitor this by using the excellent check_progres
nagios plugin, which has stuff to check for things like this.

Which makes me think that it might be possible to add some other checks
like this, in for example pg_ctl.  A big fat warning 'your data may be
eaten' might get noticed at startup.

(A minor annoyance is that in recent version of PostgreSQL you have to
give check_postgres admin rights, otherwise it can't warn you about
"idle in transaction" problems.)

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.  -- Arthur Schopenhauer

Re: Hard to Use WAS: Hard limit on WAL space

From

Craig Ringer

Date:

15 June 2013, 10:21:26

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 06/15/2013 05:57 PM, Martijn van Oosterhout wrote:
> On Sat, Jun 15, 2013 at 12:43:10PM +0800, Craig Ringer wrote:
>> Bloat ------
>> 
>> Table bloat. Table bloat has been a major issue with PostgreSQL 
>> users/admins for years. Anyone care to explain to me in a simple 
>> paragraph how to find out if you have table or index bloat issues
>> in your database and what to do about it? (Maybe we need 
>> "pg_catalog.pg_index_bloat" and "pg_catalog.pg_table_bloat"
>> views including FILLFACTOR correction?)
>> 
>> I think I'll draft up a patch to add exactly that.
> 
> Nice list btw. I monitor this by using the excellent check_progres 
> nagios plugin, which has stuff to check for things like this.

It's been a vital tool for me too. It probably serves as a reasonable
guide for some things we could really usefully expose in system views.
Once in core we can document them in the main user manual, making them
reasonably discoverable.

Argh. I think my TODO has some kind of horrible disease, it keeps
growing uncontrollably.

> Which makes me think that it might be possible to add some other
> checks like this, in for example pg_ctl.  A big fat warning 'your
> data may be eaten' might get noticed at startup.

The users who have this kind of issue aren't the ones running pg_ctl.
They'll usually be using launchd, systemctl, upstart, sysv init
scripts, etc ... whatever, something that sends the warning straight
to the system logs that they likely never read.

I don't have tons of sympathy for these people, but I do think making
fsync=off so easy to set without understanding it is kind of like
handing a grenade to a toddler.

- -- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJRvD/8AAoJELBXNkqjr+S2aG8H/jnjATe1M+4O6k8iaS6Okgux
HQU14nDaBh7tbRaNQZUlNFDaVoQA6cynN8Xzq8k5lqJxGnuNRR7SNw8+cZZiZmMe
pS2f9q2IyOMz3T/mBNPuAFcPTbp6pjYrBNpMEGF6FYDhmUMSEfhf4Cp1Ns4FG0kx
o5dIXnhgDpCCTBK4XiYqbijFGe0pqbOH98fTQJLXb2ItgE17t4jU0YoYPJovjjT8
xKnDggN+H3uPMmNTcxn0VL6XcrjM6oDeBQPtzCiePWWxYD4nwP3d0ZIok13jZSHm
KC3NWgYQ7uP8/NJitnqewMQ8RArQjAWsW94deZt28jNDeaKp/vovQlZtrU2M6dQ=
=aysr
-----END PGP SIGNATURE-----

Re: Hard to Use WAS: Hard limit on WAL space

From

Greg Stark

Date:

15 June 2013, 16:42:52

> fsync=off

synchronous_commits=off replaced fsync=off in almost every use case
where fsync=off might have been useful. The only remaining use case is
for the initial build of a database. In that case what the user really
wants is to turn off WAL logging entirely though. Having a WAL log and
not fsyncing it is kind of pointless. I guess it lets you replicate
the database but it doesn't let you use the WAL log for recovery
locally.

> Bloat
> ------
>
> Table bloat. Table bloat has been a major issue with PostgreSQL users/admins
> for years. Anyone care to explain to me in a simple paragraph how to find
> out if you have table or index bloat issues in your database and what to do
> about it? (Maybe we need "pg_catalog.pg_index_bloat" and
> "pg_catalog.pg_table_bloat" views including FILLFACTOR correction?)

A nice view that exposes a summary of information from the fsm per
table would be pretty handy.

In general there's a lot of data tied up in things like the fsm that
could be usefully exposed to users.



-- 
greg

Re: Hard limit on WAL space used (because PANIC sucks)

From

Dimitri Fontaine

Date:

17 June 2013, 09:10:17

Peter Eisentraut <peter_e@gmx.net> writes:
> I suspect that there are actually only about 5 or 6 common ways to do
> archiving (say, local, NFS, scp, rsync, S3, ...).  There's no reason why
> we can't fully specify and/or script what to do in each of these cases.

And provide either fully reliable contrib scripts or "internal" archive
commands ready to use for those common cases. I can't think of other
common use cases, by the way.

+1

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support

Re: Hard limit on WAL space used (because PANIC sucks)

From

Bruce Momjian

Date:

22 June 2013, 15:03:57

On Mon, Jun 10, 2013 at 07:28:24AM +0800, Craig Ringer wrote:
> (I'm still learning the details of Pg's WAL, WAL replay and recovery, so
> the below's just my understanding):
> 
> The problem is that WAL for all tablespaces is mixed together in the
> archives. If you lose your tablespace then you have to keep *all* WAL
> around and replay *all* of it again when the tablespace comes back
> online. This would be very inefficient, would require a lot of tricks to
> cope with applying WAL to a database that has an on-disk state in the
> future as far as the archives are concerned. It's not as simple as just
> replaying all WAL all over again - as I understand it, things like
> CLUSTER or TRUNCATE will result in relfilenodes not being where they're
> expected to be as far as old WAL archives are concerned. Selective
> replay would be required, and that leaves the door open to all sorts of
> new and exciting bugs in areas that'd hardly ever get tested.
> 
> To solve the massive disk space explosion problem I imagine we'd have to
> have per-tablespace WAL. That'd cause a *huge* increase in fsync costs
> and loss of the rather nice property that WAL writes are nice sequential
> writes. It'd be complicated and probably cause nightmares during
> recovery, for archive-based replication, etc.
> 
> The only other thing I can think of is: When a tablespace is offline,
> write WAL records to a separate "tablespace recovery log" as they're
> encountered. Replay this log when the tablespace comes is restored,
> before applying any other new WAL to the tablespace. This wouldn't
> affect archive-based recovery since it'd already have the records from
> the original WAL.
> 
> None of these options seem exactly simple or pretty, especially given
> the additional complexities that'd be involved in allowing WAL records
> to be applied out-of-order, something that AFAIK _never_h happens at the
> moment.
> 
> The key problem, of course, is that this all sounds like a lot of
> complicated work for a case that's not really supposed to happen. Right
> now, the answer is "your database is unrecoverable, switch to your
> streaming warm standby and re-seed it from the standby". Not pretty, but
> at least there's the option of using a sync standby and avoiding data loss.
> 
> How would you approach this?

Sorry to be replying late.  You are right that you could record/apply
WAL separately for offline tablespaces.  The problem is that you could
have logical ties from the offline tablespace to online tablespaces. 
For example, what happens if data in an online tablespace references a
primary key in an offline tablespace.  What if the system catalogs are
stored in an offline tablespace?  Right now, we allow logical bindings
across physical tablespaces.  To do what you want, you would really need
to store each database in its own tablespace.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: Hard limit on WAL space used (because PANIC sucks)

From

Simon Riggs

Date:

21 January 2014, 16:06:20

On 6 June 2013 16:00, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
> In the "Redesigning checkpoint_segments" thread, many people opined that
> there should be a hard limit on the amount of disk space used for WAL:
> http://www.postgresql.org/message-id/CA+TgmoaOkgZb5YsmQeMg8ZVqWMtR=6S4-PPd+6jiy4OQ78ihUA@mail.gmail.com.
> I'm starting a new thread on that, because that's mostly orthogonal to
> redesigning checkpoint_segments.
>
> The current situation is that if you run out of disk space while writing
> WAL, you get a PANIC, and the server shuts down. That's awful. We can try to
> avoid that by checkpointing early enough, so that we can remove old WAL
> segments to make room for new ones before you run out, but unless we somehow
> throttle or stop new WAL insertions, it's always going to be possible to use
> up all disk space. A typical scenario where that happens is when
> archive_command fails for some reason; even a checkpoint can't remove old,
> unarchived segments in that case. But it can happen even without WAL
> archiving.

I don't see we need to prevent WAL insertions when the disk fills. We
still have the whole of wal_buffers to use up. When that is full, we
will prevent further WAL insertions because we will be holding the
WALwritelock to clear more space. So the rest of the system will lock
up nicely, like we want, apart from read-only transactions.

Instead of PANICing, we should simply signal the checkpointer to
perform a shutdown checkpoint. That normally requires a WAL insertion
to complete, but it seems easy enough to make that happen by simply
rewriting the control file, after which ALL WAL files are superfluous
for crash recovery and can be deleted. Once that checkpoint is
complete, we can begin deleting WAL files that are archived/replicated
and continue as normal. The previously failing WAL write can now be
made again and may succeed this time - if it does, we continue, if not
- now we PANIC.

Note that this would not require in-progress transactions to be
aborted. They can continue normally once wal_buffers re-opens.

We don't really want anything too drastic, because if this situation
happens once it may happen many times - I'm imagining a flaky network
etc.. So we want the situation to recover quickly and easily, without
too many consequences.

The above appears to be very minimal change from existing code and
doesn't introduce lots of new points of breakage.

> I've seen a case, where it was even worse than a PANIC and shutdown. pg_xlog
> was on a separate partition that had nothing else on it. The partition
> filled up, and the system shut down with a PANIC. Because there was no space
> left, it could not even write the checkpoint after recovery, and thus
> refused to start up again. There was nothing else on the partition that you
> could delete to make space. The only recourse would've been to add more disk
> space to the partition (impossible), or manually delete an old WAL file that
> was not needed to recover from the latest checkpoint (scary). Fortunately
> this was a test system, so we just deleted everything.

Doing shutdown checkpoints via the control file would exactly solve
that issue. We already depend upon the readability of the control file
anyway, so this changes nothing. (And if you regard it does, then we
can have multiple control files, or at least a backup control file at
shutdown).

We can make the shutdown checkpoint happen always at EOF of a WAL
segment, so at shutdown we don't need any WAL files to remain at all.

> So we need to somehow stop new WAL insertions from happening, before it's
> too late.

I don't think we do.

What might be sensible is to have checkpoints speed up as WAL volume
approaches a predefined limit, so that we minimise the delay caused
when wal_buffers locks up.

Not suggesting anything here for 9.4, since we're midCF.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Hard limit on WAL space used (because PANIC sucks)

From

Tom Lane

Date:

21 January 2014, 17:35:12

Simon Riggs <simon@2ndQuadrant.com> writes:
> On 6 June 2013 16:00, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
>> The current situation is that if you run out of disk space while writing
>> WAL, you get a PANIC, and the server shuts down. That's awful.

> I don't see we need to prevent WAL insertions when the disk fills. We
> still have the whole of wal_buffers to use up. When that is full, we
> will prevent further WAL insertions because we will be holding the
> WALwritelock to clear more space. So the rest of the system will lock
> up nicely, like we want, apart from read-only transactions.

I'm not sure that "all writing transactions lock up hard" is really so
much better than the current behavior.

My preference would be that we simply start failing writes with ERRORs
rather than PANICs.  I'm not real sure ATM why this has to be a PANIC
condition.  Probably the cause is that it's being done inside a critical
section, but could we move that?

> Instead of PANICing, we should simply signal the checkpointer to
> perform a shutdown checkpoint.

And if that fails for lack of disk space?  In any case, what you're
proposing sounds like a lot of new complication in a code path that
is necessarily never going to be terribly well tested.
        regards, tom lane

Re: Hard limit on WAL space used (because PANIC sucks)

From

Simon Riggs

Date:

21 January 2014, 18:41:59

On 21 January 2014 18:35, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> On 6 June 2013 16:00, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
>>> The current situation is that if you run out of disk space while writing
>>> WAL, you get a PANIC, and the server shuts down. That's awful.
>
>> I don't see we need to prevent WAL insertions when the disk fills. We
>> still have the whole of wal_buffers to use up. When that is full, we
>> will prevent further WAL insertions because we will be holding the
>> WALwritelock to clear more space. So the rest of the system will lock
>> up nicely, like we want, apart from read-only transactions.
>
> I'm not sure that "all writing transactions lock up hard" is really so
> much better than the current behavior.

Lock up momentarily, until the situation clears. But my proposal would
allow the situation to fully clear, i.e. all WAL files could be
deleted as soon as replication/archiving has caught up. The current
behaviour doesn't automatically correct itself as this proposal would.
My proposal is also fully safe in line with synchronous replication,
as well as zero performance overhead for mainline processing.

> My preference would be that we simply start failing writes with ERRORs
> rather than PANICs.

Yes, that is what I am proposing, amongst other points.

> I'm not real sure ATM why this has to be a PANIC
> condition.  Probably the cause is that it's being done inside a critical
> section, but could we move that?

Yes, I think so.

>> Instead of PANICing, we should simply signal the checkpointer to
>> perform a shutdown checkpoint.
>
> And if that fails for lack of disk space?

I proposed a way to ensure it wouldn't fail for that, at least on pg_xlog space.

> In any case, what you're
> proposing sounds like a lot of new complication in a code path that
> is necessarily never going to be terribly well tested.

It's the smallest amount of change proposed so far... I agree on the
danger of untested code.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Hard limit on WAL space used (because PANIC sucks)

From

Greg Stark

Date:

21 January 2014, 21:02:44

Fwiw I think "all transactions lock up until space appears" is *much*
better than PANICing. Often disks fill up due to other transient
storage or people may have options to manually increase the amount of
space. it's much better if the database just continues to function
after that rather than need to be restarted.

Re: Hard limit on WAL space used (because PANIC sucks)

From

Tom Lane

Date:

21 January 2014, 21:51:22

Greg Stark <stark@mit.edu> writes:
> Fwiw I think "all transactions lock up until space appears" is *much*
> better than PANICing. Often disks fill up due to other transient
> storage or people may have options to manually increase the amount of
> space. it's much better if the database just continues to function
> after that rather than need to be restarted.

Well, PANIC is certainly bad, but what I'm suggesting is that we just
focus on getting that down to ERROR and not worry about trying to get
out of the disk-shortage situation automatically.  Nor do I believe
that it's such a good idea to have the database freeze up until space
appears rather than reporting errors.
        regards, tom lane

Re: Hard limit on WAL space used (because PANIC sucks)

From

Jeff Janes

Date:

21 January 2014, 22:07:01

On Tue, Jan 21, 2014 at 9:35 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:
> On 6 June 2013 16:00, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
>> The current situation is that if you run out of disk space while writing
>> WAL, you get a PANIC, and the server shuts down. That's awful.

> I don't see we need to prevent WAL insertions when the disk fills. We
> still have the whole of wal_buffers to use up. When that is full, we
> will prevent further WAL insertions because we will be holding the
> WALwritelock to clear more space. So the rest of the system will lock
> up nicely, like we want, apart from read-only transactions.

I'm not sure that "all writing transactions lock up hard" is really so
much better than the current behavior.

My preference would be that we simply start failing writes with ERRORs
rather than PANICs. I'm not real sure ATM why this has to be a PANIC
condition. Probably the cause is that it's being done inside a critical
section, but could we move that?

My understanding is that if it runs out of buffer space while in an XLogInsert, it will be holding one or more buffer content locks exclusively, and unless it can complete the xlog (or scrounge up the info to return that buffer to its previous state), it can never release that lock. There might be other paths were it could get by with an ERROR, but if no one can write xlog anymore, all of those paths must quickly converge to the one that cannot simply ERROR.

Cheers,

Jeff

Re: Hard limit on WAL space used (because PANIC sucks)

From

Tom Lane

Date:

21 January 2014, 23:24:49

Jeff Janes <jeff.janes@gmail.com> writes:
> On Tue, Jan 21, 2014 at 9:35 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> My preference would be that we simply start failing writes with ERRORs
>> rather than PANICs.  I'm not real sure ATM why this has to be a PANIC
>> condition.  Probably the cause is that it's being done inside a critical
>> section, but could we move that?

> My understanding is that if it runs out of buffer space while in an
> XLogInsert, it will be holding one or more buffer content locks
> exclusively, and unless it can complete the xlog (or scrounge up the info
> to return that buffer to its previous state), it can never release that
> lock.  There might be other paths were it could get by with an ERROR, but
> if no one can write xlog anymore, all of those paths must quickly converge
> to the one that cannot simply ERROR.

Well, the point is we'd have to somehow push detection of the problem
to a point before the critical section that does the buffer changes
and WAL insertion.

The first idea that comes to mind is (1) estimate the XLOG space needed
(an overestimate is fine here); (2) just before entering the critical
section, call some function to "reserve" that space, such that we always
have at least sum(outstanding reservations) available future WAL space;
(3) release our reservation as part of the actual XLogInsert call.

The problem here is that the "reserve" function would presumably need an
exclusive lock, and would be about as much of a hot spot as XLogInsert
itself is.  Plus we'd be paying a lot of extra cycles to solve a corner
case problem that, with all due respect, comes up pretty darn seldom.
So probably we need a better idea than that.

Maybe we could get some mileage out of the fact that very approximate
techniques would be good enough.  For instance, I doubt anyone would bleat
if the system insisted on having 10MB or even 100MB of future WAL space
always available.  But I'm not sure exactly how to make use of that
flexibility.
        regards, tom lane

Re: Hard limit on WAL space used (because PANIC sucks)

From

Peter Geoghegan

Date:

21 January 2014, 23:38:58

On Tue, Jan 21, 2014 at 3:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Maybe we could get some mileage out of the fact that very approximate
> techniques would be good enough.  For instance, I doubt anyone would bleat
> if the system insisted on having 10MB or even 100MB of future WAL space
> always available.  But I'm not sure exactly how to make use of that
> flexibility.

In the past I've thought that one approach that would eliminate
concerns about portably and efficiently knowing how much space is left
on the pg_xlog filesystem is to have a "ballast file". Under this
scheme, perhaps XLogInsert() could differentiate between a soft and
hard failure. Hopefully the reserve function you mentioned, which is
still called at the same place, just before each critical section
thereby becomes cheap. Perhaps I'm just restating what you said,
though.

-- 
Peter Geoghegan

Re: Hard limit on WAL space used (because PANIC sucks)

From

Andres Freund

Date:

21 January 2014, 23:43:38

On 2014-01-21 18:24:39 -0500, Tom Lane wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
> > On Tue, Jan 21, 2014 at 9:35 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> My preference would be that we simply start failing writes with ERRORs
> >> rather than PANICs.  I'm not real sure ATM why this has to be a PANIC
> >> condition.  Probably the cause is that it's being done inside a critical
> >> section, but could we move that?
> 
> > My understanding is that if it runs out of buffer space while in an
> > XLogInsert, it will be holding one or more buffer content locks
> > exclusively, and unless it can complete the xlog (or scrounge up the info
> > to return that buffer to its previous state), it can never release that
> > lock.  There might be other paths were it could get by with an ERROR, but
> > if no one can write xlog anymore, all of those paths must quickly converge
> > to the one that cannot simply ERROR.
> 
> Well, the point is we'd have to somehow push detection of the problem
> to a point before the critical section that does the buffer changes
> and WAL insertion.

Well, I think that's already hard for the heapam.c stuff, but doing that
in xact.c seems fracking hairy. We can't simply stop in the middle of a
commit and not continue, that'd often grind the system to a halt
preventing cleanup. Additionally the size of the inserted record for
commits is essentially unbounded, which makes it an especially fun case.

> The first idea that comes to mind is (1) estimate the XLOG space needed
> (an overestimate is fine here); (2) just before entering the critical
> section, call some function to "reserve" that space, such that we always
> have at least sum(outstanding reservations) available future WAL space;
> (3) release our reservation as part of the actual XLogInsert call.

I think that's not necessarily enough. In a COW filesystem like btrfs or
ZFS you really cannot give much guarantees about writes suceeding or
failing, even if we were able to create (and zero) a new segment.

Even if you disregard that, we'd need to keep up with lots of concurrent
reservations, looking a fair bit into the future. E.g. during a "smart"
shutdown in a workload with lots of subtransactions trying to reserve
space might make the situation actually worse because we might end up
trying to reserve the combined size of records.

> The problem here is that the "reserve" function would presumably need an
> exclusive lock, and would be about as much of a hot spot as XLogInsert
> itself is.  Plus we'd be paying a lot of extra cycles to solve a corner
> case problem that, with all due respect, comes up pretty darn seldom.
> So probably we need a better idea than that.

Yea, I don't think anything really safe is going to work without
signifcant penalties.

> Maybe we could get some mileage out of the fact that very approximate
> techniques would be good enough.  For instance, I doubt anyone would bleat
> if the system insisted on having 10MB or even 100MB of future WAL space
> always available.  But I'm not sure exactly how to make use of that
> flexibility.

If we'd be more aggressive with preallocating WAL files and doing so in
the WAL writer, we could stop accepting writes in some common codepaths
(e.g. nodeModifyTable.c) as soon as preallocating failed but continue to
accept writes in other locations (e.g. TRUNCATE, DROP TABLE). That'd
still fail if you write a *very* large commit record using up all the
reserve though...

I personally think this isn't worth complicating the code for.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Hard limit on WAL space used (because PANIC sucks)

From

Tom Lane

Date:

21 January 2014, 23:59:23

Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-01-21 18:24:39 -0500, Tom Lane wrote:
>> Maybe we could get some mileage out of the fact that very approximate
>> techniques would be good enough.  For instance, I doubt anyone would bleat
>> if the system insisted on having 10MB or even 100MB of future WAL space
>> always available.  But I'm not sure exactly how to make use of that
>> flexibility.

> If we'd be more aggressive with preallocating WAL files and doing so in
> the WAL writer, we could stop accepting writes in some common codepaths
> (e.g. nodeModifyTable.c) as soon as preallocating failed but continue to
> accept writes in other locations (e.g. TRUNCATE, DROP TABLE). That'd
> still fail if you write a *very* large commit record using up all the
> reserve though...

> I personally think this isn't worth complicating the code for.

I too have got doubts about whether a completely bulletproof solution
is practical.  (And as you say, even if our internal logic was
bulletproof, a COW filesystem defeats all guarantees in this area
anyway.)  But perhaps a 99% solution would be a useful compromise.

Another thing to think about is whether we couldn't put a hard limit on
WAL record size somehow.  Multi-megabyte WAL records are an abuse of the
design anyway, when you get right down to it.  So for example maybe we
could split up commit records, with most of the bulky information dumped
into separate records that appear before the "real commit".  This would
complicate replay --- in particular, if we abort the transaction after
writing a few such records, how does the replayer realize that it can
forget about those records?  But that sounds probably surmountable.
        regards, tom lane

Re: Hard limit on WAL space used (because PANIC sucks)

From

Andres Freund

Date:

22 January 2014, 00:17:55

On 2014-01-21 18:59:13 -0500, Tom Lane wrote:
> Another thing to think about is whether we couldn't put a hard limit on
> WAL record size somehow.  Multi-megabyte WAL records are an abuse of the
> design anyway, when you get right down to it.  So for example maybe we
> could split up commit records, with most of the bulky information dumped
> into separate records that appear before the "real commit".  This would
> complicate replay --- in particular, if we abort the transaction after
> writing a few such records, how does the replayer realize that it can
> forget about those records?  But that sounds probably surmountable.

Which cases of essentially unbounded record sizes do we currently have?
The only ones that I remember right now are commit and abort records
(including when wrapped in a prepared xact).
Containing a) the list of committing/aborting subtransactions and b) the
list of files to drop c) cache invalidations.
Hm. There's also xl_standby_locks, but that'd be easily splittable.

I think removing the list of subtransactions from commit records would
essentially require not truncating pg_subtrans after a restart
anymore. If we'd truncate it in concord with pg_clog, we'd only need to
log the subxids which haven't been explicitly assigned. Unfortunately
pg_subtrans will be bigger than pg_clog, making such a scheme likely to
be painful.

We could relatively easily split of logging the dropped files from
commit records and log them in groups afterwards, we already have
several races allowing to leak files.

We could do something similar for the cache invalidations, but that
seems likely to get rather ugly, as we'd need to hold procarraylock till
the next record is read, or until a shutdown or end-of-recovery record
is read. If one of the latter is found before the corresponding
invalidations, we'd need to invalidate the entire syscache.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Hard limit on WAL space used (because PANIC sucks)

From

Simon Riggs

Date:

22 January 2014, 00:18:45

On 21 January 2014 23:01, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Tue, Jan 21, 2014 at 9:35 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> Simon Riggs <simon@2ndQuadrant.com> writes:
>> > On 6 June 2013 16:00, Heikki Linnakangas <hlinnakangas@vmware.com>
>> > wrote:
>> >> The current situation is that if you run out of disk space while
>> >> writing
>> >> WAL, you get a PANIC, and the server shuts down. That's awful.
>>
>> > I don't see we need to prevent WAL insertions when the disk fills. We
>> > still have the whole of wal_buffers to use up. When that is full, we
>> > will prevent further WAL insertions because we will be holding the
>> > WALwritelock to clear more space. So the rest of the system will lock
>> > up nicely, like we want, apart from read-only transactions.
>>
>> I'm not sure that "all writing transactions lock up hard" is really so
>> much better than the current behavior.
>>
>> My preference would be that we simply start failing writes with ERRORs
>> rather than PANICs.  I'm not real sure ATM why this has to be a PANIC
>> condition.  Probably the cause is that it's being done inside a critical
>> section, but could we move that?
>
>
> My understanding is that if it runs out of buffer space while in an
> XLogInsert, it will be holding one or more buffer content locks exclusively,
> and unless it can complete the xlog (or scrounge up the info to return that
> buffer to its previous state), it can never release that lock.  There might
> be other paths were it could get by with an ERROR, but if no one can write
> xlog anymore, all of those paths must quickly converge to the one that
> cannot simply ERROR.

Agreed. You don't say it but I presume you intend to point out that
such long-lived contention could easily have a knock on effect to
other read-only statements. I'm pretty sure other databases work the
same way.

Our choice are

1. Waiting
2. Abort transactions
3. Some kind of release-locks-then-wait-and-retry

(3) is a step too far for me, even though it is easier than you say
since we write WAL before changing the data block so a failure to
insert WAL could just result in a temporary drop lock, sleep and
retry.

I would go for (1) waiting for up to checkpoint_timeout then (2), if
we think that is a problem.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Hard limit on WAL space used (because PANIC sucks)

From

Andres Freund

Date:

22 January 2014, 00:21:15

On 2014-01-22 01:18:36 +0100, Simon Riggs wrote:
> > My understanding is that if it runs out of buffer space while in an
> > XLogInsert, it will be holding one or more buffer content locks exclusively,
> > and unless it can complete the xlog (or scrounge up the info to return that
> > buffer to its previous state), it can never release that lock.  There might
> > be other paths were it could get by with an ERROR, but if no one can write
> > xlog anymore, all of those paths must quickly converge to the one that
> > cannot simply ERROR.
> 
> Agreed. You don't say it but I presume you intend to point out that
> such long-lived contention could easily have a knock on effect to
> other read-only statements. I'm pretty sure other databases work the
> same way.
> 
> Our choice are
> 
> 1. Waiting
> 2. Abort transactions
> 3. Some kind of release-locks-then-wait-and-retry
> 
> (3) is a step too far for me, even though it is easier than you say
> since we write WAL before changing the data block so a failure to
> insert WAL could just result in a temporary drop lock, sleep and
> retry.
> 
> I would go for (1) waiting for up to checkpoint_timeout then (2), if
> we think that is a problem.

How are we supposed to wait while e.g. ProcArrayLock? Aborting
transactions doesn't work either, that writes abort records which can
get signficantly large.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Hard limit on WAL space used (because PANIC sucks)

From

Tom Lane

Date:

22 January 2014, 00:24:24

Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-01-21 18:59:13 -0500, Tom Lane wrote:
>> Another thing to think about is whether we couldn't put a hard limit on
>> WAL record size somehow.  Multi-megabyte WAL records are an abuse of the
>> design anyway, when you get right down to it.  So for example maybe we
>> could split up commit records, with most of the bulky information dumped
>> into separate records that appear before the "real commit".  This would
>> complicate replay --- in particular, if we abort the transaction after
>> writing a few such records, how does the replayer realize that it can
>> forget about those records?  But that sounds probably surmountable.

> I think removing the list of subtransactions from commit records would
> essentially require not truncating pg_subtrans after a restart
> anymore.

I'm not suggesting that we stop providing that information!  I'm just
saying that we perhaps don't need to store it all in one WAL record,
if instead we put the onus on WAL replay to be able to reconstruct what
it needs from a series of WAL records.

> We could relatively easily split of logging the dropped files from
> commit records and log them in groups afterwards, we already have
> several races allowing to leak files.

I was thinking the other way around: emit the subsidiary records before the
atomic commit or abort record, indeed before we've actually committed.
Part of the point is to reduce the risk that lack of WAL space would
prevent us from fully committing.  Also, writing those records afterwards
increases the risk of a post-commit failure, which is a bad thing.

Replay would then involve either accumulating the subsidiary records in
memory, or being willing to go back and re-read them when the real commit
or abort record is seen.
        regards, tom lane

Re: Hard limit on WAL space used (because PANIC sucks)

From

Tom Lane

Date:

22 January 2014, 00:30:56

Andres Freund <andres@2ndquadrant.com> writes:
> How are we supposed to wait while e.g. ProcArrayLock? Aborting
> transactions doesn't work either, that writes abort records which can
> get signficantly large.

Yeah, that's an interesting point ;-).  We can't *either* commit or abort
without emitting some WAL, possibly quite a bit of WAL.
        regards, tom lane

Re: Hard limit on WAL space used (because PANIC sucks)

From

Peter Geoghegan

Date:

22 January 2014, 00:34:51

On Tue, Jan 21, 2014 at 3:43 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> I personally think this isn't worth complicating the code for.

You're probably right. However, I don't see why the bar has to be very
high when we're considering the trade-off between taking some
emergency precaution against having a PANIC shutdown, and an assured
PANIC shutdown. Heikki said somewhere upthread that he'd be happy with
a solution that only catches 90% of the cases. That is probably a
conservative estimate. The schemes discussed here would probably be
much more effective than that in practice. Sure, you can still poke
holes in them. For example, there has been some discussion of
arbitrarily large commit records. However, this is the kind of thing
just isn't that relevant in the real world. I believe that in practice
the majority of commit records are all about the same size.

I do not believe that the two acceptable outcomes here are either that
we continue to always PANIC shutdown (i.e. do nothing), or promise to
never PANIC shutdown. There is likely to be a third way, which is that
the probability of a PANIC shutdown is, at the macro level, reduced
somewhat from the present probability of 1.0. People are not going to
develop a lackadaisical attitude about running out of disk space on
the pg_xlog partition if we do so. They still have plenty of incentive
to make sure that that doesn't happen.

-- 
Peter Geoghegan

Re: Hard limit on WAL space used (because PANIC sucks)

From

Andres Freund

Date:

22 January 2014, 00:35:01

On 2014-01-21 19:23:57 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2014-01-21 18:59:13 -0500, Tom Lane wrote:
> >> Another thing to think about is whether we couldn't put a hard limit on
> >> WAL record size somehow.  Multi-megabyte WAL records are an abuse of the
> >> design anyway, when you get right down to it.  So for example maybe we
> >> could split up commit records, with most of the bulky information dumped
> >> into separate records that appear before the "real commit".  This would
> >> complicate replay --- in particular, if we abort the transaction after
> >> writing a few such records, how does the replayer realize that it can
> >> forget about those records?  But that sounds probably surmountable.
> 
> > I think removing the list of subtransactions from commit records would
> > essentially require not truncating pg_subtrans after a restart
> > anymore.
> 
> I'm not suggesting that we stop providing that information!  I'm just
> saying that we perhaps don't need to store it all in one WAL record,
> if instead we put the onus on WAL replay to be able to reconstruct what
> it needs from a series of WAL records.

That'd likely require something similar to the incomplete actions used
in btrees (and until recently in more places). I think that is/was a
disaster I really don't want to extend.

> > We could relatively easily split of logging the dropped files from
> > commit records and log them in groups afterwards, we already have
> > several races allowing to leak files.
> 
> I was thinking the other way around: emit the subsidiary records before the
> atomic commit or abort record, indeed before we've actually committed.
> Part of the point is to reduce the risk that lack of WAL space would
> prevent us from fully committing.
> Replay would then involve either accumulating the subsidiary records in
> memory, or being willing to go back and re-read them when the real commit
> or abort record is seen.

Well, the reason I suggested doing it the other way round is that we
wouldn't need to reassemble anything (outside of cache invalidations
which I don't know how to handle that way) which I think is a
significant increase in robustness and decrease in complexity.

> Also, writing those records afterwards
> increases the risk of a post-commit failure, which is a bad thing.

Well, most of those could be done outside of a critical section,
possibly just FATALing out. Beats PANICing.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Hard limit on WAL space used (because PANIC sucks)

From

Tom Lane

Date:

22 January 2014, 00:45:27

Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-01-21 19:23:57 -0500, Tom Lane wrote:
>> I'm not suggesting that we stop providing that information!  I'm just
>> saying that we perhaps don't need to store it all in one WAL record,
>> if instead we put the onus on WAL replay to be able to reconstruct what
>> it needs from a series of WAL records.

> That'd likely require something similar to the incomplete actions used
> in btrees (and until recently in more places). I think that is/was a
> disaster I really don't want to extend.

I don't think that's a comparable case.  Incomplete actions are actions
to be taken immediately, and which the replayer then has to complete
somehow if it doesn't find the rest of the action in the WAL sequence.
The only thing to be done with the records I'm proposing is to remember
their contents (in some fashion) until it's time to apply them.  If you
hit end of WAL you don't really have to do anything.
        regards, tom lane

Re: Hard limit on WAL space used (because PANIC sucks)

From

Andres Freund

Date:

22 January 2014, 00:46:21

On 2014-01-21 16:34:45 -0800, Peter Geoghegan wrote:
> On Tue, Jan 21, 2014 at 3:43 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > I personally think this isn't worth complicating the code for.
> 
> You're probably right. However, I don't see why the bar has to be very
> high when we're considering the trade-off between taking some
> emergency precaution against having a PANIC shutdown, and an assured
> PANIC shutdown

Well, the problem is that the tradeoff would very likely include making
already complex code even more complex. None of the proposals, even the
one just decreasing the likelihood of a PANIC, like like they'd end up
being simple implementation-wise.
And that additional complexity would hurt robustness and prevent things
I find much more important than this.

> Heikki said somewhere upthread that he'd be happy with
> a solution that only catches 90% of the cases. That is probably a
> conservative estimate. The schemes discussed here would probably be
> much more effective than that in practice. Sure, you can still poke
> holes in them. For example, there has been some discussion of
> arbitrarily large commit records. However, this is the kind of thing
> just isn't that relevant in the real world. I believe that in practice
> the majority of commit records are all about the same size.

Yes, realistically the boundary will be relatively low, but I don't
think that that means that we can disregard issues like the possibility
that a record might be bigger than wal_buffers. Not because it'd allow
theoretical issues, but because it rules out several tempting approaches
like e.g. extending the in-memory reservation scheme of Heikki's
scalability work to handle this.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Hard limit on WAL space used (because PANIC sucks)

From

Andres Freund

Date:

22 January 2014, 00:58:20

On 2014-01-21 19:45:19 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2014-01-21 19:23:57 -0500, Tom Lane wrote:
> >> I'm not suggesting that we stop providing that information!  I'm just
> >> saying that we perhaps don't need to store it all in one WAL record,
> >> if instead we put the onus on WAL replay to be able to reconstruct what
> >> it needs from a series of WAL records.
> 
> > That'd likely require something similar to the incomplete actions used
> > in btrees (and until recently in more places). I think that is/was a
> > disaster I really don't want to extend.
> 
> I don't think that's a comparable case.  Incomplete actions are actions
> to be taken immediately, and which the replayer then has to complete
> somehow if it doesn't find the rest of the action in the WAL sequence.
> The only thing to be done with the records I'm proposing is to remember
> their contents (in some fashion) until it's time to apply them.  If you
> hit end of WAL you don't really have to do anything.

Would that work for the promotion case as well? Afair there's the
assumption that everything >= TransactionXmin can be looked up in
pg_subtrans or in the procarray - which afaics wouldn't be the case with
your scheme? And TransactionXmin could very well be below such an
"incomplete commit"'s xids afaics.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Hard limit on WAL space used (because PANIC sucks)

From

Tom Lane

Date:

22 January 2014, 02:42:29

Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-01-21 19:45:19 -0500, Tom Lane wrote:
>> I don't think that's a comparable case.  Incomplete actions are actions
>> to be taken immediately, and which the replayer then has to complete
>> somehow if it doesn't find the rest of the action in the WAL sequence.
>> The only thing to be done with the records I'm proposing is to remember
>> their contents (in some fashion) until it's time to apply them.  If you
>> hit end of WAL you don't really have to do anything.

> Would that work for the promotion case as well? Afair there's the
> assumption that everything >= TransactionXmin can be looked up in
> pg_subtrans or in the procarray - which afaics wouldn't be the case with
> your scheme? And TransactionXmin could very well be below such an
> "incomplete commit"'s xids afaics.

Uh, what?  The behavior I'm talking about is *exactly the same*
as what happens now.  The only change is that the data sent to the
WAL file is laid out a bit differently, and the replay logic has
to work harder to reassemble it before it can apply the commit or
abort action.  If anything outside replay can detect a difference
at all, that would be a bug.

Once again: the replayer is not supposed to act immediately on the
subsidiary records.  It's just supposed to remember their contents
so it can reattach them to the eventual commit or abort record,
and then do what it does today to replay the commit or abort.
        regards, tom lane

Re: Hard limit on WAL space used (because PANIC sucks)

From

Simon Riggs

Date:

22 January 2014, 11:48:06

On 22 January 2014 01:23, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
>> On 2014-01-21 18:59:13 -0500, Tom Lane wrote:
>>> Another thing to think about is whether we couldn't put a hard limit on
>>> WAL record size somehow.  Multi-megabyte WAL records are an abuse of the
>>> design anyway, when you get right down to it.  So for example maybe we
>>> could split up commit records, with most of the bulky information dumped
>>> into separate records that appear before the "real commit".  This would
>>> complicate replay --- in particular, if we abort the transaction after
>>> writing a few such records, how does the replayer realize that it can
>>> forget about those records?  But that sounds probably surmountable.
>
>> I think removing the list of subtransactions from commit records would
>> essentially require not truncating pg_subtrans after a restart
>> anymore.
>
> I'm not suggesting that we stop providing that information!  I'm just
> saying that we perhaps don't need to store it all in one WAL record,
> if instead we put the onus on WAL replay to be able to reconstruct what
> it needs from a series of WAL records.

I think removing excess subxacts from commit and abort records could
be a good idea. Not sure anybody considered it before.

We already emit xid allocation records when we overflow, so we already
know which subxids have been allocated. We also issue subxact abort
records for anything that aborted. So in theory we should be able to
reconstruct an arbitrarily long chain of subxacts.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Hard limit on WAL space used (because PANIC sucks)

From

Simon Riggs

Date:

22 January 2014, 12:10:45

On 22 January 2014 01:30, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
>> How are we supposed to wait while e.g. ProcArrayLock? Aborting
>> transactions doesn't work either, that writes abort records which can
>> get signficantly large.
>
> Yeah, that's an interesting point ;-).  We can't *either* commit or abort
> without emitting some WAL, possibly quite a bit of WAL.

Right, which is why we don't need to lock ProcArrayLock. As soon as we
try to write a commit or abort it goes through the normal XLogInsert
route. As soon as wal_buffers fills WALWriteLock will be held
continuously until we free some space.

Since ProcArrayLock isn't held, read-only users can continue.

As Jeff points out, the blocks being modified would be locked until
space is freed up. Which could make other users wait. The code
required to avoid that wait would be complex and not worth any
overhead.

Note that my proposal would not require aborting any in-flight
transactions; they would continue to completion as soon as space is
cleared.

My proposal again, so we can review how simple it was...

1. Allow a checkpoint to complete by updating the control file, rather
than writing WAL. The control file is already there and is fixed size,
so we can be more confident it will accept the update. We could add a
new checkpoint mode for that, or we could do that always for shutdown
checkpoints (my preferred option). EFFECT: Since a checkpoint can now
be called and complete without writing WAL, we are able to write dirty
buffers and then clean out WAL files to reduce space.

2. If we fill the disk when writing WAL we do not PANIC, we signal the
checkpointer process to perform an immediate checkpoint and then wait
for its completion. EFFECT: Since we are holding WALWriteLock, all
other write users will soon either wait for that lock directly or
indirectly.

Both of those points are relatively straightforward to implement and
this proposal minimises seldom-tested code paths.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Hard limit on WAL space used (because PANIC sucks)

From

Heikki Linnakangas

Date:

22 January 2014, 12:14:35

On 01/22/2014 02:10 PM, Simon Riggs wrote:
> As Jeff points out, the blocks being modified would be locked until
> space is freed up. Which could make other users wait. The code
> required to avoid that wait would be complex and not worth any
> overhead.

Checkpoint also acquires the content lock of every dirty page in the 
buffer cache...

- Heikki

Re: Hard limit on WAL space used (because PANIC sucks)

From

Simon Riggs

Date:

22 January 2014, 13:25:35

On 22 January 2014 13:14, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
> On 01/22/2014 02:10 PM, Simon Riggs wrote:
>>
>> As Jeff points out, the blocks being modified would be locked until
>> space is freed up. Which could make other users wait. The code
>> required to avoid that wait would be complex and not worth any
>> overhead.
>
>
> Checkpoint also acquires the content lock of every dirty page in the buffer
> cache...

Good point. We would need to take special action for any dirty blocks
that we cannot obtain content lock for, which should be a smallish
list, to be dealt with right at the end of the checkpoint writes.

We know that anyone waiting for the WAL lock will not be modifying the
block and so we can copy it without obtaining the lock. We can inspect
the lock queue on the WAL locks and then see which buffers we can skip
the lock for.

The alternative of adding a check for WAL space to the normal path is
a non-starter, IMHO.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Hard limit on WAL space used (because PANIC sucks)

From

Kevin Grittner

Date:

22 January 2014, 13:58:25

Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Well, PANIC is certainly bad, but what I'm suggesting is that we
> just focus on getting that down to ERROR and not worry about
> trying to get out of the disk-shortage situation automatically.
> Nor do I believe that it's such a good idea to have the database
> freeze up until space appears rather than reporting errors.

Dusting off my DBA hat for a moment, I would say that I would be
happy if each process either generated an ERROR or went into a wait
state.  They would not all need to do the same thing; whichever is
easier in each process's context would do.  It would be nice if a
process which went into a long wait state until disk space became
available would issue a WARNING about that, where that is possible.

I feel that anyone running a database that matters to them should
be monitoring disk space and getting an alert from the monitoring
system in advance of actually running out of disk space; but we all
know that poorly managed systems are out there.  To accomodate them
we don't want to add risk or performance hits for those who do
manage their systems well.

The attempt to make more space by deleting WAL files scares me a
bit.  The dynamics of that seem like they could be
counterproductive if the pg_xlog directory is on the same
filesystem as everything else and there is a rapidly growing file.
Every time you cleaned up, the fast-growing file would eat more of
the space pg_xlog had been using, until perhaps it had no space to
keep even a single segment, no?  And how would that work with a
situation where the archive_command was failing, causing a build-up
WAL files?  It just seems like too much mechanism and incomplete
coverage of the problem space.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Hard limit on WAL space used (because PANIC sucks)

From

Andres Freund

Date:

22 January 2014, 15:42:00

On 2014-01-21 21:42:19 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2014-01-21 19:45:19 -0500, Tom Lane wrote:
> >> I don't think that's a comparable case.  Incomplete actions are actions
> >> to be taken immediately, and which the replayer then has to complete
> >> somehow if it doesn't find the rest of the action in the WAL sequence.
> >> The only thing to be done with the records I'm proposing is to remember
> >> their contents (in some fashion) until it's time to apply them.  If you
> >> hit end of WAL you don't really have to do anything.
> 
> > Would that work for the promotion case as well? Afair there's the
> > assumption that everything >= TransactionXmin can be looked up in
> > pg_subtrans or in the procarray - which afaics wouldn't be the case with
> > your scheme? And TransactionXmin could very well be below such an
> > "incomplete commit"'s xids afaics.
> 
> Uh, what?  The behavior I'm talking about is *exactly the same*
> as what happens now.  The only change is that the data sent to the
> WAL file is laid out a bit differently, and the replay logic has
> to work harder to reassemble it before it can apply the commit or
> abort action.  If anything outside replay can detect a difference
> at all, that would be a bug.
> 
> Once again: the replayer is not supposed to act immediately on the
> subsidiary records.  It's just supposed to remember their contents
> so it can reattach them to the eventual commit or abort record,
> and then do what it does today to replay the commit or abort.

I (think) I get what you want to do, but splitting the record like that
nonetheless opens up behaviour that previously wasn't there. Imagine we
promote inbetween replaying the list of subxacts (only storing it in
memory) and the main commit record. Either we have something like the
incomplete action stuff doing something with the in-memory data, or we
are in a situation where there can be xids bigger than TransactionXmin
that are not in pg_subtrans and not in the procarray. Which I don't
think exists today since we either read the commit record in it's
entirety or not.
We'd also need to use the MyPgXact->delayChkpt mechanism to prevent
checkpoints from occuring inbetween those records, but we do that
already, so that seems rather uncontroversial.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Hard limit on WAL space used (because PANIC sucks)

From

Tom Lane

Date:

22 January 2014, 16:35:50

Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-01-21 21:42:19 -0500, Tom Lane wrote:
>> Uh, what?  The behavior I'm talking about is *exactly the same*
>> as what happens now.  The only change is that the data sent to the
>> WAL file is laid out a bit differently, and the replay logic has
>> to work harder to reassemble it before it can apply the commit or
>> abort action.  If anything outside replay can detect a difference
>> at all, that would be a bug.
>> 
>> Once again: the replayer is not supposed to act immediately on the
>> subsidiary records.  It's just supposed to remember their contents
>> so it can reattach them to the eventual commit or abort record,
>> and then do what it does today to replay the commit or abort.

> I (think) I get what you want to do, but splitting the record like that
> nonetheless opens up behaviour that previously wasn't there.

Obviously we are not on the same page yet.

In my vision, the WAL writer is dumping the same data it would have
dumped, though in a different layout, and it's working from process-local
state same as it does now.  The WAL replayer is taking the same actions at
the same time using the same data as it does now.  There is no "behavior
that wasn't there", unless you're claiming that there are *existing* race
conditions in commit/abort WAL processing.

The only thing that seems mildly squishy about this is that it's not clear
how long the WAL replayer ought to hang onto subsidiary records for a
commit or abort it hasn't seen yet.  In the case where we change our minds
and abort a transaction after already having written some subsidiary
records for the commit, it's not really a problem; the replayer can throw
away any saved data related to the commit of xid N as soon as it sees an
abort for xid N.  However, what if the session crashes and never writes
either a final commit or abort record?  I think we can deal with this
fairly easily though, because that case should end with a crash recovery
cycle writing a shutdown checkpoint to the log (we do do that no?).
So the rule can be "discard any unmatched subsidiary records if you see a
shutdown checkpoint".  This makes sense on its own terms since there are
surely no active transactions at that point in the log.
        regards, tom lane

Re: Hard limit on WAL space used (because PANIC sucks)

From

Simon Riggs

Date:

22 January 2014, 22:22:59

On 22 January 2014 14:25, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 22 January 2014 13:14, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
>> On 01/22/2014 02:10 PM, Simon Riggs wrote:
>>>
>>> As Jeff points out, the blocks being modified would be locked until
>>> space is freed up. Which could make other users wait. The code
>>> required to avoid that wait would be complex and not worth any
>>> overhead.
>>
>>
>> Checkpoint also acquires the content lock of every dirty page in the buffer
>> cache...
>
> Good point. We would need to take special action for any dirty blocks
> that we cannot obtain content lock for, which should be a smallish
> list, to be dealt with right at the end of the checkpoint writes.
>
> We know that anyone waiting for the WAL lock will not be modifying the
> block and so we can copy it without obtaining the lock. We can inspect
> the lock queue on the WAL locks and then see which buffers we can skip
> the lock for.

This could be handled similarly to the way we handle buffer pin
deadlocks in Hot Standby.

So I don't see any blockers from that angle.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Hard limit on WAL space used (because PANIC sucks)

From

Jim Nasby

Date:

23 January 2014, 00:19:38

On 1/21/14, 6:46 PM, Andres Freund wrote:
> On 2014-01-21 16:34:45 -0800, Peter Geoghegan wrote:
>> >On Tue, Jan 21, 2014 at 3:43 PM, Andres Freund<andres@2ndquadrant.com>  wrote:
>>> > >I personally think this isn't worth complicating the code for.
>> >
>> >You're probably right. However, I don't see why the bar has to be very
>> >high when we're considering the trade-off between taking some
>> >emergency precaution against having a PANIC shutdown, and an assured
>> >PANIC shutdown
> Well, the problem is that the tradeoff would very likely include making
> already complex code even more complex. None of the proposals, even the
> one just decreasing the likelihood of a PANIC, like like they'd end up
> being simple implementation-wise.
> And that additional complexity would hurt robustness and prevent things
> I find much more important than this.

If we're not looking for perfection, what's wrong with Peter's idea of a ballast file? Presumably the check to see if
thatfile still exists would be cheap so we can do that before entering the appropriate critical section.
 

There's still a small chance that we'd end up panicing, but it's better than today. I'd argue that even if it doesn't
workfor CoW filesystems it'd still be a win.
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net

Re: Hard limit on WAL space used (because PANIC sucks)

From

Andres Freund

Date:

23 January 2014, 10:46:46

On 2014-01-22 18:19:25 -0600, Jim Nasby wrote:
> On 1/21/14, 6:46 PM, Andres Freund wrote:
> >On 2014-01-21 16:34:45 -0800, Peter Geoghegan wrote:
> >>>On Tue, Jan 21, 2014 at 3:43 PM, Andres Freund<andres@2ndquadrant.com>  wrote:
> >>>> >I personally think this isn't worth complicating the code for.
> >>>
> >>>You're probably right. However, I don't see why the bar has to be very
> >>>high when we're considering the trade-off between taking some
> >>>emergency precaution against having a PANIC shutdown, and an assured
> >>>PANIC shutdown
> >Well, the problem is that the tradeoff would very likely include making
> >already complex code even more complex. None of the proposals, even the
> >one just decreasing the likelihood of a PANIC, like like they'd end up
> >being simple implementation-wise.
> >And that additional complexity would hurt robustness and prevent things
> >I find much more important than this.
> 
> If we're not looking for perfection, what's wrong with Peter's idea of
> a ballast file? Presumably the check to see if that file still exists
> would be cheap so we can do that before entering the appropriate
> critical section.

That'd be noticeably expensive. Opening/stat a file isn't cheap,
especially if you do it via filename and not via fd which we'd have to
do. I am pretty sure it would be noticeably in single client workloads,
but it'd damned sure will be noticeable on busy multi-socket workloads.

I still think doing the checks in the wal writer is the best bet,
setting a flag that can then cheaply be tested in shared memory. When
set it will cause any further action that will write xlog to error out
unless it's already in progress.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Hard limit on WAL space used (because PANIC sucks)

From

Simon Riggs

Date:

23 January 2014, 12:57:02

On 23 January 2014 01:19, Jim Nasby <jim@nasby.net> wrote:
> On 1/21/14, 6:46 PM, Andres Freund wrote:
>>
>> On 2014-01-21 16:34:45 -0800, Peter Geoghegan wrote:
>>>
>>> >On Tue, Jan 21, 2014 at 3:43 PM, Andres Freund<andres@2ndquadrant.com>
>>> > wrote:
>>>>
>>>> > >I personally think this isn't worth complicating the code for.
>>>
>>> >
>>> >You're probably right. However, I don't see why the bar has to be very
>>> >high when we're considering the trade-off between taking some
>>> >emergency precaution against having a PANIC shutdown, and an assured
>>> >PANIC shutdown
>>
>> Well, the problem is that the tradeoff would very likely include making
>> already complex code even more complex. None of the proposals, even the
>> one just decreasing the likelihood of a PANIC, like like they'd end up
>> being simple implementation-wise.
>> And that additional complexity would hurt robustness and prevent things
>> I find much more important than this.
>
>
> If we're not looking for perfection, what's wrong with Peter's idea of a
> ballast file? Presumably the check to see if that file still exists would be
> cheap so we can do that before entering the appropriate critical section.
>
> There's still a small chance that we'd end up panicing, but it's better than
> today. I'd argue that even if it doesn't work for CoW filesystems it'd still
> be a win.

I grant that it does sound simple enough for a partial stop gap.

My concern is that it provides only a short delay before the eventual
disk-full situation, which it doesn't actually prevent.

IMHO the main issue now is how we clear down old WAL files. We need to
perform a checkpoint to do that - and as has been pointed out in
relation to my proposal, we cannot complete that because of locks that
will be held for some time when we do eventually lock up.

That issue is not solved by having a ballast file(s).

IMHO we need to resolve the deadlock inherent in the
disk-full/WALlock-up/checkpoint situation. My view is that can be
solved in a similar way to the way the buffer pin deadlock was
resolved for Hot Standby.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Hard limit on WAL space used (because PANIC sucks)

From

Andres Freund

Date:

23 January 2014, 23:38:17

On 2014-01-23 13:56:49 +0100, Simon Riggs wrote:
> IMHO we need to resolve the deadlock inherent in the
> disk-full/WALlock-up/checkpoint situation. My view is that can be
> solved in a similar way to the way the buffer pin deadlock was
> resolved for Hot Standby.

I don't think that approach works here. We're not talking about mere
buffer pins but the big bad exclusively locked buffer which is held by a
backend in a critical section. Killing such a backend costs you a PANIC.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services