Thread: BUG #10675: alter database set tablespace and unlogged table
The following bug has been logged on the website: Bug reference: 10675 Logged by: Maxim Boguk Email address: maxim.boguk@gmail.com PostgreSQL version: 9.3.4 Operating system: Linux (Ubuntu) Description: Hi, Now bug report with easy/short test case. PostgreSQL seems doesn't flush dirty buffers related to unlogged tables in the database during alter database set tablespace ...; Test case: mboguk=# create database test tablespace tmp; CREATE DATABASE mboguk=# \c test You are now connected to database "test" as user "mboguk". test=# create unlogged table test (id integer); CREATE TABLE test=# insert into test select * from generate_series(1,10000000); INSERT 0 10000000 test=# \c postgres You are now connected to database "postgres" as user "mboguk". postgres=# alter database test set tablespace pg_default; ALTER DATABASE postgres=# checkpoint; ERROR: checkpoint request failed HINT: Consult recent messages in the server log for details. In PostgreSQL logs: 2014-06-16 23:16:41 EDT ERROR: could not open file "pg_tblspc/16558/PG_9.3_201306121/16559/16560": No such file or directory 2014-06-16 23:16:41 EDT CONTEXT: writing block 27059 of relation pg_tblspc/16558/PG_9.3_201306121/16559/16560 2014-06-16 23:16:41 EDT WARNING: could not write block 27059 of pg_tblspc/16558/PG_9.3_201306121/16559/16560 2014-06-16 23:16:41 EDT DETAIL: Multiple failures --- write error might be permanent. Some additional info: select oid,* from pg_tablespace where oid=16558; oid | spcname | spcowner | spcacl | spcoptions -------+---------+----------+--------+------------ 16558 | tmp | 16397 | | test=# select relname from pg_class where relfilenode=16560; relname --------- test So the database flush dirty shared buffers related to the unlogged table into the old file location (pre-alter tablespace). Kind Regards, Maksym
On Tue, Jun 17, 2014 at 8:54 AM, <maxim.boguk@gmail.com> wrote:
The following bug has been logged on the website:
Bug reference: 10675
Logged by: Maxim Boguk
Email address: maxim.boguk@gmail.com
PostgreSQL version: 9.3.4
Operating system: Linux (Ubuntu)
Description:
Hi,
Now bug report with easy/short test case.
PostgreSQL seems doesn't flush dirty buffers related to unlogged tables in
the database during alter database set tablespace ...;
Test case:
mboguk=# create database test tablespace tmp;
CREATE DATABASE
mboguk=# \c test
You are now connected to database "test" as user "mboguk".
test=# create unlogged table test (id integer);
CREATE TABLE
test=# insert into test select * from generate_series(1,10000000);
INSERT 0 10000000
test=# \c postgres
You are now connected to database "postgres" as user "mboguk".
postgres=# alter database test set tablespace pg_default;
ALTER DATABASE
postgres=# checkpoint;
ERROR: checkpoint request failed
HINT: Consult recent messages in the server log for details.
In PostgreSQL logs:
2014-06-16 23:16:41 EDT ERROR: could not open file
"pg_tblspc/16558/PG_9.3_201306121/16559/16560": No such file or directory
2014-06-16 23:16:41 EDT CONTEXT: writing block 27059 of relation
pg_tblspc/16558/PG_9.3_201306121/16559/16560
2014-06-16 23:16:41 EDT WARNING: could not write block 27059 of
pg_tblspc/16558/PG_9.3_201306121/16559/16560
2014-06-16 23:16:41 EDT DETAIL: Multiple failures --- write error might be
permanent.
Thanks for the report. I can reproduce this on the current HEAD as well albeit not with the exact same steps. For me, it happens during the shutdown checkpoint.
LOG: shutting down
FATAL: could not open file "pg_tblspc/24576/PG_9.5_201406121/40971/40972": No such file or directory
CONTEXT: writing block 0 of relation pg_tblspc/24576/PG_9.5_201406121/40971/40972
WARNING: buffer refcount leak: [193] (rel=pg_tblspc/24576/PG_9.5_201406121/40971/40972, blockNum=0, flags=0x97, refcount=1 1)
TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "/home/pavan.deolasee/work/pgsql/postgresql/src/backend/storage/buffer/bufmgr.c", Line: 1773)
LOG: checkpointer process (PID 2070) was terminated by signal 6: Aborted
It's clearly a bug. During a normal or a forced CHECKPOINT, we don't write buffers of UNLOGGED tables, even if they are dirty. ALTER DATABASE SET TABLESPACE relies on the checkpoint mechnism to ensure that all dirty buffers are written to the disk before proceeding with moving the files to the new tablespace. Leaving behind those dirty buffers with old tablespace and old relfilenode causes problems later when we try to sync those dirty buffers to the disk. AFAICS it can happen during the SHUTDOWN checkpoint or the End-of-Recovery checkpoint, at least in the HEAD.
ISTM that the right fix is to write *all* dirty pages during a FORCE checkpoint since system relies on FORCE checkpoints to handle such alterations. Attached patch fixes this for the HEAD. But this needs to be fixed all the way to 9.1 when unlogged tables were first introduced.
Thanks,
Pavan
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee
Attachment
On Wed, Jun 18, 2014 at 12:31 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
Attached patch fixes this for the HEAD. But this needs to be fixed all the way to 9.1 when unlogged tables were first introduced.
I'd forgotten to update the comment in the code. Fixed in the attached version.
Thanks,
Pavan
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee
Attachment
On 2014-06-18 15:33:58 +0530, Pavan Deolasee wrote: > On Wed, Jun 18, 2014 at 12:31 PM, Pavan Deolasee <pavan.deolasee@gmail.com> > wrote: > > > > > Attached patch fixes this for the HEAD. But this needs to be fixed all > >> the way to 9.1 when unlogged tables were first introduced. > >> > > > > > I'd forgotten to update the comment in the code. Fixed in the attached > version. > > Thanks, > Pavan > > -- > Pavan Deolasee > http://www.linkedin.com/in/pavandeolasee > diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c > index c070278..e0b9aff 100644 > --- a/src/backend/storage/buffer/bufmgr.c > +++ b/src/backend/storage/buffer/bufmgr.c > @@ -1221,11 +1221,12 @@ BufferSync(int flags) > ResourceOwnerEnlargeBuffers(CurrentResourceOwner); > > /* > - * Unless this is a shutdown checkpoint, we write only permanent, dirty > - * buffers. But at shutdown or end of recovery, we write all dirty > - * buffers. > + * If a force checkpoint or at shutdown or end of recovery, we write all > + * dirty buffers, otherwise we only write permanent buffers > */ > - if (!((flags & CHECKPOINT_IS_SHUTDOWN) || (flags & CHECKPOINT_END_OF_RECOVERY))) > + if (!((flags & CHECKPOINT_IS_SHUTDOWN) || > + (flags & CHECKPOINT_END_OF_RECOVERY) || > + (flags & CHECKPOINT_FORCE))) > mask |= BM_PERMANENT; Won't that significantly regress manually issued CHECKPOINT;s? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jun 18, 2014 at 3:40 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > Won't that significantly regress manually issued CHECKPOINT;s? > > May be if there are large unlogged table(s) which are frequently updated between manual checkpoints. I don't know how unlogged tables are being currently used to make that call. We could add another flag and use that while taking system checkpoints. But I wonder if not flushing dirty buffers of unlogged tables at a checkpoint is a bad idea anyways. User might expect that the unlogged tables to sustain server crash or unclean shutdown if there had been no writes after successful manual checkpoint(s). I understand we provide no such guarantee and its explicitly stated in the documentation, but user may have that expectation after a successful checkpoint or at the least would expect a way to flush unlogged tables to disk. Currently I see no way of doing that. Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee
On 2014-06-18 16:15:47 +0530, Pavan Deolasee wrote: > On Wed, Jun 18, 2014 at 3:40 PM, Andres Freund <andres@2ndquadrant.com> > wrote: > > > > > Won't that significantly regress manually issued CHECKPOINT;s? > > > > > May be if there are large unlogged table(s) which are frequently updated > between manual checkpoints. I don't know how unlogged tables are being > currently used to make that call. I think that's actually one of the major reasons to use unlogged tables. > We could add another flag and use that while taking system checkpoints. Don't really have a better idea. CHECKPOINT_FLUSH_ALL? > But I wonder if not flushing dirty buffers > of unlogged tables at a checkpoint is a bad idea anyways. User might expect > that the unlogged tables to sustain server crash or unclean shutdown if > there had been no writes after successful manual checkpoint(s). They'll get reset at unlcean startup anyway. Independent of having been touched or not. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-06-18 16:15:47 +0530, Pavan Deolasee wrote: >> But I wonder if not flushing dirty buffers >> of unlogged tables at a checkpoint is a bad idea anyways. User might expect >> that the unlogged tables to sustain server crash or unclean shutdown if >> there had been no writes after successful manual checkpoint(s). > They'll get reset at unlcean startup anyway. Independent of having been > touched or not. I'm with Pavan on this one: it's *not* a good thing that manually issued checkpoints skip unlogged tables. That's surprising, possibly dangerous, and no case whatsoever has been made that anyone sees it as an important performance benefit. I trust that a shutdown checkpoint, at least, would write such pages? If so, I'd expect that a manual checkpoint would do it too. Maybe I'm checkpointing because I want to be sure that the shutdown will be quick so I can do a minor release update with minimal downtime. I think we should just change this. -1 for new flags and more complication. regards, tom lane
On 2014-06-18 09:45:46 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2014-06-18 16:15:47 +0530, Pavan Deolasee wrote: > >> But I wonder if not flushing dirty buffers > >> of unlogged tables at a checkpoint is a bad idea anyways. User might expect > >> that the unlogged tables to sustain server crash or unclean shutdown if > >> there had been no writes after successful manual checkpoint(s). > > > They'll get reset at unlcean startup anyway. Independent of having been > > touched or not. > > I'm with Pavan on this one: it's *not* a good thing that manually issued > checkpoints skip unlogged tables. That's surprising, possibly dangerous, > and no case whatsoever has been made that anyone sees it as an important > performance benefit. I don't understand what dangers it has? Any unclean shutdown resets all unlogged tables. /* REDO */ if (InRecovery) { ... /* * We're in recovery, so unlogged relations may be trashed and must be * reset. This should be done BEFORE allowing Hot Standby * connections, so that read-only backends don't try to read whatever * garbage is left over from before. */ ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP); ... And I don't really see why it's so surprising given that normal checkpoints skip those buffers? > I trust that a shutdown checkpoint, at least, would write such pages? Yes. > If so, I'd expect that a manual checkpoint would do it too. Maybe > I'm checkpointing because I want to be sure that the shutdown will be > quick so I can do a minor release update with minimal downtime. That's a fair argument. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jun 18, 2014 at 7:49 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-06-18 09:45:46 -0400, Tom Lane wrote: > > Andres Freund <andres@2ndquadrant.com> writes: > > > On 2014-06-18 16:15:47 +0530, Pavan Deolasee wrote: > > >> But I wonder if not flushing dirty buffers > > >> of unlogged tables at a checkpoint is a bad idea anyways. User might > expect > > >> that the unlogged tables to sustain server crash or unclean shutdown > if > > >> there had been no writes after successful manual checkpoint(s). > > > > > They'll get reset at unlcean startup anyway. Independent of having been > > > touched or not. > > > > I'm with Pavan on this one: it's *not* a good thing that manually issued > > checkpoints skip unlogged tables. That's surprising, possibly dangerous, > > and no case whatsoever has been made that anyone sees it as an important > > performance benefit. > > I don't understand what dangers it has? Any unclean shutdown resets all > unlogged tables. > > Looks like there is no agreement on this. I agree with Andreas that given the current mechanism of truncating unlogged relations at the end of redo recovery, there is no danger in not flushing the dirty buffers belonging to unlogged relation at a normal checkpoint. Having said that, I find it confusing that we don't do that, for one reason that Tom explained and also because there is practically just no way to flush those dirty buffers to disk if the user wants so. Also, there had been discussions about altering unlogged tables to normal tables and we may also want to improve upon the current mechanism of truncating unlogged relations at the end of recovery even if the table was fully synced to the disk. It looks simpler to just flush everything instead of devising a new flag for checkpoint. Anyone else has an opinion on this? Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee
On Fri, Jun 20, 2014 at 11:34 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
Anyone else has an opinion on this?Looks like there is no agreement on this. I agree with Andreas that given the current mechanism of truncating unlogged relations at the end of redo recovery, there is no danger in not flushing the dirty buffers belonging to unlogged relation at a normal checkpoint. Having said that, I find it confusing that we don't do that, for one reason that Tom explained and also because there is practically just no way to flush those dirty buffers to disk if the user wants so.Also, there had been discussions about altering unlogged tables to normal tables and we may also want to improve upon the current mechanism of truncating unlogged relations at the end of recovery even if the table was fully synced to the disk. It looks simpler to just flush everything instead of devising a new flag for checkpoint.
Since I did not hear anything on this, I created a patch that adds a new flag to tell checkpointer to flush all pages to the disk. Tom (and even I) have reservations about the approach, but I would nevertheless leave it to the committer to decide. IMV we must fix this bug one way or the other. Otherwise users face risk of failing to do clean shutdown.
Thanks,
Pavan
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee
Attachment
On 2014-06-20 23:34:36 +0530, Pavan Deolasee wrote: > On Wed, Jun 18, 2014 at 7:49 PM, Andres Freund <andres@2ndquadrant.com> > wrote: > > > On 2014-06-18 09:45:46 -0400, Tom Lane wrote: > > > Andres Freund <andres@2ndquadrant.com> writes: > > > > On 2014-06-18 16:15:47 +0530, Pavan Deolasee wrote: > > > I'm with Pavan on this one: it's *not* a good thing that manually issued > > > checkpoints skip unlogged tables. That's surprising, possibly dangerous, > > > and no case whatsoever has been made that anyone sees it as an important > > > performance benefit. > > > > I don't understand what dangers it has? Any unclean shutdown resets all > > unlogged tables. > > > > > Looks like there is no agreement on this. I agree with Andreas that given > the current mechanism of truncating unlogged relations at the end of redo > recovery, there is no danger in not flushing the dirty buffers belonging to > unlogged relation at a normal checkpoint. Having said that, I find it > confusing that we don't do that, for one reason that Tom explained and also > because there is practically just no way to flush those dirty buffers to > disk if the user wants so. Which reason, except the faster shutdown, for a user to want that? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2014-07-02 17:28:25 +0530, Pavan Deolasee wrote: > On Fri, Jun 20, 2014 at 11:34 PM, Pavan Deolasee <pavan.deolasee@gmail.com> > wrote: > > > > > > Looks like there is no agreement on this. I agree with Andreas that given > > the current mechanism of truncating unlogged relations at the end of redo > > recovery, there is no danger in not flushing the dirty buffers belonging to > > unlogged relation at a normal checkpoint. Having said that, I find it > > confusing that we don't do that, for one reason that Tom explained and also > > because there is practically just no way to flush those dirty buffers to > > disk if the user wants so. > > > > Also, there had been discussions about altering unlogged tables to normal > > tables and we may also want to improve upon the current mechanism of > > truncating unlogged relations at the end of recovery even if the table was > > fully synced to the disk. It looks simpler to just flush everything instead > > of devising a new flag for checkpoint. > > > > Anyone else has an opinion on this? > > > > > Since I did not hear anything on this, I created a patch that adds a new > flag to tell checkpointer to flush all pages to the disk. Tom (and even I) > have reservations about the approach, but I would nevertheless leave it to > the committer to decide. IMV we must fix this bug one way or the other. > Otherwise users face risk of failing to do clean shutdown. I think one reason for the separate flag is that the checkpoint performed by pg_start_backup/pg_basebackup shouldn't just become more expensive because unlogged tables are needlessly flushed to disk. After all, unlogged tables are used because normal tables have a too high overhead in that scenario. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jul 2, 2014 at 5:36 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > Which reason, except the faster shutdown, for a user to want that? > > May be to clean up shared buffers so that a regular backend does not need to write the page during eviction? Of course bgwriter would that but not under user's control. Say user wants to force a checkpoint before some high priority, response time sensitive tasks are started. Not sure though if this a valid use case in real world. Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee
Andres Freund <andres@2ndquadrant.com> writes: > I think one reason for the separate flag is that the checkpoint > performed by pg_start_backup/pg_basebackup shouldn't just become more > expensive because unlogged tables are needlessly flushed to disk. After > all, unlogged tables are used because normal tables have a too high > overhead in that scenario. AFAIK, the "overhead" that unlogged tables are trying to avoid is WAL I/O. Nobody has argued (until this thread) that we are worried about whether checkpoints write them. regards, tom lane
On Wed, Jul 2, 2014 at 11:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andres Freund <andres@2ndquadrant.com> writes: >> I think one reason for the separate flag is that the checkpoint >> performed by pg_start_backup/pg_basebackup shouldn't just become more >> expensive because unlogged tables are needlessly flushed to disk. After >> all, unlogged tables are used because normal tables have a too high >> overhead in that scenario. > > AFAIK, the "overhead" that unlogged tables are trying to avoid is WAL > I/O. Nobody has argued (until this thread) that we are worried about > whether checkpoints write them. Sorry, I didn't see this thread until just now. I was definitely worried about that issue when I wrote the unlogged tables patch, and I added the BM_PERMANENT for precisely that reason. I think it's a completely legitimate worry, too. The point of avoiding WAL I/O is that writing WAL to disk is expensive; writing data files to disk is no less expensive, and often moreso, because it's not always sequential I/O. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-07-02 11:38:51 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > I think one reason for the separate flag is that the checkpoint > > performed by pg_start_backup/pg_basebackup shouldn't just become more > > expensive because unlogged tables are needlessly flushed to disk. After > > all, unlogged tables are used because normal tables have a too high > > overhead in that scenario. > > AFAIK, the "overhead" that unlogged tables are trying to avoid is WAL > I/O. Nobody has argued (until this thread) that we are worried about > whether checkpoints write them. I don't think that's true. In production scenarios checkpoint IO is one of the two top problems I see (the other being crazy amount of WAL due to FPIs because of too short checkpoints). And unlogged tables have explicitly been excluded from checkpoints, so it's not like nobody has thought about it. I seem to recall lengthy discussions even. Yep the discussion is around http://archives.postgresql.org/message-id/AANLkTimxC%2BG9M9_s0dXa_huoAeZpkCmoWCo5S-7DsLi%3D%40mail.gmail.com Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jul 2, 2014 at 5:28 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > On Fri, Jun 20, 2014 at 11:34 PM, Pavan Deolasee <pavan.deolasee@gmail.com > > wrote: >> >> >> Looks like there is no agreement on this. I agree with Andreas that given >> the current mechanism of truncating unlogged relations at the end of redo >> recovery, there is no danger in not flushing the dirty buffers belonging to >> unlogged relation at a normal checkpoint. Having said that, I find it >> confusing that we don't do that, for one reason that Tom explained and also >> because there is practically just no way to flush those dirty buffers to >> disk if the user wants so. >> >> Also, there had been discussions about altering unlogged tables to normal >> tables and we may also want to improve upon the current mechanism of >> truncating unlogged relations at the end of recovery even if the table was >> fully synced to the disk. It looks simpler to just flush everything instead >> of devising a new flag for checkpoint. >> >> Anyone else has an opinion on this? >> >> > Since I did not hear anything on this, I created a patch that adds a new > flag to tell checkpointer to flush all pages to the disk. Tom (and even I) > have reservations about the approach, but I would nevertheless leave it to > the committer to decide. IMV we must fix this bug one way or the other. > Otherwise users face risk of failing to do clean shutdown. > > As Robert as voted in favor of keeping existing checkpoint behavior intact, should we consider this patch before the minor releases are out next week? Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee
On Wed, Jul 16, 2014 at 1:32 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > > > > On Wed, Jul 2, 2014 at 5:28 PM, Pavan Deolasee <pavan.deolasee@gmail.com> > wrote: >> >> On Fri, Jun 20, 2014 at 11:34 PM, Pavan Deolasee >> <pavan.deolasee@gmail.com> wrote: >>> >>> >>> Looks like there is no agreement on this. I agree with Andreas that given >>> the current mechanism of truncating unlogged relations at the end of redo >>> recovery, there is no danger in not flushing the dirty buffers belonging to >>> unlogged relation at a normal checkpoint. Having said that, I find it >>> confusing that we don't do that, for one reason that Tom explained and also >>> because there is practically just no way to flush those dirty buffers to >>> disk if the user wants so. >>> >>> Also, there had been discussions about altering unlogged tables to normal >>> tables and we may also want to improve upon the current mechanism of >>> truncating unlogged relations at the end of recovery even if the table was >>> fully synced to the disk. It looks simpler to just flush everything instead >>> of devising a new flag for checkpoint. >>> >>> Anyone else has an opinion on this? >>> >> >> Since I did not hear anything on this, I created a patch that adds a new >> flag to tell checkpointer to flush all pages to the disk. Tom (and even I) >> have reservations about the approach, but I would nevertheless leave it to >> the committer to decide. IMV we must fix this bug one way or the other. >> Otherwise users face risk of failing to do clean shutdown. >> > > As Robert as voted in favor of keeping existing checkpoint behavior intact, > should we consider this patch before the minor releases are out next week? What's the status of this? ISTM that the patch has not been applied yet. Right? Regards, -- Fujii Masao
On Thu, Aug 7, 2014 at 5:54 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > > > What's the status of this? There was no clear consensus. The thread has patches per both the approaches. I tried to bring it up couple of times, last time just before the minor releases were out. But did not get any attention. > ISTM that the patch has not been applied yet. Right? > > Right. I've now added it to the commitfest so that we don't lose track of it. Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee
On Fri, Aug 8, 2014 at 4:08 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > On Thu, Aug 7, 2014 at 5:54 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> >> >> >> What's the status of this? > > > There was no clear consensus. The thread has patches per both the > approaches. I tried to bring it up couple of times, last time just before > the minor releases were out. But did not get any attention. I vote to make checkpoint flush unlogged tables only when the special flag (that maybe we will introduce) is set. Unlogged tables are not flushed by manual checkpoint. Per Robert, that's originally intentional. No dangerous risk that may cause has been reported yet. So I don't think that we need to change manual checkpoint so that it always flushes unlogged tables, at least for now. OTOH, I agree to extend manual checkpoint so that it can flush unlogged table to shorten the time of subsequent shutdown checkpoint. For example, we can implement something like "CHECKPOINT [ALL | PERMANENT | UNLOGGED]". But that's a feature request. We should work on this later, not for now. >> ISTM that the patch has not been applied yet. Right? >> > > Right. I've now added it to the commitfest so that we don't lose track of > it. Thanks! Regards, -- Fujii Masao
From: "Pavan Deolasee" <pavan.deolasee@gmail.com> >> Anyone else has an opinion on this? >> >> > Since I did not hear anything on this, I created a patch that adds a new > flag to tell checkpointer to flush all pages to the disk. Tom (and even I) > have reservations about the approach, but I would nevertheless leave it to > the committer to decide. IMV we must fix this bug one way or the other. > Otherwise users face risk of failing to do clean shutdown. I'm reviewing this patch. According to the approach of the patch, CREATE DATABASE also needs the new flag to sync the unlogged tables before copying them from source database to the new one. You need to check other places which do checkpointing. I'm with Tom-san: I think it would be better for online checkpoints to sync unlogged tables without a new flag. The reasons are: * There's a greater danger of losing data during operating system restart. For example, IIRC, Windows gives only 20 seconds to terminate all services during OS shutdown. If many dirty buffers for unlogged tables linger in the shared buffers, PostgreSQL service may fail to complete database shutdown. Even if the online checkpoint writes out all dirty buffers, the possibility of there being many dirty buffers at shutdown is not zero, but the probability would be lower. * As you missed the CREATE DATABASE case, we may fail to specify the new flag when necessary in the future code. * Unlogged tables are, as its name suggests, for better performance by not writing WAL. * If the online checkpoint leaves dirty buffers, it would be more probable that the backends have to write them out when evicting the buffers. This may also indicate some influence on how to interpret pg_stat_bgwriter. I'm inclined to just make the online checkpoint write out all dirty buffers. Could you consider this again? Let me mark this as waiting on authoer now. Regards MauMau
On 2014-08-13 21:57:08 +0900, MauMau wrote: > From: "Pavan Deolasee" <pavan.deolasee@gmail.com> > >>Anyone else has an opinion on this? > >> > >> > >Since I did not hear anything on this, I created a patch that adds a new > >flag to tell checkpointer to flush all pages to the disk. Tom (and even I) > >have reservations about the approach, but I would nevertheless leave it to > >the committer to decide. IMV we must fix this bug one way or the other. > >Otherwise users face risk of failing to do clean shutdown. > > I'm reviewing this patch. > > According to the approach of the patch, CREATE DATABASE also needs the new > flag to sync the unlogged tables before copying them from source database to > the new one. You need to check other places which do checkpointing. > I'm with Tom-san: I think it would be better for online checkpoints to sync > unlogged tables without a new flag. The reasons are: That'd mean that the next pointrelease will incur *significantly* higher IO in many setups. If you currently have a workload where all dirty buffers of unlogged tables fit in s_b you'll never have any OS writes. That'd completely change. A compromise would be to couple it to CHECKPOINT_IMMEDIATE and/or _FORCE instead of creating a new flag - I think this is actually what Tom proposed. I'd planned to look (and possibly) commit this patch soonish. If I have to decide it'd be a new flag. > * There's a greater danger of losing data during operating system restart. > For example, IIRC, Windows gives only 20 seconds to terminate all services > during OS shutdown. If many dirty buffers for unlogged tables linger in the > shared buffers, PostgreSQL service may fail to complete database shutdown. > Even if the online checkpoint writes out all dirty buffers, the possibility > of there being many dirty buffers at shutdown is not zero, but the > probability would be lower. Meh. That won't lead to data loss, just recovery on restart. And 20s isn't sufficient for any halfway busy database anyway. > * Unlogged tables are, as its name suggests, for better performance by not > writing WAL. Pft. They exist because they have higher performance due to lower durability guarantees. Should we really create CREATE UNLOGGED AND NOT WRITTEN UNTIL REALLY NECESSARY TABLE ...(...)? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
From: "Andres Freund" <andres@2ndquadrant.com> > That'd mean that the next pointrelease will incur *significantly* higher > IO in many setups. If you currently have a workload where all dirty > buffers of unlogged tables fit in s_b you'll never have any OS > writes. That'd completely change. Yes, that's the only headache... But I'm not worried so much, because bgwriter and/or backends may write out dirty buffers for unlogged tables, so the total I/O is not low anyway. > A compromise would be to couple it to CHECKPOINT_IMMEDIATE and/or _FORCE > instead of creating a new flag - I think this is actually what Tom > proposed. That rescues the CREATE DATABASE case, and would probably prevent potential similar problems. >> * There's a greater danger of losing data during operating system >> restart. >> For example, IIRC, Windows gives only 20 seconds to terminate all >> services >> during OS shutdown. If many dirty buffers for unlogged tables linger in >> the >> shared buffers, PostgreSQL service may fail to complete database >> shutdown. >> Even if the online checkpoint writes out all dirty buffers, the >> possibility >> of there being many dirty buffers at shutdown is not zero, but the >> probability would be lower. > > Meh. That won't lead to data loss, just recovery on restart. And 20s > isn't sufficient for any halfway busy database anyway. The unlogged tables are emptied (= data loss) at recovery restart. Regards MauMau
On 2014-08-13 23:31:46 +0900, MauMau wrote: > From: "Andres Freund" <andres@2ndquadrant.com> > >That'd mean that the next pointrelease will incur *significantly* higher > >IO in many setups. If you currently have a workload where all dirty > >buffers of unlogged tables fit in s_b you'll never have any OS > >writes. That'd completely change. > > Yes, that's the only headache... But I'm not worried so much, because > bgwriter and/or backends may write out dirty buffers for unlogged tables, so > the total I/O is not low anyway. If the workload fits in s_b - not an infrequent thing with today's memory sizes - that's simply not true. > >>* There's a greater danger of losing data during operating system > >>restart. > >>For example, IIRC, Windows gives only 20 seconds to terminate all > >>services > >>during OS shutdown. If many dirty buffers for unlogged tables linger in > >>the > >>shared buffers, PostgreSQL service may fail to complete database > >>shutdown. > >>Even if the online checkpoint writes out all dirty buffers, the > >>possibility > >>of there being many dirty buffers at shutdown is not zero, but the > >>probability would be lower. > > > >Meh. That won't lead to data loss, just recovery on restart. And 20s > >isn't sufficient for any halfway busy database anyway. > > The unlogged tables are emptied (= data loss) at recovery restart. If that's data loss you shouldn't be using unlogged tables. That's not an argument. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 08/13/2014 04:32 PM, Andres Freund wrote: > On 2014-08-13 23:31:46 +0900, MauMau wrote: >> From: "Andres Freund" <andres@2ndquadrant.com> >>> That'd mean that the next pointrelease will incur *significantly* higher >>> IO in many setups. If you currently have a workload where all dirty >>> buffers of unlogged tables fit in s_b you'll never have any OS >>> writes. That'd completely change. >> >> Yes, that's the only headache... But I'm not worried so much, because >> bgwriter and/or backends may write out dirty buffers for unlogged tables, so >> the total I/O is not low anyway. > > If the workload fits in s_b - not an infrequent thing with today's > memory sizes - that's simply not true. > >>>> * There's a greater danger of losing data during operating system >>>> restart. >>>> For example, IIRC, Windows gives only 20 seconds to terminate all >>>> services >>>> during OS shutdown. If many dirty buffers for unlogged tables linger in >>>> the >>>> shared buffers, PostgreSQL service may fail to complete database >>>> shutdown. >>>> Even if the online checkpoint writes out all dirty buffers, the >>>> possibility >>>> of there being many dirty buffers at shutdown is not zero, but the >>>> probability would be lower. >>> >>> Meh. That won't lead to data loss, just recovery on restart. And 20s >>> isn't sufficient for any halfway busy database anyway. >> >> The unlogged tables are emptied (= data loss) at recovery restart. > > If that's data loss you shouldn't be using unlogged tables. That's not > an argument. There is also this issue which has been bugging me for a while but I haven't had time to look at providing a patch for: postgres=# create unlogged table t (id integer); CREATE TABLE postgres=# insert into t values (1); INSERT 0 1 postgres=# create index on t using hash (id); CREATE INDEX <crash and restart server here> postgres=# set enable_seqscan = off; SET postgres=# select * from t where id = 1; ERROR: index "t_id_idx" contains unexpected zero page at block 0 HINT: Please REINDEX it. All because the init fork is never checkpointed to disk. If there's anywhere a hash index should be safe to use, it's on unlogged tables. -- Vik
On 2014-08-13 18:43:53 +0200, Vik Fearing wrote: > There is also this issue which has been bugging me for a while but I > haven't had time to look at providing a patch for: > > postgres=# create unlogged table t (id integer); > CREATE TABLE > postgres=# insert into t values (1); > INSERT 0 1 > postgres=# create index on t using hash (id); > CREATE INDEX > > <crash and restart server here> > > postgres=# set enable_seqscan = off; > SET > postgres=# select * from t where id = 1; > ERROR: index "t_id_idx" contains unexpected zero page at block 0 > HINT: Please REINDEX it. > > All because the init fork is never checkpointed to disk. If there's > anywhere a hash index should be safe to use, it's on unlogged tables. I don't think this really is related. For one, this this surely can't be fixed with anything checkpoint related. The overhead of that would be prohibitive. Other *builempty routines use smgrimmedsync(), but hash doesn't. That's the problem here. Note that that still won't make it safe across streaming rep et al.... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2014-07-02 17:28:25 +0530, Pavan Deolasee wrote: > flag to tell checkpointer to flush all pages to the disk. Tom (and even I) > have reservations about the approach, but I would nevertheless leave it to > the committer to decide. IMV we must fix this bug one way or the other. > Otherwise users face risk of failing to do clean shutdown. I'm looking into this again now. We haven't really come to a agreement about which approach to take. So unless there's some progress on that front I'll push a variant of this patch. That seems better than leaving the issue open for another round of releases. Issues I've found with the patch so far: * The RequestCheckpoint() in createdb() also needs CHECKPOINT_FLUSH_ALL. * I don't think it's wise to renumber the CHECKPOINT_* flags in a commit that's supposed to be backpatched. That'll cause interesting issues if there's a extension out there containing a RequestCheckpoint() call. * I've also done some minor code and comment changes. Attached is my new version. I've confirmed that I could reproduce various broken behaviour before (wrong query results, ERRORs in further queries trying to write out buffers, shutdown failures), but not after. I'll note that I think there's arguably another bug that participated in the shutdown/checkpoint error you reported in http://archives.postgresql.org/message-id/CABOikdMxX0VdJEKSBd62sX_XAwa_%3DMAFqdAXXbU5V%2BBHZOxrng%40mail.gmail.com When moving a database into a different tablespace we never invalidate it's buffers even though they don't correspond to any existing files. The only reason that currently works is because they're not dirty and thus never written back or anything. I think movedb() should also grow a DropDatabaseBuffers() call to fix that. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Thu, Oct 9, 2014 at 10:52 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Hi, > > On 2014-07-02 17:28:25 +0530, Pavan Deolasee wrote: >> flag to tell checkpointer to flush all pages to the disk. Tom (and even I) >> have reservations about the approach, but I would nevertheless leave it to >> the committer to decide. IMV we must fix this bug one way or the other. >> Otherwise users face risk of failing to do clean shutdown. > > I'm looking into this again now. We haven't really come to a agreement > about which approach to take. So unless there's some progress on that > front I'll push a variant of this patch. That seems better than leaving > the issue open for another round of releases. > > Issues I've found with the patch so far: > * The RequestCheckpoint() in createdb() also needs CHECKPOINT_FLUSH_ALL. > * I don't think it's wise to renumber the CHECKPOINT_* flags in a commit > that's supposed to be backpatched. That'll cause interesting issues > if there's a extension out there containing a RequestCheckpoint() > call. > * I've also done some minor code and comment changes. > > Attached is my new version. I've confirmed that I could reproduce > various broken behaviour before (wrong query results, ERRORs in further > queries trying to write out buffers, shutdown failures), but not after. +1 for applying this change. - (flags & CHECKPOINT_CAUSE_TIME) ? " time" : ""); + (flags & CHECKPOINT_CAUSE_TIME) ? " time" : "", + (flags & CHECKPOINT_FLUSH_ALL) ? " flush-all" :""); ISTM that you forgot to add the following change. - msg = "restartpoint starting:%s%s%s%s%s%s%s"; + msg = "restartpoint starting:%s%s%s%s%s%s%s%s"; else - msg = "checkpoint starting:%s%s%s%s%s%s%s"; + msg = "checkpoint starting:%s%s%s%s%s%s%s%s"; Regards, -- Fujii Masao
On 18 June 2014 14:45, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andres Freund <andres@2ndquadrant.com> writes: >> On 2014-06-18 16:15:47 +0530, Pavan Deolasee wrote: >>> But I wonder if not flushing dirty buffers >>> of unlogged tables at a checkpoint is a bad idea anyways. User might expect >>> that the unlogged tables to sustain server crash or unclean shutdown if >>> there had been no writes after successful manual checkpoint(s). > >> They'll get reset at unlcean startup anyway. Independent of having been >> touched or not. > > I'm with Pavan on this one: it's *not* a good thing that manually issued > checkpoints skip unlogged tables. That's surprising, possibly dangerous, > and no case whatsoever has been made that anyone sees it as an important > performance benefit. > > I trust that a shutdown checkpoint, at least, would write such pages? > If so, I'd expect that a manual checkpoint would do it too. Maybe > I'm checkpointing because I want to be sure that the shutdown will be > quick so I can do a minor release update with minimal downtime. I think manual checkpoints should flush everything. This is a valid use case. What other use case is there for a manual checkpoint? > I think we should just change this. -1 for new flags and more > complication. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2014-10-16 04:08:50 +0100, Simon Riggs wrote: > On 18 June 2014 14:45, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Andres Freund <andres@2ndquadrant.com> writes: > >> On 2014-06-18 16:15:47 +0530, Pavan Deolasee wrote: > >>> But I wonder if not flushing dirty buffers > >>> of unlogged tables at a checkpoint is a bad idea anyways. User might expect > >>> that the unlogged tables to sustain server crash or unclean shutdown if > >>> there had been no writes after successful manual checkpoint(s). > > > >> They'll get reset at unlcean startup anyway. Independent of having been > >> touched or not. > > > > I'm with Pavan on this one: it's *not* a good thing that manually issued > > checkpoints skip unlogged tables. That's surprising, possibly dangerous, > > and no case whatsoever has been made that anyone sees it as an important > > performance benefit. > > > > I trust that a shutdown checkpoint, at least, would write such pages? > > If so, I'd expect that a manual checkpoint would do it too. Maybe > > I'm checkpointing because I want to be sure that the shutdown will be > > quick so I can do a minor release update with minimal downtime. > > I think manual checkpoints should flush everything. > > This is a valid use case. > > What other use case is there for a manual checkpoint? There's people using frequent manual checkpoints to keep performance predictable. Unfortunately that actually can improve jitter quite measurably. If we want to change this, fine, but we shouldn't sneak it into the back branches, together with a correctness fix Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 16 October 2014 05:26, Andres Freund <andres@2ndquadrant.com> wrote: >> I think manual checkpoints should flush everything. >> >> This is a valid use case. >> >> What other use case is there for a manual checkpoint? > > There's people using frequent manual checkpoints to keep performance > predictable. Unfortunately that actually can improve jitter quite > measurably. Hmm, more discussion required there it would seem. > If we want to change this, fine, but we shouldn't sneak it into the back > branches, together with a correctness fix The bug lies in the default behaviour, which we must fix. I agree backpatching is awkward, though there may also be people who believe that a CHECKPOINT flushes everything, plus other related bugs may be lurking. Seems like we could backpatch this... CHECKPOINT [ALL (defult) | PERMANENT] and people who are doing CHECKPOINT PERMANENT for performance reasons can add the new keyword easy enough, if we highlight it in the release notes. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2014-10-16 09:52:58 +0100, Simon Riggs wrote: > On 16 October 2014 05:26, Andres Freund <andres@2ndquadrant.com> wrote: > > >> I think manual checkpoints should flush everything. > >> > >> This is a valid use case. > >> > >> What other use case is there for a manual checkpoint? > > > > There's people using frequent manual checkpoints to keep performance > > predictable. Unfortunately that actually can improve jitter quite > > measurably. > > Hmm, more discussion required there it would seem. Can we please decouple the discussion about an actual bug fix about the discussion about behavioural changes? It really doesn't seem to be helpful. The bug lingered for weeks because of this, and I really don't want it to miss another minor release. > > If we want to change this, fine, but we shouldn't sneak it into the back > > branches, together with a correctness fix > > The bug lies in the default behaviour, which we must fix. But the bug is unrelated to manual checkpoints. So I don't think this is really related. > I agree backpatching is awkward, though there may also be people who > believe that a CHECKPOINT flushes everything, plus other related bugs > may be lurking. > Seems like we could backpatch this... > CHECKPOINT [ALL (defult) | PERMANENT] Why. This really is a new feature. If we want the new thing, can we please treat it as any other feature? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 16 October 2014 10:00, Andres Freund <andres@2ndquadrant.com> wrote: >> The bug lies in the default behaviour, which we must fix. > > But the bug is unrelated to manual checkpoints. So I don't think this is > really related. Yes, agreed; I was just thinking that myself. We need to decouple the bug fix from any behaviour change of the CHECKPOINT command, so v3 is not the right thing for backpatching. We may choose to do v3 for later releases, but there does seem to be a case to retain the current behaviour in some form as well. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2014-10-16 10:56:46 +0100, Simon Riggs wrote: > On 16 October 2014 10:00, Andres Freund <andres@2ndquadrant.com> wrote: > > >> The bug lies in the default behaviour, which we must fix. > > > > But the bug is unrelated to manual checkpoints. So I don't think this is > > really related. > > Yes, agreed; I was just thinking that myself. > > We need to decouple the bug fix from any behaviour change of the > CHECKPOINT command, so v3 is not the right thing for backpatching. I'm not following? v3 is the one that doesn't change the behaviour of a manual CHECKPOINT? v1/v2 changed the behaviour of it? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2014-10-10 16:15:58 +0900, Fujii Masao wrote: > +1 for applying this change. I've just done so. > - (flags & CHECKPOINT_CAUSE_TIME) ? " time" : ""); > + (flags & CHECKPOINT_CAUSE_TIME) ? " time" : "", > + (flags & CHECKPOINT_FLUSH_ALL) ? " flush-all" :""); > > ISTM that you forgot to add the following change. > > - msg = "restartpoint starting:%s%s%s%s%s%s%s"; > + msg = "restartpoint starting:%s%s%s%s%s%s%s%s"; > else > - msg = "checkpoint starting:%s%s%s%s%s%s%s"; > + msg = "checkpoint starting:%s%s%s%s%s%s%s%s"; Good catch. Thanks for having a look! Interesting that gcc can't deduce this... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services