Thread: making an unlogged table logged
Somebody asked about this on Depesz's blog today, and I think it's come up here before too, so I thought it might be worth my writing up a few comments on this. I don't think I'm going to have time to work on this any time soon, but if someone else wants to work up a patch, I'm game to review. I think it'd clearly be a good feature. Generally, to do this, it would be necessary to do the following things (plus anything I'm forgetting): 1. Take an AccessExclusiveLock on the target table. You might think that concurrent selects could be allowed, but I believe that's not the case. Read on. 2. Verify that there are no foreign keys referencing other unlogged tables, because if that were the case then after the change we'd have a permanent table referencing an unlogged table, which would violate referential integrity. (Note that unlogged referencing permanent is OK, but permanent referencing unlogged is a no-no, so what matters when upgrading is "outbound" foreign keys.) 3. Write out all shared buffers for the target table, and drop them. This ensures that there are no buffers floating around for the target relation that are marked BM_UNLOGGED, which would be a bad thing. Or maybe it's possible to just clear the BM_UNLOGGED flag, instead of dropping them. This is the step that makes me think we need an access exclusive lock - otherwise, somebody else might read in a buffer and, seeing that the relation is unlogged (which is true, since we haven't committed yet), mark it BM_UNLOGGED. 4. fsync() any segments of the target relation - of any fork except that init fork - that might have dirty pages not on disk. 5. Arrange for the appropriate file deletions at commit or abort, by updating pendingDeletes. On commit, we want to delete the init fork for the table and all its indexes. On abort, we want to delete everything else, but only for pretend; that is, the abort record should reflect the deletions since they'll need to happen on any standbys, but we shouldn't actually perform them on the master since we don't want to obliterate the contents of the table for no reason. There's a subtle problem here I'm not quite sure how to deal with: what happens if we *crash* without writing an abort record? It seems like that could leave a stray file around on a standby, because the current code only cleans things up on the standby at the start of recovery; to make this bullet-proof, I suppose it'd need to repeat that every time a crash happens on the master, but I don't know how to do that. Note also that if wal_level is minimal, then we need only worry about the commit case; the abort case can be a no-op. 6. If wal_level != minimal, XLOG every page of every fork except the init fork, for both the table and the associated indexes. (Note that this step also requires an AccessExclusiveLock rather than some weaker lock, because of the arbitrary rule that only AccessExclusiveLocks are sent to standbys. If we held only ShareRowExclusiveLock on the master, for example, a Hot Standby backend might read the table while it's only been half-copied.) 7. Update pg_class.relpersistence from 'u' to 'p', for both the table and the associated indexes. Going the other direction ought to be possible too, although it seems somewhat less useful. For that, you'd need to flip around the check in step #2 (i.e. check for a reference FROM a permanent table), perform step #3, skip step #4, do step #5 backwards (create and log init forks, arranging for them to be removed on abort - this too has an issue with crashes that don't leave abort records behind); and arrange for the rest of the forks to be removed on commit on any standby without doing it on the master), skip step #6, and do step #7 backwards. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jan 4, 2011 at 7:41 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Somebody asked about this on Depesz's blog today, and I think it's > come up here before too, so I thought it might be worth my writing up > a few comments on this. I don't think I'm going to have time to work > on this any time soon, but if someone else wants to work up a patch, > I'm game to review. I think it'd clearly be a good feature. > > Generally, to do this, it would be necessary to do the following > things (plus anything I'm forgetting): > > 1. Take an AccessExclusiveLock on the target table. You might think > that concurrent selects could be allowed, but I believe that's not the > case. Read on. > > 2. Verify that there are no foreign keys referencing other unlogged > tables, because if that were the case then after the change we'd have > a permanent table referencing an unlogged table, which would violate > referential integrity. (Note that unlogged referencing permanent is > OK, but permanent referencing unlogged is a no-no, so what matters > when upgrading is "outbound" foreign keys.) > > 3. Write out all shared buffers for the target table, and drop them. > This ensures that there are no buffers floating around for the target > relation that are marked BM_UNLOGGED, which would be a bad thing. Or > maybe it's possible to just clear the BM_UNLOGGED flag, instead of > dropping them. This is the step that makes me think we need an access > exclusive lock - otherwise, somebody else might read in a buffer and, > seeing that the relation is unlogged (which is true, since we haven't > committed yet), mark it BM_UNLOGGED. > > 4. fsync() any segments of the target relation - of any fork except > that init fork - that might have dirty pages not on disk. > > 5. Arrange for the appropriate file deletions at commit or abort, by > updating pendingDeletes. On commit, we want to delete the init fork > for the table and all its indexes. On abort, we want to delete > everything else, but only for pretend; that is, the abort record > should reflect the deletions since they'll need to happen on any > standbys, but we shouldn't actually perform them on the master since > we don't want to obliterate the contents of the table for no reason. > There's a subtle problem here I'm not quite sure how to deal with: > what happens if we *crash* without writing an abort record? It seems > like that could leave a stray file around on a standby, because the > current code only cleans things up on the standby at the start of > recovery; to make this bullet-proof, I suppose it'd need to repeat > that every time a crash happens on the master, but I don't know how to > do that. Note also that if wal_level is minimal, then we need only > worry about the commit case; the abort case can be a no-op. > > 6. If wal_level != minimal, XLOG every page of every fork except the > init fork, for both the table and the associated indexes. (Note that > this step also requires an AccessExclusiveLock rather than some weaker > lock, because of the arbitrary rule that only AccessExclusiveLocks are > sent to standbys. If we held only ShareRowExclusiveLock on the > master, for example, a Hot Standby backend might read the table while > it's only been half-copied.) > > 7. Update pg_class.relpersistence from 'u' to 'p', for both the table > and the associated indexes. > > Going the other direction ought to be possible too, although it seems > somewhat less useful. For that, you'd need to flip around the check > in step #2 (i.e. check for a reference FROM a permanent table), > perform step #3, skip step #4, do step #5 backwards (create and log > init forks, arranging for them to be removed on abort - this too has > an issue with crashes that don't leave abort records behind); and > arrange for the rest of the forks to be removed on commit on any > standby without doing it on the master), skip step #6, and do step #7 > backwards. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company A couple thoughts: 1. Could the making a table logged be a non-exclusive lock if the ALTER is allowed to take a full checkpoint? 2. Unlogged to logged has giant use case. 3. In MySQL I have had to ALTER tables to engine BLACKHOLE because they held data that was not vital, but the server was out of IO. Going logged -> unlogged has a significant placed, I think. -- Rob Wultsch wultsch@gmail.com
> > -- > > Robert Haas > > EnterpriseDB: http://www.enterprisedb.com > > The Enterprise PostgreSQL Company > > A couple thoughts: > 1. Could the making a table logged be a non-exclusive lock if the > ALTER is allowed to take a full checkpoint? If possible, that would certainly be better. If the bgwriter is doing what it is supposed to, it would be relatively painless. > 2. Unlogged to logged has giant use case. Agreed. JD -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
On Tue, Jan 4, 2011 at 10:56 PM, Rob Wultsch <wultsch@gmail.com> wrote: > 1. Could the making a table logged be a non-exclusive lock if the > ALTER is allowed to take a full checkpoint? No, that doesn't solve either of the two problems I described, unfortunately. > 2. Unlogged to logged has giant use case. Agree. > 3. In MySQL I have had to ALTER tables to engine BLACKHOLE because > they held data that was not vital, but the server was out of IO. Going > logged -> unlogged has a significant placed, I think. Interesting. So you'd change a logged table into an unlogged table to cut down on I/O, and take the risk of losing the data if the server went down? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jan 05, 2011 at 09:04:08AM -0500, Robert Haas wrote: > On Tue, Jan 4, 2011 at 10:56 PM, Rob Wultsch <wultsch@gmail.com> wrote: > > 1. Could the making a table logged be a non-exclusive lock if the > > ALTER is allowed to take a full checkpoint? > > No, that doesn't solve either of the two problems I described, > unfortunately. > > > 2. Unlogged to logged has giant use case. > > Agree. > > > 3. In MySQL I have had to ALTER tables to engine BLACKHOLE because > > they held data that was not vital, but the server was out of IO. > > Going logged -> unlogged has a significant placed, I think. > > Interesting. So you'd change a logged table into an unlogged table > to cut down on I/O, and take the risk of losing the data if the > server went down? BLACKHOLE is a "storage engine" that's equivalent to /dev/null, so it wasn't a risk /per se/. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Wed, Jan 5, 2011 at 7:48 AM, David Fetter <david@fetter.org> wrote: > On Wed, Jan 05, 2011 at 09:04:08AM -0500, Robert Haas wrote: >> On Tue, Jan 4, 2011 at 10:56 PM, Rob Wultsch <wultsch@gmail.com> wrote: >> > 1. Could the making a table logged be a non-exclusive lock if the >> > ALTER is allowed to take a full checkpoint? >> >> No, that doesn't solve either of the two problems I described, >> unfortunately. That is too bad. >> >> > 2. Unlogged to logged has giant use case. >> >> Agree. >> >> > 3. In MySQL I have had to ALTER tables to engine BLACKHOLE because >> > they held data that was not vital, but the server was out of IO. >> > Going logged -> unlogged has a significant placed, I think. >> >> Interesting. So you'd change a logged table into an unlogged table >> to cut down on I/O, and take the risk of losing the data if the >> server went down? > > BLACKHOLE is a "storage engine" that's equivalent to /dev/null, so it > wasn't a risk /per se/. > Exactly. It was data I could live without and by having schema attached to /dev/null the application did not error out and die. It is a very bad option and being able to turn off logging for a table is a much better one. -- Rob Wultsch wultsch@gmail.com
On 1/4/11 6:41 PM, Robert Haas wrote: > Going the other direction ought to be possible too, although it seems > somewhat less useful. Actually, it's more useful; many people who *upgrade* to 9.1 will wand to convert one or two of their tables to unlogged. Note that the conversion both ways can be worked around with the CREATE/ALTER TABLE dosiedo, so I don't think either of these is critical for 9.1. Unless you feel like working on them, of course. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Wed, Jan 5, 2011 at 2:02 PM, Josh Berkus <josh@agliodbs.com> wrote: > Note that the conversion both ways can be worked around with the > CREATE/ALTER TABLE dosiedo, so I don't think either of these is critical > for 9.1. Unless you feel like working on them, of course. As I said in my OP, I don't intend to work on them for 9.1, but will review and possibly commit a patch written by someone else. It'd be nice to have, but there are other things I want more, and a ton of other large and small patches to review. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, 2011-01-04 at 21:41 -0500, Robert Haas wrote: > 6. If wal_level != minimal, XLOG every page of every fork except the > init fork, for both the table and the associated indexes. (Note that > this step also requires an AccessExclusiveLock rather than some weaker > lock, because of the arbitrary rule that only AccessExclusiveLocks are > sent to standbys. If we held only ShareRowExclusiveLock on the > master, for example, a Hot Standby backend might read the table while > it's only been half-copied.) That rule is not arbitrary, there is simply no need to send other lock types since they would not conflict with queries. It's an optimisation. The lock strength selected on the master doesn't need to be the same as the lock strength on the standby. You could quite easily generate AEL lock records to send to standby, without actually taking that lock level on the master. Question: what does an unlogged table look like on the standby? Are they visible at all, does it throw an error, or do they just look empty. We probably need some docs in the HS section to explain that. I hope the answer isn't "look empty" since that is effectively data loss for people spreading queries across multiple nodes. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Wed, Jan 5, 2011 at 2:36 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > The lock strength selected on the master doesn't need to be the same as > the lock strength on the standby. You could quite easily generate AEL > lock records to send to standby, without actually taking that lock level > on the master. True. > Question: what does an unlogged table look like on the standby? Are they > visible at all, does it throw an error, or do they just look empty. We > probably need some docs in the HS section to explain that. I hope the > answer isn't "look empty" since that is effectively data loss for people > spreading queries across multiple nodes. Error. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Jan 5, 2011, at 2:37 PM, Robert Haas wrote: > On Wed, Jan 5, 2011 at 2:36 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> The lock strength selected on the master doesn't need to be the same as >> the lock strength on the standby. You could quite easily generate AEL >> lock records to send to standby, without actually taking that lock level >> on the master. > > True. > >> Question: what does an unlogged table look like on the standby? Are they >> visible at all, does it throw an error, or do they just look empty. We >> probably need some docs in the HS section to explain that. I hope the >> answer isn't "look empty" since that is effectively data loss for people >> spreading queries across multiple nodes. > > Error. Hm- if the unlogged tables are being used as HTTP transient state storage, it would be handy to have that (admittedly non-essential)data on the standby when it becomes master, even if there are no guarantees surrounding the data beyond "itlooked like this at some point". Since the tables are not writing WAL, would it be possible to allow for writing to unloggedtables on the standby to allow for out-of-band syncing? Otherwise, it seems the only alternative is to push changesto a separate database on the standby machine and then suck the data in when it becomes master. Cheers, M
> Hm- if the unlogged tables are being used as HTTP transient state storage, it would be handy to have that (admittedly non-essential)data on the standby when it becomes master, even if there are no guarantees surrounding the data beyond "itlooked like this at some point". Since the tables are not writing WAL, would it be possible to allow for writing to unloggedtables on the standby to allow for out-of-band syncing? Otherwise, it seems the only alternative is to push changesto a separate database on the standby machine and then suck the data in when it becomes master. It would be useful (this would allow per-standby buffer tables, for that matter), but it would also be tremendously difficult. Seems worthy of a TODO, though. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > It would be useful (this would allow per-standby buffer tables, for that > matter), but it would also be tremendously difficult. Seems worthy of a > TODO, though. Don't we have anything covering xid-less tables in the TODO already? The read-only tables "compressions" of removing large part of headers seems to come up often enough… Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Wed, Jan 5, 2011 at 3:58 PM, Josh Berkus <josh@agliodbs.com> wrote: >> Hm- if the unlogged tables are being used as HTTP transient state storage, it would be handy to have that (admittedlynon-essential) data on the standby when it becomes master, even if there are no guarantees surrounding the databeyond "it looked like this at some point". Since the tables are not writing WAL, would it be possible to allow for writingto unlogged tables on the standby to allow for out-of-band syncing? Otherwise, it seems the only alternative is topush changes to a separate database on the standby machine and then suck the data in when it becomes master. > > It would be useful (this would allow per-standby buffer tables, for that > matter), but it would also be tremendously difficult. Seems worthy of a > TODO, though. I think that's probably a dead end - just to take one example, if you don't sync often enough, the standby might have transaction ID wraparound problems. Autovacuum on the master will prevent that for permanent tables, but not for an only-occasionally-updated copy of an unlogged table. If what you want is a possibly-slightly-stale copy of a fast table on the master, there are ways to engineer that using technology we already have, such as asynchronous replication + Hot Standby + synchronous_commit=off, or by using Slony. Admittedly, none of these solutions are quite perfect, but it would probably be much easier to improve those solutions or develop something completely new rather than try to somehow go through the mammoth contortions that would be needed to make it work with unlogged tables. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 1/5/11 3:14 PM, Robert Haas wrote: > I think that's probably a dead end - just to take one example, if you > don't sync often enough, the standby might have transaction ID > wraparound problems. Autovacuum on the master will prevent that for > permanent tables, but not for an only-occasionally-updated copy of an > unlogged table. I think you're missing Agent M's idea: if you could write to unlogged tables on the standby, then you could use application code to periodically synch them. Mind you, I personally don't find that idea that useful -- unlogged tables are supposed to be for highly volatile data, after all. No doubt M was thinking that in a failover situation, it would be better to have stale data than none at all. However, if an unlogged table created on the master could be available for writing and initially empty on the standbys, it would give each standby available temporary/buffer tables they could use. That would be *really* useful. Also, one of the obvious uses for unlogged tables is materialized views.If unlogged tables don't get replicated, and can'tbe created on the standby, then it severely limits their utility for this purpose. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Wed, Jan 5, 2011 at 6:25 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 1/5/11 3:14 PM, Robert Haas wrote: >> I think that's probably a dead end - just to take one example, if you >> don't sync often enough, the standby might have transaction ID >> wraparound problems. Autovacuum on the master will prevent that for >> permanent tables, but not for an only-occasionally-updated copy of an >> unlogged table. > > I think you're missing Agent M's idea: if you could write to unlogged > tables on the standby, then you could use application code to > periodically synch them. > > Mind you, I personally don't find that idea that useful -- unlogged > tables are supposed to be for highly volatile data, after all. No doubt > M was thinking that in a failover situation, it would be better to have > stale data than none at all. > > However, if an unlogged table created on the master could be available > for writing and initially empty on the standbys, it would give each > standby available temporary/buffer tables they could use. That would be > *really* useful. OIC, sorry. Well, that could possibly be done, but it'd be tricky. The obvious problem is that the backend doing the writing would need an XID, and it'd probably have to ask the master to assign it one... which is possibly doable, but certainly not ideal (you can't write on the slave if the master is down, unless you promote it). Then there's a whole bunch of follow-on problems, like now the standby needs to run autovacuum - but only on the unlogged tables, and without being able to update or rely on pg_database.datfrozenxid. I think we have to face up to the fact that WAL shipping is an extremely limiting way to do replication. It has its perks, certainly - principally, that it minimizes the amount of extra work that must be done on the master, which is an extremely valuable consideration for many applications. However, it's also got some pretty major disadvantages, and one of the big ones is that it's not well-suited to partial replication. If it were possible to replicate individual tables, we wouldn't be having this conversation. You'd just replicate some tables from the master to the standby and then create a few extra ones on the standby (perhaps permanent, perhaps unlogged, perhaps temporary) and call it good. I think we ought to seriously consider having both physical and logical replication in core. Physical replication, which we have today, is great for what it does, but trying to make it do things that it's not good at is going to be an awful lot of work, and require an awful lot of hacks, to make it cover everything that people really want to be able to do. Adding logical replication would be a lot of work but we'd get a lot of collateral benefits. Imagine that PG had the ability to spit out INSERT/UPDATE/DELETE statements for designated tables, as they were modified. That would provide a framework for partial replication, replication from PG into other databases, even multi-master replication if you add some kind of conflict resolution. Even though this would require essentially building a whole new system, it's starting to seem to me that it would be simpler than trying to remove the limitations of our existing system incrementally. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Adding logical replication would be a lot of > work but we'd get a lot of collateral benefits. Imagine that PG had There has been extensive discussions at last pgcon about that (mainly in the "hallway track", but also in the devroom we had) to work on getting the common bits between the slony and londiste queuing solutions into core, to support what you're saying. In fact we already have the txid datatype and its functions, those came from slony through PGQ into core. Jan Wieck started a discussion back then to offer the basics we need in core as far as queuing goes, here: http://archives.postgresql.org/pgsql-hackers/2010-05/msg01209.php What happened to that effort? -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
> Jan Wieck started a discussion back then to offer the basics we need in > core as far as queuing goes, here: > > http://archives.postgresql.org/pgsql-hackers/2010-05/msg01209.php > > What happened to that effort? Stalled due to lack of manpower, currently. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Tue, Jan 4, 2011 at 9:41 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Generally, to do this, it would be necessary to do the following > things (plus anything I'm forgetting): It occurs to me that almost exactly this same procedure could be used to make a *temporary* table into a permanent table. You'd have to also change the schema, and there'd be some other adjustments, but overall it'd be pretty similar. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company