Thread: Make unlogged table resets detectable
One problem with unlogged tables is that the application has no way to tell if they were reset, or they just happen to be empty. This can be a problem with sharding, where you might have different shards of an unlogged table on different servers. If one server crashes, you'll be missing only one shard of the data, which may appear inconsistent. In that case, you'd like the application (or sharding solution) to be able to detect that one shard was lost, and TRUNCATE those that remain to get back to a reasonable state. It would be easy enough for the init fork to have a single page with a flag set. That way, when the main fork is replaced with the init fork, other code could detect that a reset happened. When detected, depending on a GUC, the behavior could be to auto- truncate it (to get the current silent behavior), or refuse to perform the operation (except an explicit TRUNCATE), or issue a warning/log/notice. The biggest challenge would be: when should we detect that the reset has happened? There might be a lot of entry points. Another idea would be to just have a SQL function that the application could call whenever it needs to know. Thoughts? Jeff Davis
On 03/06/2021 23:04, Jeff Davis wrote: > One problem with unlogged tables is that the application has no way to > tell if they were reset, or they just happen to be empty. > > This can be a problem with sharding, where you might have different > shards of an unlogged table on different servers. If one server > crashes, you'll be missing only one shard of the data, which may appear > inconsistent. In that case, you'd like the application (or sharding > solution) to be able to detect that one shard was lost, and TRUNCATE > those that remain to get back to a reasonable state. > > It would be easy enough for the init fork to have a single page with a > flag set. That way, when the main fork is replaced with the init fork, > other code could detect that a reset happened. I'd suggest using a counter rather than a flag. With a flag, if one client clears the flag to acknowledge that a truncation happened, others might miss it. See also ABA problem. > When detected, depending on a GUC, the behavior could be to auto- > truncate it (to get the current silent behavior), or refuse to perform > the operation (except an explicit TRUNCATE), or issue a > warning/log/notice. TRUNCATE isn't quite what happens when an unlogged table is re-initialized. It changes the relfilenode, resets stats, and requires a more strict lock. So I don't think repurposing TRUNCATE for re-initializing a table is a good idea. There's also potential for a race condition, if two connections see that a table needs re-initialization, and issue "TRUNCATE + INSERT" concurrently. One of the INSERTs will be lost. A warning or notice is easy to miss. > The biggest challenge would be: when should we detect that the reset > has happened? There might be a lot of entry points. Another idea would > be to just have a SQL function that the application could call whenever > it needs to know. Yeah, a SQL function to get the current "reset counter" would be nice. - Heikki
On Fri, 2021-06-04 at 09:42 +0300, Heikki Linnakangas wrote: > I'd suggest using a counter rather than a flag. With a flag, if one > client clears the flag to acknowledge that a truncation happened, > others > might miss it. See also ABA problem. This feels like it's getting more complex. Stepping back, maybe unlogged tables are the wrong level to solve this problem. We could just have a "crash counter" in pg_control that would be incremented every time a crash happened (and all unlogged tables are reset). It might be a number or maybe the LSN of the startup checkpoint after the most recent crash. A SQL function could read the value. Perhaps we'd also have a SQL function to reset it, but I don't see a use case for it. Then, it's up to the client to check it against a stored value, and clear/repopulate unlogged tables as necessary. Regards, Jeff Davis
On Fri, Jun 4, 2021 at 8:41 PM Jeff Davis <pgsql@j-davis.com> wrote: > Stepping back, maybe unlogged tables are the wrong level to solve this > problem. We could just have a "crash counter" in pg_control that would > be incremented every time a crash happened (and all unlogged tables are > reset). It might be a number or maybe the LSN of the startup checkpoint > after the most recent crash. > > A SQL function could read the value. Perhaps we'd also have a SQL > function to reset it, but I don't see a use case for it. > > Then, it's up to the client to check it against a stored value, and > clear/repopulate unlogged tables as necessary. I think this would be useful for a variety of purposes. Both being able to know the last time that it happened and being able to know the number of times that it happened could be useful, depending on the scenario. For example, if one of my employer's customers began complaining about a problem that started happening recently, it would be useful to be able to establish whether there had also been a crash recently, and a timestamp or LSN would help a lot. On the other hand, if we had a counter, we'd probably find out some interesting things, too. Maybe someone would report that the value of the counter was surprisingly large. For example, if a customer's pg_control output showed that the database cluster had performed crash recovery 162438 times, I might have some, err, followup questions. This is not a vote for or against any specific proposal; it's just a general statement that I support trying to do something in this area, and that it feels like anything we do will likely have some value. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Jun 4, 2021 at 8:41 PM Jeff Davis <pgsql@j-davis.com> wrote: >> Stepping back, maybe unlogged tables are the wrong level to solve this >> problem. We could just have a "crash counter" in pg_control that would >> be incremented every time a crash happened (and all unlogged tables are >> reset). It might be a number or maybe the LSN of the startup checkpoint >> after the most recent crash. > I think this would be useful for a variety of purposes. Both being > able to know the last time that it happened and being able to know the > number of times that it happened could be useful, depending on the > scenario. +1. I'd support recording the time of the last crash recovery, as well as having a counter. I think an LSN would not be as useful as a timestamp. regards, tom lane
On Mon, Jun 07, 2021 at 02:56:57PM -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Fri, Jun 4, 2021 at 8:41 PM Jeff Davis <pgsql@j-davis.com> wrote: > >> Stepping back, maybe unlogged tables are the wrong level to solve this > >> problem. We could just have a "crash counter" in pg_control that would > >> be incremented every time a crash happened (and all unlogged tables are > >> reset). It might be a number or maybe the LSN of the startup checkpoint > >> after the most recent crash. > > > I think this would be useful for a variety of purposes. Both being > > able to know the last time that it happened and being able to know the > > number of times that it happened could be useful, depending on the > > scenario. > > +1. I'd support recording the time of the last crash recovery, as > well as having a counter. I think an LSN would not be as useful > as a timestamp. +1 It's been suggested before ;) https://www.postgresql.org/message-id/20180228221653.GB32095%40telsasoft.com PS. I currently monitor for crashes by checking something hacky like: | SELECT backend_start - pg_postmaster_start_time() ORDER BY 1
On Mon, Jun 07, 2021 at 02:56:57PM -0400, Tom Lane wrote: > +1. I'd support recording the time of the last crash recovery, as > well as having a counter. I think an LSN would not be as useful > as a timestamp. One could guess a timestamp based on a LSN, no? So I'd like to think the opposite actually: a LSN would be more useful than a timestamp. -- Michael
Attachment
On Tue, Jun 08, 2021 at 12:46:05PM +0900, Michael Paquier wrote: > On Mon, Jun 07, 2021 at 02:56:57PM -0400, Tom Lane wrote: > > +1. I'd support recording the time of the last crash recovery, as > > well as having a counter. I think an LSN would not be as useful > > as a timestamp. > > One could guess a timestamp based on a LSN, no? So I'd like to think > the opposite actually: a LSN would be more useful than a timestamp. Wouldn't that work only if the LSN is recent enough, depending on the WAL activity?
On Mon, Jun 7, 2021 at 11:46 PM Michael Paquier <michael@paquier.xyz> wrote: > On Mon, Jun 07, 2021 at 02:56:57PM -0400, Tom Lane wrote: > > +1. I'd support recording the time of the last crash recovery, as > > well as having a counter. I think an LSN would not be as useful > > as a timestamp. > > One could guess a timestamp based on a LSN, no? So I'd like to think > the opposite actually: a LSN would be more useful than a timestamp. One could also guess an LSN based on a timestamp, but I think in either case one has to be a pretty good guesser. The rate at which WAL is generated is hardly guaranteed to be uniform, and if you're looking at a system for the first time you may have no idea what it is. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Jun 7, 2021 at 11:46 PM Michael Paquier <michael@paquier.xyz> wrote: >> On Mon, Jun 07, 2021 at 02:56:57PM -0400, Tom Lane wrote: >>> +1. I'd support recording the time of the last crash recovery, as >>> well as having a counter. I think an LSN would not be as useful >>> as a timestamp. >> One could guess a timestamp based on a LSN, no? So I'd like to think >> the opposite actually: a LSN would be more useful than a timestamp. > One could also guess an LSN based on a timestamp, but I think in > either case one has to be a pretty good guesser. Yeah. If there are actually use-cases for knowing both things, then we ought to record both. However, it's not real clear to me why LSN would be interesting. BTW, I spent a bit of time thinking about whether we should record the timestamp at start or end of crash recovery; my conclusion is we should record the latter. It would only make a difference to people who wanted to inspect the value (a) while crash recovery is in progress or (b) after a failed crash recovery. In both scenarios, you have other mechanisms to discover the start time of the current crash; while if we overwrite the pg_control field at the start, there's no longer a way to know how long ago the previous crash was. So it seems best not to overwrite the time of the previous crash until we're up. (If there is a reason to log LSN, maybe the argument is different for that? Although I'd think that looking at the last checkpoint REDO location is sufficient for figuring out where the current crash recovery attempt started.) regards, tom lane
On Tue, 2021-06-08 at 12:52 -0400, Tom Lane wrote: > Yeah. If there are actually use-cases for knowing both things, then > we ought to record both. However, it's not real clear to me why > LSN would be interesting. Let me expand on my use case: in a sharded environment, how do you figure out if you need to repopulate an UNLOGGED table? For a single node, there's not much risk, because you either have the data or you don't. But in a sharded environment, if one node crashes, you might end up with some shards empty and others populated, and that's inconsistent. If Postgres provides a way to figure out when the last crash happened, then that would give the sharding solution the basic information it needs to figure out if it needs to clear and repopulate the entire unlogged table (i.e. all its shards on all nodes). Clearly, the sharding solution would need to do some tracking of its own, like recording when the last TRUNCATE happened, to figure out what to do. For that tracking, I think using the LSN makes more sense than a timestamp. > (If there is a reason to log LSN, maybe the argument is different > for that? Although I'd think that looking at the last checkpoint > REDO location is sufficient for figuring out where the current > crash recovery attempt started.) I came to a similar conclusion for my use case: tracking the LSN at the end of the recovery makes more sense. I attached a patch to track last recovery LSN, time, and total count. But there are a few issues: 1. Do we want a way to reset the counter? If so, should it be done with pg_resetwal or a superuser SQL function? 2. It would be helpful to also know the last time a promotion happened, for the same reason (e.g. a failover of a single node leading to an unlogged table with some empty shards and some populated ones). Should also store the last promotion LSN and time as well? Does "promotion count" make sense, and should we track that, too? 3. Should we try to track crash information across promotions, or just start them at the initial values when promoted? Regards, Jeff Davis
Attachment
Jeff Davis <pgsql@j-davis.com> writes: > On Tue, 2021-06-08 at 12:52 -0400, Tom Lane wrote: >> Yeah. If there are actually use-cases for knowing both things, then >> we ought to record both. However, it's not real clear to me why >> LSN would be interesting. > Let me expand on my use case: in a sharded environment, how do you > figure out if you need to repopulate an UNLOGGED table? Since we don't put LSNs into unlogged tables, nor would the different shards be likely to have equivalent LSNs, I'm not seeing that LSN is remarkably better for this than a timestamp. > 1. Do we want a way to reset the counter? If so, should it be done with > pg_resetwal or a superuser SQL function? I'd be kind of inclined to say no, short of pg_resetwal, and maybe not then. > 2. It would be helpful to also know the last time a promotion happened, I'm not following this either. How do you unpromote a node? regards, tom lane
On Tue, 2021-06-08 at 16:08 -0400, Tom Lane wrote: > Since we don't put LSNs into unlogged tables, nor would the different > shards be likely to have equivalent LSNs, I'm not seeing that LSN is > remarkably better for this than a timestamp. It requires some other bookkeeping on the part of the sharding solution. This is ugly (alternative suggestions welcome), but I think it would work: 1. The sharding code would create on each node: CREATE UNLOGGED TABLE unlogged_table_status( shard_name regclass, last_truncate pg_lsn); 2. When you create an unlogged table, each node would do: INSERT INTO unlogged_table_status VALUES('my_unlogged_shard', pg_current_wal_flush_lsn()) 3. When you TRUNCATE an unlogged table, each node would do: UPDATE unlogged_table_status SET last_truncate=pg_current_wal_flush_lsn() WHERE shard_name='my_unlogged_shard' 4. When connecting to a node and accessing a shard of an unlogged table for the first time, test whether the shard has been lost with: SELECT last_truncate <= (pg_control_recovery()).last_recovery_lsn AS shard_was_lost FROM unlogged_table_status WHERE shard_name='my_unlogged_shard' 5. If the shard was lost, truncate all shards for that table on all nodes (and update the unlogged_table_status on all nodes as in #3). Not exactly straightforward, but better than the current situation. And I think it can be made more robust than a timestamp. > I'd be kind of inclined to say no, short of pg_resetwal, and maybe > not then. Agreed, at least until we find some use case that says otherwise. > > 2. It would be helpful to also know the last time a promotion > > happened, > > I'm not following this either. How do you unpromote a node? What I meant by "node" here is actually a primary+standby pair. Let's say each primary+standby pair holds one shard of an unlogged table. In this case, a crash followed by restart is equivalent to a primary failing over to a promoted standby -- in either case, the shard is gone, but other shards of the same table may be populated on other primaries. We need to detect that the shard is gone and then wipe out all the other shards on the healthy primaries. You could reasonably say that it's the job of the sharding solution to keep track of these crashes and handle unlogged tables at the time. But it's inconvenient to insert more tasks into a sensitive process like failover/recovery. It's preferable to be able to detect the unlogged table problem after the fact and handle it when the systems are all up and stable. Regards, Jeff Davis
Is this patch targetting pg15 ? There's no discussion since June. Latest at 2021-06-08 21:29:25 by Jeff Davis <pgsql at j-davis.com> 2022-02-02 16:37:58 Julien Rouhaud (rjuju) Closed in commitfest 2022-01 with status: Moved to next CF 2021-12-03 06:18:05 Michael Paquier (michael-kun) Closed in commitfest 2021-11 with status: Moved to next CF 2021-10-04 16:32:49 Jaime Casanova (jcasanov) Closed in commitfest 2021-09 with status: Moved to next CF 2021-08-03 02:29:40 Masahiko Sawada (masahikosawada) Closed in commitfest 2021-07 with status: Moved to next CF
On Fri, Mar 04, 2022 at 10:12:27AM -0600, Justin Pryzby wrote: > Is this patch targetting pg15 ? > There's no discussion since June. > > Latest at 2021-06-08 21:29:25 by Jeff Davis <pgsql at j-davis.com> This is too long, so let's discard this patch for now. -- Michael