Thread: error "can only drop stats once" brings down database
Hi,
On a database we have we’ve recently seen a fatal error occur twice. The error happened on two different physical replicas (of the same cluster) during a WAL redo action in the recovery process. They’re running Postgres 15.5.
Occurrence 1:
2024-02-01 06:55:54.476 CET,,,70290,,65a29b60.11292,6,,2024-01-13 15:17:04 CET,1/0,0,FATAL,XX000,"can only drop stats once",,,,,"WAL redo at A7BD1/D6F9B6C0 for Transaction/COMMIT: 2024-02-01 06:55:54.395851+01; dropped stats: 2/16405/2991517839 2/16405/2991517838 2/16405/2991517835; inval msgs: catcache 80 catcache 79 catcache 80 catcache 79 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 32 catcache 55 catcache 54 catcache 55 catcache 54 catcache 55 catcache 54 catcache 32 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache 80 catcache 79 catcache 80 catcache 79 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 snapshot 2608 relcache 2991517835 snapshot 1214 relcache 2991517838 relcache 2991517839 relcache 2991517838 snapshot 2608 relcache 2991517838 relcache 2991517839 relcache 2991517835 snapshot 2608 relcache 2991517839 relcache 2991517838 snapshot 2608 relcache 2991517838 snapshot 2608 snapshot 2608 snapshot 2608 relcache 2991517835 snapshot 2608 snapshot 1214",,,,"","startup",,0
2024-02-01 06:55:56.793 CET,,,70282,,65a29b5f.1128a,8,,2024-01-13 15:17:03 CET,,0,LOG,00000,"startup process (PID 70290) exited with exit code 1",,,,,,,,,"","postmaster",,0 2024-02-01 06:55:56.793 CET,,,70282,,65a29b5f.1128a,9,,2024-01-13 15:17:03 CET,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,"","postmaster",,0 2024-02-01 06:55:57.145 CET,,,26624,"100.104.20.59:37478",65bb326d.6800,1,"",2024-02-01 06:55:57
Occurrence 2:
2024-02-09 19:15:41.353 CET,,,88714,,65a29b5e.15a8a,6,,2024-01-13 15:17:02 CET,1/0,0,FATAL,XX000,"can only drop stats once",,,,,"WAL redo at A95F1/3C9D3D88 for Transaction/COMMIT: 2024-02-09 19:15:41.33755+01; dropped stats: 2/16405/3843195112 2/16405/3843195111 2/16405/3843195107; inval msgs: catcache 80 catcache 79 catcache 80 catcache 79 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 32 catcache 55 catcache 54 catcache 55 catcache 54 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 32 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache 80 catcache 79 catcache 80 catcache 79 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 snapshot 2608 relcache 3843195107 snapshot 1214 relcache 3843195111 relcache 3843195112 relcache 3843195111 snapshot 2608 relcache 3843195111 relcache 3843195112 relcache 3843195107 snapshot 2608 relcache 3843195107 snapshot 2608 relcache 3843195107 snapshot 2608 relcache 3843195112 relcache 3843195111 snapshot 2608 relcache 3843195111 snapshot 2608 snapshot 2608 snapshot 2608 relcache 3843195107 snapshot 2608 snapshot 1214",,,,"","startup",,0
Googling the error only shows one conversation from last year: https://www.postgresql.org/message-id/flat/17947-b9554521ad963c9c%40postgresql.org
However, in that thread it is caused by logical replication slots, which this cluster does not have.
Does anyone have an idea what may be causing this?
Separately, it seems like Postgres did not restart by itself when this error occurred (normally if a backend crashes Postgres would kill all connections and initiate a restart), but in this case because it’s the recovery process that stops, the whole database just shuts down and needs to be manually restarted. Is that intended?
After manually starting the database again, recovery would continue as normal. The database would be in normal operation again.
-Floris
Floris Van Nee <florisvannee@Optiver.com> writes: > Hi, > On a database we have we've recently seen a fatal error occur twice. The error happened on two different physical replicas(of the same cluster) during a WAL redo action in the recovery process. They're running Postgres 15.5. > Occurrence 1: > 2024-02-01 06:55:54.476 CET,,,70290,,65a29b60.11292,6,,2024-01-13 15:17:04 CET,1/0,0,FATAL,XX000,"can only drop stats once",,,,,"WALredo at A7BD1/D6F9B6C0 for Transaction/COMMIT: 2024-02-01 06:55:54.395851+01; ... Hmm. This must be coming from pgstat_drop_entry_internal. I suspect the correct fix is in pgstat_drop_entry, along the lines of - if (shent) + if (shent && !shent->dropped) but it's not clear to me how the already-dropped case ought to affect the function's bool result. Also, how are we getting into a concurrent-drop situation in recovery? regards, tom lane
> Floris Van Nee <florisvannee@Optiver.com> writes: > > Hi, > > On a database we have we've recently seen a fatal error occur twice. The > error happened on two different physical replicas (of the same cluster) > during a WAL redo action in the recovery process. They're running Postgres > 15.5. > > > Occurrence 1: > > 2024-02-01 06:55:54.476 CET,,,70290,,65a29b60.11292,6,,2024-01-13 15:17:04 > CET,1/0,0,FATAL,XX000,"can only drop stats once",,,,,"WAL redo at > A7BD1/D6F9B6C0 for Transaction/COMMIT: 2024-02-01 06:55:54.395851+01; > ... > > Hmm. This must be coming from pgstat_drop_entry_internal. > I suspect the correct fix is in pgstat_drop_entry, along the lines of > > - if (shent) > + if (shent && !shent->dropped) > > but it's not clear to me how the already-dropped case ought to affect the > function's bool result. Also, how are we getting into a concurrent-drop > situation in recovery? Anyone has further thoughts on this? This still happens occasionally. -Floris
Floris Van Nee <florisvannee@Optiver.com> writes: >> Hmm. This must be coming from pgstat_drop_entry_internal. >> I suspect the correct fix is in pgstat_drop_entry, along the lines of >> >> - if (shent) >> + if (shent && !shent->dropped) >> >> but it's not clear to me how the already-dropped case ought to affect the >> function's bool result. Also, how are we getting into a concurrent-drop >> situation in recovery? > Anyone has further thoughts on this? This still happens occasionally. I was expecting Andres to comment on it eventually. I don't know that code well enough to want to modify it on my own authority, especially not just before a release. One thing that would be safe enough and perhaps useful is to modify the "can only drop stats once" message to provide a little more detail, like the hash key of the problematic entry. That might help us understand what's triggering this. The OIDs would be opaque perhaps, but the PgStat_Kind value seems useful to know. regards, tom lane
Hello Floris, 03.05.2024 21:10, Floris Van Nee wrote: >> Floris Van Nee <florisvannee@Optiver.com> writes: >>> Hi, >>> On a database we have we've recently seen a fatal error occur twice. The >> error happened on two different physical replicas (of the same cluster) >> during a WAL redo action in the recovery process. They're running Postgres >> 15.5. >> >>> Occurrence 1: >>> 2024-02-01 06:55:54.476 CET,,,70290,,65a29b60.11292,6,,2024-01-13 15:17:04 >> CET,1/0,0,FATAL,XX000,"can only drop stats once",,,,,"WAL redo at >> A7BD1/D6F9B6C0 for Transaction/COMMIT: 2024-02-01 06:55:54.395851+01; >> ... >> >> Hmm. This must be coming from pgstat_drop_entry_internal. >> I suspect the correct fix is in pgstat_drop_entry, along the lines of >> >> - if (shent) >> + if (shent && !shent->dropped) >> >> but it's not clear to me how the already-dropped case ought to affect the >> function's bool result. Also, how are we getting into a concurrent-drop >> situation in recovery? > Anyone has further thoughts on this? This still happens occasionally. > Please take a look at bug #17947 (maybe you encounter the same?): https://www.postgresql.org/message-id/17947-b9554521ad963c9c%40postgresql.org Best regards, Alexander
Hi, On 2024-05-03 18:10:05 +0000, Floris Van Nee wrote: > > Floris Van Nee <florisvannee@Optiver.com> writes: > > > Hi, > > > On a database we have we've recently seen a fatal error occur twice. The > > error happened on two different physical replicas (of the same cluster) > > during a WAL redo action in the recovery process. They're running Postgres > > 15.5. > > > > > Occurrence 1: > > > 2024-02-01 06:55:54.476 CET,,,70290,,65a29b60.11292,6,,2024-01-13 15:17:04 > > CET,1/0,0,FATAL,XX000,"can only drop stats once",,,,,"WAL redo at > > A7BD1/D6F9B6C0 for Transaction/COMMIT: 2024-02-01 06:55:54.395851+01; > > ... > > > > Hmm. This must be coming from pgstat_drop_entry_internal. > > I suspect the correct fix is in pgstat_drop_entry, along the lines of > > > > - if (shent) > > + if (shent && !shent->dropped) > > > > but it's not clear to me how the already-dropped case ought to affect the > > function's bool result. I don't think that'd be quite right - just ignoring that we're confused about tracking "stats object" liveliness seems likely to hide bugs. Elsewhere in this thread you suggested adding more details about the error - let's do that. Something like the attached might already be an improvement? > Also, how are we getting into a concurrent-drop situation in recovery? I'd like to know how we get into the situation too. It's perhaps worth noting that stats can be generated on a standby, albeit not by the replay process. But locking should prevent active use of the stats entry when it's being dropped... > Anyone has further thoughts on this? This still happens occasionally. Do you have any more details about the workload leading to this issue? Is the standby used for queries? Given the "high value" your oids/relfilenodes have, I assume there are a lot of created/dropped/truncated relations? Greetings, Andres Freund
Attachment
Hi, On 2024-05-05 09:09:15 -0700, Andres Freund wrote: > Do you have any more details about the workload leading to this issue? Is the > standby used for queries? Given the "high value" your oids/relfilenodes have, > I assume there are a lot of created/dropped/truncated relations? I suspect that the "high value" oids are related to the issue. I bet you're running into OID wraparound for objects at a much higher frequency than most users. I suspect the reason this causes problems is the problem I just found and described here: https://www.postgresql.org/message-id/20240505183741.5tefii3razzefvtc%40awork3.anarazel.de Could you confirm that you have a) a lot of oid assignments b) your startup process was running for a long time by the time of the crash? The fact that we're not triggering "stats object garbage collection" for most drops also explains why you're much more likely to see this on a standby than on a primary. An a primary it's going to be rare to have a single backend live long enough to observe an oid wraparound leading to one backend accessing stats for the same object type with the same oid after that object previously having been dropped. But most stats accesses on a standby are going to be by the same process, the startup process. You'd still need some other accesses to prevent the object from being dropped "immediately", but that could be due to shorter lived processes. Greetings, Andres Freund
> > Could you confirm that you have a) a lot of oid assignments b) your startup > process was running for a long time by the time of the crash? > Thanks Andres. Both higher than average I guess, although it depends on what is considered 'a lot' and 'a long time'. Thestartup process was running for a few months. There are ~500k entries in pg_class, of which most are (Timescale) partitions.However, even with this number of items in pg_class I would't expect wraparound to happen frequently? These arenot dropped/recreated. I've monitored "select count(*) from pg_class" for a while to see if it changes often, and whilethere are changes during the day (likely temporary tables being created), it also doesn't nearly happen at a frequencythat will get us to a wraparound quickly. Oids aren't just consumed by pg_class though. And I do see system oid growing quickly (when doing CREATE TABLE twice witha minute in between, then checking the oid difference of the table). I don't know how to investigate the cause of this.What would be the best way to check what could be consuming these oids so quickly? -Floris
> I suspect that the "high value" oids are related to the issue. I bet you're > running into OID wraparound for objects at a much higher frequency than > most users. I suspect the reason this causes problems is the problem I just > found and described here: I definitely suspect oid wraparound has something to do with it. However, given what I found on the other thread ( https://www.postgresql.org/message-id/flat/17947-b9554521ad963c9c%40postgresql.org ), I don't think the two have the exactsame root cause. The other thread seems a clear case of 'forgot to call gc' when it should. > > The fact that we're not triggering "stats object garbage collection" for most > drops also explains why you're much more likely to see this on a standby > than on a primary. An a primary it's going to be rare to have a single backend > live long enough to observe an oid wraparound leading to one backend > accessing stats for the same object type with the same oid after that object > previously having been dropped. But most stats accesses on a standby are > going to be by the same process, the startup process. You'd still need some > other accesses to prevent the object from being dropped "immediately", but > that could be due to shorter lived processes. > As mentioned on other thread, I do think garbage collection gets called as part of pgstat_execute_transactional_drops inrecovery. There's still the case of the "skip gc of entry if pending" though, however I struggle to understand how this can lead tosuch an error in the recovery process after wraparound. I'd expect the gc to happen relatively frequently (at least muchmore frequently than a wraparound) as tables get created/dropped quite frequently. Next to that, because the recoveryprocess is the only one creating/dropping relations on standby, it should be impossible to get in a state where therecovery "local" stats cache has a non-gc'd entry that should be dropped? Because when dropping, at least it removes thelocal cache entry (even if it cannot remove the shared one). So later, when creating it again in pgstat_get_entry_ref,it should always hit the pgstat_reinit_entry path? Would it make sense to at least commit your patch to enhance the error message a bit? -Floris
On Sun, Jun 02, 2024 at 10:01:53PM +0000, Floris Van Nee wrote: > Would it make sense to at least commit your patch to enhance the > error message a bit? FWIW, I am +1 for the improvement of this error message. Note that there is a very similar proposal here: https://www.postgresql.org/message-id/ZkM30paAD8Cr/Bix@ip-10-97-1-34.eu-west-3.compute.internal The only reason why I did not do one or the other yet is that this is an improvement, hence I was waiting for v18 to open. If somebody feels differently, feel free to proceed before that with the format you like. Still I'd vote for the addition of the refcount to the patch proposed on this thread, to provide more debugging info to the information reported. -- Michael
Attachment
> > The only reason why I did not do one or the other yet is that this is an > improvement, hence I was waiting for v18 to open. If somebody feels > differently, feel free to proceed before that with the format you like. Still I'd > vote for the addition of the refcount to the patch proposed on this thread, to > provide more debugging info to the information reported. This patch would help in getting more info about this specific issue if it reoccurs, but in that case it's only helpful ifit's backpatched. Agree the refcount addition makes sense. -Floris
On Wed, Jun 05, 2024 at 10:31:04AM +0000, Floris Van Nee wrote: > This patch would help in getting more info about this specific issue > if it reoccurs, but in that case it's only helpful if it's > backpatched. > Agree the refcount addition makes sense. Adding Bertrand in CC as he was similarly playing with this area. What would both of you think about the attached. This is a mix of what's been proposed on the two threads, with the addition of the refcount. -- Michael
Attachment
Hi, On Thu, Jun 06, 2024 at 01:44:26PM +0900, Michael Paquier wrote: > On Wed, Jun 05, 2024 at 10:31:04AM +0000, Floris Van Nee wrote: > > This patch would help in getting more info about this specific issue > > if it reoccurs, but in that case it's only helpful if it's > > backpatched. > > Agree the refcount addition makes sense. > > Adding Bertrand in CC as he was similarly playing with this area. > > What would both of you think about the attached. This is a mix of > what's been proposed on the two threads, with the addition of the > refcount. I like the idea of adding refcount in the error message. OTOH I think it's more appropriate to make use of errdetail_internal() as proposed in [1], but that's a nit. [1]: https://www.postgresql.org/message-id/ZkRsoOMrSrNI945i%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
At Thu, 6 Jun 2024 06:05:11 +0000, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote in > > What would both of you think about the attached. This is a mix of > > what's been proposed on the two threads, with the addition of the > > refcount. > > I like the idea of adding refcount in the error message. OTOH I think it's more > appropriate to make use of errdetail_internal() as proposed in [1], but that's > a nit. To me, this is something that is not expected to happen, but if it does, we would want detailed information. In that sense, it might be better not to hide it in the DETAILS field. regards. > [1]: https://www.postgresql.org/message-id/ZkRsoOMrSrNI945i%40ip-10-97-1-34.eu-west-3.compute.internal -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, Jun 07, 2024 at 10:15:40AM +0900, Kyotaro Horiguchi wrote: > To me, this is something that is not expected to happen, but if it > does, we would want detailed information. In that sense, it might be > better not to hide it in the DETAILS field. Same arguments here. A simple elog() makes easier to get to this data. -- Michael
Attachment
On 2024-06-07 11:46:04 +0900, Michael Paquier wrote: > On Fri, Jun 07, 2024 at 10:15:40AM +0900, Kyotaro Horiguchi wrote: > > To me, this is something that is not expected to happen, but if it > > does, we would want detailed information. In that sense, it might be > > better not to hide it in the DETAILS field. > > Same arguments here. A simple elog() makes easier to get to this > data. +1
Hi, On Thu, Jun 06, 2024 at 08:12:58PM -0700, Andres Freund wrote: > On 2024-06-07 11:46:04 +0900, Michael Paquier wrote: > > On Fri, Jun 07, 2024 at 10:15:40AM +0900, Kyotaro Horiguchi wrote: > > > To me, this is something that is not expected to happen, but if it > > > does, we would want detailed information. In that sense, it might be > > > better not to hide it in the DETAILS field. > > > > Same arguments here. A simple elog() makes easier to get to this > > data. > > +1 That was a nit, so fine by me to move on with a simple elog(). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, Jun 07, 2024 at 09:32:13AM +0000, Bertrand Drouvot wrote: > That was a nit, so fine by me to move on with a simple elog(). Okay, I've just applied the elog() version down to 15 then. Now, about the entry getting dropped twice.. -- Michael
Attachment
> Okay, I've just applied the elog() version down to 15 then. Now, about the > entry getting dropped twice.. Thanks Michael. I've got an update about the bug. I managed to reproduce it locally after a lot of digging. How to repro: - Setup primary + replica - Open a psql session on both - On primary session: create table t (a int); select 't'::regclass::oid; - On replica session: select * from t; - On primary session: drop table t; vacuum pg_class; checkpoint; - Gdb attach to the backend for your primary, set a breakpoint for catalog.c:GetNewOidWithIndex, just before it calls GetNewObjectId() - On primary session: create table t (a int); - When it hits breakpoint, simulate oid wraparound by setting: ShmemVariableCache->nextOid = <the output value of the select earlier> This will make pg create the new table with the same oid as the previous one. - On primary session: drop table t; -- this triggers the replica to go down The reason it crashes on replica is that the recovery process is responsible for dropping stats on commit, but it's not creating them on table creation. Thus, on the second create table call, the old shared stats entry still exists (due to a backend still have a ref to it), but it is never reinitialized by the logic in pgstat_reinit_entry(). On primary it's not possible to reach this state, because heap_create() creates the stats entry immediately when the table is created. I wonder what's the best way to fix this though. Should redo process call pgstat_create_relation somewhere, just like heap_create does? Should we just ignore this 'drop stats twice' error on standby? -Floris
On Sat, Jun 08, 2024 at 11:52:43AM +0000, Floris Van Nee wrote: > I've got an update about the bug. I managed to reproduce it locally > after a lot of digging. > > How to repro: > - Setup primary + replica > - Open a psql session on both > - On primary session: create table t (a int); select 't'::regclass::oid; > - On replica session: select * from t; > - On primary session: drop table t; vacuum pg_class; checkpoint; > - Gdb attach to the backend for your primary, set a breakpoint for > catalog.c:GetNewOidWithIndex, just before it calls GetNewObjectId() > - On primary session: create table t (a int); > - When it hits breakpoint, simulate oid wraparound by setting: > ShmemVariableCache->nextOid = <the output value of the select earlier> > This will make pg create the new table with the same oid as the previous one. > - On primary session: drop table t; -- this triggers the replica to go down Okay, this stuff makes the beginning of a week fun. > The reason it crashes on replica is that the recovery process is responsible for dropping > stats on commit, but it's not creating them on table creation. Thus, on the second create > table call, the old shared stats entry still exists (due to a backend still have a ref to it), > but it is never reinitialized by the logic in pgstat_reinit_entry(). On primary it's not possible > to reach this state, because heap_create() creates the stats entry immediately when the > table is created. > > I wonder what's the best way to fix this though. > Should redo process call pgstat_create_relation somewhere, just like heap_create does? > Should we just ignore this 'drop stats twice' error on standby? Nah, ignoring the double-drop error does not seem right to me. Wouldn't it make the most sense to ensure that the stats are dropped on the standby instead on the first DROP replayed even if there are still references to it hold, making sure that the stats entry with this OID is gone before reusing it after wraparound? -- Michael
Attachment
> > Nah, ignoring the double-drop error does not seem right to me. > Wouldn't it make the most sense to ensure that the stats are dropped on the > standby instead on the first DROP replayed even if there are still references > to it hold, making sure that the stats entry with this OID is gone before > reusing it after wraparound? Agree it doesn't seem right to just ignore. However I don't believe we can just drop the stats entry on replay of the first drop, right? That's why all this logic with pgstat_request_entry_refs_gc was invented in the first place, to get backends to clean up their refs first. If we free it, backends can't see the 'dropped' attribute anymore. This can only be changed if the whole gc mechanism gets revamped to use a different mechanism to notify backends that their local reference in pgStatEntryRefHash is suddenly invalid. I doubt that's backpatchable. On primary this doesn't happen because it's guaranteed that 'drop stats' can only ever happen after a 'create stats' call. This is the thing that's different on a replica: it is possible to get 'drop stats' twice, without a 'create stats' in between. Thing is that even on primary there are already some edge cases that it handles explicitly (eg. the create-drop-create case where on the second 'create' the old stats entry hasn't been removed yet). If the create-drop-create has special logic to handle the case that the old entry still exists, why wouldn't the create-drop-drop case have the special handling as well? Andres, would you be able to chime in on this? -Floris
At Mon, 10 Jun 2024 15:13:16 +0000, Floris Van Nee <florisvannee@Optiver.com> wrote in > > > > Nah, ignoring the double-drop error does not seem right to me. > > Wouldn't it make the most sense to ensure that the stats are dropped on the > > standby instead on the first DROP replayed even if there are still references > > to it hold, making sure that the stats entry with this OID is gone before > > reusing it after wraparound? > > Agree it doesn't seem right to just ignore. However I don't believe we can just > drop the stats entry on replay of the first drop, right? That's why all this logic > with pgstat_request_entry_refs_gc was invented in the first place, to get > backends to clean up their refs first. If we free it, backends can't see the > 'dropped' attribute anymore. This can only be changed if the whole gc > mechanism gets revamped to use a different mechanism to notify backends > that their local reference in pgStatEntryRefHash is suddenly invalid. I doubt > that's backpatchable. > > On primary this doesn't happen because it's guaranteed that 'drop stats' can > only ever happen after a 'create stats' call. This is the thing that's different on > a replica: it is possible to get 'drop stats' twice, without a 'create stats' in between. We don't simply allow double-drop; instead, we permit this kind of behavior only during recovery by signaling the redo-execution functions with the parameter isRedo, like smgr_unlink(). > Thing is that even on primary there are already some edge cases that it handles > explicitly (eg. the create-drop-create case where on the second 'create' the old stats > entry hasn't been removed yet). If the create-drop-create has special logic to handle > the case that the old entry still exists, why wouldn't the create-drop-drop case have > the special handling as well? The two things look completely different to me. The former is an indisputably legit operation, but the latter is indisputably illegitimate as internal logic. > Andres, would you be able to chime in on this? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Jun 11, 2024 at 04:44:57PM +0900, Kyotaro Horiguchi wrote: > We don't simply allow double-drop; instead, we permit this kind of > behavior only during recovery by signaling the redo-execution > functions with the parameter isRedo, like smgr_unlink(). Still, the case is different here, no? Recovery forgives unlinks as long as we are in crash recovery, or consistency has not been reached on a standby. This case is different: we are in a consistent state and a backend on the standby still holds references to the stats entry that is transparently reused after a wraparound. I am wondering if we should tackle this problem by adding a notion of "age" in the stats entries for what's stored in the central dshash and in the hold copies in the backends, so as we dump the "age" of an existing stats entry if we find out that it gets reused in a create path. That requires some design and would not be backpatchable, but it does not seem that invasive at quick glance (I had my eyes on this code the full week). Other thoughts are welcome. -- Michael
Attachment
On Fri, Jun 14, 2024 at 09:37:35AM +0900, Michael Paquier wrote: > I am wondering if we should tackle this problem by adding a notion of > "age" in the stats entries for what's stored in the central dshash and > in the hold copies in the backends, so as we dump the "age" of an > existing stats entry if we find out that it gets reused in a create > path. That requires some design and would not be backpatchable, but > it does not seem that invasive at quick glance (I had my eyes on this > code the full week). Other thoughts are welcome. This still sounds like a good idea after sleeping on it over the weekend. Please note that I am moving on with the problem on this thread where this has been reported, and that is the point of origin: https://www.postgresql.org/message-id/ZmvVudpT8z6hbY9t@paquier.xyz Let's continue the discussion there. -- Michael
Attachment
At Mon, 17 Jun 2024 08:28:10 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Fri, Jun 14, 2024 at 09:37:35AM +0900, Michael Paquier wrote: > > I am wondering if we should tackle this problem by adding a notion of > > "age" in the stats entries for what's stored in the central dshash and > > in the hold copies in the backends, so as we dump the "age" of an > > existing stats entry if we find out that it gets reused in a create > > path. That requires some design and would not be backpatchable, but > > it does not seem that invasive at quick glance (I had my eyes on this > > code the full week). Other thoughts are welcome. > > This still sounds like a good idea after sleeping on it over the > weekend. Please note that I am moving on with the problem on this > thread where this has been reported, and that is the point of origin: > https://www.postgresql.org/message-id/ZmvVudpT8z6hbY9t@paquier.xyz > > Let's continue the discussion there. Is it something like this? I forgot the reason why this was excluded from the vinal version, though. https://www.postgresql.org/message-id/20201001.090722.322196928507744460.horikyota.ntt%40gmail.com > Dshash scans are used for dropping and resetting stats entries. Entry > dropping is performed in the following steps. > > (delete_current_stats_entry()) > - Drop the dshash entry (needs exlock of dshash partition). > > - If refcount of the stats entry body is already zero, free the memory > immediately . > > - If not, set the "dropped" flag of the body. No lock is required > because the "dropped" flag won't be even referred to by other > backends until the next step is done. > > - Increment deletion count of the shared hash. (this is used as the > "age" of local pointer cache hash (pgstat_cache). > > (get_stat_entry()) > > - If dshash deletion count is different from the local cache age, scan > over the local cache hash to find "dropped" entries. > > - Decrements refcount of the dropped entry and free the shared entry > if it is no longer referenced. Apparently no lock is required. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Jun 25, 2024 at 09:22:48AM +0900, Kyotaro Horiguchi wrote: > Is it something like this? I forgot the reason why this was excluded > from the vinal version, though. > > https://www.postgresql.org/message-id/20201001.090722.322196928507744460.horikyota.ntt%40gmail.com That's a bit different than the gc_count presented in 0004, because there is no relationship to autovacuum with the cleanup of the entries, still there is some overlap in the thought process: this suggestion just makes sure that any backend holding local references to a shared entry with a different age does not attempt to manipulate it after switching locally to a dropped state. -- Michael