Thread: Commit Timestamp and LSN Inversion issue
Hello hackers, (Cc people involved in the earlier discussion) I would like to discuss the $Subject. While discussing Logical Replication's Conflict Detection and Resolution (CDR) design in [1] , it came to our notice that the commit LSN and timestamp may not correlate perfectly i.e. commits may happen with LSN1 < LSN2 but with Ts1 > Ts2. This issue may arise because, during the commit process, the timestamp (xactStopTimestamp) is captured slightly earlier than when space is reserved in the WAL. ~~ Reproducibility of conflict-resolution problem due to the timestamp inversion ------------------------------------------------ It was suggested that timestamp inversion *may* impact the time-based resolutions such as last_update_wins (targeted to be implemented in [1]) as we may end up making wrong decisions if timestamps and LSNs are not correctly ordered. And thus we tried some tests but failed to find any practical scenario where it could be a problem. Basically, the proposed conflict resolution is a row-level resolution, and to cause the row value to be inconsistent, we need to modify the same row in concurrent transactions and commit the changes concurrently. But this doesn't seem possible because concurrent updates on the same row are disallowed (e.g., the later update will be blocked due to the row lock). See [2] for the details. We tried to give some thoughts on multi table cases as well e.g., update table A with foreign key and update the table B that table A refers to. But update on table A will block the update on table B as well, so we could not reproduce data-divergence due to the LSN/timestamp mismatch issue there. ~~ Idea proposed to fix the timestamp inversion issue ------------------------------------------------ There was a suggestion in [3] to acquire the timestamp while reserving the space (because that happens in LSN order). The clock would need to be monotonic (easy enough with CLOCK_MONOTONIC), but also cheap. The main problem why it's being done outside the critical section, because gettimeofday() may be quite expensive. There's a concept of hybrid clock, combining "time" and logical counter, which might be useful independently of CDR. On further analyzing this idea, we found that CLOCK_MONOTONIC can be accepted only by clock_gettime() which has more precision than gettimeofday() and thus is equally or more expensive theoretically (we plan to test it and post the results). It does not look like a good idea to call any of these when holding spinlock to reserve the wal position. As for the suggested solution "hybrid clock", it might not help here because the logical counter is only used to order the transactions with the same timestamp. The problem here is how to get the timestamp along with wal position reservation(ReserveXLogInsertLocation). ~~ We can explore further but as we are not able to see any real-time scenario where this could actually be problem, it may or may not be worth to spend time on this. Thoughts? [1]: (See: "How is this going to deal with the fact that commit LSN and timestamps may not correlate perfectly?"). https://www.postgresql.org/message-id/CAJpy0uBWBEveM8LO2b7wNZ47raZ9tVJw3D2_WXd8-b6LSqP6HA%40mail.gmail.com [2]: https://www.postgresql.org/message-id/CAA4eK1JTMiBOoGqkt%3DaLPLU8Rs45ihbLhXaGHsz8XC76%2BOG3%2BQ%40mail.gmail.com [3]: (See: "The clock would need to be monotonic (easy enough with CLOCK_MONOTONIC"). https://www.postgresql.org/message-id/a3a70a19-a35e-426c-8646-0898cdc207c8%40enterprisedb.com thanks Shveta
Hi Shveta, > While discussing Logical Replication's Conflict Detection and > Resolution (CDR) design in [1] , it came to our notice that the > commit LSN and timestamp may not correlate perfectly i.e. commits may > happen with LSN1 < LSN2 but with Ts1 > Ts2. This issue may arise > because, during the commit process, the timestamp (xactStopTimestamp) > is captured slightly earlier than when space is reserved in the WAL. > [...] > There was a suggestion in [3] to acquire the timestamp while reserving > the space (because that happens in LSN order). The clock would need to > be monotonic (easy enough with CLOCK_MONOTONIC), but also cheap. The > main problem why it's being done outside the critical section, because > gettimeofday() may be quite expensive. There's a concept of hybrid > clock, combining "time" and logical counter, which might be useful > independently of CDR. I don't think you can rely on a system clock for conflict resolution. In a corner case a DBA can move the clock forward or backward between recordings of Ts1 and Ts2. On top of that there is no guarantee that 2+ servers have synchronised clocks. It seems to me that what you are proposing will just hide the problem instead of solving it in the general case. This is the reason why {latest|earliest}_timestamp_wins strategies you are proposing to use for CDR are poor strategies. In practice they work as random_key_wins which is not extremely useful (what it does is basically _deleting_ random data, not solving conflicts). On top of that strategies like this don't take into account checks and constraints the database may have, including high-level constraints that may not be explicitly defined in the DBMS but may exist in the application logic. Unfortunately I can't reference a particular article or white paper on the subject but I know that "last write wins" was widely criticized back in the 2010s when people were building distributed systems on commodity hardware. In this time period I worked on several projects as a backend software engineer and I can assure you that LWW is not something you want. IMO the right approach to the problem would be defining procedures for conflict resolution that may not only semi-randomly choose between two tuples but also implement a user-defined logic. Similarly to INSERT INTO ... ON CONFLICT ... semantics, or similar approaches from long-lived and well-explored distributed system, e.g. Riak. Alternatively / additionally we could support CRDTs in Postgres. -- Best regards, Aleksander Alekseev
On Wed, Sep 4, 2024 at 2:05 PM Aleksander Alekseev <aleksander@timescale.com> wrote: > > > While discussing Logical Replication's Conflict Detection and > > Resolution (CDR) design in [1] , it came to our notice that the > > commit LSN and timestamp may not correlate perfectly i.e. commits may > > happen with LSN1 < LSN2 but with Ts1 > Ts2. This issue may arise > > because, during the commit process, the timestamp (xactStopTimestamp) > > is captured slightly earlier than when space is reserved in the WAL. > > [...] > > There was a suggestion in [3] to acquire the timestamp while reserving > > the space (because that happens in LSN order). The clock would need to > > be monotonic (easy enough with CLOCK_MONOTONIC), but also cheap. The > > main problem why it's being done outside the critical section, because > > gettimeofday() may be quite expensive. There's a concept of hybrid > > clock, combining "time" and logical counter, which might be useful > > independently of CDR. > > I don't think you can rely on a system clock for conflict resolution. > In a corner case a DBA can move the clock forward or backward between > recordings of Ts1 and Ts2. On top of that there is no guarantee that > 2+ servers have synchronised clocks. It seems to me that what you are > proposing will just hide the problem instead of solving it in the > general case. > It is possible that we can't rely on the system clock for conflict resolution but that is not the specific point of this thread. As mentioned in the subject of this thread, the problem is "Commit Timestamp and LSN Inversion issue". The LSN value and timestamp for a commit are not generated atomically, so two different transactions can have them in different order. Your point as far as I can understand is that in the first place, it is not a good idea to have a strategy like "last_write_wins" which relies on the system clock. So, even if LSN->Timestamp ordering has any problem, it won't matter to us. Now, we can discuss whether "last_write_wins" is a poor strategy or not but if possible, for the sake of the point of this thread, let's assume that users using the resolution feature ("last_write_wins") ensure that clocks are synced or they won't enable this feature and then see if we can think of any problem w.r.t the current code. -- With Regards, Amit Kapila.
Hi Amit, > > I don't think you can rely on a system clock for conflict resolution. > > In a corner case a DBA can move the clock forward or backward between > > recordings of Ts1 and Ts2. On top of that there is no guarantee that > > 2+ servers have synchronised clocks. It seems to me that what you are > > proposing will just hide the problem instead of solving it in the > > general case. > > > > It is possible that we can't rely on the system clock for conflict > resolution but that is not the specific point of this thread. As > mentioned in the subject of this thread, the problem is "Commit > Timestamp and LSN Inversion issue". The LSN value and timestamp for a > commit are not generated atomically, so two different transactions can > have them in different order. Hm.... Then I'm having difficulties understanding why this is a problem and why it was necessary to mention CDR in this context in the first place. OK, let's forget about CDR completely. Who is affected by the current behavior and why would it be beneficial changing it? -- Best regards, Aleksander Alekseev
On Wed, Sep 4, 2024 at 6:35 PM Aleksander Alekseev <aleksander@timescale.com> wrote: > > > > I don't think you can rely on a system clock for conflict resolution. > > > In a corner case a DBA can move the clock forward or backward between > > > recordings of Ts1 and Ts2. On top of that there is no guarantee that > > > 2+ servers have synchronised clocks. It seems to me that what you are > > > proposing will just hide the problem instead of solving it in the > > > general case. > > > > > > > It is possible that we can't rely on the system clock for conflict > > resolution but that is not the specific point of this thread. As > > mentioned in the subject of this thread, the problem is "Commit > > Timestamp and LSN Inversion issue". The LSN value and timestamp for a > > commit are not generated atomically, so two different transactions can > > have them in different order. > > Hm.... Then I'm having difficulties understanding why this is a > problem This is a potential problem pointed out during discussion of CDR [1] (Please read the point starting from "How is this going to deal .." and response by Shveta). The point of this thread is that though it appears to be a problem but practically there is no scenario where it can impact even when we implement "last_write_wins" startegy as explained in the initial email. If you or someone sees a problem due to LSN<->timestamp inversion then we need to explore the solution for it. > > and why it was necessary to mention CDR in this context in the > first place. > > OK, let's forget about CDR completely. Who is affected by the current > behavior and why would it be beneficial changing it? > We can't forget CDR completely as this could only be a potential problem in that context. Right now, we don't have any built-in resolution strategies, so this can't impact but if this is a problem then we need to have a solution for it before considering a solution like "last_write_wins" strategy. Now, instead of discussing LSN<->timestamp inversion issue, you started to discuss "last_write_wins" strategy itself which we have discussed to some extent in the thread [2]. BTW, we are planning to start a separate thread as well just to discuss the clock skew problem w.r.t resolution strategies like "last_write_wins" strategy. So, we can discuss clock skew in that thread and keep the focus of this thread LSN<->timestamp inversion problem. [1] - https://www.postgresql.org/message-id/CAJpy0uBWBEveM8LO2b7wNZ47raZ9tVJw3D2_WXd8-b6LSqP6HA%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CAJpy0uD0-DpYVMtsxK5R%3DzszXauZBayQMAYET9sWr_w0CNWXxQ%40mail.gmail.com -- With Regards, Amit Kapila.
On Tue, Nov 5, 2024 at 7:28 PM Jan Wieck <jan@wi3ck.info> wrote: > > > > > We can't forget CDR completely as this could only be a potential > > problem in that context. Right now, we don't have any built-in > > resolution strategies, so this can't impact but if this is a problem > > then we need to have a solution for it before considering a solution > > like "last_write_wins" strategy. > > I agree that we can't forget about CDR. This is precisely the problem we > ran into here at pgEdge and why we came up with a solution (attached). > I would like to highlight that we need to solve LSN<->Timestamp inversion issue not only for resolution strategies like 'last_write_wins' but also for conflict detection as well. In particular, while implementing/discussing the patch to detect the update_deleted conflict type, we came across the race conditions [1] where the inversion issue discussed here would lead to removing the required rows before we could detect the conflict. So, +1 to solve this issue. > > Now, instead of discussing LSN<->timestamp inversion issue, you > > started to discuss "last_write_wins" strategy itself which we have > > discussed to some extent in the thread [2]. BTW, we are planning to > > start a separate thread as well just to discuss the clock skew problem > > w.r.t resolution strategies like "last_write_wins" strategy. So, we > > can discuss clock skew in that thread and keep the focus of this > > thread LSN<->timestamp inversion problem. > > Fact is that "last_write_wins" together with some implementation of > Conflict free Replicated Data Types (CRDT) is good enough for many real > world situations. Anything resembling a TPC-B or TPC-C is quite happy > with it. > > The attached solution is minimally invasive because it doesn't move the > timestamp generation (clock_gettime() call) into the critical section of > ReserveXLogInsertLocation() that is protected by a spinlock. Instead it > keeps track of the last commit-ts written to WAL in shared memory and > simply bumps that by one microsecond if the next one is below or equal. > There is one extra condition in that code section plus a function call > by pointer for every WAL record. > I think we avoid calling hook/callback functions after holding a lock (spinlock or LWLock) as the user may do an expensive operation or acquire some other locks in those functions which could lead to deadlocks or impact the concurrency. So, it would be better to directly call an inline function to perform the required operation. This sounds like a good idea to solve this problem. Thanks for sharing the patch. [1] - https://www.postgresql.org/message-id/CAA4eK1LKgkjyNKeW5jEhy1%3DuE8z0p7Pdae0rohoJP51eJGd%3Dgg%40mail.gmail.com -- With Regards, Amit Kapila.
On 11/6/24 06:23, Amit Kapila wrote: > I think we avoid calling hook/callback functions after holding a lock > (spinlock or LWLock) as the user may do an expensive operation or > acquire some other locks in those functions which could lead to > deadlocks or impact the concurrency. So, it would be better to > directly call an inline function to perform the required operation. This is a valid concern. The reason why I kept it as a hook is because ReserveXLogInsertLocation() has no knowledge that this is allocating WAL space for a commit record. Only the caller does. We certainly need to be extremely careful what any such hook function is doing. Acquiring additional locks is definitely not acceptable. But I am not sure we should burden this function with specialized knowledge about what it is reserving WAL space for. Regards, Jan
On Tuesday, November 5, 2024 9:59 PM Jan Wieck <jan@wi3ck.info> wrote: > > Hi Hackers, > > On 9/5/24 01:39, Amit Kapila wrote: > > > > We can't forget CDR completely as this could only be a potential > > problem in that context. Right now, we don't have any built-in > > resolution strategies, so this can't impact but if this is a problem > > then we need to have a solution for it before considering a solution > > like "last_write_wins" strategy. > > I agree that we can't forget about CDR. This is precisely the problem we ran into > here at pgEdge and why we came up with a solution (attached). > > > Now, instead of discussing LSN<->timestamp inversion issue, you > > started to discuss "last_write_wins" strategy itself which we have > > discussed to some extent in the thread [2]. BTW, we are planning to > > start a separate thread as well just to discuss the clock skew problem > > w.r.t resolution strategies like "last_write_wins" strategy. So, we > > can discuss clock skew in that thread and keep the focus of this > > thread LSN<->timestamp inversion problem. > > Fact is that "last_write_wins" together with some implementation of Conflict > free Replicated Data Types (CRDT) is good enough for many real world > situations. Anything resembling a TPC-B or TPC-C is quite happy with it. > > The attached solution is minimally invasive because it doesn't move the > timestamp generation (clock_gettime() call) into the critical section of > ReserveXLogInsertLocation() that is protected by a spinlock. Instead it keeps > track of the last commit-ts written to WAL in shared memory and simply bumps > that by one microsecond if the next one is below or equal. > There is one extra condition in that code section plus a function call by pointer > for every WAL record. In the unlikely case of encountering a stalled or > backwards running commit-ts, the expensive part of recalculating the CRC of > the altered commit WAL-record is done later, after releasing the spinlock. I have > not been able to measure any performance impact on a machine with 2x > Xeon-Silver (32 HT cores). > > This will work fine until we have systems that can sustain a commit rate of one > million transactions per second or higher for more than a few milliseconds. Thanks for the patch! I am reading the patch and noticed few minor things. 1. + /* + * This is a local transaction. Make sure that the xact_time + * higher than any timestamp we have seen thus far. + * + * TODO: This is not postmaster restart safe. If the local + * system clock is further behind other nodes than it takes + * for the postmaster to restart (time between it stops + * accepting new transactions and time when it becomes ready + * to accept new transactions), local transactions will not + * be bumped into the future correctly. + */ The TODO section mentions other nodes, but I believe think patch currently do not have the handling of timestamps for other nodes. Should we either remove this part or add a brief explanation to clarify the relationship with other nodes? 2. +/* + * Hook function to be called while holding the WAL insert spinlock. + * to adjust commit timestamps via Lamport clock if needed. + */ The second line seems can be improved: "to adjust commit timestamps .." => "It adjusts commit timestamps ..." Best Regards, Hou zj
Hi, On 2024-11-05 08:58:36 -0500, Jan Wieck wrote: > The attached solution is minimally invasive because it doesn't move the > timestamp generation (clock_gettime() call) into the critical section of > ReserveXLogInsertLocation() that is protected by a spinlock. Instead it > keeps track of the last commit-ts written to WAL in shared memory and simply > bumps that by one microsecond if the next one is below or equal. There is > one extra condition in that code section plus a function call by pointer for > every WAL record. In the unlikely case of encountering a stalled or > backwards running commit-ts, the expensive part of recalculating the CRC of > the altered commit WAL-record is done later, after releasing the spinlock. I > have not been able to measure any performance impact on a machine with 2x > Xeon-Silver (32 HT cores). I think it's *completely* unacceptable to call a hook inside ReserveXLogInsertLocation, with the spinlock held, no less. That's the most contended section of code in postgres. Even leaving that aside, something with a spinlock held is never a suitable place to call a hook. As the header comments say: * Keep in mind the coding rule that spinlocks must not be held for more * than a few instructions. In particular, we assume it is not possible * for a CHECK_FOR_INTERRUPTS() to occur while holding a spinlock, and so * it is not necessary to do HOLD/RESUME_INTERRUPTS() in these macros. I'm also quite certain that we don't want to just allow to redirect XLogInsert to some random function. WAL logging is very crucial code and allowing to randomly redirect it to something we don't know about will make it harder to write correct WAL logging code and makes already quite complicated code even more complicated. > This will work fine until we have systems that can sustain a commit rate of > one million transactions per second or higher for more than a few > milliseconds. It's sustainable for many minutes today. Greetings, Andres Freund
On 11/7/24 13:19, Andres Freund wrote: > I think it's *completely* unacceptable to call a hook inside > ReserveXLogInsertLocation, with the spinlock held, no less. That's the most > contended section of code in postgres. Aside from your technical concerns, can we at least agree that the commit-ts vs LSN inversion is a problem that should get addressed? Regards, Jan
Hi, On 2024-11-07 15:05:31 -0500, Jan Wieck wrote: > On 11/7/24 13:19, Andres Freund wrote: > > I think it's *completely* unacceptable to call a hook inside > > ReserveXLogInsertLocation, with the spinlock held, no less. That's the most > > contended section of code in postgres. > > Aside from your technical concerns, can we at least agree that the commit-ts > vs LSN inversion is a problem that should get addressed? I'm not against addressing it, but I'd consider it not a high priority and that the acceptable cost (both runtime and maintenance) of it should would need to be extremely low. Greetings, Andres Freund
On Thu, Nov 7, 2024 at 9:39 PM Jan Wieck <jan@wi3ck.info> wrote: > > On 11/6/24 21:30, Zhijie Hou (Fujitsu) wrote: > > > > Thanks for the patch! I am reading the patch and noticed few minor things. > > > > 1. > > + /* > > + * This is a local transaction. Make sure that the xact_time > > + * higher than any timestamp we have seen thus far. > > + * > > + * TODO: This is not postmaster restart safe. If the local > > + * system clock is further behind other nodes than it takes > > + * for the postmaster to restart (time between it stops > > + * accepting new transactions and time when it becomes ready > > + * to accept new transactions), local transactions will not > > + * be bumped into the future correctly. > > + */ > > > > The TODO section mentions other nodes, but I believe think patch currently do > > not have the handling of timestamps for other nodes. Should we either remove > > this part or add a brief explanation to clarify the relationship with other > > nodes? > > That TODO is actually obsolete. I understood from Amit Kapila that the > community does assume that NTP synchronization is good enough. > This is my understanding from the relevant discussion in the email thread [1]. [1] - https://www.postgresql.org/message-id/2423650.1726842094%40sss.pgh.pa.us -- With Regards, Amit Kapila.