Re: Global snapshots - Mailing list pgsql-hackers
From | Arseny Sher |
---|---|
Subject | Re: Global snapshots |
Date | |
Msg-id | 87pnx0fgiw.fsf@ars-thinkpad Whole thread Raw |
In response to | Re: Global snapshots (Andrey Borodin <x4mmm@yandex-team.ru>) |
Responses |
Re: Global snapshots
|
List | pgsql-hackers |
Hello, Andrey Borodin <x4mmm@yandex-team.ru> writes: > I like the idea that with this patch set universally all postgres > instances are bound into single distributed DB, even if they never > heard about each other before :) This is just amazing. Or do I get > something wrong? Yeah, in a sense of xact visibility we can view it like this. > I've got few questions: > 1. If we coordinate HA-clusters with replicas, can replicas > participate if their part of transaction is read-only? Ok, there are several things to consider. Clock-SI as described in the paper technically boils down to three things. First two assume CSN-based implementation of MVCC where local time acts as CSN/snapshot source, and they impose the following additional rules: 1) When xact expands on some node and imports its snapshot, it must be blocked until clocks on this node will show time >= snapshot being imported: node never processes xacts with snap 'from the future' 2) When we choose CSN to commit xact with, we must read clocks on all the nodes who participated in it and set CSN to max among read values. These rules ensure *integrity* of the snapshot in the face of clock skews regardless of which node we access: that is, snapshots are stable (no non-repeatable reads) and no xact is considered half committed: they prevent situation when the snapshot sees some xact on one node as committed and on another node as still running. (Actually, this is only true under the assumption that any distributed xact is committed at all nodes instantly at the same time; this is obviously not true, see 3rd point below.) If we are speaking about e.g. traditional usage of hot standy, when client in one xact accesses either primary, or some standby, but not several nodes at once, we just don't need this stuff because usual MVCC in Postgres already provides you consistent snapshot. Same is true for multimaster-like setups, when each node accepts writes, but client still accesses single node; if there is a write conflict (e.g. same row updated on different nodes), one of xacts must be aborted; the snapshot is still good. However, if you really intend to read in one xact data from multiple nodes (e.g. read primary and then replica), then yes, these problems arise and Clock-SI helps with them. However, honestly it is hard for me to make up a reason why would you want to do that: reading local data is always more efficient than visiting several nodes. It would make sense if we could read primary and replica in parallel, but that currently is impossible in core Postgres. More straightforward application of the patchset is sharding, when data is splitted and you might need to go to several nodes in one xact to collect needed data. Also, this patchset adds core algorithm and makes use of it only in postgres_fdw; you would need to adapt replication (import/export global snapshot API) to make it work there. 3) The third rule of Clock-SI deals with the following problem. Distributed (writing to several nodes) xact doesn't commit (i.e. becomes visible) instantly at all nodes. That means that there is a time hole in which we can see xact as committed on some node and still running on another. To mitigate this, Clock-SI adds kind of two-phase commit on visibility: additional state InDoubt which blocks all attempts to read this xact changes until xact's fate (commit/abort) is determined. Unlike the previous problem, this issue exists in all replicated setups. Even if we just have primary streaming data to one hot standby, xacts are not committed on them instantly and we might observe xact as committed on primary, then quickly switch to standby and find the data we have just seen disappeared. remote_apply mode partially alleviates this problem (apparently to the degree comfortable for most application developers) by switching positions: with it xact always commits on replicas earlier than on master. At least this guarantees that the guy who wrote the xact will definitely see it on replica unless it drops the connection to master before commit ack. Still the problem is not fully solved: only addition of InDoubt state can fix this. While Clock-SI (and this patchset) certainly addresses the issue as it becomes even more serious in sharded setups (it makes possible to see /parts/ of transactions), there is nothing CSN or clock specific here. In theory, we could implement the same two-phase commit on visiblity without switching to timestamp-based CSN MVCC. Aside from the paper, you can have a look at Clock-SI explanation in these slides [1] from PGCon. > 2. How does InDoubt transaction behave when we add or subtract leap seconds? Good question! In Clock-SI, time can be arbitrary desynchronized and might go forward with arbitrary speed (e.g. clocks can be stopped), but it must never go backwards. So if leap second correction is implemented by doubling the duration of certain second (as it usually seems to be), we are fine. > Also, I could not understand some notes from Arseny: > >> 25 июля 2018 г., в 16:35, Arseny Sher <a.sher@postgrespro.ru> написал(а): >> >> * One drawback of these patches is that only REPEATABLE READ is >> supported. For READ COMMITTED, we must export every new snapshot >> generated on coordinator to all nodes, which is fairly easy to >> do. SERIALIZABLE will definitely require chattering between nodes, >> but that's much less demanded isolevel (e.g. we still don't support >> it on replicas). > > If all shards are executing transaction in SERIALIZABLE, what anomalies does it permit? > > If you have transactions on server A and server B, there are > transactions 1 and 2, transaction A1 is serialized before A2, but B1 > is after B2, right? > > Maybe we can somehow abort 1 or 2? Yes, your explanation is concise and correct. To put it in another way, ensuring SERIALIZABLE in MVCC requires tracking reads, and there is no infrastructure for doing it globally. Classical write skew is possible: you have node A holding x and node B holding y, initially x = y = 30 and there is a constraint x + y > 0. Two concurrent xacts start: T1: x = x - 42; T2: y = y - 42; They don't see each other, so both commit successfully and the constraint is violated. We need to transfer info about reads between nodes to know when we need to abort someone. >> >> * Another somewhat serious issue is that there is a risk of recency >> guarantee violation. If client starts transaction at node with >> lagging clocks, its snapshot might not include some recently >> committed transactions; if client works with different nodes, she >> might not even see her own changes. CockroachDB describes at [1] how >> they and Google Spanner overcome this problem. In short, both set >> hard limit on maximum allowed clock skew. Spanner uses atomic >> clocks, so this skew is small and they just wait it at the end of >> each transaction before acknowledging the client. In CockroachDB, if >> tuple is not visible but we are unsure whether it is truly invisible >> or it's just the skew (the difference between snapshot and tuple's >> csn is less than the skew), transaction is restarted with advanced >> snapshot. This process is not infinite because the upper border >> (initial snapshot + max skew) stays the same; this is correct as we >> just want to ensure that our xact sees all the committed ones before >> it started. We can implement the same thing. > I think that this situation is also covered in Clock-SI since > transactions will not exit InDoubt state before we can see them. But > I'm not sure, chances are that I get something wrong, I'll think more > about it. I'd be happy to hear comments from Stas about this. InDoubt state protects from seeing xact who is not yet committed everywhere, but it doesn't protect from starting xact on node with lagging clocks, obtaining plainly old snapshot. We won't see any half-committed data (InDoubt covers us here) with it, but some recently committed xacts might not get into our old snapshot entirely. >> * 003_bank_shared.pl test is removed. In current shape (loading one >> node) it is useless, and if we bombard both nodes, deadlock surely >> appears. In general, global snaphots are not needed for such >> multimaster-like setup -- either there are no conflicts and we are >> fine, or there is a conflict, in which case we get a deadlock. > Can we do something with this deadlock? Will placing an upper limit on > time of InDoubt state fix the issue? I understand that aborting > automatically is kind of dangerous... Sure, this is just a generalization of basic deadlock problem to distributed system. To deal with it, someone must periodically collect locks across the nodes, build graph, check for loops and punish (abort) one of its chain creators, if loop exists. InDoubt (and global snapshots in general) is unrelated to this, we hanged on usual row-level lock in this test. BTW, our pg_shardman extension has primitive deadlock detector [2], I suppose Citus [3] also has one. > Also, currently hanging 2pc transaction can cause a lot of headache > for DBA. Can we have some kind of protection for the case when one > node is gone permanently during transaction? Oh, subject of automatic 2PC xacts resolution is also matter of another (probably many-miles) thread, largely unrelated to global snapshots/visibility. In general, the problem of distributed transaction commit which doesn't block while majority of nodes is alive requires implementing distributed consensus algorithm like Paxos or Raft. You might also find thread [4] interesting. Thank you for your interest in the topic! [1] https://yadi.sk/i/qgmFeICvuRwYNA [2] https://github.com/postgrespro/pg_shardman/blob/broadcast/pg_shardman--0.0.3.sql#L2497 [3] https://www.citusdata.com/ [4] https://www.postgresql.org/message-id/flat/CAFjFpRc5Eo%3DGqgQBa1F%2B_VQ-q_76B-d5-Pt0DWANT2QS24WE7w%40mail.gmail.com -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
pgsql-hackers by date: