Re: sequences vs. synchronous replication - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: sequences vs. synchronous replication |
Date | |
Msg-id | 9fb080d5-f509-cca4-1353-fd9da85db1d2@enterprisedb.com Whole thread Raw |
In response to | sequences vs. synchronous replication (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: sequences vs. synchronous replication
|
List | pgsql-hackers |
On 12/22/21 18:50, Fujii Masao wrote: > > > On 2021/12/22 21:11, Tomas Vondra wrote: >> Interesting idea, but I think it has a couple of issues :-( > > Thanks for the review! > >> 1) We'd need to know the LSN of the last WAL record for any given >> sequence, and we'd need to communicate that between backends somehow. >> Which seems rather tricky to do without affecting performance. > > How about using the page lsn for the sequence? nextval_internal() > already uses that to check whether it's less than or equal to checkpoint > redo location. > I explored the idea of using page LSN a bit, and there's some good and bad news. The patch from 22/12 simply checks if the change should/would wait for sync replica, and if yes it WAL-logs the sequence increment. There's a couple problems with this, unfortunately: 1) Imagine a high-concurrency environment, with a lot of sessions doing nextval('s') at the same time. One session WAL-logs the increment, but before the WAL gets flushed / sent to replica, another session calls nextval. SyncRepNeedsWait() says true, so it WAL-logs it again, moving the page LSN forward. And so on. So in a high-concurrency environments, this simply makes the matters worse - it causes an avalanche of WAL writes instead of saving anything. (You don't even need multiple sessions - a single session calling nextval would have the same issue, WAL-logging every call.) 2) It assumes having a synchronous replica, but that's wrong. It's partially my fault because I formulated this issue as if it was just about sync replicas, but that's just one symptom. It applies even to systems without any replicas. Imagine you do BEGIN; SELECT nextval('s') FROM generate_series(1,40); ROLLBACK; SELECT nextval('s'); and then you murder the server by "kill -9". If you restart it and do a nextval('s') again, the value will likely go back, generating duplicate values :-( So I think this approach is not really an improvement over WAL-logging every increment. But there's a better way, I think - we don't need to generate WAL, we just need to ensure we wait for it to be flushed at transaction end in RecordTransactionCommit(). That is, instead of generating more WAL, simply update XactLastRecEnd and then ensure RecordTransactionCommit flushes/waits etc. Attached is a patch doing that - the changes in sequence.c are trivial, changes in RecordTransactionCommit simply ensure we flush/wait even without XID (this actually raises some additional questions that I'll discuss in a separate message in this thread). I repeated the benchmark measurements with nextval/insert workloads, to compare this with the other patch (WAL-logging every increment). I had to use a different machine, so the the results are not directly comparable to the numbers presented earlier. On btrfs, it looks like this. The log-all is the first patch, page-lsn is the new patch using page LSN. The first columns are raw pgbench tps values, the last two columns are comparison to master. On btrfs, it looks like this (the numbers next to nextval are the cache size, with 1 being the default): client test master log-all page-lsn log-all page-lsn ------------------------------------------------------------------- 1 insert 829 807 802 97% 97% nextval/1 16491 814 16465 5% 100% nextval/32 24487 16462 24632 67% 101% nextval/64 24516 24918 24671 102% 101% nextval/128 32337 33178 32863 103% 102% client test master log-all page-lsn log-all page-lsn ------------------------------------------------------------------- 4 insert 1577 1590 1546 101% 98% nextval/1 45607 1579 21220 3% 47% nextval/32 68453 49141 51170 72% 75% nextval/64 66928 65534 66408 98% 99% nextval/128 83502 81835 82576 98% 99% The results seem clearly better, I think. For "insert" there's no drop at all (same as before), because as soon as a transaction generates any WAL, it has to flush/wait anyway. And for "nextval" there's a drop, but only with 4 clients, and it's much smaller (53% instead of 97%). And increasing the cache size eliminates even that. Out of curiosity I ran the tests on tmpfs too, which should show overhed not related to I/O. The results are similar: client test master log-all page-lsn log-all page-lsn ------------------------------------------------------------------- 1 insert 44033 43740 43215 99% 98% nextval/1 58640 48384 59243 83% 101% nextval/32 61089 60901 60830 100% 100% nextval/64 60412 61315 61550 101% 102% nextval/128 61436 61605 61503 100% 100% client test master log-all page-lsn log-all page-lsn ------------------------------------------------------------------- 4 insert 88212 85731 87350 97% 99% nextval/1 115059 90644 113541 79% 99% nextval/32 119765 118115 118511 99% 99% nextval/64 119717 119220 118410 100% 99% nextval/128 120258 119448 118826 99% 99% Seems pretty nice, I guess. The original patch did pretty well too (only about 20% drop). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: