Re: sequences vs. synchronous replication - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: sequences vs. synchronous replication
Date
Msg-id 9fb080d5-f509-cca4-1353-fd9da85db1d2@enterprisedb.com
Whole thread Raw
In response to sequences vs. synchronous replication  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: sequences vs. synchronous replication
List pgsql-hackers

On 12/22/21 18:50, Fujii Masao wrote:
> 
> 
> On 2021/12/22 21:11, Tomas Vondra wrote:
>> Interesting idea, but I think it has a couple of issues :-(
> 
> Thanks for the review!
> 
>> 1) We'd need to know the LSN of the last WAL record for any given
>> sequence, and we'd need to communicate that between backends somehow.
>> Which seems rather tricky to do without affecting performance.
> 
> How about using the page lsn for the sequence? nextval_internal()
> already uses that to check whether it's less than or equal to checkpoint
> redo location.
> 

I explored the idea of using page LSN a bit, and there's some good and
bad news.

The patch from 22/12 simply checks if the change should/would wait for
sync replica, and if yes it WAL-logs the sequence increment. There's a
couple problems with this, unfortunately:

1) Imagine a high-concurrency environment, with a lot of sessions doing
nextval('s') at the same time. One session WAL-logs the increment, but
before the WAL gets flushed / sent to replica, another session calls
nextval. SyncRepNeedsWait() says true, so it WAL-logs it again, moving
the page LSN forward. And so on. So in a high-concurrency environments,
this simply makes the matters worse - it causes an avalanche of WAL
writes instead of saving anything.

(You don't even need multiple sessions - a single session calling
nextval would have the same issue, WAL-logging every call.)


2) It assumes having a synchronous replica, but that's wrong. It's
partially my fault because I formulated this issue as if it was just
about sync replicas, but that's just one symptom. It applies even to
systems without any replicas.

Imagine you do

  BEGIN;
  SELECT nextval('s') FROM generate_series(1,40);
  ROLLBACK;

  SELECT nextval('s');

and then you murder the server by "kill -9". If you restart it and do a
nextval('s') again, the value will likely go back, generating duplicate
values :-(


So I think this approach is not really an improvement over WAL-logging
every increment. But there's a better way, I think - we don't need to
generate WAL, we just need to ensure we wait for it to be flushed at
transaction end in RecordTransactionCommit().

That is, instead of generating more WAL, simply update XactLastRecEnd
and then ensure RecordTransactionCommit flushes/waits etc. Attached is a
patch doing that - the changes in sequence.c are trivial, changes in
RecordTransactionCommit simply ensure we flush/wait even without XID
(this actually raises some additional questions that I'll discuss in a
separate message in this thread).

I repeated the benchmark measurements with nextval/insert workloads, to
compare this with the other patch (WAL-logging every increment). I had
to use a different machine, so the the results are not directly
comparable to the numbers presented earlier.

On btrfs, it looks like this. The log-all is the first patch, page-lsn
is the new patch using page LSN. The first columns are raw pgbench tps
values, the last two columns are comparison to master.

On btrfs, it looks like this (the numbers next to nextval are the cache
size, with 1 being the default):

  client  test         master   log-all  page-lsn   log-all  page-lsn
  -------------------------------------------------------------------
       1  insert          829       807       802       97%       97%
          nextval/1     16491       814     16465        5%      100%
          nextval/32    24487     16462     24632       67%      101%
          nextval/64    24516     24918     24671      102%      101%
          nextval/128   32337     33178     32863      103%      102%

  client  test         master   log-all  page-lsn   log-all  page-lsn
  -------------------------------------------------------------------
       4  insert         1577      1590      1546      101%       98%
          nextval/1     45607      1579     21220        3%       47%
          nextval/32    68453     49141     51170       72%       75%
          nextval/64    66928     65534     66408       98%       99%
          nextval/128   83502     81835     82576       98%       99%

The results seem clearly better, I think.

For "insert" there's no drop at all (same as before), because as soon as
a transaction generates any WAL, it has to flush/wait anyway.

And for "nextval" there's a drop, but only with 4 clients, and it's much
smaller (53% instead of 97%). And increasing the cache size eliminates
even that.

Out of curiosity I ran the tests on tmpfs too, which should show overhed
not related to I/O. The results are similar:

  client  test         master   log-all  page-lsn   log-all  page-lsn
  -------------------------------------------------------------------
        1 insert        44033     43740     43215       99%       98%
          nextval/1     58640     48384     59243       83%      101%
          nextval/32    61089     60901     60830      100%      100%
          nextval/64    60412     61315     61550      101%      102%
          nextval/128   61436     61605     61503      100%      100%

  client  test         master   log-all  page-lsn   log-all  page-lsn
  -------------------------------------------------------------------
       4  insert        88212     85731     87350       97%       99%
          nextval/1    115059     90644    113541       79%       99%
          nextval/32   119765    118115    118511       99%       99%
          nextval/64   119717    119220    118410      100%       99%
          nextval/128  120258    119448    118826       99%       99%

Seems pretty nice, I guess. The original patch did pretty well too (only
about 20% drop).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Time to drop plpython2?
Next
From: Robert Haas
Date:
Subject: Re: generic plans and "initial" pruning