Re: BUG #18385: Assert("strategy_delta >= 0") in BgBufferSync() fails due to race condition - Mailing list pgsql-bugs

From Tender Wang
Subject Re: BUG #18385: Assert("strategy_delta >= 0") in BgBufferSync() fails due to race condition
Date
Msg-id CAHewXNk0sTA-OKc5ezxta16=56ODh2808Vqtu8q-9Zjpdy6DGA@mail.gmail.com
Whole thread Raw
In response to Re: BUG #18385: Assert("strategy_delta >= 0") in BgBufferSync() fails due to race condition  (Alexander Lakhin <exclusion@gmail.com>)
Responses Re: BUG #18385: Assert("strategy_delta >= 0") in BgBufferSync() fails due to race condition  (Tender Wang <tndrwang@gmail.com>)
List pgsql-bugs


Alexander Lakhin <exclusion@gmail.com> 于2024年3月13日周三 16:00写道:
Hi Tender Wang,

13.03.2024 09:50, Tender Wang wrote:
> Hi Alexander,
>    I haven't been able to reproduce this issue on my machine(2 vCPU, 2GB memory).
>
>  Can you reproduce this issue reliably on your machine?

Thanks for your attention to this issue!

On my 12-core workstation, where pgbench shows approximately 2500 tps,
`pgbench -t 10000 -c 40` failed on iterations 1, 1, 3:
...
number of transactions actually processed: 398639/400000
number of failed transactions: 0 (0.000%)
latency average = 15.931 ms
initial connection time = 56.806 ms
tps = 2510.845701 (without initial connection time)
pgbench: error: Run was aborted; the above results are incomplete.
...

The server built with gcc,
CPPFLAGS="-O0" ./configure --enable-debug --enable-cassert ...

I have also intensified bgwriter as follows:
          rc = WaitLatch(MyLatch,
                         WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
-                       BgWriterDelay /* ms */ , WAIT_EVENT_BGWRITER_MAIN);
+                       1 /* ms */ , WAIT_EVENT_BGWRITER_MAIN);

That is, I have the attached modification applied and the following in my
extra.config:
fsync = off
autovacuum = off
shared_buffers = '1MB'

Using your provided patch and configuration, I can reproduce this issue easily on my mashine(2 vCPU, 2GB).
I went throug the  StrategySyncStart() and ClockSweepTick() code. After d72731a70 commit, operation on StrategyControl->nextVictimBuffer
doesn't need to get the StrategyControl->buffer_strategy_lock, but StrategyControl->completePasses still needs lock.

Before d72731a70, if bgwriter gets the StrategyControl->buffer_strategy_lock spinlock in StrategySyncStart(), backends can't add StrategyControl->nextVictimBuffer,
then bgwriter can get a consistent  value consisting of nextVictimBuffer and completePasses. 

But now, even though bgwriter gets the spinlock, backends also could add the StrategyControl->nextVictimBuffer. In corner cases, bgwriter will see a litter StrategyControl->nextVictimBuffer
value due to StrategyControl->nextVictimBuffer wraparound. But StrategyControl->completePasses didn't  update. So the Assert(strategy_delta >= 0) will trigger. 

I have two solutions in my head:
1. remove the Assert, but I'm not srue bgwriter write strategy should change if strategy_delta < 0.
2. add more check in BgBufferSync() just like below:
   /*
    * Since nextVictimBuffer in StrategyControl has been atomic.
    * So its operation would not need to get buffer_strategy_lock.
   * In extreme circumstances, StrategySyncStart would see not consistent
   * value consisting of nextVictimBuffer and completePasses.
   * So we add one to passes_delta to make strategy_delta >= 0.
   */
  if (passes_delta == 0 && strategy_delta < 0)
       passes_delta++;

I test two times using the second solution, not trigger Assert again. Need more times to test.

--
Tender Wang
OpenPie:  https://en.openpie.com/

pgsql-bugs by date:

Previous
From: PG Bug reporting form
Date:
Subject: BUG #18395: Checksum verification failed for: deb_postgis_3_4_pg16.app.zip
Next
From: Tender Wang
Date:
Subject: Re: BUG #18385: Assert("strategy_delta >= 0") in BgBufferSync() fails due to race condition