heavily contended lwlocks with long wait queues scale badly - Mailing list pgsql-hackers

From Andres Freund
Subject heavily contended lwlocks with long wait queues scale badly
Date
Msg-id 20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de
Whole thread Raw
Responses Re: heavily contended lwlocks with long wait queues scale badly  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Re: heavily contended lwlocks with long wait queues scale badly  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Re: heavily contended lwlocks with long wait queues scale badly  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hi,

I am working on posting a patch series making relation extension more
scalable. As part of that I was running some benchmarks for workloads that I
thought should not or just positively impacted - but I was wrong, there was
some very significant degradation at very high client counts. After pulling my
hair out for quite a while to try to understand that behaviour, I figured out
that it's just a side-effect of *removing* some other contention. This
morning, turns out sleeping helps, I managed to reproduce it in an unmodified
postgres.

$ cat ~/tmp/txid.sql
SELECT txid_current();
$ for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n "$c ";pgbench -n -M prepared -f ~/tmp/txid.sql
-c$c-j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done
 
1        60174
2       116169
4       208119
8       373685
16      515247
32      554726
64      497508
128     415097
256     334923
512     243679
768     192959
1024    157734
2048     82904
4096     32007

(I didn't properly round TPS, but that doesn't matter here)


Performance completely falls off a cliff starting at ~256 clients. There's
actually plenty CPU available here, so this isn't a case of running out of
CPU time.

Rather, the problem is very bad contention on the "spinlock" for the lwlock
wait list. I realized that something in that direction was off when trying to
investigate why I was seeing spin delays of substantial duration (>100ms).

The problem isn't a fundamental issue with lwlocks, it's that
LWLockDequeueSelf() does this:

        LWLockWaitListLock(lock);

        /*
         * Can't just remove ourselves from the list, but we need to iterate over
         * all entries as somebody else could have dequeued us.
         */
        proclist_foreach_modify(iter, &lock->waiters, lwWaitLink)
        {
                if (iter.cur == MyProc->pgprocno)
                {
                        found = true;
                        proclist_delete(&lock->waiters, iter.cur, lwWaitLink);
                        break;
                }
        }

I.e. it iterates over the whole waitlist to "find itself". The longer the
waitlist gets, the longer this takes. And the longer it takes for
LWLockWakeup() to actually wake up all waiters, the more likely it becomes
that LWLockDequeueSelf() needs to be called.


We can't make the trivial optimization and use proclist_contains(), because
PGPROC->lwWaitLink is also used for the list of processes to wake up in
LWLockWakeup().

But I think we can solve that fairly reasonably nonetheless. We can change
PGPROC->lwWaiting to not just be a boolean, but have three states:
0: not waiting
1: waiting in waitlist
2: waiting to be woken up

which we then can use in LWLockDequeueSelf() to only remove ourselves from the
list if we're on it. As removal from that list is protected by the wait list
lock, there's no race to worry about.

client  patched   HEAD
1        60109    60174
2       112694   116169
4       214287   208119
8       377459   373685
16      524132   515247
32      565772   554726
64      587716   497508
128     581297   415097
256     550296   334923
512     486207   243679
768     449673   192959
1024    410836   157734
2048    326224    82904
4096    250252    32007

Not perfect with the patch, but not awful either.


I suspect this issue might actually explain quite a few odd performance
behaviours we've seen at the larger end in the past. I think it has gotten a
bit worse with the conversion of lwlock.c to proclists (I see lots of
expensive multiplications to deal with sizeof(PGPROC)), but otherwise likely
exists at least as far back as ab5194e6f61, in 9.5.

I guess there's an argument for considering this a bug that we should
backpatch a fix for? But given the vintage, probably not?  The only thing that
gives me pause is that this is quite hard to pinpoint as happening.


I've attached my quick-and-dirty patch. Obviously it'd need a few defines etc,
but I wanted to get this out to discuss before spending further time.

Greetings,

Andres Freund

Attachment

pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: Code checks for App Devs, using new options for transaction behavior
Next
From: Alvaro Herrera
Date:
Subject: Re: Reducing planning time on tables with many indexes