experimental: replace s_lock spinlock code with pthread_mutex on linux - Mailing list pgsql-hackers

From Nils Goroll
Subject experimental: replace s_lock spinlock code with pthread_mutex on linux
Date
Msg-id 4FEA3EA7.1040104@schokola.de
Whole thread Raw
In response to Re: why roll-your-own s_lock? / improving scalability  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: experimental: replace s_lock spinlock code with pthread_mutex on linux
Re: experimental: replace s_lock spinlock code with pthread_mutex on linux
List pgsql-hackers
> It's
> still unproven whether it'd be an improvement, but you could expect to
> prove it one way or the other with a well-defined amount of testing.

I've hacked the code to use adaptive pthread mutexes instead of spinlocks. see
attached patch. The patch is for the git head, but it can easily be applied for
9.1.3, which is what I did for my tests.

This had disastrous effects on Solaris because it does not use anything similar
to futexes for PTHREAD_PROCESS_SHARED mutexes (only the _PRIVATE mutexes do
without syscalls for the simple case).

But I was surprised to see that it works relatively well on linux. Here's a
glimpse of my results:

hacked code 9.1.3:

-bash-4.1$ rsync -av --delete /tmp/test_template_data/ ../data/ ; /usr/bin/time
./postgres -D ../data -p 55502 & ppid=$! ; pid=$(pgrep -P $ppid ) ; sleep 15 ;
./pgbench -c 768 -t 20 -j 128 -p 55502 postgres ; kill $pid
sending incremental file list
...
ransaction type: TPC-B (sort of)
scaling factor: 10
query mode: simple
number of clients: 768
number of threads: 128
number of transactions per client: 20
number of transactions actually processed: 15360/15360
tps = 476.873261 (including connections establishing)
tps = 485.964355 (excluding connections establishing)
LOG:  received smart shutdown request
LOG:  autovacuum launcher shutting down
-bash-4.1$ LOG:  shutting down
LOG:  database system is shut down
210.58user 78.88system 0:50.64elapsed 571%CPU (0avgtext+0avgdata
1995968maxresident)k
0inputs+1153872outputs (0major+2464649minor)pagefaults 0swaps

original code (vanilla build on amd64) 9.1.3:

-bash-4.1$ rsync -av --delete /tmp/test_template_data/ ../data/ ; /usr/bin/time
./postgres -D ../data -p 55502 & ppid=$! ; pid=$(pgrep -P $ppid ) ; sleep 15 ;
./pgbench -c 768 -t 20 -j 128 -p 55502 postgres ; kill $pid
sending incremental file list
...
transaction type: TPC-B (sort of)
scaling factor: 10
query mode: simple
number of clients: 768
number of threads: 128
number of transactions per client: 20
number of transactions actually processed: 15360/15360
tps = 499.993685 (including connections establishing)
tps = 510.410883 (excluding connections establishing)
LOG:  received smart shutdown request
-bash-4.1$ LOG:  autovacuum launcher shutting down
LOG:  shutting down
LOG:  database system is shut down
196.21user 71.38system 0:47.99elapsed 557%CPU (0avgtext+0avgdata
1360800maxresident)k
0inputs+1147904outputs (0major+2375965minor)pagefaults 0swaps


config:

-bash-4.1$ egrep '^[a-z]' /tmp/test_template_data/postgresql.conf
max_connections = 1800            # (change requires restart)
shared_buffers = 10GB            # min 128kB
temp_buffers = 64MB            # min 800kB
work_mem = 256MB            # min 64kB,d efault 1MB
maintenance_work_mem = 2GB        # min 1MB, default 16MB
bgwriter_delay = 10ms            # 10-10000ms between rounds
bgwriter_lru_maxpages = 1000        # 0-1000 max buffers written/round
bgwriter_lru_multiplier = 10.0        # 0-10.0 multipler on buffers scanned/round
wal_level = hot_standby            # minimal, archive, or hot_standby
wal_buffers = 64MB            # min 32kB, -1 sets based on shared_buffers
commit_delay = 10000            # range 0-100000, in microseconds
datestyle = 'iso, mdy'
lc_messages = 'en_US.UTF-8'            # locale for system error message
lc_monetary = 'en_US.UTF-8'            # locale for monetary formatting
lc_numeric = 'en_US.UTF-8'            # locale for number formatting
lc_time = 'en_US.UTF-8'                # locale for time formatting
default_text_search_config = 'pg_catalog.english'
seq_page_cost = 1.0            # measured on an arbitrary scale
random_page_cost = 1.5            # same scale as above (default: 4.0)
cpu_tuple_cost = 0.005
cpu_index_tuple_cost = 0.0025
cpu_operator_cost = 0.0001
effective_cache_size = 192GB



So it looks like using pthread_mutexes could at least be an option on Linux.

Using futexes directly could be even cheaper.


As a side note, it looks like I have not expressed myself clearly:

I did not intend to suggest to replace proven, working code (which probably is
the best you can get for some platforms) with posix calls. I apologize for the
provocative question.


Regarding the actual production issue, I did not manage to synthetically provoke
the saturation we are seeing in production using pgbench - I could not even get
anywhere near the production load. So I cannot currently test if reducing the
amount of spinning and waking up exactly one waiter (which is what linux/nptl
pthread_mutex_unlock does) would solve/mitigate the production issue I am
working on, and I'd highly appreciate any pointers in this direction.

Cheers, Nils

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Posix Shared Mem patch
Next
From: Alvaro Herrera
Date:
Subject: Re: Posix Shared Mem patch