Thread: ExclusiveLock on extension of relation with huge shared_buffers

ExclusiveLock on extension of relation with huge shared_buffers

From
Borodin Vladimir
Date:
Hi all.

Some time ago I wrote [0] on pgsql-performance@ about having problems with ExclusiveLock on extending relation while checkpointing on system with large shared_buffers. That time the problem was fixed with getting shared_buffers much smaller.

Right now I am testing 9.4beta3 with large shared_buffers (100GB of 128 GB RAM) and huge pages under writing (insert only) load and I am having the same issue. The reason I do it is that such configuration shows much better performance (than 4GB shared_buffers and all other RAM for page cache) on our read-only load profile (some kind of 2,5x better).

The symptoms are well described in the first letter but I will repeat shortly. Most of the time processor spends in system, pg_locks shows that many backends are waiting because of ExclusiveLock on extending some relation. In the log I can see lots of such lines:

< 2014-10-23 18:03:39.287 MSK >LOG:  process 5426 acquired ExclusiveLock on extension of relation 111669 of database 110866 after 3421.279 ms
< 2014-10-23 18:03:39.501 MSK >LOG:  process 2904 still waiting for ExclusiveLock on extension of relation 111714 of database 110866 after 1000.053 ms

I started to dig to find the root of the problem. The CPU spike does not happen on the start of checkpoint or at the end. It happens one or several times during checkpoint is happening.  Since it is not production environment I have written SystemTap script that logs all checkpoint events and tracks timings of doing something under ExclusiveLock. The script is attached to the letter (checkpoint_markers.stp). Also in attaches you can find output of this script (stap.out) and output of dstat (dstat.log) to see the system resource consumption.

From stap.out you can see that there are some strange spikes of timings spent between LockRelationForExtension and UnlockRelationForExtension from the same pid and for the same relation. And from dstat.log you can see that at the same time there are spikes of cpu spent in system, "perf top" says that most of the time is spent in _spin_lock_irq at this time.

I have looked at src/backend/access/heap/hio.c:RelationGetBufferForTuple and further until smgr* functions to see what is happening under this lock but haven’t found any potential problems except one. Relation extending is done by backend under exclusive lock and it is extended by only one page. Aren’t there any plans for making yet another background process for extending relations after reaching some threashold?

I’m pretty sure that I have missed something (for example, I still can’t explain why this is not happening on system with small shared_buffers). What kind of debug can I turn on or where else should I look? Any ideas would be really appropriate. Below are some details about the system.




I am running PostgreSQL 9.4beta3 installed from RPM packages (compiled with —enable-debug and —enable-dtrace) on RHEL 6.5. Host is running with the following CPU (32 cores) and memory:

root@xdb01d ~ # fgrep -m1 'model name' /proc/cpuinfo
model name  : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
root@xdb01d ~ # free -m
             total       used       free     shared    buffers     cached
Mem:        129027     128293        734          0         21      17345
-/+ buffers/cache:     110927      18100
Swap:        16378          0      16378
root@xdb01d ~ #

PGDATA lives on RAID6 array of 8 ssd-disks with ext4 (pg_xlog is on separate RAID10 array mounted with nobarrier). Right now PGDATA takes a bit more than 100G.

When checkpoint starts I/O load on the system becomes much higher than before checkpoint starts. That is happening because of bgwriter doesn’t really do the job (although bgwriter settings are really aggressive):

xdb01d/postgres # select * from pg_stat_bgwriter;
-[ RECORD 1 ]---------+------------------------------
checkpoints_timed     | 0
checkpoints_req       | 19
checkpoint_write_time | 22769904
checkpoint_sync_time  | 199006
buffers_checkpoint    | 85116980
buffers_clean         | 92815
maxwritten_clean      | 80
buffers_backend       | 3743898
buffers_backend_fsync | 0
buffers_alloc         | 13366074
stats_reset           | 2014-10-23 10:59:17.123677+04

Time: 10.451 ms
xdb01d/postgres #

< 2014-10-23 18:54:13.913 MSK >LOG:  checkpoint starting: xlog
< 2014-10-23 19:23:36.432 MSK >LOG:  checkpoint complete: wrote 4586752 buffers (35.0%); 0 transaction log file(s) added, 0 removed, 1881 recycled; write=1761.595 s, sync=0.722 s, total=1762.518 s; sync files=172, longest=0.162 s, average=0.004 s

< 2014-10-23 19:28:47.526 MSK >LOG:  checkpoint starting: xlog
< 2014-10-23 19:59:29.860 MSK >LOG:  checkpoint complete: wrote 4735580 buffers (36.1%); 0 transaction log file(s) added, 0 removed, 2138 recycled; write=1841.799 s, sync=0.328 s, total=1842.334 s; sync files=179, longest=0.124 s, average=0.001 s

The configuration for postgres is also in attached file.




P.S. Since this letter is more the question about why relation is extended by one page I’m writing the letter on hackers list but not performance. But of course it can be moved to performance list if someone finds it correct.

--
Vladimir




Attachment

Re: ExclusiveLock on extension of relation with huge shared_buffers

From
Borodin Vladimir
Date:
I have taken some backtraces (they are attached to the letter) of two processes with such command:
pid=17981; while true; do date; gdb -batch -e back /usr/pgsql-9.4/bin/postgres $pid; echo; echo; echo; echo; sleep 0.1; done

Process 17981 was holding the lock for a long time - http://pastie.org/9671931.
And process 13886 was waiting for lock (in different time and from different blocker actually but I don’t think it is really important) - http://pastie.org/9671939.

As I can see, 17981 is actually waiting for LWLock on BufFreelistLock in StrategyGetBuffer function, freelist.c:134 while holding exclusive lock on relation. I will try to increase NUM_BUFFER_PARTITIONS (on read-only load it also gave us some performance boost) and write the result in this thread.

And sorry for spam. This letter is mostly to let you know that you should not read my first long message carefully.


23 окт. 2014 г., в 21:09, Borodin Vladimir <root@simply.name> написал(а):

Hi all.

Some time ago I wrote [0] on pgsql-performance@ about having problems with ExclusiveLock on extending relation while checkpointing on system with large shared_buffers. That time the problem was fixed with getting shared_buffers much smaller.

Right now I am testing 9.4beta3 with large shared_buffers (100GB of 128 GB RAM) and huge pages under writing (insert only) load and I am having the same issue. The reason I do it is that such configuration shows much better performance (than 4GB shared_buffers and all other RAM for page cache) on our read-only load profile (some kind of 2,5x better).

The symptoms are well described in the first letter but I will repeat shortly. Most of the time processor spends in system, pg_locks shows that many backends are waiting because of ExclusiveLock on extending some relation. In the log I can see lots of such lines:

< 2014-10-23 18:03:39.287 MSK >LOG:  process 5426 acquired ExclusiveLock on extension of relation 111669 of database 110866 after 3421.279 ms
< 2014-10-23 18:03:39.501 MSK >LOG:  process 2904 still waiting for ExclusiveLock on extension of relation 111714 of database 110866 after 1000.053 ms

I started to dig to find the root of the problem. The CPU spike does not happen on the start of checkpoint or at the end. It happens one or several times during checkpoint is happening.  Since it is not production environment I have written SystemTap script that logs all checkpoint events and tracks timings of doing something under ExclusiveLock. The script is attached to the letter (checkpoint_markers.stp). Also in attaches you can find output of this script (stap.out) and output of dstat (dstat.log) to see the system resource consumption.

From stap.out you can see that there are some strange spikes of timings spent between LockRelationForExtension and UnlockRelationForExtension from the same pid and for the same relation. And from dstat.log you can see that at the same time there are spikes of cpu spent in system, "perf top" says that most of the time is spent in _spin_lock_irq at this time.

I have looked at src/backend/access/heap/hio.c:RelationGetBufferForTuple and further until smgr* functions to see what is happening under this lock but haven’t found any potential problems except one. Relation extending is done by backend under exclusive lock and it is extended by only one page. Aren’t there any plans for making yet another background process for extending relations after reaching some threashold?

I’m pretty sure that I have missed something (for example, I still can’t explain why this is not happening on system with small shared_buffers). What kind of debug can I turn on or where else should I look? Any ideas would be really appropriate. Below are some details about the system.




I am running PostgreSQL 9.4beta3 installed from RPM packages (compiled with —enable-debug and —enable-dtrace) on RHEL 6.5. Host is running with the following CPU (32 cores) and memory:

root@xdb01d ~ # fgrep -m1 'model name' /proc/cpuinfo
model name  : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
root@xdb01d ~ # free -m
             total       used       free     shared    buffers     cached
Mem:        129027     128293        734          0         21      17345
-/+ buffers/cache:     110927      18100
Swap:        16378          0      16378
root@xdb01d ~ #

PGDATA lives on RAID6 array of 8 ssd-disks with ext4 (pg_xlog is on separate RAID10 array mounted with nobarrier). Right now PGDATA takes a bit more than 100G.

When checkpoint starts I/O load on the system becomes much higher than before checkpoint starts. That is happening because of bgwriter doesn’t really do the job (although bgwriter settings are really aggressive):

xdb01d/postgres # select * from pg_stat_bgwriter;
-[ RECORD 1 ]---------+------------------------------
checkpoints_timed     | 0
checkpoints_req       | 19
checkpoint_write_time | 22769904
checkpoint_sync_time  | 199006
buffers_checkpoint    | 85116980
buffers_clean         | 92815
maxwritten_clean      | 80
buffers_backend       | 3743898
buffers_backend_fsync | 0
buffers_alloc         | 13366074
stats_reset           | 2014-10-23 10:59:17.123677+04

Time: 10.451 ms
xdb01d/postgres #

< 2014-10-23 18:54:13.913 MSK >LOG:  checkpoint starting: xlog
< 2014-10-23 19:23:36.432 MSK >LOG:  checkpoint complete: wrote 4586752 buffers (35.0%); 0 transaction log file(s) added, 0 removed, 1881 recycled; write=1761.595 s, sync=0.722 s, total=1762.518 s; sync files=172, longest=0.162 s, average=0.004 s

< 2014-10-23 19:28:47.526 MSK >LOG:  checkpoint starting: xlog
< 2014-10-23 19:59:29.860 MSK >LOG:  checkpoint complete: wrote 4735580 buffers (36.1%); 0 transaction log file(s) added, 0 removed, 2138 recycled; write=1841.799 s, sync=0.328 s, total=1842.334 s; sync files=179, longest=0.124 s, average=0.001 s

The configuration for postgres is also in attached file.



<checkpoint_markers.stp>
<stap.out>
<dstat.log>
<postgresql.conf>

P.S. Since this letter is more the question about why relation is extended by one page I’m writing the letter on hackers list but not performance. But of course it can be moved to performance list if someone finds it correct.

--
Vladimir






--
Vladimir




Attachment

Re: ExclusiveLock on extension of relation with huge shared_buffers

From
Jim Nasby
Date:
Please don't top-post.

On 10/24/14, 3:40 AM, Borodin Vladimir wrote:
> I have taken some backtraces (they are attached to the letter) of two processes with such command:
> pid=17981; while true; do date; gdb -batch -e back /usr/pgsql-9.4/bin/postgres $pid; echo; echo; echo; echo; sleep
0.1;done
 
>
> Process 17981 was holding the lock for a long time - http://pastie.org/9671931.
> And process 13886 was waiting for lock (in different time and from different blocker actually but I don’t think it is
reallyimportant) - http://pastie.org/9671939.
 
>
> As I can see, 17981 is actually waiting for LWLock on BufFreelistLock in StrategyGetBuffer function, freelist.c:134
whileholding exclusive lock on relation. I will try to increase NUM_BUFFER_PARTITIONS (on read-only load it also gave
ussome performance boost) and write the result in this thread.
 

BufFreelistLock becomes very contended when shared buffers are under a lot of pressure.

Here's what I believe is happening:

If RelationGetBufferForTuple() decides it needs to extend, this happens:    LockRelationForExtension(relation,
ExclusiveLock);buffer= ReadBufferBI(relation, P_NEW, bistate);
 

Assuming bistate is false (I didn't check the bulk case), ReadBufferBI() ends up at ReadBuffer_common(), which calls
BufferAlloc().In the normal case, BufferAlloc() won't find the necessary buffer, so it will call StrategyGetBuffer(),
whichwill end up getting the freelist lock. Currently the free list is normally empty, which means we now need to run
theclock sweep to find a victim buffer. The clock sweep will keep running until it finds a buffer that is not pinned
andhas usage_count = 0. If shared buffers are under heavy pressure, you can have a huge number of them with usage_count
=5, which for 100GB shared buffers and an 8K BLKSZ, you could have to check buffers *52 million* times (assuming you
finallyfind a buffer on the start of the 5th loop) before you find a victim.
 

Keep in mind that's all happening while you're holding both the extension lock *and the freelist lock*, which basically
meansno one else in the entire system can allocate a new buffer.
 

This is one reason why a large shared_buffers setting is usually counter-productive. Experience with older versions is
thatsetting it higher than about 8GB is more likely to hurt than to help. Newer versions are probably better, but I
thinkyou'll be hard-pressed to find a workload where 100GB makes sense. It might if your entire database fits in
shared_buffers(though, even then there's probably a number of O(n) or worse operations that will hurt you), but if your
databaseis > shared_buffers you're probably in trouble.
 

I suggest cutting shared_buffers *way* down. Old-school advice for this machine would be 8G (since 25% of 128G would be
toobig). You might be able to do better than 8G, but I recommend not even trying unless you've got a good way to test
yourperformance.
 

If you can test performance and find an optimal setting for shared_buffers, please do share your test data and
findings.:)
 
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: ExclusiveLock on extension of relation with huge shared_buffers

From
Borodin Vladimir
Date:

25 окт. 2014 г., в 4:31, Jim Nasby <Jim.Nasby@BlueTreble.com> написал(а):

Please don't top-post.

On 10/24/14, 3:40 AM, Borodin Vladimir wrote:
I have taken some backtraces (they are attached to the letter) of two processes with such command:
pid=17981; while true; do date; gdb -batch -e back /usr/pgsql-9.4/bin/postgres $pid; echo; echo; echo; echo; sleep 0.1; done

Process 17981 was holding the lock for a long time - http://pastie.org/9671931.
And process 13886 was waiting for lock (in different time and from different blocker actually but I don’t think it is really important) - http://pastie.org/9671939.

As I can see, 17981 is actually waiting for LWLock on BufFreelistLock in StrategyGetBuffer function, freelist.c:134 while holding exclusive lock on relation. I will try to increase NUM_BUFFER_PARTITIONS (on read-only load it also gave us some performance boost) and write the result in this thread.

BufFreelistLock becomes very contended when shared buffers are under a lot of pressure.

Here's what I believe is happening:

If RelationGetBufferForTuple() decides it needs to extend, this happens:
LockRelationForExtension(relation, ExclusiveLock);
buffer = ReadBufferBI(relation, P_NEW, bistate);

Assuming bistate is false (I didn't check the bulk case), ReadBufferBI() ends up at ReadBuffer_common(), which calls BufferAlloc(). In the normal case, BufferAlloc() won't find the necessary buffer, so it will call StrategyGetBuffer(), which will end up getting the freelist lock. Currently the free list is normally empty, which means we now need to run the clock sweep to find a victim buffer. The clock sweep will keep running until it finds a buffer that is not pinned and has usage_count = 0. If shared buffers are under heavy pressure, you can have a huge number of them with usage_count = 5, which for 100GB shared buffers and an 8K BLKSZ, you could have to check buffers *52 million* times (assuming you finally find a buffer on the start of the 5th loop) before you find a victim.

Keep in mind that's all happening while you're holding both the extension lock *and the freelist lock*, which basically means no one else in the entire system can allocate a new buffer.

I’ll try the same workload with recent patch from Andres Freund [0].


This is one reason why a large shared_buffers setting is usually counter-productive. Experience with older versions is that setting it higher than about 8GB is more likely to hurt than to help. Newer versions are probably better, but I think you'll be hard-pressed to find a workload where 100GB makes sense. It might if your entire database fits in shared_buffers (though, even then there's probably a number of O(n) or worse operations that will hurt you), but if your database is > shared_buffers you're probably in trouble.

I suggest cutting shared_buffers *way* down. Old-school advice for this machine would be 8G (since 25% of 128G would be too big). You might be able to do better than 8G, but I recommend not even trying unless you've got a good way to test your performance.

If you can test performance and find an optimal setting for shared_buffers, please do share your test data and findings. :)

Of course, it works well with shared_buffers <= 8GB. But we have seen that on read-only load when data set fits in RAM with <=8GB shared_buffers we hit BufFreelistLock LWLock while moving pages between shared buffers and page cache. Increasing shared_buffers size to the size of data set improves performance up to 2,5X faster on this read-only load. So we started testing configuration with huge shared_buffers under writing load and that’s why I started this thread.

Since StrategyGetBuffer() does not use BufFreelistLock LWLock any more [1] I’ll also re-run tests with read-only load and small shared_buffers.


--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Vladimir