Performance features the 4th - Mailing list pgsql-hackers

From Jan Wieck
Subject Performance features the 4th
Date
Msg-id 3FA94A52.8070603@Yahoo.com
Whole thread Raw
Responses Re: Performance features the 4th  (Manfred Spraul <manfred@colorfullife.com>)
Re: Performance features the 4th  (Neil Conway <neilc@samurai.com>)
List pgsql-hackers
I've just uploaded

http://developer.postgresql.org/~wieck/all_performance.v4.74.diff.gz

This patch contains the "still not yet ready" performance improvements 
discussed over the couple last days.

_Shared buffer replacement_:

The buffer replacement strategy is a slightly modified version of ARC. 
The modifications are some specializations about CDB promotions. Since 
PostgreSQL allways looks for buffers multiple times when updating (first 
during the scan, then during the heap_update() etc.), every updated 
block would jump right into the T2 (frequent accessed) queue. To prevent 
that the Xid when a buffer got added to the T1 queue is remembered and 
if a block is found in T1, the same transaction will not promote it into 
T2. This also affects blocks accessed like SELECT ... FOR UPDATE; UPDATE 
as this is a usual strategy and does not mean that this particular datum 
is accessed frequently.

Blocks faulted in by vacuum are handled special in that they end up at 
the LRU of the T1 queue and when evicted from there their CDB get's 
destroyed instead of added to the B1 queue to prevent vacuum from 
polluting the caches autotuning.

A guc variable
    buffer_strategy_status_interval = 0 # 0-600 seconds

controls DEBUG1 messages every n seconds showing the current queue sizes 
and the cache hitrates during the last interval.


_Vacuum page delay_:

Tom Lane's napping during vacuums with another tuning option. I replaced 
the usleep() call with a PG_DELAY(msec) macro in miscadmin.h, which does 
use select(2) instead. That should address the possible portability 
problems.

The config options
    vacuum_page_group_delay = 0  # 0-100 milliseconds    vacuum_page_group_size  = 10 # 1-1000 pages

control how many pages get vacuumed as a group and how long vacuum will 
nap between groups.

I think this can be improved more if vacuum get's feedback from the 
buffer manager if a page actually was found clean or already dirty in 
the cache or faulted in. This together with the fact if vacuum actually 
dirties the page or not would result in a sort of "vacuum page cost" 
that is accumulated and controls how often to nap. So that vacuuming a 
page found in the cache and that has no dead tuples is cheap, but 
vacuuming a page that caused another dirty block to get evicted, then 
read in and finally ends up dirty because of dead tuples is expensive.


_Lazy checkpoint_:

This is the checkpoint process with the ability to schedule the buffer 
flushing over some time. Also the buffers are written in an order told 
by the buffer replacement strategy. Currently that is a merged list of 
dirty buffers in the order of the T1 and T2 queues of ARC. Since buffers 
are replaced in that order, it causes backends to find clean buffers for 
eviction more often.

The config options
    lazy_checkpoint_time = 0        # 0-3600 seconds    lazy_checkpoint_group_size = 50 # 10-1000 pages
lazy_checkpoint_maxdelay= 500  # 100-1000 milliseconds
 

control how long the buffer flushing "should" take, how many dirty pages 
to write as a group before syncing and napping. The maxdelay is a 
parameter that causes really small amounts of changes not to spread out 
over that long.

The syncing is currently done in a new function in md.c, mdfsyncrecent() 
called through the smgr. The intention is to maintain some LRU of 
written to file descriptors and pg_fdatasync() them. I haven't found the 
right place for that yet, so it simply does a system global sync().

My idea here is that it really does not matter how accurate the single 
files are forced to disk during this, all we care for is to cause some 
physical writes performed by the kernel while we're writing them out, 
and not to buffer those writes in the OS until we finish the checkpoint.

The lazy checkpoint configuration should only affect automatic 
checkpoints started by postmaster because a checkpoint_timeout occured. 
Acutally it seems to apply this to manually started checkpoints as well. 
BufferSync() monitors the time to finish, held in shared memory, so it 
would be relatively easy to hurry up a running lazy checkpoint by 
setting that to zero. It's just that the postmaster can't do that 
because he does not have a PGPROC structure and therefore can't lock 
that shmem structure. This is a must fix item because to hurry up the 
checkpointer is very critical at shutdown time.


_TODO_:

* Replace the global sync() in mdfsyncrecent(int max) with calls to  pg_fdatasync()

* Add functionality to postmaster to hurry up a running checkpoint  at shutdown.

* Make sure that manual checkpoints are not affected by the lazy  checkpoint config options and that they too hurry up
arunning one.
 

* Further improve vacuums napping strategy depending on actual caused  IO per page.


_NOTE_:

The core team is well aware of the high demand for these features. As 
things stand however, it is impossible to get this functionality 
released in version 7.4.

That does not mean, that we have no chance to include some or all of the 
functionality in a subsequent 7.4.x release. But for that to happen, the 
above already mentioned TODO's must get done first. Further, we need a 
good amount of evidence that these changes actually gain the desired 
effect to a degree that justifies breaking our "no features in dot 
releases" rule. Also we need a good amount of evidence that the features 
don't break anything or sacrifice stability and that a backward 
compatible behaviour (where possible ... not possible with ARC vs. LRU) 
is the default.

I personally would like to see this work included in a 7.4.x release. 
But it requires people to actually run tests, stress some hardware, 
check platform portability and *give us feedback*, bacause this is what 
we get for the release candidates and these improvements can under no 
circumstance have any lower quality than that. If this goes into a 7.4.x 
release and there is any platform dependant issue in it, it endangers 
the timely fix of other bugs for those platforms, and that's a no-go.


Happy testing


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Erroneous PPC spinlock code
Next
From: Rod Taylor
Date:
Subject: Very poor estimates from planner