Re: Bug: Buffer cache is not scan resistant - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Bug: Buffer cache is not scan resistant
Date
Msg-id 24581.1173141787@sss.pgh.pa.us
Whole thread Raw
In response to Re: Bug: Buffer cache is not scan resistant  (Mark Kirkwood <markir@paradise.net.nz>)
Responses Re: Bug: Buffer cache is not scan resistant
List pgsql-hackers
Mark Kirkwood <markir@paradise.net.nz> writes:
> Tom Lane wrote:
>> But what I wanted to see was the curve of
>> elapsed time vs shared_buffers?
> ...
> Looks *very* similar.

Yup, thanks for checking.

I've been poking into this myself.  I find that I can reproduce the
behavior to some extent even with a slow disk drive (this machine is a
dual 2.8GHz Xeon EM64T running Fedora Core 5; the dd-to-dev-null test
shows the disk read speed as 43MB/sec or so).  Test case is a
several-gig table, no indexes, fully vacuumed so that neither VACUUM nor
COUNT(*) have to do anything but seqscan as fast as they can.  Given a
*freshly started* postmaster, I see

regression=# show shared_buffers;shared_buffers
----------------128MB
(1 row)

regression=# \timing
Timing is on.
regression=# vacuum lineitem;
VACUUM
Time: 63652.333 ms
regression=# vacuum lineitem;
VACUUM
Time: 63562.303 ms
regression=# select count(*) from lineitem; count
----------10240000
(1 row)

Time: 63142.174 ms
regression=# vacuum lineitem;
VACUUM
Time: 61638.421 ms
regression=# vacuum lineitem;
VACUUM
Time: 61785.905 ms

I didn't show it here, but you can repeat the VACUUM all you want before
the SELECT, and its times are stable; and you can repeat all you want
after the SELECT, and the times are stable but a couple seconds lower.
Restart the postmaster and it goes back to the slower behavior.  (I'm
keeping autovac off so it doesn't change the results.)

I decided to get down and dirty with oprofile, and soon found that the
user-space CPU consumption is indistinguishable in both states:

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 240000
GLOBAL_POWER_E...| samples|      %|
------------------  141065 73.8193 /usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux   26368 13.7984
/home/tgl/testversion/bin/postgres  12765  6.6799 /libata    2238  1.1711 /lib64/libc-2.4.so    1112  0.5819 /dm_mod
 

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 240000
GLOBAL_POWER_E...| samples|      %|
------------------  113177 70.2169 /usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux   26284 16.3070
/home/tgl/testversion/bin/postgres  12004  7.4475 /libata    2093  1.2985 /lib64/libc-2.4.so     996  0.6179 /dm_mod
 

Inside the kernel, there's only one routine that's significantly different:

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 240000
samples  %        symbol name
57779    40.9591  copy_user_generic
18175    12.8841  __delay
3994      2.8313  _raw_spin_lock
2388      1.6928  put_page
2184      1.5482  mwait_idle
2083      1.4766  _raw_write_unlock
1909      1.3533  _raw_write_lock

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 240000
samples  %        symbol name
37437    33.0783  copy_user_generic
17891    15.8080  __delay
3372      2.9794  _raw_spin_lock
2218      1.9598  mwait_idle
2067      1.8263  _raw_write_unlock
1837      1.6231  _raw_write_lock
1531      1.3527  put_page

So that's part of the mystery: apparently copy_user_generic is coded in
such a way that it's faster to copy into memory that's already in
processor cache.  This strikes me as something that probably
could/should be fixed in the kernel; I don't see any good reason why
overwriting a whole cache line oughtn't be the same speed either way.

The other thing that was bothering me is why does the SELECT change
VACUUM's behavior?  A debugging printout added to ReadBuffer gave the
answer: after postmaster start, we see things like

read block 353094 into buffer 11386
read block 353095 into buffer 11387
read block 353096 into buffer 11388
read block 353097 into buffer 11389
read block 353098 into buffer 11390
read block 353099 into buffer 11391
read block 353100 into buffer 11392
read block 353101 into buffer 11393
read block 353102 into buffer 11394
read block 353103 into buffer 11395

and after the SELECT it behaves like

read block 336761 into buffer 9403
read block 336762 into buffer 9402
read block 336763 into buffer 9403
read block 336764 into buffer 9402
read block 336765 into buffer 9403
read block 336766 into buffer 9402
read block 336767 into buffer 9403
read block 336768 into buffer 9402
read block 336769 into buffer 9403
read block 336770 into buffer 9402

What's going on is that VACUUM puts each buffer it's finished with on
the tail of the freelist.  In the post-SELECT state, there are just two
buffers cycling through the freelist (not sure why not only one, but it
doesn't matter) and so the cache footprint is limited.  But immediately
after postmaster start, (nearly) all the buffers are in the freelist and
so they all cycle through VACUUM's usage.  In any real-world situation,
of course, the freelist is going to be nearly empty most of the time and
so I don't think this part is worth changing.

So I now concede Luke's argument that this behavior is related to L2
cache usage.  But the next question is whether we ought to change
regular seqscan's behavior to mimic VACUUM.  I'm very worried about
pessimizing other cases if we do.  ISTM there's a fairly clear case that
this might be fixable at the kernel level.  Moreover, the issue only
arises because of the way that the copy-from-kernel-buffer-to-userspace
routine behaves, which means that if we go to a regime where we stop
relying on OS caching and ask for direct DMA into our buffers, the
advantage would disappear anyway.  Lastly, I think the case where a win
is possible is fairly narrow --- as soon as you've got anything but the
one seqscan going on, it's not going to help.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Mark Kirkwood
Date:
Subject: Re: Bug: Buffer cache is not scan resistant
Next
From: Gregory Stark
Date:
Subject: Re: Bug: Buffer cache is not scan resistant