Re: Bug: Buffer cache is not scan resistant - Mailing list pgsql-hackers

From Luke Lonergan
Subject Re: Bug: Buffer cache is not scan resistant
Date
Msg-id C3E62232E3BCF24CBA20D72BFDCB6BF802AF2821@MI8NYCMAIL08.Mi8.com
Whole thread Raw
In response to Bug: Buffer cache is not scan resistant  ("Luke Lonergan" <llonergan@greenplum.com>)
Responses Re: Bug: Buffer cache is not scan resistant  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
<p><font size="2">Hi Tom,<br /><br /> Good info - it's the same in Solaris, the routine is uiomove (Sherry wrote
it).<br/><br /><br /> - Luke<br /><br /> Msg is shrt cuz m on ma treo<br /><br />  -----Original Message-----<br />
From:  Tom Lane [<a href="mailto:tgl@sss.pgh.pa.us">mailto:tgl@sss.pgh.pa.us</a>]<br /> Sent:   Monday, March 05, 2007
07:43PM Eastern Standard Time<br /> To:     Mark Kirkwood<br /> Cc:     Pavan Deolasee; Gavin Sherry; Luke Lonergan;
PGSQLHackers; Doug Rady; Sherry Moore<br /> Subject:        Re: [HACKERS] Bug: Buffer cache is not scan resistant<br
/><br/> Mark Kirkwood <markir@paradise.net.nz> writes:<br /> > Tom Lane wrote:<br /> >> But what I
wantedto see was the curve of<br /> >> elapsed time vs shared_buffers?<br /> > ...<br /> > Looks *very*
similar.<br/><br /> Yup, thanks for checking.<br /><br /> I've been poking into this myself.  I find that I can
reproducethe<br /> behavior to some extent even with a slow disk drive (this machine is a<br /> dual 2.8GHz Xeon EM64T
runningFedora Core 5; the dd-to-dev-null test<br /> shows the disk read speed as 43MB/sec or so).  Test case is a<br />
several-gigtable, no indexes, fully vacuumed so that neither VACUUM nor<br /> COUNT(*) have to do anything but seqscan
asfast as they can.  Given a<br /> *freshly started* postmaster, I see<br /><br /> regression=# show shared_buffers;<br
/> shared_buffers<br /> ----------------<br />  128MB<br /> (1 row)<br /><br /> regression=# \timing<br /> Timing is
on.<br/> regression=# vacuum lineitem;<br /> VACUUM<br /> Time: 63652.333 ms<br /> regression=# vacuum lineitem;<br />
VACUUM<br/> Time: 63562.303 ms<br /> regression=# select count(*) from lineitem;<br />   count<br /> ----------<br />
 10240000<br/> (1 row)<br /><br /> Time: 63142.174 ms<br /> regression=# vacuum lineitem;<br /> VACUUM<br /> Time:
61638.421ms<br /> regression=# vacuum lineitem;<br /> VACUUM<br /> Time: 61785.905 ms<br /><br /> I didn't show it
here,but you can repeat the VACUUM all you want before<br /> the SELECT, and its times are stable; and you can repeat
allyou want<br /> after the SELECT, and the times are stable but a couple seconds lower.<br /> Restart the postmaster
andit goes back to the slower behavior.  (I'm<br /> keeping autovac off so it doesn't change the results.)<br /><br />
Idecided to get down and dirty with oprofile, and soon found that the<br /> user-space CPU consumption is
indistinguishablein both states:<br /><br /> CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)<br />
CountedGLOBAL_POWER_EVENTS events (time during which processor is not stopped)<br /> with a unit mask of 0x01
(mandatory)count 240000<br /> GLOBAL_POWER_E...|<br />   samples|      %|<br /> ------------------<br />    141065
73.8193/usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux<br />     26368 13.7984
/home/tgl/testversion/bin/postgres<br/>     12765  6.6799 /libata<br />      2238  1.1711 /lib64/libc-2.4.so<br />     
1112 0.5819 /dm_mod<br /><br /> CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)<br /> Counted
GLOBAL_POWER_EVENTSevents (time during which processor is not stopped)<br /> with a unit mask of 0x01 (mandatory) count
240000<br/> GLOBAL_POWER_E...|<br />   samples|      %|<br /> ------------------<br />    113177 70.2169
/usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux<br/>     26284 16.3070 /home/tgl/testversion/bin/postgres<br />
   12004  7.4475 /libata<br />      2093  1.2985 /lib64/libc-2.4.so<br />       996  0.6179 /dm_mod<br /><br /> Inside
thekernel, there's only one routine that's significantly different:<br /><br /> CPU: P4 / Xeon with 2 hyper-threads,
speed2793.08 MHz (estimated)<br /> Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)<br
/>with a unit mask of 0x01 (mandatory) count 240000<br /> samples  %        symbol name<br /> 57779    40.9591 
copy_user_generic<br/> 18175    12.8841  __delay<br /> 3994      2.8313  _raw_spin_lock<br /> 2388      1.6928 
put_page<br/> 2184      1.5482  mwait_idle<br /> 2083      1.4766  _raw_write_unlock<br /> 1909      1.3533 
_raw_write_lock<br/><br /> CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)<br /> Counted
GLOBAL_POWER_EVENTSevents (time during which processor is not stopped)<br /> with a unit mask of 0x01 (mandatory) count
240000<br/> samples  %        symbol name<br /> 37437    33.0783  copy_user_generic<br /> 17891    15.8080  __delay<br
/>3372      2.9794  _raw_spin_lock<br /> 2218      1.9598  mwait_idle<br /> 2067      1.8263  _raw_write_unlock<br />
1837     1.6231  _raw_write_lock<br /> 1531      1.3527  put_page<br /><br /> So that's part of the mystery: apparently
copy_user_genericis coded in<br /> such a way that it's faster to copy into memory that's already in<br /> processor
cache. This strikes me as something that probably<br /> could/should be fixed in the kernel; I don't see any good
reasonwhy<br /> overwriting a whole cache line oughtn't be the same speed either way.<br /><br /> The other thing that
wasbothering me is why does the SELECT change<br /> VACUUM's behavior?  A debugging printout added to ReadBuffer gave
the<br/> answer: after postmaster start, we see things like<br /><br /> read block 353094 into buffer 11386<br /> read
block353095 into buffer 11387<br /> read block 353096 into buffer 11388<br /> read block 353097 into buffer 11389<br />
readblock 353098 into buffer 11390<br /> read block 353099 into buffer 11391<br /> read block 353100 into buffer
11392<br/> read block 353101 into buffer 11393<br /> read block 353102 into buffer 11394<br /> read block 353103 into
buffer11395<br /><br /> and after the SELECT it behaves like<br /><br /> read block 336761 into buffer 9403<br /> read
block336762 into buffer 9402<br /> read block 336763 into buffer 9403<br /> read block 336764 into buffer 9402<br />
readblock 336765 into buffer 9403<br /> read block 336766 into buffer 9402<br /> read block 336767 into buffer 9403<br
/>read block 336768 into buffer 9402<br /> read block 336769 into buffer 9403<br /> read block 336770 into buffer
9402<br/><br /> What's going on is that VACUUM puts each buffer it's finished with on<br /> the tail of the freelist. 
Inthe post-SELECT state, there are just two<br /> buffers cycling through the freelist (not sure why not only one, but
it<br/> doesn't matter) and so the cache footprint is limited.  But immediately<br /> after postmaster start, (nearly)
allthe buffers are in the freelist and<br /> so they all cycle through VACUUM's usage.  In any real-world situation,<br
/>of course, the freelist is going to be nearly empty most of the time and<br /> so I don't think this part is worth
changing.<br/><br /> So I now concede Luke's argument that this behavior is related to L2<br /> cache usage.  But the
nextquestion is whether we ought to change<br /> regular seqscan's behavior to mimic VACUUM.  I'm very worried about<br
/>pessimizing other cases if we do.  ISTM there's a fairly clear case that<br /> this might be fixable at the kernel
level. Moreover, the issue only<br /> arises because of the way that the copy-from-kernel-buffer-to-userspace<br />
routinebehaves, which means that if we go to a regime where we stop<br /> relying on OS caching and ask for direct DMA
intoour buffers, the<br /> advantage would disappear anyway.  Lastly, I think the case where a win<br /> is possible is
fairlynarrow --- as soon as you've got anything but the<br /> one seqscan going on, it's not going to help.<br /><br />
                       regards, tom lane<br /><br /></font> 

pgsql-hackers by date:

Previous
From: Gregory Stark
Date:
Subject: Re: Bug: Buffer cache is not scan resistant
Next
From: "Marc G. Fournier"
Date:
Subject: Re: [PATCHES]