Thread: Effective Cache Size
Hi, I have been following a thread on this list "Inconsistent performance" and had a few questions especially the bits about effective_cache_size. I have read some of the docs, and some other threads on this setting, and it seems to used by the planner to either choose a sequential or index scan. So it will not necessarily increase performance I suppose but instead choose the most optimal plan. Is this correct? We are not that we are suffering massive performance issues at the moment but it is expected that our database is going to grow considerably in the next couple of years, both in terms of load and size. Also what would an appropriate setting be? From what I read of Scott Marlowes email, and from the information below I reckon it should be somewhere in the region of 240,000. Danger maths ahead. Beware!!!! <maths> 141816K buff + 1781764K cached ----------------- 1923580K total effective_cache_size = 1923580 / 8 = 240447.5 </maths> Here is some information on the server in question. If any more information is required then please say. It is a dedicated PG machine with no other services being hosted off it. As you can see from the uptime, its load average is 0.00, and is currently so chilled its almost frozen!!!!! That will change though :-( Hardware ======== Dual PIII 1.4GHz 2Gb RAM 1Tb SAN with hardware RAID 5 using 1Gbps Fibre channel. OS == Linux webbasedth5 2.4.18-18.7.xsmp #1 SMP Wed Nov 13 19:01:42 EST 2002 i686 Red Hat Linux release 7.3 (Valhalla) PG == PostgreSQL 7.3.1 on i686-pc-linux-gnu, compiled by GCC 2.96 Database ======== This includes all indexes and tables. I can provide more information on how this is chopped up if needed. Size : 1,141.305 Mb Tuples : 13,416,397 Uptime ====== 11:15am up 197 days, 16:50, 1 user, load average: 0.00, 0.00, 0.00 Top === Mem: 2064836K av, 2018648K used, 46188K free, 0K shrd, 141816K buff Swap: 2096472K av, 4656K used, 2091816K free 1781764K cached Postgresql.conf (all defaults except) ===================================== max_connections = 1000 shared_buffers = 16000 (128 Mb) max_fsm_relations = 5000 max_fsm_pages = 500000 vacuum_mem = 65535 Kind Regards, Nick Barr This email and any attachments are confidential to the intended recipient and may also be privileged. If you are not the intended recipient please delete it from your system and notify the sender. You should not copy it or use it for any purpose nor disclose or distribute its contents to any other person.
On 17 Sep 2003 at 11:48, Nick Barr wrote: > Hi, > > I have been following a thread on this list "Inconsistent performance" > and had a few questions especially the bits about effective_cache_size. > I have read some of the docs, and some other threads on this setting, > and it seems to used by the planner to either choose a sequential or > index scan. So it will not necessarily increase performance I suppose > but instead choose the most optimal plan. Is this correct? That is correct. > Danger maths ahead. Beware!!!! > > <maths> > 141816K buff > + 1781764K cached > ----------------- > 1923580K total > > effective_cache_size = 1923580 / 8 = 240447.5 > </maths> That would be bit too aggressive. I would say set it around 200K to leave room for odd stuff. Rest seems fine with your configuration. Of course a latest version of postgresql is always good though.. Bye Shridhar -- Power is danger. -- The Centurion, "Balance of Terror", stardate 1709.2
I have been experimenting with a new Seagate Cheetah 10k-RPM SCSI to compare with a cheaper Seagate Barracuda 7200-RPM IDE (each in a single-drive configuration). The Cheetah definately dominates the generic IO tests such as bonnie++, but fares poorly with pgbench (and other postgresql operations). I don't understand why switching to a SCSI drive in an otherwise identical setup would so seriously degrade performance. I would have expected the opposite. vmstat does not reveal (to me) any bottlenecks in the SCSI configuration. The only difference between the two test scenarios is that I stopped the postmaster, copied the data dir to the other drive and put a symlink to point to the new path. I ran the tests several times, so these are not flukes. Can anyone explain why this might be happening and how to better leverage this 10k drive? thanks, Mike Adler System info: Box is a Dell 600SC with Adaptec 39160 SCSI controller. Linux 2.4.18-bf2.4 CPU: Intel(R) Celeron(R) CPU 2.00GHz stepping 09 Memory: 512684k/524224k available (1783k kernel code, 11156k reserved, 549k data, 280k init, 0k highmem) postgresql.conf settings: shared_buffers = 10000 random_page_cost = 0.3 sort_mem = 4096 ################################################## TEST 1: IDE Seagate Baracuda hde: ST340014A, ATA DISK drive hde: 78165360 sectors (40021 MB) w/2048KiB Cache, CHS=77545/16/63 bonnie++ -f: Version 1.02b ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP dellmar 1G 27001 10 11863 4 20867 3 161.7 0 sample vmstat 1 output during bonnie++: procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 1 0 0 9332 4456 5056 467728 0 0 20864 0 429 698 0 5 95 0 1 1 9332 4380 5056 467728 0 0 5248 27056 361 207 1 4 95 0 1 1 9332 4376 5056 467728 0 0 384 26936 338 55 0 0 100 0 1 0 9332 4416 5064 468368 0 0 10112 9764 385 350 0 4 96 1 0 0 9332 4408 5056 468120 0 0 20608 0 427 684 1 7 92 1 0 0 9332 4392 5056 467864 0 0 20992 0 431 692 0 5 95 pgbench: starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 1 number of clients: 2 number of transactions per client: 400 number of transactions actually processed: 800/800 tps = 110.213013(including connections establishing) tps = 110.563927(excluding connections establishing) sample "vmstat 1" output during pgbench: procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 2 0 0 160 4348 50032 419320 0 0 240 3432 514 3849 34 7 59 0 2 0 160 4392 50764 418544 0 0 224 3348 500 3701 33 6 61 2 0 0 160 4364 51652 417688 0 0 240 3908 573 4411 43 8 50 2 0 0 160 4364 52508 416832 0 0 160 3708 548 4273 44 8 49 1 1 1 160 4420 53332 415944 0 0 160 3604 541 4174 40 13 48 0 1 1 160 4420 54160 415120 0 0 104 3552 526 4048 42 14 45 1 0 0 160 4964 54720 414576 0 0 128 4328 645 5819 69 7 24 ######################################################## TEST 2: SCSI Drive Seagate Cheetah 10k.6 Vendor: SEAGATE Model: ST336607LW Rev: DS08 bonnie++ -f: Version 1.02b ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP dellmar 1G 40249 14 21177 7 46620 7 365.8 0 sample vmstat 1 output during bonnie++: 0 1 1 8916 4400 1844 467216 0 0 384 42348 475 80 0 0 100 0 1 1 8916 4392 1844 467216 0 0 512 46420 472 103 0 2 98 1 0 0 8916 4364 1852 469392 0 0 7168 26552 507 268 0 3 97 1 0 0 8916 4452 1868 469392 0 0 28544 12312 658 947 1 15 84 1 0 0 8916 4416 1860 468888 0 0 47744 4 850 1534 0 18 82 1 0 0 8916 4436 1796 468312 0 0 48384 0 859 1555 0 19 81 1 0 0 8916 4452 1744 467724 0 0 48640 0 863 1569 2 20 78 pgbench (sounds thrashy): starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 1 number of clients: 2 number of transactions per client: 400 number of transactions actually processed: 800/800 tps = 33.274922(including connections establishing) tps = 33.307125(excluding connections establishing) sample "vmstat 1" output during pgbench: procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 0 1 1 160 4356 36568 432772 0 0 0 1120 232 1325 12 2 86 0 1 1 160 4452 36592 432652 0 0 0 1108 229 1295 14 2 84 0 1 1 160 4428 36616 432652 0 0 0 1168 234 1370 9 4 87 0 1 1 160 4392 36636 432668 0 0 0 1120 231 1303 12 3 85 0 1 0 160 4364 36664 432668 0 0 0 1084 230 1361 16 5 79 0 1 0 160 4456 36696 432548 0 0 0 1196 234 1300 13 2 85 Mike Adler
Michael Adler <adler@pobox.com> writes: > I have been experimenting with a new Seagate Cheetah 10k-RPM SCSI to > compare with a cheaper Seagate Barracuda 7200-RPM IDE (each in a > single-drive configuration). The Cheetah definately dominates the generic > IO tests such as bonnie++, but fares poorly with pgbench (and other > postgresql operations). It's fairly common for ATA drives to be configured to lie about write completion (ie, claim write-complete as soon as data is accepted into their onboard RAM buffer), whereas SCSI drives usually report write complete only when the data is actually down to disk. The performance differential may thus be coming at the expense of reliability. If you run Postgres with fsync off, does the differential go away? regards, tom lane
On Wed, 17 Sep 2003, Tom Lane wrote: > Michael Adler <adler@pobox.com> writes: > > I have been experimenting with a new Seagate Cheetah 10k-RPM SCSI to > > compare with a cheaper Seagate Barracuda 7200-RPM IDE (each in a > > single-drive configuration). The Cheetah definately dominates the generic > > IO tests such as bonnie++, but fares poorly with pgbench (and other > > postgresql operations). > > It's fairly common for ATA drives to be configured to lie about write > completion (ie, claim write-complete as soon as data is accepted into > their onboard RAM buffer), whereas SCSI drives usually report write > complete only when the data is actually down to disk. The performance > differential may thus be coming at the expense of reliability. If you > run Postgres with fsync off, does the differential go away? Yes, they both perform equally at about 190 tps with fsync off. The culprit turns out to be write-caching on the IDE drive. It is enabled by default, but can be disabled with "hdparm -W0 /dev/hdx". After it is disabled, the tps are proportional to rpms. There's an (2001) Linux thread on this if anyone is interested: http://www.ussg.iu.edu/hypermail/linux/kernel/0103.0/0331.html So the quesiton is whether it is ever sensible to use write-caching and expect comparable persistence. Thanks, Michael Adler
On Wed, Sep 17, 2003 at 04:46:00PM -0400, Michael Adler wrote: > So the quesiton is whether it is ever sensible to use write-caching and > expect comparable persistence. Yes. If and only if you have a battery-backed cache. I know of no IDE drives that have that, but there's nothing about the spec which makes it impossible. A -- ---- Andrew Sullivan 204-4141 Yonge Street Afilias Canada Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
On Wed, 1 Oct 2003, Andrew Sullivan wrote: > On Wed, Sep 17, 2003 at 04:46:00PM -0400, Michael Adler wrote: > > So the quesiton is whether it is ever sensible to use write-caching and > > expect comparable persistence. > > Yes. If and only if you have a battery-backed cache. I know of no > IDE drives that have that, but there's nothing about the spec which > makes it impossible. FYI, on a Dual PIV2800 with 2 gig ram and a single UDMA 80 gig hard drive, I from 420 tps to 22 tps when I disable write caching. WOW. A factor of about 20 times slower. (pgbench -c 4 -t 100)
On Wed, Oct 01, 2003 at 07:14:32AM -0600, scott.marlowe wrote: > FYI, on a Dual PIV2800 with 2 gig ram and a single UDMA 80 gig hard drive, > I from 420 tps to 22 tps when I disable write caching. WOW. A factor of > about 20 times slower. (pgbench -c 4 -t 100) That's completely consistent with tests Chris Browne has done here on cache-enabled and cache-disabled boxes that we have. It's a _really_ big difference. The combination of battery-backed write cache on your controller plus a real good UPS is quite possibly the number one thing you can do to improve performance. For what it's worth, I can't see how this is something special about Postgres: even raw-filesystem type systems have to make sure the disk actually has the data, and a write cache is bound to be a big help for that. A -- ---- Andrew Sullivan 204-4141 Yonge Street Afilias Canada Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
andrew@libertyrms.info (Andrew Sullivan) writes: > On Wed, Oct 01, 2003 at 07:14:32AM -0600, scott.marlowe wrote: >> FYI, on a Dual PIV2800 with 2 gig ram and a single UDMA 80 gig hard drive, >> I from 420 tps to 22 tps when I disable write caching. WOW. A factor of >> about 20 times slower. (pgbench -c 4 -t 100) > > That's completely consistent with tests Chris Browne has done here on > cache-enabled and cache-disabled boxes that we have. > > It's a _really_ big difference. The combination of battery-backed > write cache on your controller plus a real good UPS is quite possibly > the number one thing you can do to improve performance. For what > it's worth, I can't see how this is something special about Postgres: > even raw-filesystem type systems have to make sure the disk actually > has the data, and a write cache is bound to be a big help for that. Indeed. When I ran the tests, I found that JFS was preferable to XFS and ext3 on Linux on the machine with the big battery backed cache. (And the side-effect that it was getting yes, probably about 20x the performance of systems without the cache.) The FS-related result appeared surprising, as the "stories" I had heard suggested that JFS hadn't been particularly heavily tuned on Linux, whereas XFS was supposed to be the "speed demon." It is entirely possible that the result I saw was one that would reverse partially or even totally on a system LACKING that cache. XFS might "play better" when we're cacheless; the (perhaps only fabled) demerits of JFS being more than totally hidden if we add the cache. What I find disappointing is that it isn't possible to get SSD cards that are relatively inexpensive. A similarly fabulous performance increase _ought_ to be attainable if you could stick pg_xlog and pg_clog on a 256MB (or bigger!) battery-backed SSD, ideally one that plugs into a PCI slot. This should have the further benefit of diminishing the amount of mechanical activity going on, as WAL activity would no longer involve ANY i/o operations. Unfortunately, while there are companies hawking SSDs, they are in the "you'll have to talk to our salescritter for pricing" category, which means that they must be ferociously expensive. :-(. -- output = ("cbbrowne" "@" "libertyrms.info") <http://dev6.int.libertyrms.com/> Christopher Browne (416) 646 3304 x124 (land)
Christopher Browne kirjutas K, 01.10.2003 kell 19:21: > > The FS-related result appeared surprising, as the "stories" I had > heard suggested that JFS hadn't been particularly heavily tuned on > Linux, whereas XFS was supposed to be the "speed demon." Gentoo linux recommends XFS only for SAN+fibre channel + good ups for anything but database use ;) > It is entirely possible that the result I saw was one that would > reverse partially or even totally on a system LACKING that cache. XFS > might "play better" when we're cacheless; the (perhaps only fabled) > demerits of JFS being more than totally hidden if we add the cache. > > What I find disappointing is that it isn't possible to get SSD cards > that are relatively inexpensive. A similarly fabulous performance > increase _ought_ to be attainable if you could stick pg_xlog and > pg_clog on a 256MB (or bigger!) battery-backed SSD, ideally one that > plugs into a PCI slot. For really cheap and for small-size transactions you could experiment with USB2 memory sticks (some of them claim 34MB/s write speed), perhaps in striped/mirrored configuration. You would just need something counting writes in the driver layer to alert you when you are reaching the x00k "writes" limit and have to plug in new sticks :) OTOH, articles I found through quick googling suggest only 2.4MB/s write and 7MB/s read speeds for USB 2.0 memory sticks, so the 34MB is proably just sales pitch and refers to bus speed, not actual write speed ;( > Unfortunately, while there are companies hawking SSDs, they are in the > "you'll have to talk to our salescritter for pricing" category, which > means that they must be ferociously expensive. :-(. the cheapest I found was the one with external backup power was ~1.8k$ for 2GB PCI device http://www.cdw.com/shop/search/Results.aspx?key=platypus&x=0&y=0 An external 16GB one with battery backup and write-t-small-ide-drives-on-power-failure was ~25k$ ----------------- Hannu
> > Unfortunately, while there are companies hawking SSDs, they are in the > > "you'll have to talk to our salescritter for pricing" category, which > > means that they must be ferociously expensive. :-(. > > the cheapest I found was the one with external backup power was ~1.8k$ > for 2GB PCI device > > http://www.cdw.com/shop/search/Results.aspx?key=platypus&x=0&y=0 That is pretty neat. > An external 16GB one with battery backup and > write-t-small-ide-drives-on-power-failure was ~25k$ And they scale up from there. This company has one that goes up to 1TB RAM. 4.5kW power consumption? I hate to see what kind of heat that thing generates. http://www.imperialtech.com/pdf/MRSpec_021803.pdf I have no idea what price the monster unit is, but someone described the price of one of the *lesser* units as "made me physically ill". So I can only imagine. -- greg
Andrew Sullivan wrote: > Yes. If and only if you have a battery-backed cache. I know of no > IDE drives that have that, but there's nothing about the spec which > makes it impossible. http://www.ussg.iu.edu/hypermail/linux/kernel/0103.0/1084.html Relevant section: <quote> Maybe that is why there is a vender disk-cache dump zone on the edge of the platters...just maybe you need to buy your drives from somebody that does this and has a predictive sector stretcher as the energy from the inertia by the DC three-phase motor executes the dump. Ever wondered why modern drives have open collectors on the databuss? Maybe to disconnect the power draw so that the motor now generator provides the needed power to complete the data dump... </quote> SEEMS to imply that some IDE drives at least have enough power left after power's off to store the write-cached data to disk. The rest of the email's not very reassuring, though, but note that this email's two years old. Anyone want to test? :) -- Linux homer 2.4.18-14 #1 Wed Sep 4 13:35:50 EDT 2002 i686 i686 i386 GNU/Linux 4:30pm up 280 days, 8:00, 8 users, load average: 6.05, 6.01, 6.02
Attachment
>>>>> "CB" == Christopher Browne <cbbrowne@libertyrms.info> writes: CB> Unfortunately, while there are companies hawking SSDs, they are in the CB> "you'll have to talk to our salescritter for pricing" category, which CB> means that they must be ferociously expensive. :-(. You ain't kidding. Unfortunately, one of the major vendors just went belly up (Imperial Technology) so pricing probably won't get any better anytime soon. Perhaps one of these days I'll try that experiment on my SSD... ;-) -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Vivek Khera, Ph.D. Khera Communications, Inc. Internet: khera@kciLink.com Rockville, MD +1-240-453-8497 AIM: vivekkhera Y!: vivek_khera http://www.khera.org/~vivek/