Thread: random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1
Can anyone answer this for me: Although I realize my client's disk subsystem (SCSI/RAID Smart Array E200 controller using RAID 1) is less than impressive - is the default setting of 4.0 realistic or could it be lower? Thanks!
Should be a lot higher, something like 10-15 is approximating accurate. Increasing the number of disks in a RAID actually makes the number higher, not lower. Until Postgres gets AIO + the ability to post multiple concurrent IOs on index probes, random IO does not scale with increasing disk count, but sequential does, thus the increasing "random page cost" as the RAID gets faster. The reason to change the number is to try to discourage the planner from choosing index scans too aggressively. We (GP) have implemented something we call "Adaptive Nested Loop" to replace a nested loop + index scan with a hash join when the selectivity estimates are off in order to improve this behavior. We also run with a "random_page_cost=100" because we generally run on machines with fast sequential I/O. - Luke On 9/10/07 12:25 PM, "Carlo Stonebanks" <stonec.register@sympatico.ca> wrote: > Can anyone answer this for me: Although I realize my client's disk subsystem > (SCSI/RAID Smart Array E200 controller using RAID 1) is less than > impressive - is the default setting of 4.0 realistic or could it be lower? > > Thanks! > > > ---------------------------(end of broadcast)--------------------------- > TIP 7: You can help support the PostgreSQL project by donating at > > http://www.postgresql.org/about/donate
Luke Lonergan wrote: > Should be a lot higher, something like 10-15 is approximating accurate. > In my own case, I have a much smaller database that I normally work with, where everything should fit in memory (100 Mbytes?), and reducing it to 3.0 has resulted in consistently better timings for me. I think this means that the planner doesn't understand my database size : effective memory size ratio. :-) Anyways - my point is that if you change the default to 10 you may hurt people like me. Cheers, mark -- Mark Mielke <mark@mielke.cc>
Luke, > We (GP) have implemented > something we call "Adaptive Nested Loop" to replace a nested loop + > index scan with a hash join when the selectivity estimates are off in > order to improve this behavior. We also run with a > "random_page_cost=100" because we generally run on machines with fast > sequential I/O. So, when is this getting contributed? ;-) -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco
"Luke Lonergan" <llonergan@greenplum.com> writes: > Should be a lot higher, something like 10-15 is approximating accurate. Most people's experience is that due to Postgres underestimating the benefits of caching lowering the random_page_cost is helpful. > Increasing the number of disks in a RAID actually makes the number higher, > not lower. Until Postgres gets AIO + the ability to post multiple > concurrent IOs on index probes, random IO does not scale with increasing > disk count, but sequential does, thus the increasing "random page cost" as > the RAID gets faster. That does sound right, though I don't think it's very common. If you have very wide stripes you can get some amazing sequential scan speeds and it won't really help random access at all. This is especially helpful if you're in an environment where you don't care about the costs you're imposing on other processes, such as a data warehouse where you have a single large batch query running at a time. What I don't understand is the bit about "until Postgres gets AIO + the ability to post multiple concurrent IOs on index probes". Even with AIO your seek times are not going to be improved by wide raid stripes. And you can't possibly find the page at level n+1 before you've looked at the page at level n. Do you mean to be able to probe multiple index keys simultaneously? How does that work out? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On 9/10/07, Gregory Stark <stark@enterprisedb.com> wrote: > > "Luke Lonergan" <llonergan@greenplum.com> writes: > > > Should be a lot higher, something like 10-15 is approximating accurate. > > Most people's experience is that due to Postgres underestimating the benefits > of caching lowering the random_page_cost is helpful. Quite often the real problem is that they have effective_cache_size too small, and they use random_page_cost to get the planner to switch to index scans on small tables. With a large effective_cache_size and small to moderate table (i.e. it fits in memory pretty handily) the planner seems much better in the last few major releases about picking an index over a sequential scan.
On Mon, 2007-09-10 at 22:44 +0100, Gregory Stark wrote: > What I don't understand is the bit about "until Postgres gets AIO + the > ability to post multiple concurrent IOs on index probes". Even with AIO your > seek times are not going to be improved by wide raid stripes. And you can't > possibly find the page at level n+1 before you've looked at the page at level > n. Do you mean to be able to probe multiple index keys simultaneously? How > does that work out? > I think he's referring to mirrors, in which there are multiple spindles that can return a requested block. That could mitigate random I/O, if the I/O is asynchronous and something intelligent (OS or controller) can schedule it. Regards, Jeff Davis
Gregory Stark wrote:
Cheers,
mark
One suggestion: The plan is already in a tree. With some dependency analysis, I assume the tree could be executed in parallel (multiple threads or event triggered entry into a state machine), and I/O to fetch index pages or table pages could be scheduled in parallel. At this point, AIO becomes necessary to let the underlying system (and hardware with tagged queueing?) schedule which pages should be served best first."Luke Lonergan" <llonergan@greenplum.com> writes:Increasing the number of disks in a RAID actually makes the number higher, not lower. Until Postgres gets AIO + the ability to post multiple concurrent IOs on index probes, random IO does not scale with increasing disk count, but sequential does, thus the increasing "random page cost" as the RAID gets faster.What I don't understand is the bit about "until Postgres gets AIO + the ability to post multiple concurrent IOs on index probes". Even with AIO your seek times are not going to be improved by wide raid stripes. And you can't possibly find the page at level n+1 before you've looked at the page at level n. Do you mean to be able to probe multiple index keys simultaneously? How does that work out
Cheers,
mark
-- Mark Mielke <mark@mielke.cc>
Scott Marlowe wrote:
For a database that truly fits entirely in memory, I assume random_page_cost is closer to 1.0. The planner should know that there is no significant seek cost for RAM.
I will try to compare results tonight using 8.2. The last time I checked may have been 8.1. I am also curious to see what the current algorithm is with regard to effective_cache_size.
Cheers,
mark
In my case, I set effective_cache_size to 25% of the RAM available to the system (256 Mbytes), for a database that was about 100 Mbytes or less. I found performance to increase when reducing random_page_cost from 4.0 to 3.0.On 9/10/07, Gregory Stark <stark@enterprisedb.com> wrote:"Luke Lonergan" <llonergan@greenplum.com> writes:Should be a lot higher, something like 10-15 is approximating accurate.Most people's experience is that due to Postgres underestimating the benefits of caching lowering the random_page_cost is helpful.Quite often the real problem is that they have effective_cache_size too small, and they use random_page_cost to get the planner to switch to index scans on small tables. With a large effective_cache_size and small to moderate table (i.e. it fits in memory pretty handily) the planner seems much better in the last few major releases about picking an index over a sequential scan
For a database that truly fits entirely in memory, I assume random_page_cost is closer to 1.0. The planner should know that there is no significant seek cost for RAM.
I will try to compare results tonight using 8.2. The last time I checked may have been 8.1. I am also curious to see what the current algorithm is with regard to effective_cache_size.
Cheers,
mark
-- Mark Mielke <mark@mielke.cc>
Hi Mark, Greg, On 9/10/07 3:08 PM, "Mark Mielke" <mark@mark.mielke.cc> wrote: > One suggestion: The plan is already in a tree. With some dependency analysis, > I assume the tree could be executed in parallel (multiple threads or event > triggered entry into a state machine), and I/O to fetch index pages or table > pages could be scheduled in parallel. At this point, AIO becomes necessary to > let the underlying system (and hardware with tagged queueing?) schedule which > pages should be served best first. Right now the pattern for index scan goes like this: - Find qualifying TID in index - Seek to TID location in relfile - Acquire tuple from relfile, return When the tuples are widely distributed in the table, as is the case with a very selective predicate against an evenly distributed attribute on a relation 2x larger than the I/O cache + bufcache, this pattern will result in effectively "random I/O". In actual fact, the use of the in-memory bitmap index will make the I/Os sequential, but sparse, which is another version of "random" if the sequential I/Os are larger than the gather/scatter I/O aggregation in the OS scheduler (say 1MB). This is a very common circumstance for DSS / OLAP / DW workloads. For plans that qualify with the above conditions, the executor will issue blocking calls to lseek(), which will translate to a single disk actuator moving to the needed location in seek_time, approximately 8ms. The seek_time for a single query will not improve with the increase in number of disks in an underlying RAID pool, so we can do about 1000/8 = 125 seeks per second no matter what I/O subsystem we have. If we implement AIO and allow for multiple pending I/Os used to prefetch groups of qualifying tuples, basically a form of random readahead, we can improve the throughput for any given query by taking advantage of multiple disk actuators. This will work for RAID5, RAID10 and other disk pooling mechanisms because the lseek() will be issued as parallel events. Note that the same approach would also work to speed sequential access by overlapping compute and I/O. - Luke
Hi Josh, On 9/10/07 2:26 PM, "Josh Berkus" <josh@agliodbs.com> wrote: > So, when is this getting contributed? ;-) Yes, that's the right question to ask :-) One feeble answer: "when we're not overwhelmed by customer activity"... - Luke
"Luke Lonergan" <llonergan@greenplum.com> writes: > Right now the pattern for index scan goes like this: > > - Find qualifying TID in index > - Seek to TID location in relfile > - Acquire tuple from relfile, return >... > If we implement AIO and allow for multiple pending I/Os used to prefetch > groups of qualifying tuples, basically a form of random readahead Ah, I see what you mean now. It makes a lot more sense if you think of it for bitmap index scans. So, for example, the bitmap index scan could stream tids to the executor and the executor would strip out the block numbers and pass them to the i/o layer saying "i need this block now but following that I'll need these blocks so get them moving now". I think this seems pretty impractical for regular (non-bitmap) index probes though. You might be able to do it sometimes but not very effectively and you won't know when it would be useful. I think what this means is that there are actually *three* kinds of i/o: 1) Sequential which means you get the full bandwidth of your drives * the number of spindles; 2) Random which gets you 1 block per seek latency regardless of how many spindles you have; and 3) Random but with prefetch which gets you the random bandwidth above times the number of spindles. The extra spindles speed up sequential i/o too so the ratio between sequential and random with prefetch would still be about 4.0. But the ratio between sequential and random without prefetch would be even higher. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Greg, > I think this seems pretty impractical for regular > (non-bitmap) index probes though. You might be able to do it > sometimes but not very effectively and you won't know when it > would be useful. Maybe so, though I think it's reasonable to get multiple actuators going even if the seeks are truly random. It's a dynamic / tricky business to determine how many pending seeks to post, but it should be roughly close to the number of disks in the pool IMO. > I think what this means is that there are actually *three* > kinds of i/o: 1) Sequential which means you get the full > bandwidth of your drives * the number of spindles; 2) Random > which gets you 1 block per seek latency regardless of how > many spindles you have; and 3) Random but with prefetch which > gets you the random bandwidth above times the number of spindles. Perhaps so, though I'm more optimistic that prefetch would help most random seek situations. For reasonable amounts of concurrent usage this point becomes moot - we get the benefit of multiple backends doing seeking anyway, but I think if we do dynamic prefetch right it would degenerate gracefully in those circumstances. > The extra spindles speed up sequential i/o too so the ratio > between sequential and random with prefetch would still be > about 4.0. But the ratio between sequential and random > without prefetch would be even higher. Right :-) - Luke
Re: random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1
From
"Heikki Linnakangas"
Date:
Luke Lonergan wrote: > For plans that qualify with the above conditions, the executor will issue > blocking calls to lseek(), which will translate to a single disk actuator > moving to the needed location in seek_time, approximately 8ms. I doubt it's actually the lseeks, but the reads/writes after the lseeks that block. > If we implement AIO and allow for multiple pending I/Os used to prefetch > groups of qualifying tuples, basically a form of random readahead, we can > improve the throughput for any given query by taking advantage of multiple > disk actuators. Rather than jumping to AIO, which is a huge change, I think we could get much of the benefit by using posix_fadvise(WILLNEED) in strategic places to tell the OS what pages we're going to need in the near future. If the OS has implemented that properly, it should schedule I/Os for the requested pages ahead of time. That would require very little change to PostgreSQL code, and could simply be #ifdef'd away on platforms that don't support posix_fadvise. > Note that > the same approach would also work to speed sequential access by overlapping > compute and I/O. Yes, though the OS should already doing read ahead for us. How efficient it is is another question. posix_fadvise(SEQUENTIAL) could be used to give a hint on that as well. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Gregory Stark wrote (in part): > The extra spindles speed up sequential i/o too so the ratio between sequential > and random with prefetch would still be about 4.0. But the ratio between > sequential and random without prefetch would be even higher. > I never figured out how extra spindles help sequential I-O because consecutive logical blocks are not necessarily written consecutively in a Linux or UNIX file system. They try to group a bunch (8 512-bit?) of blocks together, but that is about it. So even if you are reading sequentially, the head actuator may be seeking around anyway. I suppose you could fix this, if the database were reasonably static, by backing up the entire database, doing a mkfs on the file system, and restoring it. This might make the database more contiguous, at least for a while. When I was working on a home-brew UNIX dbms, I used raw IO on a separate disk drive so that the files could be contiguous, and this would work. Similarly, IBM's DB2 does that (optionally). But it is my understanding that postgreSQL does not. OTOH, the large (in my case) cache in the kernel can be helpful if I seek around back and forth to nearby records since they may be in the cache. On my 8 GByte RAM, I have the shared buffers set to 200,000 which should keep any busy stuff in memory, and there are about 6 GBytes of ram presently available for the system I-O cache. I have not optimized anything yet because I am still laundering the major input data to initialize the database so I do not have any real transactions going through it yet. I have 6 SCSI hard drives on two Ultra/320 SCSI controllers. Of the database partitions, sda8 has the write-ahead-log, sdb7 has a few tiny seldom-used tables and pg_log, and sdc1, sdd1, sde1, and sdf1 are just for the other tables. For the data on sd[c-f]1 (there is nothing else on these drives), I keep the index for a table on a different drive from the data. When populating the database initially, this seems to help since I tend to fill one table, or a very few tables, at a time, so the table itself and its index do not contend for the head actuator. Presumably, the SCSI controllers can do simultaneous seeks on the various drives and one transfer on each controller. When loading the database (using INSERTs mainly -- because the input data are gawdawful unnormalized spreadsheets obtained from elsewhere, growing once a week), the system is IO limited with seeks (and rotational latency time). IO transfers average about 1.7 Megabytes/second, although there are peaks that exceed 10 Megabytes/second. If I run pg_restore from a backup tape, I can see 90 Megabyte/second transfer rates for bursts of several seconds at a time, but that is pretty much of a record. -- .~. Jean-David Beyer Registered Linux User 85642. /V\ PGP-Key: 9A2FC99A Registered Machine 241939. /( )\ Shrewsbury, New Jersey http://counter.li.org ^^-^^ 06:35:01 up 33 days, 9:57, 0 users, load average: 4.06, 4.07, 4.02
"Jean-David Beyer" <jeandavid8@verizon.net> writes: > Gregory Stark wrote (in part): > >> The extra spindles speed up sequential i/o too so the ratio between sequential >> and random with prefetch would still be about 4.0. But the ratio between >> sequential and random without prefetch would be even higher. >> > I never figured out how extra spindles help sequential I-O because > consecutive logical blocks are not necessarily written consecutively in a > Linux or UNIX file system. They try to group a bunch (8 512-bit?) of blocks > together, but that is about it. So even if you are reading sequentially, the > head actuator may be seeking around anyway. That's somewhat true but good filesystems group a whole lot more than 8 blocks together. You can do benchmarks with dd and compare the speed of reading from a file with the speed of reading from the raw device. On typical consumer drives these days you'll get 50-60MB/s raw and I would expect not a whole lot less than that with a large ext2 file, at least if it's created all in one chunk on a not overly-full filesystem. (Those assumptions is not necessarily valid for Postgres which is another topic, but one that requires some empirical numbers before diving into.) -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Gregory Stark wrote: > "Jean-David Beyer" <jeandavid8@verizon.net> writes: > >> Gregory Stark wrote (in part): >> >>> The extra spindles speed up sequential i/o too so the ratio between sequential >>> and random with prefetch would still be about 4.0. But the ratio between >>> sequential and random without prefetch would be even higher. >>> >> I never figured out how extra spindles help sequential I-O because >> consecutive logical blocks are not necessarily written consecutively in a >> Linux or UNIX file system. They try to group a bunch (8 512-bit?) of blocks >> together, but that is about it. So even if you are reading sequentially, the >> head actuator may be seeking around anyway. > > That's somewhat true but good filesystems group a whole lot more than 8 blocks > together. You can do benchmarks with dd and compare the speed of reading from > a file with the speed of reading from the raw device. On typical consumer > drives these days you'll get 50-60MB/s raw and I would expect not a whole lot > less than that with a large ext2 file, at least if it's created all in one > chunk on a not overly-full filesystem. # date; dd if=/dev/sda8 of=/dev/null;date Tue Sep 11 14:27:36 EDT 2007 8385867+0 records in 8385867+0 records out 4293563904 bytes (4.3 GB) copied, 71.7648 seconds, 59.8 MB/s Tue Sep 11 14:28:48 EDT 2007 # date; dd bs=8192 if=/dev/sda8 of=/dev/null;date Tue Sep 11 14:29:15 EDT 2007 524116+1 records in 524116+1 records out 4293563904 bytes (4.3 GB) copied, 68.2595 seconds, 62.9 MB/s Tue Sep 11 14:30:23 EDT 2007 # date; dd bs=8192 if=/srv/dbms/dataA/pgsql/data/pg_xlog/000000010000002B0000002F of=/dev/null;date Tue Sep 11 14:34:25 EDT 2007 2048+0 records in 2048+0 records out 16777216 bytes (17 MB) copied, 0.272343 seconds, 61.6 MB/s Tue Sep 11 14:34:26 EDT 2007 The first two are the partition where the W.A.L. is in (and a bit more: [/srv/dbms/dataA/pgsql/data]# ls -l total 104 - -rw------- 1 postgres postgres 4 Aug 11 13:32 PG_VERSION drwx------ 5 postgres postgres 4096 Aug 11 13:32 base drwx------ 2 postgres postgres 4096 Sep 11 14:35 global drwx------ 2 postgres postgres 4096 Sep 10 18:58 pg_clog - -rw------- 1 postgres postgres 3396 Aug 11 13:32 pg_hba.conf - -rw------- 1 root root 3396 Aug 16 14:32 pg_hba.conf.dist - -rw------- 1 postgres postgres 1460 Aug 11 13:32 pg_ident.conf drwx------ 4 postgres postgres 4096 Aug 11 13:32 pg_multixact drwx------ 2 postgres postgres 4096 Sep 10 19:48 pg_subtrans drwx------ 2 postgres postgres 4096 Aug 12 16:14 pg_tblspc drwx------ 2 postgres postgres 4096 Aug 11 13:32 pg_twophase drwx------ 3 postgres postgres 4096 Sep 10 19:53 pg_xlog - -rw------- 1 postgres postgres 15527 Sep 8 00:35 postgresql.conf - -rw------- 1 postgres postgres 13659 Aug 11 13:32 postgresql.conf.dist - -rw------- 1 root root 15527 Sep 4 10:37 postgresql.conf~ - -rw------- 1 postgres postgres 56 Sep 8 08:12 postmaster.opts - -rw------- 1 postgres postgres 53 Sep 8 08:12 postmaster.pid It is tricky for me to find a big enough file to test. I tried one of the pg_xlog files, but I cannot easily copy from there because it acts a bit interactive and the time is mostly my time. If I copy it elsewhere and give it to non-root, then it is all in the cache, so it does not really read it. > > (Those assumptions is not necessarily valid for Postgres which is another > topic, but one that requires some empirical numbers before diving into.) > - -- .~. Jean-David Beyer Registered Linux User 85642. /V\ PGP-Key: 9A2FC99A Registered Machine 241939. /( )\ Shrewsbury, New Jersey http://counter.li.org ^^-^^ 14:30:04 up 33 days, 17:52, 1 user, load average: 5.50, 4.67, 4.29 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with CentOS - http://enigmail.mozdev.org iD8DBQFG5uM4Ptu2XpovyZoRAhtlAKDFs5eP/CGIqB/z207j2dpwDSHOlwCfevp4 lBWn3b2GW6gesaq+l3Rbooc= =F4H6 -----END PGP SIGNATURE-----
On Mon, Sep 10, 2007 at 06:22:06PM -0400, Mark Mielke wrote: > In my case, I set effective_cache_size to 25% of the RAM available to > the system (256 Mbytes), for a database that was about 100 Mbytes or > less. I found performance to increase when reducing random_page_cost > from 4.0 to 3.0. Just for the record, effective_cache_size of 25% is *way* too low in most cases, though if you only have 1GB setting it to 500MB probably isn't too far off. Generally, I'll set this to however much memory is in the server, minus 1G for the OS, unless there's less than 4G of total memory in which case I subtract less. -- Decibel!, aka Jim Nasby decibel@decibel.org EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Attachment
On Tue, Sep 11, 2007 at 02:49:28PM -0400, Jean-David Beyer wrote: > It is tricky for me to find a big enough file to test. I tried one of the dd if=/dev/zero of=bigfile bs=8192 count=1000000 -- Decibel!, aka Jim Nasby decibel@decibel.org EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Attachment
Decibel! wrote:
Cheers,
mark
Agree. My point was only that there are conflicting database requirements, and that one setting may not be valid for both. The default should be whatever is the most useful for the most number of people. People who fall into one of the two extremes should know enough to set the value based on actual performance measurements.On Mon, Sep 10, 2007 at 06:22:06PM -0400, Mark Mielke wrote:In my case, I set effective_cache_size to 25% of the RAM available to the system (256 Mbytes), for a database that was about 100 Mbytes or less. I found performance to increase when reducing random_page_cost from 4.0 to 3.0.Just for the record, effective_cache_size of 25% is *way* too low in most cases, though if you only have 1GB setting it to 500MB probably isn't too far off. Generally, I'll set this to however much memory is in the server, minus 1G for the OS, unless there's less than 4G of total memory in which case I subtract less.
Cheers,
mark
-- Mark Mielke <mark@mielke.cc>
"Decibel!" <decibel@decibel.org> writes: > On Tue, Sep 11, 2007 at 02:49:28PM -0400, Jean-David Beyer wrote: >> It is tricky for me to find a big enough file to test. I tried one of the > > dd if=/dev/zero of=bigfile bs=8192 count=1000000 On linux another useful trick is: echo 1 > /proc/sys/vm/drop_caches Also, it helps to run a "vmstat 1" in another window and watch the bi and bo columns. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On Wed, Sep 12, 2007 at 12:02:46AM +0100, Gregory Stark wrote: > "Decibel!" <decibel@decibel.org> writes: > > > On Tue, Sep 11, 2007 at 02:49:28PM -0400, Jean-David Beyer wrote: > >> It is tricky for me to find a big enough file to test. I tried one of the > > > > dd if=/dev/zero of=bigfile bs=8192 count=1000000 > > On linux another useful trick is: > > echo 1 > /proc/sys/vm/drop_caches The following C code should have similar effect... /* * $Id: clearmem.c,v 1.1 2003/06/29 20:41:33 decibel Exp $ * * Utility to clear out a chunk of memory and zero it. Useful for flushing disk buffers */ int main(int argc, char *argv[]) { if (!calloc(atoi(argv[1]), 1024*1024)) { printf("Error allocating memory.\n"); } } -- Decibel!, aka Jim Nasby decibel@decibel.org EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Attachment
On Wed, 12 Sep 2007, Gregory Stark wrote: > Also, it helps to run a "vmstat 1" in another window and watch the bi and bo > columns. Recently on Linux systems I've been using dstat ( http://dag.wieers.com/home-made/dstat/ ) for live monitoring in this sort of situation. Once you get the command line parameters right, you can get data for each of the major disks on your system that keep the columns human readable (like switching from KB/s to MB/s as appropriate) as activity goes up and down combined with the standard vmstat data. I still use vmstat/iostat if I want to archive or parse the data, but if I'm watching it I always use dstat now. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Greg Smith wrote: > On Wed, 12 Sep 2007, Gregory Stark wrote: > >> Also, it helps to run a "vmstat 1" in another window and watch the bi >> and bo >> columns. > > Recently on Linux systems I've been using dstat ( > http://dag.wieers.com/home-made/dstat/ ) for live monitoring in this > sort of situation. Once you get the command line parameters right, you > can get data for each of the major disks on your system that keep the > columns human readable (like switching from KB/s to MB/s as appropriate) > as activity goes up and down combined with the standard vmstat data. > > I still use vmstat/iostat if I want to archive or parse the data, but if > I'm watching it I always use dstat now. Thanks for the tip Greg... I hadn't heard of dstat. Sincerely, Joshua D. Drake > > -- > * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend > - -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 24x7/Emergency: +1.800.492.2240 PostgreSQL solutions since 1997 http://www.commandprompt.com/ UNIQUE NOT NULL Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG5zbRATb/zqfZUUQRAnz4AJwM1bGsVPdUZWy6ldqEq9l8SqRpJACcCfUc Joc8dLj12hISB5mQO6Tn+a8= =E5D2 -----END PGP SIGNATURE-----
Jean-David Beyer escribió: > Gregory Stark wrote (in part): > > >> The extra spindles speed up sequential i/o too so the ratio between sequential >> and random with prefetch would still be about 4.0. But the ratio between >> sequential and random without prefetch would be even higher. >> >> > I never figured out how extra spindles help sequential I-O because > consecutive logical blocks are not necessarily written consecutively in a > Linux or UNIX file system. They try to group a bunch (8 512-bit?) of blocks > together, but that is about it. So even if you are reading sequentially, the > head actuator may be seeking around anyway. I suppose you could fix this, if > the database were reasonably static, by backing up the entire database, > doing a mkfs on the file system, and restoring it. This might make the > database more contiguous, at least for a while. > > When I was working on a home-brew UNIX dbms, I used raw IO on a separate > disk drive so that the files could be contiguous, and this would work. > Similarly, IBM's DB2 does that (optionally). But it is my understanding that > postgreSQL does not. OTOH, the large (in my case) cache in the kernel can be > helpful if I seek around back and forth to nearby records since they may be > in the cache. On my 8 GByte RAM, I have the shared buffers set to 200,000 > which should keep any busy stuff in memory, and there are about 6 GBytes of > ram presently available for the system I-O cache. I have not optimized > anything yet because I am still laundering the major input data to > initialize the database so I do not have any real transactions going through > it yet. > > I have 6 SCSI hard drives on two Ultra/320 SCSI controllers. Of the database > partitions, sda8 has the write-ahead-log, sdb7 has a few tiny seldom-used > tables and pg_log, and sdc1, sdd1, sde1, and sdf1 are just for the other > tables. For the data on sd[c-f]1 (there is nothing else on these drives), I > keep the index for a table on a different drive from the data. When > populating the database initially, this seems to help since I tend to fill > one table, or a very few tables, at a time, so the table itself and its > index do not contend for the head actuator. Presumably, the SCSI controllers > can do simultaneous seeks on the various drives and one transfer on each > controller. > > When loading the database (using INSERTs mainly -- because the input data > are gawdawful unnormalized spreadsheets obtained from elsewhere, growing > once a week), the system is IO limited with seeks (and rotational latency > time). IO transfers average about 1.7 Megabytes/second, although there are > peaks that exceed 10 Megabytes/second. If I run pg_restore from a backup > tape, I can see 90 Megabyte/second transfer rates for bursts of several > seconds at a time, but that is pretty much of a record. > >
"Gregory Stark" <stark@enterprisedb.com> writes: > "Luke Lonergan" <llonergan@greenplum.com> writes: > >> Right now the pattern for index scan goes like this: >> >> - Find qualifying TID in index >> - Seek to TID location in relfile >> - Acquire tuple from relfile, return >>... >> If we implement AIO and allow for multiple pending I/Os used to prefetch >> groups of qualifying tuples, basically a form of random readahead > > Ah, I see what you mean now. It makes a lot more sense if you think of it for > bitmap index scans. So, for example, the bitmap index scan could stream tids > to the executor and the executor would strip out the block numbers and pass > them to the i/o layer saying "i need this block now but following that I'll > need these blocks so get them moving now". Wow, I've done some preliminary testing here on Linux using posix_fadvise and Solaris using libaio to prefetch blocks and then access them randomly and I think there's a lot of low hanging fruit here. The use case where this helps is indeed on a raid array where you're not maxing out the bandwidth of the array and care about the transaction latency, perhaps a narrow use case but still, quite common. Since our random access is synchronous it means we have to wait for one seek, process that page, then wait for the next seek on another drive which was sitting idle while we were processing the first page. By prefetching the pages we'll need next we can get all the members of the array working for us simultaneously even if they're all doing seeks. What I've done is write a test program which generates a 1G file, syncs it and drops the caches (not working yet on Solaris but doesn't seem to affect the results) and then picks 4096 8k buffers and reads them in random order. The machines it's running on have a small raid array with 4 drives. Just seeking without any prefetch it takes about 12.5s on Linux and 13.5s on Solaris. If I prefetch even a single buffer using posix_fadvise or libaio I see a noticeable improvement, over 25%. At 128 buffers of prefetch both systems are down to about 2.5-2.7s. That's on the small raid array. On the boot both have a small beneficial effect but only at very large prefetch set sizes which I would chalk down to being able to re-order the reads even if it can't overlap them. I want to test how much of this effect evaporates when I compare it to a bitmap index style scan but that depends on a lot of factors like the exact pattern of file extensions on the database files. In any case bitmap index scans get us the reordering effect, but not the overlapping i/o requests assuming they're spread quite far apart in the data files. > I think this seems pretty impractical for regular (non-bitmap) index probes > though. You might be able to do it sometimes but not very effectively and you > won't know when it would be useful. How useful this is depends a lot on how invasively we let it infect things like regular index scans. If we can prefetch right siblings and deeper index pages as we descend an index tree and future heap pages it could help a lot as those aren't sorted like bitmap index scans. But even if we only fetch heap pages all together before processing the heap pages it could be a big help. Incidentally we do need to try to make use of both as Solaris doesn't have posix_fadvise as far as I can tell and Linux's libaio doesn't support non-O_DIRECT files. Raw data: Blocks Linux Solaris Prefetched posix_fadvise libaio --------------------------------------- 1 12.473 13.597 2 9.053 9.830 4 6.787 7.594 8 5.303 6.588 16 4.209 5.120 32 3.388 4.014 64 2.869 3.216 128 2.515 2.710 256 2.312 2.327 512 2.168 2.099 1024 2.139 1.974 2048 2.242 1.903 4096 2.222 1.890 -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
>>> On Mon, Sep 10, 2007 at 2:25 PM, in message <fc45mi$1tj9$1@news.hub.org>, "Carlo Stonebanks" <stonec.register@sympatico.ca> wrote: > is the default setting of 4.0 realistic or could it be lower? Wow, such a simple, innocent question. As you may have inferred, it can't be answered in isolation. Make sure that you have reviewed all of your memory settings, then try adjusting this and seeing what the results are. With accurate effective_cache_size and a fairly generous work_mem setting, we have found that these settings work best for us with our actual production loads: (1) Cache well below database size (for example 6 GB or 12 GB RAM on a box running a 210 GB database): #seq_page_cost = 1.0 random_page_cost = 2.0 (2) On a database which is entirely contained within cache: seq_page_cost = 0.1 random_page_cost = 0.1 (3) Where caching is very significant, but not complete, we have to test to see where performance is best. One example that significantly beat both of the above in production on a particular box: seq_page_cost = 0.3 random_page_cost = 0.5 So, the short answer to your question is that the default might be realistic in some environments; the best choice will be lower in many environments; the best choice will be higher in some environments; only testing your actual applications in your actual environment can tell you which is the case for you. My approach is to pick one of the first two, depending on whether the database will be fully cached, then monitor for performance problems. When the queries with unacceptable response time have been identified, I look for ways to improve them. One of the things I may try, where a bad plan seems to have been chosen, is to adjust the random page cost. If I do change that in production, then I closely monitor for regression in other queries. -Kevin