Thread: Non-linear Performance
I'm noticing that performance in creating a particular index, and also a little bit in a simple query, seems somewhat non-linear, and I'm wondering if anybody could give me any clues as to what might be causing this, and how I might fix it, if that's possible. I've done a COPY in and index build on three data sets (just integer data) consisting of 10m rows (500 MB table), 100m rows (5 GB table), and 500m rows (25 GB table). (This is all on a 1.4 GHz P4, 512 MB RAM, two 7200 RPM IDE drives, one for data and one for log.) The COPY command to do the import is pretty linear, at about 230 sec., 2000 sec and 10,000 sec. for the 10m, 100m and 500m row tables. Neither disk I/O nor CPU seem to be pushed particularly, though I'm getting lots of small writes to the log files. I tried using a larger checkpoint size, but it didn't seem to help any. Recreating the primary key (on an INT, and of course all values unique) took 123, 1300 and 6700 seconds, again pretty linear. The next column is a DATE, which is the same value for the first 3.5M records, the next day for the next 3.5M records, and so on, incrementing a day for every 3.5M records (in order). This index build took about 60, 700 and 3500 seconds, respectively, again linear. But then I get to the next INT column which in every row is filled in with a random value between 0 and 99,999. This index takes about 175, 3600, and 28,000 seconds seconds, respectively, to generate. So it take about 2x as long per record going from 10m to 100m records, and about 1.5x as long again per record when going from 100m to 500m records. Queries using that index seem to do this too, though not quite as badly. Using a very simple query such as "SELECT COUNT(*) FROM table WHERE value = 12345" (where value is the last INT column above that took ages to index), typical query times (including connection overhead) for data not in the cache are 0.6 sec., 11 sec. and 72 sec. This query, as expected, is completely dominated by random IO; the disk the table is on sits there at 100% usage (i.e., disk requests outstanding 100% of the time) and not much else is happening at all. It does seem to do a few more more disk transfers than I would really expect. I get back a count of around 4000-5000, which to me implies about 5000 reads plus the index reads (which one would think would not amount to more than one or two hundred pages), yet 110 I/O requests per second times 70 seconds implies about 7000 reads. Is there something I'm missing here? (If the query plan and analysis is any help, here it is: Aggregate (cost=22081.46..22081.46 rows=1 width=0) (actual time=70119.88..70119.88 rows=1 loops=1) -> Index Scan using data_3_value on data_3 (cost=0.00..22067.71 rows=5498 width=0) (actual time=38.70..70090.45 rows=4980 loops=1) Total runtime: 70121.74 msec Anyway, I'm open to any thoughts on this. In particular, I'm open to suggestions for cheap ways of dealing with this horrible random I/O load. (Yeah, yeah, I know: disk array, and SCSI while I'm about it. But I've been asked to get this sort of thing working fast on systems much cheaper than the current Sun/Sybase/EMC or whatever it is that they're using.) cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
Curt Sampson <cjs@cynic.net> writes: > I'm noticing that performance in creating a particular index, and > also a little bit in a simple query, seems somewhat non-linear, Btree index build is primarily a sort, so cannot have better than O(n*log(n)) performance for random data. Not sure why you'd expect linearity. Increasing SORT_MEM would help the constant factor, however... > Queries using that index seem to do this too, though not quite as > badly. Using a very simple query such as "SELECT COUNT(*) FROM > table WHERE value = 12345" (where value is the last INT column > above that took ages to index), typical query times (including > connection overhead) for data not in the cache are 0.6 sec., 11 > sec. and 72 sec. I guess that the smaller datasets would get proportionally more benefit from kernel disk caching. > It does seem to do a few more more disk transfers than I would > really expect. I get back a count of around 4000-5000, which to me > implies about 5000 reads plus the index reads (which one would > think would not amount to more than one or two hundred pages), yet > 110 I/O requests per second times 70 seconds implies about 7000 > reads. Is there something I'm missing here? Can you demonstrate that it actually did 7000 reads, and not 5000+? That extrapolation technique doesn't look to me like it has the accuracy to tell the difference. You might try setting show_query_stats (note the results go into the postmaster log, not to the client; perhaps we ought to change that someday). Also, if you've updated the table at all, there might be some fetches of dead tuples involved. > Anyway, I'm open to any thoughts on this. In particular, I'm open > to suggestions for cheap ways of dealing with this horrible random > I/O load. More RAM, perhaps. regards, tom lane
Tom Lane wrote: > >Btree index build is primarily a sort, so cannot have better than >O(n*log(n)) performance for random data. Not sure why you'd expect >linearity. > >Increasing SORT_MEM would help the constant factor, however... > What is the most amount of SORT_MEM it makes sense to allocate? Pretend the ONLY thing I care about is BTREE index creation time. (2 gigs of RAM on my DB machine.) Disk IO is not a bottleneck. -Pete
Peter A. Daly wrote: > > Tom Lane wrote: > >> >> Btree index build is primarily a sort, so cannot have better than >> O(n*log(n)) performance for random data. Not sure why you'd expect >> linearity. >> >> Increasing SORT_MEM would help the constant factor, however... >> > What is the most amount of SORT_MEM it makes sense to allocate? > Pretend the ONLY thing I care about is BTREE index creation time. (2 > gigs of RAM on my DB machine.) Disk IO is not a bottleneck. Another though. If postgres has one of my CPU's at near 100%, does that mean I can't get any more performance out of it? Still, how big can I make sort_mem? -Pete
"Peter A. Daly" <petedaly@ix.netcom.com> writes: > Tom Lane wrote: >> Increasing SORT_MEM would help the constant factor, however... >> > What is the most amount of SORT_MEM it makes sense to allocate? I've never experimented with it, but certainly the standard default (512K = 0.5M) is pretty small for modern machines. In a 2G machine I might try settings around 100M-500M to see what works best. (Note this is just for a one-off btree creation --- for ordinary queries you need to allow for multiple sorts going on in parallel, which is one reason the default sort_mem is not very large.) regards, tom lane
> > >>What is the most amount of SORT_MEM it makes sense to allocate? >> > >I've never experimented with it, but certainly the standard default >(512K = 0.5M) is pretty small for modern machines. In a 2G machine >I might try settings around 100M-500M to see what works best. (Note >this is just for a one-off btree creation --- for ordinary queries you >need to allow for multiple sorts going on in parallel, which is one >reason the default sort_mem is not very large.) > I will run some benchmarks and let the list know the results. If this can speed it up a large amount, I can get another 6 gig of RAM into this machine which I hope can let me leave the SORT_MEM at a high enough amount to speed up our huge nightly batch database reload. It's a Dual Xeon 500Mhz Machine. -Pete
>Another though. If postgres has one of my CPU's at near 100%, does that >mean I can't get any more performance out of it? Still, how big can I >make sort_mem? I regularly have one CPU pegged at 100% with PostgreSQL on my queries (tables of 4m-ish rows and up). (Dual P3 1ghz 2GB, Linux 2.2, PG 7.2.1) I believe it cannot multithread individual queries using several processors. I believe IBM DB2 Enterprise _can_ do that - but it also costs $20,000 per CPU and damned if I can figure out how to install it even for the trial. Let me ask a similar question: If I ran PostgreSQL with a Linux 2.4 kernel and 6GB of RAM with dual P4 Xeons: a) Could PostgreSQL use all the RAM? b) The RAM paging would incur a 2-4% slowdown, any other noted speed reductions? c) Could PostgreSQL use this "hyperthreading" that apparently is in these P4 Xeons making dual processors look like four processors (obviously, on four separate queries)? d) How much extra performance does having the log or indices on a different disk buy you, esp. in the instance where you are inserting millions of records into a table? An indexed table? I ask "d" above because as I am considering upgrading to such a box, it will have an 8-disk RAID controller, and I'm wondering if it is a better idea to: a) Run everything on one 7-drive RAID 5 partition (8th drive as hot spare) b) Run logs as a 2-drive mirror and the rest on a 5-drive RAID 5 c) Run logs on a 2-drive mirror, indices on a 2-drive mirror, and the rest on a 3-drive RAID5? d) Run logs & indices on a 2-drive mirror and the rest on a 5-drive RAID 5 Thanks, Doug
> >> Increasing SORT_MEM would help the constant factor, however... > >> > > What is the most amount of SORT_MEM it makes sense to allocate? > >I've never experimented with it, but certainly the standard default >(512K = 0.5M) is pretty small for modern machines. In a 2G machine >I might try settings around 100M-500M to see what works best. (Note I currently use 1.4 gigs for "shared mem" in my database (out of 2G) - I couldn't get PostgreSQL to run with more than that (it might be an OS limit, Linux 2.2). I also use 64 megs as a SORT_MEM setting, on the theory that 4 sorts at once will leave just enough RAM for the rest of my system not to swap. Cheers, Doug
Doug Fields <dfields-pg-general@pexicom.com> writes: > d) How much extra performance does having the log or indices on a different > disk buy you, esp. in the instance where you are inserting millions of > records into a table? An indexed table? Keeping the logs on a separate drive is a big win, I believe, for heavy update situations. (For read-only queries, of course the log doesn't matter.) Keeping indexes on a separate drive is also traditional database advice, but I don't have any feeling for how much it matters in Postgres. > a) Run everything on one 7-drive RAID 5 partition (8th drive as hot spare) > b) Run logs as a 2-drive mirror and the rest on a 5-drive RAID 5 > c) Run logs on a 2-drive mirror, indices on a 2-drive mirror, and the rest > on a 3-drive RAID5? > d) Run logs & indices on a 2-drive mirror and the rest on a 5-drive RAID 5 You could probably get away without mirroring the indices, if you are willing to incur a little downtime to rebuild them after an index drive failure. So another possibility is 2-drive mirror for log, 1 plain old drive for indexes, rest for data. If your data will fit on 2 drives then you could mirror both, still have your 8th drive as hot spare, and feel pretty secure. Note that while it is reasonably painless to configure PG with WAL logs in a special place (after initdb, move the pg_xlog subdirectory and make a symlink to its new location), it's not currently easy to separate indexes from data. So the most practical approach in the short term is probably your (b). regards, tom lane
Bear in mind that I am a newbie to the PostgreSQL world but have experience in other RDBMSs when I ask this question: If you are looking for the best performance, why go with a RAID5 as opposed to a RAID1+0 (mirrored stripes) solution? Understandably RAID5 is a cheaper solution requiring fewer drives for redundancy but, from my experience, RAID5 chokes horribly under heavy disk writing. RAID5 always requires at least two write operations for every block written; one to the data and one to the redundancy algorithm. Is this wrong? (I mean no disrespect) Tom Lane wrote: > Doug Fields <dfields-pg-general@pexicom.com> writes: > >>d) How much extra performance does having the log or indices on a different >>disk buy you, esp. in the instance where you are inserting millions of >>records into a table? An indexed table? >> > > Keeping the logs on a separate drive is a big win, I believe, for heavy > update situations. (For read-only queries, of course the log doesn't > matter.) > > Keeping indexes on a separate drive is also traditional database advice, > but I don't have any feeling for how much it matters in Postgres. > > >>a) Run everything on one 7-drive RAID 5 partition (8th drive as hot spare) >>b) Run logs as a 2-drive mirror and the rest on a 5-drive RAID 5 >>c) Run logs on a 2-drive mirror, indices on a 2-drive mirror, and the rest >>on a 3-drive RAID5? >>d) Run logs & indices on a 2-drive mirror and the rest on a 5-drive RAID 5 >> > > You could probably get away without mirroring the indices, if you are > willing to incur a little downtime to rebuild them after an index drive > failure. So another possibility is > > 2-drive mirror for log, 1 plain old drive for indexes, rest for data. > > If your data will fit on 2 drives then you could mirror both, still have > your 8th drive as hot spare, and feel pretty secure. > > Note that while it is reasonably painless to configure PG with WAL logs > in a special place (after initdb, move the pg_xlog subdirectory and make > a symlink to its new location), it's not currently easy to separate > indexes from data. So the most practical approach in the short term is > probably your (b). > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > > -- Kurt Gunderson Senior Programmer Applications Development Lottery Group Canadian Bank Note Company, Limited Email: kgunders@cbnlottery.com Phone: 613.225.6566 x326 Fax: 613.225.6651 http://www.cbnco.com/ "Entropy isn't what is used to be" Obtaining any information from this message for the purpose of sending unsolicited commercial Email is strictly prohibited. Receiving this email does not constitute a request of or consent to send unsolicited commercial Email.
Hello, I noticed recently a bunch of people talking about making sure that marketing is happening with PostgreSQL. I thought I would drop a line to let you guys know that we are marketing on Google, as well as SysAdmin magazine. I also know that dbExperts is marketing on Google. J
> Keeping indexes on a separate drive is also traditional database advice, > but I don't have any feeling for how much it matters in Postgres. My gut feeling is that you'd be better off taking two drives and striping them RAID 0 instead of putting data on one and indexes on another. Either way, you have the same potential total throughput, but if they're in the RAID array, then you can use the total throughput more easily, and in more situations - you don't have to depend on reading two seperate pieces of data simultaneously to utilize the entire throughput. steve
Your RAID analysis is a bit wrong. In striping (disk joining) every byte written requires 1 byte sent to 1 disk. This gives you ZERO redundancy: RAID0 is used purely for making a large partition from smaller disks. In RAID1 (mirroring) Every 1 byte written requires 1 byte written to EACH of the 2 mirrored disks, for total disk IO of 2bytes. In RAID5, the most efficient solution, every 1 byte written requires LESS then 1 byte written for the CRC. Roughly (depending on implementation, number of disks) every 3 bytes written requires 4 bytes of disk IO. RAID5 is the fastest from an algorithm, standpoint. There is some gotchas, RAID5 implemented by hardware is faster the RAID5 implemented by OS, simply because the controller on the SCSI card acts like a parallel processor. RAID5 also wastes the least amount of disk space. What is the cheapest is a relative thing, what is certain is that RAID 5 requires more disks (at least 3) then mirroring (exactly 2), but RAID5 wastes less space, so the cost analysis begins with a big "it depends...". Any disk system will choke under heavy load, especially if the disk write system is inefficient (like IBM's IDE interface). I think if you did a test, you would find RAID1 would choke more then RAID5 simply because RAID1 requires MORE disk IO for the same bytes being saved. Referring to what Tom Lane said, he recommends 7 drive RAID5 for a very good reason: The more the drives, the faster the performance. Here's why: Write 7 bytes on a 7 drive RAID5, the first byte goes to drive 1, 2nd byte to drive 2, etc, and the CRC to the final drive. For high performance SCSI systems, whose BUS IO is faster then drives (and most SCSI IO chains ARE faster then the drives they are attached to) the drives actually write in PARALLEL. I can give you a more detailed example, but suffice to say that with RAID5 writing 7 bytes to 7 data drives takes about the same time to write 3 or 4 bytes to a single non raid drive. That my friends, is why RAID5 (especially when done by hardware) actually improves performance. Terry Fielder Network Engineer Great Gulf Homes / Ashton Woods Homes terry@greatgulfhomes.com > -----Original Message----- > From: pgsql-general-owner@postgresql.org > [mailto:pgsql-general-owner@postgresql.org]On Behalf Of Kurt Gunderson > Sent: Thursday, May 30, 2002 12:59 PM > Cc: pgsql-general@postgresql.org > Subject: Re: [GENERAL] Scaling with memory & disk planning > > > Bear in mind that I am a newbie to the PostgreSQL world but have > experience in other RDBMSs when I ask this question: > > If you are looking for the best performance, why go with a RAID5 as > opposed to a RAID1+0 (mirrored stripes) solution? > Understandably RAID5 > is a cheaper solution requiring fewer drives for redundancy > but, from my > experience, RAID5 chokes horribly under heavy disk writing. RAID5 > always requires at least two write operations for every block > written; > one to the data and one to the redundancy algorithm. > > Is this wrong? > > (I mean no disrespect) > > Tom Lane wrote: > > > Doug Fields <dfields-pg-general@pexicom.com> writes: > > > >>d) How much extra performance does having the log or > indices on a different > >>disk buy you, esp. in the instance where you are inserting > millions of > >>records into a table? An indexed table? > >> > > > > Keeping the logs on a separate drive is a big win, I > believe, for heavy > > update situations. (For read-only queries, of course the > log doesn't > > matter.) > > > > Keeping indexes on a separate drive is also traditional > database advice, > > but I don't have any feeling for how much it matters in Postgres. > > > > > >>a) Run everything on one 7-drive RAID 5 partition (8th > drive as hot spare) > >>b) Run logs as a 2-drive mirror and the rest on a 5-drive RAID 5 > >>c) Run logs on a 2-drive mirror, indices on a 2-drive > mirror, and the rest > >>on a 3-drive RAID5? > >>d) Run logs & indices on a 2-drive mirror and the rest on a > 5-drive RAID 5 > >> > > > > You could probably get away without mirroring the indices, > if you are > > willing to incur a little downtime to rebuild them after an > index drive > > failure. So another possibility is > > > > 2-drive mirror for log, 1 plain old drive for indexes, rest > for data. > > > > If your data will fit on 2 drives then you could mirror > both, still have > > your 8th drive as hot spare, and feel pretty secure. > > > > Note that while it is reasonably painless to configure PG > with WAL logs > > in a special place (after initdb, move the pg_xlog > subdirectory and make > > a symlink to its new location), it's not currently easy to separate > > indexes from data. So the most practical approach in the > short term is > > probably your (b). > > > > regards, tom lane > > > > ---------------------------(end of > broadcast)--------------------------- > > TIP 6: Have you searched our list archives? > > > > http://archives.postgresql.org > > > > > > > -- > Kurt Gunderson > Senior Programmer > Applications Development > Lottery Group > Canadian Bank Note Company, Limited > Email: kgunders@cbnlottery.com > Phone: > 613.225.6566 x326 > Fax: > 613.225.6651 > http://www.cbnco.com/ > > "Entropy isn't what is used to be" > > Obtaining any information from this message for the purpose of sending > unsolicited commercial Email is strictly prohibited. Receiving this > email does not constitute a request of or consent to send unsolicited > commercial Email. > > > ---------------------------(end of > broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org >
I think your undestanding of RAID 5 is wrong also. For a general N disk RAID 5 the process is: 1)Read sector 2)XOR with data to write 3)Read parity sector 4)XOR with result above 5)write data 6)write parity So you can see, for every logical write, there is two reads and two writes. For a 3 disks RAID 5 the process can be shortened: 1)Write data 2)Read other disk 3)XOR with data 4)Write to parity disk. So, two writes and one read. JLL terry@greatgulfhomes.com wrote: > > Your RAID analysis is a bit wrong. > In striping (disk joining) every byte written requires 1 byte sent to 1 > disk. This gives you ZERO redundancy: RAID0 is used purely for making a > large partition from smaller disks. > > In RAID1 (mirroring) Every 1 byte written requires 1 byte written to EACH of > the 2 mirrored disks, for total disk IO of 2bytes. > In RAID5, the most efficient solution, every 1 byte written requires LESS > then 1 byte written for the CRC. Roughly (depending on implementation, > number of disks) every 3 bytes written requires 4 bytes of disk IO. > > RAID5 is the fastest from an algorithm, standpoint. There is some gotchas, > RAID5 implemented by hardware is faster the RAID5 implemented by OS, simply > because the controller on the SCSI card acts like a parallel processor. > > RAID5 also wastes the least amount of disk space. > > What is the cheapest is a relative thing, what is certain is that RAID 5 > requires more disks (at least 3) then mirroring (exactly 2), but RAID5 > wastes less space, so the cost analysis begins with a big "it depends...". > > Any disk system will choke under heavy load, especially if the disk write > system is inefficient (like IBM's IDE interface). I think if you did a > test, you would find RAID1 would choke more then RAID5 simply because RAID1 > requires MORE disk IO for the same bytes being saved. > > Referring to what Tom Lane said, he recommends 7 drive RAID5 for a very good > reason: The more the drives, the faster the performance. Here's why: > Write 7 bytes on a 7 drive RAID5, the first byte goes to drive 1, 2nd byte > to drive 2, etc, and the CRC to the final drive. For high performance SCSI > systems, whose BUS IO is faster then drives (and most SCSI IO chains ARE > faster then the drives they are attached to) the drives actually write in > PARALLEL. I can give you a more detailed example, but suffice to say that > with RAID5 writing 7 bytes to 7 data drives takes about the same time to > write 3 or 4 bytes to a single non raid drive. That my friends, is why > RAID5 (especially when done by hardware) actually improves performance. > > Terry Fielder > Network Engineer > Great Gulf Homes / Ashton Woods Homes > terry@greatgulfhomes.com > > > -----Original Message----- > > From: pgsql-general-owner@postgresql.org > > [mailto:pgsql-general-owner@postgresql.org]On Behalf Of Kurt Gunderson > > Sent: Thursday, May 30, 2002 12:59 PM > > Cc: pgsql-general@postgresql.org > > Subject: Re: [GENERAL] Scaling with memory & disk planning > > > > > > Bear in mind that I am a newbie to the PostgreSQL world but have > > experience in other RDBMSs when I ask this question: > > > > If you are looking for the best performance, why go with a RAID5 as > > opposed to a RAID1+0 (mirrored stripes) solution? > > Understandably RAID5 > > is a cheaper solution requiring fewer drives for redundancy > > but, from my > > experience, RAID5 chokes horribly under heavy disk writing. RAID5 > > always requires at least two write operations for every block > > written; > > one to the data and one to the redundancy algorithm. > > > > Is this wrong? > > > > (I mean no disrespect) > > > > Tom Lane wrote: > > > > > Doug Fields <dfields-pg-general@pexicom.com> writes: > > > > > >>d) How much extra performance does having the log or > > indices on a different > > >>disk buy you, esp. in the instance where you are inserting > > millions of > > >>records into a table? An indexed table? > > >> > > > > > > Keeping the logs on a separate drive is a big win, I > > believe, for heavy > > > update situations. (For read-only queries, of course the > > log doesn't > > > matter.) > > > > > > Keeping indexes on a separate drive is also traditional > > database advice, > > > but I don't have any feeling for how much it matters in Postgres. > > > > > > > > >>a) Run everything on one 7-drive RAID 5 partition (8th > > drive as hot spare) > > >>b) Run logs as a 2-drive mirror and the rest on a 5-drive RAID 5 > > >>c) Run logs on a 2-drive mirror, indices on a 2-drive > > mirror, and the rest > > >>on a 3-drive RAID5? > > >>d) Run logs & indices on a 2-drive mirror and the rest on a > > 5-drive RAID 5 > > >> > > > > > > You could probably get away without mirroring the indices, > > if you are > > > willing to incur a little downtime to rebuild them after an > > index drive > > > failure. So another possibility is > > > > > > 2-drive mirror for log, 1 plain old drive for indexes, rest > > for data. > > > > > > If your data will fit on 2 drives then you could mirror > > both, still have > > > your 8th drive as hot spare, and feel pretty secure. > > > > > > Note that while it is reasonably painless to configure PG > > with WAL logs > > > in a special place (after initdb, move the pg_xlog > > subdirectory and make > > > a symlink to its new location), it's not currently easy to separate > > > indexes from data. So the most practical approach in the > > short term is > > > probably your (b). > > > > > > regards, tom lane > > > > > > ---------------------------(end of > > broadcast)--------------------------- > > > TIP 6: Have you searched our list archives? > > > > > > http://archives.postgresql.org > > > > > > > > > > > > -- > > Kurt Gunderson > > Senior Programmer > > Applications Development > > Lottery Group > > Canadian Bank Note Company, Limited > > Email: kgunders@cbnlottery.com > > Phone: > > 613.225.6566 x326 > > Fax: > > 613.225.6651 > > http://www.cbnco.com/ > > > > "Entropy isn't what is used to be" > > > > Obtaining any information from this message for the purpose of sending > > unsolicited commercial Email is strictly prohibited. Receiving this > > email does not constitute a request of or consent to send unsolicited > > commercial Email. > > > > > > ---------------------------(end of > > broadcast)--------------------------- > > TIP 6: Have you searched our list archives? > > > > http://archives.postgresql.org > > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster
> If you are looking for the best performance, why go with a RAID5 as > opposed to a RAID1+0 (mirrored stripes) solution? Understandably RAID5 > is a cheaper solution requiring fewer drives for redundancy but, from my > experience, RAID5 chokes horribly under heavy disk writing. RAID5 > always requires at least two write operations for every block written; > one to the data and one to the redundancy algorithm. > > Is this wrong? Here's my take on it... If you have enough RAM to keep everything buffered/cached, and fsync() is turned off, then the speed of the disk subsystem becomes vastly less important - you'll only read the data once (first couple of queries), and then the disks will sit idle. The lights on the disks on my DB machine only flicker once per minute or less. If that's the case, then I'd rather use RAID 5 with a hot-spare or two, to increase storage capacity over 1+0, and speed at reading as well. Of course, if you write tremendous amounts of data, and your data set is larger than you can cache/buffer, that all flies out the window. steve
My simplification was intended, anyway it still equates to the same, because in a performance machine (lots of memory) reads are (mostly) pulled from cache (not disk IO). So the real cost is disk writes, and 2 = 2. Terry Fielder Network Engineer Great Gulf Homes / Ashton Woods Homes terry@greatgulfhomes.com > -----Original Message----- > From: pgsql-general-owner@postgresql.org > [mailto:pgsql-general-owner@postgresql.org]On Behalf Of Jean-Luc > Lachance > Sent: Thursday, May 30, 2002 3:17 PM > To: terry@greatgulfhomes.com > Cc: kgunders@cbnlottery.com; pgsql-general@postgresql.org > Subject: Re: [GENERAL] Scaling with memory & disk planning > > > I think your undestanding of RAID 5 is wrong also. > > For a general N disk RAID 5 the process is: > 1)Read sector > 2)XOR with data to write > 3)Read parity sector > 4)XOR with result above > 5)write data > 6)write parity > > So you can see, for every logical write, there is two reads and two > writes. > > For a 3 disks RAID 5 the process can be shortened: > 1)Write data > 2)Read other disk > 3)XOR with data > 4)Write to parity disk. > > So, two writes and one read. > > JLL > > > terry@greatgulfhomes.com wrote: > > > > Your RAID analysis is a bit wrong. > > In striping (disk joining) every byte written requires 1 > byte sent to 1 > > disk. This gives you ZERO redundancy: RAID0 is used > purely for making a > > large partition from smaller disks. > > > > In RAID1 (mirroring) Every 1 byte written requires 1 byte > written to EACH of > > the 2 mirrored disks, for total disk IO of 2bytes. > > > In RAID5, the most efficient solution, every 1 byte written > requires LESS > > then 1 byte written for the CRC. Roughly (depending on > implementation, > > number of disks) every 3 bytes written requires 4 bytes of disk IO. > > > > RAID5 is the fastest from an algorithm, standpoint. There > is some gotchas, > > RAID5 implemented by hardware is faster the RAID5 > implemented by OS, simply > > because the controller on the SCSI card acts like a > parallel processor. > > > > RAID5 also wastes the least amount of disk space. > > > > What is the cheapest is a relative thing, what is certain > is that RAID 5 > > requires more disks (at least 3) then mirroring (exactly > 2), but RAID5 > > wastes less space, so the cost analysis begins with a big > "it depends...". > > > > Any disk system will choke under heavy load, especially if > the disk write > > system is inefficient (like IBM's IDE interface). I think > if you did a > > test, you would find RAID1 would choke more then RAID5 > simply because RAID1 > > requires MORE disk IO for the same bytes being saved. > > > > Referring to what Tom Lane said, he recommends 7 drive > RAID5 for a very good > > reason: The more the drives, the faster the performance. > Here's why: > > Write 7 bytes on a 7 drive RAID5, the first byte goes to > drive 1, 2nd byte > > to drive 2, etc, and the CRC to the final drive. For high > performance SCSI > > systems, whose BUS IO is faster then drives (and most SCSI > IO chains ARE > > faster then the drives they are attached to) the drives > actually write in > > PARALLEL. I can give you a more detailed example, but > suffice to say that > > with RAID5 writing 7 bytes to 7 data drives takes about the > same time to > > write 3 or 4 bytes to a single non raid drive. That my > friends, is why > > RAID5 (especially when done by hardware) actually improves > performance. > > > > Terry Fielder > > Network Engineer > > Great Gulf Homes / Ashton Woods Homes > > terry@greatgulfhomes.com > > > > > -----Original Message----- > > > From: pgsql-general-owner@postgresql.org > > > [mailto:pgsql-general-owner@postgresql.org]On Behalf Of > Kurt Gunderson > > > Sent: Thursday, May 30, 2002 12:59 PM > > > Cc: pgsql-general@postgresql.org > > > Subject: Re: [GENERAL] Scaling with memory & disk planning > > > > > > > > > Bear in mind that I am a newbie to the PostgreSQL world but have > > > experience in other RDBMSs when I ask this question: > > > > > > If you are looking for the best performance, why go with > a RAID5 as > > > opposed to a RAID1+0 (mirrored stripes) solution? > > > Understandably RAID5 > > > is a cheaper solution requiring fewer drives for redundancy > > > but, from my > > > experience, RAID5 chokes horribly under heavy disk writing. RAID5 > > > always requires at least two write operations for every block > > > written; > > > one to the data and one to the redundancy algorithm. > > > > > > Is this wrong? > > > > > > (I mean no disrespect) > > > > > > Tom Lane wrote: > > > > > > > Doug Fields <dfields-pg-general@pexicom.com> writes: > > > > > > > >>d) How much extra performance does having the log or > > > indices on a different > > > >>disk buy you, esp. in the instance where you are inserting > > > millions of > > > >>records into a table? An indexed table? > > > >> > > > > > > > > Keeping the logs on a separate drive is a big win, I > > > believe, for heavy > > > > update situations. (For read-only queries, of course the > > > log doesn't > > > > matter.) > > > > > > > > Keeping indexes on a separate drive is also traditional > > > database advice, > > > > but I don't have any feeling for how much it matters in > Postgres. > > > > > > > > > > > >>a) Run everything on one 7-drive RAID 5 partition (8th > > > drive as hot spare) > > > >>b) Run logs as a 2-drive mirror and the rest on a 5-drive RAID 5 > > > >>c) Run logs on a 2-drive mirror, indices on a 2-drive > > > mirror, and the rest > > > >>on a 3-drive RAID5? > > > >>d) Run logs & indices on a 2-drive mirror and the rest on a > > > 5-drive RAID 5 > > > >> > > > > > > > > You could probably get away without mirroring the indices, > > > if you are > > > > willing to incur a little downtime to rebuild them after an > > > index drive > > > > failure. So another possibility is > > > > > > > > 2-drive mirror for log, 1 plain old drive for indexes, rest > > > for data. > > > > > > > > If your data will fit on 2 drives then you could mirror > > > both, still have > > > > your 8th drive as hot spare, and feel pretty secure. > > > > > > > > Note that while it is reasonably painless to configure PG > > > with WAL logs > > > > in a special place (after initdb, move the pg_xlog > > > subdirectory and make > > > > a symlink to its new location), it's not currently easy > to separate > > > > indexes from data. So the most practical approach in the > > > short term is > > > > probably your (b). > > > > > > > > regards, tom lane > > > > > > > > ---------------------------(end of > > > broadcast)--------------------------- > > > > TIP 6: Have you searched our list archives? > > > > > > > > http://archives.postgresql.org > > > > > > > > > > > > > > > > > -- > > > Kurt Gunderson > > > Senior Programmer > > > Applications Development > > > Lottery Group > > > Canadian Bank Note Company, Limited > > > Email: kgunders@cbnlottery.com > > > Phone: > > > 613.225.6566 x326 > > > Fax: > > > 613.225.6651 > > > http://www.cbnco.com/ > > > > > > "Entropy isn't what is used to be" > > > > > > Obtaining any information from this message for the > purpose of sending > > > unsolicited commercial Email is strictly prohibited. > Receiving this > > > email does not constitute a request of or consent to send > unsolicited > > > commercial Email. > > > > > > > > > ---------------------------(end of > > > broadcast)--------------------------- > > > TIP 6: Have you searched our list archives? > > > > > > http://archives.postgresql.org > > > > > > > ---------------------------(end of > broadcast)--------------------------- > > TIP 4: Don't 'kill -9' the postmaster > > ---------------------------(end of > broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to > majordomo@postgresql.org) >
This still troubles me. terry@greatgulfhomes.com wrote: > In striping (disk joining) every byte written requires 1 byte sent to 1 > disk. This gives you ZERO redundancy: RAID0 is used purely for making a > large partition from smaller disks. Not necessarily. RAID0 absolutely shines in 'concurrent' access to disk. When a hundred people want a hundred different records from disk, the chance becomes greater that any needle on any disk in the striped set will be found near the block where the record exists. Small gains for a small shop but when multiplied with hundreds/thousands of concurrent users across many Gigs of data the benefits add up. Likewise, because a data block is written across many disks (depending on strip size) the task can 'almost' be done concurrently by the controller. > In RAID1 (mirroring) Every 1 byte written requires 1 byte written to EACH of > the 2 mirrored disks, for total disk IO of 2bytes. I agree and, to go further, on some (most?) RAID disk controllers and when reading a block of data, the first disk of the mirrored pair (given the location of the needle on the spindle) to locate the data will return the block to the controller. Likewise, when writing to the mirrored pair (and using 'write-through', never 'write-back'), the controller will pass along the 'data written' flag to the CPU when the first disk of the pair writes the data. The second will sync eventually but the controller need not wait for both. So in combining these technologies and utilizing RAID1+0, you gain all the benefits of striping and mirroring. > In RAID5, the most efficient solution, every 1 byte written requires LESS > then 1 byte written for the CRC. Roughly (depending on implementation, > number of disks) every 3 bytes written requires 4 bytes of disk IO. > > RAID5 is the fastest from an algorithm, standpoint. There is some gotchas, > RAID5 implemented by hardware is faster the RAID5 implemented by OS, simply > because the controller on the SCSI card acts like a parallel processor. > > RAID5 also wastes the least amount of disk space. In addition to the additional writes, RAID5 still requires at least the computation of the CRC based on the amount of data written even if it can concurrently write that data across many disks. You can achieve the same efficiency with RAID0 less the cost of calculating the CRC and gain the integrity/performance by using RAID1. > What is the cheapest is a relative thing, what is certain is that RAID 5 > requires more disks (at least 3) then mirroring (exactly 2), but RAID5 > wastes less space, so the cost analysis begins with a big "it depends...". > > Any disk system will choke under heavy load, especially if the disk write > system is inefficient (like IBM's IDE interface). I think if you did a > test, you would find RAID1 would choke more then RAID5 simply because RAID1 > requires MORE disk IO for the same bytes being saved. > > Referring to what Tom Lane said, he recommends 7 drive RAID5 for a very good > reason: The more the drives, the faster the performance. Here's why: > Write 7 bytes on a 7 drive RAID5, the first byte goes to drive 1, 2nd byte > to drive 2, etc, and the CRC to the final drive. For high performance SCSI > systems, whose BUS IO is faster then drives (and most SCSI IO chains ARE > faster then the drives they are attached to) the drives actually write in > PARALLEL. I can give you a more detailed example, but suffice to say that > with RAID5 writing 7 bytes to 7 data drives takes about the same time to > write 3 or 4 bytes to a single non raid drive. That my friends, is why > RAID5 (especially when done by hardware) actually improves performance. I fully agree with you that hardware RAID is the way to go. I would still place log files (or sequential transaction files) on a RAID1 because movement of the needle on the disk of the mirrored pair is minimal. > Terry Fielder > Network Engineer > Great Gulf Homes / Ashton Woods Homes > terry@greatgulfhomes.com > > >>-----Original Message----- >>From: pgsql-general-owner@postgresql.org >>[mailto:pgsql-general-owner@postgresql.org]On Behalf Of Kurt Gunderson >>Sent: Thursday, May 30, 2002 12:59 PM >>Cc: pgsql-general@postgresql.org >>Subject: Re: [GENERAL] Scaling with memory & disk planning >> >> >>Bear in mind that I am a newbie to the PostgreSQL world but have >>experience in other RDBMSs when I ask this question: >> >>If you are looking for the best performance, why go with a RAID5 as >>opposed to a RAID1+0 (mirrored stripes) solution? >>Understandably RAID5 >>is a cheaper solution requiring fewer drives for redundancy >>but, from my >>experience, RAID5 chokes horribly under heavy disk writing. RAID5 >>always requires at least two write operations for every block >>written; >>one to the data and one to the redundancy algorithm. >> >>Is this wrong? >> >>(I mean no disrespect) >> >>Tom Lane wrote: >> >> >>>Doug Fields <dfields-pg-general@pexicom.com> writes: >>> >>> >>>>d) How much extra performance does having the log or >>>> >>indices on a different >> >>>>disk buy you, esp. in the instance where you are inserting >>>> >>millions of >> >>>>records into a table? An indexed table? >>>> >>>> >>>Keeping the logs on a separate drive is a big win, I >>> >>believe, for heavy >> >>>update situations. (For read-only queries, of course the >>> >>log doesn't >> >>>matter.) >>> >>>Keeping indexes on a separate drive is also traditional >>> >>database advice, >> >>>but I don't have any feeling for how much it matters in Postgres. >>> >>> >>> >>>>a) Run everything on one 7-drive RAID 5 partition (8th >>>> >>drive as hot spare) >> >>>>b) Run logs as a 2-drive mirror and the rest on a 5-drive RAID 5 >>>>c) Run logs on a 2-drive mirror, indices on a 2-drive >>>> >>mirror, and the rest >> >>>>on a 3-drive RAID5? >>>>d) Run logs & indices on a 2-drive mirror and the rest on a >>>> >>5-drive RAID 5 >> >>>You could probably get away without mirroring the indices, >>> >>if you are >> >>>willing to incur a little downtime to rebuild them after an >>> >>index drive >> >>>failure. So another possibility is >>> >>>2-drive mirror for log, 1 plain old drive for indexes, rest >>> >>for data. >> >>>If your data will fit on 2 drives then you could mirror >>> >>both, still have >> >>>your 8th drive as hot spare, and feel pretty secure. >>> >>>Note that while it is reasonably painless to configure PG >>> >>with WAL logs >> >>>in a special place (after initdb, move the pg_xlog >>> >>subdirectory and make >> >>>a symlink to its new location), it's not currently easy to separate >>>indexes from data. So the most practical approach in the >>> >>short term is >> >>>probably your (b). -- Kurt Gunderson Senior Programmer Applications Development Lottery Group Canadian Bank Note Company, Limited Email: kgunders@cbnlottery.com Phone: 613.225.6566 x326 Fax: 613.225.6651 http://www.cbnco.com/ "Entropy isn't what is used to be" Obtaining any information from this message for the purpose of sending unsolicited commercial Email is strictly prohibited. Receiving this email does not constitute a request of or consent to send unsolicited commercial Email.
On Thu, 30 May 2002 terry@greatgulfhomes.com wrote: I agree with ALMOST everything you say, but have a few minor nits to pick. nothing personal, just my own experience on RAID testing and such. > In RAID5, the most efficient solution, every 1 byte written requires LESS > then 1 byte written for the CRC. This isn't true for any RAID 5 implementation I'm familiar with. The parity stripe is exactly the same size as the data stripes it shares space with. But this isn't a real important point, since most RAID arrays write 8k to 128k at a time. Note that it's not CRC (Cyclic Redunancy Check) that gets written, but straight XOR, hence no space savings. > Roughly (depending on implementation, > number of disks) every 3 bytes written requires 4 bytes of disk IO. > > RAID5 is the fastest from an algorithm, standpoint. There is some gotchas, > RAID5 implemented by hardware is faster the RAID5 implemented by OS, simply > because the controller on the SCSI card acts like a parallel processor. This is most definitely not always true, even given equal hardware specs (i.e. number and type of drives / interfaces all the same). My old AMI Megaraid card with 3 Ultra Wide SCSI ports can generate 64 Megs of parity data per second. My Celeron 1.1GHz machine can generate 2584 Megs of parity data per second. The load on the CPU under VERY heavy reads and writes is about 0.3% cpu, and the max throughput on reads on a RAID array of 4 VERY old (non-ultra non-wide 7200RPM) 2 Gig drives is about 33 Megs a second read speed. Same setup with 7200 4 Gig Ultra narrow drives on an AMI raid can read at about 14 Megabytes a second. The most important part of fast RAID is the drives first, interface second, and hardware versus software raid last. While it was often true in the dark past of 33 Megahertz CPUs and such that hardware raid was always faster, it is often much better to spend the extra money a RAID controller would cost you and just buy more drives or cheaper controllers or memory or CPUs. Generally speaking, I've found RAID5 with 4 or fewer drives to be about even with RAID1, while RAID5 with 6 or more drives quickly starts to outrun a two drive mirror set. This is especially true under heavy parallel access. On a subject no one's mentioned yet, >2 drives in a RAID1 setup. I've done some testing with >2 drives in a mirror (NOT 1+0 or 0+1, just RAID1 with >2 drives) under Linux, and found that if you are doing 90% reads then it's also a good solution, but for most real world database apps, it doesn't really help a lot.
What is RAID1 with > 2 drives??? Terry Fielder Network Engineer Great Gulf Homes / Ashton Woods Homes terry@greatgulfhomes.com > -----Original Message----- > From: pgsql-general-owner@postgresql.org > [mailto:pgsql-general-owner@postgresql.org]On Behalf Of Scott Marlowe > Sent: Thursday, May 30, 2002 5:45 PM > To: terry@greatgulfhomes.com > Cc: pgsql-general@postgresql.org > Subject: Re: [GENERAL] Scaling with memory & disk planning > > > On Thu, 30 May 2002 terry@greatgulfhomes.com wrote: > > I agree with ALMOST everything you say, but have a few minor > nits to pick. > nothing personal, just my own experience on RAID testing and such. > > > In RAID5, the most efficient solution, every 1 byte written > requires LESS > > then 1 byte written for the CRC. > > This isn't true for any RAID 5 implementation I'm familiar with. The > parity stripe is exactly the same size as the data stripes it > shares space > with. But this isn't a real important point, since most RAID > arrays write > 8k to 128k at a time. Note that it's not CRC (Cyclic > Redunancy Check) > that gets written, but straight XOR, hence no space savings. > > > Roughly (depending on implementation, > > number of disks) every 3 bytes written requires 4 bytes of disk IO. > > > > RAID5 is the fastest from an algorithm, standpoint. There > is some gotchas, > > RAID5 implemented by hardware is faster the RAID5 > implemented by OS, simply > > because the controller on the SCSI card acts like a > parallel processor. > > This is most definitely not always true, even given equal > hardware specs > (i.e. number and type of drives / interfaces all the same). > > My old AMI Megaraid card with 3 Ultra Wide SCSI ports can > generate 64 Megs > of parity data per second. My Celeron 1.1GHz machine can generate > 2584 Megs of parity data per second. The load on the CPU under VERY > heavy reads and writes is about 0.3% cpu, and the max > throughput on reads > on a RAID array of 4 VERY old (non-ultra non-wide 7200RPM) 2 > Gig drives is > about 33 Megs a second read speed. > > Same setup with 7200 4 Gig Ultra narrow drives on an AMI raid > can read at > about 14 Megabytes a second. > > The most important part of fast RAID is the drives first, interface > second, and hardware versus software raid last. While it was > often true > in the dark past of 33 Megahertz CPUs and such that hardware raid was > always faster, it is often much better to spend the extra > money a RAID > controller would cost you and just buy more drives or cheaper > controllers > or memory or CPUs. > > Generally speaking, I've found RAID5 with 4 or fewer drives > to be about > even with RAID1, while RAID5 with 6 or more drives quickly starts to > outrun a two drive mirror set. This is especially true under heavy > parallel access. > > On a subject no one's mentioned yet, >2 drives in a RAID1 setup. > > I've done some testing with >2 drives in a mirror (NOT 1+0 or > 0+1, just > RAID1 with >2 drives) under Linux, and found that if you are > doing 90% > reads then it's also a good solution, but for most real world > database > apps, it doesn't really help a lot. > > > > ---------------------------(end of > broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to > majordomo@postgresql.org >
On Thu, 30 May 2002 terry@greatgulfhomes.com wrote: > What is RAID1 with > 2 drives??? Just that. A mirror set that writes all changes to all drives. It costs more for writes, but runs faster for reads, so it's not a good idea to use one in a heavily written environment, but it really flies for heavy read setups, like data mining or CRM stuff. Plus, a single drive failure is no big deal, you just replace it at your leisure, since you're still fully redundant. This means you could do crazy things like build 9 disks into a RAID 0+1 by making three sets of RAID 1 with three drives, and then put all three together to make the RAID 0. Very good read speeds under heavy loads, and fairly good write speeds as well. You could run a fairly large database on a setup of 9 36 Gig ultra SCSI drives like this, and get some nice throughput, and have it be VERY redundant. I.e. ANY two drives could fail, and you'd still be up, and if the two drives that failed were in different RAID1s, you'd still be fully redundant. Crazy stuff.
Jean-Luc Lachance writes: > > I think your undestanding of RAID 5 is wrong also. > > > > For a general N disk RAID 5 the process is: > > 1)Read sector > > 2)XOR with data to write > > 3)Read parity sector > > 4)XOR with result above > > 5)write data > > 6)write parity Yes, generally. There are a couple of tricks you can do to help get around this, though. One, which works very nicely when doing sequential writes, is to attempt to hold off on the write until you collect an entire stripe's worth of data. Then you can calculate the parity based on what's in memory, and write the new blocks across all of the disks without worrying about what was on them before. 3ware's Escalade IDE RAID controllers (the 3W-7x50 series, anyway) do this. Their explanation is at http://www.3ware.com/NewFaq/general_operating_and_troubleshooting.htm#R5 _Fusion_Explained . Another tactic is just to buffer entire stripes. Baydel does this with their disk arrays, which are actually RAID-3, not RAID-5. (Since they do only full-stripe reads and writes, it doesn't really make any difference which they use.) You want a fair amount of RAM in your controller for buffering in this case, but it keeps the computers "read, modify, write" cycle on one block from turning into "read, read, modify, write". Terry Fielder writes: > My simplification was intended, anyway it still equates to the same, > because in a performance machine (lots of memory) reads are (mostly) > pulled from cache (not disk IO). So the real cost is disk writes, and > 2 = 2. Well, it really depends on your workload. If you're selecting stuff almost completely randomly scattered about a large table (like the 25 GB one I'm dealing with right now), it's going to be a bit pricy to get hold of a machine with enough memory to cache that effectively. Kurt Gunderson writes: ] Likewise, when writing to the mirrored pair (and using 'write-through', ] never 'write-back'), the controller will pass along the 'data written' ] flag to the CPU when the first disk of the pair writes the data. The ] second will sync eventually but the controller need not wait for both. I hope not! I think that controller ought to wait for both to be written, because otherwise you can have this scenario: 1. Write of block X scheduled for drives A and B. 2. Block written to drive A. Still pending on drive B. 3. Controller returns "block committed to stable storage" to application. 4. Power failure. Pending write to drive B is never written. Now, how do you know, when the system comes back up, that you have a good copy of the block on drive A, but not on drive B? cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
On Thu, 30 May 2002, Steve Wolfe wrote: > > Keeping indexes on a separate drive is also traditional database advice, > > but I don't have any feeling for how much it matters in Postgres. > > My gut feeling is that you'd be better off taking two drives and > striping them RAID 0 instead of putting data on one and indexes on > another. Either way, you have the same potential total throughput, but if > they're in the RAID array, then you can use the total throughput more > easily, and in more situations - you don't have to depend on reading two > seperate pieces of data simultaneously to utilize the entire throughput. I'd go with that too. Looking at the stuff I'm playing with right now, which is heavily random-read oriented, When I get 5000 rows back out of a table, the system does about 5020 reads. Since it's highly unlikely that more than a very few of those rows would be in the same disk block, I expect that the system is only doing about 20-30 index block reads here, and the rest is on the table. Splitting this by index vs. table would put 99% of the burden of this query on the table's drive. The log-file situation tends to be somewhat different, since you're also looking at different types of disk access. Log files just don't get randomly accessed, and so having only one disk arm available for them is not really a disadvantage. So it's usually fine on a single disk (or mirrored pair). Giving more disk arms to a random-access application (such as pulling blocks out of an (unsorted) table based on what you retrieved from an index) can provide a lot of help, though, so spreading it across many disks makes sense. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
On Thu, 30 May 2002, Doug Fields wrote: > I currently use 1.4 gigs for "shared mem" in my database (out of 2G) - I > couldn't get PostgreSQL to run with more than that (it might be an OS > limit, Linux 2.2). Does it just fail to allocate? There may be a kernel parameter you have to tweak. I'm presuming that the idea with allocating lots of shared memory is so that postgres can buffer data blocks from the disk. However, since postgres uses the filesystem, the operating system will also buffer disk data, using whatever memory is free to do so. Thus, increaing the shared memory allocated to postgres will just reduce the amount of memory available to the OS to do block buffering. What is the advantage, if any, to having postgres do the buffering in its shared memory rather than letting the OS do it? One disadvantage I can think of is that when a back end (or several back ends) allocates a lot of memory for sorting (assuming you let them do that), you might end up pushing the sort memory out to your swap disk, whereas if the OS is doing the buffer management, it can just buffer fewer file blocks while you're doing the sort, instead. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
On Thu, 30 May 2002, Doug Fields wrote: > If I ran PostgreSQL with a Linux 2.4 kernel and 6GB of RAM with dual P4 Xeons: > a) Could PostgreSQL use all the RAM? Yes, but no single process could. Since you're on a 32-bit machine your maximum address space is 4 GB, and the Linux kernel keeps 1 GB of that for itself, so no process (i.e., no individual backend, in this case) can address more than 3 GB of RAM. I'm not sure how the Linux buffer caching works, but there is a possibility that this kind of thing could give you a very nice large buffer cache. > c) Could PostgreSQL use this "hyperthreading" that apparently is in these > P4 Xeons making dual processors look like four processors (obviously, on > four separate queries)? Does Linux come up saying there are 4 processors installed in your system? If so, yes. If not, no. > d) How much extra performance does having the log or indices on a different > disk buy you, esp. in the instance where you are inserting millions of > records into a table? An indexed table? When inserting, log on a separate disk buys you a lot. (You don't get the table and index writes moving the disk head away from the end of the log file.) Splitting index and table files may help somewhat, but that's harder to determine. You always want to throw at them as many disk arms as you can, but it's usually the safer bet to combine it all on one big stripe set or raid or whatever, unless you have a very consistent workload and are willing to play around a lot to determine how you can best optimize this stuff. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
On Thu, 30 May 2002, Tom Lane wrote: > Curt Sampson <cjs@cynic.net> writes: > > I'm noticing that performance in creating a particular index, and > > also a little bit in a simple query, seems somewhat non-linear, > > Btree index build is primarily a sort, so cannot have better than > O(n*log(n)) performance for random data. Not sure why you'd expect > linearity. Not exactly "expect." More "wish," really. > Increasing SORT_MEM would help the constant factor, however... Thanks for the tip. I tried that, though not on the monster 500m row table yet. On the 100m row table, I get some interesting results. First I confirmed that with the default sortmem = 512, I get the results I previously got, about an hour: ! system usage stats: ! 3715.756060 elapsed 1975.705112 user 218.771892 system sec ! [1975.717191 user 218.771892 sys total] ! 29/535 [29/538] filesystem blocks in/out ! 124/1949 [124/2281] page faults/reclaims, 0 [0] swaps ! 0 [0] signals rcvd, 0/0 [0/5] messages rcvd/sent ! 98271/335134 [98271/335139] voluntary/involuntary context switches ! postgres usage stats: ! Shared blocks: 636965 read, 9 written, buffer hit rate = 0.05% ! Local blocks: 218962 read, 218962 written, buffer hit rate = 0.00% ! Direct blocks: 0 read, 0 written Next I tried to bump the sortmem up to 256 MB, half the physical RAM of the machine. I found out the hard way that the back-end doing the indexing will grow to something over three times the size of sortmem, and proceeded (slowly) to extricate myself from swap hell. Next try, 128 MB. This time the process was about 400 MB in size (350 MB resident), and it did indeed shave more than 20% off my sort time, clocking in at about 48 1/2 minutes. ! system usage stats: ! 2897.031560 elapsed 2396.215835 user 168.270137 system sec ! [2396.215835 user 168.285445 sys total] ! 2/678 [2/680] filesystem blocks in/out ! 165926/91843 [165926/92159] page faults/reclaims, 0 [0] swaps ! 0 [0] signals rcvd, 0/0 [0/5] messages rcvd/sent ! 166718/225004 [166750/225010] voluntary/involuntary context switches ! postgres usage stats: ! Shared blocks: 636979 read, 9 written, buffer hit rate = 0.05% ! Local blocks: 218962 read, 218962 written, buffer hit rate = 0.00% ! Direct blocks: 0 read, 0 written But I decided I couldn't spare that much memory for a database doing real work, too, and backed down to 32 MB. That got the process down to 115 MB (100 MB resident), but oddly enough sped it up even more, shaving the sort time by more than 35% from the original. (The new time was about 40 minutes.) DEBUG: QUERY STATISTICS ! system usage stats: ! 2386.940418 elapsed 2155.223379 user 178.008478 system sec ! [2155.233638 user 178.008478 sys total] ! 0/592 [0/594] filesystem blocks in/out ! 100/25113 [100/25399] page faults/reclaims, 0 [0] swaps ! 0 [0] signals rcvd, 0/0 [0/5] messages rcvd/sent ! 2376/274787 [2376/274793] voluntary/involuntary context switches ! postgres usage stats: ! Shared blocks: 636969 read, 10 written, buffer hit rate = 0.05% ! Local blocks: 218962 read, 218962 written, buffer hit rate = 0.00% ! Direct blocks: 0 read, 0 written So what's up with that, do you think? Anyway, over the weekend I might have time to try rebuilding an index on the 500m row table a couple of times with different sortmem sizes to see how well that works. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
On Thu, 30 May 2002, Tom Lane wrote: > > Queries using that index seem to do this too, though not quite as > > badly. Using a very simple query such as "SELECT COUNT(*) FROM > > table WHERE value = 12345" (where value is the last INT column > > above that took ages to index), typical query times (including > > connection overhead) for data not in the cache are 0.6 sec., 11 > > sec. and 72 sec. > > I guess that the smaller datasets would get proportionally more benefit > from kernel disk caching. Actually, I re-did the 100m row and 500m row queries from a cold start of the machine, and I still get the same results: 10 sec. vs 70 sec. (Thus, 7x as long to query only 5x as much data.) So I don't think caching is an issue here. > Can you demonstrate that it actually did 7000 reads, and not 5000+? > That extrapolation technique doesn't look to me like it has the > accuracy to tell the difference. You might try setting show_query_stats > (note the results go into the postmaster log, not to the client; > perhaps we ought to change that someday). Ah, I should have looked up and used show_query_stats in the first place. Thanks for the tip. (BTW, yes, it would be really, really cool if we could turn this on and off on a per-client basis, and have the results sent to the client. As it stands, this facility is rather difficult to use to try to debug things on a production system.) You're right, it is doing around 5000 reads, not 7000. The results for all queries are very consistent: it does just slightly (5-20) more reads than the number of rows returned, which is exactly what one would expect. So it seems I somehow go from about 100 reads per second to 70 reads per second. Some of that would no doubt be due to having five times as much area over which to seek. I guess I'd have to play with some random-read benchmarks to see if that's a reasonable amount of performance degradation for this disk alone to be responsible for. > Also, if you've updated the table at all, there might be some fetches of > dead tuples involved. Hm. I did do a few updates, but not to any of the tuples I'm searching on. > More RAM, perhaps. Given the size of the table (25 GB for the table alone; about 60 GB with all of the indices), I don't think more RAM is going to help all that much. I'm not expecting to get a lot of repeated queries. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
Curt Sampson <cjs@cynic.net> writes: > What is the advantage, if any, to having postgres do the buffering > in its shared memory rather than letting the OS do it? Not much, if any. I don't believe in making shared_buffers be more than (at most) 25% of physical RAM. In most cases it's just as effective to keep it smaller. I would recommend bumping up the default though ;-). Something in the low thousands (of buffers) is probably a realistic minimum. regards, tom lane
Curt Sampson <cjs@cynic.net> writes: > Next I tried to bump the sortmem up to 256 MB, half the physical RAM > of the machine. I found out the hard way that the back-end doing the > indexing will grow to something over three times the size of sortmem, > and proceeded (slowly) to extricate myself from swap hell. [ scratches head ... ] The sort code tries very hard to keep track of how much it's storing. I wonder why it's so far off? Perhaps failing to allow for palloc overhead? Remind me again of the exact datatype you were indexing, please. > But I decided I couldn't spare that much memory for a database doing > real work, too, and backed down to 32 MB. That got the process down to > 115 MB (100 MB resident), but oddly enough sped it up even more, shaving > the sort time by more than 35% from the original. Looks like the 128Mb case was swapping a lot (note the page faults count). That probably explains the difference. regards, tom lane
> > I currently use 1.4 gigs for "shared mem" in my database (out of 2G) - I > > couldn't get PostgreSQL to run with more than that (it might be an OS > > limit, Linux 2.2). > >Does it just fail to allocate? There may be a kernel parameter you >have to tweak. I'm simply not sure, since I was trying to allocate 1.5gb for it, and modified the shm-max parameter in the /proc filesystem accordingly. However, since I got it to work for about 1.3 as it turns out (numeric value 170000) that seemed enough for the moment and I stopped investigating. >What is the advantage, if any, to having postgres do the buffering >in its shared memory rather than letting the OS do it? I monitor the sar and vmstat's regularly and have noticed that PostgreSQL performs nicely with these settings, and no swap usage occurs. I have this unfounded assumption that PostgreSQL will keep better track of its shared memory than the typical LIFO disk-block-caching strategy the OS uses. E.g., it seems obvious to me that for tables which can't fit entirely in memory, it makes more sense to leave its indices in the shared memory than the table itself, and so when it needs to rearrange its memory, it would drop the data tables first. I am no DB programmer (leave that to tom, he seems to do a good job :) but since it is a dedicated DB machine, it's so far fine. >One disadvantage I can think of is that when a back end (or several >back ends) allocates a lot of memory for sorting (assuming you let >them do that), you might end up pushing the sort memory out to your >swap disk, whereas if the OS is doing the buffer management, it >can just buffer fewer file blocks while you're doing the sort, >instead. Yes, this has been noted several times. So far, however, I have not seen any swap usage. I would be more likely to decrease sort mem first, and would prefer if I could do all my queries without sorts in the first place! (wishful thinking) Cheers, Doug
Curt Sampson <cjs@cynic.net> writes: > On Thu, 30 May 2002, Tom Lane wrote: >> I guess that the smaller datasets would get proportionally more benefit >> from kernel disk caching. > Actually, I re-did the 100m row and 500m row queries from a cold > start of the machine, and I still get the same results: 10 sec. vs > 70 sec. (Thus, 7x as long to query only 5x as much data.) So I > don't think caching is an issue here. But even from a cold start, there would be cache effects within the query, viz. fetching the same table block more than once when it is referenced from different places in the index. On the smaller table, the block is more likely to still be in kernel cache when it is next wanted. On a pure random-chance basis, you'd not expect that fetching 5k rows out of 100m would hit the same table block twice --- but I'm wondering if the data was somewhat clustered. Do the system usage stats on your machine reflect the difference between physical reads and reads satisfied from kernel buffer cache? Or maybe your idea about extra seek time is correct. regards, tom lane
On Fri, 31 May 2002, Tom Lane wrote: > Curt Sampson <cjs@cynic.net> writes: > > Next I tried to bump the sortmem up to 256 MB, half the physical RAM > > of the machine. I found out the hard way that the back-end doing the > > indexing will grow to something over three times the size of sortmem, > > and proceeded (slowly) to extricate myself from swap hell. > > [ scratches head ... ] The sort code tries very hard to keep track > of how much it's storing. I wonder why it's so far off? Perhaps > failing to allow for palloc overhead? Remind me again of the exact > datatype you were indexing, please. It's just an int. The exact table definition is (int, date, int, int). The last two are both random ints ranging from 0-99999, and are the slow ones to index. The working set of the backend appears to be most of what it's allocating, too. I'd assumed that it might be allocating two or three sorts, or something, maybe to merge the sort files. > Looks like the 128Mb case was swapping a lot (note the page faults > count). That probably explains the difference. Oh, duh. I should read this stuff a bit more closely. But I wonder why? The working set was well under the RAM size of the machine. Perhaps it wasn't touching all the pages in its working set often enough, and the I/O was driving the backend's pages out of memory from time to time. It's a known problem with NetBSD under certain circmustances. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
On Fri, 31 May 2002, Tom Lane wrote: > On a pure random-chance basis, you'd not expect that fetching 5k rows > out of 100m would hit the same table block twice --- but I'm wondering > if the data was somewhat clustered. I dont' think so. I generated the data myself, and the data are entirely pseudo-random. (Unless perl's PNRG is quite screwed, which doesn't seem so likely.) > Do the system usage stats on your machine reflect the difference > between physical reads and reads satisfied from kernel buffer cache? Well, not that I've been looking at yet, but it would definitely be cool if I could figure out a way to do this. > Or maybe your idea about extra seek time is correct. If I can get a spare couple of hours, I'll cons up a little benchmark program to help determine seek time on disks and various parts of disks, and play a bit. But on the other hand, it's not like I can do all that much about this sort of problem, anyway, and I'm kind of doubting that the fault here lies with postgres. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
Peter A. Daly wrote: > > > > > >>What is the most amount of SORT_MEM it makes sense to allocate? > >> > > > >I've never experimented with it, but certainly the standard default > >(512K = 0.5M) is pretty small for modern machines. In a 2G machine > >I might try settings around 100M-500M to see what works best. (Note > >this is just for a one-off btree creation --- for ordinary queries you > >need to allow for multiple sorts going on in parallel, which is one > >reason the default sort_mem is not very large.) > > > I will run some benchmarks and let the list know the results. If this > can speed it up a large amount, I can get another 6 gig of RAM into this > machine which I hope can let me leave the SORT_MEM at a high enough > amount to speed up our huge nightly batch database reload. > > It's a Dual Xeon 500Mhz Machine. I am curious how you are going to address >4 gig of RAM on a 32-bit system, especially if you want to address it all from the same process. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian wrote: > >I am curious how you are going to address >4 gig of RAM on a 32-bit >system, especially if you want to address it all from the same process. > Frankly, I don't know. I have been told the machine can hold 8 gig of ram. I have heard rumors about how Linux plays some games with how it allocates memory to use over 4 gigs of memory. but have no direct knowledge myself of the matter. Something alongs the lines of a single process being able to use up to 4 gig of memory. Not sure on the specifics. Looks like we will not be upgrading the database server for at least another 60 days (the postgres one at least), and the dual Xeon (Quad soon?) may have another purpose. We have a 1.5tb file server/image rendering machine that is out of space for expansion. The Xeon has enough PCI slots to let us add more RAID arrays for a lot longer, where the current server is out of slots. This being the case, I have stopped trying to milk speed out of postgres on that machine. On the positive side of this, I may get a higher end Quad Xeon after July for the next Postgres box. We have our eyes on Quad 700mhx Xeon, 8gig ram boxes. -Pete
I have my Dual P4-Xeon 2.4 ghz 8gb RAM Adaptec SCSI RAID 5 PostgreSQL server coming in Friday, from my vendor POGO Linux (highly recommended). I expect most of the 8gb to be used by buffer cache directly by the operating system. I will let you know how it compares to the dual P3-1ghz 2GB IDE RAID 1 server we use now. The server cost just over half the price of a DB2 enterprise single-CPU license. Thank god for Postgres! My biggest fear is that the added CPU speed will be sub-linear, and that the real speed improvement would come if PG would multi-thread individual queries. Cheers, Doug At 11:42 PM 6/7/2002, Peter A. Daly wrote: >Bruce Momjian wrote: > >> >>I am curious how you are going to address >4 gig of RAM on a 32-bit >>system, especially if you want to address it all from the same process. >Frankly, I don't know. I have been told the machine can hold 8 gig of >ram. I have heard rumors about how Linux plays some games with how it >allocates memory to use over 4 gigs of memory. but have no direct >knowledge myself of the matter. Something alongs the lines of a single >process being able to use up to 4 gig of memory. Not sure on the >specifics. Looks like we will not be upgrading the database server for at >least another 60 days (the postgres one at least), and the dual Xeon (Quad >soon?) may have another purpose. We have a 1.5tb file server/image >rendering machine that is out of space for expansion. The Xeon has enough >PCI slots to let us add more RAID arrays for a lot longer, where the >current server is out of slots. This being the case, I have stopped >trying to milk speed out of postgres on that machine. > >On the positive side of this, I may get a higher end Quad Xeon after July >for the next Postgres box. We have our eyes on Quad 700mhx Xeon, 8gig ram >boxes. > >-Pete > > > >---------------------------(end of broadcast)--------------------------- >TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
On Sat, 08 Jun 2002 00:28:18 -0400 "Doug Fields" <dfields-pg-general@pexicom.com> wrote: > My biggest fear is that the added CPU speed will be sub-linear, and that > the real speed improvement would come if PG would multi-thread individual > queries. I'll confess that I can't understand the need for this feature (even if one ignores the complexity of implementing it). On most heavily loaded database servers, isn't the number of concurrent queries far greater than the number of processors? In which case, what is the point of allowing a single query to be executed by multiple CPUs concurrently? If your queries are CPU-bound, the other CPUs are going to be busy executing other queries at the same time (not to mention that the cost of migrating processes from one CPU to another and back again would not help). I can see the benefit when you're executing a small number of enormously expensive queries (and those queries are CPU-bound); but that situation is quite rare, IME. Cheers, Neil -- Neil Conway <neilconway@rogers.com> PGP Key ID: DB3C29FC
>I can see the benefit when you're executing a small number of >enormously expensive queries (and those queries are CPU-bound); >but that situation is quite rare, IME. Bingo. I have some self-joins (Order N^2 at worst) on VARCHAR fields, which, if you have tables with millions of rows, take full CPU of 900 seconds or more sometimes - and I have a need to expand from 7-digit to 9-digit row counts - which could mean 10,000 times slower. I can speed that up from P3-1ghz to P4-2.4ghz, but not much. The best way to solve the above problem is to figure out a more efficient query, of course, or avoid the need for the query in the first place. Sometimes easier said than done. Also, the relatively slow speed for inserts with VARCHAR indices would conceivably be helped; I often insert data into a temp table and then INSERT INTO SELECT *... from the temp to the other one to get reasonable performance. I'd love to be able to turn off MVCC for those kinds of things; that is, instead of seeing one universal insert of 3 million rows, I wouldn't mind seeing them dribble in one at a time, thereby saving the overhead of MVCCing them. Cheers, Doug
Curt Sampson <cjs@cynic.net> writes: > On Fri, 31 May 2002, Tom Lane wrote: >> Curt Sampson <cjs@cynic.net> writes: > Next I tried to bump the sortmem up to 256 MB, half the physical RAM > of the machine. I found out the hard way that the back-end doing the > indexing will grow to something over three times the size of sortmem, > and proceeded (slowly) to extricate myself from swap hell. >> >> [ scratches head ... ] The sort code tries very hard to keep track >> of how much it's storing. I wonder why it's so far off? Perhaps >> failing to allow for palloc overhead? Remind me again of the exact >> datatype you were indexing, please. > It's just an int. Hm. The actual size of an IndexTuple with an integer datum would be 12 bytes, which is how the tuplesort.c code would account for it. But the palloc allocation to hold it would be 16 data bytes plus the AllocChunkData header overhead, which could be 8, 12, or 16 bytes depending on whether you have MEMORY_CONTEXT_CHECKING enabled (did you configure --enable-cassert?) and on whether MAXALIGN is 4 or 8 bytes on your hardware. So the palloc overhead could indeed amount to a factor of nearly 3x. While looking into this, I came across the following commentary in tuplesort.c: * NOTES about memory consumption calculations: * * We count space requested for tuples against the SortMem limit. * Fixed-size space (primarily the LogicalTapeSet I/O buffers) is not * counted, nor do we count the variable-size memtuples and memtupindex * arrays. (Even though those could grow pretty large, they should be * small compared to the tuples proper, so this is not unreasonable.) * * The major deficiency in this approach is that it ignores palloc overhead. * The memory space actually allocated for a palloc chunk is always more * than the request size, and could be considerably more (as much as 2X * larger, in the current aset.c implementation). So the space used could * be considerably more than SortMem says. * * One way to fix this is to add a memory management function that, given * a pointer to a palloc'd chunk, returns the actual space consumed by the * chunk. This would be very easy in the current aset.c module, but I'm * hesitant to do it because it might be unpleasant to support in future * implementations of memory management. (For example, a direct * implementation of palloc as malloc could not support such a function * portably.) * * A cruder answer is just to apply a fudge factor, say by initializing * availMem to only three-quarters of what SortMem indicates. This is * probably the right answer if anyone complains that SortMem is not being * obeyed very faithfully. This note, and the code too, was written some years ago by yours truly; apparently you're the first to complain about the inaccuracy. I have changed my mind since then: I now think that adding an actual-space inquiry function to the memory management API would be a reasonable thing to do. We're depending on enough other stuff that isn't provided by plain malloc() that one more addition won't make any difference. Also, the two variable-size arrays mentioned above would add another 8 bytes per tuple, which really isn't negligible compared to 12-byte tuples. So that might account for the rest of your 3x growth observation. Probably tuplesort needs to include those in its space calculation. If we did make these changes then I'd be inclined to tweak the default SortMem up from 512K to say 1024K; otherwise we'd effectively be reducing the default sort size because of the extra space being charged for. Comments? BTW, I'm not committing to get this done for 7.3... regards, tom lane
On Sun, 9 Jun 2002, Tom Lane wrote: > ...depending on whether you have MEMORY_CONTEXT_CHECKING enabled > (did you configure --enable-cassert?) No, I didn't enable that. > and on whether MAXALIGN is 4 or 8 bytes on your hardware. i386, so probably 4 bytes. > So the palloc overhead could indeed amount to a factor of nearly 3x. That would seem about right. Possibly it's even more in some way; even with sortmem set to 32 MB, my backend grows to about 115 MB, 105 MB resident. Normally it's around 3 MB. > If we did make these changes then I'd be inclined to tweak the default > SortMem up from 512K to say 1024K; otherwise we'd effectively be > reducing the default sort size because of the extra space being charged > for. I'd be inclined to bump it up to 2-4 MB, actually; machines tend to be large enough these days that grabbing an extra MB or two when you need it is not a problem at all. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
On Fri, 7 Jun 2002, Bruce Momjian wrote: > I am curious how you are going to address >4 gig of RAM on a 32-bit > system, especially if you want to address it all from the same process. Bank switching. :-) But to be a bit more precise, physical addresses are more than 32 bits. This is not a problem for the kernel; when it maps a page into an address space, it just hands over a 36-bit (or whatever) address. It's always going and mapping and unmapping stuff all over the place anyway, so life is not much different. For user processes, however, there's a bit of a problem. A single process can only address 4GB of RAM at any particular momement, and in fact it's less under Linux (2 GB or 3 GB, depending on how your built your kernel) because the kernel is using up address space at the same time. So the OS would have to provide special system calls for, essentially, bank switching memory in the processes' address space, and the specific application (postgres in this case) would have to know how to use them. But, in fact, this facility already does exist, in a sort of a kludgy way anyway. You'd want to try it and see if it actually works under any specific OS, of course. You ought to be able to create, say, four or five 1GB SysV shared memory segments, and map them individually in and out of your address space using the shmat system call. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
On Sat, Jun 08, 2002 at 08:36:50PM -0400, Doug Fields wrote: > Also, the relatively slow speed for inserts with VARCHAR indices would > conceivably be helped; I often insert data into a temp table and then > INSERT INTO SELECT *... from the temp to the other one to get reasonable > performance. I'd love to be able to turn off MVCC for those kinds of > things; that is, instead of seeing one universal insert of 3 million rows, > I wouldn't mind seeing them dribble in one at a time, thereby saving the > overhead of MVCCing them. Interesting. I have many indexes on varchar fields and have no such issues. Ofcourse, inserting them all within a single transaction does speed it up a lot. But when inserting rows, all MVCC adds is a few more bytes to each row, there is no speed overhead. I'd be interested to know how MVCC would be slowing anything down. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > There are 10 kinds of people in the world, those that can do binary > arithmetic and those that can't.
At 08:11 AM 6/10/2002, Martijn van Oosterhout wrote: >there is no speed overhead. I'd be interested to know how MVCC would be >slowing anything down. I'd be interested in being able to turn it off and see if it speeds anything up. :) I didn't mean to imply that I _know_ that it slows things down. What I do know is that I a_ssume_ it does, and would like to be able to selectively disable it - in the hope of gaining speed at certain points. I figure the biggest speed gain will be because PostgreSQL can then skip writing anything to the transaction log. But then again, I also do not _know_ that. Cheers, Doug
Doug Fields wrote: > At 08:11 AM 6/10/2002, Martijn van Oosterhout wrote: > >there is no speed overhead. I'd be interested to know how MVCC would be > >slowing anything down. > > I'd be interested in being able to turn it off and see if it speeds > anything up. :) > > I didn't mean to imply that I _know_ that it slows things down. What I do > know is that I a_ssume_ it does, and would like to be able to selectively > disable it - in the hope of gaining speed at certain points. I figure the > biggest speed gain will be because PostgreSQL can then skip writing > anything to the transaction log. But then again, I also do not _know_ that. MVCC is an integral part of how PostgreSQL's no-overwriting storage manager works. It can't be turned of, and if it could, it wouldn't provide _any_ speed increase. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Just going back through old mail, I notice this. So I'm not the only one with this opinion. I've seen, at least twice in the last week or so, people make the mistake of devoting about half their memory to postgres shared memory buffers (the wost thing you can do!). Would someone care to go around and find all the spots that talk about this and update them to have more reasonable advice? cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC On Fri, 31 May 2002, Tom Lane wrote: > Curt Sampson <cjs@cynic.net> writes: > > What is the advantage, if any, to having postgres do the buffering > > in its shared memory rather than letting the OS do it? > > Not much, if any. I don't believe in making shared_buffers be more than > (at most) 25% of physical RAM. In most cases it's just as effective to > keep it smaller. I would recommend bumping up the default though ;-). > Something in the low thousands (of buffers) is probably a realistic > minimum. > > regards, tom lane >
Curt Sampson wrote: > > Just going back through old mail, I notice this. So I'm not the only one > with this opinion. I've seen, at least twice in the last week or so, > people make the mistake of devoting about half their memory to postgres > shared memory buffers (the wost thing you can do!). Would someone care > to go around and find all the spots that talk about this and update them > to have more reasonable advice? I advise in my hardware tuning guide that it be increased until paging starts, then reduce it, and start around 25%. Is that OK? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
>I advise in my hardware tuning guide that it be increased until paging >starts, then reduce it, and start around 25%. Is that OK? This is fine up to a point. My PostgreSQL server has 8gb ram (running under 2.4.18). I can't allocate 2gb shared because that basically takes up all the RAM of any individual process, since the i386 kernel can only access 2 or 3 gigs of RAM (it saves one or two for the kernel). So, I've set it to 256megs of shared, and I'll increase it from there as necessary. The rest, I hope, will be used to keep pages in cache by the operating system, which will hopefully be able to move them into the shared memory relatively quickly. Cheers, Doug
On Wed, 26 Jun 2002, Bruce Momjian wrote: > I advise in my hardware tuning guide that it be increased until paging > starts, then reduce it, and start around 25%. Is that OK? I don't think that's optimal. The general gist is that if the OS has as much free memory as postgres has shared memory used for buffering, it's going to be buffering pretty much all the data that posgres is. So every block that postgres holds is using 8K of memory that the OS could use to cache a different block. Now it's a bit more subtle than this in that copying a buffer between postgres's shared memory and OS memory also has a cost, and we should try to minimize that. But still, that's a lot cheaper than moving a buffer between memory and disk. This sort of thing is one of the main reasons I like the idea of moving to mmap for data files; it gets rid of both the double buffering and the copying back-and-forth. (And even if you do end up copying, you still save the kernel/userland transition, which is also a little bit expensive.) cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
Curt Sampson wrote: > > On Wed, 26 Jun 2002, Bruce Momjian wrote: > > > I advise in my hardware tuning guide that it be increased until paging > > starts, then reduce it, and start around 25%. Is that OK? > > I don't think that's optimal. The general gist is that if the OS > has as much [...] All right all right. So far we've seen a couple of responses from you, all rejecting someone elses advice because of not beeing optimal or even beeing worse. Can you please enlighten us what the optimum is? And please don't escape into the mmap solution if you cannot give that answer in diff format. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On Thu, 27 Jun 2002, Jan Wieck wrote: > Can you please enlighten us what the optimum is? And please don't escape > into the mmap solution if you cannot give that answer in diff format. Well, first of all, let me note that this is all theoretical work on my part so far. If someone is actually doing some real testing of this, I'd be interested in the results. If the results are different from what I think they should be, something strange is going on. But you can look back through my posts if you like and tell me if you disagree with my analysis of why increasing shared memory buffers is generally going to reduce performance. The optimum will depend on the number of connections you have and the type of workload you have. At this point, I'm not even sure about how to go about determining precisely what would be better and worse; it would be a lot of work. (Probably a lot more than it's worth, given the prices of memory these days.) I'd say, at a rough estimate, go for a number of buffers 2-3 times the maximum number of connections you allow. Or less if you anticipate rarely ever having that many connections. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
Curt Sampson wrote: > > On Thu, 27 Jun 2002, Jan Wieck wrote: > > > Can you please enlighten us what the optimum is? And please don't escape > > into the mmap solution if you cannot give that answer in diff format. > > Well, first of all, let me note that this is all theoretical work on my > part so far. If someone is actually doing some real testing of this, I'd > be interested in the results. If the results are different from what I > think they should be, something strange is going on. Since none of us actually has done real benchmarks in this area, we are all just debating out of the blue. So please don't take that personal, okay? > The optimum will depend on the number of connections you have and the > type of workload you have. At this point, I'm not even sure about how to > go about determining precisely what would be better and worse; it would > be a lot of work. (Probably a lot more than it's worth, given the prices > of memory these days.) Sure, the optimum will depend on the application and it's usage profile. But that's fine tuning, not a rough rule of thumb for general purpose, and I think we where looking for the latter. > I'd say, at a rough estimate, go for a number of buffers 2-3 times the > maximum number of connections you allow. Or less if you anticipate > rarely ever having that many connections. Here I disagree. The more shared buffer cache you have, the bigger the percentage of your database that neither causes read()'s nor memory copying from the OS buffer cache. Whatever, we can debate that forever without some numbers. Let's continue that discussion when we have at least some benchmarking results. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On Thu, Jun 27, 2002 at 11:27:33AM -0400, Jan Wieck wrote: > Here I disagree. The more shared buffer cache you have, the bigger the > percentage of your database that neither causes read()'s nor memory > copying from the OS buffer cache. For what it's worth, truss on Solaris 7 indicates that copying from OS buffer costs a few milliseconds (between 2 and 8 in our tests, depending on load), at least in our tests. A -- ---- Andrew Sullivan 87 Mowat Avenue Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M6K 3E3 +1 416 646 3304 x110
I guess I'll add my 2cents here... I run a Linux/Postgres server with 1GB of ram, and give postgres 48049 shared buffers. This lets the backends top out around 400MB each when they really get going. The effect I've seen is that more buffers don't hurt. However, adding more buffers stops making a significant difference at some low number that is a small fraction of what I use (and would obviously depend heavily on usage). Now assuming that the server is dedicated to postgres (which mine is), I can't see any reason not to give almost all available physical memory to postgres. The whole point of the OS disk caching is for *shared* caching, so if only one application is using the majority of the disk cache, what does it hurt to just let that application have it all as private memory? If there were no cost associated with copying pages from OS cache to user land, then I'd say it makes more sense to give almost no memory to the postgres cache, but obviously there is a cost; the more often postgres has to ask the OS for a page, the worse performance will be. The more memory you give to postgres, the more often asking the OS for a page will result in a disk read. It seems optimal to me to *never* ask the OS for a page unless it has to be pulled from the disk. Obviously a lot of people disagree with this... I'd like to understand why this approach is considered incorrect when postgres rather owns the machine? Glen Parker glenebob@nwlink.com > > Curt Sampson wrote: > > > > Just going back through old mail, I notice this. So I'm not > the only one > > with this opinion. I've seen, at least twice in the last week or so, > > people make the mistake of devoting about half their memory > to postgres > > shared memory buffers (the wost thing you can do!). Would > someone care > > to go around and find all the spots that talk about this > and update them > > to have more reasonable advice? > > I advise in my hardware tuning guide that it be increased until paging > starts, then reduce it, and start around 25%. Is that OK? > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 853-3000 > + If your life is a hard drive, | 830 Blythe Avenue > + Christ can be your backup. | Drexel Hill, > Pennsylvania 19026 > > > > ---------------------------(end of > broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > http://www.postgresql.org/users-lounge/docs/faq.html
On Thu, Jun 27, 2002 at 01:44:56PM -0700, Glen Parker wrote: > Obviously a lot of people disagree with this... I'd like to understand > why this approach is considered incorrect when postgres rather owns the > machine? If you starve other things on the machine for memory, you'll cause swapping. Consider all the other things you're doing on the machine -- just little things, like cron and such. All that takes memory. Therefore, it's dangerous not to let the OS manage a good chunk of memory. There also appears to be a diminishing returns problem: at a certain point, you're unlikely to need more shared space, and if you do something else on the machine that could use the memory, you're throwing it away. But I don't see that you're wrong in principle. Just don't get it wrong -- I _have_ caused a 16 gig machine to swap. It's not fun. A -- ---- Andrew Sullivan 87 Mowat Avenue Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M6K 3E3 +1 416 646 3304 x110
On Thu, 27 Jun 2002, Jan Wieck wrote: > Since none of us actually has done real benchmarks in this area, we are > all just debating out of the blue. So please don't take that personal, > okay? Well, I have a pretty good knowledge of how Unix operating systems work internally (I've been a NetBSD developer for about six years now), so it's not just out of the blue. However, I will always bow to hard data. > Sure, the optimum will depend on the application and it's usage profile. > But that's fine tuning, not a rough rule of thumb for general purpose, > and I think we where looking for the latter. Good. That's about all I can give. > > I'd say, at a rough estimate, go for a number of buffers 2-3 times the > > maximum number of connections you allow. Or less if you anticipate > > rarely ever having that many connections. > > Here I disagree. The more shared buffer cache you have, the bigger the > percentage of your database that neither causes read()'s nor memory > copying from the OS buffer cache. Certainly. But overall, you will cache a smaller number of blocks because you will be buffering them twice. When you copy a block from the OS buffer to shared memory, the copy still exists in the OS buffer. So that block is now buffered twice. For most workloads, in the long run, that will force you to do disk I/O that you would not have had to do otherwise. A single disk I/O is far more expensive than hundreds of copies between the OS buffer cache and postgres' shared memory. Draw your own conclusions. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
On Thu, 27 Jun 2002, Andrew Sullivan wrote: > For what it's worth, truss on Solaris 7 indicates that copying from > OS buffer costs a few milliseconds (between 2 and 8 in our tests, > depending on load), at least in our tests. Ouch! Are you sure you mean MILLIseconds, rather than MICROSECONDS? 8K in 2 ms. = 4 MB/sec. 8K in 8 usec. = 976 MB/sec. The latter would be a much more reasonable figure for memory bandwidth... cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
On Thu, 27 Jun 2002, Andrew Sullivan wrote: > Consider all the other things you're doing on the machine -- just > little things, like cron and such. All that takes memory. > Therefore, it's dangerous not to let the OS manage a good chunk of > memory. Postgres itself doesn't manage all of its own memory usage. Sorts and things, for example, are allocated from OS-owned memory, not postgres' shared memory. So if you give most of the machine memory over to postgres shared buffers, postgres itself might drive the machine into swapping because it's got memory that should be used for the sort, but can't be. That's why it's better to move to minimum shared memory rather than maximum. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
On Fri, Jun 28, 2002 at 01:13:01PM +0900, Curt Sampson wrote: > On Thu, 27 Jun 2002, Andrew Sullivan wrote: > > > For what it's worth, truss on Solaris 7 indicates that copying from > > OS buffer costs a few milliseconds (between 2 and 8 in our tests, > > depending on load), at least in our tests. > > Ouch! Are you sure you mean MILLIseconds, rather than MICROSECONDS? Doh. Had a braino. Thanks. Yes, microseconds. A -- ---- Andrew Sullivan 87 Mowat Avenue Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M6K 3E3 +1 416 646 3304 x110
>>>>> "CS" == Curt Sampson <cjs@cynic.net> writes: CS> For most workloads, in the long run, that will force you to do disk CS> I/O that you would not have had to do otherwise. A single disk I/O CS> is far more expensive than hundreds of copies between the OS buffer CS> cache and postgres' shared memory. It occurs to me that these two opposing factors could be plotted and make somewhat of a line each. The ideal point would be where they crossed. Perhaps one day when I've got a few hours I'll copy over my large dataset to my devel server and try to mimic the load with different shm sizes... -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Vivek Khera, Ph.D. Khera Communications, Inc. Internet: khera@kciLink.com Rockville, MD +1-240-453-8497 AIM: vivekkhera Y!: vivek_khera http://www.khera.org/~vivek/
Curt Sampson wrote: > Certainly. But overall, you will cache a smaller number of blocks > because you will be buffering them twice. When you copy a block > from the OS buffer to shared memory, the copy still exists in the > OS buffer. So that block is now buffered twice. You have the block in the kernel buffer when you copy it to the PostgreSQL buffers, but nothing says you have to keep that block in the kernel buffers while PostgreSQL has it. Only when it is written does it return to the kernel, and if it is only read, it never returns to the kernel. > For most workloads, in the long run, that will force you to do disk > I/O that you would not have had to do otherwise. A single disk I/O > is far more expensive than hundreds of copies between the OS buffer > cache and postgres' shared memory. Yes, if you swap, you went too far. That has always been the upper limit. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
On Fri, 28 Jun 2002, Bruce Momjian wrote: > You have the block in the kernel buffer when you copy it to the > PostgreSQL buffers, but nothing says you have to keep that block in the > kernel buffers while PostgreSQL has it. Only when it is written does it > return to the kernel, and if it is only read, it never returns to the > kernel. Sure. But the OS doesn't know that after you read a block it may immediately abandon buffering of that block. It's going to use the same strategy it does for any other block, which is to keep it in memory for a while, preferring to get rid of older blocks. If you had a way to tell the OS, "I'm buffering block X, so you don't need to" that would be one thing. But you don't. > > For most workloads, in the long run, that will force you to do disk > > I/O that you would not have had to do otherwise. A single disk I/O > > is far more expensive than hundreds of copies between the OS buffer > > cache and postgres' shared memory. > > Yes, if you swap, you went too far. That has always been the upper > limit. Not just swap: disk I/O for blocks that would have been buffered by the OS if you hadn't been using memory that it could use. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
Curt Sampson wrote: > On Fri, 28 Jun 2002, Bruce Momjian wrote: > > > You have the block in the kernel buffer when you copy it to the > > PostgreSQL buffers, but nothing says you have to keep that block in the > > kernel buffers while PostgreSQL has it. Only when it is written does it > > return to the kernel, and if it is only read, it never returns to the > > kernel. > > Sure. But the OS doesn't know that after you read a block it may > immediately abandon buffering of that block. It's going to use the > same strategy it does for any other block, which is to keep it in > memory for a while, preferring to get rid of older blocks. > > If you had a way to tell the OS, "I'm buffering block X, so you > don't need to" that would be one thing. But you don't. > > > > For most workloads, in the long run, that will force you to do disk > > > I/O that you would not have had to do otherwise. A single disk I/O > > > is far more expensive than hundreds of copies between the OS buffer > > > cache and postgres' shared memory. > > > > Yes, if you swap, you went too far. That has always been the upper > > limit. > > Not just swap: disk I/O for blocks that would have been buffered > by the OS if you hadn't been using memory that it could use. Yes, but how does a smaller PostgreSQL buffer and larger kernel buffer fix this? It seems it only makes it works because PostgreSQL is loading its buffers more often. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026