Thread: two disks - best way to use them?
I installed another drive in my linux pc in an attempt to improve performance
on a large COPY to a table with a geometry index.
Based on previous discussion, it seems there are three things competing for the hard
drive:
1) the input data file
2) the pg table
3) the WAL
What is the best way to distribute these among two drives? From Tom’s comments
I would think that the pg table and the WAL should be separate. Does it matter where
the input data is?
At 01:58 PM 12/2/2005, Rick Schumeyer wrote: >I installed another drive in my linux pc in an attempt to improve performance >on a large COPY to a table with a geometry index. > >Based on previous discussion, it seems there are three things >competing for the hard drive: > >1) the input data file >2) the pg table >3) the WAL > >What is the best way to distribute these among two drives? From >Tom's comments >I would think that the pg table and the WAL should be >separate. Does it matter where the input data is? Best is to have 3 HD or HD sets, one for each of the above. With only 2, and assuming the input file is too large to fit completely into RAM at once, I'd test to see whether: a= input on one + pg table & WAL on the other, or b= WAL on one + pg table & input file on the other is best. If the input file can be made 100% RAM resident, then use c= pg table on one + WAL and input file on the other. The big goal here is to minimize HD head seeks. Ron
Rick Schumeyer wrote: > 1) the input data file > 2) the pg table > 3) the WAL And journal of file system, especially if you not set "noatime" mount option. WAL and file system journal like to make sync. IMHO: on first disk (raid mirror:)) I place /, pg_table and file system journal from second disk. On second /var and pg tables. Thus first disc is synced time to time, second not. -- Olleg Samoylov
Ron wrote: > At 01:58 PM 12/2/2005, Rick Schumeyer wrote: > >> I installed another drive in my linux pc in an attempt to improve >> performance >> on a large COPY to a table with a geometry index. >> >> Based on previous discussion, it seems there are three things >> competing for the hard drive: >> >> 1) the input data file >> 2) the pg table >> 3) the WAL >> >> What is the best way to distribute these among two drives? From Tom's >> comments >> I would think that the pg table and the WAL should be separate. Does >> it matter where the input data is? > > > Best is to have 3 HD or HD sets, one for each of the above. > > With only 2, and assuming the input file is too large to fit completely > into RAM at once, I'd test to see whether: > a= input on one + pg table & WAL on the other, or > b= WAL on one + pg table & input file on the other > is best. > > If the input file can be made 100% RAM resident, then use > c= pg table on one + WAL and input file on the other. > > The big goal here is to minimize HD head seeks. (noob question incoming) Section 26.4 WAL Internals http://www.postgresql.org/docs/8.1/interactive/wal-internals.html This seems to be the applicable chapter. They talk about creating a symlink for the data/pg_xlog folder to point at another disk set. If I have (2) RAID1 sets with LVM2, can I instead create a logical volume on the 2nd disk set and just mount data/pg_xlog to point at the logical volume on the 2nd disk set? For example, I have an LVM on my primary mirror called 'pgsql'. And I've created a 2nd LVM on my secondary mirror called 'pgxlog'. These are mounted as: /dev/vgraida/pgsql on /var/lib/postgresql type ext3 (rw,noatime) /dev/vgraidb/pgxlog on /var/lib/postgresql/data/pg_xlog type ext3 (rw,noatime) From the application's P.O.V., it's the same thing, right? (It seems to be working, I'm just trying to double-check that I'm not missing something.)
On Mon, 5 Dec 2005, Thomas Harold wrote: > (noob question incoming) > > Section 26.4 WAL Internals > http://www.postgresql.org/docs/8.1/interactive/wal-internals.html > > This seems to be the applicable chapter. They talk about creating a symlink > for the data/pg_xlog folder to point at another disk set. > > If I have (2) RAID1 sets with LVM2, can I instead create a logical volume on > the 2nd disk set and just mount data/pg_xlog to point at the logical volume > on the 2nd disk set? > > For example, I have an LVM on my primary mirror called 'pgsql'. And I've > created a 2nd LVM on my secondary mirror called 'pgxlog'. These are mounted > as: > > /dev/vgraida/pgsql on /var/lib/postgresql type ext3 (rw,noatime) > > /dev/vgraidb/pgxlog on /var/lib/postgresql/data/pg_xlog type ext3 > (rw,noatime) > > From the application's P.O.V., it's the same thing, right? (It seems to be > working, I'm just trying to double-check that I'm not missing something.) > the application can' tell the difference, but the reason for seperating them isn't for the application, it's so that different pieces of hardware can work on different things without having to bounce back and forth between them. useing the same drives with LVM doesn't achieve this goal. the problem is that the WAL is doing a LOT of writes, and postgres waits until each write is completed before going on to the next thing (for safety), if a disk is dedicated to the WAL then the head doesn't move much. if the disk is used for other things as well then the heads have to move across the disk surface between the WAL and where the data is. this drasticly slows down the number of items that can go into the WAL, and therefor slows down the entire system. this slowdown isn't even something as simple as cutting your speed in half (half the time spent working on the WAL, half spent on the data itself), it's more like 10% spent on the WAL, 10% spent on the data, and 80% moveing back and forth between them (I am probably wrong on the exact numbers, but it is something similarly drastic) this is also the reason why it's so good to have a filesystem journal on a different drive. David Lang
David Lang wrote: > the application can' tell the difference, but the reason for seperating > them isn't for the application, it's so that different pieces of > hardware can work on different things without having to bounce back and > forth between them. > > useing the same drives with LVM doesn't achieve this goal. > > the problem is that the WAL is doing a LOT of writes, and postgres waits > until each write is completed before going on to the next thing (for > safety), if a disk is dedicated to the WAL then the head doesn't move > much. if the disk is used for other things as well then the heads have > to move across the disk surface between the WAL and where the data is. > this drasticly slows down the number of items that can go into the WAL, > and therefor slows down the entire system. > > this slowdown isn't even something as simple as cutting your speed in > half (half the time spent working on the WAL, half spent on the data > itself), it's more like 10% spent on the WAL, 10% spent on the data, and > 80% moveing back and forth between them (I am probably wrong on the > exact numbers, but it is something similarly drastic) Yeah, I don't think I was clear about the config. It's (4) disks setup as a pair of RAID1 sets. My original config was pgsql on the first RAID set (data and WAL). I'm now experimenting with putting the data/pg_xlog folder on the 2nd set of disks. Under the old setup (everything on the original RAID1 set, in a dedicated 32GB LVM volume), I was seeing 80-90% wait percentages in "top". My understanding is that this is an indicator of an overloaded / bottlenecked disk system. This was while doing massive inserts into a test table (millions of narrow rows). I'm waiting to see what happens once I have data/pg_xlog on the 2nd disk set. Thanks for the input.
On Mon, 5 Dec 2005, Thomas Harold wrote: > Yeah, I don't think I was clear about the config. It's (4) disks setup as a > pair of RAID1 sets. My original config was pgsql on the first RAID set (data > and WAL). I'm now experimenting with putting the data/pg_xlog folder on the > 2nd set of disks. > > Under the old setup (everything on the original RAID1 set, in a dedicated > 32GB LVM volume), I was seeing 80-90% wait percentages in "top". My > understanding is that this is an indicator of an overloaded / bottlenecked > disk system. This was while doing massive inserts into a test table > (millions of narrow rows). I'm waiting to see what happens once I have > data/pg_xlog on the 2nd disk set. in that case you logicly have two disks, so see the post from Ron earlier in this thread. David Lang
David Lang wrote: > in that case you logicly have two disks, so see the post from Ron > earlier in this thread. And it's a very nice performance gain. Percent spent waiting according to "top" is down around 10-20% instead of 80-90%. While I'm not prepared to benchmark, database performance is way up. The client machines that are writing the data are running closer to 100% CPU (before they were well below 50% CPU utilization).
At 12:52 AM 12/6/2005, Thomas Harold wrote: >David Lang wrote: > >>in that case you logicly have two disks, so see the post from Ron >>earlier in this thread. > >And it's a very nice performance gain. Percent spent waiting >according to "top" is down around 10-20% instead of 80-90%. While >I'm not prepared to benchmark, database performance is way up. The >client machines that are writing the data are running closer to 100% >CPU (before they were well below 50% CPU utilization). For accuracy's sake, which exact config did you finally use? How did you choose the config you finally used? Did you test the three options or just pick one? Ron
Ron wrote: > For accuracy's sake, which exact config did you finally use? > > How did you choose the config you finally used? Did you test the three > options or just pick one? (Note: I'm not the original poster.) I just picked the option of putting the data/pg_xlog directory (WAL) on a 2nd set of spindles. That was the easiest thing for me to change on this test box. The test server is simply a Gentoo box running software RAID and LVM2. The primary disk set is 2x7200RPM 300GB drives and the secondary disk set is 2x5400RPM 300GB drives. Brand new install of PGSQL 8.1, with mostly default settings (I changed FSM pages to be a higher value, max_fsm_pages = 150000). PGSQL was given it's own ext3 32GB LVM volume on the primary disk set (2x7200RPM). Originally, all files were on the primary disk. The task at hand was inserting large quantity of ~45 byte rows (according to "vacuum verbose"), on the order of millions of records per table. There was an unique key and a unique index. Test clients were accessing the database via ODBC / ADO and doing the inserts in a fairly brute-force mode (attempt the insert, calling .CancelUpdate if it fails). When the tables were under 2 million rows, performance was okay. At one point, I had a 1.8Ghz P4, dual Opteron 246, and Opteron 148 CPUs running at nearly 100% CPU processing and doing inserts into the database. So I had 4 clients running, performing inserts to 4 separate tables in the same database. The P4 ran at about half the throughput as the Opterons (client-bound due to the code that generated row data prior to the insert), so I'd score my throughput as roughly 3.3-3.4. Where 1.0 would be full utilization of the Opteron 148 box. However, once the tables started getting above ~2 million rows, performance took a nose dive. CPU utilizations on the 4 client CPUs dropped into the basement (5-20% CPU) and I had to back off on the number of clients. So throughput had dropped down to around 0.25 or so. The linux box was spending nearly all of its time waiting on the primary disks. Moving the data/pg_xlog (WAL) to the 2nd set of disks (2x5400RPM) in the test server made a dramatic difference for this mass insert. I'm running the P4 (100% CPU) and the Opteron 148 (~80% CPU) at the moment. While it's not up to full speed, a throughput of ~1.3 is a lot better then the ~0.25 that I was getting prior. (The two tables currently being written have over 5 million rows each. One table has ~16 million rows.) Wait percentage in "top" is only running 20-30% (dipping as low as 10%). I haven't pushed this new setup hard enough to determine where the upper limit for throughput is. It's very much a niche test (millions of inserts of narrow rows into multiple tables using fairly brain-dead code). But it gives me data points on which to base purchasing of the production box. The original plan was a simple RAID1 setup (2 spindles), but this tells me it's better to order 4 spindles and set it up as a pair of RAID1 sets. Whether 4 spindles is better as two separate RAID1 arrays, or configured as a single RAID1+0 array... dunno. Our application is typically more limited by insert speed then read speed (so I'm leaning towards separate RAID arrays). I'm sure there's also more tuning that could be done to the PGSQL database (in the configuration file). Also, the code is throwaway code that isn't the most elegant.
On Tue, 6 Dec 2005, Thomas Harold wrote: > Ron wrote: > >> For accuracy's sake, which exact config did you finally use? >> >> How did you choose the config you finally used? Did you test the three >> options or just pick one? > > (Note: I'm not the original poster.) > > I just picked the option of putting the data/pg_xlog directory (WAL) on a 2nd > set of spindles. That was the easiest thing for me to change on this test > box. > > The test server is simply a Gentoo box running software RAID and LVM2. The > primary disk set is 2x7200RPM 300GB drives and the secondary disk set is > 2x5400RPM 300GB drives. Brand new install of PGSQL 8.1, with mostly default > settings (I changed FSM pages to be a higher value, max_fsm_pages = 150000). > PGSQL was given it's own ext3 32GB LVM volume on the primary disk set > (2x7200RPM). Originally, all files were on the primary disk. the WAL is more sensitive to drive speeds then the data is, so you may pick up a little more performance by switching the WAL to the 7200 rpm drives instead of the 5400 rpm drives. if you see a noticable difference with this, consider buying a pair of smaller, but faster drives (10k or 15k rpm drives, or a solid-state drive). you can test this (with significant data risk) by putting the WAL on a ramdisk and see what your performance looks like. David Lang