Thread: Best COPY Performance
What is the best COPY performance that you have gotten on a "normal" table? I know that this is question is almost too general, but it might help me out a bit, or at least give me the right things to tweak. Perhaps the question can be rewritten as "Where are the major bottlenecks in a COPY?" or "How can I compute the max theoretical COPY performance for my hardware?". The two subquestions that I have from this are: -Are my ETL scripts (perl) maximizing the database COPY speeds? -Can I tweak my DB further to eek out a bit more performance? I'm using perl to ETL a decent sized data set (10 million records) and then loading it through perl::DBI's copy. I am currently getting between 10K and 15K inserts/second. I've profiled the ETL scripts a bit and have performance-improved a lot of the code, but I'd like to determine whether it makes sense to try and further optimize my Perl or count it as "done" and look for improvements elsewhere. I ran trivial little insert into a table with a single integer row and came close to 250K inserts/second using psql's \copy, so I'm thinking that my code could be optimized a bit more, but wanted to check around to see if that was the case. I am most interested in loading two tables, one with about 21 (small) VARCHARs where each record is about 200 bytes, and another with 7 INTEGERs, 3 TIMESTAMPs, and 1 BYTEA where each record is about 350 bytes. I have implemented most of the various bits of PG config advice that I have seen, both here and with a some googling, such as: wal_buffers=128 checkpoint_segments=128 checkpoint_timeout=3000 Software: PG 8.1.3 on RHEL 4.3 x86_64 Hardware: Quad Dual-core Opteron, Fibre Channel SAN with 256M BBC Thanks!
On 10/21/06, Worky Workerson <worky.workerson@gmail.com> wrote: > What is the best COPY performance that you have gotten on a "normal" table? > > I know that this is question is almost too general, but it might help > me out a bit, or at least give me the right things to tweak. Perhaps > the question can be rewritten as "Where are the major bottlenecks in a > COPY?" or "How can I compute the max theoretical COPY performance for > my hardware?". The two subquestions that I have from this are: > -Are my ETL scripts (perl) maximizing the database COPY speeds? > -Can I tweak my DB further to eek out a bit more performance? > > I'm using perl to ETL a decent sized data set (10 million records) and > then loading it through perl::DBI's copy. I am currently getting > between 10K and 15K inserts/second. I've profiled the ETL scripts a > bit and have performance-improved a lot of the code, but I'd like to > determine whether it makes sense to try and further optimize my Perl > or count it as "done" and look for improvements elsewhere. > > I ran trivial little insert into a table with a single integer row and > came close to 250K inserts/second using psql's \copy, so I'm thinking > that my code could be optimized a bit more, but wanted to check around > to see if that was the case. > > I am most interested in loading two tables, one with about 21 (small) > VARCHARs where each record is about 200 bytes, and another with 7 > INTEGERs, 3 TIMESTAMPs, and 1 BYTEA where each record is about 350 > bytes. indexes/keys? more memory for sorting during index creation can have a dramatic affect on bulk insert performance. check for pg_tmp folders popping up during copy run. > I have implemented most of the various bits of PG config advice that I > have seen, both here and with a some googling, such as: > > wal_buffers=128 > checkpoint_segments=128 > checkpoint_timeout=3000 > > Software: PG 8.1.3 on RHEL 4.3 x86_64 > Hardware: Quad Dual-core Opteron, Fibre Channel SAN with 256M BBC for table light on indexes, 10-15k for copy is pretty poor. you can get pretty close to that with raw inserts on good hardware. I would suggest configuirng your perl script to read from stdin and write to stdout, and pipe it to psql using copy from stdin. then just benchmark your perl script redirecting output to a file. merlin
Hi, Worky, Worky Workerson wrote: > I am currently getting > between 10K and 15K inserts/second. > I ran trivial little insert into a table with a single integer row and > came close to 250K inserts/second using psql's \copy, so I'm thinking > that my code could be optimized a bit more, but wanted to check around > to see if that was the case. Could you COPY one of your tables out to disk via psql, and then COPY it back into the database, to reproduce this measurement with your real data? Also, how much is the disk load, and CPU usage? As long as psql is factor 20 better than your perl script, I think that the perl interface is what should be optimized. On a table with no indices, triggers and contstraints, we managed to COPY about 7-8 megabytes/second with psql over our 100 MBit network, so here the network was the bottleneck. You should think about making your perl program writing the COPY statement as text, and piping it into psql. HTH, Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in Europe! www.ffii.org www.nosoftwarepatents.org
Markus, On 10/23/06 2:27 AM, "Markus Schaber" <schabi@logix-tt.com> wrote: > On a table with no indices, triggers and contstraints, we managed to > COPY about 7-8 megabytes/second with psql over our 100 MBit network, so > here the network was the bottleneck. We routinely get 10-12MB/s on I/O hardware that can sustain a sequential write rate of 60+ MB/s with the WAL and data on the same disks. It depends on a few things you might not consider, including the number and type of columns in the table and the client and server encoding. The fastest results are with more columns in a table and when the client and server encoding are the same. - Luke
> > I am most interested in loading two tables, one with about 21 (small) > > VARCHARs where each record is about 200 bytes, and another with 7 > > INTEGERs, 3 TIMESTAMPs, and 1 BYTEA where each record is about 350 > > bytes. > > indexes/keys? more memory for sorting during index creation can have > a dramatic affect on bulk insert performance. check for pg_tmp > folders popping up during copy run. The only index on load is a single IP4 btree primary key, which I figure should function about like an INTEGER. > for table light on indexes, 10-15k for copy is pretty poor. you can > get pretty close to that with raw inserts on good hardware. I would > suggest configuirng your perl script to read from stdin and write to > stdout, and pipe it to psql using copy from stdin. then just > benchmark your perl script redirecting output to a file. So simple and hadn't thought of that ... thanks. When I pre-create a COPY file, I can load it at about 45K inserts/sec (file was 1.8GB or 14.5 million records in 331 seconds), which looks like its about 5.5 MB/s. I'm loading from a local 15K SCSI320 RAID10 (which also contains the PG log files) to a 10K SCSI320 RAID10 on an FC SAN. Does this look more consistent with "decent" performance, or should I go looking into some hardware issues i.e. SAN configuration? I've currently got several hats including hardware/systems/security admin, as well as DBA and programmer, and my SAN setup skills could definitely use some more work. Hardware aside, my perl can definitely use some work, and it seems to be mostly the CSV stuff that I am using, mostly for convenience. I'll see if I can't redo some of that to eliminate some CSV processing, or, barring that, multithread the process to utilize more of the CPUs. Part of the reason that I hadn't used psql in the first place is that I'm loading the data into partitioned tables, and the loader keeps several COPY connections open at a time to load the data into the right table. I guess I could just as easily keep several psql pipes open, but it seemed cleaner to go through DBI.
On Mon, Oct 23, 2006 at 11:10:19AM -0400, Worky Workerson wrote: > >> I am most interested in loading two tables, one with about 21 (small) > >> VARCHARs where each record is about 200 bytes, and another with 7 > >> INTEGERs, 3 TIMESTAMPs, and 1 BYTEA where each record is about 350 > >> bytes. > > > >indexes/keys? more memory for sorting during index creation can have > >a dramatic affect on bulk insert performance. check for pg_tmp > >folders popping up during copy run. > > The only index on load is a single IP4 btree primary key, which I > figure should function about like an INTEGER. > > >for table light on indexes, 10-15k for copy is pretty poor. you can > >get pretty close to that with raw inserts on good hardware. I would > >suggest configuirng your perl script to read from stdin and write to > >stdout, and pipe it to psql using copy from stdin. then just > >benchmark your perl script redirecting output to a file. > > So simple and hadn't thought of that ... thanks. When I pre-create a > COPY file, I can load it at about 45K inserts/sec (file was 1.8GB or > 14.5 million records in 331 seconds), which looks like its about 5.5 > MB/s. I'm loading from a local 15K SCSI320 RAID10 (which also > contains the PG log files) to a 10K SCSI320 RAID10 on an FC SAN. Does > this look more consistent with "decent" performance, or should I go > looking into some hardware issues i.e. SAN configuration? I've > currently got several hats including hardware/systems/security admin, > as well as DBA and programmer, and my SAN setup skills could > definitely use some more work. > > Hardware aside, my perl can definitely use some work, and it seems to > be mostly the CSV stuff that I am using, mostly for convenience. I'll > see if I can't redo some of that to eliminate some CSV processing, or, > barring that, multithread the process to utilize more of the CPUs. > Part of the reason that I hadn't used psql in the first place is that > I'm loading the data into partitioned tables, and the loader keeps > several COPY connections open at a time to load the data into the > right table. I guess I could just as easily keep several psql pipes > open, but it seemed cleaner to go through DBI. http://stats.distributed.net used to use a perl script to do some transformations before loading data into the database. IIRC, when we switched to using C we saw 100x improvement in speed, so I suspect that if you want performance perl isn't the way to go. I think you can compile perl into C, so maybe that would help some. Ultimately, you might be best of using triggers instead of rules for the partitioning since then you could use copy. Or go to raw insert commands that are wrapped in a transaction. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
> Ultimately, you might be best of using triggers instead of rules for the > partitioning since then you could use copy. Or go to raw insert commands > that are wrapped in a transaction. My experience is that triggers are quite a bit faster than rules in any kind of partitioning that involves more than say 7 tables. Sincerely, Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
Jim C. Nasby wrote: > http://stats.distributed.net used to use a perl script to do some > transformations before loading data into the database. IIRC, when we > switched to using C we saw 100x improvement in speed, so I suspect that > if you want performance perl isn't the way to go. I think you can > compile perl into C, so maybe that would help some. I use Perl extensively, and have never seen a performance problem. I suspect the perl-to-C "100x improvement" was due tosome other factor, like a slight change in the schema, indexes, or the fundamental way the client (C vs Perl) handled thedata during the transformation, or just plain bad Perl code. Modern scripting languages like Perl and Python make programmers far, far more productive than the bad old days of C/C++. Don't shoot yourself in the foot by reverting to low-level languages like C/C++ until you've exhausted all other possibilities. I only use C/C++ for intricate scientific algorithms. In many cases, Perl is *faster* than C/C++ code that I write, because I can't take the time (for example) to write the high-performancestring manipulation that have been fine-tuned and extensively optimized in Perl. Craig
Markus, > Could you COPY one of your tables out to disk via psql, and then COPY it > back into the database, to reproduce this measurement with your real data? $ psql -c "COPY my_table TO STDOUT" > my_data $ ls my_data 2018792 edgescape_pg_load $ time cat my_data | psql -c "COPY mytable FROM STDIN" real 5m43.194s user 0m35.412s sys 0m9.567s > Also, how much is the disk load, and CPU usage? When I am loading via the perl (which I've established is a bottleneck), the one CPU core is at 99% for the perl and another is at 30% for a postmaster, vs about 90% for the postmaster when going through psql. The disk load is where I start to get a little fuzzy, as I haven't played with iostat to figure what is "normal". The local drives contain PG_DATA as well as all the log files, but there is a tablespace on the FibreChannel SAN that contains the destination table. The disk usage pattern that I see is that there is a ton of consistent activity on the local disk, with iostat reporting an average of 30K Blk_wrtn/s, which I assume is the log files. Every several seconds there is a massive burst of activity on the FC partition, to the tune of 250K Blk_wrtn/s. > On a table with no indices, triggers and contstraints, we managed to > COPY about 7-8 megabytes/second with psql over our 100 MBit network, so > here the network was the bottleneck. hmm, this makes me think that either my PG config is really lacking, or that the SAN is badly misconfigured, as I would expect it to outperform a 100Mb network. As it is, with a straight pipe to psql COPY, I'm only working with a little over 5.5 MB/s. Could this be due to the primary key index updates? Thanks!
On Mon, Oct 23, 2006 at 03:37:47PM -0700, Craig A. James wrote: > Jim C. Nasby wrote: > >http://stats.distributed.net used to use a perl script to do some > >transformations before loading data into the database. IIRC, when we > >switched to using C we saw 100x improvement in speed, so I suspect that > >if you want performance perl isn't the way to go. I think you can > >compile perl into C, so maybe that would help some. > > I use Perl extensively, and have never seen a performance problem. I > suspect the perl-to-C "100x improvement" was due to some other factor, like > a slight change in the schema, indexes, or the fundamental way the client > (C vs Perl) handled the data during the transformation, or just plain bad > Perl code. > > Modern scripting languages like Perl and Python make programmers far, far > more productive than the bad old days of C/C++. Don't shoot yourself in > the foot by reverting to low-level languages like C/C++ until you've > exhausted all other possibilities. I only use C/C++ for intricate > scientific algorithms. > > In many cases, Perl is *faster* than C/C++ code that I write, because I > can't take the time (for example) to write the high-performance string > manipulation that have been fine-tuned and extensively optimized in Perl. Well, the code is all at http://cvs.distributed.net/viewcvs.cgi/stats-proc/hourly/ (see logmod directory and logmod_*.pl). There have been changes made to the C code since we changed over, but you can find the appropriate older versions in there. IIRC, nothing in the database changed when we went from perl to C (it's likely that was the *only* change that happened anywhere around that time). -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Tue, Oct 24, 2006 at 09:17:08AM -0400, Worky Workerson wrote: > >http://stats.distributed.net used to use a perl script to do some > >transformations before loading data into the database. IIRC, when we > >switched to using C we saw 100x improvement in speed, so I suspect that > >if you want performance perl isn't the way to go. I think you can > >compile perl into C, so maybe that would help some. > > Like Craig mentioned, I have never seen those sorts of improvements > going from perl->C, and developer efficiency is primo for me. I've > profiled most of the stuff, and have used XS modules and Inline::C on > the appropriate, often used functions, but I still think that it comes > down to my using CSV and Text::CSV_XS. Even though its XS, CSV is > still a pain in the ass. > > >Ultimately, you might be best of using triggers instead of rules for the > >partitioning since then you could use copy. Or go to raw insert commands > >that are wrapped in a transaction. > > Eh, I've put the partition loading logic in the loader, which seems to > work out pretty well, especially since I keep things sorted and am the > only one inserting into the DB and do so with bulk loads. But I'll > keep this in mind for later use. Well, given that perl is using an entire CPU, it sounds like you should start looking either at ways to remove some of the overhead from perl, or to split that perl into multiple processes. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
> http://stats.distributed.net used to use a perl script to do some > transformations before loading data into the database. IIRC, when we > switched to using C we saw 100x improvement in speed, so I suspect that > if you want performance perl isn't the way to go. I think you can > compile perl into C, so maybe that would help some. Like Craig mentioned, I have never seen those sorts of improvements going from perl->C, and developer efficiency is primo for me. I've profiled most of the stuff, and have used XS modules and Inline::C on the appropriate, often used functions, but I still think that it comes down to my using CSV and Text::CSV_XS. Even though its XS, CSV is still a pain in the ass. > Ultimately, you might be best of using triggers instead of rules for the > partitioning since then you could use copy. Or go to raw insert commands > that are wrapped in a transaction. Eh, I've put the partition loading logic in the loader, which seems to work out pretty well, especially since I keep things sorted and am the only one inserting into the DB and do so with bulk loads. But I'll keep this in mind for later use. Thanks!
Jim C. Nasby wrote: > Well, given that perl is using an entire CPU, it sounds like you should > start looking either at ways to remove some of the overhead from perl, > or to split that perl into multiple processes. I use Perl for big database copies (usually with some processing/transformation along the way) and I've never seen 100% CPUusage except for brief periods, even when copying BLOBS and such. My typical copy divides operations into blocks, forexample doing N = 0 while (more rows to go) { begin transaction select ... where primary_key > N order by primary_key limit 1000 while (fetch a row) insert into ... N = (highest value found in last block) commit } Doing it like this in Perl should keep Postgres busy, with Perl using only moderate resources. If you're seeing high PerlCPU usage, I'd look first at the Perl code. Craig
On 23 Oct 2006, at 22:59, Jim C. Nasby wrote: > http://stats.distributed.net used to use a perl script to do some > transformations before loading data into the database. IIRC, when we > switched to using C we saw 100x improvement in speed, so I suspect > that > if you want performance perl isn't the way to go. I think you can > compile perl into C, so maybe that would help some. http://shootout.alioth.debian.org/gp4/benchmark.php? test=all&lang=perl&lang2=gcc 100x doesn't totally impossible if that is even vaguely accurate and you happen to be using bits of Perl which are a lot slower than the C implementation would be... The slowest things appear to involve calling functions, all the slowest tests involve lots of function calls.
On 10/25/06, Craig A. James <cjames@modgraph-usa.com> wrote: > Jim C. Nasby wrote: > > Well, given that perl is using an entire CPU, it sounds like you should > > start looking either at ways to remove some of the overhead from perl, > > or to split that perl into multiple processes. > > I use Perl for big database copies (usually with some processing/transformation along the > way) and I've never seen 100% CPU usage except for brief periods, even when copying > BLOBS and such. My typical copy divides operations into blocks, for example doing I'm just doing CSV style transformations (and calling a lot of functions along the way), but the end result is a straight bulk load of data into a blank database. And we've established that Postgres can do *way* better than what I am seeing, so its not suprising that perl is using 100% of a CPU. However, I am still curious as to the rather slow COPYs from psql to local disks. Like I mentioned previously, I was only seeing about 5.7 MB/s (1.8 GB / 330 seconds), where it seemed like others were doing substantially better. What sorts of things should I look into? Thanks!
Hi, Worky, Worky Workerson wrote: > $ psql -c "COPY my_table TO STDOUT" > my_data > $ ls my_data > 2018792 edgescape_pg_load > $ time cat my_data | psql -c "COPY mytable FROM STDIN" > real 5m43.194s > user 0m35.412s > sys 0m9.567s That's via PSQL, and you get about 5 MB/Sec. >> On a table with no indices, triggers and contstraints, we managed to >> COPY about 7-8 megabytes/second with psql over our 100 MBit network, so >> here the network was the bottleneck. > > hmm, this makes me think that either my PG config is really lacking, > or that the SAN is badly misconfigured, as I would expect it to > outperform a 100Mb network. As it is, with a straight pipe to psql > COPY, I'm only working with a little over 5.5 MB/s. Could this be due > to the primary key index updates? Yes, index updates cause both CPU load, and random disk access (which is slow by nature). HTH, Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in Europe! www.ffii.org www.nosoftwarepatents.org
On 10/23/06, Worky Workerson <worky.workerson@gmail.com> wrote: > The disk load is where I start to get a little fuzzy, as I haven't > played with iostat to figure what is "normal". The local drives > contain PG_DATA as well as all the log files, but there is a > tablespace on the FibreChannel SAN that contains the destination > table. The disk usage pattern that I see is that there is a ton of > consistent activity on the local disk, with iostat reporting an > average of 30K Blk_wrtn/s, which I assume is the log files. Every > several seconds there is a massive burst of activity on the FC > partition, to the tune of 250K Blk_wrtn/s. > > > On a table with no indices, triggers and contstraints, we managed to > > COPY about 7-8 megabytes/second with psql over our 100 MBit network, so > > here the network was the bottleneck. I'm guessing the high bursts are checkpoints. Can you check your log files for pg and see if you are getting warnings about checkpoint frequency? You can get some mileage here by increasing wal files. Have you determined that pg is not swapping? try upping maintenance_work_mem. What exactly is your architecture? is your database server direct attached to the san? if so, 2gb/4gb fc? what san? have you bonnie++ the san? basically, you can measure iowait to see if pg is waiting on your disks. regarding perl, imo the language performance is really about which libraries you use. the language itself is plenty fast. merlin
Mr. Worky Workerson, On 10/25/06 5:03 AM, "Worky Workerson" <worky.workerson@gmail.com> wrote: > However, I am still curious as to the rather slow COPYs from psql to > local disks. Like I mentioned previously, I was only seeing about 5.7 > MB/s (1.8 GB / 330 seconds), where it seemed like others were doing > substantially better. What sorts of things should I look into? It's probable that you have a really poor performing disk configuration. Judging from earlier results, you may only be getting 3 x 5.7 = 17 MB/s of write performance to your disks, which is about 1/4 of a single disk drive. Please run this test and report the time here: 1) Calculate the size of 2x memory in 8KB blocks: # of blocks = 250,000 x memory_in_GB Example: 250,000 x 16GB = 4,000,000 blocks 2) Benchmark the time taken to write 2x RAM sequentially to your disk: time bash -c "dd if=/dev/zero of=bigfile bs=8k count=<# of blocks> && sync" 3) Benchmark the time taken to read same: time dd if=bigfile of=/dev/null bs=8k - Luke
On Tue, Oct 24, 2006 at 10:36:04PM -0700, Craig A. James wrote: > Jim C. Nasby wrote: > >Well, given that perl is using an entire CPU, it sounds like you should > >start looking either at ways to remove some of the overhead from perl, > >or to split that perl into multiple processes. > > I use Perl for big database copies (usually with some > processing/transformation along the way) and I've never seen 100% CPU usage > except for brief periods, even when copying BLOBS and such. My typical > copy divides operations into blocks, for example doing > > N = 0 > while (more rows to go) { > begin transaction > select ... where primary_key > N order by primary_key limit 1000 > while (fetch a row) > insert into ... > N = (highest value found in last block) > commit > } > > Doing it like this in Perl should keep Postgres busy, with Perl using only > moderate resources. If you're seeing high Perl CPU usage, I'd look first > at the Perl code. Wait... so you're using perl to copy data between two tables? And using a cursor to boot? I can't think of any way that could be more inefficient... What's wrong with a plain old INSERT INTO ... SELECT? Or if you really need to break it into multiple transaction blocks, at least don't shuffle the data from the database into perl and then back into the database; do an INSERT INTO ... SELECT with that same where clause. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Wed, Oct 25, 2006 at 08:03:38AM -0400, Worky Workerson wrote: > I'm just doing CSV style transformations (and calling a lot of > functions along the way), but the end result is a straight bulk load > of data into a blank database. And we've established that Postgres > can do *way* better than what I am seeing, so its not suprising that > perl is using 100% of a CPU. If you're loading into an empty database, there's a number of tricks that will help you: Turn off fsync Add constraints and indexes *after* you've loaded the data (best to add as much of them as possible on a per-table basis right after the table is loaded so that it's hopefully still in cache) Crank up maintenance_work_mem, especially for tables that won't fit into cache anyway Bump up checkpoint segments and wal_buffers. Disable PITR Create a table and load it's data in a single transaction (8.2 will avoid writing any WAL data if you do this and PITR is turned off) -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On 10/25/06, Worky Workerson <worky.workerson@gmail.com> wrote: > > I'm guessing the high bursts are checkpoints. Can you check your log > > files for pg and see if you are getting warnings about checkpoint > > frequency? You can get some mileage here by increasing wal files. > > Nope, nothing in the log. I have set: > wal_buffers=128 > checkpoint_segments=128 > checkpoint_timeout=3000 > which I thought was rather generous. Perhaps I should set it even > higher for the loads? > > > Have you determined that pg is not swapping? try upping maintenance_work_mem. > > maintenance_work_mem = 524288 ... should I increase it even more? > Doesn't look like pg is swapping ... nah, you already addressed it. either pg is swapping or it isnt, and i'm guessing it isn't. > I'm currently running bonnie++ with the defaults ... should I change > the execution to better mimic Postgres' behavior? just post what you have... > RHEL 4.3 x86_64 > HP DL585, 4 Dual Core Opteron 885s > 16 GB RAM > 2x300GB 10K SCSI320, RAID10 > HP MSA1000 SAN direct connected via single 2GB Fibre Channel Arbitrated Loop > 10x300GB 10K SCSI320, RAID10 in theory, with 10 10k disks in raid 10, you should be able to keep your 2fc link saturated all the time unless your i/o is extremely random. random i/o is the wild card here, ideally you should see at least 2000 seeks in bonnie...lets see what comes up. hopefully, bonnie will report close to 200 mb/sec. in extreme sequential cases, the 2fc link should be a bottleneck if the raid controller is doing its job. if you are having cpu issues, try breaking your process down to at least 4 processes (you have quad dual core box after all)...thats a no brainer. merlin
> I'm guessing the high bursts are checkpoints. Can you check your log > files for pg and see if you are getting warnings about checkpoint > frequency? You can get some mileage here by increasing wal files. Nope, nothing in the log. I have set: wal_buffers=128 checkpoint_segments=128 checkpoint_timeout=3000 which I thought was rather generous. Perhaps I should set it even higher for the loads? > Have you determined that pg is not swapping? try upping maintenance_work_mem. maintenance_work_mem = 524288 ... should I increase it even more? Doesn't look like pg is swapping ... > What exactly is your architecture? is your database server direct > attached to the san? if so, 2gb/4gb fc? what san? have you bonnie++ > the san? basically, you can measure iowait to see if pg is waiting on > your disks. I'm currently running bonnie++ with the defaults ... should I change the execution to better mimic Postgres' behavior? RHEL 4.3 x86_64 HP DL585, 4 Dual Core Opteron 885s 16 GB RAM 2x300GB 10K SCSI320, RAID10 HP MSA1000 SAN direct connected via single 2GB Fibre Channel Arbitrated Loop 10x300GB 10K SCSI320, RAID10
Merlin, On 10/25/06 8:38 AM, "Merlin Moncure" <mmoncure@gmail.com> wrote: > in theory, with 10 10k disks in raid 10, you should be able to keep > your 2fc link saturated all the time unless your i/o is extremely > random. random i/o is the wild card here, ideally you should see at > least 2000 seeks in bonnie...lets see what comes up. The 2000 seeks/sec are irrelevant to Postgres with one user doing COPY. Because the I/O is single threaded, you will get one disk worth of seeks for one user, roughly 150/second on a 10K RPM drive. I suspect the problem here is the sequential I/O rate - let's wait and see what the dd test results look like. - Luke
Jim C. Nasby wrote: > Wait... so you're using perl to copy data between two tables? And using > a cursor to boot? I can't think of any way that could be more > inefficient... > > What's wrong with a plain old INSERT INTO ... SELECT? Or if you really > need to break it into multiple transaction blocks, at least don't > shuffle the data from the database into perl and then back into the > database; do an INSERT INTO ... SELECT with that same where clause. The data are on two different computers, and I do processing of the data as it passes through the application. Otherwise,the INSERT INTO ... SELECT is my first choice. Craig
> -----Original Message----- > From: pgsql-performance-owner@postgresql.org > [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of > Craig A. James > Sent: Wednesday, October 25, 2006 12:52 PM > To: Jim C. Nasby > Cc: Worky Workerson; Merlin Moncure; pgsql-performance@postgresql.org > Subject: Re: [PERFORM] Best COPY Performance > > Jim C. Nasby wrote: > > Wait... so you're using perl to copy data between two tables? And > > using a cursor to boot? I can't think of any way that could be more > > inefficient... > > > > What's wrong with a plain old INSERT INTO ... SELECT? Or if > you really > > need to break it into multiple transaction blocks, at least don't > > shuffle the data from the database into perl and then back into the > > database; do an INSERT INTO ... SELECT with that same where clause. > > The data are on two different computers, and I do processing > of the data as it passes through the application. Otherwise, > the INSERT INTO ... SELECT is my first choice. Would dblink() help in any way? Greg
Spiegelberg, Greg wrote: >> The data are on two different computers, and I do processing >> of the data as it passes through the application. Otherwise, >> the INSERT INTO ... SELECT is my first choice. > > Would dblink() help in any way? It might if perl wasn't so damned good at this. ;-) Craig
Merlin/Luke: > > in theory, with 10 10k disks in raid 10, you should be able to keep > > your 2fc link saturated all the time unless your i/o is extremely > > random. random i/o is the wild card here, ideally you should see at > > least 2000 seeks in bonnie...lets see what comes up. > I suspect the problem here is the sequential I/O rate - let's wait and see > what the dd test results look like. Here are the tests that you suggested that I do, on both the local disks (WAL) and the SAN (tablespace). The random seeks seem to be far below what Merlin said was "good", so I am a bit concerned. There is a bit of other activity on the box at the moment which is hard to stop, so that might have had an impact on the processing. Here is the bonnie++ output: Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP Local Disks 31G 45119 85 56548 21 27527 8 35069 66 86506 13 499.6 1 SAN 31G 53544 98 93385 35 18266 5 24970 47 57911 8 611.8 1 And here are the dd results for 16GB RAM, i.e. 4,000,000 8K blocks: # Local Disks $ time bash -c "dd if=/dev/zero of=/home/myhome/bigfile bs=8k count=4000000 && sync" 4000000+0 records in 4000000+0 records out real 10m0.382s user 0m1.117s sys 2m45.681s $ time dd if=/home/myhome/bigfile of=/dev/null bs=8k count=4000000 4000000+0 records in 4000000+0 records out real 6m22.904s user 0m0.717s sys 0m53.766s # Fibre Channel SAN $ time bash -c "dd if=/dev/zero of=/data/test/bigfile bs=8k count=4000000 && sync" 4000000+0 records in 4000000+0 records out real 5m58.846s user 0m1.096s sys 2m18.026s $ time dd if=/data/test/bigfile of=/dev/null bs=8k count=4000000 4000000+0 records in 4000000+0 records out real 14m9.560s user 0m0.739s sys 0m53.806s
Hi, Craig, Craig A. James wrote: >> Would dblink() help in any way? > > It might if perl wasn't so damned good at this. ;-) You know that you can use Perl inside PostgreS via plperl? HTH, Markus
Mr. Worky, On 10/25/06 11:26 AM, "Worky Workerson" <worky.workerson@gmail.com> wrote: > And here are the dd results for 16GB RAM, i.e. 4,000,000 8K blocks: So, if we divide 32,000 MB by the real time, we get: /home (WAL): 53 MB/s write 84 MB/s read /data (data): 89 MB/s write 38 MB/s read The write and read speeds on /home look like a single disk drive, which is not good if you have more drives in a RAID. OTOH, it should be sufficient for WAL writing and you should think that the COPY speed won't be limited by WAL. The read speed on your /data volume is awful to the point where you should consider it broken and find a fix. A quick comparison: the same number on a 16 drive internal SATA array with 7200 RPM disks gets 950 MB/s read, about 25 times faster for about 1/4 the price. But again, this may not have anything to do with the speed of your COPY statements. Can you provide about 10 seconds worth of "vmstat 1" while running your COPY so we can get a global view of the I/O and CPU? - Luke
On Wed, Oct 25, 2006 at 11:25:01AM -0400, Worky Workerson wrote: > >I'm guessing the high bursts are checkpoints. Can you check your log > >files for pg and see if you are getting warnings about checkpoint > >frequency? You can get some mileage here by increasing wal files. > > Nope, nothing in the log. I have set: > wal_buffers=128 > checkpoint_segments=128 > checkpoint_timeout=3000 > which I thought was rather generous. Perhaps I should set it even > higher for the loads? But depending on your shared_buffer and bgwriter settings (as well as how much WAL traffic you're generating, you could still end up with big slugs of work to be done when checkpoints happen. If you set checkpoint_warning to 3001, you'll see exactly when checkpoints are happening, so you can determine if that's an issue. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
I do have a dirty little secret, one which I wasn't completely aware of until a little while ago. Apparently, someone decided to install Oracle on the server, and use the SAN as the primary tablespace, so that might have something to do with the poor performance of the SAN. At least, I'm hoping that's what is involved. I'll be able to tell for sure when I can rerun the benchmarks on Monday without Oracle. Thank you all for all your help so far. I've learned a ton about Postgres (and life!) from both this thread and this list in general, and the "support" has been nothing less than spectacular. I'm hoping that the corporate Oracle machine won't shut down my pg projects. On total side note, if anyone knows how to best limit Oracle's impact on a system (i.e. memory usage, etc), I'd be interested. I hate shared DB servers. Thanks!
On 10/27/06, Worky Workerson <worky.workerson@gmail.com> wrote: > I'm hoping that the corporate Oracle machine won't shut down my pg > projects. On total side note, if anyone knows how to best limit > Oracle's impact on a system (i.e. memory usage, etc), I'd be > interested. rm -rf /usr/local/oracle? merlin
> The read speed on your /data volume is awful to the point where you should > consider it broken and find a fix. A quick comparison: the same number on a > 16 drive internal SATA array with 7200 RPM disks gets 950 MB/s read, about > 25 times faster for about 1/4 the price. I'm hoping that the poor performance is a result of Oracle doing random reads while I try and do the sequential read. If not (I'll test on Monday), I'll start looking other places. Any immediate suggestions? > Can you provide about 10 seconds worth of "vmstat 1" while running your COPY > so we can get a global view of the I/O and CPU? Here it is, taken from a spot about halfway through a 'cat file | psql' load, with the "Oracle-is-installed-and-running" caveat: r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 345732 29328 770980 12947212 0 0 20 16552 1223 3677 12 2 85 1 1 0 345732 29840 770520 12946924 0 0 20 29244 1283 2955 11 2 85 1 1 0 345732 32144 770560 12944436 0 0 12 16436 1204 2936 11 2 86 1 1 0 345732 33744 770464 12942764 0 0 20 16460 1189 2005 10 2 86 1 2 0 345732 32656 770140 12943972 0 0 16 7068 1057 3434 13 2 85 0 1 0 345732 34832 770184 12941820 0 0 20 9368 1170 3120 11 2 86 1 1 0 345732 36528 770228 12939804 0 0 16 32668 1297 2109 11 2 85 1 1 0 345732 29304 770272 12946764 0 0 16 16428 1192 3105 12 2 85 1 1 0 345732 30840 770060 12945480 0 0 20 16456 1196 3151 12 2 84 1 1 0 345732 32760 769972 12943528 0 0 12 16460 1185 3103 11 2 86 1
Worky (that your real name? :-) On 10/27/06 12:08 PM, "Worky Workerson" <worky.workerson@gmail.com> wrote: > Here it is, taken from a spot about halfway through a 'cat file | > psql' load, with the "Oracle-is-installed-and-running" caveat: > > r b swpd free buff cache si so bi bo in cs us sy id wa > 1 0 345732 29328 770980 12947212 0 0 20 16552 1223 3677 12 2 85 1 > 1 0 345732 29840 770520 12946924 0 0 20 29244 1283 2955 11 2 85 1 > 1 0 345732 32144 770560 12944436 0 0 12 16436 1204 2936 11 2 86 1 > 1 0 345732 33744 770464 12942764 0 0 20 16460 1189 2005 10 2 86 1 > 2 0 345732 32656 770140 12943972 0 0 16 7068 1057 3434 13 2 85 0 > 1 0 345732 34832 770184 12941820 0 0 20 9368 1170 3120 11 2 86 1 > 1 0 345732 36528 770228 12939804 0 0 16 32668 1297 2109 11 2 85 1 > 1 0 345732 29304 770272 12946764 0 0 16 16428 1192 3105 12 2 85 1 > 1 0 345732 30840 770060 12945480 0 0 20 16456 1196 3151 12 2 84 1 > 1 0 345732 32760 769972 12943528 0 0 12 16460 1185 3103 11 2 86 1 It doesn't look like there's anything else running - the runnable "r" is about 1. Your "bo" blocks output rate is about 16MB/s, so divide by 3 and you're about in range with your 5MB/s COPY rate. The interesting thing is that the I/O wait is pretty low. How many CPUs on the machine? Can you send the result of "cat /proc/cpuinfo"? Is your "cat file | psql" being done on the DBMS server or is it on the network? - Luke
On 10/28/06, Luke Lonergan <llonergan@greenplum.com> wrote: > Worky (that your real name? :-) > > > On 10/27/06 12:08 PM, "Worky Workerson" <worky.workerson@gmail.com> wrote: > > > Here it is, taken from a spot about halfway through a 'cat file | > > psql' load, with the "Oracle-is-installed-and-running" caveat: > > > > r b swpd free buff cache si so bi bo in cs us sy id wa > > 1 0 345732 29328 770980 12947212 0 0 20 16552 1223 3677 12 2 85 1 > > 1 0 345732 29840 770520 12946924 0 0 20 29244 1283 2955 11 2 85 1 > > 1 0 345732 32144 770560 12944436 0 0 12 16436 1204 2936 11 2 86 1 > > 1 0 345732 33744 770464 12942764 0 0 20 16460 1189 2005 10 2 86 1 > > 2 0 345732 32656 770140 12943972 0 0 16 7068 1057 3434 13 2 85 0 > > 1 0 345732 34832 770184 12941820 0 0 20 9368 1170 3120 11 2 86 1 > > 1 0 345732 36528 770228 12939804 0 0 16 32668 1297 2109 11 2 85 1 > > 1 0 345732 29304 770272 12946764 0 0 16 16428 1192 3105 12 2 85 1 > > 1 0 345732 30840 770060 12945480 0 0 20 16456 1196 3151 12 2 84 1 > > 1 0 345732 32760 769972 12943528 0 0 12 16460 1185 3103 11 2 86 1 > > It doesn't look like there's anything else running - the runnable "r" is > about 1. Your "bo" blocks output rate is about 16MB/s, so divide by 3 and > you're about in range with your 5MB/s COPY rate. The interesting thing is > that the I/O wait is pretty low. > > How many CPUs on the machine? Can you send the result of "cat > /proc/cpuinfo"? > > Is your "cat file | psql" being done on the DBMS server or is it on the > network? iirc, he is running quad opteron 885 (8 cores), so if my math is correct he can split up his process for an easy gain. merlin
> Worky (that your real name? :-) Nope, its Mike. worky.workerson is just the email that I use for "work" :) > How many CPUs on the machine? Can you send the result of "cat > /proc/cpuinfo"? Not at work at the moment, however I do have quad dual-core opterons, like Merlin mentioned. > Is your "cat file | psql" being done on the DBMS server or is it on the > network? Everything is currently being done on the same DB server. The data that is being loaded is on the same 2-disk "RAID10" as the OS and WAL.
On 10/27/06, Merlin Moncure <mmoncure@gmail.com> wrote: > > > r b swpd free buff cache si so bi bo in cs us sy id wa > > > 1 0 345732 29328 770980 12947212 0 0 20 16552 1223 3677 12 2 85 1 > > > 1 0 345732 29840 770520 12946924 0 0 20 29244 1283 2955 11 2 85 1 > > > 1 0 345732 32144 770560 12944436 0 0 12 16436 1204 2936 11 2 86 1 > > > 1 0 345732 33744 770464 12942764 0 0 20 16460 1189 2005 10 2 86 1 > > > 2 0 345732 32656 770140 12943972 0 0 16 7068 1057 3434 13 2 85 0 > > > 1 0 345732 34832 770184 12941820 0 0 20 9368 1170 3120 11 2 86 1 > > > 1 0 345732 36528 770228 12939804 0 0 16 32668 1297 2109 11 2 85 1 > > > 1 0 345732 29304 770272 12946764 0 0 16 16428 1192 3105 12 2 85 1 > > > 1 0 345732 30840 770060 12945480 0 0 20 16456 1196 3151 12 2 84 1 > > > 1 0 345732 32760 769972 12943528 0 0 12 16460 1185 3103 11 2 86 1 > > > > It doesn't look like there's anything else running - the runnable "r" is > > about 1. Your "bo" blocks output rate is about 16MB/s, so divide by 3 and > > you're about in range with your 5MB/s COPY rate. The interesting thing is > > that the I/O wait is pretty low. > > > > How many CPUs on the machine? Can you send the result of "cat > > /proc/cpuinfo"? > > iirc, he is running quad opteron 885 (8 cores), so if my math is > correct he can split up his process for an easy gain. Are you saying that I should be able to issue multiple COPY commands because my I/O wait is low? I was under the impression that I am I/O bound, so multiple simeoultaneous loads would have a detrimental effect ...
Worky, On 10/27/06 8:47 PM, "Worky Workerson" <worky.workerson@gmail.com> wrote: >>>> 1 0 345732 29304 770272 12946764 0 0 16 16428 1192 3105 12 2 85 1 >>>> 1 0 345732 30840 770060 12945480 0 0 20 16456 1196 3151 12 2 84 1 >>>> 1 0 345732 32760 769972 12943528 0 0 12 16460 1185 3103 11 2 86 1 >> >> iirc, he is running quad opteron 885 (8 cores), so if my math is >> correct he can split up his process for an easy gain. > > Are you saying that I should be able to issue multiple COPY commands > because my I/O wait is low? I was under the impression that I am I/O > bound, so multiple simeoultaneous loads would have a detrimental > effect ... The reason I asked how many CPUs was to make sense of the 12% usr CPU time in the above. That means you are CPU bound and are fully using one CPU. So you aren't being limited by the I/O in this case, it's the CPU. I agree with Merlin that you can speed things up by breaking the file up. Alternately you can use the OSS Bizgres java loader, which lets you specify the number of I/O threads with the "-n" option on a single file. OTOH, you should find that you will only double your COPY speed with this approach because your write speed as you previously found was limited to 30 MB/s. For now, you could simply split the file in two pieces and load two copies at once, then watch the same "vmstat 1" for 10 seconds and look at your "bo" rate. If this does speed things up, you really should check out the Bizgres Java loader. The other thing to wonder about though is why you are so CPU bound at 5 MB/s. What version of Postgres is this? - Luke
> > Are you saying that I should be able to issue multiple COPY commands > > because my I/O wait is low? I was under the impression that I am I/O > > bound, so multiple simeoultaneous loads would have a detrimental > > effect ... > > The reason I asked how many CPUs was to make sense of the 12% usr CPU time > in the above. That means you are CPU bound and are fully using one CPU. So > you aren't being limited by the I/O in this case, it's the CPU. > > I agree with Merlin that you can speed things up by breaking the file up. > Alternately you can use the OSS Bizgres java loader, which lets you specify > the number of I/O threads with the "-n" option on a single file. Thanks, I'll try that on Monday. > The other thing to wonder about though is why you are so CPU bound at 5 > MB/s. What version of Postgres is this? I was wondering about that as well, and the only thing that I can think of is that its the PK btree index creation on the IP4. PG 8.1.3 x86_64. I installed it via a RH rpm for their "Web Services Beta", or something like that. I know I'm a bit behind the times, but getting stuff in (and out) of my isolated lab is a bit of a pain. I'll compile up a 8.2 beta as well and see how that works out.
Michael (aka Worky), On 10/28/06 5:03 AM, "Michael Artz" <mlartz@gmail.com> wrote: > PG 8.1.3 x86_64. I installed it via a RH rpm for their "Web Services > Beta", or something like that. I know I'm a bit behind the times, but > getting stuff in (and out) of my isolated lab is a bit of a pain. > I'll compile up a 8.2 beta as well and see how that works out. I think 8.1 and 8.2 should be the same in this regard. Maybe it is just the PK *build* that slows it down, but I just tried some small scale experiments on my MacBook Pro laptop (which has the same disk performance as your server) and I get only a 10-15% slowdown from having a PK on an integer column. The 10-15% slowdown was on 8.1.5 MPP, so it used both CPUs to build the index and load at about 15 MB/s. Note that the primary key is the first column. Table "public.part" Column | Type | Modifiers ---------------+-----------------------+----------- p_partkey | integer | not null p_name | character varying(55) | not null p_mfgr | text | not null p_brand | text | not null p_type | character varying(25) | not null p_size | integer | not null p_container | text | not null p_retailprice | double precision | not null p_comment | character varying(23) | not null Indexes: "part_pkey" PRIMARY KEY, btree (p_partkey) What is your schema for the table? - Luke
> -----Original Message----- > From: pgsql-performance-owner@postgresql.org > [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of > Luke Lonergan > Sent: Saturday, October 28, 2006 12:07 AM > To: Worky Workerson; Merlin Moncure > Cc: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] Best COPY Performance > > Worky, > > On 10/27/06 8:47 PM, "Worky Workerson" > <worky.workerson@gmail.com> wrote: > > > Are you saying that I should be able to issue multiple COPY > commands > > because my I/O wait is low? I was under the impression > that I am I/O > > bound, so multiple simeoultaneous loads would have a detrimental > > effect ... > > ... > I agree with Merlin that you can speed things up by breaking > the file up. > Alternately you can use the OSS Bizgres java loader, which > lets you specify the number of I/O threads with the "-n" > option on a single file. As a result of this thread, and b/c I've tried this in the past but never had much success at speeding the process up, I attempted just that here except via 2 psql CLI's with access to the local file. 1.1M rows of data varying in width from 40 to 200 characters COPY'd to a table with only one text column, no keys, indexes, &c took about 15 seconds to load. ~73K rows/second. I broke that file into 2 files each of 550K rows and performed 2 simultaneous COPY's after dropping the table, recreating, issuing a sync on the system to be sure, &c and nearly every time both COPY's finish in 12 seconds. About a 20% gain to ~91K rows/second. Admittedly, this was a pretty rough test but a 20% savings, if it can be put into production, is worth exploring for us. B/c I'll be asked, I did this on an idle, dual 3.06GHz Xeon with 6GB of memory, U320 SCSI internal drives and PostgreSQL 8.1.4. Greg
Greg, On 10/30/06 7:09 AM, "Spiegelberg, Greg" <gspiegelberg@cranel.com> wrote: > I broke that file into 2 files each of 550K rows and performed 2 > simultaneous COPY's after dropping the table, recreating, issuing a sync > on the system to be sure, &c and nearly every time both COPY's finish in > 12 seconds. About a 20% gain to ~91K rows/second. > > Admittedly, this was a pretty rough test but a 20% savings, if it can be > put into production, is worth exploring for us. Did you see whether you were I/O or CPU bound in your single threaded COPY? A 10 second "vmstat 1" snapshot would tell you/us. With Mr. Workerson (:-) I'm thinking his benefit might be a lot better because the bottleneck is the CPU and it *may* be the time spent in the index building bits. We've found that there is an ultimate bottleneck at about 12-14MB/s despite having sequential write to disk speeds of 100s of MB/s. I forget what the latest bottleneck was. - Luke
Luke Lonergan wrote: > Greg, > > On 10/30/06 7:09 AM, "Spiegelberg, Greg" <gspiegelberg@cranel.com> wrote: > >> I broke that file into 2 files each of 550K rows and performed 2 >> simultaneous COPY's after dropping the table, recreating, issuing a sync >> on the system to be sure, &c and nearly every time both COPY's finish in >> 12 seconds. About a 20% gain to ~91K rows/second. >> >> Admittedly, this was a pretty rough test but a 20% savings, if it can be >> put into production, is worth exploring for us. > > Did you see whether you were I/O or CPU bound in your single threaded COPY? > A 10 second "vmstat 1" snapshot would tell you/us. > > With Mr. Workerson (:-) I'm thinking his benefit might be a lot better > because the bottleneck is the CPU and it *may* be the time spent in the > index building bits. > > We've found that there is an ultimate bottleneck at about 12-14MB/s despite > having sequential write to disk speeds of 100s of MB/s. I forget what the > latest bottleneck was. I have personally managed to load a bit less then 400k/s (5 int columns no indexes) - on very fast disk hardware - at that point postgresql is completely CPU bottlenecked (2,6Ghz Opteron). Using multiple processes to load the data will help to scale up to about 900k/s (4 processes on 4 cores). Stefan
Stefan, On 10/30/06 8:57 AM, "Stefan Kaltenbrunner" <stefan@kaltenbrunner.cc> wrote: >> We've found that there is an ultimate bottleneck at about 12-14MB/s despite >> having sequential write to disk speeds of 100s of MB/s. I forget what the >> latest bottleneck was. > > I have personally managed to load a bit less then 400k/s (5 int columns > no indexes) - on very fast disk hardware - at that point postgresql is > completely CPU bottlenecked (2,6Ghz Opteron). 400,000 rows/s x 4 bytes/column x 5 columns/row = 8MB/s > Using multiple processes to load the data will help to scale up to about > 900k/s (4 processes on 4 cores). 18MB/s? Have you done this? I've not seen this much of an improvement before by using multiple COPY processes to the same table. Another question: how to measure MB/s - based on the input text file? On the DBMS storage size? We usually consider the input text file in the calculation of COPY rate. - Luke
Luke Lonergan wrote: > Stefan, > > On 10/30/06 8:57 AM, "Stefan Kaltenbrunner" <stefan@kaltenbrunner.cc> wrote: > >>> We've found that there is an ultimate bottleneck at about 12-14MB/s despite >>> having sequential write to disk speeds of 100s of MB/s. I forget what the >>> latest bottleneck was. >> I have personally managed to load a bit less then 400k/s (5 int columns >> no indexes) - on very fast disk hardware - at that point postgresql is >> completely CPU bottlenecked (2,6Ghz Opteron). > > 400,000 rows/s x 4 bytes/column x 5 columns/row = 8MB/s > >> Using multiple processes to load the data will help to scale up to about >> 900k/s (4 processes on 4 cores). yes I did that about half a year ago as part of the CREATE INDEX on a 1,8B row table thread on -hackers that resulted in some some the sorting improvements in 8.2. I don't think there is much more possible in terms of import speed by using more cores (at least not when importing to the same table) - iirc I was at nearly 700k/s with two cores and 850k/s with 3 cores or such ... > > 18MB/s? Have you done this? I've not seen this much of an improvement > before by using multiple COPY processes to the same table. > > Another question: how to measure MB/s - based on the input text file? On > the DBMS storage size? We usually consider the input text file in the > calculation of COPY rate. yeah that is a good questions (and part of the reason why I cited the rows/sec number btw.) Stefan
> > And here are the dd results for 16GB RAM, i.e. 4,000,000 8K blocks: > > So, if we divide 32,000 MB by the real time, we get: > /data (data): > 89 MB/s write > 38 MB/s read ... snip ... > The read speed on your /data volume is awful to the point where you should > consider it broken and find a fix. A quick comparison: the same number on a > 16 drive internal SATA array with 7200 RPM disks gets 950 MB/s read, about > 25 times faster for about 1/4 the price. I managed to get approval to shut down the Oracle instance and reran the dd's on the SAN (/data) and came up with about 60MB/s write (I had missed the 'sync' in the previous runs) and about 58 MB/s read, still no comparison on your SATA arrary. Any recommendations on what to look at to find a fix? One thing which I never mentioned was that I am using ext3 mounted with noatime,data=writeback. An interesting note (at least to me) is the inverse relationship between free memory and bo when writing with dd, i.e: $ vmstat 5 r b swpd free buff cache si so bi bo in cs us sy id wa 0 3 244664 320688 23588 15383120 0 0 0 28 1145 197 0 1 74 25 2 6 244664 349488 22276 15204980 0 0 0 24 1137 188 0 1 75 25 2 6 244664 28264 23024 15526552 0 0 0 65102 1152 335 0 12 60 28 2 4 244664 28968 23588 15383120 0 0 1 384386 1134 372 0 19 34 47 1 5 244664 28840 23768 15215728 0 0 1 438482 1144 494 0 24 33 43 0 5 247256 41320 20144 15212788 0 524 0 57062 1142 388 0 6 43 51 1 6 247256 29096 19588 15226788 0 0 5 60999 1140 391 0 15 42 43 Is this because of the kernel attempting to cache the file in memory?
Worky (!), On 10/31/06 12:11 PM, "Worky Workerson" <worky.workerson@gmail.com> wrote: > Any recommendations on what to > look at to find a fix? One thing which I never mentioned was that I > am using ext3 mounted with noatime,data=writeback. You can try setting the max readahead like this: /sbin/blockdev --setra 16384 /dev/sd[a-z] It will set the max readahead to 16MB for whatever devices are in /dev/sd[a-z] for the booted machine. You'd need to put the line(s) in /etc/rc.d/rc.local to have the setting persist on reboot. - Luke
> >I'm guessing the high bursts are checkpoints. Can you check your log > > >files for pg and see if you are getting warnings about checkpoint > > >frequency? You can get some mileage here by increasing wal files. > > > > Nope, nothing in the log. I have set: > > wal_buffers=128 > > checkpoint_segments=128 > > checkpoint_timeout=3000 > > which I thought was rather generous. Perhaps I should set it even > > higher for the loads? > > But depending on your shared_buffer and bgwriter settings (as well as > how much WAL traffic you're generating, you could still end up with big > slugs of work to be done when checkpoints happen. > > If you set checkpoint_warning to 3001, you'll see exactly when > checkpoints are happening, so you can determine if that's an issue. I did that, and I get two log messages, seeing checkpoints happening at 316 and 147 seconds apart on my load of a 1.9 GB file. Is this "an issue"? Thanks!
> Maybe it is just the PK *build* that slows it down, but I just tried some > small scale experiments on my MacBook Pro laptop (which has the same disk > performance as your server) and I get only a 10-15% slowdown from having a > PK on an integer column. The 10-15% slowdown was on 8.1.5 MPP, so it used > both CPUs to build the index and load at about 15 MB/s. ...snip... > What is your schema for the table? A single IP4 PK and 21 VARCHARs. It takes about 340 seconds to load a 1.9GB file with the PK index, and about 230 seconds without it (ALTER TABLE mytable DROP CONSTRAINT mytable_pkey), which is a pretty significant (~30%) savings. If I read the vmstat output correctly (i.e. the cpu us column), I'm still at 12% and thus still cpu-bound, except for when the checkpoint occurs, i.e (everything is chugging along similar to the first line, then stuff gets wonky): r b swpd free buff cache si so bi bo in cs us sy id wa 2 0 279028 4620040 717940 9697664 0 0 0 19735 1242 7534 13 4 82 1 1 2 279028 4476120 718120 9776840 0 0 0 2225483 1354 5269 13 6 71 11 0 3 279028 4412928 718320 9866672 0 0 2 19746 1324 3978 10 2 69 18 1 1 279028 4334112 718528 9971456 0 0 0 20615 1311 5912 10 3 69 18 0 1 279028 4279904 718608 9995244 0 0 0 134946 1205 674 1 3 85 11 0 2 279028 4307344 718616 9995304 0 0 0 54 1132 247 0 1 77 22 1 0 279028 7411104 718768 6933860 0 0 0 9942 1148 3618 11 6 80 3 1 0 279028 7329312 718964 7015536 0 0 1 19766 1232 5108 13 2 84 1 Also, as a semi-side note, I only have a single checkpoint without the index, while I have 2 with the index.
Checkpoints are not an issue here, the vmstat you included was on a 5 second interval, so the 'bursts' were bursting at arate of 60MB/s. - Luke Msg is shrt cuz m on ma treo
> >>>> 1 0 345732 29304 770272 12946764 0 0 16 16428 1192 3105 12 2 85 1 > >>>> 1 0 345732 30840 770060 12945480 0 0 20 16456 1196 3151 12 2 84 1 > >>>> 1 0 345732 32760 769972 12943528 0 0 12 16460 1185 3103 11 2 86 1 > >> > >> iirc, he is running quad opteron 885 (8 cores), so if my math is > >> correct he can split up his process for an easy gain. > > > > Are you saying that I should be able to issue multiple COPY commands > > because my I/O wait is low? I was under the impression that I am I/O > > bound, so multiple simeoultaneous loads would have a detrimental > > effect ... > > The reason I asked how many CPUs was to make sense of the 12% usr CPU time > in the above. That means you are CPU bound and are fully using one CPU. So > you aren't being limited by the I/O in this case, it's the CPU. ... snip ... > For now, you could simply split the file in two pieces and load two copies > at once, then watch the same "vmstat 1" for 10 seconds and look at your "bo" > rate. Significantly higher on average, and a parallel loads were ~30% faster that a single with index builds (240s vs 340s) and about ~45% (150s vs 230s) without the PK index. I'll definitely look into the bizgres java loader. Thanks!
On Tuesday 31 October 2006 21:11, Worky Workerson wrote: > One thing which I never mentioned was that I > am using ext3 mounted with noatime,data=writeback. You might also want to try with data=ordered. I have noticed that nowadays it seems to be a bit faster, but not much. I don't know why, maybe it has got more optimization efforts. Teemu
For the third time today, our server has crashed, or frozen, actually something in between. Normally there are about 30-50connections because of mod_perl processes that keep connections open. After the crash, there are three processes remaining: # ps -ef | grep postgres postgres 23832 1 0 Nov11 pts/1 00:02:53 /usr/local/pgsql/bin/postmaster -D /postgres/main postgres 1200 23832 20 14:28 pts/1 00:58:14 postgres: pubchem pubchem 66.226.76.106(58882) SELECT postgres 4190 23832 25 14:33 pts/1 01:09:12 postgres: asinex asinex 66.226.76.106(56298) SELECT But they're not doing anything: No CPU time consumed, no I/O going on, no progress. If I try to connect with psql(1), itsays: psql: FATAL: the database system is in recovery mode And the server log has: LOG: background writer process (PID 23874) was terminated by signal 9 LOG: terminating any other active server processes LOG: statistics collector process (PID 23875) was terminated by signal 9 WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because anotherserver process exited ab normally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because anotherserver process exited ab ... repeats about 50 times, one per process. Questions: 1. Any idea what happened and how I can avoid this? It's a *big* problem. 2. Why didn't the database recover? Why are there two processes that couldn't be killed? 3. Where did the "signal 9" come from? (Nobody but me ever logs in to the server machine.) Help! Thanks, Craig
Craig A. James wrote: > For the third time today, our server has crashed, or frozen, actually > something in between. Normally there are about 30-50 connections > because of mod_perl processes that keep connections open. After the > crash, there are three processes remaining: > > # ps -ef | grep postgres > postgres 23832 1 0 Nov11 pts/1 00:02:53 > /usr/local/pgsql/bin/postmaster -D /postgres/main > postgres 1200 23832 20 14:28 pts/1 00:58:14 postgres: pubchem > pubchem 66.226.76.106(58882) SELECT > postgres 4190 23832 25 14:33 pts/1 01:09:12 postgres: asinex > asinex 66.226.76.106(56298) SELECT > > But they're not doing anything: No CPU time consumed, no I/O going on, > no progress. If I try to connect with psql(1), it says: > > psql: FATAL: the database system is in recovery mode > > And the server log has: > > LOG: background writer process (PID 23874) was terminated by signal 9 > LOG: terminating any other active server processes > LOG: statistics collector process (PID 23875) was terminated by signal 9 > WARNING: terminating connection because of crash of another server > process > DETAIL: The postmaster has commanded this server process to roll back > the current transaction and exit, because another server process > exited ab > normally and possibly corrupted shared memory. > HINT: In a moment you should be able to reconnect to the database and > repeat your command. > WARNING: terminating connection because of crash of another server > process > DETAIL: The postmaster has commanded this server process to roll back > the current transaction and exit, because another server process > exited ab > ... repeats about 50 times, one per process. > > Questions: > 1. Any idea what happened and how I can avoid this? It's a *big* > problem. > 2. Why didn't the database recover? Why are there two processes > that couldn't be killed? > 3. Where did the "signal 9" come from? (Nobody but me ever logs > in to the server machine.) > I would guess it's the linux OOM if you are running linux. You need to turn off killing of processes when you run out of memory. Are you getting close to running out of memory? > Help! > > Thanks, > Craig > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster > >
Russell Smith wrote: > Craig A. James wrote: >> Questions: >> 1. Any idea what happened and how I can avoid this? It's a *big* >> problem. >> 2. Why didn't the database recover? Why are there two processes >> that couldn't be killed? I'm guessing it didn't recover *because* there were two processes that couldn't be killed. Responsibility for that falls to the operating-system. I've seen it most often with faulty drivers or hardware that's being communicated with/written to. However, see below. >> 3. Where did the "signal 9" come from? (Nobody but me ever logs >> in to the server machine.) >> > I would guess it's the linux OOM if you are running linux. If not, it means the server is hacked or haunted. Something outside PG issued a kill -9 and the OOM killer is the prime suspect I'd say. -- Richard Huxton Archonet Ltd
Russell Smith wrote: >> For the third time today, our server has crashed... > > I would guess it's the linux OOM if you are running linux. You need to > turn off killing of processes when you run out of memory. Are you > getting close to running out of memory? Good suggestion, it was a memory leak in an add-on library that we plug in to the Postgres server. OOM? Can you give me a quick pointer to what this acronym stands for and how I can reconfigure it? It sounds like a "feature"old UNIX systems like SGI IRIX had, where the system would allocate virtual memory that it didn't really have, thenkill your process if you tried to use it. I.e. malloc() would never return NULL even if swap space was over allocated. Is this what you're talking about? Having this enabled on a server is deadly for reliability. Thanks, Craig
Craig A. James wrote: > Russell Smith wrote: >>> For the third time today, our server has crashed... >> >> I would guess it's the linux OOM if you are running linux. You need to >> turn off killing of processes when you run out of memory. Are you >> getting close to running out of memory? > > Good suggestion, it was a memory leak in an add-on library that we plug > in to the Postgres server. > > OOM? Can you give me a quick pointer to what this acronym stands for > and how I can reconfigure it? Out Of Memory > It sounds like a "feature" old UNIX > systems like SGI IRIX had, where the system would allocate virtual > memory that it didn't really have, then kill your process if you tried > to use it. That's it. > I.e. malloc() would never return NULL even if swap space was > over allocated. Is this what you're talking about? Having this enabled > on a server is deadly for reliability. Indeed. See the manuals for details. Section 16.4.3 http://www.postgresql.org/docs/8.1/static/kernel-resources.html#AEN18128 -- Richard Huxton Archonet Ltd
By the way, in spite of my questions and concerns, I was *very* impressed by the recovery process. I know it might seemlike old hat to you guys to watch the WAL in action, and I know on a theoretical level it's supposed to work, but watchingit recover 150 separate databases, and find and fix a couple of problems was very impressive. It gives me greatconfidence that I made the right choice to use Postgres. Richard Huxton wrote: >>> 2. Why didn't the database recover? Why are there two processes >>> that couldn't be killed? > > I'm guessing it didn't recover *because* there were two processes that > couldn't be killed. Responsibility for that falls to the > operating-system. I've seen it most often with faulty drivers or > hardware that's being communicated with/written to. However, see below. It can't be a coincidence that these were the only two processes in a SELECT operation. Does the server disable signalsat critical points? I'd make a wild guess that this is some sort of deadlock problem -- these two servers have disabled signals for a criticalsection of SELECT, and are waiting for something from the postmaster, but postmaster is dead. This is an ordinary system, no hardware problems, stock RH FC3 kernel, stock PG 8.1.4, with 4 GB memory, and at the momentthe database is running on a single SATA disk. I'm worried that a production server can get into a state that requiresmanual intervention to recover. Craig
Craig A. James wrote: > By the way, in spite of my questions and concerns, I was *very* > impressed by the recovery process. I know it might seem like old hat to > you guys to watch the WAL in action, and I know on a theoretical level > it's supposed to work, but watching it recover 150 separate databases, > and find and fix a couple of problems was very impressive. It gives me > great confidence that I made the right choice to use Postgres. > > Richard Huxton wrote: >>>> 2. Why didn't the database recover? Why are there two processes >>>> that couldn't be killed? >> >> I'm guessing it didn't recover *because* there were two processes that >> couldn't be killed. Responsibility for that falls to the >> operating-system. I've seen it most often with faulty drivers or >> hardware that's being communicated with/written to. However, see below. > > It can't be a coincidence that these were the only two processes in a > SELECT operation. Does the server disable signals at critical points? If a "kill -9" as root doesn't get rid of them, I think I'm right in saying that it's a kernel-level problem rather than something else. -- Richard Huxton Archonet Ltd
Richard Huxton <dev@archonet.com> writes: > Craig A. James wrote: >> It can't be a coincidence that these were the only two processes in a >> SELECT operation. Does the server disable signals at critical points? > If a "kill -9" as root doesn't get rid of them, I think I'm right in > saying that it's a kernel-level problem rather than something else. I didn't actually see Craig say anywhere that he'd tried "kill -9" on those backends. If he did and it didn't do anything, then clearly they were in some kind of uninterruptable wait, which is certainly evidence of a kernel or hardware issue. If he didn't do "kill -9" then we don't really know what the issue was --- it's not improbable that they were stuck in some loop not containing a CHECK_FOR_INTERRUPTS() test. regards, tom lane
"Craig A. James" <cjames@modgraph-usa.com> writes: > OOM? Can you give me a quick pointer to what this acronym stands for > and how I can reconfigure it? See "Linux Memory Overcommit" at http://www.postgresql.org/docs/8.1/static/kernel-resources.html#AEN18128 or try googling for "OOM kill" for non-Postgres-specific coverage. > It sounds like a "feature" old UNIX > systems like SGI IRIX had, where the system would allocate virtual > memory that it didn't really have, then kill your process if you tried > to use it. I.e. malloc() would never return NULL even if swap space > was over allocated. Is this what you're talking about? Having this > enabled on a server is deadly for reliability. No kidding :-(. The default behavior in Linux is extremely unfortunate. regards, tom lane
Craig A. James wrote: > Richard Huxton wrote: >> If a "kill -9" as root doesn't get rid of them, I think I'm right in >> saying that it's a kernel-level problem rather than something else. > > Sorry I didn't clarify that. "kill -9" did kill them. Other signals > did not. It wasn't until I manually intervened with the "kill -9" that > the system began the recovery process. Ah, when you said "unkillable" in the first msg I jumped to the wrong conclusion. In that case, see Tom's answer. -- Richard Huxton Archonet Ltd
On 11/15/06, Craig A. James <cjames@modgraph-usa.com> wrote: > Questions: > 1. Any idea what happened and how I can avoid this? It's a *big* problem. > 2. Why didn't the database recover? Why are there two processes > that couldn't be killed? > 3. Where did the "signal 9" come from? (Nobody but me ever logs > in to the server machine.) how much memory is in the box? maybe vmstat will give you some clues about what is going on leading up to the crash...a spike in swap for example. Maybe, if you can budget for it, query logging might give some clues also. Try increasing swap. merlin
OOM stands for "Out Of Memory" and it does indeed seem to be the same as what IRIX had. I believe you can turn the feature off and also configure its overcomitment by setting something in /proc/..... and unfortunately, I don't remember more than that. On Thu, 16 Nov 2006, Craig A. James wrote: > OOM? Can you give me a quick pointer to what this acronym stands for and how > I can reconfigure it? It sounds like a "feature" old UNIX systems like SGI > IRIX had, where the system would allocate virtual memory that it didn't > really have, then kill your process if you tried to use it. I.e. malloc() > would never return NULL even if swap space was over allocated. Is this what > you're talking about? Having this enabled on a server is deadly for > reliability. > > Thanks, > Craig > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match >
On Thu, 16 Nov 2006, Tom Lane wrote: > > "Craig A. James" <cjames@modgraph-usa.com> writes: > > OOM? Can you give me a quick pointer to what this acronym stands for > > and how I can reconfigure it? > > See "Linux Memory Overcommit" at > http://www.postgresql.org/docs/8.1/static/kernel-resources.html#AEN18128 > or try googling for "OOM kill" for non-Postgres-specific coverage. I did that - spent about two f-ing hours looking for what I wanted. (Guess I entered poor choices for my searches. -frown- ) There are a LOT of articles that TALK ABOUT OOM, but prescious few actually tell you what you can do about it. Trying to save you some time: On linux you can use the sysctl utility to muck with vm.overcommit_memory; You can disable the "feature." Google _that_ for more info! > > > It sounds like a "feature" old UNIX > > systems like SGI IRIX had, where the system would allocate virtual > > memory that it didn't really have, then kill your process if you tried > > to use it. I.e. malloc() would never return NULL even if swap space > > was over allocated. Is this what you're talking about? Having this > > enabled on a server is deadly for reliability. > > No kidding :-(. The default behavior in Linux is extremely unfortunate. > > regards, tom lane That's a major understatement. The reason I spent a couple of hours looking for what I could learn on this is that I've been absolutely beside myself on this "extremely unfortunate" "feature." I had a badly behaving app (but didn't know which app it was), so Linux would kill lots of things, like, oh, say, inetd. Good luck sshing into the box. You just had to suffer with pushing the damned reset button... It must have taken at least a week before figuring out what not to do. (What I couldn't/can't understand is why the system wouldn't just refuse the bad app the memory when it was short - no, you've had enough!) <soapbox> ...I read a large number of articles on this subject and am absolutely dumbfounded by the -ahem- idiots who think killing a random process is an appropriate action. I'm just taking their word for it that there's some kind of impossibility of the existing Linux kernel not getting itself into a potentially hung situation because it didn't save itself any memory. Frankly, if it takes a complete kernel rewrite to fix the problem that the damned operating system can't manage its own needs, then the kernel needs to be rewritten! </soapbox> These kernel hackers could learn something from VAX/VMS. Richard -- Richard Troy, Chief Scientist Science Tools Corporation 510-924-1363 or 202-747-1263 rtroy@ScienceTools.com, http://ScienceTools.com/
Richard Troy wrote: > I did that - spent about two f-ing hours looking for what I wanted. (Guess > I entered poor choices for my searches. -frown- ) There are a LOT of > articles that TALK ABOUT OOM, but prescious few actually tell you what you > can do about it. > > Trying to save you some time: > > On linux you can use the sysctl utility to muck with vm.overcommit_memory; > You can disable the "feature." > > Google _that_ for more info! Here's something I found googling for "memory overcommitment"+linux http://archives.neohapsis.com/archives/postfix/2000-04/0512.html From /usr/src/linux/Documentation/sysctl/vm.txt "overcommit_memory: This value contains a flag that enables memory overcommitment. When this flag is 0, the kernel checks before each malloc() to see if there's enough memory left. If the flag is nonzero, the system pretends there's always enough memory." This flag is settable in /proc/sys/vm Lo and behold, here it is on my system: $ cat /proc/sys/vm/overcommit_memory 0 $ cat /proc/sys/vm/overcommit_ratio 50 Craig
"Craig A. James" <cjames@modgraph-usa.com> writes: > Here's something I found googling for "memory overcommitment"+linux > http://archives.neohapsis.com/archives/postfix/2000-04/0512.html That might have been right when it was written (note the reference to a 2.2 Linux kernel), but it's 100% wrong now. 0 is the default, not-safe setting. regards, tom lane
Tom Lane wrote: > "Craig A. James" <cjames@modgraph-usa.com> writes: >> Here's something I found googling for "memory overcommitment"+linux >> http://archives.neohapsis.com/archives/postfix/2000-04/0512.html > > That might have been right when it was written (note the reference to a > 2.2 Linux kernel), but it's 100% wrong now. 0 is the default, not-safe > setting. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq >
Tom Lane wrote: > "Craig A. James" <cjames@modgraph-usa.com> writes: >> Here's something I found googling for "memory overcommitment"+linux >> http://archives.neohapsis.com/archives/postfix/2000-04/0512.html > > That might have been right when it was written (note the reference to a > 2.2 Linux kernel), but it's 100% wrong now. 0 is the default, not-safe > setting. Is this something we could detect and show a warning for at initdb time; or perhaps log a warning in the log files at postmaster startup? This dangerous choice of settings seems to catch quite a few postgresql users.
On Sat, Nov 18, 2006 at 05:28:46PM -0800, Richard Troy wrote: >On linux you can use the sysctl utility to muck with vm.overcommit_memory; >You can disable the "feature." Be aware that there's are "reasons" the "feature" exists before you "cast" "aspersions" and "quote marks" all over the place, and the "reasons" are "things" like "shared memory" and "copy on write" memory, which are "generally" condsidered "good things". At one point someone complained about the ability to configure, e.g., IRIX to allow memory overcommit. I worked on some large IRIX installations where full memory accounting would have required on the order of 100s of gigabytes of swap, due to large shared memory allocations. If the swap had been configured rather than allowing overcommit, sure it might have reduced the chance of an application allocating memory that it couldn't use. In practice, if you're 100s of gigs into swap the system isn't going to be useable anyway. >itself any memory. Frankly, if it takes a complete kernel rewrite to fix >the problem that the damned operating system can't manage its own needs, >then the kernel needs to be rewritten! </soapbox> > >These kernel hackers could learn something from VAX/VMS. Like how to make a slower system? You can configure a VM to operate the way they did 25 years ago, but in general people have preferred to get better performance (*much* better performance) instead. It could be that they're all idiots, or it could be that in practice the problem isn't as bad as you make it out to be. (Maybe other people are just better at configuring their memory usage?) Mike Stone
> (Maybe other people are just > better at configuring their memory usage?) I don't mean to hijack the thread, but I am interested in learning the science behind configuring memory usage. A lot of the docs that I have found on this subject speak in terms of generalities and rules of thumb. Are there any resources that can show how to tune the kernels parameters to make use of all of the available memory and at the same time allow all services to get along? From what I've read it seems that tuning kernel parameters is more of an art rather than a science, since most recommendations boil down to experiences based on trial and error. Regards, Richard Broersma Jr.
On Sun, Nov 19, 2006 at 12:42:45PM -0800, Richard Broersma Jr wrote: >I don't mean to hijack the thread, but I am interested in learning the science behind configuring >memory usage. There isn't one. You need experience, and an awareness of your particular requirements. If it were easy, it would be autotuned. :) Mike Stone
Michael Stone wrote: > At one point someone complained about the ability to configure, e.g., > IRIX to allow memory overcommit. I worked on some large IRIX > installations where full memory accounting would have required on the > order of 100s of gigabytes of swap, due to large shared memory > allocations. These were mostly scientific and graphical apps where reliability took a back seat to performance and to program complexity. They would allocate 100's of GB of swap space rather than taking the time to design proper data structures. If the program crashed every week or two, no big deal -- just run it again. Overallocating memory is a valuable techniquefor such applications. But overallocating memory has no place in a server environment. When memory overcommittment is allowed, it is impossibleto write a reliable application, because no matter how carefully and correctly you craft your code, someone else'sprogram that leaks memory like Elmer Fudd's rowboat after his shotgun goes off, can kill your well-written application. Installing Postgres on such a system makes Postgres unreliable. Tom Lane wrote: > That might have been right when it was written (note the reference to a > 2.2 Linux kernel), but it's 100% wrong now. > [Setting /proc/sys/vm/overcommit_memory to] 0 is the default, not-safe > setting. I'm surprised that the Linux kernel people take such a uncritical view of reliability that they set, as *default*, a featurethat makes Linux an unreliable platform for servers. And speaking of SGI, this very issue was among the things that sank the company. As the low-end graphics cards ate intotheir visualization market, they tried to become an Oracle Server platform. Their servers were *fast*. But they crashed-- a lot. And memory-overcommit was one of the reasons. IRIX admins would brag that their systems only crashed everycouple of weeks. I had HP and Sun systems that would run for years. Craig
On Sun, Nov 19, 2006 at 02:12:01PM -0800, Craig A. James wrote: >And speaking of SGI, this very issue was among the things that sank the >company. As the low-end graphics cards ate into their visualization >market, they tried to become an Oracle Server platform. Their servers were >*fast*. But they crashed -- a lot. And memory-overcommit was one of the >reasons. You realize that it had to be turned on explicitly on IRIX, right? But don't let facts get in the way of a good rant... Mike Stone
> You realize that it had to be turned on explicitly on IRIX, right? But > don't let facts get in the way of a good rant... On the contrary, with Irix 4 and earlier it was the default, but it caused so many problems that SGI switched the defaultto OFF in IRIX 5. But because it had been available for so long, many important apps had come to rely on it, so mostsites had to immediately re-enable virtual swap on every IRIX 5 server that came in. Admins just got used to doing it,so it became a "default" at most sites, and admins often couldn't be convinced to disable it for database server machines,because "That's our standard for IRIX configuration." I worked at a big molecular modeling/visualization company; our visualization programs *required* virtual swap, and our serverprograms *prohibited* virtual swap. Imagine how our sales people felt about that, telling customers that they'd haveto buy two $30,000 machines just because of one kernel parameter. Of course, they didn't, and our server apps took theheat as being "unreliable." SGI called it "virtual swap" which I always thought was a hoot. You have virtual memory, which is really your swap space,and then virtual swap, which is some kind of dark hole... Craig
On Sun, Nov 19, 2006 at 14:12:01 -0800, "Craig A. James" <cjames@modgraph-usa.com> wrote: > > These were mostly scientific and graphical apps where reliability took a > back seat to performance and to program complexity. They would allocate > 100's of GB of swap space rather than taking the time to design proper data > structures. If the program crashed every week or two, no big deal -- just > run it again. Overallocating memory is a valuable technique for such > applications. I don't think the above applies generally. Programmers need to be aware of the working set of CPU bound apps. If the program is constantly paging, the performance is going to be abysmal.
We're getting off-topic for this group except that this is *why* we're plagued with this problem, so I'll make one more observation. Bruno Wolff III wrote: >> They would allocate >> 100's of GB of swap space rather than taking the time to design proper data >> structures. If the program crashed every week or two, no big deal -- just >> run it again. Overallocating memory is a valuable technique for such >> applications. > > I don't think the above applies generally. Programmers need to be aware of > the working set of CPU bound apps. If the program is constantly paging, > the performance is going to be abysmal. You're doing planetary number crunching, so you allocate an (x,y,z) space with 10^6 points on every axis, or 10^18 points,for roughly 2^60 64-bit floating-point numbers, or 2^68 bytes. But your space is mostly empty (just like real space). Where there's a planet or star, you're actually using memory, everywhere else, the memory is never referenced. So you're using just an ordinary amount of memory, but you can access your planet's data by simply referencing a three-dimensionalarray in FORTRAN. Just allow infinite overcommit, let the OS do the rest, and it actually works. A lotof molecular modeling works this way too. This is how these applications were/are actually written. I'm not defending the method, just point out that it's real andit's one of the reasons virtual swap was invented. Craig
if [ `cat /proc/sys/vm/overcommit_memory` == 0 ]; then echo "WARNING: Watch out for the oom-killer! Consider echo 2 > /proc/sys/vm/overcommit_memory" fi ----- Original Message ----- From: "Ron Mayer" <rm_pg@cheapcomplexdevices.com> To: <pgsql-performance@postgresql.org> Sent: Sunday, November 19, 2006 6:07 PM Subject: Re: Postgres server crash > Tom Lane wrote: >> "Craig A. James" <cjames@modgraph-usa.com> writes: >>> Here's something I found googling for "memory overcommitment"+linux >>> http://archives.neohapsis.com/archives/postfix/2000-04/0512.html >> >> That might have been right when it was written (note the reference to a >> 2.2 Linux kernel), but it's 100% wrong now. 0 is the default, not-safe >> setting. > > Is this something we could detect and show a warning for at initdb > time; or perhaps log a warning in the log files at postmaster startup? > This dangerous choice of settings seems to catch quite a few postgresql > users. >
Hi, Richard, Richard Troy wrote: > The reason I spent a couple of hours looking for what I could learn on > this is that I've been absolutely beside myself on this "extremely > unfortunate" "feature." I had a badly behaving app (but didn't know which > app it was), so Linux would kill lots of things, like, oh, say, inetd. Actually, AFAICT, Linux tries to kill the process(es) that use the most memory ressources first. Without overcommitment, the OOM killer won't kick in, and as long as the hoggy applications don't actually exit when malloc fails, they will just stay around, sitting on their memory. > Good luck sshing into the box. SSH'ing into the box will even get worse without overcommitment. When the machine is stuck, the sshd that tries to spawn its child will get the out of memory signal, and you won't be able to log in. HTH, Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in Europe! www.ffii.org www.nosoftwarepatents.org
On Sat, Nov 18, 2006 at 05:28:46PM -0800, Richard Troy wrote: > <soapbox> ...I read a large number of articles on this subject and am > absolutely dumbfounded by the -ahem- idiots who think killing a random > process is an appropriate action. I'm just taking their word for it that > there's some kind of impossibility of the existing Linux kernel not > getting itself into a potentially hung situation because it didn't save > itself any memory. Frankly, if it takes a complete kernel rewrite to fix > the problem that the damned operating system can't manage its own needs, > then the kernel needs to be rewritten! </soapbox> > > These kernel hackers could learn something from VAX/VMS. What's interesting is that apparently FreeBSD also has overcommit (and IIRC no way to disable it), yet I never hear people going off on OOM kills in FreeBSD. My theory is that FreeBSD admins are smart enough to dedicate a decent amount of swap space, so that by the time you got to an OOM kill situation you'd be so far into swapping that the box would be nearly unusable. Many linux 'admins' think it's ok to save a few GB of disk space by allocating a small amount of swap (or none at all), and *kaboom*. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Sun, Nov 26, 2006 at 05:41:02PM -0600, Jim C. Nasby wrote: >What's interesting is that apparently FreeBSD also has overcommit (and >IIRC no way to disable it), yet I never hear people going off on OOM >kills in FreeBSD. Could just be that nobody is using FreeBSD. <ducks> Seriously, though, there are so many linux installations out there that you're statistically more likely to see corner cases. FWIW, I know a whole lotta linux machines that have never had OOM-killer problems. Remember, though, that anecodotal evidence isn't. Mike Stone
* Jim C. Nasby: > What's interesting is that apparently FreeBSD also has overcommit (and > IIRC no way to disable it), yet I never hear people going off on OOM > kills in FreeBSD. My theory is that FreeBSD admins are smart enough to > dedicate a decent amount of swap space, so that by the time you got to > an OOM kill situation you'd be so far into swapping that the box would > be nearly unusable. I've seen OOM situations with our regular "use twice as much swap space as there is RAM in the machine" rule. Perhaps Linux is just paging too fast for this to work. 8-P By the way, the sysctl setting breaks some applications and programming environments, such as SBCL and CMUCL.