Thread: Best COPY Performance

Best COPY Performance

From
"Worky Workerson"
Date:
What is the best COPY performance that you have gotten on a "normal" table?

I know that this is question is almost too general, but it might help
me out a bit, or at least give me the right things to tweak.  Perhaps
the question can be rewritten as "Where are the major bottlenecks in a
COPY?" or "How can I compute the max theoretical COPY performance for
my hardware?".  The two subquestions that I have from this are:
  -Are my ETL scripts (perl) maximizing the database COPY speeds?
  -Can I tweak my DB further to eek out a bit more performance?

I'm using perl to ETL a decent sized data set (10 million records) and
then loading it through perl::DBI's copy.  I am currently getting
between 10K and 15K inserts/second.  I've profiled the ETL scripts a
bit and have performance-improved a lot of the code, but I'd like to
determine whether it makes sense to try and further optimize my Perl
or count it as "done" and look for improvements elsewhere.

I ran trivial little insert into a table with a single integer row and
came close to 250K inserts/second using psql's \copy, so I'm thinking
that my code could be optimized a bit more, but wanted to check around
to see if that was the case.

I am most interested in loading two tables, one with about 21 (small)
VARCHARs where each record is about 200 bytes, and another with 7
INTEGERs, 3 TIMESTAMPs, and 1 BYTEA where each record is about 350
bytes.

I have implemented most of the various bits of PG config advice that I
have seen, both here and with a some googling, such as:

 wal_buffers=128
 checkpoint_segments=128
 checkpoint_timeout=3000

Software:  PG 8.1.3 on RHEL 4.3 x86_64
Hardware: Quad Dual-core Opteron, Fibre Channel SAN with 256M BBC

Thanks!

Re: Best COPY Performance

From
"Merlin Moncure"
Date:
On 10/21/06, Worky Workerson <worky.workerson@gmail.com> wrote:
> What is the best COPY performance that you have gotten on a "normal" table?
>
> I know that this is question is almost too general, but it might help
> me out a bit, or at least give me the right things to tweak.  Perhaps
> the question can be rewritten as "Where are the major bottlenecks in a
> COPY?" or "How can I compute the max theoretical COPY performance for
> my hardware?".  The two subquestions that I have from this are:
>   -Are my ETL scripts (perl) maximizing the database COPY speeds?
>   -Can I tweak my DB further to eek out a bit more performance?
>
> I'm using perl to ETL a decent sized data set (10 million records) and
> then loading it through perl::DBI's copy.  I am currently getting
> between 10K and 15K inserts/second.  I've profiled the ETL scripts a
> bit and have performance-improved a lot of the code, but I'd like to
> determine whether it makes sense to try and further optimize my Perl
> or count it as "done" and look for improvements elsewhere.
>
> I ran trivial little insert into a table with a single integer row and
> came close to 250K inserts/second using psql's \copy, so I'm thinking
> that my code could be optimized a bit more, but wanted to check around
> to see if that was the case.
>
> I am most interested in loading two tables, one with about 21 (small)
> VARCHARs where each record is about 200 bytes, and another with 7
> INTEGERs, 3 TIMESTAMPs, and 1 BYTEA where each record is about 350
> bytes.

indexes/keys?  more memory for sorting during index creation can have
a dramatic affect on bulk insert performance.  check for pg_tmp
folders popping up during copy run.

> I have implemented most of the various bits of PG config advice that I
> have seen, both here and with a some googling, such as:
>
>  wal_buffers=128
>  checkpoint_segments=128
>  checkpoint_timeout=3000
>
> Software:  PG 8.1.3 on RHEL 4.3 x86_64
> Hardware: Quad Dual-core Opteron, Fibre Channel SAN with 256M BBC

for table light on indexes, 10-15k for copy is pretty poor.  you can
get pretty close to that with raw inserts on good hardware. I would
suggest configuirng your perl script to read from stdin and write to
stdout, and pipe it to psql using copy from stdin.  then just
benchmark your perl script redirecting output to a file.

merlin

Re: Best COPY Performance

From
Markus Schaber
Date:
Hi, Worky,

Worky Workerson wrote:
> I am currently getting
> between 10K and 15K inserts/second.

> I ran trivial little insert into a table with a single integer row and
> came close to 250K inserts/second using psql's \copy, so I'm thinking
> that my code could be optimized a bit more, but wanted to check around
> to see if that was the case.

Could you COPY one of your tables out to disk via psql, and then COPY it
back into the database, to reproduce this measurement with your real data?

Also, how much is the disk load, and CPU usage?

As long as psql is factor 20 better than your perl script, I think that
the perl interface is what should be optimized.

On a table with no indices, triggers and contstraints, we managed to
COPY about 7-8 megabytes/second with psql over our 100 MBit network, so
here the network was the bottleneck.

You should think about making your perl program writing the COPY
statement as text, and piping it into psql.

HTH,
Markus
--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in Europe! www.ffii.org
www.nosoftwarepatents.org

Re: Best COPY Performance

From
"Luke Lonergan"
Date:
Markus,

On 10/23/06 2:27 AM, "Markus Schaber" <schabi@logix-tt.com> wrote:

> On a table with no indices, triggers and contstraints, we managed to
> COPY about 7-8 megabytes/second with psql over our 100 MBit network, so
> here the network was the bottleneck.

We routinely get 10-12MB/s on I/O hardware that can sustain a sequential
write rate of 60+ MB/s with the WAL and data on the same disks.

It depends on a few things you might not consider, including the number and
type of columns in the table and the client and server encoding.  The
fastest results are with more columns in a table and when the client and
server encoding are the same.

- Luke



Re: Best COPY Performance

From
"Worky Workerson"
Date:
> > I am most interested in loading two tables, one with about 21 (small)
> > VARCHARs where each record is about 200 bytes, and another with 7
> > INTEGERs, 3 TIMESTAMPs, and 1 BYTEA where each record is about 350
> > bytes.
>
> indexes/keys?  more memory for sorting during index creation can have
> a dramatic affect on bulk insert performance.  check for pg_tmp
> folders popping up during copy run.

The only index on load is a single IP4 btree primary key, which I
figure should function about like an INTEGER.

> for table light on indexes, 10-15k for copy is pretty poor.  you can
> get pretty close to that with raw inserts on good hardware. I would
> suggest configuirng your perl script to read from stdin and write to
> stdout, and pipe it to psql using copy from stdin.  then just
> benchmark your perl script redirecting output to a file.

So simple and hadn't thought of that ... thanks.  When I pre-create a
COPY file, I can load it at about 45K inserts/sec (file was 1.8GB or
14.5 million records in 331 seconds), which looks like its about 5.5
MB/s.  I'm loading from a local 15K SCSI320 RAID10 (which also
contains the PG log files) to a 10K SCSI320 RAID10 on an FC SAN.  Does
this look more consistent with "decent" performance, or should I go
looking into some hardware issues i.e. SAN configuration?  I've
currently got several hats including hardware/systems/security admin,
as well as DBA and programmer, and my SAN setup skills could
definitely use some more work.

Hardware aside, my perl can definitely use some work, and it seems to
be mostly the CSV stuff that I am using, mostly for convenience.  I'll
see if I can't redo some of that to eliminate some CSV processing, or,
barring that, multithread the process to utilize more of the CPUs.
Part of the reason that I hadn't used psql in the first place is that
I'm loading the data into partitioned tables, and the loader keeps
several COPY connections open at a time to load the data into the
right table.  I guess I could just as easily keep several psql pipes
open, but it seemed cleaner to go through DBI.

Re: Best COPY Performance

From
"Jim C. Nasby"
Date:
On Mon, Oct 23, 2006 at 11:10:19AM -0400, Worky Workerson wrote:
> >> I am most interested in loading two tables, one with about 21 (small)
> >> VARCHARs where each record is about 200 bytes, and another with 7
> >> INTEGERs, 3 TIMESTAMPs, and 1 BYTEA where each record is about 350
> >> bytes.
> >
> >indexes/keys?  more memory for sorting during index creation can have
> >a dramatic affect on bulk insert performance.  check for pg_tmp
> >folders popping up during copy run.
>
> The only index on load is a single IP4 btree primary key, which I
> figure should function about like an INTEGER.
>
> >for table light on indexes, 10-15k for copy is pretty poor.  you can
> >get pretty close to that with raw inserts on good hardware. I would
> >suggest configuirng your perl script to read from stdin and write to
> >stdout, and pipe it to psql using copy from stdin.  then just
> >benchmark your perl script redirecting output to a file.
>
> So simple and hadn't thought of that ... thanks.  When I pre-create a
> COPY file, I can load it at about 45K inserts/sec (file was 1.8GB or
> 14.5 million records in 331 seconds), which looks like its about 5.5
> MB/s.  I'm loading from a local 15K SCSI320 RAID10 (which also
> contains the PG log files) to a 10K SCSI320 RAID10 on an FC SAN.  Does
> this look more consistent with "decent" performance, or should I go
> looking into some hardware issues i.e. SAN configuration?  I've
> currently got several hats including hardware/systems/security admin,
> as well as DBA and programmer, and my SAN setup skills could
> definitely use some more work.
>
> Hardware aside, my perl can definitely use some work, and it seems to
> be mostly the CSV stuff that I am using, mostly for convenience.  I'll
> see if I can't redo some of that to eliminate some CSV processing, or,
> barring that, multithread the process to utilize more of the CPUs.
> Part of the reason that I hadn't used psql in the first place is that
> I'm loading the data into partitioned tables, and the loader keeps
> several COPY connections open at a time to load the data into the
> right table.  I guess I could just as easily keep several psql pipes
> open, but it seemed cleaner to go through DBI.

http://stats.distributed.net used to use a perl script to do some
transformations before loading data into the database. IIRC, when we
switched to using C we saw 100x improvement in speed, so I suspect that
if you want performance perl isn't the way to go. I think you can
compile perl into C, so maybe that would help some.

Ultimately, you might be best of using triggers instead of rules for the
partitioning since then you could use copy. Or go to raw insert commands
that are wrapped in a transaction.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Best COPY Performance

From
"Joshua D. Drake"
Date:
> Ultimately, you might be best of using triggers instead of rules for the
> partitioning since then you could use copy. Or go to raw insert commands
> that are wrapped in a transaction.

My experience is that triggers are quite a bit faster than rules in any
kind of partitioning that involves more than say 7 tables.

Sincerely,

Joshua D. Drake


--

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
             http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate


Re: Best COPY Performance

From
"Craig A. James"
Date:
Jim C. Nasby wrote:
> http://stats.distributed.net used to use a perl script to do some
> transformations before loading data into the database. IIRC, when we
> switched to using C we saw 100x improvement in speed, so I suspect that
> if you want performance perl isn't the way to go. I think you can
> compile perl into C, so maybe that would help some.

I use Perl extensively, and have never seen a performance problem.  I suspect the perl-to-C "100x improvement" was due
tosome other factor, like a slight change in the schema, indexes, or the fundamental way the client (C vs Perl) handled
thedata during the transformation, or just plain bad Perl code. 

Modern scripting languages like Perl and Python make programmers far, far more productive than the bad old days of
C/C++. Don't shoot yourself in the foot by reverting to low-level languages like C/C++ until you've exhausted all other
possibilities. I only use C/C++ for intricate scientific algorithms. 

In many cases, Perl is *faster* than C/C++ code that I write, because I can't take the time (for example) to write the
high-performancestring manipulation that have been fine-tuned and extensively optimized in Perl. 

Craig


Re: Best COPY Performance

From
"Worky Workerson"
Date:
Markus,

> Could you COPY one of your tables out to disk via psql, and then COPY it
> back into the database, to reproduce this measurement with your real data?

$ psql -c "COPY my_table TO STDOUT" > my_data
$ ls my_data
2018792 edgescape_pg_load
$ time cat my_data | psql -c "COPY mytable FROM STDIN"
real    5m43.194s
user    0m35.412s
sys     0m9.567s

> Also, how much is the disk load, and CPU usage?

  When I am loading via the perl (which I've established is a
bottleneck), the one CPU core is at 99% for the perl and another is at
30% for a postmaster, vs about 90% for the postmaster when going
through psql.

The disk load is where I start to get a little fuzzy, as I haven't
played with iostat to figure what is "normal".  The local drives
contain PG_DATA as well as all the log files, but there is a
tablespace on the FibreChannel SAN that contains the destination
table.  The disk usage pattern that I see is that there is a ton of
consistent activity on the local disk, with iostat reporting an
average of 30K Blk_wrtn/s, which I assume is the log files.  Every
several seconds there is a massive burst of activity on the FC
partition, to the tune of 250K Blk_wrtn/s.

> On a table with no indices, triggers and contstraints, we managed to
> COPY about 7-8 megabytes/second with psql over our 100 MBit network, so
> here the network was the bottleneck.

hmm, this makes me think that either my PG config is really lacking,
or that the SAN is badly misconfigured, as I would expect it to
outperform a 100Mb network.  As it is, with a straight pipe to psql
COPY, I'm only working with a little over 5.5 MB/s.  Could this be due
to the primary key index updates?

Thanks!

Re: Best COPY Performance

From
"Jim C. Nasby"
Date:
On Mon, Oct 23, 2006 at 03:37:47PM -0700, Craig A. James wrote:
> Jim C. Nasby wrote:
> >http://stats.distributed.net used to use a perl script to do some
> >transformations before loading data into the database. IIRC, when we
> >switched to using C we saw 100x improvement in speed, so I suspect that
> >if you want performance perl isn't the way to go. I think you can
> >compile perl into C, so maybe that would help some.
>
> I use Perl extensively, and have never seen a performance problem.  I
> suspect the perl-to-C "100x improvement" was due to some other factor, like
> a slight change in the schema, indexes, or the fundamental way the client
> (C vs Perl) handled the data during the transformation, or just plain bad
> Perl code.
>
> Modern scripting languages like Perl and Python make programmers far, far
> more productive than the bad old days of C/C++.  Don't shoot yourself in
> the foot by reverting to low-level languages like C/C++ until you've
> exhausted all other possibilities.  I only use C/C++ for intricate
> scientific algorithms.
>
> In many cases, Perl is *faster* than C/C++ code that I write, because I
> can't take the time (for example) to write the high-performance string
> manipulation that have been fine-tuned and extensively optimized in Perl.

Well, the code is all at
http://cvs.distributed.net/viewcvs.cgi/stats-proc/hourly/ (see logmod
directory and logmod_*.pl). There have been changes made to the C code
since we changed over, but you can find the appropriate older versions
in there. IIRC, nothing in the database changed when we went from perl
to C (it's likely that was the *only* change that happened anywhere
around that time).
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Best COPY Performance

From
"Jim C. Nasby"
Date:
On Tue, Oct 24, 2006 at 09:17:08AM -0400, Worky Workerson wrote:
> >http://stats.distributed.net used to use a perl script to do some
> >transformations before loading data into the database. IIRC, when we
> >switched to using C we saw 100x improvement in speed, so I suspect that
> >if you want performance perl isn't the way to go. I think you can
> >compile perl into C, so maybe that would help some.
>
> Like Craig mentioned, I have never seen those sorts of improvements
> going from perl->C, and developer efficiency is primo for me.  I've
> profiled most of the stuff, and have used XS modules and Inline::C on
> the appropriate, often used functions, but I still think that it comes
> down to my using CSV and Text::CSV_XS.  Even though its XS, CSV is
> still a pain in the ass.
>
> >Ultimately, you might be best of using triggers instead of rules for the
> >partitioning since then you could use copy. Or go to raw insert commands
> >that are wrapped in a transaction.
>
> Eh, I've put the partition loading logic in the loader, which seems to
> work out pretty well, especially since I keep things sorted and am the
> only one inserting into the DB and do so with bulk loads.  But I'll
> keep this in mind for later use.

Well, given that perl is using an entire CPU, it sounds like you should
start looking either at ways to remove some of the overhead from perl,
or to split that perl into multiple processes.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Best COPY Performance

From
"Worky Workerson"
Date:
> http://stats.distributed.net used to use a perl script to do some
> transformations before loading data into the database. IIRC, when we
> switched to using C we saw 100x improvement in speed, so I suspect that
> if you want performance perl isn't the way to go. I think you can
> compile perl into C, so maybe that would help some.

Like Craig mentioned, I have never seen those sorts of improvements
going from perl->C, and developer efficiency is primo for me.  I've
profiled most of the stuff, and have used XS modules and Inline::C on
the appropriate, often used functions, but I still think that it comes
down to my using CSV and Text::CSV_XS.  Even though its XS, CSV is
still a pain in the ass.

> Ultimately, you might be best of using triggers instead of rules for the
> partitioning since then you could use copy. Or go to raw insert commands
> that are wrapped in a transaction.

Eh, I've put the partition loading logic in the loader, which seems to
work out pretty well, especially since I keep things sorted and am the
only one inserting into the DB and do so with bulk loads.  But I'll
keep this in mind for later use.

Thanks!

Re: Best COPY Performance

From
"Craig A. James"
Date:
Jim C. Nasby wrote:
> Well, given that perl is using an entire CPU, it sounds like you should
> start looking either at ways to remove some of the overhead from perl,
> or to split that perl into multiple processes.

I use Perl for big database copies (usually with some processing/transformation along the way) and I've never seen 100%
CPUusage except for brief periods, even when copying BLOBS and such.  My typical copy divides operations into blocks,
forexample doing 

  N = 0
  while (more rows to go) {
     begin transaction
     select ... where primary_key > N order by primary_key limit 1000
     while (fetch a row)
        insert into ...
     N = (highest value found in last block)
     commit
   }

Doing it like this in Perl should keep Postgres busy, with Perl using only moderate resources.  If you're seeing high
PerlCPU usage, I'd look first at the Perl code. 

Craig

Re: Best COPY Performance

From
Alex Stapleton
Date:
On 23 Oct 2006, at 22:59, Jim C. Nasby wrote:
> http://stats.distributed.net used to use a perl script to do some
> transformations before loading data into the database. IIRC, when we
> switched to using C we saw 100x improvement in speed, so I suspect
> that
> if you want performance perl isn't the way to go. I think you can
> compile perl into C, so maybe that would help some.


http://shootout.alioth.debian.org/gp4/benchmark.php?
test=all&lang=perl&lang2=gcc

100x doesn't totally impossible if that is even vaguely accurate and
you happen to be using bits of Perl which are a lot slower than the C
implementation would be...
The slowest things appear to involve calling functions, all the
slowest tests involve lots of function calls.

Re: Best COPY Performance

From
"Worky Workerson"
Date:
On 10/25/06, Craig A. James <cjames@modgraph-usa.com> wrote:
> Jim C. Nasby wrote:
> > Well, given that perl is using an entire CPU, it sounds like you should
> > start looking either at ways to remove some of the overhead from perl,
> > or to split that perl into multiple processes.
>
> I use Perl for big database copies (usually with some processing/transformation along the
> way) and I've never seen 100% CPU usage except for brief periods, even when copying
> BLOBS and such.  My typical copy divides operations into blocks, for example doing

I'm just doing CSV style transformations (and calling a lot of
functions along the way), but the end result is a straight bulk load
of data into a blank database.  And we've established that Postgres
can do *way* better than what I am seeing, so its not suprising that
perl is using 100% of a CPU.

However, I am still curious as to the rather slow COPYs from psql to
local disks.  Like I mentioned previously, I was only seeing about 5.7
MB/s (1.8 GB / 330 seconds), where it seemed like others were doing
substantially better.  What sorts of things should I look into?

Thanks!

Re: Best COPY Performance

From
Markus Schaber
Date:
Hi, Worky,

Worky Workerson wrote:

> $ psql -c "COPY my_table TO STDOUT" > my_data
> $ ls my_data
> 2018792 edgescape_pg_load
> $ time cat my_data | psql -c "COPY mytable FROM STDIN"
> real    5m43.194s
> user    0m35.412s
> sys     0m9.567s

That's via PSQL, and you get about 5 MB/Sec.

>> On a table with no indices, triggers and contstraints, we managed to
>> COPY about 7-8 megabytes/second with psql over our 100 MBit network, so
>> here the network was the bottleneck.
>
> hmm, this makes me think that either my PG config is really lacking,
> or that the SAN is badly misconfigured, as I would expect it to
> outperform a 100Mb network.  As it is, with a straight pipe to psql
> COPY, I'm only working with a little over 5.5 MB/s.  Could this be due
> to the primary key index updates?

Yes, index updates cause both CPU load, and random disk access (which is
slow by nature).


HTH,
Markus
--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in Europe! www.ffii.org
www.nosoftwarepatents.org

Re: Best COPY Performance

From
"Merlin Moncure"
Date:
On 10/23/06, Worky Workerson <worky.workerson@gmail.com> wrote:
> The disk load is where I start to get a little fuzzy, as I haven't
> played with iostat to figure what is "normal".  The local drives
> contain PG_DATA as well as all the log files, but there is a
> tablespace on the FibreChannel SAN that contains the destination
> table.  The disk usage pattern that I see is that there is a ton of
> consistent activity on the local disk, with iostat reporting an
> average of 30K Blk_wrtn/s, which I assume is the log files.  Every
> several seconds there is a massive burst of activity on the FC
> partition, to the tune of 250K Blk_wrtn/s.
>
> > On a table with no indices, triggers and contstraints, we managed to
> > COPY about 7-8 megabytes/second with psql over our 100 MBit network, so
> > here the network was the bottleneck.

I'm guessing the high bursts are checkpoints.  Can you check your log
files for pg and see if you are getting warnings about checkpoint
frequency?   You can get some mileage here by increasing wal files.

Have you determined that pg is not swapping?  try upping maintenance_work_mem.

What exactly is your architecture?  is your database server direct
attached to the san? if so, 2gb/4gb fc?  what san?  have you bonnie++
the san?  basically, you can measure iowait to see if pg is waiting on
your disks.

regarding perl, imo the language performance is really about which
libraries you use. the language itself is plenty fast.

merlin

Re: Best COPY Performance

From
"Luke Lonergan"
Date:
Mr. Worky Workerson,

On 10/25/06 5:03 AM, "Worky Workerson" <worky.workerson@gmail.com> wrote:

> However, I am still curious as to the rather slow COPYs from psql to
> local disks.  Like I mentioned previously, I was only seeing about 5.7
> MB/s (1.8 GB / 330 seconds), where it seemed like others were doing
> substantially better.  What sorts of things should I look into?

It's probable that you  have a really poor performing disk configuration.
Judging from earlier results, you may only be getting 3 x 5.7 = 17 MB/s of
write performance to your disks, which is about 1/4 of a single disk drive.

Please run this test and report the time here:

1) Calculate the size of 2x memory in 8KB blocks:
  # of blocks = 250,000 x memory_in_GB

Example:
  250,000 x 16GB = 4,000,000 blocks

2) Benchmark the time taken to write 2x RAM sequentially to your disk:
  time bash -c "dd if=/dev/zero of=bigfile bs=8k count=<# of blocks> &&
sync"

3) Benchmark the time taken to read same:
  time dd if=bigfile of=/dev/null bs=8k

- Luke



Re: Best COPY Performance

From
"Jim C. Nasby"
Date:
On Tue, Oct 24, 2006 at 10:36:04PM -0700, Craig A. James wrote:
> Jim C. Nasby wrote:
> >Well, given that perl is using an entire CPU, it sounds like you should
> >start looking either at ways to remove some of the overhead from perl,
> >or to split that perl into multiple processes.
>
> I use Perl for big database copies (usually with some
> processing/transformation along the way) and I've never seen 100% CPU usage
> except for brief periods, even when copying BLOBS and such.  My typical
> copy divides operations into blocks, for example doing
>
>  N = 0
>  while (more rows to go) {
>     begin transaction
>     select ... where primary_key > N order by primary_key limit 1000
>     while (fetch a row)
>        insert into ...
>     N = (highest value found in last block)
>     commit
>   }
>
> Doing it like this in Perl should keep Postgres busy, with Perl using only
> moderate resources.  If you're seeing high Perl CPU usage, I'd look first
> at the Perl code.

Wait... so you're using perl to copy data between two tables? And using
a cursor to boot? I can't think of any way that could be more
inefficient...

What's wrong with a plain old INSERT INTO ... SELECT? Or if you really
need to break it into multiple transaction blocks, at least don't
shuffle the data from the database into perl and then back into the
database; do an INSERT INTO ... SELECT with that same where clause.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Best COPY Performance

From
"Jim C. Nasby"
Date:
On Wed, Oct 25, 2006 at 08:03:38AM -0400, Worky Workerson wrote:
> I'm just doing CSV style transformations (and calling a lot of
> functions along the way), but the end result is a straight bulk load
> of data into a blank database.  And we've established that Postgres
> can do *way* better than what I am seeing, so its not suprising that
> perl is using 100% of a CPU.

If you're loading into an empty database, there's a number of tricks
that will help you:

Turn off fsync
Add constraints and indexes *after* you've loaded the data (best to add
as much of them as possible on a per-table basis right after the table
is loaded so that it's hopefully still in cache)
Crank up maintenance_work_mem, especially for tables that won't fit into
cache anyway
Bump up checkpoint segments and wal_buffers.
Disable PITR
Create a table and load it's data in a single transaction (8.2 will
avoid writing any WAL data if you do this and PITR is turned off)
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Best COPY Performance

From
"Merlin Moncure"
Date:
On 10/25/06, Worky Workerson <worky.workerson@gmail.com> wrote:
> > I'm guessing the high bursts are checkpoints.  Can you check your log
> > files for pg and see if you are getting warnings about checkpoint
> > frequency?   You can get some mileage here by increasing wal files.
>
> Nope, nothing in the log.  I have set:
>  wal_buffers=128
>  checkpoint_segments=128
>  checkpoint_timeout=3000
> which I thought was rather generous.  Perhaps I should set it even
> higher for the loads?
>
> > Have you determined that pg is not swapping?  try upping maintenance_work_mem.
>
> maintenance_work_mem = 524288 ... should I increase it even more?
> Doesn't look like pg is swapping ...

nah, you already addressed it.  either pg is swapping or it isnt, and
i'm guessing it isn't.

> I'm currently running bonnie++ with the defaults ... should I change
> the execution to better mimic Postgres' behavior?

just post what you have...

> RHEL 4.3 x86_64
> HP DL585, 4 Dual Core Opteron 885s
>   16 GB RAM
>   2x300GB 10K SCSI320, RAID10
> HP MSA1000 SAN direct connected via single 2GB Fibre Channel Arbitrated Loop
>   10x300GB 10K SCSI320, RAID10

in theory, with 10 10k disks in raid 10, you should be able to keep
your 2fc link saturated all the time unless your i/o is extremely
random.  random i/o is the wild card here, ideally you should see at
least 2000 seeks in bonnie...lets see what comes up.

hopefully, bonnie will report close to 200 mb/sec.  in extreme
sequential cases, the 2fc link should be a bottleneck if the raid
controller is doing its job.

if you are having cpu issues, try breaking your process down to at
least 4 processes (you have quad dual core box after all)...thats a no
brainer.

merlin

Re: Best COPY Performance

From
"Worky Workerson"
Date:
> I'm guessing the high bursts are checkpoints.  Can you check your log
> files for pg and see if you are getting warnings about checkpoint
> frequency?   You can get some mileage here by increasing wal files.

Nope, nothing in the log.  I have set:
 wal_buffers=128
 checkpoint_segments=128
 checkpoint_timeout=3000
which I thought was rather generous.  Perhaps I should set it even
higher for the loads?

> Have you determined that pg is not swapping?  try upping maintenance_work_mem.

maintenance_work_mem = 524288 ... should I increase it even more?
Doesn't look like pg is swapping ...

> What exactly is your architecture?  is your database server direct
> attached to the san? if so, 2gb/4gb fc?  what san?  have you bonnie++
> the san?  basically, you can measure iowait to see if pg is waiting on
> your disks.

I'm currently running bonnie++ with the defaults ... should I change
the execution to better mimic Postgres' behavior?

RHEL 4.3 x86_64
HP DL585, 4 Dual Core Opteron 885s
  16 GB RAM
  2x300GB 10K SCSI320, RAID10
HP MSA1000 SAN direct connected via single 2GB Fibre Channel Arbitrated Loop
  10x300GB 10K SCSI320, RAID10

Re: Best COPY Performance

From
"Luke Lonergan"
Date:
Merlin,

On 10/25/06 8:38 AM, "Merlin Moncure" <mmoncure@gmail.com> wrote:

> in theory, with 10 10k disks in raid 10, you should be able to keep
> your 2fc link saturated all the time unless your i/o is extremely
> random.  random i/o is the wild card here, ideally you should see at
> least 2000 seeks in bonnie...lets see what comes up.

The 2000 seeks/sec are irrelevant to Postgres with one user doing COPY.
Because the I/O is single threaded, you will get one disk worth of seeks for
one user, roughly 150/second on a 10K RPM drive.

I suspect the problem here is the sequential I/O rate - let's wait and see
what the dd test results look like.

- Luke



Re: Best COPY Performance

From
"Craig A. James"
Date:
Jim C. Nasby wrote:
> Wait... so you're using perl to copy data between two tables? And using
> a cursor to boot? I can't think of any way that could be more
> inefficient...
>
> What's wrong with a plain old INSERT INTO ... SELECT? Or if you really
> need to break it into multiple transaction blocks, at least don't
> shuffle the data from the database into perl and then back into the
> database; do an INSERT INTO ... SELECT with that same where clause.

The data are on two different computers, and I do processing of the data as it passes through the application.
Otherwise,the INSERT INTO ... SELECT is my first choice. 

Craig

Re: Best COPY Performance

From
"Spiegelberg, Greg"
Date:
> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org
> [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of
> Craig A. James
> Sent: Wednesday, October 25, 2006 12:52 PM
> To: Jim C. Nasby
> Cc: Worky Workerson; Merlin Moncure; pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] Best COPY Performance
>
> Jim C. Nasby wrote:
> > Wait... so you're using perl to copy data between two tables? And
> > using a cursor to boot? I can't think of any way that could be more
> > inefficient...
> >
> > What's wrong with a plain old INSERT INTO ... SELECT? Or if
> you really
> > need to break it into multiple transaction blocks, at least don't
> > shuffle the data from the database into perl and then back into the
> > database; do an INSERT INTO ... SELECT with that same where clause.
>
> The data are on two different computers, and I do processing
> of the data as it passes through the application.  Otherwise,
> the INSERT INTO ... SELECT is my first choice.

Would dblink() help in any way?

Greg



Re: Best COPY Performance

From
"Craig A. James"
Date:
Spiegelberg, Greg wrote:
>> The data are on two different computers, and I do processing
>> of the data as it passes through the application.  Otherwise,
>> the INSERT INTO ... SELECT is my first choice.
>
> Would dblink() help in any way?

It might if perl wasn't so damned good at this. ;-)

Craig


Re: Best COPY Performance

From
"Worky Workerson"
Date:
Merlin/Luke:

> > in theory, with 10 10k disks in raid 10, you should be able to keep
> > your 2fc link saturated all the time unless your i/o is extremely
> > random.  random i/o is the wild card here, ideally you should see at
> > least 2000 seeks in bonnie...lets see what comes up.

> I suspect the problem here is the sequential I/O rate - let's wait and see
> what the dd test results look like.

Here are the tests that you suggested that I do, on both the local
disks (WAL) and the SAN (tablespace).  The random seeks seem to be far
below what Merlin said was "good", so I am a bit concerned.  There is
a bit of other activity on the box at the moment which is hard to
stop, so that might have had an impact on the processing.

Here is the bonnie++ output:

Version 1.03       ------Sequential Output------ --Sequential Input- --Random-
                   -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
              Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
Local Disks    31G 45119  85 56548  21 27527   8 35069  66 86506  13 499.6   1
SAN            31G 53544  98 93385  35 18266   5 24970  47 57911   8 611.8   1

And here are the dd results for 16GB RAM, i.e. 4,000,000 8K blocks:
# Local Disks
$ time bash -c "dd if=/dev/zero of=/home/myhome/bigfile bs=8k
count=4000000 && sync"
4000000+0 records in
4000000+0 records out

real    10m0.382s
user    0m1.117s
sys     2m45.681s
$ time dd if=/home/myhome/bigfile of=/dev/null bs=8k count=4000000
4000000+0 records in
4000000+0 records out

real    6m22.904s
user    0m0.717s
sys     0m53.766s

# Fibre Channel SAN
$ time bash -c "dd if=/dev/zero of=/data/test/bigfile bs=8k
count=4000000 && sync"
4000000+0 records in
4000000+0 records out

real    5m58.846s
user   0m1.096s
sys     2m18.026s
$ time dd if=/data/test/bigfile of=/dev/null bs=8k count=4000000
4000000+0 records in
4000000+0 records out

real    14m9.560s
user    0m0.739s
sys     0m53.806s

Re: Best COPY Performance

From
Markus Schaber
Date:
Hi, Craig,

Craig A. James wrote:

>> Would dblink() help in any way?
>
> It might if perl wasn't so damned good at this. ;-)

You know that you can use Perl inside PostgreS via plperl?

HTH,
Markus

Re: Best COPY Performance

From
"Luke Lonergan"
Date:
Mr. Worky,

On 10/25/06 11:26 AM, "Worky Workerson" <worky.workerson@gmail.com> wrote:

> And here are the dd results for 16GB RAM, i.e. 4,000,000 8K blocks:

So, if we divide 32,000 MB by the real time, we get:

/home (WAL):
53 MB/s write
84 MB/s read

/data (data):
89 MB/s write
38 MB/s read

The write and read speeds on /home look like a single disk drive, which is
not good if you have more drives in a RAID.  OTOH, it should be sufficient
for WAL writing and you should think that the COPY speed won't be limited by
WAL.

The read speed on your /data volume is awful to the point where you should
consider it broken and find a fix.  A quick comparison: the same number on a
16 drive internal SATA array with 7200 RPM disks gets 950 MB/s read, about
25 times faster for about 1/4 the price.

But again, this may not have anything to do with the speed of your COPY
statements.

Can you provide about 10 seconds worth of "vmstat 1" while running your COPY
so we can get a global view of the I/O and CPU?

- Luke



Re: Best COPY Performance

From
"Jim C. Nasby"
Date:
On Wed, Oct 25, 2006 at 11:25:01AM -0400, Worky Workerson wrote:
> >I'm guessing the high bursts are checkpoints.  Can you check your log
> >files for pg and see if you are getting warnings about checkpoint
> >frequency?   You can get some mileage here by increasing wal files.
>
> Nope, nothing in the log.  I have set:
> wal_buffers=128
> checkpoint_segments=128
> checkpoint_timeout=3000
> which I thought was rather generous.  Perhaps I should set it even
> higher for the loads?

But depending on your shared_buffer and bgwriter settings (as well as
how much WAL traffic you're generating, you could still end up with big
slugs of work to be done when checkpoints happen.

If you set checkpoint_warning to 3001, you'll see exactly when
checkpoints are happening, so you can determine if that's an issue.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Best COPY Performance

From
"Worky Workerson"
Date:
I do have a dirty little secret, one which I wasn't completely aware
of until a little while ago.  Apparently, someone decided to install
Oracle on the server, and use the SAN as the primary tablespace, so
that might have something to do with the poor performance of the SAN.
At least, I'm hoping that's what is involved.  I'll be able to tell
for sure when I can rerun the benchmarks on Monday without Oracle.

Thank you all for all your help so far.  I've learned a ton about
Postgres (and life!) from both this thread and this list in general,
and the "support" has been nothing less than spectacular.

I'm hoping that the corporate Oracle machine won't shut down my pg
projects.  On total side note, if anyone knows how to best limit
Oracle's impact on a system (i.e. memory usage, etc), I'd be
interested.

I hate shared DB servers.

Thanks!

Re: Best COPY Performance

From
"Merlin Moncure"
Date:
On 10/27/06, Worky Workerson <worky.workerson@gmail.com> wrote:
> I'm hoping that the corporate Oracle machine won't shut down my pg
> projects.  On total side note, if anyone knows how to best limit
> Oracle's impact on a system (i.e. memory usage, etc), I'd be
> interested.


rm -rf /usr/local/oracle?

merlin

Re: Best COPY Performance

From
"Worky Workerson"
Date:
> The read speed on your /data volume is awful to the point where you should
> consider it broken and find a fix.  A quick comparison: the same number on a
> 16 drive internal SATA array with 7200 RPM disks gets 950 MB/s read, about
> 25 times faster for about 1/4 the price.

I'm hoping that the poor performance is a result of Oracle doing
random reads while I try and do the sequential read.  If not (I'll
test on Monday), I'll start looking other places.  Any immediate
suggestions?

> Can you provide about 10 seconds worth of "vmstat 1" while running your COPY
> so we can get a global view of the I/O and CPU?

Here it is, taken from a spot about halfway through a 'cat file |
psql' load, with the "Oracle-is-installed-and-running" caveat:

r b   swpd  free   buff    cache si so bi    bo   in   cs us sy id wa
1 0 345732 29328 770980 12947212  0  0 20 16552 1223 3677 12  2 85  1
1 0 345732 29840 770520 12946924  0  0 20 29244 1283 2955 11  2 85  1
1 0 345732 32144 770560 12944436  0  0 12 16436 1204 2936 11  2 86  1
1 0 345732 33744 770464 12942764  0  0 20 16460 1189 2005 10  2 86  1
2 0 345732 32656 770140 12943972  0  0 16  7068 1057 3434 13  2 85  0
1 0 345732 34832 770184 12941820  0  0 20  9368 1170 3120 11  2 86  1
1 0 345732 36528 770228 12939804  0  0 16 32668 1297 2109 11  2 85  1
1 0 345732 29304 770272 12946764  0  0 16 16428 1192 3105 12  2 85  1
1 0 345732 30840 770060 12945480  0  0 20 16456 1196 3151 12  2 84  1
1 0 345732 32760 769972 12943528  0  0 12 16460 1185 3103 11  2 86  1

Re: Best COPY Performance

From
"Luke Lonergan"
Date:
Worky (that your real name? :-)


On 10/27/06 12:08 PM, "Worky Workerson" <worky.workerson@gmail.com> wrote:

> Here it is, taken from a spot about halfway through a 'cat file |
> psql' load, with the "Oracle-is-installed-and-running" caveat:
>
> r b   swpd  free   buff    cache si so bi    bo   in   cs us sy id wa
> 1 0 345732 29328 770980 12947212  0  0 20 16552 1223 3677 12  2 85  1
> 1 0 345732 29840 770520 12946924  0  0 20 29244 1283 2955 11  2 85  1
> 1 0 345732 32144 770560 12944436  0  0 12 16436 1204 2936 11  2 86  1
> 1 0 345732 33744 770464 12942764  0  0 20 16460 1189 2005 10  2 86  1
> 2 0 345732 32656 770140 12943972  0  0 16  7068 1057 3434 13  2 85  0
> 1 0 345732 34832 770184 12941820  0  0 20  9368 1170 3120 11  2 86  1
> 1 0 345732 36528 770228 12939804  0  0 16 32668 1297 2109 11  2 85  1
> 1 0 345732 29304 770272 12946764  0  0 16 16428 1192 3105 12  2 85  1
> 1 0 345732 30840 770060 12945480  0  0 20 16456 1196 3151 12  2 84  1
> 1 0 345732 32760 769972 12943528  0  0 12 16460 1185 3103 11  2 86  1

It doesn't look like there's anything else running - the runnable "r" is
about 1.  Your "bo" blocks output rate is about 16MB/s, so divide by 3 and
you're about in range with your 5MB/s COPY rate.  The interesting thing is
that the I/O wait is pretty low.

How many CPUs on the machine?  Can you send the result of "cat
/proc/cpuinfo"?

Is your "cat file | psql" being done on the DBMS server or is it on the
network?

- Luke



Re: Best COPY Performance

From
"Merlin Moncure"
Date:
On 10/28/06, Luke Lonergan <llonergan@greenplum.com> wrote:
> Worky (that your real name? :-)
>
>
> On 10/27/06 12:08 PM, "Worky Workerson" <worky.workerson@gmail.com> wrote:
>
> > Here it is, taken from a spot about halfway through a 'cat file |
> > psql' load, with the "Oracle-is-installed-and-running" caveat:
> >
> > r b   swpd  free   buff    cache si so bi    bo   in   cs us sy id wa
> > 1 0 345732 29328 770980 12947212  0  0 20 16552 1223 3677 12  2 85  1
> > 1 0 345732 29840 770520 12946924  0  0 20 29244 1283 2955 11  2 85  1
> > 1 0 345732 32144 770560 12944436  0  0 12 16436 1204 2936 11  2 86  1
> > 1 0 345732 33744 770464 12942764  0  0 20 16460 1189 2005 10  2 86  1
> > 2 0 345732 32656 770140 12943972  0  0 16  7068 1057 3434 13  2 85  0
> > 1 0 345732 34832 770184 12941820  0  0 20  9368 1170 3120 11  2 86  1
> > 1 0 345732 36528 770228 12939804  0  0 16 32668 1297 2109 11  2 85  1
> > 1 0 345732 29304 770272 12946764  0  0 16 16428 1192 3105 12  2 85  1
> > 1 0 345732 30840 770060 12945480  0  0 20 16456 1196 3151 12  2 84  1
> > 1 0 345732 32760 769972 12943528  0  0 12 16460 1185 3103 11  2 86  1
>
> It doesn't look like there's anything else running - the runnable "r" is
> about 1.  Your "bo" blocks output rate is about 16MB/s, so divide by 3 and
> you're about in range with your 5MB/s COPY rate.  The interesting thing is
> that the I/O wait is pretty low.
>
> How many CPUs on the machine?  Can you send the result of "cat
> /proc/cpuinfo"?
>
> Is your "cat file | psql" being done on the DBMS server or is it on the
> network?

iirc, he is running quad opteron 885 (8 cores), so if my math is
correct he can split up his process for an easy gain.

merlin

Re: Best COPY Performance

From
"Worky Workerson"
Date:
> Worky (that your real name? :-)

Nope, its Mike.  worky.workerson is just the email that I use for "work" :)

> How many CPUs on the machine?  Can you send the result of "cat
> /proc/cpuinfo"?

Not at work at the moment, however I do have quad dual-core opterons,
like Merlin mentioned.

> Is your "cat file | psql" being done on the DBMS server or is it on the
> network?

Everything is currently being done on the same DB server.  The data
that is being loaded is on the same 2-disk "RAID10" as the OS and WAL.

Re: Best COPY Performance

From
"Worky Workerson"
Date:
On 10/27/06, Merlin Moncure <mmoncure@gmail.com> wrote:
> > > r b   swpd  free   buff    cache si so bi    bo   in   cs us sy id wa
> > > 1 0 345732 29328 770980 12947212  0  0 20 16552 1223 3677 12  2 85  1
> > > 1 0 345732 29840 770520 12946924  0  0 20 29244 1283 2955 11  2 85  1
> > > 1 0 345732 32144 770560 12944436  0  0 12 16436 1204 2936 11  2 86  1
> > > 1 0 345732 33744 770464 12942764  0  0 20 16460 1189 2005 10  2 86  1
> > > 2 0 345732 32656 770140 12943972  0  0 16  7068 1057 3434 13  2 85  0
> > > 1 0 345732 34832 770184 12941820  0  0 20  9368 1170 3120 11  2 86  1
> > > 1 0 345732 36528 770228 12939804  0  0 16 32668 1297 2109 11  2 85  1
> > > 1 0 345732 29304 770272 12946764  0  0 16 16428 1192 3105 12  2 85  1
> > > 1 0 345732 30840 770060 12945480  0  0 20 16456 1196 3151 12  2 84  1
> > > 1 0 345732 32760 769972 12943528  0  0 12 16460 1185 3103 11  2 86  1
> >
> > It doesn't look like there's anything else running - the runnable "r" is
> > about 1.  Your "bo" blocks output rate is about 16MB/s, so divide by 3 and
> > you're about in range with your 5MB/s COPY rate.  The interesting thing is
> > that the I/O wait is pretty low.
> >
> > How many CPUs on the machine?  Can you send the result of "cat
> > /proc/cpuinfo"?
>
> iirc, he is running quad opteron 885 (8 cores), so if my math is
> correct he can split up his process for an easy gain.

Are you saying that I should be able to issue multiple COPY commands
because my I/O wait is low?  I was under the impression that I am I/O
bound, so multiple simeoultaneous loads would have a detrimental
effect ...

Re: Best COPY Performance

From
"Luke Lonergan"
Date:
Worky,

On 10/27/06 8:47 PM, "Worky Workerson" <worky.workerson@gmail.com> wrote:

>>>> 1 0 345732 29304 770272 12946764  0  0 16 16428 1192 3105 12  2 85  1
>>>> 1 0 345732 30840 770060 12945480  0  0 20 16456 1196 3151 12  2 84  1
>>>> 1 0 345732 32760 769972 12943528  0  0 12 16460 1185 3103 11  2 86  1
>>
>> iirc, he is running quad opteron 885 (8 cores), so if my math is
>> correct he can split up his process for an easy gain.
>
> Are you saying that I should be able to issue multiple COPY commands
> because my I/O wait is low?  I was under the impression that I am I/O
> bound, so multiple simeoultaneous loads would have a detrimental
> effect ...

The reason I asked how many CPUs was to make sense of the 12% usr CPU time
in the above.  That means you are CPU bound and are fully using one CPU.  So
you aren't being limited by the I/O in this case, it's the CPU.

I agree with Merlin that you can speed things up by breaking the file up.
Alternately you can use the OSS Bizgres java loader, which lets you specify
the number of I/O threads with the "-n" option on a single file.

OTOH, you should find that you will only double your COPY speed with this
approach because your write speed as you previously found was limited to 30
MB/s.

For now, you could simply split the file in two pieces and load two copies
at once, then watch the same "vmstat 1" for 10 seconds and look at your "bo"
rate.

If this does speed things up, you really should check out the Bizgres Java
loader.

The other thing to wonder about though is why you are so CPU bound at 5
MB/s.  What version of Postgres is this?

- Luke



Re: Best COPY Performance

From
"Michael Artz"
Date:
> > Are you saying that I should be able to issue multiple COPY commands
> > because my I/O wait is low?  I was under the impression that I am I/O
> > bound, so multiple simeoultaneous loads would have a detrimental
> > effect ...
>
> The reason I asked how many CPUs was to make sense of the 12% usr CPU time
> in the above.  That means you are CPU bound and are fully using one CPU.  So
> you aren't being limited by the I/O in this case, it's the CPU.
>
> I agree with Merlin that you can speed things up by breaking the file up.
> Alternately you can use the OSS Bizgres java loader, which lets you specify
> the number of I/O threads with the "-n" option on a single file.

Thanks, I'll try that on Monday.

> The other thing to wonder about though is why you are so CPU bound at 5
> MB/s.  What version of Postgres is this?

I was wondering about that as well, and the only thing that I can
think of is that its the PK btree index creation on the IP4.

PG 8.1.3 x86_64.  I installed it via a RH rpm for their "Web Services
Beta", or something like that.  I know I'm a bit behind the times, but
getting stuff in (and out) of my isolated lab is a bit of a pain.
I'll compile up a 8.2 beta as well and see how that works out.

Re: Best COPY Performance

From
"Luke Lonergan"
Date:
Michael (aka Worky),

On 10/28/06 5:03 AM, "Michael Artz" <mlartz@gmail.com> wrote:

> PG 8.1.3 x86_64.  I installed it via a RH rpm for their "Web Services
> Beta", or something like that.  I know I'm a bit behind the times, but
> getting stuff in (and out) of my isolated lab is a bit of a pain.
> I'll compile up a 8.2 beta as well and see how that works out.

I think 8.1 and 8.2 should be the same in this regard.

Maybe it is just the PK *build* that slows it down, but I just tried some
small scale experiments on my MacBook Pro laptop (which has the same disk
performance as your server) and I get only a 10-15% slowdown from having a
PK on an integer column.  The 10-15% slowdown was on 8.1.5 MPP, so it used
both CPUs to build the index and load at about 15 MB/s.

Note that the primary key is the first column.

                Table "public.part"
    Column     |         Type          | Modifiers
---------------+-----------------------+-----------
 p_partkey     | integer               | not null
 p_name        | character varying(55) | not null
 p_mfgr        | text                  | not null
 p_brand       | text                  | not null
 p_type        | character varying(25) | not null
 p_size        | integer               | not null
 p_container   | text                  | not null
 p_retailprice | double precision      | not null
 p_comment     | character varying(23) | not null
Indexes:
    "part_pkey" PRIMARY KEY, btree (p_partkey)

What is your schema for the table?

- Luke



Re: Best COPY Performance

From
"Spiegelberg, Greg"
Date:
> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org
> [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of
> Luke Lonergan
> Sent: Saturday, October 28, 2006 12:07 AM
> To: Worky Workerson; Merlin Moncure
> Cc: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] Best COPY Performance
>
> Worky,
>
> On 10/27/06 8:47 PM, "Worky Workerson"
> <worky.workerson@gmail.com> wrote:
>
> > Are you saying that I should be able to issue multiple COPY
> commands
> > because my I/O wait is low?  I was under the impression
> that I am I/O
> > bound, so multiple simeoultaneous loads would have a detrimental
> > effect ...
>
> ...
> I agree with Merlin that you can speed things up by breaking
> the file up.
> Alternately you can use the OSS Bizgres java loader, which
> lets you specify the number of I/O threads with the "-n"
> option on a single file.

As a result of this thread, and b/c I've tried this in the past but
never had much success at speeding the process up, I attempted just that
here except via 2 psql CLI's with access to the local file.  1.1M rows
of data varying in width from 40 to 200 characters COPY'd to a table
with only one text column, no keys, indexes, &c took about 15 seconds to
load. ~73K rows/second.

I broke that file into 2 files each of 550K rows and performed 2
simultaneous COPY's after dropping the table, recreating, issuing a sync
on the system to be sure, &c and nearly every time both COPY's finish in
12 seconds.  About a 20% gain to ~91K rows/second.

Admittedly, this was a pretty rough test but a 20% savings, if it can be
put into production, is worth exploring for us.

B/c I'll be asked, I did this on an idle, dual 3.06GHz Xeon with 6GB of
memory, U320 SCSI internal drives and PostgreSQL 8.1.4.

Greg

Re: Best COPY Performance

From
"Luke Lonergan"
Date:
Greg,

On 10/30/06 7:09 AM, "Spiegelberg, Greg" <gspiegelberg@cranel.com> wrote:

> I broke that file into 2 files each of 550K rows and performed 2
> simultaneous COPY's after dropping the table, recreating, issuing a sync
> on the system to be sure, &c and nearly every time both COPY's finish in
> 12 seconds.  About a 20% gain to ~91K rows/second.
>
> Admittedly, this was a pretty rough test but a 20% savings, if it can be
> put into production, is worth exploring for us.

Did you see whether you were I/O or CPU bound in your single threaded COPY?
A 10 second "vmstat 1" snapshot would tell you/us.

With Mr. Workerson (:-) I'm thinking his benefit might be a lot better
because the bottleneck is the CPU and it *may* be the time spent in the
index building bits.

We've found that there is an ultimate bottleneck at about 12-14MB/s despite
having sequential write to disk speeds of 100s of MB/s.  I forget what the
latest bottleneck was.

- Luke



Re: Best COPY Performance

From
Stefan Kaltenbrunner
Date:
Luke Lonergan wrote:
> Greg,
>
> On 10/30/06 7:09 AM, "Spiegelberg, Greg" <gspiegelberg@cranel.com> wrote:
>
>> I broke that file into 2 files each of 550K rows and performed 2
>> simultaneous COPY's after dropping the table, recreating, issuing a sync
>> on the system to be sure, &c and nearly every time both COPY's finish in
>> 12 seconds.  About a 20% gain to ~91K rows/second.
>>
>> Admittedly, this was a pretty rough test but a 20% savings, if it can be
>> put into production, is worth exploring for us.
>
> Did you see whether you were I/O or CPU bound in your single threaded COPY?
> A 10 second "vmstat 1" snapshot would tell you/us.
>
> With Mr. Workerson (:-) I'm thinking his benefit might be a lot better
> because the bottleneck is the CPU and it *may* be the time spent in the
> index building bits.
>
> We've found that there is an ultimate bottleneck at about 12-14MB/s despite
> having sequential write to disk speeds of 100s of MB/s.  I forget what the
> latest bottleneck was.

I have personally managed to load a bit less then 400k/s (5 int columns
no indexes) - on very fast disk hardware - at that point postgresql is
completely CPU bottlenecked (2,6Ghz Opteron).
Using multiple processes to load the data will help to scale up to about
  900k/s (4 processes on 4 cores).


Stefan

Re: Best COPY Performance

From
"Luke Lonergan"
Date:
Stefan,

On 10/30/06 8:57 AM, "Stefan Kaltenbrunner" <stefan@kaltenbrunner.cc> wrote:

>> We've found that there is an ultimate bottleneck at about 12-14MB/s despite
>> having sequential write to disk speeds of 100s of MB/s.  I forget what the
>> latest bottleneck was.
>
> I have personally managed to load a bit less then 400k/s (5 int columns
> no indexes) - on very fast disk hardware - at that point postgresql is
> completely CPU bottlenecked (2,6Ghz Opteron).

400,000 rows/s x 4 bytes/column x 5 columns/row = 8MB/s

> Using multiple processes to load the data will help to scale up to about
>   900k/s (4 processes on 4 cores).

18MB/s?  Have you done this?  I've not seen this much of an improvement
before by using multiple COPY processes to the same table.

Another question: how to measure MB/s - based on the input text file?  On
the DBMS storage size?  We usually consider the input text file in the
calculation of COPY rate.

- Luke



Re: Best COPY Performance

From
Stefan Kaltenbrunner
Date:
Luke Lonergan wrote:
> Stefan,
>
> On 10/30/06 8:57 AM, "Stefan Kaltenbrunner" <stefan@kaltenbrunner.cc> wrote:
>
>>> We've found that there is an ultimate bottleneck at about 12-14MB/s despite
>>> having sequential write to disk speeds of 100s of MB/s.  I forget what the
>>> latest bottleneck was.
>> I have personally managed to load a bit less then 400k/s (5 int columns
>> no indexes) - on very fast disk hardware - at that point postgresql is
>> completely CPU bottlenecked (2,6Ghz Opteron).
>
> 400,000 rows/s x 4 bytes/column x 5 columns/row = 8MB/s
>
>> Using multiple processes to load the data will help to scale up to about
>>   900k/s (4 processes on 4 cores).

yes I did that about half a year ago as part of the CREATE INDEX on a
1,8B row table thread on -hackers that resulted in some some the sorting
improvements in 8.2.
I don't think there is much more possible in terms of import speed by
using more cores (at least not when importing to the same table) - iirc
I was at nearly 700k/s with two cores and 850k/s with 3 cores or such ...

>
> 18MB/s?  Have you done this?  I've not seen this much of an improvement
> before by using multiple COPY processes to the same table.
>
> Another question: how to measure MB/s - based on the input text file?  On
> the DBMS storage size?  We usually consider the input text file in the
> calculation of COPY rate.


yeah that is a good questions (and part of the reason why I cited the
rows/sec number btw.)


Stefan

Re: Best COPY Performance

From
"Worky Workerson"
Date:
> > And here are the dd results for 16GB RAM, i.e. 4,000,000 8K blocks:
>
> So, if we divide 32,000 MB by the real time, we get:
> /data (data):
> 89 MB/s write
> 38 MB/s read
... snip ...
> The read speed on your /data volume is awful to the point where you should
> consider it broken and find a fix.  A quick comparison: the same number on a
> 16 drive internal SATA array with 7200 RPM disks gets 950 MB/s read, about
> 25 times faster for about 1/4 the price.

I managed to get approval to shut down the Oracle instance and reran
the dd's on the SAN (/data) and came up with about 60MB/s write (I had
missed the 'sync' in the previous runs) and about 58 MB/s read, still
no comparison on your SATA arrary.  Any recommendations on what to
look at to find a fix?  One thing which I never mentioned was that I
am using ext3 mounted with noatime,data=writeback.

An interesting note (at least to me) is the inverse relationship
between free memory and bo when writing with dd, i.e:

$ vmstat 5
r b   swpd   free   buff   cache si  so bi     bo   in  cs us sy id wa
0 3 244664 320688 23588 15383120  0   0  0     28 1145 197  0  1 74 25
2 6 244664 349488 22276 15204980  0   0  0     24 1137 188  0  1 75 25
2 6 244664  28264 23024 15526552  0   0  0  65102 1152 335  0 12 60 28
2 4 244664  28968 23588 15383120  0   0  1 384386 1134 372  0 19 34 47
1 5 244664  28840 23768 15215728  0   0  1 438482 1144 494  0 24 33 43
0 5 247256  41320 20144 15212788  0 524  0  57062 1142 388  0  6 43 51
1 6 247256  29096 19588 15226788  0   0  5  60999 1140 391  0 15 42 43

Is this because of the kernel attempting to cache the file in memory?

Re: Best COPY Performance

From
"Luke Lonergan"
Date:
Worky (!),

On 10/31/06 12:11 PM, "Worky Workerson" <worky.workerson@gmail.com> wrote:

> Any recommendations on what to
> look at to find a fix?  One thing which I never mentioned was that I
> am using ext3 mounted with noatime,data=writeback.

You can try setting the max readahead like this:
  /sbin/blockdev --setra 16384 /dev/sd[a-z]

It will set the max readahead to 16MB for whatever devices are in
/dev/sd[a-z] for the booted machine.  You'd need to put the line(s) in
/etc/rc.d/rc.local to have the setting persist on reboot.

- Luke



Re: Best COPY Performance

From
"Worky Workerson"
Date:
> >I'm guessing the high bursts are checkpoints.  Can you check your log
> > >files for pg and see if you are getting warnings about checkpoint
> > >frequency?   You can get some mileage here by increasing wal files.
> >
> > Nope, nothing in the log.  I have set:
> > wal_buffers=128
> > checkpoint_segments=128
> > checkpoint_timeout=3000
> > which I thought was rather generous.  Perhaps I should set it even
> > higher for the loads?
>
> But depending on your shared_buffer and bgwriter settings (as well as
> how much WAL traffic you're generating, you could still end up with big
> slugs of work to be done when checkpoints happen.
>
> If you set checkpoint_warning to 3001, you'll see exactly when
> checkpoints are happening, so you can determine if that's an issue.

I did that, and I get two log messages, seeing checkpoints happening
at 316 and 147 seconds apart on my load of a 1.9 GB file.  Is this "an
issue"?

Thanks!

Re: Best COPY Performance

From
"Worky Workerson"
Date:
> Maybe it is just the PK *build* that slows it down, but I just tried some
> small scale experiments on my MacBook Pro laptop (which has the same disk
> performance as your server) and I get only a 10-15% slowdown from having a
> PK on an integer column.  The 10-15% slowdown was on 8.1.5 MPP, so it used
> both CPUs to build the index and load at about 15 MB/s.
...snip...
> What is your schema for the table?

A single IP4 PK and 21 VARCHARs.  It takes about 340 seconds to load a
1.9GB file with the  PK index, and about 230 seconds without it (ALTER
TABLE mytable DROP CONSTRAINT mytable_pkey), which is a pretty
significant (~30%) savings.  If I read the vmstat output correctly
(i.e. the cpu us column), I'm still at 12% and thus still cpu-bound,
except for when the checkpoint occurs, i.e (everything is chugging
along similar to the first line, then stuff gets wonky):

r b   swpd    free   buff   cache si so bi     bo   in  cs   us sy id wa
2 0 279028 4620040 717940 9697664  0  0  0   19735 1242 7534 13  4 82  1
1 2 279028 4476120 718120 9776840  0  0  0 2225483 1354 5269 13  6 71 11
0 3 279028 4412928 718320 9866672  0  0  2   19746 1324 3978 10  2 69 18
1 1 279028 4334112 718528 9971456  0  0  0   20615 1311 5912 10  3 69 18
0 1 279028 4279904 718608 9995244  0  0  0  134946 1205  674  1  3 85 11
0 2 279028 4307344 718616 9995304  0  0  0      54 1132  247  0  1 77 22
1 0 279028 7411104 718768 6933860  0  0  0    9942 1148 3618 11  6 80  3
1 0 279028 7329312 718964 7015536  0  0  1   19766 1232 5108 13  2 84  1

Also, as a semi-side note, I only have a single checkpoint without the
index, while I have 2 with the index.

Re: Best COPY Performance

From
"Luke Lonergan"
Date:
Checkpoints are not an issue here, the vmstat you included was on a 5 second interval, so the 'bursts' were bursting at
arate of 60MB/s. 


- Luke

Msg is shrt cuz m on ma treo


Re: Best COPY Performance

From
"Worky Workerson"
Date:
> >>>> 1 0 345732 29304 770272 12946764  0  0 16 16428 1192 3105 12  2 85  1
> >>>> 1 0 345732 30840 770060 12945480  0  0 20 16456 1196 3151 12  2 84  1
> >>>> 1 0 345732 32760 769972 12943528  0  0 12 16460 1185 3103 11  2 86  1
> >>
> >> iirc, he is running quad opteron 885 (8 cores), so if my math is
> >> correct he can split up his process for an easy gain.
> >
> > Are you saying that I should be able to issue multiple COPY commands
> > because my I/O wait is low?  I was under the impression that I am I/O
> > bound, so multiple simeoultaneous loads would have a detrimental
> > effect ...
>
> The reason I asked how many CPUs was to make sense of the 12% usr CPU time
> in the above.  That means you are CPU bound and are fully using one CPU.  So
> you aren't being limited by the I/O in this case, it's the CPU.
... snip ...
> For now, you could simply split the file in two pieces and load two copies
> at once, then watch the same "vmstat 1" for 10 seconds and look at your "bo"
> rate.

Significantly higher on average, and a parallel loads were ~30% faster
that a single with index builds (240s vs 340s) and about ~45% (150s vs
230s) without the PK index.  I'll definitely look into the bizgres
java loader.

Thanks!

Re: Best COPY Performance

From
Teemu Torma
Date:
On Tuesday 31 October 2006 21:11, Worky Workerson wrote:
> One thing which I never mentioned was that I
> am using ext3 mounted with noatime,data=writeback.

You might also want to try with data=ordered.  I have noticed that
nowadays it seems to be a bit faster, but not much.  I don't know why,
maybe it has got more optimization efforts.

Teemu

Postgres server crash

From
"Craig A. James"
Date:
For the third time today, our server has crashed, or frozen, actually something in between.  Normally there are about
30-50connections because of mod_perl processes that keep connections open.  After the crash, there are three processes
remaining:

# ps -ef | grep postgres
postgres 23832     1  0 Nov11 pts/1    00:02:53 /usr/local/pgsql/bin/postmaster -D /postgres/main
postgres  1200 23832 20 14:28 pts/1    00:58:14 postgres: pubchem pubchem 66.226.76.106(58882) SELECT
postgres  4190 23832 25 14:33 pts/1    01:09:12 postgres: asinex asinex 66.226.76.106(56298) SELECT

But they're not doing anything: No CPU time consumed, no I/O going on, no progress.  If I try to connect with psql(1),
itsays: 

   psql: FATAL:  the database system is in recovery mode

And the server log has:

LOG:  background writer process (PID 23874) was terminated by signal 9
LOG:  terminating any other active server processes
LOG:  statistics collector process (PID 23875) was terminated by signal 9
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because
anotherserver process exited ab 
normally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because
anotherserver process exited ab 
... repeats about 50 times, one per process.

Questions:
  1. Any idea what happened and how I can avoid this?  It's a *big* problem.
  2. Why didn't the database recover?  Why are there two processes
     that couldn't be killed?
  3. Where did the "signal 9" come from?  (Nobody but me ever logs
     in to the server machine.)

Help!

Thanks,
Craig


Re: Postgres server crash

From
Russell Smith
Date:
Craig A. James wrote:
> For the third time today, our server has crashed, or frozen, actually
> something in between.  Normally there are about 30-50 connections
> because of mod_perl processes that keep connections open.  After the
> crash, there are three processes remaining:
>
> # ps -ef | grep postgres
> postgres 23832     1  0 Nov11 pts/1    00:02:53
> /usr/local/pgsql/bin/postmaster -D /postgres/main
> postgres  1200 23832 20 14:28 pts/1    00:58:14 postgres: pubchem
> pubchem 66.226.76.106(58882) SELECT
> postgres  4190 23832 25 14:33 pts/1    01:09:12 postgres: asinex
> asinex 66.226.76.106(56298) SELECT
>
> But they're not doing anything: No CPU time consumed, no I/O going on,
> no progress.  If I try to connect with psql(1), it says:
>
>   psql: FATAL:  the database system is in recovery mode
>
> And the server log has:
>
> LOG:  background writer process (PID 23874) was terminated by signal 9
> LOG:  terminating any other active server processes
> LOG:  statistics collector process (PID 23875) was terminated by signal 9
> WARNING:  terminating connection because of crash of another server
> process
> DETAIL:  The postmaster has commanded this server process to roll back
> the current transaction and exit, because another server process
> exited ab
> normally and possibly corrupted shared memory.
> HINT:  In a moment you should be able to reconnect to the database and
> repeat your command.
> WARNING:  terminating connection because of crash of another server
> process
> DETAIL:  The postmaster has commanded this server process to roll back
> the current transaction and exit, because another server process
> exited ab
> ... repeats about 50 times, one per process.
>
> Questions:
>  1. Any idea what happened and how I can avoid this?  It's a *big*
> problem.
>  2. Why didn't the database recover?  Why are there two processes
>     that couldn't be killed?
>  3. Where did the "signal 9" come from?  (Nobody but me ever logs
>     in to the server machine.)
>
I would guess it's the linux OOM if you are running linux. You need to
turn off killing of processes when you run out of memory.  Are you
getting close to running out of memory?

> Help!
>
> Thanks,
> Craig
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>
>


Re: Postgres server crash

From
Richard Huxton
Date:
Russell Smith wrote:
> Craig A. James wrote:
>> Questions:
>>  1. Any idea what happened and how I can avoid this?  It's a *big*
>> problem.
>>  2. Why didn't the database recover?  Why are there two processes
>>     that couldn't be killed?

I'm guessing it didn't recover *because* there were two processes that
couldn't be killed. Responsibility for that falls to the
operating-system. I've seen it most often with faulty drivers or
hardware that's being communicated with/written to. However, see below.

>>  3. Where did the "signal 9" come from?  (Nobody but me ever logs
>>     in to the server machine.)
>>
> I would guess it's the linux OOM if you are running linux.

If not, it means the server is hacked or haunted. Something outside PG
issued a kill -9 and the OOM killer is the prime suspect I'd say.

--
   Richard Huxton
   Archonet Ltd

Re: Postgres server crash

From
"Craig A. James"
Date:
Russell Smith wrote:
>> For the third time today, our server has crashed...
>
> I would guess it's the linux OOM if you are running linux. You need to
> turn off killing of processes when you run out of memory.  Are you
> getting close to running out of memory?

Good suggestion, it was a memory leak in an add-on library that we plug in to the Postgres server.

OOM?  Can you give me a quick pointer to what this acronym stands for and how I can reconfigure it?  It sounds like a
"feature"old UNIX systems like SGI IRIX had, where the system would allocate virtual memory that it didn't really have,
thenkill your process if you tried to use it.  I.e. malloc() would never return NULL even if swap space was over
allocated. Is this what you're talking about?  Having this enabled on a server is deadly for reliability. 

Thanks,
Craig


Re: Postgres server crash

From
Richard Huxton
Date:
Craig A. James wrote:
> Russell Smith wrote:
>>> For the third time today, our server has crashed...
>>
>> I would guess it's the linux OOM if you are running linux. You need to
>> turn off killing of processes when you run out of memory.  Are you
>> getting close to running out of memory?
>
> Good suggestion, it was a memory leak in an add-on library that we plug
> in to the Postgres server.
>
> OOM?   Can you give me a quick pointer to what this acronym stands for
> and how I can reconfigure it?

Out Of Memory

 > It sounds like a "feature" old UNIX
> systems like SGI IRIX had, where the system would allocate virtual
> memory that it didn't really have, then kill your process if you tried
> to use it.

That's it.

 > I.e. malloc() would never return NULL even if swap space was
> over allocated.  Is this what you're talking about?  Having this enabled
> on a server is deadly for reliability.

Indeed. See the manuals for details. Section 16.4.3

http://www.postgresql.org/docs/8.1/static/kernel-resources.html#AEN18128

--
   Richard Huxton
   Archonet Ltd

Re: Postgres server crash

From
"Craig A. James"
Date:
By the way, in spite of my questions and concerns, I was *very* impressed by the recovery process.  I know it might
seemlike old hat to you guys to watch the WAL in action, and I know on a theoretical level it's supposed to work, but
watchingit recover 150 separate databases, and find and fix a couple of problems was very impressive.  It gives me
greatconfidence that I made the right choice to use Postgres. 

Richard Huxton wrote:
>>>  2. Why didn't the database recover?  Why are there two processes
>>>     that couldn't be killed?
>
> I'm guessing it didn't recover *because* there were two processes that
> couldn't be killed. Responsibility for that falls to the
> operating-system. I've seen it most often with faulty drivers or
> hardware that's being communicated with/written to. However, see below.

It can't be a coincidence that these were the only two processes in a SELECT operation.  Does the server disable
signalsat critical points? 

I'd make a wild guess that this is some sort of deadlock problem -- these two servers have disabled signals for a
criticalsection of SELECT, and are waiting for something from the postmaster, but postmaster is dead. 

This is an ordinary system, no hardware problems, stock RH FC3 kernel, stock PG 8.1.4, with 4 GB memory, and at the
momentthe database is running on a single SATA disk.  I'm worried that a production server can get into a state that
requiresmanual intervention to recover. 

Craig

Re: Postgres server crash

From
Richard Huxton
Date:
Craig A. James wrote:
> By the way, in spite of my questions and concerns, I was *very*
> impressed by the recovery process.  I know it might seem like old hat to
> you guys to watch the WAL in action, and I know on a theoretical level
> it's supposed to work, but watching it recover 150 separate databases,
> and find and fix a couple of problems was very impressive.  It gives me
> great confidence that I made the right choice to use Postgres.
>
> Richard Huxton wrote:
>>>>  2. Why didn't the database recover?  Why are there two processes
>>>>     that couldn't be killed?
>>
>> I'm guessing it didn't recover *because* there were two processes that
>> couldn't be killed. Responsibility for that falls to the
>> operating-system. I've seen it most often with faulty drivers or
>> hardware that's being communicated with/written to. However, see below.
>
> It can't be a coincidence that these were the only two processes in a
> SELECT operation.  Does the server disable signals at critical points?

If a "kill -9" as root doesn't get rid of them, I think I'm right in
saying that it's a kernel-level problem rather than something else.

--
   Richard Huxton
   Archonet Ltd

Re: Postgres server crash

From
Tom Lane
Date:
Richard Huxton <dev@archonet.com> writes:
> Craig A. James wrote:
>> It can't be a coincidence that these were the only two processes in a
>> SELECT operation.  Does the server disable signals at critical points?

> If a "kill -9" as root doesn't get rid of them, I think I'm right in
> saying that it's a kernel-level problem rather than something else.

I didn't actually see Craig say anywhere that he'd tried "kill -9" on
those backends.  If he did and it didn't do anything, then clearly they
were in some kind of uninterruptable wait, which is certainly evidence
of a kernel or hardware issue.  If he didn't do "kill -9" then we don't
really know what the issue was --- it's not improbable that they were
stuck in some loop not containing a CHECK_FOR_INTERRUPTS() test.

            regards, tom lane

Re: Postgres server crash

From
Tom Lane
Date:
"Craig A. James" <cjames@modgraph-usa.com> writes:
> OOM?  Can you give me a quick pointer to what this acronym stands for
> and how I can reconfigure it?

See "Linux Memory Overcommit" at
http://www.postgresql.org/docs/8.1/static/kernel-resources.html#AEN18128
or try googling for "OOM kill" for non-Postgres-specific coverage.

> It sounds like a "feature" old UNIX
> systems like SGI IRIX had, where the system would allocate virtual
> memory that it didn't really have, then kill your process if you tried
> to use it.  I.e. malloc() would never return NULL even if swap space
> was over allocated.  Is this what you're talking about?  Having this
> enabled on a server is deadly for reliability.

No kidding :-(.  The default behavior in Linux is extremely unfortunate.

            regards, tom lane

Re: Postgres server crash

From
Richard Huxton
Date:
Craig A. James wrote:
> Richard Huxton wrote:
>> If a "kill -9" as root doesn't get rid of them, I think I'm right in
>> saying that it's a kernel-level problem rather than something else.
>
> Sorry I didn't clarify that.  "kill -9" did kill them.  Other signals
> did not.  It wasn't until I manually intervened with the "kill -9" that
> the system began the recovery process.

Ah, when you said "unkillable" in the first msg I jumped to the wrong
conclusion.

In that case, see Tom's answer.

--
   Richard Huxton
   Archonet Ltd

Re: Postgres server crash

From
"Merlin Moncure"
Date:
On 11/15/06, Craig A. James <cjames@modgraph-usa.com> wrote:

> Questions:
>   1. Any idea what happened and how I can avoid this?  It's a *big* problem.
>   2. Why didn't the database recover?  Why are there two processes
>      that couldn't be killed?
>   3. Where did the "signal 9" come from?  (Nobody but me ever logs
>      in to the server machine.)

how much memory is in the box?  maybe vmstat will give you some clues
about what is going on leading up to the crash...a spike in swap for
example.  Maybe, if you can budget for it, query logging might give
some clues also.  Try increasing swap.

merlin

Re: Postgres server crash

From
Ben
Date:
OOM stands for "Out Of Memory" and it does indeed seem to be the same as
what IRIX had. I believe you can turn the feature off and also configure
its overcomitment by setting something in /proc/..... and unfortunately, I
don't remember more than that.

On Thu, 16 Nov 2006, Craig A. James wrote:

> OOM?  Can you give me a quick pointer to what this acronym stands for and how
> I can reconfigure it?  It sounds like a "feature" old UNIX systems like SGI
> IRIX had, where the system would allocate virtual memory that it didn't
> really have, then kill your process if you tried to use it.  I.e. malloc()
> would never return NULL even if swap space was over allocated.  Is this what
> you're talking about?  Having this enabled on a server is deadly for
> reliability.
>
> Thanks,
> Craig
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>      choose an index scan if your joining column's datatypes do not
>      match
>

Re: Postgres server crash

From
Richard Troy
Date:

On Thu, 16 Nov 2006, Tom Lane wrote:
>
> "Craig A. James" <cjames@modgraph-usa.com> writes:
> > OOM?  Can you give me a quick pointer to what this acronym stands for
> > and how I can reconfigure it?
>
> See "Linux Memory Overcommit" at
> http://www.postgresql.org/docs/8.1/static/kernel-resources.html#AEN18128
> or try googling for "OOM kill" for non-Postgres-specific coverage.

I did that - spent about two f-ing hours looking for what I wanted. (Guess
I entered poor choices for my searches. -frown- ) There are a LOT of
articles that TALK ABOUT OOM, but prescious few actually tell you what you
can do about it.

Trying to save you some time:

On linux you can use the sysctl utility to muck with vm.overcommit_memory;
You can disable the "feature."

Google _that_ for more info!

>
> > It sounds like a "feature" old UNIX
> > systems like SGI IRIX had, where the system would allocate virtual
> > memory that it didn't really have, then kill your process if you tried
> > to use it.  I.e. malloc() would never return NULL even if swap space
> > was over allocated.  Is this what you're talking about?  Having this
> > enabled on a server is deadly for reliability.
>
> No kidding :-(.  The default behavior in Linux is extremely unfortunate.
>
>             regards, tom lane

That's a major understatement.

The reason I spent a couple of hours looking for what I could learn on
this is that I've been absolutely beside myself on this "extremely
unfortunate" "feature." I had a badly behaving app (but didn't know which
app it was), so Linux would kill lots of things, like, oh, say, inetd.
Good luck sshing into the box. You just had to suffer with pushing the
damned reset button... It must have taken at least a week before figuring
out what not to do. (What I couldn't/can't understand is why the system
wouldn't just refuse the bad app the memory when it was short - no, you've
had enough!)

<soapbox> ...I read a large number of articles on this subject and am
absolutely dumbfounded by the -ahem- idiots who think killing a random
process is an appropriate action. I'm just taking their word for it that
there's some kind of impossibility of the existing Linux kernel not
getting itself into a potentially hung situation because it didn't save
itself any memory. Frankly, if it takes a complete kernel rewrite to fix
the problem that the damned operating system can't manage its own needs,
then the kernel needs to be rewritten! </soapbox>

These kernel hackers could learn something from VAX/VMS.

Richard

--
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
rtroy@ScienceTools.com, http://ScienceTools.com/


Re: Postgres server crash

From
"Craig A. James"
Date:
Richard Troy wrote:
> I did that - spent about two f-ing hours looking for what I wanted. (Guess
> I entered poor choices for my searches. -frown- ) There are a LOT of
> articles that TALK ABOUT OOM, but prescious few actually tell you what you
> can do about it.
>
> Trying to save you some time:
>
> On linux you can use the sysctl utility to muck with vm.overcommit_memory;
> You can disable the "feature."
>
> Google _that_ for more info!


Here's something I found googling for "memory overcommitment"+linux

  http://archives.neohapsis.com/archives/postfix/2000-04/0512.html

  From /usr/src/linux/Documentation/sysctl/vm.txt

  "overcommit_memory:

  This value contains a flag that enables memory overcommitment.
  When this flag is 0, the kernel checks before each malloc()
  to see if there's enough memory left. If the flag is nonzero,
  the system pretends there's always enough memory."

  This flag is settable in /proc/sys/vm

Lo and behold, here it is on my system:

   $ cat /proc/sys/vm/overcommit_memory
   0
   $ cat /proc/sys/vm/overcommit_ratio
   50

Craig

Re: Postgres server crash

From
Tom Lane
Date:
"Craig A. James" <cjames@modgraph-usa.com> writes:
> Here's something I found googling for "memory overcommitment"+linux
>   http://archives.neohapsis.com/archives/postfix/2000-04/0512.html

That might have been right when it was written (note the reference to a
2.2 Linux kernel), but it's 100% wrong now.  0 is the default, not-safe
setting.

            regards, tom lane

Re: Postgres server crash

From
Ron Mayer
Date:
Tom Lane wrote:
> "Craig A. James" <cjames@modgraph-usa.com> writes:
>> Here's something I found googling for "memory overcommitment"+linux
>>   http://archives.neohapsis.com/archives/postfix/2000-04/0512.html
>
> That might have been right when it was written (note the reference to a
> 2.2 Linux kernel), but it's 100% wrong now.  0 is the default, not-safe
> setting.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq
>

Re: Postgres server crash

From
Ron Mayer
Date:
Tom Lane wrote:
> "Craig A. James" <cjames@modgraph-usa.com> writes:
>> Here's something I found googling for "memory overcommitment"+linux
>>   http://archives.neohapsis.com/archives/postfix/2000-04/0512.html
>
> That might have been right when it was written (note the reference to a
> 2.2 Linux kernel), but it's 100% wrong now.  0 is the default, not-safe
> setting.

Is this something we could detect and show a warning for at initdb
time; or perhaps log a warning in the log files at postmaster startup?
This dangerous choice of settings seems to catch quite a few postgresql
users.

Re: Postgres server crash

From
Michael Stone
Date:
On Sat, Nov 18, 2006 at 05:28:46PM -0800, Richard Troy wrote:
>On linux you can use the sysctl utility to muck with vm.overcommit_memory;
>You can disable the "feature."

Be aware that there's are "reasons" the "feature" exists before you
"cast" "aspersions" and "quote marks" all over the place, and the
"reasons" are "things" like "shared memory" and "copy on write"
memory, which are "generally" condsidered "good things".

At one point someone complained about the ability to configure, e.g.,
IRIX to allow memory overcommit. I worked on some large IRIX
installations where full memory accounting would have required on the
order of 100s of gigabytes of swap, due to large shared memory
allocations. If the swap had been configured rather than allowing
overcommit, sure it might have reduced the chance of an application
allocating memory that it couldn't use. In practice, if you're 100s of
gigs into swap the system isn't going to be useable anyway.

>itself any memory. Frankly, if it takes a complete kernel rewrite to fix
>the problem that the damned operating system can't manage its own needs,
>then the kernel needs to be rewritten! </soapbox>
>
>These kernel hackers could learn something from VAX/VMS.

Like how to make a slower system? You can configure a VM to operate
the way they did 25 years ago, but in general people have preferred to
get better performance (*much* better performance) instead. It could be
that they're all idiots, or it could be that in practice the problem
isn't as bad as you make it out to be. (Maybe other people are just
better at configuring their memory usage?)

Mike Stone

Re: Postgres server crash

From
Richard Broersma Jr
Date:
> (Maybe other people are just
> better at configuring their memory usage?)

I don't mean to hijack the thread,  but I am interested in learning the science behind configuring
memory usage.  A lot of the docs that I have found on this subject speak in terms of generalities
and rules of thumb.  Are there any resources that can show how to tune the kernels parameters to
make use of all of the available memory and at the same time allow all services to get along?

From what I've read it seems that tuning kernel parameters is more of an art rather than a
science, since most recommendations boil down to experiences based on trial and error.

Regards,

Richard Broersma Jr.


Re: Postgres server crash

From
Michael Stone
Date:
On Sun, Nov 19, 2006 at 12:42:45PM -0800, Richard Broersma Jr wrote:
>I don't mean to hijack the thread,  but I am interested in learning the science behind configuring
>memory usage.

There isn't one. You need experience, and an awareness of your
particular requirements. If it were easy, it would be autotuned. :)

Mike Stone

Re: Postgres server crash

From
"Craig A. James"
Date:
Michael Stone wrote:
> At one point someone complained about the ability to configure, e.g.,
> IRIX to allow memory overcommit. I worked on some large IRIX
> installations where full memory accounting would have required on the
> order of 100s of gigabytes of swap, due to large shared memory
> allocations.

These were mostly scientific and graphical apps where reliability took a back seat to performance and to program
complexity. They would allocate 100's of GB of swap space rather than taking the time to design proper data structures.
If the program crashed every week or two, no big deal -- just run it again.  Overallocating memory is a valuable
techniquefor such applications. 

But overallocating memory has no place in a server environment.  When memory overcommittment is allowed, it is
impossibleto write a reliable application, because no matter how carefully and correctly you craft your code, someone
else'sprogram that leaks memory like Elmer Fudd's rowboat after his shotgun goes off, can kill your well-written
application.

Installing Postgres on such a system makes Postgres unreliable.

Tom Lane wrote:
> That might have been right when it was written (note the reference to a
> 2.2 Linux kernel), but it's 100% wrong now.
> [Setting /proc/sys/vm/overcommit_memory to] 0 is the default, not-safe
> setting.

I'm surprised that the Linux kernel people take such a uncritical view of reliability that they set, as *default*, a
featurethat makes Linux an unreliable platform for servers. 

And speaking of SGI, this very issue was among the things that sank the company.  As the low-end graphics cards ate
intotheir visualization market, they tried to become an Oracle Server platform.  Their servers were *fast*.  But they
crashed-- a lot.  And memory-overcommit was one of the reasons.  IRIX admins would brag that their systems only crashed
everycouple of weeks.  I had HP and Sun systems that would run for years. 

Craig

Re: Postgres server crash

From
Michael Stone
Date:
On Sun, Nov 19, 2006 at 02:12:01PM -0800, Craig A. James wrote:
>And speaking of SGI, this very issue was among the things that sank the
>company.  As the low-end graphics cards ate into their visualization
>market, they tried to become an Oracle Server platform.  Their servers were
>*fast*.  But they crashed -- a lot.  And memory-overcommit was one of the
>reasons.

You realize that it had to be turned on explicitly on IRIX, right? But
don't let facts get in the way of a good rant...

Mike Stone

Re: Postgres server crash

From
"Craig A. James"
Date:
> You realize that it had to be turned on explicitly on IRIX, right? But
> don't let facts get in the way of a good rant...

On the contrary, with Irix 4 and earlier it was the default, but it caused so many problems that SGI switched the
defaultto OFF in IRIX 5.  But because it had been available for so long, many important apps had come to rely on it, so
mostsites had to immediately re-enable virtual swap on every IRIX 5 server that came in.  Admins just got used to doing
it,so it became a "default" at most sites, and admins often couldn't be convinced to disable it for database server
machines,because "That's our standard for IRIX configuration." 

I worked at a big molecular modeling/visualization company; our visualization programs *required* virtual swap, and our
serverprograms *prohibited* virtual swap.  Imagine how our sales people felt about that, telling customers that they'd
haveto buy two $30,000 machines just because of one kernel parameter.  Of course, they didn't, and our server apps took
theheat as being "unreliable." 

SGI called it "virtual swap" which I always thought was a hoot.  You have virtual memory, which is really your swap
space,and then virtual swap, which is some kind of dark hole... 

Craig

Re: Postgres server crash

From
Bruno Wolff III
Date:
On Sun, Nov 19, 2006 at 14:12:01 -0800,
  "Craig A. James" <cjames@modgraph-usa.com> wrote:
>
> These were mostly scientific and graphical apps where reliability took a
> back seat to performance and to program complexity.  They would allocate
> 100's of GB of swap space rather than taking the time to design proper data
> structures.  If the program crashed every week or two, no big deal -- just
> run it again.  Overallocating memory is a valuable technique for such
> applications.

I don't think the above applies generally. Programmers need to be aware of
the working set of CPU bound apps. If the program is constantly paging,
the performance is going to be abysmal.

Re: Postgres server crash

From
"Craig A. James"
Date:
We're getting off-topic for this group except that this is *why* we're plagued with this problem, so I'll make one more
observation.

Bruno Wolff III wrote:
>> They would allocate
>> 100's of GB of swap space rather than taking the time to design proper data
>> structures.  If the program crashed every week or two, no big deal -- just
>> run it again.  Overallocating memory is a valuable technique for such
>> applications.
>
> I don't think the above applies generally. Programmers need to be aware of
> the working set of CPU bound apps. If the program is constantly paging,
> the performance is going to be abysmal.

You're doing planetary number crunching, so you allocate an (x,y,z) space with 10^6 points on every axis, or 10^18
points,for roughly 2^60 64-bit floating-point numbers, or 2^68 bytes.  But your space is mostly empty (just like real
space). Where there's a planet or star, you're actually using memory, everywhere else, the memory is never referenced.
So you're using just an ordinary amount of memory, but you can access your planet's data by simply referencing a
three-dimensionalarray in FORTRAN.  Just allow infinite overcommit, let the OS do the rest, and it actually works.  A
lotof molecular modeling works this way too. 

This is how these applications were/are actually written.  I'm not defending the method, just point out that it's real
andit's one of the reasons virtual swap was invented. 

Craig

Re: Postgres server crash

From
"Mattias Kregert"
Date:
if [ `cat /proc/sys/vm/overcommit_memory` == 0 ]; then
    echo "WARNING: Watch out for the oom-killer! Consider echo 2 >
/proc/sys/vm/overcommit_memory"
fi

----- Original Message -----
From: "Ron Mayer" <rm_pg@cheapcomplexdevices.com>
To: <pgsql-performance@postgresql.org>
Sent: Sunday, November 19, 2006 6:07 PM
Subject: Re: Postgres server crash


> Tom Lane wrote:
>> "Craig A. James" <cjames@modgraph-usa.com> writes:
>>> Here's something I found googling for "memory overcommitment"+linux
>>>   http://archives.neohapsis.com/archives/postfix/2000-04/0512.html
>>
>> That might have been right when it was written (note the reference to a
>> 2.2 Linux kernel), but it's 100% wrong now.  0 is the default, not-safe
>> setting.
>
> Is this something we could detect and show a warning for at initdb
> time; or perhaps log a warning in the log files at postmaster startup?
> This dangerous choice of settings seems to catch quite a few postgresql
> users.
>


Re: Postgres server crash

From
Markus Schaber
Date:
Hi, Richard,

Richard Troy wrote:

> The reason I spent a couple of hours looking for what I could learn on
> this is that I've been absolutely beside myself on this "extremely
> unfortunate" "feature." I had a badly behaving app (but didn't know which
> app it was), so Linux would kill lots of things, like, oh, say, inetd.

Actually, AFAICT, Linux tries to kill the process(es) that use the most
memory ressources first.

Without overcommitment, the OOM killer won't kick in, and as long as the
hoggy applications don't actually exit when malloc fails, they will just
stay around, sitting on their memory.

> Good luck sshing into the box.

SSH'ing into the box will even get worse without overcommitment. When
the machine is stuck, the sshd that tries to spawn its child will get
the out of memory signal, and you won't be able to log in.

HTH,
Markus

--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in Europe! www.ffii.org
www.nosoftwarepatents.org

Re: Postgres server crash

From
"Jim C. Nasby"
Date:
On Sat, Nov 18, 2006 at 05:28:46PM -0800, Richard Troy wrote:
> <soapbox> ...I read a large number of articles on this subject and am
> absolutely dumbfounded by the -ahem- idiots who think killing a random
> process is an appropriate action. I'm just taking their word for it that
> there's some kind of impossibility of the existing Linux kernel not
> getting itself into a potentially hung situation because it didn't save
> itself any memory. Frankly, if it takes a complete kernel rewrite to fix
> the problem that the damned operating system can't manage its own needs,
> then the kernel needs to be rewritten! </soapbox>
>
> These kernel hackers could learn something from VAX/VMS.

What's interesting is that apparently FreeBSD also has overcommit (and
IIRC no way to disable it), yet I never hear people going off on OOM
kills in FreeBSD. My theory is that FreeBSD admins are smart enough to
dedicate a decent amount of swap space, so that by the time you got to
an OOM kill situation you'd be so far into swapping that the box would
be nearly unusable. Many linux 'admins' think it's ok to save a few GB
of disk space by allocating a small amount of swap (or none at all), and
*kaboom*.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Postgres server crash

From
Michael Stone
Date:
On Sun, Nov 26, 2006 at 05:41:02PM -0600, Jim C. Nasby wrote:
>What's interesting is that apparently FreeBSD also has overcommit (and
>IIRC no way to disable it), yet I never hear people going off on OOM
>kills in FreeBSD.

Could just be that nobody is using FreeBSD. <ducks>

Seriously, though, there are so many linux installations out there that
you're statistically more likely to see corner cases. FWIW, I know a
whole lotta linux machines that have never had OOM-killer problems.
Remember, though, that anecodotal evidence isn't.

Mike Stone

Re: Postgres server crash

From
Florian Weimer
Date:
* Jim C. Nasby:

> What's interesting is that apparently FreeBSD also has overcommit (and
> IIRC no way to disable it), yet I never hear people going off on OOM
> kills in FreeBSD. My theory is that FreeBSD admins are smart enough to
> dedicate a decent amount of swap space, so that by the time you got to
> an OOM kill situation you'd be so far into swapping that the box would
> be nearly unusable.

I've seen OOM situations with our regular "use twice as much swap
space as there is RAM in the machine" rule.  Perhaps Linux is just
paging too fast for this to work. 8-P

By the way, the sysctl setting breaks some applications and
programming environments, such as SBCL and CMUCL.