Thread: Re: Slowdown problem when writing 1.7million records
Stephen Livesey wrote: > > I am very new to PostgreSQL and have installed v7.03 on a Red Hat Linux > Server (v6.2), I am accessing the files using JDBC from a Windows 2000 PC. > > I have created a small file as follows: > CREATE TABLE expafh ( > postcode CHAR(8) NOT NULL, > postcode_record_no INT, > street_name CHAR(30), > town CHAR(31), > PRIMARY KEY(postcode) ) > > I am now writing 1.7million records to this file. > > The first 100,000 records took 15mins. > The next 100,000 records took 30mins > The last 100,000 records took 4hours. > > In total, it took 43 hours to write 1.7million records. > > Is this sort of degradation normal using a PostgreSQL database? AFAICT, no. > I have never experienced this sort of degradation with any other database > and I have done exactly the same test (using the same hardware) on the > following databases: > DB2 v7 in total took 10hours 6mins > Oracle 8i in total took 3hours 20mins > Interbase v6 in total took 1hr 41min > MySQL v3.23 in total took 54mins > > Any Help or advise would be appreciated. Did you "vacuum analyse" your DB ? This seems to be essential to PG performance, for various reasons. Do you have a unique index on your primary key ? HTH, Emmanuel Charpentier
> > I am very new to PostgreSQL and have installed v7.03 on a Red Hat Linux > > Server (v6.2), I am accessing the files using JDBC from a Windows 2000 PC. > > > > I have created a small file as follows: > > CREATE TABLE expafh ( > > postcode CHAR(8) NOT NULL, > > postcode_record_no INT, > > street_name CHAR(30), > > town CHAR(31), > > PRIMARY KEY(postcode) ) > > > > I am now writing 1.7million records to this file. > > > > The first 100,000 records took 15mins. > > The next 100,000 records took 30mins > > The last 100,000 records took 4hours. > > > > In total, it took 43 hours to write 1.7million records. > > > > Is this sort of degradation normal using a PostgreSQL database? > > AFAICT, no. Funny you should say that, but I seem to be experiencing a similar things when doing pgdump/restore. When restoring a table with around 40M records, everythings starts nice and fast. However, by the time it's half done, inserts take forever. It took about 16 hours to restore all 40M records (3 GB or so + indices). Obviously, since it was a "restore", there was no vacuuming during the operation, but it still seemed to get unusually slow toward the end... Judging by the increase rates in file sizes, anyway, which I accept, isn't the most reliable way to judge this matter by... But, I guess this is nothing comparable to your problem. I was inserting 40M records, and it took 16 hours. You are inserting 1.7M records, and it's taking 43 hours. Are you using autocommit? Are you using fsync? You should probably periodically commit the data every few thousand inserts if you don't have autocommit enabled. Are you running out of memory? What amount of shared memory do you have enabled? How much is Postgres allowed to use? I know this is fairly obvious stuff, but it is, as such, quite easy to overlook... > > I have never experienced this sort of degradation with any other database > > and I have done exactly the same test (using the same hardware) on the > > following databases: > > DB2 v7 in total took 10hours 6mins > > Oracle 8i in total took 3hours 20mins > > Interbase v6 in total took 1hr 41min > > MySQL v3.23 in total took 54mins > > > > Any Help or advise would be appreciated. The only thing that comes to mind is that if you're doing a bulk insert, you should probably drop all indices that aren't unique or for primary keys, and re-create them once your insert all your data... Regards. Gordan
> Server (v6.2), I am accessing the files using JDBC from a Windows 2000 PC. I don't use Microsoft software, nor Java, but a few general suggestions for you: - you may want to create the INDEX at the end (using CREATE INDEX) in one operation - you may want to run your requests in transactions: e.g. a transaction which is committed every N insertion. - in some rare case you may want to disable the fsync() of the PostgreSQL - you may want to use a more efficient interface, such as the COPY command. Example (using transactions) #! /usr/bin/perl -wI /home/schaefer/perl-libs # $Id$ use DBI; use test_db; use strict; my $debug = 0; my $result = 0; # success my $reason = "of an unknown error"; $| = 1; my $dbh = &open_database(); if (defined($dbh)) { my $amount_entries = 4000000; my $commit_every = 10000; # Sometimes, large data entries are done better with a COPY. $dbh->{AutoCommit} = 0; # Use transactions. foreach (('sol_f', 'sol_i', 'sol_d')) { # Using transactions should make insertion faster, since fsync() # are probably not required. However, when changes are very big, # it might actually make it slower or using much space, this # is why we have this $commit_every above and below. # was expecting a BEGIN WORK; but that # seem to be implicit. eval { my $i; my $failed_reason = "unknown db error"; print "Populating " . $_ . " ..."; for ($i = 0; ($i < $amount_entries) && ($result == 0); $i++) { my @titles = ('id', 'ref', 'sentpos', 'wordpos'); if (!&do_query($dbh, "INSERT INTO " . $_ . "(" . join(", ", @titles) . ") VALUES (" . join(", ", ('?') x @titles) . ")", \@titles, undef, [ int(rand(32768)), 'truc', int(rand(32768)), 'temp' ], undef, \$failed_reason)) { $result = 1; $reason = "can't insert " . $i . ": " . $failed_reason; } else { if (($i % $commit_every) == 0) { if ($dbh->commit) { print "C "; } else { $result = 1; $reason = "can't commit: " . $dbh->errstr; } } } } if ($result == 0) { print " POPULATED.\n"; } else { print " FAILED.\n"; } }; # DB is not set to die mode, so we will catch only our bugs. if ($@) { print; $result = 1; $reason = "transaction failed: " . $@; # Not always right. $dbh->rollback; # res. ign. (in failure mode anyway) } elsif ($result) { $dbh->rollback; # res. ign. (in failure mode anyway) } else { if ($dbh->commit) { $result = 1; $reason = "can't commit: " . $dbh->errstr; } print "COMMITTED.\n"; } } $dbh->{AutoCommit} = 1; # No transactions if (!$dbh->disconnect) { $result = 1; $reason = "disconnect error: " . $dbh->errstr; } undef $dbh; } else { # Obviously, can't use $dbh->errstr here. $reason = "can't database connect: " . $DBI::errstr; $result = 1; } if ($result) { print $0 . ": failed " . $result . " because " . $reason . "\n"; } else { print "SUCCESSFUL.\n"; } exit $result; sub create_table { my($dbh, $name, $val) = @_; my $result = 0; my $sth = $dbh->prepare("CREATE TABLE $name ($val)"); if (defined($sth)) { my $rv = $sth->execute; if (defined($rv)) { if ($debug) { print "$name: succeeded.\n"; } $result = 1; } $sth->finish; undef $sth; } return $result; } And using COPY: #! /usr/bin/perl -wI /home/schaefer/perl-libs # USAGE # ./copy.pl | psql test_db # $Id$ my $amount_entries = 4000000; my $tell_every = 100000; print "COPY sol_f FROM stdin;\n"; my $i; for ($i = 1; $i <= $amount_entries; $i++) { print int(rand(32768)) . "\t" . 'truc' . "\t" . int(rand(32768)) . "\t" . 'temp' . "\n"; if (($i % $tell_every) == 0) { print STDERR $i . "\n"; } } print ".\n";
"Gordan Bobic" <gordan@freeuk.com> writes: > Are you using autocommit? Are you using fsync? You should probably > periodically commit the data every few thousand inserts if you don't > have autocommit enabled. > [ various other speedup suggestions from other people ] The thing that interests me here is the apparent slowdown as more data is inserted. I don't see a good reason for that, and none of the suggested remedies seem likely to eliminate it if we don't know what's causing it. INSERT per se should be an essentially constant-time operation, since it's just slapping another record at the end of the table. Unless you're using advanced features like triggers/rules/foreign keys, it seems like any progressive slowdown would have to be blamed on updates of indexes. But that's not normally a performance problem except in pathological cases (zillions of equal keys for example --- but Stephen's only index is a PRIMARY KEY, therefore UNIQUE, therefore no equal keys). So I'm confused, and think this deserves more careful examination as to *why* the slowdown, rather than just looking for a quick&dirty workaround. regards, tom lane