Thread: Re: Heaps of read() syscalls by the postmaster
Hi *, I am trying to find out why the MySQL benchmarks result in such _lousy_ performance for postgreSQL. Tracing a simple update loop, I see the postmaster hogging 95% CPU, doing this: $ strace -c -p 10174 [ wait a few seconds ] ^C % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ----------------88.84 1.813381 60 30247 read8.82 0.180123 6 30552 lseek 1.76 0.035868 1087 33 recv 0.45 0.009101 54 170 write 0.07 0.001528 45 34 send 0.03 0.000619 19 33 time 0.03 0.000618 19 33 33 ioctl ------ ----------- ----------- --------- --------- ---------------- 100.00 2.041238 61102 33 total ... which translates to "the query update bench1 set dummy1='updated' where id=1747 is processed by calling the read() system call _one_thousand_times_". That's a bit excessive. What the hell could possibly cause this? Failure to use the index?? EXPLAIN says it's using the index. This query, unsurprisingly, takes 0.08 seconds to execute, which is way too much. The table has been created (via Perl) thusly: $server->create("bench1", ["id int NOT NULL", "id2 int NOT NULL", "id3 int NOTNULL", "dummy1 char(30)"], ["primary key (id,id2)", "index index_id3 (id3)"])); which translates to create table bench1 (id int NOT NULL,id2 int NOT NULL,id3 int NOT NULL,dummy1 char(30)); create unique index bench1_index_ on bench1 using btree (id,id2); create index bench1_index_1 on bench1 using btree (id3); I don't have a clue what the reason for this might be. This is postgreSQL 7.0, compiled with i686-pc-linux-gnu/2.95, using no special options to compile or setup, except that fsync was turned off, as verified by the above sytem call summary. Auto-commit was on during this test (what's the SQL command to turn it off, anyway? I couldn't find it). NB: The same benchmark revealed that CREATE TABLE (or maybe it's CREATE INDEX) leaks about 2k of memory. -- Matthias Urlichs | noris network GmbH | smurf@noris.de | ICQ: 20193661 The quote was selected randomly. Really. | http://smurf.noris.de/ -- Justice: A commodity which (in a more or less adulterated condition) the State sells to the citizen as a reward forhis allegiance, taxes, and personal service. -- Ambrose Bierce, "The Devil's Dictionary"
Hi, short add-on: > The table has been created (via Perl) thusly: > [...] and then 300000 records have been inserted, with unique values of 'id' of course, before doing the updates I mentioned in my previous post. The backend does 500 read() calls each INSERT, too, which is equally dog-slow. Trace of testing program's send() calls: 08:59:20.367593 send(3, "Qinsert into bench1 (id,id2,id3,dummy1) values (2730,2730,2730,\'ABCDEFGHIJ\')\0", 77, 0) = 77 08:59:20.416391 send(3, "Qinsert into bench1 (id,id2,id3,dummy1) values (2731,2731,2731,\'ABCDEFGHIJ\')\0", 77, 0) = 77 08:59:20.457082 send(3, "Qinsert into bench1 (id,id2,id3,dummy1) values (2732,2732,2732,\'ABCDEFGHIJ\')\0", 77, 0) = 77 08:59:20.497766 send(3, "Qinsert into bench1 (id,id2,id3,dummy1) values (2733,2733,2733,\'ABCDEFGHIJ\')\0", 77, 0) = 77 08:59:20.538928 send(3, "Qinsert into bench1 (id,id2,id3,dummy1) values (2734,2734,2734,\'ABCDEFGHIJ\')\0", 77, 0) = 77 Trace summary of the server while doing this: $ sudo strace -c -p 27264 % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ----------------84.77 1.224299 60 20301 read10.90 0.157477 8 20573 lseek 3.30 0.047625 1058 45 recv 0.69 0.009914 54 182 write 0.22 0.003168 70 45 send 0.07 0.000955 21 45 45 ioctl 0.06 0.000899 20 45 time ------ ----------- ----------- --------- --------- ---------------- 100.00 1.444337 41236 45 total i.e., 450 or so read() calls per request, apparently serially scannign for somethign or other: $ strace -e lseek -p -pid-of-postmaster lseek(13, 0, SEEK_SET) = 0 lseek(13, 8192, SEEK_SET) = 8192 lseek(13, 16384, SEEK_SET) = 16384 lseek(13, 24576, SEEK_SET) = 24576 lseek(13, 32768, SEEK_SET) = 32768 lseek(13, 40960, SEEK_SET) = 40960 lseek(13, 49152, SEEK_SET) = 49152 lseek(13, 57344, SEEK_SET) = 57344 lseek(13, 65536, SEEK_SET) = 65536 lseek(13, 73728, SEEK_SET) = 73728 lseek(13, 81920, SEEK_SET) = 81920 lseek(13, 90112, SEEK_SET) = 90112 lseek(13, 98304, SEEK_SET) = 98304 lseek(13, 106496, SEEK_SET) = 106496 lseek(13, 114688, SEEK_SET) = 114688 I think you'll agree that this kind of query is supposed to use an index. test=> explain select * from bench1 where id = 123; NOTICE: QUERY PLAN: Index Scan using bench1_index_ on bench1 (cost=0.00..8.14 rows=10 width=24) EXPLAIN test=> explain insert into bench1 (id,id2,id3,dummy1) values test-> (2730,2730,2730,'ABCDEFGHIJ'); NOTICE: QUERY PLAN: Result (cost=0.00..0.00 rows=0 width=0) EXPLAIN test=> Hmmm... what, no query plan at all??? -- Matthias Urlichs | noris network GmbH | smurf@noris.de | ICQ: 20193661 The quote was selected randomly. Really. | http://smurf.noris.de/ -- Many changes of mind and mood; do not hesitate too long.
> -----Original Message----- > From: pgsql-hackers-owner@hub.org [mailto:pgsql-hackers-owner@hub.org]On > Behalf Of Matthias Urlichs > > Hi, > > short add-on: > > > The table has been created (via Perl) thusly: > > [...] > > i.e., 450 or so read() calls per request, apparently serially scannign > for somethign or other: > > $ strace -e lseek -p -pid-of-postmaster > lseek(13, 0, SEEK_SET) = 0 > lseek(13, 8192, SEEK_SET) = 8192 > lseek(13, 16384, SEEK_SET) = 16384 > lseek(13, 24576, SEEK_SET) = 24576 > lseek(13, 32768, SEEK_SET) = 32768 > lseek(13, 40960, SEEK_SET) = 40960 > lseek(13, 49152, SEEK_SET) = 49152 > lseek(13, 57344, SEEK_SET) = 57344 > lseek(13, 65536, SEEK_SET) = 65536 > lseek(13, 73728, SEEK_SET) = 73728 > lseek(13, 81920, SEEK_SET) = 81920 > lseek(13, 90112, SEEK_SET) = 90112 > lseek(13, 98304, SEEK_SET) = 98304 > lseek(13, 106496, SEEK_SET) = 106496 > lseek(13, 114688, SEEK_SET) = 114688 > This seems to scan a system table(pg_index ?). There are some(many?) places where indexes aren't (or couldn't be) used to scan system tables. Must we avoid sequential scan for system tables as possible ? Comments ? Regards. Hiroshi Inoue Inoue@tpf.co.jp
Hi, Hiroshi Inoue: > > This seems to scan a system table(pg_index ?). You're right. It's the database's test/pg_attribute file, which is a whopping 41 MBytes. It would probably be very interesting to figure out (a) why it's so big, and (b) why it's not accesed using an index. Any pointers on how I should proceed? -- Matthias Urlichs | noris network GmbH | smurf@noris.de | ICQ: 20193661 The quote was selected randomly. Really. | http://smurf.noris.de/ -- A man can sleep around, no questions asked, but if a woman makes nineteen or twenty mistakes she's a tramp.-- Joan Rivers
Matthias Urlichs wrote: > You're right. It's the database's test/pg_attribute file, > which is a whopping 41 MBytes. I suppose the obvious question is whether you copy the database to a new database, if the new database's pg_attribute is 41MB.
Matthias Urlichs wrote: > > Hi, > > Hiroshi Inoue: > > > > This seems to scan a system table(pg_index ?). > > You're right. It's the database's test/pg_attribute file, > which is a whopping 41 MBytes. Do we shrink system tables on vacuum ? It's possible that running some benchmark that creates/drops tables repetedly will blow up the size of system tables incl. pg_attribute. ---------------- Hannu
Hi, Chris: > Matthias Urlichs wrote: > > > You're right. It's the database's test/pg_attribute file, > > which is a whopping 41 MBytes. > > I suppose the obvious question is whether you copy the database to a new > database, if the new database's pg_attribute is 41MB. ? I don't understand. The database was created by a simple initdb/createdb. The test user was created and given access, and the benchmark was started. The person watching the benchmark subsequently fell asleep. ;-) The reason for the huge size of this file might be the fact that the full benchmark first creates a whole damn lot of tables, which it then deletes. Apparently this process results in a rather suboptimal pg_attribute/pg_index file. It also leaks memory (about 2k per table) in the backend. I'll restart the whole thing later today. We'll see if the problem comes back. Hopefully not. -- Matthias Urlichs | noris network GmbH | smurf@noris.de | ICQ: 20193661 The quote was selected randomly. Really. | http://smurf.noris.de/ -- Guitar players had their licks.
Hi, Hannu Krosing: > Do we shrink system tables on vacuum ? > If the user calling the VACUUM has access rights to them, yes. > It's possible that running some benchmark that creates/drops tables > repetedly will blow up the size of system tables incl. pg_attribute. > Now that this test has run once, I hope that the fixups done by that VACUUM call will be persistent. I'll let you know. -- Matthias Urlichs | noris network GmbH | smurf@noris.de | ICQ: 20193661 The quote was selected randomly. Really. | http://smurf.noris.de/ -- It's hard to keep a good girl down -- but lots of fun trying.
On Fri, 19 May 2000, Matthias Urlichs wrote: > This is postgreSQL 7.0, compiled with i686-pc-linux-gnu/2.95, using no > special options to compile or setup, except that fsync was turned off, > as verified by the above sytem call summary. Auto-commit was on during > this test (what's the SQL command to turn it off, anyway? I couldn't > find it). In Perl, I do: my $dbh = DBI->connect("dbi:Pg:dbname=$dbname;host=$dbhost;port=$dbport","$dbuser"); $dbh->{AutoCommit} = 0; And then make sure you do a $dbh->commit(); whenever you want to end the transaction ... > > > NB: The same benchmark revealed that CREATE TABLE (or maybe it's CREATE > INDEX) leaks about 2k of memory. > > -- > Matthias Urlichs | noris network GmbH | smurf@noris.de | ICQ: 20193661 > The quote was selected randomly. Really. | http://smurf.noris.de/ > -- > Justice: A commodity which (in a more or less adulterated condition) > the State sells to the citizen as a reward for his allegiance, > taxes, and personal service. > -- Ambrose Bierce, "The Devil's Dictionary" > Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
"Matthias Urlichs" <smurf@noris.net> writes: >> Do we shrink system tables on vacuum ? >> > If the user calling the VACUUM has access rights to them, yes. But the indexes don't shrink (same problem as for user indexes). VACUUM doesn't really make any distinction between system tables and user tables; they're all handled the same way. IIRC, the only special-case in 7.0 is that it doesn't try to compute pg_statistic entries for pg_statistic ;-) >> It's possible that running some benchmark that creates/drops tables >> repetedly will blow up the size of system tables incl. pg_attribute. Yes, if you don't vacuum them every so often... But what I don't understand is why a simple INSERT is doing a sequential scan of pg_attribute. Presumably the parser needs to find out what the table's columns are ... but why isn't the catcache providing the data? Needs to be looked at. regards, tom lane
Hi, Tom Lane: > But what I don't understand is why a simple INSERT is doing a sequential > scan of pg_attribute. It might have been pg_index. I will look into this later today. -- Matthias Urlichs | noris network GmbH | smurf@noris.de | ICQ: 20193661 The quote was selected randomly. Really. | http://smurf.noris.de/ -- A raccoon tangled with a 23,000 volt line today. The results blacked out 1400 homes and, of course, one raccoon. -- Steel City News
If we create a unique index on a column, seems we could update pg_statistic automatically to mark it as unique. > "Matthias Urlichs" <smurf@noris.net> writes: > >> Do we shrink system tables on vacuum ? > >> > > If the user calling the VACUUM has access rights to them, yes. > > But the indexes don't shrink (same problem as for user indexes). > > VACUUM doesn't really make any distinction between system tables and > user tables; they're all handled the same way. IIRC, the only > special-case in 7.0 is that it doesn't try to compute pg_statistic > entries for pg_statistic ;-) > > >> It's possible that running some benchmark that creates/drops tables > >> repetedly will blow up the size of system tables incl. pg_attribute. > > Yes, if you don't vacuum them every so often... > > But what I don't understand is why a simple INSERT is doing a sequential > scan of pg_attribute. Presumably the parser needs to find out what the > table's columns are ... but why isn't the catcache providing the data? > Needs to be looked at. > > regards, tom lane > -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
"Matthias Urlichs" <smurf@noris.net> writes: > NB: The same benchmark revealed that CREATE TABLE (or maybe it's CREATE > INDEX) leaks about 2k of memory. Following up on this other point: this could simply be the new table's relcache entry (2K seems high though). Currently the relcache doesn't have any procedure for discarding uninteresting entries, so once a table is referenced by a backend that relcache entry will be there until the backend quits or has some specific reason for flushing the entry. I wouldn't expect a CREATE TABLE / DELETE TABLE cycle to show any memory leak, since the DELETE would flush the relcache entry. But creating a few thousand tables in a row would start to eat up memory a little bit. What is the benchmark doing exactly? We could add a mechanism for aging relcache entries out of the cache when they haven't been touched recently, but so far it hasn't seemed worth the trouble... regards, tom lane
I attac perl script which I run to benchmark postgres. NOTICE: you should create 'pgtest' database. It uses DBI, DBD::Pg and Benchmark modules. There are obvious parameters you can play with. Regards, Oleg On Fri, 19 May 2000, Tom Lane wrote: > Date: Fri, 19 May 2000 14:46:13 -0400 > From: Tom Lane <tgl@sss.pgh.pa.us> > To: Matthias Urlichs <smurf@noris.net> > Cc: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Re: Heaps of read() syscalls by the postmaster > > "Matthias Urlichs" <smurf@noris.net> writes: > > NB: The same benchmark revealed that CREATE TABLE (or maybe it's CREATE > > INDEX) leaks about 2k of memory. > > Following up on this other point: this could simply be the new table's > relcache entry (2K seems high though). Currently the relcache doesn't > have any procedure for discarding uninteresting entries, so once a > table is referenced by a backend that relcache entry will be there until > the backend quits or has some specific reason for flushing the entry. > > I wouldn't expect a CREATE TABLE / DELETE TABLE cycle to show any memory > leak, since the DELETE would flush the relcache entry. But creating a > few thousand tables in a row would start to eat up memory a little bit. > What is the benchmark doing exactly? > > We could add a mechanism for aging relcache entries out of the cache > when they haven't been touched recently, but so far it hasn't seemed > worth the trouble... > > regards, tom lane > _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83 #!/usr/local/bin/perl use strict; use DBI; use DBD::Pg; use Benchmark; # Change this my $DSN = 'DBI:Pg:dbname=pgtest'; my $TABLE = 'bench'; my($i); sub TimeMe ($$$$) { my($startMsg, $endMsg, $code, $count) = @_; printf("\n%s\n", $startMsg); my($t1) = Benchmark->new(); $@ = ''; eval {for ($i = 0; $i < $count; $i++) { &$code;} }; if ($@) {print "Test failed, message:$@\n"; } else {my($td) = Benchmark::timediff(Benchmark->new(), $t1);my($dur) = $td->cpu_a;printf($endMsg, $count,$dur, $count / $dur);print "\n"; } } TimeMe("Testing empty loop speed ...", "%d iterations in %.1f cpu+sys seconds (%d per sec)", sub { }, 100000); my($dbh); TimeMe("Testing connect/disconnect speed ...", "%d connections in %.1f cpu+sys seconds (%d per sec)", sub { $dbh= DBI->connect($DSN, undef, undef,{ 'RaiseError' => 1 }); $dbh->disconnect(); }, 2000); $dbh = DBI->connect($DSN, undef, undef,{ 'RaiseError' => 0 }); $dbh->do("DROP TABLE $TABLE"); $dbh->disconnect(); $dbh = DBI->connect($DSN, undef, undef,{ 'RaiseError' => 1 }); TimeMe("Testing CREATE/DROP TABLE speed ...", "%d files in %.1f cpu+sys seconds (%d per sec)", sub { $dbh->do("CREATETABLE $TABLE (id INT4, name CHAR(40)," . " firstname CHAR(40), address CHAR(40)," . " zip CHAR(10), city CHAR(40), email CHAR(40))"); $dbh->do("DROP TABLE $TABLE"); }, 1000); $dbh->disconnect(); my $dbh = DBI->connect($DSN, undef, undef,{ 'RaiseError' => 1 }); $dbh->do("CREATE TABLE $TABLE (id INT4, name CHAR(40)," . " firstname CHAR(40), address CHAR(40)," . " zip CHAR(10),city CHAR(40), email CHAR(40))"); my(@vals) = (0 .. 499); my($num); TimeMe("Testing INSERT speed ...", "%d rows in %.1f cpu+sys seconds (%d per sec)", sub { ($num) = splice(@vals,int(rand(@vals)), 1); $dbh->do("INSERT INTO $TABLE VALUES (?, 'Wiedmann', 'Jochen'," . " 'Am Eisteich 9', '72555', 'Metzingen'," . " 'joe\@ispsoft.de')", undef, $num); }, 500); $dbh->disconnect(); my(@vals) = (0 .. 499); $dbh = DBI->connect($DSN, undef, undef,{ 'RaiseError' => 1 }); TimeMe("Testing UPDATE speed ...", "%d rows in %.1f cpu+sys seconds (%d per sec)", sub { ($num) = splice(@vals,int(rand(@vals)), 1); $dbh->do("UPDATE $TABLE SET name='NnamdeiW' WHERE id=$num"); }, 500); $dbh->disconnect(); my $dbh = DBI->connect($DSN, undef, undef,{ 'RaiseError' => 1 }); my($sth); TimeMe("Testing SELECT speed ...", "%d single rows in %.1f cpu+sys seconds (%.1f per sec)", sub { $num= int(rand(500)); $sth = $dbh->prepare("SELECT * FROM $TABLE WHERE id = $num"); $sth->execute(); $sth->fetch() or die "Expected result for id = $num"; }, 100); $sth->finish; $dbh->disconnect(); $dbh = DBI->connect($DSN, undef, undef,{ 'RaiseError' => 1 }); TimeMe("Testing SELECT speed (multiple rows) ...", "%d times 100 rows in %.1f cpu+sys seconds (%.1f per sec)", sub { $num = int(rand(400)); $sth = $dbh->prepare("SELECT * FROM $TABLE WHERE id >= $num" . " AND id < " . ($num+100)); $sth->execute(); ($sth->rows() == 100) or die"Expected 100 rows for id = $num, got " . $sth->rows(); while ($sth->fetch()) { } }, 100); $sth->finish; $dbh->disconnect(); exit;
Hi, Tom Lane: > "Matthias Urlichs" <smurf@noris.net> writes: > > NB: The same benchmark revealed that CREATE TABLE (or maybe it's CREATE > > INDEX) leaks about 2k of memory. > > Following up on this other point: this could simply be the new table's > relcache entry (2K seems high though). The first part of the many-tables benchmark does 10000 CREATE TABLE/CREATE INDEX calls followed by 10000 DROP TABLE calls (i.e. you have ten thousand tables after the first step). The postmaster pricess grows from 15 to 50 MBtes during that time. The second part does 10000 CREATE TABLE/CREATE INDEX/DROP TABLE calls (i.e. it deletes every table immediately). Afterwards, the postmaster process is 85 MBytes big. > What is the benchmark doing exactly? > I can put a standalone version of the benchmark up for download someplace. > We could add a mechanism for aging relcache entries out of the cache > when they haven't been touched recently, but so far it hasn't seemed > worth the trouble... > Not for that benchmark, certainly, but there are real-world applications which do play with lots of temporary tables for one reason or another. -- Matthias Urlichs | noris network GmbH | smurf@noris.de | ICQ: 20193661 The quote was selected randomly. Really. | http://smurf.noris.de/ -- I want to dress you up as TALLULAH BANKHEAD and cover you with VASELINE and WHEAT THINS ... -- Zippy the Pinhead
"Matthias Urlichs" <smurf@noris.net> writes: >>>> NB: The same benchmark revealed that CREATE TABLE (or maybe it's CREATE >>>> INDEX) leaks about 2k of memory. >> >> Following up on this other point: this could simply be the new table's >> relcache entry (2K seems high though). > The first part of the many-tables benchmark does 10000 CREATE > TABLE/CREATE INDEX calls followed by 10000 DROP TABLE calls (i.e. > you have ten thousand tables after the first step). > The postmaster pricess grows from 15 to 50 MBtes during that time. > The second part does 10000 CREATE TABLE/CREATE INDEX/DROP TABLE calls > (i.e. it deletes every table immediately). Afterwards, the postmaster > process is 85 MBytes big. Hmm. So the space *is* leaked. Grumble. Another TODO list item... thanks for following up on it. >> What is the benchmark doing exactly? >> > I can put a standalone version of the benchmark up for download > someplace. OK. If this benchmark comes with MySQL, though, just tell us where to look in their tarball --- no need for you to provide an alternate distribution when we can get it from their server... >> We could add a mechanism for aging relcache entries out of the cache >> when they haven't been touched recently, but so far it hasn't seemed >> worth the trouble... >> > Not for that benchmark, certainly, but there are real-world applications > which do play with lots of temporary tables for one reason or another. Agreed, but I thought the space would get reclaimed when you deleted the temp table. That's definitely a bug. regards, tom lane
Hi, Tom Lane: > OK. If this benchmark comes with MySQL, though, just tell us where > to look in their tarball --- no need for you to provide an alternate > distribution when we can get it from their server... > Unpack mysql-3.23.16.tar.gz, subdirectory sql-bench; ./run-all-tests --server=pg --fast --database test --user=test --password=test (Or ./test-whatever ... witht he same options, for a single test run.) Running the test as the postgres superuser will change the results. -- Matthias Urlichs | noris network GmbH | smurf@noris.de | ICQ: 20193661 The quote was selected randomly. Really. | http://smurf.noris.de/ -- APL programmers do it with stile.
>>>>> NB: The same benchmark revealed that CREATE TABLE (or maybe it's CREATE >>>>> INDEX) leaks about 2k of memory. > Hmm. So the space *is* leaked. Grumble. Another TODO list item... > thanks for following up on it. OK, I think this bug is swatted --- it's not relcache really, just a couple of generic memory leaks that happened to occur during relcache entry construction (worst possible time). If you care to try the repaired code, see our CVS server, or grab a nightly snapshot dated later than this message. regards, tom lane
Hi, Tom Lane: > entry construction (worst possible time). If you care to try the > repaired code, see our CVS server, or grab a nightly snapshot dated > later than this message. > Thank you, I'll do that. -- Matthias Urlichs | noris network GmbH | smurf@noris.de | ICQ: 20193661 The quote was selected randomly. Really. | http://smurf.noris.de/ -- Go ahead - the Surgeon General has determined that you only live once