Thread: Sustained inserts per sec ... ?
Just curious, but does anyone have an idea of what we are capable of? I realize that size of record would affect things, as well as hardware, but if anyone has some ideas on max, with 'record size', that would be appreciated ... Thanks ... ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
> Just curious, but does anyone have an idea of what we are capable of? I > realize that size of record would affect things, as well as hardware, but > if anyone has some ideas on max, with 'record size', that would be > appreciated ... Well, I just did an insert of 27,500 records with 9 fields, averaging around 118 bytes per record, each insert statement coming from a separate SQL statement fed to psql, and it took a bit over 4 minutes, or about 106 inserts per second. That seems consistent with what I get when I do a restore of a dump file that has insert statement instead of COPY. The hardware is a Dell dual Xeon system, the disks are mirrored SATA drives with write buffering turned off. -- Mike Nolan
On Apr 1, 2005 1:06 PM, Marc G. Fournier <scrappy@postgresql.org> wrote: > > Just curious, but does anyone have an idea of what we are capable of? I > realize that size of record would affect things, as well as hardware, but > if anyone has some ideas on max, with 'record size', that would be > appreciated ... On a AMD64/3000, 1Gb RAM, 2 SATA drives (1 for log, 1 for data), and inserting using batches of 500-1000 rows, and also using the COPY syntax, I have seen an interesting thing. There are 5 indexes involved here, BTW. This is Linux 2.6 running on an XFS file system (ext3 was even worse for me). I can start at about 4,000 rows/second, but at about 1M rows, it plummets, and by 4M it's taking 6-15 seconds to insert 1000 rows. That's only about 15 rows/second, which is quite pathetic. The problem seems to be related to my indexes, since I have to keep them online (the system in continually querying, as well). This was an application originally written for MySQL/MYISAM, and it's looking like PostgreSQL can't hold up for it, simply because it's "too much database" if that makes sense. The same box, running the MySQL implementation (which uses no transactions) runs around 800-1000 rows/second systained. Just a point of reference. I'm trying to collect some data so that I can provide some charts of the degredation, hoping to find the point where it dies and thereby find the point where it needs attention. Chris -- | Christopher Petrilli | petrilli@gmail.com
Christopher Petrilli <petrilli@gmail.com> writes: > I can start at about 4,000 rows/second, but at about 1M rows, it > plummets, and by 4M it's taking 6-15 seconds to insert 1000 rows. > That's only about 15 rows/second, which is quite pathetic. The > problem seems to be related to my indexes, since I have to keep them > online (the system in continually querying, as well). I doubt it has anything to do with your indexes. I'm wondering about foreign key checks, myself. What FKs exist on these tables? Is the "start" condition zero rows in every table? Which PG version exactly? regards, tom lane
On Apr 1, 2005 3:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Christopher Petrilli <petrilli@gmail.com> writes: > > I can start at about 4,000 rows/second, but at about 1M rows, it > > plummets, and by 4M it's taking 6-15 seconds to insert 1000 rows. > > That's only about 15 rows/second, which is quite pathetic. The > > problem seems to be related to my indexes, since I have to keep them > > online (the system in continually querying, as well). > > I doubt it has anything to do with your indexes. I'm wondering about > foreign key checks, myself. What FKs exist on these tables? Is the > "start" condition zero rows in every table? Which PG version exactly? Sure, I'm going to post something on my web site when I get some numbers that will make this more valuable. To answer your question: 1. No foreign keys (long story, but they just don't exist for this one table) 2. Start condition is zero. I'm using multiple inherited tables to deal with the data "partitioning" since eventual size is billions of rows. Each "partition" currently has 10M rows in it as a goal. 3. Version 8.0.2, however I started this experiment with 8.0.0. 4. fsync=off What seems to happen is it slams into a "wall" of some sort, the system goes into disk write frenzy (wait=90% CPU), and eventually recovers and starts running for a while at a more normal speed. What I need though, is to not have that wall happen. It is easier for me to accept a constant degredation of 5%, rather than a 99% degredation for short periods, as it can cause cascade problems in the system. My goal is to gather some numbers, and post code + schema + analysis. Chris -- | Christopher Petrilli | petrilli@gmail.com
> What seems to happen is it slams into a "wall" of some sort, the > system goes into disk write frenzy (wait=90% CPU), and eventually > recovers and starts running for a while at a more normal speed. What > I need though, is to not have that wall happen. It is easier for me > to accept a constant degredation of 5%, rather than a 99% degredation > for short periods, as it can cause cascade problems in the system. Could this possibly be a checkpoint happening? Also how many checkpoint segments do you have? > > My goal is to gather some numbers, and post code + schema + analysis. > > Chris -- Command Prompt, Inc., Your PostgreSQL solutions company. 503-667-4564 Custom programming, 24x7 support, managed services, and hosting Open Source Authors: plPHP, pgManage, Co-Authors: plPerlNG Reliable replication, Mammoth Replicator - http://www.commandprompt.com/
Christopher Petrilli <petrilli@gmail.com> writes: > What seems to happen is it slams into a "wall" of some sort, the > system goes into disk write frenzy (wait=90% CPU), and eventually > recovers and starts running for a while at a more normal speed. Checkpoint maybe? If so, tweaking the bgwriter parameters might improve matters. regards, tom lane
On Apr 1, 2005 3:53 PM, Joshua D. Drake <jd@commandprompt.com> wrote: > > > What seems to happen is it slams into a "wall" of some sort, the > > system goes into disk write frenzy (wait=90% CPU), and eventually > > recovers and starts running for a while at a more normal speed. What > > I need though, is to not have that wall happen. It is easier for me > > to accept a constant degredation of 5%, rather than a 99% degredation > > for short periods, as it can cause cascade problems in the system. > > Could this possibly be a checkpoint happening? > > Also how many checkpoint segments do you have? Changes to the postgresql.conf file from "default": maintenance_work_mem = 131072 fsync = false checkpoint_segments = 32 I set the checkpoint segments up until it no longer complained about them rolling over. That was the best "advice" I could find online. The maintenance_work_mem I upped to deal with indexes being updated constantly. And finally, since I'm willing to risk some loss, I turned fsync off, since the system is on a UPS (or will be in production) and carefully monitored. I did actually wonder about the checkpoint_segments being an issue, since it seems to me the more of them you have, the more you'd have to deal with when checkpointing, and so you might actually want to turn that number down to create a "smoother" behavior. Unfortunately, the alot advice for 'loading data' doesn't apply when you have a constant stream of load, rather than just sporadic. Any advice is more than appreciated. Chris -- | Christopher Petrilli | petrilli@gmail.com
Mike Nolan <nolan@gw.tssi.com> writes: > Well, I just did an insert of 27,500 records with 9 fields, averaging > around 118 bytes per record, each insert statement coming from a separate > SQL statement fed to psql, and it took a bit over 4 minutes, or about > 106 inserts per second. Is that with a separate transaction for each insert command? I can get significantly higher rates on my devel machine if the inserts are bundled into transactions of reasonable length. With fsync on, you can't expect to get more than about one commit per disk rotation (with a single inserting process), so with say a 7200RPM drive (120 revs/sec) the above is a pretty good fraction of the theoretical limit. regards, tom lane
1250/sec with record size average is 26 bytes 800/sec with record size average is 48 bytes. 250/sec with record size average is 618 bytes. Data from pg_stats and our own job monitoring System has four partitions, two raid 1s, a four disk RAID 10 and a six disk RAID 10. pg_xlog is on four disk RAID 10, database is on RAID 10. Data is very spread out because database turnover time is very high, so our performance is about double this with a fresh DB. (the data half life is probably measurable in days or weeks). Alex Turner netEconomist On Apr 1, 2005 1:06 PM, Marc G. Fournier <scrappy@postgresql.org> wrote: > > Just curious, but does anyone have an idea of what we are capable of? I > realize that size of record would affect things, as well as hardware, but > if anyone has some ideas on max, with 'record size', that would be > appreciated ... > > Thanks ... > > ---- > Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) > Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664 > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend >
Oh - this is with a seperate transaction per command. fsync is on. Alex Turner netEconomist On Apr 1, 2005 4:17 PM, Alex Turner <armtuk@gmail.com> wrote: > 1250/sec with record size average is 26 bytes > 800/sec with record size average is 48 bytes. > 250/sec with record size average is 618 bytes. > > Data from pg_stats and our own job monitoring > > System has four partitions, two raid 1s, a four disk RAID 10 and a six > disk RAID 10. > pg_xlog is on four disk RAID 10, database is on RAID 10. > > Data is very spread out because database turnover time is very high, > so our performance is about double this with a fresh DB. (the data > half life is probably measurable in days or weeks). > > Alex Turner > netEconomist > > On Apr 1, 2005 1:06 PM, Marc G. Fournier <scrappy@postgresql.org> wrote: > > > > Just curious, but does anyone have an idea of what we are capable of? I > > realize that size of record would affect things, as well as hardware, but > > if anyone has some ideas on max, with 'record size', that would be > > appreciated ... > > > > Thanks ... > > > > ---- > > Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) > > Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664 > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 8: explain analyze is your friend > > >
Alex Turner <armtuk@gmail.com> writes: > On Apr 1, 2005 4:17 PM, Alex Turner <armtuk@gmail.com> wrote: >> 1250/sec with record size average is 26 bytes >> 800/sec with record size average is 48 bytes. >> 250/sec with record size average is 618 bytes. > Oh - this is with a seperate transaction per command. > fsync is on. [ raised eyebrow... ] What kind of disk hardware is that exactly, and does it have write cache enabled? It's hard to believe those numbers if not. Write caching is fine if it's done in a battery-backed cache, which you can get in the higher-end hardware RAID controllers. Otherwise you're going to have problems whenever the power goes away unexpectedly. regards, tom lane
> > Well, I just did an insert of 27,500 records with 9 fields, averaging > > around 118 bytes per record, each insert statement coming from a separate > > SQL statement fed to psql, and it took a bit over 4 minutes, or about > > 106 inserts per second. > > Is that with a separate transaction for each insert command? I can get > significantly higher rates on my devel machine if the inserts are > bundled into transactions of reasonable length. That's with autocommit on. If I do it as a single transaction block, it takes about 6.5 seconds, which is about 4200 transactions/second. -- Mike Nolan
On Apr 1, 2005 3:59 PM, Christopher Petrilli <petrilli@gmail.com> wrote: > On Apr 1, 2005 3:53 PM, Joshua D. Drake <jd@commandprompt.com> wrote: > > > > > What seems to happen is it slams into a "wall" of some sort, the > > > system goes into disk write frenzy (wait=90% CPU), and eventually > > > recovers and starts running for a while at a more normal speed. What > > > I need though, is to not have that wall happen. It is easier for me > > > to accept a constant degredation of 5%, rather than a 99% degredation > > > for short periods, as it can cause cascade problems in the system. > > > > Could this possibly be a checkpoint happening? > > > > Also how many checkpoint segments do you have? > > Changes to the postgresql.conf file from "default": > > maintenance_work_mem = 131072 > fsync = false > checkpoint_segments = 32 I've now had a chance to run a couple more tests, and here's two graphs of the time required to insert (via COPY from a file) 500 records at a time: http://www.amber.org/~petrilli/diagrams/pgsql_copy500.png http://www.amber.org/~petrilli/diagrams/pgsql_copy500_bgwriter.png The first is with the above changes, the second contains two additional modificiations to the configuration: bgwriter_percent = 25 bgwriter_maxpages = 1000 To my, likely faulty, intuition, it would seem that there is a backup happening in the moving of data from the WAL to the final resting place, and that by increasing these I could pull that forward. As you can see from the charts, that doesn't seem to have any major impact. The point, in the rough middle, is where the program begins inserting into a new table (inherited). The X axis is the "total" number of rows inserted. The table has: * 21 columns (nothing too strange) * No OIDS * 5 indexes, including the primary key on a string They are created by creating a main table, then doing: CREATE TABLE foo001 INHERITS (foos); And then recreating all the indexes. Thoughts? Any advice would be more than appreciated. Chris -- | Christopher Petrilli | petrilli@gmail.com
Yup, Battery backed, cache enabled. 6 drive RAID 10, and 4 drive RAID 10, and 2xRAID 1. It's a 3ware 9500S-8MI - not bad for $450 plus BBU. Alex Turner netEconomist On Apr 1, 2005 6:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alex Turner <armtuk@gmail.com> writes: > > On Apr 1, 2005 4:17 PM, Alex Turner <armtuk@gmail.com> wrote: > >> 1250/sec with record size average is 26 bytes > >> 800/sec with record size average is 48 bytes. > >> 250/sec with record size average is 618 bytes. > > > Oh - this is with a seperate transaction per command. > > fsync is on. > > [ raised eyebrow... ] What kind of disk hardware is that exactly, and > does it have write cache enabled? It's hard to believe those numbers > if not. > > Write caching is fine if it's done in a battery-backed cache, which you > can get in the higher-end hardware RAID controllers. Otherwise you're > going to have problems whenever the power goes away unexpectedly. > > regards, tom lane >
Christopher Petrilli <petrilli@gmail.com> writes: > The table has: > * 21 columns (nothing too strange) > * No OIDS > * 5 indexes, including the primary key on a string Could we see the *exact* SQL definitions of the table and indexes? Also some sample data would be interesting. I'm wondering for example about the incidence of duplicate index keys. regards, tom lane
On Apr 4, 2005 11:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Christopher Petrilli <petrilli@gmail.com> writes: > > The table has: > > * 21 columns (nothing too strange) > > * No OIDS > > * 5 indexes, including the primary key on a string > > Could we see the *exact* SQL definitions of the table and indexes? > Also some sample data would be interesting. I'm wondering for example > about the incidence of duplicate index keys. Of course, this is a bit cleansed, since it's an internal project, but only the column names are changed: CREATE TABLE foos ( foo_id VARCHAR(32), s_ts INTEGER NOT NULL, c_ts INTEGER NOT NULL, bar_id INTEGER NOT NULL, proto INTEGER NOT NULL, src_ip INT8 NOT NULL, dst_ip INT8 NOT NULL, src_port INTEGER, dst_port INTEGER, nated INTEGER NOT NULL, src_nat_ip INT8, dst_nat_ip INT8, src_nat_port INTEGER, dst_nat_port INTEGER, foo_class INTEGER NOT NULL, foo_type INTEGER NOT NULL, src_bar INTEGER NOT NULL, dst_bar INTEGER NOT NULL, user_name VARCHAR(255), info TEXT ) WITHOUT OIDS; ALTER TABLE foos ADD CONSTRAINT foos_foo_id_pk UNIQUE (foo_id); CREATE INDEX foos_c_ts_idx ON foos(conduit_ts); CREATE INDEX foos_src_ip_idx ON foos(src_ip); CREATE INDEX foos_dst_ip_idx ON foos(dst_ip); CREATE INDEX foos_foo_class_idx ON foos(foo_class); CREATE INDEX foos_foo_type_idx ON foos(foo_type); CREATE TABLE foos001 ( ) INHERITS (foos) WITHOUT OIDS; ALTER TABLE foos001 ADD CONSTRAINT foos001_foo_id_pk UNIQUE (foo_id); CREATE INDEX foos001_c_ts_idx ON foos001(conduit_ts); CREATE INDEX foos001_src_ip_idx ON foos001(src_ip); CREATE INDEX foos001_dst_ip_idx ON foos001(dst_ip); CREATE INDEX foos001_foo_class_idx ON foos001(foo_class); CREATE INDEX foos001_foo_type_idx ON foos001(foo_type); That continues on, but you get the idea... So, as you asked about data content, specifically regarding indices, here's what the "simulator" creates: foo_id - 32 character UID (generated by the UUID function in mxTools, which looks like '00beef19420053c64f3f01aeb0b4a2a5', and varies in the upper components more than the lower. *_ts - UNIX epoch timestamps, sequential. There's a long story behind not using DATETIME format, but if that's the big issue, it can be dealt with. *_ip - Randomly selected 32-bit integers from a pre-generated list containing about 500 different numbers ranging from 3232235500 to 3232236031. This is unfortunately, not too atypical from the "real world". *_class - Randomly selected 1-100 (again, not atypical, although normal distribution would be less random) *_type - Randomly selected 1-10000 (not atypical, and more random than in real world) Hopefully this helps? Chris -- | Christopher Petrilli | petrilli@gmail.com
Christopher Petrilli <petrilli@gmail.com> writes: > On Apr 4, 2005 11:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Could we see the *exact* SQL definitions of the table and indexes? > Of course, this is a bit cleansed, since it's an internal project, but > only the column names are changed: Thanks. No smoking gun in sight there. But out of curiosity, can you do a test run with *no* indexes on the table, just to see if it behaves any differently? Basically I was wondering if index overhead might be part of the problem. Also, the X-axis on your graphs seems to be total number of rows inserted ... can you relate that to elapsed real time for us? regards, tom lane
On Apr 4, 2005 12:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Christopher Petrilli <petrilli@gmail.com> writes: > > On Apr 4, 2005 11:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Could we see the *exact* SQL definitions of the table and indexes? > > > Of course, this is a bit cleansed, since it's an internal project, but > > only the column names are changed: > > Thanks. No smoking gun in sight there. But out of curiosity, can you > do a test run with *no* indexes on the table, just to see if it behaves > any differently? Basically I was wondering if index overhead might be > part of the problem. Running now, but it'll take a while since I have a 3/4 second pause after each COPY to better reflect "real world" ... the application does 1 COPY per second, or whenever it hits 1000 entries. This seemed to be a sane way to deal with it, and not burden the system with needless index balancing, etc. > Also, the X-axis on your graphs seems to be total number of rows > inserted ... can you relate that to elapsed real time for us? Sure, like I said, there's a 3/4 second sleep between each COPY, regardless of how long it took (which well, isn't quite right, but close enough for this test). I've created a PNG with the X axies reflecting "elapsed time": http://www.amber.org/~petrilli/diagrams/pgsql_copyperf_timeline.png In addition, I've put up the raw data I used: http://www.amber.org/~petrilli/diagrams/results_timeline.txt The columns are rowcount, elapsed time, instance time. Hopefully this might help some? This machine has nothing else running on it other than the normal stripped down background processes (like sshd). -- | Christopher Petrilli | petrilli@gmail.com
On Mon, 2005-04-04 at 09:48 -0400, Christopher Petrilli wrote: > The point, in the rough middle, is where the program begins inserting > into a new table (inherited). The X axis is the "total" number of rows > inserted. and you also mention the same data plotted with elapsed time: http://www.amber.org/~petrilli/diagrams/pgsql_copyperf_timeline.png Your graphs look identical to others I've seen, so I think we're touching on something wider than your specific situation. The big difference is that things seem to go back to high performance when you switch to a new inherited table. I'm very interested in the graphs of elapsed time for COPY 500 rows against rows inserted. The simplistic inference from those graphs are that if you only inserted 5 million rows into each table, rather than 10 million rows then everything would be much quicker. I hope this doesn't work, but could you try that to see if it works? I'd like to rule out a function of "number of rows" as an issue, or focus in on it depending upon the results. Q: Please can you confirm that the discontinuity on the graph at around 5000 elapsed seconds matches EXACTLY with the switch from one table to another? That is an important point. Q: How many data files are there for these relations? Wouldn't be two, by any chance, when we have 10 million rows in them? Q: What is the average row length? About 150-160 bytes? Thanks, Best Regards, Simon Riggs
On Apr 4, 2005 3:46 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Mon, 2005-04-04 at 09:48 -0400, Christopher Petrilli wrote: > > The point, in the rough middle, is where the program begins inserting > > into a new table (inherited). The X axis is the "total" number of rows > > inserted. > > and you also mention the same data plotted with elapsed time: > http://www.amber.org/~petrilli/diagrams/pgsql_copyperf_timeline.png > > Your graphs look identical to others I've seen, so I think we're > touching on something wider than your specific situation. The big > difference is that things seem to go back to high performance when you > switch to a new inherited table. This is correct. > I'm very interested in the graphs of elapsed time for COPY 500 rows > against rows inserted. The simplistic inference from those graphs are > that if you only inserted 5 million rows into each table, rather than 10 > million rows then everything would be much quicker. I hope this doesn't > work, but could you try that to see if it works? I'd like to rule out a > function of "number of rows" as an issue, or focus in on it depending > upon the results. > > Q: Please can you confirm that the discontinuity on the graph at around > 5000 elapsed seconds matches EXACTLY with the switch from one table to > another? That is an important point. Well, the change over happens at 51593.395205 seconds :-) Here's two lines from the results with row count and time added: 10000000 51584.9818912 8.41331386566 10000500 51593.395205 0.416964054108 Note that 10M is when it swaps. I see no reason to interpret it differently, so it seems to be totally based around switching tables (and thereby indices). > Q: How many data files are there for these relations? Wouldn't be two, > by any chance, when we have 10 million rows in them? I allow PostgreSQL to manage all the data files itself, so here's the default tablespace: total 48 drwx------ 2 pgsql pgsql 4096 Jan 26 20:59 1 drwx------ 2 pgsql pgsql 4096 Dec 17 19:15 17229 drwx------ 2 pgsql pgsql 4096 Feb 16 17:55 26385357 drwx------ 2 pgsql pgsql 4096 Mar 24 23:56 26425059 drwx------ 2 pgsql pgsql 8192 Mar 28 11:31 26459063 drwx------ 2 pgsql pgsql 8192 Mar 31 23:54 26475755 drwx------ 2 pgsql pgsql 4096 Apr 4 15:07 26488263 [root@bigbird base]# du 16624 ./26425059 5028 ./26385357 5660 ./26459063 4636 ./17229 6796 ./26475755 4780 ./1 1862428 ./26488263 1905952 . > Q: What is the average row length? > About 150-160 bytes? Raw data is around 150bytes, after insertion, I'd need to do some other calculations. Chris -- | Christopher Petrilli | petrilli@gmail.com
On Mon, 2005-04-04 at 15:56 -0400, Christopher Petrilli wrote: > On Apr 4, 2005 3:46 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Mon, 2005-04-04 at 09:48 -0400, Christopher Petrilli wrote: > > > The point, in the rough middle, is where the program begins inserting > > > into a new table (inherited). The X axis is the "total" number of rows > > > inserted. > > > > and you also mention the same data plotted with elapsed time: > > http://www.amber.org/~petrilli/diagrams/pgsql_copyperf_timeline.png > > > > Your graphs look identical to others I've seen, so I think we're > > touching on something wider than your specific situation. The big > > difference is that things seem to go back to high performance when you > > switch to a new inherited table. > > This is correct. > > > I'm very interested in the graphs of elapsed time for COPY 500 rows > > against rows inserted. The simplistic inference from those graphs are > > that if you only inserted 5 million rows into each table, rather than 10 > > million rows then everything would be much quicker. I hope this doesn't > > work, but could you try that to see if it works? I'd like to rule out a > > function of "number of rows" as an issue, or focus in on it depending > > upon the results. Any chance of running a multiple load of 4 million rows per table, leaving the test running for at least 3 tables worth (12+ M rows)? > > > > Q: Please can you confirm that the discontinuity on the graph at around > > 5000 elapsed seconds matches EXACTLY with the switch from one table to > > another? That is an important point. > > Well, the change over happens at 51593.395205 seconds :-) Here's two > lines from the results with row count and time added: > > 10000000 51584.9818912 8.41331386566 > 10000500 51593.395205 0.416964054108 > > Note that 10M is when it swaps. I see no reason to interpret it > differently, so it seems to be totally based around switching tables > (and thereby indices). OK, but do you have some other external knowledge that it is definitely happening at that time? Your argument above seems slightly circular to me. This is really important because we need to know whether it ties in with that event, or some other. Have you run this for more than 2 files, say 3 or more? You COMMIT after each 500 rows? > > Q: How many data files are there for these relations? Wouldn't be two, > > by any chance, when we have 10 million rows in them? > > I allow PostgreSQL to manage all the data files itself, so here's the > default tablespace: > > total 48 > drwx------ 2 pgsql pgsql 4096 Jan 26 20:59 1 > drwx------ 2 pgsql pgsql 4096 Dec 17 19:15 17229 > drwx------ 2 pgsql pgsql 4096 Feb 16 17:55 26385357 > drwx------ 2 pgsql pgsql 4096 Mar 24 23:56 26425059 > drwx------ 2 pgsql pgsql 8192 Mar 28 11:31 26459063 > drwx------ 2 pgsql pgsql 8192 Mar 31 23:54 26475755 > drwx------ 2 pgsql pgsql 4096 Apr 4 15:07 26488263 > [root@bigbird base]# du > 16624 ./26425059 > 5028 ./26385357 > 5660 ./26459063 > 4636 ./17229 > 6796 ./26475755 > 4780 ./1 > 1862428 ./26488263 > 1905952 . OK. Please... cd $PGDATA/base/26488263 ls -l I'm looking for the number of files associated with each inherited table (heap). > > Q: What is the average row length? > > About 150-160 bytes? > > Raw data is around 150bytes, after insertion, I'd need to do some > other calculations. By my calculations, you should have just 2 data files per 10M rows for the main table. The performance degradation seems to coincide with the point where we move to inserting into the second of the two files. I'm not looking for explanations yet, just examining coincidences and trying to predict the behaviour based upon conjectures. Best Regards, Simon Riggs
On Apr 4, 2005 4:11 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > > I'm very interested in the graphs of elapsed time for COPY 500 rows > > > against rows inserted. The simplistic inference from those graphs are > > > that if you only inserted 5 million rows into each table, rather than 10 > > > million rows then everything would be much quicker. I hope this doesn't > > > work, but could you try that to see if it works? I'd like to rule out a > > > function of "number of rows" as an issue, or focus in on it depending > > > upon the results. > > Any chance of running a multiple load of 4 million rows per table, > leaving the test running for at least 3 tables worth (12+ M rows)? As soon as I get done running a test without indexes :-) > > > Q: Please can you confirm that the discontinuity on the graph at around > > > 5000 elapsed seconds matches EXACTLY with the switch from one table to > > > another? That is an important point. > > > > Well, the change over happens at 51593.395205 seconds :-) Here's two > > lines from the results with row count and time added: > > > > 10000000 51584.9818912 8.41331386566 > > 10000500 51593.395205 0.416964054108 > > > > Note that 10M is when it swaps. I see no reason to interpret it > > differently, so it seems to be totally based around switching tables > > (and thereby indices). > > OK, but do you have some other external knowledge that it is definitely > happening at that time? Your argument above seems slightly circular to > me. My program *SPECIFICALLY* counts to 10M then switches the COPY statement. > This is really important because we need to know whether it ties in with > that event, or some other. Unless basic integer math is failing, it's definately happening at 10M rows. > Have you run this for more than 2 files, say 3 or more? You mean, 3 or more tables? I'm not sure which type of files you are reffering to here. > You COMMIT after each 500 rows? This is done using COPY syntax, not INSERT syntax. So I suppose "yes" I do. The file that is being used for COPY is kept on a ramdisk. > OK. Please... > cd $PGDATA/base/26488263 > ls -l [root@bigbird base]# cd 26488263/ [root@bigbird 26488263]# ls -l total 2003740 -rw------- 1 pgsql pgsql 49152 Apr 4 12:26 1247 -rw------- 1 pgsql pgsql 245760 Apr 4 12:27 1249 -rw------- 1 pgsql pgsql 573440 Apr 4 12:24 1255 -rw------- 1 pgsql pgsql 57344 Apr 4 14:44 1259 -rw------- 1 pgsql pgsql 0 Apr 4 12:24 16384 -rw------- 1 pgsql pgsql 8192 Apr 4 12:26 16386 -rw------- 1 pgsql pgsql 8192 Apr 4 12:26 16388 -rw------- 1 pgsql pgsql 24576 Apr 4 12:29 16390 -rw------- 1 pgsql pgsql 106496 Apr 4 12:24 16392 -rw------- 1 pgsql pgsql 16384 Apr 4 12:24 16394 -rw------- 1 pgsql pgsql 8192 Apr 4 12:24 16396 -rw------- 1 pgsql pgsql 16384 Apr 4 12:24 16398 -rw------- 1 pgsql pgsql 8192 Apr 4 12:24 16400 -rw------- 1 pgsql pgsql 8192 Apr 4 12:24 16402 -rw------- 1 pgsql pgsql 0 Apr 4 12:24 16404 -rw------- 1 pgsql pgsql 8192 Apr 4 12:24 16406 -rw------- 1 pgsql pgsql 212992 Apr 4 14:44 16408 -rw------- 1 pgsql pgsql 49152 Apr 4 12:24 16410 -rw------- 1 pgsql pgsql 8192 Apr 4 12:24 16412 -rw------- 1 pgsql pgsql 0 Apr 4 12:24 16414 -rw------- 1 pgsql pgsql 114688 Apr 4 12:24 16416 -rw------- 1 pgsql pgsql 16384 Apr 4 12:24 16418 -rw------- 1 pgsql pgsql 8192 Apr 4 12:24 16672 -rw------- 1 pgsql pgsql 16384 Apr 4 12:24 16674 -rw------- 1 pgsql pgsql 237568 Apr 4 12:26 16676 -rw------- 1 pgsql pgsql 16384 Apr 4 12:24 16678 -rw------- 1 pgsql pgsql 16384 Apr 4 12:24 16679 -rw------- 1 pgsql pgsql 16384 Apr 4 12:24 16680 -rw------- 1 pgsql pgsql 16384 Apr 4 12:24 16681 -rw------- 1 pgsql pgsql 16384 Apr 4 12:24 16682 -rw------- 1 pgsql pgsql 16384 Apr 4 12:24 16683 -rw------- 1 pgsql pgsql 8192 Apr 4 12:24 16684 -rw------- 1 pgsql pgsql 8192 Apr 4 12:24 16685 -rw------- 1 pgsql pgsql 245760 Apr 4 12:26 16686 -rw------- 1 pgsql pgsql 73728 Apr 4 12:26 16687 -rw------- 1 pgsql pgsql 16384 Apr 4 12:24 16688 -rw------- 1 pgsql pgsql 16384 Apr 4 12:24 16689 -rw------- 1 pgsql pgsql 16384 Apr 4 12:26 16690 -rw------- 1 pgsql pgsql 65536 Apr 4 12:26 16691 -rw------- 1 pgsql pgsql 16384 Apr 4 12:26 16692 -rw------- 1 pgsql pgsql 16384 Apr 4 12:26 16693 -rw------- 1 pgsql pgsql 16384 Apr 4 12:26 16694 -rw------- 1 pgsql pgsql 16384 Apr 4 12:26 16695 -rw------- 1 pgsql pgsql 16384 Apr 4 12:24 16696 -rw------- 1 pgsql pgsql 32768 Apr 4 12:24 16697 -rw------- 1 pgsql pgsql 16384 Apr 4 12:24 16698 -rw------- 1 pgsql pgsql 163840 Apr 4 12:26 16701 -rw------- 1 pgsql pgsql 196608 Apr 4 12:26 16702 -rw------- 1 pgsql pgsql 73728 Apr 4 12:24 16703 -rw------- 1 pgsql pgsql 16384 Apr 4 12:26 16706 -rw------- 1 pgsql pgsql 16384 Apr 4 12:26 16707 -rw------- 1 pgsql pgsql 16384 Apr 4 12:26 16708 -rw------- 1 pgsql pgsql 16384 Apr 4 12:24 16709 -rw------- 1 pgsql pgsql 16384 Apr 4 12:24 16710 -rw------- 1 pgsql pgsql 8192 Apr 4 12:24 16711 -rw------- 1 pgsql pgsql 16384 Apr 4 12:24 16712 -rw------- 1 pgsql pgsql 16384 Apr 4 12:24 16713 -rw------- 1 pgsql pgsql 16384 Apr 4 12:24 16714 -rw------- 1 pgsql pgsql 16384 Apr 4 12:24 16715 -rw------- 1 pgsql pgsql 32768 Apr 4 12:24 16716 -rw------- 1 pgsql pgsql 106496 Apr 4 12:24 16717 -rw------- 1 pgsql pgsql 106496 Apr 4 12:24 16718 -rw------- 1 pgsql pgsql 1212416 Apr 4 12:24 16719 -rw------- 1 pgsql pgsql 16384 Apr 4 12:24 16720 -rw------- 1 pgsql pgsql 16384 Apr 4 12:24 16721 -rw------- 1 pgsql pgsql 40960 Apr 4 14:44 16724 -rw------- 1 pgsql pgsql 16384 Apr 4 12:24 16727 -rw------- 1 pgsql pgsql 16384 Apr 4 12:24 16728 -rw------- 1 pgsql pgsql 16384 Apr 4 12:24 16729 -rw------- 1 pgsql pgsql 16384 Apr 4 12:24 16730 -rw------- 1 pgsql pgsql 16384 Apr 4 12:26 16731 -rw------- 1 pgsql pgsql 49152 Apr 4 12:26 16732 -rw------- 1 pgsql pgsql 0 Apr 4 12:24 16735 -rw------- 1 pgsql pgsql 8192 Apr 4 12:24 16737 -rw------- 1 pgsql pgsql 0 Apr 4 12:24 16738 -rw------- 1 pgsql pgsql 8192 Apr 4 12:24 16740 -rw------- 1 pgsql pgsql 0 Apr 4 12:24 16744 -rw------- 1 pgsql pgsql 8192 Apr 4 12:24 16746 -rw------- 1 pgsql pgsql 0 Apr 4 12:24 16750 -rw------- 1 pgsql pgsql 8192 Apr 4 12:24 16752 -rw------- 1 pgsql pgsql 122880 Apr 4 12:24 16753 -rw------- 1 pgsql pgsql 16384 Apr 4 12:24 16755 -rw------- 1 pgsql pgsql 0 Apr 4 12:24 16759 -rw------- 1 pgsql pgsql 8192 Apr 4 12:24 16761 -rw------- 1 pgsql pgsql 40960 Apr 4 12:24 17158 -rw------- 1 pgsql pgsql 0 Apr 4 12:24 17160 -rw------- 1 pgsql pgsql 8192 Apr 4 12:24 17162 -rw------- 1 pgsql pgsql 8192 Apr 4 12:24 17163 -rw------- 1 pgsql pgsql 0 Apr 4 12:24 17165 -rw------- 1 pgsql pgsql 8192 Apr 4 12:24 17167 -rw------- 1 pgsql pgsql 8192 Apr 4 12:24 17168 -rw------- 1 pgsql pgsql 0 Apr 4 12:24 17170 -rw------- 1 pgsql pgsql 8192 Apr 4 12:24 17172 -rw------- 1 pgsql pgsql 8192 Apr 4 12:24 17173 -rw------- 1 pgsql pgsql 0 Apr 4 12:24 17175 -rw------- 1 pgsql pgsql 8192 Apr 4 12:24 17177 -rw------- 1 pgsql pgsql 8192 Apr 4 12:24 17178 -rw------- 1 pgsql pgsql 0 Apr 4 12:24 17180 -rw------- 1 pgsql pgsql 8192 Apr 4 12:24 17182 -rw------- 1 pgsql pgsql 0 Apr 4 12:24 17183 -rw------- 1 pgsql pgsql 0 Apr 4 12:24 17185 -rw------- 1 pgsql pgsql 8192 Apr 4 12:24 17187 -rw------- 1 pgsql pgsql 0 Apr 4 12:26 26488264 -rw------- 1 pgsql pgsql 0 Apr 4 12:26 26488266 -rw------- 1 pgsql pgsql 8192 Apr 4 12:26 26488268 -rw------- 1 pgsql pgsql 8192 Apr 4 12:26 26488269 -rw------- 1 pgsql pgsql 1073741824 Apr 4 15:07 26488271 -rw------- 1 pgsql pgsql 407527424 Apr 4 16:17 26488271.1 -rw------- 1 pgsql pgsql 0 Apr 4 12:26 26488273 -rw------- 1 pgsql pgsql 8192 Apr 4 12:26 26488275 -rw------- 1 pgsql pgsql 565067776 Apr 4 16:17 26488276 -rw------- 1 pgsql pgsql 0 Apr 4 12:26 26488278 -rw------- 1 pgsql pgsql 0 Apr 4 12:26 26488280 -rw------- 1 pgsql pgsql 8192 Apr 4 12:26 26488282 -rw------- 1 pgsql pgsql 8192 Apr 4 12:26 26488283 -rw------- 1 pgsql pgsql 0 Apr 4 12:26 26488285 -rw------- 1 pgsql pgsql 0 Apr 4 12:26 26488287 -rw------- 1 pgsql pgsql 8192 Apr 4 12:26 26488289 -rw------- 1 pgsql pgsql 8192 Apr 4 12:26 26488290 -rw------- 1 pgsql pgsql 0 Apr 4 12:26 26488292 -rw------- 1 pgsql pgsql 0 Apr 4 12:26 26488294 -rw------- 1 pgsql pgsql 8192 Apr 4 12:26 26488296 -rw------- 1 pgsql pgsql 8192 Apr 4 12:26 26488297 -rw------- 1 pgsql pgsql 0 Apr 4 12:26 26488299 -rw------- 1 pgsql pgsql 0 Apr 4 12:26 26488301 -rw------- 1 pgsql pgsql 8192 Apr 4 12:26 26488303 -rw------- 1 pgsql pgsql 8192 Apr 4 12:26 26488304 -rw------- 1 pgsql pgsql 0 Apr 4 12:26 26488306 -rw------- 1 pgsql pgsql 0 Apr 4 12:26 26488308 -rw------- 1 pgsql pgsql 8192 Apr 4 12:26 26488310 -rw------- 1 pgsql pgsql 8192 Apr 4 12:26 26488311 -rw------- 1 pgsql pgsql 0 Apr 4 12:26 26488313 -rw------- 1 pgsql pgsql 0 Apr 4 12:26 26488315 -rw------- 1 pgsql pgsql 8192 Apr 4 12:26 26488317 -rw------- 1 pgsql pgsql 8192 Apr 4 12:26 26488318 -rw------- 1 pgsql pgsql 0 Apr 4 12:26 26488320 -rw------- 1 pgsql pgsql 0 Apr 4 12:26 26488322 -rw------- 1 pgsql pgsql 8192 Apr 4 12:26 26488324 -rw------- 1 pgsql pgsql 8192 Apr 4 12:26 26488325 -rw------- 1 pgsql pgsql 0 Apr 4 12:26 26488327 -rw------- 1 pgsql pgsql 0 Apr 4 12:26 26488329 -rw------- 1 pgsql pgsql 8192 Apr 4 12:26 26488331 -rw------- 1 pgsql pgsql 8192 Apr 4 12:26 26488332 -rw------- 1 pgsql pgsql 0 Apr 4 12:26 26488334 -rw------- 1 pgsql pgsql 0 Apr 4 12:26 26488336 -rw------- 1 pgsql pgsql 8192 Apr 4 12:26 26488338 -rw------- 1 pgsql pgsql 8192 Apr 4 12:26 26488339 -rw------- 1 pgsql pgsql 60045 Apr 4 12:24 pg_internal.init -rw------- 1 pgsql pgsql 4 Apr 4 12:24 PG_VERSION Hopefully this helps. Chris -- | Christopher Petrilli | petrilli@gmail.com
> This is done using COPY syntax, not INSERT syntax. So I suppose "yes" > I do. The file that is being used for COPY is kept on a ramdisk. COPY or psql \copy ? If you wanna be sure you commit after each COPY, launch a psql in a shell and check if the inserted rows are visible (watching SELECT count(*) grow will do)
On Apr 4, 2005 4:53 PM, PFC <lists@boutiquenumerique.com> wrote: > > This is done using COPY syntax, not INSERT syntax. So I suppose "yes" > > I do. The file that is being used for COPY is kept on a ramdisk. > > COPY or psql \copy ? > If you wanna be sure you commit after each COPY, launch a psql in a shell > and check if the inserted rows are visible (watching SELECT count(*) grow > will do) The script is Python, using pyexpect (a'la expect) and does this, exactly: psql = pexpect.spawn('/usr/local/pgsql/bin/psql -d bench2 ') [ ...] start = time.time() psql.expect_exact('bench2=#') psql.sendline("COPY events%03i FROM '/mnt/tmpfs/loadfile';" % (tablenum+1)) results.write('%s\n' % (time.time() - start)) results.flush() There's other code, but it's all related to building the loadfile. Note that I'm specifically including the time it takes to get the prompt back in the timing (but it does slip 1 loop, which isn't relevent). Chris -- | Christopher Petrilli | petrilli@gmail.com
On Mon, 2005-04-04 at 16:18 -0400, Christopher Petrilli wrote: > On Apr 4, 2005 4:11 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > > > I'm very interested in the graphs of elapsed time for COPY 500 rows > > > > against rows inserted. The simplistic inference from those graphs are > > > > that if you only inserted 5 million rows into each table, rather than 10 > > > > million rows then everything would be much quicker. I hope this doesn't > > > > work, but could you try that to see if it works? I'd like to rule out a > > > > function of "number of rows" as an issue, or focus in on it depending > > > > upon the results. > > > > Any chance of running a multiple load of 4 million rows per table, > > leaving the test running for at least 3 tables worth (12+ M rows)? > > As soon as I get done running a test without indexes :-) > > > > > Q: Please can you confirm that the discontinuity on the graph at around > > > > 5000 elapsed seconds matches EXACTLY with the switch from one table to > > > > another? That is an important point. > > > > > > Well, the change over happens at 51593.395205 seconds :-) Here's two > > > lines from the results with row count and time added: > > > > > > 10000000 51584.9818912 8.41331386566 > > > 10000500 51593.395205 0.416964054108 > > > > My program *SPECIFICALLY* counts to 10M then switches the COPY statement. > > OK. Please... > > cd $PGDATA/base/26488263 > > ls -l > > [root@bigbird base]# cd 26488263/ > [root@bigbird 26488263]# ls -l > total 2003740 > -rw------- 1 pgsql pgsql 1073741824 Apr 4 15:07 26488271 > -rw------- 1 pgsql pgsql 407527424 Apr 4 16:17 26488271.1 Can you do: select relname from pg_class where relfilenode = 26488271 and confirm that the name is the table you've been loading... Couldn't see all your indexes... are they still there? Thanks, Best Regards, Simon Riggs
On Apr 4, 2005 4:58 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Can you do: > select relname from pg_class where relfilenode = 26488271 > and confirm that the name is the table you've been loading... It is. > Couldn't see all your indexes... are they still there? Nope, I'm running a second run without the auxilary indices. I only have the primary key index. So far, a quick scan with the eye says that it's behaving "better", but beginning to have issues again. I'll post results as soon as they are done. Chris -- | Christopher Petrilli | petrilli@gmail.com
On Mon, 2005-04-04 at 17:03 -0400, Christopher Petrilli wrote: > On Apr 4, 2005 4:58 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > Can you do: > > select relname from pg_class where relfilenode = 26488271 > > and confirm that the name is the table you've been loading... > > It is. > > > Couldn't see all your indexes... are they still there? > > Nope, I'm running a second run without the auxilary indices. I only > have the primary key index. So far, a quick scan with the eye says > that it's behaving "better", but beginning to have issues again. I'll > post results as soon as they are done. Hmmm.... Before I start to tunnel-vision on a particular coincidence... How much memory have you got on the system? How much of that have you allocated to various tasks? What else is happening on your system? Tell us more about disk set-up and other hardware related things. Disk cache...disk speed...seek times....etc Best Regards, Simon Riggs
If I'm getting the point of this thread correctly, have a huge amount of data in one table degrades INSERT/COPY performance even with just a PKEY index. If that's about the size of it, read on. If not, ignore me because I missed something. On Apr 4, 2005 10:44 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Before I start to tunnel-vision on a particular coincidence... > Don't worry too much about tunnel vision. I see the same thing every day with multi-million row tables. The bigger the table gets (with only a pkey index) the slower the inserts go. If I start over (truncate, drop/create table), or if I point the initial load at a new table, everything gets speedy. I've always figured it was a function of table size and learned to live with it... > How much memory have you got on the system? On mine, 16G > How much of that have you allocated to various tasks? shared buffers: 15000 > What else is happening on your system? Nothing on mine. > Tell us more about disk set-up and other hardware related things. 6-disk RAID10 on a Compaq SmartArray 6404 with 256M BB cache, WAL on 2-disk mirror on built in SmartArray5 controller. -- Mike Rylander mrylander@gmail.com GPLS -- PINES Development Database Developer http://open-ils.org
On Apr 4, 2005 6:44 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Before I start to tunnel-vision on a particular coincidence... > > How much memory have you got on the system? Now, 2Gb, but most of it is free in this situation. Earlier, I posted some of the settings related to work mem. > How much of that have you allocated to various tasks? Do you mean inside PostgreSQL? > What else is happening on your system? sshd, that's it :-) > Tell us more about disk set-up and other hardware related things. > Disk cache...disk speed...seek times....etc Sure, here's the system configuration: * AMD64/3000 * 2GB RAM (was 1GB, has made no difference) * 1 x 120GB SATA drive (w/WAL), 7200RPM Seagate * 1 x 160GB SATA drive (main), 7200RPM Seagate Chris -- | Christopher Petrilli | petrilli@gmail.com
Christopher Petrilli <petrilli@gmail.com> writes: > On Apr 4, 2005 12:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> do a test run with *no* indexes on the table, just to see if it behaves >> any differently? Basically I was wondering if index overhead might be >> part of the problem. > http://www.amber.org/~petrilli/diagrams/pgsql_copy500_pkonly.png > I appologize, I forgot to kill the PK, but as you can see, the curve > flattened out a lot. It still begins to increase in what seems like > the same place. You can find the results themselves at: Yeah, this confirms the thought that the indexes are the source of the issue. (Which is what I'd expect, because a bare INSERT ought to be an approximately constant-time operation. But it's good to verify.) Now some amount of slowdown is to be expected as the indexes get larger, since it ought to take roughly O(log N) time to insert a new entry in an index of size N. The weird thing about your curves is the very sudden jump in the insert times. What I think might be happening is that the "working set" of pages touched during index inserts is gradually growing, and at some point it exceeds shared_buffers, and at that point performance goes in the toilet because we are suddenly doing lots of reads to pull in index pages that fell out of the shared buffer area. It would be interesting to watch the output of iostat or vmstat during this test run. If I'm correct about this, the I/O load should be basically all writes during the initial part of the test, and then suddenly develop a significant and increasing fraction of reads at the point where the slowdown occurs. The indicated fix of course is to increase shared_buffers. regards, tom lane
On Apr 4, 2005 10:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Christopher Petrilli <petrilli@gmail.com> writes: > > On Apr 4, 2005 12:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> do a test run with *no* indexes on the table, just to see if it behaves > >> any differently? Basically I was wondering if index overhead might be > >> part of the problem. > > > http://www.amber.org/~petrilli/diagrams/pgsql_copy500_pkonly.png > > > I appologize, I forgot to kill the PK, but as you can see, the curve > > flattened out a lot. It still begins to increase in what seems like > > the same place. You can find the results themselves at: > > Yeah, this confirms the thought that the indexes are the source of > the issue. (Which is what I'd expect, because a bare INSERT ought to be > an approximately constant-time operation. But it's good to verify.) This seemsed to be my original idea, but I wanted to eliminate everything else as much as possible. I was also concerned that I might be hitting a bad case in the trees. I had to change some UID generation code to better hash, so... > Now some amount of slowdown is to be expected as the indexes get larger, > since it ought to take roughly O(log N) time to insert a new entry in an > index of size N. The weird thing about your curves is the very sudden > jump in the insert times. Right, I expected O(log N) behavior myself, and it seems to behave that way, if you look at the first section (although there's some interesting patterns that are visible if you exclude data outside the 90th percentile in the first section, that seems to coincide with some write activity. > It would be interesting to watch the output of iostat or vmstat during > this test run. If I'm correct about this, the I/O load should be > basically all writes during the initial part of the test, and then > suddenly develop a significant and increasing fraction of reads at the > point where the slowdown occurs. Well, I can track this on a run, if it would be useful, but I think you're right as it matches what I saw from looking at iostat at those points. > The indicated fix of course is to increase shared_buffers. Any idea where it should be set? Chris -- | Christopher Petrilli | petrilli@gmail.com
Christopher Petrilli <petrilli@gmail.com> writes: > On Apr 4, 2005 10:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The indicated fix of course is to increase shared_buffers. > Any idea where it should be set? Not really. An upper bound would be the total size of the finished indexes for one 10M-row table, but one would suppose that that's overkill. The leaf pages shouldn't have to stay in RAM to have reasonable behavior --- the killer case is when upper-level tree pages drop out. Or that's what I'd expect anyway. You could probably drop the inter-insert sleep for testing purposes, if you want to experiment with several shared_buffers values quickly. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > What I think might be happening is that the "working set" of pages > touched during index inserts is gradually growing, and at some point it > exceeds shared_buffers, and at that point performance goes in the toilet > because we are suddenly doing lots of reads to pull in index pages that > fell out of the shared buffer area. All this is happening within a single transaction too, right? So there hasn't been an fsync the entire time. It's entirely up to the kernel when to decide to start writing data. It's possible it's just buffering all the writes in memory until the amount of free buffers drops below some threshold then it suddenly starts writing out buffers. > It would be interesting to watch the output of iostat or vmstat during > this test run. If I'm correct about this, the I/O load should be > basically all writes during the initial part of the test, and then > suddenly develop a significant and increasing fraction of reads at the > point where the slowdown occurs. I think he's right, if you see a reasonable write volume before the performance drop followed by a sudden increase in read volume (and decrease of write volume proportionate to the drop in performance) then it's just shared buffers becoming a bottleneck. If there's hardly any write volume before, then a sudden increase in write volume despite a drop in performance then I might be right. In which case you might want to look into tools to tune your kernel vm system. -- greg
On 04 Apr 2005 23:45:47 -0400, Greg Stark <gsstark@mit.edu> wrote: > > Tom Lane <tgl@sss.pgh.pa.us> writes: > > > What I think might be happening is that the "working set" of pages > > touched during index inserts is gradually growing, and at some point it > > exceeds shared_buffers, and at that point performance goes in the toilet > > because we are suddenly doing lots of reads to pull in index pages that > > fell out of the shared buffer area. > > All this is happening within a single transaction too, right? So there hasn't > been an fsync the entire time. It's entirely up to the kernel when to decide > to start writing data. This was my concern, and in fact moving from ext3 -> XFS has helped substantially in this regard. This is all happening inside COPY statements, so there's effectively a commit every 500 rows. I could enlarge this, but I didn't notice a huge increase in performance when doing tests on smaller bits. Also, you are correct, I am running without fsync, although I could change that if you thought it would "smooth" the performance. The issue is less absolute performance than something more deterministic. Going from 0.05 seconds for a 500 row COPY to 26 seconds really messes with the system. One thing that was mentioned early on, and I hope people remember, is that I am running autovacuum in the background, but the timing of it seems to have little to do with the system's problems, at least the debug output doesn't conincide with performance loss. > It's possible it's just buffering all the writes in memory until the amount of > free buffers drops below some threshold then it suddenly starts writing out > buffers. That was happening with ext3, actually, or at least to the best of my knowledge. > > It would be interesting to watch the output of iostat or vmstat during > > this test run. If I'm correct about this, the I/O load should be > > basically all writes during the initial part of the test, and then > > suddenly develop a significant and increasing fraction of reads at the > > point where the slowdown occurs. > > I think he's right, if you see a reasonable write volume before the > performance drop followed by a sudden increase in read volume (and decrease of > write volume proportionate to the drop in performance) then it's just shared > buffers becoming a bottleneck. I've set shared_buffers to 16000 (from the original 1000) and am running now, without the pauses. We'll see what it looks like, but so far it seems to be running faster. How much and how it degrades will be an interesting view. > If there's hardly any write volume before, then a sudden increase in write > volume despite a drop in performance then I might be right. In which case you > might want to look into tools to tune your kernel vm system. Here's a quick snapshot of iostat: Linux 2.6.9-1.667 (bigbird.amber.org) 04/04/2005 avg-cpu: %user %nice %sys %iowait %idle 1.05 0.01 0.63 13.15 85.17 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn hda 0.00 0.00 0.00 3616 0 sda 23.15 68.09 748.89 246884021 2715312654 sdb 19.08 37.65 773.03 136515457 2802814036 The first 3 columns have been identical (or nearly so) the whole time, which tells me the system is pegged in its performance on IO. This is not surprising. Chris -- | Christopher Petrilli | petrilli@gmail.com
Greg Stark <gsstark@mit.edu> writes: > All this is happening within a single transaction too, right? So there hasn't > been an fsync the entire time. It's entirely up to the kernel when to decide > to start writing data. No ... there's a commit every 500 records. However, I think Chris said he was running with fsync off; so you're right that the kernel is at liberty to write stuff to disk when it feels like. It could be that those outlier points are transactions that occurred in the middle of periodic syncer-driven mass writes. Maybe fsync off is counterproductive for this situation? regards, tom lane
On Apr 4, 2005 11:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Greg Stark <gsstark@mit.edu> writes: > > All this is happening within a single transaction too, right? So there hasn't > > been an fsync the entire time. It's entirely up to the kernel when to decide > > to start writing data. > > No ... there's a commit every 500 records. However, I think Chris said > he was running with fsync off; so you're right that the kernel is at > liberty to write stuff to disk when it feels like. It could be that > those outlier points are transactions that occurred in the middle of > periodic syncer-driven mass writes. Maybe fsync off is > counterproductive for this situation? Looking at preliminary results from running with shared_buffers at 16000, it seems this may be correct. Performance was flatter for a BIT longer, but slammed right into the wall and started hitting the 3-30 second range per COPY. I've restarted the run, with fsync turned on (fdatasync), and we'll see. My fear is that it's some bizarre situation interacting with both issues, and one that might not be solvable. Does anyone else have much experience with this sort of sustained COPY? Chris -- | Christopher Petrilli | petrilli@gmail.com
On Tue, Apr 05, 2005 at 12:16:27AM -0400, Christopher Petrilli wrote: > My fear is that it's some bizarre situation interacting with both > issues, and one that might not be solvable. Does anyone else have > much experience with this sort of sustained COPY? You might ask the guy who just posted to -admin about a database that's doing 340M inserts a day in 300M transactions... -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
On Apr 5, 2005 12:16 AM, Christopher Petrilli <petrilli@gmail.com> wrote: > Looking at preliminary results from running with shared_buffers at > 16000, it seems this may be correct. Performance was flatter for a > BIT longer, but slammed right into the wall and started hitting the > 3-30 second range per COPY. I've restarted the run, with fsync turned > on (fdatasync), and we'll see. > > My fear is that it's some bizarre situation interacting with both > issues, and one that might not be solvable. Does anyone else have > much experience with this sort of sustained COPY? Well, here's the results: http://www.amber.org/~petrilli/diagrams/pgsql_copy500_comparison.png The red is the run with shared_buffers turned up, but fsync off. The blue is the run with shared_buffers turned up, but fsync on. Note that it hits the wall sooner. Unfortunately, my brain is fried, and not sure what that means! Chris -- | Christopher Petrilli | petrilli@gmail.com
On Mon, 2005-04-04 at 22:36 -0400, Tom Lane wrote: > Christopher Petrilli <petrilli@gmail.com> writes: > > On Apr 4, 2005 12:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> do a test run with *no* indexes on the table, just to see if it behaves > >> any differently? Basically I was wondering if index overhead might be > >> part of the problem. > > > http://www.amber.org/~petrilli/diagrams/pgsql_copy500_pkonly.png > > > I appologize, I forgot to kill the PK, but as you can see, the curve > > flattened out a lot. It still begins to increase in what seems like > > the same place. You can find the results themselves at: > > Yeah, this confirms the thought that the indexes are the source of > the issue. (Which is what I'd expect, because a bare INSERT ought to be > an approximately constant-time operation. But it's good to verify.) Yup, indexes are the best explanation so far - block extension needs some work, but I doubted that it was the source of this effect. > Now some amount of slowdown is to be expected as the indexes get larger, > since it ought to take roughly O(log N) time to insert a new entry in an > index of size N. The weird thing about your curves is the very sudden > jump in the insert times. Well, ISTM that the curve is far from unique. Mark's OSDL tests show them too. What was wierd, for me, was that it "resets" when you move to a new table. The index theory does accurately explain that. Perhaps the jump is not so sudden? Do I see a first small step up at about 4.5M rows, then another much bigger one at 7.5M (which looks like the only one at first glance)? > What I think might be happening is that the "working set" of pages > touched during index inserts is gradually growing, and at some point it > exceeds shared_buffers, and at that point performance goes in the toilet > because we are suddenly doing lots of reads to pull in index pages that > fell out of the shared buffer area. So this does seem to be the best explanation and it seems a good one. It's also an excellent advert for table and index partitioning, and some damning evidence against global indexes on partitioned tables (though they may still be better than the alternative...) > The indicated fix of course is to increase shared_buffers. Splitting your tables at 4M, not 10M would work even better. .. Anyway, where most of this started was with Christopher's comments: On Fri, 2005-04-01 at 14:38 -0500, Christopher Petrilli wrote: > This was an application originally written for MySQL/MYISAM, and it's > looking like PostgreSQL can't hold up for it, simply because it's "too > much database" if that makes sense. The same box, running the MySQL > implementation (which uses no transactions) runs around 800-1000 > rows/second systained. B-trees aren't unique to PostgreSQL; the explanation developed here would work equally well for any database system that used tree-based indexes. Do we still think that MySQL can do this when PostgreSQL cannot? How? Do we have performance test results showing the same application load without the degradation? We don't need to look at the source code to measure MySQL performance... Best Regards, Simon Riggs
On Apr 5, 2005 3:48 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > Now some amount of slowdown is to be expected as the indexes get larger, > > since it ought to take roughly O(log N) time to insert a new entry in an > > index of size N. The weird thing about your curves is the very sudden > > jump in the insert times. > > Well, ISTM that the curve is far from unique. Mark's OSDL tests show > them too. What was wierd, for me, was that it "resets" when you move to > a new table. The index theory does accurately explain that. > > Perhaps the jump is not so sudden? Do I see a first small step up at > about 4.5M rows, then another much bigger one at 7.5M (which looks like > the only one at first glance)? > > > What I think might be happening is that the "working set" of pages > > touched during index inserts is gradually growing, and at some point it > > exceeds shared_buffers, and at that point performance goes in the toilet > > because we are suddenly doing lots of reads to pull in index pages that > > fell out of the shared buffer area. > > So this does seem to be the best explanation and it seems a good one. > > It's also an excellent advert for table and index partitioning, and some > damning evidence against global indexes on partitioned tables (though > they may still be better than the alternative...) > > > The indicated fix of course is to increase shared_buffers. > > Splitting your tables at 4M, not 10M would work even better. Unfortunately, given we are talking about billions of rows potentially, I'm concerned about that many tables when it comes to query time. I assume this will kick in the genetic optimizer? > Anyway, where most of this started was with Christopher's comments: > > On Fri, 2005-04-01 at 14:38 -0500, Christopher Petrilli wrote: > > This was an application originally written for MySQL/MYISAM, and it's > > looking like PostgreSQL can't hold up for it, simply because it's "too > > much database" if that makes sense. The same box, running the MySQL > > implementation (which uses no transactions) runs around 800-1000 > > rows/second systained. > > B-trees aren't unique to PostgreSQL; the explanation developed here > would work equally well for any database system that used tree-based > indexes. Do we still think that MySQL can do this when PostgreSQL > cannot? How? There are customers in production using MySQL with 10M rows/table, and I have no evidence of this behavior. I do not have the test jig for MySQL, but I can create one, which is what I will do. Note that they are using MyISAM files, so there is no ACID behavior. Also, I have seen troubling corruption issues that I've never been able to concretely identify. Above all, I've been impressed that PostgreSQL, even when it hits this wall, never corrupts anything. > Do we have performance test results showing the same application load > without the degradation? We don't need to look at the source code to > measure MySQL performance... I will see what I can do in the next few days to create a similar little test for MySQL. Chris -- | Christopher Petrilli | petrilli@gmail.com
On Apr 5, 2005 3:48 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > B-trees aren't unique to PostgreSQL; the explanation developed here > would work equally well for any database system that used tree-based > indexes. Do we still think that MySQL can do this when PostgreSQL > cannot? How? > > Do we have performance test results showing the same application load > without the degradation? We don't need to look at the source code to > measure MySQL performance... http://www.amber.org/~petrilli/diagrams/comparison_mysql_pgsql.png That chart shows MySQL (using INSERT against MyISAM tables) and PostgreSQL (using COPY) running with the exact same code otherwise. Note that MySQL does hit a bit of a wall, but nothing as drastic as PostgreSQL and actually maintains something "more flat". The red and blue dashed lines are the 95th percentile point. My suspicion is that what we're seeing is WAL issues, not particularly index issues. The indices just fill up the WAL faster because there's more data. This is a wag basically, but it would seem to explain the difference. In both cases, the indices were identical. Five on each. One interesting thing... PostgreSQL starts out a good bit faster, but looses in the end. Chris -- | Christopher Petrilli | petrilli@gmail.com
On Tue, 2005-04-05 at 18:55 -0400, Christopher Petrilli wrote: > On Apr 5, 2005 3:48 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > B-trees aren't unique to PostgreSQL; the explanation developed here > > would work equally well for any database system that used tree-based > > indexes. Do we still think that MySQL can do this when PostgreSQL > > cannot? How? > > > > Do we have performance test results showing the same application load > > without the degradation? We don't need to look at the source code to > > measure MySQL performance... > > http://www.amber.org/~petrilli/diagrams/comparison_mysql_pgsql.png > > That chart shows MySQL (using INSERT against MyISAM tables) and > PostgreSQL (using COPY) running with the exact same code otherwise. > Note that MySQL does hit a bit of a wall, but nothing as drastic as > PostgreSQL and actually maintains something "more flat". The red and > blue dashed lines are the 95th percentile point. Interesting comparison. Any chance of separating the graphs as well, I'm interested in the detail on both graphs. Could you estimate the apparent periodicity on the PostgreSQL graphs? > My suspicion is that what we're seeing is WAL issues, not particularly > index issues. The indices just fill up the WAL faster because there's > more data. This is a wag basically, but it would seem to explain the > difference. In both cases, the indices were identical. Five on each. Let's test the shared_buffers theory. Would you mind loading only 5M rows per table, but load the same amount of data overall? That should keep us within the comparable zone overall. Best Regards, Simon Riggs
On Tue, 2005-04-05 at 16:05 -0400, Christopher Petrilli wrote: > On Apr 5, 2005 3:48 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > > The indicated fix of course is to increase shared_buffers. > > > > Splitting your tables at 4M, not 10M would work even better. > > Unfortunately, given we are talking about billions of rows > potentially, I'm concerned about that many tables when it comes to > query time. I assume this will kick in the genetic optimizer? No, it won't start using the genetic optimizer. You could just buy more RAM and keep table size the same. Best Regards, Simon Riggs