Thread: Grave performance issues...
I am porting a system that I have been developing on an MS platform (including SQL Server) to a linux system with postgresql. Apart from the usual difficulties (and not being very knowledgeable in the unix world), I have now gotten most things to work. However, I am experiencing some pretty serious performance issues. The system collects data from users thru a web interface, and also has a background process that does rather heavy work. I will here refer to but one of the SQL statements, to keep things focused. The idea here is that the background process is, among other things, constructing/updating a table called Distribution from a table called Data, which then represents the statistic distribution of the information found in Data (using user groups, too). The update-statement actually performing this looks like this: UPDATE Distribution SET Value = ( SELECT COUNT(*) FROM Data INNER JOIN GroupMember ON Data.UserID = GroupMember.UserID WHERE ItemID = Distribution.ItemID AND Grade = Distribution.Grade AND WeightGroupID = Distribution.WeightGroupID ) Distribution takes the form: CREATE TABLE Distribution ( Grade int NOT NULL, ItemID numeric(18, 0) NOT NULL, WeightGroupID numeric(18, 0) NOT NULL, Value float NOT NULL, PRIMARY KEY (Grade, ItemID, WeightGroupID) ); I am unsure if I need to post the DDL descriptions of the other tables; feel free to ask. Anyway, there are two seemingly distinct problems with this: *1 As you can see, this UPDATE touches the entire table upon each execution, which in postgresql seems to mean that performance quickly deteriorates unless you are constantly vacuuming. To answer an obvious question, I *do* feel that this is a necessary thing to do - if I were to update only those rows that would acquire a new value, it would most probably turn out to be the majority of them as the system is expected to receive a steady flow of input data when launched. This operation also needs to be performed very often - ideally about every 10th second. I tried leaving the background process running for a few days without vacuuming during the holiday, and while testing today I found that the above query takes aproximately 2 minutes to complete - for a Distribution size of 600 rows! Also, neither of the other two tables referenced contained more than 30 rows. *2 Even when the involved tables are freshly vacuumed (and vacuum analyzed), the above query using the data amount from (*1) takes about 4 seconds to complete. That should be compared to the about 50ms it takes to complete for the same amount of data on Windows 2000 against SQL Server on a comparable machine. Needless to say, this is rather catastrophical, and I'm pretty much out of ideas. The performance values given above are those I got while executing the query directly in psql, with no other connections to the database, so this does not seem to me to be a problem with the system in itself. A configuration problem, perhaps? A bug? SQL ignorance on my part? I could surely use some help. As previously noted, I am not very comfortable with unix systems, and not exactly an SQL guru either. The linux distribution is some version of redhat, and the postgresql version is 7.1.3. Any help or attempt to help would be greatly appreciated. / Ztream
"Ztream" <ztream@highrad.org> wrote in message news:<a0fr1a$1eil$1@news.tht.net>... > Distribution size of 600 rows! Also, neither of the other two tables > referenced contained more than 30 rows. > > *2 > Even when the involved tables are freshly vacuumed (and vacuum analyzed), > the above query using the data amount from (*1) takes about 4 seconds to That time does sound REALLY excessive for the quantity of data involved. Have you tried "EXPLAIN" to see how PostgreSQL is implementing the query? Are there indexes on UserID in "Data" and "GroupMember"? Have you tried using a temporary table as an in-between stage? Something like: CREATE TEMPORARY TABLE stats AS SELECT ItemID, Grade, WeightGroupID,COUNT(*)as statcount FROM Data INNER JOIN GroupMember ON Data.UserID = GroupMember.UserID GROUP BY 1,2,3 UPDATE distribution SET value = (SELECT statcount FROM stats WHERE ItemID = Distribution.ItemID AND Grade = Distribution.Grade AND WeightGroupID = Distribution.WeightGroupID) Have you tried watching the output of "vmstat 1", to see if excessive disk I/O or swapping is the culprit? Would you consider emailing me a pg_dump of the affected tables, so I can try it on a computer here to see if the problem is in your SQL or in your server? Inquiring minds want to know ... Calvin Dodge Certified Linux Bigot(tm)
First off it IS surprizing that the quesry is runnig s slowly. Did you create in indexies? What size buffer (-B option) are you using? How much RAM is in your system? Have you disabled fsync? What does "explain" say about the query? Also it seems silly to recompute a table on a fixed schedule of 1/10 second. Why not do it only when the data table changes? Or for that matter compute "distribution" on the fly only when you need to know it. --- Ztream <ztream@highrad.org> wrote: > I am porting a system that I have been developing on an MS platform > (including SQL Server) to a linux system with postgresql. Apart from > the > usual difficulties (and not being very knowledgeable in the unix > world), I > have now gotten most things to work. However, I am experiencing some > pretty > serious performance issues. > The system collects data from users thru a web interface, and also > has a > background process that does rather heavy work. I will here refer to > but one > of the SQL statements, to keep things focused. The idea here is that > the > background process is, among other things, constructing/updating a > table > called Distribution from a table called Data, which then represents > the > statistic distribution of the information found in Data (using user > groups, > too). > The update-statement actually performing this looks like this: > > UPDATE Distribution > SET Value = > ( > SELECT COUNT(*) FROM Data INNER JOIN GroupMember ON Data.UserID = > GroupMember.UserID > WHERE ItemID = Distribution.ItemID AND Grade = Distribution.Grade AND > WeightGroupID = Distribution.WeightGroupID > ) > > Distribution takes the form: > CREATE TABLE Distribution ( > Grade int NOT NULL, > ItemID numeric(18, 0) NOT NULL, > WeightGroupID numeric(18, 0) NOT NULL, > Value float NOT NULL, > PRIMARY KEY (Grade, ItemID, WeightGroupID) > ); > > I am unsure if I need to post the DDL descriptions of the other > tables; feel > free to ask. > > Anyway, there are two seemingly distinct problems with this: > > *1 > As you can see, this UPDATE touches the entire table upon each > execution, > which in postgresql seems to mean that performance quickly > deteriorates > unless you are constantly vacuuming. To answer an obvious question, I > *do* > feel that this is a necessary thing to do - if I were to update only > those > rows that would acquire a new value, it would most probably turn out > to be > the majority of them as the system is expected to receive a steady > flow of > input data when launched. > This operation also needs to be performed very often - ideally about > every > 10th second. I tried leaving the background process running for a few > days > without vacuuming during the holiday, and while testing today I found > that > the above query takes aproximately 2 minutes to complete - for a > Distribution size of 600 rows! Also, neither of the other two tables > referenced contained more than 30 rows. > > *2 > Even when the involved tables are freshly vacuumed (and vacuum > analyzed), > the above query using the data amount from (*1) takes about 4 seconds > to > complete. That should be compared to the about 50ms it takes to > complete for > the same amount of data on Windows 2000 against SQL Server on a > comparable > machine. > > Needless to say, this is rather catastrophical, and I'm pretty much > out of > ideas. The performance values given above are those I got while > executing > the query directly in psql, with no other connections to the > database, so > this does not seem to me to be a problem with the system in itself. A > configuration problem, perhaps? A bug? SQL ignorance on my part? I > could > surely use some help. > As previously noted, I am not very comfortable with unix systems, and > not > exactly an SQL guru either. > The linux distribution is some version of redhat, and the postgresql > version > is 7.1.3. > > Any help or attempt to help would be greatly appreciated. > > / Ztream > > > > ---------------------------(end of > broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org ===== Chris Albertson Home: 310-376-1029 chrisalbertson90278@yahoo.com Cell: 310-990-7550 Office: 310-336-5189 Christopher.J.Albertson@aero.org __________________________________________________ Do You Yahoo!? Send your FREE holiday greetings online! http://greetings.yahoo.com
Dear Z, ..decided to have a bit of a play with this, I created tables _similar_ to yours - (guessed what data and groupmember were like): CREATE TABLE Distribution ( Grade integer NOT NULL, ItemID integer NOT NULL, WeightGroupID integer NOT NULL, Value float NOT NULL ); CREATE TABLE Groupmember ( GroupID integer NOT NULL, GroupName varchar(30) NOT NULL, UserID integer NOT NULL ); CREATE TABLE Data ( ItemID integer NOT NULL, UserID integer NOT NULL, weightGroupID integer NOT NULL, Grade integer NOT NULL ); and populated them with generated data : ( 1000, 100 and 10000 rows respectively) I created an index on distribution only: CREATE INDEX dist_i1 ON distribution(Grade, ItemID, WeightGroupID); initially the update : UPDATE Distribution SET Value = ( SELECT COUNT(*) FROM Data INNER JOIN GroupMember ON Data.UserID = GroupMember.UserID WHERE ItemID = Distribution.ItemID AND Grade = Distribution.Grade AND WeightGroupID = Distribution.WeightGroupID ); took about 60s ( slow) Adding 3 other indexes : CREATE INDEX groupm_i1 ON groupmember(UserID); -- CREATE INDEX data_i1 ON data(UserID); CREATE INDEX data_i2 ON data(Grade,ItemID,WeightGroupID); and analyzing reduced the elapsed time for the update to 2s ( not bad ) Now I have brutalized your schema (apologies...) in the interest of making scripted data generation easy (typed everything as numbers if possible), but hopefully the basic idea will be appropriate... On that note, dont use "fancy" datatypes like numeric if integer will do (as the simple ones are faster) On the tuning front I set postgresql.conf parameters : shared_buffers = 16000 # 128M of shared buffers sort_mem = 10240 # 10M of sort memory You probably dont need the buffers that high - unless you expect the big table(s) to have millions of rows.... (I used Pg 7.2b4 but the same ideas should work with 7.1.x) Good luck Mark
I might be missing something but have you considered either using a trigger to update the Distribution table each time the data table is amended OR wrap updates to the Distribution table using a function i.e. f_add_data(....) that would keep the two tables in sync. I think this would probably do away with the need to carry out the background process altogether. Also are you using the GroupMember table in the update statement? hih sb "Ztream" <ztream@highrad.org> wrote in message news:a0fr1a$1eil$1@news.tht.net... > I am porting a system that I have been developing on an MS platform > (including SQL Server) to a linux system with postgresql. Apart from the > usual difficulties (and not being very knowledgeable in the unix world), I > have now gotten most things to work. However, I am experiencing some pretty > serious performance issues. > The system collects data from users thru a web interface, and also has a > background process that does rather heavy work. I will here refer to but one > of the SQL statements, to keep things focused. The idea here is that the > background process is, among other things, constructing/updating a table > called Distribution from a table called Data, which then represents the > statistic distribution of the information found in Data (using user groups, > too). > The update-statement actually performing this looks like this: > > UPDATE Distribution > SET Value = > ( > SELECT COUNT(*) FROM Data INNER JOIN GroupMember ON Data.UserID = > GroupMember.UserID > WHERE ItemID = Distribution.ItemID AND Grade = Distribution.Grade AND > WeightGroupID = Distribution.WeightGroupID > ) > > Distribution takes the form: > CREATE TABLE Distribution ( > Grade int NOT NULL, > ItemID numeric(18, 0) NOT NULL, > WeightGroupID numeric(18, 0) NOT NULL, > Value float NOT NULL, > PRIMARY KEY (Grade, ItemID, WeightGroupID) > ); > > I am unsure if I need to post the DDL descriptions of the other tables; feel > free to ask. > > Anyway, there are two seemingly distinct problems with this: > > *1 > As you can see, this UPDATE touches the entire table upon each execution, > which in postgresql seems to mean that performance quickly deteriorates > unless you are constantly vacuuming. To answer an obvious question, I *do* > feel that this is a necessary thing to do - if I were to update only those > rows that would acquire a new value, it would most probably turn out to be > the majority of them as the system is expected to receive a steady flow of > input data when launched. > This operation also needs to be performed very often - ideally about every > 10th second. I tried leaving the background process running for a few days > without vacuuming during the holiday, and while testing today I found that > the above query takes aproximately 2 minutes to complete - for a > Distribution size of 600 rows! Also, neither of the other two tables > referenced contained more than 30 rows. > > *2 > Even when the involved tables are freshly vacuumed (and vacuum analyzed), > the above query using the data amount from (*1) takes about 4 seconds to > complete. That should be compared to the about 50ms it takes to complete for > the same amount of data on Windows 2000 against SQL Server on a comparable > machine. > > Needless to say, this is rather catastrophical, and I'm pretty much out of > ideas. The performance values given above are those I got while executing > the query directly in psql, with no other connections to the database, so > this does not seem to me to be a problem with the system in itself. A > configuration problem, perhaps? A bug? SQL ignorance on my part? I could > surely use some help. > As previously noted, I am not very comfortable with unix systems, and not > exactly an SQL guru either. > The linux distribution is some version of redhat, and the postgresql version > is 7.1.3. > > Any help or attempt to help would be greatly appreciated. > > / Ztream > >
I am running Postgres 7.1 on Suse. I got all the RPMs from Suse but I can not find the libs to use perl or Python in Postgres. I can add tcl and PG/pgsql with out any problem. any help would be great. -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org]On Behalf Of steve boyle Sent: Tuesday, January 08, 2002 12:25 PM To: pgsql-general@postgresql.org Subject: Re: [GENERAL] Grave performance issues... I might be missing something but have you considered either using a trigger to update the Distribution table each time the data table is amended OR wrap updates to the Distribution table using a function i.e. f_add_data(....) that would keep the two tables in sync. I think this would probably do away with the need to carry out the background process altogether. Also are you using the GroupMember table in the update statement? hih sb "Ztream" <ztream@highrad.org> wrote in message news:a0fr1a$1eil$1@news.tht.net... > I am porting a system that I have been developing on an MS platform > (including SQL Server) to a linux system with postgresql. Apart from the > usual difficulties (and not being very knowledgeable in the unix world), I > have now gotten most things to work. However, I am experiencing some pretty > serious performance issues. > The system collects data from users thru a web interface, and also has a > background process that does rather heavy work. I will here refer to but one > of the SQL statements, to keep things focused. The idea here is that the > background process is, among other things, constructing/updating a table > called Distribution from a table called Data, which then represents the > statistic distribution of the information found in Data (using user groups, > too). > The update-statement actually performing this looks like this: > > UPDATE Distribution > SET Value = > ( > SELECT COUNT(*) FROM Data INNER JOIN GroupMember ON Data.UserID = > GroupMember.UserID > WHERE ItemID = Distribution.ItemID AND Grade = Distribution.Grade AND > WeightGroupID = Distribution.WeightGroupID > ) > > Distribution takes the form: > CREATE TABLE Distribution ( > Grade int NOT NULL, > ItemID numeric(18, 0) NOT NULL, > WeightGroupID numeric(18, 0) NOT NULL, > Value float NOT NULL, > PRIMARY KEY (Grade, ItemID, WeightGroupID) > ); > > I am unsure if I need to post the DDL descriptions of the other tables; feel > free to ask. > > Anyway, there are two seemingly distinct problems with this: > > *1 > As you can see, this UPDATE touches the entire table upon each execution, > which in postgresql seems to mean that performance quickly deteriorates > unless you are constantly vacuuming. To answer an obvious question, I *do* > feel that this is a necessary thing to do - if I were to update only those > rows that would acquire a new value, it would most probably turn out to be > the majority of them as the system is expected to receive a steady flow of > input data when launched. > This operation also needs to be performed very often - ideally about every > 10th second. I tried leaving the background process running for a few days > without vacuuming during the holiday, and while testing today I found that > the above query takes aproximately 2 minutes to complete - for a > Distribution size of 600 rows! Also, neither of the other two tables > referenced contained more than 30 rows. > > *2 > Even when the involved tables are freshly vacuumed (and vacuum analyzed), > the above query using the data amount from (*1) takes about 4 seconds to > complete. That should be compared to the about 50ms it takes to complete for > the same amount of data on Windows 2000 against SQL Server on a comparable > machine. > > Needless to say, this is rather catastrophical, and I'm pretty much out of > ideas. The performance values given above are those I got while executing > the query directly in psql, with no other connections to the database, so > this does not seem to me to be a problem with the system in itself. A > configuration problem, perhaps? A bug? SQL ignorance on my part? I could > surely use some help. > As previously noted, I am not very comfortable with unix systems, and not > exactly an SQL guru either. > The linux distribution is some version of redhat, and the postgresql version > is 7.1.3. > > Any help or attempt to help would be greatly appreciated. > > / Ztream > > ---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://archives.postgresql.org