Thread: Grave performance issues...

Grave performance issues...

From
"Ztream"
Date:
I am porting a system that I have been developing on an MS platform
(including SQL Server) to a linux system with postgresql. Apart from the
usual difficulties (and not being very knowledgeable in the unix world), I
have now gotten most things to work. However, I am experiencing some pretty
serious performance issues.
The system collects data from users thru a web interface, and also has a
background process that does rather heavy work. I will here refer to but one
of the SQL statements, to keep things focused. The idea here is that the
background process is, among other things, constructing/updating a table
called Distribution from a table called Data, which then represents the
statistic distribution of the information found in Data (using user groups,
too).
The update-statement actually performing this looks like this:

UPDATE Distribution
SET Value =
(
SELECT COUNT(*) FROM Data INNER JOIN GroupMember ON Data.UserID =
GroupMember.UserID
WHERE ItemID = Distribution.ItemID AND Grade = Distribution.Grade AND
WeightGroupID = Distribution.WeightGroupID
)

Distribution takes the form:
CREATE TABLE Distribution (
 Grade int NOT NULL,
 ItemID numeric(18, 0) NOT NULL,
 WeightGroupID numeric(18, 0) NOT NULL,
 Value float NOT NULL,
 PRIMARY KEY (Grade, ItemID, WeightGroupID)
);

I am unsure if I need to post the DDL descriptions of the other tables; feel
free to ask.

Anyway, there are two seemingly distinct problems with this:

*1
As you can see, this UPDATE touches the entire table upon each execution,
which in postgresql seems to mean that performance quickly deteriorates
unless you are constantly vacuuming. To answer an obvious question, I *do*
feel that this is a necessary thing to do - if I were to update only those
rows that would acquire a new value, it would most probably turn out to be
the majority of them as the system is expected to receive a steady flow of
input data when launched.
This operation also needs to be performed very often - ideally about every
10th second. I tried leaving the background process running for a few days
without vacuuming during the holiday, and while testing today I found that
the above query takes aproximately 2 minutes to complete - for a
Distribution size of 600 rows! Also, neither of the other two tables
referenced contained more than 30 rows.

*2
Even when the involved tables are freshly vacuumed (and vacuum analyzed),
the above query using the data amount from (*1) takes about 4 seconds to
complete. That should be compared to the about 50ms it takes to complete for
the same amount of data on Windows 2000 against SQL Server on a comparable
machine.

Needless to say, this is rather catastrophical, and I'm pretty much out of
ideas. The performance values given above are those I got while executing
the query directly in psql, with no other connections to the database, so
this does not seem to me to be a problem with the system in itself. A
configuration problem, perhaps? A bug? SQL ignorance on my part? I could
surely use some help.
As previously noted, I am not very comfortable with unix systems, and not
exactly an SQL guru either.
The linux distribution is some version of redhat, and the postgresql version
is 7.1.3.

Any help or attempt to help would be greatly appreciated.

 / Ztream



Re: Grave performance issues...

From
caldodge@fpcc.net (Calvin Dodge)
Date:
"Ztream" <ztream@highrad.org> wrote in message news:<a0fr1a$1eil$1@news.tht.net>...
> Distribution size of 600 rows! Also, neither of the other two tables
> referenced contained more than 30 rows.
>
> *2
> Even when the involved tables are freshly vacuumed (and vacuum analyzed),
> the above query using the data amount from (*1) takes about 4 seconds to

That time does sound REALLY excessive for the quantity of data
involved.

Have you tried "EXPLAIN" to see how PostgreSQL is implementing the
query?

Are there indexes on UserID in "Data" and "GroupMember"?

Have you tried using a temporary table as an in-between stage?
Something like:

CREATE TEMPORARY TABLE stats AS SELECT ItemID, Grade,
WeightGroupID,COUNT(*)as statcount FROM Data INNER JOIN GroupMember ON
Data.UserID = GroupMember.UserID GROUP BY 1,2,3

UPDATE distribution SET value = (SELECT statcount FROM stats WHERE
ItemID = Distribution.ItemID AND Grade = Distribution.Grade AND
WeightGroupID = Distribution.WeightGroupID)



Have you tried watching the output of "vmstat 1", to see if excessive
disk I/O or swapping is the culprit?

Would you consider emailing me a pg_dump of the affected tables, so I
can try it on a computer here to see if the problem is in your SQL or
in your server?

Inquiring minds want to know ...

Calvin Dodge
Certified Linux Bigot(tm)

Re: Grave performance issues...

From
Chris Albertson
Date:
First off it IS surprizing that the quesry is runnig
s slowly.

Did you create in indexies?
What size buffer (-B option) are you using?
How much RAM is in your system?
Have you disabled fsync?
What does "explain" say about the query?

Also it seems silly to recompute a table on a fixed schedule
of 1/10 second.  Why not do it only when the data table changes?
Or for that matter compute "distribution" on the fly only
when you need to know it.

--- Ztream <ztream@highrad.org> wrote:
> I am porting a system that I have been developing on an MS platform
> (including SQL Server) to a linux system with postgresql. Apart from
> the
> usual difficulties (and not being very knowledgeable in the unix
> world), I
> have now gotten most things to work. However, I am experiencing some
> pretty
> serious performance issues.
> The system collects data from users thru a web interface, and also
> has a
> background process that does rather heavy work. I will here refer to
> but one
> of the SQL statements, to keep things focused. The idea here is that
> the
> background process is, among other things, constructing/updating a
> table
> called Distribution from a table called Data, which then represents
> the
> statistic distribution of the information found in Data (using user
> groups,
> too).
> The update-statement actually performing this looks like this:
>
> UPDATE Distribution
> SET Value =
> (
> SELECT COUNT(*) FROM Data INNER JOIN GroupMember ON Data.UserID =
> GroupMember.UserID
> WHERE ItemID = Distribution.ItemID AND Grade = Distribution.Grade AND
> WeightGroupID = Distribution.WeightGroupID
> )
>
> Distribution takes the form:
> CREATE TABLE Distribution (
>  Grade int NOT NULL,
>  ItemID numeric(18, 0) NOT NULL,
>  WeightGroupID numeric(18, 0) NOT NULL,
>  Value float NOT NULL,
>  PRIMARY KEY (Grade, ItemID, WeightGroupID)
> );
>
> I am unsure if I need to post the DDL descriptions of the other
> tables; feel
> free to ask.
>
> Anyway, there are two seemingly distinct problems with this:
>
> *1
> As you can see, this UPDATE touches the entire table upon each
> execution,
> which in postgresql seems to mean that performance quickly
> deteriorates
> unless you are constantly vacuuming. To answer an obvious question, I
> *do*
> feel that this is a necessary thing to do - if I were to update only
> those
> rows that would acquire a new value, it would most probably turn out
> to be
> the majority of them as the system is expected to receive a steady
> flow of
> input data when launched.
> This operation also needs to be performed very often - ideally about
> every
> 10th second. I tried leaving the background process running for a few
> days
> without vacuuming during the holiday, and while testing today I found
> that
> the above query takes aproximately 2 minutes to complete - for a
> Distribution size of 600 rows! Also, neither of the other two tables
> referenced contained more than 30 rows.
>
> *2
> Even when the involved tables are freshly vacuumed (and vacuum
> analyzed),
> the above query using the data amount from (*1) takes about 4 seconds
> to
> complete. That should be compared to the about 50ms it takes to
> complete for
> the same amount of data on Windows 2000 against SQL Server on a
> comparable
> machine.
>
> Needless to say, this is rather catastrophical, and I'm pretty much
> out of
> ideas. The performance values given above are those I got while
> executing
> the query directly in psql, with no other connections to the
> database, so
> this does not seem to me to be a problem with the system in itself. A
> configuration problem, perhaps? A bug? SQL ignorance on my part? I
> could
> surely use some help.
> As previously noted, I am not very comfortable with unix systems, and
> not
> exactly an SQL guru either.
> The linux distribution is some version of redhat, and the postgresql
> version
> is 7.1.3.
>
> Any help or attempt to help would be greatly appreciated.
>
>  / Ztream
>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to
majordomo@postgresql.org


=====
Chris Albertson
  Home:   310-376-1029  chrisalbertson90278@yahoo.com
  Cell:   310-990-7550
  Office: 310-336-5189  Christopher.J.Albertson@aero.org

__________________________________________________
Do You Yahoo!?
Send your FREE holiday greetings online!
http://greetings.yahoo.com

Re: Grave performance issues...

From
Mark kirkwood
Date:
Dear Z,

..decided to have a bit of a play with this,

I created tables _similar_ to yours -
(guessed what data and groupmember were like):

CREATE TABLE Distribution (
  Grade         integer NOT NULL,
  ItemID        integer NOT NULL,
  WeightGroupID integer NOT NULL,
  Value         float   NOT NULL
);

CREATE TABLE Groupmember (
  GroupID       integer NOT NULL,
  GroupName     varchar(30) NOT NULL,
  UserID        integer NOT NULL
);

CREATE TABLE Data (
  ItemID        integer NOT NULL,
  UserID        integer NOT NULL,
  weightGroupID integer NOT NULL,
  Grade         integer NOT NULL
);

and populated them with generated data : ( 1000, 100 and 10000 rows
respectively)

I created an index on distribution only:

CREATE INDEX dist_i1 ON distribution(Grade, ItemID, WeightGroupID);


initially the update :
UPDATE Distribution
SET Value =
(
  SELECT COUNT(*)
  FROM Data INNER JOIN GroupMember ON Data.UserID = GroupMember.UserID
  WHERE ItemID = Distribution.ItemID
  AND Grade = Distribution.Grade
  AND WeightGroupID = Distribution.WeightGroupID
);

took about 60s ( slow)

Adding 3 other indexes :

CREATE INDEX groupm_i1 ON groupmember(UserID);
--
CREATE INDEX data_i1 ON data(UserID);
CREATE INDEX data_i2 ON data(Grade,ItemID,WeightGroupID);

and analyzing reduced the elapsed time for the update to 2s ( not bad )

Now I have brutalized your schema (apologies...) in the interest of
making scripted data generation easy (typed everything as numbers if
possible), but hopefully the basic idea will be appropriate...

On that note, dont use "fancy" datatypes like numeric if integer will do
(as the simple ones are faster)

On the tuning front I set postgresql.conf parameters :

shared_buffers = 16000  # 128M of shared buffers
sort_mem       = 10240  #  10M of sort memory

You probably dont need the buffers that high - unless you expect the big
table(s) to have millions of rows....

(I used Pg 7.2b4 but the same ideas should work with 7.1.x)

Good luck

Mark



Re: Grave performance issues...

From
"steve boyle"
Date:
I might be missing something but have you considered either using a trigger
to update the Distribution table each time the data table is amended OR wrap
updates to the Distribution table using a function i.e. f_add_data(....)
that would keep the two tables in sync.  I think this would probably do away
with the need to carry out the background process altogether.

Also are you using the GroupMember table in the update statement?

hih

sb

"Ztream" <ztream@highrad.org> wrote in message
news:a0fr1a$1eil$1@news.tht.net...
> I am porting a system that I have been developing on an MS platform
> (including SQL Server) to a linux system with postgresql. Apart from the
> usual difficulties (and not being very knowledgeable in the unix world), I
> have now gotten most things to work. However, I am experiencing some
pretty
> serious performance issues.
> The system collects data from users thru a web interface, and also has a
> background process that does rather heavy work. I will here refer to but
one
> of the SQL statements, to keep things focused. The idea here is that the
> background process is, among other things, constructing/updating a table
> called Distribution from a table called Data, which then represents the
> statistic distribution of the information found in Data (using user
groups,
> too).
> The update-statement actually performing this looks like this:
>
> UPDATE Distribution
> SET Value =
> (
> SELECT COUNT(*) FROM Data INNER JOIN GroupMember ON Data.UserID =
> GroupMember.UserID
> WHERE ItemID = Distribution.ItemID AND Grade = Distribution.Grade AND
> WeightGroupID = Distribution.WeightGroupID
> )
>
> Distribution takes the form:
> CREATE TABLE Distribution (
>  Grade int NOT NULL,
>  ItemID numeric(18, 0) NOT NULL,
>  WeightGroupID numeric(18, 0) NOT NULL,
>  Value float NOT NULL,
>  PRIMARY KEY (Grade, ItemID, WeightGroupID)
> );
>
> I am unsure if I need to post the DDL descriptions of the other tables;
feel
> free to ask.
>
> Anyway, there are two seemingly distinct problems with this:
>
> *1
> As you can see, this UPDATE touches the entire table upon each execution,
> which in postgresql seems to mean that performance quickly deteriorates
> unless you are constantly vacuuming. To answer an obvious question, I *do*
> feel that this is a necessary thing to do - if I were to update only those
> rows that would acquire a new value, it would most probably turn out to be
> the majority of them as the system is expected to receive a steady flow of
> input data when launched.
> This operation also needs to be performed very often - ideally about every
> 10th second. I tried leaving the background process running for a few days
> without vacuuming during the holiday, and while testing today I found that
> the above query takes aproximately 2 minutes to complete - for a
> Distribution size of 600 rows! Also, neither of the other two tables
> referenced contained more than 30 rows.
>
> *2
> Even when the involved tables are freshly vacuumed (and vacuum analyzed),
> the above query using the data amount from (*1) takes about 4 seconds to
> complete. That should be compared to the about 50ms it takes to complete
for
> the same amount of data on Windows 2000 against SQL Server on a comparable
> machine.
>
> Needless to say, this is rather catastrophical, and I'm pretty much out of
> ideas. The performance values given above are those I got while executing
> the query directly in psql, with no other connections to the database, so
> this does not seem to me to be a problem with the system in itself. A
> configuration problem, perhaps? A bug? SQL ignorance on my part? I could
> surely use some help.
> As previously noted, I am not very comfortable with unix systems, and not
> exactly an SQL guru either.
> The linux distribution is some version of redhat, and the postgresql
version
> is 7.1.3.
>
> Any help or attempt to help would be greatly appreciated.
>
>  / Ztream
>
>



Stored proceures in perl or Python

From
"David Siebert"
Date:
I am running Postgres 7.1 on Suse. I got all the RPMs from Suse but I can
not find the libs to use perl or Python in Postgres. I can add tcl and
PG/pgsql with out any problem. any help would be great.


-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org]On Behalf Of steve boyle
Sent: Tuesday, January 08, 2002 12:25 PM
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Grave performance issues...


I might be missing something but have you considered either using a trigger
to update the Distribution table each time the data table is amended OR wrap
updates to the Distribution table using a function i.e. f_add_data(....)
that would keep the two tables in sync.  I think this would probably do away
with the need to carry out the background process altogether.

Also are you using the GroupMember table in the update statement?

hih

sb

"Ztream" <ztream@highrad.org> wrote in message
news:a0fr1a$1eil$1@news.tht.net...
> I am porting a system that I have been developing on an MS platform
> (including SQL Server) to a linux system with postgresql. Apart from the
> usual difficulties (and not being very knowledgeable in the unix world), I
> have now gotten most things to work. However, I am experiencing some
pretty
> serious performance issues.
> The system collects data from users thru a web interface, and also has a
> background process that does rather heavy work. I will here refer to but
one
> of the SQL statements, to keep things focused. The idea here is that the
> background process is, among other things, constructing/updating a table
> called Distribution from a table called Data, which then represents the
> statistic distribution of the information found in Data (using user
groups,
> too).
> The update-statement actually performing this looks like this:
>
> UPDATE Distribution
> SET Value =
> (
> SELECT COUNT(*) FROM Data INNER JOIN GroupMember ON Data.UserID =
> GroupMember.UserID
> WHERE ItemID = Distribution.ItemID AND Grade = Distribution.Grade AND
> WeightGroupID = Distribution.WeightGroupID
> )
>
> Distribution takes the form:
> CREATE TABLE Distribution (
>  Grade int NOT NULL,
>  ItemID numeric(18, 0) NOT NULL,
>  WeightGroupID numeric(18, 0) NOT NULL,
>  Value float NOT NULL,
>  PRIMARY KEY (Grade, ItemID, WeightGroupID)
> );
>
> I am unsure if I need to post the DDL descriptions of the other tables;
feel
> free to ask.
>
> Anyway, there are two seemingly distinct problems with this:
>
> *1
> As you can see, this UPDATE touches the entire table upon each execution,
> which in postgresql seems to mean that performance quickly deteriorates
> unless you are constantly vacuuming. To answer an obvious question, I *do*
> feel that this is a necessary thing to do - if I were to update only those
> rows that would acquire a new value, it would most probably turn out to be
> the majority of them as the system is expected to receive a steady flow of
> input data when launched.
> This operation also needs to be performed very often - ideally about every
> 10th second. I tried leaving the background process running for a few days
> without vacuuming during the holiday, and while testing today I found that
> the above query takes aproximately 2 minutes to complete - for a
> Distribution size of 600 rows! Also, neither of the other two tables
> referenced contained more than 30 rows.
>
> *2
> Even when the involved tables are freshly vacuumed (and vacuum analyzed),
> the above query using the data amount from (*1) takes about 4 seconds to
> complete. That should be compared to the about 50ms it takes to complete
for
> the same amount of data on Windows 2000 against SQL Server on a comparable
> machine.
>
> Needless to say, this is rather catastrophical, and I'm pretty much out of
> ideas. The performance values given above are those I got while executing
> the query directly in psql, with no other connections to the database, so
> this does not seem to me to be a problem with the system in itself. A
> configuration problem, perhaps? A bug? SQL ignorance on my part? I could
> surely use some help.
> As previously noted, I am not very comfortable with unix systems, and not
> exactly an SQL guru either.
> The linux distribution is some version of redhat, and the postgresql
version
> is 7.1.3.
>
> Any help or attempt to help would be greatly appreciated.
>
>  / Ztream
>
>



---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org