Thread: for help!

for help!

From
linweidong
Date:

Hi,

 

 I am writing to you to discuss the performance problem of postgreSQL database we encountered in our project. I  want to get suggestions from you.

 

The postgreSQL database we used need to process several millions records. There are only six tables in the database. one of them contains several million records, the  Others are less smaller. We need select more than 100 thousands records from the talbe which contains several million records in 10 seconds.  In the process of selecting, the speed of selecting is not stable. Sometimes it cost 2 minutes , but sometimes 20 seconds. After analyzing the time wasting in the process, we found the speed of  function Count(*) is very slow. At the same time we have finished the setup of some parameters like max_fsm_relation, max_fsm_pages, share memory size etc, but the performance is not improved satisfied.

 

Under this condition, I want get some useful suggestion from you. How to optimize the database?  How to improve the Count(*)? Because we  want to get the number of records in the recordset  we got.

 

Thank you every much! I hope hear from you soon.

                                                                                                 Wind

                                                                                                      2003-4-15

Re: for help!

From
Shridhar Daithankar
Date:
On Tuesday 15 April 2003 15:14, you wrote:
> The postgreSQL database we used need to process several millions records.
> There are only six tables in the database. one of them contains several
> million records, the  Others are less smaller. We need select more than 100
> thousands records from the talbe which contains several million records in
> 10 seconds.  In the process of selecting, the speed of selecting is not
> stable. Sometimes it cost 2 minutes , but sometimes 20 seconds. After
> analyzing the time wasting in the process, we found the speed of  function
> Count(*) is very slow. At the same time we have finished the setup of some
> parameters like max_fsm_relation, max_fsm_pages, share memory size etc, but
> the performance is not improved satisfied.

Why do you need to do select count(*) to select more than 100 thousand
records?

Postgresql being MVCC database, select count(*) is not going to be anywhere
near good, especially if you have transactions occuring on table.

As far as just selecting rows from table, that should be tad fast if there are
proper indexes, table in analyzed every now and then and there are enough
shared buffers.

If you post your queries and table schemas, that would be much helpful. Your
tweaked settings in postgresql.conf and hardware spec. would be good as well.

> Under this condition, I want get some useful suggestion from you. How to
> optimize the database?  How to improve the Count(*)? Because we  want to
> get the number of records in the recordset  we got.

If you are using say libpq, you don't need to issue a select count(*) where
foo and select where foo, to obtain record count and the records themselves.
I believe every other interface stemming from libpq should provide any such
hooks as well. Never used any other myself (barring ecpg)

 HTH

 Shridhar


Re: for help!

From
Will LaShell
Date:
<SNIP>
> > Under this condition, I want get some useful suggestion from you. How to
> > optimize the database?  How to improve the Count(*)? Because we  want to
> > get the number of records in the recordset  we got.
>
> If you are using say libpq, you don't need to issue a select count(*) where
> foo and select where foo, to obtain record count and the records themselves.
> I believe every other interface stemming from libpq should provide any such
> hooks as well. Never used any other myself (barring ecpg)

The python interfaces most definitely do. Doing the count is quite
unnecessary just as Shridhar points out.

>  HTH
>
>  Shridhar
>

Attachment

Re: for help!

From
"scott.marlowe"
Date:
On Tue, 15 Apr 2003, linweidong wrote:

> Under this condition, I want get some useful suggestion from you. How to
> optimize the database?  How to improve the Count(*)? Because we  want to get
> the number of records in the recordset  we got.

Well, you can always use the trick of putting an on insert / delete
trigger on the table that maintains a single row table with the current
count.  That way, whenever a row is added or removed, the count is
updated.  this will slow down inserts and deletes a little, but TANSTAAFL.


Re: for help!

From
Josh Berkus
Date:
Scott,

> Well, you can always use the trick of putting an on insert / delete
> trigger on the table that maintains a single row table with the current
> count.  That way, whenever a row is added or removed, the count is
> updated.  this will slow down inserts and deletes a little, but TANSTAAFL.

BTW, I tested this for a client.   I found the performance penalty on inserts
and updates to be:

-- For a single stream of intermittent updates from a single connection
   on an adequately powered server with moderate disk support (IDE Linux RAID)
   (100 inserts/updates per minute, with VACUUM every 5 minutes)
   PL/pgSQL Trigger:  20% penalty      C Trigger: 9-11% penalty

-- For 5 streams of inserts and updates at high volume on an overloaded
   server with moderate disk support (dual fast SCSI disks)
   (1000 inserts/updates per minute, vacuum every 5 minutes)
   PL/pgSQL Trigger: 65% penalty    C Trigger: 40% penalty

Please note that the effective performance penalty on inserts and updates was
dramatically higher for large batches of updates than for small ones.

--
Josh Berkus
Aglio Database Solutions
San Francisco


Re: for help!

From
Andrew Sullivan
Date:
On Wed, Apr 16, 2003 at 08:48:36AM -0700, Josh Berkus wrote:
> Scott,
>
> > Well, you can always use the trick of putting an on insert / delete
> > trigger on the table that maintains a single row table with the current
> > count.  That way, whenever a row is added or removed, the count is

> BTW, I tested this for a client.   I found the performance penalty
> on inserts and updates to be:

[. . .]

> Please note that the effective performance penalty on inserts and
> updates was dramatically higher for large batches of updates than
> for small ones.

Presumably the problem was to do with contention?  This is why I
don't really like the "update one row" approach for this sort of
thing.

But you _could_ write a trigger which inserts into a "staging" table,
and write a little daemon which only updates the count table with the
data from the staging table.  It's a mighty ugly hack, but it ought
to work.

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Re: for help!

From
Tom Lane
Date:
Andrew Sullivan <andrew@libertyrms.info> writes:
> But you _could_ write a trigger which inserts into a "staging" table,
> and write a little daemon which only updates the count table with the
> data from the staging table.  It's a mighty ugly hack, but it ought
> to work.

The $64 question with this sort of thing is "how accurate (up-to-date)
does the count have to be?".

Given that Josh is willing to vacuum every five minutes, he might find
that returning pg_class.reltuples is Close Enough (TM).

            regards, tom lane


Re: for help!

From
Andrew Sullivan
Date:
On Wed, Apr 16, 2003 at 12:01:56PM -0400, Tom Lane wrote:
> Given that Josh is willing to vacuum every five minutes, he might find
> that returning pg_class.reltuples is Close Enough (TM).

Certainly, it's not going to be any farther off than the
staging-table+real-table approach, anyway.

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110