Thread: Huge Data

Huge Data

From
Sezai YILMAZ
Date:
Hi,

I use PostgreSQL 7.4 for storing huge amount of data. For example 7
million rows. But when I run the query "select count(*) from table;", it
results after about 120 seconds. Is this result normal for such a huge
table? Is there any methods for speed up the querying time? The huge
table has integer primary key and some other indexes for other columns.

The hardware is: PIII 800 MHz processor, 512 MB RAM, and IDE hard disk
drive.

-sezai

Re: Huge Data

From
Richard Huxton
Date:
On Wednesday 14 January 2004 11:11, Sezai YILMAZ wrote:
> Hi,
>
> I use PostgreSQL 7.4 for storing huge amount of data. For example 7
> million rows. But when I run the query "select count(*) from table;", it
> results after about 120 seconds. Is this result normal for such a huge
> table? Is there any methods for speed up the querying time? The huge
> table has integer primary key and some other indexes for other columns.

PG uses MVCC to manage concurrency. A downside of this is that to verify the
exact number of rows in a table you have to visit them all.

There's plenty on this in the archives, and probably the FAQ too.

What are you using the count() for?

--
  Richard Huxton
  Archonet Ltd

Re: Huge Data

From
Sezai YILMAZ
Date:
Richard Huxton wrote:

>On Wednesday 14 January 2004 11:11, Sezai YILMAZ wrote:
>
>
>>Hi,
>>
>>I use PostgreSQL 7.4 for storing huge amount of data. For example 7
>>million rows. But when I run the query "select count(*) from table;", it
>>results after about 120 seconds. Is this result normal for such a huge
>>table? Is there any methods for speed up the querying time? The huge
>>table has integer primary key and some other indexes for other columns.
>>
>>
>
>PG uses MVCC to manage concurrency. A downside of this is that to verify the
>exact number of rows in a table you have to visit them all.
>
>There's plenty on this in the archives, and probably the FAQ too.
>
>What are you using the count() for?
>
>
I use count() for some statistics. Just to show how many records
collected so far.

-sezai

Re: Huge Data

From
Sezai YILMAZ
Date:
Richard Huxton wrote:

>PG uses MVCC to manage concurrency. A downside of this is that to verify the
>exact number of rows in a table you have to visit them all.
>
>There's plenty on this in the archives, and probably the FAQ too.
>
>What are you using the count() for?
>
>
>

select logid, agentid, logbody from log where logid=3000000;

this query also returns after about 120 seconds. The table log has about
7 million records, and logid is the primary key of log table. What about
that? Why is it too slow?

-sezai


Re: Huge Data

From
Shridhar Daithankar
Date:
On Wednesday 14 January 2004 17:57, Sezai YILMAZ wrote:
> Richard Huxton wrote:
> >What are you using the count() for?
>
> I use count() for some statistics. Just to show how many records
> collected so far.

Rather than doing count(*), you should either cache the count in application
memory

 or analyze often and use following.

'select reltuples from pg_class where relname = 'foo';

This would give you approximate count. I believe it should suffice for your
needs.

HTH

 Shridhar


Re: Huge Data

From
"Matthew Lunnon"
Date:
Have you run 'vacuum analyze log;'? Also I believe that in Oracle count(1) used to be quicker than count(*).
Matthew
----- Original Message -----
Sent: Wednesday, January 14, 2004 12:39 PM
Subject: Re: [GENERAL] Huge Data

Richard Huxton wrote:

>PG uses MVCC to manage concurrency. A downside of this is that to verify the
>exact number of rows in a table you have to visit them all.
>
>There's plenty on this in the archives, and probably the FAQ too.
>
>What are you using the count() for?
>

>

select logid, agentid, logbody from log where logid=3000000;

this query also returns after about 120 seconds. The table log has about
7 million records, and logid is the primary key of log table. What about
that? Why is it too slow?

-sezai


---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
      joining column's datatypes do not match

_____________________________________________________________________
This e-mail has been scanned for viruses by MCI's Internet Managed Scanning Services - powered by MessageLabs. For further information visit http://www.mci.com

Re: Huge Data

From
Shridhar Daithankar
Date:
On Wednesday 14 January 2004 18:22, Matthew Lunnon wrote:
>   select logid, agentid, logbody from log where logid=3000000;
>
>   this query also returns after about 120 seconds. The table log has about
>   7 million records, and logid is the primary key of log table. What about
>   that? Why is it too slow?

How about

select logid, agentid, logbody from log where logid='3000000';

or

select logid, agentid, logbody from log where logid=3000000::int4;

Basically you need to typecast the constant. Then it would use the index.

I am not sure of first form of it though. I recommend you use the later form.

 Shridhar


Re: Huge Data

From
Richard Huxton
Date:
On Wednesday 14 January 2004 12:39, Sezai YILMAZ wrote:
>
> select logid, agentid, logbody from log where logid=3000000;

At a guess, because logid is bigint, whereas 300000 is taken to be integer.
Try ... where logid = 300000::bigint;

This is in the FAQ too I think, and is certainly in the archives.

Other things you might come across:
SELECT max() involves a sequential scan just like count(), you can rewrite it
as SELECT target_column FROM my_table ORDER BY target_column DESC LIMIT 1

The config values are very conservative. You will definitely want to tune them
for performance. See the articles here for a good introduction:
  http://www.varlena.com/varlena/GeneralBits/Tidbits/index.php

The VACUUM command is used to reclaim unused space, and the ANALYZE command to
regenerate statistics. It's worth reading up on both.

You can use EXPLAIN ANALYSE <query here> to see the plan that PG uses. I think
there's a discussion of it at http://techdocs.postgresql.org/

--
  Richard Huxton
  Archonet Ltd

Re: Huge Data

From
Richard Huxton
Date:
On Wednesday 14 January 2004 12:27, Sezai YILMAZ wrote:
> Richard Huxton wrote:
> >There's plenty on this in the archives, and probably the FAQ too.
> >
> >What are you using the count() for?
>
> I use count() for some statistics. Just to show how many records
> collected so far.

If you want an accurate number without scanning the table, you'll need to use
a trigger to keep a count up to date.

--
  Richard Huxton
  Archonet Ltd

Re: Huge Data

From
Sezai YILMAZ
Date:
Shridhar Daithankar wrote:

>Rather than doing count(*), you should either cache the count in application
>memory
>
> or analyze often and use following.
>
>'select reltuples from pg_class where relname = 'foo';
>
Thank you very much Shridhar. This one is responsive immediately. I
think I will use this method for gathering row count. But I complain to
break SQL standards. The code will become unmovable.

-sezai

Re: Huge Data

From
Franco Bruno Borghesi
Date:
If the mentioned solution fits your needs, you could create a stored procedure using that. The postgreSQL implementation could select from pg_class, while the same function in other database could execute the select count() on the table.

On Wed, 2004-01-14 at 10:25, Sezai YILMAZ wrote:
Shridhar Daithankar wrote:

>Rather than doing count(*), you should either cache the count in application 
>memory
>
> or analyze often and use following.
>
>'select reltuples from pg_class where relname = 'foo';
>
Thank you very much Shridhar. This one is responsive immediately. I 
think I will use this method for gathering row count. But I complain to 
break SQL standards. The code will become unmovable.

-sezai

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?
              http://archives.postgresql.org
Attachment

Re: Huge Data

From
Sezai YILMAZ
Date:
Shridhar Daithankar wrote:

>On Wednesday 14 January 2004 18:22, Matthew Lunnon wrote:
>
>
>>  select logid, agentid, logbody from log where logid=3000000;
>>
>>  this query also returns after about 120 seconds. The table log has about
>>  7 million records, and logid is the primary key of log table. What about
>>  that? Why is it too slow?
>>
>>
>
>How about
>
>select logid, agentid, logbody from log where logid='3000000';
>
>
Oh my god. It is unbelievable. The result is great. Thanks for all guys
who helped me.

-sezai

Re: Huge Data

From
Shridhar Daithankar
Date:
On Wednesday 14 January 2004 18:55, Sezai YILMAZ wrote:
> Shridhar Daithankar wrote:
> >Rather than doing count(*), you should either cache the count in
> > application memory
> >
> > or analyze often and use following.
> >
> >'select reltuples from pg_class where relname = 'foo';
>
> Thank you very much Shridhar. This one is responsive immediately. I
> think I will use this method for gathering row count. But I complain to
> break SQL standards. The code will become unmovable.

Well, you could document it somewhere for your reference. It is not that
hard..:-)

And remember, the value you get is just an estimate. You need to analyze the
table often with respect to it's update/insert/delete activity to keep the
estimate reasonable accurate. Vacuuming would also update the estimate.

 Shridhar


Re: Huge Data

From
Shridhar Daithankar
Date:
On Wednesday 14 January 2004 18:09, Sezai YILMAZ wrote:
> Richard Huxton wrote:
> >PG uses MVCC to manage concurrency. A downside of this is that to verify
> > the exact number of rows in a table you have to visit them all.
> >
> >There's plenty on this in the archives, and probably the FAQ too.
> >
> >What are you using the count() for?
>
> select logid, agentid, logbody from log where logid=3000000;

How about

select logid, agentid, logbody from log where logid='3000000';

or

select logid, agentid, logbody from log where logid=3000000::int4

Basically you need to typecast the constant. Then it would use the index.

I am not sure of first form of it though. I recommend you use the later form.

 Shridhar