Thread: Huge Data
Hi, I use PostgreSQL 7.4 for storing huge amount of data. For example 7 million rows. But when I run the query "select count(*) from table;", it results after about 120 seconds. Is this result normal for such a huge table? Is there any methods for speed up the querying time? The huge table has integer primary key and some other indexes for other columns. The hardware is: PIII 800 MHz processor, 512 MB RAM, and IDE hard disk drive. -sezai
On Wednesday 14 January 2004 11:11, Sezai YILMAZ wrote: > Hi, > > I use PostgreSQL 7.4 for storing huge amount of data. For example 7 > million rows. But when I run the query "select count(*) from table;", it > results after about 120 seconds. Is this result normal for such a huge > table? Is there any methods for speed up the querying time? The huge > table has integer primary key and some other indexes for other columns. PG uses MVCC to manage concurrency. A downside of this is that to verify the exact number of rows in a table you have to visit them all. There's plenty on this in the archives, and probably the FAQ too. What are you using the count() for? -- Richard Huxton Archonet Ltd
Richard Huxton wrote: >On Wednesday 14 January 2004 11:11, Sezai YILMAZ wrote: > > >>Hi, >> >>I use PostgreSQL 7.4 for storing huge amount of data. For example 7 >>million rows. But when I run the query "select count(*) from table;", it >>results after about 120 seconds. Is this result normal for such a huge >>table? Is there any methods for speed up the querying time? The huge >>table has integer primary key and some other indexes for other columns. >> >> > >PG uses MVCC to manage concurrency. A downside of this is that to verify the >exact number of rows in a table you have to visit them all. > >There's plenty on this in the archives, and probably the FAQ too. > >What are you using the count() for? > > I use count() for some statistics. Just to show how many records collected so far. -sezai
Richard Huxton wrote: >PG uses MVCC to manage concurrency. A downside of this is that to verify the >exact number of rows in a table you have to visit them all. > >There's plenty on this in the archives, and probably the FAQ too. > >What are you using the count() for? > > > select logid, agentid, logbody from log where logid=3000000; this query also returns after about 120 seconds. The table log has about 7 million records, and logid is the primary key of log table. What about that? Why is it too slow? -sezai
On Wednesday 14 January 2004 17:57, Sezai YILMAZ wrote: > Richard Huxton wrote: > >What are you using the count() for? > > I use count() for some statistics. Just to show how many records > collected so far. Rather than doing count(*), you should either cache the count in application memory or analyze often and use following. 'select reltuples from pg_class where relname = 'foo'; This would give you approximate count. I believe it should suffice for your needs. HTH Shridhar
Have you run 'vacuum analyze log;'? Also I believe that in Oracle count(1) used to be quicker than count(*).
Matthew
----- Original Message -----From: Sezai YILMAZTo: Richard HuxtonSent: Wednesday, January 14, 2004 12:39 PMSubject: Re: [GENERAL] Huge DataRichard Huxton wrote:
>PG uses MVCC to manage concurrency. A downside of this is that to verify the
>exact number of rows in a table you have to visit them all.
>
>There's plenty on this in the archives, and probably the FAQ too.
>
>What are you using the count() for?
>
>
>
select logid, agentid, logbody from log where logid=3000000;
this query also returns after about 120 seconds. The table log has about
7 million records, and logid is the primary key of log table. What about
that? Why is it too slow?
-sezai
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match
_____________________________________________________________________
This e-mail has been scanned for viruses by MCI's Internet Managed Scanning Services - powered by MessageLabs. For further information visit http://www.mci.com
On Wednesday 14 January 2004 18:22, Matthew Lunnon wrote: > select logid, agentid, logbody from log where logid=3000000; > > this query also returns after about 120 seconds. The table log has about > 7 million records, and logid is the primary key of log table. What about > that? Why is it too slow? How about select logid, agentid, logbody from log where logid='3000000'; or select logid, agentid, logbody from log where logid=3000000::int4; Basically you need to typecast the constant. Then it would use the index. I am not sure of first form of it though. I recommend you use the later form. Shridhar
On Wednesday 14 January 2004 12:39, Sezai YILMAZ wrote: > > select logid, agentid, logbody from log where logid=3000000; At a guess, because logid is bigint, whereas 300000 is taken to be integer. Try ... where logid = 300000::bigint; This is in the FAQ too I think, and is certainly in the archives. Other things you might come across: SELECT max() involves a sequential scan just like count(), you can rewrite it as SELECT target_column FROM my_table ORDER BY target_column DESC LIMIT 1 The config values are very conservative. You will definitely want to tune them for performance. See the articles here for a good introduction: http://www.varlena.com/varlena/GeneralBits/Tidbits/index.php The VACUUM command is used to reclaim unused space, and the ANALYZE command to regenerate statistics. It's worth reading up on both. You can use EXPLAIN ANALYSE <query here> to see the plan that PG uses. I think there's a discussion of it at http://techdocs.postgresql.org/ -- Richard Huxton Archonet Ltd
On Wednesday 14 January 2004 12:27, Sezai YILMAZ wrote: > Richard Huxton wrote: > >There's plenty on this in the archives, and probably the FAQ too. > > > >What are you using the count() for? > > I use count() for some statistics. Just to show how many records > collected so far. If you want an accurate number without scanning the table, you'll need to use a trigger to keep a count up to date. -- Richard Huxton Archonet Ltd
Shridhar Daithankar wrote: >Rather than doing count(*), you should either cache the count in application >memory > > or analyze often and use following. > >'select reltuples from pg_class where relname = 'foo'; > Thank you very much Shridhar. This one is responsive immediately. I think I will use this method for gathering row count. But I complain to break SQL standards. The code will become unmovable. -sezai
If the mentioned solution fits your needs, you could create a stored procedure using that. The postgreSQL implementation could select from pg_class, while the same function in other database could execute the select count() on the table.
On Wed, 2004-01-14 at 10:25, Sezai YILMAZ wrote:
On Wed, 2004-01-14 at 10:25, Sezai YILMAZ wrote:
Shridhar Daithankar wrote: >Rather than doing count(*), you should either cache the count in application >memory > > or analyze often and use following. > >'select reltuples from pg_class where relname = 'foo'; > Thank you very much Shridhar. This one is responsive immediately. I think I will use this method for gathering row count. But I complain to break SQL standards. The code will become unmovable. -sezai ---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Attachment
Shridhar Daithankar wrote: >On Wednesday 14 January 2004 18:22, Matthew Lunnon wrote: > > >> select logid, agentid, logbody from log where logid=3000000; >> >> this query also returns after about 120 seconds. The table log has about >> 7 million records, and logid is the primary key of log table. What about >> that? Why is it too slow? >> >> > >How about > >select logid, agentid, logbody from log where logid='3000000'; > > Oh my god. It is unbelievable. The result is great. Thanks for all guys who helped me. -sezai
On Wednesday 14 January 2004 18:55, Sezai YILMAZ wrote: > Shridhar Daithankar wrote: > >Rather than doing count(*), you should either cache the count in > > application memory > > > > or analyze often and use following. > > > >'select reltuples from pg_class where relname = 'foo'; > > Thank you very much Shridhar. This one is responsive immediately. I > think I will use this method for gathering row count. But I complain to > break SQL standards. The code will become unmovable. Well, you could document it somewhere for your reference. It is not that hard..:-) And remember, the value you get is just an estimate. You need to analyze the table often with respect to it's update/insert/delete activity to keep the estimate reasonable accurate. Vacuuming would also update the estimate. Shridhar
On Wednesday 14 January 2004 18:09, Sezai YILMAZ wrote: > Richard Huxton wrote: > >PG uses MVCC to manage concurrency. A downside of this is that to verify > > the exact number of rows in a table you have to visit them all. > > > >There's plenty on this in the archives, and probably the FAQ too. > > > >What are you using the count() for? > > select logid, agentid, logbody from log where logid=3000000; How about select logid, agentid, logbody from log where logid='3000000'; or select logid, agentid, logbody from log where logid=3000000::int4 Basically you need to typecast the constant. Then it would use the index. I am not sure of first form of it though. I recommend you use the later form. Shridhar