Re: selects from large tables - Mailing list pgsql-performance

From Nikk Anderson
Subject Re: selects from large tables
Date
Msg-id DA1274E682D3734B8802904A9B36124C298AA1@nic-nts1.nic.parallel.ltd.uk
Whole thread Raw
In response to selects from large tables  (Nikk Anderson <Nikk.Anderson@parallel.ltd.uk>)
Responses Re: selects from large tables  (Bruce Momjian <pgman@candle.pha.pa.us>)
Re: selects from large tables  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: selects from large tables  (Rod Taylor <rbt@rbt.ca>)
Re: selects from large tables  (Hannu Krosing <hannu@tm.ee>)
List pgsql-performance

Hi,

I tried a test cluster on a copy of our real data - all 10 million rows or so.  WOW!   The normal select performance improved drastically. 

Selecting 3 months worth of data was taking 146 seconds to retrieve.  After clustering it took 7.7 seconds!  We are now looking into ways we can automate clustering to keep the table up to date.  The cluster itself took around 2.5 hours.

As our backend systems are writing hundreds of rows of data in per minute into the table that needs clustering - will cluster handle locking the tables when dropping the old, and renaming the clustered data?  What happens to the data being added to the table while cluster is running? Our backend systems may have some problems if the table does not exist when it tries to insert, and we don't want to lose any data.

Thanks

Nikk

-----Original Message-----
From: Charles H. Woloszynski [mailto:chw@clearmetrix.com]
Sent: 18 November 2002 15:46
To: Nikk Anderson
Cc: 'Stephan Szabo'; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] selects from large tables

Nikk:

Are you doing vaccums on these tables?  I was under the understanding
that the estimated row count should be close to the real row count
returned, and when it is not (as it looks in your case), the primary
reason for the disconnect is that the stats for the tables are
out-of-date. 

Since it used the indexes, I am not sure if the old stats are causing
any issues, but I suspect they are not helping. 

Also, do you do any clustering of the data (since the queries are mostly
time limited)?  I am wondering if the system is doing lots of seeks to
get the data (implying that the data is all over the disk and not
clustered). 

Charlie

Nikk Anderson wrote:

> Hi,
> Thanks for the reply Stephen, the data is 'somewhat' realistic.....
>
> The data in the table is actually synthetic, but the structure is the
> same as our live system, and the queries are similar to those we
> actually carry out.
>
> As the data was synthetic there was a bit of repetition (19 million
> rows of repetition!! ) of the item used in the where clause, meaning
> that most of the table was returned by the queries - oops!  So, I have
> done is some more realistic queries from our live system, and put the
> time it takes, and the explain results.  Just to note that the
> explain's estimated number of rows is way out - its guesses are way
> too low.
>
> Typically a normal query on our live system returns between 200 and
> 30000 rows depending on the reports a user wants to generate.  In
> prior testing, we noted that using SELECT COUNT( ..   was slower than
> other queries, which is why we though we would test counts first.
>
>
> Here are some more realistic results, which still take a fair whack of
> time........
>
>
> Starting query 0
> Query 0: SELECT * FROM xx WHERE time BETWEEN '2002-11-17 14:08:58.021'
> AND '2002-11-18 14:08:58.021' AND job_id = 335
> Time taken = 697 ms
> Index Scan using http_timejobid on xx  (cost=0.00..17.01 rows=4 width=57)
> This query returns 500 rows of data
>
>
> Starting query 1
> Query 1: SELECT * FROM xx WHERE time BETWEEN '2002-11-11 14:08:58.021'
> AND '2002-11-18 14:08:58.021' AND job_id = 335
> Time taken = 15 seconds
> Index Scan using http_timejobid on xx  (cost=0.00..705.57 rows=175
> width=57)
> This query return 3582 rows
>
> Starting query 2
> Query 2: SELECT * FROM xx WHERE time BETWEEN '2002-10-19 15:08:58.021'
> AND '2002-11-18 14:08:58.021' AND job_id = 335;
> Time taken = 65 seconds
> Index Scan using http_timejobid on xx  (cost=0.00..3327.55 rows=832
> width=57)
> This query returns 15692 rows
>
> Starting query 3
> Query 3: SELECT * FROM xx_result WHERE time BETWEEN '2002-08-20
> 15:08:58.021' AND '2002-11-18 14:08:58.021' AND job_id = 335;
>
> Time taken = 241 seconds
> Index Scan using http_timejobid on xx  (cost=0.00..10111.36 rows=2547
> width=57)
> This query returns 48768 rows
>
>
> Cheers
>
> Nikk
>
>
>
>
> -----Original Message-----
> From: Stephan Szabo [mailto:sszabo@megazone23.bigpanda.com]
> Sent: 18 November 2002 13:02
> To: Nikk Anderson
> Cc: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] selects from large tables
>
>
>
> On Mon, 18 Nov 2002, Nikk Anderson wrote:
>
> > Any ideas on how we can select data more quickly from large tables?
>
> Are these row estimates realistic? It's estimating nearly 20 million rows
> to be returned by some of the queries (unless I'm misreading the
> number - possible since it's 5am here).  At that point you almost
> certainly want to be using a cursor rather than plain queries since even a
> small width result (say 50 bytes) gives a very large (1 gig) result set.
>
> > - Queries and explain plans
> >
> > select count(*) from table_name;
> > NOTICE:  QUERY PLAN:
> > Aggregate  (cost=488700.65..488700.65 rows=1 width=0)
> >   ->  Seq Scan on table_name  (cost=0.00..439527.12 rows=19669412
> width=0)
> >
> > hawkdb=# explain select count(job_id) from table_name;
> > NOTICE:  QUERY PLAN:
> > Aggregate  (cost=488700.65..488700.65 rows=1 width=4)
> >   ->  Seq Scan on table_name  (cost=0.00..439527.12 rows=19669412
> width=4)
> >
> > hawkdb=# explain select * from table_name;
> > NOTICE:  QUERY PLAN:
> > Seq Scan on table_name  (cost=0.00..439527.12 rows=19669412 width=57)
> >
> > hawkdb=# explain select count(*) from table_name where job_id = 13;
> > NOTICE:  QUERY PLAN:
> > Aggregate  (cost=537874.18..537874.18 rows=1 width=0)
> >   ->  Seq Scan on table_name  (cost=0.00..488700.65 rows=19669412
> width=0)
> >
> > hawkdb=# explain select * from table_name where job_id = 13;
> > NOTICE:  QUERY PLAN:
> > Seq Scan on http_result  (cost=0.00..488700.65 rows=19669412 width=57)
> >
> > hawkdb=# explain select * from table_name where job_id = 1;
> > NOTICE:  QUERY PLAN:
> > Index Scan using http_result_pk on table_name  (cost=0.00..5.01 rows=1
> > width=57)
> >
> > hawkdb=#explain select * from table_name where time > '2002-10-10';
> > NOTICE:  QUERY PLAN:
> > Seq Scan on table_name  (cost=0.00..488700.65 rows=19649743 width=57)
> >
> > hawkdb=# explain select * from http_result where time < '2002-10-10';
> > NOTICE:  QUERY PLAN:
> > Index Scan using table_name_time on table_name  (cost=0.00..75879.17
> > rows=19669 width=57)
>

--

Charles H. Woloszynski

ClearMetrix, Inc.
115 Research Drive
Bethlehem, PA 18015

tel: 610-419-2210 x400
fax: 240-371-3256
web: www.clearmetrix.com



---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

pgsql-performance by date:

Previous
From: Richard Huxton
Date:
Subject: Re: Slow DELETE with IN clausule
Next
From: Bruce Momjian
Date:
Subject: Re: selects from large tables