Thread: selects from large tables
Hi All,
We are using Postgres 7.1, on Solaris 8 - hardware is a 400mhz Netra X1, 512Mb ram, with the database on a separate partition.
Our main result tables are getting really big, and we don't want to delete any data yet. Currently, our largest table has around 10 million rows and is going up at a rate of around 1 million per month. The table has 13 integer, one boolean and one timestamp column. We index the table on an ID number and the timestamp. We vacuum analyse the table every night. The performance has steadily degraded, and the more data we try and select, the longer the select queries take.
The queries are not complex, and do not involve any unions etc, eg:
SELECT * FROM table_name WHERE column1 = 454 AND time BETWEEN '2002-10-13 13:44:00.0' AND '2002-11-14'
SELECT count(DISTINCT id) FROM table_name WHERE column1 = 454 AND time BETWEEN '2002-10-13 13:44:00.0' AND '2002-11-14
See various queries and explains at the end this email for more info on the type of queries we are doing.
Most of the queries use a sequence scan - disabling this and forcing index scan decreases performance further for those queries.
These queries are sometimes taking over 2 minutes to perform!!!! If we reduce the table size significantly (i.e. around 1 million rows)is is obviously faster - down to a few seconds.
We then tried the DB on a clean installation of Solaris 9, on a dual 400mhz processor SunE250 with 2Gb ram, and 2 scsi 17gb disks. We put the database onto the second disk. Surprisingly the performance is only 5-10% greater. I expected far more, due to the increased power of the machine. Looking at the os info on this machine, the IO wait is negligible as is the cpu usage. So this machine is not working as hard as the Netra X1, though the time taken to perform queries is not too much different.
We have tried tweaking the shared buffers and sort mem (also tweaking kernel shared mem size), which make little difference, and in fact if we increase it to around 25% of total memory performance degrades slightly. We have changed from the default amount of shared buffers, to 64000 to give us access to 25% of the total system memory.
Any ideas on how we can select data more quickly from large tables?
Other ideas we had was to split the data over multiple table by id (resulting in several thousand tables), however this would make management of the database in terms of keys, triggers and integrity very difficult and messy.
I hope someone can offer some advice.
Cheers
Nikk
- Queries and explain plans
select count(*) from table_name;
NOTICE: QUERY PLAN:
Aggregate (cost=488700.65..488700.65 rows=1 width=0)
-> Seq Scan on table_name (cost=0.00..439527.12 rows=19669412 width=0)
hawkdb=# explain select count(job_id) from table_name;
NOTICE: QUERY PLAN:
Aggregate (cost=488700.65..488700.65 rows=1 width=4)
-> Seq Scan on table_name (cost=0.00..439527.12 rows=19669412 width=4)
hawkdb=# explain select * from table_name;
NOTICE: QUERY PLAN:
Seq Scan on table_name (cost=0.00..439527.12 rows=19669412 width=57)
hawkdb=# explain select count(*) from table_name where job_id = 13;
NOTICE: QUERY PLAN:
Aggregate (cost=537874.18..537874.18 rows=1 width=0)
-> Seq Scan on table_name (cost=0.00..488700.65 rows=19669412 width=0)
hawkdb=# explain select * from table_name where job_id = 13;
NOTICE: QUERY PLAN:
Seq Scan on http_result (cost=0.00..488700.65 rows=19669412 width=57)
hawkdb=# explain select * from table_name where job_id = 1;
NOTICE: QUERY PLAN:
Index Scan using http_result_pk on table_name (cost=0.00..5.01 rows=1 width=57)
hawkdb=#explain select * from table_name where time > '2002-10-10';
NOTICE: QUERY PLAN:
Seq Scan on table_name (cost=0.00..488700.65 rows=19649743 width=57)
hawkdb=# explain select * from http_result where time < '2002-10-10';
NOTICE: QUERY PLAN:
Index Scan using table_name_time on table_name (cost=0.00..75879.17 rows=19669 width=57)
Nikk Anderson
Parallel ltd.
Cranfield Innovation Centre
University Way
Cranfield
Bedfordshire
MK43 0BT
http://www.nexuswatch.com
http://www.parallel.ltd.uk
Tel: +44 (0)8700 PARALLEL (727255)
Fax: +44 (0)8700 PARAFAX (727232)
******************************************************************
Privileged/Confidential Information may be contained in this
message. If you are not the addressee indicated in this message
(or responsible for delivery of the message to such person), you
may not copy or deliver this message to anyone. In such case, you
should destroy this message and kindly notify the sender by reply
email. Please advise immediately if you or your employer do not
consent to Internet email for messages of this kind. Opinions,
conclusions and other information in this message that do not
relate to the official business of Parallel shall be understood
as neither given nor endorsed by it.
Unless agreed otherwise by way of a signed agreement, any business
conducted by Parallel shall be subject to its Standard Terms
and Conditions which are available upon request.
******************************************************************
On Mon, 18 Nov 2002, Nikk Anderson wrote: > Any ideas on how we can select data more quickly from large tables? Are these row estimates realistic? It's estimating nearly 20 million rows to be returned by some of the queries (unless I'm misreading the number - possible since it's 5am here). At that point you almost certainly want to be using a cursor rather than plain queries since even a small width result (say 50 bytes) gives a very large (1 gig) result set. > - Queries and explain plans > > select count(*) from table_name; > NOTICE: QUERY PLAN: > Aggregate (cost=488700.65..488700.65 rows=1 width=0) > -> Seq Scan on table_name (cost=0.00..439527.12 rows=19669412 width=0) > > hawkdb=# explain select count(job_id) from table_name; > NOTICE: QUERY PLAN: > Aggregate (cost=488700.65..488700.65 rows=1 width=4) > -> Seq Scan on table_name (cost=0.00..439527.12 rows=19669412 width=4) > > hawkdb=# explain select * from table_name; > NOTICE: QUERY PLAN: > Seq Scan on table_name (cost=0.00..439527.12 rows=19669412 width=57) > > hawkdb=# explain select count(*) from table_name where job_id = 13; > NOTICE: QUERY PLAN: > Aggregate (cost=537874.18..537874.18 rows=1 width=0) > -> Seq Scan on table_name (cost=0.00..488700.65 rows=19669412 width=0) > > hawkdb=# explain select * from table_name where job_id = 13; > NOTICE: QUERY PLAN: > Seq Scan on http_result (cost=0.00..488700.65 rows=19669412 width=57) > > hawkdb=# explain select * from table_name where job_id = 1; > NOTICE: QUERY PLAN: > Index Scan using http_result_pk on table_name (cost=0.00..5.01 rows=1 > width=57) > > hawkdb=#explain select * from table_name where time > '2002-10-10'; > NOTICE: QUERY PLAN: > Seq Scan on table_name (cost=0.00..488700.65 rows=19649743 width=57) > > hawkdb=# explain select * from http_result where time < '2002-10-10'; > NOTICE: QUERY PLAN: > Index Scan using table_name_time on table_name (cost=0.00..75879.17 > rows=19669 width=57)
Stephan Szabo <sszabo@megazone23.bigpanda.com> writes: > On Mon, 18 Nov 2002, Nikk Anderson wrote: >> Any ideas on how we can select data more quickly from large tables? > Are these row estimates realistic? Showing EXPLAIN ANALYZE results would be much more useful than just EXPLAIN. regards, tom lane
Hi,
Thanks for the reply Stephen, the data is 'somewhat' realistic.....
The data in the table is actually synthetic, but the structure is the same as our live system, and the queries are similar to those we actually carry out.
As the data was synthetic there was a bit of repetition (19 million rows of repetition!! ) of the item used in the where clause, meaning that most of the table was returned by the queries - oops! So, I have done is some more realistic queries from our live system, and put the time it takes, and the explain results. Just to note that the explain's estimated number of rows is way out - its guesses are way too low.
Typically a normal query on our live system returns between 200 and 30000 rows depending on the reports a user wants to generate. In prior testing, we noted that using SELECT COUNT( .. was slower than other queries, which is why we though we would test counts first.
Here are some more realistic results, which still take a fair whack of time........
Starting query 0
Query 0: SELECT * FROM xx WHERE time BETWEEN '2002-11-17 14:08:58.021' AND '2002-11-18 14:08:58.021' AND job_id = 335
Time taken = 697 ms
Index Scan using http_timejobid on xx (cost=0.00..17.01 rows=4 width=57)
This query returns 500 rows of data
Starting query 1
Query 1: SELECT * FROM xx WHERE time BETWEEN '2002-11-11 14:08:58.021' AND '2002-11-18 14:08:58.021' AND job_id = 335
Time taken = 15 seconds
Index Scan using http_timejobid on xx (cost=0.00..705.57 rows=175 width=57)
This query return 3582 rows
Starting query 2
Query 2: SELECT * FROM xx WHERE time BETWEEN '2002-10-19 15:08:58.021' AND '2002-11-18 14:08:58.021' AND job_id = 335;
Time taken = 65 seconds
Index Scan using http_timejobid on xx (cost=0.00..3327.55 rows=832 width=57)
This query returns 15692 rows
Starting query 3
Query 3: SELECT * FROM xx_result WHERE time BETWEEN '2002-08-20 15:08:58.021' AND '2002-11-18 14:08:58.021' AND job_id = 335;
Time taken = 241 seconds
Index Scan using http_timejobid on xx (cost=0.00..10111.36 rows=2547 width=57)
This query returns 48768 rows
Cheers
Nikk
-----Original Message-----
From: Stephan Szabo [mailto:sszabo@megazone23.bigpanda.com]
Sent: 18 November 2002 13:02
To: Nikk Anderson
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] selects from large tables
On Mon, 18 Nov 2002, Nikk Anderson wrote:
> Any ideas on how we can select data more quickly from large tables?
Are these row estimates realistic? It's estimating nearly 20 million rows
to be returned by some of the queries (unless I'm misreading the
number - possible since it's 5am here). At that point you almost
certainly want to be using a cursor rather than plain queries since even a
small width result (say 50 bytes) gives a very large (1 gig) result set.
> - Queries and explain plans
>
> select count(*) from table_name;
> NOTICE: QUERY PLAN:
> Aggregate (cost=488700.65..488700.65 rows=1 width=0)
> -> Seq Scan on table_name (cost=0.00..439527.12 rows=19669412 width=0)
>
> hawkdb=# explain select count(job_id) from table_name;
> NOTICE: QUERY PLAN:
> Aggregate (cost=488700.65..488700.65 rows=1 width=4)
> -> Seq Scan on table_name (cost=0.00..439527.12 rows=19669412 width=4)
>
> hawkdb=# explain select * from table_name;
> NOTICE: QUERY PLAN:
> Seq Scan on table_name (cost=0.00..439527.12 rows=19669412 width=57)
>
> hawkdb=# explain select count(*) from table_name where job_id = 13;
> NOTICE: QUERY PLAN:
> Aggregate (cost=537874.18..537874.18 rows=1 width=0)
> -> Seq Scan on table_name (cost=0.00..488700.65 rows=19669412 width=0)
>
> hawkdb=# explain select * from table_name where job_id = 13;
> NOTICE: QUERY PLAN:
> Seq Scan on http_result (cost=0.00..488700.65 rows=19669412 width=57)
>
> hawkdb=# explain select * from table_name where job_id = 1;
> NOTICE: QUERY PLAN:
> Index Scan using http_result_pk on table_name (cost=0.00..5.01 rows=1
> width=57)
>
> hawkdb=#explain select * from table_name where time > '2002-10-10';
> NOTICE: QUERY PLAN:
> Seq Scan on table_name (cost=0.00..488700.65 rows=19649743 width=57)
>
> hawkdb=# explain select * from http_result where time < '2002-10-10';
> NOTICE: QUERY PLAN:
> Index Scan using table_name_time on table_name (cost=0.00..75879.17
> rows=19669 width=57)
Hi,
Unfortunately explain analyze does not work on our postgres version (7.1) ?
I think I will download and compile 7.2, and try to compile in 64bit mode to see if that helps improve performance.
Cheers
Nikk
-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: 18 November 2002 15:04
To: Stephan Szabo
Cc: Nikk Anderson; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] selects from large tables
Stephan Szabo <sszabo@megazone23.bigpanda.com> writes:
> On Mon, 18 Nov 2002, Nikk Anderson wrote:
>> Any ideas on how we can select data more quickly from large tables?
> Are these row estimates realistic?
Showing EXPLAIN ANALYZE results would be much more useful than just
EXPLAIN.
regards, tom lane
On Mon, Nov 18, 2002 at 03:36:08PM -0000, Nikk Anderson wrote: > Hi, > > Unfortunately explain analyze does not work on our postgres version (7.1) ? No, it doesn't. > I think I will download and compile 7.2, and try to compile in 64bit mode to > see if that helps improve performance. I have seen something like a 40% improvement in performance from 7.1 to 7.2 on Solaris 7 in my tests. There are some problems with the 64 bit compilation, by the way, so make sure that you check out the FAQ and test carefully. You need to make some modifications of the source files in order to avoid some buggly libraries on Solaris. A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Nikk: Are you doing vaccums on these tables? I was under the understanding that the estimated row count should be close to the real row count returned, and when it is not (as it looks in your case), the primary reason for the disconnect is that the stats for the tables are out-of-date. Since it used the indexes, I am not sure if the old stats are causing any issues, but I suspect they are not helping. Also, do you do any clustering of the data (since the queries are mostly time limited)? I am wondering if the system is doing lots of seeks to get the data (implying that the data is all over the disk and not clustered). Charlie Nikk Anderson wrote: > Hi, > Thanks for the reply Stephen, the data is 'somewhat' realistic..... > > The data in the table is actually synthetic, but the structure is the > same as our live system, and the queries are similar to those we > actually carry out. > > As the data was synthetic there was a bit of repetition (19 million > rows of repetition!! ) of the item used in the where clause, meaning > that most of the table was returned by the queries - oops! So, I have > done is some more realistic queries from our live system, and put the > time it takes, and the explain results. Just to note that the > explain's estimated number of rows is way out - its guesses are way > too low. > > Typically a normal query on our live system returns between 200 and > 30000 rows depending on the reports a user wants to generate. In > prior testing, we noted that using SELECT COUNT( .. was slower than > other queries, which is why we though we would test counts first. > > > Here are some more realistic results, which still take a fair whack of > time........ > > > Starting query 0 > Query 0: SELECT * FROM xx WHERE time BETWEEN '2002-11-17 14:08:58.021' > AND '2002-11-18 14:08:58.021' AND job_id = 335 > Time taken = 697 ms > Index Scan using http_timejobid on xx (cost=0.00..17.01 rows=4 width=57) > This query returns 500 rows of data > > > Starting query 1 > Query 1: SELECT * FROM xx WHERE time BETWEEN '2002-11-11 14:08:58.021' > AND '2002-11-18 14:08:58.021' AND job_id = 335 > Time taken = 15 seconds > Index Scan using http_timejobid on xx (cost=0.00..705.57 rows=175 > width=57) > This query return 3582 rows > > Starting query 2 > Query 2: SELECT * FROM xx WHERE time BETWEEN '2002-10-19 15:08:58.021' > AND '2002-11-18 14:08:58.021' AND job_id = 335; > Time taken = 65 seconds > Index Scan using http_timejobid on xx (cost=0.00..3327.55 rows=832 > width=57) > This query returns 15692 rows > > Starting query 3 > Query 3: SELECT * FROM xx_result WHERE time BETWEEN '2002-08-20 > 15:08:58.021' AND '2002-11-18 14:08:58.021' AND job_id = 335; > > Time taken = 241 seconds > Index Scan using http_timejobid on xx (cost=0.00..10111.36 rows=2547 > width=57) > This query returns 48768 rows > > > Cheers > > Nikk > > > > > -----Original Message----- > From: Stephan Szabo [mailto:sszabo@megazone23.bigpanda.com] > Sent: 18 November 2002 13:02 > To: Nikk Anderson > Cc: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] selects from large tables > > > > On Mon, 18 Nov 2002, Nikk Anderson wrote: > > > Any ideas on how we can select data more quickly from large tables? > > Are these row estimates realistic? It's estimating nearly 20 million rows > to be returned by some of the queries (unless I'm misreading the > number - possible since it's 5am here). At that point you almost > certainly want to be using a cursor rather than plain queries since even a > small width result (say 50 bytes) gives a very large (1 gig) result set. > > > - Queries and explain plans > > > > select count(*) from table_name; > > NOTICE: QUERY PLAN: > > Aggregate (cost=488700.65..488700.65 rows=1 width=0) > > -> Seq Scan on table_name (cost=0.00..439527.12 rows=19669412 > width=0) > > > > hawkdb=# explain select count(job_id) from table_name; > > NOTICE: QUERY PLAN: > > Aggregate (cost=488700.65..488700.65 rows=1 width=4) > > -> Seq Scan on table_name (cost=0.00..439527.12 rows=19669412 > width=4) > > > > hawkdb=# explain select * from table_name; > > NOTICE: QUERY PLAN: > > Seq Scan on table_name (cost=0.00..439527.12 rows=19669412 width=57) > > > > hawkdb=# explain select count(*) from table_name where job_id = 13; > > NOTICE: QUERY PLAN: > > Aggregate (cost=537874.18..537874.18 rows=1 width=0) > > -> Seq Scan on table_name (cost=0.00..488700.65 rows=19669412 > width=0) > > > > hawkdb=# explain select * from table_name where job_id = 13; > > NOTICE: QUERY PLAN: > > Seq Scan on http_result (cost=0.00..488700.65 rows=19669412 width=57) > > > > hawkdb=# explain select * from table_name where job_id = 1; > > NOTICE: QUERY PLAN: > > Index Scan using http_result_pk on table_name (cost=0.00..5.01 rows=1 > > width=57) > > > > hawkdb=#explain select * from table_name where time > '2002-10-10'; > > NOTICE: QUERY PLAN: > > Seq Scan on table_name (cost=0.00..488700.65 rows=19649743 width=57) > > > > hawkdb=# explain select * from http_result where time < '2002-10-10'; > > NOTICE: QUERY PLAN: > > Index Scan using table_name_time on table_name (cost=0.00..75879.17 > > rows=19669 width=57) > -- Charles H. Woloszynski ClearMetrix, Inc. 115 Research Drive Bethlehem, PA 18015 tel: 610-419-2210 x400 fax: 240-371-3256 web: www.clearmetrix.com
"Charles H. Woloszynski" <chw@clearmetrix.com> writes: > Are you doing vaccums on these tables? I was under the understanding > that the estimated row count should be close to the real row count > returned, and when it is not (as it looks in your case), the primary > reason for the disconnect is that the stats for the tables are > out-of-date. The fact that he's using 7.1 doesn't help any; the statistics mechanisms in 7.1 are pretty weak compared to 7.2. > Also, do you do any clustering of the data (since the queries are mostly > time limited)? I am wondering if the system is doing lots of seeks to > get the data (implying that the data is all over the disk and not > clustered). It would also be interesting to try a two-column index ordered the other way (timestamp as the major sort key instead of ID). Can't tell if that will be a win without more info about the data properties, but it's worth looking at. regards, tom lane
On Mon, 18 Nov 2002, Nikk Anderson wrote: > Hi, > Thanks for the reply Stephen, the data is 'somewhat' realistic..... Tom's said most of what I would have, except that if you've got a wide variation based on job_id you may want to change the statistics gathering defaults for that column with ALTER TABLE ALTER COLUMN SET STATISTICS when you get to 7.2.
Hi Charlie,
We do a vacuum analyze every night at midnight. I thought that perhaps the analyzing was not being done correctly, so I manually did a vacuum analyze and the estimated row counts were way still out.
I will look into clustering the data and see what effect that may have.
Thanks
Nikk
-----Original Message-----
From: Charles H. Woloszynski [mailto:chw@clearmetrix.com]
Sent: 18 November 2002 15:46
To: Nikk Anderson
Cc: 'Stephan Szabo'; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] selects from large tables
Nikk:
Are you doing vaccums on these tables? I was under the understanding
that the estimated row count should be close to the real row count
returned, and when it is not (as it looks in your case), the primary
reason for the disconnect is that the stats for the tables are
out-of-date.
Since it used the indexes, I am not sure if the old stats are causing
any issues, but I suspect they are not helping.
Also, do you do any clustering of the data (since the queries are mostly
time limited)? I am wondering if the system is doing lots of seeks to
get the data (implying that the data is all over the disk and not
clustered).
Charlie
Nikk Anderson wrote:
> Hi,
> Thanks for the reply Stephen, the data is 'somewhat' realistic.....
>
> The data in the table is actually synthetic, but the structure is the
> same as our live system, and the queries are similar to those we
> actually carry out.
>
> As the data was synthetic there was a bit of repetition (19 million
> rows of repetition!! ) of the item used in the where clause, meaning
> that most of the table was returned by the queries - oops! So, I have
> done is some more realistic queries from our live system, and put the
> time it takes, and the explain results. Just to note that the
> explain's estimated number of rows is way out - its guesses are way
> too low.
>
> Typically a normal query on our live system returns between 200 and
> 30000 rows depending on the reports a user wants to generate. In
> prior testing, we noted that using SELECT COUNT( .. was slower than
> other queries, which is why we though we would test counts first.
>
>
> Here are some more realistic results, which still take a fair whack of
> time........
>
>
> Starting query 0
> Query 0: SELECT * FROM xx WHERE time BETWEEN '2002-11-17 14:08:58.021'
> AND '2002-11-18 14:08:58.021' AND job_id = 335
> Time taken = 697 ms
> Index Scan using http_timejobid on xx (cost=0.00..17.01 rows=4 width=57)
> This query returns 500 rows of data
>
>
> Starting query 1
> Query 1: SELECT * FROM xx WHERE time BETWEEN '2002-11-11 14:08:58.021'
> AND '2002-11-18 14:08:58.021' AND job_id = 335
> Time taken = 15 seconds
> Index Scan using http_timejobid on xx (cost=0.00..705.57 rows=175
> width=57)
> This query return 3582 rows
>
> Starting query 2
> Query 2: SELECT * FROM xx WHERE time BETWEEN '2002-10-19 15:08:58.021'
> AND '2002-11-18 14:08:58.021' AND job_id = 335;
> Time taken = 65 seconds
> Index Scan using http_timejobid on xx (cost=0.00..3327.55 rows=832
> width=57)
> This query returns 15692 rows
>
> Starting query 3
> Query 3: SELECT * FROM xx_result WHERE time BETWEEN '2002-08-20
> 15:08:58.021' AND '2002-11-18 14:08:58.021' AND job_id = 335;
>
> Time taken = 241 seconds
> Index Scan using http_timejobid on xx (cost=0.00..10111.36 rows=2547
> width=57)
> This query returns 48768 rows
>
>
> Cheers
>
> Nikk
>
>
>
>
> -----Original Message-----
> From: Stephan Szabo [mailto:sszabo@megazone23.bigpanda.com]
> Sent: 18 November 2002 13:02
> To: Nikk Anderson
> Cc: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] selects from large tables
>
>
>
> On Mon, 18 Nov 2002, Nikk Anderson wrote:
>
> > Any ideas on how we can select data more quickly from large tables?
>
> Are these row estimates realistic? It's estimating nearly 20 million rows
> to be returned by some of the queries (unless I'm misreading the
> number - possible since it's 5am here). At that point you almost
> certainly want to be using a cursor rather than plain queries since even a
> small width result (say 50 bytes) gives a very large (1 gig) result set.
>
> > - Queries and explain plans
> >
> > select count(*) from table_name;
> > NOTICE: QUERY PLAN:
> > Aggregate (cost=488700.65..488700.65 rows=1 width=0)
> > -> Seq Scan on table_name (cost=0.00..439527.12 rows=19669412
> width=0)
> >
> > hawkdb=# explain select count(job_id) from table_name;
> > NOTICE: QUERY PLAN:
> > Aggregate (cost=488700.65..488700.65 rows=1 width=4)
> > -> Seq Scan on table_name (cost=0.00..439527.12 rows=19669412
> width=4)
> >
> > hawkdb=# explain select * from table_name;
> > NOTICE: QUERY PLAN:
> > Seq Scan on table_name (cost=0.00..439527.12 rows=19669412 width=57)
> >
> > hawkdb=# explain select count(*) from table_name where job_id = 13;
> > NOTICE: QUERY PLAN:
> > Aggregate (cost=537874.18..537874.18 rows=1 width=0)
> > -> Seq Scan on table_name (cost=0.00..488700.65 rows=19669412
> width=0)
> >
> > hawkdb=# explain select * from table_name where job_id = 13;
> > NOTICE: QUERY PLAN:
> > Seq Scan on http_result (cost=0.00..488700.65 rows=19669412 width=57)
> >
> > hawkdb=# explain select * from table_name where job_id = 1;
> > NOTICE: QUERY PLAN:
> > Index Scan using http_result_pk on table_name (cost=0.00..5.01 rows=1
> > width=57)
> >
> > hawkdb=#explain select * from table_name where time > '2002-10-10';
> > NOTICE: QUERY PLAN:
> > Seq Scan on table_name (cost=0.00..488700.65 rows=19649743 width=57)
> >
> > hawkdb=# explain select * from http_result where time < '2002-10-10';
> > NOTICE: QUERY PLAN:
> > Index Scan using table_name_time on table_name (cost=0.00..75879.17
> > rows=19669 width=57)
>
--
Charles H. Woloszynski
ClearMetrix, Inc.
115 Research Drive
Bethlehem, PA 18015
tel: 610-419-2210 x400
fax: 240-371-3256
web: www.clearmetrix.com
---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
Hi Tom,
Yes, we should upgrade to 7.2 soon, its just that as it is a live system running 24x7 we are careful about upgrading core components so we do not disrupt our data collection agents too much.
Here is some table info, we currently index by time then ID. Generally, data will be selected by ID, then time range. Clustering may help on this.
Attribute | Type | Modifier
-----------------+--------------------------+----------
job_id | integer | not null
server_id | integer | not null
time | timestamp with time zone | not null
availability | boolean |
connection_time | integer |
dns_setup | integer |
server_response | integer |
frontpage_size | integer |
frontpage_time | integer |
transfer_size | integer |
transfer_time | integer |
error_id | integer |
redirect_time | integer |
polling_id | integer | not null
Indices: http_result_pk,
http_timejobid
Thanks
Nikk
-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: 18 November 2002 16:25
To: Charles H. Woloszynski
Cc: Nikk Anderson; 'Stephan Szabo'; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] selects from large tables
"Charles H. Woloszynski" <chw@clearmetrix.com> writes:
> Are you doing vaccums on these tables? I was under the understanding
> that the estimated row count should be close to the real row count
> returned, and when it is not (as it looks in your case), the primary
> reason for the disconnect is that the stats for the tables are
> out-of-date.
The fact that he's using 7.1 doesn't help any; the statistics mechanisms
in 7.1 are pretty weak compared to 7.2.
> Also, do you do any clustering of the data (since the queries are mostly
> time limited)? I am wondering if the system is doing lots of seeks to
> get the data (implying that the data is all over the disk and not
> clustered).
It would also be interesting to try a two-column index ordered the other
way (timestamp as the major sort key instead of ID). Can't tell if that
will be a win without more info about the data properties, but it's
worth looking at.
regards, tom lane
You might want to check out 7.3 while your at it. It's currently planned to be released around Dec 1st, which might fit in nicely with your upgrade schedule. Robert Treat On Mon, 2002-11-18 at 11:36, Nikk Anderson wrote: > Hi Tom, > > Yes, we should upgrade to 7.2 soon, its just that as it is a live system > running 24x7 we are careful about upgrading core components so we do not > disrupt our data collection agents too much. > > Here is some table info, we currently index by time then ID. Generally, > data will be selected by ID, then time range. Clustering may help on > this. > > > Attribute | Type | Modifier > -----------------+--------------------------+---------- > job_id | integer | not null > server_id | integer | not null > time | timestamp with time zone | not null > availability | boolean | > connection_time | integer | > dns_setup | integer | > server_response | integer | > frontpage_size | integer | > frontpage_time | integer | > transfer_size | integer | > transfer_time | integer | > error_id | integer | > redirect_time | integer | > polling_id | integer | not null > Indices: http_result_pk, > http_timejobid > > Thanks > > Nikk > > > -----Original Message----- > From: Tom Lane [ mailto:tgl@sss.pgh.pa.us <mailto:tgl@sss.pgh.pa.us> ] > Sent: 18 November 2002 16:25 > To: Charles H. Woloszynski > Cc: Nikk Anderson; 'Stephan Szabo'; pgsql-performance@postgresql.org > Subject: Re: [PERFORM] selects from large tables > > > "Charles H. Woloszynski" <chw@clearmetrix.com> writes: > > Are you doing vaccums on these tables? I was under the understanding > > that the estimated row count should be close to the real row count > > returned, and when it is not (as it looks in your case), the primary > > reason for the disconnect is that the stats for the tables are > > out-of-date. > > The fact that he's using 7.1 doesn't help any; the statistics mechanisms > > in 7.1 are pretty weak compared to 7.2. > > > Also, do you do any clustering of the data (since the queries are > mostly > > time limited)? I am wondering if the system is doing lots of seeks to > > > get the data (implying that the data is all over the disk and not > > clustered). > > It would also be interesting to try a two-column index ordered the other > > way (timestamp as the major sort key instead of ID). Can't tell if that > > will be a win without more info about the data properties, but it's > worth looking at. > > regards, tom lane >
Hi,
I tried a test cluster on a copy of our real data - all 10 million rows or so. WOW! The normal select performance improved drastically.
Selecting 3 months worth of data was taking 146 seconds to retrieve. After clustering it took 7.7 seconds! We are now looking into ways we can automate clustering to keep the table up to date. The cluster itself took around 2.5 hours.
As our backend systems are writing hundreds of rows of data in per minute into the table that needs clustering - will cluster handle locking the tables when dropping the old, and renaming the clustered data? What happens to the data being added to the table while cluster is running? Our backend systems may have some problems if the table does not exist when it tries to insert, and we don't want to lose any data.
Thanks
Nikk
-----Original Message-----
From: Charles H. Woloszynski [mailto:chw@clearmetrix.com]
Sent: 18 November 2002 15:46
To: Nikk Anderson
Cc: 'Stephan Szabo'; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] selects from large tables
Nikk:
Are you doing vaccums on these tables? I was under the understanding
that the estimated row count should be close to the real row count
returned, and when it is not (as it looks in your case), the primary
reason for the disconnect is that the stats for the tables are
out-of-date.
Since it used the indexes, I am not sure if the old stats are causing
any issues, but I suspect they are not helping.
Also, do you do any clustering of the data (since the queries are mostly
time limited)? I am wondering if the system is doing lots of seeks to
get the data (implying that the data is all over the disk and not
clustered).
Charlie
Nikk Anderson wrote:
> Hi,
> Thanks for the reply Stephen, the data is 'somewhat' realistic.....
>
> The data in the table is actually synthetic, but the structure is the
> same as our live system, and the queries are similar to those we
> actually carry out.
>
> As the data was synthetic there was a bit of repetition (19 million
> rows of repetition!! ) of the item used in the where clause, meaning
> that most of the table was returned by the queries - oops! So, I have
> done is some more realistic queries from our live system, and put the
> time it takes, and the explain results. Just to note that the
> explain's estimated number of rows is way out - its guesses are way
> too low.
>
> Typically a normal query on our live system returns between 200 and
> 30000 rows depending on the reports a user wants to generate. In
> prior testing, we noted that using SELECT COUNT( .. was slower than
> other queries, which is why we though we would test counts first.
>
>
> Here are some more realistic results, which still take a fair whack of
> time........
>
>
> Starting query 0
> Query 0: SELECT * FROM xx WHERE time BETWEEN '2002-11-17 14:08:58.021'
> AND '2002-11-18 14:08:58.021' AND job_id = 335
> Time taken = 697 ms
> Index Scan using http_timejobid on xx (cost=0.00..17.01 rows=4 width=57)
> This query returns 500 rows of data
>
>
> Starting query 1
> Query 1: SELECT * FROM xx WHERE time BETWEEN '2002-11-11 14:08:58.021'
> AND '2002-11-18 14:08:58.021' AND job_id = 335
> Time taken = 15 seconds
> Index Scan using http_timejobid on xx (cost=0.00..705.57 rows=175
> width=57)
> This query return 3582 rows
>
> Starting query 2
> Query 2: SELECT * FROM xx WHERE time BETWEEN '2002-10-19 15:08:58.021'
> AND '2002-11-18 14:08:58.021' AND job_id = 335;
> Time taken = 65 seconds
> Index Scan using http_timejobid on xx (cost=0.00..3327.55 rows=832
> width=57)
> This query returns 15692 rows
>
> Starting query 3
> Query 3: SELECT * FROM xx_result WHERE time BETWEEN '2002-08-20
> 15:08:58.021' AND '2002-11-18 14:08:58.021' AND job_id = 335;
>
> Time taken = 241 seconds
> Index Scan using http_timejobid on xx (cost=0.00..10111.36 rows=2547
> width=57)
> This query returns 48768 rows
>
>
> Cheers
>
> Nikk
>
>
>
>
> -----Original Message-----
> From: Stephan Szabo [mailto:sszabo@megazone23.bigpanda.com]
> Sent: 18 November 2002 13:02
> To: Nikk Anderson
> Cc: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] selects from large tables
>
>
>
> On Mon, 18 Nov 2002, Nikk Anderson wrote:
>
> > Any ideas on how we can select data more quickly from large tables?
>
> Are these row estimates realistic? It's estimating nearly 20 million rows
> to be returned by some of the queries (unless I'm misreading the
> number - possible since it's 5am here). At that point you almost
> certainly want to be using a cursor rather than plain queries since even a
> small width result (say 50 bytes) gives a very large (1 gig) result set.
>
> > - Queries and explain plans
> >
> > select count(*) from table_name;
> > NOTICE: QUERY PLAN:
> > Aggregate (cost=488700.65..488700.65 rows=1 width=0)
> > -> Seq Scan on table_name (cost=0.00..439527.12 rows=19669412
> width=0)
> >
> > hawkdb=# explain select count(job_id) from table_name;
> > NOTICE: QUERY PLAN:
> > Aggregate (cost=488700.65..488700.65 rows=1 width=4)
> > -> Seq Scan on table_name (cost=0.00..439527.12 rows=19669412
> width=4)
> >
> > hawkdb=# explain select * from table_name;
> > NOTICE: QUERY PLAN:
> > Seq Scan on table_name (cost=0.00..439527.12 rows=19669412 width=57)
> >
> > hawkdb=# explain select count(*) from table_name where job_id = 13;
> > NOTICE: QUERY PLAN:
> > Aggregate (cost=537874.18..537874.18 rows=1 width=0)
> > -> Seq Scan on table_name (cost=0.00..488700.65 rows=19669412
> width=0)
> >
> > hawkdb=# explain select * from table_name where job_id = 13;
> > NOTICE: QUERY PLAN:
> > Seq Scan on http_result (cost=0.00..488700.65 rows=19669412 width=57)
> >
> > hawkdb=# explain select * from table_name where job_id = 1;
> > NOTICE: QUERY PLAN:
> > Index Scan using http_result_pk on table_name (cost=0.00..5.01 rows=1
> > width=57)
> >
> > hawkdb=#explain select * from table_name where time > '2002-10-10';
> > NOTICE: QUERY PLAN:
> > Seq Scan on table_name (cost=0.00..488700.65 rows=19649743 width=57)
> >
> > hawkdb=# explain select * from http_result where time < '2002-10-10';
> > NOTICE: QUERY PLAN:
> > Index Scan using table_name_time on table_name (cost=0.00..75879.17
> > rows=19669 width=57)
>
--
Charles H. Woloszynski
ClearMetrix, Inc.
115 Research Drive
Bethlehem, PA 18015
tel: 610-419-2210 x400
fax: 240-371-3256
web: www.clearmetrix.com
---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
Nikk Anderson wrote: > Hi, > > I tried a test cluster on a copy of our real data - all 10 million rows or > so. WOW! The normal select performance improved drastically. > Selecting 3 months worth of data was taking 146 seconds to retrieve. After > clustering it took 7.7 seconds! We are now looking into ways we can > automate clustering to keep the table up to date. The cluster itself took > around 2.5 hours. > > As our backend systems are writing hundreds of rows of data in per minute > into the table that needs clustering - will cluster handle locking the > tables when dropping the old, and renaming the clustered data? What happens > to the data being added to the table while cluster is running? Our backend > systems may have some problems if the table does not exist when it tries to > insert, and we don't want to lose any data. CLUSTER will exclusively lock the table from read/write during the CLUSTER. Sorry. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Nikk Anderson <Nikk.Anderson@parallel.ltd.uk> writes: > As our backend systems are writing hundreds of rows of data in per minute > into the table that needs clustering - will cluster handle locking the > tables when dropping the old, and renaming the clustered data? What happens > to the data being added to the table while cluster is running? Nothing, because there won't be any: cluster acquires exclusive lock on the table while it runs. Any would-be inserter will block till it's done. If you are clustering by timestamp of insertion, and you never update or delete rows, then I think it's a one-time-and-you're-done kind of task anyway --- newly inserted rows will always get added at the end, and so will be in timestamp order anyway. But if you need to update the table then things aren't so nice :-( regards, tom lane PS: it's not really necessary to quote the entire thread in every message, and it's definitely not nice to do so twice in both plain text and HTML :-(. Please have some consideration for the size of your emails that Marc is archiving for posterity ...
On Wed, 2002-11-20 at 10:08, Nikk Anderson wrote: > Hi, > > I tried a test cluster on a copy of our real data - all 10 million > rows or so. WOW! The normal select performance improved > drastically. > > Selecting 3 months worth of data was taking 146 seconds to retrieve. > After clustering it took 7.7 seconds! We are now looking into ways we > can automate clustering to keep the table up to date. The cluster > itself took around 2.5 hours. > > As our backend systems are writing hundreds of rows of data in per > minute into the table that needs clustering - will cluster handle > locking the tables when dropping the old, and renaming the clustered > data? What happens to the data being added to the table while cluster > is running? Our backend systems may have some problems if the table > does not exist when it tries to insert, and we don't want to lose any > data. The table will be locked while cluster is running. Meaning, any new data will have to sit and wait. Cluster won't buy much on a mostly clustered table. But it's probably worth it for you to do it when 20% of the tuples turnover (deleted, updated, inserts, etc). I'm a little curious to know when the last time you had run a VACUUM FULL on that table was. -- Rod Taylor <rbt@rbt.ca>
Hi,
> I'm a little curious to know when the last time you had run a VACUUM
> FULL on that table was.
>
>--
>Rod Taylor <rbt@rbt.ca>
We do a VACUUM ANALYZE every night, there is no option for FULL on our version (7.1)
Nikk
> We do a VACUUM ANALYZE every night, there is no option for FULL on our > version (7.1) Oh, I see. Old stuff :) -- Rod Taylor <rbt@rbt.ca>
Nikk Anderson kirjutas K, 20.11.2002 kell 20:08: > Hi, > > I tried a test cluster on a copy of our real data - all 10 million > rows or so. WOW! The normal select performance improved > drastically. > > Selecting 3 months worth of data was taking 146 seconds to retrieve. > After clustering it took 7.7 seconds! We are now looking into ways we > can automate clustering to keep the table up to date. The cluster > itself took around 2.5 hours. > > As our backend systems are writing hundreds of rows of data in per > minute into the table that needs clustering - will cluster handle > locking the tables when dropping the old, and renaming the clustered > data? What happens to the data being added to the table while cluster > is running? Our backend systems may have some problems if the table > does not exist when it tries to insert, and we don't want to lose any > data. You could use a staging table that takes all the inserts and the contents of which are moved (begin;insert into big select from small;delete from small;commit;vacuum full small;) to the main table once a day (or week or month) just before clustering the big one. Then do all your selects from a UNION view on both - thus you have a big fast clustered table and non-clustered "live" table which stays small. That should make your selects fast(er). ----------------- Hannu