Thread: Too slow querying a table of 15 million records

Too slow querying a table of 15 million records

From
kjelle@bingon.no
Date:
Hello!

I use FreeBSD 4.11 with PostGreSQL 7.3.8.

I got a huge database with roughly 15 million records. There is just one
table, with a time field, a few ints and a few strings.


table test
fields time (timestamp), source (string), destination (string), p1 (int),
p2 (int)


I have run VACUUM ANALYZE ;

I have created indexes on every field, but for some reason my postgre
server wants to use a seqscan, even tho i know a indexed scan would be
much faster.


create index test_time_idx on test (time) ;
create index test_source_idx on test (source) ;
create index test_destination_idx on test (destination) ;
create index test_p1_idx on test (p1) ;
create index test_p2_idx on test (p2) ;



What is really strange, is that when i query a count(*) on one of the int
fields (p1), which has a very low count, postgre uses seqscan. In another
count on the same int field (p1), i know he is giving about 2.2 million
hits, but then he suddenly uses seqscan, instead of a indexed one. Isn't
the whole idea of indexing to increase performance in large queries.. To
make sort of a phonebook for the values, to make it faster to look up what
ever you need... This just seems opposite..

Here is a EXPLAIN of my query

database=> explain select date_trunc('hour', time),count(*) as total from
test where p1=53 and time > now() - interval '24 hours' group by
date_trunc order by date_trunc ;
                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Aggregate  (cost=727622.61..733143.23 rows=73608 width=8)
   ->  Group  (cost=727622.61..731303.02 rows=736083 width=8)
         ->  Sort  (cost=727622.61..729462.81 rows=736083 width=8)
               Sort Key: date_trunc('hour'::text, "time")
               ->  Seq Scan on test  (cost=0.00..631133.12 rows=736083
width=8)
                     Filter: ((p1 = 53) AND ("time" > (now() - '1
day'::interval)))
(6 rows)



database=> drop INDEX test_<TABULATOR>
test_source_idx         test_destination_idx        test_p1_idx
test_p2_idx       test_time_idx


After all this, i tried to set enable_seqscan to off and
enable_nestedloops to on. This didnt help much either. The time to run the
query is still in minutes. My results are the number of elements for each
hour, and it gives about 1000-2000 hits per hour. I have read somewhere,
about PostGreSQL, that it can easily handle 100-200million records. And
with the right tuned system, have a great performance.. I would like to
learn how :)

I also found an article on a page
(http://techdocs.postgresql.org/techdocs/pgsqladventuresep3.php):

Tip #11:  Don't bother indexing columns with huge numbers of records and a
small range of values, such as BOOLEAN columns.

This tip, regretfully, is perhaps the only tip where I cannot provide a
good, real-world example from my work.  So I'll give you a hypothetical
situation instead:

Imagine that you have a database table with a list of every establishment
vending ice cream in the US.  A simple example might look like:

Where there were almost 1 million rows, but due to simplistic data entry,
only three possible values for type (1-SUPERMARKET, 2-BOUTIQUE, and
3-OTHER) which are relatively evenly distributed.  In this hypothetical
situation, you might find (with testing using EXPLAIN) that an index on
type is ignored and the parser uses a "seq scan" (or table scan) instead.
This is because a table scan can actually be faster than an index scan in
this situation.  Thus, any index on type should be dropped.

Certainly, the boolean column (active) requires no indexing as it has only
two possible values and no index will be faster than a table scan.


Then I ask, what is useful with indexing, when I can't use it on a VERY
large database? It is on my 15 million record database it takes for ever
to do seqscans over and over again... This is probably why, as i mentioned
earlier, the reason (read the quote) why he chooses a full scan and not a
indexed one...

So what do I do? :confused:

I'v used SQL for years, but never in such a big scale. Thus, not having to
learn how to deal with large number of records. Usually a maximum of 1000
records. Now, with millions, I need to learn a way to make my sucky
queries better.

Im trying to learn more about tuning my system, makeing better queries and
such. I'v found some documents on the Internet, but far from the best.

Feedback most appreciated!

Regards,
a learning PostGreSQL user





Re: Too slow querying a table of 15 million records

From
Tobias Brox
Date:
[kjelle@bingon.no - Tue at 08:33:58PM +0200]
> I use FreeBSD 4.11 with PostGreSQL 7.3.8.
(...)
> database=> explain select date_trunc('hour', time),count(*) as total from
> test where p1=53 and time > now() - interval '24 hours' group by
> date_trunc order by date_trunc ;

I haven't looked through all your email yet, but this phenomena have been up
at the list a couple of times.  Try replacing "now() - interval '24 hours'"
with a fixed time stamp, and see if it helps.

pg7 will plan the query without knowledge of what "now() - interval '24
hours'" will compute to.  This should be fixed in pg8.

--
Tobias Brox, +86-13521622905
Nordicbet, IT dept

Re: Too slow querying a table of 15 million records

From
John A Meinel
Date:
Tobias Brox wrote:

>[kjelle@bingon.no - Tue at 08:33:58PM +0200]
>
>
>>I use FreeBSD 4.11 with PostGreSQL 7.3.8.
>>
>>
>(...)
>
>
>>database=> explain select date_trunc('hour', time),count(*) as total from
>>test where p1=53 and time > now() - interval '24 hours' group by
>>date_trunc order by date_trunc ;
>>
>>
>
>I haven't looked through all your email yet, but this phenomena have been up
>at the list a couple of times.  Try replacing "now() - interval '24 hours'"
>with a fixed time stamp, and see if it helps.
>
>pg7 will plan the query without knowledge of what "now() - interval '24
>hours'" will compute to.  This should be fixed in pg8.
>
>
>
The grandparent was a mailing list double send. Notice the date is 1
week ago. It has already been answered (though your answer is still
correct).

John
=:->


Attachment

Re: Too slow querying a table of 15 million records

From
Christopher Kings-Lynne
Date:
> database=> explain select date_trunc('hour', time),count(*) as total from
> test where p1=53 and time > now() - interval '24 hours' group by
> date_trunc order by date_trunc ;

Try going:

time > '2005-06-28 15:34:00'

ie. put in the time 24 hours ago as a literal constant.

Chris


Re: Too slow querying a table of 15 million records

From
PFC
Date:

> database=> explain select date_trunc('hour', time),count(*) as total from
> test where p1=53 and time > now() - interval '24 hours' group by
> date_trunc order by date_trunc ;

    1. Use CURRENT_TIMESTAMP (which is considered a constant by the planner)
instead of now()
    2. Create a multicolumn index on (p1,time) or (time,p1) whichever works
better

Re: Too slow querying a table of 15 million records

From
Tom Lane
Date:
PFC <lists@boutiquenumerique.com> writes:
>     1. Use CURRENT_TIMESTAMP (which is considered a constant by the planner)
> instead of now()

Oh?

            regards, tom lane