Hello!
I use FreeBSD 4.11 with PostGreSQL 7.3.8.
I got a huge database with roughly 15 million records. There is just one
table, with a time field, a few ints and a few strings.
table test
fields time (timestamp), source (string), destination (string), p1 (int),
p2 (int)
I have run VACUUM ANALYZE ;
I have created indexes on every field, but for some reason my postgre
server wants to use a seqscan, even tho i know a indexed scan would be
much faster.
create index test_time_idx on test (time) ;
create index test_source_idx on test (source) ;
create index test_destination_idx on test (destination) ;
create index test_p1_idx on test (p1) ;
create index test_p2_idx on test (p2) ;
What is really strange, is that when i query a count(*) on one of the int
fields (p1), which has a very low count, postgre uses seqscan. In another
count on the same int field (p1), i know he is giving about 2.2 million
hits, but then he suddenly uses seqscan, instead of a indexed one. Isn't
the whole idea of indexing to increase performance in large queries.. To
make sort of a phonebook for the values, to make it faster to look up what
ever you need... This just seems opposite..
Here is a EXPLAIN of my query
database=> explain select date_trunc('hour', time),count(*) as total from
test where p1=53 and time > now() - interval '24 hours' group by
date_trunc order by date_trunc ;
QUERY PLAN
------------------------------------------------------------------------------------------
Aggregate (cost=727622.61..733143.23 rows=73608 width=8)
-> Group (cost=727622.61..731303.02 rows=736083 width=8)
-> Sort (cost=727622.61..729462.81 rows=736083 width=8)
Sort Key: date_trunc('hour'::text, "time")
-> Seq Scan on test (cost=0.00..631133.12 rows=736083
width=8)
Filter: ((p1 = 53) AND ("time" > (now() - '1
day'::interval)))
(6 rows)
database=> drop INDEX test_<TABULATOR>
test_source_idx test_destination_idx test_p1_idx
test_p2_idx test_time_idx
After all this, i tried to set enable_seqscan to off and
enable_nestedloops to on. This didnt help much either. The time to run the
query is still in minutes. My results are the number of elements for each
hour, and it gives about 1000-2000 hits per hour. I have read somewhere,
about PostGreSQL, that it can easily handle 100-200million records. And
with the right tuned system, have a great performance.. I would like to
learn how :)
I also found an article on a page
(http://techdocs.postgresql.org/techdocs/pgsqladventuresep3.php):
Tip #11: Don't bother indexing columns with huge numbers of records and a
small range of values, such as BOOLEAN columns.
This tip, regretfully, is perhaps the only tip where I cannot provide a
good, real-world example from my work. So I'll give you a hypothetical
situation instead:
Imagine that you have a database table with a list of every establishment
vending ice cream in the US. A simple example might look like:
Where there were almost 1 million rows, but due to simplistic data entry,
only three possible values for type (1-SUPERMARKET, 2-BOUTIQUE, and
3-OTHER) which are relatively evenly distributed. In this hypothetical
situation, you might find (with testing using EXPLAIN) that an index on
type is ignored and the parser uses a "seq scan" (or table scan) instead.
This is because a table scan can actually be faster than an index scan in
this situation. Thus, any index on type should be dropped.
Certainly, the boolean column (active) requires no indexing as it has only
two possible values and no index will be faster than a table scan.
Then I ask, what is useful with indexing, when I can't use it on a VERY
large database? It is on my 15 million record database it takes for ever
to do seqscans over and over again... This is probably why, as i mentioned
earlier, the reason (read the quote) why he chooses a full scan and not a
indexed one...
So what do I do? :confused:
I'v used SQL for years, but never in such a big scale. Thus, not having to
learn how to deal with large number of records. Usually a maximum of 1000
records. Now, with millions, I need to learn a way to make my sucky
queries better.
Im trying to learn more about tuning my system, makeing better queries and
such. I'v found some documents on the Internet, but far from the best.
Feedback most appreciated!
Regards,
a learning PostGreSQL user