Re: Querying 19million records very slowly - Mailing list pgsql-performance
From | Paul Ramsey |
---|---|
Subject | Re: Querying 19million records very slowly |
Date | |
Msg-id | 42B8719A.6040809@refractions.net Whole thread Raw |
In response to | Querying 19million records very slowly (Kjell Tore Fossbakk <kjelltore@gmail.com>) |
Responses |
Re: Querying 19million records very slowly
|
List | pgsql-performance |
Some tips: - EXPLAIN ANALYZE provides a more useful analysis of a slow query, because it gives both the estimate and actual times/rows for each step in the plan. - The documentation is right: rows with little variation are pretty useless to index. Indexing is about "selectivity", reducing the amount of stuff the database has to read off the the disk. - You only have two things in your WHERE clause, so that is where the most important indexes reside. How many of your rows have p1=53? How many of your rows have happened in the last day? If your answer is "a lot" then the indexes are not going to help: PostgreSQL will be more efficient scanning every tuple than it will be jumping around the index structure for a large number of tuples. - If neither time nor p1 are particularly selective individually, but they are selective when taken together, try a multi-key index on them both. Paul Kjell Tore Fossbakk wrote: > Hello! > > I use FreeBSD 4.11 with PostGreSQL 7.3.8. > > I got a huge database with roughly 19 million records. There is just one > table, with a time field, a few ints and a few strings. > > > table test > fields time (timestamp), source (string), destination (string), p1 (int), > p2 (int) > > > I have run VACUUM ANALYZE ; > > I have created indexes on every field, but for some reason my postgre > server wants to use a seqscan, even tho i know a indexed scan would be > much faster. > > > create index test_time_idx on test (time) ; > create index test_source_idx on test (source) ; > create index test_destination_idx on test (destination) ; > create index test_p1_idx on test (p1) ; > create index test_p2_idx on test (p2) ; > > > > What is really strange, is that when i query a count(*) on one of the int > fields (p1), which has a very low count, postgre uses seqscan. In another > count on the same int field (p1), i know he is giving about 2.2 million > hits, but then he suddenly uses seqscan, instead of a indexed one. Isn't > the whole idea of indexing to increase performance in large queries.. To > make sort of a phonebook for the values, to make it faster to look up what > ever you need... This just seems opposite.. > > Here is a EXPLAIN of my query > > database=> explain select date_trunc('hour', time),count(*) as total from > test where p1=53 and time > now() - interval '24 hours' group by > date_trunc order by date_trunc ; > QUERY PLAN > ------------------------------------------------------------------------------------------ > Aggregate (cost=727622.61..733143.23 rows=73608 width=8) > -> Group (cost=727622.61..731303.02 rows=736083 width=8) > -> Sort (cost=727622.61..729462.81 rows=736083 width=8) > Sort Key: date_trunc('hour'::text, "time") > -> Seq Scan on test (cost=0.00..631133.12 rows=736083 > width=8) > Filter: ((p1 = 53) AND ("time" > (now() - '1 > day'::interval))) > (6 rows) > > > > > database=> drop INDEX test_<TABULATOR> > test_source_idx test_destination_idx test_p1_idx > test_p2_idx test_time_idx > > > After all this, i tried to set enable_seqscan to off and > enable_nestedloops to on. This didnt help much either. The time to run the > query is still in minutes. My results are the number of elements for each > hour, and it gives about 1000-2000 hits per hour. I have read somewhere, > about PostGreSQL, that it can easily handle 100-200million records. And > with the right tuned system, have a great performance.. I would like to > learn how :) > > I also found an article on a page > ( http://techdocs.postgresql.org/techdocs/pgsqladventuresep3.php): > Tip #11: Don't bother indexing columns with huge numbers of records and a > small range of values, such as BOOLEAN columns. > > This tip, regretfully, is perhaps the only tip where I cannot provide a > good, real-world example from my work. So I'll give you a hypothetical > situation instead: > > Imagine that you have a database table with a list of every establishment > vending ice cream in the US. A simple example might look like: > > Where there were almost 1 million rows, but due to simplistic data entry, > only three possible values for type (1-SUPERMARKET, 2-BOUTIQUE, and > 3-OTHER) which are relatively evenly distributed. In this hypothetical > situation, you might find (with testing using EXPLAIN) that an index on > type is ignored and the parser uses a "seq scan" (or table scan) instead. > This is because a table scan can actually be faster than an index scan in > this situation. Thus, any index on type should be dropped. > > Certainly, the boolean column (active) requires no indexing as it has only > two possible values and no index will be faster than a table scan. > > > Then I ask, what is useful with indexing, when I can't use it on a VERY > large database? It is on my 15 million record database it takes for ever > to do seqscans over and over again... This is probably why, as i mentioned > earlier, the reason (read the quote) why he chooses a full scan and not a > indexed one... > > So what do I do? :confused: > > I'v used SQL for years, but never in such a big scale. Thus, not having to > learn how to deal with large number of records. Usually a maximum of 1000 > records. Now, with millions, I need to learn a way to make my sucky > queries better. > > Im trying to learn more about tuning my system, makeing better queries and > such. I'v found some documents on the Internet, but far from the best. > > Feedback most appreciated! > > Regards, > a learning PostGreSQL user >
pgsql-performance by date: