Re: Querying 19million records very slowly - Mailing list pgsql-performance

From Paul Ramsey
Subject Re: Querying 19million records very slowly
Date
Msg-id 42B8719A.6040809@refractions.net
Whole thread Raw
In response to Querying 19million records very slowly  (Kjell Tore Fossbakk <kjelltore@gmail.com>)
Responses Re: Querying 19million records very slowly
List pgsql-performance
Some tips:

- EXPLAIN ANALYZE provides a more useful analysis of a slow query,
because it gives both the estimate and actual times/rows for each step
in the plan.

- The documentation is right: rows with little variation are pretty
useless to index. Indexing is about "selectivity", reducing the amount
of stuff the database has to read off the the disk.

- You only have two things in your WHERE clause, so that is where the
most important indexes reside. How many of your rows have p1=53? How
many of your rows have happened in the last day? If your answer is "a
lot" then the indexes are not going to help: PostgreSQL will be more
efficient scanning every tuple than it will be jumping around the index
structure for a large number of tuples.

- If neither time nor p1 are particularly selective individually, but
they are selective when taken together, try a multi-key index on them both.

Paul

Kjell Tore Fossbakk wrote:

> Hello!
>
> I use FreeBSD 4.11 with PostGreSQL 7.3.8.
>
> I got a huge database with roughly 19 million records. There is just one
> table, with a time field, a few ints and a few strings.
>
>
> table test
> fields time (timestamp), source (string), destination (string), p1 (int),
> p2 (int)
>
>
> I have run VACUUM ANALYZE ;
>
> I have created indexes on every field, but for some reason my postgre
> server wants to use a seqscan, even tho i know a indexed scan would be
> much faster.
>
>
> create index test_time_idx on test (time) ;
> create index test_source_idx on test (source) ;
> create index test_destination_idx on test (destination) ;
> create index test_p1_idx on test (p1) ;
> create index test_p2_idx on test (p2) ;
>
>
>
> What is really strange, is that when i query a count(*) on one of the int
> fields (p1), which has a very low count, postgre uses seqscan. In another
> count on the same int field (p1), i know he is giving about 2.2 million
> hits, but then he suddenly uses seqscan, instead of a indexed one. Isn't
> the whole idea of indexing to increase performance in large queries.. To
> make sort of a phonebook for the values, to make it faster to look up what
> ever you need... This just seems opposite..
>
> Here is a EXPLAIN of my query
>
> database=> explain select date_trunc('hour', time),count(*) as total from
> test where p1=53 and time > now() - interval '24 hours' group by
> date_trunc order by date_trunc ;
>                                         QUERY PLAN
> ------------------------------------------------------------------------------------------
> Aggregate  (cost=727622.61..733143.23 rows=73608 width=8)
>    ->  Group  (cost=727622.61..731303.02 rows=736083 width=8)
>          ->  Sort  (cost=727622.61..729462.81 rows=736083 width=8)
>                Sort Key: date_trunc('hour'::text, "time")
>                ->  Seq Scan on test  (cost=0.00..631133.12 rows=736083
> width=8)
>                      Filter: ((p1 = 53) AND ("time" > (now() - '1
> day'::interval)))
> (6 rows)
>
>
>
>
> database=> drop INDEX test_<TABULATOR>
> test_source_idx         test_destination_idx        test_p1_idx
> test_p2_idx       test_time_idx
>
>
> After all this, i tried to set enable_seqscan to off and
> enable_nestedloops to on. This didnt help much either. The time to run the
> query is still in minutes. My results are the number of elements for each
> hour, and it gives about 1000-2000 hits per hour. I have read somewhere,
> about PostGreSQL, that it can easily handle 100-200million records. And
> with the right tuned system, have a great performance.. I would like to
> learn how :)
>
> I also found an article on a page
> ( http://techdocs.postgresql.org/techdocs/pgsqladventuresep3.php):
> Tip #11:  Don't bother indexing columns with huge numbers of records and a
> small range of values, such as BOOLEAN columns.
>
> This tip, regretfully, is perhaps the only tip where I cannot provide a
> good, real-world example from my work.  So I'll give you a hypothetical
> situation instead:
>
> Imagine that you have a database table with a list of every establishment
> vending ice cream in the US.  A simple example might look like:
>
> Where there were almost 1 million rows, but due to simplistic data entry,
> only three possible values for type (1-SUPERMARKET, 2-BOUTIQUE, and
> 3-OTHER) which are relatively evenly distributed.  In this hypothetical
> situation, you might find (with testing using EXPLAIN) that an index on
> type is ignored and the parser uses a "seq scan" (or table scan) instead.
> This is because a table scan can actually be faster than an index scan in
> this situation.  Thus, any index on type should be dropped.
>
> Certainly, the boolean column (active) requires no indexing as it has only
> two possible values and no index will be faster than a table scan.
>
>
> Then I ask, what is useful with indexing, when I can't use it on a VERY
> large database? It is on my 15 million record database it takes for ever
> to do seqscans over and over again... This is probably why, as i mentioned
> earlier, the reason (read the quote) why he chooses a full scan and not a
> indexed one...
>
> So what do I do? :confused:
>
> I'v used SQL for years, but never in such a big scale. Thus, not having to
> learn how to deal with large number of records. Usually a maximum of 1000
> records. Now, with millions, I need to learn a way to make my sucky
> queries better.
>
> Im trying to learn more about tuning my system, makeing better queries and
> such. I'v found some documents on the Internet, but far from the best.
>
> Feedback most appreciated!
>
> Regards,
> a learning PostGreSQL user
>


pgsql-performance by date:

Previous
From: Oliver Crosby
Date:
Subject: Prepared statements vs. Stored Procedures
Next
From: Tobias Brox
Date:
Subject: Re: Prepared statements vs. Stored Procedures