Re: Slow query with a lot of data - Mailing list pgsql-performance

From Moritz Onken
Subject Re: Slow query with a lot of data
Date
Msg-id 3663375B-7AC0-43F6-B6EC-D3EE572F3D34@houseofdesign.de
Whole thread Raw
In response to Re: Slow query with a lot of data  (Matthew Wakeling <matthew@flymine.org>)
Responses Re: Slow query with a lot of data
List pgsql-performance
Am 18.08.2008 um 18:05 schrieb Matthew Wakeling:

> On Mon, 18 Aug 2008, Moritz Onken wrote:
>> Running the query for more than one user is indeed not much slower.
>> That's what I need. I'm clustering the results table on domain
>> right now. But why is this better than clustering it on "user"?
>
> The reason is the way that the merge join algorithm works. What it
> does is takes two tables, and sorts them both by the join fields.
> Then it can stream through both tables producing results as it goes.
> It's the best join algorithm, but it does require both tables to be
> sorted by the same thing, which is domain in this case. The
> aggregating on user happens after the join has been done, and the
> hash aggregate can accept the users in random order.
>
> If you look at your last EXPLAIN, see that it has to sort the result
> table on domain, although it can read the domain_categories in
> domain order due to the clustered index.

explain select
   a."user", b.category, sum(1.0/b.cat_count)::float
   from result a, domain_categories b
   where a."domain" = b."domain"
   group by a."user", b.category;

"GroupAggregate  (cost=21400443313.69..22050401897.13 rows=35049240
width=12)"
"  ->  Sort  (cost=21400443313.69..21562757713.35 rows=64925759864
width=12)"
"        Sort Key: a."user", b.category"
"        ->  Merge Join  (cost=4000210.40..863834009.08
rows=64925759864 width=12)"
"              Merge Cond: (b.domain = a.domain)"
"              ->  Index Scan using domain_categories_domain on
domain_categories b  (cost=0.00..391453.79 rows=12105014 width=12)"
"              ->  Materialize  (cost=3999931.73..4253766.93
rows=20306816 width=8)"
"                    ->  Sort  (cost=3999931.73..4050698.77
rows=20306816 width=8)"
"                          Sort Key: a.domain"
"                          ->  Seq Scan on result a
(cost=0.00..424609.16 rows=20306816 width=8)"

Both results and domain_categories are clustered on domain and analyzed.
It took 50 minutes to run this query for 280 users ("and "user" IN
([280 ids])"), 78000 rows were returned and stored in a table. Is this
reasonable?
Why is it still sorting on domain? I thought the clustering should
prevent the planner from doing this?

moritz

pgsql-performance by date:

Previous
From: Matthew Wakeling
Date:
Subject: Re: Slow query with a lot of data
Next
From: Matthew Wakeling
Date:
Subject: Re: Slow query with a lot of data