Re: Query performance - Mailing list pgsql-performance

From Tomas Vondra
Subject Re: Query performance
Date
Msg-id 54C52080.90708@2ndquadrant.com
Whole thread Raw
In response to Re: Query performance  (Joe Van Dyk <joe@tanga.com>)
List pgsql-performance
Hi,

On 25.1.2015 07:38, Joe Van Dyk wrote:
>
> Here's one that's not quite as well: http://explain.depesz.com/s/SgT

As Pavel already pointed out, the first problem is this part of the plan:

Seq Scan on events e2 (cost=0.00..120,179.60 rows=4,450,241 width=51)
(actual time=0.014..33,773.370 rows=4,450,865 loops=1)
    Filter: (product_id <> '81716'::citext)

Consuming ~33 seconds of the runtime. If you can make this faster
somehow (e.g. by getting rid of the citext cast), that'd be nice.

Another issue is that the hashjoin is batched:

   Buckets: 65536 Batches: 8 Memory Usage: 46085kB

The hash preparation takes ~40 seconds, so maybe try to give it a bit
more memory - I assume you have work_mem=64MB, so try doubling that
(ISTM 512MB should work with a single batch). Maybe this won't really
improve the performance, though. It still has to process ~4.5M rows.

Increasing the work mem could also result in switching to hash
aggregate, making the sort (~30 seconds) unnecessary.

Anyway, ISTM this works as expected, i.e.

(a) with rare product_id values the queries are fast
(b) with common product_id values the queries are slow

That's expected, because (b) needs to process much more data. I don't
think you can magically make it run as fast as (a). The best solution
might be to keep a pre-aggregated results - I don't think you really
need exact answers when recommending "similar" products.

I also wonder if you really need to join the tables? I mean, what if you
do something like this:

CREATE TABLE events_aggregated AS SELECT
   site_id,
   array_agg(product_id) AS product_ids,
   count(nullif(e2.type='viewed', false)) view_count,
   count(nullif(e2.type='purchased', false)) purchase_count
FROM events
GROUP BY 1;

and then using intarray with GIN indexes to query this table?
Something like this:

  CREATE products_agg_idx ON aggregated
                       USING GIN (product_ids gin__int_ops);

  SELECT * FROM events_aggregated WHERE product_ids @> ARRAY['82503'];

regards

--
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-performance by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: Query performance
Next
From: Marc Mamin
Date:
Subject: Re: Query performance