Re: Query performance - Mailing list pgsql-performance

From Marc Mamin
Subject Re: Query performance
Date
Msg-id B6F6FD62F2624C4C9916AC0175D56D8828B452AA@jenmbs01.ad.intershop.net
Whole thread Raw
In response to Query performance  (Joe Van Dyk <joe@tanga.com>)
List pgsql-performance

>I have an events table that records page views and purchases (type = 'viewed' or type='purchased'). I have a query that figures out "people who bought/viewed this also bought/viewed that".
>
>It worked fine, taking about 0.1 seconds to complete, until a few hours ago when it started taking hours to complete. Vacuum/analyze didn't help.  Turned out there was one session_id that had 400k rows in the system. Deleting that made the query performant again.
>
>Is there anything I can do to make the query work better in cases like that? Missing index, or better query?
>
>This is on 9.3.5.
>
>The below is reproduced at the following URL if it's not formatted correctly in the email. https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt

Hello,

here  are 2 variations that should be somewhat faster.

 It seems you may have duplicate (site_id,session_id,product_id)
 which would false the result. In that case you'll need some more logic in the query.
 
 select
    '82503' as product_id,
    e2.site_id,
    e2.product_id,
    count(nullif(e2.type='viewed', false)) view_count,
    count(nullif(e2.type='purchased', false)) purchase_count
  from events e1
  join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
  where
    e1.product_id = '82503' and
    e2.product_id != '82503'
 group by  e2.product_id, e2.site_id;
 
 
 OR:
 
 WITH SALL as(
  select
     e2.site_id,
     e2.product_id,
     count(nullif(e2.type='viewed', false)) view_count,
     count(nullif(e2.type='purchased', false)) purchase_count
   from events e1
   join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
   where
     e1.product_id = '82503'
  group by  e2.product_id, e2.site_id
 )
 SELECT
    '82503' as product_id_1,
    site_id,
    product_id,
    view_count,
    purchase_count
 FROM SALL
 WHERE product_id != '82503';
    

 regards,
 Marc Mamin
 


>explain  select
>   e1.product_id,
>   e2.site_id,
>   e2.product_id,
>   count(nullif(e2.type='viewed', false)) view_count,
>   count(nullif(e2.type='purchased', false)) purchase_count
> from events e1
> join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
> where
>   e1.product_id = '82503' and
>   e1.product_id != e2.product_id
> group by e1.product_id, e2.product_id, e2.site_id;
>                                                         QUERY PLAN
>----------------------------------------------------------------------------------------------------------------------------
> GroupAggregate  (cost=828395.67..945838.90 rows=22110 width=19)
>   ->  Sort  (cost=828395.67..840117.89 rows=4688885 width=19)
>         Sort Key: e1.product_id, e2.product_id, e2.site_id
>         ->  Nested Loop  (cost=11.85..20371.14 rows=4688885 width=19)
>               ->  Bitmap Heap Scan on events e1  (cost=11.29..1404.31 rows=369 width=49)
>                     Recheck Cond: (product_id = '82503'::citext)
>                     ->  Bitmap Index Scan on events_product_id_site_id_idx  (cost=0.00..11.20 rows=369 width=0)
>                           Index Cond: (product_id = '82503'::citext)
>               ->  Index Scan using events_session_id_type_product_id_idx on events e2  (cost=0.56..51.28 rows=12 width=51)
>                     Index Cond: ((session_id = e1.session_id) AND (type = e1.type))
>                     Filter: (e1.product_id <> product_id)
>(11 rows)
>
>recommender_production=> \d events
>                                    Table "public.events"
>   Column    |           Type           |                      Modifiers
>-------------+--------------------------+-----------------------------------------------------
> id          | bigint                   | not null default nextval('events_id_seq'::regclass)
> user_id     | citext                   |
> session_id  | citext                   | not null
> product_id  | citext                   | not null
> site_id     | citext                   | not null
> type        | text                     | not null
> happened_at | timestamp with time zone | not null
> created_at  | timestamp with time zone | not null
>Indexes:
>    "events_pkey" PRIMARY KEY, btree (id)
>    "events_product_id_site_id_idx" btree (product_id, site_id)
>    "events_session_id_type_product_id_idx" btree (session_id, type, product_id)
>Check constraints:
>    "events_session_id_check" CHECK (length(session_id::text) < 255)
>    "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 'viewed'::text]))
>    "events_user_id_check" CHECK (length(user_id::text) < 255)
>
>
>
>

pgsql-performance by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Query performance
Next
From: "Christian Roche"
Date:
Subject: Why is PostgreSQL not using my index?