Efficient "filter query" with positive and/or negative conditions - Mailing list pgsql-general

From Anders Steinlein
Subject Efficient "filter query" with positive and/or negative conditions
Date
Msg-id 7DCE73D5-B4D6-4AAB-86F7-900E73254B4C@steinlein.no
Whole thread Raw
List pgsql-general
We have a database with contacts within lists, which can have any
number of tags associated with them. I would like to segment the
contacts based on any number of tag-based conditions, all of which
must be fulfilled (i.e. an ALL rule/filter). For instance, I want to
extract all contacts which have the tags "customer" AND "newsletter"
but NOT the "invited" tag. However, segments need not contain
"positive" predicates, so a segment defined as only NOT "invited" tag
should include ALL contacts except those with the "invited" tag. I
have found a working query using CTE, but it's way too slow when tens
of thousands of contacts are involved. Suggestions for an alternative
query and/or optimizations to this one?

Server:
-------
PostgreSQL 8.4.2 on a shared server running Apache/PHP/MySQL/PostgreSQL
shared_buffers: 128MB
work_mem: 8MB
effective_cache_size: 256MB

Current query:
--------------
WITH segments_contacts_tags AS (
     SELECT segmentid, tagname, tagtype, email
         FROM segments_tags st
             LEFT JOIN contacts_tags ct USING (tagname)
     WHERE segmentid = 93
), positive_segment_matches AS (
     SELECT segmentid, email, COUNT(email) AS tags
         FROM segments_contacts_tags
         WHERE tagtype = 1
         GROUP BY 1, 2
), positive_predicates_count AS (
     SELECT segmentid, COUNT(*) AS count
         FROM segments_tags
         WHERE tagtype = 1
         GROUP BY 1
)
SELECT s.segmentid, cl.email
     FROM segments s
         INNER JOIN contacts_lists cl USING (lid)
         LEFT JOIN positive_segment_matches psm USING (segmentid, email)
         LEFT JOIN positive_predicates_count ppc USING (segmentid)
     WHERE segmentid = 93
         AND ((psm.email IS NULL AND ppc.count IS NULL) OR psm.tags =
ppc.count)
EXCEPT
SELECT segmentid, email
     FROM segments_contacts_tags st
     WHERE st.tagtype = 0;

Explain analyze:
----------------
  HashSetOp Except  (cost=1817.12..5822.48 rows=200 width=36) (actual
time=5827.885..6051.771 rows=64672 loops=1)
    CTE segments_contacts_tags
      ->  Nested Loop Left Join  (cost=0.00..1323.86 rows=21819
width=61) (actual time=0.143..696.589 rows=64672 loops=1)
            ->  Seq Scan on segments_tags st  (cost=0.00..1.05 rows=1
width=29) (actual time=0.014..0.025 rows=1 loops=1)
                  Filter: (segmentid = 93)
            ->  Index Scan using contacts_tags_tagname on
contacts_tags ct  (cost=0.00..1050.07 rows=21819 width=49) (actual
time=0.110..265.453 rows=64672 loops=1)
                  Index Cond: (st.tagname = ct.tagname)
    CTE positive_segment_matches
      ->  HashAggregate  (cost=491.75..491.88 rows=11 width=36)
(actual time=1733.143..1995.489 rows=64672 loops=1)
            ->  CTE Scan on segments_contacts_tags  (cost=0.00..490.93
rows=109 width=36) (actual time=0.161..1245.836 rows=64672 loops=1)
                  Filter: (tagtype = 1)
    CTE positive_predicates_count
      ->  HashAggregate  (cost=1.07..1.12 rows=4 width=4) (actual
time=0.069..0.088 rows=4 loops=1)
            ->  Seq Scan on segments_tags  (cost=0.00..1.05 rows=4
width=4) (actual time=0.015..0.032 rows=4 loops=1)
                  Filter: (tagtype = 1)
    ->  Append  (cost=0.26..3681.56 rows=64811 width=36) (actual
time=2734.790..5363.556 rows=64672 loops=1)
          ->  Subquery Scan "*SELECT* 1"  (cost=0.26..3189.55
rows=64702 width=36) (actual time=2734.782..4923.849 rows=64672 loops=1)
                ->  Hash Left Join  (cost=0.26..2542.53 rows=64702
width=36) (actual time=2734.773..4483.547 rows=64672 loops=1)
                      Hash Cond: ((s.segmentid = psm.segmentid) AND
((cl.email)::text = (psm.email)::text))
                      Filter: (((psm.email IS NULL) AND (ppc.count IS
NULL)) OR (psm.tags = ppc.count))
                      ->  Nested Loop  (cost=0.00..2052.95 rows=64702
width=44) (actual time=0.169..687.204 rows=64702 loops=1)
                            Join Filter: (s.lid = cl.lid)
                            ->  Nested Loop Left Join
(cost=0.00..1.15 rows=1 width=16) (actual time=0.130..0.189 rows=1
loops=1)
                                  Join Filter: (s.segmentid =
ppc.segmentid)
                                  ->  Seq Scan on segments s
(cost=0.00..1.05 rows=1 width=8) (actual time=0.025..0.032 rows=1
loops=1)
                                        Filter: (segmentid = 93)
                                  ->  CTE Scan on
positive_predicates_count ppc  (cost=0.00..0.09 rows=1 width=12)
(actual time=0.086..0.129 rows=1 loops=1)
                                        Filter: (ppc.segmentid = 93)
                            ->  Seq Scan on contacts_lists cl
(cost=0.00..1243.02 rows=64702 width=36) (actual time=0.014..231.331
rows=64702 loops=1)
                      ->  Hash  (cost=0.25..0.25 rows=1 width=44)
(actual time=2734.535..2734.535 rows=64672 loops=1)
                            ->  CTE Scan on positive_segment_matches
psm  (cost=0.00..0.25 rows=1 width=44) (actual time=1733.169..2481.359
rows=64672 loops=1)
                                  Filter: (segmentid = 93)
          ->  Subquery Scan "*SELECT* 2"  (cost=0.00..492.02 rows=109
width=36) (actual time=23.767..23.767 rows=0 loops=1)
                ->  CTE Scan on segments_contacts_tags st
(cost=0.00..490.93 rows=109 width=36) (actual time=23.758..23.758
rows=0 loops=1)
                      Filter: (tagtype = 0)
  Total runtime: 6278.849 ms

Involved tables:
----------------

Table "public.contacts_lists"
      Column     |            Type             |          Modifiers
----------------+-----------------------------
+------------------------------
  email          | email                       | not null
  lid            | integer                     | not null
  lstatus        | character(1)                | not null default
'a'::bpchar
  ladded         | timestamp without time zone | not null default now()
  lstatuschanged | timestamp without time zone | not null default now()
Indexes:
     "contacts_lists_pkey" PRIMARY KEY, btree (email, lid)
     "contacts_lists_statchanged" btree (lstatuschanged)
     "contacts_lists_status" btree (lstatus)
Check constraints:
     "status_check" CHECK (lstatus = 'a'::bpchar OR lstatus =
'b'::bpchar OR lstatus = 'u'::bpchar)
Foreign-key constraints:
     "contacts_lists_email_fkey" FOREIGN KEY (email) REFERENCES
contacts(email) ON UPDATE CASCADE ON DELETE CASCADE
     "contacts_lists_lid_fkey" FOREIGN KEY (lid) REFERENCES lists(lid)
ON UPDATE RESTRICT ON DELETE CASCADE

Table "public.contacts_tags"
  Column  | Type  | Modifiers
---------+-------+-----------
  email   | email | not null
  tagname | text  | not null
Indexes:
     "contacts_tags_pkey" PRIMARY KEY, btree (email, tagname)
     "contacts_tags_tagname" btree (tagname)
Foreign-key constraints:
     "contacts_tags_email_fkey" FOREIGN KEY (email) REFERENCES
contacts(email) ON UPDATE CASCADE ON DELETE CASCADE
     "contacts_tags_tagname_fkey" FOREIGN KEY (tagname) REFERENCES
tags(tagname) ON UPDATE CASCADE ON DELETE CASCADE

Table "public.tags"
   Column   |            Type             |       Modifiers
-----------+-----------------------------+------------------------
  tagname   | text                        | not null
  createdat | timestamp without time zone | not null default now()
Indexes:
     "tags_pkey" PRIMARY KEY, btree (tagname)
Referenced by:
     TABLE "contacts_tags" CONSTRAINT "contacts_tags_tagname_fkey"
FOREIGN KEY (tagname) REFERENCES tags(tagname) ON UPDATE CASCADE ON
DELETE CASCADE
     TABLE "segments_tags" CONSTRAINT "segments_tags_tagname_fkey"
FOREIGN KEY (tagname) REFERENCES tags(tagname) ON UPDATE CASCADE ON
DELETE CASCADE

Table "public.segments"
    Column    |            Type             |
Modifiers
-------------+-----------------------------
+--------------------------------------------------------------
  segmentid   | integer                     | not null default
nextval('segments_segmentid_seq'::regclass)
  lid         | integer                     | not null
  segmentname | text                        | not null
  createdat   | timestamp without time zone | not null default now()
Indexes:
     "segments_pkey" PRIMARY KEY, btree (segmentid)
Foreign-key constraints:
     "segments_lid_fkey" FOREIGN KEY (lid) REFERENCES lists(lid) ON
UPDATE RESTRICT ON DELETE CASCADE
Referenced by:
     TABLE "segments_tags" CONSTRAINT "segments_tags_segmentid_fkey"
FOREIGN KEY (segmentid) REFERENCES segments(segmentid) ON UPDATE
RESTRICT ON DELETE CASCADE

Table "public.segments_tags"
   Column   |  Type   | Modifiers
-----------+---------+-----------
  segmentid | integer | not null
  tagname   | text    | not null
  tagtype   | integer | not null
Indexes:
     "segments_tags_pkey" PRIMARY KEY, btree (segmentid, tagname)
Foreign-key constraints:
     "segments_tags_segmentid_fkey" FOREIGN KEY (segmentid) REFERENCES
segments(segmentid) ON UPDATE RESTRICT ON DELETE CASCADE
     "segments_tags_tagname_fkey" FOREIGN KEY (tagname) REFERENCES
tags(tagname) ON UPDATE CASCADE ON DELETE CASCADE

-- anders

pgsql-general by date:

Previous
From: John R Pierce
Date:
Subject: Re: Installing Postgresql on Windows XP embedded
Next
From: Bruce Momjian
Date:
Subject: Re: hardware for a server