WHERE condition not being pushed down to union parts - Mailing list pgsql-performance

From John L. Clark
Subject WHERE condition not being pushed down to union parts
Date
Msg-id 4fb69e5d0904210721m78dabc2ofe4869582804752e@mail.gmail.com
Whole thread Raw
Responses Re: WHERE condition not being pushed down to union parts  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
I have a database with two tables that relate similar data, and a view
which projects and combines the data from these two tables in order to
access them both in a consistent manner.  With enough information, the
application can specifically choose to query from one table or the
other, but in the more general case the data could come from either
table, so I need to query the view.  When I join against the view (or
an equivalent subselect), however, it looks like the joining condition
is not pushed down into the individual components of the union that
defines the view.  This leads to a significant performance degradation
when using the view; I ask the list for help in resolving this
problem.  The remainder of this email digs into this problem in
detail.

(If you were interested in background on this database, it implements
a backing store for a higher level RDF database, specifically for the
RDFLib project.  I would be happy to talk more about this application,
or the corresponding database design issues, with anyone who might be
interested, in whatever forum would be appropriate.)

I begin with the poorly performing query, which follows this
paragraph.  This query joins one of the tables to the view, and using
'explain' on this query gives the query plan listed below the query.
Note that in this query plan, the join filter happens after (above)
the collection of matching rows from each of the parts of the UNION.

<query>
select * from
  relations as component_0_statements
cross join
  URI_or_literal_object as component_1_statements
where
component_0_statements.predicate = -2875059751320018987 and
component_0_statements.object = -2827607394936393903 and
component_1_statements.subject = component_0_statements.subject and
component_1_statements.predicate = -2875059751320018987
</query>

<query-plan>

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=96.31..36201.57 rows=1 width=128)
  Join Filter: (component_0_statements.subject = literalproperties.subject)
  ->  Index Scan using relations_poscindex on relations
component_0_statements  (cost=0.00..9.96 rows=1 width=40)
        Index Cond: ((predicate = (-2875059751320018987)::bigint) AND
(object = (-2827607394936393903)::bigint))
  ->  Append  (cost=96.31..36044.62 rows=11759 width=88)
        ->  Bitmap Heap Scan on literalproperties
(cost=96.31..16190.72 rows=5052 width=49)
              Recheck Cond: (literalproperties.predicate =
(-2875059751320018987)::bigint)
              ->  Bitmap Index Scan on
literalproperties_predicateindex  (cost=0.00..95.04 rows=5052 width=0)
                    Index Cond: (literalproperties.predicate =
(-2875059751320018987)::bigint)
        ->  Bitmap Heap Scan on relations  (cost=128.99..19736.31
rows=6707 width=40)
              Recheck Cond: (relations.predicate =
(-2875059751320018987)::bigint)
              ->  Bitmap Index Scan on relations_predicateindex
(cost=0.00..127.32 rows=6707 width=0)
                    Index Cond: (relations.predicate =
(-2875059751320018987)::bigint)
(13 rows)
</query-plan>

As it turns out, all of the results are in fact from the 'relations'
table, so we get the same results if we query that table instead of
the more general view.  The corresponding query follows this
paragraph, and its query plan immediately follows it.  Note that in
this query plan, the join condition is pushed down to the leaf node as
an Index Condition, which seems to be the main source of the dramatic
performance difference.

<query>
select * from
  relations as component_0_statements
cross join
  relations as component_1_statements
where
component_0_statements.predicate = -2875059751320018987 and
component_0_statements.object = -2827607394936393903 and
component_1_statements.subject = component_0_statements.subject and
component_1_statements.predicate = -2875059751320018987
</query>

<query-plan>

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..26.11 rows=1 width=80)
  ->  Index Scan using relations_poscindex on relations
component_0_statements  (cost=0.00..9.96 rows=1 width=40)
        Index Cond: ((predicate = (-2875059751320018987)::bigint) AND
(object = (-2827607394936393903)::bigint))
  ->  Index Scan using relations_subjectindex on relations
component_1_statements  (cost=0.00..16.13 rows=1 width=40)
        Index Cond: (component_1_statements.subject =
component_0_statements.subject)
        Filter: (component_1_statements.predicate =
(-2875059751320018987)::bigint)
(6 rows)
</query-plan>

My research led me to a post by Tom Lane describing the conditions in
which the WHERE conditions cannot be pushed down to the UNION parts:
<http://archives.postgresql.org/pgsql-performance/2007-11/msg00041.php>.
I refactored the UNION definition slightly to attempt to bring all
the column types into alignment, as that seemed like it might be a
blocker, but the problem persists.  It didn't look like the other
conditions would hold in my case, but I certainly could be wrong.  For
reference, the definitions of the two tables and the view are listed
below.  The 'literalproperties' tables has 8229098 rows, and the
'relations' table has 6960820 rows.

# \d literalproperties
 Table "public.literalproperties"
    Column     |         Type         | Modifiers
----------------+----------------------+-----------
 subject        | bigint               | not null
 subject_term   | character(1)         | not null
 predicate      | bigint               | not null
 predicate_term | character(1)         | not null
 object         | bigint               | not null
 context        | bigint               | not null
 context_term   | character(1)         | not null
 data_type      | bigint               |
 language       | character varying(3) |
Indexes:
   "literalproperties_poscindex" UNIQUE, btree (predicate, object,
subject, context, data_type, language)
   "literalproperties_context_termindex" btree (context_term)
   "literalproperties_contextindex" btree (context)
   "literalproperties_data_typeindex" btree (data_type)
   "literalproperties_languageindex" btree (language)
   "literalproperties_objectindex" btree (object)
   "literalproperties_predicate_termindex" btree (predicate_term)
   "literalproperties_predicateindex" btree (predicate)
   "literalproperties_subject_termindex" btree (subject_term)
   "literalproperties_subjectindex" btree (subject)

# \d relations;
 Table "public.relations"
    Column     |     Type     | Modifiers
----------------+--------------+-----------
 subject        | bigint       | not null
 subject_term   | character(1) | not null
 predicate      | bigint       | not null
 predicate_term | character(1) | not null
 object         | bigint       | not null
 object_term    | character(1) | not null
 context        | bigint       | not null
 context_term   | character(1) | not null
Indexes:
   "relations_poscindex" UNIQUE, btree (predicate, object, subject, context)
   "relations_context_termindex" btree (context_term)
   "relations_contextindex" btree (context)
   "relations_object_termindex" btree (object_term)
   "relations_objectindex" btree (object)
   "relations_predicate_termindex" btree (predicate_term)
   "relations_predicateindex" btree (predicate)
   "relations_subject_termindex" btree (subject_term)
   "relations_subjectindex" btree (subject)

# \d uri_or_literal_object
 View "public.uri_or_literal_object"
    Column     |         Type         | Modifiers
----------------+----------------------+-----------
 subject        | bigint               |
 subject_term   | character(1)         |
 predicate      | bigint               |
 predicate_term | character(1)         |
 object         | bigint               |
 object_term    | character(1)         |
 context        | bigint               |
 context_term   | character(1)         |
 data_type      | bigint               |
 language       | character varying(3) |
View definition:
 SELECT literalproperties.subject, literalproperties.subject_term,
literalproperties.predicate, literalproperties.predicate_term,
literalproperties.object, 'L'::character(1) AS object_term,
literalproperties.context, literalproperties.context_term,
literalproperties.data_type, literalproperties.language
  FROM literalproperties
UNION ALL
 SELECT relations.subject, relations.subject_term,
relations.predicate, relations.predicate_term, relations.object,
relations.object_term, relations.context, relations.context_term,
NULL::bigint AS data_type, NULL::character varying(3) AS language
  FROM relations;

Does anyone have any ideas about how I could better optimize joins
against a union (either with a view or a subquery) like this?

Thanks, and take care,

   John L. Clark

pgsql-performance by date:

Previous
From: Kenneth Marshall
Date:
Subject: Re: performance for high-volume log insertion
Next
From: Tom Lane
Date:
Subject: Re: WHERE condition not being pushed down to union parts