Q on views and performance - Mailing list pgsql-performance

From Kynn Jones
Subject Q on views and performance
Date
Msg-id c2350ba40802221249h37913d0cyaa6387a93c55fe69@mail.gmail.com
Whole thread Raw
Responses Re: Q on views and performance
Re: Q on views and performance
List pgsql-performance
Hi.  I'm trying to optimize the performance of a database whose main purpose is to support two (rather similar) kinds of queries.  The first kind, which is expected to be the most common (I estimate it will account for about 90% of all the queries performed on this DB), has the following general structure:

(Q1)   SELECT a1.word, a2.word
         FROM T a1 JOIN T a2 USING ( zipk )
        WHERE a1.type = <int1>
          AND a2.type = <int2>;

...where <int1> and <int2> stand for some two integers.  In English, this query essentially executes an inner join between two "virtual subtables" of table T, which are defined by the value of the type column.  For brevity, I will refer to these (virtual) subtables as T<int1> and T<int2>.  (I should point out that T holds about 2 million records, spread roughly evenly over about 100 values of the type column.  So each of these virtual subtables has about 20K records.  Also, for practical purposes T may be regarded as an immutable, read-only table, since it gets re-built from scratch about once a month.  And, FWIW, all the columns mentioned in this post have a NOT NULL constraint.)

The second form is similar to the first, except that now the join is taken between T and T<int2>:

(Q2)   SELECT a1.word, a2.word
         FROM T a1 JOIN T a2 USING ( zipk )
        WHERE a2.type = <int2>;

(Both the forms above are somewhat oversimplified relative to the actual situation; in our actual application, the joins are actually left outer ones, and each query also involves an additional inner join with another table, S.  For the sake of completeness, I give the "real-world" versions of these queries at the end of this post, but I think that for the purpose of my question, the additional complications they entail can be neglected.)

One way to speed (Q1) would be to break T into its subtables, i.e. to create T1, T2, T3, ... , T100 as bona fide tables.  Then the query would become a simple join without the two condition of the original's WHERE clause, which I figure should make it noticeably faster.

But since the second kind of query (Q2) requires T, we can't get rid of this table, so all the data would need to be stored twice, once in T and once in some T<int*>.

In trying to come up with a way around this duplication, it occurred to me that instead of creating tables T1, T2, etc., I could create the analogous views V1, V2, etc.  (e.g. CREATE VIEW V1 AS SELECT * FROM T WHERE type = 1).  With this design, the two queries above would become

(Q1*)  SELECT V<int1>.word, V<int2>.word
         FROM V<int1> JOIN V<int2> USING ( zipk );

(Q2*)  SELECT T.word, V<int2>.word
         FROM T JOIN V<int2> USING ( zipk );

Of course, I expect that using views V<int1> and V<int2>... would result in a loss in performance relative to a version that used bona fide tables T<int1> and T<int2>.  My question is, how can I minimize this performance loss?

More specifically, how can I go about building table T and the views V<int?>'s to maximize the performance of (Q1)?  For example, I'm thinking that if T had an additional id column and were built in such a way that all the records belonging to each V<int?> were physically contiguous, and (say) had contiguous values in the id column, then I could define each view like this

  CREATE VIEW V<int1> AS SELECT * FROM T
   WHERE <start_int1> <= id AND id < <start_int1+1>;

So my question is, what design would make querying V1, V2, V3 ... as fast as possible?  Is it possible to approach the performance of the design that uses bona fide tables T1, T2, T3, ... instead of views V1, V2, V3 ...?

Thank you very much for reading this long post, and many thanks in advance for your comments!

Kynn


P.S.  Here are the actual form of the queries.  They now include an initial join with table S, and the join with T<int2> (or V<int2>) is a left outer join.  Interestingly, even though the queries below that use views (i.e. Q1*** and Q2***) are not much more complex-looking than before, the other two (Q1** and Q2**) are.  I don't know if this is because my ineptitude with SQL, but I am not able to render (Q1**) and (Q2**) without resorting to the subquery sq.

(Q1**)  SELECT a1.word, sq.word FROM
               S      JOIN T a1 USING ( word )
                 LEFT JOIN ( SELECT * FROM T a2
                             WHERE a2.type = <int2> ) sq USING ( zipk )
         WHERE a1.type = <int1>;

(Q2**)  SELECT a1.word, sq.word FROM
               S      JOIN T a1 USING ( word )
                 LEFT JOIN ( SELECT * FROM T a2
                             WHERE a2.type = <int2> ) sq USING ( zipk )

       ---------------------------------------------

(Q1***) SELECT V<int1>.word, V<int2>.word FROM
               S      JOIN V<int1> USING ( word )
                 LEFT JOIN V<int2> USING ( zipk );

(Q2***) SELECT T.word, V<int2>.word
          FROM S      JOIN T       USING ( word )
                 LEFT JOIN V<int2> USING ( zipk );


pgsql-performance by date:

Previous
From: Susan Russo
Date:
Subject: loading same instance of dump to two different servers simultaneously?
Next
From: "Dean Gibson (DB Administrator)"
Date:
Subject: Re: Q on views and performance