Q on views and performance - Mailing list pgsql-performance
From | Kynn Jones |
---|---|
Subject | Q on views and performance |
Date | |
Msg-id | c2350ba40802221249h37913d0cyaa6387a93c55fe69@mail.gmail.com Whole thread Raw |
Responses |
Re: Q on views and performance
Re: Q on views and performance |
List | pgsql-performance |
Hi. I'm trying to optimize the performance of a database whose main purpose is to support two (rather similar) kinds of queries. The first kind, which is expected to be the most common (I estimate it will account for about 90% of all the queries performed on this DB), has the following general structure:
(Q1) SELECT a1.word, a2.word
FROM T a1 JOIN T a2 USING ( zipk )
WHERE a1.type = <int1>
AND a2.type = <int2>;
...where <int1> and <int2> stand for some two integers. In English, this query essentially executes an inner join between two "virtual subtables" of table T, which are defined by the value of the type column. For brevity, I will refer to these (virtual) subtables as T<int1> and T<int2>. (I should point out that T holds about 2 million records, spread roughly evenly over about 100 values of the type column. So each of these virtual subtables has about 20K records. Also, for practical purposes T may be regarded as an immutable, read-only table, since it gets re-built from scratch about once a month. And, FWIW, all the columns mentioned in this post have a NOT NULL constraint.)
The second form is similar to the first, except that now the join is taken between T and T<int2>:
(Q2) SELECT a1.word, a2.word
FROM T a1 JOIN T a2 USING ( zipk )
WHERE a2.type = <int2>;
(Both the forms above are somewhat oversimplified relative to the actual situation; in our actual application, the joins are actually left outer ones, and each query also involves an additional inner join with another table, S. For the sake of completeness, I give the "real-world" versions of these queries at the end of this post, but I think that for the purpose of my question, the additional complications they entail can be neglected.)
One way to speed (Q1) would be to break T into its subtables, i.e. to create T1, T2, T3, ... , T100 as bona fide tables. Then the query would become a simple join without the two condition of the original's WHERE clause, which I figure should make it noticeably faster.
But since the second kind of query (Q2) requires T, we can't get rid of this table, so all the data would need to be stored twice, once in T and once in some T<int*>.
In trying to come up with a way around this duplication, it occurred to me that instead of creating tables T1, T2, etc., I could create the analogous views V1, V2, etc. (e.g. CREATE VIEW V1 AS SELECT * FROM T WHERE type = 1). With this design, the two queries above would become
(Q1*) SELECT V<int1>.word, V<int2>.word
FROM V<int1> JOIN V<int2> USING ( zipk );
(Q2*) SELECT T.word, V<int2>.word
FROM T JOIN V<int2> USING ( zipk );
Of course, I expect that using views V<int1> and V<int2>... would result in a loss in performance relative to a version that used bona fide tables T<int1> and T<int2>. My question is, how can I minimize this performance loss?
More specifically, how can I go about building table T and the views V<int?>'s to maximize the performance of (Q1)? For example, I'm thinking that if T had an additional id column and were built in such a way that all the records belonging to each V<int?> were physically contiguous, and (say) had contiguous values in the id column, then I could define each view like this
CREATE VIEW V<int1> AS SELECT * FROM T
WHERE <start_int1> <= id AND id < <start_int1+1>;
So my question is, what design would make querying V1, V2, V3 ... as fast as possible? Is it possible to approach the performance of the design that uses bona fide tables T1, T2, T3, ... instead of views V1, V2, V3 ...?
Thank you very much for reading this long post, and many thanks in advance for your comments!
Kynn
P.S. Here are the actual form of the queries. They now include an initial join with table S, and the join with T<int2> (or V<int2>) is a left outer join. Interestingly, even though the queries below that use views (i.e. Q1*** and Q2***) are not much more complex-looking than before, the other two (Q1** and Q2**) are. I don't know if this is because my ineptitude with SQL, but I am not able to render (Q1**) and (Q2**) without resorting to the subquery sq.
(Q1**) SELECT a1.word, sq.word FROM
S JOIN T a1 USING ( word )
LEFT JOIN ( SELECT * FROM T a2
WHERE a2.type = <int2> ) sq USING ( zipk )
WHERE a1.type = <int1>;
(Q2**) SELECT a1.word, sq.word FROM
S JOIN T a1 USING ( word )
LEFT JOIN ( SELECT * FROM T a2
WHERE a2.type = <int2> ) sq USING ( zipk )
---------------------------------------------
(Q1***) SELECT V<int1>.word, V<int2>.word FROM
S JOIN V<int1> USING ( word )
LEFT JOIN V<int2> USING ( zipk );
(Q2***) SELECT T.word, V<int2>.word
FROM S JOIN T USING ( word )
LEFT JOIN V<int2> USING ( zipk );
pgsql-performance by date: