<div class="Section1"><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt;
font-family:Arial"> This is a collection of many performance tips that we've gathered together at Affymetrix, and I
thoughtit would be useful to post them to the PostgreSQL news group. </span></font><p class="MsoNormal"><font
face="Arial"size="2"><span style="font-size:10.0pt;
font-family:Arial"> The single most helpful trick has been the "Partial index trick" at the bottom and the use of temp
tables. Most of these tricks came from either this news group, or from my colleagues in the bioinformatics department,
soI'd like to thank and acknowledge both groups.</span></font><p class="MsoNormal"><font face="Arial" size="2"><span
style="font-size:10.0pt;
font-family:Arial"> I'd like to thank </span></font><font face="Arial" size="2"><span
style="font-size:10.0pt;font-family:Arial">TomLane</span></font><font face="Arial" size="2"><span
style="font-size:10.0pt;font-family:Arial">,who clearly has been working very hard on the Optimizer, and all the other
peoplewho have worked on Postgres. Your efforts have been invaluable to us. Keep up the good work!</span></font><p
class="MsoNormal"><fontface="Arial" size="2"><span style="font-size:10.0pt;
font-family:Arial"> We are currently working on a Transcriptome project, which is a follow-on to the human genome
project,in which we systematically look across all parts of the genome to see what is expressed in the form of RNA. It
ispublicly funded by the National Cancer Institute and the data is made publicly available at: <a
href="http://www.netaffx.com/transcriptome/">http://www.netaffx.com/transcriptome/</a></span></font><p
class="MsoNormal"><fontface="Arial" size="2"><span style="font-size:10.0pt;
font-family:Arial"> We currently have about 100GB of data and will soon grow to a multi-terabyte system. We have
tablesof up to 1 billion rows and have been able to get ~1 million row queries to run in about 5 min. We've been very
pleasedwith postgres. After a couple of major outages in our server room, it came back up flawlessly each time. So it
hasbeen an invaluable asset for this project. We run 7.2 on Red Hat on a 2-processor machine with SAN, and we have a
128-nodelinux cluster which will make analysis runs against the database.</span></font><p class="MsoNormal"><font
face="Arial"size="2"><span style="font-size:10.0pt;
font-family:Arial"> Our main request is continued enhancement of the optimizer for these heavy types of queries.
Improveduse of indexes, ability to control execution plans explicitly, ability to use indexes for data retrieval
withouttouching the table in certain cases, and other such features would be very useful. I'm also curious to hear
aboutwhether there is any good clustering system for making a parallel postgres installation, and if others have
experiencewith creating such large databases.</span></font><p class="MsoNormal"><font face="Arial" size="2"><span
style="font-size:10.0pt;
font-family:Arial"> We've been very happy and impressed with the constant improvements to the system. Thank
You!</span></font><p><fontface="Times New Roman" size="3"><span style="font-size:12.0pt">This page is a long detailed
listof performance tips for doing heavy duty queries. </span></font><ul type="disc"><li class="MsoNormal"><font
face="TimesNew Roman" size="3"><span style="font-size:12.0pt">Indexes 1. Indexes are critical. Create exactly the
combined(multi-field) indexes that are being joined in a particular join. The order of fields in the index and in the
joinmust match exactly. </span></font><li class="MsoNormal"><font face="Times New Roman" size="3"><span
style="font-size:12.0pt">Indexes2. Multi-Field Indexes. Having indexes on individual columns as well as combinations of
2,3,and4 columns can help. Sometimes is uses the 3 version, and sometimes it uses one 2 and one singlet index. This can
behelpful, especially when seq scan is turned off and you are using limit. </span></font><li class="MsoNormal"><font
face="TimesNew Roman" size="3"><span style="font-size:12.0pt">Indexes 3. Remember that multiple-field indexes must have
thefields in the correct order as they are accessed in the query. An index can only be used to the extent allowed by
thekeys. An index over (A B C) can be used to find (A B), but not (B C). </span></font><li class="MsoNormal"><font
face="TimesNew Roman" size="3"><span style="font-size:12.0pt">Vacuum. Always vacuum analyze the table(s) after creating
indices(or loading/deleting data). </span></font><li class="MsoNormal"><font face="Times New Roman" size="3"><span
style="font-size:12.0pt">Limitand Order by. May have to use order by and/or limit to use the indexes. May need to use
orderby with limit. Sometimes order by increases speed by causing use of an index. Sometimes it decreases speed because
aSort step is required. A where condition that is sufficiently restrictive may also cause an index to be used.
</span></font><liclass="MsoNormal"><font face="Times New Roman" size="3"><span style="font-size:12.0pt">Join Order.
Orderof fields, joins, and order by fields has a big impact. </span></font><li class="MsoNormal"><font face="Times New
Roman"size="3"><span style="font-size:12.0pt">Casting 1. May have to explicitly cast things. For instance where x=3
mustbecome (where x=cast(3 as smallint)). This can make a huge difference. </span></font><li class="MsoNormal"><font
face="TimesNew Roman" size="3"><span style="font-size:12.0pt">Casting 2. Simply adding abs(destype)=(cast 111 as
smallint)to my query and turning seq scans off seems to change the query execution plan. Writing this as (destype=111
ordestype=-111) makes the cost over 7 times higher!! </span></font><li class="MsoNormal"><font face="Times New Roman"
size="3"><spanstyle="font-size:12.0pt">Seq Scans 1. Can you disable seq scans? Yes, you can type "set
enable_seqscan=no;"at the psql prompt and disable it. Do not be surprised if this does not work though. You can also
disablemerges, joins, nested loops, and sorts. Try this and attempt to enable the correct combination that you want it
touse. </span></font><li class="MsoNormal"><font face="Times New Roman" size="3"><span style="font-size:12.0pt">Seq
Scans2. In general you would like it to use an index, but don't be afraid to try the seq scans if cost is say <
150,000and see if it it finishes in a few minutes. For large joins with no where clause, Postgres always uses seq
scans.Try to add a where clause, even a non-restrictive one, and use an index. However, remember that postgres must go
getthe table data too, so this can be more costly. Postgres cannot read data solely from an index (some commercial
databasescan). </span></font><li class="MsoNormal"><font face="Times New Roman" size="3"><span
style="font-size:12.0pt">SeqScans 3. Sometimes it is true that seq scans are faster. It tries to use the analyzed
statisticsto decide which is better. But don't always trust it, try it both ways. This is why analyzing your table will
producedifferent execution plans at after analysis -- The analysis step will update the stats of the table. The change
inestimated costs might cause a different plan to be chosen. </span></font><li class="MsoNormal"><font face="Times New
Roman"size="3"><span style="font-size:12.0pt">Explain Output. Reading the Explain output can be confusing. In general,
thenumbers are a range. If you are trying to just get some rows back, you'd like the left most number to be 0. This
meansthat the right-most number will probably not happen, because you will not really have to search the entire table.
Theright-most number is an upper bound. The numbers sum as you go up. What you don't want is a large number for both
themin and max. Sometimes a cost of about 100,000 takes about 3 minutes. Sometimes this is not accurate. Sometimes I
wasable to to see a lower seq scan cost, but when I disable seq scans and used indexes, the actual performance was
faster. In general the cost is in milliseconds. Use Explain Analyze which will run through they query and produce
actualtimes.</span></font><li class="MsoNormal"><font face="Times New Roman" size="3"><span
style="font-size:12.0pt">SQLtricks. Remember the standard SQL tricks which I will not cover here (get a good thick SQL
book).For example using Like, etc. can be slow. Remember that if there is no data in your table for a given where
clause,it must scan the entire result just to tell you "no results found" so know your data in advance.
</span></font><liclass="MsoNormal"><font face="Times New Roman" size="3"><span style="font-size:12.0pt">Nested loops
areprobably the most expensive operation. </span></font><li class="MsoNormal"><font face="Times New Roman"
size="3"><spanstyle="font-size:12.0pt">Having several merges and sorts can be way better than having a single nestloop
inyour query. </span></font><li class="MsoNormal"><font face="Times New Roman" size="3"><span
style="font-size:12.0pt">ExplicitJoins. For more than 2 joined tables, consider using explicit joins (see: <a
href="http://www.ca.postgresql.org/users-lounge/docs/7.1/postgres/explicit-joins.html)"
target="_top">http://www.ca.postgresql.org/users-lounge/docs/7.1/postgres/explicit-joins.html)</a></span></font><li
class="MsoNormal"><fontface="Times New Roman" size="3"><span style="font-size:12.0pt">News Groups. Try the postgres
newsgroups: <a href="http://www.us.postgresql.org/users-lounge/index.html"
target="_top">http://www.us.postgresql.org/users-lounge/index.html</a></span></font><li class="MsoNormal"><font
face="TimesNew Roman" size="3"><span style="font-size:12.0pt">Hardware/Configuration changes. I won't go into a lot of
detailhere as this page is more about the query optimizer, but you can look at how much your CPU and memory is being
taxed,and try running postmaster with various flags to increase speed and memory. However, if your query plan is not
comingout right this will have little impact. </span></font><li class="MsoNormal"><font face="Times New Roman"
size="3"><spanstyle="font-size:12.0pt">Identities. You can try typing "and a.id=a.id" and this will actually help
encouragethe query planner to use an index. In one example, select with x=x and y=y order by x worked best (order by y
toomade it worse!). </span></font><li class="MsoNormal"><font face="Times New Roman" size="3"><span
style="font-size:12.0pt">Temptables. You may want to explicitly control the query by breaking it into several steps,
withintermediate tables being created along the way. You can make these true temp tables, which will go away when you
logout, or you may want to keep them around. You might want to create a procedure or script that automates/hides this
process.</span></font><li class="MsoNormal"><font face="Times New Roman" size="3"><span style="font-size:12.0pt">Views.
Viewssometimes say that they are adding a step to the query planner, but it does not seem to impact query speed. But if
youadd more clauses to the view this may change the query plan in a bad way, which is confusing to the user.
</span></font><liclass="MsoNormal"><font face="Times New Roman" size="3"><span style="font-size:12.0pt">Stored
Procedures.Try writing a stored procedure to more explicitly control the query execution. If you do this break out SQL
intomany small cursors instead of 1 large cursor, otherwise you will run up against the same problems.
</span></font><liclass="MsoNormal"><font face="Times New Roman" size="3"><span style="font-size:12.0pt">External
programs.As above, breaking out a query into a series of small, explicit nested loops in a C, Perl, or other client
program,may actually improve performance (especially if you want a subset of results/tables). </span></font><li
class="MsoNormal"><fontface="Times New Roman" size="3"><span style="font-size:12.0pt">Monitor Query Progress. Alan
Williamsprovided a good trick to monitor the progress of a long running query. If you add to the query a sequence
(selectnextval('sq_test'),...) then you can use select currval('sq_test') to see how far the query has progressed.
</span></font><liclass="MsoNormal"><font face="Times New Roman" size="3"><span style="font-size:12.0pt">Partial
Indices.You can use this feature to force use of an index!!! (it is also useful as a true partial index). Assume table1
belowhas no rows where field1=0. By doing the actions below, it stores the clause field1<>0 in pg_index and when
itsees that predicate, it always uses the partial index. In this case we are using it as a full index to trick it.
Example:</span></font></ul><p class="MsoNormal" style="margin-left:.25in"><font face="Times New Roman" size="3"><span
style="font-size:12.0pt"> create index i on table1(field1) where field1 <> 0;</span></font><pre
style="margin-left:.25in"><fontface="Courier New" size="2"><span style="font-size:10.0pt"> select * from table1
wherefield1<>0;</span></font></pre><pre style="margin-left:.5in"><font face="Courier New" size="2"><span
style="font-size:10.0pt"></span></font></pre><p class="MsoNormal"><font face="Arial" size="2"><span
style="font-size:10.0pt;
font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt;
font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt;
font-family:Arial">Shane Brubaker</span></font><p class="MsoNormal"><font face="Arial" size="2"><span
style="font-size:10.0pt;
font-family:Arial">BioInformatics Engineer</span></font><p class="MsoNormal"><font face="Arial" size="2"><span
style="font-size:10.0pt;
font-family:Arial">Affymetric, Inc.</span></font><p class="MsoNormal"><font face="Arial" size="2"><span
style="font-size:10.0pt;
font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt;
font-family:Arial"> </span></font></div>