Collection of PostgreSQL Performance Tips - Mailing list pgsql-sql
From | Brubaker, Shane |
---|---|
Subject | Collection of PostgreSQL Performance Tips |
Date | |
Msg-id | 53386E0C47E7D41194BB0002B325C997747385@NTEX60 Whole thread Raw |
List | pgsql-sql |
<div class="Section1"><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> This is a collection of many performance tips that we've gathered together at Affymetrix, and I thoughtit would be useful to post them to the PostgreSQL news group. </span></font><p class="MsoNormal"><font face="Arial"size="2"><span style="font-size:10.0pt; font-family:Arial"> The single most helpful trick has been the "Partial index trick" at the bottom and the use of temp tables. Most of these tricks came from either this news group, or from my colleagues in the bioinformatics department, soI'd like to thank and acknowledge both groups.</span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> I'd like to thank </span></font><font face="Arial" size="2"><span style="font-size:10.0pt;font-family:Arial">TomLane</span></font><font face="Arial" size="2"><span style="font-size:10.0pt;font-family:Arial">,who clearly has been working very hard on the Optimizer, and all the other peoplewho have worked on Postgres. Your efforts have been invaluable to us. Keep up the good work!</span></font><p class="MsoNormal"><fontface="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> We are currently working on a Transcriptome project, which is a follow-on to the human genome project,in which we systematically look across all parts of the genome to see what is expressed in the form of RNA. It ispublicly funded by the National Cancer Institute and the data is made publicly available at: <a href="http://www.netaffx.com/transcriptome/">http://www.netaffx.com/transcriptome/</a></span></font><p class="MsoNormal"><fontface="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> We currently have about 100GB of data and will soon grow to a multi-terabyte system. We have tablesof up to 1 billion rows and have been able to get ~1 million row queries to run in about 5 min. We've been very pleasedwith postgres. After a couple of major outages in our server room, it came back up flawlessly each time. So it hasbeen an invaluable asset for this project. We run 7.2 on Red Hat on a 2-processor machine with SAN, and we have a 128-nodelinux cluster which will make analysis runs against the database.</span></font><p class="MsoNormal"><font face="Arial"size="2"><span style="font-size:10.0pt; font-family:Arial"> Our main request is continued enhancement of the optimizer for these heavy types of queries. Improveduse of indexes, ability to control execution plans explicitly, ability to use indexes for data retrieval withouttouching the table in certain cases, and other such features would be very useful. I'm also curious to hear aboutwhether there is any good clustering system for making a parallel postgres installation, and if others have experiencewith creating such large databases.</span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> We've been very happy and impressed with the constant improvements to the system. Thank You!</span></font><p><fontface="Times New Roman" size="3"><span style="font-size:12.0pt">This page is a long detailed listof performance tips for doing heavy duty queries. </span></font><ul type="disc"><li class="MsoNormal"><font face="TimesNew Roman" size="3"><span style="font-size:12.0pt">Indexes 1. Indexes are critical. Create exactly the combined(multi-field) indexes that are being joined in a particular join. The order of fields in the index and in the joinmust match exactly. </span></font><li class="MsoNormal"><font face="Times New Roman" size="3"><span style="font-size:12.0pt">Indexes2. Multi-Field Indexes. Having indexes on individual columns as well as combinations of 2,3,and4 columns can help. Sometimes is uses the 3 version, and sometimes it uses one 2 and one singlet index. This can behelpful, especially when seq scan is turned off and you are using limit. </span></font><li class="MsoNormal"><font face="TimesNew Roman" size="3"><span style="font-size:12.0pt">Indexes 3. Remember that multiple-field indexes must have thefields in the correct order as they are accessed in the query. An index can only be used to the extent allowed by thekeys. An index over (A B C) can be used to find (A B), but not (B C). </span></font><li class="MsoNormal"><font face="TimesNew Roman" size="3"><span style="font-size:12.0pt">Vacuum. Always vacuum analyze the table(s) after creating indices(or loading/deleting data). </span></font><li class="MsoNormal"><font face="Times New Roman" size="3"><span style="font-size:12.0pt">Limitand Order by. May have to use order by and/or limit to use the indexes. May need to use orderby with limit. Sometimes order by increases speed by causing use of an index. Sometimes it decreases speed because aSort step is required. A where condition that is sufficiently restrictive may also cause an index to be used. </span></font><liclass="MsoNormal"><font face="Times New Roman" size="3"><span style="font-size:12.0pt">Join Order. Orderof fields, joins, and order by fields has a big impact. </span></font><li class="MsoNormal"><font face="Times New Roman"size="3"><span style="font-size:12.0pt">Casting 1. May have to explicitly cast things. For instance where x=3 mustbecome (where x=cast(3 as smallint)). This can make a huge difference. </span></font><li class="MsoNormal"><font face="TimesNew Roman" size="3"><span style="font-size:12.0pt">Casting 2. Simply adding abs(destype)=(cast 111 as smallint)to my query and turning seq scans off seems to change the query execution plan. Writing this as (destype=111 ordestype=-111) makes the cost over 7 times higher!! </span></font><li class="MsoNormal"><font face="Times New Roman" size="3"><spanstyle="font-size:12.0pt">Seq Scans 1. Can you disable seq scans? Yes, you can type "set enable_seqscan=no;"at the psql prompt and disable it. Do not be surprised if this does not work though. You can also disablemerges, joins, nested loops, and sorts. Try this and attempt to enable the correct combination that you want it touse. </span></font><li class="MsoNormal"><font face="Times New Roman" size="3"><span style="font-size:12.0pt">Seq Scans2. In general you would like it to use an index, but don't be afraid to try the seq scans if cost is say < 150,000and see if it it finishes in a few minutes. For large joins with no where clause, Postgres always uses seq scans.Try to add a where clause, even a non-restrictive one, and use an index. However, remember that postgres must go getthe table data too, so this can be more costly. Postgres cannot read data solely from an index (some commercial databasescan). </span></font><li class="MsoNormal"><font face="Times New Roman" size="3"><span style="font-size:12.0pt">SeqScans 3. Sometimes it is true that seq scans are faster. It tries to use the analyzed statisticsto decide which is better. But don't always trust it, try it both ways. This is why analyzing your table will producedifferent execution plans at after analysis -- The analysis step will update the stats of the table. The change inestimated costs might cause a different plan to be chosen. </span></font><li class="MsoNormal"><font face="Times New Roman"size="3"><span style="font-size:12.0pt">Explain Output. Reading the Explain output can be confusing. In general, thenumbers are a range. If you are trying to just get some rows back, you'd like the left most number to be 0. This meansthat the right-most number will probably not happen, because you will not really have to search the entire table. Theright-most number is an upper bound. The numbers sum as you go up. What you don't want is a large number for both themin and max. Sometimes a cost of about 100,000 takes about 3 minutes. Sometimes this is not accurate. Sometimes I wasable to to see a lower seq scan cost, but when I disable seq scans and used indexes, the actual performance was faster. In general the cost is in milliseconds. Use Explain Analyze which will run through they query and produce actualtimes.</span></font><li class="MsoNormal"><font face="Times New Roman" size="3"><span style="font-size:12.0pt">SQLtricks. Remember the standard SQL tricks which I will not cover here (get a good thick SQL book).For example using Like, etc. can be slow. Remember that if there is no data in your table for a given where clause,it must scan the entire result just to tell you "no results found" so know your data in advance. </span></font><liclass="MsoNormal"><font face="Times New Roman" size="3"><span style="font-size:12.0pt">Nested loops areprobably the most expensive operation. </span></font><li class="MsoNormal"><font face="Times New Roman" size="3"><spanstyle="font-size:12.0pt">Having several merges and sorts can be way better than having a single nestloop inyour query. </span></font><li class="MsoNormal"><font face="Times New Roman" size="3"><span style="font-size:12.0pt">ExplicitJoins. For more than 2 joined tables, consider using explicit joins (see: <a href="http://www.ca.postgresql.org/users-lounge/docs/7.1/postgres/explicit-joins.html)" target="_top">http://www.ca.postgresql.org/users-lounge/docs/7.1/postgres/explicit-joins.html)</a></span></font><li class="MsoNormal"><fontface="Times New Roman" size="3"><span style="font-size:12.0pt">News Groups. Try the postgres newsgroups: <a href="http://www.us.postgresql.org/users-lounge/index.html" target="_top">http://www.us.postgresql.org/users-lounge/index.html</a></span></font><li class="MsoNormal"><font face="TimesNew Roman" size="3"><span style="font-size:12.0pt">Hardware/Configuration changes. I won't go into a lot of detailhere as this page is more about the query optimizer, but you can look at how much your CPU and memory is being taxed,and try running postmaster with various flags to increase speed and memory. However, if your query plan is not comingout right this will have little impact. </span></font><li class="MsoNormal"><font face="Times New Roman" size="3"><spanstyle="font-size:12.0pt">Identities. You can try typing "and a.id=a.id" and this will actually help encouragethe query planner to use an index. In one example, select with x=x and y=y order by x worked best (order by y toomade it worse!). </span></font><li class="MsoNormal"><font face="Times New Roman" size="3"><span style="font-size:12.0pt">Temptables. You may want to explicitly control the query by breaking it into several steps, withintermediate tables being created along the way. You can make these true temp tables, which will go away when you logout, or you may want to keep them around. You might want to create a procedure or script that automates/hides this process.</span></font><li class="MsoNormal"><font face="Times New Roman" size="3"><span style="font-size:12.0pt">Views. Viewssometimes say that they are adding a step to the query planner, but it does not seem to impact query speed. But if youadd more clauses to the view this may change the query plan in a bad way, which is confusing to the user. </span></font><liclass="MsoNormal"><font face="Times New Roman" size="3"><span style="font-size:12.0pt">Stored Procedures.Try writing a stored procedure to more explicitly control the query execution. If you do this break out SQL intomany small cursors instead of 1 large cursor, otherwise you will run up against the same problems. </span></font><liclass="MsoNormal"><font face="Times New Roman" size="3"><span style="font-size:12.0pt">External programs.As above, breaking out a query into a series of small, explicit nested loops in a C, Perl, or other client program,may actually improve performance (especially if you want a subset of results/tables). </span></font><li class="MsoNormal"><fontface="Times New Roman" size="3"><span style="font-size:12.0pt">Monitor Query Progress. Alan Williamsprovided a good trick to monitor the progress of a long running query. If you add to the query a sequence (selectnextval('sq_test'),...) then you can use select currval('sq_test') to see how far the query has progressed. </span></font><liclass="MsoNormal"><font face="Times New Roman" size="3"><span style="font-size:12.0pt">Partial Indices.You can use this feature to force use of an index!!! (it is also useful as a true partial index). Assume table1 belowhas no rows where field1=0. By doing the actions below, it stores the clause field1<>0 in pg_index and when itsees that predicate, it always uses the partial index. In this case we are using it as a full index to trick it. Example:</span></font></ul><p class="MsoNormal" style="margin-left:.25in"><font face="Times New Roman" size="3"><span style="font-size:12.0pt"> create index i on table1(field1) where field1 <> 0;</span></font><pre style="margin-left:.25in"><fontface="Courier New" size="2"><span style="font-size:10.0pt"> select * from table1 wherefield1<>0;</span></font></pre><pre style="margin-left:.5in"><font face="Courier New" size="2"><span style="font-size:10.0pt"></span></font></pre><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial">Shane Brubaker</span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial">BioInformatics Engineer</span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial">Affymetric, Inc.</span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> </span></font></div>