Thread: Tuning to speed select
Hi, I'm running PostgreSQL version 8 on a dual 2.4GHz Xeon with 1GB of RAM and an IDE hard drive. My big table has around 9 million records. Is there a tuning parameter I can change to increase speed of selects? Clearly, there's already some buffering going on since selecting an indexed ~50,000 records takes 17 seconds on the first try, and only 0.5 seconds on the second try (from pgsql). cowpea=> explain analyze select bs_fk from blast_result where si_fk=11843254; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using si_fk_index on blast_result (cost=0.00..22874.87 rows=58118 width=4) (actual time=112.249..17472.935 rows=50283 loops=1) Index Cond: (si_fk = 11843254) Total runtime: 17642.522 ms (3 rows) cowpea=> explain analyze select bs_fk from blast_result where si_fk=11843254; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------- Index Scan using si_fk_index on blast_result (cost=0.00..22874.87 rows=58118 width=4) (actual time=0.178..341.643 rows=50283 loops=1) Index Cond: (si_fk = 11843254) Total runtime: 505.011 ms (3 rows) cowpea=> Thanks, Tom -- Tom Laudeman twl8n@virginia.edu (434) 924-2456 http://www.people.virginia.edu/~twl8n/ http://laudeman.com/
On Wed, Aug 09, 2006 at 09:19:31AM -0400, Tom Laudeman wrote: > Hi, > > I'm running PostgreSQL version 8 on a dual 2.4GHz Xeon with 1GB of RAM > and an IDE hard drive. My big table has around 9 million records. > > Is there a tuning parameter I can change to increase speed of selects? > Clearly, there's already some buffering going on since selecting an > indexed ~50,000 records takes 17 seconds on the first try, and only 0.5 > seconds on the second try (from pgsql). Your OS is probably buffering, 1GB of RAM holds a lot of data. You can try increasing the shared_buffers parameter, but if the delay is getting data from the disk, that won't really help you. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Attachment
On Wed, Aug 09, 2006 at 03:46:38PM +0200, Martijn van Oosterhout wrote: > On Wed, Aug 09, 2006 at 09:19:31AM -0400, Tom Laudeman wrote: > > Is there a tuning parameter I can change to increase speed of selects? > > Clearly, there's already some buffering going on since selecting an > > indexed ~50,000 records takes 17 seconds on the first try, and only 0.5 > > seconds on the second try (from pgsql). > > Your OS is probably buffering, 1GB of RAM holds a lot of data. You can > try increasing the shared_buffers parameter, but if the delay is > getting data from the disk, that won't really help you. If most of your queries use the same index then clustering on that index might speed up initial (i.e., not-cached) queries by reducing the number of disk pages that need to be read. See the documentation for more information. http://www.postgresql.org/docs/8.1/interactive/sql-cluster.html -- Michael Fuhr
What about creating views on areas of the table that are queried often? I don't know if you have access or the ability to find what type of trends the table has, in terms of queries, but if you create some views on frequently visited information, this could also help. Tom Laudeman wrote: > Hi, > > I'm running PostgreSQL version 8 on a dual 2.4GHz Xeon with 1GB of RAM > and an IDE hard drive. My big table has around 9 million records. > > Is there a tuning parameter I can change to increase speed of selects? > Clearly, there's already some buffering going on since selecting an > indexed ~50,000 records takes 17 seconds on the first try, and only > 0.5 seconds on the second try (from pgsql). > > cowpea=> explain analyze select bs_fk from blast_result where > si_fk=11843254; > QUERY > PLAN > -------------------------------------------------------------------------------------------------------------------------------------------- > > Index Scan using si_fk_index on blast_result (cost=0.00..22874.87 > rows=58118 width=4) (actual time=112.249..17472.935 rows=50283 loops=1) > Index Cond: (si_fk = 11843254) > Total runtime: 17642.522 ms > (3 rows) > > cowpea=> explain analyze select bs_fk from blast_result where > si_fk=11843254; > QUERY > PLAN > ---------------------------------------------------------------------------------------------------------------------------------------- > > Index Scan using si_fk_index on blast_result (cost=0.00..22874.87 > rows=58118 width=4) (actual time=0.178..341.643 rows=50283 loops=1) > Index Cond: (si_fk = 11843254) > Total runtime: 505.011 ms > (3 rows) > > cowpea=> > > > Thanks, > Tom > >
Views certainly help in managing complexity. They do nothing to improve query-speed.
Querying a view gets rewritten to queries to the underlying tables on the fly.
(as long as there are no materialized views, which are still on a the TODO list)
--
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Reinsburgstraße 202b
70197 Stuttgart
0173/9409607
-
Let's set so double the killer delete select all.
> Views certainly help in managing complexity. They do nothing to improve > query-speed. > > Querying a view gets rewritten to queries to the underlying tables on the > fly. > (as long as there are no materialized views, which are still on a the TODO > list) Would partial indexs on the most queried regions of the table help in query speed? Regards, Richard Broersma Jr.
I'm not so sure about that, when you create a view on a table - at least with Oracle - which is a subset(the trivial or 'proper' subset is the entire table view) of the information on a table, when a select is issued against a table, Oracle at least, determines if there is a view already on a the table which potentially has a smaller amount of information to process - as long as the view contains the proper constraints that meet your 'select' criteria, the RDBMS engine will have fewer records to process - which I'd say, certainly constitutes a time benefit, in terms of 'performance gain.' Hence my reasoning behind determining IF there is a subset of the 'big table' that is frequented, I'd create a view on this, assuming postgresql does this too? Maybe somebody else can answer that for the pgsql-general's general information? query-speed itself is going to be as fast/slow as your system is configured for, however my point was to shave some time off of a 1M+ record table, but implementing views of 'frequently' visisted/hit records meeting the same specifications. Harald Armin Massa wrote: > Louis, > > Views certainly help in managing complexity. They do nothing to > improve query-speed. > > Querying a view gets rewritten to queries to the underlying tables on > the fly. > (as long as there are no materialized views, which are still on a the > TODO list) > > -- > GHUM Harald Massa > persuadere et programmare > Harald Armin Massa > Reinsburgstraße 202b > 70197 Stuttgart > 0173/9409607 > - > Let's set so double the killer delete select all.
On Wed, 2006-08-09 at 14:58, louis gonzales wrote:
> I'm not so sure about that, when you create a view on a table - at least
> with Oracle - which is a subset(the trivial or 'proper' subset is the
> entire table view) of the information on a table, when a select is
> issued against a table, Oracle at least, determines if there is a view
> already on a the table which potentially has a smaller amount of
> information to process - as long as the view contains the proper
> constraints that meet your 'select' criteria, the RDBMS engine will have
> fewer records to process - which I'd say, certainly constitutes a time
> benefit, in terms of 'performance gain.'
>
> Hence my reasoning behind determining IF there is a subset of the 'big
> table' that is frequented, I'd create a view on this, assuming
> postgresql does this too? Maybe somebody else can answer that for the
> pgsql-general's general information?
>
> query-speed itself is going to be as fast/slow as your system is
> configured for, however my point was to shave some time off of a 1M+
> record table, but implementing views of 'frequently' visisted/hit
> records meeting the same specifications.
There are basically two ways to do views. The simple way, is to have a
view represent a query that gets run everytime you call it. The more
complex way is to "materialize" the view data, and put it into a new
table, and then update that table whenever the source table changes.
PostgreSQL has native support for the first type. They're cheap and
easy, and work for most of the things people need views for (i.e. hiding
complexity).
PostgreSQL is extensible, and therefore you can institute the second
type (i.e. materialized views) on your own. Thanksfully, someone else
has already done most of the work for us, by the name of Jonathan
Gardner, and you can find his nifty guide by typing "materialized views
postgresql" into google.
Gardner's materialized views support several update methods depending on
what you need from your mat views. It's also a danged fine tutorial on
how to write some simple plpgsql functions.
Great suggestion. I've read about CLUSTER, but never had a chance to use it. The only problem is that this table with 9 million records has 5 or 6 indexes. It is hard to pick the most used, but I'll bet CLUSTER will make at least one of the queries run very fast, especially for an index with a small number of distinct values.
The speed of the query is (as Michael implies) limited to the rate at which the disk can seek and read. I have done experiments with views and cursors; there was no improvement in speed. I've also tried only pulling back primary keys in the hope that a smaller amount of data would more quickly be read into memory. No speed increase. I have also raised all the usual memory limits, with the expected results (slight speed improvements).
I'll try CLUSTER (I'm looking forward to that test), but if we really need speed, it will probably be necessary to create copies of the table, or copy portions of the table elsewhere (essentially creating materialized views, I suppose). I'm still trying to get my science compatriot here to tell me which index he most wants to improve, then I'll CLUSTER the table on that index.
Thanks!
Tom
Michael Fuhr wrote:
On Wed, Aug 09, 2006 at 03:46:38PM +0200, Martijn van Oosterhout wrote:On Wed, Aug 09, 2006 at 09:19:31AM -0400, Tom Laudeman wrote:Is there a tuning parameter I can change to increase speed of selects? Clearly, there's already some buffering going on since selecting an indexed ~50,000 records takes 17 seconds on the first try, and only 0.5 seconds on the second try (from pgsql).Your OS is probably buffering, 1GB of RAM holds a lot of data. You can try increasing the shared_buffers parameter, but if the delay is getting data from the disk, that won't really help you.If most of your queries use the same index then clustering on that index might speed up initial (i.e., not-cached) queries by reducing the number of disk pages that need to be read. See the documentation for more information. http://www.postgresql.org/docs/8.1/interactive/sql-cluster.html
-- Tom Laudeman twl8n@virginia.edu (434) 924-2456 http://www.people.virginia.edu/~twl8n/ http://laudeman.com/
On Wed, Aug 09, 2006 at 04:54:00PM -0400, Tom Laudeman wrote: > I'll try CLUSTER (I'm looking forward to that test), but if we really > need speed, it will probably be necessary to create copies of the table, > or copy portions of the table elsewhere (essentially creating > materialized views, I suppose). I'm still trying to get my science > compatriot here to tell me which index he most wants to improve, then > I'll CLUSTER the table on that index. If you enable statistics collection then you could use those statistics to see which indexes are used the most. Those indexes might be good candidates for clustering. http://www.postgresql.org/docs/8.1/interactive/monitoring-stats.html -- Michael Fuhr
Great suggestion. I've read about CLUSTER, but never had a chance to use it. The only problem is that this table with 9 million records has 5 or 6 indexes. It is hard to pick the most used, but I'll bet CLUSTER will make at least one of the queries run very fast, especially for an index with a small number of distinct values.
Tom-
I found multi-column indexes and clustering to be extremely beneficial in 7.4. I still use them in 8.1, but I haven't compared them extensively with equivalent queries that use bitmap index scans. The obvious downside of having more indexes is the additional time and space overhead during insert, update, or delete.
The approach I took to design multi-column indexes was to run explain on representative queries and look for seq scans. The seq scans indicate which columns /might/ be well-served by indexes. In 7.4, the order of indexed columns was important. (I saw something in the 8.1 release notes that made me think that this was no longer true, but I haven't verified that.)
-Reece
-- Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0 |
Richard Broersma Jr wrote: >> Views certainly help in managing complexity. They do nothing to improve >> query-speed. >> >> Querying a view gets rewritten to queries to the underlying tables on the >> fly. >> (as long as there are no materialized views, which are still on a the TODO >> list) > > Would partial indexs on the most queried regions of the table help in query speed? They would, as long as they can be used in the client application queries. If all the data is often used (partially or not), then there's little point to partial indices. But that's a border-case. In common, they can make major differences. Also, make sure you add enough constraints to your query so that the partial index constraints match your query constraints, otherwise the index will not be used. Another point to check; make sure you don't cause type casts over indexed columns. That might (under certain conditions) cause a type cast on the column value of every indexed row, which does slow down things significantly. I believe this was solved in PostgreSQL 8.something. -- Alban Hertroys alban@magproductions.nl magproductions b.v. T: ++31(0)534346874 F: ++31(0)534346876 M: I: www.magproductions.nl A: Postbus 416 7500 AK Enschede // Integrate Your World //
On 8/9/06, Tom Laudeman <twl8n@virginia.edu> wrote: > The speed of the query is (as Michael implies) limited to the rate at which > the disk can seek and read. I have done experiments with views and cursors; > there was no improvement in speed. I've also tried only pulling back > primary keys in the hope that a smaller amount of data would more quickly be > read into memory. No speed increase. I have also raised all the usual memory > limits, with the expected results (slight speed improvements). > Are your data structures normalized? Performance problems queying a single giganto table is usually (but not necessirly in your case) a sign of a poorly designed table structure. otherwise it's pretty clear you get the most bang for the buck with hardware. consider upping ram and/or buying better disks. you could buy cheap sata controller and 4 raptors in raid 0+1 configuration for <1000$ and you will feel like you have supercomputer relative to what you have now :) merlin
We have some multi-column indexes. Speed of insert, update, and delete are not an issue since this data is essentially static: write-once, read-many.
As far as I can tell (from running actual tests) Postgres will not use a multi-column index when the SELECT is constraining on only one of the columns in the index. If I need a single column index, I create one. If I need three columns in an index, I create a specific 3 column index. Granted, my memory is fuzzy, and some of my testing was done on version 7.x and there may be improvements on version 8.x
-Tom
...
I found multi-column indexes and clustering to be extremely beneficial in 7.4. I still use them in 8.1, but I haven't compared them extensively with equivalent queries that use bitmap index scans. The obvious downside of having more indexes is the additional time and space overhead during insert, update, or delete.
...
-Reece
-- Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0
-- Tom Laudeman twl8n@virginia.edu (434) 924-2456 http://www.people.virginia.edu/~twl8n/ http://laudeman.com/
On Thu, Aug 10, 2006 at 11:00:00AM -0400, Tom Laudeman wrote: > As far as I can tell (from running actual tests) Postgres will not use a > multi-column index when the SELECT is constraining on only one of the > columns in the index. Excerpt from the 8.1 Release Notes: * Allow nonconsecutive index columns to be used in a multicolumn index (Tom) For example, this allows an index on columns a,b,c to be used in a query with WHERE a = 4 and c = 10. If you're querying only on c then an index on (a,b,c) will probably be slower than an index on (c), but if you have only (a,b,c) then the planner will consider using it. -- Michael Fuhr
Michael Fuhr <mike@fuhr.org> writes: > Excerpt from the 8.1 Release Notes: > * Allow nonconsecutive index columns to be used in a multicolumn > index (Tom) > For example, this allows an index on columns a,b,c to be used in > a query with WHERE a = 4 and c = 10. > If you're querying only on c then an index on (a,b,c) will probably > be slower than an index on (c), but if you have only (a,b,c) then > the planner will consider using it. Note that if your query doesn't have a constraint on the first index column, it's unlikely that that index will get chosen, because the entire index would have to be scanned. Before 8.1 it was guaranteed not to be chosen. regards, tom lane
Merlin, The problem is disk. I've got a WD Caviar. hdparm says it does 44MB/sec (I ran that in single user mode so there was nothing interfering). A WD Caviar SE SATA in one of my servers at home gets 56MB/sec on a quiescent system at runlevel 3. What kind of values does hdparm give for a SATA Raptor? I think my Dell Precision 650 has SATA on the motherboard. The boss says I can order one drive, so what should I get? How much faster is RAID 0+1 than a single drive? Aside from size, I can't see much difference between these drives (WD Raptors at NewEgg): http://www.newegg.com/Product/ProductList.asp?DEPA=0&type=&Description=raptor&Submit=ENE&Ntk=all&N=0&minPrice=&maxPrice=&Go.x=0&Go.y=0 CLUSTER certainly helped. Each of the following queries would have returned roughly 50,000 records. Note that selecting a single record from blast_result using an index is plenty fast ( ~ 50 ms), so my primary concern is pulling back larger subsets of data. It appears that count(*) on a CLUSTERed table uses the index (as opposed to the old way of doing a sequential scan). Count on the table after CLUSTER appears to be a *lot* faster, maybe almost 100x. I know we shouldn't count, but we've been too lazy to keep the record counts in another table, and our customers occasionally want to know how many records are in a certain subset. Before CLUSTER: explain analyze select * from blast_result where si_fk=11843253; Total runtime: 16334.539 ms explain analyze select * from blast_result where si_fk=11843248; Total runtime: 31406.999 ms explain analyze select * from blast_result where si_fk=11218929; Total runtime: 15319.440 ms After CLUSTER and vacuum analyze: explain analyze select * from blast_result where si_fk=11843253; Total runtime: 2343.893 ms explain analyze select * from blast_result where si_fk=11843248; Total runtime: 2158.395 ms explain analyze select * from blast_result where si_fk=11218929; Total runtime: 1880.586 ms explain analyze select * from blast_result where si_fk=11843250; Total runtime: 2085.253 ms Thanks, Tom > > Are your data structures normalized? Performance problems queying a > single giganto table is usually (but not necessirly in your case) a > sign of a poorly designed table structure. > > otherwise it's pretty clear you get the most bang for the buck with > hardware. consider upping ram and/or buying better disks. you could > buy cheap sata controller and 4 raptors in raid 0+1 configuration for > <1000$ and you will feel like you have supercomputer relative to what > you have now :) > > merlin -- Tom Laudeman twl8n@virginia.edu (434) 924-2456 http://www.people.virginia.edu/~twl8n/ http://laudeman.com/
On 8/11/06, Tom Laudeman <twl8n@virginia.edu> wrote: > Merlin, > The problem is disk. I've got a WD Caviar. hdparm says it does 44MB/sec > (I ran that in single user mode so there was nothing interfering). A WD > Caviar SE SATA in one of my servers at home gets 56MB/sec on a quiescent > system at runlevel 3. What kind of values does hdparm give for a SATA > Raptor? i dont have one handy on a linux box to test, but all raptors are 10k drives with 16 mb cache. they feel about twice as fast as 7200 rpm ide drives in general use. > I think my Dell Precision 650 has SATA on the motherboard. The boss says > I can order one drive, so what should I get? How much faster is RAID 0+1 > than a single drive? depends on a lot of factors, near 100% improvement is realistic even with software raid. I would tell your boss that you could buy 2 36g raptors (110$ each) do a simple raid 0. just be aware that either drive failing will take you out. or, you could do raid 1 for redundancy. > Aside from size, I can't see much difference between these drives (WD > Raptors at NewEgg): > http://www.newegg.com/Product/ProductList.asp?DEPA=0&type=&Description=raptor&Submit=ENE&Ntk=all&N=0&minPrice=&maxPrice=&Go.x=0&Go.y=0 the retail parts are more expensive as is the silly drive that you can look into. buy the cheapest part at the size level you need. > CLUSTER certainly helped. Each of the following queries would have > returned roughly 50,000 records. Note that selecting a single record > from blast_result using an index is plenty fast ( ~ 50 ms), so my > primary concern is pulling back larger subsets of data. maybe. you may have had table bloat as well, cluster does a full table rebuild like vacuum fuul. > It appears that count(*) on a CLUSTERed table uses the index (as opposed > to the old way of doing a sequential scan). Count on the table after > CLUSTER appears to be a *lot* faster, maybe almost 100x. I know we > shouldn't count, but we've been too lazy to keep the record counts in > another table, and our customers occasionally want to know how many > records are in a certain subset. no, afaik count(*) uses the table still (try explain analyze). you just compacted and optimized the table for efficient sequential scans. are you vacuuming regulary? merlin
# twl8n@virginia.edu / 2006-08-11 10:12:40 -0400: > I think my Dell Precision 650 has SATA on the motherboard. The boss says > I can order one drive, so what should I get? How much faster is RAID 0+1 > than a single drive? You need 4 disks for 0+1 (or 1+0, also called 10). -- How many Vietnam vets does it take to screw in a light bulb? You don't know, man. You don't KNOW. Cause you weren't THERE. http://bash.org/?255991
Roman Neuhauser wrote: > # twl8n@virginia.edu / 2006-08-11 10:12:40 -0400: >> I think my Dell Precision 650 has SATA on the motherboard. The boss says >> I can order one drive, so what should I get? How much faster is RAID 0+1 >> than a single drive? If you can order one drive, get a drive that is twice the size of what you have. Use one partition of that drive to create a raid 1. Use the other partition to push off nightly rsyncs :) Sincerely, Joshua D. Drake > > You need 4 disks for 0+1 (or 1+0, also called 10). > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/