Thread: Simple join doesn't use index
Hi everybody,
I have implemented my first app using PG DB and thought for a minute(may be two) that I know something about PG but below problem totally destroyed my confidence :). Please help me to restore it.
Here is simple join query. It runs just fine on MS SQL 2008 and uses all available indexes using even bigger overall dataset.
Quick performance stat
MS SQL: 1 second, 264K rows
PG: 158 seconds, 264K rows
Explain plan from both DBs
PG QUERY PLAN
Hash Join (cost=12716.17..1101820.09 rows=248494 width=8)
Hash Cond: (views.visit_id = visits.id)
-> Seq Scan on views (cost=0.00..819136.56 rows=17434456 width=8)
-> Hash (cost=10549.16..10549.16 rows=132081 width=4)
-> Index Scan using visits_created_at_index on visits (cost=0.00..10549.16 rows=132081 width=4)
Index Cond: ((created_at >= '2012-11-15 00:00:00'::timestamp without time zone) AND (created_at < '2012-11-16 00:00:00'::timestamp without time zone))
schemaname | tablename | indexname | tablespace | indexdef
------------+-----------+---------------------------------+------------+------------------------------------------------------------------------------------------
public | views | views_pkey | | CREATE UNIQUE INDEX views_pkey ON views USING btree (id)
public | views | views_visit_id_index | | CREATE INDEX views_visit_id_index ON views USING btree (visit_id)
MS SQL Query plan
'11/16/2012'
|--Parallelism(Gather Streams)
|--Nested Loops(Inner Join, OUTER REFERENCES:([visits].[id], [Expr1006]) OPTIMIZED WITH UNORDERED PREFETCH)
|--Index Seek(OBJECT:([visits].[test]), SEEK:([visits].[created_at] >= '2012-11-15 00:00:00.000' AND [visits].[created_at] < '2012-11-16 00:00:00.000') ORDERED FORWARD)
|--Index Seek(OBJECT:([views].[views_visit_id_index]), SEEK:([views].[visit_id]=[raw_visits].[id]) ORDERED FORWARD)
It is clear that PG does full table scan "Seq Scan on views (cost=0.00..819136.56 rows=17434456 width=8)"
Don't understand why PG doesn't use views_visit_id_index in that query but rather scans whole table. One explanation I have found that when resulting dataset constitutes ~15% of total number of rows in the table then seq scan is used. In this case resulting dataset is just 1.5% of total number of rows. So it must be something different. Any reason why it happens and how to fix it?
Postgres 9.2
Ubuntu 12.04.1 LTS
shared_buffers = 4GB the rest of the settings are default ones
Thanks
-Alex
On 01/03/2013 10:54 PM, Alex Vinnik wrote: > I have implemented my first app using PG DB and thought for a minute(may be > two) that I know something about PG but below problem totally destroyed my > confidence :). Please help me to restore it. https://wiki.postgresql.org/wiki/SlowQueryQuestions -- Jeremy
On 01/03/2013 11:54 PM, Alex Vinnik wrote: > Don't understand why PG doesn't use views_visit_id_index in that query > but rather scans whole table. One explanation I have found that when > resulting dataset constitutes ~15% of total number of rows in the table > then seq scan is used. In this case resulting dataset is just 1.5% of > total number of rows. So it must be something different. Any reason why > it happens and how to fix it? But does the query planner know the same? If you added the EXPLAIN ANALYZE output of the query and something like: SELECT tablename AS table_name, attname AS column_name, null_frac, avg_width, n_distinct, correlation FROM pg_stats WHERE tablename in ('views', 'visits'); .. one could possibly tell a bit more. > Postgres 9.2 > Ubuntu 12.04.1 LTS > shared_buffers = 4GB the rest of the settings are default ones There are more than just this one memory related value, that need to be changed for optimal performance. E.g. effective_cache_size can have a direct effect on use of nested loops. See: http://www.postgresql.org/docs/9.2/static/runtime-config-query.html Regards, Stefan
Hi all, I have a table that has about 73mm rows in it and growing. Running 9.0.x on a server that unfortunately is a little I/O constrained. Some (maybe) pertinent settings: default_statistics_target = 50 maintenance_work_mem = 512MB constraint_exclusion = on effective_cache_size = 5GB work_mem = 18MB wal_buffers = 8MB checkpoint_segments = 32 shared_buffers = 2GB The server has 12GB RAM, 4 cores, but is shared with a big webapp running in Tomcat -- and I only have a RAID1 disk to work on. Woes me... Anyway, this table is going to continue to grow, and it's used frequently (Read and Write). From what I read, this table is a candidate to be partitioned for performance and scalability. I have tested some scripts to build the "inherits" tables with their constraints and the trigger/function to perform the work. Am I doing the right thing by partitioning this? If so, and I can afford some downtime, is dumping the table via pg_dump and then loading it back in the best way to do this? Should I run a cluster or vacuum full after all is done? Is there a major benefit if I can upgrade to 9.2.x in some way that I haven't realized? Finally, if anyone has any comments about my settings listed above that might help improve performance, I thank you in advance. -AJ
On Friday, January 4, 2013, AJ Weber wrote:
Hi all,
I have a table that has about 73mm rows in it and growing.
How big is the table in MB? Its indexes?
...
The server has 12GB RAM, 4 cores, but is shared with a big webapp running in Tomcat -- and I only have a RAID1 disk to work on. Woes me...
By a RAID1 disk, do you mean two disks in a RAID1 configuration, or a single RAID1 composed of an unspecified number of disks?
Often spending many thousands of dollars in DBA time can save you from having to buy many hundreds of dollars in hard drives. :) On the other hand, often you end up having to buy the extra disks anyway afterall.
Anyway, this table is going to continue to grow, and it's used frequently (Read and Write).
Are all rows in the table read and written with equal vigor, or are there hot rows and cold rows that can be recognized based on the row's values?
From what I read, this table is a candidate to be partitioned for performance and scalability. I have tested some scripts to build the "inherits" tables with their constraints and the trigger/function to perform the work.
Am I doing the right thing by partitioning this?
Probably not. Or at least, you haven't given us the information to know. Very broadly speaking, well-implemented partitioning makes bulk loading and removal operations take less IO, but makes normal operations take more IO, or if lucky leaves it unchanged. There are exceptions, but unless you can identify a very specific reason to think you might have one of those exceptions, then you probably don't.
Do you have a natural partitioning key? That is, is there a column (or expression) which occurs as a selective component in the where clause of almost all of your most io consuming SQL and DML? If so, you might benefit from partitioning on it. (But in that case, you might be able to get most of the benefits of partitioning, without the headaches of it, just by revamping your indexes to include that column/expression as their leading field).
If you don't have a good candidate partitioning key, then partitioning will almost surely make things worse.
If so, and I can afford some downtime, is dumping the table via pg_dump and then loading it back in the best way to do this?
To do efficient bulk loading into a partitioned table, you need to specifically target each partition, rather than targeting with a trigger. That pretty much rules out pg_dump, AFAIK, unless you are going to parse the dump file(s) and rewrite them.
Should I run a cluster or vacuum full after all is done?
Probably not. If a cluster after the partitioning would be beneficial, there would be a pretty good chance you could do a cluster *instead* of the partitioning and get the same benefit.
If you do some massive deletes from the parent table as part of populating the children, then a vacuum full of the parent could be useful. But if you dump the parent table, truncate it, and reload it as partitioned tables, then vacuum full would probably not be useful.
Really, you need to identify your most resource-intensive queries before you can make any reasonable decisions.
Is there a major benefit if I can upgrade to 9.2.x in some way that I haven't realized?
If you have specific queries that are misoptimized and so are generating more IO than they need to, then upgrading could help. On the other hand, it could also make things worse, if a currently well optimized query becomes worse.
But, instrumentation has improved in 9.2 from 9.0, so upgrading would make it easier to figure out just which queries are really bad and have the most opportunity for improvement. A little well informed optimization might obviate the need for either partitioning or more hard drives.
Finally, if anyone has any comments about my settings listed above that might help improve performance, I thank you in advance.
Your default statistics target seemed low. Without knowing the nature of your most resource intensive queries or how much memory tomcat is using, it is hard to say more.
Cheers,
Jeff
All fair questions...
Thank you for your detailed response!
On 1/4/2013 11:03 PM, Jeff Janes wrote:
Thank you for your detailed response!
On 1/4/2013 11:03 PM, Jeff Janes wrote:
On Friday, January 4, 2013, AJ Weber wrote:Not sure on this. Will see if pgAdmin tells me.Hi all,
I have a table that has about 73mm rows in it and growing.How big is the table in MB? Its indexes?
I mean I have two disks in a RAID1 configuration. The server is currently in a whitebox datacenter and I have zero control over the hardware, so adding disks is unfortunately out of the question. I completely understand the comment, and would love to have a larger SAN available to me that I could configure...I just don't and have no way of getting one anytime soon....The server has 12GB RAM, 4 cores, but is shared with a big webapp running in Tomcat -- and I only have a RAID1 disk to work on. Woes me...By a RAID1 disk, do you mean two disks in a RAID1 configuration, or a single RAID1 composed of an unspecified number of disks?Often spending many thousands of dollars in DBA time can save you from having to buy many hundreds of dollars in hard drives. :) On the other hand, often you end up having to buy the extra disks anyway afterall.
No, I could probably figure out a way to setup an "archive" or "older" section of the data that is updated much less frequently. Deletes are rare. Inserts/Updates "yes". Select on existing rows -- very frequent.Anyway, this table is going to continue to grow, and it's used frequently (Read and Write).Are all rows in the table read and written with equal vigor, or are there hot rows and cold rows that can be recognized based on the row's values?
I know you can't believe everything you read, but I thought I saw some metrics about when a table's size exceeds some fraction of available RAM, or when it approaches 100mm rows, it's a big candidate for partitioning.From what I read, this table is a candidate to be partitioned for performance and scalability. I have tested some scripts to build the "inherits" tables with their constraints and the trigger/function to perform the work.
Am I doing the right thing by partitioning this?Probably not. Or at least, you haven't given us the information to know. Very broadly speaking, well-implemented partitioning makes bulk loading and removal operations take less IO, but makes normal operations take more IO, or if lucky leaves it unchanged. There are exceptions, but unless you can identify a very specific reason to think you might have one of those exceptions, then you probably don't.
The table is a "detail table" to its master records. That is, it's like an order-details table where it will have a 1-n rows joined to the master ("order") table on the order-id. So I can partition it based on the order number pretty easily (which is a bigint, btw).Do you have a natural partitioning key? That is, is there a column (or expression) which occurs as a selective component in the where clause of almost all of your most io consuming SQL and DML? If so, you might benefit from partitioning on it. (But in that case, you might be able to get most of the benefits of partitioning, without the headaches of it, just by revamping your indexes to include that column/expression as their leading field).If you don't have a good candidate partitioning key, then partitioning will almost surely make things worse.
I did try clustering the table on the PK (which is actually 4 columns), and it appeared to help a bit. I was hoping partitioning was going to help me even more.If so, and I can afford some downtime, is dumping the table via pg_dump and then loading it back in the best way to do this?To do efficient bulk loading into a partitioned table, you need to specifically target each partition, rather than targeting with a trigger. That pretty much rules out pg_dump, AFAIK, unless you are going to parse the dump file(s) and rewrite them.
Should I run a cluster or vacuum full after all is done?Probably not. If a cluster after the partitioning would be beneficial, there would be a pretty good chance you could do a cluster *instead* of the partitioning and get the same benefit.
Is there some new feature or optimization you're thinking about with this comment? If so, could you please just send me a link and/or feature name and I'll google it myself?If you do some massive deletes from the parent table as part of populating the children, then a vacuum full of the parent could be useful. But if you dump the parent table, truncate it, and reload it as partitioned tables, then vacuum full would probably not be useful.Really, you need to identify your most resource-intensive queries before you can make any reasonable decisions.
Is there a major benefit if I can upgrade to 9.2.x in some way that I haven't realized?If you have specific queries that are misoptimized and so are generating more IO than they need to, then upgrading could help. On the other hand, it could also make things worse, if a currently well optimized query becomes worse.
This is interesting too. I obviously would like the best available options to tune the database and the application. Is this detailed in the release notes somewhere, and what tools could I use to take advantage of this? (Are there new/improved details included in the EXPLAIN statement or something?)But, instrumentation has improved in 9.2 from 9.0, so upgrading would make it easier to figure out just which queries are really bad and have the most opportunity for improvement. A little well informed optimization might obviate the need for either partitioning or more hard drives.
Tomcat uses 4G of RAM, plus we have nginx in front using a little and some other, smaller services running on the server in addition to the usual Linux gamut of processes.
Finally, if anyone has any comments about my settings listed above that might help improve performance, I thank you in advance.Your default statistics target seemed low. Without knowing the nature of your most resource intensive queries or how much memory tomcat is using, it is hard to say more.
Cheers,Jeff
On Thu, Jan 3, 2013 at 4:54 PM, Alex Vinnik <alvinnik.g@gmail.com> wrote: > Don't understand why PG doesn't use views_visit_id_index in that query but > rather scans whole table. One explanation I have found that when resulting > dataset constitutes ~15% of total number of rows in the table then seq scan > is used. In this case resulting dataset is just 1.5% of total number of > rows. So it must be something different. Any reason why it happens and how > to fix it? > > Postgres 9.2 > Ubuntu 12.04.1 LTS > shared_buffers = 4GB the rest of the settings are default ones <snip> It happens because you lied to the database...heh. In particular, the 'effective_cache_size' setting which defaults to 128mb. That probably needs to be much, much larger. Basically postgres is figuring the cache is much smaller than the data and starts to favor sequential plans once you hit a certain threshold. If you had a server with only say 256mb ram, it probably *would* be faster. SQL server probably uses all kinds of crazy native unportable kernel calls to avoid having to make a similar .conf setting. Or maybe it just assumes infinite cache size...dunno. merlin
On Sunday, January 6, 2013, AJ Weber wrote:
The main things I am thinking of are the "fudge factor" for large indexes, which is currently being discussed in both performance and hackers mailing lists, which was made overly aggressive in 9.2 and so can make it choose worse plans, and the "allow the planner to generate custom plans for specific parameter values even when using prepared statements" from the 9.2 release notes, which can allow it to choose better plans. But, surely there are other changes as well, which amount to corner cases and so are hard to discuss in the abstract. Which is why instrumentation is important. There isn't much point in worrying about possible changed plans until you've identified the queries that are important to worry about.
All fair questions...
Thank you for your detailed response!
On 1/4/2013 11:03 PM, Jeff Janes wrote:On Friday, January 4, 2013, AJ Weber wrote:Not sure on this. Will see if pgAdmin tells me.Hi all,
I have a table that has about 73mm rows in it and growing.How big is the table in MB? Its indexes?
It probably does, but from psql command line, you can do \d+ and \di+
No, I could probably figure out a way to setup an "archive" or "older" section of the data that is updated much less frequently.Anyway, this table is going to continue to grow, and it's used frequently (Read and Write).Are all rows in the table read and written with equal vigor, or are there hot rows and cold rows that can be recognized based on the row's values?
So the data that deliniates this does not exist in that table, but it does exist someplace, either just in your head, or in the column of a higher level table?
Deletes are rare. Inserts/Updates "yes". Select on existing rows -- very frequent.
If you have little control over your storage and are already IO bound, and the tables are growing rapidly, you may need to rethink that "deletes are rare" bit. So the inserts and updates do target a hot part, while the selects are evenly spread?
In that case, it is very important to know if the slow part are the selects, or the insert and deletes. If the selects are slow, and the hot rows for selects can't be gathered together into a hot partition, then after clustering they will still be slow as the disk will still have to seek all over the place (massive data-mining type selects might be an exception to that, but I wouldn't count on it).
I know you can't believe everything you read, but I thought I saw some metrics about when a table's size exceeds some fraction of available RAM, or when it approaches 100mm rows, it's a big candidate for partitioning.From what I read, this table is a candidate to be partitioned for performance and scalability. I have tested some scripts to build the "inherits" tables with their constraints and the trigger/function to perform the work.
Am I doing the right thing by partitioning this?Probably not. Or at least, you haven't given us the information to know. Very broadly speaking, well-implemented partitioning makes bulk loading and removal operations take less IO, but makes normal operations take more IO, or if lucky leaves it unchanged. There are exceptions, but unless you can identify a very specific reason to think you might have one of those exceptions, then you probably don't.
I think it is a matter of semantics. A small table is poor candidate for partitioning even if it has an excellent key to use for partitioning. A large table could be a good candidate up until you realize it doesn't have a good key to use, at which point it stops being a good candidate (in my opinion).
I did try clustering the table on the PK (which is actually 4 columns), and it appeared to help a bit. I was hoping partitioning was going to help me even more.
Should I run a cluster or vacuum full after all is done?Probably not. If a cluster after the partitioning would be beneficial, there would be a pretty good chance you could do a cluster *instead* of the partitioning and get the same benefit.
Was the order_num (from the parent table) the leading field of the 4 column PK? If not, you might want to reorder the PK so that it is the leading field and cluster again. Or if reordering the PK columns is not convenient, make a new index on the order_num and cluster on that (perhaps dropping the index after the cluster, if it no longer serves a purpose)
Is there some new feature or optimization you're thinking about with this comment? If so, could you please just send me a link and/or feature name and I'll google it myself?
Is there a major benefit if I can upgrade to 9.2.x in some way that I haven't realized?If you have specific queries that are misoptimized and so are generating more IO than they need to, then upgrading could help. On the other hand, it could also make things worse, if a currently well optimized query becomes worse.
The main things I am thinking of are the "fudge factor" for large indexes, which is currently being discussed in both performance and hackers mailing lists, which was made overly aggressive in 9.2 and so can make it choose worse plans, and the "allow the planner to generate custom plans for specific parameter values even when using prepared statements" from the 9.2 release notes, which can allow it to choose better plans. But, surely there are other changes as well, which amount to corner cases and so are hard to discuss in the abstract. Which is why instrumentation is important. There isn't much point in worrying about possible changed plans until you've identified the queries that are important to worry about.
This is interesting too. I obviously would like the best available options to tune the database and the application. Is this detailed in the release notes somewhere, and what tools could I use to take advantage of this? (Are there new/improved details included in the EXPLAIN statement or something?)But, instrumentation has improved in 9.2 from 9.0, so upgrading would make it easier to figure out just which queries are really bad and have the most opportunity for improvement. A little well informed optimization might obviate the need for either partitioning or more hard drives.
track_io_timing is new, and it exposes new data into EXPLAIN (ANALYZE, BUFFERS) as well as into other places. You might not want to turn this on permanently, as it can affect performance (but you can test with pg_test_timing as outlined in the docs to see how large probable affect it). Also, EXPLAIN displays the number row removed by filters, which may or may not be useful to you.
Most exciting I think are the improvements to the contrib module pg_stat_statements. That would be my first recourse, to find out which of your statements are taking the most time (and/or IO). I try to install and configure this for all of my databases now as a matter of course.
See the 9.2 release notes (with links therein to the rest of the documentation) for discussion of these.
Cheers,
Jeff
> > It probably does, but from psql command line, you can do \d+ and \di+ \d+ doesn't appear to display any size information. > > If you have little control over your storage and are already IO bound, > and the tables are growing rapidly, you may need to rethink that > "deletes are rare" bit. So the inserts and updates do target a hot > part, while the selects are evenly spread? > > In that case, it is very important to know if the slow part are the > selects, or the insert and deletes. If the selects are slow, and the > hot rows for selects can't be gathered together into a hot partition, > then after clustering they will still be slow as the disk will still > have to seek all over the place (massive data-mining type selects > might be an exception to that, but I wouldn't count on it). Since order_num is sequential, I could partition on it in broad (sequential) ranges. That would put all recent/new rows in one table-partition that would be a fraction of the size of the overall (unpartitioned) table. I guess that would require manual maintenance over-time (to switch to another, new partition as each grows). > > I think it is a matter of semantics. A small table is poor candidate > for partitioning even if it has an excellent key to use for > partitioning. A large table could be a good candidate up until you > realize it doesn't have a good key to use, at which point it stops > being a good candidate (in my opinion). > My first idea to evenly-partition the table was to use the order_num and do a "mod" on it with the number of tables I wanted to use. That would yield a partition-table number of 0-mod, and all rows for the same order would stay within the same partition-table. However, you're right in thinking that a search for orders could -- really WOULD -- require retrieving details from multiple partitions, probably increasing IO. So maybe the sequential partitioning (if at all) is better, just more maintenance down-the-road. > > Was the order_num (from the parent table) the leading field of the 4 > column PK? If not, you might want to reorder the PK so that it is the > leading field and cluster again. Or if reordering the PK columns is > not convenient, make a new index on the order_num and cluster on that > (perhaps dropping the index after the cluster, if it no longer serves > a purpose) > Yes, the order_num is the first column in the PK, and our main browse queries use, at a minimum, the first 2-3 columns in that PK in their where-clause. Many thanks again for all the input! -AJ
On Tue, Jan 8, 2013 at 8:45 AM, AJ Weber <aweber@comcast.net> wrote: > >> >> It probably does, but from psql command line, you can do \d+ and \di+ > > \d+ doesn't appear to display any size information. It does if you use it without an argument, to display all the tables in the search path: jjanes=# \d+ List of relations Schema | Name | Type | Owner | Size | Description --------+------------------+-------+--------+---------+------------- public | pgbench_accounts | table | jjanes | 128 MB | public | pgbench_branches | table | jjanes | 40 kB | public | pgbench_history | table | jjanes | 0 bytes | public | pgbench_tellers | table | jjanes | 40 kB | (4 rows) It rather annoys me that you actually get less information (no size, no owner) when you use \d+ on a named table. I don't know if there is a reason for that feature, or if it was just an oversight. > > >> >> If you have little control over your storage and are already IO bound, and >> the tables are growing rapidly, you may need to rethink that "deletes are >> rare" bit. So the inserts and updates do target a hot part, while the >> selects are evenly spread? >> >> In that case, it is very important to know if the slow part are the >> selects, or the insert and deletes. If the selects are slow, and the hot >> rows for selects can't be gathered together into a hot partition, then after >> clustering they will still be slow as the disk will still have to seek all >> over the place (massive data-mining type selects might be an exception to >> that, but I wouldn't count on it). > > Since order_num is sequential, I could partition on it in broad (sequential) > ranges. That would put all recent/new rows in one table-partition that > would be a fraction of the size of the overall (unpartitioned) table. I > guess that would require manual maintenance over-time (to switch to another, > new partition as each grows). Yep. If your selects are concentrated in those recent/new, this could be very useful. But, if your selects are not concentrated on the recent/new rows, the benefit would be small. > > >> >> I think it is a matter of semantics. A small table is poor candidate for >> partitioning even if it has an excellent key to use for partitioning. A >> large table could be a good candidate up until you realize it doesn't have a >> good key to use, at which point it stops being a good candidate (in my >> opinion). >> > My first idea to evenly-partition the table was to use the order_num and do > a "mod" on it with the number of tables I wanted to use. That would yield a > partition-table number of 0-mod, The current constraint exclusion code is quite simple-minded and doesn't know how to make use of check constraints that use the mod function, so the indexes of all partitions would have to be searched for each order_num-driven query, even though we know the data could only exist in one of them. The constraint exclusion codes does understand check constraints that involve ranges. There could still be some benefit as the table data would be concentrated, even if the index data is not. > and all rows for the same order would stay > within the same partition-table. But usually a given order_num would only be of interest for a fraction of a second before moving on to some other order_num of interest, so by the time the relevant partition become fully cached, it would no longer be hot. Or, if the partitions were small enough, you could assume that all rows would be dragged into memory when the first one was requested because they lay so close to each other. But it is not feasible to have a large enough number of partitions to make that happen. But if the table is clustered, this is exactly what you would get--the trouble would be keeping it clustered. If most of the line-items are inserted at the same time as each other, they probably should be fairly well clustered to start with. Cheers, Jeff
Jeff Janes <jeff.janes@gmail.com> writes: > On Tue, Jan 8, 2013 at 8:45 AM, AJ Weber <aweber@comcast.net> wrote: >> >> \d+ doesn't appear to display any size information. > It does if you use it without an argument, to display all the tables > in the search path: > jjanes=# \d+ > List of relations > Schema | Name | Type | Owner | Size | Description > --------+------------------+-------+--------+---------+------------- > public | pgbench_accounts | table | jjanes | 128 MB | > public | pgbench_branches | table | jjanes | 40 kB | > public | pgbench_history | table | jjanes | 0 bytes | > public | pgbench_tellers | table | jjanes | 40 kB | > (4 rows) > It rather annoys me that you actually get less information (no size, > no owner) when you use \d+ on a named table. I don't know if there is > a reason for that feature, or if it was just an oversight. This is actually an abbreviation for \dtisv+, which is a completely different command from "\d table". You can use something like "\dt+ table-pattern" to get a display of the above form for a subset of tables. I agree it ain't too consistent. regards, tom lane
> It does if you use it without an argument, to display all the tables > in the search path: > > jjanes=# \d+ > List of relations > Schema | Name | Type | Owner | Size | Description > --------+------------------+-------+--------+---------+------------- > public | pgbench_accounts | table | jjanes | 128 MB | > public | pgbench_branches | table | jjanes | 40 kB | > public | pgbench_history | table | jjanes | 0 bytes | > public | pgbench_tellers | table | jjanes | 40 kB | > (4 rows) > > It rather annoys me that you actually get less information (no size, > no owner) when you use \d+ on a named table. I don't know if there is > a reason for that feature, or if it was just an oversight. That is rather peculiar. Sorry for that. Table in question is 9284MB (Parent table is 621MB) > > The current constraint exclusion code is quite simple-minded and > doesn't know how to make use of check constraints that use the mod > function, so the indexes of all partitions would have to be searched > for each order_num-driven query, even though we know the data could > only exist in one of them. The constraint exclusion codes does > understand check constraints that involve ranges. Hmm. That's a bit of a limitation I didn't know about. I assume it doesn't understand the percent (mod operator) just the same as not understanding the MOD() function? Either way, I guess this strategy does not pan-out. > There could still be some benefit as the table data would be > concentrated, even if the index data is not. I'm reaching way, way back in my head, but I think _some_ RDBMS I worked with previously had a way to "cluster" the rows around a single one of the indexes on the table, thus putting the index and the row-data "together" and reducing the number of IO's to retrieve the row if that index was used. Am I understanding that PG's "cluster" is strictly to group like rows together logically -- table data only, not to coordinate the table row with the index upon which you clustered them? > >> and all rows for the same order would stay >> within the same partition-table. > But usually a given order_num would only be of interest for a fraction > of a second before moving on to some other order_num of interest, so > by the time the relevant partition become fully cached, it would no > longer be hot. Or, if the partitions were small enough, you could > assume that all rows would be dragged into memory when the first one > was requested because they lay so close to each other. But it is not > feasible to have a large enough number of partitions to make that > happen. But if the table is clustered, this is exactly what you would > get--the trouble would be keeping it clustered. If most of the > line-items are inserted at the same time as each other, they probably > should be fairly well clustered to start with. Does decreasing the fill to like 90 help keep it clustered in-between times that I could shutdown the app and perform a (re-) cluster on the overall table? Problem is, with a table that size, and the hardware I'm "blessed with", the cluster takes quite a bit of time. :(
On Thursday, January 3, 2013, Alex Vinnik wrote:
Hi everybody,I have implemented my first app using PG DB and thought for a minute(may be two) that I know something about PG but below problem totally destroyed my confidence :). Please help me to restore it.Here is simple join query. It runs just fine on MS SQL 2008 and uses all available indexes using even bigger overall dataset.Quick performance statMS SQL: 1 second, 264K rows
If it returns 264K rows in 1 second, then it must have all data in memory. Which prompts a few questions:
Is *all* data in memory, or is it just the data needed for this particular query because you already ran it recently with the same date range?
PG: 158 seconds, 264K rows
Does the PG machine have enough memory to hold all the data in RAM? If so, does it actually have all the data in RAM? That is, is the cache already warm? Starting from scratch it can take a long time for the cache to warm up naturally. And finally, if all the data is in RAM, does PG know this?
For the last question, the answer is "no", since you are using default settings. You need to lower random_page_cost and probably also seq_page_cost in order to trick PG to think the data is in RAM. Of course if you do this when the data is in fact not in RAM, the result could be catastrophically bad plans. (And when I tried to replicate your situation on anemic hardware, indeed the hash join often was faster than the nested loop on both indexes.)
Explain plan from both DBsPG QUERY PLANHash Join (cost=12716.17..1101820.09 rows=248494 width=8)Hash Cond: (views.visit_id = visits.id)-> Seq Scan on views (cost=0.00..819136.56 rows=17434456 width=8)
This cost estimate is probably due mostly to seg_page_cost and cpu_tuple_cost, which at their defaults means the table has 645,000 blocks (819136 - 17434456/100) blocks and each block has ~30 rows.
But you are returning 248,494 rows, or roughly 1 / 2.5 of a row per block. Let's say you need to fetch 200,000 blocks (in random order) to get those rows. Since at default settings fetching 200,000 random blocks is considered as expensive as fetching 800,000 sequential blocks, the index scan you want already looks more expensive than the sequential scan. But, if you want to use the index scan, you also have to fetch the index blocks, which a sequential scan does not need to do. There are probably about 50,000 index blocks, but each one has to be fetched about 5 times (248,494/50,000). Because your effective_cache_size is so low, PG assumes the next time it needs to fetch the same block, it will no longer be in memory and so needs to be fetched again at full random_page_cost.
It is clear that PG does full table scan "Seq Scan on views (cost=0.00..819136.56 rows=17434456 width=8)"Don't understand why PG doesn't use views_visit_id_index in that query but rather scans whole table. One explanation I have found that when resulting dataset constitutes ~15% of total number of rows in the table then seq scan is used.
I don't know where you found that rule of thumb, but it would probably more accurate if it was given in in terms of the percentage of the table's *blocks* scanned, rather than *rows*.
In this case resulting dataset is just 1.5% of total number of rows.
Since there are about 30 rows per block, scanning 1.5% of the rows means scanning somewhat less than 45% of the blocks, assuming the rows are randomly distributed over the blocks. And they are scanned in a less efficient way.
Postgres 9.2
You are probably getting hit hard by the overly-large "fudge factor" penalty for scans of large indexes, of much discussion recently in regards to 9.2.
Ubuntu 12.04.1 LTSshared_buffers = 4GB the rest of the settings are default ones
The default effective_cache_size is almost certainly wrong, and if the analogy to MSSQL to is correct, then random_page_cost almost certainly is as well.
Another poster referred you to the wiki page for suggestion on how to report slow queries, particularly using EXPLAIN (analyze, buffers) rather than merely EXPLAIN. In this case, I would also try setting enable_hashjoin=off and enable_mergejoin=off in the session, in order to force the planner to use the plan you think you want, so we can see what PG thinks of that one.
Cheers,
Jeff
Guys, thanks a lot for your input. It is very valuable for us. We plan to fix a separate dev server similar to production one, copy all data there and try you suggestions as we really don't want to do it on production server. I also noticed that IOPS jumps to 100% when running this query. So it is a bit scary to make those changes in production directly. Will report back on the progress and findings.
On Wed, Jan 9, 2013 at 9:49 AM, Alex Vinnik <alvinnik.g@gmail.com> wrote: > Guys, thanks a lot for your input. It is very valuable for us. We plan to > fix a separate dev server similar to production one, copy all data there and > try you suggestions as we really don't want to do it on production server. I > also noticed that IOPS jumps to 100% when running this query. So it is a bit > scary to make those changes in production directly. Will report back on the > progress and findings. nothing wrong with that, but keep in mind you can tweak 'effective_cache_size' for a single session with 'set' command; merlin
On Tue, Jan 8, 2013 at 10:04 AM, AJ Weber <aweber@comcast.net> wrote: >> >> The current constraint exclusion code is quite simple-minded and >> doesn't know how to make use of check constraints that use the mod >> function, so the indexes of all partitions would have to be searched >> for each order_num-driven query, even though we know the data could >> only exist in one of them. The constraint exclusion codes does >> understand check constraints that involve ranges. > > Hmm. That's a bit of a limitation I didn't know about. I assume it doesn't > understand the percent (mod operator) just the same as not understanding the > MOD() function? Either way, I guess this strategy does not pan-out. Yes, it doesn't detect either. It would use it if you formulate to every equality query with an extra restriction: "where id=1234567 and mod(id,100)=67" or whatever. (But I was surprised that % and mod() are not recognized as being equivalent. If you specify it one way in the check constraint, you need to use the same "spelling" in the where clause) >> There could still be some benefit as the table data would be >> concentrated, even if the index data is not. > > I'm reaching way, way back in my head, but I think _some_ RDBMS I worked > with previously had a way to "cluster" the rows around a single one of the > indexes on the table, thus putting the index and the row-data "together" and > reducing the number of IO's to retrieve the row if that index was used. In Oracle this is called in "index organized table" or IOT (or it was at one point, they have the habit of rename most of their features with each release). I don't know what other RDBMS call it. Supporting secondary indexes when the table data could move around was quite intricate/weird. PG doesn't have this index-organized-table feature--it has been discussed but I don't of any currently active effort to add it. There is another feature, sometimes called clustering, in which the rows from different tables can be mingled together in the same block. So both the parent order and the child order_line_item that have the same order_num (i.e. the join column) would be in the same block. So once you query for a specific order and did the necessary IO, the corresponding order_line_item rows would already be in memory. I thought this was interesting, but I don't know how often it was actually used. > Am > I understanding that PG's "cluster" is strictly to group like rows together > logically -- table data only, not to coordinate the table row with the index > upon which you clustered them? They are coordinated in a sense. Not as one single structure, but as two structures in parallel. >>> and all rows for the same order would stay >>> within the same partition-table. >> >> But usually a given order_num would only be of interest for a fraction >> of a second before moving on to some other order_num of interest, so >> by the time the relevant partition become fully cached, it would no >> longer be hot. Or, if the partitions were small enough, you could >> assume that all rows would be dragged into memory when the first one >> was requested because they lay so close to each other. But it is not >> feasible to have a large enough number of partitions to make that >> happen. But if the table is clustered, this is exactly what you would >> get--the trouble would be keeping it clustered. If most of the >> line-items are inserted at the same time as each other, they probably >> should be fairly well clustered to start with. > > Does decreasing the fill to like 90 help keep it clustered in-between times > that I could shutdown the app and perform a (re-) cluster on the overall > table? Problem is, with a table that size, and the hardware I'm "blessed > with", the cluster takes quite a bit of time. :( Probably not. If the data starts out clustered and gets updated a lot, lowering the fill factor might be able to prevent some de-clustering due to row migration. But when you insert new rows, PG makes no effort to put them near existing rows with the same key. (In a hypothetical future in which that did happen, lowering the fill factor would then probably help) Cheers, Jeff
It sure turned out that default settings are not a good fit. Setting random_page_cost to 1.0 made query to run in 2.6 seconds and I clearly see that indexes are being used in explain plan and IO utilization is close to 0.
QUERY PLAN
Sort (cost=969787.23..970288.67 rows=200575 width=8) (actual time=2176.045..2418.162 rows=241238 loops=1)
Sort Method: external sort Disk: 4248kB
-> Nested Loop (cost=0.00..950554.81 rows=200575 width=8) (actual time=0.048..1735.357 rows=241238 loops=1)
-> Index Scan using visits_created_at_index on visits (cost=0.00..5459.16 rows=82561 width=4) (actual time=0.032..178.591 rows=136021 loops=1)
Index Cond: ((created_at >= '2012-12-15 00:00:00'::timestamp without time zone) AND (created_at < '2012-12-16 00:00:00'::timestamp without time zone))
-> Index Scan using views_visit_id_index on views (cost=0.00..11.33 rows=12 width=8) (actual time=0.004..0.006 rows=2 loops=136021)
Index Cond: (visit_id = visits.id)
Total runtime: 2635.169 ms
However I noticed that sorting is done using disk("external sort Disk: 4248kB") which prompted me to take a look at work_mem. But it turned out that small increase to 4MB from default 1MB turns off index usage and query gets x10 slower. IO utilization jumped to 100% from literally nothing. so back to square one...
QUERY PLAN
Sort (cost=936642.75..937144.19 rows=200575 width=8) (actual time=33200.762..33474.443 rows=241238 loops=1)
Sort Method: external merge Disk: 4248kB
-> Hash Join (cost=6491.17..917410.33 rows=200575 width=8) (actual time=7156.498..32723.221 rows=241238 loops=1)
Hash Cond: (views.visit_id = visits.id)
-> Seq Scan on views (cost=0.00..832189.95 rows=8768395 width=8) (actual time=0.100..12126.342 rows=8200704 loops=1)
-> Hash (cost=5459.16..5459.16 rows=82561 width=4) (actual time=353.683..353.683 rows=136021 loops=1)
Buckets: 16384 Batches: 2 (originally 1) Memory Usage: 4097kB
-> Index Scan using visits_created_at_index on visits (cost=0.00..5459.16 rows=82561 width=4) (actual time=0.032..175.051 rows=136021 loops=1)
Index Cond: ((created_at >= '2012-12-15 00:00:00'::timestamp without time zone) AND (created_at < '2012-12-16 00:00:00'::timestamp without time zone))
Total runtime: 33698.000 ms
Basically PG is going through all views again and not using "Index Scan using views_visit_id_index on views". Looks like setting work_mem confuses planner somehow. Any idea what can be done to do sorting in memory. I suspect it should make query even more faster. Thanks -Alex
nothing wrong with that, but keep in mind you can tweak
'effective_cache_size' for a single session with 'set' command;
merlin
On Mon, Jan 28, 2013 at 5:43 PM, Alex Vinnik <alvinnik.g@gmail.com> wrote:
do you know pgtune?
it's a good tool for starters, if you want a fast postgres and don't really want to learn what's behind the scenes.
random_page_cost=1 might be not what you really want.
it would mean that random reads are as fast as as sequential reads, which probably is true only for SSD
Filip
It sure turned out that default settings are not a good fit.
do you know pgtune?
it's a good tool for starters, if you want a fast postgres and don't really want to learn what's behind the scenes.
random_page_cost=1 might be not what you really want.
it would mean that random reads are as fast as as sequential reads, which probably is true only for SSD
Filip
On Mon, Jan 28, 2013 at 5:43 PM, Alex Vinnik <alvinnik.g@gmail.com> wrote: > It sure turned out that default settings are not a good fit. Setting > random_page_cost to 1.0 made query to run in 2.6 seconds and I clearly see > that indexes are being used in explain plan and IO utilization is close to > 0. > > QUERY PLAN > Sort (cost=969787.23..970288.67 rows=200575 width=8) (actual > time=2176.045..2418.162 rows=241238 loops=1) > Sort Key: visits.id, views.id > Sort Method: external sort Disk: 4248kB > -> Nested Loop (cost=0.00..950554.81 rows=200575 width=8) (actual > time=0.048..1735.357 rows=241238 loops=1) > -> Index Scan using visits_created_at_index on visits > (cost=0.00..5459.16 rows=82561 width=4) (actual time=0.032..178.591 > rows=136021 loops=1) > Index Cond: ((created_at >= '2012-12-15 00:00:00'::timestamp > without time zone) AND (created_at < '2012-12-16 00:00:00'::timestamp > without time zone)) > -> Index Scan using views_visit_id_index on views > (cost=0.00..11.33 rows=12 width=8) (actual time=0.004..0.006 rows=2 > loops=136021) > Index Cond: (visit_id = visits.id) > Total runtime: 2635.169 ms > > However I noticed that sorting is done using disk("external sort Disk: > 4248kB") which prompted me to take a look at work_mem. But it turned out > that small increase to 4MB from default 1MB turns off index usage and query > gets x10 slower. IO utilization jumped to 100% from literally nothing. so > back to square one... > > QUERY PLAN > Sort (cost=936642.75..937144.19 rows=200575 width=8) (actual > time=33200.762..33474.443 rows=241238 loops=1) > Sort Key: visits.id, views.id > Sort Method: external merge Disk: 4248kB > -> Hash Join (cost=6491.17..917410.33 rows=200575 width=8) (actual > time=7156.498..32723.221 rows=241238 loops=1) > Hash Cond: (views.visit_id = visits.id) > -> Seq Scan on views (cost=0.00..832189.95 rows=8768395 width=8) > (actual time=0.100..12126.342 rows=8200704 loops=1) > -> Hash (cost=5459.16..5459.16 rows=82561 width=4) (actual > time=353.683..353.683 rows=136021 loops=1) > Buckets: 16384 Batches: 2 (originally 1) Memory Usage: > 4097kB > -> Index Scan using visits_created_at_index on visits > (cost=0.00..5459.16 rows=82561 width=4) (actual time=0.032..175.051 > rows=136021 loops=1) > Index Cond: ((created_at >= '2012-12-15 > 00:00:00'::timestamp without time zone) AND (created_at < '2012-12-16 > 00:00:00'::timestamp without time zone)) > Total runtime: 33698.000 ms > > Basically PG is going through all views again and not using "Index Scan > using views_visit_id_index on views". Looks like setting work_mem confuses > planner somehow. Any idea what can be done to do sorting in memory. I > suspect it should make query even more faster. Thanks -Alex hm, what happens when you set work_mem a fair amount higher? (say, 64mb). You can set it for one session by going "set work_mem='64mb'; " as opposed to the entire server in postgresql.conf. merlin
On Mon, Jan 28, 2013 at 6:55 PM, Filip Rembiałkowski <plk.zuber@gmail.com> wrote:
Yeah.. I came across pgtune but noticed that latest version dated 2009-10-29 http://pgfoundry.org/frs/?group_id=1000416 which is kind of outdated. Tar file has settings for pg 8.3. Is still relevant?On Mon, Jan 28, 2013 at 5:43 PM, Alex Vinnik <alvinnik.g@gmail.com> wrote:It sure turned out that default settings are not a good fit.
do you know pgtune?
it's a good tool for starters, if you want a fast postgres and don't really want to learn what's behind the scenes.
random_page_cost=1 might be not what you really want.
it would mean that random reads are as fast as as sequential reads, which probably is true only for SSD
What randon_page_cost would be more appropriate for EC2 EBS Provisioned volume that can handle 2,000 IOPS?
Filip
Setting work_mem to 64MB triggers in memory sort but look what happens with views look up. PG goes through all records there "Seq Scan on views" instead of using visitor_id index and I have only subset of real data to play around. Can imagine what cost would be running it against bigger dataset. Something else is in play here that makes planner to take this route. Any ideas how to gain more insight into planner's inner workings?
QUERY PLAN
Sort (cost=960280.46..960844.00 rows=225414 width=8) (actual time=23328.040..23537.126 rows=209401 loops=1)
Sort Method: quicksort Memory: 15960kB
-> Hash Join (cost=8089.16..940238.66 rows=225414 width=8) (actual time=6622.072..22995.890 rows=209401 loops=1)
Hash Cond: (views.visit_id = visits.id)
-> Seq Scan on views (cost=0.00..831748.05 rows=8724205 width=8) (actual time=0.093..10552.306 rows=6995893 loops=1)
-> Hash (cost=6645.51..6645.51 rows=115492 width=4) (actual time=307.389..307.389 rows=131311 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 4617kB
-> Index Scan using visits_created_at_index on visits (cost=0.00..6645.51 rows=115492 width=4) (actual time=0.040..163.151 rows=131311 loops=1)
Index Cond: ((created_at >= '2013-01-15 00:00:00'::timestamp without time zone) AND (created_at < '2013-01-16 00:00:00'::timestamp without time zone))
Total runtime: 23733.045 ms
On Mon, Jan 28, 2013 at 8:31 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Mon, Jan 28, 2013 at 5:43 PM, Alex Vinnik <alvinnik.g@gmail.com> wrote:
> It sure turned out that default settings are not a good fit. Setting
> random_page_cost to 1.0 made query to run in 2.6 seconds and I clearly see
> that indexes are being used in explain plan and IO utilization is close to
> 0.
>
> QUERY PLAN
> Sort (cost=969787.23..970288.67 rows=200575 width=8) (actual
> time=2176.045..2418.162 rows=241238 loops=1)
> Sort Key: visits.id, views.id
> Sort Method: external sort Disk: 4248kB
> -> Nested Loop (cost=0.00..950554.81 rows=200575 width=8) (actual
> time=0.048..1735.357 rows=241238 loops=1)
> -> Index Scan using visits_created_at_index on visits
> (cost=0.00..5459.16 rows=82561 width=4) (actual time=0.032..178.591
> rows=136021 loops=1)
> Index Cond: ((created_at >= '2012-12-15 00:00:00'::timestamp
> without time zone) AND (created_at < '2012-12-16 00:00:00'::timestamp
> without time zone))
> -> Index Scan using views_visit_id_index on views
> (cost=0.00..11.33 rows=12 width=8) (actual time=0.004..0.006 rows=2
> loops=136021)
> Index Cond: (visit_id = visits.id)
> Total runtime: 2635.169 ms
>
> However I noticed that sorting is done using disk("external sort Disk:
> 4248kB") which prompted me to take a look at work_mem. But it turned out
> that small increase to 4MB from default 1MB turns off index usage and query
> gets x10 slower. IO utilization jumped to 100% from literally nothing. so
> back to square one...
>
> QUERY PLAN
> Sort (cost=936642.75..937144.19 rows=200575 width=8) (actual
> time=33200.762..33474.443 rows=241238 loops=1)
> Sort Key: visits.id, views.id
> Sort Method: external merge Disk: 4248kB
> -> Hash Join (cost=6491.17..917410.33 rows=200575 width=8) (actual
> time=7156.498..32723.221 rows=241238 loops=1)
> Hash Cond: (views.visit_id = visits.id)
> -> Seq Scan on views (cost=0.00..832189.95 rows=8768395 width=8)
> (actual time=0.100..12126.342 rows=8200704 loops=1)
> -> Hash (cost=5459.16..5459.16 rows=82561 width=4) (actual
> time=353.683..353.683 rows=136021 loops=1)
> Buckets: 16384 Batches: 2 (originally 1) Memory Usage:
> 4097kB
> -> Index Scan using visits_created_at_index on visits
> (cost=0.00..5459.16 rows=82561 width=4) (actual time=0.032..175.051
> rows=136021 loops=1)
> Index Cond: ((created_at >= '2012-12-15
> 00:00:00'::timestamp without time zone) AND (created_at < '2012-12-16
> 00:00:00'::timestamp without time zone))
> Total runtime: 33698.000 ms
>
> Basically PG is going through all views again and not using "Index Scan
> using views_visit_id_index on views". Looks like setting work_mem confuses
> planner somehow. Any idea what can be done to do sorting in memory. I
> suspect it should make query even more faster. Thanks -Alex
hm, what happens when you set work_mem a fair amount higher? (say,
64mb). You can set it for one session by going "set work_mem='64mb';
" as opposed to the entire server in postgresql.conf.
merlin
On Tue, Jan 29, 2013 at 8:24 AM, Alex Vinnik <alvinnik.g@gmail.com> wrote:
Yes, I'm sure it will not do anything bad to your config.
I'd say: don't guess. Measure.
Use any tool that can test sequential disk block reads versus random disk block reads.
bonnie++ is quite popular.
Filip
On Mon, Jan 28, 2013 at 6:55 PM, Filip Rembiałkowski <plk.zuber@gmail.com> wrote:Yeah.. I came across pgtune but noticed that latest version dated 2009-10-29 http://pgfoundry.org/frs/?group_id=1000416 which is kind of outdated. Tar file has settings for pg 8.3. Is still relevant?do you know pgtune?
it's a good tool for starters, if you want a fast postgres and don't really want to learn what's behind the scenes.
Yes, I'm sure it will not do anything bad to your config.
random_page_cost=1 might be not what you really want.
it would mean that random reads are as fast as as sequential reads, which probably is true only for SSDWhat randon_page_cost would be more appropriate for EC2 EBS Provisioned volume that can handle 2,000 IOPS?
I'd say: don't guess. Measure.
Use any tool that can test sequential disk block reads versus random disk block reads.
bonnie++ is quite popular.
Filip
On Tue, Jan 29, 2013 at 8:41 AM, Alex Vinnik <alvinnik.g@gmail.com> wrote: > Setting work_mem to 64MB triggers in memory sort but look what happens with > views look up. PG goes through all records there "Seq Scan on views" instead > of using visitor_id index and I have only subset of real data to play > around. Can imagine what cost would be running it against bigger dataset. > Something else is in play here that makes planner to take this route. Any > ideas how to gain more insight into planner's inner workings? did you set effective_cache_seize as noted upthread? merlin
On Jan 29, 2013, at 6:24 AM, Alex Vinnik wrote:
random_page_cost=1 might be not what you really want.
it would mean that random reads are as fast as as sequential reads, which probably is true only for SSDWhat randon_page_cost would be more appropriate for EC2 EBS Provisioned volume that can handle 2,000 IOPS?
For EC2 Provisioned IOPS volumes - not standard EBS - random_page_cost=1 is exactly what you want.
On Tue, Jan 29, 2013 at 11:39 AM, Ben Chobot <bench@silentmedia.com> wrote:
Well... after some experimentation it turned out that random_page_cost=0.6 gives me fast queryOn Jan 29, 2013, at 6:24 AM, Alex Vinnik wrote:random_page_cost=1 might be not what you really want.
it would mean that random reads are as fast as as sequential reads, which probably is true only for SSDWhat randon_page_cost would be more appropriate for EC2 EBS Provisioned volume that can handle 2,000 IOPS?For EC2 Provisioned IOPS volumes - not standard EBS - random_page_cost=1 is exactly what you want.
QUERY PLAN
Sort (cost=754114.96..754510.46 rows=158199 width=8) (actual time=1839.324..2035.405 rows=209401 loops=1)
Sort Method: quicksort Memory: 15960kB
-> Nested Loop (cost=0.00..740453.38 rows=158199 width=8) (actual time=0.048..1531.592 rows=209401 loops=1)
-> Index Scan using visits_created_at_index on visits (cost=0.00..5929.82 rows=115492 width=4) (actual time=0.032..161.488 rows=131311 loops=1)
Index Cond: ((created_at >= '2013-01-15 00:00:00'::timestamp without time zone) AND (created_at < '2013-01-16 00:00:00'::timestamp without time zone))
-> Index Scan using views_visit_id_index on views (cost=0.00..6.26 rows=10 width=8) (actual time=0.003..0.005 rows=2 loops=131311)
Index Cond: (visit_id = visits.id)
Total runtime: 2234.142 ms
random_page_cost=0.7 slows it down 16 times
Sort (cost=804548.42..804943.92 rows=158199 width=8) (actual time=37011.337..37205.449 rows=209401 loops=1)
Sort Method: quicksort Memory: 15960kB
-> Merge Join (cost=15871.37..790886.85 rows=158199 width=8) (actual time=35673.602..36714.056 rows=209401 loops=1)
Merge Cond: (visits.id = views.visit_id)
-> Sort (cost=15824.44..16113.17 rows=115492 width=4) (actual time=335.486..463.085 rows=131311 loops=1)
Sort Key: visits.id
Sort Method: quicksort Memory: 12300kB
-> Index Scan using visits_created_at_index on visits (cost=0.00..6113.04 rows=115492 width=4) (actual time=0.034..159.326 rows=131311 loops=1)
Index Cond: ((created_at >= '2013-01-15 00:00:00'::timestamp without time zone) AND (created_at < '2013-01-16 00:00:00'::timestamp without time zone))
-> Index Scan using views_visit_id_visit_buoy_index on views (cost=0.00..757596.22 rows=6122770 width=8) (actual time=0.017..30765.316 rows=5145902 loops=1)
Total runtime: 37407.174 ms
I am totally puzzled now...
On Tue, Jan 29, 2013 at 12:59 PM, Alex Vinnik <alvinnik.g@gmail.com> wrote: > > > > On Tue, Jan 29, 2013 at 11:39 AM, Ben Chobot <bench@silentmedia.com> wrote: >> >> On Jan 29, 2013, at 6:24 AM, Alex Vinnik wrote: >> >>> random_page_cost=1 might be not what you really want. >>> it would mean that random reads are as fast as as sequential reads, which >>> probably is true only for SSD >> >> What randon_page_cost would be more appropriate for EC2 EBS Provisioned >> volume that can handle 2,000 IOPS? >> >> >> For EC2 Provisioned IOPS volumes - not standard EBS - random_page_cost=1 >> is exactly what you want. >> > Well... after some experimentation it turned out that random_page_cost=0.6 > gives me fast query > > QUERY PLAN > Sort (cost=754114.96..754510.46 rows=158199 width=8) (actual > time=1839.324..2035.405 rows=209401 loops=1) > Sort Key: visits.id, views.id > Sort Method: quicksort Memory: 15960kB > -> Nested Loop (cost=0.00..740453.38 rows=158199 width=8) (actual > time=0.048..1531.592 rows=209401 loops=1) > -> Index Scan using visits_created_at_index on visits > (cost=0.00..5929.82 rows=115492 width=4) (actual time=0.032..161.488 > rows=131311 loops=1) > Index Cond: ((created_at >= '2013-01-15 00:00:00'::timestamp > without time zone) AND (created_at < '2013-01-16 00:00:00'::timestamp > without time zone)) > -> Index Scan using views_visit_id_index on views (cost=0.00..6.26 > rows=10 width=8) (actual time=0.003..0.005 rows=2 loops=131311) > Index Cond: (visit_id = visits.id) > Total runtime: 2234.142 ms > > random_page_cost=0.7 slows it down 16 times > > Sort (cost=804548.42..804943.92 rows=158199 width=8) (actual > time=37011.337..37205.449 rows=209401 loops=1) > Sort Key: visits.id, views.id > Sort Method: quicksort Memory: 15960kB > -> Merge Join (cost=15871.37..790886.85 rows=158199 width=8) (actual > time=35673.602..36714.056 rows=209401 loops=1) > Merge Cond: (visits.id = views.visit_id) > -> Sort (cost=15824.44..16113.17 rows=115492 width=4) (actual > time=335.486..463.085 rows=131311 loops=1) > Sort Key: visits.id > Sort Method: quicksort Memory: 12300kB > -> Index Scan using visits_created_at_index on visits > (cost=0.00..6113.04 rows=115492 width=4) (actual time=0.034..159.326 > rows=131311 loops=1) > Index Cond: ((created_at >= '2013-01-15 > 00:00:00'::timestamp without time zone) AND (created_at < '2013-01-16 > 00:00:00'::timestamp without time zone)) > -> Index Scan using views_visit_id_visit_buoy_index on views > (cost=0.00..757596.22 rows=6122770 width=8) (actual time=0.017..30765.316 > rows=5145902 loops=1) Something is awry here. pg is doing an index scan via views_visit_id_visit_buoy_index with no matching condition. What's the definition of that index? The reason why the random_page_cost adjustment is working is that you are highly penalizing sequential type scans so that the database is avoiding the merge (sort A, sort B, stepwise compare). SQL server is doing a nestloop/index scan, just like the faster pg plan, but is a bit faster because it's parallelizing. merlin
On Mon, Jan 28, 2013 at 3:43 PM, Alex Vinnik <alvinnik.g@gmail.com> wrote: > It sure turned out that default settings are not a good fit. Setting > random_page_cost to 1.0 made query to run in 2.6 seconds and I clearly see > that indexes are being used in explain plan and IO utilization is close to > 0. > > QUERY PLAN > Sort (cost=969787.23..970288.67 rows=200575 width=8) (actual > time=2176.045..2418.162 rows=241238 loops=1) > Sort Key: visits.id, views.id > Sort Method: external sort Disk: 4248kB What query are you running? The query you originally showed us should not be doing this sort in the first place. Cheers, Jeff
On Tue, Jan 29, 2013 at 2:06 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> Sort Key: visits.id, views.id
> Sort Method: external sort Disk: 4248kB
What query are you running? The query you originally showed us should
not be doing this sort in the first place.
Cheers,
Jeff
Here is the query
from visits join views on visits.id = views.visit_id
where visits.created_at >= '01/15/2013' and visits.created_at < '01/16/2013'
Original query didn't have order by clause
Here query plan w/o order by
Merge Join (cost=18213.46..802113.80 rows=182579 width=8) (actual time=144443.693..145469.499 rows=209401 loops=1)
Merge Cond: (visits.id = views.visit_id)
-> Sort (cost=18195.47..18523.91 rows=131373 width=4) (actual time=335.496..464.929 rows=131311 loops=1)
Sort Key: visits.id
Sort Method: quicksort Memory: 12300kB
-> Index Scan using visits_created_at_index on visits (cost=0.00..7026.59 rows=131373 width=4) (actual time=0.037..162.047 rows=131311 loops=1)
Index Cond: ((created_at >= '2013-01-15 00:00:00'::timestamp without time zone) AND (created_at < '2013-01-16 00:00:00'::timestamp without time zone))
-> Index Scan using views_visit_id_visit_buoy_index on views (cost=0.00..766120.99 rows=6126002 width=8) (actual time=18.960..140565.130 rows=4014837 loops=1)
Total runtime: 145664.274 ms
index definition
CREATE INDEX views_visit_id_visit_buoy_index ON views USING btree (visit_id, visit_buoy)
On Tue, Jan 29, 2013 at 1:35 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Tue, Jan 29, 2013 at 12:59 PM, Alex Vinnik <alvinnik.g@gmail.com> wrote:
>
>
>
> On Tue, Jan 29, 2013 at 11:39 AM, Ben Chobot <bench@silentmedia.com> wrote:
>>
>> On Jan 29, 2013, at 6:24 AM, Alex Vinnik wrote:
>>
>>> random_page_cost=1 might be not what you really want.
>>> it would mean that random reads are as fast as as sequential reads, which
>>> probably is true only for SSD
>>
>> What randon_page_cost would be more appropriate for EC2 EBS Provisioned
>> volume that can handle 2,000 IOPS?
>>
>>
>> For EC2 Provisioned IOPS volumes - not standard EBS - random_page_cost=1
>> is exactly what you want.
>>
> Well... after some experimentation it turned out that random_page_cost=0.6
> gives me fast query
>
> QUERY PLAN
> Sort (cost=754114.96..754510.46 rows=158199 width=8) (actual
> time=1839.324..2035.405 rows=209401 loops=1)
> Sort Key: visits.id, views.id
> Sort Method: quicksort Memory: 15960kB
> -> Nested Loop (cost=0.00..740453.38 rows=158199 width=8) (actual
> time=0.048..1531.592 rows=209401 loops=1)
> -> Index Scan using visits_created_at_index on visits
> (cost=0.00..5929.82 rows=115492 width=4) (actual time=0.032..161.488
> rows=131311 loops=1)
> Index Cond: ((created_at >= '2013-01-15 00:00:00'::timestamp
> without time zone) AND (created_at < '2013-01-16 00:00:00'::timestamp
> without time zone))
> -> Index Scan using views_visit_id_index on views (cost=0.00..6.26
> rows=10 width=8) (actual time=0.003..0.005 rows=2 loops=131311)
> Index Cond: (visit_id = visits.id)
> Total runtime: 2234.142 ms
>
> random_page_cost=0.7 slows it down 16 times
>
> Sort (cost=804548.42..804943.92 rows=158199 width=8) (actual
> time=37011.337..37205.449 rows=209401 loops=1)
> Sort Key: visits.id, views.id
> Sort Method: quicksort Memory: 15960kB
> -> Merge Join (cost=15871.37..790886.85 rows=158199 width=8) (actual
> time=35673.602..36714.056 rows=209401 loops=1)
> Merge Cond: (visits.id = views.visit_id)
> -> Sort (cost=15824.44..16113.17 rows=115492 width=4) (actual
> time=335.486..463.085 rows=131311 loops=1)
> Sort Key: visits.id
> Sort Method: quicksort Memory: 12300kB
> -> Index Scan using visits_created_at_index on visits
> (cost=0.00..6113.04 rows=115492 width=4) (actual time=0.034..159.326
> rows=131311 loops=1)
> Index Cond: ((created_at >= '2013-01-15
> 00:00:00'::timestamp without time zone) AND (created_at < '2013-01-16
> 00:00:00'::timestamp without time zone))
> -> Index Scan using views_visit_id_visit_buoy_index on views
> (cost=0.00..757596.22 rows=6122770 width=8) (actual time=0.017..30765.316
> rows=5145902 loops=1)
Something is awry here. pg is doing an index scan via
views_visit_id_visit_buoy_index with no matching condition. What's
the definition of that index? The reason why the random_page_cost
adjustment is working is that you are highly penalizing sequential
type scans so that the database is avoiding the merge (sort A, sort B,
stepwise compare).
SQL server is doing a nestloop/index scan, just like the faster pg
plan, but is a bit faster because it's parallelizing.
merlin
On Mon, Jan 28, 2013 at 4:55 PM, Filip Rembiałkowski <plk.zuber@gmail.com> wrote: > > On Mon, Jan 28, 2013 at 5:43 PM, Alex Vinnik <alvinnik.g@gmail.com> wrote: >> >> It sure turned out that default settings are not a good fit. > > > do you know pgtune? > it's a good tool for starters, if you want a fast postgres and don't really > want to learn what's behind the scenes. > > random_page_cost=1 might be not what you really want. > it would mean that random reads are as fast as as sequential reads, which > probably is true only for SSD Or that the "reads" are cached and coming from RAM, which is almost surely the case here. Cheers, Jeff
On Monday, January 28, 2013, Alex Vinnik wrote:
It sure turned out that default settings are not a good fit. Setting random_page_cost to 1.0 made query to run in 2.6 seconds and I clearly see that indexes are being used in explain plan and IO utilization is close to 0.
This is not surprising. The default settings are aimed at quite small servers, while you seem to have a rather substantial one. Have you done anything yet to fix effective_cache_size?
QUERY PLANSort (cost=969787.23..970288.67 rows=200575 width=8) (actual time=2176.045..2418.162 rows=241238 loops=1)
However I noticed that sorting is done using disk("external sort Disk: 4248kB")
As far as pgsql is concerned, it is using the disk. But the kernel is probably buffering that to an extent that the disk is probably being touched rather little. So I wouldn't worry about it.
which prompted me to take a look at work_mem. But it turned out that small increase to 4MB from default 1MB turns off index usage and query gets x10 slower. IO utilization jumped to 100% from literally nothing. so back to square one...QUERY PLANSort (cost=936642.75..937144.19 rows=200575 width=8) (actual time=33200.762..33474.443 rows=241238 loops=1)
And why should the IO utilization have jumped? Is everything in memory, or is it not? You should run your EXPLAINs with (analyze, buffers), and also you should turn on track_io_timings, at least in the local session; that will give us some insights.
If everything is in memory, then why is the seq scan taking so long? If not, then why is the nested loop such a good idea? (In my hands, when everything does *not* fit in memory, the nested loop is very very bad)
You seem have a bit of an infatuation with Dec 15th, running that one query over and over and over. Why? If the real
live query is not just for that one day repeatedly, then you should test with different days, not just one day repeatedly. (And if your real query really is like the movie "Groundhog Day", you should probably cluster or partition with that in mind.)
Anyway, there was an issue introduced in 9.2.0 and to be removed in 9.2.3 which over-penalized nested loops that had large indexes on the inner side. Since your different plans are so close to each other in estimated cost, I think this issue would be enough to tip it into the seq scan. Also, your poor setting of effective_cache_size might also be enough to tip it. And both combined, almost certainly are.
But ultimately, I think you are optimizing for a case that does not actually exist.
Cheers,
Jeff
Yeah.. I came across pgtune but noticed that latest version dated 2009-10-29 http://pgfoundry.org/frs/?group_id=1000416 which is kind of outdated. Tar file has settings for pg 8.3. Is still relevant?
Yes, I'm sure it will not do anything bad to your config.
Apologies for leaping in a little late, but I note the version on Github has been updated much more recently:
Cheers,
Dan