Thread: Simple join doesn't use index

Simple join doesn't use index

From

Alex Vinnik

Date:

04 January 2013, 01:54:16

Hi everybody,

I have implemented my first app using PG DB and thought for a minute(may be two) that I know something about PG but below problem totally destroyed my confidence :). Please help me to restore it.

Here is simple join query. It runs just fine on MS SQL 2008 and uses all available indexes using even bigger overall dataset.

select visits.id, views.id

from visits join views on visits.id = views.visit_id

where visits.created_at >= '11/15/2012' and visits.created_at < '11/16/2012'

Quick performance stat

MS SQL: 1 second, 264K rows

PG: 158 seconds, 264K rows

Explain plan from both DBs

PG QUERY PLAN

Hash Join (cost=12716.17..1101820.09 rows=248494 width=8)

Hash Cond: (views.visit_id = visits.id)

-> Seq Scan on views (cost=0.00..819136.56 rows=17434456 width=8)

-> Hash (cost=10549.16..10549.16 rows=132081 width=4)

-> Index Scan using visits_created_at_index on visits (cost=0.00..10549.16 rows=132081 width=4)

Index Cond: ((created_at >= '2012-11-15 00:00:00'::timestamp without time zone) AND (created_at < '2012-11-16 00:00:00'::timestamp without time zone))

schemaname | tablename | indexname | tablespace | indexdef

------------+-----------+---------------------------------+------------+------------------------------------------------------------------------------------------

public | views | views_pkey | | CREATE UNIQUE INDEX views_pkey ON views USING btree (id)

public | views | views_visit_id_index | | CREATE INDEX views_visit_id_index ON views USING btree (visit_id)

MS SQL Query plan

'11/16/2012'

|--Parallelism(Gather Streams)

|--Nested Loops(Inner Join, OUTER REFERENCES:([visits].[id], [Expr1006]) OPTIMIZED WITH UNORDERED PREFETCH)

|--Index Seek(OBJECT:([visits].[test]), SEEK:([visits].[created_at] >= '2012-11-15 00:00:00.000' AND [visits].[created_at] < '2012-11-16 00:00:00.000') ORDERED FORWARD)

|--Index Seek(OBJECT:([views].[views_visit_id_index]), SEEK:([views].[visit_id]=[raw_visits].[id]) ORDERED FORWARD)

It is clear that PG does full table scan "Seq Scan on views (cost=0.00..819136.56 rows=17434456 width=8)"

Don't understand why PG doesn't use views_visit_id_index in that query but rather scans whole table. One explanation I have found that when resulting dataset constitutes ~15% of total number of rows in the table then seq scan is used. In this case resulting dataset is just 1.5% of total number of rows. So it must be something different. Any reason why it happens and how to fix it?

Postgres 9.2

Ubuntu 12.04.1 LTS

shared_buffers = 4GB the rest of the settings are default ones

Thanks

-Alex

Re: Simple join doesn't use index

From

Jeremy Harris

Date:

04 January 2013, 02:11:49

On 01/03/2013 10:54 PM, Alex Vinnik wrote:
> I have implemented my first app using PG DB and thought for a minute(may be
> two) that I know something about PG but below problem totally destroyed my
> confidence :). Please help me to restore it.

https://wiki.postgresql.org/wiki/SlowQueryQuestions
--
Jeremy

Re: Simple join doesn't use index

From

Stefan Andreatta

Date:

04 January 2013, 07:33:16

On 01/03/2013 11:54 PM, Alex Vinnik wrote:
> Don't understand why PG doesn't use views_visit_id_index in that query
> but rather scans whole table. One explanation I have found that when
> resulting dataset constitutes ~15% of total number of rows in the table
> then seq scan is used. In this case resulting dataset is just 1.5% of
> total number of rows. So it must be something different. Any reason why
> it happens and how to fix it?

But does the query planner know the same? If you added the EXPLAIN
ANALYZE output of the query and something like:

  SELECT tablename AS table_name, attname AS column_name,
         null_frac, avg_width, n_distinct, correlation
      FROM pg_stats
      WHERE tablename in ('views', 'visits');

.. one could possibly tell a bit more.

> Postgres 9.2
> Ubuntu 12.04.1 LTS
> shared_buffers = 4GB the rest of the settings are default ones

There are more than just this one memory related value, that need to be
changed for optimal performance. E.g. effective_cache_size can have a
direct effect on use of nested loops. See:

http://www.postgresql.org/docs/9.2/static/runtime-config-query.html

Regards,
Stefan

Partition table in 9.0.x?

From

AJ Weber

Date:

05 January 2013, 00:31:59

Hi all,

I have a table that has about 73mm rows in it and growing.  Running
9.0.x on a server that unfortunately is a little I/O constrained.  Some
(maybe) pertinent settings:
default_statistics_target = 50
maintenance_work_mem = 512MB
constraint_exclusion = on
effective_cache_size = 5GB
work_mem = 18MB
wal_buffers = 8MB
checkpoint_segments = 32
shared_buffers = 2GB

The server has 12GB RAM, 4 cores, but is shared with a big webapp
running in Tomcat -- and I only have a RAID1 disk to work on.  Woes me...

Anyway, this table is going to continue to grow, and it's used
frequently (Read and Write).  From what I read, this table is a
candidate to be partitioned for performance and scalability.  I have
tested some scripts to build the "inherits" tables with their
constraints and the trigger/function to perform the work.

Am I doing the right thing by partitioning this?  If so, and I can
afford some downtime, is dumping the table via pg_dump and then loading
it back in the best way to do this?

Should I run a cluster or vacuum full after all is done?

Is there a major benefit if I can upgrade to 9.2.x in some way that I
haven't realized?

Finally, if anyone has any comments about my settings listed above that
might help improve performance, I thank you in advance.

-AJ

Re: Partition table in 9.0.x?

From

Jeff Janes

Date:

05 January 2013, 07:03:09

On Friday, January 4, 2013, AJ Weber wrote:

Hi all,

I have a table that has about 73mm rows in it and growing.

How big is the table in MB? Its indexes?

...

The server has 12GB RAM, 4 cores, but is shared with a big webapp running in Tomcat -- and I only have a RAID1 disk to work on. Woes me...

By a RAID1 disk, do you mean two disks in a RAID1 configuration, or a single RAID1 composed of an unspecified number of disks?

Often spending many thousands of dollars in DBA time can save you from having to buy many hundreds of dollars in hard drives. :) On the other hand, often you end up having to buy the extra disks anyway afterall.

Anyway, this table is going to continue to grow, and it's used frequently (Read and Write).

Are all rows in the table read and written with equal vigor, or are there hot rows and cold rows that can be recognized based on the row's values?

From what I read, this table is a candidate to be partitioned for performance and scalability. I have tested some scripts to build the "inherits" tables with their constraints and the trigger/function to perform the work.

Am I doing the right thing by partitioning this?

Probably not. Or at least, you haven't given us the information to know. Very broadly speaking, well-implemented partitioning makes bulk loading and removal operations take less IO, but makes normal operations take more IO, or if lucky leaves it unchanged. There are exceptions, but unless you can identify a very specific reason to think you might have one of those exceptions, then you probably don't.

Do you have a natural partitioning key? That is, is there a column (or expression) which occurs as a selective component in the where clause of almost all of your most io consuming SQL and DML? If so, you might benefit from partitioning on it. (But in that case, you might be able to get most of the benefits of partitioning, without the headaches of it, just by revamping your indexes to include that column/expression as their leading field).

If you don't have a good candidate partitioning key, then partitioning will almost surely make things worse.

If so, and I can afford some downtime, is dumping the table via pg_dump and then loading it back in the best way to do this?

To do efficient bulk loading into a partitioned table, you need to specifically target each partition, rather than targeting with a trigger. That pretty much rules out pg_dump, AFAIK, unless you are going to parse the dump file(s) and rewrite them.

Should I run a cluster or vacuum full after all is done?

Probably not. If a cluster after the partitioning would be beneficial, there would be a pretty good chance you could do a cluster *instead* of the partitioning and get the same benefit.

If you do some massive deletes from the parent table as part of populating the children, then a vacuum full of the parent could be useful. But if you dump the parent table, truncate it, and reload it as partitioned tables, then vacuum full would probably not be useful.

Really, you need to identify your most resource-intensive queries before you can make any reasonable decisions.

Is there a major benefit if I can upgrade to 9.2.x in some way that I haven't realized?

If you have specific queries that are misoptimized and so are generating more IO than they need to, then upgrading could help. On the other hand, it could also make things worse, if a currently well optimized query becomes worse.

But, instrumentation has improved in 9.2 from 9.0, so upgrading would make it easier to figure out just which queries are really bad and have the most opportunity for improvement. A little well informed optimization might obviate the need for either partitioning or more hard drives.

Finally, if anyone has any comments about my settings listed above that might help improve performance, I thank you in advance.

Your default statistics target seemed low. Without knowing the nature of your most resource intensive queries or how much memory tomcat is using, it is hard to say more.

Cheers,

Jeff

Re: Partition table in 9.0.x?

From

AJ Weber

Date:

06 January 2013, 18:27:40

All fair questions...

Thank you for your detailed response!

On 1/4/2013 11:03 PM, Jeff Janes wrote:

On Friday, January 4, 2013, AJ Weber wrote:
Hi all,

I have a table that has about 73mm rows in it and growing.

How big is the table in MB? Its indexes?

Not sure on this. Will see if pgAdmin tells me.

...

The server has 12GB RAM, 4 cores, but is shared with a big webapp running in Tomcat -- and I only have a RAID1 disk to work on. Woes me...

By a RAID1 disk, do you mean two disks in a RAID1 configuration, or a single RAID1 composed of an unspecified number of disks?

Often spending many thousands of dollars in DBA time can save you from having to buy many hundreds of dollars in hard drives. :) On the other hand, often you end up having to buy the extra disks anyway afterall.

I mean I have two disks in a RAID1 configuration. The server is currently in a whitebox datacenter and I have zero control over the hardware, so adding disks is unfortunately out of the question. I completely understand the comment, and would love to have a larger SAN available to me that I could configure...I just don't and have no way of getting one anytime soon.

Anyway, this table is going to continue to grow, and it's used frequently (Read and Write).

Are all rows in the table read and written with equal vigor, or are there hot rows and cold rows that can be recognized based on the row's values?

No, I could probably figure out a way to setup an "archive" or "older" section of the data that is updated much less frequently. Deletes are rare. Inserts/Updates "yes". Select on existing rows -- very frequent.

From what I read, this table is a candidate to be partitioned for performance and scalability. I have tested some scripts to build the "inherits" tables with their constraints and the trigger/function to perform the work.

Am I doing the right thing by partitioning this?

Probably not. Or at least, you haven't given us the information to know. Very broadly speaking, well-implemented partitioning makes bulk loading and removal operations take less IO, but makes normal operations take more IO, or if lucky leaves it unchanged. There are exceptions, but unless you can identify a very specific reason to think you might have one of those exceptions, then you probably don't.

I know you can't believe everything you read, but I thought I saw some metrics about when a table's size exceeds some fraction of available RAM, or when it approaches 100mm rows, it's a big candidate for partitioning.

Do you have a natural partitioning key? That is, is there a column (or expression) which occurs as a selective component in the where clause of almost all of your most io consuming SQL and DML? If so, you might benefit from partitioning on it. (But in that case, you might be able to get most of the benefits of partitioning, without the headaches of it, just by revamping your indexes to include that column/expression as their leading field).

If you don't have a good candidate partitioning key, then partitioning will almost surely make things worse.

The table is a "detail table" to its master records. That is, it's like an order-details table where it will have a 1-n rows joined to the master ("order") table on the order-id. So I can partition it based on the order number pretty easily (which is a bigint, btw).

If so, and I can afford some downtime, is dumping the table via pg_dump and then loading it back in the best way to do this?

To do efficient bulk loading into a partitioned table, you need to specifically target each partition, rather than targeting with a trigger. That pretty much rules out pg_dump, AFAIK, unless you are going to parse the dump file(s) and rewrite them.

Should I run a cluster or vacuum full after all is done?

Probably not. If a cluster after the partitioning would be beneficial, there would be a pretty good chance you could do a cluster *instead* of the partitioning and get the same benefit.

I did try clustering the table on the PK (which is actually 4 columns), and it appeared to help a bit. I was hoping partitioning was going to help me even more.

If you do some massive deletes from the parent table as part of populating the children, then a vacuum full of the parent could be useful. But if you dump the parent table, truncate it, and reload it as partitioned tables, then vacuum full would probably not be useful.

Really, you need to identify your most resource-intensive queries before you can make any reasonable decisions.

Is there a major benefit if I can upgrade to 9.2.x in some way that I haven't realized?

If you have specific queries that are misoptimized and so are generating more IO than they need to, then upgrading could help. On the other hand, it could also make things worse, if a currently well optimized query becomes worse.

Is there some new feature or optimization you're thinking about with this comment? If so, could you please just send me a link and/or feature name and I'll google it myself?

But, instrumentation has improved in 9.2 from 9.0, so upgrading would make it easier to figure out just which queries are really bad and have the most opportunity for improvement. A little well informed optimization might obviate the need for either partitioning or more hard drives.

This is interesting too. I obviously would like the best available options to tune the database and the application. Is this detailed in the release notes somewhere, and what tools could I use to take advantage of this? (Are there new/improved details included in the EXPLAIN statement or something?)

Finally, if anyone has any comments about my settings listed above that might help improve performance, I thank you in advance.

Your default statistics target seemed low. Without knowing the nature of your most resource intensive queries or how much memory tomcat is using, it is hard to say more.

Tomcat uses 4G of RAM, plus we have nginx in front using a little and some other, smaller services running on the server in addition to the usual Linux gamut of processes.

Cheers,

Jeff

Re: Simple join doesn't use index

From

Merlin Moncure

Date:

08 January 2013, 03:13:35

On Thu, Jan 3, 2013 at 4:54 PM, Alex Vinnik <alvinnik.g@gmail.com> wrote:
> Don't understand why PG doesn't use views_visit_id_index in that query but
> rather scans whole table. One explanation I have found that when resulting
> dataset constitutes ~15% of total number of rows in the table then seq scan
> is used. In this case resulting dataset is just 1.5% of total number of
> rows. So it must be something different. Any reason why it happens and how
> to fix it?
>
> Postgres 9.2
> Ubuntu 12.04.1 LTS
> shared_buffers = 4GB the rest of the settings are default ones
<snip>

It happens because you lied to the database...heh.  In particular, the
'effective_cache_size' setting which defaults to 128mb.  That probably
needs to be much, much larger.   Basically postgres is figuring the
cache is much smaller than the data and starts to favor sequential
plans once you hit a certain threshold.  If you had a server with only
say 256mb ram, it probably *would* be faster.

SQL server probably uses all kinds of crazy native unportable kernel
calls to avoid having to make a similar .conf setting.  Or maybe it
just assumes infinite cache size...dunno.

merlin

Re: Partition table in 9.0.x?

From

Jeff Janes

Date:

08 January 2013, 18:26:11

On Sunday, January 6, 2013, AJ Weber wrote:

All fair questions...

Thank you for your detailed response!

On 1/4/2013 11:03 PM, Jeff Janes wrote:
On Friday, January 4, 2013, AJ Weber wrote:
Hi all,

I have a table that has about 73mm rows in it and growing.

How big is the table in MB? Its indexes?
Not sure on this. Will see if pgAdmin tells me.

It probably does, but from psql command line, you can do \d+ and \di+

Anyway, this table is going to continue to grow, and it's used frequently (Read and Write).

Are all rows in the table read and written with equal vigor, or are there hot rows and cold rows that can be recognized based on the row's values?
No, I could probably figure out a way to setup an "archive" or "older" section of the data that is updated much less frequently.

So the data that deliniates this does not exist in that table, but it does exist someplace, either just in your head, or in the column of a higher level table?

Deletes are rare. Inserts/Updates "yes". Select on existing rows -- very frequent.

If you have little control over your storage and are already IO bound, and the tables are growing rapidly, you may need to rethink that "deletes are rare" bit. So the inserts and updates do target a hot part, while the selects are evenly spread?

In that case, it is very important to know if the slow part are the selects, or the insert and deletes. If the selects are slow, and the hot rows for selects can't be gathered together into a hot partition, then after clustering they will still be slow as the disk will still have to seek all over the place (massive data-mining type selects might be an exception to that, but I wouldn't count on it).

From what I read, this table is a candidate to be partitioned for performance and scalability. I have tested some scripts to build the "inherits" tables with their constraints and the trigger/function to perform the work.

Am I doing the right thing by partitioning this?

Probably not. Or at least, you haven't given us the information to know. Very broadly speaking, well-implemented partitioning makes bulk loading and removal operations take less IO, but makes normal operations take more IO, or if lucky leaves it unchanged. There are exceptions, but unless you can identify a very specific reason to think you might have one of those exceptions, then you probably don't.
I know you can't believe everything you read, but I thought I saw some metrics about when a table's size exceeds some fraction of available RAM, or when it approaches 100mm rows, it's a big candidate for partitioning.

I think it is a matter of semantics. A small table is poor candidate for partitioning even if it has an excellent key to use for partitioning. A large table could be a good candidate up until you realize it doesn't have a good key to use, at which point it stops being a good candidate (in my opinion).

Should I run a cluster or vacuum full after all is done?

Probably not. If a cluster after the partitioning would be beneficial, there would be a pretty good chance you could do a cluster *instead* of the partitioning and get the same benefit.

I did try clustering the table on the PK (which is actually 4 columns), and it appeared to help a bit. I was hoping partitioning was going to help me even more.

Was the order_num (from the parent table) the leading field of the 4 column PK? If not, you might want to reorder the PK so that it is the leading field and cluster again. Or if reordering the PK columns is not convenient, make a new index on the order_num and cluster on that (perhaps dropping the index after the cluster, if it no longer serves a purpose)

Is there a major benefit if I can upgrade to 9.2.x in some way that I haven't realized?

If you have specific queries that are misoptimized and so are generating more IO than they need to, then upgrading could help. On the other hand, it could also make things worse, if a currently well optimized query becomes worse.

Is there some new feature or optimization you're thinking about with this comment? If so, could you please just send me a link and/or feature name and I'll google it myself?

The main things I am thinking of are the "fudge factor" for large indexes, which is currently being discussed in both performance and hackers mailing lists, which was made overly aggressive in 9.2 and so can make it choose worse plans, and the "allow the planner to generate custom plans for specific parameter values even when using prepared statements" from the 9.2 release notes, which can allow it to choose better plans. But, surely there are other changes as well, which amount to corner cases and so are hard to discuss in the abstract. Which is why instrumentation is important. There isn't much point in worrying about possible changed plans until you've identified the queries that are important to worry about.

But, instrumentation has improved in 9.2 from 9.0, so upgrading would make it easier to figure out just which queries are really bad and have the most opportunity for improvement. A little well informed optimization might obviate the need for either partitioning or more hard drives.

This is interesting too. I obviously would like the best available options to tune the database and the application. Is this detailed in the release notes somewhere, and what tools could I use to take advantage of this? (Are there new/improved details included in the EXPLAIN statement or something?)

track_io_timing is new, and it exposes new data into EXPLAIN (ANALYZE, BUFFERS) as well as into other places. You might not want to turn this on permanently, as it can affect performance (but you can test with pg_test_timing as outlined in the docs to see how large probable affect it). Also, EXPLAIN displays the number row removed by filters, which may or may not be useful to you.

Most exciting I think are the improvements to the contrib module pg_stat_statements. That would be my first recourse, to find out which of your statements are taking the most time (and/or IO). I try to install and configure this for all of my databases now as a matter of course.

See the 9.2 release notes (with links therein to the rest of the documentation) for discussion of these.

Cheers,

Jeff

Re: Partition table in 9.0.x?

From

AJ Weber

Date:

08 January 2013, 19:46:34

>
> It probably does, but from psql command line, you can do \d+ and \di+
\d+ doesn't appear to display any size information.

>
> If you have little control over your storage and are already IO bound,
> and the tables are growing rapidly, you may need to rethink that
> "deletes are rare" bit.  So the inserts and updates do target a hot
> part, while the selects are evenly spread?
>
> In that case, it is very important to know if the slow part are the
> selects, or the insert and deletes.  If the selects are slow, and the
> hot rows for selects can't be gathered together into a hot partition,
> then after clustering they will still be slow as the disk will still
> have to seek all over the place (massive data-mining type selects
> might be an exception to that, but I wouldn't count on it).
Since order_num is sequential, I could partition on it in broad
(sequential) ranges.  That would put all recent/new rows in one
table-partition that would be a fraction of the size of the overall
(unpartitioned) table.  I guess that would require manual maintenance
over-time (to switch to another, new partition as each grows).

>
> I think it is a matter of semantics. A small table is poor candidate
> for partitioning even if it has an excellent key to use for
> partitioning.  A large table could be a good candidate up until you
> realize it doesn't have a good key to use, at which point it stops
> being a good candidate (in my opinion).
>
My first idea to evenly-partition the table was to use the order_num and
do a "mod" on it with the number of tables I wanted to use.  That would
yield a partition-table number of 0-mod, and all rows for the same order
would stay within the same partition-table.  However, you're right in
thinking that a search for orders could -- really WOULD -- require
retrieving details from multiple partitions, probably increasing IO.  So
maybe the sequential partitioning (if at all) is better, just more
maintenance down-the-road.
>
> Was the order_num (from the parent table) the leading field of the 4
> column PK?  If not, you might want to reorder the PK so that it is the
> leading field and cluster again.  Or if reordering the PK columns is
> not convenient, make a new index on the order_num and cluster on that
> (perhaps dropping the index after the cluster, if it no longer serves
> a purpose)
>
Yes, the order_num is the first column in the PK, and our main browse
queries use, at a minimum, the first 2-3 columns in that PK in their
where-clause.

Many thanks again for all the input!
-AJ

Re: Partition table in 9.0.x?

From

Jeff Janes

Date:

08 January 2013, 20:21:46

On Tue, Jan 8, 2013 at 8:45 AM, AJ Weber <aweber@comcast.net> wrote:
>
>>
>> It probably does, but from psql command line, you can do \d+ and \di+
>
> \d+ doesn't appear to display any size information.

It does if you use it without an argument, to display all the tables
in the search path:

jjanes=# \d+
                         List of relations
 Schema |       Name       | Type  | Owner  |  Size   | Description
--------+------------------+-------+--------+---------+-------------
 public | pgbench_accounts | table | jjanes | 128 MB  |
 public | pgbench_branches | table | jjanes | 40 kB   |
 public | pgbench_history  | table | jjanes | 0 bytes |
 public | pgbench_tellers  | table | jjanes | 40 kB   |
(4 rows)

It rather annoys me that you actually get less information (no size,
no owner) when you use \d+ on a named table.  I don't know if there is
a reason for that feature, or if it was just an oversight.

>
>
>>
>> If you have little control over your storage and are already IO bound, and
>> the tables are growing rapidly, you may need to rethink that "deletes are
>> rare" bit.  So the inserts and updates do target a hot part, while the
>> selects are evenly spread?
>>
>> In that case, it is very important to know if the slow part are the
>> selects, or the insert and deletes.  If the selects are slow, and the hot
>> rows for selects can't be gathered together into a hot partition, then after
>> clustering they will still be slow as the disk will still have to seek all
>> over the place (massive data-mining type selects might be an exception to
>> that, but I wouldn't count on it).
>
> Since order_num is sequential, I could partition on it in broad (sequential)
> ranges.  That would put all recent/new rows in one table-partition that
> would be a fraction of the size of the overall (unpartitioned) table.  I
> guess that would require manual maintenance over-time (to switch to another,
> new partition as each grows).

Yep.  If your selects are concentrated in those recent/new, this could
be very useful.  But, if your selects are not concentrated on the
recent/new rows, the benefit would be small.

>
>
>>
>> I think it is a matter of semantics. A small table is poor candidate for
>> partitioning even if it has an excellent key to use for partitioning.  A
>> large table could be a good candidate up until you realize it doesn't have a
>> good key to use, at which point it stops being a good candidate (in my
>> opinion).
>>
> My first idea to evenly-partition the table was to use the order_num and do
> a "mod" on it with the number of tables I wanted to use.  That would yield a
> partition-table number of 0-mod,

The current constraint exclusion code is quite simple-minded and
doesn't know how to make use of check constraints that use the mod
function, so the indexes of all partitions would have to be searched
for each order_num-driven query, even though we know the data could
only exist in one of them.  The constraint exclusion codes does
understand check constraints that involve ranges.

There could still be some benefit as the table data would be
concentrated, even if the index data is not.

> and all rows for the same order would stay
> within the same partition-table.

But usually a given order_num would only be of interest for a fraction
of a second before moving on to some other order_num of interest, so
by the time the relevant partition become fully cached, it would no
longer be hot.  Or, if the partitions were small enough, you could
assume that all rows would be dragged into memory when the first one
was requested because they lay so close to each other.  But it is not
feasible to have a large enough number of partitions to make that
happen.  But if the table is clustered, this is exactly what you would
get--the trouble would be keeping it clustered.   If most of the
line-items are inserted at the same time as each other, they probably
should be fairly well clustered to start with.

Cheers,

Jeff

Re: Partition table in 9.0.x?

From

Tom Lane

Date:

08 January 2013, 20:53:40

Jeff Janes <jeff.janes@gmail.com> writes:
> On Tue, Jan 8, 2013 at 8:45 AM, AJ Weber <aweber@comcast.net> wrote:
>>
>> \d+ doesn't appear to display any size information.

> It does if you use it without an argument, to display all the tables
> in the search path:

> jjanes=# \d+
>                          List of relations
>  Schema |       Name       | Type  | Owner  |  Size   | Description
> --------+------------------+-------+--------+---------+-------------
>  public | pgbench_accounts | table | jjanes | 128 MB  |
>  public | pgbench_branches | table | jjanes | 40 kB   |
>  public | pgbench_history  | table | jjanes | 0 bytes |
>  public | pgbench_tellers  | table | jjanes | 40 kB   |
> (4 rows)

> It rather annoys me that you actually get less information (no size,
> no owner) when you use \d+ on a named table.  I don't know if there is
> a reason for that feature, or if it was just an oversight.

This is actually an abbreviation for \dtisv+, which is a completely
different command from "\d table".  You can use something like
"\dt+ table-pattern" to get a display of the above form for a subset
of tables.  I agree it ain't too consistent.

            regards, tom lane

Re: Partition table in 9.0.x?

From

AJ Weber

Date:

08 January 2013, 21:05:32

> It does if you use it without an argument, to display all the tables
> in the search path:
>
> jjanes=# \d+
>                           List of relations
>   Schema |       Name       | Type  | Owner  |  Size   | Description
> --------+------------------+-------+--------+---------+-------------
>   public | pgbench_accounts | table | jjanes | 128 MB  |
>   public | pgbench_branches | table | jjanes | 40 kB   |
>   public | pgbench_history  | table | jjanes | 0 bytes |
>   public | pgbench_tellers  | table | jjanes | 40 kB   |
> (4 rows)
>
> It rather annoys me that you actually get less information (no size,
> no owner) when you use \d+ on a named table.  I don't know if there is
> a reason for that feature, or if it was just an oversight.
That is rather peculiar.  Sorry for that.
Table in question is 9284MB
(Parent table is 621MB)

>
> The current constraint exclusion code is quite simple-minded and
> doesn't know how to make use of check constraints that use the mod
> function, so the indexes of all partitions would have to be searched
> for each order_num-driven query, even though we know the data could
> only exist in one of them.  The constraint exclusion codes does
> understand check constraints that involve ranges.
Hmm.  That's a bit of a limitation I didn't know about.  I assume it
doesn't understand the percent (mod operator) just the same as not
understanding the MOD() function?  Either way, I guess this strategy
does not pan-out.
> There could still be some benefit as the table data would be
> concentrated, even if the index data is not.
I'm reaching way, way back in my head, but I think _some_ RDBMS I worked
with previously had a way to "cluster" the rows around a single one of
the indexes on the table, thus putting the index and the row-data
"together" and reducing the number of IO's to retrieve the row if that
index was used.  Am I understanding that PG's "cluster" is strictly to
group like rows together logically -- table data only, not to coordinate
the table row with the index upon which you clustered them?

>
>> and all rows for the same order would stay
>> within the same partition-table.
> But usually a given order_num would only be of interest for a fraction
> of a second before moving on to some other order_num of interest, so
> by the time the relevant partition become fully cached, it would no
> longer be hot.  Or, if the partitions were small enough, you could
> assume that all rows would be dragged into memory when the first one
> was requested because they lay so close to each other.  But it is not
> feasible to have a large enough number of partitions to make that
> happen.  But if the table is clustered, this is exactly what you would
> get--the trouble would be keeping it clustered.   If most of the
> line-items are inserted at the same time as each other, they probably
> should be fairly well clustered to start with.
Does decreasing the fill to like 90 help keep it clustered in-between
times that I could shutdown the app and perform a (re-) cluster on the
overall table?  Problem is, with a table that size, and the hardware I'm
"blessed with", the cluster takes quite a bit of time. :(

Re: Simple join doesn't use index

From

Jeff Janes

Date:

09 January 2013, 07:34:17

On Thursday, January 3, 2013, Alex Vinnik wrote:

Hi everybody,

I have implemented my first app using PG DB and thought for a minute(may be two) that I know something about PG but below problem totally destroyed my confidence :). Please help me to restore it.

Here is simple join query. It runs just fine on MS SQL 2008 and uses all available indexes using even bigger overall dataset.

select visits.id, views.id
from visits join views on visits.id = views.visit_id
where visits.created_at >= '11/15/2012' and visits.created_at < '11/16/2012'

Quick performance stat

MS SQL: 1 second, 264K rows

If it returns 264K rows in 1 second, then it must have all data in memory. Which prompts a few questions:

Is *all* data in memory, or is it just the data needed for this particular query because you already ran it recently with the same date range?

PG: 158 seconds, 264K rows

Does the PG machine have enough memory to hold all the data in RAM? If so, does it actually have all the data in RAM? That is, is the cache already warm? Starting from scratch it can take a long time for the cache to warm up naturally. And finally, if all the data is in RAM, does PG know this?

For the last question, the answer is "no", since you are using default settings. You need to lower random_page_cost and probably also seq_page_cost in order to trick PG to think the data is in RAM. Of course if you do this when the data is in fact not in RAM, the result could be catastrophically bad plans. (And when I tried to replicate your situation on anemic hardware, indeed the hash join often was faster than the nested loop on both indexes.)

Explain plan from both DBs

PG QUERY PLAN
Hash Join (cost=12716.17..1101820.09 rows=248494 width=8)
Hash Cond: (views.visit_id = visits.id)
-> Seq Scan on views (cost=0.00..819136.56 rows=17434456 width=8)

This cost estimate is probably due mostly to seg_page_cost and cpu_tuple_cost, which at their defaults means the table has 645,000 blocks (819136 - 17434456/100) blocks and each block has ~30 rows.

But you are returning 248,494 rows, or roughly 1 / 2.5 of a row per block. Let's say you need to fetch 200,000 blocks (in random order) to get those rows. Since at default settings fetching 200,000 random blocks is considered as expensive as fetching 800,000 sequential blocks, the index scan you want already looks more expensive than the sequential scan. But, if you want to use the index scan, you also have to fetch the index blocks, which a sequential scan does not need to do. There are probably about 50,000 index blocks, but each one has to be fetched about 5 times (248,494/50,000). Because your effective_cache_size is so low, PG assumes the next time it needs to fetch the same block, it will no longer be in memory and so needs to be fetched again at full random_page_cost.

It is clear that PG does full table scan "Seq Scan on views (cost=0.00..819136.56 rows=17434456 width=8)"

Don't understand why PG doesn't use views_visit_id_index in that query but rather scans whole table. One explanation I have found that when resulting dataset constitutes ~15% of total number of rows in the table then seq scan is used.

I don't know where you found that rule of thumb, but it would probably more accurate if it was given in in terms of the percentage of the table's *blocks* scanned, rather than *rows*.

In this case resulting dataset is just 1.5% of total number of rows.

Since there are about 30 rows per block, scanning 1.5% of the rows means scanning somewhat less than 45% of the blocks, assuming the rows are randomly distributed over the blocks. And they are scanned in a less efficient way.

Postgres 9.2

You are probably getting hit hard by the overly-large "fudge factor" penalty for scans of large indexes, of much discussion recently in regards to 9.2.

Ubuntu 12.04.1 LTS
shared_buffers = 4GB the rest of the settings are default ones

The default effective_cache_size is almost certainly wrong, and if the analogy to MSSQL to is correct, then random_page_cost almost certainly is as well.

Another poster referred you to the wiki page for suggestion on how to report slow queries, particularly using EXPLAIN (analyze, buffers) rather than merely EXPLAIN. In this case, I would also try setting enable_hashjoin=off and enable_mergejoin=off in the session, in order to force the planner to use the plan you think you want, so we can see what PG thinks of that one.

Cheers,

Jeff

Re: Simple join doesn't use index

From

Alex Vinnik

Date:

09 January 2013, 18:49:46

Guys, thanks a lot for your input. It is very valuable for us. We plan to fix a separate dev server similar to production one, copy all data there and try you suggestions as we really don't want to do it on production server. I also noticed that IOPS jumps to 100% when running this query. So it is a bit scary to make those changes in production directly. Will report back on the progress and findings.

Re: Simple join doesn't use index

From

Merlin Moncure

Date:

09 January 2013, 18:53:10

On Wed, Jan 9, 2013 at 9:49 AM, Alex Vinnik <alvinnik.g@gmail.com> wrote:
> Guys, thanks a lot for your input. It is very valuable for us. We plan to
> fix a separate dev server similar to production one, copy all data there and
> try you suggestions as we really don't want to do it on production server. I
> also noticed that IOPS jumps to 100% when running this query. So it is a bit
> scary to make those changes in production directly. Will report back on the
> progress and findings.

nothing wrong with that, but keep in mind you can tweak
'effective_cache_size' for a single session with 'set' command;

merlin

Re: Partition table in 9.0.x?

From

Jeff Janes

Date:

14 January 2013, 21:24:55

On Tue, Jan 8, 2013 at 10:04 AM, AJ Weber <aweber@comcast.net> wrote:
>>
>> The current constraint exclusion code is quite simple-minded and
>> doesn't know how to make use of check constraints that use the mod
>> function, so the indexes of all partitions would have to be searched
>> for each order_num-driven query, even though we know the data could
>> only exist in one of them.  The constraint exclusion codes does
>> understand check constraints that involve ranges.
>
> Hmm.  That's a bit of a limitation I didn't know about.  I assume it doesn't
> understand the percent (mod operator) just the same as not understanding the
> MOD() function?  Either way, I guess this strategy does not pan-out.

Yes, it doesn't detect either.  It would use it if you formulate to
every equality query with an extra restriction: "where id=1234567 and
mod(id,100)=67" or whatever.

(But I was surprised that % and mod() are not recognized as being
equivalent.  If you specify it one way in the check constraint, you
need to use the same "spelling" in the where clause)

>> There could still be some benefit as the table data would be
>> concentrated, even if the index data is not.
>
> I'm reaching way, way back in my head, but I think _some_ RDBMS I worked
> with previously had a way to "cluster" the rows around a single one of the
> indexes on the table, thus putting the index and the row-data "together" and
> reducing the number of IO's to retrieve the row if that index was used.

In Oracle this is called in "index organized table" or IOT (or it was
at one point, they have the habit of rename most of their features
with each release).  I don't know what other RDBMS call it.
Supporting secondary indexes when the table data could move around was
quite intricate/weird.

PG doesn't have this index-organized-table feature--it has been
discussed but I don't of any currently active effort to add it.

There is another feature, sometimes called clustering, in which the
rows from different tables can be mingled together in the same block.
So both the parent order and the child order_line_item that have the
same order_num (i.e. the join column) would be in the same block.  So
once you query for a specific order and did the necessary IO, the
corresponding order_line_item rows would already be in memory.   I
thought this was interesting, but I don't know how often it was
actually used.

> Am
> I understanding that PG's "cluster" is strictly to group like rows together
> logically -- table data only, not to coordinate the table row with the index
> upon which you clustered them?

They are coordinated in a sense.  Not as one single structure, but as
two structures in parallel.

>>> and all rows for the same order would stay
>>> within the same partition-table.
>>
>> But usually a given order_num would only be of interest for a fraction
>> of a second before moving on to some other order_num of interest, so
>> by the time the relevant partition become fully cached, it would no
>> longer be hot.  Or, if the partitions were small enough, you could
>> assume that all rows would be dragged into memory when the first one
>> was requested because they lay so close to each other.  But it is not
>> feasible to have a large enough number of partitions to make that
>> happen.  But if the table is clustered, this is exactly what you would
>> get--the trouble would be keeping it clustered.   If most of the
>> line-items are inserted at the same time as each other, they probably
>> should be fairly well clustered to start with.
>
> Does decreasing the fill to like 90 help keep it clustered in-between times
> that I could shutdown the app and perform a (re-) cluster on the overall
> table?  Problem is, with a table that size, and the hardware I'm "blessed
> with", the cluster takes quite a bit of time. :(

Probably not.  If the data starts out clustered and gets updated a
lot, lowering the fill factor might be able to prevent some
de-clustering due to row migration.  But when you insert new rows, PG
makes no effort to put them near existing rows with the same key.  (In
a hypothetical future in which that did happen, lowering the fill
factor would then probably help)

Cheers,

Jeff

Re: Simple join doesn't use index

From

Alex Vinnik

Date:

29 January 2013, 02:43:58

It sure turned out that default settings are not a good fit. Setting random_page_cost to 1.0 made query to run in 2.6 seconds and I clearly see that indexes are being used in explain plan and IO utilization is close to 0.

QUERY PLAN

Sort (cost=969787.23..970288.67 rows=200575 width=8) (actual time=2176.045..2418.162 rows=241238 loops=1)

Sort Key: visits.id, views.id

Sort Method: external sort Disk: 4248kB

-> Nested Loop (cost=0.00..950554.81 rows=200575 width=8) (actual time=0.048..1735.357 rows=241238 loops=1)

-> Index Scan using visits_created_at_index on visits (cost=0.00..5459.16 rows=82561 width=4) (actual time=0.032..178.591 rows=136021 loops=1)

Index Cond: ((created_at >= '2012-12-15 00:00:00'::timestamp without time zone) AND (created_at < '2012-12-16 00:00:00'::timestamp without time zone))

-> Index Scan using views_visit_id_index on views (cost=0.00..11.33 rows=12 width=8) (actual time=0.004..0.006 rows=2 loops=136021)

Index Cond: (visit_id = visits.id)

Total runtime: 2635.169 ms

However I noticed that sorting is done using disk("external sort Disk: 4248kB") which prompted me to take a look at work_mem. But it turned out that small increase to 4MB from default 1MB turns off index usage and query gets x10 slower. IO utilization jumped to 100% from literally nothing. so back to square one...

QUERY PLAN

Sort (cost=936642.75..937144.19 rows=200575 width=8) (actual time=33200.762..33474.443 rows=241238 loops=1)

Sort Key: visits.id, views.id

Sort Method: external merge Disk: 4248kB

-> Hash Join (cost=6491.17..917410.33 rows=200575 width=8) (actual time=7156.498..32723.221 rows=241238 loops=1)

Hash Cond: (views.visit_id = visits.id)

-> Seq Scan on views (cost=0.00..832189.95 rows=8768395 width=8) (actual time=0.100..12126.342 rows=8200704 loops=1)

-> Hash (cost=5459.16..5459.16 rows=82561 width=4) (actual time=353.683..353.683 rows=136021 loops=1)

Buckets: 16384 Batches: 2 (originally 1) Memory Usage: 4097kB

-> Index Scan using visits_created_at_index on visits (cost=0.00..5459.16 rows=82561 width=4) (actual time=0.032..175.051 rows=136021 loops=1)

Index Cond: ((created_at >= '2012-12-15 00:00:00'::timestamp without time zone) AND (created_at < '2012-12-16 00:00:00'::timestamp without time zone))

Total runtime: 33698.000 ms

Basically PG is going through all views again and not using "Index Scan using views_visit_id_index on views". Looks like setting work_mem confuses planner somehow. Any idea what can be done to do sorting in memory. I suspect it should make query even more faster. Thanks -Alex

nothing wrong with that, but keep in mind you can tweak
'effective_cache_size' for a single session with 'set' command;

merlin

Re: Simple join doesn't use index

From

Filip Rembiałkowski

Date:

29 January 2013, 03:55:35

On Mon, Jan 28, 2013 at 5:43 PM, Alex Vinnik <alvinnik.g@gmail.com> wrote:

It sure turned out that default settings are not a good fit.

do you know pgtune?
it's a good tool for starters, if you want a fast postgres and don't really want to learn what's behind the scenes.

random_page_cost=1 might be not what you really want.
it would mean that random reads are as fast as as sequential reads, which probably is true only for SSD

Filip

Re: Simple join doesn't use index

From

Merlin Moncure

Date:

29 January 2013, 05:31:35

On Mon, Jan 28, 2013 at 5:43 PM, Alex Vinnik <alvinnik.g@gmail.com> wrote:
> It sure turned out that default settings are not a good fit. Setting
> random_page_cost to 1.0 made query to run in 2.6 seconds and I clearly see
> that indexes are being used in explain plan and IO utilization is close to
> 0.
>
> QUERY PLAN
> Sort  (cost=969787.23..970288.67 rows=200575 width=8) (actual
> time=2176.045..2418.162 rows=241238 loops=1)
>   Sort Key: visits.id, views.id
>   Sort Method: external sort  Disk: 4248kB
>   ->  Nested Loop  (cost=0.00..950554.81 rows=200575 width=8) (actual
> time=0.048..1735.357 rows=241238 loops=1)
>         ->  Index Scan using visits_created_at_index on visits
> (cost=0.00..5459.16 rows=82561 width=4) (actual time=0.032..178.591
> rows=136021 loops=1)
>               Index Cond: ((created_at >= '2012-12-15 00:00:00'::timestamp
> without time zone) AND (created_at < '2012-12-16 00:00:00'::timestamp
> without time zone))
>         ->  Index Scan using views_visit_id_index on views
> (cost=0.00..11.33 rows=12 width=8) (actual time=0.004..0.006 rows=2
> loops=136021)
>               Index Cond: (visit_id = visits.id)
> Total runtime: 2635.169 ms
>
> However I noticed that sorting is done using disk("external sort  Disk:
> 4248kB") which prompted me to take a look at work_mem. But it turned out
> that small increase to 4MB from default 1MB turns off index usage and query
> gets x10 slower. IO utilization jumped to 100% from literally nothing. so
> back to square one...
>
> QUERY PLAN
> Sort  (cost=936642.75..937144.19 rows=200575 width=8) (actual
> time=33200.762..33474.443 rows=241238 loops=1)
>   Sort Key: visits.id, views.id
>   Sort Method: external merge  Disk: 4248kB
>   ->  Hash Join  (cost=6491.17..917410.33 rows=200575 width=8) (actual
> time=7156.498..32723.221 rows=241238 loops=1)
>         Hash Cond: (views.visit_id = visits.id)
>         ->  Seq Scan on views  (cost=0.00..832189.95 rows=8768395 width=8)
> (actual time=0.100..12126.342 rows=8200704 loops=1)
>         ->  Hash  (cost=5459.16..5459.16 rows=82561 width=4) (actual
> time=353.683..353.683 rows=136021 loops=1)
>               Buckets: 16384  Batches: 2 (originally 1)  Memory Usage:
> 4097kB
>               ->  Index Scan using visits_created_at_index on visits
> (cost=0.00..5459.16 rows=82561 width=4) (actual time=0.032..175.051
> rows=136021 loops=1)
>                     Index Cond: ((created_at >= '2012-12-15
> 00:00:00'::timestamp without time zone) AND (created_at < '2012-12-16
> 00:00:00'::timestamp without time zone))
> Total runtime: 33698.000 ms
>
> Basically PG is going through all views again and not using "Index Scan
> using views_visit_id_index on views". Looks like setting work_mem confuses
> planner somehow. Any idea what can be done to do sorting in memory. I
> suspect it should make query even more faster. Thanks -Alex

hm, what happens when you set work_mem a fair amount higher? (say,
64mb).   You can set it for one session by going "set work_mem='64mb';
" as opposed to the entire server in postgresql.conf.

merlin

Re: Simple join doesn't use index

From

Alex Vinnik

Date:

29 January 2013, 17:24:16

On Mon, Jan 28, 2013 at 6:55 PM, Filip Rembiałkowski <plk.zuber@gmail.com> wrote:

On Mon, Jan 28, 2013 at 5:43 PM, Alex Vinnik <alvinnik.g@gmail.com> wrote:
It sure turned out that default settings are not a good fit.

do you know pgtune?
it's a good tool for starters, if you want a fast postgres and don't really want to learn what's behind the scenes.

Yeah.. I came across pgtune but noticed that latest version dated 2009-10-29 http://pgfoundry.org/frs/?group_id=1000416 which is kind of outdated. Tar file has settings for pg 8.3. Is still relevant?

random_page_cost=1 might be not what you really want.
it would mean that random reads are as fast as as sequential reads, which probably is true only for SSD

What randon_page_cost would be more appropriate for EC2 EBS Provisioned volume that can handle 2,000 IOPS?

Filip

Re: Simple join doesn't use index

From

Alex Vinnik

Date:

29 January 2013, 17:41:54

Setting work_mem to 64MB triggers in memory sort but look what happens with views look up. PG goes through all records there "Seq Scan on views" instead of using visitor_id index and I have only subset of real data to play around. Can imagine what cost would be running it against bigger dataset. Something else is in play here that makes planner to take this route. Any ideas how to gain more insight into planner's inner workings?

QUERY PLAN

Sort (cost=960280.46..960844.00 rows=225414 width=8) (actual time=23328.040..23537.126 rows=209401 loops=1)

Sort Key: visits.id, views.id

Sort Method: quicksort Memory: 15960kB

-> Hash Join (cost=8089.16..940238.66 rows=225414 width=8) (actual time=6622.072..22995.890 rows=209401 loops=1)

Hash Cond: (views.visit_id = visits.id)

-> Seq Scan on views (cost=0.00..831748.05 rows=8724205 width=8) (actual time=0.093..10552.306 rows=6995893 loops=1)

-> Hash (cost=6645.51..6645.51 rows=115492 width=4) (actual time=307.389..307.389 rows=131311 loops=1)

Buckets: 16384 Batches: 1 Memory Usage: 4617kB

-> Index Scan using visits_created_at_index on visits (cost=0.00..6645.51 rows=115492 width=4) (actual time=0.040..163.151 rows=131311 loops=1)

Index Cond: ((created_at >= '2013-01-15 00:00:00'::timestamp without time zone) AND (created_at < '2013-01-16 00:00:00'::timestamp without time zone))

Total runtime: 23733.045 ms

On Mon, Jan 28, 2013 at 8:31 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Mon, Jan 28, 2013 at 5:43 PM, Alex Vinnik <alvinnik.g@gmail.com> wrote:
> It sure turned out that default settings are not a good fit. Setting
> random_page_cost to 1.0 made query to run in 2.6 seconds and I clearly see
> that indexes are being used in explain plan and IO utilization is close to
> 0.
>
> QUERY PLAN
> Sort (cost=969787.23..970288.67 rows=200575 width=8) (actual
> time=2176.045..2418.162 rows=241238 loops=1)
> Sort Key: visits.id, views.id
> Sort Method: external sort Disk: 4248kB
> -> Nested Loop (cost=0.00..950554.81 rows=200575 width=8) (actual
> time=0.048..1735.357 rows=241238 loops=1)
> -> Index Scan using visits_created_at_index on visits
> (cost=0.00..5459.16 rows=82561 width=4) (actual time=0.032..178.591
> rows=136021 loops=1)
> Index Cond: ((created_at >= '2012-12-15 00:00:00'::timestamp
> without time zone) AND (created_at < '2012-12-16 00:00:00'::timestamp
> without time zone))
> -> Index Scan using views_visit_id_index on views
> (cost=0.00..11.33 rows=12 width=8) (actual time=0.004..0.006 rows=2
> loops=136021)
> Index Cond: (visit_id = visits.id)
> Total runtime: 2635.169 ms
>
> However I noticed that sorting is done using disk("external sort Disk:
> 4248kB") which prompted me to take a look at work_mem. But it turned out
> that small increase to 4MB from default 1MB turns off index usage and query
> gets x10 slower. IO utilization jumped to 100% from literally nothing. so
> back to square one...
>
> QUERY PLAN
> Sort (cost=936642.75..937144.19 rows=200575 width=8) (actual
> time=33200.762..33474.443 rows=241238 loops=1)
> Sort Key: visits.id, views.id
> Sort Method: external merge Disk: 4248kB
> -> Hash Join (cost=6491.17..917410.33 rows=200575 width=8) (actual
> time=7156.498..32723.221 rows=241238 loops=1)
> Hash Cond: (views.visit_id = visits.id)
> -> Seq Scan on views (cost=0.00..832189.95 rows=8768395 width=8)
> (actual time=0.100..12126.342 rows=8200704 loops=1)
> -> Hash (cost=5459.16..5459.16 rows=82561 width=4) (actual
> time=353.683..353.683 rows=136021 loops=1)
> Buckets: 16384 Batches: 2 (originally 1) Memory Usage:
> 4097kB
> -> Index Scan using visits_created_at_index on visits
> (cost=0.00..5459.16 rows=82561 width=4) (actual time=0.032..175.051
> rows=136021 loops=1)
> Index Cond: ((created_at >= '2012-12-15
> 00:00:00'::timestamp without time zone) AND (created_at < '2012-12-16
> 00:00:00'::timestamp without time zone))
> Total runtime: 33698.000 ms
>
> Basically PG is going through all views again and not using "Index Scan
> using views_visit_id_index on views". Looks like setting work_mem confuses
> planner somehow. Any idea what can be done to do sorting in memory. I
> suspect it should make query even more faster. Thanks -Alex

hm, what happens when you set work_mem a fair amount higher? (say,
64mb). You can set it for one session by going "set work_mem='64mb';
" as opposed to the entire server in postgresql.conf.

merlin

Re: Simple join doesn't use index

From

Filip Rembiałkowski

Date:

29 January 2013, 19:22:24

On Tue, Jan 29, 2013 at 8:24 AM, Alex Vinnik <alvinnik.g@gmail.com> wrote:

On Mon, Jan 28, 2013 at 6:55 PM, Filip Rembiałkowski <plk.zuber@gmail.com> wrote:

do you know pgtune?
it's a good tool for starters, if you want a fast postgres and don't really want to learn what's behind the scenes.
Yeah.. I came across pgtune but noticed that latest version dated 2009-10-29 http://pgfoundry.org/frs/?group_id=1000416 which is kind of outdated. Tar file has settings for pg 8.3. Is still relevant?

Yes, I'm sure it will not do anything bad to your config.

random_page_cost=1 might be not what you really want.
it would mean that random reads are as fast as as sequential reads, which probably is true only for SSD
What randon_page_cost would be more appropriate for EC2 EBS Provisioned volume that can handle 2,000 IOPS?

I'd say: don't guess. Measure.
Use any tool that can test sequential disk block reads versus random disk block reads.
bonnie++ is quite popular.

Filip

Re: Simple join doesn't use index

From

Merlin Moncure

Date:

29 January 2013, 19:41:51

On Tue, Jan 29, 2013 at 8:41 AM, Alex Vinnik <alvinnik.g@gmail.com> wrote:
> Setting work_mem to 64MB triggers in memory sort but look what happens with
> views look up. PG goes through all records there "Seq Scan on views" instead
> of using visitor_id index and I have only subset of real data to play
> around. Can imagine what cost would be running it against bigger dataset.
> Something else is in play here that makes planner to take this route. Any
> ideas how to gain more insight into planner's inner workings?

did you set effective_cache_seize as noted upthread?

merlin

Re: Simple join doesn't use index

From

Ben Chobot

Date:

29 January 2013, 20:39:18

On Jan 29, 2013, at 6:24 AM, Alex Vinnik wrote:

random_page_cost=1 might be not what you really want.
it would mean that random reads are as fast as as sequential reads, which probably is true only for SSD
What randon_page_cost would be more appropriate for EC2 EBS Provisioned volume that can handle 2,000 IOPS?

For EC2 Provisioned IOPS volumes - not standard EBS - random_page_cost=1 is exactly what you want.

Re: Simple join doesn't use index

From

Alex Vinnik

Date:

29 January 2013, 21:59:14

On Tue, Jan 29, 2013 at 11:39 AM, Ben Chobot <bench@silentmedia.com> wrote:

On Jan 29, 2013, at 6:24 AM, Alex Vinnik wrote:

random_page_cost=1 might be not what you really want.
it would mean that random reads are as fast as as sequential reads, which probably is true only for SSD
What randon_page_cost would be more appropriate for EC2 EBS Provisioned volume that can handle 2,000 IOPS?

For EC2 Provisioned IOPS volumes - not standard EBS - random_page_cost=1 is exactly what you want.

Well... after some experimentation it turned out that random_page_cost=0.6 gives me fast query

QUERY PLAN

Sort (cost=754114.96..754510.46 rows=158199 width=8) (actual time=1839.324..2035.405 rows=209401 loops=1)

Sort Key: visits.id, views.id

Sort Method: quicksort Memory: 15960kB

-> Nested Loop (cost=0.00..740453.38 rows=158199 width=8) (actual time=0.048..1531.592 rows=209401 loops=1)

-> Index Scan using visits_created_at_index on visits (cost=0.00..5929.82 rows=115492 width=4) (actual time=0.032..161.488 rows=131311 loops=1)

Index Cond: ((created_at >= '2013-01-15 00:00:00'::timestamp without time zone) AND (created_at < '2013-01-16 00:00:00'::timestamp without time zone))

-> Index Scan using views_visit_id_index on views (cost=0.00..6.26 rows=10 width=8) (actual time=0.003..0.005 rows=2 loops=131311)

Index Cond: (visit_id = visits.id)

Total runtime: 2234.142 ms

random_page_cost=0.7 slows it down 16 times

Sort (cost=804548.42..804943.92 rows=158199 width=8) (actual time=37011.337..37205.449 rows=209401 loops=1)

Sort Key: visits.id, views.id

Sort Method: quicksort Memory: 15960kB

-> Merge Join (cost=15871.37..790886.85 rows=158199 width=8) (actual time=35673.602..36714.056 rows=209401 loops=1)

Merge Cond: (visits.id = views.visit_id)

-> Sort (cost=15824.44..16113.17 rows=115492 width=4) (actual time=335.486..463.085 rows=131311 loops=1)

Sort Key: visits.id

Sort Method: quicksort Memory: 12300kB

-> Index Scan using visits_created_at_index on visits (cost=0.00..6113.04 rows=115492 width=4) (actual time=0.034..159.326 rows=131311 loops=1)

Index Cond: ((created_at >= '2013-01-15 00:00:00'::timestamp without time zone) AND (created_at < '2013-01-16 00:00:00'::timestamp without time zone))

-> Index Scan using views_visit_id_visit_buoy_index on views (cost=0.00..757596.22 rows=6122770 width=8) (actual time=0.017..30765.316 rows=5145902 loops=1)

Total runtime: 37407.174 ms

I am totally puzzled now...

Re: Simple join doesn't use index

From

Merlin Moncure

Date:

29 January 2013, 22:35:19

On Tue, Jan 29, 2013 at 12:59 PM, Alex Vinnik <alvinnik.g@gmail.com> wrote:
>
>
>
> On Tue, Jan 29, 2013 at 11:39 AM, Ben Chobot <bench@silentmedia.com> wrote:
>>
>> On Jan 29, 2013, at 6:24 AM, Alex Vinnik wrote:
>>
>>> random_page_cost=1 might be not what you really want.
>>> it would mean that random reads are as fast as as sequential reads, which
>>> probably is true only for SSD
>>
>> What randon_page_cost would be more appropriate for EC2 EBS Provisioned
>> volume that can handle 2,000 IOPS?
>>
>>
>> For EC2 Provisioned IOPS volumes - not standard EBS - random_page_cost=1
>> is exactly what you want.
>>
> Well... after some experimentation it turned out that random_page_cost=0.6
> gives me fast query
>
> QUERY PLAN
> Sort  (cost=754114.96..754510.46 rows=158199 width=8) (actual
> time=1839.324..2035.405 rows=209401 loops=1)
>   Sort Key: visits.id, views.id
>   Sort Method: quicksort  Memory: 15960kB
>   ->  Nested Loop  (cost=0.00..740453.38 rows=158199 width=8) (actual
> time=0.048..1531.592 rows=209401 loops=1)
>         ->  Index Scan using visits_created_at_index on visits
> (cost=0.00..5929.82 rows=115492 width=4) (actual time=0.032..161.488
> rows=131311 loops=1)
>               Index Cond: ((created_at >= '2013-01-15 00:00:00'::timestamp
> without time zone) AND (created_at < '2013-01-16 00:00:00'::timestamp
> without time zone))
>         ->  Index Scan using views_visit_id_index on views  (cost=0.00..6.26
> rows=10 width=8) (actual time=0.003..0.005 rows=2 loops=131311)
>               Index Cond: (visit_id = visits.id)
> Total runtime: 2234.142 ms
>
> random_page_cost=0.7 slows it down 16 times
>
> Sort  (cost=804548.42..804943.92 rows=158199 width=8) (actual
> time=37011.337..37205.449 rows=209401 loops=1)
>   Sort Key: visits.id, views.id
>   Sort Method: quicksort  Memory: 15960kB
>   ->  Merge Join  (cost=15871.37..790886.85 rows=158199 width=8) (actual
> time=35673.602..36714.056 rows=209401 loops=1)
>         Merge Cond: (visits.id = views.visit_id)
>         ->  Sort  (cost=15824.44..16113.17 rows=115492 width=4) (actual
> time=335.486..463.085 rows=131311 loops=1)
>               Sort Key: visits.id
>               Sort Method: quicksort  Memory: 12300kB
>               ->  Index Scan using visits_created_at_index on visits
> (cost=0.00..6113.04 rows=115492 width=4) (actual time=0.034..159.326
> rows=131311 loops=1)
>                     Index Cond: ((created_at >= '2013-01-15
> 00:00:00'::timestamp without time zone) AND (created_at < '2013-01-16
> 00:00:00'::timestamp without time zone))

>         ->  Index Scan using views_visit_id_visit_buoy_index on views
> (cost=0.00..757596.22 rows=6122770 width=8) (actual time=0.017..30765.316
> rows=5145902 loops=1)

Something is awry here. pg is doing an index scan via
views_visit_id_visit_buoy_index with no matching condition.  What's
the definition of that index? The reason why the random_page_cost
adjustment is working is that you are highly penalizing sequential
type scans so that the database is avoiding the merge (sort A, sort B,
stepwise compare).

SQL server is doing a nestloop/index scan, just like the faster pg
plan, but is a bit faster because it's parallelizing.

 merlin

Re: Simple join doesn't use index

From

Jeff Janes

Date:

29 January 2013, 23:06:54

On Mon, Jan 28, 2013 at 3:43 PM, Alex Vinnik <alvinnik.g@gmail.com> wrote:
> It sure turned out that default settings are not a good fit. Setting
> random_page_cost to 1.0 made query to run in 2.6 seconds and I clearly see
> that indexes are being used in explain plan and IO utilization is close to
> 0.
>
> QUERY PLAN
> Sort  (cost=969787.23..970288.67 rows=200575 width=8) (actual
> time=2176.045..2418.162 rows=241238 loops=1)
>   Sort Key: visits.id, views.id
>   Sort Method: external sort  Disk: 4248kB

What query are you running?  The query you originally showed us should
not be doing this sort in the first place.

Cheers,

Jeff

Re: Simple join doesn't use index

From

Alex Vinnik

Date:

29 January 2013, 23:43:23

On Tue, Jan 29, 2013 at 2:06 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

> Sort Key: visits.id, views.id
> Sort Method: external sort Disk: 4248kB

What query are you running? The query you originally showed us should
not be doing this sort in the first place.

Cheers,

Jeff

Here is the query

 select visits.id, views.id

from visits join views on visits.id = views.visit_id

where visits.created_at >= '01/15/2013' and visits.created_at < '01/16/2013'

order by visits.id, views.id;

Original query didn't have order by clause

Here query plan w/o order by

Merge Join  (cost=18213.46..802113.80 rows=182579 width=8) (actual time=144443.693..145469.499 rows=209401 loops=1)
  Merge Cond: (visits.id = views.visit_id)
  ->  Sort  (cost=18195.47..18523.91 rows=131373 width=4) (actual time=335.496..464.929 rows=131311 loops=1)
        Sort Key: visits.id
        Sort Method: quicksort  Memory: 12300kB
        ->  Index Scan using visits_created_at_index on visits  (cost=0.00..7026.59 rows=131373 width=4) (actual time=0.037..162.047 rows=131311 loops=1)
              Index Cond: ((created_at >= '2013-01-15 00:00:00'::timestamp without time zone) AND (created_at < '2013-01-16 00:00:00'::timestamp without time zone))
  ->  Index Scan using views_visit_id_visit_buoy_index on views  (cost=0.00..766120.99 rows=6126002 width=8) (actual time=18.960..140565.130 rows=4014837 loops=1)
Total runtime: 145664.274 ms

Re: Simple join doesn't use index

From

Alex Vinnik

Date:

29 January 2013, 23:48:55

index definition

CREATE INDEX views_visit_id_visit_buoy_index ON views USING btree (visit_id, visit_buoy)

On Tue, Jan 29, 2013 at 1:35 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Tue, Jan 29, 2013 at 12:59 PM, Alex Vinnik <alvinnik.g@gmail.com> wrote:
>
>
>
> On Tue, Jan 29, 2013 at 11:39 AM, Ben Chobot <bench@silentmedia.com> wrote:
>>
>> On Jan 29, 2013, at 6:24 AM, Alex Vinnik wrote:
>>
>>> random_page_cost=1 might be not what you really want.
>>> it would mean that random reads are as fast as as sequential reads, which
>>> probably is true only for SSD
>>
>> What randon_page_cost would be more appropriate for EC2 EBS Provisioned
>> volume that can handle 2,000 IOPS?
>>
>>
>> For EC2 Provisioned IOPS volumes - not standard EBS - random_page_cost=1
>> is exactly what you want.
>>
> Well... after some experimentation it turned out that random_page_cost=0.6
> gives me fast query
>
> QUERY PLAN
> Sort (cost=754114.96..754510.46 rows=158199 width=8) (actual
> time=1839.324..2035.405 rows=209401 loops=1)
> Sort Key: visits.id, views.id
> Sort Method: quicksort Memory: 15960kB
> -> Nested Loop (cost=0.00..740453.38 rows=158199 width=8) (actual
> time=0.048..1531.592 rows=209401 loops=1)
> -> Index Scan using visits_created_at_index on visits
> (cost=0.00..5929.82 rows=115492 width=4) (actual time=0.032..161.488
> rows=131311 loops=1)
> Index Cond: ((created_at >= '2013-01-15 00:00:00'::timestamp
> without time zone) AND (created_at < '2013-01-16 00:00:00'::timestamp
> without time zone))
> -> Index Scan using views_visit_id_index on views (cost=0.00..6.26
> rows=10 width=8) (actual time=0.003..0.005 rows=2 loops=131311)
> Index Cond: (visit_id = visits.id)
> Total runtime: 2234.142 ms
>
> random_page_cost=0.7 slows it down 16 times
>
> Sort (cost=804548.42..804943.92 rows=158199 width=8) (actual
> time=37011.337..37205.449 rows=209401 loops=1)
> Sort Key: visits.id, views.id
> Sort Method: quicksort Memory: 15960kB
> -> Merge Join (cost=15871.37..790886.85 rows=158199 width=8) (actual
> time=35673.602..36714.056 rows=209401 loops=1)
> Merge Cond: (visits.id = views.visit_id)
> -> Sort (cost=15824.44..16113.17 rows=115492 width=4) (actual
> time=335.486..463.085 rows=131311 loops=1)
> Sort Key: visits.id
> Sort Method: quicksort Memory: 12300kB
> -> Index Scan using visits_created_at_index on visits
> (cost=0.00..6113.04 rows=115492 width=4) (actual time=0.034..159.326
> rows=131311 loops=1)
> Index Cond: ((created_at >= '2013-01-15
> 00:00:00'::timestamp without time zone) AND (created_at < '2013-01-16
> 00:00:00'::timestamp without time zone))

> -> Index Scan using views_visit_id_visit_buoy_index on views
> (cost=0.00..757596.22 rows=6122770 width=8) (actual time=0.017..30765.316
> rows=5145902 loops=1)

Something is awry here. pg is doing an index scan via
views_visit_id_visit_buoy_index with no matching condition. What's
the definition of that index? The reason why the random_page_cost
adjustment is working is that you are highly penalizing sequential
type scans so that the database is avoiding the merge (sort A, sort B,
stepwise compare).

SQL server is doing a nestloop/index scan, just like the faster pg
plan, but is a bit faster because it's parallelizing.

merlin

Re: Simple join doesn't use index

From

Jeff Janes

Date:

30 January 2013, 02:15:47

On Mon, Jan 28, 2013 at 4:55 PM, Filip Rembiałkowski
<plk.zuber@gmail.com> wrote:
>
> On Mon, Jan 28, 2013 at 5:43 PM, Alex Vinnik <alvinnik.g@gmail.com> wrote:
>>
>> It sure turned out that default settings are not a good fit.
>
>
> do you know pgtune?
> it's a good tool for starters, if you want a fast postgres and don't really
> want to learn what's behind the scenes.
>
> random_page_cost=1 might be not what you really want.
> it would mean that random reads are as fast as as sequential reads, which
> probably is true only for SSD

Or that the "reads" are cached and coming from RAM, which is almost
surely the case here.

Cheers,

Jeff

Re: Simple join doesn't use index

From

Jeff Janes

Date:

02 February 2013, 19:39:46

On Monday, January 28, 2013, Alex Vinnik wrote:

It sure turned out that default settings are not a good fit. Setting random_page_cost to 1.0 made query to run in 2.6 seconds and I clearly see that indexes are being used in explain plan and IO utilization is close to 0.

This is not surprising. The default settings are aimed at quite small servers, while you seem to have a rather substantial one. Have you done anything yet to fix effective_cache_size?

QUERY PLAN
Sort (cost=969787.23..970288.67 rows=200575 width=8) (actual time=2176.045..2418.162 rows=241238 loops=1)

However I noticed that sorting is done using disk("external sort Disk: 4248kB")

As far as pgsql is concerned, it is using the disk. But the kernel is probably buffering that to an extent that the disk is probably being touched rather little. So I wouldn't worry about it.

which prompted me to take a look at work_mem. But it turned out that small increase to 4MB from default 1MB turns off index usage and query gets x10 slower. IO utilization jumped to 100% from literally nothing. so back to square one...

QUERY PLAN
Sort (cost=936642.75..937144.19 rows=200575 width=8) (actual time=33200.762..33474.443 rows=241238 loops=1)

And why should the IO utilization have jumped? Is everything in memory, or is it not? You should run your EXPLAINs with (analyze, buffers), and also you should turn on track_io_timings, at least in the local session; that will give us some insights.

If everything is in memory, then why is the seq scan taking so long? If not, then why is the nested loop such a good idea? (In my hands, when everything does *not* fit in memory, the nested loop is very very bad)

You seem have a bit of an infatuation with Dec 15th, running that one query over and over and over. Why? If the real

live query is not just for that one day repeatedly, then you should test with different days, not just one day repeatedly. (And if your real query really is like the movie "Groundhog Day", you should probably cluster or partition with that in mind.)

Anyway, there was an issue introduced in 9.2.0 and to be removed in 9.2.3 which over-penalized nested loops that had large indexes on the inner side. Since your different plans are so close to each other in estimated cost, I think this issue would be enough to tip it into the seq scan. Also, your poor setting of effective_cache_size might also be enough to tip it. And both combined, almost certainly are.

But ultimately, I think you are optimizing for a case that does not actually exist.

Cheers,

Jeff

Re: Simple join doesn't use index

From

Dan Fairs

Date:

05 February 2013, 00:15:59

Yeah.. I came across pgtune but noticed that latest version dated 2009-10-29 http://pgfoundry.org/frs/?group_id=1000416 which is kind of outdated. Tar file has settings for pg 8.3. Is still relevant?

Yes, I'm sure it will not do anything bad to your config.

Apologies for leaping in a little late, but I note the version on Github has been updated much more recently:

https://github.com/gregs1104/pgtune

Cheers,

Dan

--
Dan Fairs | dan.fairs@gmail.com | @danfairs | secondsync.com