Thread: How best to index foreign key for inner join

How best to index foreign key for inner join

From
Nathaniel Trellice
Date:
Hello,

I'd appreciate advice anyone can offer on this simple, newbie
question. It really is a straightforward question, so please bare with
me if I don't explain it succinctly below.

I want to search a table based on criteria on both a column within
that table and a column within a joined table, and I'm trying to
figure out how to index things for my particular form of query. The
postgres manual is excellent, but it doesn't appear to answer this
question explicitly through a matching example, and I'm not very good
with indexing, so I'm seeking advice here (yet again).

E.g. Suppose I have 2 tables defined:

CREATE TABLE table1 (
  id      SERIAL PRIMARY KEY,
  name    text
);

CREATE TABLE table2 (
  x            real,
  y            real,
  t            timestampz,
  table1_id    integer REFERENCES table1 ON DELETE CASCADE
);

My queries upon these tables are, almost exclusively, asking for all
the records with a particular 'name' between a range of times 't', and
so are of the form:

SELECT table2.x, table2.y, table2.t
FROM table2, table1
WHERE
  table2.table1_id = table1.id    -- inner join
  AND table1.name = 'some_name'    -- criterion on table1
  AND table2.t BETWEEN some_t AND some_other_t;    --criterion on table2


What columns should I index in table2, and how: multi-column on
(table1_id, t), say, or multiple single column indexes, or only one
on 't'?

I guess the query might benefit from an index on table1's 'name'
column, but since there are, typically, only tens or hundreds of
records in table1, it might not be worth it. There are millions of
records in table2.

Thanks,

Nathaniel




Re: How best to index foreign key for inner join

From
Thom Brown
Date:
2009/11/27 Nathaniel Trellice <naptrel@yahoo.co.uk>
Hello,

I'd appreciate advice anyone can offer on this simple, newbie
question. It really is a straightforward question, so please bare with
me if I don't explain it succinctly below.

I want to search a table based on criteria on both a column within
that table and a column within a joined table, and I'm trying to
figure out how to index things for my particular form of query. The
postgres manual is excellent, but it doesn't appear to answer this
question explicitly through a matching example, and I'm not very good
with indexing, so I'm seeking advice here (yet again).

E.g. Suppose I have 2 tables defined:

CREATE TABLE table1 (
 id      SERIAL PRIMARY KEY,
 name    text
);

CREATE TABLE table2 (
 x            real,
 y            real,
 t            timestampz,
 table1_id    integer REFERENCES table1 ON DELETE CASCADE
);

My queries upon these tables are, almost exclusively, asking for all
the records with a particular 'name' between a range of times 't', and
so are of the form:

SELECT table2.x, table2.y, table2.t
FROM table2, table1
WHERE
 table2.table1_id = table1.id    -- inner join
 AND table1.name = 'some_name'    -- criterion on table1
 AND table2.t BETWEEN some_t AND some_other_t;    --criterion on table2


What columns should I index in table2, and how: multi-column on
(table1_id, t), say, or multiple single column indexes, or only one
on 't'?

I guess the query might benefit from an index on table1's 'name'
column, but since there are, typically, only tens or hundreds of
records in table1, it might not be worth it. There are millions of
records in table2.

Thanks,

Nathaniel


Since for each row of table1 you'll be looking for multiple records on table2, I imagine you'd want to index table2.table1_id, and table2.t since that's what you're filtering on.  If table1 is particularly big, you might benefit from also indexing table1.name.

I personally prefer to use INNER JOIN syntax as it's clearer, although the query planner will probably be identical as it's clever that way:

SELECT table2.x, table2.y, table2.t
FROM table1
INNER JOIN table2 ON table1.id = table2.table1_id
AND table1.name = 'some_name'
WHERE table2.t BETWEEN some_t AND some_other_t;

Regards

Thom

Re: How best to index foreign key for inner join

From
Nathaniel Trellice
Date:
On 27 Nov 2009, at 11:12, Thom Brown <thombrown@gmail.com> wrote:

Since for each row of table1 you'll be looking for multiple records on table2, I imagine you'd want to index table2.table1_id, and table2.t since that's what you're filtering on.  If table1 is particularly big, you might benefit from also indexing table1.name.

I personally prefer to use INNER JOIN syntax as it's clearer, although the query planner will probably be identical as it's clever that way:

SELECT table2.x, table2.y, table2.t
FROM table1
INNER JOIN table2 ON table1.id = table2.table1_id
AND table1.name = 'some_name'
WHERE table2.t BETWEEN some_t AND some_other_t;

Thanks for the help Thom.

Do you think it's possible to phrase this query in such a way that the planner would be able to exploit any benefits from a mutli-column index on table2 on either (table1_id, t) or (t, table1_t)?

If you imagine setting up a view that encapsulates the inner join, creating a 'virtual' table with columns:

name text
x real
y real
t timestampz

Then the manual (11.3) suggests that ANDed queries on 'name' and 't' will be improved by using a multi-column index (name, t).

But the join confuses me. When the rule system breaks down both the query and the view's join, will the benefits of the multi-column index still be realised?


Nathaniel

Re: How best to index foreign key for inner join

From
Thom Brown
Date:
2009/11/27 Nathaniel Trellice <naptrel@yahoo.co.uk>
On 27 Nov 2009, at 11:12, Thom Brown <thombrown@gmail.com> wrote:

Since for each row of table1 you'll be looking for multiple records on table2, I imagine you'd want to index table2.table1_id, and table2.t since that's what you're filtering on.  If table1 is particularly big, you might benefit from also indexing table1.name.

I personally prefer to use INNER JOIN syntax as it's clearer, although the query planner will probably be identical as it's clever that way:

SELECT table2.x, table2.y, table2.t
FROM table1
INNER JOIN table2 ON table1.id = table2.table1_id
AND table1.name = 'some_name'
WHERE table2.t BETWEEN some_t AND some_other_t;

Thanks for the help Thom.

Do you think it's possible to phrase this query in such a way that the planner would be able to exploit any benefits from a mutli-column index on table2 on either (table1_id, t) or (t, table1_t)?

If you imagine setting up a view that encapsulates the inner join, creating a 'virtual' table with columns:

name text
x real
y real
t timestampz

Then the manual (11.3) suggests that ANDed queries on 'name' and 't' will be improved by using a multi-column index (name, t).

But the join confuses me. When the rule system breaks down both the query and the view's join, will the benefits of the multi-column index still be realised?


Nathaniel


Again, it really depends on the size of your tables.  If, for example, table2 contains 5 million rows and 99% of them have the same t value, and index probably won't be used by the planner since it's value distribution in the statistics data for that table (gathered by ANALYZE) would suggest a sequential scan is just as efficient.  If you have a lot of varying combinations between table2.table1_id and table2.t, a multi-column index will probably benefit you.

A good way to tell how much work is involved in running a query is to put EXPLAIN ANALYZE before it to see where the more expensive parts of the query are.

There are many instances where indexes won't be used by the planner purely because it's statistics tell it that's it's more or equally efficient to a sequential scan.  You need to know your data well and take the size of the table into account.

I'm hoping someone with better query-planner knowledge can chime in here with more solid recommendations though.

Thom