Thread: Joins, Deletes and Indexes

Joins, Deletes and Indexes

From
Butkus_Charles@emc.com
Date:
I've got 2 tables defined as follows:

CREATE TABLE "cluster"
(
  id int8 NOT NULL DEFAULT nextval('serial'::text),
  clusterid varchar(255) NOT NULL,
  ...
  CONSTRAINT pk_cluster PRIMARY KEY (id)
)

CREATE TABLE sensorreport
(
  id int8 NOT NULL DEFAULT nextval('serial'::text),
  clusterid int8 NOT NULL,
  ...
  CONSTRAINT pk_sensorreport PRIMARY KEY (id),
  CONSTRAINT fk_sensorreport_clusterid FOREIGN KEY (clusterid) REFERENCES
"cluster" (id) ON UPDATE RESTRICT ON DELETE RESTRICT
)

I've defined an Index on the clusterid field of sensorreport.


So I've run into 2 issues, one a SELECT, the other a DELETE;

SELECT issue:
So the following query:
EXPLAIN ANALYZE select * from sensorreport where clusterid = 25000114;

Yields:
"Index Scan using idx_sensorreport_clusterid on sensorreport
(cost=0.00..2.01 rows=1 width=129) (actual time=0.000..0.000 rows=38
loops=1)"
"  Index Cond: (clusterid = 25000114)"
"Total runtime: 0.000 ms"

However, when using a join as follows (in the cluster table id=25000114
clusterid='clusterid1'):
EXPLAIN ANALYZE select * from sensorreport as a join cluster as c on c.id =
a.clusterid where c.clusterid = 'clusterid1';

Yields:
Hash Join  (cost=1.18..566211.51 rows=1071429 width=287) (actual
time=150025.000..150025.000 rows=38 loops=1)
  Hash Cond: ("outer".clusterid = "inner".id)
  ->  Seq Scan on sensorreport a  (cost=0.00..480496.03 rows=15000003
width=129) (actual time=10.000..126751.000 rows=15000039 loops=1)
  ->  Hash  (cost=1.18..1.18 rows=1 width=158) (actual time=0.000..0.000
rows=0 loops=1)
        ->  Seq Scan on "cluster" c  (cost=0.00..1.18 rows=1 width=158)
(actual time=0.000..0.000 rows=1 loops=1)
              Filter: ((clusterid)::text = 'clusterid1'::text)
Total runtime: 150025.000 ms

My question is can I get the join query to use the
idx_sensorreport_clusterid index on the sensorreport table?

DELETE issue:
The statement:
EXPLAIN ANALYZE delete from cluster where clusterid='clusterid99'

Yields:
 Seq Scan on "cluster"  (cost=0.00..1.18 rows=1 width=6) (actual
time=0.000..0.000 rows=1 loops=1)
   Filter: ((clusterid)::text = 'clusterid99'::text)
 Total runtime: 275988.000 ms

I'm assuming that the length of the delete is because the "DELETE RESTRICT"
on the foreign key from sensortable.
Again, is there any way to get the delete to use the
idx_sensorreport_clusterid index?

Re: Joins, Deletes and Indexes

From
Richard Huxton
Date:
Butkus_Charles@emc.com wrote:
> I've got 2 tables defined as follows:
>
> CREATE TABLE "cluster"
> (
>   id int8 NOT NULL DEFAULT nextval('serial'::text),
>   clusterid varchar(255) NOT NULL,
>   ...
>   CONSTRAINT pk_cluster PRIMARY KEY (id)
> )
>
> CREATE TABLE sensorreport
> (
>   id int8 NOT NULL DEFAULT nextval('serial'::text),
>   clusterid int8 NOT NULL,
>   ...
>   CONSTRAINT pk_sensorreport PRIMARY KEY (id),
>   CONSTRAINT fk_sensorreport_clusterid FOREIGN KEY (clusterid) REFERENCES
> "cluster" (id) ON UPDATE RESTRICT ON DELETE RESTRICT
> )
>
> I've defined an Index on the clusterid field of sensorreport.

Looking further down, perhaps an index on cluster.clusterid too.

> So I've run into 2 issues, one a SELECT, the other a DELETE;
>
> SELECT issue:
> So the following query:
> EXPLAIN ANALYZE select * from sensorreport where clusterid = 25000114;
>
> Yields:
> "Index Scan using idx_sensorreport_clusterid on sensorreport
> (cost=0.00..2.01 rows=1 width=129) (actual time=0.000..0.000 rows=38
> loops=1)"
> "  Index Cond: (clusterid = 25000114)"
> "Total runtime: 0.000 ms"
>
> However, when using a join as follows (in the cluster table id=25000114
> clusterid='clusterid1'):
> EXPLAIN ANALYZE select * from sensorreport as a join cluster as c on c.id =
> a.clusterid where c.clusterid = 'clusterid1';

You don't say what version you're using, but older versions of PG took a
literal join as a request to plan a query in that order. Try rewriting
it without the "join" keyword and see if the plan alters.

> Yields:
> Hash Join  (cost=1.18..566211.51 rows=1071429 width=287) (actual
> time=150025.000..150025.000 rows=38 loops=1)
>   Hash Cond: ("outer".clusterid = "inner".id)
>   ->  Seq Scan on sensorreport a  (cost=0.00..480496.03 rows=15000003
> width=129) (actual time=10.000..126751.000 rows=15000039 loops=1)
>   ->  Hash  (cost=1.18..1.18 rows=1 width=158) (actual time=0.000..0.000
> rows=0 loops=1)
>         ->  Seq Scan on "cluster" c  (cost=0.00..1.18 rows=1 width=158)
> (actual time=0.000..0.000 rows=1 loops=1)
>               Filter: ((clusterid)::text = 'clusterid1'::text)
> Total runtime: 150025.000 ms
>
> My question is can I get the join query to use the
> idx_sensorreport_clusterid index on the sensorreport table?

The only reason to use the index on sensorreport is if it isn't going to
match many rows. That means we want to run the restriction on
"clisterid1" first, which suggests you want that index on table cluster.

> DELETE issue:
> The statement:
> EXPLAIN ANALYZE delete from cluster where clusterid='clusterid99'
>
> Yields:
>  Seq Scan on "cluster"  (cost=0.00..1.18 rows=1 width=6) (actual
> time=0.000..0.000 rows=1 loops=1)
>    Filter: ((clusterid)::text = 'clusterid99'::text)
>  Total runtime: 275988.000 ms
>
> I'm assuming that the length of the delete is because the "DELETE RESTRICT"
> on the foreign key from sensortable.
> Again, is there any way to get the delete to use the
> idx_sensorreport_clusterid index?

No, because this is the cluster table, not sensorreport :-)

--
   Richard Huxton
   Archonet Ltd

Re: Joins, Deletes and Indexes

From
Butkus_Charles@emc.com
Date:

> -----Original Message-----
> From: Richard Huxton [mailto:dev@archonet.com]
> Sent: Wednesday, February 23, 2005 3:40 AM
> To: Butkus_Charles@emc.com
> Cc: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] Joins, Deletes and Indexes
>
> Butkus_Charles@emc.com wrote:
> > I've got 2 tables defined as follows:
> >
> > CREATE TABLE "cluster"
> > (
> >   id int8 NOT NULL DEFAULT nextval('serial'::text),
> >   clusterid varchar(255) NOT NULL,
> >   ...
> >   CONSTRAINT pk_cluster PRIMARY KEY (id)
> > )
> >
> > CREATE TABLE sensorreport
> > (
> >   id int8 NOT NULL DEFAULT nextval('serial'::text),
> >   clusterid int8 NOT NULL,
> >   ...
> >   CONSTRAINT pk_sensorreport PRIMARY KEY (id),
> >   CONSTRAINT fk_sensorreport_clusterid FOREIGN KEY
> (clusterid) REFERENCES
> > "cluster" (id) ON UPDATE RESTRICT ON DELETE RESTRICT
> > )
> >
> > I've defined an Index on the clusterid field of sensorreport.
>
> Looking further down, perhaps an index on cluster.clusterid too.
>
> > So I've run into 2 issues, one a SELECT, the other a DELETE;
> >
> > SELECT issue:
> > So the following query:
> > EXPLAIN ANALYZE select * from sensorreport where clusterid
> = 25000114;
> >
> > Yields:
> > "Index Scan using idx_sensorreport_clusterid on sensorreport
> > (cost=0.00..2.01 rows=1 width=129) (actual time=0.000..0.000 rows=38
> > loops=1)"
> > "  Index Cond: (clusterid = 25000114)"
> > "Total runtime: 0.000 ms"
> >
> > However, when using a join as follows (in the cluster table
> id=25000114
> > clusterid='clusterid1'):
> > EXPLAIN ANALYZE select * from sensorreport as a join
> cluster as c on c.id =
> > a.clusterid where c.clusterid = 'clusterid1';
>
> You don't say what version you're using, but older versions
> of PG took a
> literal join as a request to plan a query in that order. Try
> rewriting
> it without the "join" keyword and see if the plan alters.

I'm using version 8.0 on Windows.

>
> > Yields:
> > Hash Join  (cost=1.18..566211.51 rows=1071429 width=287) (actual
> > time=150025.000..150025.000 rows=38 loops=1)
> >   Hash Cond: ("outer".clusterid = "inner".id)
> >   ->  Seq Scan on sensorreport a  (cost=0.00..480496.03
> rows=15000003
> > width=129) (actual time=10.000..126751.000 rows=15000039 loops=1)
> >   ->  Hash  (cost=1.18..1.18 rows=1 width=158) (actual
> time=0.000..0.000
> > rows=0 loops=1)
> >         ->  Seq Scan on "cluster" c  (cost=0.00..1.18
> rows=1 width=158)
> > (actual time=0.000..0.000 rows=1 loops=1)
> >               Filter: ((clusterid)::text = 'clusterid1'::text)
> > Total runtime: 150025.000 ms
> >
> > My question is can I get the join query to use the
> > idx_sensorreport_clusterid index on the sensorreport table?
>
> The only reason to use the index on sensorreport is if it
> isn't going to
> match many rows. That means we want to run the restriction on
> "clisterid1" first, which suggests you want that index on
> table cluster.

The cluster table only has 11 rows, so I'm not sure an index would
help.  The sensorreport table has 15,000,000 rows so that's why I've
got the index there.

>
> > DELETE issue:
> > The statement:
> > EXPLAIN ANALYZE delete from cluster where clusterid='clusterid99'
> >
> > Yields:
> >  Seq Scan on "cluster"  (cost=0.00..1.18 rows=1 width=6) (actual
> > time=0.000..0.000 rows=1 loops=1)
> >    Filter: ((clusterid)::text = 'clusterid99'::text)
> >  Total runtime: 275988.000 ms
> >
> > I'm assuming that the length of the delete is because the
> "DELETE RESTRICT"
> > on the foreign key from sensortable.
> > Again, is there any way to get the delete to use the
> > idx_sensorreport_clusterid index?
>
> No, because this is the cluster table, not sensorreport :-)

True, but the foreign key constraint on the sensorreport table forces
Postgres to check if there are any sensorreport's that are currently
using this cluster before allowing the cluster to be deleted.

>
> --
>    Richard Huxton
>    Archonet Ltd
>

Thanks a lot for the reply.

Chuck Butkus
EMC

Re: Joins, Deletes and Indexes

From
Richard Huxton
Date:
Butkus_Charles@emc.com wrote:
>
> The cluster table only has 11 rows, so I'm not sure an index would
> help.  The sensorreport table has 15,000,000 rows so that's why I've
> got the index there.

Ah - only 11?

>>>on the foreign key from sensortable.
>>>Again, is there any way to get the delete to use the
>>>idx_sensorreport_clusterid index?
>>
>>No, because this is the cluster table, not sensorreport :-)
>
> True, but the foreign key constraint on the sensorreport table forces
> Postgres to check if there are any sensorreport's that are currently
> using this cluster before allowing the cluster to be deleted.

If you only have 11 distinct values in the large table then it's
debatable whether it's always quicker to use the index. Since your first
example (clusterid = 25000114) returned so few rows, I'm guessing that
some other values represent a sizeable percentage of the table. That'd
explain the difference between PG's estimates and the actual number of
matching rows.

You can try "SET enable_seqscan =false;" before running the query and
see whether using the index helps things.

--
   Richard Huxton
   Archonet Ltd