Thread: Fast join

Fast join

From
Leon
Date:
Hello all!

I assume this is a developer's list, isn't it? Because the
address hackers-request@postgreSQL.org mentioned in PostgreSQL FAQ
doesn't work.

Ok. Let me begin with the beginning. We are a group of developers
of an free accounting system here in Russia. We want to use Postgres
as our database server. It is cool, but we encountered a problem
during preliminary tests, a problem concerning joins. The problem
is that joins are too slow. Consider that sequence of commands
given at the command prompt of psql (strange duplicates of NOTICEs are
removed):

--------------------
adb=> create table atable (afield int4 primary key,bfield int4);
NOTICE:  CREATE TABLE/PRIMARY KEY will create implicit index 'atable_pkey' for t
able 'atable'
CREATE
adb=> create table btable (afield int4 primary key,bfield int4);
NOTICE:  CREATE TABLE/PRIMARY KEY will create implicit index 'btable_pkey' for t
able 'btable'
CREATE
adb=> create index aindex on atable (bfield);
CREATE
adb=>  create index bindex on btable (bfield);
CREATE
adb=> explain select * from atable where atable.afield=btable.afield and atable.
bfield>10;
NOTICE:  QUERY PLAN:

Hash Join  (cost=109.69 rows=335 width=12)
  ->  Seq Scan on btable  (cost=43.00 rows=1000 width=4)
  ->  Hash  (cost=21.67 rows=334 width=8)
        ->  Index Scan using aindex on atable  (cost=21.67 rows=334 width=8)
-------------------

This shows clearly that Postgres doesn't do any attempt to use
existing indices and restrictive qualifications doing join. That's
unacceptable to us, because it slows join immensely.

What is desired here is logic of a networked (or networking? :)
data model as opposed to general relational model, because logic
of nets is simpler than relational logic and thus can be done
faster. They say that future standard of SQL will incorporate
such data types as links (references), so how soon will it appear
in Postgres? We need this feature URGENTLY, because accounting
data is mostly network-type,rather than general relational-type.
Please reply me.

--
Leon.


Re: [GENERAL] Fast join

From
Bruce Momjian
Date:
> Hello all!
>
> I assume this is a developer's list, isn't it? Because the
> address hackers-request@postgreSQL.org mentioned in PostgreSQL FAQ
> doesn't work.
>
> Ok. Let me begin with the beginning. We are a group of developers
> of an free accounting system here in Russia. We want to use Postgres
> as our database server. It is cool, but we encountered a problem
> during preliminary tests, a problem concerning joins. The problem
> is that joins are too slow. Consider that sequence of commands
> given at the command prompt of psql (strange duplicates of NOTICEs are
> removed):

Are you using 6.5?  If not, upgrade.  The optimizer is much smarter in
6.5.

Also, if a join does most of the table, it is faster do not use indexes,
and just sort on the backend.

--
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: [GENERAL] Fast join

From
Leon
Date:
Bruce Momjian wrote:
>
> > Hello all!

> > during preliminary tests, a problem concerning joins. The problem
> > is that joins are too slow. Consider that sequence of commands
> > given at the command prompt of psql (strange duplicates of NOTICEs are
> > removed):
>
> Are you using 6.5?  If not, upgrade.  The optimizer is much smarter in
> 6.5.

:)

The last and the latest!  6.5 I mean.

>
> Also, if a join does most of the table, it is faster do not use indexes,
> and just sort on the backend.
>

The problem is - when you want just a small part of the table(s) and
you have indices to facilitate qualifications, Postgres doesn't
use 'em !  This is a question of Life and Death - i.e. to use or
not to use Postgres.

--
Leon.

Re: [GENERAL] Fast join

From
Bruce Momjian
Date:
> > Also, if a join does most of the table, it is faster do not use indexes,
> > and just sort on the backend.
> >
>
> The problem is - when you want just a small part of the table(s) and
> you have indices to facilitate qualifications, Postgres doesn't
> use 'em !  This is a question of Life and Death - i.e. to use or
> not to use Postgres.

As I remember, your qualification was x > 10.  That may not be
restrictive enough to make an index faster.
--
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: [GENERAL] Fast join

From
Leon
Date:
Bruce Momjian wrote:
>
> > > Also, if a join does most of the table, it is faster do not use indexes,
> > > and just sort on the backend.
> > >
> >
> > The problem is - when you want just a small part of the table(s) and
> > you have indices to facilitate qualifications, Postgres doesn't
> > use 'em !  This is a question of Life and Death - i.e. to use or
> > not to use Postgres.
>
> As I remember, your qualification was x > 10.  That may not be
> restrictive enough to make an index faster.

Oh, I'm sorry, it was a typo. But believe me, such behaviour is
persistent notwithstanding any type of qualification. It is, so
to say, tested and approved. Look at the explanations of Postgres
of his plan of query on database whose creation I showed you
earlier (it has two tables of 10000 rows, properly vacuumed):

-=--------------------------------
adb=>  EXPLAIN  SELECT COUNT(*) FROM atable WHERE atable.cfield = btable.cfield
AND atable.afield<10;
NOTICE:  QUERY PLAN:

Aggregate  (cost=1047.69 rows=3334 width=12)
  ->  Hash Join  (cost=1047.69 rows=3334 width=12)
        ->  Seq Scan on btable  (cost=399.00 rows=10000 width=4)
        ->  Hash  (cost=198.67 rows=3334 width=8)
              ->  Index Scan using aindex on atable  (cost=198.67 rows=3334
width=8)

adb=>  EXPLAIN  SELECT COUNT(*) FROM atable WHERE atable.cfield = btable.cfield
AND atable.afield>100;
NOTICE:  QUERY PLAN:

Aggregate  (cost=1047.69 rows=3334 width=12)
  ->  Hash Join  (cost=1047.69 rows=3334 width=12)
        ->  Seq Scan on btable  (cost=399.00 rows=10000 width=4)
        ->  Hash  (cost=198.67 rows=3334 width=8)
              ->  Index Scan using aindex on atable  (cost=198.67 rows=3334
width=8)
---------------------

It is clear that Postgres does hash join of the whole tables ALWAYS.

--
Leon.


Re: [GENERAL] Fast join

From
Herouth Maoz
Date:
At 18:33 +0300 on 29/06/1999, Leon wrote:


> adb=>  EXPLAIN  SELECT COUNT(*) FROM atable WHERE atable.cfield =
>btable.cfield
> AND atable.afield<10;
...
>
> adb=>  EXPLAIN  SELECT COUNT(*) FROM atable WHERE atable.cfield =
>btable.cfield
> AND atable.afield>100;

Hey, shouldn't these be:

SELECT COUNT(*)
FROM atable, btable  <----- Note this!
WHERE atable.cfield = btable.cfield
AND ....

I'm not sure that when the other table is implicit, the optimizer checks
the statistics of the btable on time.

Herouth

--
Herouth Maoz, Internet developer.
Open University of Israel - Telem project
http://telem.openu.ac.il/~herutma



Re: [GENERAL] Fast join

From
Bruce Momjian
Date:
> Oh, I'm sorry, it was a typo. But believe me, such behaviour is
> persistent notwithstanding any type of qualification. It is, so
> to say, tested and approved. Look at the explanations of Postgres
> of his plan of query on database whose creation I showed you
> earlier (it has two tables of 10000 rows, properly vacuumed):
>
> -=--------------------------------

> adb=>  EXPLAIN  SELECT COUNT(*) FROM atable WHERE atable.cfield = btable.cfield
> AND atable.afield<10;
> NOTICE:  QUERY PLAN:

But your only restriction is < 10.  That is not enough.  Make it = 10,
and I think it will use the index.


--
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: [GENERAL] Fast join

From
Leon
Date:
Herouth Maoz wrote:

> >
> > adb=>  EXPLAIN  SELECT COUNT(*) FROM atable WHERE atable.cfield =
> >btable.cfield
> > AND atable.afield>100;
>
> Hey, shouldn't these be:
>
> SELECT COUNT(*)
> FROM atable, btable  <----- Note this!
> WHERE atable.cfield = btable.cfield
> AND ....
>
> I'm not sure that when the other table is implicit, the optimizer checks
> the statistics of the btable on time.
>

adb=> EXPLAIN  SELECT * FROM atable,btable  WHERE atable.cfield = btable.cfield
AND atable.afield<10;
NOTICE:  QUERY PLAN:

Hash Join  (cost=1052.69 rows=3334 width=40)
  ->  Seq Scan on btable  (cost=399.00 rows=10000 width=20)
  ->  Hash  (cost=198.67 rows=3334 width=20)
        ->  Index Scan using aindex on atable  (cost=198.67 rows=3334 width=20)

--
Leon.



Re: [GENERAL] Fast join

From
Leon
Date:
Bruce Momjian wrote:

> > adb=>  EXPLAIN  SELECT COUNT(*) FROM atable WHERE atable.cfield = btable.cfield
> > AND atable.afield<10;
> > NOTICE:  QUERY PLAN:
>
> But your only restriction is < 10.  That is not enough.  Make it = 10,
> and I think it will use the index.

Ok. We did it! :)
-------------
adb=>  EXPLAIN  SELECT COUNT(*) FROM atable WHERE atable.cfield = btable.cfield
AND atable.afield=10;
NOTICE:  QUERY PLAN:

Aggregate  (cost=4.10 rows=1 width=12)
  ->  Nested Loop  (cost=4.10 rows=1 width=12)
        ->  Index Scan using aindex on atable  (cost=2.05 rows=1 width=8)
        ->  Index Scan using hindex on btable  (cost=2.05 rows=10000 width=4)
-------------

But look here:
-------------
adb=> EXPLAIN  SELECT * FROM atable WHERE atable.cfield = btable.cfield AND
atable.afield IN (SELECT btable.bfield WHERE  btable.bfield=10);
NOTICE:  QUERY PLAN:

Hash Join  (cost=1483.00 rows=10000 width=24)
  ->  Seq Scan on atable  (cost=399.00 rows=10000 width=20)
        SubPlan
          ->  Index Scan using gindex on btable  (cost=2.05 rows=1 width=4)
  ->  Hash  (cost=399.00 rows=10000 width=4)
        ->  Seq Scan on btable  (cost=399.00 rows=10000 width=4)
-------------

This is the same dumbness again. Will you fix the optimizer?

And more: would you make a cool data type, a reference, which
is a physical record number of a foreign record? This could make certain
type of joins VERY fast, too good to be true. Such thing is really
an incorporation of elements of networking (networked? :) data
model into relational model.

--
Leon.


Re: [GENERAL] Fast join

From
Bruce Momjian
Date:
> But look here:
> -------------
> adb=> EXPLAIN  SELECT * FROM atable WHERE atable.cfield = btable.cfield AND
> atable.afield IN (SELECT btable.bfield WHERE  btable.bfield=10);
> NOTICE:  QUERY PLAN:
>
> Hash Join  (cost=1483.00 rows=10000 width=24)
>   ->  Seq Scan on atable  (cost=399.00 rows=10000 width=20)
>         SubPlan
>           ->  Index Scan using gindex on btable  (cost=2.05 rows=1 width=4)
>   ->  Hash  (cost=399.00 rows=10000 width=4)
>         ->  Seq Scan on btable  (cost=399.00 rows=10000 width=4)
> -------------

You can't use in index here because an IN is not selective, or if it is,
there is way for the optimizer to know this.


--
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: [GENERAL] Fast join

From
Leon
Date:
Bruce Momjian wrote:

> >           ->  Index Scan using gindex on btable  (cost=2.05 rows=1 width=4)
> >   ->  Hash  (cost=399.00 rows=10000 width=4)
> >         ->  Seq Scan on btable  (cost=399.00 rows=10000 width=4)
> > -------------
>
> You can't use in index here because an IN is not selective, or if it is,
> there is way for the optimizer to know this.

Ok. It seems that statistics which is within optimizer's reach is quite
poor, so optimizer can't always make sane predictions. And, I am
afraid, improving statistics gathering will require major rewrite of code.

So there should at least be some way to give hints to optimizer,
shouldn't it?

--
Leon.


Re: [GENERAL] Fast join

From
Bruce Momjian
Date:
> Ok. It seems that statistics which is within optimizer's reach is quite
> poor, so optimizer can't always make sane predictions. And, I am
> afraid, improving statistics gathering will require major rewrite of code.
>
> So there should at least be some way to give hints to optimizer,
> shouldn't it?

Currently, no way to give hints.

--
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: [GENERAL] Fast join

From
Leon
Date:
Bruce Momjian wrote:
>
> > Ok. It seems that statistics which is within optimizer's reach is quite
> > poor, so optimizer can't always make sane predictions. And, I am
> > afraid, improving statistics gathering will require major rewrite of code.
> >
> > So there should at least be some way to give hints to optimizer,
> > shouldn't it?
>
> Currently, no way to give hints.

Picture this:

SELECT * FROM atable,btable  WHERE
atable.cfield = btable.cfield AND #(atable.afield>10)#

Where "magic brackets" will tell optimizer that qualification
atable.afield>10 is very restrictive and should be treated on a par
with "=" qualification. Such hint seems much easier to realize
than to make optimizer gather real statistics on a table during
decision process.

--
Leon.


(no subject)

From
ralli@poboxes.com
Date:
unsubscribe



Re: [GENERAL] Fast join

From
Chris Bitmead
Date:
Leon wrote:

> And more: would you make a cool data type, a reference,
> which is a physical record number of a foreign record?
> This could make certain type of joins VERY fast, too good
> to be true. Such thing is really an incorporation of
> elements of networking (networked? :) data model into
> relational model.

When you say "a physical record number", I assume you mean some
reference to where the record is stored on disk. There are a number of
problems with this. One is space reclaimation. You can't re-use space
anymore because if you put a new record in the place where the old
record was there would be an integrity problem. If you somehow solve
that there is still the problem that you can't move a record when it
gets bigger or you want to re-organise the database. Backups become
problematic.

In actual fact, you don't need physical record ids to make things
blindingly quick. Object databases like Versant have proved that.
Although I've got no doubt their record-id lookup is massively optimised
for the special case and I'd say it's significantly faster than most
relation database indexes, but the principle is no different. Record ids
are good. Physical record ids are bad.