Re: Bidirectional hard joins (fwd) - Mailing list pgsql-hackers

From Hannu Krosing
Subject Re: Bidirectional hard joins (fwd)
Date
Msg-id 1017934312.12446.31.camel@taru.tm.ee
Whole thread Raw
In response to Bidirectional hard joins (fwd)  (Oleg Bartunov <oleg@sai.msu.su>)
List pgsql-hackers
On Thu, 2002-04-04 at 14:17, Oleg Bartunov wrote:Subject: Bidirectional hard joins
> 
> Hi,
> 
> 
> Here we propose some essential improvement of postgreSQL functionality,
> which may provide a great perfomance increase.
> 
>   1. Problem
> 
> The fastest way to find and fetch a record from a table is to
> perform a SELECT ... WHERE record.id = value.
> Probably, an index scan would be performed for this SELECT.
> 
> Such index scan seems to be fast, but there are some cases where it may
> appear too slow. The most evident case is the case of a sub-query, which
> can arise as a result of a join or a nested select statement.
> 
> If it were possible to store direct references to database
> records in the tables, joins could be implemented in much more effective
> way.
> 
>   2. Possible soultion
> 
> Creating a datatype which stores direct reference
> to the record (i.e., physical location of the tuple) is only a part of the
> solution.

The tid type does exaclty what is needed.

> 
> When a record that is referenced is updated its physical location can be
> changed, so the references to it should be updated.  To make this
> possible, the referenced record should remember all the references to
> itself. Thus, we consider the direct tuple references as bidirectional
> links, or "bidirectional hard joins".
> 
> These "hard joins" are in some sense similar to hard links in a
> filesystem (in this analogy, classic joins are like symbolic links).
> 
> Philosophically, this means a convergence between indexes and tables: a
> table behaives like an index for an other table.
> 
> Obviously, this is is a nonrelational feature, and it requires some
> special SQL syntax. Below we provide some examples for clarification of
> the use of the proposed feature.

Or we can just use tid's and ordinary joins to make it a relational
feature.

IIRC this has been discussed on this list a few months ago. I'm not sure
if bi-directional tid usage was discussed, but I can't see how to use
them efficiently in a non-overwrite storage manager. 

...

>   4. Performance
> 
> When direct joins are used in select statements, they can strongly
> increase performance.
> 
> Let us examine the query plan of the request ("Find all Bob's
> children") from the example above in the present day postgres.
>   create table man (id SERIAL,name text);
>   create table child (id SERIAL,name text, parent_id int4 references man(id));
>   .. populate the tables ... and create indexes...
>   explain select child.name from child, man
>            where child.parent_id = man.id
>              and man.name = 'Bob Scott';
> 
>   Nested Loop
>   ->  Index Scan using man_n on man
>   ->  Index Scan using child_par on child
> 
> 
> 
> In a hypotetical postgres with hard joins it could be:
> 
>   Nested Loop
>   ->  Index Scan using man_n on man
>   ->  Direct retrieval on child
>
> I.e., the for each retrieved "man" record we retrieve the "child" records
> directly using hard join. The real overhead for this operation should be
> neglible in comparison with index scan.

OTOH, if index is in memory and the retrieved tuple is not then the
_speed_difference_ could be neglible.

> Using the hard joins require some additional overhead in updates. In fact,
> after updating the record which takes part in such join, the references
> to this record in the other records should be also updated.

And this should be in a non-overwriting way. If we just do a standard
UPDATE, causing a new heap record to be added this will result in a
circle as then the original records references are not valid anymore and
so also need to be updated and so on ...

> This operation
> is not essentially new for postgres as similar things are done with
> indexes when an indexed record is updated. Hence, the overhead for updates
> is not greater than the overhead for updating indexes.

AFAIK indexes are not "updated" but a new index entry is added as the
old tuple may be still visible to some other transaction.

>   5. Implementation and conclusion
> 
> Effective implementing of hard joins requires hard changes to postgres,
> most serious of them probably in the executor, where a new method "fetch
> record by reference" should be added in addition to "index scan" and "seq
> scan". Also the optimizer should be taught to deal with this.
> 
> The update support is not so hard as it is similar to the updating of
> indexes.
> 
> Though the implementation of such  hard joins is really a complicated task,
> the performance it brings should be tremendous, so we consider discussing
> this important.

Depending on usage the performance degradation can also be tremendous,
as a simple update can trigger an avalance of referencing tid updates
...

--------------
Hannu











pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: timeout implementation issues
Next
From: Tom Lane
Date:
Subject: Re: Changing column types...