Thread: Fwd: Joins and links

Fwd: Joins and links

From

Leon

Date:

05 July 1999, 11:28:58

Hello hackers!

I posted this message to general mailing list, and was told
that hackers' list is more appropriate place to post this
message to. What will you say about it?

This is a forwarded message
From: Leon <leon@udmnet.ru>
To: pgsql-general <pgsql-general@postgreSQL.org>
Date: Monday, July 05, 1999, 5:46:36 PM
Subject: Joins and links

===8<==============Original message text===============
Hello all!
You probably remember me - recently I complained about speedof joins in Postgres. After a short investigation the way
wasfoundin which Postgres's optimizer can do the right job. Itwas constructive discussion. Now I want to tell you what
couldmakePostgres better and faster. And what will make us (ourdevelopment group) happy. Maybe I am bothering someone,
ifIdo - tell me that.

Let me begin.
First of all, some accounting reports need to be deliveredvery fast - within minutes or so. And what's bad is thatquite
afew of these reports are quite time-consuming and searchintensive. In particular, internals of these reports includea
lotof joins on tables.

Secondly, almost all of accounting information naturallyfits into network data model, which can be implemented
veryefficiently.
This stuff described here is not accounting-specific, itcan be found in every database which uses master-detailtables
andother such types of relations.

So. How is join being performed in such cases? Although I amnot an expert, I can imagine the way: first there is an
(index)scanon first table, and then an (index) scan on the second.It is the best way, reality could be much worse as we
haveseen.

How can we radically improve performance in such cases? Thereis a simple and quite obvious way. (For you not to think
thatIam hallucinating I will tell you that there exist somereal servers that offer such features I am talking about)We
shouldmake a real reference in one table to another! Thatmeans there could be special data type called, say,
"link",whichis a physical record number in the foreign table.

Queries could look like this:
table1:a int4b link (->table2)
table2:c int4recnum (system auxiliary field, really a record number in the table)
select * from table2 where table1.a > 5 and table1.b = table2.recnum
Such joins can fly really fast, as practice shows :)Just consider: the thing table1.b = table2.recnum is a
READY-MADEjoin,so server doesn't have to build anything on top of that. Itcan simply perform lookup through link, and
sinceit is a physicalrecord number, this is done with the efficiency of C pointers! Thusperformance gain is ENORMOUS.

And it simplifies the optimizer, because it doesn't have to decideanything about whether to use indices and such like.
Thejoin isperformed always the same way, and it is the best way.

This feature, being implemented, could bring Postgres aheadof most commercial servers, so proving creative abilities
offreesoftware community. Let us make a step in the future!

Best regards,Leon

===8<===========End of original message text===========

Best regards,Leon                            mailto:leon@udmnet.ru

Re: [HACKERS] Fwd: Joins and links

From

Tom Lane

Date:

05 July 1999, 13:39:09

Leon <leon@udmnet.ru> writes:
>  We should make a real reference in one table to another! That
>  means there could be special data type called, say, "link",
>  which is a physical record number in the foreign table.

There is no such thing as a physical record number for a tuple in
Postgres.  The closest you could come is an OID, which isn't really any
faster than any other joinable field --- you still need an index to
support fast lookup by OID.

If we did have such a concept, the speed penalties for supporting
hard links from one tuple to another would be enormous.  Every time
you change a tuple, you'd have to try to figure out what other tuples
reference it, and update them all.

Finally, I'm not convinced that the results would be materially faster
than a standard mergejoin (assuming that you have indexes on both the
fields being joined) or hashjoin (in the case that one table is small
enough to be loaded into memory).
        regards, tom lane

Re[2]: [HACKERS] Fwd: Joins and links

From

Leon

Date:

05 July 1999, 14:22:28

Hello Tom,

Monday, July 05, 1999 you wrote:

T> If we did have such a concept, the speed penalties for supporting
T> hard links from one tuple to another would be enormous.  Every time
T> you change a tuple, you'd have to try to figure out what other tuples
T> reference it, and update them all.

I'm afraid that's mainly because fields in Postgres have variable
length and after update they go to the end of the table. Am I right?
In that case there could be done such referencing only with
tables with wixed width rows, whose updates can naturally be done
without moving. It is a little sacrifice, but it is worth it.

T> Finally, I'm not convinced that the results would be materially faster
T> than a standard mergejoin (assuming that you have indexes on both the
T> fields being joined) or hashjoin (in the case that one table is small
T> enough to be loaded into memory).

Consider this: no indices, no optimizer thinking, no index lookups -
no nothing! Just a sequential number of record multiplied by
record size. Exactly three CPU instructions: read, multiply,
lookup. Can you see the gain now?

Best regards, Leon

Re: Re[2]: [HACKERS] Fwd: Joins and links

From

Tom Lane

Date:

05 July 1999, 15:03:37

Leon <leon@udmnet.ru> writes:
> I'm afraid that's mainly because fields in Postgres have variable
> length and after update they go to the end of the table. Am I right?
> In that case there could be done such referencing only with
> tables with wixed width rows, whose updates can naturally be done
> without moving. It is a little sacrifice, but it is worth it.

No, you are not right.  Tuple updates can *never* be done without
moving the tuple, because the old tuple value must not be overwritten
until and unless the transaction is committed.  (Under MVCC, it may
need to stick around even longer than that, I believe.)  Thus, a tuple
update would require an update (and move) of every referencing tuple,
which could cascade into updates of tuples that reference those tuples,
etc.

But the really serious problem is simply that of finding the tuples
that reference the tuple you are about to update.  This is relatively
straightforwards for indexes --- you just compute the index value for
the old tuple and probe into the index with it.  When the tuples might
be anywhere in the database, there are no easy shortcuts.  I think the
only way would be maintaining an index listing all the referencing
tuples for every referenced tuple.  This would be a bit of a bear to
maintain, as well as a source of unwanted blocks/deadlocks since it
would be a bottleneck for updates to *every* table in the database.

T> Finally, I'm not convinced that the results would be materially faster
T> than a standard mergejoin (assuming that you have indexes on both the
T> fields being joined) or hashjoin (in the case that one table is small
T> enough to be loaded into memory).

> Consider this: no indices,

You'd still need indices --- see above

> no optimizer thinking,

You'd still need to run the optimizer to decide whether you wanted to
use this technique or some more-conventional one (unless your proposal
is to remove all other join mechanisms?  Rather inflexible, that...)

> no nothing! Just a sequential number of record multiplied by
> record size. Exactly three CPU instructions: read, multiply,
> lookup. Can you see the gain now?

If you think it takes three instructions to access a tuple that's out
on disk somewhere, I'm afraid you're sadly mistaken.
        regards, tom lane

Re: [HACKERS] Fwd: Joins and links

From

Bruce Momjian

Date:

05 July 1999, 15:33:05

> Leon <leon@udmnet.ru> writes:
> >  We should make a real reference in one table to another! That
> >  means there could be special data type called, say, "link",
> >  which is a physical record number in the foreign table.
> 
> There is no such thing as a physical record number for a tuple in
> Postgres.  The closest you could come is an OID, which isn't really any
> faster than any other joinable field --- you still need an index to
> support fast lookup by OID.

Actually, there is:
select ctid from pg_class;
ctid  ------(0,1) (0,2) ...

The number is the block number offset in the block.  It doesn't help
because UPDATED rows would get a new tid.  Tid's can be used for short
periods if you are sure the data in the table doesn't change, and there
is a TODO item to allow ctid reference in the WHERE clause.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Re[2]: [HACKERS] Fwd: Joins and links

From

Bruce Momjian

Date:

05 July 1999, 15:33:23

> Hello Tom,
> 
> Monday, July 05, 1999 you wrote:
> 
> T> If we did have such a concept, the speed penalties for supporting
> T> hard links from one tuple to another would be enormous.  Every time
> T> you change a tuple, you'd have to try to figure out what other tuples
> T> reference it, and update them all.
> 
> I'm afraid that's mainly because fields in Postgres have variable
> length and after update they go to the end of the table. Am I right?
> In that case there could be done such referencing only with
> tables with wixed width rows, whose updates can naturally be done
> without moving. It is a little sacrifice, but it is worth it.
> 
> T> Finally, I'm not convinced that the results would be materially faster
> T> than a standard mergejoin (assuming that you have indexes on both the
> T> fields being joined) or hashjoin (in the case that one table is small
> T> enough to be loaded into memory).
> 
> Consider this: no indices, no optimizer thinking, no index lookups -
> no nothing! Just a sequential number of record multiplied by
> record size. Exactly three CPU instructions: read, multiply,
> lookup. Can you see the gain now?
> 
> Best regards, Leon
> 
> 
> 
> 


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re[4]: [HACKERS] Fwd: Joins and links

From

Leon

Date:

05 July 1999, 15:33:40

Hello Tom,

Tuesday, July 06, 1999 you wrote:

T> No, you are not right.  Tuple updates can *never* be done without
T> moving the tuple, because the old tuple value must not be overwritten
T> until and unless the transaction is committed.  (Under MVCC, it may
T> need to stick around even longer than that, I believe.)  Thus, a tuple
T> update would require an update (and move) of every referencing tuple,
T> which could cascade into updates of tuples that reference those tuples,
T> etc.

This problem can be solved. An offhand solution is to have
an additional system field which will point to new tuple left after
update. It is filled at the same time as the original tuple is
marked invalid.  So the scenario is as follows: we follow the link,
and if we find that in the tuple where we arrived this system field
is not NULL, we go to (the same table of course) where it is pointing
to. Sure VACUUM will eliminate these. Performance penalty is small.

T>> Finally, I'm not convinced that the results would be materially faster
T>> than a standard mergejoin (assuming that you have indexes on both the
T>> fields being joined) or hashjoin (in the case that one table is small
T>> enough to be loaded into memory).

>> Consider this: no indices,

T> You'd still need indices --- see above

Only time when we will need to look who is referencing us is
during VACUUM. So no real need of indices.

>> no optimizer thinking,

T> You'd still need to run the optimizer to decide whether you wanted to
T> use this technique or some more-conventional one (unless your proposal
T> is to remove all other join mechanisms?  Rather inflexible, that...)

No. I am not an evil itself which tries to eliminate everything :)
I said when optimizer sees join made through such field it has
the only option - to follow link. It simply has no choice.

T> If you think it takes three instructions to access a tuple that's out
T> on disk somewhere, I'm afraid you're sadly mistaken.

No. I meant a tuple which is in memory somewhere :)

Best regards, Leon

Re[2]: [HACKERS] Fwd: Joins and links

From

Leon

Date:

05 July 1999, 15:53:17

Hello Bruce,

Tuesday, July 06, 1999 you wrote:

B> Actually, there is:

B>         select ctid from pg_class;

B>         ctid
B>         ------
B>         (0,1)
B>         (0,2)
B>         ...

B> The number is the block number offset in the block.  It doesn't help
B> because UPDATED rows would get a new tid.  Tid's can be used for short
B> periods if you are sure the data in the table doesn't change, and there
B> is a TODO item to allow ctid reference in the WHERE clause.

It seems that you are moving in a right direction. But these tids
seem to be of short-term use, anyway. Why not to upgrade to full-featured
links at once?

Best regards, Leon

Re: Re[4]: [HACKERS] Fwd: Joins and links

From

Tom Lane

Date:

05 July 1999, 16:34:19

Leon <leon@udmnet.ru> writes:
> This problem can be solved. An offhand solution is to have
> an additional system field which will point to new tuple left after
> update. It is filled at the same time as the original tuple is
> marked invalid.  So the scenario is as follows: we follow the link,
> and if we find that in the tuple where we arrived this system field
> is not NULL, we go to (the same table of course) where it is pointing
> to. Sure VACUUM will eliminate these. Performance penalty is small.

Is it small?  After multiple updates to the referenced tuple, you'd be
talking about following a chain of TID references in order to find the
referenced tuple from the referencing tuple.  I'd expect this to take
more time than an index access within a fairly small number of updates
(maybe four or so, just on the basis of counting disk-block fetches).

VACUUM is an interesting problem as well: to clean up the chains as you
suggest, VACUUM could no longer be a one-table-at-a-time proposition.
It would have to be able to update tuples elsewhere while repacking the
tuples in the current table.  This probably means that VACUUM requires
a global lock across the whole database.  Also, making those updates
in an already-vacuumed table without undoing its nicely vacuummed state
might be tricky.
        regards, tom lane

Re[6]: [HACKERS] Fwd: Joins and links

From

Leon

Date:

05 July 1999, 17:18:14

Hello Tom,

Tuesday, July 06, 1999 you wrote:

T> Is it small?

It is :)  First you should tell me what is the cost of tid lookup.
If it is significantly more expensive than C pointer, then we
should consider implementing such cheap pointer. If tid is already
cheap, then even 10 consecutive lookups will cost almost nothing.

And besides all, you should consider statistics. Can you name
five or even three applications where large databases are
massively updated without being vacuumed often?

T>   After multiple updates to the referenced tuple, you'd be
T> talking about following a chain of TID references in order to find the
T> referenced tuple from the referencing tuple.  I'd expect this to take
T> more time than an index access within a fairly small number of updates
T> (maybe four or so, just on the basis of counting disk-block fetches).

T> VACUUM is an interesting problem as well: to clean up the chains as you
T> suggest, VACUUM could no longer be a one-table-at-a-time proposition.
T> It would have to be able to update tuples elsewhere while repacking the
T> tuples in the current table.  This probably means that VACUUM requires
T> a global lock across the whole database.

Does VACUUM require lock on the vacuumed table now? I am sure it
does. And in our case we must lock the vacuumed table and all
the tables that are referencing it, not all tables.
And, besides, manual suggests that VACUUM should be done
nightly, not daily :)

Having aquired such lock, vacuum should update the "main"
table first, then update all links in referencing tables.
It can be done using oids, which are matched in new and old
versions of "main "table (are oids preserved during vacuum? -
if they are not, this can be done with primary key)

T>   Also, making those updates
T> in an already-vacuumed table without undoing its nicely vacuummed state
T> might be tricky.

I didn' get the idea of last sentence. Anyway, I am going to sleep.
See you tomorrow :)

Best regards, Leon

Re: Re[2]: [HACKERS] Fwd: Joins and links

From

Bruce Momjian

Date:

05 July 1999, 17:28:03

> Leon <leon@udmnet.ru> writes:
> > I'm afraid that's mainly because fields in Postgres have variable
> > length and after update they go to the end of the table. Am I right?
> > In that case there could be done such referencing only with
> > tables with wixed width rows, whose updates can naturally be done
> > without moving. It is a little sacrifice, but it is worth it.
> 
> No, you are not right.  Tuple updates can *never* be done without
> moving the tuple, because the old tuple value must not be overwritten
> until and unless the transaction is committed.  (Under MVCC, it may
> need to stick around even longer than that, I believe.)  Thus, a tuple
> update would require an update (and move) of every referencing tuple,
> which could cascade into updates of tuples that reference those tuples,
> etc.

Yes, Thanks, Tom.  This is exactly the issue.


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: [HACKERS] Fwd: Joins and links

From

Thomas Lockhart

Date:

05 July 1999, 17:44:04

Congrats Tom for rising to the challenge ;)

istm that a brute force hack such as is being suggested should be left
as an exercise for the reader.

These "table links" seem to controvert the ability for a RDBMS to mix
and match tables in ways which are not hardcoded beforehand.
Regardless of whether "there exist some real servers that offer such
features I am talking", a departure from the relation model in a
relational database is likely to lead to undesireable constraints and
restrictions in our future development.

If they are a good idea, you might be able to implement and prove them
using an embedded language and the SPI facilities.

Just the usual $.02...
                  - Thomas

-- 
Thomas Lockhart                lockhart@alumni.caltech.edu
South Pasadena, California

Re: [HACKERS] Fwd: Joins and links

From

Tom Lane

Date:

05 July 1999, 19:38:06

Thomas Lockhart <lockhart@alumni.caltech.edu> writes:
> Regardless of whether "there exist some real servers that offer such
> features I am talking", a departure from the relation model in a
> relational database is likely to lead to undesireable constraints and
> restrictions in our future development.

That was another thing that was bothering me about the idea of "version
links" between tuples (as in Leon's second proposal).  They don't fit
into the fundamental relational model.

I am not sure there's anything fundamentally wrong with his basic point;
if, say, we could find a way to construct OIDs so that a tuple could be
found very quickly from its OID, that wouldn't violate the relational
model AFAICS, and such OIDs would work fine as "links".  But I don't see
any way to do that without either giving up UPDATE or introducing a huge
amount of baggage into all processes that can update tables (VACUUM
being the worst case, likely).  Without doubt the best compromise would
look remarkably like an index on OID.

Ultimately, when you consider both the update costs and the access
costs, I doubt that this sort of thing could be a win, except maybe
in the case where the referenced table is hardly ever changed so that
the update costs are seldom incurred.  But in that situation it's not
clear you want to store the referenced table in an RDBMS anyway ---
there are lots of lower-overhead ways to deal with fixed tables, such
as perfect-hash generators.

> If they are a good idea, you might be able to implement and prove them
> using an embedded language and the SPI facilities.

I don't think VACUUM invokes triggers, so you couldn't really do
anything about VACUUM rearranging the table under you that way,
could you?

I'll be interested to see Vadim's comments on this thread...
        regards, tom lane

Re: [HACKERS] Fwd: Joins and links

From

Leon

Date:

06 July 1999, 01:52:43

Thomas Lockhart wrote:
> 
> Congrats Tom for rising to the challenge ;)
> 
> istm that a brute force hack such as is being suggested should be left
> as an exercise for the reader.
> 
> These "table links" seem to controvert the ability for a RDBMS to mix
> and match tables in ways which are not hardcoded beforehand.

Excuse me, in what way? All the usual features of SQL are in their
legal place. What you are getting is one more additional ability,
not constraint. 

> Regardless of whether "there exist some real servers that offer such
> features I am talking", a departure from the relation model in a
> relational database is likely to lead to undesireable constraints and
> restrictions in our future development.

Until you offer an example of such constraint these words are 
groundless fear. 

Think of it not merely as index lookup speedup, but as quick and
clever way to fix the optimizer. As we have seen, optimizer now
has troubles choosing the fastest way of doing the query. This
is deep-rooted trouble, because optimizer generally needs some
fine-graded statistics on tables which it is working on, and 
gathering such statisctics would require, I am afraid, total
rewrite of the optimizer. In the case of links quiery is always
done the best way.

-- 
Leon.

Re: [HACKERS] Fwd: Joins and links

From

Adriaan Joubert

Date:

06 July 1999, 02:46:08

Tom Lane wrote:
> 
> Thomas Lockhart <lockhart@alumni.caltech.edu> writes:
> > Regardless of whether "there exist some real servers that offer such
> > features I am talking", a departure from the relation model in a
> > relational database is likely to lead to undesireable constraints and
> > restrictions in our future development.

Yep. Also, you fix one set of hard links and the next day you need to do
a slightly different join and it doesn't fit into the links you
constructed, because you left out a table or something silly. 

I used to work on a system storing and retrieving real-time trading data
on Tandem. We were integrating it with their first-generation CORBA
system, but could handle 70 updates a second + heavy reads of
historical/real-time tick data. And this was on an old 4-processor
K10000. Nearly all data was stored at least twice -- made the inserts
slightly slower (Tandem is bloody good at inserts!) but otherwise we
couldn't cope with the reads of several MBytes of historical data/query. 

Leon, I think you should study the accesses, and build the right
intermediate tables. Yes, I know you are not supposed to duplicate data,
but hey, this is the real world, and disk is cheap. And triggers etc
make it fairly managable to retain integrity. But what is indispensable
is the flexibility you have in a true relational model, so that you can
easily adapt to changing demands -- adding temporary tables as you need
them for new reports and dropping them as they go out of use.

Of course you can use views, but this can still be slow. 

As far as I can see: if you know which hard links you need, you know
which additional table you need to build. And knocking up the triggers
to keep it updated is childs play. Ah, yes -- and I always have to add a
sanity_function as well that can fix things when I've made a real
balls-up ;-) 

Have fun,

Adriaan

Re: [HACKERS] Fwd: Joins and links

From

Leon

Date:

06 July 1999, 03:18:13

After thinking a bit more, I decided to reply in a more constructive
way.

Thomas Lockhart wrote:

> These "table links" seem to controvert the ability for a RDBMS to mix
> and match tables in ways which are not hardcoded beforehand.

Certainly links are only of use to their intended purpose, and to
nothing more. But you should be aware that real life relationships
are exactly ot this kind. The drawback of general relational model
is that links (=joins) are built from scratch at the moment of join.
This may seem an advantage, but really this is often an unnecessary
redundant feature whose design allows to build a swarm of relationships
which never existed and will never be used. 

Keeping all that in mind, we might consider building a subsystem in
SQL server which is carefully optimized for such real life tasks.
There is no need to put any restrictions on general SQL, the only
thing proposed is enhancement of a particular side of the server.

> Regardless of whether "there exist some real servers that offer such
> features I am talking", a departure from the relation model in a
> relational database is likely to lead to undesireable constraints and
> restrictions in our future development.
> 

You have already done a heroic deed of implementing MVCC, it seems
the most interfered with thing. I can see no serious interference 
with any SQL feature which you might implement.

-- 
Leon.

Re: [HACKERS] Fwd: Joins and links

From

Leon

Date:

06 July 1999, 03:33:44

Adriaan Joubert wrote:

> Yep. Also, you fix one set of hard links and the next day you need to do
> a slightly different join and it doesn't fit into the links you
> constructed, because you left out a table or something silly.
> 

No one is talking about abolishing any standard SQL feature. After
you carefully verified design you can hard-code links to speedup
access. Before that has happened the usual SQL will do.

> Leon, I think you should study the accesses, and build the right
> intermediate tables. Yes, I know you are not supposed to duplicate data,
> but hey, this is the real world, and disk is cheap.

But RAM is not as big as HDD. If database doesn't fit in RAM performance
degrades severely.

> And triggers etc
> make it fairly managable to retain integrity.

Making trigger would cost the same as rearranging the table after
poor design of links is discovered. 

> But what is indispensable
> is the flexibility you have in a true relational model, so that you can
> easily adapt to changing demands -- adding temporary tables as you need
> them for new reports and dropping them as they go out of use.

This will immensely bloat the database thus flooding the disk
channel and, what is worse, the main memory.

-- 
Leon.

Re: [HACKERS] Fwd: Joins and links

From

Vadim Mikheev

Date:

06 July 1999, 07:16:17

Tom Lane wrote:
> 
> I am not sure there's anything fundamentally wrong with his basic point;
> if, say, we could find a way to construct OIDs so that a tuple could be
> found very quickly from its OID, that wouldn't violate the relational
> model AFAICS, and such OIDs would work fine as "links".  But I don't see
> any way to do that without either giving up UPDATE or introducing a huge
> amount of baggage into all processes that can update tables (VACUUM
> being the worst case, likely).  Without doubt the best compromise would
> look remarkably like an index on OID.

There is no problems with UPDATE: updated tuple points to newer
version, so we can avoid update of referencing tuples here.
VACUUM would have to update referencing tuples (via normal
heap_replace, nothing special) while removing old versions. 
This may cause deadlocks but we could give vacuum higher priority
and abort others.

So, vacuum is the worst case, as pointed by Tom.
No problems with MVCC and other things.

Vadim

Re[2]: [HACKERS] Fwd: Joins and links

From

Leon

Date:

06 July 1999, 07:41:14

Hello Vadim,

Tuesday, July 06, 1999 you wrote:

V> There is no problems with UPDATE: updated tuple points to newer
V> version, so we can avoid update of referencing tuples here.
V> VACUUM would have to update referencing tuples (via normal
V> heap_replace, nothing special) while removing old versions.
V> This may cause deadlocks but we could give vacuum higher priority
V> and abort others.

V> So, vacuum is the worst case, as pointed by Tom.
V> No problems with MVCC and other things.

So. The main drawback is higher priority for VACUUM. Not
too large, eh?

When you will decide - to implement or not to implement,
I urge you to think again about the relief on optimizer,
which I stressed many times. No one rebutted yet that adding
brains to optimizer so that it can use appropriate join method
will require major rewrite. With links you get the best join
method as side effect - virtually for free. These joins
will never be too slow for an unknown reason. Think carefully.
I hope you will make wise decision.

Best regards, Leon

Re: [HACKERS] Fwd: Joins and links

From

Vadim Mikheev

Date:

06 July 1999, 08:05:38

Leon wrote:
> 
> So. The main drawback is higher priority for VACUUM. Not
> too large, eh?
> 
> When you will decide - to implement or not to implement,

We will not decide -:))
If someone want to implement it - welcome.

> I urge you to think again about the relief on optimizer,
> which I stressed many times. No one rebutted yet that adding
> brains to optimizer so that it can use appropriate join method
> will require major rewrite. With links you get the best join
> method as side effect - virtually for free. These joins
> will never be too slow for an unknown reason. Think carefully.
> I hope you will make wise decision.

Optimizer requires major rewrite in any case, even
having links implemented.

Vadim

Re[2]: [HACKERS] Fwd: Joins and links

From

Leon

Date:

06 July 1999, 08:30:49

Hello Vadim,

Tuesday, July 06, 1999 you wrote:

>> These joins
>> will never be too slow for an unknown reason. Think carefully.
>> I hope you will make wise decision.

V> Optimizer requires major rewrite in any case, even
V> having links implemented.

I am afraid that optimizer, even totally rewritten, can't choose
the best method always. That is simply because it is such a
complex animal :)  Bacterium - simple links will always win
in the field where they live :)

Best regards, Leon

Re: [HACKERS] Fwd: Joins and links

From

Clark Evans

Date:

06 July 1999, 09:31:34

Leon wrote:
> No one is talking about abolishing any standard SQL feature. After
> you carefully verified design you can hard-code links to speedup
> access. Before that has happened the usual SQL will do.

Leon,

Interesting idea.  You are re-introducing some of the
hierarchical database ideas, is this right?  Here is 
what I'm receiving, could you correct me if I'm 
mis-understanding?  (much of this you did not say...)

- - - - - - 

In current SQL implementations, if updates are done on a
tuple being modified, referencing tables are not usually
identified or checked, let alone updated.  Then, when a 
query is requested, the database figures out how referenced 
data can be retrieved, and does this at the time of the query.

In this proposal, in addition to carrying a primary key
for a referenced table, tuples in the referencing table 
will also have a place to record the physical address 
of each referenced tuple.  In this way, referenced 
data is easily retrieved during a query, since the
physical address of the referenced information is
stored in the referant.

For example, lets take the following schema

ORDER      (ORDER_ID, ... );
PRODUCT    (PRODUCT_ID, NAME, ... );
ORDER_LINE (ORDER_ID,PRODUCT_ID, ... );

In the current cases, changes to the PRODUCT table,
let's say a changed name, do not result in an update
of the ORDER_LINE tuples which reference the product
tuple being changed.

In this proposal, a few hidden field (ID/REF) would be added:

ORDER      ( LEON_ID, ORDER_ID, ... );
PRODUCT    ( LEON_ID, PRODUCT_ID, NAME, ... );
ORDER_LINE ( LEON_ID, ORDER_ID, PRODUCT_ID, ... , PRODUCT_LEON_REF );

Where the ORDER_LINE table would have a reference to the
physical LEON_ID of the tuple being referenced by PRODUCT_ID.

Then, an update of the PRODUCT table would result in a cascading
update of all referencing tables, including ORDER_LINE to 
change the PRODUCT_LEON_REF from its previous value to the
update value.  The LEON_ID and LEON_REF are internal implementation
fields and not available though SQL.

SUMMARY,

With this method, query speed is drastically improved since
the "join" logic is performed once during insert, instead
of many times during select.

This method should work well, when the referencing table
changes relatively infrequently.  Thus people, products, 
and other relatively static "reference" information is
a key canidate for this 'indexing' technique.

This technique should not be used if the frequency of 
updates exceed the frequency of select statements.  

- - - - - - -

Overall, I think it is a good idea.  I frequently do weaker
versions all the time that I call "field cashing", where 
the NAME field of infrequently chaging tuples are frequently
accessed.  In this case, one may put PRODUCT_NAME in the 
ORDER_LINE table and put a trigger on PRODUCT to cascade 
update of NAME to the ORDER_LINE.PRODUCT_NAME table.  
I tend to make monsters like this a nightly process, since
product name changes need not be immediate (they are rare,
and thus not frequent, and thus, not usually immediate).  
This allows the cascade update to run at night when 
things are alot less stressful on the database.

Is this in-line with what you are saying?  

I suggest borrowing an XML term for the idea, GROVE.
In XML, a GROVE is a tree built from XML/SGML/notations.
In this case, you can think of frequently joined 
information as cutting deep into the landscape, thus
the more the query is done, the more of a chance that
the UPDATE/SELECT ratio wil be small, and thus, the
greater chance that the hard wired physical address
is cashed in the table.   The reason I like the
name, is that it defines a common pathway that is 
easy, without preventing the efficiency of uncommon 
paths (where updates >> select ).

Hmm.  I'm just worrying about the CASCADE nature
of the beast.  On the extreme that I was writing
about earlier, a prototype OO dbms that I was
looking at about 6 years ago (god knows what the
name is), they did *everything* this way.   And
GOD it was slow... especially since it cascaded
when frequency of updates far exceed the frequency
of selects. 

Thoughts?

Best,

Clark

Re: [HACKERS] Fwd: Joins and links

From

Vadim Mikheev

Date:

06 July 1999, 09:54:39

Clark Evans wrote:
> 
> In this proposal, a few hidden field (ID/REF) would be added:                         ^^^^^^
Not hidden, but with _link_ type.

Vadim

Re[2]: [HACKERS] Fwd: Joins and links

From

Leon

Date:

06 July 1999, 10:47:33

Hello Clark,

Tuesday, July 06, 1999 you wrote:

C> Interesting idea.  You are re-introducing some of the
C> hierarchical database ideas, is this right?  Here is
C> what I'm receiving, could you correct me if I'm
C> mis-understanding?  (much of this you did not say...)

Strictly speaking, this is neither hierarchical nor network
database. It is not hierarchical because cyclic graphs are
allowed (when tables reference one another, maybe through
some intermediate table). And it is not network because there
is not some weird restriction put on network database.
(textbook says in network database one referenced tuple must
be at most in one link of certain link type)

C> In this proposal, in addition to carrying a primary key
C> for a referenced table, tuples in the referencing table
C> will also have a place to record the physical address
C> of each referenced tuple.

I have read description carefully. I am afraid that MVCC
will break your scheme, because referencing tuple must have
a way to reach all versions of foreign updated tuple. If
you update the referencing field, all other versions of
foreign tuple are lost. It seems the only way to satisfy
MVCC is to chain updated foreign tuples with subsequent
VACUUM. That's because there is no need of indices, as soon
as the need of them is only during VACUUM.


Best regards, Leon

Re: Re[2]: [HACKERS] Fwd: Joins and links

From

Bruce Momjian

Date:

06 July 1999, 12:01:33

> When you will decide - to implement or not to implement,
> I urge you to think again about the relief on optimizer,
> which I stressed many times. No one rebutted yet that adding
> brains to optimizer so that it can use appropriate join method
> will require major rewrite. With links you get the best join
> method as side effect - virtually for free. These joins
> will never be too slow for an unknown reason. Think carefully.
> I hope you will make wise decision.

I believe Ingres does allow this, as it has tid's too.  If you are
creating a temp table, you could use tids during your processing.  In
fact, it seems tids would be valid until a vacuum is performed.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: [HACKERS] Fwd: Joins and links

From

Clark Evans

Date:

06 July 1999, 12:29:19

Leon wrote:
> C> In this proposal, in addition to carrying a primary key
> C> for a referenced table, tuples in the referencing table
> C> will also have a place to record the physical address
> C> of each referenced tuple.
> 
> I have read description carefully. I am afraid that MVCC
> will break your scheme, because referencing tuple must have
> a way to reach all versions of foreign updated tuple.
> If you update the referencing field, all other versions of
> foreign tuple are lost. 
> It seems the only way to satisfy
> MVCC is to chain updated foreign tuples with subsequent
> VACUUM. That's because there is no need of indices, as soon
> as the need of them is only during VACUUM.

(look of puzzlement) Where did I go wrong with what 
you are proposing?  I'm not trying to invent my
own scheme... I'm trying to understand yours.

;) Clark

Re[2]: [HACKERS] Fwd: Joins and links

From

Leon

Date:

06 July 1999, 15:57:47

Hello Clark,

Tuesday, July 06, 1999 you wrote:

C> (look of puzzlement) Where did I go wrong with what
C> you are proposing?  I'm not trying to invent my
C> own scheme... I'm trying to understand yours.

Ok. If American people wants to know the True Path, it
can be enlightened :))) (it's a joke)

So what's exactly proposed:

Introduction of what will seem a new data type in table
structure:

CREATE TABLE atable (a int4)
CREATE TABLE btable (b int4, c link (atable)) - "link" looks like
new data type.

Example query with link:

SELECT * FROM atable where btable.b < 5 AND btable.c = atable.tid
(or here should go ctid - you can know better)

Type checking:

CREATE TABLE ctable (d int4)
SELECT * FROM ctable where btable.b < 5 AND btable.c = ctable.tid -
it should produce an error because link isn't to ctable.

No additional constraint is placed. Tables can reference one
another in any combination, maybe the table should be able
to reference itself.

How all that is implemented:

As we have seen, link is matched against tid in queries. It
means that link internally can be of the same data type as tid.

MVCC stuff: as Vadim pointed out, updated tuples are chained
already, so this feature can naturally be utilized. Referencing
tuple is always pointing to the oldest version of foreign
updated tuple. If transaction needs the version of foreign
tuple other than oldest, it follows the chain.

Vacuuming removes these chains thus packing the table and
rewriting references to vacuumed table in other tables.
Vacuuming thus needs high priority, maybe lock on the table
being vacuumed and all referencing tables.

Since referencing fields are rewritten only during vacuum,
there is no need of indices on any field.

Best regards, Leon

Re: [HACKERS] Fwd: Joins and links

From

Peter Eisentraut

Date:

06 July 1999, 18:25:01

On Mon, 5 Jul 1999, Tom Lane wrote:

> I am not sure there's anything fundamentally wrong with his basic point;
> if, say, we could find a way to construct OIDs so that a tuple could be
> found very quickly from its OID, that wouldn't violate the relational
> model AFAICS, and such OIDs would work fine as "links".  But I don't see
> any way to do that without either giving up UPDATE or introducing a huge
> amount of baggage into all processes that can update tables (VACUUM
> being the worst case, likely).  Without doubt the best compromise would
> look remarkably like an index on OID.

So is there anything wrong with that?

> Ultimately, when you consider both the update costs and the access
> costs, I doubt that this sort of thing could be a win, except maybe
> in the case where the referenced table is hardly ever changed so that
> the update costs are seldom incurred.  But in that situation it's not
> clear you want to store the referenced table in an RDBMS anyway ---
> there are lots of lower-overhead ways to deal with fixed tables, such
> as perfect-hash generators.

While I read this thread I noticed that a lot of people are concerned
about their update speeds. I am primarily concerned about query speeds.
Consider how often you update data vs. how often you query it. That's the
whole point of a database: to optimize information retrieval. Now I am not
sure how big those update performance penalties would be but I am not
concerned really.

Meanwhile I agree that hard-linking via record IDs sounds suspiciously
like a page from the OODB textbook where it is praised for exactly the
same reasons the person who started this discussion cited: no joins. But
in order for that to work (if it works) the database software would have
to be written from scratch in otder for it to be marginally efficient.

The question I ask myself though is, are there any concrete plans for
referential integrity via foreign key clauses? 6.6, 7.0, never? To me,
that's really much more important than query speed or MVCC.

-- 
Peter Eisentraut
PathWay Computing, Inc.

Re: [HACKERS] Fwd: Joins and links

From

Hannu Krosing

Date:

06 July 1999, 19:09:51

Leon wrote:
> 
> Hello Vadim,
> 
> Tuesday, July 06, 1999 you wrote:
> 
> >> These joins
> >> will never be too slow for an unknown reason. Think carefully.
> >> I hope you will make wise decision.
> 
> V> Optimizer requires major rewrite in any case, even
> V> having links implemented.
> 
> I am afraid that optimizer, even totally rewritten, can't choose
> the best method always. That is simply because it is such a
> complex animal :)  Bacterium - simple links will always win
> in the field where they live :)

>From what I have read from earlier posts about the optimizer, 
there can be situations where using links would actually be slower
than going through the optimiser, similar to the case where scanning 
the whole table using an index can be orders of magnitude slower than 
doing a direct scan.

That is of course if used unwisely ;)

Another thing that has remained unclear to me is the way to actually 
insert or update the links - you can't just put another record there,
so that should be some kind of field (tid,oid,...) or some function
like last_touched('other_table_name').

So, what have you thought to put there ?

------
Hannu

RE: [HACKERS] Fwd: Joins and links

From

"Hiroshi Inoue"

Date:

06 July 1999, 20:30:31

> -----Original Message-----
> From: owner-pgsql-hackers@postgreSQL.org
> [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Vadim Mikheev
> Sent: Tuesday, July 06, 1999 8:12 PM
> To: Tom Lane
> Cc: Thomas Lockhart; Leon; pgsql-hackers@postgreSQL.org
> Subject: Re: [HACKERS] Fwd: Joins and links
>
>
> Tom Lane wrote:
> >
> > I am not sure there's anything fundamentally wrong with his basic point;
> > if, say, we could find a way to construct OIDs so that a tuple could be
> > found very quickly from its OID, that wouldn't violate the relational
> > model AFAICS, and such OIDs would work fine as "links".  But I don't see
> > any way to do that without either giving up UPDATE or introducing a huge
> > amount of baggage into all processes that can update tables (VACUUM
> > being the worst case, likely).  Without doubt the best compromise would
> > look remarkably like an index on OID.
>
> There is no problems with UPDATE: updated tuple points to newer
> version, so we can avoid update of referencing tuples here.
> VACUUM would have to update referencing tuples (via normal
> heap_replace, nothing special) while removing old versions.
> This may cause deadlocks but we could give vacuum higher priority
> and abort others.
>
> So, vacuum is the worst case, as pointed by Tom.
> No problems with MVCC and other things.
>

What about dump/reload ?
And would vacuum be much complicated than now ?
I think vacuum is sufficiently complicated now.

Didn't these kind of annoying things let RDBMS exceed
NDBMS inspite of its low performance ?

If "link" is necessary at any cost,how about the following story ?
   "link" = OID + TID
  If oid pointed by TID is different from holding OID,executor resets  TID using OID indices(my story needs OID
indices).

By this way we need not change vacuum/dump/reload etc.
The command to update TID-s to latest ones may be needed.

Comments ?

Hiroshi Inoue
Inoue@tpf.co.jp

Re: [HACKERS] Fwd: Joins and links

From

Leon

Date:

07 July 1999, 02:47:26

Hannu Krosing wrote:

> Another thing that has remained unclear to me is the way to actually
> insert or update the links - you can't just put another record there,
> so that should be some kind of field (tid,oid,...) or some function
> like last_touched('other_table_name').
> 
> So, what have you thought to put there ?
> 

Earlier I proposed that links should be of type similar to tid,
so inserts should be fed with values of tid. But this requires
intermediate step, so there can be a function which takes primary
key and returns tid, or as you say a function 
last_touched('other_table_name') - this seems the best choice.

-- 
Leon.

Re: Fwd: Joins and links

From

Bruce Momjian

Date:

25 August 2002, 18:41:13

I have been corresponding with Bob Devine for a few years.  He was at
Berkeley during the Postgres days and knows quite a bit about
optimizers and storage systems.  I will put his name at the bottom of
the optimizer README and if people have questions, he is willing to
answer them as best he can.

---------------------------------------------------------------------------

Bob Devine wrote:
> Bruce Momjian wrote:
> > 
> > Bob, any status on when you may want to get involved with PostgreSQL?
> 
> Umm, hem, haw...
> 
> I'm maxed out with current stuff now.  It would be very hard
> for me to realistically promise any substantial committment
> to doing any major work.
> 
> However, I'm always available for discussion on designs.
> I've built two optimizers so I can help there.  Otherwise
> I can help with general query execution or storage subsystems.
> 
> Bob Devine
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073