Thread: Fwd: Joins and links
Hello hackers! I posted this message to general mailing list, and was told that hackers' list is more appropriate place to post this message to. What will you say about it? This is a forwarded message From: Leon <leon@udmnet.ru> To: pgsql-general <pgsql-general@postgreSQL.org> Date: Monday, July 05, 1999, 5:46:36 PM Subject: Joins and links ===8<==============Original message text=============== Hello all! You probably remember me - recently I complained about speedof joins in Postgres. After a short investigation the way wasfoundin which Postgres's optimizer can do the right job. Itwas constructive discussion. Now I want to tell you what couldmakePostgres better and faster. And what will make us (ourdevelopment group) happy. Maybe I am bothering someone, ifIdo - tell me that. Let me begin. First of all, some accounting reports need to be deliveredvery fast - within minutes or so. And what's bad is thatquite afew of these reports are quite time-consuming and searchintensive. In particular, internals of these reports includea lotof joins on tables. Secondly, almost all of accounting information naturallyfits into network data model, which can be implemented veryefficiently. This stuff described here is not accounting-specific, itcan be found in every database which uses master-detailtables andother such types of relations. So. How is join being performed in such cases? Although I amnot an expert, I can imagine the way: first there is an (index)scanon first table, and then an (index) scan on the second.It is the best way, reality could be much worse as we haveseen. How can we radically improve performance in such cases? Thereis a simple and quite obvious way. (For you not to think thatIam hallucinating I will tell you that there exist somereal servers that offer such features I am talking about)We shouldmake a real reference in one table to another! Thatmeans there could be special data type called, say, "link",whichis a physical record number in the foreign table. Queries could look like this: table1:a int4b link (->table2) table2:c int4recnum (system auxiliary field, really a record number in the table) select * from table2 where table1.a > 5 and table1.b = table2.recnum Such joins can fly really fast, as practice shows :)Just consider: the thing table1.b = table2.recnum is a READY-MADEjoin,so server doesn't have to build anything on top of that. Itcan simply perform lookup through link, and sinceit is a physicalrecord number, this is done with the efficiency of C pointers! Thusperformance gain is ENORMOUS. And it simplifies the optimizer, because it doesn't have to decideanything about whether to use indices and such like. Thejoin isperformed always the same way, and it is the best way. This feature, being implemented, could bring Postgres aheadof most commercial servers, so proving creative abilities offreesoftware community. Let us make a step in the future! Best regards,Leon ===8<===========End of original message text=========== Best regards,Leon mailto:leon@udmnet.ru
Leon <leon@udmnet.ru> writes: > We should make a real reference in one table to another! That > means there could be special data type called, say, "link", > which is a physical record number in the foreign table. There is no such thing as a physical record number for a tuple in Postgres. The closest you could come is an OID, which isn't really any faster than any other joinable field --- you still need an index to support fast lookup by OID. If we did have such a concept, the speed penalties for supporting hard links from one tuple to another would be enormous. Every time you change a tuple, you'd have to try to figure out what other tuples reference it, and update them all. Finally, I'm not convinced that the results would be materially faster than a standard mergejoin (assuming that you have indexes on both the fields being joined) or hashjoin (in the case that one table is small enough to be loaded into memory). regards, tom lane
Hello Tom, Monday, July 05, 1999 you wrote: T> If we did have such a concept, the speed penalties for supporting T> hard links from one tuple to another would be enormous. Every time T> you change a tuple, you'd have to try to figure out what other tuples T> reference it, and update them all. I'm afraid that's mainly because fields in Postgres have variable length and after update they go to the end of the table. Am I right? In that case there could be done such referencing only with tables with wixed width rows, whose updates can naturally be done without moving. It is a little sacrifice, but it is worth it. T> Finally, I'm not convinced that the results would be materially faster T> than a standard mergejoin (assuming that you have indexes on both the T> fields being joined) or hashjoin (in the case that one table is small T> enough to be loaded into memory). Consider this: no indices, no optimizer thinking, no index lookups - no nothing! Just a sequential number of record multiplied by record size. Exactly three CPU instructions: read, multiply, lookup. Can you see the gain now? Best regards, Leon
Leon <leon@udmnet.ru> writes: > I'm afraid that's mainly because fields in Postgres have variable > length and after update they go to the end of the table. Am I right? > In that case there could be done such referencing only with > tables with wixed width rows, whose updates can naturally be done > without moving. It is a little sacrifice, but it is worth it. No, you are not right. Tuple updates can *never* be done without moving the tuple, because the old tuple value must not be overwritten until and unless the transaction is committed. (Under MVCC, it may need to stick around even longer than that, I believe.) Thus, a tuple update would require an update (and move) of every referencing tuple, which could cascade into updates of tuples that reference those tuples, etc. But the really serious problem is simply that of finding the tuples that reference the tuple you are about to update. This is relatively straightforwards for indexes --- you just compute the index value for the old tuple and probe into the index with it. When the tuples might be anywhere in the database, there are no easy shortcuts. I think the only way would be maintaining an index listing all the referencing tuples for every referenced tuple. This would be a bit of a bear to maintain, as well as a source of unwanted blocks/deadlocks since it would be a bottleneck for updates to *every* table in the database. T> Finally, I'm not convinced that the results would be materially faster T> than a standard mergejoin (assuming that you have indexes on both the T> fields being joined) or hashjoin (in the case that one table is small T> enough to be loaded into memory). > Consider this: no indices, You'd still need indices --- see above > no optimizer thinking, You'd still need to run the optimizer to decide whether you wanted to use this technique or some more-conventional one (unless your proposal is to remove all other join mechanisms? Rather inflexible, that...) > no nothing! Just a sequential number of record multiplied by > record size. Exactly three CPU instructions: read, multiply, > lookup. Can you see the gain now? If you think it takes three instructions to access a tuple that's out on disk somewhere, I'm afraid you're sadly mistaken. regards, tom lane
> Leon <leon@udmnet.ru> writes: > > We should make a real reference in one table to another! That > > means there could be special data type called, say, "link", > > which is a physical record number in the foreign table. > > There is no such thing as a physical record number for a tuple in > Postgres. The closest you could come is an OID, which isn't really any > faster than any other joinable field --- you still need an index to > support fast lookup by OID. Actually, there is: select ctid from pg_class; ctid ------(0,1) (0,2) ... The number is the block number offset in the block. It doesn't help because UPDATED rows would get a new tid. Tid's can be used for short periods if you are sure the data in the table doesn't change, and there is a TODO item to allow ctid reference in the WHERE clause. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> Hello Tom, > > Monday, July 05, 1999 you wrote: > > T> If we did have such a concept, the speed penalties for supporting > T> hard links from one tuple to another would be enormous. Every time > T> you change a tuple, you'd have to try to figure out what other tuples > T> reference it, and update them all. > > I'm afraid that's mainly because fields in Postgres have variable > length and after update they go to the end of the table. Am I right? > In that case there could be done such referencing only with > tables with wixed width rows, whose updates can naturally be done > without moving. It is a little sacrifice, but it is worth it. > > T> Finally, I'm not convinced that the results would be materially faster > T> than a standard mergejoin (assuming that you have indexes on both the > T> fields being joined) or hashjoin (in the case that one table is small > T> enough to be loaded into memory). > > Consider this: no indices, no optimizer thinking, no index lookups - > no nothing! Just a sequential number of record multiplied by > record size. Exactly three CPU instructions: read, multiply, > lookup. Can you see the gain now? > > Best regards, Leon > > > > -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Hello Tom, Tuesday, July 06, 1999 you wrote: T> No, you are not right. Tuple updates can *never* be done without T> moving the tuple, because the old tuple value must not be overwritten T> until and unless the transaction is committed. (Under MVCC, it may T> need to stick around even longer than that, I believe.) Thus, a tuple T> update would require an update (and move) of every referencing tuple, T> which could cascade into updates of tuples that reference those tuples, T> etc. This problem can be solved. An offhand solution is to have an additional system field which will point to new tuple left after update. It is filled at the same time as the original tuple is marked invalid. So the scenario is as follows: we follow the link, and if we find that in the tuple where we arrived this system field is not NULL, we go to (the same table of course) where it is pointing to. Sure VACUUM will eliminate these. Performance penalty is small. T>> Finally, I'm not convinced that the results would be materially faster T>> than a standard mergejoin (assuming that you have indexes on both the T>> fields being joined) or hashjoin (in the case that one table is small T>> enough to be loaded into memory). >> Consider this: no indices, T> You'd still need indices --- see above Only time when we will need to look who is referencing us is during VACUUM. So no real need of indices. >> no optimizer thinking, T> You'd still need to run the optimizer to decide whether you wanted to T> use this technique or some more-conventional one (unless your proposal T> is to remove all other join mechanisms? Rather inflexible, that...) No. I am not an evil itself which tries to eliminate everything :) I said when optimizer sees join made through such field it has the only option - to follow link. It simply has no choice. T> If you think it takes three instructions to access a tuple that's out T> on disk somewhere, I'm afraid you're sadly mistaken. No. I meant a tuple which is in memory somewhere :) Best regards, Leon
Hello Bruce, Tuesday, July 06, 1999 you wrote: B> Actually, there is: B> select ctid from pg_class; B> ctid B> ------ B> (0,1) B> (0,2) B> ... B> The number is the block number offset in the block. It doesn't help B> because UPDATED rows would get a new tid. Tid's can be used for short B> periods if you are sure the data in the table doesn't change, and there B> is a TODO item to allow ctid reference in the WHERE clause. It seems that you are moving in a right direction. But these tids seem to be of short-term use, anyway. Why not to upgrade to full-featured links at once? Best regards, Leon
Leon <leon@udmnet.ru> writes: > This problem can be solved. An offhand solution is to have > an additional system field which will point to new tuple left after > update. It is filled at the same time as the original tuple is > marked invalid. So the scenario is as follows: we follow the link, > and if we find that in the tuple where we arrived this system field > is not NULL, we go to (the same table of course) where it is pointing > to. Sure VACUUM will eliminate these. Performance penalty is small. Is it small? After multiple updates to the referenced tuple, you'd be talking about following a chain of TID references in order to find the referenced tuple from the referencing tuple. I'd expect this to take more time than an index access within a fairly small number of updates (maybe four or so, just on the basis of counting disk-block fetches). VACUUM is an interesting problem as well: to clean up the chains as you suggest, VACUUM could no longer be a one-table-at-a-time proposition. It would have to be able to update tuples elsewhere while repacking the tuples in the current table. This probably means that VACUUM requires a global lock across the whole database. Also, making those updates in an already-vacuumed table without undoing its nicely vacuummed state might be tricky. regards, tom lane
Hello Tom, Tuesday, July 06, 1999 you wrote: T> Is it small? It is :) First you should tell me what is the cost of tid lookup. If it is significantly more expensive than C pointer, then we should consider implementing such cheap pointer. If tid is already cheap, then even 10 consecutive lookups will cost almost nothing. And besides all, you should consider statistics. Can you name five or even three applications where large databases are massively updated without being vacuumed often? T> After multiple updates to the referenced tuple, you'd be T> talking about following a chain of TID references in order to find the T> referenced tuple from the referencing tuple. I'd expect this to take T> more time than an index access within a fairly small number of updates T> (maybe four or so, just on the basis of counting disk-block fetches). T> VACUUM is an interesting problem as well: to clean up the chains as you T> suggest, VACUUM could no longer be a one-table-at-a-time proposition. T> It would have to be able to update tuples elsewhere while repacking the T> tuples in the current table. This probably means that VACUUM requires T> a global lock across the whole database. Does VACUUM require lock on the vacuumed table now? I am sure it does. And in our case we must lock the vacuumed table and all the tables that are referencing it, not all tables. And, besides, manual suggests that VACUUM should be done nightly, not daily :) Having aquired such lock, vacuum should update the "main" table first, then update all links in referencing tables. It can be done using oids, which are matched in new and old versions of "main "table (are oids preserved during vacuum? - if they are not, this can be done with primary key) T> Also, making those updates T> in an already-vacuumed table without undoing its nicely vacuummed state T> might be tricky. I didn' get the idea of last sentence. Anyway, I am going to sleep. See you tomorrow :) Best regards, Leon
> Leon <leon@udmnet.ru> writes: > > I'm afraid that's mainly because fields in Postgres have variable > > length and after update they go to the end of the table. Am I right? > > In that case there could be done such referencing only with > > tables with wixed width rows, whose updates can naturally be done > > without moving. It is a little sacrifice, but it is worth it. > > No, you are not right. Tuple updates can *never* be done without > moving the tuple, because the old tuple value must not be overwritten > until and unless the transaction is committed. (Under MVCC, it may > need to stick around even longer than that, I believe.) Thus, a tuple > update would require an update (and move) of every referencing tuple, > which could cascade into updates of tuples that reference those tuples, > etc. Yes, Thanks, Tom. This is exactly the issue. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Congrats Tom for rising to the challenge ;) istm that a brute force hack such as is being suggested should be left as an exercise for the reader. These "table links" seem to controvert the ability for a RDBMS to mix and match tables in ways which are not hardcoded beforehand. Regardless of whether "there exist some real servers that offer such features I am talking", a departure from the relation model in a relational database is likely to lead to undesireable constraints and restrictions in our future development. If they are a good idea, you might be able to implement and prove them using an embedded language and the SPI facilities. Just the usual $.02... - Thomas -- Thomas Lockhart lockhart@alumni.caltech.edu South Pasadena, California
Thomas Lockhart <lockhart@alumni.caltech.edu> writes: > Regardless of whether "there exist some real servers that offer such > features I am talking", a departure from the relation model in a > relational database is likely to lead to undesireable constraints and > restrictions in our future development. That was another thing that was bothering me about the idea of "version links" between tuples (as in Leon's second proposal). They don't fit into the fundamental relational model. I am not sure there's anything fundamentally wrong with his basic point; if, say, we could find a way to construct OIDs so that a tuple could be found very quickly from its OID, that wouldn't violate the relational model AFAICS, and such OIDs would work fine as "links". But I don't see any way to do that without either giving up UPDATE or introducing a huge amount of baggage into all processes that can update tables (VACUUM being the worst case, likely). Without doubt the best compromise would look remarkably like an index on OID. Ultimately, when you consider both the update costs and the access costs, I doubt that this sort of thing could be a win, except maybe in the case where the referenced table is hardly ever changed so that the update costs are seldom incurred. But in that situation it's not clear you want to store the referenced table in an RDBMS anyway --- there are lots of lower-overhead ways to deal with fixed tables, such as perfect-hash generators. > If they are a good idea, you might be able to implement and prove them > using an embedded language and the SPI facilities. I don't think VACUUM invokes triggers, so you couldn't really do anything about VACUUM rearranging the table under you that way, could you? I'll be interested to see Vadim's comments on this thread... regards, tom lane
Thomas Lockhart wrote: > > Congrats Tom for rising to the challenge ;) > > istm that a brute force hack such as is being suggested should be left > as an exercise for the reader. > > These "table links" seem to controvert the ability for a RDBMS to mix > and match tables in ways which are not hardcoded beforehand. Excuse me, in what way? All the usual features of SQL are in their legal place. What you are getting is one more additional ability, not constraint. > Regardless of whether "there exist some real servers that offer such > features I am talking", a departure from the relation model in a > relational database is likely to lead to undesireable constraints and > restrictions in our future development. Until you offer an example of such constraint these words are groundless fear. Think of it not merely as index lookup speedup, but as quick and clever way to fix the optimizer. As we have seen, optimizer now has troubles choosing the fastest way of doing the query. This is deep-rooted trouble, because optimizer generally needs some fine-graded statistics on tables which it is working on, and gathering such statisctics would require, I am afraid, total rewrite of the optimizer. In the case of links quiery is always done the best way. -- Leon.
Tom Lane wrote: > > Thomas Lockhart <lockhart@alumni.caltech.edu> writes: > > Regardless of whether "there exist some real servers that offer such > > features I am talking", a departure from the relation model in a > > relational database is likely to lead to undesireable constraints and > > restrictions in our future development. Yep. Also, you fix one set of hard links and the next day you need to do a slightly different join and it doesn't fit into the links you constructed, because you left out a table or something silly. I used to work on a system storing and retrieving real-time trading data on Tandem. We were integrating it with their first-generation CORBA system, but could handle 70 updates a second + heavy reads of historical/real-time tick data. And this was on an old 4-processor K10000. Nearly all data was stored at least twice -- made the inserts slightly slower (Tandem is bloody good at inserts!) but otherwise we couldn't cope with the reads of several MBytes of historical data/query. Leon, I think you should study the accesses, and build the right intermediate tables. Yes, I know you are not supposed to duplicate data, but hey, this is the real world, and disk is cheap. And triggers etc make it fairly managable to retain integrity. But what is indispensable is the flexibility you have in a true relational model, so that you can easily adapt to changing demands -- adding temporary tables as you need them for new reports and dropping them as they go out of use. Of course you can use views, but this can still be slow. As far as I can see: if you know which hard links you need, you know which additional table you need to build. And knocking up the triggers to keep it updated is childs play. Ah, yes -- and I always have to add a sanity_function as well that can fix things when I've made a real balls-up ;-) Have fun, Adriaan
After thinking a bit more, I decided to reply in a more constructive way. Thomas Lockhart wrote: > These "table links" seem to controvert the ability for a RDBMS to mix > and match tables in ways which are not hardcoded beforehand. Certainly links are only of use to their intended purpose, and to nothing more. But you should be aware that real life relationships are exactly ot this kind. The drawback of general relational model is that links (=joins) are built from scratch at the moment of join. This may seem an advantage, but really this is often an unnecessary redundant feature whose design allows to build a swarm of relationships which never existed and will never be used. Keeping all that in mind, we might consider building a subsystem in SQL server which is carefully optimized for such real life tasks. There is no need to put any restrictions on general SQL, the only thing proposed is enhancement of a particular side of the server. > Regardless of whether "there exist some real servers that offer such > features I am talking", a departure from the relation model in a > relational database is likely to lead to undesireable constraints and > restrictions in our future development. > You have already done a heroic deed of implementing MVCC, it seems the most interfered with thing. I can see no serious interference with any SQL feature which you might implement. -- Leon.
Adriaan Joubert wrote: > Yep. Also, you fix one set of hard links and the next day you need to do > a slightly different join and it doesn't fit into the links you > constructed, because you left out a table or something silly. > No one is talking about abolishing any standard SQL feature. After you carefully verified design you can hard-code links to speedup access. Before that has happened the usual SQL will do. > Leon, I think you should study the accesses, and build the right > intermediate tables. Yes, I know you are not supposed to duplicate data, > but hey, this is the real world, and disk is cheap. But RAM is not as big as HDD. If database doesn't fit in RAM performance degrades severely. > And triggers etc > make it fairly managable to retain integrity. Making trigger would cost the same as rearranging the table after poor design of links is discovered. > But what is indispensable > is the flexibility you have in a true relational model, so that you can > easily adapt to changing demands -- adding temporary tables as you need > them for new reports and dropping them as they go out of use. This will immensely bloat the database thus flooding the disk channel and, what is worse, the main memory. -- Leon.
Tom Lane wrote: > > I am not sure there's anything fundamentally wrong with his basic point; > if, say, we could find a way to construct OIDs so that a tuple could be > found very quickly from its OID, that wouldn't violate the relational > model AFAICS, and such OIDs would work fine as "links". But I don't see > any way to do that without either giving up UPDATE or introducing a huge > amount of baggage into all processes that can update tables (VACUUM > being the worst case, likely). Without doubt the best compromise would > look remarkably like an index on OID. There is no problems with UPDATE: updated tuple points to newer version, so we can avoid update of referencing tuples here. VACUUM would have to update referencing tuples (via normal heap_replace, nothing special) while removing old versions. This may cause deadlocks but we could give vacuum higher priority and abort others. So, vacuum is the worst case, as pointed by Tom. No problems with MVCC and other things. Vadim
Hello Vadim, Tuesday, July 06, 1999 you wrote: V> There is no problems with UPDATE: updated tuple points to newer V> version, so we can avoid update of referencing tuples here. V> VACUUM would have to update referencing tuples (via normal V> heap_replace, nothing special) while removing old versions. V> This may cause deadlocks but we could give vacuum higher priority V> and abort others. V> So, vacuum is the worst case, as pointed by Tom. V> No problems with MVCC and other things. So. The main drawback is higher priority for VACUUM. Not too large, eh? When you will decide - to implement or not to implement, I urge you to think again about the relief on optimizer, which I stressed many times. No one rebutted yet that adding brains to optimizer so that it can use appropriate join method will require major rewrite. With links you get the best join method as side effect - virtually for free. These joins will never be too slow for an unknown reason. Think carefully. I hope you will make wise decision. Best regards, Leon
Leon wrote: > > So. The main drawback is higher priority for VACUUM. Not > too large, eh? > > When you will decide - to implement or not to implement, We will not decide -:)) If someone want to implement it - welcome. > I urge you to think again about the relief on optimizer, > which I stressed many times. No one rebutted yet that adding > brains to optimizer so that it can use appropriate join method > will require major rewrite. With links you get the best join > method as side effect - virtually for free. These joins > will never be too slow for an unknown reason. Think carefully. > I hope you will make wise decision. Optimizer requires major rewrite in any case, even having links implemented. Vadim
Hello Vadim, Tuesday, July 06, 1999 you wrote: >> These joins >> will never be too slow for an unknown reason. Think carefully. >> I hope you will make wise decision. V> Optimizer requires major rewrite in any case, even V> having links implemented. I am afraid that optimizer, even totally rewritten, can't choose the best method always. That is simply because it is such a complex animal :) Bacterium - simple links will always win in the field where they live :) Best regards, Leon
Leon wrote: > No one is talking about abolishing any standard SQL feature. After > you carefully verified design you can hard-code links to speedup > access. Before that has happened the usual SQL will do. Leon, Interesting idea. You are re-introducing some of the hierarchical database ideas, is this right? Here is what I'm receiving, could you correct me if I'm mis-understanding? (much of this you did not say...) - - - - - - In current SQL implementations, if updates are done on a tuple being modified, referencing tables are not usually identified or checked, let alone updated. Then, when a query is requested, the database figures out how referenced data can be retrieved, and does this at the time of the query. In this proposal, in addition to carrying a primary key for a referenced table, tuples in the referencing table will also have a place to record the physical address of each referenced tuple. In this way, referenced data is easily retrieved during a query, since the physical address of the referenced information is stored in the referant. For example, lets take the following schema ORDER (ORDER_ID, ... ); PRODUCT (PRODUCT_ID, NAME, ... ); ORDER_LINE (ORDER_ID,PRODUCT_ID, ... ); In the current cases, changes to the PRODUCT table, let's say a changed name, do not result in an update of the ORDER_LINE tuples which reference the product tuple being changed. In this proposal, a few hidden field (ID/REF) would be added: ORDER ( LEON_ID, ORDER_ID, ... ); PRODUCT ( LEON_ID, PRODUCT_ID, NAME, ... ); ORDER_LINE ( LEON_ID, ORDER_ID, PRODUCT_ID, ... , PRODUCT_LEON_REF ); Where the ORDER_LINE table would have a reference to the physical LEON_ID of the tuple being referenced by PRODUCT_ID. Then, an update of the PRODUCT table would result in a cascading update of all referencing tables, including ORDER_LINE to change the PRODUCT_LEON_REF from its previous value to the update value. The LEON_ID and LEON_REF are internal implementation fields and not available though SQL. SUMMARY, With this method, query speed is drastically improved since the "join" logic is performed once during insert, instead of many times during select. This method should work well, when the referencing table changes relatively infrequently. Thus people, products, and other relatively static "reference" information is a key canidate for this 'indexing' technique. This technique should not be used if the frequency of updates exceed the frequency of select statements. - - - - - - - Overall, I think it is a good idea. I frequently do weaker versions all the time that I call "field cashing", where the NAME field of infrequently chaging tuples are frequently accessed. In this case, one may put PRODUCT_NAME in the ORDER_LINE table and put a trigger on PRODUCT to cascade update of NAME to the ORDER_LINE.PRODUCT_NAME table. I tend to make monsters like this a nightly process, since product name changes need not be immediate (they are rare, and thus not frequent, and thus, not usually immediate). This allows the cascade update to run at night when things are alot less stressful on the database. Is this in-line with what you are saying? I suggest borrowing an XML term for the idea, GROVE. In XML, a GROVE is a tree built from XML/SGML/notations. In this case, you can think of frequently joined information as cutting deep into the landscape, thus the more the query is done, the more of a chance that the UPDATE/SELECT ratio wil be small, and thus, the greater chance that the hard wired physical address is cashed in the table. The reason I like the name, is that it defines a common pathway that is easy, without preventing the efficiency of uncommon paths (where updates >> select ). Hmm. I'm just worrying about the CASCADE nature of the beast. On the extreme that I was writing about earlier, a prototype OO dbms that I was looking at about 6 years ago (god knows what the name is), they did *everything* this way. And GOD it was slow... especially since it cascaded when frequency of updates far exceed the frequency of selects. Thoughts? Best, Clark
Clark Evans wrote: > > In this proposal, a few hidden field (ID/REF) would be added: ^^^^^^ Not hidden, but with _link_ type. Vadim
Hello Clark, Tuesday, July 06, 1999 you wrote: C> Interesting idea. You are re-introducing some of the C> hierarchical database ideas, is this right? Here is C> what I'm receiving, could you correct me if I'm C> mis-understanding? (much of this you did not say...) Strictly speaking, this is neither hierarchical nor network database. It is not hierarchical because cyclic graphs are allowed (when tables reference one another, maybe through some intermediate table). And it is not network because there is not some weird restriction put on network database. (textbook says in network database one referenced tuple must be at most in one link of certain link type) C> In this proposal, in addition to carrying a primary key C> for a referenced table, tuples in the referencing table C> will also have a place to record the physical address C> of each referenced tuple. I have read description carefully. I am afraid that MVCC will break your scheme, because referencing tuple must have a way to reach all versions of foreign updated tuple. If you update the referencing field, all other versions of foreign tuple are lost. It seems the only way to satisfy MVCC is to chain updated foreign tuples with subsequent VACUUM. That's because there is no need of indices, as soon as the need of them is only during VACUUM. Best regards, Leon
> When you will decide - to implement or not to implement, > I urge you to think again about the relief on optimizer, > which I stressed many times. No one rebutted yet that adding > brains to optimizer so that it can use appropriate join method > will require major rewrite. With links you get the best join > method as side effect - virtually for free. These joins > will never be too slow for an unknown reason. Think carefully. > I hope you will make wise decision. I believe Ingres does allow this, as it has tid's too. If you are creating a temp table, you could use tids during your processing. In fact, it seems tids would be valid until a vacuum is performed. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Leon wrote: > C> In this proposal, in addition to carrying a primary key > C> for a referenced table, tuples in the referencing table > C> will also have a place to record the physical address > C> of each referenced tuple. > > I have read description carefully. I am afraid that MVCC > will break your scheme, because referencing tuple must have > a way to reach all versions of foreign updated tuple. > If you update the referencing field, all other versions of > foreign tuple are lost. > It seems the only way to satisfy > MVCC is to chain updated foreign tuples with subsequent > VACUUM. That's because there is no need of indices, as soon > as the need of them is only during VACUUM. (look of puzzlement) Where did I go wrong with what you are proposing? I'm not trying to invent my own scheme... I'm trying to understand yours. ;) Clark
Hello Clark, Tuesday, July 06, 1999 you wrote: C> (look of puzzlement) Where did I go wrong with what C> you are proposing? I'm not trying to invent my C> own scheme... I'm trying to understand yours. Ok. If American people wants to know the True Path, it can be enlightened :))) (it's a joke) So what's exactly proposed: Introduction of what will seem a new data type in table structure: CREATE TABLE atable (a int4) CREATE TABLE btable (b int4, c link (atable)) - "link" looks like new data type. Example query with link: SELECT * FROM atable where btable.b < 5 AND btable.c = atable.tid (or here should go ctid - you can know better) Type checking: CREATE TABLE ctable (d int4) SELECT * FROM ctable where btable.b < 5 AND btable.c = ctable.tid - it should produce an error because link isn't to ctable. No additional constraint is placed. Tables can reference one another in any combination, maybe the table should be able to reference itself. How all that is implemented: As we have seen, link is matched against tid in queries. It means that link internally can be of the same data type as tid. MVCC stuff: as Vadim pointed out, updated tuples are chained already, so this feature can naturally be utilized. Referencing tuple is always pointing to the oldest version of foreign updated tuple. If transaction needs the version of foreign tuple other than oldest, it follows the chain. Vacuuming removes these chains thus packing the table and rewriting references to vacuumed table in other tables. Vacuuming thus needs high priority, maybe lock on the table being vacuumed and all referencing tables. Since referencing fields are rewritten only during vacuum, there is no need of indices on any field. Best regards, Leon
On Mon, 5 Jul 1999, Tom Lane wrote: > I am not sure there's anything fundamentally wrong with his basic point; > if, say, we could find a way to construct OIDs so that a tuple could be > found very quickly from its OID, that wouldn't violate the relational > model AFAICS, and such OIDs would work fine as "links". But I don't see > any way to do that without either giving up UPDATE or introducing a huge > amount of baggage into all processes that can update tables (VACUUM > being the worst case, likely). Without doubt the best compromise would > look remarkably like an index on OID. So is there anything wrong with that? > Ultimately, when you consider both the update costs and the access > costs, I doubt that this sort of thing could be a win, except maybe > in the case where the referenced table is hardly ever changed so that > the update costs are seldom incurred. But in that situation it's not > clear you want to store the referenced table in an RDBMS anyway --- > there are lots of lower-overhead ways to deal with fixed tables, such > as perfect-hash generators. While I read this thread I noticed that a lot of people are concerned about their update speeds. I am primarily concerned about query speeds. Consider how often you update data vs. how often you query it. That's the whole point of a database: to optimize information retrieval. Now I am not sure how big those update performance penalties would be but I am not concerned really. Meanwhile I agree that hard-linking via record IDs sounds suspiciously like a page from the OODB textbook where it is praised for exactly the same reasons the person who started this discussion cited: no joins. But in order for that to work (if it works) the database software would have to be written from scratch in otder for it to be marginally efficient. The question I ask myself though is, are there any concrete plans for referential integrity via foreign key clauses? 6.6, 7.0, never? To me, that's really much more important than query speed or MVCC. -- Peter Eisentraut PathWay Computing, Inc.
Leon wrote: > > Hello Vadim, > > Tuesday, July 06, 1999 you wrote: > > >> These joins > >> will never be too slow for an unknown reason. Think carefully. > >> I hope you will make wise decision. > > V> Optimizer requires major rewrite in any case, even > V> having links implemented. > > I am afraid that optimizer, even totally rewritten, can't choose > the best method always. That is simply because it is such a > complex animal :) Bacterium - simple links will always win > in the field where they live :) >From what I have read from earlier posts about the optimizer, there can be situations where using links would actually be slower than going through the optimiser, similar to the case where scanning the whole table using an index can be orders of magnitude slower than doing a direct scan. That is of course if used unwisely ;) Another thing that has remained unclear to me is the way to actually insert or update the links - you can't just put another record there, so that should be some kind of field (tid,oid,...) or some function like last_touched('other_table_name'). So, what have you thought to put there ? ------ Hannu
> -----Original Message----- > From: owner-pgsql-hackers@postgreSQL.org > [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Vadim Mikheev > Sent: Tuesday, July 06, 1999 8:12 PM > To: Tom Lane > Cc: Thomas Lockhart; Leon; pgsql-hackers@postgreSQL.org > Subject: Re: [HACKERS] Fwd: Joins and links > > > Tom Lane wrote: > > > > I am not sure there's anything fundamentally wrong with his basic point; > > if, say, we could find a way to construct OIDs so that a tuple could be > > found very quickly from its OID, that wouldn't violate the relational > > model AFAICS, and such OIDs would work fine as "links". But I don't see > > any way to do that without either giving up UPDATE or introducing a huge > > amount of baggage into all processes that can update tables (VACUUM > > being the worst case, likely). Without doubt the best compromise would > > look remarkably like an index on OID. > > There is no problems with UPDATE: updated tuple points to newer > version, so we can avoid update of referencing tuples here. > VACUUM would have to update referencing tuples (via normal > heap_replace, nothing special) while removing old versions. > This may cause deadlocks but we could give vacuum higher priority > and abort others. > > So, vacuum is the worst case, as pointed by Tom. > No problems with MVCC and other things. > What about dump/reload ? And would vacuum be much complicated than now ? I think vacuum is sufficiently complicated now. Didn't these kind of annoying things let RDBMS exceed NDBMS inspite of its low performance ? If "link" is necessary at any cost,how about the following story ? "link" = OID + TID If oid pointed by TID is different from holding OID,executor resets TID using OID indices(my story needs OID indices). By this way we need not change vacuum/dump/reload etc. The command to update TID-s to latest ones may be needed. Comments ? Hiroshi Inoue Inoue@tpf.co.jp
Hannu Krosing wrote: > Another thing that has remained unclear to me is the way to actually > insert or update the links - you can't just put another record there, > so that should be some kind of field (tid,oid,...) or some function > like last_touched('other_table_name'). > > So, what have you thought to put there ? > Earlier I proposed that links should be of type similar to tid, so inserts should be fed with values of tid. But this requires intermediate step, so there can be a function which takes primary key and returns tid, or as you say a function last_touched('other_table_name') - this seems the best choice. -- Leon.
I have been corresponding with Bob Devine for a few years. He was at Berkeley during the Postgres days and knows quite a bit about optimizers and storage systems. I will put his name at the bottom of the optimizer README and if people have questions, he is willing to answer them as best he can. --------------------------------------------------------------------------- Bob Devine wrote: > Bruce Momjian wrote: > > > > Bob, any status on when you may want to get involved with PostgreSQL? > > Umm, hem, haw... > > I'm maxed out with current stuff now. It would be very hard > for me to realistically promise any substantial committment > to doing any major work. > > However, I'm always available for discussion on designs. > I've built two optimizers so I can help there. Otherwise > I can help with general query execution or storage subsystems. > > Bob Devine > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073