Thread: New form of index "persistent reference"
For about 5 years now, I have been using a text search engine that I wrote and maintain. In the beginning, I hacked up function mechanisms to return multiple value sets and columns. Then PostgreSQL aded "setof" and it is was cool. Then it was able to return a set of rows, which was even better. Lately, I have been thinking that a cool form of index would be some sort of "persistent reference" index. Like the old ISAM days of yore, a fixed number could point you right to the row that you want. I'm not sure if the "persistent reference" is a specific auto numbering column type or separate index structure or both. I asked the question how do you get a record without going through an index, the answer was CTID, which unfortunately changes when the row is updated. Now, what I want to brainstorm is some sort of "persistent reference" where the value is not algorithmically stored, maybe just an offset into a table. The number of operations should be about 1 per lookup. Imagine a dynamically growing array that has one slot per row. Every row is considered unique. Rows which are updated, their CTID is updated in the reference. (with vacuum?) Imagine something like this: create table foobar(id reference, name varchar, value varchar); select * from foobar where id = 100; The reference type has an implicit index that is basically a lookup table. On unique references where the reference value is fairly arbitrary, this would be a HUGE gain for direct lookups. There is no need for the NlogN of a tree. On the surface level, this would be a huge win for websites that use semi-fixed tables of data.
<p><font size="2">If that ID is the only thing you use to access that data, why not just store it in a flat file with fixed-lengthrecords? seek() (or your language's equivalent) is usually fast. </font><p><font size="2">If you need to drivethat from within PostgreSQL, you would need an untrusted language to read the file, but you could also generate it froma table using a trigger. </font><p><font size="2">Or maybe use a serial column, an index on that column, and clusterthe table on that index. It's more than one lookup, but not much with a Btree index. (Not sure if this is better thanjust using a serial and an index. <a href="http://www.postgresql.org/docs/8.0/interactive/sql-cluster.html" target="_blank">http://www.postgresql.org/docs/8.0/interactive/sql-cluster.html</a>says it isn't, if I read it correctly.)</font><p><fontsize="2">Then anytime there is a batch of updates to the table, re-cluster it.</font><p><font size="2">>-----Original Message-----</font><br /><font size="2">> From: pgsql@mohawksoft.com [<a href="mailto:pgsql@mohawksoft.com">mailto:pgsql@mohawksoft.com</a>]</font><br/><font size="2">> Sent: Thursday, February10, 2005 11:22 AM</font><br /><font size="2">> To: pgsql-hackers@postgresql.org</font><br /><font size="2">>Subject: [HACKERS] New form of index "persistent reference"</font><br /><font size="2">> </font><br /><fontsize="2">> </font><br /><font size="2">> For about 5 years now, I have been using a text search engine </font><br/><font size="2">> that I wrote</font><br /><font size="2">> and maintain.</font><br /><font size="2">></font><br /><font size="2">> In the beginning, I hacked up function mechanisms to return </font><br /><fontsize="2">> multiple value</font><br /><font size="2">> sets and columns. Then PostgreSQL aded "setof" and itis was </font><br /><font size="2">> cool. Then it</font><br /><font size="2">> was able to return a set of rows,which was even better.</font><br /><font size="2">> </font><br /><font size="2">> Lately, I have been thinkingthat a cool form of index would </font><br /><font size="2">> be some sort</font><br /><font size="2">> of"persistent reference" index. Like the old ISAM days of </font><br /><font size="2">> yore, a fixed</font><br /><fontsize="2">> number could point you right to the row that you want. I'm </font><br /><font size="2">> not sureif the</font><br /><font size="2">> "persistent reference" is a specific auto numbering column type or</font><br /><fontsize="2">> separate index structure or both.</font><br /><font size="2">> </font><br /><font size="2">> Iasked the question how do you get a record without going through an</font><br /><font size="2">> index, the answer wasCTID, which unfortunately changes when </font><br /><font size="2">> the row is</font><br /><font size="2">> updated.</font><br/><font size="2">> </font><br /><font size="2">> Now, what I want to brainstorm is some sort of "persistentreference"</font><br /><font size="2">> where the value is not algorithmically stored, maybe just an </font><br/><font size="2">> offset into a</font><br /><font size="2">> table. The number of operations should be about1 per lookup.</font><br /><font size="2">> </font><br /><font size="2">> Imagine a dynamically growing array thathas one slot per </font><br /><font size="2">> row. Every row</font><br /><font size="2">> is considered unique.Rows which are updated, their CTID is </font><br /><font size="2">> updated in the</font><br /><font size="2">>reference. (with vacuum?)</font><br /><font size="2">> </font><br /><font size="2">> Imagine somethinglike this:</font><br /><font size="2">> </font><br /><font size="2">> create table foobar(id reference, namevarchar, value varchar);</font><br /><font size="2">> </font><br /><font size="2">> select * from foobar whereid = 100;</font><br /><font size="2">> </font><br /><font size="2">> The reference type has an implicit indexthat is basically a </font><br /><font size="2">> lookup table.</font><br /><font size="2">> On unique referenceswhere the reference value is fairly </font><br /><font size="2">> arbitrary, this</font><br /><font size="2">>would be a HUGE gain for direct lookups. There is no need for </font><br /><font size="2">> the NlogN of</font><br/><font size="2">> a tree.</font><br /><font size="2">> </font><br /><font size="2">> On the surfacelevel, this would be a huge win for websites that use</font><br /><font size="2">> semi-fixed tables of data.</font><br/><font size="2">> </font><br /><font size="2">> </font><br /><font size="2">> </font><br /><fontsize="2">> ---------------------------(end of </font><br /><font size="2">> broadcast)---------------------------</font><br/><font size="2">> TIP 1: subscribe and unsubscribe commands go to </font><br/><font size="2">> majordomo@postgresql.org</font><br /><font size="2">> </font>
> If that ID is the only thing you use to access that data, why not just > store > it in a flat file with fixed-length records? seek() (or your language's > equivalent) is usually fast. As a matter of policy, I would never manage data outside of the database. > > If you need to drive that from within PostgreSQL, you would need an > untrusted language to read the file, but you could also generate it from a > table using a trigger. Very ugly. > > Or maybe use a serial column, an index on that column, and cluster the > table > on that index. It's more than one lookup, but not much with a Btree index. > (Not sure if this is better than just using a serial and an index. > http://www.postgresql.org/docs/8.0/interactive/sql-cluster.html says it > isn't, if I read it correctly.) Clustering is OK, but it doesn't handle updates and additions until you recluster the data. If a static reference is all that is needed, then merely using CTID would suffice. I was thinking a little overhead for a reference table would allow it to hook into PostgreSQL and keep it up to date. > > Then anytime there is a batch of updates to the table, re-cluster it. Yea, like I said, there are easier ways of doing that with fairly static data. > >> -----Original Message----- >> From: pgsql@mohawksoft.com [mailto:pgsql@mohawksoft.com] >> Sent: Thursday, February 10, 2005 11:22 AM >> To: pgsql-hackers@postgresql.org >> Subject: [HACKERS] New form of index "persistent reference" >> >> >> For about 5 years now, I have been using a text search engine >> that I wrote >> and maintain. >> >> In the beginning, I hacked up function mechanisms to return >> multiple value >> sets and columns. Then PostgreSQL aded "setof" and it is was >> cool. Then it >> was able to return a set of rows, which was even better. >> >> Lately, I have been thinking that a cool form of index would >> be some sort >> of "persistent reference" index. Like the old ISAM days of >> yore, a fixed >> number could point you right to the row that you want. I'm >> not sure if the >> "persistent reference" is a specific auto numbering column type or >> separate index structure or both. >> >> I asked the question how do you get a record without going through an >> index, the answer was CTID, which unfortunately changes when >> the row is >> updated. >> >> Now, what I want to brainstorm is some sort of "persistent reference" >> where the value is not algorithmically stored, maybe just an >> offset into a >> table. The number of operations should be about 1 per lookup. >> >> Imagine a dynamically growing array that has one slot per >> row. Every row >> is considered unique. Rows which are updated, their CTID is >> updated in the >> reference. (with vacuum?) >> >> Imagine something like this: >> >> create table foobar(id reference, name varchar, value varchar); >> >> select * from foobar where id = 100; >> >> The reference type has an implicit index that is basically a >> lookup table. >> On unique references where the reference value is fairly >> arbitrary, this >> would be a HUGE gain for direct lookups. There is no need for >> the NlogN of >> a tree. >> >> On the surface level, this would be a huge win for websites that use >> semi-fixed tables of data. >> >> >> >> ---------------------------(end of >> broadcast)--------------------------- >> TIP 1: subscribe and unsubscribe commands go to >> majordomo@postgresql.org >> >
> Lately, I have been thinking that a cool form of index would be some sort > of "persistent reference" index. Like the old ISAM days of yore, a fixed > number could point you right to the row that you want. I'm not sure if the > "persistent reference" is a specific auto numbering column type or > separate index structure or both. What you are talking about is a 'relative file'. It turns out on modern ISAM file systems, the win you get over b-tree indexing is not worth losing the ability to do simple things like run-length compression on strings. Anyways, while storing a physical offset is O(1), so is computing a hash. How would a hash index not fill your need? Merlin
> I asked the question how do you get a record without going through an > index, the answer was CTID, which unfortunately changes when the row is > updated. The ctid is a physical location of the row. On update a new tuple is written in a new location, that is why the ctid changes. The old tuple has a system field t_ctid which is then a forward pointer to the new tuple. Thus you can follow that chain until the visible tuple is found. The current tid = tid does not do that (I think because the ODBC driver which was the first to use it (for result set modification) needed to notice when the tuple was updated underneath). But you can use:select * from atab where ctid = currtid2('atab', '(0,1)'); -- '(0,1)' is the old ctid Of course the old ctid is only valid until (auto)vacuum marks it free. Without vacuum you are currently safe to use the currtid functions. Andreas