Thread: New form of index "persistent reference"

New form of index "persistent reference"

From
pgsql@mohawksoft.com
Date:
For about 5 years now, I have been using a text search engine that I wrote
and maintain.

In the beginning, I hacked up function mechanisms to return multiple value
sets and columns. Then PostgreSQL aded "setof" and it is was cool. Then it
was able to return a set of rows, which was even better.

Lately, I have been thinking that a cool form of index would be some sort
of "persistent reference" index. Like the old ISAM days of yore, a fixed
number could point you right to the row that you want. I'm not sure if the
"persistent reference" is a specific auto numbering column type or
separate index structure or both.

I asked the question how do you get a record without going through an
index, the answer was CTID, which unfortunately changes when the row is
updated.

Now, what I want to brainstorm is some sort of "persistent reference"
where the value is not algorithmically stored, maybe just an offset into a
table. The number of operations should be about 1 per lookup.

Imagine a dynamically growing array that has one slot per row. Every row
is considered unique. Rows which are updated, their CTID is updated in the
reference. (with vacuum?)

Imagine something like this:

create table foobar(id reference, name varchar, value varchar);

select * from foobar where id = 100;

The reference type has an implicit index that is basically a lookup table.
On unique references where the reference value is fairly arbitrary, this
would be a HUGE gain for direct lookups. There is no need for the NlogN of
a tree.

On the surface level, this would be a huge win for websites that use
semi-fixed tables of data.




Re: New form of index "persistent reference"

From
"Bort, Paul"
Date:
<p><font size="2">If that ID is the only thing you use to access that data, why not just store it in a flat file with
fixed-lengthrecords? seek() (or your language's equivalent) is usually fast. </font><p><font size="2">If you need to
drivethat from within PostgreSQL, you would need an untrusted language to read the file, but you could also generate it
froma table using a trigger. </font><p><font size="2">Or maybe use a serial column, an index on that column, and
clusterthe table on that index. It's more than one lookup, but not much with a Btree index. (Not sure if this is better
thanjust using a serial and an index. <a href="http://www.postgresql.org/docs/8.0/interactive/sql-cluster.html"
target="_blank">http://www.postgresql.org/docs/8.0/interactive/sql-cluster.html</a>says it isn't, if I read it
correctly.)</font><p><fontsize="2">Then anytime there is a batch of updates to the table, re-cluster it.</font><p><font
size="2">>-----Original Message-----</font><br /><font size="2">> From: pgsql@mohawksoft.com [<a
href="mailto:pgsql@mohawksoft.com">mailto:pgsql@mohawksoft.com</a>]</font><br/><font size="2">> Sent: Thursday,
February10, 2005 11:22 AM</font><br /><font size="2">> To: pgsql-hackers@postgresql.org</font><br /><font
size="2">>Subject: [HACKERS] New form of index "persistent reference"</font><br /><font size="2">> </font><br
/><fontsize="2">> </font><br /><font size="2">> For about 5 years now, I have been using a text search engine
</font><br/><font size="2">> that I wrote</font><br /><font size="2">> and maintain.</font><br /><font
size="2">></font><br /><font size="2">> In the beginning, I hacked up function mechanisms to return </font><br
/><fontsize="2">> multiple value</font><br /><font size="2">> sets and columns. Then PostgreSQL aded "setof" and
itis was </font><br /><font size="2">> cool. Then it</font><br /><font size="2">> was able to return a set of
rows,which was even better.</font><br /><font size="2">> </font><br /><font size="2">> Lately, I have been
thinkingthat a cool form of index would </font><br /><font size="2">> be some sort</font><br /><font size="2">>
of"persistent reference" index. Like the old ISAM days of </font><br /><font size="2">> yore, a fixed</font><br
/><fontsize="2">> number could point you right to the row that you want. I'm </font><br /><font size="2">> not
sureif the</font><br /><font size="2">> "persistent reference" is a specific auto numbering column type or</font><br
/><fontsize="2">> separate index structure or both.</font><br /><font size="2">> </font><br /><font size="2">>
Iasked the question how do you get a record without going through an</font><br /><font size="2">> index, the answer
wasCTID, which unfortunately changes when </font><br /><font size="2">> the row is</font><br /><font size="2">>
updated.</font><br/><font size="2">> </font><br /><font size="2">> Now, what I want to brainstorm is some sort of
"persistentreference"</font><br /><font size="2">> where the value is not algorithmically stored, maybe just an
</font><br/><font size="2">> offset into a</font><br /><font size="2">> table. The number of operations should be
about1 per lookup.</font><br /><font size="2">> </font><br /><font size="2">> Imagine a dynamically growing array
thathas one slot per </font><br /><font size="2">> row. Every row</font><br /><font size="2">> is considered
unique.Rows which are updated, their CTID is </font><br /><font size="2">> updated in the</font><br /><font
size="2">>reference. (with vacuum?)</font><br /><font size="2">> </font><br /><font size="2">> Imagine
somethinglike this:</font><br /><font size="2">> </font><br /><font size="2">> create table foobar(id reference,
namevarchar, value varchar);</font><br /><font size="2">> </font><br /><font size="2">> select * from foobar
whereid = 100;</font><br /><font size="2">> </font><br /><font size="2">> The reference type has an implicit
indexthat is basically a </font><br /><font size="2">> lookup table.</font><br /><font size="2">> On unique
referenceswhere the reference value is fairly </font><br /><font size="2">> arbitrary, this</font><br /><font
size="2">>would be a HUGE gain for direct lookups. There is no need for </font><br /><font size="2">> the NlogN
of</font><br/><font size="2">> a tree.</font><br /><font size="2">> </font><br /><font size="2">> On the
surfacelevel, this would be a huge win for websites that use</font><br /><font size="2">> semi-fixed tables of
data.</font><br/><font size="2">> </font><br /><font size="2">> </font><br /><font size="2">> </font><br
/><fontsize="2">> ---------------------------(end of </font><br /><font size="2">>
broadcast)---------------------------</font><br/><font size="2">> TIP 1: subscribe and unsubscribe commands go to
</font><br/><font size="2">> majordomo@postgresql.org</font><br /><font size="2">> </font> 

Re: New form of index "persistent reference"

From
pgsql@mohawksoft.com
Date:
> If that ID is the only thing you use to access that data, why not just
> store
> it in a flat file with fixed-length records? seek() (or your language's
> equivalent) is usually fast.

As a matter of policy, I would never manage data outside of the database.

>
> If you need to drive that from within PostgreSQL, you would need an
> untrusted language to read the file, but you could also generate it from a
> table using a trigger.

Very ugly.

>
> Or maybe use a serial column, an index on that column, and cluster the
> table
> on that index. It's more than one lookup, but not much with a Btree index.
> (Not sure if this is better than just using a serial and an index.
> http://www.postgresql.org/docs/8.0/interactive/sql-cluster.html says it
> isn't, if I read it correctly.)

Clustering is OK, but it doesn't handle updates and additions until you
recluster the data.

If a static reference is all that is needed, then merely using CTID would
suffice. I was thinking a little overhead for a reference table would
allow it to hook into PostgreSQL and keep it up to date.



>
> Then anytime there is a batch of updates to the table, re-cluster it.

Yea, like I said, there are easier ways of doing that with fairly static
data.


>
>> -----Original Message-----
>> From: pgsql@mohawksoft.com [mailto:pgsql@mohawksoft.com]
>> Sent: Thursday, February 10, 2005 11:22 AM
>> To: pgsql-hackers@postgresql.org
>> Subject: [HACKERS] New form of index "persistent reference"
>>
>>
>> For about 5 years now, I have been using a text search engine
>> that I wrote
>> and maintain.
>>
>> In the beginning, I hacked up function mechanisms to return
>> multiple value
>> sets and columns. Then PostgreSQL aded "setof" and it is was
>> cool. Then it
>> was able to return a set of rows, which was even better.
>>
>> Lately, I have been thinking that a cool form of index would
>> be some sort
>> of "persistent reference" index. Like the old ISAM days of
>> yore, a fixed
>> number could point you right to the row that you want. I'm
>> not sure if the
>> "persistent reference" is a specific auto numbering column type or
>> separate index structure or both.
>>
>> I asked the question how do you get a record without going through an
>> index, the answer was CTID, which unfortunately changes when
>> the row is
>> updated.
>>
>> Now, what I want to brainstorm is some sort of "persistent reference"
>> where the value is not algorithmically stored, maybe just an
>> offset into a
>> table. The number of operations should be about 1 per lookup.
>>
>> Imagine a dynamically growing array that has one slot per
>> row. Every row
>> is considered unique. Rows which are updated, their CTID is
>> updated in the
>> reference. (with vacuum?)
>>
>> Imagine something like this:
>>
>> create table foobar(id reference, name varchar, value varchar);
>>
>> select * from foobar where id = 100;
>>
>> The reference type has an implicit index that is basically a
>> lookup table.
>> On unique references where the reference value is fairly
>> arbitrary, this
>> would be a HUGE gain for direct lookups. There is no need for
>> the NlogN of
>> a tree.
>>
>> On the surface level, this would be a huge win for websites that use
>> semi-fixed tables of data.
>>
>>
>>
>> ---------------------------(end of
>> broadcast)---------------------------
>> TIP 1: subscribe and unsubscribe commands go to
>> majordomo@postgresql.org
>>
>



Re: New form of index "persistent reference"

From
"Merlin Moncure"
Date:
> Lately, I have been thinking that a cool form of index would be some
sort
> of "persistent reference" index. Like the old ISAM days of yore, a
fixed
> number could point you right to the row that you want. I'm not sure if
the
> "persistent reference" is a specific auto numbering column type or
> separate index structure or both.

What you are talking about is a 'relative file'.  It turns out on modern
ISAM file systems, the win you get over b-tree indexing is not worth
losing the ability to do simple things like run-length compression on
strings.

Anyways, while storing a physical offset is O(1), so is computing a
hash.  How would a hash index not fill your need?

Merlin


Re: New form of index "persistent reference"

From
"Zeugswetter Andreas DAZ SD"
Date:
> I asked the question how do you get a record without going through an
> index, the answer was CTID, which unfortunately changes when the row is
> updated.

The ctid is a physical location of the row. On update a new tuple is written
in a new location, that is why the ctid changes. The old tuple has a system field
t_ctid which is then a forward pointer to the new tuple.

Thus you can follow that chain until the visible tuple is found.
The current tid = tid does not do that (I think because the ODBC driver
which was the first to use it (for result set modification) needed to notice when the
tuple was updated underneath).

But you can use:select * from atab where ctid = currtid2('atab', '(0,1)'); -- '(0,1)' is the old ctid

Of course the old ctid is only valid until (auto)vacuum marks it free.
Without vacuum you are currently safe to use the currtid functions.

Andreas