Thread: Transactions and indexes

Transactions and indexes

From
Chris Cleveland
Date:
Noob here.

I'm trying to implement a new type of index access method. I don't think I can use the built-in storage manager and buffer manager efficiently because files with 8k blocks aren't going to work. I really need to implement a log-structured file.

I'm confused on how to handle transactions and visibility. I don't see anything in the index action method functions (am*()) that tell me when to commit or rollback new index entries, or which transaction we're currently in so I can know whether recently-added index entries should be visible to the current scan. I'm guessing that all that magically happens in the storage and buffer managers.

So... how do I handle this? Is there some way for me to implement my own storage manager that manages visibility?

I'd be grateful for any guidance.


--
Chris Cleveland
312-339-2677 mobile

Re: Transactions and indexes

From
Peter Geoghegan
Date:
On Mon, Jul 19, 2021 at 4:31 PM Chris Cleveland
<ccleveland@dieselpoint.com> wrote:
> I'm confused on how to handle transactions and visibility.

In Postgres, indexes are not considered to be part of the logical
database. They're just data structures that point to TIDs in the
table. To an index, each TID is just another object -- it doesn't
possess any built-in idea about MVCC.

In practice the indexes may be able to surmise certain things about
MVCC and versioning, as an optimization -- but that is all speculative
and relies on cooperation from the table AM side. Also, the
implementation of unique indexes knows more than zero about versions,
since that's just necessary. These two cases may or may not be
considered exceptions to the general rule. I suppose that it's a
matter of perspective.

> So... how do I handle this? Is there some way for me to implement my own storage manager that manages visibility?

This is the responsibility of a table AM, not any one index AM. In
general we assume that each table AM implements something very much
like heapam's VACUUM implementation. Index AMs may also have
opportunistic cleanup of their own, as an optimization (actually this
is what I was referring to).

Theoretically index AMs and table AMs are orthogonal things. How true
that will be in a world with more than one mature table AM remains to
be seen.

-- 
Peter Geoghegan



Re: Transactions and indexes

From
Chris Cleveland
Date:
Thank you. Does this mean I can implement the index AM and return TIDs without having to worry about transactions at all?

Also, as far as I can tell, the only way that TIDs are removed from the index is in ambulkdelete(). Is this accurate? Does that mean that my index will be returning TIDs for deleted items and I don't have to worry about that either?

Don't TIDs get reused? What happens when my index returns an old TID which is now pointing to a new record?

This is going to make it really hard to implement Top X queries of the type you get from a search engine. A search engine will normally maintain an internal buffer (usually a priority queue) of a fixed size, X, and add tuples to it along with their relevance score. The buffer only remembers the Top X tuples with the highest score. In this way the search engine can iterate over millions of entries and retain only the best ones without having an unbounded buffer. For this to work, though, you need to know how many tuples to keep in the buffer in advance. If my index can't know, in advance, which TIDs are invisible or deleted, then it can't keep them out of the buffer, and this whole scheme fails.

This is not going to work unless the system gives the index a clear picture of transactions, visibility, and deletes as they happen. Is this information available?




On Mon, Jul 19, 2021 at 6:58 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Jul 19, 2021 at 4:31 PM Chris Cleveland
<ccleveland@dieselpoint.com> wrote:
> I'm confused on how to handle transactions and visibility.

In Postgres, indexes are not considered to be part of the logical
database. They're just data structures that point to TIDs in the
table. To an index, each TID is just another object -- it doesn't
possess any built-in idea about MVCC.

In practice the indexes may be able to surmise certain things about
MVCC and versioning, as an optimization -- but that is all speculative
and relies on cooperation from the table AM side. Also, the
implementation of unique indexes knows more than zero about versions,
since that's just necessary. These two cases may or may not be
considered exceptions to the general rule. I suppose that it's a
matter of perspective.

> So... how do I handle this? Is there some way for me to implement my own storage manager that manages visibility?

This is the responsibility of a table AM, not any one index AM. In
general we assume that each table AM implements something very much
like heapam's VACUUM implementation. Index AMs may also have
opportunistic cleanup of their own, as an optimization (actually this
is what I was referring to).

Theoretically index AMs and table AMs are orthogonal things. How true
that will be in a world with more than one mature table AM remains to
be seen.

--
Peter Geoghegan


--
Chris Cleveland
312-339-2677 mobile

Re: Transactions and indexes

From
Peter Geoghegan
Date:
On Mon, Jul 19, 2021 at 7:20 PM Chris Cleveland
<ccleveland@dieselpoint.com> wrote:
> Thank you. Does this mean I can implement the index AM and return TIDs without having to worry about transactions at
all?

Yes. That's the upside of the design -- it makes it easy to add new
transactional index AMs. Which is one reason why Postgres has so many.

> Also, as far as I can tell, the only way that TIDs are removed from the index is in ambulkdelete(). Is this accurate?

It doesn't have to be the only way, but in practice it can be. Depends
on the index AM. The core code relies on ambulkdelete() to make sure
that all TIDs dead in the table are gone from the index. This allows
VACUUM to finally physically recycle the previously referenced TIDs in
the table structure, without risk of index scans finding the wrong
thing.

> Does that mean that my index will be returning TIDs for deleted items and I don't have to worry about that either?

If you assume that you're using heapam (the standard table AM), then
yes. Otherwise I don't know -- it's ambiguous.

> Don't TIDs get reused? What happens when my index returns an old TID which is now pointing to a new record?

This can't happen because, as I said, the table cannot recycle
TIDs/line pointers until it's known that this cannot happen (because
VACUUM already cleaned out all the garbage index tuples).

> This is going to make it really hard to implement Top X queries of the type you get from a search engine. A search
enginewill normally maintain an internal buffer (usually a priority queue) of a fixed size, X, and add tuples to it
alongwith their relevance score. The buffer only remembers the Top X tuples with the highest score. In this way the
searchengine can iterate over millions of entries and retain only the best ones without having an unbounded buffer. For
thisto work, though, you need to know how many tuples to keep in the buffer in advance. If my index can't know, in
advance,which TIDs are invisible or deleted, then it can't keep them out of the buffer, and this whole scheme fails. 
>
> This is not going to work unless the system gives the index a clear picture of transactions, visibility, and deletes
asthey happen. Is this information available? 

Are you implementing a new index AM or a new table AM? Discarding data
based on something like a relevance score doesn't seem like something
that either API provides for. Indexes in Postgres can be lossy, but
that in itself doesn't change the result of queries.

--
Peter Geoghegan



Re: Transactions and indexes

From
Chris Cleveland
Date:
Are you implementing a new index AM or a new table AM? Discarding data
based on something like a relevance score doesn't seem like something
that either API provides for. Indexes in Postgres can be lossy, but
that in itself doesn't change the result of queries.

(Sorry if this doesn't quote properly, I'm trying to figure out how to do the quote-and-bottom-post thing in gmail).

My plan was to do an index AM alone, but I'm thinking that isn't going to work. The goal is to do better full-text search in Postgres, fast, over really large datasets.

Relevance scoring is like an ORDER BY score with a LIMIT. The code that traverses the index needs to know both of these things in advance.

The GIN code doesn't cut it. I'm still trying to understand the code for the RUM index type, but it's slow going.

Suggestions on how to go about this are welcome.