Thread: Concrete proposal for large objects and MVCC

Concrete proposal for large objects and MVCC

From
Tom Lane
Date:
I spent a little bit of time thinking about what it would mean exactly
for large-object operations to obey MVCC, and decided that there are
more worms in that can than I had realized.  Part of the problem is
that we have no concept of a lock on an individual LO, and thus
operations that really should be serialized, such as DROP, aren't going
to work very well.  We could implement DROP as the equivalent ofDELETE FROM pg_largeobject WHERE loid = nnn;
with an MVCC snapshot --- but there is no guarantee that we won't miss
a page that someone else is concurrently inserting into that same large
object.

So what I'm thinking is that the prudent course is to leave writing
semantics as they are, namely SnapshotNow rules.  (What this means
in practice is you get "tuple concurrently updated" errors if two
transactions try to write the same page of the same LO concurrently.
We have seen few if any complaints about that error in connection with
LO operations, so ISTM there's not a problem there that needs solving.)

The problem we do need to solve is letting pg_dump have a stable view
of the database's large objects.  I propose that we can fix this in
a suitably narrow way by making the following definition:
* A large object descriptor opened for read-only access saves  the current ActiveSnapshot and uses that snapshot to
read pg_largeobject for the duration of its existence.
 
* A large object descriptor opened for write-only or read-write  access uses SnapshotNow, same as before.

This avoids the risk of creating any serious backwards-compatibility
issues: if there's anyone out there who does need SnapshotNow reads,
they just have to be sure to open the LO in read-write mode to have
fully backward compatible operation.

Comments, objections?
        regards, tom lane


Re: Concrete proposal for large objects and MVCC

From
Tatsuo Ishii
Date:
> I spent a little bit of time thinking about what it would mean exactly
> for large-object operations to obey MVCC, and decided that there are
> more worms in that can than I had realized.  Part of the problem is
> that we have no concept of a lock on an individual LO, and thus
> operations that really should be serialized, such as DROP, aren't going
> to work very well.  We could implement DROP as the equivalent of
>     DELETE FROM pg_largeobject WHERE loid = nnn;
> with an MVCC snapshot --- but there is no guarantee that we won't miss
> a page that someone else is concurrently inserting into that same large
> object.
> 
> So what I'm thinking is that the prudent course is to leave writing
> semantics as they are, namely SnapshotNow rules.  (What this means
> in practice is you get "tuple concurrently updated" errors if two
> transactions try to write the same page of the same LO concurrently.
> We have seen few if any complaints about that error in connection with
> LO operations, so ISTM there's not a problem there that needs solving.)
> 
> The problem we do need to solve is letting pg_dump have a stable view
> of the database's large objects.  I propose that we can fix this in
> a suitably narrow way by making the following definition:
> 
>     * A large object descriptor opened for read-only access saves
>       the current ActiveSnapshot and uses that snapshot to read
>       pg_largeobject for the duration of its existence.
> 
>     * A large object descriptor opened for write-only or read-write
>       access uses SnapshotNow, same as before.
> 
> This avoids the risk of creating any serious backwards-compatibility
> issues: if there's anyone out there who does need SnapshotNow reads,
> they just have to be sure to open the LO in read-write mode to have
> fully backward compatible operation.
> 
> Comments, objections?

Besides the MVCC issue, I am not sure it's a good idea LO being binded
to OID. In my understanding OID is solely used to distinguish each LO
in a database. In another word, it's just a key to LO. I think giving
explicit key when creating a LO has some benefits:

1) not need to worry about OID wrap around problem
2) easier to find orpahn LO
3) for replication systems it's easier to replicate LOs

What do you think?
--
Tatsuo Ishii


Re: Concrete proposal for large objects and MVCC

From
"Joshua D. Drake"
Date:
>>This avoids the risk of creating any serious backwards-compatibility
>>issues: if there's anyone out there who does need SnapshotNow reads,
>>they just have to be sure to open the LO in read-write mode to have
>>fully backward compatible operation.
>>
>>Comments, objections?
> 
> 
> Besides the MVCC issue, I am not sure it's a good idea LO being binded
> to OID. In my understanding OID is solely used to distinguish each LO
> in a database. In another word, it's just a key to LO. I think giving
> explicit key when creating a LO has some benefits:
> 
> 1) not need to worry about OID wrap around problem
> 2) easier to find orpahn LO
> 3) for replication systems it's easier to replicate LOs  4) No longer tied to a system object and thus no oddities
neededfor 
 
backup/restore.

It should just be an int4 or in8 with a serial IMHO.

Sincerely,

Joshua D. Drake


> 
> What do you think?
> --
> Tatsuo Ishii
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
> 
>                http://www.postgresql.org/docs/faq


-- 
Your PostgreSQL solutions company - Command Prompt, Inc. 1.800.492.2240
PostgreSQL Replication, Consulting, Custom Programming, 24x7 support
Managed Services, Shared and Dedicated Hosting
Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/


Re: Concrete proposal for large objects and MVCC

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> Besides the MVCC issue, I am not sure it's a good idea LO being binded
> to OID. In my understanding OID is solely used to distinguish each LO
> in a database. In another word, it's just a key to LO. I think giving
> explicit key when creating a LO has some benefits:

I'm not excited about making non-backwards-compatible changes in LOs;
the whole thing is pretty much a legacy feature in my mind, and changing
it significantly seems to miss the point.  However, we could offer a
new variant of lo_creat that allows a particular OID to be specified.
(That would simplify pg_dump tremendously, which is probably sufficient
reason to do it.)  I think the only other change needed is that the
default form of lo_creat should loop until it finds a free OID, which
is something I had intended to change anyway --- the current coding is
failure-prone once the OID counter wraps around.

This is really orthogonal to the MVCC issue, but I'm willing to change
it at the same time if there's no objections.

Anyone have a preference about the name for the new function?  (At least
at the libpq level, we have to invent a new name, we can't just
overload.)  I'm inclined to use "lo_create", but maybe that's too close
to "lo_creat".
        regards, tom lane


Re: Concrete proposal for large objects and MVCC

From
Christopher Kings-Lynne
Date:
> This avoids the risk of creating any serious backwards-compatibility
> issues: if there's anyone out there who does need SnapshotNow reads,
> they just have to be sure to open the LO in read-write mode to have
> fully backward compatible operation.
> 
> Comments, objections?

If you feel like it, feel free to make pg_dump custom format restore LOB 
comments :)

Chris