Thread: [PATCH] [v8.5] Security checks on largeobjects
This patch adds support DAC security checks on largeobjects. Enhanced GRANT/REVOKE statement allows to set SELECT (read) and UPDATE (write) permission on individual largeobjects. At the creation time, it checks ACL_CREATE on the schema object. Currently, a largeobject does not have any human readable name and qualified namespace, we assume "public" namespace here. At the deletion time, it checks ownership of the largeobject. Only resource owner and superuser can drop largeobjects. The ownership and schema can be set using: ALTER LARGE OBJECT <lobj> OWNER TO <role>; ALTER LARGE OBJECT <lobj> SET SCHEMA <schema>; The current pg_largeobject system catalog cannot store metadata of the largeobejcts, its data structure is modified. CATALOG(pg_largeobject,2613) { Oid lonsp; /* OID of the namespace */ Oid loowner; /* OID of the owner */ aclitem loacl[1]; /* access permissions */ } FormData_pg_largeobject; The current pg_largeobejct is renamed to pg_largeobject_data. One or multiple tuples within pg_largeobject_data points to a record within pg_largeobject which has a metadata of a certain largeobject. CATALOG(pg_largeobject_data,2966) BKI_WITHOUT_OIDS { Oid loid; /* Identifier of large object */ int4 pageno; /* Page number (starting from 0) */ bytea data; /* Data for page (may be zero-length) */ } FormData_pg_largeobject_data; Issues: * Is ALTER LARGE OBJECT interface suitable? * How we should consider the namespace (schema) and the ownership of the largeobejct? * Is the named large object (including fully qualified one) worth? It will enables us to specify a largeobject with human readable identifier string. * Is the data structure appropriate? - As an aside, the pg_largeobject_data has an identical definition with TOAST tables. It may be possible to store them within TOAST table. * If so, it may also resolve other Todo item. - Allow read/write into TOAST values like large objects Memo: http://wiki.postgresql.org/wiki/Largeobject_Enhancement Example: postgres=# REVOKE ALL ON LARGE OBJECT 1234 FROM PUBLIC; REVOKE postgres=# GRANT SELECT ON LARGE OBJECT 1234 TO ymj; GRANT postgres=# GRANT SELECT,UPDATE ON LARGE OBJECT 1234 TO tak; GRANT postgres=# \c - ymj psql (8.4rc2) You are now connected to database "postgres" as user "ymj". postgres=> SELECT loread(lo_open(1234, x'40000'::int), 10); loread ------------ 1234567890 (1 row) postgres=> SELECT lowrite(lo_open(1234, x'20000'::int), 'abcdefg'); ERROR: permission denied for largeobject largeobject:1234 postgres=> \c - tak psql (8.4rc2) You are now connected to database "postgres" as user "tak". postgres=> SELECT lowrite(lo_open(1234, x'20000'::int), 'abcdefg'); lowrite --------- 7 (1 row) Scale of the patch: [kaigai@saba ]$ diffstat pgsql-lobj-perms-8.4rc2-r2080.patch backend/catalog/Makefile | 6 backend/catalog/aclchk.c | 243 ++++++++++++++++++++++++++++++ backend/catalog/dependency.c | 15 + backend/catalog/pg_largeobject.c | 265 ++++++++++++!!!!!!!!!!!!!!!!!!!! backend/commands/alter.c | 9 + backend/libpq/be-fsstubs.c | 25 +++ backend/parser/gram.y | 28 +++ backend/storage/large_object/inv_api.c | 140 +++------!!!!!! backend/tcop/utility.c | 6 backend/utils/adt/acl.c | 4 backend/utils/cache/syscache.c | 13 + include/catalog/dependency.h | 1 include/catalog/indexing.h | 7 include/catalog/pg_largeobject.h | 21 !! include/catalog/pg_largeobject_data.h | 54 ++++++ include/catalog/toasting.h | 1 include/nodes/parsenodes.h | 1 include/utils/acl.h | 7 include/utils/syscache.h | 1 test/regress/expected/sanity_check.out | 3 test/regress/input/largeobject.source | 95 +++++++++++ test/regress/output/largeobject.source | 175 +++++++++++++++++++++ 22 files changed, 803 insertions(+), 46 deletions(-), 271 modifications(!) -- OSS Platform Development Division, NEC KaiGai Kohei <kaigai@ak.jp.nec.com>
Attachment
--On 26. Juni 2009 13:08:37 +0900 KaiGai Kohei <kaigai@ak.jp.nec.com> wrote: > * Is the named large object (including fully qualified one) worth? > It will enables us to specify a largeobject with human readable > identifier string. I don't understand the notion of this. Does this mean you can create a LO with an identifier string, generated from (e.g.) your application? -- Thanks Bernd
Bernd Helmle wrote: > --On 26. Juni 2009 13:08:37 +0900 KaiGai Kohei <kaigai@ak.jp.nec.com> > wrote: > >> * Is the named large object (including fully qualified one) worth? >> It will enables us to specify a largeobject with human readable >> identifier string. > > I don't understand the notion of this. Does this mean you can create a > LO with an identifier string, generated from (e.g.) your application? Yes, it intends to assign an identifier string not only numeric large object identifier. The identifier string can be qualified with a certain namespace as follows. E.g)SELECT lo_open('my_picture01', x'40000'::int);SELECT lo_create('pg_temp.my_musid02'); In the later case, the new largeobject will be reclaimed after the session closed due to the temporary namespace. Thanks, -- OSS Platform Development Division, NEC KaiGai Kohei <kaigai@ak.jp.nec.com>
--On 29. Juni 2009 08:32:29 +0900 KaiGai Kohei <kaigai@ak.jp.nec.com> wrote: > Yes, it intends to assign an identifier string not only numeric > large object identifier. The identifier string can be qualified > with a certain namespace as follows. > > E.g) > SELECT lo_open('my_picture01', x'40000'::int); > SELECT lo_create('pg_temp.my_musid02'); > > In the later case, the new largeobject will be reclaimed after > the session closed due to the temporary namespace. I'm not sure about the usefulness of this. While having an identifier for a LO is nice, i believe most users store additional metadata about objects within their own tables anyways, linking the LO there. Also i doubt there is much need for temporary large objects (at least, i have no idea about this....). It might be interesting to dig into your proposal deeper in conjunction with TOAST (you've already mentioned this TODO). Having serial access with a nice interface into TOAST would be eliminating the need for pg_largeobject completely (i'm not a big fan of this one-big-system-table approach the old LO interface currently is). -- Thanks Bernd
Bernd Helmle <mailings@oopsware.de> writes: > It might be interesting to dig into your proposal deeper in conjunction > with TOAST (you've already mentioned this TODO). Having serial access with > a nice interface into TOAST would be eliminating the need for > pg_largeobject completely (i'm not a big fan of this one-big-system-table > approach the old LO interface currently is). Yeah, it would be more useful probably to fix that than to add decoration to the LO facility. Making LO more usable is just going to encourage people to bump into its other limitations (32-bit OIDs, 32-bit object size, finite maximum size of pg_largeobject, lack of dead-object cleanup, etc etc). regards, tom lane
Tom Lane wrote: > Bernd Helmle <mailings@oopsware.de> writes: >> It might be interesting to dig into your proposal deeper in conjunction >> with TOAST (you've already mentioned this TODO). Having serial access with >> a nice interface into TOAST would be eliminating the need for >> pg_largeobject completely (i'm not a big fan of this one-big-system-table >> approach the old LO interface currently is). > > Yeah, it would be more useful probably to fix that than to add > decoration to the LO facility. Making LO more usable is just going to > encourage people to bump into its other limitations (32-bit OIDs, > 32-bit object size, finite maximum size of pg_largeobject, lack of > dead-object cleanup, etc etc). The reason why I tried to mention the named largeobject feature is that dac security checks on largeobject require them to belong to a certain schema, so I thought it is quite natural to have a string name. However, obviously, it is not a significant theme for me. I can also agree your opinion that largeobject interfaces should be redefined to access partial stuff of TOAST'ed verlena data structure, not only pg_largeobject. In this case, we will need a new pg_type.typstorage option which force to put the given verlena data on external relation without compression, because we cannot estimate the data offset in inlined or compressed external verlena data. I'll try to submit a design within a few days. Thanks, -- OSS Platform Development Division, NEC KaiGai Kohei <kaigai@ak.jp.nec.com>
I concluded that the following issues should be solved when we apply largeobject-like interfaces on the big toasted data within general relations, not only pg_largeobject system catalog. At first, we need to add a new strategy to store the given varlena data on the external toast relation. If we try to seek and fetch a certain data chunk, it is necessary to be computable what chunk stores the required data specified by offset and length. So, the external chunks should be uncompressed always. It is a common requirement for both of read and write operations. If we try to update a part of the toasted data chunks, it should not be inlined independent from length of the datum, because we need to update whole the tuple which contains inlined toasted chunks in this case. If we open the toasted varlena with read-only mode, inlined one does not prevent anything. It is an issue for only write operation. I would like to add a new strategy on pg_type.typstorage with the following characteristics:1. It always stores the given varlena data without any compression. So, the given data is stored as a setof fixed-length chunks.2. It always stores the given varlena data on external toast relation. I suggest a new built-in type named BLOB which has an identical definition to BYTEA type, expect for its attstorage. Next, a different version of lo_open() should be provided to accept BLOB type as follows: SELECT pictname, lo_open(pictdata, x'20000'::int) FROM my_picture; It will allocate a largeobject descriptor for the given BLOB data, and user can read and write using loread() and lowrite() interfaces. issue: In this case, should it hold the relation handler and locks on the "my_picture" relation, not only its toast relation? issue: Should the lo_open() with read-only mode be available on the existing TEXT or BYTEA types? I could not find any reasonto deny them. Next, pg_largeobject system catalog can be redefined using the BLOB type as follows: CATALOG(pg_largeobject,2613) { Oid loowner; /* OID of the largeobject owner */ Oid lonsp; /* OID of the largeobject namespace */ aclitem loacl[1]; /* access permissions */ blob lodata; /* contents of the largeobject */ } FormData_pg_largeobject; The existing largeobject interfaces perform on pg_largeobject.lodata specified by largeobject identifier. Rest of metadata can be used for access control purpose. Thanks, KaiGai Kohei wrote: > Tom Lane wrote: >> Bernd Helmle <mailings@oopsware.de> writes: >>> It might be interesting to dig into your proposal deeper in conjunction >>> with TOAST (you've already mentioned this TODO). Having serial access with >>> a nice interface into TOAST would be eliminating the need for >>> pg_largeobject completely (i'm not a big fan of this one-big-system-table >>> approach the old LO interface currently is). >> Yeah, it would be more useful probably to fix that than to add >> decoration to the LO facility. Making LO more usable is just going to >> encourage people to bump into its other limitations (32-bit OIDs, >> 32-bit object size, finite maximum size of pg_largeobject, lack of >> dead-object cleanup, etc etc). > > The reason why I tried to mention the named largeobject feature is > that dac security checks on largeobject require them to belong to > a certain schema, so I thought it is quite natural to have a string > name. However, obviously, it is not a significant theme for me. > > I can also agree your opinion that largeobject interfaces should be > redefined to access partial stuff of TOAST'ed verlena data structure, > not only pg_largeobject. > > In this case, we will need a new pg_type.typstorage option which > force to put the given verlena data on external relation without > compression, because we cannot estimate the data offset in inlined > or compressed external verlena data. > > I'll try to submit a design within a few days. > Thanks, -- OSS Platform Development Division, NEC KaiGai Kohei <kaigai@ak.jp.nec.com>
I could find one more issue when we apply largeobject-style interfaces on generic toasted varlena data. When we fetch a toasted datum, it scans the pg_toast_%u with SnapshotToast, because it assumes any toasted chunks don't have multiple versions, and visibility of the toast pointer always means visibility of the toast chunks. However, if we provide largeobject-style interfaces which allow partial updates on toasted varlena, it seems to me this assumption will get being incorrect. Is there any good idea? KaiGai Kohei wrote: > I concluded that the following issues should be solved when we apply > largeobject-like interfaces on the big toasted data within general > relations, not only pg_largeobject system catalog. > > At first, we need to add a new strategy to store the given varlena data > on the external toast relation. > If we try to seek and fetch a certain data chunk, it is necessary to be > computable what chunk stores the required data specified by offset and > length. So, the external chunks should be uncompressed always. It is a > common requirement for both of read and write operations. > If we try to update a part of the toasted data chunks, it should not be > inlined independent from length of the datum, because we need to update > whole the tuple which contains inlined toasted chunks in this case. > If we open the toasted varlena with read-only mode, inlined one does not > prevent anything. It is an issue for only write operation. > > I would like to add a new strategy on pg_type.typstorage with the following > characteristics: > 1. It always stores the given varlena data without any compression. > So, the given data is stored as a set of fixed-length chunks. > 2. It always stores the given varlena data on external toast relation. > > I suggest a new built-in type named BLOB which has an identical definition > to BYTEA type, expect for its attstorage. > > Next, a different version of lo_open() should be provided to accept > BLOB type as follows: > > SELECT pictname, lo_open(pictdata, x'20000'::int) FROM my_picture; > > It will allocate a largeobject descriptor for the given BLOB data, > and user can read and write using loread() and lowrite() interfaces. > > issue: > In this case, should it hold the relation handler and locks on the > "my_picture" relation, not only its toast relation? > issue: > Should the lo_open() with read-only mode be available on the existing > TEXT or BYTEA types? I could not find any reason to deny them. > > Next, pg_largeobject system catalog can be redefined using the BLOB > type as follows: > > CATALOG(pg_largeobject,2613) > { > Oid loowner; /* OID of the largeobject owner */ > Oid lonsp; /* OID of the largeobject namespace */ > aclitem loacl[1]; /* access permissions */ > blob lodata; /* contents of the largeobject */ > } FormData_pg_largeobject; > > The existing largeobject interfaces perform on pg_largeobject.lodata > specified by largeobject identifier. > Rest of metadata can be used for access control purpose. > > Thanks, > > KaiGai Kohei wrote: >> Tom Lane wrote: >>> Bernd Helmle <mailings@oopsware.de> writes: >>>> It might be interesting to dig into your proposal deeper in conjunction >>>> with TOAST (you've already mentioned this TODO). Having serial access with >>>> a nice interface into TOAST would be eliminating the need for >>>> pg_largeobject completely (i'm not a big fan of this one-big-system-table >>>> approach the old LO interface currently is). >>> Yeah, it would be more useful probably to fix that than to add >>> decoration to the LO facility. Making LO more usable is just going to >>> encourage people to bump into its other limitations (32-bit OIDs, >>> 32-bit object size, finite maximum size of pg_largeobject, lack of >>> dead-object cleanup, etc etc). >> The reason why I tried to mention the named largeobject feature is >> that dac security checks on largeobject require them to belong to >> a certain schema, so I thought it is quite natural to have a string >> name. However, obviously, it is not a significant theme for me. >> >> I can also agree your opinion that largeobject interfaces should be >> redefined to access partial stuff of TOAST'ed verlena data structure, >> not only pg_largeobject. >> >> In this case, we will need a new pg_type.typstorage option which >> force to put the given verlena data on external relation without >> compression, because we cannot estimate the data offset in inlined >> or compressed external verlena data. >> >> I'll try to submit a design within a few days. >> Thanks, > > -- OSS Platform Development Division, NEC KaiGai Kohei <kaigai@ak.jp.nec.com>
Joshua, I found your name as a reviewer at the commitfest.postgresql.org. However, I don't think the initial proposal of the largeobject security is now on the state to be reviewed seriously. In my preference, it should be reworked based on the TOASTing approach as discussed in this thread previously, if we can find a reasonable solution for the issue I raised before. KaiGai Kohei wrote: > I could find one more issue when we apply largeobject-style interfaces > on generic toasted varlena data. > > When we fetch a toasted datum, it scans the pg_toast_%u with SnapshotToast, > because it assumes any toasted chunks don't have multiple versions, and > visibility of the toast pointer always means visibility of the toast chunks. > > However, if we provide largeobject-style interfaces which allow partial > updates on toasted varlena, it seems to me this assumption will get being > incorrect. > > Is there any good idea? The largeobject interface uses SnapshotNow for writable accesses, and GetActiveSnapshot() for read-only accesses, but toast_fetch_datum() uses SnapshotToast to scan the toast relation. It seems to me SnapshotToast depends on an assumption that tuples within TOAST relation does not have any multiple versions. When we update a toast value, TOAST mechanism inserts whole of variable length datum with a new chunk_id, and older chunks are removed at toast_delete_datum(). The TOAST pointer is updated to the new chunk_id, and its visibility is under MVCC controls. The source code comments at HeapTupleSatisfiesToast() says as follows: /** HeapTupleSatisfiesToast* True iff heap tuple is valid as a TOAST row.** This is a simplified version that only checksfor VACUUM moving conditions.* It's appropriate for TOAST usage because TOAST really doesn't want to do* its own timequal checks; if you can see the main table row that contains* a TOAST reference, you should be able to see the TOASTedvalue. However,* vacuuming a TOAST table is independent of the main table, and in case such* a vacuum fails partwaythrough, we'd better do this much checking.** Among other things, this means you can't do UPDATEs of rows in a TOAST*table.*/ If largeobjects-like interface is available to update a part of TOAST values, we cannot keep the assumption. I would like to conclude the issue before reworking it. Is there any good idea? If we follow the current largeobject manner, what problems can be found? Thanks, > KaiGai Kohei wrote: >> I concluded that the following issues should be solved when we apply >> largeobject-like interfaces on the big toasted data within general >> relations, not only pg_largeobject system catalog. >> >> At first, we need to add a new strategy to store the given varlena data >> on the external toast relation. >> If we try to seek and fetch a certain data chunk, it is necessary to be >> computable what chunk stores the required data specified by offset and >> length. So, the external chunks should be uncompressed always. It is a >> common requirement for both of read and write operations. >> If we try to update a part of the toasted data chunks, it should not be >> inlined independent from length of the datum, because we need to update >> whole the tuple which contains inlined toasted chunks in this case. >> If we open the toasted varlena with read-only mode, inlined one does not >> prevent anything. It is an issue for only write operation. >> >> I would like to add a new strategy on pg_type.typstorage with the following >> characteristics: >> 1. It always stores the given varlena data without any compression. >> So, the given data is stored as a set of fixed-length chunks. >> 2. It always stores the given varlena data on external toast relation. >> >> I suggest a new built-in type named BLOB which has an identical definition >> to BYTEA type, expect for its attstorage. >> >> Next, a different version of lo_open() should be provided to accept >> BLOB type as follows: >> >> SELECT pictname, lo_open(pictdata, x'20000'::int) FROM my_picture; >> >> It will allocate a largeobject descriptor for the given BLOB data, >> and user can read and write using loread() and lowrite() interfaces. >> >> issue: >> In this case, should it hold the relation handler and locks on the >> "my_picture" relation, not only its toast relation? >> issue: >> Should the lo_open() with read-only mode be available on the existing >> TEXT or BYTEA types? I could not find any reason to deny them. >> >> Next, pg_largeobject system catalog can be redefined using the BLOB >> type as follows: >> >> CATALOG(pg_largeobject,2613) >> { >> Oid loowner; /* OID of the largeobject owner */ >> Oid lonsp; /* OID of the largeobject namespace */ >> aclitem loacl[1]; /* access permissions */ >> blob lodata; /* contents of the largeobject */ >> } FormData_pg_largeobject; >> >> The existing largeobject interfaces perform on pg_largeobject.lodata >> specified by largeobject identifier. >> Rest of metadata can be used for access control purpose. >> >> Thanks, >> >> KaiGai Kohei wrote: >>> Tom Lane wrote: >>>> Bernd Helmle <mailings@oopsware.de> writes: >>>>> It might be interesting to dig into your proposal deeper in conjunction >>>>> with TOAST (you've already mentioned this TODO). Having serial access with >>>>> a nice interface into TOAST would be eliminating the need for >>>>> pg_largeobject completely (i'm not a big fan of this one-big-system-table >>>>> approach the old LO interface currently is). >>>> Yeah, it would be more useful probably to fix that than to add >>>> decoration to the LO facility. Making LO more usable is just going to >>>> encourage people to bump into its other limitations (32-bit OIDs, >>>> 32-bit object size, finite maximum size of pg_largeobject, lack of >>>> dead-object cleanup, etc etc). >>> The reason why I tried to mention the named largeobject feature is >>> that dac security checks on largeobject require them to belong to >>> a certain schema, so I thought it is quite natural to have a string >>> name. However, obviously, it is not a significant theme for me. >>> >>> I can also agree your opinion that largeobject interfaces should be >>> redefined to access partial stuff of TOAST'ed verlena data structure, >>> not only pg_largeobject. >>> >>> In this case, we will need a new pg_type.typstorage option which >>> force to put the given verlena data on external relation without >>> compression, because we cannot estimate the data offset in inlined >>> or compressed external verlena data. >>> >>> I'll try to submit a design within a few days. >>> Thanks, >> > > -- OSS Platform Development Division, NEC KaiGai Kohei <kaigai@ak.jp.nec.com>
On Thu, Jul 16, 2009 at 01:49:14PM +0900, KaiGai Kohei wrote: > Joshua, > > I found your name as a reviewer at the commitfest.postgresql.org. > > However, I don't think the initial proposal of the largeobject > security is now on the state to be reviewed seriously. > > In my preference, it should be reworked based on the TOASTing > approach as discussed in this thread previously, if we can > find a reasonable solution for the issue I raised before. For whatever it's worth, I consider my capability limited to making sure the patch applies and coming up with a few interesting ways of testing it out, but not seriously studying the code and knowing when there might be a competitive alternative implementation. I don't yet understand the issues that have been raised, but will quit working to review the patch if you feel it's not ready. Thanks for letting me know; I hope a solution to the problems you've brought up is forthcoming. -- Joshua Tolley / eggyknap End Point Corporation http://www.endpoint.com
2009/7/16 KaiGai Kohei <kaigai@ak.jp.nec.com>: > However, I don't think the initial proposal of the largeobject > security is now on the state to be reviewed seriously. OK, I am moving this patch to returned with feedback. ...Robert
Robert Haas wrote: > 2009/7/16 KaiGai Kohei <kaigai@ak.jp.nec.com>: >> However, I don't think the initial proposal of the largeobject >> security is now on the state to be reviewed seriously. > > OK, I am moving this patch to returned with feedback. If possible, I would like to have a discussion to make consensus about its design and interfaces before submitting my patch to the next commit fest. I'll submit its design proposal (including a few issues) again. Is there anyone interested in? Thanks, -- OSS Platform Development Division, NEC KaiGai Kohei <kaigai@ak.jp.nec.com>
I summarized the design proposal and issues currently we have. I would like to see any comments corresponding to the proposition. Especially, selection of the snapshot is a headach issue for me. ---------------- This project tries to solve two items listed at: http://wiki.postgresql.org/wiki/Todo#Binary_Data * Add security checks for large objects* Allow read/write into TOAST values like large objects = Introduction = We need to associate a metadata for a certain largeobject to implement security checks for largeobjects. However, the data structure of largeobjects are not suitable to manage its metadata (such as owner identifier, database acls ...) on a certain largeobject, because a largeobject is stored as separated page frames in the pg_largeobject system catalog. Thus, we need to revise the data structure to manage a certain largeobject. An interesting fact is a similarity of data structure between TOAST table and pg_lageobject. A TOAST relation is declared as follows: pg_toast_%u ( chunk_id oid, chunk_seq int4, chunk_data bytea, unique(chunk_id, chunk_seq) ) Definition of the pg_largeobject is as follows: pg_largeobject( loid oid, pageno int4, data bytea, unique(loid, pageno) ) They have an identical data structure, so it is quite natural to utilize TOAST mechanism to store pagef rames of largeobject. = Design = In my plan, role of the pg_largeobject will be changed to manage metadata of largeobjects, and it will be redefined as follows: CATALOG(pg_largeobject,2613) { Oid loowner; /* OID of the owner */ Oid lonsp; /* OID ofthe namespace */ aclitem loacl[1]; /* ACL of the largeobject */ Blob lodata; /* Contents of thelargeobject */ } FormData_pg_largeobejct; For access controls purpose, its ownership, namespace and ACLs are necessary. In addition, the Blob is a new type which support to read/write a its partial TOAST value. The current lo_xxx() interfaces will perform as a wrapper function to access a certain pg_largeobject.lodata identified by a largeobject handler. The loread(), lowrite() or similar interfaces will support partial accesses on the Blob type. It enables user defined relation to contain large data using TOAST mechanism, with reasonable resource comsumption. (Note that TOAST replaces whole of the chunks with same identifier, even if it changes just a single byte.) = Interfaces = == New type == We need a new variable length type that has the following feature, to allow users partial accesses. * It always use external TOAST table, independent from its size.If toasted data is stored as inline, we cannot update itindependent from the main table.It does not prevent partial read, but meaningless because inlined data is enough small. * It always store the data without any compression.We cannot easily compute required data offset on the compresseddata. Allthe toasted data need to be uncompressed, for both ofreader and writer access. == lo_xxx() interfaces == A new version of loread() and lowrite() are necessary to access a part of toasted data within user defined tables. It can be defined as follows: loread(Blob data, int32 offset, int32 length) lowrite(Blob data, int32 offset, Bytea data) == GRANT/REVOKE == When we access traditional largeobjects, reader permission (SELECT) or writer permission (UPDATE) should be checked on accesses. The GRANT/REVOKE statements are enhanced as follows: GRANT SELECT ON LARGE OBJECT 1234 TO kaigai; It allows "kaigai" to read the largeobject: 1234. = Issues = == New pg_type.typstorage == The variable length data is always necessary to be stored in external storage and uncompressed. The existing typstorage does not satisfies the requirement, so we need to add a new pg_type.typstorage strategy. The new typstorage strategy forces:- It always stores the given varlena data on external toast relation.- It always storesthe given varlena data without any compression. It will give us performance loss, so existing Text or Bytea will be more suitable to store variable length data being not very large. == Snapshot == The largeobject interface uses SnapshotNow for writable accesses, and GetActiveSnapshot() for read-only accesses, but toast_fetch_datum() uses SnapshotToast to scan the toast relation. It seems to me SnapshotToast depends on an assumption that tuples within TOAST relation does not have any multiple versions. When we update a toast value, TOAST mechanism inserts whole of variable length datum with a new chunk_id, and older chunks are removed at toast_delete_datum(). The TOAST pointer is updated to the new chunk_id, and its visibility is under MVCC controls. The source code comments at HeapTupleSatisfiesToast() says as follows: /** HeapTupleSatisfiesToast* True iff heap tuple is valid as a TOAST row.** This is a simplified version that only checksfor VACUUM moving conditions.* It's appropriate for TOAST usage because TOAST really doesn't want to do* its own timequal checks; if you can see the main table row that contains* a TOAST reference, you should be able to see the TOASTedvalue. However,* vacuuming a TOAST table is independent of the main table, and in case such* a vacuum fails partwaythrough, we'd better do this much checking.** Among other things, this means you can't do UPDATEs of rows in a TOAST*table.*/ If largeobjects-like interface is available to update a part of TOAST values, we cannot keep the assumption. At the beginning, I have a plan to apply the result od GetActiveSnapshot() to fetch toasted value. Is there any matter which can be expected? -- OSS Platform Development Division, NEC KaiGai Kohei <kaigai@ak.jp.nec.com>
Nobody may remember, I also proposed a patch to support access controls on largeobject. It was suggested that largeobject stores its contents on the TOAST relations and provides interfaces to read/write them partially. For example: > == lo_xxx() interfaces == > > A new version of loread() and lowrite() are necessary to access > a part of toasted data within user defined tables. It can be defined > as follows: > > loread(Blob data, int32 offset, int32 length) > lowrite(Blob data, int32 offset, Bytea data) But, I found an another problem when we tried to update TOAST chunks partially. Because the toast pointer contains the total size of a series of chunks, not only chunk_id and toastrelid, we cannot expand the size of TOAST'ed data. See below, toast pointer is defined as follows. struct varatt_external { int32 va_rawsize; /* Original data size (includes header) */ int32 va_extsize; /* External saved size (doesn't) */ Oid va_valueid; /* Unique ID of value within TOAST table*/ Oid va_toastrelid; /* RelID of TOAST table containing it */ }; These fields are inlined within the main tuple, and the contents of TOAST datum is stored in the toast relation in separation. If we can update the toast data partially with expanding its total size (e.g, write 10,000 bytes from the offset 6,000 of the 12,000 bytes verlena), it is also necessary to update the main tuple, because its va_rawsize and va_extsize should be also updates, although va_valueid and va_toastrelid can keep as is. I wonder why the va_extsize and va_rawsize should be included within the toast pointer. It is fundamentally computable using a pair of va_valueid and va_toastrelid. (I guess it is due to the performance reason. If we need to check whether it is compressed or not, a flag varible can provide enough hint.) So, we need to consider more how to implement partial writable verlena on the user tables. However, as far as existing largeobject interfaces, I believe it is possible to implement them using partial writable verlena. Because largeobject identifier is obvious when we user lo_*() interfaces, so we can update the main tuple within pg_largeobject also, if the total length of the verlena was changed on partial writes. Thus, I would like to implement the feature with focusing on the existing largeobject interfaces at the first step. Partial writable TOAST in the user relation will come on the next. (It also enables to keep the patch size smaller than all-in-a-patch.) In summary, I try to implement it on the next commit fest. - GRANT / REVOKE on largeobjects - DAC access controls on largeobejcts - It puts the contents of largeobject on TOAST relation. (including hole support) - It provides largeobejct interfaces to write it partially. Thanks, KaiGai Kohei wrote: > I summarized the design proposal and issues currently we have. > > I would like to see any comments corresponding to the proposition. > Especially, selection of the snapshot is a headach issue for me. > > ---------------- > This project tries to solve two items listed at: > http://wiki.postgresql.org/wiki/Todo#Binary_Data > > * Add security checks for large objects > * Allow read/write into TOAST values like large objects > > = Introduction = > > We need to associate a metadata for a certain largeobject to > implement security checks for largeobjects. However, the data > structure of largeobjects are not suitable to manage its > metadata (such as owner identifier, database acls ...) on > a certain largeobject, because a largeobject is stored as > separated page frames in the pg_largeobject system catalog. > Thus, we need to revise the data structure to manage a certain > largeobject. > > An interesting fact is a similarity of data structure between > TOAST table and pg_lageobject. > A TOAST relation is declared as follows: > pg_toast_%u ( > chunk_id oid, > chunk_seq int4, > chunk_data bytea, > unique(chunk_id, chunk_seq) > ) > > Definition of the pg_largeobject is as follows: > pg_largeobject( > loid oid, > pageno int4, > data bytea, > unique(loid, pageno) > ) > > They have an identical data structure, so it is quite natural > to utilize TOAST mechanism to store pagef rames of largeobject. > > = Design = > > In my plan, role of the pg_largeobject will be changed to > manage metadata of largeobjects, and it will be redefined > as follows: > > CATALOG(pg_largeobject,2613) > { > Oid loowner; /* OID of the owner */ > Oid lonsp; /* OID of the namespace */ > aclitem loacl[1]; /* ACL of the largeobject */ > Blob lodata; /* Contents of the largeobject */ > } FormData_pg_largeobejct; > > For access controls purpose, its ownership, namespace and ACLs > are necessary. In addition, the Blob is a new type which support > to read/write a its partial TOAST value. > > The current lo_xxx() interfaces will perform as a wrapper function > to access a certain pg_largeobject.lodata identified by a largeobject > handler. The loread(), lowrite() or similar interfaces will support > partial accesses on the Blob type. It enables user defined relation > to contain large data using TOAST mechanism, with reasonable resource > comsumption. (Note that TOAST replaces whole of the chunks with same > identifier, even if it changes just a single byte.) > > = Interfaces = > > == New type == > > We need a new variable length type that has the following feature, > to allow users partial accesses. > > * It always use external TOAST table, independent from its size. > If toasted data is stored as inline, we cannot update it independent > from the main table. > It does not prevent partial read, but meaningless because inlined > data is enough small. > > * It always store the data without any compression. > We cannot easily compute required data offset on the compressed > data. All the toasted data need to be uncompressed, for both of > reader and writer access. > > == lo_xxx() interfaces == > > A new version of loread() and lowrite() are necessary to access > a part of toasted data within user defined tables. It can be defined > as follows: > > loread(Blob data, int32 offset, int32 length) > lowrite(Blob data, int32 offset, Bytea data) > > == GRANT/REVOKE == > > When we access traditional largeobjects, reader permission (SELECT) > or writer permission (UPDATE) should be checked on accesses. > > The GRANT/REVOKE statements are enhanced as follows: > > GRANT SELECT ON LARGE OBJECT 1234 TO kaigai; > > It allows "kaigai" to read the largeobject: 1234. > > = Issues = > > == New pg_type.typstorage == > > The variable length data is always necessary to be stored in external > storage and uncompressed. The existing typstorage does not satisfies > the requirement, so we need to add a new pg_type.typstorage strategy. > > The new typstorage strategy forces: > - It always stores the given varlena data on external toast relation. > - It always stores the given varlena data without any compression. > > It will give us performance loss, so existing Text or Bytea will be > more suitable to store variable length data being not very large. > > == Snapshot == > > The largeobject interface uses SnapshotNow for writable accesses, and > GetActiveSnapshot() for read-only accesses, but toast_fetch_datum() > uses SnapshotToast to scan the toast relation. > > It seems to me SnapshotToast depends on an assumption that tuples > within TOAST relation does not have any multiple versions. > When we update a toast value, TOAST mechanism inserts whole of > variable length datum with a new chunk_id, and older chunks are > removed at toast_delete_datum(). > The TOAST pointer is updated to the new chunk_id, and its visibility > is under MVCC controls. > > The source code comments at HeapTupleSatisfiesToast() says as follows: > > /* > * HeapTupleSatisfiesToast > * True iff heap tuple is valid as a TOAST row. > * > * This is a simplified version that only checks for VACUUM moving conditions. > * It's appropriate for TOAST usage because TOAST really doesn't want to do > * its own time qual checks; if you can see the main table row that contains > * a TOAST reference, you should be able to see the TOASTed value. However, > * vacuuming a TOAST table is independent of the main table, and in case such > * a vacuum fails partway through, we'd better do this much checking. > * > * Among other things, this means you can't do UPDATEs of rows in a TOAST > * table. > */ > > If largeobjects-like interface is available to update a part of TOAST > values, we cannot keep the assumption. > > At the beginning, I have a plan to apply the result od GetActiveSnapshot() > to fetch toasted value. Is there any matter which can be expected? > -- OSS Platform Development Division, NEC KaiGai Kohei <kaigai@ak.jp.nec.com>