Thread: Extending varlena
Folks, As the things stored in databases grow, we're going to start needing to think about database objects that 4 bytes of size can't describe. People are already storing video in lo and bytea fields. To date, the sizes of media files have never trended downward. What would need to happen for the next jump up from where varlena is now, to 8 bytes? Would we want to use the bit-stuffing model that the current varvarlena uses? Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
David Fetter <david@fetter.org> writes: > What would need to happen for the next jump up from where varlena is > now, to 8 bytes? Dealing with upwards-of-4GB blobs as single Datums isn't remotely sane, and won't become so in the near (or even medium) future. So I don't see the point of doing all the work that would be involved in making this go. What would make more sense is to redesign the large-object stuff to be somewhat modern and featureful, and provide stream-access APIs (think lo_read, lo_seek, etc) that allow offsets wider than 32 bits. The main things I think we'd need to consider besides just the access API are - permissions features (more than "none" anyway) - better management of orphaned objects (obsoleting vacuumlo) - support > 16TB of large objects (maybe partition pg_largeobject?) - dump and restore probably need improvement to be practical for such large data volumes regards, tom lane
On Mon, Aug 18, 2008 at 04:22:56PM -0400, Tom Lane wrote: > David Fetter <david@fetter.org> writes: > > What would need to happen for the next jump up from where varlena > > is now, to 8 bytes? > > Dealing with upwards-of-4GB blobs as single Datums isn't remotely > sane, and won't become so in the near (or even medium) future. So I > don't see the point of doing all the work that would be involved in > making this go. OK > What would make more sense is to redesign the large-object stuff to > be somewhat modern and featureful, and provide stream-access APIs > (think lo_read, lo_seek, etc) that allow offsets wider than 32 bits. Great! > The main things I think we'd need to consider besides just the > access API are > > - permissions features (more than "none" anyway) Would ROLEs work, or are you thinking of the per-row and per-column access controls people sometimes want? > - better management of orphaned objects (obsoleting vacuumlo) > - support > 16TB of large objects (maybe partition pg_largeobject?) > - dump and restore probably need improvement to be practical for such > large data volumes That, and the usual upgrade-in-place :) Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
David Fetter wrote: > Folks, > > As the things stored in databases grow, we're going to start needing > to think about database objects that 4 bytes of size can't describe. > People are already storing video in lo and bytea fields. To date, the > sizes of media files have never trended downward. > I always find these requests puzzling. Is it really useful to store the data for a jpeg, video file or a 10GB tar ball in a database column? Does anyone actually search for byte sequences within those data streams (maybe if it were text)? I would think that the metadata is what gets searched: title, track, name, file times, size, etc... Database storage is normally pricey, stocked with 15K drives, so wasting that expensive storage with non-searchable binary blobs doesn't make much sense. Why not offload the data to a file system with 7200 RPM SATA drives and store a reference to it in the db? Keep the db more compact and simpler to manage. Andrew Chernow eSilo, LLC
On Mon, Aug 18, 2008 at 06:09:13PM -0400, Andrew Chernow wrote: > David Fetter wrote: >> Folks, >> >> As the things stored in databases grow, we're going to start >> needing to think about database objects that 4 bytes of size can't >> describe. People are already storing video in lo and bytea fields. >> To date, the sizes of media files have never trended downward. > > I always find these requests puzzling. Is it really useful to store > the data for a jpeg, video file or a 10GB tar ball in a database > column? It is if you need transaction semantics. Think medical records, etc. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
David Fetter wrote: > On Mon, Aug 18, 2008 at 06:09:13PM -0400, Andrew Chernow wrote: >> David Fetter wrote: >>> Folks, >>> >>> As the things stored in databases grow, we're going to start >>> needing to think about database objects that 4 bytes of size can't >>> describe. People are already storing video in lo and bytea fields. >>> To date, the sizes of media files have never trended downward. >> I always find these requests puzzling. Is it really useful to store >> the data for a jpeg, video file or a 10GB tar ball in a database >> column? > > It is if you need transaction semantics. Think medical records, etc. > > Cheers, > David. I see that, although developing the middleware between db and fs is rather trival. I think that is the puzzling part. It just feels akward to "me" to just stuff it in the db. You can do more by distributing. Anyways (back on topic), I am in favor of removing limits from any section of the database ... not just your suggestion. The end-user application should impose limits. Andrew
Andrew, > I always find these requests puzzling. Is it really useful to store the > data for a jpeg, video file or a 10GB tar ball in a database column? Some people find it useful. Because LOs are actually easier to manage in PG than in most other DBMSes, right now that's a significant source of PostgreSQL adoption. I'd like to encourage those users by giving them more useful LO features. -- --Josh Josh Berkus PostgreSQL San Francisco
> I always find these requests puzzling. Is it really useful to store the > data for a jpeg, video file or a 10GB tar ball in a database column? One use case is that it can use the existing postgresql protocol, and does not require extra filesystem mounts, extra error handling, and other complexity. Regards,Jeff Davis
Andrew Chernow <ac@esilo.com> writes: > Anyways (back on topic), I am in favor of removing limits from any > section of the database ... not just your suggestion. The end-user > application should impose limits. That's nice as an abstract principle, but there are only so many hours in the day, so we need to prioritize which limits we're going to get rid of. The 4-byte limit on individual Datum sizes does not strike me as a limit that's going to be significant for practical use any time soon. (I grant David's premise that people will soon want to work with objects that are larger than that --- but not that they'll want to push them around as indivisible, store-and-fetch-as-a-unit field values.) regards, tom lane
Tom Lane wrote: > Andrew Chernow <ac@esilo.com> writes: >> Anyways (back on topic), I am in favor of removing limits from any >> section of the database ... not just your suggestion. The end-user >> application should impose limits. > > That's nice as an abstract principle, but there are only so many hours > in the day, so we need to prioritize which limits we're going to get rid > of. The 4-byte limit on individual Datum sizes does not strike me as a > limit that's going to be significant for practical use any time soon. > (I grant David's premise that people will soon want to work with objects > that are larger than that --- but not that they'll want to push them > around as indivisible, store-and-fetch-as-a-unit field values.) > > regards, tom lane > > Yeah, my comments were overly general. I wasn't suggesting attention be put on one limit over another. I was only saying that the act of removing a limit (of which many are arbitrary) is most often a good one. andrew
Jeff Davis wrote: >> I always find these requests puzzling. Is it really useful to store the >> data for a jpeg, video file or a 10GB tar ball in a database column? > > One use case is that it can use the existing postgresql protocol, > > So can what I am suggesting. How about a user-defined C function in the backend that talks to the fs and uses SPI to sync info with a record? Now the operation is behind a transaction. Yet, one must handle fs orphans from evil crash cases. Just one solution, but other more creative cats may have better ideas. the point is, it can be done without too much effort. A little TLC :) Andrew
David Fetter <david@fetter.org> writes: > On Mon, Aug 18, 2008 at 04:22:56PM -0400, Tom Lane wrote: >> The main things I think we'd need to consider besides just the >> access API are >> >> - permissions features (more than "none" anyway) > Would ROLEs work, or are you thinking of the per-row and per-column > access controls people sometimes want? Well, obviously roles are the entities that receive permissions, but on what do we base granting permissions to LOs? With the current model that a LO is an independent entity that is merely referenced (or not) by OIDs in the database, it seems like we'd have to grant/revoke permissions to individual LOs, identified by OID; which sure seems messy to me. People don't really want to name their LOs by OID anyway --- it's just a convention that's forced on them by the current implementation. I was kinda wondering about something closer to the TOAST model, where a blob is only referenceable from a value that's in a table field; and that value encapsulates the "name" of the blob in some way that needn't even be user-visible. This'd greatly simplify the cleanup-dead-objects problem, and we could avoid addressing the permissions problem at all, since regular SQL permissions on the table would serve fine. But it's not clear what regular SQL fetch and update behaviors should be like for such a thing. (Fetching or storing the whole blob value is right out, IMHO.) ISTR hearing of concepts roughly like this in other DBs --- does it ring a bell for anyone? regards, tom lane
David Fetter <david@fetter.org> writes: > On Mon, Aug 18, 2008 at 06:09:13PM -0400, Andrew Chernow wrote: >> I always find these requests puzzling. Is it really useful to store >> the data for a jpeg, video file or a 10GB tar ball in a database >> column? > It is if you need transaction semantics. Think medical records, etc. The basic problem with outside-the-DB data storage is keeping it in sync with your inside-the-DB metadata. In a slowly changing dataset you can probably get away with external storage, but if there's lots of updates then allowing the DB to manage the storage definitely makes life easier. This is not to say that you want SQL-style operations on the blobs; in fact I think you probably don't, which is why I was pointing to a LO-style API. regards, tom lane
On Mon, Aug 18, 2008 at 07:31:04PM -0400, Tom Lane wrote: > David Fetter <david@fetter.org> writes: > > On Mon, Aug 18, 2008 at 04:22:56PM -0400, Tom Lane wrote: > >> The main things I think we'd need to consider besides just the > >> access API are > >> > >> - permissions features (more than "none" anyway) > > > Would ROLEs work, or are you thinking of the per-row and per-column > > access controls people sometimes want? > > Well, obviously roles are the entities that receive permissions, but > on what do we base granting permissions to LOs? > > With the current model that a LO is an independent entity that is merely > referenced (or not) by OIDs in the database, it seems like we'd have to > grant/revoke permissions to individual LOs, identified by OID; which > sure seems messy to me. People don't really want to name their LOs > by OID anyway --- it's just a convention that's forced on them by the > current implementation. > > I was kinda wondering about something closer to the TOAST model, where > a blob is only referenceable from a value that's in a table field; > and that value encapsulates the "name" of the blob in some way that > needn't even be user-visible. This vaguely reminds me of Sybase's hidden primary keys. > This'd greatly simplify the > cleanup-dead-objects problem, and we could avoid addressing the > permissions problem at all, since regular SQL permissions on the table > would serve fine. But it's not clear what regular SQL fetch and update > behaviors should be like for such a thing. (Fetching or storing the > whole blob value is right out, IMHO.) ISTR hearing of concepts roughly > like this in other DBs --- does it ring a bell for anyone? Informix has some pretty good blob-handling: http://publib.boulder.ibm.com/infocenter/idshelp/v10/index.jsp?topic=/com.ibm.sqlr.doc/sqlrmst101.htm Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
<p><font size="2">David Fetter wrote"<br /><br /> <...><br /> ><br /> > > This'd greatly simplify the<br />> > cleanup-dead-objects problem, and we could avoid addressing the<br /> > > permissions problem at all, sinceregular SQL permissions on the table<br /> > > would serve fine. But it's not clear what regular SQL fetch andupdate<br /> > > behaviors should be like for such a thing. (Fetching or storing the<br /> > > whole blobvalue is right out, IMHO.) ISTR hearing of concepts roughly<br /> > > like this in other DBs --- does it ringa bell for anyone?<br /> ><br /> > Informix has some pretty good blob-handling:<br /> ><br /> > <a href="http://publib.boulder.ibm.com/infocenter/idshelp/v10/index.jsp?topic=/com.ibm.sqlr.doc/sqlrmst101.htm">http://publib.boulder.ibm.com/infocenter/idshelp/v10/index.jsp?topic=/com.ibm.sqlr.doc/sqlrmst101.htm</a><br />><br /><br /> Agreed. I used Informix a few years back in a system that scanned both sides of multi-page financial documents;we stored them in Informix' blobs, which IIRC could be tuned to be given number of bytes. We found that 90% ofour images fit in a given size and since Informix raw disk access let them move up the whole blob in a single pass, itwas quite fast, and gave us all the warmth and fuzziness of ACID functionality. But we didn't fetch parts of the BLOB --metadata lived in its own table. There is/was an Illustra/Informix blade which let you in theory do some processing ofimages (indexing) but that seems like a very specialized case.<br /><br /> Greg Williamson<br /> Senior DBA<br /> DigitalGlobe<br/><br /> Confidentiality Notice: This e-mail message, including any attachments, is for the sole use of theintended recipient(s) and may contain confidential and privileged information and must be protected in accordance withthose provisions. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intendedrecipient, please contact the sender by reply e-mail and destroy all copies of the original message.<br /><br />(My corporate masters made me say this.)<br /></font>
On Mon, 2008-08-18 at 16:22 -0400, Tom Lane wrote: > David Fetter <david@fetter.org> writes: > > What would need to happen for the next jump up from where varlena is > > now, to 8 bytes? > > Dealing with upwards-of-4GB blobs as single Datums isn't remotely sane, > and won't become so in the near (or even medium) future. So I don't > see the point of doing all the work that would be involved in making > this go. > > What would make more sense is to redesign the large-object stuff to be > somewhat modern and featureful, and provide stream-access APIs (think > lo_read, lo_seek, etc) that allow offsets wider than 32 bits. The main > things I think we'd need to consider besides just the access API are > > - permissions features (more than "none" anyway) > - better management of orphaned objects (obsoleting vacuumlo) > - support > 16TB of large objects (maybe partition pg_largeobject?) > - dump and restore probably need improvement to be practical for such > large data volumes Sounds like a good list. Probably also using a separate Sequence to allocate numbers rather than using up all the Oids on LOs would be a good plan. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes: > Probably also using a separate Sequence to allocate numbers rather than > using up all the Oids on LOs would be a good plan. Well, assuming that your Large Objects are actually Large, you aren't going to need as many OIDs as all that ;-) However: I was chewing on this point a few days ago, and it seemed to me that essentially duplicating the functionality of the OID generator wasn't likely to be a win. What seems more practical is to extend the internal next-OID counter to 64 bits, and allow callers to get either the full 64 bits or just the lowest 32 bits depending on what they need. This change would actually be entirely transparent to 32-bit callers, and the extra cycles to manage a 64-bit counter would surely be lost in the noise compared to acquiring/releasing OidGenLock. regards, tom lane
On Mon, 2008-08-18 at 23:43 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > Probably also using a separate Sequence to allocate numbers rather than > > using up all the Oids on LOs would be a good plan. > > Well, assuming that your Large Objects are actually Large, you aren't > going to need as many OIDs as all that ;-) > > However: I was chewing on this point a few days ago, and it seemed to me > that essentially duplicating the functionality of the OID generator > wasn't likely to be a win. What seems more practical is to extend the > internal next-OID counter to 64 bits, and allow callers to get either > the full 64 bits or just the lowest 32 bits depending on what they need. > This change would actually be entirely transparent to 32-bit callers, > and the extra cycles to manage a 64-bit counter would surely be lost in > the noise compared to acquiring/releasing OidGenLock. Sounds very cool. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Mon, 18 Aug 2008, Tom Lane wrote: > What would make more sense is to redesign the large-object stuff to be > somewhat modern and featureful, and provide stream-access APIs (think > lo_read, lo_seek, etc) that allow offsets wider than 32 bits. A few years ago, I was working on such a project for a company I used to work for. The company changed directions shortly thereafter, and the project was dropped, but perhaps the patch might still be useful as a starting point for someone else. The original patch is http://archives.postgresql.org/pgsql-hackers/2005-09/msg01026.php, and the advice I was working on implementing was in http://archives.postgresql.org/pgsql-hackers/2005-09/msg01063.php I am attaching the latest version of the patch I found around. As it was almost 3 years ago, I am a little fuzzy on where I left off, but I do remember that I was trying to work through the suggestions Tom Lane gave in that second linked email. I would recommend discarding the libpq changes, since that seemed to not pass muster. Note that this patch was against 8.0.3. There only seem to be a few issues applying it to the current head, but I haven't really dug into them to see how difficult it would be to update. Luckily, the large object code is fairly slow-moving, so there aren't too many conflicts. One thing I did notice is that it looks like someone extracted one of the functions I wrote in this patch and applied it as a 32-bit version. Good for them. I'm glad someone got some use out of this project, and perhaps more use will come of it. -- At the source of every error which is blamed on the computer you will find at least two human errors, including the error of blaming it on the computer.
Josh Berkus wrote: > Andrew, > >> I always find these requests puzzling. Is it really useful to store the >> data for a jpeg, video file or a 10GB tar ball in a database column? > > Some people find it useful. Because LOs are actually easier to manage in > PG than in most other DBMSes, right now that's a significant source of > PostgreSQL adoption. I'd like to encourage those users by giving them > more useful LO features. Given that they don't replicate with most (any?) of the currently available replication solutions, the fact that they are easy to use becomes irrelevant fairly quickly to larger installations in my experience. But the interface *is* nice, so if we could fix that, it would be very good. //Magnus
Simon Riggs wrote: > On Mon, 2008-08-18 at 16:22 -0400, Tom Lane wrote: >> David Fetter <david@fetter.org> writes: >>> What would need to happen for the next jump up from where varlena is >>> now, to 8 bytes? >> Dealing with upwards-of-4GB blobs as single Datums isn't remotely sane, >> and won't become so in the near (or even medium) future. So I don't >> see the point of doing all the work that would be involved in making >> this go. >> >> What would make more sense is to redesign the large-object stuff to be >> somewhat modern and featureful, and provide stream-access APIs (think >> lo_read, lo_seek, etc) that allow offsets wider than 32 bits. The main >> things I think we'd need to consider besides just the access API are >> >> - permissions features (more than "none" anyway) >> - better management of orphaned objects (obsoleting vacuumlo) >> - support > 16TB of large objects (maybe partition pg_largeobject?) >> - dump and restore probably need improvement to be practical for such >> large data volumes > > Sounds like a good list. > > Probably also using a separate Sequence to allocate numbers rather than > using up all the Oids on LOs would be a good plan. The ability to partition the large object store would not suck either... For backup/recovery purposes mainly. //Magnus
Am Monday, 18. August 2008 schrieb Tom Lane: > - permissions features (more than "none" anyway) > - better management of orphaned objects (obsoleting vacuumlo) > - support > 16TB of large objects (maybe partition pg_largeobject?) > - dump and restore probably need improvement to be practical for such > large data volumes If you replace the third point by "maybe partition TOAST tables", replace large object handle by TOAST pointer, and create an API to work on TOAST pointers, how are the two so much different? And why should they be? I can see that there are going to be needs to access large data with interfaces that are not traditional SQL, but at least the storage handling could be the same. That way you would solve the first two points and others for free.
>>> Tom Lane <tgl@sss.pgh.pa.us> wrote: > I was kinda wondering about something closer to the TOAST model, where > a blob is only referenceable from a value that's in a table field; > and that value encapsulates the "name" of the blob in some way that > needn't even be user-visible. This'd greatly simplify the > cleanup-dead-objects problem, and we could avoid addressing the > permissions problem at all, since regular SQL permissions on the table > would serve fine. But it's not clear what regular SQL fetch and update > behaviors should be like for such a thing. (Fetching or storing the > whole blob value is right out, IMHO.) ISTR hearing of concepts roughly > like this in other DBs --- does it ring a bell for anyone? It'd probably be good to have methods parallel to the JDBC API within the implementation. http://java.sun.com/javase/6/docs/api/java/sql/Blob.html http://java.sun.com/javase/6/docs/api/java/sql/Clob.html http://java.sun.com/javase/6/docs/api/java/sql/NClob.html http://java.sun.com/javase/6/docs/api/java/sql/ResultSet.html#getBlob(int) http://java.sun.com/javase/6/docs/api/java/sql/ResultSet.html#getClob(int) http://java.sun.com/javase/6/docs/api/java/sql/ResultSet.html#getNClob(int) http://java.sun.com/javase/6/docs/api/java/sql/ResultSet.html#updateBlob(int,%20java.sql.Blob) http://java.sun.com/javase/6/docs/api/java/sql/ResultSet.html#updateBlob(int,%20java.io.InputStream) http://java.sun.com/javase/6/docs/api/java/sql/ResultSet.html#updateBlob(int,%20java.io.InputStream,%20long) (similar for CLOB & NCLOB) Reading a blob value gives you an object which lets you perform the stream-based operations against the blob. To set a blob value on an insert, you prepare a statement and then link the stream to the blob -- the insertRow method sucks the data from the stream. To set a blob on an update, you use an updateable cursor (or maybe a prepared statement) to do the same. You can set a lob from another lob directly in SQL I assume we'd want to support streams directly inline in the protocol, as well as support functions to convert between datums and streams (for, say, tiny little 2MB or 10MB values), and files and streams (kinda like COPY). -Kevin
Peter Eisentraut wrote: >If you replace the third point by "maybe partition TOAST tables", replace >large object handle by TOAST pointer, and create an API to work on TOAST >pointers, how are the two so much different? And why should they be? I can >see that there are going to be needs to access large data with interfaces >that are not traditional SQL, but at least the storage handling could be the >same. That way you would solve the first two points and others for free. I've long wondered why toast and large object storage is not one and the same (indeed). It seems a like a natural solution to marry the two. -- Sincerely, Stephen R. van den Berg. E-mails should be like a lady's skirt: Long enough to cover the subject, and short enough to be interesting.
> Peter Eisentraut wrote: >> If you replace the third point by "maybe partition TOAST tables", replace >> large object handle by TOAST pointer, and create an API to work on TOAST >> pointers, how are the two so much different? And why should they be? The reason they should be different is that (IMHO anyway) you don't want the default behavior of SELECT * FROM ... to include pulling back the entire contents of the blob. Indeed, we *can't* have that be the behavior, unless we want to go back to the proposal that started this thread of making the entire system safe for multi-gigabyte datums. It's certainly possible that the underlying implementation could be just TOAST, but we need some other API at the SQL level. regards, tom lane