Thread: WORM and Read Only Tables (v0.1)
Many applications have the need to archive data after it has been through the initial flurry of reads and updates that follows its original insertion. Currently there is no specific feature support to meet this requirement, so I propose to add this for 8.4. Use Case: VLDB with tons of (now) read only data, some not. Data needs to be accessible, but data itself is rarely touched, allowing storage costs to be minimised via a "storage hierarchy" of progressively cheaper storage. Features - Read Only Tables - Compressed Tablespaces - Attaching table - Per-Tablespace Access Costing - Performance Tuning Read-Only Tables ---------------- Postgres supports the concept of freezing tuples, so they can live forever within the database without needing further writes. Currently there is no command that will guarantee that a table has been completely frozen. This makes it difficult to reliably write data files to WORM media for longer term archiving. (WORM means Write-Once, Read-Many). It's also a pain having to VACUUM a large table again just because a small number of rows need to be frozen. So we need a DDL command that will ensure all tuples are frozen and then mark the table as read-only. Ideally, we would like to do this in a way that doesn't hold long full table locks, since we want the data to remain accessible at all times. So... VACUUM FREEZE table SET READ ONLY; would be my first thought, but I'm guessing everybody will press me towards supporting the more obvious ALTER TABLE table SET READ ONLY; This command will place a ShareLock (only) on the table, preventing anybody from writing to the table while we freeze it. The ShareLock is incompatible with any transaction that has written to the table, so when we acquire the lock all writers to the table will have completed. We then run the equivalent of a VACUUM FREEZE which will then be able to freeze *all* rows in one pass (rather than all except the most recent). On completion of the freeze pass we will then update the pg_class entry to show that it is now read-only, so we will emulate the way VACUUM does this. This form of the ALTER TABLE command will need to be mangled so it can only run outside of a transaction block and also so it takes only a ShareLock rather than an AccessExclusiveLock. Reversing the process is simpler, since we only have to turn off the flag in pg_class: ALTER TABLE table SET READ WRITE; possibly able to do this without grabbing an AccessExclusiveLock, though that isn't an important part of this implementation. Read-only tables never need VACUUMing, so we would be able to make autovacuum and explicit vacuum ignore them. Read-only tables may not be written to, yet would still allow implicit or explicit INSERT, UPDATE and DELETE privileges to be held on the table. Attempts to write to the table will result in a specific "read only table cannot be modified" ERROR. This allows a table to be placed into read-only mode for long periods of time and flipped back to read-write if some updates are required. That is useful for various performance reasons, see later. We can't use the privilege mechanism to prevent writes since superusers bypass them. (Thoughts?) Requests for tuple-level SHARE locks (e.g. FOR SHARE) against read-only tables will be ignored, since they are effectively already there. So we don't need to change the internals of the locking, nor edit the RI code to remove the call to SHARE lock referenced tables. Do this during post-parse analysis. Tables can be copied to WORM media by using ALTER TABLE table SET TABLESPACE tblspc; This would also use a ShareLock rather than an AccessExclusiveLock, piggy-backing off the work mentioned above. Running SET TABLESPACE and SET READ ONLY at the same time might sound like a good plan, but ISTM will require two fairly different code paths, so if we do it at all it will be a later addition. Compressed Tablespaces ---------------------- Frequently with large data archives there is a requirement to reduce the footprint of the data to allow longer term storage costs to be reduced. For Insert-only data we might imagine we can reduce the size of tables by removing unused tuple header information. Although that is possible, repeated headers compress fairly well, so it seems easier to tackle the problem directly by having compressed tables. Using a streaming library like zlib, it will be easy to read/write data files into a still-usable form but with much reduced size. Access to a compressed table only makes sense as a SeqScan. That would be handled by introducing tablespace-specific access costs, discussed below. Indexes on compressed tables would still be allowed, but would hardly ever be used. Access would probably be via tablespace-specific storage managers. So implement mdcompress.c alongside md.c in src/backend/storage/smgr. If that implementation route was chosen, it would then allow the compression option to be made at tablespace level, so commands would be: CREATE TABLESPACE tablespacename LOCATION 'directory' [COMPRESS]; (ALTER TABLESPACE support is going to be much more complex, so leave that alone for now) So when we copy a read-only table to another tablespace the compression would take place without additional syntax at ALTER TABLE level. i.e. nothing new goes in tblcmds.c. Cool. mdcompress.c seems fairly straightforward, though we would need to think about how to implement smgr_nblocks() since lseek-ing to get it won't work because the file size is smaller than the actual decompressed table. Perhaps with an info file that contains something like an index metapage where we can read the number of blocks. Ideas? In the first pass, I would only allow compressed read-only tables. In time we might allow Inserts, Updates and Deletes though the latter two will never be very effective. So my initial approach does *not* allow writing directly to a compressed table. A couple of reasons why that isn't desirable/easy: If we write data straight to the table then any aborted loads will be written to the table, so VACUUMing the table would need to re-compress the table which sounds horrible. Plus hint status bits would need to be set *after* the data was written. (Perhaps in conjunction with a visibility map we can just forget that aspect of it). The other problem is that blocks would need to be written out of shared buffers sequentially, which unfortunately we do not guarantee. We can force that in the smgr layer by keeping track of last written blockid and then writing all cached blocks up the currently requested one, but that seems problematic. I'd say if we want more, we do that in the next release, or at least the next phase of development. So we would prevent the direct use of CREATE TABLE on a COMPRESSED tablespace, for now. I'm open to arguments that we don't need this at all because filesystem utilities exist that do everything we need. You're experience will be good to hear about in regard to this feature. Attach ------ Writing tables on one system and then moving that data to other systems is fairly common. If we supported read-only tables then you might consider how you would publish new versions to people. For now, we just want to consider how we will upgrade from one release to another without needing to unload and reload potentially many Terabytes of data. We can't delete the old data until the new data is successfully loaded, so we will have a huge temporary storage cost. This could go very slowly if we use cheaper storage, plus reloading the data means we have to re-freeze it again also. So we need a way of attaching the old tables to the new database. We might call this binary upgrade, or we might be slightly less ambitious and talk about just moving the old read-only data. That's all I want to do at this stage. I'm mentioning this here now to see what comes out in debate, and what others are planning to work on in this area. Per-Tablespace Access Costs --------------------------- Access costs are currently set using sequential_page_cost and random_page_cost for the whole database. Read Only Tables and WORM media would make it more sensible to consider different access costs for each tablespace. This then allows tablespaces on tape, MAID, CD-ROM or just fast and slow disk etc to be catered for. For compressed filesystems, setting random_page_cost = ~100000 should do the trick for most planning issues. A new parameter, startup_page_cost might also be introduced, so we can allow for some kinds of media that have non-zero initial access times. MAID storage ~1 sec to first block, whether sequential or random. Tape storage can be ~15 seconds for robots, more for humans, even more for outsourced offsite archival companies. Performance ----------- There are a number of optimisations that are possible with read-only tables. I'm not suggesting to write all of them as part of this initial project, but it is useful to discuss them at the beginning. - Removal of empty read-only tables from query plans (particularly interesting for optimising partitioning; nice idea Greg) - Column stats can hold min & max values for all columns, allowing min() and max() aggregates to be answered from the catalog, plus those values used for constraint exclusion during planning. - pg_class.reltuples to be used directly for select count(*) - Table contents can be cached, allowing RI checks against small read-only tables to operate in a similar manner to CHECK constraints - Index-only plans would be able to trust the complete visibility of any index entry, which would speed up RI checks against large read only table The first two are certainly related to advanced partitioning, so I would be looking to work on those eventually. Example -------- When user tables get old we stamp them read only and copy them away to a compressed tablespace, which we create here also to complete the example. VACUUM FREEZE mytable SET READ ONLY; CREATE TABLESPACE archive001 LOCATION '/vldb_archive/arc001' COMPRESS; ALTER TABLE mytable SET TABLESPACE archive001; Implementation Schedule ----------------------- Roughly in the order listed above. I'm being partially sponsored for this work, but not fully. (Don't ask who by, I'm under NDA). If you're a likely user of these features and would like to sponsor me please drop me an email. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > So... VACUUM FREEZE table SET READ ONLY; > > would be my first thought, but I'm guessing everybody will press me > towards supporting the more obvious > > ALTER TABLE table SET READ ONLY; > > This command will place a ShareLock (only) on the table, preventing > anybody from writing to the table while we freeze it. The ShareLock is > incompatible with any transaction that has written to the table, so when > we acquire the lock all writers to the table will have completed. We > then run the equivalent of a VACUUM FREEZE which will then be able to > freeze *all* rows in one pass (rather than all except the most recent). > On completion of the freeze pass we will then update the pg_class entry > to show that it is now read-only, so we will emulate the way VACUUM does > this. To be clear it if it meets a block for which a tuple is not freezable -- that is, it has an xmin or xmax more recent than the global xmin then it needs to block waiting for the backend which that recent xmin. Then presumably it needs to update its concept of recent global xmin going forward. You might be best off grabbing a list of txid->xmin when you start and sorting them by xmin so you can loop through them sleeping until you reach the first txid with an xmin large enough to continue. > Reversing the process is simpler, since we only have to turn off the > flag in pg_class: I'm not sure how this interacts with: > Requests for tuple-level SHARE locks (e.g. FOR SHARE) against read-only > tables will be ignored, since they are effectively already there. So we > don't need to change the internals of the locking, nor edit the RI code > to remove the call to SHARE lock referenced tables. Do this during > post-parse analysis. Since queries which think they hold FOR SHARE tuple locks will be magically losing their share locks if you turn off the read-only flag. Do you need to obtain an exclusive lock on the table to turn it read-write? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB'sPostgreSQL training!
On Tue, 2007-12-11 at 11:12 +0000, Simon Riggs wrote: > Features > - Read Only Tables > - Compressed Tablespaces I wonder if instead of read-only tables wouldn't it be better to have some kind of automatic partitioning which permits to have different chunks of the table data in different tablespaces, and a freeze command which effectively moves the data from the (normally small) active chunk to the archive chunk when it's transaction id is older than a predefined threshold ? Then put the active chunk on a high performance file system and the archive tablespace on a compressed/slow/cheap file system and you're done. Allow even the archive chunk to be updateable, and put new tuple data in the active chunk. It would work just fine for cases where the old data is rarely updated/deleted... Another advantage I guess would be that active data would more likely stay in cache, as updated records would stay together and not spread over the inactive. Cheers, Csaba.
On Tue, 2007-12-11 at 13:44 +0100, Csaba Nagy wrote: > Another advantage I guess would be that active data would more likely > stay in cache, as updated records would stay together and not spread > over the inactive. And I forgot to mention that vacuum could mostly skip the archive part, and only vacuum the active part, which would drastically reduce the cost of vacuuming big & active tables. Cheers, Csaba.
Ühel kenal päeval, T, 2007-12-11 kell 13:44, kirjutas Csaba Nagy: > On Tue, 2007-12-11 at 11:12 +0000, Simon Riggs wrote: > > Features > > - Read Only Tables > > - Compressed Tablespaces > > I wonder if instead of read-only tables wouldn't it be better to have > some kind of automatic partitioning which permits to have different > chunks of the table data in different tablespaces, and a freeze command > which effectively moves the data from the (normally small) active chunk > to the archive chunk when it's transaction id is older than a predefined > threshold ? This would be doable using Simons proposed commands. > Then put the active chunk on a high performance file system and the > archive tablespace on a compressed/slow/cheap file system and you're > done. Allow even the archive chunk to be updateable, and put new tuple > data in the active chunk. It would work just fine for cases where the > old data is rarely updated/deleted... You can't update a table on a read-only (write-once) partition, at least not with current header structure. > Another advantage I guess would be that active data would more likely > stay in cache, as updated records would stay together and not spread > over the inactive.
On Tue, 2007-12-11 at 14:58 +0200, Hannu Krosing wrote: > Ühel kenal päeval, T, 2007-12-11 kell 13:44, kirjutas Csaba Nagy: > >> Then put the active chunk on a high performance file system and the > > archive tablespace on a compressed/slow/cheap file system and you're > > done. Allow even the archive chunk to be updateable, and put new tuple > > data in the active chunk. It would work just fine for cases where the > > old data is rarely updated/deleted... > > You can't update a table on a read-only (write-once) partition, at least > not with current header structure. OK, but that's what I'm challenging, why do you need a write once partition ? You mean by that tapes ? OK, it means I was thinking in completely different usage scenarios then... Cheers, Csaba.
On 12/11/07, Simon Riggs <simon@2ndquadrant.com> wrote: > Compressed Tablespaces > Using a streaming library like zlib, it will be easy to read/write data > files into a still-usable form but with much reduced size. Access to a > compressed table only makes sense as a SeqScan. That would be handled by > introducing tablespace-specific access costs, discussed below. Indexes > on compressed tables would still be allowed, but would hardly ever be > used. I've actually been wanting this lately, for a couple reasons. One is reduced disk footprint, but the other is reduced I/O, similar to how TOAST helps with large fields now. (In my particular scenario, TOAST can't help due to small field sizes.) It would be useful to have available even on read/write data. To that end, it would probably make more sense to use a block compression algorithm rather than a streaming one. Block-based algorithms can generally get better compression than streaming ones as well, at least when fed large enough blocks. I'm not familiar with the implementation issues, other than the obvious "variable block sizes make the I/O subsystem look very different", so I don't know if there's a major tradeoff between the two strategies (even just for read-only). > I'm open to arguments that we don't need this at all because filesystem > utilities exist that do everything we need. You're experience will be > good to hear about in regard to this feature. Some filesystems do support transparent compression, but they're not always available. It would be nice to have compression on unsophisticated systems with cheap hardware.
On 11/12/2007, Csaba Nagy <nagy@ecircle-ag.com> wrote:
On Tue, 2007-12-11 at 14:58 +0200, Hannu Krosing wrote:
> Ühel kenal päeval, T, 2007-12-11 kell 13:44, kirjutas Csaba Nagy:
> >> Then put the active chunk on a high performance file system and the
> > archive tablespace on a compressed/slow/cheap file system and you're
> > done. Allow even the archive chunk to be updateable, and put new tuple
> > data in the active chunk. It would work just fine for cases where the
> > old data is rarely updated/deleted...
>
> You can't update a table on a read-only (write-once) partition, at least
> not with current header structure.
OK, but that's what I'm challenging, why do you need a write once
partition ? You mean by that tapes ? OK, it means I was thinking in
completely different usage scenarios then...
Cheers,
Csaba.
I think DVD or CD would make sence, Tapes have an added limitation of being sequential access only.
Peter Childs
On 11/12/2007, Simon Riggs <simon@2ndquadrant.com> wrote:
Attach
------
Writing tables on one system and then moving that data to other systems
is fairly common. If we supported read-only tables then you might
consider how you would publish new versions to people.
For now, we just want to consider how we will upgrade from one release
to another without needing to unload and reload potentially many
Terabytes of data. We can't delete the old data until the new data is
successfully loaded, so we will have a huge temporary storage cost. This
could go very slowly if we use cheaper storage, plus reloading the data
means we have to re-freeze it again also.
So we need a way of attaching the old tables to the new database. We
might call this binary upgrade, or we might be slightly less ambitious
and talk about just moving the old read-only data. That's all I want to
do at this stage.
I'm mentioning this here now to see what comes out in debate, and what
others are planning to work on in this area.
This sounds like allowing new versions to read old versions file structure, Probably I guess on a tablespace by table space basis
Another advantage might be to be able to load an old pitr backup in a new version across major versions. But I'm not sure that would work.
Maybe we need a command to upgrade a tablespace to a new versions file format?
Need to be careful we don't stunt future progress by fixing file format much.
Peter Childs
On Tue, 2007-12-11 at 11:49 +0000, Gregory Stark wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > > So... VACUUM FREEZE table SET READ ONLY; > > > > would be my first thought, but I'm guessing everybody will press me > > towards supporting the more obvious > > > > ALTER TABLE table SET READ ONLY; > > > > This command will place a ShareLock (only) on the table, preventing > > anybody from writing to the table while we freeze it. The ShareLock is > > incompatible with any transaction that has written to the table, so when > > we acquire the lock all writers to the table will have completed. We > > then run the equivalent of a VACUUM FREEZE which will then be able to > > freeze *all* rows in one pass (rather than all except the most recent). > > On completion of the freeze pass we will then update the pg_class entry > > to show that it is now read-only, so we will emulate the way VACUUM does > > this. > > To be clear it if it meets a block for which a tuple is not freezable -- that > is, it has an xmin or xmax more recent than the global xmin then it needs to > block waiting for the backend which that recent xmin. Then presumably it needs > to update its concept of recent global xmin going forward. > > You might be best off grabbing a list of txid->xmin when you start and sorting > them by xmin so you can loop through them sleeping until you reach the first > txid with an xmin large enough to continue. D'oh. Completely agreed. Mia culpa. I had that bit in my original design, but I was looking elsewhere on this clearly. I'd been trying to think about how to do this since about 2 years ago and it was only the CREATE INDEX CONCURRENTLY stuff that showed me how. Thanks for nudging me. > > Reversing the process is simpler, since we only have to turn off the > > flag in pg_class: > > I'm not sure how this interacts with: > > > Requests for tuple-level SHARE locks (e.g. FOR SHARE) against read-only > > tables will be ignored, since they are effectively already there. So we > > don't need to change the internals of the locking, nor edit the RI code > > to remove the call to SHARE lock referenced tables. Do this during > > post-parse analysis. > > Since queries which think they hold FOR SHARE tuple locks will be magically > losing their share locks if you turn off the read-only flag. Do you need to > obtain an exclusive lock on the table to turn it read-write? Agreed. I wasn't suggesting implementing without, just noting that it might have been possible, but it seems not as you say. I don't think its important to be able to do that with less than AccessExclusiveLock. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs <simon@2ndquadrant.com> writes: > This command will place a ShareLock (only) on the table, preventing > anybody from writing to the table while we freeze it. The ShareLock is > incompatible with any transaction that has written to the table, so when > we acquire the lock all writers to the table will have completed. We > then run the equivalent of a VACUUM FREEZE which will then be able to > freeze *all* rows in one pass (rather than all except the most > recent). This breaks MVCC. The fact that a transaction has completed is not license to discard tuple xmin immediately. regards, tom lane
On Tue, 2007-12-11 at 10:19 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > This command will place a ShareLock (only) on the table, preventing > > anybody from writing to the table while we freeze it. The ShareLock is > > incompatible with any transaction that has written to the table, so when > > we acquire the lock all writers to the table will have completed. We > > then run the equivalent of a VACUUM FREEZE which will then be able to > > freeze *all* rows in one pass (rather than all except the most > > recent). > > This breaks MVCC. The fact that a transaction has completed is not > license to discard tuple xmin immediately. Yeh, agreed. I knew I'd solved that bit, so I was focused elsewhere. Sloppy, so apologies. I was originally planning to put a wait in at the beginning, as is used by CREATE INDEX CONCURRENTLY, though I prefer Greg's variant because it's more forgiving. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Tue, 2007-12-11 at 13:44 +0100, Csaba Nagy wrote: > On Tue, 2007-12-11 at 11:12 +0000, Simon Riggs wrote: > > Features > > - Read Only Tables > > - Compressed Tablespaces > > I wonder if instead of read-only tables wouldn't it be better to have > some kind of automatic partitioning That's definitely on my list of requirements for partitioning. > which permits to have different > chunks of the table data in different tablespaces, and a freeze command > which effectively moves the data from the (normally small) active chunk > to the archive chunk when it's transaction id is older than a predefined > threshold ? As Hannu says, this is exactly what the other features will allow, so Yes! -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Tue, Dec 11, 2007 at 11:12:46AM +0000, Simon Riggs wrote: > > Read-Only Tables > ---------------- In the past when this topic came up, there was some discussion of doing this at a level somewhere below the table horizon. There are a number of nasty limitations for partitions currently (not the least of which is that real uniqueness guarantees are impractical), so allowing users to specify some segment of the table to be "read only" without imposing it on the whole table would be awful nice. I seem to recall Jan had an idea on how to do it, but I could be wrong. Also, doing this at the tuple, rather than table-wide, level might lead to additional capabilities in this area: > Attach > ------ > > Writing tables on one system and then moving that data to other systems > is fairly common. If we supported read-only tables then you might > consider how you would publish new versions to people. Some time ago I was speculating on pie-in-the-sky features I might like in Postgres, and it was something like this attach. But the idea was somehow related to the read-only tuples. In my specific case, I have piles and piles of mostly useless data. Sometimes, however, some of that data is possibly useful in retrospect. So the suggestion was to have tables that could be mostly offline -- archived somewhere -- but for which we had enough metadata online to say, "You have some data that might match in catalog C. Go mount it, and I'll check." I think this is subtly different from the attach case you're outlining? A
On Tue, 2007-12-11 at 14:25 -0500, Andrew Sullivan wrote: > On Tue, Dec 11, 2007 at 11:12:46AM +0000, Simon Riggs wrote: > > > > Read-Only Tables > > ---------------- > > In the past when this topic came up, there was some discussion of doing this > at a level somewhere below the table horizon. There are a number of nasty > limitations for partitions currently (not the least of which is that real > uniqueness guarantees are impractical), so allowing users to specify some > segment of the table to be "read only" without imposing it on the whole > table would be awful nice. I seem to recall Jan had an idea on how to do > it, but I could be wrong. > > Also, doing this at the tuple, rather than table-wide, level might lead to > additional capabilities in this area: Seems fair comment. I'll move forward my thoughts on partitioning, so we can decide whether to do things this way, or below the table horizon as you say. I've got some detailed notes on this already which showed it was roughly balanced between the two ways. I'll write it up so we can see; the answer might change during the writing. > > Attach > > ------ > > > > Writing tables on one system and then moving that data to other systems > > is fairly common. If we supported read-only tables then you might > > consider how you would publish new versions to people. > > Some time ago I was speculating on pie-in-the-sky features I might like in > Postgres, and it was something like this attach. But the idea was somehow > related to the read-only tuples. > > In my specific case, I have piles and piles of mostly useless data. > Sometimes, however, some of that data is possibly useful in retrospect. So > the suggestion was to have tables that could be mostly offline -- archived > somewhere -- but for which we had enough metadata online to say, "You have > some data that might match in catalog C. Go mount it, and I'll check." I > think this is subtly different from the attach case you're outlining? Yes it is, but I had hoped that what you're asking for is catered for here. If you have a hierarchical storage manager, you can just call access the file, whereupon the actual file will be de-archived to allow access. Or maybe you have it on MAID storage, so we spin up the disks to allow access to the files. So I guess I was expecting the de-archive to be automated at the file system level. http://en.wikipedia.org/wiki/Hierarchical_Storage_Management Regrettably, I'm not aware of an open source HSM, though XFS has some hooks in it that mention this. That's an old IBM term, so some people might refer to this concept as tiered storage. Since I was planning to modify smgr to allow different kinds of tablespaces it should be possible to make the file issue some kind of a call out to mount the appropriate files. What would that call out look like? Would that be a DMAPI/XDSM impelementation, or something simpler as was used for PITR, or a roll-your-own plug-in hook? I can see I'll have to re-wire smgr_nblocks() for non-md smgrs to access pg_class.relpages rather than issue a seek, which will need to issue an open. That way we can do planning without actually accessing the table. (Maybe that's presuming we dealing with tables, oh well). -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Tue, 2007-12-11 at 20:30 -0800, Josh Berkus wrote: > Simon, > > > Use Case: VLDB with tons of (now) read only data, some not. Data needs > > to be accessible, but data itself is rarely touched, allowing storage > > costs to be minimised via a "storage hierarchy" of progressively cheaper > > storage. > > There's actually 2 cases to optimize for: > 1) write-once-read-many (WORM) > 2) write-once-read-seldom (WORS) > > The 2nd case is becoming extremely popular due to the presence of > government-mandated records databases. For example, I'm currently working on > one call completion records database which will hold 75TB of data, of which > we expect less than 1% to *ever* be queried. Well, that's exactly the use case I'm writing for. I called that an archival data store in my post on VLDB Features. WORM is a type of storage that might be used, so it would be somewhat confusing if we use it as the name of a specific use case. Getting partitioning/read-only right will allow 70+TB of that to be on tape or similar, which with compression can be reduced to maybe 20TB? I don't want to promise any particular compression ratio, but it will make a substantial difference, as I'm sure you realise. > One of the other things I'd like to note is that for WORM, conventional > storage is never going to approach column-store DBs for general performance. > So, should we be working on incremental improvements like the ones you > propose, or should we be working on integrating a c-store into PostgreSQL on > a per-table basis? What I'm saying is that there are some features that all VLDBs need. If we had a column store DB we would still need partitioning as well or the data structures would become unmanageable. Plus partitioning can allow the planner to avoid de-archiving/spinning up data and help reduce storage costs. Radical can be good, but it can take more time also. I dare say it would be harder for the community to accept also. So I look for worthwhile change in acceptable size chunks. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon, > Use Case: VLDB with tons of (now) read only data, some not. Data needs > to be accessible, but data itself is rarely touched, allowing storage > costs to be minimised via a "storage hierarchy" of progressively cheaper > storage. There's actually 2 cases to optimize for: 1) write-once-read-many (WORM) 2) write-once-read-seldom (WORS) The 2nd case is becoming extremely popular due to the presence of government-mandated records databases. For example, I'm currently working on one call completion records database which will hold 75TB of data, of which we expect less than 1% to *ever* be queried. One of the other things I'd like to note is that for WORM, conventional storage is never going to approach column-store DBs for general performance. So, should we be working on incremental improvements like the ones you propose, or should we be working on integrating a c-store into PostgreSQL on a per-table basis? -- Josh "the Fuzzy" Berkus San Francisco
> There are a number of nasty > limitations for partitions currently (not the least of which is that real > uniqueness guarantees are impractical), Just to add an other opinion to this statement, because it imho sounds overly pessimistic: Uniqueness is currently perfectly practical, when the unique index contains the column[s] that is/are used in a non overlapping partitioning scheme. If you cannot create separate unique indexes on each partition that guarantee global uniqueness because of the chosen partitioning scheme, you can often reconsider your scheme (e.g. use natural keys instead of serials). Other db software allows creating global indexes, or indexes with separate partitioning schemes, but this is then often a pain. When you drop/attach/detach a partition such an index needs to be recreated or reorganized. This then makes a large slow transaction out of attach/detach partition. If you don't need to attach/detach, there is still one other argument against the huge global index which is "fault isolation". There is imho large room to make it better than others :-) And I think we should not regard them as positive examples, because that narrows the view. Andreas
> Getting partitioning/read-only right will allow 70+TB of that to be on > tape or similar, which with compression can be reduced to maybe 20TB? I > don't want to promise any particular compression ratio, but it will make > a substantial difference, as I'm sure you realise. Wouldn't one very substantial requirement of such storage be to have it independent of db version, or even db product? Keeping old hardware and software around can be quite expensive. So, wouldn't a virtual table interface be a better match for such a problem ? Such a virtual table should be allowed to be part of a partitioning scheme, have native or virtual indexes, ... Andreas
On Wed, Dec 12, 2007 at 12:14:43PM +0100, Zeugswetter Andreas ADI SD wrote: > Uniqueness is currently perfectly practical, when the unique index > contains > the column[s] that is/are used in a non overlapping partitioning scheme. Well, yes, assuming you have no bugs. Part of the reason I want the database to handle this for me is because, where I've come from, the only thing I can be sure of is that there will be bugs. There'll even be bugs before there is running code. One bug I can easily imagine is that the non-overlapping partitioning scheme has a bug in it, such that it turns out there _is_ an overlap some time. All of that said, I agree with you, particularly about the alternative ways things can suck instead :-/ A
On Wed, Dec 12, 2007 at 12:58:11PM +0100, Zeugswetter Andreas ADI SD wrote: > Wouldn't one very substantial requirement of such storage be to > have it independent of db version, or even db product? Keeping > old hardware and software around can be quite expensive. This was one of the explicit requirements I had when I wrote my pie in the sky outline. Hrm. I wonder if I can get permission to post it. Let me find out. The requirement was, anyway, that we be able to read old versions of "archived" rows. IIRC there was an implementation choice, whether we would _never_ allow such rows to be SET READ WRITE or whether they'd be immediately upgraded to the present format on SET READ WRITE. A
Andrew, > The requirement was, anyway, that we be able to read old versions of > "archived" rows. IIRC there was an implementation choice, whether we would > _never_ allow such rows to be SET READ WRITE or whether they'd be > immediately upgraded to the present format on SET READ WRITE. Well, in theory we need this capability for upgrade-in-place too. While that project has kind of stalled for the moment, we'll pick it back up again soon. -- Josh Berkus PostgreSQL @ Sun San Francisco
On Wed, 2007-12-12 at 10:48 -0800, Josh Berkus wrote: > Andrew, > > > The requirement was, anyway, that we be able to read old versions of > > "archived" rows. IIRC there was an implementation choice, whether we would > > _never_ allow such rows to be SET READ WRITE or whether they'd be > > immediately upgraded to the present format on SET READ WRITE. > > Well, in theory we need this capability for upgrade-in-place too. While that > project has kind of stalled for the moment, we'll pick it back up again soon. Who was working on it? -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Wed, 2007-12-12 at 11:22 -0500, Andrew Sullivan wrote: > On Wed, Dec 12, 2007 at 12:14:43PM +0100, Zeugswetter Andreas ADI SD wrote: > > Uniqueness is currently perfectly practical, when the unique index > > contains > > the column[s] that is/are used in a non overlapping partitioning scheme. > > Well, yes, assuming you have no bugs. Part of the reason I want the > database to handle this for me is because, where I've come from, the only > thing I can be sure of is that there will be bugs. There'll even be bugs > before there is running code. One bug I can easily imagine is that the > non-overlapping partitioning scheme has a bug in it, such that it turns out > there _is_ an overlap some time. Enforcing uniqueness with a global index has a number of disadvantages. The worst of these is that the index continues to get bigger and bigger as the total data volume increases. You have to index all partitions, plus each index entry needs to include a partition id as well as the index key. So not only is it big, its huge. Huge indexes are slow, so an index with terabytes of data in it is going to be almost unusable. The best thing to do would be to sit down and work out exactly how big and deep such an index would be in the case you're thinking of so we can tell whether it is very bad or merely bad. I seem to be the only one saying global indexes are bad, so if people that want them can do the math and honestly say they want them, then I will listen. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Wed, 12 Dec 2007 19:07:57 +0000 Simon Riggs <simon@2ndquadrant.com> wrote: > I seem to be the only one saying global indexes are bad, so if people > that want them can do the math and honestly say they want them, then I > will listen. global indexes are bad for certain situations for others they are required. Constraint Exclusion/Partitioning is not only for ginormous tables. It can also be used for maintenance efficiency, micro optimizations and just general data architecture. Joshua D. Drake - -- The PostgreSQL Company: Since 1997, http://www.commandprompt.com/ Sales/Support: +1.503.667.4564 24x7/Emergency: +1.800.492.2240 Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate SELECT 'Training', 'Consulting' FROM vendor WHERE name = 'CMD' -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFHYDPhATb/zqfZUUQRApkWAJ0ZRixV0QD5DCAZxexq/oOojkIftwCfZqDv LA1HPCP/h2di7Xlj2uju0zo= =/lMO -----END PGP SIGNATURE-----
Simon, > Who was working on it? Zdenec and Dhanaraj from Sun, and someone from EDB (I'm not sure who, maybe Korry?). Unfortunately, both companies have shifted staff around and we need to re-start work. Of course, if hackers other than those from EDB & Sun want to attack the problem, the more the merrier. -- Josh Berkus PostgreSQL @ Sun San Francisco
On Wed, Dec 12, 2007 at 07:07:57PM +0000, Simon Riggs wrote: > > Enforcing uniqueness with a global index has a number of disadvantages. This is why I was trying to talk about "constraints" rather than global indexes. Just because we happen to implement them that way today does not mean that such constraints need be implemented that way in every case. I think especially for the sort of "detached rows" scenario I was dreaming about, a global index is never going to be good. A
> Read-Only Tables > ---------------- > Postgres supports the concept of freezing tuples, so they can live > forever within the database without needing further writes. Currently > there is no command that will guarantee that a table has been completely > frozen. This makes it difficult to reliably write data files to WORM > media for longer term archiving. (WORM means Write-Once, Read-Many). > It's also a pain having to VACUUM a large table again just because a > small number of rows need to be frozen. > I'm not an expert at all, but I'd like to understand this, do you plan that READ-ONLY tables wouldn't even store transaction information? That should save quite a lot of space. Maybe when the table would be moved to the compressed tablespace, MVCC information could be dropped too? Of course that would avoid future insert & update possibilities though. -- Albert Cervera i Areny http://www.NaN-tic.com
On Sat, 2007-12-15 at 13:32 +0100, Albert Cervera i Areny wrote: > > Read-Only Tables > > ---------------- > > Postgres supports the concept of freezing tuples, so they can live > > forever within the database without needing further writes. Currently > > there is no command that will guarantee that a table has been completely > > frozen. This makes it difficult to reliably write data files to WORM > > media for longer term archiving. (WORM means Write-Once, Read-Many). > > It's also a pain having to VACUUM a large table again just because a > > small number of rows need to be frozen. > > > > I'm not an expert at all, but I'd like to understand this, do you plan that > READ-ONLY tables wouldn't even store transaction information? That should > save quite a lot of space. Maybe when the table would be moved to the > compressed tablespace, MVCC information could be dropped too? Of course that > would avoid future insert & update possibilities though. It could, but its a lot of work for little gain. The tuple headers look like they will compress fairly well, so why bother to remove them at all? -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Dec 11, 2007, at 7:50 AM, Trevor Talbot wrote: > I've actually been wanting this lately, for a couple reasons. One is > reduced disk footprint, but the other is reduced I/O, similar to how > TOAST helps with large fields now. (In my particular scenario, TOAST > can't help due to small field sizes.) I've wanted the ability to define different toast limits for a long time (so for example you could say that any fields over 100 bytes get toasted). I might be able to get sponsorship for this, if others are interested enough to pony up $$ please contact me. -- Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828
Ühel kenal päeval, E, 2007-12-17 kell 09:20, kirjutas Simon Riggs: > On Sat, 2007-12-15 at 13:32 +0100, Albert Cervera i Areny wrote: > > > Read-Only Tables > > > ---------------- > > > Postgres supports the concept of freezing tuples, so they can live > > > forever within the database without needing further writes. Currently > > > there is no command that will guarantee that a table has been completely > > > frozen. This makes it difficult to reliably write data files to WORM > > > media for longer term archiving. (WORM means Write-Once, Read-Many). > > > It's also a pain having to VACUUM a large table again just because a > > > small number of rows need to be frozen. > > > > > > > I'm not an expert at all, but I'd like to understand this, do you plan that > > READ-ONLY tables wouldn't even store transaction information? That should > > save quite a lot of space. Maybe when the table would be moved to the > > compressed tablespace, MVCC information could be dropped too? Of course that > > would avoid future insert & update possibilities though. > > It could, but its a lot of work for little gain. The tuple headers look > like they will compress fairly well, so why bother to remove them at > all? One place for removing them would be if we do column-stores where there would be one header per column instead of one per tuple. -------------- Hannu