Thread: Storage Model for Partitioning
In my striving towards more effective partitioning for Postgres, I see we have one main decision to make and that all other sub-tasks are driven directly by this one issue. The issue is: At what point we store the data within the existing storage model? We discussed this in 2005 when I started to discuss what became constraint exclusion and it remains the core issue from which all other tasks are driven. If we can establish the basics of how a table can be split into partitions, then that allows work to progress on the other issues. I'd like some guidance from the senior crew on this, which is hopefully possible without getting embroiled in all the details of partitioning, most of which are more straightforward technical issues. The current storage model for a table is that above the smgr layer a table looks like a single continuous range of blocks, while below the smgr layer it is in fact a set of segment files. Given where we are now, how should we change the storage model to support partitioning? If at all. We have two types of requirement: - manageability - tablespaces for each partition etc.. - performance - excluding partitions etc.. I've argued from multiple sides of the fence, so I'm trying to present a neutral view to allow us to take the best route forward. The basic options are these: 0. Do Nothing - we don't want any of the other options. 1. Partitions are Contiguous Ranges of Blocks As proposed for segment exclusion based partitioning 2. Partitions are Tables As used by current constraint exclusion based partitioning. 3. Partitions are RelFileNodes, but not Tables 4. Some Other Choice In more detail... 1. Partitions are Contiguous Ranges of Blocks Partitions are a simple subset of a table, i.e. a contiguous range of blocks within the main block range of the table. That allows us to maintain the current smgr model, which then allows... - allows RI via SHARE locks - avoids the need for complex planner changes - allows unique indexes - allows global indexes (because we only have one table) - works naturally with synchronous scans and buffer recycling Doing partitioning this way means we would (as a trivial example) assign blocks 1-10 as partition 1, blocks 11-20 as partition 2 etc.. There are two sub-options of this basic idea: a) Dynamic Partitioning - we define partitioning based around what is already in the table, rather than trying to force the data to a "correct" partition. No changes to the current storage model. Useful, but it doesn't do all that everybody wants. - allows automated table extension, so works automatically with Slony - allows partition wise merge joins - but not easily usable with declarative partitioning b) Fixed partitioning - we define partitions as static ranges of blocks, which may leave us with holes in the range of BlockNumbers, plus each partition has a maximum size that it cannot expand beyond. Probably unacceptable. 2. Partitions are Tables Current situation. This means we have to change - Nested loop joins work with partitions, so an IndexScan must be able to cross partitions within the target table - indexes, so they can refer to more than one partition - share locking in the executor - changes to allow synchronous scans and buffer recycling - automatic partition creation required - single DDL declaration from the Parent table 3. Partitions are RelFileNodes, but not Tables We allow a table to have multiple RelFileNodes, when explicitly declared that way. This means we have to change - DDL changes to allow TABLE level changes to apply to all RelFileNodes, while PARTITION level changes to apply to only one RelFileNode - indexes, so they can refer to more than one partition - share locking in the executor - changes to allow synchronous scans and buffer recycling There *are* other changes not mentioned here that are required for partitioning, which although complex are less doubtful. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
> > 3. Partitions are RelFileNodes, but not Tables > > We allow a table to have multiple RelFileNodes, when explicitly declared > that way. > > This means we have to change > - DDL changes to allow TABLE level changes to apply to all RelFileNodes, > while PARTITION level changes to apply to only one RelFileNode > - indexes, so they can refer to more than one partition > - share locking in the executor > - changes to allow synchronous scans and buffer recycling > > There *are* other changes not mentioned here that are required for > partitioning, which although complex are less doubtful. > It can be used for global temporary tables with some modification table defines structure and partitions defines contents for sessions. Regards Pavel Stehule > Simon Riggs > 2ndQuadrant http://www.2ndQuadrant.com > > > ---------------------------(end of broadcast)--------------------------- > TIP 7: You can help support the PostgreSQL project by donating at > > http://www.postgresql.org/about/donate >
More dumb user questions... Simon Riggs wrote: > 1. Partitions are Contiguous Ranges of Blocks > > Partitions are a simple subset of a table, i.e. a contiguous range of > blocks within the main block range of the table. > b) Fixed partitioning - we define partitions as static ranges of blocks, > which may leave us with holes in the range of BlockNumbers, plus each > partition has a maximum size that it cannot expand beyond. Probably > unacceptable. Clearly not going to make anyone happy if you can't fit 1.1GB of data into your partition. Is the following basically the same as option #3 (multiple RelFileNodes)? 1. Make an on-disk "chunk" much smaller (e.g. 64MB). Each chunk is a contigous range of blocks. 2. Make a table-partition (implied or explicit constraints) map to multiple "chunks". That would reduce fragmentation (you'd have on average 32MB's worth of blocks wasted per partition) and allow for stretchy partitions at the cost of an extra layer of indirection. For the single-partition case you'd not need to split the file of course, so it would end up looking much like the current arrangement. -- Richard Huxton Archonet Ltd
On Fri, 2008-01-11 at 11:34 +0000, Richard Huxton wrote: > 1. Make an on-disk "chunk" much smaller (e.g. 64MB). Each chunk is a > contigous range of blocks. > 2. Make a table-partition (implied or explicit constraints) map to > multiple "chunks". > That would reduce fragmentation (you'd have on average 32MB's worth of > blocks wasted per partition) and allow for stretchy partitions at the > cost of an extra layer of indirection. This sounds almost like some kind of "clustering index", where the index contains ranges pointing to blocks of data... if the same index is also used for inserting (i.e. the free space map is a partial "cluster index" on blocks with free space), that would be a coarse clustering solution I guess... Cheers, Csaba.
Csaba Nagy wrote: > On Fri, 2008-01-11 at 11:34 +0000, Richard Huxton wrote: >> 1. Make an on-disk "chunk" much smaller (e.g. 64MB). Each chunk is a >> contigous range of blocks. >> 2. Make a table-partition (implied or explicit constraints) map to >> multiple "chunks". >> That would reduce fragmentation (you'd have on average 32MB's worth of >> blocks wasted per partition) and allow for stretchy partitions at the >> cost of an extra layer of indirection. > > This sounds almost like some kind of "clustering index", where the index > contains ranges pointing to blocks of data... if the same index is also > used for inserting (i.e. the free space map is a partial "cluster index" > on blocks with free space), that would be a coarse clustering solution I > guess... Which is roughly what Simon's original "Dynamic Partitioning" would be if it became visible at the planner level (unless I've misunderstood). I was picturing it in my head as a two-dimensional bitmap with value-ranges along one axis and block-ranges along the other. Of course, unlike other indexes it needs visibility information to be of any use. Thinking about it, I'm not sure how my thinking would affect a full-table seq-scan. You'd not get blocks back in-order throughout the scan - would that matter? -- Richard Huxton Archonet Ltd
> Which is roughly what Simon's original "Dynamic Partitioning" would be > if it became visible at the planner level (unless I've misunderstood). I > was picturing it in my head as a two-dimensional bitmap with > value-ranges along one axis and block-ranges along the other. Of course, > unlike other indexes it needs visibility information to be of any use. But why not have it as a normal index of ranges ? I'm not familiar with the GIST extensions, but this sounds like a set of records (segments in Simon's terms) indexed by their interval position on a line... isn't that covered by some GIST index type ? > Thinking about it, I'm not sure how my thinking would affect a > full-table seq-scan. You'd not get blocks back in-order throughout the > scan - would that matter? That could be covered by something like the bitmap scan, just on coarser level, the bits covering segments instead of blocks. Cheers, Csaba.
On Fri, 2008-01-11 at 11:34 +0000, Richard Huxton wrote: > Is the following basically the same as option #3 (multiple RelFileNodes)? > > 1. Make an on-disk "chunk" much smaller (e.g. 64MB). Each chunk is a > contigous range of blocks. > 2. Make a table-partition (implied or explicit constraints) map to > multiple "chunks". > That would reduce fragmentation (you'd have on average 32MB's worth of > blocks wasted per partition) and allow for stretchy partitions at the > cost of an extra layer of indirection. > > For the single-partition case you'd not need to split the file of > course, so it would end up looking much like the current arrangement. We need to think about the "data model" of the storage layer. Space itself isn't the issue, its the assumptions that all of the other subsystems currently make about what how a table is structured, indexed, accessed and manipulated. Currently: Table 1:M Segments Option 1: Table 1:M Segments and *separately* Table 1:M Partitions, so partitions are always have a maximum size. The size just changes the impact, doesn't change the impact of holes, max sizes etc. e.g. empty table with 10 partitions would be a) 0 bytes in 1 file b) 0 bytes in 1 file, plus 9GB in 9 files all full of empty blocks e.g. table with 10 partitions each of 1.5GB would be a) 15 GB in 15 files b) hit max size limit of partition: ERROR Option 2: Table 1:M Child Tables 1:M Segments e.g. empty table with 10 partitions would be 0 bytes in each of 10 files e.g. table with 10 partitions each of 1.5GB would be 15GB in 10 groups of 2 files Option 3: Table 1:M Nodes 1:M Segments e.g. empty table with 10 partitions would be 0 bytes in each of 10 files e.g. table with 10 partitions each of 1.5GB would be 15GB in 10 groups of 2 files So 1b) seems definitely out. The implications of 2 and 3 are what I'm worried about, which is why the shortcomings of 1a) seem acceptable currently. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote: > On Fri, 2008-01-11 at 11:34 +0000, Richard Huxton wrote: > >> Is the following basically the same as option #3 (multiple RelFileNodes)? >> >> 1. Make an on-disk "chunk" much smaller (e.g. 64MB). Each chunk is a >> contigous range of blocks. >> 2. Make a table-partition (implied or explicit constraints) map to >> multiple "chunks". >> That would reduce fragmentation (you'd have on average 32MB's worth of >> blocks wasted per partition) and allow for stretchy partitions at the >> cost of an extra layer of indirection. >> >> For the single-partition case you'd not need to split the file of >> course, so it would end up looking much like the current arrangement. > > We need to think about the "data model" of the storage layer. Space > itself isn't the issue, its the assumptions that all of the other > subsystems currently make about what how a table is structured, indexed, > accessed and manipulated. Which was why I was thinking you'd want to maintain indexes etc. thinking in terms of a table being a contiguous set of blocks, with the mapping to an actual on-disk block taking placebelow that level. (If I've understood you). > Currently: Table 1:M Segments > > Option 1: Table 1:M Segments and *separately* Table 1:M Partitions, so > partitions are always have a maximum size. The size just changes the > impact, doesn't change the impact of holes, max sizes etc. > e.g. empty table with 10 partitions would be > a) 0 bytes in 1 file > b) 0 bytes in 1 file, plus 9GB in 9 files all full of empty blocks Well, presumably 0GB in 10 files, but 10GB-worth of block-numbers "pre-allocated". > e.g. table with 10 partitions each of 1.5GB would be > a) 15 GB in 15 files With the limitation that any given partition might contain a mix of data-ranges (e.g. 2005 lies half in partition 2 and half in partition 3). > b) hit max size limit of partition: ERROR In the case of 1b, you could have a segment mapping to more than 1 partition, avoiding the error. So 2004 data is in partition 1, 2005 is in partitions 2,3 (where 3 is half empty), 2006 is in partition 4. However, this does mean you've got a lot of wasted block numbers. If you were using explicit (fixed) partitioning and chose a bad set of criteria your maximum table size could be substantially reduced. > Option 2: Table 1:M Child Tables 1:M Segments > e.g. empty table with 10 partitions would be > 0 bytes in each of 10 files > > e.g. table with 10 partitions each of 1.5GB would be > 15GB in 10 groups of 2 files Cross-table indexes and constraints would be useful outside of the current scenario. > Option 3: Table 1:M Nodes 1:M Segments > e.g. empty table with 10 partitions would be > 0 bytes in each of 10 files > > e.g. table with 10 partitions each of 1.5GB would be > 15GB in 10 groups of 2 files Ah, so this does seem to be roughly the same as I was rambling about. This would presumably mean that rather than (table, block #) specifying the location of a row you'd need (table, node #, block #). > So 1b) seems definitely out. > > The implications of 2 and 3 are what I'm worried about, which is why the > shortcomings of 1a) seem acceptable currently. -- Richard Huxton Archonet Ltd