Thread: Fwd: [GENERAL] 4B row limit for CLOB tables
>> On 1/29/15, Roger Pack <rogerdpack2@gmail.com> wrote: >>> Hello. I see on this page a mention of basically a 4B row limit for >>> tables that have BLOB's >> >> Oops I meant for BYTEA or TEXT columns, but it's possible the >> reasoning is the same... > > It only applies to large objects, not bytea or text. OK I think I figured out possibly why the wiki says this. I guess BYTEA entries > 2KB will be autostored via TOAST, which uses an OID in its backend. So BYTEA has a same limitation. It appears that disabling TOAST is not an option [1]. So I guess if the number of BYTEA entries (in the sum all tables? partitioning doesn't help?) with size > 2KB is > 4 billion then there is actually no option there? If this occurred it might cause "all sorts of things to break"? [2] Thanks! -roger- [1] http://www.postgresql.org/message-id/20130405140348.GC4326@awork2.anarazel.de [2] http://www.postgresql.org/message-id/CAL1QdWfb-p5kE9DT2pMqBxohaKG=vxmDREmSBjc+7TkbOeKBBw@mail.gmail.com
On 1/30/15 11:54 AM, Roger Pack wrote: >>> On 1/29/15, Roger Pack <rogerdpack2@gmail.com> wrote: >>>> Hello. I see on this page a mention of basically a 4B row limit for >>>> tables that have BLOB's >>> >>> Oops I meant for BYTEA or TEXT columns, but it's possible the >>> reasoning is the same... >> >> It only applies to large objects, not bytea or text. > > OK I think I figured out possibly why the wiki says this. I guess > BYTEA entries > 2KB will be autostored via TOAST, which uses an OID in > its backend. So BYTEA has a same limitation. It appears that > disabling TOAST is not an option [1]. > So I guess if the number of BYTEA entries (in the sum all tables? > partitioning doesn't help?) with size > 2KB is > 4 billion then there > is actually no option there? If this occurred it might cause "all > sorts of things to break"? [2] It's a bit more complex than that. First, toast isn't limited to bytea; it holds for ALL varlena fields in a table that are allowed to store externally. Second, the limit is actually per-table: every table gets it's own toast table, and each toast table is limited to 4B unique OIDs. Third, the OID counter is actually global, but the code should handle conflicts by trying to get another OID. See toast_save_datum(), which calls GetNewOidWithIndex(). Now, the reality is that GetNewOidWithIndex() is going to keep incrementing the global OID counter until it finds an OID that isn't in the toast table. That means that if you actually get anywhere close to using 4B OIDs you're going to become extremely unhappy with the performance of toasting new data. I don't think it would be horrifically hard to change the way toast OIDs are assigned (I'm thinking we'd basically switch to creating a sequence for every toast table), but I don't think anyone's ever tried to push toast hard enough to hit this kind of limit. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
Oops forgot to forward to the list (suggestion/feature request to the list admin for the various pg lists: make the default "reply to" go to the list, not the sender, if at all possible). Response below: On 1/30/15, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > On 1/30/15 11:54 AM, Roger Pack wrote: >>>> On 1/29/15, Roger Pack <rogerdpack2@gmail.com> wrote: >>>>> Hello. I see on this page a mention of basically a 4B row limit for >>>>> tables that have BLOB's >>>> >>>> Oops I meant for BYTEA or TEXT columns, but it's possible the >>>> reasoning is the same... >>> >>> It only applies to large objects, not bytea or text. >> >> OK I think I figured out possibly why the wiki says this. I guess >> BYTEA entries > 2KB will be autostored via TOAST, which uses an OID in >> its backend. So BYTEA has a same limitation. It appears that >> disabling TOAST is not an option [1]. >> So I guess if the number of BYTEA entries (in the sum all tables? >> partitioning doesn't help?) with size > 2KB is > 4 billion then there >> is actually no option there? If this occurred it might cause "all >> sorts of things to break"? [2] > > It's a bit more complex than that. First, toast isn't limited to bytea; > it holds for ALL varlena fields in a table that are allowed to store > externally. Second, the limit is actually per-table: every table gets > it's own toast table, and each toast table is limited to 4B unique OIDs. > Third, the OID counter is actually global, but the code should handle > conflicts by trying to get another OID. See toast_save_datum(), which > calls GetNewOidWithIndex(). > > Now, the reality is that GetNewOidWithIndex() is going to keep > incrementing the global OID counter until it finds an OID that isn't in > the toast table. That means that if you actually get anywhere close to > using 4B OIDs you're going to become extremely unhappy with the > performance of toasting new data. OK so "system stability" doesn't degrade per se when it wraps [since they all use that GetNewOid method or similar [?] good to know. So basically when it gets near 4B TOAST'ed rows it may have to wrap that counter and search for "unused" number, and for each number it's querying the TOAST table to see if it's already used, degrading performance. So I guess partitioning tables for now is an acceptable work around, good to know. Thanks much for your response, good to know the details before we dive into postgres with our 8B row table with BYTEA's in it :)
On 31/01/15 14:55, Roger Pack wrote: > [...] > > Now, the reality is that GetNewOidWithIndex() is going to keep > incrementing the global OID counter until it finds an OID that isn't in > the toast table. That means that if you actually get anywhere close to > using 4B OIDs you're going to become extremely unhappy with the > performance of toasting new data. > OK so "system stability" doesn't degrade per se when it wraps [since > they all use that GetNewOid method or similar [?] good to know. > > So basically when it gets near 4B TOAST'ed rows it may have to wrap that > counter and search for "unused" number, and for each number it's > querying the TOAST table to see if it's already used, degrading > performance. The problem here is that performance degrades exponentially, or worse. Speaking here from experience, we already tested this for a very similar case (table creation, where two oids are consumed from a global sequence when inserting to pg_class). Have a look at http://www.slideshare.net/nosys/billion-tables-project-nycpug-2013, slides 43-45. We tested there this scenario and shown that table creations per second dropped from 10K to a few per second and then to a few per day. In the graphs you can't even realize there were more tables been created. At around 8K tables from the theoretical limit of 4B oids consumed, the process basically stopped (doing more insertions). Hope that this information helps. Best regards, Álvaro -- Álvaro Hernández Tortosa ----------- 8Kdata
On 01/31/2015 12:25 AM, Jim Nasby wrote: > [snip] > It's a bit more complex than that. First, toast isn't limited to > bytea; it holds for ALL varlena fields in a table that are allowed to > store externally. Second, the limit is actually per-table: every table > gets it's own toast table, and each toast table is limited to 4B > unique OIDs. Third, the OID counter is actually global, but the code > should handle conflicts by trying to get another OID. See > toast_save_datum(), which calls GetNewOidWithIndex(). > > Now, the reality is that GetNewOidWithIndex() is going to keep > incrementing the global OID counter until it finds an OID that isn't > in the toast table. That means that if you actually get anywhere close > to using 4B OIDs you're going to become extremely unhappy with the > performance of toasting new data. Indeed ...... > I don't think it would be horrifically hard to change the way toast > OIDs are assigned (I'm thinking we'd basically switch to creating a > sequence for every toast table), but I don't think anyone's ever tried > to push toast hard enough to hit this kind of limit. We did. The Billion Table Project, part2 (a.k.a. "when does Postgres' OID allocator become a bottleneck").... The allocator becomes essentially unusable at about 2.1B OIDs, where it performed very well at "quite empty"(< 100M objects) levels. So yes, using one sequence per TOAST table should help. Combined with the new SequenceAMs / sequence implementation being proposed (specifically: one file for all sequences in a certain tablespace) this should scale much better. My 2c. Regards, / J.L.
On 1/30/15, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > On 1/30/15 11:54 AM, Roger Pack wrote: >>>> On 1/29/15, Roger Pack <rogerdpack2@gmail.com> wrote: >>>>> Hello. I see on this page a mention of basically a 4B row limit for >>>>> tables that have BLOB's >>>> >>>> Oops I meant for BYTEA or TEXT columns, but it's possible the >>>> reasoning is the same... >>> >>> It only applies to large objects, not bytea or text. >> >> OK I think I figured out possibly why the wiki says this. I guess >> BYTEA entries > 2KB will be autostored via TOAST, which uses an OID in >> its backend. So BYTEA has a same limitation. It appears that >> disabling TOAST is not an option [1]. >> So I guess if the number of BYTEA entries (in the sum all tables? >> partitioning doesn't help?) with size > 2KB is > 4 billion then there >> is actually no option there? If this occurred it might cause "all >> sorts of things to break"? [2] > > It's a bit more complex than that. First, toast isn't limited to bytea; > it holds for ALL varlena fields in a table that are allowed to store > externally. Second, the limit is actually per-table: every table gets > it's own toast table, and each toast table is limited to 4B unique OIDs. > Third, the OID counter is actually global, but the code should handle > conflicts by trying to get another OID. See toast_save_datum(), which > calls GetNewOidWithIndex(). > > Now, the reality is that GetNewOidWithIndex() is going to keep > incrementing the global OID counter until it finds an OID that isn't in > the toast table. That means that if you actually get anywhere close to > using 4B OIDs you're going to become extremely unhappy with the > performance of toasting new data. OK so "system stability" doesn't degrade per se when it wraps, good to know. So basically when it gets near 4B rows it may have to wrap that counter multiple times, and for each "entry" it's searching if it's already used, etc. So I guess partitioning tables for now is an acceptable work around, good to know. Thanks much for your response, good to know the details before we dive into postgres with our 8B row table with BYTEA's in it :)
On 2/2/15, José Luis Tallón <jltallon@adv-solutions.net> wrote: > On 01/31/2015 12:25 AM, Jim Nasby wrote: >> [snip] >> It's a bit more complex than that. First, toast isn't limited to >> bytea; it holds for ALL varlena fields in a table that are allowed to >> store externally. Second, the limit is actually per-table: every table >> gets it's own toast table, and each toast table is limited to 4B >> unique OIDs. Third, the OID counter is actually global, but the code >> should handle conflicts by trying to get another OID. See >> toast_save_datum(), which calls GetNewOidWithIndex(). >> >> Now, the reality is that GetNewOidWithIndex() is going to keep >> incrementing the global OID counter until it finds an OID that isn't >> in the toast table. That means that if you actually get anywhere close >> to using 4B OIDs you're going to become extremely unhappy with the >> performance of toasting new data. > > Indeed ...... > >> I don't think it would be horrifically hard to change the way toast >> OIDs are assigned (I'm thinking we'd basically switch to creating a >> sequence for every toast table), but I don't think anyone's ever tried >> to push toast hard enough to hit this kind of limit. > > We did. The Billion Table Project, part2 (a.k.a. "when does Postgres' > OID allocator become a bottleneck").... The allocator becomes > essentially unusable at about 2.1B OIDs, where it performed very well at > "quite empty"(< 100M objects) levels. > > So yes, using one sequence per TOAST table should help. > Combined with the new SequenceAMs / sequence implementation being > proposed (specifically: one file for all sequences in a certain > tablespace) this should scale much better. But it wouldn't be perfect, right? I mean if you had multiple deletion/insertions and pass 4B then the "one sequence per TOAST table" would still wrap [albeit more slowly], and performance start degrading the same way. And there would still be the hard 4B limit. Perhaps the foreign key to the TOAST table could be changed from oid (32 bits) to something else (64 bits) [as well the sequence] so that it never wraps? What do you think? And would a more aggressive change like this have a chance of being accepted into the code base? Thanks. -roger-
Roger Pack <rogerdpack2@gmail.com> writes: > On 2/2/15, José Luis Tallón <jltallon@adv-solutions.net> wrote: >> So yes, using one sequence per TOAST table should help. >> Combined with the new SequenceAMs / sequence implementation being >> proposed (specifically: one file for all sequences in a certain >> tablespace) this should scale much better. > But it wouldn't be perfect, right? I mean if you had multiple > deletion/insertions and pass 4B then the "one sequence per TOAST > table" would still wrap [albeit more slowly], and performance start > degrading the same way. And there would still be the hard 4B limit. > Perhaps the foreign key to the TOAST table could be changed from oid > (32 bits) to something else (64 bits) [as well the sequence] so that > it never wraps? What do you think? And would a more aggressive change > like this have a chance of being accepted into the code base? There has been some thought about this, but I have seen no, zero, reports of anyone actually running into problems *in practice* (as opposed to contrived cases like "can we create a billion tables"). So we probably aren't going to want to address it until it starts being a real problem. The reason it's not as significant as you might think is that small field values (less than a couple KB *after compression*) don't get pushed out to the TOAST table, so they don't consume OIDs. And large field values, like megabytes worth, aren't a problem either because you just aren't gonna have that many of them. (Simple arithmetic.) You could potentially get into trouble if you had a whole lot of entries that were just a little over the toasting threshold, because then you'd have a lot of OIDs consumed but still a manageable total amount of disk space. But that doesn't seem to be a very common usage pattern. Also, partitioning the table largely eliminates the problem because each partition will have its own TOAST table. I'm on record as saying that many people are far too quick to decide that they need partitioning; but once you get into the volume of data where 4B toast entries starts to look like a limitation, you will probably have other reasons to think that you need to partition. In short, this is something that's theoretically interesting but doesn't seem worth doing in practice --- yet anyway. regards, tom lane
On 02/02/2015 09:36 PM, Roger Pack wrote: > On 2/2/15, José Luis Tallón <jltallon@adv-solutions.net> wrote: >> On 01/31/2015 12:25 AM, Jim Nasby wrote: >>> [snip] >>> It's a bit more complex than that. First, toast isn't limited to >>> bytea; it holds for ALL varlena fields in a table that are allowed to >>> store externally. Second, the limit is actually per-table: every table >>> gets it's own toast table, and each toast table is limited to 4B >>> unique OIDs. Third, the OID counter is actually global, but the code >>> should handle conflicts by trying to get another OID. See >>> toast_save_datum(), which calls GetNewOidWithIndex(). >>> >>> Now, the reality is that GetNewOidWithIndex() is going to keep >>> incrementing the global OID counter until it finds an OID that isn't >>> in the toast table. That means that if you actually get anywhere close >>> to using 4B OIDs you're going to become extremely unhappy with the >>> performance of toasting new data. >> Indeed ...... >> >>> I don't think it would be horrifically hard to change the way toast >>> OIDs are assigned (I'm thinking we'd basically switch to creating a >>> sequence for every toast table), but I don't think anyone's ever tried >>> to push toast hard enough to hit this kind of limit. >> We did. The Billion Table Project, part2 (a.k.a. "when does Postgres' >> OID allocator become a bottleneck").... The allocator becomes >> essentially unusable at about 2.1B OIDs, where it performed very well at >> "quite empty"(< 100M objects) levels. >> >> So yes, using one sequence per TOAST table should help. >> Combined with the new SequenceAMs / sequence implementation being >> proposed (specifically: one file for all sequences in a certain >> tablespace) this should scale much better. > But it wouldn't be perfect, right? I mean if you had multiple > deletion/insertions and pass 4B then the "one sequence per TOAST > table" would still wrap [albeit more slowly], and performance start > degrading the same way. And there would still be the hard 4B limit. > Perhaps the foreign key to the TOAST table could be changed from oid > (32 bits) to something else (64 bits) [as well the sequence] so that > it never wraps? Hmm.... 2^32 times aprox. 2kB (as per usual heuristics, ~4 rows per heap page) is 8796093022208 (~9e13) bytes ... which results in 8192 1GB segments :O Looks like partitioning might be needed much sooner than that (if only for index efficiency reasons)... unless access is purely sequential. The problem with changing the id from 32 to 64 bits is that the storage *for everybody else* doubles, making the implementation slower for most.... though this might be actually not that important. The alternative could be some "long LOB" ("HugeOBject"?) using the equivalent to "serial8" whereas regular LOBs would use "serial4". Anybody actually reaching this limit out there? Regards, / J .L.
On 2/2/15 3:50 PM, José Luis Tallón wrote: > Hmm.... 2^32 times aprox. 2kB (as per usual heuristics, ~4 rows per heap > page) is 8796093022208 (~9e13) bytes > ... which results in 8192 1GB segments :O > Looks like partitioning might be needed much sooner than that (if only > for index efficiency reasons)... unless access is purely sequential. > > The problem with changing the id from 32 to 64 bits is that the storage > *for everybody else* doubles, making the implementation slower for > most.... though this might be actually not that important. > The alternative could be some "long LOB" ("HugeOBject"?) using the > equivalent to "serial8" whereas regular LOBs would use "serial4". Well, it depends on how we did this. We could (for example) add a field to pg_class that determines what type to use for toast pointers; OID, int, or bigint. That could then be taken into account in the *toast* functions. But as others have pointed out, we haven't even had any real complaints about toast using OIDs as being an issue until now, so I think it's premature to start messing with this. At most it's just something to keep in mind so we don't preclude doing this in the future. BTW, regarding the size of what gets toasted; I've often thought it would be useful to allow a custom size limit on columns so that you could easily force data to be toasted if you knew you were very unlikely to access it. Basically, a cheap form of vertical partitioning. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
> Hmm.... 2^32 times aprox. 2kB (as per usual heuristics, ~4 rows per heap page) is 8796093022208 (~9e13) bytes > ... which results in 8192 1GB segments :O 8192 1GB segments is just 8TB, its not _that_ large. At TripAdvisor we’ve been using a NoSQL solution to do session storage. We are looking to probably swap that out to be Postgres (every other db backing the site is Postgres). Essentially,what I’m building is a system with 1 logical table that maps session id to a 2KB+ grab bag of ever changing sessionattributes which is partially normalized, partially json. 315 million uniques a month multiplied by the retentionpolicy means I need to hold 2-4 billion session objects (and somehow expire old ones). Additionally, most httpcalls can update the session, so between maintenance windows I expect to take around 20 billion 'upserts’. Obviously,I will have to shard and partition the table in practice, but this weekend I ran a test that demonstrated thata single table on a 9.4 server + logical replication + Dell 730xd can handle 4x that workload. Well, it can for 38 hours…until you wrap xid’s on the toast table. :P I’ll be the first to admit that isn’t the normal use case though. I’mhappy to have found this thread, however, because I’m going to have to build around the global oid counter, explicitlythe prevent the problem I explain below regarding clustering. > Anybody actually reaching this limit out there? Well its not the 4 billion row limit that concerns me, its the global shared counter in conjunction with pg_restore/clusteringthat is actually pretty concerning. Just checked through all of TripAdvisor’s normal databases and the max tuples I see in single toast table is 17,000,000,so that is still a couple of orders of magnitude too small. (however, close enough that it’ll be a concern ina few years). However, I do have active databases where the current oid is between 1 billion and 2 billion. They were last dump-restoredfor a hardware upgrade a couple years ago and were a bit more than half the size. I therefore can imagine thatI have tables which are keyed by ~8,000,000 consecutive oids. I would argue that when it wraps there will be a single insert that will probably block for 2-5 minutes while it tries toaccomplish ~8,000,000 index scans inside of GetNewOidWithIndex. Even partitioning doesn’t protect you from this potentialproblem. What even more weird is that this issue can be trigged by consuming too many oid’s in a different database in the same cluster(i.e. creating large amounts of temp tables) > The problem with changing the id from 32 to 64 bits is that the storage *for everybody else* doubles, making the implementationslower for most.... though this might be actually not that important. Well, you aren’t doubling the storage. Even if you have to store the key in 4 places, you are adding 16 bytes per TOASTtuple. If we work off the 2KB estimate for each TOAST tuple, then you are only increasing the storage by 0.7%. I’msure there are more hidden costs but we are really only talking about a low single digit percent increase. In exchange,you get to drop one index scan per toast insert; an index scan looking in the only hot part of the index. That being said I’d be perfectly happy merely giving each TOAST table its own sequence as that almost entire mitigates therisk of an unexpected lock up on reasonably sized tables/partitions, and provides a functional work around for those ofus with larger than average installs. - Matt K
On 02/03/2015 03:44 AM, Jim Nasby wrote: > [snip] >> The alternative could be some "long LOB" ("HugeOBject"?) using the >> equivalent to "serial8" whereas regular LOBs would use "serial4". > > Well, it depends on how we did this. We could (for example) add a > field to pg_class that determines what type to use for toast pointers; > OID, int, or bigint. That could then be taken into account in the > *toast* functions. > > But as others have pointed out, we haven't even had any real > complaints about toast using OIDs as being an issue until now, so I > think it's premature to start messing with this. At most it's just > something to keep in mind so we don't preclude doing this in the future. A patch creating those HOBs (Huge Objects) might well make sense *after* the sequence refactoring got merged. Removing the bottleneck due to the OID allocator for this use case will be definitively welcome (I don't dare to code that just yet, but here's hoping someone will step in O:-) > BTW, regarding the size of what gets toasted; I've often thought it > would be useful to allow a custom size limit on columns so that you > could easily force data to be toasted if you knew you were very > unlikely to access it. Basically, a cheap form of vertical partitioning. Hmmm.... alter column set storage external / set storage extended ? From http://www.postgresql.org/docs/9.4/static/sql-altertable.html : ALTER [ COLUMN ] column_name SET STORAGE { PLAIN| EXTERNAL | EXTENDED | MAIN } This would do what you described, right? HTH, / J.L.
Matthew Kelly <mkelly@tripadvisor.com> writes: > However, I do have active databases where the current oid is between 1 billion and 2 billion. They were last dump-restoredfor a hardware upgrade a couple years ago and were a bit more than half the size. I therefore can imagine thatI have tables which are keyed by ~8,000,000 consecutive oids. > I would argue that when it wraps there will be a single insert that will probably block for 2-5 minutes while it triesto accomplish ~8,000,000 index scans inside of GetNewOidWithIndex. Even partitioning doesn’t protect you from thispotential problem. That may be a hazard, but ... > That being said I’d be perfectly happy merely giving each TOAST table its own sequence as that almost entire mitigatesthe risk of an unexpected lock up on reasonably sized tables/partitions, and provides a functional work around forthose of us with larger than average installs. ... this "fix" would actually make things enormously worse. With the single counter feeding all tables, you at least have a reasonable probability that there are not enormously long runs of consecutive OIDs in any one toast table. With a sequence per table, you are nearly guaranteed that there are such runs, because inserts into other tables don't create a break. (This effect is also why you're wrong to claim that partitioning can't fix it.) regards, tom lane
On 2/3/15 10:01 AM, José Luis Tallón wrote:<br /><br /><blockquote cite="mid:54D0E2B1.3030900@adv-solutions.net" type="cite">Hmmm....alter column set storage external / set storage extended ? <br /><br /> From <a class="moz-txt-link-freetext" href="http://www.postgresql.org/docs/9.4/static/sql-altertable.html">http://www.postgresql.org/docs/9.4/static/sql-altertable.html</a> :<br /> ALTER [ COLUMN ] column_name SET STORAGE { PLAIN | EXTERNAL | EXTENDED | MAIN } <br /><br /> This would dowhat you described, right? <br /></blockquote><br /> EXTENDED is the default for most TOAST-able types and is still subjectto <tt class="SYMBOL">TOAST_TUPLE_THRESHOLD which is normally 2K. EXTERNAL is the same but with no compression.<br/><br /> See: <a class="moz-txt-link-freetext" href="http://www.postgresql.org/docs/9.4/static/storage-toast.html">http://www.postgresql.org/docs/9.4/static/storage-toast.html</a><br /><br/></tt> <pre class="moz-signature" cols="72">-- - David Steele <a class="moz-txt-link-abbreviated" href="mailto:david@pgmasters.net">david@pgmasters.net</a></pre>
On 2/3/15 9:50 AM, David Steele wrote: > On 2/3/15 10:01 AM, José Luis Tallón wrote: > >> Hmmm.... alter column set storage external / set storage extended ? >> >> From http://www.postgresql.org/docs/9.4/static/sql-altertable.html : >> ALTER [ COLUMN ] column_name SET STORAGE { PLAIN | EXTERNAL | >> EXTENDED | MAIN } >> >> This would do what you described, right? > > EXTENDED is the default for most TOAST-able types and is still subject > to TOAST_TUPLE_THRESHOLD which is normally 2K. EXTERNAL is the same but > with no compression. > > See: http://www.postgresql.org/docs/9.4/static/storage-toast.html Right. I'd like to be able to set per-column TOAST_TUPLE_THRESHOLD. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On 2/3/15 9:01 AM, Tom Lane wrote: > Matthew Kelly <mkelly@tripadvisor.com> writes: >> However, I do have active databases where the current oid is between 1 billion and 2 billion. They were last dump-restoredfor a hardware upgrade a couple years ago and were a bit more than half the size. I therefore can imagine thatI have tables which are keyed by ~8,000,000 consecutive oids. > >> I would argue that when it wraps there will be a single insert that will probably block for 2-5 minutes while it triesto accomplish ~8,000,000 index scans inside of GetNewOidWithIndex. Even partitioning doesn’t protect you from thispotential problem. > > That may be a hazard, but ... > >> That being said I’d be perfectly happy merely giving each TOAST table its own sequence as that almost entire mitigatesthe risk of an unexpected lock up on reasonably sized tables/partitions, and provides a functional work around forthose of us with larger than average installs. > > ... this "fix" would actually make things enormously worse. With the > single counter feeding all tables, you at least have a reasonable > probability that there are not enormously long runs of consecutive OIDs in > any one toast table. With a sequence per table, you are nearly guaranteed > that there are such runs, because inserts into other tables don't create a > break. > > (This effect is also why you're wrong to claim that partitioning can't fix > it.) That's assuming that toasting is evenly spread between tables. In my experience, that's not a great bet... -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On 2/3/15 5:27 PM, Jim Nasby wrote: > On 2/3/15 9:50 AM, David Steele wrote: >> EXTENDED is the default for most TOAST-able types and is still subject >> to TOAST_TUPLE_THRESHOLD which is normally 2K. EXTERNAL is the same but >> with no compression. >> >> See: http://www.postgresql.org/docs/9.4/static/storage-toast.html > > Right. I'd like to be able to set per-column TOAST_TUPLE_THRESHOLD. No argument there. There are some columns that I would prefer to always TOAST because even 2K can be very big for some use cases. -- - David Steele david@pgmasters.net
> That's assuming that toasting is evenly spread between tables. In my experience, that's not a great bet...
Time to create a test:
SELECT chunk_id::bigint/100000 as id_range, count(*), count(*)/(100000::float) density FROM (SELECT chunk_id FROM pg_toast.pg_toast_39000165 WHERE chunk_id <
100000000 AND chunk_seq = 0) f GROUP BY id_range ORDER BY id_range;
The machine in question was restored in parallel in Sept 2013 as part of an upgrade from 8.4. It has about 2000 tables, so its definitely not dominated by a couple tables. Progress towards oid wrap around is about 25.6%.
With minimal effort, I found 2 bad examples, and I’m sure I can easily find more. I attached the results for those two.
There were runs of 1,100,000+ and 600,000+ chunk_ids where more than 99% of the chunk_id are taken. After restore completion, oid densities averaged less than 20 per 100,000 and 400 per 100,000 respectively. The only reasons those runs seem to be so short is because the tables were much smaller back then. I expect that next time I dump restore (necessary for upgrading OS versions due to the collation issue), I’m going to have runs closer to 20,0000,000.
> ... this "fix" would actually make things enormously worse. With the
> single counter feeding all tables, you at least have a reasonable
> probability that there are not enormously long runs of consecutive OIDs in
> any one toast table. With a sequence per table, you are nearly guaranteed
> that there are such runs, because inserts into other tables don't create a
> break.
It makes each toast table independent (and far less likely to wrap) . It would wrap when the sum(mods on THIS toast table) > 2^32. Right now the function looks like:
sum(mods on ALL toast tables in cluster) + sum(created normal tables in cluster * k) + sum(created temp tables in cluster * k) + [...] > 2^32,
WHERE k average number of ids consumed for pg_class, pg_type, etc...
In the case of an insert only table (which is a common use case for partitions), the id would only wrap when the TOAST table was “full”. On the other hand currently, it would wrap into its pg_restored section when the combined oid consuming operations on the cluster surpassed 4 billion.
That being said, I’m certainly not attached to that solution. My real argument is that although its not a problem today, we are only about 5 years from it being a problem for large installs and the first time you’ll hear about it is after someone has a 5 minute production outage on a database thats been taking traffic for 2 years.
- Matt K.
Time to create a test:
SELECT chunk_id::bigint/100000 as id_range, count(*), count(*)/(100000::float) density FROM (SELECT chunk_id FROM pg_toast.pg_toast_39000165 WHERE chunk_id <
100000000 AND chunk_seq = 0) f GROUP BY id_range ORDER BY id_range;
The machine in question was restored in parallel in Sept 2013 as part of an upgrade from 8.4. It has about 2000 tables, so its definitely not dominated by a couple tables. Progress towards oid wrap around is about 25.6%.
With minimal effort, I found 2 bad examples, and I’m sure I can easily find more. I attached the results for those two.
There were runs of 1,100,000+ and 600,000+ chunk_ids where more than 99% of the chunk_id are taken. After restore completion, oid densities averaged less than 20 per 100,000 and 400 per 100,000 respectively. The only reasons those runs seem to be so short is because the tables were much smaller back then. I expect that next time I dump restore (necessary for upgrading OS versions due to the collation issue), I’m going to have runs closer to 20,0000,000.
> ... this "fix" would actually make things enormously worse. With the
> single counter feeding all tables, you at least have a reasonable
> probability that there are not enormously long runs of consecutive OIDs in
> any one toast table. With a sequence per table, you are nearly guaranteed
> that there are such runs, because inserts into other tables don't create a
> break.
It makes each toast table independent (and far less likely to wrap) . It would wrap when the sum(mods on THIS toast table) > 2^32. Right now the function looks like:
sum(mods on ALL toast tables in cluster) + sum(created normal tables in cluster * k) + sum(created temp tables in cluster * k) + [...] > 2^32,
WHERE k average number of ids consumed for pg_class, pg_type, etc...
In the case of an insert only table (which is a common use case for partitions), the id would only wrap when the TOAST table was “full”. On the other hand currently, it would wrap into its pg_restored section when the combined oid consuming operations on the cluster surpassed 4 billion.
That being said, I’m certainly not attached to that solution. My real argument is that although its not a problem today, we are only about 5 years from it being a problem for large installs and the first time you’ll hear about it is after someone has a 5 minute production outage on a database thats been taking traffic for 2 years.
- Matt K.
Attachment
On Sun, Feb 1, 2015 at 03:54:03PM +0100, Álvaro Hernández Tortosa wrote: > > On 31/01/15 14:55, Roger Pack wrote: > >[...] > > > >Now, the reality is that GetNewOidWithIndex() is going to keep > >incrementing the global OID counter until it finds an OID that isn't in > >the toast table. That means that if you actually get anywhere close to > >using 4B OIDs you're going to become extremely unhappy with the > >performance of toasting new data. > >OK so "system stability" doesn't degrade per se when it wraps [since > >they all use that GetNewOid method or similar [?] good to know. > > > >So basically when it gets near 4B TOAST'ed rows it may have to wrap that > >counter and search for "unused" number, and for each number it's > >querying the TOAST table to see if it's already used, degrading > >performance. > > > The problem here is that performance degrades exponentially, or > worse. Speaking here from experience, we already tested this for a > very similar case (table creation, where two oids are consumed from > a global sequence when inserting to pg_class). Have a look at > http://www.slideshare.net/nosys/billion-tables-project-nycpug-2013, > slides 43-45. We tested there this scenario and shown that table > creations per second dropped from 10K to a few per second and then > to a few per day. In the graphs you can't even realize there were > more tables been created. At around 8K tables from the theoretical > limit of 4B oids consumed, the process basically stopped (doing more > insertions). I had a look at our FAQ about Postgres limitations and I don't see anything that needs changing: https://wiki.postgresql.org/wiki/FAQ#What_is_the_maximum_size_for_a_row.2C_a_table.2C_and_a_database.3F Maximum size for a database? unlimited (32 TB databases exist)Maximum size for a table? 32 TBMaximum size for a row? 400GBMaximum size for a field? 1 GBMaximum number of rows in a table? unlimitedMaximum number of columns in a table? 250-1600depending on column typesMaximum number of indexes on a table? unlimited We don't report the maximum number of tables per database, or the maximum number of TOAST values. Agreed? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
* Bruce Momjian (bruce@momjian.us) wrote: > On Sun, Feb 1, 2015 at 03:54:03PM +0100, Álvaro Hernández Tortosa wrote: > > The problem here is that performance degrades exponentially, or > > worse. Speaking here from experience, we already tested this for a > > very similar case (table creation, where two oids are consumed from > > a global sequence when inserting to pg_class). Have a look at > > http://www.slideshare.net/nosys/billion-tables-project-nycpug-2013, > > slides 43-45. We tested there this scenario and shown that table > > creations per second dropped from 10K to a few per second and then > > to a few per day. In the graphs you can't even realize there were > > more tables been created. At around 8K tables from the theoretical > > limit of 4B oids consumed, the process basically stopped (doing more > > insertions). > > I had a look at our FAQ about Postgres limitations and I don't see > anything that needs changing: > > https://wiki.postgresql.org/wiki/FAQ#What_is_the_maximum_size_for_a_row.2C_a_table.2C_and_a_database.3F > > Maximum size for a database? unlimited (32 TB databases exist) > Maximum size for a table? 32 TB > Maximum size for a row? 400 GB > Maximum size for a field? 1 GB > Maximum number of rows in a table? unlimited > Maximum number of columns in a table? 250-1600 depending on column types > Maximum number of indexes on a table? unlimited > > We don't report the maximum number of tables per database, or the > maximum number of TOAST values. Agreed? For my 2c, this limitation is a surprise to users and therefore we should add documentation to point out that it exists, unless we're going to actually fix it (which is certainly what I'd prefer to see...). As for the other discussion on the thread, having a per-table sequence would be far better as it'd reduce the wrap-around risk quite a bit and, further, we should be able to provide that without breaking the on-disk format. What would be really nice is a way to expand the size of the key when needed- in other words, instead of wrapping around, if we actually hit 4B toasted values in a table then stick a flag somewhere for the next toasted value that says "this value is in the second toast table/fork" and then go up to 4B on that one, etc. That allows us to support more than 4B toasted values and doesn't require searching for holes in the values assigned. If we end up with empty toast tables eventually, then allow reusing them. Perhaps vacuum can even be used to make a note somewhere saying "this toast table is now empty and can be reused". In the end, I'd like to think we can do better here than having a hard limit at 4B when it comes to how many values over a few KB we can store. As mentioned, that isn't all that much these days. I'm not saying that my proposal or what's been proposed upthread is an answer, but I've certainly build PG systems which store over 4B rows and it's not hard to imagine cases where I might have wanted a toasted value for each of those rows. Thanks, Stephen
Stephen Frost <sfrost@snowman.net> writes: > * Bruce Momjian (bruce@momjian.us) wrote: >> On Sun, Feb 1, 2015 at 03:54:03PM +0100, �lvaro Hern�ndez Tortosa wrote: >>> The problem here is that performance degrades exponentially, or >>> worse. Speaking here from experience, we already tested this for a >>> very similar case (table creation, where two oids are consumed from >>> a global sequence when inserting to pg_class). Have a look at >>> http://www.slideshare.net/nosys/billion-tables-project-nycpug-2013, >>> slides 43-45. We tested there this scenario and shown that table >>> creations per second dropped from 10K to a few per second and then >>> to a few per day. In the graphs you can't even realize there were >>> more tables been created. At around 8K tables from the theoretical >>> limit of 4B oids consumed, the process basically stopped (doing more >>> insertions). >> We don't report the maximum number of tables per database, or the >> maximum number of TOAST values. Agreed? > For my 2c, this limitation is a surprise to users and therefore we > should add documentation to point out that it exists, unless we're going > to actually fix it (which is certainly what I'd prefer to see...). TBH, I've got very little enthusiasm for fixing this given the number of reports of trouble from the field, which so far as I recall is zero. �lvaro's case came up through intentionally trying to create an unreasonable number of tables, not from real usage. This thread likewise appears to contain lots of speculation and no reports of anyone hitting a problem in practice. Certainly this is likely to become an issue at some point in the future, but I'm not finding it very compelling to worry about now. By the time it does become an issue, we may have additional considerations or use cases that should inform a solution; which seems to me to be a good argument not to try to fix it in advance of real problems. Perhaps, for example, we'd find that at the same time we ought to relax the 1GB limit on individual-value size; or perhaps not. Having said all that, if we did try to fix it today, I'd imagine changing TOAST value identifiers to int64 and inventing a new TOAST pointer format for use when 32 bits isn't wide enough for the ID. But I think we're best advised to hold off doing that until the need becomes pressing. regards, tom lane
On Thu, Apr 23, 2015 at 11:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Having said all that, if we did try to fix it today, I'd imagine changing > TOAST value identifiers to int64 and inventing a new TOAST pointer format > for use when 32 bits isn't wide enough for the ID. But I think we're best > advised to hold off doing that until the need becomes pressing. Just out of curiosity, has anyone thought about inventing a new TOAST pointer format on the grounds that our TOAST pointers are unreasonably large? IIUC, a TOAST pointer right now is 18 bytes: 16 for a varatt_external, and then that gets embedded in a varattrib_1b_e with a va_header byte and a va_tag byte. Eliminating one or both of va_rawsize and va_extsize from the TOAST pointer itself seems like it could save quite a bit of space on disk. Maybe you could even find a way to get rid of va_toastrelid; after all, at the point when you first acquire a pointer to the tuple, you surely know what relation it's a part of. You'd probably want to force de-TOASTing (or converting to a more expressive form of TOAST pointer, anyway) when you extracted the column from the tuple, which might be hard to arrange. But the benefits could be pretty significant. Suppose you have a table where each tuple is 4K untoasted, with all but 100 bytes of that in a single column. So, as stored, you've got 100 bytes of regular stuff plus an 18-byte TOAST header. If you could trim 2 of the above-mentioned 4-byte fields out of the TOAST header, that would reduce the size of the main relation fork by almost 7%. If you could trim all 3 of them out, you'd save more than 10%. That's not nothing, and the benefits could be even larger for rows that contain multiple TOAST pointers. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
* Robert Haas (robertmhaas@gmail.com) wrote: > On Thu, Apr 23, 2015 at 11:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Having said all that, if we did try to fix it today, I'd imagine changing > > TOAST value identifiers to int64 and inventing a new TOAST pointer format > > for use when 32 bits isn't wide enough for the ID. But I think we're best > > advised to hold off doing that until the need becomes pressing. > > Just out of curiosity, has anyone thought about inventing a new TOAST > pointer format on the grounds that our TOAST pointers are unreasonably > large? I'd not thought about it, but sure sounds like a good idea from here. Would be particularly great if we were able to do this and increase the number of supported toast pointers and avoid having to go hunting for unused identifiers due to wrapping. Thanks! Stephen
On 24/04/15 05:24, Tom Lane wrote: > Stephen Frost <sfrost@snowman.net> writes: >> * Bruce Momjian (bruce@momjian.us) wrote: >>> On Sun, Feb 1, 2015 at 03:54:03PM +0100, Álvaro Hernández Tortosa wrote: >>>> The problem here is that performance degrades exponentially, or >>>> worse. Speaking here from experience, we already tested this for a >>>> very similar case (table creation, where two oids are consumed from >>>> a global sequence when inserting to pg_class). Have a look at >>>> http://www.slideshare.net/nosys/billion-tables-project-nycpug-2013, >>>> slides 43-45. We tested there this scenario and shown that table >>>> creations per second dropped from 10K to a few per second and then >>>> to a few per day. In the graphs you can't even realize there were >>>> more tables been created. At around 8K tables from the theoretical >>>> limit of 4B oids consumed, the process basically stopped (doing more >>>> insertions). >>> We don't report the maximum number of tables per database, or the >>> maximum number of TOAST values. Agreed? >> For my 2c, this limitation is a surprise to users and therefore we >> should add documentation to point out that it exists, unless we're going >> to actually fix it (which is certainly what I'd prefer to see...). > TBH, I've got very little enthusiasm for fixing this given the number > of reports of trouble from the field, which so far as I recall is zero. > Álvaro's case came up through intentionally trying to create an > unreasonable number of tables, not from real usage. This thread likewise > appears to contain lots of speculation and no reports of anyone hitting > a problem in practice. It is certainly true that this was a very synthetic case. I envision, however, certain use cases where we may hit a very large number of tables: - Massive multitenancy - Aggressive partitioning - Massive multitenancy with aggressive partitioning - Software dynamically generated tables, like those created by ToroDB (https://github.com/torodb/torodb). In ToroDB we generate tables depending only on the input data, so we may end up having as many as required by the datasource. For example, a "general purpose" json datastore may generate several tables per document inserted. > > Certainly this is likely to become an issue at some point in the future, > but I'm not finding it very compelling to worry about now. By the time > it does become an issue, we may have additional considerations or use > cases that should inform a solution; which seems to me to be a good > argument not to try to fix it in advance of real problems. Perhaps, I understand this argument, and it makes sense. However, on the other side, given the long time it may take from patch to commit and then release version to companies finally using it in production, I'd rather try to fix it soon, as there are already reports and use cases that may hit it, rather than wait three years until it explodes in our faces. After all, 640Kb RAM is enough, right? So maybe 2B tables is not that far in the horizon. Who knows. Regards, Álvaro -- Álvaro Hernández Tortosa ----------- 8Kdata
On 4/24/15 7:11 PM, Álvaro Hernández Tortosa wrote: > On 24/04/15 05:24, Tom Lane wrote: ... >> TBH, I've got very little enthusiasm for fixing this given the number >> of reports of trouble from the field, which so far as I recall is zero. >> Álvaro's case came up through intentionally trying to create an >> unreasonable number of tables, not from real usage. This thread likewise >> appears to contain lots of speculation and no reports of anyone hitting >> a problem in practice. > > It is certainly true that this was a very synthetic case. I > envision, however, certain use cases where we may hit a very large > number of tables: The original case has NOTHING to do with the number of tables and everything to do with the number of toasted values a table can have. If you have to toast 4B attributes in a single relation it will fail. In reality, if you get anywhere close to that things will fall apart due to OID conflicts. This case isn't nearly as insane as 4B tables. A table storing 10 text fields each of which is 2K would hit this limit with only 400M rows. If my math is right that's only 8TB; certainly not anything insane space-wise or rowcount-wise. Perhaps it's still not fixing, but I think it's definitely worth documenting. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Fri, Apr 24, 2015 at 11:39:04PM -0500, Jim Nasby wrote: > On 4/24/15 7:11 PM, Álvaro Hernández Tortosa wrote: > >On 24/04/15 05:24, Tom Lane wrote: > ... > >>TBH, I've got very little enthusiasm for fixing this given the number > >>of reports of trouble from the field, which so far as I recall is zero. > >>Álvaro's case came up through intentionally trying to create an > >>unreasonable number of tables, not from real usage. This thread likewise > >>appears to contain lots of speculation and no reports of anyone hitting > >>a problem in practice. > > > > It is certainly true that this was a very synthetic case. I > >envision, however, certain use cases where we may hit a very large > >number of tables: > > The original case has NOTHING to do with the number of tables and > everything to do with the number of toasted values a table can have. > If you have to toast 4B attributes in a single relation it will > fail. In reality, if you get anywhere close to that things will fall > apart due to OID conflicts. > > This case isn't nearly as insane as 4B tables. A table storing 10 > text fields each of which is 2K would hit this limit with only 400M > rows. If my math is right that's only 8TB; certainly not anything > insane space-wise or rowcount-wise. > > Perhaps it's still not fixing, but I think it's definitely worth > documenting. And it is now documented in the Postgres FAQ thanks to 'Rogerdpack', which is where that "maximum" table came from: https://wiki.postgresql.org/wiki/FAQ#What_is_the_maximum_size_for_a_row.2C_a_table.2C_and_a_database.3F Note if you are storing a table with rows that exceed 2KB in size(aggregate size of each row) then the "Maximum number ofrows in atable" may be limited to 4 Billion, see TOAST. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 25/04/15 06:39, Jim Nasby wrote: > On 4/24/15 7:11 PM, Álvaro Hernández Tortosa wrote: >> On 24/04/15 05:24, Tom Lane wrote: > ... >>> TBH, I've got very little enthusiasm for fixing this given the number >>> of reports of trouble from the field, which so far as I recall is zero. >>> Álvaro's case came up through intentionally trying to create an >>> unreasonable number of tables, not from real usage. This thread >>> likewise >>> appears to contain lots of speculation and no reports of anyone hitting >>> a problem in practice. >> >> It is certainly true that this was a very synthetic case. I >> envision, however, certain use cases where we may hit a very large >> number of tables: > > The original case has NOTHING to do with the number of tables and > everything to do with the number of toasted values a table can have. > If you have to toast 4B attributes in a single relation it will fail. > In reality, if you get anywhere close to that things will fall apart > due to OID conflicts. > > This case isn't nearly as insane as 4B tables. A table storing 10 text > fields each of which is 2K would hit this limit with only 400M rows. > If my math is right that's only 8TB; certainly not anything insane > space-wise or rowcount-wise. > > Perhaps it's still not fixing, but I think it's definitely worth > documenting. They are definitely different problems, but caused by similar symptoms: an oid wrapping around, or not even there: just trying to find an unused one. If fixed, we should probably look at both at the same time. It's worth document but also, as I said, maybe also fixing them, so that if three years from now they really show up, solution is already in production (rather than in patching state). Regards, Álvaro -- Álvaro Hernández Tortosa ----------- 8Kdata
* Álvaro Hernández Tortosa (aht@8Kdata.com) wrote: > It's worth document but also, as I said, maybe also fixing them, > so that if three years from now they really show up, solution is > already in production (rather than in patching state). With the proliferation of JSON usage in PG thanks to jsonb, I'd count us lucky if we don't get complaints about this in the next three years. I don't expect to have time to work on it in the near future, unfortunately, but Robert's thoughts on supporting a new TOAST pointer structure (with a way to support what's currently there, to avoid an on-disk break) seems like a good starting point to me. Thanks! Stephen
On 4/25/15 1:19 PM, Bruce Momjian wrote: > Note if you are storing a table with rows that exceed 2KB in size > (aggregate size of each row) then the "Maximum number of rows in a > table" may be limited to 4 Billion, see TOAST. That's not accurate though; you could be limited to far less than 4B rows. If each row has 10 fields that toast, you'd be limited to just 400M rows. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On 4/27/15, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > On 4/25/15 1:19 PM, Bruce Momjian wrote: >> Note if you are storing a table with rows that exceed 2KB in size >> (aggregate size of each row) then the "Maximum number of rows in a >> table" may be limited to 4 Billion, see TOAST. > > That's not accurate though; you could be limited to far less than 4B > rows. If each row has 10 fields that toast, you'd be limited to just > 400M rows. Good point. I noted that on the TOAST wiki page now, at least (and also mentioned that using partitioning is a "work around" for now).
On 04/27/2015 08:49 AM, Jim Nasby wrote: > On 4/25/15 1:19 PM, Bruce Momjian wrote: >> Note if you are storing a table with rows that exceed 2KB in size >> (aggregate size of each row) then the "Maximum number of rows in a >> table" may be limited to 4 Billion, see TOAST. > > That's not accurate though; you could be limited to far less than 4B > rows. If each row has 10 fields that toast, you'd be limited to just > 400M rows. ISTM like the solution is almost here, and could be done without too much (additional) work: * We have already discussed having a page-per-sequence with the new SeqAMs being introduced and how that would improve scalability. * We have commented on having a sequence per TOAST table (hence, 4B toasted values per table each up to 4B chunks in size... vs just 4B toasted values per cluster) I'm not sure that I can do it all by myself just yet, but I sure can try if there is interest. (just after I'm done with another patch that is independent from this, though) This would be material for 9.6, of course :) Thanks, J.L.
On 4/28/15 5:41 AM, José Luis Tallón wrote: > On 04/27/2015 08:49 AM, Jim Nasby wrote: >> On 4/25/15 1:19 PM, Bruce Momjian wrote: >>> Note if you are storing a table with rows that exceed 2KB in size >>> (aggregate size of each row) then the "Maximum number of rows in a >>> table" may be limited to 4 Billion, see TOAST. >> >> That's not accurate though; you could be limited to far less than 4B >> rows. If each row has 10 fields that toast, you'd be limited to just >> 400M rows. > > ISTM like the solution is almost here, and could be done without too > much (additional) work: > * We have already discussed having a page-per-sequence with the new > SeqAMs being introduced and how that would improve scalability. > * We have commented on having a sequence per TOAST table > (hence, 4B toasted values per table each up to 4B chunks in size... > vs just 4B toasted values per cluster) > > I'm not sure that I can do it all by myself just yet, but I sure > can try if there is interest. I don't think it would be hard at all to switch toast pointers to being sequence generated instead of OIDs. The only potential downside I see is the extra space required for all the sequnces... but that would only matter on the tinyest of clusters (think embedded), which probably don't have that many tables to begin with. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com