Thread: 9.3: load path to mitigate load penalty for checksums
Introduction: ================================= A lot of progress has been made on the checksums issue, with Simon's excellent work beginning to gain consensus: http://archives.postgresql.org/message-id/CA +U5nMKw_GBs6qQ_Y8-RjGL1V7MVW2HWBHartB8LoJhnPfxL8g@mail.gmail.com For the purposes of this proposal, I'll assume that's the general direction we'll be taking for CRCs. The major drawback of that proposal is that it increases the amount of work to be done after a large data load by requiring more WAL. Proposal: ================================= I propose a special user-initiated loading mode at the table granularity. During this time, readers must ignore PD_ALL_VISIBLE, HEAP_XMIN_COMMITTED, and the visibility map entirely. However, writers may set all of those bits before the writing transaction commits, obviating the need to rewrite (and WAL) the data again later. Ideally, there would be no work for VACUUM to do after the data load (unless a transaction aborted). This would also help the non-CRC case of course, but I expect CRCs to make this significantly more important. Goals: ================================= * Table granularity (doesn't affect other tables at all) * Allows concurrent loaders * Allows loading into existing tableswith existing data * Online (allow reads to proceed, even if degraded) Rough user interaction: ================================= INITIATE LOAD ON foo AS 'job name'; -- run concurrent loading sessions FINALIZE LOAD 'job name'; High-level design: ================================= By "hints" I mean the VM bit, PD_ALL_VISIBLE, and HEAP_XMIN_COMMITTED. By "ignorehints" I mean a flag in pg_class indicating that readers should ignore hints. By "optimistichints" I mean a flag in pg_class indicating that writers can optimistically set hints. Obviously, readers and writers would need a mechanism to honor those flags, but I haven't dug into the details yet (additional routines in tqual.c?). States: 0: normal * ignorehints = false * optimistichints = false 1: trying to enter data load mode, waitingon existing lockers (who might be scanning) to finish * ignorehints = true * optimistichints = false 2: data load mode * ignorehints = true * optimistichints = true 3: trying to leave data load mode, waitingon old snapshots to be released and aborted transactions to be cleaned up * ignorehints = true * optimistichints= false INITIATE LOAD would first transition from state 0 to 1 by acquiring a ShareUpdateExclusiveLock on the table (to be sure no concurrent INITIATE or FINALIZE LOAD is going on) and setting ignorehints = true. Then it moves from state 1 to state 2 by waiting for all transactions that hold a lock on that table. Any transactions that don't already have a lock will see the new flag when they try to get it. Now we're sure that all readers will see the "ignorehints" flag, so we can set the "optimistichints" flag to indicate that writers can write hints optimistically. FINALIZE LOAD would first move from state 2 to state 3 by acquiring a ShareUpdateExclusiveLock on the table setting optimistichints = false. Then, it would move from state 3 to state 0 by first waiting for all transactions that currently hold a lock on the table, to ensure they see the optimistichints=false flag. Then, it would remember the current xid as max_loader_xid, and wait until the global xmin is greater than max_loader_xid. This should ensure that all snapshots regard all loading transactions as complete. Also, it would need to run a lazy VACUUM to remove any tuples from aborted transactions. Details and optimizations =================================* We probably want a graceful way to handle multiple data loads happening on the same table. Rather than erroring out, we could treat it more like a reference count, and only do the work to move in to data load mode if not there already, and only move out of data load mode if we're the last loading job on the table.* In general, there are some usability issues to sort out, to make sure a table isn't put into data load mode and left that way. Right now, I'm mostly concerned with getting a working design, but those will be important, too.* We could optimize away the VACUUM going from 3 -> 0 if we are sure no writing transactions aborted.* INITIATE and FINALIZE probably need to use PreventTransactionChain() and multiple transactions, to avoid holding the ShareUpdateExclusiveLock for too long. Also, we want to keep people from using it in the same transaction as the loading xact, because they might not realize that they would get a concurrency of 1 that way (because of the ShareUpdateExclusiveLock). Thoughts? Regards,Jeff Davis
On Mon, Jun 4, 2012 at 9:26 PM, Jeff Davis <pgsql@j-davis.com> wrote: > Thoughts? Simon already proposed a way of doing this that doesn't require explicit user action, which seems preferable to a method that does require explicit user action, even though it's a little harder to implement. His idea was to store the XID of the process creating the table in the pg_class row, which I think is *probably* better than your idea of having a process that waits and then flips the flag. There are some finicky details though - see previous thread for discussion of some of the issues. It would be very nice to have a method that detects whether or not there is only one open snapshot in a particular backend. Any time that condition is met, tuples written into a table created or truncated in the current transaction can be written with HEAP_XMIN_COMMITTED already set. That is not as nice as being able to set HEAP_XMIN_COMMITTED *and* PD_ALL_VISIBLE *and* the visibility map, but it would still be a big improvement over the status quo. I would like to see us get that part done and committed and then worry about writing the tuples with PD_ALL_VISIBLE set as a separate project. In many cases it would also be nice to write the tuples pre-frozen, so I think we should look for a design that will support that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jun 06, 2012 at 03:08:05PM -0400, Robert Haas wrote: > On Mon, Jun 4, 2012 at 9:26 PM, Jeff Davis <pgsql@j-davis.com> wrote: > > Thoughts? > > Simon already proposed a way of doing this that doesn't require > explicit user action, which seems preferable to a method that does > require explicit user action, even though it's a little harder to > implement. His idea was to store the XID of the process creating the > table in the pg_class row, which I think is *probably* better than > your idea of having a process that waits and then flips the flag. > There are some finicky details though - see previous thread for > discussion of some of the issues. I think both improvements have a place. Loads to a new heap should take as many automatic shortcuts as practical. Loads to an existing heap also matter, and Jeff's proposal helps those. > It would be very nice to have a method that detects whether or not > there is only one open snapshot in a particular backend. Any time > that condition is met, tuples written into a table created or > truncated in the current transaction can be written with > HEAP_XMIN_COMMITTED already set. Perhaps criteria like those in CheckTableNotInUse() would suffice?
On Wed, 2012-06-06 at 15:08 -0400, Robert Haas wrote: > On Mon, Jun 4, 2012 at 9:26 PM, Jeff Davis <pgsql@j-davis.com> wrote: > > Thoughts? > > Simon already proposed a way of doing this that doesn't require > explicit user action, which seems preferable to a method that does > require explicit user action, even though it's a little harder to > implement. His idea was to store the XID of the process creating the > table in the pg_class row, which I think is *probably* better than > your idea of having a process that waits and then flips the flag. > There are some finicky details though - see previous thread for > discussion of some of the issues. My goals include: * The ability to load into existing tables with existing data * The ability to load concurrently My understanding was that the proposal to which you're referring can't do those things, which seem like major limitations. Did I miss something? > In > many cases it would also be nice to write the tuples pre-frozen, so I > think we should look for a design that will support that. You're right, that would be nice. Regards,Jeff Davis
On Mon, Jun 04, 2012 at 06:26:04PM -0700, Jeff Davis wrote: > I propose a special user-initiated loading mode at the table > granularity. During this time, readers must ignore PD_ALL_VISIBLE, > HEAP_XMIN_COMMITTED, and the visibility map entirely. However, writers > may set all of those bits before the writing transaction commits, > obviating the need to rewrite (and WAL) the data again later. Ideally, > there would be no work for VACUUM to do after the data load (unless a > transaction aborted). Note that, currently, only VACUUM sets PD_ALL_VISIBLE and visibility map bits. Would you make something else like heap_multi_insert() be able to do so? > Goals: > ================================= > > * Table granularity (doesn't affect other tables at all) > * Allows concurrent loaders > * Allows loading into existing tables with existing data > * Online (allow reads to proceed, even if degraded) +1 > Obviously, readers and writers would need a mechanism to honor those > flags, but I haven't dug into the details yet (additional routines in > tqual.c?). Avoiding measurable overhead in tuple visibility checks when the feature is inactive may well prove to be a key implementation challenge. > FINALIZE LOAD would first move from state 2 to state 3 by acquiring a > ShareUpdateExclusiveLock on the table setting optimistichints = false. > > Then, it would move from state 3 to state 0 by first waiting for all > transactions that currently hold a lock on the table, to ensure they see > the optimistichints=false flag. This is certainly necessary, but ... > Then, it would remember the current xid > as max_loader_xid, and wait until the global xmin is greater than > max_loader_xid. This should ensure that all snapshots regard all loading > transactions as complete. ... this might not be. Each backend could decide, based on its own xmin, whether to ignore PD_ALL_VISIBLE in a given table. In other words, your ignorehints flag could be an xmin set to InvalidTransactionId during stages 1 and 2 and to the earliest safe xmin during stages 0 and 3. > * INITIATE and FINALIZE probably need to use PreventTransactionChain() > and multiple transactions, to avoid holding the ShareUpdateExclusiveLock > for too long. Also, we want to keep people from using it in the same > transaction as the loading xact, because they might not realize that > they would get a concurrency of 1 that way (because of the > ShareUpdateExclusiveLock). Yes. You need to commit the transaction modifying pg_class so other backends can observe the change, at which point you can gather the list to wait on. Consider splitting the INITIATE UI into two interfaces, one that transitions from state 0 to state 1 and another that expects state 1 and blocks until we reach state 2. You then have no need for PreventTransactionChain(), and the interfaces could even be normal functions. It's less clear how reasonably you could do this for the FINALIZE step, given its implicit VACUUM. It could be achieved by having the user do the VACUUM and making the new interface merely throw an error if a VACUUM is still needed. The trivial usage pattern might look like this: SELECT pg_initiate_load('bigtbl'); SELECT pg_wait_load('bigtbl'); -- not a great name COPY bigtbl FROM STDIN; SELECT pg_stop_load('bigtbl'); VACUUM bigtbl; SELECT pg_finalize_load('bigtbl'); It's definitely less elegant, alas. Perhaps offer the interface you've proposed and have it do the above under the hood. That way, users with complex needs have the flexibility of the lower-level interfaces while those who can tolerate PreventTransactionChain() have simplicity. Thanks, nm
On Wed, 2012-06-06 at 22:16 -0400, Noah Misch wrote: > Note that, currently, only VACUUM sets PD_ALL_VISIBLE and visibility map bits. > Would you make something else like heap_multi_insert() be able to do so? That was the plan (roughly). I was thinking about doing it at the time a new page was allocated. > Avoiding measurable overhead in tuple visibility checks when the feature is > inactive may well prove to be a key implementation challenge. Perhaps a rudimentary CLOG cache, or some other way to mitigate CLOG access could make it bearable. Although I would like it to be an online operation, I'm not quite as concerned about reads. I'd like to mitigate any major penalty, but if reads are expensive during a load, than so be it. > > Then, it would remember the current xid > > as max_loader_xid, and wait until the global xmin is greater than > > max_loader_xid. This should ensure that all snapshots regard all loading > > transactions as complete. > > ... this might not be. Each backend could decide, based on its own xmin, > whether to ignore PD_ALL_VISIBLE in a given table. In other words, your > ignorehints flag could be an xmin set to InvalidTransactionId during stages 1 > and 2 and to the earliest safe xmin during stages 0 and 3. That's a good idea. It might make it easier to implement, and removing a step from finalization is certainly a big plus. > > * INITIATE and FINALIZE probably need to use PreventTransactionChain() > > and multiple transactions, to avoid holding the ShareUpdateExclusiveLock > > for too long. Also, we want to keep people from using it in the same > > transaction as the loading xact, because they might not realize that > > they would get a concurrency of 1 that way (because of the > > ShareUpdateExclusiveLock). > > Yes. You need to commit the transaction modifying pg_class so other backends > can observe the change, at which point you can gather the list to wait on. > > Consider splitting the INITIATE UI into two interfaces, one that transitions > from state 0 to state 1 and another that expects state 1 and blocks until we > reach state 2. You then have no need for PreventTransactionChain(), and the > interfaces could even be normal functions. It's less clear how reasonably you > could do this for the FINALIZE step, given its implicit VACUUM. It could be > achieved by having the user do the VACUUM and making the new interface merely > throw an error if a VACUUM is still needed. The trivial usage pattern might > look like this: > > SELECT pg_initiate_load('bigtbl'); > SELECT pg_wait_load('bigtbl'); -- not a great name > COPY bigtbl FROM STDIN; > SELECT pg_stop_load('bigtbl'); > VACUUM bigtbl; > SELECT pg_finalize_load('bigtbl'); > > It's definitely less elegant, alas. Perhaps offer the interface you've > proposed and have it do the above under the hood. That way, users with > complex needs have the flexibility of the lower-level interfaces while those > who can tolerate PreventTransactionChain() have simplicity. I think that's a reasonable suggestion. I am going back and forth a little on this one. It's got the benefit that you can see the internal states more clearly, and it's easier to tell what's going on, and it's better if we want to do more sophisticated testing. The main drawback here is that it's exposing more to the user. I imagined that we might want to push other kinds of optimizations into the load path, and that might upset the interface you've described above. Then again, we'll probably need the normal, load, and transition states regardless, so maybe it's an empty concern. Regards,Jeff Davis
On Mon, Jun 11, 2012 at 2:16 AM, Jeff Davis <pgsql@j-davis.com> wrote: > On Wed, 2012-06-06 at 22:16 -0400, Noah Misch wrote: >> Note that, currently, only VACUUM sets PD_ALL_VISIBLE and visibility map bits. >> Would you make something else like heap_multi_insert() be able to do so? > > That was the plan (roughly). I was thinking about doing it at the time a > new page was allocated. > >> Avoiding measurable overhead in tuple visibility checks when the feature is >> inactive may well prove to be a key implementation challenge. > > Perhaps a rudimentary CLOG cache, or some other way to mitigate CLOG > access could make it bearable. > > Although I would like it to be an online operation, I'm not quite as > concerned about reads. I'd like to mitigate any major penalty, but if > reads are expensive during a load, than so be it. > >> > Then, it would remember the current xid >> > as max_loader_xid, and wait until the global xmin is greater than >> > max_loader_xid. This should ensure that all snapshots regard all loading >> > transactions as complete. >> >> ... this might not be. Each backend could decide, based on its own xmin, >> whether to ignore PD_ALL_VISIBLE in a given table. In other words, your >> ignorehints flag could be an xmin set to InvalidTransactionId during stages 1 >> and 2 and to the earliest safe xmin during stages 0 and 3. > > That's a good idea. It might make it easier to implement, and removing a > step from finalization is certainly a big plus. > >> > * INITIATE and FINALIZE probably need to use PreventTransactionChain() >> > and multiple transactions, to avoid holding the ShareUpdateExclusiveLock >> > for too long. Also, we want to keep people from using it in the same >> > transaction as the loading xact, because they might not realize that >> > they would get a concurrency of 1 that way (because of the >> > ShareUpdateExclusiveLock). >> >> Yes. You need to commit the transaction modifying pg_class so other backends >> can observe the change, at which point you can gather the list to wait on. >> >> Consider splitting the INITIATE UI into two interfaces, one that transitions >> from state 0 to state 1 and another that expects state 1 and blocks until we >> reach state 2. You then have no need for PreventTransactionChain(), and the >> interfaces could even be normal functions. It's less clear how reasonably you >> could do this for the FINALIZE step, given its implicit VACUUM. It could be >> achieved by having the user do the VACUUM and making the new interface merely >> throw an error if a VACUUM is still needed. The trivial usage pattern might >> look like this: >> >> SELECT pg_initiate_load('bigtbl'); >> SELECT pg_wait_load('bigtbl'); -- not a great name >> COPY bigtbl FROM STDIN; >> SELECT pg_stop_load('bigtbl'); >> VACUUM bigtbl; >> SELECT pg_finalize_load('bigtbl'); >> >> It's definitely less elegant, alas. Perhaps offer the interface you've >> proposed and have it do the above under the hood. That way, users with >> complex needs have the flexibility of the lower-level interfaces while those >> who can tolerate PreventTransactionChain() have simplicity. > > I think that's a reasonable suggestion. I am going back and forth a > little on this one. It's got the benefit that you can see the internal > states more clearly, and it's easier to tell what's going on, and it's > better if we want to do more sophisticated testing. > > The main drawback here is that it's exposing more to the user. I > imagined that we might want to push other kinds of optimizations into > the load path, and that might upset the interface you've described > above. Then again, we'll probably need the normal, load, and transition > states regardless, so maybe it's an empty concern. Instead of trying to maintain MVCC semantics, maybe we should just have something like COPY (FROZEN) that just writes frozen tuples into the table and throws MVCC out the window. Seems like that would be a lot easier to implement and satisfy basically the same use cases. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jun 12, 2012 at 7:42 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Jun 11, 2012 at 2:16 AM, Jeff Davis <pgsql@j-davis.com> wrote: >> On Wed, 2012-06-06 at 22:16 -0400, Noah Misch wrote: >>> Note that, currently, only VACUUM sets PD_ALL_VISIBLE and visibility map bits. >>> Would you make something else like heap_multi_insert() be able to do so? >> >> That was the plan (roughly). I was thinking about doing it at the time a >> new page was allocated. >> >>> Avoiding measurable overhead in tuple visibility checks when the feature is >>> inactive may well prove to be a key implementation challenge. >> >> Perhaps a rudimentary CLOG cache, or some other way to mitigate CLOG >> access could make it bearable. >> >> Although I would like it to be an online operation, I'm not quite as >> concerned about reads. I'd like to mitigate any major penalty, but if >> reads are expensive during a load, than so be it. >> >>> > Then, it would remember the current xid >>> > as max_loader_xid, and wait until the global xmin is greater than >>> > max_loader_xid. This should ensure that all snapshots regard all loading >>> > transactions as complete. >>> >>> ... this might not be. Each backend could decide, based on its own xmin, >>> whether to ignore PD_ALL_VISIBLE in a given table. In other words, your >>> ignorehints flag could be an xmin set to InvalidTransactionId during stages 1 >>> and 2 and to the earliest safe xmin during stages 0 and 3. >> >> That's a good idea. It might make it easier to implement, and removing a >> step from finalization is certainly a big plus. >> >>> > * INITIATE and FINALIZE probably need to use PreventTransactionChain() >>> > and multiple transactions, to avoid holding the ShareUpdateExclusiveLock >>> > for too long. Also, we want to keep people from using it in the same >>> > transaction as the loading xact, because they might not realize that >>> > they would get a concurrency of 1 that way (because of the >>> > ShareUpdateExclusiveLock). >>> >>> Yes. You need to commit the transaction modifying pg_class so other backends >>> can observe the change, at which point you can gather the list to wait on. >>> >>> Consider splitting the INITIATE UI into two interfaces, one that transitions >>> from state 0 to state 1 and another that expects state 1 and blocks until we >>> reach state 2. You then have no need for PreventTransactionChain(), and the >>> interfaces could even be normal functions. It's less clear how reasonably you >>> could do this for the FINALIZE step, given its implicit VACUUM. It could be >>> achieved by having the user do the VACUUM and making the new interface merely >>> throw an error if a VACUUM is still needed. The trivial usage pattern might >>> look like this: >>> >>> SELECT pg_initiate_load('bigtbl'); >>> SELECT pg_wait_load('bigtbl'); -- not a great name >>> COPY bigtbl FROM STDIN; >>> SELECT pg_stop_load('bigtbl'); >>> VACUUM bigtbl; >>> SELECT pg_finalize_load('bigtbl'); >>> >>> It's definitely less elegant, alas. Perhaps offer the interface you've >>> proposed and have it do the above under the hood. That way, users with >>> complex needs have the flexibility of the lower-level interfaces while those >>> who can tolerate PreventTransactionChain() have simplicity. >> >> I think that's a reasonable suggestion. I am going back and forth a >> little on this one. It's got the benefit that you can see the internal >> states more clearly, and it's easier to tell what's going on, and it's >> better if we want to do more sophisticated testing. >> >> The main drawback here is that it's exposing more to the user. I >> imagined that we might want to push other kinds of optimizations into >> the load path, and that might upset the interface you've described >> above. Then again, we'll probably need the normal, load, and transition >> states regardless, so maybe it's an empty concern. > > Instead of trying to maintain MVCC semantics, maybe we should just > have something like COPY (FROZEN) that just writes frozen tuples into > the table and throws MVCC out the window. Seems like that would be a > lot easier to implement and satisfy basically the same use cases. -1: The situation with hint bit i/o patterns on many workloads is untenable but it's not safe to assume MVCC can be ditched in those workloads. Also, COPY does nothing about deletes. Neither does the proposal as stated but I think it's easier to generalize into 'I want to put hint bits in now so I don't have to deal with them later'. merlin
Merlin Moncure <mmoncure@gmail.com> writes: > On Tue, Jun 12, 2012 at 7:42 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> Instead of trying to maintain MVCC semantics, maybe we should just >> have something like COPY (FROZEN) that just writes frozen tuples into >> the table and throws MVCC out the window. �Seems like that would be a >> lot easier to implement and satisfy basically the same use cases. > -1: The situation with hint bit i/o patterns on many workloads is > untenable but it's not safe to assume MVCC can be ditched in those > workloads. Also, COPY does nothing about deletes. Neither does the > proposal as stated but I think it's easier to generalize into 'I want > to put hint bits in now so I don't have to deal with them later'. Dunno, I think Robert's idea has a fair amount of merit: mainly because it will probably satisfy 90% of use cases for 1% of the work. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think Robert's idea has a fair amount of merit: mainly because > it will probably satisfy 90% of use cases for 1% of the work. +1, especially if we include a pg_dump option to write the COPY statements with that option. -Kevin
On Tue, Jun 12, 2012 at 11:12 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Merlin Moncure <mmoncure@gmail.com> writes: >> On Tue, Jun 12, 2012 at 7:42 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> Instead of trying to maintain MVCC semantics, maybe we should just >>> have something like COPY (FROZEN) that just writes frozen tuples into >>> the table and throws MVCC out the window. Seems like that would be a >>> lot easier to implement and satisfy basically the same use cases. > >> -1: The situation with hint bit i/o patterns on many workloads is >> untenable but it's not safe to assume MVCC can be ditched in those >> workloads. Also, COPY does nothing about deletes. Neither does the >> proposal as stated but I think it's easier to generalize into 'I want >> to put hint bits in now so I don't have to deal with them later'. > > Dunno, I think Robert's idea has a fair amount of merit: mainly because > it will probably satisfy 90% of use cases for 1% of the work. 90%? Hint bits i/o issues are not limited to bulk loads. They apply to all many-record-per-transaction DML including (and especially) deletes. Also it's not safe to assume that insertion heavy clients can be migrated to COPY. For example, JDBC bulk loading AFAIK does not use COPY and even if it did wouldn't be able to decorate the command for a long time for most production workloads. Vs Jeff's proposal you have a point -- I'm just very skeptical it's going to do enough to mitigate the performance hit. merlin
On Tue, Jun 12, 2012 at 5:41 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > Also it's not safe to assume that insertion heavy clients > can be migrated to COPY. For example, JDBC bulk loading AFAIK does > not use COPY and even if it did wouldn't be able to decorate the > command for a long time for most production workloads A non-mvcc table would only be useful if you're loading the data all at once and don't plan to load any more. So that wouldn't even be attempting to cover all "insertion heavy clients". Just specific use cases. I had in mind to implement something like this for archived data such as old partitions in partitioned table. I think data loaded in a copy that you know is never going to be modified would be a reasonable other case for the same codepath. Note that this particular use case is covered by FDW as well. If the source data is in a dense enough format you could save even the load step by reading directly from the source. I don't think this eliminates the idea of having a denser read-only format though. Postgres has lots of other features like indexing and different data representations that might be useful and would be missing from, say, a csv source file. -- greg
On Tue, Jun 12, 2012 at 12:41 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >>> -1: The situation with hint bit i/o patterns on many workloads is >>> untenable but it's not safe to assume MVCC can be ditched in those >>> workloads. Also, COPY does nothing about deletes. Neither does the >>> proposal as stated but I think it's easier to generalize into 'I want >>> to put hint bits in now so I don't have to deal with them later'. >> >> Dunno, I think Robert's idea has a fair amount of merit: mainly because >> it will probably satisfy 90% of use cases for 1% of the work. > > 90%? Hint bits i/o issues are not limited to bulk loads. They apply > to all many-record-per-transaction DML including (and especially) > deletes. Also it's not safe to assume that insertion heavy clients > can be migrated to COPY. For example, JDBC bulk loading AFAIK does > not use COPY and even if it did wouldn't be able to decorate the > command for a long time for most production workloads. > > Vs Jeff's proposal you have a point -- I'm just very skeptical it's > going to do enough to mitigate the performance hit. I don't think it's going to solve the problem in general, but TBH I don't think Jeff's proposal is, either. I mean, ignoring xmin-committed, xmax-committed, and all-visible bits is going to come with a pretty steep performance penalty all of its own. I don't doubt that you can construct situations in which it's better than incurring the I/O associated with setting those bits after the fact, but I also don't doubt that the reverse is true. Furthermore, any system that involves sometimes ignoring those things is going to add cost in extremely hot code paths even when the user doesn't elect to take advantage of the new feature. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I don't think it's going to solve the problem in general, but TBH I > don't think Jeff's proposal is, either. I mean, ignoring > xmin-committed, xmax-committed, and all-visible bits is going to come > with a pretty steep performance penalty all of its own. I don't doubt > that you can construct situations in which it's better than incurring > the I/O associated with setting those bits after the fact, but I also > don't doubt that the reverse is true. Furthermore, any system that > involves sometimes ignoring those things is going to add cost in > extremely hot code paths even when the user doesn't elect to take > advantage of the new feature. Yeah; the notion of adding cycles to the visibility-check code paths, which would add cost even for users who have no use at all for this feature, is close to being a deal-breaker for me. I would rather see us designing this around the idea of "what can we do without adding any new complexity in visibility checks?". At least for MVCC cases (hence, user tables only), it seems like we could pre-set XMIN_COMMITTED hint bits if there were a way to be sure that the XID in question would be seen as still running in the followup snapshot check that every MVCC visibility test must make anyway. The hard part of that seems to be post-crash behavior, but maybe it'd be acceptable to incur crash-recovery-time cost to run around and unset such bits? This doesn't do anything for pre-freezing of course, but I basically don't see any way that we could pre-freeze without breaking MVCC semantics anyhow. So there's still attraction in the idea of offering users the choice of not sustaining MVCC behavior in exchange for cheaper bulk loads. regards, tom lane
On Wed, Jun 6, 2012 at 8:42 PM, Jeff Davis <pgsql@j-davis.com> wrote: > On Wed, 2012-06-06 at 15:08 -0400, Robert Haas wrote: >> On Mon, Jun 4, 2012 at 9:26 PM, Jeff Davis <pgsql@j-davis.com> wrote: >> > Thoughts? >> >> Simon already proposed a way of doing this that doesn't require >> explicit user action, which seems preferable to a method that does >> require explicit user action, even though it's a little harder to >> implement. His idea was to store the XID of the process creating the >> table in the pg_class row, which I think is *probably* better than >> your idea of having a process that waits and then flips the flag. >> There are some finicky details though - see previous thread for >> discussion of some of the issues. > > My goals include: > > * The ability to load into existing tables with existing data > * The ability to load concurrently > > My understanding was that the proposal to which you're referring can't > do those things, which seem like major limitations. Did I miss > something? No, you're correct. I misread your original email, sorry. I'm just thinking about this a little more. It strikes me that the core trade-off here is between doing more post-commit work and doing more post-abort work. For example, in your proposal, we've got to run a lazy vacuum before exiting bulk load mode, because if a transaction has aborted, we've got to get rid of its tuples before letting anyone trust hint bits again. That is, abort cleanup gets harder. OTOH, commit cleanup gets easier, because all the hint bits and visibility map bits are already set: the only thing left is to freeze. In general, I think that's a good trade-off. Commits are much more common than aborts, and so we ought to be optimizing for the commit case. But maybe there are other ways of doing it, besides what you've proposed here. Not sure exactly what, but it might be worth thinking about. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, 2012-06-12 at 08:42 -0400, Robert Haas wrote: > Instead of trying to maintain MVCC semantics, maybe we should just > have something like COPY (FROZEN) that just writes frozen tuples into > the table and throws MVCC out the window. Seems like that would be a > lot easier to implement and satisfy basically the same use cases. The reason that doesn't work is because there's really no answer for ABORTs except to delete the whole table, or ask the user to try to figure out what made it or what didn't. (That I can see, anyway; you may have a better idea.) In practice, I think the user would need to use this only for new tables and use only one loader so they would know what to do if there is an abort. That makes this very similar to the proposal for optimizing loads by the transaction that creates the table; except with fewer safeguards. It's basically just saying "don't abort, and be careful with concurrent reads (which are now dirty)". I think this problem largely has to do with freezing though, because that's what makes it impossible to differentiate after an abort. Let's set that aside for now, and just focus on setting HEAP_XMIN_COMMITTED, PD_ALL_VISIBLE, and the VM bit. Those account for a major part of the problem; being able to freeze also is really a bonus (and if the table is really that static, then maybe creating it in the loading transaction is reasonable). In those terms, we can reframe the questions as: what do we do about reads during the load? Answers could include: (a) nothing -- reads during a load are potentially dirty (b) readers ignore hint bits during the load,and pay the penalty (c) allow only INSERT/COPY, and block unsafe SELECT/UPDATE/DELETE (a) is your proposal except without the freezing. (b) was my proposal. (c) was a backup I had in mind (if reads are expensive anyway, maybe people would rather just block). These aren't exclusive. Maybe we can do (c) in READ COMMITTED and above, and (a) in READ UNCOMMITTED. Regards,Jeff Davis
On Tue, 2012-06-12 at 13:31 -0400, Tom Lane wrote: > Yeah; the notion of adding cycles to the visibility-check code paths, > which would add cost even for users who have no use at all for this > feature, is close to being a deal-breaker for me. I would rather see > us designing this around the idea of "what can we do without adding > any new complexity in visibility checks?". [waves hands wildly] We could create a different HeapTupleSatisfiesMVCC routine that ignores hint bits, and select that at the time the scan is started. That would at least avoid affecting existing code paths. > At least for MVCC cases (hence, user tables only), it seems like we > could pre-set XMIN_COMMITTED hint bits if there were a way to be sure > that the XID in question would be seen as still running in the followup > snapshot check that every MVCC visibility test must make anyway. The > hard part of that seems to be post-crash behavior, but maybe it'd be > acceptable to incur crash-recovery-time cost to run around and unset > such bits? Hmm... but that still leaves PD_ALL_VISIBLE, which will also cause a rewrite of the data. Under OLTP, we can assume that PD_ALL_VISIBLE is much more rare than HEAP_XMIN_COMMITTED; but for bulk loads, setting HEAP_XMIN_COMMITTED doesn't help us much without PD_ALL_VISIBLE. > This doesn't do anything for pre-freezing of course, but I basically > don't see any way that we could pre-freeze without breaking MVCC > semantics anyhow. So there's still attraction in the idea of offering > users the choice of not sustaining MVCC behavior in exchange for cheaper > bulk loads. That may be reasonable, but it's much more dangerous than just breaking MVCC (which to me implies isolation rules) -- pre-freezing would break atomicity if you have any aborts. And I can't think of very many cases where losing atomicity is reasonable (although I'm sure there are some). Regards,Jeff Davis
On Tue, Jun 12, 2012 at 01:31:05PM -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > I don't think it's going to solve the problem in general, but TBH I > > don't think Jeff's proposal is, either. I mean, ignoring > > xmin-committed, xmax-committed, and all-visible bits is going to come > > with a pretty steep performance penalty all of its own. I don't doubt > > that you can construct situations in which it's better than incurring > > the I/O associated with setting those bits after the fact, but I also > > don't doubt that the reverse is true. Furthermore, any system that > > involves sometimes ignoring those things is going to add cost in > > extremely hot code paths even when the user doesn't elect to take > > advantage of the new feature. > > Yeah; the notion of adding cycles to the visibility-check code paths, > which would add cost even for users who have no use at all for this > feature, is close to being a deal-breaker for me. I would rather see > us designing this around the idea of "what can we do without adding > any new complexity in visibility checks?". > > At least for MVCC cases (hence, user tables only), it seems like we > could pre-set XMIN_COMMITTED hint bits if there were a way to be sure > that the XID in question would be seen as still running in the followup > snapshot check that every MVCC visibility test must make anyway. The > hard part of that seems to be post-crash behavior, but maybe it'd be > acceptable to incur crash-recovery-time cost to run around and unset > such bits? Well, truncating tables that were empty on the load would certainly be a easy to do --- not sure if that helps us, though. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Tue, 2012-06-12 at 13:10 -0400, Robert Haas wrote: > I don't think it's going to solve the problem in general, but TBH I > don't think Jeff's proposal is, either. I mean, ignoring > xmin-committed, xmax-committed, and all-visible bits is going to come > with a pretty steep performance penalty all of its own. I think we'd certainly have to discourage users from launching lots of full table scans during a data load. We could go so far as blocking reads, as I said in my other response, and still (mostly) meet my primary goals. But allowing reads without hint bits might be useful for highly selective index access, which is a substantial use case for the kind of large tables we'd be bulk-loading. Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > In those terms, we can reframe the questions as: what do we do about > reads during the load? > Answers could include: > (a) nothing -- reads during a load are potentially dirty > (b) readers ignore hint bits during the load, and pay the penalty > (c) allow only INSERT/COPY, and block unsafe SELECT/UPDATE/DELETE Or (d) it's not a problem, since the inserting XID is still busy according to the readers' snapshots. The part I think is actually hard is how to clean up if the inserting xact doesn't reach commit. I think what we're basically looking at here is pushing more cost into that path in order to avoid cost in successful cases. The first design that comes to mind is (1) the inserting xact remembers which tables it's inserted pre-hinted tuples into, and if it has to abort, it first seqscans those tables to reset the hint bits; (2) it also emits WAL entries that will cause crash recovery to perform an identical scan, if WAL ends without finding a commit or abort record for the inserting xact. But there may be other better ways. One problem with the above sketch is that you can't checkpoint till the insertion is committed, since a checkpoint would prevent crash recovery from seeing the warning WAL records. regards, tom lane
On Tue, 2012-06-12 at 17:53 -0400, Bruce Momjian wrote: > Well, truncating tables that were empty on the load would certainly be > a easy to do --- not sure if that helps us, though. This goes back to the single-loader-into-an-empty-table case, which I'd like to get away from. I think we already have some reasonable proposals on the table for optimizing that case. Ultimately, that pushes us toward partitioning with the same granularity as the data load. We can try to open that discussion up, but I think it's a bad idea to tie the data load granularity to the partitioning granularity, unless we have a way to coalesce them later without rewriting. It's too bad filesystems don't allow us to just reallocate some blocks from one file to another. [ OK, I guess I just opened this discussion myself ;) ] Regards,Jeff Davis
On Tue, 2012-06-12 at 18:02 -0400, Tom Lane wrote: > Or (d) it's not a problem, since the inserting XID is still busy > according to the readers' snapshots. How much of a savings did we get from PD_ALL_VISIBLE when it was added into the page-at-a-time visibility check? >From 608195a3a3656145a7eec7a47d903bc684011d73: "In addition to the visibility map, there's a new PD_ALL_VISIBLE flag on each heap page, also indicating that all tuples on the page are visible to all transactions. It's important that this flag is kept up-to-date. It is also used to skip visibility tests in sequential scans, which gives a small performance gain on seqscans." If "small" means that it's something we can give up, then focusing on HEAP_XMIN_COMMITTED makes sense. But if we can't give it up, then we need to take it into account in the proposal. Regards,Jeff Davis
On Tue, Jun 12, 2012 at 6:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The part I think is actually hard is how to clean up if the inserting > xact doesn't reach commit. I think what we're basically looking at here > is pushing more cost into that path in order to avoid cost in successful > cases. The first design that comes to mind is > > (1) the inserting xact remembers which tables it's inserted pre-hinted > tuples into, and if it has to abort, it first seqscans those tables to > reset the hint bits; I don't think we can count on that to be safe in an arbitrarily chosen abort path. Anything FATAL, for instance. I think we're going to need to keep track of some kind table-xmin value, representing the oldest operation on the table that's not cleaned up yet, and make it autovacuum's job to clean any that precede OldestXmin. If the backend can clean itself up, great, but there has to be some kind of allowance for the case where that doesn't happen. I'm also skeptical about the notion that "scan the whole table" is going to be a good idea. It really will have to be a full sequential scan, if we're setting visibility map bits as we go, not just a scan of pages that are not-all-visible, as vacuum normally does. I think if we want to go this route, we need to log the TID of every tuple we write into the heap into some kind of undo fork (or maybe just the block numbers), so that if the transaction aborts, we (or autovacuum) can go back and find all of those TIDs and mark the tuples dead without having to scan through (potentially) terabytes of data. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jun 12, 2012 at 6:26 PM, Jeff Davis <pgsql@j-davis.com> wrote: > On Tue, 2012-06-12 at 18:02 -0400, Tom Lane wrote: >> Or (d) it's not a problem, since the inserting XID is still busy >> according to the readers' snapshots. > > How much of a savings did we get from PD_ALL_VISIBLE when it was added > into the page-at-a-time visibility check? > > >From 608195a3a3656145a7eec7a47d903bc684011d73: > > "In addition to the visibility map, there's a new PD_ALL_VISIBLE flag on > each heap page, also indicating that all tuples on the page are visible > to all transactions. It's important that this flag is kept up-to-date. > It is also used to skip visibility tests in sequential scans, which > gives a small performance gain on seqscans." > > If "small" means that it's something we can give up, then focusing on > HEAP_XMIN_COMMITTED makes sense. But if we can't give it up, then we > need to take it into account in the proposal. It's significant. rhaas=# create table foo (a int, b text); ERROR: relation "foo" already exists rhaas=# create table bar (a int, b text); CREATE TABLE rhaas=# insert into bar select g, random()::text||random()::text||random()::text||random()::text from generate_series(1,1000000) g; INSERT 0 1000000 rhaas=# \timing Timing is on. rhaas=# select sum(1) from bar; sum ---------1000000 (1 row) Time: 257.500 ms rhaas=# select sum(1) from bar; sum ---------1000000 (1 row) Time: 140.763 ms rhaas=# select sum(1) from bar; sum ---------1000000 (1 row) Time: 142.760 ms rhaas=# select sum(1) from bar; sum ---------1000000 (1 row) Time: 140.603 ms rhaas=# vacuum bar; VACUUM Time: 133.084 ms rhaas=# select sum(1) from bar; sum ---------1000000 (1 row) Time: 123.591 ms rhaas=# select sum(1) from bar; sum ---------1000000 (1 row) Time: 123.096 ms rhaas=# select sum(1) from bar; sum ---------1000000 (1 row) Time: 122.653 ms So vacuuming to set the PD_ALL_VISIBLE bits is buying us more than 10% here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, 2012-06-12 at 22:06 -0400, Robert Haas wrote: > > How much of a savings did we get from PD_ALL_VISIBLE when it was added > > into the page-at-a-time visibility check? > > > > >From 608195a3a3656145a7eec7a47d903bc684011d73: > > > > "In addition to the visibility map, there's a new PD_ALL_VISIBLE flag on > > each heap page, also indicating that all tuples on the page are visible > > to all transactions. It's important that this flag is kept up-to-date. > > It is also used to skip visibility tests in sequential scans, which > > gives a small performance gain on seqscans." > > > > If "small" means that it's something we can give up, then focusing on > > HEAP_XMIN_COMMITTED makes sense. But if we can't give it up, then we > > need to take it into account in the proposal. > > It's significant. In that case, the proposals that only involve HEAP_XMIN_COMMITTED don't seem viable to me. To get maxiumum read speed, we will need to set PD_ALL_VISIBLE also, and that means rewriting the entire table anyway (for the workload that I'm describing). However, maybe if we combine the approaches, readers could ignore PD_ALL_VISIBLE during the load, which looks like maybe a 10% penalty, without having to ignore HEAP_XMIN_COMMITTED (which seems much more costly to ignore). Regards,Jeff Davis