Thread: 9.3: load path to mitigate load penalty for checksums

9.3: load path to mitigate load penalty for checksums

From
Jeff Davis
Date:
Introduction:
=================================
A lot of progress has been made on the checksums issue, with Simon's
excellent work beginning to gain consensus:

http://archives.postgresql.org/message-id/CA
+U5nMKw_GBs6qQ_Y8-RjGL1V7MVW2HWBHartB8LoJhnPfxL8g@mail.gmail.com

For the purposes of this proposal, I'll assume that's the general
direction we'll be taking for CRCs.

The major drawback of that proposal is that it increases the amount of
work to be done after a large data load by requiring more WAL.

Proposal:
=================================
I propose a special user-initiated loading mode at the table
granularity. During this time, readers must ignore PD_ALL_VISIBLE,
HEAP_XMIN_COMMITTED, and the visibility map entirely. However, writers
may set all of those bits before the writing transaction commits,
obviating the need to rewrite (and WAL) the data again later. Ideally,
there would be no work for VACUUM to do after the data load (unless a
transaction aborted).

This would also help the non-CRC case of course, but I expect CRCs to
make this significantly more important.

Goals:
=================================
 * Table granularity (doesn't affect other tables at all) * Allows concurrent loaders * Allows loading into existing
tableswith existing data * Online (allow reads to proceed, even if degraded)
 

Rough user interaction:
=================================
 INITIATE LOAD ON foo AS 'job name';
 -- run concurrent loading sessions
 FINALIZE LOAD 'job name';

High-level design:
=================================

By "hints" I mean the VM bit, PD_ALL_VISIBLE, and HEAP_XMIN_COMMITTED.

By "ignorehints" I mean a flag in pg_class indicating that readers
should ignore hints.

By "optimistichints" I mean a flag in pg_class indicating that writers
can optimistically set hints.

Obviously, readers and writers would need a mechanism to honor those
flags, but I haven't dug into the details yet (additional routines in
tqual.c?).
 States:   0: normal      * ignorehints = false      * optimistichints = false   1: trying to enter data load mode,
waitingon existing      lockers (who might be scanning) to finish      * ignorehints = true      * optimistichints =
false  2: data load mode      * ignorehints = true      * optimistichints = true   3: trying to leave data load mode,
waitingon old snapshots to      be released and aborted transactions to be cleaned up      * ignorehints = true      *
optimistichints= false
 

INITIATE LOAD would first transition from state 0 to 1 by acquiring a
ShareUpdateExclusiveLock on the table (to be sure no concurrent INITIATE
or FINALIZE LOAD is going on) and setting ignorehints = true.

Then it moves from state 1 to state 2 by waiting for all transactions
that hold a lock on that table. Any transactions that don't already have
a lock will see the new flag when they try to get it. Now we're sure
that all readers will see the "ignorehints" flag, so we can set the
"optimistichints" flag to indicate that writers can write hints
optimistically.

FINALIZE LOAD would first move from state 2 to state 3 by acquiring a
ShareUpdateExclusiveLock on the table setting optimistichints = false.

Then, it would move from state 3 to state 0 by first waiting for all
transactions that currently hold a lock on the table, to ensure they see
the optimistichints=false flag. Then, it would remember the current xid
as max_loader_xid, and wait until the global xmin is greater than
max_loader_xid. This should ensure that all snapshots regard all loading
transactions as complete. Also, it would need to run a lazy VACUUM to
remove any tuples from aborted transactions.

Details and optimizations
=================================* We probably want a graceful way to handle multiple data loads
happening on the same table. Rather than erroring out, we could treat it
more like a reference count, and only do the work to move in to data
load mode if not there already, and only move out of data load mode if
we're the last loading job on the table.* In general, there are some usability issues to sort out, to make sure
a table isn't put into data load mode and left that way. Right now, I'm
mostly concerned with getting a working design, but those will be
important, too.* We could optimize away the VACUUM going from 3 -> 0 if we are sure no
writing transactions aborted.* INITIATE and FINALIZE probably need to use PreventTransactionChain()
and multiple transactions, to avoid holding the ShareUpdateExclusiveLock
for too long. Also, we want to keep people from using it in the same
transaction as the loading xact, because they might not realize that
they would get a concurrency of 1 that way (because of the
ShareUpdateExclusiveLock).

Thoughts?

Regards,Jeff Davis



Re: 9.3: load path to mitigate load penalty for checksums

From
Robert Haas
Date:
On Mon, Jun 4, 2012 at 9:26 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> Thoughts?

Simon already proposed a way of doing this that doesn't require
explicit user action, which seems preferable to a method that does
require explicit user action, even though it's a little harder to
implement.  His idea was to store the XID of the process creating the
table in the pg_class row, which I think is *probably* better than
your idea of having a process that waits and then flips the flag.
There are some finicky details though - see previous thread for
discussion of some of the issues.

It would be very nice to have a method that detects whether or not
there is only one open snapshot in a particular backend.  Any time
that condition is met, tuples written into a table created or
truncated in the current transaction can be written with
HEAP_XMIN_COMMITTED already set.  That is not as nice as being able to
set HEAP_XMIN_COMMITTED *and* PD_ALL_VISIBLE *and* the visibility map,
but it would still be a big improvement over the status quo.  I would
like to see us get that part done and committed and then worry about
writing the tuples with PD_ALL_VISIBLE set as a separate project.  In
many cases it would also be nice to write the tuples pre-frozen, so I
think we should look for a design that will support that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 9.3: load path to mitigate load penalty for checksums

From
Noah Misch
Date:
On Wed, Jun 06, 2012 at 03:08:05PM -0400, Robert Haas wrote:
> On Mon, Jun 4, 2012 at 9:26 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> > Thoughts?
> 
> Simon already proposed a way of doing this that doesn't require
> explicit user action, which seems preferable to a method that does
> require explicit user action, even though it's a little harder to
> implement.  His idea was to store the XID of the process creating the
> table in the pg_class row, which I think is *probably* better than
> your idea of having a process that waits and then flips the flag.
> There are some finicky details though - see previous thread for
> discussion of some of the issues.

I think both improvements have a place.  Loads to a new heap should take as
many automatic shortcuts as practical.  Loads to an existing heap also matter,
and Jeff's proposal helps those.

> It would be very nice to have a method that detects whether or not
> there is only one open snapshot in a particular backend.  Any time
> that condition is met, tuples written into a table created or
> truncated in the current transaction can be written with
> HEAP_XMIN_COMMITTED already set.

Perhaps criteria like those in CheckTableNotInUse() would suffice?


Re: 9.3: load path to mitigate load penalty for checksums

From
Jeff Davis
Date:
On Wed, 2012-06-06 at 15:08 -0400, Robert Haas wrote:
> On Mon, Jun 4, 2012 at 9:26 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> > Thoughts?
> 
> Simon already proposed a way of doing this that doesn't require
> explicit user action, which seems preferable to a method that does
> require explicit user action, even though it's a little harder to
> implement.  His idea was to store the XID of the process creating the
> table in the pg_class row, which I think is *probably* better than
> your idea of having a process that waits and then flips the flag.
> There are some finicky details though - see previous thread for
> discussion of some of the issues.

My goals include:

* The ability to load into existing tables with existing data
* The ability to load concurrently

My understanding was that the proposal to which you're referring can't
do those things, which seem like major limitations. Did I miss
something?

> In
> many cases it would also be nice to write the tuples pre-frozen, so I
> think we should look for a design that will support that.

You're right, that would be nice.

Regards,Jeff Davis




Re: 9.3: load path to mitigate load penalty for checksums

From
Noah Misch
Date:
On Mon, Jun 04, 2012 at 06:26:04PM -0700, Jeff Davis wrote:
> I propose a special user-initiated loading mode at the table
> granularity. During this time, readers must ignore PD_ALL_VISIBLE,
> HEAP_XMIN_COMMITTED, and the visibility map entirely. However, writers
> may set all of those bits before the writing transaction commits,
> obviating the need to rewrite (and WAL) the data again later. Ideally,
> there would be no work for VACUUM to do after the data load (unless a
> transaction aborted).

Note that, currently, only VACUUM sets PD_ALL_VISIBLE and visibility map bits.
Would you make something else like heap_multi_insert() be able to do so?

> Goals:
> =================================
> 
>   * Table granularity (doesn't affect other tables at all)
>   * Allows concurrent loaders
>   * Allows loading into existing tables with existing data
>   * Online (allow reads to proceed, even if degraded)

+1

> Obviously, readers and writers would need a mechanism to honor those
> flags, but I haven't dug into the details yet (additional routines in
> tqual.c?).

Avoiding measurable overhead in tuple visibility checks when the feature is
inactive may well prove to be a key implementation challenge.

> FINALIZE LOAD would first move from state 2 to state 3 by acquiring a
> ShareUpdateExclusiveLock on the table setting optimistichints = false.
> 
> Then, it would move from state 3 to state 0 by first waiting for all
> transactions that currently hold a lock on the table, to ensure they see
> the optimistichints=false flag.

This is certainly necessary, but ...

> Then, it would remember the current xid
> as max_loader_xid, and wait until the global xmin is greater than
> max_loader_xid. This should ensure that all snapshots regard all loading
> transactions as complete.

... this might not be.  Each backend could decide, based on its own xmin,
whether to ignore PD_ALL_VISIBLE in a given table.  In other words, your
ignorehints flag could be an xmin set to InvalidTransactionId during stages 1
and 2 and to the earliest safe xmin during stages 0 and 3.

>  * INITIATE and FINALIZE probably need to use PreventTransactionChain()
> and multiple transactions, to avoid holding the ShareUpdateExclusiveLock
> for too long. Also, we want to keep people from using it in the same
> transaction as the loading xact, because they might not realize that
> they would get a concurrency of 1 that way (because of the
> ShareUpdateExclusiveLock).

Yes.  You need to commit the transaction modifying pg_class so other backends
can observe the change, at which point you can gather the list to wait on.

Consider splitting the INITIATE UI into two interfaces, one that transitions
from state 0 to state 1 and another that expects state 1 and blocks until we
reach state 2.  You then have no need for PreventTransactionChain(), and the
interfaces could even be normal functions.  It's less clear how reasonably you
could do this for the FINALIZE step, given its implicit VACUUM.  It could be
achieved by having the user do the VACUUM and making the new interface merely
throw an error if a VACUUM is still needed.  The trivial usage pattern might
look like this:

SELECT pg_initiate_load('bigtbl');
SELECT pg_wait_load('bigtbl'); -- not a great name
COPY bigtbl FROM STDIN;
SELECT pg_stop_load('bigtbl');
VACUUM bigtbl;
SELECT pg_finalize_load('bigtbl');

It's definitely less elegant, alas.  Perhaps offer the interface you've
proposed and have it do the above under the hood.  That way, users with
complex needs have the flexibility of the lower-level interfaces while those
who can tolerate PreventTransactionChain() have simplicity.

Thanks,
nm


Re: 9.3: load path to mitigate load penalty for checksums

From
Jeff Davis
Date:
On Wed, 2012-06-06 at 22:16 -0400, Noah Misch wrote:
> Note that, currently, only VACUUM sets PD_ALL_VISIBLE and visibility map bits.
> Would you make something else like heap_multi_insert() be able to do so?

That was the plan (roughly). I was thinking about doing it at the time a
new page was allocated.

> Avoiding measurable overhead in tuple visibility checks when the feature is
> inactive may well prove to be a key implementation challenge.

Perhaps a rudimentary CLOG cache, or some other way to mitigate CLOG
access could make it bearable.

Although I would like it to be an online operation, I'm not quite as
concerned about reads. I'd like to mitigate any major penalty, but if
reads are expensive during a load, than so be it.

> > Then, it would remember the current xid
> > as max_loader_xid, and wait until the global xmin is greater than
> > max_loader_xid. This should ensure that all snapshots regard all loading
> > transactions as complete.
> 
> ... this might not be.  Each backend could decide, based on its own xmin,
> whether to ignore PD_ALL_VISIBLE in a given table.  In other words, your
> ignorehints flag could be an xmin set to InvalidTransactionId during stages 1
> and 2 and to the earliest safe xmin during stages 0 and 3.

That's a good idea. It might make it easier to implement, and removing a
step from finalization is certainly a big plus.

> >  * INITIATE and FINALIZE probably need to use PreventTransactionChain()
> > and multiple transactions, to avoid holding the ShareUpdateExclusiveLock
> > for too long. Also, we want to keep people from using it in the same
> > transaction as the loading xact, because they might not realize that
> > they would get a concurrency of 1 that way (because of the
> > ShareUpdateExclusiveLock).
> 
> Yes.  You need to commit the transaction modifying pg_class so other backends
> can observe the change, at which point you can gather the list to wait on.
> 
> Consider splitting the INITIATE UI into two interfaces, one that transitions
> from state 0 to state 1 and another that expects state 1 and blocks until we
> reach state 2.  You then have no need for PreventTransactionChain(), and the
> interfaces could even be normal functions.  It's less clear how reasonably you
> could do this for the FINALIZE step, given its implicit VACUUM.  It could be
> achieved by having the user do the VACUUM and making the new interface merely
> throw an error if a VACUUM is still needed.  The trivial usage pattern might
> look like this:
> 
> SELECT pg_initiate_load('bigtbl');
> SELECT pg_wait_load('bigtbl'); -- not a great name
> COPY bigtbl FROM STDIN;
> SELECT pg_stop_load('bigtbl');
> VACUUM bigtbl;
> SELECT pg_finalize_load('bigtbl');
> 
> It's definitely less elegant, alas.  Perhaps offer the interface you've
> proposed and have it do the above under the hood.  That way, users with
> complex needs have the flexibility of the lower-level interfaces while those
> who can tolerate PreventTransactionChain() have simplicity.

I think that's a reasonable suggestion. I am going back and forth a
little on this one. It's got the benefit that you can see the internal
states more clearly, and it's easier to tell what's going on, and it's
better if we want to do more sophisticated testing.

The main drawback here is that it's exposing more to the user. I
imagined that we might want to push other kinds of optimizations into
the load path, and that might upset the interface you've described
above. Then again, we'll probably need the normal, load, and transition
states regardless, so maybe it's an empty concern.

Regards,Jeff Davis



Re: 9.3: load path to mitigate load penalty for checksums

From
Robert Haas
Date:
On Mon, Jun 11, 2012 at 2:16 AM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Wed, 2012-06-06 at 22:16 -0400, Noah Misch wrote:
>> Note that, currently, only VACUUM sets PD_ALL_VISIBLE and visibility map bits.
>> Would you make something else like heap_multi_insert() be able to do so?
>
> That was the plan (roughly). I was thinking about doing it at the time a
> new page was allocated.
>
>> Avoiding measurable overhead in tuple visibility checks when the feature is
>> inactive may well prove to be a key implementation challenge.
>
> Perhaps a rudimentary CLOG cache, or some other way to mitigate CLOG
> access could make it bearable.
>
> Although I would like it to be an online operation, I'm not quite as
> concerned about reads. I'd like to mitigate any major penalty, but if
> reads are expensive during a load, than so be it.
>
>> > Then, it would remember the current xid
>> > as max_loader_xid, and wait until the global xmin is greater than
>> > max_loader_xid. This should ensure that all snapshots regard all loading
>> > transactions as complete.
>>
>> ... this might not be.  Each backend could decide, based on its own xmin,
>> whether to ignore PD_ALL_VISIBLE in a given table.  In other words, your
>> ignorehints flag could be an xmin set to InvalidTransactionId during stages 1
>> and 2 and to the earliest safe xmin during stages 0 and 3.
>
> That's a good idea. It might make it easier to implement, and removing a
> step from finalization is certainly a big plus.
>
>> >  * INITIATE and FINALIZE probably need to use PreventTransactionChain()
>> > and multiple transactions, to avoid holding the ShareUpdateExclusiveLock
>> > for too long. Also, we want to keep people from using it in the same
>> > transaction as the loading xact, because they might not realize that
>> > they would get a concurrency of 1 that way (because of the
>> > ShareUpdateExclusiveLock).
>>
>> Yes.  You need to commit the transaction modifying pg_class so other backends
>> can observe the change, at which point you can gather the list to wait on.
>>
>> Consider splitting the INITIATE UI into two interfaces, one that transitions
>> from state 0 to state 1 and another that expects state 1 and blocks until we
>> reach state 2.  You then have no need for PreventTransactionChain(), and the
>> interfaces could even be normal functions.  It's less clear how reasonably you
>> could do this for the FINALIZE step, given its implicit VACUUM.  It could be
>> achieved by having the user do the VACUUM and making the new interface merely
>> throw an error if a VACUUM is still needed.  The trivial usage pattern might
>> look like this:
>>
>> SELECT pg_initiate_load('bigtbl');
>> SELECT pg_wait_load('bigtbl'); -- not a great name
>> COPY bigtbl FROM STDIN;
>> SELECT pg_stop_load('bigtbl');
>> VACUUM bigtbl;
>> SELECT pg_finalize_load('bigtbl');
>>
>> It's definitely less elegant, alas.  Perhaps offer the interface you've
>> proposed and have it do the above under the hood.  That way, users with
>> complex needs have the flexibility of the lower-level interfaces while those
>> who can tolerate PreventTransactionChain() have simplicity.
>
> I think that's a reasonable suggestion. I am going back and forth a
> little on this one. It's got the benefit that you can see the internal
> states more clearly, and it's easier to tell what's going on, and it's
> better if we want to do more sophisticated testing.
>
> The main drawback here is that it's exposing more to the user. I
> imagined that we might want to push other kinds of optimizations into
> the load path, and that might upset the interface you've described
> above. Then again, we'll probably need the normal, load, and transition
> states regardless, so maybe it's an empty concern.

Instead of trying to maintain MVCC semantics, maybe we should just
have something like COPY (FROZEN) that just writes frozen tuples into
the table and throws MVCC out the window.  Seems like that would be a
lot easier to implement and satisfy basically the same use cases.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 9.3: load path to mitigate load penalty for checksums

From
Merlin Moncure
Date:
On Tue, Jun 12, 2012 at 7:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jun 11, 2012 at 2:16 AM, Jeff Davis <pgsql@j-davis.com> wrote:
>> On Wed, 2012-06-06 at 22:16 -0400, Noah Misch wrote:
>>> Note that, currently, only VACUUM sets PD_ALL_VISIBLE and visibility map bits.
>>> Would you make something else like heap_multi_insert() be able to do so?
>>
>> That was the plan (roughly). I was thinking about doing it at the time a
>> new page was allocated.
>>
>>> Avoiding measurable overhead in tuple visibility checks when the feature is
>>> inactive may well prove to be a key implementation challenge.
>>
>> Perhaps a rudimentary CLOG cache, or some other way to mitigate CLOG
>> access could make it bearable.
>>
>> Although I would like it to be an online operation, I'm not quite as
>> concerned about reads. I'd like to mitigate any major penalty, but if
>> reads are expensive during a load, than so be it.
>>
>>> > Then, it would remember the current xid
>>> > as max_loader_xid, and wait until the global xmin is greater than
>>> > max_loader_xid. This should ensure that all snapshots regard all loading
>>> > transactions as complete.
>>>
>>> ... this might not be.  Each backend could decide, based on its own xmin,
>>> whether to ignore PD_ALL_VISIBLE in a given table.  In other words, your
>>> ignorehints flag could be an xmin set to InvalidTransactionId during stages 1
>>> and 2 and to the earliest safe xmin during stages 0 and 3.
>>
>> That's a good idea. It might make it easier to implement, and removing a
>> step from finalization is certainly a big plus.
>>
>>> >  * INITIATE and FINALIZE probably need to use PreventTransactionChain()
>>> > and multiple transactions, to avoid holding the ShareUpdateExclusiveLock
>>> > for too long. Also, we want to keep people from using it in the same
>>> > transaction as the loading xact, because they might not realize that
>>> > they would get a concurrency of 1 that way (because of the
>>> > ShareUpdateExclusiveLock).
>>>
>>> Yes.  You need to commit the transaction modifying pg_class so other backends
>>> can observe the change, at which point you can gather the list to wait on.
>>>
>>> Consider splitting the INITIATE UI into two interfaces, one that transitions
>>> from state 0 to state 1 and another that expects state 1 and blocks until we
>>> reach state 2.  You then have no need for PreventTransactionChain(), and the
>>> interfaces could even be normal functions.  It's less clear how reasonably you
>>> could do this for the FINALIZE step, given its implicit VACUUM.  It could be
>>> achieved by having the user do the VACUUM and making the new interface merely
>>> throw an error if a VACUUM is still needed.  The trivial usage pattern might
>>> look like this:
>>>
>>> SELECT pg_initiate_load('bigtbl');
>>> SELECT pg_wait_load('bigtbl'); -- not a great name
>>> COPY bigtbl FROM STDIN;
>>> SELECT pg_stop_load('bigtbl');
>>> VACUUM bigtbl;
>>> SELECT pg_finalize_load('bigtbl');
>>>
>>> It's definitely less elegant, alas.  Perhaps offer the interface you've
>>> proposed and have it do the above under the hood.  That way, users with
>>> complex needs have the flexibility of the lower-level interfaces while those
>>> who can tolerate PreventTransactionChain() have simplicity.
>>
>> I think that's a reasonable suggestion. I am going back and forth a
>> little on this one. It's got the benefit that you can see the internal
>> states more clearly, and it's easier to tell what's going on, and it's
>> better if we want to do more sophisticated testing.
>>
>> The main drawback here is that it's exposing more to the user. I
>> imagined that we might want to push other kinds of optimizations into
>> the load path, and that might upset the interface you've described
>> above. Then again, we'll probably need the normal, load, and transition
>> states regardless, so maybe it's an empty concern.
>
> Instead of trying to maintain MVCC semantics, maybe we should just
> have something like COPY (FROZEN) that just writes frozen tuples into
> the table and throws MVCC out the window.  Seems like that would be a
> lot easier to implement and satisfy basically the same use cases.

-1: The situation with hint bit i/o patterns on many workloads is
untenable but it's not safe to assume MVCC can be ditched in those
workloads.  Also, COPY does nothing about deletes.  Neither does the
proposal as stated but I think it's easier to generalize into 'I want
to put hint bits in now so I don't have to deal with them later'.

merlin


Re: 9.3: load path to mitigate load penalty for checksums

From
Tom Lane
Date:
Merlin Moncure <mmoncure@gmail.com> writes:
> On Tue, Jun 12, 2012 at 7:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Instead of trying to maintain MVCC semantics, maybe we should just
>> have something like COPY (FROZEN) that just writes frozen tuples into
>> the table and throws MVCC out the window. �Seems like that would be a
>> lot easier to implement and satisfy basically the same use cases.

> -1: The situation with hint bit i/o patterns on many workloads is
> untenable but it's not safe to assume MVCC can be ditched in those
> workloads.  Also, COPY does nothing about deletes.  Neither does the
> proposal as stated but I think it's easier to generalize into 'I want
> to put hint bits in now so I don't have to deal with them later'.

Dunno, I think Robert's idea has a fair amount of merit: mainly because
it will probably satisfy 90% of use cases for 1% of the work.
        regards, tom lane


Re: 9.3: load path to mitigate load penalty for checksums

From
"Kevin Grittner"
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I think Robert's idea has a fair amount of merit: mainly because
> it will probably satisfy 90% of use cases for 1% of the work.
+1, especially if we include a pg_dump option to write the COPY
statements with that option.
-Kevin


Re: 9.3: load path to mitigate load penalty for checksums

From
Merlin Moncure
Date:
On Tue, Jun 12, 2012 at 11:12 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Merlin Moncure <mmoncure@gmail.com> writes:
>> On Tue, Jun 12, 2012 at 7:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> Instead of trying to maintain MVCC semantics, maybe we should just
>>> have something like COPY (FROZEN) that just writes frozen tuples into
>>> the table and throws MVCC out the window.  Seems like that would be a
>>> lot easier to implement and satisfy basically the same use cases.
>
>> -1: The situation with hint bit i/o patterns on many workloads is
>> untenable but it's not safe to assume MVCC can be ditched in those
>> workloads.  Also, COPY does nothing about deletes.  Neither does the
>> proposal as stated but I think it's easier to generalize into 'I want
>> to put hint bits in now so I don't have to deal with them later'.
>
> Dunno, I think Robert's idea has a fair amount of merit: mainly because
> it will probably satisfy 90% of use cases for 1% of the work.

90%?  Hint bits i/o issues are not limited to bulk loads.   They apply
to all many-record-per-transaction DML including (and especially)
deletes.  Also it's not safe to assume that insertion heavy clients
can be migrated to COPY.  For example, JDBC bulk loading AFAIK does
not use COPY and even if it did wouldn't be able to decorate the
command for a long time for most production workloads.

Vs Jeff's proposal you have a point -- I'm just very skeptical it's
going to do enough to mitigate the performance hit.

merlin


Re: 9.3: load path to mitigate load penalty for checksums

From
Greg Stark
Date:
On Tue, Jun 12, 2012 at 5:41 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> Also it's not safe to assume that insertion heavy clients
> can be migrated to COPY.  For example, JDBC bulk loading AFAIK does
> not use COPY and even if it did wouldn't be able to decorate the
> command for a long time for most production workloads

A non-mvcc table would only be useful if you're loading the data all
at once and don't plan to load any more. So that wouldn't even be
attempting to cover all "insertion heavy clients". Just specific use
cases.

I had in mind to implement something like this for archived data such
as old partitions in partitioned table. I think data loaded in a copy
that you know is never going to be modified would be a reasonable
other case for the same codepath.

Note that this particular use case is covered by FDW as well. If the
source data is in a dense enough format you could save even the load
step by reading directly from the source. I don't think this
eliminates the idea of having a denser read-only format though.
Postgres has lots of other features like indexing and different data
representations that might be useful and would be missing from, say, a
csv source file.

--
greg


Re: 9.3: load path to mitigate load penalty for checksums

From
Robert Haas
Date:
On Tue, Jun 12, 2012 at 12:41 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>>> -1: The situation with hint bit i/o patterns on many workloads is
>>> untenable but it's not safe to assume MVCC can be ditched in those
>>> workloads.  Also, COPY does nothing about deletes.  Neither does the
>>> proposal as stated but I think it's easier to generalize into 'I want
>>> to put hint bits in now so I don't have to deal with them later'.
>>
>> Dunno, I think Robert's idea has a fair amount of merit: mainly because
>> it will probably satisfy 90% of use cases for 1% of the work.
>
> 90%?  Hint bits i/o issues are not limited to bulk loads.   They apply
> to all many-record-per-transaction DML including (and especially)
> deletes.  Also it's not safe to assume that insertion heavy clients
> can be migrated to COPY.  For example, JDBC bulk loading AFAIK does
> not use COPY and even if it did wouldn't be able to decorate the
> command for a long time for most production workloads.
>
> Vs Jeff's proposal you have a point -- I'm just very skeptical it's
> going to do enough to mitigate the performance hit.

I don't think it's going to solve the problem in general, but TBH I
don't think Jeff's proposal is, either.  I mean, ignoring
xmin-committed, xmax-committed, and all-visible bits is going to come
with a pretty steep performance penalty all of its own.  I don't doubt
that you can construct situations in which it's better than incurring
the I/O associated with setting those bits after the fact, but I also
don't doubt that the reverse is true.  Furthermore, any system that
involves sometimes ignoring those things is going to add cost in
extremely hot code paths even when the user doesn't elect to take
advantage of the new feature.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 9.3: load path to mitigate load penalty for checksums

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I don't think it's going to solve the problem in general, but TBH I
> don't think Jeff's proposal is, either.  I mean, ignoring
> xmin-committed, xmax-committed, and all-visible bits is going to come
> with a pretty steep performance penalty all of its own.  I don't doubt
> that you can construct situations in which it's better than incurring
> the I/O associated with setting those bits after the fact, but I also
> don't doubt that the reverse is true.  Furthermore, any system that
> involves sometimes ignoring those things is going to add cost in
> extremely hot code paths even when the user doesn't elect to take
> advantage of the new feature.

Yeah; the notion of adding cycles to the visibility-check code paths,
which would add cost even for users who have no use at all for this
feature, is close to being a deal-breaker for me.  I would rather see
us designing this around the idea of "what can we do without adding
any new complexity in visibility checks?".

At least for MVCC cases (hence, user tables only), it seems like we
could pre-set XMIN_COMMITTED hint bits if there were a way to be sure
that the XID in question would be seen as still running in the followup
snapshot check that every MVCC visibility test must make anyway.  The
hard part of that seems to be post-crash behavior, but maybe it'd be
acceptable to incur crash-recovery-time cost to run around and unset
such bits?

This doesn't do anything for pre-freezing of course, but I basically
don't see any way that we could pre-freeze without breaking MVCC
semantics anyhow.  So there's still attraction in the idea of offering
users the choice of not sustaining MVCC behavior in exchange for cheaper
bulk loads.
        regards, tom lane


Re: 9.3: load path to mitigate load penalty for checksums

From
Robert Haas
Date:
On Wed, Jun 6, 2012 at 8:42 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Wed, 2012-06-06 at 15:08 -0400, Robert Haas wrote:
>> On Mon, Jun 4, 2012 at 9:26 PM, Jeff Davis <pgsql@j-davis.com> wrote:
>> > Thoughts?
>>
>> Simon already proposed a way of doing this that doesn't require
>> explicit user action, which seems preferable to a method that does
>> require explicit user action, even though it's a little harder to
>> implement.  His idea was to store the XID of the process creating the
>> table in the pg_class row, which I think is *probably* better than
>> your idea of having a process that waits and then flips the flag.
>> There are some finicky details though - see previous thread for
>> discussion of some of the issues.
>
> My goals include:
>
> * The ability to load into existing tables with existing data
> * The ability to load concurrently
>
> My understanding was that the proposal to which you're referring can't
> do those things, which seem like major limitations. Did I miss
> something?

No, you're correct.  I misread your original email, sorry.

I'm just thinking about this a little more.  It strikes me that the
core trade-off here is between doing more post-commit work and doing
more post-abort work.  For example, in your proposal, we've got to run
a lazy vacuum before exiting bulk load mode, because if a transaction
has aborted, we've got to get rid of its tuples before letting anyone
trust hint bits again.  That is, abort cleanup gets harder.  OTOH,
commit cleanup gets easier, because all the hint bits and visibility
map bits are already set: the only thing left is to freeze.  In
general, I think that's a good trade-off.  Commits are much more
common than aborts, and so we ought to be optimizing for the commit
case.

But maybe there are other ways of doing it, besides what you've
proposed here.   Not sure exactly what, but it might be worth thinking
about.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 9.3: load path to mitigate load penalty for checksums

From
Jeff Davis
Date:
On Tue, 2012-06-12 at 08:42 -0400, Robert Haas wrote:
> Instead of trying to maintain MVCC semantics, maybe we should just
> have something like COPY (FROZEN) that just writes frozen tuples into
> the table and throws MVCC out the window.  Seems like that would be a
> lot easier to implement and satisfy basically the same use cases.

The reason that doesn't work is because there's really no answer for
ABORTs except to delete the whole table, or ask the user to try to
figure out what made it or what didn't. (That I can see, anyway; you may
have a better idea.)

In practice, I think the user would need to use this only for new tables
and use only one loader so they would know what to do if there is an
abort. That makes this very similar to the proposal for optimizing loads
by the transaction that creates the table; except with fewer safeguards.
It's basically just saying "don't abort, and be careful with concurrent
reads (which are now dirty)".

I think this problem largely has to do with freezing though, because
that's what makes it impossible to differentiate after an abort. Let's
set that aside for now, and just focus on setting HEAP_XMIN_COMMITTED,
PD_ALL_VISIBLE, and the VM bit. Those account for a major part of the
problem; being able to freeze also is really a bonus (and if the table
is really that static, then maybe creating it in the loading transaction
is reasonable).

In those terms, we can reframe the questions as: what do we do about
reads during the load?

Answers could include: (a) nothing -- reads during a load are potentially dirty (b) readers ignore hint bits during the
load,and pay the penalty (c) allow only INSERT/COPY, and block unsafe SELECT/UPDATE/DELETE
 

(a) is your proposal except without the freezing. (b) was my proposal.
(c) was a backup I had in mind (if reads are expensive anyway, maybe
people would rather just block).

These aren't exclusive. Maybe we can do (c) in READ COMMITTED and above,
and (a) in READ UNCOMMITTED.

Regards,Jeff Davis



Re: 9.3: load path to mitigate load penalty for checksums

From
Jeff Davis
Date:
On Tue, 2012-06-12 at 13:31 -0400, Tom Lane wrote:
> Yeah; the notion of adding cycles to the visibility-check code paths,
> which would add cost even for users who have no use at all for this
> feature, is close to being a deal-breaker for me.  I would rather see
> us designing this around the idea of "what can we do without adding
> any new complexity in visibility checks?".

[waves hands wildly]

We could create a different HeapTupleSatisfiesMVCC routine that ignores
hint bits, and select that at the time the scan is started. That would
at least avoid affecting existing code paths.

> At least for MVCC cases (hence, user tables only), it seems like we
> could pre-set XMIN_COMMITTED hint bits if there were a way to be sure
> that the XID in question would be seen as still running in the followup
> snapshot check that every MVCC visibility test must make anyway.  The
> hard part of that seems to be post-crash behavior, but maybe it'd be
> acceptable to incur crash-recovery-time cost to run around and unset
> such bits?

Hmm... but that still leaves PD_ALL_VISIBLE, which will also cause a
rewrite of the data. Under OLTP, we can assume that PD_ALL_VISIBLE is
much more rare than HEAP_XMIN_COMMITTED; but for bulk loads, setting
HEAP_XMIN_COMMITTED doesn't help us much without PD_ALL_VISIBLE.

> This doesn't do anything for pre-freezing of course, but I basically
> don't see any way that we could pre-freeze without breaking MVCC
> semantics anyhow.  So there's still attraction in the idea of offering
> users the choice of not sustaining MVCC behavior in exchange for cheaper
> bulk loads.

That may be reasonable, but it's much more dangerous than just breaking
MVCC (which to me implies isolation rules) -- pre-freezing would break
atomicity if you have any aborts. And I can't think of very many cases
where losing atomicity is reasonable (although I'm sure there are some).

Regards,Jeff Davis



Re: 9.3: load path to mitigate load penalty for checksums

From
Bruce Momjian
Date:
On Tue, Jun 12, 2012 at 01:31:05PM -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > I don't think it's going to solve the problem in general, but TBH I
> > don't think Jeff's proposal is, either.  I mean, ignoring
> > xmin-committed, xmax-committed, and all-visible bits is going to come
> > with a pretty steep performance penalty all of its own.  I don't doubt
> > that you can construct situations in which it's better than incurring
> > the I/O associated with setting those bits after the fact, but I also
> > don't doubt that the reverse is true.  Furthermore, any system that
> > involves sometimes ignoring those things is going to add cost in
> > extremely hot code paths even when the user doesn't elect to take
> > advantage of the new feature.
> 
> Yeah; the notion of adding cycles to the visibility-check code paths,
> which would add cost even for users who have no use at all for this
> feature, is close to being a deal-breaker for me.  I would rather see
> us designing this around the idea of "what can we do without adding
> any new complexity in visibility checks?".
> 
> At least for MVCC cases (hence, user tables only), it seems like we
> could pre-set XMIN_COMMITTED hint bits if there were a way to be sure
> that the XID in question would be seen as still running in the followup
> snapshot check that every MVCC visibility test must make anyway.  The
> hard part of that seems to be post-crash behavior, but maybe it'd be
> acceptable to incur crash-recovery-time cost to run around and unset
> such bits?

Well, truncating tables that were empty on the load would certainly be
a easy to do --- not sure if that helps us, though.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: 9.3: load path to mitigate load penalty for checksums

From
Jeff Davis
Date:
On Tue, 2012-06-12 at 13:10 -0400, Robert Haas wrote:
> I don't think it's going to solve the problem in general, but TBH I
> don't think Jeff's proposal is, either.  I mean, ignoring
> xmin-committed, xmax-committed, and all-visible bits is going to come
> with a pretty steep performance penalty all of its own.

I think we'd certainly have to discourage users from launching lots of
full table scans during a data load. We could go so far as blocking
reads, as I said in my other response, and still (mostly) meet my
primary goals.

But allowing reads without hint bits might be useful for highly
selective index access, which is a substantial use case for the kind of
large tables we'd be bulk-loading.

Regards,Jeff Davis



Re: 9.3: load path to mitigate load penalty for checksums

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> In those terms, we can reframe the questions as: what do we do about
> reads during the load?

> Answers could include:
>   (a) nothing -- reads during a load are potentially dirty
>   (b) readers ignore hint bits during the load, and pay the penalty
>   (c) allow only INSERT/COPY, and block unsafe SELECT/UPDATE/DELETE

Or (d) it's not a problem, since the inserting XID is still busy
according to the readers' snapshots.

The part I think is actually hard is how to clean up if the inserting
xact doesn't reach commit.  I think what we're basically looking at here
is pushing more cost into that path in order to avoid cost in successful
cases.  The first design that comes to mind is

(1) the inserting xact remembers which tables it's inserted pre-hinted
tuples into, and if it has to abort, it first seqscans those tables to
reset the hint bits;
(2) it also emits WAL entries that will cause crash recovery to perform
an identical scan, if WAL ends without finding a commit or abort record
for the inserting xact.

But there may be other better ways.  One problem with the above sketch
is that you can't checkpoint till the insertion is committed, since a
checkpoint would prevent crash recovery from seeing the warning WAL
records.
        regards, tom lane


Re: 9.3: load path to mitigate load penalty for checksums

From
Jeff Davis
Date:
On Tue, 2012-06-12 at 17:53 -0400, Bruce Momjian wrote:
> Well, truncating tables that were empty on the load would certainly be
> a easy to do --- not sure if that helps us, though.

This goes back to the single-loader-into-an-empty-table case, which I'd
like to get away from. I think we already have some reasonable proposals
on the table for optimizing that case.

Ultimately, that pushes us toward partitioning with the same granularity
as the data load. We can try to open that discussion up, but I think
it's a bad idea to tie the data load granularity to the partitioning
granularity, unless we have a way to coalesce them later without
rewriting. It's too bad filesystems don't allow us to just reallocate
some blocks from one file to another. [ OK, I guess I just opened this
discussion myself ;) ]

Regards,Jeff Davis




Re: 9.3: load path to mitigate load penalty for checksums

From
Jeff Davis
Date:
On Tue, 2012-06-12 at 18:02 -0400, Tom Lane wrote:
> Or (d) it's not a problem, since the inserting XID is still busy
> according to the readers' snapshots.

How much of a savings did we get from PD_ALL_VISIBLE when it was added
into the page-at-a-time visibility check?

>From 608195a3a3656145a7eec7a47d903bc684011d73:

"In addition to the visibility map, there's a new PD_ALL_VISIBLE flag on
each heap page, also indicating that all tuples on the page are visible
to all transactions. It's important that this flag is kept up-to-date.
It is also used to skip visibility tests in sequential scans, which
gives a small performance gain on seqscans."

If "small" means that it's something we can give up, then focusing on
HEAP_XMIN_COMMITTED makes sense. But if we can't give it up, then we
need to take it into account in the proposal.

Regards,Jeff Davis




Re: 9.3: load path to mitigate load penalty for checksums

From
Robert Haas
Date:
On Tue, Jun 12, 2012 at 6:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The part I think is actually hard is how to clean up if the inserting
> xact doesn't reach commit.  I think what we're basically looking at here
> is pushing more cost into that path in order to avoid cost in successful
> cases.  The first design that comes to mind is
>
> (1) the inserting xact remembers which tables it's inserted pre-hinted
> tuples into, and if it has to abort, it first seqscans those tables to
> reset the hint bits;

I don't think we can count on that to be safe in an arbitrarily chosen
abort path.  Anything FATAL, for instance.  I think we're going to
need to keep track of some kind table-xmin value, representing the
oldest operation on the table that's not cleaned up yet, and make it
autovacuum's job to clean any that precede OldestXmin.  If the backend
can clean itself up, great, but there has to be some kind of allowance
for the case where that doesn't happen.

I'm also skeptical about the notion that "scan the whole table" is
going to be a good idea.  It really will have to be a full sequential
scan, if we're setting visibility map bits as we go, not just a scan
of pages that are not-all-visible, as vacuum normally does.  I think
if we want to go this route, we need to log the TID of every tuple we
write into the heap into some kind of undo fork (or maybe just the
block numbers), so that if the transaction aborts, we (or autovacuum)
can go back and find all of those TIDs and mark the tuples dead
without having to scan through (potentially) terabytes of data.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 9.3: load path to mitigate load penalty for checksums

From
Robert Haas
Date:
On Tue, Jun 12, 2012 at 6:26 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Tue, 2012-06-12 at 18:02 -0400, Tom Lane wrote:
>> Or (d) it's not a problem, since the inserting XID is still busy
>> according to the readers' snapshots.
>
> How much of a savings did we get from PD_ALL_VISIBLE when it was added
> into the page-at-a-time visibility check?
>
> >From 608195a3a3656145a7eec7a47d903bc684011d73:
>
> "In addition to the visibility map, there's a new PD_ALL_VISIBLE flag on
> each heap page, also indicating that all tuples on the page are visible
> to all transactions. It's important that this flag is kept up-to-date.
> It is also used to skip visibility tests in sequential scans, which
> gives a small performance gain on seqscans."
>
> If "small" means that it's something we can give up, then focusing on
> HEAP_XMIN_COMMITTED makes sense. But if we can't give it up, then we
> need to take it into account in the proposal.

It's significant.

rhaas=# create table foo (a int, b text);
ERROR:  relation "foo" already exists
rhaas=# create table bar (a int, b text);
CREATE TABLE
rhaas=# insert into bar select g,
random()::text||random()::text||random()::text||random()::text from
generate_series(1,1000000) g;
INSERT 0 1000000
rhaas=# \timing
Timing is on.
rhaas=# select sum(1) from bar;   sum
---------1000000
(1 row)

Time: 257.500 ms
rhaas=# select sum(1) from bar;  sum
---------1000000
(1 row)

Time: 140.763 ms
rhaas=# select sum(1) from bar;  sum
---------1000000
(1 row)

Time: 142.760 ms
rhaas=# select sum(1) from bar;  sum
---------1000000
(1 row)

Time: 140.603 ms
rhaas=# vacuum bar;
VACUUM
Time: 133.084 ms
rhaas=# select sum(1) from bar;  sum
---------1000000
(1 row)

Time: 123.591 ms
rhaas=# select sum(1) from bar;  sum
---------1000000
(1 row)

Time: 123.096 ms
rhaas=# select sum(1) from bar;  sum
---------1000000
(1 row)

Time: 122.653 ms

So vacuuming to set the PD_ALL_VISIBLE bits is buying us more than 10% here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 9.3: load path to mitigate load penalty for checksums

From
Jeff Davis
Date:
On Tue, 2012-06-12 at 22:06 -0400, Robert Haas wrote:
> > How much of a savings did we get from PD_ALL_VISIBLE when it was added
> > into the page-at-a-time visibility check?
> >
> > >From 608195a3a3656145a7eec7a47d903bc684011d73:
> >
> > "In addition to the visibility map, there's a new PD_ALL_VISIBLE flag on
> > each heap page, also indicating that all tuples on the page are visible
> > to all transactions. It's important that this flag is kept up-to-date.
> > It is also used to skip visibility tests in sequential scans, which
> > gives a small performance gain on seqscans."
> >
> > If "small" means that it's something we can give up, then focusing on
> > HEAP_XMIN_COMMITTED makes sense. But if we can't give it up, then we
> > need to take it into account in the proposal.
> 
> It's significant.

In that case, the proposals that only involve HEAP_XMIN_COMMITTED don't
seem viable to me. To get maxiumum read speed, we will need to set
PD_ALL_VISIBLE also, and that means rewriting the entire table anyway
(for the workload that I'm describing).

However, maybe if we combine the approaches, readers could ignore
PD_ALL_VISIBLE during the load, which looks like maybe a 10% penalty,
without having to ignore HEAP_XMIN_COMMITTED (which seems much more
costly to ignore).

Regards,Jeff Davis