Thread: Reduce WAL logging of INSERT SELECT
We currently have reduced WAL logging for wal_level = minimum for these commands: CREATE TABLE ASCREATE INDEXCLUSTERCOPY into tables that were created or truncated in the sametransaction One thing we don't optimize is INSERT ... SELECT when the table is created or truncated in the same transaction. Seems we could. We optimize CREATE TABLE AS which is effectively SELECT ... INTO using a different syntax. Is this a TODO? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Bruce Momjian <bruce@momjian.us> writes: > One thing we don't optimize is INSERT ... SELECT when the table is > created or truncated in the same transaction. Seems we could. > We optimize CREATE TABLE AS which is effectively SELECT ... INTO using a > different syntax. Is this a TODO? Considering that SELECT INTO is deprecated, I don't think we should be expending effort to encourage people to use it. regards, tom lane
On 08/04/2011 04:55 PM, Tom Lane wrote: > Bruce Momjian<bruce@momjian.us> writes: >> One thing we don't optimize is INSERT ... SELECT when the table is >> created or truncated in the same transaction. Seems we could. >> We optimize CREATE TABLE AS which is effectively SELECT ... INTO using a >> different syntax. Is this a TODO? > Considering that SELECT INTO is deprecated, I don't think we should be > expending effort to encourage people to use it. > > Right, but the original point about INSERT ... SELECT seems reasonable, no? cheers andrew
Andrew Dunstan wrote: > > > On 08/04/2011 04:55 PM, Tom Lane wrote: > > Bruce Momjian<bruce@momjian.us> writes: > >> One thing we don't optimize is INSERT ... SELECT when the table is > >> created or truncated in the same transaction. Seems we could. > >> We optimize CREATE TABLE AS which is effectively SELECT ... INTO using a > >> different syntax. Is this a TODO? > > Considering that SELECT INTO is deprecated, I don't think we should be > > expending effort to encourage people to use it. > > > > > > > Right, but the original point about INSERT ... SELECT seems reasonable, no? Right. I brought up SELECT INTO because you could make the argument that INSERT ... SELECT is not a utility command like the other ones and therefore can't be done easily, but CREATE TABLE AS is internal SELECT INTO and implemented in execMain.c, which I think is where INSERT ... SELECT would also be implemented. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Thu, Aug 4, 2011 at 10:46 PM, Bruce Momjian <bruce@momjian.us> wrote: > Right. I brought up SELECT INTO because you could make the argument > that INSERT ... SELECT is not a utility command like the other ones and > therefore can't be done easily, but CREATE TABLE AS is internal SELECT > INTO and implemented in execMain.c, which I think is where INSERT ... > SELECT would also be implemented. What you should be asking is whether the optimisation would be effective for INSERT SELECT, or even test it. My observation is that the optimisation is only effective for very large loads that cause I/O. As RAM sizes get bigger, I'm inclined to remove the optimisation and make it optional since it is ineffective in many cases and negative benefit for smaller cases. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs wrote: > On Thu, Aug 4, 2011 at 10:46 PM, Bruce Momjian <bruce@momjian.us> wrote: > > > Right. ?I brought up SELECT INTO because you could make the argument > > that INSERT ... SELECT is not a utility command like the other ones and > > therefore can't be done easily, but CREATE TABLE AS is internal SELECT > > INTO and implemented in execMain.c, which I think is where INSERT ... > > SELECT would also be implemented. > > What you should be asking is whether the optimisation would be > effective for INSERT SELECT, or even test it. > > My observation is that the optimisation is only effective for very > large loads that cause I/O. As RAM sizes get bigger, I'm inclined to > remove the optimisation and make it optional since it is ineffective > in many cases and negative benefit for smaller cases. I am confused how generating WAL traffic that is larger than the heap file we are fsync'ing can possibly be slower. Are you just throwing out an idea to try to make me prove it? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Thu, 2011-08-04 at 17:46 -0400, Bruce Momjian wrote: > Right. I brought up SELECT INTO because you could make the argument > that INSERT ... SELECT is not a utility command like the other ones and > therefore can't be done easily, but CREATE TABLE AS is internal SELECT > INTO and implemented in execMain.c, which I think is where INSERT ... > SELECT would also be implemented. The above statement is a little confusing, so let me start from the beginning: How could we avoid WAL logging for INSERT ... SELECT? The way we do it for CREATE TABLE AS is because nobody would even *see* the table if our transaction doesn't commit. Therefore we don't need to bother logging it. Same can be said for SELECT INTO. INSERT ... SELECT is just an insert. It needs just as much logging as inserting tuples any other way. For instance, it will potentially share pages with other inserts, and better properly record all such page modifications so that they return to a consistent state. Regards,Jeff Davis
Jeff Davis wrote: > On Thu, 2011-08-04 at 17:46 -0400, Bruce Momjian wrote: > > Right. I brought up SELECT INTO because you could make the argument > > that INSERT ... SELECT is not a utility command like the other ones and > > therefore can't be done easily, but CREATE TABLE AS is internal SELECT > > INTO and implemented in execMain.c, which I think is where INSERT ... > > SELECT would also be implemented. > > The above statement is a little confusing, so let me start from the > beginning: > > How could we avoid WAL logging for INSERT ... SELECT? > > The way we do it for CREATE TABLE AS is because nobody would even *see* > the table if our transaction doesn't commit. Therefore we don't need to > bother logging it. Same can be said for SELECT INTO. > > INSERT ... SELECT is just an insert. It needs just as much logging as > inserting tuples any other way. For instance, it will potentially share > pages with other inserts, and better properly record all such page > modifications so that they return to a consistent state. It would act like COPY, meaning the table would have to be truncated or created in the same transaction. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Thu, Aug 4, 2011 at 8:55 PM, Bruce Momjian <bruce@momjian.us> wrote: > Jeff Davis wrote: >> On Thu, 2011-08-04 at 17:46 -0400, Bruce Momjian wrote: >> > Right. I brought up SELECT INTO because you could make the argument >> > that INSERT ... SELECT is not a utility command like the other ones and >> > therefore can't be done easily, but CREATE TABLE AS is internal SELECT >> > INTO and implemented in execMain.c, which I think is where INSERT ... >> > SELECT would also be implemented. >> >> The above statement is a little confusing, so let me start from the >> beginning: >> >> How could we avoid WAL logging for INSERT ... SELECT? >> >> The way we do it for CREATE TABLE AS is because nobody would even *see* >> the table if our transaction doesn't commit. Therefore we don't need to >> bother logging it. Same can be said for SELECT INTO. >> >> INSERT ... SELECT is just an insert. It needs just as much logging as >> inserting tuples any other way. For instance, it will potentially share >> pages with other inserts, and better properly record all such page >> modifications so that they return to a consistent state. > > It would act like COPY, meaning the table would have to be truncated or > created in the same transaction. It seems to me that, if we know the relation was created or truncated in the current transaction, and if wal_level = minimal, then we don't need to WAL-log *anything* until transaction commit (provided we fsync at commit). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, 2011-08-04 at 20:55 -0400, Bruce Momjian wrote: > It would act like COPY, meaning the table would have to be truncated or > created in the same transaction. Well, in that case it could work for any INSERT. No need for a SELECT to be involved. For that matter, why not make it work for DELETE and UPDATE, too? However, I think this is all just a workaround for not having a faster loading path. I don't object to applying this optimization to inserts, but I think it might be more productive to figure out if we can support loading data efficiently -- i.e. also set hint bits and frozenxid during the load. Regards,Jeff Davis
On 05.08.2011 04:23, Jeff Davis wrote: > On Thu, 2011-08-04 at 20:55 -0400, Bruce Momjian wrote: >> It would act like COPY, meaning the table would have to be truncated or >> created in the same transaction. > > Well, in that case it could work for any INSERT. No need for a SELECT to > be involved. For that matter, why not make it work for DELETE and > UPDATE, too? Yep. If we are to expand it, we should make it work for any operation. However, for small operations it's a net loss - you avoid writing a WAL record, but you have to fsync() the heap instead. If you only modify a few rows, the extra fsync (or fsyncs if there are indexes too) is more expensive than writing the WAL. We'd need a heuristic to decide whether to write WAL or fsync at the end. For regular INSERTs, UPDATEs and DELETEs, you have the planner's estimate of number of rows affected. Another thing we should do is move the fsync call from the end of COPY (and other such operations) to the end of transaction. That way if you do e.g one COPY followed by a bunch of smaller INSERTs or UPDATEs, you only need to fsync the files once. > However, I think this is all just a workaround for not having a faster > loading path. I don't object to applying this optimization to inserts, > but I think it might be more productive to figure out if we can support > loading data efficiently -- i.e. also set hint bits and frozenxid during > the load. Yeah, that would make a much bigger impact in practice. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, Aug 4, 2011 at 11:07 PM, Bruce Momjian <bruce@momjian.us> wrote: > Simon Riggs wrote: >> On Thu, Aug 4, 2011 at 10:46 PM, Bruce Momjian <bruce@momjian.us> wrote: >> >> > Right. ?I brought up SELECT INTO because you could make the argument >> > that INSERT ... SELECT is not a utility command like the other ones and >> > therefore can't be done easily, but CREATE TABLE AS is internal SELECT >> > INTO and implemented in execMain.c, which I think is where INSERT ... >> > SELECT would also be implemented. >> >> What you should be asking is whether the optimisation would be >> effective for INSERT SELECT, or even test it. >> >> My observation is that the optimisation is only effective for very >> large loads that cause I/O. As RAM sizes get bigger, I'm inclined to >> remove the optimisation and make it optional since it is ineffective >> in many cases and negative benefit for smaller cases. > > I am confused how generating WAL traffic that is larger than the heap > file we are fsync'ing can possibly be slower. I'm telling you what I know to be true as an assistance to you. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, 2011-08-04 at 18:07 -0400, Bruce Momjian wrote: > I am confused how generating WAL traffic that is larger than the heap > file we are fsync'ing can possibly be slower. Are you just throwing out > an idea to try to make me prove it? That's worded in a slightly confusing way, but here is the trade-off: 1. If you are using WAL, then regardless of what your transaction does, only the WAL needs to be fsync'd at commit time. Conveniently, that's being written sequentially, so it's a single fairly cheap fsync; and all the data page changes are deferred, collected together, and fsync'd at checkpoint time (rather than commit time). The cost is that you double-write the data. 2. If you aren't using WAL, you need to fsync every data file the transaction touched, which are probably not localized with other activity. Also, the _entire_ data files needs to be sync'd, so perhaps many other transactions have made changes to one data file all over, and it may require _many_ seeks to accomplish the one fsync. The benefit is that you don't double-write the data. So, fundamentally, WAL is (in the OLTP case, where a transaction is much shorter than a checkpoint interval) a big performance _win_, because it allows us to do nice sequential writing in a single place for all activities of all transactions; and defer all those random writes to data pages until the next checkpoint. So we shouldn't treat WAL like a cost burden that we want to avoid in every case we can. But in the data load case (where many checkpoints may happen during a single transaction anyway), it happens that avoiding WAL is a performance win, because the seeks are not the dominant cost. Regards,Jeff Davis
Jeff Davis wrote: > On Thu, 2011-08-04 at 18:07 -0400, Bruce Momjian wrote: > > I am confused how generating WAL traffic that is larger than the heap > > file we are fsync'ing can possibly be slower. Are you just throwing out > > an idea to try to make me prove it? > > That's worded in a slightly confusing way, but here is the trade-off: > > 1. If you are using WAL, then regardless of what your transaction does, > only the WAL needs to be fsync'd at commit time. Conveniently, that's > being written sequentially, so it's a single fairly cheap fsync; and all > the data page changes are deferred, collected together, and fsync'd at > checkpoint time (rather than commit time). The cost is that you > double-write the data. > > 2. If you aren't using WAL, you need to fsync every data file the > transaction touched, which are probably not localized with other > activity. Also, the _entire_ data files needs to be sync'd, so perhaps > many other transactions have made changes to one data file all over, and > it may require _many_ seeks to accomplish the one fsync. The benefit is > that you don't double-write the data. > > So, fundamentally, WAL is (in the OLTP case, where a transaction is much > shorter than a checkpoint interval) a big performance _win_, because it > allows us to do nice sequential writing in a single place for all > activities of all transactions; and defer all those random writes to > data pages until the next checkpoint. So we shouldn't treat WAL like a > cost burden that we want to avoid in every case we can. > > But in the data load case (where many checkpoints may happen during a > single transaction anyway), it happens that avoiding WAL is a > performance win, because the seeks are not the dominant cost. Well, if the table is created in the same transaction (which is the only case under consideration), no other sessions can write to the table so you are just writing the entire table on commit, rather than to the WAL. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
However, for small operations it's a net loss - you avoid writing a WAL record, but you have to fsync() the heap instead. If you only modify a few rows, the extra fsync (or fsyncs if there are indexes too) is more expensive than writing the WAL.
We'd need a heuristic to decide whether to write WAL or fsync at the end. For regular INSERTs, UPDATEs and DELETEs, you have the planner's estimate of number of rows affected. Another thing we should do is move the fsync call from the end of COPY (and other such operations) to the end of transaction. That way if you do e.g one COPY followed by a bunch of smaller INSERTs or UPDATEs, you only need to fsync the files once.
Have you thought about recovery, especially when the page size is greater than the disk block size( > 4K ). With WAL, there is a way to restore the pages to the original state, during recovery, if there are partial page writes. Is it possible to do the same with direct fsync without WAL?
Gokul.
On 06.08.2011 06:32, Gokulakannan Somasundaram wrote: >> >> However, for small operations it's a net loss - you avoid writing a WAL >> record, but you have to fsync() the heap instead. If you only modify a few >> rows, the extra fsync (or fsyncs if there are indexes too) is more expensive >> than writing the WAL. >> >> We'd need a heuristic to decide whether to write WAL or fsync at the end. >> For regular INSERTs, UPDATEs and DELETEs, you have the planner's estimate of >> number of rows affected. Another thing we should do is move the fsync call >> from the end of COPY (and other such operations) to the end of transaction. >> That way if you do e.g one COPY followed by a bunch of smaller INSERTs or >> UPDATEs, you only need to fsync the files once. > > > Have you thought about recovery, especially when the page size is greater > than the disk block size(> 4K ). With WAL, there is a way to restore the > pages to the original state, during recovery, if there are partial page > writes. Is it possible to do the same with direct fsync without WAL? The point of the optimization is that you can only skip WAL when it's been created (or truncated) in the same transaction. In that case, if the transaction aborts because of a crash, you don't care about the contents of the table anyway. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sat, Aug 6, 2011 at 4:16 AM, Bruce Momjian <bruce@momjian.us> wrote: > Well, if the table is created in the same transaction (which is the only > case under consideration), no other sessions can write to the table so > you are just writing the entire table on commit, rather than to the WAL. Below a certain point, skipping WAL is slower and over an intermediate range there is no benefit. So small amounts of data on large servers goes slower. heap_fsync() requires a scan of shared buffers, which may not be cheap. There is a difficulty because you would need to calculate the cut-off is for a particular database, and then predict ahead of time whether the number of rows that will be handled by the statement is low enough to warrant using the optimisation. Both of which I call a hard problem. I think we should remove the COPY optimisation because of this and definitely not extend INSERT SELECT to perform it automatically. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 06.08.2011 13:13, Simon Riggs wrote: > I think we should remove the COPY optimisation because of this and > definitely not extend INSERT SELECT to perform it automatically. It can be very helpful when loading a lot of data, so I'm not in favor of removing it altogether. Maybe WAL-log the first 10000 rows or such normally, and skip WAL after that. Of course, loading 10001 rows becomes the worst case then, but something along those lines... -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sat, Aug 6, 2011 at 11:05 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 06.08.2011 13:13, Simon Riggs wrote: >> >> I think we should remove the COPY optimisation because of this and >> definitely not extend INSERT SELECT to perform it automatically. > > It can be very helpful when loading a lot of data, so I'm not in favor of > removing it altogether. Maybe WAL-log the first 10000 rows or such normally, > and skip WAL after that. Of course, loading 10001 rows becomes the worst > case then, but something along those lines... > why 10000 rows? maybe the right solution is move towards make a normal table unlogged and viceversa... probably that's harder to do but we will have better control and less odd heuristics -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación
Heikki Linnakangas wrote: > On 06.08.2011 13:13, Simon Riggs wrote: >> I think we should remove the COPY optimisation because of this and >> definitely not extend INSERT SELECT to perform it automatically. > > It can be very helpful when loading a lot of data, so I'm not in > favor of removing it altogether. Yeah, it can currently help a lot. Of course, if WAL-logging could in any way facilitate hint bit and frozen xmin setting during bulk loads, I'm sure the WAL-logged version would win easily. -Kevin
Jaime Casanova <jaime@2ndquadrant.com> writes: > On Sat, Aug 6, 2011 at 11:05 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> It can be very helpful when loading a lot of data, so I'm not in favor of >> removing it altogether. Maybe WAL-log the first 10000 rows or such normally, >> and skip WAL after that. Of course, loading 10001 rows becomes the worst >> case then, but something along those lines... > why 10000 rows? Yeah; any particular number is wrong. Perhaps it'd be better to put the behavior under user control. In the case of COPY, where we already have a place to stick random options, you could imagine writing something like COPY ... WITH (bulk) to cue the system that a lot of data is coming in. But I don't see any nice way to do something similar for INSERT/SELECT. I hesitate to suggest a GUC, but something like "SET bulk_load = on" would be pretty straightforward to use in pg_dump for instance. regards, tom lane
On Fri, 2011-08-05 at 23:16 -0400, Bruce Momjian wrote: > Well, if the table is created in the same transaction (which is the only > case under consideration), no other sessions can write to the table so > you are just writing the entire table on commit, rather than to the WAL. The transaction can still write to many tables that way, and that could mean many fsyncs. Also, there may be a bunch of other transactions trying to write to the WAL that have to wait as your transaction has to seek out to fsync the data file and then seek back to the WAL. Regards,Jeff Davis
Simon Riggs wrote: > On Sat, Aug 6, 2011 at 4:16 AM, Bruce Momjian <bruce@momjian.us> wrote: > > > Well, if the table is created in the same transaction (which is the only > > case under consideration), no other sessions can write to the table so > > you are just writing the entire table on commit, rather than to the WAL. > > Below a certain point, skipping WAL is slower and over an intermediate > range there is no benefit. So small amounts of data on large servers > goes slower. > > heap_fsync() requires a scan of shared buffers, which may not be cheap. > > There is a difficulty because you would need to calculate the cut-off > is for a particular database, and then predict ahead of time whether > the number of rows that will be handled by the statement is low enough > to warrant using the optimisation. Both of which I call a hard > problem. > > I think we should remove the COPY optimisation because of this and > definitely not extend INSERT SELECT to perform it automatically. I ran some tests and Simon was correct and I was wrong. There is a measurable overhead to the optimization of avoiding WAL traffic for small tables. I tested git head with COPY and created the table inside and outside the COPY transaction, with the attached script. It ran in 11 seconds without the optimization, and 12 seconds with the CREATE TABLE inside the COPY transaction. With these results, I withdraw my idea of adding this optimization to other commands. I think COPY is usually used in bulk mode, but the other commands are often used in smaller batches that would be worse with this optimization. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + : . traprm echo 1 > $TMP/1 echo "DROP TABLE copytest;" > $TMP/0 for X in $(jot 1000) do cat >> $TMP/0 <<END -- move the next line into the transaction block to test the optimization overhead CREATE TABLE copytest(x int); BEGIN WORK; COPY copytest FROM '$TMP/1'; COMMIT; DROP TABLE copytest; END done time sql test < $TMP/0 > /dev/null