Thread: foreign table batch inserts
Hi, I realized that inserts into foreign tables are only done row by row. Consider copying data from one local table to a foreign table with INSERT INTO foreign_table(a,b,c) SELECT a,b,c FROM local_table; When the foreign server is for example in another datacenter with long latency, this as an enormous performance trade off. Wouldn’t it make sense to do the insert batch wise e.g. 100 rows ? Are there any plans doing that or am I miss something? regards Manuel Kniep
On Wed, May 18, 2016 at 6:00 AM, Manuel Kniep <m.kniep@web.de> wrote: > I realized that inserts into foreign tables are only done row by row. > Consider copying data from one local table to a foreign table with > > INSERT INTO foreign_table(a,b,c) SELECT a,b,c FROM local_table; > > When the foreign server is for example in another datacenter with long latency, > this as an enormous performance trade off. > > Wouldn’t it make sense to do the insert batch wise e.g. 100 rows ? Using a single query string with multiple values, perhaps, but after that comes into consideration query string limit particularly for large text values... The query used for the insertion is a prepared statement since writable queries are supported in 9.3, which makes the code quite simple actually. > Are there any plans doing that or am I miss something? Not that I know of. I am adding Fujita-san in the loop here, he is quite involved with postgres_fdw these days so perhaps he has some input to offer. -- Michael
> Wouldn’t it make sense to do the insert batch wise e.g. 100 rows ?
Using a single query string with multiple values, perhaps, but after
that comes into consideration query string limit particularly for
large text values... The query used for the insertion is a prepared
statement since writable queries are supported in 9.3, which makes the
code quite simple actually.
On Wed, May 18, 2016 at 12:27 PM, Craig Ringer <craig@2ndquadrant.com> wrote: > On 18 May 2016 at 06:08, Michael Paquier <michael.paquier@gmail.com> wrote: >> > Wouldn’t it make sense to do the insert batch wise e.g. 100 rows ? >> >> Using a single query string with multiple values, perhaps, but after >> that comes into consideration query string limit particularly for >> large text values... The query used for the insertion is a prepared >> statement since writable queries are supported in 9.3, which makes the >> code quite simple actually. > > This should be done how PgJDBC does batches. It'd require a libpq > enhancement, but it's one we IMO need anyway: allow pipelined query > execution from libpq. That's also something that would be useful for the ODBC driver. Since it is using libpq as a hard dependency and does not speak the protocol directly, it is doing additional round trips to the server for this exact reason when preparing a statement. > [design follows] > This would require libpq to be smarter about how it tracks queries. Right > now it keeps track of current query, query results, etc directly in the > connection object, and it sends a Sync after each operation then expects to > wait in a busy state until it gets the results from that operation. Yep. > Instead we'd have to have a FIFO queue of messages libpq expects responses > for. Variants of PQsendPrepare, PQsendQueryPrepared, PQsendDescribePrepared, > etc would not send a Sync message and would append an entry to the expected > result queue instead of setting the current query, etc on the connection. > They'd still mark the connection as busy, so no non-queue-aware calls could > be run until the queue is consumed and empty. Yep. That's exactly the ODBC regression, which become a huge problem with more latency. -- Michael
On Wed, May 18, 2016 at 12:27 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 18 May 2016 at 06:08, Michael Paquier <michael.paquier@gmail.com> wrote:
>> > Wouldn’t it make sense to do the insert batch wise e.g. 100 rows ?
>>
>> Using a single query string with multiple values, perhaps, but after
>> that comes into consideration query string limit particularly for
>> large text values... The query used for the insertion is a prepared
>> statement since writable queries are supported in 9.3, which makes the
>> code quite simple actually.
>
> This should be done how PgJDBC does batches. It'd require a libpq
> enhancement, but it's one we IMO need anyway: allow pipelined query
> execution from libpq.
That's also something that would be useful for the ODBC driver. Since
it is using libpq as a hard dependency and does not speak the protocol
directly, it is doing additional round trips to the server for this
exact reason when preparing a statement.
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Craig Ringer
On 19 May 2016 at 01:39, Michael Paquier <michael.paquier@gmail.com> wrote:
On Wed, May 18, 2016 at 12:27 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 18 May 2016 at 06:08, Michael Paquier <michael.paquier@gmail.com> wrote:
>> > Wouldn’t it make sense to do the insert batch wise e.g. 100 rows ?
>>
>> Using a single query string with multiple values, perhaps, but after
>> that comes into consideration query string limit particularly for
>> large text values... The query used for the insertion is a prepared
>> statement since writable queries are supported in 9.3, which makes the
>> code quite simple actually.
>
> This should be done how PgJDBC does batches. It'd require a libpq
> enhancement, but it's one we IMO need anyway: allow pipelined query
> execution from libpq.
That's also something that would be useful for the ODBC driver. Since
it is using libpq as a hard dependency and does not speak the protocol
directly, it is doing additional round trips to the server for this
exact reason when preparing a statement.
Yes, I want FE-BE protocol-level batch inserts/updates/deletes, too. I was just about to start thinking of how to implement it because of recent user question in pgsql-odbc. The OP uses Microsoft SQL Server Integration Service (SSIS) to migrate data to PostgreSQL. He asked for a method to speed up multi-row inserts, because the ODBC's multi-row insert API takes as long a time as when performing single-row inserts separately. This may prevent the migration to PostgreSQL.
And it's also useful for ECPG. Our customer wanted ECPG to support multi-row insert to migrate to PostgreSQL, because their embedded-SQL apps use the feature with a commercial database.
If you challenge this feature, I can help you by reviewing and testing, implementing the ODBC and ECPG sides, etc.
Regards
Takayuki Tsunakawa
Yes, I want FE-BE protocol-level batch inserts/updates/deletes, too. I was just about to start thinking of how to implement it because of recent user question in pgsql-odbc. The OP uses Microsoft SQL Server Integration Service (SSIS) to migrate data to PostgreSQL. He asked for a method to speed up multi-row inserts, because the ODBC's multi-row insert API takes as long a time as when performing single-row inserts separately. This may prevent the migration to PostgreSQL.
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Craig Ringer
Well, there's FE/BE level batching/pipelining already. Just no access to it from libpq.
Oh, really. The Bind ('B') appears to take one set of parameter values, not multiple sets (array). Anyway, I had to say "I want batch update API in libpq" to use it in ODBC and ECPG.
Regards
Takayuki Tsunakawa
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Craig Ringer
Well, there's FE/BE level batching/pipelining already. Just no access to it from libpq.
Oh, really. The Bind ('B') appears to take one set of parameter values, not multiple sets (array). Anyway, I had to say "I want batch update API in libpq" to use it in ODBC and ECPG.
You can, however, omit Sync from between messages and send a series of protocol messages, likeParse/Bind/Execute/Bind/Execute/Bind/Execute/Syncto avoid round-trip overheads.
* PQsendQuery cannot be used as it uses simple query protocol, use PQsendQueryParams instead;
Attachment
On 20 May 2016 at 15:35, Craig Ringer <craig@2ndquadrant.com> wrote:You can, however, omit Sync from between messages and send a series of protocol messages, likeParse/Bind/Execute/Bind/Execute/Bind/Execute/Syncto avoid round-trip overheads.I implemented what I think is a pretty solid proof of concept of this for kicks this evening. Attached, including basic test program. Patch attached. The performance difference over higher latency links is huge, see below.
On 2016/05/18 7:08, Michael Paquier wrote: > On Wed, May 18, 2016 at 6:00 AM, Manuel Kniep <m.kniep@web.de> wrote: >> I realized that inserts into foreign tables are only done row by row. >> Consider copying data from one local table to a foreign table with >> >> INSERT INTO foreign_table(a,b,c) SELECT a,b,c FROM local_table; >> >> When the foreign server is for example in another datacenter with long latency, >> this as an enormous performance trade off. > I am adding Fujita-san in the loop here, he is > quite involved with postgres_fdw these days so perhaps he has some > input to offer. Honestly, I didn't have any idea for executing such an insert efficiently, but I was thinking to execute an insert into a foreign table efficiently, by sending the whole insert to the remote server, if possible. For example, if the insert is of the form: INSERT INTO foreign_table(a,b,c) VALUES (1, 2, 3), (4, 5, 6) or INSERT INTO foreign_table(a,b,c) SELECT a,b,c FROM foreign_table2 where foreign_table and foreign_table2 belong to the same foreign server, then we could send the whole insert to the remote server. Wouldn't that make sense? Best regards, Etsuro Fujita
On Thu, May 26, 2016 at 4:25 AM, Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> wrote: > On 2016/05/18 7:08, Michael Paquier wrote: >> >> On Wed, May 18, 2016 at 6:00 AM, Manuel Kniep <m.kniep@web.de> wrote: >>> >>> I realized that inserts into foreign tables are only done row by row. >>> Consider copying data from one local table to a foreign table with >>> >>> INSERT INTO foreign_table(a,b,c) SELECT a,b,c FROM local_table; >>> >>> When the foreign server is for example in another datacenter with long >>> latency, >>> this as an enormous performance trade off. > > >> I am adding Fujita-san in the loop here, he is >> quite involved with postgres_fdw these days so perhaps he has some >> input to offer. > > > Honestly, I didn't have any idea for executing such an insert efficiently, > but I was thinking to execute an insert into a foreign table efficiently, by > sending the whole insert to the remote server, if possible. For example, if > the insert is of the form: > > INSERT INTO foreign_table(a,b,c) VALUES (1, 2, 3), (4, 5, 6) or > INSERT INTO foreign_table(a,b,c) SELECT a,b,c FROM foreign_table2 > > where foreign_table and foreign_table2 belong to the same foreign server, > then we could send the whole insert to the remote server. > > Wouldn't that make sense? Query strings have a limited length, and this assumption is true for many code paths in the backend code, so doing that with a long string would introduce more pain in the logic than anything else, as this would become more data type sensitive. -- Michael
On 2016/05/27 8:49, Michael Paquier wrote: > On Thu, May 26, 2016 at 4:25 AM, Etsuro Fujita > <fujita.etsuro@lab.ntt.co.jp> wrote: >> Honestly, I didn't have any idea for executing such an insert efficiently, >> but I was thinking to execute an insert into a foreign table efficiently, by >> sending the whole insert to the remote server, if possible. For example, if >> the insert is of the form: >> >> INSERT INTO foreign_table(a,b,c) VALUES (1, 2, 3), (4, 5, 6) or >> INSERT INTO foreign_table(a,b,c) SELECT a,b,c FROM foreign_table2 >> >> where foreign_table and foreign_table2 belong to the same foreign server, >> then we could send the whole insert to the remote server. >> >> Wouldn't that make sense? > Query strings have a limited length, and this assumption is true for > many code paths in the backend code, so doing that with a long string > would introduce more pain in the logic than anything else, as this > would become more data type sensitive. That's a good point, but the basic idea is to send the local query almost-as-is to the remote server if possible. For example, if the local query is "INSERT INTO foreign_table(a,b,c) VALUES (1, 2, 3), (4, 5, 6)", send the remote query "INSERT INTO remote_table(a,b,c) VALUES (1, 2, 3), (4, 5, 6)" to the remote server where remote_table is the table name for the foreign table on the remote server. So, wouldn't the query string length be a problem in many cases? Maybe I'm missing something, though. Best regards, Etsuro Fujita
That's a good point, but the basic idea is to send the local query almost-as-is to the remote server if possible. For example, if the local query is "INSERT INTO foreign_table(a,b,c) VALUES (1, 2, 3), (4, 5, 6)", send the remote query "INSERT INTO remote_table(a,b,c) VALUES (1, 2, 3), (4, 5, 6)" to the remote server where remote_table is the table name for the foreign table on the remote server. So, wouldn't the query string length be a problem in many cases? Maybe I'm missing something, though.
On 2016/05/30 22:59, Craig Ringer wrote: > On 30 May 2016 at 16:17, Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> wrote: >> >> That's a good point, but the basic idea is to send the local query >> almost-as-is to the remote server if possible. For example, if the local >> query is "INSERT INTO foreign_table(a,b,c) VALUES (1, 2, 3), (4, 5, 6)", >> send the remote query "INSERT INTO remote_table(a,b,c) VALUES (1, 2, 3), >> (4, 5, 6)" to the remote server where remote_table is the table name for >> the foreign table on the remote server. So, wouldn't the query string >> length be a problem in many cases? Maybe I'm missing something, though. >> <http://www.postgresql.org/mailpref/pgsql-hackers> > > FDWs don't operate at that level. They don't see the original query string. > They're plan nodes that operate with a row-by-row push/pull model. The > foreign table node in question has no idea you're doing a multivalued > insert and doesn't care if it's INSERT INTO ... SELECT, INSERT INTO ... > VALUES, or COPY. IIUC, what Fujita-san seems to be referring to here is safe push-down of a insert's query or values expression (and hence the whole insert itself) considered during the *planning* step. Although that sounds like a different optimization from what's being discussed on this thread. The latter certainly seems to have its benefits in case of push-down failure and might as well be the majority of cases. Thanks, Amit
On 2016/05/31 14:53, Amit Langote wrote: > On 2016/05/30 22:59, Craig Ringer wrote: >> On 30 May 2016 at 16:17, Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> wrote: >>> That's a good point, but the basic idea is to send the local query >>> almost-as-is to the remote server if possible. For example, if the local >>> query is "INSERT INTO foreign_table(a,b,c) VALUES (1, 2, 3), (4, 5, 6)", >>> send the remote query "INSERT INTO remote_table(a,b,c) VALUES (1, 2, 3), >>> (4, 5, 6)" to the remote server where remote_table is the table name for >>> the foreign table on the remote server. So, wouldn't the query string >>> length be a problem in many cases? Maybe I'm missing something, though. >>> <http://www.postgresql.org/mailpref/pgsql-hackers> >> FDWs don't operate at that level. They don't see the original query string. >> They're plan nodes that operate with a row-by-row push/pull model. The >> foreign table node in question has no idea you're doing a multivalued >> insert and doesn't care if it's INSERT INTO ... SELECT, INSERT INTO ... >> VALUES, or COPY. > IIUC, what Fujita-san seems to be referring to here is safe push-down of a > insert's query or values expression (and hence the whole insert itself) > considered during the *planning* step. That's really what I have in mind. Thanks for the explanation! > Although that sounds like a > different optimization from what's being discussed on this thread. The > latter certainly seems to have its benefits in case of push-down failure > and might as well be the majority of cases. Agreed. Best regards, Etsuro Fujita