Thread: Bulk Inserts
Hi,
I'm trying to measure the performance of the following: Multiple txns inserting tuples into a table concurrently vs single txn doing the whole insertion.
new table created as:
create table tab2 (
id serial,
attr1 integer not null,
attr2 integer not null,
primary key(id)
);
EXP 1: inserts with multiple txn:
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where attr2 = 10);
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where attr2 = 20);
note: attr2 has only two values 10 and 20
EXP 2: inserts with a single txn:
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1);
I also performed another experiment as follows:
EXP 3: select attr1, attr2 into tab2 from tab1;
The observation here is EXP 3 is much faster than EXP 2 probably due to bulk inserts used by Postgres. However I could not find a way to insert id values in tab2 using EXP 3. Also select .. into .. from .. throws an error if we create a table first and then populate the tuples using the command.
I have the following questions:
1. Is it possible to have an id column in tab2 and perform a bulk insert using select .. into .. from .. or using some other means?
2. If a table is already created, is it possible to do bulk inserts via multiple txns inserting into the same table (EXP 3)?
Best,
-SB
On 8/9/19 3:06 PM, Souvik Bhattacherjee wrote: > Hi, > > I'm trying to measure the performance of the following: Multiple txns > inserting tuples into a table concurrently vs single txn doing the whole > insertion. > > *new table created as:* > create table tab2 ( > id serial, > attr1 integer not null, > attr2 integer not null, > primary key(id) > ); > > *EXP 1: inserts with multiple txn:* > insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where > attr2 = 10); > insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where > attr2 = 20); > > note: attr2 has only two values 10 and 20 > > *EXP 2: inserts with a single txn:* > insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1); > > I also performed another experiment as follows: > *EXP 3:* select attr1, attr2 into tab2 from tab1; > > The observation here is EXP 3 is much faster than EXP 2 probably due to > bulk inserts used by Postgres. However I could not find a way to insert > id values in tab2 using EXP 3. Also select .. into .. from .. throws an > error if we create a table first and then populate the tuples using the > command. Yes as SELECT INTO is functionally the same as CREATE TABLE AS: https://www.postgresql.org/docs/11/sql-selectinto.html > > I have the following questions: > 1. Is it possible to have an id column in tab2 and perform a bulk insert > using select .. into .. from .. or using some other means? Not using SELECT INTO for reasons given above. Though it is possible to SELECT INTO as you show in EXP 3 and then: alter table tab2 add column id serial primary key; EXP 2 shows the other means. > 2. If a table is already created, is it possible to do bulk inserts via > multiple txns inserting into the same table (EXP 3)? Yes, but you will some code via client or function that batches the inserts for you. > > Best, > -SB -- Adrian Klaver adrian.klaver@aklaver.com
Hi Adrian,
Thanks for the response.
> Yes, but you will some code via client or function that batches the
> inserts for you.
> inserts for you.
Could you please elaborate a bit on how EXP 1 could be performed such that it uses bulk inserts?
Best,
-SB
On Fri, Aug 9, 2019 at 7:26 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 8/9/19 3:06 PM, Souvik Bhattacherjee wrote:
> Hi,
>
> I'm trying to measure the performance of the following: Multiple txns
> inserting tuples into a table concurrently vs single txn doing the whole
> insertion.
>
> *new table created as:*
> create table tab2 (
> id serial,
> attr1 integer not null,
> attr2 integer not null,
> primary key(id)
> );
>
> *EXP 1: inserts with multiple txn:*
> insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
> attr2 = 10);
> insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
> attr2 = 20);
>
> note: attr2 has only two values 10 and 20
>
> *EXP 2: inserts with a single txn:*
> insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1);
>
> I also performed another experiment as follows:
> *EXP 3:* select attr1, attr2 into tab2 from tab1;
>
> The observation here is EXP 3 is much faster than EXP 2 probably due to
> bulk inserts used by Postgres. However I could not find a way to insert
> id values in tab2 using EXP 3. Also select .. into .. from .. throws an
> error if we create a table first and then populate the tuples using the
> command.
Yes as SELECT INTO is functionally the same as CREATE TABLE AS:
https://www.postgresql.org/docs/11/sql-selectinto.html
>
> I have the following questions:
> 1. Is it possible to have an id column in tab2 and perform a bulk insert
> using select .. into .. from .. or using some other means?
Not using SELECT INTO for reasons given above.
Though it is possible to SELECT INTO as you show in EXP 3 and then:
alter table tab2 add column id serial primary key;
EXP 2 shows the other means.
> 2. If a table is already created, is it possible to do bulk inserts via
> multiple txns inserting into the same table (EXP 3)?
Yes, but you will some code via client or function that batches the
inserts for you.
>
> Best,
> -SB
--
Adrian Klaver
adrian.klaver@aklaver.com
Top-posting (i.e. putting your reply at the top is discouraged here)On Aug 10, 2019, at 8:47 PM, Souvik Bhattacherjee <kivuosb@gmail.com> wrote:Hi Adrian,Thanks for the response.> Yes, but you will some code via client or function that batches the
> inserts for you.Could you please elaborate a bit on how EXP 1 could be performed such that it uses bulk inserts?Best,-SBOn Fri, Aug 9, 2019 at 7:26 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:On 8/9/19 3:06 PM, Souvik Bhattacherjee wrote:
> Hi,
>
> I'm trying to measure the performance of the following: Multiple txns
> inserting tuples into a table concurrently vs single txn doing the whole
> insertion.
>
> *new table created as:*
> create table tab2 (
> id serial,
> attr1 integer not null,
> attr2 integer not null,
> primary key(id)
> );
>
> *EXP 1: inserts with multiple txn:*
> insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
> attr2 = 10);
> insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
> attr2 = 20);
>
> note: attr2 has only two values 10 and 20
>
> *EXP 2: inserts with a single txn:*
> insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1);
>
> I also performed another experiment as follows:
> *EXP 3:* select attr1, attr2 into tab2 from tab1;
>
> The observation here is EXP 3 is much faster than EXP 2 probably due to
> bulk inserts used by Postgres. However I could not find a way to insert
> id values in tab2 using EXP 3. Also select .. into .. from .. throws an
> error if we create a table first and then populate the tuples using the
> command.
Yes as SELECT INTO is functionally the same as CREATE TABLE AS:
https://www.postgresql.org/docs/11/sql-selectinto.html
>
> I have the following questions:
> 1. Is it possible to have an id column in tab2 and perform a bulk insert
> using select .. into .. from .. or using some other means?
Not using SELECT INTO for reasons given above.
Though it is possible to SELECT INTO as you show in EXP 3 and then:
alter table tab2 add column id serial primary key;
EXP 2 shows the other means.
> 2. If a table is already created, is it possible to do bulk inserts via
> multiple txns inserting into the same table (EXP 3)?
Yes, but you will some code via client or function that batches the
inserts for you.
>
> Best,
> -SB
--
Adrian Klaver
adrian.klaver@aklaver.com
Does this appeal to you:
> Does this appeal to you:
Not sure if COPY can be used to transfer data between tables.
On Sat, Aug 10, 2019 at 11:01 PM Rob Sargent <robjsargent@gmail.com> wrote:
Top-posting (i.e. putting your reply at the top is discouraged here)On Aug 10, 2019, at 8:47 PM, Souvik Bhattacherjee <kivuosb@gmail.com> wrote:Hi Adrian,Thanks for the response.> Yes, but you will some code via client or function that batches the
> inserts for you.Could you please elaborate a bit on how EXP 1 could be performed such that it uses bulk inserts?Best,-SBOn Fri, Aug 9, 2019 at 7:26 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:On 8/9/19 3:06 PM, Souvik Bhattacherjee wrote:
> Hi,
>
> I'm trying to measure the performance of the following: Multiple txns
> inserting tuples into a table concurrently vs single txn doing the whole
> insertion.
>
> *new table created as:*
> create table tab2 (
> id serial,
> attr1 integer not null,
> attr2 integer not null,
> primary key(id)
> );
>
> *EXP 1: inserts with multiple txn:*
> insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
> attr2 = 10);
> insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
> attr2 = 20);
>
> note: attr2 has only two values 10 and 20
>
> *EXP 2: inserts with a single txn:*
> insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1);
>
> I also performed another experiment as follows:
> *EXP 3:* select attr1, attr2 into tab2 from tab1;
>
> The observation here is EXP 3 is much faster than EXP 2 probably due to
> bulk inserts used by Postgres. However I could not find a way to insert
> id values in tab2 using EXP 3. Also select .. into .. from .. throws an
> error if we create a table first and then populate the tuples using the
> command.
Yes as SELECT INTO is functionally the same as CREATE TABLE AS:
https://www.postgresql.org/docs/11/sql-selectinto.html
>
> I have the following questions:
> 1. Is it possible to have an id column in tab2 and perform a bulk insert
> using select .. into .. from .. or using some other means?
Not using SELECT INTO for reasons given above.
Though it is possible to SELECT INTO as you show in EXP 3 and then:
alter table tab2 add column id serial primary key;
EXP 2 shows the other means.
> 2. If a table is already created, is it possible to do bulk inserts via
> multiple txns inserting into the same table (EXP 3)?
Yes, but you will some code via client or function that batches the
inserts for you.
>
> Best,
> -SB
--
Adrian Klaver
adrian.klaver@aklaver.comDoes this appeal to you:
On 2019-08-10 21:01:50 -0600, Rob Sargent wrote: > On Aug 10, 2019, at 8:47 PM, Souvik Bhattacherjee <kivuosb@gmail.com> > wrote: > > Hi Adrian, > > Thanks for the response. > > > Yes, but you will some code via client or function that batches the > > inserts for you. > > Could you please elaborate a bit on how EXP 1 could be performed such that > it uses bulk inserts? > > Best, > -SB > > On Fri, Aug 9, 2019 at 7:26 PM Adrian Klaver <adrian.klaver@aklaver.com> > wrote: > [70 lines of full quote removed] > Top-posting (i.e. putting your reply at the top is discouraged here) He didn't really top-post. He quoted the relevant part of Adrian's posting and then wrote his reply below that. This is the style I prefer, because it makes it really clear what one is replying to. After his reply, he quoted Adrian's posting again, this time completely. I think this is unnecessary and confusing (you apparently didn't even see that he quoted something above his reply). But it's not as bad as quoting everything below the answer (or - as you did - quoting everything before the answer which I think is even worse: If I don't see any original content within the first 100 lines or so I usually skip the rest). hp -- _ | Peter J. Holzer | we build much bigger, better disasters now |_|_) | | because we have much more sophisticated | | | hjp@hjp.at | management tools. __/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>
Attachment
Sorry. I thought I had cut most of the redundancy > On Aug 11, 2019, at 2:26 AM, Peter J. Holzer <hjp-pgsql@hjp.at> wrote: > >> On 2019-08-10 21:01:50 -0600, Rob Sargent wrote: >> On Aug 10, 2019, at 8:47 PM, Souvik Bhattacherjee <kivuosb@gmail.com> >> wrote: >> >> Hi Adrian, >> >> Thanks for the response. >> >>> Yes, but you will some code via client or function that batches the >>> inserts for you. >> >> Could you please elaborate a bit on how EXP 1 could be performed such that >> it uses bulk inserts? >> >> Best, >> -SB >> >> On Fri, Aug 9, 2019 at 7:26 PM Adrian Klaver <adrian.klaver@aklaver.com> >> wrote: >> > [70 lines of full quote removed] > > >> Top-posting (i.e. putting your reply at the top is discouraged here) > > He didn't really top-post. He quoted the relevant part of Adrian's > posting and then wrote his reply below that. This is the style I prefer, > because it makes it really clear what one is replying to. > > After his reply, he quoted Adrian's posting again, this time completely. > I think this is unnecessary and confusing (you apparently didn't even > see that he quoted something above his reply). But it's not as bad as > quoting everything below the answer (or - as you did - quoting > everything before the answer which I think is even worse: If I don't see > any original content within the first 100 lines or so I usually skip the > rest). > > hp > > -- > _ | Peter J. Holzer | we build much bigger, better disasters now > |_|_) | | because we have much more sophisticated > | | | hjp@hjp.at | management tools. > __/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>
On 8/10/19 7:47 PM, Souvik Bhattacherjee wrote: > Hi Adrian, > > Thanks for the response. > > > Yes, but you will some code via client or function that batches the > > inserts for you. > > Could you please elaborate a bit on how EXP 1 could be performed such > that it uses bulk inserts? I guess it comes down to what you define as bulk inserts. From your OP: EXP 1: inserts with multiple txn: insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where attr2 = 10); insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where attr2 = 20); If the selects are returning more then one row then you are already doing bulk inserts. If they are returning single rows or you want to batch them then you need some sort of code to do that. Something like(pseudo Python like code): attr2_vals= [(10, 20, 30, 40), (50, 60, 70, 80)] for val_batch in attr2_vals: BEGIN for id in val_batch: insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where attr2 = id) COMMIT > > Best, > -SB -- Adrian Klaver adrian.klaver@aklaver.com
> If the selects are returning more then one row then you are already
> doing bulk inserts. If they are returning single rows or you want to
> batch them then you need some sort of code to do that. Something
> like(pseudo Python like code):
> attr2_vals= [(10, 20, 30, 40), (50, 60, 70, 80)]
> for val_batch in attr2_vals:
BEGIN
for id in val_batch:
insert into tab2 (attr1, attr2) (select attr1, attr2
from tab1 where attr2 = id)
> doing bulk inserts. If they are returning single rows or you want to
> batch them then you need some sort of code to do that. Something
> like(pseudo Python like code):
> attr2_vals= [(10, 20, 30, 40), (50, 60, 70, 80)]
> for val_batch in attr2_vals:
BEGIN
for id in val_batch:
insert into tab2 (attr1, attr2) (select attr1, attr2
from tab1 where attr2 = id)
COMMIT
For EXP 1: inserts with multiple txn:
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where attr2 = 10);
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where attr2 = 20);
tab1 has ~6M rows and there are only two values for the attribute attr2 in
tab1 which are evenly distributed. So, yes, I guess I'm already doing batching
here.
Also, I ran the following two statements to see if their performances are comparable.
While STMT 1 always runs faster in my machine but their performances seem to differ
by a couple of seconds at most.
STMT 1: select attr1, attr2 into tab2 from tab1;
STMT 2: insert into tab2 (select attr1, attr2 from tab1);
However adding the serial id column as an ALTER TABLE statement actually takes more time
than inserting the tuples, so the combined total time is more than double the time taken to insert
the tuples into tab2 without serial id column.
Best,
-SB
On Sun, Aug 11, 2019 at 11:11 AM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 8/10/19 7:47 PM, Souvik Bhattacherjee wrote:
> Hi Adrian,
>
> Thanks for the response.
>
> > Yes, but you will some code via client or function that batches the
> > inserts for you.
>
> Could you please elaborate a bit on how EXP 1 could be performed such
> that it uses bulk inserts?
I guess it comes down to what you define as bulk inserts. From your OP:
EXP 1: inserts with multiple txn:
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
attr2 = 10);
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
attr2 = 20);
If the selects are returning more then one row then you are already
doing bulk inserts. If they are returning single rows or you want to
batch them then you need some sort of code to do that. Something
like(pseudo Python like code):
attr2_vals= [(10, 20, 30, 40), (50, 60, 70, 80)]
for val_batch in attr2_vals:
BEGIN
for id in val_batch:
insert into tab2 (attr1, attr2) (select attr1, attr2
from tab1 where attr2 = id)
COMMIT
>
> Best,
> -SB
--
Adrian Klaver
adrian.klaver@aklaver.com
On 8/13/19 6:34 AM, Souvik Bhattacherjee wrote: > > If the selects are returning more then one row then you are already > > doing bulk inserts. If they are returning single rows or you want to > > batch them then you need some sort of code to do that. Something > > like(pseudo Python like code): > > > attr2_vals= [(10, 20, 30, 40), (50, 60, 70, 80)] > > > for val_batch in attr2_vals: > BEGIN > for id in val_batch: > insert into tab2 (attr1, attr2) (select attr1, attr2 > from tab1 where attr2 = id) > COMMIT > > For *EXP 1: inserts with multiple txn:* > insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where > attr2 = 10); > insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where > attr2 = 20); > > tab1 has ~6M rows and there are only two values for the attribute attr2 in > tab1 which are evenly distributed. So, yes, I guess I'm already doing > batching > here. > > Also, I ran the following two statements to see if their performances > are comparable. > While STMT 1 always runs faster in my machine but their performances > seem to differ > by a couple of seconds at most. > > STMT 1: select attr1, attr2 into tab2 from tab1; > STMT 2: insert into tab2 (select attr1, attr2 from tab1); All I have left is: select tab2 row_number () OVER (order by attr2, attr1 )AS id, attr1, attr2 into tab2 from tab1; That will not create a serial type in the id column though. You can attach a sequence to that column. Something like: 1) create sequence tab2_id start <max id + 1> owned by tab2.id; 2) alter table tab2 alter COLUMN id set default nextval('tab2_id'); > > However adding the serial id column as an ALTER TABLE statement actually > takes more time > than inserting the tuples, so the combined total time is more than > double the time taken to insert > the tuples into tab2 without serial id column. > > Best, > -SB > > > -- Adrian Klaver adrian.klaver@aklaver.com
> All I have left is:
> select tab2 row_number () OVER (order by attr2, attr1 )AS id, attr1,
> attr2 into tab2 from tab1;
> That will not create a serial type in the id column though. You can
> attach a sequence to that column. Something like:
> 1) create sequence tab2_id start <max id + 1> owned by tab2.id;
> select tab2 row_number () OVER (order by attr2, attr1 )AS id, attr1,
> attr2 into tab2 from tab1;
> That will not create a serial type in the id column though. You can
> attach a sequence to that column. Something like:
> 1) create sequence tab2_id start <max id + 1> owned by tab2.id;
> 2) alter table tab2 alter COLUMN id set default nextval('tab2_id');
Thanks. This is a bit indirect but works fine. Performance wise this turns
out to the best when inserting rows from one table to another (new) table
with a serial id column in the new table.
Best,
-SB
On Tue, Aug 13, 2019 at 11:08 AM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 8/13/19 6:34 AM, Souvik Bhattacherjee wrote:
> > If the selects are returning more then one row then you are already
> > doing bulk inserts. If they are returning single rows or you want to
> > batch them then you need some sort of code to do that. Something
> > like(pseudo Python like code):
>
> > attr2_vals= [(10, 20, 30, 40), (50, 60, 70, 80)]
>
> > for val_batch in attr2_vals:
> BEGIN
> for id in val_batch:
> insert into tab2 (attr1, attr2) (select attr1, attr2
> from tab1 where attr2 = id)
> COMMIT
>
> For *EXP 1: inserts with multiple txn:*
> insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
> attr2 = 10);
> insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
> attr2 = 20);
>
> tab1 has ~6M rows and there are only two values for the attribute attr2 in
> tab1 which are evenly distributed. So, yes, I guess I'm already doing
> batching
> here.
>
> Also, I ran the following two statements to see if their performances
> are comparable.
> While STMT 1 always runs faster in my machine but their performances
> seem to differ
> by a couple of seconds at most.
>
> STMT 1: select attr1, attr2 into tab2 from tab1;
> STMT 2: insert into tab2 (select attr1, attr2 from tab1);
All I have left is:
select tab2 row_number () OVER (order by attr2, attr1 )AS id, attr1,
attr2 into tab2 from tab1;
That will not create a serial type in the id column though. You can
attach a sequence to that column. Something like:
1) create sequence tab2_id start <max id + 1> owned by tab2.id;
2) alter table tab2 alter COLUMN id set default nextval('tab2_id');
>
> However adding the serial id column as an ALTER TABLE statement actually
> takes more time
> than inserting the tuples, so the combined total time is more than
> double the time taken to insert
> the tuples into tab2 without serial id column.
>
> Best,
> -SB
>
>
>
--
Adrian Klaver
adrian.klaver@aklaver.com