Re: How batch processing works - Mailing list pgsql-general

From Lok P
Subject Re: How batch processing works
Date
Msg-id CAKna9VZA5x1g7-4j8cQrW1ByqX5jtPwY7sY2MHuRgrxrOu4LBg@mail.gmail.com
Whole thread Raw
In response to How batch processing works  (Lok P <loknath.73@gmail.com>)
List pgsql-general


On Sat, Sep 21, 2024 at 8:07 PM Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
On 2024-09-21 16:44:08 +0530, Lok P wrote:
> But wondering why we don't see any difference in performance between method-2
> and method-3 above.

The code runs completely inside the database. So there isn't much
difference between a single statement which inserts 50 rows and 50
statements which insert 1 row each. The work to be done is (almost) the
same.

This changes once you consider an application which runs outside of the
database (maybe even on a different host). Such an application has to
wait for the result of each statement before it can send the next one.
Now it makes a difference whether you are waiting 50 times for a
statement which does very little or just once for a statement which does
more work.

> So does it mean that,I am testing this in a wrong way or

That depends on what you want to test. If you are interested in the
behaviour of stored procedures, the test is correct. If you want to know
about the performance of a database client (whether its written in Java,
Python, Go or whatever), this is the wrong test. You have to write the
test in your target language and run it on the client system to get
realistic results (for example, the round-trip times will be a lot
shorter if the client and database are on the same computer than when
one is in Europe and the other in America).

For example, here are the three methods as Python scripts:

---------------------------------------------------------------------------------------------------
#!/usr/bin/python3

import time
import psycopg2

num_inserts = 10_000

db = psycopg2.connect()
csr = db.cursor()

csr.execute("drop table if exists parent_table")
csr.execute("create table parent_table (id int primary key, t text)")

start_time = time.monotonic()
for i in range(1, num_inserts+1):
    csr.execute("insert into parent_table values(%s, %s)", (i, 'a'))
    db.commit()
end_time = time.monotonic()
elapsed_time = end_time - start_time
print(f"Method 1: Individual Inserts with Commit after every Row: {elapsed_time:.3} seconds")

# vim: tw=99
---------------------------------------------------------------------------------------------------
#!/usr/bin/python3

import time
import psycopg2

num_inserts = 10_000
batch_size = 50

db = psycopg2.connect()
csr = db.cursor()

csr.execute("drop table if exists parent_table")
csr.execute("create table parent_table (id int primary key, t text)")
db.commit()

start_time = time.monotonic()
for i in range(1, num_inserts+1):
    csr.execute("insert into parent_table values(%s, %s)", (i, 'a'))
    if i % batch_size == 0:
        db.commit()
db.commit()
end_time = time.monotonic()
elapsed_time = end_time - start_time
print(f"Method 2: Individual Inserts with Commit after {batch_size}  Rows: {elapsed_time:.3} seconds")

# vim: tw=99
---------------------------------------------------------------------------------------------------
#!/usr/bin/python3

import itertools
import time
import psycopg2

num_inserts = 10_000
batch_size = 50

db = psycopg2.connect()
csr = db.cursor()

csr.execute("drop table if exists parent_table")
csr.execute("create table parent_table (id int primary key, t text)")
db.commit()

start_time = time.monotonic()
batch = []
for i in range(1, num_inserts+1):
    batch.append((i, 'a'))
    if i % batch_size == 0:
        q = "insert into parent_table values" + ",".join(["(%s, %s)"] * len(batch))
        params = list(itertools.chain.from_iterable(batch))
        csr.execute(q, params)
        db.commit()
        batch = []
if batch:
    q = "insert into parent_table values" + ",".join(["(%s, %s)"] * len(batch))
    csr.execute(q, list(itertools.chain(batch)))
    db.commit()
    batch = []

end_time = time.monotonic()
elapsed_time = end_time - start_time
print(f"Method 3: Batch Inserts ({batch_size})  with Commit after each batch: {elapsed_time:.3} seconds")

# vim: tw=99
---------------------------------------------------------------------------------------------------

On my laptop, method2 is about twice as fast as method3. But if I
connect to a database on the other side of the city, method2 is now more
than 16 times faster than method3 . Simply because the delay in
communication is now large compared to the time it takes to insert those
rows.


Thank you so much.
I was expecting method-3(batch insert) to be the fastest or atleast as you said perform with similar speed as method-2 (row by row insert with batch commit) if we do it within the procedure inside the database. But because the context switching will be minimal in method-3 as it will prepare the insert and submit to the database in one shot in one DB call, so it should be a bit fast. But from your figures , it appears to be the opposite , i.e.method-2 is faster than method-3. Not able to understand the reason though. So in this case then ,it appears we can follow method-2 as that is cheaper in regards to less code change , i.e. just shifting the commit points without any changes for doing the batch insert. 

Btw,Do you have any thoughts,  why method-2 is faster as compared to method-3 in your test?

pgsql-general by date:

Previous
From: Lok P
Date:
Subject: Re: How batch processing works
Next
From: Adrian Klaver
Date:
Subject: Re: IO related waits