Thread: INSERTing lots of data

INSERTing lots of data

From

Joachim Worringen

Date:

28 May 2010, 06:41:38

Greetings,

my Python application (http://perfbase.tigris.org) repeatedly needs to
insert lots of data into an exsting, non-empty, potentially large table.
Currently, the bottleneck is with the Python application, so I intend to
multi-thread it. Each thread should work on a part of the input file.

I already multi-threaded the query part of the application, which
requires to use one connection per thread - cursors a serialized via a
single connection.

Provided that
- the threads use their own connection
- the threads perform all INSERTs within a single transaction
- the machine has enough resources

  will I get a speedup? Or will table-locking serialize things on the
server side?

Suggestions for alternatives are welcome, but the data must go through
the Python application via INSERTs (no bulk insert, COPY etc. possible)

  thanks, Joachim

Re: INSERTing lots of data

From

Szymon Guz

Date:

28 May 2010, 06:48:27

2010/5/28 Joachim Worringen <joachim.worringen@iathh.de>

Greetings,

my Python application (http://perfbase.tigris.org) repeatedly needs to insert lots of data into an exsting, non-empty, potentially large table. Currently, the bottleneck is with the Python application, so I intend to multi-thread it. Each thread should work on a part of the input file.

I already multi-threaded the query part of the application, which requires to use one connection per thread - cursors a serialized via a single connection.

Provided that
- the threads use their own connection
- the threads perform all INSERTs within a single transaction
- the machine has enough resources

will I get a speedup? Or will table-locking serialize things on the server side?

Suggestions for alternatives are welcome, but the data must go through the Python application via INSERTs (no bulk insert, COPY etc. possible)

Remember about Python's GIL in some Python implementations so those threads could be serialized at the Python level.

This is possible that those inserts will be faster. The speed depends on the table structure, some constraints and triggers and even database configuration. The best answer is: just check it on some test code, make a simple multithreaded aplication and try to do the inserts and check that out.

regards
Szymon Guz

Re: INSERTing lots of data

From

Joachim Worringen

Date:

28 May 2010, 07:01:16

On 05/28/2010 11:48 AM, Szymon Guz wrote:
> Remember about Python's GIL in some Python implementations so those
> threads could be serialized at the Python level.

My multi-threaded queries scale nicely with Python 2.6 on Linux, so this
is not an issue here. But the queries do not perform concurrent write
accesses on the same table.

> This is possible that those inserts will be faster. The speed depends on
> the table structure, some constraints and triggers and even database
> configuration. The best answer is: just check it on some test code, make
> a simple multithreaded aplication and try to do the inserts and check
> that out.

Sure, testing always shows something, but I wonder if something general
can be said about the execution of concurrent write transaction on the
same table (no triggers, some non-NULL constraints, one index).

http://www.postgresql.org/docs/8.4/interactive/mvcc-intro.html says
about MVCC:
"
The main advantage of using the MVCC model of concurrency control rather
than locking is that in MVCC locks acquired for querying (reading) data
do not conflict with locks acquired for writing data, and so reading
never blocks writing and writing never blocks reading.
"

It does not mention whether writing may block writing, or if it always
does.
http://bytes.com/topic/python/answers/728130-parallel-insert-postgresql-thread
indicates it should not block - can this be confirmed by some Postgresql
guru?

thanks, Joachim

Re: INSERTing lots of data

From

Martin Gainty

Date:

28 May 2010, 07:14:23

Good Afternoon Szymon!

Could you explain what a Python GIL is? and if there is any workaround to Python GIL we can implement to achieve better performance..possibly at the database level?

Mit freundlichen Grüßen/Les plus sincères amitiés
Martin Gainty
______________________________________________
Verzicht und Vertraulichkeitanmerkung/Note de déni et de confidentialité

Diese Nachricht ist vertraulich. Sollten Sie nicht der vorgesehene Empfaenger sein, so bitten wir hoeflich um eine Mitteilung. Jede unbefugte Weiterleitung oder Fertigung einer Kopie ist unzulaessig. Diese Nachricht dient lediglich dem Austausch von Informationen und entfaltet keine rechtliche Bindungswirkung. Aufgrund der leichten Manipulierbarkeit von E-Mails koennen wir keine Haftung fuer den Inhalt uebernehmen.

Ce message est confidentiel et peut être privilégié. Si vous n'êtes pas le destinataire prévu, nous te demandons avec bonté que pour satisfaire informez l'expéditeur. N'importe quelle diffusion non autorisée ou la copie de ceci est interdite. Ce message sert à l'information seulement et n'aura pas n'importe quel effet légalement obligatoire. Étant donné que les email peuvent facilement être sujets à la manipulation, nous ne pouvons accepter aucune responsabilité pour le contenu fourni.

Date: Fri, 28 May 2010 11:48:16 +0200
Subject: Re: [GENERAL] INSERTing lots of data
From: mabewlun@gmail.com
To: joachim.worringen@iathh.de
CC: pgsql-general@postgresql.org

2010/5/28 Joachim Worringen <joachim.worringen@iathh.de>

Greetings,

my Python application (http://perfbase.tigris.org) repeatedly needs to insert lots of data into an exsting, non-empty, potentially large table. Currently, the bottleneck is with the Python application, so I intend to multi-thread it. Each thread should work on a part of the input file.

I already multi-threaded the query part of the application, which requires to use one connection per thread - cursors a serialized via a single connection.

Provided that
- the threads use their own connection
- the threads perform all INSERTs within a single transaction
- the machine has enough resources

will I get a speedup? Or will table-locking serialize things on the server side?

Suggestions for alternatives are welcome, but the data must go through the Python application via INSERTs (no bulk insert, COPY etc. possible)

Remember about Python's GIL in some Python implementations so those threads could be serialized at the Python level.

regards
Szymon Guz

Hotmail is redefining busy with tools for the New Busy. Get more from your inbox. See how.

Re: INSERTing lots of data

From

Alban Hertroys

Date:

28 May 2010, 07:22:45

On 28 May 2010, at 12:14, Martin Gainty wrote:

> Good Afternoon Szymon!
>
> Could you explain what a Python GIL is? and if there is any workaround to Python GIL we can implement to achieve
betterperformance..possibly at the database level? 

See here: http://en.wikipedia.org/wiki/Global_Interpreter_Lock

Alban Hertroys

--
Screwing up is an excellent way to attach something to the ceiling.


!DSPAM:737,4bff996b10419162611771!

Re: INSERTing lots of data

From

Craig Ringer

Date:

28 May 2010, 09:56:09

On 28/05/10 17:41, Joachim Worringen wrote:
> Greetings,
>
> my Python application (http://perfbase.tigris.org) repeatedly needs to
> insert lots of data into an exsting, non-empty, potentially large table.
> Currently, the bottleneck is with the Python application, so I intend to
> multi-thread it.

That may not be a great idea. For why, search for "Global Interpreter
Lock" (GIL).

It might help if Python's mostly blocked on network I/O, as the GIL is
released when Python blocks on the network, but still, your results may
not be great.

> will I get a speedup? Or will table-locking serialize things on the
> server side?

Concurrent inserts work *great* with PostgreSQL, it's Python I'd be
worried about.

--
Craig Ringer

Re: INSERTing lots of data

From

Joachim Worringen

Date:

28 May 2010, 10:17:54

On 05/28/2010 02:55 PM, Craig Ringer wrote:
> On 28/05/10 17:41, Joachim Worringen wrote:
>> Greetings,
>>
>> my Python application (http://perfbase.tigris.org) repeatedly needs to
>> insert lots of data into an exsting, non-empty, potentially large table.
>> Currently, the bottleneck is with the Python application, so I intend to
>> multi-thread it.
>
> That may not be a great idea. For why, search for "Global Interpreter
> Lock" (GIL).
>
> It might help if Python's mostly blocked on network I/O, as the GIL is
> released when Python blocks on the network, but still, your results may
> not be great.

I verified that the thread actually execute queries concurrently. That
does imply that they are blocked on I/O while the query is running, and
that the query performance does in fact scale for this reason.

In the "import data" case, however, I really need concurrent processing
on the CPU in the first place, so you may be right on this one. I'll
check it.

>> will I get a speedup? Or will table-locking serialize things on the
>> server side?
>
> Concurrent inserts work *great* with PostgreSQL, it's Python I'd be
> worried about.

That's the part of answer I wanted to hear.,,

  thanks, Joachim

Re: INSERTing lots of data

From

Craig Ringer

Date:

28 May 2010, 14:28:47

On 28/05/2010 9:17 PM, Joachim Worringen wrote:

> In the "import data" case, however, I really need concurrent processing
> on the CPU in the first place, so you may be right on this one. I'll
> check it.

If you run into GIL problems, you can always work around it by spawning
multiple processes and communicating between them via pipes, signals,
shm, etc. That way you get one GIL per Python instance.

BTW, if you need decent CPU performance with Python, make sure you're
using the Pysco specializing compiler module for Python. Activating it
requires literally four lines at the start of your main module and the
speed improvement is mind blowing.

--
Craig Ringer

Re: INSERTing lots of data

From

Greg Smith

Date:

01 June 2010, 00:45:47

Joachim Worringen wrote:
> my Python application (http://perfbase.tigris.org) repeatedly needs to
> insert lots of data into an exsting, non-empty, potentially large
> table. Currently, the bottleneck is with the Python application, so I
> intend to multi-thread it. Each thread should work on a part of the
> input file.

You are wandering down a path followed by pgloader at one point:
http://pgloader.projects.postgresql.org/#toc6 and one that I fought with
briefly as well.  Simple multi-threading can be of minimal help in
scaling up insert performance here, due to the Python issues involved
with the GIL.  Maybe we get Dimitri to chime in here, he did more of
this than I did.

Two thoughts.  First, build a test performance case assuming it will
fail to scale upwards, looking for problems.  If you get lucky, great,
but don't assume this will work--it's proven more difficult than is
obvious in the past for others.

Second, if you do end up being throttled by the GIL, you can probably
build a solution for Python 2.6/3.0 using the multiprocessing module for
your use case:  http://docs.python.org/library/multiprocessing.html

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us

Re: INSERTing lots of data

From

Joachim Worringen

Date:

01 June 2010, 04:51:48

On 06/01/2010 05:45 AM, Greg Smith wrote:
> Two thoughts. First, build a test performance case assuming it will fail
> to scale upwards, looking for problems. If you get lucky, great, but
> don't assume this will work--it's proven more difficult than is obvious
> in the past for others.
>
> Second, if you do end up being throttled by the GIL, you can probably
> build a solution for Python 2.6/3.0 using the multiprocessing module for
> your use case: http://docs.python.org/library/multiprocessing.html

Thanks, Greg - multiprocessing looks very usable for my application.
Much more than using fork() and pipes myself...

  Joachim

Re: INSERTing lots of data

From

Dimitri Fontaine

Date:

01 June 2010, 09:21:53

Greg Smith <greg@2ndquadrant.com> writes:
> Joachim Worringen wrote:
>> my Python application (http://perfbase.tigris.org) repeatedly needs to
>> insert lots of data into an exsting, non-empty, potentially large
>> table. Currently, the bottleneck is with the Python application, so I
>> intend to multi-thread it. Each thread should work on a part of the input
>> file.
>
> You are wandering down a path followed by pgloader at one point:
> http://pgloader.projects.postgresql.org/#toc6 and one that I fought with
> briefly as well.  Simple multi-threading can be of minimal help in scaling
> up insert performance here, due to the Python issues involved with the GIL.
> Maybe we get Dimitri to chime in here, he did more of this than I did.

In my case pgloader is using COPY and not INSERT. Which would mean than
while one python thread is blocked on network IO the others have a
chance of using the CPU. That should be a case where GIL is working
ok. My tests show that it's not.

> Two thoughts.  First, build a test performance case assuming it will fail to
> scale upwards, looking for problems.  If you get lucky, great, but don't
> assume this will work--it's proven more difficult than is obvious in the
> past for others.
>
> Second, if you do end up being throttled by the GIL, you can probably build
> a solution for Python 2.6/3.0 using the multiprocessing module for your use
> case:  http://docs.python.org/library/multiprocessing.html

My plan was to go with http://docs.python.org/library/subprocess.html
but it seems multiprocessing is easier to use when you want to port
existing threaded code.

Thanks Greg!
--
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support