Thread: INSERTing lots of data
Greetings, my Python application (http://perfbase.tigris.org) repeatedly needs to insert lots of data into an exsting, non-empty, potentially large table. Currently, the bottleneck is with the Python application, so I intend to multi-thread it. Each thread should work on a part of the input file. I already multi-threaded the query part of the application, which requires to use one connection per thread - cursors a serialized via a single connection. Provided that - the threads use their own connection - the threads perform all INSERTs within a single transaction - the machine has enough resources will I get a speedup? Or will table-locking serialize things on the server side? Suggestions for alternatives are welcome, but the data must go through the Python application via INSERTs (no bulk insert, COPY etc. possible) thanks, Joachim
2010/5/28 Joachim Worringen <joachim.worringen@iathh.de>
Greetings,
my Python application (http://perfbase.tigris.org) repeatedly needs to insert lots of data into an exsting, non-empty, potentially large table. Currently, the bottleneck is with the Python application, so I intend to multi-thread it. Each thread should work on a part of the input file.
I already multi-threaded the query part of the application, which requires to use one connection per thread - cursors a serialized via a single connection.
Provided that
- the threads use their own connection
- the threads perform all INSERTs within a single transaction
- the machine has enough resources
will I get a speedup? Or will table-locking serialize things on the server side?
Suggestions for alternatives are welcome, but the data must go through the Python application via INSERTs (no bulk insert, COPY etc. possible)
Remember about Python's GIL in some Python implementations so those threads could be serialized at the Python level.
This is possible that those inserts will be faster. The speed depends on the table structure, some constraints and triggers and even database configuration. The best answer is: just check it on some test code, make a simple multithreaded aplication and try to do the inserts and check that out.
regards
Szymon Guz
Szymon Guz
On 05/28/2010 11:48 AM, Szymon Guz wrote: > Remember about Python's GIL in some Python implementations so those > threads could be serialized at the Python level. My multi-threaded queries scale nicely with Python 2.6 on Linux, so this is not an issue here. But the queries do not perform concurrent write accesses on the same table. > This is possible that those inserts will be faster. The speed depends on > the table structure, some constraints and triggers and even database > configuration. The best answer is: just check it on some test code, make > a simple multithreaded aplication and try to do the inserts and check > that out. Sure, testing always shows something, but I wonder if something general can be said about the execution of concurrent write transaction on the same table (no triggers, some non-NULL constraints, one index). http://www.postgresql.org/docs/8.4/interactive/mvcc-intro.html says about MVCC: " The main advantage of using the MVCC model of concurrency control rather than locking is that in MVCC locks acquired for querying (reading) data do not conflict with locks acquired for writing data, and so reading never blocks writing and writing never blocks reading. " It does not mention whether writing may block writing, or if it always does. http://bytes.com/topic/python/answers/728130-parallel-insert-postgresql-thread indicates it should not block - can this be confirmed by some Postgresql guru? thanks, Joachim
Good Afternoon Szymon!
Could you explain what a Python GIL is? and if there is any workaround to Python GIL we can implement to achieve better performance..possibly at the database level?
Mit freundlichen Grüßen/Les plus sincères amitiés
Martin Gainty
______________________________________________
Verzicht und Vertraulichkeitanmerkung/Note de déni et de confidentialité
Diese Nachricht ist vertraulich. Sollten Sie nicht der vorgesehene Empfaenger sein, so bitten wir hoeflich um eine Mitteilung. Jede unbefugte Weiterleitung oder Fertigung einer Kopie ist unzulaessig. Diese Nachricht dient lediglich dem Austausch von Informationen und entfaltet keine rechtliche Bindungswirkung. Aufgrund der leichten Manipulierbarkeit von E-Mails koennen wir keine Haftung fuer den Inhalt uebernehmen.
Date: Fri, 28 May 2010 11:48:16 +0200
Subject: Re: [GENERAL] INSERTing lots of data
From: mabewlun@gmail.com
To: joachim.worringen@iathh.de
CC: pgsql-general@postgresql.org
Hotmail is redefining busy with tools for the New Busy. Get more from your inbox. See how.
Could you explain what a Python GIL is? and if there is any workaround to Python GIL we can implement to achieve better performance..possibly at the database level?
Mit freundlichen Grüßen/Les plus sincères amitiés
Martin Gainty
______________________________________________
Verzicht und Vertraulichkeitanmerkung/Note de déni et de confidentialité
Diese Nachricht ist vertraulich. Sollten Sie nicht der vorgesehene Empfaenger sein, so bitten wir hoeflich um eine Mitteilung. Jede unbefugte Weiterleitung oder Fertigung einer Kopie ist unzulaessig. Diese Nachricht dient lediglich dem Austausch von Informationen und entfaltet keine rechtliche Bindungswirkung. Aufgrund der leichten Manipulierbarkeit von E-Mails koennen wir keine Haftung fuer den Inhalt uebernehmen.
Ce message est confidentiel et peut être privilégié. Si vous n'êtes pas le destinataire prévu, nous te demandons avec bonté que pour satisfaire informez l'expéditeur. N'importe quelle diffusion non autorisée ou la copie de ceci est interdite. Ce message sert à l'information seulement et n'aura pas n'importe quel effet légalement obligatoire. Étant donné que les email peuvent facilement être sujets à la manipulation, nous ne pouvons accepter aucune responsabilité pour le contenu fourni.
Date: Fri, 28 May 2010 11:48:16 +0200
Subject: Re: [GENERAL] INSERTing lots of data
From: mabewlun@gmail.com
To: joachim.worringen@iathh.de
CC: pgsql-general@postgresql.org
2010/5/28 Joachim Worringen <joachim.worringen@iathh.de>
Greetings,
my Python application (http://perfbase.tigris.org) repeatedly needs to insert lots of data into an exsting, non-empty, potentially large table. Currently, the bottleneck is with the Python application, so I intend to multi-thread it. Each thread should work on a part of the input file.
I already multi-threaded the query part of the application, which requires to use one connection per thread - cursors a serialized via a single connection.
Provided that
- the threads use their own connection
- the threads perform all INSERTs within a single transaction
- the machine has enough resources
will I get a speedup? Or will table-locking serialize things on the server side?
Suggestions for alternatives are welcome, but the data must go through the Python application via INSERTs (no bulk insert, COPY etc. possible)
Remember about Python's GIL in some Python implementations so those threads could be serialized at the Python level.
This is possible that those inserts will be faster. The speed depends on the table structure, some constraints and triggers and even database configuration. The best answer is: just check it on some test code, make a simple multithreaded aplication and try to do the inserts and check that out.
regards
Szymon Guz
Szymon Guz
Hotmail is redefining busy with tools for the New Busy. Get more from your inbox. See how.
On 28 May 2010, at 12:14, Martin Gainty wrote: > Good Afternoon Szymon! > > Could you explain what a Python GIL is? and if there is any workaround to Python GIL we can implement to achieve betterperformance..possibly at the database level? See here: http://en.wikipedia.org/wiki/Global_Interpreter_Lock Alban Hertroys -- Screwing up is an excellent way to attach something to the ceiling. !DSPAM:737,4bff996b10419162611771!
On 28/05/10 17:41, Joachim Worringen wrote: > Greetings, > > my Python application (http://perfbase.tigris.org) repeatedly needs to > insert lots of data into an exsting, non-empty, potentially large table. > Currently, the bottleneck is with the Python application, so I intend to > multi-thread it. That may not be a great idea. For why, search for "Global Interpreter Lock" (GIL). It might help if Python's mostly blocked on network I/O, as the GIL is released when Python blocks on the network, but still, your results may not be great. > will I get a speedup? Or will table-locking serialize things on the > server side? Concurrent inserts work *great* with PostgreSQL, it's Python I'd be worried about. -- Craig Ringer
On 05/28/2010 02:55 PM, Craig Ringer wrote: > On 28/05/10 17:41, Joachim Worringen wrote: >> Greetings, >> >> my Python application (http://perfbase.tigris.org) repeatedly needs to >> insert lots of data into an exsting, non-empty, potentially large table. >> Currently, the bottleneck is with the Python application, so I intend to >> multi-thread it. > > That may not be a great idea. For why, search for "Global Interpreter > Lock" (GIL). > > It might help if Python's mostly blocked on network I/O, as the GIL is > released when Python blocks on the network, but still, your results may > not be great. I verified that the thread actually execute queries concurrently. That does imply that they are blocked on I/O while the query is running, and that the query performance does in fact scale for this reason. In the "import data" case, however, I really need concurrent processing on the CPU in the first place, so you may be right on this one. I'll check it. >> will I get a speedup? Or will table-locking serialize things on the >> server side? > > Concurrent inserts work *great* with PostgreSQL, it's Python I'd be > worried about. That's the part of answer I wanted to hear.,, thanks, Joachim
On 28/05/2010 9:17 PM, Joachim Worringen wrote: > In the "import data" case, however, I really need concurrent processing > on the CPU in the first place, so you may be right on this one. I'll > check it. If you run into GIL problems, you can always work around it by spawning multiple processes and communicating between them via pipes, signals, shm, etc. That way you get one GIL per Python instance. BTW, if you need decent CPU performance with Python, make sure you're using the Pysco specializing compiler module for Python. Activating it requires literally four lines at the start of your main module and the speed improvement is mind blowing. -- Craig Ringer
Joachim Worringen wrote: > my Python application (http://perfbase.tigris.org) repeatedly needs to > insert lots of data into an exsting, non-empty, potentially large > table. Currently, the bottleneck is with the Python application, so I > intend to multi-thread it. Each thread should work on a part of the > input file. You are wandering down a path followed by pgloader at one point: http://pgloader.projects.postgresql.org/#toc6 and one that I fought with briefly as well. Simple multi-threading can be of minimal help in scaling up insert performance here, due to the Python issues involved with the GIL. Maybe we get Dimitri to chime in here, he did more of this than I did. Two thoughts. First, build a test performance case assuming it will fail to scale upwards, looking for problems. If you get lucky, great, but don't assume this will work--it's proven more difficult than is obvious in the past for others. Second, if you do end up being throttled by the GIL, you can probably build a solution for Python 2.6/3.0 using the multiprocessing module for your use case: http://docs.python.org/library/multiprocessing.html -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On 06/01/2010 05:45 AM, Greg Smith wrote: > Two thoughts. First, build a test performance case assuming it will fail > to scale upwards, looking for problems. If you get lucky, great, but > don't assume this will work--it's proven more difficult than is obvious > in the past for others. > > Second, if you do end up being throttled by the GIL, you can probably > build a solution for Python 2.6/3.0 using the multiprocessing module for > your use case: http://docs.python.org/library/multiprocessing.html Thanks, Greg - multiprocessing looks very usable for my application. Much more than using fork() and pipes myself... Joachim
Greg Smith <greg@2ndquadrant.com> writes: > Joachim Worringen wrote: >> my Python application (http://perfbase.tigris.org) repeatedly needs to >> insert lots of data into an exsting, non-empty, potentially large >> table. Currently, the bottleneck is with the Python application, so I >> intend to multi-thread it. Each thread should work on a part of the input >> file. > > You are wandering down a path followed by pgloader at one point: > http://pgloader.projects.postgresql.org/#toc6 and one that I fought with > briefly as well. Simple multi-threading can be of minimal help in scaling > up insert performance here, due to the Python issues involved with the GIL. > Maybe we get Dimitri to chime in here, he did more of this than I did. In my case pgloader is using COPY and not INSERT. Which would mean than while one python thread is blocked on network IO the others have a chance of using the CPU. That should be a case where GIL is working ok. My tests show that it's not. > Two thoughts. First, build a test performance case assuming it will fail to > scale upwards, looking for problems. If you get lucky, great, but don't > assume this will work--it's proven more difficult than is obvious in the > past for others. > > Second, if you do end up being throttled by the GIL, you can probably build > a solution for Python 2.6/3.0 using the multiprocessing module for your use > case: http://docs.python.org/library/multiprocessing.html My plan was to go with http://docs.python.org/library/subprocess.html but it seems multiprocessing is easier to use when you want to port existing threaded code. Thanks Greg! -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support