Thread: python modul pre-import to avoid importing each time
Hey List,
I use plpython with postgis and 2 python modules (numpy and shapely).
Sadly importing such module in the plpython function is very slow (several hundreds of milliseconds).
I also don't know if this overhead is applied each time the function is called in the same session.
Is there a way to pre-import those modules once and for all,
such that the python function are accelerated?
Thanks,
Cheers,
Rémi-C
On Thu, Jun 19, 2014 at 7:50 AM, Rémi Cura <remi.cura@gmail.com> wrote: > Hey List, > > I use plpython with postgis and 2 python modules (numpy and shapely). > Sadly importing such module in the plpython function is very slow (several > hundreds of milliseconds). Is that mostly shapely (which I don't have)? numpy seems to be pretty fast, like 16ms. But that is still slow for what you want, perhaps. > > I also don't know if this overhead is applied each time the function is > called in the same session. It is not. The overhead is once per connection, not once per call. So using a connection pooler could be really be a help here. > Is there a way to pre-import those modules once and for all, > such that the python function are accelerated? I don't think there is. With plperl you can do this by loading the module in plperl.on_init and by putting plperl into shared_preload_libraries so that this happens just at server start up. But I don't see a way to do something analogous for plpython due to lack of plpython.on_init. I think that is because the infrastructure to do that is part of making a "trusted" version of the language, which python doesn't have. (But it could just be that no one has ever gotten around to adding it.) Cheers, Jeff
Hey,
thanks for your answer !
thanks for your answer !
Yep you are right, the function I would like to test are going to be called a lot (100k times), so even 15 ms per call matters.
Cheers,
Rémi-C
Rémi-C
2014-06-25 21:46 GMT+02:00 Jeff Janes <jeff.janes@gmail.com>:
On Thu, Jun 19, 2014 at 7:50 AM, Rémi Cura <remi.cura@gmail.com> wrote:Is that mostly shapely (which I don't have)? numpy seems to be pretty
> Hey List,
>
> I use plpython with postgis and 2 python modules (numpy and shapely).
> Sadly importing such module in the plpython function is very slow (several
> hundreds of milliseconds).
fast, like 16ms. But that is still slow for what you want, perhaps.It is not. The overhead is once per connection, not once per call.
>
> I also don't know if this overhead is applied each time the function is
> called in the same session.
So using a connection pooler could be really be a help here.I don't think there is. With plperl you can do this by loading the
> Is there a way to pre-import those modules once and for all,
> such that the python function are accelerated?
module in plperl.on_init and by putting plperl into
shared_preload_libraries so that this happens just at server start up.
But I don't see a way to do something analogous for plpython due to
lack of plpython.on_init. I think that is because the infrastructure
to do that is part of making a "trusted" version of the language,
which python doesn't have. (But it could just be that no one has ever
gotten around to adding it.)
Cheers,
Jeff
On 06/26/2014 02:14 AM, Rémi Cura wrote: > Hey, > thanks for your answer ! > > Yep you are right, the function I would like to test are going to be > called a lot (100k times), so even 15 ms per call matters. > > I'm still a bit confused by a topic I found here : > http://stackoverflow.com/questions/15023080/how-are-import-statements-in-plpython-handled > > The answer gives a trick to avoid importing each time, so somehow it > must be usefull. Peters answer is based on using the global dictionary SD to store an imported library. For more information see here: http://www.postgresql.org/docs/9.3/interactive/plpython-sharing.html > > On another internet page (can't find it anymore) somebody mentioned this > module loading at server startup, one way or another, but gave no > precision. It seems that the "plpy" python module get loaded by default, > would'nt it be possible to hack this module to add other import inside it? In a sense that is what is being suggested above. > > I also use PL/R (untrusted I guess) and you can create a special table > to indicate which module to load at startup. > > Cheers, > Rémi-C > -- Adrian Klaver adrian.klaver@aklaver.com
On Thu, Jun 26, 2014 at 2:14 AM, Rémi Cura <remi.cura@gmail.com> wrote: > Hey, > thanks for your answer ! > > Yep you are right, the function I would like to test are going to be called > a lot (100k times), so even 15 ms per call matters. > > I'm still a bit confused by a topic I found here : > http://stackoverflow.com/questions/15023080/how-are-import-statements-in-plpython-handled > The answer gives a trick to avoid importing each time, so somehow it must be > usefull. I'd want to see the benchmark before deciding that how useful it actually is.... Anyway, that seems to be about calling import over and over within the same connection, not between different connections, as is your issue. Also, I think that that suggestion is targeted at removing what is already a very minor overhead, which is importing the symbols from the module into the importer's namespace (or however you translate that into python speak). The slow part is loading the module in the first place (finding the shared objects, parsing the module's source code, gluing them together, etc.), not importing the python symbols. If you arrange to re-use connections, you will probably find no further optimization is needed. > On another internet page (can't find it anymore) somebody mentioned this > module loading at server startup, one way or another, but gave no precision. > It seems that the "plpy" python module get loaded by default, would'nt it be > possible to hack this module to add other import inside it? I just thought your question looked lonely and that I'd tell you what I learned about plperl in case it helped. There may be a way to do about the same thing in plpython, but if so it doesn't seem to be documented, or analogous to the way plperl does it. I'm afraid that exhausts my knowledge of plpython. I don't see any files that suggests there is a user-editable plpy.py module. If you are willing to monkey around with C and recompiling, you could probably make it happen somehow, though. Cheers, Jeff
Adrian Klaver <adrian.klaver@aklaver.com> writes: > On 06/26/2014 02:14 AM, Rémi Cura wrote: >> On another internet page (can't find it anymore) somebody mentioned this >> module loading at server startup, one way or another, but gave no >> precision. It seems that the "plpy" python module get loaded by default, >> would'nt it be possible to hack this module to add other import inside it? > In a sense that is what is being suggested above. IIRC, plperl has a GUC you can set to tell it to do things at the time it's loaded (which of course you use in combination with having listed plperl in shared_preload_libraries). There's no reason except lack of round tuits why plpython couldn't have a similar feature. regards, tom lane
On 06/26/2014 02:14 AM, Rémi Cura wrote: > Hey, > thanks for your answer ! > > Yep you are right, the function I would like to test are going to be > called a lot (100k times), so even 15 ms per call matters. > I got to thinking about this. 100K over what time frame? How is it being called? -- Adrian Klaver adrian.klaver@aklaver.com
Hey,
thanks, now we have good information:Unlike plperl or plR there is no easy way to preload packages.
There may be some solutions to make this import at connection start but it would involve C modification (found no trace of python file or hackable sql script in postgres source and install directory)
After that,
further optimization is possible by avoiding the useless 'import' (because it is already loaded) (see the trick here)
My use case is simple geometry manipulation functions. It is easier to use plpython rather than plpgsql thanks to numpy for vector manipulation. Usually the functions are called inside complex query with many CTE, and execute over 100k of rows. Total execution time is in the order of minutes. (exemple of querry at the end)
Thanks everybody,
Rémi
Thanks everybody,
Rémi
Example of querry
CREATE TABLE holding_result AS
WITH the_geom AS (
SELECT gid, geom
FROM my_big_table --200k rows
)FROM my_big_table --200k rows
SELECT gid, my_python_function(geom) AS result
FROM the_geom;
FROM the_geom;
2014-06-27 4:27 GMT+02:00 Adrian Klaver <adrian.klaver@aklaver.com>:
On 06/26/2014 02:14 AM, Rémi Cura wrote:I got to thinking about this.Hey,
thanks for your answer !
Yep you are right, the function I would like to test are going to be
called a lot (100k times), so even 15 ms per call matters.
100K over what time frame?
How is it being called?
--
Adrian Klaver
adrian.klaver@aklaver.com