Thread: problem with lost connection while running long PL/R query

problem with lost connection while running long PL/R query

From
"David M. Kaplan"
Date:
Hi,

I have a query that uses a PL/R function to run a statistical model on
data in a postgresql table.  The query runs the function 4 times, each
of which generates about 2 million lines of results, generating a final
table that has about 8 million lines.  Each time the function is called,
it takes about ~3-5 minutes to finish its work. I know that the function
is running fine as it gives me progress and I have had no trouble using
it on slightly smaller datasets. However, this time it does the 4 runs,
but ends with psql saying the connection to the server has been lost.
In the postgresql log file, I have:

2013-05-16 14:05:47 CEST LOG:  server process (PID 27488) was terminated by signal 9: Killed

2013-05-16 14:05:47 CEST LOG:  terminating any other active server processes

2013-05-16 14:05:47 CEST WARNING:  terminating connection because of crash of another server process

2013-05-16 14:05:47 CEST DETAIL:  The postmaster has commanded this server process to roll back the current transaction
andexit, because another server process exited abnormally and possibly corrupted shared memory 


The process ID it mentions is the postgresql query process:

postgres 27488 87.1 26.1 2272848 2101804 ?     Rs   13:44   1:01 postgres: dkaplan fads [local] SELECT


Does anyone know what might be causing this problem and how to fix it?
Is something timing out?  Or is it a memory problem?  How would I go
about getting more diagnostic information?

Thanks for the help,
David


--
**********************************
David M. Kaplan
Charge de Recherche 1

Institut de Recherche pour le Developpement
Centre de Recherche Halieutique Mediterraneenne et Tropicale
av. Jean Monnet
B.P. 171
34203 Sete cedex
France

Phone: +33 (0)4 99 57 32 27
Fax: +33 (0)4 99 57 32 95

http://www.umr-eme.org/team/dkaplan/
http://www.amped.ird.fr/
**********************************



Re: problem with lost connection while running long PL/R query

From
Ryan Kelly
Date:
On Thu, May 05/16/13, 2013 at 02:47:28PM +0200, David M. Kaplan wrote:
> Hi,
>
> I have a query that uses a PL/R function to run a statistical model
> on data in a postgresql table.  The query runs the function 4 times,
> each of which generates about 2 million lines of results, generating
> a final table that has about 8 million lines.  Each time the
> function is called, it takes about ~3-5 minutes to finish its work.
> I know that the function is running fine as it gives me progress and
> I have had no trouble using it on slightly smaller datasets.
> However, this time it does the 4 runs, but ends with psql saying the
> connection to the server has been lost.  In the postgresql log file,
> I have:
>
> 2013-05-16 14:05:47 CEST LOG:  server process (PID 27488) was terminated by signal 9: Killed
This usually indicates the OOM killer has killed your backend. Please
see the documentation here:

http://www.postgresql.org/docs/9.2/interactive/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT

-Ryan


Re: problem with lost connection while running long PL/R query

From
Tom Lane
Date:
"David M. Kaplan" <david.kaplan@ird.fr> writes:
> I have a query that uses a PL/R function to run a statistical model on
> data in a postgresql table.  The query runs the function 4 times, each
> of which generates about 2 million lines of results, generating a final
> table that has about 8 million lines.  Each time the function is called,
> it takes about ~3-5 minutes to finish its work. I know that the function
> is running fine as it gives me progress and I have had no trouble using
> it on slightly smaller datasets. However, this time it does the 4 runs,
> but ends with psql saying the connection to the server has been lost.
> In the postgresql log file, I have:

> 2013-05-16 14:05:47 CEST LOG:  server process (PID 27488) was terminated by signal 9: Killed

You've been hit by the dreaded OOM kill "feature" of Linux, as you can
confirm by looking into the kernel logs.  A lot of people just turn that
off, especially on servers.  Google for "memory overcommit" for more info.

            regards, tom lane


Re: problem with lost connection while running long PL/R query

From
"David M. Kaplan"
Date:
Hi,

Thanks for the help.  You have definitely identified the problem, but I
am still looking for a solution that works for me.  I tried setting
vm.overcommit_memory=2, but this just made the query crash quicker than
before, though without killing the entire connection to the database.  I
imagine that this means that I really am trying to use more memory than
the system can handle?

I am wondering if there is a way to tell postgresql to flush a set of
table lines out to disk so that the memory they are using can be
liberated.  Basically, I am creating my table with 8 million lines in 4
chunks, and each chunk fits in memory.  As the 4 are inside one query, I
imagine that at some point postgresql tries to put the four together and
write them to disk as a postgresql table, and that is when things
break.  Is it possible to tell postgresql to write one "chunk" at a
time?  I imagine I could do this using PL/PGSQL and iteration, but
perhaps there is a postgresql configuration parameter that can help with
this?

Thanks again,
David


On 05/16/2013 03:35 PM, Ryan Kelly wrote:
> On Thu, May 05/16/13, 2013 at 02:47:28PM +0200, David M. Kaplan wrote:
>> Hi,
>>
>> I have a query that uses a PL/R function to run a statistical model
>> on data in a postgresql table.  The query runs the function 4 times,
>> each of which generates about 2 million lines of results, generating
>> a final table that has about 8 million lines.  Each time the
>> function is called, it takes about ~3-5 minutes to finish its work.
>> I know that the function is running fine as it gives me progress and
>> I have had no trouble using it on slightly smaller datasets.
>> However, this time it does the 4 runs, but ends with psql saying the
>> connection to the server has been lost.  In the postgresql log file,
>> I have:
>>
>> 2013-05-16 14:05:47 CEST LOG:  server process (PID 27488) was terminated by signal 9: Killed
> This usually indicates the OOM killer has killed your backend. Please
> see the documentation here:
>
> http://www.postgresql.org/docs/9.2/interactive/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT
>
> -Ryan
>
>

--
**********************************
David M. Kaplan
Charge de Recherche 1

Institut de Recherche pour le Developpement
Centre de Recherche Halieutique Mediterraneenne et Tropicale
av. Jean Monnet
B.P. 171
34203 Sete cedex
France

Phone: +33 (0)4 99 57 32 27
Fax: +33 (0)4 99 57 32 95

http://www.umr-eme.org/team/dkaplan/
http://www.amped.ird.fr/
**********************************



Re: problem with lost connection while running long PL/R query

From
Tom Lane
Date:
"David M. Kaplan" <david.kaplan@ird.fr> writes:
> Thanks for the help.  You have definitely identified the problem, but I
> am still looking for a solution that works for me.  I tried setting
> vm.overcommit_memory=2, but this just made the query crash quicker than
> before, though without killing the entire connection to the database.  I
> imagine that this means that I really am trying to use more memory than
> the system can handle?

> I am wondering if there is a way to tell postgresql to flush a set of
> table lines out to disk so that the memory they are using can be
> liberated.

Assuming you don't have work_mem set to something unreasonably large,
it seems likely that the excessive memory consumption is inside your
PL/R function, and not the fault of Postgres per se.  You might try
asking in some R-related forums about how to reduce the code's memory
usage.

Also, if by "crash" this time you meant you got an "out of memory" error
from Postgres, there should be a memory map in the postmaster log
showing all the memory consumption Postgres itself is aware of.  If that
doesn't add up to a lot, it would be pretty solid proof that the problem
is inside R.  If there are any memory contexts that seem to have bloated
unreasonably, knowing which one(s) would be helpful information.

            regards, tom lane


Re: problem with lost connection while running long PL/R query

From
Joe Conway
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 05/16/2013 08:40 AM, Tom Lane wrote:
> "David M. Kaplan" <david.kaplan@ird.fr> writes:
>> Thanks for the help.  You have definitely identified the problem,
>> but I am still looking for a solution that works for me.  I tried
>> setting vm.overcommit_memory=2, but this just made the query
>> crash quicker than before, though without killing the entire
>> connection to the database.  I imagine that this means that I
>> really am trying to use more memory than the system can handle?
>
>> I am wondering if there is a way to tell postgresql to flush a
>> set of table lines out to disk so that the memory they are using
>> can be liberated.
>
> Assuming you don't have work_mem set to something unreasonably
> large, it seems likely that the excessive memory consumption is
> inside your PL/R function, and not the fault of Postgres per se.
> You might try asking in some R-related forums about how to reduce
> the code's memory usage.

The two "classic" approaches to this with PL/R are either create a
custom aggregate with the PL/R as the final function (i.e. work on one
group at a time) or use the SPI cursor functionality within the PL/R
function and page your data using a cursor. Not all forms of analysis
lend themselves to these approaches, but perhaps yours does.

Ultimately I would like to implement a form of R data.frame that does
the paging with a cursor transparently for you, but I have not been
able to find the time so far.

Joe


- --
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJRlVJdAAoJEDfy90M199hle8gP+wU+GSJ44g26VBBAy3po/E/Y
9+pwxBhJe0x6v5PXtuM8Bzyy4yjlKCgzDj4XdZpEU7SYR+IKj7tWCihqc+Fuk1t1
EjR2VUJwpSMztRvEIqWW8rX/DFGaVYCt89n0neKfKL/XJ5rbqMqQAUPbxMaBtW/p
7EXo8RjVBMYibkvKrjpYJjLTuOTWkQCiXx5hc4HVFN53DYOF46rdFxMYUe5KLYTL
mZOnSoV0yrsaPGnxRIY0uzRv7ZTTBmB2o4TIWpTySx2rHNLqAJIT22wl0pfkjksH
JYvko3rWhSg7vSf+8RDN6X1eMAXcUO7H2NR5IdOoXEX2bzqTmDBQUjOcb5WR1yUd
L5XuT5WYiTpyzU8qAtPEVirwFnEwUN1tR6wDoVsseIWwXUYqSuXtg9qjFNAXZ1Hr
05yxuzexOEzLQNwSXWhsCrLdnndEHrJ6pDlLaUCPVybxwwwW9BfS2fJUz+X63M8x
l5DYbyl6q6o2J2bs4UGCTk4r/1Qq/R9pApkWzsckTtF6zl49mzwzPnh5b/JcB+4x
u17Te+s3cRGcX09lt7qf9cWkv1uUF/Qw0ntBhW8TY2HYhbWVIEmiZV1HIksXf+nw
EBFshWs2/H75OPnhN9YNq3tjCuiR7o/eaZeINfGs2LzGIJvHpcjMDBgFFTES7CYV
Y20XukH07h9XcJGTsf0o
=TwfD
-----END PGP SIGNATURE-----


Re: problem with lost connection while running long PL/R query

From
"David M. Kaplan"
Date:
Hi Joe,

Thanks for responding as you would clearly be the expert on this sort of
problem.  My current function does page through data using a cursor
precisely to avoid out of memory problems, which is why I am somewhat
surprised and stumped as to how this can be happening.  It does return
all the data at once, but one call to the function would seem to work,
so I can't see why 4 wouldn't.  I am currently planning to do some test
runs using memory.profile() to see if each successive call to the PL/R
function is somehow accumulating memory usage somewhere.  Perhaps I am
not properly closing a query or something like that?

I am attaching my code.  Perhaps you will have some insight.  To give
you a basic idea of what I am trying to do, I have separately developed
a classification model for the state of a "system" based on the data in
the postgresql table.  I want to apply that model to each line of the
table.  I loop over the cursor and predict the state for batches of
10,000 lines at a time.

Thanks again for the help.

Cheers,
David



On 05/16/2013 11:40 PM, Joe Conway wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 05/16/2013 08:40 AM, Tom Lane wrote:
>> "David M. Kaplan" <david.kaplan@ird.fr> writes:
>>> Thanks for the help.  You have definitely identified the problem,
>>> but I am still looking for a solution that works for me.  I tried
>>> setting vm.overcommit_memory=2, but this just made the query
>>> crash quicker than before, though without killing the entire
>>> connection to the database.  I imagine that this means that I
>>> really am trying to use more memory than the system can handle?
>>> I am wondering if there is a way to tell postgresql to flush a
>>> set of table lines out to disk so that the memory they are using
>>> can be liberated.
>> Assuming you don't have work_mem set to something unreasonably
>> large, it seems likely that the excessive memory consumption is
>> inside your PL/R function, and not the fault of Postgres per se.
>> You might try asking in some R-related forums about how to reduce
>> the code's memory usage.
> The two "classic" approaches to this with PL/R are either create a
> custom aggregate with the PL/R as the final function (i.e. work on one
> group at a time) or use the SPI cursor functionality within the PL/R
> function and page your data using a cursor. Not all forms of analysis
> lend themselves to these approaches, but perhaps yours does.
>
> Ultimately I would like to implement a form of R data.frame that does
> the paging with a cursor transparently for you, but I have not been
> able to find the time so far.
>
> Joe
>
>
> - --
> Joe Conway
> credativ LLC: http://www.credativ.us
> Linux, PostgreSQL, and general Open Source
> Training, Service, Consulting, & 24x7 Support
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iQIcBAEBAgAGBQJRlVJdAAoJEDfy90M199hle8gP+wU+GSJ44g26VBBAy3po/E/Y
> 9+pwxBhJe0x6v5PXtuM8Bzyy4yjlKCgzDj4XdZpEU7SYR+IKj7tWCihqc+Fuk1t1
> EjR2VUJwpSMztRvEIqWW8rX/DFGaVYCt89n0neKfKL/XJ5rbqMqQAUPbxMaBtW/p
> 7EXo8RjVBMYibkvKrjpYJjLTuOTWkQCiXx5hc4HVFN53DYOF46rdFxMYUe5KLYTL
> mZOnSoV0yrsaPGnxRIY0uzRv7ZTTBmB2o4TIWpTySx2rHNLqAJIT22wl0pfkjksH
> JYvko3rWhSg7vSf+8RDN6X1eMAXcUO7H2NR5IdOoXEX2bzqTmDBQUjOcb5WR1yUd
> L5XuT5WYiTpyzU8qAtPEVirwFnEwUN1tR6wDoVsseIWwXUYqSuXtg9qjFNAXZ1Hr
> 05yxuzexOEzLQNwSXWhsCrLdnndEHrJ6pDlLaUCPVybxwwwW9BfS2fJUz+X63M8x
> l5DYbyl6q6o2J2bs4UGCTk4r/1Qq/R9pApkWzsckTtF6zl49mzwzPnh5b/JcB+4x
> u17Te+s3cRGcX09lt7qf9cWkv1uUF/Qw0ntBhW8TY2HYhbWVIEmiZV1HIksXf+nw
> EBFshWs2/H75OPnhN9YNq3tjCuiR7o/eaZeINfGs2LzGIJvHpcjMDBgFFTES7CYV
> Y20XukH07h9XcJGTsf0o
> =TwfD
> -----END PGP SIGNATURE-----

--
**********************************
David M. Kaplan
Charge de Recherche 1

Institut de Recherche pour le Developpement
Centre de Recherche Halieutique Mediterraneenne et Tropicale
av. Jean Monnet
B.P. 171
34203 Sete cedex
France

Phone: +33 (0)4 99 57 32 27
Fax: +33 (0)4 99 57 32 95

http://www.umr-eme.org/team/dkaplan/
http://www.amped.ird.fr/
**********************************


Attachment

Re: problem with lost connection while running long PL/R query

From
"David M. Kaplan"
Date:
Hi,

Of course today after a reboot to update to the newest kernel,
everything works without crashing...

I imagine that yesterday the problem was that I had forgotten that I had
a Windows virtual machine running on the server that was eating a good
piece of memory.  Still, using a cursor to page through the data should
have avoided major memory difficulties...

As a footnote, I can confirm that PL/R was not eating up memory little
by little (see results of memory.profile() at beginning and end of each
function invokation below).

Thanks,
David

Run 1:

NOTICE:  Starting predict_classification_model
NOTICE:         NULL      symbol    pairlist     closure environment
promise
NOTICE:            1        7847      173768        4308 1129        6273
NOTICE:     language     special     builtin        char logical
integer
NOTICE:        46996         189        1839        9387 7351       17229
NOTICE:       double     complex   character         ... any        list
NOTICE:         2190           1       42810           1 0       13456
NOTICE:   expression    bytecode externalptr     weakref raw          S4
NOTICE:            3       10246         823         198 199        1061

NOTICE:  Ending predict_classification_model
NOTICE:         NULL      symbol    pairlist     closure environment
promise
NOTICE:            1        8062      181354        4360 1129        6273
NOTICE:     language     special     builtin        char logical
integer
NOTICE:        50616         189        1839     2085318 7496       17972
NOTICE:       double     complex   character         ... any        list
NOTICE:         2393           1       43459           1 0       13929
NOTICE:   expression    bytecode externalptr     weakref raw          S4
NOTICE:            3       10683         814         187 188        1061

Run 2:

NOTICE:  Starting predict_classification_model
NOTICE:         NULL      symbol    pairlist     closure environment
promise
NOTICE:            1        8062      180936        4360 1128        6271
NOTICE:     language     special     builtin        char logical
integer
NOTICE:        50536         189        1839        9684 7496       17878
NOTICE:       double     complex   character         ... any        list
NOTICE:         2258           1       43312           1 0       13899
NOTICE:   expression    bytecode externalptr     weakref raw          S4
NOTICE:            3       10683         813         188 189        1061

NOTICE:  Ending predict_classification_model
NOTICE:         NULL      symbol    pairlist     closure environment
promise
NOTICE:            1        8062      181356        4360 1129        6273
NOTICE:     language     special     builtin        char logical
integer
NOTICE:        50617         189        1839     2085317 7496       17972
NOTICE:       double     complex   character         ... any        list
NOTICE:         2393           1       43459           1 0       13929
NOTICE:   expression    bytecode externalptr     weakref raw          S4
NOTICE:            3       10683         814         187 188        1061

Run 3:

NOTICE:  Starting predict_classification_model
NOTICE:         NULL      symbol    pairlist     closure environment
promise
NOTICE:            1        8062      180936        4360 1128        6271
NOTICE:     language     special     builtin        char logical
integer
NOTICE:        50536         189        1839        9684 7496       17878
NOTICE:       double     complex   character         ... any        list
NOTICE:         2258           1       43312           1 0       13899
NOTICE:   expression    bytecode externalptr     weakref raw          S4
NOTICE:            3       10683         813         188 189        1061

NOTICE:  Ending predict_classification_model
NOTICE:         NULL      symbol    pairlist     closure environment
promise
NOTICE:            1        8062      181356        4360 1129        6273
NOTICE:     language     special     builtin        char logical
integer
NOTICE:        50617         189        1839     2085319 7496       17972
NOTICE:       double     complex   character         ... any        list
NOTICE:         2393           1       43459           1 0       13929
NOTICE:   expression    bytecode externalptr     weakref raw          S4
NOTICE:            3       10683         814         187 188        1061



On 05/16/2013 11:40 PM, Joe Conway wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 05/16/2013 08:40 AM, Tom Lane wrote:
>> "David M. Kaplan" <david.kaplan@ird.fr> writes:
>>> Thanks for the help.  You have definitely identified the problem,
>>> but I am still looking for a solution that works for me.  I tried
>>> setting vm.overcommit_memory=2, but this just made the query
>>> crash quicker than before, though without killing the entire
>>> connection to the database.  I imagine that this means that I
>>> really am trying to use more memory than the system can handle?
>>> I am wondering if there is a way to tell postgresql to flush a
>>> set of table lines out to disk so that the memory they are using
>>> can be liberated.
>> Assuming you don't have work_mem set to something unreasonably
>> large, it seems likely that the excessive memory consumption is
>> inside your PL/R function, and not the fault of Postgres per se.
>> You might try asking in some R-related forums about how to reduce
>> the code's memory usage.
> The two "classic" approaches to this with PL/R are either create a
> custom aggregate with the PL/R as the final function (i.e. work on one
> group at a time) or use the SPI cursor functionality within the PL/R
> function and page your data using a cursor. Not all forms of analysis
> lend themselves to these approaches, but perhaps yours does.
>
> Ultimately I would like to implement a form of R data.frame that does
> the paging with a cursor transparently for you, but I have not been
> able to find the time so far.
>
> Joe
>
>
> - --
> Joe Conway
> credativ LLC: http://www.credativ.us
> Linux, PostgreSQL, and general Open Source
> Training, Service, Consulting, & 24x7 Support
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iQIcBAEBAgAGBQJRlVJdAAoJEDfy90M199hle8gP+wU+GSJ44g26VBBAy3po/E/Y
> 9+pwxBhJe0x6v5PXtuM8Bzyy4yjlKCgzDj4XdZpEU7SYR+IKj7tWCihqc+Fuk1t1
> EjR2VUJwpSMztRvEIqWW8rX/DFGaVYCt89n0neKfKL/XJ5rbqMqQAUPbxMaBtW/p
> 7EXo8RjVBMYibkvKrjpYJjLTuOTWkQCiXx5hc4HVFN53DYOF46rdFxMYUe5KLYTL
> mZOnSoV0yrsaPGnxRIY0uzRv7ZTTBmB2o4TIWpTySx2rHNLqAJIT22wl0pfkjksH
> JYvko3rWhSg7vSf+8RDN6X1eMAXcUO7H2NR5IdOoXEX2bzqTmDBQUjOcb5WR1yUd
> L5XuT5WYiTpyzU8qAtPEVirwFnEwUN1tR6wDoVsseIWwXUYqSuXtg9qjFNAXZ1Hr
> 05yxuzexOEzLQNwSXWhsCrLdnndEHrJ6pDlLaUCPVybxwwwW9BfS2fJUz+X63M8x
> l5DYbyl6q6o2J2bs4UGCTk4r/1Qq/R9pApkWzsckTtF6zl49mzwzPnh5b/JcB+4x
> u17Te+s3cRGcX09lt7qf9cWkv1uUF/Qw0ntBhW8TY2HYhbWVIEmiZV1HIksXf+nw
> EBFshWs2/H75OPnhN9YNq3tjCuiR7o/eaZeINfGs2LzGIJvHpcjMDBgFFTES7CYV
> Y20XukH07h9XcJGTsf0o
> =TwfD
> -----END PGP SIGNATURE-----

--
**********************************
David M. Kaplan
Charge de Recherche 1

Institut de Recherche pour le Developpement
Centre de Recherche Halieutique Mediterraneenne et Tropicale
av. Jean Monnet
B.P. 171
34203 Sete cedex
France

Phone: +33 (0)4 99 57 32 27
Fax: +33 (0)4 99 57 32 95

http://www.umr-eme.org/team/dkaplan/
http://www.amped.ird.fr/
**********************************