Thread: problem with lost connection while running long PL/R query
Hi, I have a query that uses a PL/R function to run a statistical model on data in a postgresql table. The query runs the function 4 times, each of which generates about 2 million lines of results, generating a final table that has about 8 million lines. Each time the function is called, it takes about ~3-5 minutes to finish its work. I know that the function is running fine as it gives me progress and I have had no trouble using it on slightly smaller datasets. However, this time it does the 4 runs, but ends with psql saying the connection to the server has been lost. In the postgresql log file, I have: 2013-05-16 14:05:47 CEST LOG: server process (PID 27488) was terminated by signal 9: Killed 2013-05-16 14:05:47 CEST LOG: terminating any other active server processes 2013-05-16 14:05:47 CEST WARNING: terminating connection because of crash of another server process 2013-05-16 14:05:47 CEST DETAIL: The postmaster has commanded this server process to roll back the current transaction andexit, because another server process exited abnormally and possibly corrupted shared memory The process ID it mentions is the postgresql query process: postgres 27488 87.1 26.1 2272848 2101804 ? Rs 13:44 1:01 postgres: dkaplan fads [local] SELECT Does anyone know what might be causing this problem and how to fix it? Is something timing out? Or is it a memory problem? How would I go about getting more diagnostic information? Thanks for the help, David -- ********************************** David M. Kaplan Charge de Recherche 1 Institut de Recherche pour le Developpement Centre de Recherche Halieutique Mediterraneenne et Tropicale av. Jean Monnet B.P. 171 34203 Sete cedex France Phone: +33 (0)4 99 57 32 27 Fax: +33 (0)4 99 57 32 95 http://www.umr-eme.org/team/dkaplan/ http://www.amped.ird.fr/ **********************************
On Thu, May 05/16/13, 2013 at 02:47:28PM +0200, David M. Kaplan wrote: > Hi, > > I have a query that uses a PL/R function to run a statistical model > on data in a postgresql table. The query runs the function 4 times, > each of which generates about 2 million lines of results, generating > a final table that has about 8 million lines. Each time the > function is called, it takes about ~3-5 minutes to finish its work. > I know that the function is running fine as it gives me progress and > I have had no trouble using it on slightly smaller datasets. > However, this time it does the 4 runs, but ends with psql saying the > connection to the server has been lost. In the postgresql log file, > I have: > > 2013-05-16 14:05:47 CEST LOG: server process (PID 27488) was terminated by signal 9: Killed This usually indicates the OOM killer has killed your backend. Please see the documentation here: http://www.postgresql.org/docs/9.2/interactive/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT -Ryan
"David M. Kaplan" <david.kaplan@ird.fr> writes: > I have a query that uses a PL/R function to run a statistical model on > data in a postgresql table. The query runs the function 4 times, each > of which generates about 2 million lines of results, generating a final > table that has about 8 million lines. Each time the function is called, > it takes about ~3-5 minutes to finish its work. I know that the function > is running fine as it gives me progress and I have had no trouble using > it on slightly smaller datasets. However, this time it does the 4 runs, > but ends with psql saying the connection to the server has been lost. > In the postgresql log file, I have: > 2013-05-16 14:05:47 CEST LOG: server process (PID 27488) was terminated by signal 9: Killed You've been hit by the dreaded OOM kill "feature" of Linux, as you can confirm by looking into the kernel logs. A lot of people just turn that off, especially on servers. Google for "memory overcommit" for more info. regards, tom lane
Hi, Thanks for the help. You have definitely identified the problem, but I am still looking for a solution that works for me. I tried setting vm.overcommit_memory=2, but this just made the query crash quicker than before, though without killing the entire connection to the database. I imagine that this means that I really am trying to use more memory than the system can handle? I am wondering if there is a way to tell postgresql to flush a set of table lines out to disk so that the memory they are using can be liberated. Basically, I am creating my table with 8 million lines in 4 chunks, and each chunk fits in memory. As the 4 are inside one query, I imagine that at some point postgresql tries to put the four together and write them to disk as a postgresql table, and that is when things break. Is it possible to tell postgresql to write one "chunk" at a time? I imagine I could do this using PL/PGSQL and iteration, but perhaps there is a postgresql configuration parameter that can help with this? Thanks again, David On 05/16/2013 03:35 PM, Ryan Kelly wrote: > On Thu, May 05/16/13, 2013 at 02:47:28PM +0200, David M. Kaplan wrote: >> Hi, >> >> I have a query that uses a PL/R function to run a statistical model >> on data in a postgresql table. The query runs the function 4 times, >> each of which generates about 2 million lines of results, generating >> a final table that has about 8 million lines. Each time the >> function is called, it takes about ~3-5 minutes to finish its work. >> I know that the function is running fine as it gives me progress and >> I have had no trouble using it on slightly smaller datasets. >> However, this time it does the 4 runs, but ends with psql saying the >> connection to the server has been lost. In the postgresql log file, >> I have: >> >> 2013-05-16 14:05:47 CEST LOG: server process (PID 27488) was terminated by signal 9: Killed > This usually indicates the OOM killer has killed your backend. Please > see the documentation here: > > http://www.postgresql.org/docs/9.2/interactive/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT > > -Ryan > > -- ********************************** David M. Kaplan Charge de Recherche 1 Institut de Recherche pour le Developpement Centre de Recherche Halieutique Mediterraneenne et Tropicale av. Jean Monnet B.P. 171 34203 Sete cedex France Phone: +33 (0)4 99 57 32 27 Fax: +33 (0)4 99 57 32 95 http://www.umr-eme.org/team/dkaplan/ http://www.amped.ird.fr/ **********************************
"David M. Kaplan" <david.kaplan@ird.fr> writes: > Thanks for the help. You have definitely identified the problem, but I > am still looking for a solution that works for me. I tried setting > vm.overcommit_memory=2, but this just made the query crash quicker than > before, though without killing the entire connection to the database. I > imagine that this means that I really am trying to use more memory than > the system can handle? > I am wondering if there is a way to tell postgresql to flush a set of > table lines out to disk so that the memory they are using can be > liberated. Assuming you don't have work_mem set to something unreasonably large, it seems likely that the excessive memory consumption is inside your PL/R function, and not the fault of Postgres per se. You might try asking in some R-related forums about how to reduce the code's memory usage. Also, if by "crash" this time you meant you got an "out of memory" error from Postgres, there should be a memory map in the postmaster log showing all the memory consumption Postgres itself is aware of. If that doesn't add up to a lot, it would be pretty solid proof that the problem is inside R. If there are any memory contexts that seem to have bloated unreasonably, knowing which one(s) would be helpful information. regards, tom lane
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 05/16/2013 08:40 AM, Tom Lane wrote: > "David M. Kaplan" <david.kaplan@ird.fr> writes: >> Thanks for the help. You have definitely identified the problem, >> but I am still looking for a solution that works for me. I tried >> setting vm.overcommit_memory=2, but this just made the query >> crash quicker than before, though without killing the entire >> connection to the database. I imagine that this means that I >> really am trying to use more memory than the system can handle? > >> I am wondering if there is a way to tell postgresql to flush a >> set of table lines out to disk so that the memory they are using >> can be liberated. > > Assuming you don't have work_mem set to something unreasonably > large, it seems likely that the excessive memory consumption is > inside your PL/R function, and not the fault of Postgres per se. > You might try asking in some R-related forums about how to reduce > the code's memory usage. The two "classic" approaches to this with PL/R are either create a custom aggregate with the PL/R as the final function (i.e. work on one group at a time) or use the SPI cursor functionality within the PL/R function and page your data using a cursor. Not all forms of analysis lend themselves to these approaches, but perhaps yours does. Ultimately I would like to implement a form of R data.frame that does the paging with a cursor transparently for you, but I have not been able to find the time so far. Joe - -- Joe Conway credativ LLC: http://www.credativ.us Linux, PostgreSQL, and general Open Source Training, Service, Consulting, & 24x7 Support -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBAgAGBQJRlVJdAAoJEDfy90M199hle8gP+wU+GSJ44g26VBBAy3po/E/Y 9+pwxBhJe0x6v5PXtuM8Bzyy4yjlKCgzDj4XdZpEU7SYR+IKj7tWCihqc+Fuk1t1 EjR2VUJwpSMztRvEIqWW8rX/DFGaVYCt89n0neKfKL/XJ5rbqMqQAUPbxMaBtW/p 7EXo8RjVBMYibkvKrjpYJjLTuOTWkQCiXx5hc4HVFN53DYOF46rdFxMYUe5KLYTL mZOnSoV0yrsaPGnxRIY0uzRv7ZTTBmB2o4TIWpTySx2rHNLqAJIT22wl0pfkjksH JYvko3rWhSg7vSf+8RDN6X1eMAXcUO7H2NR5IdOoXEX2bzqTmDBQUjOcb5WR1yUd L5XuT5WYiTpyzU8qAtPEVirwFnEwUN1tR6wDoVsseIWwXUYqSuXtg9qjFNAXZ1Hr 05yxuzexOEzLQNwSXWhsCrLdnndEHrJ6pDlLaUCPVybxwwwW9BfS2fJUz+X63M8x l5DYbyl6q6o2J2bs4UGCTk4r/1Qq/R9pApkWzsckTtF6zl49mzwzPnh5b/JcB+4x u17Te+s3cRGcX09lt7qf9cWkv1uUF/Qw0ntBhW8TY2HYhbWVIEmiZV1HIksXf+nw EBFshWs2/H75OPnhN9YNq3tjCuiR7o/eaZeINfGs2LzGIJvHpcjMDBgFFTES7CYV Y20XukH07h9XcJGTsf0o =TwfD -----END PGP SIGNATURE-----
Hi Joe, Thanks for responding as you would clearly be the expert on this sort of problem. My current function does page through data using a cursor precisely to avoid out of memory problems, which is why I am somewhat surprised and stumped as to how this can be happening. It does return all the data at once, but one call to the function would seem to work, so I can't see why 4 wouldn't. I am currently planning to do some test runs using memory.profile() to see if each successive call to the PL/R function is somehow accumulating memory usage somewhere. Perhaps I am not properly closing a query or something like that? I am attaching my code. Perhaps you will have some insight. To give you a basic idea of what I am trying to do, I have separately developed a classification model for the state of a "system" based on the data in the postgresql table. I want to apply that model to each line of the table. I loop over the cursor and predict the state for batches of 10,000 lines at a time. Thanks again for the help. Cheers, David On 05/16/2013 11:40 PM, Joe Conway wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 05/16/2013 08:40 AM, Tom Lane wrote: >> "David M. Kaplan" <david.kaplan@ird.fr> writes: >>> Thanks for the help. You have definitely identified the problem, >>> but I am still looking for a solution that works for me. I tried >>> setting vm.overcommit_memory=2, but this just made the query >>> crash quicker than before, though without killing the entire >>> connection to the database. I imagine that this means that I >>> really am trying to use more memory than the system can handle? >>> I am wondering if there is a way to tell postgresql to flush a >>> set of table lines out to disk so that the memory they are using >>> can be liberated. >> Assuming you don't have work_mem set to something unreasonably >> large, it seems likely that the excessive memory consumption is >> inside your PL/R function, and not the fault of Postgres per se. >> You might try asking in some R-related forums about how to reduce >> the code's memory usage. > The two "classic" approaches to this with PL/R are either create a > custom aggregate with the PL/R as the final function (i.e. work on one > group at a time) or use the SPI cursor functionality within the PL/R > function and page your data using a cursor. Not all forms of analysis > lend themselves to these approaches, but perhaps yours does. > > Ultimately I would like to implement a form of R data.frame that does > the paging with a cursor transparently for you, but I have not been > able to find the time so far. > > Joe > > > - -- > Joe Conway > credativ LLC: http://www.credativ.us > Linux, PostgreSQL, and general Open Source > Training, Service, Consulting, & 24x7 Support > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ > > iQIcBAEBAgAGBQJRlVJdAAoJEDfy90M199hle8gP+wU+GSJ44g26VBBAy3po/E/Y > 9+pwxBhJe0x6v5PXtuM8Bzyy4yjlKCgzDj4XdZpEU7SYR+IKj7tWCihqc+Fuk1t1 > EjR2VUJwpSMztRvEIqWW8rX/DFGaVYCt89n0neKfKL/XJ5rbqMqQAUPbxMaBtW/p > 7EXo8RjVBMYibkvKrjpYJjLTuOTWkQCiXx5hc4HVFN53DYOF46rdFxMYUe5KLYTL > mZOnSoV0yrsaPGnxRIY0uzRv7ZTTBmB2o4TIWpTySx2rHNLqAJIT22wl0pfkjksH > JYvko3rWhSg7vSf+8RDN6X1eMAXcUO7H2NR5IdOoXEX2bzqTmDBQUjOcb5WR1yUd > L5XuT5WYiTpyzU8qAtPEVirwFnEwUN1tR6wDoVsseIWwXUYqSuXtg9qjFNAXZ1Hr > 05yxuzexOEzLQNwSXWhsCrLdnndEHrJ6pDlLaUCPVybxwwwW9BfS2fJUz+X63M8x > l5DYbyl6q6o2J2bs4UGCTk4r/1Qq/R9pApkWzsckTtF6zl49mzwzPnh5b/JcB+4x > u17Te+s3cRGcX09lt7qf9cWkv1uUF/Qw0ntBhW8TY2HYhbWVIEmiZV1HIksXf+nw > EBFshWs2/H75OPnhN9YNq3tjCuiR7o/eaZeINfGs2LzGIJvHpcjMDBgFFTES7CYV > Y20XukH07h9XcJGTsf0o > =TwfD > -----END PGP SIGNATURE----- -- ********************************** David M. Kaplan Charge de Recherche 1 Institut de Recherche pour le Developpement Centre de Recherche Halieutique Mediterraneenne et Tropicale av. Jean Monnet B.P. 171 34203 Sete cedex France Phone: +33 (0)4 99 57 32 27 Fax: +33 (0)4 99 57 32 95 http://www.umr-eme.org/team/dkaplan/ http://www.amped.ird.fr/ **********************************
Attachment
Hi, Of course today after a reboot to update to the newest kernel, everything works without crashing... I imagine that yesterday the problem was that I had forgotten that I had a Windows virtual machine running on the server that was eating a good piece of memory. Still, using a cursor to page through the data should have avoided major memory difficulties... As a footnote, I can confirm that PL/R was not eating up memory little by little (see results of memory.profile() at beginning and end of each function invokation below). Thanks, David Run 1: NOTICE: Starting predict_classification_model NOTICE: NULL symbol pairlist closure environment promise NOTICE: 1 7847 173768 4308 1129 6273 NOTICE: language special builtin char logical integer NOTICE: 46996 189 1839 9387 7351 17229 NOTICE: double complex character ... any list NOTICE: 2190 1 42810 1 0 13456 NOTICE: expression bytecode externalptr weakref raw S4 NOTICE: 3 10246 823 198 199 1061 NOTICE: Ending predict_classification_model NOTICE: NULL symbol pairlist closure environment promise NOTICE: 1 8062 181354 4360 1129 6273 NOTICE: language special builtin char logical integer NOTICE: 50616 189 1839 2085318 7496 17972 NOTICE: double complex character ... any list NOTICE: 2393 1 43459 1 0 13929 NOTICE: expression bytecode externalptr weakref raw S4 NOTICE: 3 10683 814 187 188 1061 Run 2: NOTICE: Starting predict_classification_model NOTICE: NULL symbol pairlist closure environment promise NOTICE: 1 8062 180936 4360 1128 6271 NOTICE: language special builtin char logical integer NOTICE: 50536 189 1839 9684 7496 17878 NOTICE: double complex character ... any list NOTICE: 2258 1 43312 1 0 13899 NOTICE: expression bytecode externalptr weakref raw S4 NOTICE: 3 10683 813 188 189 1061 NOTICE: Ending predict_classification_model NOTICE: NULL symbol pairlist closure environment promise NOTICE: 1 8062 181356 4360 1129 6273 NOTICE: language special builtin char logical integer NOTICE: 50617 189 1839 2085317 7496 17972 NOTICE: double complex character ... any list NOTICE: 2393 1 43459 1 0 13929 NOTICE: expression bytecode externalptr weakref raw S4 NOTICE: 3 10683 814 187 188 1061 Run 3: NOTICE: Starting predict_classification_model NOTICE: NULL symbol pairlist closure environment promise NOTICE: 1 8062 180936 4360 1128 6271 NOTICE: language special builtin char logical integer NOTICE: 50536 189 1839 9684 7496 17878 NOTICE: double complex character ... any list NOTICE: 2258 1 43312 1 0 13899 NOTICE: expression bytecode externalptr weakref raw S4 NOTICE: 3 10683 813 188 189 1061 NOTICE: Ending predict_classification_model NOTICE: NULL symbol pairlist closure environment promise NOTICE: 1 8062 181356 4360 1129 6273 NOTICE: language special builtin char logical integer NOTICE: 50617 189 1839 2085319 7496 17972 NOTICE: double complex character ... any list NOTICE: 2393 1 43459 1 0 13929 NOTICE: expression bytecode externalptr weakref raw S4 NOTICE: 3 10683 814 187 188 1061 On 05/16/2013 11:40 PM, Joe Conway wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 05/16/2013 08:40 AM, Tom Lane wrote: >> "David M. Kaplan" <david.kaplan@ird.fr> writes: >>> Thanks for the help. You have definitely identified the problem, >>> but I am still looking for a solution that works for me. I tried >>> setting vm.overcommit_memory=2, but this just made the query >>> crash quicker than before, though without killing the entire >>> connection to the database. I imagine that this means that I >>> really am trying to use more memory than the system can handle? >>> I am wondering if there is a way to tell postgresql to flush a >>> set of table lines out to disk so that the memory they are using >>> can be liberated. >> Assuming you don't have work_mem set to something unreasonably >> large, it seems likely that the excessive memory consumption is >> inside your PL/R function, and not the fault of Postgres per se. >> You might try asking in some R-related forums about how to reduce >> the code's memory usage. > The two "classic" approaches to this with PL/R are either create a > custom aggregate with the PL/R as the final function (i.e. work on one > group at a time) or use the SPI cursor functionality within the PL/R > function and page your data using a cursor. Not all forms of analysis > lend themselves to these approaches, but perhaps yours does. > > Ultimately I would like to implement a form of R data.frame that does > the paging with a cursor transparently for you, but I have not been > able to find the time so far. > > Joe > > > - -- > Joe Conway > credativ LLC: http://www.credativ.us > Linux, PostgreSQL, and general Open Source > Training, Service, Consulting, & 24x7 Support > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ > > iQIcBAEBAgAGBQJRlVJdAAoJEDfy90M199hle8gP+wU+GSJ44g26VBBAy3po/E/Y > 9+pwxBhJe0x6v5PXtuM8Bzyy4yjlKCgzDj4XdZpEU7SYR+IKj7tWCihqc+Fuk1t1 > EjR2VUJwpSMztRvEIqWW8rX/DFGaVYCt89n0neKfKL/XJ5rbqMqQAUPbxMaBtW/p > 7EXo8RjVBMYibkvKrjpYJjLTuOTWkQCiXx5hc4HVFN53DYOF46rdFxMYUe5KLYTL > mZOnSoV0yrsaPGnxRIY0uzRv7ZTTBmB2o4TIWpTySx2rHNLqAJIT22wl0pfkjksH > JYvko3rWhSg7vSf+8RDN6X1eMAXcUO7H2NR5IdOoXEX2bzqTmDBQUjOcb5WR1yUd > L5XuT5WYiTpyzU8qAtPEVirwFnEwUN1tR6wDoVsseIWwXUYqSuXtg9qjFNAXZ1Hr > 05yxuzexOEzLQNwSXWhsCrLdnndEHrJ6pDlLaUCPVybxwwwW9BfS2fJUz+X63M8x > l5DYbyl6q6o2J2bs4UGCTk4r/1Qq/R9pApkWzsckTtF6zl49mzwzPnh5b/JcB+4x > u17Te+s3cRGcX09lt7qf9cWkv1uUF/Qw0ntBhW8TY2HYhbWVIEmiZV1HIksXf+nw > EBFshWs2/H75OPnhN9YNq3tjCuiR7o/eaZeINfGs2LzGIJvHpcjMDBgFFTES7CYV > Y20XukH07h9XcJGTsf0o > =TwfD > -----END PGP SIGNATURE----- -- ********************************** David M. Kaplan Charge de Recherche 1 Institut de Recherche pour le Developpement Centre de Recherche Halieutique Mediterraneenne et Tropicale av. Jean Monnet B.P. 171 34203 Sete cedex France Phone: +33 (0)4 99 57 32 27 Fax: +33 (0)4 99 57 32 95 http://www.umr-eme.org/team/dkaplan/ http://www.amped.ird.fr/ **********************************