Thread: extreme memory use when loading in a lot of data

extreme memory use when loading in a lot of data

From
Vivek Khera
Date:
I have some historic data that I want to analyze.  To do this I set up
postgres on a spare box I picked up for cheap, which just lucked into
having tons of RAM (1.5G).   I set up postgres to use 10000 buffers,
and recompiled the kernel to allow 2Gb data size limit per process.

Since this is historical data, I'm actually merging a couple of dumps
that span the time range.  I've dealt with eliminating any conflicting
data (ie, clashing unique keys) but I'm not 100% sure that the foreign
key constraints are all met.  Thus, when loading the data from the
second dump, I am leaving the FK triggers on.

Now, this is where my trouble has begun... On importing row 29,796,801
for the first big table, I get this (after 27 hours!):

pg_restore: ERROR:  out of memory
DETAIL:  Failed on request of size 44.
CONTEXT:  COPY msg_recipients, line 29796801: "75312    3434358 sent"
pg_restore: [archiver (db)] error returned by PQendcopy

(I had originally tried this with a 1Gb data size limit on the
process, and it died at around row 15 million.)

The curious thing is that watching the process size grow on another
window, is that it shrunk considerably a few minutes before it
croaked:


% ps axuw | grep postg
pgsql   26032  6.0  7.9 2186888 123440  ??  D    Thu09AM 122:26.10 postmaster: khera vkmlm [local] COPY (postgres)

Note here that we are using nearly 2gb virtual memory (I'm sure a
bunch of that is stack and the shared memory segment).

a few minutes later I saw this:

% ps axuw | grep postg
pgsql   26032 10.4  5.5 91840 85624  ??  S    Thu09AM 123:17.24 postmaster: khera vkmlm [local] idle (postgres)

note the process size is down to 91M.  A few minutes later I got the
out of memory error.  This is very curious because I don't expect the
process to release the memory back to the OS like that.

There are about 157 million records in this table, three columns wide.

FreeBSD 4.10-PRERELEASE, PostgreSQL 7.4.2

So, is there any way to load this in or do I need to either break it
into chunks (no small feat) or disable triggers during load and hope
and pray there are no FK violations?

Is there some leak or does it just take that much RAM to load in data
from a table?

--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Vivek Khera, Ph.D.                Khera Communications, Inc.
Internet: khera@kciLink.com       Rockville, MD  +1-301-869-4449 x806
AIM: vivekkhera Y!: vivek_khera   http://www.khera.org/~vivek/

Re: extreme memory use when loading in a lot of data

From
Stephan Szabo
Date:
On Fri, 21 May 2004, Vivek Khera wrote:

> I have some historic data that I want to analyze.  To do this I set up
> postgres on a spare box I picked up for cheap, which just lucked into
> having tons of RAM (1.5G).   I set up postgres to use 10000 buffers,
> and recompiled the kernel to allow 2Gb data size limit per process.
>
> Since this is historical data, I'm actually merging a couple of dumps
> that span the time range.  I've dealt with eliminating any conflicting
> data (ie, clashing unique keys) but I'm not 100% sure that the foreign
> key constraints are all met.  Thus, when loading the data from the
> second dump, I am leaving the FK triggers on.

I'd suggest dropping the constraints, adding the data and adding the
constraint again. If you're using 7.4 the speed will be better for
checking the constraint, and if the constraint is not satisfied, you'll
need to remove the offending row and recreate the constraint, but that's
better than having to reimport.

> Now, this is where my trouble has begun... On importing row 29,796,801
> for the first big table, I get this (after 27 hours!):

I'd wonder if some large portion of the memory is the deferred trigger
queue which doesn't yet spill over to disk when it gets too large.

Re: extreme memory use when loading in a lot of data

From
Tom Lane
Date:
Vivek Khera <khera@kcilink.com> writes:
> Since this is historical data, I'm actually merging a couple of dumps
> that span the time range.  I've dealt with eliminating any conflicting
> data (ie, clashing unique keys) but I'm not 100% sure that the foreign
> key constraints are all met.  Thus, when loading the data from the
> second dump, I am leaving the FK triggers on.

I think you'd be better off to drop the FK constraint, import, and
re-add the constraint.  The out-of-memory problem is probably due to the
list of deferred trigger firings (one per tuple, or more if you have
multiple FKs to check).  Even if you had enough memory, you'd not have
enough patience for all those retail FK checks to occur after the COPY
finishes.

At least in 7.4, adding an FK constraint on an existing table should
produce a better plan than the retail checks involved in adding rows to
a table with an existing FK constraint.

            regards, tom lane

Re: extreme memory use when loading in a lot of data

From
Vivek Khera
Date:
>> key constraints are all met.  Thus, when loading the data from the
>> second dump, I am leaving the FK triggers on.
>
> I'd suggest dropping the constraints, adding the data and adding the
> constraint again. If you're using 7.4 the speed will be better for
> checking the constraint, and if the constraint is not satisfied, you'll
> need to remove the offending row and recreate the constraint, but
> that's
> better than having to reimport.

Thanks to Stephan and Tom for the same suggestion.  I'm also glad to
know it is not a leak but just a large amount of memory use.  The load
is still taking a LOOONG time because it is just a lowly IDE disk (but
it is UDMA100)... 96 hours so far and the memory usage is steady.