Thread: PLPythonU & Out of Memory - Importing Query

PLPythonU & Out of Memory - Importing Query

From
Jon Clements
Date:
Hi there,

I am currently experimenting using plpythonu with postgresql 8.0 for Win32. It's basically a quick script that imports
datafrom CSV files, but does some quite complicated data lookups and selections. The area in which I'm somewhat
confundedis memory usage. The process successfully runs, but keeps climbing in memory usage relentlessly, successfully
importingabout 200k records, before the memory usage of postgres soars to 2gb and of course, shortly after that, grinds
toa halt with a "Out of Memory" error.  

I'm not deliberately storing anything in the SD/GD dictionaries, and am not dealing with triggers...

create function blah(text) returns int8 as
$$
# Initialisation of plans
myplan = plpy.prepare('insert into tablename (var1,var2) values($1,$2)', ['text','text'] )
# Setup external CSV data source
# For each record, that meets certain critera, execute insert...
for rec in dsource: plpy.execute(myplan, [Value1, Value2] )

# Finishing stuff
return some_meaningful_value
$$
LANGUAGE PLPYTHONU;

Given I'm importing about 250 million records and only want to end up with about 4 million, is
1) This possible using the above?
2) Better suited to something else (I've looked at COPY but that would require the entire table be uploaded first, then
filteredand I'd like to avoid that if necessary, or thinking about it, I spose a trigger could be written that
respondedon the copy?). Also the other thing is COPY is only applicable to simple text files, while I want this import
scriptto be generic from whatever datasource it may be importing from (an ODBC/DBF/Berkeley DB format etc...) 

Anyhow, thanks in advance for any help.
Any Qs, please gimme a yell.

Regards,

Jon.








________________________________________________________________________
This e-mail has been scanned for all viruses by Star. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________


Re: PLPythonU & Out of Memory - Importing Query

From
Tom Lane
Date:
Jon Clements <jon.clements@trgstrata.co.uk> writes:
> I am currently experimenting using plpythonu with postgresql 8.0 for
> Win32. It's basically a quick script that imports data from CSV files,
> but does some quite complicated data lookups and selections. The area
> in which I'm somewhat confunded is memory usage. The process
> successfully runs, but keeps climbing in memory usage relentlessly,
> successfully importing about 200k records, before the memory usage of
> postgres soars to 2gb and of course, shortly after that, grinds to a
> halt with a "Out of Memory" error.

You may have stumbled across a memory leak in plpythonu, but there's
not enough info here for anyone to investigate.  Can you put together
a self-contained test case for people to try?
        regards, tom lane


Re: PLPythonU & Out of Memory - Importing Query

From
Michael Fuhr
Date:
On Tue, Nov 08, 2005 at 02:21:01PM +0000, Jon Clements wrote:
> I am currently experimenting using plpythonu with postgresql 8.0
> for Win32. It's basically a quick script that imports data from CSV
> files, but does some quite complicated data lookups and selections.
> The area in which I'm somewhat confunded is memory usage. The process
> successfully runs, but keeps climbing in memory usage relentlessly,
> successfully importing about 200k records, before the memory usage
> of postgres soars to 2gb and of course, shortly after that, grinds
> to a halt with a "Out of Memory" error.
[ ... ]
> # Setup external CSV data source
> # For each record, that meets certain critera, execute insert...
> for rec in dsource: plpy.execute(myplan, [Value1, Value2] )

What exactly is dsource?  If it were a simple list that was built
in its entirety prior to the "for" loop then I'd expect that to
cause memory problems; however, since you're getting a partial data
load I'm wondering if dsource is some other kind of object that
implements an iterator.  Have you run the same code (minus the
database parts) in a standalone Python program to see if it has a
memory leak?

-- 
Michael Fuhr