Does the psql executable have any ability to do a "fetch many", using a server-side named cursor, when returning
results?It seems like it tries to retrieve the query entirely to local memory before printing to standard out.
Specifically, I've tried running the following command on my desktop, which returns about 70 million lines:
echo "select [thirteen columns, all floating point and boolean types] from go_prediction_view_mouse gpv order by
combined_scoredesc nulls last" | psql -U [username] -h [remote host] [database] > mouse_predictions.tab
The command increases in memory usage until about 12GB, when it stops and there is significant disk activity (assume
pagingto disk). Closing a few big programs immediately causes it to increase its memory usage accordingly. After about
50minutes, I killed it.
If I instead run the Python program below, which simply performs the query using a fetchmany() call to retrieve a few
hundredthousand tuples at a time through a named server-side cursor, the program remains under about 20 MB of memory
usagethroughout and finishes in about 35 minutes.
I know that the query used here could have been a COPY statement, which I assume would be better-behaved, but I'm more
concernedabout the case in which the query is more complex.
The local (OSX 10.6.2) version of Postgres is 8.4.2, and the server's (Ubuntu 4.x) version of Postgres is 8.3.5.
------ Python source to use a named server-side cursor to dump a large number of rows to a file ------
#---Modify this section---#
species = 'mouse'
query = 'select go_term_ref, gene_remote_id, function_verified_exactly, function_verified_with_parent_go,
function_verified_with_child_go,combined_score, obozinski_score, lee_score, mostafavi_score, guan_score, kim_score,
joshi_score,tasan_score, tasan_revised_score, qi_score, leone_score from go_prediction_view_' + species + ' gpv order
bycombined_score desc nulls last'
outputFilePath = [*output file path*]
connectionString = [*connection string*]
queryBufferSize = 10000
# not important
#---End modify this section----#
#---Everything below should be genetic---#
conn = psycopg2.connect(connectionString);
cur = conn.cursor('temp_dump_cursor')
cur.arraysize = queryBufferSize
'An iterator that uses fetchmany to keep memory usage down'
done = False
while not done:
results = cursor.fetchmany(queryBufferSize)
if not results or len(results) == 0:
done = True
for result in results:
with open(outputFilePath, 'w') as file:
print 'Starting ' + query + '...'
i = 0
for row in resultIter(cur):
i += 1
row = processRow(row)
file.write('\t'.join(row) + '\n')
if i % queryBufferSize == 0:
print 'Wrote ' + str(i) + ' lines.'
print 'finished. Total of ' + str(i) + ' lines.'