Thread: Working with very large datasets
<p><font size="2">Greetings,</font><br /><font size="2">Can anyone enlighten me or point me at resources concerning use ofpgsql with</font><br /><font size="2">very large datasets?</font><p><font size="2">My specific problem is this:</font><p><fontsize="2">I have two tables, one with about 100 million rows and one with about 22,000</font><br /><fontsize="2">rows. My plan was to inner join the two tables on an integer key and output</font><br /><font size="2">the4 significant columns, excluding the keys. (Those with a better understanding</font><br /><font size="2">of pgsqlinternals, feel free to laugh.) The result was a big angry psql that</font><br /><font size="2">grew to 800+MB beforeI had to kill it.</font><p><font size="2">Example:</font><p><font size="2">CREATE TABLE users (</font><br /><font size="2"> userid INTEGER NOT NULL,</font><br /><font size="2"> fileid INTEGER NOT NULL</font><br/><font size="2">);</font><br /><font size="2">CREATE INDEX "users_userid_key" on "users" using btree ( "userid""int4_ops" );</font><p><font size="2">CREATE TABLE files</font><br /><font size="2">(</font><br /><font size="2"> fileid SERIAL NOT NULL,</font><br /><font size="2"> name VARCHAR(120),</font><br/><font size="2"> size INTEGER,</font><br /><font size="2"> mtime INTEGER</font><br /><font size="2">);</font><br /><font size="2">-- files.fileid will automagically get indexed</font><p><fontsize="2">[100 million inserts to 'users', 22,000 inserts to 'files'...]</font><p><font size="2">SELECTuserid,name,size,mtime FROM users INNER JOIN files ON users.fileid = files.fileid;</font><p><font size="2">[attackof the 50 foot psql...]</font><p><font size="2">Did I miss sonething important? Is there a better way? Any clues appreciated.</font><p><font size="2">-cw-</font><p><font size="2">(Apologies for any HTML that gets tackedonto this message. I spec'd plaintext!)</font><br /><font size="2">--</font><br /><font size="2">Charlie Wilkinson</font><br/><font size="2">TRIS Development Systems Administrator</font><br /><font size="2">IS:SD:CT:CC:TD</font><br/><font size="2">Phone: 202-283-3241</font><br /><font size="2">MSMail: Charlie.E.Wilkinson@irs.gov</font><br/><font size="2">SMTP: cwilkins@tris.irs.gov</font><br /><font size="2">Home: cwilkins@boinklabs.com</font><br/><font size="2">This message constructed from 90% post-consumer electrons. </font>
On Tue, 11 Feb 2003, Wilkinson Charlie E wrote: > Greetings, > Can anyone enlighten me or point me at resources concerning use of pgsql > with > very large datasets? > > My specific problem is this: > > I have two tables, one with about 100 million rows and one with about 22,000 > rows. My plan was to inner join the two tables on an integer key and output > the 4 significant columns, excluding the keys. (Those with a better > understanding > of pgsql internals, feel free to laugh.) The result was a big angry psql > that > grew to 800+MB before I had to kill it. Was it psql that grew to 800Mb or a backend? If the former, how many rows do you expect that to return? You probably want to look into using cursors rather than returning the entire result set at once.
On Tue, 2003-02-11 at 18:42, Wilkinson Charlie E wrote: > Greetings, > Can anyone enlighten me or point me at resources concerning use of > pgsql with > very large datasets? [snip] > SELECT userid,name,size,mtime FROM users INNER JOIN files ON > users.fileid = files.fileid; If you just want to see a sample result from that query in psql, then you can use LIMIT: SELECT userid,name,size,mtime FROM users INNER JOIN files ON users.fileid = files.fileid LIMIT 100; This should only return 100 rows. I hope that helps. I think there is also a way to specify offsets so that you can return the rows in batches. -- Lex Berezhny <LBerezhny@DevIS.com>
<p><font size="2">> -----Original Message-----</font><br /><font size="2">> From: Stephan Szabo [<a href="mailto:sszabo@megazone23.bigpanda.com">mailto:sszabo@megazone23.bigpanda.com</a>]</font><br/><font size="2">[...]</font><br/><font size="2">> Was it psql that grew to 800Mb or a backend? If the former, </font><br /><fontsize="2">> how many rows</font><br /><font size="2">> do you expect that to return? You probably want to lookinto using</font><br /><font size="2">> cursors rather than returning the entire result set at once.</font><p><fontsize="2">It was the front end, psql that got big. postmaster has been pretty</font><br /><font size="2">wellbehaved through all this. I'm looking to get the full 100M+</font><br /><font size="2">records back. I haveto print it all out and fax them to someone. ;)</font><br /><font size="2">Seriously, I want to dump it all to a (big!)file and sift through</font><br /><font size="2">it with some familiar unix tools to check the data and generally</font><br/><font size="2">look it over.</font><p><font size="2">Thanks to Stephan and Lex for the tips! I tookyour CURSOR and LIMIT</font><br /><font size="2">clues and ran with them. I looked at straight SQL>psql and PerlDBD:Pg</font><br /><font size="2">for solutions, but the end result was some "ecpg" C code:</font><p><font size="2">----------------------------------</font><br/><font size="2">#include <stdio.h></font><br /><font size="2">#defineNO_MORE_TUPLES (sqlca.sqlcode == ECPG_NOT_FOUND)</font><p><font size="2">int main(int argc, char* agv[]){</font><p><fontsize="2"> EXEC SQL BEGIN DECLARE SECTION;</font><br /><font size="2"> int myuser;</font><br/><font size="2"> char myname[121];</font><br /><font size="2"> int mysize;</font><br /><fontsize="2"> int mymtime;</font><br /><font size="2"> EXEC SQL END DECLARE SECTION;</font><p><font size="2"> EXEC SQL INCLUDE sqlca;</font><br /><font size="2"> EXEC SQL CONNECT TO userdb USER xxxxxxx;</font><br/><font size="2"> EXEC SQL WHENEVER sqlerror sqlprint;</font><p><font size="2"> EXEC SQLDECLARE mycursor CURSOR FOR SELECT user,name,size,mtime</font><br /><font size="2"> FROM users JOIN files USING(fileid) ORDER BY user;</font><p><font size="2"> EXEC SQL OPEN mycursor;</font><br /><font size="2"> while(1){</font><br/><font size="2"> EXEC SQL FETCH FROM mycursor</font><br /><font size="2"> INTO :myuser,:myname,:mysize,:mymtime;</font><br /><font size="2"> if(NO_MORE_TUPLES)</font><br/><font size="2"> break;</font><br /><font size="2"> myname[120]= '\0';</font><br /><font size="2"> printf("%d\t%s\t%d\t%d\n", myuser, myname, mysize, mymtime);</font><br/><font size="2"> }</font><br /><font size="2"> EXEC SQL CLOSE mycursor;</font><p><fontsize="2"> exit(0);</font><br /><font size="2">}</font><br /><font size="2">-----------------------------------</font><p><fontsize="2">No idea how long it's going to take to complete its run,but it</font><br /><font size="2">faithfully starts spitting out the records shortly after being</font><br /><font size="2">invoked. Thanks again guys!</font><p><font size="2">For those interested, more helpful hints & examples on epgc,</font><br/><font size="2">libpq and related goodies here:</font><br /><font size="2"> <a href="http://www.cse.ucsc.edu/classes/cmps180/Winter02/proj5.html" target="_blank">http://www.cse.ucsc.edu/classes/cmps180/Winter02/proj5.html</a></font><br/> <font size="2"> - thanksto UCSC and Google. :-)</font><p><font size="2">-cw-</font>