Thread: Working with very large datasets

Working with very large datasets

From

Wilkinson Charlie E

Date:

11 February 2003, 18:42:59

<p><font size="2">Greetings,</font><br /><font size="2">Can anyone enlighten me or point me at resources concerning use
ofpgsql with</font><br /><font size="2">very large datasets?</font><p><font size="2">My specific problem is
this:</font><p><fontsize="2">I have two tables, one with about 100 million rows and one with about 22,000</font><br
/><fontsize="2">rows.  My plan was to inner join the two tables on an integer key and output</font><br /><font
size="2">the4 significant columns, excluding the keys. (Those with a better understanding</font><br /><font size="2">of
pgsqlinternals, feel free to laugh.)  The result was a big angry psql that</font><br /><font size="2">grew to 800+MB
beforeI had to kill it.</font><p><font size="2">Example:</font><p><font size="2">CREATE TABLE users (</font><br /><font
size="2">       userid             INTEGER NOT NULL,</font><br /><font size="2">        fileid             INTEGER NOT
NULL</font><br/><font size="2">);</font><br /><font size="2">CREATE INDEX "users_userid_key" on "users" using btree (
"userid""int4_ops" );</font><p><font size="2">CREATE TABLE files</font><br /><font size="2">(</font><br /><font
size="2">       fileid          SERIAL NOT NULL,</font><br /><font size="2">        name           
VARCHAR(120),</font><br/><font size="2">        size            INTEGER,</font><br /><font size="2">       
mtime          INTEGER</font><br /><font size="2">);</font><br /><font size="2">-- files.fileid will automagically get
indexed</font><p><fontsize="2">[100 million inserts to 'users', 22,000 inserts to 'files'...]</font><p><font
size="2">SELECTuserid,name,size,mtime FROM users INNER JOIN files ON users.fileid = files.fileid;</font><p><font
size="2">[attackof the 50 foot psql...]</font><p><font size="2">Did I miss sonething important?  Is there a better
way? Any clues appreciated.</font><p><font size="2">-cw-</font><p><font size="2">(Apologies for any HTML that gets
tackedonto this message.  I spec'd plaintext!)</font><br /><font size="2">--</font><br /><font size="2">Charlie
Wilkinson</font><br/><font size="2">TRIS Development Systems Administrator</font><br /><font
size="2">IS:SD:CT:CC:TD</font><br/><font size="2">Phone: 202-283-3241</font><br /><font size="2">MSMail:  
Charlie.E.Wilkinson@irs.gov</font><br/><font size="2">SMTP:  cwilkins@tris.irs.gov</font><br /><font size="2">Home: 
cwilkins@boinklabs.com</font><br/><font size="2">This message constructed from 90% post-consumer electrons. </font>

Re: Working with very large datasets

From

Stephan Szabo

Date:

11 February 2003, 19:02:49

On Tue, 11 Feb 2003, Wilkinson Charlie E wrote:

> Greetings,
> Can anyone enlighten me or point me at resources concerning use of pgsql
> with
> very large datasets?
>
> My specific problem is this:
>
> I have two tables, one with about 100 million rows and one with about 22,000
> rows.  My plan was to inner join the two tables on an integer key and output
> the 4 significant columns, excluding the keys. (Those with a better
> understanding
> of pgsql internals, feel free to laugh.)  The result was a big angry psql
> that
> grew to 800+MB before I had to kill it.

Was it psql that grew to 800Mb or a backend?  If the former, how many rows
do you expect that to return?  You probably want to look into using
cursors rather than returning the entire result set at once.

Re: Working with very large datasets

From

Lex Berezhny

Date:

11 February 2003, 19:33:32

On Tue, 2003-02-11 at 18:42, Wilkinson Charlie E wrote:
> Greetings,
> Can anyone enlighten me or point me at resources concerning use of
> pgsql with
> very large datasets?

[snip]

> SELECT userid,name,size,mtime FROM users INNER JOIN files ON
> users.fileid = files.fileid;

If you just want to see a sample result from that query in psql, then
you can use LIMIT:

SELECT userid,name,size,mtime FROM users INNER JOIN files ON
users.fileid = files.fileid LIMIT 100;

This should only return 100 rows.

I hope that helps.

I think there is also a way to specify offsets so that you can return
the rows in batches.

-- 
Lex Berezhny <LBerezhny@DevIS.com>

Re: Working with very large datasets

From

Wilkinson Charlie E

Date:

12 February 2003, 16:18:49

<p><font size="2">> -----Original Message-----</font><br /><font size="2">> From: Stephan Szabo [<a
href="mailto:sszabo@megazone23.bigpanda.com">mailto:sszabo@megazone23.bigpanda.com</a>]</font><br/><font
size="2">[...]</font><br/><font size="2">> Was it psql that grew to 800Mb or a backend?  If the former, </font><br
/><fontsize="2">> how many rows</font><br /><font size="2">> do you expect that to return?  You probably want to
lookinto using</font><br /><font size="2">> cursors rather than returning the entire result set at
once.</font><p><fontsize="2">It was the front end, psql that got big.  postmaster has been pretty</font><br /><font
size="2">wellbehaved through all this.  I'm looking to get the full 100M+</font><br /><font size="2">records back.  I
haveto print it all out and fax them to someone.  ;)</font><br /><font size="2">Seriously, I want to dump it all to a
(big!)file and sift through</font><br /><font size="2">it with some familiar unix tools to check the data and
generally</font><br/><font size="2">look it over.</font><p><font size="2">Thanks to Stephan and Lex for the tips!  I
tookyour CURSOR and LIMIT</font><br /><font size="2">clues and ran with them.  I looked at straight SQL>psql and
PerlDBD:Pg</font><br /><font size="2">for solutions, but the end result was some "ecpg" C code:</font><p><font
size="2">----------------------------------</font><br/><font size="2">#include <stdio.h></font><br /><font
size="2">#defineNO_MORE_TUPLES (sqlca.sqlcode == ECPG_NOT_FOUND)</font><p><font size="2">int main(int argc, char*
agv[]){</font><p><fontsize="2">        EXEC SQL BEGIN DECLARE SECTION;</font><br /><font size="2">        int
myuser;</font><br/><font size="2">        char myname[121];</font><br /><font size="2">        int mysize;</font><br
/><fontsize="2">        int mymtime;</font><br /><font size="2">        EXEC SQL END DECLARE SECTION;</font><p><font
size="2">       EXEC SQL INCLUDE sqlca;</font><br /><font size="2">        EXEC SQL CONNECT TO userdb USER
xxxxxxx;</font><br/><font size="2">        EXEC SQL WHENEVER sqlerror sqlprint;</font><p><font size="2">        EXEC
SQLDECLARE mycursor CURSOR FOR SELECT user,name,size,mtime</font><br /><font size="2">          FROM users JOIN files
USING(fileid) ORDER BY user;</font><p><font size="2">        EXEC SQL OPEN mycursor;</font><br /><font size="2">       
while(1){</font><br/><font size="2">                EXEC SQL FETCH FROM mycursor</font><br /><font
size="2">                 INTO :myuser,:myname,:mysize,:mymtime;</font><br /><font size="2">               
if(NO_MORE_TUPLES)</font><br/><font size="2">                        break;</font><br /><font size="2">               
myname[120]= '\0';</font><br /><font size="2">                printf("%d\t%s\t%d\t%d\n", myuser, myname, mysize,
mymtime);</font><br/><font size="2">        }</font><br /><font size="2">        EXEC SQL CLOSE
mycursor;</font><p><fontsize="2">        exit(0);</font><br /><font size="2">}</font><br /><font
size="2">-----------------------------------</font><p><fontsize="2">No idea how long it's going to take to complete its
run,but it</font><br /><font size="2">faithfully starts spitting out the records shortly after being</font><br /><font
size="2">invoked. Thanks again guys!</font><p><font size="2">For those interested, more helpful hints & examples on
epgc,</font><br/><font size="2">libpq and related goodies here:</font><br /><font size="2">   <a
href="http://www.cse.ucsc.edu/classes/cmps180/Winter02/proj5.html"
target="_blank">http://www.cse.ucsc.edu/classes/cmps180/Winter02/proj5.html</a></font><br/>        <font size="2"> -
thanksto UCSC and Google. :-)</font><p><font size="2">-cw-</font>