Thread: psql large RSS (1.6GB)

psql large RSS (1.6GB)

From

TTK Ciar

Date:

31 October 2004, 05:42:58

Hello!

  My name is TTK, and I'm a software engineer at the Internet Archive's
Data Repository department.  We have recently started using postgresql
for a couple of projects (we have historically been a MySQL outfit),
and today my co-worker noticed psql eating memory like mad when invoked
with a simple select statement incorporating a join of two tables.

  The platform is a heavily modified RedHat 7.3 Linux.  We are using
version 7.4.5 of postgresql.

  The invocation was via sh script:

#!/bin/bash

outfile=$1
if [ -z "$outfile" ]; then
    outfile=/0/brad/all_arcs.txt
fi

/usr/lib/postgresql/bin/psql -c 'select
ServerDisks.servername,ServerDisks.diskserial,ServerDisks.diskmountpoint,DiskFiles.name,DiskFiles.md5from
DiskFiles,ServerDiskswhere DiskFiles.diskserial=ServerDisks.diskserial;' -F ' ' -A -t -o $outfile 

.. and the tables in question are somewhat large (hundreds of GB's
of data), though we didn't expect that to be an issue as far as the
psql process was concerned.

  We monitored server load via 'top -i -d 0.5' and watched the output
file for data.  Over the course of about 200 seconds, psql's RSS
climbed to about 1.6 GB, and stayed there, while no data was written
to the output file.  Eventually 10133194 lines were written to the
output file, all at once, about 1.2GB's worth of data.

  I re-ran the select query using psql in interactive mode, and saw
the same results.

  I re-ran it again, using "explain analyse", and this time psql's
RSS did *not* increase significantly.  The result is here, if it
helps:

brad=# explain analyse select
ServerDisks.servername,ServerDisks.diskserial,ServerDisks.diskmountpoint,DiskFiles.name,DiskFiles.md5from
DiskFiles,ServerDiskswhere DiskFiles.diskserial=ServerDisks.diskserial; 
                          QUERY PLAN
------------------------------------------------------------------
 Hash Join  (cost=22.50..65.00 rows=1000 width=274) (actual time=118.584..124653.729 rows=10133349 loops=1)
   Hash Cond: (("outer".diskserial)::text = ("inner".diskserial)::text)
   ->  Seq Scan on diskfiles  (cost=0.00..20.00 rows=1000 width=198) (actual time=7.201..31336.063 rows=10133349
loops=1)
   ->  Hash  (cost=20.00..20.00 rows=1000 width=158) (actual time=90.821..90.821 rows=0 loops=1)
         ->  Seq Scan on serverdisks  (cost=0.00..20.00 rows=1000 width=158) (actual time=9.985..87.364 rows=2280
loops=1)
 Total runtime: 130944.586 ms

  At a guess, it looks like the data set is being buffered in its
entirety by psql, before any data is written to the output file,
which is surprising.  I would have expected it to grab data as it
appeared on the socket from postmaster and write it to disk.  Is
there something we can do to stop psql from buffering results?
Does anyone know what's going on here?

  If the solution is to just write a little client that uses perl
DBI to fetch rows one at a time and write them out, that's doable,
but it would be nice if psql could be made to "just work" without
the monster RSS.

  I'd appreciate any feedback.  If you need any additional info,
please let me know and I will provide it.

  -- TTK
  ttk2@ciar.org
  ttk@archive.org

Re: psql large RSS (1.6GB)

From

Dustin Sallings

Date:

31 October 2004, 06:03:52

On Oct 27, 2004, at 0:57, TTK Ciar wrote:

>   At a guess, it looks like the data set is being buffered in its
> entirety by psql, before any data is written to the output file,
> which is surprising.  I would have expected it to grab data as it
> appeared on the socket from postmaster and write it to disk.  Is
> there something we can do to stop psql from buffering results?
> Does anyone know what's going on here?

    Yes, the result set is sent back to the client before it can be used.
An easy workaround when dealing with this much data is to use a cursor.
  Something like this:

db# start transaction;
START TRANSACTION
db# declare logcur cursor for select * from some_table;
DECLARE CURSOR
db# fetch 5 in logcur;
[...]
(5 rows)

    This will do approximately what you expected the select to do in the
first, place, but the fetch will decide how many rows to buffer into
the client at a time.

>   If the solution is to just write a little client that uses perl
> DBI to fetch rows one at a time and write them out, that's doable,
> but it would be nice if psql could be made to "just work" without
> the monster RSS.

    It wouldn't make a difference unless that driver implements the
underlying protocol on its own.

--
SPY                      My girlfriend asked me which one I like better.
pub  1024/3CAE01D5 1994/11/03 Dustin Sallings <dustin@spy.net>
|    Key fingerprint =  87 02 57 08 02 D0 DA D6  C8 0F 3E 65 51 98 D8 BE
L_______________________ I hope the answer won't upset her. ____________

Re: psql large RSS (1.6GB)

From

Markus Bertheau

Date:

31 October 2004, 10:27:14

В Срд, 27.10.2004, в 09:57, TTK Ciar пишет:

> brad=# explain analyse select
ServerDisks.servername,ServerDisks.diskserial,ServerDisks.diskmountpoint,DiskFiles.name,DiskFiles.md5from
DiskFiles,ServerDiskswhere DiskFiles.diskserial=ServerDisks.diskserial; 
>                           QUERY PLAN
> ------------------------------------------------------------------
>  Hash Join  (cost=22.50..65.00 rows=1000 width=274) (actual time=118.584..124653.729 rows=10133349 loops=1)
>    Hash Cond: (("outer".diskserial)::text = ("inner".diskserial)::text)
>    ->  Seq Scan on diskfiles  (cost=0.00..20.00 rows=1000 width=198) (actual time=7.201..31336.063 rows=10133349
loops=1)
>    ->  Hash  (cost=20.00..20.00 rows=1000 width=158) (actual time=90.821..90.821 rows=0 loops=1)
>          ->  Seq Scan on serverdisks  (cost=0.00..20.00 rows=1000 width=158) (actual time=9.985..87.364 rows=2280
loops=1)
>  Total runtime: 130944.586 ms

You should run ANALYZE on your database once in a while.

--
Markus Bertheau <twanger@bluetwanger.de>

Re: psql large RSS (1.6GB)

From

Gavin Sherry

Date:

01 November 2004, 13:46:34

On Sat, 30 Oct 2004, Dustin Sallings wrote:

> >   If the solution is to just write a little client that uses perl
> > DBI to fetch rows one at a time and write them out, that's doable,
> > but it would be nice if psql could be made to "just work" without
> > the monster RSS.
>
>     It wouldn't make a difference unless that driver implements the
> underlying protocol on its own.

Even though we can tell people to make use of cursors, it seems that
memory usage for large result sets should be addressed. A quick search of
the archives does not reveal any discussion about having libpq spill to
disk if a result set reaches some threshold. Has this been canvassed in
the past?

Thanks,

Gavin

Re: psql large RSS (1.6GB)

From

Bruce Momjian

Date:

01 November 2004, 14:05:33

Gavin Sherry wrote:
> On Sat, 30 Oct 2004, Dustin Sallings wrote:
>
> > >   If the solution is to just write a little client that uses perl
> > > DBI to fetch rows one at a time and write them out, that's doable,
> > > but it would be nice if psql could be made to "just work" without
> > > the monster RSS.
> >
> >     It wouldn't make a difference unless that driver implements the
> > underlying protocol on its own.
>
> Even though we can tell people to make use of cursors, it seems that
> memory usage for large result sets should be addressed. A quick search of
> the archives does not reveal any discussion about having libpq spill to
> disk if a result set reaches some threshold. Has this been canvassed in
> the past?

No, I don't remember hearing this discussed and I don't think most
people would want libpq spilling to disk by default.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: psql large RSS (1.6GB)

From

Tom Lane

Date:

01 November 2004, 16:04:07

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> No, I don't remember hearing this discussed and I don't think most
> people would want libpq spilling to disk by default.

Far more useful would be some sort of streaming API to let the
application process the rows as they arrive, or at least fetch the rows
in small batches (the V3 protocol supports the latter even without any
explicit use of a cursor).  I'm not sure if this can be bolted onto the
existing libpq framework reasonably, but that's the direction I'd prefer
to go in.

            regards, tom lane

Re: psql large RSS (1.6GB)

From

Josh Berkus

Date:

01 November 2004, 18:50:10

Tom,

> Far more useful would be some sort of streaming API to let the
> application process the rows as they arrive, or at least fetch the rows
> in small batches (the V3 protocol supports the latter even without any
> explicit use of a cursor).  I'm not sure if this can be bolted onto the
> existing libpq framework reasonably, but that's the direction I'd prefer
> to go in.

I think that TelegraphCQ incorporates this.   However, I'm not sure whether
it's a portable component; it may be too tied in to their streaming query
engine.   They have talked about porting their "background query" patch for
PSQL, though ...

--
Josh Berkus
Aglio Database Solutions
San Francisco