Thread: two memory-consuming postgres processes

two memory-consuming postgres processes

From
Alexy Khrabrov
Date:
Greetings -- I have an UPDATE query updating a 100 million row table,
and allocate enough memory via shared_buffers=1500MB.  However, I see
two processes in top, the UPDATE process eating about 850 MB and the
writer process eating about 750 MB.  The box starts paging.   Why is
there the writer taking almost as much space as the UPDATE, and how
can I shrink it?

Cheers,
Alexy

Re: two memory-consuming postgres processes

From
"Scott Marlowe"
Date:
On Fri, May 2, 2008 at 1:24 PM, Alexy Khrabrov <deliverable@gmail.com> wrote:
> Greetings -- I have an UPDATE query updating a 100 million row table, and
> allocate enough memory via shared_buffers=1500MB.  However, I see two
> processes in top, the UPDATE process eating about 850 MB and the writer
> process eating about 750 MB.  The box starts paging.   Why is there the
> writer taking almost as much space as the UPDATE, and how can I shrink it?

Shared_buffers is NOT the main memory pool for all operations in
pgsql, it is simply the buffer pool used to hold data being operated
on.

Things like sorts etc. use other memory and can exhaust your machine.
However, I'd like to see the output of vmstat 1 or top while this is
happening.

How much memory does this machine have?

Re: two memory-consuming postgres processes

From
Alexy Khrabrov
Date:
On May 2, 2008, at 12:30 PM, Scott Marlowe wrote:

> On Fri, May 2, 2008 at 1:24 PM, Alexy Khrabrov
> <deliverable@gmail.com> wrote:
>> Greetings -- I have an UPDATE query updating a 100 million row
>> table, and
>> allocate enough memory via shared_buffers=1500MB.  However, I see two
>> processes in top, the UPDATE process eating about 850 MB and the
>> writer
>> process eating about 750 MB.  The box starts paging.   Why is there
>> the
>> writer taking almost as much space as the UPDATE, and how can I
>> shrink it?
>
> Shared_buffers is NOT the main memory pool for all operations in
> pgsql, it is simply the buffer pool used to hold data being operated
> on.
>
> Things like sorts etc. use other memory and can exhaust your machine.
> However, I'd like to see the output of vmstat 1 or top while this is
> happening.
>
> How much memory does this machine have?

It's a 2GB RAM MacBook.  Here's the top for postgres

Processes:  117 total, 2 running, 6 stuck, 109 sleeping... 459
threads
                                                                                   12
:34:27
Load Avg:  0.27,  0.24,  0.32    CPU usage:  8.41% user, 11.06% sys,
80.53% idle
SharedLibs: num =   15, resident =   40M code, 2172K data, 3172K
linkedit.
MemRegions: num = 20719, resident =  265M +   12M private, 1054M shared.
PhysMem:  354M wired, 1117M active,  551M inactive, 2022M used,   19M
free.
VM: 26G + 373M   1176145(160) pageins, 1446482(2) pageouts

   PID COMMAND      %CPU   TIME   #TH #PRTS #MREGS RPRVT  RSHRD
RSIZE  VSIZE
51775 postgres     6.8%  2:40.16   1     9     39 1504K   896M   859M+
1562M
51767 postgres     0.0%  0:39.74   1     8     28  752K   896M   752M
1560M

the first is the UPDATE, the second is the writer.

The query is very simple,

netflix=> create index movs_mid_idx on movs(mid);
CREATE INDEX
netflix=> update ratings set offset1=avg-rating from movs where
mid=movie_id;

where the table ratings has about 100 million rows, movs has about
20,000.

I randomly increased values in postgresql.conf to

shared_buffers = 1500MB
max_fsm_pages = 2000000
max_fsm_relations = 10000

Should I set the background writer parameters somehow to decrease the
RAM consumed by the writer?

Cheers,
Alexy

Re: two memory-consuming postgres processes

From
"Scott Marlowe"
Date:
On Fri, May 2, 2008 at 1:38 PM, Alexy Khrabrov <deliverable@gmail.com> wrote:
>
>
>  On May 2, 2008, at 12:30 PM, Scott Marlowe wrote:
>
>
> > On Fri, May 2, 2008 at 1:24 PM, Alexy Khrabrov <deliverable@gmail.com>
> wrote:
> >
> > > Greetings -- I have an UPDATE query updating a 100 million row table,
> and
> > > allocate enough memory via shared_buffers=1500MB.  However, I see two
> > > processes in top, the UPDATE process eating about 850 MB and the writer
> > > process eating about 750 MB.  The box starts paging.   Why is there the
> > > writer taking almost as much space as the UPDATE, and how can I shrink
> it?
> > >
> >
> > Shared_buffers is NOT the main memory pool for all operations in
> > pgsql, it is simply the buffer pool used to hold data being operated
> > on.
> >
> > Things like sorts etc. use other memory and can exhaust your machine.
> > However, I'd like to see the output of vmstat 1 or top while this is
> > happening.
> >
> > How much memory does this machine have?
> >
>
>  It's a 2GB RAM MacBook.  Here's the top for postgres
>
>  Processes:  117 total, 2 running, 6 stuck, 109 sleeping... 459 threads
> 12:34:27
>  Load Avg:  0.27,  0.24,  0.32    CPU usage:  8.41% user, 11.06% sys, 80.53%
> idle
>  SharedLibs: num =   15, resident =   40M code, 2172K data, 3172K linkedit.
>  MemRegions: num = 20719, resident =  265M +   12M private, 1054M shared.
>  PhysMem:  354M wired, 1117M active,  551M inactive, 2022M used,   19M free.
>  VM: 26G + 373M   1176145(160) pageins, 1446482(2) pageouts
>
>   PID COMMAND      %CPU   TIME   #TH #PRTS #MREGS RPRVT  RSHRD  RSIZE  VSIZE
>  51775 postgres     6.8%  2:40.16   1     9     39 1504K   896M   859M+
> 1562M
>  51767 postgres     0.0%  0:39.74   1     8     28  752K   896M   752M
> 1560M

SOME snipping here.

>  I randomly increased values in postgresql.conf to
>
>  shared_buffers = 1500MB
>  max_fsm_pages = 2000000
>  max_fsm_relations = 10000

On a laptop with 2G ram, 1.5Gig shared buffers is probably WAY too high.

>  Should I set the background writer parameters somehow to decrease the RAM
> consumed by the writer?

No, the background writer reads through the shared buffers for dirty
ones and writes them out.  so, it's not really using MORE memory, it's
just showing that it's attached to the ginormous shared_buffer pool
you've set up.

Lower your shared_buffers to about 512M or so and see how it works.

Re: two memory-consuming postgres processes

From
Tom Lane
Date:
"Scott Marlowe" <scott.marlowe@gmail.com> writes:
> On Fri, May 2, 2008 at 1:38 PM, Alexy Khrabrov <deliverable@gmail.com> wrote:
>> I randomly increased values in postgresql.conf to
>>
>> shared_buffers = 1500MB
>> max_fsm_pages = 2000000
>> max_fsm_relations = 10000

> On a laptop with 2G ram, 1.5Gig shared buffers is probably WAY too high.

s/probably/definitely/, especially seeing that OS X is a bit of a memory
hog itself.  I don't think you should figure on more than 1GB being
usefully available to Postgres, and you can't give all or even most of
that space to shared_buffers.

> No, the background writer reads through the shared buffers for dirty
> ones and writes them out.  so, it's not really using MORE memory, it's
> just showing that it's attached to the ginormous shared_buffer pool
> you've set up.

Yeah.  You have to be aware of top's quirky behavior for shared memory:
on most platforms it will count the shared memory against *each*
process, but only as much of the shared memory as that process has
touched so far.  So over time the reported size of any PG process will
tend to climb to something over the shared memory size, but most of that
isn't "real".

I haven't directly checked whether OS X's top behaves that way, but
given your report I think it does.

            regards, tom lane

Re: two memory-consuming postgres processes

From
Greg Smith
Date:
On Fri, 2 May 2008, Alexy Khrabrov wrote:

> I have an UPDATE query updating a 100 million row table, and
> allocate enough memory via shared_buffers=1500MB.

In addition to reducing that as you've been advised, you'll probably need
to increase checkpoint_segments significantly from the default (3) in
order to get good performance on an update that large.  Something like 30
would be a reasonable starting point.

I'd suggest doing those two things, seeing how things go, and reporting
back if you still think performance is unacceptable.  We'd need to know
your PostgreSQL version in order to really target future suggestions.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: two memory-consuming postgres processes

From
Alexy Khrabrov
Date:
On May 2, 2008, at 1:13 PM, Tom Lane wrote:
> I don't think you should figure on more than 1GB being
> usefully available to Postgres, and you can't give all or even most of
> that space to shared_buffers.


So how should I divide say a 512 MB between shared_buffers and, um,
what else?  (new to pg tuning :)

I naively thought that if I have a 100,000,000 row table, of the form
(integer,integer,smallint,date), and add a real coumn to it, it will
scroll through the memory reasonably fast.  Yet when I had
shared_buffers=128 MB, it was hanging there 8 hours before I killed
it, and now with 1500MB is paging again for several hours with no end
in sight.  Why can't it just add a column to a row at a time and be
done with it soon enough? :)  It takes inordinately long compared to a
FORTRAN or even python program and there's no index usage for this
table, a sequential scan, why all the paging?

Cheers,
Alexy

Re: two memory-consuming postgres processes

From
Alexy Khrabrov
Date:
On May 2, 2008, at 1:22 PM, Greg Smith wrote:

> On Fri, 2 May 2008, Alexy Khrabrov wrote:
>
>> I have an UPDATE query updating a 100 million row table, and
>> allocate enough memory via shared_buffers=1500MB.
>
> In addition to reducing that as you've been advised, you'll probably
> need to increase checkpoint_segments significantly from the default
> (3) in order to get good performance on an update that large.
> Something like 30 would be a reasonable starting point.
>
> I'd suggest doing those two things, seeing how things go, and
> reporting back if you still think performance is unacceptable.  We'd
> need to know your PostgreSQL version in order to really target
> future suggestions.

PostgreSQL 8.3.1, compiled from source on Mac OSX 10.5.2 (Leopard).
Saw the checkpoint_segments warning every ~20sec and increased it to
100 already.  Will see what 512 MB buys me, but 128 MB was paging
miserably.

Cheers,
Alexy

Re: two memory-consuming postgres processes

From
Alexy Khrabrov
Date:
Interestingly, after shutting down the server with
shared_buffer=1500MB in the middle of that UPDATE, I see this:

bash-3.2$ /opt/bin/pg_ctl -D /data/pgsql/ stop
waiting for server to shut down....LOG:  received smart shutdown request
LOG:  autovacuum launcher shutting down
........................................................... failed
pg_ctl: server does not shut down
bash-3.2$ /opt/bin/pg_ctl -D /data/pgsql/ stop
waiting for server to shut
down..........................................................LOG:
shutting down
LOG:  database system is shut down
  done
server stopped

-- had to do it twice, the box was paging for a minute or two.

Should I do something about the autovacuum e.g. to turn it off
completely?  I thought it's not on as all of it was still commented
out in postgresql.conf as shipped, only tweaked a few numbers as
reported before.

Cheers,
Alexy


Re: two memory-consuming postgres processes

From
"Scott Marlowe"
Date:
On Fri, May 2, 2008 at 2:26 PM, Alexy Khrabrov <deliverable@gmail.com> wrote:
>
>  So how should I divide say a 512 MB between shared_buffers and, um, what
> else?  (new to pg tuning :)

Don't worry so much about the rest of the settings.  Maybe increase
sort_mem (aka work_mem) to something like 16M or so.  that's about it.

>  I naively thought that if I have a 100,000,000 row table, of the form
> (integer,integer,smallint,date), and add a real coumn to it, it will scroll
> through the memory reasonably fast.

This is a database.  It makes changes on disk in such a way that they
won't be lost should power be cut off.  If you're just gonna be batch
processing data that it's ok to lose halfway through, then python /
perl / php etc might be a better choice.

>  Yet when I had shared_buffers=128 MB,
> it was hanging there 8 hours before I killed it, and now with 1500MB is
> paging again for several hours with no end in sight.

You went from kinda small to WAY too big.  512M should be a happy medium.

>  Why can't it just add
> a column to a row at a time and be done with it soon enough? :)

Adding a column is instantaneous.  populating it is not.

> It takes
> inordinately long compared to a FORTRAN or even python program and there's
> no index usage for this table, a sequential scan, why all the paging?

Again, a database protects your data from getting scrambled should the
program updating it quit halfway through etc...

Have you been vacuuming between these update attempts?  Each one has
created millions of dead rows and bloated your data store.  vacuum
full / cluster / reindex may be needed.

Re: two memory-consuming postgres processes

From
Alexy Khrabrov
Date:
On May 2, 2008, at 1:40 PM, Scott Marlowe wrote:
> Again, a database protects your data from getting scrambled should the
> program updating it quit halfway through etc...

Right -- but this is a data mining work, I add a derived column to a
row, and it's computed from that very row and a small second table
which should fit in RAM.

> Have you been vacuuming between these update attempts?  Each one has
> created millions of dead rows and bloated your data store.  vacuum
> full / cluster / reindex may be needed.

I've read postgresql.conf better and see autovacuum = on is commented
out, so it's on.  That explains why shutting down was taking so long
to shut autovacuum down too.

Basically, the derived data is not critical at all, -- can I turn (1)
off transactional behavior for an UPDATE, (2) should I care about
vacuuming being done on the fly when saving RAM, or need I defer it/
manage it manually?

I wonder what MySQL would do here on MyISAM tables without
transactional behavior -- perhaps this is the case more suitable for
them?

Cheers,
Alexy

Re: two memory-consuming postgres processes

From
Craig James
Date:
On Fri, May 2, 2008 at 2:26 PM, Alexy Khrabrov <deliverable@gmail.com> wrote:
>  I naively thought that if I have a 100,000,000 row table, of the form
> (integer,integer,smallint,date), and add a real coumn to it, it will scroll
> through the memory reasonably fast.

In Postgres, an update is the same as a delete/insert.  That means that changing the data in one column rewrites ALL of
thecolumns for that row, and you end up with a table that's 50% dead space, which you then have to vacuum. 

Sometimes if you have a "volatile" column that goes with several "static" columns, you're far better off to create a
secondtable for the volatile data, duplicating the primary key in both tables.  In your case, it would mean the
differencebetween 10^8 inserts of (int, float), very fast, compared to what you're doing now, which is 10^8 insert and
10^8deletes of (int, int, smallint, date, float), followed by a big vacuum/analyze (also slow). 

The down side of this design is that later on, it requires a join to fetch all the data for each key.

You do have a primary key on your data, right?  Or some sort of index?

Craig

Re: two memory-consuming postgres processes

From
Alexy Khrabrov
Date:
On May 2, 2008, at 2:02 PM, Craig James wrote:

> On Fri, May 2, 2008 at 2:26 PM, Alexy Khrabrov
> <deliverable@gmail.com> wrote:
>> I naively thought that if I have a 100,000,000 row table, of the form
>> (integer,integer,smallint,date), and add a real coumn to it, it
>> will scroll
>> through the memory reasonably fast.
>
> In Postgres, an update is the same as a delete/insert.  That means
> that changing the data in one column rewrites ALL of the columns for
> that row, and you end up with a table that's 50% dead space, which
> you then have to vacuum.
>
> Sometimes if you have a "volatile" column that goes with several
> "static" columns, you're far better off to create a second table for
> the volatile data, duplicating the primary key in both tables.  In
> your case, it would mean the difference between 10^8 inserts of
> (int, float), very fast, compared to what you're doing now, which is
> 10^8 insert and 10^8 deletes of (int, int, smallint, date, float),
> followed by a big vacuum/analyze (also slow).
>
> The down side of this design is that later on, it requires a join to
> fetch all the data for each key.
>
> You do have a primary key on your data, right?  Or some sort of index?

I created several indices for the primary table, yes.  Sure I can do a
table for a volatile column, but then I'll have to create a new such
table for each derived column -- that's why I tried to add a column to
the existing table.  Yet seeing this is really slow, and I need to to
many derived analyses like this -- which are later scanned in other
computations, so should persist -- I indeed see no other way but to
procreate derived tables with the same key, one column per each...

Cheers,
Alexy

Re: two memory-consuming postgres processes

From
Greg Smith
Date:
On Fri, 2 May 2008, Alexy Khrabrov wrote:

> I created several indices for the primary table, yes.

That may be part of your problem.  All of the indexes all are being
updated along with the main data in the row each time you touch a record.
There's some optimization there in 8.3 but it doesn't make index overhead
go away completely.  As mentioned already, the optimal solution to
problems in this area is to adjust table normalization as much as feasible
to limit what you're updating.

> Basically, the derived data is not critical at all, -- can I turn (1)
> off transactional behavior for an UPDATE,

What you can do is defer transaction commits to only happen periodically
rather than all the time by turning off syncronous_commit and increasing
wal_writer_delay; see
http://www.postgresql.com.cn/docs/8.3/static/wal-async-commit.html

> (2) should I care about vacuuming being done on the fly when saving RAM,
> or need I defer it/manage it manually?

It's hard to speculate from here about what optimal vacuum behavior will
be.  You might find it more efficient to turn autovacuum off when doing
these large updates.  The flip side is that you'll be guaranteed to end up
with more dead rows in the table and that has its own impact later.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: two memory-consuming postgres processes

From
PFC
Date:
> I created several indices for the primary table, yes.  Sure I can do a
> table for a volatile column, but then I'll have to create a new such
> table for each derived column -- that's why I tried to add a column to
> the existing table.  Yet seeing this is really slow, and I need to to
> many derived analyses like this -- which are later scanned in other
> computations, so should persist -- I indeed see no other way but to
> procreate derived tables with the same key, one column per each...

    OK, so in that case, if you could do all of your derived column
calculations in one query like this :

CREATE TABLE derived AS SELECT ... FROM ... (perform all your derived
calculations here)

    or :

BEGIN;    <-- this is important to avoid writing xlog
CREATE TABLE derived AS ...
INSERT INTO derived SELECT ... FROM ... (perform all your derived
calculations here)
COMMIT;

    Basically, updating the entire table several times to add a few simple
columns is a bad idea. If you can compute all the data you need in one
query, like above, it will be much faster. Especially if you join one
large table to several smaller ones, and as long as the huge data set
doesn't need to be sorted (check the query plan using EXPLAIN). Try to do
as much as possible in one query to scan the large dataset only once.

    Note that the above will be faster than updating the entire table since
it needs to write much less data : it doesn't need to delete the old rows,
and it doesn't need to write the transaction log, since if the transaction
rolls back, the table never existed anyway. Also since your newly created
table doesn't have any indexes, they won't need to be updated.

    If you really need to update an entire table multiple times, you will
need to :

    - Use hardware that can handle disk writes at a decent speed (that isn't
a characteristic of a laptop drive)
    - use MyIsam, yes (but if you need to make complex queries on the data
afterwards, it could suck).


Re: two memory-consuming postgres processes

From
Alexy Khrabrov
Date:
On May 2, 2008, at 2:23 PM, Greg Smith wrote:

> On Fri, 2 May 2008, Alexy Khrabrov wrote:
>
>> I created several indices for the primary table, yes.
>
> That may be part of your problem.  All of the indexes all are being
> updated along with the main data in the row each time you touch a
> record. There's some optimization there in 8.3 but it doesn't make
> index overhead go away completely.  As mentioned already, the
> optimal solution to problems in this area is to adjust table
> normalization as much as feasible to limit what you're updating.

Was wondering about it, too -- intuitively I 'd like to say, "stop all
indexing" until the column is added, then say "reindex", is it
doable?  Or would it take longer anyways?  SInce I don't index on that
new column, I'd assume my old indices would do -- do they change
because of rows deletions/insertions, with the effective new rows
addresses?

Cheers,
Alexy

Re: two memory-consuming postgres processes

From
"Kevin Grittner"
Date:
>>> Alexy Khrabrov wrote:

> SInce I don't index on that
> new column, I'd assume my old indices would do -- do they change
> because of rows deletions/insertions, with the effective new rows
> addresses?

Every update is a delete and insert.  The new version of the row must
be added to the index.  Every access through the index then has to
look at both versions of the row to see which one is "current" for its
transaction.  Vacuum will make the space used by the dead rows
available for reuse, as well as removing the old index entries and
making that space available for new index entries.

-Kevin



Re: two memory-consuming postgres processes

From
"Kevin Grittner"
Date:
>>> Alexy Khrabrov wrote:

> OK.  I've cancelled all previous attempts at UPDATE and will now
> create some derived tables.  See no changes in the previous huge
table
> -- the added column was completely empty.  Dropped it.  Should I
> vacuum just in case, or am I guaranteed not to have any extra rows
> since no UPDATE actually went through and none are showing?

The canceled attempts would have left dead space.  If you have
autovacuum running, it probably made the space available for reuse,
but depending on exactly how you got to where you are, you may have
bloat.  Personally, I would do a VACUUM ANALYZE VERBOSE and capture
the output.  If bloat is too bad, you may want to CLUSTER the table
(if you have the free disk space for a temporary extra copy of the
table) or VACUUM FULL followed by REINDEX (if you don't have that much
free disk space).

Let us know if you need help interpreting the VERBOSE output.

-Kevin



Re: two memory-consuming postgres processes

From
Alexy Khrabrov
Date:
On May 2, 2008, at 2:43 PM, Kevin Grittner wrote:

>>>> Alexy Khrabrov wrote:
>
>> SInce I don't index on that
>> new column, I'd assume my old indices would do -- do they change
>> because of rows deletions/insertions, with the effective new rows
>> addresses?
>
> Every update is a delete and insert.  The new version of the row must
> be added to the index.  Every access through the index then has to
> look at both versions of the row to see which one is "current" for its
> transaction.  Vacuum will make the space used by the dead rows
> available for reuse, as well as removing the old index entries and
> making that space available for new index entries.

OK.  I've cancelled all previous attempts at UPDATE and will now
create some derived tables.  See no changes in the previous huge table
-- the added column was completely empty.  Dropped it.  Should I
vacuum just in case, or am I guaranteed not to have any extra rows
since no UPDATE actually went through and none are showing?

Cheers,
Alexy

Re: two memory-consuming postgres processes

From
Matthew Wakeling
Date:
On Fri, 2 May 2008, PFC wrote:
> CREATE TABLE derived AS SELECT ... FROM ... (perform all your derived
> calculations here)

Given what you have said (that you really want all the data in one table)
it may be best to proceed like this:

First, take your original table, create an index on the primary key field,
and CLUSTER on that index.

CREATE TABLE derived AS SELECT ... FROM ... ORDER BY primary key field
CREATE INDEX derived_pk ON derived(primary key field)

Repeat those last two commands ad nauseum.

Then, when you want a final full table, run:

CREATE TABLE new_original AS SELECT * FROM original, derived, derived2,
       ... WHERE original.pk = derived.pk ...

That should be a merge join, which should run really quickly, and you can
then create all the indexes you want on it.

Matthew

--
When I first started working with sendmail, I was convinced that the cf
file had been created by someone bashing their head on the keyboard. After
a week, I realised this was, indeed, almost certainly the case.
        -- Unknown

Re: two memory-consuming postgres processes

From
"Merlin Moncure"
Date:
On Fri, May 2, 2008 at 4:51 PM, Alexy Khrabrov <deliverable@gmail.com> wrote:
>
>  On May 2, 2008, at 1:40 PM, Scott Marlowe wrote:
>
> > Again, a database protects your data from getting scrambled should the
> > program updating it quit halfway through etc...
> >
>
>  Right -- but this is a data mining work, I add a derived column to a row,
> and it's computed from that very row and a small second table which should
> fit in RAM.

Full table update of a single field is one of the worst possible
operations with PostgreSQL.  mysql is better at this because lack of
proper transactions and full table locking allow the rows to be
(mostly) updated in place.  Ideally, you should be leveraging the
power of PostgreSQL so that you can avoid the full table update if
possible.  Maybe if you step back and think about the problem you may
be able to come up with a solution that is more efficient.

Also, if you must do it this way, (as others suggest), do CREATE TABLE
new_table AS SELECT...., then create keys, and drop the old table when
done.  This is much faster than update.

merlin