Thread: PG writes a lot to the disk

PG writes a lot to the disk

From

"Laurent Raufaste"

Date:

19 March 2008, 08:18:23

I have a big PG server dedicated to serve only SELECT queries.
The database is updated permanently using Slony.

The server has 8 Xeon cores running at 3Ghz, 24GB or RAM and the
following disk arrays:
- one RAID1 serving the OS and the pg_xlog
- one RAID5 serving the database and the tables (base directory)
- one RAID5 serving the indexes (indexes have an alternate tablespace)

This server can't take anything, it writes too much.

When I try to  plug it to a client (sending 20
transactions/s) it works fine for like 10 minutes, then start to write
a lot in the pgdata/base directory (where the database files are, not
the index).

It writes so much (3MB/s randomly) that it can't serve the queries anymore, the
load is huge.

In order to locate the problem, I stopped Slony (no updates anymore),
mounted the database and index partitions with the sync option (no FS
write cache), and the problem happens faster, like 2 minutes after
having plugged the client (and the queries) to it.
I can reproduce the problem at will.

I tried to see if some file size were increasing a lot, and found
nothing more than the usual DB increase (DB is constantly updated by
Slony).

What does it writes so much in the base directory ? If it's some
temporary table or anything, how can I locate it so I can fix the
problem ?

Here's the PG memory configuration:
max_connections = 128
shared_buffers = 2GB
temp_buffers = 8MB
work_mem = 96MB
maintenance_work_mem = 4GB
max_stack_depth = 7MB
default_statistics_target = 100
effective_cache_size = 20GB

Thanks a lot for your advices !

--
Laurent Raufaste
<http://www.glop.org/>

Re: PG writes a lot to the disk

From

Bill Moran

Date:

19 March 2008, 10:49:28

In response to "Laurent Raufaste" <analogue@glop.org>:

> I have a big PG server dedicated to serve only SELECT queries.
> The database is updated permanently using Slony.
>
> The server has 8 Xeon cores running at 3Ghz, 24GB of RAM and the
> following disk arrays:
> - one RAID1 serving the OS and the pg_xlog
> - one RAID5 serving the database and the tables (base directory)
> - one RAID5 serving the indexes (indexes have an alternate tablespace)
>
> This server can't take anything, it writes too much.
>
> When I try to  plug it to a client (sending 20
> transactions/s) it works fine for like 10 minutes, then start to write
> a lot in the pgdata/base directory (where the database files are, not
> the index).
>
> It writes so much (3MB/s randomly) that it can't serve the queries anymore, the
> load is huge.
>
> In order to locate the problem, I stopped Slony (no updates anymore),
> mounted the database and index partitions with the sync option (no FS
> write cache), and the problem happens faster, like 2 minutes after
> having plugged the client (and the queries) to it.
> I can reproduce the problem at will.
>
> I tried to see if some file size were increasing a lot, and found
> nothing more than the usual DB increase (DB is constantly updated by
> Slony).
>
> What does it writes so much in the base directory ? If it's some
> temporary table or anything, how can I locate it so I can fix the
> problem ?

My guess (based on the information you provided) is that it's temporary
sort file usage.  If you're using 8.3 there's a config option to log
each time a sort file is required.  Anything earlier than 8.3 and you'll
have to rely on your OS tools to track it down.

However, what makes you so sure it's write activity?  I see no evidence
attached to this email (iostat or similar output) so I'm wondering if
it's actually read activity.

Check your log levels, if you turn up PG's logging all the way, it generates
a LOT of write activity ... more than you might imagine under some loads.

Get rid of the RAID 5.  RAID 5 sucks.  Have you tried running bonnie++ or
similar to see if it's not just a really crappy RAID 5 controller?

> Here's the PG memory configuration:
> max_connections = 128
> shared_buffers = 2GB

Have you tuned this based on experience?  Current best practices would
recommend that you start with ~6G (1/4 RAM) and tune up/down as experience
with your workload dictates.

> temp_buffers = 8MB
> work_mem = 96MB

Considering you've got 24G of RAM, you might want to try bumping this and
see if it helps without pushing the system into swap.  If the problem
is sort file usage, this is the option to tune it.

> maintenance_work_mem = 4GB

I doubt it's hurting anything, but I don't think a value this high will
actually be used.

> max_stack_depth = 7MB
> default_statistics_target = 100
> effective_cache_size = 20GB

--
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

wmoran@collaborativefusion.com
Phone: 412-422-3463x4023

****************************************************************
IMPORTANT: This message contains confidential information and is
intended only for the individual named. If the reader of this
message is not an intended recipient (or the individual
responsible for the delivery of this message to an intended
recipient), please be advised that any re-use, dissemination,
distribution or copying of this message is prohibited. Please
notify the sender immediately by e-mail if you have received
this e-mail by mistake and delete this e-mail from your system.
E-mail transmission cannot be guaranteed to be secure or
error-free as information could be intercepted, corrupted, lost,
destroyed, arrive late or incomplete, or contain viruses. The
sender therefore does not accept liability for any errors or
omissions in the contents of this message, which arise as a
result of e-mail transmission.
****************************************************************

Re: PG writes a lot to the disk

From

Greg Smith

Date:

19 March 2008, 11:16:57

On Wed, 19 Mar 2008, Laurent Raufaste wrote:

> When I try to plug it to a client (sending 20 transactions/s) it works
> fine for like 10 minutes, then start to write a lot in the pgdata/base
> directory (where the database files are, not the index).  It writes so
> much (3MB/s randomly) that it can't serve the queries anymore, the load
> is huge.

You didn't mention adjusting any of the checkpoint parameters and you also
didn't say what version of PostgreSQL you're running.  If you've got
frozen sections approximately every 5 minutes you should figure out of
they line up with the checkpoints on your system.  How you do that varies
depending on version, I've covered most of what you need to get started
at:

http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm

It's also worth noting that RAID5 is known to be awful on write
performance with some disk controllers.  You didn't mention what
controller you had.  You should measure your disks to be sure they're
performing well at all, it's possible you might be getting performance
that's barely better than a single disk.
http://www.westnet.com/~gsmith/content/postgresql/pg-disktesting.htm has
some ideas on how to do that.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: PG writes a lot to the disk

From

"Albe Laurenz"

Date:

19 March 2008, 12:20:38

Laurent Raufaste wrote:
> I have a big PG server dedicated to serve only SELECT queries.
> The database is updated permanently using Slony.
>
> [...] load is huge.
>
> In order to locate the problem, I stopped Slony (no updates anymore),
> mounted the database and index partitions with the sync option (no FS
> write cache), and the problem happens faster, like 2 minutes after
> having plugged the client (and the queries) to it.
> I can reproduce the problem at will.
>
> I tried to see if some file size were increasing a lot, and found
> nothing more than the usual DB increase (DB is constantly updated by
> Slony).
>
> What does it writes so much in the base directory ? If it's some
> temporary table or anything, how can I locate it so I can fix the
> problem ?

It could be a temporary file, although your work_mem setting is already
quite large.

Can you attach to the rogue backend with "strace" (if you have Linux, else
something else maybe) to see what it does and use "lsof" to see what files
it has open?

Yours,
Laurenz Albe

Re: PG writes a lot to the disk

From

"Laurent Raufaste"

Date:

20 March 2008, 07:48:20

2008/3/19, Laurent Raufaste <analogue@glop.org>:
>  What does it writes so much in the base directory ? If it's some
>  temporary table or anything, how can I locate it so I can fix the
>  problem ?

Thanks for your help everybody ! I fixed the problem by doing an
ANALYZE to every table (yes I'm so noob ;) ).

The problem was that the optimiser didn't know how to run the queries
well and used millions of tuples for simple queries. For each tuple
used it was updating some bit in the table file, resulting in a huge
writing activity to that file.

After the ANALYZE, the optimiser worked smarter, used thousand time
less tuple for each query, and PG was not required to update so much
bits in the table files.

The server is now OK, thanks !

--
Laurent Raufaste
<http://www.glop.org/>

Re: PG writes a lot to the disk

From

"Albe Laurenz"

Date:

20 March 2008, 11:38:54

Laurent Raufaste wrote:
> The problem was that the optimiser didn't know how to run the queries
> well and used millions of tuples for simple queries. For each tuple
> used it was updating some bit in the table file, resulting in a huge
> writing activity to that file.

Good that you solved your problem.

PostgreSQL doesn't write into the table files when it SELECTs data.

Without an EXPLAIN plan it is impossible to say what PostgreSQL
was doing, but most likely it was building a large hash structure
or something similar and had to dump data into temporary files.

Yours,
Laurenz Albe

Re: PG writes a lot to the disk

From

Bill Moran

Date:

20 March 2008, 11:58:43

In response to "Albe Laurenz" <laurenz.albe@wien.gv.at>:

> Laurent Raufaste wrote:
> > The problem was that the optimiser didn't know how to run the queries
> > well and used millions of tuples for simple queries. For each tuple
> > used it was updating some bit in the table file, resulting in a huge
> > writing activity to that file.
>
> Good that you solved your problem.
>
> PostgreSQL doesn't write into the table files when it SELECTs data.
>
> Without an EXPLAIN plan it is impossible to say what PostgreSQL
> was doing, but most likely it was building a large hash structure
> or something similar and had to dump data into temporary files.

As a parting comment on this topic ...

Based on his previous messages, he was able to definitively tie
filesystem write activity to specific tables, but also claimed that
his PG logs showed only SELECT statements being executed.

However, the part I wanted to comment on (and got busy yesterday so
am only getting to it now) is that there's no guarantee that SELECT
isn't modifying rows.

SELECT nextval('some_seq');

is the simplest example I can imagine of a select that modifies database
data, but it's hardly the only one.  I suspect that the OP has procedures
in his SELECTs that are modifying table data, or triggers that do it ON
SELECT or something similar.

Of course, without any details, this is purely speculation.

--
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

wmoran@collaborativefusion.com
Phone: 412-422-3463x4023

Re: PG writes a lot to the disk

From

Tom Lane

Date:

20 March 2008, 12:21:32

Bill Moran <wmoran@collaborativefusion.com> writes:
> However, the part I wanted to comment on (and got busy yesterday so
> am only getting to it now) is that there's no guarantee that SELECT
> isn't modifying rows.

Another way that SELECT can cause disk writes is if it sets hint bits on
recently-committed rows.  However, if the tables aren't actively being
modified any more, you'd expect that sort of activity to settle out pretty
quickly.

I concur with the temporary-file theory --- it's real hard to see how
analyzing the tables would've fixed it otherwise.

            regards, tom lane

Re: PG writes a lot to the disk

From

Kris Jurka

Date:

20 March 2008, 17:34:47


On Thu, 20 Mar 2008, Albe Laurenz wrote:

> PostgreSQL doesn't write into the table files when it SELECTs data.
>

It could easily be hint bit updates that are set by selects getting
written.

Kris Jurka

Re: PG writes a lot to the disk

From

"Laurent Raufaste"

Date:

21 March 2008, 07:49:30

2008/3/20, Tom Lane <tgl@sss.pgh.pa.us>:
>
> Another way that SELECT can cause disk writes is if it sets hint bits on
>  recently-committed rows.  However, if the tables aren't actively being
>  modified any more, you'd expect that sort of activity to settle out pretty
>  quickly.
>
>  I concur with the temporary-file theory --- it's real hard to see how
>  analyzing the tables would've fixed it otherwise.
>

That's exactly it, I concur with your first explanation because:
 - We have no modification at all on SELECT simply because it's a
slony replicated table and any update is forbidden (no nextval, no
trigger, nothin)
 - While monitoring the SELECT activity, write activity happened
within the tables files only, and without changing their size. No
other file was created, which eliminates the possibility of using
temporary files.
- Every table was recently commited, as it was a 3 days old replicated
database from scratch.

The most problematic query was like:
"SELECT * FROM blah WHERE tree <@ A.B.C ;" (more complicated but it's the idea)
We have millions of rows in blah, and blah was created a few hours
ago, with no ANALYZE after the injection of data.

All this make me think that PG was setting some bit on every row it
used, which caused this massive write activity (3MB/s) in the table
files. I'm talking about approx. 50 SELECT per second for a single
server.

And to prove that I made a test. I switched slony off on a server (no
update anymore), synced the disks, got the mtime of every file in the
base/ folder, executed hundreds of queries of the form:

SELECT 1
FROM _comment
INNER JOIN _article ON _article.id = _comment.parent_id
WHERE _comment.path <@ '%RANDOM_VALUE%'
;

During the massive activity, I took a new snapshot of the modified
files in the base/ folder.

The only files which were modified are:
base/16387/1819754
base/16387/18567

# SELECT relname FROM pg_class WHERE relfilenode IN (1819754, 18567) ;
 relname
----------
 _comment
 _article


So *yes* table files are modified during SELECT, and it can result in
a lot of write if the queries plan work on a lot of rows.

Thansk for your help, I'm relieved =)

--
Laurent Raufaste
<http://www.glop.org/>

Re: PG writes a lot to the disk

From

Gregory Stark

Date:

21 March 2008, 10:25:51

"Laurent Raufaste" <analogue@glop.org> writes:

> All this make me think that PG was setting some bit on every row it
> used, which caused this massive write activity (3MB/s) in the table
> files. I'm talking about approx. 50 SELECT per second for a single
> server.

Well that's true it does. But only once per row. So analyze would have set the
bit on every row. You could do the same thing with something ligter like
"select count(*) from <table>".

Tom's thinking was that you would only expect a high update rate for a short
time until all those bits were set.

Slony's inserts, updates, and deletes count as updates to the table as well.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's RemoteDBA services!

Re: PG writes a lot to the disk

From

"Laurent Raufaste"

Date:

21 March 2008, 10:45:39

2008/3/21, Gregory Stark <stark@enterprisedb.com>:
>
> Well that's true it does. But only once per row. So analyze would have set the
>  bit on every row. You could do the same thing with something ligter like
>  "select count(*) from <table>".

Well, the table has been analyzed, I did SELECT, PG write on the
table. That's a fact.

But it's also true (I juste tested it) that every file of a table is
modified by a SELECT COUNT.
>
>  Tom's thinking was that you would only expect a high update rate for a short
>  time until all those bits were set.
>
>  Slony's inserts, updates, and deletes count as updates to the table as well.
>

Slony is shut down when I'm testing.

--
Laurent Raufaste
<http://www.glop.org/>

Re: PG writes a lot to the disk

From

Bill Moran

Date:

21 March 2008, 10:54:49

In response to "Laurent Raufaste" <analogue@glop.org>:

> 2008/3/21, Gregory Stark <stark@enterprisedb.com>:
> >
> > Well that's true it does. But only once per row. So analyze would have set the
> >  bit on every row. You could do the same thing with something ligter like
> >  "select count(*) from <table>".
>
> Well, the table has been analyzed, I did SELECT, PG write on the
> table. That's a fact.
>
> But it's also true (I juste tested it) that every file of a table is
> modified by a SELECT COUNT.

The real question (to verify Tom's point) is does a _second_ SELECT count()
modify the table again?  If so, then something else is going on than
what Tom suggested.

--
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

wmoran@collaborativefusion.com
Phone: 412-422-3463x4023