Thread: PG writes a lot to the disk
I have a big PG server dedicated to serve only SELECT queries. The database is updated permanently using Slony. The server has 8 Xeon cores running at 3Ghz, 24GB or RAM and the following disk arrays: - one RAID1 serving the OS and the pg_xlog - one RAID5 serving the database and the tables (base directory) - one RAID5 serving the indexes (indexes have an alternate tablespace) This server can't take anything, it writes too much. When I try to plug it to a client (sending 20 transactions/s) it works fine for like 10 minutes, then start to write a lot in the pgdata/base directory (where the database files are, not the index). It writes so much (3MB/s randomly) that it can't serve the queries anymore, the load is huge. In order to locate the problem, I stopped Slony (no updates anymore), mounted the database and index partitions with the sync option (no FS write cache), and the problem happens faster, like 2 minutes after having plugged the client (and the queries) to it. I can reproduce the problem at will. I tried to see if some file size were increasing a lot, and found nothing more than the usual DB increase (DB is constantly updated by Slony). What does it writes so much in the base directory ? If it's some temporary table or anything, how can I locate it so I can fix the problem ? Here's the PG memory configuration: max_connections = 128 shared_buffers = 2GB temp_buffers = 8MB work_mem = 96MB maintenance_work_mem = 4GB max_stack_depth = 7MB default_statistics_target = 100 effective_cache_size = 20GB Thanks a lot for your advices ! -- Laurent Raufaste <http://www.glop.org/>
In response to "Laurent Raufaste" <analogue@glop.org>: > I have a big PG server dedicated to serve only SELECT queries. > The database is updated permanently using Slony. > > The server has 8 Xeon cores running at 3Ghz, 24GB of RAM and the > following disk arrays: > - one RAID1 serving the OS and the pg_xlog > - one RAID5 serving the database and the tables (base directory) > - one RAID5 serving the indexes (indexes have an alternate tablespace) > > This server can't take anything, it writes too much. > > When I try to plug it to a client (sending 20 > transactions/s) it works fine for like 10 minutes, then start to write > a lot in the pgdata/base directory (where the database files are, not > the index). > > It writes so much (3MB/s randomly) that it can't serve the queries anymore, the > load is huge. > > In order to locate the problem, I stopped Slony (no updates anymore), > mounted the database and index partitions with the sync option (no FS > write cache), and the problem happens faster, like 2 minutes after > having plugged the client (and the queries) to it. > I can reproduce the problem at will. > > I tried to see if some file size were increasing a lot, and found > nothing more than the usual DB increase (DB is constantly updated by > Slony). > > What does it writes so much in the base directory ? If it's some > temporary table or anything, how can I locate it so I can fix the > problem ? My guess (based on the information you provided) is that it's temporary sort file usage. If you're using 8.3 there's a config option to log each time a sort file is required. Anything earlier than 8.3 and you'll have to rely on your OS tools to track it down. However, what makes you so sure it's write activity? I see no evidence attached to this email (iostat or similar output) so I'm wondering if it's actually read activity. Check your log levels, if you turn up PG's logging all the way, it generates a LOT of write activity ... more than you might imagine under some loads. Get rid of the RAID 5. RAID 5 sucks. Have you tried running bonnie++ or similar to see if it's not just a really crappy RAID 5 controller? > Here's the PG memory configuration: > max_connections = 128 > shared_buffers = 2GB Have you tuned this based on experience? Current best practices would recommend that you start with ~6G (1/4 RAM) and tune up/down as experience with your workload dictates. > temp_buffers = 8MB > work_mem = 96MB Considering you've got 24G of RAM, you might want to try bumping this and see if it helps without pushing the system into swap. If the problem is sort file usage, this is the option to tune it. > maintenance_work_mem = 4GB I doubt it's hurting anything, but I don't think a value this high will actually be used. > max_stack_depth = 7MB > default_statistics_target = 100 > effective_cache_size = 20GB -- Bill Moran Collaborative Fusion Inc. http://people.collaborativefusion.com/~wmoran/ wmoran@collaborativefusion.com Phone: 412-422-3463x4023 **************************************************************** IMPORTANT: This message contains confidential information and is intended only for the individual named. If the reader of this message is not an intended recipient (or the individual responsible for the delivery of this message to an intended recipient), please be advised that any re-use, dissemination, distribution or copying of this message is prohibited. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. ****************************************************************
On Wed, 19 Mar 2008, Laurent Raufaste wrote: > When I try to plug it to a client (sending 20 transactions/s) it works > fine for like 10 minutes, then start to write a lot in the pgdata/base > directory (where the database files are, not the index). It writes so > much (3MB/s randomly) that it can't serve the queries anymore, the load > is huge. You didn't mention adjusting any of the checkpoint parameters and you also didn't say what version of PostgreSQL you're running. If you've got frozen sections approximately every 5 minutes you should figure out of they line up with the checkpoints on your system. How you do that varies depending on version, I've covered most of what you need to get started at: http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm It's also worth noting that RAID5 is known to be awful on write performance with some disk controllers. You didn't mention what controller you had. You should measure your disks to be sure they're performing well at all, it's possible you might be getting performance that's barely better than a single disk. http://www.westnet.com/~gsmith/content/postgresql/pg-disktesting.htm has some ideas on how to do that. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Laurent Raufaste wrote: > I have a big PG server dedicated to serve only SELECT queries. > The database is updated permanently using Slony. > > [...] load is huge. > > In order to locate the problem, I stopped Slony (no updates anymore), > mounted the database and index partitions with the sync option (no FS > write cache), and the problem happens faster, like 2 minutes after > having plugged the client (and the queries) to it. > I can reproduce the problem at will. > > I tried to see if some file size were increasing a lot, and found > nothing more than the usual DB increase (DB is constantly updated by > Slony). > > What does it writes so much in the base directory ? If it's some > temporary table or anything, how can I locate it so I can fix the > problem ? It could be a temporary file, although your work_mem setting is already quite large. Can you attach to the rogue backend with "strace" (if you have Linux, else something else maybe) to see what it does and use "lsof" to see what files it has open? Yours, Laurenz Albe
2008/3/19, Laurent Raufaste <analogue@glop.org>: > What does it writes so much in the base directory ? If it's some > temporary table or anything, how can I locate it so I can fix the > problem ? Thanks for your help everybody ! I fixed the problem by doing an ANALYZE to every table (yes I'm so noob ;) ). The problem was that the optimiser didn't know how to run the queries well and used millions of tuples for simple queries. For each tuple used it was updating some bit in the table file, resulting in a huge writing activity to that file. After the ANALYZE, the optimiser worked smarter, used thousand time less tuple for each query, and PG was not required to update so much bits in the table files. The server is now OK, thanks ! -- Laurent Raufaste <http://www.glop.org/>
Laurent Raufaste wrote: > The problem was that the optimiser didn't know how to run the queries > well and used millions of tuples for simple queries. For each tuple > used it was updating some bit in the table file, resulting in a huge > writing activity to that file. Good that you solved your problem. PostgreSQL doesn't write into the table files when it SELECTs data. Without an EXPLAIN plan it is impossible to say what PostgreSQL was doing, but most likely it was building a large hash structure or something similar and had to dump data into temporary files. Yours, Laurenz Albe
In response to "Albe Laurenz" <laurenz.albe@wien.gv.at>: > Laurent Raufaste wrote: > > The problem was that the optimiser didn't know how to run the queries > > well and used millions of tuples for simple queries. For each tuple > > used it was updating some bit in the table file, resulting in a huge > > writing activity to that file. > > Good that you solved your problem. > > PostgreSQL doesn't write into the table files when it SELECTs data. > > Without an EXPLAIN plan it is impossible to say what PostgreSQL > was doing, but most likely it was building a large hash structure > or something similar and had to dump data into temporary files. As a parting comment on this topic ... Based on his previous messages, he was able to definitively tie filesystem write activity to specific tables, but also claimed that his PG logs showed only SELECT statements being executed. However, the part I wanted to comment on (and got busy yesterday so am only getting to it now) is that there's no guarantee that SELECT isn't modifying rows. SELECT nextval('some_seq'); is the simplest example I can imagine of a select that modifies database data, but it's hardly the only one. I suspect that the OP has procedures in his SELECTs that are modifying table data, or triggers that do it ON SELECT or something similar. Of course, without any details, this is purely speculation. -- Bill Moran Collaborative Fusion Inc. http://people.collaborativefusion.com/~wmoran/ wmoran@collaborativefusion.com Phone: 412-422-3463x4023
Bill Moran <wmoran@collaborativefusion.com> writes: > However, the part I wanted to comment on (and got busy yesterday so > am only getting to it now) is that there's no guarantee that SELECT > isn't modifying rows. Another way that SELECT can cause disk writes is if it sets hint bits on recently-committed rows. However, if the tables aren't actively being modified any more, you'd expect that sort of activity to settle out pretty quickly. I concur with the temporary-file theory --- it's real hard to see how analyzing the tables would've fixed it otherwise. regards, tom lane
On Thu, 20 Mar 2008, Albe Laurenz wrote: > PostgreSQL doesn't write into the table files when it SELECTs data. > It could easily be hint bit updates that are set by selects getting written. Kris Jurka
2008/3/20, Tom Lane <tgl@sss.pgh.pa.us>: > > Another way that SELECT can cause disk writes is if it sets hint bits on > recently-committed rows. However, if the tables aren't actively being > modified any more, you'd expect that sort of activity to settle out pretty > quickly. > > I concur with the temporary-file theory --- it's real hard to see how > analyzing the tables would've fixed it otherwise. > That's exactly it, I concur with your first explanation because: - We have no modification at all on SELECT simply because it's a slony replicated table and any update is forbidden (no nextval, no trigger, nothin) - While monitoring the SELECT activity, write activity happened within the tables files only, and without changing their size. No other file was created, which eliminates the possibility of using temporary files. - Every table was recently commited, as it was a 3 days old replicated database from scratch. The most problematic query was like: "SELECT * FROM blah WHERE tree <@ A.B.C ;" (more complicated but it's the idea) We have millions of rows in blah, and blah was created a few hours ago, with no ANALYZE after the injection of data. All this make me think that PG was setting some bit on every row it used, which caused this massive write activity (3MB/s) in the table files. I'm talking about approx. 50 SELECT per second for a single server. And to prove that I made a test. I switched slony off on a server (no update anymore), synced the disks, got the mtime of every file in the base/ folder, executed hundreds of queries of the form: SELECT 1 FROM _comment INNER JOIN _article ON _article.id = _comment.parent_id WHERE _comment.path <@ '%RANDOM_VALUE%' ; During the massive activity, I took a new snapshot of the modified files in the base/ folder. The only files which were modified are: base/16387/1819754 base/16387/18567 # SELECT relname FROM pg_class WHERE relfilenode IN (1819754, 18567) ; relname ---------- _comment _article So *yes* table files are modified during SELECT, and it can result in a lot of write if the queries plan work on a lot of rows. Thansk for your help, I'm relieved =) -- Laurent Raufaste <http://www.glop.org/>
"Laurent Raufaste" <analogue@glop.org> writes: > All this make me think that PG was setting some bit on every row it > used, which caused this massive write activity (3MB/s) in the table > files. I'm talking about approx. 50 SELECT per second for a single > server. Well that's true it does. But only once per row. So analyze would have set the bit on every row. You could do the same thing with something ligter like "select count(*) from <table>". Tom's thinking was that you would only expect a high update rate for a short time until all those bits were set. Slony's inserts, updates, and deletes count as updates to the table as well. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
2008/3/21, Gregory Stark <stark@enterprisedb.com>: > > Well that's true it does. But only once per row. So analyze would have set the > bit on every row. You could do the same thing with something ligter like > "select count(*) from <table>". Well, the table has been analyzed, I did SELECT, PG write on the table. That's a fact. But it's also true (I juste tested it) that every file of a table is modified by a SELECT COUNT. > > Tom's thinking was that you would only expect a high update rate for a short > time until all those bits were set. > > Slony's inserts, updates, and deletes count as updates to the table as well. > Slony is shut down when I'm testing. -- Laurent Raufaste <http://www.glop.org/>
In response to "Laurent Raufaste" <analogue@glop.org>: > 2008/3/21, Gregory Stark <stark@enterprisedb.com>: > > > > Well that's true it does. But only once per row. So analyze would have set the > > bit on every row. You could do the same thing with something ligter like > > "select count(*) from <table>". > > Well, the table has been analyzed, I did SELECT, PG write on the > table. That's a fact. > > But it's also true (I juste tested it) that every file of a table is > modified by a SELECT COUNT. The real question (to verify Tom's point) is does a _second_ SELECT count() modify the table again? If so, then something else is going on than what Tom suggested. -- Bill Moran Collaborative Fusion Inc. http://people.collaborativefusion.com/~wmoran/ wmoran@collaborativefusion.com Phone: 412-422-3463x4023