Thread: Megabytes of stats saved after every connection
Dear Postgresql experts, For some time I had been trying to work out why every connection to my database resulted in several megabytes of data being written to the disk, however trivial the query. I think I've found the culprit: global/pgstat.stat. This is with 7.4.7. This is for a web application which uses a new connection for each CGI request. The server doesn't have a particularly high disk bandwidth and this mysterious activity had been the bottleneck for some time. The system is a little unusual as one of the databases has tens of thousands of tables (though I saw these writes whichever database I connected to). Looking at the output of vmstat I could see about 2.7Mbytes being written up to about 5 seconds after the query was processed. I was scratching my head about this for a long time, but today I noticed that this size was just a little larger than my global/pgstat.stat file. So I turned off stat_start_collector and stats_row_level and the writes vanished. Turing them back on, the pgstats.stats file is much smaller (10k) and the writes are invisible against the background noise. So can I expect this file to grow again? I think I need the stats, though I'm not entirely sure about that. Was the entire file re-written, even when the only query I've run is "select 1"? Is this necessary? Any comments or suggestions gratefully received. --Phil.
Phil Endecott <spam_from_postgresql_general@chezphil.org> writes: > For some time I had been trying to work out why every connection to my > database resulted in several megabytes of data being written to the > disk, however trivial the query. I think I've found the culprit: > global/pgstat.stat. This is with 7.4.7. > This is for a web application which uses a new connection for each CGI > request. The server doesn't have a particularly high disk bandwidth and > this mysterious activity had been the bottleneck for some time. The > system is a little unusual as one of the databases has tens of thousands > of tables (though I saw these writes whichever database I connected to). Well, there's the problem --- the stats subsystem is designed in a way that makes it rewrite its entire stats collection on every update. That's clearly not going to scale well to a large number of tables. Offhand I don't see an easy solution ... Jan, any ideas? > So can I expect this file to grow again? I think I need the stats, > though I'm not entirely sure about that. If you're not using autovacuum then you don't need stats_row_level. regards, tom lane
On 7/28/2005 2:03 PM, Tom Lane wrote: > Phil Endecott <spam_from_postgresql_general@chezphil.org> writes: >> For some time I had been trying to work out why every connection to my >> database resulted in several megabytes of data being written to the >> disk, however trivial the query. I think I've found the culprit: >> global/pgstat.stat. This is with 7.4.7. > >> This is for a web application which uses a new connection for each CGI >> request. The server doesn't have a particularly high disk bandwidth and >> this mysterious activity had been the bottleneck for some time. The >> system is a little unusual as one of the databases has tens of thousands >> of tables (though I saw these writes whichever database I connected to). > > Well, there's the problem --- the stats subsystem is designed in a way > that makes it rewrite its entire stats collection on every update. > That's clearly not going to scale well to a large number of tables. > Offhand I don't see an easy solution ... Jan, any ideas? PostgreSQL itself doesn't work too well with tens of thousands of tables. I don't see much of an easy solution either. The best workaround I can offer is to move that horror-DB to a separate postmaster with stats disabled altogether. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck <JanWieck@Yahoo.com> writes: > On 7/28/2005 2:03 PM, Tom Lane wrote: >> Well, there's the problem --- the stats subsystem is designed in a way >> that makes it rewrite its entire stats collection on every update. >> That's clearly not going to scale well to a large number of tables. >> Offhand I don't see an easy solution ... Jan, any ideas? > PostgreSQL itself doesn't work too well with tens of thousands of > tables. Really? AFAIK it should be pretty OK, assuming you are on a filesystem that doesn't choke with tens of thousands of entries in a directory. I think we should put down a TODO item to see if we can improve the stats subsystem's performance in such cases. regards, tom lane
On 7/28/2005 2:28 PM, Tom Lane wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: >> On 7/28/2005 2:03 PM, Tom Lane wrote: >>> Well, there's the problem --- the stats subsystem is designed in a way >>> that makes it rewrite its entire stats collection on every update. >>> That's clearly not going to scale well to a large number of tables. >>> Offhand I don't see an easy solution ... Jan, any ideas? > >> PostgreSQL itself doesn't work too well with tens of thousands of >> tables. > > Really? AFAIK it should be pretty OK, assuming you are on a filesystem > that doesn't choke with tens of thousands of entries in a directory. > I think we should put down a TODO item to see if we can improve the > stats subsystem's performance in such cases. Okay, I should be more specific. The problem with tens of thousands of tables does not exist just because of them being there. It will emerge if all those tables are actually used because it will mean that you'd need all the pg_class and pg_attribute rows cached and also your vfd cache will constantly rotate. Then again, the stats file is only written. There is nothing that actually forces the blocks out. On a busy system, one individual stats file will be created, written to, renamed, live for 500ms and be thrown away by the next stat files rename operation. I would assume that with a decent filesystem and appropriate OS buffers, none of the data blocks of most stat files even hit the disk. I must be missing something. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck <JanWieck@Yahoo.com> writes: > On 7/28/2005 2:28 PM, Tom Lane wrote: >> Jan Wieck <JanWieck@Yahoo.com> writes: >>> PostgreSQL itself doesn't work too well with tens of thousands of >>> tables. >> >> Really? AFAIK it should be pretty OK, assuming you are on a filesystem >> that doesn't choke with tens of thousands of entries in a directory. >> I think we should put down a TODO item to see if we can improve the >> stats subsystem's performance in such cases. > Okay, I should be more specific. The problem with tens of thousands of > tables does not exist just because of them being there. It will emerge > if all those tables are actually used because it will mean that you'd > need all the pg_class and pg_attribute rows cached and also your vfd > cache will constantly rotate. Sure, if you have a single backend touching all tables you'll have some issues in that backend. But the stats problem is that it tracks every table anyone has ever touched, which makes the issue much more pressing. > Then again, the stats file is only written. There is nothing that > actually forces the blocks out. On a busy system, one individual stats > file will be created, written to, renamed, live for 500ms and be thrown > away by the next stat files rename operation. I would assume that with a > decent filesystem and appropriate OS buffers, none of the data blocks of > most stat files even hit the disk. I must be missing something. This is possibly true --- Phil, do you see actual disk I/O happening from the stats writes, or is it just kernel calls? regards, tom lane
On Thu, 2005-07-28 at 13:40, Jan Wieck wrote: > On 7/28/2005 2:28 PM, Tom Lane wrote: > > > Jan Wieck <JanWieck@Yahoo.com> writes: > >> On 7/28/2005 2:03 PM, Tom Lane wrote: > >>> Well, there's the problem --- the stats subsystem is designed in a way > >>> that makes it rewrite its entire stats collection on every update. > >>> That's clearly not going to scale well to a large number of tables. > >>> Offhand I don't see an easy solution ... Jan, any ideas? > > > >> PostgreSQL itself doesn't work too well with tens of thousands of > >> tables. > > > > Really? AFAIK it should be pretty OK, assuming you are on a filesystem > > that doesn't choke with tens of thousands of entries in a directory. > > I think we should put down a TODO item to see if we can improve the > > stats subsystem's performance in such cases. > > Okay, I should be more specific. The problem with tens of thousands of > tables does not exist just because of them being there. It will emerge > if all those tables are actually used because it will mean that you'd > need all the pg_class and pg_attribute rows cached and also your vfd > cache will constantly rotate. > > Then again, the stats file is only written. There is nothing that > actually forces the blocks out. On a busy system, one individual stats > file will be created, written to, renamed, live for 500ms and be thrown > away by the next stat files rename operation. I would assume that with a > decent filesystem and appropriate OS buffers, none of the data blocks of > most stat files even hit the disk. I must be missing something. Yeah, I found these three facets of the OP's system a bit disconcerting: QUOTE --- This is for a web application which uses a new connection for each CGI request. The server doesn't have a particularly high disk bandwidth and this mysterious activity had been the bottleneck for some time. The system is a little unusual as one of the databases has tens of thousands of tables. ENDQUOTE --- Any two of those choices could cause some issues, but all three together are pretty much a death knell for performance, whether or not the global/pgstat file is being written or not. Just an observation.
Jan Wieck <JanWieck@Yahoo.com> writes: > Then again, the stats file is only written. There is nothing that actually > forces the blocks out. On a busy system, one individual stats file will be > created, written to, renamed, live for 500ms and be thrown away by the next > stat files rename operation. I would assume that with a decent filesystem and > appropriate OS buffers, none of the data blocks of most stat files even hit the > disk. I must be missing something. Renaming is a metadata operation. Depending on the filesystem it has to be done either synchronously or force a log write barrier. I'm not sure how those things are implemented in various filesystems but I could easily imagine some implementations treating them as implicit fsyncs for that file. Perhaps this user could put the stats file in a ramdisk. It doesn't sound like losing it in a crash would be anything to worry about. -- greg
Jan Wieck <JanWieck@Yahoo.com> writes: > >> PostgreSQL itself doesn't work too well with tens of thousands of tables. > > Really? AFAIK it should be pretty OK, assuming you are on a filesystem > > that doesn't choke with tens of thousands of entries in a directory. > > I think we should put down a TODO item to see if we can improve the > > stats subsystem's performance in such cases. > > Okay, I should be more specific. The problem with tens of thousands of tables > does not exist just because of them being there. It will emerge if all those > tables are actually used because it will mean that you'd need all the pg_class > and pg_attribute rows cached and also your vfd cache will constantly rotate. I think occasionally people get bitten by not having their pg_* tables being vacuumed or analyzed regularly. If you have lots of tables and the stats are never updated for pg_class or related tables you can find the planner taking a long time to plan queries. This happens if you schedule a cron job to do your vacuuming and analyzing but connect as a user other than the database owner. For example, you leave the database owned by "postgres" but create a user to own all the tables and use that to run regularly scheduled "vacuum analyze"s. I'm not sure how often these types of problems get properly diagnosed. The symptoms are quite mysterious. In retrospect I think I observed something like it and never figured out what was going on. The problem only went away when I upgraded the database and went through an initdb cycle. -- greg
Hello again, Just to give a bit of background, in case it is useful: this is my family tree website, treefic.com. I have a schema for each user, each with about a dozen tables. In most cases the tables are small, i.e. tens of entries, but the users I care about are the ones with tens of thousands of people in their trees. The schemas are independent of each other. Example web page: http://treefic.com/treefic/royal92 >>>Jan Wieck <JanWieck@Yahoo.com> writes: >>>>PostgreSQL itself doesn't work too well with tens of thousands of >>>>tables. I've specifically asked about this here before. This is obviously important for my application so I invite all readers to share any thoughts they might have about possible problems with large numbers of tables. I also create and drop large numbers of temporary tables - can anyone think of any additional problems with that? Issues I have discussed here before include tab-completion in psql (unimportant) and autovacuum's O(n^2) performance (important). >>Okay, I should be more specific. The problem with tens of thousands of >>tables does not exist just because of them being there. It will emerge >>if all those tables are actually used because it will mean that you'd >>need all the pg_class and pg_attribute rows cached and also your vfd >>cache will constantly rotate. If many trees are being viewed simultaneously, another part of the system will be the bottleneck. Within any, say, 5 minute period, only hundreds of tables will be in use. >>Then again, the stats file is only written. There is nothing that >>actually forces the blocks out. On a busy system, one individual stats >>file will be created, written to, renamed, live for 500ms and be thrown >>away by the next stat files rename operation. I would assume that with a >>decent filesystem and appropriate OS buffers, none of the data blocks of >>most stat files even hit the disk. I must be missing something. > This is possibly true --- Phil, do you see actual disk I/O happening > from the stats writes, or is it just kernel calls? During my tests the system was idle; I would run "psql -c 'select 1;'" and see the blocks in vmstat's "bo" column a couple of seconds later. As I understand it that indicates actual I/O, and the delay suggests that it is being flushed by the kernel. When the system is busy it is harder to see what is going on and it is possible that at least some of this activity was not being written to the disk. Typically I would see a lot more write bandwidth than read bandwidth (by a factor of 5 or so) according to vmstat; any advice about how to identify what files or processes are involved would be appreciated. I had previously imagined that it could be temporary tables. This is Linux 2.4.26 and an ext3 filesystem. Having disabled stats earlier my stats file is still quite small. Presumably it will gradually grow back. In the meantime I cannot do any experiments. Thanks as ever for your prompt responses. Regards, --Phil.
Scott Marlowe wrote: > Yeah, I found these three facets of the OP's system a bit disconcerting: > > QUOTE --- > This is for a web application which uses a new connection for each CGI > request. > The server doesn't have a particularly high disk bandwidth and this > mysterious activity had been the bottleneck for some time. > The system is a little unusual as one of the databases has tens of > thousands of tables. > ENDQUOTE --- > > Any two of those choices could cause some issues, but all three together > are pretty much a death knell for performance, whether or not the > global/pgstat file is being written or not. See my previous message for some background about the application and an example URL. When PostgreSQL is running smoothly, it is not the bottleneck in the system: all it has to do is read maybe 100k from the disk (or more likely the cache), do some in-memory sorts and joins, and pass it to the rest of the application. As far as I can see it is only because some parts of PostgreSQL have poor O(num tables) performance that things are slowing down. --Phil.
On Thu, Jul 28, 2005 at 08:31:21PM +0100, Phil Endecott wrote: > > This is Linux 2.4.26 and an ext3 filesystem. With the dir_index feature or without? Peter
On Thu, Jul 28, 2005 at 09:43:44PM +0200, Peter Wiersig wrote: > On Thu, Jul 28, 2005 at 08:31:21PM +0100, Phil Endecott wrote: > > > > This is Linux 2.4.26 and an ext3 filesystem. > > With the dir_index feature or without? Also, with data=ordered, data=writeback or data=journal? (First one is default value) -- Alvaro Herrera (<alvherre[a]alvh.no-ip.org>) "No renuncies a nada. No te aferres a nada."
Jan Wieck wrote: > Then again, the stats file is only written. There is nothing that > actually forces the blocks out. On a busy system, one individual stats > file will be created, written to, renamed, live for 500ms and be > thrown away by the next stat files rename operation. I would assume > that with a decent filesystem and appropriate OS buffers, none of the > data blocks of most stat files even hit the disk. I must be missing > something. (From someone who is at best semi-informed).. Unless battery-backed cache is available, we are advised to run with fsync enabled. Wouldn't that affect the stats files as well? -- Guy Rouillier
>> This is Linux 2.4.26 and an ext3 filesystem. > With the dir_index feature or without? With, I believe. It is enabled in the superblock (tune2fs -O dir_index) but this was not done when the filesystem was created so only new directories are indexed I think. I don't think there's a way to index an existing directory on a mounted filesystem, or to tell if a particular directory is indexed. I created new directories for my postgres data and moved the files into them in the hope that they would then have indexes, but am not sure how to check. In any case, this does not seem to be a bottleneck. --Phil.
On Thu, Jul 28, 2005 at 05:48:21PM -0500, Guy Rouillier wrote: > Jan Wieck wrote: > > > Then again, the stats file is only written. There is nothing that > > actually forces the blocks out. On a busy system, one individual stats > > file will be created, written to, renamed, live for 500ms and be > > thrown away by the next stat files rename operation. I would assume > > that with a decent filesystem and appropriate OS buffers, none of the > > data blocks of most stat files even hit the disk. I must be missing > > something. > > (From someone who is at best semi-informed).. Unless battery-backed > cache is available, we are advised to run with fsync enabled. Wouldn't > that affect the stats files as well? The stats file is dispensable. In fact, it has been proposed that on crash recovery the stat file should be deleted. -- Alvaro Herrera (<alvherre[a]alvh.no-ip.org>) "Y eso te lo doy firmado con mis lágrimas" (Fiebre del Loco)
On Thu, Jul 28, 2005 at 03:12:33PM -0400, Greg Stark wrote: > I think occasionally people get bitten by not having their pg_* tables being > vacuumed or analyzed regularly. If you have lots of tables and the stats are > never updated for pg_class or related tables you can find the planner taking a > long time to plan queries. > > This happens if you schedule a cron job to do your vacuuming and analyzing but > connect as a user other than the database owner. For example, you leave the > database owned by "postgres" but create a user to own all the tables and use > that to run regularly scheduled "vacuum analyze"s. > > I'm not sure how often these types of problems get properly diagnosed. The > symptoms are quite mysterious. In retrospect I think I observed something like > it and never figured out what was going on. The problem only went away when I > upgraded the database and went through an initdb cycle. I've had exactly this problem at least five times, twice on my own systems and three times that I noticed on customer machines. It's an easy mistake to make on a system that doesn't have much interactive use, and if you're creating and dropping a lot of tables it can devastate your performance after a while. Cheers, Steve
On Jul 28, 2005, at 2:40 PM, Jan Wieck wrote: > Then again, the stats file is only written. There is nothing that > actually forces the blocks out. On a busy system, one individual > stats file will be created, written to, If one is running with stats_reset_on_server_start true (the default) do we even need this file if it is never read? To help alleviate his problem could he symlink it to /dev/null? -- Jeff Trout <jeff@jefftrout.com> http://www.jefftrout.com/ http://www.stuarthamm.net/
On Fri, Jul 29, 2005 at 09:08:28AM -0400, Jeff Trout wrote: > > On Jul 28, 2005, at 2:40 PM, Jan Wieck wrote: > > >Then again, the stats file is only written. There is nothing that > >actually forces the blocks out. On a busy system, one individual > >stats file will be created, written to, > > If one is running with stats_reset_on_server_start true (the default) > do we even need this file if it is never read? > > To help alleviate his problem could he symlink it to /dev/null? If you don't want the stat collector to run, you can disable it. No need to play games with a useless pgstat file. Anyway -- I see this as an argument in favor of the ability to deactivate stats on a per-database basis. -- Alvaro Herrera (<alvherre[a]alvh.no-ip.org>) "The important things in the world are problems with society that we don't understand at all. The machines will become more complicated but they won't be more complicated than the societies that run them." (Freeman Dyson)
Phil Endecott <spam_from_postgresql_general@chezphil.org> writes: > Hello again, > > Just to give a bit of background, in case it is useful: this is my family tree > website, treefic.com. I have a schema for each user, each with about a dozen > tables. In most cases the tables are small, i.e. tens of entries, but the > users I care about are the ones with tens of thousands of people in their > trees. The schemas are independent of each other. Example web page: I would strongly suggest you reconsider this design altogether. A normal (and normalized) design would have a users table that assigns a sequential id to each user. Then every other table would combine everybody's data but have a user id column to indicate which user that row belonged to. If you don't believe there's anything wrong with your current system, consider what it would look like to query your existing schema to find out the answer to the question "how many users have > 1000 people in their tree". Or "how many users have updated their tree in the last 7 days". In a normalized database you really want one table for any given type of data. Not hundreds of tables that contain the same type data but for different people. -- greg
Greg Stark wrote: > Phil Endecott wrote: >>Just to give a bit of background, in case it is useful: this is my family tree >>website, treefic.com. I have a schema for each user, each with about a dozen >>tables. In most cases the tables are small, i.e. tens of entries, but the >>users I care about are the ones with tens of thousands of people in their >>trees. The schemas are independent of each other. Example web page: > > I would strongly suggest you reconsider this design altogether. A normal (and > normalized) design would have a users table that assigns a sequential id to > each user. Then every other table would combine everybody's data but have a > user id column to indicate which user that row belonged to. > > If you don't believe there's anything wrong with your current system, consider > what it would look like to query your existing schema to find out the answer > to the question "how many users have > 1000 people in their tree". Or "how > many users have updated their tree in the last 7 days". Those aren't questions that I need to answer often. The sort of question I do need to answer is this: starting from individual X, find all the ancestors and descendants for n generations. This involves n iterations of a loop, joining the relatives found so far with the next generation. If there are p people in the tree this has something like O(n log p) complexity. On the other hand, if I stored all users' data in the same tables and I had u users, this operation would have O(n log (u*p)) complexity. My guess is that it would be about an order of magnitude slower. The individual users' sites are entirely disjoint - there are no queries that overlap them. --Phil.
Phil Endecott <spam_from_postgresql_general@chezphil.org> writes: > Those aren't questions that I need to answer often. But the fact that they're utterly infeasible in your current design is a bad sign. Just because you don't need them now doesn't mean you won't need *something* that spans users later. Sometimes you have to be pragmatic and look at what your actual current needs are and make sacrifices but you should at least be aware that you're giving up a *lot* and in this case I think for little or no gain. > The sort of question I do need to answer is this: starting from individual > X, find all the ancestors and descendants for n generations. This involves n > iterations of a loop, joining the relatives found so far with the next > generation. If there are p people in the tree this has something like O(n > log p) complexity. On the other hand, if I stored all users' data in the > same tables and I had u users, this operation would have O(n log (u*p)) > complexity. My guess is that it would be about an order of magnitude slower. You're omitting the time spent finding the actual table for the correct user in your current scheme. That's exactly the same as the log(u) factor above. Of course the time spent finding the table is pretty small but it's also small in the normalized schema where it represents probably a single extra btree level. You might be interested in the ltree contrib module and gist indexes. You might be able to do this recursive algorithm in a single indexed non-recursive query using them. > The individual users' sites are entirely disjoint - there are no queries that > overlap them. If you had a more flexible design you might find that you have a wealth of data that you're currently not able to see because your design hides it. -- greg
Greg Stark wrote: >>The sort of question I do need to answer is this: starting from individual >>X, find all the ancestors and descendants for n generations. This involves n >>iterations of a loop, joining the relatives found so far with the next >>generation. If there are p people in the tree this has something like O(n >>log p) complexity. On the other hand, if I stored all users' data in the >>same tables and I had u users, this operation would have O(n log (u*p)) >>complexity. My guess is that it would be about an order of magnitude slower. > > You're omitting the time spent finding the actual table for the correct user > in your current scheme. That's exactly the same as the log(u) factor above. I hope not - can anyone confirm? I have the impression that within a plpgsql function, the table lookup cost happens once, and subsequent accesses to the same table are cheap. In fact this characteristic has caused problems for me in the past, see http://archives.postgresql.org/pgsql-general/2004-09/msg00316.php I hope that the same is true of PQexecPrepared - can anyone confirm? > You might be interested in the ltree contrib module and gist indexes. You > might be able to do this recursive algorithm in a single indexed non-recursive > query using them. I could use something like "CONNECT BY", though last time I investigated I believe there were some stability concerns with the patch. Unfortunately genealogies are not trees in anything other than the informal sense of the word, so I don't think ltree is applicable. >>The individual users' sites are entirely disjoint - there are no queries that >>overlap them. > > If you had a more flexible design you might find that you have a wealth of > data that you're currently not able to see because your design hides it. I have a wealth of data that the majority of my users want me to keep private. There are other sites that try to match up peoples' genealogies, and I'm not competing with them. Thanks for your suggestions Greg, but I think I know what I'm doing. The Postgresql core copes well with this setup. It's just peripheral things, like autovacuum and this stats writing issue, where poor big-O complexity had gone un-noticed. --Phil.
Phil Endecott <spam_from_postgresql_general@chezphil.org> writes: > Greg Stark wrote: > > > You're omitting the time spent finding the actual table for the correct > > user in your current scheme. That's exactly the same as the log(u) factor > > above. > > I hope not - can anyone confirm? > > I have the impression that within a plpgsql function, the table lookup cost > happens once, and subsequent accesses to the same table are cheap. In fact this > characteristic has caused problems for me in the past, see > http://archives.postgresql.org/pgsql-general/2004-09/msg00316.php > > I hope that the same is true of PQexecPrepared - can anyone confirm? Are you really keeping prepared queries for each of your thousands of users? Then I have to wonder about the time to look up the relevant prepared query from amongst the thousands of prepared queries in the system. I'm not saying it's a problem; it's (presumably) a small cost, just like looking up the table the system tables (using indexes) is a small cost. And just like having another level in the btree index would be a small cost. I'm just saying you're not getting something for free here by having lots of small indexes instead of one big one. There can be some small linear gains like database using a sequential scan instead of an index scan for some queries, but there there's no algorithmic gain here. > I could use something like "CONNECT BY", though last time I investigated I > believe there were some stability concerns with the patch. I think the main problem was that it changed some internal structures such that a database created with a postgres with that patch was incompatible with a postgres without the patch. And if you switched back and forth you corrupted the database. > Thanks for your suggestions Greg, but I think I know what I'm doing. The > Postgresql core copes well with this setup. It's just peripheral things, like > autovacuum and this stats writing issue, where poor big-O > complexity had gone un-noticed. Well that's useful for Postgres development in a "guinea pig" sort of way at least :) -- greg