Thread: autovacuum stress-testing our system
Hi, I've been struggling with autovacuum generating a lot of I/O and CPU on some of our systems - after a night spent analyzing this behavior, I believe the current autovacuum accidentally behaves a bit like a stress-test in some corner cases (but I may be seriously wrong, after all it was a long night). First - our system really is not a "common" one - we do have ~1000 of databases of various size, each containing up to several thousands of tables (several user-defined tables, the rest serve as caches for a reporting application - yes, it's a bit weird design but that's life). This all leads to pgstat.stat significantly larger than 60 MB. Now, the two main pieces of information from the pgstat.c are the timer definitions ---------------------------------- pgstat.c : 80 ---------------------------------- #define PGSTAT_STAT_INTERVAL 500 /* Minimum time between stats file * updates; in milliseconds. */ #define PGSTAT_RETRY_DELAY 10 /* How long to wait between checks for * a new file; in milliseconds. */ #define PGSTAT_MAX_WAIT_TIME 10000 /* Maximum time to wait for a stats * file update; in milliseconds. */ #define PGSTAT_INQ_INTERVAL 640 /* How often to ping the collector for * a new file; in milliseconds. */ #define PGSTAT_RESTART_INTERVAL 60 /* How often to attempt to restart a * failed statistics collector; in * seconds. */ #define PGSTAT_POLL_LOOP_COUNT (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY) #define PGSTAT_INQ_LOOP_COUNT (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY) ----------------------------------------------------------------------------------- and then this loop (the current HEAD does this a bit differently, but the 9.2 code is a bit readable and suffers the same issue): ---------------------------------- pgstat.c : 3560 -------------------------------- /* * Loop until fresh enough stats file is available or we ran out of time. * The stats inquiry message is sent repeatedly in case collector drops * it; but not every single time, as that just swamps the collector. */for (count = 0; count < PGSTAT_POLL_LOOP_COUNT;count++){ TimestampTz file_ts = 0; CHECK_FOR_INTERRUPTS(); if (pgstat_read_statsfile_timestamp(false, &file_ts) && file_ts >= min_ts) break; /* Not there or too old, so kick the collector and wait a bit */ if ((count % PGSTAT_INQ_LOOP_COUNT) == 0) pgstat_send_inquiry(min_ts); pg_usleep(PGSTAT_RETRY_DELAY * 1000L);} if (count >= PGSTAT_POLL_LOOP_COUNT) elog(WARNING, "pgstat wait timeout"); /* Autovacuum launcher wants stats about all databases */if (IsAutoVacuumLauncherProcess()) pgStatDBHash = pgstat_read_statsfile(InvalidOid,false);else pgStatDBHash = pgstat_read_statsfile(MyDatabaseId, false); ----------------------------------------------------------------------------------- What this code does it that it checks the statfile, and if it's not stale (the timestamp of the write start is not older than PGSTAT_RETRY_DELAY milliseconds), the loop is terminated and the file is read. Now, let's suppose the write takes >10 ms, which is the PGSTAT_RETRY_DELAY values. With our current pgstat.stat filesize/num of relations, this is quite common. Actually the common write time in our case is ~100 ms, even if we move the file into tmpfs. That means that almost all the calls to backend_read_statsfile (which happen in all pgstat_fetch_stat_*entry calls) result in continuous stream of inquiries from the autovacuum workers, writing/reading of the file. We're not getting 'pgstat wait timeout' though, because it finally gets written before PGSTAT_MAX_WAIT_TIME. By moving the file to a tmpfs we've minimized the I/O impact, but now the collector and autovacuum launcher consume ~75% of CPU (i.e. ~ one core) and do nothing except burning power because the database is almost read-only. Not a good thing in the "green computing" era I guess. First, I'm interested in feedback - did I get all the details right, or am I missing something important? Next, I'm thinking about ways to solve this: 1) turning of autovacuum, doing regular VACUUM ANALYZE from cron - certainly an option, but it's rather a workaround than a solution and I'm not very fond of it. Moreover it fixes only one side of the problem - triggering the statfile writes over and over. The file will be written anyway, although not that frequently. 2) tweaking the timer values, especially increasing PGSTAT_RETRY_DELAY and so on to consider several seconds to be fresh enough - Would be nice to have this as a GUC variables, although we can do another private patch on our own. But more knobs is not always better. 3) logic detecting the proper PGSTAT_RETRY_DELAY value - based mostly on the time it takes to write the file (e.g. 10x the write time or something). 4) keeping some sort of "dirty flag" in stat entries - and then writing only info about objects were modified enough to be eligible for vacuum/analyze (e.g. increasing number of index scans can't trigger autovacuum while inserting rows can). Also, I'm not worried about getting a bit older num of index scans, so 'clean' records might be written less frequently than 'dirty' ones. 5) splitting the single stat file into multiple pieces - e.g. per database, written separately, so that the autovacuum workers don't need to read all the data even for databases that don't need to be vacuumed. This might be combined with (4). Ideas? Objections? Preferred options? I kinda like (4+5), although that'd be a pretty big patch and I'm not entirely sure it can be done without breaking other things. regards Tomas
Really, as far as autovacuum is concerned, it would be much more useful to be able to reliably detect that a table has been recently vacuumed, without having to request a 10ms-recent pgstat snapshot. That would greatly reduce the amount of time autovac spends on pgstat requests. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Sep 26, 2012 at 5:43 AM, Tomas Vondra <tv@fuzzy.cz> wrote: > First - our system really is not a "common" one - we do have ~1000 of > databases of > various size, each containing up to several thousands of tables (several > user-defined > tables, the rest serve as caches for a reporting application - yes, it's a > bit weird > design but that's life). This all leads to pgstat.stat significantly larger > than 60 MB. > ... > Now, let's suppose the write takes >10 ms, which is the PGSTAT_RETRY_DELAY > values. > With our current pgstat.stat filesize/num of relations, this is quite > common. > Actually the common write time in our case is ~100 ms, even if we move the > file > into tmpfs. That means that almost all the calls to backend_read_statsfile > (which > happen in all pgstat_fetch_stat_*entry calls) result in continuous stream of > inquiries from the autovacuum workers, writing/reading of the file. I don't think it actually does. What you are missing is the same thing I was missing a few weeks ago when I also looked into something like this. 3962: * We don't recompute min_ts after sleeping, except in the * unlikely case that cur_ts went backwards. That means the file must have been written within 10 ms of when we *first* asked for it. What is generating the endless stream you are seeing is that you have 1000 databases so if naptime is one minute you are vacuuming 16 per second. Since every database gets a new process, that process needs to read the file as it doesn't inherit one. ... > > First, I'm interested in feedback - did I get all the details right, or am I > missing something important? > > Next, I'm thinking about ways to solve this: > > 1) turning of autovacuum, doing regular VACUUM ANALYZE from cron Increasing autovacuum_naptime seems like a far better way to do effectively the same thing. > > 2) tweaking the timer values, especially increasing PGSTAT_RETRY_DELAY and > so on > to consider several seconds to be fresh enough - Would be nice to have > this > as a GUC variables, although we can do another private patch on our own. > But > more knobs is not always better. I think forking it off to to another value would be better. If you are an autovacuum worker which is just starting up and so getting its initial stats, you can tolerate a stats file up to "autovacuum_naptime / 5.0" stale. If you are already started up and are just about to vacuum a table, then keep the staleness at PGSTAT_RETRY_DELAY as it currently is, so as not to redundantly vacuum a table. > > 3) logic detecting the proper PGSTAT_RETRY_DELAY value - based mostly on the > time > it takes to write the file (e.g. 10x the write time or something). This is already in place. > 5) splitting the single stat file into multiple pieces - e.g. per database, > written separately, so that the autovacuum workers don't need to read all > the data even for databases that don't need to be vacuumed. This might be > combined with (4). I think this needs to happen eventually. Cheers, Jeff
On 26-09-2012 09:43, Tomas Vondra wrote: > I've been struggling with autovacuum generating a lot of I/O and CPU on some > of our > systems - after a night spent analyzing this behavior, I believe the current > autovacuum accidentally behaves a bit like a stress-test in some corner cases > (but > I may be seriously wrong, after all it was a long night). > It is known that statistic collector doesn't scale for a lot of databases. It wouldn't be a problem if we don't have automatic maintenance (aka autovacuum). > Next, I'm thinking about ways to solve this: > > 1) turning of autovacuum, doing regular VACUUM ANALYZE from cron - certainly an > option, but it's rather a workaround than a solution and I'm not very fond of > it. Moreover it fixes only one side of the problem - triggering the statfile > writes over and over. The file will be written anyway, although not that > frequently. > It solves your problem if you combine scheduled VA with pgstat.stat in a tmpfs. I don't see it as a definitive solution if we want to scale auto maintenance for several hundreds or even thousands databases in a single cluster (Someone could think it is not that common but in hosting scenarios this is true. DBAs don't want to run several VMs or pg servers just to minimize the auto maintenance scalability problem). > 2) tweaking the timer values, especially increasing PGSTAT_RETRY_DELAY and so on > to consider several seconds to be fresh enough - Would be nice to have this > as a GUC variables, although we can do another private patch on our own. But > more knobs is not always better. > It doesn't solve the problem. Also it could be a problem for autovacuum (that make assumptions based in those fixed values). > 3) logic detecting the proper PGSTAT_RETRY_DELAY value - based mostly on the time > it takes to write the file (e.g. 10x the write time or something). > Such adaptive logic would be good only iff it takes a small time fraction to execute. It have to pay attention to the limits. It appears to be a candidate for exploration. > 4) keeping some sort of "dirty flag" in stat entries - and then writing only info > about objects were modified enough to be eligible for vacuum/analyze (e.g. > increasing number of index scans can't trigger autovacuum while inserting > rows can). Also, I'm not worried about getting a bit older num of index scans, > so 'clean' records might be written less frequently than 'dirty' ones. > It minimizes your problem but harms collector tools (that want fresh statistics about databases). > 5) splitting the single stat file into multiple pieces - e.g. per database, > written separately, so that the autovacuum workers don't need to read all > the data even for databases that don't need to be vacuumed. This might be > combined with (4). > IMHO that's the definitive solution. It would be one file per database plus a global one. That way, the check would only read the global.stat and process those database that were modified. Also, an in-memory map could store that information to speed up the checks. The only downside I can see is that you will increase the number of opened file descriptors. > Ideas? Objections? Preferred options? > I prefer to attack 3, sort of 4 (explained in 5 -- in-memory map) and 5. Out of curiosity, did you run perf (or some other performance analyzer) to verify if some (stats and/or autovac) functions pop up in the report? -- Euler Taveira de Oliveira - Timbira http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte24x7 e Treinamento
Dne 26.09.2012 16:51, Jeff Janes napsal: > On Wed, Sep 26, 2012 at 5:43 AM, Tomas Vondra <tv@fuzzy.cz> wrote: > >> First - our system really is not a "common" one - we do have ~1000 >> of >> databases of >> various size, each containing up to several thousands of tables >> (several >> user-defined >> tables, the rest serve as caches for a reporting application - yes, >> it's a >> bit weird >> design but that's life). This all leads to pgstat.stat significantly >> larger >> than 60 MB. >> > ... >> Now, let's suppose the write takes >10 ms, which is the >> PGSTAT_RETRY_DELAY >> values. >> With our current pgstat.stat filesize/num of relations, this is >> quite >> common. >> Actually the common write time in our case is ~100 ms, even if we >> move the >> file >> into tmpfs. That means that almost all the calls to >> backend_read_statsfile >> (which >> happen in all pgstat_fetch_stat_*entry calls) result in continuous >> stream of >> inquiries from the autovacuum workers, writing/reading of the file. > > I don't think it actually does. What you are missing is the same > thing I was missing a few weeks ago when I also looked into something > like this. > > 3962: > > * We don't recompute min_ts after sleeping, except in > the > * unlikely case that cur_ts went backwards. > > That means the file must have been written within 10 ms of when we > *first* asked for it. Yeah, right - I've missed the first "if (pgStatDBHash)" check right at the beginning. > What is generating the endless stream you are seeing is that you have > 1000 databases so if naptime is one minute you are vacuuming 16 per > second. Since every database gets a new process, that process needs > to read the file as it doesn't inherit one. Right. But that makes the 10ms timeout even more strange, because the worker is then using the data for very long time (even minutes). > > ... >> >> First, I'm interested in feedback - did I get all the details right, >> or am I >> missing something important? >> >> Next, I'm thinking about ways to solve this: >> >> 1) turning of autovacuum, doing regular VACUUM ANALYZE from cron > > Increasing autovacuum_naptime seems like a far better way to do > effectively the same thing. Agreed. One of my colleagues turned autovacuum off a few years back and that was a nice lesson how not to solve this kind of issues. >> 2) tweaking the timer values, especially increasing >> PGSTAT_RETRY_DELAY and >> so on >> to consider several seconds to be fresh enough - Would be nice to >> have >> this >> as a GUC variables, although we can do another private patch on >> our own. >> But >> more knobs is not always better. > > I think forking it off to to another value would be better. If you > are an autovacuum worker which is just starting up and so getting its > initial stats, you can tolerate a stats file up to > "autovacuum_naptime > / 5.0" stale. If you are already started up and are just about to > vacuum a table, then keep the staleness at PGSTAT_RETRY_DELAY as it > currently is, so as not to redundantly vacuum a table. I always thought there's a "no more than one worker per database" limit, and that the file is always reloaded when switching to another database. So I'm not sure how could a worker see such a stale table info? Or are the workers keeping the stats across multiple databases? >> 3) logic detecting the proper PGSTAT_RETRY_DELAY value - based >> mostly on the >> time >> it takes to write the file (e.g. 10x the write time or >> something). > > This is already in place. Really? Where? I've checked the current master, and the only thing I see in pgstat_write_statsfile is this (line 3558): last_statwrite = globalStats.stats_timestamp; https://github.com/postgres/postgres/blob/master/src/backend/postmaster/pgstat.c#L3558 I don't think that's doing what I meant. That really doesn't scale the timeout according to write time. What happens right now is that when the stats file is written at time 0 (starts at zero, write finishes at 100 ms), and a worker asks for the file at 99 ms (i.e. 1ms before the write finishes), it will set the time of the inquiry to last_statrequest and then do this if (last_statwrite < last_statrequest) pgstat_write_statsfile(false); i.e. comparing it to the start of the write. So another write will start right after the file is written out. And over and over. Moreover there's the 'rename' step making the new file invisible for the worker processes, which makes the thing a bit more complicated. What I'm suggesting it that there should be some sort of tracking the write time and then deciding whether the file is fresh enough using 10x that value. So when a file is written in 100 ms, it's be considered OK for the next 900 ms, i.e. 1 sec in total. Sure, we could use 5x or other coefficient, doesn't really matter. >> 5) splitting the single stat file into multiple pieces - e.g. per >> database, >> written separately, so that the autovacuum workers don't need to >> read all >> the data even for databases that don't need to be vacuumed. This >> might be >> combined with (4). > > I think this needs to happen eventually. Yes, a nice patch idea ;-) thanks for the feedback Tomas
Excerpts from Euler Taveira's message of mié sep 26 11:53:27 -0300 2012: > On 26-09-2012 09:43, Tomas Vondra wrote: > > 5) splitting the single stat file into multiple pieces - e.g. per database, > > written separately, so that the autovacuum workers don't need to read all > > the data even for databases that don't need to be vacuumed. This might be > > combined with (4). > > > IMHO that's the definitive solution. It would be one file per database plus a > global one. That way, the check would only read the global.stat and process > those database that were modified. Also, an in-memory map could store that > information to speed up the checks. +1 > The only downside I can see is that you > will increase the number of opened file descriptors. Note that most users of pgstat will only have two files open (instead of one as currently) -- one for shared, one for their own database. Only pgstat itself and autovac launcher would need to open pgstat files for all databases; but both do not have a need to open other files (arbitrary tables) so this shouldn't be a major problem. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Excerpts from Tomas Vondra's message of mié sep 26 12:25:58 -0300 2012: > Dne 26.09.2012 16:51, Jeff Janes napsal: > > I think forking it off to to another value would be better. If you > > are an autovacuum worker which is just starting up and so getting its > > initial stats, you can tolerate a stats file up to > > "autovacuum_naptime > > / 5.0" stale. If you are already started up and are just about to > > vacuum a table, then keep the staleness at PGSTAT_RETRY_DELAY as it > > currently is, so as not to redundantly vacuum a table. > > I always thought there's a "no more than one worker per database" > limit, There is no such limitation. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Dne 26.09.2012 17:29, Alvaro Herrera napsal: > Excerpts from Tomas Vondra's message of mié sep 26 12:25:58 -0300 > 2012: >> Dne 26.09.2012 16:51, Jeff Janes napsal: > >> > I think forking it off to to another value would be better. If >> you >> > are an autovacuum worker which is just starting up and so getting >> its >> > initial stats, you can tolerate a stats file up to >> > "autovacuum_naptime >> > / 5.0" stale. If you are already started up and are just about to >> > vacuum a table, then keep the staleness at PGSTAT_RETRY_DELAY as >> it >> > currently is, so as not to redundantly vacuum a table. >> >> I always thought there's a "no more than one worker per database" >> limit, > > There is no such limitation. OK, thanks. Still, reading/writing the small (per-database) files would be much faster so it would be easy to read/write them more often on demand. Tomas
On Wed, Sep 26, 2012 at 8:25 AM, Tomas Vondra <tv@fuzzy.cz> wrote: > Dne 26.09.2012 16:51, Jeff Janes napsal: > > >> What is generating the endless stream you are seeing is that you have >> 1000 databases so if naptime is one minute you are vacuuming 16 per >> second. Since every database gets a new process, that process needs >> to read the file as it doesn't inherit one. > > > Right. But that makes the 10ms timeout even more strange, because the > worker is then using the data for very long time (even minutes). On average that can't happen, or else your vacuuming would fall way behind. But I agree, there is no reason to have very fresh statistics to start with. naptime/5 seems like a good cutoff for me for the start up reading. If a table only becomes eligible for vacuuming in the last 20% of the naptime, I see no reason that it can't wait another round. But that just means the statistics collector needs to write the file less often, the workers still need to read it once per database since each one only vacuums one database and don't inherit the data from the launcher. >> I think forking it off to to another value would be better. If you >> are an autovacuum worker which is just starting up and so getting its >> initial stats, you can tolerate a stats file up to "autovacuum_naptime >> / 5.0" stale. If you are already started up and are just about to >> vacuum a table, then keep the staleness at PGSTAT_RETRY_DELAY as it >> currently is, so as not to redundantly vacuum a table. > > > I always thought there's a "no more than one worker per database" limit, > and that the file is always reloaded when switching to another database. > So I'm not sure how could a worker see such a stale table info? Or are > the workers keeping the stats across multiple databases? If you only have one "active" database, then all the workers will be in it. I don't how likely it is that they will leap frog each other and collide. But anyway, if you 1000s of databases, then each one will generally require zero vacuums per naptime (as you say, it is mostly read only), so it is the reads upon start-up, not the reads per table that needs vacuuming, which generates most of the traffic. Once you separate those two parameters out, playing around with the PGSTAT_RETRY_DELAY one seems like a needless risk. > >>> 3) logic detecting the proper PGSTAT_RETRY_DELAY value - based mostly on >>> the >>> time >>> it takes to write the file (e.g. 10x the write time or something). >> >> >> This is already in place. > > > Really? Where? I had thought that this part was effectively the same thing: * We don't recompute min_ts after sleeping, except in the * unlikely case that cur_ts went backwards. But I think I did not understand your proposal. > > I've checked the current master, and the only thing I see in > pgstat_write_statsfile > is this (line 3558): > > last_statwrite = globalStats.stats_timestamp; > > https://github.com/postgres/postgres/blob/master/src/backend/postmaster/pgstat.c#L3558 > > > I don't think that's doing what I meant. That really doesn't scale the > timeout > according to write time. What happens right now is that when the stats file > is > written at time 0 (starts at zero, write finishes at 100 ms), and a worker > asks > for the file at 99 ms (i.e. 1ms before the write finishes), it will set the > time > of the inquiry to last_statrequest and then do this > > if (last_statwrite < last_statrequest) > pgstat_write_statsfile(false); > > i.e. comparing it to the start of the write. So another write will start > right > after the file is written out. And over and over. Ah. I had wondered about this before too, and wondered if it would be a good idea to have it go back to the beginning of the stats file, and overwrite the timestamp with the current time (rather than the time it started writing it), as the last action it does before the rename. I think that would automatically make it adaptive to the time it takes to write out the file, in a fairly simple way. > > Moreover there's the 'rename' step making the new file invisible for the > worker > processes, which makes the thing a bit more complicated. I think renames are assumed to be atomic. Either it sees the old one, or the new one, but never sees neither. Cheers, Jeff
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Excerpts from Euler Taveira's message of mié sep 26 11:53:27 -0300 2012: >> On 26-09-2012 09:43, Tomas Vondra wrote: >>> 5) splitting the single stat file into multiple pieces - e.g. per database, >>> written separately, so that the autovacuum workers don't need to read all >>> the data even for databases that don't need to be vacuumed. This might be >>> combined with (4). >> IMHO that's the definitive solution. It would be one file per database plus a >> global one. That way, the check would only read the global.stat and process >> those database that were modified. Also, an in-memory map could store that >> information to speed up the checks. > +1 That would help for the case of hundreds of databases, but how much does it help for lots of tables in a single database? I'm a bit suspicious of the idea that we should encourage people to use hundreds of databases per installation anyway: the duplicated system catalogs are going to be mighty expensive, both in disk space and in their cache footprint in shared buffers. There was some speculation at the last PGCon about how we might avoid the duplication, but I think we're years away from any such thing actually happening. What seems to me like it could help more is fixing things so that the autovac launcher needn't even launch a child process for databases that haven't had any updates lately. I'm not sure how to do that, but it probably involves getting the stats collector to produce some kind of summary file. regards, tom lane
On Wed, Sep 26, 2012 at 9:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> Excerpts from Euler Taveira's message of mié sep 26 11:53:27 -0300 2012: >>> On 26-09-2012 09:43, Tomas Vondra wrote: >>>> 5) splitting the single stat file into multiple pieces - e.g. per database, >>>> written separately, so that the autovacuum workers don't need to read all >>>> the data even for databases that don't need to be vacuumed. This might be >>>> combined with (4). > >>> IMHO that's the definitive solution. It would be one file per database plus a >>> global one. That way, the check would only read the global.stat and process >>> those database that were modified. Also, an in-memory map could store that >>> information to speed up the checks. > >> +1 > > That would help for the case of hundreds of databases, but how much > does it help for lots of tables in a single database? It doesn't help that case, but that case doesn't need much help. If you have N statistics-kept objects in total spread over M databases, of which T objects need vacuuming per naptime, the stats file traffic is proportional to N*(M+T). If T is low, then there is generally is no problem if M is also low. Or at least, the problem is much smaller than when M is high for a fixed value of N. > I'm a bit suspicious of the idea that we should encourage people to use > hundreds of databases per installation anyway: I agree with that, but we could still do a better job of tolerating it; without encouraging it. If someone volunteers to write the code to do this, what trade-offs would there be? Cheers, Jeff
On 26.9.2012 18:29, Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> Excerpts from Euler Taveira's message of miĂŠ sep 26 11:53:27 -0300 2012: >>> On 26-09-2012 09:43, Tomas Vondra wrote: >>>> 5) splitting the single stat file into multiple pieces - e.g. per database, >>>> written separately, so that the autovacuum workers don't need to read all >>>> the data even for databases that don't need to be vacuumed. This might be >>>> combined with (4). > >>> IMHO that's the definitive solution. It would be one file per database plus a >>> global one. That way, the check would only read the global.stat and process >>> those database that were modified. Also, an in-memory map could store that >>> information to speed up the checks. > >> +1 > > That would help for the case of hundreds of databases, but how much > does it help for lots of tables in a single database? Well, it wouldn't, but it wouldn't make it worse either. Or at least that's how I understand it. > I'm a bit suspicious of the idea that we should encourage people to use > hundreds of databases per installation anyway: the duplicated system > catalogs are going to be mighty expensive, both in disk space and in > their cache footprint in shared buffers. There was some speculation > at the last PGCon about how we might avoid the duplication, but I think > we're years away from any such thing actually happening. You don't need to encourage us to do that ;-) We know it's not perfect and considering a good alternative - e.g. several databases (~10) with schemas inside, replacing the current database-only approach. This way we'd get multiple stat files (thus gaining the benefits) with less overhead (shared catalogs). And yes, using tens of thousands of tables (serving as "caches") for a reporting solution is "interesting" (as in the old Chinese curse) too. > What seems to me like it could help more is fixing things so that the > autovac launcher needn't even launch a child process for databases that > haven't had any updates lately. I'm not sure how to do that, but it > probably involves getting the stats collector to produce some kind of > summary file. Yes, I've proposed something like this in my original mail - setting a "dirty" flag on objects (a database in this case) whenever a table in it gets eligible for vacuum/analyze. Tomas
On 26.9.2012 18:14, Jeff Janes wrote: > On Wed, Sep 26, 2012 at 8:25 AM, Tomas Vondra <tv@fuzzy.cz> wrote: >> Dne 26.09.2012 16:51, Jeff Janes napsal: >> >> >>> What is generating the endless stream you are seeing is that you have >>> 1000 databases so if naptime is one minute you are vacuuming 16 per >>> second. Since every database gets a new process, that process needs >>> to read the file as it doesn't inherit one. >> >> >> Right. But that makes the 10ms timeout even more strange, because the >> worker is then using the data for very long time (even minutes). > > On average that can't happen, or else your vacuuming would fall way > behind. But I agree, there is no reason to have very fresh statistics > to start with. naptime/5 seems like a good cutoff for me for the > start up reading. If a table only becomes eligible for vacuuming in > the last 20% of the naptime, I see no reason that it can't wait > another round. But that just means the statistics collector needs to > write the file less often, the workers still need to read it once per > database since each one only vacuums one database and don't inherit > the data from the launcher. So what happens if there are two workers vacuuming the same database? Wouldn't that make it more probable that were already vacuumed by the other worker? See the comment at the beginning of autovacuum.c, where it also states that the statfile is reloaded before each table (probably because of the calls to autovac_refresh_stats which in turn calls clear_snapshot). >>> I think forking it off to to another value would be better. If you >>> are an autovacuum worker which is just starting up and so getting its >>> initial stats, you can tolerate a stats file up to "autovacuum_naptime >>> / 5.0" stale. If you are already started up and are just about to >>> vacuum a table, then keep the staleness at PGSTAT_RETRY_DELAY as it >>> currently is, so as not to redundantly vacuum a table. >> >> >> I always thought there's a "no more than one worker per database" limit, >> and that the file is always reloaded when switching to another database. >> So I'm not sure how could a worker see such a stale table info? Or are >> the workers keeping the stats across multiple databases? > > If you only have one "active" database, then all the workers will be > in it. I don't how likely it is that they will leap frog each other > and collide. But anyway, if you 1000s of databases, then each one > will generally require zero vacuums per naptime (as you say, it is > mostly read only), so it is the reads upon start-up, not the reads per > table that needs vacuuming, which generates most of the traffic. Once > you separate those two parameters out, playing around with the > PGSTAT_RETRY_DELAY one seems like a needless risk. OK, right. My fault. Yes, our databases are mostly readable - more precisely whenever we load data, we immediately do VACUUM ANALYZE on the tables, so autovacuum never kicks in on them. The only thing it works on are system catalogs and such stuff. >>>> 3) logic detecting the proper PGSTAT_RETRY_DELAY value - based mostly on >>>> the >>>> time >>>> it takes to write the file (e.g. 10x the write time or something). >>> >>> >>> This is already in place. >> >> >> Really? Where? > > I had thought that this part was effectively the same thing: > > * We don't recompute min_ts after sleeping, except in the > * unlikely case that cur_ts went backwards. > > But I think I did not understand your proposal. > >> >> I've checked the current master, and the only thing I see in >> pgstat_write_statsfile >> is this (line 3558): >> >> last_statwrite = globalStats.stats_timestamp; >> >> https://github.com/postgres/postgres/blob/master/src/backend/postmaster/pgstat.c#L3558 >> >> >> I don't think that's doing what I meant. That really doesn't scale the >> timeout >> according to write time. What happens right now is that when the stats file >> is >> written at time 0 (starts at zero, write finishes at 100 ms), and a worker >> asks >> for the file at 99 ms (i.e. 1ms before the write finishes), it will set the >> time >> of the inquiry to last_statrequest and then do this >> >> if (last_statwrite < last_statrequest) >> pgstat_write_statsfile(false); >> >> i.e. comparing it to the start of the write. So another write will start >> right >> after the file is written out. And over and over. > > Ah. I had wondered about this before too, and wondered if it would be > a good idea to have it go back to the beginning of the stats file, and > overwrite the timestamp with the current time (rather than the time it > started writing it), as the last action it does before the rename. I > think that would automatically make it adaptive to the time it takes > to write out the file, in a fairly simple way. Yeah, I was thinking about that too. >> Moreover there's the 'rename' step making the new file invisible for the >> worker >> processes, which makes the thing a bit more complicated. > > I think renames are assumed to be atomic. Either it sees the old one, > or the new one, but never sees neither. I'm not quite sure what I meant, but not this - I know the renames are atomic. I probably haven't noticed that inquiries are using min_ts, so I though that an inquiry sent right after the write starts (with min_ts before the write) would trigger another write, but that's not the case. regards Tomas
On 26 September 2012 15:47, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > Really, as far as autovacuum is concerned, it would be much more useful > to be able to reliably detect that a table has been recently vacuumed, > without having to request a 10ms-recent pgstat snapshot. That would > greatly reduce the amount of time autovac spends on pgstat requests. VACUUMing generates a relcache invalidation. Can we arrange for those invalidations to be received by autovac launcher, so it gets immediate feedback of recent activity without polling? -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Excerpts from Simon Riggs's message of jue sep 27 06:51:28 -0300 2012: > On 26 September 2012 15:47, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > > > Really, as far as autovacuum is concerned, it would be much more useful > > to be able to reliably detect that a table has been recently vacuumed, > > without having to request a 10ms-recent pgstat snapshot. That would > > greatly reduce the amount of time autovac spends on pgstat requests. > > VACUUMing generates a relcache invalidation. Can we arrange for those > invalidations to be received by autovac launcher, so it gets immediate > feedback of recent activity without polling? Hmm, this is an interesting idea worth exploring, I think. Maybe we should sort tables in the autovac worker to-do list by age of last invalidation messages received, or something like that. Totally unclear on the details, but as I said, worth exploring. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 27 September 2012 15:57, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Excerpts from Simon Riggs's message of jue sep 27 06:51:28 -0300 2012: >> On 26 September 2012 15:47, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> > >> > Really, as far as autovacuum is concerned, it would be much more useful >> > to be able to reliably detect that a table has been recently vacuumed, >> > without having to request a 10ms-recent pgstat snapshot. That would >> > greatly reduce the amount of time autovac spends on pgstat requests. >> >> VACUUMing generates a relcache invalidation. Can we arrange for those >> invalidations to be received by autovac launcher, so it gets immediate >> feedback of recent activity without polling? > > Hmm, this is an interesting idea worth exploring, I think. Maybe we > should sort tables in the autovac worker to-do list by age of last > invalidation messages received, or something like that. Totally unclear > on the details, but as I said, worth exploring. Just put them to back of queue if an inval is received. There is already support for listening and yet never generating to relcache inval messages. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi! On 26.9.2012 19:18, Jeff Janes wrote: > On Wed, Sep 26, 2012 at 9:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Alvaro Herrera <alvherre@2ndquadrant.com> writes: >>> Excerpts from Euler Taveira's message of mié sep 26 11:53:27 -0300 2012: >>>> On 26-09-2012 09:43, Tomas Vondra wrote: >>>>> 5) splitting the single stat file into multiple pieces - e.g. per database, >>>>> written separately, so that the autovacuum workers don't need to read all >>>>> the data even for databases that don't need to be vacuumed. This might be >>>>> combined with (4). >> >>>> IMHO that's the definitive solution. It would be one file per database plus a >>>> global one. That way, the check would only read the global.stat and process >>>> those database that were modified. Also, an in-memory map could store that >>>> information to speed up the checks. >> >>> +1 >> >> That would help for the case of hundreds of databases, but how much >> does it help for lots of tables in a single database? > > It doesn't help that case, but that case doesn't need much help. If > you have N statistics-kept objects in total spread over M databases, > of which T objects need vacuuming per naptime, the stats file traffic > is proportional to N*(M+T). If T is low, then there is generally is > no problem if M is also low. Or at least, the problem is much smaller > than when M is high for a fixed value of N. I've done some initial hacking on splitting the stat file into multiple smaller pieces over the weekend, and it seems promising (at least with respect to the issues we're having). See the patch attached, but be aware that this is a very early WIP (or rather a proof of concept), so it has many rough edges (read "sloppy coding"). I haven't even added it to the commitfest yet ... The two main changes are these: (1) The stats file is split into a common "db" file, containing all the DB Entries, and per-database files with tables/functions. The common file is still called "pgstat.stat", the per-db files have the database OID appended, so for example "pgstat.stat.12345" etc. This was a trivial hack pgstat_read_statsfile/pgstat_write_statsfile functions, introducing two new functions: pgstat_read_db_statsfile pgstat_write_db_statsfile that do the trick of reading/writing stat file for one database. (2) The pgstat_read_statsfile has an additional parameter "onlydbs" that says that you don't need table/func stats - just the list of db entries. This is used for autovacuum launcher, which does not need to read the table/stats (if I'm reading the code in autovacuum.c correctly - it seems to be working as expected). So what are the benefits? (a) When a launcher asks for info about databases, something like this is called in the end: pgstat_read_db_statsfile(InvalidOid, false, true) which means all databases (InvalidOid) and only db info (true). So it reads only the one common file with db entries, not the table/func stats. (b) When a worker asks for stats for a given DB, something like this is called in the end: pgstat_read_db_statsfile(MyDatabaseId, false, false) which reads only the common stats file (with db entries) and only one file for the one database. The current implementation (with the single pgstat.stat file), all the data had to be read (and skipped silently) in both cases. That's a lot of CPU time, and we're seeing ~60% of CPU spent on doing just this (writing/reading huge statsfile). So with a lot of databases/objects, this "pgstat.stat split" saves us a lot of CPU ... (c) This should lower the space requirements too - with a single file, you actually need at least 2x the disk space (or RAM, if you're using tmpfs as we are), because you need to keep two versions of the file at the same time (pgstat.stat and pgstat.tmp). Thanks to this split you only need additional space for a copy of the largest piece (with some reasonable safety reserve). Well, it's very early patch, so there are rough edges too (a) It does not solve the "many-schema" scenario at all - that'll need a completely new approach I guess :-( (b) It does not solve the writing part at all - the current code uses a single timestamp (last_statwrite) to decide if a new file needs to be written. That clearly is not enough for multiple files - there should be one timestamp for each database/file. I'm thinking about how to solve this and how to integrate it with pgstat_send_inquiry etc. One way might be adding the timestamp(s) into PgStat_StatDBEntry and the other one is using an array of inquiries for each database. And yet another one I'm thinking about is using a fixed-length array of timestamps (e.g. 256), indexed by mod(dboid,256). That would mean stats for all databases with the same mod(oid,256) would be written at the same time. Seems like an over-engineering though. (c) I'm a bit worried about the number of files - right now there's one for each database and I'm thinking about splitting them by type (one for tables, one for functions) which might make it even faster for some apps with a lot of stored procedures etc. But is the large number of files actually a problem? After all, we're using one file per relation fork in the "base" directory, so this seems like a minor issue. And if really an issue, this might be solved by the mod(oid,256) to combine multiple files into one (which would work neatly with the fixed-length array of timestamps). kind regards Tomas
Attachment
On 26.9.2012 18:29, Tom Lane wrote: > What seems to me like it could help more is fixing things so that the > autovac launcher needn't even launch a child process for databases that > haven't had any updates lately. I'm not sure how to do that, but it > probably involves getting the stats collector to produce some kind of > summary file. Couldn't we use the PgStat_StatDBEntry for this? By splitting the pgstat.stat file into multiple pieces (see my other post in this thread) there's a file with StatDBEntry items only, so maybe it could be used as the summary file ... I've been thinking about this: (a) add "needs_autovacuuming" flag to PgStat_(TableEntry|StatDBEntry) (b) when table stats are updated, run quick check to decide whether the table needs to be processed by autovacuum (vacuumedor analyzed), and if yes then set needs_autovacuuming=true and both for table and database The worker may read the DB entries from the file and act only on those that need to be processed (those with needs_autovacuuming=true). Maybe the DB-level field might be a counter of tables that need to be processed, and the autovacuum daemon might act on those first? Although the simpler the better I guess. Or did you mean something else? regards Tomas
On Sun, Nov 18, 2012 at 5:49 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > The two main changes are these: > > (1) The stats file is split into a common "db" file, containing all the > DB Entries, and per-database files with tables/functions. The common > file is still called "pgstat.stat", the per-db files have the > database OID appended, so for example "pgstat.stat.12345" etc. > > This was a trivial hack pgstat_read_statsfile/pgstat_write_statsfile > functions, introducing two new functions: > > pgstat_read_db_statsfile > pgstat_write_db_statsfile > > that do the trick of reading/writing stat file for one database. > > (2) The pgstat_read_statsfile has an additional parameter "onlydbs" that > says that you don't need table/func stats - just the list of db > entries. This is used for autovacuum launcher, which does not need > to read the table/stats (if I'm reading the code in autovacuum.c > correctly - it seems to be working as expected). I'm not an expert on the stats system, but this seems like a promising approach to me. > (a) It does not solve the "many-schema" scenario at all - that'll need > a completely new approach I guess :-( We don't need to solve every problem in the first patch. I've got no problem kicking this one down the road. > (b) It does not solve the writing part at all - the current code uses a > single timestamp (last_statwrite) to decide if a new file needs to > be written. > > That clearly is not enough for multiple files - there should be one > timestamp for each database/file. I'm thinking about how to solve > this and how to integrate it with pgstat_send_inquiry etc. Presumably you need a last_statwrite for each file, in a hash table or something, and requests need to specify which file is needed. > And yet another one I'm thinking about is using a fixed-length > array of timestamps (e.g. 256), indexed by mod(dboid,256). That > would mean stats for all databases with the same mod(oid,256) would > be written at the same time. Seems like an over-engineering though. That seems like an unnecessary kludge. > (c) I'm a bit worried about the number of files - right now there's one > for each database and I'm thinking about splitting them by type > (one for tables, one for functions) which might make it even faster > for some apps with a lot of stored procedures etc. > > But is the large number of files actually a problem? After all, > we're using one file per relation fork in the "base" directory, so > this seems like a minor issue. I don't see why one file per database would be a problem. After all, we already have on directory per database inside base/. If the user has so many databases that dirent lookups in a directory of that size are a problem, they're already hosed, and this will probably still work out to a net win. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 21.11.2012 19:02, Robert Haas wrote: > On Sun, Nov 18, 2012 at 5:49 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >> The two main changes are these: >> >> (1) The stats file is split into a common "db" file, containing all the >> DB Entries, and per-database files with tables/functions. The common >> file is still called "pgstat.stat", the per-db files have the >> database OID appended, so for example "pgstat.stat.12345" etc. >> >> This was a trivial hack pgstat_read_statsfile/pgstat_write_statsfile >> functions, introducing two new functions: >> >> pgstat_read_db_statsfile >> pgstat_write_db_statsfile >> >> that do the trick of reading/writing stat file for one database. >> >> (2) The pgstat_read_statsfile has an additional parameter "onlydbs" that >> says that you don't need table/func stats - just the list of db >> entries. This is used for autovacuum launcher, which does not need >> to read the table/stats (if I'm reading the code in autovacuum.c >> correctly - it seems to be working as expected). > > I'm not an expert on the stats system, but this seems like a promising > approach to me. > >> (a) It does not solve the "many-schema" scenario at all - that'll need >> a completely new approach I guess :-( > > We don't need to solve every problem in the first patch. I've got no > problem kicking this one down the road. > >> (b) It does not solve the writing part at all - the current code uses a >> single timestamp (last_statwrite) to decide if a new file needs to >> be written. >> >> That clearly is not enough for multiple files - there should be one >> timestamp for each database/file. I'm thinking about how to solve >> this and how to integrate it with pgstat_send_inquiry etc. > > Presumably you need a last_statwrite for each file, in a hash table or > something, and requests need to specify which file is needed. > >> And yet another one I'm thinking about is using a fixed-length >> array of timestamps (e.g. 256), indexed by mod(dboid,256). That >> would mean stats for all databases with the same mod(oid,256) would >> be written at the same time. Seems like an over-engineering though. > > That seems like an unnecessary kludge. > >> (c) I'm a bit worried about the number of files - right now there's one >> for each database and I'm thinking about splitting them by type >> (one for tables, one for functions) which might make it even faster >> for some apps with a lot of stored procedures etc. >> >> But is the large number of files actually a problem? After all, >> we're using one file per relation fork in the "base" directory, so >> this seems like a minor issue. > > I don't see why one file per database would be a problem. After all, > we already have on directory per database inside base/. If the user > has so many databases that dirent lookups in a directory of that size > are a problem, they're already hosed, and this will probably still > work out to a net win. Attached is a v2 of the patch, fixing some of the issues and unclear points from the initial version. The main improvement is that it implements writing only stats for the requested database (set when sending inquiry). There's a dynamic array of request - for each DB only the last request is kept. I've done a number of changes - most importantly: - added a stats_timestamp field to PgStat_StatDBEntry, keeping the last write of the database (i.e. a per-database last_statwrite), which is used to decide whether the file is stale or not - handling of the 'permanent' flag correctly (used when starting or stopping the cluster) for per-db files - added a very simple header to the per-db files (basically just a format ID and a timestamp) - this is needed for checking of the timestamp of the last write from workers (although maybe we could just read the pgstat.stat, which is now rather small) - a 'force' parameter (true - write all databases, even if they weren't specifically requested) So with the exception of 'multi-schema' case (which was not the aim of this effort), it should solve all the issues of the initial version. There are two blocks of code dealing with clock glitches. I haven't fixed those yet, but that can wait I guess. I've also left there some logging I've used during development (printing inquiries and which file is written and when). The main unsolved problem I'm struggling with is what to do when a database is dropped? Right now, the statfile remains in pg_stat_tmp forewer (or until the restart) - is there a good way to remove the file? I'm thinking about adding a message to be sent to the collector from the code that handles DROP TABLE. I've done some very simple performance testing - I've created 1000 databases with 1000 tables each, done ANALYZE on all of them. With only autovacum running, I've seen this: Without the patch ----------------- %CPU %MEM TIME+ COMMAND 18 3.0 0:10.10 postgres: autovacuum launcher process 17 2.6 0:11.44 postgres: stats collector process The I/O was seriously bogged down, doing ~150 MB/s (basically what the drive can handle) - with less dbs, or when the statfiles are placed on tmpfs filesystem, we usually see ~70% of one core doing just this. With the patch -------------- Then, the typical "top" for PostgreSQL processes looked like this: %CPU %MEM TIME+ COMMAND 2 0.3 1:16.57 postgres: autovacuum launcher process 2 3.1 0:25.34 postgres: stats collector process and the average write speed from the stats collector was ~3.5MB/s (measured using iotop), and even when running the ANALYZE etc. I was getting rather light IO usage (like ~15 MB/s or something). With both cases, the total size was ~150MB, but without the space requirements are actually 2x that (because of writing a copy and then renaming). I'd like to put this into 2013-01 commit fest, but if we can do some prior testing / comments, that'd be great. regards Tomas
Attachment
PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tomas Vondra
Date:
Hi, attached is a version of the patch that I believe is ready for the commitfest. As the patch was discussed over a large number of messages, I've prepared a brief summary for those who did not follow the thread. Issue ===== The patch aims to improve the situation in deployments with many tables in many databases (think for example 1000 tables in 1000 databases). Currently all the stats for all the objects (dbs, tables and functions) are written in a single file (pgstat.stat), which may get quite large and that consequently leads to various issues: 1) I/O load - the file is frequently written / read, which may use a significant part of the I/O bandwidth. For example we'have to deal with cases when the pgstat.stat size is >150MB, and it's written (and read) continuously (once it's written, a new write starts) and utilizes 100% bandwidth on that device. 2) CPU load - a common solution to the previous issue is moving the file into RAM, using a tmpfs filesystem. That "fixes" the I/O bottleneck but causes high CPU load because the system is serializing and deserializing large amounts of data. We often see ~1 CPU core "lost" due to this (and causing higher power consumption, but that's Amazon's problem ;-) 3) disk space utilization - the pgstat.stat file is updated in two steps, i.e. a new version is written to another file (pgstat.tmp) and then it's renamed to pgstat.stat, which means the device (amount of RAM, if using tmpfs device) needs to be >2x the actual size of the file. (Actually more, because there may be descriptors open to multiple versions of the file.) This patch does not attempt to fix a "single DB with multiple schemas" scenario, although it should not have a negative impact on it. What the patch does =================== 1) split into global and per-db files ------------------------------------- The patch "splits" the huge pgstat.stat file into smaller pieces - one "global" one (global.stat) with database stats, and one file for each of the databases (oid.stat) with table. This makes it possible to write/read much smaller amounts of data, because a) autovacuum launcher does not need to read the whole file - it needs just the list of databases (and not the table/func stats) b) autovacuum workers do request a fresh copy of a single database, so the stats collector may write just the global.stat + one of the per-db files and that consequently leads to much lower I/O and CPU load. During our tests we've seen the I/O to drop from ~150MB/s to less than 4MB/s, and much lower CPU utilization. 2) a new global/stat directory ------------------------------ The pgstat.stat file was originally saved into the "global" directory, but with so many files that would get rather messy so I've created a new global/stat directory and all the files are stored there. This also means we can do a simple "delete files in the dir" when pgstat_reset_all is called. 3) pgstat_(read|write)_statsfile split -------------------------------------- These two functions were moved into a global and per-db functions, so now there's pgstat_write_statsfile -- global.stat pgstat_write_db_statsfile -- oid.stat pgstat_read_statsfile -- global.stat pgstat_read_db_statsfile -- oid.stat There's a logic to read/write only those files that are actually needed. 4) list of (OID, timestamp) inquiries, last db-write ---------------------------------------------------- Originally there was a single pair of request/write timestamps for the whole file, updated whenever a worker requested a fresh file or when the file was written. With the split, this had to be replaced by two lists - a timestamp of the last inquiry (per DB), and a timestamp when each database file was written for the last time. The timestamp of the last DB write was added to the PgStat_StatDBEntry and the list of inquiries is kept in last_statrequests. The fields are used at several places, so it's probably best to see the code. Handling the timestamps is a rather complex stuff because of the clock skews. One of those checks is not needed as the list of inquiries is freed right after writing all the databases. But I wouldn't be surprised if there was something I missed, as splitting the file into multiple pieces made this part more complex. So please, if you're going to review this patch this is one of the tricky places. 5) dummy file ------------- A special handling is necessary when an inquiry arrives for a database without a PgStat_StatDBEntry - this happens for example right after initdb, when there are no stats for template0 and template1, yet the autovacuum workers do send inqiries for them. The backend_read_statsfile now uses the timestamp stored in the header of the per-db file (not in the global one), and the easies way to handle this for new databases is writing an empty 'dummy file' (just a header with timestamp). Without this, this would result in 'pgstat wait timeout' errors. This is what pgstat_write_db_dummyfile (used in pgstat_write_statsfile) is for. 6) format ID ------------ I've bumped PGSTAT_FILE_FORMAT_ID to a new random value, although the filenames changed to so we could live with the old ID just fine. We've done a fair amount of testing so far, and if everything goes fine we plan to deploy a back-ported version of this patch (to 9.1) on a production in ~2 weeks. Then I'll be able to provide some numbers from a real-world workload (although our deployment and workload is not quite usual I guess). regards
Attachment
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Heikki Linnakangas
Date:
On 03.01.2013 01:15, Tomas Vondra wrote: > 2) a new global/stat directory > ------------------------------ > > The pgstat.stat file was originally saved into the "global" directory, > but with so many files that would get rather messy so I've created a new > global/stat directory and all the files are stored there. > > This also means we can do a simple "delete files in the dir" when > pgstat_reset_all is called. How about creating the new directory as a direct subdir of $PGDATA, rather than buried in global? "global" is supposed to contain data related to shared catalog relations (plus pg_control), so it doesn't seem like the right location for per-database stat files. Also, if we're going to have admins manually zapping the directory (hopefully when the system is offline), that's less scary if the directory is not buried as deep. - Heikki
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tomas Vondra
Date:
On 3.1.2013 18:47, Heikki Linnakangas wrote: > On 03.01.2013 01:15, Tomas Vondra wrote: >> 2) a new global/stat directory >> ------------------------------ >> >> The pgstat.stat file was originally saved into the "global" directory, >> but with so many files that would get rather messy so I've created a new >> global/stat directory and all the files are stored there. >> >> This also means we can do a simple "delete files in the dir" when >> pgstat_reset_all is called. > > How about creating the new directory as a direct subdir of $PGDATA, > rather than buried in global? "global" is supposed to contain data > related to shared catalog relations (plus pg_control), so it doesn't > seem like the right location for per-database stat files. Also, if we're > going to have admins manually zapping the directory (hopefully when the > system is offline), that's less scary if the directory is not buried as > deep. That's clearly possible and it's a trivial change. I was thinking about that actually, but then I placed the directory into "global" because that's where the "pgstat.stat" originally was. Tomas
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Magnus Hagander
Date:
On Thu, Jan 3, 2013 at 8:31 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > On 3.1.2013 18:47, Heikki Linnakangas wrote: >> On 03.01.2013 01:15, Tomas Vondra wrote: >>> 2) a new global/stat directory >>> ------------------------------ >>> >>> The pgstat.stat file was originally saved into the "global" directory, >>> but with so many files that would get rather messy so I've created a new >>> global/stat directory and all the files are stored there. >>> >>> This also means we can do a simple "delete files in the dir" when >>> pgstat_reset_all is called. >> >> How about creating the new directory as a direct subdir of $PGDATA, >> rather than buried in global? "global" is supposed to contain data >> related to shared catalog relations (plus pg_control), so it doesn't >> seem like the right location for per-database stat files. Also, if we're >> going to have admins manually zapping the directory (hopefully when the >> system is offline), that's less scary if the directory is not buried as >> deep. > > That's clearly possible and it's a trivial change. I was thinking about > that actually, but then I placed the directory into "global" because > that's where the "pgstat.stat" originally was. Yeah, +1 for a separate directory not in global. --Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tomas Vondra
Date:
On 3.1.2013 20:33, Magnus Hagander wrote: > On Thu, Jan 3, 2013 at 8:31 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >> On 3.1.2013 18:47, Heikki Linnakangas wrote: >>> How about creating the new directory as a direct subdir of $PGDATA, >>> rather than buried in global? "global" is supposed to contain data >>> related to shared catalog relations (plus pg_control), so it doesn't >>> seem like the right location for per-database stat files. Also, if we're >>> going to have admins manually zapping the directory (hopefully when the >>> system is offline), that's less scary if the directory is not buried as >>> deep. >> >> That's clearly possible and it's a trivial change. I was thinking about >> that actually, but then I placed the directory into "global" because >> that's where the "pgstat.stat" originally was. > > Yeah, +1 for a separate directory not in global. OK, I moved the files from "global/stat" to "stat". Tomas
Attachment
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Alvaro Herrera
Date:
Tomas Vondra wrote: > diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c > index be3adf1..4ec485e 100644 > --- a/src/backend/postmaster/pgstat.c > +++ b/src/backend/postmaster/pgstat.c > @@ -64,10 +64,14 @@ > > /* ---------- > * Paths for the statistics files (relative to installation's $PGDATA). > + * Permanent and temprorary, global and per-database files. Note typo in the line above. > -#define PGSTAT_STAT_PERMANENT_FILENAME "global/pgstat.stat" > -#define PGSTAT_STAT_PERMANENT_TMPFILE "global/pgstat.tmp" > +#define PGSTAT_STAT_PERMANENT_DIRECTORY "stat" > +#define PGSTAT_STAT_PERMANENT_FILENAME "stat/global.stat" > +#define PGSTAT_STAT_PERMANENT_TMPFILE "stat/global.tmp" > +#define PGSTAT_STAT_PERMANENT_DB_FILENAME "stat/%d.stat" > +#define PGSTAT_STAT_PERMANENT_DB_TMPFILE "stat/%d.tmp" > +char *pgstat_stat_directory = NULL; > char *pgstat_stat_filename = NULL; > char *pgstat_stat_tmpname = NULL; > +char *pgstat_stat_db_filename = NULL; > +char *pgstat_stat_db_tmpname = NULL; I don't like the quoted parts very much; it seems awkward to have the snprintf patterns in one place and have them be used in very distant places. Is there a way to improve that? Also, if I understand clearly, the pgstat_stat_db_filename value needs to be an snprintf pattern too, right? What if it doesn't contain the required % specifier? Also, if you can filter this through pgindent, that would be best. Make sure to add DBWriteRequest to src/tools/pgindent/typedefs_list. > + /* > + * There's no request for this DB yet, so lets create it (allocate a > + * space for it, set the values). > + */ > + if (last_statrequests == NULL) > + last_statrequests = palloc(sizeof(DBWriteRequest)); > + else > + last_statrequests = repalloc(last_statrequests, > + (num_statrequests + 1)*sizeof(DBWriteRequest)); > + > + last_statrequests[num_statrequests].databaseid = msg->databaseid; > + last_statrequests[num_statrequests].request_time = msg->clock_time; > + num_statrequests += 1; Having to repalloc this array each time seems wrong. Would a list instead of an array help? see ilist.c/h; I vote for a dlist because you can easily delete elements from the middle of it, if required (I think you'd need that.) > + char db_statfile[strlen(pgstat_stat_db_filename) + 11]; > + snprintf(db_statfile, strlen(pgstat_stat_db_filename) + 11, > + pgstat_stat_filename, dbentry->databaseid); This pattern seems rather frequent. Can we use a macro or similar here? Encapsulating the "11" better would be good. Magic numbers are evil. > diff --git a/src/include/pgstat.h b/src/include/pgstat.h > index 613c1c2..b3467d2 100644 > --- a/src/include/pgstat.h > +++ b/src/include/pgstat.h > @@ -205,6 +205,7 @@ typedef struct PgStat_MsgInquiry > PgStat_MsgHdr m_hdr; > TimestampTz clock_time; /* observed local clock time */ > TimestampTz cutoff_time; /* minimum acceptable file timestamp */ > + Oid databaseid; /* requested DB (InvalidOid => all DBs) */ > } PgStat_MsgInquiry; Do we need to support the case that somebody requests stuff from the "shared" DB? IIRC that's what InvalidOid means in pgstat ... -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Jeff Janes
Date:
On Sat, Jan 5, 2013 at 8:03 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > On 3.1.2013 20:33, Magnus Hagander wrote: >> >> Yeah, +1 for a separate directory not in global. > > OK, I moved the files from "global/stat" to "stat". This has a warning: pgstat.c:5132: warning: 'pgstat_write_statsfile_needed' was used with no prototype before its definition I plan to do some performance testing, but that will take a while so I wanted to post this before I get distracted. Cheers, Jeff
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Jeff Janes
Date:
On Sat, Jan 5, 2013 at 8:03 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > On 3.1.2013 20:33, Magnus Hagander wrote: >> >> Yeah, +1 for a separate directory not in global. > > OK, I moved the files from "global/stat" to "stat". Why "stat" rather than "pg_stat"? The existence of "global" and "base" as exceptions already annoys me. (Especially when I do a tar -xf in my home directory without remembering the -C flag). Unless there is some unstated rule behind what gets a pg_ and what doesn't, I think we should have the "pg_". Cheers, Jeff
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Jeff Janes
Date:
On Sat, Feb 2, 2013 at 2:33 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Sat, Jan 5, 2013 at 8:03 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >> On 3.1.2013 20:33, Magnus Hagander wrote: >>> >>> Yeah, +1 for a separate directory not in global. >> >> OK, I moved the files from "global/stat" to "stat". > > This has a warning: > > pgstat.c:5132: warning: 'pgstat_write_statsfile_needed' was used with > no prototype before its definition > > I plan to do some performance testing, but that will take a while so I > wanted to post this before I get distracted. Running "vacuumdb -a" on a cluster with 1000 db with 200 tables (x serial primary key) in each, I get log messages like this: last_statwrite 23682-06-18 22:36:52.960194-07 is later than collector's time 2013-02-03 12:49:19.700629-08 for db 16387 Note the bizarre year in the first time stamp. If it matters, I got this after shutting down the cluster, blowing away $DATA/stat/*, then restarting it and invoking vacuumdb. Cheers, Jeff
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tomas Vondra
Date:
On 3.2.2013 20:46, Jeff Janes wrote: > On Sat, Jan 5, 2013 at 8:03 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >> On 3.1.2013 20:33, Magnus Hagander wrote: >>> >>> Yeah, +1 for a separate directory not in global. >> >> OK, I moved the files from "global/stat" to "stat". > > Why "stat" rather than "pg_stat"? > > The existence of "global" and "base" as exceptions already annoys me. > (Especially when I do a tar -xf in my home directory without > remembering the -C flag). Unless there is some unstated rule behind > what gets a pg_ and what doesn't, I think we should have the "pg_". I don't think there's a clear naming rule. But I think your suggestion makes perfect sense, especially because we have pg_stat_tmp directory. So now we'd have pg_stat and pg_stat_tmp, which is quite elegant. Tomas
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tomas Vondra
Date:
On 2.2.2013 23:33, Jeff Janes wrote: > On Sat, Jan 5, 2013 at 8:03 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >> On 3.1.2013 20:33, Magnus Hagander wrote: >>> >>> Yeah, +1 for a separate directory not in global. >> >> OK, I moved the files from "global/stat" to "stat". > > This has a warning: > > pgstat.c:5132: warning: 'pgstat_write_statsfile_needed' was used with > no prototype before its definition I forgot to add "void" into the method prototype ... Thanks! Tomas
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tomas Vondra
Date:
On 3.2.2013 21:54, Jeff Janes wrote: > On Sat, Feb 2, 2013 at 2:33 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> On Sat, Jan 5, 2013 at 8:03 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >>> On 3.1.2013 20:33, Magnus Hagander wrote: >>>> >>>> Yeah, +1 for a separate directory not in global. >>> >>> OK, I moved the files from "global/stat" to "stat". >> >> This has a warning: >> >> pgstat.c:5132: warning: 'pgstat_write_statsfile_needed' was used with >> no prototype before its definition >> >> I plan to do some performance testing, but that will take a while so I >> wanted to post this before I get distracted. > > Running "vacuumdb -a" on a cluster with 1000 db with 200 tables (x > serial primary key) in each, I get log messages like this: > > last_statwrite 23682-06-18 22:36:52.960194-07 is later than > collector's time 2013-02-03 12:49:19.700629-08 for db 16387 > > Note the bizarre year in the first time stamp. > > If it matters, I got this after shutting down the cluster, blowing > away $DATA/stat/*, then restarting it and invoking vacuumdb. I somehow expected that hash_search zeroes all the fields of a new entry, but looking at pgstat_get_db_entry that obviously is not the case. So stats_timestamp (which tracks timestamp of the last write for a DB) was random - that's where the bizarre year values came from. I've added a proper initialization (to 0), and now it works as expected. Although the whole sequence of errors I was getting was this: LOG: last_statwrite 11133-08-28 19:22:31.711744+02 is later than collector's time 2013-02-04 00:54:21.113439+01 for db 19093 WARNING: pgstat wait timeout LOG: last_statwrite 39681-12-23 18:48:48.9093+01 is later than collector's time 2013-02-04 00:54:31.424681+01 for db 46494 FATAL: could not find block containing chunk 0x2af4a60 LOG: statistics collector process (PID 10063) exited with exit code 1 WARNING: pgstat wait timeout WARNING: pgstat wait timeout I'm not entirely sure where the FATAL came from, but it seems it was somehow related to the issue - it was quite reproducible, although I don't see how exactly could this happen. There relevant block of code looks like this: char *writetime; char *mytime; /* Copy because timestamptz_to_str returns a static buffer */ writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp)); mytime = pstrdup(timestamptz_to_str(cur_ts)); elog(LOG, "last_statwrite %s is later than collector's time %s for " "db %d", writetime, mytime, dbentry->databaseid); pfree(writetime); pfree(mytime); which seems quite fine to mee. I'm not sure how one of the pfree calls could fail? Anyway, attached is a patch that fixes all three issues, i.e. 1) the un-initialized timestamp 2) the "void" ommited from the signature 3) rename to "pg_stat" instead of just "stat" Tomas
Attachment
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tomas Vondra
Date:
On 1.2.2013 17:19, Alvaro Herrera wrote: > Tomas Vondra wrote: > >> diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c >> index be3adf1..4ec485e 100644 >> --- a/src/backend/postmaster/pgstat.c >> +++ b/src/backend/postmaster/pgstat.c >> @@ -64,10 +64,14 @@ >> >> /* ---------- >> * Paths for the statistics files (relative to installation's $PGDATA). >> + * Permanent and temprorary, global and per-database files. > > Note typo in the line above. > >> -#define PGSTAT_STAT_PERMANENT_FILENAME "global/pgstat.stat" >> -#define PGSTAT_STAT_PERMANENT_TMPFILE "global/pgstat.tmp" >> +#define PGSTAT_STAT_PERMANENT_DIRECTORY "stat" >> +#define PGSTAT_STAT_PERMANENT_FILENAME "stat/global.stat" >> +#define PGSTAT_STAT_PERMANENT_TMPFILE "stat/global.tmp" >> +#define PGSTAT_STAT_PERMANENT_DB_FILENAME "stat/%d.stat" >> +#define PGSTAT_STAT_PERMANENT_DB_TMPFILE "stat/%d.tmp" > >> +char *pgstat_stat_directory = NULL; >> char *pgstat_stat_filename = NULL; >> char *pgstat_stat_tmpname = NULL; >> +char *pgstat_stat_db_filename = NULL; >> +char *pgstat_stat_db_tmpname = NULL; > > I don't like the quoted parts very much; it seems awkward to have the > snprintf patterns in one place and have them be used in very distant I don't see that as particularly awkward, but that's a matter of taste. I still see that as a bunch of constants that are sprintf patterns at the same time. > places. Is there a way to improve that? Also, if I understand clearly, > the pgstat_stat_db_filename value needs to be an snprintf pattern too, > right? What if it doesn't contain the required % specifier? Ummmm, yes - it needs to be a pattern too, but the user specifies the directory (stats_temp_directory) and this is used to derive all the other values - see assign_pgstat_temp_directory() in guc.c. > Also, if you can filter this through pgindent, that would be best. Make > sure to add DBWriteRequest to src/tools/pgindent/typedefs_list. Will do. > >> + /* >> + * There's no request for this DB yet, so lets create it (allocate a >> + * space for it, set the values). >> + */ >> + if (last_statrequests == NULL) >> + last_statrequests = palloc(sizeof(DBWriteRequest)); >> + else >> + last_statrequests = repalloc(last_statrequests, >> + (num_statrequests + 1)*sizeof(DBWriteRequest)); >> + >> + last_statrequests[num_statrequests].databaseid = msg->databaseid; >> + last_statrequests[num_statrequests].request_time = msg->clock_time; >> + num_statrequests += 1; > > Having to repalloc this array each time seems wrong. Would a list > instead of an array help? see ilist.c/h; I vote for a dlist because you > can easily delete elements from the middle of it, if required (I think > you'd need that.) Thanks. I'm not very familiar with the list interface, so I've used plain array. But yes, there are better ways than doing repalloc all the time. > >> + char db_statfile[strlen(pgstat_stat_db_filename) + 11]; >> + snprintf(db_statfile, strlen(pgstat_stat_db_filename) + 11, >> + pgstat_stat_filename, dbentry->databaseid); > > This pattern seems rather frequent. Can we use a macro or similar here? > Encapsulating the "11" better would be good. Magic numbers are evil. Yes, this needs to be cleaned / improved. >> diff --git a/src/include/pgstat.h b/src/include/pgstat.h >> index 613c1c2..b3467d2 100644 >> --- a/src/include/pgstat.h >> +++ b/src/include/pgstat.h >> @@ -205,6 +205,7 @@ typedef struct PgStat_MsgInquiry >> PgStat_MsgHdr m_hdr; >> TimestampTz clock_time; /* observed local clock time */ >> TimestampTz cutoff_time; /* minimum acceptable file timestamp */ >> + Oid databaseid; /* requested DB (InvalidOid => all DBs) */ >> } PgStat_MsgInquiry; > > Do we need to support the case that somebody requests stuff from the > "shared" DB? IIRC that's what InvalidOid means in pgstat ... Frankly, I don't know, but I guess we do because it was in the original code, and there are such inquiries right after the database starts (that's why I had to add pgstat_write_db_dummyfile). Tomas
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Jeff Janes
Date:
On Sun, Feb 3, 2013 at 4:51 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > LOG: last_statwrite 11133-08-28 19:22:31.711744+02 is later than > collector's time 2013-02-04 00:54:21.113439+01 for db 19093 > WARNING: pgstat wait timeout > LOG: last_statwrite 39681-12-23 18:48:48.9093+01 is later than > collector's time 2013-02-04 00:54:31.424681+01 for db 46494 > FATAL: could not find block containing chunk 0x2af4a60 > LOG: statistics collector process (PID 10063) exited with exit code 1 > WARNING: pgstat wait timeout > WARNING: pgstat wait timeout > > I'm not entirely sure where the FATAL came from, but it seems it was > somehow related to the issue - it was quite reproducible, although I > don't see how exactly could this happen. There relevant block of code > looks like this: > > char *writetime; > char *mytime; > > /* Copy because timestamptz_to_str returns a static buffer */ > writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp)); > mytime = pstrdup(timestamptz_to_str(cur_ts)); > elog(LOG, "last_statwrite %s is later than collector's time %s for " > "db %d", writetime, mytime, dbentry->databaseid); > pfree(writetime); > pfree(mytime); > > which seems quite fine to mee. I'm not sure how one of the pfree calls > could fail? I don't recall seeing the FATAL errors myself, but didn't keep the logfile around. (I do recall seeing the pgstat wait timeout). Are you using windows? pstrdup seems to be different there. I'm afraid I don't have much to say on the code. Indeed I never even look at it (other than grepping for pstrdup just now). I am taking a purely experimental approach, Since Alvaro and others have looked at the code. > > Anyway, attached is a patch that fixes all three issues, i.e. > > 1) the un-initialized timestamp > 2) the "void" ommited from the signature > 3) rename to "pg_stat" instead of just "stat" Thanks. If I shutdown the server and blow away the stats with "rm data/pg_stat/*", it recovers gracefully when I start it back up. If a do "rm -r data/pg_stat" then it has problems the next time I shut it down, but I have no right to do that in the first place. If I initdb a database without this patch, then shut it down and restart with binaries that include this patch, and need to manually make the pg_stat directory. Does that mean it needs a catalog bump in order to force an initdb? A review: It applies cleanly (some offsets, no fuzz), builds without warnings, and passes make check including with cassert. The final test done in "make check" inherently tests this code, and it passes. If I intentionally break the patch by making pgstat_read_db_statsfile add one to the oid it opens, then the test fails. So the existing test is at least plausible as a test. doc/src/sgml/monitoring.sgml needs to be changed: "a permanent copy of the statistics data is stored in the global subdirectory". I'm not aware of any other needed changes to the docs. The big question is whether we want this. I think we do. While having hundreds of databases in a cluster is not recommended, that is no reason not to handle it better than we do. I don't see any down-sides, other than possibly some code uglification. Some file systems might not deal well with having lots of small stats files being rapidly written and rewritten, but it is hard to see how the current behavior would be more favorable for those systems. We do not already have this. There is no relevant spec. I can't see how this could need pg_dump support (but what about pg_upgrade?) I am not aware of any dangers. I have a question about its completeness. When I first start up the cluster and have not yet touched it, there is very little stats collector activity, either with or without this patch. When I kick the cluster sufficiently (I've been using vacuumdb -a to do that) then there is a lot of stats collector activity. Even once the vacuumdb has long finished, this high level of activity continues even though the database is otherwise completely idle, and this seems to happen for every. This patch makes that high level of activity much more efficient, but it does not reduce the activity. I don't understand why an idle database cannot get back into the state right after start-up. I do not think that the patch needs to solve this problem in order to be accepted, but if it can be addressed while the author and reviewers are paying attention to this part of the system, that would be ideal. And if not, then we should at least remember that there is future work that could be done here. I created 1000 databases each with 200 single column tables (x serial primary key). After vacuumdb -a, I let it idle for a long time to see what steady state was reached. without the patch: vacuumdb -a real 11m2.624s idle steady state: 48.17% user, 39.24% system, 11.78% iowait, 0.81% idle. with the patch: vacuumdb -a real 6m41.306s idle steady state: 7.86% user, 5.00% system 0.09% iowait 87% idle, I also ran pgbench on a scale that fits in memory with fsync=off, on a singe CPU machine. With the same above-mentioned 1000 databases as unused decoys to bloat the stats file. pgbench_tellers and branches undergo enough turn over that they should get vacuumed every minuted (nap time). Without the patch, they only get vacuumed every 40 minutes or so as the autovac workers are so distracted by reading the bloated stats file. and the TPS is ~680. With the patch, they get vacuumed every 1 to 2 minutes and TPS is ~940 So this seems to be effective at it intended goal. I have not done a review of the code itself, only the performance. Cheers, Jeff
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tomas Vondra
Date:
On 5.2.2013 19:23, Jeff Janes wrote: > On Sun, Feb 3, 2013 at 4:51 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > >> LOG: last_statwrite 11133-08-28 19:22:31.711744+02 is later than >> collector's time 2013-02-04 00:54:21.113439+01 for db 19093 >> WARNING: pgstat wait timeout >> LOG: last_statwrite 39681-12-23 18:48:48.9093+01 is later than >> collector's time 2013-02-04 00:54:31.424681+01 for db 46494 >> FATAL: could not find block containing chunk 0x2af4a60 >> LOG: statistics collector process (PID 10063) exited with exit code 1 >> WARNING: pgstat wait timeout >> WARNING: pgstat wait timeout >> >> I'm not entirely sure where the FATAL came from, but it seems it was >> somehow related to the issue - it was quite reproducible, although I >> don't see how exactly could this happen. There relevant block of code >> looks like this: >> >> char *writetime; >> char *mytime; >> >> /* Copy because timestamptz_to_str returns a static buffer */ >> writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp)); >> mytime = pstrdup(timestamptz_to_str(cur_ts)); >> elog(LOG, "last_statwrite %s is later than collector's time %s for " >> "db %d", writetime, mytime, dbentry->databaseid); >> pfree(writetime); >> pfree(mytime); >> >> which seems quite fine to mee. I'm not sure how one of the pfree calls >> could fail? > > I don't recall seeing the FATAL errors myself, but didn't keep the > logfile around. (I do recall seeing the pgstat wait timeout). > > Are you using windows? pstrdup seems to be different there. Nope. I'll repeat the test with the original patch to find out what went wrong, just to be sure it was fixed. > I'm afraid I don't have much to say on the code. Indeed I never even > look at it (other than grepping for pstrdup just now). I am taking a > purely experimental approach, Since Alvaro and others have looked at > the code. Thanks for finding the issue with unitialized timestamp! > If I shutdown the server and blow away the stats with "rm > data/pg_stat/*", it recovers gracefully when I start it back up. If a > do "rm -r data/pg_stat" then it has problems the next time I shut it > down, but I have no right to do that in the first place. If I initdb > a database without this patch, then shut it down and restart with > binaries that include this patch, and need to manually make the > pg_stat directory. Does that mean it needs a catalog bump in order to > force an initdb? Ummmm, what you mean by "catalog bump"? Anyway, messing with files in the "base" directory is a bad idea in general, and I don't think that's a reason to treat the pg_stat directory differently. If you remove it by hand, you'll be rightfully punished by various errors. > A review: > > It applies cleanly (some offsets, no fuzz), builds without warnings, > and passes make check including with cassert. > > The final test done in "make check" inherently tests this code, and it > passes. If I intentionally break the patch by making > pgstat_read_db_statsfile add one to the oid it opens, then the test > fails. So the existing test is at least plausible as a test. > > doc/src/sgml/monitoring.sgml needs to be changed: "a permanent copy of > the statistics data is stored in the global subdirectory". I'm not > aware of any other needed changes to the docs. Yeah, that should be "in the global/pg_stat subdirectory". > The big question is whether we want this. I think we do. While > having hundreds of databases in a cluster is not recommended, that is > no reason not to handle it better than we do. I don't see any > down-sides, other than possibly some code uglification. Some file > systems might not deal well with having lots of small stats files > being rapidly written and rewritten, but it is hard to see how the > current behavior would be more favorable for those systems. If the filesystem has issues with that many entries, it's already hosed by contents of the "base" directory (one per directory) or in the database directories (multiple files per table). Moreover, it's still possible to use tmpfs to handle this at runtime (which is often the recommended solution with the current code), and use the actual filesystem only for keeping the data across restarts. > We do not already have this. There is no relevant spec. I can't see > how this could need pg_dump support (but what about pg_upgrade?) pg_dump - no pg_upgrage - IMHO it should create the pg_stat directory. I don't think it could "convert" statfile into the new format (by splitting it into the pieces). I haven't checked but I believe the default behavior is to delete it as there might be new fields / slight changes of meaning etc. > I am not aware of any dangers. > > I have a question about its completeness. When I first start up the > cluster and have not yet touched it, there is very little stats > collector activity, either with or without this patch. When I kick > the cluster sufficiently (I've been using vacuumdb -a to do that) then > there is a lot of stats collector activity. Even once the vacuumdb > has long finished, this high level of activity continues even though > the database is otherwise completely idle, and this seems to happen > for every. This patch makes that high level of activity much more > efficient, but it does not reduce the activity. I don't understand > why an idle database cannot get back into the state right after > start-up. What do you mean by "stats collector activity"? Is it reading/writing a lot of data, or is it just using a lot of CPU? Isn't that just a natural and expected behavior because the database needs to actually perform ANALYZE to actually collect the data. Although the tables are empty, it costs some CPU / IO and there's a lot of them (1000 dbs, each with 200 tables). I don't think there's a way around this. You may increase the autovacuum naptime, but that's about all. > I do not think that the patch needs to solve this problem in order to > be accepted, but if it can be addressed while the author and reviewers > are paying attention to this part of the system, that would be ideal. > And if not, then we should at least remember that there is future work > that could be done here. If I understand that correctly, you see the same behaviour even without the patch, right? In that case I'd vote not to make the patch more complex, and try to improve that separately (if it's even possible). > I created 1000 databases each with 200 single column tables (x serial > primary key). > > After vacuumdb -a, I let it idle for a long time to see what steady > state was reached. > > without the patch: > vacuumdb -a real 11m2.624s > idle steady state: 48.17% user, 39.24% system, 11.78% iowait, 0.81% idle. > > with the patch: > vacuumdb -a real 6m41.306s > idle steady state: 7.86% user, 5.00% system 0.09% iowait 87% idle, Nice. Another interesting numbers would be device utilization, average I/O speed and required space (which should be ~2x the pgstat.stat size without the patch). > I also ran pgbench on a scale that fits in memory with fsync=off, on a > singe CPU machine. With the same above-mentioned 1000 databases as > unused decoys to bloat the stats file. > > pgbench_tellers and branches undergo enough turn over that they should > get vacuumed every minuted (nap time). > > Without the patch, they only get vacuumed every 40 minutes or so as > the autovac workers are so distracted by reading the bloated stats > file. and the TPS is ~680. > > With the patch, they get vacuumed every 1 to 2 minutes and TPS is ~940 Great, I haven't really aimed to improve pgbench results, but it seems natural that the decreased CPU utilization can go somewhere else. Not bad. Have you moved the stats somewhere to tmpfs, or have you used the default location (on disk)? Tomas
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Pavel Stehule
Date:
>> >> with the patch: >> vacuumdb -a real 6m41.306s >> idle steady state: 7.86% user, 5.00% system 0.09% iowait 87% idle, > > Nice. Another interesting numbers would be device utilization, average > I/O speed and required space (which should be ~2x the pgstat.stat size > without the patch). > this point is important - with large warehouse with lot of databases and tables you have move stat file to some ramdisc - without it you lost lot of IO capacity - and it is very important if you need only half sized ramdisc Regards Pavel
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tom Lane
Date:
Pavel Stehule <pavel.stehule@gmail.com> writes: >> Nice. Another interesting numbers would be device utilization, average >> I/O speed and required space (which should be ~2x the pgstat.stat size >> without the patch). > this point is important - with large warehouse with lot of databases > and tables you have move stat file to some ramdisc - without it you > lost lot of IO capacity - and it is very important if you need only > half sized ramdisc [ blink... ] I confess I'd not been paying close attention to this thread, but if that's true I'd say the patch is DOA. Why should we accept 2x bloat in the already-far-too-large stats file? I thought the idea was just to split up the existing data into multiple files. regards, tom lane
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Alvaro Herrera
Date:
Tom Lane escribió: > Pavel Stehule <pavel.stehule@gmail.com> writes: > >> Nice. Another interesting numbers would be device utilization, average > >> I/O speed and required space (which should be ~2x the pgstat.stat size > >> without the patch). > > > this point is important - with large warehouse with lot of databases > > and tables you have move stat file to some ramdisc - without it you > > lost lot of IO capacity - and it is very important if you need only > > half sized ramdisc > > [ blink... ] I confess I'd not been paying close attention to this > thread, but if that's true I'd say the patch is DOA. Why should we > accept 2x bloat in the already-far-too-large stats file? I thought > the idea was just to split up the existing data into multiple files. I think they are saying just the opposite: maximum disk space utilization is now half of the unpatched code. This is because when we need to write the temporary file to rename on top of the other one, the temporary file is not of the size of the complete pgstat data collation, but just that for the requested database. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Pavel Stehule
Date:
2013/2/6 Alvaro Herrera <alvherre@2ndquadrant.com>: > Tom Lane escribió: >> Pavel Stehule <pavel.stehule@gmail.com> writes: >> >> Nice. Another interesting numbers would be device utilization, average >> >> I/O speed and required space (which should be ~2x the pgstat.stat size >> >> without the patch). >> >> > this point is important - with large warehouse with lot of databases >> > and tables you have move stat file to some ramdisc - without it you >> > lost lot of IO capacity - and it is very important if you need only >> > half sized ramdisc >> >> [ blink... ] I confess I'd not been paying close attention to this >> thread, but if that's true I'd say the patch is DOA. Why should we >> accept 2x bloat in the already-far-too-large stats file? I thought >> the idea was just to split up the existing data into multiple files. > > I think they are saying just the opposite: maximum disk space > utilization is now half of the unpatched code. This is because when we > need to write the temporary file to rename on top of the other one, the > temporary file is not of the size of the complete pgstat data collation, > but just that for the requested database. +1 Pavel > > -- > Álvaro Herrera http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tomas Vondra
Date:
Dne 06.02.2013 16:53, Alvaro Herrera napsal: > Tom Lane escribió: >> Pavel Stehule <pavel.stehule@gmail.com> writes: >> >> Nice. Another interesting numbers would be device utilization, >> average >> >> I/O speed and required space (which should be ~2x the pgstat.stat >> size >> >> without the patch). >> >> > this point is important - with large warehouse with lot of >> databases >> > and tables you have move stat file to some ramdisc - without it >> you >> > lost lot of IO capacity - and it is very important if you need >> only >> > half sized ramdisc >> >> [ blink... ] I confess I'd not been paying close attention to this >> thread, but if that's true I'd say the patch is DOA. Why should we >> accept 2x bloat in the already-far-too-large stats file? I thought >> the idea was just to split up the existing data into multiple files. > > I think they are saying just the opposite: maximum disk space > utilization is now half of the unpatched code. This is because when > we > need to write the temporary file to rename on top of the other one, > the > temporary file is not of the size of the complete pgstat data > collation, > but just that for the requested database. Exactly. And I suspect the current (unpatched) code ofter requires more than twice the space because of open file descriptors to already deleted files. Tomas
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Jeff Janes
Date:
On Tue, Feb 5, 2013 at 2:31 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > On 5.2.2013 19:23, Jeff Janes wrote: > >> If I shutdown the server and blow away the stats with "rm >> data/pg_stat/*", it recovers gracefully when I start it back up. If a >> do "rm -r data/pg_stat" then it has problems the next time I shut it >> down, but I have no right to do that in the first place. If I initdb >> a database without this patch, then shut it down and restart with >> binaries that include this patch, and need to manually make the >> pg_stat directory. Does that mean it needs a catalog bump in order to >> force an initdb? > > Ummmm, what you mean by "catalog bump"? There is a catalog number in src/include/catalog/catversion.h, which when changed forces one to redo initdb. Formally I guess it is only for system catalog changes, but I thought it was used for any on-disk changes during development cycles. I like it the way it is, as I can use the same data directory for both versions of the binary (patched and unpatched), and just manually create or remove the directory pg_stat directory when changing modes. That is ideal for testing this patch, probably not ideal for being committed into the tree along with all the other ongoing devel work. But I think this is something the committer has to worry about. >> I have a question about its completeness. When I first start up the >> cluster and have not yet touched it, there is very little stats >> collector activity, either with or without this patch. When I kick >> the cluster sufficiently (I've been using vacuumdb -a to do that) then >> there is a lot of stats collector activity. Even once the vacuumdb >> has long finished, this high level of activity continues even though >> the database is otherwise completely idle, and this seems to happen >> for every. This patch makes that high level of activity much more >> efficient, but it does not reduce the activity. I don't understand >> why an idle database cannot get back into the state right after >> start-up. > > What do you mean by "stats collector activity"? Is it reading/writing a > lot of data, or is it just using a lot of CPU? Basically, the launching of new autovac workers and the work that that entails. Your patch reduces the size of data that needs to be written, read, and parsed for every launch, but not the number of times that that happens. > Isn't that just a natural and expected behavior because the database > needs to actually perform ANALYZE to actually collect the data. Although > the tables are empty, it costs some CPU / IO and there's a lot of them > (1000 dbs, each with 200 tables). It isn't touching the tables at all, just the stats files. I was wrong about the cluster opening quiet. It only does that if, while the cluster was shutdown, you remove the statistics files which I was doing, as I was switching back and forth between patched and unpatched. When the cluster opens, any databases that don't have statistics in the stat file(s) will not get an autovacuum worker process spawned. They only start getting spawned once someone asks for statistics for that database. But then once that happens, that database then gets a worker spawned for it every naptime (or, at least, as close to that as the server can keep up with) for eternity, even if that database is never used again. The only way to stop this is the unsupported way of blowing away the permanent stats files. > > I don't think there's a way around this. You may increase the autovacuum > naptime, but that's about all. > >> I do not think that the patch needs to solve this problem in order to >> be accepted, but if it can be addressed while the author and reviewers >> are paying attention to this part of the system, that would be ideal. >> And if not, then we should at least remember that there is future work >> that could be done here. > > If I understand that correctly, you see the same behaviour even without > the patch, right? In that case I'd vote not to make the patch more > complex, and try to improve that separately (if it's even possible). OK. I just thought that while digging through the code, you might have a good idea for fixing this part as well. If so, it would be a shame for that idea to be lost when you move on to other things. > > >> I created 1000 databases each with 200 single column tables (x serial >> primary key). >> >> After vacuumdb -a, I let it idle for a long time to see what steady >> state was reached. >> >> without the patch: >> vacuumdb -a real 11m2.624s >> idle steady state: 48.17% user, 39.24% system, 11.78% iowait, 0.81% idle. >> >> with the patch: >> vacuumdb -a real 6m41.306s >> idle steady state: 7.86% user, 5.00% system 0.09% iowait 87% idle, > > Nice. Another interesting numbers would be device utilization, average > I/O speed I didn't gather that data, as I never figured out how to interpret those numbers and so don't have much faith in them. (But I am pretty impressed with the numbers I do understand) > and required space (which should be ~2x the pgstat.stat size > without the patch). I didn't study this in depth, but the patch seems to do what it should (that is, take less space, not more). If I fill the device up so that there is less than 3x the size of the stats file available for use (i.e. space for the file itself and for 1 temp copy version of it but not space for a complete second temp copy), I occasionally get out-of-space warning with unpatched. But never get those errors with patched. Indeed, with patch I never get warnings even with only 1.04 times the aggregate size of the stats files available for use. (That is, size for all the files, plus just 1/25 that amount to spare. Obviously this limit is specific to having 1000 databases of equal size.) > >> I also ran pgbench on a scale that fits in memory with fsync=off, on a >> singe CPU machine. With the same above-mentioned 1000 databases as >> unused decoys to bloat the stats file. >> >> pgbench_tellers and branches undergo enough turn over that they should >> get vacuumed every minuted (nap time). >> >> Without the patch, they only get vacuumed every 40 minutes or so as >> the autovac workers are so distracted by reading the bloated stats >> file. and the TPS is ~680. >> >> With the patch, they get vacuumed every 1 to 2 minutes and TPS is ~940 > > Great, I haven't really aimed to improve pgbench results, but it seems > natural that the decreased CPU utilization can go somewhere else. Not bad. My goal there was to prove to myself that the correct tables were getting vacuumed. The TPS measurements were just a by-product of that, but since I had them I figured I'd post them. > Have you moved the stats somewhere to tmpfs, or have you used the > default location (on disk)? All the specific work I reported was with them on disk, except the part about running out of space, which was done on /dev/shm. But even with the data theoretically going to disk, the kernel caches it well enough that I wouldn't expect things to change very much. Two more questions I've come up with: If I invoke pg_stat_reset() from a database, the corresponding file does not get removed from the pg_stat_tmp directory. And when the database is shut down, a file for the reset database does get created in pg_stat. Is this OK? Cheers, Jeff
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tom Lane
Date:
Jeff Janes <jeff.janes@gmail.com> writes: > On Tue, Feb 5, 2013 at 2:31 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >> Ummmm, what you mean by "catalog bump"? > There is a catalog number in src/include/catalog/catversion.h, which > when changed forces one to redo initdb. > Formally I guess it is only for system catalog changes, but I thought > it was used for any on-disk changes during development cycles. Yeah, it would be appropriate to bump the catversion if we're creating a new PGDATA subdirectory. I'm not excited about keeping code to take care of the lack of such a subdirectory at runtime, as I gather there is in the current state of the patch. Formally, if there were such code, we'd not need a catversion bump --- the rule of thumb is to change catversion if the new postgres executable would fail regression tests without a run of the new initdb. But it's pretty dumb to keep such code indefinitely, when it would have no more possible use after the next catversion bump (which is seldom more than a week or two away during devel phase). >> What do you mean by "stats collector activity"? Is it reading/writing a >> lot of data, or is it just using a lot of CPU? > Basically, the launching of new autovac workers and the work that that > entails. Your patch reduces the size of data that needs to be > written, read, and parsed for every launch, but not the number of > times that that happens. It doesn't seem very reasonable to ask this patch to redesign the autovacuum algorithms, which is essentially what it'll take to improve that. That's a completely separate layer of code. regards, tom lane
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tomas Vondra
Date:
On 7.2.2013 00:40, Tom Lane wrote: > Jeff Janes <jeff.janes@gmail.com> writes: >> On Tue, Feb 5, 2013 at 2:31 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >>> Ummmm, what you mean by "catalog bump"? > >> There is a catalog number in src/include/catalog/catversion.h, which >> when changed forces one to redo initdb. > >> Formally I guess it is only for system catalog changes, but I thought >> it was used for any on-disk changes during development cycles. > > Yeah, it would be appropriate to bump the catversion if we're creating a > new PGDATA subdirectory. > > I'm not excited about keeping code to take care of the lack of such a > subdirectory at runtime, as I gather there is in the current state of > the patch. Formally, if there were such code, we'd not need a No, there is nothing to handle that at runtime. The directory is created at initdb and the patch expects that (and fails if it's gone). > catversion bump --- the rule of thumb is to change catversion if the new > postgres executable would fail regression tests without a run of the new > initdb. But it's pretty dumb to keep such code indefinitely, when it > would have no more possible use after the next catversion bump (which is > seldom more than a week or two away during devel phase). > >>> What do you mean by "stats collector activity"? Is it reading/writing a >>> lot of data, or is it just using a lot of CPU? > >> Basically, the launching of new autovac workers and the work that that >> entails. Your patch reduces the size of data that needs to be >> written, read, and parsed for every launch, but not the number of >> times that that happens. > > It doesn't seem very reasonable to ask this patch to redesign the > autovacuum algorithms, which is essentially what it'll take to improve > that. That's a completely separate layer of code. My opinion, exactly. Tomas
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Jeff Janes
Date:
On Tue, Feb 5, 2013 at 2:31 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > >> We do not already have this. There is no relevant spec. I can't see >> how this could need pg_dump support (but what about pg_upgrade?) > > pg_dump - no > > pg_upgrage - IMHO it should create the pg_stat directory. I don't think > it could "convert" statfile into the new format (by splitting it into > the pieces). I haven't checked but I believe the default behavior is to > delete it as there might be new fields / slight changes of meaning etc. Right, I have no concerns with pg_upgrade any more. The pg_stat will inherently get created by the initdb of the new cluster (because the initdb will done with the new binaries with your patch in place them). pg_upgrade currently doesn't copy over global/pgstat.stat. So that means the new cluster doesn't have the activity stats either way, patch or unpatched. So if it is not currently a problem it will not become one under the proposed patch. Cheers, Jeff
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Alvaro Herrera
Date:
Here's an updated version of this patch that takes care of the issues I reported previously: no more repalloc() of the requests array; it's now an slist, which makes the code much more natural IMV. And no more messing around with doing sprintf to create a separate sprintf pattern for the per-db stats file; instead have a function to return the name that uses just the pgstat dir as stored by GUC. I think this can be further simplified still. I haven't reviewed the rest yet; please do give this a try to confirm that the speedups previously reported are still there (i.e. I didn't completely blew it). Thanks -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Alvaro Herrera
Date:
Alvaro Herrera escribió: > Hm, and I now also realize another bug in this patch: the global stats > only include database entries for requested databases; but perhaps the > existing files can serve later requestors just fine for databases that > already had files; so the global stats file should continue to carry > entries for them, with the old timestamps. Actually the code already do things that way -- apologies. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Alvaro Herrera
Date:
Here's a ninth version of this patch. (version 8 went unpublished). I have simplified a lot of things and improved some comments; I think I understand much of it now. I think this patch is fairly close to committable, but one issue remains, which is this bit in pgstat_write_statsfiles(): /* In any case, we can just throw away all the db requests, but we need to * write dummy files for databases without a stat entry (it would cause * issues in pgstat_read_db_statsfile_timestamp and pgstat wait timeouts). * This may happen e.g. for shared DB (oid = 0) right after initdb. */ if (!slist_is_empty(&last_statrequests)) { slist_mutable_iter iter; slist_foreach_modify(iter, &last_statrequests) { DBWriteRequest *req = slist_container(DBWriteRequest, next, iter.cur); /* * Create dummy files for requested databases without a proper * dbentry. It's much easier this way than dealing with multiple * timestamps, possibly existing but not yet written DBs etc. * */ if (!pgstat_get_db_entry(req->databaseid, false)) pgstat_write_db_dummyfile(req->databaseid); pfree(req); } slist_init(&last_statrequests); } The problem here is that creating these dummy entries will cause a difference in autovacuum behavior. Autovacuum will skip processing databases with no pgstat entry, and the intended reason is that if there's no pgstat entry it's because the database doesn't have enough activity. Now perhaps we want to change that, but it should be an explicit decision taken after discussion and thought, not side effect from an unrelated patch. Hm, and I now also realize another bug in this patch: the global stats only include database entries for requested databases; but perhaps the existing files can serve later requestors just fine for databases that already had files; so the global stats file should continue to carry entries for them, with the old timestamps. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Josh Berkus
Date:
I saw discussion about this on this thread, but I'm not able to figure out what the answer is: how does this work with moving the stats file, for example to a RAMdisk? Specifically, if the user sets stats_temp_directory, does it continue to work the way it does now? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Alvaro Herrera
Date:
Josh Berkus wrote: > I saw discussion about this on this thread, but I'm not able to figure > out what the answer is: how does this work with moving the stats file, > for example to a RAMdisk? Specifically, if the user sets > stats_temp_directory, does it continue to work the way it does now? Of course. You get more files than previously, but yes. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Alvaro Herrera
Date:
Alvaro Herrera escribió: > Here's a ninth version of this patch. (version 8 went unpublished). I > have simplified a lot of things and improved some comments; I think I > understand much of it now. I think this patch is fairly close to > committable, but one issue remains, which is this bit in > pgstat_write_statsfiles(): > > /* In any case, we can just throw away all the db requests, but we need to > * write dummy files for databases without a stat entry (it would cause > * issues in pgstat_read_db_statsfile_timestamp and pgstat wait timeouts). > * This may happen e.g. for shared DB (oid = 0) right after initdb. > */ I think the real way to handle this is to fix backend_read_statsfile(). It's using the old logic of considering existance of the file, but of course now the file might not exist at all and that doesn't mean we need to continue kicking the collector to write it. We need a mechanism to figure that the collector is just not going to write the file no matter how hard we kick it ... -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Alvaro Herrera
Date:
Alvaro Herrera escribió: > Here's a ninth version of this patch. (version 8 went unpublished). I > have simplified a lot of things and improved some comments; I think I > understand much of it now. I think this patch is fairly close to > committable, but one issue remains, which is this bit in > pgstat_write_statsfiles(): I've marked this as Waiting on author for the time being. I'm going to review/work on other patches now, hoping that Tomas will post an updated version in time for it to be considered for 9.3. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tomas Vondra
Date:
On 14.2.2013 20:43, Josh Berkus wrote: > I saw discussion about this on this thread, but I'm not able to figure > out what the answer is: how does this work with moving the stats file, > for example to a RAMdisk? Specifically, if the user sets > stats_temp_directory, does it continue to work the way it does now? No change in this respect - you can still use RAMdisk, and you'll actually need less space because the space requirements decreased due to breaking the single file into multiple pieces. We're using it this way (on a tmpfs filesystem) and it works like a charm. regards Tomas
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tomas Vondra
Date:
First of all, big thanks for working on this patch and not only identifying the issues but actually fixing them. On 14.2.2013 20:23, Alvaro Herrera wrote: > Here's a ninth version of this patch. (version 8 went unpublished). I > have simplified a lot of things and improved some comments; I think I > understand much of it now. I think this patch is fairly close to > committable, but one issue remains, which is this bit in > pgstat_write_statsfiles(): > ... > > The problem here is that creating these dummy entries will cause a > difference in autovacuum behavior. Autovacuum will skip processing > databases with no pgstat entry, and the intended reason is that if > there's no pgstat entry it's because the database doesn't have enough > activity. Now perhaps we want to change that, but it should be an > explicit decision taken after discussion and thought, not side effect > from an unrelated patch. I don't see how that changes the autovacuum behavior. Can you explain that a bit more? As I see it, with the old (single-file version) the autovacuum worker would get exacly the same thing, i.e. no stats at all. Which is exacly what autovacuum worker gets with the new code, except that the check for last statfile timestamp uses the "per-db" file, so we need to write it. This way the worker is able to read the timestamp, is happy about it because it gets a fresh file although it gets no stats later. Where is the behavior change? Can you provide an example? kind regards Tomas
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tomas Vondra
Date:
On 14.2.2013 22:24, Alvaro Herrera wrote: > Alvaro Herrera escribió: >> Here's a ninth version of this patch. (version 8 went unpublished). I >> have simplified a lot of things and improved some comments; I think I >> understand much of it now. I think this patch is fairly close to >> committable, but one issue remains, which is this bit in >> pgstat_write_statsfiles(): > > I've marked this as Waiting on author for the time being. I'm going to > review/work on other patches now, hoping that Tomas will post an updated > version in time for it to be considered for 9.3. Sadly I have no idea how to fix that, and I think the solution you suggested in the previous messages does not actually do the trick :-( Tomas
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Alvaro Herrera
Date:
Tomas Vondra escribió: > I don't see how that changes the autovacuum behavior. Can you explain > that a bit more? It might be that I'm all wet on this. I'll poke at it some more. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Alvaro Herrera
Date:
Tomas Vondra escribió: > On 14.2.2013 20:23, Alvaro Herrera wrote: > > The problem here is that creating these dummy entries will cause a > > difference in autovacuum behavior. Autovacuum will skip processing > > databases with no pgstat entry, and the intended reason is that if > > there's no pgstat entry it's because the database doesn't have enough > > activity. Now perhaps we want to change that, but it should be an > > explicit decision taken after discussion and thought, not side effect > > from an unrelated patch. > > I don't see how that changes the autovacuum behavior. Can you explain > that a bit more? > > As I see it, with the old (single-file version) the autovacuum worker > would get exacly the same thing, i.e. no stats at all. See in autovacuum.c the calls to pgstat_fetch_stat_dbentry(). Most of them check for NULL result and act differently depending on that. Returning a valid (not NULL) entry full of zeroes is not the same. I didn't actually try to reproduce a problem. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tomas Vondra
Date:
On 15.2.2013 16:38, Alvaro Herrera wrote: > Tomas Vondra escribió: > >> On 14.2.2013 20:23, Alvaro Herrera wrote: > >>> The problem here is that creating these dummy entries will cause a >>> difference in autovacuum behavior. Autovacuum will skip processing >>> databases with no pgstat entry, and the intended reason is that if >>> there's no pgstat entry it's because the database doesn't have enough >>> activity. Now perhaps we want to change that, but it should be an >>> explicit decision taken after discussion and thought, not side effect >>> from an unrelated patch. >> >> I don't see how that changes the autovacuum behavior. Can you explain >> that a bit more? >> >> As I see it, with the old (single-file version) the autovacuum worker >> would get exacly the same thing, i.e. no stats at all. > > See in autovacuum.c the calls to pgstat_fetch_stat_dbentry(). Most of > them check for NULL result and act differently depending on that. > Returning a valid (not NULL) entry full of zeroes is not the same. > I didn't actually try to reproduce a problem. Errrr, but why would the patched code return entry full of zeroes and not NULL as before? The dummy files serve single purpose - confirm that the collector attempted to write info for the particular database (and did not found any data for that). All it contains is a timestamp of the write - nothing else. So the worker will read the global file (containing list of stats for dbs) and then will get NULL just like the old code. Because the database is not there and the patch does not change that at all. Tomas
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tomas Vondra
Date:
On 15.2.2013 01:02, Tomas Vondra wrote: > On 14.2.2013 22:24, Alvaro Herrera wrote: >> Alvaro Herrera escribió: >>> Here's a ninth version of this patch. (version 8 went unpublished). I >>> have simplified a lot of things and improved some comments; I think I >>> understand much of it now. I think this patch is fairly close to >>> committable, but one issue remains, which is this bit in >>> pgstat_write_statsfiles(): >> >> I've marked this as Waiting on author for the time being. I'm going to >> review/work on other patches now, hoping that Tomas will post an updated >> version in time for it to be considered for 9.3. > > Sadly I have no idea how to fix that, and I think the solution you > suggested in the previous messages does not actually do the trick :-( I've been thinking about this (actually I had a really weird dream about it this night) and I think it might work like this: (1) check the timestamp of the global file -> if it's too old, we need to send an inquiry or wait a bit longer (2) if it's new enough, we need to read it a look for that particular database - if it's not found, we have no info aboutit yet (this is the case handled by the dummy files) (3) if there's a database stat entry, we need to check the timestamp when it was written for the last time -> if it's tooold, send an inquiry and wait a bit longer (4) well, we have a recent global file, it contains the database stat entry and it's fresh enough -> tadaaaaaa, we're done At least that's my idea - I haven't tried to implement it yet. I see a few pros and cons of this approach: pros: * no dummy files * no timestamps in the per-db files (and thus no sync issues) cons: * the backends / workers will have to re-read the global file just to check that the per-db file is there and is freshenough So far it was sufficient just to peek at the timestamp at the beginning of the per-db stat file - minimum data read, no CPU-expensive processing etc. Sadly the more DBs there are, the larger the file get (thus more overhead to read it). OTOH it's not that much data (~180B per entry, so with a 1000 of dbs it's just ~180kB) so I don't expect this to be a tremendous issue. And the pros seem to be quite compelling. Tomas
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Alvaro Herrera
Date:
Tomas Vondra wrote: > I've been thinking about this (actually I had a really weird dream about > it this night) and I think it might work like this: > > (1) check the timestamp of the global file -> if it's too old, we need > to send an inquiry or wait a bit longer > > (2) if it's new enough, we need to read it a look for that particular > database - if it's not found, we have no info about it yet (this is > the case handled by the dummy files) > > (3) if there's a database stat entry, we need to check the timestamp > when it was written for the last time -> if it's too old, send an > inquiry and wait a bit longer > > (4) well, we have a recent global file, it contains the database stat > entry and it's fresh enough -> tadaaaaaa, we're done Hmm, yes, I think this is what I was imagining. I had even considered that the timestamp would be removed from the per-db file as you suggest here. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tomas Vondra
Date:
On 17.2.2013 06:46, Alvaro Herrera wrote: > Tomas Vondra wrote: > >> I've been thinking about this (actually I had a really weird dream about >> it this night) and I think it might work like this: >> >> (1) check the timestamp of the global file -> if it's too old, we need >> to send an inquiry or wait a bit longer >> >> (2) if it's new enough, we need to read it a look for that particular >> database - if it's not found, we have no info about it yet (this is >> the case handled by the dummy files) >> >> (3) if there's a database stat entry, we need to check the timestamp >> when it was written for the last time -> if it's too old, send an >> inquiry and wait a bit longer >> >> (4) well, we have a recent global file, it contains the database stat >> entry and it's fresh enough -> tadaaaaaa, we're done > > Hmm, yes, I think this is what I was imagining. I had even considered > that the timestamp would be removed from the per-db file as you suggest > here. So, here's v10 of the patch (based on the v9+v9a), that implements the approach described above. It turned out to be much easier than I expected (basically just a rewrite of the pgstat_read_db_statsfile_timestamp() function. I've done a fair amount of testing (and will do some more next week) but it seems to work just fine - no errors, no measurable decrease of performance etc. regards Tomas Vondra
Attachment
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tomas Vondra
Date:
Hi, just a few charts from our production system, illustrating the impact of this patch. We've deployed this patch on January 12 on our systems running 9.1 (i.e. a backpatch), so we do have enough data to do some nice pics from it. I don't expect the changes made to the patch since then to affect the impact significantly. I've chosen two systems with large numbers of databases (over 1000 on each), each database contains multiple (possibly hundreds or more). The "cpu" first charts show quantiles of CPU usage, and assuming that the system usage did not change, there's a clear drop by about 15% percent in both cases. This is a 8-core system (on AWS), so this means a save of about 120% of one core. The "disk usage" charts show how much space was needed for the stats (placed on tmpfs filesystem, mounted at /mnt/pg_tmp). The filesystem max size is 400MB and the files require ~100MB. With the unpatched code the space required was actually ~200MB because of the copying. regards Tomas
Attachment
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Alvaro Herrera
Date:
Tomas Vondra wrote: > So, here's v10 of the patch (based on the v9+v9a), that implements the > approach described above. > > It turned out to be much easier than I expected (basically just a > rewrite of the pgstat_read_db_statsfile_timestamp() function. Thanks. I'm giving this another look now. I think the new code means we no longer need the first_write logic; just let the collector idle until we get the first request. (If for some reason we considered that we should really be publishing initial stats as early as possible, we could just do a write_statsfiles(allDbs) call before entering the main loop. But I don't see any reason to do this. If you do, please speak up.) Also, it seems to me that the new pgstat_db_requested() logic is slightly bogus (in the "inefficient" sense, not the "incorrect" sense): we should be comparing the timestamp of the request vs. what's already on disk instead of blindly returning true if the list is nonempty. If the request is older than the file, we don't need to write anything and can discard the request. For example, suppose that backend A sends a request for a DB; we write the file. If then quickly backend B also requests stats for the same DB, with the current logic we'd go write the file, but perhaps backend B would be fine with the file we already wrote. Another point is that I think there's a better way to handle nonexistant files, instead of having to read the global file and all the DB records to find the one we want. Just try to read the database file, and only if that fails try to read the global file and compare the timestamp (so there might be two timestamps for each DB, one in the global file and one in the DB-specific file. I don't think this is a problem). The point is avoid having to read the global file if possible. So here's v11. I intend to commit this shortly. (I wanted to get it out before lunch, but I introduced a silly bug that took me a bit to fix.) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tomas Vondra
Date:
On 18.2.2013 16:50, Alvaro Herrera wrote: > Tomas Vondra wrote: > >> So, here's v10 of the patch (based on the v9+v9a), that implements the >> approach described above. >> >> It turned out to be much easier than I expected (basically just a >> rewrite of the pgstat_read_db_statsfile_timestamp() function. > > Thanks. I'm giving this another look now. I think the new code means > we no longer need the first_write logic; just let the collector idle > until we get the first request. (If for some reason we considered that > we should really be publishing initial stats as early as possible, we > could just do a write_statsfiles(allDbs) call before entering the main > loop. But I don't see any reason to do this. If you do, please speak > up.) > > Also, it seems to me that the new pgstat_db_requested() logic is > slightly bogus (in the "inefficient" sense, not the "incorrect" sense): > we should be comparing the timestamp of the request vs. what's already > on disk instead of blindly returning true if the list is nonempty. If > the request is older than the file, we don't need to write anything and > can discard the request. For example, suppose that backend A sends a > request for a DB; we write the file. If then quickly backend B also > requests stats for the same DB, with the current logic we'd go write the > file, but perhaps backend B would be fine with the file we already > wrote. Hmmm, you're probably right. > Another point is that I think there's a better way to handle nonexistant > files, instead of having to read the global file and all the DB records > to find the one we want. Just try to read the database file, and only > if that fails try to read the global file and compare the timestamp (so > there might be two timestamps for each DB, one in the global file and > one in the DB-specific file. I don't think this is a problem). The > point is avoid having to read the global file if possible. I don't think that's a good idea. Keeping the timestamps at one place is a significant simplification, and I don't think it's worth the additional complexity. And the overhead is minimal. So my vote on this change is -1. > > So here's v11. I intend to commit this shortly. (I wanted to get it > out before lunch, but I introduced a silly bug that took me a bit to > fix.) ;-) Tomas
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Alvaro Herrera
Date:
Tomas Vondra wrote: > On 18.2.2013 16:50, Alvaro Herrera wrote: > > Also, it seems to me that the new pgstat_db_requested() logic is > > slightly bogus (in the "inefficient" sense, not the "incorrect" sense): > > we should be comparing the timestamp of the request vs. what's already > > on disk instead of blindly returning true if the list is nonempty. If > > the request is older than the file, we don't need to write anything and > > can discard the request. For example, suppose that backend A sends a > > request for a DB; we write the file. If then quickly backend B also > > requests stats for the same DB, with the current logic we'd go write the > > file, but perhaps backend B would be fine with the file we already > > wrote. > > Hmmm, you're probably right. I left it as is for now; I think it warrants revisiting. > > Another point is that I think there's a better way to handle nonexistant > > files, instead of having to read the global file and all the DB records > > to find the one we want. Just try to read the database file, and only > > if that fails try to read the global file and compare the timestamp (so > > there might be two timestamps for each DB, one in the global file and > > one in the DB-specific file. I don't think this is a problem). The > > point is avoid having to read the global file if possible. > > I don't think that's a good idea. Keeping the timestamps at one place is > a significant simplification, and I don't think it's worth the > additional complexity. And the overhead is minimal. > > So my vote on this change is -1. Fair enough. I have pushed it now. Further testing, of course, is always welcome. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Alvaro Herrera
Date:
Alvaro Herrera wrote: > I have pushed it now. Further testing, of course, is always welcome. Mastodon failed: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mastodon&dt=2013-02-19%2000%3A00%3A01 probably worth investigating a bit; we might have broken something. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tomas Vondra
Date:
Dne 19.02.2013 05:46, Alvaro Herrera napsal: > Alvaro Herrera wrote: > >> I have pushed it now. Further testing, of course, is always >> welcome. > > Mastodon failed: > > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mastodon&dt=2013-02-19%2000%3A00%3A01 > > probably worth investigating a bit; we might have broken something. Hmmm, interesting. A single Windows machine, while the other Windows machines seem to work fine (although some of them were not built for a few weeks). I'll look into that, but I have no clue why this might happen. Except maybe for some unexpected timing issue or something ... Tomas
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tom Lane
Date:
Tomas Vondra <tv@fuzzy.cz> writes: > Dne 19.02.2013 05:46, Alvaro Herrera napsal: >> Mastodon failed: >> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mastodon&dt=2013-02-19%2000%3A00%3A01 >> >> probably worth investigating a bit; we might have broken something. > Hmmm, interesting. A single Windows machine, while the other Windows > machines seem to work fine (although some of them were not built for a > few weeks). Could be random chance --- we've seen the same failure before, eg http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mastodon&dt=2012-11-25%2006%3A00%3A00 regards, tom lane
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tomas Vondra
Date:
Dne 19.02.2013 11:27, Tom Lane napsal: > Tomas Vondra <tv@fuzzy.cz> writes: >> Dne 19.02.2013 05:46, Alvaro Herrera napsal: >>> Mastodon failed: >>> >>> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mastodon&dt=2013-02-19%2000%3A00%3A01 >>> >>> probably worth investigating a bit; we might have broken something. > >> Hmmm, interesting. A single Windows machine, while the other Windows >> machines seem to work fine (although some of them were not built for >> a >> few weeks). > > Could be random chance --- we've seen the same failure before, eg > > > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mastodon&dt=2012-11-25%2006%3A00%3A00 Maybe. But why does random chance happens to me only with regression tests and not lottery, like to normal people? Tomas
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tomas Vondra
Date:
On 19.2.2013 11:27, Tom Lane wrote: > Tomas Vondra <tv@fuzzy.cz> writes: >> Dne 19.02.2013 05:46, Alvaro Herrera napsal: >>> Mastodon failed: >>> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mastodon&dt=2013-02-19%2000%3A00%3A01 >>> >>> probably worth investigating a bit; we might have broken something. > >> Hmmm, interesting. A single Windows machine, while the other Windows >> machines seem to work fine (although some of them were not built for a >> few weeks). > > Could be random chance --- we've seen the same failure before, eg > > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mastodon&dt=2012-11-25%2006%3A00%3A00 > > regards, tom lane I'm looking at that test, and I'm not really sure about a few details. First, this function seems pretty useless to me: ======================================================================= create function wait_for_stats() returns void as $$ declare start_time timestamptz := clock_timestamp(); updated bool;begin -- we don't want to wait forever; loop will exit after 30 seconds for i in 1 .. 300 loop -- check to see if indexscan has been sensed SELECT (st.idx_scan >= pr.idx_scan + 1) INTO updated FROM pg_stat_user_tablesAS st, pg_class AS cl, prevstats AS pr WHERE st.relname='tenk2' AND cl.relname='tenk2'; exit when updated; -- wait a little perform pg_sleep(0.1); -- reset stats snapshot so we can test again perform pg_stat_clear_snapshot(); end loop; -- report time waited in postmaster log (where it won't change test output) raise log 'wait_for_stats delayed % seconds', extract(epoch from clock_timestamp() - start_time); end $$ languageplpgsql; ======================================================================= AFAIK the stats remain the same within a transaction, and as a function runs within a transaction, it will either get new data on the first iteration, or it will run all 300 of them. I've checked several buildfarm members and I'm yet to see a single duration between 12ms and 30sec. So IMHO we can replace the function call with pg_sleep(30) and we'll get about the same effect. But this obviously does not answer the question why it failed, although on both occasions there's this log message: [50b1b7fa.0568:14] LOG: wait_for_stats delayed 34.75 seconds which essentialy means the stats were not updated before the call to wait_for_stats(). Anyway, there are these two failing queries: ======================================================================= -- check effects SELECT st.seq_scan >= pr.seq_scan + 1, st.seq_tup_read >= pr.seq_tup_read + cl.reltuples, st.idx_scan >= pr.idx_scan+ 1, st.idx_tup_fetch >= pr.idx_tup_fetch + 1 FROM pg_stat_user_tables AS st, pg_class AS cl, prevstats ASprWHERE st.relname='tenk2' AND cl.relname='tenk2';?column? | ?column? | ?column? | ?column? ----------+----------+----------+----------t | t | t | t (1 row) SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages, st.idx_blks_read + st.idx_blks_hit >= pr.idx_blks+ 1 FROM pg_statio_user_tables AS st, pg_class AS cl, prevstats AS prWHERE st.relname='tenk2' AND cl.relname='tenk2';?column?| ?column? ----------+----------t | t (1 row) ======================================================================= The first one returns just falses, the second one retuns either (t,f) or (f,f) - for the two failures posted by Alvaro and TL earlier today. I'm really wondering how that could happen. The only thing that I can think of is some strange timing issue, causing lost requests to write the stats or maybe some of the stats updates. Hmmm, IIRC the stats are sent over UDP - couldn't that be related? Tomas
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Alvaro Herrera
Date:
Tomas Vondra wrote: > AFAIK the stats remain the same within a transaction, and as a function > runs within a transaction, it will either get new data on the first > iteration, or it will run all 300 of them. I've checked several > buildfarm members and I'm yet to see a single duration between 12ms and > 30sec. No, there's a call to pg_stat_clear_snapshot() that takes care of that. > I'm really wondering how that could happen. The only thing that I can > think of is some strange timing issue, causing lost requests to write > the stats or maybe some of the stats updates. Hmmm, IIRC the stats are > sent over UDP - couldn't that be related? yes, UDP packet drops can certainly happen. This is considered a feature (do not cause backends to block when the network socket to stat collector is swamped; it's better to lose some stat messages instead). -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tomas Vondra
Date:
On 19.2.2013 23:31, Alvaro Herrera wrote:> Tomas Vondra wrote: > >> AFAIK the stats remain the same within a transaction, and as a >> function runs within a transaction, it will either get new data on >> the first iteration, or it will run all 300 of them. I've checked >> several buildfarm members and I'm yet to see a single duration >> between 12ms and 30sec. > > No, there's a call to pg_stat_clear_snapshot() that takes care of > that. Aha! Missed that for some reason. Thanks. > >> I'm really wondering how that could happen. The only thing that I >> can think of is some strange timing issue, causing lost requests to >> write the stats or maybe some of the stats updates. Hmmm, IIRC the >> stats are sent over UDP - couldn't that be related? > > yes, UDP packet drops can certainly happen. This is considered a > feature (do not cause backends to block when the network socket to > stat collector is swamped; it's better to lose some stat messages > instead). Is there anything we could add to the test to identify this? Something that either shows "stats were sent" and "stats arrived" (maybe in the log only), or that some UPD packets were dropped? Tomas
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Jeff Janes
Date:
On Mon, Feb 18, 2013 at 7:50 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > So here's v11. I intend to commit this shortly. (I wanted to get it > out before lunch, but I introduced a silly bug that took me a bit to > fix.) On Windows with Mingw I get this: pgstat.c:4389:8: warning: variable 'found' set but not used [-Wunused-but-set-variable] I don't get that on Linux, but I bet that is just the gcc version (4.6.2 vs 4.4.6) rather than the OS. It looks like it is just a useless variable, rather than any possible cause of the Windows "make check" failure (which I can't reproduce). Cheers, Jeff
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Kevin Grittner
Date:
Jeff Janes <jeff.janes@gmail.com> wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> >> So here's v11. I intend to commit this shortly. (I wanted to get it >> out before lunch, but I introduced a silly bug that took me a bit to >> fix.) > > On Windows with Mingw I get this: > > pgstat.c:4389:8: warning: variable 'found' set but not used > [-Wunused-but-set-variable] > > I don't get that on Linux, but I bet that is just the gcc version > (4.6.2 vs 4.4.6) rather than the OS. I get it on Linux with gcc version 4.7.2. > It looks like it is just a useless variable Agreed. -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Alvaro Herrera
Date:
Jeff Janes escribió: > On Mon, Feb 18, 2013 at 7:50 AM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > > > So here's v11. I intend to commit this shortly. (I wanted to get it > > out before lunch, but I introduced a silly bug that took me a bit to > > fix.) > > On Windows with Mingw I get this: > > pgstat.c:4389:8: warning: variable 'found' set but not used > [-Wunused-but-set-variable] > > I don't get that on Linux, but I bet that is just the gcc version > (4.6.2 vs 4.4.6) rather than the OS. It looks like it is just a > useless variable, rather than any possible cause of the Windows "make > check" failure (which I can't reproduce). Hm, I remember looking at that code and thinking that the "return" there might not be the best idea because it'd miss running the code that checks for clock skew; and so the "found" was necessary because the return was to be taken out. But on second thought, a database for which the loop terminates early has already run the clock-skew detection code recently, so that's probably not worth worrying about. IOW I will just remove that variable. Thanks for the notice. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Noah Misch
Date:
On Mon, Feb 18, 2013 at 06:19:12PM -0300, Alvaro Herrera wrote: > I have pushed it now. Further testing, of course, is always welcome. While investigating stats.sql buildfarm failures, mostly on animals axolotl and shearwater, I found that this patch (commit 187492b) inadvertently removed the collector's ability to coalesce inquiries. Every PGSTAT_MTYPE_INQUIRY received now causes one stats file write. Before, pgstat_recv_inquiry() did: if (msg->inquiry_time > last_statrequest) last_statrequest = msg->inquiry_time; and pgstat_write_statsfile() did: globalStats.stats_timestamp = GetCurrentTimestamp(); ... (work of writing a stats file) ... last_statwrite = globalStats.stats_timestamp; last_statrequest = last_statwrite; If the collector entered pgstat_write_statsfile() with more inquiries waiting in its socket receive buffer, it would ignore them as being too old once it finished the write and resumed message processing. Commit 187492b converted last_statrequest to a "last_statrequests" list that we wipe after each write. I modeled a machine with slow stats writes using the attached diagnostic patch (not for commit). It has pgstat_write_statsfiles() sleep just before renaming the temporary file, and it logs each stats message received. A three second delay causes stats.sql to fail the way it did on shearwater[1] and on axolotl[2]. Inquiries accumulate until the socket receive buffer overflows, at which point the socket drops stats messages whose effects we later test for. The 3s delay makes us drop some 85% of installcheck stats messages. Logs from a single VACUUM show receipt of five inquiries ("stats 1:") with 3s between them: 24239 2015-12-10 04:21:03.439 GMT LOG: connection authorized: user=nmisch database=test 24236 2015-12-10 04:21:03.454 GMT LOG: stats 2: 1888 + 936 = 2824 24236 2015-12-10 04:21:03.454 GMT LOG: stats 2: 2824 + 376 = 3200 24236 2015-12-10 04:21:03.454 GMT LOG: stats 2: 3200 + 824 = 4024 24239 2015-12-10 04:21:03.455 GMT LOG: statement: vacuum pg_class 24236 2015-12-10 04:21:03.455 GMT LOG: stats 1: 4024 + 32 = 4056 24236 2015-12-10 04:21:06.458 GMT LOG: stats 12: 4056 + 88 = 4144 24236 2015-12-10 04:21:06.458 GMT LOG: stats 1: 4144 + 32 = 4176 24239 2015-12-10 04:21:06.463 GMT LOG: disconnection: session time: 0:00:03.025 user=nmisch database=test host=[local] 24236 2015-12-10 04:21:09.486 GMT LOG: stats 1: 4176 + 32 = 4208 24236 2015-12-10 04:21:12.503 GMT LOG: stats 1: 4208 + 32 = 4240 24236 2015-12-10 04:21:15.519 GMT LOG: stats 1: 4240 + 32 = 4272 24236 2015-12-10 04:21:18.535 GMT LOG: stats 9: 4272 + 48 = 4320 24236 2015-12-10 04:21:18.535 GMT LOG: stats 2: 4320 + 936 = 5256 24236 2015-12-10 04:21:18.535 GMT LOG: stats 2: 5256 + 376 = 5632 24236 2015-12-10 04:21:18.535 GMT LOG: stats 2: 5632 + 264 = 5896 24236 2015-12-10 04:21:18.535 GMT LOG: stats 12: 5896 + 88 = 5984 As for how to fix this, the most direct translation of the old logic is to retain last_statrequests entries that could help coalesce inquiries. I lean toward that for an initial, back-patched fix. It would be good, though, to process two closely-spaced, different-database inquiries in one pgstat_write_statsfiles() call. We do one-database writes and all-databases writes, but we never write "1 < N < all" databases despite the code prepared to do so. I tried calling pgstat_write_statsfiles() only when the socket receive buffer empties. That's dead simple to implement and aggressively coalesces inquiries (even a 45s sleep did not break stats.sql), but it starves inquirers if the socket receive buffer stays full persistently. Ideally, I'd want to process inquiries when the buffer empties _or_ when the oldest inquiry is X seconds old. I don't have a more-specific design in mind, though. Thanks, nm [1] http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=shearwater&dt=2015-09-23%2002%3A08%3A31 [2] http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2015-08-04%2019%3A31%3A22
Attachment
Re: Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tomas Vondra
Date:
Hi, On 12/22/2015 03:49 PM, Noah Misch wrote: > On Mon, Feb 18, 2013 at 06:19:12PM -0300, Alvaro Herrera wrote: >> I have pushed it now. Further testing, of course, is always welcome. > > While investigating stats.sql buildfarm failures, mostly on animals axolotl > and shearwater, I found that this patch (commit 187492b) inadvertently removed > the collector's ability to coalesce inquiries. Every PGSTAT_MTYPE_INQUIRY > received now causes one stats file write. Before, pgstat_recv_inquiry() did: > > if (msg->inquiry_time > last_statrequest) > last_statrequest = msg->inquiry_time; > > and pgstat_write_statsfile() did: > > globalStats.stats_timestamp = GetCurrentTimestamp(); > ... (work of writing a stats file) ... > last_statwrite = globalStats.stats_timestamp; > last_statrequest = last_statwrite; > > If the collector entered pgstat_write_statsfile() with more inquiries waiting > in its socket receive buffer, it would ignore them as being too old once it > finished the write and resumed message processing. Commit 187492b converted > last_statrequest to a "last_statrequests" list that we wipe after each write. So essentially we remove the list of requests, and thus on the next round we don't know the timestamp of the last request and write the file again unnecessarily. Do I get that right? What if we instead kept the list but marked the requests as 'invalid' so that we know the timestamp? In that case we'd be able to do pretty much exactly what the original code did (but at per-db granularity). We'd have to cleanup the list once in a while not to grow excessively large, but something like removing entries older than PGSTAT_STAT_INTERVAL should be enough. Actually, I think that was the idea when I wrote the patch, but apparently I got distracted and it did not make it into the code. > > I modeled a machine with slow stats writes using the attached diagnostic patch > (not for commit). It has pgstat_write_statsfiles() sleep just before renaming > the temporary file, and it logs each stats message received. A three second > delay causes stats.sql to fail the way it did on shearwater[1] and on > axolotl[2]. Inquiries accumulate until the socket receive buffer overflows, > at which point the socket drops stats messages whose effects we later test > for. The 3s delay makes us drop some 85% of installcheck stats messages. > Logs from a single VACUUM show receipt of five inquiries ("stats 1:") with 3s > between them: > > 24239 2015-12-10 04:21:03.439 GMT LOG: connection authorized: user=nmisch database=test > 24236 2015-12-10 04:21:03.454 GMT LOG: stats 2: 1888 + 936 = 2824 > 24236 2015-12-10 04:21:03.454 GMT LOG: stats 2: 2824 + 376 = 3200 > 24236 2015-12-10 04:21:03.454 GMT LOG: stats 2: 3200 + 824 = 4024 > 24239 2015-12-10 04:21:03.455 GMT LOG: statement: vacuum pg_class > 24236 2015-12-10 04:21:03.455 GMT LOG: stats 1: 4024 + 32 = 4056 > 24236 2015-12-10 04:21:06.458 GMT LOG: stats 12: 4056 + 88 = 4144 > 24236 2015-12-10 04:21:06.458 GMT LOG: stats 1: 4144 + 32 = 4176 > 24239 2015-12-10 04:21:06.463 GMT LOG: disconnection: session time: 0:00:03.025 user=nmisch database=test host=[local] > 24236 2015-12-10 04:21:09.486 GMT LOG: stats 1: 4176 + 32 = 4208 > 24236 2015-12-10 04:21:12.503 GMT LOG: stats 1: 4208 + 32 = 4240 > 24236 2015-12-10 04:21:15.519 GMT LOG: stats 1: 4240 + 32 = 4272 > 24236 2015-12-10 04:21:18.535 GMT LOG: stats 9: 4272 + 48 = 4320 > 24236 2015-12-10 04:21:18.535 GMT LOG: stats 2: 4320 + 936 = 5256 > 24236 2015-12-10 04:21:18.535 GMT LOG: stats 2: 5256 + 376 = 5632 > 24236 2015-12-10 04:21:18.535 GMT LOG: stats 2: 5632 + 264 = 5896 > 24236 2015-12-10 04:21:18.535 GMT LOG: stats 12: 5896 + 88 = 5984 > > > As for how to fix this, the most direct translation of the old logic is to > retain last_statrequests entries that could help coalesce inquiries.I lean > toward that for an initial, back-patched fix. That seems reasonable and I believe it's pretty much the idea I came up with above, right? Depending on how you define "entries that could help coalesce inquiries". > It would be good, though, to > process two closely-spaced, different-database inquiries in one > pgstat_write_statsfiles() call. We do one-database writes and all-databases > writes, but we never write "1 < N < all" databases despite the code prepared > to do so. I tried calling pgstat_write_statsfiles() only when the socket > receive buffer empties. That's dead simple to implement and aggressively > coalesces inquiries (even a 45s sleep did not break stats.sql), but it starves > inquirers if the socket receive buffer stays full persistently. Ideally, I'd > want to process inquiries when the buffer empties _or_ when the oldest inquiry > is X seconds old. I don't have a more-specific design in mind, though. That's a nice idea, but I agree that binding the coalescing to buffer like this seems like a pretty bad idea exactly because of the starving. What might work though is if we could look at how much data is there in the buffer, process only those requests and then write the stats files. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Noah Misch
Date:
On Mon, Feb 01, 2016 at 07:03:45PM +0100, Tomas Vondra wrote: > On 12/22/2015 03:49 PM, Noah Misch wrote: > >On Mon, Feb 18, 2013 at 06:19:12PM -0300, Alvaro Herrera wrote: > >>I have pushed it now. Further testing, of course, is always welcome. > > > >While investigating stats.sql buildfarm failures, mostly on animals axolotl > >and shearwater, I found that this patch (commit 187492b) inadvertently removed > >the collector's ability to coalesce inquiries. Every PGSTAT_MTYPE_INQUIRY > >received now causes one stats file write. Before, pgstat_recv_inquiry() did: > > > > if (msg->inquiry_time > last_statrequest) > > last_statrequest = msg->inquiry_time; > > > >and pgstat_write_statsfile() did: > > > > globalStats.stats_timestamp = GetCurrentTimestamp(); > >... (work of writing a stats file) ... > > last_statwrite = globalStats.stats_timestamp; > > last_statrequest = last_statwrite; > > > >If the collector entered pgstat_write_statsfile() with more inquiries waiting > >in its socket receive buffer, it would ignore them as being too old once it > >finished the write and resumed message processing. Commit 187492b converted > >last_statrequest to a "last_statrequests" list that we wipe after each write. > > So essentially we remove the list of requests, and thus on the next round we > don't know the timestamp of the last request and write the file again > unnecessarily. Do I get that right? Essentially right. Specifically, for each database, we must remember the globalStats.stats_timestamp of the most recent write. It could be okay to forget the last request timestamp. (I now doubt I picked the best lines to quote, above.) > What if we instead kept the list but marked the requests as 'invalid' so > that we know the timestamp? In that case we'd be able to do pretty much > exactly what the original code did (but at per-db granularity). The most natural translation of the old code would be to add a write_time field to struct DBWriteRequest. One can infer "invalid" from write_time and request_time. There are other reasonable designs, though. > We'd have to cleanup the list once in a while not to grow excessively large, > but something like removing entries older than PGSTAT_STAT_INTERVAL should > be enough. Specifically, if you assume the socket delivers messages in the order sent, you may as well discard entries having write_time at least PGSTAT_STAT_INTERVAL older than the most recent cutoff_time seen in a PgStat_MsgInquiry. That delivery order assumption does not hold in general, but I expect it's close enough for this purpose. > >As for how to fix this, the most direct translation of the old logic is to > >retain last_statrequests entries that could help coalesce inquiries.I lean > >toward that for an initial, back-patched fix. > > That seems reasonable and I believe it's pretty much the idea I came up with > above, right? Depending on how you define "entries that could help coalesce > inquiries". Yes.
Re: Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tomas Vondra
Date:
On 02/03/2016 06:46 AM, Noah Misch wrote: > On Mon, Feb 01, 2016 at 07:03:45PM +0100, Tomas Vondra wrote: >> On 12/22/2015 03:49 PM, Noah Misch wrote: ... >>> If the collector entered pgstat_write_statsfile() with more >>> inquiries waiting in its socket receive buffer, it would ignore >>> them as being too old once it finished the write and resumed >>> message processing. Commit 187492b converted last_statrequest to >>> a "last_statrequests" list that we wipe after each write. >> >> So essentially we remove the list of requests, and thus on the next >> round we don't know the timestamp of the last request and write the >> file again unnecessarily. Do I get that right? > > Essentially right. Specifically, for each database, we must remember > the globalStats.stats_timestamp of the most recent write. It could be > okay to forget the last request timestamp. (I now doubt I picked the > best lines to quote, above.) > >> What if we instead kept the list but marked the requests as >> 'invalid' so that we know the timestamp? In that case we'd be able >> to do pretty much exactly what the original code did (but at per-db >> granularity). > > The most natural translation of the old code would be to add a > write_time field to struct DBWriteRequest. One can infer "invalid" > from write_time and request_time. There are other reasonable designs, > though. OK, makes sense. I'll look into that. > >> We'd have to cleanup the list once in a while not to grow >> excessively large, but something like removing entries older than >> PGSTAT_STAT_INTERVAL should be enough. > > Specifically, if you assume the socket delivers messages in the order > sent, you may as well discard entries having write_time at least > PGSTAT_STAT_INTERVAL older than the most recent cutoff_time seen in a > PgStat_MsgInquiry. That delivery order assumption does not hold in > general, but I expect it's close enough for this purpose. Agreed. If I get that right, it might result in some false negatives (in the sense that we'll remove a record too early, forcing us to write the database file again). But I expect that to be a rare case. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tomas Vondra
Date:
Hi, On 02/03/2016 06:46 AM, Noah Misch wrote: > On Mon, Feb 01, 2016 at 07:03:45PM +0100, Tomas Vondra wrote: >> On 12/22/2015 03:49 PM, Noah Misch wrote: >>> On Mon, Feb 18, 2013 at 06:19:12PM -0300, Alvaro Herrera wrote: >>>> I have pushed it now. Further testing, of course, is always welcome. >>> >>> While investigating stats.sql buildfarm failures, mostly on animals >>> axolotl and shearwater, I found that this patch (commit 187492b) >>> inadvertently removed the collector's ability to coalesce inquiries. >>> Every PGSTAT_MTYPE_INQUIRY received now causes one stats file write. >>> Before, pgstat_recv_inquiry() did: >>> >>> if (msg->inquiry_time > last_statrequest) >>> last_statrequest = msg->inquiry_time; >>> >>> and pgstat_write_statsfile() did: >>> >>> globalStats.stats_timestamp = GetCurrentTimestamp(); >>> ... (work of writing a stats file) ... >>> last_statwrite = globalStats.stats_timestamp; >>> last_statrequest = last_statwrite; >>> >>> If the collector entered pgstat_write_statsfile() with more inquiries >>> waiting in its socket receive buffer, it would ignore them as being too >>> old once it finished the write and resumed message processing. Commit >>> 187492b converted last_statrequest to a "last_statrequests" list that we >>> wipe after each write. So I've been looking at this today, and I think the attached patch should do the trick. I can't really verify it, because I've been unable to reproduce the non-coalescing - I presume it requires much slower system (axolotl is RPi, not sure about shearwater). The patch simply checks DBEntry,stats_timestamp in pgstat_recv_inquiry() and ignores requests that are already resolved by the last write (maybe this should accept a file written up to PGSTAT_STAT_INTERVAL in the past). The required field is already in DBEntry (otherwise it'd be impossible to determine if backends need to send inquiries or not), so this is pretty trivial change. I can't think of a simpler patch. Can you try applying the patch on a machine where the problem is reproducible? I might have some RPi machines laying around (for other purposes). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Re: Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Noah Misch
Date:
On Thu, Mar 03, 2016 at 06:08:07AM +0100, Tomas Vondra wrote: > On 02/03/2016 06:46 AM, Noah Misch wrote: > >On Mon, Feb 01, 2016 at 07:03:45PM +0100, Tomas Vondra wrote: > >>On 12/22/2015 03:49 PM, Noah Misch wrote: > >>>On Mon, Feb 18, 2013 at 06:19:12PM -0300, Alvaro Herrera wrote: > >>>>I have pushed it now. Further testing, of course, is always welcome. > >>> > >>>While investigating stats.sql buildfarm failures, mostly on animals > >>>axolotl and shearwater, I found that this patch (commit 187492b) > >>>inadvertently removed the collector's ability to coalesce inquiries. > >>>Every PGSTAT_MTYPE_INQUIRY received now causes one stats file write. > >>>Before, pgstat_recv_inquiry() did: > >>> > >>> if (msg->inquiry_time > last_statrequest) > >>> last_statrequest = msg->inquiry_time; > >>> > >>>and pgstat_write_statsfile() did: > >>> > >>> globalStats.stats_timestamp = GetCurrentTimestamp(); > >>>... (work of writing a stats file) ... > >>> last_statwrite = globalStats.stats_timestamp; > >>> last_statrequest = last_statwrite; > >>> > >>>If the collector entered pgstat_write_statsfile() with more inquiries > >>>waiting in its socket receive buffer, it would ignore them as being too > >>>old once it finished the write and resumed message processing. Commit > >>>187492b converted last_statrequest to a "last_statrequests" list that we > >>>wipe after each write. > > So I've been looking at this today, and I think the attached patch should do > the trick. I can't really verify it, because I've been unable to reproduce the > non-coalescing - I presume it requires much slower system (axolotl is RPi, not > sure about shearwater). > > The patch simply checks DBEntry,stats_timestamp in pgstat_recv_inquiry() and > ignores requests that are already resolved by the last write (maybe this > should accept a file written up to PGSTAT_STAT_INTERVAL in the past). > > The required field is already in DBEntry (otherwise it'd be impossible to > determine if backends need to send inquiries or not), so this is pretty > trivial change. I can't think of a simpler patch. > > Can you try applying the patch on a machine where the problem is > reproducible? I might have some RPi machines laying around (for other > purposes). I've not attempted to study the behavior on slow hardware. Instead, my report used stat-coalesce-v1.patch[1] to simulate slow writes. (That diagnostic patch no longer applies cleanly, so I'm attaching a rebased version. I've changed the patch name from "stat-coalesce" to "slow-stat-simulate" to more-clearly distinguish it from the "pgstat-coalesce" patch.) Problems remain after applying your patch; consider "VACUUM pg_am" behavior: 9.2 w/ stat-coalesce-v1.patch: VACUUM returns in 3s, stats collector writes each file 1x over 3s HEAD w/ slow-stat-simulate-v2.patch: VACUUM returns in 3s, stats collector writes each file 5x over 15s HEAD w/ slow-stat-simulate-v2.patch and your patch: VACUUM returns in 10s, stats collector writes no files over 10s [1] http://www.postgresql.org/message-id/20151222144950.GA2553834@tornado.leadboat.com
Attachment
Re: Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tomas Vondra
Date:
On Sun, 2016-03-13 at 18:46 -0400, Noah Misch wrote: > On Thu, Mar 03, 2016 at 06:08:07AM +0100, Tomas Vondra wrote: > > > > On 02/03/2016 06:46 AM, Noah Misch wrote: > > > > > > On Mon, Feb 01, 2016 at 07:03:45PM +0100, Tomas Vondra wrote: > > > > > > > > On 12/22/2015 03:49 PM, Noah Misch wrote: > > > > > > > > > > On Mon, Feb 18, 2013 at 06:19:12PM -0300, Alvaro Herrera > > > > > wrote: > > > > > > > > > > > > I have pushed it now. Further testing, of course, is > > > > > > always welcome. > > > > > While investigating stats.sql buildfarm failures, mostly on > > > > > animals > > > > > axolotl and shearwater, I found that this patch (commit > > > > > 187492b) > > > > > inadvertently removed the collector's ability to coalesce > > > > > inquiries. > > > > > Every PGSTAT_MTYPE_INQUIRY received now causes one stats file > > > > > write. > > > > > Before, pgstat_recv_inquiry() did: > > > > > > > > > > if (msg->inquiry_time > last_statrequest) > > > > > last_statrequest = msg->inquiry_time; > > > > > > > > > > and pgstat_write_statsfile() did: > > > > > > > > > > globalStats.stats_timestamp = GetCurrentTimestamp(); > > > > > ... (work of writing a stats file) ... > > > > > last_statwrite = globalStats.stats_timestamp; > > > > > last_statrequest = last_statwrite; > > > > > > > > > > If the collector entered pgstat_write_statsfile() with more > > > > > inquiries > > > > > waiting in its socket receive buffer, it would ignore them as > > > > > being too > > > > > old once it finished the write and resumed message > > > > > processing. Commit > > > > > 187492b converted last_statrequest to a "last_statrequests" > > > > > list that we > > > > > wipe after each write. > > So I've been looking at this today, and I think the attached patch > > should do > > the trick. I can't really verify it, because I've been unable to > > reproduce the > > non-coalescing - I presume it requires much slower system (axolotl > > is RPi, not > > sure about shearwater). > > > > The patch simply checks DBEntry,stats_timestamp in > > pgstat_recv_inquiry() and > > ignores requests that are already resolved by the last write (maybe > > this > > should accept a file written up to PGSTAT_STAT_INTERVAL in the > > past). > > > > The required field is already in DBEntry (otherwise it'd be > > impossible to > > determine if backends need to send inquiries or not), so this is > > pretty > > trivial change. I can't think of a simpler patch. > > > > Can you try applying the patch on a machine where the problem is > > reproducible? I might have some RPi machines laying around (for > > other > > purposes). > I've not attempted to study the behavior on slow hardware. Instead, > my report > used stat-coalesce-v1.patch[1] to simulate slow writes. (That > diagnostic > patch no longer applies cleanly, so I'm attaching a rebased > version. I've > changed the patch name from "stat-coalesce" to "slow-stat-simulate" > to > more-clearly distinguish it from the "pgstat-coalesce" > patch.) Problems > remain after applying your patch; consider "VACUUM pg_am" behavior: > > 9.2 w/ stat-coalesce-v1.patch: > VACUUM returns in 3s, stats collector writes each file 1x over 3s > HEAD w/ slow-stat-simulate-v2.patch: > VACUUM returns in 3s, stats collector writes each file 5x over 15s > HEAD w/ slow-stat-simulate-v2.patch and your patch: > VACUUM returns in 10s, stats collector writes no files over 10s Oh damn, the timestamp comparison in pgstat_recv_inquiry should be in the opposite direction. After fixing that "VACUUM pg_am" completes in 3 seconds and writes each file just once. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Re: Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Noah Misch
Date:
[Aside: your new mail editor is rewrapping lines in quoted material, and the result is messy. I have rerewrapped one paragraph below.] On Mon, Mar 14, 2016 at 02:00:03AM +0100, Tomas Vondra wrote: > On Sun, 2016-03-13 at 18:46 -0400, Noah Misch wrote: > > I've not attempted to study the behavior on slow hardware. Instead, my > > report used stat-coalesce-v1.patch[1] to simulate slow writes. (That > > diagnostic patch no longer applies cleanly, so I'm attaching a rebased > > version. I've changed the patch name from "stat-coalesce" to > > "slow-stat-simulate" to more-clearly distinguish it from the > > "pgstat-coalesce" patch.) Problems remain after applying your patch; > > consider "VACUUM pg_am" behavior: > > > > 9.2 w/ stat-coalesce-v1.patch: > > VACUUM returns in 3s, stats collector writes each file 1x over 3s > > HEAD w/ slow-stat-simulate-v2.patch: > > VACUUM returns in 3s, stats collector writes each file 5x over 15s > > HEAD w/ slow-stat-simulate-v2.patch and your patch: > > VACUUM returns in 10s, stats collector writes no files over 10s > > Oh damn, the timestamp comparison in pgstat_recv_inquiry should be in > the opposite direction. After fixing that "VACUUM pg_am" completes in 3 > seconds and writes each file just once. That fixes things. "make check" passes under an 8s stats write time. > --- a/src/backend/postmaster/pgstat.c > +++ b/src/backend/postmaster/pgstat.c > @@ -4836,6 +4836,20 @@ pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len) > } > > /* > + * Ignore requests that are already resolved by the last write. > + * > + * We discard the queue of requests at the end of pgstat_write_statsfiles(), > + * so the requests already waiting on the UDP socket at that moment can't > + * be discarded in the previous loop. > + * > + * XXX Maybe this should also care about the clock skew, just like the > + * block a few lines down. Yes, it should. (The problem is large (>~100s), backward clock resets, not skew.) A clock reset causing "msg->clock_time < dbentry->stats_timestamp" will usually also cause "msg->cutoff_time < dbentry->stats_timestamp". Such cases need the correction a few lines down. The other thing needed here is to look through and update comments about last_statrequests. For example, this loop is dead code due to us writing files as soon as we receive one inquiry: /* * Find the last write request for this DB. If it's older than the * request's cutoff time, update it; otherwise there'snothing to do. * * Note that if a request is found, we return early and skip the below * check for clock skew. Thisis okay, since the only way for a DB * request to be present in the list is that we have been here since the * last writeround. */slist_foreach(iter, &last_statrequests) ... I'm okay keeping the dead code for future flexibility, but the comment should reflect that. Thanks, nm
Re: Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tomas Vondra
Date:
Hi, On 03/14/2016 07:14 AM, Noah Misch wrote: > [Aside: your new mail editor is rewrapping lines in quoted material, and the > result is messy. I have rerewrapped one paragraph below.] Thanks, I've noticed that too. I've been testing Evolution in the past few days, and apparently the line wrapping algorithm is broken. I've switched back to Thunderbird, so hopefully that'll fix it. > > On Mon, Mar 14, 2016 at 02:00:03AM +0100, Tomas Vondra wrote: >> On Sun, 2016-03-13 at 18:46 -0400, Noah Misch wrote: >>> I've not attempted to study the behavior on slow hardware. Instead, my >>> report used stat-coalesce-v1.patch[1] to simulate slow writes. (That >>> diagnostic patch no longer applies cleanly, so I'm attaching a rebased >>> version. I've changed the patch name from "stat-coalesce" to >>> "slow-stat-simulate" to more-clearly distinguish it from the >>> "pgstat-coalesce" patch.) Problems remain after applying your patch; >>> consider "VACUUM pg_am" behavior: >>> >>> 9.2 w/ stat-coalesce-v1.patch: >>> VACUUM returns in 3s, stats collector writes each file 1x over 3s >>> HEAD w/ slow-stat-simulate-v2.patch: >>> VACUUM returns in 3s, stats collector writes each file 5x over 15s >>> HEAD w/ slow-stat-simulate-v2.patch and your patch: >>> VACUUM returns in 10s, stats collector writes no files over 10s >> >> Oh damn, the timestamp comparison in pgstat_recv_inquiry should be in >> the opposite direction. After fixing that "VACUUM pg_am" completes in 3 >> seconds and writes each file just once. > > That fixes things. "make check" passes under an 8s stats write time. OK, good. > >> --- a/src/backend/postmaster/pgstat.c >> +++ b/src/backend/postmaster/pgstat.c >> @@ -4836,6 +4836,20 @@ pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len) >> } >> >> /* >> + * Ignore requests that are already resolved by the last write. >> + * >> + * We discard the queue of requests at the end of pgstat_write_statsfiles(), >> + * so the requests already waiting on the UDP socket at that moment can't >> + * be discarded in the previous loop. >> + * >> + * XXX Maybe this should also care about the clock skew, just like the >> + * block a few lines down. > > Yes, it should. (The problem is large (>~100s), backward clock resets, not > skew.) A clock reset causing "msg->clock_time < dbentry->stats_timestamp" > will usually also cause "msg->cutoff_time < dbentry->stats_timestamp". Such > cases need the correction a few lines down. I'll look into that. I have to admit I have a hard time reasoning about the code handling clock skew, so it might take some time, though. > > The other thing needed here is to look through and update comments about > last_statrequests. For example, this loop is dead code due to us writing > files as soon as we receive one inquiry: > > /* > * Find the last write request for this DB. If it's older than the > * request's cutoff time, update it; otherwise there's nothing to do. > * > * Note that if a request is found, we return early and skip the below > * check for clock skew. This is okay, since the only way for a DB > * request to be present in the list is that we have been here since the > * last write round. > */ > slist_foreach(iter, &last_statrequests) ... > > I'm okay keeping the dead code for future flexibility, but the comment should > reflect that. Yes, that's another thing that I'd like to look into. Essentially the problem is that we always process the inquiries one by one, so we never actually see a list with more than a single element. Correct? I think the best way to address that is to peek is to first check how much data is in the UDP queue, and then fetching all of that before actually doing the writes. Peeking at the number of requests first (or even some reasonable hard-coded limit) should should prevent starving the inquirers in case of a steady stream or inquiries. regards Tomas -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Noah Misch
Date:
On Mon, Mar 14, 2016 at 01:33:08PM +0100, Tomas Vondra wrote: > On 03/14/2016 07:14 AM, Noah Misch wrote: > >On Mon, Mar 14, 2016 at 02:00:03AM +0100, Tomas Vondra wrote: > >>+ * XXX Maybe this should also care about the clock skew, just like the > >>+ * block a few lines down. > > > >Yes, it should. (The problem is large (>~100s), backward clock resets, not > >skew.) A clock reset causing "msg->clock_time < dbentry->stats_timestamp" > >will usually also cause "msg->cutoff_time < dbentry->stats_timestamp". Such > >cases need the correction a few lines down. > > I'll look into that. I have to admit I have a hard time reasoning about the > code handling clock skew, so it might take some time, though. No hurry; it would be no problem to delay this several months. > >The other thing needed here is to look through and update comments about > >last_statrequests. For example, this loop is dead code due to us writing > >files as soon as we receive one inquiry: > > > > /* > > * Find the last write request for this DB. If it's older than the > > * request's cutoff time, update it; otherwise there's nothing to do. > > * > > * Note that if a request is found, we return early and skip the below > > * check for clock skew. This is okay, since the only way for a DB > > * request to be present in the list is that we have been here since the > > * last write round. > > */ > > slist_foreach(iter, &last_statrequests) ... > > > >I'm okay keeping the dead code for future flexibility, but the comment should > >reflect that. > > Yes, that's another thing that I'd like to look into. Essentially the > problem is that we always process the inquiries one by one, so we never > actually see a list with more than a single element. Correct? Correct. > I think the best way to address that is to peek is to first check how much > data is in the UDP queue, and then fetching all of that before actually > doing the writes. Peeking at the number of requests first (or even some > reasonable hard-coded limit) should should prevent starving the inquirers in > case of a steady stream or inquiries. Now that you mention it, a hard-coded limit sounds good: write the files for pending inquiries whenever the socket empties or every N messages processed, whichever comes first. I don't think the amount of pending UDP data is portably available, and I doubt it's important. Breaking every, say, one thousand messages will make the collector predictably responsive to inquiries, and that is the important property. I would lean toward making this part 9.7-only; it would be a distinct patch from the one previously under discussion. Thanks, nm
Re: Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tomas Vondra
Date:
Hi, On 03/15/2016 03:04 AM, Noah Misch wrote: > On Mon, Mar 14, 2016 at 01:33:08PM +0100, Tomas Vondra wrote: >> On 03/14/2016 07:14 AM, Noah Misch wrote: >>> On Mon, Mar 14, 2016 at 02:00:03AM +0100, Tomas Vondra wrote: >>>> + * XXX Maybe this should also care about the clock skew, just like the >>>> + * block a few lines down. >>> >>> Yes, it should. (The problem is large (>~100s), backward clock resets, not >>> skew.) A clock reset causing "msg->clock_time < dbentry->stats_timestamp" >>> will usually also cause "msg->cutoff_time < dbentry->stats_timestamp". Such >>> cases need the correction a few lines down. >> >> I'll look into that. I have to admit I have a hard time reasoning about the >> code handling clock skew, so it might take some time, though. > > No hurry; it would be no problem to delay this several months. Attached is a patch that should fix the coalescing, including the clock skew detection. In the end I reorganized the code a bit, moving the check at the end, after the clock skew detection. Otherwise I'd have to do the clock skew detection on multiple places, and that seemed ugly. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Re: Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > Attached is a patch that should fix the coalescing, including the clock > skew detection. In the end I reorganized the code a bit, moving the > check at the end, after the clock skew detection. Otherwise I'd have to > do the clock skew detection on multiple places, and that seemed ugly. I hadn't been paying any attention to this thread, I must confess. But I rediscovered this no-coalescing problem while pursuing the poor behavior for shared catalogs that Peter complained of in https://www.postgresql.org/message-id/56AD41AC.1030509@gmx.net I posted a patch at https://www.postgresql.org/message-id/13023.1464213041@sss.pgh.pa.us which I think is functionally equivalent to what you have here, but it goes to some lengths to make the code more readable, whereas this is just adding another layer of complication to something that's already a mess (eg, the request_time field is quite useless as-is). So I'd like to propose pushing that in place of this patch ... do you care to review it first? Reacting to the thread overall: I see Noah's concern about wanting to merge the write work for requests about different databases. I've got mixed feelings about that: it's arguable that any such change would make things worse not better. In particular, it's inevitable that trying to merge writes will result in delaying the response to the first request, whether or not we are able to merge anything. That's not good in itself, and it means that we can't hope to merge requests over any very long interval, which very possibly will prevent any merging from happening in real situations. Also, considering that we know the stats collector can be pretty slow to respond at all under load, I'm worried that this would result in more outright failures. Moreover, what we'd hope to gain from merging is fewer writes of the global stats file and the shared-catalog stats file; but neither of those are very big, so I'm skeptical of what we'd win. In view of 52e8fc3e2, there's more or less no case in which we'd be writing stats without writing stats for the shared catalogs. So I'm tempted to propose that we try to reduce the overhead by merging the shared-catalog stats back into the global-stats file, thereby halving the filesystem metadata traffic for updating those. regards, tom lane
Re: Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tomas Vondra
Date:
Hi, On 05/26/2016 10:10 PM, Tom Lane wrote: > Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> Attached is a patch that should fix the coalescing, including the clock >> skew detection. In the end I reorganized the code a bit, moving the >> check at the end, after the clock skew detection. Otherwise I'd have to >> do the clock skew detection on multiple places, and that seemed ugly. > > I hadn't been paying any attention to this thread, I must confess. > But I rediscovered this no-coalescing problem while pursuing the poor > behavior for shared catalogs that Peter complained of in > https://www.postgresql.org/message-id/56AD41AC.1030509@gmx.net > > I posted a patch at > https://www.postgresql.org/message-id/13023.1464213041@sss.pgh.pa.us > which I think is functionally equivalent to what you have here, but > it goes to some lengths to make the code more readable, whereas this > is just adding another layer of complication to something that's > already a mess (eg, the request_time field is quite useless as-is). > So I'd like to propose pushing that in place of this patch ... do > you care to review it first? I do care and I'll look at it over the next few days. FWIW when writing that patch I intentionally refrained from major changes, as I think the plan was to backpatch it. But +1 for more readable code from me. > > Reacting to the thread overall: > > I see Noah's concern about wanting to merge the write work for > requests about different databases. I've got mixed feelings about > that: it's arguable that any such change would make things worse not > better. In particular, it's inevitable that trying to merge writes > will result in delaying the response to the first request, whether > or not we are able to merge anything. That's not good in itself, and > it means that we can't hope to merge requests over any very long > interval, which very possibly will prevent any merging from > happening in real situations. Also, considering that we know the > stats collector can be pretty slow to respond at all under load, I'm > worried that this would result in more outright failures. > > Moreover, what we'd hope to gain from merging is fewer writes of the > global stats file and the shared-catalog stats file; but neither of > those are very big, so I'm skeptical of what we'd win. Yep. Clearly there's a trade-off between slowing down response to the first request vs. speeding-up the whole process, but as you point out we probably can't gain enough to justify that. I wonder if that's still true on clusters with many databases (say, shared systems with thousands of dbs). Perhaps walking the list just once would save enough CPU to make this a win. In any case, if we decide to abandon the idea of merging requests for multiple databases, that probably means we can further simplify the code. last_statrequests is a list but it actually never contains more than just a single request. We kept it that way because of the plan to add the merging. But if that's not worth it ... > > In view of 52e8fc3e2, there's more or less no case in which we'd be > writing stats without writing stats for the shared catalogs. So I'm > tempted to propose that we try to reduce the overhead by merging the > shared-catalog stats back into the global-stats file, thereby > halving the filesystem metadata traffic for updating those. I find this a bit contradictory with the previous paragraph. If you believe that reducing the filesystem metadata traffic will have a measurable benefit, then surely merging writes for multiple dbs (thus not writing the global/shared files multiple times) will have even higher impact, no? E.g. let's assume we're still writing the global+shared+db files for each database. If there are requests for 10 databases, we'll write 30 files. If we merge those requests first, we're writing only 12 files. So I'm not sure about the metadata traffic argument, we'd need to see some numbers showing it really makes a difference. That being said, I'm not opposed to merging the shared catalog into the global-stats file - it's not really a separate database so having it in a separate file is a bit weird. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Michael Paquier
Date:
On Thu, May 26, 2016 at 6:43 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > On 05/26/2016 10:10 PM, Tom Lane wrote: >> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> In view of 52e8fc3e2, there's more or less no case in which we'd be >> writing stats without writing stats for the shared catalogs. So I'm >> tempted to propose that we try to reduce the overhead by merging the >> shared-catalog stats back into the global-stats file, thereby >> halving the filesystem metadata traffic for updating those. > > [...] > > That being said, I'm not opposed to merging the shared catalog into the > global-stats file - it's not really a separate database so having it in a > separate file is a bit weird. While looking at this stuff, to be honest I got surprised that shared relation stats are in located in a file whose name depends on InvalidOid, so +1 from here as well to merge that into the global stats file. -- Michael
Re: Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > On 05/26/2016 10:10 PM, Tom Lane wrote: >> In view of 52e8fc3e2, there's more or less no case in which we'd be >> writing stats without writing stats for the shared catalogs. So I'm >> tempted to propose that we try to reduce the overhead by merging the >> shared-catalog stats back into the global-stats file, thereby >> halving the filesystem metadata traffic for updating those. > I find this a bit contradictory with the previous paragraph. If you > believe that reducing the filesystem metadata traffic will have a > measurable benefit, then surely merging writes for multiple dbs (thus > not writing the global/shared files multiple times) will have even > higher impact, no? Well, my thinking is that this is something we could get "for free" without any penalty in response time. Going further will require some sort of tradeoff. regards, tom lane
Re: Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tomas Vondra
Date:
Hi, On 05/26/2016 10:10 PM, Tom Lane wrote: > Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> Attached is a patch that should fix the coalescing, including the clock >> skew detection. In the end I reorganized the code a bit, moving the >> check at the end, after the clock skew detection. Otherwise I'd have to >> do the clock skew detection on multiple places, and that seemed ugly. > > I hadn't been paying any attention to this thread, I must confess. > But I rediscovered this no-coalescing problem while pursuing the poor > behavior for shared catalogs that Peter complained of in > https://www.postgresql.org/message-id/56AD41AC.1030509@gmx.net > > I posted a patch at > https://www.postgresql.org/message-id/13023.1464213041@sss.pgh.pa.us > which I think is functionally equivalent to what you have here, but > it goes to some lengths to make the code more readable, whereas this > is just adding another layer of complication to something that's > already a mess (eg, the request_time field is quite useless as-is). > So I'd like to propose pushing that in place of this patch ... do you > care to review it first? I've reviewed the patch today, and it seems fine to me - correct and achieving the same goal as the patch posted to this thread (plus fixing the issue with shared catalogs and fixing many comments). FWIW do you still plan to back-patch this? Minimizing the amount of changes was one of the things I had in mind when writing "my" patch, which is why I ended up with parts that are less readable. The one change I'm not quite sure about is the removal of clock skew detection in pgstat_recv_inquiry(). You've removed the first check on the inquiry, replacing it with this comment: It seems sufficient to check for clock skew once per write round. But the first check was comparing msg/req, while the second check looks at dbentry/cur_ts. I don't see how those two clock skew check are redundant - if they are, the comment should explain that I guess. Another thing is that if you believe merging requests across databases is a silly idea, maybe we should bite the bullet and replace the list of requests with a single item. I'm not convinced about this, though. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > On 05/26/2016 10:10 PM, Tom Lane wrote: >> I posted a patch at >> https://www.postgresql.org/message-id/13023.1464213041@sss.pgh.pa.us >> which I think is functionally equivalent to what you have here, but >> it goes to some lengths to make the code more readable, whereas this >> is just adding another layer of complication to something that's >> already a mess (eg, the request_time field is quite useless as-is). >> So I'd like to propose pushing that in place of this patch ... do you >> care to review it first? > I've reviewed the patch today, and it seems fine to me - correct and > achieving the same goal as the patch posted to this thread (plus fixing > the issue with shared catalogs and fixing many comments). Thanks for reviewing! > FWIW do you still plan to back-patch this? Minimizing the amount of > changes was one of the things I had in mind when writing "my" patch, > which is why I ended up with parts that are less readable. Yeah, I think it's a bug fix and should be back-patched. I'm not in favor of making things more complicated just to reduce the number of lines a patch touches. > The one change I'm not quite sure about is the removal of clock skew > detection in pgstat_recv_inquiry(). You've removed the first check on > the inquiry, replacing it with this comment: > It seems sufficient to check for clock skew once per write round. > But the first check was comparing msg/req, while the second check looks > at dbentry/cur_ts. I don't see how those two clock skew check are > redundant - if they are, the comment should explain that I guess. I'm confused here --- are you speaking of having removed if (msg->cutoff_time > req->request_time) req->request_time = msg->cutoff_time; ? That is not a check for clock skew, it's intending to be sure that req->request_time reflects the latest request for this DB when we've seen more than one request. But since req->request_time isn't actually being used anywhere, it's useless code. I reformatted the actual check for clock skew, but I do not think I changed its behavior. > Another thing is that if you believe merging requests across databases > is a silly idea, maybe we should bite the bullet and replace the list of > requests with a single item. I'm not convinced about this, though. No, I don't want to do that either. We're not spending much code by having pending_write_requests be a list rather than a single entry, and we might eventually figure out a reasonable way to time the flushes so that we can merge requests. regards, tom lane
Re: Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tomas Vondra
Date:
On 05/31/2016 06:59 PM, Tom Lane wrote: > Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> On 05/26/2016 10:10 PM, Tom Lane wrote: >>> I posted a patch at >>> https://www.postgresql.org/message-id/13023.1464213041@sss.pgh.pa.us >>> which I think is functionally equivalent to what you have here, but >>> it goes to some lengths to make the code more readable, whereas this >>> is just adding another layer of complication to something that's >>> already a mess (eg, the request_time field is quite useless as-is). >>> So I'd like to propose pushing that in place of this patch ... do you >>> care to review it first? > >> I've reviewed the patch today, and it seems fine to me - correct and >> achieving the same goal as the patch posted to this thread (plus fixing >> the issue with shared catalogs and fixing many comments). > > Thanks for reviewing! > >> FWIW do you still plan to back-patch this? Minimizing the amount of >> changes was one of the things I had in mind when writing "my" patch, >> which is why I ended up with parts that are less readable. > > Yeah, I think it's a bug fix and should be back-patched. I'm not in > favor of making things more complicated just to reduce the number of > lines a patch touches. > >> The one change I'm not quite sure about is the removal of clock skew >> detection in pgstat_recv_inquiry(). You've removed the first check on >> the inquiry, replacing it with this comment: >> It seems sufficient to check for clock skew once per write round. >> But the first check was comparing msg/req, while the second check looks >> at dbentry/cur_ts. I don't see how those two clock skew check are >> redundant - if they are, the comment should explain that I guess. > > I'm confused here --- are you speaking of having removed > > if (msg->cutoff_time > req->request_time) > req->request_time = msg->cutoff_time; > > ? That is not a check for clock skew, it's intending to be sure that > req->request_time reflects the latest request for this DB when we've > seen more than one request. But since req->request_time isn't > actually being used anywhere, it's useless code. Ah, you're right. I've made the mistake of writing the e-mail before drinking any coffee today, and I got distracted by the comment change. > I reformatted the actual check for clock skew, but I do not think I > changed its behavior. I'm not sure it does not change the behavior, though. request_time only became unused after you removed the two places that set the value (one of them in the clock skew check). I'm not sure this is a bad change, though. But there was a dependency between the new request and the preceding one. > >> Another thing is that if you believe merging requests across databases >> is a silly idea, maybe we should bite the bullet and replace the list of >> requests with a single item. I'm not convinced about this, though. > > No, I don't want to do that either. We're not spending much code by > having pending_write_requests be a list rather than a single entry, > and we might eventually figure out a reasonable way to time the flushes > so that we can merge requests. > +1 regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > On 05/31/2016 06:59 PM, Tom Lane wrote: >> I'm confused here --- are you speaking of having removed >> if (msg->cutoff_time > req->request_time) >> req->request_time = msg->cutoff_time; >> ? That is not a check for clock skew, it's intending to be sure that >> req->request_time reflects the latest request for this DB when we've >> seen more than one request. But since req->request_time isn't >> actually being used anywhere, it's useless code. > Ah, you're right. I've made the mistake of writing the e-mail before > drinking any coffee today, and I got distracted by the comment change. >> I reformatted the actual check for clock skew, but I do not think I >> changed its behavior. > I'm not sure it does not change the behavior, though. request_time only > became unused after you removed the two places that set the value (one > of them in the clock skew check). Well, it's unused in the sense that the if-test quoted above is the only place in HEAD that examines the value of request_time. And since that if-test only controls whether we change the value, and not whether we proceed to make the clock skew check, I don't see how it's related to clock skew or indeed anything else at all. regards, tom lane
Re: Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tomas Vondra
Date:
On 05/31/2016 07:24 PM, Tom Lane wrote: > Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> On 05/31/2016 06:59 PM, Tom Lane wrote: >>> I'm confused here --- are you speaking of having removed >>> if (msg->cutoff_time > req->request_time) >>> req->request_time = msg->cutoff_time; >>> ? That is not a check for clock skew, it's intending to be sure that >>> req->request_time reflects the latest request for this DB when we've >>> seen more than one request. But since req->request_time isn't >>> actually being used anywhere, it's useless code. > >> Ah, you're right. I've made the mistake of writing the e-mail before >> drinking any coffee today, and I got distracted by the comment change. > >>> I reformatted the actual check for clock skew, but I do not think I >>> changed its behavior. > >> I'm not sure it does not change the behavior, though. request_time only >> became unused after you removed the two places that set the value (one >> of them in the clock skew check). > > Well, it's unused in the sense that the if-test quoted above is the only > place in HEAD that examines the value of request_time. And since that > if-test only controls whether we change the value, and not whether we > proceed to make the clock skew check, I don't see how it's related > to clock skew or indeed anything else at all. I see, in that case it indeed is useless. I've checked how this worked in 9.2 (before the 9.3 patch that split the file per db), and back then last_statsrequest (transformed to request_time) was used to decide whether we need to write something. But now we do that by simply checking whether the list is empty. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > I've checked how this worked in 9.2 (before the 9.3 patch that split the > file per db), and back then last_statsrequest (transformed to > request_time) was used to decide whether we need to write something. But > now we do that by simply checking whether the list is empty. Right. In effect, 9.3 moved the decision about "do we need to write stats because of this request" from the main loop to pgstat_recv_inquiry() ... but we forgot to incorporate any check for whether the request was already satisfied into pgstat_recv_inquiry(). We can do that, though, as per either of the patches under discussion --- and once we do so, maintaining DBWriteRequest.request_time seems a bit pointless. It's conceivable that we'd try to implement merging of close-together requests in a way that would take account of how far back the oldest unsatisfied request is. But I think that a global oldest-request time would be sufficient for that; we don't need to track it per-database. In any case, it's hard to see exactly how to make that work without putting a gettimeofday() call into the inner message handling loop, which I believe we won't want to do on performance grounds. The previous speculation about doing writes every N messages or when we have no input to process seems more likely to lead to a useful answer. regards, tom lane