Thread: Moving pgstat.stat and pgstat.tmp
Hi, I'm currently doctoring a situation wherein we've got table inheritance scheme that over the years that has ballooned like only in your nightmares (think well over 100K tables + indexes on those). The obvious solution is to re-design the schema with a better partitioning scheme in mind (see another msg from me later today on that) but that's a big project that's just getting underway and an immediate concern is the I/O on out data partition due in large part to the stats file(s) getting hammered. We can verify this by looking at our write volume 45+ Mbits/s and watching it drop to well below 10 on average when we disable stat_row_level as well as watching the insane amounts of writes to pgstat.tmp when running the rwsnoop dtrace script. So, for the interim we're looking to move where the stats files are written to. I've made the changes to the file paths for pgstat.stat and pgstat.tmp in src/backend/postmaster/pgstat.c, recompiled and verified that everything seems to be working ok on our test machine. However, seeing as how I'm not all that familiar with the code base, I'm asking here: is that all I need to do? Is there anything I've missed? Erik Jones Software Developer | Emma® erik@myemma.com 800.595.4401 or 615.292.5888 615.292.0777 (fax) Emma helps organizations everywhere communicate & market in style. Visit us online at http://www.myemma.com
Erik Jones <erik@myemma.com> writes: > Hi, I'm currently doctoring a situation wherein we've got table > inheritance scheme that over the years that has ballooned like only > in your nightmares (think well over 100K tables + indexes on those). > The obvious solution is to re-design the schema with a better > partitioning scheme in mind (see another msg from me later today on > that) but that's a big project that's just getting underway and an > immediate concern is the I/O on out data partition due in large part > to the stats file(s) getting hammered. Which PG version? Early 8.2.x releases had a nasty bug that caused excessive stats file writes. regards, tom lane
On Dec 3, 2007, at 4:16 PM, Tom Lane wrote: > Erik Jones <erik@myemma.com> writes: >> Hi, I'm currently doctoring a situation wherein we've got table >> inheritance scheme that over the years that has ballooned like only >> in your nightmares (think well over 100K tables + indexes on those). >> The obvious solution is to re-design the schema with a better >> partitioning scheme in mind (see another msg from me later today on >> that) but that's a big project that's just getting underway and an >> immediate concern is the I/O on out data partition due in large part >> to the stats file(s) getting hammered. > > Which PG version? Early 8.2.x releases had a nasty bug that caused > excessive stats file writes. 8.2.5 on Solaris 10. Before we upgraded to 8.2.4 it was doing about 65 Mbs/sec. Interestingly, a while back we were running with the data directory mounted with forcedirectio and saw none of this, I'm guessing that fsync calls would have something to do with that? Erik Jones Software Developer | Emma® erik@myemma.com 800.595.4401 or 615.292.5888 615.292.0777 (fax) Emma helps organizations everywhere communicate & market in style. Visit us online at http://www.myemma.com
Erik Jones <erik@myemma.com> writes: > 8.2.5 on Solaris 10. Before we upgraded to 8.2.4 it was doing about > 65 Mbs/sec. Interestingly, a while back we were running with the > data directory mounted with forcedirectio and saw none of this, I'm > guessing that fsync calls would have something to do with that? Hmm ... no, because the stats file never gets fsync'd. I should think that forcedirectio would have made things worse. regards, tom lane
On Dec 3, 2007, at 6:10 PM, Tom Lane wrote: > Erik Jones <erik@myemma.com> writes: >> 8.2.5 on Solaris 10. Before we upgraded to 8.2.4 it was doing about >> 65 Mbs/sec. Interestingly, a while back we were running with the >> data directory mounted with forcedirectio and saw none of this, I'm >> guessing that fsync calls would have something to do with that? > > Hmm ... no, because the stats file never gets fsync'd. I should think > that forcedirectio would have made things worse. Interesting. If this is anything you'd like to look into I can provide whatever diagnostic output you need (iostat, vmstat, dtrace script outputs, etc...) but I do have to reiterate that we are an extreme corner case due to out schema size. For now, is renaming the #define'd paths for the stats file and temp file sufficient for moving them? Basically, we'd like to move them onto a RAM disk to give our disks a break. Erik Jones Software Developer | Emma® erik@myemma.com 800.595.4401 or 615.292.5888 615.292.0777 (fax) Emma helps organizations everywhere communicate & market in style. Visit us online at http://www.myemma.com
Erik Jones <erik@myemma.com> writes: > For now, is renaming the > #define'd paths for the stats file and temp file sufficient for > moving them? I would think so, but haven't tried it. There definitely shouldn't be anything outside pgstat.c that's touching them. regards, tom lane
On Monday 03 December 2007 20:22, Erik Jones wrote: > On Dec 3, 2007, at 6:10 PM, Tom Lane wrote: > > Erik Jones <erik@myemma.com> writes: > >> 8.2.5 on Solaris 10. Before we upgraded to 8.2.4 it was doing about > >> 65 Mbs/sec. Interestingly, a while back we were running with the > >> data directory mounted with forcedirectio and saw none of this, I'm > >> guessing that fsync calls would have something to do with that? > > > > Hmm ... no, because the stats file never gets fsync'd. I should think > > that forcedirectio would have made things worse. > > Interesting. If this is anything you'd like to look into I can > provide whatever diagnostic output you need (iostat, vmstat, dtrace > script outputs, etc...) but I do have to reiterate that we are an > extreme corner case due to out schema size. For now, is renaming the > #define'd paths for the stats file and temp file sufficient for > moving them? Basically, we'd like to move them onto a RAM disk to > give our disks a break. > Yeah, we've noticed the same problem (pgstat is the most active file on the system... uncovered in much the same way... go solaris). Actually I was wondering if it could be done with symlinks, a la moving xlogs. Since we do custom builds, that's not a real issue, but I was curious. -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
Robert Treat wrote: > On Monday 03 December 2007 20:22, Erik Jones wrote: > > Interesting. If this is anything you'd like to look into I can > > provide whatever diagnostic output you need (iostat, vmstat, dtrace > > script outputs, etc...) but I do have to reiterate that we are an > > extreme corner case due to out schema size. For now, is renaming the > > #define'd paths for the stats file and temp file sufficient for > > moving them? Basically, we'd like to move them onto a RAM disk to > > give our disks a break. > > Yeah, we've noticed the same problem (pgstat is the most active file on the > system... uncovered in much the same way... go solaris). Actually I was > wondering if it could be done with symlinks, a la moving xlogs. Not really, because a new file is created and renamed in place each time it's going to be rewritten. So the symlink would be lost in the first file rewrite. The first idea that comes to mind is to make the path configurable via GUC, so the user could set it to be written to an in-memory filesystem (/tmp in Solaris?). But then I thought, why do we need it to be a file at all? Why not use a mmap'ed memory area or something like that, and only write it to a file on postmaster shutdown? (Losing the file on unclean shutdown is not a problem, because the file is removed anyway.) -- Alvaro Herrera Valdivia, Chile ICBM: S 39º 49' 18.1", W 73º 13' 56.4" "Prefiero omelette con amigos que caviar con tontos" (Alain Nonnet)
On Wednesday 05 December 2007 07:22, Alvaro Herrera wrote: > Robert Treat wrote: > > On Monday 03 December 2007 20:22, Erik Jones wrote: > > > Interesting. If this is anything you'd like to look into I can > > > provide whatever diagnostic output you need (iostat, vmstat, dtrace > > > script outputs, etc...) but I do have to reiterate that we are an > > > extreme corner case due to out schema size. For now, is renaming the > > > #define'd paths for the stats file and temp file sufficient for > > > moving them? Basically, we'd like to move them onto a RAM disk to > > > give our disks a break. > > > > Yeah, we've noticed the same problem (pgstat is the most active file on > > the system... uncovered in much the same way... go solaris). Actually I > > was wondering if it could be done with symlinks, a la moving xlogs. > > Not really, because a new file is created and renamed in place each time > it's going to be rewritten. So the symlink would be lost in the first > file rewrite. > Ah yeah, thats what I concluded back then. > The first idea that comes to mind is to make the path configurable via > GUC, so the user could set it to be written to an in-memory filesystem > (/tmp in Solaris?). Yep, thought of that to, though it was after feature freeze so I didn't propose it. Course if someone wants to sneak that in it would be cool :-) > But then I thought, why do we need it to be a file > at all? Why not use a mmap'ed memory area or something like that, and > only write it to a file on postmaster shutdown? (Losing the file on > unclean shutdown is not a problem, because the file is removed anyway.) I suppose you need some facility to spill to disk, so maybe being in a file is better? Seems it might not be in most cases... I wonder how big a memory space we (or Erik) need. -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > But then I thought, why do we need it to be a file > at all? Why not use a mmap'ed memory area or something like that, and > only write it to a file on postmaster shutdown? Yeah, we definitely need some other technology for this. The difficulty is in dealing with a highly variably sized chunk of data --- our existing shmem approach won't work well, and once you get away from that the old portability question raises its head. There's also a synchronization issue: how can the stats collector make updates appear atomic? mmap by itself doesn't solve that AFAIK. regards, tom lane
On Dec 5, 2007, at 7:50 AM, Robert Treat wrote: > On Wednesday 05 December 2007 07:22, Alvaro Herrera wrote: >> Robert Treat wrote: >>> On Monday 03 December 2007 20:22, Erik Jones wrote: >>>> Interesting. If this is anything you'd like to look into I can >>>> provide whatever diagnostic output you need (iostat, vmstat, dtrace >>>> script outputs, etc...) but I do have to reiterate that we are an >>>> extreme corner case due to out schema size. For now, is >>>> renaming the >>>> #define'd paths for the stats file and temp file sufficient for >>>> moving them? Basically, we'd like to move them onto a RAM disk to >>>> give our disks a break. >>> >>> Yeah, we've noticed the same problem (pgstat is the most active >>> file on >>> the system... uncovered in much the same way... go solaris). >>> Actually I >>> was wondering if it could be done with symlinks, a la moving xlogs. >> >> Not really, because a new file is created and renamed in place >> each time >> it's going to be rewritten. So the symlink would be lost in the >> first >> file rewrite. >> > > Ah yeah, thats what I concluded back then. > >> The first idea that comes to mind is to make the path configurable >> via >> GUC, so the user could set it to be written to an in-memory >> filesystem >> (/tmp in Solaris?). > > Yep, thought of that to, though it was after feature freeze so I > didn't > propose it. Course if someone wants to sneak that in it would be > cool :-) > >> But then I thought, why do we need it to be a file >> at all? Why not use a mmap'ed memory area or something like that, >> and >> only write it to a file on postmaster shutdown? (Losing the file on >> unclean shutdown is not a problem, because the file is removed >> anyway.) > > I suppose you need some facility to spill to disk, so maybe being > in a file is > better? Seems it might not be in most cases... I wonder how big a > memory > space we (or Erik) need. What I've done and tested on our test db server is to change lines 65 & 66 in pg_stat.c from #define PGSTAT_STAT_FILENAME "global/pgstat.stat" #define PGSTAT_STAT_TMPFILE "global/pgstat.tmp" to #define PGSTAT_STAT_FILENAME "global/pg_stats/pgstat.stat" #define PGSTAT_STAT_TMPFILE "global/pg_stats/pgstat.tmp" recompile and then create that pg_stats directory as a symlink to a directory with a swapfs mounted in it. Everything seems to be kosher. Of course, this adds a bit to our shutdown procedure in the case where we're going to bounce the actual server in that we need to make sure to copy the stats file(s) out of the swapfs directory in order to preserve stats in that case (and back in afterwards, of course). Erik Jones Software Developer | Emma® erik@myemma.com 800.595.4401 or 615.292.5888 615.292.0777 (fax) Emma helps organizations everywhere communicate & market in style. Visit us online at http://www.myemma.com
On Dec 5, 2007, at 7:50 AM, Robert Treat wrote: > On Wednesday 05 December 2007 07:22, Alvaro Herrera wrote: >> Robert Treat wrote: >>> On Monday 03 December 2007 20:22, Erik Jones wrote: >>>> Interesting. If this is anything you'd like to look into I can >>>> provide whatever diagnostic output you need (iostat, vmstat, dtrace >>>> script outputs, etc...) but I do have to reiterate that we are an >>>> extreme corner case due to out schema size. For now, is >>>> renaming the >>>> #define'd paths for the stats file and temp file sufficient for >>>> moving them? Basically, we'd like to move them onto a RAM disk to >>>> give our disks a break. >>> >>> Yeah, we've noticed the same problem (pgstat is the most active >>> file on >>> the system... uncovered in much the same way... go solaris). >>> Actually I >>> was wondering if it could be done with symlinks, a la moving xlogs. >> >> Not really, because a new file is created and renamed in place >> each time >> it's going to be rewritten. So the symlink would be lost in the >> first >> file rewrite. >> > > Ah yeah, thats what I concluded back then. > >> The first idea that comes to mind is to make the path configurable >> via >> GUC, so the user could set it to be written to an in-memory >> filesystem >> (/tmp in Solaris?). > > Yep, thought of that to, though it was after feature freeze so I > didn't > propose it. Course if someone wants to sneak that in it would be > cool :-) > >> But then I thought, why do we need it to be a file >> at all? Why not use a mmap'ed memory area or something like that, >> and >> only write it to a file on postmaster shutdown? (Losing the file on >> unclean shutdown is not a problem, because the file is removed >> anyway.) > > I suppose you need some facility to spill to disk, so maybe being > in a file is > better? Seems it might not be in most cases... I wonder how big a > memory > space we (or Erik) need. We made the swapfs 300MB which is actually way more than we need as I don't think I've seen our pgstat.stat file crack 10MB using the entirely scientific method of spot-checking :) Erik Jones Software Developer | Emma® erik@myemma.com 800.595.4401 or 615.292.5888 615.292.0777 (fax) Emma helps organizations everywhere communicate & market in style. Visit us online at http://www.myemma.com
Added to TODO: * Reduce file system activity overhead of statistics file pgstat.stat http://archives.postgresql.org/pgsql-general/2007-12/msg00106.php --------------------------------------------------------------------------- Erik Jones wrote: > Hi, I'm currently doctoring a situation wherein we've got table > inheritance scheme that over the years that has ballooned like only > in your nightmares (think well over 100K tables + indexes on those). > The obvious solution is to re-design the schema with a better > partitioning scheme in mind (see another msg from me later today on > that) but that's a big project that's just getting underway and an > immediate concern is the I/O on out data partition due in large part > to the stats file(s) getting hammered. We can verify this by looking > at our write volume 45+ Mbits/s and watching it drop to well below 10 > on average when we disable stat_row_level as well as watching the > insane amounts of writes to pgstat.tmp when running the rwsnoop > dtrace script. > > So, for the interim we're looking to move where the stats files are > written to. I've made the changes to the file paths for pgstat.stat > and pgstat.tmp in src/backend/postmaster/pgstat.c, recompiled and > verified that everything seems to be working ok on our test machine. > However, seeing as how I'm not all that familiar with the code base, > I'm asking here: is that all I need to do? Is there anything I've > missed? > > Erik Jones > > Software Developer | Emma? > erik@myemma.com > 800.595.4401 or 615.292.5888 > 615.292.0777 (fax) > > Emma helps organizations everywhere communicate & market in style. > Visit us online at http://www.myemma.com > > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +