Thread: WAL write of full pages
Our current WAL implementation writes copies of full pages to WAL before modifying the page on disk. This is done to prevent partial pages from being corrupted in case the operating system crashes during a page write. For example, suppose an 8k block is being written to a heap file. First the backend issues a write(), which copies the page into the kernel buffer cache. Later, the kernel sends the write request to the drive. Even if the file system uses 8k blocks, the disk is typically made up of 512-byte sectors, so the OS translates the 8k block into a contiguous number of disk sectors, in this case 16. There is no guarantee that all 16 sectors will be written --- perhaps 8 could be written, then the system crashes, or perhaps part of an 512-byte sector is written, but the remainder left unchanged. In all these cases, restarting the system will yield corrupt heap blocks. The WAL writes copies of full pages so that on restore, it can check each page to make sure it hasn't been corrupted. The system records an LSN (log serial number) on every page. When a pages is modified, its pre-change image is written to WAL, but not fsync'ed. Later, if a backend wants to write a page, it must make sure the LSN of page page is between the LSN of the last checkpoint and the LSN of the last fsync by a committed transactions. Only in those cases can the page be written because we are sure that a copy of the page is in the WAL in case there is a partial write. Now, as you can image, these WAL page writes take up a considerable amount of space in the WAL, and cause slowness, but no one has come up with a way to recover from partial pages write with it. The only way to minimze page writes is to increase checkpoint_segments and checkpoint_timeout so that checkpoints are less frequent, and pages have to be written fewer times to the WAL because old copies of the pages remain in WAL longer. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
If I understand WAL correctly (and I may not), it is essentially a write cache for writes to the data files, because: 1. Data file writes are notoriously random, and writing the log is sequential. Ironically, the sectors mapped by the OS to the disk are likely not at all sequential, but they likely are more sequential than the random data writes. 2. Log writing allows use of small, super fast drives (e.g. Solid State Disks) to speed up total database performance. You can have slower drives for the large files in the database and still get acceptable performance. 3. WAL allows for syncing only the pages changed. For example, suppose 14 transactions are in flight and each one modifies 40 pages of a data file. When one transaction commits, 560 pages are dirty, but only 40 need to be written. Without very close control of which buffers get dirtied to the OS (and Pg may have this, I am not sure), then all 560 pages may get written in place of the 40 that actually need to be written. My only complaint is about larger systems which have a single (or mirrored) large arrays. If I have a very fast array of some sort that has proper caching, and my data files are on the array, look at my options for log files: 1. Put them on the array. Pros: * Fastest "drive" available * RAID, so most reliable "drive" available Cons: * All changes get dumped twice: once for WAL, once at checkpoint. * The array is no slower on random writes then sequential ones, which means that the benefits of writing to WAL vs. the data files are lost. 2. Put them on an actual (or mirrored actual) spindle Pros: * Keeps WAL and data file I/O separate Cons: * All of the non array drives are still slower than the array 3. Put them on mirrored solid state disks or another array Pros: * Very fast * WAL and data file I/O is separate Cons: * Big $. Extremely large $/GB ratio. * If an array, hordes of unused space. I suspect (but cannot prove) that performance would jump for systems like ours if WAL was done away with entirely and the individual data files were synchronized on commit. Is there a simple way to turn off WAL in the config files so that I may do some benchmarking? Bruce Momjian wrote: > Our current WAL implementation writes copies of full pages to WAL before > modifying the page on disk. This is done to prevent partial pages from > being corrupted in case the operating system crashes during a page > write. > > For example, suppose an 8k block is being written to a heap file. > First the backend issues a write(), which copies the page into the > kernel buffer cache. Later, the kernel sends the write request to the > drive. Even if the file system uses 8k blocks, the disk is typically > made up of 512-byte sectors, so the OS translates the 8k block into a > contiguous number of disk sectors, in this case 16. There is no > guarantee that all 16 sectors will be written --- perhaps 8 could be > written, then the system crashes, or perhaps part of an 512-byte sector > is written, but the remainder left unchanged. In all these cases, > restarting the system will yield corrupt heap blocks. > > The WAL writes copies of full pages so that on restore, it can check > each page to make sure it hasn't been corrupted. The system records an > LSN (log serial number) on every page. When a pages is modified, its > pre-change image is written to WAL, but not fsync'ed. Later, if a > backend wants to write a page, it must make sure the LSN of page page is > between the LSN of the last checkpoint and the LSN of the last fsync by > a committed transactions. Only in those cases can the page be written > because we are sure that a copy of the page is in the WAL in case there > is a partial write. > > Now, as you can image, these WAL page writes take up a considerable > amount of space in the WAL, and cause slowness, but no one has come up > with a way to recover from partial pages write with it. The only way to > minimze page writes is to increase checkpoint_segments and > checkpoint_timeout so that checkpoints are less frequent, and pages have > to be written fewer times to the WAL because old copies of the pages > remain in WAL longer. >
Marty Scholes <marty@outputservices.com> writes: > * The array is no slower on random writes then sequential ones, which means > that the benefits of writing to WAL vs. the data files are lost. The main benefit is that if the system crashes or loses power that your database isn't lost. -- greg
> I suspect (but cannot prove) that performance would jump for systems > like ours if WAL was done away with entirely and the individual data > files were synchronized on commit. You know.. thats exactly what WAL is designed to prevent? Grab a copy of 7.0 and 7.1. Do a benchmark between the 2 with fsync on in both cases. I think you'll find one is about 50% faster than the other on a single disk system, and about the same if you have gobs of battery backed write cache.
Marty Scholes <marty@outputservices.com> writes: > I suspect (but cannot prove) that performance would jump for systems > like ours if WAL was done away with entirely and the individual data > files were synchronized on commit. I rather doubt this, since we used to do things that way and we saw an across-the-board performance improvement when we got rid of it in favor of WAL. > Is there a simple way to turn off WAL in the config files so that I may > do some benchmarking? No, there's no way to turn it off at all. You can disable fsync'ing it, but that's hardly representative of what would happen if the data writes had to be fsync'd instead. Your analysis is missing an important point, which is what happens when multiple transactions successively modify the same page. With a sync-the-data-files approach, we'd have to write the data page again for each commit. With WAL, the data page will likely not get written at all (until a checkpoint happens). Instead there will be per-transaction writes to the WAL, but the data volume will be less since WAL records are generally tuple-sized not page-sized. There's probably no win for large transactions that touch most of the tuples on a given data page, but for small transactions it's a win. regards, tom lane
Tom Lane wrote: > Your analysis is missing an important point, which is what happens when > multiple transactions successively modify the same page. With a > sync-the-data-files approach, we'd have to write the data page again for > each commit. With WAL, the data page will likely not get written at all > (until a checkpoint happens). Instead there will be per-transaction > writes to the WAL, but the data volume will be less since WAL records > are generally tuple-sized not page-sized. There's probably no win for > large transactions that touch most of the tuples on a given data page, > but for small transactions it's a win. Well said. I had not considered that the granularity of WAL entries was different than that of dirtying data pages. I have no doubt that all of these issues have been hashed out before, and I appreciate you sharing the rationale behind the design decisions. I can't help but wonder if there is a better way for update intensive environments, which probably did not play a large role in design decisions. Since I live it, I know of other shops that use an industrial strength RDBMS (Oracle, Sybase, MS SQL, etc.) for batch data processing, not just transaction processing. Often times a large data set comes in, gets loaded then churned for a few mintes/hours then spit out, with relatively little residual data held in the RDBMS. Why use an RDBMS for this kind of work? Because it's faster/cheaper/better than any alternative we have seen. I have a 100 GB Oracle installation, small by most standards, but it has well over 1 TB per month flushed through it. Bulk loads are not a "once in a while" undertaking. At any rate, thanks again. Marty
Marty Scholes wrote: > > 2. Put them on an actual (or mirrored actual) spindle > Pros: > * Keeps WAL and data file I/O separate > Cons: > * All of the non array drives are still slower than the array Are you sure this is a problem? The dbt-2 benchmarks from osdl run on an 8-way Intel computer with several raid arrays distributed to 40 disks. IIRC it generates around 1.5 MB wal logs per second - well withing the capability of a single drive. My laptop can write around 10 MB/sec (measured with dd if=/dev/zero of=fill and vmstat), fast drives should be above 20 MB/sec. How much wal data is generated by large postgres setups? Are there any setups that are limited by the wal logs. -- Manfred
Hi, I was thinking other way round. What if we write to WAL pages only to those portions which we need to modify and let kernel do the job the way it sees fit? What will happen if it fails? Bruce Momjian wrote: > Our current WAL implementation writes copies of full pages to WAL before > modifying the page on disk. This is done to prevent partial pages from > being corrupted in case the operating system crashes during a page > write. Assuming a WAL page is zero at start and later written say a 128 bytes block. Then how exactly writing 128 bytes is different than writing entire 8K page, especially when we control neither kernel/buffer cache nor disk? What is partial? Postgresql will always flush entire data block to WAL page isn't it? If write returns, we can assume it is written. > For example, suppose an 8k block is being written to a heap file. > First the backend issues a write(), which copies the page into the > kernel buffer cache. Later, the kernel sends the write request to the > drive. Even if the file system uses 8k blocks, the disk is typically > made up of 512-byte sectors, so the OS translates the 8k block into a > contiguous number of disk sectors, in this case 16. There is no > guarantee that all 16 sectors will be written --- perhaps 8 could be > written, then the system crashes, or perhaps part of an 512-byte sector > is written, but the remainder left unchanged. In all these cases, > restarting the system will yield corrupt heap blocks. We are hoping to prevent WAL page corruption which is part of file system corruption. Do we propose to tacle file system corruption in order to guarantee WAL integrity? > The WAL writes copies of full pages so that on restore, it can check > each page to make sure it hasn't been corrupted. The system records an > LSN (log serial number) on every page. When a pages is modified, its > pre-change image is written to WAL, but not fsync'ed. Later, if a > backend wants to write a page, it must make sure the LSN of page page is > between the LSN of the last checkpoint and the LSN of the last fsync by > a committed transactions. Only in those cases can the page be written > because we are sure that a copy of the page is in the WAL in case there > is a partial write. Do we have per page checksum? It could be in control log, not necessarily in WAL. But just asking since I don't know. > Now, as you can image, these WAL page writes take up a considerable > amount of space in the WAL, and cause slowness, but no one has come up > with a way to recover from partial pages write with it. The only way to > minimze page writes is to increase checkpoint_segments and > checkpoint_timeout so that checkpoints are less frequent, and pages have > to be written fewer times to the WAL because old copies of the pages > remain in WAL longer. If I am not mistaken, we rely upon WAL being consistent to ensure transaction recovery. We write() WAL and fsync/open/close it to make sure it goes on disk before data pages. What else we can do? I can not see why writing an 8K block is any more safe than writing just the changes. I may be dead wrong but just putting my thoughts together.. Shridhar
Shridhar Daithankar wrote: > Hi, > > I was thinking other way round. What if we write to WAL pages only to those > portions which we need to modify and let kernel do the job the way it sees fit? > What will happen if it fails? So you are saying only write the part of the page that we modify? I think the kernel reads in the entire page, makes the modification, then writes it. However, we still don't know our 1.5k of changes made it on to the platters completely. > > Our current WAL implementation writes copies of full pages to WAL before > > modifying the page on disk. This is done to prevent partial pages from > > being corrupted in case the operating system crashes during a page > > write. > > Assuming a WAL page is zero at start and later written say a 128 bytes block. > Then how exactly writing 128 bytes is different than writing entire 8K page, > especially when we control neither kernel/buffer cache nor disk? > > What is partial? Postgresql will always flush entire data block to WAL page > isn't it? If write returns, we can assume it is written. If write returns, it means the data is in the kernel cache, not on the disks. Fsync is the only thing that forces it to disk, and it is slow. > > For example, suppose an 8k block is being written to a heap file. > > First the backend issues a write(), which copies the page into the > > kernel buffer cache. Later, the kernel sends the write request to the > > drive. Even if the file system uses 8k blocks, the disk is typically > > made up of 512-byte sectors, so the OS translates the 8k block into a > > contiguous number of disk sectors, in this case 16. There is no > > guarantee that all 16 sectors will be written --- perhaps 8 could be > > written, then the system crashes, or perhaps part of an 512-byte sector > > is written, but the remainder left unchanged. In all these cases, > > restarting the system will yield corrupt heap blocks. > > We are hoping to prevent WAL page corruption which is part of file system > corruption. Do we propose to tacle file system corruption in order to guarantee > WAL integrity? We assume the file system will come back with an xlog directory with files in it because we fsync it. > > The WAL writes copies of full pages so that on restore, it can check > > each page to make sure it hasn't been corrupted. The system records an > > LSN (log serial number) on every page. When a pages is modified, its > > pre-change image is written to WAL, but not fsync'ed. Later, if a > > backend wants to write a page, it must make sure the LSN of page page is > > between the LSN of the last checkpoint and the LSN of the last fsync by > > a committed transactions. Only in those cases can the page be written > > because we are sure that a copy of the page is in the WAL in case there > > is a partial write. > > Do we have per page checksum? It could be in control log, not necessarily in > WAL. But just asking since I don't know. Yes, in WAL. > > Now, as you can image, these WAL page writes take up a considerable > > amount of space in the WAL, and cause slowness, but no one has come up > > with a way to recover from partial pages write with it. The only way to > > minimze page writes is to increase checkpoint_segments and > > checkpoint_timeout so that checkpoints are less frequent, and pages have > > to be written fewer times to the WAL because old copies of the pages > > remain in WAL longer. > > If I am not mistaken, we rely upon WAL being consistent to ensure transaction > recovery. We write() WAL and fsync/open/close it to make sure it goes on disk > before data pages. What else we can do? > > I can not see why writing an 8K block is any more safe than writing just the > changes. > > I may be dead wrong but just putting my thoughts together.. The problem is that we need to record what was on the page before we made the modification because there is no way to know that a write hasn't corrupted some part of the page. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian wrote: > Shridhar Daithankar wrote: >>I can not see why writing an 8K block is any more safe than writing just the >>changes. >> >>I may be dead wrong but just putting my thoughts together.. > The problem is that we need to record what was on the page before we > made the modification because there is no way to know that a write > hasn't corrupted some part of the page. OK... I think there is hardly any way around the fact that we need to flush a page the way we do it now. But that is slow. So what do we do. How feasible it would be to push fsyncing those pages/files to background writer and have it done on priority? That way the disk IO wait could get out of critical execution path. May be that could yield the performance benefit we are looking for. Also just out of curiosity. Is it possbile that more than one transaction grab hold of different pages of WAL and start putting data to it simaltenously? In such a case a single fsync could do the job for more than one backend but replaying WAL would be akin to defragging a FAT partition.. Just a thought.. Shridhar
Bruce Momjian wrote: <blockquote cite="mid200403151922.i2FJMWb18195@candle.pha.pa.us" type="cite"><pre wrap="">Our currentWAL implementation writes copies of full pages to WAL before modifying the page on disk. This is done to prevent partial pages from being corrupted in case the operating system crashes during a page write. </pre></blockquote> InnoDB uses a doublebuffer system instead.<br /><a class="moz-txt-link-freetext" href="http://www.innodb.com/ibman.php#File.space.management">http://www.innodb.com/ibman.php#File.space.management</a><br /><br/> quote:<br /><p>Starting from 3.23.40b, InnoDB uses a novel file flush technique called <q>doublewrite</q>. It addssafety to crash recovery after an operating system crash or a power outage, and improves performance on most Unix flavorsby reducing the need for <code class="c">fsync()</code> operations.<p>Doublewrite means that InnoDB before writingpages to a data file first writes them to a contiguous tablespace area called the doublewrite buffer. Only after thewrite and the flush to the doublewrite buffer has completed, InnoDB writes the pages to their proper positions in thedata file. If the operating system crashes in the middle of a page write, InnoDB will in recovery find a good copy ofthe page from the doublewrite buffer.<br /><br /><pre class="moz-signature" cols="72">-- Dennis </pre>
Dennis Haney wrote: > Bruce Momjian wrote: > >>Our current WAL implementation writes copies of full pages to WAL before >>modifying the page on disk. This is done to prevent partial pages from >>being corrupted in case the operating system crashes during a page >>write. >> >> > InnoDB uses a doublebuffer system instead. > http://www.innodb.com/ibman.php#File.space.management > > quote: > > Starting from 3.23.40b, InnoDB uses a novel file flush technique called > "doublewrite". It adds safety to crash recovery after an operating > system crash or a power outage, and improves performance on most Unix > flavors by reducing the need for |fsync()| operations. > > Doublewrite means that InnoDB before writing pages to a data file first > writes them to a contiguous tablespace area called the doublewrite > buffer. Only after the write and the flush to the doublewrite buffer has > completed, InnoDB writes the pages to their proper positions in the data > file. If the operating system crashes in the middle of a page write, > InnoDB will in recovery find a good copy of the page from the > doublewrite buffer. That is what postgresql calls as WAL(Write Ahead Log). The issue here is that WAL itself could become bottleneck since it is hit very frequently with heavy load. So how do we speed up WAL itself. Shridhar
psql dbname can still connect but when I go to the pg_database table the db is not there as a result I cannot do a pg_dump on it? I tried forcing an entry into pg_database but it won't allow me to set the oid ? Dave -- Dave Cramer 519 939 0336 ICQ # 14675561
Shridhar Daithankar <shridhar@frodo.hserus.net> writes: > We are hoping to prevent WAL page corruption which is part of file > system corruption. Do we propose to tacle file system corruption in > order to guarantee WAL integrity? You really should study the code more before pontificating. We *do* take measures to reduce the risk of file system corruption breaking WAL. Specifically, a WAL segment is filled with zeroes and fsync'd before we ever start to use it as live WAL space. The segment is never extended while in use. Therefore, given a reasonable filesystem implementation, the metadata for the segment file is down to disk before we ever use it, and it does not change while we are using it. It's impractical to do the same for data files, of course, since they have to be able to grow. > I can not see why writing an 8K block is any more safe than writing just the > changes. It's not more safe, it's just a lot easier to manage. We'd need more than just one "dirty" flag per buffer. In any case, the kernel would likely force the write to be a multiple of its internal buffer size anyway. I'm not sure that kernel buffers are as universally 8K as they once were (doesn't Linux use 4K?) but trying to manage dirtiness down to the byte level is a waste of time. regards, tom lane
Shridhar Daithankar wrote: > Bruce Momjian wrote: > > > Shridhar Daithankar wrote: > >>I can not see why writing an 8K block is any more safe than writing just the > >>changes. > >> > >>I may be dead wrong but just putting my thoughts together.. > > The problem is that we need to record what was on the page before we > > made the modification because there is no way to know that a write > > hasn't corrupted some part of the page. > > OK... I think there is hardly any way around the fact that we need to flush a > page the way we do it now. But that is slow. So what do we do. > > How feasible it would be to push fsyncing those pages/files to background writer > and have it done on priority? That way the disk IO wait could get out of > critical execution path. May be that could yield the performance benefit we are > looking for. We already allow committing transactions to flush WAL. We don't do the flush when we write the page image to WAL, unless we can't get any other buffer and have to write it ourselves and it hasn't already been fsync'ed by another transaction. This is where the current LSN come in --- it tells us how far fsync has gone, and each page has an LSN that tells us when it was written to WAL. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
You are correct, modern drives are much faster than this, for big, cacheable writes. Try compiling and running the following code and watching your disk I/O. Than, comment out the fsync(), which will make the writes cacheable. Notice the huge difference. It is an multiple of 15 difference on my machine. #include <stdlib.h> #include <stdio.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> int main() { int i; char buf[8192]; int ld; unlink("dump.out"); ld=open("dump.out", O_WRONLY | O_CREAT); for (i=0; i<65536; i++) { write(ld, buf, sizeof(buf)); fsync(ld); } close(ld); return 0; } Manfred Spraul wrote: > Marty Scholes wrote: > >> >> 2. Put them on an actual (or mirrored actual) spindle >> Pros: >> * Keeps WAL and data file I/O separate >> Cons: >> * All of the non array drives are still slower than the array > > > Are you sure this is a problem? The dbt-2 benchmarks from osdl run on an > 8-way Intel computer with several raid arrays distributed to 40 disks. > IIRC it generates around 1.5 MB wal logs per second - well withing the > capability of a single drive. My laptop can write around 10 MB/sec > (measured with dd if=/dev/zero of=fill and vmstat), fast drives should > be above 20 MB/sec. > How much wal data is generated by large postgres setups? Are there any > setups that are limited by the wal logs. > > -- > Manfred > > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org
Dave Cramer <pg@fastcrypt.com> writes: > psql dbname can still connect but when I go to the pg_database table the > db is not there as a result I cannot do a pg_dump on it? Hm, it doesn't make a lot of sense that fresh connections would still succeed if the pg_database row is deleted, but ... > I tried forcing an entry into pg_database but it won't allow me to set > the oid ? You don't have to; the DB OID doesn't appear anywhere within the database (except possibly with the database comment, if you have one). So: * Determine the old DB OID, by elimination if necessary. * Create a new database and determine its OID. * Shut down postmaster. * Blow away $PGDATA/base/NEWOID, and rename $PGDATA/base/OLDOID to be $PGDATA/base/NEWOID. * Restart postmaster. * Try to figure out what you did wrong, so you don't do it again... regards, tom lane
Tom, Thanks, first of all it wasn't my mess, but someone elses. Secondly this worked however I was unable to use the same name, some remnants of the old database must have remained in pg_database. I couldn't even reindex it with postgres -O -P Dave On Tue, 2004-03-16 at 11:11, Tom Lane wrote: > Dave Cramer <pg@fastcrypt.com> writes: > > psql dbname can still connect but when I go to the pg_database table the > > db is not there as a result I cannot do a pg_dump on it? > > Hm, it doesn't make a lot of sense that fresh connections would still > succeed if the pg_database row is deleted, but ... > > > I tried forcing an entry into pg_database but it won't allow me to set > > the oid ? > > You don't have to; the DB OID doesn't appear anywhere within the > database (except possibly with the database comment, if you have one). > > So: > > * Determine the old DB OID, by elimination if necessary. > > * Create a new database and determine its OID. > > * Shut down postmaster. > > * Blow away $PGDATA/base/NEWOID, and rename $PGDATA/base/OLDOID to > be $PGDATA/base/NEWOID. > > * Restart postmaster. > > * Try to figure out what you did wrong, so you don't do it again... > > regards, tom lane > -- Dave Cramer 519 939 0336 ICQ # 14675561
Dave Cramer <pg@fastcrypt.com> writes: > Secondly this worked however I was unable to use the same name, some > remnants of the old database must have remained in pg_database. > I couldn't even reindex it with postgres -O -P Interesting. I wonder what state the old tuple is really in ... Could you send me the pg_database table file (off-list)? regards, tom lane
Try doing a vacuum full on template1 and restart the database. I've had to do this before after renaming a database via the system catalogs. Robert Treat On Tue, 2004-03-16 at 12:05, Dave Cramer wrote: > Tom, > > Thanks, first of all it wasn't my mess, but someone elses. > > Secondly this worked however I was unable to use the same name, some > remnants of the old database must have remained in pg_database. > > I couldn't even reindex it with postgres -O -P > > Dave > On Tue, 2004-03-16 at 11:11, Tom Lane wrote: > > Dave Cramer <pg@fastcrypt.com> writes: > > > psql dbname can still connect but when I go to the pg_database table the > > > db is not there as a result I cannot do a pg_dump on it? > > > > Hm, it doesn't make a lot of sense that fresh connections would still > > succeed if the pg_database row is deleted, but ... > > > > > I tried forcing an entry into pg_database but it won't allow me to set > > > the oid ? > > > > You don't have to; the DB OID doesn't appear anywhere within the > > database (except possibly with the database comment, if you have one). > > > > So: > > > > * Determine the old DB OID, by elimination if necessary. > > > > * Create a new database and determine its OID. > > > > * Shut down postmaster. > > > > * Blow away $PGDATA/base/NEWOID, and rename $PGDATA/base/OLDOID to > > be $PGDATA/base/NEWOID. > > > > * Restart postmaster. > > > > * Try to figure out what you did wrong, so you don't do it again... > > > > regards, tom lane > > > -- > Dave Cramer > 519 939 0336 > ICQ # 14675561 > > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend -- Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL
On Tue, 16 Mar 2004, Tom Lane wrote: > Dave Cramer <pg@fastcrypt.com> writes: > > Secondly this worked however I was unable to use the same name, some > > remnants of the old database must have remained in pg_database. > > > I couldn't even reindex it with postgres -O -P > > Interesting. I wonder what state the old tuple is really in ... > > Could you send me the pg_database table file (off-list)? > Without looking at the actual data, it seems like PhonyHeapTupleSatisfiesNow() in GetRawDatabaseInfo() might be to blame.
Gavin Sherry <swm@linuxworld.com.au> writes: > Without looking at the actual data, it seems like > PhonyHeapTupleSatisfiesNow() in GetRawDatabaseInfo() might be to blame. That was my theory too, but having looked at the tuple, it's perfectly valid. However, it appears that its xmin is way in the past, which means that snapshot-aware scans won't see it. I think what happened is that the DBA of this database (Dave says it ain't him!) never did any database-wide vacuums, and thereby allowed the info in pg_database to get old enough to slide out of the transaction window without being frozen :-( regards, tom lane
> Thanks, first of all it wasn't my mess, but someone elses. > > Secondly this worked however I was unable to use the same name, some > remnants of the old database must have remained in pg_database. > > I couldn't even reindex it with postgres -O -P Maybe try a full dump and reload now? Chris