Thread: WAL write of full pages

WAL write of full pages

From

Bruce Momjian

Date:

15 March 2004, 15:23:16

Our current WAL implementation writes copies of full pages to WAL before
modifying the page on disk. This is done to prevent partial pages from
being corrupted in case the operating system crashes during a page
write.

For example, suppose an 8k block is being written to a heap file.
First the backend issues a write(), which copies the page into the
kernel buffer cache. Later, the kernel sends the write request to the
drive. Even if the file system uses 8k blocks, the disk is typically
made up of 512-byte sectors, so the OS translates the 8k block into a
contiguous number of disk sectors, in this case 16. There is no
guarantee that all 16 sectors will be written --- perhaps 8 could be
written, then the system crashes, or perhaps part of an 512-byte sector
is written, but the remainder left unchanged. In all these cases,
restarting the system will yield corrupt heap blocks.

The WAL writes copies of full pages so that on restore, it can check
each page to make sure it hasn't been corrupted. The system records an
LSN (log serial number) on every page. When a pages is modified, its
pre-change image is written to WAL, but not fsync'ed. Later, if a
backend wants to write a page, it must make sure the LSN of page page is
between the LSN of the last checkpoint and the LSN of the last fsync by
a committed transactions. Only in those cases can the page be written
because we are sure that a copy of the page is in the WAL in case there
is a partial write.

Now, as you can image, these WAL page writes take up a considerable
amount of space in the WAL, and cause slowness, but no one has come up
with a way to recover from partial pages write with it. The only way to
minimze page writes is to increase checkpoint_segments and
checkpoint_timeout so that checkpoints are less frequent, and pages have
to be written fewer times to the WAL because old copies of the pages
remain in WAL longer.

-- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610)
359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square,
Pennsylvania19073

Re: WAL write of full pages

From

Marty Scholes

Date:

15 March 2004, 16:10:57

If I understand WAL correctly (and I may not), it is essentially a write 
cache for writes to the data files, because:

1. Data file writes are notoriously random, and writing the log is 
sequential.  Ironically, the sectors mapped by the OS to the disk are 
likely not at all sequential, but they likely are more sequential than 
the random data writes.

2. Log writing allows use of small, super fast drives (e.g. Solid State 
Disks) to speed up total database performance.  You can have slower 
drives for the large files in the database and still get acceptable 
performance.

3. WAL allows for syncing only the pages changed.  For example, suppose 
14 transactions are in flight and each one modifies 40 pages of a data 
file.  When one transaction commits, 560 pages are dirty, but only 40 
need to be written.  Without very close control of which buffers get 
dirtied to the OS (and Pg may have this, I am not sure), then all 560 
pages may get written in place of the 40 that actually need to be written.

My only complaint is about larger systems which have a single (or 
mirrored) large arrays.  If I have a very fast array of some sort that 
has proper caching, and my data files are on the array, look at my 
options for log files:

1. Put them on the array.
Pros:
* Fastest "drive" available
* RAID, so most reliable "drive" available
Cons:
* All changes get dumped twice: once for WAL, once at checkpoint.
* The array is no slower on random writes then sequential ones, which 
means that the benefits of writing to WAL vs. the data files are lost.

2. Put them on an actual (or mirrored actual) spindle
Pros:
* Keeps WAL and data file I/O separate
Cons:
* All of the non array drives are still slower than the array

3. Put them on mirrored solid state disks or another array
Pros:
* Very fast
* WAL and data file I/O is separate
Cons:
* Big $.  Extremely large $/GB ratio.
* If an array, hordes of unused space.

I suspect (but cannot prove) that performance would jump for systems 
like ours if WAL was done away with entirely and the individual data 
files were synchronized on commit.

Is there a simple way to turn off WAL in the config files so that I may 
do some benchmarking?

Bruce Momjian wrote:
> Our current WAL implementation writes copies of full pages to WAL before
> modifying the page on disk.  This is done to prevent partial pages from
> being corrupted in case the operating system crashes during a page
> write.  
> 
> For example, suppose an 8k block is being written to a heap file.  
> First the backend issues a write(), which copies the page into the
> kernel buffer cache.  Later, the kernel sends the write request to the
> drive. Even if the file system uses 8k blocks, the disk is typically
> made up of 512-byte sectors, so the OS translates the 8k block into a
> contiguous number of disk sectors, in this case 16.  There is no
> guarantee that all 16 sectors will be written --- perhaps 8 could be
> written, then the system crashes, or perhaps part of an 512-byte sector
> is written, but the remainder left unchanged.  In all these cases,
> restarting the system will yield corrupt heap blocks.
> 
> The WAL writes copies of full pages so that on restore, it can check
> each page to make sure it hasn't been corrupted.  The system records an
> LSN (log serial number) on every page.  When a pages is modified, its
> pre-change image is written to WAL, but not fsync'ed.  Later, if a
> backend wants to write a page, it must make sure the LSN of page page is
> between the LSN of the last checkpoint and the LSN of the last fsync by
> a committed transactions.  Only in those cases can the page be written
> because we are sure that a copy of the page is in the WAL in case there
> is a partial write.
> 
> Now, as you can image, these WAL page writes take up a considerable
> amount of space in the WAL, and cause slowness, but no one has come up
> with a way to recover from partial pages write with it.  The only way to
> minimze page writes is to increase checkpoint_segments and
> checkpoint_timeout so that checkpoints are less frequent, and pages have
> to be written fewer times to the WAL because old copies of the pages
> remain in WAL longer.
>

Re: WAL write of full pages

From

Greg Stark

Date:

15 March 2004, 17:18:33

Marty Scholes <marty@outputservices.com> writes:

> * The array is no slower on random writes then sequential ones, which means
> that the benefits of writing to WAL vs. the data files are lost.

The main benefit is that if the system crashes or loses power that your
database isn't lost.

-- 
greg

Re: WAL write of full pages

From

Rod Taylor

Date:

15 March 2004, 17:18:44

> I suspect (but cannot prove) that performance would jump for systems 
> like ours if WAL was done away with entirely and the individual data 
> files were synchronized on commit.

You know.. thats exactly what WAL is designed to prevent? Grab a copy of
7.0 and 7.1. Do a benchmark between the 2 with fsync on in both cases.

I think you'll find one is about 50% faster than the other on a single
disk system, and about the same if you have gobs of battery backed write
cache.

Re: WAL write of full pages

From

Tom Lane

Date:

15 March 2004, 17:25:43

Marty Scholes <marty@outputservices.com> writes:
> I suspect (but cannot prove) that performance would jump for systems 
> like ours if WAL was done away with entirely and the individual data 
> files were synchronized on commit.

I rather doubt this, since we used to do things that way and we saw an
across-the-board performance improvement when we got rid of it in favor
of WAL.

> Is there a simple way to turn off WAL in the config files so that I may 
> do some benchmarking?

No, there's no way to turn it off at all.  You can disable fsync'ing it,
but that's hardly representative of what would happen if the data writes
had to be fsync'd instead.

Your analysis is missing an important point, which is what happens when
multiple transactions successively modify the same page.  With a
sync-the-data-files approach, we'd have to write the data page again for
each commit.  With WAL, the data page will likely not get written at all
(until a checkpoint happens).  Instead there will be per-transaction
writes to the WAL, but the data volume will be less since WAL records
are generally tuple-sized not page-sized.  There's probably no win for
large transactions that touch most of the tuples on a given data page,
but for small transactions it's a win.
        regards, tom lane

Re: WAL write of full pages

From

Marty Scholes

Date:

15 March 2004, 17:55:22

Tom Lane wrote:
> Your analysis is missing an important point, which is what happens when
> multiple transactions successively modify the same page.  With a
> sync-the-data-files approach, we'd have to write the data page again for
> each commit.  With WAL, the data page will likely not get written at all
> (until a checkpoint happens).  Instead there will be per-transaction
> writes to the WAL, but the data volume will be less since WAL records
> are generally tuple-sized not page-sized.  There's probably no win for
> large transactions that touch most of the tuples on a given data page,
> but for small transactions it's a win.

Well said.  I had not considered that the granularity of WAL entries was 
different than that of dirtying data pages.

I have no doubt that all of these issues have been hashed out before, 
and I appreciate you sharing the rationale behind the design decisions.

I can't help but wonder if there is a better way for update intensive 
environments, which probably did not play a large role in design decisions.

Since I live it, I know of other shops that use an industrial strength 
RDBMS (Oracle, Sybase, MS SQL, etc.) for batch data processing, not just 
transaction processing.  Often times a large data set comes in, gets 
loaded then churned for a few mintes/hours then spit out, with 
relatively little residual data held in the RDBMS.

Why use an RDBMS for this kind of work?  Because it's 
faster/cheaper/better than any alternative we have seen.

I have a 100 GB Oracle installation, small by most standards, but it has 
well over 1 TB per month flushed through it.

Bulk loads are not a "once in a while" undertaking.

At any rate, thanks again.
Marty

Re: WAL write of full pages

From

Manfred Spraul

Date:

16 March 2004, 02:19:03

Marty Scholes wrote:

>
> 2. Put them on an actual (or mirrored actual) spindle
> Pros:
> * Keeps WAL and data file I/O separate
> Cons:
> * All of the non array drives are still slower than the array

Are you sure this is a problem? The dbt-2 benchmarks from osdl run on an 
8-way Intel computer with several raid arrays distributed to 40 disks. 
IIRC it generates around 1.5 MB wal logs per second - well withing the 
capability of a single drive. My laptop can write around 10 MB/sec 
(measured with dd if=/dev/zero of=fill and vmstat), fast drives should 
be above 20 MB/sec.
How much wal data is generated by large postgres setups? Are there any 
setups that are limited by the wal logs.

--   Manfred

Re: WAL write of full pages

From

Shridhar Daithankar

Date:

16 March 2004, 03:58:02

Hi,

I was thinking other way round. What if we write to WAL pages only to those 
portions which we need to modify and let kernel do the job the way it sees fit? 
What will happen if it fails?

Bruce Momjian wrote:

> Our current WAL implementation writes copies of full pages to WAL before
> modifying the page on disk.  This is done to prevent partial pages from
> being corrupted in case the operating system crashes during a page
> write.  

Assuming a WAL page is zero at start and later written say a 128 bytes block. 
Then how exactly writing 128 bytes is different than writing entire 8K page, 
especially when we control neither kernel/buffer cache nor disk?

What is partial? Postgresql will always flush entire data block to WAL page 
isn't it? If write returns, we can assume it is written.

> For example, suppose an 8k block is being written to a heap file.  
> First the backend issues a write(), which copies the page into the
> kernel buffer cache.  Later, the kernel sends the write request to the
> drive. Even if the file system uses 8k blocks, the disk is typically
> made up of 512-byte sectors, so the OS translates the 8k block into a
> contiguous number of disk sectors, in this case 16.  There is no
> guarantee that all 16 sectors will be written --- perhaps 8 could be
> written, then the system crashes, or perhaps part of an 512-byte sector
> is written, but the remainder left unchanged.  In all these cases,
> restarting the system will yield corrupt heap blocks.

We are hoping to prevent WAL page corruption which is part of file system 
corruption. Do we propose to tacle file system corruption in order to guarantee 
WAL integrity?

> The WAL writes copies of full pages so that on restore, it can check
> each page to make sure it hasn't been corrupted.  The system records an
> LSN (log serial number) on every page.  When a pages is modified, its
> pre-change image is written to WAL, but not fsync'ed.  Later, if a
> backend wants to write a page, it must make sure the LSN of page page is
> between the LSN of the last checkpoint and the LSN of the last fsync by
> a committed transactions.  Only in those cases can the page be written
> because we are sure that a copy of the page is in the WAL in case there
> is a partial write.

Do we have per page checksum? It could be in control log, not necessarily in 
WAL. But just asking since I don't know.

> Now, as you can image, these WAL page writes take up a considerable
> amount of space in the WAL, and cause slowness, but no one has come up
> with a way to recover from partial pages write with it.  The only way to
> minimze page writes is to increase checkpoint_segments and
> checkpoint_timeout so that checkpoints are less frequent, and pages have
> to be written fewer times to the WAL because old copies of the pages
> remain in WAL longer.

If I am not mistaken, we rely upon WAL being consistent to ensure transaction 
recovery. We write() WAL and fsync/open/close it to make sure it goes on disk 
before data pages. What else we can do?

I can not see why writing an 8K block is any more safe than writing just the 
changes.

I may be dead wrong but just putting my thoughts together..
 Shridhar

Re: WAL write of full pages

From

Bruce Momjian

Date:

16 March 2004, 09:43:00

Shridhar Daithankar wrote:
> Hi,
> 
> I was thinking other way round. What if we write to WAL pages only to those 
> portions which we need to modify and let kernel do the job the way it sees fit? 
> What will happen if it fails?

So you are saying only write the part of the page that we modify?  I
think the kernel reads in the entire page, makes the modification, then
writes it.  However, we still don't know our 1.5k of changes made it on
to the platters completely.

> > Our current WAL implementation writes copies of full pages to WAL before
> > modifying the page on disk.  This is done to prevent partial pages from
> > being corrupted in case the operating system crashes during a page
> > write.  
> 
> Assuming a WAL page is zero at start and later written say a 128 bytes block. 
> Then how exactly writing 128 bytes is different than writing entire 8K page, 
> especially when we control neither kernel/buffer cache nor disk?
> 
> What is partial? Postgresql will always flush entire data block to WAL page 
> isn't it? If write returns, we can assume it is written.

If write returns, it means the data is in the kernel cache, not on the
disks.  Fsync is the only thing that forces it to disk, and it is slow.

> > For example, suppose an 8k block is being written to a heap file.  
> > First the backend issues a write(), which copies the page into the
> > kernel buffer cache.  Later, the kernel sends the write request to the
> > drive. Even if the file system uses 8k blocks, the disk is typically
> > made up of 512-byte sectors, so the OS translates the 8k block into a
> > contiguous number of disk sectors, in this case 16.  There is no
> > guarantee that all 16 sectors will be written --- perhaps 8 could be
> > written, then the system crashes, or perhaps part of an 512-byte sector
> > is written, but the remainder left unchanged.  In all these cases,
> > restarting the system will yield corrupt heap blocks.
> 
> We are hoping to prevent WAL page corruption which is part of file system 
> corruption. Do we propose to tacle file system corruption in order to guarantee 
> WAL integrity?

We assume the file system will come back with an xlog directory with
files in it because we fsync it.

> > The WAL writes copies of full pages so that on restore, it can check
> > each page to make sure it hasn't been corrupted.  The system records an
> > LSN (log serial number) on every page.  When a pages is modified, its
> > pre-change image is written to WAL, but not fsync'ed.  Later, if a
> > backend wants to write a page, it must make sure the LSN of page page is
> > between the LSN of the last checkpoint and the LSN of the last fsync by
> > a committed transactions.  Only in those cases can the page be written
> > because we are sure that a copy of the page is in the WAL in case there
> > is a partial write.
> 
> Do we have per page checksum? It could be in control log, not necessarily in 
> WAL. But just asking since I don't know.

Yes, in WAL.

> > Now, as you can image, these WAL page writes take up a considerable
> > amount of space in the WAL, and cause slowness, but no one has come up
> > with a way to recover from partial pages write with it.  The only way to
> > minimze page writes is to increase checkpoint_segments and
> > checkpoint_timeout so that checkpoints are less frequent, and pages have
> > to be written fewer times to the WAL because old copies of the pages
> > remain in WAL longer.
> 
> If I am not mistaken, we rely upon WAL being consistent to ensure transaction 
> recovery. We write() WAL and fsync/open/close it to make sure it goes on disk 
> before data pages. What else we can do?
> 
> I can not see why writing an 8K block is any more safe than writing just the 
> changes.
> 
> I may be dead wrong but just putting my thoughts together..

The problem is that we need to record what was on the page before we
made the modification because there is no way to know that a write
hasn't corrupted some part of the page.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: WAL write of full pages

From

Shridhar Daithankar

Date:

16 March 2004, 10:09:25

Bruce Momjian wrote:

> Shridhar Daithankar wrote:
>>I can not see why writing an 8K block is any more safe than writing just the 
>>changes.
>>
>>I may be dead wrong but just putting my thoughts together..
> The problem is that we need to record what was on the page before we
> made the modification because there is no way to know that a write
> hasn't corrupted some part of the page.

OK... I think there is hardly any way around the fact that we need to flush a 
page the way we do it now. But that is slow. So what do we do.

How feasible it would be to push fsyncing those pages/files to background writer 
and have it done on priority? That way the disk IO wait could get out of 
critical execution path. May be that could yield the performance benefit we are 
looking for.

Also just out of curiosity. Is it possbile that more than one transaction grab 
hold of different pages of WAL and start putting data to it simaltenously? In 
such a case a single fsync could do the job for more than one backend but 
replaying WAL would be akin to defragging a FAT partition..

Just a thought..
 Shridhar

Re: WAL write of full pages

From

Dennis Haney

Date:

16 March 2004, 10:15:01

Bruce Momjian wrote: <blockquote cite="mid200403151922.i2FJMWb18195@candle.pha.pa.us" type="cite"><pre wrap="">Our
currentWAL implementation writes copies of full pages to WAL before
 
modifying the page on disk.  This is done to prevent partial pages from
being corrupted in case the operating system crashes during a page
write.   </pre></blockquote> InnoDB uses a doublebuffer system instead.<br /><a class="moz-txt-link-freetext"
href="http://www.innodb.com/ibman.php#File.space.management">http://www.innodb.com/ibman.php#File.space.management</a><br
/><br/> quote:<br /><p>Starting from 3.23.40b, InnoDB uses a novel file flush technique called <q>doublewrite</q>. It
addssafety to crash recovery after an operating system crash or a power outage, and improves performance on most Unix
flavorsby reducing the need for <code class="c">fsync()</code> operations.<p>Doublewrite means that InnoDB before
writingpages to a data file first writes them to a contiguous tablespace area called the doublewrite buffer. Only after
thewrite and the flush to the doublewrite buffer has completed, InnoDB writes the pages to their proper positions in
thedata file. If the operating system crashes in the middle of a page write, InnoDB will in recovery find a good copy
ofthe page from the doublewrite buffer.<br /><br /><pre class="moz-signature" cols="72">-- 
 
Dennis
</pre>

Re: WAL write of full pages

From

Shridhar Daithankar

Date:

16 March 2004, 10:22:35

Dennis Haney wrote:

> Bruce Momjian wrote:
> 
>>Our current WAL implementation writes copies of full pages to WAL before
>>modifying the page on disk.  This is done to prevent partial pages from
>>being corrupted in case the operating system crashes during a page
>>write.  
>>  
>>
> InnoDB uses a doublebuffer system instead.
> http://www.innodb.com/ibman.php#File.space.management
> 
> quote:
> 
> Starting from 3.23.40b, InnoDB uses a novel file flush technique called 
> "doublewrite". It adds safety to crash recovery after an operating 
> system crash or a power outage, and improves performance on most Unix 
> flavors by reducing the need for |fsync()| operations.
> 
> Doublewrite means that InnoDB before writing pages to a data file first 
> writes them to a contiguous tablespace area called the doublewrite 
> buffer. Only after the write and the flush to the doublewrite buffer has 
> completed, InnoDB writes the pages to their proper positions in the data 
> file. If the operating system crashes in the middle of a page write, 
> InnoDB will in recovery find a good copy of the page from the 
> doublewrite buffer.

That is what postgresql calls as WAL(Write Ahead Log).

The issue here is that WAL itself could become bottleneck since it is hit very 
frequently with heavy load. So how do we speed up WAL itself.
 Shridhar

Some one deleted pg_database entry how to fix it?

From

Dave Cramer

Date:

16 March 2004, 11:04:22

psql dbname can still connect but when I go to the pg_database table the
db is not there as a result I cannot do a pg_dump on it? 

I tried forcing an entry into pg_database but it won't allow me to set
the oid ?

Dave
-- 
Dave Cramer
519 939 0336
ICQ # 14675561

Re: WAL write of full pages

From

Tom Lane

Date:

16 March 2004, 11:36:54

Shridhar Daithankar <shridhar@frodo.hserus.net> writes:
> We are hoping to prevent WAL page corruption which is part of file
> system corruption. Do we propose to tacle file system corruption in
> order to guarantee WAL integrity?

You really should study the code more before pontificating.

We *do* take measures to reduce the risk of file system corruption
breaking WAL.  Specifically, a WAL segment is filled with zeroes and
fsync'd before we ever start to use it as live WAL space.  The segment
is never extended while in use.  Therefore, given a reasonable filesystem
implementation, the metadata for the segment file is down to disk before
we ever use it, and it does not change while we are using it.

It's impractical to do the same for data files, of course, since they
have to be able to grow.

> I can not see why writing an 8K block is any more safe than writing just the 
> changes.

It's not more safe, it's just a lot easier to manage.  We'd need more
than just one "dirty" flag per buffer.  In any case, the kernel would
likely force the write to be a multiple of its internal buffer size
anyway.  I'm not sure that kernel buffers are as universally 8K as they
once were (doesn't Linux use 4K?) but trying to manage dirtiness down to
the byte level is a waste of time.
        regards, tom lane

Re: WAL write of full pages

From

Bruce Momjian

Date:

16 March 2004, 11:55:39

Shridhar Daithankar wrote:
> Bruce Momjian wrote:
> 
> > Shridhar Daithankar wrote:
> >>I can not see why writing an 8K block is any more safe than writing just the 
> >>changes.
> >>
> >>I may be dead wrong but just putting my thoughts together..
> > The problem is that we need to record what was on the page before we
> > made the modification because there is no way to know that a write
> > hasn't corrupted some part of the page.
> 
> OK... I think there is hardly any way around the fact that we need to flush a 
> page the way we do it now. But that is slow. So what do we do.
> 
> How feasible it would be to push fsyncing those pages/files to background writer 
> and have it done on priority? That way the disk IO wait could get out of 
> critical execution path. May be that could yield the performance benefit we are 
> looking for.

We already allow committing transactions to flush WAL.  We don't do the
flush when we write the page image to WAL, unless we can't get any other
buffer and have to write it ourselves and it hasn't already been
fsync'ed by another transaction.  This is where the current LSN come in ---
it tells us how far fsync has gone, and each page has an LSN that tells
us when it was written to WAL.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: WAL write of full pages

From

Marty Scholes

Date:

16 March 2004, 12:01:57

You are correct, modern drives are much faster than this, for big, 
cacheable writes.

Try compiling and running the following code and watching your disk I/O.

Than, comment out the fsync(), which will make the writes cacheable.

Notice the huge difference.  It is an multiple of 15 difference on my 
machine.

#include <stdlib.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int main()
{        int i;        char    buf[8192];        int     ld;
        unlink("dump.out");        ld=open("dump.out", O_WRONLY | O_CREAT);
        for (i=0; i<65536; i++) {                write(ld, buf, sizeof(buf));                fsync(ld);
}
        close(ld);
        return 0;
}


Manfred Spraul wrote:
> Marty Scholes wrote:
> 
>>
>> 2. Put them on an actual (or mirrored actual) spindle
>> Pros:
>> * Keeps WAL and data file I/O separate
>> Cons:
>> * All of the non array drives are still slower than the array
> 
> 
> Are you sure this is a problem? The dbt-2 benchmarks from osdl run on an 
> 8-way Intel computer with several raid arrays distributed to 40 disks. 
> IIRC it generates around 1.5 MB wal logs per second - well withing the 
> capability of a single drive. My laptop can write around 10 MB/sec 
> (measured with dd if=/dev/zero of=fill and vmstat), fast drives should 
> be above 20 MB/sec.
> How much wal data is generated by large postgres setups? Are there any 
> setups that are limited by the wal logs.
> 
> -- 
>    Manfred
> 
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
> 
>               http://archives.postgresql.org

Re: Some one deleted pg_database entry how to fix it?

From

Tom Lane

Date:

16 March 2004, 12:11:28

Dave Cramer <pg@fastcrypt.com> writes:
> psql dbname can still connect but when I go to the pg_database table the
> db is not there as a result I cannot do a pg_dump on it? 

Hm, it doesn't make a lot of sense that fresh connections would still
succeed if the pg_database row is deleted, but ...

> I tried forcing an entry into pg_database but it won't allow me to set
> the oid ?

You don't have to; the DB OID doesn't appear anywhere within the
database (except possibly with the database comment, if you have one).

So:

* Determine the old DB OID, by elimination if necessary.

* Create a new database and determine its OID.

* Shut down postmaster.

* Blow away $PGDATA/base/NEWOID, and rename $PGDATA/base/OLDOID to be $PGDATA/base/NEWOID.

* Restart postmaster.

* Try to figure out what you did wrong, so you don't do it again...
        regards, tom lane

Re: Some one deleted pg_database entry how to fix it?

From

Dave Cramer

Date:

16 March 2004, 13:02:25

Tom,

Thanks, first of all it wasn't my mess, but someone elses.

Secondly this worked however I was unable to use the same name, some
remnants of the old database must have remained in pg_database.

I couldn't even reindex it with postgres -O -P

Dave
On Tue, 2004-03-16 at 11:11, Tom Lane wrote:
> Dave Cramer <pg@fastcrypt.com> writes:
> > psql dbname can still connect but when I go to the pg_database table the
> > db is not there as a result I cannot do a pg_dump on it? 
> 
> Hm, it doesn't make a lot of sense that fresh connections would still
> succeed if the pg_database row is deleted, but ...
> 
> > I tried forcing an entry into pg_database but it won't allow me to set
> > the oid ?
> 
> You don't have to; the DB OID doesn't appear anywhere within the
> database (except possibly with the database comment, if you have one).
> 
> So:
> 
> * Determine the old DB OID, by elimination if necessary.
> 
> * Create a new database and determine its OID.
> 
> * Shut down postmaster.
> 
> * Blow away $PGDATA/base/NEWOID, and rename $PGDATA/base/OLDOID to
>   be $PGDATA/base/NEWOID.
> 
> * Restart postmaster.
> 
> * Try to figure out what you did wrong, so you don't do it again...
> 
>             regards, tom lane
> 
-- 
Dave Cramer
519 939 0336
ICQ # 14675561

Re: Some one deleted pg_database entry how to fix it?

From

Tom Lane

Date:

16 March 2004, 13:23:53

Dave Cramer <pg@fastcrypt.com> writes:
> Secondly this worked however I was unable to use the same name, some
> remnants of the old database must have remained in pg_database.

> I couldn't even reindex it with postgres -O -P

Interesting.  I wonder what state the old tuple is really in ...

Could you send me the pg_database table file (off-list)?
        regards, tom lane

Re: Some one deleted pg_database entry how to fix it?

From

Robert Treat

Date:

16 March 2004, 16:54:49

Try doing a vacuum full on template1 and restart the database. I've had
to do this before after renaming a database via the system catalogs. 

Robert Treat

On Tue, 2004-03-16 at 12:05, Dave Cramer wrote:
> Tom,
> 
> Thanks, first of all it wasn't my mess, but someone elses.
> 
> Secondly this worked however I was unable to use the same name, some
> remnants of the old database must have remained in pg_database.
> 
> I couldn't even reindex it with postgres -O -P
> 
> Dave
> On Tue, 2004-03-16 at 11:11, Tom Lane wrote:
> > Dave Cramer <pg@fastcrypt.com> writes:
> > > psql dbname can still connect but when I go to the pg_database table the
> > > db is not there as a result I cannot do a pg_dump on it? 
> > 
> > Hm, it doesn't make a lot of sense that fresh connections would still
> > succeed if the pg_database row is deleted, but ...
> > 
> > > I tried forcing an entry into pg_database but it won't allow me to set
> > > the oid ?
> > 
> > You don't have to; the DB OID doesn't appear anywhere within the
> > database (except possibly with the database comment, if you have one).
> > 
> > So:
> > 
> > * Determine the old DB OID, by elimination if necessary.
> > 
> > * Create a new database and determine its OID.
> > 
> > * Shut down postmaster.
> > 
> > * Blow away $PGDATA/base/NEWOID, and rename $PGDATA/base/OLDOID to
> >   be $PGDATA/base/NEWOID.
> > 
> > * Restart postmaster.
> > 
> > * Try to figure out what you did wrong, so you don't do it again...
> > 
> >             regards, tom lane
> > 
> -- 
> Dave Cramer
> 519 939 0336
> ICQ # 14675561
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend

-- 
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL

Re: Some one deleted pg_database entry how to fix it?

From

Gavin Sherry

Date:

16 March 2004, 18:01:17

On Tue, 16 Mar 2004, Tom Lane wrote:

> Dave Cramer <pg@fastcrypt.com> writes:
> > Secondly this worked however I was unable to use the same name, some
> > remnants of the old database must have remained in pg_database.
>
> > I couldn't even reindex it with postgres -O -P
>
> Interesting.  I wonder what state the old tuple is really in ...
>
> Could you send me the pg_database table file (off-list)?
>

Without looking at the actual data, it seems like
PhonyHeapTupleSatisfiesNow() in GetRawDatabaseInfo() might be to blame.

Re: Some one deleted pg_database entry how to fix it?

From

Tom Lane

Date:

16 March 2004, 18:23:45

Gavin Sherry <swm@linuxworld.com.au> writes:
> Without looking at the actual data, it seems like
> PhonyHeapTupleSatisfiesNow() in GetRawDatabaseInfo() might be to blame.

That was my theory too, but having looked at the tuple, it's perfectly
valid.  However, it appears that its xmin is way in the past, which
means that snapshot-aware scans won't see it.  I think what happened is
that the DBA of this database (Dave says it ain't him!) never did any
database-wide vacuums, and thereby allowed the info in pg_database to
get old enough to slide out of the transaction window without being
frozen :-(
        regards, tom lane

Re: Some one deleted pg_database entry how to fix it?

From

Christopher Kings-Lynne

Date:

16 March 2004, 22:33:10

> Thanks, first of all it wasn't my mess, but someone elses.
> 
> Secondly this worked however I was unable to use the same name, some
> remnants of the old database must have remained in pg_database.
> 
> I couldn't even reindex it with postgres -O -P

Maybe try a full dump and reload now?

Chris