Thread: Streaming replication on win32, still broken

Streaming replication on win32, still broken

From
Magnus Hagander
Date:
With the libpq fixes, I get further (more on that fix later, btw), but
now I get stuck in this. When I do something on the master that
generates WAL, such as insert a record, and then try to query this on
the slave, the walreceiver process crashes with:

PANIC:  XX000: could not write to log file 0, segment 9 at offset 0, length 160:Invalid argument
LOCATION:  XLogWalRcvWrite, .\src\backend\replication\walreceiver.c:487

I'll keep digging at the details, but if somebody has a good idea here.. ;)

-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


Re: Streaming replication on win32, still broken

From
Fujii Masao
Date:
On Tue, Feb 16, 2010 at 12:37 AM, Magnus Hagander <magnus@hagander.net> wrote:
> With the libpq fixes, I get further (more on that fix later, btw), but
> now I get stuck in this. When I do something on the master that
> generates WAL, such as insert a record, and then try to query this on
> the slave, the walreceiver process crashes with:
>
> PANIC:  XX000: could not write to log file 0, segment 9 at offset 0, length 160:
>  Invalid argument
> LOCATION:  XLogWalRcvWrite, .\src\backend\replication\walreceiver.c:487
>
> I'll keep digging at the details, but if somebody has a good idea here.. ;)

Yeah, this problem was reproduced in my (very slow :-( ) MinGW environment, too.
Though I've not idenfied the cause yet, I guess that it derives from wrong use
of the type of local variables in XLogWalRcvWrite(). I'll continue investigation
of it.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Streaming replication on win32, still broken

From
Magnus Hagander
Date:
2010/2/16 Fujii Masao <masao.fujii@gmail.com>:
> On Tue, Feb 16, 2010 at 12:37 AM, Magnus Hagander <magnus@hagander.net> wrote:
>> With the libpq fixes, I get further (more on that fix later, btw), but
>> now I get stuck in this. When I do something on the master that
>> generates WAL, such as insert a record, and then try to query this on
>> the slave, the walreceiver process crashes with:
>>
>> PANIC:  XX000: could not write to log file 0, segment 9 at offset 0, length 160:
>>  Invalid argument
>> LOCATION:  XLogWalRcvWrite, .\src\backend\replication\walreceiver.c:487
>>
>> I'll keep digging at the details, but if somebody has a good idea here.. ;)
>
> Yeah, this problem was reproduced in my (very slow :-( ) MinGW environment, too.
> Though I've not idenfied the cause yet, I guess that it derives from wrong use
> of the type of local variables in XLogWalRcvWrite(). I'll continue investigation
> of it.

Thanks!

I will be somewhat spottily available over the next two days due to
on-site work with clients.

Let me know if you would be helped by some details of how to get a
(somewhat faster) EC2 image up and running with MSVC to test on :-)

-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


Re: Streaming replication on win32, still broken

From
Fujii Masao
Date:
On Tue, Feb 16, 2010 at 7:20 PM, Magnus Hagander <magnus@hagander.net> wrote:
> 2010/2/16 Fujii Masao <masao.fujii@gmail.com>:
>> On Tue, Feb 16, 2010 at 12:37 AM, Magnus Hagander <magnus@hagander.net> wrote:
>>> With the libpq fixes, I get further (more on that fix later, btw), but
>>> now I get stuck in this. When I do something on the master that
>>> generates WAL, such as insert a record, and then try to query this on
>>> the slave, the walreceiver process crashes with:
>>>
>>> PANIC:  XX000: could not write to log file 0, segment 9 at offset 0, length 160:
>>>  Invalid argument
>>> LOCATION:  XLogWalRcvWrite, .\src\backend\replication\walreceiver.c:487
>>>
>>> I'll keep digging at the details, but if somebody has a good idea here.. ;)
>>
>> Yeah, this problem was reproduced in my (very slow :-( ) MinGW environment, too.
>> Though I've not idenfied the cause yet, I guess that it derives from wrong use
>> of the type of local variables in XLogWalRcvWrite(). I'll continue investigation
>> of it.
>
> Thanks!
>
> I will be somewhat spottily available over the next two days due to
> on-site work with clients.
>
> Let me know if you would be helped by some details of how to get a
> (somewhat faster) EC2 image up and running with MSVC to test on :-)

Thanks! I can probably use the EC2 image by reading your great blog post.
http://blog.hagander.net/archives/151-Testing-PostgreSQL-patches-on-Windows-using-Amazon-EC2.html

But it might take some time to make my sysadmin open the port for
rdesktop for some reasons...

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Streaming replication on win32, still broken

From
Magnus Hagander
Date:
2010/2/16 Fujii Masao <masao.fujii@gmail.com>:
> On Tue, Feb 16, 2010 at 7:20 PM, Magnus Hagander <magnus@hagander.net> wrote:
>> 2010/2/16 Fujii Masao <masao.fujii@gmail.com>:
>>> On Tue, Feb 16, 2010 at 12:37 AM, Magnus Hagander <magnus@hagander.net> wrote:
>>>> With the libpq fixes, I get further (more on that fix later, btw), but
>>>> now I get stuck in this. When I do something on the master that
>>>> generates WAL, such as insert a record, and then try to query this on
>>>> the slave, the walreceiver process crashes with:
>>>>
>>>> PANIC:  XX000: could not write to log file 0, segment 9 at offset 0, length 160:
>>>>  Invalid argument
>>>> LOCATION:  XLogWalRcvWrite, .\src\backend\replication\walreceiver.c:487
>>>>
>>>> I'll keep digging at the details, but if somebody has a good idea here.. ;)
>>>
>>> Yeah, this problem was reproduced in my (very slow :-( ) MinGW environment, too.
>>> Though I've not idenfied the cause yet, I guess that it derives from wrong use
>>> of the type of local variables in XLogWalRcvWrite(). I'll continue investigation
>>> of it.
>>
>> Thanks!
>>
>> I will be somewhat spottily available over the next two days due to
>> on-site work with clients.
>>
>> Let me know if you would be helped by some details of how to get a
>> (somewhat faster) EC2 image up and running with MSVC to test on :-)
>
> Thanks! I can probably use the EC2 image by reading your great blog post.
> http://blog.hagander.net/archives/151-Testing-PostgreSQL-patches-on-Windows-using-Amazon-EC2.html

Actually, that one deosn't work anymore, because I managed to break
the image :-)

If you send me your amazon id, I can get you premissions on my private
image. I plan to clean it up and make it public, just haven't gotten
around to it yet...

-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


Re: Streaming replication on win32, still broken

From
Fujii Masao
Date:
On Wed, Feb 17, 2010 at 6:28 AM, Magnus Hagander <magnus@hagander.net> wrote:
> If you send me your amazon id, I can get you premissions on my private
> image. I plan to clean it up and make it public, just haven't gotten
> around to it yet...

Thanks for your concern! I'll send the ID when I complete the preparation.

And, fortunately?, when I set wal_sync_method to open_sync, the problem was
reproduced in the linux, too. The cause is that the data that is written by
walreceiver is not aligned, even if O_DIRECT is used. On win32, O_DIRECT is
used by default. So the problem always happened on win32.

I propose two solution ideas:

1. O_DIRECT is somewhat harmful in the standby since the data written by  walreceiver is read by the startup process
immediately.So, how about  not making only walreceiver use O_DIRECT?
 

2. Straightforwardly observe the alignment rule. Since the received WAL  data might start at the middle of WAL block,
walreceiverneeds to keep  the last half-written WAL block for alignment. OTOH since the received  data might end at the
middleof WAL block, walreceiver needs zero-padding.  As a result, walreceiver writes the set of the last WAL block,
received data and zero-padding.
 

Which is better? Or do you have another better idea?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Streaming replication on win32, still broken

From
Magnus Hagander
Date:
On Wed, Feb 17, 2010 at 06:55, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Wed, Feb 17, 2010 at 6:28 AM, Magnus Hagander <magnus@hagander.net> wrote:
>> If you send me your amazon id, I can get you premissions on my private
>> image. I plan to clean it up and make it public, just haven't gotten
>> around to it yet...
>
> Thanks for your concern! I'll send the ID when I complete the preparation.

ok.


> And, fortunately?, when I set wal_sync_method to open_sync, the problem was
> reproduced in the linux, too. The cause is that the data that is written by

Ah, that's good. It always helps if it's a cross-platform issue -
particularly in that it's not one of the funky win32 specific things
we did :)


> walreceiver is not aligned, even if O_DIRECT is used. On win32, O_DIRECT is
> used by default. So the problem always happened on win32.

Ahh. I see.


> I propose two solution ideas:
>
> 1. O_DIRECT is somewhat harmful in the standby since the data written by
>   walreceiver is read by the startup process immediately. So, how about
>   not making only walreceiver use O_DIRECT?

In that case, O_DIRECT would be counterproductive, no? It maps to
FILE_FLAG_NOI_BUFFERING, which makes sure it doesn't go into the
cache. So the read in the startup proc is actually guaranteed to
reuqire a physical read - of something we just wrote, so it'll almost
certainly end up waiting for a rotation, no?

Seems like getting rid of O_DIRECT here is the right thing to do,
regardless of this.


> 2. Straightforwardly observe the alignment rule. Since the received WAL
>   data might start at the middle of WAL block, walreceiver needs to keep
>   the last half-written WAL block for alignment. OTOH since the received
>   data might end at the middle of WAL block, walreceiver needs zero-padding.
>   As a result, walreceiver writes the set of the last WAL block, received
>   data and zero-padding.

May there be other reasons to d this as well?


-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


Re: Streaming replication on win32, still broken

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> On Wed, Feb 17, 2010 at 06:55, Fujii Masao <masao.fujii@gmail.com> wrote:
>> 2. Straightforwardly observe the alignment rule. Since the received WAL
>> � data might start at the middle of WAL block, walreceiver needs to keep
>> � the last half-written WAL block for alignment. OTOH since the received
>> � data might end at the middle of WAL block, walreceiver needs zero-padding.
>> � As a result, walreceiver writes the set of the last WAL block, received
>> � data and zero-padding.

> May there be other reasons to d this as well?

Writing misaligned data is certain to be expensive even when it works...
        regards, tom lane


Re: Streaming replication on win32, still broken

From
Fujii Masao
Date:
On Wed, Feb 17, 2010 at 3:03 PM, Magnus Hagander <magnus@hagander.net> wrote:
> In that case, O_DIRECT would be counterproductive, no? It maps to
> FILE_FLAG_NOI_BUFFERING, which makes sure it doesn't go into the
> cache. So the read in the startup proc is actually guaranteed to
> reuqire a physical read - of something we just wrote, so it'll almost
> certainly end up waiting for a rotation, no?
>
> Seems like getting rid of O_DIRECT here is the right thing to do,
> regardless of this.

Agreed. I'll remove O_DIRECT from walreceiver.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Streaming replication on win32, still broken

From
Fujii Masao
Date:
On Wed, Feb 17, 2010 at 3:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> On Wed, Feb 17, 2010 at 06:55, Fujii Masao <masao.fujii@gmail.com> wrote:
>>> 2. Straightforwardly observe the alignment rule. Since the received WAL
>>>   data might start at the middle of WAL block, walreceiver needs to keep
>>>   the last half-written WAL block for alignment. OTOH since the received
>>>   data might end at the middle of WAL block, walreceiver needs zero-padding.
>>>   As a result, walreceiver writes the set of the last WAL block, received
>>>   data and zero-padding.
>
>> May there be other reasons to d this as well?
>
> Writing misaligned data is certain to be expensive even when it works...

Yeah, right. After I remove O_DIRECT, I'll change walreceiver so as to
do an alignment correctly, and then I'll test the performance.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Streaming replication on win32, still broken

From
Fujii Masao
Date:
On Wed, Feb 17, 2010 at 4:07 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Wed, Feb 17, 2010 at 3:03 PM, Magnus Hagander <magnus@hagander.net> wrote:
>> In that case, O_DIRECT would be counterproductive, no? It maps to
>> FILE_FLAG_NOI_BUFFERING, which makes sure it doesn't go into the
>> cache. So the read in the startup proc is actually guaranteed to
>> reuqire a physical read - of something we just wrote, so it'll almost
>> certainly end up waiting for a rotation, no?
>>
>> Seems like getting rid of O_DIRECT here is the right thing to do,
>> regardless of this.
>
> Agreed. I'll remove O_DIRECT from walreceiver.

Here is the patch to do that.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: Streaming replication on win32, still broken

From
Fujii Masao
Date:
On Wed, Feb 17, 2010 at 6:00 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Wed, Feb 17, 2010 at 4:07 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Wed, Feb 17, 2010 at 3:03 PM, Magnus Hagander <magnus@hagander.net> wrote:
>>> In that case, O_DIRECT would be counterproductive, no? It maps to
>>> FILE_FLAG_NOI_BUFFERING, which makes sure it doesn't go into the
>>> cache. So the read in the startup proc is actually guaranteed to
>>> reuqire a physical read - of something we just wrote, so it'll almost
>>> certainly end up waiting for a rotation, no?
>>>
>>> Seems like getting rid of O_DIRECT here is the right thing to do,
>>> regardless of this.
>>
>> Agreed. I'll remove O_DIRECT from walreceiver.
>
> Here is the patch to do that.

Ooops! I found the bug in the patch. Here is the updated version.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: Streaming replication on win32, still broken

From
Heikki Linnakangas
Date:
Fujii Masao wrote:
> On Wed, Feb 17, 2010 at 6:00 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Wed, Feb 17, 2010 at 4:07 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>> On Wed, Feb 17, 2010 at 3:03 PM, Magnus Hagander <magnus@hagander.net> wrote:
>>>> In that case, O_DIRECT would be counterproductive, no? It maps to
>>>> FILE_FLAG_NOI_BUFFERING, which makes sure it doesn't go into the
>>>> cache. So the read in the startup proc is actually guaranteed to
>>>> reuqire a physical read - of something we just wrote, so it'll almost
>>>> certainly end up waiting for a rotation, no?
>>>>
>>>> Seems like getting rid of O_DIRECT here is the right thing to do,
>>>> regardless of this.
>>> Agreed. I'll remove O_DIRECT from walreceiver.
>> Here is the patch to do that.
> 
> Ooops! I found the bug in the patch. Here is the updated version.

If I'm reading the patch correctly, when wal_sync_method is 'open_sync',
walreceiver nevertheless opens the WAL file without the O_DIRECT flag.
When it later flushes it in XLogWalRcvFlush() by issue_xlog_fsync(),
issue_xlog_fsync() will do nothing because it assumes the write() synced
it already. So the data written isn't being forced to disk at all.

How about just forcing sync_method to 'fsync' in walreceiver?

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Streaming replication on win32, still broken

From
Fujii Masao
Date:
On Thu, Feb 18, 2010 at 5:28 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> If I'm reading the patch correctly, when wal_sync_method is 'open_sync',
> walreceiver nevertheless opens the WAL file without the O_DIRECT flag.
> When it later flushes it in XLogWalRcvFlush() by issue_xlog_fsync(),
> issue_xlog_fsync() will do nothing because it assumes the write() synced
> it already. So the data written isn't being forced to disk at all.

When 'open_sync' is chosen, the WAL file is opened with O_SYNC or O_FSYNC
flag. So I think that write() flushes the data to disk even if O_DIRECT
flag is not given. Am I missing something?

> How about just forcing sync_method to 'fsync' in walreceiver?

In win32, O_DSYNC seems to be preferred to 'fsync' so far. So I'm not sure
if reshuffling of priority is harmless.
http://archives.postgresql.org/pgsql-hackers-win32/2005-03/msg00148.php

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Streaming replication on win32, still broken

From
Heikki Linnakangas
Date:
Fujii Masao wrote:
> On Thu, Feb 18, 2010 at 5:28 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> If I'm reading the patch correctly, when wal_sync_method is 'open_sync',
>> walreceiver nevertheless opens the WAL file without the O_DIRECT flag.
>> When it later flushes it in XLogWalRcvFlush() by issue_xlog_fsync(),
>> issue_xlog_fsync() will do nothing because it assumes the write() synced
>> it already. So the data written isn't being forced to disk at all.
> 
> When 'open_sync' is chosen, the WAL file is opened with O_SYNC or O_FSYNC
> flag. So I think that write() flushes the data to disk even if O_DIRECT
> flag is not given. Am I missing something?

Ah, ok, you're right.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Streaming replication on win32, still broken

From
Magnus Hagander
Date:
2010/2/18 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:
> Fujii Masao wrote:
>> On Thu, Feb 18, 2010 at 5:28 AM, Heikki Linnakangas
>> <heikki.linnakangas@enterprisedb.com> wrote:
>>> If I'm reading the patch correctly, when wal_sync_method is 'open_sync',
>>> walreceiver nevertheless opens the WAL file without the O_DIRECT flag.
>>> When it later flushes it in XLogWalRcvFlush() by issue_xlog_fsync(),
>>> issue_xlog_fsync() will do nothing because it assumes the write() synced
>>> it already. So the data written isn't being forced to disk at all.
>>
>> When 'open_sync' is chosen, the WAL file is opened with O_SYNC or O_FSYNC
>> flag. So I think that write() flushes the data to disk even if O_DIRECT
>> flag is not given. Am I missing something?
>
> Ah, ok, you're right.

Yes, I believe the difference is that with O_DIRECT it bypasses the
cache completely. Without it, we still sync it out, but it also goes
into the cache.

O_DIRECT helps us when we're not going to read the file again, because
we don't waste cache on it. If we are, which is the case here, it
should be really bad for performance, since we actually have to do a
physical read.

Incidentally, that should also apply to general WAL when archive_mdoe
is on. Do we optimize for that?


-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


Re: Streaming replication on win32, still broken

From
Heikki Linnakangas
Date:
Magnus Hagander wrote:
> O_DIRECT helps us when we're not going to read the file again, because
> we don't waste cache on it. If we are, which is the case here, it
> should be really bad for performance, since we actually have to do a
> physical read.
> 
> Incidentally, that should also apply to general WAL when archive_mdoe
> is on. Do we optimize for that?

Hmm, no we don't. We do take that into account so that we refrain from
issuing posix_fadvice(DONTNEED) if archive_mode is on, but we don't
disable O_DIRECT. Maybe we should..

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Streaming replication on win32, still broken

From
Fujii Masao
Date:
On Thu, Feb 18, 2010 at 7:04 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Magnus Hagander wrote:
>> O_DIRECT helps us when we're not going to read the file again, because
>> we don't waste cache on it. If we are, which is the case here, it
>> should be really bad for performance, since we actually have to do a
>> physical read.
>>
>> Incidentally, that should also apply to general WAL when archive_mdoe
>> is on. Do we optimize for that?
>
> Hmm, no we don't. We do take that into account so that we refrain from
> issuing posix_fadvice(DONTNEED) if archive_mode is on, but we don't
> disable O_DIRECT. Maybe we should..

Since the performance of WAL write is more important than that of WAL
archiving in general, that optimization might offer little benefit.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Streaming replication on win32, still broken

From
Magnus Hagander
Date:
2010/2/18 Fujii Masao <masao.fujii@gmail.com>:
> On Thu, Feb 18, 2010 at 7:04 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> Magnus Hagander wrote:
>>> O_DIRECT helps us when we're not going to read the file again, because
>>> we don't waste cache on it. If we are, which is the case here, it
>>> should be really bad for performance, since we actually have to do a
>>> physical read.
>>>
>>> Incidentally, that should also apply to general WAL when archive_mdoe
>>> is on. Do we optimize for that?
>>
>> Hmm, no we don't. We do take that into account so that we refrain from
>> issuing posix_fadvice(DONTNEED) if archive_mode is on, but we don't
>> disable O_DIRECT. Maybe we should..
>
> Since the performance of WAL write is more important than that of WAL
> archiving in general, that optimization might offer little benefit.

Well, it's going to make the process that reads the WAL cause actual
physical I/O... That'll take a chunk out of your total available I/O,
which is likely to push you to the limit of your I/O capacity much
quicker.

-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


Re: Streaming replication on win32, still broken

From
Heikki Linnakangas
Date:
Magnus Hagander wrote:
> 2010/2/18 Fujii Masao <masao.fujii@gmail.com>:
>> On Thu, Feb 18, 2010 at 7:04 PM, Heikki Linnakangas
>> <heikki.linnakangas@enterprisedb.com> wrote:
>>> Magnus Hagander wrote:
>>>> O_DIRECT helps us when we're not going to read the file again, because
>>>> we don't waste cache on it. If we are, which is the case here, it
>>>> should be really bad for performance, since we actually have to do a
>>>> physical read.
>>>>
>>>> Incidentally, that should also apply to general WAL when archive_mdoe
>>>> is on. Do we optimize for that?
>>> Hmm, no we don't. We do take that into account so that we refrain from
>>> issuing posix_fadvice(DONTNEED) if archive_mode is on, but we don't
>>> disable O_DIRECT. Maybe we should..
>> Since the performance of WAL write is more important than that of WAL
>> archiving in general, that optimization might offer little benefit.
>
> Well, it's going to make the process that reads the WAL cause actual
> physical I/O... That'll take a chunk out of your total available I/O,
> which is likely to push you to the limit of your I/O capacity much
> quicker.

Right, doesn't seem sensible, though it would be nice to see a benchmark
on that.

Here's a patch to disable O_DIRECT when archiving or streaming is
enabled. This is pretty hard to test, so any extra eyeballs would be nice..

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
? GNUmakefile
? a.patch
? b
? backend
? config.log
? config.status
? config.status.lineno
? configure.lineno
? gin-splay-1.patch
? gin-splay-2.patch
? gin-splay-3.patch
? include
? libpq
? log_unlogged_op_0115.patch
? md-1.c
? md-1.patch
? replication
? temp-file-resowner-2.patch
? contrib/pgbench/.deps
? contrib/pgbench/fsynctest
? contrib/pgbench/fsynctest.c
? contrib/pgbench/fsynctestfile
? contrib/pgbench/pgbench
? contrib/spi/.deps
? doc/src/sgml/HTML.index
? doc/src/sgml/bookindex.sgml
? doc/src/sgml/features-supported.sgml
? doc/src/sgml/features-unsupported.sgml
? doc/src/sgml/version.sgml
? src/Makefile.global
? src/backend/aaa.patch
? src/backend/postgres
? src/backend/access/common/.deps
? src/backend/access/gin/.deps
? src/backend/access/gist/.deps
? src/backend/access/hash/.deps
? src/backend/access/heap/.deps
? src/backend/access/index/.deps
? src/backend/access/nbtree/.deps
? src/backend/access/transam/.deps
? src/backend/bootstrap/.deps
? src/backend/catalog/.deps
? src/backend/commands/.deps
? src/backend/executor/.deps
? src/backend/foreign/.deps
? src/backend/foreign/dummy/.deps
? src/backend/foreign/postgresql/.deps
? src/backend/lib/.deps
? src/backend/libpq/.deps
? src/backend/main/.deps
? src/backend/nodes/.deps
? src/backend/optimizer/geqo/.deps
? src/backend/optimizer/path/.deps
? src/backend/optimizer/plan/.deps
? src/backend/optimizer/prep/.deps
? src/backend/optimizer/util/.deps
? src/backend/parser/.deps
? src/backend/po/af.mo
? src/backend/po/cs.mo
? src/backend/po/hr.mo
? src/backend/po/hu.mo
? src/backend/po/it.mo
? src/backend/po/ko.mo
? src/backend/po/nb.mo
? src/backend/po/nl.mo
? src/backend/po/pl.mo
? src/backend/po/ro.mo
? src/backend/po/ru.mo
? src/backend/po/sk.mo
? src/backend/po/sl.mo
? src/backend/po/sv.mo
? src/backend/po/zh_CN.mo
? src/backend/po/zh_TW.mo
? src/backend/port/.deps
? src/backend/postmaster/.deps
? src/backend/regex/.deps
? src/backend/replication/.deps
? src/backend/replication/libpqwalreceiver/.deps
? src/backend/rewrite/.deps
? src/backend/snowball/.deps
? src/backend/snowball/snowball_create.sql
? src/backend/storage/buffer/.deps
? src/backend/storage/file/.deps
? src/backend/storage/freespace/.deps
? src/backend/storage/ipc/.deps
? src/backend/storage/large_object/.deps
? src/backend/storage/lmgr/.deps
? src/backend/storage/page/.deps
? src/backend/storage/smgr/.deps
? src/backend/tcop/.deps
? src/backend/tsearch/.deps
? src/backend/utils/.deps
? src/backend/utils/probes.h
? src/backend/utils/adt/.deps
? src/backend/utils/cache/.deps
? src/backend/utils/error/.deps
? src/backend/utils/fmgr/.deps
? src/backend/utils/hash/.deps
? src/backend/utils/init/.deps
? src/backend/utils/mb/.deps
? src/backend/utils/mb/Unicode/BIG5.TXT
? src/backend/utils/mb/Unicode/CP950.TXT
? src/backend/utils/mb/conversion_procs/conversion_create.sql
? src/backend/utils/mb/conversion_procs/ascii_and_mic/.deps
? src/backend/utils/mb/conversion_procs/cyrillic_and_mic/.deps
? src/backend/utils/mb/conversion_procs/euc2004_sjis2004/.deps
? src/backend/utils/mb/conversion_procs/euc_cn_and_mic/.deps
? src/backend/utils/mb/conversion_procs/euc_jis_2004_and_shift_jis_2004/.deps
? src/backend/utils/mb/conversion_procs/euc_jp_and_sjis/.deps
? src/backend/utils/mb/conversion_procs/euc_kr_and_mic/.deps
? src/backend/utils/mb/conversion_procs/euc_tw_and_big5/.deps
? src/backend/utils/mb/conversion_procs/latin2_and_win1250/.deps
? src/backend/utils/mb/conversion_procs/latin_and_mic/.deps
? src/backend/utils/mb/conversion_procs/utf8_and_ascii/.deps
? src/backend/utils/mb/conversion_procs/utf8_and_big5/.deps
? src/backend/utils/mb/conversion_procs/utf8_and_cyrillic/.deps
? src/backend/utils/mb/conversion_procs/utf8_and_euc2004/.deps
? src/backend/utils/mb/conversion_procs/utf8_and_euc_cn/.deps
? src/backend/utils/mb/conversion_procs/utf8_and_euc_jis_2004/.deps
? src/backend/utils/mb/conversion_procs/utf8_and_euc_jp/.deps
? src/backend/utils/mb/conversion_procs/utf8_and_euc_kr/.deps
? src/backend/utils/mb/conversion_procs/utf8_and_euc_tw/.deps
? src/backend/utils/mb/conversion_procs/utf8_and_gb18030/.deps
? src/backend/utils/mb/conversion_procs/utf8_and_gbk/.deps
? src/backend/utils/mb/conversion_procs/utf8_and_iso8859/.deps
? src/backend/utils/mb/conversion_procs/utf8_and_iso8859_1/.deps
? src/backend/utils/mb/conversion_procs/utf8_and_johab/.deps
? src/backend/utils/mb/conversion_procs/utf8_and_shift_jis_2004/.deps
? src/backend/utils/mb/conversion_procs/utf8_and_sjis/.deps
? src/backend/utils/mb/conversion_procs/utf8_and_sjis2004/.deps
? src/backend/utils/mb/conversion_procs/utf8_and_uhc/.deps
? src/backend/utils/mb/conversion_procs/utf8_and_win/.deps
? src/backend/utils/misc/.deps
? src/backend/utils/mmgr/.deps
? src/backend/utils/resowner/.deps
? src/backend/utils/sort/.deps
? src/backend/utils/time/.deps
? src/bin/initdb/.deps
? src/bin/initdb/initdb
? src/bin/initdb/po/ko.mo
? src/bin/initdb/po/pl.mo
? src/bin/initdb/po/ro.mo
? src/bin/initdb/po/sk.mo
? src/bin/initdb/po/sl.mo
? src/bin/initdb/po/zh_CN.mo
? src/bin/initdb/po/zh_TW.mo
? src/bin/pg_config/.deps
? src/bin/pg_config/pg_config
? src/bin/pg_config/po/cs.mo
? src/bin/pg_config/po/pl.mo
? src/bin/pg_config/po/sl.mo
? src/bin/pg_config/po/zh_CN.mo
? src/bin/pg_config/po/zh_TW.mo
? src/bin/pg_controldata/.deps
? src/bin/pg_controldata/pg_controldata
? src/bin/pg_controldata/po/cs.mo
? src/bin/pg_controldata/po/fa.mo
? src/bin/pg_controldata/po/hu.mo
? src/bin/pg_controldata/po/nb.mo
? src/bin/pg_controldata/po/pl.mo
? src/bin/pg_controldata/po/ro.mo
? src/bin/pg_controldata/po/ru.mo
? src/bin/pg_controldata/po/sk.mo
? src/bin/pg_controldata/po/sl.mo
? src/bin/pg_controldata/po/zh_CN.mo
? src/bin/pg_controldata/po/zh_TW.mo
? src/bin/pg_ctl/.deps
? src/bin/pg_ctl/pg_ctl
? src/bin/pg_ctl/po/cs.mo
? src/bin/pg_ctl/po/ro.mo
? src/bin/pg_ctl/po/sk.mo
? src/bin/pg_ctl/po/sl.mo
? src/bin/pg_ctl/po/zh_CN.mo
? src/bin/pg_ctl/po/zh_TW.mo
? src/bin/pg_dump/.deps
? src/bin/pg_dump/pg_dump
? src/bin/pg_dump/pg_dumpall
? src/bin/pg_dump/pg_restore
? src/bin/pg_dump/po/cs.mo
? src/bin/pg_dump/po/ko.mo
? src/bin/pg_dump/po/nb.mo
? src/bin/pg_dump/po/ro.mo
? src/bin/pg_dump/po/ru.mo
? src/bin/pg_dump/po/sk.mo
? src/bin/pg_dump/po/sl.mo
? src/bin/pg_dump/po/zh_CN.mo
? src/bin/pg_dump/po/zh_TW.mo
? src/bin/pg_resetxlog/.deps
? src/bin/pg_resetxlog/pg_resetxlog
? src/bin/pg_resetxlog/po/cs.mo
? src/bin/pg_resetxlog/po/hu.mo
? src/bin/pg_resetxlog/po/nb.mo
? src/bin/pg_resetxlog/po/sk.mo
? src/bin/pg_resetxlog/po/sl.mo
? src/bin/pg_resetxlog/po/zh_CN.mo
? src/bin/pg_resetxlog/po/zh_TW.mo
? src/bin/psql/.deps
? src/bin/psql/psql
? src/bin/psql/po/fa.mo
? src/bin/psql/po/hu.mo
? src/bin/psql/po/it.mo
? src/bin/psql/po/ko.mo
? src/bin/psql/po/nb.mo
? src/bin/psql/po/ro.mo
? src/bin/psql/po/ru.mo
? src/bin/psql/po/sk.mo
? src/bin/psql/po/sl.mo
? src/bin/psql/po/zh_CN.mo
? src/bin/psql/po/zh_TW.mo
? src/bin/scripts/.deps
? src/bin/scripts/clusterdb
? src/bin/scripts/createdb
? src/bin/scripts/createlang
? src/bin/scripts/createuser
? src/bin/scripts/dropdb
? src/bin/scripts/droplang
? src/bin/scripts/dropuser
? src/bin/scripts/foo
? src/bin/scripts/reindexdb
? src/bin/scripts/vacuumdb
? src/bin/scripts/po/ru.mo
? src/bin/scripts/po/sk.mo
? src/bin/scripts/po/sl.mo
? src/bin/scripts/po/zh_CN.mo
? src/bin/scripts/po/zh_TW.mo
? src/include/pg_config.h
? src/include/stamp-h
? src/interfaces/ecpg/compatlib/.deps
? src/interfaces/ecpg/compatlib/exports.list
? src/interfaces/ecpg/compatlib/libecpg_compat.so.3.1
? src/interfaces/ecpg/compatlib/libecpg_compat.so.3.2
? src/interfaces/ecpg/ecpglib/.deps
? src/interfaces/ecpg/ecpglib/exports.list
? src/interfaces/ecpg/ecpglib/libecpg.so.6.1
? src/interfaces/ecpg/ecpglib/libecpg.so.6.2
? src/interfaces/ecpg/include/ecpg_config.h
? src/interfaces/ecpg/include/stamp-h
? src/interfaces/ecpg/pgtypeslib/.deps
? src/interfaces/ecpg/pgtypeslib/exports.list
? src/interfaces/ecpg/pgtypeslib/libpgtypes.so.3.1
? src/interfaces/ecpg/pgtypeslib/libpgtypes.so.3.2
? src/interfaces/ecpg/preproc/.deps
? src/interfaces/ecpg/preproc/ecpg
? src/interfaces/libpq/.deps
? src/interfaces/libpq/exports.list
? src/interfaces/libpq/libpq.so.5.3
? src/interfaces/libpq/po/af.mo
? src/interfaces/libpq/po/hr.mo
? src/interfaces/libpq/po/nb.mo
? src/interfaces/libpq/po/pl.mo
? src/interfaces/libpq/po/sk.mo
? src/interfaces/libpq/po/sl.mo
? src/interfaces/libpq/po/zh_CN.mo
? src/interfaces/libpq/po/zh_TW.mo
? src/pl/plpgsql/src/.deps
? src/pl/plpgsql/src/pl_scan.c
? src/pl/plpgsql/src/po/tr.mo
? src/port/.deps
? src/port/pg_config_paths.h
? src/test/regress/.deps
? src/test/regress/log
? src/test/regress/pg_regress
? src/test/regress/results
? src/test/regress/testtablespace
? src/test/regress/expected/constraints.out
? src/test/regress/expected/copy.out
? src/test/regress/expected/create_function_1.out
? src/test/regress/expected/create_function_2.out
? src/test/regress/expected/largeobject.out
? src/test/regress/expected/largeobject_1.out
? src/test/regress/expected/misc.out
? src/test/regress/expected/tablespace.out
? src/test/regress/sql/constraints.sql
? src/test/regress/sql/copy.sql
? src/test/regress/sql/create_function_1.sql
? src/test/regress/sql/create_function_2.sql
? src/test/regress/sql/largeobject.sql
? src/test/regress/sql/misc.sql
? src/test/regress/sql/tablespace.sql
? src/timezone/.deps
? src/timezone/zic
Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.375
diff -u -r1.375 xlog.c
--- src/backend/access/transam/xlog.c    17 Feb 2010 04:19:39 -0000    1.375
+++ src/backend/access/transam/xlog.c    18 Feb 2010 12:40:05 -0000
@@ -2686,13 +2686,10 @@
      * WAL segment files will not be re-read in normal operation, so we advise
      * the OS to release any cached pages.    But do not do so if WAL archiving
      * or streaming is active, because archiver and walsender process could use
-     * the cache to read the WAL segment.  Also, don't bother with it if we
-     * are using O_DIRECT, since the kernel is presumably not caching in that
-     * case.
+     * the cache to read the WAL segment.
      */
 #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
-    if (!XLogIsNeeded() &&
-        (get_sync_bit(sync_method) & PG_O_DIRECT) == 0)
+    if (!XLogIsNeeded())
         (void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
 #endif

@@ -7652,10 +7649,29 @@
 static int
 get_sync_bit(int method)
 {
+    int o_direct_flag = 0;
+
     /* If fsync is disabled, never open in sync mode */
     if (!enableFsync)
         return 0;

+    /*
+     * Optimize writes by bypassing kernel cache with O_DIRECT when using
+     * O_SYNC, O_DSYNC or O_FSYNC. But only if archiving and streaming are
+     * disabled, otherwise the archive command or walsender process will
+     * read the WAL soon after writing it, which is guaranteed to cause a
+     * physical read if we bypassed the kernel cache. We also skip the
+     * posix_fadvise(POSIX_FADV_DONTNEED) call in XLogFileClose() for the
+     * same reason.
+     *
+     * Never use O_DIRECT in walsender process for similar reasons; the WAL
+     * written by walreceiver is normally read by the startup process soon
+     * after its written. Also, walreceiver performs unaligned writes, which
+     * don't work with O_DIRECT, so it is required for correctness too.
+     */
+    if (!XLogIsNeeded() && !am_walreceiver)
+        o_direct_flag = PG_O_DIRECT;
+
     switch (method)
     {
             /*
@@ -7670,11 +7686,11 @@
             return 0;
 #ifdef OPEN_SYNC_FLAG
         case SYNC_METHOD_OPEN:
-            return OPEN_SYNC_FLAG;
+            return OPEN_SYNC_FLAG | o_direct_flag;
 #endif
 #ifdef OPEN_DATASYNC_FLAG
         case SYNC_METHOD_OPEN_DSYNC:
-            return OPEN_DATASYNC_FLAG;
+            return OPEN_DATASYNC_FLAG | o_direct_flag;
 #endif
         default:
             /* can't happen (unless we are out of sync with option array) */
Index: src/backend/replication/walreceiver.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/replication/walreceiver.c,v
retrieving revision 1.4
diff -u -r1.4 walreceiver.c
--- src/backend/replication/walreceiver.c    17 Feb 2010 04:19:39 -0000    1.4
+++ src/backend/replication/walreceiver.c    18 Feb 2010 12:40:05 -0000
@@ -50,6 +50,9 @@
 #include "utils/ps_status.h"
 #include "utils/resowner.h"

+/* Global variable to indicate if this process is a walreceiver process */
+bool am_walreceiver;
+
 /* libpqreceiver hooks to these when loaded */
 walrcv_connect_type walrcv_connect = NULL;
 walrcv_receive_type walrcv_receive = NULL;
@@ -158,6 +161,8 @@
     /* use volatile pointer to prevent code rearrangement */
     volatile WalRcvData *walrcv = WalRcv;

+    am_walreceiver = true;
+
     /*
      * WalRcv should be set up already (if we are a backend, we inherit
      * this by fork() or EXEC_BACKEND mechanism from the postmaster).
@@ -424,16 +429,18 @@
             bool    use_existent;

             /*
-             * XLOG segment files will be re-read in recovery operation soon,
-             * so we don't need to advise the OS to release any cache page.
+             * fsync() and close current file before we switch to next one.
+             * We would otherwise have to reopen this file to fsync it later
              */
             if (recvFile >= 0)
             {
+                XLogWalRcvFlush();
+
                 /*
-                 * fsync() before we switch to next file. We would otherwise
-                 * have to reopen this file to fsync it later
+                 * XLOG segment files will be re-read by recovery in startup
+                 * process soon, so we don't advise the OS to release cache
+                 * pages associated with the file like XLogFileClose() does.
                  */
-                XLogWalRcvFlush();
                 if (close(recvFile) != 0)
                     ereport(PANIC,
                             (errcode_for_file_access(),
@@ -445,8 +452,7 @@
             /* Create/use new log file */
             XLByteToSeg(recptr, recvId, recvSeg);
             use_existent = true;
-            recvFile = XLogFileInit(recvId, recvSeg,
-                                    &use_existent, true);
+            recvFile = XLogFileInit(recvId, recvSeg, &use_existent, true);
             recvOff = 0;
         }

Index: src/include/access/xlogdefs.h
===================================================================
RCS file: /cvsroot/pgsql/src/include/access/xlogdefs.h,v
retrieving revision 1.25
diff -u -r1.25 xlogdefs.h
--- src/include/access/xlogdefs.h    15 Jan 2010 09:19:06 -0000    1.25
+++ src/include/access/xlogdefs.h    18 Feb 2010 12:40:05 -0000
@@ -106,23 +106,20 @@
  * configure determined whether fdatasync() is.
  */
 #if defined(O_SYNC)
-#define BARE_OPEN_SYNC_FLAG        O_SYNC
+#define OPEN_SYNC_FLAG        O_SYNC
 #elif defined(O_FSYNC)
-#define BARE_OPEN_SYNC_FLAG        O_FSYNC
-#endif
-#ifdef BARE_OPEN_SYNC_FLAG
-#define OPEN_SYNC_FLAG            (BARE_OPEN_SYNC_FLAG | PG_O_DIRECT)
+#define OPEN_SYNC_FLAG        O_FSYNC
 #endif

 #if defined(O_DSYNC)
 #if defined(OPEN_SYNC_FLAG)
 /* O_DSYNC is distinct? */
-#if O_DSYNC != BARE_OPEN_SYNC_FLAG
-#define OPEN_DATASYNC_FLAG        (O_DSYNC | PG_O_DIRECT)
+#if O_DSYNC != OPEN_SYNC_FLAG
+#define OPEN_DATASYNC_FLAG        O_DSYNC
 #endif
 #else                            /* !defined(OPEN_SYNC_FLAG) */
 /* Win32 only has O_DSYNC */
-#define OPEN_DATASYNC_FLAG        (O_DSYNC | PG_O_DIRECT)
+#define OPEN_DATASYNC_FLAG        O_DSYNC
 #endif
 #endif

Index: src/include/replication/walreceiver.h
===================================================================
RCS file: /cvsroot/pgsql/src/include/replication/walreceiver.h,v
retrieving revision 1.6
diff -u -r1.6 walreceiver.h
--- src/include/replication/walreceiver.h    3 Feb 2010 09:47:19 -0000    1.6
+++ src/include/replication/walreceiver.h    18 Feb 2010 12:40:05 -0000
@@ -15,6 +15,8 @@
 #include "access/xlogdefs.h"
 #include "storage/spin.h"

+extern bool am_walreceiver;
+
 /*
  * MAXCONNINFO: maximum size of a connection string.
  *

Re: Streaming replication on win32, still broken

From
Heikki Linnakangas
Date:
Heikki Linnakangas wrote:
> Magnus Hagander wrote:
>> Well, it's going to make the process that reads the WAL cause actual
>> physical I/O... That'll take a chunk out of your total available I/O,
>> which is likely to push you to the limit of your I/O capacity much
>> quicker.
> 
> Right, doesn't seem sensible, though it would be nice to see a benchmark
> on that.
> 
> Here's a patch to disable O_DIRECT when archiving or streaming is
> enabled. This is pretty hard to test, so any extra eyeballs would be nice..

Committed. Can you check that this fixed the PANIC you saw?

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Streaming replication on win32, still broken

From
Fujii Masao
Date:
On Fri, Feb 19, 2010 at 7:54 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Heikki Linnakangas wrote:
>> Magnus Hagander wrote:
>>> Well, it's going to make the process that reads the WAL cause actual
>>> physical I/O... That'll take a chunk out of your total available I/O,
>>> which is likely to push you to the limit of your I/O capacity much
>>> quicker.
>>
>> Right, doesn't seem sensible, though it would be nice to see a benchmark
>> on that.
>>
>> Here's a patch to disable O_DIRECT when archiving or streaming is
>> enabled. This is pretty hard to test, so any extra eyeballs would be nice..
>
> Committed. Can you check that this fixed the PANIC you saw?

Thanks! Yeah, SR works fine in my MinGW environment.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center