Re: BUG #15449: file_fdw using program cause exit code error whenusing LIMIT - Mailing list pgsql-bugs

From Etsuro Fujita
Subject Re: BUG #15449: file_fdw using program cause exit code error whenusing LIMIT
Date
Msg-id 5BE4318F.4040002@lab.ntt.co.jp
Whole thread Raw
In response to Re: BUG #15449: file_fdw using program cause exit code error whenusing LIMIT  (Thomas Munro <thomas.munro@enterprisedb.com>)
Responses Re: BUG #15449: file_fdw using program cause exit code error whenusing LIMIT  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Re: BUG #15449: file_fdw using program cause exit code error whenusing LIMIT  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
List pgsql-bugs
(2018/11/08 10:50), Thomas Munro wrote:
> On Wed, Nov 7, 2018 at 11:03 PM Etsuro Fujita
> <fujita.etsuro@lab.ntt.co.jp>  wrote:
>> (2018/11/07 9:22), Kyotaro HORIGUCHI wrote:
>>> At Tue, 06 Nov 2018 21:07:37 +0900, Etsuro Fujita<fujita.etsuro@lab.ntt.co.jp>   wrote
in<5BE18409.2070004@lab.ntt.co.jp>
>>>> (2018/11/06 12:53), Kyotaro HORIGUCHI wrote:
>>>>> even if it comes from another pipe, we can assume that the
>>>>> SIGPIPE immediately stopped the program before returning any
>>>>> garbage to us.
>>>>
>>>> Sorry, I don't understand this.
>>>
>>> Mmm. It looks confusing, sorry. In other words:
>>>
>>> We don't care the reason for program's ending if we received
>>> enough data from the program and we actively (= before receiving
>>> EOF) stop reading. On the other hand if SIGPIPE (on another pipe)
>>> or something fatal happened on the program before we end reading,
>>> we receive EOF just after that and we are interested in the cause
>>> of the program's death.
>>>
>>> # Does the above make sense?
>>
>> Yeah, thanks for the explanation!
>
> +1
>
> I take back what I said earlier about false positives from other
> pipes.  I think it's only traditional Unix programs designed for use
> in pipelines and naive programs that let SIGPIPE terminate the
> process.   The accepted answer here gives a good way to think about
> it:
>
> https://stackoverflow.com/questions/8369506/why-does-sigpipe-exist

Thanks for the information!

> A program sophisticated enough to be writing to other pipes is no
> longer in that category and should be setting up signal dispositions
> itself, so I agree that we should enable the default disposition and
> ignore WTERMSIG(exit_code) == SIGPIPE, as proposed.  That is pretty
> close to the intended purpose of that signal AFAICS.

Great!

>>> In the sense of "We don't care the reason", negligible reasons
>>> are necessariry restricted to SIGPIPE, evan SIGSEGV could be
>>> theoretically ignored safely. "theoretically" here means it is
>>> another issue whether we rely on the output from a program which
>>> causes SEGV (or any reason other than SIGPIPE, which we caused).
>>
>> For the SIGSEGV case, I think it would be better that we don't rely on
>> the output data, IMO, because I think there might be a possibility that
>> the program have generated that data incorrectly/unexpectedly.
>
> +1
>
> I don't think we should ignore termination by signals other than
> SIGPIPE: that could hide serious problems from users.  I want to know
> if my program is crashing with SIGBUS, SIGTERM, SIGFPE etc, even if it
> happens after we read enough data; there is a major problem that a
> human needs to investigate!

I think so too.

>>>>> Finally in the attached PoC, I set cstate->eof_reached on failure
>>>>> of NextCopyFromRawFields then if eof_reached we don't ingore
>>>>> SIGPIPE. For a program like
>>>>> "puts("test1");fflush(stdout);kill(getpid(), SIGPIPE);", I had
>>>>> the following behavior. (I got an error without the fflush() but
>>>>> it is inevitable).
>>>>>
>>>>> =# select * from ft2 limit 1;
>>>>>       a
>>>>> -------
>>>>>     test1
>>>>>
>>>>> =# select * from ft2 limit 2;
>>>>> ERROR:  program "/home/horiguti/work/exprog" failed
>>>>> DETAIL:  child process was terminated by signal 13: Broken pipe
>>>>>
>>>>> For the original case:
>>>>>
>>>>> =# select * from ft1 limit 1;
>>>>>      a
>>>>> ------
>>>>>     test
>>>>> =# select * from ft1 limit 2;
>>>>>      a
>>>>> ------
>>>>>     test
>>>>> (1 row)
>>>>>
>>>>>
>>>>> I didn't confirmed whether it is complete.
>>>>
>>>> Sorry, I don't understand this fully, but the reason to add the
>>>> eof_reached stuff is to avoid ignoring the SIGPIPE failure in normal
>>>> COPY FROM PROGRAM cases?
>>>
>>> Yes, if we have received EOF, the program ended before we
>>> finished reading from the pipe thus any error should be
>>> reported.  Elsewise we don't care at least SIGPIPE. In the
>>> attached patch all kind of errors are ignored when pipe is closed
>>> from reading-end on backends.
>>
>> OK, I understand your idea.  Thanks for the patch!
>
> +1
>
>>> As the result it doesn't report an error for SELECT * FROM ft2
>>> LIMIT 1 on "main(void){puts("test1"); return 1;}".
>>>
>>> =#  select * from ft limit 1;
>>>      a
>>> -------
>>>    test1
>>> (1 row)
>>>
>>> limit 2 reports the error.
>>>
>>> =# select * from ft limit 2;
>>> ERROR:  program "/home/horiguti/work/exprog" failed
>>> DETAIL:  child process exited with exit code 1
>>
>> I think this would be contrary to users expectations: if the SELECT
>> command works for limit 1, they would expect that the command would work
>> for limit 2 as well.  So, I think it would be better to error out that
>> command for limit 1 as well, as-is.
>
> I think it's correct that LIMIT 1 gives no error but LIMIT 2 gives an
> error.  For LIMIT 1, we got all the rows we wanted, and then we closed
> the pipe.  If we got a non-zero non-signal exit code, or a signal exit
> code and it was SIGPIPE (not any other signal!), then we should
> consider that to be expected.

Maybe I'm missing something, but the non-zero non-signal exit code means 
that there was something wrong with the called program, so I think a 
human had better investigate that as well IMO, which would probably be a 
minor problem, though.  Too restrictive?

> I tried to think of a scenario where the already-received output is
> truly invalidated by a later error that happens after we close the
> pipe.  It could be something involving a program that uses something
> optimistic like serializable snapshot isolation that can later decide
> that whatever it told you earlier is not valid after all.  Suppose the
> program is clever enough to expect EPIPE and not consider that to be
> an error, but wants to tell us about serialization failure with a
> non-zero exit code.  To handle that, you'd need a way to provide an
> option to file_fdw to tell it not to ignore non-zero exit codes after
> close.  This seems so exotic and contrived that it's not worth
> bothering with for now, but could always be added.

Interesting!  I agree that such an option could add more flexibility in 
handling the non-zero-exit-code case.

> BTW just for curiosity:
>
> perl -e 'for (my $i=0; $i<  1000000; $i++) { print "$i\n"; }' | head -5
> Exit code: terminated by SIGPIPE, like seq

Good to know!

> ruby -e 'for i in 1..1000000 do puts i; end' | head -5
> Exit code: 1, like Python

Sad.  Anyway, thanks a lot for these experiments in addition to the 
previous ones!

> On Wed, Nov 7, 2018 at 4:44 PM Etsuro Fujita
> <fujita.etsuro@lab.ntt.co.jp>  wrote:
>> (2018/11/06 19:50), Thomas Munro wrote:
>>> On my FreeBSD system, I compared the output of procstat -i (= show
>>> signal disposition) for two "sleep 60" processes, one invoked from the
>>> shell and the other from COPY ... FROM PROGRAM.  The differences were:
>>> PIPE, TTIN, TTOU and USR2.  For the first and last of those, the
>>> default action would be to terminate the process, but the COPY PROGRAM
>>> child ignored them; for TTIN and TTOU, the default action would be to
>>> stop the process, but again they are ignored.  Why do bgwriter.c,
>>> startup.c, ... set SIGTTIN and SIGTTOU back to SIG_DFL, but not
>>> regular backends?
>>
>> So, we should revert SIGUSR2 as well to default processing?
>
> I don't think it matters in practice, but it might be nice to restore
> that just for consistency.

Agreed.

> I'm not sure what to think about the TTIN,
> TTOU stuff; I don't understand job control well right now but I don't
> think it really applies to programs run by a PostgreSQL backend, so if
> we restore those it'd probably again be only for consistency.  Then
> again, there may be a reason someone decided to ignore those in the
> postmaster + regular backends but not the various auxiliary processes.
> Anyone?

I don't have any idea about that.

Best regards,
Etsuro Fujita


pgsql-bugs by date:

Previous
From: Amit Langote
Date:
Subject: Re: BUG #15448: server process (PID 22656) was terminated byexception 0xC0000005
Next
From: PG Bug reporting form
Date:
Subject: BUG #15493: Wrong name of fields/missing fields for the internalstatistic