Thread: brin regression test intermittent failures

brin regression test intermittent failures

From

Andrew Dunstan

Date:

15 May 2015, 22:31:53

There's something odd about the brin regression tests. They seem to 
generate intermittent failures, which suggests some sort of race 
condition or ordering failure.

See for example 
<http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=fulmar&dt=2015-05-15%2001%3A02%3A28>
and 
<http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=sittella&dt=2015-05-15%2021%3A08%3A38>

cheers

andrew

Re: brin regression test intermittent failures

From

Alvaro Herrera

Date:

15 May 2015, 22:44:58

Andrew Dunstan wrote:
> 
> There's something odd about the brin regression tests. They seem to generate
> intermittent failures, which suggests some sort of race condition or
> ordering failure.
> 
> See for example <http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=fulmar&dt=2015-05-15%2001%3A02%3A28>
> and <http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=sittella&dt=2015-05-15%2021%3A08%3A38>

Yeah it's pretty odd.  I guess the way to figure out what is going on is
to get the test to print out the index contents in case of failure.
I guess I could do something with \gset.

(The way to print out the index is to use the pageinspect functions.
One problem is that at the time the brin test is run we don't have
pageinspect)

Of course, if I could reproduce the issue locally, this would be a lot
easier.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: brin regression test intermittent failures

From

Tom Lane

Date:

15 May 2015, 22:49:44

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Andrew Dunstan wrote:
>> There's something odd about the brin regression tests. They seem to generate
>> intermittent failures, which suggests some sort of race condition or
>> ordering failure.
>> 
>> See for example <http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=fulmar&dt=2015-05-15%2001%3A02%3A28>
>> and <http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=sittella&dt=2015-05-15%2021%3A08%3A38>

> Yeah it's pretty odd.

Oooh.  I saw the sittella failure and assumed it was triggered by the
latest BRIN additions, but that fulmar failure is from before those hit.

Just from reading the documentation, couldn't the symptom we're seeing
arise from autovacuum having hit the table right before
brin_summarize_new_values got called?
        regards, tom lane

Re: brin regression test intermittent failures

From

Alvaro Herrera

Date:

15 May 2015, 23:06:30

Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > Andrew Dunstan wrote:
> >> There's something odd about the brin regression tests. They seem to generate
> >> intermittent failures, which suggests some sort of race condition or
> >> ordering failure.
> >> 
> >> See for example <http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=fulmar&dt=2015-05-15%2001%3A02%3A28>
> >> and <http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=sittella&dt=2015-05-15%2021%3A08%3A38>
> 
> > Yeah it's pretty odd.
> 
> Oooh.  I saw the sittella failure and assumed it was triggered by the
> latest BRIN additions, but that fulmar failure is from before those hit.
> 
> Just from reading the documentation, couldn't the symptom we're seeing
> arise from autovacuum having hit the table right before
> brin_summarize_new_values got called?

Well, I added a autovacuum_enabled=off to that table recently precisely
because that was my hypothesis.  It didn't work though, so it must be
sometihng else.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: brin regression test intermittent failures

From

Tom Lane

Date:

15 May 2015, 23:51:49

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> Just from reading the documentation, couldn't the symptom we're seeing
>> arise from autovacuum having hit the table right before
>> brin_summarize_new_values got called?

> Well, I added a autovacuum_enabled=off to that table recently precisely
> because that was my hypothesis.  It didn't work though, so it must be
> sometihng else.

Ah.  Not having noticed that, I'd locally added a pg_sleep(60) right
before the brin_summarize_new_values call, and failed to reproduce any
problem.  So it's not AV doing something, but it sure smells like
something close to that.

Is there a good reason why we need to exercise brin_summarize_new_values
as such here, rather than just doing a manual VACUUM on the table?  And
if there is, do we really need to verify its result value?  I mean, even
without whatever sort of race condition we're talking about, that expected
result of "5" looks pretty darn phase-of-the-moon-dependent to me.
        regards, tom lane

Re: brin regression test intermittent failures

From

Tom Lane

Date:

25 May 2015, 22:25:22

Andrew Dunstan <andrew@dunslane.net> writes:
> There's something odd about the brin regression tests. They seem to 
> generate intermittent failures, which suggests some sort of race 
> condition or ordering failure.

> See for example 
> <http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=fulmar&dt=2015-05-15%2001%3A02%3A28>
> and 
> <http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=sittella&dt=2015-05-15%2021%3A08%3A38>

I found the cause of this symptom today.  Alvaro said he'd added the
autovacuum_enabled=off option to the brintest table to prevent autovac
from screwing up this expected result ... but that only stops autovacuum
from summarizing the table.  Guess what is in the concurrently-executed
gist.sql test, at line 40.

While we could and perhaps should change that command to a more narrowly
targeted "vacuum analyze gist_tbl;", this will not prevent someone from
reintroducing an untargeted vacuum command in one of the concurrent tests
later.  I think a future-proof fix would require either making brintest
a temp table (losing all test coverage of WAL logging :-() or changing
the test so that it does not expect a specific result from
brin_summarize_new_values.

Or, maybe better, let's lose the brin_summarize_new_values call
altogether.  What does it test that wouldn't be better done by
explicitly running "vacuum brintest;" ?

Also worth noting is that there's a completely different failure symptom
that's shown up a few times, eg here:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2015-05-25%2009%3A56%3A55

This makes it look like brintest sometimes contains no rows at all,
which is difficult to explain ...
        regards, tom lane

Re: brin regression test intermittent failures

From

Peter Geoghegan

Date:

25 May 2015, 22:41:36

On Mon, May 25, 2015 at 3:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Also worth noting is that there's a completely different failure symptom
> that's shown up a few times, eg here:
>
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2015-05-25%2009%3A56%3A55
>
> This makes it look like brintest sometimes contains no rows at all,
> which is difficult to explain ...

I meant to get around to looking into it, but FWIW I see BRIN-related
Valgrind issues. e.g.:

2015-05-20 23:44:42.419 PDT 14787 LOG:  statement: CREATE INDEX
brinidx ON brintest USING brin (
byteacol,
charcol,
namecol,
int8col,
int2col,
int4col,
textcol,
oidcol,
tidcol,
float4col,
float8col,
macaddrcol,
inetcol inet_inclusion_ops,
inetcol inet_minmax_ops,
bpcharcol,
datecol,
timecol,
timestampcol,
timestamptzcol,
intervalcol,
timetzcol,
bitcol,
varbitcol,
numericcol,
uuidcol,
int4rangecol,
lsncol,
boxcol
) with (pages_per_range = 1);
==14787== Unaddressable byte(s) found during client check request
==14787==    at 0x7E19AD: PageAddItem (bufpage.c:314)
==14787==    by 0x4693AD: brin_doinsert (brin_pageops.c:315)
==14787==    by 0x46844C: form_and_insert_tuple (brin.c:1122)
==14787==    by 0x4672AD: brinbuildCallback (brin.c:540)
==14787==    by 0x544D35: IndexBuildHeapRangeScan (index.c:2549)
==14787==    by 0x54426A: IndexBuildHeapScan (index.c:2162)
==14787==    by 0x4676EA: brinbuild (brin.c:638)
==14787==    by 0x944D19: OidFunctionCall3Coll (fmgr.c:1649)
==14787==    by 0x543F75: index_build (index.c:2025)
==14787==    by 0x542C4D: index_create (index.c:1100)
==14787==    by 0x614E52: DefineIndex (indexcmds.c:605)
==14787==    by 0x7F29CC: ProcessUtilitySlow (utility.c:1258)
==14787==  Address 0x6932a3a is 1,738 bytes inside a block of size 8,192 alloc'd
==14787==    at 0x4C2AB80: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==14787==    by 0x966E86: AllocSetAlloc (aset.c:847)
==14787==    by 0x969B19: palloc (mcxt.c:825)
==14787==    by 0x839203: datumCopy (datum.c:171)
==14787==    by 0x46D6B3: brin_minmax_add_value (brin_minmax.c:105)
==14787==    by 0x944116: FunctionCall4Coll (fmgr.c:1375)
==14787==    by 0x4673AF: brinbuildCallback (brin.c:562)
==14787==    by 0x544D35: IndexBuildHeapRangeScan (index.c:2549)
==14787==    by 0x54426A: IndexBuildHeapScan (index.c:2162)
==14787==    by 0x4676EA: brinbuild (brin.c:638)
==14787==    by 0x944D19: OidFunctionCall3Coll (fmgr.c:1649)
==14787==    by 0x543F75: index_build (index.c:2025)
==14787==
{  <insert_a_suppression_name_here>  Memcheck:User  fun:PageAddItem  fun:brin_doinsert  fun:form_and_insert_tuple
fun:brinbuildCallback fun:IndexBuildHeapRangeScan  fun:IndexBuildHeapScan  fun:brinbuild  fun:OidFunctionCall3Coll
fun:index_build fun:index_create  fun:DefineIndex  fun:ProcessUtilitySlow
 
}
==14787== Invalid read of size 8
==14787==    at 0x4C2F79E: memcpy@@GLIBC_2.14 (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==14787==    by 0x7E19DB: PageAddItem (bufpage.c:317)
==14787==    by 0x4693AD: brin_doinsert (brin_pageops.c:315)
==14787==    by 0x46844C: form_and_insert_tuple (brin.c:1122)
==14787==    by 0x4672AD: brinbuildCallback (brin.c:540)
==14787==    by 0x544D35: IndexBuildHeapRangeScan (index.c:2549)
==14787==    by 0x54426A: IndexBuildHeapScan (index.c:2162)
==14787==    by 0x4676EA: brinbuild (brin.c:638)
==14787==    by 0x944D19: OidFunctionCall3Coll (fmgr.c:1649)
==14787==    by 0x543F75: index_build (index.c:2025)
==14787==    by 0x542C4D: index_create (index.c:1100)
==14787==    by 0x614E52: DefineIndex (indexcmds.c:605)
==14787==  Address 0x6932a38 is 728 bytes inside a block of size 730
client-defined
==14787==    at 0x969DC2: palloc0 (mcxt.c:864)
==14787==    by 0x46B9FE: brin_form_tuple (brin_tuple.c:166)
==14787==    by 0x46840B: form_and_insert_tuple (brin.c:1120)
==14787==    by 0x4672AD: brinbuildCallback (brin.c:540)
==14787==    by 0x544D35: IndexBuildHeapRangeScan (index.c:2549)
==14787==    by 0x54426A: IndexBuildHeapScan (index.c:2162)
==14787==    by 0x4676EA: brinbuild (brin.c:638)
==14787==    by 0x944D19: OidFunctionCall3Coll (fmgr.c:1649)
==14787==    by 0x543F75: index_build (index.c:2025)
==14787==    by 0x542C4D: index_create (index.c:1100)
==14787==    by 0x614E52: DefineIndex (indexcmds.c:605)
==14787==    by 0x7F29CC: ProcessUtilitySlow (utility.c:1258)
==14787==
{  <insert_a_suppression_name_here>  Memcheck:Addr8  fun:memcpy@@GLIBC_2.14  fun:PageAddItem  fun:brin_doinsert
fun:form_and_insert_tuple fun:brinbuildCallback  fun:IndexBuildHeapRangeScan  fun:IndexBuildHeapScan  fun:brinbuild
fun:OidFunctionCall3Coll fun:index_build  fun:index_create  fun:DefineIndex
 
}


-- 
Peter Geoghegan

Re: brin regression test intermittent failures

From

Tom Lane

Date:

26 May 2015, 01:57:55

Peter Geoghegan <pg@heroku.com> writes:
> I meant to get around to looking into it, but FWIW I see BRIN-related
> Valgrind issues. e.g.:

Fixed, see 79f2b5d583e2e2a7; but AFAICS this has no real-world impact
so it does not explain whatever is happening on chipmunk.
        regards, tom lane

Re: brin regression test intermittent failures

From

Tom Lane

Date:

26 May 2015, 02:45:59

I wrote:
> Peter Geoghegan <pg@heroku.com> writes:
>> I meant to get around to looking into it, but FWIW I see BRIN-related
>> Valgrind issues. e.g.:

> Fixed, see 79f2b5d583e2e2a7; but AFAICS this has no real-world impact
> so it does not explain whatever is happening on chipmunk.

BTW, after some further trawling in the buildfarm logs, it seem that that
"no results for" failure mode has been seen *only* on chipmunk, where it's
happened in roughly half the runs since that particular test case went in.
So it's definitely platform-specific.  It's less clear whether the test
case is bogus, or it's exposing a bug added elsewhere in db5f98ab4fa44bc5,
or the bug was pre-existing but not exposed by any older test case.
        regards, tom lane

Re: brin regression test intermittent failures

From

Alvaro Herrera

Date:

26 May 2015, 14:30:17

Tom Lane wrote:
> Peter Geoghegan <pg@heroku.com> writes:
> > I meant to get around to looking into it, but FWIW I see BRIN-related
> > Valgrind issues. e.g.:
> 
> Fixed, see 79f2b5d583e2e2a7; but AFAICS this has no real-world impact
> so it does not explain whatever is happening on chipmunk.

Ah, thanks for diagnosing that.

The chipmunk failure is strange -- notice it only references the
= operators, except for type box for which it's ~= that fails.  The test
includes a lot of operators ...

Also, we have quite a number of ARM boxes: apart from chipmunk we have
gull, hamster, mereswine, dangomushi, axolotl, grison.  (hamster and
chipmunk report hostname -m as "armv6l", the others armv7l).  All of
them are running Linux, either Fedora or Debian.  Most are using gcc,
compilation flags look pretty standard.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: brin regression test intermittent failures

From

Tom Lane

Date:

04 June 2015, 16:23:20

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> Fixed, see 79f2b5d583e2e2a7; but AFAICS this has no real-world impact
>> so it does not explain whatever is happening on chipmunk.

> Ah, thanks for diagnosing that.

> The chipmunk failure is strange -- notice it only references the
> = operators, except for type box for which it's ~= that fails.  The test
> includes a lot of operators ...

Actually not --- if you browse through the last half dozen failures
on chipmunk you will notice that

(1) the set of operators complained of varies a bit from one failure
to the next;

(2) more often than not, this is one of the failures:

WARNING:  no results for (boxcol,@>,box,"((1,2),(300,400))")

Certainly the majority of the complaints are about equality operators,
but not quite all of them.

> Also, we have quite a number of ARM boxes: apart from chipmunk we have
> gull, hamster, mereswine, dangomushi, axolotl, grison.  (hamster and
> chipmunk report hostname -m as "armv6l", the others armv7l).  All of
> them are running Linux, either Fedora or Debian.  Most are using gcc,
> compilation flags look pretty standard.

I have no idea what might be different about chipmunk compared to any
other ARM buildfarm critter ... Heikki, any thoughts on that?
        regards, tom lane

Re: brin regression test intermittent failures

From

Alvaro Herrera

Date:

04 June 2015, 16:34:33

Tom Lane wrote:

> Actually not --- if you browse through the last half dozen failures
> on chipmunk you will notice that
> 
> (1) the set of operators complained of varies a bit from one failure
> to the next;
> 
> (2) more often than not, this is one of the failures:
> 
> WARNING:  no results for (boxcol,@>,box,"((1,2),(300,400))")
> 
> Certainly the majority of the complaints are about equality operators,
> but not quite all of them.

Hm.  Well, what this message says is that we ran that query using
both BRIN and seqscan, and that in both cases no row was returned.  Note
that if the BRIN and seqscan cases had returned different sets of rows,
the error message would have been different.  So this might be related
to the way the test table is created, rather than to a bug in BRIN.
Peter G. recently pointed out that this seems to be relying on an
index-only scan on table tenk1 and suggested an ORDER BY.  Maybe that
assumption is being violated on chipmunk and so the table populated is
different than what the table actually expects.

I just noticed that chipmunk has shared_buffers=10MB on its buildfarm
config.  I don't see that in any of the other ARM animals.  Maybe that
can change the plan choice.

I will test locally with reduced shared_buffers and see if I can
reproduce the results.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: brin regression test intermittent failures

From

Alvaro Herrera

Date:

04 June 2015, 16:53:59

Alvaro Herrera wrote:

> Hm.  Well, what this message says is that we ran that query using
> both BRIN and seqscan, and that in both cases no row was returned.  Note
> that if the BRIN and seqscan cases had returned different sets of rows,
> the error message would have been different.  So this might be related
> to the way the test table is created, rather than to a bug in BRIN.
> Peter G. recently pointed out that this seems to be relying on an
> index-only scan on table tenk1 and suggested an ORDER BY.  Maybe that
> assumption is being violated on chipmunk and so the table populated is
> different than what the table actually expects.

Evidently there is a problem right there.  If I simply add an "order by
tenthous" as proposed by Peter, many more errors appear; and what errors
appear differs if I change shared_buffers.  I think the real fix for
this is to change the hand-picked values used in the brinopers table, so
that they all pass the test using some reasonable ORDER BY specification
in the populating query (probably tenk1.unique1).

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: brin regression test intermittent failures

From

Tom Lane

Date:

04 June 2015, 17:02:29

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Evidently there is a problem right there.  If I simply add an "order by
> tenthous" as proposed by Peter, many more errors appear; and what errors
> appear differs if I change shared_buffers.  I think the real fix for
> this is to change the hand-picked values used in the brinopers table, so
> that they all pass the test using some reasonable ORDER BY specification
> in the populating query (probably tenk1.unique1).

I may be confused, but why would the physical ordering of the table
entries make a difference to the correct answers for this test?
(I can certainly see why that might break the brin code, but not
why it should change the seqscan's answers.)

Also, what I'd just noticed is that all of the cases that are failing are
ones where the expected number of matching rows is exactly 1.  I am
wondering if the test is sometimes just missing random rows, and we're not
seeing any reported problem unless that makes it go down to no rows.  (But
I do not know how that could simultaneously affect the seqscan case ...)

I think it would be a good idea to extend the brinopers table to include
the number of expected matches, and to complain if that's not what we got,
rather than simply checking for zero.
        regards, tom lane

Re: brin regression test intermittent failures

From

Tom Lane

Date:

04 June 2015, 17:15:23

I wrote:
> I think it would be a good idea to extend the brinopers table to include
> the number of expected matches, and to complain if that's not what we got,
> rather than simply checking for zero.

Also, further experimentation shows that there are about 30 entries in the
brinopers table that give rise to seqscan plans even when we're commanding
a bitmap scan, presumably because those operators aren't brin-indexable.
They're not the problematic cases, but things like 
((charcol)::text > 'A'::text)

Is there a reason to have such things in the table, or is this just a
thinko?  Or is it actually a bug that we're getting such plans?
        regards, tom lane

Re: brin regression test intermittent failures

From

Alvaro Herrera

Date:

04 June 2015, 17:17:39

Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > Evidently there is a problem right there.  If I simply add an "order by
> > tenthous" as proposed by Peter, many more errors appear; and what errors
> > appear differs if I change shared_buffers.  I think the real fix for
> > this is to change the hand-picked values used in the brinopers table, so
> > that they all pass the test using some reasonable ORDER BY specification
> > in the populating query (probably tenk1.unique1).
> 
> I may be confused, but why would the physical ordering of the table
> entries make a difference to the correct answers for this test?
> (I can certainly see why that might break the brin code, but not
> why it should change the seqscan's answers.)

We create the brintest using a scan of tenk1 LIMIT 100, without
specifying the order.  So whether we find rows that match each test query
is pure chance.

> Also, what I'd just noticed is that all of the cases that are failing are
> ones where the expected number of matching rows is exactly 1.  I am
> wondering if the test is sometimes just missing random rows, and we're not
> seeing any reported problem unless that makes it go down to no rows.  (But
> I do not know how that could simultaneously affect the seqscan case ...)

Yeah, we compare the ctid sets of the results, and we assume that a
seqscan would get that correctly.

> I think it would be a good idea to extend the brinopers table to include
> the number of expected matches, and to complain if that's not what we got,
> rather than simply checking for zero.

That sounds reasonable.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: brin regression test intermittent failures

From

Alvaro Herrera

Date:

04 June 2015, 17:25:02

Tom Lane wrote:
> I wrote:
> > I think it would be a good idea to extend the brinopers table to include
> > the number of expected matches, and to complain if that's not what we got,
> > rather than simply checking for zero.
> 
> Also, further experimentation shows that there are about 30 entries in the
> brinopers table that give rise to seqscan plans even when we're commanding
> a bitmap scan, presumably because those operators aren't brin-indexable.
> They're not the problematic cases, but things like 
> 
>     ((charcol)::text > 'A'::text)
> 
> Is there a reason to have such things in the table, or is this just a
> thinko?  Or is it actually a bug that we're getting such plans?

No, I left those there knowing that there are no plans involving brin --
in a way, they provide some future proofing if some of those operators
are made indexable later.

I couldn't think of a way to test that the plans are actually using the
brin index or not, but if we can do that in some way, that would be
good.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: brin regression test intermittent failures

From

Tom Lane

Date:

04 June 2015, 17:26:53

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> I may be confused, but why would the physical ordering of the table
>> entries make a difference to the correct answers for this test?
>> (I can certainly see why that might break the brin code, but not
>> why it should change the seqscan's answers.)

> We create the brintest using a scan of tenk1 LIMIT 100, without
> specifying the order.  So whether we find rows that match each test query
> is pure chance.

Oooh ... normally that would not matter, but I wonder if what's happening
on chipmunk is that the synchronized-seqscan logic kicks in and causes us
to read some other part of tenk1 than we normally would, as a consequence
of some concurrent activity or other.  The connection to smaller than
normal shared_buffers would be that it would change our idea of what's a
large enough table to justify using synchronized seqscans.

Peter's patch failed to hit the place where this matters, btw.
        regards, tom lane

Re: brin regression test intermittent failures

From

Tom Lane

Date:

04 June 2015, 17:32:41

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> Also, further experimentation shows that there are about 30 entries in the
>> brinopers table that give rise to seqscan plans even when we're commanding
>> a bitmap scan, presumably because those operators aren't brin-indexable.
>> They're not the problematic cases, but things like 
>> 
>> ((charcol)::text > 'A'::text)
>> 
>> Is there a reason to have such things in the table, or is this just a
>> thinko?  Or is it actually a bug that we're getting such plans?

> No, I left those there knowing that there are no plans involving brin --
> in a way, they provide some future proofing if some of those operators
> are made indexable later.

On closer investigation, I think the ones involving charcol are a flat
out bug in the test, namely failure to quote "char".  Observe:

regression=# explain select ctid from brintest where charcol = 'A'::char;                       QUERY PLAN
         
 
----------------------------------------------------------Seq Scan on brintest  (cost=0.00..101.88 rows=1 width=6)
Filter:((charcol)::text = 'A'::text)
 
(2 rows)

regression=# explain select ctid from brintest where charcol = 'A'::"char";                             QUERY PLAN
                        
 
-----------------------------------------------------------------------Bitmap Heap Scan on brintest  (cost=48.02..58.50
rows=3width=6)  Recheck Cond: (charcol = 'A'::"char")  ->  Bitmap Index Scan on brinidx  (cost=0.00..48.02 rows=3
width=0)       Index Cond: (charcol = 'A'::"char")
 
(4 rows)

Presumably we'd like to test the latter case not the former.

The other cases that I found involve cidrcol, and seem to represent
an actual bug in the brin planning logic, ie failure to disregard a
no-op cast.  I'll look closer.

> I couldn't think of a way to test that the plans are actually using the
> brin index or not, but if we can do that in some way, that would be
> good.

Yeah, we can do that --- the way I found out there's a problem is to
modify the test script to check the output of EXPLAIN.

So at this point it looks like (1) chipmunk's issue might be explained
by lack of forced ORDER BY; (2) the test script could be improved to
test more carefully, and it has got an issue with "char" vs char;
(3) there might be a planner bug.
        regards, tom lane

Re: brin regression test intermittent failures

From

Tom Lane

Date:

04 June 2015, 19:30:13

I wrote:
> The other cases that I found involve cidrcol, and seem to represent
> an actual bug in the brin planning logic, ie failure to disregard a
> no-op cast.  I'll look closer.

I leapt to the wrong conclusion on that one.  The reason for failure to
match to an index column had nothing to do with the extra cast, and
everything to do with the fact that there was no such index column.

I think we're probably good now, though it'd be wise to keep an eye on
chipmunk for awhile to verify that it doesn't see the problem anymore.
        regards, tom lane

Re: brin regression test intermittent failures

From

Alvaro Herrera

Date:

04 June 2015, 19:37:27

Tom Lane wrote:
> I wrote:
> > The other cases that I found involve cidrcol, and seem to represent
> > an actual bug in the brin planning logic, ie failure to disregard a
> > no-op cast.  I'll look closer.
> 
> I leapt to the wrong conclusion on that one.  The reason for failure to
> match to an index column had nothing to do with the extra cast, and
> everything to do with the fact that there was no such index column.

Oops!  Thanks for reviewing this.

> I think we're probably good now, though it'd be wise to keep an eye on
> chipmunk for awhile to verify that it doesn't see the problem anymore.

Will do.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services