Thread: [HACKERS] Parallel Bitmap scans a bit broken

[HACKERS] Parallel Bitmap scans a bit broken

From
David Rowley
Date:
I was just doing some testing on [1] when I noticed that there's a problem with parallel bitmap index scans scans.

Test case:

patch with [1]

=# create table r1(value int);
CREATE TABLE
=# insert into r1 select (random()*1000)::int from generate_Series(1,1000000);
INSERT 0 1000000
=# create index on r1 using brin(value);
CREATE INDEX
=# set enable_seqscan=0;
SET
=# explain select * from r1 where value=555;
                                       QUERY PLAN                                        
-----------------------------------------------------------------------------------------
 Gather  (cost=3623.52..11267.45 rows=5000 width=4)
   Workers Planned: 2
   ->  Parallel Bitmap Heap Scan on r1  (cost=2623.52..9767.45 rows=2083 width=4)
         Recheck Cond: (value = 555)
         ->  Bitmap Index Scan on r1_value_idx  (cost=0.00..2622.27 rows=522036 width=0)
               Index Cond: (value = 555)
(6 rows)

=# explain analyze select * from r1 where value=555;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

The crash occurs in tbm_shared_iterate() at:

PagetableEntry *page = &ptbase[idxpages[istate->spageptr]];


I see in tbm_prepare_shared_iterate() tbm->npages is zero. I'm unsure if bringetbitmap() does something different with npages than btgetbitmap() around setting npages?

But anyway, due to the npages being 0 the tbm->ptpages is not allocated in tbm_prepare_shared_iterate()

if (tbm->npages)
{
tbm->ptpages = dsa_allocate(tbm->dsa, sizeof(PTIterationArray) +
tbm->npages * sizeof(int));

so when tbm_shared_iterate runs this code;

/*
* If both chunk and per-page data remain, must output the numerically
* earlier page.
*/
if (istate->schunkptr < istate->nchunks)
{
PagetableEntry *chunk = &ptbase[idxchunks[istate->schunkptr]];
PagetableEntry *page = &ptbase[idxpages[istate->spageptr]];
BlockNumber chunk_blockno;

chunk_blockno = chunk->blockno + istate->schunkbit;

if (istate->spageptr >= istate->npages ||
chunk_blockno < page->blockno)
{
/* Return a lossy page indicator from the chunk */
output->blockno = chunk_blockno;
output->ntuples = -1;
output->recheck = true;
istate->schunkbit++;

LWLockRelease(&istate->lock);
return output;
}
}

it fails, due to idxpages pointing to random memory

Probably this is a simple fix for the authors, so passing it along. I'm a bit unable to see how the part above is meant to work.


--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Parallel Bitmap scans a bit broken

From
Dilip Kumar
Date:
On Thu, Mar 9, 2017 at 9:17 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> patch with [1]
>
> =# create table r1(value int);
> CREATE TABLE
> =# insert into r1 select (random()*1000)::int from
> generate_Series(1,1000000);
> INSERT 0 1000000
> =# create index on r1 using brin(value);
> CREATE INDEX
> =# set enable_seqscan=0;
> SET
> =# explain select * from r1 where value=555;

I am looking into the issue, I have already reproduced it.  I will
update on this soon.

Thanks for reporting.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel Bitmap scans a bit broken

From
Dilip Kumar
Date:
On Thu, Mar 9, 2017 at 9:37 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> =# create table r1(value int);
>> CREATE TABLE
>> =# insert into r1 select (random()*1000)::int from
>> generate_Series(1,1000000);
>> INSERT 0 1000000
>> =# create index on r1 using brin(value);
>> CREATE INDEX
>> =# set enable_seqscan=0;
>> SET
>> =# explain select * from r1 where value=555;
>
> I am looking into the issue, I have already reproduced it.  I will
> update on this soon.
>
> Thanks for reporting.

I slightly modified your query to reproduce this issue.

explain analyze select * from r1 where value<555;

Patch is attached to fix the problem.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel Bitmap scans a bit broken

From
Dilip Kumar
Date:
On Thu, Mar 9, 2017 at 10:02 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I slightly modified your query to reproduce this issue.
>
> explain analyze select * from r1 where value<555;
>
> Patch is attached to fix the problem.

I forgot to mention the cause of the problem.

if (istate->schunkptr < istate->nchunks)
{  PagetableEntry *chunk = &ptbase[idxchunks[istate->schunkptr]];   PagetableEntry *page =
&ptbase[idxpages[istate->spageptr]];  BlockNumber chunk_blockno;
 

In above if condition we have only checked istate->schunkptr <
istate->nchunks that means we have some chunk left so we are safe to
access idxchunks,  But just after that we are accessing
ptbase[idxpages[istate->spageptr]] without checking that accessing
idxpages is safe or not.

tbm_iterator already handling this case, I broke it in tbm_shared_iterator.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel Bitmap scans a bit broken

From
Robert Haas
Date:
On Thu, Mar 9, 2017 at 11:50 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Thu, Mar 9, 2017 at 10:02 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> I slightly modified your query to reproduce this issue.
>>
>> explain analyze select * from r1 where value<555;
>>
>> Patch is attached to fix the problem.
>
> I forgot to mention the cause of the problem.
>
> if (istate->schunkptr < istate->nchunks)
> {
>    PagetableEntry *chunk = &ptbase[idxchunks[istate->schunkptr]];
>     PagetableEntry *page = &ptbase[idxpages[istate->spageptr]];
>     BlockNumber chunk_blockno;
>
> In above if condition we have only checked istate->schunkptr <
> istate->nchunks that means we have some chunk left so we are safe to
> access idxchunks,  But just after that we are accessing
> ptbase[idxpages[istate->spageptr]] without checking that accessing
> idxpages is safe or not.
>
> tbm_iterator already handling this case, I broke it in tbm_shared_iterator.

I don't know if this is the only problem -- it would be good if David
could retest -- but it's certainly *a* problem, so committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel Bitmap scans a bit broken

From
David Rowley
Date:
On 10 March 2017 at 06:17, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Mar 9, 2017 at 11:50 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Thu, Mar 9, 2017 at 10:02 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> I slightly modified your query to reproduce this issue.
>>
>> explain analyze select * from r1 where value<555;
>>
>> Patch is attached to fix the problem.
>
> I forgot to mention the cause of the problem.
>
> if (istate->schunkptr < istate->nchunks)
> {
>    PagetableEntry *chunk = &ptbase[idxchunks[istate->schunkptr]];
>     PagetableEntry *page = &ptbase[idxpages[istate->spageptr]];
>     BlockNumber chunk_blockno;
>
> In above if condition we have only checked istate->schunkptr <
> istate->nchunks that means we have some chunk left so we are safe to
> access idxchunks,  But just after that we are accessing
> ptbase[idxpages[istate->spageptr]] without checking that accessing
> idxpages is safe or not.
>
> tbm_iterator already handling this case, I broke it in tbm_shared_iterator.

I don't know if this is the only problem -- it would be good if David
could retest -- but it's certainly *a* problem, so committed.

Thanks for committing, and generally parallelising more stuff.

I confirm that my test case is now working again.

I'll be in this general area today, so will mention if I stumble over anything that looks broken. 

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Parallel Bitmap scans a bit broken

From
Emre Hasegeli
Date:
> I don't know if this is the only problem

> I'll be in this general area today, so will mention if I stumble over
> anything that looks broken.

I was testing the same patch with a large dataset and got a different segfault:

> hasegeli=# explain select * from only mp_notification_20170225 where server_id = 7;
>                                                QUERY PLAN
> ----------------------------------------------------------------------------------------------------------
> Gather  (cost=26682.94..476995.88 rows=1 width=215)
>   Workers Planned: 2
>   ->  Parallel Bitmap Heap Scan on mp_notification_20170225  (cost=25682.94..475995.78 rows=1 width=215)
>         Recheck Cond: (server_id = 7)
>         ->  Bitmap Index Scan on mp_notification_block_idx  (cost=0.00..25682.94 rows=4557665 width=0)
>               Index Cond: (server_id = 7)
> (6 rows)
>
> hasegeli=# select * from only mp_notification_20170225 where server_id = 7;
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.

> * thread #1: tid = 0x5045a8f, 0x000000010ae44558 postgres`brin_deform_tuple(brdesc=0x00007fea3c86a3a8,
tuple=0x00007fea3c891040)+ 40 at brin_tuple.c:414, queue = 'com.apple.main-thread', stop reason = signal SIGUSR1 
>  * frame #0: 0x000000010ae44558 postgres`brin_deform_tuple(brdesc=0x00007fea3c86a3a8, tuple=0x00007fea3c891040) + 40
atbrin_tuple.c:414 [opt] 
>    frame #1: 0x000000010ae4000c postgres`bringetbitmap(scan=0x00007fea3c875c20, tbm=<unavailable>) + 428 at
brin.c:398[opt] 
>    frame #2: 0x000000010ae9b451 postgres`index_getbitmap(scan=0x00007fea3c875c20, bitmap=<unavailable>) + 65 at
indexam.c:726[opt] 
>    frame #3: 0x000000010b0035a9 postgres`MultiExecBitmapIndexScan(node=<unavailable>) + 233 at
nodeBitmapIndexscan.c:91[opt] 
>    frame #4: 0x000000010b002840 postgres`BitmapHeapNext(node=<unavailable>) + 400 at nodeBitmapHeapscan.c:143 [opt]
>    frame #5: 0x000000010afef5d0 postgres`ExecProcNode(node=0x00007fea3c873948) + 224 at execProcnode.c:459 [opt]
>    frame #6: 0x000000010b004cc9 postgres`ExecGather [inlined] gather_getnext(gatherstate=<unavailable>) + 520 at
nodeGather.c:276[opt] 
>    frame #7: 0x000000010b004ac1 postgres`ExecGather(node=<unavailable>) + 497 at nodeGather.c:212 [opt]
>    frame #8: 0x000000010afef6b2 postgres`ExecProcNode(node=0x00007fea3c872f58) + 450 at execProcnode.c:541 [opt]
>    frame #9: 0x000000010afeaf90 postgres`standard_ExecutorRun [inlined] ExecutePlan(estate=<unavailable>,
planstate=<unavailable>,use_parallel_mode=<unavailable>, operation=<unavailable>, numberTuples=0,
direction=<unavailable>,dest=<unavailable>) + 29 at execMain.c:1616 [opt] 
>    frame #10: 0x000000010afeaf73 postgres`standard_ExecutorRun(queryDesc=<unavailable>, direction=<unavailable>,
count=0)+ 291 at execMain.c:348 [opt] 
>    frame #11: 0x000000010af8b108 postgres`ExplainOnePlan(plannedstmt=0x00007fea3c871040, into=0x0000000000000000,
es=0x00007fea3c805360,queryString=<unavailable>, params=<unavailable>, planduration=<unavailable>) + 328 at
explain.c:533[opt] 
>    frame #12: 0x000000010af8ab98 postgres`ExplainOneQuery(query=0x00007fea3c805890, cursorOptions=<unavailable>,
into=0x0000000000000000,es=0x00007fea3c805360, queryString=<unavailable>,params=0x0000000000000000) + 280 at
explain.c:369[opt] 
>    frame #13: 0x000000010af8a773 postgres`ExplainQuery(pstate=<unavailable>, stmt=0x00007fea3d005450,
queryString="explainanalyze select * from only mp_notification_20170225 where server_id >
6;",params=0x0000000000000000,dest=0x00007fea3c8052c8) + 819 at explain.c:254 [opt] 
>    frame #14: 0x000000010b13b660 postgres`standard_ProcessUtility(pstmt=0x00007fea3d005fa8, queryString="explain
analyzeselect * from only mp_notification_20170225 where server_id > 6;",context=PROCESS_UTILITY_TOPLEVEL,
params=0x0000000000000000,dest=0x00007fea3c8052c8, completionTag=<unavailable>) + 1104 at utility.c:675 [opt] 
>    frame #15: 0x000000010b13ad2a postgres`PortalRunUtility(portal=0x00007fea3c837640, pstmt=0x00007fea3d005fa8,
isTopLevel='\x01',setHoldSnapshot=<unavailable>, dest=0x00007fea3c8052c8, completionTag=<unavailable>) + 90 at
pquery.c:1165[opt] 
>    frame #16: 0x000000010b139f56 postgres`FillPortalStore(portal=0x00007fea3c837640, isTopLevel='\x01') + 182 at
pquery.c:1025[opt] 
>    frame #17: 0x000000010b139c22 postgres`PortalRun(portal=0x00007fea3c837640, count=<unavailable>,
isTopLevel='\x01',dest=<unavailable>, altdest=<unavailable>, completionTag=<unavailable>) + 402 at pquery.c:757 [opt] 
>    frame #18: 0x000000010b13789b postgres`PostgresMain + 44 at postgres.c:1101 [opt]
>    frame #19: 0x000000010b13786f postgres`PostgresMain(argc=<unavailable>, argv=<unavailable>, dbname=<unavailable>,
username=<unavailable>)+ 8927 at postgres.c:4066 [opt] 
>    frame #20: 0x000000010b0ba113 postgres`PostmasterMain [inlined] BackendRun + 7587 at postmaster.c:4317 [opt]
>    frame #21: 0x000000010b0ba0e8 postgres`PostmasterMain [inlined] BackendStartup at postmaster.c:3989 [opt]
>    frame #22: 0x000000010b0ba0e8 postgres`PostmasterMain at postmaster.c:1729 [opt]
>    frame #23: 0x000000010b0ba0e8 postgres`PostmasterMain(argc=<unavailable>, argv=<unavailable>) + 7544 at
postmaster.c:1337[opt] 
>    frame #24: 0x000000010b0332af postgres`main(argc=<unavailable>, argv=<unavailable>) + 1567 at main.c:228 [opt]
>    frame #25: 0x00007fffb4e28255 libdyld.dylib`start + 1

I can try to provide a test case, if that wouldn't be enough to spot
the problem.



Re: [HACKERS] Parallel Bitmap scans a bit broken

From
Dilip Kumar
Date:
On Wed, Mar 15, 2017 at 8:11 PM, Emre Hasegeli <emre@hasegeli.com> wrote:
>
> I can try to provide a test case, if that wouldn't be enough to spot
> the problem.

Thanks for reporting, I am looking into this.  Meanwhile, if you can
provide the reproducible test case then locating the issue will be
faster.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel Bitmap scans a bit broken

From
Dilip Kumar
Date:
On Wed, Mar 15, 2017 at 8:51 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> I can try to provide a test case, if that wouldn't be enough to spot
>> the problem.
>
> Thanks for reporting, I am looking into this.  Meanwhile, if you can
> provide the reproducible test case then locating the issue will be
> faster.

After trying multiple attempts with different datasets I am unable to
reproduce the issue.

I tried with below test case:
create table t(a int, b varchar);insert into t values(generate_series(1,10000000), repeat('x', 100));insert into t
values(generate_series(1,100000000),repeat('x', 100));
 
create index idx on t using brin(a);
postgres=# analyze t;
ANALYZE
postgres=# explain analyze select * from t where a>6;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------Gather
(cost=580794.52..3059826.52 rows=110414922 width=105) (actual
 
time=92.324..91853.716 rows=110425971 loops=1)  Workers Planned: 2  Workers Launched: 2  ->  Parallel Bitmap Heap Scan
ont  (cost=579794.52..3058826.52
 
rows=46006218 width=105) (actual time=65.651..62023.020 rows=36808657
loops=3)        Recheck Cond: (a > 6)        Rows Removed by Index Recheck: 4        Heap Blocks: lossy=204401
-> Bitmap Index Scan on idx  (cost=0.00..552190.79
 
rows=110425920 width=0) (actual time=88.215..88.215 rows=19040000
loops=1)              Index Cond: (a > 6)Planning time: 1.116 msExecution time: 96176.881 ms
(11 rows)

Is it possible for you to provide a reproducible test case?  I also
applied the patch given up thread[1] but still could not reproduce.


[1] https://www.postgresql.org/message-id/attachment/50164/brin-correlation-v3.patch

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel Bitmap scans a bit broken

From
Dilip Kumar
Date:
On Wed, Mar 15, 2017 at 8:11 PM, Emre Hasegeli <emre@hasegeli.com> wrote:
>> * thread #1: tid = 0x5045a8f, 0x000000010ae44558 postgres`brin_deform_tuple(brdesc=0x00007fea3c86a3a8,
tuple=0x00007fea3c891040)+ 40 at brin_tuple.c:414, queue = 'com.apple.main-thread', stop reason = signal SIGUSR1
 
>>  * frame #0: 0x000000010ae44558 postgres`brin_deform_tuple(brdesc=0x00007fea3c86a3a8, tuple=0x00007fea3c891040) + 40
atbrin_tuple.c:414 [opt]
 
>>    frame #1: 0x000000010ae4000c postgres`bringetbitmap(scan=0x00007fea3c875c20, tbm=<unavailable>) + 428 at
brin.c:398[opt]
 
>>    frame #2: 0x000000010ae9b451 postgres`index_getbitmap(scan=0x00007fea3c875c20, bitmap=<unavailable>) + 65 at
indexam.c:726[opt]
 
>>    frame #3: 0x000000010b0035a9 postgres`MultiExecBitmapIndexScan(node=<unavailable>) + 233 at
nodeBitmapIndexscan.c:91[opt]
 
>>    frame #4: 0x000000010b002840 postgres`BitmapHeapNext(node=<unavailable>) + 400 at nodeBitmapHeapscan.c:143 [opt]

Further analyzing the call stack, seems like this is not exact call
stack where it crashed.  Because, if you notice the code in the
brin_deform_tuple (line 414)

brin_deform_tuple(BrinDesc *brdesc, BrinTuple *tuple)

{     dtup = brin_new_memtuple(brdesc);
    if (BrinTupleIsPlaceholder(tuple))        dtup->bt_placeholder = true;    dtup->bt_blkno = tuple->bt_blkno;  -->
line414
 

This can crash at line:414, if either tuple is invalid memory(but I
think it's not because we have already accessed this memory in above
if check) or dtup is invalid (this is also not possible because
brin_new_memtuple has already accessed this).

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel Bitmap scans a bit broken

From
Emre Hasegeli
Date:
> This can crash at line:414, if either tuple is invalid memory(but I
> think it's not because we have already accessed this memory in above
> if check) or dtup is invalid (this is also not possible because
> brin_new_memtuple has already accessed this).

I was testing with the brin correlation patch [1] applied.  I cannot
crash it without the patch either.  I am sorry for not testing it
before.  The patch make BRIN selectivity estimation function access
more information.

[1] https://www.postgresql.org/message-id/attachment/50164/brin-correlation-v3.patch



Re: [HACKERS] Parallel Bitmap scans a bit broken

From
Dilip Kumar
Date:
On Wed, Mar 15, 2017 at 10:02 PM, Emre Hasegeli <emre@hasegeli.com> wrote:
> I was testing with the brin correlation patch [1] applied.  I cannot
> crash it without the patch either.  I am sorry for not testing it
> before.  The patch make BRIN selectivity estimation function access
> more information.
>
> [1] https://www.postgresql.org/message-id/attachment/50164/brin-correlation-v3.patch

With my test case, I could not crash even with this patch applied.
Can you provide your test case?
(table, index, data, query)


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel Bitmap scans a bit broken

From
Emre Hasegeli
Date:
> With my test case, I could not crash even with this patch applied.
> Can you provide your test case?

Yes:

> hasegeli=# create table r2 as select (random() * 3)::int as i from generate_series(1, 1000000);
> SELECT 1000000
> hasegeli=# create index on r2 using brin (i);
> CREATE INDEX
> hasegeli=# analyze r2;
> ANALYZE
> hasegeli=# explain select * from only r2 where i = 10;
>                                      QUERY PLAN
> -------------------------------------------------------------------------------------
>  Gather  (cost=2867.50..9225.32 rows=1 width=4)
>    Workers Planned: 2
>    ->  Parallel Bitmap Heap Scan on r2  (cost=1867.50..8225.22 rows=1 width=4)
>          Recheck Cond: (i = 10)
>          ->  Bitmap Index Scan on r2_i_idx  (cost=0.00..1867.50 rows=371082 width=0)
>                Index Cond: (i = 10)
> (6 rows)
>
> hasegeli=# select * from only r2 where i = 10;
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.



Re: [HACKERS] Parallel Bitmap scans a bit broken

From
Dilip Kumar
Date:
On Wed, Mar 15, 2017 at 10:21 PM, Emre Hasegeli <emre@hasegeli.com> wrote:
>> hasegeli=# create table r2 as select (random() * 3)::int as i from generate_series(1, 1000000);
>> SELECT 1000000
>> hasegeli=# create index on r2 using brin (i);
>> CREATE INDEX
>> hasegeli=# analyze r2;
>> ANALYZE
>> hasegeli=# explain select * from only r2 where i = 10;
>>                                      QUERY PLAN
>> -------------------------------------------------------------------------------------
>>  Gather  (cost=2867.50..9225.32 rows=1 width=4)
>>    Workers Planned: 2
>>    ->  Parallel Bitmap Heap Scan on r2  (cost=1867.50..8225.22 rows=1 width=4)
>>          Recheck Cond: (i = 10)
>>          ->  Bitmap Index Scan on r2_i_idx  (cost=0.00..1867.50 rows=371082 width=0)
>>                Index Cond: (i = 10)
>> (6 rows)
>>
>> hasegeli=# select * from only r2 where i = 10;

I am able to reproduce the bug, and attached patch fixes the same.
Problem is that I am not handling TBM_EMPTY state properly.  I
remember that while reviewing the patch Robert mentioned that we might
need to handle the TBM_EMPTY and I told that since we are not handling
in non-parallel mode so we don't need to handle here as well.  But, I
was wrong.  So the problem is that if state is not TBM_HASH then it's
directly assuming TBM_ONE_PAGE which is completely wrong.  I have
fixed that and also fixed in other similar locations.

Please verify the fix.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel Bitmap scans a bit broken

From
Emre Hasegeli
Date:
> Please verify the fix.

The same test with both of the patches applied still crashes for me.



Re: [HACKERS] Parallel Bitmap scans a bit broken

From
Dilip Kumar
Date:
On Thu, Mar 16, 2017 at 12:56 AM, Emre Hasegeli <emre@hasegeli.com> wrote:
>> Please verify the fix.
>
> The same test with both of the patches applied still crashes for me.
After above fix, I am not able to reproduce. Can you give me the
backtrace of the crash location or the dump?

I am trying on the below commit

commit c5832346625af4193b1242e57e7d13e66a220b38
Author: Stephen Frost <sfrost@snowman.net>
Date:   Wed Mar 15 11:19:39 2017 -0400

+ https://www.postgresql.org/message-id/attachment/50164/brin-correlation-v3.patch
+ fix_tbm_empty.patch


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel Bitmap scans a bit broken

From
Dilip Kumar
Date:
On Thu, Mar 16, 2017 at 5:02 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> After above fix, I am not able to reproduce. Can you give me the
> backtrace of the crash location or the dump?
>
> I am trying on the below commit
>
> commit c5832346625af4193b1242e57e7d13e66a220b38
> Author: Stephen Frost <sfrost@snowman.net>
> Date:   Wed Mar 15 11:19:39 2017 -0400
>
> + https://www.postgresql.org/message-id/attachment/50164/brin-correlation-v3.patch
> + fix_tbm_empty.patch

Forgot to mention after fix I am seeing this output.

postgres=# explain analyze select * from only r2 where i = 10;
QUERYPLAN
 

-------------------------------------------------------------------------------------------------------------------------------Gather
(cost=2880.56..9251.98 rows=1 width=4) (actual
 
time=3.857..3.857 rows=0 loops=1)  Workers Planned: 2  Workers Launched: 2  ->  Parallel Bitmap Heap Scan on r2
(cost=1880.56..8251.88rows=1
 
width=4) (actual time=0.043..0.043 rows=0 loops=3)        Recheck Cond: (i = 10)        ->  Bitmap Index Scan on
r2_i_idx (cost=0.00..1880.56
 
rows=373694 width=0) (actual time=0.052..0.052 rows=0 loops=1)              Index Cond: (i = 10)Planning time: 0.111
msExecutiontime: 4.449 ms
 
(9 rows)

postgres=# select * from only r2 where i = 10;i
---
(0 rows)

Are you getting the crash with the same test case?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel Bitmap scans a bit broken

From
Emre Hasegeli
Date:
> Are you getting the crash with the same test case?

Yes.  Here is the new backtrace:

> * thread #1: tid = 0x51828fd, 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined]
pg_atomic_write_u32_impl(val=0)at generic.h:57, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1,
address=0x0)
>   * frame #0: 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined] pg_atomic_write_u32_impl(val=0) at
generic.h:57[opt] 
>     frame #1: 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined] pg_atomic_init_u32_impl(val_=0) at
generic.h:163[opt] 
>     frame #2: 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined] pg_atomic_init_u32(val=0) + 17 at
atomics.h:237[opt] 
>     frame #3: 0x0000000100caf303 postgres`tbm_prepare_shared_iterate(tbm=<unavailable>) + 723 at tidbitmap.c:875
[opt]
>     frame #4: 0x0000000100c74844 postgres`BitmapHeapNext(node=<unavailable>) + 436 at nodeBitmapHeapscan.c:154 [opt]
>     frame #5: 0x0000000100c615b0 postgres`ExecProcNode(node=0x00007fdabf8189f0) + 224 at execProcnode.c:459 [opt]
>     frame #6: 0x0000000100c76ca9 postgres`ExecGather [inlined] gather_getnext(gatherstate=<unavailable>) + 520 at
nodeGather.c:276[opt] 
>     frame #7: 0x0000000100c76aa1 postgres`ExecGather(node=<unavailable>) + 497 at nodeGather.c:212 [opt]
>     frame #8: 0x0000000100c61692 postgres`ExecProcNode(node=0x00007fdabf818558) + 450 at execProcnode.c:541 [opt]
>     frame #9: 0x0000000100c5cf70 postgres`standard_ExecutorRun [inlined] ExecutePlan(estate=<unavailable>,
planstate=<unavailable>,use_parallel_mode=<unavailable>, operation=<unavailable>, numberTuples=0,
direction=<unavailable>,dest=<unavailable>) + 29 at execMain.c:1616 [opt] 
>    frame #10: 0x0000000100c5cf53 postgres`standard_ExecutorRun(queryDesc=<unavailable>, direction=<unavailable>,
count=0)+ 291 at execMain.c:348 [opt] 
>    frame #11: 0x0000000100dac0df postgres`PortalRunSelect(portal=0x00007fdac000b240, forward=<unavailable>, count=0,
dest=<unavailable>)+ 255 at pquery.c:921 [opt] 
>    frame #12: 0x0000000100dabc84 postgres`PortalRun(portal=0x00007fdac000b240, count=<unavailable>,
isTopLevel='\x01',dest=<unavailable>, altdest=<unavailable>, completionTag=<unavailable>) + 500 at pquery.c:762 [opt] 
>    frame #13: 0x0000000100da989b postgres`PostgresMain + 44 at postgres.c:1101 [opt]
>    frame #14: 0x0000000100da986f postgres`PostgresMain(argc=<unavailable>, argv=<unavailable>, dbname=<unavailable>,
username=<unavailable>)+ 8927 at postgres.c:4066 [opt] 
>    frame #15: 0x0000000100d2c113 postgres`PostmasterMain [inlined] BackendRun + 7587 at postmaster.c:4317 [opt]
>    frame #16: 0x0000000100d2c0e8 postgres`PostmasterMain [inlined] BackendStartup at postmaster.c:3989 [opt]
>    frame #17: 0x0000000100d2c0e8 postgres`PostmasterMain at postmaster.c:1729 [opt]
>    frame #18: 0x0000000100d2c0e8 postgres`PostmasterMain(argc=<unavailable>, argv=<unavailable>) + 7544 at
postmaster.c:1337[opt] 
>    frame #19: 0x0000000100ca528f postgres`main(argc=<unavailable>, argv=<unavailable>) + 1567 at main.c:228 [opt]
>    frame #20: 0x00007fffb4e28255 libdyld.dylib`start + 1
>    frame #21: 0x00007fffb4e28255 libdyld.dylib`start + 1



Re: [HACKERS] Parallel Bitmap scans a bit broken

From
Dilip Kumar
Date:
On Thu, Mar 16, 2017 at 3:52 PM, Emre Hasegeli <emre@hasegeli.com> wrote:
>> * thread #1: tid = 0x51828fd, 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined]
pg_atomic_write_u32_impl(val=0)at generic.h:57, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1,
address=0x0)
>>   * frame #0: 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined] pg_atomic_write_u32_impl(val=0) at
generic.h:57[opt]
 
>>     frame #1: 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined] pg_atomic_init_u32_impl(val_=0) at
generic.h:163[opt]
 
>>     frame #2: 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined] pg_atomic_init_u32(val=0) + 17 at
atomics.h:237[opt]
 

By looking at the call stack I got the problem location.  I am
reviewing other parts of the code if there are the similar mistake at
other places. Soon I will post the patch.  Thanks for the help.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel Bitmap scans a bit broken

From
Dilip Kumar
Date:
On Thu, Mar 16, 2017 at 5:14 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> pg_atomic_write_u32_impl(val=0) at generic.h:57, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS
(code=1,address=0x0)
 
>>>   * frame #0: 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined] pg_atomic_write_u32_impl(val=0) at
generic.h:57[opt]
 
>>>     frame #1: 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined] pg_atomic_init_u32_impl(val_=0) at
generic.h:163[opt]
 
>>>     frame #2: 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined] pg_atomic_init_u32(val=0) + 17 at
atomics.h:237[opt]
 
>
> By looking at the call stack I got the problem location.  I am
> reviewing other parts of the code if there are the similar mistake at
> other places. Soon I will post the patch.  Thanks for the help.

Based on the call stack I have tried to fix the issue. The problem is
there was some uninitialized pointer access (in some special cases
i.e. TBM_EMPTY when pagetable is not created at all).

 fix_tbm_empty.patch have fixed some of them but induced one which you
are seeing in your call stack.

Hopefully, this time I got it correct.  Since I am unable to reproduce
the issue so I will again need your help in verifying the fix.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel Bitmap scans a bit broken

From
Emre Hasegeli
Date:
> Hopefully, this time I got it correct.  Since I am unable to reproduce
> the issue so I will again need your help in verifying the fix.

It is not crashing with the new patch.  Thank you.



Re: [HACKERS] Parallel Bitmap scans a bit broken

From
Robert Haas
Date:
On Thu, Mar 16, 2017 at 10:56 AM, Emre Hasegeli <emre@hasegeli.com> wrote:
>> Hopefully, this time I got it correct.  Since I am unable to reproduce
>> the issue so I will again need your help in verifying the fix.
>
> It is not crashing with the new patch.  Thank you.

Thanks for confirming.  Some review comments on v2:

+    if (istate->pagetable)

Please compare explicitly to InvalidDsaPointer.

+    if (iterator->ptbase)
+        ptbase = iterator->ptbase->ptentry;
+    if (iterator->ptpages)
+        idxpages = iterator->ptpages->index;
+    if (iterator->ptchunks)
+        idxchunks = iterator->ptchunks->index;

Similarly.

Dilip, please also provide a proposed commit message describing what
this is fixing.  Is it just the TBM_EMPTY case, or is there anything
else?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel Bitmap scans a bit broken

From
Dilip Kumar
Date:
On Thu, Mar 16, 2017 at 8:42 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Thanks for confirming.  Some review comments on v2:
>
> +    if (istate->pagetable)
fixed
>
> Please compare explicitly to InvalidDsaPointer.
>
> +    if (iterator->ptbase)
> +        ptbase = iterator->ptbase->ptentry;
> +    if (iterator->ptpages)
> +        idxpages = iterator->ptpages->index;
> +    if (iterator->ptchunks)
> +        idxchunks = iterator->ptchunks->index;
>
> Similarly.
fixed

Also fixed at
+ if (ptbase)
+   pg_atomic_init_u32(&ptbase->refcount, 0);

>
> Dilip, please also provide a proposed commit message describing what
> this is fixing.  Is it just the TBM_EMPTY case, or is there anything
> else?

Okay, I have added the commit message in the patch.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel Bitmap scans a bit broken

From
Dilip Kumar
Date:
On Thu, Mar 16, 2017 at 8:26 PM, Emre Hasegeli <emre@hasegeli.com> wrote:
>> Hopefully, this time I got it correct.  Since I am unable to reproduce
>> the issue so I will again need your help in verifying the fix.
>
> It is not crashing with the new patch.  Thank you.

Thanks for verifying.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parallel Bitmap scans a bit broken

From
Robert Haas
Date:
On Thu, Mar 16, 2017 at 1:50 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> fixed

Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company