Thread: [HACKERS] Parallel Bitmap scans a bit broken
I was just doing some testing on [1] when I noticed that there's a problem with parallel bitmap index scans scans.
Test case:
patch with [1]
=# create table r1(value int);
CREATE TABLE
=# insert into r1 select (random()*1000)::int from generate_Series(1,1000000);
INSERT 0 1000000
=# create index on r1 using brin(value);
CREATE INDEX
=# set enable_seqscan=0;
SET
=# explain select * from r1 where value=555;
QUERY PLAN
-----------------------------------------------------------------------------------------
Gather (cost=3623.52..11267.45 rows=5000 width=4)
Workers Planned: 2
-> Parallel Bitmap Heap Scan on r1 (cost=2623.52..9767.45 rows=2083 width=4)
Recheck Cond: (value = 555)
-> Bitmap Index Scan on r1_value_idx (cost=0.00..2622.27 rows=522036 width=0)
Index Cond: (value = 555)
(6 rows)
=# explain analyze select * from r1 where value=555;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
The crash occurs in tbm_shared_iterate() at:
PagetableEntry *page = &ptbase[idxpages[istate->spageptr]];
I see in tbm_prepare_shared_iterate() tbm->npages is zero. I'm unsure if bringetbitmap() does something different with npages than btgetbitmap() around setting npages?
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Test case:
patch with [1]
=# create table r1(value int);
CREATE TABLE
=# insert into r1 select (random()*1000)::int from generate_Series(1,1000000);
INSERT 0 1000000
=# create index on r1 using brin(value);
CREATE INDEX
=# set enable_seqscan=0;
SET
=# explain select * from r1 where value=555;
QUERY PLAN
-----------------------------------------------------------------------------------------
Gather (cost=3623.52..11267.45 rows=5000 width=4)
Workers Planned: 2
-> Parallel Bitmap Heap Scan on r1 (cost=2623.52..9767.45 rows=2083 width=4)
Recheck Cond: (value = 555)
-> Bitmap Index Scan on r1_value_idx (cost=0.00..2622.27 rows=522036 width=0)
Index Cond: (value = 555)
(6 rows)
=# explain analyze select * from r1 where value=555;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
The crash occurs in tbm_shared_iterate() at:
PagetableEntry *page = &ptbase[idxpages[istate->spageptr]];
I see in tbm_prepare_shared_iterate() tbm->npages is zero. I'm unsure if bringetbitmap() does something different with npages than btgetbitmap() around setting npages?
But anyway, due to the npages being 0 the tbm->ptpages is not allocated in tbm_prepare_shared_iterate()
if (tbm->npages)
{
tbm->ptpages = dsa_allocate(tbm->dsa, sizeof(PTIterationArray) +
tbm->npages * sizeof(int));
so when tbm_shared_iterate runs this code;
/*
* If both chunk and per-page data remain, must output the numerically
* earlier page.
*/
if (istate->schunkptr < istate->nchunks)
{
PagetableEntry *chunk = &ptbase[idxchunks[istate->schunkptr]];
PagetableEntry *page = &ptbase[idxpages[istate->spageptr]];
BlockNumber chunk_blockno;
chunk_blockno = chunk->blockno + istate->schunkbit;
if (istate->spageptr >= istate->npages ||
chunk_blockno < page->blockno)
{
/* Return a lossy page indicator from the chunk */
output->blockno = chunk_blockno;
output->ntuples = -1;
output->recheck = true;
istate->schunkbit++;
LWLockRelease(&istate->lock);
return output;
}
}
it fails, due to idxpages pointing to random memory
Probably this is a simple fix for the authors, so passing it along. I'm a bit unable to see how the part above is meant to work.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Mar 9, 2017 at 9:17 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > patch with [1] > > =# create table r1(value int); > CREATE TABLE > =# insert into r1 select (random()*1000)::int from > generate_Series(1,1000000); > INSERT 0 1000000 > =# create index on r1 using brin(value); > CREATE INDEX > =# set enable_seqscan=0; > SET > =# explain select * from r1 where value=555; I am looking into the issue, I have already reproduced it. I will update on this soon. Thanks for reporting. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Mar 9, 2017 at 9:37 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote: >> =# create table r1(value int); >> CREATE TABLE >> =# insert into r1 select (random()*1000)::int from >> generate_Series(1,1000000); >> INSERT 0 1000000 >> =# create index on r1 using brin(value); >> CREATE INDEX >> =# set enable_seqscan=0; >> SET >> =# explain select * from r1 where value=555; > > I am looking into the issue, I have already reproduced it. I will > update on this soon. > > Thanks for reporting. I slightly modified your query to reproduce this issue. explain analyze select * from r1 where value<555; Patch is attached to fix the problem. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Thu, Mar 9, 2017 at 10:02 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > I slightly modified your query to reproduce this issue. > > explain analyze select * from r1 where value<555; > > Patch is attached to fix the problem. I forgot to mention the cause of the problem. if (istate->schunkptr < istate->nchunks) { PagetableEntry *chunk = &ptbase[idxchunks[istate->schunkptr]]; PagetableEntry *page = &ptbase[idxpages[istate->spageptr]]; BlockNumber chunk_blockno; In above if condition we have only checked istate->schunkptr < istate->nchunks that means we have some chunk left so we are safe to access idxchunks, But just after that we are accessing ptbase[idxpages[istate->spageptr]] without checking that accessing idxpages is safe or not. tbm_iterator already handling this case, I broke it in tbm_shared_iterator. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Mar 9, 2017 at 11:50 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > On Thu, Mar 9, 2017 at 10:02 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote: >> I slightly modified your query to reproduce this issue. >> >> explain analyze select * from r1 where value<555; >> >> Patch is attached to fix the problem. > > I forgot to mention the cause of the problem. > > if (istate->schunkptr < istate->nchunks) > { > PagetableEntry *chunk = &ptbase[idxchunks[istate->schunkptr]]; > PagetableEntry *page = &ptbase[idxpages[istate->spageptr]]; > BlockNumber chunk_blockno; > > In above if condition we have only checked istate->schunkptr < > istate->nchunks that means we have some chunk left so we are safe to > access idxchunks, But just after that we are accessing > ptbase[idxpages[istate->spageptr]] without checking that accessing > idxpages is safe or not. > > tbm_iterator already handling this case, I broke it in tbm_shared_iterator. I don't know if this is the only problem -- it would be good if David could retest -- but it's certainly *a* problem, so committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 10 March 2017 at 06:17, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Mar 9, 2017 at 11:50 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Thu, Mar 9, 2017 at 10:02 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> I slightly modified your query to reproduce this issue.
>>
>> explain analyze select * from r1 where value<555;
>>
>> Patch is attached to fix the problem.
>
> I forgot to mention the cause of the problem.
>
> if (istate->schunkptr < istate->nchunks)
> {
> PagetableEntry *chunk = &ptbase[idxchunks[istate->schunkptr]];
> PagetableEntry *page = &ptbase[idxpages[istate->spageptr]];
> BlockNumber chunk_blockno;
>
> In above if condition we have only checked istate->schunkptr <
> istate->nchunks that means we have some chunk left so we are safe to
> access idxchunks, But just after that we are accessing
> ptbase[idxpages[istate->spageptr]] without checking that accessing
> idxpages is safe or not.
>
> tbm_iterator already handling this case, I broke it in tbm_shared_iterator.
I don't know if this is the only problem -- it would be good if David
could retest -- but it's certainly *a* problem, so committed.
Thanks for committing, and generally parallelising more stuff.
I confirm that my test case is now working again.
I'll be in this general area today, so will mention if I stumble over anything that looks broken.
> I don't know if this is the only problem > I'll be in this general area today, so will mention if I stumble over > anything that looks broken. I was testing the same patch with a large dataset and got a different segfault: > hasegeli=# explain select * from only mp_notification_20170225 where server_id = 7; > QUERY PLAN > ---------------------------------------------------------------------------------------------------------- > Gather (cost=26682.94..476995.88 rows=1 width=215) > Workers Planned: 2 > -> Parallel Bitmap Heap Scan on mp_notification_20170225 (cost=25682.94..475995.78 rows=1 width=215) > Recheck Cond: (server_id = 7) > -> Bitmap Index Scan on mp_notification_block_idx (cost=0.00..25682.94 rows=4557665 width=0) > Index Cond: (server_id = 7) > (6 rows) > > hasegeli=# select * from only mp_notification_20170225 where server_id = 7; > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. > * thread #1: tid = 0x5045a8f, 0x000000010ae44558 postgres`brin_deform_tuple(brdesc=0x00007fea3c86a3a8, tuple=0x00007fea3c891040)+ 40 at brin_tuple.c:414, queue = 'com.apple.main-thread', stop reason = signal SIGUSR1 > * frame #0: 0x000000010ae44558 postgres`brin_deform_tuple(brdesc=0x00007fea3c86a3a8, tuple=0x00007fea3c891040) + 40 atbrin_tuple.c:414 [opt] > frame #1: 0x000000010ae4000c postgres`bringetbitmap(scan=0x00007fea3c875c20, tbm=<unavailable>) + 428 at brin.c:398[opt] > frame #2: 0x000000010ae9b451 postgres`index_getbitmap(scan=0x00007fea3c875c20, bitmap=<unavailable>) + 65 at indexam.c:726[opt] > frame #3: 0x000000010b0035a9 postgres`MultiExecBitmapIndexScan(node=<unavailable>) + 233 at nodeBitmapIndexscan.c:91[opt] > frame #4: 0x000000010b002840 postgres`BitmapHeapNext(node=<unavailable>) + 400 at nodeBitmapHeapscan.c:143 [opt] > frame #5: 0x000000010afef5d0 postgres`ExecProcNode(node=0x00007fea3c873948) + 224 at execProcnode.c:459 [opt] > frame #6: 0x000000010b004cc9 postgres`ExecGather [inlined] gather_getnext(gatherstate=<unavailable>) + 520 at nodeGather.c:276[opt] > frame #7: 0x000000010b004ac1 postgres`ExecGather(node=<unavailable>) + 497 at nodeGather.c:212 [opt] > frame #8: 0x000000010afef6b2 postgres`ExecProcNode(node=0x00007fea3c872f58) + 450 at execProcnode.c:541 [opt] > frame #9: 0x000000010afeaf90 postgres`standard_ExecutorRun [inlined] ExecutePlan(estate=<unavailable>, planstate=<unavailable>,use_parallel_mode=<unavailable>, operation=<unavailable>, numberTuples=0, direction=<unavailable>,dest=<unavailable>) + 29 at execMain.c:1616 [opt] > frame #10: 0x000000010afeaf73 postgres`standard_ExecutorRun(queryDesc=<unavailable>, direction=<unavailable>, count=0)+ 291 at execMain.c:348 [opt] > frame #11: 0x000000010af8b108 postgres`ExplainOnePlan(plannedstmt=0x00007fea3c871040, into=0x0000000000000000, es=0x00007fea3c805360,queryString=<unavailable>, params=<unavailable>, planduration=<unavailable>) + 328 at explain.c:533[opt] > frame #12: 0x000000010af8ab98 postgres`ExplainOneQuery(query=0x00007fea3c805890, cursorOptions=<unavailable>, into=0x0000000000000000,es=0x00007fea3c805360, queryString=<unavailable>,params=0x0000000000000000) + 280 at explain.c:369[opt] > frame #13: 0x000000010af8a773 postgres`ExplainQuery(pstate=<unavailable>, stmt=0x00007fea3d005450, queryString="explainanalyze select * from only mp_notification_20170225 where server_id > 6;",params=0x0000000000000000,dest=0x00007fea3c8052c8) + 819 at explain.c:254 [opt] > frame #14: 0x000000010b13b660 postgres`standard_ProcessUtility(pstmt=0x00007fea3d005fa8, queryString="explain analyzeselect * from only mp_notification_20170225 where server_id > 6;",context=PROCESS_UTILITY_TOPLEVEL, params=0x0000000000000000,dest=0x00007fea3c8052c8, completionTag=<unavailable>) + 1104 at utility.c:675 [opt] > frame #15: 0x000000010b13ad2a postgres`PortalRunUtility(portal=0x00007fea3c837640, pstmt=0x00007fea3d005fa8, isTopLevel='\x01',setHoldSnapshot=<unavailable>, dest=0x00007fea3c8052c8, completionTag=<unavailable>) + 90 at pquery.c:1165[opt] > frame #16: 0x000000010b139f56 postgres`FillPortalStore(portal=0x00007fea3c837640, isTopLevel='\x01') + 182 at pquery.c:1025[opt] > frame #17: 0x000000010b139c22 postgres`PortalRun(portal=0x00007fea3c837640, count=<unavailable>, isTopLevel='\x01',dest=<unavailable>, altdest=<unavailable>, completionTag=<unavailable>) + 402 at pquery.c:757 [opt] > frame #18: 0x000000010b13789b postgres`PostgresMain + 44 at postgres.c:1101 [opt] > frame #19: 0x000000010b13786f postgres`PostgresMain(argc=<unavailable>, argv=<unavailable>, dbname=<unavailable>, username=<unavailable>)+ 8927 at postgres.c:4066 [opt] > frame #20: 0x000000010b0ba113 postgres`PostmasterMain [inlined] BackendRun + 7587 at postmaster.c:4317 [opt] > frame #21: 0x000000010b0ba0e8 postgres`PostmasterMain [inlined] BackendStartup at postmaster.c:3989 [opt] > frame #22: 0x000000010b0ba0e8 postgres`PostmasterMain at postmaster.c:1729 [opt] > frame #23: 0x000000010b0ba0e8 postgres`PostmasterMain(argc=<unavailable>, argv=<unavailable>) + 7544 at postmaster.c:1337[opt] > frame #24: 0x000000010b0332af postgres`main(argc=<unavailable>, argv=<unavailable>) + 1567 at main.c:228 [opt] > frame #25: 0x00007fffb4e28255 libdyld.dylib`start + 1 I can try to provide a test case, if that wouldn't be enough to spot the problem.
On Wed, Mar 15, 2017 at 8:11 PM, Emre Hasegeli <emre@hasegeli.com> wrote: > > I can try to provide a test case, if that wouldn't be enough to spot > the problem. Thanks for reporting, I am looking into this. Meanwhile, if you can provide the reproducible test case then locating the issue will be faster. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Mar 15, 2017 at 8:51 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote: >> I can try to provide a test case, if that wouldn't be enough to spot >> the problem. > > Thanks for reporting, I am looking into this. Meanwhile, if you can > provide the reproducible test case then locating the issue will be > faster. After trying multiple attempts with different datasets I am unable to reproduce the issue. I tried with below test case: create table t(a int, b varchar);insert into t values(generate_series(1,10000000), repeat('x', 100));insert into t values(generate_series(1,100000000),repeat('x', 100)); create index idx on t using brin(a); postgres=# analyze t; ANALYZE postgres=# explain analyze select * from t where a>6; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------Gather (cost=580794.52..3059826.52 rows=110414922 width=105) (actual time=92.324..91853.716 rows=110425971 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Bitmap Heap Scan ont (cost=579794.52..3058826.52 rows=46006218 width=105) (actual time=65.651..62023.020 rows=36808657 loops=3) Recheck Cond: (a > 6) Rows Removed by Index Recheck: 4 Heap Blocks: lossy=204401 -> Bitmap Index Scan on idx (cost=0.00..552190.79 rows=110425920 width=0) (actual time=88.215..88.215 rows=19040000 loops=1) Index Cond: (a > 6)Planning time: 1.116 msExecution time: 96176.881 ms (11 rows) Is it possible for you to provide a reproducible test case? I also applied the patch given up thread[1] but still could not reproduce. [1] https://www.postgresql.org/message-id/attachment/50164/brin-correlation-v3.patch -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Mar 15, 2017 at 8:11 PM, Emre Hasegeli <emre@hasegeli.com> wrote: >> * thread #1: tid = 0x5045a8f, 0x000000010ae44558 postgres`brin_deform_tuple(brdesc=0x00007fea3c86a3a8, tuple=0x00007fea3c891040)+ 40 at brin_tuple.c:414, queue = 'com.apple.main-thread', stop reason = signal SIGUSR1 >> * frame #0: 0x000000010ae44558 postgres`brin_deform_tuple(brdesc=0x00007fea3c86a3a8, tuple=0x00007fea3c891040) + 40 atbrin_tuple.c:414 [opt] >> frame #1: 0x000000010ae4000c postgres`bringetbitmap(scan=0x00007fea3c875c20, tbm=<unavailable>) + 428 at brin.c:398[opt] >> frame #2: 0x000000010ae9b451 postgres`index_getbitmap(scan=0x00007fea3c875c20, bitmap=<unavailable>) + 65 at indexam.c:726[opt] >> frame #3: 0x000000010b0035a9 postgres`MultiExecBitmapIndexScan(node=<unavailable>) + 233 at nodeBitmapIndexscan.c:91[opt] >> frame #4: 0x000000010b002840 postgres`BitmapHeapNext(node=<unavailable>) + 400 at nodeBitmapHeapscan.c:143 [opt] Further analyzing the call stack, seems like this is not exact call stack where it crashed. Because, if you notice the code in the brin_deform_tuple (line 414) brin_deform_tuple(BrinDesc *brdesc, BrinTuple *tuple) { dtup = brin_new_memtuple(brdesc); if (BrinTupleIsPlaceholder(tuple)) dtup->bt_placeholder = true; dtup->bt_blkno = tuple->bt_blkno; --> line414 This can crash at line:414, if either tuple is invalid memory(but I think it's not because we have already accessed this memory in above if check) or dtup is invalid (this is also not possible because brin_new_memtuple has already accessed this). -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
> This can crash at line:414, if either tuple is invalid memory(but I > think it's not because we have already accessed this memory in above > if check) or dtup is invalid (this is also not possible because > brin_new_memtuple has already accessed this). I was testing with the brin correlation patch [1] applied. I cannot crash it without the patch either. I am sorry for not testing it before. The patch make BRIN selectivity estimation function access more information. [1] https://www.postgresql.org/message-id/attachment/50164/brin-correlation-v3.patch
On Wed, Mar 15, 2017 at 10:02 PM, Emre Hasegeli <emre@hasegeli.com> wrote: > I was testing with the brin correlation patch [1] applied. I cannot > crash it without the patch either. I am sorry for not testing it > before. The patch make BRIN selectivity estimation function access > more information. > > [1] https://www.postgresql.org/message-id/attachment/50164/brin-correlation-v3.patch With my test case, I could not crash even with this patch applied. Can you provide your test case? (table, index, data, query) -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
> With my test case, I could not crash even with this patch applied. > Can you provide your test case? Yes: > hasegeli=# create table r2 as select (random() * 3)::int as i from generate_series(1, 1000000); > SELECT 1000000 > hasegeli=# create index on r2 using brin (i); > CREATE INDEX > hasegeli=# analyze r2; > ANALYZE > hasegeli=# explain select * from only r2 where i = 10; > QUERY PLAN > ------------------------------------------------------------------------------------- > Gather (cost=2867.50..9225.32 rows=1 width=4) > Workers Planned: 2 > -> Parallel Bitmap Heap Scan on r2 (cost=1867.50..8225.22 rows=1 width=4) > Recheck Cond: (i = 10) > -> Bitmap Index Scan on r2_i_idx (cost=0.00..1867.50 rows=371082 width=0) > Index Cond: (i = 10) > (6 rows) > > hasegeli=# select * from only r2 where i = 10; > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed.
On Wed, Mar 15, 2017 at 10:21 PM, Emre Hasegeli <emre@hasegeli.com> wrote: >> hasegeli=# create table r2 as select (random() * 3)::int as i from generate_series(1, 1000000); >> SELECT 1000000 >> hasegeli=# create index on r2 using brin (i); >> CREATE INDEX >> hasegeli=# analyze r2; >> ANALYZE >> hasegeli=# explain select * from only r2 where i = 10; >> QUERY PLAN >> ------------------------------------------------------------------------------------- >> Gather (cost=2867.50..9225.32 rows=1 width=4) >> Workers Planned: 2 >> -> Parallel Bitmap Heap Scan on r2 (cost=1867.50..8225.22 rows=1 width=4) >> Recheck Cond: (i = 10) >> -> Bitmap Index Scan on r2_i_idx (cost=0.00..1867.50 rows=371082 width=0) >> Index Cond: (i = 10) >> (6 rows) >> >> hasegeli=# select * from only r2 where i = 10; I am able to reproduce the bug, and attached patch fixes the same. Problem is that I am not handling TBM_EMPTY state properly. I remember that while reviewing the patch Robert mentioned that we might need to handle the TBM_EMPTY and I told that since we are not handling in non-parallel mode so we don't need to handle here as well. But, I was wrong. So the problem is that if state is not TBM_HASH then it's directly assuming TBM_ONE_PAGE which is completely wrong. I have fixed that and also fixed in other similar locations. Please verify the fix. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
> Please verify the fix. The same test with both of the patches applied still crashes for me.
On Thu, Mar 16, 2017 at 12:56 AM, Emre Hasegeli <emre@hasegeli.com> wrote: >> Please verify the fix. > > The same test with both of the patches applied still crashes for me. After above fix, I am not able to reproduce. Can you give me the backtrace of the crash location or the dump? I am trying on the below commit commit c5832346625af4193b1242e57e7d13e66a220b38 Author: Stephen Frost <sfrost@snowman.net> Date: Wed Mar 15 11:19:39 2017 -0400 + https://www.postgresql.org/message-id/attachment/50164/brin-correlation-v3.patch + fix_tbm_empty.patch -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Mar 16, 2017 at 5:02 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > After above fix, I am not able to reproduce. Can you give me the > backtrace of the crash location or the dump? > > I am trying on the below commit > > commit c5832346625af4193b1242e57e7d13e66a220b38 > Author: Stephen Frost <sfrost@snowman.net> > Date: Wed Mar 15 11:19:39 2017 -0400 > > + https://www.postgresql.org/message-id/attachment/50164/brin-correlation-v3.patch > + fix_tbm_empty.patch Forgot to mention after fix I am seeing this output. postgres=# explain analyze select * from only r2 where i = 10; QUERYPLAN -------------------------------------------------------------------------------------------------------------------------------Gather (cost=2880.56..9251.98 rows=1 width=4) (actual time=3.857..3.857 rows=0 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Bitmap Heap Scan on r2 (cost=1880.56..8251.88rows=1 width=4) (actual time=0.043..0.043 rows=0 loops=3) Recheck Cond: (i = 10) -> Bitmap Index Scan on r2_i_idx (cost=0.00..1880.56 rows=373694 width=0) (actual time=0.052..0.052 rows=0 loops=1) Index Cond: (i = 10)Planning time: 0.111 msExecutiontime: 4.449 ms (9 rows) postgres=# select * from only r2 where i = 10;i --- (0 rows) Are you getting the crash with the same test case? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
> Are you getting the crash with the same test case? Yes. Here is the new backtrace: > * thread #1: tid = 0x51828fd, 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined] pg_atomic_write_u32_impl(val=0)at generic.h:57, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0) > * frame #0: 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined] pg_atomic_write_u32_impl(val=0) at generic.h:57[opt] > frame #1: 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined] pg_atomic_init_u32_impl(val_=0) at generic.h:163[opt] > frame #2: 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined] pg_atomic_init_u32(val=0) + 17 at atomics.h:237[opt] > frame #3: 0x0000000100caf303 postgres`tbm_prepare_shared_iterate(tbm=<unavailable>) + 723 at tidbitmap.c:875 [opt] > frame #4: 0x0000000100c74844 postgres`BitmapHeapNext(node=<unavailable>) + 436 at nodeBitmapHeapscan.c:154 [opt] > frame #5: 0x0000000100c615b0 postgres`ExecProcNode(node=0x00007fdabf8189f0) + 224 at execProcnode.c:459 [opt] > frame #6: 0x0000000100c76ca9 postgres`ExecGather [inlined] gather_getnext(gatherstate=<unavailable>) + 520 at nodeGather.c:276[opt] > frame #7: 0x0000000100c76aa1 postgres`ExecGather(node=<unavailable>) + 497 at nodeGather.c:212 [opt] > frame #8: 0x0000000100c61692 postgres`ExecProcNode(node=0x00007fdabf818558) + 450 at execProcnode.c:541 [opt] > frame #9: 0x0000000100c5cf70 postgres`standard_ExecutorRun [inlined] ExecutePlan(estate=<unavailable>, planstate=<unavailable>,use_parallel_mode=<unavailable>, operation=<unavailable>, numberTuples=0, direction=<unavailable>,dest=<unavailable>) + 29 at execMain.c:1616 [opt] > frame #10: 0x0000000100c5cf53 postgres`standard_ExecutorRun(queryDesc=<unavailable>, direction=<unavailable>, count=0)+ 291 at execMain.c:348 [opt] > frame #11: 0x0000000100dac0df postgres`PortalRunSelect(portal=0x00007fdac000b240, forward=<unavailable>, count=0, dest=<unavailable>)+ 255 at pquery.c:921 [opt] > frame #12: 0x0000000100dabc84 postgres`PortalRun(portal=0x00007fdac000b240, count=<unavailable>, isTopLevel='\x01',dest=<unavailable>, altdest=<unavailable>, completionTag=<unavailable>) + 500 at pquery.c:762 [opt] > frame #13: 0x0000000100da989b postgres`PostgresMain + 44 at postgres.c:1101 [opt] > frame #14: 0x0000000100da986f postgres`PostgresMain(argc=<unavailable>, argv=<unavailable>, dbname=<unavailable>, username=<unavailable>)+ 8927 at postgres.c:4066 [opt] > frame #15: 0x0000000100d2c113 postgres`PostmasterMain [inlined] BackendRun + 7587 at postmaster.c:4317 [opt] > frame #16: 0x0000000100d2c0e8 postgres`PostmasterMain [inlined] BackendStartup at postmaster.c:3989 [opt] > frame #17: 0x0000000100d2c0e8 postgres`PostmasterMain at postmaster.c:1729 [opt] > frame #18: 0x0000000100d2c0e8 postgres`PostmasterMain(argc=<unavailable>, argv=<unavailable>) + 7544 at postmaster.c:1337[opt] > frame #19: 0x0000000100ca528f postgres`main(argc=<unavailable>, argv=<unavailable>) + 1567 at main.c:228 [opt] > frame #20: 0x00007fffb4e28255 libdyld.dylib`start + 1 > frame #21: 0x00007fffb4e28255 libdyld.dylib`start + 1
On Thu, Mar 16, 2017 at 3:52 PM, Emre Hasegeli <emre@hasegeli.com> wrote: >> * thread #1: tid = 0x51828fd, 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined] pg_atomic_write_u32_impl(val=0)at generic.h:57, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0) >> * frame #0: 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined] pg_atomic_write_u32_impl(val=0) at generic.h:57[opt] >> frame #1: 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined] pg_atomic_init_u32_impl(val_=0) at generic.h:163[opt] >> frame #2: 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined] pg_atomic_init_u32(val=0) + 17 at atomics.h:237[opt] By looking at the call stack I got the problem location. I am reviewing other parts of the code if there are the similar mistake at other places. Soon I will post the patch. Thanks for the help. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Mar 16, 2017 at 5:14 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > pg_atomic_write_u32_impl(val=0) at generic.h:57, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1,address=0x0) >>> * frame #0: 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined] pg_atomic_write_u32_impl(val=0) at generic.h:57[opt] >>> frame #1: 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined] pg_atomic_init_u32_impl(val_=0) at generic.h:163[opt] >>> frame #2: 0x0000000100caf314 postgres`tbm_prepare_shared_iterate [inlined] pg_atomic_init_u32(val=0) + 17 at atomics.h:237[opt] > > By looking at the call stack I got the problem location. I am > reviewing other parts of the code if there are the similar mistake at > other places. Soon I will post the patch. Thanks for the help. Based on the call stack I have tried to fix the issue. The problem is there was some uninitialized pointer access (in some special cases i.e. TBM_EMPTY when pagetable is not created at all). fix_tbm_empty.patch have fixed some of them but induced one which you are seeing in your call stack. Hopefully, this time I got it correct. Since I am unable to reproduce the issue so I will again need your help in verifying the fix. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
> Hopefully, this time I got it correct. Since I am unable to reproduce > the issue so I will again need your help in verifying the fix. It is not crashing with the new patch. Thank you.
On Thu, Mar 16, 2017 at 10:56 AM, Emre Hasegeli <emre@hasegeli.com> wrote: >> Hopefully, this time I got it correct. Since I am unable to reproduce >> the issue so I will again need your help in verifying the fix. > > It is not crashing with the new patch. Thank you. Thanks for confirming. Some review comments on v2: + if (istate->pagetable) Please compare explicitly to InvalidDsaPointer. + if (iterator->ptbase) + ptbase = iterator->ptbase->ptentry; + if (iterator->ptpages) + idxpages = iterator->ptpages->index; + if (iterator->ptchunks) + idxchunks = iterator->ptchunks->index; Similarly. Dilip, please also provide a proposed commit message describing what this is fixing. Is it just the TBM_EMPTY case, or is there anything else? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Mar 16, 2017 at 8:42 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Thanks for confirming. Some review comments on v2: > > + if (istate->pagetable) fixed > > Please compare explicitly to InvalidDsaPointer. > > + if (iterator->ptbase) > + ptbase = iterator->ptbase->ptentry; > + if (iterator->ptpages) > + idxpages = iterator->ptpages->index; > + if (iterator->ptchunks) > + idxchunks = iterator->ptchunks->index; > > Similarly. fixed Also fixed at + if (ptbase) + pg_atomic_init_u32(&ptbase->refcount, 0); > > Dilip, please also provide a proposed commit message describing what > this is fixing. Is it just the TBM_EMPTY case, or is there anything > else? Okay, I have added the commit message in the patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Thu, Mar 16, 2017 at 8:26 PM, Emre Hasegeli <emre@hasegeli.com> wrote: >> Hopefully, this time I got it correct. Since I am unable to reproduce >> the issue so I will again need your help in verifying the fix. > > It is not crashing with the new patch. Thank you. Thanks for verifying. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Mar 16, 2017 at 1:50 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > fixed Committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company