Thread: Open 6.5 items
SELECT * FROM test WHERE test IN (SELECT * FROM test) fails with strange error When creating a table with either type inet or type cidr as a primary,unique key, the "198.68.123.0/24" and "198.68.123.0/27"are considered equal Fix function pointer calls to take Datum args for char and int2 args(ecgs) Regression test for new Numeric type Large Object memory problems refint problems invalidate cache on aborted transaction spinlock stuck problem benchmark performance problem Make psql \help, man pages, and sgml reflect changes in grammar Markup sql.sgml, Stefan's intro to SQL Generate Admin, User, Programmer hardcopy postscript Generate INSTALL and HISTORY from sgml sources. Update ref/lock.sgml, ref/set.sgml to reflect MVCC and locking changes. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
>spinlock stuck problem This time I have tested on another slower/less memory machine. Seems things getting worse. I got: LockAcquire: xid table corrupted This comes from: /* * Find or create an xid entry with this tag */ result = (XIDLookupEnt *) hash_search(xidTable, (Pointer)&item, HASH_ENTER, &found); if (!result) { elog(NOTICE, "LockAcquire: xid table corrupted"); return STATUS_ERROR; } As you can see the aquired master lock never released, and all backends get stucked. (of course, corrupted xid table is a problem too ). Another error was: out of free buffers: time to abort ! I will do more testing... --- Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes: > LockAcquire: xid table corrupted > This comes from: > /* > * Find or create an xid entry with this tag > */ > result = (XIDLookupEnt *) hash_search(xidTable, (Pointer) &item, > HASH_ENTER, &found); > if (!result) > { > elog(NOTICE, "LockAcquire: xid table corrupted"); > return STATUS_ERROR; > } > As you can see the aquired master lock never released, and all > backends get stucked. (of course, corrupted xid table is a problem too Actually, corrupted xid table is *the* problem --- whatever happens after that is just collateral damage. (The elog should likely be elog(FATAL) not NOTICE...) If I recall the dynahash.c code correctly, a null return value indicates either damage to the structure of the table (ie someone stomped on memory that didn't belong to them) or running out of memory to add entries to the table. The latter should be impossible if we sized shared memory correctly. Perhaps the table size estimation code has been obsoleted by recent changes? regards, tom lane
Thus spake Bruce Momjian > When creating a table with either type inet or type cidr as a primary,unique > key, the "198.68.123.0/24" and "198.68.123.0/27" are considered equal So have we decided that this is still to be fixed? If so, it's an easy fix but we have to decide which of the following is true. 198.68.123.0/24 < 198.68.123.0/27 198.68.123.0/24 > 198.68.123.0/27 Maybe deciding that should be the TODO item. :-) -- D'Arcy J.M. Cain <darcy@{druid|vex}.net> | Democracy is three wolves http://www.druid.net/darcy/ | and a sheep voting on +1 416 424 2871 (DoD#0082) (eNTP) | what's for dinner.
Tom Lane wrote: > > If I recall the dynahash.c code correctly, a null return value > indicates either damage to the structure of the table (ie someone > stomped on memory that didn't belong to them) or running out of memory > to add entries to the table. The latter should be impossible if we Quite different cases and should result in different reactions. If structure is corrupted then only abort() is proper thing. If running out of memory then elog(ERROR) is enough. > sized shared memory correctly. Perhaps the table size estimation code > has been obsoleted by recent changes? lock.h: /* ----------------------* The following defines are used to estimate how much shared * memory the lock manager is goingto require.* See LockShmemSize() in lock.c.** NLOCKS_PER_XACT - The number of unique locks acquired in a transaction * NLOCKENTS - The maximum number of lock entries in the lock table.* ----------------------*/ #define NLOCKS_PER_XACT 40 ^^ Isn't it too low? #define NLOCKENTS(maxBackends) (NLOCKS_PER_XACT*(maxBackends)) And now - LockShmemSize() in lock.c: /* lockHash table */ size += hash_estimate_size(NLOCKENTS(maxBackends), ^^^^^^^^^^^^^^^^^^^^^^ SHMEM_LOCKTAB_KEYSIZE, SHMEM_LOCKTAB_DATASIZE); /* xidHash table */ size += hash_estimate_size(maxBackends, ^^^^^^^^^^^ SHMEM_XIDTAB_KEYSIZE, SHMEM_XIDTAB_DATASIZE); Why just maxBackends is here? NLOCKENTS should be used too (each transaction lock requieres own xidhash entry). Vadim
Vadim Mikheev <vadim@krs.ru> writes: >> If I recall the dynahash.c code correctly, a null return value >> indicates either damage to the structure of the table (ie someone >> stomped on memory that didn't belong to them) or running out of memory >> to add entries to the table. The latter should be impossible if we > Quite different cases and should result in different reactions. I agree; will see about cleaning up hash_search's call convention after 6.5 is done. Actually, maybe I should do it now? I'm not convinced yet whether the reports we're seeing are due to memory clobber or running out of space... fixing this may be the easiest way to find out. > #define NLOCKS_PER_XACT 40 > ^^ > Isn't it too low? You tell me ... that was the number that was in the 6.4 code, but I have no idea if it's right or not. (Does MVCC require more locks than the old stuff?) What is a good upper bound on the number of concurrently existing locks? > /* xidHash table */ > size += hash_estimate_size(maxBackends, > ^^^^^^^^^^^ > SHMEM_XIDTAB_KEYSIZE, > SHMEM_XIDTAB_DATASIZE); > Why just maxBackends is here? NLOCKENTS should be used too > (each transaction lock requieres own xidhash entry). Should it be NLOCKENTS(maxBackends) xid entries, or do you mean NLOCKENTS(maxBackends) + maxBackends? Feel free to stick in any estimates that you like better --- what's there now is an interpretation of what the 6.4 code was trying to do (but it was sufficiently buggy and unreadable that it was probably coming out with different numbers in the end...) regards, tom lane
"D'Arcy" "J.M." Cain <darcy@druid.net> writes: > but we have to decide which of the following is true. > 198.68.123.0/24 < 198.68.123.0/27 > 198.68.123.0/24 > 198.68.123.0/27 I'd say the former, on the same principle that 'abc' < 'abcd'. Think of the addresses as being bit strings of the specified length, and compare them the same way character strings are compared. But if Vixie's got a different opinion, I defer to him... regards, tom lane
Tom Lane wrote: > > Vadim Mikheev <vadim@krs.ru> writes: > >> If I recall the dynahash.c code correctly, a null return value > >> indicates either damage to the structure of the table (ie someone > >> stomped on memory that didn't belong to them) or running out of memory > >> to add entries to the table. The latter should be impossible if we > > > Quite different cases and should result in different reactions. > > I agree; will see about cleaning up hash_search's call convention after > 6.5 is done. Actually, maybe I should do it now? I'm not convinced yet > whether the reports we're seeing are due to memory clobber or running > out of space... fixing this may be the easiest way to find out. Imho, we have to fix it in some way before 6.5 Either by changing dynahash.c (to return 0x1 if table is corrupted and 0x0 if out of space) or by changing elog(NOTICE) to elog(ERROR). > > > #define NLOCKS_PER_XACT 40 > > ^^ > > Isn't it too low? > > You tell me ... that was the number that was in the 6.4 code, but I > have no idea if it's right or not. (Does MVCC require more locks > than the old stuff?) What is a good upper bound on the number > of concurrently existing locks? Probably yes, because of writers can continue to work and lock other tables instead of sleeping of first lock due to concurrent select. I'll change it to 64, but this should be configurable thing. > > > /* xidHash table */ > > size += hash_estimate_size(maxBackends, > > ^^^^^^^^^^^ > > SHMEM_XIDTAB_KEYSIZE, > > SHMEM_XIDTAB_DATASIZE); > > > Why just maxBackends is here? NLOCKENTS should be used too > > (each transaction lock requieres own xidhash entry). > > Should it be NLOCKENTS(maxBackends) xid entries, or do you mean > NLOCKENTS(maxBackends) + maxBackends? Feel free to stick in any > estimates that you like better --- what's there now is an interpretation > of what the 6.4 code was trying to do (but it was sufficiently buggy and > unreadable that it was probably coming out with different numbers in > the end...) Just NLOCKENTS(maxBackends) - I'll change it now. Vadim
Thus spake Tom Lane > "D'Arcy" "J.M." Cain <darcy@druid.net> writes: > > but we have to decide which of the following is true. > > > 198.68.123.0/24 < 198.68.123.0/27 > > 198.68.123.0/24 > 198.68.123.0/27 > > I'd say the former, on the same principle that 'abc' < 'abcd'. And, in fact, that's what happens if you use the operators. The only place they are equal is when sorting them so they can't be used as primary keys. I guess there is no argument about the sorting order if we think they should be sorted. There is still the question of whether or not they should be sorted. There seems to be tacit sgreement but could we have a little more discussion. The question is, when inet or cidr is used as the primary key on a table, should they be considered equal. In fact, think about the question separately as we may want a different behaviour for each. Here is my breakdown of the question. For inet type, the value specifies primarily, I think, the host but also carries information about its place on the network. Given an inet type you can extract the host, broadcast, netmask and even the cidr that it is part of. So, 198.68.123.0/24 and 198.68.123.0/27 really refer to the same host but on different networks. Since a host can only be on one network, there is an argument that they can't both be used as the primary key in the same table. A cidr type is primarily a network. In fact, some valid inet values aren't even valid cidr. So, the question is, if one network is part of another then should it be possible to have both as a primary key? Of course, both of these beg the real question, should either of these types be used as a primary key, but that is a database design question. > Think of the addresses as being bit strings of the specified length, > and compare them the same way character strings are compared. Not sure that that clarifies it but we do have the code to order them in any case. We just need to decide whether we want to. > But if Vixie's got a different opinion, I defer to him... Paul's code orders them without regard to netmask which implies "no" as the answer to the question but his original code only referred to what we eventually called the cidr type. The question would still be open for the inet type anyway. -- D'Arcy J.M. Cain <darcy@{druid|vex}.net> | Democracy is three wolves http://www.druid.net/darcy/ | and a sheep voting on +1 416 424 2871 (DoD#0082) (eNTP) | what's for dinner.
"D'Arcy" "J.M." Cain <darcy@druid.net> writes: > And, in fact, that's what happens if you use the operators. The only > place they are equal is when sorting them so they can't be used as > primary keys. Huh? Indexes and operators are the same thing --- or more specifically, indexes rely on operators to compare keys. I don't see how it's even *possible* that an index would think that two keys are equal when the underlying = operator says they are not. A little experimentation shows that's indeed what's happening, though. Weird. Is this a deliberate effect, and if so how did you achieve it? It looks like what could be a serious bug to me. > I guess there is no argument about the sorting order > if we think they should be sorted. There is still the question of > whether or not they should be sorted. There seems to be tacit sgreement > but could we have a little more discussion. The question is, when inet > or cidr is used as the primary key on a table, should they be considered > equal. In fact, think about the question separately as we may want a > different behaviour for each. I'd argue that plain indexing ought not try to do anything especially subtle --- in particular it ought not vary from the behavior of the comparison operators for the type. If someone wants a table wherein you can't enter two spellings of the same hostname, the right way would be to construct a unique functional index using a function that reduces the INET type into the simpler form. A good analogy might be a text field where you don't want any two entries to be equal on a case-insensitive basis. You don't up and change the behavior of indexing to be case-insensitive, you sayCREATE UNIQUE INDEX foo_f1_key ON foo (lower(f1) text_ops); regards, tom lane
I wrote: > A little experimentation shows that's indeed what's happening, though. > Weird. Is this a deliberate effect, and if so how did you achieve it? Oh, I see it: the network_cmp function is deliberately inconsistent with the regular comparison functions on network values. This is *very bad*. Indexes depend on both the operators and the cmp support function. You cannot have inconsistent behavior between these functions, or indexing will misbehave. Do I need to gin up an example where it fails? regards, tom lane
Hello all, > -----Original Message----- > From: owner-pgsql-hackers@postgreSQL.org > [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Vadim Mikheev > Sent: Saturday, May 29, 1999 2:51 PM > To: Tom Lane > Cc: t-ishii@sra.co.jp; PostgreSQL-development > Subject: Re: [HACKERS] Open 6.5 items > > > Tom Lane wrote: > > > > Vadim Mikheev <vadim@krs.ru> writes: > > >> If I recall the dynahash.c code correctly, a null return value > > >> indicates either damage to the structure of the table (ie someone > > >> stomped on memory that didn't belong to them) or running out > of memory > > >> to add entries to the table. The latter should be impossible if we > > > > > Quite different cases and should result in different reactions. > > > > I agree; will see about cleaning up hash_search's call convention after > > 6.5 is done. Actually, maybe I should do it now? I'm not convinced yet > > whether the reports we're seeing are due to memory clobber or running > > out of space... fixing this may be the easiest way to find out. > > Imho, we have to fix it in some way before 6.5 > Either by changing dynahash.c (to return 0x1 if table is > corrupted and 0x0 if out of space) or by changing > elog(NOTICE) to elog(ERROR). > Another case exists which causes stuck spinlock abort. status = WaitOnLock(lockmethod, lock, lockmode); /* * Check the xid entry status, in case something in the ipc * communicationdoesn't work correctly. */ if (!((result->nHolding > 0) && (result->holders[lockmode]> 0)) ) { XID_PRINT_AUX("LockAcquire: INCONSISTENT ", result); LOCK_PRINT_AUX("LockAcquire:INCONSISTENT ", lock, lockm ode); /* Should we retry ? */ return FALSE; This case returns without releasing LockMgrLock and doesn't call even elog(). As far as I see,different entries in xidHash have a same key when above case occurs. Moreover xidHash has been in abnormal state since the number of xidHash entries exceeded 256. Is this bug solved by change maxBackends->NLOCKENTS(maxBackends) by Vadim or the change about hash by Tom ? As for my test case,xidHash is filled with XactLockTable entries which have been acquired by XactLockTableWait(). Could those entries be released immediately after they are acquired ? Thanks. Hiroshi Inoue Inoue@tpf.co.jp
> -----Original Message----- > From: owner-pgsql-hackers@postgreSQL.org > [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Bruce Momjian > Sent: Friday, May 28, 1999 1:58 PM > To: PostgreSQL-development > Subject: [HACKERS] Open 6.5 items > > > SELECT * FROM test WHERE test IN (SELECT * FROM test) fails with > strange error > When creating a table with either type inet or type cidr as a > primary,unique > key, the "198.68.123.0/24" and "198.68.123.0/27" are considered equal > Fix function pointer calls to take Datum args for char and int2 args(ecgs) > Regression test for new Numeric type > Large Object memory problems > refint problems > invalidate cache on aborted transaction > spinlock stuck problem > benchmark performance problem > > Make psql \help, man pages, and sgml reflect changes in grammar > Markup sql.sgml, Stefan's intro to SQL > Generate Admin, User, Programmer hardcopy postscript > Generate INSTALL and HISTORY from sgml sources. > Update ref/lock.sgml, ref/set.sgml to reflect MVCC and locking changes. > What about mdtruncate() for multi-segments relation ? AFAIK,it has not been solved yet. Thanks. Hiroshi Inoue Inoue@tpf.co.jp
Hiroshi Inoue wrote: > > As far as I see,different entries in xidHash have a same key when above > case occurs. Moreover xidHash has been in abnormal state since the > number of xidHash entries exceeded 256. > > Is this bug solved by change maxBackends->NLOCKENTS(maxBackends) > by Vadim or the change about hash by Tom ? Should be fixed now. > > As for my test case,xidHash is filled with XactLockTable entries which have > been acquired by XactLockTableWait(). > Could those entries be released immediately after they are acquired ? Ops. Thanks! Must be released. Vadim
> > Make psql \help, man pages, and sgml reflect changes in grammar > > Markup sql.sgml, Stefan's intro to SQL > > Generate Admin, User, Programmer hardcopy postscript > > Generate INSTALL and HISTORY from sgml sources. > > Update ref/lock.sgml, ref/set.sgml to reflect MVCC and locking changes. > > > > What about mdtruncate() for multi-segments relation ? > AFAIK,it has not been solved yet. > I thought we decided that file descriptors are kept by backends, and are still accessable while new backends don't see the files. Correct? -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> -----Original Message----- > From: Bruce Momjian [mailto:maillist@candle.pha.pa.us] > Sent: Monday, May 31, 1999 11:15 AM > To: Hiroshi Inoue > Cc: PostgreSQL-development > Subject: Re: [HACKERS] Open 6.5 items > > > > > Make psql \help, man pages, and sgml reflect changes in grammar > > > Markup sql.sgml, Stefan's intro to SQL > > > Generate Admin, User, Programmer hardcopy postscript > > > Generate INSTALL and HISTORY from sgml sources. > > > Update ref/lock.sgml, ref/set.sgml to reflect MVCC and > locking changes. > > > > > > > What about mdtruncate() for multi-segments relation ? > > AFAIK,it has not been solved yet. > > > > I thought we decided that file descriptors are kept by backends, and are > still accessable while new backends don't see the files. Correct? > Yes,other backends could write to unliked files which would be vanished before long. I think it's more secure to truncate useless segments to size 0 than unlinking the segments though vacuum would never remove useless segments. Thanks. Hiroshi Inoue Inoue@tpf.co.jp
> > I thought we decided that file descriptors are kept by backends, and are > > still accessable while new backends don't see the files. Correct? > > > > Yes,other backends could write to unliked files which would be > vanished before long. > I think it's more secure to truncate useless segments to size 0 > than unlinking the segments though vacuum would never remove > useless segments. If you truncate, other backends will see the data gone, and will be writing into the middle of an empty file. Better to remove. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > > > I thought we decided that file descriptors are kept by > backends, and are > > > still accessable while new backends don't see the files. Correct? > > > > > > > Yes,other backends could write to unliked files which would be > > vanished before long. > > I think it's more secure to truncate useless segments to size 0 > > than unlinking the segments though vacuum would never remove > > useless segments. > > If you truncate, other backends will see the data gone, and will be > writing into the middle of an empty file. Better to remove. > I couldn't explain more because of my poor English,sorry. But my test case usually causes backend abort. My test case isWhile 1 or more sessions frequently insert/update a table,vacuum the table. After vacuum, those sessions abort with message ERROR: cannot open segment .. of relation ... This ERROR finally causes spinlock freeze as I reported in a posting [HACKERS] spinlock freeze ?(Re: INSERT/UPDATE waiting (another example)). Comments ? Thanks. Hiroshi Inoue Inoue@tpf.co.jp
> I couldn't explain more because of my poor English,sorry. > > But my test case usually causes backend abort. > My test case is > While 1 or more sessions frequently insert/update a table, > vacuum the table. > > After vacuum, those sessions abort with message > ERROR: cannot open segment .. of relation ... > > This ERROR finally causes spinlock freeze as I reported in a posting > [HACKERS] spinlock freeze ?(Re: INSERT/UPDATE waiting (another > example)). > > Comments ? OK, I buy that. How will truncate fix things? Isn't that going to be strange too. Hard to imagine how we are going to modify these things. I am now leaning to the truncate option, especially considering that usually only the last segment is going to be truncated. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
>Tom Lane wrote: >> >> Vadim Mikheev <vadim@krs.ru> writes: >> >> If I recall the dynahash.c code correctly, a null return value >> >> indicates either damage to the structure of the table (ie someone >> >> stomped on memory that didn't belong to them) or running out of memory >> >> to add entries to the table. The latter should be impossible if we >> >> > Quite different cases and should result in different reactions. >> >> I agree; will see about cleaning up hash_search's call convention after >> 6.5 is done. Actually, maybe I should do it now? I'm not convinced yet >> whether the reports we're seeing are due to memory clobber or running >> out of space... fixing this may be the easiest way to find out. > >Imho, we have to fix it in some way before 6.5 >Either by changing dynahash.c (to return 0x1 if table is >corrupted and 0x0 if out of space) or by changing >elog(NOTICE) to elog(ERROR). > >> >> > #define NLOCKS_PER_XACT 40 >> > ^^ >> > Isn't it too low? >> >> You tell me ... that was the number that was in the 6.4 code, but I >> have no idea if it's right or not. (Does MVCC require more locks >> than the old stuff?) What is a good upper bound on the number >> of concurrently existing locks? > >Probably yes, because of writers can continue to work and lock >other tables instead of sleeping of first lock due to concurrent >select. I'll change it to 64, but this should be configurable >thing. > >> >> > /* xidHash table */ >> > size += hash_estimate_size(maxBackends, >> > ^^^^^^^^^^^ >> > SHMEM_XIDTAB_KEYSIZE, >> > SHMEM_XIDTAB_DATASIZE); >> >> > Why just maxBackends is here? NLOCKENTS should be used too >> > (each transaction lock requieres own xidhash entry). >> >> Should it be NLOCKENTS(maxBackends) xid entries, or do you mean >> NLOCKENTS(maxBackends) + maxBackends? Feel free to stick in any >> estimates that you like better --- what's there now is an interpretation >> of what the 6.4 code was trying to do (but it was sufficiently buggy and >> unreadable that it was probably coming out with different numbers in >> the end...) > >Just NLOCKENTS(maxBackends) - I'll change it now. I have just done cvs update and saw your changes. I tried the same testing as I did before (64 conccurrent connections, and each connection excutes 100 transactions), but it failed again. (1) without -B 1024, it failed: out of free buffers: time to abort! (2) with -B 1024, it went into stuck spin lock So I looked into sources a little bit, and made a minor change to include/storage/lock.h: #define INIT_TABLE_SIZE 100 to: #define INIT_TABLE_SIZE 4096 then restarted postmaster with -B 1024 (this will prevent out-of-free-buffers problem, I guess). Now everything seems to work great! I suspect that huge INIT_TABLE_SIZE prevented dynamic expanding the hash tables and seems there's something wrong in the routines responsible for that. Comments? -- Tatsuo Ishii
Tatsuo Ishii wrote: > > I have just done cvs update and saw your changes. I tried the same > testing as I did before (64 conccurrent connections, and each > connection excutes 100 transactions), but it failed again. > > (1) without -B 1024, it failed: out of free buffers: time to abort! > > (2) with -B 1024, it went into stuck spin lock > > So I looked into sources a little bit, and made a minor change to > include/storage/lock.h: > > #define INIT_TABLE_SIZE 100 > > to: > > #define INIT_TABLE_SIZE 4096 > > then restarted postmaster with -B 1024 (this will prevent > out-of-free-buffers problem, I guess). Now everything seems to work > great! > > I suspect that huge INIT_TABLE_SIZE prevented dynamic expanding the > hash tables and seems there's something wrong in the routines > responsible for that. Seems like that -:( If we'll not fix expand hash code before 6.5 release then I would recommend to don't use INIT_TABLE_SIZE in lockMethodTable->lockHash = (HTAB *) ShmemInitHash(shmemName, INIT_TABLE_SIZE,MAX_TABLE_SIZE, &info, hash_flags); and lockMethodTable->xidHash = (HTAB *) ShmemInitHash(shmemName, INIT_TABLE_SIZE, MAX_TABLE_SIZE, &info, hash_flags); but use NLOCKENTS(maxBackends) instead. Vadim
Vadim Mikheev <vadim@krs.ru> writes: >> As for my test case,xidHash is filled with XactLockTable entries which have >> been acquired by XactLockTableWait(). >> Could those entries be released immediately after they are acquired ? > Ops. Thanks! Must be released. Does this account for the "ShmemAlloc: out of memory" errors we've been seeing? I spent a good deal of time yesterday grovelling through all the calls to ShmemAlloc, and concluded that (unless there is a memory stomp somewhere) it has to be caused by one of the shared hashtables growing well beyond its size estimate. I did find that the PROC structures are not counted in the initial sizing of the shared memory block. This is no problem at the default limit of 32 backends, but could get to be an issue for hundreds of backends. I will add that item to the size estimate today. regards, tom lane
Vadim Mikheev <vadim@krs.ru> writes: >> I suspect that huge INIT_TABLE_SIZE prevented dynamic expanding the >> hash tables and seems there's something wrong in the routines >> responsible for that. > Seems like that -:( OK, as the last guy to touch dynahash.c I suppose this is my bailiwick... I will look into it today. > If we'll not fix expand hash code before 6.5 release then > I would recommend to don't use INIT_TABLE_SIZE in If we can't figure out what's really wrong then that might be a good kluge solution to get 6.5 out the door. But I'd rather try to fix it right first. Can anyone send me a test case script that reproduces the problem? regards, tom lane
I wrote: >>> I suspect that huge INIT_TABLE_SIZE prevented dynamic expanding the >>> hash tables and seems there's something wrong in the routines >>> responsible for that. > OK, as the last guy to touch dynahash.c I suppose this is my > bailiwick... I will look into it today. It's amazing how much easier it is to see a bug when you know it must be there ;-). I discovered that the hashtable expansion routine would mess up in the case that all the records in the bucket being split ended up in the new bucket rather than the old. In that case it forgot to clear the old bucket's chain header, with the result that all the records appeared to be in both buckets at once. This would not be a big problem until and unless the first record in the chain got deleted --- it would only be correctly removed from the new bucket, leaving the old bucket's chain header pointing at a now-free record (and failing to link to any records that had been added to the shared chain on its behalf in the meanwhile). Disaster ensues. An actual failure via this path seems somewhat improbable, but that may just explain why we hadn't seen it happen very much before... I have committed a fix in dynahash.c. Hiroshi and Tatsuo, would you please grab latest sources and see whether the problems you are observing are fixed? regards, tom lane
Tatsuo Ishii <t-ishii@sra.co.jp> writes: > I have just done cvs update and saw your changes. I tried the same > testing as I did before (64 conccurrent connections, and each > connection excutes 100 transactions), but it failed again. > > (1) without -B 1024, it failed: out of free buffers: time to abort! Right now, the postmaster will let you set any combination of -B and -N you please. But it seems obvious that there is some minimum number of buffers per backend below which things aren't going to work very well. I wonder whether the postmaster startup code ought to enforce a minimum ratio, say -B at least twice -N ? I have no idea what an appropriate limit would be, however. Vadim, do you have any thoughts? regards, tom lane
I said: > I did find that the PROC structures are not counted in the initial > sizing of the shared memory block. Er ... yes they are ... never mind ... At this point I am not going to make any further changes to the shared-mem code unless Hiroshi and Tatsuo report that there are still problems. I would still like to alter the calling conventions for hash_search and hash_seq, which are ugly and also dependent on static state variables. But as far as I can tell right now, those are just code beautification items rather than repairs for currently-existing bugs. So it seems best not to risk breaking anything with so little testing time left in the 6.5 cycle. I will put them on my to-do-for-6.6 list instead. regards, tom lane
Tom Lane wrote: > > Vadim Mikheev <vadim@krs.ru> writes: > >> As for my test case,xidHash is filled with XactLockTable entries which have > >> been acquired by XactLockTableWait(). > >> Could those entries be released immediately after they are acquired ? > > > Ops. Thanks! Must be released. > > Does this account for the "ShmemAlloc: out of memory" errors we've been > seeing? I spent a good deal of time yesterday grovelling through all Yes. Vadim
Tom Lane wrote: > > Tatsuo Ishii <t-ishii@sra.co.jp> writes: > > I have just done cvs update and saw your changes. I tried the same > > testing as I did before (64 conccurrent connections, and each > > connection excutes 100 transactions), but it failed again. > > > > (1) without -B 1024, it failed: out of free buffers: time to abort! > > Right now, the postmaster will let you set any combination of -B and -N > you please. But it seems obvious that there is some minimum number of > buffers per backend below which things aren't going to work very well. > I wonder whether the postmaster startup code ought to enforce a minimum > ratio, say -B at least twice -N ? I have no idea what an appropriate ^^^^^^^^^^^^^^ It's enough for select from single table using index, so it's probably good ratio. > limit would be, however. Vadim, do you have any thoughts? Vadim
Tom, >It's amazing how much easier it is to see a bug when you know it must be >there ;-). > >I discovered that the hashtable expansion routine would mess up in the >case that all the records in the bucket being split ended up in the new >bucket rather than the old. In that case it forgot to clear the old >bucket's chain header, with the result that all the records appeared to >be in both buckets at once. This would not be a big problem until and >unless the first record in the chain got deleted --- it would only be >correctly removed from the new bucket, leaving the old bucket's chain >header pointing at a now-free record (and failing to link to any records >that had been added to the shared chain on its behalf in the meanwhile). >Disaster ensues. > >An actual failure via this path seems somewhat improbable, but that >may just explain why we hadn't seen it happen very much before... > >I have committed a fix in dynahash.c. Hiroshi and Tatsuo, would you >please grab latest sources and see whether the problems you are >observing are fixed? Bingo! Your fix seems to solve the problem! Now 64 concurrent transactions ran 100 transactions each without any problem. Thanks. BTW, the script I'm using for the heavy load testing is written in Java(not written by me). Do you want to try it? >> (1) without -B 1024, it failed: out of free buffers: time to abort! > >Right now, the postmaster will let you set any combination of -B and -N >you please. But it seems obvious that there is some minimum number of >buffers per backend below which things aren't going to work very well. >I wonder whether the postmaster startup code ought to enforce a minimum >ratio, say -B at least twice -N ? I have no idea what an appropriate >limit would be, however. Vadim, do you have any thoughts? Just a few questions. o I observed the backend processes grew ~10MB with -B 1024. Is this normal? o Is it possible to let the backend wait for free buffers in case of insufficient shared buffers? (with reasonable retries, of course) -- Tatsuo Ishii
Tatsuo Ishii wrote: > > Just a few questions. > > o I observed the backend processes grew ~10MB with -B 1024. Is this > normal? Backend attaches to 1024*8K + other shmem, so probably ps takes it into account. > o Is it possible to let the backend wait for free buffers in case of > insufficient shared buffers? (with reasonable retries, of course) Yes, but not now. Vadim
> -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: Tuesday, June 01, 1999 2:14 AM > To: Hiroshi Inoue; t-ishii@sra.co.jp > Cc: PostgreSQL-development > Subject: Re: [HACKERS] Open 6.5 items > > > I wrote: > >>> I suspect that huge INIT_TABLE_SIZE prevented dynamic expanding the > >>> hash tables and seems there's something wrong in the routines > >>> responsible for that. > > > OK, as the last guy to touch dynahash.c I suppose this is my > > bailiwick... I will look into it today. > > It's amazing how much easier it is to see a bug when you know it must be > there ;-). > [snip] > > I have committed a fix in dynahash.c. Hiroshi and Tatsuo, would you > please grab latest sources and see whether the problems you are > observing are fixed? > It works fine. The number of xidHash entry exceeded 600 but spinlock error didn't occur. However,when I did vacuum while testing I got the following error message.ERROR: Child itemid marked as unused TransactionId-s of tuples in update chain may be out of order. Thanks. Hiroshi Inoue Inoue@tpf.co.jp
> -----Original Message----- > From: owner-pgsql-hackers@postgreSQL.org > [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Bruce Momjian > Sent: Monday, May 31, 1999 3:15 PM > To: Hiroshi Inoue > Cc: PostgreSQL-development > Subject: Re: [HACKERS] Open 6.5 items > > > > I couldn't explain more because of my poor English,sorry. > > > > But my test case usually causes backend abort. > > My test case is > > While 1 or more sessions frequently insert/update a table, > > vacuum the table. > > > > After vacuum, those sessions abort with message > > ERROR: cannot open segment .. of relation ... > > > > This ERROR finally causes spinlock freeze as I reported in a posting > > [HACKERS] spinlock freeze ?(Re: INSERT/UPDATE waiting (another > > example)). > > > > Comments ? > > OK, I buy that. How will truncate fix things? Isn't that going to be > strange too. Hard to imagine how we are going to modify these things. > I am now leaning to the truncate option, especially considering that > usually only the last segment is going to be truncated. > I made a patch on trial. 1) Useless segments are never removed by my implementation because I call FileTruncate() instead of File(Name)Unlink(). 2) mdfd_lstbcnt member of MdfdVec was abused in mdnblocks(). I am maintaining the value of mdfd_lstbcnt unwillingly. Is it preferable to get rid of mdfd_lstbcnt completely ? I'm not sure that this patch has no problem. Please check and comment on my patch. I don't have > 1G disk space. Could someone run under > 1G environment ? As Ole Gjerde mentioned,current implementation by his old patch is not right. His new patch seems right if vacuum is executed alone. Please run vacuum while other concurrent sessions are reading or writing,if you would see the difference. Thanks. Hiroshi Inoue Inoue@tpf.co.jp *** storage/smgr/md.c.orig Wed May 26 16:05:02 1999 --- storage/smgr/md.c Wed Jun 2 15:35:35 1999 *************** *** 674,684 **** segno = 0; for (;;) { ! if (v->mdfd_lstbcnt == RELSEG_SIZE ! || (nblocks = _mdnblocks(v->mdfd_vfd, BLCKSZ)) == RELSEG_SIZE) { - - v->mdfd_lstbcnt = RELSEG_SIZE; segno++; if (v->mdfd_chain == (MdfdVec *) NULL) --- 674,685 ---- segno = 0; for (;;) { ! nblocks = _mdnblocks(v->mdfd_vfd, BLCKSZ); ! if (nblocks > RELSEG_SIZE) ! elog(FATAL, "segment too big in mdnblocks!"); ! v->mdfd_lstbcnt = nblocks; ! if (nblocks == RELSEG_SIZE) { segno++; if (v->mdfd_chain == (MdfdVec *) NULL) *************** *** 714,745 **** int curnblk, i, oldsegno, ! newsegno; ! char fname[NAMEDATALEN]; ! char tname[NAMEDATALEN + 10]; curnblk = mdnblocks(reln); oldsegno = curnblk / RELSEG_SIZE; newsegno= nblocks / RELSEG_SIZE; - StrNCpy(fname, RelationGetRelationName(reln)->data, NAMEDATALEN); - - if (newsegno < oldsegno) - { - for (i = (newsegno + 1);; i++) - { - sprintf(tname, "%s.%d", fname, i); - if (FileNameUnlink(tname) < 0) - break; - } - } #endif fd = RelationGetFile(reln); v = &Md_fdvec[fd]; if (FileTruncate(v->mdfd_vfd, nblocks * BLCKSZ)< 0) return -1; return nblocks; --- 715,766 ---- int curnblk, i, oldsegno, ! newsegno, ! lastsegblocks; ! MdfdVec **varray; curnblk = mdnblocks(reln); + if (nblocks > curnblk) + return -1; oldsegno = curnblk / RELSEG_SIZE; newsegno = nblocks / RELSEG_SIZE; #endif fd = RelationGetFile(reln); v = &Md_fdvec[fd]; + #ifndef LET_OS_MANAGE_FILESIZE + varray = (MdfdVec **)palloc((oldsegno + 1) * sizeof(MdfdVec *)); + for (i = 0; i <= oldsegno; i++) + { + if (!v) + elog(ERROR,"segment isn't open in mdtruncate!"); + varray[i] = v; + v = v->mdfd_chain; + } + for (i = oldsegno; i > newsegno; i--) + { + v = varray[i]; + if (FileTruncate(v->mdfd_vfd, 0) < 0) + { + pfree(varray); + return -1; + } + v->mdfd_lstbcnt = 0; + } + /* Calculate the # of blocks in the last segment */ + lastsegblocks = nblocks - (newsegno * RELSEG_SIZE); + v = varray[newsegno]; + pfree(varray); + if (FileTruncate(v->mdfd_vfd, lastsegblocks * BLCKSZ) < 0) + return -1; + v->mdfd_lstbcnt = lastsegblocks; + #else if (FileTruncate(v->mdfd_vfd, nblocks * BLCKSZ) < 0) return -1; + v->mdfd_lstbcnt = nblocks; + #endif return nblocks;
Hiroshi Inoue wrote: > > However,when I did vacuum while testing I got the following error > message. > ERROR: Child itemid marked as unused > > TransactionId-s of tuples in update chain may be out of order. I see... Need 1-2 days to fix this -:( Vadim
Hiroshi Inoue wrote: > > However,when I did vacuum while testing I got the following error > message. > ERROR: Child itemid marked as unused > > TransactionId-s of tuples in update chain may be out of order. 1. Fix and explanation in xact.c:CommitTransaction(): RecordTransactionCommit(); /* * Let others know about no transaction in progress by me. * Note that this must be done _before_ releasing lockswe hold * and SpinAcquire(ShmemIndexLock) is required - or bad (too high) * XmaxRecent value might be used byvacuum: UPDATE with xid 0 is * blocked by xid 1' UPDATE, xid 1 is doing commit while xid 2 * gets snapshot - if xid2' GetSnapshotData sees xid 1 as running * then it must see xid 0 as running as well or XmaxRecent = 1 * might beused by concurrent vacuum causing * ERROR: Child itemid marked as unused * This bug was reported by HiroshiInoue and I was able to reproduce * it with 3 sessions and gdb. - vadim 06/03/99 */ if (MyProc != (PROC *)NULL) { SpinAcquire(ShmemIndexLock); MyProc->xid = InvalidTransactionId; MyProc->xmin = InvalidTransactionId; SpinRelease(ShmemIndexLock); } 2. It was possible to get two versions of the same row from select. Fixed by moving MyProc->xid assignment from StartTransaction()inside GetNewTransactionId(). Thanks, Hiroshi! And please run your tests - I used just 3 sessions and gdb. Vadim
Vadim Mikheev <vadim@krs.ru> writes: >> ERROR: Child itemid marked as unused > [ is fixed ] Great! Vadim (also Hiroshi and Tatsuo), how many bugs remain on your must-fix-for-6.5 lists? I was just wondering over in the "Freezing docs" thread whether we had any problems severe enough to justify delaying the release. It sounds like at least one such problem is gone... regards, tom lane
Tom Lane wrote: > > Vadim Mikheev <vadim@krs.ru> writes: > >> ERROR: Child itemid marked as unused > > [ is fixed ] > > Great! Vadim (also Hiroshi and Tatsuo), how many bugs remain on your > must-fix-for-6.5 lists? I was just wondering over in the "Freezing > docs" thread whether we had any problems severe enough to justify > delaying the release. It sounds like at least one such problem is > gone... No one in mine. There are still some bad things, but they are old: 1. elog(NOTICE) in lock manager when locking was not succeeded. Hope that our recent changes will reduce possibility ofthis. Hiroshi wrote: 2. > spinlock io_in_progress_lock of a buffer page is not > released by operations called by elog() such as > ProcReleaseSpins(),ResetBufferPool() etc. I tried to fix this before 6.4 but without success (don't remember why). 3. > It seems elog(FATAL) doesn't release allocated buffer pages. > It's OK ? > AFAIC elog(FATAL) causes proc_exit(0) and proc_exit() doesn't > call ResetBufferPool(). Seems to me that elog(FATAL) should call siglongjmp(Warn_restart, 1), like elog(ERROR), but force exit in tcop main loop after AbortCurrentTransaction(). AbortTransaction() does pretty nice things like RelationPurgeLocalRelation(false) and DestroyNoNameRels()... Vadim
Vadim Mikheev <vadim@krs.ru> writes: >> It seems elog(FATAL) doesn't release allocated buffer pages. >> It's OK ? >> AFAIC elog(FATAL) causes proc_exit(0) and proc_exit() doesn't >> call ResetBufferPool(). > Seems to me that elog(FATAL) should call siglongjmp(Warn_restart, 1), > like elog(ERROR), but force exit in tcop main loop after > AbortCurrentTransaction(). AbortTransaction() does pretty nice > things like RelationPurgeLocalRelation(false) and DestroyNoNameRels()... Seems reasonable to me. It seems to me that elog(FATAL) means "this backend is too messed up to continue, but I think the rest of the backends can keep going." So we need to clean up our allocated resources before quitting. abort() is for the cases where we think shared memory may be corrupted and everyone must bail out. We might need to revisit the uses of each routine and make sure that different error conditions are properly classified. Of course, if things *are* messed up then trying to AbortTransaction might make it worse. How bad is that risk? regards, tom lane
Tom Lane wrote: > > Vadim Mikheev <vadim@krs.ru> writes: > >> It seems elog(FATAL) doesn't release allocated buffer pages. > >> It's OK ? > >> AFAIC elog(FATAL) causes proc_exit(0) and proc_exit() doesn't > >> call ResetBufferPool(). > > > Seems to me that elog(FATAL) should call siglongjmp(Warn_restart, 1), > > like elog(ERROR), but force exit in tcop main loop after > > AbortCurrentTransaction(). AbortTransaction() does pretty nice > > things like RelationPurgeLocalRelation(false) and DestroyNoNameRels()... > > Seems reasonable to me. It seems to me that elog(FATAL) means "this > backend is too messed up to continue, but I think the rest of the > backends can keep going." So we need to clean up our allocated > resources before quitting. abort() is for the cases where we think > shared memory may be corrupted and everyone must bail out. We might > need to revisit the uses of each routine and make sure that different > error conditions are properly classified. > > Of course, if things *are* messed up then trying to AbortTransaction > might make it worse. How bad is that risk? Don't know. I think that we have no time to fix this in 6.5, -:( Vadim
> -----Original Message----- > From: root@sunpine.krs.ru [mailto:root@sunpine.krs.ru]On Behalf Of Vadim > Mikheev > Sent: Thursday, June 03, 1999 10:45 PM > To: Hiroshi Inoue > Cc: Tom Lane; t-ishii@sra.co.jp; PostgreSQL-development > Subject: Re: [HACKERS] Open 6.5 items > > > Hiroshi Inoue wrote: > > > > However,when I did vacuum while testing I got the following error > > message. > > ERROR: Child itemid marked as unused > > > > TransactionId-s of tuples in update chain may be out of order. > > 1. Fix and explanation in xact.c:CommitTransaction(): > > RecordTransactionCommit(); > > /* > * Let others know about no transaction in progress by me. > * Note that this must be done _before_ releasing locks we hold > * and SpinAcquire(ShmemIndexLock) is required - or bad (too high) > * XmaxRecent value might be used by vacuum: UPDATE with xid 0 is > * blocked by xid 1' UPDATE, xid 1 is doing commit while xid 2 > * gets snapshot - if xid 2' GetSnapshotData sees xid 1 as running > * then it must see xid 0 as running as well or XmaxRecent = 1 > * might be used by concurrent vacuum causing > * ERROR: Child itemid marked as unused > * This bug was reported by Hiroshi Inoue and I was able to reproduce > * it with 3 sessions and gdb. - vadim 06/03/99 > */ > if (MyProc != (PROC *) NULL) > { > SpinAcquire(ShmemIndexLock); > MyProc->xid = InvalidTransactionId; > MyProc->xmin = InvalidTransactionId; > SpinRelease(ShmemIndexLock); > } > > 2. It was possible to get two versions of the same row from > select. Fixed by moving MyProc->xid assignment from > StartTransaction() inside GetNewTransactionId(). > > Thanks, Hiroshi! And please run your tests - I used just > 3 sessions and gdb. > Unfortunately,the error still occurs(I changed xact.c as above by hand OK ?). It seems there are cases that tuples are updated by older transactions than their xmin-s and only some tuples in the middle of update chain may be deleted. I have no idea to fix this now. It's OK for me to leave this unsolved because those cases would rarely occur. Thanks. Hiroshi Inoue Inoue@tpf.co.jp
> -----Original Message----- > From: root@sunpine.krs.ru [mailto:root@sunpine.krs.ru]On Behalf Of Vadim > Mikheev > Sent: Friday, June 04, 1999 2:24 AM > To: Tom Lane > Cc: Hiroshi Inoue; t-ishii@sra.co.jp; PostgreSQL-development > Subject: Re: [HACKERS] Open 6.5 items > > Tom Lane wrote: > > > > Vadim Mikheev <vadim@krs.ru> writes: > > >> ERROR: Child itemid marked as unused > > > [ is fixed ] > > > > Great! Vadim (also Hiroshi and Tatsuo), how many bugs remain on your > > must-fix-for-6.5 lists? I was just wondering over in the "Freezing > > docs" thread whether we had any problems severe enough to justify > > delaying the release. It sounds like at least one such problem is > > gone... > > No one in mine. > [snip] > > Hiroshi wrote: > > 2. > > spinlock io_in_progress_lock of a buffer page is not > > released by operations called by elog() such as > > ProcReleaseSpins(),ResetBufferPool() etc. > > I tried to fix this before 6.4 but without success (don't > remember why). > This is not in my must-fix-for-6.5 lists. For the present this is caused by other bugs not by bufmgr/smgr itself and an easy fix may introduce other bugs. And on segmented relations. Ole Gjerde who provided the patch for current implementation of mdtruncate() sayz. "First, please reverse my patch to mdtruncate() in md.c as soon aspossible. It does not work properly in some cases." I also recommend to reverse his patch to mdtruncate(). Though we could not shrink segmented relations by old implementation the result by vacuum would never be inconsistent(?). I think we don't have enough time to fix this. Thanks. Hiroshi Inoue Inoue@tpf.co.jp
Hiroshi Inoue wrote: > > > > > 2. It was possible to get two versions of the same row from > > select. Fixed by moving MyProc->xid assignment from > > StartTransaction() inside GetNewTransactionId(). > > > > Thanks, Hiroshi! And please run your tests - I used just > > 3 sessions and gdb. > > > > Unfortunately,the error still occurs(I changed xact.c as above > by hand OK ?). Did you add 2. changes too? Also, I made some changes in shmem.c:GetSnapshotData() but seems that they are not relevant. Could you post me your xact.c and varsup.c or grab current sources? > It seems there are cases that tuples are updated by older > transactions than their xmin-s and only some tuples in the middle > of update chain may be deleted. > > I have no idea to fix this now. > It's OK for me to leave this unsolved because those cases would > rarely occur. I would like to see it fixed anyway. Vadim
> -----Original Message----- > From: root@sunpine.krs.ru [mailto:root@sunpine.krs.ru]On Behalf Of Vadim > Mikheev > Sent: Friday, June 04, 1999 5:47 PM > To: Hiroshi Inoue > Cc: Tom Lane; t-ishii@sra.co.jp; PostgreSQL-development > Subject: Re: [HACKERS] Open 6.5 items > > > Hiroshi Inoue wrote: > > > > > > > > 2. It was possible to get two versions of the same row from > > > select. Fixed by moving MyProc->xid assignment from > > > StartTransaction() inside GetNewTransactionId(). > > > > > > Thanks, Hiroshi! And please run your tests - I used just > > > 3 sessions and gdb. > > > > > > > Unfortunately,the error still occurs(I changed xact.c as above > > by hand OK ?). > > Did you add 2. changes too? > Also, I made some changes in shmem.c:GetSnapshotData() but seems > that they are not relevant. Could you post me your xact.c and > varsup.c or grab current sources? > OK I would grab current sources and retry. Thanks. Hiroshi Inoue Inoue@tpf.co.jp
> > > > Hiroshi wrote: > > > > 2. > > > spinlock io_in_progress_lock of a buffer page is not > > > released by operations called by elog() such as > > > ProcReleaseSpins(),ResetBufferPool() etc. > > > > I tried to fix this before 6.4 but without success (don't > > remember why). > > > > This is not in my must-fix-for-6.5 lists. > For the present this is caused by other bugs not by bufmgr/smgr itself > and an easy fix may introduce other bugs. > > > And on segmented relations. > > Ole Gjerde who provided the patch for current implementation of > mdtruncate() sayz. > "First, please reverse my patch to mdtruncate() in md.c as soon as > possible. It does not work properly in some cases." > > I also recommend to reverse his patch to mdtruncate(). > > Though we could not shrink segmented relations by old implementation > the result by vacuum would never be inconsistent(?). > > I think we don't have enough time to fix this. So what do we put in its place when we reverse out the patch? -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Vadim Mikheev <vadim@krs.ru> writes: > Tom Lane wrote: >> Right now, the postmaster will let you set any combination of -B and -N >> you please. But it seems obvious that there is some minimum number of >> buffers per backend below which things aren't going to work very well. >> I wonder whether the postmaster startup code ought to enforce a minimum >> ratio, say -B at least twice -N ? I have no idea what an appropriate > ^^^^^^^^^^^^^^ > It's enough for select from single table using index, so it's > probably good ratio. I've added a check in postmaster.c to require -B to be at least twice -N, per this discussion. It also enforces a minimum -B of 16 no matter how small -N is. I pulled that number out of the air --- anyone have an idea whether some other number would be better? The current default values of -B 64, -N 32 are not affected. However, since the -N default is easily configurable, I wonder whether we should move the default -B value into config.h as well... regards, tom lane
> -----Original Message----- > From: Bruce Momjian [mailto:maillist@candle.pha.pa.us] > Sent: Friday, June 04, 1999 11:37 PM > To: Hiroshi Inoue > Cc: Vadim Mikheev; Tom Lane; t-ishii@sra.co.jp; PostgreSQL-development > Subject: Re: [HACKERS] Open 6.5 items > > > > > > > > Hiroshi wrote: > > > > > > > And on segmented relations. > > > > Ole Gjerde who provided the patch for current implementation of > > mdtruncate() sayz. > > "First, please reverse my patch to mdtruncate() in md.c as soon as > > possible. It does not work properly in some cases." > > > > I also recommend to reverse his patch to mdtruncate(). > > > > Though we could not shrink segmented relations by old implementation > > the result by vacuum would never be inconsistent(?). > > > > I think we don't have enough time to fix this. > > So what do we put in its place when we reverse out the patch? > Future TODO items ? As far as I see,there's no consensus of opinion whether we would remove useless segments(I also think it's preferable if possible) or we would only truncate the segments(as my trial patch does). Only Bruce and Ole objected to my opinion and no one agreed with me. How do other people who would use segmented relations think ? Thanks. Hiroshi Inoue Inoue@tpf.co.p
> Future TODO items ? > > As far as I see,there's no consensus of opinion whether we would > remove useless segments(I also think it's preferable if possible) or > we would only truncate the segments(as my trial patch does). > > Only Bruce and Ole objected to my opinion and no one agreed > with me. > How do other people who would use segmented relations think ? > I liked unlinking because it allowed old backends to still see the segments if they still have open file descriptors, and new backends can see there is no file there. That seemed nice, but you clearly demostrated it caused major problems. Maybe truncation is the answer. I don't know, but we need to resolve this for 6.5. I can't imagine us focusing on this like we have in the past few weeks. Let's just figure out an answer. I am on IRC now if someone can get on to discuss this. I will even phone someone in US or Canada to discuss it. What is it on the backend that causes some backend to think there is another segment. Does it just go off the end of the max segment size and try to open another, or do we store the number of segments somewhere. I thought it was the former in sgml() area. I honestly don't care if the segment files stay around if that is going to be a reliable solution. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Sun, 6 Jun 1999, Bruce Momjian wrote: > > Future TODO items ? > > > > As far as I see,there's no consensus of opinion whether we would > > remove useless segments(I also think it's preferable if possible) or > > we would only truncate the segments(as my trial patch does). > > > > Only Bruce and Ole objected to my opinion and no one agreed > > with me. > > How do other people who would use segmented relations think ? > > > > I liked unlinking because it allowed old backends to still see the > segments if they still have open file descriptors, and new backends can > see there is no file there. That seemed nice, but you clearly > demostrated it caused major problems. Maybe truncation is the answer. > I don't know, but we need to resolve this for 6.5. I can't imagine us > focusing on this like we have in the past few weeks. Let's just figure > out an answer. I am on IRC now if someone can get on to discuss this. I > will even phone someone in US or Canada to discuss it. > > What is it on the backend that causes some backend to think there is > another segment. Does it just go off the end of the max segment size > and try to open another, or do we store the number of segments > somewhere. I thought it was the former in sgml() area. I honestly don't > care if the segment files stay around if that is going to be a reliable > solution. Other then the inode being used, what is wrong with a zero-length segment file? Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
> > What is it on the backend that causes some backend to think there is > > another segment. Does it just go off the end of the max segment size > > and try to open another, or do we store the number of segments > > somewhere. I thought it was the former in sgml() area. I honestly don't > > care if the segment files stay around if that is going to be a reliable > > solution. > > Other then the inode being used, what is wrong with a zero-length segment > file? Nothing is wrong with it. I just thought it would be more reliable to unlink it, but now am considering I was wrong. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Mon, 7 Jun 1999, Bruce Momjian wrote: > > > What is it on the backend that causes some backend to think there is > > > another segment. Does it just go off the end of the max segment size > > > and try to open another, or do we store the number of segments > > > somewhere. I thought it was the former in sgml() area. I honestly don't > > > care if the segment files stay around if that is going to be a reliable > > > solution. > > > > Other then the inode being used, what is wrong with a zero-length segment > > file? > > Nothing is wrong with it. I just thought it would be more reliable to > unlink it, but now am considering I was wrong. Just a thought, but if you left it zero length, the dba could use it as a means for estimating disk space requirements? :) buff.0 buff.1 is zero lenght, but buff.2 isn't, we know that we've filled 2x1gig buffers plus a little bit, so can allocate space accordingly? :) I'm groping here, help me out ... :) Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
> > Nothing is wrong with it. I just thought it would be more reliable to > > unlink it, but now am considering I was wrong. > > Just a thought, but if you left it zero length, the dba could use it as a > means for estimating disk space requirements? :) buff.0 buff.1 is zero > lenght, but buff.2 isn't, we know that we've filled 2x1gig buffers plus a > little bit, so can allocate space accordingly? :) > > I'm groping here, help me out ... :) > One reason to do truncate is that if it is a symbolic link to another driver, that link will stay, while unlink will not, and will recreate on on the same drive. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Wed, 2 Jun 1999, Hiroshi Inoue wrote: > I made a patch on trial. > 1) Useless segments are never removed by my implementation > because I call FileTruncate() instead of File(Name)Unlink(). > I'm not sure that this patch has no problem. > Please check and comment on my patch. I have tried it, and it seems to work. > As Ole Gjerde mentioned,current implementation by his old > patch is not right. His new patch seems right if vacuum is > executed alone. Yes, my first patch was horribly wrong :) The second one, as you mention, only works right if no reading or writing is going on. I'll talk more about the new patch in later emails. Ole Gjerde
On Sun, 6 Jun 1999, Bruce Momjian wrote: > I liked unlinking because it allowed old backends to still see the > segments if they still have open file descriptors, and new backends can > see there is no file there. That seemed nice, but you clearly > demostrated it caused major problems. Maybe truncation is the answer. > I don't know, but we need to resolve this for 6.5. I can't imagine us > focusing on this like we have in the past few weeks. Let's just figure > out an answer. I am on IRC now if someone can get on to discuss this. I > will even phone someone in US or Canada to discuss it. Personally, I think the right thing is to unlink the unused segments. For the most part keeping them around is not going to cause any problems, but I can't really think of any good reasons to keep them around. Keeping the database directories clean is a good thing in my opinion. > What is it on the backend that causes some backend to think there is > another segment. Does it just go off the end of the max segment size > and try to open another, or do we store the number of segments > somewhere. I thought it was the former in sgml() area. I honestly don't > care if the segment files stay around if that is going to be a reliable > solution. The new patch from Hiroshi Inoue <Inoue@tpf.co.jp> works. I believe it is a reliable solution, I just don't agree it's the right one. That is probably just a matter of opinion however. As his patch doesn't have any immediate problems, so I vote for that to be included in 6.5. Thanks, Ole Gjerde
Folks, do we have anything to revisit here? > Tatsuo Ishii wrote: > > > > I have just done cvs update and saw your changes. I tried the same > > testing as I did before (64 conccurrent connections, and each > > connection excutes 100 transactions), but it failed again. > > > > (1) without -B 1024, it failed: out of free buffers: time to abort! > > > > (2) with -B 1024, it went into stuck spin lock > > > > So I looked into sources a little bit, and made a minor change to > > include/storage/lock.h: > > > > #define INIT_TABLE_SIZE 100 > > > > to: > > > > #define INIT_TABLE_SIZE 4096 > > > > then restarted postmaster with -B 1024 (this will prevent > > out-of-free-buffers problem, I guess). Now everything seems to work > > great! > > > > I suspect that huge INIT_TABLE_SIZE prevented dynamic expanding the > > hash tables and seems there's something wrong in the routines > > responsible for that. > > Seems like that -:( > > If we'll not fix expand hash code before 6.5 release then > I would recommend to don't use INIT_TABLE_SIZE in > > lockMethodTable->lockHash = (HTAB *) ShmemInitHash(shmemName, > INIT_TABLE_SIZE, MAX_TABLE_SIZE, > &info, hash_flags); > > and > > lockMethodTable->xidHash = (HTAB *) ShmemInitHash(shmemName, > INIT_TABLE_SIZE, MAX_TABLE_SIZE, > &info, hash_flags); > > but use NLOCKENTS(maxBackends) instead. > > Vadim > > -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <maillist@candle.pha.pa.us> writes: > Folks, do we have anything to revisit here? I believe it is fixed --- I went back and found some more bugs in dynahash.c after seeing Tatsuo's report ;-) regards, tom lane >> Tatsuo Ishii wrote: >>>> >>>> I have just done cvs update and saw your changes. I tried the same >>>> testing as I did before (64 conccurrent connections, and each >>>> connection excutes 100 transactions), but it failed again. >>>> >>>> (1) without -B 1024, it failed: out of free buffers: time to abort! >>>> >>>> (2) with -B 1024, it went into stuck spin lock >>>> >>>> So I looked into sources a little bit, and made a minor change to >>>> include/storage/lock.h: >>>> >>>> #define INIT_TABLE_SIZE 100 >>>> >>>> to: >>>> >>>> #define INIT_TABLE_SIZE 4096 >>>> >>>> then restarted postmaster with -B 1024 (this will prevent >>>> out-of-free-buffers problem, I guess). Now everything seems to work >>>> great! >>>> >>>> I suspect that huge INIT_TABLE_SIZE prevented dynamic expanding the >>>> hash tables and seems there's something wrong in the routines >>>> responsible for that. >> >> Seems like that -:( >> >> If we'll not fix expand hash code before 6.5 release then >> I would recommend to don't use INIT_TABLE_SIZE in >> lockMethodTable-> lockHash = (HTAB *) ShmemInitHash(shmemName, >> INIT_TABLE_SIZE, MAX_TABLE_SIZE, >> &info, hash_flags); >> >> and >> lockMethodTable-> xidHash = (HTAB *) ShmemInitHash(shmemName, >> INIT_TABLE_SIZE, MAX_TABLE_SIZE, >> &info, hash_flags); >> >> but use NLOCKENTS(maxBackends) instead. >> >> Vadim