Thread: doc/FAQ* files are hosed
Comparing the current-CVS FAQ files with those in the REL6_4 branch is depressing --- almost every one is messed up in one branch or the other. The main FAQ is newer in the 6.4 branch than in the current branch; the 6.4 version seems to have numerous small fixes that are not in the main branch. FAQ_CVS is also newer in 6.4; it's been de-HTMLized in 6.4 (good) but not in current. FAQ_HPUX, FAQ_Irix, and FAQ_Linux are completely trashed in the 6.4 branch --- all three have been overwritten with some generic document. I'd check in some fixes, but a look at the CVS logs makes me think that part of the problem is too many cooks stirring the broth already. Whoever is primary maintainer of these files probably should take responsibility for getting the branches back in sync. regards, tom lane
> Comparing the current-CVS FAQ files with those in the REL6_4 branch > is depressing --- almost every one is messed up in one branch or the > other. > > The main FAQ is newer in the 6.4 branch than in the current branch; the > 6.4 version seems to have numerous small fixes that are not in the main > branch. FAQ_CVS is also newer in 6.4; it's been de-HTMLized in 6.4 > (good) but not in current. FAQ_HPUX, FAQ_Irix, and FAQ_Linux are > completely trashed in the 6.4 branch --- all three have been overwritten > with some generic document. I must have done that during the 6.4.2 release process. I decided to grab them from the web site, but it looks like I got the wrong pages somehow. They get over-written with ever release, so I have another chance to try it for 6.5. This does not apply to the FAQ_CVS. That's someone else. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <maillist@candle.pha.pa.us> writes: >> Comparing the current-CVS FAQ files with those in the REL6_4 branch >> is depressing --- almost every one is messed up in one branch or the >> other. > They get over-written with ever release, so I have another chance to try > it for 6.5. Oh? Aren't the CVS versions considered the master copies? If not, where *are* the master copies? regards, tom lane
Hello - I've been lurking on this list for a little while. I have just done a little tinkering with Postgres 6.4.2, comparing it to Oracle and mySQL on Linux. It would appear that just doing a lot of inserts (ie 40000) in a loop is enormously slower in Postgres in two ways that I tried it. One - using a loop in Java that makes a JDBC call to insert. Two- using plpgsql as a comparision to PL/SQL. Perhaps these are bad techniques and a C based proc would do better? I really like the idea of an open source DB and am impressed with a lot I see about PostgreSQL - however the speed difference is pretty bad - 4.5 minutes versus about 20 seconds. Is that just the way it goes? Are there options that would make it faster that are not the defaults? Any reaction would be appreciated. John Holland
John Holland wrote: > > Hello - > > I've been lurking on this list for a little while. I have just done a > little tinkering with Postgres 6.4.2, comparing it to Oracle and mySQL on > Linux. It would appear that just doing a lot of inserts (ie 40000) in a > loop is enormously slower in Postgres in two ways that I tried it. > One - using a loop in Java that makes a JDBC call to insert. > Two- using plpgsql as a comparision to PL/SQL. > > Perhaps these are bad techniques and a C based proc would do better? > > I really like the idea of an open source DB and am impressed with a lot I > see about PostgreSQL - however the speed difference is pretty bad - > 4.5 minutes versus about 20 seconds. > > Is that just the way it goes? Are there options that would make it faster > that are not the defaults? Oracle uses chained transaction mode (i.e. - all queries run in single transaction untill explicit COMMIT/ABORT), MySQL hasn't transaction at all... PostgreSQL uses un-chained transaction mode: each query runs in own transaction - 40000 transaction commits !!! Try to use BEGIN/END to run all inserts in single transaction and please let us know results. Vadim
> Bruce Momjian <maillist@candle.pha.pa.us> writes: > >> Comparing the current-CVS FAQ files with those in the REL6_4 branch > >> is depressing --- almost every one is messed up in one branch or the > >> other. > > > They get over-written with ever release, so I have another chance to try > > it for 6.5. > > Oh? Aren't the CVS versions considered the master copies? If not, > where *are* the master copies? The master copies are on the web site. In fact, people send e-mail to a special account here, and they get transfered to the web site automatically. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Vadim Mikheev <vadim@krs.ru> writes: > John Holland wrote: >> I've been lurking on this list for a little while. I have just done a >> little tinkering with Postgres 6.4.2, comparing it to Oracle and mySQL on >> Linux. It would appear that just doing a lot of inserts (ie 40000) in a >> loop is enormously slower in Postgres in two ways that I tried it. >> One - using a loop in Java that makes a JDBC call to insert. >> Two- using plpgsql as a comparision to PL/SQL. >> >> I really like the idea of an open source DB and am impressed with a lot I >> see about PostgreSQL - however the speed difference is pretty bad - >> 4.5 minutes versus about 20 seconds. > Try to use BEGIN/END to run all inserts in single transaction > and please let us know results. I tried this myself and found that wrapping BEGIN/END around a series of INSERT statements didn't make much difference at all. On an HP 715 (75MHz RISC box, not very heavy iron by current standards), I inserted about 13000 records into an initially-empty table having 38 columns (just because it's data I had handy...). I timed it at: Individual INSERT statements (as generated by pg_dump -d):33 inserts/sec Same with BEGIN/END wrapped around 'em:34 inserts/sec Single COPY statement (as generated by pg_dump without -d):1400 inserts/sec This was for a simple table with no indexes. In reality, this table type has four b-tree indexes on different columns in our normal usage. The COPY speed dropped to 325 inserts/sec when I had the indexes in place. I didn't bother trying the INSERTs that way. The conventional wisdom is that you should use COPY for bulk loading, and this result supports it... John didn't say what hardware he's using, so I don't know how comparable my result is to his 150 inserts/sec --- that might have been on a table with many fewer columns, or maybe his machine is just faster. As for where the time is going: "top" showed that the time for the INSERT ops was almost all going into backend CPU time. My guess is that most of the cost is in parsing and planning the INSERT statements. Pre-planned statement skeletons might help, but I think the real answer will be to find a way to avoid constructing and parsing SQL statements for simple operations like INSERT. (One reason I'm interested in the idea of a CORBA interface is that it might help here.) My guess is that Oracle and mySQL have APIs that allow the construction of an SQL INSERT command to be bypassed, and that's why they show up better on this operation. regards, tom lane
On Sun, 17 Jan 1999, Tom Lane wrote: > I tried this myself and found that wrapping BEGIN/END around a series of > INSERT statements didn't make much difference at all. Using what API? Some APIs control autocommit for you, so execing "BEGIN" and "END" commands may not do anything. You might want to check your API for a way to set autocommit off. In fact, I would suggest that you never use BEGIN and END except via cli. Tom
Tom <tom@sdf.com> writes: > On Sun, 17 Jan 1999, Tom Lane wrote: >> I tried this myself and found that wrapping BEGIN/END around a series of >> INSERT statements didn't make much difference at all. > Using what API? Sorry, I neglected to specify that it was psql (being driven by hand-trimmed pg_dump scripts). > Some APIs control autocommit for you, so execing > "BEGIN" and "END" commands may not do anything. That's not the case for psql. If it were, I would have measured no difference in speed, rather than a small difference. regards, tom lane
On Sun, 17 Jan 1999, Tom Lane wrote: > Tom <tom@sdf.com> writes: > > On Sun, 17 Jan 1999, Tom Lane wrote: > >> I tried this myself and found that wrapping BEGIN/END around a series of > >> INSERT statements didn't make much difference at all. > > > Using what API? > > Sorry, I neglected to specify that it was psql (being driven by > hand-trimmed pg_dump scripts). This also wouldn't you give you any benefit from "prepare" that would speed up repeated executions of the same statement. PostgreSQL does allow statements to be prepared right? If it doesn't, chances are COPY has a way of doing it. It saves a lot of parsing and planning time. > > Some APIs control autocommit for you, so execing > > "BEGIN" and "END" commands may not do anything. > > That's not the case for psql. If it were, I would have measured no > difference in speed, rather than a small difference. > > regards, tom lane Tom
Tom Lane wrote: > > > Try to use BEGIN/END to run all inserts in single transaction > > and please let us know results. > > I tried this myself and found that wrapping BEGIN/END around a series of > INSERT statements didn't make much difference at all. > > On an HP 715 (75MHz RISC box, not very heavy iron by current standards), > I inserted about 13000 records into an initially-empty table having 38 ^^ > columns (just because it's data I had handy...). I timed it at: ^^^^^^^ So much. > Individual INSERT statements (as generated by pg_dump -d): > 33 inserts/sec > Same with BEGIN/END wrapped around 'em: > 34 inserts/sec > Single COPY statement (as generated by pg_dump without -d): > 1400 inserts/sec > ... > > John didn't say what hardware he's using, so I don't know how comparable > my result is to his 150 inserts/sec --- that might have been on a table > with many fewer columns, or maybe his machine is just faster. > > As for where the time is going: "top" showed that the time for the > INSERT ops was almost all going into backend CPU time. My guess is > that most of the cost is in parsing and planning the INSERT statements. > Pre-planned statement skeletons might help, but I think the real answer > will be to find a way to avoid constructing and parsing SQL statements > for simple operations like INSERT. (One reason I'm interested in the > idea of a CORBA interface is that it might help here.) > > My guess is that Oracle and mySQL have APIs that allow the construction > of an SQL INSERT command to be bypassed, and that's why they show up > better on this operation. For the table with single int4 column & 2000 INSERTs I have: BEGIN/END: 3.5sec ~600insert/sec Without: 151sec ~13insert/sec (IDE disk!) I guess that you don't use -F flag..? Vadim
I wrote: > As for where the time is going: "top" showed that the time for the > INSERT ops was almost all going into backend CPU time. My guess is > that most of the cost is in parsing and planning the INSERT > statements. Having an idle hour this evening, I thought it'd be interesting to build a backend with profiling enabled, so as to confirm or deny the above guess. It seems that indeed a lot of time is being wasted, but where it's being wasted might surprise you! I profiled a backend built from this morning's current CVS sources, being driven by a single psql run reading a series of exactly 5000 INSERT commands. The first few are: INSERT INTO "fehistory_1" values (64,14,252,'D','-','-','s',1,4027,NULL,'S',1,4088,NULL,32,'Thu Jan 07 15:15:00 1999 EST','FriJan 08 08:30:38 1999 EST',4027,32,NULL,NULL,46,NULL); INSERT INTO "fehistory_1" values (65,14,252,'P','-','-','S',1,4086,NULL,NULL,NULL,NULL,NULL,32,'Fri Jan 08 16:00:00 1999EST',NULL,NULL,NULL,NULL,NULL,45,NULL); INSERT INTO "fehistory_1" values (66,14,135,'P','-','-','S',1,13619,NULL,'s',1,12967,NULL,100,'Fri Jan 08 02:00:00 1999 EST',NULL,NULL,NULL,NULL,NULL,44,NULL); and it goes on for a long while in that vein :-). Total runtime was 278.09 CPU seconds in the backend, 307 seconds elapsed realtime. The table being inserted into has no indexes and was initially empty. The profiler is GNU gprof, which provides both flat and structured profile data. The top few entries in the flat profile are: % cumulative self self total time seconds seconds calls ms/call ms/call name 34.96 97.21 97.21 _mcount 7.18 117.19 19.98 8047816 0.00 0.00 tree_cut6.56 135.43 18.24 962187 0.02 0.02 recv 3.55 145.31 9.88 8234278 0.00 0.00 tree_insert3.49 155.02 9.71 4292947 0.00 0.01 malloc 3.31 164.22 9.20 1965056 0.00 0.01 SearchSysCache 3.24 173.23 9.01 4291775 0.00 0.01 free 2.44 180.01 6.78 6171667 0.00 0.00 tree_concatenate 2.17 186.04 6.03 8234069 0.00 0.00 tree_delete 1.07 189.02 2.98 10250 0.29 0.29 write 1.05 191.94 2.92 _psort_cmp 0.95 194.57 2.63 2983301 0.00 0.00 newNode 0.92 197.14 2.57 4251762 0.00 0.00 OrderedElemPop 0.88 199.60 2.46 2746172 0.00 0.00 comphash 0.82 201.88 2.28 4252495 0.00 0.01 AllocSetAlloc 0.77 204.02 2.14 1965056 0.00 0.00 SearchSysCacheTuple 0.73 206.05 2.03 265000 0.01 0.01 yylex0.69 207.96 1.91 $$remU 0.68 209.86 1.90 4252500 0.00 0.00 OrderedElemPushHead0.62 211.58 1.72 4247473 0.00 0.00 palloc 0.60 213.26 1.68 25004 0.07 0.07 send 0.56 214.81 1.55 1965056 0.00 0.00 CatalogCacheComputeHashIndex 0.55 216.33 1.52 5000 0.30 1.07 yyparse 0.54 217.82 1.49 4251762 0.00 0.01 AllocSetFree 0.53 219.29 1.47 4252494 0.00 0.00 MemoryContextAlloc 0.51 220.71 1.42 4252500 0.00 0.00 OrderedElemPushInto0.49 222.07 1.36 3430844 0.00 0.00 strncmp 0.48 223.41 1.34 4205327 0.00 0.00 OrderedSetGetHead 0.47 224.73 1.32 elog 0.45 225.99 1.26 1964976 0.00 0.00 DLRemove 0.43 227.19 1.20 2510857 0.00 0.00 strcmp 0.43 228.38 1.19 4200327 0.00 0.00 OrderedElemGetBase 0.42 229.56 1.18 4245740 0.00 0.01 PortalHeapMemoryAlloc 0.38 230.62 1.06 4252500 0.00 0.00 OrderedElemPush 0.37 231.66 1.04 4205327 0.00 0.00 AllocSetGetFirst0.36 232.66 1.00 1965145 0.00 0.00 DLAddHead _mcount is part of the profiling apparatus and can be ignored (although the fact that it's so high up implies that we're executing an awful lot of short routine calls, because mcount runs once per entry and exit of profilable routines). tree_cut, tree_insert, tree_concatenate, and tree_delete are subroutines of malloc/free. Altogether, malloc/free and friends accounted for 61.39 seconds out of the 180.88 non-overhead CPU seconds in this run. In other words, we're spending a third of our time mallocing and freeing memory. A tad high, what? Actually, it's worse than that, because AllocSetAlloc, PortalHeapMemoryAlloc, AllocSetFree, and all of the OrderedElemXXX routines represent our own bookkeeping layer atop malloc/free. That's another 18.66 seconds spent in these top routines, which means that we are real close to expending half the backend's runtime on memory bookkeeping. This needs work. The other thing that jumps out here is the unreasonably high position of recv(), which is called 962187 times. The script being read by psql was only 957186 characters. Evidently we're invoking a kernel recv() call once per character read from the frontend. I suspect this is an inefficiency introduced by Magnus Hagander's recent rewrite of backend libpq (see, I told you there was a reason for using stdio ;-)). We're gonna have to do something about that, though it's not as critical as the memory-allocation issue. It also appears that send() is now being invoked multiple times per backend reply, which is going to create inefficiencies outside the backend (ie, multiple packets per reply). On a test case with a lot of SELECTs that would show up more than it does here. Moving on to the dynamic profile, the most interesting items are: index % time self children called name ----------------------------------------------- 0.05 118.06 1/1 DoBackend [7] [8] 65.3 0.05 118.06 1 PostgresMain [8] 0.04 57.64 5000/5000 pg_exec_query[9] 0.01 39.83 5000/5000 CommitTransactionCommand [11] 0.02 18.86 5001/5001 ReadCommand [27] 0.01 1.18 5000/5000 StartTransactionCommand [118] 0.01 0.36 5001/5001 ReadyForQuery [190] 0.04 0.00 5001/5001 EnableNotifyInterrupt[385] 0.04 0.00 5000/5000 strspn [388] 0.00 0.01 1/1 InitPostgres [495] 0.00 0.00 5000/548573 strlen [172] 0.00 0.00 5001/93349 memset [366] 0.00 0.00 1/1 read_pg_options [601] 0.00 0.00 2/2 pq_putint [637] 0.00 0.00 1/15002 pq_putnchar [123] 0.00 0.00 2/10013 getenv [241] 0.00 0.00 1/1 SetPgUserName [683] 0.00 0.00 1/1 pq_init [687] 0.00 0.00 1/1 proc_exit [704] 0.00 0.00 1/1 pq_close [721] 0.00 0.00 1/5002 getpgusername [521] 0.00 0.00 1/2 FindExec [722] 0.00 0.00 5/8 getopt[728] 0.00 0.00 5001/5001 DisableNotifyInterrupt [761] 0.00 0.00 9/34 pqsignal [806] 0.00 0.00 2/5 atoi [837] 0.00 0.00 1/2 SetProcessingMode [868] 0.00 0.00 1/10002 TransactionFlushEnabled [749] 0.00 0.00 1/1 sigsetjmp [934] ("self" is the amount of time (CPU sec) spent directly in the named function, while "children" is the amount of time spent in its callees, pro-rated by the number of calls. For example, strlen was actually measured to use 0.47 sec in the whole run, but since PostgresMain called it less than 1% of all the times it was called, PostgresMain gets blamed for less than 0.01 sec of that total.) The interesting thing here is the relative times indicated for pg_exec_query, CommitTransactionCommand, and ReadCommand. We already found out why ReadCommand is so slow (recv() per character) ... but why is CommitTransactionCommand so high, when I already discovered that doing only one commit doesn't help much? Patience, we'll get there. ----------------------------------------------- 0.03 57.61 5000/5000 pg_exec_query [9] [10] 31.9 0.03 57.61 5000 pg_exec_query_dest [10] 0.09 39.12 5000/5000 pg_parse_and_plan[13] 0.00 17.37 5000/5000 ProcessQuery [36] 0.02 0.60 5000/5000 SetQuerySnapshot [159] 0.01 0.34 5000/5000 CommandCounterIncrement [196] 0.02 0.05 10000/4291775 free [22] ----------------------------------------------- 0.09 39.12 5000/5000 pg_exec_query_dest [10] [13] 21.7 0.09 39.12 5000 pg_parse_and_plan [13] 0.02 19.28 5000/5000 parser[26] 0.03 17.63 5000/5000 planner [34] 0.00 1.98 5000/5000 QueryRewrite[90] 0.02 0.05 10000/4292947 malloc [21] 0.02 0.05 10000/4291775 free [22] 0.00 0.02 5000/497779 lappend [91] 0.01 0.00 5000/5000 IsAbortedTransactionBlockState[529] 0.00 0.00 5000/72779 length [342] In other words, parsing, planning, and executing an INSERT each take about the same amount of time. ----------------------------------------------- 0.01 39.83 5000/5000 PostgresMain [8] [11] 22.0 0.01 39.83 5000 CommitTransactionCommand [11] 0.02 39.81 5000/5000 CommitTransaction [12] ----------------------------------------------- 0.02 39.81 5000/5000 CommitTransactionCommand [11] [12] 22.0 0.02 39.81 5000 CommitTransaction [12] 0.00 38.12 5000/5000 AtCommit_Memory[16] 0.01 1.10 5000/5000 RecordTransactionCommit [122] 0.00 0.30 5000/5000 AtCommit_Locks [211] 0.02 0.10 5000/5000 AtEOXact_portals [286] 0.03 0.07 5000/5000 DestroyTempRels [305] 0.03 0.00 5000/5000 _lo_commit[425] 0.01 0.00 5000/5000 AtCommit_Notify [522] 0.01 0.00 5000/5000 CloseSequences [523] 0.01 0.00 5000/5000 RelationPurgeLocalRelation [531] 0.00 0.00 5000/10000 AtCommit_Cache [750] ----------------------------------------------- 0.00 38.12 5000/5000 CommitTransaction [12] [16] 21.1 0.00 38.12 5000 AtCommit_Memory [16] 0.01 38.11 5000/5000 EndPortalAllocMode[17] 0.00 0.00 5000/10552 MemoryContextSwitchTo [747] ----------------------------------------------- 0.01 38.11 5000/5000 AtCommit_Memory [16] [17] 21.1 0.01 38.11 5000 EndPortalAllocMode [17] 0.54 37.55 5000/5000 AllocSetReset[18] 0.01 0.00 5000/10000 PortalHeapMemoryGetVariableMemory [449] 0.01 0.00 5000/5000 FixedStackPop [528] 0.00 0.00 5000/51434 MemoryContextFree [367] ----------------------------------------------- 0.54 37.55 5000/5000 EndPortalAllocMode [17] [18] 21.1 0.54 37.55 5000 AllocSetReset [18] 1.47 32.51 4200327/4251762 AllocSetFree[20] 1.04 2.53 4205327/4205327 AllocSetGetFirst [64] In other words, essentially *all* of the CPU time spent in CommitTransaction is spent freeing memory. That's probably why ganging the transactions doesn't help --- it's the same number of memory blocks getting allocated and freed. 0.02 19.28 5000/5000 pg_parse_and_plan [13] [26] 10.7 0.02 19.28 5000 parser [26] 0.02 13.88 5000/5000 parse_analyze [41] 1.52 3.81 5000/5000 yyparse [54] 0.01 0.01 5000/5000 init_io [474] 0.00 0.01 5000/420496 pstrdup [124] 0.01 0.00 5000/5000 parser_init[533] Thomas might know why parse_analyze is taking so much time compared to the rest of the parsing machinery... I won't bother the list with the rest of the profile, although I'll be happy to send it to anyone who wants to see all of it. Our immediate performance problems seem pretty clear, however: time to rethink memory management. regards, tom lane
Tom Lane wrote: > > The other thing that jumps out here is the unreasonably high position of > recv(), which is called 962187 times. The script being read by psql was > only 957186 characters. Evidently we're invoking a kernel recv() call > once per character read from the frontend. I suspect this is an > inefficiency introduced by Magnus Hagander's recent rewrite of backend > libpq (see, I told you there was a reason for using stdio ;-)). At least part of the problem is also the fe-be protocol itself, maybe not in the backend receiving side, but certainly while front-end is receiving. The main problem is that you must very often wait for a specific end character (instead of sending first the length and then reading the required amount in one chunk), and it requires some trickery to do it efficiently without reading each character separately. I once did the fe-be protocol in python (an interpreted scripting language, see http://www.python.org), it was for v6.2. I was quite amazed by the baroqueness of the protocol - it uses a mix three techniques - 1. wait for EOT char, 2. receive a pascal style string and 3. get a record consisting of a mix of 1 and 2. > We're > gonna have to do something about that, though it's not as critical as > the memory-allocation issue. It also appears that send() is now being > invoked multiple times per backend reply, which is going to create > inefficiencies outside the backend (ie, multiple packets per reply). > On a test case with a lot of SELECTs that would show up more than it > does here. I am sure that we need to change the protocol sometime soon, if we want to increase the performance. I have contemplated something like the X-window protocol, that seems to be quite well designed. And of course we need some sort of CLI that can do prepared statements and that can use binary fields (not only the file interface). Currently we have some of it, bu only in SPI. Unfortunately most inserts are not done using SPI :( It may be the CORBA interface that somebody may be working on, or it may be something simpler. --------------------- Hannu Krosing
I tried to do this..... What I've done is create a plpgpsql function that does the inserts. There is a begin/end in the function of course...I also tried doing it with a begin work before calling the function in psql, and it did seem that this worked as it should - ie a rollback work would get rid of the 40000 inserts. Still, it took about 3.5 minutes, and Oracle does it in about 30 seconds. factor of 7..... On the plus side for Postgres, trying to find some queries against the data for other speed tests, I found that it seemed to respond quickly - not much thinking to do with one table of integers I guess. Actually I couldn't come up with a query that would take very long - need another table I suppose. I found that UPDATE TEST SET A = B -1 took only 20 seconds for 40000 records. The real plus is that Oracle couldn't do that transaction because it was running out of rollback segment space - something I have fixed in the past but I recall being a pain to fix - need to make bigger rollback datafiles or whatever. Any suggested speed tests would be appreciated. > > Try to use BEGIN/END to run all inserts in single transaction > and please let us know results. > > Vadim >
Hannu Krosing <hannu@trust.ee> writes: > I am sure that we need to change the protocol sometime soon, if we > want to increase the performance. The immediate problem is just that we've given back performance that we had a week ago. stdio-based libpq didn't have this problem, because it dealt with the kernel a bufferload at a time not a character at a time. Also, while the fe/be protocol is certainly a bit crufty, I'm not convinced that a well-designed frontend is going to suffer any real performance problem from parsing the protocol. The trick again is to read into a buffer and parse from the buffer. The last time I profiled frontend libpq, it turned out to be spending all its time in memory allocation not parsing. (Which I fixed, btw.) > And of course we need some sort of CLI that can do prepared statements > and that can use binary fields (not only the file interface). > Currently we have some of it, bu only in SPI. SPI? I missed something ... what's that? regards, tom lane
On Sun, Jan 17, 1999 at 07:23:15PM -0800, Tom wrote: > This also wouldn't you give you any benefit from "prepare" that would > speed up repeated executions of the same statement. PostgreSQL does allow > statements to be prepared right? If it doesn't, chances are COPY has a Since when? I thought we're still looking for someone to implement it, i.e. at least the PREPARE statement. Michael -- Michael Meskes | Go SF 49ers! Th.-Heuss-Str. 61, D-41812 Erkelenz | Go Rhein Fire! Tel.: (+49) 2431/72651 | Use Debian GNU/Linux! Email: Michael.Meskes@gmx.net | Use PostgreSQL!
On Sun, Jan 17, 1999 at 09:08:28PM -0500, John Holland wrote: > I've been lurking on this list for a little while. I have just done a > little tinkering with Postgres 6.4.2, comparing it to Oracle and mySQL on > Linux. It would appear that just doing a lot of inserts (ie 40000) in a > loop is enormously slower in Postgres in two ways that I tried it. > One - using a loop in Java that makes a JDBC call to insert. > Two- using plpgsql as a comparision to PL/SQL. Did you use the '-F' option? I tried somethings similar from C and found similar results when calling fsync() after each write. Once I used -F it was way faster and comparable to Oracle. > Perhaps these are bad techniques and a C based proc would do better? Feel free to try src/interfaces/ecpg/test/perftest.c Michael -- Michael Meskes | Go SF 49ers! Th.-Heuss-Str. 61, D-41812 Erkelenz | Go Rhein Fire! Tel.: (+49) 2431/72651 | Use Debian GNU/Linux! Email: Michael.Meskes@gmx.net | Use PostgreSQL!
Tom Lane wrote: > > Also, while the fe/be protocol is certainly a bit crufty, I'm not > convinced that a well-designed frontend is going to suffer any real > performance problem from parsing the protocol. The trick again is to > read into a buffer and parse from the buffer. That's also what I did, but still better protocol would have helped > > And of course we need some sort of CLI that can do prepared statements > > and that can use binary fields (not only the file interface). > > > Currently we have some of it, bu only in SPI. > > SPI? I missed something ... what's that? I think it stands for Server Programming Interface (?). Vadim (?) added it as the interface for C and other programming language functions to execute queries inside the backend. AFAIK, it follows the standard prepare-bind-execute-fetch model for its queries. It is quite well documented in the standard PG docs. ------------------- Hannu Krosing
Vadim Mikheev <vadim@krs.ru> writes: > For the table with single int4 column & 2000 INSERTs I have: > BEGIN/END: 3.5sec ~600insert/sec > Without: 151sec ~13insert/sec > (IDE disk!) > I guess that you don't use -F flag..? *Do* use -F, you meant of course. You're right, I do. With an fsync after every transaction, I'm sure my test would've been much slower. However, with a UPS and an OS that only crashes about once every other year, I feel pretty secure using -F ... regards, tom lane
Michael Meskes wrote: > > On Sun, Jan 17, 1999 at 07:23:15PM -0800, Tom wrote: > > This also wouldn't you give you any benefit from "prepare" that would > > speed up repeated executions of the same statement. PostgreSQL does allow > > statements to be prepared right? If it doesn't, chances are COPY has a > > Since when? I thought we're still looking for someone to implement it, i.e. > at least the PREPARE statement. 1. PG can handle prepared queries. Parser & Planner can parse/plan (i.e. prepare) query with parameters once, so such preparedquery can be executed many times by executor without parser/planner invocation. 2. We have to implement some interface to use this ability. Vadim
On Mon, 18 Jan 1999, Vadim Mikheev wrote: >1. PG can handle prepared queries. > Parser & Planner can parse/plan (i.e. prepare) query with > parameters once, so such prepared query can be executed > many times by executor without parser/planner invocation. >2. We have to implement some interface to use this ability. Yes, I need prepared statements for CORBA, too. Taral
Tom Lane wrote: > > Vadim Mikheev <vadim@krs.ru> writes: > > For the table with single int4 column & 2000 INSERTs I have: > > > BEGIN/END: 3.5sec ~600insert/sec > > Without: 151sec ~13insert/sec > > > (IDE disk!) > > > I guess that you don't use -F flag..? > > *Do* use -F, you meant of course. You're right, I do. With an Actually no -:) > fsync after every transaction, I'm sure my test would've been much > slower. Just use BEGIN/END to run all queries in single transaction -:) > However, with a UPS and an OS that only crashes about once every > other year, I feel pretty secure using -F ... However, it's easy to crash Postgres itself and lose committed transactions -:( Vadim
Vadim Mikheev <vadim@krs.ru> writes: > Tom Lane wrote: >> However, with a UPS and an OS that only crashes about once every >> other year, I feel pretty secure using -F ... > However, it's easy to crash Postgres itself and lose > committed transactions -:( Surely not? The docs say (and a quick look at the code confirms) that -F suppresses calls to fsync(2). It does not suppress writes. Thus, a commit will still write the data out to kernel disk buffers. All that fsync does is force the kernel to execute immediate disk writes for those buffers. If I don't fsync, and the backend crashes, the modified file data is still in kernel disk buffers and the kernel is still responsible for seeing that those dirty buffers get written out eventually. (Of course, errors inside Postgres might cause it to write bogus data, but -F isn't going to help or hurt for that.) Not using -F means that you don't trust your OS, your hardware, and/or your power supply. It has nothing to do with whether you trust Postgres. regards, tom lane
Tom Lane wrote: > > Vadim Mikheev <vadim@krs.ru> writes: > > Tom Lane wrote: > >> However, with a UPS and an OS that only crashes about once every > >> other year, I feel pretty secure using -F ... > > > However, it's easy to crash Postgres itself and lose > > committed transactions -:( > > Surely not? The docs say (and a quick look at the code confirms) > that -F suppresses calls to fsync(2). It does not suppress writes. > Thus, a commit will still write the data out to kernel disk buffers. > All that fsync does is force the kernel to execute immediate disk > writes for those buffers. If I don't fsync, and the backend crashes, > the modified file data is still in kernel disk buffers and the kernel > is still responsible for seeing that those dirty buffers get written > out eventually. You're right. Vadim
> Not using -F means that you don't trust your OS, your hardware, > and/or your power supply. It has nothing to do with whether you > trust Postgres. I couldn't have said it better. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Tue, Jan 19, 1999 at 09:15:01AM +0700, Vadim Mikheev wrote: > 1. PG can handle prepared queries. > Parser & Planner can parse/plan (i.e. prepare) query with > parameters once, so such prepared query can be executed > many times by executor without parser/planner invocation. > 2. We have to implement some interface to use this ability. I see. Thanks. Michael -- Michael Meskes | Go SF 49ers! Th.-Heuss-Str. 61, D-41812 Erkelenz | Go Rhein Fire! Tel.: (+49) 2431/72651 | Use Debian GNU/Linux! Email: Michael.Meskes@gmx.net | Use PostgreSQL!
can you explain the -F flag? when is it passed to what? Can this be done in plpgsql? how? john On Mon, 18 Jan 1999, Tom Lane wrote: > Vadim Mikheev <vadim@krs.ru> writes: > > For the table with single int4 column & 2000 INSERTs I have: > > > BEGIN/END: 3.5sec ~600insert/sec > > Without: 151sec ~13insert/sec > > > (IDE disk!) > > > I guess that you don't use -F flag..? > > *Do* use -F, you meant of course. You're right, I do. With an > fsync after every transaction, I'm sure my test would've been much > slower. > > However, with a UPS and an OS that only crashes about once every > other year, I feel pretty secure using -F ... > > regards, tom lane >
At 11:29 19/01/99 -0500, you wrote: >can you explain the -F flag? when is it passed to what? Can this be done >in plpgsql? how? Or passed by libpq in a C program that needs to insert (as fast as possible) a lot of new rows? > > >john >On >Mon, 18 Jan 1999, Tom Lane wrote: > >> Vadim Mikheev <vadim@krs.ru> writes: >> > For the table with single int4 column & 2000 INSERTs I have: >> >> > BEGIN/END: 3.5sec ~600insert/sec >> > Without: 151sec ~13insert/sec >> >> > (IDE disk!) >> >> > I guess that you don't use -F flag..? >> >> *Do* use -F, you meant of course. You're right, I do. With an >> fsync after every transaction, I'm sure my test would've been much >> slower. >> >> However, with a UPS and an OS that only crashes about once every >> other year, I feel pretty secure using -F ... >> >> regards, tom lane >> > > > > ------------------------------------------------------------------ Eng. Roberto João Lopes Garcia E-mail: roberto@mha.com.br F. 55 11 848 9906 FAX 55 11 848 9955 MHA Engenharia Ltda E-mail: mha@mha.com.br WWW: http://www.mha.com.br Av Maria Coelho Aguiar, 215 Bloco D 2 Andar Centro Empresarial de Sao Paulo Sao Paulo - BRASIL - 05805 000 -------------------------------------------------------------------
On Mon, 18 Jan 1999, Taral wrote: > On Mon, 18 Jan 1999, Vadim Mikheev wrote: > >1. PG can handle prepared queries. > > Parser & Planner can parse/plan (i.e. prepare) query with > > parameters once, so such prepared query can be executed > > many times by executor without parser/planner invocation. > >2. We have to implement some interface to use this ability. > > Yes, I need prepared statements for CORBA, too. This would improve JDBC's PreparedStatement as well. -- Peter T Mount peter@retep.org.uk Main Homepage: http://www.retep.org.uk PostgreSQL JDBC Faq: http://www.retep.org.uk/postgresJava PDF Generator: http://www.retep.org.uk/pdf
John Holland <jholland@isr.umd.edu> writes: > can you explain the -F flag? when is it passed to what? -F is a command-line flag passed to the backend at backend startup. Since backends are normally started by the postmaster, what you really do in practice is to start the postmaster with "-o -F". For example my postmaster start script looks like nohup postmaster -i -o "-F" >server.log 2>&1 </dev/null & What the -F switch actually does is to disable calls to fsync(2), thereby allowing modified file blocks to hang around in kernel memory for a little while (up to 30 seconds in most Unixes) rather than being force-written to disk as soon as each transaction commits. If the same block of the database file gets modified again within that time window (very likely under a repeated-update load), you just saved a disk write. On the other hand, if your OS crashes or your power goes out in those 30 sec, you just lost a database update that you thought you had committed. I'm not sure I believe the argument that omitting -F buys very much safety, even if you do not trust your power company. Murphy's law says that a power flicker will happen in the middle of committing a transaction, not during the 30-second-max window between when you could've had the data flushed to disk if only you'd used fsync() and when the swapper process will fsync it on its own. And in that case you have a corrupted database anyway. So my theory is you use a reliable OS, and get yourself a UPS if your power company isn't reliable (lord knows mine ain't), and back up your database as often as you can. -F buys enough speed that it's worth the small extra risk. There are competent experts with conflicting opinions, however ;-) See doc/README.fsync for more about -F and the implications of using it. regards, tom lane
> John Holland <jholland@isr.umd.edu> writes: > > can you explain the -F flag? when is it passed to what? > > -F is a command-line flag passed to the backend at backend startup. > Since backends are normally started by the postmaster, what you > really do in practice is to start the postmaster with "-o -F". > For example my postmaster start script looks like > > nohup postmaster -i -o "-F" >server.log 2>&1 </dev/null & > > What the -F switch actually does is to disable calls to fsync(2), > thereby allowing modified file blocks to hang around in kernel > memory for a little while (up to 30 seconds in most Unixes) > rather than being force-written to disk as soon as each transaction > commits. If the same block of the database file gets modified > again within that time window (very likely under a repeated-update > load), you just saved a disk write. On the other hand, if your OS > crashes or your power goes out in those 30 sec, you just lost a > database update that you thought you had committed. > > I'm not sure I believe the argument that omitting -F buys very much > safety, even if you do not trust your power company. Murphy's law > says that a power flicker will happen in the middle of committing > a transaction, not during the 30-second-max window between when you > could've had the data flushed to disk if only you'd used fsync() > and when the swapper process will fsync it on its own. And in that > case you have a corrupted database anyway. So my theory is you use > a reliable OS, and get yourself a UPS if your power company isn't > reliable (lord knows mine ain't), and back up your database as often > as you can. -F buys enough speed that it's worth the small extra risk. > > There are competent experts with conflicting opinions, however ;-) > > See doc/README.fsync for more about -F and the implications of > using it. Well said, and I think we have to address this shortcoming. Vadim has been reluctant to turn off fsync by default. I have been trying to come up with some soft-fsync on my own but haven't hit on anything he agrees with yet. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
>For example my postmaster start script looks like> > >nohup postmaster -i -o "-F" >server.log 2>&1 </dev/null & > I trayed the -F option and the results is bellow. I use a C program (libPQ) that read from 3 dbf files and past it to 3 tables in postgreSQL 4.0.0. Eche row in dbf file generate a SQL insert command. I´m not using begin transaction and commit to make all insert in just one transaction. Test without -F option: table 1: 697 rows = 4' 39" table 2: 22760 rows = 1h 9' 37" table 3: 22758 rows = 43' 44" index/vacuun/grant all tables 12' 58" Total time: 2h 10' 58" Test with -F option (dbf file was updated and grow up a litle): table 1: 698 rows = 1' 52" table 2: 22798 rows = 15' 38" table 3: 22796 rows = 7' 42" index/vacuun/grant all tables 11' 39" Total time: 36' 51" ~= 72% reducion in time it will allow me to update postgreSQL tables from the dbf 2 times a day insted of one at night! Thank you who point me the -F option. I will tray with begin transaction and commit to make all inserts in just one transaction Roberto ------------------------------------------------------------------ Eng. Roberto João Lopes Garcia E-mail: roberto@mha.com.br F. 55 11 848 9906 FAX 55 11 848 9955 MHA Engenharia Ltda E-mail: mha@mha.com.br WWW: http://www.mha.com.br Av Maria Coelho Aguiar, 215 Bloco D 2 Andar Centro Empresarial de Sao Paulo Sao Paulo - BRASIL - 05805 000 -------------------------------------------------------------------
On Wed, 20 Jan 1999, Tom Lane wrote: > What the -F switch actually does is to disable calls to fsync(2), > thereby allowing modified file blocks to hang around in kernel > memory for a little while (up to 30 seconds in most Unixes) > rather than being force-written to disk as soon as each transaction > commits. If the same block of the database file gets modified > again within that time window (very likely under a repeated-update > load), you just saved a disk write. Your effeciencies from running asynchronously run far beyond that. By delaying writes, you allow the kernel to order them and pass them in large chunks to the drive, getting very good effeciency. Compare ten aggregated writes (the asynchronous case) to ten individual writes (the synchronous case): in the former case, the disk arm needs to move over the platter once, and you pay the cost of initiating a write once, whereas in the later case the disk arm needs to do ten seeks, and you pay the cost of initiating ten different disk operations, servicing ten interrupts instead of one, etc. You save a lot more than just optimizing out cancelling writes. > I'm not sure I believe the argument that omitting -F buys very much > safety, even if you do not trust your power company. Murphy's law > says that a power flicker will happen in the middle of committing > a transaction, not during the 30-second-max window between when you > could've had the data flushed to disk if only you'd used fsync() > and when the swapper process will fsync it on its own. And in that > case you have a corrupted database anyway. That depends on how you process your transactions, I guess. It is possible to perform transactions safely, such that no interruption can violate your db's integrity. I have not read and understood enough of psql's transaction engine to know how safe it is in the face of interruption, but there's no inherent reason that it need be unsafe. The real way to solve this problem is to build a file system which is more suitable for use in this way than a traditional unix fs. You're working around the coarse-grained-ness of fsync() no matter what you do, and there is no good solution until you move away from fsync() and towards something like fsync(block) or fsync(transaction). -- Todd Graham Lewis 32�49'N,83�36'W (800) 719-4664, x2804 ******Linux****** MindSpring Enterprises tlewis@mindspring.net "Those who write the code make the rules." -- Jamie Zawinski
Todd Graham Lewis wrote: > > On Wed, 20 Jan 1999, Tom Lane wrote: > > > What the -F switch actually does is to disable calls to fsync(2), > > thereby allowing modified file blocks to hang around in kernel Could you use an mmapped file and flush it according to some algorythm? Hard on memory but fast. -- John Ryan CEO Ryan Net Works Visit us at www.cybertrace.com john@cybertrace.com (703) 998-1751 (w) 3515-A S. Stafford St. (703) 820-6304 (f) Arlington, VA 22206-1191 (703) 626-4130 (c) -----BEGIN PGP PUBLIC KEY BLOCK----- Version: 2.6.2 mQBtAzO3iTYAAAEDAOvDARDRxluZdieT11m6ukP/niHI3qnSL9DKmRrLil5M+Cdp 3w4fV2/HvoBhsUvJE0uLwtbFCCBUP5YkGWNM0oU+XxQHP2+yUY8CDEHLncZ6KHV5 KLlfQKDHSQgZ+58PcQAFE7QjSm9obiBSeWFuIDxqb2huQHd3dy5jeWJlcnRyYWNlLmNvbT4==Wl7V -----END PGP PUBLIC KEY BLOCK-----
Tom Lane wrote: > > Having an idle hour this evening, I thought it'd be interesting to build > a backend with profiling enabled, so as to confirm or deny the above > guess. It seems that indeed a lot of time is being wasted, but where > it's being wasted might surprise you! ... > In other words, we're spending a third of our time mallocing and freeing > memory. A tad high, what? > > Actually, it's worse than that, because AllocSetAlloc, > PortalHeapMemoryAlloc, AllocSetFree, and all of the OrderedElemXXX > routines represent our own bookkeeping layer atop malloc/free. > That's another 18.66 seconds spent in these top routines, which means > that we are real close to expending half the backend's runtime on > memory bookkeeping. This needs work. > Yes, it's suprizing! I added some debug code to palloc/pfree and it shows that for INSERT: 1. 80% of allocations are made for <= 32 bytes. 2. pfree is used for 25% of them only (others are freed after statement/transaction is done). Note that our mmgr adds 16 bytes to each allocation (+ some bytes in malloc) - a great overhead, yes? I added code to allocate a few big (16K-64K) blocks of memory for these small allocations to speed up palloc by skiping AllocSetAlloc/malloc. New code don't free allocated memory (to make bookkeeping fast) but keeping in mind 2. above and memory overhead it seems as appropriate thing to do. These code also speed up freeing when statement/transaction is done, because of only a few blocks have to be freed now. I did 5000 INSERTS (into tables with 3 ints and 33 ints) with BEGIN/END, -F and -B 512 (I run postgres directly, without postmaster). User times: old new ----------------------------------------- table with 3 ints 9.7 sec 7.6 sec table with 33 ints 59.5 sec 39.9 sec So, new code 20%-30% faster. Process sizes are the same. Tom, could you run new code under profiling? There are still some things to do: 1. SELECT/UPDATE/DELETE often palloc/pfree tuples (sizes are > 32 bytes), but pfree now requires additional lookup tosee is memory allocated by AllocSetAlloc or new code. We can avoid this. 2. Index scans palloc/pfree IndexResult for each tuple returned by index. This one was annoying me for long time. IndexResultshould be part of IndexScan structure... 3. psort uses leftist structure (16 bytes) when disk is used for sorting. Obviously, big block allocation should be usedby lselect code itself. 4. Actually, new mode shouldn't be used by Executor in some cases. I'll address this in a few days... BTW, look at memutils.h: new code is like "tunable" mode described there. > > In other words, essentially *all* of the CPU time spent in > CommitTransaction is spent freeing memory. That's probably why > ganging the transactions doesn't help --- it's the same number of > memory blocks getting allocated and freed. It shows that we should get rid of system malloc/free and do all things in mmgr itself - this would allow us much faster free memory contexts at statement/transaction end. Vadim
Vadim Mikheev <vadim@krs.ru> writes: > Tom Lane wrote: >> In other words, we're spending a third of our time mallocing and freeing >> memory. A tad high, what? > I added some debug code to palloc/pfree and it shows that for > INSERT: > 1. 80% of allocations are made for <= 32 bytes. > 2. pfree is used for 25% of them only (others are freed > after statement/transaction is done). I had suspected the former (because most of the allocs are parser nodes) and knew the latter from profile call counts. > Note that our mmgr adds 16 bytes to each allocation > (+ some bytes in malloc) - a great overhead, yes? Youch ... for a list-node-sized request, that's a lot of overhead. I did some more profiling this weekend, after Magnus Hagander and I repaired libpq's little problem with calling recv() once per input byte. Unsurprisingly, memory allocation is still 50% of backend CPU time in my test consisting of a long series of INSERT statements. It's less than that, but still an interesting number, in two other test scenarios I set up: (1) SELECT the whole contents of a 6500-row, 38-column table; repeat 5 times. Memory allocation takes 8.74 of 54.45 backendCPU sec, or 16%. (2) Load a database from a pg_dump script. Script size is 1.3MB. This is of course mostly CREATE TABLE, COPY from stdin,CREATE INDEX. Memory allocation takes 11.08 of 38.25 sec = 29%. I believe that the reason memory alloc is such a large percentage of the INSERT scenario is that this is the only one where lots of SQL parsing is going on. In the SELECT scenario, most of the palloc/pfree traffic is temporary field value representations inside printtup (more about that below). In the database load scenario, it seems that btree index building is the biggest hit; the heaviest callers of palloc are: 0.00 0.01 10642/438019 _bt_searchr [86] 0.00 0.01 13531/438019 heap_formtuple[149] 0.00 0.01 13656/438019 CopyFrom [24] 0.00 0.01 14196/438019 textin [247] 0.01 0.01 18275/438019 datetime_in [42] 0.01 0.03 32595/438019 bpcharin [182] 0.01 0.03 38882/438019 index_formtuple [79] 0.01 0.03 39028/438019 _bt_formitem [212] 0.07 0.16 204564/438019 _bt_setsortkey [78] [88] 1.3 0.14 0.34 438019 palloc [88] > I added code to allocate a few big (16K-64K) blocks of memory for > these small allocations to speed up palloc by skiping > AllocSetAlloc/malloc. New code don't free allocated memory (to make > bookkeeping fast) but keeping in mind 2. above and memory overhead it > seems as appropriate thing to do. These code also speed up freeing > when statement/transaction is done, because of only a few blocks have > to be freed now. This is pretty much what I had in mind, but I think that it is not OK to leave everything to be freed at statement end. In particular I've just been looking at printtup(), the guts of SELECT output, and I see that each field value is palloc'd (by a type-specific output routine) and then pfree'd by printtup after emitting it. If you make the pfrees all no-ops then this means the temporary space needed for SELECT is at least equal to the total volume of output data ... not good for huge tables. (COPY in and out also seem to palloc/pfree each field value that goes past.) However, I think we could get around this by making more use of memory contexts. The space allocated inside printtup could be palloc'd from a "temporary" context that gets flushed every so often (say, after every few hundred tuples of output). All the pallocs done inside the parser could go to a "statement" context that is flushed at the end of each statement. It'd probably also be a good idea to have two kinds of memory contexts, the current kind where each block is individually free-able and the new kind where we only keep track of how to free the whole context's contents at once. That way, any particular routine that was going to allocate a *large* chunk of memory would have a way to free that chunk explicitly when done with it (but would have to remember which context it got it from in order to do the free). For the typical list-node-sized request, we wouldn't bother with the extra bookkeeping. I think that palloc() could be defined as always giving you the bulk- allocated kind of chunk. pfree() would become a no-op and eventually could be removed entirely. Anyone who wants the other kind of chunk would call a different pair of routines, where the controlling memory context has to be named at both alloc and free time. (This approach gets rid of your concern about the cost of finding which kind of context a block has been allocated in inside of pfree; we expect the caller to know that.) Getting this right is probably going to take some thought and work, but it looks worthwhile from a performance standpoint. Maybe for 6.6? > It shows that we should get rid of system malloc/free and do > all things in mmgr itself - this would allow us much faster > free memory contexts at statement/transaction end. I don't think we can or should stop using malloc(), but we can ask it for large blocks and do our own allocations inside those blocks --- was that what you meant? regards, tom lane
Tom Lane wrote: > I don't think we can or should stop using malloc(), but we can > ask it for large blocks and do our own allocations inside those > blocks --- was that what you meant? I will just jump in with a small idea. The Gnome crowd is investigating using alloca instead of malloc/free where applicable. It is a huge save in CPU cycles where it can be used. regards, -- ----------------- Göran Thyni This is Penguin Country. On a quiet night you can hear Windows NT reboot!
BUGS The alloca() function is machine dependent; its use is discouraged. On Sun, 24 Jan 1999, Goran Thyni wrote: > Tom Lane wrote: > > I don't think we can or should stop using malloc(), but we can > > ask it for large blocks and do our own allocations inside those > > blocks --- was that what you meant? > > I will just jump in with a small idea. > The Gnome crowd is investigating using alloca instead of malloc/free > where applicable. It is a huge save in CPU cycles where it can be used. > > regards, > -- > ----------------- > G�ran Thyni > This is Penguin Country. On a quiet night you can hear Windows NT > reboot! > Marc G. Fournier Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
The Hermit Hacker wrote: > BUGS > The alloca() function is machine dependent; its use is discouraged. Gain a big potential speed boost, loose some portability. I think most modern unices has a good alloca, DOS and Mac don't, but who's porting the server there? Any unices out there missing alloca? mvh, -- ----------------- Göran Thyni This is Penguin Country. On a quiet night you can hear Windows NT reboot!
Tom Lane wrote: > > > Note that our mmgr adds 16 bytes to each allocation > > (+ some bytes in malloc) - a great overhead, yes? > > Youch ... for a list-node-sized request, that's a lot of overhead. And lists are very often used by planner and - never pfreed. > Getting this right is probably going to take some thought and work, > but it looks worthwhile from a performance standpoint. Maybe for 6.6? Yes - I have to return to MVCC stuff... So, I consider my exercizes with mmgr as vacation from MVCC -:) > > It shows that we should get rid of system malloc/free and do > > all things in mmgr itself - this would allow us much faster > > free memory contexts at statement/transaction end. > > I don't think we can or should stop using malloc(), but we can > ask it for large blocks and do our own allocations inside those > blocks --- was that what you meant? No. We could ask brk() for large blocks. The problem is where to handle dynamic allocations. As long as they are handled by malloc we can't put them in proper blocks of current memory context. But having our own handling malloc would become useless. Vadim
On 25-Jan-99 Vadim Mikheev wrote: > > Yes - I have to return to MVCC stuff... > So, I consider my exercizes with mmgr as vacation from MVCC -:) Dumb question. What is MVCC? It almost sounds like a M$ compiler. Vince. -- ========================================================================== Vince Vielhaber -- KA8CSH email: vev@michvhf.com flame-mail: /dev/null # include <std/disclaimers.h> TEAM-OS2 Online Searchable Campground Listings http://www.camping-usa.com "There is no outfit less entitledto lecture me about bloat than the federal government" -- Tony Snow ==========================================================================
On Sun, 24 Jan 1999, Goran Thyni wrote: > The Hermit Hacker wrote: > > BUGS > > The alloca() function is machine dependent; its use is discouraged. > > Gain a big potential speed boost, loose some portability. > I think most modern unices has a good alloca, DOS and Mac > don't, but who's porting the server there? > > Any unices out there missing alloca? Make you a deal... You build the code such that "#ifdef HAVE_ALLOCA" is true, so that those platforms that either don't support it, or have broken suport for it, aren't affect, and you are most welcome to work on it... Marc G. Fournier Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
> On Sun, 24 Jan 1999, Goran Thyni wrote: > > > The Hermit Hacker wrote: > > > BUGS > > > The alloca() function is machine dependent; its use is discouraged. > > > > Gain a big potential speed boost, loose some portability. > > I think most modern unices has a good alloca, DOS and Mac > > don't, but who's porting the server there? > > > > Any unices out there missing alloca? > > Make you a deal... > > You build the code such that "#ifdef HAVE_ALLOCA" is true, so that those > platforms that either don't support it, or have broken suport for it, > aren't affect, and you are most welcome to work on it... As far as I know, alloca is only useful for memory that is used by the current function or its children. I don't think we have many cases where we could use that. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
>>>> The Hermit Hacker wrote: >> You build the code such that "#ifdef HAVE_ALLOCA" is true, so that those >> platforms that either don't support it, or have broken suport for it, >> aren't affect, and you are most welcome to work on it... alloca is treated as a builtin function by gcc, so it should work on any platform if you use gcc. If your compiler is not gcc then alloca is *very* iffy --- as far as I know, it is not specified by any recognized standard. So I concur with Marc's opinion: a speedup on alloca-supporting platforms is nice, but it is not acceptable to break the code on machines without it. Bruce Momjian <maillist@candle.pha.pa.us> writes: > As far as I know, alloca is only useful for memory that is used by the > current function or its children. I don't think we have many cases > where we could use that. Right, storage allocated by alloca is part of the calling function's stack frame --- it goes away when that function exits, no sooner and no later. I have noticed some places in Postgres where routines palloc storage that they pfree before exiting. alloca would be a fine substitute for that usage. But as far as I've seen in profiling, the primary callers of palloc are allocating storage that they will *not* free, but indeed will return to their caller --- the type-specific conversion routines all work that way, for example. So I'm doubtful that adding alloca to our tool arsenal would be worth the portability headaches it would cost. I have been toying with the notion of nested memory contexts, which would be created at the start of certain major routines and deleted when they exit. palloc would allocate storage from the most closely nested context, and thus would mean "allocate storage that will live through the current major operation, whatever it is". This is still a very half-baked notion and I can't say that it will work, but I thought I'd throw it into the thread... regards, tom lane
Vadim Mikheev <vadim@krs.ru> writes: > Tom Lane wrote: >> I don't think we can or should stop using malloc(), but we can >> ask it for large blocks and do our own allocations inside those >> blocks --- was that what you meant? > No. We could ask brk() for large blocks. I think that would be a bad idea. brk() is a Unix-ism; I doubt it's supported on Win NT, for example. malloc() is a lot more portable. Another potential portability issue is whether malloc() will coexist with calling brk() ourselves. (It *ought* to, but I can believe that the feature might be broken on some platforms, since it's so seldom exercised...) We can't stop all uses of malloc(), because parts of the C library use it --- stdio, qsort, putenv all do on my machine. If we're going to grab large chunks and keep them, then any small inefficiency in doing the grabbing isn't really worth worrying about; so I don't see the need to bypass malloc() for that. regards, tom lane
Tom Lane wrote: > > Vadim Mikheev <vadim@krs.ru> writes: > > Tom Lane wrote: > >> I don't think we can or should stop using malloc(), but we can > >> ask it for large blocks and do our own allocations inside those > >> blocks --- was that what you meant? > > > No. We could ask brk() for large blocks. > > I think that would be a bad idea. brk() is a Unix-ism; I doubt it's > supported on Win NT, for example. malloc() is a lot more portable. > > Another potential portability issue is whether malloc() will coexist > with calling brk() ourselves. (It *ought* to, but I can believe that > the feature might be broken on some platforms, since it's so seldom > exercised...) We can't stop all uses of malloc(), because parts of the > C library use it --- stdio, qsort, putenv all do on my machine. > > If we're going to grab large chunks and keep them, then any small > inefficiency in doing the grabbing isn't really worth worrying about; > so I don't see the need to bypass malloc() for that. Ok, I agreed. Vadim
> Right, storage allocated by alloca is part of the calling function's > stack frame --- it goes away when that function exits, no sooner and > no later. And as I'm sure many already know, there are limits on stack frame size on many architectures, and limits on total stack size on every platform. The available size is not in the same league as malloc()-style heap allocation. - Tom
Bruce Momjian wrote: > As far as I know, alloca is only useful for memory that is used by the > current function or its children. I don't think we have many cases > where we could use that. Perhaps not in so many obvious places, but with some restructuring perhaps. I plan to look into this too, time permitting, don't hold your breath. regards, -- ----------------- Göran Thyni This is Penguin Country. On a quiet night you can hear Windows NT reboot!
The Hermit Hacker wrote: > Make you a deal... > > You build the code such that "#ifdef HAVE_ALLOCA" is true, so that those > platforms that either don't support it, or have broken suport for it, > aren't affect, and you are most welcome to work on it... Yeah right, I should keep my mouth shut. :-) It is a big job, starting getting with getting profiling working on my box, analyzing lots of code and then come up with some optimations. We can start with the patch below to detect alloca() in autoconf. best regards, -- ----------------- Göran Thyni This is Penguin Country. On a quiet night you can hear Windows NT reboot!diff -rc pgsql-orig/src/configure.in pgsql/src/configure.in *** pgsql-orig/src/configure.in Mon Jan 18 07:01:04 1999 --- pgsql/src/configure.in Tue Jan 26 15:03:12 1999 *************** *** 692,697 **** --- 692,698 ---- AC_FUNC_MEMCMP AC_TYPE_SIGNAL AC_FUNC_VPRINTF + AC_FUNC_ALLOCA AC_CHECK_FUNCS(tzset memmove sigsetjmp kill sysconf fpclass) AC_CHECK_FUNCS(fp_class fp_class_d class) AC_CHECK_FUNCS(sigprocmask waitpid setsid fcvt) diff -rc pgsql-orig/src/include/config.h.in pgsql/src/include/config.h.in *** pgsql-orig/src/include/config.h.in Mon Jan 18 07:02:26 1999 --- pgsql/src/include/config.h.in Tue Jan 26 15:05:37 1999 *************** *** 115,120 **** --- 115,123 ---- # define gettimeofday(a,b) gettimeofday(a) #endif + /* Set to 1 if you have alloca() */ + #undef HAVE_ALLOCA + /* Set to 1 if you have snprintf() */ #undef HAVE_SNPRINTF
Before this patch is applied, havae we determined wheteher or not its even a useful thing to pursue? I thought the general conscious was that, for us, it wasn't... On Tue, 26 Jan 1999, Goran Thyni wrote: > The Hermit Hacker wrote: > > Make you a deal... > > > > You build the code such that "#ifdef HAVE_ALLOCA" is true, so that those > > platforms that either don't support it, or have broken suport for it, > > aren't affect, and you are most welcome to work on it... > > Yeah right, I should keep my mouth shut. :-) > > It is a big job, starting getting with getting profiling working on > my box, analyzing lots of code and then come up with some optimations. > > We can start with the patch below to detect alloca() in autoconf. > > best regards, > -- > ----------------- > G�ran Thyni > This is Penguin Country. On a quiet night you can hear Windows NT > reboot! Marc G. Fournier Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
The Hermit Hacker <scrappy@hub.org> writes: > Before this patch is applied, havae we determined wheteher or not its even > a useful thing to pursue? I thought the general conscious was that, for > us, it wasn't... If we can demonstrate through profiling that there's someplace where using alloca offers a worthwhile speedup, then let's go for it. But, until we find such a hotspot, I'd be inclined not to spend configure cycles on testing for alloca... regards, tom lane
The Hermit Hacker wrote: > Before this patch is applied, havae we determined wheteher or not its even > a useful thing to pursue? I thought the general conscious was that, for > us, it wasn't... I would have to investigate further before making any final judgement. But having the test in configure/config.h can't hurt should me or someone else find a worthwhile speedup using alloca, can it? If you want something more substantial you have to wait, I just posted the patch to make it available to other eager investigators. happy hacking, -- ----------------- Göran Thyni This is Penguin Country. On a quiet night you can hear Windows NT reboot!
On Wed, 27 Jan 1999, Goran Thyni wrote: > The Hermit Hacker wrote: > > Before this patch is applied, havae we determined wheteher or not its even > > a useful thing to pursue? I thought the general conscious was that, for > > us, it wasn't... > > I would have to investigate further before making any final judgement. > But having the test in configure/config.h can't hurt should me or > someone > else find a worthwhile speedup using alloca, can it? > > If you want something more substantial you have to wait, > I just posted the patch to make it available to other > eager investigators. Let's go with the wait scenario...one thing I've always hated with other packages and configure is watching them run forever checking for things that are never used, but checking "just cause it can" *roll eyes* Marc G. Fournier Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
> > Tom Lane wrote: > > > > > Note that our mmgr adds 16 bytes to each allocation > > > (+ some bytes in malloc) - a great overhead, yes? > > > > Youch ... for a list-node-sized request, that's a lot of overhead. > > And lists are very often used by planner and - never pfreed. > > > Getting this right is probably going to take some thought and work, > > but it looks worthwhile from a performance standpoint. Maybe for 6.6? > > Yes - I have to return to MVCC stuff... > So, I consider my exercizes with mmgr as vacation from MVCC -:) > > > > It shows that we should get rid of system malloc/free and do > > > all things in mmgr itself - this would allow us much faster > > > free memory contexts at statement/transaction end. > > > > I don't think we can or should stop using malloc(), but we can > > ask it for large blocks and do our own allocations inside those > > blocks --- was that what you meant? > > No. We could ask brk() for large blocks. > The problem is where to handle dynamic allocations. > As long as they are handled by malloc we can't put > them in proper blocks of current memory context. > But having our own handling malloc would become useless. > > Vadim We could use 4 methods for dynamic allocation: 1) malloc/free - for persistent storage allocation 2) palloc/pfree - for storage belonging to some context andwhich we can keep track of and free explicitly 3) fast_palloc - for storage which impossible, too difficult or tooexpensive to keep track of. This storage should beallocated withfast and simple inline code from bigger chunks allocated with palloc.This storage would never freed explicitly,so that code could besimple and fast, but the big chunks would be freed automatically atthe end of the transaction. 4) fast_talloc - we could introduce a `tuple' context handled likefast_palloc for storage used only while processing onetuple.This storage could be fast allocated from few big chunks allocatedwith palloc and freed explicitly after the tuplehas been processed.This could avoid the overhead of many malloc/palloc while reducingthe overall memory usage for transactionwhich process many rows.The total cost per tuple could be one palloc and one pfree.We could also simply reusethe chunks for every tuple and pfree them only at the end of the transaction. This would cost one palloc/pfreeper transaction. This would require revising the code and changing palloc/pfree with the new functions where appropriate, but this could be done gradually because the old palloc/pfree are always safe. -- Massimo Dal Zotto +----------------------------------------------------------------------+ | Massimo Dal Zotto email: dz@cs.unitn.it | | Via Marconi, 141 phone: ++39-0461534251 | | 38057 Pergine Valsugana (TN) www: http://www.cs.unitn.it/~dz/ | | Italy pgp: finger dz@tango.cs.unitn.it | +----------------------------------------------------------------------+
Massimo Dal Zotto <dz@cs.unitn.it> writes: > We could use 4 methods for dynamic allocation: This sounds pretty close to what I was thinking, too. We already have the notion of "memory contexts" which can have different alloc/free implementations, so there's already a structure to put these different alloc methods into. However, we probably don't really want to touch each individual palloc() call in the system to change it to a MemoryContextAlloc(some-context, size) call. Not only would that be horribly tedious, but an awful lot of these calls are made for the purpose of creating an object that the requesting routine intends to pass back to its caller. In that situation, the requesting routine doesn't really *know* what the lifetime of the storage needs to be --- it's the caller that determines what's going to happen. I think what we want is to define palloc() as allocating not-individually-freeable storage from the current context, which is set by the most closely nested main control routine (parser, planner, executor, etc). That should handle the majority of the existing code and thus minimize the number of places that have to be touched. pfree() goes away, eventually, and can be made a no-op meanwhile. This assumes that the number of places that really *want* to free storage earlier than end of context is much smaller than the number of places that don't want to be bothered --- therefore, it makes more sense to redefine palloc() as the not-separately-freeable kind and change the places that do want to control their storage lifetime, instead of changing all the other places. > 4) fast_talloc - we could introduce a `tuple' context handled like > fast_palloc for storage used only while processing one tuple. Right, that could be the most closely nested context for operations that iterate over tuples. A further idea I had is to define allocations from that context as being only good for one tuple, but to not bother to physically clear the context after every single tuple --- do it every ten or hundred or so tuples, instead. The amount of storage wasted won't be very large, and it cuts the per-tuple time cost even more. (Of course, if resetting a context to empty only takes a few instructions, you might as well just do it every time. I'm not sure offhand if that will be true or not.) regards, tom lane
I have found a bug and tried to trace the source but now I need some pointers. ENV: Pgsql 6.4.2 (and current) compiled with locale (using Swedish locale) BUG: A small table: CREATE TABLE x(txt text); goran=> SELECT * FROM x; txt ------ abc åäö Grön Gunnar GUNNAR Göran göran GÖRAN (8 rows) but: goran=> select * from x WHERE txt LIKE 'G%'; txt ------ Grön Gunnar GUNNAR (3 rows) the same goes for: "select * from x WHERE txt ~ '^G'" which is the same as above LIKE-stmt in the backend. Otherwise regex works correct. To sum up: Case sensitive regex anchored with '^' and where the first char following the match is a non-ascii char gives misses. I have tracked it down to the following, 3 functions are called to test the expression: 1. textregexeq OK for 5 instances 2. text_ge('G','G') also OK 3. text_lt('G','G\0xFF') this is not correct! Case 3 work not work with strcoll(), in varstr_cmp(). If I change it to strcoll() to strncmp() it works as expected, but it probably breaks sorting etc big time. Any suggestions how to proceed? TIA, -- --------------------------------------------- Göran Thyni, JMS Bildbasen, Kiruna This is Penguin Country. On a quiet night you can hear Windows NT reboot!
I have found a bug and tried to trace the source but now I need some pointers. ENV: Pgsql 6.4.2 (and current) compiled with locale (using Swedish locale) BUG: A small table: CREATE TABLE x(txt text); goran=> SELECT * FROM x; txt ------ abc åäö Grön Gunnar GUNNAR Göran göran GÖRAN (8 rows) but: goran=> select * from x WHERE txt LIKE 'G%'; txt ------ Grön Gunnar GUNNAR (3 rows) the same goes for: "select * from x WHERE txt ~ '^G'" which is the same as above LIKE-stmt in the backend. Otherwise regex works correct. To sum up: Case sensitive regex anchored with '^' and where the first char following the match is a non-ascii char gives misses. I have tracked it down to the following, 3 functions are called to test the expression: 1. textregexeq OK for 5 instances 2. text_ge('G','G') also OK 3. text_lt('G','G\0xFF') this is not correct! Case 3 work not work with strcoll(), in varstr_cmp(). If I change it to strcoll() to strncmp() it works as expected, but it probably breaks sorting etc big time. Any suggestions how to proceed? TIA, -- --------------------------------------------- Göran Thyni, JMS Bildbasen, Kiruna This is Penguin Country. On a quiet night you can hear Windows NT reboot!
Tom Lane wrote: > However, we probably don't really want to touch each individual palloc() > call in the system to change it to a MemoryContextAlloc(some-context, size) > call. Not only would that be horribly tedious, but an awful lot of > these calls are made for the purpose of creating an object that the > requesting routine intends to pass back to its caller. In that > situation, the requesting routine doesn't really *know* what the > lifetime of the storage needs to be --- it's the caller that determines > what's going to happen. There are about 900 calls of palloc() in the backend code. It is much less than I expected (we have over 200,000 lines of code). So I vote for doing it ONCE HIGH QUALITY, instead of half way but easy to do. And I don't think you could avoid at least looking at each individual palloc(). You defined nesting levels, where at their end the allocated memory get's free'd. But there are many places where deeper nested functions put things into structures that are held in upper levels and thus live longer. These upper levels continue to use the things the callee's placed in there. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #======================================== jwieck@debis.com (Jan Wieck) #
jwieck@debis.com (Jan Wieck) writes: > Tom Lane wrote: >> However, we probably don't really want to touch each individual palloc() >> call in the system ... > There are about 900 calls of palloc() in the backend code. It > is much less than I expected (we have over 200,000 lines of > code). Much less than I thought, also. And (grep|wc...) over 300 of these calls are in the utils/adt code, ie, they are for passing back the results of type conversion functions. We'll only need to think through how that should work once, and then it's just a mechanical edit to all the ADT files. > So I vote for doing it ONCE HIGH QUALITY, instead of half way > but easy to do. OK, I agree. > And I don't think you could avoid at least looking at each > individual palloc(). You defined nesting levels, where at > their end the allocated memory get's free'd. But there are > many places where deeper nested functions put things into > structures that are held in upper levels and thus live > longer. These upper levels continue to use the things the > callee's placed in there. What about my prior point that the bottom-level function may not know how long the caller needs the storage? Will we have to add a "memory context to put result in" parameter to a lot of routines? Ugh, but maybe it's the only way. I've noticed that everyone else contributing to this thread has been thinking in terms of inventing multiple allocation functions with different names, ie a routine might have to call "palloc" or "fast_palloc" or "tmp_palloc" or whatever depending on what behavior it wants. I really think we are better off to stick to the structure that we already have (cf. include/nodes/memnodes.h and include/utils/mcxt.h), in which there's one set of interface routines that take a "context" parameter, and the context determines the semantics of allocation. (The interface routines probably ought to be macros, not out-of-line code, but that's a minor optimization.) This will make it easier to handle cases where one routine has to tell another one how to allocate its result space: you pass a MemoryContext parameter. regards, tom lane
Tom Lane wrote: > > jwieck@debis.com (Jan Wieck) writes: > > Tom Lane wrote: > >> However, we probably don't really want to touch each individual palloc() > >> call in the system ... > > > There are about 900 calls of palloc() in the backend code. It > > is much less than I expected (we have over 200,000 lines of > > code). > > Much less than I thought, also. And (grep|wc...) over 300 of these > calls are in the utils/adt code, ie, they are for passing back the > results of type conversion functions. We'll only need to think through > how that should work once, and then it's just a mechanical edit to all > the ADT files. Exactly these 300 ones (except for 20 or 30 of them) are those that the caller needs :-) > What about my prior point that the bottom-level function may not know > how long the caller needs the storage? Will we have to add a "memory > context to put result in" parameter to a lot of routines? Ugh, but > maybe it's the only way. I've browsed a little throug the code (well, the candidates I thought are good choices for optimization). The problem there is, that the order of allocation hasn't anything to do with the lifecycle of the objects. So if we want to handle allocations different depending on the object lifecycle, the allocators (mainly their callers) would have to switch often between different contexts. It wouldn't make the code much better readable if between most of the statements some MemoryContextStumbleOver() appears. And many memory contexts means on the other hand many concurrently read/written pages. Thus I would expect it to affect the caching of the CPU. > > I've noticed that everyone else contributing to this thread has been > thinking in terms of inventing multiple allocation functions with > different names, ie a routine might have to call "palloc" or > "fast_palloc" or "tmp_palloc" or whatever depending on what behavior it > wants. So I. > I really think we are better off to stick to the structure that > we already have (cf. include/nodes/memnodes.h and include/utils/mcxt.h), > in which there's one set of interface routines that take a "context" > parameter, and the context determines the semantics of allocation. > (The interface routines probably ought to be macros, not out-of-line > code, but that's a minor optimization.) This will make it easier to > handle cases where one routine has to tell another one how to allocate > its result space: you pass a MemoryContext parameter. I came to the same conclusion. So I continued with the approach of bigger chunks handled in palloc(). What I have now is something, that gains about 10% speedup at the regression test, while memory consumption (visibly watched with top(1)) seems not to raise compared against old version. Since the bigger blocks are built on top of the existing memory context model, it would not conflict with any enhanced usage we could make with it. I include a patch at the end - please comment. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #======================================== jwieck@debis.com (Jan Wieck) # diff -cr src.orig/backend/utils/mmgr/mcxt.c src/backend/utils/mmgr/mcxt.c *** src.orig/backend/utils/mmgr/mcxt.c Thu Jan 28 16:47:39 1999 --- src/backend/utils/mmgr/mcxt.c Fri Jan 29 20:15:41 1999 *************** *** 106,111 **** --- 106,112 ---- static struct GlobalMemory TopGlobalMemoryData = { T_GlobalMemory, /* NodeTag tag */ &GlobalContextMethodsData, /* ContextMethods method */ + NULL, /* char* smallchunk_block */ {{0}}, /* uninitialized OrderedSetData allocSetD */ "TopGlobal", /* char* name */ {0} /* uninitialized OrderedElemData elemD */ diff -cr src.orig/backend/utils/mmgr/palloc.c src/backend/utils/mmgr/palloc.c *** src.orig/backend/utils/mmgr/palloc.c Thu Jan 28 16:47:39 1999 --- src/backend/utils/mmgr/palloc.c Fri Jan 29 21:57:06 1999 *************** *** 24,29 **** --- 24,57 ---- #include "utils/palloc.h" + + typedef struct PallocBlock { + MemoryContext mcxt; + int refcount; + char *unused_chunk; + char *freeptr; + char *endptr; + } PallocBlock; + + typedef struct PallocChunk { + PallocBlock *blk; + Size size; + } PallocChunk; + + + #define PALLOC_BLOCKSIZE \ + (8192 - sizeof(OrderedElemData) - sizeof(Size)) + #define PALLOC_CHUNK_LIMIT 512 + #define PALLOC_SIZE_ALIGN(s) ((s + 15) & ~15) + + #define PALLOC_CHUNK_BLKPTR(c) (((PallocChunk *)(c))[-1].blk) + #define PALLOC_CHUNK_SIZE(c) (((PallocChunk *)(c))[-1].size) + + #define PALLOC_CHUNK_HDRSZ MAXALIGN(sizeof(PallocChunk)) + #define PALLOC_BLOCK_HDRSZ MAXALIGN(sizeof(PallocBlock)) + + #define PALLOC_FREESPACE(b) ((b)->endptr - (b)->freeptr) + /* ---------------------------------------------------------------- * User library functions * ---------------------------------------------------------------- *************** *** 66,72 **** #ifdef PALLOC_IS_MALLOC return malloc(size); #else ! return MemoryContextAlloc(CurrentMemoryContext, size); #endif /* PALLOC_IS_MALLOC */ } --- 94,168 ---- #ifdef PALLOC_IS_MALLOC return malloc(size); #else ! PallocBlock *block; ! char *chunk; ! Size asize = PALLOC_SIZE_ALIGN(size); ! ! if (asize >= PALLOC_CHUNK_LIMIT) ! { ! chunk = (char *)MemoryContextAlloc(CurrentMemoryContext, size + ! PALLOC_CHUNK_HDRSZ); ! chunk += PALLOC_CHUNK_HDRSZ; ! PALLOC_CHUNK_BLKPTR(chunk) = NULL; ! return (void *)chunk; ! } ! ! block = (PallocBlock *)(CurrentMemoryContext->smallchunk_block); ! ! if (block != NULL && PALLOC_FREESPACE(block) < asize) ! { ! char *prev = NULL; ! Size chunk_size; ! ! chunk = block->unused_chunk; ! while (chunk) ! { ! chunk_size = PALLOC_CHUNK_SIZE(chunk); ! if (asize == chunk_size) ! { ! if (prev == NULL) ! block->unused_chunk = (char *)PALLOC_CHUNK_BLKPTR(chunk); ! else ! PALLOC_CHUNK_BLKPTR(prev) = PALLOC_CHUNK_BLKPTR(chunk); ! ! block->refcount++; ! PALLOC_CHUNK_BLKPTR(chunk) = block; ! /* ! PALLOC_CHUNK_SIZE(chunk) = size; ! */ ! return (void *)chunk; ! } ! prev = chunk; ! chunk = (char *)PALLOC_CHUNK_BLKPTR(chunk); ! } ! ! block = NULL; ! } ! ! if (block == NULL) ! { ! block = (PallocBlock *)MemoryContextAlloc(CurrentMemoryContext, ! PALLOC_BLOCKSIZE); ! block->mcxt = CurrentMemoryContext; ! block->unused_chunk = NULL; ! block->refcount = 0; ! block->freeptr = ((char *)block) + PALLOC_BLOCK_HDRSZ + ! PALLOC_CHUNK_HDRSZ; ! block->endptr = ((char *)block) + PALLOC_BLOCKSIZE; ! ! CurrentMemoryContext->smallchunk_block = (void *)block; ! } ! ! chunk = block->freeptr; ! block->freeptr += PALLOC_CHUNK_HDRSZ + asize; ! block->refcount++; ! PALLOC_CHUNK_BLKPTR(chunk) = block; ! PALLOC_CHUNK_SIZE(chunk) = asize; ! ! if (block->freeptr >= block->endptr) ! block->mcxt->smallchunk_block = NULL; ! ! return (void *)chunk; #endif /* PALLOC_IS_MALLOC */ } *************** *** 76,82 **** #ifdef PALLOC_IS_MALLOC free(pointer); #else ! MemoryContextFree(CurrentMemoryContext, pointer); #endif /* PALLOC_IS_MALLOC */ } --- 172,202 ---- #ifdef PALLOC_IS_MALLOC free(pointer); #else ! PallocBlock *block = PALLOC_CHUNK_BLKPTR(pointer); ! ! if (block == NULL) ! { ! MemoryContextFree(CurrentMemoryContext, (char *)pointer - PALLOC_CHUNK_HDRSZ); ! return; ! } ! ! PALLOC_CHUNK_BLKPTR(pointer) = (PallocBlock *)(block->unused_chunk); ! block->unused_chunk = (char *)pointer; ! ! block->refcount--; ! if (block->refcount == 0) ! { ! if (block == (PallocBlock *)(block->mcxt->smallchunk_block)) ! { ! block->freeptr = ((char *)block) + PALLOC_BLOCK_HDRSZ + ! PALLOC_CHUNK_HDRSZ; ! block->unused_chunk = NULL; ! } ! else ! { ! MemoryContextFree(block->mcxt, (void *)block); ! } ! } #endif /* PALLOC_IS_MALLOC */ } *************** *** 100,106 **** #ifdef PALLOC_IS_MALLOC return realloc(pointer, size); #else ! return MemoryContextRealloc(CurrentMemoryContext, pointer, size); #endif } --- 220,248 ---- #ifdef PALLOC_IS_MALLOC return realloc(pointer, size); #else ! PallocBlock *block = PALLOC_CHUNK_BLKPTR(pointer); ! char *new; ! Size tocopy; ! ! if (block == NULL) ! { ! new = (char *)MemoryContextRealloc(CurrentMemoryContext, ! (char *)pointer - PALLOC_CHUNK_HDRSZ, ! size + PALLOC_CHUNK_HDRSZ); ! new += PALLOC_CHUNK_HDRSZ; ! PALLOC_CHUNK_BLKPTR(new) = NULL; ! return (void *)new; ! } ! else ! { ! new = palloc(size); ! ! tocopy = PALLOC_CHUNK_SIZE(pointer) > size ? ! size : PALLOC_CHUNK_SIZE(pointer); ! memcpy(new, pointer, tocopy); ! pfree(pointer); ! return (void *)new; ! } #endif } *************** *** 117,119 **** --- 259,263 ---- return nstr; } + + diff -cr src.orig/backend/utils/mmgr/portalmem.c src/backend/utils/mmgr/portalmem.c *** src.orig/backend/utils/mmgr/portalmem.c Thu Jan 28 16:47:39 1999 --- src/backend/utils/mmgr/portalmem.c Fri Jan 29 20:15:41 1999 *************** *** 390,395 **** --- 390,396 ---- NodeSetTag((Node *) &portal->variable, T_PortalVariableMemory); AllocSetInit(&portal->variable.setData, DynamicAllocMode, (Size) 0); portal->variable.method = &PortalVariableContextMethodsData; + portal->variable.smallchunk_block = NULL; /* * initialize portal heap context *************** *** 399,404 **** --- 400,406 ---- FixedStackInit(&portal->heap.stackData, offsetof(HeapMemoryBlockData, itemData)); portal->heap.method = &PortalHeapContextMethodsData; + portal->heap.smallchunk_block = NULL; /* * set bogus portal name *************** *** 756,761 **** --- 758,764 ---- NodeSetTag((Node *) &portal->variable, T_PortalVariableMemory); AllocSetInit(&portal->variable.setData, DynamicAllocMode, (Size) 0); portal->variable.method = &PortalVariableContextMethodsData; + portal->variable.smallchunk_block = NULL; /* initialize portal heap context */ NodeSetTag((Node *) &portal->heap, T_PortalHeapMemory); *************** *** 763,768 **** --- 766,772 ---- FixedStackInit(&portal->heap.stackData, offsetof(HeapMemoryBlockData, itemData)); portal->heap.method = &PortalHeapContextMethodsData; + portal->heap.smallchunk_block = NULL; /* initialize portal name */ length = 1 + strlen(name); *************** *** 918,923 **** --- 922,928 ---- /* free current mode */ AllocSetReset(&HEAPMEMBLOCK(context)->setData); + context->smallchunk_block = NULL; MemoryContextFree((MemoryContext) PortalHeapMemoryGetVariableMemory(context), context->block); diff -cr src.orig/include/nodes/memnodes.h src/include/nodes/memnodes.h *** src.orig/include/nodes/memnodes.h Thu Jan 28 16:47:39 1999 --- src/include/nodes/memnodes.h Fri Jan 29 20:15:41 1999 *************** *** 60,65 **** --- 60,66 ---- { NodeTag type; MemoryContextMethods method; + void *smallchunk_block; } *MemoryContext; /* think about doing this right some time but we'll have explicit fields *************** *** 68,73 **** --- 69,75 ---- { NodeTag type; MemoryContextMethods method; + void *smallchunk_block; AllocSetData setData; char *name; OrderedElemData elemData; *************** *** 79,84 **** --- 81,87 ---- { NodeTag type; MemoryContextMethods method; + void *smallchunk_block; AllocSetData setData; } *PortalVariableMemory; *************** *** 86,91 **** --- 89,95 ---- { NodeTag type; MemoryContextMethods method; + void *smallchunk_block; Pointer block; FixedStackData stackData; } *PortalHeapMemory;
> I've noticed that everyone else contributing to this thread has been > thinking in terms of inventing multiple allocation functions with > different names, ie a routine might have to call "palloc" or > "fast_palloc" or "tmp_palloc" or whatever depending on what behavior it > wants. I really think we are better off to stick to the structure that > we already have (cf. include/nodes/memnodes.h and include/utils/mcxt.h), > in which there's one set of interface routines that take a "context" > parameter, and the context determines the semantics of allocation. > (The interface routines probably ought to be macros, not out-of-line > code, but that's a minor optimization.) This will make it easier to > handle cases where one routine has to tell another one how to allocate > its result space: you pass a MemoryContext parameter. I agree, inline is good. I can imagine something like makeNode is going to be tough. I am going to give the parser as an example. It palloc's all sorts of things for rangetables, strings, structures, lists, but eventually just exports a huge Query structure. If we could somehow just grab copy Query structure and free all the other parser palloc allocations, that would be great. Same thing with the optimizer. It makes all sorts of lists, but eventuall just exports Plan's. All the other stuff can be pfree'ed. It is almost as though we could copy the Query structure to another memory context, and just fee all palloc'ed memory. What we could try is to keep the various memory contexts like cache context unchanged, but allow pallocs in the standard memory context to be allocated from a pool and free'ed automatically. Our standard memory context is overused and doesn't need the pfree overhead. We could make palloc a macro that tested for the current context, and did standard pallocs from a free memory pool. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Goran Thyni wrote: > 3. text_lt('G','G\0xFF') this is not correct! > > Case 3 work not work with strcoll(), in varstr_cmp(). > If I change it to strcoll() to strncmp() it works as expected, > but it probably breaks sorting etc big time. Here is a hack which solves the problem without breaking sorting AFAIK. I have read a lot of code, but has not found any cleaner way to fix this. A cleaner solution would be to find the highest char in locale charset but I found no portable way to do that, any pointers appriciated. It is not a beauty but it works. Please apply to -current since it fixes the bug, unless someone has a better suggestion. best regards, -- ----------------- Göran Thyni This is Penguin Country. On a quiet night you can hear Windows NT reboot!diff -cr cvs/pgsql/src/backend/utils/adt/varlena.c cvswork/pgsql/src/backend/utils/adt/varlena.c *** cvs/pgsql/src/backend/utils/adt/varlena.c Mon Dec 14 07:01:37 1998 --- cvswork/pgsql/src/backend/utils/adt/varlena.c Sun Jan 31 16:32:52 1999 *************** *** 496,528 **** varstr_cmp(char *arg1, int len1, char *arg2, int len2) { int result; ! char *a1p, ! *a2p; ! #ifdef USE_LOCALE ! a1p = (unsigned char *) palloc(len1 + 1); ! a2p = (unsigned char *) palloc(len2 + 1); ! ! memcpy(a1p, arg1, len1); ! *(a1p + len1) = '\0'; ! memcpy(a2p, arg2, len2); ! *(a2p + len2) = '\0'; ! ! result = strcoll(a1p, a2p); ! ! pfree(a1p); ! pfree(a2p); ! ! #else ! ! a1p = arg1; ! a2p = arg2; ! ! result = strncmp(a1p, a2p, Min(len1, len2)); ! if ((result == 0) && (len1 != len2)) ! result = (len1 < len2) ? -1 : 1; #endif ! return result; } /* varstr_cmp() */ --- 496,524 ---- varstr_cmp(char *arg1, int len1, char *arg2, int len2) { int result; ! char *a1p, *a2p; #ifdef USE_LOCALE ! if ((int)arg2[len2 - 1] != -1) ! { ! a1p = (unsigned char *) palloc(len1 + 1); ! a2p = (unsigned char *) palloc(len2 + 1); ! memcpy(a1p, arg1, len1); ! *(a1p + len1) = '\0'; ! memcpy(a2p, arg2, len2); ! *(a2p + len2) = '\0'; ! result = strcoll(a1p, a2p); ! pfree(a1p); ! pfree(a2p); ! } ! else #endif ! { ! a1p = arg1; ! a2p = arg2; ! result = strncmp(a1p, a2p, Min(len1, len2)); ! if ((result == 0) && (len1 != len2)) ! result = (len1 < len2) ? -1 : 1; ! } return result; } /* varstr_cmp() */
> #ifdef USE_LOCALE > ! if ((int)arg2[len2 - 1] != -1) Sorry, I'm not as familiar with the LOCALE code as I should be. Why are we testing explicitly for -1? Is there something special about a character with all bits set in typical locale support code? Regards. - Tom
Hi! On Mon, 1 Feb 1999, Thomas G. Lockhart wrote: > > #ifdef USE_LOCALE > > ! if ((int)arg2[len2 - 1] != -1) > > Sorry, I'm not as familiar with the LOCALE code as I should be. Why are > we testing explicitly for -1? Is there something special about a > character with all bits set in typical locale support code? Other way around - in windows-1251 it is pretty standard character. Oleg. ---- Oleg Broytmann National Research Surgery Centre http://sun.med.ru/~phd/ Programmers don't die, they justGOSUB without RETURN.
"Thomas G. Lockhart" wrote: > > > #ifdef USE_LOCALE > > ! if ((int)arg2[len2 - 1] != -1) > > Sorry, I'm not as familiar with the LOCALE code as I should be. Why are > we testing explicitly for -1? Is there something special about a > character with all bits set in typical locale support code? > > Regards. > > - Tom It looks a bit cryptic yes, but it is not specific to locale. The planner appends a "(char) -1" (or alt syntax '\xFF') to the teststring in some cases of "<=" on text (f.ex in "txt ~ '^G'"). This works OK with strncmp which compare char by char as unsigned ints, but locale uses lookup-tables where '\FF' not is guanteed to be greater or equal to any other unsigned 8-bit character. Therefore I did a fallback to strncmp (instead of locale aware strcoll) for this special case. It is not the perfect solution and may break in a few cases, but as it is now it breaks most usage of LIKE and ~ when locale is enabled. If we could in a portable way to find the last usable character in the used charset it would be nice, but I have not found any way to do that. regards, -- --------------------------------------------- Göran Thyni, JMS Bildbasen, Kiruna This is Penguin Country. On a quiet night you can hear Windows NT reboot!
Oleg Broytmann wrote: > > Hi! > > On Mon, 1 Feb 1999, Thomas G. Lockhart wrote: > > > #ifdef USE_LOCALE > > > ! if ((int)arg2[len2 - 1] != -1) > > > > Sorry, I'm not as familiar with the LOCALE code as I should be. Why are > > we testing explicitly for -1? Is there something special about a > > character with all bits set in typical locale support code? > > Other way around - in windows-1251 it is pretty standard character. OK, and it is probably not the last character in the alphabet either. In that case we need another solution I am afraid. Anyone got any ideas how to resolve this? regards, -- --------------------------------------------- Göran Thyni, JMS Bildbasen, Kiruna This is Penguin Country. On a quiet night you can hear Windows NT reboot!
Hi! On Mon, 1 Feb 1999, Goran Thyni wrote: > > Other way around - in windows-1251 it is pretty standard character. > and it is probably not the last character in the alphabet either. In win-1251 it is last charachter in alphabet. In koi8-r it is not. Oleg. ---- Oleg Broytmann http://members.xoom.com/phd2/ phd2@earthling.net Programmers don't die, they justGOSUB without RETURN.
Oleg Broytmann wrote: > > Hi! > > On Mon, 1 Feb 1999, Goran Thyni wrote: > > > Other way around - in windows-1251 it is pretty standard character. > > and it is probably not the last character in the alphabet either. > > In win-1251 it is last charachter in alphabet. In koi8-r it is not. I posted a new patch that corrects the problem in a way that doesn't break koi8-r (and possibly others). I hope it will get applied before beta starts. mvh, -- --------------------------------------------- Göran Thyni, JMS Bildbasen, Kiruna This is Penguin Country. On a quiet night you can hear Windows NT reboot!
Hi! On Tue, 2 Feb 1999, Goran Thyni wrote: > I posted a new patch that corrects the problem in a way that > doesn't break koi8-r (and possibly others). I hope it will get > applied before beta starts. I saw the patch. I'll try to apply it and test. BTW, if you are looking into locale - could you please supply locale regression test for you locale? Look into ...src/test/locale, copy koi8-r directory and supply your test data. Oleg. ---- Oleg Broytmann National Research Surgery Centre http://sun.med.ru/~phd/ Programmers don't die, they justGOSUB without RETURN.
I wrote: > > I came to the same conclusion. So I continued with the > approach of bigger chunks handled in palloc(). What I have > now is something, that gains about 10% speedup at the > regression test, while memory consumption (visibly watched > with top(1)) seems not to raise compared against old version. > > Since the bigger blocks are built on top of the existing > memory context model, it would not conflict with any enhanced > usage we could make with it. > > I include a patch at the end - please comment. > > > Jan Did anyone play around with it? I've had it installed now for some days and it work's well so far. How close are we to v6.5 BETA? Should I apply it to CURRENT? > diff -cr src.orig/backend/utils/mmgr/mcxt.c src/backend/utils/mmgr/mcxt.c > *** src.orig/backend/utils/mmgr/mcxt.c Thu Jan 28 16:47:39 1999 > --- src/backend/utils/mmgr/mcxt.c Fri Jan 29 20:15:41 1999 > *************** > *** 106,111 **** > --- 106,112 ---- > static struct GlobalMemory TopGlobalMemoryData = { > T_GlobalMemory, /* NodeTag tag */ > &GlobalContextMethodsData, /* ContextMethods method */ > + NULL, /* char* smallchunk_block */ > {{0}}, /* uninitialized OrderedSetData allocSetD */ > "TopGlobal", /* char* name */ > {0} /* uninitialized OrderedElemData elemD */ > diff -cr src.orig/backend/utils/mmgr/palloc.c src/backend/utils/mmgr/palloc.c > *** src.orig/backend/utils/mmgr/palloc.c Thu Jan 28 16:47:39 1999 > --- src/backend/utils/mmgr/palloc.c Fri Jan 29 21:57:06 1999 > *************** > *** 24,29 **** > --- 24,57 ---- > > #include "utils/palloc.h" > > + > + typedef struct PallocBlock { > + MemoryContext mcxt; > + int refcount; > + char *unused_chunk; > + char *freeptr; > + char *endptr; > + } PallocBlock; > + > + typedef struct PallocChunk { > + PallocBlock *blk; > + Size size; > + } PallocChunk; > + > + > + #define PALLOC_BLOCKSIZE \ > + (8192 - sizeof(OrderedElemData) - sizeof(Size)) > + #define PALLOC_CHUNK_LIMIT 512 > + #define PALLOC_SIZE_ALIGN(s) ((s + 15) & ~15) > + > + #define PALLOC_CHUNK_BLKPTR(c) (((PallocChunk *)(c))[-1].blk) > + #define PALLOC_CHUNK_SIZE(c) (((PallocChunk *)(c))[-1].size) > + > + #define PALLOC_CHUNK_HDRSZ MAXALIGN(sizeof(PallocChunk)) > + #define PALLOC_BLOCK_HDRSZ MAXALIGN(sizeof(PallocBlock)) > + > + #define PALLOC_FREESPACE(b) ((b)->endptr - (b)->freeptr) > + > /* ---------------------------------------------------------------- > * User library functions > * ---------------------------------------------------------------- > *************** > *** 66,72 **** > #ifdef PALLOC_IS_MALLOC > return malloc(size); > #else > ! return MemoryContextAlloc(CurrentMemoryContext, size); > #endif /* PALLOC_IS_MALLOC */ > } > > --- 94,168 ---- > #ifdef PALLOC_IS_MALLOC > return malloc(size); > #else > ! PallocBlock *block; > ! char *chunk; > ! Size asize = PALLOC_SIZE_ALIGN(size); > ! > ! if (asize >= PALLOC_CHUNK_LIMIT) > ! { > ! chunk = (char *)MemoryContextAlloc(CurrentMemoryContext, size + > ! PALLOC_CHUNK_HDRSZ); > ! chunk += PALLOC_CHUNK_HDRSZ; > ! PALLOC_CHUNK_BLKPTR(chunk) = NULL; > ! return (void *)chunk; > ! } > ! > ! block = (PallocBlock *)(CurrentMemoryContext->smallchunk_block); > ! > ! if (block != NULL && PALLOC_FREESPACE(block) < asize) > ! { > ! char *prev = NULL; > ! Size chunk_size; > ! > ! chunk = block->unused_chunk; > ! while (chunk) > ! { > ! chunk_size = PALLOC_CHUNK_SIZE(chunk); > ! if (asize == chunk_size) > ! { > ! if (prev == NULL) > ! block->unused_chunk = (char *)PALLOC_CHUNK_BLKPTR(chunk); > ! else > ! PALLOC_CHUNK_BLKPTR(prev) = PALLOC_CHUNK_BLKPTR(chunk); > ! > ! block->refcount++; > ! PALLOC_CHUNK_BLKPTR(chunk) = block; > ! /* > ! PALLOC_CHUNK_SIZE(chunk) = size; > ! */ > ! return (void *)chunk; > ! } > ! prev = chunk; > ! chunk = (char *)PALLOC_CHUNK_BLKPTR(chunk); > ! } > ! > ! block = NULL; > ! } > ! > ! if (block == NULL) > ! { > ! block = (PallocBlock *)MemoryContextAlloc(CurrentMemoryContext, > ! PALLOC_BLOCKSIZE); > ! block->mcxt = CurrentMemoryContext; > ! block->unused_chunk = NULL; > ! block->refcount = 0; > ! block->freeptr = ((char *)block) + PALLOC_BLOCK_HDRSZ + > ! PALLOC_CHUNK_HDRSZ; > ! block->endptr = ((char *)block) + PALLOC_BLOCKSIZE; > ! > ! CurrentMemoryContext->smallchunk_block = (void *)block; > ! } > ! > ! chunk = block->freeptr; > ! block->freeptr += PALLOC_CHUNK_HDRSZ + asize; > ! block->refcount++; > ! PALLOC_CHUNK_BLKPTR(chunk) = block; > ! PALLOC_CHUNK_SIZE(chunk) = asize; > ! > ! if (block->freeptr >= block->endptr) > ! block->mcxt->smallchunk_block = NULL; > ! > ! return (void *)chunk; > #endif /* PALLOC_IS_MALLOC */ > } > > *************** > *** 76,82 **** > #ifdef PALLOC_IS_MALLOC > free(pointer); > #else > ! MemoryContextFree(CurrentMemoryContext, pointer); > #endif /* PALLOC_IS_MALLOC */ > } > > --- 172,202 ---- > #ifdef PALLOC_IS_MALLOC > free(pointer); > #else > ! PallocBlock *block = PALLOC_CHUNK_BLKPTR(pointer); > ! > ! if (block == NULL) > ! { > ! MemoryContextFree(CurrentMemoryContext, (char *)pointer - PALLOC_CHUNK_HDRSZ); > ! return; > ! } > ! > ! PALLOC_CHUNK_BLKPTR(pointer) = (PallocBlock *)(block->unused_chunk); > ! block->unused_chunk = (char *)pointer; > ! > ! block->refcount--; > ! if (block->refcount == 0) > ! { > ! if (block == (PallocBlock *)(block->mcxt->smallchunk_block)) > ! { > ! block->freeptr = ((char *)block) + PALLOC_BLOCK_HDRSZ + > ! PALLOC_CHUNK_HDRSZ; > ! block->unused_chunk = NULL; > ! } > ! else > ! { > ! MemoryContextFree(block->mcxt, (void *)block); > ! } > ! } > #endif /* PALLOC_IS_MALLOC */ > } > > *************** > *** 100,106 **** > #ifdef PALLOC_IS_MALLOC > return realloc(pointer, size); > #else > ! return MemoryContextRealloc(CurrentMemoryContext, pointer, size); > #endif > } > > --- 220,248 ---- > #ifdef PALLOC_IS_MALLOC > return realloc(pointer, size); > #else > ! PallocBlock *block = PALLOC_CHUNK_BLKPTR(pointer); > ! char *new; > ! Size tocopy; > ! > ! if (block == NULL) > ! { > ! new = (char *)MemoryContextRealloc(CurrentMemoryContext, > ! (char *)pointer - PALLOC_CHUNK_HDRSZ, > ! size + PALLOC_CHUNK_HDRSZ); > ! new += PALLOC_CHUNK_HDRSZ; > ! PALLOC_CHUNK_BLKPTR(new) = NULL; > ! return (void *)new; > ! } > ! else > ! { > ! new = palloc(size); > ! > ! tocopy = PALLOC_CHUNK_SIZE(pointer) > size ? > ! size : PALLOC_CHUNK_SIZE(pointer); > ! memcpy(new, pointer, tocopy); > ! pfree(pointer); > ! return (void *)new; > ! } > #endif > } > > *************** > *** 117,119 **** > --- 259,263 ---- > > return nstr; > } > + > + > diff -cr src.orig/backend/utils/mmgr/portalmem.c src/backend/utils/mmgr/portalmem.c > *** src.orig/backend/utils/mmgr/portalmem.c Thu Jan 28 16:47:39 1999 > --- src/backend/utils/mmgr/portalmem.c Fri Jan 29 20:15:41 1999 > *************** > *** 390,395 **** > --- 390,396 ---- > NodeSetTag((Node *) &portal->variable, T_PortalVariableMemory); > AllocSetInit(&portal->variable.setData, DynamicAllocMode, (Size) 0); > portal->variable.method = &PortalVariableContextMethodsData; > + portal->variable.smallchunk_block = NULL; > > /* > * initialize portal heap context > *************** > *** 399,404 **** > --- 400,406 ---- > FixedStackInit(&portal->heap.stackData, > offsetof(HeapMemoryBlockData, itemData)); > portal->heap.method = &PortalHeapContextMethodsData; > + portal->heap.smallchunk_block = NULL; > > /* > * set bogus portal name > *************** > *** 756,761 **** > --- 758,764 ---- > NodeSetTag((Node *) &portal->variable, T_PortalVariableMemory); > AllocSetInit(&portal->variable.setData, DynamicAllocMode, (Size) 0); > portal->variable.method = &PortalVariableContextMethodsData; > + portal->variable.smallchunk_block = NULL; > > /* initialize portal heap context */ > NodeSetTag((Node *) &portal->heap, T_PortalHeapMemory); > *************** > *** 763,768 **** > --- 766,772 ---- > FixedStackInit(&portal->heap.stackData, > offsetof(HeapMemoryBlockData, itemData)); > portal->heap.method = &PortalHeapContextMethodsData; > + portal->heap.smallchunk_block = NULL; > > /* initialize portal name */ > length = 1 + strlen(name); > *************** > *** 918,923 **** > --- 922,928 ---- > > /* free current mode */ > AllocSetReset(&HEAPMEMBLOCK(context)->setData); > + context->smallchunk_block = NULL; > MemoryContextFree((MemoryContext) PortalHeapMemoryGetVariableMemory(context), > context->block); > > diff -cr src.orig/include/nodes/memnodes.h src/include/nodes/memnodes.h > *** src.orig/include/nodes/memnodes.h Thu Jan 28 16:47:39 1999 > --- src/include/nodes/memnodes.h Fri Jan 29 20:15:41 1999 > *************** > *** 60,65 **** > --- 60,66 ---- > { > NodeTag type; > MemoryContextMethods method; > + void *smallchunk_block; > } *MemoryContext; > > /* think about doing this right some time but we'll have explicit fields > *************** > *** 68,73 **** > --- 69,75 ---- > { > NodeTag type; > MemoryContextMethods method; > + void *smallchunk_block; > AllocSetData setData; > char *name; > OrderedElemData elemData; > *************** > *** 79,84 **** > --- 81,87 ---- > { > NodeTag type; > MemoryContextMethods method; > + void *smallchunk_block; > AllocSetData setData; > } *PortalVariableMemory; > > *************** > *** 86,91 **** > --- 89,95 ---- > { > NodeTag type; > MemoryContextMethods method; > + void *smallchunk_block; > Pointer block; > FixedStackData stackData; > } *PortalHeapMemory; > > -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #======================================== jwieck@debis.com (Jan Wieck) #
jwieck@debis.com (Jan Wieck) writes: > [ patch to handle small-block allocation specially ] > Did anyone play around with it? I've had it installed now for > some days and it work's well so far. > How close are we to v6.5 BETA? Should I apply it to CURRENT? Well, I think it's a band-aid until we can redesign the memory manager properly, but it could be a useful band-aid ... might as well put it in. The only thing that bothers me about the code itself is lack of comments. In particular it'd be nice to mention somewhere that freeptr is actually one chunkheader past the start of the free space --- took a while to puzzle that out. regards, tom lane
> > jwieck@debis.com (Jan Wieck) writes: > > [ patch to handle small-block allocation specially ] > > Did anyone play around with it? I've had it installed now for > > some days and it work's well so far. > > How close are we to v6.5 BETA? Should I apply it to CURRENT? > > Well, I think it's a band-aid until we can redesign the memory manager > properly, but it could be a useful band-aid ... might as well put it in. > > The only thing that bothers me about the code itself is lack of > comments. In particular it'd be nice to mention somewhere that freeptr > is actually one chunkheader past the start of the free space --- took > a while to puzzle that out. Tricky eh? The AllocSet...() functions do it the same so it just continues with that style. But you're right, and when moving it in I'll shurely add comments. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #======================================== jwieck@debis.com (Jan Wieck) #
On Tue, 2 Feb 1999, Jan Wieck wrote: > I wrote: > > > > > I came to the same conclusion. So I continued with the > > approach of bigger chunks handled in palloc(). What I have > > now is something, that gains about 10% speedup at the > > regression test, while memory consumption (visibly watched > > with top(1)) seems not to raise compared against old version. > > > > Since the bigger blocks are built on top of the existing > > memory context model, it would not conflict with any enhanced > > usage we could make with it. > > > > I include a patch at the end - please comment. > > > > > > Jan > > Did anyone play around with it? I've had it installed now for > some days and it work's well so far. > > How close are we to v6.5 BETA? Should I apply it to CURRENT? BETA starts upon Vadim's word...I'd say go for it... Marc G. Fournier Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
> > Since the bigger blocks are built on top of the existing > > memory context model, it would not conflict with any enhanced > > usage we could make with it. > > > > I include a patch at the end - please comment. > > > > > > Jan > > Did anyone play around with it? I've had it installed now for > some days and it work's well so far. > > How close are we to v6.5 BETA? Should I apply it to CURRENT? Good question. I am done with temp tables, so Vadim makes the call on when to start beta. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > > > Since the bigger blocks are built on top of the existing > > > memory context model, it would not conflict with any enhanced > > > usage we could make with it. > > > > > > I include a patch at the end - please comment. > > > > > > > > > Jan > > > > Did anyone play around with it? I've had it installed now for > > some days and it work's well so far. > > > > How close are we to v6.5 BETA? Should I apply it to CURRENT? > > > Good question. I am done with temp tables, so Vadim makes the call on > when to start beta. Now I'm placing another LOCK request (waiting for Vadim). The problem with constraints in ExecRelCheck() I've just found not only affects COPY. It also occurs with INSERT...SELECT or big UPDATE's. Constraints with many affected tuples eat up backend memory quickly until out of swap space. This fix must go into CURRENT and REL6_4 before we release v6.5 or v6.4.3 either. There are actually reports from users not able to reload dumps made with v6.4.2. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #======================================== jwieck@debis.com (Jan Wieck) #
> > Good question. I am done with temp tables, so Vadim makes the call on > > when to start beta. > > Now I'm placing another LOCK request (waiting for Vadim). > > The problem with constraints in ExecRelCheck() I've just > found not only affects COPY. It also occurs with > INSERT...SELECT or big UPDATE's. Constraints with many > affected tuples eat up backend memory quickly until out of > swap space. > > This fix must go into CURRENT and REL6_4 before we release > v6.5 or v6.4.3 either. There are actually reports from users > not able to reload dumps made with v6.4.2. You can keep applying fixes like this until near final release date. I have patches in my mailbox to apply today too. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Do you want this applied? [Charset iso-8859-1 unsupported, filtering to ASCII...] > Goran Thyni wrote: > > 3. text_lt('G','G\0xFF') this is not correct! > > > > Case 3 work not work with strcoll(), in varstr_cmp(). > > If I change it to strcoll() to strncmp() it works as expected, > > but it probably breaks sorting etc big time. > > Here is a hack which solves the problem without breaking sorting AFAIK. > I have read a lot of code, but has not found any cleaner way to fix > this. > A cleaner solution would be to find the highest char in locale charset > but I found no portable way to do that, any pointers appriciated. > > It is not a beauty but it works. > Please apply to -current since it fixes the bug, > unless someone has a better suggestion. > > best regards, > -- > ----------------- > G_ran Thyni > This is Penguin Country. On a quiet night you can hear Windows NT > reboot! > diff -cr cvs/pgsql/src/backend/utils/adt/varlena.c cvswork/pgsql/src/backend/utils/adt/varlena.c > *** cvs/pgsql/src/backend/utils/adt/varlena.c Mon Dec 14 07:01:37 1998 > --- cvswork/pgsql/src/backend/utils/adt/varlena.c Sun Jan 31 16:32:52 1999 > *************** > *** 496,528 **** > varstr_cmp(char *arg1, int len1, char *arg2, int len2) > { > int result; > ! char *a1p, > ! *a2p; > ! > #ifdef USE_LOCALE > ! a1p = (unsigned char *) palloc(len1 + 1); > ! a2p = (unsigned char *) palloc(len2 + 1); > ! > ! memcpy(a1p, arg1, len1); > ! *(a1p + len1) = '\0'; > ! memcpy(a2p, arg2, len2); > ! *(a2p + len2) = '\0'; > ! > ! result = strcoll(a1p, a2p); > ! > ! pfree(a1p); > ! pfree(a2p); > ! > ! #else > ! > ! a1p = arg1; > ! a2p = arg2; > ! > ! result = strncmp(a1p, a2p, Min(len1, len2)); > ! if ((result == 0) && (len1 != len2)) > ! result = (len1 < len2) ? -1 : 1; > #endif > ! > return result; > } /* varstr_cmp() */ > > --- 496,524 ---- > varstr_cmp(char *arg1, int len1, char *arg2, int len2) > { > int result; > ! char *a1p, *a2p; > #ifdef USE_LOCALE > ! if ((int)arg2[len2 - 1] != -1) > ! { > ! a1p = (unsigned char *) palloc(len1 + 1); > ! a2p = (unsigned char *) palloc(len2 + 1); > ! memcpy(a1p, arg1, len1); > ! *(a1p + len1) = '\0'; > ! memcpy(a2p, arg2, len2); > ! *(a2p + len2) = '\0'; > ! result = strcoll(a1p, a2p); > ! pfree(a1p); > ! pfree(a2p); > ! } > ! else > #endif > ! { > ! a1p = arg1; > ! a2p = arg2; > ! result = strncmp(a1p, a2p, Min(len1, len2)); > ! if ((result == 0) && (len1 != len2)) > ! result = (len1 < len2) ? -1 : 1; > ! } > return result; > } /* varstr_cmp() */ > -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Tom, I assume this has all been corrected, right? > I wrote: > > As for where the time is going: "top" showed that the time for the > > INSERT ops was almost all going into backend CPU time. My guess is > > that most of the cost is in parsing and planning the INSERT > > statements. > > Having an idle hour this evening, I thought it'd be interesting to build > a backend with profiling enabled, so as to confirm or deny the above > guess. It seems that indeed a lot of time is being wasted, but where > it's being wasted might surprise you! > > I profiled a backend built from this morning's current CVS sources, > being driven by a single psql run reading a series of exactly 5000 > INSERT commands. The first few are: > > INSERT INTO "fehistory_1" values (64,14,252,'D','-','-','s',1,4027,NULL,'S',1,4088,NULL,32,'Thu Jan 07 15:15:00 1999 EST','FriJan 08 08:30:38 1999 EST',4027,32,NULL,NULL,46,NULL); > INSERT INTO "fehistory_1" values (65,14,252,'P','-','-','S',1,4086,NULL,NULL,NULL,NULL,NULL,32,'Fri Jan 08 16:00:00 1999EST',NULL,NULL,NULL,NULL,NULL,45,NULL); > INSERT INTO "fehistory_1" values (66,14,135,'P','-','-','S',1,13619,NULL,'s',1,12967,NULL,100,'Fri Jan 08 02:00:00 1999EST',NULL,NULL,NULL,NULL,NULL,44,NULL); > > and it goes on for a long while in that vein :-). Total runtime was > 278.09 CPU seconds in the backend, 307 seconds elapsed realtime. The > table being inserted into has no indexes and was initially empty. > > The profiler is GNU gprof, which provides both flat and structured > profile data. The top few entries in the flat profile are: > > % cumulative self self total > time seconds seconds calls ms/call ms/call name > 34.96 97.21 97.21 _mcount > 7.18 117.19 19.98 8047816 0.00 0.00 tree_cut > 6.56 135.43 18.24 962187 0.02 0.02 recv > 3.55 145.31 9.88 8234278 0.00 0.00 tree_insert > 3.49 155.02 9.71 4292947 0.00 0.01 malloc > 3.31 164.22 9.20 1965056 0.00 0.01 SearchSysCache > 3.24 173.23 9.01 4291775 0.00 0.01 free > 2.44 180.01 6.78 6171667 0.00 0.00 tree_concatenate > 2.17 186.04 6.03 8234069 0.00 0.00 tree_delete > 1.07 189.02 2.98 10250 0.29 0.29 write > 1.05 191.94 2.92 _psort_cmp > 0.95 194.57 2.63 2983301 0.00 0.00 newNode > 0.92 197.14 2.57 4251762 0.00 0.00 OrderedElemPop > 0.88 199.60 2.46 2746172 0.00 0.00 comphash > 0.82 201.88 2.28 4252495 0.00 0.01 AllocSetAlloc > 0.77 204.02 2.14 1965056 0.00 0.00 SearchSysCacheTuple > 0.73 206.05 2.03 265000 0.01 0.01 yylex > 0.69 207.96 1.91 $$remU > 0.68 209.86 1.90 4252500 0.00 0.00 OrderedElemPushHead > 0.62 211.58 1.72 4247473 0.00 0.00 palloc > 0.60 213.26 1.68 25004 0.07 0.07 send > 0.56 214.81 1.55 1965056 0.00 0.00 CatalogCacheComputeHashIndex > 0.55 216.33 1.52 5000 0.30 1.07 yyparse > 0.54 217.82 1.49 4251762 0.00 0.01 AllocSetFree > 0.53 219.29 1.47 4252494 0.00 0.00 MemoryContextAlloc > 0.51 220.71 1.42 4252500 0.00 0.00 OrderedElemPushInto > 0.49 222.07 1.36 3430844 0.00 0.00 strncmp > 0.48 223.41 1.34 4205327 0.00 0.00 OrderedSetGetHead > 0.47 224.73 1.32 elog > 0.45 225.99 1.26 1964976 0.00 0.00 DLRemove > 0.43 227.19 1.20 2510857 0.00 0.00 strcmp > 0.43 228.38 1.19 4200327 0.00 0.00 OrderedElemGetBase > 0.42 229.56 1.18 4245740 0.00 0.01 PortalHeapMemoryAlloc > 0.38 230.62 1.06 4252500 0.00 0.00 OrderedElemPush > 0.37 231.66 1.04 4205327 0.00 0.00 AllocSetGetFirst > 0.36 232.66 1.00 1965145 0.00 0.00 DLAddHead > > _mcount is part of the profiling apparatus and can be ignored (although > the fact that it's so high up implies that we're executing an awful lot > of short routine calls, because mcount runs once per entry and exit of > profilable routines). tree_cut, tree_insert, tree_concatenate, and > tree_delete are subroutines of malloc/free. Altogether, malloc/free > and friends accounted for 61.39 seconds out of the 180.88 non-overhead > CPU seconds in this run. > > In other words, we're spending a third of our time mallocing and freeing > memory. A tad high, what? > > Actually, it's worse than that, because AllocSetAlloc, > PortalHeapMemoryAlloc, AllocSetFree, and all of the OrderedElemXXX > routines represent our own bookkeeping layer atop malloc/free. > That's another 18.66 seconds spent in these top routines, which means > that we are real close to expending half the backend's runtime on > memory bookkeeping. This needs work. > > The other thing that jumps out here is the unreasonably high position of > recv(), which is called 962187 times. The script being read by psql was > only 957186 characters. Evidently we're invoking a kernel recv() call > once per character read from the frontend. I suspect this is an > inefficiency introduced by Magnus Hagander's recent rewrite of backend > libpq (see, I told you there was a reason for using stdio ;-)). We're > gonna have to do something about that, though it's not as critical as > the memory-allocation issue. It also appears that send() is now being > invoked multiple times per backend reply, which is going to create > inefficiencies outside the backend (ie, multiple packets per reply). > On a test case with a lot of SELECTs that would show up more than it > does here. > > > Moving on to the dynamic profile, the most interesting items are: > > index % time self children called name > ----------------------------------------------- > 0.05 118.06 1/1 DoBackend [7] > [8] 65.3 0.05 118.06 1 PostgresMain [8] > 0.04 57.64 5000/5000 pg_exec_query [9] > 0.01 39.83 5000/5000 CommitTransactionCommand [11] > 0.02 18.86 5001/5001 ReadCommand [27] > 0.01 1.18 5000/5000 StartTransactionCommand [118] > 0.01 0.36 5001/5001 ReadyForQuery [190] > 0.04 0.00 5001/5001 EnableNotifyInterrupt [385] > 0.04 0.00 5000/5000 strspn [388] > 0.00 0.01 1/1 InitPostgres [495] > 0.00 0.00 5000/548573 strlen [172] > 0.00 0.00 5001/93349 memset [366] > 0.00 0.00 1/1 read_pg_options [601] > 0.00 0.00 2/2 pq_putint [637] > 0.00 0.00 1/15002 pq_putnchar [123] > 0.00 0.00 2/10013 getenv [241] > 0.00 0.00 1/1 SetPgUserName [683] > 0.00 0.00 1/1 pq_init [687] > 0.00 0.00 1/1 proc_exit [704] > 0.00 0.00 1/1 pq_close [721] > 0.00 0.00 1/5002 getpgusername [521] > 0.00 0.00 1/2 FindExec [722] > 0.00 0.00 5/8 getopt [728] > 0.00 0.00 5001/5001 DisableNotifyInterrupt [761] > 0.00 0.00 9/34 pqsignal [806] > 0.00 0.00 2/5 atoi [837] > 0.00 0.00 1/2 SetProcessingMode [868] > 0.00 0.00 1/10002 TransactionFlushEnabled [749] > 0.00 0.00 1/1 sigsetjmp [934] > > ("self" is the amount of time (CPU sec) spent directly in the named function, > while "children" is the amount of time spent in its callees, pro-rated > by the number of calls. For example, strlen was actually measured to > use 0.47 sec in the whole run, but since PostgresMain called it less > than 1% of all the times it was called, PostgresMain gets blamed for > less than 0.01 sec of that total.) > > The interesting thing here is the relative times indicated for > pg_exec_query, CommitTransactionCommand, and ReadCommand. We already > found out why ReadCommand is so slow (recv() per character) ... but why > is CommitTransactionCommand so high, when I already discovered that > doing only one commit doesn't help much? Patience, we'll get there. > > ----------------------------------------------- > 0.03 57.61 5000/5000 pg_exec_query [9] > [10] 31.9 0.03 57.61 5000 pg_exec_query_dest [10] > 0.09 39.12 5000/5000 pg_parse_and_plan [13] > 0.00 17.37 5000/5000 ProcessQuery [36] > 0.02 0.60 5000/5000 SetQuerySnapshot [159] > 0.01 0.34 5000/5000 CommandCounterIncrement [196] > 0.02 0.05 10000/4291775 free [22] > ----------------------------------------------- > 0.09 39.12 5000/5000 pg_exec_query_dest [10] > [13] 21.7 0.09 39.12 5000 pg_parse_and_plan [13] > 0.02 19.28 5000/5000 parser [26] > 0.03 17.63 5000/5000 planner [34] > 0.00 1.98 5000/5000 QueryRewrite [90] > 0.02 0.05 10000/4292947 malloc [21] > 0.02 0.05 10000/4291775 free [22] > 0.00 0.02 5000/497779 lappend [91] > 0.01 0.00 5000/5000 IsAbortedTransactionBlockState [529] > 0.00 0.00 5000/72779 length [342] > > In other words, parsing, planning, and executing an INSERT each take > about the same amount of time. > > ----------------------------------------------- > 0.01 39.83 5000/5000 PostgresMain [8] > [11] 22.0 0.01 39.83 5000 CommitTransactionCommand [11] > 0.02 39.81 5000/5000 CommitTransaction [12] > ----------------------------------------------- > 0.02 39.81 5000/5000 CommitTransactionCommand [11] > [12] 22.0 0.02 39.81 5000 CommitTransaction [12] > 0.00 38.12 5000/5000 AtCommit_Memory [16] > 0.01 1.10 5000/5000 RecordTransactionCommit [122] > 0.00 0.30 5000/5000 AtCommit_Locks [211] > 0.02 0.10 5000/5000 AtEOXact_portals [286] > 0.03 0.07 5000/5000 DestroyTempRels [305] > 0.03 0.00 5000/5000 _lo_commit [425] > 0.01 0.00 5000/5000 AtCommit_Notify [522] > 0.01 0.00 5000/5000 CloseSequences [523] > 0.01 0.00 5000/5000 RelationPurgeLocalRelation [531] > 0.00 0.00 5000/10000 AtCommit_Cache [750] > ----------------------------------------------- > 0.00 38.12 5000/5000 CommitTransaction [12] > [16] 21.1 0.00 38.12 5000 AtCommit_Memory [16] > 0.01 38.11 5000/5000 EndPortalAllocMode [17] > 0.00 0.00 5000/10552 MemoryContextSwitchTo [747] > ----------------------------------------------- > 0.01 38.11 5000/5000 AtCommit_Memory [16] > [17] 21.1 0.01 38.11 5000 EndPortalAllocMode [17] > 0.54 37.55 5000/5000 AllocSetReset [18] > 0.01 0.00 5000/10000 PortalHeapMemoryGetVariableMemory [449] > 0.01 0.00 5000/5000 FixedStackPop [528] > 0.00 0.00 5000/51434 MemoryContextFree [367] > ----------------------------------------------- > 0.54 37.55 5000/5000 EndPortalAllocMode [17] > [18] 21.1 0.54 37.55 5000 AllocSetReset [18] > 1.47 32.51 4200327/4251762 AllocSetFree [20] > 1.04 2.53 4205327/4205327 AllocSetGetFirst [64] > > In other words, essentially *all* of the CPU time spent in > CommitTransaction is spent freeing memory. That's probably why > ganging the transactions doesn't help --- it's the same number of > memory blocks getting allocated and freed. > > 0.02 19.28 5000/5000 pg_parse_and_plan [13] > [26] 10.7 0.02 19.28 5000 parser [26] > 0.02 13.88 5000/5000 parse_analyze [41] > 1.52 3.81 5000/5000 yyparse [54] > 0.01 0.01 5000/5000 init_io [474] > 0.00 0.01 5000/420496 pstrdup [124] > 0.01 0.00 5000/5000 parser_init [533] > > Thomas might know why parse_analyze is taking so much time compared to > the rest of the parsing machinery... > > > I won't bother the list with the rest of the profile, although I'll be > happy to send it to anyone who wants to see all of it. Our immediate > performance problems seem pretty clear, however: time to rethink memory > management. > > regards, tom lane > > -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
As I remember, this was fixed already, without your patch, right? [Charset iso-8859-1 unsupported, filtering to ASCII...] > Goran Thyni wrote: > > 3. text_lt('G','G\0xFF') this is not correct! > > > > Case 3 work not work with strcoll(), in varstr_cmp(). > > If I change it to strcoll() to strncmp() it works as expected, > > but it probably breaks sorting etc big time. > > Here is a hack which solves the problem without breaking sorting AFAIK. > I have read a lot of code, but has not found any cleaner way to fix > this. > A cleaner solution would be to find the highest char in locale charset > but I found no portable way to do that, any pointers appriciated. > > It is not a beauty but it works. > Please apply to -current since it fixes the bug, > unless someone has a better suggestion. > > best regards, > -- > ----------------- > G_ran Thyni > This is Penguin Country. On a quiet night you can hear Windows NT > reboot! > diff -cr cvs/pgsql/src/backend/utils/adt/varlena.c cvswork/pgsql/src/backend/utils/adt/varlena.c > *** cvs/pgsql/src/backend/utils/adt/varlena.c Mon Dec 14 07:01:37 1998 > --- cvswork/pgsql/src/backend/utils/adt/varlena.c Sun Jan 31 16:32:52 1999 > *************** > *** 496,528 **** > varstr_cmp(char *arg1, int len1, char *arg2, int len2) > { > int result; > ! char *a1p, > ! *a2p; > ! > #ifdef USE_LOCALE > ! a1p = (unsigned char *) palloc(len1 + 1); > ! a2p = (unsigned char *) palloc(len2 + 1); > ! > ! memcpy(a1p, arg1, len1); > ! *(a1p + len1) = '\0'; > ! memcpy(a2p, arg2, len2); > ! *(a2p + len2) = '\0'; > ! > ! result = strcoll(a1p, a2p); > ! > ! pfree(a1p); > ! pfree(a2p); > ! > ! #else > ! > ! a1p = arg1; > ! a2p = arg2; > ! > ! result = strncmp(a1p, a2p, Min(len1, len2)); > ! if ((result == 0) && (len1 != len2)) > ! result = (len1 < len2) ? -1 : 1; > #endif > ! > return result; > } /* varstr_cmp() */ > > --- 496,524 ---- > varstr_cmp(char *arg1, int len1, char *arg2, int len2) > { > int result; > ! char *a1p, *a2p; > #ifdef USE_LOCALE > ! if ((int)arg2[len2 - 1] != -1) > ! { > ! a1p = (unsigned char *) palloc(len1 + 1); > ! a2p = (unsigned char *) palloc(len2 + 1); > ! memcpy(a1p, arg1, len1); > ! *(a1p + len1) = '\0'; > ! memcpy(a2p, arg2, len2); > ! *(a2p + len2) = '\0'; > ! result = strcoll(a1p, a2p); > ! pfree(a1p); > ! pfree(a2p); > ! } > ! else > #endif > ! { > ! a1p = arg1; > ! a2p = arg2; > ! result = strncmp(a1p, a2p, Min(len1, len2)); > ! if ((result == 0) && (len1 != len2)) > ! result = (len1 < len2) ? -1 : 1; > ! } > return result; > } /* varstr_cmp() */ > -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Can someone remind me where we left this memory context stuff? > > I've noticed that everyone else contributing to this thread has been > > thinking in terms of inventing multiple allocation functions with > > different names, ie a routine might have to call "palloc" or > > "fast_palloc" or "tmp_palloc" or whatever depending on what behavior it > > wants. I really think we are better off to stick to the structure that > > we already have (cf. include/nodes/memnodes.h and include/utils/mcxt.h), > > in which there's one set of interface routines that take a "context" > > parameter, and the context determines the semantics of allocation. > > (The interface routines probably ought to be macros, not out-of-line > > code, but that's a minor optimization.) This will make it easier to > > handle cases where one routine has to tell another one how to allocate > > its result space: you pass a MemoryContext parameter. > > I agree, inline is good. I can imagine something like makeNode is going > to be tough. > > I am going to give the parser as an example. It palloc's all sorts of > things for rangetables, strings, structures, lists, but eventually just > exports a huge Query structure. > > If we could somehow just grab copy Query structure and free all the > other parser palloc allocations, that would be great. > > Same thing with the optimizer. It makes all sorts of lists, but > eventuall just exports Plan's. All the other stuff can be pfree'ed. > > It is almost as though we could copy the Query structure to another > memory context, and just fee all palloc'ed memory. > > What we could try is to keep the various memory contexts like cache > context unchanged, but allow pallocs in the standard memory context to > be allocated from a pool and free'ed automatically. Our standard memory > context is overused and doesn't need the pfree overhead. We could make > palloc a macro that tested for the current context, and did standard > pallocs from a free memory pool. > > -- > Bruce Momjian | http://www.op.net/~candle > maillist@candle.pha.pa.us | (610) 853-3000 > + If your life is a hard drive, | 830 Blythe Avenue > + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 > > -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <maillist@candle.pha.pa.us> writes: > Tom, I assume this has all been corrected, right? The recv() problem is fixed, and Jan took a swipe at reducing the cost of palloc/pfree. I have not done a profile since Jan's changes to see how successful he was. I would like to revisit memory management at some point, but it won't happen for 6.5. regards, tom lane
Bruce Momjian <maillist@candle.pha.pa.us> writes: > Can someone remind me where we left this memory context stuff? Jan put in some changes to allow palloc()'d space to be freed more cheaply. The more sweeping rewrite that I was suggesting is not done, but perhaps we can reconsider it for 6.6 or sometime later. I don't think we need to worry about it for 6.5. regards, tom lane
> > Can someone remind me where we left this memory context stuff? > I've changed AllocSetAlloc() and friends to manage small allocations in bigger blocks together where the sizes are round up to powers of two and free's chunks in the blocks are reused in place. The entire blocks get only free'd when they get completely free. Then I've changed palloc() etc. into macros (which started the discussion on mcxt.h - sorry). Overall speedup from the changes (for regression) is something about 10-15%. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #======================================== jwieck@debis.com (Jan Wieck) #
> > > > Can someone remind me where we left this memory context stuff? > > > > I've changed AllocSetAlloc() and friends to manage small > allocations in bigger blocks together where the sizes are > round up to powers of two and free's chunks in the blocks are > reused in place. The entire blocks get only free'd when they > get completely free. > > Then I've changed palloc() etc. into macros (which started > the discussion on mcxt.h - sorry). > > Overall speedup from the changes (for regression) is > something about 10-15%. > Great. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
This from January 24, 1999. Is it still relivant, and something that should be on the TODO list? > Tom Lane wrote: > > > > Having an idle hour this evening, I thought it'd be interesting to build > > a backend with profiling enabled, so as to confirm or deny the above > > guess. It seems that indeed a lot of time is being wasted, but where > > it's being wasted might surprise you! > ... > > In other words, we're spending a third of our time mallocing and freeing > > memory. A tad high, what? > > > > Actually, it's worse than that, because AllocSetAlloc, > > PortalHeapMemoryAlloc, AllocSetFree, and all of the OrderedElemXXX > > routines represent our own bookkeeping layer atop malloc/free. > > That's another 18.66 seconds spent in these top routines, which means > > that we are real close to expending half the backend's runtime on > > memory bookkeeping. This needs work. > > > > Yes, it's suprizing! > > I added some debug code to palloc/pfree and it shows that for > INSERT: > > 1. 80% of allocations are made for <= 32 bytes. > 2. pfree is used for 25% of them only (others are freed > after statement/transaction is done). > > Note that our mmgr adds 16 bytes to each allocation > (+ some bytes in malloc) - a great overhead, yes? > > I added code to allocate a few big (16K-64K) blocks > of memory for these small allocations to speed up > palloc by skiping AllocSetAlloc/malloc. New code > don't free allocated memory (to make bookkeeping fast) > but keeping in mind 2. above and memory overhead > it seems as appropriate thing to do. These code also > speed up freeing when statement/transaction is done, > because of only a few blocks have to be freed now. > > I did 5000 INSERTS (into tables with 3 ints and 33 ints) > with BEGIN/END, -F and -B 512 (I run postgres directly, > without postmaster). User times: > > old new > ----------------------------------------- > table with 3 ints 9.7 sec 7.6 sec > table with 33 ints 59.5 sec 39.9 sec > > So, new code 20%-30% faster. Process sizes are the same. > > Tom, could you run new code under profiling? > > There are still some things to do: > > 1. SELECT/UPDATE/DELETE often palloc/pfree tuples > (sizes are > 32 bytes), but pfree now requires > additional lookup to see is memory allocated by > AllocSetAlloc or new code. We can avoid this. > > 2. Index scans palloc/pfree IndexResult for each > tuple returned by index. This one was annoying me > for long time. IndexResult should be part of > IndexScan structure... > > 3. psort uses leftist structure (16 bytes) when > disk is used for sorting. Obviously, big block > allocation should be used by lselect code itself. > > 4. Actually, new mode shouldn't be used by Executor > in some cases. > > I'll address this in a few days... > > BTW, look at memutils.h: new code is like "tunable" > mode described there. > > > > > In other words, essentially *all* of the CPU time spent in > > CommitTransaction is spent freeing memory. That's probably why > > ganging the transactions doesn't help --- it's the same number of > > memory blocks getting allocated and freed. > > It shows that we should get rid of system malloc/free and do > all things in mmgr itself - this would allow us much faster > free memory contexts at statement/transaction end. > > Vadim > > -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
I have added this to the TODO list: * improve dynamic memory allocation by introducing tuple-context memory allocation * add pooled memory allocation where allocations are freed only as a group > We could use 4 methods for dynamic allocation: > > 1) malloc/free - for persistent storage allocation > > 2) palloc/pfree - for storage belonging to some context and > which we can keep track of and free explicitly > > 3) fast_palloc - for storage which impossible, too difficult or too > expensive to keep track of. This storage should be allocated with > fast and simple inline code from bigger chunks allocated with palloc. > This storage would never freed explicitly, so that code could be > simple and fast, but the big chunks would be freed automatically at > the end of the transaction. > > 4) fast_talloc - we could introduce a `tuple' context handled like > fast_palloc for storage used only while processing one tuple. > This storage could be fast allocated from few big chunks allocated > with palloc and freed explicitly after the tuple has been processed. > This could avoid the overhead of many malloc/palloc while reducing > the overall memory usage for transaction which process many rows. > The total cost per tuple could be one palloc and one pfree. > We could also simply reuse the chunks for every tuple and pfree them > only at the end of the transaction. This would cost one palloc/pfree > per transaction. > > This would require revising the code and changing palloc/pfree with the new > functions where appropriate, but this could be done gradually because the > old palloc/pfree are always safe. > > -- > Massimo Dal Zotto > > +----------------------------------------------------------------------+ > | Massimo Dal Zotto email: dz@cs.unitn.it | > | Via Marconi, 141 phone: ++39-0461534251 | > | 38057 Pergine Valsugana (TN) www: http://www.cs.unitn.it/~dz/ | > | Italy pgp: finger dz@tango.cs.unitn.it | > +----------------------------------------------------------------------+ > > -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <maillist@candle.pha.pa.us> writes: > This from January 24, 1999. Is it still relivant, and something that > should be on the TODO list? Jan's rewrite of alloc sets has probably taken care of the speed issue. Of course we still need to look at how memory allocation is being *used*, in order to fix the within-statement memory leak problems. But that's not what I was complaining about here. regards, tom lane >> Tom Lane wrote: >>>> >>>> Having an idle hour this evening, I thought it'd be interesting to build >>>> a backend with profiling enabled, so as to confirm or deny the above >>>> guess. It seems that indeed a lot of time is being wasted, but where >>>> it's being wasted might surprise you! >> ... >>>> In other words, we're spending a third of our time mallocing and freeing >>>> memory. A tad high, what? >>>> >>>> Actually, it's worse than that, because AllocSetAlloc, >>>> PortalHeapMemoryAlloc, AllocSetFree, and all of the OrderedElemXXX >>>> routines represent our own bookkeeping layer atop malloc/free. >>>> That's another 18.66 seconds spent in these top routines, which means >>>> that we are real close to expending half the backend's runtime on >>>> memory bookkeeping. This needs work. >>>> >> >> Yes, it's suprizing! >> >> I added some debug code to palloc/pfree and it shows that for >> INSERT: >> >> 1. 80% of allocations are made for <= 32 bytes. >> 2. pfree is used for 25% of them only (others are freed >> after statement/transaction is done). >> >> Note that our mmgr adds 16 bytes to each allocation >> (+ some bytes in malloc) - a great overhead, yes? >> >> I added code to allocate a few big (16K-64K) blocks >> of memory for these small allocations to speed up >> palloc by skiping AllocSetAlloc/malloc. New code >> don't free allocated memory (to make bookkeeping fast) >> but keeping in mind 2. above and memory overhead >> it seems as appropriate thing to do. These code also >> speed up freeing when statement/transaction is done, >> because of only a few blocks have to be freed now. >> >> I did 5000 INSERTS (into tables with 3 ints and 33 ints) >> with BEGIN/END, -F and -B 512 (I run postgres directly, >> without postmaster). User times: >> >> old new >> ----------------------------------------- >> table with 3 ints 9.7 sec 7.6 sec >> table with 33 ints 59.5 sec 39.9 sec >> >> So, new code 20%-30% faster. Process sizes are the same. >> >> Tom, could you run new code under profiling? >> >> There are still some things to do: >> >> 1. SELECT/UPDATE/DELETE often palloc/pfree tuples >> (sizes are > 32 bytes), but pfree now requires >> additional lookup to see is memory allocated by >> AllocSetAlloc or new code. We can avoid this. >> >> 2. Index scans palloc/pfree IndexResult for each >> tuple returned by index. This one was annoying me >> for long time. IndexResult should be part of >> IndexScan structure... >> >> 3. psort uses leftist structure (16 bytes) when >> disk is used for sorting. Obviously, big block >> allocation should be used by lselect code itself. >> >> 4. Actually, new mode shouldn't be used by Executor >> in some cases. >> >> I'll address this in a few days... >> >> BTW, look at memutils.h: new code is like "tunable" >> mode described there. >> >>>> >>>> In other words, essentially *all* of the CPU time spent in >>>> CommitTransaction is spent freeing memory. That's probably why >>>> ganging the transactions doesn't help --- it's the same number of >>>> memory blocks getting allocated and freed. >> >> It shows that we should get rid of system malloc/free and do >> all things in mmgr itself - this would allow us much faster >> free memory contexts at statement/transaction end. >> >> Vadim >> >> > -- > Bruce Momjian | http://www.op.net/~candle > maillist@candle.pha.pa.us | (610) 853-3000 > + If your life is a hard drive, | 830 Blythe Avenue > + Christ can be your backup. | Drexel Hill, Pennsylvania 19026