Thread: Re: Parser Cruft in gram.y
Tom Lane wrote: > the parser tables are basically number-of-tokens wide by > number-of-states high. (In HEAD there are 433 tokens known to the > grammar, all but 30 of which are keywords, and 4367 states.) > > Splitting the grammar into multiple grammars is unlikely to do > much to improve this --- in fact, it could easily make matters > worse due to duplication. I agree that without knowing what percentage would be used by each parser in a split, it could go either way. Consider a hypothetical situation where one parser has 80% and the other 50% of the current combined parser -- 30% overlap on both tokens and grammer constructs. (Picking numbers out of the air, for purposes of demonstration.) Combined = 433 * 4,367 = 1,890,911 80% = 346 * 3,493 = 1,208,578 50% = 216 * 2,183 = 471,528 Total for split = 1,680,106 Of course if they were both at 80% it would be a higher total than combined, but unless you have a handle on the percentages, it doesn't seem like a foregone conclusion. Do you have any feel for what the split would be? -Kevin
"Kevin Grittner" <kgrittn@mail.com> writes: > Tom Lane wrote: >> the parser tables are basically number-of-tokens wide by >> number-of-states high. (In HEAD there are 433 tokens known to the >> grammar, all but 30 of which are keywords, and 4367 states.) >> >> Splitting the grammar into multiple grammars is unlikely to do >> much to improve this --- in fact, it could easily make matters >> worse due to duplication. > Of course if they were both at 80% it would be a higher total than > combined, but unless you have a handle on the percentages, it > doesn't seem like a foregone conclusion. Do you have any feel for > what the split would be? I don't really, but I will note that the scalar-expression subgrammar is a pretty sizable part of the whole, and it's difficult to see how you'd make a useful split that didn't duplicate it. I guess you could push CREATE TABLE, ALTER TABLE, CREATE DOMAIN, ALTER DOMAIN, COPY, and anything else that included expression arguments over into the "main" grammar. But that path leads to more and more stuff getting moved to the "main" grammar over time, making the whole thing more and more questionable. The whole concept seems ugly and unmaintainable in any case. regards, tom lane
On Sat, Dec 15, 2012 at 11:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Kevin Grittner" <kgrittn@mail.com> writes: >> Tom Lane wrote: >>> the parser tables are basically number-of-tokens wide by >>> number-of-states high. (In HEAD there are 433 tokens known to the >>> grammar, all but 30 of which are keywords, and 4367 states.) >>> >>> Splitting the grammar into multiple grammars is unlikely to do >>> much to improve this --- in fact, it could easily make matters >>> worse due to duplication. > >> Of course if they were both at 80% it would be a higher total than >> combined, but unless you have a handle on the percentages, it >> doesn't seem like a foregone conclusion. Do you have any feel for >> what the split would be? > > I don't really, but I will note that the scalar-expression subgrammar is > a pretty sizable part of the whole, and it's difficult to see how you'd > make a useful split that didn't duplicate it. I guess you could push > CREATE TABLE, ALTER TABLE, CREATE DOMAIN, ALTER DOMAIN, COPY, and > anything else that included expression arguments over into the "main" > grammar. But that path leads to more and more stuff getting moved to > the "main" grammar over time, making the whole thing more and more > questionable. The whole concept seems ugly and unmaintainable in any > case. I thought a little bit about the sort of thing that Dimitri is proposing in the past, and it seemed to me that one could put DML in one grammar and everything else in another grammar and then decide, based on the first word of the input, which grammar to use. But there are a couple of problems with this. First, the DML grammar has to include an awful lot of stuff, because the grammar for expressions is really complicated and involves a lot of things like special-case syntax for XML that are probably not really carrying their weight but which cannot easily be factored out. Second, the DDL grammar would have to duplicate a lot of stuff that also shows up in the DML grammar, because things like expressions can also show up in DEFAULT or USING clauses which show up in things like CREATE TABLE and ALTER TABLE and CREATE SCHEMA .. CREATE TABLE. Now either one of these problems by itself might not be sufficient to kill the idea: if the DML grammar were a small subset of the full grammar, one might not mind duplicating some stuff, on the grounds that in most cases that full grammar would not be used, and using only the smaller tables most of the time would be easier on the L1 cache. And on the other hand, if you could get a clean split between the two grammars, then regardless of exactly what the split was, it might seem a win. But it seemed to me when I looked at this that you'd have to duplicate a lot of stuff and the small parser still wouldn't end up being very small, which I found hard to get excited about. I'm frankly kind of shocked at the revelation that the parser is already 14% of the backend. I knew it was big; I didn't realize it was THAT big. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I'm frankly kind of shocked at the revelation that the parser is > already 14% of the backend. I knew it was big; I didn't realize it > was THAT big. Yeah, likewise. It makes me wonder whether we aren't past the size of grammar that bison is a good choice for: considering that gram.y is only circa 1% of the source text, it's surprising to find that it compiles to >10% of the object code. I'm not sure what other tool might be better though. I looked through http://en.wikipedia.org/wiki/Comparison_of_parser_generators#Deterministic_context-free_languages but it seems our options are a bit limited if we want something that produces C. It's not clear to me that any of the likely options are as mature as bison, let alone likely to substantially outperform it. (For instance, Hyacc sounded pretty promising until I got to the part about it doesn't yet support %union or %type.) Still, I didn't spend much time on this --- maybe somebody else would like to do a bit more research. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > And on the other hand, if you could get a clean split between the two > grammars, then regardless of exactly what the split was, it might seem > a win. But it seemed to me when I looked at this that you'd have to > duplicate a lot of stuff and the small parser still wouldn't end up > being very small, which I found hard to get excited about. I think the goal is not so much about getting a much smaller parser, but more about have a separate parser that you don't care about the "bloat" of, so that you can improve DDL without fearing about main parser performance regressions. If we come out with no regression and no gain on the main query parser, I say it still worth the separation effort. And anyway we only add things to the main parser (queries) when the standard saith we have to. Tom Lane <tgl@sss.pgh.pa.us> writes: > I'm not sure what other tool might be better though. I looked through > http://en.wikipedia.org/wiki/Comparison_of_parser_generators#Deterministic_context-free_languages > but it seems our options are a bit limited if we want something > that produces C. It's not clear to me that any of the likely options > are as mature as bison, let alone likely to substantially outperform it. > (For instance, Hyacc sounded pretty promising until I got to the part > about it doesn't yet support %union or %type.) Still, I didn't spend > much time on this --- maybe somebody else would like to do a bit more > research. I did spend a very little time on it too, with a different search angle, and did find that "experimental" thing that might be worth looking at, or maybe not. http://en.wikipedia.org/wiki/Parsing_expression_grammar http://piumarta.com/software/peg/ Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Tue, Dec 18, 2012 at 4:33 AM, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> And on the other hand, if you could get a clean split between the two >> grammars, then regardless of exactly what the split was, it might seem >> a win. But it seemed to me when I looked at this that you'd have to >> duplicate a lot of stuff and the small parser still wouldn't end up >> being very small, which I found hard to get excited about. > > I think the goal is not so much about getting a much smaller parser, but > more about have a separate parser that you don't care about the "bloat" > of, so that you can improve DDL without fearing about main parser > performance regressions. Well that would be nice, but the problem is that I see no way to implement it. If, with a unified parser, the parser is 14% of our source code, then splitting it in two will probably crank that number up well over 20%, because there will be duplication between the two. That seems double-plus un-good. I can't help but suspect that the way we handle keywords today is monumentally inefficient. The unreserved_keyword products, et al, just seem somehow badly wrong-headed. We take the trouble to distinguish all of those cases so that we an turn around and not distinguish them. I feel like there ought to be some way to use lexer states to handle this - if we're in a context where an unreserved keyword will be treated as an IDENT, then have the lexer return IDENT when it sees an unreserved keyword. I might be wrong, but it seems like that would eliminate a whole lot of parser state transitions. However, even if I'm right, I have no idea how to implement it. It just seems very wasteful that we have so many parser states that have no purpose other than (effectively) to convert an unreserved_keyword into an IDENT when the lexer could do the same thing much more cheaply given a bit more context. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/18/12 5:10 PM, Robert Haas wrote: > I can't help but suspect that the way we handle keywords today is > monumentally inefficient. The unreserved_keyword products, et al, > just seem somehow badly wrong-headed. We take the trouble to > distinguish all of those cases so that we an turn around and not > distinguish them. I feel like there ought to be some way to use lexer > states to handle this - if we're in a context where an unreserved > keyword will be treated as an IDENT, then have the lexer return IDENT > when it sees an unreserved keyword. The problem would be the lookahead. You need to know the next token before you can decide what context the current one is in.
On Tue, Dec 18, 2012 at 5:24 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > On 12/18/12 5:10 PM, Robert Haas wrote: >> I can't help but suspect that the way we handle keywords today is >> monumentally inefficient. The unreserved_keyword products, et al, >> just seem somehow badly wrong-headed. We take the trouble to >> distinguish all of those cases so that we an turn around and not >> distinguish them. I feel like there ought to be some way to use lexer >> states to handle this - if we're in a context where an unreserved >> keyword will be treated as an IDENT, then have the lexer return IDENT >> when it sees an unreserved keyword. > > The problem would be the lookahead. You need to know the next token > before you can decide what context the current one is in. Yeah, that's why I don't know how to make it work. It feels like this is partly artifact of the tool, though. I mean, suppose we haven't read anything yet. Then, the next token can't be an IDENT, so if we see an unreserved keyword, we know we're not going to convert it to an IDENT. OTOH, if we've seen CREATE TABLE, the next token cannot be an unreserved keyword that is intended as a keyword; it has to be something that will reduce to ColId. I guess the problem situation is where we can shift the keyword and then use the following token to decide whether to reduce it to ColId/type_function_name/ColLabel or use some other rule instead; the CREATE INDEX CONCURRENTLY productions might be such a case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Dec 18, 2012 at 10:33:12AM +0100, Dimitri Fontaine wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > And on the other hand, if you could get a clean split between the two > > grammars, then regardless of exactly what the split was, it might seem > > a win. But it seemed to me when I looked at this that you'd have to > > duplicate a lot of stuff and the small parser still wouldn't end up > > being very small, which I found hard to get excited about. > > I think the goal is not so much about getting a much smaller parser, but > more about have a separate parser that you don't care about the "bloat" > of, so that you can improve DDL without fearing about main parser > performance regressions. In addition to this, there could well be uses for a more modular parser. For example, if it turns out to be possible to do our parser in a way that is exportable and (can be converted to a type that) looks forward, client code could do a lot of interesting (and provably correct) things. > If we come out with no regression and no gain on the main query parser, > I say it still worth the separation effort. And anyway we only add > things to the main parser (queries) when the standard saith we have to. > > Tom Lane <tgl@sss.pgh.pa.us> writes: > > I'm not sure what other tool might be better though. I looked through > > http://en.wikipedia.org/wiki/Comparison_of_parser_generators#Deterministic_context-free_languages > > but it seems our options are a bit limited if we want something > > that produces C. It's not clear to me that any of the likely options > > are as mature as bison, let alone likely to substantially outperform it. > > (For instance, Hyacc sounded pretty promising until I got to the part > > about it doesn't yet support %union or %type.) Still, I didn't spend > > much time on this --- maybe somebody else would like to do a bit more > > research. > > I did spend a very little time on it too, with a different search angle, > and did find that "experimental" thing that might be worth looking at, > or maybe not. > > http://en.wikipedia.org/wiki/Parsing_expression_grammar > http://piumarta.com/software/peg/ More angles: http://wwwiti.cs.uni-magdeburg.de/~rosenmue/publications/SKS+08.pdf Might anyone here wish to investigate this, given some kind of resources for the initial study? Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Tue, Dec 18, 2012 at 10:44 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Yeah, that's why I don't know how to make it work. It feels like this > is partly artifact of the tool, though. I mean, suppose we haven't > read anything yet. Then, the next token can't be an IDENT, so if we > see an unreserved keyword, we know we're not going to convert it to an > IDENT. OTOH, if we've seen CREATE TABLE, the next token cannot be an > unreserved keyword that is intended as a keyword; it has to be > something that will reduce to ColId. > > I guess the problem situation is where we can shift the keyword and > then use the following token to decide whether to reduce it to > ColId/type_function_name/ColLabel or use some other rule instead; the > CREATE INDEX CONCURRENTLY productions might be such a case. It seems to me the avenue for simplifying the transition table would be keywords that can never be used in the same place. That is, if we replaced all the elements of such a set with a single token then the grammar would be unambigous and we could insert a check that the right actual token was present in each place it's used. I'm thinking of the various "noise words" that the SQL standard introduces which are all going to be reduced to IDENT except for a few places each of which will only admit one such noise word anyways. I think doing this manually would be unmaintainable since every time we modified the grammar it would introduce random unpredictable conflicts which would be hard to debug. But I wonder if we could preparse the transitions table, find any such large sets and rewrite either the transition table or regenerate the grammar and rerun bison on it. Alternately we could just replace the transition table with a representation that is less wasteful such as a list of perfect hash tables just large enough to hold the valid transition. Or even a single very large perfect hash table where the key is <state,token>. But I'm not entirely convinced any of this is actually useful. Just becuase the transition table is large doesn't mean it's inefficient. Most of these bytes are being wasted on transitions which can never occur or which can never occur in syntactically valid SQL. The transitions which can occur will still be present in any condensed representation we come up with. The L2 cache might still be large enough to hold these hot transitions which might not be a very large subset at all. valgrind comes with a tool called cachegrind which can emulate the cache algorithm on some variants of various cpus and produce reports. Can it be made to produce a report for a specific block of memory? -- greg
Greg Stark <stark@mit.edu> writes: > But I'm not entirely convinced any of this is actually useful. Just > becuase the transition table is large doesn't mean it's inefficient. That's a fair point. However, I've often noticed base_yyparse() showing up rather high on profiles --- higher than seemed plausible at the time, given that its state-machine implementation is pretty tight. Now I'm wondering whether that isn't coming from cache stalls from trying to touch all the requisite parts of the transition table. > valgrind comes with a tool called cachegrind which can emulate the > cache algorithm on some variants of various cpus and produce reports. > Can it be made to produce a report for a specific block of memory? I believe that oprofile can be persuaded to produce statistics about where in one's code are the most cache misses, not just the most wall-clock ticks; which would shed a lot of light on this question. However, my oprofile-fu doesn't quite extend to actually persuading it. regards, tom lane
On Wed, Dec 19, 2012 at 10:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> valgrind comes with a tool called cachegrind which can emulate the >> cache algorithm on some variants of various cpus and produce reports. >> Can it be made to produce a report for a specific block of memory? > > I believe that oprofile can be persuaded to produce statistics about > where in one's code are the most cache misses, not just the most > wall-clock ticks; which would shed a lot of light on this question. > However, my oprofile-fu doesn't quite extend to actually persuading it. perf can certainly do this. $ perf record -a -e cache-misses pgbench -n -S -T 30 ...output elided... $ perf report -d postgres | grep -v '^#' | head 8.88% postgres base_yyparse 7.05% swapper 0x807c 4.67% postgres SearchCatCache 3.77% pgbench 0x172dd58 3.47% postgres hash_search_with_hash_value 3.23% postgres AllocSetAlloc 2.58% postgres core_yylex 1.87% postgres LWLockAcquire 1.83% postgres fmgr_info_cxt_security 1.75% postgres 0x4d1054 For comparison: $ perf record -a -e cycles -d postgres pgbench -n -S -T 30 $ perf report -d postgres | grep -v '^#' | head 6.54% postgres AllocSetAlloc 4.08% swapper 0x4ce754 3.60% postgres hash_search_with_hash_value 2.74% postgres base_yyparse 2.71% postgres MemoryContextAllocZeroAligned 2.57% postgres MemoryContextAlloc 2.36% postgres SearchCatCache 2.10% postgres _bt_compare 1.70% postgres LWLockAcquire 1.54% postgres FunctionCall2Coll -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/18/12 5:10 PM, Robert Haas wrote: > I can't help but suspect that the way we handle keywords today is > monumentally inefficient. The unreserved_keyword products, et al, > just seem somehow badly wrong-headed. We take the trouble to > distinguish all of those cases so that we an turn around and not > distinguish them. I feel like there ought to be some way to use lexer > states to handle this - if we're in a context where an unreserved > keyword will be treated as an IDENT, then have the lexer return IDENT > when it sees an unreserved keyword. I might be wrong, but it seems > like that would eliminate a whole lot of parser state transitions. > However, even if I'm right, I have no idea how to implement it. It > just seems very wasteful that we have so many parser states that have > no purpose other than (effectively) to convert an unreserved_keyword > into an IDENT when the lexer could do the same thing much more cheaply > given a bit more context. Another way of attack along these lines might be to use the %glr-parser and then try to cut back on all those redundant rules that were put in to avoid conflicts. The number of key words categories and such could perhaps also be reduced that way. Of course, this is mostly speculation.
On 18 December 2012 22:10, Robert Haas <robertmhaas@gmail.com> wrote: > Well that would be nice, but the problem is that I see no way to > implement it. If, with a unified parser, the parser is 14% of our > source code, then splitting it in two will probably crank that number > up well over 20%, because there will be duplication between the two. > That seems double-plus un-good. I don't think the size of the parser binary is that relevant. What is relevant is how much of that is regularly accessed. Increasing parser cache misses for DDL and increasing size of binary overall are acceptable costs if we are able to swap out the unneeded areas and significantly reduce the cache misses on the well travelled portions of the parser. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Dec 20, 2012 at 8:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 18 December 2012 22:10, Robert Haas <robertmhaas@gmail.com> wrote: >> Well that would be nice, but the problem is that I see no way to >> implement it. If, with a unified parser, the parser is 14% of our >> source code, then splitting it in two will probably crank that number >> up well over 20%, because there will be duplication between the two. >> That seems double-plus un-good. > > I don't think the size of the parser binary is that relevant. What is > relevant is how much of that is regularly accessed. > > Increasing parser cache misses for DDL and increasing size of binary > overall are acceptable costs if we are able to swap out the unneeded > areas and significantly reduce the cache misses on the well travelled > portions of the parser. I generally agree. We don't want to bloat the size of the parser with wild abandon, but yeah if we can reduce the cache misses on the well-travelled portions that seems like it ought to help. My previous hacky attempt to quantify the potential benefit in this area was: http://archives.postgresql.org/pgsql-hackers/2011-05/msg01008.php On my machine there seemed to be a small but consistent win; on a very old box Jeff Janes tried, it didn't seem like there was any benefit at all. Somehow, I have a feeling we're missing a trick here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2012-12-20 09:11:46 -0500, Robert Haas wrote: > On Thu, Dec 20, 2012 at 8:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On 18 December 2012 22:10, Robert Haas <robertmhaas@gmail.com> wrote: > >> Well that would be nice, but the problem is that I see no way to > >> implement it. If, with a unified parser, the parser is 14% of our > >> source code, then splitting it in two will probably crank that number > >> up well over 20%, because there will be duplication between the two. > >> That seems double-plus un-good. > > > > I don't think the size of the parser binary is that relevant. What is > > relevant is how much of that is regularly accessed. > > > > Increasing parser cache misses for DDL and increasing size of binary > > overall are acceptable costs if we are able to swap out the unneeded > > areas and significantly reduce the cache misses on the well travelled > > portions of the parser. > > I generally agree. We don't want to bloat the size of the parser with > wild abandon, but yeah if we can reduce the cache misses on the > well-travelled portions that seems like it ought to help. My previous > hacky attempt to quantify the potential benefit in this area was: > > http://archives.postgresql.org/pgsql-hackers/2011-05/msg01008.php > > On my machine there seemed to be a small but consistent win; on a very > old box Jeff Janes tried, it didn't seem like there was any benefit at > all. Somehow, I have a feeling we're missing a trick here. I don't think you will see too many cache misses on such a low number of extremly simply statements, so its not too surprising not to see a that big difference with that. Are we sure its really cache-misses and not just the actions performed in the grammar that make bison code show up in profiles? I remember the latter being the case... Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2012-12-20 15:45:47 +0100, Andres Freund wrote: > On 2012-12-20 09:11:46 -0500, Robert Haas wrote: > > On Thu, Dec 20, 2012 at 8:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > > On 18 December 2012 22:10, Robert Haas <robertmhaas@gmail.com> wrote: > > >> Well that would be nice, but the problem is that I see no way to > > >> implement it. If, with a unified parser, the parser is 14% of our > > >> source code, then splitting it in two will probably crank that number > > >> up well over 20%, because there will be duplication between the two. > > >> That seems double-plus un-good. > > > > > > I don't think the size of the parser binary is that relevant. What is > > > relevant is how much of that is regularly accessed. > > > > > > Increasing parser cache misses for DDL and increasing size of binary > > > overall are acceptable costs if we are able to swap out the unneeded > > > areas and significantly reduce the cache misses on the well travelled > > > portions of the parser. > > > > I generally agree. We don't want to bloat the size of the parser with > > wild abandon, but yeah if we can reduce the cache misses on the > > well-travelled portions that seems like it ought to help. My previous > > hacky attempt to quantify the potential benefit in this area was: > > > > http://archives.postgresql.org/pgsql-hackers/2011-05/msg01008.php > > > > On my machine there seemed to be a small but consistent win; on a very > > old box Jeff Janes tried, it didn't seem like there was any benefit at > > all. Somehow, I have a feeling we're missing a trick here. > > I don't think you will see too many cache misses on such a low number of > extremly simply statements, so its not too surprising not to see a that > big difference with that. > > Are we sure its really cache-misses and not just the actions performed > in the grammar that make bison code show up in profiles? I remember the > latter being the case... Hm. A very, very quick perf stat -dvvv of pgbench -S -c 20 -j 20 -T 20 later: 218350.885559 task-clock # 10.095 CPUs utilized 1,676,479 context-switches # 0.008M/sec 2,396 cpu-migrations # 0.011 K/sec 796,038 page-faults # 0.004M/sec 506,312,525,518 cycles # 2.319 GHz [20.00%] 405,944,435,754 stalled-cycles-frontend # 80.18% frontend cycles idle [30.32%] 236,712,872,641 stalled-cycles-backend # 46.75%backend cycles idle [40.51%] 193,459,120,458 instructions # 0.38 insns per cycle # 2.10 stalled cycles per insn [50.70%] 36,433,144,472 branches # 166.856 M/sec [51.12%] 3,623,778,087 branch-misses # 9.95% of all branches [50.87%] 50,344,123,581 L1-dcache-loads # 230.565 M/sec [50.33%] 5,548,192,686 L1-dcache-load-misses # 11.02% of all L1-dcache hits [49.69%] 2,666,461,361 LLC-loads # 12.212M/sec [35.63%] 112,407,198 LLC-load-misses # 4.22% of all LL-cache hits [ 9.67%] 21.629396701 seconds time elapsed So there definitely a noticeable rate of cache misses... Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2012-12-20 15:51:37 +0100, Andres Freund wrote: > On 2012-12-20 15:45:47 +0100, Andres Freund wrote: > > On 2012-12-20 09:11:46 -0500, Robert Haas wrote: > > > On Thu, Dec 20, 2012 at 8:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > > > On 18 December 2012 22:10, Robert Haas <robertmhaas@gmail.com> wrote: > > > >> Well that would be nice, but the problem is that I see no way to > > > >> implement it. If, with a unified parser, the parser is 14% of our > > > >> source code, then splitting it in two will probably crank that number > > > >> up well over 20%, because there will be duplication between the two. > > > >> That seems double-plus un-good. > > > > > > > > I don't think the size of the parser binary is that relevant. What is > > > > relevant is how much of that is regularly accessed. > > > > > > > > Increasing parser cache misses for DDL and increasing size of binary > > > > overall are acceptable costs if we are able to swap out the unneeded > > > > areas and significantly reduce the cache misses on the well travelled > > > > portions of the parser. > > > > > > I generally agree. We don't want to bloat the size of the parser with > > > wild abandon, but yeah if we can reduce the cache misses on the > > > well-travelled portions that seems like it ought to help. My previous > > > hacky attempt to quantify the potential benefit in this area was: > > > > > > http://archives.postgresql.org/pgsql-hackers/2011-05/msg01008.php > > > > > > On my machine there seemed to be a small but consistent win; on a very > > > old box Jeff Janes tried, it didn't seem like there was any benefit at > > > all. Somehow, I have a feeling we're missing a trick here. > > > > I don't think you will see too many cache misses on such a low number of > > extremly simply statements, so its not too surprising not to see a that > > big difference with that. > > > > Are we sure its really cache-misses and not just the actions performed > > in the grammar that make bison code show up in profiles? I remember the > > latter being the case... > > Hm. A very, very quick perf stat -dvvv of pgbench -S -c 20 -j 20 -T 20 later: > > 218350.885559 task-clock # 10.095 CPUs utilized > 1,676,479 context-switches # 0.008 M/sec > 2,396 cpu-migrations # 0.011 K/sec > 796,038 page-faults # 0.004 M/sec > 506,312,525,518 cycles # 2.319 GHz [20.00%] > 405,944,435,754 stalled-cycles-frontend # 80.18% frontend cycles idle [30.32%] > 236,712,872,641 stalled-cycles-backend # 46.75% backend cycles idle [40.51%] > 193,459,120,458 instructions # 0.38 insns per cycle > # 2.10 stalled cycles per insn [50.70%] > 36,433,144,472 branches # 166.856 M/sec [51.12%] > 3,623,778,087 branch-misses # 9.95% of all branches [50.87%] > 50,344,123,581 L1-dcache-loads # 230.565 M/sec [50.33%] > 5,548,192,686 L1-dcache-load-misses # 11.02% of all L1-dcache hits [49.69%] > 2,666,461,361 LLC-loads # 12.212 M/sec [35.63%] > 112,407,198 LLC-load-misses # 4.22% of all LL-cache hits [ 9.67%] > > 21.629396701 seconds time elapsed > > So there definitely a noticeable rate of cache misses... L1 misses: # Samples: 997K of event 'L1-dcache-load-misses' # Overhead Command Shared Object Symbol # ........ ........ ............................................................... 6.49% postgres postgres [.] SearchCatCache 3.65% postgres postgres [.] base_yyparse 3.48% postgres postgres [.] hash_search_with_hash_value 3.41% postgres postgres [.] AllocSetAlloc 1.84% postgres postgres [.] LWLockAcquire 1.40% postgres postgres [.] fmgr_info_cxt_security 1.36% postgres postgres [.] nocachegetattr 1.23% postgres libc-2.13.so [.] _int_malloc 1.20% postgres postgres [.] core_yylex 1.15% postgres postgres [.] MemoryContextAllocZeroAligned 0.94% postgres postgres [.] PostgresMain 0.94% postgres postgres [.] MemoryContextAlloc 0.91% postgres libc-2.13.so [.] __memcpy_ssse3_back 0.89% postgres postgres [.] CatalogCacheComputeHashValue 0.86% postgres postgres [.] PinBuffer 0.86% postgres [kernel.kallsyms] [k] __audit_syscall_entry 0.80% postgres libc-2.13.so [.] __strcmp_sse42 0.80% postgres postgres [.] _bt_compare 0.78% postgres postgres [.] FunctionCall2Coll 0.77% postgres libc-2.13.so [.] malloc 0.73% postgres libc-2.13.so [.] __memset_sse2 0.72% postgres postgres [.] GetSnapshotData 0.69% postgres [kernel.kallsyms] [k] fget_light 0.69% postgres postgres [.] DirectFunctionCall1Coll 0.67% postgres postgres [.] hash_search 0.67% postgres libc-2.13.so [.]0x000000000011a3a5 0.66% postgres postgres [.] pgstat_initstats 0.66% postgres postgres [.] AllocSetFree 0.65% postgres libc-2.13.so [.] __strlen_sse42 0.60% postgres libc-2.13.so [.]_int_free 0.60% postgres [kernel.kallsyms] [k] cpuacct_charge 0.59% postgres postgres [.] heap_getsysattr 0.59% postgres postgres [.] MemoryContextAllocZero 0.58% postgres postgres [.] PopActiveSnapshot 0.53% postgres libc-2.13.so [.] __memcmp_sse4_1 0.51% postgres postgres [.] ReadBuffer_common 0.49% postgres postgres [.] ScanKeywordLookup 0.49% postgres postgres [.] LockAcquireExtended 0.47% postgres [kernel.kallsyms] [k] update_cfs_shares 0.45% postgres postgres [.] SearchCatCacheList 0.45% postgres postgres [.] new_list 0.44% postgres postgres [.] get_relation_info LLC misses: # Samples: 1M of event 'LLC-load-misses' # Event count (approx.): 143379713 # Overhead Command Shared Object Symbol # ........ ........ ............................................................... 25.08% postgres postgres [.] _bt_compare 12.84% postgres postgres [.] PinBuffer 9.18% postgres postgres [.] LWLockAcquire 6.31% postgres postgres [.] GetSnapshotData 6.08% postgres postgres [.] heap_hot_search_buffer 5.13% postgres postgres [.] hash_search_with_hash_value 4.85% postgres postgres [.] _bt_checkpage 3.95% postgres postgres [.] _bt_moveright 3.09% postgres postgres [.] heap_page_prune_opt 2.12% postgres postgres [.] slot_deform_tuple 1.98% postgres postgres [.] LWLockRelease 1.82% postgres libc-2.13.so [.] __memcmp_sse4_1 1.16% postgres postgres [.] ExecProject 1.16% postgres postgres [.] FunctionCall2Coll 0.94% postgres [kernel.kallsyms] [k] copy_user_generic_string 0.94% postgres [kernel.kallsyms] [k] tg_load_down 0.78% postgres [kernel.kallsyms] [k] find_get_page 0.73% postgres postgres [.] ProcArrayEndTransaction 0.73% postgres postgres [.] pfree 0.71% postgres postgres [.] pgstat_report_xact_timestamp 0.69% postgres postgres [.] index_fetch_heap 0.66% postgres postgres [.] LockAcquireExtended 0.56% postgres postgres [.] LockBuffer 0.45% postgres postgres [.] slot_getsomeattrs 0.40% postgres postgres [.] _bt_readpage So it seems L1 misses are the interesting thing wrt to parsing. When doing a source/assembly annotation in SearchCatCache about half of the misses are attributed to the memcpy() directly at the beginning. In base_yyparse the three biggest offenders (15%, 10.5%, 5.58%) really seem to be various kinds of table lookups. So it seems to confirm the various suspicious that the table size might be rather relevant. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2012-12-20 16:04:49 +0100, Andres Freund wrote: > On 2012-12-20 15:51:37 +0100, Andres Freund wrote: > When doing a source/assembly annotation in SearchCatCache about half of > the misses are attributed to the memcpy() directly at the beginning. Using a separate array for storing the arguments instead of copying & modifying cache->cc_skey yields a 2% speedup in pgbench -S for me... The attached patch is clearly not ready and I don't really have time & energy to pursue it right now, but it seems interesting enough to post. The approach seems solid and sensible although the implementation is not (too much c&p, no comments). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Thu, Dec 20, 2012 at 3:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Greg Stark <stark@mit.edu> writes: >> But I'm not entirely convinced any of this is actually useful. Just >> becuase the transition table is large doesn't mean it's inefficient. > > That's a fair point. However, I've often noticed base_yyparse() showing > up rather high on profiles --- higher than seemed plausible at the time, > given that its state-machine implementation is pretty tight. Now I'm > wondering whether that isn't coming from cache stalls from trying to > touch all the requisite parts of the transition table. For what it's worth the bloat isn't in the parser transition table at all: 516280 yy_transition 147208 yytable 147208 yycheck 146975 base_yyparse 17468 yypact 9571 core_yylex 8734 yystos 8734 yydefact Unless I'm confused yy_transition is in fact the *lexer* transition table. I'm not sure how to reconcile that with the profiling results showing the cache misses in base_yyparse() though. -- greg
On 2012-12-20 15:58:12 +0000, Greg Stark wrote: > On Thu, Dec 20, 2012 at 3:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Greg Stark <stark@mit.edu> writes: > >> But I'm not entirely convinced any of this is actually useful. Just > >> becuase the transition table is large doesn't mean it's inefficient. > > > > That's a fair point. However, I've often noticed base_yyparse() showing > > up rather high on profiles --- higher than seemed plausible at the time, > > given that its state-machine implementation is pretty tight. Now I'm > > wondering whether that isn't coming from cache stalls from trying to > > touch all the requisite parts of the transition table. > > For what it's worth the bloat isn't in the parser transition table at all: > 516280 yy_transition > 147208 yytable > 147208 yycheck > 146975 base_yyparse > 17468 yypact > 9571 core_yylex > 8734 yystos > 8734 yydefact > > Unless I'm confused yy_transition is in fact the *lexer* transition > table. I'm not sure how to reconcile that with the profiling results > showing the cache misses in base_yyparse() though. The scanner is compiled together with the parser, so its not unlikely that the compiler bunches them together making only base_yyparse visible in the profile. I quickly checked though, and the top three offenders for L1 caches are in bison not in the lexer... Its not implausible that the state transitions in the lexer are better cached and/or predicted... Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
> > Another way of attack along these lines might be to use the %glr-parser > and then try to cut back on all those redundant rules that were put in > to avoid conflicts. The number of key words categories and such could > perhaps also be reduced that way. > > Of course, this is mostly speculation. > > The GLR output from Bison is licensed under the GPL (unlike the LALR output). So using Bison's GLR mode isn't an option.
On 2012-12-20 23:12:55 +0000, McDevitt, Charles wrote: > > > > Another way of attack along these lines might be to use the %glr-parser > > and then try to cut back on all those redundant rules that were put in > > to avoid conflicts. The number of key words categories and such could > > perhaps also be reduced that way. > > > > Of course, this is mostly speculation. > > > > > > The GLR output from Bison is licensed under the GPL (unlike the LALR output). > So using Bison's GLR mode isn't an option. Thats not the case anymore: http://www.gnu.org/software/bison/manual/html_node/Conditions.html Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
> > > > The GLR output from Bison is licensed under the GPL (unlike the LALR output). > > So using Bison's GLR mode isn't an option. > > Thats not the case anymore: > http://www.gnu.org/software/bison/manual/html_node/Conditions.html Sorry! My mistake... I didn't realize they changed the rules. I'll be more careful to check these things in the future.
Andres Freund <andres@2ndquadrant.com> writes: > On 2012-12-20 23:12:55 +0000, McDevitt, Charles wrote: >>> Another way of attack along these lines might be to use the %glr-parser >>> and then try to cut back on all those redundant rules that were put in >>> to avoid conflicts. The number of key words categories and such could >>> perhaps also be reduced that way. >> The GLR output from Bison is licensed under the GPL (unlike the LALR output). >> So using Bison's GLR mode isn't an option. > Thats not the case anymore: > http://www.gnu.org/software/bison/manual/html_node/Conditions.html This does mean that we'd have to specify a minimum bison version of 2.2 in order to be squeaky-clean license wise, if we went over to using the GLR mode. However, that would likely be a good idea anyway from a technical standpoint --- the GLR mode may exist in ancient bison versions, but who knows how bug-free it is. Anyway, this is all merest speculation until somebody actually tries it and sees if a performance gain is possible. Having just re-read the description of GLR mode, I wouldn't be surprised if any savings in table size is squandered by its handling of ambiguous cases (ie, the need to track and merge multiple parser states). regards, tom lane