Thread: b-tree index search algorithms
Hello, I'm currently on a university research project if performance could be increased by substituting different inter-node search algorithms instead of the currently used binary search. But I'm having troubles understanding how the general b-tree implementation (nbtree.h) is used to represent for example a simple primary key on an integer column. I've debug printed the 'scankey->sk_argument' and all attributes of the index tuples on the pages being traversed (simply ran 'DatumGetInt32' on both) but I never see one of the integers actually appearing in my table being logged when I do a select. This is why I assume that all column values are hashed before they are pushed into the b-tree, but this hashing would have to preserve the order of the keys. I have tried to look at stack traces, but the b-tree implementations seems to be used commonly for many things that it's hard to find the interesting bits for me. I would like to know how the b-tree and column indexes interact. The readmes for the b-tree seem really implementation centric and I'm not getting a hold the whole picture. Regards, Samuel Vogel
Samuel Vogel <s@muel-vogel.de> writes: > I'm currently on a university research project if performance could be > increased by substituting different inter-node search algorithms instead > of the currently used binary search. Hm, what have you got in mind exactly? > But I'm having troubles understanding how the general b-tree > implementation (nbtree.h) is used to represent for example a simple > primary key on an integer column. I've debug printed the > 'scankey->sk_argument' and all attributes of the index tuples on the > pages being traversed (simply ran 'DatumGetInt32' on both) but I never > see one of the integers actually appearing in my table being logged when > I do a select. Not clear what you did wrong from this amount of detail, but integer keys ought to be pretty obvious at the debugger level. > This is why I assume that all column values are hashed before they are > pushed into the b-tree, PG's b-trees do not hash anything. If you're not seeing interpretable key values then you're doing something wrong in your inspection methodology. regards, tom lane
Am 17.07.12 05:21, schrieb Tom Lane: > Samuel Vogel <s@muel-vogel.de> writes: >> I'm currently on a university research project if performance could be >> increased by substituting different inter-node search algorithms instead >> of the currently used binary search. > Hm, what have you got in mind exactly? At first I will try a simple interpolation search, but problems start there since I need to have a numerical representation of the index keys (or map them to one) to do the interpolation. >> But I'm having troubles understanding how the general b-tree >> implementation (nbtree.h) is used to represent for example a simple >> primary key on an integer column. I've debug printed the >> 'scankey->sk_argument' and all attributes of the index tuples on the >> pages being traversed (simply ran 'DatumGetInt32' on both) but I never >> see one of the integers actually appearing in my table being logged when >> I do a select. > Not clear what you did wrong from this amount of detail, but integer > keys ought to be pretty obvious at the debugger level. Okay, to be more specific: Printing 'DatumGetInt32(scankey->sk_argument)' in '_bt_compare' never shows me 50 when I execute this query: SELECT * FROM simpletest WHERE id = 50; >> This is why I assume that all column values are hashed before they are >> pushed into the b-tree, > PG's b-trees do not hash anything. If you're not seeing interpretable > key values then you're doing something wrong in your inspection > methodology. Okay, how are indexes on char/text columns handled then? Are they hashed before being put into the b-tree or is my assumption correct, that in that case the Datum is only a link to where the actual data is stored and only 'scankey->sk_func' knows how to make use of it (compare it). In that case it would be extremly hard to get to a numeric representation which can be used for the interpolation. Regards, Samuel Vogel
Samuel Vogel <s@muel-vogel.de> writes: > Am 17.07.12 05:21, schrieb Tom Lane: >> Samuel Vogel <s@muel-vogel.de> writes: >>> I'm currently on a university research project if performance could be >>> increased by substituting different inter-node search algorithms instead >>> of the currently used binary search. >> Hm, what have you got in mind exactly? > At first I will try a simple interpolation search, but problems start > there since I need to have a numerical representation of the index keys > (or map them to one) to do the interpolation. Dunno about that. btree knows nothing about the datatypes it's working on except that they have comparison functions. Converting the values to some sort of numeric scale that you can interpolate on seems logically dubious and fraught with practical difficulties. Now, we do have some code in selfuncs.c that tries to do that, for some data types, but it's only for planner estimation purposes, and we don't rely very heavily on its results even in that context. Depending on it to be right for search purposes sounds pretty scary. >> Not clear what you did wrong from this amount of detail, but integer >> keys ought to be pretty obvious at the debugger level. > Okay, to be more specific: Printing > 'DatumGetInt32(scankey->sk_argument)' in '_bt_compare' never shows me 50 > when I execute this query: SELECT * FROM simpletest WHERE id = 50; Um, what does it show you? DatumGetInt32 is a macro, and at least in gdb that won't work at all: (gdb) p DatumGetInt32(scankey->sk_argument) No symbol "DatumGetInt32" in current context. However, just looking at the value produces sane answers for me: (gdb) p *scankey $1 = {sk_flags = 196608, sk_attno = 1, sk_strategy = 0, sk_subtype = 23, sk_collation = 0, sk_func = {fn_addr = 0x486ec0<btint4cmp>, fn_oid = 351, fn_nargs = 2, fn_strict = 1 '\001', fn_retset = 0 '\000', fn_stats = 2 '\002', fn_extra= 0x0, fn_mcxt = 0x2b82fe8, fn_expr = 0x0}, sk_argument = 50} (gdb) p scankey->sk_argument $2 = 50 >> PG's b-trees do not hash anything. If you're not seeing interpretable >> key values then you're doing something wrong in your inspection >> methodology. > Okay, how are indexes on char/text columns handled then? The datum values will be pointers to strings. > ... is my assumption correct, that in > that case the Datum is only a link to where the actual data is stored > and only 'scankey->sk_func' knows how to make use of it (compare it). > In that case it would be extremly hard to get to a numeric > representation which can be used for the interpolation. The btree code is (or reasonably can be) aware that such values are pass-by-reference, and how to get to the bits. But the comparison semantics of two different values are not something it knows about except by asking the comparison function. This can be quite a nontrivial matter even for text, since we follow strcoll() comparison rules. regards, tom lane
Am 17.07.12 19:38, schrieb Tom Lane: > btree knows nothing about the datatypes it's working on except that > they have comparison functions. Converting the values to some sort of > numeric scale that you can interpolate on seems logically dubious and > fraught with practical difficulties. Now, we do have some code in > selfuncs.c that tries to do that, for some data types, but it's only > for planner estimation purposes, and we don't rely very heavily on its > results even in that context. Depending on it to be right for search > purposes sounds pretty scary. 'convert_string_to_scalar' and others look interesting in selfuncs.c, thanks for the pointer! >> Okay, how are indexes on char/text columns handled then? > The datum values will be pointers to strings. I can simply dereference it and read all bytes until a null-byte appears (depending on the collation and that I know that it actually is a string)? > The btree code is (or reasonably can be) aware that such values are > pass-by-reference, and how to get to the bits. But the comparison > semantics of two different values are not something it knows about > except by asking the comparison function. This can be quite a > nontrivial matter even for text, since we follow strcoll() comparison > rules. regards, tom lane How would the b-tree know exactly that a value is only a reference? And even in that case you say that it could get the bits, but make no use of it, since it does not know what they represent, right? Regards, Samuel
Samuel Vogel <s@muel-vogel.de> writes: > Am 17.07.12 19:38, schrieb Tom Lane: >> The datum values will be pointers to strings. > I can simply dereference it and read all bytes until a null-byte appears > (depending on the collation and that I know that it actually is a string)? We use a length word and then data, with no trailing null, at least for text and varchar. This area has been hacked for efficiency until it's pretty complicated, but you can read about it in postgres.h. > How would the b-tree know exactly that a value is only a reference? And > even in that case you say that it could get the bits, but make no use of > it, since it does not know what they represent, right? It has access to the data type's basic storage parameters, which are typbyval, typlen, and typalign; and we have standard conventions for identifying the length etc of variable-length values. It's just the meaning of the payload data bytes that's data-type-private. regards, tom lane
Am 18.07.12 23:56, schrieb Tom Lane: > Samuel Vogel <s@muel-vogel.de> writes: >> How would the b-tree know exactly that a value is only a reference? And >> even in that case you say that it could get the bits, but make no use of >> it, since it does not know what they represent, right? > It has access to the data type's basic storage parameters, which are > typbyval, typlen, and typalign; and we have standard conventions for > identifying the length etc of variable-length values. It's just the > meaning of the payload data bytes that's data-type-private. Okay, so with these I know if and how I would have to "dereference" the data. But how do I get to this info from inside _bt_binsrch? RelationGetDescr(rel)->tdtypeid was my closest guess, but I need to get a reference to FormData_pg_type somehow. Regards, Samuel Vogel
Samuel Vogel <s@muel-vogel.de> writes: > Am 18.07.12 23:56, schrieb Tom Lane: >> It has access to the data type's basic storage parameters, which are >> typbyval, typlen, and typalign; and we have standard conventions for >> identifying the length etc of variable-length values. It's just the >> meaning of the payload data bytes that's data-type-private. > Okay, so with these I know if and how I would have to "dereference" the > data. > But how do I get to this info from inside _bt_binsrch? RelationGetDescr(rel)->attrs[n]->attbyval etc. regards, tom lane
Am 20.07.12 01:40, schrieb Tom Lane: > RelationGetDescr(rel)->attrs[n]->attbyval Thanks! Now does 'Relation' refer to the whole table or only the columns that are supposed to be scanned? So will RelationGetDescr(rel)->attrs[0] give me the description of the first column relevant to the current b-tree scan or simply to the first column in the table? Naively I tried RelationGetDescr(rel)->attrs[scankey->sk_attno] but it results in a segmentation fault. A first version of my algorithm is running now (very simple test case) but I had to implement a workaround for the following behavior: The (binary) search is supposed to always find the first matching offset on a page, but I do not understand how this is guaranteed, since the semantics of a binary search do not guarantee this. The type being searched where it throws me off specifically is a 'chunk_id', page contents are: 1: 12000 2: 12001 3: 12002 4: 12003 5: 12004 6: 12004 7: 12005 8: 12005 9: 12006 10: 12006 11: 16393 12: 16393 13: 16394 14: 16394 15: 16395 16: 16395 Binary search finds 11 in 4 steps, interpolation search finds 12 in 3 steps. Now if there was an additional value like "17: 16396", binary search should also find 12 first, right? Regards, Samuel Vogel