Thread: gsoc, text search selectivity and dllist enhancments
Hi, attached are two patches against HEAD. The smaller one is meant to be commited - it adds some functions that manipulate double-linked lists, namely inserting a new cell after or before another cell and swapping two adjacent cells. It felt like being back in the first year of studies. I hope I didn't mess those pointers up. The gzipped one is WIP for my GSoC project. I've reworked the algorithm for determing most common lexemes. The goal was to avoid scanning through all currently kept lexemes in each iteration of the loop that processes all lexemes from sample tsvectors. Heikki suggested to introduce a hashtable, so I did that. It works, passes regression tests and correctly identifies most common lexemes in my toy test database. Of course it's all quite useless without a selectivity operator that could use those statistics. I'm sending it in to maybe get some feedback during the commit fest. The second patch depends on the first, and also on the one I sent eariler: http://archives.postgresql.org/message-id/484418B8.6060004@students.mimuw.edu.pl I'll add the first one to the commit fest page, and I'm sending it to -hackers with congratulations on the decision to ditch -patches ;) Cheers, Jan -- Jan Urbanski GPG key ID: E583D7D2 ouden estin diff --git a/src/backend/lib/dllist.c b/src/backend/lib/dllist.c index b771fce..79c402e 100644 *** a/src/backend/lib/dllist.c --- b/src/backend/lib/dllist.c *************** *** 212,214 **** --- 212,336 ---- l->dll_head = e; /* We need not check dll_tail, since there must have been > 1 entry */ } + + /* Insert a node after the given target node. */ + void + DLAddAfter(Dlelem *e, Dlelem *target) + { + Dllist *l = target->dle_list; + + e->dle_list = l; + e->dle_prev = target; + e->dle_next = target->dle_next; + + if (l->dll_tail != target) + { + /* Target is not the tail */ + Assert(target->dle_next != NULL); + target->dle_next->dle_prev = e; + } + else + { + /* Target is the tail */ + Assert(target->dle_next == NULL); + l->dll_tail = e; + } + target->dle_next = e; + return; + } + + /* Insert a node before the given target node. */ + void + DLAddBefore(Dlelem *e, Dlelem *target) + { + Dllist *l = target->dle_list; + + e->dle_list = l; + e->dle_prev = target->dle_prev; + e->dle_next = target; + + if (l->dll_head != target) + { + /* Target is not the head */ + Assert(target->dle_prev != NULL); + target->dle_prev->dle_next = e; + } + else + { + /* Target is the head */ + Assert(target->dle_prev == NULL); + l->dll_head = e; + } + target->dle_prev = e; + return; + } + + /* Swap a node with its successor */ + void + DLSwapWithNext(Dlelem *e) + { + Dllist *l = e->dle_list; + Dlelem *tmp; + + Assert(e->dle_next != NULL); + + tmp = e->dle_next; + e->dle_next = tmp->dle_next; + if (l->dll_tail != tmp) + { + Assert(tmp->dle_next != NULL); + tmp->dle_next->dle_prev = e; + } + else + { + l->dll_tail = e; + } + if (l->dll_head != e) + { + Assert(e->dle_prev != NULL); + e->dle_prev->dle_next = tmp; + } + else + { + l->dll_head = tmp; + } + tmp->dle_prev = e->dle_prev; + e->dle_prev = tmp; + tmp->dle_next = e; + return; + } + + /* Swap a node with its predecessor */ + void + DLSwapWithPrevious(Dlelem *e) + { + Dllist *l = e->dle_list; + Dlelem *tmp; + + Assert(e->dle_prev != NULL); + + tmp = e->dle_prev; + e->dle_prev = tmp->dle_prev; + if (l->dll_head != tmp) + { + Assert(tmp->dle_prev != NULL); + tmp->dle_prev->dle_next = e; + } + else + { + l->dll_head = e; + } + if (l->dll_tail != e) + { + Assert(e->dle_next != NULL); + e->dle_next->dle_prev = tmp; + } + else + { + l->dll_tail = tmp; + } + tmp->dle_next = e->dle_next; + e->dle_next = tmp; + tmp->dle_prev = e; + return; + } diff --git a/src/include/lib/dllist.h b/src/include/lib/dllist.h index d754b03..597e71f 100644 *** a/src/include/lib/dllist.h --- b/src/include/lib/dllist.h *************** *** 72,77 **** --- 72,81 ---- extern Dlelem *DLRemHead(Dllist *list); /* remove and return the head */ extern Dlelem *DLRemTail(Dllist *list); extern void DLMoveToFront(Dlelem *e); /* move node to front of its list */ + extern void DLAddAfter(Dlelem *e, Dlelem *target); /* add node after another */ + extern void DLAddBefore(Dlelem *e, Dlelem *target); /* same, but add before */ + extern void DLSwapWithNext(Dlelem *e); + extern void DLSwapWithPrevious(Dlelem *e); /* These are macros for speed */ #define DLGetHead(list) ((list)->dll_head)
Attachment
Jan Urbański wrote: > I'll add the first one to the commit fest page, and I'm sending it to > -hackers with congratulations on the decision to ditch -patches ;) Hm... someone apparently added this mail to the wiki pag independently of me, so there were two duplicate entries. I found the second description better, so I removed my original entry and left the other one. -- Jan Urbanski GPG key ID: E583D7D2 ouden estin
Jan, > Hm... someone apparently added this mail to the wiki pag independently > of me, so there were two duplicate entries. I found the second > description better, so I removed my original entry and left the other > one. Yeah, I've been going through -hackers and -patches and adding stuff to the wiki page. -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco
Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes: > attached are two patches against HEAD. The smaller one is meant to be > commited - it adds some functions that manipulate double-linked lists, > namely inserting a new cell after or before another cell and swapping > two adjacent cells. > The gzipped one is WIP for my GSoC project. I've reworked the algorithm > for determing most common lexemes. I looked over this a bit. I'm not excited about adding functionality to Dllist --- that data structure is barely used at all in the backend, and I think a better case could be made for getting rid of it than adding code to it. The analyze patch doesn't change my mind on the point, because I don't think that Dllist is really helping you there anyway. The data structure I'd suggest is a simple array of pointers to the underlying hash table entries. Since you have a predetermined maximum number of lexemes to track, you can just palloc the array once --- you don't need the expansibility properties of a list. The only operations you need are "add an entry at the end" (if you keep the array sorted by descending count not ascending), "remove the end entry", and "swap adjacent entries", all of which are actually cheaper on an array than on a Dllist. Another point is that you don't really need the array to be sorted all the time. Instead of using what is basically an O(N^2) incremental sort, you could consider applying qsort() when you do need it to be sorted, which is at the end or when the table overflows and you need to discard some entries. If you are willing to discard multiple entries per overflow event, this could be quite cheap --- although I think in the worst case where there are many overflows, it'd go back to being O(N^2). Note BTW that adding a count=1 entry at the end cannot make the array unsorted if it was sorted before. The only event that renders the array unsorted is increasing an item's count to more than the count of its predecessor --- so it might be worth keeping a predecessor pointer in each hashtable entry that identifies its predecessor as of the time of the last array sort, so you could check on-the-fly and avoid unnecessary re-sorts. I'm not totally sure that this idea is a win, but it seems worth investigating. Other than that, the analyze patch looked generally sane to me, and I think you're on the right track. Please rework and resubmit. regards, tom lane
Tom Lane wrote: > The data structure I'd suggest is a simple array of pointers > to the underlying hash table entries. Since you have a predetermined > maximum number of lexemes to track, you can just palloc the array once > --- you don't need the expansibility properties of a list. The number of lexemes isn't predetermined. It's 2 * (longest tsvector seen so far), and we don't know beforehand how long the longest tsvector is. repalloc()ing shouldn't be a problem, though. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
"Heikki Linnakangas" <heikki@enterprisedb.com> writes: > Tom Lane wrote: >> The data structure I'd suggest is a simple array of pointers >> to the underlying hash table entries. Since you have a predetermined >> maximum number of lexemes to track, you can just palloc the array once >> --- you don't need the expansibility properties of a list. > The number of lexemes isn't predetermined. It's 2 * (longest tsvector > seen so far), and we don't know beforehand how long the longest tsvector is. Hmm, I had just assumed without looking too closely that it was stats target times a fudge factor. What is the rationale for doing it as above? I don't think I like the idea of the limit varying over the course of the scan --- that means that lexemes in different places in the input will have significantly different probabilities of surviving to the final result. regards, tom lane
Tom Lane wrote: > "Heikki Linnakangas" <heikki@enterprisedb.com> writes: >> Tom Lane wrote: >>> The data structure I'd suggest is a simple array of pointers >>> to the underlying hash table entries. Since you have a predetermined >>> maximum number of lexemes to track, you can just palloc the array once >>> --- you don't need the expansibility properties of a list. > >> The number of lexemes isn't predetermined. It's 2 * (longest tsvector >> seen so far), and we don't know beforehand how long the longest tsvector is. > > Hmm, I had just assumed without looking too closely that it was stats > target times a fudge factor. What is the rationale for doing it as > above? I don't think I like the idea of the limit varying over the > course of the scan --- that means that lexemes in different places > in the input will have significantly different probabilities of > surviving to the final result. Well, clearly if the list is smaller than the longest tsvector, inserting all elements of that long tsvector will flush out all other entries. Or if we throw away the newly inserted entries, some elements will never have a chance to climb up the list. I'm not sure where the "times two" figure comes from, maybe it's just a fudge factor, but the bottom line is that the minimum size needed depends on the size of the longest tsvector. (Jan is offline until Saturday...) -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Tom Lane wrote: >> "Heikki Linnakangas" <heikki@enterprisedb.com> writes: >>> Tom Lane wrote: >> Hmm, I had just assumed without looking too closely that it was stats >> target times a fudge factor. What is the rationale for doing it as >> above? I don't think I like the idea of the limit varying over the >> course of the scan --- that means that lexemes in different places >> in the input will have significantly different probabilities of >> surviving to the final result. > will never have a chance to climb up the list. I'm not sure where the > "times two" figure comes from, maybe it's just a fudge factor, but the > bottom line is that the minimum size needed depends on the size of the > longest tsvector. Thank you for the review. The whole algorithm was based on the original analyze routine from analyze.c (compute_minimal_stats() ). The "times two" factor is there, I think it's purpose is to reduce the chances of an element dropping off the list before it's number of occurrences gets boosted. It doesn't look like a factor of 1.5 or 3 would be much better/worse, I just kept it as I saw it. It's true that making the limit change in the course of the scan makes the set of surviving lexemes dependent on the order in which entries are fetched. I'm inclined to believe that most of the times it won't change the overall result - some entries will drop off eariler, some later, some will be discarded when it comes to the final pass at the end of the algorithm. Maybe I should do a test with - run ANALYZE several times over a large set of documents and see if the result stays the same? Since the number of tracked lexemes needs to be correlated with the length of the longest tsvector, the only one possibility I see is doing two passes through the statistics sample (determine the length first, then do the real work). I wasn't sure if rows fetched by the sampling algorithm are guaranteed to be in memory. If so, then another pass is just a linear cost (and asymptotically insignificant). About the array vs dllist issue. If the "two passes" idea is better than "dynamically expanding list" then an array is of course better. I thought about using a one-way list, but having it two-way helps with with inserting and element at the end of the 1-occurrence group (and I think that's important, read on) and with the final pass, in which you go through the list backward and stop when you get enough entries to fill the pg_statistics tuple. Maybe these could be dealt with efficiently using a one-way list, it was easier to do with a double-linked one. Keeping the list sorted at all times has an advantage: if you insert new entries at the *end* of the 1-occurrence block and have the ones at the begginning fall off, you are discarding elements that haven't been seen for the longest time. Consider a list: cat:1 dog:1 duck:2 rhino:3 If the next lexeme is "goat", and you'd put it at the begginning of the list (or put in anywhere and qsort() it, you could get) goat:1 cat:1 dog:1 duck:2 rhino:3 And if this would've trigger an overflow, you could lose the goat, when you'd want to lose the cat. Once again: the whole algorithm is a ripoff from analyze.c, with the dllist instead of an array because you don't know how big tracking list will you need and with a hashtable, because the tracking list will probably be large and looking up things linearly wouldn't be fun. Cheers, Jan -- Jan Urbanski GPG key ID: E583D7D2 ouden estin
Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes: > Once again: the whole algorithm is a ripoff from analyze.c, with the > dllist instead of an array because you don't know how big tracking list > will you need and with a hashtable, because the tracking list will > probably be large and looking up things linearly wouldn't be fun. Hmmm ... I had forgotten how compute_minimal_stats() works, and looking at it now I'm just itching to rewrite it. You have to realize that that code was written with the assumptions that (1) it was only going to be used for a few weird second-class-citizen data types, and (2) the stats target would typically be around 10 or so. It really wasn't designed to be industrial-strength code. (It was still a big improvement over what we'd had before :-(.) So I'm not very comfortable with taking it as the design base for something that does need to be industrial-strength. Your point about not wanting the most-recently-added lexeme to drop off first is a good one, but the approach is awfully fragile. Consider what happens if (or when) all the lexemes in the list reach count 2 or more. All other lexemes seen later will fight over the last list slot, and have no chance of getting higher in the list; so the algorithm will completely fail to adapt to any changes in the input statistics that occur after that point is reached. I'm thinking that the idea of periodically cutting off a bunch of low-scoring items, instead of trying to do it "exactly", would actually be better because it'd still have a chance of tracking new data even when the counts get larger. I don't recommend doing two passes over the data because it's fairly likely that tsvectors would be large enough to be toasted, which'd make fetching them expensive. One idea is to start out with some reasonable estimate of the max lexemes per tsvector (say a couple hundred) and realloc the list bigger in sizable jumps (say 2X) when the estimate is seen to be exceeded. Another possibility is to have a prescan that only determines the width of the physically largest tsvector (you can get the width without detoasting, so this would be quite cheap), and then estimate the number of lexemes corresponding to that using some fudge-factor guess about lexeme sizes, and then stick with the resulting list size regardless of what the actual lexeme counts are. I kinda like the latter since its behavior is order-independent. regards, tom lane
Tom Lane wrote: > target would typically be around 10 or so. It really wasn't designed to > be industrial-strength code. (It was still a big improvement over what > we'd had before :-(.) So I'm not very comfortable with taking it as the > design base for something that does need to be industrial-strength. I did some googling and found a paper describing a method of determining most frequent elements from a stream called Lossy Counting: http://www.vldb.org/conf/2002/S10P03.pdf The following is an adaptation of an algorithm described in section 4.2 of that paper. I'll summarize it and propose an implementation using existing PG data structures. The data structure (called D) is a set of entries in the form (e, f, d), where e is the lexeme, f is an integer representing the estimated frequency and d is the maximum possible error in f. We process tsvectors in batches. After each w tsvectors we will be pruning our data structure, deleting entries that are not frequent enough. Let b be the number of the current batch (tsvectors 1 to w are processed with b = 1, tsvectors w+1 to w*2 are processed with b = 2 and so on). Observe that w is an arbitrary number. Initially, D is empty. For each lexeme e we look up its entry in D. If it's found, we increment f in that entry. If not, we add a new entry in the form of (e, 1, b - 1). After each w tsvectors we delete all entries from D, that have f + d <= b. For example, consider such a list of tsvectors (numbers stand for lexemes, each line is a tsvector): 1 2 3 2 3 4 5 1 2 1 3 4 2 3 4 5 2 4 5 6 Let w = 3. After processing the first tsvector D would look like so: (1, 1, 0), (2, 1, 0), (3, 1, 0) After processing the second tsvector D would be: (1, 1, 0), (2, 2, 0), (3, 2, 0), (4, 1, 0), (5, 1, 0) After the third tsvector we'd have: (1, 2, 0), (2, 3, 0), (3, 2, 0), (4, 1, 0), (5, 1, 0) And now we'd do the pruning. b = 1, so all entries with f + d <= 1 are discarded. D is then: (1, 2, 0), (2, 3, 0), (3, 2, 0) After processing the fourth tsvector we get: (1, 3, 0), (2, 3, 0), (3, 3, 0), (4, 1, 1) And then: (1, 3, 0), (2, 4, 0), (3, 4, 0), (4, 2, 1), (5, 1, 1) And finally: (1, 3, 0), (2, 5, 0), (3, 4, 0), (4, 3, 1), (5, 2, 1), (6, 1, 1) Now comes the time for the second pruning. b = 2, to D is left with: (1, 3, 0), (2, 5, 0), (3, 4, 0), (4, 3, 1), (5, 2, 1) Finally we return entries with the largest values of f, their number depending on statistics_target. That ends the algorithm. The original procedure from the paper prunes the entries after each "element batch" is processed, but the batches are of equal size. With tsvectors it seems sensible to divide lexemes into batches of different sizes, so the pruning always takes place after all the lexemes from a tsvector have been processed. I'm not totally sure if that change does not destroy some properties of that algorithm, but it looks reasonably sane. As to the implementation: like Tom suggested, we could use a hashtable as D and an array of pointers to the hashtable entries, that would get updated each time a new hashtable entry would be added. Pruning would then require a qsort() of that table taking (f + d) as the sort key, but other than that we could just add new entries at the end of the array. An open question is: what size should that array be? I think that by knowing how long is the longest tsvector we could come up with a good estimate, but this needs to be thought over. Anyway, we can always repalloc(). Getting the max tsvector length by looking at the datum widths would be a neat trick. I just wonder: if we have to detoast all these tsvectors anyway, is it not possible to do that and do two passes over already detoasted datums? I've got to admit I don't really understand how toasting works, so I'm probably completely wrong on that. Does this algorithm sound sensible? I might have not understood some things from that publication, they give some proofs of why their algorithm yields good results and takes up little space, but I'm not all that certain it can be so easily adapted. If you think the Lossy Counting method has potential, I could test it somehow. Using my current work I could extract a stream of lexemes as ANALYZE sees it and run it through a python implementation of the algorithm to see if the result makes sense. This is getting more complicated, than I thought it would :^) Cheers, Jan -- Jan Urbanski GPG key ID: E583D7D2 ouden estin
Jan Urbański wrote: > If you think the Lossy Counting method has potential, I could test it > somehow. Using my current work I could extract a stream of lexemes as > ANALYZE sees it and run it through a python implementation of the > algorithm to see if the result makes sense. I hacked together a simplistic python implementation and ran it on a table with 244901 tsvectors, 45624891 lexemes total. I was comparing results from my current approach with the results I'd get from a Lossy Counting algorithm. I experimented with statistics_target set to 10 and 100, and ran pruning in the LC algorithm every 3, 10 or 100 tsvectors. The sample size with statistics_target set to 100 was 30000 rows and that's what the input to the script was - lexemes from these 30000 tsvectors. I found out that with pruning happening every 10 tsvectors I got precisely the same results as with the original algorithm (same most common lexemes, same frequencies). When I tried pruning after every 100 tsvectors the results changed very slightly (they were a tiny bit more distant from the ones from the original algorithm, and I think a tiny bit more precise, but I didn't give it much attention). Bottom line seems to be: the Lossy Counting algorithm gives roughly the same results as the algorithm used currently and is also possibly faster (and more scalable wrt. statistics_target). This should probably get more testing than just running some script 5 times over a fixed set of data, but I had trouble already sucking ~300 MB of tsvectors from one of my production sites, putting it on my laptop and so on. Do you think it's worthwhile to implement the LC algorithm in C and send it out, so others could try it out? Heck, maybe it's worthwhile to replace the current compute_minimal_stats() algorithm with LC and see how that compares? Anyway, I can share the python script if someone would like to do some more tests (I suppose no-one would, 'cause you first need to apply my ts_typanalyze patch and then change it some more to extract lexemes from the sample). Cheers, Jan -- Jan Urbanski GPG key ID: E583D7D2 ouden estin
Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes: > Do you think it's worthwhile to implement the LC algorithm in C and send > it out, so others could try it out? Heck, maybe it's worthwhile to > replace the current compute_minimal_stats() algorithm with LC and see > how that compares? Very possibly. I repeat that the current implementation of compute_minimal_stats is very ad-hoc code and wasn't written with an eye to high performance. Replacing it with an algorithm that someone actually thought about might well be worth doing. regards, tom lane
Tom Lane wrote: > =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <j.urbanski@students.mimuw.edu.pl> writes: >> Do you think it's worthwhile to implement the LC algorithm in C and send >> it out, so others could try it out? Heck, maybe it's worthwhile to >> replace the current compute_minimal_stats() algorithm with LC and see >> how that compares? > > Very possibly. I repeat that the current implementation of > compute_minimal_stats is very ad-hoc code and wasn't written with an eye > to high performance. Replacing it with an algorithm that someone > actually thought about might well be worth doing. Here's a patch that combines both patches included here: http://archives.postgresql.org/message-id/48649261.5040703@students.mimuw.edu.pl and adds a C implementation of the Lossy Counting algorithm. It defines two typanalyze functions: ts_typanalyze_std and ts_typanalyze_lc, so you can see what statistics are gathered by each of them. It's meant for easy applying to HEAD, updating pg_type and running ANALYZE on a few tables with tsvectors (i.e. testing, not commiting). My observations are: the LC algorithm beats the previous one by a fairly large margin (20-30%) timewise. The results are almost identical, I got discrepancies of about 0.05 for some lexemes' frequencies. I intend to stick with LC for tsvectors and that'll allow to throw away the Dllist changes. If I want to keep my GSoC schedule I won't be able to implement LC for general statistics gathering, but it's trivial. If no one gets about it I can do it after the Summer of Code (if only to see how it'll work). Oh, one important thing. You need to choose a bucket width for the LC algorithm, that is decide after how many elements will you prune your data structure. I chose to prune after every twenty tsvectors. You might consider: - picking some other arbitrary value - making it depend on the largest tsvector size - making it depend on the statistics_target - pruning after each X lexemes instead of after each Y tsvectors, because now the buckets will vary in width and you can argue that the order of input makes a difference again. OTOH the situation here is a bit different: you get streams of mutually different elements (lexemes inside a tsvector are all different) and pruning in the middle of such stream might be unfair for lexemes that are still to be processed. Hmm, dunno. Cheers, Jan -- Jan Urbanski GPG key ID: E583D7D2 ouden estin
Attachment
Jan Urbański wrote: > Oh, one important thing. You need to choose a bucket width for the LC > algorithm, that is decide after how many elements will you prune your > data structure. I chose to prune after every twenty tsvectors. Do you prune after X tsvectors regardless of the numbers of lexemes in them? I don't think that preserves the algorithm properties; if there's a bunch of very short tsvectors and then long tsvectors, the pruning would take place too early for the initial lexemes. I think you should count lexemes, not tsvectors. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera wrote: > Jan Urbański wrote: > >> Oh, one important thing. You need to choose a bucket width for the LC >> algorithm, that is decide after how many elements will you prune your >> data structure. I chose to prune after every twenty tsvectors. > > Do you prune after X tsvectors regardless of the numbers of lexemes in > them? I don't think that preserves the algorithm properties; if there's > a bunch of very short tsvectors and then long tsvectors, the pruning > would take place too early for the initial lexemes. I think you should > count lexemes, not tsvectors. Yes, that's what I was afraid of. I'm not sure why I was reluctant to prune in the middle of a tsvector, maybe it's just in my head. Still, there's a decision to be made: after how many lexemes should the pruning occur? -- Jan Urbanski GPG key ID: E583D7D2 ouden estin
Alvaro Herrera <alvherre@commandprompt.com> writes: > Jan Urbański wrote: >> Oh, one important thing. You need to choose a bucket width for the LC >> algorithm, that is decide after how many elements will you prune your >> data structure. I chose to prune after every twenty tsvectors. > Do you prune after X tsvectors regardless of the numbers of lexemes in > them? I don't think that preserves the algorithm properties; if there's > a bunch of very short tsvectors and then long tsvectors, the pruning > would take place too early for the initial lexemes. I think you should > count lexemes, not tsvectors. Yeah. I haven't read the Lossy Counting paper in detail yet, but I suspect that the mathematical proof of limited error doesn't work if the pruning is done on a variable spacing. I don't see anything very wrong with pruning intra-tsvector; the effects ought to average out, since the point where you prune is going to move around with respect to the tsvector boundaries. regards, tom lane
Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes: > Still, there's a decision to be made: after how many lexemes should the > pruning occur? The way I think it ought to work is that the number of lexemes stored in the final pg_statistic entry is statistics_target times a constant (perhaps 100). I don't like having it vary depending on tsvector width --- why for example should a column having a few wide tsvectors get a bigger stats entry than one with many narrow ones? (Not to mention the issue of having to estimate the average or max width before you can start the counting run.) But in any case, given a target number of lexemes to accumulate, I'd suggest pruning with that number as the bucket width (pruning distance). Or perhaps use some multiple of the target number, but the number itself seems about right. The LC paper says that the bucket width w is equal to ceil(1/e) where e is the maximum frequency estimation error, and that the maximum number of table entries needed is log(eN)/e after N lexemes have been scanned. For the values of e and N we are going to be dealing with, this is likely to work out to a few times 1/e, in other words the table size is a few times w. (They prove it's at most 7w given reasonable assumptions about data distribution, regardless of how big N gets; though I think our values for N aren't large enough for that to matter.) The existing compute_minimal_stats code uses a table size of twice the target number of values, so setting w to maybe a half or a third of the target number would reproduce the current space usage. I don't see a problem with letting it get a little bigger though, especially since we can expect that the lexemes aren't very long. (compute_minimal_stats can't assume that for arbitrary data types...) regards, tom lane
Tom Lane wrote: > The way I think it ought to work is that the number of lexemes stored in > the final pg_statistic entry is statistics_target times a constant > (perhaps 100). I don't like having it vary depending on tsvector width I think the existing code puts at most statistics_target elements in a pg_statistic tuple. In compute_minimal_stats() num_mcv starts with stats->attr->attstattarget and is adjusted only downwards. My original thought was to keep that property for tsvectors (i.e. store at most statistics_target lexemes) and advise people to set it high for their tsvector columns (e.g. 100x their default). Also, the existing code decides which elements are worth storing as most common ones by discarding those that are not frequent enough (that's where num_mcv can get adjusted downwards). I mimicked that for lexemes but maybe it just doesn't make sense? > But in any case, given a target number of lexemes to accumulate, > I'd suggest pruning with that number as the bucket width (pruning > distance). Or perhaps use some multiple of the target number, but > the number itself seems about right. Fine with me, I'm too tired to do the math now, so I'll take your word for it :) Cheers, Jan -- Jan Urbanski GPG key ID: E583D7D2 ouden estin
Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes: > Tom Lane wrote: >> The way I think it ought to work is that the number of lexemes stored in >> the final pg_statistic entry is statistics_target times a constant >> (perhaps 100). I don't like having it vary depending on tsvector width > I think the existing code puts at most statistics_target elements in a > pg_statistic tuple. In compute_minimal_stats() num_mcv starts with > stats->attr->attstattarget and is adjusted only downwards. > My original thought was to keep that property for tsvectors (i.e. store > at most statistics_target lexemes) and advise people to set it high for > their tsvector columns (e.g. 100x their default). Well, (1) the normal measure would be statistics_target *tsvectors*, and we'd have to translate that to lexemes somehow; my proposal is just to use a fixed constant instead of tsvector width as in your original patch. And (2) storing only statistics_target lexemes would be uselessly small and would guarantee that people *have to* set a custom target on tsvector columns to get useful results. Obviously broken defaults are not my bag. > Also, the existing code decides which elements are worth storing as most > common ones by discarding those that are not frequent enough (that's > where num_mcv can get adjusted downwards). I mimicked that for lexemes > but maybe it just doesn't make sense? Well, that's not unreasonable either, if you can come up with a reasonable definition of "not frequent enough"; but that adds another variable to the discussion. regards, tom lane
On Wed, 9 Jul 2008, Jan Urbaski wrote: > Jan Urbaski wrote: > Do you think it's worthwhile to implement the LC algorithm in C and send it > out, so others could try it out? Heck, maybe it's worthwhile to replace the > current compute_minimal_stats() algorithm with LC and see how that compares? I and Teodor are using LC for phrase estimation in one application and from our understanding of the original paper this algorithm might be not good for sampling, since all theory behind was about streaming of FULL data. As for technique we use suffix tree, which should be fine for typical sample size. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Tom Lane wrote: > Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes: >> Tom Lane wrote: > Well, (1) the normal measure would be statistics_target *tsvectors*, > and we'd have to translate that to lexemes somehow; my proposal is just > to use a fixed constant instead of tsvector width as in your original > patch. And (2) storing only statistics_target lexemes would be > uselessly small and would guarantee that people *have to* set a custom > target on tsvector columns to get useful results. Obviously broken > defaults are not my bag. Fair enough, I'm fine with a multiplication factor. >> Also, the existing code decides which elements are worth storing as most >> common ones by discarding those that are not frequent enough (that's >> where num_mcv can get adjusted downwards). I mimicked that for lexemes >> but maybe it just doesn't make sense? > > Well, that's not unreasonable either, if you can come up with a > reasonable definition of "not frequent enough"; but that adds another > variable to the discussion. The current definition was "with more occurrences than 0.001 of total rows count, but no less than 2". Copied right off compute_minimal_stats(), I have no problem with removing it. I think its point is to guard you against a situation where all elements are more or less unique, and taking the top N would just give you some random noise. It doesn't hurt, so I'd be for keeping the mechanism, but if people feel different, then I'll just drop it. -- Jan Urbanski GPG key ID: E583D7D2 ouden estin
Oleg Bartunov wrote: > On Wed, 9 Jul 2008, Jan Urbaski wrote: > >> Jan Urbaski wrote: >> Do you think it's worthwhile to implement the LC algorithm in C and >> send it out, so others could try it out? Heck, maybe it's worthwhile >> to replace the current compute_minimal_stats() algorithm with LC and >> see how that compares? > > I and Teodor are using LC for phrase estimation in one application and > from our understanding of the original paper this algorithm might be > not good for sampling, since all theory behind was about streaming of > FULL data. As for technique we use suffix tree, which should be fine for > typical sample size. Hm, that's a good point. I'm only reasurred by the fact, that it yields roughly the same results as the original algorithm, which means that if LC is bad for sampling, then the current implementation is just as bad. Come to think of it, the current code is in a way a variant of Lossy Counting, it's just doing the pruning after each and every new element, isn't it? Cheers, Jan -- Jan Urbanski GPG key ID: E583D7D2 ouden estin
Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes: > Come to think of it, the current code is in a way a variant of Lossy > Counting, it's just doing the pruning after each and every new element, > isn't it? Interesting comment. In LC's terms we have w=1 therefore e=1 therefore the maximum error is as bad as possible? regards, tom lane
Tom Lane wrote: > Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes: >> Come to think of it, the current code is in a way a variant of Lossy >> Counting, it's just doing the pruning after each and every new element, >> isn't it? > > Interesting comment. In LC's terms we have w=1 therefore e=1 therefore > the maximum error is as bad as possible? Well, the similarity doesn't go as far as that IMHO. Having a LC algorithm with w = 1 you'd just go about adding one element, and then deleting it, 'cause is has f = 1 and delta = b_current - 1, so f + delta <= b_current. But scanning the list linearily each time *and* keeping it incrementally sorted sure didn't help the preformance ;) BTW: I don't know if you've noticed - I settled for adding elements to a hashtable and sequentially scanning it at prune time, instead of maintaining an additional array of pointers to hashtable entires and qsort()ing it every time a pruning is called for. I only qsort() the pointers for determining the final MCLs (by allocating an array of the same size as the hashtable, copying the pointers and applying qsort()). That saves a lot of headaches and, if I did my calculations right, is asymptotically faster. Cheers, Jan -- Jan Urbanski GPG key ID: E583D7D2 ouden estin
OK, here's the (hopefully final) version of the typanalyze function for tsvectors. It applies to HEAD and passes regression tests. I now plan to move towards a selectivity function that'll use the gathered statistics. Thanks to everyone for their reviews and comments, Cheers, Jan -- Jan Urbanski GPG key ID: E583D7D2 ouden estin
Attachment
Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes: > OK, here's the (hopefully final) version of the typanalyze function for > tsvectors. It applies to HEAD and passes regression tests. > I now plan to move towards a selectivity function that'll use the > gathered statistics. Applied with some revisions. Rather than making pg_statistic stakind 4 be specific to tsvector, I thought it'd be better to define it as "most common elements", with the idea that it could be used for array and array-like types as well as tsvector. (I'm not actually planning to go off and make that happen right now, but it seems like a pretty obvious extension.) I thought it was a bit schizophrenic to repurpose pg_stats.most_common_freqs for element frequencies while creating a separate column for the elements themselves. What I've done for the moment is to define both most_common_vals and most_common_freqs as referring to the elements in the case of tsvector (or anything else that has stakind 4 in place of stakind 1). You could make an argument for inventing *two* new pg_stats columns instead, but I think that is probably overkill; I doubt it'll be useful to have both MCV and MCELEM stats for the same column. This could easily be changed though. I removed the prune step after the last tsvector. I'm not convinced that the LC algorithm's guarantees still hold if we prune partway through a bucket, and anyway it's far from clear that we'd save enough in the sort step to compensate for more HASH_REMOVE operations. I'm open to being convinced otherwise. I made some other cosmetic changes, but those were the substantive ones. regards, tom lane
Tom Lane wrote: > Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes: > > OK, here's the (hopefully final) version of the typanalyze function for > > tsvectors. It applies to HEAD and passes regression tests. > > > I now plan to move towards a selectivity function that'll use the > > gathered statistics. > > Applied with some revisions. Should we have a CHECK_FOR_INTERRUPTS() call in the inner loop of compute_tsvector_stats? -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Should we have a CHECK_FOR_INTERRUPTS() call in the inner loop of > compute_tsvector_stats? Isn't the vacuum_delay_point() good enough? regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Should we have a CHECK_FOR_INTERRUPTS() call in the inner loop of > > compute_tsvector_stats? > > Isn't the vacuum_delay_point() good enough? But that's in the outer loop ... I mean here: Index: src/backend/tsearch/ts_typanalyze.c =================================================================== RCS file: /home/alvherre/Code/cvs/pgsql/src/backend/tsearch/ts_typanalyze.c,v retrieving revision 1.1 diff -c -p -r1.1 ts_typanalyze.c *** src/backend/tsearch/ts_typanalyze.c 14 Jul 2008 00:51:45 -0000 1.1 --- src/backend/tsearch/ts_typanalyze.c 14 Jul 2008 04:59:59 -0000 *************** compute_tsvector_stats(VacAttrStats *sta *** 206,211 **** --- 206,213 ---- { bool found; + CHECK_FOR_INTERRUPTS(); + /* Construct a hash key */ hash_key.lexeme = lexemesptr + curentryptr->pos; hash_key.length= curentryptr->len; -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane wrote: >> Isn't the vacuum_delay_point() good enough? > But that's in the outer loop ... I mean here: You'd need one heckuva lot of lexemes in a tsvector to make that important. Do we have CHECK_FOR_INTERRUPTS() in any other loops over tsvector contents? I kinda doubt it ... (I have no real objection to adding CHECK_FOR_INTERRUPTS there, I'm just questioning the value.) regards, tom lane
On Mon, 14 Jul 2008, Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: >> Tom Lane wrote: >>> Isn't the vacuum_delay_point() good enough? > >> But that's in the outer loop ... I mean here: > > You'd need one heckuva lot of lexemes in a tsvector to make that Tom, I like the 'heckuva' word (in russian we have sort of), but would you please sometimes put in footnote a translation to a regular english for education :) > important. Do we have CHECK_FOR_INTERRUPTS() in any other loops > over tsvector contents? I kinda doubt it ... > > (I have no real objection to adding CHECK_FOR_INTERRUPTS there, > I'm just questioning the value.) > > regards, tom lane > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
<4873EB4C.3070404@students.mimuw.edu.pl> <Pine.LNX.4.64.0807110303550.11363@sn.sai.msu.ru> <4876FC49.2000402@students.mimuw.edu.pl> <26534.1215757826@sss.pgh.pa.us> <48791B16.3040207@students.mimuw.edu.pl> <18435.1215996859@sss.pgh.pa.us> <20080714035243.GA21994@alvh.no-ip.org> <21927.1216008159@sss.pgh.pa.us> <20080714050120.GB21994@alvh.no-ip.org> <22836.1216012003@sss.pgh.pa.us> <Pine.LNX.4.64.0807141145100.11363@sn.sai.msu.ru> Message-ID: <a4942759f51dd8c2f80c4606f610e2bf@192.168.128.15> X-Sender: rlippan@remotelinux.com Received: from 74-92-199-69-Atlanta.hfc.comcastbusiness.net [74.92.199.69] via192.168.128.5 [192.168.128.5] with HTTP/1.1(POST); Mon, 14 Jul 200807:51:36 +0100 User-Agent: RoundCube Webmail/0.1 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit On Mon, 14 Jul 2008 11:47:17 +0400 (MSD), Oleg Bartunov <oleg@sai.msu.su> wrote: > On Mon, 14 Jul 2008, Tom Lane wrote: > >> Alvaro Herrera <alvherre@commandprompt.com> writes: >>> Tom Lane wrote: >>>> Isn't the vacuum_delay_point() good enough? >> >>> But that's in the outer loop ... I mean here: >> >> You'd need one heckuva lot of lexemes in a tsvector to make that > > Tom, I like the 'heckuva' word (in russian we have sort of), but would you > please sometimes put in footnote a translation to a regular english for > education :) > You might want to look around on ' http://www.urbandictionary.com/ '. That site is basically one big footnote to the English language :) It is somewhat of a social network/democratic version of the OED. Of course, you may have to read though several (odd ball) definitions (some specious, at best) and couple that with the context from the original sentence to get a feel for the word as used. -r
On Mon, 14 Jul 2008, Rudolf Lippan wrote: > <4873EB4C.3070404@students.mimuw.edu.pl> <Pine.LNX.4.64.0807110303550.11363@sn.sai.msu.ru> > <4876FC49.2000402@students.mimuw.edu.pl> <26534.1215757826@sss.pgh.pa.us> > <48791B16.3040207@students.mimuw.edu.pl> <18435.1215996859@sss.pgh.pa.us> > <20080714035243.GA21994@alvh.no-ip.org> <21927.1216008159@sss.pgh.pa.us> > <20080714050120.GB21994@alvh.no-ip.org> <22836.1216012003@sss.pgh.pa.us> <Pine.LNX.4.64.0807141145100.11363@sn.sai.msu.ru> > Message-ID: <a4942759f51dd8c2f80c4606f610e2bf@192.168.128.15> > X-Sender: rlippan@remotelinux.com > Received: from 74-92-199-69-Atlanta.hfc.comcastbusiness.net [74.92.199.69] via > 192.168.128.5 [192.168.128.5] with HTTP/1.1 (POST); Mon, 14 Jul 2008 > 07:51:36 +0100 > User-Agent: RoundCube Webmail/0.1 > Content-Type: text/plain; charset="UTF-8" > Content-Transfer-Encoding: 8bit > > > > On Mon, 14 Jul 2008 11:47:17 +0400 (MSD), Oleg Bartunov <oleg@sai.msu.su> > wrote: >> On Mon, 14 Jul 2008, Tom Lane wrote: >> >>> Alvaro Herrera <alvherre@commandprompt.com> writes: >>>> Tom Lane wrote: >>>>> Isn't the vacuum_delay_point() good enough? >>> >>>> But that's in the outer loop ... I mean here: >>> >>> You'd need one heckuva lot of lexemes in a tsvector to make that >> >> Tom, I like the 'heckuva' word (in russian we have sort of), but would > you >> please sometimes put in footnote a translation to a regular english for >> education :) >> > > You might want to look around on ' http://www.urbandictionary.com/ '. That > site is basically one big footnote to the English language :) It is > somewhat of a social network/democratic version of the OED. thanks for the link. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83