Thread: gsoc, text search selectivity and dllist enhancments

gsoc, text search selectivity and dllist enhancments

From
Jan Urbański
Date:
Hi,

attached are two patches against HEAD. The smaller one is meant to be
commited - it adds some functions that manipulate double-linked lists,
namely inserting a new cell after or before another cell and swapping
two adjacent cells. It felt like being back in the first year of
studies. I hope I didn't mess those pointers up.

The gzipped one is WIP for my GSoC project. I've reworked the algorithm
for determing most common lexemes. The goal was to avoid scanning
through all currently kept lexemes in each iteration of the loop that
processes all lexemes from sample tsvectors. Heikki suggested to
introduce a hashtable, so I did that. It works, passes regression tests
and correctly identifies most common lexemes in my toy test database.
Of course it's all quite useless without a selectivity operator that
could use those statistics. I'm sending it in to maybe get some feedback
during the commit fest.

The second patch depends on the first, and also on the one I sent eariler:
http://archives.postgresql.org/message-id/484418B8.6060004@students.mimuw.edu.pl

I'll add the first one to the commit fest page, and I'm sending it to
-hackers with congratulations on the decision to ditch -patches ;)

Cheers,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin
diff --git a/src/backend/lib/dllist.c b/src/backend/lib/dllist.c
index b771fce..79c402e 100644
*** a/src/backend/lib/dllist.c
--- b/src/backend/lib/dllist.c
***************
*** 212,214 ****
--- 212,336 ----
      l->dll_head = e;
      /* We need not check dll_tail, since there must have been > 1 entry */
  }
+
+ /* Insert a node after the given target node. */
+ void
+ DLAddAfter(Dlelem *e, Dlelem *target)
+ {
+     Dllist    *l = target->dle_list;
+
+     e->dle_list = l;
+     e->dle_prev = target;
+     e->dle_next = target->dle_next;
+
+     if (l->dll_tail != target)
+     {
+         /* Target is not the tail */
+         Assert(target->dle_next != NULL);
+         target->dle_next->dle_prev = e;
+     }
+     else
+     {
+         /* Target is the tail */
+         Assert(target->dle_next == NULL);
+         l->dll_tail = e;
+     }
+     target->dle_next = e;
+     return;
+ }
+
+ /* Insert a node before the given target node. */
+ void
+ DLAddBefore(Dlelem *e, Dlelem *target)
+ {
+     Dllist    *l = target->dle_list;
+
+     e->dle_list = l;
+     e->dle_prev = target->dle_prev;
+     e->dle_next = target;
+
+     if (l->dll_head != target)
+     {
+         /* Target is not the head */
+         Assert(target->dle_prev != NULL);
+         target->dle_prev->dle_next = e;
+     }
+     else
+     {
+         /* Target is the head */
+         Assert(target->dle_prev == NULL);
+         l->dll_head = e;
+     }
+     target->dle_prev = e;
+     return;
+ }
+
+ /* Swap a node with its successor */
+ void
+ DLSwapWithNext(Dlelem *e)
+ {
+     Dllist    *l = e->dle_list;
+     Dlelem    *tmp;
+
+     Assert(e->dle_next != NULL);
+
+     tmp = e->dle_next;
+     e->dle_next = tmp->dle_next;
+     if (l->dll_tail != tmp)
+     {
+         Assert(tmp->dle_next != NULL);
+         tmp->dle_next->dle_prev = e;
+     }
+     else
+     {
+         l->dll_tail = e;
+     }
+     if (l->dll_head != e)
+     {
+         Assert(e->dle_prev != NULL);
+         e->dle_prev->dle_next = tmp;
+     }
+     else
+     {
+         l->dll_head = tmp;
+     }
+     tmp->dle_prev = e->dle_prev;
+     e->dle_prev = tmp;
+     tmp->dle_next = e;
+     return;
+ }
+
+ /* Swap a node with its predecessor */
+ void
+ DLSwapWithPrevious(Dlelem *e)
+ {
+     Dllist    *l = e->dle_list;
+     Dlelem    *tmp;
+
+     Assert(e->dle_prev != NULL);
+
+     tmp = e->dle_prev;
+     e->dle_prev = tmp->dle_prev;
+     if (l->dll_head != tmp)
+     {
+         Assert(tmp->dle_prev != NULL);
+         tmp->dle_prev->dle_next = e;
+     }
+     else
+     {
+         l->dll_head = e;
+     }
+     if (l->dll_tail != e)
+     {
+         Assert(e->dle_next != NULL);
+         e->dle_next->dle_prev = tmp;
+     }
+     else
+     {
+         l->dll_tail = tmp;
+     }
+     tmp->dle_next = e->dle_next;
+     e->dle_next = tmp;
+     tmp->dle_prev = e;
+     return;
+ }
diff --git a/src/include/lib/dllist.h b/src/include/lib/dllist.h
index d754b03..597e71f 100644
*** a/src/include/lib/dllist.h
--- b/src/include/lib/dllist.h
***************
*** 72,77 ****
--- 72,81 ----
  extern Dlelem *DLRemHead(Dllist *list); /* remove and return the head */
  extern Dlelem *DLRemTail(Dllist *list);
  extern void DLMoveToFront(Dlelem *e);    /* move node to front of its list */
+ extern void DLAddAfter(Dlelem *e, Dlelem *target); /* add node after another */
+ extern void DLAddBefore(Dlelem *e, Dlelem *target); /* same, but add before */
+ extern void DLSwapWithNext(Dlelem *e);
+ extern void DLSwapWithPrevious(Dlelem *e);

  /* These are macros for speed */
  #define DLGetHead(list)  ((list)->dll_head)

Attachment

Re: gsoc, text search selectivity and dllist enhancments

From
Jan Urbański
Date:
Jan Urbański wrote:
> I'll add the first one to the commit fest page, and I'm sending it to
> -hackers with congratulations on the decision to ditch -patches ;)

Hm... someone apparently added this mail to the wiki pag  independently
of me, so there were two duplicate entries. I found the second
description better, so I removed my original entry and left the other one.

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin



Re: gsoc, text search selectivity and dllist enhancments

From
Josh Berkus
Date:
Jan,

> Hm... someone apparently added this mail to the wiki pag  independently
> of me, so there were two duplicate entries. I found the second
> description better, so I removed my original entry and left the other
> one.

Yeah, I've been going through -hackers and -patches and adding stuff to the 
wiki page.

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco


Re: gsoc, text search selectivity and dllist enhancments

From
Tom Lane
Date:
Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes:
> attached are two patches against HEAD. The smaller one is meant to be 
> commited - it adds some functions that manipulate double-linked lists, 
> namely inserting a new cell after or before another cell and swapping 
> two adjacent cells.

> The gzipped one is WIP for my GSoC project. I've reworked the algorithm 
> for determing most common lexemes.

I looked over this a bit.  I'm not excited about adding functionality to
Dllist --- that data structure is barely used at all in the backend,
and I think a better case could be made for getting rid of it than
adding code to it.  The analyze patch doesn't change my mind on the
point, because I don't think that Dllist is really helping you there
anyway.  The data structure I'd suggest is a simple array of pointers
to the underlying hash table entries.  Since you have a predetermined
maximum number of lexemes to track, you can just palloc the array once
--- you don't need the expansibility properties of a list.  The only
operations you need are "add an entry at the end" (if you keep the
array sorted by descending count not ascending), "remove the end
entry", and "swap adjacent entries", all of which are actually cheaper
on an array than on a Dllist.

Another point is that you don't really need the array to be sorted all
the time.  Instead of using what is basically an O(N^2) incremental
sort, you could consider applying qsort() when you do need it to be
sorted, which is at the end or when the table overflows and you need to
discard some entries.  If you are willing to discard multiple entries
per overflow event, this could be quite cheap --- although I think in
the worst case where there are many overflows, it'd go back to being
O(N^2).  Note BTW that adding a count=1 entry at the end cannot make the
array unsorted if it was sorted before.  The only event that renders the
array unsorted is increasing an item's count to more than the count of
its predecessor --- so it might be worth keeping a predecessor pointer
in each hashtable entry that identifies its predecessor as of the time
of the last array sort, so you could check on-the-fly and avoid
unnecessary re-sorts.  I'm not totally sure that this idea is a win,
but it seems worth investigating.

Other than that, the analyze patch looked generally sane to me,
and I think you're on the right track.  Please rework and resubmit.
        regards, tom lane


Re: gsoc, text search selectivity and dllist enhancments

From
"Heikki Linnakangas"
Date:
Tom Lane wrote:
> The data structure I'd suggest is a simple array of pointers
> to the underlying hash table entries.  Since you have a predetermined
> maximum number of lexemes to track, you can just palloc the array once
> --- you don't need the expansibility properties of a list. 

The number of lexemes isn't predetermined. It's 2 * (longest tsvector 
seen so far), and we don't know beforehand how long the longest tsvector is.

repalloc()ing shouldn't be a problem, though.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: gsoc, text search selectivity and dllist enhancments

From
Tom Lane
Date:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> The data structure I'd suggest is a simple array of pointers
>> to the underlying hash table entries.  Since you have a predetermined
>> maximum number of lexemes to track, you can just palloc the array once
>> --- you don't need the expansibility properties of a list. 

> The number of lexemes isn't predetermined. It's 2 * (longest tsvector 
> seen so far), and we don't know beforehand how long the longest tsvector is.

Hmm, I had just assumed without looking too closely that it was stats
target times a fudge factor.  What is the rationale for doing it as
above?  I don't think I like the idea of the limit varying over the
course of the scan --- that means that lexemes in different places
in the input will have significantly different probabilities of
surviving to the final result.
        regards, tom lane


Re: gsoc, text search selectivity and dllist enhancments

From
"Heikki Linnakangas"
Date:
Tom Lane wrote:
> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
>> Tom Lane wrote:
>>> The data structure I'd suggest is a simple array of pointers
>>> to the underlying hash table entries.  Since you have a predetermined
>>> maximum number of lexemes to track, you can just palloc the array once
>>> --- you don't need the expansibility properties of a list. 
> 
>> The number of lexemes isn't predetermined. It's 2 * (longest tsvector 
>> seen so far), and we don't know beforehand how long the longest tsvector is.
> 
> Hmm, I had just assumed without looking too closely that it was stats
> target times a fudge factor.  What is the rationale for doing it as
> above?  I don't think I like the idea of the limit varying over the
> course of the scan --- that means that lexemes in different places
> in the input will have significantly different probabilities of
> surviving to the final result.

Well, clearly if the list is smaller than the longest tsvector, 
inserting all elements of that long tsvector will flush out all other 
entries. Or if we throw away the newly inserted entries, some elements 
will never have a chance to climb up the list. I'm not sure where the 
"times two" figure comes from, maybe it's just a fudge factor, but the 
bottom line is that the minimum size needed depends on the size of the 
longest tsvector.

(Jan is offline until Saturday...)

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: gsoc, text search selectivity and dllist enhancments

From
Jan Urbański
Date:
Heikki Linnakangas wrote:
> Tom Lane wrote:
>> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
>>> Tom Lane wrote:
>> Hmm, I had just assumed without looking too closely that it was stats
>> target times a fudge factor.  What is the rationale for doing it as
>> above?  I don't think I like the idea of the limit varying over the
>> course of the scan --- that means that lexemes in different places
>> in the input will have significantly different probabilities of
>> surviving to the final result.

> will never have a chance to climb up the list. I'm not sure where the 
> "times two" figure comes from, maybe it's just a fudge factor, but the 
> bottom line is that the minimum size needed depends on the size of the 
> longest tsvector.

Thank you for the review.

The whole algorithm was based on the original analyze routine from 
analyze.c (compute_minimal_stats() ). The "times two" factor is there, I 
think it's purpose is to reduce the chances of an element dropping off 
the list before it's number of occurrences gets boosted. It doesn't look 
like a factor of 1.5 or 3 would be much better/worse, I just kept it as 
I saw it.

It's true that making the limit change in the course of the scan makes 
the set of surviving lexemes dependent on the order in which entries are 
fetched. I'm inclined to believe that most of the times it won't change 
the overall result - some entries will drop off eariler, some later, 
some will be discarded when it comes to the final pass at the end of the 
algorithm. Maybe I should do a test with - run ANALYZE several times 
over a large set of documents and see if the result stays the same?

Since the number of tracked lexemes needs to be correlated with the 
length of the longest tsvector, the only one possibility I see is doing 
two passes through the statistics sample (determine the length first, 
then do the real work). I wasn't sure if rows fetched by the sampling 
algorithm are guaranteed to be in memory. If so, then another pass is 
just a linear cost (and asymptotically insignificant).

About the array vs dllist issue. If the "two passes" idea is better than 
"dynamically expanding list" then an array is of course better. I 
thought about using a one-way list, but having it two-way helps with 
with inserting and element at the end of the 1-occurrence group (and I 
think that's important, read on) and with the final pass, in which you 
go through the list backward and stop when you get enough entries to 
fill the pg_statistics tuple.
Maybe these could be dealt with efficiently using a one-way list, it was 
easier to do with a double-linked one.

Keeping the list sorted at all times has an advantage: if you insert new 
entries at the *end* of the 1-occurrence block and have the ones at the 
begginning fall off, you are discarding elements that haven't been seen 
for the longest time.
Consider a list:   cat:1  dog:1  duck:2  rhino:3
If the next lexeme is "goat", and you'd put it at the begginning of the 
list (or put in anywhere and qsort() it, you could get)   goat:1  cat:1  dog:1  duck:2  rhino:3
And if this would've trigger an overflow, you could lose the goat, when 
you'd want to lose the cat.

Once again: the whole algorithm is a ripoff from analyze.c, with the 
dllist instead of an array because you don't know how big tracking list 
will you need and with a hashtable, because the tracking list will 
probably be large and looking up things linearly wouldn't be fun.

Cheers,
Jan

-- 
Jan Urbanski
GPG key ID: E583D7D2

ouden estin


Re: gsoc, text search selectivity and dllist enhancments

From
Tom Lane
Date:
Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes:
> Once again: the whole algorithm is a ripoff from analyze.c, with the 
> dllist instead of an array because you don't know how big tracking list 
> will you need and with a hashtable, because the tracking list will 
> probably be large and looking up things linearly wouldn't be fun.

Hmmm ... I had forgotten how compute_minimal_stats() works, and looking
at it now I'm just itching to rewrite it.  You have to realize that that
code was written with the assumptions that (1) it was only going to be
used for a few weird second-class-citizen data types, and (2) the stats
target would typically be around 10 or so.  It really wasn't designed to
be industrial-strength code.  (It was still a big improvement over what
we'd had before :-(.)  So I'm not very comfortable with taking it as the
design base for something that does need to be industrial-strength.

Your point about not wanting the most-recently-added lexeme to drop off
first is a good one, but the approach is awfully fragile.  Consider what
happens if (or when) all the lexemes in the list reach count 2 or more.
All other lexemes seen later will fight over the last list slot, and
have no chance of getting higher in the list; so the algorithm will
completely fail to adapt to any changes in the input statistics that
occur after that point is reached.  I'm thinking that the idea of
periodically cutting off a bunch of low-scoring items, instead of trying
to do it "exactly", would actually be better because it'd still have a
chance of tracking new data even when the counts get larger.

I don't recommend doing two passes over the data because it's fairly
likely that tsvectors would be large enough to be toasted, which'd make
fetching them expensive.  One idea is to start out with some reasonable
estimate of the max lexemes per tsvector (say a couple hundred) and
realloc the list bigger in sizable jumps (say 2X) when the estimate is
seen to be exceeded.  Another possibility is to have a prescan that only
determines the width of the physically largest tsvector (you can get the
width without detoasting, so this would be quite cheap), and then
estimate the number of lexemes corresponding to that using some
fudge-factor guess about lexeme sizes, and then stick with the resulting
list size regardless of what the actual lexeme counts are.  I kinda like
the latter since its behavior is order-independent.
        regards, tom lane


Re: gsoc, text search selectivity and dllist enhancments

From
Jan Urbański
Date:
Tom Lane wrote:
> target would typically be around 10 or so.  It really wasn't designed to
> be industrial-strength code.  (It was still a big improvement over what
> we'd had before :-(.)  So I'm not very comfortable with taking it as the
> design base for something that does need to be industrial-strength.

I did some googling and found a paper describing a method of determining 
most frequent elements from a stream called Lossy Counting:
http://www.vldb.org/conf/2002/S10P03.pdf
The following is an adaptation of an algorithm described in section 4.2 
of that paper. I'll summarize it and propose an implementation using 
existing PG data structures.

The data structure (called D) is a set of entries in the form (e, f, d), 
where e is the lexeme, f is an integer representing the estimated 
frequency and d is the maximum possible error in f. We process tsvectors 
in batches. After each w tsvectors we will be pruning our data 
structure, deleting entries that are not frequent enough. Let b be the 
number of the current batch (tsvectors 1 to w are processed with b = 1, 
tsvectors w+1 to w*2 are processed with b = 2 and so on). Observe that w 
is an arbitrary number.
Initially, D is empty. For each lexeme e we look up its entry in D. If 
it's found, we increment f in that entry. If not, we add a new entry in 
the form of (e, 1, b - 1).
After each w tsvectors we delete all entries from D, that have f + d <= b.

For example, consider such a list of tsvectors (numbers stand for 
lexemes, each line is a tsvector):
1 2 3
2 3 4 5
1 2
1 3 4
2 3 4 5
2 4 5 6

Let w = 3.
After processing the first tsvector D would look like so:
(1, 1, 0), (2, 1, 0), (3, 1, 0)
After processing the second tsvector D would be:
(1, 1, 0), (2, 2, 0), (3, 2, 0), (4, 1, 0), (5, 1, 0)
After the third tsvector we'd have:
(1, 2, 0), (2, 3, 0), (3, 2, 0), (4, 1, 0), (5, 1, 0)

And now we'd do the pruning. b = 1, so all entries with f + d <= 1 are 
discarded. D is then:
(1, 2, 0), (2, 3, 0), (3, 2, 0)

After processing the fourth tsvector we get:
(1, 3, 0), (2, 3, 0), (3, 3, 0), (4, 1, 1)
And then:
(1, 3, 0), (2, 4, 0), (3, 4, 0), (4, 2, 1), (5, 1, 1)
And finally:
(1, 3, 0), (2, 5, 0), (3, 4, 0), (4, 3, 1), (5, 2, 1), (6, 1, 1)

Now comes the time for the second pruning. b = 2, to D is left with:
(1, 3, 0), (2, 5, 0), (3, 4, 0), (4, 3, 1), (5, 2, 1)

Finally we return entries with the largest values of f, their number 
depending on statistics_target. That ends the algorithm.

The original procedure from the paper prunes the entries after each 
"element batch" is processed, but the batches are of equal size. With 
tsvectors it seems sensible to divide lexemes into batches of different 
sizes, so the pruning always takes place after all the lexemes from a 
tsvector have been processed. I'm not totally sure if that change does 
not destroy some properties of that algorithm, but it looks reasonably sane.

As to the implementation: like Tom suggested, we could use a hashtable 
as D and an array of pointers to the hashtable entries, that would get 
updated each time a new hashtable entry would be added. Pruning would 
then require a qsort() of that table taking (f + d) as the sort key, but 
other than that we could just add new entries at the end of the array. 
An open question is: what size should that array be? I think that by 
knowing how long is the longest tsvector we could come up with a good 
estimate, but this needs to be thought over. Anyway, we can always 
repalloc().

Getting the max tsvector length by looking at the datum widths would be 
a neat trick. I just wonder: if we have to detoast all these tsvectors 
anyway, is it not possible to do that and do two passes over already 
detoasted datums? I've got to admit I don't really understand how 
toasting works, so I'm probably completely wrong on that.

Does this algorithm sound sensible? I might have not understood some 
things from that publication, they give some proofs of why their 
algorithm  yields good results and takes up little space, but I'm not 
all that certain it can be so easily adapted.

If you think the Lossy Counting method has potential, I could test it 
somehow. Using my current work I could extract a stream of lexemes as 
ANALYZE sees it and run it through a python implementation of the 
algorithm to see if the result makes sense.

This is getting more complicated, than I thought it would :^)

Cheers,
Jan

-- 
Jan Urbanski
GPG key ID: E583D7D2

ouden estin


Re: gsoc, text search selectivity and dllist enhancments

From
Jan Urbański
Date:
Jan Urbański wrote:
> If you think the Lossy Counting method has potential, I could test it
> somehow. Using my current work I could extract a stream of lexemes as
> ANALYZE sees it and run it through a python implementation of the
> algorithm to see if the result makes sense.

I hacked together a simplistic python implementation and ran it on a
table with 244901 tsvectors, 45624891 lexemes total. I was comparing
results from my current approach with the results I'd get from a Lossy
Counting algorithm.
I experimented with statistics_target set to 10 and 100, and ran pruning
in the LC algorithm every 3, 10 or 100 tsvectors.
The sample size with statistics_target set to 100 was 30000 rows and
that's what the input to the script was - lexemes from these 30000
tsvectors.
I found out that with pruning happening every 10 tsvectors I got
precisely the same results as with the original algorithm (same most
common lexemes, same frequencies). When I tried pruning after every 100
tsvectors the results changed very slightly (they were a tiny bit more
distant from the ones from the original algorithm, and I think a tiny
bit more precise, but I didn't give it much attention).

Bottom line seems to be: the Lossy Counting algorithm gives roughly the
same results as the algorithm used currently and is also possibly faster
(and more scalable wrt. statistics_target).

This should probably get more testing than just running some script 5
times over a fixed set of data, but I had trouble already sucking ~300
MB of tsvectors from one of my production sites, putting it on my laptop
and so on.
Do you think it's worthwhile to implement the LC algorithm in C and send
it out, so others could try it out? Heck, maybe it's worthwhile to
replace the current compute_minimal_stats() algorithm with LC and see
how that compares?

Anyway, I can share the python script if someone would like to do some
more tests (I suppose no-one would,  'cause you first need to apply my
ts_typanalyze patch and then change it some more to extract lexemes from
the sample).

Cheers,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin



Re: gsoc, text search selectivity and dllist enhancments

From
Tom Lane
Date:
Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes:
> Do you think it's worthwhile to implement the LC algorithm in C and send 
> it out, so others could try it out? Heck, maybe it's worthwhile to 
> replace the current compute_minimal_stats() algorithm with LC and see 
> how that compares?

Very possibly.  I repeat that the current implementation of
compute_minimal_stats is very ad-hoc code and wasn't written with an eye
to high performance.  Replacing it with an algorithm that someone
actually thought about might well be worth doing.
        regards, tom lane


Re: gsoc, text search selectivity and dllist enhancments

From
Jan Urbański
Date:
Tom Lane wrote:
> =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <j.urbanski@students.mimuw.edu.pl> writes:
>> Do you think it's worthwhile to implement the LC algorithm in C and send
>> it out, so others could try it out? Heck, maybe it's worthwhile to
>> replace the current compute_minimal_stats() algorithm with LC and see
>> how that compares?
>
> Very possibly.  I repeat that the current implementation of
> compute_minimal_stats is very ad-hoc code and wasn't written with an eye
> to high performance.  Replacing it with an algorithm that someone
> actually thought about might well be worth doing.

Here's a patch that combines both patches included here:
http://archives.postgresql.org/message-id/48649261.5040703@students.mimuw.edu.pl
and adds a C implementation of the Lossy Counting algorithm.

It defines two typanalyze functions: ts_typanalyze_std and
ts_typanalyze_lc, so you can see what statistics are gathered by each of
them. It's meant for easy applying to HEAD, updating pg_type and running
ANALYZE on a few tables with tsvectors (i.e. testing, not commiting).

My observations are: the LC algorithm beats the previous one by a fairly
large margin (20-30%) timewise. The results are almost identical, I got
discrepancies of about 0.05 for some lexemes' frequencies. I intend to
stick with LC for tsvectors and that'll allow to throw away the Dllist
changes.

If I want to keep my GSoC schedule I won't be able to implement LC for
general statistics gathering, but it's trivial. If no one gets about it
I can do it after the Summer of Code (if only to see how it'll work).

Oh, one important thing. You need to choose a bucket width for the LC
algorithm, that is decide after how many elements will you prune your
data structure. I chose to prune after every twenty tsvectors. You might
consider:
  - picking some other arbitrary value
  - making it depend on the largest tsvector size
  - making it depend on the statistics_target
  - pruning after each X lexemes instead of after each Y tsvectors,
because now the buckets will vary in width and you can argue that the
order of input makes a difference again. OTOH the situation here is a
bit different: you get streams of mutually different elements (lexemes
inside a tsvector are all different) and pruning in the middle of such
stream might be unfair for lexemes that are still to be processed. Hmm,
dunno.

Cheers,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

Attachment

Re: gsoc, text search selectivity and dllist enhancments

From
Alvaro Herrera
Date:
Jan Urbański wrote:

> Oh, one important thing. You need to choose a bucket width for the LC  
> algorithm, that is decide after how many elements will you prune your  
> data structure. I chose to prune after every twenty tsvectors.

Do you prune after X tsvectors regardless of the numbers of lexemes in
them?  I don't think that preserves the algorithm properties; if there's
a bunch of very short tsvectors and then long tsvectors, the pruning
would take place too early for the initial lexemes.  I think you should
count lexemes, not tsvectors.


-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: gsoc, text search selectivity and dllist enhancments

From
Jan Urbański
Date:
Alvaro Herrera wrote:
> Jan Urbański wrote:
>
>> Oh, one important thing. You need to choose a bucket width for the LC
>> algorithm, that is decide after how many elements will you prune your
>> data structure. I chose to prune after every twenty tsvectors.
>
> Do you prune after X tsvectors regardless of the numbers of lexemes in
> them?  I don't think that preserves the algorithm properties; if there's
> a bunch of very short tsvectors and then long tsvectors, the pruning
> would take place too early for the initial lexemes.  I think you should
> count lexemes, not tsvectors.

Yes, that's what I was afraid of. I'm not sure why I was reluctant to
prune in the middle of a tsvector, maybe it's just in my head.
Still, there's a decision to be made: after how many lexemes should the
pruning occur?

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin



Re: gsoc, text search selectivity and dllist enhancments

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Jan Urbański wrote:
>> Oh, one important thing. You need to choose a bucket width for the LC  
>> algorithm, that is decide after how many elements will you prune your  
>> data structure. I chose to prune after every twenty tsvectors.

> Do you prune after X tsvectors regardless of the numbers of lexemes in
> them?  I don't think that preserves the algorithm properties; if there's
> a bunch of very short tsvectors and then long tsvectors, the pruning
> would take place too early for the initial lexemes.  I think you should
> count lexemes, not tsvectors.

Yeah.  I haven't read the Lossy Counting paper in detail yet, but I
suspect that the mathematical proof of limited error doesn't work if the
pruning is done on a variable spacing.  I don't see anything very wrong
with pruning intra-tsvector; the effects ought to average out, since the
point where you prune is going to move around with respect to the
tsvector boundaries.
        regards, tom lane


Re: gsoc, text search selectivity and dllist enhancments

From
Tom Lane
Date:
Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes:
> Still, there's a decision to be made: after how many lexemes should the 
> pruning occur?

The way I think it ought to work is that the number of lexemes stored in
the final pg_statistic entry is statistics_target times a constant
(perhaps 100).  I don't like having it vary depending on tsvector width
--- why for example should a column having a few wide tsvectors get a
bigger stats entry than one with many narrow ones?  (Not to mention the
issue of having to estimate the average or max width before you can
start the counting run.)

But in any case, given a target number of lexemes to accumulate,
I'd suggest pruning with that number as the bucket width (pruning
distance).   Or perhaps use some multiple of the target number, but
the number itself seems about right.  The LC paper says that the
bucket width w is equal to ceil(1/e) where e is the maximum frequency
estimation error, and that the maximum number of table entries needed
is log(eN)/e after N lexemes have been scanned.  For the values of e
and N we are going to be dealing with, this is likely to work out to
a few times 1/e, in other words the table size is a few times w.
(They prove it's at most 7w given reasonable assumptions about data
distribution, regardless of how big N gets; though I think our values
for N aren't large enough for that to matter.)

The existing compute_minimal_stats code uses a table size of twice the
target number of values, so setting w to maybe a half or a third of the
target number would reproduce the current space usage.  I don't see a
problem with letting it get a little bigger though, especially since we
can expect that the lexemes aren't very long.  (compute_minimal_stats
can't assume that for arbitrary data types...)
        regards, tom lane


Re: gsoc, text search selectivity and dllist enhancments

From
Jan Urbański
Date:
Tom Lane wrote:
> The way I think it ought to work is that the number of lexemes stored in
> the final pg_statistic entry is statistics_target times a constant
> (perhaps 100).  I don't like having it vary depending on tsvector width

I think the existing code puts at most statistics_target elements in a 
pg_statistic tuple. In compute_minimal_stats() num_mcv starts with 
stats->attr->attstattarget and is adjusted only downwards.
My original thought was to keep that property for tsvectors (i.e. store 
at most statistics_target lexemes) and advise people to set it high for 
their tsvector columns (e.g. 100x their default).
Also, the existing code decides which elements are worth storing as most 
common ones by discarding those that are not frequent enough (that's 
where num_mcv can get adjusted downwards). I mimicked that for lexemes 
but maybe it just doesn't make sense?

> But in any case, given a target number of lexemes to accumulate,
> I'd suggest pruning with that number as the bucket width (pruning
> distance).   Or perhaps use some multiple of the target number, but
> the number itself seems about right. 

Fine with me, I'm too tired to do the math now, so I'll take your word 
for it :)

Cheers,
Jan

-- 
Jan Urbanski
GPG key ID: E583D7D2

ouden estin


Re: gsoc, text search selectivity and dllist enhancments

From
Tom Lane
Date:
Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes:
> Tom Lane wrote:
>> The way I think it ought to work is that the number of lexemes stored in
>> the final pg_statistic entry is statistics_target times a constant
>> (perhaps 100).  I don't like having it vary depending on tsvector width

> I think the existing code puts at most statistics_target elements in a 
> pg_statistic tuple. In compute_minimal_stats() num_mcv starts with 
> stats->attr->attstattarget and is adjusted only downwards.
> My original thought was to keep that property for tsvectors (i.e. store 
> at most statistics_target lexemes) and advise people to set it high for 
> their tsvector columns (e.g. 100x their default).

Well, (1) the normal measure would be statistics_target *tsvectors*,
and we'd have to translate that to lexemes somehow; my proposal is just
to use a fixed constant instead of tsvector width as in your original
patch.  And (2) storing only statistics_target lexemes would be
uselessly small and would guarantee that people *have to* set a custom
target on tsvector columns to get useful results.  Obviously broken
defaults are not my bag.

> Also, the existing code decides which elements are worth storing as most 
> common ones by discarding those that are not frequent enough (that's 
> where num_mcv can get adjusted downwards). I mimicked that for lexemes 
> but maybe it just doesn't make sense?

Well, that's not unreasonable either, if you can come up with a
reasonable definition of "not frequent enough"; but that adds another
variable to the discussion.
        regards, tom lane


Re: gsoc, text search selectivity and dllist enhancments

From
Oleg Bartunov
Date:
On Wed, 9 Jul 2008, Jan Urbaski wrote:

> Jan Urbaski wrote:


> Do you think it's worthwhile to implement the LC algorithm in C and send it 
> out, so others could try it out? Heck, maybe it's worthwhile to replace the 
> current compute_minimal_stats() algorithm with LC and see how that compares?

I and Teodor are using LC for phrase estimation in one application and 
from our understanding of the original paper this algorithm might be
not good for sampling, since all theory behind was about streaming of 
FULL data. As for technique we use suffix tree, which should be fine for
typical sample size.
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: gsoc, text search selectivity and dllist enhancments

From
Jan Urbański
Date:
Tom Lane wrote:
> Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes:
>> Tom Lane wrote:
> Well, (1) the normal measure would be statistics_target *tsvectors*,
> and we'd have to translate that to lexemes somehow; my proposal is just
> to use a fixed constant instead of tsvector width as in your original
> patch.  And (2) storing only statistics_target lexemes would be
> uselessly small and would guarantee that people *have to* set a custom
> target on tsvector columns to get useful results.  Obviously broken
> defaults are not my bag.

Fair enough, I'm fine with a multiplication factor.

>> Also, the existing code decides which elements are worth storing as most 
>> common ones by discarding those that are not frequent enough (that's 
>> where num_mcv can get adjusted downwards). I mimicked that for lexemes 
>> but maybe it just doesn't make sense?
> 
> Well, that's not unreasonable either, if you can come up with a
> reasonable definition of "not frequent enough"; but that adds another
> variable to the discussion.

The current definition was "with more occurrences than 0.001 of total 
rows count, but no less than 2". Copied right off 
compute_minimal_stats(), I have no problem with removing it. I think its 
point is to guard you against a situation where all elements are more or 
less unique, and taking the top N would just give you some random noise. 
It doesn't hurt, so I'd be for keeping the mechanism, but if people feel 
different, then I'll just drop it.

-- 
Jan Urbanski
GPG key ID: E583D7D2

ouden estin


Re: gsoc, text search selectivity and dllist enhancments

From
Jan Urbański
Date:
Oleg Bartunov wrote:
> On Wed, 9 Jul 2008, Jan Urbaski wrote:
> 
>> Jan Urbaski wrote:
>> Do you think it's worthwhile to implement the LC algorithm in C and 
>> send it out, so others could try it out? Heck, maybe it's worthwhile 
>> to replace the current compute_minimal_stats() algorithm with LC and 
>> see how that compares?
> 
> I and Teodor are using LC for phrase estimation in one application and 
> from our understanding of the original paper this algorithm might be
> not good for sampling, since all theory behind was about streaming of 
> FULL data. As for technique we use suffix tree, which should be fine for
> typical sample size.

Hm, that's a good point. I'm only reasurred by the fact, that it yields 
roughly the same results as the original algorithm, which means that if 
LC is bad for sampling, then the current implementation is just as bad. 
Come to think of it, the current code is in a way a variant of Lossy 
Counting, it's just doing the pruning after each and every new element, 
isn't it?

Cheers,
Jan

-- 
Jan Urbanski
GPG key ID: E583D7D2

ouden estin


Re: gsoc, text search selectivity and dllist enhancments

From
Tom Lane
Date:
Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes:
> Come to think of it, the current code is in a way a variant of Lossy 
> Counting, it's just doing the pruning after each and every new element, 
> isn't it?

Interesting comment.  In LC's terms we have w=1 therefore e=1 therefore
the maximum error is as bad as possible?
        regards, tom lane


Re: gsoc, text search selectivity and dllist enhancments

From
Jan Urbański
Date:
Tom Lane wrote:
> Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes:
>> Come to think of it, the current code is in a way a variant of Lossy 
>> Counting, it's just doing the pruning after each and every new element, 
>> isn't it?
> 
> Interesting comment.  In LC's terms we have w=1 therefore e=1 therefore
> the maximum error is as bad as possible?

Well, the similarity doesn't go as far as that IMHO. Having a LC 
algorithm with w = 1 you'd just go about adding one element, and then 
deleting it, 'cause is has f = 1 and delta = b_current - 1, so f + delta 
<= b_current.

But scanning the list linearily each time *and* keeping it incrementally 
sorted sure didn't help the preformance ;)

BTW: I don't know if you've noticed - I settled for adding elements to a 
hashtable and sequentially scanning it at prune time, instead of 
maintaining an additional array of pointers to hashtable entires and 
qsort()ing it every time a pruning is called for. I only qsort() the 
pointers for determining the final MCLs (by allocating an array of the 
same size as the hashtable, copying the pointers and applying qsort()). 
That saves a lot of headaches and, if I did my calculations right, is 
asymptotically faster.

Cheers,
Jan

-- 
Jan Urbanski
GPG key ID: E583D7D2

ouden estin


Re: gsoc, text search selectivity and dllist enhancments

From
Jan Urbański
Date:
OK, here's the (hopefully final) version of the typanalyze function for
tsvectors. It applies to HEAD and passes regression tests.

I now plan to move towards a selectivity function that'll use the
gathered statistics.

Thanks to everyone for their reviews and comments,

Cheers,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

Attachment

Re: gsoc, text search selectivity and dllist enhancments

From
Tom Lane
Date:
Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes:
> OK, here's the (hopefully final) version of the typanalyze function for 
> tsvectors. It applies to HEAD and passes regression tests.

> I now plan to move towards a selectivity function that'll use the 
> gathered statistics.

Applied with some revisions.

Rather than making pg_statistic stakind 4 be specific to tsvector,
I thought it'd be better to define it as "most common elements", with
the idea that it could be used for array and array-like types as well as
tsvector.  (I'm not actually planning to go off and make that happen
right now, but it seems like a pretty obvious extension.)

I thought it was a bit schizophrenic to repurpose
pg_stats.most_common_freqs for element frequencies while creating a
separate column for the elements themselves.  What I've done for the
moment is to define both most_common_vals and most_common_freqs as
referring to the elements in the case of tsvector (or anything else that
has stakind 4 in place of stakind 1).  You could make an argument for
inventing *two* new pg_stats columns instead, but I think that is
probably overkill; I doubt it'll be useful to have both MCV and MCELEM
stats for the same column.  This could easily be changed though.

I removed the prune step after the last tsvector.  I'm not convinced
that the LC algorithm's guarantees still hold if we prune partway
through a bucket, and anyway it's far from clear that we'd save enough
in the sort step to compensate for more HASH_REMOVE operations.  I'm
open to being convinced otherwise.

I made some other cosmetic changes, but those were the substantive ones.
        regards, tom lane


Re: gsoc, text search selectivity and dllist enhancments

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes:
> > OK, here's the (hopefully final) version of the typanalyze function for 
> > tsvectors. It applies to HEAD and passes regression tests.
> 
> > I now plan to move towards a selectivity function that'll use the 
> > gathered statistics.
> 
> Applied with some revisions.

Should we have a CHECK_FOR_INTERRUPTS() call in the inner loop of
compute_tsvector_stats?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: gsoc, text search selectivity and dllist enhancments

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Should we have a CHECK_FOR_INTERRUPTS() call in the inner loop of
> compute_tsvector_stats?

Isn't the vacuum_delay_point() good enough?
        regards, tom lane


Re: gsoc, text search selectivity and dllist enhancments

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Should we have a CHECK_FOR_INTERRUPTS() call in the inner loop of
> > compute_tsvector_stats?
> 
> Isn't the vacuum_delay_point() good enough?

But that's in the outer loop ... I mean here:

Index: src/backend/tsearch/ts_typanalyze.c
===================================================================
RCS file: /home/alvherre/Code/cvs/pgsql/src/backend/tsearch/ts_typanalyze.c,v
retrieving revision 1.1
diff -c -p -r1.1 ts_typanalyze.c
*** src/backend/tsearch/ts_typanalyze.c    14 Jul 2008 00:51:45 -0000    1.1
--- src/backend/tsearch/ts_typanalyze.c    14 Jul 2008 04:59:59 -0000
*************** compute_tsvector_stats(VacAttrStats *sta
*** 206,211 ****
--- 206,213 ----         {             bool            found; 
+             CHECK_FOR_INTERRUPTS();
+              /* Construct a hash key */             hash_key.lexeme = lexemesptr + curentryptr->pos;
hash_key.length= curentryptr->len;
 

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: gsoc, text search selectivity and dllist enhancments

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> Isn't the vacuum_delay_point() good enough?

> But that's in the outer loop ... I mean here:

You'd need one heckuva lot of lexemes in a tsvector to make that
important.  Do we have CHECK_FOR_INTERRUPTS() in any other loops
over tsvector contents?  I kinda doubt it ...

(I have no real objection to adding CHECK_FOR_INTERRUPTS there,
I'm just questioning the value.)
        regards, tom lane


Re: gsoc, text search selectivity and dllist enhancments

From
Oleg Bartunov
Date:
On Mon, 14 Jul 2008, Tom Lane wrote:

> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> Tom Lane wrote:
>>> Isn't the vacuum_delay_point() good enough?
>
>> But that's in the outer loop ... I mean here:
>
> You'd need one heckuva lot of lexemes in a tsvector to make that

Tom, I like the 'heckuva' word (in russian we have sort of), but would you 
please sometimes put in footnote a translation to a regular english for 
education :)

> important.  Do we have CHECK_FOR_INTERRUPTS() in any other loops
> over tsvector contents?  I kinda doubt it ...
>
> (I have no real objection to adding CHECK_FOR_INTERRUPTS there,
> I'm just questioning the value.)
>
>             regards, tom lane
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: gsoc, text search selectivity and dllist enhancments

From
Rudolf Lippan
Date:
<4873EB4C.3070404@students.mimuw.edu.pl> <Pine.LNX.4.64.0807110303550.11363@sn.sai.msu.ru>
<4876FC49.2000402@students.mimuw.edu.pl> <26534.1215757826@sss.pgh.pa.us>
<48791B16.3040207@students.mimuw.edu.pl> <18435.1215996859@sss.pgh.pa.us>
<20080714035243.GA21994@alvh.no-ip.org> <21927.1216008159@sss.pgh.pa.us>
<20080714050120.GB21994@alvh.no-ip.org> <22836.1216012003@sss.pgh.pa.us>
<Pine.LNX.4.64.0807141145100.11363@sn.sai.msu.ru>
Message-ID: <a4942759f51dd8c2f80c4606f610e2bf@192.168.128.15>
X-Sender: rlippan@remotelinux.com
Received: from 74-92-199-69-Atlanta.hfc.comcastbusiness.net [74.92.199.69] via192.168.128.5 [192.168.128.5] with
HTTP/1.1(POST); Mon, 14 Jul 200807:51:36 +0100
 
User-Agent: RoundCube Webmail/0.1
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8bit



On Mon, 14 Jul 2008 11:47:17 +0400 (MSD), Oleg Bartunov <oleg@sai.msu.su>
wrote:
> On Mon, 14 Jul 2008, Tom Lane wrote:
> 
>> Alvaro Herrera <alvherre@commandprompt.com> writes:
>>> Tom Lane wrote:
>>>> Isn't the vacuum_delay_point() good enough?
>>
>>> But that's in the outer loop ... I mean here:
>>
>> You'd need one heckuva lot of lexemes in a tsvector to make that
> 
> Tom, I like the 'heckuva' word (in russian we have sort of), but would
you
> please sometimes put in footnote a translation to a regular english for
> education :)
> 

You might want to look around on ' http://www.urbandictionary.com/ '. That
site is basically one big footnote to the English language :) It is
somewhat of a social network/democratic version of the OED. 

Of course, you may have to read though several (odd ball) definitions (some
specious, at best) and couple that with the context from the original
sentence to get a feel for the word as used.

-r





Re: gsoc, text search selectivity and dllist enhancments

From
Oleg Bartunov
Date:
On Mon, 14 Jul 2008, Rudolf Lippan wrote:

> <4873EB4C.3070404@students.mimuw.edu.pl> <Pine.LNX.4.64.0807110303550.11363@sn.sai.msu.ru>
> <4876FC49.2000402@students.mimuw.edu.pl> <26534.1215757826@sss.pgh.pa.us>
> <48791B16.3040207@students.mimuw.edu.pl> <18435.1215996859@sss.pgh.pa.us>
> <20080714035243.GA21994@alvh.no-ip.org> <21927.1216008159@sss.pgh.pa.us>
> <20080714050120.GB21994@alvh.no-ip.org> <22836.1216012003@sss.pgh.pa.us>
<Pine.LNX.4.64.0807141145100.11363@sn.sai.msu.ru>
> Message-ID: <a4942759f51dd8c2f80c4606f610e2bf@192.168.128.15>
> X-Sender: rlippan@remotelinux.com
> Received: from 74-92-199-69-Atlanta.hfc.comcastbusiness.net [74.92.199.69] via
>     192.168.128.5 [192.168.128.5] with HTTP/1.1 (POST); Mon, 14 Jul 2008
>     07:51:36 +0100
> User-Agent: RoundCube Webmail/0.1
> Content-Type: text/plain; charset="UTF-8"
> Content-Transfer-Encoding: 8bit
>
>
>
> On Mon, 14 Jul 2008 11:47:17 +0400 (MSD), Oleg Bartunov <oleg@sai.msu.su>
> wrote:
>> On Mon, 14 Jul 2008, Tom Lane wrote:
>>
>>> Alvaro Herrera <alvherre@commandprompt.com> writes:
>>>> Tom Lane wrote:
>>>>> Isn't the vacuum_delay_point() good enough?
>>>
>>>> But that's in the outer loop ... I mean here:
>>>
>>> You'd need one heckuva lot of lexemes in a tsvector to make that
>>
>> Tom, I like the 'heckuva' word (in russian we have sort of), but would
> you
>> please sometimes put in footnote a translation to a regular english for
>> education :)
>>
>
> You might want to look around on ' http://www.urbandictionary.com/ '. That
> site is basically one big footnote to the English language :) It is
> somewhat of a social network/democratic version of the OED.

thanks for the link.
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83