Re: gsoc, text search selectivity and dllist enhancments - Mailing list pgsql-hackers

From Jan Urbański
Subject Re: gsoc, text search selectivity and dllist enhancments
Date
Msg-id 4872906C.5020607@students.mimuw.edu.pl
Whole thread Raw
In response to Re: gsoc, text search selectivity and dllist enhancments  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: gsoc, text search selectivity and dllist enhancments
List pgsql-hackers
Tom Lane wrote:
> target would typically be around 10 or so.  It really wasn't designed to
> be industrial-strength code.  (It was still a big improvement over what
> we'd had before :-(.)  So I'm not very comfortable with taking it as the
> design base for something that does need to be industrial-strength.

I did some googling and found a paper describing a method of determining 
most frequent elements from a stream called Lossy Counting:
http://www.vldb.org/conf/2002/S10P03.pdf
The following is an adaptation of an algorithm described in section 4.2 
of that paper. I'll summarize it and propose an implementation using 
existing PG data structures.

The data structure (called D) is a set of entries in the form (e, f, d), 
where e is the lexeme, f is an integer representing the estimated 
frequency and d is the maximum possible error in f. We process tsvectors 
in batches. After each w tsvectors we will be pruning our data 
structure, deleting entries that are not frequent enough. Let b be the 
number of the current batch (tsvectors 1 to w are processed with b = 1, 
tsvectors w+1 to w*2 are processed with b = 2 and so on). Observe that w 
is an arbitrary number.
Initially, D is empty. For each lexeme e we look up its entry in D. If 
it's found, we increment f in that entry. If not, we add a new entry in 
the form of (e, 1, b - 1).
After each w tsvectors we delete all entries from D, that have f + d <= b.

For example, consider such a list of tsvectors (numbers stand for 
lexemes, each line is a tsvector):
1 2 3
2 3 4 5
1 2
1 3 4
2 3 4 5
2 4 5 6

Let w = 3.
After processing the first tsvector D would look like so:
(1, 1, 0), (2, 1, 0), (3, 1, 0)
After processing the second tsvector D would be:
(1, 1, 0), (2, 2, 0), (3, 2, 0), (4, 1, 0), (5, 1, 0)
After the third tsvector we'd have:
(1, 2, 0), (2, 3, 0), (3, 2, 0), (4, 1, 0), (5, 1, 0)

And now we'd do the pruning. b = 1, so all entries with f + d <= 1 are 
discarded. D is then:
(1, 2, 0), (2, 3, 0), (3, 2, 0)

After processing the fourth tsvector we get:
(1, 3, 0), (2, 3, 0), (3, 3, 0), (4, 1, 1)
And then:
(1, 3, 0), (2, 4, 0), (3, 4, 0), (4, 2, 1), (5, 1, 1)
And finally:
(1, 3, 0), (2, 5, 0), (3, 4, 0), (4, 3, 1), (5, 2, 1), (6, 1, 1)

Now comes the time for the second pruning. b = 2, to D is left with:
(1, 3, 0), (2, 5, 0), (3, 4, 0), (4, 3, 1), (5, 2, 1)

Finally we return entries with the largest values of f, their number 
depending on statistics_target. That ends the algorithm.

The original procedure from the paper prunes the entries after each 
"element batch" is processed, but the batches are of equal size. With 
tsvectors it seems sensible to divide lexemes into batches of different 
sizes, so the pruning always takes place after all the lexemes from a 
tsvector have been processed. I'm not totally sure if that change does 
not destroy some properties of that algorithm, but it looks reasonably sane.

As to the implementation: like Tom suggested, we could use a hashtable 
as D and an array of pointers to the hashtable entries, that would get 
updated each time a new hashtable entry would be added. Pruning would 
then require a qsort() of that table taking (f + d) as the sort key, but 
other than that we could just add new entries at the end of the array. 
An open question is: what size should that array be? I think that by 
knowing how long is the longest tsvector we could come up with a good 
estimate, but this needs to be thought over. Anyway, we can always 
repalloc().

Getting the max tsvector length by looking at the datum widths would be 
a neat trick. I just wonder: if we have to detoast all these tsvectors 
anyway, is it not possible to do that and do two passes over already 
detoasted datums? I've got to admit I don't really understand how 
toasting works, so I'm probably completely wrong on that.

Does this algorithm sound sensible? I might have not understood some 
things from that publication, they give some proofs of why their 
algorithm  yields good results and takes up little space, but I'm not 
all that certain it can be so easily adapted.

If you think the Lossy Counting method has potential, I could test it 
somehow. Using my current work I could extract a stream of lexemes as 
ANALYZE sees it and run it through a python implementation of the 
algorithm to see if the result makes sense.

This is getting more complicated, than I thought it would :^)

Cheers,
Jan

-- 
Jan Urbanski
GPG key ID: E583D7D2

ouden estin


pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: Exposing quals
Next
From: "Andrew Hammond"
Date:
Subject: Re: the un-vacuumable table