Thread: store A LOT of 3-tuples for comparisons

store A LOT of 3-tuples for comparisons

From

Moritz Onken

Date:

22 February 2008, 11:42:48

Hi,

I need to store a lot of 3-tuples of words (e.g. "he", "can",
"drink"), order matters!
The source is about 4 GB of these 3-tuples.
I need to store them in a table and check whether one of them is
already stored, and if that's the case to increment a column named
"count" (or something).

I thought of doing all the inserts without having an index and without
doing the check whether the row is already there. After that I'd do a
"group by" and count(*) on that table. Is this a good idea?

I don't know much about Pgs data types. I'd try to use the varchar
type. But maybe there is a better data type?
What kind of index should I use?

This is for a scientific research.

Thanks in advance

moritz

Re: store A LOT of 3-tuples for comparisons

From

Matthew

Date:

22 February 2008, 11:49:17

On Fri, 22 Feb 2008, Moritz Onken wrote:
> I need to store a lot of 3-tuples of words (e.g. "he", "can", "drink"), order
> matters!
> The source is about 4 GB of these 3-tuples.
> I need to store them in a table and check whether one of them is already
> stored, and if that's the case to increment a column named "count" (or
> something).

My suggestion would be to use three varchar columns to store the 3-tuples.
You should then create a B-tree index on the three columns together.

> I thought of doing all the inserts without having an index and without doing
> the check whether the row is already there. After that I'd do a "group by"
> and count(*) on that table. Is this a good idea?

That sounds like the fastest way to do it, certainly.

Matthew

--
"We have always been quite clear that Win95 and Win98 are not the systems to
use if you are in a hostile security environment." "We absolutely do recognize
that the Internet is a hostile environment." Paul Leach <paulle@microsoft.com>

Re: store A LOT of 3-tuples for comparisons

From

Shane Ambler

Date:

23 February 2008, 03:59:01

Matthew wrote:
> On Fri, 22 Feb 2008, Moritz Onken wrote:

>> I thought of doing all the inserts without having an index and without
>> doing the check whether the row is already there. After that I'd do a
>> "group by" and count(*) on that table. Is this a good idea?
>
> That sounds like the fastest way to do it, certainly.

Yeah I would load the data into a temp 3-column table and then
INSERT INTO mydatatable SELECT w1,w2,w3,count(*) GROUP BY w1,w2,w3
then
CREATE UNIQUE INDEX idx_unique_data ON mydatatable (w1,w2,w3)
if you plan to continue adding to and using the data.

If this is to be an ongoing data collection (with data being added
slowly from here) I would probably setup a trigger to update the count
column.

I am also wondering about the ordering and whether that matters.
Can you have "he", "can", "drink" as well as "drink", "he", "can"
and should they be considered the same? If so you will need a different
tactic.

--

Shane Ambler
pgSQL (at) Sheeky (dot) Biz

Get Sheeky @ http://Sheeky.Biz

Re: store A LOT of 3-tuples for comparisons

From

Moritz Onken

Date:

23 February 2008, 05:35:47

>
>
> I am also wondering about the ordering and whether that matters.
> Can you have "he", "can", "drink" as well as "drink", "he", "can"
> and should they be considered the same? If so you will need a
> different tactic.
>

ordering matters. So the 3-column tactic should work.

Thanks for your advice!