Thread: Working with huge amount of data.

Working with huge amount of data.

From
Mario Lopez
Date:
Hi guys :-), I am working on a personal project in which I am trying to
make sense on a huge (at least for me) amount of data. I have
approximately 150 million rows of unique words (they are not exactly
words it is just for explaining the situation).

The table I am inserting this is a quite simple table, something like this:

CREATE TABLE "public"."names" (
"id" SERIAL,
"name" VARCHAR(255)
) WITHOUT OIDS;

It is a requirement that I can make searches on the varchar with queries
that look the following way:

SELECT * FROM names WHERE name LIKE ‘keyword%’
Or
SELECT * FROM names WHERE name LIKE ‘%keyword%’

I optimized the first type of queries making partitions with every
letter that a name can begin with:

CREATE TABLE "public"."names_a" (
CONSTRAINT "names_a_check" CHECK (("name")::text ~~ 'a%'::text)
) INHERITS ("public"."names")
WITHOUT OIDS;

The problem arises with the second type of queries, where there are no
possible partitions and that the search keywords are not known, I have
tried making indexes on the letter it ends with, or indexes that specify
that it contains the letter specified but none of them work the
planifier only make sequential scans over the table.

For the moment the quickest scan I have being able to make is using
grep!!, surprisingly enough grep searches on an average of 20 seconds a
whole plain text file of 2 GB one name per line and PostgreSQL on the
fist type of queries takes like 50 seconds while the second type of
queries con take up to two minutes which is completely unacceptable for
an online search engine that has to attend a user querying this information.

How does this big search engines let’s say Google make this up? I am
amazed of the quickness on searching this amount of information in so
little time. Any approach I could take? I am open minded so anything is
acceptable not necessarily only PostgreSQL based solutions (although I
would prefer it). By the way Textual Search in PostgreSQL is discarded
because what I am looking at are not names that can be decomposed on
lexems, let's say that this varchar is composed of random garbage.

Thanks for the time taken to read. :-)


Re: Working with huge amount of data.

From
Vivek Khera
Date:
On Feb 11, 2008, at 10:37 AM, Mario Lopez wrote:

> The problem arises with the second type of queries, where there are
> no possible partitions and that the search keywords are not known, I
> have tried making indexes on the letter it ends with, or indexes
> that specify that it contains the letter specified but none of them
> work the planifier only make sequential scans over the table.

Postgres doesn't use indexes on prefix-wildcard searches like your
'%word' type search.  It will always sequential scan the table.


Re: Working with huge amount of data.

From
hubert depesz lubaczewski
Date:
On Mon, Feb 11, 2008 at 04:37:24PM +0100, Mario Lopez wrote:
> SELECT * FROM names WHERE name LIKE ‘keyword%’
> Or
> SELECT * FROM names WHERE name LIKE ‘%keyword%’

check this:
http://www.depesz.com/index.php/2007/07/30/indexable-field-like-something/
and this:
http://www.depesz.com/index.php/2007/09/15/speeding-up-like-xxx/

depesz

--
quicksil1er: "postgres is excellent, but like any DB it requires a
highly paid DBA.  here's my CV!" :)
http://www.depesz.com/ - blog dla ciebie (i moje CV)

Re: Working with huge amount of data.

From
Mario Lopez
Date:
Hubert,

Your two posts look pretty cool :), I would read them tonight and answer
you back :)

Thanks!
> On Mon, Feb 11, 2008 at 04:37:24PM +0100, Mario Lopez wrote:
>
>> SELECT * FROM names WHERE name LIKE ‘keyword%’
>> Or
>> SELECT * FROM names WHERE name LIKE ‘%keyword%’
>>
>
> check this:
> http://www.depesz.com/index.php/2007/07/30/indexable-field-like-something/
> and this:
> http://www.depesz.com/index.php/2007/09/15/speeding-up-like-xxx/
>
> depesz
>
>


Re: Working with huge amount of data.

From
Erik Jones
Date:
On Feb 11, 2008, at 9:37 AM, Mario Lopez wrote:

> Hi guys :-), I am working on a personal project in which I am
> trying to make sense on a huge (at least for me) amount of data. I
> have approximately 150 million rows of unique words (they are not
> exactly words it is just for explaining the situation).
>
> The table I am inserting this is a quite simple table, something
> like this:
>
> CREATE TABLE "public"."names" (
> "id" SERIAL,
> "name" VARCHAR(255)
> ) WITHOUT OIDS;
>
> It is a requirement that I can make searches on the varchar with
> queries that look the following way:
>
> SELECT * FROM names WHERE name LIKE ‘keyword%’
> Or
> SELECT * FROM names WHERE name LIKE ‘%keyword%’
>
> I optimized the first type of queries making partitions with every
> letter that a name can begin with:
>
> CREATE TABLE "public"."names_a" (
> CONSTRAINT "names_a_check" CHECK (("name")::text ~~ 'a%'::text)
> ) INHERITS ("public"."names")
> WITHOUT OIDS;
>
> The problem arises with the second type of queries, where there are
> no possible partitions and that the search keywords are not known,
> I have tried making indexes on the letter it ends with, or indexes
> that specify that it contains the letter specified but none of them
> work the planifier only make sequential scans over the table.
>
> For the moment the quickest scan I have being able to make is using
> grep!!, surprisingly enough grep searches on an average of 20
> seconds a whole plain text file of 2 GB one name per line and
> PostgreSQL on the fist type of queries takes like 50 seconds while
> the second type of queries con take up to two minutes which is
> completely unacceptable for an online search engine that has to
> attend a user querying this information.
>
> How does this big search engines let’s say Google make this up? I
> am amazed of the quickness on searching this amount of information
> in so little time. Any approach I could take? I am open minded so
> anything is acceptable not necessarily only PostgreSQL based
> solutions (although I would prefer it). By the way Textual Search
> in PostgreSQL is discarded because what I am looking at are not
> names that can be decomposed on lexems, let's say that this varchar
> is composed of random garbage.

Actually, a friend of mine actually did exactly what you've tried:
grep.  He had a cron job that would update the txt file from the
table's data every five minutes and then his app would shell out to
run those kinds of queries.  Of course, with a setup like that your
results can be a little out of date (the period between runs of the
cron job) but, if you can deal with that, that's actually a pretty
simple solution that doesn't take too much setup.

Erik Jones

DBA | Emma®
erik@myemma.com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com




Re: Working with huge amount of data.

From
Mario Lopez
Date:
Erik,

Thanks for your answers, actually this is a workable solution because my
data does not get updated so frequently (every 24 hours). The  problem
is that I would like a more advanced version of this, there must be
something I can do, I am going to try what Hubert Despez explained in
his articles.

Thanks :)

>
> On Feb 11, 2008, at 9:37 AM, Mario Lopez wrote:
>
>> Hi guys :-), I am working on a personal project in which I am trying
>> to make sense on a huge (at least for me) amount of data. I have
>> approximately 150 million rows of unique words (they are not exactly
>> words it is just for explaining the situation).
>>
>> The table I am inserting this is a quite simple table, something like
>> this:
>>
>> CREATE TABLE "public"."names" (
>> "id" SERIAL,
>> "name" VARCHAR(255)
>> ) WITHOUT OIDS;
>>
>> It is a requirement that I can make searches on the varchar with
>> queries that look the following way:
>>
>> SELECT * FROM names WHERE name LIKE ‘keyword%’
>> Or
>> SELECT * FROM names WHERE name LIKE ‘%keyword%’
>>
>> I optimized the first type of queries making partitions with every
>> letter that a name can begin with:
>>
>> CREATE TABLE "public"."names_a" (
>> CONSTRAINT "names_a_check" CHECK (("name")::text ~~ 'a%'::text)
>> ) INHERITS ("public"."names")
>> WITHOUT OIDS;
>>
>> The problem arises with the second type of queries, where there are
>> no possible partitions and that the search keywords are not known, I
>> have tried making indexes on the letter it ends with, or indexes that
>> specify that it contains the letter specified but none of them work
>> the planifier only make sequential scans over the table.
>>
>> For the moment the quickest scan I have being able to make is using
>> grep!!, surprisingly enough grep searches on an average of 20 seconds
>> a whole plain text file of 2 GB one name per line and PostgreSQL on
>> the fist type of queries takes like 50 seconds while the second type
>> of queries con take up to two minutes which is completely
>> unacceptable for an online search engine that has to attend a user
>> querying this information.
>>
>> How does this big search engines let’s say Google make this up? I am
>> amazed of the quickness on searching this amount of information in so
>> little time. Any approach I could take? I am open minded so anything
>> is acceptable not necessarily only PostgreSQL based solutions
>> (although I would prefer it). By the way Textual Search in PostgreSQL
>> is discarded because what I am looking at are not names that can be
>> decomposed on lexems, let's say that this varchar is composed of
>> random garbage.
>
> Actually, a friend of mine actually did exactly what you've tried:
> grep.  He had a cron job that would update the txt file from the
> table's data every five minutes and then his app would shell out to
> run those kinds of queries.  Of course, with a setup like that your
> results can be a little out of date (the period between runs of the
> cron job) but, if you can deal with that, that's actually a pretty
> simple solution that doesn't take too much setup.
>
> Erik Jones
>
> DBA | Emma®
> erik@myemma.com
> 800.595.4401 or 615.292.5888
> 615.292.0777 (fax)
>
> Emma helps organizations everywhere communicate & market in style.
> Visit us online at http://www.myemma.com
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly


Re: Working with huge amount of data.

From
Jonathan Ballet
Date:
Mario Lopez wrote:
> Hi guys :-), I am working on a personal project in which I am trying to
> make sense on a huge (at least for me) amount of data. I have
> approximately 150 million rows of unique words (they are not exactly
> words it is just for explaining the situation).
>
> The table I am inserting this is a quite simple table, something like this:
>
> CREATE TABLE "public"."names" (
> "id" SERIAL,
> "name" VARCHAR(255)
> ) WITHOUT OIDS;
>
> It is a requirement that I can make searches on the varchar with queries
> that look the following way:
>
> SELECT * FROM names WHERE name LIKE ‘keyword%’
> Or
> SELECT * FROM names WHERE name LIKE ‘%keyword%’
>
> I optimized the first type of queries making partitions with every
> letter that a name can begin with:

AFAIK, you only need to add an index on "name" to be able to speed up the first kind of queries.
Have a look at B-Tree description in [1].

  - Jonathan

[1] : http://www.postgresql.org/docs/8.3/interactive/indexes-types.html

Re: Working with huge amount of data.

From
Jeff
Date:
On Feb 11, 2008, at 10:37 AM, Mario Lopez wrote:

> SELECT * FROM names WHERE name LIKE ‘keyword%’

If you use the C locale, PG can use an index for this query
> Or
> SELECT * FROM names WHERE name LIKE ‘%keyword%’
>

But not this one - substring searches are painful.

However, there is some hope - you can checkout the pg_trgm module
which provides trigrams and indexing methods so '%foo%'  type
searches can be done very quickly.

Things like google don't usually index substrings, just full words,
which makes life easier.  They may also use trigrams and other things
of that nature if they need to do subtrings (trigrams can also be
used to provide "Did you mean...." results.

--
Jeff Trout <jeff@jefftrout.com>
www.dellsmartexitin.com
www.stuarthamm.net






Re: Working with huge amount of data. RESULTS!

From
Mario Lopez
Date:
Hi!,

I optimized the LIKE 'keyword%' and LIKE '%keyword' with the following
results:

# time /Library/PostgreSQL8/bin/psql -U postgres -d testdb -c "select *
from table1 where varchar_reverse(data) like varchar_reverse('%keyword');"
real    0m0.055s
user    0m0.011s
sys     0m0.006s


# time /Library/PostgreSQL8/bin/psql -U postgres -d testdb -c "select *
from table1 where data like 'keyword%';"
real    0m0.026s
user    0m0.012s
sys     0m0.006s

It works flawlesly as you can see by the timings, take in consideration
that "table1" has 100 million records. The only problem is generating
the reversed index which takes like 20 minutes, I guess it has to do
with the plperl function, perhaps a C function for inverting would make
it up in less time.

The problem is still with the LIKE '%keyword%', my problem is that I am
not searching for Words in a dictionary fashion, suppose my "data" is
random garbage, that it has common consecutive bytes. How could I
generate a dictionary from this random garbage to make it easier for
indexing?

Thanks


> On Mon, Feb 11, 2008 at 04:37:24PM +0100, Mario Lopez wrote:
>
>> SELECT * FROM names WHERE name LIKE ‘keyword%’
>> Or
>> SELECT * FROM names WHERE name LIKE ‘%keyword%’
>>
>
> check this:
> http://www.depesz.com/index.php/2007/07/30/indexable-field-like-something/
> and this:
> http://www.depesz.com/index.php/2007/09/15/speeding-up-like-xxx/
>
> depesz
>
>


Re: Working with huge amount of data. RESULTS!

From
Alvaro Herrera
Date:
Mario Lopez wrote:

> The problem is still with the LIKE '%keyword%', my problem is that I am
> not searching for Words in a dictionary fashion, suppose my "data" is
> random garbage, that it has common consecutive bytes. How could I
> generate a dictionary from this random garbage to make it easier for
> indexing?

Using trigrams perhaps?  Try checking whether pg_trgm fits your needs.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Working with huge amount of data. RESULTS!

From
hubert depesz lubaczewski
Date:
On Tue, Feb 12, 2008 at 03:45:51PM +0100, Mario Lopez wrote:
> the reversed index which takes like 20 minutes, I guess it has to do
> with the plperl function, perhaps a C function for inverting would make
> it up in less time.

sure. take a look at this:
http://www.depesz.com/index.php/2007/09/05/indexable-field-like-something-update/

> The problem is still with the LIKE '%keyword%', my problem is that I am
> not searching for Words in a dictionary fashion, suppose my "data" is
> random garbage, that it has common consecutive bytes. How could I
> generate a dictionary from this random garbage to make it easier for
> indexing?

if you are talking abount really general "random bytes", and really
general "random substrings" - you're basically out of luck.

but
think about acceptable limitations.
can you limit min-length of searched substring?
check pg_trgm in contrib.

as for generating dictionary - show me the data and perhaps i can think
of something.

depesz

--
quicksil1er: "postgres is excellent, but like any DB it requires a
highly paid DBA.  here's my CV!" :)
http://www.depesz.com/ - blog dla ciebie (i moje CV)

Re: Working with huge amount of data. RESULTS!

From
Oleg Bartunov
Date:
On Tue, 12 Feb 2008, Mario Lopez wrote:

> Hi!,
>
> I optimized the LIKE 'keyword%' and LIKE '%keyword' with the following
> results:
>
> # time /Library/PostgreSQL8/bin/psql -U postgres -d testdb -c "select * from
> table1 where varchar_reverse(data) like varchar_reverse('%keyword');"
> real    0m0.055s
> user    0m0.011s
> sys     0m0.006s
>
>
> # time /Library/PostgreSQL8/bin/psql -U postgres -d testdb -c "select * from
> table1 where data like 'keyword%';"
> real    0m0.026s
> user    0m0.012s
> sys     0m0.006s
>
> It works flawlesly as you can see by the timings, take in consideration that
> "table1" has 100 million records. The only problem is generating the reversed
> index which takes like 20 minutes, I guess it has to do with the plperl
> function, perhaps a C function for inverting would make it up in less time.
>
> The problem is still with the LIKE '%keyword%', my problem is that I am not
> searching for Words in a dictionary fashion, suppose my "data" is random
> garbage, that it has common consecutive bytes. How could I generate a
> dictionary from this random garbage to make it easier for indexing?

suffix tree (array) would speedup '%keyword%' query, but currently it doesn't
supported by GiST extension architecture (we have it in our TODO), so I see
several ways (not tested):

1. try contrib/pg_trgm to reduce a number of candidate words
2. generate all possible substrings and use your tested approach


>
>> On Mon, Feb 11, 2008 at 04:37:24PM +0100, Mario Lopez wrote:
>>
>>> SELECT * FROM names WHERE name LIKE keyword%
>>> Or
>>> SELECT * FROM names WHERE name LIKE %keyword%
>>>
>>
>> check this:
>> http://www.depesz.com/index.php/2007/07/30/indexable-field-like-something/
>> and this:
>> http://www.depesz.com/index.php/2007/09/15/speeding-up-like-xxx/
>>
>> depesz
>>
>>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>      choose an index scan if your joining column's datatypes do not
>      match
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83