Thread: Indexes for hashes

Indexes for hashes

From
Ivan Voras
Date:
Hi,

I have an application which stores a large amounts of hex-encoded hash strings (nearly 100 GB of them), which means:
  • The number of distinct characters (alphabet) is limited to 16
  • Each string is of the same length, 64 characters
  • The strings are essentially random
Creating a B-Tree index on this results in the index size being larger than the table itself, and there are disk space constraints.

I've found the SP-GIST radix tree index, and thought it could be a good match for the data because of the above constraints. An attempt to create it (as in CREATE INDEX ON t USING spgist(field_name)) apparently takes more than 12 hours (while a similar B-tree index takes a few hours at most), so I've interrupted it because "it probably is not going to finish in a reasonable time". Some slides I found on the spgist index allude that both build time and size are not really suitable for this purpose.

My question is: what would be the most size-efficient index for this situation?

Re: Indexes for hashes

From
Torsten Zuehlsdorff
Date:
Hello Ivan,

> I have an application which stores a large amounts of hex-encoded hash
> strings (nearly 100 GB of them), which means:
>
>   * The number of distinct characters (alphabet) is limited to 16
>   * Each string is of the same length, 64 characters
>   * The strings are essentially random
>
> Creating a B-Tree index on this results in the index size being larger
> than the table itself, and there are disk space constraints.
>
> I've found the SP-GIST radix tree index, and thought it could be a good
> match for the data because of the above constraints. An attempt to
> create it (as in CREATE INDEX ON t USING spgist(field_name)) apparently
> takes more than 12 hours (while a similar B-tree index takes a few hours
> at most), so I've interrupted it because "it probably is not going to
> finish in a reasonable time". Some slides I found on the spgist index
> allude that both build time and size are not really suitable for this
> purpose.
>
> My question is: what would be the most size-efficient index for this
> situation?

It depends on what you want to query. What about the BRIN-Index:
https://www.postgresql.org/docs/9.5/static/brin-intro.html

This will result in a very small size, but depending on what you want to
query it will fit or not fit your needs.

Greetings,
Torsten


Re: Indexes for hashes

From
"ktm@rice.edu"
Date:
On Wed, Jun 15, 2016 at 11:34:18AM +0200, Ivan Voras wrote:
> Hi,
>
> I have an application which stores a large amounts of hex-encoded hash
> strings (nearly 100 GB of them), which means:
>
>    - The number of distinct characters (alphabet) is limited to 16
>    - Each string is of the same length, 64 characters
>    - The strings are essentially random
>
> Creating a B-Tree index on this results in the index size being larger than
> the table itself, and there are disk space constraints.
>
> I've found the SP-GIST radix tree index, and thought it could be a good
> match for the data because of the above constraints. An attempt to create
> it (as in CREATE INDEX ON t USING spgist(field_name)) apparently takes more
> than 12 hours (while a similar B-tree index takes a few hours at most), so
> I've interrupted it because "it probably is not going to finish in a
> reasonable time". Some slides I found on the spgist index allude that both
> build time and size are not really suitable for this purpose.
>
> My question is: what would be the most size-efficient index for this
> situation?

Hi Ivan,

If the strings are really random, then maybe a function index on the first
4, 8, or 16 characters could be used to narrow the search space and not need
to index all 64. If they are not "good" random numbers, you could use a hash
index on the strings. It will be much smaller since it currently uses a 32-bit
hash. It has a number of caveats and is not currently crash-safe, but it seems
like it might work in your environment. You can also use a functional index on
a hash-function applied to your values with a btree to give you crash safety.

Regards,
Ken


Re: Indexes for hashes

From
Ivan Voras
Date:


On 15 June 2016 at 15:03, ktm@rice.edu <ktm@rice.edu> wrote:
On Wed, Jun 15, 2016 at 11:34:18AM +0200, Ivan Voras wrote:
> Hi,
>
> I have an application which stores a large amounts of hex-encoded hash
> strings (nearly 100 GB of them), which means:
>
>    - The number of distinct characters (alphabet) is limited to 16
>    - Each string is of the same length, 64 characters
>    - The strings are essentially random
>
> Creating a B-Tree index on this results in the index size being larger than
> the table itself, and there are disk space constraints.
>
> I've found the SP-GIST radix tree index, and thought it could be a good
> match for the data because of the above constraints. An attempt to create
> it (as in CREATE INDEX ON t USING spgist(field_name)) apparently takes more
> than 12 hours (while a similar B-tree index takes a few hours at most), so
> I've interrupted it because "it probably is not going to finish in a
> reasonable time". Some slides I found on the spgist index allude that both
> build time and size are not really suitable for this purpose.
>
> My question is: what would be the most size-efficient index for this
> situation?

Hi Ivan,

If the strings are really random, then maybe a function index on the first
4, 8, or 16 characters could be used to narrow the search space and not need
to index all 64. If they are not "good" random numbers, you could use a hash
index on the strings. It will be much smaller since it currently uses a 32-bit
hash. It has a number of caveats and is not currently crash-safe, but it seems
like it might work in your environment. You can also use a functional index on
a hash-function applied to your values with a btree to give you crash safety.


Hi,

I figured the hash index might be helpful and I've tried it in the meantime: on one of the smaller tables (which is 51 GB in size), a btree index is 32 GB, while the hash index is 22 GB (so btree is around 45% larger).

I don't suppose there's an effort in progress to make hash indexes use WAL? :D




Re: Indexes for hashes

From
"ktm@rice.edu"
Date:
On Wed, Jun 15, 2016 at 03:09:04PM +0200, Ivan Voras wrote:
> On 15 June 2016 at 15:03, ktm@rice.edu <ktm@rice.edu> wrote:
>
> > On Wed, Jun 15, 2016 at 11:34:18AM +0200, Ivan Voras wrote:
> > > Hi,
> > >
> > > I have an application which stores a large amounts of hex-encoded hash
> > > strings (nearly 100 GB of them), which means:
> > >
> > >    - The number of distinct characters (alphabet) is limited to 16
> > >    - Each string is of the same length, 64 characters
> > >    - The strings are essentially random
> > >
> > > Creating a B-Tree index on this results in the index size being larger
> > than
> > > the table itself, and there are disk space constraints.
> > >
> > > I've found the SP-GIST radix tree index, and thought it could be a good
> > > match for the data because of the above constraints. An attempt to create
> > > it (as in CREATE INDEX ON t USING spgist(field_name)) apparently takes
> > more
> > > than 12 hours (while a similar B-tree index takes a few hours at most),
> > so
> > > I've interrupted it because "it probably is not going to finish in a
> > > reasonable time". Some slides I found on the spgist index allude that
> > both
> > > build time and size are not really suitable for this purpose.
> > >
> > > My question is: what would be the most size-efficient index for this
> > > situation?
> >
> > Hi Ivan,
> >
> > If the strings are really random, then maybe a function index on the first
> > 4, 8, or 16 characters could be used to narrow the search space and not
> > need
> > to index all 64. If they are not "good" random numbers, you could use a
> > hash
> > index on the strings. It will be much smaller since it currently uses a
> > 32-bit
> > hash. It has a number of caveats and is not currently crash-safe, but it
> > seems
> > like it might work in your environment. You can also use a functional
> > index on
> > a hash-function applied to your values with a btree to give you crash
> > safety.
> >
> >
> Hi,
>
> I figured the hash index might be helpful and I've tried it in the
> meantime: on one of the smaller tables (which is 51 GB in size), a btree
> index is 32 GB, while the hash index is 22 GB (so btree is around 45%
> larger).
>
> I don't suppose there's an effort in progress to make hash indexes use WAL?
> :D

Hi Ivan,

Several people have looked at it but it has not made it to the top of anyone's
to-do list. So if you need WAL and crash-safety, a functional index on a hash
of your values is currently your best bet.

Regards,
Ken


Re: Indexes for hashes

From
hubert depesz lubaczewski
Date:
On Wed, Jun 15, 2016 at 11:34:18AM +0200, Ivan Voras wrote:
> I have an application which stores a large amounts of hex-encoded hash
> strings (nearly 100 GB of them), which means:

Why do you keep them hex encoded, and not use bytea?

I made a sample table with 1 million rows, looking like this:

     Table "public.new"
 Column  | Type  | Modifiers
---------+-------+-----------
 texthex | text  |
 a_bytea | bytea |

values are like:

$ select * from new limit 10;
                             texthex                              |                              a_bytea
               

------------------------------------------------------------------+--------------------------------------------------------------------
 c968f64426b941bc9a8f6d4e87fc151c7a7192679837618570b7989c67c31e2f |
\xc968f64426b941bc9a8f6d4e87fc151c7a7192679837618570b7989c67c31e2f
 61dffbf002d7fc3db9df5953aca7f2e434a78d4c5fdd0db6f90f43ee8c4371db |
\x61dffbf002d7fc3db9df5953aca7f2e434a78d4c5fdd0db6f90f43ee8c4371db
 757acf228adf2357356fd38d03b80529771f211e0ad3b35b66a23d6de53e5033 |
\x757acf228adf2357356fd38d03b80529771f211e0ad3b35b66a23d6de53e5033
 fba35b8b33a7fccc2ac7d96389d43be509ff17636fe3c0f8a33af6d009f84f15 |
\xfba35b8b33a7fccc2ac7d96389d43be509ff17636fe3c0f8a33af6d009f84f15
 ecd8587a8b9acae650760cea8683f8e1c131c4054c0d64b1d7de0ff269ccc61a |
\xecd8587a8b9acae650760cea8683f8e1c131c4054c0d64b1d7de0ff269ccc61a
 11782c73bb3fc9f281b41d3eff8c1e7907b3494b3abe7b6982c1e88f49dad2ea |
\x11782c73bb3fc9f281b41d3eff8c1e7907b3494b3abe7b6982c1e88f49dad2ea
 5862bd8d645e4d44997a485c616bc18f1acabeaec5df3c3b09b9d4c08643e852 |
\x5862bd8d645e4d44997a485c616bc18f1acabeaec5df3c3b09b9d4c08643e852
 2d09a5cca2c03153a55faa3aff13df0f0593f4355a1b2cfcf9237c2931b4918c |
\x2d09a5cca2c03153a55faa3aff13df0f0593f4355a1b2cfcf9237c2931b4918c
 2186eb2bcc12319ee6e00f7e08a1d61e379a75c01c579c29d0338693bc31c7c7 |
\x2186eb2bcc12319ee6e00f7e08a1d61e379a75c01c579c29d0338693bc31c7c7
 2061bd05049c51bd1162e4d77f72a37f06d2397fc522ef587ed172a5ad8d57aa |
\x2061bd05049c51bd1162e4d77f72a37f06d2397fc522ef587ed172a5ad8d57aa
(10 rows)

created two indexes:
create index i1 on new (texthex);
create index i2 on new (a_bytea);

i1 is 91MB, and i2 is 56MB.

Index creation was also much faster - best out of 3 runs for i1 was 4928.982
ms, best out of 3 runs for i2 was 2047.648 ms

Best regards,

depesz



Re: Indexes for hashes

From
Ivan Voras
Date:
Hi,

I understand your idea, and have also been thinking about it. Basically, existing applications would need to be modified, however slightly, and that wouldn't be good.




On 15 June 2016 at 15:38, hubert depesz lubaczewski <depesz@depesz.com> wrote:
On Wed, Jun 15, 2016 at 11:34:18AM +0200, Ivan Voras wrote:
> I have an application which stores a large amounts of hex-encoded hash
> strings (nearly 100 GB of them), which means:

Why do you keep them hex encoded, and not use bytea?

I made a sample table with 1 million rows, looking like this:

     Table "public.new"
 Column  | Type  | Modifiers
---------+-------+-----------
 texthex | text  |
 a_bytea | bytea |

values are like:

$ select * from new limit 10;
                             texthex                              |                              a_bytea
------------------------------------------------------------------+--------------------------------------------------------------------
 c968f64426b941bc9a8f6d4e87fc151c7a7192679837618570b7989c67c31e2f | \xc968f64426b941bc9a8f6d4e87fc151c7a7192679837618570b7989c67c31e2f
 61dffbf002d7fc3db9df5953aca7f2e434a78d4c5fdd0db6f90f43ee8c4371db | \x61dffbf002d7fc3db9df5953aca7f2e434a78d4c5fdd0db6f90f43ee8c4371db
 757acf228adf2357356fd38d03b80529771f211e0ad3b35b66a23d6de53e5033 | \x757acf228adf2357356fd38d03b80529771f211e0ad3b35b66a23d6de53e5033
 fba35b8b33a7fccc2ac7d96389d43be509ff17636fe3c0f8a33af6d009f84f15 | \xfba35b8b33a7fccc2ac7d96389d43be509ff17636fe3c0f8a33af6d009f84f15
 ecd8587a8b9acae650760cea8683f8e1c131c4054c0d64b1d7de0ff269ccc61a | \xecd8587a8b9acae650760cea8683f8e1c131c4054c0d64b1d7de0ff269ccc61a
 11782c73bb3fc9f281b41d3eff8c1e7907b3494b3abe7b6982c1e88f49dad2ea | \x11782c73bb3fc9f281b41d3eff8c1e7907b3494b3abe7b6982c1e88f49dad2ea
 5862bd8d645e4d44997a485c616bc18f1acabeaec5df3c3b09b9d4c08643e852 | \x5862bd8d645e4d44997a485c616bc18f1acabeaec5df3c3b09b9d4c08643e852
 2d09a5cca2c03153a55faa3aff13df0f0593f4355a1b2cfcf9237c2931b4918c | \x2d09a5cca2c03153a55faa3aff13df0f0593f4355a1b2cfcf9237c2931b4918c
 2186eb2bcc12319ee6e00f7e08a1d61e379a75c01c579c29d0338693bc31c7c7 | \x2186eb2bcc12319ee6e00f7e08a1d61e379a75c01c579c29d0338693bc31c7c7
 2061bd05049c51bd1162e4d77f72a37f06d2397fc522ef587ed172a5ad8d57aa | \x2061bd05049c51bd1162e4d77f72a37f06d2397fc522ef587ed172a5ad8d57aa
(10 rows)

created two indexes:
create index i1 on new (texthex);
create index i2 on new (a_bytea);

i1 is 91MB, and i2 is 56MB.

Index creation was also much faster - best out of 3 runs for i1 was 4928.982
ms, best out of 3 runs for i2 was 2047.648 ms

Best regards,

depesz


Re: Indexes for hashes

From
julyanto SUTANDANG
Date:
Hi Ivan, 

How about using crc32 ? and then index the integer as the result of crc32 function? you can split the hash into 2 part and do crc32 2x ? and then create composite index on both integer (the crc32 result)
instead of using 64 char, you only employ 2 integer as index key. 

Regards, 

Jul

On Wed, Jun 15, 2016 at 8:54 PM, Ivan Voras <ivoras@gmail.com> wrote:
Hi,

I understand your idea, and have also been thinking about it. Basically, existing applications would need to be modified, however slightly, and that wouldn't be good.




On 15 June 2016 at 15:38, hubert depesz lubaczewski <depesz@depesz.com> wrote:
On Wed, Jun 15, 2016 at 11:34:18AM +0200, Ivan Voras wrote:
> I have an application which stores a large amounts of hex-encoded hash
> strings (nearly 100 GB of them), which means:

Why do you keep them hex encoded, and not use bytea?

I made a sample table with 1 million rows, looking like this:

     Table "public.new"
 Column  | Type  | Modifiers
---------+-------+-----------
 texthex | text  |
 a_bytea | bytea |

values are like:

$ select * from new limit 10;
                             texthex                              |                              a_bytea
------------------------------------------------------------------+--------------------------------------------------------------------
 c968f64426b941bc9a8f6d4e87fc151c7a7192679837618570b7989c67c31e2f | \xc968f64426b941bc9a8f6d4e87fc151c7a7192679837618570b7989c67c31e2f
 61dffbf002d7fc3db9df5953aca7f2e434a78d4c5fdd0db6f90f43ee8c4371db | \x61dffbf002d7fc3db9df5953aca7f2e434a78d4c5fdd0db6f90f43ee8c4371db
 757acf228adf2357356fd38d03b80529771f211e0ad3b35b66a23d6de53e5033 | \x757acf228adf2357356fd38d03b80529771f211e0ad3b35b66a23d6de53e5033
 fba35b8b33a7fccc2ac7d96389d43be509ff17636fe3c0f8a33af6d009f84f15 | \xfba35b8b33a7fccc2ac7d96389d43be509ff17636fe3c0f8a33af6d009f84f15
 ecd8587a8b9acae650760cea8683f8e1c131c4054c0d64b1d7de0ff269ccc61a | \xecd8587a8b9acae650760cea8683f8e1c131c4054c0d64b1d7de0ff269ccc61a
 11782c73bb3fc9f281b41d3eff8c1e7907b3494b3abe7b6982c1e88f49dad2ea | \x11782c73bb3fc9f281b41d3eff8c1e7907b3494b3abe7b6982c1e88f49dad2ea
 5862bd8d645e4d44997a485c616bc18f1acabeaec5df3c3b09b9d4c08643e852 | \x5862bd8d645e4d44997a485c616bc18f1acabeaec5df3c3b09b9d4c08643e852
 2d09a5cca2c03153a55faa3aff13df0f0593f4355a1b2cfcf9237c2931b4918c | \x2d09a5cca2c03153a55faa3aff13df0f0593f4355a1b2cfcf9237c2931b4918c
 2186eb2bcc12319ee6e00f7e08a1d61e379a75c01c579c29d0338693bc31c7c7 | \x2186eb2bcc12319ee6e00f7e08a1d61e379a75c01c579c29d0338693bc31c7c7
 2061bd05049c51bd1162e4d77f72a37f06d2397fc522ef587ed172a5ad8d57aa | \x2061bd05049c51bd1162e4d77f72a37f06d2397fc522ef587ed172a5ad8d57aa
(10 rows)

created two indexes:
create index i1 on new (texthex);
create index i2 on new (a_bytea);

i1 is 91MB, and i2 is 56MB.

Index creation was also much faster - best out of 3 runs for i1 was 4928.982
ms, best out of 3 runs for i2 was 2047.648 ms

Best regards,

depesz





--


Julyanto SUTANDANG

Equnix Business Solutions, PT
(An Open Source an Open Mind Company)

Pusat Niaga ITC Roxy Mas Blok C2/42.  Jl. KH Hasyim Ashari 125, Jakarta Pusat
T: +6221 22866662 F: +62216315281 M: +628164858028


Caution: The information enclosed in this email (and any attachments) may be legally privileged and/or confidential and is intended only for the use of the addressee(s). No addressee should forward, print, copy, or otherwise reproduce this message in any manner that would allow it to be viewed by any individual not originally listed as a recipient. If the reader of this message is not the intended recipient, you are hereby notified that any unauthorized disclosure, dissemination, distribution, copying or the taking of any action in reliance on the information herein is strictly prohibited. If you have received this communication in error, please immediately notify the sender and delete this message.Unless it is made by the authorized person, any views expressed in this message are those of the individual sender and may not necessarily reflect the views of PT Equnix Business Solutions.

Re: Indexes for hashes

From
Ivan Voras
Date:
Hi,

This idea is similar to the substring one, and while it does give excellent performance and small size, it requires application code modifications, so it's out.


On 15 June 2016 at 15:58, julyanto SUTANDANG <julyanto@equnix.co.id> wrote:
Hi Ivan, 

How about using crc32 ? and then index the integer as the result of crc32 function? you can split the hash into 2 part and do crc32 2x ? and then create composite index on both integer (the crc32 result)
instead of using 64 char, you only employ 2 integer as index key. 

Regards, 

Jul

On Wed, Jun 15, 2016 at 8:54 PM, Ivan Voras <ivoras@gmail.com> wrote:
Hi,

I understand your idea, and have also been thinking about it. Basically, existing applications would need to be modified, however slightly, and that wouldn't be good.




On 15 June 2016 at 15:38, hubert depesz lubaczewski <depesz@depesz.com> wrote:
On Wed, Jun 15, 2016 at 11:34:18AM +0200, Ivan Voras wrote:
> I have an application which stores a large amounts of hex-encoded hash
> strings (nearly 100 GB of them), which means:

Why do you keep them hex encoded, and not use bytea?

I made a sample table with 1 million rows, looking like this:

     Table "public.new"
 Column  | Type  | Modifiers
---------+-------+-----------
 texthex | text  |
 a_bytea | bytea |

values are like:

$ select * from new limit 10;
                             texthex                              |                              a_bytea
------------------------------------------------------------------+--------------------------------------------------------------------
 c968f64426b941bc9a8f6d4e87fc151c7a7192679837618570b7989c67c31e2f | \xc968f64426b941bc9a8f6d4e87fc151c7a7192679837618570b7989c67c31e2f
 61dffbf002d7fc3db9df5953aca7f2e434a78d4c5fdd0db6f90f43ee8c4371db | \x61dffbf002d7fc3db9df5953aca7f2e434a78d4c5fdd0db6f90f43ee8c4371db
 757acf228adf2357356fd38d03b80529771f211e0ad3b35b66a23d6de53e5033 | \x757acf228adf2357356fd38d03b80529771f211e0ad3b35b66a23d6de53e5033
 fba35b8b33a7fccc2ac7d96389d43be509ff17636fe3c0f8a33af6d009f84f15 | \xfba35b8b33a7fccc2ac7d96389d43be509ff17636fe3c0f8a33af6d009f84f15
 ecd8587a8b9acae650760cea8683f8e1c131c4054c0d64b1d7de0ff269ccc61a | \xecd8587a8b9acae650760cea8683f8e1c131c4054c0d64b1d7de0ff269ccc61a
 11782c73bb3fc9f281b41d3eff8c1e7907b3494b3abe7b6982c1e88f49dad2ea | \x11782c73bb3fc9f281b41d3eff8c1e7907b3494b3abe7b6982c1e88f49dad2ea
 5862bd8d645e4d44997a485c616bc18f1acabeaec5df3c3b09b9d4c08643e852 | \x5862bd8d645e4d44997a485c616bc18f1acabeaec5df3c3b09b9d4c08643e852
 2d09a5cca2c03153a55faa3aff13df0f0593f4355a1b2cfcf9237c2931b4918c | \x2d09a5cca2c03153a55faa3aff13df0f0593f4355a1b2cfcf9237c2931b4918c
 2186eb2bcc12319ee6e00f7e08a1d61e379a75c01c579c29d0338693bc31c7c7 | \x2186eb2bcc12319ee6e00f7e08a1d61e379a75c01c579c29d0338693bc31c7c7
 2061bd05049c51bd1162e4d77f72a37f06d2397fc522ef587ed172a5ad8d57aa | \x2061bd05049c51bd1162e4d77f72a37f06d2397fc522ef587ed172a5ad8d57aa
(10 rows)

created two indexes:
create index i1 on new (texthex);
create index i2 on new (a_bytea);

i1 is 91MB, and i2 is 56MB.

Index creation was also much faster - best out of 3 runs for i1 was 4928.982
ms, best out of 3 runs for i2 was 2047.648 ms

Best regards,

depesz





--


Julyanto SUTANDANG

Equnix Business Solutions, PT
(An Open Source an Open Mind Company)

Pusat Niaga ITC Roxy Mas Blok C2/42.  Jl. KH Hasyim Ashari 125, Jakarta Pusat
T: +6221 22866662 F: +62216315281 M: +628164858028


Caution: The information enclosed in this email (and any attachments) may be legally privileged and/or confidential and is intended only for the use of the addressee(s). No addressee should forward, print, copy, or otherwise reproduce this message in any manner that would allow it to be viewed by any individual not originally listed as a recipient. If the reader of this message is not the intended recipient, you are hereby notified that any unauthorized disclosure, dissemination, distribution, copying or the taking of any action in reliance on the information herein is strictly prohibited. If you have received this communication in error, please immediately notify the sender and delete this message.Unless it is made by the authorized person, any views expressed in this message are those of the individual sender and may not necessarily reflect the views of PT Equnix Business Solutions.

Re: Indexes for hashes

From
Ivan Voras
Date:
Hi,

Just for testing... is there a fast (i.e. written in C) crc32 or a similar small hash function for PostgreSQL?


On 15 June 2016 at 16:00, Ivan Voras <ivoras@gmail.com> wrote:
Hi,

This idea is similar to the substring one, and while it does give excellent performance and small size, it requires application code modifications, so it's out.


On 15 June 2016 at 15:58, julyanto SUTANDANG <julyanto@equnix.co.id> wrote:
Hi Ivan, 

How about using crc32 ? and then index the integer as the result of crc32 function? you can split the hash into 2 part and do crc32 2x ? and then create composite index on both integer (the crc32 result)
instead of using 64 char, you only employ 2 integer as index key. 

Regards, 

Jul

On Wed, Jun 15, 2016 at 8:54 PM, Ivan Voras <ivoras@gmail.com> wrote:
Hi,

I understand your idea, and have also been thinking about it. Basically, existing applications would need to be modified, however slightly, and that wouldn't be good.




On 15 June 2016 at 15:38, hubert depesz lubaczewski <depesz@depesz.com> wrote:
On Wed, Jun 15, 2016 at 11:34:18AM +0200, Ivan Voras wrote:
> I have an application which stores a large amounts of hex-encoded hash
> strings (nearly 100 GB of them), which means:

Why do you keep them hex encoded, and not use bytea?

I made a sample table with 1 million rows, looking like this:

     Table "public.new"
 Column  | Type  | Modifiers
---------+-------+-----------
 texthex | text  |
 a_bytea | bytea |

values are like:

$ select * from new limit 10;
                             texthex                              |                              a_bytea
------------------------------------------------------------------+--------------------------------------------------------------------
 c968f64426b941bc9a8f6d4e87fc151c7a7192679837618570b7989c67c31e2f | \xc968f64426b941bc9a8f6d4e87fc151c7a7192679837618570b7989c67c31e2f
 61dffbf002d7fc3db9df5953aca7f2e434a78d4c5fdd0db6f90f43ee8c4371db | \x61dffbf002d7fc3db9df5953aca7f2e434a78d4c5fdd0db6f90f43ee8c4371db
 757acf228adf2357356fd38d03b80529771f211e0ad3b35b66a23d6de53e5033 | \x757acf228adf2357356fd38d03b80529771f211e0ad3b35b66a23d6de53e5033
 fba35b8b33a7fccc2ac7d96389d43be509ff17636fe3c0f8a33af6d009f84f15 | \xfba35b8b33a7fccc2ac7d96389d43be509ff17636fe3c0f8a33af6d009f84f15
 ecd8587a8b9acae650760cea8683f8e1c131c4054c0d64b1d7de0ff269ccc61a | \xecd8587a8b9acae650760cea8683f8e1c131c4054c0d64b1d7de0ff269ccc61a
 11782c73bb3fc9f281b41d3eff8c1e7907b3494b3abe7b6982c1e88f49dad2ea | \x11782c73bb3fc9f281b41d3eff8c1e7907b3494b3abe7b6982c1e88f49dad2ea
 5862bd8d645e4d44997a485c616bc18f1acabeaec5df3c3b09b9d4c08643e852 | \x5862bd8d645e4d44997a485c616bc18f1acabeaec5df3c3b09b9d4c08643e852
 2d09a5cca2c03153a55faa3aff13df0f0593f4355a1b2cfcf9237c2931b4918c | \x2d09a5cca2c03153a55faa3aff13df0f0593f4355a1b2cfcf9237c2931b4918c
 2186eb2bcc12319ee6e00f7e08a1d61e379a75c01c579c29d0338693bc31c7c7 | \x2186eb2bcc12319ee6e00f7e08a1d61e379a75c01c579c29d0338693bc31c7c7
 2061bd05049c51bd1162e4d77f72a37f06d2397fc522ef587ed172a5ad8d57aa | \x2061bd05049c51bd1162e4d77f72a37f06d2397fc522ef587ed172a5ad8d57aa
(10 rows)

created two indexes:
create index i1 on new (texthex);
create index i2 on new (a_bytea);

i1 is 91MB, and i2 is 56MB.

Index creation was also much faster - best out of 3 runs for i1 was 4928.982
ms, best out of 3 runs for i2 was 2047.648 ms

Best regards,

depesz





--


Julyanto SUTANDANG

Equnix Business Solutions, PT
(An Open Source an Open Mind Company)

Pusat Niaga ITC Roxy Mas Blok C2/42.  Jl. KH Hasyim Ashari 125, Jakarta Pusat
T: +6221 22866662 F: +62216315281 M: +628164858028


Caution: The information enclosed in this email (and any attachments) may be legally privileged and/or confidential and is intended only for the use of the addressee(s). No addressee should forward, print, copy, or otherwise reproduce this message in any manner that would allow it to be viewed by any individual not originally listed as a recipient. If the reader of this message is not the intended recipient, you are hereby notified that any unauthorized disclosure, dissemination, distribution, copying or the taking of any action in reliance on the information herein is strictly prohibited. If you have received this communication in error, please immediately notify the sender and delete this message.Unless it is made by the authorized person, any views expressed in this message are those of the individual sender and may not necessarily reflect the views of PT Equnix Business Solutions.


Re: Indexes for hashes

From
"Joshua D. Drake"
Date:
On 06/15/2016 07:20 AM, Ivan Voras wrote:
> Hi,
>
> Just for testing... is there a fast (i.e. written in C) crc32 or a
> similar small hash function for PostgreSQL?

https://www.postgresql.org/docs/9.5/static/pgcrypto.html

We also have a builtin md5().

JD
--
Command Prompt, Inc.                  http://the.postgres.company/
                         +1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.


Re: Indexes for hashes

From
"ktm@rice.edu"
Date:
On Wed, Jun 15, 2016 at 04:20:46PM +0200, Ivan Voras wrote:
> Hi,
>
> Just for testing... is there a fast (i.e. written in C) crc32 or a similar
> small hash function for PostgreSQL?
>

Hi Ivan,

Here is an extension that provides a number of different hash
functions, including a version of the version used internally:

https://github.com/markokr/pghashlib

Regards,
Ken


Re: Indexes for hashes

From
Jeff Janes
Date:
On Wed, Jun 15, 2016 at 6:16 AM, ktm@rice.edu <ktm@rice.edu> wrote:
> On Wed, Jun 15, 2016 at 03:09:04PM +0200, Ivan Voras wrote:
>> On 15 June 2016 at 15:03, ktm@rice.edu <ktm@rice.edu> wrote:
>>
>>
>> I don't suppose there's an effort in progress to make hash indexes use WAL?
>> :D
>
> Hi Ivan,
>
> Several people have looked at it but it has not made it to the top of anyone's
> to-do list.

I don't know if it is the top of his todo list, but Amit seems pretty
serious about it:

https://www.postgresql.org/message-id/CAA4eK1LfzcZYxLoXS874Ad0+S-ZM60U9bwcyiUZx9mHZ-KCWhw@mail.gmail.com

I hope to give him some help if I get a chance.

Cheers,

Jeff


Re: Indexes for hashes

From
Claudio Freire
Date:
On Wed, Jun 15, 2016 at 6:34 AM, Ivan Voras <ivoras@gmail.com> wrote:
>
> I have an application which stores a large amounts of hex-encoded hash
> strings (nearly 100 GB of them), which means:
>
> The number of distinct characters (alphabet) is limited to 16
> Each string is of the same length, 64 characters
> The strings are essentially random
>
> Creating a B-Tree index on this results in the index size being larger than
> the table itself, and there are disk space constraints.
>
> I've found the SP-GIST radix tree index, and thought it could be a good
> match for the data because of the above constraints. An attempt to create it
> (as in CREATE INDEX ON t USING spgist(field_name)) apparently takes more
> than 12 hours (while a similar B-tree index takes a few hours at most), so
> I've interrupted it because "it probably is not going to finish in a
> reasonable time". Some slides I found on the spgist index allude that both
> build time and size are not really suitable for this purpose.


I've found that hash btree indexes tend to perform well in these situations:

CREATE INDEX ON t USING btree (hashtext(fieldname));

However, you'll have to modify your queries to query for both, the
hashtext and the text itself:

SELECT * FROM t WHERE hashtext(fieldname) = hashtext('blabla') AND
fieldname = 'blabla';


Re: Indexes for hashes

From
julyanto SUTANDANG
Date:
This way is doing faster using crc32(data) than hashtext since crc32 is hardware accelerated in intel (and others perhaps) 
this way (crc32)  is no way the same as hash, much way faster than others... 

Regards, 


On Fri, Jun 17, 2016 at 10:51 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
On Wed, Jun 15, 2016 at 6:34 AM, Ivan Voras <ivoras@gmail.com> wrote:
>
> I have an application which stores a large amounts of hex-encoded hash
> strings (nearly 100 GB of them), which means:
>
> The number of distinct characters (alphabet) is limited to 16
> Each string is of the same length, 64 characters
> The strings are essentially random
>
> Creating a B-Tree index on this results in the index size being larger than
> the table itself, and there are disk space constraints.
>
> I've found the SP-GIST radix tree index, and thought it could be a good
> match for the data because of the above constraints. An attempt to create it
> (as in CREATE INDEX ON t USING spgist(field_name)) apparently takes more
> than 12 hours (while a similar B-tree index takes a few hours at most), so
> I've interrupted it because "it probably is not going to finish in a
> reasonable time". Some slides I found on the spgist index allude that both
> build time and size are not really suitable for this purpose.


I've found that hash btree indexes tend to perform well in these situations:

CREATE INDEX ON t USING btree (hashtext(fieldname));

However, you'll have to modify your queries to query for both, the
hashtext and the text itself:

SELECT * FROM t WHERE hashtext(fieldname) = hashtext('blabla') AND
fieldname = 'blabla';


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



--


Julyanto SUTANDANG

Equnix Business Solutions, PT
(An Open Source an Open Mind Company)

Pusat Niaga ITC Roxy Mas Blok C2/42.  Jl. KH Hasyim Ashari 125, Jakarta Pusat
T: +6221 22866662 F: +62216315281 M: +628164858028


Caution: The information enclosed in this email (and any attachments) may be legally privileged and/or confidential and is intended only for the use of the addressee(s). No addressee should forward, print, copy, or otherwise reproduce this message in any manner that would allow it to be viewed by any individual not originally listed as a recipient. If the reader of this message is not the intended recipient, you are hereby notified that any unauthorized disclosure, dissemination, distribution, copying or the taking of any action in reliance on the information herein is strictly prohibited. If you have received this communication in error, please immediately notify the sender and delete this message.Unless it is made by the authorized person, any views expressed in this message are those of the individual sender and may not necessarily reflect the views of PT Equnix Business Solutions.

Re: Indexes for hashes

From
Claudio Freire
Date:
On Fri, Jun 17, 2016 at 1:09 AM, julyanto SUTANDANG
<julyanto@equnix.co.id> wrote:
> This way is doing faster using crc32(data) than hashtext since crc32 is
> hardware accelerated in intel (and others perhaps)
> this way (crc32)  is no way the same as hash, much way faster than others...
>
> Regards,

Sure, but I've had uniformity issues with crc32.


Re: Indexes for hashes

From
Ivan Voras
Date:
And in any case, there's no crc32 in the built-in pgcrypto module.


On 17 June 2016 at 06:18, Claudio Freire <klaussfreire@gmail.com> wrote:
On Fri, Jun 17, 2016 at 1:09 AM, julyanto SUTANDANG
<julyanto@equnix.co.id> wrote:
> This way is doing faster using crc32(data) than hashtext since crc32 is
> hardware accelerated in intel (and others perhaps)
> this way (crc32)  is no way the same as hash, much way faster than others...
>
> Regards,

Sure, but I've had uniformity issues with crc32.

Re: Indexes for hashes

From
julyanto SUTANDANG
Date:
Crc32 is great because it is supported by Intel Hardware, unfortunatelly you have to code something like this:



int32_t sse42_crc32(const uint8_t *bytes, size_t len)
{ uint32_t hash = 0; size_t i = 0; for (i=0;i<len;i++) {   hash = _mm_crc32_u8(hash, bytes[i]); }
 return hash;
}
It is supported by GCC and will implemented as hardware computing which really fast.
you can use 64bit integer to have more precise hashing, so don't worry about uniformity.

Btw: crc32 is not part of the cryptography, it is part of hashing or signature.

Regards,  


On Fri, Jun 17, 2016 at 3:32 PM, Ivan Voras <ivoras@gmail.com> wrote:
And in any case, there's no crc32 in the built-in pgcrypto module.


On 17 June 2016 at 06:18, Claudio Freire <klaussfreire@gmail.com> wrote:
On Fri, Jun 17, 2016 at 1:09 AM, julyanto SUTANDANG
<julyanto@equnix.co.id> wrote:
> This way is doing faster using crc32(data) than hashtext since crc32 is
> hardware accelerated in intel (and others perhaps)
> this way (crc32)  is no way the same as hash, much way faster than others...
>
> Regards,

Sure, but I've had uniformity issues with crc32.




--


Julyanto SUTANDANG

Equnix Business Solutions, PT
(An Open Source an Open Mind Company)

Pusat Niaga ITC Roxy Mas Blok C2/42.  Jl. KH Hasyim Ashari 125, Jakarta Pusat
T: +6221 22866662 F: +62216315281 M: +628164858028


Caution: The information enclosed in this email (and any attachments) may be legally privileged and/or confidential and is intended only for the use of the addressee(s). No addressee should forward, print, copy, or otherwise reproduce this message in any manner that would allow it to be viewed by any individual not originally listed as a recipient. If the reader of this message is not the intended recipient, you are hereby notified that any unauthorized disclosure, dissemination, distribution, copying or the taking of any action in reliance on the information herein is strictly prohibited. If you have received this communication in error, please immediately notify the sender and delete this message.Unless it is made by the authorized person, any views expressed in this message are those of the individual sender and may not necessarily reflect the views of PT Equnix Business Solutions.