Thread: No index only scan on md5 index

From:
Adam Brusselback
Date:

Hey all,

I have an attachment table in my database which stores a file in a bytea column, the file name, and the size of the file.

Schema: 
CREATE TABLE attachment
(
  attachment_id uuid NOT NULL DEFAULT gen_random_uuid(),
  attachment_name character varying NOT NULL,
  attachment_bytes_size integer NOT NULL,
  attachment_bytes bytea NOT NULL,
  CONSTRAINT attachment_pkey PRIMARY KEY (attachment_id)
);

I do lookups on this table based on the md5 of the attachment_bytes column, so I added an index: 
CREATE INDEX idx_attachment_bytes_md5 ON attachment ((md5(attachment_bytes)::uuid));

Queries like this are sped up by the index no problem:
SELECT attachment_id
FROM attachment
WHERE md5(attachment_bytes)::uuid = 'b2ab855ece13a72a398096dfb6c832aa';

But if I wanted to return the md5 value, it seems to be totally unable to use an index only scan:
SELECT md5(attachment_bytes)::uuid
FROM attachment;

From:
"David G. Johnston"
Date:

On Wednesday, November 25, 2015, Adam Brusselback <> wrote:
Hey all,

I have an attachment table in my database which stores a file in a bytea column, the file name, and the size of the file.

Schema: 
CREATE TABLE attachment
(
  attachment_id uuid NOT NULL DEFAULT gen_random_uuid(),
  attachment_name character varying NOT NULL,
  attachment_bytes_size integer NOT NULL,
  attachment_bytes bytea NOT NULL,
  CONSTRAINT attachment_pkey PRIMARY KEY (attachment_id)
);

I do lookups on this table based on the md5 of the attachment_bytes column, so I added an index: 
CREATE INDEX idx_attachment_bytes_md5 ON attachment ((md5(attachment_bytes)::uuid));

Queries like this are sped up by the index no problem:
SELECT attachment_id
FROM attachment
WHERE md5(attachment_bytes)::uuid = 'b2ab855ece13a72a398096dfb6c832aa';

But if I wanted to return the md5 value, it seems to be totally unable to use an index only scan:
SELECT md5(attachment_bytes)::uuid
FROM attachment;


Ok.

Any reason not to add the uuid column to the table?

AFAIK The system is designed to return data from the heap, not an index.  While it possibly can in some instances if you need to return data you should store it directly in the table.

David J.
From:
Tom Lane
Date:

Adam Brusselback <> writes:
> CREATE TABLE attachment
> (
>   attachment_id uuid NOT NULL DEFAULT gen_random_uuid(),
>   attachment_name character varying NOT NULL,
>   attachment_bytes_size integer NOT NULL,
>   attachment_bytes bytea NOT NULL,
>   CONSTRAINT attachment_pkey PRIMARY KEY (attachment_id)
> );
> CREATE INDEX idx_attachment_bytes_md5 ON attachment
> ((md5(attachment_bytes)::uuid));

> But if I wanted to return the md5 value, it seems to be totally unable to
> use an index only scan:
> SELECT md5(attachment_bytes)::uuid
> FROM attachment;

Nope, sorry, you're out of luck on that, because the check for whether an
index-only scan is feasible checks whether all the variables used in the
query are available from the index.  (Testing whether an index expression
could match everything asked for would greatly slow down planning, whether
or not the index turned out to be applicable, so we don't try.  I have
some rough ideas about making that better, but don't hold your breath.)

IIRC, it does actually get it right in terms of constructing the
finished plan, if you can get past the index-only-scan-is-feasible test.
So some people have done something like this to avoid recalculations of
expensive functions:

create table ff(f1 float8);
create index on ff(sin(f1), f1);
select sin(f1) from ff;  -- can generate IOS and not re-evaluate sin()

But if I'm right in guessing that attachment_bytes can be large,
that's not going to be a workable hack for your case.

Probably the only thing that's going to work for you is to store
md5(attachment_bytes) in its own plain column (you can use a trigger
to compute it for you), and then build a regular index on that,
and query for that column not the md5() expression.

            regards, tom lane


From:
Adam Brusselback
Date:

Main reason I was hoping to not do that, is the value that would be stored in that column is dependent on what is stored in the attachment_bytes column, so to be 100% sure it's correct, you'd need that column controlled by a trigger, disallowing any explicit inserts or updates to the value.  Was having a hard time finding info on this type of thing online though, so I was unsure if Postgres was working as intended, or if I had made a mistake somehow.

If you do know, what are the instances it is able to return data directly from an index instead of having to go to heap?

On Wed, Nov 25, 2015 at 7:55 PM, David G. Johnston <> wrote:
On Wednesday, November 25, 2015, Adam Brusselback <> wrote:
Hey all,

I have an attachment table in my database which stores a file in a bytea column, the file name, and the size of the file.

Schema: 
CREATE TABLE attachment
(
  attachment_id uuid NOT NULL DEFAULT gen_random_uuid(),
  attachment_name character varying NOT NULL,
  attachment_bytes_size integer NOT NULL,
  attachment_bytes bytea NOT NULL,
  CONSTRAINT attachment_pkey PRIMARY KEY (attachment_id)
);

I do lookups on this table based on the md5 of the attachment_bytes column, so I added an index: 
CREATE INDEX idx_attachment_bytes_md5 ON attachment ((md5(attachment_bytes)::uuid));

Queries like this are sped up by the index no problem:
SELECT attachment_id
FROM attachment
WHERE md5(attachment_bytes)::uuid = 'b2ab855ece13a72a398096dfb6c832aa';

But if I wanted to return the md5 value, it seems to be totally unable to use an index only scan:
SELECT md5(attachment_bytes)::uuid
FROM attachment;


Ok.

Any reason not to add the uuid column to the table?

AFAIK The system is designed to return data from the heap, not an index.  While it possibly can in some instances if you need to return data you should store it directly in the table.

David J.

From:
Adam Brusselback
Date:

I appreciate the response Tom, and you are correct that the workaround would not work in my case.

So no index expressions can return the their value without recomputing without that work around?  I learn something new every day it seems. 
Thank you for the alternate method.

-Adam

On Wed, Nov 25, 2015 at 8:01 PM, Tom Lane <> wrote:
Adam Brusselback <> writes:
> CREATE TABLE attachment
> (
>   attachment_id uuid NOT NULL DEFAULT gen_random_uuid(),
>   attachment_name character varying NOT NULL,
>   attachment_bytes_size integer NOT NULL,
>   attachment_bytes bytea NOT NULL,
>   CONSTRAINT attachment_pkey PRIMARY KEY (attachment_id)
> );
> CREATE INDEX idx_attachment_bytes_md5 ON attachment
> ((md5(attachment_bytes)::uuid));

> But if I wanted to return the md5 value, it seems to be totally unable to
> use an index only scan:
> SELECT md5(attachment_bytes)::uuid
> FROM attachment;

Nope, sorry, you're out of luck on that, because the check for whether an
index-only scan is feasible checks whether all the variables used in the
query are available from the index.  (Testing whether an index expression
could match everything asked for would greatly slow down planning, whether
or not the index turned out to be applicable, so we don't try.  I have
some rough ideas about making that better, but don't hold your breath.)

IIRC, it does actually get it right in terms of constructing the
finished plan, if you can get past the index-only-scan-is-feasible test.
So some people have done something like this to avoid recalculations of
expensive functions:

create table ff(f1 float8);
create index on ff(sin(f1), f1);
select sin(f1) from ff;  -- can generate IOS and not re-evaluate sin()

But if I'm right in guessing that attachment_bytes can be large,
that's not going to be a workable hack for your case.

Probably the only thing that's going to work for you is to store
md5(attachment_bytes) in its own plain column (you can use a trigger
to compute it for you), and then build a regular index on that,
and query for that column not the md5() expression.

                        regards, tom lane

From:
Albe Laurenz
Date:

Adam Brusselback wrote:
> I appreciate the response Tom, and you are correct that the workaround would not work in my case.
> 
> So no index expressions can return the their value without recomputing without that work around?  I
> learn something new every day it seems.
> Thank you for the alternate method.

No, what Tom said is that the check whether an "index only scan" was feasible
or not does not consider expressions.

Yours,
Laurenz Albe