Thread: How to get an md5/sha256 hash of a really large object in psql?

How to get an md5/sha256 hash of a really large object in psql?

From
Alex Shan
Date:
Hello,

In my DB I have a large object over 4GB in size.
I need to get its MD5 or SHA256 from within psql query, i.e. without exporting it to FS first.

“SELECT md5(lo_get(loid));” doesnt work — “large object is too large”.

Is there any other way to do it?


Regards,
Al

Re: How to get an md5/sha256 hash of a really large object in psql?

From
Erik Wienhold
Date:
> On 29/07/2023 08:42 CEST Alex Shan <3341018@gmail.com> wrote:
>
> In my DB I have a large object over 4GB in size.
> I need to get its MD5 or SHA256 from within psql query, i.e. without
> exporting it to FS first.
>
> “SELECT md5(lo_get(loid));” doesnt work — “large object is too large”.
>
> Is there any other way to do it?

Is plpython3u [1] an option for you?  In that case you can use Python's hashlib
in a custom function and feed every page from pg_largeobject to a selected hash
function.

For example:

    BEGIN;

    CREATE EXTENSION plpython3u;

    CREATE OR REPLACE FUNCTION lo_hash(loid oid, name text)
        RETURNS bytea
        LANGUAGE plpython3u
    AS $$
        import hashlib

        hash = hashlib.new(name)

        # Check if large object exists.
        plan = plpy.prepare("""
            SELECT
            FROM pg_largeobject_metadata
            WHERE oid = $1
        """, ['oid'])
        rv = plpy.execute(plan, [loid])

        if rv.nrows() == 0:
            raise ValueError(f"large object {loid} does not exist")

        # Get all pages (possibly zero).
        plan = plpy.prepare("""
            SELECT data
            FROM pg_largeobject
            WHERE loid = $1
            ORDER BY pageno
        """, ['oid'])
        pages = plpy.cursor(plan, [loid])

        for page in pages:
            hash.update(page['data'])

        return hash.digest()
    $$;

    COMMIT;

Testing with 65 KiB null bytes:

    BEGIN;

    SELECT
      lo_from_bytea(0, decode(repeat('00', 1 << 16), 'hex')) AS test_loid
    \gset

    SELECT
      loid,
      count(*) AS n_pages,
      sum(length(data)) AS n_bytes
    FROM pg_largeobject
    WHERE loid = :test_loid
    GROUP BY loid;

      loid  | n_pages | n_bytes
    --------+---------+---------
     365958 |      32 |   65536
    (1 row)

    SELECT
      :test_loid AS loid,
      lo_hash(:test_loid, 'md5') AS md5;

      loid  |                md5
    --------+------------------------------------
     365958 | \xfcd6bcb56c1689fcef28b57c22475bad
    (1 row)

    SELECT
      :test_loid AS loid,
      lo_hash(:test_loid, 'sha256') AS sha256;

      loid  |                               sha256
    --------+--------------------------------------------------------------------
     365958 | \xde2f256064a0af797747c2b97505dc0b9f3df0de4f489eac731c23ae9ca9cc31
    (1 row)

    ROLLBACK;

Verifying the hashes:

    $ head -c65536 /dev/zero | md5sum
    fcd6bcb56c1689fcef28b57c22475bad  -
    $ head -c65536 /dev/zero | sha256sum
    de2f256064a0af797747c2b97505dc0b9f3df0de4f489eac731c23ae9ca9cc31  -

[1] https://www.postgresql.org/docs/15/plpython.html

--
Erik



Re: How to get an md5/sha256 hash of a really large object in psql?

From
Alex Shan
Date:
Great, thanks a lot!

I will test it on my system.

Myself, I tried to do it in C with libpq, but got stuck at reading a LO...


On Sat, 29 Jul 2023 at 19:57, Erik Wienhold <ewie@ewie.name> wrote:
> “SELECT md5(lo_get(loid));” doesnt work — “large object is too large”.
>
> Is there any other way to do it?

Is plpython3u [1] an option for you?  In that case you can use Python's hashlib
in a custom function and feed every page from pg_largeobject to a selected hash
function.
...
--
Erik