Thread: Delays between "connection received" and "connection authenticated" because of localhost entries in hba

Delays between "connection received" and "connection authenticated" because of localhost entries in hba

Hi,

we're facing a strange issue with delays between "connection received" and "connection authenticated".

# select version();
                                                              version

-----------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 15.6 (Ubuntu 15.6-1.pgdg22.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04)
11.4.0,64-bit 
(1 ligne)

I know, this is not the latest minor version.

# \! cat /etc/os-release | head -1
PRETTY_NAME="Ubuntu 22.04.4 LTS"

What we see in the log is this (around 4 seconds delay):

2024-05-07 15:29:50.244 CEST [369909] LOG:  connection received: host=xx.xx.xx.100 port=48434
2024-05-07 15:29:54.518 CEST [369909] LOG:  connection authenticated: identity="xxxxxx" method=md5
(/etc/postgresql/15/main/pg_hba.conf:121)

The matching line is this ( I know md5 ):
host    all             xxxxx     xx.xx.xx.0/24            md5

What we've found out so far is, that this only happens if we have a localhost(or any other hostname) line before the
linematching our connection, something like this: 
host    replication     xxxxx     localhost                md5
host    all             xxxxx     xx.xx.xx.0/24            md5

We don't see it if we do it like this:
host    all             xxxxx     xx.xx.xx.0/24            md5
host    replication     xxxxx     localhost                md5

Has anyone experienced such a behavior? It seems clear that this is somehow related to name resolution but we couldn't
reproducesomething like this on the OS using dig (in a loop several hundred of times). It is also only happening from
timeto time, and not constantly. 

What we're basically looking for is a way to prove the assumption without involving PostgreSQL at all, if that does
makesense? 

Many thanks in advance
Daniel



On 10/29/24 09:30, Daniel Westermann (DWE) wrote:
> Delays between "connection received" and "connection authenticated" because of localhost entries in hba
> 
> Hi,
> 
> we're facing a strange issue with delays between "connection received" and "connection authenticated".
> 
> # select version();
>                                                                version
>
-----------------------------------------------------------------------------------------------------------------------------------
>   PostgreSQL 15.6 (Ubuntu 15.6-1.pgdg22.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04)
11.4.0,64-bit
 
> (1 ligne)
> 
> I know, this is not the latest minor version.
> 
> # \! cat /etc/os-release | head -1
> PRETTY_NAME="Ubuntu 22.04.4 LTS"
> 
> What we see in the log is this (around 4 seconds delay):
> 
> 2024-05-07 15:29:50.244 CEST [369909] LOG:  connection received: host=xx.xx.xx.100 port=48434
> 2024-05-07 15:29:54.518 CEST [369909] LOG:  connection authenticated: identity="xxxxxx" method=md5
(/etc/postgresql/15/main/pg_hba.conf:121)

1) host=xx.xx.xx.100 = localhost?

2) In postgresql.conf what is listen_addresses set to?

3) What are the settings in /etc/hosts?

> 
> The matching line is this ( I know md5 ):
> host    all             xxxxx     xx.xx.xx.0/24            md5
> 
> What we've found out so far is, that this only happens if we have a localhost(or any other hostname) line before the
linematching our connection, something like this:
 
> host    replication     xxxxx     localhost                md5
> host    all             xxxxx     xx.xx.xx.0/24            md5
> 
> We don't see it if we do it like this:
> host    all             xxxxx     xx.xx.xx.0/24            md5
> host    replication     xxxxx     localhost                md5
> 
> Has anyone experienced such a behavior? It seems clear that this is somehow related to name resolution but we
couldn'treproduce something like this on the OS using dig (in a loop several hundred of times). It is also only
happeningfrom time to time, and not constantly.
 
> 
> What we're basically looking for is a way to prove the assumption without involving PostgreSQL at all, if that does
makesense?
 
> 
> Many thanks in advance
> Daniel
> 
> 

-- 
Adrian Klaver
adrian.klaver@aklaver.com




"Daniel Westermann (DWE)" <daniel.westermann@dbi-services.com> writes:
> we're facing a strange issue with delays between "connection received" and "connection authenticated".

> Has anyone experienced such a behavior? It seems clear that this is somehow related to name resolution but we
couldn'treproduce something like this on the OS using dig (in a loop several hundred of times). It is also only
happeningfrom time to time, and not constantly. 

FWIW, I do have vague memories of other complaints like this, although
not of whether solutions were found.  It was awhile ago though.
It might be worth searching our mailing list archives.

            regards, tom lane



> 1) host=xx.xx.xx.100 = localhost?

No, it is not

> 2) In postgresql.conf what is listen_addresses set to?

*

> 3) What are the settings in /etc/hosts?

Standard Ubuntu settings:

127.0.0.1       localhost
127.0.1.1       abc      cde

# The following lines are desirable for IPv6 capable hosts
::1     localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

Regards
Daniel



On 10/29/24 10:42 AM, Daniel Westermann (DWE) wrote:
>> 1) host=xx.xx.xx.100 = localhost?
> 
> No, it is not

Is the localhost machine on the same network?

Is the client connection string using IPV4 address or host name?

> 
>> 2) In postgresql.conf what is listen_addresses set to?
> 
> *
> 
>> 3) What are the settings in /etc/hosts?
> 
> Standard Ubuntu settings:
> 
> 127.0.0.1       localhost
> 127.0.1.1       abc      cde
> 
> # The following lines are desirable for IPv6 capable hosts
> ::1     localhost ip6-localhost ip6-loopback
> ff02::1 ip6-allnodes
> ff02::2 ip6-allrouters

Contents of resolv.conf?

Would it be possible to crank the logging detail up?

> 
> Regards
> Daniel

-- 
Adrian Klaver
adrian.klaver@aklaver.com




missed the mailing list.


On 10/29/24 11:20 AM, Vijaykumar Jain wrote:
> 
> missed the mailing list.

Can you provide a more complete explanation for above?

-- 
Adrian Klaver
adrian.klaver@aklaver.com




On Tue, 29 Oct 2024 at 23:43, Vijaykumar Jain <vijaykumarjain.github@gmail.com> wrote:


On Tue, 29 Oct 2024 at 22:00, Daniel Westermann (DWE) <daniel.westermann@dbi-services.com> wrote:
Delays between "connection received" and "connection authenticated" because of localhost entries in hba

Hi,

we're facing a strange issue with delays between "connection received" and "connection authenticated".

# select version();
                                                              version
-----------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 15.6 (Ubuntu 15.6-1.pgdg22.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit
(1 ligne)

I know, this is not the latest minor version.

# \! cat /etc/os-release | head -1
PRETTY_NAME="Ubuntu 22.04.4 LTS"

What we see in the log is this (around 4 seconds delay):

2024-05-07 15:29:50.244 CEST [369909] LOG:  connection received: host=xx.xx.xx.100 port=48434
2024-05-07 15:29:54.518 CEST [369909] LOG:  connection authenticated: identity="xxxxxx" method=md5 (/etc/postgresql/15/main/pg_hba.conf:121)


Just before we get into os and name resolution stuff..

1) if the database usage is really heavy with schema churn  and or heavy temp usage. can the catalog bloat result in slowless of pg_auth* objects resulting in slow authentication?
 a general check used to be checking the size of the catalog objects and if needed vacuum full in single user mode .... to speed up auth lookup.

i tried with pg latest on ubuntu 22, but did not see any issues. 


So to simulate a similar issue, i understand once connected, it will look up pg_auth* objects cascading to other catalogs for resolution ...
in one session:
 PGPASSWORD=1234 psql -h localhost -p 5432 -d postgres -U postgres -c 'select 1'

in another session

postgres@ubuntu:~$ cat /tmp/db1/postgresql.auto.conf
# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
trace_locks = 'on'
trace_lock_table = 1260


postgres=# select 'pg_authid'::regclass::oid;
 oid
------
 1260
(1 row)
postgres=*# lock table pg_authid in access exclusive mode;
LOCK TABLE
postgres=*# rollback;
ROLLBACK


i tried to trace which locks are held when a connection is established after auth.

for example pg_authid , so i enable tracing on this object, and to mock the delay i try to lock the table in access exclusive mode ...
note, this may not be the scenario, but i am trying to rule out pg related delay due to bloat or locking 


postgres@ubuntu:/tmp$ time PGPASSWORD=1234 psql -h localhost -p 5432 -d postgres -U postgres -c 'select 1'
 ?column?
----------
        1
(1 row)


real    0m7.503s  - -delay
user    0m0.001s
sys     0m0.005s

postgres@ubuntu:/tmp$ #after rollback of pg_authid


from the logs
2024-10-29 18:04:46.411 UTC [localhost(33424)] [62661] LOG:  connection received: host=localhost port=33424
2024-10-29 18:04:53.906 UTC [localhost(33424)] [62661] LOG:  connection authenticated: identity="postgres" method=md5 (/tmp/db1/pg_hba.conf:120)


what i tried here was, delay auth but after connection established till lock held, which can be due to bloated catalog objects or long locks due to temp objects churn?

still it does not explain how the order of pg_hba deals with the delay ? 

 anyways, fyi to rule out :)

--
Thanks,
Vijay

Open to work
Resume - Vijaykumar Jain


sorry,  sharing again.

On Tue, 2024-10-29 at 16:30 +0000, Daniel Westermann (DWE) wrote:
Delays between "connection received" and "connection authenticated" because of localhost entries in hba

Maybe check that "files" is the first mechanism for hosts lookups in nsswitch.conf.

afaik dig doesn't follow the same name resolution path as glibc.


On Wed, 30 Oct 2024 at 00:04, Alan Hodgson <ahodgson@lists.simkin.ca> wrote:
On Tue, 2024-10-29 at 16:30 +0000, Daniel Westermann (DWE) wrote:
Delays between "connection received" and "connection authenticated" because of localhost entries in hba

Maybe check that "files" is the first mechanism for hosts lookups in nsswitch.conf.

afaik dig doesn't follow the same name resolution path as glibc.


a connection received should mean resolution was done, right ? am i missing something?
I could try adding delay with traffic control, but then if a connection  established is logged in the log, then name resolution is ok ?

or the syscall to md5 is slow , does it work faster for trust ? just saying :) 

--
Thanks,
Vijay

Open to work
Resume - Vijaykumar Jain
On Wed, 2024-10-30 at 00:15 +0530, Vijaykumar Jain wrote:


On Wed, 30 Oct 2024 at 00:04, Alan Hodgson <ahodgson@lists.simkin.ca> wrote:
On Tue, 2024-10-29 at 16:30 +0000, Daniel Westermann (DWE) wrote:
Delays between "connection received" and "connection authenticated" because of localhost entries in hba

Maybe check that "files" is the first mechanism for hosts lookups in nsswitch.conf.

afaik dig doesn't follow the same name resolution path as glibc.


a connection received should mean resolution was done, right ? am i missing something?

Well, I'm just throwing out something to check, but I meant on the server, not the client. It's trying to resolve "localhost" to check the auth. If there's a delay, that's probably where it is.

I guess I just assumed that's where you were testing dig.

On Wed, 30 Oct 2024 at 00:40, Alan Hodgson <ahodgson@lists.simkin.ca> wrote:
Well, I'm just throwing out something to check, but I meant on the server, not the client. It's trying to resolve "localhost" to check the auth. If there's a delay, that's probably where it is.

I guess I just assumed that's where you were testing dig.

Yep. my mistake. i did not read it correctly. in order to discard the rule, it first has to find out by expanding the hosts in the rule, so yep, dns resolution might be the problem.

also, the 4-5s delay is eerie similar to issues in the past.

maybe disabling ipv6 completely resolves this  for the time being.



--
Thanks,
Vijay

Open to work
Resume - Vijaykumar Jain
>Maybe check that "files" is the first mechanism for hosts lookups in nsswitch.conf.

It is
>>Maybe check that "files" is the first mechanism for hosts lookups in nsswitch.conf.

>afaik dig doesn't follow the same name resolution path as glibc.

I don't know. Is there a way to test the glibc way easily?

> a connection received should mean resolution was done, right ? am i missing something?

>Well, I'm just throwing out something to check, but I meant on the server, not the client. It's trying to resolve
>"localhost"to check the auth. If there's a delay, that's probably where it is. 

>I guess I just assumed that's where you were testing dig.

Yes

Regards
Daniel


>It might be worth searching our mailing list archives.

I did that, but was not able to find anything useful related to this.

Regards
Daniel


On Tue, 2024-10-29 at 16:30 +0000, Daniel Westermann (DWE) wrote:
> we're facing a strange issue with delays between "connection received" and "connection authenticated".
>
> # select version();
>                                                               version
>
-----------------------------------------------------------------------------------------------------------------------------------
>  PostgreSQL 15.6 (Ubuntu 15.6-1.pgdg22.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04)
11.4.0,64-bit 
> (1 ligne)
>
> I know, this is not the latest minor version.
>
> # \! cat /etc/os-release | head -1
> PRETTY_NAME="Ubuntu 22.04.4 LTS"
>
> What we see in the log is this (around 4 seconds delay):
>
> 2024-05-07 15:29:50.244 CEST [369909] LOG:  connection received: host=xx.xx.xx.100 port=48434
> 2024-05-07 15:29:54.518 CEST [369909] LOG:  connection authenticated: identity="xxxxxx" method=md5
(/etc/postgresql/15/main/pg_hba.conf:121)
>
> The matching line is this ( I know md5 ):
> host    all             xxxxx     xx.xx.xx.0/24            md5
>
> What we've found out so far is, that this only happens if we have a localhost(or any other hostname) line before the
linematching our connection, something like this: 
> host    replication     xxxxx     localhost                md5
> host    all             xxxxx     xx.xx.xx.0/24            md5

I'd "strace" such a database connection to see where the time is spent.

Yours,
Laurenz Albe

>> What we've found out so far is, that this only happens if we have a localhost(or any other hostname) line before the
linematching our connection, something like this: 
>> host    replication     xxxxx     localhost                md5
>> host    all             xxxxx     xx.xx.xx.0/24            md5

... which is quite hard to do if you don't know which one it will be

Regards
Daniel


I'd echo the suggestion to strace this. You can use the pre_auth_delay setting to help facilitate that. See:


Cheers,
Greg

Greg Sabino Mullane <htamfids@gmail.com> writes:
> I'd echo the suggestion to strace this. You can use the pre_auth_delay
> setting to help facilitate that. See:

IIUC, the delays are rare and unpredictable, so that strace'ing seems
unlikely to be practical.

If rebuilding from source is feasible, you could insert monitoring
elog(LOG) calls around the name lookup (that is, around the
pg_getnameinfo_all call in src/backend/libpq/hba.c).  This would
confirm or disprove the theory that the name lookup is the source
of the issue, which is about as much as we'd get out of strace.

            regards, tom lane



>>Greg Sabino Mullane <htamfids@gmail.com> writes:
>> I'd echo the suggestion to strace this. You can use the pre_auth_delay
>> setting to help facilitate that. See:

>IIUC, the delays are rare and unpredictable, so that strace'ing seems
>unlikely to be practical.

>If rebuilding from source is feasible, you could insert monitoring
>elog(LOG) calls around the name lookup (that is, around the
>pg_getnameinfo_all call in src/backend/libpq/hba.c).  This would
>confirm or disprove the theory that the name lookup is the source
>of the issue, which is about as much as we'd get out of strace.

Thanks for all your tips/hints

Regards
Daniel