GNU bug report logs - #49449
28: TLS connection never gets to "open" stage

Previous Next

Package: emacs;

Reported by: Mattias Engdegård <mattiase <at> acm.org>

Date: Tue, 6 Jul 2021 19:42:02 UTC

Severity: normal

Done: Mattias Engdegård <mattiase <at> acm.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 49449 in the body.
You can then email your comments to 49449 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#49449; Package emacs. (Tue, 06 Jul 2021 19:42:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Mattias Engdegård <mattiase <at> acm.org>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Tue, 06 Jul 2021 19:42:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: bug-gnu-emacs <at> gnu.org
Cc: Lars Ingebrigtsen <larsi <at> gnus.org>
Subject: 28: TLS connection never gets to "open" stage
Date: Tue, 6 Jul 2021 21:12:39 +0200
It seems that the process/TLS apparatus can get into a state where it is unable to ever call the process sentinel "open\n" event; specifically, an async `url-https` does not make forward progress until the connection times out.

This has been observed on macOS in Emacs 28 now and then, and it looks like there is an invariant violation somewhere. To recap:

The "open\n" event is sent to the sentinel in either of two places:

(A) In finish_after_tls_connection, after having successfully called nsm-verify-connection and the condition

  (fd_callback_info[p->outfd].flags & NON_BLOCKING_CONNECT_FD) == 0

being satisfied (process.c:3277).

(B) In wait_reading_process_output, after the descriptor being found writable by `select` and the condition

    NILP (p->gnutls_boot_parameters) && !p->gnutls_p

being satisfied (process.c:5900).

There seems to be a gap in the logic, however: it is perfectly possible for the condition in (A) to fail because the descriptor is still marked nonblocking at that point, and for (B) to fail because gnutls_p=true was set already in gnutls_try_handshake.

Lars, it looks like you wrote at least part of the original logic. Can you see what is going on? It is somewhat complex.

For reference, I'm using the reproduction recipe below; it may or may not exhibit the problem in your particular setup. I'm using gnutls 3.6.15.

(defun busy-wait (s)
  (let ((t0 (current-time)))
    (while (< (time-to-seconds (time-since t0)) s) nil)))

(progn
  (url-http
   #s(url "https" nil nil "elpa.gnu.org" nil "/packages/archive-contents" nil nil t silent t t)
   (lambda (status) (message "callback: status = %S" status))
   '(nil) nil 'tls)
  (busy-wait 1.0))






Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#49449; Package emacs. (Wed, 07 Jul 2021 19:58:01 GMT) Full text and rfc822 format available.

Message #8 received at 49449 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 49449 <at> debbugs.gnu.org
Subject: Re: bug#49449: 28: TLS connection never gets to "open" stage
Date: Wed, 07 Jul 2021 21:57:23 +0200
Mattias Engdegård <mattiase <at> acm.org> writes:

> (B) In wait_reading_process_output, after the descriptor being found writable by `select` and the condition
>
>     NILP (p->gnutls_boot_parameters) && !p->gnutls_p
>
> being satisfied (process.c:5900).
>
> There seems to be a gap in the logic, however: it is perfectly
> possible for the condition in (A) to fail because the descriptor is
> still marked nonblocking at that point, and for (B) to fail because
> gnutls_p=true was set already in gnutls_try_handshake.
>
> Lars, it looks like you wrote at least part of the original logic. Can
> you see what is going on? It is somewhat complex.

Yes, it's grown somewhat organically.  :-/

> For reference, I'm using the reproduction recipe below; it may or may not exhibit the problem in your particular setup. I'm using gnutls 3.6.15.
>
> (defun busy-wait (s)
>   (let ((t0 (current-time)))
>     (while (< (time-to-seconds (time-since t0)) s) nil)))
>
> (progn
>   (url-http
>    #s(url "https" nil nil "elpa.gnu.org" nil "/packages/archive-contents" nil nil t silent t t)
>    (lambda (status) (message "callback: status = %S" status))
>    '(nil) nil 'tls)
>   (busy-wait 1.0))

I'm not able to reproduce this on Debian/bullseye, but on Macos I get

callback: status = (:error (error connection-failed "connect" :host "elpa.gnu.o\
rg" :service 443))

after a while.  There's been several reports in the last week of TLS not
working on Macos.  Has Apple pushed something new, or...  did something
else happen lately in this area on Macos?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#49449; Package emacs. (Thu, 08 Jul 2021 08:00:02 GMT) Full text and rfc822 format available.

Message #11 received at 49449 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: 49449 <at> debbugs.gnu.org
Subject: Re: bug#49449: 28: TLS connection never gets to "open" stage
Date: Thu, 8 Jul 2021 09:59:26 +0200
7 juli 2021 kl. 21.57 skrev Lars Ingebrigtsen <larsi <at> gnus.org>:

> Yes, it's grown somewhat organically.  :-/

Let me first say that the state of the code is not your fault! It's a product, as you say, from organic growth, and it does need a rewrite.

> I'm not able to reproduce this on Debian/bullseye, but on Macos I get
> 
> callback: status = (:error (error connection-failed "connect" :host "elpa.gnu.o\
> rg" :service 443))

Yes, that is my observation too. Obviously the busy-wait part is essential: removing it makes the problem go away.
Essentially, the busy-wait postpones the call to wait_reading_process_output so that when it is eventually called, gnutls_handshake succeeds on the first try instead of first returning GNUTLS_E_AGAIN, which brings us onto a different code path.

> There's been several reports in the last week of TLS not
> working on Macos.  Has Apple pushed something new, or...  did something
> else happen lately in this area on Macos?

No, I've been harassed by this bug for quite some time but only now decided to dig deeper. Most likely it's just a matter of different timing that the process/TLS system doesn't cope with.

First, when the `url-http` call returns we have a Lisp_Process with

 gnutls_p = true
 gnutls_boot_parameters = non-nil
 gnutls_initstage = GNUTLS_STAGE_HANDSHAKE_TRIED (8)

and its file descriptor has a corresponding fd_callback_data with
 flags = FOR_WRITE | NON_BLOCKING_CONNECT_FD

because the asynchronous connect call has not yet been completed.

In the GOOD case (without busy-wait), `wait_reading_process_output` gets called right away (because Emacs has nothing else to do) and gnutls_try_handshake initially fails with E_AGAIN but p->outfd becomes writable so `delete_write_fd` is called to zero the fd_callback_data flags, and when the handshake eventually succeeds, the sentinel is called with the "open\n" event.

In the BAD case (with busy-wait), the TLS handshake succeeds right away while the descriptor flags still has NON_BLOCKING_CONNECT_FD set, so the sentinel isn't called.

Does this jog any memories?





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#49449; Package emacs. (Thu, 08 Jul 2021 12:55:01 GMT) Full text and rfc822 format available.

Message #14 received at 49449 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 49449 <at> debbugs.gnu.org
Subject: Re: bug#49449: 28: TLS connection never gets to "open" stage
Date: Thu, 08 Jul 2021 14:54:39 +0200
Mattias Engdegård <mattiase <at> acm.org> writes:

> In the BAD case (with busy-wait), the TLS handshake succeeds right
> away while the descriptor flags still has NON_BLOCKING_CONNECT_FD set,
> so the sentinel isn't called.
>
> Does this jog any memories?

Sorry, nope.  :-/

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#49449; Package emacs. (Thu, 08 Jul 2021 16:48:02 GMT) Full text and rfc822 format available.

Message #17 received at 49449 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: 49449 <at> debbugs.gnu.org
Subject: Re: bug#49449: 28: TLS connection never gets to "open" stage
Date: Thu, 8 Jul 2021 18:47:11 +0200
[Message part 1 (text/plain, inline)]
8 juli 2021 kl. 14.54 skrev Lars Ingebrigtsen <larsi <at> gnus.org>:

> Sorry, nope.  :-/

I was thinking about 9c74f2fea6bf which introduced the condition of the wait mask, and maybe ac6e085cf6b2 where async TLS negotiation was added.

A hack like the one below would "solve" the problem by pretending that the first handshake attempt always fails with E_AGAIN, but it would be better to do it in a slightly more principled way.

[tls-handshake-hack.diff (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#49449; Package emacs. (Sat, 10 Jul 2021 16:28:02 GMT) Full text and rfc822 format available.

Message #20 received at 49449 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 49449 <at> debbugs.gnu.org
Subject: Re: bug#49449: 28: TLS connection never gets to "open" stage
Date: Sat, 10 Jul 2021 18:27:22 +0200
Mattias Engdegård <mattiase <at> acm.org> writes:

> A hack like the one below would "solve" the problem by pretending that
> the first handshake attempt always fails with E_AGAIN, but it would be
> better to do it in a slightly more principled way.

Hm...  does reverting 234bf1b6363a3d5db8e73c422d87a0bf1aa4b2e3 help any
here?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#49449; Package emacs. (Sat, 10 Jul 2021 16:52:01 GMT) Full text and rfc822 format available.

Message #23 received at 49449 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: 49449 <at> debbugs.gnu.org
Subject: Re: bug#49449: 28: TLS connection never gets to "open" stage
Date: Sat, 10 Jul 2021 18:51:08 +0200
10 juli 2021 kl. 18.27 skrev Lars Ingebrigtsen <larsi <at> gnus.org>:

> Hm...  does reverting 234bf1b6363a3d5db8e73c422d87a0bf1aa4b2e3 help any
> here?

No, sorry. (Was worth a shot, though!)





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#49449; Package emacs. (Sat, 10 Jul 2021 16:58:02 GMT) Full text and rfc822 format available.

Message #26 received at 49449 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 49449 <at> debbugs.gnu.org
Subject: Re: bug#49449: 28: TLS connection never gets to "open" stage
Date: Sat, 10 Jul 2021 18:57:45 +0200
Mattias Engdegård <mattiase <at> acm.org> writes:

> 10 juli 2021 kl. 18.27 skrev Lars Ingebrigtsen <larsi <at> gnus.org>:
>
>> Hm...  does reverting 234bf1b6363a3d5db8e73c422d87a0bf1aa4b2e3 help any
>> here?
>
> No, sorry. (Was worth a shot, though!)

Darn!  Well, back to your patch:

-  while ((ret = gnutls_handshake (state)) < 0)
+  if (non_blocking && proc->gnutls_handshakes_tried < 1)
     {
-      if (emacs_gnutls_handle_error (state, ret) == 0) /* fatal */
-	break;
+      /* HACK: don't succeed the first time for nonblocking connections,
+	 because the logic doesn't allow it (bug#49449).  */
+      ret = GNUTLS_E_AGAIN;

I'm not super enthusiastic about this -- it'll artificially make
connections slower?  The problem only seems to occur on Macos, so
it'd be interesting to find out why it doesn't happen on Linux...

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#49449; Package emacs. (Sat, 10 Jul 2021 17:08:01 GMT) Full text and rfc822 format available.

Message #29 received at 49449 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: 49449 <at> debbugs.gnu.org
Subject: Re: bug#49449: 28: TLS connection never gets to "open" stage
Date: Sat, 10 Jul 2021 19:07:18 +0200
10 juli 2021 kl. 18.57 skrev Lars Ingebrigtsen <larsi <at> gnus.org>:

> I'm not super enthusiastic about this -- it'll artificially make
> connections slower?  The problem only seems to occur on Macos, so
> it'd be interesting to find out why it doesn't happen on Linux...

I'm not fond about it either; it's just a hack. I'm not sure if it makes connections slower or not (if so, it's 100 ms).

We really should seek a sound understanding of the bug so that a proper solution can be found. However the patch is important in that it does seem to work (and consistently so) and could assist us in developing a theory.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#49449; Package emacs. (Sat, 10 Jul 2021 18:24:02 GMT) Full text and rfc822 format available.

Message #32 received at 49449 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: 49449 <at> debbugs.gnu.org
Subject: Re: bug#49449: 28: TLS connection never gets to "open" stage
Date: Sat, 10 Jul 2021 20:23:33 +0200
>  The problem only seems to occur on Macos, so
> it'd be interesting to find out why it doesn't happen on Linux...

Now I've had access to a Linux machine, and it seems that it works because it has getaddrinfo_a (async DNS lookup) so the path taken is different. If HAVE_GETADDRINFO_A is #undef'ed, it seems to behave like macOS and fail in the same way if busy-waiting for one second after the url-http call. Can you confirm?





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#49449; Package emacs. (Sat, 10 Jul 2021 18:56:02 GMT) Full text and rfc822 format available.

Message #35 received at 49449 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 49449 <at> debbugs.gnu.org, larsi <at> gnus.org
Subject: Re: bug#49449: 28: TLS connection never gets to "open" stage
Date: Sat, 10 Jul 2021 21:54:43 +0300
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Sat, 10 Jul 2021 20:23:33 +0200
> Cc: 49449 <at> debbugs.gnu.org
> 
> >  The problem only seems to occur on Macos, so
> > it'd be interesting to find out why it doesn't happen on Linux...
> 
> Now I've had access to a Linux machine, and it seems that it works because it has getaddrinfo_a (async DNS lookup) so the path taken is different. If HAVE_GETADDRINFO_A is #undef'ed, it seems to behave like macOS and fail in the same way if busy-waiting for one second after the url-http call. Can you confirm?

The busy-wait loop can delay the sentinel call, but do you understand
why the sentinel isn't called once the loop is over?




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#49449; Package emacs. (Sat, 10 Jul 2021 19:23:02 GMT) Full text and rfc822 format available.

Message #38 received at 49449 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 49449 <at> debbugs.gnu.org, larsi <at> gnus.org
Subject: Re: bug#49449: 28: TLS connection never gets to "open" stage
Date: Sat, 10 Jul 2021 21:22:09 +0200
10 juli 2021 kl. 20.54 skrev Eli Zaretskii <eliz <at> gnu.org>:

> The busy-wait loop can delay the sentinel call, but do you understand
> why the sentinel isn't called once the loop is over?

Sort of; see previous discussion. In short: once the busy-wait loop is over, the TLS handshake succeeds immediately (because sufficient time has passed) but the logic isn't set up for this path and enters a state where the sentinel cannot be called.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#49449; Package emacs. (Sat, 10 Jul 2021 19:32:02 GMT) Full text and rfc822 format available.

Message #41 received at 49449 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 49449 <at> debbugs.gnu.org, larsi <at> gnus.org
Subject: Re: bug#49449: 28: TLS connection never gets to "open" stage
Date: Sat, 10 Jul 2021 22:31:25 +0300
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Sat, 10 Jul 2021 21:22:09 +0200
> Cc: larsi <at> gnus.org, 49449 <at> debbugs.gnu.org
> 
> 10 juli 2021 kl. 20.54 skrev Eli Zaretskii <eliz <at> gnu.org>:
> 
> > The busy-wait loop can delay the sentinel call, but do you understand
> > why the sentinel isn't called once the loop is over?
> 
> Sort of; see previous discussion. In short: once the busy-wait loop is over, the TLS handshake succeeds immediately (because sufficient time has passed) but the logic isn't set up for this path and enters a state where the sentinel cannot be called.

That's my question: why cannot the sentinel be called in this case?
what prevents it from being called?  Apologies if I missed the
explanation, and could you please repeat it?




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#49449; Package emacs. (Sat, 10 Jul 2021 19:45:02 GMT) Full text and rfc822 format available.

Message #44 received at 49449 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 49449 <at> debbugs.gnu.org, larsi <at> gnus.org
Subject: Re: bug#49449: 28: TLS connection never gets to "open" stage
Date: Sat, 10 Jul 2021 21:44:21 +0200
10 juli 2021 kl. 21.31 skrev Eli Zaretskii <eliz <at> gnu.org>:

> That's my question: why cannot the sentinel be called in this case?
> what prevents it from being called?

In the failing case, wait_reading_process_output calls gnutls_try_handshake early on, which succeeds and this leads to finish_after_tls_connection being called. Here, we have the condition

  else if ((fd_callback_info[p->outfd].flags & NON_BLOCKING_CONNECT_FD) == 0)

which gates further progress, but this condition is false because the flags have NON_BLOCKING_CONNECT_FD set.

In the successful case, the first call to gnutls_try_handshake from wait_reading_process_output fails because things haven't had the time to be set up yet. This leads to a select being called on the socket for writing (since it's in a nonblocking connect), and when ready, the NON_BLOCKING_CONNECT_FD bit is cleared from the flags.

This is a simplified view. The state is clearly more complex and things need to be done in the proper order.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#49449; Package emacs. (Sat, 10 Jul 2021 20:06:01 GMT) Full text and rfc822 format available.

Message #47 received at 49449 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>, Eli Zaretskii <eliz <at> gnu.org>
Cc: 49449 <at> debbugs.gnu.org
Subject: Re: bug#49449: 28: TLS connection never gets to "open" stage
Date: Sat, 10 Jul 2021 22:05:18 +0200
[Message part 1 (text/plain, inline)]
Here is a slightly more principled solution. It blocks attempts at TLS handshaking until the nonblocking connect has actually been established, since it's pointless to go on beforehand.

[tls-connect.diff (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#49449; Package emacs. (Sun, 11 Jul 2021 06:50:02 GMT) Full text and rfc822 format available.

Message #50 received at 49449 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 49449 <at> debbugs.gnu.org, larsi <at> gnus.org
Subject: Re: bug#49449: 28: TLS connection never gets to "open" stage
Date: Sun, 11 Jul 2021 09:49:03 +0300
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Sat, 10 Jul 2021 21:44:21 +0200
> Cc: larsi <at> gnus.org, 49449 <at> debbugs.gnu.org
> 
> 10 juli 2021 kl. 21.31 skrev Eli Zaretskii <eliz <at> gnu.org>:
> 
> > That's my question: why cannot the sentinel be called in this case?
> > what prevents it from being called?
> 
> In the failing case, wait_reading_process_output calls gnutls_try_handshake early on, which succeeds and this leads to finish_after_tls_connection being called. Here, we have the condition
> 
>   else if ((fd_callback_info[p->outfd].flags & NON_BLOCKING_CONNECT_FD) == 0)
> 
> which gates further progress, but this condition is false because the flags have NON_BLOCKING_CONNECT_FD set.

Thanks.  A potentially silly question: why not reset the
NON_BLOCKING_CONNECT_FD bit before we call finish_after_tls_connection
from that place?




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#49449; Package emacs. (Sun, 11 Jul 2021 07:43:02 GMT) Full text and rfc822 format available.

Message #53 received at 49449 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 49449 <at> debbugs.gnu.org, larsi <at> gnus.org
Subject: Re: bug#49449: 28: TLS connection never gets to "open" stage
Date: Sun, 11 Jul 2021 09:42:34 +0200
11 juli 2021 kl. 08.49 skrev Eli Zaretskii <eliz <at> gnu.org>:

> why not reset the
> NON_BLOCKING_CONNECT_FD bit before we call finish_after_tls_connection
> from that place?

That's tantamount to jamming a metal screwdriver into the condition on line 3277, which was indeed the first thing I tried. Unfortunately it doesn't work -- after the sentinel is called, no further progress is made, probably because we (by lying) haven't set up the connection properly.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#49449; Package emacs. (Sun, 11 Jul 2021 10:15:01 GMT) Full text and rfc822 format available.

Message #56 received at 49449 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 49449 <at> debbugs.gnu.org, larsi <at> gnus.org
Subject: Re: bug#49449: 28: TLS connection never gets to "open" stage
Date: Sun, 11 Jul 2021 13:14:16 +0300
> Feedback-ID:mattiase <at> acm.or
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Sun, 11 Jul 2021 09:42:34 +0200
> Cc: larsi <at> gnus.org, 49449 <at> debbugs.gnu.org
> 
> 11 juli 2021 kl. 08.49 skrev Eli Zaretskii <eliz <at> gnu.org>:
> 
> > why not reset the
> > NON_BLOCKING_CONNECT_FD bit before we call finish_after_tls_connection
> > from that place?
> 
> That's tantamount to jamming a metal screwdriver into the condition on line 3277, which was indeed the first thing I tried. Unfortunately it doesn't work -- after the sentinel is called, no further progress is made, probably because we (by lying) haven't set up the connection properly.

Did you succeed in understanding what else has to happen before that
flag could be safely reset?

And anyway, if those conditions are not yet set, I wonder why are we
calling finish_after_tls_connection at that place?




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#49449; Package emacs. (Sun, 11 Jul 2021 11:31:01 GMT) Full text and rfc822 format available.

Message #59 received at 49449 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 49449 <at> debbugs.gnu.org
Subject: Re: bug#49449: 28: TLS connection never gets to "open" stage
Date: Sun, 11 Jul 2021 13:29:48 +0200
Mattias Engdegård <mattiase <at> acm.org> writes:

> If HAVE_GETADDRINFO_A is #undef'ed, it seems to behave like
> macOS and fail in the same way if busy-waiting for one second after
> the url-http call. Can you confirm?

With the test case and HAVE_GETADDRINFO_A undeffed, I'm still not able
to reproduce the problem on Debian/bullseye.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#49449; Package emacs. (Sun, 11 Jul 2021 11:32:02 GMT) Full text and rfc822 format available.

Message #62 received at 49449 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 49449 <at> debbugs.gnu.org, Eli Zaretskii <eliz <at> gnu.org>
Subject: Re: bug#49449: 28: TLS connection never gets to "open" stage
Date: Sun, 11 Jul 2021 13:31:10 +0200
Mattias Engdegård <mattiase <at> acm.org> writes:

> Here is a slightly more principled solution. It blocks attempts at TLS
> handshaking until the nonblocking connect has actually been
> established, since it's pointless to go on beforehand.

[...]

> -		    && p->is_non_blocking_client)
> +		    && p->is_non_blocking_client
> +		    /* Don't proceed until we have established a connection. */
> +		    && !(fd_callback_info[p->outfd].flags
> +			 & NON_BLOCKING_CONNECT_FD))

Yeah, I think that's a sensible change (but I haven't tried it).

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#49449; Package emacs. (Sun, 11 Jul 2021 14:27:01 GMT) Full text and rfc822 format available.

Message #65 received at 49449 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 49449 <at> debbugs.gnu.org, larsi <at> gnus.org
Subject: Re: bug#49449: 28: TLS connection never gets to "open" stage
Date: Sun, 11 Jul 2021 16:26:47 +0200
[Message part 1 (text/plain, inline)]
11 juli 2021 kl. 12.14 skrev Eli Zaretskii <eliz <at> gnu.org>:

> Did you succeed in understanding what else has to happen before that
> flag could be safely reset?

Yes. The TCP connection needs to be established, the socket descriptor removed from write monitoring (because it is now connected) and added to read monitoring (so that we can get incoming traffic).

This suggests a second solution: remove the condition on line 3277 on the grounds that since the TLS handshake succeeded, we are evidently connected; then remove the write monitoring and add the read monitoring before calling the sentinel:

[alt.diff (application/octet-stream, attachment)]
[Message part 3 (text/plain, inline)]
> And anyway, if those conditions are not yet set, I wonder why are we
> calling finish_after_tls_connection at that place?

There's no harm in calling `gnutls_handshake`; it will just return E_AGAIN if the connection hasn't been established. On the other hand there's little point in doing so until we have a connection.

Which suggests a third solution: do the handshake right away after establishing the connection. That would go into the code somewhere before line 5900, which right now is a condition that I don't quite understand. I think Lars wrote it but apparently forgot all about it (happens to everyone once in a while).

I still favour the less intrusive patch posted previously (adding a condition at line 5235) since it avoids duplication; there is already far too much of that in the code (everything seems to be done in at least two places). The code is obviously in the need of restructuring, but we shouldn't conflate that effort with fixing this specific bug.


Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#49449; Package emacs. (Sun, 11 Jul 2021 14:29:01 GMT) Full text and rfc822 format available.

Message #68 received at 49449 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: 49449 <at> debbugs.gnu.org
Subject: Re: bug#49449: 28: TLS connection never gets to "open" stage
Date: Sun, 11 Jul 2021 16:28:00 +0200
11 juli 2021 kl. 13.29 skrev Lars Ingebrigtsen <larsi <at> gnus.org>:

> With the test case and HAVE_GETADDRINFO_A undeffed, I'm still not able
> to reproduce the problem on Debian/bullseye.

Hard to say what's going on since the problem is somewhat timing-sensitive, but it's easy enough to sprinkle printfs for finding out what code paths are taken.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#49449; Package emacs. (Sun, 11 Jul 2021 15:03:02 GMT) Full text and rfc822 format available.

Message #71 received at 49449 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 49449 <at> debbugs.gnu.org, larsi <at> gnus.org
Subject: Re: bug#49449: 28: TLS connection never gets to "open" stage
Date: Sun, 11 Jul 2021 18:01:52 +0300
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Sun, 11 Jul 2021 16:26:47 +0200
> Cc: larsi <at> gnus.org, 49449 <at> debbugs.gnu.org
> 
> 
> [1:text/plain Hide]
> 
> 11 juli 2021 kl. 12.14 skrev Eli Zaretskii <eliz <at> gnu.org>:
> 
> > Did you succeed in understanding what else has to happen before that
> > flag could be safely reset?
> 
> Yes. The TCP connection needs to be established, the socket descriptor removed from write monitoring (because it is now connected) and added to read monitoring (so that we can get incoming traffic).
> 
> This suggests a second solution: remove the condition on line 3277 on the grounds that since the TLS handshake succeeded, we are evidently connected; then remove the write monitoring and add the read monitoring before calling the sentinel:
> 
> 
> [2:application/octet-stream Show Save:alt.diff (649B)]
> 
> 
> [3:text/plain Hide]
> 
> 
> > And anyway, if those conditions are not yet set, I wonder why are we
> > calling finish_after_tls_connection at that place?
> 
> There's no harm in calling `gnutls_handshake`; it will just return E_AGAIN if the connection hasn't been established. On the other hand there's little point in doing so until we have a connection.
> 
> Which suggests a third solution: do the handshake right away after establishing the connection. That would go into the code somewhere before line 5900, which right now is a condition that I don't quite understand. I think Lars wrote it but apparently forgot all about it (happens to everyone once in a while).

Thanks for the explanations.

> I still favour the less intrusive patch posted previously (adding a condition at line 5235) since it avoids duplication; there is already far too much of that in the code (everything seems to be done in at least two places). The code is obviously in the need of restructuring, but we shouldn't conflate that effort with fixing this specific bug.

I tend to agree.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#49449; Package emacs. (Mon, 12 Jul 2021 14:59:01 GMT) Full text and rfc822 format available.

Message #74 received at 49449 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 49449 <at> debbugs.gnu.org, larsi <at> gnus.org
Subject: Re: bug#49449: 28: TLS connection never gets to "open" stage
Date: Mon, 12 Jul 2021 16:57:00 +0200
[Message part 1 (text/plain, inline)]
11 juli 2021 kl. 17.01 skrev Eli Zaretskii <eliz <at> gnu.org>:

>> I still favour the less intrusive patch posted previously (adding a condition at line 5235) since it avoids duplication; there is already far too much of that in the code (everything seems to be done in at least two places). The code is obviously in the need of restructuring, but we shouldn't conflate that effort with fixing this specific bug.
> 
> I tend to agree.

Attached is the patch that I intend to push if there are no objections. The actual change is the same as before and I anticipate no trouble arising from it but tests are usually more fragile.

This issue could very well be the root cause of or at least connected to other bugs: maybe bug#36017 or bug#34341? In any case it's good to see it fixed; it annoyed me (with GNU ELPA in particular) for quite some time and the various unsatisfactory workarounds suggested each time this came up (such as using HTTP instead of HTTPS) are no longer required.

[0001-Block-TLS-handshake-until-TCP-connection-established.patch (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#49449; Package emacs. (Mon, 12 Jul 2021 15:04:02 GMT) Full text and rfc822 format available.

Message #77 received at 49449 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 49449 <at> debbugs.gnu.org, Eli Zaretskii <eliz <at> gnu.org>
Subject: Re: bug#49449: 28: TLS connection never gets to "open" stage
Date: Mon, 12 Jul 2021 17:02:50 +0200
Mattias Engdegård <mattiase <at> acm.org> writes:

> Attached is the patch that I intend to push if there are no
> objections. The actual change is the same as before and I anticipate
> no trouble arising from it but tests are usually more fragile.

Looks good to me.

> This issue could very well be the root cause of or at least connected
> to other bugs: maybe bug#36017 or bug#34341? 

bug#36017 looks similar, at least, so it'd be good to poke that bug and
ask whether this patch fixes that problem, too.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#49449; Package emacs. (Tue, 13 Jul 2021 17:10:01 GMT) Full text and rfc822 format available.

Message #80 received at 49449 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: 49449 <at> debbugs.gnu.org, Eli Zaretskii <eliz <at> gnu.org>
Subject: Re: bug#49449: 28: TLS connection never gets to "open" stage
Date: Tue, 13 Jul 2021 19:08:57 +0200
12 juli 2021 kl. 17.02 skrev Lars Ingebrigtsen <larsi <at> gnus.org>:

> Looks good to me.

Thanks, pushed to master.

> bug#36017 looks similar, at least, so it'd be good to poke that bug and
> ask whether this patch fixes that problem, too.

Done.





bug closed, send any further explanations to 49449 <at> debbugs.gnu.org and Mattias Engdegård <mattiase <at> acm.org> Request was from Mattias Engdegård <mattiase <at> acm.org> to control <at> debbugs.gnu.org. (Tue, 13 Jul 2021 17:12:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Wed, 11 Aug 2021 11:24:09 GMT) Full text and rfc822 format available.

This bug report was last modified 2 years and 256 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.