GNU bug report logs - #63535
Master branch: Error in forw_comment (syntax.c) handling of escaped LFs

Previous Next

Package: emacs;

Reported by: Alan Mackenzie <acm <at> muc.de>

Date: Tue, 16 May 2023 10:58:02 UTC

Severity: normal

Done: Alan Mackenzie <acm <at> muc.de>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 63535 in the body.
You can then email your comments to 63535 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#63535; Package emacs. (Tue, 16 May 2023 10:58:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Alan Mackenzie <acm <at> muc.de>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Tue, 16 May 2023 10:58:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Alan Mackenzie <acm <at> muc.de>
To: bug-gnu-emacs <at> gnu.org
Subject: Master branch: Error in forw_comment (syntax.c) handling of escaped
 LFs
Date: Tue, 16 May 2023 10:57:40 +0000
Hello, Emacs.

In the master branch:

Consider the following C++ Mode buffer:

    // comment \
    comment line 2
    line_3();

..  The backslash at the end of line 1 extends the comment into line 2.

Put point at // on L1, and do:

    M-: (setq s (parse-partial-sexp (point) (+ (point) 9)))

..  s gets the parse state of the inside of the comment.

Now put point at EOL 1, between the backslash and the LF.  Do

    M-: (parse-partial-sexp (point) (point-max) nil nil s 'syntax-table)

..  This ought to leave point at BOL 2, since the syntax before the LF at
EOL 1 is that of a C++ comment, otherwise neutral.  Instead, it leaves
point wrongly at BOL 3.

#########################################################################

The reason for this bug is at L+42 of forw_comment (in syntax.c).  There
we have

   && !(comment_end_can_be_escaped && char_quoted (from, from_byte))

..  Checking char_quoted is wrong.  Instead the function should check the
current parse state.

-- 
Alan Mackenzie (Nuremberg, Germany).




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#63535; Package emacs. (Tue, 16 May 2023 14:04:02 GMT) Full text and rfc822 format available.

Message #8 received at 63535 <at> debbugs.gnu.org (full text, mbox):

From: Alan Mackenzie <acm <at> muc.de>
To: 63535 <at> debbugs.gnu.org
Cc: acm <at> muc.de
Subject: Re: bug#63535: Master branch: Error in forw_comment (syntax.c)
 handling of escaped LFs
Date: Tue, 16 May 2023 14:03:18 +0000
On Tue, May 16, 2023 at 10:57:40 +0000, Alan Mackenzie wrote:
> Hello, Emacs.

> In the master branch:

> Consider the following C++ Mode buffer:

>     // comment \
>     comment line 2
>     line_3();

> ..  The backslash at the end of line 1 extends the comment into line 2.

> Put point at // on L1, and do:

>     M-: (setq s (parse-partial-sexp (point) (+ (point) 9)))

> ..  s gets the parse state of the inside of the comment.

> Now put point at EOL 1, between the backslash and the LF.  Do

>     M-: (parse-partial-sexp (point) (point-max) nil nil s 'syntax-table)

> ..  This ought to leave point at BOL 2, since the syntax before the LF at
> EOL 1 is that of a C++ comment, otherwise neutral.  Instead, it leaves
> point wrongly at BOL 3.

> #########################################################################

> The reason for this bug is at L+42 of forw_comment (in syntax.c).  There
> we have

>    && !(comment_end_can_be_escaped && char_quoted (from, from_byte))

> ..  Checking char_quoted is wrong.  Instead the function should check the
> current parse state.

And here is a patch which fixes it.  I will apply this patch to master
soon if I don't hear any objection.



diff --git a/src/syntax.c b/src/syntax.c
index e9e04e2d638..76d9f16e4ed 100644
--- a/src/syntax.c
+++ b/src/syntax.c
@@ -2344,7 +2344,9 @@ forw_comment (ptrdiff_t from, ptrdiff_t from_byte, ptrdiff_t stop,
 	  && SYNTAX_FLAGS_COMMENT_STYLE (syntax, 0) == style
 	  && (SYNTAX_FLAGS_COMMENT_NESTED (syntax) ?
 	      (nesting > 0 && --nesting == 0) : nesting < 0)
-          && !(comment_end_can_be_escaped && char_quoted (from, from_byte)))
+          && !(comment_end_can_be_escaped &&
+	       (((prev_syntax & 0xff) == Sescape)
+		|| ((prev_syntax & 0xff) == Scharquote))))
 	/* We have encountered a comment end of the same style
 	   as the comment sequence which began this comment
 	   section.  */


-- 
Alan Mackenzie (Nuremberg, Germany).




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#63535; Package emacs. (Tue, 16 May 2023 15:44:02 GMT) Full text and rfc822 format available.

Message #11 received at 63535 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Alan Mackenzie <acm <at> muc.de>
Cc: 63535 <at> debbugs.gnu.org
Subject: Re: bug#63535: Master branch: Error in forw_comment (syntax.c)
 handling of escaped LFs
Date: Tue, 16 May 2023 18:43:59 +0300
> Date: Tue, 16 May 2023 10:57:40 +0000
> From: Alan Mackenzie <acm <at> muc.de>
> 
> Hello, Emacs.
> 
> In the master branch:

Is it different on emacs-29?

>    && !(comment_end_can_be_escaped && char_quoted (from, from_byte))
> 
> ..  Checking char_quoted is wrong.  Instead the function should check the
> current parse state.

Why not both?  IOW, please explain why char_quoted is not TRT here.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#63535; Package emacs. (Tue, 16 May 2023 16:16:02 GMT) Full text and rfc822 format available.

Message #14 received at 63535 <at> debbugs.gnu.org (full text, mbox):

From: Alan Mackenzie <acm <at> muc.de>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 63535 <at> debbugs.gnu.org
Subject: Re: bug#63535: Master branch: Error in forw_comment (syntax.c)
 handling of escaped LFs
Date: Tue, 16 May 2023 16:15:24 +0000
Hello, Eli.

On Tue, May 16, 2023 at 18:43:59 +0300, Eli Zaretskii wrote:
> > Date: Tue, 16 May 2023 10:57:40 +0000
> > From: Alan Mackenzie <acm <at> muc.de>

> > Hello, Emacs.

> > In the master branch:

> Is it different on emacs-29?

No, the bug has been there since ?2016, having been coded, almost
certainly, by me.  ;-(  The context in 2016 was making an escaped NL in
a C++ line comment continue the comment's fontification onto the next
line.  The (then) new variable comment-end-can-be-escaped configured the
effect of the backslash at EOL.

I have been assuming that it is too unimportant a bug to go into
emacs-29 at this late stage.

> >    && !(comment_end_can_be_escaped && char_quoted (from, from_byte))

> > ..  Checking char_quoted is wrong.  Instead the function should check the
> > current parse state.

> Why not both?  IOW, please explain why char_quoted is not TRT here.

Because parse-partial-sexp is not scanning the backslash.  The scan
starts one character after the backslash, and the syntactic effect of
that backslash is not in the OLDSTATE argument to parse-partial-sexp.

-- 
Alan Mackenzie (Nuremberg, Germany)




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#63535; Package emacs. (Tue, 16 May 2023 16:30:02 GMT) Full text and rfc822 format available.

Message #17 received at 63535 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Alan Mackenzie <acm <at> muc.de>
Cc: 63535 <at> debbugs.gnu.org
Subject: Re: bug#63535: Master branch: Error in forw_comment (syntax.c)
 handling of escaped LFs
Date: Tue, 16 May 2023 19:29:26 +0300
> Date: Tue, 16 May 2023 16:15:24 +0000
> Cc: 63535 <at> debbugs.gnu.org
> From: Alan Mackenzie <acm <at> muc.de>
> 
> > >    && !(comment_end_can_be_escaped && char_quoted (from, from_byte))
> 
> > > ..  Checking char_quoted is wrong.  Instead the function should check the
> > > current parse state.
> 
> > Why not both?  IOW, please explain why char_quoted is not TRT here.
> 
> Because parse-partial-sexp is not scanning the backslash.  The scan
> starts one character after the backslash, and the syntactic effect of
> that backslash is not in the OLDSTATE argument to parse-partial-sexp.

Sorry, I still don't follow: char_quoted doesn't call
parse-partial-sexp, AFAICT.  So why does it matter what
parse-partial-sexp does when we are discussing why char_quoted is not
TRT?




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#63535; Package emacs. (Tue, 16 May 2023 17:00:02 GMT) Full text and rfc822 format available.

Message #20 received at 63535 <at> debbugs.gnu.org (full text, mbox):

From: Alan Mackenzie <acm <at> muc.de>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 63535 <at> debbugs.gnu.org
Subject: Re: bug#63535: Master branch: Error in forw_comment (syntax.c)
 handling of escaped LFs
Date: Tue, 16 May 2023 16:58:55 +0000
Hello, Eli.

On Tue, May 16, 2023 at 19:29:26 +0300, Eli Zaretskii wrote:
> > Date: Tue, 16 May 2023 16:15:24 +0000
> > Cc: 63535 <at> debbugs.gnu.org
> > From: Alan Mackenzie <acm <at> muc.de>

> > > >    && !(comment_end_can_be_escaped && char_quoted (from, from_byte))

> > > > ..  Checking char_quoted is wrong.  Instead the function should check the
> > > > current parse state.

> > > Why not both?  IOW, please explain why char_quoted is not TRT here.

> > Because parse-partial-sexp is not scanning the backslash.  The scan
> > starts one character after the backslash, and the syntactic effect of
> > that backslash is not in the OLDSTATE argument to parse-partial-sexp.

> Sorry, I still don't follow: char_quoted doesn't call
> parse-partial-sexp, AFAICT.

parse-partial-sexp calls forw_comment which (wrongly) calls char_quoted.

> So why does it matter what parse-partial-sexp does when we are
> discussing why char_quoted is not TRT?

parse-partial-sexp is the context in which the bug becomes evident.  If,
in the C++ line comment with escaped NL, you start parse-partial-sexp at
the NL, it behaves as though the scan started at the backslash.  This is
the bug.

The cause of the bug is the use of char_quoted at line 42 of
forw_comment.

-- 
Alan Mackenzie (Nuremberg, Germany).




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#63535; Package emacs. (Tue, 16 May 2023 17:51:02 GMT) Full text and rfc822 format available.

Message #23 received at 63535 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Alan Mackenzie <acm <at> muc.de>,
 Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: 63535 <at> debbugs.gnu.org
Subject: Re: bug#63535: Master branch: Error in forw_comment (syntax.c)
 handling of escaped LFs
Date: Tue, 16 May 2023 20:50:50 +0300
> Date: Tue, 16 May 2023 16:58:55 +0000
> Cc: 63535 <at> debbugs.gnu.org
> From: Alan Mackenzie <acm <at> muc.de>
> 
> On Tue, May 16, 2023 at 19:29:26 +0300, Eli Zaretskii wrote:
> > > Date: Tue, 16 May 2023 16:15:24 +0000
> > > Cc: 63535 <at> debbugs.gnu.org
> > > From: Alan Mackenzie <acm <at> muc.de>
> 
> > > > >    && !(comment_end_can_be_escaped && char_quoted (from, from_byte))
> 
> > > > > ..  Checking char_quoted is wrong.  Instead the function should check the
> > > > > current parse state.
> 
> > > > Why not both?  IOW, please explain why char_quoted is not TRT here.
> 
> > > Because parse-partial-sexp is not scanning the backslash.  The scan
> > > starts one character after the backslash, and the syntactic effect of
> > > that backslash is not in the OLDSTATE argument to parse-partial-sexp.
> 
> > Sorry, I still don't follow: char_quoted doesn't call
> > parse-partial-sexp, AFAICT.
> 
> parse-partial-sexp calls forw_comment which (wrongly) calls char_quoted.
> 
> > So why does it matter what parse-partial-sexp does when we are
> > discussing why char_quoted is not TRT?
> 
> parse-partial-sexp is the context in which the bug becomes evident.  If,
> in the C++ line comment with escaped NL, you start parse-partial-sexp at
> the NL, it behaves as though the scan started at the backslash.  This is
> the bug.
> 
> The cause of the bug is the use of char_quoted at line 42 of
> forw_comment.

Thanks, let's see what Stefan has to say about this.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#63535; Package emacs. (Wed, 17 May 2023 22:02:02 GMT) Full text and rfc822 format available.

Message #26 received at 63535 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Alan Mackenzie <acm <at> muc.de>
Cc: 63535 <at> debbugs.gnu.org
Subject: Re: bug#63535: Master branch: Error in forw_comment (syntax.c)
 handling of escaped LFs
Date: Wed, 17 May 2023 18:01:32 -0400
Hi Alan,

> diff --git a/src/syntax.c b/src/syntax.c
> index e9e04e2d638..76d9f16e4ed 100644
> --- a/src/syntax.c
> +++ b/src/syntax.c
> @@ -2344,7 +2344,9 @@ forw_comment (ptrdiff_t from, ptrdiff_t from_byte, ptrdiff_t stop,
>  	  && SYNTAX_FLAGS_COMMENT_STYLE (syntax, 0) == style
>  	  && (SYNTAX_FLAGS_COMMENT_NESTED (syntax) ?
>  	      (nesting > 0 && --nesting == 0) : nesting < 0)
> -          && !(comment_end_can_be_escaped && char_quoted (from, from_byte)))
> +          && !(comment_end_can_be_escaped &&
> +	       (((prev_syntax & 0xff) == Sescape)
> +		|| ((prev_syntax & 0xff) == Scharquote))))
>  	/* We have encountered a comment end of the same style
>  	   as the comment sequence which began this comment
>  	   section.  */

AFAIK this is your code, so you should know better, but AFAICT
`prev_syntax` is not updated in the loop, so it only reflects the syntax
before the beginning of the scanned text, rather than anything near `from`.
Are you sure this is right?


        Stefan





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#63535; Package emacs. (Mon, 22 May 2023 15:00:02 GMT) Full text and rfc822 format available.

Message #29 received at 63535 <at> debbugs.gnu.org (full text, mbox):

From: Alan Mackenzie <acm <at> muc.de>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: 63535 <at> debbugs.gnu.org
Subject: Re: bug#63535: Master branch: Error in forw_comment (syntax.c)
 handling of escaped LFs
Date: Mon, 22 May 2023 14:59:32 +0000
Hello, Stefan.

On Wed, May 17, 2023 at 18:01:32 -0400, Stefan Monnier wrote:
> Hi Alan,

[ .... ]

> AFAIK this is your code, so you should know better, but AFAICT
> `prev_syntax` is not updated in the loop, so it only reflects the syntax
> before the beginning of the scanned text, rather than anything near `from`.
> Are you sure this is right?

Thanks, you are correct, the patch was not good.

It turned out to be quite tricky to get working.  As well as
forw_comment, I had to amend scan_sexps_forward to make it return a
quoted state to its caller when this happens at the limit of the scan.

I think the following patch is better.  Would you please have a look at
it, in the hope I haven't made any other silly mistakes.  Thanks!



diff --git a/src/syntax.c b/src/syntax.c
index e9e04e2d638..94b2ac2b591 100644
--- a/src/syntax.c
+++ b/src/syntax.c
@@ -2338,13 +2338,16 @@ forw_comment (ptrdiff_t from, ptrdiff_t from_byte, ptrdiff_t stop,
 	  return 0;
 	}
       c = FETCH_CHAR_AS_MULTIBYTE (from_byte);
+      prev_syntax = syntax;
       syntax = SYNTAX_WITH_FLAGS (c);
       code = syntax & 0xff;
       if (code == Sendcomment
 	  && SYNTAX_FLAGS_COMMENT_STYLE (syntax, 0) == style
 	  && (SYNTAX_FLAGS_COMMENT_NESTED (syntax) ?
 	      (nesting > 0 && --nesting == 0) : nesting < 0)
-          && !(comment_end_can_be_escaped && char_quoted (from, from_byte)))
+          && !(comment_end_can_be_escaped
+	       && ((prev_syntax & 0xff) == Sescape
+		   || (prev_syntax & 0xff) == Scharquote)))
 	/* We have encountered a comment end of the same style
 	   as the comment sequence which began this comment
 	   section.  */
@@ -2368,7 +2371,11 @@ forw_comment (ptrdiff_t from, ptrdiff_t from_byte, ptrdiff_t stop,
           inc_both (&from, &from_byte);
           UPDATE_SYNTAX_TABLE_FORWARD (from);
           if (from == stop) continue; /* Failure */
-        }
+	  c = FETCH_CHAR_AS_MULTIBYTE (from_byte);
+	  prev_syntax = syntax;
+	  syntax = Smax;
+	  code = syntax;
+	}
       inc_both (&from, &from_byte);
       UPDATE_SYNTAX_TABLE_FORWARD (from);
 
@@ -3349,7 +3356,14 @@ do { prev_from = from;				\
 	     are invalid now.  Luckily, the `done' doesn't use them
 	     and the INC_FROM sets them to a sane value without
 	     looking at them. */
-	  if (!found) goto done;
+	  if (!found)
+	    {
+	      if ((prev_from_syntax & 0xff) == Sescape
+		  || (prev_from_syntax & 0xff) == Scharquote)
+		goto endquoted;
+	      else
+		goto done;
+	    }
 	  INC_FROM;
 	  state->incomment = 0;
 	  state->comstyle = 0;	/* reset the comment style */


>         Stefan

-- 
Alan Mackenzie (Nuremberg, Germany).




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#63535; Package emacs. (Mon, 22 May 2023 15:17:01 GMT) Full text and rfc822 format available.

Message #32 received at 63535 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Alan Mackenzie <acm <at> muc.de>
Cc: 63535 <at> debbugs.gnu.org
Subject: Re: bug#63535: Master branch: Error in forw_comment (syntax.c)
 handling of escaped LFs
Date: Mon, 22 May 2023 11:16:40 -0400
> I think the following patch is better.  Would you please have a look at
> it, in the hope I haven't made any other silly mistakes.  Thanks!

I don't see any silly mistake there, sorry.


        Stefan


PS: It does remind me that we really should do ourselves a favor and get rid
of the distinction between `Sescape` and `Scharquote`.
IIRC there's a risk of backward incompatibility, so it has to be done
progressively, but we should start the process.  E.g. first declare one of the
two as obsolete, then emit a warning when we see it being used, ...





Reply sent to Alan Mackenzie <acm <at> muc.de>:
You have taken responsibility. (Mon, 22 May 2023 16:17:01 GMT) Full text and rfc822 format available.

Notification sent to Alan Mackenzie <acm <at> muc.de>:
bug acknowledged by developer. (Mon, 22 May 2023 16:17:01 GMT) Full text and rfc822 format available.

Message #37 received at 63535-done <at> debbugs.gnu.org (full text, mbox):

From: Alan Mackenzie <acm <at> muc.de>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: acm <at> muc.de, 63535-done <at> debbugs.gnu.org
Subject: Re: bug#63535: Master branch: Error in forw_comment (syntax.c)
 handling of escaped LFs
Date: Mon, 22 May 2023 16:16:49 +0000
Hello, Stefan.

On Mon, May 22, 2023 at 11:16:40 -0400, Stefan Monnier wrote:
> > I think the following patch is better.  Would you please have a look at
> > it, in the hope I haven't made any other silly mistakes.  Thanks!

> I don't see any silly mistake there, sorry.

Thanks!  I've committed the patch, and I'm closing the bug.

>         Stefan


> PS: It does remind me that we really should do ourselves a favor and get rid
> of the distinction between `Sescape` and `Scharquote`.
> IIRC there's a risk of backward incompatibility, so it has to be done
> progressively, but we should start the process.  E.g. first declare one of the
> two as obsolete, then emit a warning when we see it being used, ...

Yes.  I think Sescape should be the survivor.  I don't know if
Scharquote is used at all in Emacs code.

-- 
Alan Mackenzie (Nuremberg, Germany).




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 20 Jun 2023 11:24:06 GMT) Full text and rfc822 format available.

This bug report was last modified 309 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.