GNU bug report logs - #61369
Problem with keeping tree-sitter parse tree up-to-date

Previous Next

Package: emacs;

Reported by: Dmitry Gutov <dgutov <at> yandex.ru>

Date: Wed, 8 Feb 2023 15:35:02 UTC

Severity: normal

To reply to this bug, email your comments to 61369 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#61369; Package emacs. (Wed, 08 Feb 2023 15:35:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Dmitry Gutov <dgutov <at> yandex.ru>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Wed, 08 Feb 2023 15:35:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: bug-gnu-emacs <at> gnu.org
Subject: Problem with keeping tree-sitter parse tree up-to-date
Date: Wed, 8 Feb 2023 17:34:25 +0200
1. Have some buffer with text

"use std::Path::{self, Path, PathBuf};  // good: std is a crate name
... (maybe something else
"

2. Have this text in a different buffer (I used scratch):

"


let date = DateTime::<chrono::Utc>::from_utc(date, chrono::Utc);
"

Meaning, the buffer starts with two empty lines.

3. Select the first line from the first buffer including the trailing 
newline, yank and then paste at the beginning of the second buffer.

The second buffer will now look like this:

"use std::Path::{self, Path, PathBuf};  // good: std is a crate name


let date = DateTime::<chrono::Utc>::from_utc(date, chrono::Utc);
"

The parse tree will, however, be (according to treesit-explore-mode):

(source_file
 (use_declaration use
  argument:
   (scoped_use_list
    path: (scoped_identifier path: (identifier) :: name: (identifier))
    ::
    list: (use_list { (self) , (identifier) , (identifier) }))
  ;)
 (line_comment)
 (let_declaration let pattern: (identifier) =
  value:
   (scoped_identifier
    path:
     (generic_type type: (type_identifier) ::
      type_arguments:
       (type_arguments <
        (scoped_type_identifier path: (identifier) :: name: 
(type_identifier))
        >))
    :: name: (identifier))
  (ERROR ( (identifier) ,
   (scoped_identifier path: (identifier) :: name: (identifier))
   ;)
  ;))

But if I edit the buffer after that, e.g. add and remove a space at the 
beginning, the error node disappears:

(source_file
 (use_declaration use
  argument:
   (scoped_use_list
    path: (scoped_identifier path: (identifier) :: name: (identifier))
    ::
    list: (use_list { (self) , (identifier) , (identifier) }))
  ;)
 (line_comment)
 (let_declaration let pattern: (identifier) =
  value:
   (call_expression
    function:
     (scoped_identifier
      path:
       (generic_type type: (type_identifier) ::
        type_arguments:
         (type_arguments <
          (scoped_type_identifier path: (identifier) :: name: 
(type_identifier))
          >))
      :: name: (identifier))
    arguments:
     (arguments ( (identifier) ,
      (scoped_identifier path: (identifier) :: name: (identifier))
      )))
  ;))




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#61369; Package emacs. (Wed, 08 Feb 2023 18:21:02 GMT) Full text and rfc822 format available.

Message #8 received at 61369 <at> debbugs.gnu.org (full text, mbox):

From: Theodor Thornhill <theo <at> thornhill.no>
To: Dmitry Gutov <dgutov <at> yandex.ru>
Cc: 61369 <at> debbugs.gnu.org
Subject: Re: bug#61369: Problem with keeping tree-sitter parse tree up-to-date
Date: Wed, 08 Feb 2023 19:20:19 +0100
Dmitry Gutov <dgutov <at> yandex.ru> writes:

> 1. Have some buffer with text
>
> "use std::Path::{self, Path, PathBuf};  // good: std is a crate name
> ... (maybe something else
> "
>
> 2. Have this text in a different buffer (I used scratch):
>
> "
>
>
> let date = DateTime::<chrono::Utc>::from_utc(date, chrono::Utc);
> "
>
> Meaning, the buffer starts with two empty lines.
>
> 3. Select the first line from the first buffer including the trailing newline,
> yank and then paste at the beginning of the second buffer.
>
> The second buffer will now look like this:
>
> "use std::Path::{self, Path, PathBuf};  // good: std is a crate name
>
>
> let date = DateTime::<chrono::Utc>::from_utc(date, chrono::Utc);
> "

I just want to confirm that I can reproduce this, and that if you skip
the trailing newline from the use-statement, I don't get this behavior.
So it seems like the newline is the crucial point, right?

Theo




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#61369; Package emacs. (Wed, 08 Feb 2023 19:42:01 GMT) Full text and rfc822 format available.

Message #11 received at 61369 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Theodor Thornhill <theo <at> thornhill.no>
Cc: "61369 <at> debbugs.gnu.org" <61369 <at> debbugs.gnu.org>
Subject: Re: bug#61369: Problem with keeping tree-sitter parse tree up-to-date
Date: Wed, 08 Feb 2023 22:41:14 +0300
[Message part 1 (text/html, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#61369; Package emacs. (Fri, 10 Feb 2023 01:23:01 GMT) Full text and rfc822 format available.

Message #14 received at 61369 <at> debbugs.gnu.org (full text, mbox):

From: Yuan Fu <casouri <at> gmail.com>
To: Dmitry Gutov <dgutov <at> yandex.ru>
Cc: theo <at> thornhill.no, 61369 <at> debbugs.gnu.org
Subject: Re: bug#61369: Problem with keeping tree-sitter parse tree up-to-date
Date: Thu, 9 Feb 2023 17:22:15 -0800
>  I just want to confirm that I can reproduce this, and that if you skip
>  the trailing newline from the use-statement, I don't get this behavior.
>  So it seems like the newline is the crucial point, right?
>
> Yes, same.
>
> Thr trailing newline is necessary.
>
> The empty lines at the beginning of the buffer (being copied to) are necessary to reproduce this as well.

Hmmm, it might be related to how does tree-sitter does incremental
parsing? If the newline is necessary, then I guess it’s not because
Emacs missed characters when reporting edits to tree-sitter.

Yuan




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#61369; Package emacs. (Fri, 10 Feb 2023 01:39:01 GMT) Full text and rfc822 format available.

Message #17 received at 61369 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Yuan Fu <casouri <at> gmail.com>
Cc: theo <at> thornhill.no, 61369 <at> debbugs.gnu.org
Subject: Re: bug#61369: Problem with keeping tree-sitter parse tree up-to-date
Date: Fri, 10 Feb 2023 03:38:04 +0200
On 10/02/2023 03:22, Yuan Fu wrote:
>>   I just want to confirm that I can reproduce this, and that if you skip
>>   the trailing newline from the use-statement, I don't get this behavior.
>>   So it seems like the newline is the crucial point, right?
>>
>> Yes, same.
>>
>> Thr trailing newline is necessary.
>>
>> The empty lines at the beginning of the buffer (being copied to) are necessary to reproduce this as well.
> Hmmm, it might be related to how does tree-sitter does incremental
> parsing? If the newline is necessary, then I guess it’s not because
> Emacs missed characters when reporting edits to tree-sitter.

The newline is somewhat necessary: the scenario doesn't work, for 
example, if the pasted text doesn't include the newline but the buffer 
had an additional (third) one at the top.

But the scenario also doesn't work if some other (any) character is 
removed from the yanked line before pasting: it could be even one after 
the comment instruction (//).

OTOH, if I add an extra char to the yanked line, anywhere, I can skip 
the newline. E.g. I can paste

  use std::path::{self, Path, PathBuf};  // good: std is a crate namee

without a newline and still see the exact same syntax error.

So it looks more like an off-by-one error somewhere. Maybe in our code, 
but maybe in tree-sitter somewhere.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#61369; Package emacs. (Mon, 13 Feb 2023 09:11:01 GMT) Full text and rfc822 format available.

Message #20 received at 61369 <at> debbugs.gnu.org (full text, mbox):

From: Yuan Fu <casouri <at> gmail.com>
To: Dmitry Gutov <dgutov <at> yandex.ru>
Cc: theo <at> thornhill.no, 61369 <at> debbugs.gnu.org
Subject: Re: bug#61369: Problem with keeping tree-sitter parse tree up-to-date
Date: Mon, 13 Feb 2023 01:10:05 -0800
Dmitry Gutov <dgutov <at> yandex.ru> writes:

> On 10/02/2023 03:22, Yuan Fu wrote:
>>>   I just want to confirm that I can reproduce this, and that if you skip
>>>   the trailing newline from the use-statement, I don't get this behavior.
>>>   So it seems like the newline is the crucial point, right?
>>>
>>> Yes, same.
>>>
>>> Thr trailing newline is necessary.
>>>
>>> The empty lines at the beginning of the buffer (being copied to) are necessary to reproduce this as well.
>> Hmmm, it might be related to how does tree-sitter does incremental
>> parsing? If the newline is necessary, then I guess it’s not because
>> Emacs missed characters when reporting edits to tree-sitter.
>
> The newline is somewhat necessary: the scenario doesn't work, for
> example, if the pasted text doesn't include the newline but the buffer
> had an additional (third) one at the top.
>
> But the scenario also doesn't work if some other (any) character is
> removed from the yanked line before pasting: it could be even one
> after the comment instruction (//).
>
> OTOH, if I add an extra char to the yanked line, anywhere, I can skip
> the newline. E.g. I can paste
>
>   use std::path::{self, Path, PathBuf};  // good: std is a crate namee
>
> without a newline and still see the exact same syntax error.
>
> So it looks more like an off-by-one error somewhere. Maybe in our
> code, but maybe in tree-sitter somewhere.

Some progress report: I added a function that reads the buffer like a
parser would, like this:

DEFUN ("treesit--parser-view",
       Ftreesit__parser_view,
       Streesit__parser_view, 1, 1, 0,
       doc: /* Return the view of PARSER.
Read buffer like PARSER would into a string and return it.  */)
  (Lisp_Object parser)
{
  const ptrdiff_t visible_beg = XTS_PARSER (parser)->visible_beg;
  const ptrdiff_t visible_end = XTS_PARSER (parser)->visible_end;
  const ptrdiff_t view_len = visible_end - visible_beg;

  char *str_buf = xzalloc (view_len + 1);
  uint32_t read = 0;
  TSPoint pos = { 0 };
  for (int idx = 0; idx < view_len; idx++)
    {
      const char *ch = treesit_read_buffer (XTS_PARSER (parser),
					    idx, pos, &read);
      if (read == 0)
	{
	  xfree (str_buf);
	  xsignal1 (Qtreesit_error, make_fixnum (idx));
	}
      else
	str_buf[idx] = *ch;
    }
  Lisp_Object ret_str = make_string (str_buf, view_len);
  xfree (str_buf);
  return ret_str;
}

After I follow the steps and got the error node, I run this function on
the parser, and the returned string looks good.

Next I’ll try to log every character actually read by the parser and see
if anything seems fishy.

Yuan




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#61369; Package emacs. (Tue, 14 Feb 2023 00:00:02 GMT) Full text and rfc822 format available.

Message #23 received at 61369 <at> debbugs.gnu.org (full text, mbox):

From: Yuan Fu <casouri <at> gmail.com>
To: Dmitry Gutov <dgutov <at> yandex.ru>
Cc: theo <at> thornhill.no, 61369 <at> debbugs.gnu.org
Subject: Re: bug#61369: Problem with keeping tree-sitter parse tree up-to-date
Date: Mon, 13 Feb 2023 15:59:02 -0800
[Message part 1 (text/plain, inline)]
Yuan Fu <casouri <at> gmail.com> writes:

> Dmitry Gutov <dgutov <at> yandex.ru> writes:
>
>> On 10/02/2023 03:22, Yuan Fu wrote:
>>>>   I just want to confirm that I can reproduce this, and that if you skip
>>>>   the trailing newline from the use-statement, I don't get this behavior.
>>>>   So it seems like the newline is the crucial point, right?
>>>>
>>>> Yes, same.
>>>>
>>>> Thr trailing newline is necessary.
>>>>
>>>> The empty lines at the beginning of the buffer (being copied to) are necessary to reproduce this as well.
>>> Hmmm, it might be related to how does tree-sitter does incremental
>>> parsing? If the newline is necessary, then I guess it’s not because
>>> Emacs missed characters when reporting edits to tree-sitter.
>>
>> The newline is somewhat necessary: the scenario doesn't work, for
>> example, if the pasted text doesn't include the newline but the buffer
>> had an additional (third) one at the top.
>>
>> But the scenario also doesn't work if some other (any) character is
>> removed from the yanked line before pasting: it could be even one
>> after the comment instruction (//).
>>
>> OTOH, if I add an extra char to the yanked line, anywhere, I can skip
>> the newline. E.g. I can paste
>>
>>   use std::path::{self, Path, PathBuf};  // good: std is a crate namee
>>
>> without a newline and still see the exact same syntax error.
>>
>> So it looks more like an off-by-one error somewhere. Maybe in our
>> code, but maybe in tree-sitter somewhere.
>
> Some progress report: I added a function that reads the buffer like a
> parser would, like this:
>
> DEFUN ("treesit--parser-view",
>        Ftreesit__parser_view,
>        Streesit__parser_view, 1, 1, 0,
>        doc: /* Return the view of PARSER.
> Read buffer like PARSER would into a string and return it.  */)
>   (Lisp_Object parser)
> {
>   const ptrdiff_t visible_beg = XTS_PARSER (parser)->visible_beg;
>   const ptrdiff_t visible_end = XTS_PARSER (parser)->visible_end;
>   const ptrdiff_t view_len = visible_end - visible_beg;
>
>   char *str_buf = xzalloc (view_len + 1);
>   uint32_t read = 0;
>   TSPoint pos = { 0 };
>   for (int idx = 0; idx < view_len; idx++)
>     {
>       const char *ch = treesit_read_buffer (XTS_PARSER (parser),
> 					    idx, pos, &read);
>       if (read == 0)
> 	{
> 	  xfree (str_buf);
> 	  xsignal1 (Qtreesit_error, make_fixnum (idx));
> 	}
>       else
> 	str_buf[idx] = *ch;
>     }
>   Lisp_Object ret_str = make_string (str_buf, view_len);
>   xfree (str_buf);
>   return ret_str;
> }
>
> After I follow the steps and got the error node, I run this function on
> the parser, and the returned string looks good.
>
> Next I’ll try to log every character actually read by the parser and see
> if anything seems fishy.

I don’t know if it’s good news or bad news, but it doesn’t seem like a
off-by-one. Here is what I did:

1. I applied the attached patch (patch.diff) so that treesit_read_buffer, the
function used by tree-sitter parser to read buffer contents, prints the
position it read and the character it gets to stdout.

2. I open test.rs which contains

"

let date = DateTime::<chrono::Utc>::from_utc(date, chrono::Utc);
"

as in the recipe. I have rust-ts-mode enabled, so Emacs prints the
characters read by the parser to stdout. I type return several times to
separate this first batch of output from the next, which is what I’m
interested in.

3. I paste

"use std::Path::{self, Path, PathBuf};  // good: std is a crate name
"

at the beginning of the buffer. Now the parse tree contains that error
node. I go to the terminal, copy the output out, which looks like:

0 117
1 115
2 101
3 32
0 117
1 115
2 101
...
133 59
134 10
134 10
134 10
134 10

4. I paste this output (output.txt) into a buffer, and reconstruct the text read by
the parser with (setq str (reconstruct)), where reconstruct is:

(defun reconstruct ()
  (goto-char (point-min))
  (let ((result ""))
    (while (< (point) (point-max))
      (let* ((str (buffer-substring (point) (line-end-position)))
             (nums (string-split str))
             (pos (string-to-number (car nums)))
             (char (string-to-number (cadr nums))))
        (when (not (< pos (length result)))
          (setq result (concat result
                               (make-string (- (1+ pos) (length result))
                                            ?0))))
        (setf (aref result pos) char))
      (forward-line 1))
    result))

5. I insert str into a new buffer, and (to my disappointment) the
content is identical to the buffer text.

There are two surprises here: 1) there isn’t an off-by-one bug, 2) the
parser actually read the whole buffer, rather than reading only the new
content. Then there are even less reason for it to create that error
node.

In addition, I inserted a new line in the Rust source buffer (test.rs) (which
fixes the error node), here is what the parser read after that
insertion:

"0000000000000000000000000000000000000000000000000000000000000000000



let 0000 = 000000000000000000000000000000000000000000000000000);"

0 means it didn’t read that position, we can see that the parser read
all the newlines, "let ", " = ", and ");". I can’t discern anything
interesting from that, tho.

Yuan

[output.txt (text/plain, attachment)]
[patch.diff (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#61369; Package emacs. (Wed, 15 Feb 2023 02:18:01 GMT) Full text and rfc822 format available.

Message #26 received at 61369 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Yuan Fu <casouri <at> gmail.com>
Cc: theo <at> thornhill.no, 61369 <at> debbugs.gnu.org
Subject: Re: bug#61369: Problem with keeping tree-sitter parse tree up-to-date
Date: Wed, 15 Feb 2023 04:17:29 +0200
On 14/02/2023 01:59, Yuan Fu wrote:
> There are two surprises here: 1) there isn’t an off-by-one bug, 2) the
> parser actually read the whole buffer, rather than reading only the new
> content. Then there are even less reason for it to create that error
> node.

The parser reads the whole buffer, but if it tries to reparse based on 
the previous parse tree with incorrect positions, it might get into an 
invalid state as a result.

I've tried gdb-ing treesit_tree_edit_1 (after dropping the 'inline' 
qualifier), and here's what I see:

- If I paste the test line without the trailing newline or not, the value.

- If I paste the test line with the trailing newline, the value of 
new_end_byte is still 67. But then it is followed by this call right away:

Thread 1 "emacs" hit Breakpoint 3, treesit_tree_edit_1 
(tree=tree <at> entry=0x5555574139b0, start_byte=start_byte <at> entry=134, 
old_end_byte=old_end_byte <at> entry=134, new_end_byte=135) at treesit.c:739

- If I 'undo' after that, the call is as expected:

Thread 1 "emacs" hit Breakpoint 3, treesit_tree_edit_1 
(tree=0x555557435cd0, start_byte=start_byte <at> entry=0, 
old_end_byte=old_end_byte <at> entry=68, new_end_byte=new_end_byte <at> entry=0) 
at treesit.c:739
739	{

So I tried again to figure out the odd call, with the backtrace:

Thread 1 "emacs" hit Breakpoint 3, treesit_tree_edit_1 
(tree=tree <at> entry=0x5555575b64f0, start_byte=start_byte <at> entry=134, 
old_end_byte=old_end_byte <at> entry=134, new_end_byte=269) at treesit.c:739
739	{
(gdb) backtrace
#0  treesit_tree_edit_1 (tree=tree <at> entry=0x5555575b64f0, 
start_byte=start_byte <at> entry=134, old_end_byte=old_end_byte <at> entry=134, 
new_end_byte=269) at treesit.c:739
#1  0x00005555557cb085 in treesit_sync_visible_region 
(parser=parser <at> entry=XIL(0x555556fc329d)) at treesit.c:931
#2  0x00005555557ccf28 in treesit_ensure_parsed 
(parser=XIL(0x555556fc329d)) at treesit.c:1025
#3  Ftreesit_parser_root_node (parser=XIL(0x555556fc329d)) at treesit.c:1507

treesit.c:739 points to a treesit_tree_edit_1 call which is predicated 
on this condition:

  if (visible_end < BUF_ZV_BYTE (buffer))

...which shouldn't be the case since the buffer is small enough to fit 
in the default window. It might already be the consequence of passing 
the wrong value of new_end_byte to ts_tree_edit, though.

Going back to the first call, the backtrace looks like this:

Thread 1 "emacs" hit Breakpoint 3, treesit_tree_edit_1 
(tree=0x5555574f0ff0, start_byte=start_byte <at> entry=0, 
old_end_byte=old_end_byte <at> entry=0, new_end_byte=new_end_byte <at> entry=67) 
at treesit.c:739
739	{
(gdb) backtrace
#0  treesit_tree_edit_1 (tree=0x5555574f0ff0, 
start_byte=start_byte <at> entry=0, old_end_byte=old_end_byte <at> entry=0, 
new_end_byte=new_end_byte <at> entry=67) at treesit.c:739
#1  0x00005555557cc991 in treesit_record_change (start_byte=1, 
old_end_byte=1, new_end_byte=69) at treesit.c:806
#2  0x00005555556f8bb7 in insert_from_string_1 
(string=XIL(0x55555744c4f4), pos=0, pos_byte=0, nchars=68, nbytes=68, 
inherit=<optimized out>, before_markers=false) at insdel.c:1084

Seems like treesit_record_change turns new_end_byte=69 into 
new_end_byte=67 inside treesit_tree_edit_1.

It seems to fail in this calculation:

  ptrdiff_t new_end_offset = (min (visible_end,
				   max (visible_end, new_end_byte))
			      - visible_beg);

because visible_end is still 68 there. It value gets updated later, 
closer to the end of this function.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#61369; Package emacs. (Wed, 15 Feb 2023 22:45:02 GMT) Full text and rfc822 format available.

Message #29 received at 61369 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Yuan Fu <casouri <at> gmail.com>
Cc: theo <at> thornhill.no, 61369 <at> debbugs.gnu.org
Subject: Re: bug#61369: Problem with keeping tree-sitter parse tree up-to-date
Date: Thu, 16 Feb 2023 00:44:07 +0200
On 15/02/2023 04:17, Dmitry Gutov wrote:
> Seems like treesit_record_change turns new_end_byte=69 into 
> new_end_byte=67 inside treesit_tree_edit_1.
> 
> It seems to fail in this calculation:
> 
>    ptrdiff_t new_end_offset = (min (visible_end,
>                     max (visible_end, new_end_byte))
>                    - visible_beg);
> 
> because visible_end is still 68 there. It value gets updated later, 
> closer to the end of this function.

So FWIW the patch below fixes the problem. But I'm not sure about 
change's clipping by the current restriction, or how it should be 
handled exactly.

diff --git a/src/treesit.c b/src/treesit.c
index cab2f0d5354..9f15b88a8bd 100644
--- a/src/treesit.c
+++ b/src/treesit.c
@@ -794,9 +794,7 @@ treesit_record_change (ptrdiff_t start_byte, 
ptrdiff_t old_end_byte,
 	  ptrdiff_t old_end_offset = (min (visible_end,
 					   max (visible_beg, old_end_byte))
 				      - visible_beg);
-	  ptrdiff_t new_end_offset = (min (visible_end,
-					   max (visible_beg, new_end_byte))
-				      - visible_beg);
+	  ptrdiff_t new_end_offset = max (visible_beg, new_end_byte) - 
visible_beg;
 	  eassert (start_offset <= old_end_offset);
 	  eassert (start_offset <= new_end_offset);






Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#61369; Package emacs. (Fri, 17 Feb 2023 22:33:02 GMT) Full text and rfc822 format available.

Message #32 received at 61369 <at> debbugs.gnu.org (full text, mbox):

From: Yuan Fu <casouri <at> gmail.com>
To: Dmitry Gutov <dgutov <at> yandex.ru>
Cc: Theodor Thornhill <theo <at> thornhill.no>, 61369 <at> debbugs.gnu.org
Subject: Re: bug#61369: Problem with keeping tree-sitter parse tree up-to-date
Date: Fri, 17 Feb 2023 14:32:22 -0800

> On Feb 15, 2023, at 2:44 PM, Dmitry Gutov <dgutov <at> yandex.ru> wrote:
> 
> On 15/02/2023 04:17, Dmitry Gutov wrote:
>> Seems like treesit_record_change turns new_end_byte=69 into new_end_byte=67 inside treesit_tree_edit_1.
>> It seems to fail in this calculation:
>>   ptrdiff_t new_end_offset = (min (visible_end,
>>                    max (visible_end, new_end_byte))
>>                   - visible_beg);
>> because visible_end is still 68 there. It value gets updated later, closer to the end of this function.
> 
> So FWIW the patch below fixes the problem. But I'm not sure about change's clipping by the current restriction, or how it should be handled exactly.
> 
> diff --git a/src/treesit.c b/src/treesit.c
> index cab2f0d5354..9f15b88a8bd 100644
> --- a/src/treesit.c
> +++ b/src/treesit.c
> @@ -794,9 +794,7 @@ treesit_record_change (ptrdiff_t start_byte, ptrdiff_t old_end_byte,
>  ptrdiff_t old_end_offset = (min (visible_end,
>   max (visible_beg, old_end_byte))
>      - visible_beg);
> -  ptrdiff_t new_end_offset = (min (visible_end,
> -   max (visible_beg, new_end_byte))
> -      - visible_beg);
> +  ptrdiff_t new_end_offset = max (visible_beg, new_end_byte) - visible_beg;
>  eassert (start_offset <= old_end_offset);
>  eassert (start_offset <= new_end_offset);

Thank you very much! I thought that clipping the change into the fixed visible range, and rely on treesit_sync_visible_region to add back the clipped “tail” when we extend the visible range would be equivalent to not clipping, but I guess clipping and re-adding affects how incremental parsing works inside tree-sitter.

I don’t think this change would have any adverse effect, because if you think of it, inserting text in a narrowed region always extends the region, rather than pushing text at the end out of the narrowed region. So the right thing to do here is in fact not clipping new_end_offset.

I pushed this change.

Yuan



Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#61369; Package emacs. (Sat, 18 Feb 2023 00:12:02 GMT) Full text and rfc822 format available.

Message #35 received at 61369 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Yuan Fu <casouri <at> gmail.com>
Cc: Theodor Thornhill <theo <at> thornhill.no>, 61369 <at> debbugs.gnu.org
Subject: Re: bug#61369: Problem with keeping tree-sitter parse tree up-to-date
Date: Sat, 18 Feb 2023 02:11:22 +0200
On 18/02/2023 00:32, Yuan Fu wrote:
> Thank you very much! I thought that clipping the change into the fixed visible range, and rely on treesit_sync_visible_region to add back the clipped “tail” when we extend the visible range would be equivalent to not clipping, but I guess clipping and re-adding affects how incremental parsing works inside tree-sitter.

It seems like the "repairing" sync used a different range, one that 
didn't include the character number 68 inserted from the beginning.

It just synced the 1 or 2 characters at the end of the buffer, the 
difference between the computed visible_end and the actual BUF_ZV_BYTE.

> I don’t think this change would have any adverse effect, because if you think of it, inserting text in a narrowed region always extends the region, rather than pushing text at the end out of the narrowed region. So the right thing to do here is in fact not clipping new_end_offset.

I figured it could be a problem if both old_end_byte and new_end_byte 
extend past the current restriction.

But I'm not sure whether that could actually happen in practice. The 
obvious attempts (undo a change outside of the narrowing, or revert the 
buffer when narrowing is in effect) didn't play out, but I'm not sure 
whether there is an actual hard limit on modifying the text outside of 
the current restriction.

> I pushed this change.

Thanks. Good to see it make it in.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#61369; Package emacs. (Sat, 18 Feb 2023 01:15:01 GMT) Full text and rfc822 format available.

Message #38 received at 61369 <at> debbugs.gnu.org (full text, mbox):

From: Yuan Fu <casouri <at> gmail.com>
To: Dmitry Gutov <dgutov <at> yandex.ru>
Cc: Theodor Thornhill <theo <at> thornhill.no>, 61369 <at> debbugs.gnu.org
Subject: Re: bug#61369: Problem with keeping tree-sitter parse tree up-to-date
Date: Fri, 17 Feb 2023 17:14:22 -0800

> On Feb 17, 2023, at 4:11 PM, Dmitry Gutov <dgutov <at> yandex.ru> wrote:
> 
> On 18/02/2023 00:32, Yuan Fu wrote:
>> Thank you very much! I thought that clipping the change into the fixed visible range, and rely on treesit_sync_visible_region to add back the clipped “tail” when we extend the visible range would be equivalent to not clipping, but I guess clipping and re-adding affects how incremental parsing works inside tree-sitter.
> 
> It seems like the "repairing" sync used a different range, one that didn't include the character number 68 inserted from the beginning.
> 
> It just synced the 1 or 2 characters at the end of the buffer, the difference between the computed visible_end and the actual BUF_ZV_BYTE.

That should be enough, no? Because other text didn’t change, they just moved. And tree-sitter should know that they moved. Or maybe I’m misunderstanding what you mean.

> 
>> I don’t think this change would have any adverse effect, because if you think of it, inserting text in a narrowed region always extends the region, rather than pushing text at the end out of the narrowed region. So the right thing to do here is in fact not clipping new_end_offset.
> 
> I figured it could be a problem if both old_end_byte and new_end_byte extend past the current restriction.

That should be fine (ie, technically correct), since when we widen, the clipped text are reparsed by tree-sitter as new text.

> 
> But I'm not sure whether that could actually happen in practice. The obvious attempts (undo a change outside of the narrowing, or revert the buffer when narrowing is in effect) didn't play out, but I'm not sure whether there is an actual hard limit on modifying the text outside of the current restriction.

It is my impression that Emacs in general enforces the narrowing restriction strictly. And we are still correct when exceptions occur.

Yuan



Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#61369; Package emacs. (Sat, 18 Feb 2023 01:27:02 GMT) Full text and rfc822 format available.

Message #41 received at 61369 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Yuan Fu <casouri <at> gmail.com>
Cc: Theodor Thornhill <theo <at> thornhill.no>, 61369 <at> debbugs.gnu.org
Subject: Re: bug#61369: Problem with keeping tree-sitter parse tree up-to-date
Date: Sat, 18 Feb 2023 03:25:59 +0200
On 18/02/2023 03:14, Yuan Fu wrote:
> 
> 
>> On Feb 17, 2023, at 4:11 PM, Dmitry Gutov <dgutov <at> yandex.ru> wrote:
>>
>> On 18/02/2023 00:32, Yuan Fu wrote:
>>> Thank you very much! I thought that clipping the change into the fixed visible range, and rely on treesit_sync_visible_region to add back the clipped “tail” when we extend the visible range would be equivalent to not clipping, but I guess clipping and re-adding affects how incremental parsing works inside tree-sitter.
>>
>> It seems like the "repairing" sync used a different range, one that didn't include the character number 68 inserted from the beginning.
>>
>> It just synced the 1 or 2 characters at the end of the buffer, the difference between the computed visible_end and the actual BUF_ZV_BYTE.
> 
> That should be enough, no? Because other text didn’t change, they just moved. And tree-sitter should know that they moved. Or maybe I’m misunderstanding what you mean.

But the "unsynced" character is at position 68.

And we just tell tree-sitter to update positions 134-136. So it stays 
ignorant of the changed char in the middle of the buffer.

It's not just about not knowing about the change either (the character 
in question is a newline, so its absence wouldn't lead to a syntax 
error), but about wrong offsets in the old parse tree, based on which 
the new tree is generated. That probably creates a wrong picture of the 
source text in the parser.

>>> I don’t think this change would have any adverse effect, because if you think of it, inserting text in a narrowed region always extends the region, rather than pushing text at the end out of the narrowed region. So the right thing to do here is in fact not clipping new_end_offset.
>>
>> I figured it could be a problem if both old_end_byte and new_end_byte extend past the current restriction.
> 
> That should be fine (ie, technically correct), since when we widen, the clipped text are reparsed by tree-sitter as new text.

I guess the effect I was thinking of is that

  XTS_PARSER (lisp_parser)->visible_end

would end up with a higher value than BUF_ZV_BYTE. Not sure if it's a 
problem.

>>
>> But I'm not sure whether that could actually happen in practice. The obvious attempts (undo a change outside of the narrowing, or revert the buffer when narrowing is in effect) didn't play out, but I'm not sure whether there is an actual hard limit on modifying the text outside of the current restriction.
> 
> It is my impression that Emacs in general enforces the narrowing restriction strictly. And we are still correct when exceptions occur.

Very good.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#61369; Package emacs. (Sat, 18 Feb 2023 07:17:01 GMT) Full text and rfc822 format available.

Message #44 received at 61369 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Dmitry Gutov <dgutov <at> yandex.ru>
Cc: casouri <at> gmail.com, theo <at> thornhill.no, 61369 <at> debbugs.gnu.org
Subject: Re: bug#61369: Problem with keeping tree-sitter parse tree up-to-date
Date: Sat, 18 Feb 2023 09:15:56 +0200
> Cc: Theodor Thornhill <theo <at> thornhill.no>, 61369 <at> debbugs.gnu.org
> Date: Sat, 18 Feb 2023 02:11:22 +0200
> From: Dmitry Gutov <dgutov <at> yandex.ru>
> 
> I'm not sure whether there is an actual hard limit on modifying the
> text outside of the current restriction.

It is, for most/all practical purposes.  If you try to modify text
outside of the current restriction, you risk many parts of code
barfing or signaling errors on you.  For example, conversion from
character to byte positions and vice versa will stop working (and in a
build with --enable-checking will actually raise SIGABRT).




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#61369; Package emacs. (Sat, 18 Feb 2023 10:06:02 GMT) Full text and rfc822 format available.

Message #47 received at 61369 <at> debbugs.gnu.org (full text, mbox):

From: Yuan Fu <casouri <at> gmail.com>
To: Dmitry Gutov <dgutov <at> yandex.ru>
Cc: Theodor Thornhill <theo <at> thornhill.no>, 61369 <at> debbugs.gnu.org
Subject: Re: bug#61369: Problem with keeping tree-sitter parse tree up-to-date
Date: Sat, 18 Feb 2023 02:05:04 -0800

> On Feb 17, 2023, at 5:25 PM, Dmitry Gutov <dgutov <at> yandex.ru> wrote:
> 
> On 18/02/2023 03:14, Yuan Fu wrote:
>>> On Feb 17, 2023, at 4:11 PM, Dmitry Gutov <dgutov <at> yandex.ru> wrote:
>>> 
>>> On 18/02/2023 00:32, Yuan Fu wrote:
>>>> Thank you very much! I thought that clipping the change into the fixed visible range, and rely on treesit_sync_visible_region to add back the clipped “tail” when we extend the visible range would be equivalent to not clipping, but I guess clipping and re-adding affects how incremental parsing works inside tree-sitter.
>>> 
>>> It seems like the "repairing" sync used a different range, one that didn't include the character number 68 inserted from the beginning.
>>> 
>>> It just synced the 1 or 2 characters at the end of the buffer, the difference between the computed visible_end and the actual BUF_ZV_BYTE.
>> That should be enough, no? Because other text didn’t change, they just moved. And tree-sitter should know that they moved. Or maybe I’m misunderstanding what you mean.
> 
> But the "unsynced" character is at position 68.
> 
> And we just tell tree-sitter to update positions 134-136. So it stays ignorant of the changed char in the middle of the buffer.
> 
> It's not just about not knowing about the change either (the character in question is a newline, so its absence wouldn't lead to a syntax error), but about wrong offsets in the old parse tree, based on which the new tree is generated. That probably creates a wrong picture of the source text in the parser.

Ok, I made some visualization to understand it, and yeah you are right. I’ll need to modify the comment a bit.

|visible range|

updated range
-------------

|aaaaaa|
|bbbbbbbbaaaa|aa  start: 0, old_end: 0, new_end: 6
 ------          
|bbbbbbbbaaaaaa|  start: 12, old_end: 12, new_end: 14
             --


> 
>>>> I don’t think this change would have any adverse effect, because if you think of it, inserting text in a narrowed region always extends the region, rather than pushing text at the end out of the narrowed region. So the right thing to do here is in fact not clipping new_end_offset.
>>> 
>>> I figured it could be a problem if both old_end_byte and new_end_byte extend past the current restriction.
>> That should be fine (ie, technically correct), since when we widen, the clipped text are reparsed by tree-sitter as new text.
> 
> I guess the effect I was thinking of is that
> 
>  XTS_PARSER (lisp_parser)->visible_end
> 
> would end up with a higher value than BUF_ZV_BYTE. Not sure if it's a problem.

It shouldn’t be, since BUF_ZV_BYTE should automatically grow when user inserts text. Even if it does, we always call treesit_sync_visible_region to sync up visible_beg/end with BUF_(Z)V_BYTE before parsing, so it shouldn’t be a problem.

Yuan



Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#61369; Package emacs. (Sat, 18 Feb 2023 17:22:01 GMT) Full text and rfc822 format available.

Message #50 received at 61369 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: casouri <at> gmail.com, theo <at> thornhill.no, 61369 <at> debbugs.gnu.org
Subject: Re: bug#61369: Problem with keeping tree-sitter parse tree up-to-date
Date: Sat, 18 Feb 2023 19:21:40 +0200
On 18/02/2023 09:15, Eli Zaretskii wrote:
>> Cc: Theodor Thornhill<theo <at> thornhill.no>,61369 <at> debbugs.gnu.org
>> Date: Sat, 18 Feb 2023 02:11:22 +0200
>> From: Dmitry Gutov<dgutov <at> yandex.ru>
>>
>> I'm not sure whether there is an actual hard limit on modifying the
>> text outside of the current restriction.
> It is, for most/all practical purposes.  If you try to modify text
> outside of the current restriction, you risk many parts of code
> barfing or signaling errors on you.  For example, conversion from
> character to byte positions and vice versa will stop working (and in a
> build with --enable-checking will actually raise SIGABRT).

All right, thank you both.




This bug report was last modified 1 year and 278 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.