GNU bug report logs - #66674
30.0.50; Upstream tree-sitter and treesit disagree about fields

Previous Next

Package: emacs;

Reported by: Dominik Honnef <dominik <at> honnef.co>

Date: Sun, 22 Oct 2023 06:32:01 UTC

Severity: normal

Found in version 30.0.50

Done: Yuan Fu <casouri <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 66674 in the body.
You can then email your comments to 66674 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#66674; Package emacs. (Sun, 22 Oct 2023 06:32:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Dominik Honnef <dominik <at> honnef.co>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Sun, 22 Oct 2023 06:32:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Dominik Honnef <dominik <at> honnef.co>
To: bug-gnu-emacs <at> gnu.org
Subject: 30.0.50; Upstream tree-sitter and treesit disagree about fields
Date: Sat, 21 Oct 2023 22:36:30 +0200
Using tree-sitter's CLI as well as the publicly hosted playground
produce different parse trees than treesit in Emacs. Specifically, the
assignment of nodes to named fields differs.

Given the following C source:

    void main() {
      int x = // foo
        1+
        // comment
        2;
    }

treesit-explore-mode displays the following tree:

    (translation_unit
     (function_definition type: (primitive_type)
      declarator: 
       (function_declarator declarator: (identifier)
        parameters: (parameter_list ( )))
      body: 
       (compound_statement {
        (declaration type: (primitive_type)
         declarator: 
          (init_declarator declarator: (identifier) = value: (comment)
           (binary_expression left: (number_literal) operator: + right: (comment) (number_literal)))
         ;)
        })))

Note how in the init_declarator node, the 'value' field is a comment
node, and similarly for the 'right' field in the binary_expression node.

Running 'tree-sitter parse file.c', on the other hand, produces the
following tree:

    (translation_unit [0, 0] - [6, 0]
      (function_definition [0, 0] - [5, 1]
        type: (primitive_type [0, 0] - [0, 4])
        declarator: (function_declarator [0, 5] - [0, 11]
          declarator: (identifier [0, 5] - [0, 9])
          parameters: (parameter_list [0, 9] - [0, 11]))
        body: (compound_statement [0, 12] - [5, 1]
          (declaration [1, 2] - [4, 6]
            type: (primitive_type [1, 2] - [1, 5])
            declarator: (init_declarator [1, 6] - [4, 5]
              declarator: (identifier [1, 6] - [1, 7])
              (comment [1, 10] - [1, 16])
              value: (binary_expression [2, 4] - [4, 5]
                left: (number_literal [2, 4] - [2, 5])
                (comment [3, 4] - [3, 14])
                right: (number_literal [4, 4] - [4, 5])))))))

Here, the two comment nodes appear as unnamed nodes. IMHO the second
tree is a more useful one, as the named fields contain the semantically
important subtrees (e.g. a binary expression is made up of a left and
right subtree, not a left subtree, a right comment, and then some
unnamed subtree.)

Emacs's tree makes writing queries less convenient, as instead of being
able to refer to well-defined names, one has to rely on child indices to
account for comments.


Further mismatch arises from repeated fields and separators.

Consider the following Go source:

    package pkg
    
    var a, b, c = 1, 2, 3

treesit-explore-mode displays the following tree:

    (source_file
     (package_clause package (package_identifier))
     \n
     (var_declaration var
      (var_spec name: (identifier) name: , (identifier) value: , (identifier) =
       (expression_list (int_literal) , (int_literal) , (int_literal))))
     \n)

Here, the var_spec node has two fields named 'name' even though the
source specifies three names. Furthermore, The second 'name', as well as
'value' are set to the ',' separator between identifiers. Two of the three
identifiers aren't named.

'tree-sitter parse file.go', on the other hand, produces this more
accurate tree:

    (source_file [0, 0] - [2, 21]
      (package_clause [0, 0] - [0, 11]
        (package_identifier [0, 8] - [0, 11]))
      (var_declaration [2, 0] - [2, 21]
        (var_spec [2, 4] - [2, 21]
          name: (identifier [2, 4] - [2, 5])
          name: (identifier [2, 7] - [2, 8])
          name: (identifier [2, 10] - [2, 11])
          value: (expression_list [2, 14] - [2, 21]
            (int_literal [2, 14] - [2, 15])
            (int_literal [2, 17] - [2, 18])
            (int_literal [2, 20] - [2, 21])))))

This reproduces with 29.1 as well as 30.0.50.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#66674; Package emacs. (Wed, 25 Oct 2023 13:04:02 GMT) Full text and rfc822 format available.

Message #8 received at 66674 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Dominik Honnef <dominik <at> honnef.co>, Yuan Fu <casouri <at> gmail.com>
Cc: 66674 <at> debbugs.gnu.org
Subject: Re: bug#66674: 30.0.50;
 Upstream tree-sitter and treesit disagree about fields
Date: Wed, 25 Oct 2023 16:03:10 +0300
> From: Dominik Honnef <dominik <at> honnef.co>
> Date: Sat, 21 Oct 2023 22:36:30 +0200
> 
> Using tree-sitter's CLI as well as the publicly hosted playground
> produce different parse trees than treesit in Emacs. Specifically, the
> assignment of nodes to named fields differs.
> 
> Given the following C source:
> 
>     void main() {
>       int x = // foo
>         1+
>         // comment
>         2;
>     }
> 
> treesit-explore-mode displays the following tree:
> 
>     (translation_unit
>      (function_definition type: (primitive_type)
>       declarator: 
>        (function_declarator declarator: (identifier)
>         parameters: (parameter_list ( )))
>       body: 
>        (compound_statement {
>         (declaration type: (primitive_type)
>          declarator: 
>           (init_declarator declarator: (identifier) = value: (comment)
>            (binary_expression left: (number_literal) operator: + right: (comment) (number_literal)))
>          ;)
>         })))
> 
> Note how in the init_declarator node, the 'value' field is a comment
> node, and similarly for the 'right' field in the binary_expression node.
> 
> Running 'tree-sitter parse file.c', on the other hand, produces the
> following tree:
> 
>     (translation_unit [0, 0] - [6, 0]
>       (function_definition [0, 0] - [5, 1]
>         type: (primitive_type [0, 0] - [0, 4])
>         declarator: (function_declarator [0, 5] - [0, 11]
>           declarator: (identifier [0, 5] - [0, 9])
>           parameters: (parameter_list [0, 9] - [0, 11]))
>         body: (compound_statement [0, 12] - [5, 1]
>           (declaration [1, 2] - [4, 6]
>             type: (primitive_type [1, 2] - [1, 5])
>             declarator: (init_declarator [1, 6] - [4, 5]
>               declarator: (identifier [1, 6] - [1, 7])
>               (comment [1, 10] - [1, 16])
>               value: (binary_expression [2, 4] - [4, 5]
>                 left: (number_literal [2, 4] - [2, 5])
>                 (comment [3, 4] - [3, 14])
>                 right: (number_literal [4, 4] - [4, 5])))))))
> 
> Here, the two comment nodes appear as unnamed nodes. IMHO the second
> tree is a more useful one, as the named fields contain the semantically
> important subtrees (e.g. a binary expression is made up of a left and
> right subtree, not a left subtree, a right comment, and then some
> unnamed subtree.)
> 
> Emacs's tree makes writing queries less convenient, as instead of being
> able to refer to well-defined names, one has to rely on child indices to
> account for comments.
> 
> 
> Further mismatch arises from repeated fields and separators.
> 
> Consider the following Go source:
> 
>     package pkg
>     
>     var a, b, c = 1, 2, 3
> 
> treesit-explore-mode displays the following tree:
> 
>     (source_file
>      (package_clause package (package_identifier))
>      \n
>      (var_declaration var
>       (var_spec name: (identifier) name: , (identifier) value: , (identifier) =
>        (expression_list (int_literal) , (int_literal) , (int_literal))))
>      \n)
> 
> Here, the var_spec node has two fields named 'name' even though the
> source specifies three names. Furthermore, The second 'name', as well as
> 'value' are set to the ',' separator between identifiers. Two of the three
> identifiers aren't named.
> 
> 'tree-sitter parse file.go', on the other hand, produces this more
> accurate tree:
> 
>     (source_file [0, 0] - [2, 21]
>       (package_clause [0, 0] - [0, 11]
>         (package_identifier [0, 8] - [0, 11]))
>       (var_declaration [2, 0] - [2, 21]
>         (var_spec [2, 4] - [2, 21]
>           name: (identifier [2, 4] - [2, 5])
>           name: (identifier [2, 7] - [2, 8])
>           name: (identifier [2, 10] - [2, 11])
>           value: (expression_list [2, 14] - [2, 21]
>             (int_literal [2, 14] - [2, 15])
>             (int_literal [2, 17] - [2, 18])
>             (int_literal [2, 20] - [2, 21])))))
> 
> This reproduces with 29.1 as well as 30.0.50.

Yuan, any comments or suggestions?




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#66674; Package emacs. (Sun, 19 Nov 2023 10:09:02 GMT) Full text and rfc822 format available.

Message #11 received at 66674 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: casouri <at> gmail.com
Cc: 66674 <at> debbugs.gnu.org, dominik <at> honnef.co
Subject: Re: bug#66674: 30.0.50;
 Upstream tree-sitter and treesit disagree about fields
Date: Sun, 19 Nov 2023 12:08:08 +0200
Ping!  Yuan, any comments?

> Cc: 66674 <at> debbugs.gnu.org
> Date: Wed, 25 Oct 2023 16:03:10 +0300
> From: Eli Zaretskii <eliz <at> gnu.org>
> 
> > From: Dominik Honnef <dominik <at> honnef.co>
> > Date: Sat, 21 Oct 2023 22:36:30 +0200
> > 
> > Using tree-sitter's CLI as well as the publicly hosted playground
> > produce different parse trees than treesit in Emacs. Specifically, the
> > assignment of nodes to named fields differs.
> > 
> > Given the following C source:
> > 
> >     void main() {
> >       int x = // foo
> >         1+
> >         // comment
> >         2;
> >     }
> > 
> > treesit-explore-mode displays the following tree:
> > 
> >     (translation_unit
> >      (function_definition type: (primitive_type)
> >       declarator: 
> >        (function_declarator declarator: (identifier)
> >         parameters: (parameter_list ( )))
> >       body: 
> >        (compound_statement {
> >         (declaration type: (primitive_type)
> >          declarator: 
> >           (init_declarator declarator: (identifier) = value: (comment)
> >            (binary_expression left: (number_literal) operator: + right: (comment) (number_literal)))
> >          ;)
> >         })))
> > 
> > Note how in the init_declarator node, the 'value' field is a comment
> > node, and similarly for the 'right' field in the binary_expression node.
> > 
> > Running 'tree-sitter parse file.c', on the other hand, produces the
> > following tree:
> > 
> >     (translation_unit [0, 0] - [6, 0]
> >       (function_definition [0, 0] - [5, 1]
> >         type: (primitive_type [0, 0] - [0, 4])
> >         declarator: (function_declarator [0, 5] - [0, 11]
> >           declarator: (identifier [0, 5] - [0, 9])
> >           parameters: (parameter_list [0, 9] - [0, 11]))
> >         body: (compound_statement [0, 12] - [5, 1]
> >           (declaration [1, 2] - [4, 6]
> >             type: (primitive_type [1, 2] - [1, 5])
> >             declarator: (init_declarator [1, 6] - [4, 5]
> >               declarator: (identifier [1, 6] - [1, 7])
> >               (comment [1, 10] - [1, 16])
> >               value: (binary_expression [2, 4] - [4, 5]
> >                 left: (number_literal [2, 4] - [2, 5])
> >                 (comment [3, 4] - [3, 14])
> >                 right: (number_literal [4, 4] - [4, 5])))))))
> > 
> > Here, the two comment nodes appear as unnamed nodes. IMHO the second
> > tree is a more useful one, as the named fields contain the semantically
> > important subtrees (e.g. a binary expression is made up of a left and
> > right subtree, not a left subtree, a right comment, and then some
> > unnamed subtree.)
> > 
> > Emacs's tree makes writing queries less convenient, as instead of being
> > able to refer to well-defined names, one has to rely on child indices to
> > account for comments.
> > 
> > 
> > Further mismatch arises from repeated fields and separators.
> > 
> > Consider the following Go source:
> > 
> >     package pkg
> >     
> >     var a, b, c = 1, 2, 3
> > 
> > treesit-explore-mode displays the following tree:
> > 
> >     (source_file
> >      (package_clause package (package_identifier))
> >      \n
> >      (var_declaration var
> >       (var_spec name: (identifier) name: , (identifier) value: , (identifier) =
> >        (expression_list (int_literal) , (int_literal) , (int_literal))))
> >      \n)
> > 
> > Here, the var_spec node has two fields named 'name' even though the
> > source specifies three names. Furthermore, The second 'name', as well as
> > 'value' are set to the ',' separator between identifiers. Two of the three
> > identifiers aren't named.
> > 
> > 'tree-sitter parse file.go', on the other hand, produces this more
> > accurate tree:
> > 
> >     (source_file [0, 0] - [2, 21]
> >       (package_clause [0, 0] - [0, 11]
> >         (package_identifier [0, 8] - [0, 11]))
> >       (var_declaration [2, 0] - [2, 21]
> >         (var_spec [2, 4] - [2, 21]
> >           name: (identifier [2, 4] - [2, 5])
> >           name: (identifier [2, 7] - [2, 8])
> >           name: (identifier [2, 10] - [2, 11])
> >           value: (expression_list [2, 14] - [2, 21]
> >             (int_literal [2, 14] - [2, 15])
> >             (int_literal [2, 17] - [2, 18])
> >             (int_literal [2, 20] - [2, 21])))))
> > 
> > This reproduces with 29.1 as well as 30.0.50.
> 
> Yuan, any comments or suggestions?
> 
> 
> 
> 




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#66674; Package emacs. (Sat, 25 Nov 2023 10:04:01 GMT) Full text and rfc822 format available.

Message #14 received at 66674 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: casouri <at> gmail.com
Cc: 66674 <at> debbugs.gnu.org, dominik <at> honnef.co
Subject: Re: bug#66674: 30.0.50;
 Upstream tree-sitter and treesit disagree about fields
Date: Sat, 25 Nov 2023 12:03:27 +0200
Ping! Ping!  Yuan, please chime in.

> Cc: 66674 <at> debbugs.gnu.org, dominik <at> honnef.co
> Date: Sun, 19 Nov 2023 12:08:08 +0200
> From: Eli Zaretskii <eliz <at> gnu.org>
> 
> Ping!  Yuan, any comments?
> 
> > Cc: 66674 <at> debbugs.gnu.org
> > Date: Wed, 25 Oct 2023 16:03:10 +0300
> > From: Eli Zaretskii <eliz <at> gnu.org>
> > 
> > > From: Dominik Honnef <dominik <at> honnef.co>
> > > Date: Sat, 21 Oct 2023 22:36:30 +0200
> > > 
> > > Using tree-sitter's CLI as well as the publicly hosted playground
> > > produce different parse trees than treesit in Emacs. Specifically, the
> > > assignment of nodes to named fields differs.
> > > 
> > > Given the following C source:
> > > 
> > >     void main() {
> > >       int x = // foo
> > >         1+
> > >         // comment
> > >         2;
> > >     }
> > > 
> > > treesit-explore-mode displays the following tree:
> > > 
> > >     (translation_unit
> > >      (function_definition type: (primitive_type)
> > >       declarator: 
> > >        (function_declarator declarator: (identifier)
> > >         parameters: (parameter_list ( )))
> > >       body: 
> > >        (compound_statement {
> > >         (declaration type: (primitive_type)
> > >          declarator: 
> > >           (init_declarator declarator: (identifier) = value: (comment)
> > >            (binary_expression left: (number_literal) operator: + right: (comment) (number_literal)))
> > >          ;)
> > >         })))
> > > 
> > > Note how in the init_declarator node, the 'value' field is a comment
> > > node, and similarly for the 'right' field in the binary_expression node.
> > > 
> > > Running 'tree-sitter parse file.c', on the other hand, produces the
> > > following tree:
> > > 
> > >     (translation_unit [0, 0] - [6, 0]
> > >       (function_definition [0, 0] - [5, 1]
> > >         type: (primitive_type [0, 0] - [0, 4])
> > >         declarator: (function_declarator [0, 5] - [0, 11]
> > >           declarator: (identifier [0, 5] - [0, 9])
> > >           parameters: (parameter_list [0, 9] - [0, 11]))
> > >         body: (compound_statement [0, 12] - [5, 1]
> > >           (declaration [1, 2] - [4, 6]
> > >             type: (primitive_type [1, 2] - [1, 5])
> > >             declarator: (init_declarator [1, 6] - [4, 5]
> > >               declarator: (identifier [1, 6] - [1, 7])
> > >               (comment [1, 10] - [1, 16])
> > >               value: (binary_expression [2, 4] - [4, 5]
> > >                 left: (number_literal [2, 4] - [2, 5])
> > >                 (comment [3, 4] - [3, 14])
> > >                 right: (number_literal [4, 4] - [4, 5])))))))
> > > 
> > > Here, the two comment nodes appear as unnamed nodes. IMHO the second
> > > tree is a more useful one, as the named fields contain the semantically
> > > important subtrees (e.g. a binary expression is made up of a left and
> > > right subtree, not a left subtree, a right comment, and then some
> > > unnamed subtree.)
> > > 
> > > Emacs's tree makes writing queries less convenient, as instead of being
> > > able to refer to well-defined names, one has to rely on child indices to
> > > account for comments.
> > > 
> > > 
> > > Further mismatch arises from repeated fields and separators.
> > > 
> > > Consider the following Go source:
> > > 
> > >     package pkg
> > >     
> > >     var a, b, c = 1, 2, 3
> > > 
> > > treesit-explore-mode displays the following tree:
> > > 
> > >     (source_file
> > >      (package_clause package (package_identifier))
> > >      \n
> > >      (var_declaration var
> > >       (var_spec name: (identifier) name: , (identifier) value: , (identifier) =
> > >        (expression_list (int_literal) , (int_literal) , (int_literal))))
> > >      \n)
> > > 
> > > Here, the var_spec node has two fields named 'name' even though the
> > > source specifies three names. Furthermore, The second 'name', as well as
> > > 'value' are set to the ',' separator between identifiers. Two of the three
> > > identifiers aren't named.
> > > 
> > > 'tree-sitter parse file.go', on the other hand, produces this more
> > > accurate tree:
> > > 
> > >     (source_file [0, 0] - [2, 21]
> > >       (package_clause [0, 0] - [0, 11]
> > >         (package_identifier [0, 8] - [0, 11]))
> > >       (var_declaration [2, 0] - [2, 21]
> > >         (var_spec [2, 4] - [2, 21]
> > >           name: (identifier [2, 4] - [2, 5])
> > >           name: (identifier [2, 7] - [2, 8])
> > >           name: (identifier [2, 10] - [2, 11])
> > >           value: (expression_list [2, 14] - [2, 21]
> > >             (int_literal [2, 14] - [2, 15])
> > >             (int_literal [2, 17] - [2, 18])
> > >             (int_literal [2, 20] - [2, 21])))))
> > > 
> > > This reproduces with 29.1 as well as 30.0.50.
> > 
> > Yuan, any comments or suggestions?
> > 
> > 
> > 
> > 
> 
> 
> 
> 




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#66674; Package emacs. (Sun, 10 Dec 2023 10:08:01 GMT) Full text and rfc822 format available.

Message #17 received at 66674 <at> debbugs.gnu.org (full text, mbox):

From: Yuan Fu <casouri <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 66674 <at> debbugs.gnu.org, dominik <at> honnef.co
Subject: Re: bug#66674: 30.0.50; Upstream tree-sitter and treesit disagree
 about fields
Date: Sun, 10 Dec 2023 02:07:35 -0800

On 11/25/23 2:03 AM, Eli Zaretskii wrote:
> Ping! Ping!  Yuan, please chime in.
>
>> Cc: 66674 <at> debbugs.gnu.org, dominik <at> honnef.co
>> Date: Sun, 19 Nov 2023 12:08:08 +0200
>> From: Eli Zaretskii <eliz <at> gnu.org>
>>
>> Ping!  Yuan, any comments?
>>
>>> Cc: 66674 <at> debbugs.gnu.org
>>> Date: Wed, 25 Oct 2023 16:03:10 +0300
>>> From: Eli Zaretskii <eliz <at> gnu.org>
>>>
>>>> From: Dominik Honnef <dominik <at> honnef.co>
>>>> Date: Sat, 21 Oct 2023 22:36:30 +0200
>>>>
>>>> Using tree-sitter's CLI as well as the publicly hosted playground
>>>> produce different parse trees than treesit in Emacs. Specifically, the
>>>> assignment of nodes to named fields differs.
>>>>
>>>> Given the following C source:
>>>>
>>>>      void main() {
>>>>        int x = // foo
>>>>          1+
>>>>          // comment
>>>>          2;
>>>>      }
>>>>
>>>> treesit-explore-mode displays the following tree:
>>>>
>>>>      (translation_unit
>>>>       (function_definition type: (primitive_type)
>>>>        declarator:
>>>>         (function_declarator declarator: (identifier)
>>>>          parameters: (parameter_list ( )))
>>>>        body:
>>>>         (compound_statement {
>>>>          (declaration type: (primitive_type)
>>>>           declarator:
>>>>            (init_declarator declarator: (identifier) = value: (comment)
>>>>             (binary_expression left: (number_literal) operator: + right: (comment) (number_literal)))
>>>>           ;)
>>>>          })))
>>>>
>>>> Note how in the init_declarator node, the 'value' field is a comment
>>>> node, and similarly for the 'right' field in the binary_expression node.
>>>>
>>>> Running 'tree-sitter parse file.c', on the other hand, produces the
>>>> following tree:
>>>>
>>>>      (translation_unit [0, 0] - [6, 0]
>>>>        (function_definition [0, 0] - [5, 1]
>>>>          type: (primitive_type [0, 0] - [0, 4])
>>>>          declarator: (function_declarator [0, 5] - [0, 11]
>>>>            declarator: (identifier [0, 5] - [0, 9])
>>>>            parameters: (parameter_list [0, 9] - [0, 11]))
>>>>          body: (compound_statement [0, 12] - [5, 1]
>>>>            (declaration [1, 2] - [4, 6]
>>>>              type: (primitive_type [1, 2] - [1, 5])
>>>>              declarator: (init_declarator [1, 6] - [4, 5]
>>>>                declarator: (identifier [1, 6] - [1, 7])
>>>>                (comment [1, 10] - [1, 16])
>>>>                value: (binary_expression [2, 4] - [4, 5]
>>>>                  left: (number_literal [2, 4] - [2, 5])
>>>>                  (comment [3, 4] - [3, 14])
>>>>                  right: (number_literal [4, 4] - [4, 5])))))))
>>>>
>>>> Here, the two comment nodes appear as unnamed nodes. IMHO the second
>>>> tree is a more useful one, as the named fields contain the semantically
>>>> important subtrees (e.g. a binary expression is made up of a left and
>>>> right subtree, not a left subtree, a right comment, and then some
>>>> unnamed subtree.)
>>>>
>>>> Emacs's tree makes writing queries less convenient, as instead of being
>>>> able to refer to well-defined names, one has to rely on child indices to
>>>> account for comments.
>>>>
>>>>
>>>> Further mismatch arises from repeated fields and separators.
>>>>
>>>> Consider the following Go source:
>>>>
>>>>      package pkg
>>>>      
>>>>      var a, b, c = 1, 2, 3
>>>>
>>>> treesit-explore-mode displays the following tree:
>>>>
>>>>      (source_file
>>>>       (package_clause package (package_identifier))
>>>>       \n
>>>>       (var_declaration var
>>>>        (var_spec name: (identifier) name: , (identifier) value: , (identifier) =
>>>>         (expression_list (int_literal) , (int_literal) , (int_literal))))
>>>>       \n)
>>>>
>>>> Here, the var_spec node has two fields named 'name' even though the
>>>> source specifies three names. Furthermore, The second 'name', as well as
>>>> 'value' are set to the ',' separator between identifiers. Two of the three
>>>> identifiers aren't named.
>>>>
>>>> 'tree-sitter parse file.go', on the other hand, produces this more
>>>> accurate tree:
>>>>
>>>>      (source_file [0, 0] - [2, 21]
>>>>        (package_clause [0, 0] - [0, 11]
>>>>          (package_identifier [0, 8] - [0, 11]))
>>>>        (var_declaration [2, 0] - [2, 21]
>>>>          (var_spec [2, 4] - [2, 21]
>>>>            name: (identifier [2, 4] - [2, 5])
>>>>            name: (identifier [2, 7] - [2, 8])
>>>>            name: (identifier [2, 10] - [2, 11])
>>>>            value: (expression_list [2, 14] - [2, 21]
>>>>              (int_literal [2, 14] - [2, 15])
>>>>              (int_literal [2, 17] - [2, 18])
>>>>              (int_literal [2, 20] - [2, 21])))))
>>>>
>>>> This reproduces with 29.1 as well as 30.0.50.
>>> Yuan, any comments or suggestions?

Sorry sorry sorry, another missed report. I think this is a bug in 
treesit-explore-mode, I'll work on fixing it!

Yuan




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#66674; Package emacs. (Sun, 10 Dec 2023 14:43:02 GMT) Full text and rfc822 format available.

Message #20 received at 66674 <at> debbugs.gnu.org (full text, mbox):

From: Dominik Honnef <dominik <at> honnef.co>
To: Yuan Fu <casouri <at> gmail.com>, Eli Zaretskii <eliz <at> gnu.org>
Cc: 66674 <at> debbugs.gnu.org
Subject: Re: bug#66674: 30.0.50; Upstream tree-sitter and treesit disagree
 about fields
Date: Sun, 10 Dec 2023 15:28:38 +0100
Yuan Fu <casouri <at> gmail.com> writes:

> On 11/25/23 2:03 AM, Eli Zaretskii wrote:
>> Ping! Ping!  Yuan, please chime in.
>>
>>> Cc: 66674 <at> debbugs.gnu.org, dominik <at> honnef.co
>>> Date: Sun, 19 Nov 2023 12:08:08 +0200
>>> From: Eli Zaretskii <eliz <at> gnu.org>
>>>
>>> Ping!  Yuan, any comments?
>>>
>>>> Cc: 66674 <at> debbugs.gnu.org
>>>> Date: Wed, 25 Oct 2023 16:03:10 +0300
>>>> From: Eli Zaretskii <eliz <at> gnu.org>
>>>>
>>>>> From: Dominik Honnef <dominik <at> honnef.co>
>>>>> Date: Sat, 21 Oct 2023 22:36:30 +0200
>>>>>
>>>>> Using tree-sitter's CLI as well as the publicly hosted playground
>>>>> produce different parse trees than treesit in Emacs. Specifically, the
>>>>> assignment of nodes to named fields differs.
>>>>>
>>>>> Given the following C source:
>>>>>
>>>>>      void main() {
>>>>>        int x = // foo
>>>>>          1+
>>>>>          // comment
>>>>>          2;
>>>>>      }
>>>>>
>>>>> treesit-explore-mode displays the following tree:
>>>>>
>>>>>      (translation_unit
>>>>>       (function_definition type: (primitive_type)
>>>>>        declarator:
>>>>>         (function_declarator declarator: (identifier)
>>>>>          parameters: (parameter_list ( )))
>>>>>        body:
>>>>>         (compound_statement {
>>>>>          (declaration type: (primitive_type)
>>>>>           declarator:
>>>>>            (init_declarator declarator: (identifier) = value: (comment)
>>>>>             (binary_expression left: (number_literal) operator: + right: (comment) (number_literal)))
>>>>>           ;)
>>>>>          })))
>>>>>
>>>>> Note how in the init_declarator node, the 'value' field is a comment
>>>>> node, and similarly for the 'right' field in the binary_expression node.
>>>>>
>>>>> Running 'tree-sitter parse file.c', on the other hand, produces the
>>>>> following tree:
>>>>>
>>>>>      (translation_unit [0, 0] - [6, 0]
>>>>>        (function_definition [0, 0] - [5, 1]
>>>>>          type: (primitive_type [0, 0] - [0, 4])
>>>>>          declarator: (function_declarator [0, 5] - [0, 11]
>>>>>            declarator: (identifier [0, 5] - [0, 9])
>>>>>            parameters: (parameter_list [0, 9] - [0, 11]))
>>>>>          body: (compound_statement [0, 12] - [5, 1]
>>>>>            (declaration [1, 2] - [4, 6]
>>>>>              type: (primitive_type [1, 2] - [1, 5])
>>>>>              declarator: (init_declarator [1, 6] - [4, 5]
>>>>>                declarator: (identifier [1, 6] - [1, 7])
>>>>>                (comment [1, 10] - [1, 16])
>>>>>                value: (binary_expression [2, 4] - [4, 5]
>>>>>                  left: (number_literal [2, 4] - [2, 5])
>>>>>                  (comment [3, 4] - [3, 14])
>>>>>                  right: (number_literal [4, 4] - [4, 5])))))))
>>>>>
>>>>> Here, the two comment nodes appear as unnamed nodes. IMHO the second
>>>>> tree is a more useful one, as the named fields contain the semantically
>>>>> important subtrees (e.g. a binary expression is made up of a left and
>>>>> right subtree, not a left subtree, a right comment, and then some
>>>>> unnamed subtree.)
>>>>>
>>>>> Emacs's tree makes writing queries less convenient, as instead of being
>>>>> able to refer to well-defined names, one has to rely on child indices to
>>>>> account for comments.
>>>>>
>>>>>
>>>>> Further mismatch arises from repeated fields and separators.
>>>>>
>>>>> Consider the following Go source:
>>>>>
>>>>>      package pkg
>>>>>      
>>>>>      var a, b, c = 1, 2, 3
>>>>>
>>>>> treesit-explore-mode displays the following tree:
>>>>>
>>>>>      (source_file
>>>>>       (package_clause package (package_identifier))
>>>>>       \n
>>>>>       (var_declaration var
>>>>>        (var_spec name: (identifier) name: , (identifier) value: , (identifier) =
>>>>>         (expression_list (int_literal) , (int_literal) , (int_literal))))
>>>>>       \n)
>>>>>
>>>>> Here, the var_spec node has two fields named 'name' even though the
>>>>> source specifies three names. Furthermore, The second 'name', as well as
>>>>> 'value' are set to the ',' separator between identifiers. Two of the three
>>>>> identifiers aren't named.
>>>>>
>>>>> 'tree-sitter parse file.go', on the other hand, produces this more
>>>>> accurate tree:
>>>>>
>>>>>      (source_file [0, 0] - [2, 21]
>>>>>        (package_clause [0, 0] - [0, 11]
>>>>>          (package_identifier [0, 8] - [0, 11]))
>>>>>        (var_declaration [2, 0] - [2, 21]
>>>>>          (var_spec [2, 4] - [2, 21]
>>>>>            name: (identifier [2, 4] - [2, 5])
>>>>>            name: (identifier [2, 7] - [2, 8])
>>>>>            name: (identifier [2, 10] - [2, 11])
>>>>>            value: (expression_list [2, 14] - [2, 21]
>>>>>              (int_literal [2, 14] - [2, 15])
>>>>>              (int_literal [2, 17] - [2, 18])
>>>>>              (int_literal [2, 20] - [2, 21])))))
>>>>>
>>>>> This reproduces with 29.1 as well as 30.0.50.
>>>> Yuan, any comments or suggestions?
>
> Sorry sorry sorry, another missed report. I think this is a bug in 
> treesit-explore-mode, I'll work on fixing it!
>
> Yuan

I don't think that's the case, at least not exclusively. I used
treesit-explore-mode to debug patterns that matched in the playground
but not in Emacs. The matching behavior seemed pretty in line with what
treesit-explore-mode reported.




Reply sent to Yuan Fu <casouri <at> gmail.com>:
You have taken responsibility. (Mon, 11 Dec 2023 01:04:02 GMT) Full text and rfc822 format available.

Notification sent to Dominik Honnef <dominik <at> honnef.co>:
bug acknowledged by developer. (Mon, 11 Dec 2023 01:04:02 GMT) Full text and rfc822 format available.

Message #25 received at 66674-done <at> debbugs.gnu.org (full text, mbox):

From: Yuan Fu <casouri <at> gmail.com>
To: Dominik Honnef <dominik <at> honnef.co>, Eli Zaretskii <eliz <at> gnu.org>
Cc: 66674-done <at> debbugs.gnu.org
Subject: Re: bug#66674: 30.0.50; Upstream tree-sitter and treesit disagree
 about fields
Date: Sun, 10 Dec 2023 17:02:48 -0800

On 12/10/23 6:28 AM, Dominik Honnef wrote:
> Yuan Fu <casouri <at> gmail.com> writes:
>
>> On 11/25/23 2:03 AM, Eli Zaretskii wrote:
>>> Ping! Ping!  Yuan, please chime in.
>>>
>>>> Cc: 66674 <at> debbugs.gnu.org, dominik <at> honnef.co
>>>> Date: Sun, 19 Nov 2023 12:08:08 +0200
>>>> From: Eli Zaretskii <eliz <at> gnu.org>
>>>>
>>>> Ping!  Yuan, any comments?
>>>>
>>>>> Cc: 66674 <at> debbugs.gnu.org
>>>>> Date: Wed, 25 Oct 2023 16:03:10 +0300
>>>>> From: Eli Zaretskii <eliz <at> gnu.org>
>>>>>
>>>>>> From: Dominik Honnef <dominik <at> honnef.co>
>>>>>> Date: Sat, 21 Oct 2023 22:36:30 +0200
>>>>>>
>>>>>> Using tree-sitter's CLI as well as the publicly hosted playground
>>>>>> produce different parse trees than treesit in Emacs. Specifically, the
>>>>>> assignment of nodes to named fields differs.
>>>>>>
>>>>>> Given the following C source:
>>>>>>
>>>>>>       void main() {
>>>>>>         int x = // foo
>>>>>>           1+
>>>>>>           // comment
>>>>>>           2;
>>>>>>       }
>>>>>>
>>>>>> treesit-explore-mode displays the following tree:
>>>>>>
>>>>>>       (translation_unit
>>>>>>        (function_definition type: (primitive_type)
>>>>>>         declarator:
>>>>>>          (function_declarator declarator: (identifier)
>>>>>>           parameters: (parameter_list ( )))
>>>>>>         body:
>>>>>>          (compound_statement {
>>>>>>           (declaration type: (primitive_type)
>>>>>>            declarator:
>>>>>>             (init_declarator declarator: (identifier) = value: (comment)
>>>>>>              (binary_expression left: (number_literal) operator: + right: (comment) (number_literal)))
>>>>>>            ;)
>>>>>>           })))
>>>>>>
>>>>>> Note how in the init_declarator node, the 'value' field is a comment
>>>>>> node, and similarly for the 'right' field in the binary_expression node.
>>>>>>
>>>>>> Running 'tree-sitter parse file.c', on the other hand, produces the
>>>>>> following tree:
>>>>>>
>>>>>>       (translation_unit [0, 0] - [6, 0]
>>>>>>         (function_definition [0, 0] - [5, 1]
>>>>>>           type: (primitive_type [0, 0] - [0, 4])
>>>>>>           declarator: (function_declarator [0, 5] - [0, 11]
>>>>>>             declarator: (identifier [0, 5] - [0, 9])
>>>>>>             parameters: (parameter_list [0, 9] - [0, 11]))
>>>>>>           body: (compound_statement [0, 12] - [5, 1]
>>>>>>             (declaration [1, 2] - [4, 6]
>>>>>>               type: (primitive_type [1, 2] - [1, 5])
>>>>>>               declarator: (init_declarator [1, 6] - [4, 5]
>>>>>>                 declarator: (identifier [1, 6] - [1, 7])
>>>>>>                 (comment [1, 10] - [1, 16])
>>>>>>                 value: (binary_expression [2, 4] - [4, 5]
>>>>>>                   left: (number_literal [2, 4] - [2, 5])
>>>>>>                   (comment [3, 4] - [3, 14])
>>>>>>                   right: (number_literal [4, 4] - [4, 5])))))))
>>>>>>
>>>>>> Here, the two comment nodes appear as unnamed nodes. IMHO the second
>>>>>> tree is a more useful one, as the named fields contain the semantically
>>>>>> important subtrees (e.g. a binary expression is made up of a left and
>>>>>> right subtree, not a left subtree, a right comment, and then some
>>>>>> unnamed subtree.)
>>>>>>
>>>>>> Emacs's tree makes writing queries less convenient, as instead of being
>>>>>> able to refer to well-defined names, one has to rely on child indices to
>>>>>> account for comments.
>>>>>>
>>>>>>
>>>>>> Further mismatch arises from repeated fields and separators.
>>>>>>
>>>>>> Consider the following Go source:
>>>>>>
>>>>>>       package pkg
>>>>>>       
>>>>>>       var a, b, c = 1, 2, 3
>>>>>>
>>>>>> treesit-explore-mode displays the following tree:
>>>>>>
>>>>>>       (source_file
>>>>>>        (package_clause package (package_identifier))
>>>>>>        \n
>>>>>>        (var_declaration var
>>>>>>         (var_spec name: (identifier) name: , (identifier) value: , (identifier) =
>>>>>>          (expression_list (int_literal) , (int_literal) , (int_literal))))
>>>>>>        \n)
>>>>>>
>>>>>> Here, the var_spec node has two fields named 'name' even though the
>>>>>> source specifies three names. Furthermore, The second 'name', as well as
>>>>>> 'value' are set to the ',' separator between identifiers. Two of the three
>>>>>> identifiers aren't named.
>>>>>>
>>>>>> 'tree-sitter parse file.go', on the other hand, produces this more
>>>>>> accurate tree:
>>>>>>
>>>>>>       (source_file [0, 0] - [2, 21]
>>>>>>         (package_clause [0, 0] - [0, 11]
>>>>>>           (package_identifier [0, 8] - [0, 11]))
>>>>>>         (var_declaration [2, 0] - [2, 21]
>>>>>>           (var_spec [2, 4] - [2, 21]
>>>>>>             name: (identifier [2, 4] - [2, 5])
>>>>>>             name: (identifier [2, 7] - [2, 8])
>>>>>>             name: (identifier [2, 10] - [2, 11])
>>>>>>             value: (expression_list [2, 14] - [2, 21]
>>>>>>               (int_literal [2, 14] - [2, 15])
>>>>>>               (int_literal [2, 17] - [2, 18])
>>>>>>               (int_literal [2, 20] - [2, 21])))))
>>>>>>
>>>>>> This reproduces with 29.1 as well as 30.0.50.
>>>>> Yuan, any comments or suggestions?
>> Sorry sorry sorry, another missed report. I think this is a bug in
>> treesit-explore-mode, I'll work on fixing it!
>>
>> Yuan
> I don't think that's the case, at least not exclusively. I used
> treesit-explore-mode to debug patterns that matched in the playground
> but not in Emacs. The matching behavior seemed pretty in line with what
> treesit-explore-mode reported.
I do find that treesit-node-field-name are returning wrong field names, 
that's why in the first example, you see the "value" field name given to 
the comment node, rather than the binary_expression behind it. In the 
actual parse tree, "value" belongs to binary_expression. With the fixed 
I just pushed to emacs-29, the explorer parse tree for the first example 
becomes

(translation_unit
 (function_definition type: (primitive_type)
  declarator:
   (function_declarator declarator: (identifier)
    parameters: (parameter_list ( )))
  body:
   (compound_statement {
    (declaration type: (primitive_type)
     declarator:
      (init_declarator declarator: (identifier) = (comment)
       value: (binary_expression left: (number_literal) operator: +
                                 operator: (comment)
               right: (number_literal)))
     ;)
    })))

which should match the playground.

If you can find the pattern that matches in the playground but doesn't 
in Emacs, do please post it and I can see if there's anything wrong.

Yuan




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 08 Jan 2024 12:24:09 GMT) Full text and rfc822 format available.

This bug report was last modified 1 year and 124 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.