GNU bug report logs - #60953
The :match predicate with large regexp in tree-sitter font-lock seems inefficient

Package: emacs;

Reported by: Dmitry Gutov <dgutov <at> yandex.ru>

Date: Fri, 20 Jan 2023 03:54:02 UTC

Severity: normal

Done: Dmitry Gutov <dgutov <at> yandex.ru>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 60953 in the body.
You can then email your comments to 60953 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Fri, 20 Jan 2023 03:54:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Dmitry Gutov <dgutov <at> yandex.ru>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Fri, 20 Jan 2023 03:54:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: bug-gnu-emacs <at> gnu.org
Subject: The :match predicate with large regexp in tree-sitter font-lock seems
 inefficient
Date: Fri, 20 Jan 2023 05:53:12 +0200

[Message part 1 (text/plain, inline)]

In my benchmarking -- using this form in 
test/lisp/progmodes/ruby-mode-resources/ruby.rb after enabling ruby-ts-mode:

  (benchmark-run 1000 (progn (font-lock-mode -1) (font-lock-mode 1) 
(let (treesit--font-lock-fast-mode) (font-lock-ensure))))

the rule added to its font-lock in commit d66ac5285f7

   :language language
   :feature 'builtin-functions
   `((((identifier) @font-lock-builtin-face)
      (:match ,ruby-ts--builtin-methods
       @font-lock-builtin-face)))

...seems to have made it 50% slower.

The profile looked like this:

  9454  84%                   - font-lock-fontify-region
  9454  84%                    - font-lock-default-fontify-region
  8862  79%                     - font-lock-fontify-syntactically-region
  8702  78%                      - treesit-font-lock-fontify-region
   128   1%                         treesit-fontify-with-override
   123   1%                         facep
    84   0% 
treesit--children-covering-range-recurse
    60   0%                       + ruby-ts--comment-font-lock
     4   0%                       + font-lock-unfontify-region
   568   5%                     + font-lock-fontify-keywords-region
    16   0%                     + font-lock-unfontify-region

So there's nothing on the Lisp level to look at.

Looking at the code, apparently we get a cursor and basically iterate 
through all (identifier) nodes, running our predicate manually.

Without trying something more advanced like perf, I took a stab in the 
dark and tried to reduce string allocation in treesit_predicate_match 
(it currently ends up delegating to buffer-substring for every node), 
which seemed inefficient. But while my patch (attached) compiles and 
doesn't crash, it doesn't actually work (the rule's highlighting is 
missing), and the performance was unchanged.

This message was originally longer, but see commit d94dc606a09: I 
switched to using :pred -- thus avoiding embedding the 720-char long 
regexp in the query -- and the performance drop got reduced to like 20%.

As a baseline, this simplified query without predicates and colors every 
identifier in the buffer using the specified face, is still faster (just 
10% over the original):

   :language language
   :feature 'builtin-function
   `(((identifier) @font-lock-builtin-face))

The regexp matching itself doesn't seem to be the problem:

(benchmark 354100 '(string-match-p ruby-ts--builtin-methods "gsub"))

=> Elapsed time: 0.141681s

-- whereas the difference between the benchmarks is on the order of seconds.

I think the marshaling of the long regexp string back and forth could be 
the culprit. Would be nice to fix that somehow.

I also think that trying to reduce the string allocation overhead has 
potential, but so far all my experiments haven't moved the needle 
anywhere noticeable.

[treesit_predicate_match.diff (text/x-patch, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Tue, 24 Jan 2023 04:05:01 GMT) Full text and rfc822 format available.

Message #8 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: 60953 <at> debbugs.gnu.org, Yuan Fu <casouri <at> gmail.com>
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Tue, 24 Jan 2023 06:04:07 +0200

Cc-ing Yuan, just in case.

On 20/01/2023 05:53, Dmitry Gutov wrote:
> In my benchmarking -- using this form in 
> test/lisp/progmodes/ruby-mode-resources/ruby.rb after enabling 
> ruby-ts-mode:
> 
>    (benchmark-run 1000 (progn (font-lock-mode -1) (font-lock-mode 1) 
> (let (treesit--font-lock-fast-mode) (font-lock-ensure))))
> 
> the rule added to its font-lock in commit d66ac5285f7
> 
>     :language language
>     :feature 'builtin-functions
>     `((((identifier) @font-lock-builtin-face)
>        (:match ,ruby-ts--builtin-methods
>         @font-lock-builtin-face)))
> 
> ...seems to have made it 50% slower.
> 
> The profile looked like this:
> 
>    9454  84%                   - font-lock-fontify-region
>    9454  84%                    - font-lock-default-fontify-region
>    8862  79%                     - font-lock-fontify-syntactically-region
>    8702  78%                      - treesit-font-lock-fontify-region
>     128   1%                         treesit-fontify-with-override
>     123   1%                         facep
>      84   0% treesit--children-covering-range-recurse
>      60   0%                       + ruby-ts--comment-font-lock
>       4   0%                       + font-lock-unfontify-region
>     568   5%                     + font-lock-fontify-keywords-region
>      16   0%                     + font-lock-unfontify-region
> 
> So there's nothing on the Lisp level to look at.

I've done some perf recordings now. It seems most/all of the difference 
comes down to garbage collection. Or more concretely, time spent inside 
process_mark_stack.

Without the added query benchmark reports:

(10.13723333 49 1.141649534999999)

And the perf top5 is:

  17.26%  emacs         libtree-sitter.so.0.0       [.] 
ts_tree_cursor_current_status
  10.83%  emacs         libtree-sitter.so.0.0       [.] 
ts_tree_cursor_goto_next_sibling
  10.18%  emacs         libtree-sitter.so.0.0       [.] 
ts_tree_cursor_goto_first_child
   8.37%  emacs         emacs                       [.] process_mark_stack
   4.63%  emacs         libtree-sitter.so.0.0       [.] ts_node_start_point

With this simple query that colors everything:

   :language language
   :feature 'builtin-function
   `((((identifier) @font-lock-builtin-face)))

I get:

(11.993968995 82 1.9326509270000045)

Note the jump in runtime that's larger than the jump in GC.

  17.26%  emacs         libtree-sitter.so.0.0       [.] 
ts_tree_cursor_current_status
  10.83%  emacs         libtree-sitter.so.0.0       [.] 
ts_tree_cursor_goto_next_sibling
  10.18%  emacs         libtree-sitter.so.0.0       [.] 
ts_tree_cursor_goto_first_child
   8.37%  emacs         emacs                       [.] process_mark_stack
   4.63%  emacs         libtree-sitter.so.0.0       [.] ts_node_start_point

The current query looks like this:

   :language language
   :feature 'builtin-function
   `((((identifier) @font-lock-builtin-face)
      (:pred ruby-ts--builtin-method-p @font-lock-builtin-face)))

Benchmarking:

(12.493614359 107 2.558609025999999)

  15.30%  emacs         libtree-sitter.so.0.0       [.] 
ts_tree_cursor_current_status
  14.92%  emacs         emacs                       [.] process_mark_stack
   9.75%  emacs         libtree-sitter.so.0.0       [.] 
ts_tree_cursor_goto_next_sibling
   8.90%  emacs         libtree-sitter.so.0.0       [.] 
ts_tree_cursor_goto_first_child
   3.87%  emacs         libtree-sitter.so.0.0       [.] ts_node_start_point

Here we get the same jump in runtime as in GC. Even though this rule 
ends up coloring much fewer (almost none) nodes in the current buffer. I 
interpret the results like this:

- The jump in runtime of the previous query was probably related to the 
number of nodes needed to be processed, but not with the resulting 
highlighting, even though every identifier in the buffer ends up being 
colored.

- The GC overhead created by the predicates is non-negligible.

And the original query that I tried:

   :language language
   :feature 'builtin-function
   `((((identifier) @font-lock-builtin-face)
      (:match ,ruby-ts--builtin-methods @font-lock-builtin-face)))

Benchmarking:

(16.433451865000002 249 5.908674810000001)

  23.72%  emacs         emacs                    [.] process_mark_stack
  12.33%  emacs         libtree-sitter.so.0.0    [.] 
ts_tree_cursor_current_status
   7.96%  emacs         libtree-sitter.so.0.0    [.] 
ts_tree_cursor_goto_next_sibling
   7.38%  emacs         libtree-sitter.so.0.0    [.] 
ts_tree_cursor_goto_first_child
   3.37%  emacs         libtree-sitter.so.0.0    [.] ts_node_start_point

Here's a significant jump in GC time which is almost the same as the 
difference in runtime. And all of it is spent marking?

I suppose if the problem is allocation of a large string (many times 
over), the GC could be spending a lot of time scanning through the 
memory. Could this be avoided by passing some substitute handle to TS, 
instead of the full string? E.g. some kind of reference to it in the 
regexp cache.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Wed, 25 Jan 2023 03:14:02 GMT) Full text and rfc822 format available.

Message #11 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Yuan Fu <casouri <at> gmail.com>
To: Dmitry Gutov <dgutov <at> yandex.ru>
Cc: 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Tue, 24 Jan 2023 19:13:29 -0800


> On Jan 23, 2023, at 8:04 PM, Dmitry Gutov <dgutov <at> yandex.ru> wrote:
> 
> Cc-ing Yuan, just in case.
> 
> On 20/01/2023 05:53, Dmitry Gutov wrote:
>> In my benchmarking -- using this form in test/lisp/progmodes/ruby-mode-resources/ruby.rb after enabling ruby-ts-mode:
>>   (benchmark-run 1000 (progn (font-lock-mode -1) (font-lock-mode 1) (let (treesit--font-lock-fast-mode) (font-lock-ensure))))
>> the rule added to its font-lock in commit d66ac5285f7
>>    :language language
>>    :feature 'builtin-functions
>>    `((((identifier) @font-lock-builtin-face)
>>       (:match ,ruby-ts--builtin-methods
>>        @font-lock-builtin-face)))
>> ...seems to have made it 50% slower.
>> The profile looked like this:
>>   9454  84%                   - font-lock-fontify-region
>>   9454  84%                    - font-lock-default-fontify-region
>>   8862  79%                     - font-lock-fontify-syntactically-region
>>   8702  78%                      - treesit-font-lock-fontify-region
>>    128   1%                         treesit-fontify-with-override
>>    123   1%                         facep
>>     84   0% treesit--children-covering-range-recurse
>>     60   0%                       + ruby-ts--comment-font-lock
>>      4   0%                       + font-lock-unfontify-region
>>    568   5%                     + font-lock-fontify-keywords-region
>>     16   0%                     + font-lock-unfontify-region
>> So there's nothing on the Lisp level to look at.
> 
> I've done some perf recordings now. It seems most/all of the difference comes down to garbage collection. Or more concretely, time spent inside process_mark_stack.
> 
> Without the added query benchmark reports:
> 
> (10.13723333 49 1.141649534999999)
> 
> And the perf top5 is:
> 
>  17.26%  emacs         libtree-sitter.so.0.0       [.] ts_tree_cursor_current_status
>  10.83%  emacs         libtree-sitter.so.0.0       [.] ts_tree_cursor_goto_next_sibling
>  10.18%  emacs         libtree-sitter.so.0.0       [.] ts_tree_cursor_goto_first_child
>   8.37%  emacs         emacs                       [.] process_mark_stack
>   4.63%  emacs         libtree-sitter.so.0.0       [.] ts_node_start_point
> 
> With this simple query that colors everything:
> 
>   :language language
>   :feature 'builtin-function
>   `((((identifier) @font-lock-builtin-face)))
> 
> I get:
> 
> (11.993968995 82 1.9326509270000045)
> 
> Note the jump in runtime that's larger than the jump in GC.
> 
>  17.26%  emacs         libtree-sitter.so.0.0       [.] ts_tree_cursor_current_status
>  10.83%  emacs         libtree-sitter.so.0.0       [.] ts_tree_cursor_goto_next_sibling
>  10.18%  emacs         libtree-sitter.so.0.0       [.] ts_tree_cursor_goto_first_child
>   8.37%  emacs         emacs                       [.] process_mark_stack
>   4.63%  emacs         libtree-sitter.so.0.0       [.] ts_node_start_point
> 
> The current query looks like this:
> 
>   :language language
>   :feature 'builtin-function
>   `((((identifier) @font-lock-builtin-face)
>      (:pred ruby-ts--builtin-method-p @font-lock-builtin-face)))
> 
> Benchmarking:
> 
> (12.493614359 107 2.558609025999999)
> 
>  15.30%  emacs         libtree-sitter.so.0.0       [.] ts_tree_cursor_current_status
>  14.92%  emacs         emacs                       [.] process_mark_stack
>   9.75%  emacs         libtree-sitter.so.0.0       [.] ts_tree_cursor_goto_next_sibling
>   8.90%  emacs         libtree-sitter.so.0.0       [.] ts_tree_cursor_goto_first_child
>   3.87%  emacs         libtree-sitter.so.0.0       [.] ts_node_start_point
> 
> Here we get the same jump in runtime as in GC. Even though this rule ends up coloring much fewer (almost none) nodes in the current buffer. I interpret the results like this:
> 
> - The jump in runtime of the previous query was probably related to the number of nodes needed to be processed, but not with the resulting highlighting, even though every identifier in the buffer ends up being colored.
> 
> - The GC overhead created by the predicates is non-negligible.
> 
> And the original query that I tried:
> 
>   :language language
>   :feature 'builtin-function
>   `((((identifier) @font-lock-builtin-face)
>      (:match ,ruby-ts--builtin-methods @font-lock-builtin-face)))
> 
> Benchmarking:
> 
> (16.433451865000002 249 5.908674810000001)
> 
>  23.72%  emacs         emacs                    [.] process_mark_stack
>  12.33%  emacs         libtree-sitter.so.0.0    [.] ts_tree_cursor_current_status
>   7.96%  emacs         libtree-sitter.so.0.0    [.] ts_tree_cursor_goto_next_sibling
>   7.38%  emacs         libtree-sitter.so.0.0    [.] ts_tree_cursor_goto_first_child
>   3.37%  emacs         libtree-sitter.so.0.0    [.] ts_node_start_point
> 
> Here's a significant jump in GC time which is almost the same as the difference in runtime. And all of it is spent marking?
> 
> I suppose if the problem is allocation of a large string (many times over), the GC could be spending a lot of time scanning through the memory. Could this be avoided by passing some substitute handle to TS, instead of the full string? E.g. some kind of reference to it in the regexp cache.
> 

FYI the predicates are not processed by tree-sitter, but by us. For example, the #equal predicate is handled by treesit_predicate_equal. For #match, right now we create a string with Fbuffer_substring and pass it to fast_string_match, so it definitely causes a lot of gc’s, as you observed. We can probably match the regexp in-place, just limit the match to the range of the node.

Yuan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Wed, 25 Jan 2023 03:49:02 GMT) Full text and rfc822 format available.

Message #14 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Yuan Fu <casouri <at> gmail.com>
Cc: 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Wed, 25 Jan 2023 05:48:13 +0200

On 25/01/2023 05:13, Yuan Fu wrote:

> FYI the predicates are not processed by tree-sitter, but by us. For example, the #equal predicate is handled by treesit_predicate_equal. For #match, right now we create a string with Fbuffer_substring and pass it to fast_string_match, so it definitely causes a lot of gc’s, as you observed.

Right.

> We can probably match the regexp in-place, just limit the match to the range of the node.

That's what I tried to do in the patch attached to the first message: 
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=60953#5

But the effect on performance was surprisingly hard to notice. It also 
broke the actual highlighting, but that's probably because the regexp 
uses anchors \` and \', which don't really work for fast_looking_at 
calls inside a buffer.

I also experimented with replacing the current 
buffer-substring+string-match-p scheme with looking-at. No difference in 
performance.

Reducing the size of the regexp, however, made a lot of difference. 
ruby-ts--builtin-methods is 721 characters long.

So my current hypothesis is that the extra GC is caused by copying the 
regexp string back and forth. Which seems a bit more difficult to avoid. 
But could be done if we replace that value with some indirection.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Wed, 25 Jan 2023 12:50:01 GMT) Full text and rfc822 format available.

Message #17 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Dmitry Gutov <dgutov <at> yandex.ru>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Wed, 25 Jan 2023 14:49:59 +0200

> Cc: 60953 <at> debbugs.gnu.org
> Date: Wed, 25 Jan 2023 05:48:13 +0200
> From: Dmitry Gutov <dgutov <at> yandex.ru>
> 
> > We can probably match the regexp in-place, just limit the match to the range of the node.
> 
> That's what I tried to do in the patch attached to the first message: 
> https://debbugs.gnu.org/cgi/bugreport.cgi?bug=60953#5
> 
> But the effect on performance was surprisingly hard to notice. It also 
> broke the actual highlighting, but that's probably because the regexp 
> uses anchors \` and \', which don't really work for fast_looking_at 
> calls inside a buffer.

The condition for a match in that patch is not correct, AFAIU:

   if (val >= 0)
     return true;
   else
     return false;

It should be "if (val > 0)" instead, since fast_looking_at returns the
number of characters that matched (unlike fast_string_match in the
original code, which returns the _index_ of the match).

Also, fast_string_match is capable of succeeding if the match begins
not at the first character, whereas fast_looking_at does an anchored
match.  Do we expect the text to match from its beginning in this
case?  If not, I think the replacement didn't do what the original
code does, and you should have used search_buffer or maybe
search_buffer_re instead.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Wed, 25 Jan 2023 23:22:01 GMT) Full text and rfc822 format available.

Message #20 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Thu, 26 Jan 2023 01:21:08 +0200

On 25/01/2023 14:49, Eli Zaretskii wrote:
>> Cc: 60953 <at> debbugs.gnu.org
>> Date: Wed, 25 Jan 2023 05:48:13 +0200
>> From: Dmitry Gutov <dgutov <at> yandex.ru>
>>
>>> We can probably match the regexp in-place, just limit the match to the range of the node.
>>
>> That's what I tried to do in the patch attached to the first message:
>> https://debbugs.gnu.org/cgi/bugreport.cgi?bug=60953#5
>>
>> But the effect on performance was surprisingly hard to notice. It also
>> broke the actual highlighting, but that's probably because the regexp
>> uses anchors \` and \', which don't really work for fast_looking_at
>> calls inside a buffer.
> 
> The condition for a match in that patch is not correct, AFAIU:
> 
>     if (val >= 0)
>       return true;
>     else
>       return false;
> 
> It should be "if (val > 0)" instead, since fast_looking_at returns the
> number of characters that matched (unlike fast_string_match in the
> original code, which returns the _index_ of the match).

Thank you. Unfortunately, the performance improvement from this patch is 
still fairly negligible.

Even though I got the highlighting to work -- by removing the \` and \' 
anchors from ruby-ts--builtin-methods (reducing the precision a little, 
but that's not important for the benchmark).

Switching to using :pred with function (like I did in commit 
d94dc606a0934) which still uses buffer-substring inside is significantly 
faster.

> Also, fast_string_match is capable of succeeding if the match begins
> not at the first character, whereas fast_looking_at does an anchored
> match.  Do we expect the text to match from its beginning in this
> case?  If not, I think the replacement didn't do what the original
> code does, and you should have used search_buffer or maybe
> search_buffer_re instead.

I suppose one could use a non-anchored regexp with :match, but that's 
not the case with the regexp I'm using currently.

Anyway, that's only going to be important if we find something that I 
missed here with this patch. Because otherwise the major bottleneck is 
somewhere else.

If we do end up using it and try to get it to 100% correctness, I 
suppose a combination of narrow-to-region (so that the \` and \' anchors 
work) with re-search-forward can do the trick.

Although I've tried using that combination inside 
ruby-ts--builtin-method-p (to avoid the buffer-substring call), and it 
wasn't much of an improvement in performance either.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Thu, 26 Jan 2023 06:50:01 GMT) Full text and rfc822 format available.

Message #23 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Dmitry Gutov <dgutov <at> yandex.ru>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Thu, 26 Jan 2023 08:50:04 +0200

> Date: Thu, 26 Jan 2023 01:21:08 +0200
> Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
> From: Dmitry Gutov <dgutov <at> yandex.ru>
> 
> Thank you. Unfortunately, the performance improvement from this patch is 
> still fairly negligible.

This is quite strange, since all of the approaches basically use the
same primitives under the hood.  Perhaps the reason for the slowness
is that the code which computes the text span of a node is slow?
Otherwise, I must be missing something here, since the rest of the
code on the C level is basically the same, give or take some wrappers
that should not change the overall picture.

Yuan, do you have some insights here?

> Switching to using :pred with function (like I did in commit 
> d94dc606a0934) which still uses buffer-substring inside is significantly 
> faster.

If the performance issue is fixed, then the only aspect that we should
perhaps try to improve is consing.  Consing a string each time you
need to fontify increases the GC pressure, so if there's a good way of
avoiding that without performance degradation, we should take it.  Is
it possible to use your :pred technique in a way that doesn't need to
produce strings from buffer text?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Thu, 26 Jan 2023 07:18:01 GMT) Full text and rfc822 format available.

Message #26 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Yuan Fu <casouri <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 60953 <at> debbugs.gnu.org, Dmitry Gutov <dgutov <at> yandex.ru>
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Wed, 25 Jan 2023 23:17:25 -0800


> On Jan 25, 2023, at 10:50 PM, Eli Zaretskii <eliz <at> gnu.org> wrote:
> 
>> Date: Thu, 26 Jan 2023 01:21:08 +0200
>> Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
>> From: Dmitry Gutov <dgutov <at> yandex.ru>
>> 
>> Thank you. Unfortunately, the performance improvement from this patch is 
>> still fairly negligible.
> 
> This is quite strange, since all of the approaches basically use the
> same primitives under the hood.  Perhaps the reason for the slowness
> is that the code which computes the text span of a node is slow?
> Otherwise, I must be missing something here, since the rest of the
> code on the C level is basically the same, give or take some wrappers
> that should not change the overall picture.
> 
> Yuan, do you have some insights here?

Sadly, no.

> 
>> Switching to using :pred with function (like I did in commit 
>> d94dc606a0934) which still uses buffer-substring inside is significantly 
>> faster.
> 
> If the performance issue is fixed, then the only aspect that we should
> perhaps try to improve is consing.  Consing a string each time you
> need to fontify increases the GC pressure, so if there's a good way of
> avoiding that without performance degradation, we should take it.  Is
> it possible to use your :pred technique in a way that doesn't need to
> produce strings from buffer text?

Why is :pred more performant though? They just use string-match-p. If anything, the :pred predicates should be more expensive, since they execute lisp functions and conses tree-sitter nodes into lisp objects.

Yuan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Thu, 26 Jan 2023 08:11:02 GMT) Full text and rfc822 format available.

Message #29 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Yuan Fu <casouri <at> gmail.com>
Cc: 60953 <at> debbugs.gnu.org, dgutov <at> yandex.ru
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Thu, 26 Jan 2023 10:10:17 +0200

> From: Yuan Fu <casouri <at> gmail.com>
> Date: Wed, 25 Jan 2023 23:17:25 -0800
> Cc: Dmitry Gutov <dgutov <at> yandex.ru>,
>  60953 <at> debbugs.gnu.org
> 
> >> Switching to using :pred with function (like I did in commit 
> >> d94dc606a0934) which still uses buffer-substring inside is significantly 
> >> faster.
> > 
> > If the performance issue is fixed, then the only aspect that we should
> > perhaps try to improve is consing.  Consing a string each time you
> > need to fontify increases the GC pressure, so if there's a good way of
> > avoiding that without performance degradation, we should take it.  Is
> > it possible to use your :pred technique in a way that doesn't need to
> > produce strings from buffer text?
> 
> Why is :pred more performant though? They just use string-match-p. If anything, the :pred predicates should be more expensive, since they execute lisp functions and conses tree-sitter nodes into lisp objects.

Yes, exactly my thoughts.

Perhaps Dmitry could present comparison of profiles from perf which
would allow us to understand the reason(s)?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Thu, 26 Jan 2023 17:13:02 GMT) Full text and rfc822 format available.

Message #32 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Yuan Fu <casouri <at> gmail.com>, Eli Zaretskii <eliz <at> gnu.org>
Cc: 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Thu, 26 Jan 2023 19:12:05 +0200

On 26/01/2023 09:17, Yuan Fu wrote:
> If anything, the :pred predicates should be more expensive, since they execute lisp functions and conses tree-sitter nodes into lisp objects.

Doesn't the :match predicate "cons tree-sitter nodes into lisp objects"?

IIUC the list of captures is produced inside treesit-query-capture 
exactly the same way before the predicates are processed -- whether they 
are :pred, or :match, or a combination.

But indeed, I (and most other users) would expect :match to be faster 
than :pred if the predicate does a regexp check anyway. That's the 
essence of this bug report.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Thu, 26 Jan 2023 17:17:01 GMT) Full text and rfc822 format available.

Message #35 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Eli Zaretskii <eliz <at> gnu.org>, Yuan Fu <casouri <at> gmail.com>
Cc: 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Thu, 26 Jan 2023 19:15:51 +0200

On 26/01/2023 10:10, Eli Zaretskii wrote:
>> From: Yuan Fu<casouri <at> gmail.com>
>> Date: Wed, 25 Jan 2023 23:17:25 -0800
>> Cc: Dmitry Gutov<dgutov <at> yandex.ru>,
>>   60953 <at> debbugs.gnu.org
>>
>>>> Switching to using :pred with function (like I did in commit
>>>> d94dc606a0934) which still uses buffer-substring inside is significantly
>>>> faster.
>>> If the performance issue is fixed, then the only aspect that we should
>>> perhaps try to improve is consing.  Consing a string each time you
>>> need to fontify increases the GC pressure, so if there's a good way of
>>> avoiding that without performance degradation, we should take it.  Is
>>> it possible to use your :pred technique in a way that doesn't need to
>>> produce strings from buffer text?
>> Why is :pred more performant though? They just use string-match-p. If anything, the :pred predicates should be more expensive, since they execute lisp functions and conses tree-sitter nodes into lisp objects.
> Yes, exactly my thoughts.
> 
> Perhaps Dmitry could present comparison of profiles from perf which
> would allow us to understand the reason(s)?

I believe I did that in the second message in this thread: 
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=60953#8

To quote the specific profiles, it's

  15.30%  emacs         libtree-sitter.so.0.0       [.]
ts_tree_cursor_current_status
  14.92%  emacs         emacs                       [.] process_mark_stack
   9.75%  emacs         libtree-sitter.so.0.0       [.]
ts_tree_cursor_goto_next_sibling
   8.90%  emacs         libtree-sitter.so.0.0       [.]
ts_tree_cursor_goto_first_child
   3.87%  emacs         libtree-sitter.so.0.0       [.] ts_node_start_point

for :pred vs.

  23.72%  emacs         emacs                    [.] process_mark_stack
  12.33%  emacs         libtree-sitter.so.0.0    [.]
ts_tree_cursor_current_status
   7.96%  emacs         libtree-sitter.so.0.0    [.]
ts_tree_cursor_goto_next_sibling
   7.38%  emacs         libtree-sitter.so.0.0    [.]
ts_tree_cursor_goto_first_child
   3.37%  emacs         libtree-sitter.so.0.0    [.] ts_node_start_point

for :match.

And to continue the quote:

  Here's a significant jump in GC time which is almost the same as the
  difference in runtime. And all of it is spent marking?

  I suppose if the problem is allocation of a large string (many times
  over), the GC could be spending a lot of time scanning through the
  memory. Could this be avoided by passing some substitute handle to TS,
  instead of the full string? E.g. some kind of reference to it in the
  regexp cache.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Thu, 26 Jan 2023 18:08:01 GMT) Full text and rfc822 format available.

Message #38 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Thu, 26 Jan 2023 20:07:30 +0200

On 26/01/2023 08:50, Eli Zaretskii wrote:
>> Date: Thu, 26 Jan 2023 01:21:08 +0200
>> Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
>> From: Dmitry Gutov <dgutov <at> yandex.ru>
>>
>> Thank you. Unfortunately, the performance improvement from this patch is
>> still fairly negligible.
> 
> This is quite strange, since all of the approaches basically use the
> same primitives under the hood.  Perhaps the reason for the slowness
> is that the code which computes the text span of a node is slow?

That code seems to be the same between the two options: 
treesit_predicate_capture_name_to_text basically does the same as 
treesit-node-text (except in C) after iterating through a Lisp list to 
find the node. ruby-ts--builtin-method-p calls treesit-node-text.

And treesit_predicate_pred does the same iteration, so the :pred option 
should just be slower, due to Lisp-related overhead. funcalls and stuff.

> Otherwise, I must be missing something here, since the rest of the
> code on the C level is basically the same, give or take some wrappers
> that should not change the overall picture.

The query object is smaller, though. That's basically my only remaining 
hypothesis.

>> Switching to using :pred with function (like I did in commit
>> d94dc606a0934) which still uses buffer-substring inside is significantly
>> faster.
> 
> If the performance issue is fixed, then the only aspect that we should
> perhaps try to improve is consing.

I wouldn't say it's "fixed", just improved. And :match really should be 
able to be made faster than :pred, since it'll probably be used for 
similar cases (where a lot/most of nodes match).

There seems to be a fair amount of consing going on inside 
treesit-query-capture already: we wrap every TS node in our objects, we 
turn the captured nodes into a Lisp alist, and we turn the predicates 
into a list, turning the strings into "our" strings. The 'make_string' 
function creates a new copy in the memory, right?

One could hope to avoid recreating the list of predicates on every 
match, but that seems to be a limitation of the TS API: 
ts_query_predicates_for_pattern requires a second argument, 
match.pattern_index. Maybe we could memoize that, though?

In any case, that seems to explain why adding or avoiding one 
buffer-substring call per match isn't moving the needle very much.

> Consing a string each time you
> need to fontify increases the GC pressure, so if there's a good way of
> avoiding that without performance degradation, we should take it.  Is
> it possible to use your :pred technique in a way that doesn't need to
> produce strings from buffer text?

The only version I managed to get some (very minor) performance 
improvement is this:

(defun ruby-ts--builtin-method-p (node)
  (goto-char (treesit-node-start node))
  (let ((inhibit-changing-match-data t))
    (re-search-forward ruby-ts--builtin-methods (treesit-node-end node) 
t)))

The improvement is like 200-300ms, whereas the difference between :match 
and :pred in this benchmark is several seconds.

And if I try to bring it back to 100% correctness, to ensure that the 
whole of node text is matched, I have to use narrowing (and string-start 
and string-end anchors in regexp):

(defvar ruby-ts--builtin-methods
  (format "\\`%s\\'" (regexp-opt (append ruby-builtin-methods-no-reqs
                                         ruby-builtin-methods-with-reqs)))
  "Ruby built-in methods.")

(defun ruby-ts--builtin-method-p (node)
  (save-restriction
    (goto-char (treesit-node-start node))
    (narrow-to-region (point) (treesit-node-end node))
    (let ((inhibit-changing-match-data t))
      (re-search-forward ruby-ts--builtin-methods nil t))))

And with that, the performance is again no better than the current 
version. If I also add save-excursion, it's worse.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Thu, 26 Jan 2023 18:25:02 GMT) Full text and rfc822 format available.

Message #41 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Dmitry Gutov <dgutov <at> yandex.ru>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Thu, 26 Jan 2023 20:24:11 +0200

> Date: Thu, 26 Jan 2023 19:15:51 +0200
> Cc: 60953 <at> debbugs.gnu.org
> From: Dmitry Gutov <dgutov <at> yandex.ru>
> 
> On 26/01/2023 10:10, Eli Zaretskii wrote:
> > Perhaps Dmitry could present comparison of profiles from perf which
> > would allow us to understand the reason(s)?
> 
> I believe I did that in the second message in this thread: 
> https://debbugs.gnu.org/cgi/bugreport.cgi?bug=60953#8
> 
> To quote the specific profiles, it's
> 
>    15.30%  emacs         libtree-sitter.so.0.0       [.]
> ts_tree_cursor_current_status
>    14.92%  emacs         emacs                       [.] process_mark_stack
>     9.75%  emacs         libtree-sitter.so.0.0       [.]
> ts_tree_cursor_goto_next_sibling
>     8.90%  emacs         libtree-sitter.so.0.0       [.]
> ts_tree_cursor_goto_first_child
>     3.87%  emacs         libtree-sitter.so.0.0       [.] ts_node_start_point
> 
> for :pred vs.
> 
>    23.72%  emacs         emacs                    [.] process_mark_stack
>    12.33%  emacs         libtree-sitter.so.0.0    [.]
> ts_tree_cursor_current_status
>     7.96%  emacs         libtree-sitter.so.0.0    [.]
> ts_tree_cursor_goto_next_sibling
>     7.38%  emacs         libtree-sitter.so.0.0    [.]
> ts_tree_cursor_goto_first_child
>     3.37%  emacs         libtree-sitter.so.0.0    [.] ts_node_start_point
> 
> for :match.
> 
> And to continue the quote:
> 
>    Here's a significant jump in GC time which is almost the same as the
>    difference in runtime. And all of it is spent marking?
> 
>    I suppose if the problem is allocation of a large string (many times
>    over), the GC could be spending a lot of time scanning through the
>    memory. Could this be avoided by passing some substitute handle to TS,
>    instead of the full string? E.g. some kind of reference to it in the
>    regexp cache.

If you are saying that GC is responsible, then running the benchmark
with gc-cons-threshold set to most-positive-fixnum should produce a
more interesting profile and perhaps a more interesting comparison.
(But I thought you concluded that GC alone cannot explain the
difference in performance?)

Otherwise, the profiles are too similar to support any conclusions,
and the fact that process_mark_stack is in a prominent place doesn't
help.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Thu, 26 Jan 2023 19:37:01 GMT) Full text and rfc822 format available.

Message #44 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Thu, 26 Jan 2023 21:35:55 +0200

On 26/01/2023 20:24, Eli Zaretskii wrote:
>> Date: Thu, 26 Jan 2023 19:15:51 +0200
>> Cc:60953 <at> debbugs.gnu.org
>> From: Dmitry Gutov<dgutov <at> yandex.ru>
>>
>> On 26/01/2023 10:10, Eli Zaretskii wrote:
>>> Perhaps Dmitry could present comparison of profiles from perf which
>>> would allow us to understand the reason(s)?
>> I believe I did that in the second message in this thread:
>> https://debbugs.gnu.org/cgi/bugreport.cgi?bug=60953#8
>>
>> To quote the specific profiles, it's
>>
>>     15.30%  emacs         libtree-sitter.so.0.0       [.]
>> ts_tree_cursor_current_status
>>     14.92%  emacs         emacs                       [.] process_mark_stack
>>      9.75%  emacs         libtree-sitter.so.0.0       [.]
>> ts_tree_cursor_goto_next_sibling
>>      8.90%  emacs         libtree-sitter.so.0.0       [.]
>> ts_tree_cursor_goto_first_child
>>      3.87%  emacs         libtree-sitter.so.0.0       [.] ts_node_start_point
>>
>> for :pred vs.
>>
>>     23.72%  emacs         emacs                    [.] process_mark_stack
>>     12.33%  emacs         libtree-sitter.so.0.0    [.]
>> ts_tree_cursor_current_status
>>      7.96%  emacs         libtree-sitter.so.0.0    [.]
>> ts_tree_cursor_goto_next_sibling
>>      7.38%  emacs         libtree-sitter.so.0.0    [.]
>> ts_tree_cursor_goto_first_child
>>      3.37%  emacs         libtree-sitter.so.0.0    [.] ts_node_start_point
>>
>> for :match.
>>
>> And to continue the quote:
>>
>>     Here's a significant jump in GC time which is almost the same as the
>>     difference in runtime. And all of it is spent marking?
>>
>>     I suppose if the problem is allocation of a large string (many times
>>     over), the GC could be spending a lot of time scanning through the
>>     memory. Could this be avoided by passing some substitute handle to TS,
>>     instead of the full string? E.g. some kind of reference to it in the
>>     regexp cache.
> If you are saying that GC is responsible, then running the benchmark
> with gc-cons-threshold set to most-positive-fixnum should produce a
> more interesting profile and perhaps a more interesting comparison.

That really helps:

(benchmark-run 1000 (progn (font-lock-mode -1) (font-lock-mode 1) (let 
(treesit--font-lock-fast-mode) (font-lock-ensure))))

=> (16.078430587 251 5.784299419999996)

(let ((gc-cons-threshold most-positive-fixnum)) (benchmark-run 1000 
(progn (font-lock-mode -1) (font-lock-mode 1) (let 
(treesit--font-lock-fast-mode) (font-lock-ensure)))))

=> (10.369389725 0 0.0)

Do you want a perf profile for the latter? It might not be very useful.

> (But I thought you concluded that GC alone cannot explain the
> difference in performance?)

I'm inclined to think the difference is related to copying of the regexp 
string, but whether the time is spent in actually copying it, or 
scanning its copies for garbage later, it was harder to say. Seems like 
it's the latter, though.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Thu, 26 Jan 2023 20:02:02 GMT) Full text and rfc822 format available.

Message #47 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Dmitry Gutov <dgutov <at> yandex.ru>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Thu, 26 Jan 2023 22:01:14 +0200

> Date: Thu, 26 Jan 2023 21:35:55 +0200
> Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
> From: Dmitry Gutov <dgutov <at> yandex.ru>
> 
> > If you are saying that GC is responsible, then running the benchmark
> > with gc-cons-threshold set to most-positive-fixnum should produce a
> > more interesting profile and perhaps a more interesting comparison.
> 
> That really helps:
> 
> (benchmark-run 1000 (progn (font-lock-mode -1) (font-lock-mode 1) (let 
> (treesit--font-lock-fast-mode) (font-lock-ensure))))
> 
> => (16.078430587 251 5.784299419999996)
> 
> (let ((gc-cons-threshold most-positive-fixnum)) (benchmark-run 1000 
> (progn (font-lock-mode -1) (font-lock-mode 1) (let 
> (treesit--font-lock-fast-mode) (font-lock-ensure)))))
> 
> => (10.369389725 0 0.0)
> 
> Do you want a perf profile for the latter? It might not be very useful.

I'd be interested in comparing the profiles of the two techniques, the
:pred and the :match, with GC disabled like that.

> > (But I thought you concluded that GC alone cannot explain the
> > difference in performance?)
> 
> I'm inclined to think the difference is related to copying of the regexp 
> string, but whether the time is spent in actually copying it, or 
> scanning its copies for garbage later, it was harder to say. Seems like 
> it's the latter, though.

If we can avoid the copying, I think it's desirable in any case.  They
are constant regexps, aren't they?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Thu, 26 Jan 2023 20:47:03 GMT) Full text and rfc822 format available.

Message #50 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Thu, 26 Jan 2023 22:46:30 +0200

[Message part 1 (text/plain, inline)]

On 26/01/2023 20:07, Dmitry Gutov wrote:
> One could hope to avoid recreating the list of predicates on every 
> match, but that seems to be a limitation of the TS API: 
> ts_query_predicates_for_pattern requires a second argument, 
> match.pattern_index. Maybe we could memoize that, though?

Speaking of memoization, here is a POC patch.

It's a definite improvement: with the attached :match almost reaches the 
performance of :pred. Not sure why it's still not faster, though.

(I also tried a more comprehensive memoization using a hash table, the 
resulting performance was slightly worse.)

[memoize_simple.diff (text/x-patch, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Thu, 26 Jan 2023 21:28:01 GMT) Full text and rfc822 format available.

Message #53 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Thu, 26 Jan 2023 23:26:54 +0200

On 26/01/2023 22:01, Eli Zaretskii wrote:
>> Date: Thu, 26 Jan 2023 21:35:55 +0200
>> Cc:casouri <at> gmail.com,60953 <at> debbugs.gnu.org
>> From: Dmitry Gutov<dgutov <at> yandex.ru>
>>
>>> If you are saying that GC is responsible, then running the benchmark
>>> with gc-cons-threshold set to most-positive-fixnum should produce a
>>> more interesting profile and perhaps a more interesting comparison.
>> That really helps:
>>
>> (benchmark-run 1000 (progn (font-lock-mode -1) (font-lock-mode 1) (let
>> (treesit--font-lock-fast-mode) (font-lock-ensure))))
>>
>> => (16.078430587 251 5.784299419999996)
>>
>> (let ((gc-cons-threshold most-positive-fixnum)) (benchmark-run 1000
>> (progn (font-lock-mode -1) (font-lock-mode 1) (let
>> (treesit--font-lock-fast-mode) (font-lock-ensure)))))
>>
>> => (10.369389725 0 0.0)
>>
>> Do you want a perf profile for the latter? It might not be very useful.
> I'd be interested in comparing the profiles of the two techniques, the
> :pred and the :match, with GC disabled like that.

Curiously, :pred is still faster, but the difference is much smaller:

pred:

(9.212951344 0 0.0)

  18.23%  emacs         libtree-sitter.so.0.0       [.] 
ts_tree_cursor_current_status
  11.61%  emacs         libtree-sitter.so.0.0       [.] 
ts_tree_cursor_goto_next_sibling
  11.43%  emacs         libtree-sitter.so.0.0       [.] 
ts_tree_cursor_goto_first_child
   5.00%  emacs         libtree-sitter.so.0.0       [.] ts_node_start_point
   4.02%  emacs         libtree-sitter.so.0.0       [.] 
ts_tree_cursor_parent_node
   3.97%  emacs         emacs                       [.] re_match_2_internal
   3.36%  emacs         libtree-sitter.so.0.0       [.] 
ts_language_symbol_metadata
   2.45%  emacs         emacs                       [.] 
parse_str_as_multibyte
   1.95%  emacs         emacs                       [.] exec_byte_code
   1.66%  emacs         libtree-sitter.so.0.0       [.] 
ts_tree_cursor_current_node
   1.66%  emacs         libtree-sitter.so.0.0       [.] ts_node_end_point
   1.30%  emacs         emacs                       [.] allocate_vectorlike
   1.24%  emacs         emacs                       [.] find_interval

match:

(10.059083317 0 0.0)

  19.23%  emacs         libtree-sitter.so.0.0  [.] 
ts_tree_cursor_current_status
  12.41%  emacs         libtree-sitter.so.0.0  [.] 
ts_tree_cursor_goto_next_sibling
  11.22%  emacs         libtree-sitter.so.0.0  [.] 
ts_tree_cursor_goto_first_child
   5.21%  emacs         libtree-sitter.so.0.0  [.] ts_node_start_point
   4.22%  emacs         emacs                  [.] re_match_2_internal
   3.97%  emacs         libtree-sitter.so.0.0  [.] 
ts_tree_cursor_parent_node
   3.64%  emacs         libtree-sitter.so.0.0  [.] 
ts_language_symbol_metadata
   2.36%  emacs         emacs                  [.] exec_byte_code
   1.66%  emacs         libtree-sitter.so.0.0  [.] ts_node_end_point
   1.62%  emacs         libtree-sitter.so.0.0  [.] 
ts_tree_cursor_current_node
   1.34%  emacs         libtree-sitter.so.0.0  [.] ts_node_end_byte
   1.28%  emacs         emacs                  [.] allocate_vectorlike
   0.95%  emacs         libtree-sitter.so.0.0  [.] 
ts_tree_cursor_goto_parent

This is with the current code and disabled GC. No additional changes to 
treesit.c.

>>> (But I thought you concluded that GC alone cannot explain the
>>> difference in performance?)
>> I'm inclined to think the difference is related to copying of the regexp
>> string, but whether the time is spent in actually copying it, or
>> scanning its copies for garbage later, it was harder to say. Seems like
>> it's the latter, though.
> If we can avoid the copying, I think it's desirable in any case.  They
> are constant regexps, aren't they?

Yes, but how?

Memoization is one possible step, but then we only avoid re-creating the 
predicate structures for each match. We still send a pretty large query 
and, apparently, get it back..? Might be some copying involved there.

TBH the moderate success the memoization patch shows has me stumped.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Mon, 30 Jan 2023 00:50:02 GMT) Full text and rfc822 format available.

Message #56 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Mon, 30 Jan 2023 02:49:47 +0200

[Message part 1 (text/plain, inline)]

On 26/01/2023 23:26, Dmitry Gutov wrote:
>>>> (But I thought you concluded that GC alone cannot explain the
>>>> difference in performance?)
>>> I'm inclined to think the difference is related to copying of the regexp
>>> string, but whether the time is spent in actually copying it, or
>>> scanning its copies for garbage later, it was harder to say. Seems like
>>> it's the latter, though.
>> If we can avoid the copying, I think it's desirable in any case.  They
>> are constant regexps, aren't they?
> 
> Yes, but how?
> 
> Memoization is one possible step, but then we only avoid re-creating the 
> predicate structures for each match. We still send a pretty large query 
> and, apparently, get it back..? Might be some copying involved there.
> 
> TBH the moderate success the memoization patch shows has me stumped.

Okay, I have cleaned up both experiments that I had. And when combined, 
they make the :match approach a little faster than the :pred one.

I'm still not sure why the difference is so little, given that the :pred 
one has Lisp funcalls and extra allocation, and :match does not.

Still, if nobody has any better ideas, I suggest we install both of 
these changes now. They are attached in separate patches.

memoize_vector.diff improves the performance of both cases. For :pred, 
it's roughly 10%; for :match, it's more.

treesit_predicate_match.diff improves the performance of the latter, 
though only a little: maybe 3-4%.

Code review welcome.

Is applying (and undoing) the narrowing this way legal enough? Or should 
I go through some error handlers, or ensure blocks, etc?

Speaking of pref, the profile looks like this now (very similar to what 
it was before the added rule):

  17.25%  emacs  libtree-sitter.so.0.0       [.] 
ts_tree_cursor_current_status
  10.93%  emacs  libtree-sitter.so.0.0       [.] 
ts_tree_cursor_goto_next_sibling
   9.89%  emacs  libtree-sitter.so.0.0       [.] 
ts_tree_cursor_goto_first_child
   9.01%  emacs  emacs                       [.] process_mark_stack
   4.80%  emacs  libtree-sitter.so.0.0       [.] ts_node_start_point
   3.84%  emacs  emacs                       [.] re_match_2_internal
   3.82%  emacs  libtree-sitter.so.0.0       [.] ts_tree_cursor_parent_node
   3.06%  emacs  libtree-sitter.so.0.0       [.] 
ts_language_symbol_metadata

[memoize_vector.diff (text/x-patch, attachment)]

[treesit_predicate_match.diff (text/x-patch, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Mon, 30 Jan 2023 14:07:02 GMT) Full text and rfc822 format available.

Message #59 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Dmitry Gutov <dgutov <at> yandex.ru>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Mon, 30 Jan 2023 16:06:13 +0200

> Date: Mon, 30 Jan 2023 02:49:47 +0200
> From: Dmitry Gutov <dgutov <at> yandex.ru>
> Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
> 
> Code review welcome.

See some below.

> Is applying (and undoing) the narrowing this way legal enough? Or should 
> I go through some error handlers, or ensure blocks, etc?

Mmm... no.  You should use Fnarrow_to_region, I think.

But why do you need to narrow there?  fast_looking_at will not go
beyond end_pos/end_byte anyway, there's no need to restrict it.

Or are you thinking about widening a buffer that is already narrowed?
But if so, can we have parser data beyond the restriction?

> +      Lisp_Object predicates = AREF(predicates_table, match.pattern_index);
> +      if (EQ (predicates, Qt))
> +	{
> +	  predicates = treesit_predicates_for_pattern (treesit_query, 0);
> +	  ASET(predicates_table, match.pattern_index, predicates);

Our style is to leave a blank between ASET and the left parenthesis.

> +  set_buffer_internal (buffer);
> +
> +  TSNode treesit_node = XTS_NODE (node)->node;
> +  ptrdiff_t visible_beg = XTS_PARSER (XTS_NODE (node)->parser)->visible_beg;
> +  uint32_t start_byte_offset = ts_node_start_byte (treesit_node);
> +  uint32_t end_byte_offset = ts_node_end_byte (treesit_node);
> +  ptrdiff_t start_byte = visible_beg + start_byte_offset;
> +  ptrdiff_t end_byte = visible_beg + end_byte_offset;
> +  ptrdiff_t start_pos = buf_bytepos_to_charpos (buffer, start_byte);
> +  ptrdiff_t end_pos = buf_bytepos_to_charpos (buffer, end_byte);
> +  ptrdiff_t old_begv = BEGV;
> +  ptrdiff_t old_zv = ZV;

Since you switch to BUFFER, you can use BYTE_TO_CHAR, no need for
buf_bytepos_to_charpos.

> +  SET_BUF_BEGV(buffer, start_pos);
> +  SET_BUF_ZV(buffer, end_pos);

And here I suggest an additional optimization, since you already know
the byte positions:

  BEGV = start_pos;
  BEGV_BYTE = start_byte;
  ZV = end_pos;
  ZV_BYTE = end_byte;

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Mon, 30 Jan 2023 14:48:01 GMT) Full text and rfc822 format available.

Message #62 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Mon, 30 Jan 2023 16:47:01 +0200

On 30/01/2023 16:06, Eli Zaretskii wrote:

> Our style is to leave a blank between ASET and the left parenthesis.

Sure, thanks.

> Mmm... no.  You should use Fnarrow_to_region, I think.

Last time I tried that (from Lisp, with save-restriction), I think the 
result was measurably slower. I can try it again from C, though.

> But why do you need to narrow there?  fast_looking_at will not go
> beyond end_pos/end_byte anyway, there's no need to restrict it.

The reason for that is to be able to support the \` and \' markers in 
REGEXP. I haven't found any alternative approach that doesn't call 
'substring'.

> And here I suggest an additional optimization, since you already know
> the byte positions:

No real objection from me if you're sure, but I tried that, and the 
benchmarks showed no difference. So I submitted the shorter version.

(I suppose we also could save the previous values of BEGV_BYTE and 
ZV_BYTE to use when restoring.)

Either way, we would use this *instead of* Fnarrow_to_region, right? 
Because it only accepts two arguments.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Mon, 30 Jan 2023 15:10:02 GMT) Full text and rfc822 format available.

Message #65 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Dmitry Gutov <dgutov <at> yandex.ru>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Mon, 30 Jan 2023 17:08:39 +0200

> Date: Mon, 30 Jan 2023 16:47:01 +0200
> Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
> From: Dmitry Gutov <dgutov <at> yandex.ru>
> 
> On 30/01/2023 16:06, Eli Zaretskii wrote:
> 
> > But why do you need to narrow there?  fast_looking_at will not go
> > beyond end_pos/end_byte anyway, there's no need to restrict it.
> 
> The reason for that is to be able to support the \` and \' markers in 
> REGEXP. I haven't found any alternative approach that doesn't call 
> 'substring'.

fast_looking_at already does an anchored match, so I'm not sure I
follow.  I don't even understand why you need th \` part, when the
match will either always start from the first position or fail.

And for \', just compare the length of the match returned by
fast_looking_at with the length of the text.

What am I missing?

>  > And here I suggest an additional optimization, since you already know
>  > the byte positions:
> 
> No real objection from me if you're sure, but I tried that, and the 
> benchmarks showed no difference.

Sheer luck: you force SET_BUF_BEGV etc. to call buf_charpos_to_bytepos
for no reason: you already have the byte positions in hand.

> (I suppose we also could save the previous values of BEGV_BYTE and 
> ZV_BYTE to use when restoring.)

Yes.

> Either way, we would use this *instead of* Fnarrow_to_region, right? 

Fnarrow_to_region does little else.  But I need to understand first
why you need to change the restriction.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Mon, 30 Jan 2023 17:16:02 GMT) Full text and rfc822 format available.

Message #68 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Mon, 30 Jan 2023 19:15:07 +0200

On 30/01/2023 17:08, Eli Zaretskii wrote:
>> Date: Mon, 30 Jan 2023 16:47:01 +0200
>> Cc:casouri <at> gmail.com,60953 <at> debbugs.gnu.org
>> From: Dmitry Gutov<dgutov <at> yandex.ru>
>>
>> On 30/01/2023 16:06, Eli Zaretskii wrote:
>>
>>> But why do you need to narrow there?  fast_looking_at will not go
>>> beyond end_pos/end_byte anyway, there's no need to restrict it.
>> The reason for that is to be able to support the \` and \' markers in
>> REGEXP. I haven't found any alternative approach that doesn't call
>> 'substring'.
> fast_looking_at already does an anchored match, so I'm not sure I
> follow.  I don't even understand why you need th \` part, when the
> match will either always start from the first position or fail.

The regexp might include the anchors, or it might not.

It might also use a different anchor like ^ or $ or \b.

See these examples from the documentation:

((_) @bob (#match \"^B.b$\" @bob))

'((
   (compound_expression :anchor (_) @@first (_) :* @@rest)
   (:match "love" @@first)
   ))

> And for \', just compare the length of the match returned by
> fast_looking_at with the length of the text.

This seems to work, i.e. even when before "carpet",

(and (looking-at (regexp-opt '("car" "cardigan" "carpet")))
     (match-string 0))

returns the full match. I was expecting that it could return just "car" 
-- not sure why it doesn't stop there.

But again, to find out whether we need to use the end anchor at all, 
we'd have to parse the regexp, remove the actual anchor before calling 
fast_looking_at, and then add the above check.

One possible alternative, I suppose, would be to create a raw pointer to 
a part of the buffer text and call re_search directly specifying the 
known length of the node in bytes. If buffer text is one contiguous 
region in memory, that is. This way we would regexp test against a 
string (not a buffer), but without creating a separate string object.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Mon, 30 Jan 2023 17:50:02 GMT) Full text and rfc822 format available.

Message #71 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Dmitry Gutov <dgutov <at> yandex.ru>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Mon, 30 Jan 2023 19:49:45 +0200

> Date: Mon, 30 Jan 2023 19:15:07 +0200
> Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
> From: Dmitry Gutov <dgutov <at> yandex.ru>
> 
> > fast_looking_at already does an anchored match, so I'm not sure I
> > follow.  I don't even understand why you need th \` part, when the
> > match will either always start from the first position or fail.
> 
> The regexp might include the anchors, or it might not.
> 
> It might also use a different anchor like ^ or $ or \b.

OK, but it always goes only forward, so narrowing to the beginning
shouldn't be necessary.  Right?  And you can use the LIMIT argument to
limit how far it goes forward, right?  So once again, why narrow?

> > And for \', just compare the length of the match returned by
> > fast_looking_at with the length of the text.
> 
> This seems to work, i.e. even when before "carpet",
> 
> (and (looking-at (regexp-opt '("car" "cardigan" "carpet")))
>       (match-string 0))
> 
> returns the full match. I was expecting that it could return just "car" 
> -- not sure why it doesn't stop there.

Because regex search is greedy?

> One possible alternative, I suppose, would be to create a raw pointer to 
> a part of the buffer text and call re_search directly specifying the 
> known length of the node in bytes. If buffer text is one contiguous 
> region in memory, that is.

It isn't, though: there's the gap.  Which is why doing this is not
recommended; instead, use something like search_buffer_re, which
already handles this complication for you.  (Except that
search_buffer_re is a static function, so only code in search.c can
use it.  So you'd need to make it non-static.)

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Mon, 30 Jan 2023 18:21:02 GMT) Full text and rfc822 format available.

Message #74 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Mon, 30 Jan 2023 20:20:46 +0200

On 30/01/2023 19:49, Eli Zaretskii wrote:
>> Date: Mon, 30 Jan 2023 19:15:07 +0200
>> Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
>> From: Dmitry Gutov <dgutov <at> yandex.ru>
>>
>>> fast_looking_at already does an anchored match, so I'm not sure I
>>> follow.  I don't even understand why you need th \` part, when the
>>> match will either always start from the first position or fail.
>>
>> The regexp might include the anchors, or it might not.
>>
>> It might also use a different anchor like ^ or $ or \b.
> 
> OK, but it always goes only forward, so narrowing to the beginning
> shouldn't be necessary.  Right? 

Are you saying that fast_looking_at ("\\`", ...) will always succeed?

And fast_looking_at ("^", ...), etc.

I would imagine that only fast_looking_at ("\\=", ...) is guaranteed to 
succeed.

> And you can use the LIMIT argument to
> limit how far it goes forward, right?  So once again, why narrow?

I tried to explain that there is a certain expectation (on the part of 
the user/programmer) which anchors are allowed in the :match regexp, and 
what their effects are, and those seem hard to support without narrowing.

>>> And for \', just compare the length of the match returned by
>>> fast_looking_at with the length of the text.
>>
>> This seems to work, i.e. even when before "carpet",
>>
>> (and (looking-at (regexp-opt '("car" "cardigan" "carpet")))
>>        (match-string 0))
>>
>> returns the full match. I was expecting that it could return just "car"
>> -- not sure why it doesn't stop there.
> 
> Because regex search is greedy?

Cool. TIL, thanks. That's not going to help here, but might in other 
situations when my code controls the regexp as well.

>> One possible alternative, I suppose, would be to create a raw pointer to
>> a part of the buffer text and call re_search directly specifying the
>> known length of the node in bytes. If buffer text is one contiguous
>> region in memory, that is.
> 
> It isn't, though: there's the gap.  Which is why doing this is not
> recommended; instead, use something like search_buffer_re, which
> already handles this complication for you.  (Except that
> search_buffer_re is a static function, so only code in search.c can
> use it.  So you'd need to make it non-static.)

Interesting. Does search_buffer_re match the \` anchor at POS and \' at 
LIM? IOW, does in treat the rest of the buffer as non-existing? Or could it?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Mon, 30 Jan 2023 18:43:02 GMT) Full text and rfc822 format available.

Message #77 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Dmitry Gutov <dgutov <at> yandex.ru>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Mon, 30 Jan 2023 20:42:34 +0200

> Date: Mon, 30 Jan 2023 20:20:46 +0200
> Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
> From: Dmitry Gutov <dgutov <at> yandex.ru>
> 
> On 30/01/2023 19:49, Eli Zaretskii wrote:
> >> Date: Mon, 30 Jan 2023 19:15:07 +0200
> >> Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
> >> From: Dmitry Gutov <dgutov <at> yandex.ru>
> >>
> >>> fast_looking_at already does an anchored match, so I'm not sure I
> >>> follow.  I don't even understand why you need th \` part, when the
> >>> match will either always start from the first position or fail.
> >>
> >> The regexp might include the anchors, or it might not.
> >>
> >> It might also use a different anchor like ^ or $ or \b.
> > 
> > OK, but it always goes only forward, so narrowing to the beginning
> > shouldn't be necessary.  Right? 
> 
> Are you saying that fast_looking_at ("\\`", ...) will always succeed?
> 
> And fast_looking_at ("^", ...), etc.

For example, for "^", if you hint that it must look back to make sure
there's a newline there, then your narrowing will also prevent it from
doing that, right?

> >> One possible alternative, I suppose, would be to create a raw pointer to
> >> a part of the buffer text and call re_search directly specifying the
> >> known length of the node in bytes. If buffer text is one contiguous
> >> region in memory, that is.
> > 
> > It isn't, though: there's the gap.  Which is why doing this is not
> > recommended; instead, use something like search_buffer_re, which
> > already handles this complication for you.  (Except that
> > search_buffer_re is a static function, so only code in search.c can
> > use it.  So you'd need to make it non-static.)
> 
> Interesting. Does search_buffer_re match the \` anchor at POS and \' at 
> LIM? IOW, does in treat the rest of the buffer as non-existing? Or could it?

That is the low-level subroutine called by re-search-forward, so you
know the answers already, I think?  IOW, that function behaves exactly
like re-search-forward in those situations.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Mon, 30 Jan 2023 19:02:01 GMT) Full text and rfc822 format available.

Message #80 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Mon, 30 Jan 2023 21:01:02 +0200

On 30/01/2023 20:42, Eli Zaretskii wrote:
>> Date: Mon, 30 Jan 2023 20:20:46 +0200
>> Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
>> From: Dmitry Gutov <dgutov <at> yandex.ru>
>>
>> On 30/01/2023 19:49, Eli Zaretskii wrote:
>>>> Date: Mon, 30 Jan 2023 19:15:07 +0200
>>>> Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
>>>> From: Dmitry Gutov <dgutov <at> yandex.ru>
>>>>
>>>>> fast_looking_at already does an anchored match, so I'm not sure I
>>>>> follow.  I don't even understand why you need th \` part, when the
>>>>> match will either always start from the first position or fail.
>>>>
>>>> The regexp might include the anchors, or it might not.
>>>>
>>>> It might also use a different anchor like ^ or $ or \b.
>>>
>>> OK, but it always goes only forward, so narrowing to the beginning
>>> shouldn't be necessary.  Right?
>>
>> Are you saying that fast_looking_at ("\\`", ...) will always succeed?
>>
>> And fast_looking_at ("^", ...), etc.
> 
> For example, for "^", if you hint that it must look back to make sure
> there's a newline there, then your narrowing will also prevent it from
> doing that, right?

fast_looking_at ("^", ...) succeeds inside a narrowing because it always 
succeeds at BOB. Even though there are no physical newlines before BOB.

>>>> One possible alternative, I suppose, would be to create a raw pointer to
>>>> a part of the buffer text and call re_search directly specifying the
>>>> known length of the node in bytes. If buffer text is one contiguous
>>>> region in memory, that is.
>>>
>>> It isn't, though: there's the gap.  Which is why doing this is not
>>> recommended; instead, use something like search_buffer_re, which
>>> already handles this complication for you.  (Except that
>>> search_buffer_re is a static function, so only code in search.c can
>>> use it.  So you'd need to make it non-static.)
>>
>> Interesting. Does search_buffer_re match the \` anchor at POS and \' at
>> LIM? IOW, does in treat the rest of the buffer as non-existing? Or could it?
> 
> That is the low-level subroutine called by re-search-forward, so you
> know the answers already, I think?  IOW, that function behaves exactly
> like re-search-forward in those situations.

So, I suppose not?

But that doesn't answer the question "Could it?".

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Mon, 30 Jan 2023 19:06:01 GMT) Full text and rfc822 format available.

Message #83 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Dmitry Gutov <dgutov <at> yandex.ru>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Mon, 30 Jan 2023 21:05:26 +0200

> Date: Mon, 30 Jan 2023 21:01:02 +0200
> Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
> From: Dmitry Gutov <dgutov <at> yandex.ru>
> 
> But that doesn't answer the question "Could it?".

I don't understand what you are asking.  "Could" in what sense?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Mon, 30 Jan 2023 19:59:02 GMT) Full text and rfc822 format available.

Message #86 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Mon, 30 Jan 2023 21:58:22 +0200

On 30/01/2023 21:05, Eli Zaretskii wrote:
>> Date: Mon, 30 Jan 2023 21:01:02 +0200
>> Cc:casouri <at> gmail.com,60953 <at> debbugs.gnu.org
>> From: Dmitry Gutov<dgutov <at> yandex.ru>
>>
>> But that doesn't answer the question "Could it?".
> I don't understand what you are asking.  "Could" in what sense?

Like, would it make sense to try to modify it that way, or extract a 
function that would do that, without writing it from scratch.

Or create a new function which would reuse some common code.

We would call the new function something like match_buffer_substring. 
Optionally, also expose it to Lisp.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Mon, 30 Jan 2023 23:58:01 GMT) Full text and rfc822 format available.

Message #89 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Yuan Fu <casouri <at> gmail.com>
To: Dmitry Gutov <dgutov <at> yandex.ru>
Cc: Eli Zaretskii <eliz <at> gnu.org>, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Mon, 30 Jan 2023 15:57:05 -0800


> On Jan 30, 2023, at 11:58 AM, Dmitry Gutov <dgutov <at> yandex.ru> wrote:
> 
> On 30/01/2023 21:05, Eli Zaretskii wrote:
>>> Date: Mon, 30 Jan 2023 21:01:02 +0200
>>> Cc:casouri <at> gmail.com,60953 <at> debbugs.gnu.org
>>> From: Dmitry Gutov<dgutov <at> yandex.ru>
>>> 
>>> But that doesn't answer the question "Could it?".
>> I don't understand what you are asking.  "Could" in what sense?
> 
> Like, would it make sense to try to modify it that way, or extract a function that would do that, without writing it from scratch.
> 
> Or create a new function which would reuse some common code.
> 
> We would call the new function something like match_buffer_substring. Optionally, also expose it to Lisp.

Another option is to change user/programmer’s expectation of the anchor: we could say that the regexp must match the entirety of the node text. IOW, \\` \\' are implied. 

Yuan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Tue, 31 Jan 2023 00:46:02 GMT) Full text and rfc822 format available.

Message #92 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Yuan Fu <casouri <at> gmail.com>
Cc: Eli Zaretskii <eliz <at> gnu.org>, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Tue, 31 Jan 2023 02:44:53 +0200

On 31/01/2023 01:57, Yuan Fu wrote:
> 
> 
>> On Jan 30, 2023, at 11:58 AM, Dmitry Gutov <dgutov <at> yandex.ru> wrote:
>>
>> On 30/01/2023 21:05, Eli Zaretskii wrote:
>>>> Date: Mon, 30 Jan 2023 21:01:02 +0200
>>>> Cc:casouri <at> gmail.com,60953 <at> debbugs.gnu.org
>>>> From: Dmitry Gutov<dgutov <at> yandex.ru>
>>>>
>>>> But that doesn't answer the question "Could it?".
>>> I don't understand what you are asking.  "Could" in what sense?
>>
>> Like, would it make sense to try to modify it that way, or extract a function that would do that, without writing it from scratch.
>>
>> Or create a new function which would reuse some common code.
>>
>> We would call the new function something like match_buffer_substring. Optionally, also expose it to Lisp.
> 
> Another option is to change user/programmer’s expectation of the anchor: we could say that the regexp must match the entirety of the node text. IOW, \\` \\' are implied.

Huh, I guess that's an option too.

A couple reasons not to do that would be:

- Potential breakage in all existing TS modes, a week (?) before we're 
going to release Emacs 29 pretest. Maybe that's okay, I can't say. But 
the breakage from that kind of change could be subtle.

- Compatibility reasons? People writing TS modes for Emacs might be 
coming from other editors/TS integrations.

While TreeSitter docs say the predicates are not handled by it, it does 
show this example:

  (#match? @constant "^[A-Z][A-Z_]+")

The use of '^' anchor seems to imply that the regexp doesn't have to 
otherwise match the whole node text (OTOH it's not clear why the example 
doesn't just say "^[A-Z]" or "^[A-Z][A-Z_]").

The doc also references the Rust crate and WebAssembly binding which 
support #match?.

IIUC Rust uses "re.is_match", which is documented to use "implicit .*? 
at the beginning and end". Which matches our current semantics. 
WebAssembly uses "regex.test", same effect.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Tue, 31 Jan 2023 03:25:01 GMT) Full text and rfc822 format available.

Message #95 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Dmitry Gutov <dgutov <at> yandex.ru>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Tue, 31 Jan 2023 05:23:49 +0200

> Date: Mon, 30 Jan 2023 21:58:22 +0200
> Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
> From: Dmitry Gutov <dgutov <at> yandex.ru>
> 
> On 30/01/2023 21:05, Eli Zaretskii wrote:
> >> Date: Mon, 30 Jan 2023 21:01:02 +0200
> >> Cc:casouri <at> gmail.com,60953 <at> debbugs.gnu.org
> >> From: Dmitry Gutov<dgutov <at> yandex.ru>
> >>
> >> But that doesn't answer the question "Could it?".
> > I don't understand what you are asking.  "Could" in what sense?
> 
> Like, would it make sense to try to modify it that way, or extract a 
> function that would do that, without writing it from scratch.
> 
> Or create a new function which would reuse some common code.
> 
> We would call the new function something like match_buffer_substring. 
> Optionally, also expose it to Lisp.

Can you describe what that function should do?  I don't think I have a
clear idea of that.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Tue, 31 Jan 2023 18:17:02 GMT) Full text and rfc822 format available.

Message #98 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Tue, 31 Jan 2023 20:16:01 +0200

On 31/01/2023 05:23, Eli Zaretskii wrote:
>> Date: Mon, 30 Jan 2023 21:58:22 +0200
>> Cc:casouri <at> gmail.com,60953 <at> debbugs.gnu.org
>> From: Dmitry Gutov<dgutov <at> yandex.ru>
>>
>> On 30/01/2023 21:05, Eli Zaretskii wrote:
>>>> Date: Mon, 30 Jan 2023 21:01:02 +0200
>>>> Cc:casouri <at> gmail.com,60953 <at> debbugs.gnu.org
>>>> From: Dmitry Gutov<dgutov <at> yandex.ru>
>>>>
>>>> But that doesn't answer the question "Could it?".
>>> I don't understand what you are asking.  "Could" in what sense?
>> Like, would it make sense to try to modify it that way, or extract a
>> function that would do that, without writing it from scratch.
>>
>> Or create a new function which would reuse some common code.
>>
>> We would call the new function something like match_buffer_substring.
>> Optionally, also expose it to Lisp.
> Can you describe what that function should do?  I don't think I have a
> clear idea of that.

In Lisp that function could be implemented as

  (defun buffer-substring-match (regexp &optional start end
                                 inhibit-modify)
    (string-match regexp
                  (buffer-substring (or start (point-min))
                                    (or end (point-max)))
                  inhibit-modify))

Meaning, it matches the regexp against the buffer substring, with the 
string-start and string-end anchors working.

But it would be implemented in C, meaning we could avoid the extra 
consing and funcall overhead.

It might also be handy to use from Lisp in other cases, where we don't 
need the anchors, but it's easier to call (buffer-substring-match "foo") 
rather than

  (save-excursion
    (goto-char (point-min))
    (re-search-forward "foo" nil t)
    (point))

Probably a little faster, too.

Anyway, it seems like it might be too late as an addition to Emacs 29. 
And we can implement the match predicate using narrowing for this 
release, to be updated later.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Wed, 01 Feb 2023 02:40:02 GMT) Full text and rfc822 format available.

Message #101 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Wed, 1 Feb 2023 04:39:29 +0200

[Message part 1 (text/plain, inline)]

On 31/01/2023 20:16, Dmitry Gutov wrote:
> Anyway, it seems like it might be too late as an addition to Emacs 29. 
> And we can implement the match predicate using narrowing for this 
> release, to be updated later.

So here. I've installed the first patch, which didn't raise up too many 
concerns previously, and here's the new iteration on the second patch. 
Please see the attachment.

To note the numbers: the first patch does quite well to improve the 
performance of modes which use :match in queries which match a lot of 
nodes (e.g. rust-ts-mode or ruby-ts-mode). There boost in those was on 
the order of 25%.

The second one is longer, and the boost (on top of the first one) is 
around 5-6%, stable. Not as impressive, but at least it brings :match's 
performance a little above :pred's in my example.

[treesit_predicate_match.diff (text/x-patch, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Wed, 01 Feb 2023 13:11:02 GMT) Full text and rfc822 format available.

Message #104 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Dmitry Gutov <dgutov <at> yandex.ru>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Wed, 01 Feb 2023 15:10:38 +0200

> Date: Tue, 31 Jan 2023 20:16:01 +0200
> Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
> From: Dmitry Gutov <dgutov <at> yandex.ru>
> 
> > Can you describe what that function should do?  I don't think I have a
> > clear idea of that.
> 
> In Lisp that function could be implemented as
> 
>    (defun buffer-substring-match (regexp &optional start end
>                                   inhibit-modify)
>      (string-match regexp
>                    (buffer-substring (or start (point-min))
>                                      (or end (point-max)))
>                    inhibit-modify))
> 
> Meaning, it matches the regexp against the buffer substring, with the 
> string-start and string-end anchors working.
> 
> But it would be implemented in C, meaning we could avoid the extra 
> consing and funcall overhead.

Now I'm confused, because I thought the C functions we were
considering all fit the above description.

> Anyway, it seems like it might be too late as an addition to Emacs 29. 
> And we can implement the match predicate using narrowing for this 
> release, to be updated later.

Right, this is not a catastrophe, IMO.  The work on TS-based modes is
just beginning.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Wed, 01 Feb 2023 13:41:01 GMT) Full text and rfc822 format available.

Message #107 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Dmitry Gutov <dgutov <at> yandex.ru>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Wed, 01 Feb 2023 15:39:53 +0200

> Date: Wed, 1 Feb 2023 04:39:29 +0200
> From: Dmitry Gutov <dgutov <at> yandex.ru>
> Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
> 
> So here. I've installed the first patch, which didn't raise up too many 
> concerns previously, and here's the new iteration on the second patch. 

Thanks, but please in the future when you make changes which call
functions from external libraries that were not called previously, be
sure to do the DEF_DLL_FN/LOAD_DLL_FN and the #undef/#define dance
needed to avoid breaking the MS-Windows build.

> Please see the attachment.
> 
> To note the numbers: the first patch does quite well to improve the 
> performance of modes which use :match in queries which match a lot of 
> nodes (e.g. rust-ts-mode or ruby-ts-mode). There boost in those was on 
> the order of 25%.
> 
> The second one is longer, and the boost (on top of the first one) is 
> around 5-6%, stable. Not as impressive, but at least it brings :match's 
> performance a little above :pred's in my example.

Fine by me, if Yuan also approves.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Wed, 01 Feb 2023 15:14:02 GMT) Full text and rfc822 format available.

Message #110 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Wed, 1 Feb 2023 17:13:13 +0200

On 01/02/2023 15:39, Eli Zaretskii wrote:
>> Date: Wed, 1 Feb 2023 04:39:29 +0200
>> From: Dmitry Gutov<dgutov <at> yandex.ru>
>> Cc:casouri <at> gmail.com,60953 <at> debbugs.gnu.org
>>
>> So here. I've installed the first patch, which didn't raise up too many
>> concerns previously, and here's the new iteration on the second patch.
> Thanks, but please in the future when you make changes which call
> functions from external libraries that were not called previously, be
> sure to do the DEF_DLL_FN/LOAD_DLL_FN and the #undef/#define dance
> needed to avoid breaking the MS-Windows build.

Will do, sorry about that.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Wed, 01 Feb 2023 15:16:01 GMT) Full text and rfc822 format available.

Message #113 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Wed, 1 Feb 2023 17:15:43 +0200

On 01/02/2023 15:10, Eli Zaretskii wrote:
>> Date: Tue, 31 Jan 2023 20:16:01 +0200
>> Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
>> From: Dmitry Gutov <dgutov <at> yandex.ru>
>>
>>> Can you describe what that function should do?  I don't think I have a
>>> clear idea of that.
>>
>> In Lisp that function could be implemented as
>>
>>     (defun buffer-substring-match (regexp &optional start end
>>                                    inhibit-modify)
>>       (string-match regexp
>>                     (buffer-substring (or start (point-min))
>>                                       (or end (point-max)))
>>                     inhibit-modify))
>>
>> Meaning, it matches the regexp against the buffer substring, with the
>> string-start and string-end anchors working.
>>
>> But it would be implemented in C, meaning we could avoid the extra
>> consing and funcall overhead.
> 
> Now I'm confused, because I thought the C functions we were
> considering all fit the above description.

Except they don't match \` and \' at START and END. Only at actual 
point-min and point-max.

I suppose the new function could be implemented with narrowing as well. 
If we decide it's the best method.

>> Anyway, it seems like it might be too late as an addition to Emacs 29.
>> And we can implement the match predicate using narrowing for this
>> release, to be updated later.
> 
> Right, this is not a catastrophe, IMO.  The work on TS-based modes is
> just beginning.

I also don't see any particular slowdown from altering BEGV and ZV in C. 
So the current method might be the fastest possible anyway.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Wed, 01 Feb 2023 21:21:01 GMT) Full text and rfc822 format available.

Message #116 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Wed, 1 Feb 2023 23:20:50 +0200

On 01/02/2023 15:39, Eli Zaretskii wrote:
>> Please see the attachment.
>>
>> To note the numbers: the first patch does quite well to improve the
>> performance of modes which use :match in queries which match a lot of
>> nodes (e.g. rust-ts-mode or ruby-ts-mode). There boost in those was on
>> the order of 25%.
>>
>> The second one is longer, and the boost (on top of the first one) is
>> around 5-6%, stable. Not as impressive, but at least it brings :match's
>> performance a little above :pred's in my example.
> Fine by me, if Yuan also approves.

For emacs-29, right?

Waiting for Yuan's confirmation.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Thu, 02 Feb 2023 02:17:02 GMT) Full text and rfc822 format available.

Message #119 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Yuan Fu <casouri <at> gmail.com>
To: Dmitry Gutov <dgutov <at> yandex.ru>
Cc: Eli Zaretskii <eliz <at> gnu.org>, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Wed, 1 Feb 2023 18:16:05 -0800


> On Feb 1, 2023, at 1:20 PM, Dmitry Gutov <dgutov <at> yandex.ru> wrote:
> 
> On 01/02/2023 15:39, Eli Zaretskii wrote:
>>> Please see the attachment.
>>> 
>>> To note the numbers: the first patch does quite well to improve the
>>> performance of modes which use :match in queries which match a lot of
>>> nodes (e.g. rust-ts-mode or ruby-ts-mode). There boost in those was on
>>> the order of 25%.
>>> 
>>> The second one is longer, and the boost (on top of the first one) is
>>> around 5-6%, stable. Not as impressive, but at least it brings :match's
>>> performance a little above :pred's in my example.
>> Fine by me, if Yuan also approves.
> 
> For emacs-29, right?
> 
> Waiting for Yuan's confirmation.

Yeah please go ahead :-)

Yuan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Thu, 02 Feb 2023 06:35:02 GMT) Full text and rfc822 format available.

Message #122 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Dmitry Gutov <dgutov <at> yandex.ru>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Thu, 02 Feb 2023 08:34:16 +0200

> Date: Wed, 1 Feb 2023 23:20:50 +0200
> Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
> From: Dmitry Gutov <dgutov <at> yandex.ru>
> 
> On 01/02/2023 15:39, Eli Zaretskii wrote:
> >> Please see the attachment.
> >>
> >> To note the numbers: the first patch does quite well to improve the
> >> performance of modes which use :match in queries which match a lot of
> >> nodes (e.g. rust-ts-mode or ruby-ts-mode). There boost in those was on
> >> the order of 25%.
> >>
> >> The second one is longer, and the boost (on top of the first one) is
> >> around 5-6%, stable. Not as impressive, but at least it brings :match's
> >> performance a little above :pred's in my example.
> > Fine by me, if Yuan also approves.
> 
> For emacs-29, right?

Yes.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Thu, 02 Feb 2023 12:14:02 GMT) Full text and rfc822 format available.

Message #125 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Thu, 2 Feb 2023 14:12:54 +0200

[Message part 1 (text/plain, inline)]

On 02/02/2023 08:34, Eli Zaretskii wrote:
>> Date: Wed, 1 Feb 2023 23:20:50 +0200
>> Cc:casouri <at> gmail.com,60953 <at> debbugs.gnu.org
>> From: Dmitry Gutov<dgutov <at> yandex.ru>
>>
>> On 01/02/2023 15:39, Eli Zaretskii wrote:
>>>> Please see the attachment.
>>>>
>>>> To note the numbers: the first patch does quite well to improve the
>>>> performance of modes which use :match in queries which match a lot of
>>>> nodes (e.g. rust-ts-mode or ruby-ts-mode). There boost in those was on
>>>> the order of 25%.
>>>>
>>>> The second one is longer, and the boost (on top of the first one) is
>>>> around 5-6%, stable. Not as impressive, but at least it brings :match's
>>>> performance a little above :pred's in my example.
>>> Fine by me, if Yuan also approves.
>> For emacs-29, right?
> Yes.

Please take a look at this additional patch on top of the other one. It 
is ok?

I just remembered that if we don't want to auto-anchor the regexp, then 
'looking-at' is not the appropriate semantic. So this switches to 
'search_buffer'.

[treesit_predicate_match_use_search_buffer.diff (text/x-patch, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Thu, 02 Feb 2023 14:24:02 GMT) Full text and rfc822 format available.

Message #128 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Dmitry Gutov <dgutov <at> yandex.ru>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Thu, 02 Feb 2023 16:23:10 +0200

> Date: Thu, 2 Feb 2023 14:12:54 +0200
> Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
> From: Dmitry Gutov <dgutov <at> yandex.ru>
> 
> Please take a look at this additional patch on top of the other one. It 
> is ok?
> 
> I just remembered that if we don't want to auto-anchor the regexp, then 
> 'looking-at' is not the appropriate semantic. So this switches to 
> 'search_buffer'.

One gotcha: fast_looking_at returns the length of the match, whereas
search_buffer returns the buffer position where it matched.  Shouldn't
this need a different handling of the return value?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Thu, 02 Feb 2023 17:05:01 GMT) Full text and rfc822 format available.

Message #131 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Thu, 2 Feb 2023 19:03:52 +0200

On 02/02/2023 16:23, Eli Zaretskii wrote:
>> Date: Thu, 2 Feb 2023 14:12:54 +0200
>> Cc:casouri <at> gmail.com,60953 <at> debbugs.gnu.org
>> From: Dmitry Gutov<dgutov <at> yandex.ru>
>>
>> Please take a look at this additional patch on top of the other one. It
>> is ok?
>>
>> I just remembered that if we don't want to auto-anchor the regexp, then
>> 'looking-at' is not the appropriate semantic. So this switches to
>> 'search_buffer'.
> One gotcha: fast_looking_at returns the length of the match, whereas
> search_buffer returns the buffer position where it matched.  Shouldn't
> this need a different handling of the return value?

Probably not: we just need to know whether it matched or not. And when 
it did not, it returns 0, right? So the same check should work.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Thu, 02 Feb 2023 17:27:04 GMT) Full text and rfc822 format available.

Message #134 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Dmitry Gutov <dgutov <at> yandex.ru>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Thu, 02 Feb 2023 19:26:47 +0200

> Date: Thu, 2 Feb 2023 19:03:52 +0200
> Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
> From: Dmitry Gutov <dgutov <at> yandex.ru>
> 
> > One gotcha: fast_looking_at returns the length of the match, whereas
> > search_buffer returns the buffer position where it matched.  Shouldn't
> > this need a different handling of the return value?
> 
> Probably not: we just need to know whether it matched or not. And when 
> it did not, it returns 0, right? So the same check should work.

Which check?

If the search fails, search_buffer returns a non-positive integer, not
zero.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Thu, 02 Feb 2023 17:54:02 GMT) Full text and rfc822 format available.

Message #137 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Thu, 2 Feb 2023 19:53:07 +0200

On 02/02/2023 19:26, Eli Zaretskii wrote:
>> Date: Thu, 2 Feb 2023 19:03:52 +0200
>> Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
>> From: Dmitry Gutov <dgutov <at> yandex.ru>
>>
>>> One gotcha: fast_looking_at returns the length of the match, whereas
>>> search_buffer returns the buffer position where it matched.  Shouldn't
>>> this need a different handling of the return value?
>>
>> Probably not: we just need to know whether it matched or not. And when
>> it did not, it returns 0, right? So the same check should work.
> 
> Which check?

  if (val > 0)
    return true;
  else
    return false;

The :match predicate is just supposed to check whether the regexp 
matches the node text. It doesn't do anything with the position of the 
match.

> If the search fails, search_buffer returns a non-positive integer, not
> zero.

That should work too. As long as it never returns 0 for success.

Which seems to be confirmed by the check

  if (np <= 0)
    ... signal error

inside search_command.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60953; Package emacs. (Thu, 02 Feb 2023 18:04:01 GMT) Full text and rfc822 format available.

Message #140 received at 60953 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Dmitry Gutov <dgutov <at> yandex.ru>
Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Thu, 02 Feb 2023 20:03:42 +0200

> Date: Thu, 2 Feb 2023 19:53:07 +0200
> Cc: casouri <at> gmail.com, 60953 <at> debbugs.gnu.org
> From: Dmitry Gutov <dgutov <at> yandex.ru>
> 
> > If the search fails, search_buffer returns a non-positive integer, not
> > zero.
> 
> That should work too. As long as it never returns 0 for success.
> 
> Which seems to be confirmed by the check
> 
>    if (np <= 0)
>      ... signal error
> 
> inside search_command.

Yes, because buffer position can never be zero.

Reply sent to Dmitry Gutov <dgutov <at> yandex.ru>:
You have taken responsibility. (Thu, 02 Feb 2023 19:45:02 GMT) Full text and rfc822 format available.

Notification sent to Dmitry Gutov <dgutov <at> yandex.ru>:
bug acknowledged by developer. (Thu, 02 Feb 2023 19:45:02 GMT) Full text and rfc822 format available.

Message #145 received at 60953-done <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: casouri <at> gmail.com, 60953-done <at> debbugs.gnu.org
Subject: Re: bug#60953: The :match predicate with large regexp in tree-sitter
 font-lock seems inefficient
Date: Thu, 2 Feb 2023 21:44:09 +0200

On 02/02/2023 20:03, Eli Zaretskii wrote:
>> Date: Thu, 2 Feb 2023 19:53:07 +0200
>> Cc:casouri <at> gmail.com,60953 <at> debbugs.gnu.org
>> From: Dmitry Gutov<dgutov <at> yandex.ru>
>>
>>> If the search fails, search_buffer returns a non-positive integer, not
>>> zero.
>> That should work too. As long as it never returns 0 for success.
>>
>> Which seems to be confirmed by the check
>>
>>     if (np <= 0)
>>       ... signal error
>>
>> inside search_command.
> Yes, because buffer position can never be zero.

All right.

Now pushed; thanks, and closing.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 03 Mar 2023 12:24:08 GMT) Full text and rfc822 format available.

This bug report was last modified 2 years and 130 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #60953 The :match predicate with large regexp in tree-sitter font-lock seems inefficient

GNU bug report logs - #60953
The :match predicate with large regexp in tree-sitter font-lock seems inefficient