GNU bug report logs - #33205
26.1; unibyte/multibyte missing in rx.el

Package: emacs;

Reported by: Mattias Engdegård <mattiase <at> acm.org>

Date: Tue, 30 Oct 2018 15:25:08 UTC

Severity: minor

Found in version 26.1

Done: Eli Zaretskii <eliz <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 33205 in the body.
You can then email your comments to 33205 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#33205; Package emacs. (Tue, 30 Oct 2018 15:25:08 GMT) Full text and rfc822 format available.

Acknowledgement sent to Mattias Engdegård <mattiase <at> acm.org>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Tue, 30 Oct 2018 15:25:08 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: bug-gnu-emacs <at> gnu.org
Subject: 26.1; unibyte/multibyte missing in rx.el
Date: Tue, 30 Oct 2018 16:03:28 +0100

rx.el has constructs corresponding to all named regexp character
classes ([[:alnum:]], [[:digit:]], etc) except unibyte and multibyte.
This looks like a simple omission.

Or is it on purpose? The ascii and nonascii classes appear very
similar; I haven't been able to see any operational difference from
unibyte and multibyte, respectively. In fact, neither seem to work as
expected on unibyte strings or buffers:

(setq s "A\310")
"A\310"
(multibyte-string-p s)
nil
(string-match-p "A[[:nonascii:]]" s)
nil
(string-match-p "A[[:ascii:]]" s)
nil
(string-match-p "A[[:unibyte:]]" s)
nil
(string-match-p "A[[:multibyte:]]" s)
nil
(string-match-p "A." s)
0

What is going on here? ascii/nonascii and unibyte/multibyte are
supposed to be complementary; if both fail, it's because there is
nothing to match. Yet . matches.

(By the way, you may want to fix a trivial typo in a doc string in
rx.el while you are at it: `indian-tow-byte')

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#33205; Package emacs. (Tue, 30 Oct 2018 17:29:01 GMT) Full text and rfc822 format available.

Message #8 received at 33205 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 33205 <at> debbugs.gnu.org
Subject: Re: bug#33205: 26.1; unibyte/multibyte missing in rx.el
Date: Tue, 30 Oct 2018 19:27:56 +0200

> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Tue, 30 Oct 2018 16:03:28 +0100
> 
> (setq s "A\310")
> "A\310"
> (multibyte-string-p s)
> nil
> (string-match-p "A[[:nonascii:]]" s)
> nil
> (string-match-p "A[[:ascii:]]" s)
> nil
> (string-match-p "A[[:unibyte:]]" s)
> nil
> (string-match-p "A[[:multibyte:]]" s)
> nil
> (string-match-p "A." s)
> 0
> 
> What is going on here? ascii/nonascii and unibyte/multibyte are
> supposed to be complementary; if both fail, it's because there is
> nothing to match. Yet . matches.

I think it's a documentation bug: [:unibyte:] matches only ASCII
characters.  IOW, it tests "unibyteness" in the internal
representation (which might be surprising, I know).

And [:nonascii:] is only defined for multibyte characters.

> (By the way, you may want to fix a trivial typo in a doc string in
> rx.el while you are at it: `indian-tow-byte')

Done, thanks.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#33205; Package emacs. (Wed, 31 Oct 2018 15:28:01 GMT) Full text and rfc822 format available.

Message #11 received at 33205 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 33205 <at> debbugs.gnu.org
Subject: Re: bug#33205: 26.1; unibyte/multibyte missing in rx.el
Date: Wed, 31 Oct 2018 16:27:53 +0100

tis 2018-10-30 klockan 19:27 +0200 skrev Eli Zaretskii:
> I think it's a documentation bug: [:unibyte:] matches only ASCII
> characters.  IOW, it tests "unibyteness" in the internal
> representation (which might be surprising, I know).
> 
> And [:nonascii:] is only defined for multibyte characters.

Thus [:ascii:]/[:nonascii:] cannot be distinguished from
[:unibyte:]/[:multibyte:]. Surely this cannot have been the intention?
Perhaps it's a relic from an earlier implementation. The code certainly
differs (IS_REAL_ASCII vs ISUNIBYTE).

Taking a step back: Do you agree that the missing unibyte/multibyte
should be added to rx, or do you feel that their current relative
uselessness would have them better stay out of it? (I'm neutral on the
subject.)

If there is a useful interpretation of [:unibyte:]/[:multibyte:] today,
perhaps we could make them behave that way.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#33205; Package emacs. (Wed, 31 Oct 2018 15:56:01 GMT) Full text and rfc822 format available.

Message #14 received at 33205 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 33205 <at> debbugs.gnu.org
Subject: Re: bug#33205: 26.1; unibyte/multibyte missing in rx.el
Date: Wed, 31 Oct 2018 17:55:08 +0200

> From: Mattias Engdegård <mattiase <at> acm.org>
> Cc: 33205 <at> debbugs.gnu.org
> Date: Wed, 31 Oct 2018 16:27:53 +0100
> 
> tis 2018-10-30 klockan 19:27 +0200 skrev Eli Zaretskii:
> > I think it's a documentation bug: [:unibyte:] matches only ASCII
> > characters.  IOW, it tests "unibyteness" in the internal
> > representation (which might be surprising, I know).
> > 
> > And [:nonascii:] is only defined for multibyte characters.
> 
> Thus [:ascii:]/[:nonascii:] cannot be distinguished from
> [:unibyte:]/[:multibyte:]. Surely this cannot have been the intention?

I actually looked into this some more, and I think my original
conclusion was wrong.  Let me dwell on that a bit more, and I will
report what I found.  We can then revisit the questions you ask above.

> Taking a step back: Do you agree that the missing unibyte/multibyte
> should be added to rx

I think it depends on what we find regarding the functionality.  It's
possible that it makes no real sense in the context of rx, for example
(although it indeed sounds like an omission).

> If there is a useful interpretation of [:unibyte:]/[:multibyte:] today,
> perhaps we could make them behave that way. 

Right.  Stay tuned, and thanks for pointing out this surprising
behavior.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#33205; Package emacs. (Mon, 05 Nov 2018 16:50:01 GMT) Full text and rfc822 format available.

Message #17 received at 33205 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: mattiase <at> acm.org
Cc: 33205 <at> debbugs.gnu.org
Subject: Re: bug#33205: 26.1; unibyte/multibyte missing in rx.el
Date: Mon, 05 Nov 2018 18:49:07 +0200

> Date: Wed, 31 Oct 2018 17:55:08 +0200
> From: Eli Zaretskii <eliz <at> gnu.org>
> Cc: 33205 <at> debbugs.gnu.org
> 
> > From: Mattias Engdegård <mattiase <at> acm.org>
> > Cc: 33205 <at> debbugs.gnu.org
> > Date: Wed, 31 Oct 2018 16:27:53 +0100
> > 
> > tis 2018-10-30 klockan 19:27 +0200 skrev Eli Zaretskii:
> > > I think it's a documentation bug: [:unibyte:] matches only ASCII
> > > characters.  IOW, it tests "unibyteness" in the internal
> > > representation (which might be surprising, I know).
> > > 
> > > And [:nonascii:] is only defined for multibyte characters.
> > 
> > Thus [:ascii:]/[:nonascii:] cannot be distinguished from
> > [:unibyte:]/[:multibyte:]. Surely this cannot have been the intention?
> 
> I actually looked into this some more, and I think my original
> conclusion was wrong.  Let me dwell on that a bit more, and I will
> report what I found.  We can then revisit the questions you ask above.

After looking into this, my conclusion is that what I wrote above was
not too wrong.  Indeed, currently [:ascii:]/[:nonascii:] cannot be
distinguished from [:unibyte:]/[:multibyte:].  In a nutshell, it turns
out [:unibyte:] is not what one might think it is, you can see that in
re_wctype_to_bit, for example.

Thinking about this and looking at the code, I'd say that support of
named character classes is heavily biased in favor of multibyte text,
not to say supports _only_ multibyte text.  So searching unibyte
strings and unibyte buffers for the likes of [:unibyte:] will only
find ASCII characters.

In multibyte buffers and strings, unibyte characters are stored in
their multibyte representation, so it is no longer trivial to define
what does [:unibyte:] mean.  However, I discovered that there's a
workaround for what you are trying to do: use ^[:multibyte:] instead
of [:unibyte:].  Observe:

  (setq s "A\310") => "A\310"
  (string-match-p "A[[:ascii:]]" s) => nil
  (string-match-p "A[[:nonascii:]]" s) => nil
  (string-match-p "A[^[:ascii:]]" s) => 0      ;; !!!
  (string-match-p "A[[:unibyte:]]" s) => nil
  (string-match-p "A[^[:multibyte:]]" s) => 0  ;; !!!

That ^[:ascii:] is not the same as [:nonascii:], and the same with
[:unibyte:] vs ^[:multibyte:], is arguably a bug.  The reason for that
becomes clear if you look at how we generate the fastmap in each of
these cases and how we set the bits in the work-area of the range
table, but I don't know enough to say how easy would it be to fix
that.

An alternative is to use an explicit character class, as in \000-\377,
that works as you'd expect.

> > Taking a step back: Do you agree that the missing unibyte/multibyte
> > should be added to rx
> 
> I think it depends on what we find regarding the functionality.  It's
> possible that it makes no real sense in the context of rx, for example
> (although it indeed sounds like an omission).
> 
> > If there is a useful interpretation of [:unibyte:]/[:multibyte:] today,
> > perhaps we could make them behave that way. 
> 
> Right.  Stay tuned, and thanks for pointing out this surprising
> behavior.

Well, what do you think now?  Is it worth adding those to rx.el?  I'm
not sure.  How important is it to find unibyte characters in a string,
anyway?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#33205; Package emacs. (Wed, 07 Nov 2018 18:09:01 GMT) Full text and rfc822 format available.

Message #20 received at 33205 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 33205 <at> debbugs.gnu.org
Subject: Re: bug#33205: 26.1; unibyte/multibyte missing in rx.el
Date: Wed, 7 Nov 2018 19:08:43 +0100

5 nov. 2018 kl. 17.49 skrev Eli Zaretskii <eliz <at> gnu.org>:
> After looking into this, my conclusion is that what I wrote above was
> not too wrong.  Indeed, currently [:ascii:]/[:nonascii:] cannot be
> distinguished from [:unibyte:]/[:multibyte:].  In a nutshell, it turns
> out [:unibyte:] is not what one might think it is, you can see that in
> re_wctype_to_bit, for example.

Thank you very much for taking your time to look at this, and for the detailed answer.
My apologies for severely complicating what I initially thought was quite a trifle!

> That ^[:ascii:] is not the same as [:nonascii:], and the same with
> [:unibyte:] vs ^[:multibyte:], is arguably a bug.  The reason for that
> becomes clear if you look at how we generate the fastmap in each of
> these cases and how we set the bits in the work-area of the range
> table, but I don't know enough to say how easy would it be to fix
> that.
> 
> An alternative is to use an explicit character class, as in \000-\377,
> that works as you'd expect.

I'm not sure what I expected [\000-\377] to mean in a multibyte string; one endpoint is ASCII and the other is a raw byte. It does work, as you noted, because two ranges are generated, as if written [\000-\177\200-\377].

In old Emacs versions (I tried 22.1.1), [:unibyte:] appears to include raw bytes in multibyte strings/buffers, and everything in unibyte strings/buffers (aka [\000-\377] in both cases), and [:multibyte:] the complement of that. Thus, at some point the behaviour changed, but I cannot find any NEWS reference to it. It could have been an accident.
Perhaps those char classes didn't see much use.

The old behaviour seems a little more intuitive, but it must be rare to need regex matching of rubbish bytes in multibyte strings. If you could argue that the status quo is fine then I wouldn't necessarily object, but would suggest that at least the code be made explicit about it (and the documentation, as well).

> Well, what do you think now?  Is it worth adding those to rx.el? I'm
> not sure.  How important is it to find unibyte characters in a string,
> anyway?

Unless we manage to make [:unibyte:]/[:multibyte:] more useful in their own right, it's fine to leave rx.el as is, as far as I'm concerned. There is no loss of expressivity.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#33205; Package emacs. (Wed, 07 Nov 2018 19:11:01 GMT) Full text and rfc822 format available.

Message #23 received at 33205 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 33205 <at> debbugs.gnu.org
Subject: Re: bug#33205: 26.1; unibyte/multibyte missing in rx.el
Date: Wed, 07 Nov 2018 21:10:01 +0200

> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Wed, 7 Nov 2018 19:08:43 +0100
> Cc: 33205 <at> debbugs.gnu.org
> 
> I'm not sure what I expected [\000-\377] to mean in a multibyte string; one endpoint is ASCII and the other is a raw byte. It does work, as you noted, because two ranges are generated, as if written [\000-\177\200-\377].

Octal escapes usually mean raw bytes.  Cf the fact that you used \310
in your original recipe.  So the above is expected to match raw bytes
and ASCII characters, i.e. what [:unibyte:] should probably stand for.

> In old Emacs versions (I tried 22.1.1), [:unibyte:] appears to include raw bytes in multibyte strings/buffers, and everything in unibyte strings/buffers (aka [\000-\377] in both cases), and [:multibyte:] the complement of that. Thus, at some point the behaviour changed, but I cannot find any NEWS reference to it. It could have been an accident.

Almost everything regarding internals of unibyte and multibyte
characters changed when we switched to UTF-8 superset as internal
representation.  One consequence is that raw bytes are no longer
represented as themselves in buffers and strings.

> Perhaps those char classes didn't see much use.

Definitely not.  I cannot even think of a practical use case for them
nowadays.

> The old behaviour seems a little more intuitive, but it must be rare to need regex matching of rubbish bytes in multibyte strings. If you could argue that the status quo is fine then I wouldn't necessarily object, but would suggest that at least the code be made explicit about it (and the documentation, as well).

I can fix the docs, but I don't think I understand what would you like
to do about the code.

> > Well, what do you think now?  Is it worth adding those to rx.el? I'm
> > not sure.  How important is it to find unibyte characters in a string,
> > anyway?
> 
> Unless we manage to make [:unibyte:]/[:multibyte:] more useful in their own right, it's fine to leave rx.el as is, as far as I'm concerned. There is no loss of expressivity.

OK.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#33205; Package emacs. (Wed, 07 Nov 2018 20:20:01 GMT) Full text and rfc822 format available.

Message #26 received at 33205 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 33205 <at> debbugs.gnu.org
Subject: Re: bug#33205: 26.1; unibyte/multibyte missing in rx.el
Date: Wed, 7 Nov 2018 21:19:07 +0100

7 nov. 2018 kl. 20.10 skrev Eli Zaretskii <eliz <at> gnu.org>:
>> Perhaps those char classes didn't see much use.
> 
> Definitely not.  I cannot even think of a practical use case for them
> nowadays.

But were they useful back then, when they were added? If so, for what? Maybe it's been lost in the mists of time.

> 
>> The old behaviour seems a little more intuitive, but it must be rare to need regex matching of rubbish bytes in multibyte strings. If you could argue that the status quo is fine then I wouldn't necessarily object, but would suggest that at least the code be made explicit about it (and the documentation, as well).
> 
> I can fix the docs, but I don't think I understand what would you like
> to do about the code.

If we are content with [:unibyte:]/[:multibyte:] = [:ascii:]/[:nonascii:], then it would be nice if the code were obvious about it. Right now, ISUNIBYTE and IS_REAL_ASCII differ, and it takes some digging to realise that they have the same effect. Removing RECC_UNIBYTE/RECC_MULTIBYTE entirely and use RECC_ASCII/RECC_NONASCII throughout would make the semantics clear.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#33205; Package emacs. (Mon, 19 Nov 2018 20:08:01 GMT) Full text and rfc822 format available.

Message #29 received at 33205 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 33205 <at> debbugs.gnu.org
Subject: Re: bug#33205: 26.1; unibyte/multibyte missing in rx.el
Date: Mon, 19 Nov 2018 21:07:39 +0100

[Message part 1 (text/plain, inline)]

I tried using rx to match raw bytes. (rx (any (?\200 . ?\377))) doesn't work, since that is translated to the corresponding Unicode range; (any (#x3fff80 . #x3fffff)) must be used instead. Maybe that is evident, or would it merit a mention in the doc string?

The alternative formulation (rx (any "\200-\377")) doesn't work either, and this seems to be a bug. Looking at rx-check-any-string, a second bug is revealed: the code uses the regex ".-." to pick out ranges, which means that \n cannot be a range endpoint.

Perhaps you want me to open a new bug for the above? I'm attaching a patch all the same, but you may prefer doing it differently.

[rx-any-raw-bytes.patch (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#33205; Package emacs. (Sat, 08 Dec 2018 08:58:04 GMT) Full text and rfc822 format available.

Message #32 received at 33205 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 33205 <at> debbugs.gnu.org
Subject: Re: bug#33205: 26.1; unibyte/multibyte missing in rx.el
Date: Sat, 08 Dec 2018 10:56:40 +0200

> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Mon, 19 Nov 2018 21:07:39 +0100
> Cc: 33205 <at> debbugs.gnu.org
> 
> I tried using rx to match raw bytes. (rx (any (?\200 . ?\377))) doesn't work, since that is translated to the corresponding Unicode range; (any (#x3fff80 . #x3fffff)) must be used instead. Maybe that is evident, or would it merit a mention in the doc string?
> 
> The alternative formulation (rx (any "\200-\377")) doesn't work either, and this seems to be a bug. Looking at rx-check-any-string, a second bug is revealed: the code uses the regex ".-." to pick out ranges, which means that \n cannot be a range endpoint.
> 
> Perhaps you want me to open a new bug for the above? I'm attaching a patch all the same, but you may prefer doing it differently.

Thanks.  For a patch of this size, we would need a copyright
assignment from you.  Would you like to start the legal paperwork for
that?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#33205; Package emacs. (Sat, 08 Dec 2018 09:24:02 GMT) Full text and rfc822 format available.

Message #35 received at 33205 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 33205 <at> debbugs.gnu.org
Subject: Re: bug#33205: 26.1; unibyte/multibyte missing in rx.el
Date: Sat, 8 Dec 2018 10:23:18 +0100

8 dec. 2018 kl. 09.56 skrev Eli Zaretskii <eliz <at> gnu.org>:
> 
> Thanks.  For a patch of this size, we would need a copyright
> assignment from you.  Would you like to start the legal paperwork for
> that?

Yes please. Will you help me, or shall I contact someone in particular?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#33205; Package emacs. (Sat, 08 Dec 2018 11:13:01 GMT) Full text and rfc822 format available.

Message #38 received at 33205 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 33205 <at> debbugs.gnu.org
Subject: Re: bug#33205: 26.1; unibyte/multibyte missing in rx.el
Date: Sat, 08 Dec 2018 13:11:32 +0200

> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Sat, 8 Dec 2018 10:23:18 +0100
> Cc: 33205 <at> debbugs.gnu.org
> 
> 8 dec. 2018 kl. 09.56 skrev Eli Zaretskii <eliz <at> gnu.org>:
> > 
> > Thanks.  For a patch of this size, we would need a copyright
> > assignment from you.  Would you like to start the legal paperwork for
> > that?
> 
> Yes please. Will you help me, or shall I contact someone in particular?

Thanks.  Form and instructions sent off-list.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#33205; Package emacs. (Fri, 28 Dec 2018 18:18:02 GMT) Full text and rfc822 format available.

Message #41 received at 33205 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 33205 <at> debbugs.gnu.org
Subject: Re: bug#33205: 26.1; unibyte/multibyte missing in rx.el
Date: Fri, 28 Dec 2018 19:17:55 +0100

lör 2018-12-08 klockan 10:56 +0200 skrev Eli Zaretskii:
> 
> Thanks.  For a patch of this size, we would need a copyright
> assignment from you.  Would you like to start the legal paperwork for
> that?

The paperwork has now been processed, I'm told.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#33205; Package emacs. (Sat, 29 Dec 2018 09:24:02 GMT) Full text and rfc822 format available.

Message #44 received at 33205 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 33205 <at> debbugs.gnu.org
Subject: Re: bug#33205: 26.1; unibyte/multibyte missing in rx.el
Date: Sat, 29 Dec 2018 11:23:18 +0200

> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Mon, 19 Nov 2018 21:07:39 +0100
> Cc: 33205 <at> debbugs.gnu.org
> 
> Perhaps you want me to open a new bug for the above? I'm attaching a patch all the same, but you may prefer doing it differently.

Thanks, I have some comments.

First, please provide a ChangeLog-style commit log message describing
the changes.  See CONTRIBUTE for more details.

> +  "Turn a string argument to `any' into a list of characters and, representing
> +ranges, dotted pairs of characters. The original order is not preserved."

The first line of a doc string should be a single full sentence, and
it should mention the arguments of the function.  Also, the first
sentence confused me: what do you mean by this part:

 "... and, representing ranges, dotted pairs of characters"

Finally, please use the US English convention of leaving 2 spaces
between sentences in the documentation.

> +  (let ((decode-char
> +         ;; Make sure raw bytes are decoded as such, to avoid confusion with
> +         ;; U+0080..U+00FF.
> +         (if (multibyte-string-p str)
> +             #'identity
> +           (lambda (c) (if (and (>= c #x80) (<= c #xff))
> +                           (+ c #x3fff00)
> +                         c))))
> +        (len (length str))
> +        (i 0)
> +        (ret nil))
> +    (while (< i len)
> +      (cond ((and (< i (- len 2))
> +                  (= (aref str (+ i 1)) ?-))
> +             ;; Range.
> +             (let ((start (funcall decode-char (aref str i)))
> +                   (end   (funcall decode-char (aref str (+ i 2)))))
> +               (cond ((< start end) (push (cons start end) ret))
> +                     ((= start end) (push start ret)))
> +               (setq i (+ i 3))))
> +            (t
> +             ;; Single character.
> +             (push (funcall decode-char (aref str i)) ret)
> +             (setq i (+ i 1)))))
> +    ret))

This seems to have dropped the validity check which signaled an error
in the original code?  Any reason for that?

Thanks.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#33205; Package emacs. (Sat, 29 Dec 2018 09:25:02 GMT) Full text and rfc822 format available.

Message #47 received at 33205 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 33205 <at> debbugs.gnu.org
Subject: Re: bug#33205: 26.1; unibyte/multibyte missing in rx.el
Date: Sat, 29 Dec 2018 11:24:07 +0200

> From: Mattias Engdegård <mattiase <at> acm.org>
> Cc: 33205 <at> debbugs.gnu.org
> Date: Fri, 28 Dec 2018 19:17:55 +0100
> 
> lör 2018-12-08 klockan 10:56 +0200 skrev Eli Zaretskii:
> > 
> > Thanks.  For a patch of this size, we would need a copyright
> > assignment from you.  Would you like to start the legal paperwork for
> > that?
> 
> The paperwork has now been processed, I'm told.

Right, I posted a few minor comments; after fixing this, we can
install your changes.

Thanks.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#33205; Package emacs. (Sat, 29 Dec 2018 10:45:02 GMT) Full text and rfc822 format available.

Message #50 received at 33205 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 33205 <at> debbugs.gnu.org
Subject: Re: bug#33205: 26.1; unibyte/multibyte missing in rx.el
Date: Sat, 29 Dec 2018 11:43:56 +0100

[Message part 1 (text/plain, inline)]

29 dec. 2018 kl. 10.23 skrev Eli Zaretskii <eliz <at> gnu.org>:
> 
> First, please provide a ChangeLog-style commit log message describing
> the changes.  See CONTRIBUTE for more details.

Done.

> The first line of a doc string should be a single full sentence, and
> it should mention the arguments of the function.  Also, the first
> sentence confused me: what do you mean by this part:
> 
> "... and, representing ranges, dotted pairs of characters"
> 
> Finally, please use the US English convention of leaving 2 spaces
> between sentences in the documentation.

All done.

> This seems to have dropped the validity check which signaled an error
> in the original code?  Any reason for that?

Just an oversight; check reinstated.

Thank you; new patch attached.

[0001-Handle-raw-bytes-and-LF-in-ranges-in-rx-any-argument.patch (application/octet-stream, attachment)]

Reply sent to Eli Zaretskii <eliz <at> gnu.org>:
You have taken responsibility. (Sat, 29 Dec 2018 14:57:02 GMT) Full text and rfc822 format available.

Notification sent to Mattias Engdegård <mattiase <at> acm.org>:
bug acknowledged by developer. (Sat, 29 Dec 2018 14:57:03 GMT) Full text and rfc822 format available.

Message #55 received at 33205-done <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 33205-done <at> debbugs.gnu.org
Subject: Re: bug#33205: 26.1; unibyte/multibyte missing in rx.el
Date: Sat, 29 Dec 2018 16:55:25 +0200

> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Sat, 29 Dec 2018 11:43:56 +0100
> Cc: 33205 <at> debbugs.gnu.org
> 
> Thank you; new patch attached.

Thanks, pushed to the master branch.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 27 Jan 2019 12:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 5 years and 82 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #33205 26.1; unibyte/multibyte missing in rx.el

GNU bug report logs - #33205
26.1; unibyte/multibyte missing in rx.el