GNU bug report logs - #3687
23.1.50; inconsistency in multibyte eight-bit regexps

Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.

Package: emacs; Reported by: YAMAMOTO Mitsuharu <mituharu@HIDDEN>; dated Fri, 26 Jun 2009 10:05:05 UTC; Maintainer for emacs is bug-gnu-emacs@HIDDEN.

Message received at 3687@HIDDEN:


Received: (at 3687) by emacsbugs.donarmstrong.com; 24 Jul 2009 01:08:21 +0000
From mituharu@HIDDEN Thu Jul 23 18:08:20 2009
X-Spam-Checker-Version: SpamAssassin 3.2.5-bugs.debian.org_2005_01_02
	(2008-06-10) on rzlab.ucr.edu
X-Spam-Level: 
X-Spam-Bayes: score:0.5 Bayes not run. spammytokens:Tokens not available.
	hammytokens:Tokens not available.
X-Spam-Status: No, score=-2.9 required=4.0 tests=AWL,HAS_BUG_NUMBER,
	IMPRONONCABLE_2 autolearn=ham version=3.2.5-bugs.debian.org_2005_01_02
Received: from mathmail.math.s.chiba-u.ac.jp (mathmail.math.s.chiba-u.ac.jp [133.82.132.2])
	by rzlab.ucr.edu (8.14.3/8.14.3/Debian-5) with ESMTP id n6O18ERH003794
	for <3687@HIDDEN>; Thu, 23 Jul 2009 18:08:15 -0700
Received: from church.math.s.chiba-u.ac.jp (church [133.82.132.36])
	by mathmail.math.s.chiba-u.ac.jp (Postfix) with ESMTP id CA5552C49;
	Fri, 24 Jul 2009 10:08:11 +0900 (JST)
Date: Fri, 24 Jul 2009 10:08:11 +0900
Message-ID: <wlbpnavqf8.wl%mituharu@HIDDEN>
From: YAMAMOTO Mitsuharu <mituharu@HIDDEN>
To: Stefan Monnier <monnier@HIDDEN>
Cc: 3687 <at> debbugs.gnu.org, Eli Zaretskii <eliz@HIDDEN>
Subject: Re: bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps
In-Reply-To: <jwvbpo74eet.fsf-monnier+emacsbugreports@HIDDEN>
References: <200906260956.n5Q9uo917123@HIDDEN>	<83my7vyute.fsf@HIDDEN>	<wlskhmxy3h.wl%mituharu@HIDDEN>	<83iqiiyq64.fsf@HIDDEN>	<wlab3rn3n2.wl%mituharu@HIDDEN>	<jwvbpo74eet.fsf-monnier+emacsbugreports@HIDDEN>
User-Agent: Wanderlust/2.14.0 (Africa) SEMI/1.14.6 (Maruoka) FLIM/1.14.8
 (=?ISO-8859-4?Q?Shij=F2?=) APEL/10.6 Emacs/22.3 (sparc-sun-solaris2.8)
 MULE/5.0 (SAKAKI)
Organization: Faculty of Science, Chiba University
MIME-Version: 1.0 (generated by SEMI 1.14.6 - "Maruoka")
Content-Type: text/plain; charset=US-ASCII

>>>>> On Mon, 29 Jun 2009 10:47:30 +0200, Stefan Monnier <monnier@HIDDEN> said:

>> It seemed to be too obvious to explain and I hesitated to do that.
>> Anyway, I assume "C" and "[C]" work equivalently as regexps if the
>> character C has no special meaning in either context.

> Yes, it's pretty obvious, thank you.  I haven't had time to look
> deeper, but that part of the code is pretty nasty because it tries
> to be clever about the fact that values between 128-256 can be
> either latin-1 chars and eight-bit-bytes and it tries to be lenient
> about confusion between the two.

Are there any written specifications explaining how the leniency is
supposed to work?

As for documentations, the description below in the elisp info
(Special Characters in Regular Expressions) probably needs to be
updated.

     The beginning and end of a range of multibyte characters must be in
     the same character set (*note Character Sets::).  Thus,
     `"[\x8e0-\x97c]"' is invalid because character 0x8e0 (`a' with
     grave accent) is in the Emacs character set for Latin-1 but the
     character 0x97c (`u' with diaeresis) is in the Emacs character set
     for Latin-2.  (We use Lisp string syntax to write that example,
     and a few others in the next few paragraphs, in order to include
     hex escape sequences in them.)

     If a range starts with a unibyte character C and ends with a
     multibyte character C2, the range is divided into two parts: one
     is `C..?\377', the other is `C1..C2', where C1 is the first
     character of the charset to which C2 belongs.

     You cannot always match all non-ASCII characters with the regular
     expression `"[\200-\377]"'.  This works when searching a unibyte
     buffer or string (*note Text Representations::), but not in a
     multibyte buffer or string, because many non-ASCII characters have
     codes above octal 0377.  However, the regular expression
     `"[^\000-\177]"' does match all non-ASCII characters (see below
     regarding `^'), in both multibyte and unibyte representations,
     because only the ASCII characters are excluded.

				     YAMAMOTO Mitsuharu
				mituharu@HIDDEN



Acknowledgement sent to YAMAMOTO Mitsuharu <mituharu@HIDDEN>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs@HIDDEN>. Full text available.
Information forwarded to bug-submit-list@HIDDEN, Emacs Bugs <bug-gnu-emacs@HIDDEN>:
bug#3687; Package emacs. Full text available.

Message received at 3687@HIDDEN:


Received: (at 3687) by emacsbugs.donarmstrong.com; 29 Jun 2009 08:47:36 +0000
From monnier@HIDDEN Mon Jun 29 01:47:36 2009
X-Spam-Checker-Version: SpamAssassin 3.2.5-bugs.debian.org_2005_01_02
	(2008-06-10) on rzlab.ucr.edu
X-Spam-Level: 
X-Spam-Bayes: score:0.5 Bayes not run. spammytokens:Tokens not available.
	hammytokens:Tokens not available.
X-Spam-Status: No, score=-2.9 required=4.0 tests=AWL,HAS_BUG_NUMBER
	autolearn=ham version=3.2.5-bugs.debian.org_2005_01_02
Received: from smtp-01.vtx.ch (smtp-01.vtx.ch [212.147.0.84])
	by rzlab.ucr.edu (8.14.3/8.14.3/Debian-5) with ESMTP id n5T8lV9P007086
	for <3687@HIDDEN>; Mon, 29 Jun 2009 01:47:33 -0700
Received: from alfajor.home (dyn.83-228-190-143.dsl.vtx.ch [83.228.190.143])
	by smtp-01.vtx.ch (VTX Services SA) with ESMTP id DF398281D1;
	Mon, 29 Jun 2009 10:47:30 +0200 (CEST)
Received: by alfajor.home (Postfix, from userid 20848)
	id BB10C6433E; Mon, 29 Jun 2009 10:47:30 +0200 (CEST)
From: Stefan Monnier <monnier@HIDDEN>
To: YAMAMOTO Mitsuharu <mituharu@HIDDEN>
Cc: 3687 <at> debbugs.gnu.org, Eli Zaretskii <eliz@HIDDEN>
Subject: Re: bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps
Message-ID: <jwvbpo74eet.fsf-monnier+emacsbugreports@HIDDEN>
References: <200906260956.n5Q9uo917123@HIDDEN>
	<83my7vyute.fsf@HIDDEN> <wlskhmxy3h.wl%mituharu@HIDDEN>
	<83iqiiyq64.fsf@HIDDEN> <wlab3rn3n2.wl%mituharu@HIDDEN>
Date: Mon, 29 Jun 2009 10:47:30 +0200
In-Reply-To: <wlab3rn3n2.wl%mituharu@HIDDEN> (YAMAMOTO
	Mitsuharu's message of "Mon, 29 Jun 2009 12:02:41 +0900")
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.0.94 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii

> It seemed to be too obvious to explain and I hesitated to do that.
> Anyway, I assume "C" and "[C]" work equivalently as regexps if the
> character C has no special meaning in either context.

Yes, it's pretty obvious, thank you.
I haven't had time to look deeper, but that part of the code is pretty
nasty because it tries to be clever about the fact that values between
128-256 can be either latin-1 chars and eight-bit-bytes and it tries to
be lenient about confusion between the two.
The behavior you see is clearly a bug.


        Stefan




Acknowledgement sent to Stefan Monnier <monnier@HIDDEN>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs@HIDDEN>. Full text available.
Information forwarded to bug-submit-list@HIDDEN, Emacs Bugs <bug-gnu-emacs@HIDDEN>:
bug#3687; Package emacs. Full text available.

Message received at 3687@HIDDEN:


Received: (at 3687) by emacsbugs.donarmstrong.com; 29 Jun 2009 03:02:47 +0000
From mituharu@HIDDEN Sun Jun 28 20:02:47 2009
X-Spam-Checker-Version: SpamAssassin 3.2.5-bugs.debian.org_2005_01_02
	(2008-06-10) on rzlab.ucr.edu
X-Spam-Level: 
X-Spam-Bayes: score:0.5 Bayes not run. spammytokens:Tokens not available.
	hammytokens:Tokens not available.
X-Spam-Status: No, score=-3.4 required=4.0 tests=AWL,HAS_BUG_NUMBER
	autolearn=ham version=3.2.5-bugs.debian.org_2005_01_02
Received: from mathmail.math.s.chiba-u.ac.jp (mathmail.math.s.chiba-u.ac.jp [133.82.132.2])
	by rzlab.ucr.edu (8.14.3/8.14.3/Debian-5) with ESMTP id n5T32g0o011677
	for <3687@HIDDEN>; Sun, 28 Jun 2009 20:02:44 -0700
Received: from church.math.s.chiba-u.ac.jp (church [133.82.132.36])
	by mathmail.math.s.chiba-u.ac.jp (Postfix) with ESMTP id 5A2622C40;
	Mon, 29 Jun 2009 12:02:41 +0900 (JST)
Date: Mon, 29 Jun 2009 12:02:41 +0900
Message-ID: <wlab3rn3n2.wl%mituharu@HIDDEN>
From: YAMAMOTO Mitsuharu <mituharu@HIDDEN>
To: Eli Zaretskii <eliz@HIDDEN>
Cc: 3687 <at> debbugs.gnu.org
Subject: Re: bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps
In-Reply-To: <83iqiiyq64.fsf@HIDDEN>
References: <200906260956.n5Q9uo917123@HIDDEN>
	<83my7vyute.fsf@HIDDEN>
	<wlskhmxy3h.wl%mituharu@HIDDEN>
	<83iqiiyq64.fsf@HIDDEN>
User-Agent: Wanderlust/2.14.0 (Africa) SEMI/1.14.6 (Maruoka) FLIM/1.14.8
 (=?ISO-8859-4?Q?Shij=F2?=) APEL/10.6 Emacs/22.3 (sparc-sun-solaris2.8)
 MULE/5.0 (SAKAKI)
Organization: Faculty of Science, Chiba University
MIME-Version: 1.0 (generated by SEMI 1.14.6 - "Maruoka")
Content-Type: text/plain; charset=US-ASCII

>>>>> On Sat, 27 Jun 2009 12:36:03 +0300, Eli Zaretskii <eliz@HIDDEN> said:

>> Date: Sat, 27 Jun 2009 10:30:10 +0900
>> From: YAMAMOTO Mitsuharu <mituharu@HIDDEN>
>> Cc: 3687@HIDDEN
>> 
>> >>>>> On Fri, 26 Jun 2009 16:43:25 +0300, Eli Zaretskii <eliz@HIDDEN> said:
>> 
>> >> Date: Fri, 26 Jun 2009 18:56:50 +0900 (JST)
>> >> From: YAMAMOTO Mitsuharu <mituharu@HIDDEN>
>> >> Cc: 
>> >> 
>> >> The following results look inconsistent:
>> >> 
>> >> (string-match (string-to-multibyte "\x80") (string-to-multibyte "\x80"))
>> >> => 0
>> >> (string-match (string-to-multibyte "\x80") "\x80")
>> >> => nil
>> >> 
>> >> (string-match (string-to-multibyte "[\x80]") (string-to-multibyte "\x80"))
>> >> => nil
>> >> (string-match (string-to-multibyte "[\x80]") "\x80")
>> >> => 0
>> 
>> > Please tell why you think they are inconsistent.
>> 
>> I thought there's no room for argument about their inconsistency with
>> respect to the specification of "[...]" in regexps.

> Well, obviously there is such a room.  Please consider explaining why
> you think there's inconsistency.

It seemed to be too obvious to explain and I hesitated to do that.
Anyway, I assume "C" and "[C]" work equivalently as regexps if the
character C has no special meaning in either context.

				    YAMAMOTO Mitsuharu
				mituharu@HIDDEN



Acknowledgement sent to YAMAMOTO Mitsuharu <mituharu@HIDDEN>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs@HIDDEN>. Full text available.
Information forwarded to bug-submit-list@HIDDEN, Emacs Bugs <bug-gnu-emacs@HIDDEN>:
bug#3687; Package emacs. Full text available.

Message received at 3687@HIDDEN:


Received: (at 3687) by emacsbugs.donarmstrong.com; 27 Jun 2009 09:36:15 +0000
From eliz@HIDDEN Sat Jun 27 02:36:14 2009
X-Spam-Checker-Version: SpamAssassin 3.2.5-bugs.debian.org_2005_01_02
	(2008-06-10) on rzlab.ucr.edu
X-Spam-Level: 
X-Spam-Bayes: score:0.5 Bayes not run. spammytokens:Tokens not available.
	hammytokens:Tokens not available.
X-Spam-Status: No, score=-4.9 required=4.0 tests=AWL,HAS_BUG_NUMBER
	autolearn=ham version=3.2.5-bugs.debian.org_2005_01_02
Received: from mtaout3.012.net.il (mtaout4.012.net.il [84.95.2.10])
	by rzlab.ucr.edu (8.14.3/8.14.3/Debian-5) with ESMTP id n5R9a9OU002544
	for <3687@HIDDEN>; Sat, 27 Jun 2009 02:36:11 -0700
Received: from conversion-daemon.i_mtaout3.012.net.il by i_mtaout3.012.net.il (HyperSendmail v2004.12) id <0KLW00D004QFTP00@i_mtaout3.012.net.il> for 3687@HIDDEN; Sat, 27 Jun 2009 12:36:03 +0300 (IDT)
Received: from HOME-C4E4A596F7 ([84.229.213.34]) by i_mtaout3.012.net.il (HyperSendmail v2004.12) with ESMTPA id <0KLW0028N5C25SN0@i_mtaout3.012.net.il>; Sat, 27 Jun 2009 12:36:03 +0300 (IDT)
Date: Sat, 27 Jun 2009 12:36:03 +0300
From: Eli Zaretskii <eliz@HIDDEN>
Subject: Re: bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps
In-reply-to: <wlskhmxy3h.wl%mituharu@HIDDEN>
X-012-Sender: halo1@HIDDEN
To: YAMAMOTO Mitsuharu <mituharu@HIDDEN>
Cc: 3687 <at> debbugs.gnu.org
Reply-to: Eli Zaretskii <eliz@HIDDEN>
Message-id: <83iqiiyq64.fsf@HIDDEN>
References: <200906260956.n5Q9uo917123@HIDDEN> <83my7vyute.fsf@HIDDEN> <wlskhmxy3h.wl%mituharu@HIDDEN>

> Date: Sat, 27 Jun 2009 10:30:10 +0900
> From: YAMAMOTO Mitsuharu <mituharu@HIDDEN>
> Cc: 3687@HIDDEN
> 
> >>>>> On Fri, 26 Jun 2009 16:43:25 +0300, Eli Zaretskii <eliz@HIDDEN> said:
> 
> >> Date: Fri, 26 Jun 2009 18:56:50 +0900 (JST)
> >> From: YAMAMOTO Mitsuharu <mituharu@HIDDEN>
> >> Cc: 
> >> 
> >> The following results look inconsistent:
> >> 
> >> (string-match (string-to-multibyte "\x80") (string-to-multibyte "\x80"))
> >> => 0
> >> (string-match (string-to-multibyte "\x80") "\x80")
> >> => nil
> >> 
> >> (string-match (string-to-multibyte "[\x80]") (string-to-multibyte "\x80"))
> >> => nil
> >> (string-match (string-to-multibyte "[\x80]") "\x80")
> >> => 0
> 
> > Please tell why you think they are inconsistent.
> 
> I thought there's no room for argument about their inconsistency with
> respect to the specification of "[...]" in regexps.

Well, obviously there is such a room.  Please consider explaining why
you think there's inconsistency.



Acknowledgement sent to Eli Zaretskii <eliz@HIDDEN>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs@HIDDEN>. Full text available.
Information forwarded to bug-submit-list@HIDDEN, Emacs Bugs <bug-gnu-emacs@HIDDEN>:
bug#3687; Package emacs. Full text available.

Message received at 3687@HIDDEN:


Received: (at 3687) by emacsbugs.donarmstrong.com; 27 Jun 2009 01:30:18 +0000
From mituharu@HIDDEN Fri Jun 26 18:30:17 2009
X-Spam-Checker-Version: SpamAssassin 3.2.5-bugs.debian.org_2005_01_02
	(2008-06-10) on rzlab.ucr.edu
X-Spam-Level: 
X-Spam-Bayes: score:0.5 Bayes not run. spammytokens:Tokens not available.
	hammytokens:Tokens not available.
X-Spam-Status: No, score=-3.4 required=4.0 tests=AWL,HAS_BUG_NUMBER
	autolearn=ham version=3.2.5-bugs.debian.org_2005_01_02
Received: from mathmail.math.s.chiba-u.ac.jp (mathmail.math.s.chiba-u.ac.jp [133.82.132.2])
	by rzlab.ucr.edu (8.14.3/8.14.3/Debian-5) with ESMTP id n5R1UBcV014656
	for <3687@HIDDEN>; Fri, 26 Jun 2009 18:30:13 -0700
Received: from church.math.s.chiba-u.ac.jp (church [133.82.132.36])
	by mathmail.math.s.chiba-u.ac.jp (Postfix) with ESMTP id 443552C40;
	Sat, 27 Jun 2009 10:30:10 +0900 (JST)
Date: Sat, 27 Jun 2009 10:30:10 +0900
Message-ID: <wlskhmxy3h.wl%mituharu@HIDDEN>
From: YAMAMOTO Mitsuharu <mituharu@HIDDEN>
To: Eli Zaretskii <eliz@HIDDEN>
Cc: 3687 <at> debbugs.gnu.org
Subject: Re: bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps
In-Reply-To: <83my7vyute.fsf@HIDDEN>
References: <200906260956.n5Q9uo917123@HIDDEN>
	<83my7vyute.fsf@HIDDEN>
User-Agent: Wanderlust/2.14.0 (Africa) SEMI/1.14.6 (Maruoka) FLIM/1.14.8
 (=?ISO-8859-4?Q?Shij=F2?=) APEL/10.6 Emacs/22.3 (sparc-sun-solaris2.8)
 MULE/5.0 (SAKAKI)
Organization: Faculty of Science, Chiba University
MIME-Version: 1.0 (generated by SEMI 1.14.6 - "Maruoka")
Content-Type: text/plain; charset=US-ASCII

>>>>> On Fri, 26 Jun 2009 16:43:25 +0300, Eli Zaretskii <eliz@HIDDEN> said:

>> Date: Fri, 26 Jun 2009 18:56:50 +0900 (JST)
>> From: YAMAMOTO Mitsuharu <mituharu@HIDDEN>
>> Cc: 
>> 
>> The following results look inconsistent:
>> 
>> (string-match (string-to-multibyte "\x80") (string-to-multibyte "\x80"))
>> => 0
>> (string-match (string-to-multibyte "\x80") "\x80")
>> => nil
>> 
>> (string-match (string-to-multibyte "[\x80]") (string-to-multibyte "\x80"))
>> => nil
>> (string-match (string-to-multibyte "[\x80]") "\x80")
>> => 0

> Please tell why you think they are inconsistent.

I thought there's no room for argument about their inconsistency with
respect to the specification of "[...]" in regexps.

> More importantly, please show real-life examples of code or
> situations where this gets in your way.

If you decode some data containing invalid (undecodable) byte
sequences using a coding system such as utf-8, then such sequences are
embedded in the decoded result as eight-bit characters in multibyte
form.  You can detect particular such sequences by searching a
"characer alternative" regexp (or its multibyte form) in the decoded
result if it works.

Further examples that look inconsistent:

  (string-match (string-to-multibyte "[\x80\x81]") (string-to-multibyte "\x80"))
  => nil
  (string-match (string-to-multibyte "[\x80-\xbf]") (string-to-multibyte "\x80"))
  => nil
  (string-match (string-to-multibyte "[\x80-\xc0]") (string-to-multibyte "\x80"))
  => 0
  (string-match (string-to-multibyte "[\x80-\xc0]") (string-to-multibyte "\xbf"))
  => 0
  (string-match (string-to-multibyte "[\x80-\xc0]") (string-to-multibyte "\xc0"))
  => nil

> This area is full of subtleties and gotchas, and in general the
> current code does what it does because it needs to cater to many
> different practical situations.

> There could still be bugs, of course.

Yeah.  I found another suspected bug in this area:

  (string-match "[[:unibyte:]]" "\x80")
  => nil
  (string-match "[[:unibyte:]]" (string-to-multibyte "\x80"))
  => nil

				     YAMAMOTO Mitsuharu
				mituharu@HIDDEN



Acknowledgement sent to YAMAMOTO Mitsuharu <mituharu@HIDDEN>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs@HIDDEN>. Full text available.
Information forwarded to bug-submit-list@HIDDEN, Emacs Bugs <bug-gnu-emacs@HIDDEN>:
bug#3687; Package emacs. Full text available.

Message received at 3687@HIDDEN:


Received: (at 3687) by emacsbugs.donarmstrong.com; 26 Jun 2009 13:43:37 +0000
From eliz@HIDDEN Fri Jun 26 06:43:36 2009
X-Spam-Checker-Version: SpamAssassin 3.2.5-bugs.debian.org_2005_01_02
	(2008-06-10) on rzlab.ucr.edu
X-Spam-Level: 
X-Spam-Bayes: score:0.5 Bayes not run. spammytokens:Tokens not available.
	hammytokens:Tokens not available.
X-Spam-Status: No, score=-4.9 required=4.0 tests=AWL,HAS_BUG_NUMBER
	autolearn=ham version=3.2.5-bugs.debian.org_2005_01_02
Received: from mtaout5.012.net.il (mtaout5.012.net.il [84.95.2.13])
	by rzlab.ucr.edu (8.14.3/8.14.3/Debian-5) with ESMTP id n5QDhWZk018958
	for <3687@HIDDEN>; Fri, 26 Jun 2009 06:43:33 -0700
Received: from conversion-daemon.i_mtaout5.012.net.il by i_mtaout5.012.net.il (HyperSendmail v2004.12) id <0KLU00L00M3H2B00@i_mtaout5.012.net.il> for 3687@HIDDEN; Fri, 26 Jun 2009 16:43:26 +0300 (IDT)
Received: from HOME-C4E4A596F7 ([84.229.213.34]) by i_mtaout5.012.net.il (HyperSendmail v2004.12) with ESMTPA id <0KLU00LMRM4D67R0@i_mtaout5.012.net.il>; Fri, 26 Jun 2009 16:43:25 +0300 (IDT)
Date: Fri, 26 Jun 2009 16:43:25 +0300
From: Eli Zaretskii <eliz@HIDDEN>
Subject: Re: bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps
In-reply-to: <200906260956.n5Q9uo917123@HIDDEN>
X-012-Sender: halo1@HIDDEN
To: YAMAMOTO Mitsuharu <mituharu@HIDDEN>,
        3687 <at> debbugs.gnu.org
Reply-to: Eli Zaretskii <eliz@HIDDEN>
Message-id: <83my7vyute.fsf@HIDDEN>
References: <200906260956.n5Q9uo917123@HIDDEN>

> Date: Fri, 26 Jun 2009 18:56:50 +0900 (JST)
> From: YAMAMOTO Mitsuharu <mituharu@HIDDEN>
> Cc: 
> 
> The following results look inconsistent:
> 
>   (string-match (string-to-multibyte "\x80") (string-to-multibyte "\x80"))
>   => 0
>   (string-match (string-to-multibyte "\x80") "\x80")
>   => nil
> 
>   (string-match (string-to-multibyte "[\x80]") (string-to-multibyte "\x80"))
>   => nil
>   (string-match (string-to-multibyte "[\x80]") "\x80")
>   => 0

Please tell why you think they are inconsistent.  More importantly,
please show real-life examples of code or situations where this gets
in your way.  This area is full of subtleties and gotchas, and in
general the current code does what it does because it needs to cater
to many different practical situations.

There could still be bugs, of course.



Acknowledgement sent to Eli Zaretskii <eliz@HIDDEN>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs@HIDDEN>. Full text available.
Information forwarded to bug-submit-list@HIDDEN, Emacs Bugs <bug-gnu-emacs@HIDDEN>:
bug#3687; Package emacs. Full text available.

Message received at submit@HIDDEN:


Received: (at submit) by emacsbugs.donarmstrong.com; 26 Jun 2009 09:57:09 +0000
From mituharu@HIDDEN Fri Jun 26 02:57:08 2009
X-Spam-Checker-Version: SpamAssassin 3.2.5-bugs.debian.org_2005_01_02
	(2008-06-10) on rzlab.ucr.edu
X-Spam-Level: 
X-Spam-Bayes: score:0.5 Bayes not run. spammytokens:Tokens not available.
	hammytokens:Tokens not available.
X-Spam-Status: No, score=-1.9 required=4.0 tests=AWL,FOURLA autolearn=no
	version=3.2.5-bugs.debian.org_2005_01_02
Received: from fencepost.gnu.org (fencepost.gnu.org [140.186.70.10])
	by rzlab.ucr.edu (8.14.3/8.14.3/Debian-5) with ESMTP id n5Q9v1vf011356
	for <submit@HIDDEN>; Fri, 26 Jun 2009 02:57:03 -0700
Received: from mail.gnu.org ([199.232.76.166]:37703 helo=mx10.gnu.org)
	by fencepost.gnu.org with esmtp (Exim 4.67)
	(envelope-from <mituharu@HIDDEN>)
	id 1MK8B2-0002BU-L9
	for emacs-pretest-bug@HIDDEN; Fri, 26 Jun 2009 05:57:00 -0400
Received: from Debian-exim by monty-python.gnu.org with spam-scanned (Exim 4.60)
	(envelope-from <mituharu@HIDDEN>)
	id 1MK8Az-0006zM-M3
	for emacs-pretest-bug@HIDDEN; Fri, 26 Jun 2009 05:56:59 -0400
Received: from ntp.math.s.chiba-u.ac.jp ([133.82.132.2]:64183 helo=mathmail.math.s.chiba-u.ac.jp)
	by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <mituharu@HIDDEN>)
	id 1MK8Ay-0006yQ-W3
	for emacs-pretest-bug@HIDDEN; Fri, 26 Jun 2009 05:56:57 -0400
Received: from church.math.s.chiba-u.ac.jp (church [133.82.132.36])
	by mathmail.math.s.chiba-u.ac.jp (Postfix) with ESMTP id E20962C44
	for <emacs-pretest-bug@HIDDEN>; Fri, 26 Jun 2009 18:56:50 +0900 (JST)
Received: (from mituharu@localhost)
	by church.math.s.chiba-u.ac.jp (8.11.7p1+Sun/8.11.7) id n5Q9uo917123;
	Fri, 26 Jun 2009 18:56:50 +0900 (JST)
Date: Fri, 26 Jun 2009 18:56:50 +0900 (JST)
Message-Id: <200906260956.n5Q9uo917123@HIDDEN>
From: YAMAMOTO Mitsuharu <mituharu@HIDDEN>
To: emacs-pretest-bug@HIDDEN
Subject: 23.1.50; inconsistency in multibyte eight-bit regexps
User-Agent: SEMI/1.14.6 (Maruoka) FLIM/1.14.8 (=?ISO-2022-JP-2?B?U2hpag==?=
 =?ISO-2022-JP-2?B?GyQoRCtXGyhC?=) APEL/10.6 Emacs/23.1.50
 (sparc-sun-solaris2.8) MULE/6.0 (HANACHIRUSATO)
MIME-Version: 1.0 (generated by SEMI 1.14.6 - "Maruoka")
Content-Type: text/plain; charset=US-ASCII
X-detected-operating-system: by monty-python.gnu.org: NetBSD 3.0 (DF)

The following results look inconsistent:

  (string-match (string-to-multibyte "\x80") (string-to-multibyte "\x80"))
  => 0
  (string-match (string-to-multibyte "\x80") "\x80")
  => nil

  (string-match (string-to-multibyte "[\x80]") (string-to-multibyte "\x80"))
  => nil
  (string-match (string-to-multibyte "[\x80]") "\x80")
  => 0

				     YAMAMOTO Mitsuharu
				mituharu@HIDDEN

In GNU Emacs 23.1.50.1 (sparc-sun-solaris2.8, X toolkit, Xaw3d scroll bars)
 of 2009-06-26 on church
Windowing system distributor `The X.Org Foundation', version 11.0.10402000
configured using `configure  'LDFLAGS=-L/usr/local/lib -R/usr/local/lib' 'CPPFLAGS=-I/usr/local/lib''

Important settings:
  value of $LC_ALL: nil
  value of $LC_COLLATE: nil
  value of $LC_CTYPE: nil
  value of $LC_MESSAGES: nil
  value of $LC_MONETARY: nil
  value of $LC_NUMERIC: nil
  value of $LC_TIME: nil
  value of $LANG: ja
  value of $XMODIFIERS: nil
  locale-coding-system: japanese-iso-8bit-unix
  default-enable-multibyte-characters: t

Major mode: Fundamental

Minor modes in effect:
  tooltip-mode: t
  tool-bar-mode: t
  mouse-wheel-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  blink-cursor-mode: t
  global-auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  line-number-mode: t
  transient-mark-mode: t



Acknowledgement sent to YAMAMOTO Mitsuharu <mituharu@HIDDEN>:
New bug report received and forwarded. Copy sent to Emacs Bugs <bug-gnu-emacs@HIDDEN>. Full text available.
Report forwarded to bug-submit-list@HIDDEN, Emacs Bugs <bug-gnu-emacs@HIDDEN>:
bug#3687; Package emacs. Full text available.
Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.
Last modified: Fri, 31 Oct 2014 17:00:04 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.