GNU bug report logs - #45246
28.0.50; etags assertion error

Previous Next

Package: emacs;

Reported by: Gregor Zattler <grfz <at> gmx.de>

Date: Mon, 14 Dec 2020 23:39:02 UTC

Severity: normal

Tags: moreinfo

Found in version 28.0.50

Done: Eli Zaretskii <eliz <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 45246 in the body.
You can then email your comments to 45246 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#45246; Package emacs. (Mon, 14 Dec 2020 23:39:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Gregor Zattler <grfz <at> gmx.de>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Mon, 14 Dec 2020 23:39:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Gregor Zattler <grfz <at> gmx.de>
To: bug-gnu-emacs <at> gnu.org
Subject: 28.0.50; etags assertion error
Date: Tue, 15 Dec 2020 00:38:40 +0100

[Message part 1 (text/plain, inline)]

Dear emacs developers,

I use emacs Configured using:
 'configure -C --with-file-notification=inotify --with-cairo
 --without-toolkit-scroll-bars --with-x-toolkit=lucid
 --with-sound=yes --without-gconf --with-mailutils
 --with-x=yes --enable-checking=yes
 --enable-check-lisp-object-type=yes --with-nativecomp
 'CFLAGS=-g -O2
 -fdebug-prefix-map=/home/grfz/src/emacs-feature_native-comp=. -fstack-protector-strong
 -Wformat -Werror=format-security -Wall -fno-pie'
 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2 '
 'LDFLAGS=-Wl,-z,relro -no-pie''



and I get an assertion error when executing the following line:

~/src$ find . -type f -print0 | egrep -zZ -- '(\.el|\.c|\.h)(\.gz)?$' | xargs -0IXXXXX sh -c "/home/grfz/src/emacs-master/lib-src/etags XXXXX || echo XXXXX"
etags: etags.c:4153: C_entries: Assertion `bracelev == typdefbracelev' failed.
Aborted
./xapian-core-1.4.17/include/xapian/unicode.h
etags: etags.c:4153: C_entries: Assertion `bracelev == typdefbracelev' failed.
Aborted
./xapian-core-1.4.17/debian/tmp/usr/include/xapian/unicode.h
etags: etags.c:4153: C_entries: Assertion `bracelev == typdefbracelev' failed.
Aborted
./xapian-core-1.4.17/debian/libxapian-dev/usr/include/xapian/unicode.h

The file in question is attached to this email.


I do not get an assertion error if I use
/usr/bin/etags.emacs --version ./xapian-core-1.4.17/include/xapian/unicode.h

This etags binary is from the debian buster distribution.


Thanks for your attention, Gregor

[unicode.h (text/plain, inline)]

/** @file unicode.h
 * @brief Unicode and UTF-8 related classes and functions.
 */
/* Copyright (C) 2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2019 Olly Betts
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301 USA
 */

#ifndef XAPIAN_INCLUDED_UNICODE_H
#define XAPIAN_INCLUDED_UNICODE_H

#if !defined XAPIAN_IN_XAPIAN_H && !defined XAPIAN_LIB_BUILD
# error Never use <xapian/unicode.h> directly; include <xapian.h> instead.
#endif

#include <xapian/attributes.h>
#include <xapian/visibility.h>

#include <string>

namespace Xapian {

/** An iterator which returns Unicode character values from a UTF-8 encoded
 *  string.
 */
class XAPIAN_VISIBILITY_DEFAULT Utf8Iterator {
    const unsigned char* p;
    const unsigned char* end;
    mutable unsigned seqlen;

    bool XAPIAN_NOTHROW(calculate_sequence_length() const);

    unsigned get_char() const;

    Utf8Iterator(const unsigned char* p_,
		 const unsigned char* end_,
		 unsigned seqlen_)
	: p(p_), end(end_), seqlen(seqlen_) { }

  public:
    /** Return the raw const char* pointer for the current position. */
    const char* raw() const {
	return reinterpret_cast<const char*>(p ? p : end);
    }

    /** Return the number of bytes left in the iterator's buffer. */
    size_t left() const { return p ? end - p : 0; }

    /** Assign a new string to the iterator.
     *
     *  The iterator will forget the string it was iterating through, and
     *  return characters from the start of the new string when next called.
     *  The string is not copied into the iterator, so it must remain valid
     *  while the iteration is in progress.
     *
     *  @param p_ A pointer to the start of the string to read.
     *
     *  @param len The length of the string to read.
     */
    void assign(const char* p_, size_t len) {
	if (len) {
	    p = reinterpret_cast<const unsigned char*>(p_);
	    end = p + len;
	    seqlen = 0;
	} else {
	    p = NULL;
	}
    }

    /** Assign a new string to the iterator.
     *
     *  The iterator will forget the string it was iterating through, and
     *  return characters from the start of the new string when next called.
     *  The string is not copied into the iterator, so it must remain valid
     *  while the iteration is in progress.
     *
     *  @param s The string to read.  Must not be modified while the iteration
     *		 is in progress.
     */
    void assign(const std::string& s) { assign(s.data(), s.size()); }

    /** Create an iterator given a pointer to a null terminated string.
     *
     *  The iterator will return characters from the start of the string when
     *  next called.  The string is not copied into the iterator, so it must
     *  remain valid while the iteration is in progress.
     *
     *  @param p_ A pointer to the start of the null terminated string to read.
     */
    explicit Utf8Iterator(const char* p_);

    /** Create an iterator given a pointer and a length.
     *
     *  The iterator will return characters from the start of the string when
     *  next called.  The string is not copied into the iterator, so it must
     *  remain valid while the iteration is in progress.
     *
     *  @param p_ A pointer to the start of the string to read.
     *
     *  @param len The length of the string to read.
     */
    Utf8Iterator(const char* p_, size_t len) { assign(p_, len); }

    /** Create an iterator given a string.
     *
     *  The iterator will return characters from the start of the string when
     *  next called.  The string is not copied into the iterator, so it must
     *  remain valid while the iteration is in progress.
     *
     *  @param s The string to read.  Must not be modified while the iteration
     *		 is in progress.
     */
    Utf8Iterator(const std::string& s) { assign(s.data(), s.size()); }

    /** Create an iterator which is at the end of its iteration.
     *
     *  This can be compared to another iterator to check if the other iterator
     *  has reached its end.
     */
    XAPIAN_NOTHROW(Utf8Iterator())
	: p(NULL), end(0), seqlen(0) { }

    /** Get the current Unicode character value pointed to by the iterator.
     *
     *  If an invalid UTF-8 sequence is encountered, then the byte values
     *  comprising it are returned until valid UTF-8 or the end of the input is
     *  reached.
     *
     *  Returns unsigned(-1) if the iterator has reached the end of its buffer.
     */
    unsigned XAPIAN_NOTHROW(operator*() const) XAPIAN_PURE_FUNCTION;

    /** @private @internal Get the current Unicode character
     *  value pointed to by the iterator.
     *
     *  If an invalid UTF-8 sequence is encountered, then the byte values
     *  comprising it are returned with the top bit set (so the caller can
     *  differentiate these from the same values arising from valid UTF-8)
     *  until valid UTF-8 or the end of the input is reached.
     *
     *  Returns unsigned(-1) if the iterator has reached the end of its buffer.
     */
    unsigned XAPIAN_NOTHROW(strict_deref() const) XAPIAN_PURE_FUNCTION;

    /** Move forward to the next Unicode character.
     *
     *  @return An iterator pointing to the position before the move.
     */
    Utf8Iterator operator++(int) {
	// If we've not calculated seqlen yet, do so.
	if (seqlen == 0) calculate_sequence_length();
	const unsigned char* old_p = p;
	unsigned old_seqlen = seqlen;
	p += seqlen;
	if (p == end) p = NULL;
	seqlen = 0;
	return Utf8Iterator(old_p, end, old_seqlen);
    }

    /** Move forward to the next Unicode character.
     *
     *  @return A reference to this object.
     */
    Utf8Iterator& operator++() {
	if (seqlen == 0) calculate_sequence_length();
	p += seqlen;
	if (p == end) p = NULL;
	seqlen = 0;
	return *this;
    }

    /** Test two Utf8Iterators for equality.
     *
     *  @param other	The Utf8Iterator to compare this one with.
     *  @return true iff the iterators point to the same position.
     */
    bool XAPIAN_NOTHROW(operator==(const Utf8Iterator& other) const) {
	return p == other.p;
    }

    /** Test two Utf8Iterators for inequality.
     *
     *  @param other	The Utf8Iterator to compare this one with.
     *  @return true iff the iterators do not point to the same position.
     */
    bool XAPIAN_NOTHROW(operator!=(const Utf8Iterator& other) const) {
	return p != other.p;
    }

    /// We implement the semantics of an STL input_iterator.
    //@{
    typedef std::input_iterator_tag iterator_category;
    typedef unsigned value_type;
    typedef size_t difference_type;
    typedef const unsigned* pointer;
    typedef const unsigned& reference;
    //@}
};

/// Functions associated with handling Unicode characters.
namespace Unicode {

/** Each Unicode character is in exactly one of these categories.
 *
 * The Unicode standard calls this the "General Category", and uses a
 * "Major, minor" convention to derive a two letter code.
 */
typedef enum {
    UNASSIGNED,                         /**< Other, not assigned (Cn) */
    UPPERCASE_LETTER,                   /**< Letter, uppercase (Lu) */
    LOWERCASE_LETTER,                   /**< Letter, lowercase (Ll) */
    TITLECASE_LETTER,                   /**< Letter, titlecase (Lt) */
    MODIFIER_LETTER,                    /**< Letter, modifier (Lm) */
    OTHER_LETTER,                       /**< Letter, other (Lo) */
    NON_SPACING_MARK,                   /**< Mark, nonspacing (Mn) */
    ENCLOSING_MARK,                     /**< Mark, enclosing (Me) */
    COMBINING_SPACING_MARK,             /**< Mark, spacing combining (Mc) */
    DECIMAL_DIGIT_NUMBER,               /**< Number, decimal digit (Nd) */
    LETTER_NUMBER,                      /**< Number, letter (Nl) */
    OTHER_NUMBER,                       /**< Number, other (No) */
    SPACE_SEPARATOR,                    /**< Separator, space (Zs) */
    LINE_SEPARATOR,                     /**< Separator, line (Zl) */
    PARAGRAPH_SEPARATOR,                /**< Separator, paragraph (Zp) */
    CONTROL,                            /**< Other, control (Cc) */
    FORMAT,                             /**< Other, format (Cf) */
    PRIVATE_USE,                        /**< Other, private use (Co) */
    SURROGATE,                          /**< Other, surrogate (Cs) */
    CONNECTOR_PUNCTUATION,              /**< Punctuation, connector (Pc) */
    DASH_PUNCTUATION,                   /**< Punctuation, dash (Pd) */
    OPEN_PUNCTUATION,                   /**< Punctuation, open (Ps) */
    CLOSE_PUNCTUATION,                  /**< Punctuation, close (Pe) */
    INITIAL_QUOTE_PUNCTUATION,          /**< Punctuation, initial quote (Pi) */
    FINAL_QUOTE_PUNCTUATION,            /**< Punctuation, final quote (Pf) */
    OTHER_PUNCTUATION,                  /**< Punctuation, other (Po) */
    MATH_SYMBOL,                        /**< Symbol, math (Sm) */
    CURRENCY_SYMBOL,                    /**< Symbol, currency (Sc) */
    MODIFIER_SYMBOL,                    /**< Symbol, modified (Sk) */
    OTHER_SYMBOL                        /**< Symbol, other (So) */
} category;

namespace Internal {
    /** @private @internal Extract the information about a character from the
     *  Unicode character tables.
     *
     *  Characters outside of the Unicode range (i.e. ch >= 0x110000) are
     *  treated as UNASSIGNED with no case variants.
     */
    XAPIAN_VISIBILITY_DEFAULT
    int XAPIAN_NOTHROW(get_character_info(unsigned ch)) XAPIAN_CONST_FUNCTION;

    /** @private @internal Extract how to convert the case of a Unicode
     *  character from its info.
     */
    inline int get_case_type(int info) { return ((info & 0xe0) >> 5); }

    /** @private @internal Extract the category of a Unicode character from its
     *  info.
     */
    inline category get_category(int info) {
	return static_cast<category>(info & 0x1f);
    }

    /** @private @internal Extract the delta to use for case conversion of a
     *  character from its info.
     */
    inline int get_delta(int info) {
	/* It's implementation defined if sign extension happens when right
	 * shifting a signed int, although in practice sign extension is what
	 * most compilers implement.
	 *
	 * Some compilers are smart enough to spot common idioms for sign
	 * extension, but not all (e.g. GCC < 7 doesn't spot the one used
	 * below), so check what the implementation-defined behaviour is with
	 * a constant conditional which should get optimised away.
	 *
	 * We use the ternary operator here to avoid various compiler
	 * warnings which writing this as an `if` results in.
	 */
	return ((-1 >> 1) == -1 ?
		// Right shift sign-extends.
		info >> 8 :
		// Right shift shifts in zeros so bitwise-not before and after
		// the shift for negative values.
		(info >= 0) ? (info >> 8) : (~(~info >> 8)));
    }
}

/** Convert a single non-ASCII Unicode character to UTF-8.
 *
 *  This is intended mainly as a helper method for to_utf8().
 *
 *  @param ch	The character (which must be > 128) to write to @a buf.
 *  @param buf	The buffer to write the character to - it must have
 *		space for (at least) 4 bytes.
 *
 *  @return	The length of the resultant UTF-8 character in bytes.
 */
XAPIAN_VISIBILITY_DEFAULT
unsigned nonascii_to_utf8(unsigned ch, char* buf);

/** Convert a single Unicode character to UTF-8.
 *
 *  @param ch	The character to write to @a buf.
 *  @param buf	The buffer to write the character to - it must have
 *		space for (at least) 4 bytes.
 *
 *  @return	The length of the resultant UTF-8 character in bytes.
 */
inline unsigned to_utf8(unsigned ch, char* buf) {
    if (ch < 128) {
	*buf = static_cast<unsigned char>(ch);
	return 1;
    }
    return Xapian::Unicode::nonascii_to_utf8(ch, buf);
}

/** Append the UTF-8 representation of a single Unicode character to a
 *  std::string.
 */
inline void append_utf8(std::string& s, unsigned ch) {
    char buf[4];
    s.append(buf, to_utf8(ch, buf));
}

/// Return the category which a given Unicode character falls into.
inline category get_category(unsigned ch) {
    return Internal::get_category(Internal::get_character_info(ch));
}

/// Test if a given Unicode character is "word character".
inline bool is_wordchar(unsigned ch) {
    const unsigned int WORDCHAR_MASK =
	    (1 << Xapian::Unicode::UPPERCASE_LETTER) |
	    (1 << Xapian::Unicode::LOWERCASE_LETTER) |
	    (1 << Xapian::Unicode::TITLECASE_LETTER) |
	    (1 << Xapian::Unicode::MODIFIER_LETTER) |
	    (1 << Xapian::Unicode::OTHER_LETTER) |
	    (1 << Xapian::Unicode::NON_SPACING_MARK) |
	    (1 << Xapian::Unicode::ENCLOSING_MARK) |
	    (1 << Xapian::Unicode::COMBINING_SPACING_MARK) |
	    (1 << Xapian::Unicode::DECIMAL_DIGIT_NUMBER) |
	    (1 << Xapian::Unicode::LETTER_NUMBER) |
	    (1 << Xapian::Unicode::OTHER_NUMBER) |
	    (1 << Xapian::Unicode::CONNECTOR_PUNCTUATION);
    return ((WORDCHAR_MASK >> get_category(ch)) & 1);
}

/// Test if a given Unicode character is a whitespace character.
inline bool is_whitespace(unsigned ch) {
    const unsigned int WHITESPACE_MASK =
	    (1 << Xapian::Unicode::CONTROL) | // For TAB, CR, LF, FF.
	    (1 << Xapian::Unicode::SPACE_SEPARATOR) |
	    (1 << Xapian::Unicode::LINE_SEPARATOR) |
	    (1 << Xapian::Unicode::PARAGRAPH_SEPARATOR);
    return ((WHITESPACE_MASK >> get_category(ch)) & 1);
}

/// Test if a given Unicode character is a currency symbol.
inline bool is_currency(unsigned ch) {
    return (get_category(ch) == Xapian::Unicode::CURRENCY_SYMBOL);
}

/// Convert a Unicode character to lowercase.
inline unsigned tolower(unsigned ch) {
    int info = Xapian::Unicode::Internal::get_character_info(ch);
    if (!(Internal::get_case_type(info) & 2))
	return ch;
    return ch + Internal::get_delta(info);
}

/// Convert a Unicode character to uppercase.
inline unsigned toupper(unsigned ch) {
    int info = Xapian::Unicode::Internal::get_character_info(ch);
    if (!(Internal::get_case_type(info) & 4))
	return ch;
    return ch - Internal::get_delta(info);
}

/// Convert a UTF-8 std::string to lowercase.
inline std::string
tolower(const std::string& term)
{
    std::string result;
    result.reserve(term.size());
    for (Utf8Iterator i(term); i != Utf8Iterator(); ++i) {
	append_utf8(result, tolower(*i));
    }
    return result;
}

/// Convert a UTF-8 std::string to uppercase.
inline std::string
toupper(const std::string& term)
{
    std::string result;
    result.reserve(term.size());
    for (Utf8Iterator i(term); i != Utf8Iterator(); ++i) {
	append_utf8(result, toupper(*i));
    }
    return result;
}

}

}

#endif // XAPIAN_INCLUDED_UNICODE_H

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#45246; Package emacs. (Tue, 07 Jun 2022 11:36:02 GMT) Full text and rfc822 format available.

Message #8 received at 45246 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Gregor Zattler <grfz <at> gmx.de>
Cc: 45246 <at> debbugs.gnu.org
Subject: Re: bug#45246: 28.0.50; etags assertion error
Date: Tue, 07 Jun 2022 13:35:48 +0200

Gregor Zattler <grfz <at> gmx.de> writes:

> and I get an assertion error when executing the following line:
>
> ~/src$ find . -type f -print0 | egrep -zZ -- '(\.el|\.c|\.h)(\.gz)?$'
> | xargs -0IXXXXX sh -c "/home/grfz/src/emacs-master/lib-src/etags
> XXXXX || echo XXXXX"
> etags: etags.c:4153: C_entries: Assertion `bracelev == typdefbracelev' failed.
> Aborted

(I'm going through old bug reports that unfortunately weren't resolved
at the time.)

I tried saying

etags unicode.h

on the supplied file, but I didn't see any assertion errors, either with
the etags from Emacs 28 or 29.

Do you still see this problem in recent Emacs versions?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Added tag(s) moreinfo. Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Tue, 07 Jun 2022 11:37:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#45246; Package emacs. (Tue, 07 Jun 2022 14:27:01 GMT) Full text and rfc822 format available.

Message #13 received at 45246 <at> debbugs.gnu.org (full text, mbox):

From: Gregor Zattler <grfz <at> gmx.de>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: 45246 <at> debbugs.gnu.org
Subject: Re: bug#45246: 28.0.50; etags assertion error
Date: Tue, 07 Jun 2022 16:26:42 +0200

[Message part 1 (text/plain, inline)]

Hi Lars,
* Lars Ingebrigtsen <larsi <at> gnus.org> [2022-06-07; 13:35]:
> Gregor Zattler <grfz <at> gmx.de> writes:
>
>> and I get an assertion error when executing the following line:
>>
>> ~/src$ find . -type f -print0 | egrep -zZ -- '(\.el|\.c|\.h)(\.gz)?$'
>> | xargs -0IXXXXX sh -c "/home/grfz/src/emacs-master/lib-src/etags
>> XXXXX || echo XXXXX"
>> etags: etags.c:4153: C_entries: Assertion `bracelev == typdefbracelev' failed.
>> Aborted
>
> (I'm going through old bug reports that unfortunately weren't resolved
> at the time.)
>
> I tried saying
>
> etags unicode.h
>
> on the supplied file, but I didn't see any assertion errors, either with
> the etags from Emacs 28 or 29.
>
> Do you still see this problem in recent Emacs versions?


Yes:

$ /home/grfz/src/emacs/lib-src/etags /usr/include/xapian/unicode.h
etags: etags.c:4188: C_entries: Assertion `bracelev == typdefbracelev' failed.
Aborted


This is on debian/bullseye.  etags was build in the same
process as this Emacs:



In GNU Emacs 29.0.50 (build 3, x86_64-pc-linux-gnu, X toolkit, cairo version 1.16.0)
 of 2022-05-15 built on no
Repository revision: b26574d7d7c458fec7494484ea5bceeed45f2f02
Repository branch: master
Windowing system distributor 'The X.Org Foundation', version 11.0.12011000
System Description: Debian GNU/Linux 11 (bullseye)

Configured using:
 'configure -C --prefix=/usr/local/stow/emacs-snapshot
 --enable-locallisppath=/etc/emacs:/usr/local/share/emacs/29.0/site-lisp:/usr/local/share/emacs/site-lisp:/usr/share/emacs/29.0/site-lisp:/usr/share/emacs/site-lisp
 --with-sound=yes --without-gconf --with-mailutils --build
 x86_64-linux-gnu
 --infodir=/usr/local/share/info:/usr/share/info --with-json
 --with-file-notification=yes --with-cairo --with-x=yes
 --with-x-toolkit=lucid --without-toolkit-scroll-bars
 --enable-checking=yes,glyphs
 --enable-check-lisp-object-type --with-native-compilation
 'CFLAGS=-g3 -O3
 -ffile-prefix-map=/home/grfz/src/emacs=. -fstack-protector-strong
 -Wformat -Werror=format-security ''

Configured features:
ACL CAIRO DBUS FREETYPE GIF GLIB GMP GNUTLS GPM GSETTINGS
HARFBUZZ JPEG JSON LCMS2 LIBOTF LIBSELINUX LIBSYSTEMD
LIBXML2 M17N_FLT MODULES NATIVE_COMP NOTIFY INOTIFY PDUMPER
PNG RSVG SECCOMP SOUND THREADS TIFF X11 XAW3D XDBE XIM
XINPUT2 XPM LUCID ZLIB





Since there is no unicode.h under ~/src/ ATM, I used a
fifferent unicode.h file this time,.  It's attached.

For me this is not an important bug.  If you want to
investigate: Is there anything I can do to help you?

Ciao,
--
Gregor

[unicode.h (text/plain, inline)]

/** @file
 * @brief Unicode and UTF-8 related classes and functions.
 */
/* Copyright (C) 2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2019 Olly Betts
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301 USA
 */

#ifndef XAPIAN_INCLUDED_UNICODE_H
#define XAPIAN_INCLUDED_UNICODE_H

#if !defined XAPIAN_IN_XAPIAN_H && !defined XAPIAN_LIB_BUILD
# error Never use <xapian/unicode.h> directly; include <xapian.h> instead.
#endif

#include <xapian/attributes.h>
#include <xapian/visibility.h>

#include <string>

namespace Xapian {

/** An iterator which returns Unicode character values from a UTF-8 encoded
 *  string.
 */
class XAPIAN_VISIBILITY_DEFAULT Utf8Iterator {
    const unsigned char* p;
    const unsigned char* end;
    mutable unsigned seqlen;

    bool XAPIAN_NOTHROW(calculate_sequence_length() const);

    unsigned get_char() const;

    Utf8Iterator(const unsigned char* p_,
		 const unsigned char* end_,
		 unsigned seqlen_)
	: p(p_), end(end_), seqlen(seqlen_) { }

  public:
    /** Return the raw const char* pointer for the current position. */
    const char* raw() const {
	return reinterpret_cast<const char*>(p ? p : end);
    }

    /** Return the number of bytes left in the iterator's buffer. */
    size_t left() const { return p ? end - p : 0; }

    /** Assign a new string to the iterator.
     *
     *  The iterator will forget the string it was iterating through, and
     *  return characters from the start of the new string when next called.
     *  The string is not copied into the iterator, so it must remain valid
     *  while the iteration is in progress.
     *
     *  @param p_ A pointer to the start of the string to read.
     *
     *  @param len The length of the string to read.
     */
    void assign(const char* p_, size_t len) {
	if (len) {
	    p = reinterpret_cast<const unsigned char*>(p_);
	    end = p + len;
	    seqlen = 0;
	} else {
	    p = NULL;
	}
    }

    /** Assign a new string to the iterator.
     *
     *  The iterator will forget the string it was iterating through, and
     *  return characters from the start of the new string when next called.
     *  The string is not copied into the iterator, so it must remain valid
     *  while the iteration is in progress.
     *
     *  @param s The string to read.  Must not be modified while the iteration
     *		 is in progress.
     */
    void assign(const std::string& s) { assign(s.data(), s.size()); }

    /** Create an iterator given a pointer to a null terminated string.
     *
     *  The iterator will return characters from the start of the string when
     *  next called.  The string is not copied into the iterator, so it must
     *  remain valid while the iteration is in progress.
     *
     *  @param p_ A pointer to the start of the null terminated string to read.
     */
    explicit Utf8Iterator(const char* p_);

    /** Create an iterator given a pointer and a length.
     *
     *  The iterator will return characters from the start of the string when
     *  next called.  The string is not copied into the iterator, so it must
     *  remain valid while the iteration is in progress.
     *
     *  @param p_ A pointer to the start of the string to read.
     *
     *  @param len The length of the string to read.
     */
    Utf8Iterator(const char* p_, size_t len) { assign(p_, len); }

    /** Create an iterator given a string.
     *
     *  The iterator will return characters from the start of the string when
     *  next called.  The string is not copied into the iterator, so it must
     *  remain valid while the iteration is in progress.
     *
     *  @param s The string to read.  Must not be modified while the iteration
     *		 is in progress.
     */
    Utf8Iterator(const std::string& s) { assign(s.data(), s.size()); }

    /** Create an iterator which is at the end of its iteration.
     *
     *  This can be compared to another iterator to check if the other iterator
     *  has reached its end.
     */
    XAPIAN_NOTHROW(Utf8Iterator())
	: p(NULL), end(0), seqlen(0) { }

    /** Get the current Unicode character value pointed to by the iterator.
     *
     *  If an invalid UTF-8 sequence is encountered, then the byte values
     *  comprising it are returned until valid UTF-8 or the end of the input is
     *  reached.
     *
     *  Returns unsigned(-1) if the iterator has reached the end of its buffer.
     */
    unsigned XAPIAN_NOTHROW(operator*() const) XAPIAN_PURE_FUNCTION;

    /** @private @internal Get the current Unicode character
     *  value pointed to by the iterator.
     *
     *  If an invalid UTF-8 sequence is encountered, then the byte values
     *  comprising it are returned with the top bit set (so the caller can
     *  differentiate these from the same values arising from valid UTF-8)
     *  until valid UTF-8 or the end of the input is reached.
     *
     *  Returns unsigned(-1) if the iterator has reached the end of its buffer.
     */
    unsigned XAPIAN_NOTHROW(strict_deref() const) XAPIAN_PURE_FUNCTION;

    /** Move forward to the next Unicode character.
     *
     *  @return An iterator pointing to the position before the move.
     */
    Utf8Iterator operator++(int) {
	// If we've not calculated seqlen yet, do so.
	if (seqlen == 0) calculate_sequence_length();
	const unsigned char* old_p = p;
	unsigned old_seqlen = seqlen;
	p += seqlen;
	if (p == end) p = NULL;
	seqlen = 0;
	return Utf8Iterator(old_p, end, old_seqlen);
    }

    /** Move forward to the next Unicode character.
     *
     *  @return A reference to this object.
     */
    Utf8Iterator& operator++() {
	if (seqlen == 0) calculate_sequence_length();
	p += seqlen;
	if (p == end) p = NULL;
	seqlen = 0;
	return *this;
    }

    /** Test two Utf8Iterators for equality.
     *
     *  @param other	The Utf8Iterator to compare this one with.
     *  @return true iff the iterators point to the same position.
     */
    bool XAPIAN_NOTHROW(operator==(const Utf8Iterator& other) const) {
	return p == other.p;
    }

    /** Test two Utf8Iterators for inequality.
     *
     *  @param other	The Utf8Iterator to compare this one with.
     *  @return true iff the iterators do not point to the same position.
     */
    bool XAPIAN_NOTHROW(operator!=(const Utf8Iterator& other) const) {
	return p != other.p;
    }

    /// We implement the semantics of an STL input_iterator.
    //@{
    typedef std::input_iterator_tag iterator_category;
    typedef unsigned value_type;
    typedef size_t difference_type;
    typedef const unsigned* pointer;
    typedef const unsigned& reference;
    //@}
};

/// Functions associated with handling Unicode characters.
namespace Unicode {

/** Each Unicode character is in exactly one of these categories.
 *
 * The Unicode standard calls this the "General Category", and uses a
 * "Major, minor" convention to derive a two letter code.
 */
typedef enum {
    UNASSIGNED,                         /**< Other, not assigned (Cn) */
    UPPERCASE_LETTER,                   /**< Letter, uppercase (Lu) */
    LOWERCASE_LETTER,                   /**< Letter, lowercase (Ll) */
    TITLECASE_LETTER,                   /**< Letter, titlecase (Lt) */
    MODIFIER_LETTER,                    /**< Letter, modifier (Lm) */
    OTHER_LETTER,                       /**< Letter, other (Lo) */
    NON_SPACING_MARK,                   /**< Mark, nonspacing (Mn) */
    ENCLOSING_MARK,                     /**< Mark, enclosing (Me) */
    COMBINING_SPACING_MARK,             /**< Mark, spacing combining (Mc) */
    DECIMAL_DIGIT_NUMBER,               /**< Number, decimal digit (Nd) */
    LETTER_NUMBER,                      /**< Number, letter (Nl) */
    OTHER_NUMBER,                       /**< Number, other (No) */
    SPACE_SEPARATOR,                    /**< Separator, space (Zs) */
    LINE_SEPARATOR,                     /**< Separator, line (Zl) */
    PARAGRAPH_SEPARATOR,                /**< Separator, paragraph (Zp) */
    CONTROL,                            /**< Other, control (Cc) */
    FORMAT,                             /**< Other, format (Cf) */
    PRIVATE_USE,                        /**< Other, private use (Co) */
    SURROGATE,                          /**< Other, surrogate (Cs) */
    CONNECTOR_PUNCTUATION,              /**< Punctuation, connector (Pc) */
    DASH_PUNCTUATION,                   /**< Punctuation, dash (Pd) */
    OPEN_PUNCTUATION,                   /**< Punctuation, open (Ps) */
    CLOSE_PUNCTUATION,                  /**< Punctuation, close (Pe) */
    INITIAL_QUOTE_PUNCTUATION,          /**< Punctuation, initial quote (Pi) */
    FINAL_QUOTE_PUNCTUATION,            /**< Punctuation, final quote (Pf) */
    OTHER_PUNCTUATION,                  /**< Punctuation, other (Po) */
    MATH_SYMBOL,                        /**< Symbol, math (Sm) */
    CURRENCY_SYMBOL,                    /**< Symbol, currency (Sc) */
    MODIFIER_SYMBOL,                    /**< Symbol, modified (Sk) */
    OTHER_SYMBOL                        /**< Symbol, other (So) */
} category;

namespace Internal {
    /** @private @internal Extract the information about a character from the
     *  Unicode character tables.
     *
     *  Characters outside of the Unicode range (i.e. ch >= 0x110000) are
     *  treated as UNASSIGNED with no case variants.
     */
    XAPIAN_VISIBILITY_DEFAULT
    int XAPIAN_NOTHROW(get_character_info(unsigned ch)) XAPIAN_CONST_FUNCTION;

    /** @private @internal Extract how to convert the case of a Unicode
     *  character from its info.
     */
    inline int get_case_type(int info) { return ((info & 0xe0) >> 5); }

    /** @private @internal Extract the category of a Unicode character from its
     *  info.
     */
    inline category get_category(int info) {
	return static_cast<category>(info & 0x1f);
    }

    /** @private @internal Extract the delta to use for case conversion of a
     *  character from its info.
     */
    inline int get_delta(int info) {
	/* It's implementation defined if sign extension happens when right
	 * shifting a signed int, although in practice sign extension is what
	 * most compilers implement.
	 *
	 * Some compilers are smart enough to spot common idioms for sign
	 * extension, but not all (e.g. GCC < 7 doesn't spot the one used
	 * below), so check what the implementation-defined behaviour is with
	 * a constant conditional which should get optimised away.
	 *
	 * We use the ternary operator here to avoid various compiler
	 * warnings which writing this as an `if` results in.
	 */
	return ((-1 >> 1) == -1 ?
		// Right shift sign-extends.
		info >> 8 :
		// Right shift shifts in zeros so bitwise-not before and after
		// the shift for negative values.
		(info >= 0) ? (info >> 8) : (~(~info >> 8)));
    }
}

/** Convert a single non-ASCII Unicode character to UTF-8.
 *
 *  This is intended mainly as a helper method for to_utf8().
 *
 *  @param ch	The character (which must be > 128) to write to @a buf.
 *  @param buf	The buffer to write the character to - it must have
 *		space for (at least) 4 bytes.
 *
 *  @return	The length of the resultant UTF-8 character in bytes.
 */
XAPIAN_VISIBILITY_DEFAULT
unsigned nonascii_to_utf8(unsigned ch, char* buf);

/** Convert a single Unicode character to UTF-8.
 *
 *  @param ch	The character to write to @a buf.
 *  @param buf	The buffer to write the character to - it must have
 *		space for (at least) 4 bytes.
 *
 *  @return	The length of the resultant UTF-8 character in bytes.
 */
inline unsigned to_utf8(unsigned ch, char* buf) {
    if (ch < 128) {
	*buf = static_cast<unsigned char>(ch);
	return 1;
    }
    return Xapian::Unicode::nonascii_to_utf8(ch, buf);
}

/** Append the UTF-8 representation of a single Unicode character to a
 *  std::string.
 */
inline void append_utf8(std::string& s, unsigned ch) {
    char buf[4];
    s.append(buf, to_utf8(ch, buf));
}

/// Return the category which a given Unicode character falls into.
inline category get_category(unsigned ch) {
    return Internal::get_category(Internal::get_character_info(ch));
}

/// Test if a given Unicode character is "word character".
inline bool is_wordchar(unsigned ch) {
    const unsigned int WORDCHAR_MASK =
	    (1 << Xapian::Unicode::UPPERCASE_LETTER) |
	    (1 << Xapian::Unicode::LOWERCASE_LETTER) |
	    (1 << Xapian::Unicode::TITLECASE_LETTER) |
	    (1 << Xapian::Unicode::MODIFIER_LETTER) |
	    (1 << Xapian::Unicode::OTHER_LETTER) |
	    (1 << Xapian::Unicode::NON_SPACING_MARK) |
	    (1 << Xapian::Unicode::ENCLOSING_MARK) |
	    (1 << Xapian::Unicode::COMBINING_SPACING_MARK) |
	    (1 << Xapian::Unicode::DECIMAL_DIGIT_NUMBER) |
	    (1 << Xapian::Unicode::LETTER_NUMBER) |
	    (1 << Xapian::Unicode::OTHER_NUMBER) |
	    (1 << Xapian::Unicode::CONNECTOR_PUNCTUATION);
    return ((WORDCHAR_MASK >> get_category(ch)) & 1);
}

/// Test if a given Unicode character is a whitespace character.
inline bool is_whitespace(unsigned ch) {
    const unsigned int WHITESPACE_MASK =
	    (1 << Xapian::Unicode::CONTROL) | // For TAB, CR, LF, FF.
	    (1 << Xapian::Unicode::SPACE_SEPARATOR) |
	    (1 << Xapian::Unicode::LINE_SEPARATOR) |
	    (1 << Xapian::Unicode::PARAGRAPH_SEPARATOR);
    return ((WHITESPACE_MASK >> get_category(ch)) & 1);
}

/// Test if a given Unicode character is a currency symbol.
inline bool is_currency(unsigned ch) {
    return (get_category(ch) == Xapian::Unicode::CURRENCY_SYMBOL);
}

/// Convert a Unicode character to lowercase.
inline unsigned tolower(unsigned ch) {
    int info = Xapian::Unicode::Internal::get_character_info(ch);
    if (!(Internal::get_case_type(info) & 2))
	return ch;
    return ch + Internal::get_delta(info);
}

/// Convert a Unicode character to uppercase.
inline unsigned toupper(unsigned ch) {
    int info = Xapian::Unicode::Internal::get_character_info(ch);
    if (!(Internal::get_case_type(info) & 4))
	return ch;
    return ch - Internal::get_delta(info);
}

/// Convert a UTF-8 std::string to lowercase.
inline std::string
tolower(const std::string& term)
{
    std::string result;
    result.reserve(term.size());
    for (Utf8Iterator i(term); i != Utf8Iterator(); ++i) {
	append_utf8(result, tolower(*i));
    }
    return result;
}

/// Convert a UTF-8 std::string to uppercase.
inline std::string
toupper(const std::string& term)
{
    std::string result;
    result.reserve(term.size());
    for (Utf8Iterator i(term); i != Utf8Iterator(); ++i) {
	append_utf8(result, toupper(*i));
    }
    return result;
}

}

}

#endif // XAPIAN_INCLUDED_UNICODE_H

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#45246; Package emacs. (Tue, 07 Jun 2022 15:59:02 GMT) Full text and rfc822 format available.

Message #16 received at 45246 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Gregor Zattler <grfz <at> gmx.de>
Cc: larsi <at> gnus.org, 45246 <at> debbugs.gnu.org
Subject: Re: bug#45246: 28.0.50; etags assertion error
Date: Tue, 07 Jun 2022 18:58:13 +0300

> Cc: 45246 <at> debbugs.gnu.org
> From: Gregor Zattler <grfz <at> gmx.de>
> Date: Tue, 07 Jun 2022 16:26:42 +0200
> 
> > on the supplied file, but I didn't see any assertion errors, either with
> > the etags from Emacs 28 or 29.
> >
> > Do you still see this problem in recent Emacs versions?
> 
> 
> Yes:
> 
> $ /home/grfz/src/emacs/lib-src/etags /usr/include/xapian/unicode.h
> etags: etags.c:4188: C_entries: Assertion `bracelev == typdefbracelev' failed.
> Aborted

Lars, I guess you were trying this in an optimized build, where all
the assertions compile to nothing.

I see this here and will try to take a look when I have time.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#45246; Package emacs. (Tue, 07 Jun 2022 16:39:02 GMT) Full text and rfc822 format available.

Message #19 received at 45246 <at> debbugs.gnu.org (full text, mbox):

From: Andreas Schwab <schwab <at> linux-m68k.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: larsi <at> gnus.org, 45246 <at> debbugs.gnu.org, Gregor Zattler <grfz <at> gmx.de>
Subject: Re: bug#45246: 28.0.50; etags assertion error
Date: Tue, 07 Jun 2022 18:38:46 +0200

On Jun 07 2022, Eli Zaretskii wrote:

> Lars, I guess you were trying this in an optimized build, where all
> the assertions compile to nothing.

It's not about optimized or not, it's controlled by --enable-checking.

-- 
Andreas Schwab, schwab <at> linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#45246; Package emacs. (Tue, 07 Jun 2022 17:10:02 GMT) Full text and rfc822 format available.

Message #22 received at 45246 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: larsi <at> gnus.org, 45246 <at> debbugs.gnu.org, grfz <at> gmx.de
Subject: Re: bug#45246: 28.0.50; etags assertion error
Date: Tue, 07 Jun 2022 20:08:55 +0300

> Cc: larsi <at> gnus.org, 45246 <at> debbugs.gnu.org
> Date: Tue, 07 Jun 2022 18:58:13 +0300
> From: Eli Zaretskii <eliz <at> gnu.org>
> 
> > $ /home/grfz/src/emacs/lib-src/etags /usr/include/xapian/unicode.h
> > etags: etags.c:4188: C_entries: Assertion `bracelev == typdefbracelev' failed.
> > Aborted
> 
> Lars, I guess you were trying this in an optimized build, where all
> the assertions compile to nothing.
> 
> I see this here and will try to take a look when I have time.

A much smaller test case:

namespace Unicode {

typedef enum {
    UNASSIGNED,
    OTHER_SYMBOL
} category;

}

Hmm...

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#45246; Package emacs. (Tue, 07 Jun 2022 17:15:01 GMT) Full text and rfc822 format available.

Message #25 received at 45246 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 45246 <at> debbugs.gnu.org, Gregor Zattler <grfz <at> gmx.de>
Subject: Re: bug#45246: 28.0.50; etags assertion error
Date: Tue, 07 Jun 2022 19:13:50 +0200

Eli Zaretskii <eliz <at> gnu.org> writes:

> Lars, I guess you were trying this in an optimized build, where all
> the assertions compile to nothing.

Yup.

> I see this here and will try to take a look when I have time.

Thanks.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#45246; Package emacs. (Tue, 07 Jun 2022 17:16:02 GMT) Full text and rfc822 format available.

Message #28 received at 45246 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Andreas Schwab <schwab <at> linux-m68k.org>
Cc: larsi <at> gnus.org, 45246 <at> debbugs.gnu.org, grfz <at> gmx.de
Subject: Re: bug#45246: 28.0.50; etags assertion error
Date: Tue, 07 Jun 2022 20:15:10 +0300

> From: Andreas Schwab <schwab <at> linux-m68k.org>
> Cc: Gregor Zattler <grfz <at> gmx.de>,  larsi <at> gnus.org,  45246 <at> debbugs.gnu.org
> Date: Tue, 07 Jun 2022 18:38:46 +0200
> 
> On Jun 07 2022, Eli Zaretskii wrote:
> 
> > Lars, I guess you were trying this in an optimized build, where all
> > the assertions compile to nothing.
> 
> It's not about optimized or not, it's controlled by --enable-checking.

In etags.c?  I see no ENABLE_CHECKING there.  I do see

 #include <assert.h>

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#45246; Package emacs. (Tue, 07 Jun 2022 17:35:01 GMT) Full text and rfc822 format available.

Message #31 received at 45246 <at> debbugs.gnu.org (full text, mbox):

From: Andreas Schwab <schwab <at> linux-m68k.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: larsi <at> gnus.org, 45246 <at> debbugs.gnu.org, grfz <at> gmx.de
Subject: Re: bug#45246: 28.0.50; etags assertion error
Date: Tue, 07 Jun 2022 19:34:03 +0200

On Jun 07 2022, Eli Zaretskii wrote:

> In etags.c?  I see no ENABLE_CHECKING there.

See src/conf_post.h.

-- 
Andreas Schwab, schwab <at> linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#45246; Package emacs. (Tue, 07 Jun 2022 18:27:01 GMT) Full text and rfc822 format available.

Message #34 received at 45246 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Andreas Schwab <schwab <at> linux-m68k.org>
Cc: larsi <at> gnus.org, 45246 <at> debbugs.gnu.org, grfz <at> gmx.de
Subject: Re: bug#45246: 28.0.50; etags assertion error
Date: Tue, 07 Jun 2022 21:25:43 +0300

> From: Andreas Schwab <schwab <at> linux-m68k.org>
> Cc: grfz <at> gmx.de,  larsi <at> gnus.org,  45246 <at> debbugs.gnu.org
> Date: Tue, 07 Jun 2022 19:34:03 +0200
> 
> On Jun 07 2022, Eli Zaretskii wrote:
> 
> > In etags.c?  I see no ENABLE_CHECKING there.
> 
> See src/conf_post.h.

Right, thanks.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#45246; Package emacs. (Thu, 09 Jun 2022 17:43:02 GMT) Full text and rfc822 format available.

Message #37 received at 45246 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: grfz <at> gmx.de, larsi <at> gnus.org
Cc: 45246 <at> debbugs.gnu.org
Subject: Re: bug#45246: 28.0.50; etags assertion error
Date: Thu, 09 Jun 2022 20:42:41 +0300

> Date: Tue, 07 Jun 2022 20:08:55 +0300
> From: Eli Zaretskii <eliz <at> gnu.org>
> Cc: grfz <at> gmx.de, larsi <at> gnus.org, 45246 <at> debbugs.gnu.org
> 
> > Cc: larsi <at> gnus.org, 45246 <at> debbugs.gnu.org
> > Date: Tue, 07 Jun 2022 18:58:13 +0300
> > From: Eli Zaretskii <eliz <at> gnu.org>
> > 
> > > $ /home/grfz/src/emacs/lib-src/etags /usr/include/xapian/unicode.h
> > > etags: etags.c:4188: C_entries: Assertion `bracelev == typdefbracelev' failed.
> > > Aborted
> > 
> > Lars, I guess you were trying this in an optimized build, where all
> > the assertions compile to nothing.
> > 
> > I see this here and will try to take a look when I have time.
> 
> A much smaller test case:
> 
> namespace Unicode {
> 
> typedef enum {
>     UNASSIGNED,
>     OTHER_SYMBOL
> } category;
> 
> }
> 
> Hmm...

Heh, turns out it's a "feature": when etags sees a closing brace in
column zero, it by default assumes that's the final brace of a
function or a struct definition, so it resets the brace level.  As you
can see, the above test case (and the original Unicode.h) have the
closing brace of the "typedef enum" in column zero.  If you mark the
entire typedef and type "M-C-\", Emacs will indent it, and the problem
will go away.

"etags --help" says:

  -I, --ignore-indentation
	  In C and C++ do not assume that a closing brace in the first
	  column is the final brace of a function or structure definition.

And indeed, invoking "etags -I" compiled with --enable-checking with
the original file avoids the assertion violation.  And in a production
build, etags produces a valid TAGS file even if -I is omitted.

So I think there's nothing to do here, and we should close this bug as
notabug.  Does anyone disagree?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#45246; Package emacs. (Thu, 09 Jun 2022 18:44:01 GMT) Full text and rfc822 format available.

Message #40 received at 45246 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 45246 <at> debbugs.gnu.org, grfz <at> gmx.de
Subject: Re: bug#45246: 28.0.50; etags assertion error
Date: Thu, 09 Jun 2022 20:43:04 +0200

Eli Zaretskii <eliz <at> gnu.org> writes:

> "etags --help" says:
>
>   -I, --ignore-indentation
> 	  In C and C++ do not assume that a closing brace in the first
> 	  column is the final brace of a function or structure definition.
>
> And indeed, invoking "etags -I" compiled with --enable-checking with
> the original file avoids the assertion violation.  And in a production
> build, etags produces a valid TAGS file even if -I is omitted.
>
> So I think there's nothing to do here, and we should close this bug as
> notabug.  Does anyone disagree?

I think that sounds correct.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#45246; Package emacs. (Thu, 09 Jun 2022 19:01:02 GMT) Full text and rfc822 format available.

Message #43 received at 45246 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: 45246 <at> debbugs.gnu.org, grfz <at> gmx.de
Subject: Re: bug#45246: 28.0.50; etags assertion error
Date: Thu, 09 Jun 2022 21:59:55 +0300

> From: Lars Ingebrigtsen <larsi <at> gnus.org>
> Cc: grfz <at> gmx.de,  45246 <at> debbugs.gnu.org
> Date: Thu, 09 Jun 2022 20:43:04 +0200
> 
> Eli Zaretskii <eliz <at> gnu.org> writes:
> 
> > "etags --help" says:
> >
> >   -I, --ignore-indentation
> > 	  In C and C++ do not assume that a closing brace in the first
> > 	  column is the final brace of a function or structure definition.
> >
> > And indeed, invoking "etags -I" compiled with --enable-checking with
> > the original file avoids the assertion violation.  And in a production
> > build, etags produces a valid TAGS file even if -I is omitted.
> >
> > So I think there's nothing to do here, and we should close this bug as
> > notabug.  Does anyone disagree?
> 
> I think that sounds correct.

Gregor, any objections to closing this bug?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#45246; Package emacs. (Thu, 09 Jun 2022 22:34:02 GMT) Full text and rfc822 format available.

Message #46 received at 45246 <at> debbugs.gnu.org (full text, mbox):

From: Gregor Zattler <grfz <at> gmx.de>
To: Eli Zaretskii <eliz <at> gnu.org>, Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: 45246 <at> debbugs.gnu.org
Subject: Re: bug#45246: 28.0.50; etags assertion error
Date: Fri, 10 Jun 2022 00:33:07 +0200

Hi Eli, Lars,
* Eli Zaretskii <eliz <at> gnu.org> [2022-06-09; 21:59]:
>> From: Lars Ingebrigtsen <larsi <at> gnus.org>
>> Cc: grfz <at> gmx.de,  45246 <at> debbugs.gnu.org
>> Date: Thu, 09 Jun 2022 20:43:04 +0200
>>
>> Eli Zaretskii <eliz <at> gnu.org> writes:
>>
>> > "etags --help" says:
>> >
>> >   -I, --ignore-indentation
>> > 	  In C and C++ do not assume that a closing brace in the first
>> > 	  column is the final brace of a function or structure definition.
>> >
>> > And indeed, invoking "etags -I" compiled with --enable-checking with
>> > the original file avoids the assertion violation.  And in a production
>> > build, etags produces a valid TAGS file even if -I is omitted.
>> >
>> > So I think there's nothing to do here, and we should close this bug as
>> > notabug.  Does anyone disagree?
>>
>> I think that sounds correct.

I confirm -I avoids the assertion.

> Gregor, any objections to closing this bug?

no.

I must admit, I did not read the man pager closely but
anyway I wouldn't have understood the consequences of
setting vs not setting -I.  Perhaps the documentation could
be amended somehow?  I assume this is a trade-of between
speed and robustness?


... Some highly unscientific tests:
I do not see a difference in speed neither between etags as
of emacs26 as it comes with debian bullseye with or without
-I.  The optimized build etags from emacs29 is a bit faster
with -I than etags as of emacs26 with or without -I.


I do not have objections to closing this bug
report, but I wonder why etags treats closing braces in
the first column special if it does not speed up things?





Ciao,
--
Gregor

Reply sent to Eli Zaretskii <eliz <at> gnu.org>:
You have taken responsibility. (Fri, 10 Jun 2022 07:27:02 GMT) Full text and rfc822 format available.

Notification sent to Gregor Zattler <grfz <at> gmx.de>:
bug acknowledged by developer. (Fri, 10 Jun 2022 07:27:03 GMT) Full text and rfc822 format available.

Message #51 received at 45246-done <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Gregor Zattler <grfz <at> gmx.de>
Cc: larsi <at> gnus.org, 45246-done <at> debbugs.gnu.org
Subject: Re: bug#45246: 28.0.50; etags assertion error
Date: Fri, 10 Jun 2022 10:25:52 +0300

> From: Gregor Zattler <grfz <at> gmx.de>
> Cc: 45246 <at> debbugs.gnu.org
> Date: Fri, 10 Jun 2022 00:33:07 +0200
> 
> >> > And indeed, invoking "etags -I" compiled with --enable-checking with
> >> > the original file avoids the assertion violation.  And in a production
> >> > build, etags produces a valid TAGS file even if -I is omitted.
> >> >
> >> > So I think there's nothing to do here, and we should close this bug as
> >> > notabug.  Does anyone disagree?
> >>
> >> I think that sounds correct.
> 
> I confirm -I avoids the assertion.
> 
> > Gregor, any objections to closing this bug?
> 
> no.

OK, done.

> I must admit, I did not read the man pager closely but
> anyway I wouldn't have understood the consequences of
> setting vs not setting -I.  Perhaps the documentation could
> be amended somehow?

I've added some notes about this to the manual and to the etags man
page, thanks.

> I assume this is a trade-of between speed and robustness?

No, I think it's more about the correctness of the produced TAGS file
than about speed.  etags's C/C++ parser is extremely naïve and largely
ignores the complicated syntax of the C dialects.  So using the
"closing brace in column zero ends all top-level definitions"
heuristic is useful for preventing 'etags' from being utterly confused
by some sophisticated use of C/C++ facilities, such as macros and the
more arcane syntactic constructs in modern C++: it makes sure the
confusion ends as early as possible.

> I do not have objections to closing this bug
> report, but I wonder why etags treats closing braces in
> the first column special if it does not speed up things?

See above.  Whether this is a real problem, I don't know.  I think
the only way to tell is to try.  At least with our test suite for
'etags', using -I causes regressions, e.g. in cp-src/c.C some tags are
not created.  So I think having this heuristic on by default is a good
thing, overall.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#45246; Package emacs. (Fri, 10 Jun 2022 07:27:03 GMT) Full text and rfc822 format available.

Message #54 received at 45246-done <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Gregor Zattler <grfz <at> gmx.de>
Cc: larsi <at> gnus.org, 45246-done <at> debbugs.gnu.org
Subject: Re: bug#45246: 28.0.50; etags assertion error
Date: Fri, 10 Jun 2022 10:26:07 +0300

> From: Gregor Zattler <grfz <at> gmx.de>
> Cc: 45246 <at> debbugs.gnu.org
> Date: Fri, 10 Jun 2022 00:33:07 +0200
> 
> >> > And indeed, invoking "etags -I" compiled with --enable-checking with
> >> > the original file avoids the assertion violation.  And in a production
> >> > build, etags produces a valid TAGS file even if -I is omitted.
> >> >
> >> > So I think there's nothing to do here, and we should close this bug as
> >> > notabug.  Does anyone disagree?
> >>
> >> I think that sounds correct.
> 
> I confirm -I avoids the assertion.
> 
> > Gregor, any objections to closing this bug?
> 
> no.

OK, done.

> I must admit, I did not read the man pager closely but
> anyway I wouldn't have understood the consequences of
> setting vs not setting -I.  Perhaps the documentation could
> be amended somehow?

I've added some notes about this to the manual and to the etags man
page, thanks.

> I assume this is a trade-of between speed and robustness?

No, I think it's more about the correctness of the produced TAGS file
than about speed.  etags's C/C++ parser is extremely naïve and largely
ignores the complicated syntax of the C dialects.  So using the
"closing brace in column zero ends all top-level definitions"
heuristic is useful for preventing 'etags' from being utterly confused
by some sophisticated use of C/C++ facilities, such as macros and the
more arcane syntactic constructs in modern C++: it makes sure the
confusion ends as early as possible.

> I do not have objections to closing this bug
> report, but I wonder why etags treats closing braces in
> the first column special if it does not speed up things?

See above.  Whether this is a real problem, I don't know.  I think
the only way to tell is to try.  At least with our test suite for
'etags', using -I causes regressions, e.g. in cp-src/c.C some tags are
not created.  So I think having this heuristic on by default is a good
thing, overall.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#45246; Package emacs. (Fri, 10 Jun 2022 14:02:02 GMT) Full text and rfc822 format available.

Message #57 received at 45246-done <at> debbugs.gnu.org (full text, mbox):

From: Francesco Potortì <pot <at> gnu.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: larsi <at> gnus.org, 45246-done <at> debbugs.gnu.org, Gregor Zattler <grfz <at> gmx.de>
Subject: Re: bug#45246: 28.0.50; etags assertion error
Date: Fri, 10 Jun 2022 16:01:26 +0200

Gregor:
>> I confirm -I avoids the assertion.
>> I must admit, I did not read the man pager closely but
>> anyway I wouldn't have understood the consequences of
>> setting vs not setting -I.  Perhaps the documentation could
>> be amended somehow?
>
>> I assume this is a trade-of between speed and robustness?

Eli:
>No, I think it's more about the correctness of the produced TAGS file
>than about speed.  etags's C/C++ parser is extremely naïve and largely
>ignores the complicated syntax of the C dialects.  So using the
>"closing brace in column zero ends all top-level definitions"
>heuristic is useful for preventing 'etags' from being utterly confused
>by some sophisticated use of C/C++ facilities, such as macros and the
>more arcane syntactic constructs in modern C++: it makes sure the
>confusion ends as early as possible.
>
>> I do not have objections to closing this bug
>> report, but I wonder why etags treats closing braces in
>> the first column special if it does not speed up things?
>
>See above.  Whether this is a real problem, I don't know.  I think
>the only way to tell is to try.  At least with our test suite for
>'etags', using -I causes regressions, e.g. in cp-src/c.C some tags are
>not created.  So I think having this heuristic on by default is a good
>thing, overall.

I second all of Eli's statements.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 09 Jul 2022 11:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 168 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #45246 28.0.50; etags assertion error

GNU bug report logs - #45246
28.0.50; etags assertion error