GNU bug report logs - #41970
Suggestions for corrections to Emacs and Elisp manuals

Previous Next

Package: emacs;

Reported by: Jay Bingham <binghamjc <at> msn.com>

Date: Sat, 20 Jun 2020 21:00:02 UTC

Severity: minor

Fixed in version 29.1

Done: Lars Ingebrigtsen <larsi <at> gnus.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 41970 in the body.
You can then email your comments to 41970 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#41970; Package emacs. (Sat, 20 Jun 2020 21:00:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Jay Bingham <binghamjc <at> msn.com>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Sat, 20 Jun 2020 21:00:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Jay Bingham <binghamjc <at> msn.com>
To: bug-gnu-emacs <at> gnu.org
Subject: Suggestions for corrections to Emacs and Elisp manuals
Date: Sat, 20 Jun 2020 15:44:31 -0500
[Message part 1 (text/plain, inline)]
Information about the operators and constructsused to create regular 
expressions is contained in two locations in the Info manuals, one in 
the Emacs manual (section _15.6 Syntax of Regular Expressions_), the 
other in the Elisp manual (section _34.3.1.1 Special Characters in 
Regular Expressions_). The first paragraph in section 15.6 of the Emacs 
manual provides the justification for maintaining two versions of the 
material, even though the two versions containmostly the same 
information. There are legitimate differences, however all of the 
differencescannot be attributed to the "features used mainly in Lisp 
programs". Here are differences that I have noticed, which I believe 
should not be differences.

Section_15.6 Syntax of Regular Expressions_of the Emacs manual contains 
descriptions of the postfix repetition operators ‘\{N\}’ and ‘\{N,M\}’. 
These operators are not described the Elisp manual in section 34.3.1.1, 
but are described in section _34.3.1.3 Backslash Constructs in Regular 
Expressions_where they are defined as ‘\{M\}’ and ‘\{M,N\}’. Since the 
Emacs manual also has a section for backslash constructs, _15.7 
Backslash in Regular Expressions_, moving the descriptions of the 
postfix repetition operators to section 15.7 and naming the as they are 
named in the Elisp manual would contribute greatly to the consistencyof 
the two manuals. Additionallythe description of ‘\{M,N\}’ in the Elisp 
manual contains information not included in the Emacs manual version 
that would be appropriate to include there.

The terminology used in section _15.6 Syntax of Regular Expressions_to 
describe and discuss the ‘[ ... ]’ and ‘[^ ... ]’ constructs. The first 
paragraph and the final paragraph in the section both refer to these 
constructs as "a character alternative", while the paragraphs describing 
them call them a “character set”. In section 34.3.1.1 of the Elisp 
manual the phrase used consistentlyto describe them and refer to them is 
"a character alternative". It would increase the consistencyof both 
manuals to use the same terminology to describe and refer to these 
constructs. A more grammatically correct phrase to describe these 
features would be "a set of alternative characters" (but when have 
programming nerds ever been that concerned with grammatical 
correctness). Whatever phrase is used to describe and refer to these 
constructs, it shouldbe consistent throughout both manuals. (The 
introduction to tsection _34.3.1.2 Character Classes_in the Elisp manual 
included).

In both section _15.6 Syntax of Regular Expressions_and section 
_34.3.1.1 Special Characters in Regular Expressions_near the end of each 
section is a paragraph which contains the sentence:

As a ‘\’ is not special inside a character alternative, it can never 
remove the special meaning of ‘-’ or ‘]’.

In both sections, in the description of the ‘[ ... ]’ construct, isa 
sentence which states that the characters ‘]’, ‘-’ and ‘^’ are special 
inside character alternatives.

Shouldn't the sentencesfound in both sections that are cited 
aboveinclude the '^' character?

The construct ‘\(?NUM: ... \)’ that is described in the Elisp manual, 
section _34.3.1.3 Backslash Constructs in Regular Expressions_ is not 
included in the Emacs manual section _15.7 Backslash in Regular 
Expressions_, it should be. However, the description of the construct in 
section 34.3.1.3 should be modified to make it clear that only the 
digits 1 through 9 can be used as NUM. Here is a suggestion for doing that:

‘\(?DIGIT:...\)’

is the explicitly numbered groupconstruct. Normal groups get their 
number implicitly, based on their position, which can be inconvenient. 
This construct allows a specific group number (limited to the digits 1 
through 9, see: ‘\DIGIT’ construct)to be assigned to the group 
construct. There is no particular restriction on the numbering, e.g., 
several groups can have the same number in which case the last one to 
match (i.e., the rightmost match) will be recorded. Implicitly numbered 
groups always get the smallest integer larger than the largest one of 
any previous group.

In the Emacs manual section _15.7 Backslash in Regular Expressions_ in 
the description of the ‘\D’ construct the following sentence in the 
second paragraph is misleading:

Then, later on in the regular expression, you can use ‘\’ followed by 
the digit D to mean “match the same text matched the Dth time by the ‘\( 
... \)’ construct”.

This does not agree with the description in the paragraphs that surround 
it nor with the description of the construct in the Elisp manual, 
section _34.3.1.3 Backslash Constructs in Regular Expressions_. This is 
not an error introduced in version 26, it has been present since at 
least version 23. It should read:

Then, later on in the regular expression, ‘\’ followed by the digit D 
can be used to mean “match the same text matched by the Dth ‘\( ... \)’ 
construct”.

In section _15.7 Backslash in Regular Expressions_of the Emacs manual 
the descriptions for the constructs ‘\`’, ‘\'’, ‘\=’, ‘\b’, ‘\B’, ‘\<’, 
‘\>’, ‘\w’, ‘\W’, ‘\_<’, ‘\_>’, ‘\sC’, ‘\SC’, ‘\cC’ and ‘\CC’ appear in 
the order show here, while in section _34.3.1.3 Backslash Constructs in 
Regular Expressions_of the Elisp manual they appear in the following 
order: ‘\w’, ‘\W’, ‘\sCODE’, ‘\SCODE’, ‘\cC’, ‘\CC’, ‘\`’, ‘\'’, ‘\=’, 
‘\b’, ‘\B’, ‘\<’, ‘\>’, ‘\_<’and ‘\_>’, which groups the constructs 
which match characters together and those which match empty strings 
relative to positions together. This grouping makes much more sense than 
the apparenthaphazardorder used in the Emacs manual. The order in the 
Emacs manual should match that of the Elsip manual.

Also in section _34.3.1.3 Backslash Constructs in Regular Expressions 
_ofthe Elsip manual the four constructs havingplaceholders: ‘\sCODE’, 
‘\SCODE’, ‘\cC’ and‘\CC’,the same convention is not used for 
specifyingthe placeholders. Either the constructs ‘\sCODE’and‘\SCODE’ 
should be written as ‘\sC’ and‘\SC’ or the constructs ‘\cC’ and‘\CC’ 
should be written as ‘\cCODE’ and‘\CCODE’ makingthe convention 
consistent throughout the section. The same convention should be used in 
both the Emacs manual and the Elisp manual in all constructswhere place 
holdersoccur. I prefer the use of a mnemonic as a placeholder over the 
use of a dingle character.

Adopting this convention would necessitate changing the ‘\{M\}’, 
‘\{M,N\}and ‘\D’ constructs as well. I suggest the following: ‘\{NUM\}’, 
‘\{MIN,MAX\}and ‘\DIGIT’. I prefer the convention used in the online 
version of the Elisp manual where placeholders are shown in lowercase 
italics. I do not know it that is possible to do or if it would conflict 
with the convention of showing place holders in all caps that is used in 
function descriptions. Since it is possible to cause links to files and 
the names of variables to be displayed differently in function 
descriptions, it should not be difficult to define a mechanism for 
displaying place holders in italics in function descriptions.

In section _34.3.1.3 Backslash Constructs in Regular Expressions _ofthe 
Elsip manual in the paragraph that introduces the regular expression 
constructs match the empty string the word ‘consume’ would be more 
appropriate than the phrase ‘use up’.

The format of the descriptions in section _34.3.1.3 Backslash Constructs 
in Regular Expressions _ofthe Elsip manual is not consistent. I offer 
you the following which I have attempted to add some consistency to by 
stating the name of the operator/construct then describing how it is 
used. The corrections and improvements mentioned above are incorporated 
into what follows.

For the most part, ‘\’ followed by any character matches only that 
character. However, there are several exceptions: two-character 
sequences starting with ‘\’ that have special meanings. The second 
character in the sequence is always an ordinary character when used on 
its own. Here are the ‘\’ operators and constructs.

‘\|’

is the alternative operator. Two regular expressions Aand Bwith ‘\|’ 
between forms an expression that matches either the text matched by Aor 
the text matched by B

Thus, ‘foo\|bar’ matches either ‘foo’ or ‘bar’ but no other string.

‘\|’ applies to the largest possible surrounding expressions. Only a 
surrounding ‘\( … \)’ grouping can limit the grouping power of ‘\|’.

When full backtracking capability is needed to handle multiple uses of 
‘\|’, use the POSIX regular expression functions (see POSIX Regexps in 
the Elisp manual).

‘\{/num/\}’

is the postfix number of repetitions operator. It specifies the exact 
number of consecutive repetitionsthat the preceding regular expression 
must match. For example, ‘x\{4\}’ matches only the string ‘xxxx’; 
‘c[ad]\{3\}r’ matches only the eight valid strings that can be created 
with two characters in three places, that is the strings: ‘caaar’, 
‘caadr’, ‘cadar’, ‘caddr’, ‘cdaar’, ‘cdadr’, ‘cddar’, ‘cdddr’.

‘\{/min/,/max/\}’

is the postfix range of repetitions operator. It specifies the range of 
consecutive repetitionsbetween /min/and /max/that the preceding regular 
expression must match, i.e. at least /min/times, but no more than 
/max/times. If /min/is omitted, the minimum is 0, but the preceding 
regular expression must match at least /max/times; if /max/is omitted, 
there is no maximum.

‘\{0,1\}’ or ‘\{,1\}’ is equivalent to ‘?’.

‘\{0,\}’ or ‘\{,\}’is equivalent to ‘*’.

‘\{1,\}’ is equivalent to ‘+’.

For example, ‘c[ad]\{1,2\}r’ matches only the strings: ‘car’, ‘cdr’, 
‘caar’, ‘cadr’, ‘cdar’, and ‘cddr’.

The maximum value allowed for /num/, /min/and /max/is 2**15 − 1.

‘\( … \)’

is the grouping construct that serves three purposes:

1.

   To enclose a set of ‘\|’ alternatives for other operations. Thus,
   ‘\(foo\|bar\)x’ matches either ‘foox’ or ‘barx’.

2.

   To enclose a complicated expression for the postfix operators ‘*’,
   ‘+’ and ‘?’ to operate on. Thus, ‘ba\(na\)*’ matches ‘bananana’,
   etc., with any number of (zero or more) ‘na’ strings.

3.

   To record a matched substring for future reference with ‘\/digit/’
   (described below).

This last application is not a consequence of the idea of a 
parenthetical grouping; it is a separate feature that is assigned as a 
second meaning to the same ‘\( … \)’ construct. In practice there is 
usually no conflict between the two meanings; when there is a conflict, 
a “shy” group (described below) can be used.

‘\(?: … \)’

is the “shy” group construct. A shy group serves the first two purposes 
of an ordinary group (controlling the nesting of other operators), but 
it does not record the matched substring; it can’t be referred back to 
with ‘\digit’ construct (see below). This is useful in mechanically 
combining regular expressions, so that groups can be added for syntactic 
purposes without interfering with the numbering of the groups that are 
meant to be referred to.

‘\(?/digit/: … \)’

is the explicitly numbered groupconstruct. Normal groups get their 
number implicitly, based on their position, which can be inconvenient. 
This construct allows a specific group number (limited to the digits 1 
through 9, see: ‘\/digit/’ construct)to be assigned to the group 
construct. There is no particular restriction on the numbering, e.g., 
several groups can have the same number in which case the last one to 
match (i.e., the rightmost match) will be recorded. Implicitly numbered 
groups always get the smallest integer larger than the largest one of 
any previous group.

‘\/digit/’

is the back reference operator. It matches the same text that matched 
the /digit/^/th/ occurrence of a ‘\( … \)’ construct.

After the end of a ‘\( … \)’ construct, the matcher remembers the 
beginning and end of the text matched by that construct. Later in the 
regular expression, ‘\’ followed by the /digit/can be used to match the 
same text matched by the /digit/^/th/ ‘\( … \)construct.

The strings matching the first nine ‘\( … \)’ constructs appearing in a 
regular expression are assigned numbers 1 through 9 in the order that 
the open-parentheses appear in the regular expression. So ‘\1’ through 
‘\9’ can be used to refer to the text matched by the corresponding ‘\( … 
\)’ constructs.

For example, ‘\(.*\)\1’ matches any newline-free string that is composed 
of two identical halves. The ‘\(.*\)’ matches the first half, which may 
be anything, but the ‘\1’ that follows must match the same exact text.

If a ‘\( … \)’ construct matches more than once (which can easily happen 
if it is followed by ‘*’), only the last match is recorded.

If a particular grouping construct in the regular expression was never 
matched—for instance, if it appears inside of an alternative that wasn’t 
used, or inside of a repetition that repeated zero times—then the 
corresponding ‘\digit’ construct never matches anything. For example, 
the regexp ‘\(foo\(b*\)\|lose\)\2’ cannot match ‘lose’ because the 
second alternative inside the larger group matches it, which results in 
‘\2’ being undefined and unable to match anything. It can match ‘foobb’, 
because the first alternative matches ‘foob’ and ‘\2’ matches the second 
‘b’.

The following operators pertaining to words and syntax are controlled by 
the setting of the syntax table (/See:/_Table of Syntax Classes_).

‘\w’

is the word-constituent operator, it matches any word-constituent 
character. The syntax table determines which characters these 
are. (/See:/_Table of Syntax Classes_)

‘\W’

is the non-word-constituent operator, it matches any character that is 
not a word-constituent. (/See:/_Table of Syntax Classes_)

‘\s/code/’

is the syntax class operator, it matches any character whose syntax is 
/code/. Here /code/is a character that designates a particular syntax 
class: thus, ‘w’ for word constituent, ‘-’ or ‘’ for whitespace, ‘.’ for 
ordinary punctuation, etc. (/See:/_Table of Syntax Classes_)

‘\S/code/’

is the non syntax class operator, it matches any character whose syntax 
is not /code/. (/See:/_Table of Syntax Classes_)

‘\c/code/’

is the character category operator, it matches any character that 
belongs to the category /code/. For example, ‘\cc’ matches Chinese 
characters, ‘\cg’ matches Greek characters, etc. For the description of 
the known categories, type ‘M-x describe-categories <RET>’. (/See 
also:/_Category Characters_)

‘\C/code/’

is the non character category operator, it matches any character that 
does _not_belong to category /code/. (/See:/_Category Characters_)

The following regular expression constructs match the empty string—that 
is, they don't consume any characters—but whether they match depends on 
the context. For all, the beginning and end of the accessible portion of 
the buffer are treated as if they were the actual beginning and end of 
the buffer.

\`’

is the beginning of string operator, it matches the empty string, but 
only at the beginning of the string or buffer (or its accessible 
portion) being matched against.

‘\’’

is the end of string operator, it matches the empty string, but only at 
the end of the string or buffer (or its accessible portion) being 
matched against.

‘\=’

is the at point operator, it matches the empty string, but only at point.

‘\b’

is the beginning or end of word operator, it matches the empty string, 
but only at the beginning or end of a word. Thus, ‘\bfoo\b’ matches any 
occurrence of ‘foo’ as a separate word. ‘\bballs?\b’ matches ‘ball’ or 
‘balls’ as a separate word.

‘\b’ matches at the beginning or end of the buffer regardless of what 
text appears next to it.

‘\B’

is the middle of word operator, it matches the empty string, but _not_at 
the beginning or end of a word.

‘\<’

is the beginning of word operator, it matches the empty string, but only 
at the beginning of a word; furthermore, ‘\<’ matches at the beginning 
of the buffer only if a word-constituent character follows.

‘\>’

is the end of word operator, it matches the empty string, but only at 
the end of a word; furthermore, ‘\>’ matches at the end of the buffer 
only if the contents end with a word-constituent character.

‘\_<’

is the beginning of symbol operator, it matches the empty string, but 
only at the beginning of a symbol. A symbol is a sequence of one or more 
symbol-constituent characters. A symbol-constituent character is a 
character whose syntax is either ‘w’ or ‘_’. It matches at the beginning 
of the buffer only if a symbol-constituent character immediately follows 
the beginning of the buffer. As with words, the syntax table determines 
which characters are symbol-constituent.

‘\_>’

is the end of symbol operator, it matches the empty string, but only at 
the end of a symbol. It matches at the end of the buffer only if a 
symbol-constituent character immediately precedes the end of the buffer.

Not every string is a valid regular expression. For example, a string 
that ends inside a set of alternative characters without a terminating 
‘]’ is invalid, and so is a string that ends with a single ‘\’. If an 
invalid regular expression is passed to any of the search functions, an 
invalid-regexp error is signaled.


J C Bingham
   - Georgetown, TX USA -
___________________________




-- 
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
[Message part 2 (text/html, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#41970; Package emacs. (Sat, 20 Jun 2020 21:52:01 GMT) Full text and rfc822 format available.

Message #8 received at 41970 <at> debbugs.gnu.org (full text, mbox):

From: Drew Adams <drew.adams <at> oracle.com>
To: Jay Bingham <binghamjc <at> msn.com>, 41970 <at> debbugs.gnu.org
Subject: RE: bug#41970: Suggestions for corrections to Emacs and Elisp manuals
Date: Sat, 20 Jun 2020 21:50:57 +0000 (UTC)
> The terminology used in section 15.6 Syntax of Regular Expressions to describe and discuss the ‘[ ... ]’ and ‘[^ ... ]’ constructs. The first paragraph and the final paragraph in the section both refer to these constructs as "a character alternative", while the paragraphs describing them call them a “character set”. In section 34.3.1.1 of the Elisp manual the phrase used consistently to describe them and refer to them is "a character alternative".

> It would increase the consistency of both manuals to use the same terminology to describe and refer to these constructs. A more grammatically correct phrase to describe these features would be "a set of alternative characters" (but when have programming nerds ever been that concerned with grammatical correctness).

A nit:

These references refer to the syntax construct [...], and not to the set of chars that it represents.  It is wrong to call this construct "a character set", and it would be wrong to call it "a set of alternative characters".  What it _matches_, or represents, is any _one_ char of a set of alternative chars.  But the syntax construct is not a set of chars.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#41970; Package emacs. (Mon, 09 May 2022 11:40:02 GMT) Full text and rfc822 format available.

Message #11 received at 41970 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Jay Bingham <binghamjc <at> msn.com>
Cc: 41970 <at> debbugs.gnu.org
Subject: Re: bug#41970: Suggestions for corrections to Emacs and Elisp manuals
Date: Mon, 09 May 2022 13:39:45 +0200
Jay Bingham <binghamjc <at> msn.com> writes:

> Here are differences that I have noticed, which I believe should not
> be differences.

(I'm going through old bug reports that unfortunately weren't resolved
at the time.)

Thanks for the suggested improvements -- I've now adjusted these
sections in the manuals for Emacs 29 (where I agreed with the
suggestions).

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no




bug marked as fixed in version 29.1, send any further explanations to 41970 <at> debbugs.gnu.org and Jay Bingham <binghamjc <at> msn.com> Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Mon, 09 May 2022 11:41:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 07 Jun 2022 11:24:08 GMT) Full text and rfc822 format available.

This bug report was last modified 1 year and 322 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.