GNU bug report logs - #42602
Wrong (not-)casechars value for "polish" in ispell-dictionary-base-alist

Previous Next

Package: emacs;

Reported by: Sebastian Urban <mrsebastianurban <at> gmail.com>

Date: Wed, 29 Jul 2020 16:13:01 UTC

Severity: normal

Done: Stefan Kangas <stefan <at> marxist.se>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 42602 in the body.
You can then email your comments to 42602 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#42602; Package emacs. (Wed, 29 Jul 2020 16:13:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Sebastian Urban <mrsebastianurban <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Wed, 29 Jul 2020 16:13:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Sebastian Urban <mrsebastianurban <at> gmail.com>
To: Bug GNU Emacs <bug-gnu-emacs <at> gnu.org>
Subject: Wrong (not-)casechars value for "polish" in
 ispell-dictionary-base-alist
Date: Wed, 29 Jul 2020 18:12:02 +0200
Hello,

for words like:
   męski
   miód
   klątwa
   ślad
   łuk
   żaba
   źrebak
   grzać
   bańka
ispell.el sends to Aspell only part of the word, e.g. "lad" instead of
"ślad", or "kl"/"twa" (depending on the cursor position) instead of
"klątwa".

I think this is because wrong value of (NOT-)CASECHARS, which is ASCII
A-z letters and a few chars of which only ó/Ó is valid for Polish.

Although, for some reason, it doesn't recognize "ó" in word "miód",
sending "mi" or "d". It is on the list of CASECHARS under \363, so it
should work.  Moreover, if I type in regexp-builder "[\363\323]" it
won't recognize ó/Ó, but it doesn't have a problem with other Polish
chars, like "ł" ("[\502]") or "ż" ("[\574]").

If I put in my init.el:
--8<---------------cut here---------------start------------->8---
(setq ispell-program-name "C:/cygwin64/bin/aspell")
(add-hook 'ispell-initialize-spellchecker-hook
          (lambda ()
          (add-to-list 'ispell-local-dictionary-alist
                       '("pl"
                         ;; "[[:alpha:]]"
                         ;; "[^[:alpha:]]"
                         ;; ęóąśłżźćńĘÓĄŚŁŻŹĆŃ
"[A-Za-z\431\363\405\533\502\574\572\407\504\430\323\404\532\501\573\571\406\503]"
"[^A-Za-z\431\363\405\533\502\574\572\407\504\430\323\404\532\501\573\571\406\503]"
                         "[.]" nil nil nil iso-8859-2))))
(setq ispell-dictionary "pl")
--8<---------------cut here---------------start------------->8---

everything seems to work, even ó/Ó are recognised. "[[:alpha:]]" works
as well, so I leaved it as an alternative. Changing from iso-8859-2 to
utf-8 doesn't break anything.

Tested on:
- GNU Emacs 26.3 (build 1, x86_64-w64-mingw32) of 2019-08-29,
- GNU Emacs 28.0.50 (build 1, x86_64-w64-mingw32) of 2020-07-05,
with Aspell from Cygwin installation.


S. U.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#42602; Package emacs. (Wed, 29 Jul 2020 18:44:02 GMT) Full text and rfc822 format available.

Message #8 received at 42602 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Sebastian Urban <mrsebastianurban <at> gmail.com>
Cc: 42602 <at> debbugs.gnu.org
Subject: Re: bug#42602: Wrong (not-)casechars value for "polish" in
 ispell-dictionary-base-alist
Date: Wed, 29 Jul 2020 21:43:22 +0300
> From: Sebastian Urban <mrsebastianurban <at> gmail.com>
> Date: Wed, 29 Jul 2020 18:12:02 +0200
> 
> for words like:
>     męski
>     miód
>     klątwa
>     ślad
>     łuk
>     żaba
>     źrebak
>     grzać
>     bańka
> ispell.el sends to Aspell only part of the word, e.g. "lad" instead of
> "ślad", or "kl"/"twa" (depending on the cursor position) instead of
> "klątwa".
> 
> I think this is because wrong value of (NOT-)CASECHARS, which is ASCII
> A-z letters and a few chars of which only ó/Ó is valid for Polish.
> 
> Although, for some reason, it doesn't recognize "ó" in word "miód",
> sending "mi" or "d". It is on the list of CASECHARS under \363, so it
> should work.  Moreover, if I type in regexp-builder "[\363\323]" it
> won't recognize ó/Ó, but it doesn't have a problem with other Polish
> chars, like "ł" ("[\502]") or "ż" ("[\574]").
> 
> If I put in my init.el:
> --8<---------------cut here---------------start------------->8---
> (setq ispell-program-name "C:/cygwin64/bin/aspell")
> (add-hook 'ispell-initialize-spellchecker-hook
>            (lambda ()
>            (add-to-list 'ispell-local-dictionary-alist
>                         '("pl"
>                           ;; "[[:alpha:]]"
>                           ;; "[^[:alpha:]]"
>                           ;; ęóąśłżźćńĘÓĄŚŁŻŹĆŃ
> "[A-Za-z\431\363\405\533\502\574\572\407\504\430\323\404\532\501\573\571\406\503]"
> "[^A-Za-z\431\363\405\533\502\574\572\407\504\430\323\404\532\501\573\571\406\503]"
>                           "[.]" nil nil nil iso-8859-2))))
> (setq ispell-dictionary "pl")
> --8<---------------cut here---------------start------------->8---
> 
> everything seems to work, even ó/Ó are recognised.

I don't understand this change.  Values above octal 377 cannot be
right in the above regexps, because they are supposed to be in Latin-2
encoding, which is a single-byte encoding, and so can only handle
values below octal 400.  How did you come up with those values?

Anyway, I'm quite sure some other factor is at work here.

> Tested on:
> - GNU Emacs 26.3 (build 1, x86_64-w64-mingw32) of 2019-08-29,
> - GNU Emacs 28.0.50 (build 1, x86_64-w64-mingw32) of 2020-07-05,
> with Aspell from Cygwin installation.

Your Emacs is a native MinGW build, whereas Aspell seems to be a
Cygwin build?  If so, you could have incompatibility in character
encoding.  What is your Windows locale?  And what does

  M-: (getenv "LANG") RET

yield inside Emacs?




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#42602; Package emacs. (Thu, 30 Jul 2020 11:41:01 GMT) Full text and rfc822 format available.

Message #11 received at 42602 <at> debbugs.gnu.org (full text, mbox):

From: Sebastian Urban <mrsebastianurban <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 42602 <at> debbugs.gnu.org
Subject: Re: bug#42602: Wrong (not-)casechars value for "polish" in
 ispell-dictionary-base-alist
Date: Thu, 30 Jul 2020 13:39:55 +0200
> I don't understand this change.  Values above octal 377 cannot be
> right in the above regexps, because they are supposed to be in
> Latin-2 encoding, which is a single-byte encoding, and so can only
> handle values below octal 400.  How did you come up with those
> values?

Basically, C-x = on a char, which gave me octal values.  I though it
was recognising only A-z + ó/Ó and some other chars that I'm not
interested in, so I swapped those values for the ones corresponding to
the Polish chars.  That's the whole story.

> Anyway, I'm quite sure some other factor is at work here.

Well, I did some tests, e.g. switched back to the original value of
"polish" in my "pl" dictionary, and... it works.  And if I change from
iso-8859-2 to utf-8 in my "pl" (with original value from "polish") it
doesn't work.  So, as you later wrote - wrong character encoding,
I guess.

Looking for a cause (in default settings), I think I found it in
ispell-dictionary-base-alist and ispell-dictionary-alist.  During
"transfer" from *-base-* to ispell-dictionary-alist, the value of
CHARACTER-SET is changed in all cases from iso-* or cp1255 to utf-8,
then ispell uses these (from ispell-dictionary-alist) when it "talks"
with Aspell.

On the other hand, if I use Emacs 26.3 from Cygwin, everything works
out of the box, I don't even have to set "polish" as default
dictionary. But there, in Cygwin command line, "env | grep LANG" gives
"LANG=pl_PL.UTF-8".

> Your Emacs is a native MinGW build, whereas Aspell seems to be
> a Cygwin build?

Both Emacses are official Win builds, and Aspell is installed through
Cygwin.

> If so, you could have incompatibility in character encoding.  What
> is your Windows locale?

"Polish" everywhere in "Control Panel" -> "Regional and Language".

> And what does M-: (getenv "LANG") RET yield inside Emacs?

"PLK"


S. U.

P.S.
> Moreover, if I type in regexp-builder "[\363\323]" it won't
> recognize ó/Ó, but it doesn't have a problem with other Polish
> chars, like "ł" ("[\502]") or "ż" ("[\574]").

In the "Character List" buffer for unicode-bmp, regexp-builder
(numbers are octal values):
- 0-177 and 400-777 - highlights chars
- 240-377 - doesn't highlight chars (it highlights them if I use hex
  value, or insert them directly)
I didn't check "80h-9Fh" chars.  Chars like C-a were checked by
inserting them with quoted-insert in another buffer.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#42602; Package emacs. (Thu, 30 Jul 2020 13:27:02 GMT) Full text and rfc822 format available.

Message #14 received at 42602 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Sebastian Urban <mrsebastianurban <at> gmail.com>
Cc: 42602 <at> debbugs.gnu.org
Subject: Re: bug#42602: Wrong (not-)casechars value for "polish" in
 ispell-dictionary-base-alist
Date: Thu, 30 Jul 2020 16:26:07 +0300
> From: Sebastian Urban <mrsebastianurban <at> gmail.com>
> Cc: 42602 <at> debbugs.gnu.org
> Date: Thu, 30 Jul 2020 13:39:55 +0200
> 
> > I don't understand this change.  Values above octal 377 cannot be
> > right in the above regexps, because they are supposed to be in
> > Latin-2 encoding, which is a single-byte encoding, and so can only
> > handle values below octal 400.  How did you come up with those
> > values?
> 
> Basically, C-x = on a char, which gave me octal values.

This gives you the Unicode codepoint, not its Latin-2 encoding.  They
are different.  The database in ispell.el uses Latin-2 encodings of
Polish characters.

> Well, I did some tests, e.g. switched back to the original value of
> "polish" in my "pl" dictionary, and... it works.  And if I change from
> iso-8859-2 to utf-8 in my "pl" (with original value from "polish") it
> doesn't work.  So, as you later wrote - wrong character encoding,
> I guess.
> 
> Looking for a cause (in default settings), I think I found it in
> ispell-dictionary-base-alist and ispell-dictionary-alist.  During
> "transfer" from *-base-* to ispell-dictionary-alist, the value of
> CHARACTER-SET is changed in all cases from iso-* or cp1255 to utf-8,
> then ispell uses these (from ispell-dictionary-alist) when it "talks"
> with Aspell.
> 
> On the other hand, if I use Emacs 26.3 from Cygwin, everything works
> out of the box, I don't even have to set "polish" as default
> dictionary. But there, in Cygwin command line, "env | grep LANG" gives
> "LANG=pl_PL.UTF-8".

Native MinGW builds cannot use the UTF-8 encoding.

So, do we have a problem to solve, or can this issue be closed?




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#42602; Package emacs. (Fri, 31 Jul 2020 10:53:01 GMT) Full text and rfc822 format available.

Message #17 received at 42602 <at> debbugs.gnu.org (full text, mbox):

From: Sebastian Urban <mrsebastianurban <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 42602 <at> debbugs.gnu.org
Subject: Re: bug#42602: Wrong (not-)casechars value for "polish" in
 ispell-dictionary-base-alist
Date: Fri, 31 Jul 2020 12:52:47 +0200
>>> I don't understand this change.  Values above octal 377 cannot be
>>> right in the above regexps, because they are supposed to be in
>>> Latin-2 encoding, which is a single-byte encoding, and so can only
>>> handle values below octal 400.  How did you come up with those
>>> values?
>>
>> Basically, C-x = on a char, which gave me octal values.
>
> This gives you the Unicode codepoint, not its Latin-2 encoding.
> They are different.

So, it would work even if I would add "\999999999", because Emacs
would not recognize and simply ignore it, which means the only reason
it worked was explicitly set encoding (iso-8859-2)?

> The database in ispell.el uses Latin-2 encodings of Polish
> characters.

As base, but before ispell.el sends the string to the Aspell it
translates it to uft-8, right?  Because that's the only difference
between my custom "pl" dictionary and value of "polish" in
ispell-dictionary-alist.

> Native MinGW builds cannot use the UTF-8 encoding.

So, with my setup (not saying that it's the best one, it's just
current one, if there is a better one I can change), for Polish lang,
I have to define local dictionary with iso-8859-2 coding?

> So, do we have a problem to solve, or can this issue be closed?

If it's a problem of MinGW, and my setup, then I guess it's not an
Emacs problem, so yes, it can be closed.


S. U.




Reply sent to Stefan Kangas <stefan <at> marxist.se>:
You have taken responsibility. (Thu, 13 Aug 2020 00:08:02 GMT) Full text and rfc822 format available.

Notification sent to Sebastian Urban <mrsebastianurban <at> gmail.com>:
bug acknowledged by developer. (Thu, 13 Aug 2020 00:08:02 GMT) Full text and rfc822 format available.

Message #22 received at 42602-done <at> debbugs.gnu.org (full text, mbox):

From: Stefan Kangas <stefan <at> marxist.se>
To: Sebastian Urban <mrsebastianurban <at> gmail.com>
Cc: Eli Zaretskii <eliz <at> gnu.org>, 42602-done <at> debbugs.gnu.org
Subject: Re: bug#42602: Wrong (not-)casechars value for "polish" in
 ispell-dictionary-base-alist
Date: Wed, 12 Aug 2020 17:07:50 -0700
Sebastian Urban <mrsebastianurban <at> gmail.com> writes:

>> So, do we have a problem to solve, or can this issue be closed?
>
> If it's a problem of MinGW, and my setup, then I guess it's not an
> Emacs problem, so yes, it can be closed.

I'm therefore closing this bug report.

Best regards,
Stefan Kangas




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 10 Sep 2020 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 200 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.