GNU bug report logs - #7668
ispell and dictionary encodings

Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.

Package: emacs; Reported by: Reuben Thomas <rrt@HIDDEN>; dated Fri, 17 Dec 2010 18:25:01 UTC; Maintainer for emacs is bug-gnu-emacs@HIDDEN.

Message received at 7668 <at> debbugs.gnu.org:


Received: (at 7668) by debbugs.gnu.org; 21 Dec 2010 23:05:07 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Tue Dec 21 18:05:06 2010
Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1PVBGU-0006hW-CH
	for submit <at> debbugs.gnu.org; Tue, 21 Dec 2010 18:05:06 -0500
Received: from exprod7og120.obsmtp.com ([64.18.2.18])
	by debbugs.gnu.org with smtp (Exim 4.69)
	(envelope-from <rrt@HIDDEN>) id 1PVBGR-0006h1-Dx
	for 7668 <at> debbugs.gnu.org; Tue, 21 Dec 2010 18:05:04 -0500
Received: from source ([209.85.213.45]) (using TLSv1) by
	exprod7ob120.postini.com ([64.18.6.12]) with SMTP
	ID DSNKTRE0J8oYA1x8eXjSPYxVmuboSNMexZUt@HIDDEN;
	Tue, 21 Dec 2010 15:11:41 PST
Received: by ywl5 with SMTP id 5so2386501ywl.4
	for <7668 <at> debbugs.gnu.org>; Tue, 21 Dec 2010 15:11:34 -0800 (PST)
MIME-Version: 1.0
Received: by 10.150.143.20 with SMTP id q20mr9448620ybd.73.1292973094473; Tue,
	21 Dec 2010 15:11:34 -0800 (PST)
Received: by 10.150.186.15 with HTTP; Tue, 21 Dec 2010 15:11:34 -0800 (PST)
In-Reply-To: <20101221113008.GB3440@HIDDEN>
References: <AANLkTin26ZJupakNsWxgteT9A4TOGCZQtAc=H6OTGG7-@mail.gmail.com>
	<20101220113148.GA12469@HIDDEN>
	<AANLkTi=gk2W44z9ghqi72Ls5Zi9-hJr5jRwQrHKUvgD5@HIDDEN>
	<AANLkTik-dEoz+HMsNOnkGcESqgRMsFz+EtXUEEMtiWKq@HIDDEN>
	<20101221113008.GB3440@HIDDEN>
Date: Tue, 21 Dec 2010 23:11:34 +0000
Message-ID: <AANLkTi=hSdBXxK0hZK_FCQCj7h=6a=cCPd9gohJwUnWU@HIDDEN>
Subject: Re: bug#7668: ispell and dictionary encodings
From: Reuben Thomas <rrt@HIDDEN>
To: Agustin Martin <agustin.martin@HIDDEN>
Content-Type: text/plain; charset=ISO-8859-1
X-Spam-Score: -6.1 (------)
X-Debbugs-Envelope-To: 7668
Cc: 7668 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.11
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/pipermail/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -6.1 (------)

On 21 December 2010 11:30, Agustin Martin <agustin.martin@HIDDEN> wrote:
>
> Or generally for versions of the spellcheckers that do not properly support
> different encodings, old aspells and hunspells, there are still some of them
> flying around.

Presumably for the Debian version of ispell.el you don't have to cater
to old versions?

> Putting those fancy quotes in 'otherchars' section in dictionary definition
> for ispell.el should make ispell.el consider them part of the word,

Well, it definitely doesn't work with aspell. It seems to work fine
with hunspell (for whatever reason).

Thanks again; if I can be of any assistance don't hesitate to ask.

One other confusion arises: I was reporting a bug about FSF Emacs
(i.e. upstream) and I am currently, because of fixed session support
bugs, using the tip of the emacs-23 source branch, not the Ubuntu
maverick version (which is the distro I'm using).

-- 
http://rrt.sc3d.org




Information forwarded to owner <at> debbugs.gnu.org, bug-gnu-emacs@HIDDEN:
bug#7668; Package emacs. Full text available.

Message received at 7668 <at> debbugs.gnu.org:


Received: (at 7668) by debbugs.gnu.org; 21 Dec 2010 11:23:44 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Tue Dec 21 06:23:44 2010
Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1PV0Jj-0005Ez-4i
	for submit <at> debbugs.gnu.org; Tue, 21 Dec 2010 06:23:43 -0500
Received: from fibonacci.ccupm.upm.es ([138.100.198.70] helo=smtp.upm.es)
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <agustin.martin@HIDDEN>) id 1PV0Jh-0005Eg-4F
	for 7668 <at> debbugs.gnu.org; Tue, 21 Dec 2010 06:23:42 -0500
Received: from agmartin.aq.upm.es (Agmartin.aq.upm.es [138.100.41.131])
	by smtp.upm.es (8.14.3/8.14.3/fibonacci-001) with ESMTP id
	oBLBU9qo026790; Tue, 21 Dec 2010 12:30:09 +0100
Received: by agmartin.aq.upm.es (Postfix, from userid 1000)
	id C0FB78241B; Tue, 21 Dec 2010 12:30:08 +0100 (CET)
Date: Tue, 21 Dec 2010 12:30:08 +0100
From: Agustin Martin <agustin.martin@HIDDEN>
To: 7668 <at> debbugs.gnu.org, Reuben Thomas <rrt@HIDDEN>
Subject: Re: bug#7668: ispell and dictionary encodings
Message-ID: <20101221113008.GB3440@HIDDEN>
References: <AANLkTin26ZJupakNsWxgteT9A4TOGCZQtAc=H6OTGG7-@mail.gmail.com>
	<20101220113148.GA12469@HIDDEN>
	<AANLkTi=gk2W44z9ghqi72Ls5Zi9-hJr5jRwQrHKUvgD5@HIDDEN>
	<AANLkTik-dEoz+HMsNOnkGcESqgRMsFz+EtXUEEMtiWKq@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <AANLkTik-dEoz+HMsNOnkGcESqgRMsFz+EtXUEEMtiWKq@HIDDEN>
User-Agent: Mutt/1.5.20 (2009-06-14)
X-Spam-Score: -6.4 (------)
X-Debbugs-Envelope-To: 7668
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.11
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/pipermail/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -6.4 (------)

On Mon, Dec 20, 2010 at 03:40:18PM +0000, Reuben Thomas wrote:
> On 20 December 2010 11:31, Agustin Martin <agustin.martin@HIDDEN> wrote:
> 
> [a very helpful reply; thanks]
> 
> > On Fri, Dec 17, 2010 at 06:30:14PM +0000, Reuben Thomas wrote:
> > If you are not going to use XEmacs, but only FSF Emacs, just use [:alpha:]
> > for the case-character and non-case-character strings along with utf-8. That
> > is already done automatically for aspell dictionaries, where is easy to get
> > a list of installed dictionaries and additional info.
> 
> So, the built-in entries of ispell-dictionary-base-alist are
> specifically for ispell? 

Or generally for versions of the spellcheckers that do not properly support
different encodings, old aspells and hunspells, there are still some of them
flying around.

> In that case, it seems a bit odd that they
> are used for hunspell, but perhaps the problem is that you can't get
> hunspell to give you that information about its dictionaries? 

That is indeed part of the problem. Otherwise something like
(ispell-aspell-find-dictionaries) and friends could be used. 'hunspell -D'
does not provide all the info, and does not return control until ^C. 

> But is
> there in any case a reason not to default to using [:alpha:] for
> case-chars and ^[:alpha:] for non-case-chars with hunspell?

Besides old aspells and hunspells, I am trying to improve XEmacs
compatibility for ispell.el and flyspell.el. I keep patched versions for
Debian, so all Emacs flavours use the same ispell.el and flyspell.el. In its
current incarnation, even Emacs >=21.3 is supported by Debian patched files. 
I am currently removing all that compatibility leaving only Emacs23 and 
XEmacs, and would like to keep FSF Emacs ispell.el and flyspell.el 
reasonably close to those I use, so I need less changes. And XEmacs do not 
support [:alpha:]. 

An intermediate possibility could be to use a hunspell specific default 
dictionary list built on the fly from base-alist with encoding set to utf8
and case/not-case changed to [:alpha:] for FSF Emacs and recent enough
hunspell. Since this would only be done first time ispell.el invokes
hunspell spellchecking, seems be reasonable. But I have to think about this.

> In case I'm getting too confused, I'll just restate the basic
> objective I have: I want to be able to spell-check (in my case,
> British, but I don't think it matters for this purpose) English with
> a) accents and b) fancy quotes. In these days of utf-8 being widely
> used for English, it seems it should be possible to do at least b) out
> of the box, which currently it isn't, as far as I can see.

Putting those fancy quotes in 'otherchars' section in dictionary definition 
for ispell.el should make ispell.el consider them part of the word, but 
IIRC will not affect hunspell unless they are defined in TRY section of 
.aff file.

-- 
Agustin




Information forwarded to owner <at> debbugs.gnu.org, bug-gnu-emacs@HIDDEN:
bug#7668; Package emacs. Full text available.

Message received at 7668 <at> debbugs.gnu.org:


Received: (at 7668) by debbugs.gnu.org; 20 Dec 2010 15:33:53 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Mon Dec 20 10:33:52 2010
Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1PUhkG-0000ks-Ih
	for submit <at> debbugs.gnu.org; Mon, 20 Dec 2010 10:33:52 -0500
Received: from exprod7og102.obsmtp.com ([64.18.2.157])
	by debbugs.gnu.org with smtp (Exim 4.69)
	(envelope-from <rrt@HIDDEN>) id 1PUhkE-0000kf-Dc
	for 7668 <at> debbugs.gnu.org; Mon, 20 Dec 2010 10:33:51 -0500
Received: from source ([74.125.83.46]) by exprod7ob102.postini.com
	([64.18.6.12]) with SMTP
	ID DSNKTQ9449MEiQ6vnVDtGHvzJwS0ElKeJ1F8@HIDDEN;
	Mon, 20 Dec 2010 07:40:25 PST
Received: by gwj20 with SMTP id 20so1523778gwj.33
	for <7668 <at> debbugs.gnu.org>; Mon, 20 Dec 2010 07:40:18 -0800 (PST)
MIME-Version: 1.0
Received: by 10.150.230.21 with SMTP id c21mr6737333ybh.130.1292859618827;
	Mon, 20 Dec 2010 07:40:18 -0800 (PST)
Received: by 10.150.186.15 with HTTP; Mon, 20 Dec 2010 07:40:18 -0800 (PST)
In-Reply-To: <AANLkTi=gk2W44z9ghqi72Ls5Zi9-hJr5jRwQrHKUvgD5@HIDDEN>
References: <AANLkTin26ZJupakNsWxgteT9A4TOGCZQtAc=H6OTGG7-@mail.gmail.com>
	<20101220113148.GA12469@HIDDEN>
	<AANLkTi=gk2W44z9ghqi72Ls5Zi9-hJr5jRwQrHKUvgD5@HIDDEN>
Date: Mon, 20 Dec 2010 15:40:18 +0000
Message-ID: <AANLkTik-dEoz+HMsNOnkGcESqgRMsFz+EtXUEEMtiWKq@HIDDEN>
Subject: bug#7668: ispell and dictionary encodings
From: Reuben Thomas <rrt@HIDDEN>
To: 7668 <at> debbugs.gnu.org
Content-Type: text/plain; charset=ISO-8859-1
X-Spam-Score: -6.0 (------)
X-Debbugs-Envelope-To: 7668
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.11
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/pipermail/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -6.1 (------)

On 20 December 2010 11:31, Agustin Martin <agustin.martin@HIDDEN> wrote:

[a very helpful reply; thanks]

> On Fri, Dec 17, 2010 at 06:30:14PM +0000, Reuben Thomas wrote:
> If you are not going to use XEmacs, but only FSF Emacs, just use [:alpha:]
> for the case-character and non-case-character strings along with utf-8. That
> is already done automatically for aspell dictionaries, where is easy to get
> a list of installed dictionaries and additional info.

So, the built-in entries of ispell-dictionary-base-alist are
specifically for ispell? In that case, it seems a bit odd that they
are used for hunspell, but perhaps the problem is that you can't get
hunspell to give you that information about its dictionaries? But is
there in any case a reason not to default to using [:alpha:] for
case-chars and ^[:alpha:] for non-case-chars with hunspell?

In case I'm getting too confused, I'll just restate the basic
objective I have: I want to be able to spell-check (in my case,
British, but I don't think it matters for this purpose) English with
a) accents and b) fancy quotes. In these days of utf-8 being widely
used for English, it seems it should be possible to do at least b) out
of the box, which currently it isn't, as far as I can see.

--
http://rrt.sc3d.org




Information forwarded to owner <at> debbugs.gnu.org, bug-gnu-emacs@HIDDEN:
bug#7668; Package emacs. Full text available.

Message received at 7668 <at> debbugs.gnu.org:


Received: (at 7668) by debbugs.gnu.org; 20 Dec 2010 11:25:25 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Mon Dec 20 06:25:25 2010
Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1PUdro-0000hI-Ea
	for submit <at> debbugs.gnu.org; Mon, 20 Dec 2010 06:25:24 -0500
Received: from edison.ccupm.upm.es ([138.100.198.71] helo=smtp.upm.es)
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <agustin.martin@HIDDEN>) id 1PUdrl-0000h4-Il
	for 7668 <at> debbugs.gnu.org; Mon, 20 Dec 2010 06:25:22 -0500
Received: from agmartin.aq.upm.es (Agmartin.aq.upm.es [138.100.41.131])
	by smtp.upm.es (8.14.3/8.14.3/edison-001) with ESMTP id oBKBVmUE016148; 
	Mon, 20 Dec 2010 12:31:48 +0100
Received: by agmartin.aq.upm.es (Postfix, from userid 1000)
	id B887682365; Mon, 20 Dec 2010 12:31:48 +0100 (CET)
Date: Mon, 20 Dec 2010 12:31:48 +0100
From: Agustin Martin <agustin.martin@HIDDEN>
To: Reuben Thomas <rrt@HIDDEN>, 7668 <at> debbugs.gnu.org
Subject: Re: bug#7668: ispell and dictionary encodings
Message-ID: <20101220113148.GA12469@HIDDEN>
References: <AANLkTin26ZJupakNsWxgteT9A4TOGCZQtAc=H6OTGG7-@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <AANLkTin26ZJupakNsWxgteT9A4TOGCZQtAc=H6OTGG7-@mail.gmail.com>
User-Agent: Mutt/1.5.20 (2009-06-14)
X-Spam-Score: -6.4 (------)
X-Debbugs-Envelope-To: 7668
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.11
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/pipermail/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -6.4 (------)

On Fri, Dec 17, 2010 at 06:30:14PM +0000, Reuben Thomas wrote:
> I've just been puzzling my way through ispell.gz's dictionary encoding
> code, after switching from aspell to hunspell in order to be able to
> treat Unicode curly single quotes as normal intraword punctuation
> (which it seems aspell cannot be persuaded to do, but that's another
> story).
> 
> I noticed a feature of ispell-dictionary-base-alist, which I don't
> understand: the last (7th) element of each dictionary definition is
> called "Coding System", which seems to be the coding system of the
> case character and non-case-character strings, but it is also passed
> to the spelling program as the input encoding, which is wrong, since
> the input encoding depends on the file to be checked.

That element represents the language that will be used for communication
with the dictionary. case-character and non-case-character strings should 
be in the same encoding as it.

> I currently use the classic workaround of making up my own dictionary
> definition which includes accented characters that I want to be able
> to use in words (which is necessary anyway), and which specifies utf-8
> as the coding system. This only works because I use utf-8 for all my
> text files.

If you are not going to use XEmacs, but only FSF Emacs, just use [:alpha:]
for the case-character and non-case-character strings along with utf-8. That
is already done automatically for aspell dictionaries, where is easy to get
a list of installed dictionaries and additional info.

> It seems, therefore, that the argument to follow
> ispell-encoding8-command (which itself is mis-documented:
> 
> Command line option prefix to select UTF-8 if supported, nil otherwise.
> If UTF-8 if supported by spellchecker and is selectable from the command line
> this variable will contain \"--encoding=\" for aspell and \"-i \" for hunspell,
> so UTF-8 or other mime charsets can be selected.  That will be set for hunspell
> >=1.1.6 or aspell >= 0.60 in `ispell-check-version'.
> 
> It is not just for selecting UTF-8; indeed, that's the irony: in the
> default configuration it's used mostly to select 8-bit character sets!
> And there are one or two other typos. How about (suitably rewrapped):
> 
> Command line option prefix to select coding system if supported, nil otherwise.
> If the coding system is selectable from the command line
> this variable will contain \"--encoding=\" for aspell and \"-i \" for hunspell,
> so that the input encoding can be selected.  That will be set for hunspell
> >= 1.1.6 or aspell >= 0.60 in `ispell-check-version'.

Agreed, thanks

> Then, the following code in ispell-start-process:
> 
>     ;; If we are using recent aspell or hunspell, make sure we use the
> right encoding
>     ;; for communication. ispell or older aspell/hunspell does not support this
>     (if ispell-encoding8-command
> 	(setq args
> 	      (append args
> 		      (list
> 		       (concat ispell-encoding8-command
> 			       (symbol-name (ispell-get-coding-system)))))))
> 
> needs fixing: rather than using ispell-get-coding-system, it should
> use a prefix of buffer-file-coding-system (without the suffix that
> specifies the line ending).

No, current code is correct. It is telling the spellchecker that
communication with the dictionary will be done in (ispell-get-coding-system) 
coding system. ispell.el will do the internal conversions needed for that in 
a diferent place, so everything is transparent to the user.

> I'm sure I'm missing things here, but if what I've said above makes
> any sense, I'd like to help refine it into a sensible proposal to
> improve ispell.el.

Thanks for looking into this. Will prepare a change with the
`ispell-encoding8-command' documentation fix.

Regards,

-- 
Agustin




Information forwarded to owner <at> debbugs.gnu.org, bug-gnu-emacs@HIDDEN:
bug#7668; Package emacs. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 17 Dec 2010 18:24:27 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Dec 17 13:24:27 2010
Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1PTeyg-0003Ll-S7
	for submit <at> debbugs.gnu.org; Fri, 17 Dec 2010 13:24:27 -0500
Received: from eggs.gnu.org ([140.186.70.92])
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <rrt@HIDDEN>) id 1PTeye-0003LU-CL
	for submit <at> debbugs.gnu.org; Fri, 17 Dec 2010 13:24:25 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <rrt@HIDDEN>) id 1PTf4i-0004P5-B4
	for submit <at> debbugs.gnu.org; Fri, 17 Dec 2010 13:30:41 -0500
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED
	autolearn=unavailable version=3.3.1
Received: from lists.gnu.org ([199.232.76.165]:38930)
	by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <rrt@HIDDEN>)
	id 1PTf4d-0004O6-Q1
	for submit <at> debbugs.gnu.org; Fri, 17 Dec 2010 13:30:40 -0500
Received: from [140.186.70.92] (port=54960 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1PTf4V-0005Q0-MM
	for bug-gnu-emacs@HIDDEN; Fri, 17 Dec 2010 13:30:34 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <rrt@HIDDEN>) id 1PTf4Q-0004M2-3b
	for bug-gnu-emacs@HIDDEN; Fri, 17 Dec 2010 13:30:22 -0500
Received: from fencepost.gnu.org ([140.186.70.10]:34494)
	by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <rrt@HIDDEN>)
	id 1PTf4Q-0004Ly-24
	for bug-gnu-emacs@HIDDEN; Fri, 17 Dec 2010 13:30:22 -0500
Received: from eggs.gnu.org ([140.186.70.92]:35717)
	by fencepost.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32)
	(Exim 4.69) (envelope-from <rrt@HIDDEN>) id 1PTf4M-0000qk-ST
	for bug-emacs@HIDDEN; Fri, 17 Dec 2010 13:30:18 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <rrt@HIDDEN>) id 1PTf4O-0004Lg-FA
	for bug-emacs@HIDDEN; Fri, 17 Dec 2010 13:30:21 -0500
Received: from exprod7og114.obsmtp.com ([64.18.2.215]:41696)
	by eggs.gnu.org with smtp (Exim 4.71) (envelope-from <rrt@HIDDEN>)
	id 1PTf4O-0004LQ-4C
	for bug-emacs@HIDDEN; Fri, 17 Dec 2010 13:30:20 -0500
Received: from source ([209.85.161.171]) by exprod7ob114.postini.com
	([64.18.6.12]) with SMTP
	ID DSNKTQusOYa8vOZHxDSzj7hVj44RUon0EoTr@HIDDEN;
	Fri, 17 Dec 2010 10:30:19 PST
Received: by gxk8 with SMTP id 8so481880gxk.2
	for <bug-emacs@HIDDEN>; Fri, 17 Dec 2010 10:30:15 -0800 (PST)
MIME-Version: 1.0
Received: by 10.150.230.21 with SMTP id c21mr3161515ybh.130.1292610614967;
	Fri, 17 Dec 2010 10:30:14 -0800 (PST)
Received: by 10.150.186.15 with HTTP; Fri, 17 Dec 2010 10:30:14 -0800 (PST)
Date: Fri, 17 Dec 2010 18:30:14 +0000
Message-ID: <AANLkTin26ZJupakNsWxgteT9A4TOGCZQtAc=H6OTGG7-@mail.gmail.com>
Subject: ispell and dictionary encodings
From: Reuben Thomas <rrt@HIDDEN>
To: bug-emacs <bug-emacs@HIDDEN>
Content-Type: text/plain; charset=ISO-8859-1
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6, seldom 2.4 (older,
	4)
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3)
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 2)
X-Spam-Score: -5.8 (-----)
X-Debbugs-Envelope-To: submit
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.11
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/pipermail/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -5.8 (-----)

I've just been puzzling my way through ispell.gz's dictionary encoding
code, after switching from aspell to hunspell in order to be able to
treat Unicode curly single quotes as normal intraword punctuation
(which it seems aspell cannot be persuaded to do, but that's another
story).

I noticed a feature of ispell-dictionary-base-alist, which I don't
understand: the last (7th) element of each dictionary definition is
called "Coding System", which seems to be the coding system of the
case character and non-case-character strings, but it is also passed
to the spelling program as the input encoding, which is wrong, since
the input encoding depends on the file to be checked.

I currently use the classic workaround of making up my own dictionary
definition which includes accented characters that I want to be able
to use in words (which is necessary anyway), and which specifies utf-8
as the coding system. This only works because I use utf-8 for all my
text files.

It seems, therefore, that the argument to follow
ispell-encoding8-command (which itself is mis-documented:

Command line option prefix to select UTF-8 if supported, nil otherwise.
If UTF-8 if supported by spellchecker and is selectable from the command line
this variable will contain \"--encoding=\" for aspell and \"-i \" for hunspell,
so UTF-8 or other mime charsets can be selected.  That will be set for hunspell
>=1.1.6 or aspell >= 0.60 in `ispell-check-version'.

It is not just for selecting UTF-8; indeed, that's the irony: in the
default configuration it's used mostly to select 8-bit character sets!
And there are one or two other typos. How about (suitably rewrapped):

Command line option prefix to select coding system if supported, nil otherwise.
If the coding system is selectable from the command line
this variable will contain \"--encoding=\" for aspell and \"-i \" for hunspell,
so that the input encoding can be selected.  That will be set for hunspell
>= 1.1.6 or aspell >= 0.60 in `ispell-check-version'.

Then, the following code in ispell-start-process:

    ;; If we are using recent aspell or hunspell, make sure we use the
right encoding
    ;; for communication. ispell or older aspell/hunspell does not support this
    (if ispell-encoding8-command
	(setq args
	      (append args
		      (list
		       (concat ispell-encoding8-command
			       (symbol-name (ispell-get-coding-system)))))))

needs fixing: rather than using ispell-get-coding-system, it should
use a prefix of buffer-file-coding-system (without the suffix that
specifies the line ending).

I'm sure I'm missing things here, but if what I've said above makes
any sense, I'd like to help refine it into a sensible proposal to
improve ispell.el.

-- 
http://rrt.sc3d.org




Acknowledgement sent to Reuben Thomas <rrt@HIDDEN>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs@HIDDEN. Full text available.
Report forwarded to owner <at> debbugs.gnu.org, bug-gnu-emacs@HIDDEN:
bug#7668; Package emacs. Full text available.
Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.
Last modified: Fri, 31 Oct 2014 17:00:04 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.