GNU bug report logs - #50247
27.2; wrong `word-wrap' for Chinese characters

Previous Next

Package: emacs;

Reported by: ClaudeMonet <pity4yeats <at> icloud.com>

Date: Sun, 29 Aug 2021 05:19:02 UTC

Severity: minor

Tags: moreinfo

Found in version 27.2

Done: Lars Ingebrigtsen <larsi <at> gnus.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 50247 in the body.
You can then email your comments to 50247 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#50247; Package emacs. (Sun, 29 Aug 2021 05:19:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to ClaudeMonet <pity4yeats <at> icloud.com>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Sun, 29 Aug 2021 05:19:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: ClaudeMonet <pity4yeats <at> icloud.com>
To: bug-gnu-emacs <at> gnu.org
Subject: 27.2; wrong `word-wrap' for Chinese characters
Date: Sun, 29 Aug 2021 11:14:40 +0800

When `toggle-word-wrap' is enabled, lines that ends with Chinese
characters and Chinese punctuations won't be seperated in the right
way, "normally", all Chinese words in a sentence will be crowded and
recognized by Emacs as one single WORD.

e.g. "世界" is a word in
Chinese, and "世界人民大团结万岁。" is a full sentence ending with a
full width perid, and Emacs would recognize the sentence as a word, thus
wrap lines in a wrong way.

By the way, I think this one have long been a problem for Chinese users,
since we use full-width punctuation system instead in English half-width
is more generally adopted. Another thing is, in Emacs when you use
`forward-word' key binding, I know English words are all separated
either by punctuations or blank characters(<space>, <tab>, etc.), but in
Chinese, words in a single sentence are usually separated by nothing, I
don't know what the normal practice for "word recognizing" tasks is on
modern OS like Mac and Windows. I guess there is a dictionary mechanism.

A footnote here, for tokenizing Chinese words, there is a Python
tokenizor called "jieba" in NLP field, would be a great reference if you
guys are going to address this issue. The github link of "jieba" is:

	https://github.com/fxsjy/jieba

Thanks!


In GNU Emacs 27.2 (build 1, x86_64-apple-darwin18.7.0, NS appkit-1671.60 Version 10.14.6 (Build 18G95))
of 2021-03-28 built on builder10-14.porkrind.org
Windowing system distributor 'Apple', version 10.3.2022
System Description:  macOS 11.5.2

Recent messages:
Wrote /Users/claude/.emacs.d/lisp/init-preload-local.el
Quit
Type "q" in help window to delete it.
C-c C-o is undefined
uncompressing simple.el.gz...done
Mark set
find-function-C-source: The C source file buffer.c is not available
Quit [2 times]

Mark set

Configured using:
'configure --with-ns '--enable-locallisppath=/Library/Application
Support/Emacs/${version}/site-lisp:/Library/Application
Support/Emacs/site-lisp' --with-modules'

Configured features:
NOTIFY KQUEUE ACL GNUTLS LIBXML2 ZLIB TOOLKIT_SCROLL_BARS NS MODULES
THREADS JSON PDUMPER GMP

Important settings:
  value of $LANG: en_CN.UTF-8
  locale-coding-system: utf-8

Major mode: Org

Minor modes in effect:
  default-text-scale-mode: t
  recentf-mode: t
  vertico-mode: t
  marginalia-mode: t
  company-quickhelp-mode: t
  company-quickhelp-local-mode: t
  winner-mode: t
  flycheck-color-mode-line-mode: t
  global-flycheck-mode: t
  flycheck-mode: t
  dimmer-mode: t
  global-anzu-mode: t
  anzu-mode: t
  global-company-mode: t
  company-mode: t
  diredfl-global-mode: t
  shell-dirtrack-mode: t
  savehist-mode: t
  electric-pair-mode: t
  delete-selection-mode: t
  global-auto-revert-mode: t
  global-so-long-mode: t
  mode-line-bell-mode: t
  beacon-mode: t
  show-paren-mode: t
  global-page-break-lines-mode: t
  page-break-lines-mode: t
  whole-line-or-region-global-mode: t
  whole-line-or-region-local-mode: t
  hes-mode: t
  which-key-mode: t
  global-whitespace-cleanup-mode: t
  whitespace-cleanup-mode: t
  global-diff-hl-mode: t
  diff-hl-mode: t
  projectile-rails-global-mode: t
  projectile-mode: t
  ipretty-mode: t
  auto-compile-on-load-mode: t
  auto-compile-on-save-mode: t
  immortal-scratch-mode: t
  desktop-save-mode: t
  ns-auto-titlebar-mode: t
  tooltip-mode: t
  global-eldoc-mode: t
  electric-indent-mode: t
  mouse-wheel-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  blink-cursor-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  column-number-mode: t
  line-number-mode: t
  auto-fill-function: org-auto-fill-function
  visual-line-mode: t
  transient-mark-mode: t

Load-path shadows:
/Users/claude/.emacs.d/elpa-27.2/magit-20210822.529/magit-section-pkg hides /Users/claude/.emacs.d/elpa-27.2/magit-section-20210819.1119/magit-section-pkg
/Users/claude/.emacs.d/elpa-27.2/seq-2.22/seq hides /Applications/Emacs.app/Contents/Resources/lisp/emacs-lisp/seq

Features:
(shadow sort mail-extr emacsbug sendmail consult-vertico consult
bookmark ielm tabify view cl-print eieio-opt speedbar sb-image ezimage
dframe rainbow-mode help-fns radix-tree switch-window
switch-window-mvborder switch-window-asciiart quail executable cus-edit
cus-start cus-load sanityinc-tomorrow-bright-theme
color-theme-sanityinc-tomorrow default-text-scale recentf tree-widget
orderless vertico marginalia company-quickhelp pos-tip winner windswap
windmove vc-bzr vc-src vc-sccs vc-svn vc-cvs vc-rcs diff-hl-dired
elisp-slime-nav paredit aggressive-indent highlight-quoted
display-line-numbers display-fill-column-indicator rainbow-delimiters
symbol-overlay bug-reference goto-addr flycheck-color-mode-line
flycheck-package package-lint let-alist imenu finder flycheck dimmer
face-remap color anzu company-oddmuse company-keywords company-etags
etags fileloop company-gtags company-dabbrev-code company-dabbrev
company-files company-clang company-capf company-cmake company-semantic
company-bbdb company-php company-template ac-php-core popup xcscope
company-anaconda anaconda-mode xref project pythonic
company-nixos-options nixos-options company pcase disp-table vc-git
vc-darcs org-element avl-tree generator ol-eww eww mm-url url-queue
ol-rmail ol-mhe ol-irc ol-info ol-gnus nnir gnus-sum url url-proxy
url-privacy url-expand url-methods url-history mailcap shr url-cookie
url-domsuf url-util svg xml dom gnus-group gnus-undo gnus-start
gnus-cloud nnimap nnmail mail-source utf7 netrc nnoo gnus-spec gnus-int
gnus-range message rmc puny rfc822 mml mml-sec epa epg epg-config
mm-decode mm-bodies mm-encode mail-parse rfc2231 mailabbrev gmm-utils
mailheader gnus-win gnus nnheader gnus-util rmail rmail-loaddefs rfc2047
rfc2045 ietf-drums text-property-search mail-utils mm-util mail-prsvr
wid-edit ol-docview doc-view image-mode exif dired-x diredfl dired
dired-loaddefs ol-bibtex bibtex ol-bbdb ol-w3m ob-sqlite ob-sql ob-shell
ob-ruby ob-python python tramp-sh docker-tramp tramp-cache tramp
tramp-loaddefs trampver tramp-integration files-x tramp-compat shell
parse-time iso8601 ls-lisp ob-plantuml ob-octave ob-ledger ob-latex
ob-gnuplot ob-dot ob-ditaa ob-R org-clock org ob ob-tangle ob-ref ob-lob
ob-table ob-exp org-macro org-footnote org-src ob-comint org-pcomplete
pcomplete org-list org-faces org-entities time-date noutline outline
org-version ob-emacs-lisp ob-core ob-eval org-table ol org-keys
org-compat org-macs org-loaddefs format-spec find-func cal-menu calendar
cal-loaddefs savehist session elec-pair delsel autorevert filenotify
so-long mode-line-bell beacon paren page-break-lines
whole-line-or-region highlight-escape-sequences which-key diminish
whitespace-cleanup-mode whitespace diff-hl log-view pcvs-util vc-dir
ewoc vc vc-dispatcher diff-mode cl-extra help-mode projectile-rails rake
f dash s inflections inf-ruby ruby-mode smie autoinsert projectile
lisp-mnt grep compile comint ring ibuf-ext ibuffer ibuffer-loaddefs
thingatpt jka-compr ipretty advice auto-compile packed immortal-scratch
uptimes pp server init init-locales init-direnv init-ledger init-dash
init-folding init-misc init-common-lisp init-clojure-cider init-clojure
init-slime init-lisp init-paredit init-nix init-terraform init-docker
init-yaml init-toml init-rust init-nim init-j init-ocaml init-sql
init-rails init-ruby init-purescript init-elm init-haskell init-python
reformatter ansi-color init-http init-haml init-css init-html init-nxml
init-org init-php init-javascript easy-mmode init-erlang erlang-start
init-csv init-markdown init-textile init-crontab init-compile
init-projectile init-github init-git init-darcs init-vc init-whitespace
init-editing-utils init-mmm mmm-auto mmm-vars mmm-utils mmm-compat
init-sessions desktop frameset init-windows init-company
init-hippie-expand init-minibuffer init-recentf init-flycheck
init-ibuffer ibuf-macs init-uniquify init-grep init-isearch init-dired
init-gui-frames ns-auto-titlebar init-osx-keys init-themes init-xterm
init-frame-hooks init-preload-local init-exec-path exec-path-from-shell
init-elpa fullframe finder-inf rx edmacro kmacro slime-autoloads info
package easymenu browse-url url-handlers url-parse auth-source eieio
eieio-core cl-macs eieio-loaddefs password-cache json subr-x map
url-vars seq byte-opt gv bytecomp byte-compile cconv init-site-lisp
cl-seq cl-loaddefs cl-lib init-utils init-benchmarking derived
early-init tooltip eldoc electric uniquify ediff-hook vc-hooks
lisp-float-type mwheel term/ns-win ns-win ucs-normalize mule-util
term/common-win tool-bar dnd fontset image regexp-opt fringe
tabulated-list replace newcomment text-mode elisp-mode lisp-mode
prog-mode register page tab-bar menu-bar rfn-eshadow isearch timer
select scroll-bar mouse jit-lock font-lock syntax facemenu font-core
term/tty-colors frame minibuffer cl-generic cham georgian utf-8-lang
misc-lang vietnamese tibetan thai tai-viet lao korean japanese eucjp-ms
cp51932 hebrew greek romanian slovak czech european ethiopic indian
cyrillic chinese composite charscript charprop case-table epa-hook
jka-cmpr-hook help simple abbrev obarray cl-preloaded nadvice loaddefs
button faces cus-face macroexp files text-properties overlay sha1 md5
base64 format env code-pages mule custom widget hashtable-print-readable
backquote threads kqueue cocoa ns multi-tty make-network-process emacs)

Memory information:
((conses 16 632053 354268)
(symbols 48 59409 246)
(strings 32 197826 53863)
(string-bytes 1 5927719)
(vectors 16 69807)
(vector-slots 8 1717944 390092)
(floats 8 911 2031)
(intervals 56 3152 3510)
(buffers 1000 32))




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#50247; Package emacs. (Sun, 29 Aug 2021 07:28:01 GMT) Full text and rfc822 format available.

Message #8 received at 50247 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: ClaudeMonet <pity4yeats <at> icloud.com>
Cc: 50247 <at> debbugs.gnu.org
Subject: Re: bug#50247: 27.2; wrong `word-wrap' for Chinese characters
Date: Sun, 29 Aug 2021 10:26:56 +0300
> Date: Sun, 29 Aug 2021 11:14:40 +0800
> From:  ClaudeMonet via "Bug reports for GNU Emacs,
>  the Swiss army knife of text editors" <bug-gnu-emacs <at> gnu.org>
> 
> When `toggle-word-wrap' is enabled, lines that ends with Chinese
> characters and Chinese punctuations won't be seperated in the right
> way, "normally", all Chinese words in a sentence will be crowded and
> recognized by Emacs as one single WORD.
> 
> e.g. "世界" is a word in
> Chinese, and "世界人民大团结万岁。" is a full sentence ending with a
> full width perid, and Emacs would recognize the sentence as a word, thus
> wrap lines in a wrong way.

Emacs 28 introduces the variable word-wrap-by-category; if you set
that non-nil, the above should work as you expect, assuming the
Kinsoku rules are good enough for that.  (Since you didn't tell in
detail what were your expectation of the "right way" in this case, I
couldn't actually test that the results are as you expect.)

> By the way, I think this one have long been a problem for Chinese users,
> since we use full-width punctuation system instead in English half-width
> is more generally adopted.

Please elaborate in what way this presents a problem in Emacs,
preferably with examples.

> Another thing is, in Emacs when you use
> `forward-word' key binding, I know English words are all separated
> either by punctuations or blank characters(<space>, <tab>, etc.), but in
> Chinese, words in a single sentence are usually separated by nothing, I
> don't know what the normal practice for "word recognizing" tasks is on
> modern OS like Mac and Windows. I guess there is a dictionary mechanism.

Emacs has find-word-boundary-function-table, which can be used to
define our rules.  In general, we try to follow Unicode, but AFAIU
Unicode TR29 doesn't specify any word-breaking rules for Chinese
characters.

> A footnote here, for tokenizing Chinese words, there is a Python
> tokenizor called "jieba" in NLP field, would be a great reference if you
> guys are going to address this issue. The github link of "jieba" is:
> 
> 	https://github.com/fxsjy/jieba

Patches are welcome to add Chinese text segmentation capabilities to
Emacs.




Added tag(s) moreinfo. Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Sun, 29 Aug 2021 19:35:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#50247; Package emacs. (Mon, 27 Sep 2021 10:51:02 GMT) Full text and rfc822 format available.

Message #13 received at 50247 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: ClaudeMonet <pity4yeats <at> icloud.com>, 50247 <at> debbugs.gnu.org
Subject: Re: bug#50247: 27.2; wrong `word-wrap' for Chinese characters
Date: Mon, 27 Sep 2021 12:50:04 +0200
Eli Zaretskii <eliz <at> gnu.org> writes:

> Emacs 28 introduces the variable word-wrap-by-category; if you set
> that non-nil, the above should work as you expect, assuming the
> Kinsoku rules are good enough for that.  (Since you didn't tell in
> detail what were your expectation of the "right way" in this case, I
> couldn't actually test that the results are as you expect.)
>
>> By the way, I think this one have long been a problem for Chinese users,
>> since we use full-width punctuation system instead in English half-width
>> is more generally adopted.
>
> Please elaborate in what way this presents a problem in Emacs,
> preferably with examples.

More information was requested, but no response was given within a
month, so I'm closing this bug report.  If the problem still exists,
please respond to this email and we'll reopen the bug report.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no




bug closed, send any further explanations to 50247 <at> debbugs.gnu.org and ClaudeMonet <pity4yeats <at> icloud.com> Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Mon, 27 Sep 2021 10:51:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 25 Oct 2021 11:24:06 GMT) Full text and rfc822 format available.

This bug report was last modified 2 years and 182 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.