GNU bug report logs - #66236
Specific Korean characters break Unicode parsing

Previous Next

Package: sed;

Reported by: kristian.jarventaus <at> clausal.com

Date: Wed, 27 Sep 2023 12:44:01 UTC

Severity: normal

To reply to this bug, email your comments to 66236 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-sed <at> gnu.org:
bug#66236; Package sed. (Wed, 27 Sep 2023 12:44:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to kristian.jarventaus <at> clausal.com:
New bug report received and forwarded. Copy sent to bug-sed <at> gnu.org. (Wed, 27 Sep 2023 12:44:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Kristian Järventaus <kristian <at> clausal.com>
To: bug-sed <at> gnu.org
Subject: Specific Korean characters break Unicode parsing
Date: Wed, 27 Sep 2023 14:38:15 +0300
sed (GNU sed) 4.8
Packaged by Debian


Issue: I have a bunch of data that I want to clean up in the form

====
GET_PAGE: ForkPoolWorker-19, title='외출하다', hash: 
3ff685bdbf1db10566e9c3bbbc680a0ec6656e3b3d9752b5d10c9e6dcc08d648
GET_PAGE: ForkPoolWorker-19, title='Module:munge text', hash: 
86aa20ba5f2a310911fc93b32b7ef14de944b233f2894236ed236350cf467a4d
GET_PAGE: ForkPoolWorker-19, title='Module:ko-translit', hash: 
3f795c903dc252d3dedad1f7100c22de324986980a475396aabcdd554b886897
GET_PAGE: ForkPoolWorker-19, title='Module:ko-pron', hash: 
f4dde115a55246e97c0a14ea30f6896d9759e748040b8d45ac9c60ebb073cdcb
GET_PAGE: ForkPoolWorker-19, title='Module:ko', hash: 
8ebb346f32119102d15f4b464dcf178912f5ca4889ece0cbeed97ae198a6e743
GET_PAGE: ForkPoolWorker-19, title='Module:ko-pron/data', hash: 
bd4e173ed2d8f9140b524ba76d7c9862494d8fb798d8e756ea5229a830e815d9
GET_PAGE: ForkPoolWorker-19, title='Template:it-pr', hash: 
ecdb98dc9ac1387ad4f847c7bc2113fcafd016b2e7b44dc8ae806fcb83c95d62
GET_PAGE: ForkPoolWorker-19, title='traffica', hash: 
40728b79d679469e655593a096dbf2780a92b584d1a79d296d3b24a1543832b5
=====

(title contains basically all article titles from en.wiktionary.org, so 
tons and tons of Unicode, from everywhere in the Unicode set)

However, certain Hangeul (Korean) characters break *something*. After 
doing some replacements on data that looks like the above, I am always 
left with a bunch of lines with Korean titles.

> sed 's/GET_PAGE: \(ForkPoolWorker-[0-9]\+\), 
title=["\x27]\(.\+\)["\x27], hash.*/\1, \2/'


Output:
======
ForkPoolWorker-19, Template:ko-conj/verbForkPoolWorker-19, Template:affix
GET_PAGE: ForkPoolWorker-19, title='외출하다', hash: 
3ff685bdbf1db10566e9c3bbbc680a0ec6656e3b3d9752b5d10c9e6dcc08d648
ForkPoolWorker-19, Module:munge text
ForkPoolWorker-19, Module:ko-translit
======

I tried to figure out if there was some kind of weird end-of-line 
character or something that would stop the regex from processing, and in 
all the faulty examples (all with Korean titles) I could find one shared 
byte: what is M-m in `cat -v` output, 237 decimal ('m' + 128).

=====
'허공''M-mM-^WM-^HM-jM-3M-5'
title='평의회'title='M-mM-^OM-^IM-lM-^]M-^XM-mM-^ZM-^L'
title='풍년화'title='M-mM-^RM-^MM-kM-^EM-^DM-mM-^YM-^T'
'프로''M-mM-^TM-^DM-kM-!M-^\'
기계화M-jM-8M-0M-jM-3M-^DM-mM-^YM-^T
맹세하다M-kM-'M-9M-lM-^DM-8M-mM-^UM-^XM-kM-^KM-$
애프터M-lM-^UM- M-mM-^TM-^DM-mM-^DM-0
고해M-jM-3M- M-mM-^UM-4
얼큰하다M-lM-^VM-<M-mM-^AM-0M-mM-^UM-^XM-kM-^KM-$
추가하다M-lM-6M-^TM-jM-0M-^@M-mM-^UM-^XM-kM-^KM-$
푼체M-mM-^QM-<M-lM-2M-4
목표어M-kM-*M-)M-mM-^QM-^\M-lM-^VM-4
=====

The version of the above command without anything after the capture block

>  sed 's/GET_PAGE: \(ForkPoolWorker-[0-9]\+\), title=\(.\+\)/\1, \2/'

parses correctly, because the .\+ captures to the end of the line (so my 
initial suspect was wrong). Afaict, if my Unicode is correct (and I 
don't have much reason to believe it is mangled, the file contains 
basically the titles of every en.wiktionary.org article, so not just 
Korean and ascii), it seems that the presence of a character with the 
M-m byte causes the rest of the line to be broken unicode-parsing-wise, 
which causes any specific regexes (like the second ["\x27]) to fail 
parsing because the unicode 'cursor' is out of synch or something similar.

I can confirm that the presence of specific characters is the cause by 
eliminating individual characters:

====
GET_PAGE: ForkPoolWorker-19, title='외출하다',
GET_PAGE: ForkPoolWorker-19, title='출하다',
GET_PAGE: ForkPoolWorker-19, title='외하다',
GET_PAGE: ForkPoolWorker-19, title='외출다',
GET_PAGE: ForkPoolWorker-19, title='외출하',
GET_PAGE: ForkPoolWorker-20, title='부도덕하다',
GET_PAGE: ForkPoolWorker-20, title='도덕하다',
GET_PAGE: ForkPoolWorker-20, title='부덕하다',
GET_PAGE: ForkPoolWorker-20, title='부도하다',
GET_PAGE: ForkPoolWorker-20, title='부도덕다',
GET_PAGE: ForkPoolWorker-20, title='부도덕하',
====
> sed 's/GET_PAGE: \(ForkPoolWorker-[0-9]\+\), 
title=["\x27]\(.\+\)["\x27],.*/\1, \2/' kor.txt > kor.test
=====
GET_PAGE: ForkPoolWorker-19, title='외출하다',
GET_PAGE: ForkPoolWorker-19, title='출하다',
GET_PAGE: ForkPoolWorker-19, title='외하다',
ForkPoolWorker-19, 외출다
GET_PAGE: ForkPoolWorker-19, title='외출하',
GET_PAGE: ForkPoolWorker-20, title='부도덕하다',
GET_PAGE: ForkPoolWorker-20, title='도덕하다',
GET_PAGE: ForkPoolWorker-20, title='부덕하다',
GET_PAGE: ForkPoolWorker-20, title='부도하다',
ForkPoolWorker-20, 부도덕다
=====

Every single occurrence of this issue that I found (and there were many 
of them, because the data is very big) had a M-m byte somewhere in the 
hangeul.

I can't reproduce this on https://sed.js.org/, there the output is as 
expected.


-- 
Kristian Järventaus
Research Assistant / Tutkimusavustaja
Clausal Computing Oy
kristian.jarventaus <at> clausal.com




This bug report was last modified 219 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.