GNU bug report logs - #56386
[PATCH] gnu: Add mecab.

Previous Next

Package: guix-patches;

Reported by: Julien Lepiller <julien <at> lepiller.eu>

Date: Mon, 4 Jul 2022 19:11:02 UTC

Severity: normal

Tags: patch

Done: Julien Lepiller <julien <at> lepiller.eu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 56386 in the body.
You can then email your comments to 56386 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to guix-patches <at> gnu.org:
bug#56386; Package guix-patches. (Mon, 04 Jul 2022 19:11:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Julien Lepiller <julien <at> lepiller.eu>:
New bug report received and forwarded. Copy sent to guix-patches <at> gnu.org. (Mon, 04 Jul 2022 19:11:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Julien Lepiller <julien <at> lepiller.eu>
To: guix-patches <at> gnu.org
Subject: [PATCH] gnu: Add mecab.
Date: Mon, 4 Jul 2022 21:09:30 +0200
Hi Guix!

This small series adds mecab and two dictionaries. MeCab is a
morphological analysis engine. I'm not sure what that previous sentence
means (:p) but I use it as a segmenter for Japanese in one of my
projects. In fact, the two patches that follow add two dictionary
sources. You need one of them in the same profile as mecab for it to be
useful (with no dictionaries, it segfaults).




Information forwarded to guix-patches <at> gnu.org:
bug#56386; Package guix-patches. (Mon, 04 Jul 2022 19:43:01 GMT) Full text and rfc822 format available.

Message #8 received at 56386 <at> debbugs.gnu.org (full text, mbox):

From: Julien Lepiller <julien <at> lepiller.eu>
To: 56386 <at> debbugs.gnu.org
Subject: [PATCH 2/3] gnu: Add mecab-ipadic.
Date: Mon,  4 Jul 2022 21:42:01 +0200
* gnu/packages/language.scm (mecab-ipadic): New variable.
---
 gnu/packages/language.scm | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/gnu/packages/language.scm b/gnu/packages/language.scm
index 3ffe115b51..63654c544b 100644
--- a/gnu/packages/language.scm
+++ b/gnu/packages/language.scm
@@ -970,3 +970,30 @@ (define-public mecab
 collaboration between the Kyoto university and Nippon Telegraph and Telephone
 Corporation.  The engine is independent of any language, dictionary or corpus.")
     (license (list license:gpl2+ license:lgpl2.1+ license:bsd-3))))
+
+(define-public mecab-ipadic
+  (package
+    (name "mecab-ipadic")
+    (version "2.7.0")
+    (source (package-source mecab))
+    (build-system gnu-build-system)
+    (arguments
+     `(#:configure-flags
+       (list (string-append "--with-dicdir=" (assoc-ref %outputs "out")
+                            "/lib/mecab/dic")
+             "--with-charset=utf8")
+       #:phases
+       (modify-phases %standard-phases
+         (add-after 'unpack 'chdir
+           (lambda _
+             (chdir "mecab-ipadic")))
+         (add-before 'configure 'set-mecab-dir
+           (lambda* (#:key outputs #:allow-other-keys)
+             (setenv "MECAB_DICDIR" (string-append (assoc-ref outputs "out")
+                                                   "/lib/mecab/dic")))))))
+    (native-inputs (list mecab)); for mecab-config
+    (home-page "https://taku910.github.io/mecab")
+    (synopsis "Dictionary data for MeCab")
+    (description "This package contains dictionnary data derived from
+ipadic for use with MeCab.")
+    (license (license:non-copyleft "mecab-ipadic/COPYING"))))
-- 
2.36.1





Information forwarded to guix-patches <at> gnu.org:
bug#56386; Package guix-patches. (Mon, 04 Jul 2022 19:43:02 GMT) Full text and rfc822 format available.

Message #11 received at 56386 <at> debbugs.gnu.org (full text, mbox):

From: Julien Lepiller <julien <at> lepiller.eu>
To: 56386 <at> debbugs.gnu.org
Subject: [PATCH 1/3] gnu: Add mecab.
Date: Mon,  4 Jul 2022 21:42:00 +0200
* gnu/packages/language.scm (mecab): New variable.
* gnu/packages/patches/mecab-variable-param.patch: New file.
* gnu/local.mk (dist_patch_DATA): Add it.
---
 gnu/local.mk                                  |  1 +
 gnu/packages/language.scm                     | 51 ++++++++++++++++++-
 .../patches/mecab-variable-param.patch        | 30 +++++++++++
 3 files changed, 81 insertions(+), 1 deletion(-)
 create mode 100644 gnu/packages/patches/mecab-variable-param.patch

diff --git a/gnu/local.mk b/gnu/local.mk
index faad6cc6b2..87fe75082c 100644
--- a/gnu/local.mk
+++ b/gnu/local.mk
@@ -1490,6 +1490,7 @@ dist_patch_DATA =						\
   %D%/packages/patches/libmemcached-build-with-gcc7.patch	\
   %D%/packages/patches/libmhash-hmac-fix-uaf.patch		\
   %D%/packages/patches/libsigrokdecode-python3.9-fix.patch	\
+  %D%/packages/patches/mecab-variable-param.patch		\
   %D%/packages/patches/mercurial-hg-extension-path.patch       \
   %D%/packages/patches/mesa-opencl-all-targets.patch		\
   %D%/packages/patches/mesa-skip-tests.patch			\
diff --git a/gnu/packages/language.scm b/gnu/packages/language.scm
index 61c9e682ed..3ffe115b51 100644
--- a/gnu/packages/language.scm
+++ b/gnu/packages/language.scm
@@ -4,7 +4,7 @@
 ;;; Copyright © 2018 Nikita <nikita <at> n0.is>
 ;;; Copyright © 2019 Alex Vong <alexvong1995 <at> gmail.com>
 ;;; Copyright © 2020 Ricardo Wurmus <rekado <at> elephly.net>
-;;; Copyright © 2020 Julien Lepiller <julien <at> lepiller.eu>
+;;; Copyright © 2020, 2022 Julien Lepiller <julien <at> lepiller.eu>
 ;;;
 ;;; This file is part of GNU Guix.
 ;;;
@@ -921,3 +921,52 @@ (define-public praat
 analysis (pitch, formant, intensity, ...), speech synthesis, labelling, segmenting
 and manipulation.")
     (license license:gpl2+)))
+
+(define-public mecab
+  (package
+    (name "mecab")
+    (version "0.996")
+    (source (origin
+              (method git-fetch)
+              (uri (git-reference
+                     (url "https://github.com/taku910/mecab")
+                     ;; latest commit
+                     (commit "046fa78b2ed56fbd4fac312040f6d62fc1bc31e3")))
+              (file-name (git-file-name name version))
+              (sha256
+               (base32
+                "1hdv7rgn8j0ym9gsbigydwrbxa8cx2fb0qngg1ya15vvbw0lk4aa"))
+              (patches
+                (search-patches
+                  "mecab-variable-param.patch"))))
+    (build-system gnu-build-system)
+    (native-search-paths
+      (list (search-path-specification
+              (variable "MECAB_DICDIR")
+              (separator #f)
+              (files '("lib/mecab/dic")))))
+    (arguments
+     `(#:phases
+       (modify-phases %standard-phases
+         (add-after 'unpack 'chdir
+           (lambda _
+             (chdir "mecab")))
+         (add-before 'build 'add-mecab-dicdir-variable
+           (lambda _
+             (substitute* "mecabrc.in"
+               (("dicdir = .*")
+                "dicdir = $MECAB_DICDIR"))
+             (substitute* "mecab-config.in"
+               (("echo @libdir@/mecab/dic")
+                "if [ -z \"$MECAB_DICDIR\" ]; then
+  echo @libdir@/mecab/dic
+else
+  echo \"$MECAB_DICDIR\"
+fi")))))))
+    (inputs (list libiconv))
+    (home-page "https://taku910.github.io/mecab")
+    (synopsis "Morphological analysis engine for texts")
+    (description "Mecab is a morphological analysis engine developped as a
+collaboration between the Kyoto university and Nippon Telegraph and Telephone
+Corporation.  The engine is independent of any language, dictionary or corpus.")
+    (license (list license:gpl2+ license:lgpl2.1+ license:bsd-3))))
diff --git a/gnu/packages/patches/mecab-variable-param.patch b/gnu/packages/patches/mecab-variable-param.patch
new file mode 100644
index 0000000000..4457cf3f44
--- /dev/null
+++ b/gnu/packages/patches/mecab-variable-param.patch
@@ -0,0 +1,30 @@
+From 2396e90056706ef897acab3aaa081289c7336483 Mon Sep 17 00:00:00 2001
+From: LEPILLER Julien <julien.lepiller <at> irisa.fr>
+Date: Fri, 19 Apr 2019 11:48:39 +0200
+Subject: [PATCH] Allow variable parameters
+
+---
+ mecab/src/param.cpp | 6 +++++-
+ 1 file changed, 5 insertions(+), 1 deletion(-)
+
+diff --git a/mecab/src/param.cpp b/mecab/src/param.cpp
+index 65328a2..006b1b5 100644
+--- a/mecab/src/param.cpp
++++ b/mecab/src/param.cpp
+@@ -79,8 +79,12 @@ bool Param::load(const char *filename) {
+     size_t s1, s2;
+     for (s1 = pos+1; s1 < line.size() && isspace(line[s1]); s1++);
+     for (s2 = pos-1; static_cast<long>(s2) >= 0 && isspace(line[s2]); s2--);
+-    const std::string value = line.substr(s1, line.size() - s1);
++    std::string value = line.substr(s1, line.size() - s1);
+     const std::string key   = line.substr(0, s2 + 1);
++
++    if(value.find('$') == 0) {
++        value = std::getenv(value.substr(1).c_str());
++    }
+     set<std::string>(key.c_str(), value, false);
+   }
+ 
+-- 
+2.20.1
+
-- 
2.36.1





Information forwarded to guix-patches <at> gnu.org:
bug#56386; Package guix-patches. (Mon, 04 Jul 2022 19:43:02 GMT) Full text and rfc822 format available.

Message #14 received at 56386 <at> debbugs.gnu.org (full text, mbox):

From: Julien Lepiller <julien <at> lepiller.eu>
To: 56386 <at> debbugs.gnu.org
Subject: [PATCH 3/3] gnu: Add mecab-unidic.
Date: Mon,  4 Jul 2022 21:42:02 +0200
* gnu/packages/language.scm (mecab-unidic): New variable.
---
 gnu/packages/language.scm | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/gnu/packages/language.scm b/gnu/packages/language.scm
index 63654c544b..f97b982cb9 100644
--- a/gnu/packages/language.scm
+++ b/gnu/packages/language.scm
@@ -27,6 +27,7 @@ (define-module (gnu packages language)
   #:use-module (gnu packages autotools)
   #:use-module (gnu packages audio)
   #:use-module (gnu packages base)
+  #:use-module (gnu packages compression)
   #:use-module (gnu packages docbook)
   #:use-module (gnu packages emacs)
   #:use-module (gnu packages freedesktop)
@@ -57,6 +58,7 @@ (define-module (gnu packages language)
   #:use-module (gnu packages xorg)
   #:use-module (guix packages)
   #:use-module (guix build-system cmake)
+  #:use-module (guix build-system copy)
   #:use-module (guix build-system glib-or-gtk)
   #:use-module (guix build-system gnu)
   #:use-module (guix build-system perl)
@@ -997,3 +999,27 @@ (define-public mecab-ipadic
     (description "This package contains dictionnary data derived from
 ipadic for use with MeCab.")
     (license (license:non-copyleft "mecab-ipadic/COPYING"))))
+
+(define-public mecab-unidic
+  (package
+    (name "mecab-unidic")
+    (version "3.1.0")
+    (source (origin
+              (method url-fetch)
+              (uri (string-append "https://clrd.ninjal.ac.jp/unidic_archive/cwj/"
+                                  version "/unidic-cwj-" version ".zip"))
+              (sha256
+               (base32
+                "1z132p2q3bgchiw529j2d7dari21kn0fhkgrj3vcl0ncg2m521il"))))
+    (build-system copy-build-system)
+    (arguments
+     `(#:install-plan
+       '(("." "lib/mecab/dic"
+          #:include-regexp ("\\.bin$" "\\.def$" "\\.dic$" "dicrc")))))
+    (native-inputs (list unzip))
+    (home-page "https://clrd.ninjal.ac.jp/unidic/en/")
+    (synopsis "Dictionary data for MeCab")
+    (description "UniDic for morphological analysis is a dictionary for
+analysis with the morphological analyser MeCab, where the short units exported
+from the database are used as entries (heading terms).")
+    (license (list license:gpl2+ license:lgpl2.1 license:bsd-3))))
-- 
2.36.1





Information forwarded to guix-patches <at> gnu.org:
bug#56386; Package guix-patches. (Sun, 17 Jul 2022 19:34:02 GMT) Full text and rfc822 format available.

Message #17 received at 56386 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Julien Lepiller <julien <at> lepiller.eu>
Cc: 56386 <at> debbugs.gnu.org
Subject: Re: bug#56386: [PATCH] gnu: Add mecab.
Date: Sun, 17 Jul 2022 21:33:21 +0200
Hi,

Julien Lepiller <julien <at> lepiller.eu> skribis:

> +    (synopsis "Dictionary data for MeCab")
> +    (description "UniDic for morphological analysis is a dictionary for
> +analysis with the morphological analyser MeCab, where the short units exported
> +from the database are used as entries (heading terms).")
> +    (license (list license:gpl2+ license:lgpl2.1 license:bsd-3))))

Maybe add a comment stating whether this is triple-licensed (at the
user’s choice) or if that means that there are files under each of
these.

Otherwise the whole series LGTM!

Ludo’.




Information forwarded to guix-patches <at> gnu.org:
bug#56386; Package guix-patches. (Thu, 30 Mar 2023 22:44:02 GMT) Full text and rfc822 format available.

Message #20 received at 56386 <at> debbugs.gnu.org (full text, mbox):

From: Bruno Victal <mirai <at> makinata.eu>
To: Julien Lepiller <julien <at> lepiller.eu>
Cc: 56386 <at> debbugs.gnu.org
Subject: Re: [bug#56386] [PATCH] gnu: Add mecab.
Date: Thu, 30 Mar 2023 23:43:22 +0100
On 2022-07-04 20:09, Julien Lepiller wrote:
> Hi Guix!
> 
> This small series adds mecab and two dictionaries. MeCab is a
> morphological analysis engine. I'm not sure what that previous sentence
> means (:p) but I use it as a segmenter for Japanese in one of my
> projects. In fact, the two patches that follow add two dictionary
> sources. You need one of them in the same profile as mecab for it to be
> useful (with no dictionaries, it segfaults).
> 
> 
> 

Any updates regarding this?


Cheers,
Bruno




Reply sent to Julien Lepiller <julien <at> lepiller.eu>:
You have taken responsibility. (Sat, 01 Apr 2023 14:44:02 GMT) Full text and rfc822 format available.

Notification sent to Julien Lepiller <julien <at> lepiller.eu>:
bug acknowledged by developer. (Sat, 01 Apr 2023 14:44:02 GMT) Full text and rfc822 format available.

Message #25 received at 56386-done <at> debbugs.gnu.org (full text, mbox):

From: Julien Lepiller <julien <at> lepiller.eu>
To: Bruno Victal <mirai <at> makinata.eu>
Cc: 56386-done <at> debbugs.gnu.org
Subject: Re: [bug#56386] [PATCH] gnu: Add mecab.
Date: Sat, 1 Apr 2023 16:43:20 +0200
Le Thu, 30 Mar 2023 23:43:22 +0100,
Bruno Victal <mirai <at> makinata.eu> a écrit :

> On 2022-07-04 20:09, Julien Lepiller wrote:
> > Hi Guix!
> > 
> > This small series adds mecab and two dictionaries. MeCab is a
> > morphological analysis engine. I'm not sure what that previous
> > sentence means (:p) but I use it as a segmenter for Japanese in one
> > of my projects. In fact, the two patches that follow add two
> > dictionary sources. You need one of them in the same profile as
> > mecab for it to be useful (with no dictionaries, it segfaults).
> > 
> > 
> >   
> 
> Any updates regarding this?
> 
> 
> Cheers,
> Bruno

I had forgotten about this. It's a triple license (at the user's
choice), so I added a comment. Pushed to master as
3ab24ba216ce91210b93ec61554b3343fbc3aaab to
4483296da3e2e1424d12d92d0f56fb428765ca43.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 30 Apr 2023 11:24:06 GMT) Full text and rfc822 format available.

This bug report was last modified 355 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.