GNU logs - #70689, boring messages


Message sent to bug-guix@HIDDEN:


X-Loop: help-debbugs@HIDDEN
Subject: bug#70689: guix search doesn't weigh word matches higher than subword matches
Resent-From: Richard Sent <richard@HIDDEN>
Original-Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
Resent-CC: bug-guix@HIDDEN
Resent-Date: Wed, 01 May 2024 02:19:02 +0000
Resent-Message-ID: <handler.70689.B.171452992925697 <at> debbugs.gnu.org>
Resent-Sender: help-debbugs@HIDDEN
X-GNU-PR-Message: report 70689
X-GNU-PR-Package: guix
X-GNU-PR-Keywords: 
To: 70689 <at> debbugs.gnu.org
X-Debbugs-Original-To: bug-guix@HIDDEN
Received: via spool by submit <at> debbugs.gnu.org id=B.171452992925697
          (code B ref -1); Wed, 01 May 2024 02:19:02 +0000
Received: (at submit) by debbugs.gnu.org; 1 May 2024 02:18:49 +0000
Received: from localhost ([127.0.0.1]:34670 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1s1zYW-0006gP-Vp
	for submit <at> debbugs.gnu.org; Tue, 30 Apr 2024 22:18:49 -0400
Received: from lists.gnu.org ([2001:470:142::17]:45604)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <richard@HIDDEN>) id 1s1zYS-0006gJ-5Y
 for submit <at> debbugs.gnu.org; Tue, 30 Apr 2024 22:18:47 -0400
Received: from eggs.gnu.org ([2001:470:142:3::10])
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <richard@HIDDEN>)
 id 1s1zY1-00009Y-QS
 for bug-guix@HIDDEN; Tue, 30 Apr 2024 22:18:17 -0400
Received: from mail-108-mta203.mxroute.com ([136.175.108.203])
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128)
 (Exim 4.90_1) (envelope-from <richard@HIDDEN>)
 id 1s1zY0-00056f-36
 for bug-guix@HIDDEN; Tue, 30 Apr 2024 22:18:17 -0400
Received: from filter006.mxroute.com ([136.175.111.2] filter006.mxroute.com)
 (Authenticated sender: mN4UYu2MZsgR)
 by mail-108-mta203.mxroute.com (ZoneMTA) with ESMTPSA id
 18f31f1f2a60008ca2.001 for <bug-guix@HIDDEN>
 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384);
 Wed, 01 May 2024 02:18:10 +0000
X-Zone-Loop: 9731ac6d815cff36f4d0f5bd630a76f1d76e9ea50c52
X-Originating-IP: [136.175.111.2]
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
 d=freakingpenguin.com; s=x; h=Content-Type:MIME-Version:Message-ID:Date:
 Subject:To:From:Sender:Reply-To:Cc:Content-Transfer-Encoding:Content-ID:
 Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc
 :Resent-Message-ID:In-Reply-To:References:List-Id:List-Help:List-Unsubscribe:
 List-Subscribe:List-Post:List-Owner:List-Archive;
 bh=LbwFs/UmKWQRz4yPK9PcAe70fCpOsc5leLdlZXtGdxQ=; b=MAE7RJE4RcGIqT4ERONauFeXy1
 +brXFG0sdOtXnpyIf/oEUC/UJViOJ1TlMxYy8PHV0Yx5mcvS9Tw+wOzcjBeO2WU40keNstxdSROcX
 ZRO/S7EkJEmytqxapymY0jazjk+YYt2xEi67ECb8UA5XaKU+y76RjAPmDbWChOE+/SATU6A09ndJj
 4tNH5dZzfX1gqNVlYJRF1nkhvl87w7Mf7AWZ9yTHhCQh5lWzKbHglbYSgxB0Fdl8SZoU/JE/jfGo8
 3TrLhCqAWjtAkUUi4sUuYCqP0Hr5Gvng8elB98XRqdL/O1NH1nkE8eOzT3Yy3HoEQgkXxv4n+JdHj
 e5PG0iQA==;
From: Richard Sent <richard@HIDDEN>
Date: Tue, 30 Apr 2024 22:18:03 -0400
Message-ID: <87bk5qcm1w.fsf@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain
X-Authenticated-Id: richard@HIDDEN
Received-SPF: pass client-ip=136.175.108.203;
 envelope-from=richard@HIDDEN; helo=mail-108-mta203.mxroute.com
X-Spam_score_int: -16
X-Spam_score: -1.7
X-Spam_bar: -
X-Spam_report: (-1.7 / 5.0 requ) BAYES_00=-1.9, DKIM_INVALID=0.1,
 DKIM_SIGNED=0.1, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001,
 SPF_PASS=-0.001 autolearn=no autolearn_force=no
X-Spam_action: no action
X-Spam-Score: 0.9 (/)
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -0.1 (/)

Hi Guix!

When running guix search, relevance in synopsis and description fields
are computed strictly by the number of matches, both as a word and as a
subword. Ideally, if a search string matches an isolated word in a
search, that result should be considered more relevant than simply
matching a subword, even multiple times.

To illustrate, imagine trying to find what package provides the `rsh`
binary and running running `$ guix search rsh`. This binary is part of
`inetutils` and the description field contains:

> Inetutils is a collection of common network programs, such as an ftp
> client and server, a telnet client and server, an rsh client and
> server, and hostname.

Most likely, this is what the user is interested in. However, inetutils
does not show up until roughly the ~75th result with a relevance of 2
(the lowest possible relevance).

Almost every search result beforehand contains the string "rsh" as a
component of another word, such as "marshaling", "powershell", and
"hershey". However, these match multiple times and are weighted
significantly higher.

Ideally, guix search should rate inetutils higher because the string
"rsh" occurs as its own word, not as a component of another, unrelated
word. (Very, very people would search "rsh" looking for matches with
"hershey", even if "hershey" occurs multiple times.)

Another example of where this can happen is with "dig", part of the bind
package. Searching for "dig" returns garbage because "dig" is a common
subword. Bind is scored with a relevance of 2, even though bind's
description emphasises that dig is part of it.

This would improve the experience when searching with strings that
commonly occur as subwords.

Since this change can't occur in a vacuum, care should be taken not to
reduce the effectiveness of other reasonably forseeable search queries.

-- 
Take it easy,
Richard Sent
Making my computer weirder one commit at a time.




Message sent:


Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-Mailer: MIME-tools 5.505 (Entity 5.505)
Content-Type: text/plain; charset=utf-8
X-Loop: help-debbugs@HIDDEN
From: help-debbugs@HIDDEN (GNU bug Tracking System)
To: Richard Sent <richard@HIDDEN>
Subject: bug#70689: Acknowledgement (guix search doesn't weigh word
 matches higher than subword matches)
Message-ID: <handler.70689.B.171452992925697.ack <at> debbugs.gnu.org>
References: <87bk5qcm1w.fsf@HIDDEN>
X-Gnu-PR-Message: ack 70689
X-Gnu-PR-Package: guix
Reply-To: 70689 <at> debbugs.gnu.org
Date: Wed, 01 May 2024 02:19:02 +0000

Thank you for filing a new bug report with debbugs.gnu.org.

This is an automatically generated reply to let you know your message
has been received.

Your message is being forwarded to the package maintainers and other
interested parties for their attention; they will reply in due course.

Your message has been sent to the package maintainer(s):
 bug-guix@HIDDEN

If you wish to submit further information on this problem, please
send it to 70689 <at> debbugs.gnu.org.

Please do not send mail to help-debbugs@HIDDEN unless you wish
to report a problem with the Bug-tracking system.

--=20
70689: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D70689
GNU Bug Tracking System
Contact help-debbugs@HIDDEN with problems


Message sent to bug-guix@HIDDEN:


X-Loop: help-debbugs@HIDDEN
Subject: bug#70689: guix search doesn't weigh word matches higher than subword matches
Resent-From: bokr@HIDDEN
Original-Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
Resent-CC: bug-guix@HIDDEN
Resent-Date: Wed, 01 May 2024 13:46:01 +0000
Resent-Message-ID: <handler.70689.B70689.171457115227888 <at> debbugs.gnu.org>
Resent-Sender: help-debbugs@HIDDEN
X-GNU-PR-Message: followup 70689
X-GNU-PR-Package: guix
X-GNU-PR-Keywords: 
To: Richard Sent <richard@HIDDEN>
Cc: 70689 <at> debbugs.gnu.org
Received: via spool by 70689-submit <at> debbugs.gnu.org id=B70689.171457115227888
          (code B ref 70689); Wed, 01 May 2024 13:46:01 +0000
Received: (at 70689) by debbugs.gnu.org; 1 May 2024 13:45:52 +0000
Received: from localhost ([127.0.0.1]:37558 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1s2AHP-0007Fk-PC
	for submit <at> debbugs.gnu.org; Wed, 01 May 2024 09:45:52 -0400
Received: from mailout.easymail.ca ([64.68.200.34]:35802)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <bokr@HIDDEN>) id 1s2AHL-0007Fe-Qo
 for 70689 <at> debbugs.gnu.org; Wed, 01 May 2024 09:45:49 -0400
Received: from localhost (localhost [127.0.0.1])
 by mailout.easymail.ca (Postfix) with ESMTP id BECE96F10F;
 Wed,  1 May 2024 13:45:20 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=bokr.com; s=easymail;
 t=1714571120; bh=ZAx1XjR9xIRTe+Kl9c8nccd6gXaB4z3GnOIK2yASGi0=;
 h=From:Date:To:Cc:Subject:References:In-Reply-To:From;
 b=Im+xB7UJTH4U9uZYiWi0YDWtf8gwI4YKCV4npqZou54bkXalk+m8a1JcM0mbnAjeV
 EggkePX71DMWzDZj2QIzYosDTHwnMucPlp/qBZ5wBUJZyUkyJYE2XiWC1lODkAgMbH
 VdNrkYgjSGYmjlHgHvP5LGh1kUhUHWbjvBVfh26ApW/zXrOSEq8LeFQayD4ayG5jBP
 /gWMr1oR1LbbK86XP7zCp/ujc0H3zyk+qkk5pxHQwpyPWaKFOmb5XdNRJf+Wjojlo5
 8EoWGNt5b8FOhpAyN87G0wFPOzkSYLw83tSvAyDvPh54T65WWbAfJPFuv1Z5UTsVf9
 DudHki55m8KNw==
X-Virus-Scanned: Debian amavisd-new at emo07-pco.easydns.vpn
Received: from mailout.easymail.ca ([127.0.0.1])
 by localhost (emo07-pco.easydns.vpn [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id jkCzSTOuKSjW; Wed,  1 May 2024 13:45:20 +0000 (UTC)
Received: from localhost (m90-129-222-29.cust.tele2.se [90.129.222.29])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest
 SHA256) (No client certificate requested)
 by mailout.easymail.ca (Postfix) with ESMTPSA id CB68D6EC97;
 Wed,  1 May 2024 13:45:19 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=bokr.com; s=easymail;
 t=1714571120; bh=ZAx1XjR9xIRTe+Kl9c8nccd6gXaB4z3GnOIK2yASGi0=;
 h=From:Date:To:Cc:Subject:References:In-Reply-To:From;
 b=Im+xB7UJTH4U9uZYiWi0YDWtf8gwI4YKCV4npqZou54bkXalk+m8a1JcM0mbnAjeV
 EggkePX71DMWzDZj2QIzYosDTHwnMucPlp/qBZ5wBUJZyUkyJYE2XiWC1lODkAgMbH
 VdNrkYgjSGYmjlHgHvP5LGh1kUhUHWbjvBVfh26ApW/zXrOSEq8LeFQayD4ayG5jBP
 /gWMr1oR1LbbK86XP7zCp/ujc0H3zyk+qkk5pxHQwpyPWaKFOmb5XdNRJf+Wjojlo5
 8EoWGNt5b8FOhpAyN87G0wFPOzkSYLw83tSvAyDvPh54T65WWbAfJPFuv1Z5UTsVf9
 DudHki55m8KNw==
From: bokr@HIDDEN
Date: Wed, 1 May 2024 15:45:05 +0200
Message-ID: <20240501134505.GA10144@LionPure>
References: <87bk5qcm1w.fsf@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <87bk5qcm1w.fsf@HIDDEN>
User-Agent: Mutt/1.10.1 (2018-07-13)
X-Spam-Score: -2.3 (--)
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -3.3 (---)

On +2024-04-30 22:18:03 -0400, Richard Sent wrote:
> Hi Guix!
> 
> When running guix search, relevance in synopsis and description fields
> are computed strictly by the number of matches, both as a word and as a
> subword. Ideally, if a search string matches an isolated word in a
> search, that result should be considered more relevant than simply
> matching a subword, even multiple times.
> 
> To illustrate, imagine trying to find what package provides the `rsh`
> binary and running running `$ guix search rsh`. This binary is part of
> `inetutils` and the description field contains:
> 
> > Inetutils is a collection of common network programs, such as an ftp
> > client and server, a telnet client and server, an rsh client and
> > server, and hostname.
> 
> Most likely, this is what the user is interested in. However, inetutils
> does not show up until roughly the ~75th result with a relevance of 2
> (the lowest possible relevance).
> 
> Almost every search result beforehand contains the string "rsh" as a
> component of another word, such as "marshaling", "powershell", and
> "hershey". However, these match multiple times and are weighted
> significantly higher.
> 
> Ideally, guix search should rate inetutils higher because the string
> "rsh" occurs as its own word, not as a component of another, unrelated
> word. (Very, very people would search "rsh" looking for matches with
> "hershey", even if "hershey" occurs multiple times.)
> 
> Another example of where this can happen is with "dig", part of the bind
> package. Searching for "dig" returns garbage because "dig" is a common
> subword. Bind is scored with a relevance of 2, even though bind's
> description emphasises that dig is part of it.
> 
> This would improve the experience when searching with strings that
> commonly occur as subwords.
> 
> Since this change can't occur in a vacuum, care should be taken not to
> reduce the effectiveness of other reasonably forseeable search queries.
> 
> -- 
> Take it easy,
> Richard Sent
> Making my computer weirder one commit at a time.
> 
> 
> 

I like your proposal :)

I'm wondering how [1] compares in what it does for your use(ful) case.
(I am not familiar with Hyper Estraier beyond being prompted for gnu.org searching)

[1] <https://directory.fsf.org/wiki/Hyper_Estraier>

--
Regards,
Bengt Richter




Message sent to bug-guix@HIDDEN:


X-Loop: help-debbugs@HIDDEN
Subject: bug#70689: guix search doesn't weigh word matches higher than subword matches
Resent-From: aurtzy <aurtzy@HIDDEN>
Original-Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
Resent-CC: bug-guix@HIDDEN
Resent-Date: Fri, 13 Sep 2024 07:16:02 +0000
Resent-Message-ID: <handler.70689.B70689.17262117036367 <at> debbugs.gnu.org>
Resent-Sender: help-debbugs@HIDDEN
X-GNU-PR-Message: followup 70689
X-GNU-PR-Package: guix
X-GNU-PR-Keywords: 
To: 70689 <at> debbugs.gnu.org
Cc: Richard Sent <richard@HIDDEN>, bokr@HIDDEN
Received: via spool by 70689-submit <at> debbugs.gnu.org id=B70689.17262117036367
          (code B ref 70689); Fri, 13 Sep 2024 07:16:02 +0000
Received: (at 70689) by debbugs.gnu.org; 13 Sep 2024 07:15:03 +0000
Received: from localhost ([127.0.0.1]:42355 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1sp0WE-0001eR-L4
	for submit <at> debbugs.gnu.org; Fri, 13 Sep 2024 03:15:03 -0400
Received: from mail-io1-f50.google.com ([209.85.166.50]:54692)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <aurtzy@HIDDEN>) id 1sp0WC-0001dp-3w
 for 70689 <at> debbugs.gnu.org; Fri, 13 Sep 2024 03:15:00 -0400
Received: by mail-io1-f50.google.com with SMTP id
 ca18e2360f4ac-82aac438539so20463939f.1
 for <70689 <at> debbugs.gnu.org>; Fri, 13 Sep 2024 00:14:51 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1726211625; x=1726816425; darn=debbugs.gnu.org;
 h=content-transfer-encoding:in-reply-to:from:content-language:subject
 :references:cc:to:user-agent:mime-version:date:message-id:from:to:cc
 :subject:date:message-id:reply-to;
 bh=nX9gp5yExZRDFI1HsgAmCJS1l28/MEfK0eoT4EjmOVc=;
 b=MZtwYtWzAqQ2DPbpSTkyktsyYztPm9/EQfqm6AF/C/MDVVrw+b2gzb77MFulCd72Lt
 chK5H48Pbm5AvV6f/dA2C+59IfMNNBNi2B6SCyGyc45SPtZlCjVeFO9WNUyTcRyTaAei
 IsmvkZ8c9OG+Uay1tW75h8xJkb5wTWxfkEXwJ7vSGzw+u8hP0ltXrV4IcOq0ZWGnu0CI
 QTJ3FGmK1k1xDEMZNONuCNVCGTNRSfLauObyTjuttrvsODonst87bMHEKl+jN4q0FExI
 wUUfw5MQwhlvyp8wH9KbFu4g6hUstE1wU7jcbRyoW5IDmId2/zR3q9r0ueptPvzgCpL9
 Klug==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1726211625; x=1726816425;
 h=content-transfer-encoding:in-reply-to:from:content-language:subject
 :references:cc:to:user-agent:mime-version:date:message-id
 :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
 bh=nX9gp5yExZRDFI1HsgAmCJS1l28/MEfK0eoT4EjmOVc=;
 b=dESBhYXBcKABji9zymgJWVReaYLerjYMwg9wA/Zlal8DyO3k8XYTmqEUmPD+ptS/DF
 aupYb/ca6aITyxXI+XDpFvcMnu9TKcthsMKCu4yFbTbMndgnnRVcwzjIfe4XV5kHWYP6
 AkHNDnSuHW6PFBWvVkPZkDju+PfqlPfuPKgcr63Gc1My/2QTviiVe3CboKqnpOepjZfl
 b9gcDFXpietClcO6PeTf0QvnadjCnnaGNWUNHXn9UshBjJeqx6gOZMgxxtHyTSZXw5g/
 FgJhZbn1sfS9/yJ0irzq698UAmd+5fDcLGjOIUeKbG804s9b2PwcTOe9KOu4MC2/RKxy
 XSRA==
X-Gm-Message-State: AOJu0Yz6gJsB1FS6GJlWGrCAjCGsz0esvIygQlAquBZK0sv9jEuiETTb
 2jqTDhm483oj2v/1FZrVGptzotEP8dXCD+A8MyopUYp4nUUy1ZiI2ifpAQ==
X-Google-Smtp-Source: AGHT+IEoGf4jzIaisHmr8gWds8GnBkyyj/t9oUE7jGCjDhUpvHWE5DLeFKE5MCy6w6rEqIFAha/9iw==
X-Received: by 2002:a92:c54e:0:b0:39b:32f6:5e90 with SMTP id
 e9e14a558f8ab-3a08b739125mr13393705ab.15.1726211624411; 
 Fri, 13 Sep 2024 00:13:44 -0700 (PDT)
Received: from ?IPV6:2600:4808:a053:7600::e413? ([2600:4808:a053:7600::e413])
 by smtp.gmail.com with ESMTPSA id
 8926c6da1cb9f-4d35f89137bsm1047381173.104.2024.09.13.00.13.41
 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
 Fri, 13 Sep 2024 00:13:43 -0700 (PDT)
Message-ID: <b1592bd0-cd96-4e3a-9f79-a7b793cd5d5c@HIDDEN>
Date: Fri, 13 Sep 2024 03:13:41 -0400
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
References: <20240501134505.GA10144@LionPure>
Content-Language: en-US
From: aurtzy <aurtzy@HIDDEN>
In-Reply-To: <20240501134505.GA10144@LionPure>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Score: -0.0 (/)
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

Hi Richard and bokr,

I've proposed changes to relevance scoring that should help with this 
issue, if you'd like to try it out here: https://issues.guix.gnu.org/73220

Cheers,

aurtzy

 > On +2024-04-30 22:18:03 -0400, Richard Sent wrote:
 > > Hi Guix!
 > >
 > > When running guix search, relevance in synopsis and description fields
 > > are computed strictly by the number of matches, both as a word and as a
 > > subword. Ideally, if a search string matches an isolated word in a
 > > search, that result should be considered more relevant than simply
 > > matching a subword, even multiple times.
 > >
 > > To illustrate, imagine trying to find what package provides the `rsh`
 > > binary and running running `$ guix search rsh`. This binary is part of
 > > `inetutils` and the description field contains:
 > >
 > > > Inetutils is a collection of common network programs, such as an ftp
 > > > client and server, a telnet client and server, an rsh client and
 > > > server, and hostname.
 > >
 > > Most likely, this is what the user is interested in. However, inetutils
 > > does not show up until roughly the ~75th result with a relevance of 2
 > > (the lowest possible relevance).
 > >
 > > Almost every search result beforehand contains the string "rsh" as a
 > > component of another word, such as "marshaling", "powershell", and
 > > "hershey". However, these match multiple times and are weighted
 > > significantly higher.
 > >
 > > Ideally, guix search should rate inetutils higher because the string
 > > "rsh" occurs as its own word, not as a component of another, unrelated
 > > word. (Very, very people would search "rsh" looking for matches with
 > > "hershey", even if "hershey" occurs multiple times.)
 > >
 > > Another example of where this can happen is with "dig", part of the 
bind
 > > package. Searching for "dig" returns garbage because "dig" is a common
 > > subword. Bind is scored with a relevance of 2, even though bind's
 > > description emphasises that dig is part of it.
 > >
 > > This would improve the experience when searching with strings that
 > > commonly occur as subwords.
 > >
 > > Since this change can't occur in a vacuum, care should be taken not to
 > > reduce the effectiveness of other reasonably forseeable search queries.
 > >
 > > --
 > > Take it easy,
 > > Richard Sent
 > > Making my computer weirder one commit at a time.
 > >
 > >
 > >
 >
 > I like your proposal :)
 >
 > I'm wondering how [1] compares in what it does for your use(ful) case.
 > (I am not familiar with Hyper Estraier beyond being prompted for 
gnu.org searching)
 >
 > [1] <https://directory.fsf.org/wiki/Hyper_Estraier>
 >
 > --
 > Regards,
 > Bengt Richter





Message sent to bug-guix@HIDDEN:


X-Loop: help-debbugs@HIDDEN
Subject: bug#70689: guix search doesn't weigh word matches higher than subword matches
Resent-From: Simon Tournier <zimon.toutoune@HIDDEN>
Original-Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
Resent-CC: bug-guix@HIDDEN
Resent-Date: Fri, 13 Sep 2024 15:10:02 +0000
Resent-Message-ID: <handler.70689.B70689.172624018317706 <at> debbugs.gnu.org>
Resent-Sender: help-debbugs@HIDDEN
X-GNU-PR-Message: followup 70689
X-GNU-PR-Package: guix
X-GNU-PR-Keywords: 
To: Richard Sent <richard@HIDDEN>
Cc: aurtzy <aurtzy@HIDDEN>, 70689 <at> debbugs.gnu.org
Received: via spool by 70689-submit <at> debbugs.gnu.org id=B70689.172624018317706
          (code B ref 70689); Fri, 13 Sep 2024 15:10:02 +0000
Received: (at 70689) by debbugs.gnu.org; 13 Sep 2024 15:09:43 +0000
Received: from localhost ([127.0.0.1]:43980 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1sp7va-0004bU-E4
	for submit <at> debbugs.gnu.org; Fri, 13 Sep 2024 11:09:42 -0400
Received: from mail-wm1-f53.google.com ([209.85.128.53]:45541)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <zimon.toutoune@HIDDEN>) id 1sp7vX-0004b2-5n
 for 70689 <at> debbugs.gnu.org; Fri, 13 Sep 2024 11:09:40 -0400
Received: by mail-wm1-f53.google.com with SMTP id
 5b1f17b1804b1-42cde6b5094so18716655e9.3
 for <70689 <at> debbugs.gnu.org>; Fri, 13 Sep 2024 08:09:29 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1726240104; x=1726844904; darn=debbugs.gnu.org;
 h=content-transfer-encoding:mime-version:user-agent:message-id:date
 :references:in-reply-to:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=q5IukngVSCTeTU6gpYs2PaQPTn4mOIYPOpVThzg514I=;
 b=K+c5VGrrnbVBPdcinpzwcvIC09FXqocuJQa4Ad6X2nzS3adlMTLY7FOpJjNpARlEpi
 SvXotPbFTCa/podYRV1lIox8k7tLs2SyvwN5uJxuBd/sQ5HBUtMJZcMQrO17mnZZPZ8d
 bg7b7spx0/Uqk8NsOGpSAEsShuYQGWH7jO3+gNQ/Gk8/mAgEmkb/wiPA6L5i+ZOox/D2
 8wPED8dvqUgVW24FGM29sepIwSadaoKgX5z4nmKBiX3VUkjRbkynmqk7SzdvnqqVOzeN
 XOYfL7sINv0p5tLH+MzGfVDkKDCV/ot8cGz81SrrO1IzYm0gRFop87Na38G8TQlbLyF5
 7Ntw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1726240104; x=1726844904;
 h=content-transfer-encoding:mime-version:user-agent:message-id:date
 :references:in-reply-to:subject:cc:to:from:x-gm-message-state:from
 :to:cc:subject:date:message-id:reply-to;
 bh=q5IukngVSCTeTU6gpYs2PaQPTn4mOIYPOpVThzg514I=;
 b=LKE9VBOlGoMx8dxCO0D0LnyBc9bC3TPoHNnJc5bfxn+mMbbZwiIZhhry6niXkm21Aa
 Su2hjlJJ93b1z+l1kHl9DUcb8u5NCzSzdRVd+cjYEvPTD0xQB3g7w/8vAriNUAb3U5Pp
 6AjAVMuu2lizFvE9ssrn6x7K9OTBbolxd2tylzMgiXIWqQHPD3DRRgqUnKLuKXtHEDwk
 DU+nojaM5aTSdp0u8JJ923We0vfFtuf+OqRQqYs88Y5j4yjk9Qx/L+M2q1+BquIyCk9T
 cZ/Z0Imsy4yOL4CwAPbXhL6+KpIMo9Qn64Jpmlp+JMAglDb7RRvt1OMOlGfSgYysxhuJ
 MjrA==
X-Gm-Message-State: AOJu0YzS+/XXPUG9ih4/hjghxe0hvKr+3ERB59bvQ/FZfzpvNP/Y8UXp
 M7CGehzxSp+VZ+l3IgYe89TazS6zTxhLCvRl54rtR8IyK8XuiBp7
X-Google-Smtp-Source: AGHT+IEQaJGb7heK9nxFbTSjCWS7K7QpqRkb3VxSUDNxDUUeJGZ5wy4VfeTH2Cijjt1P3tdjdsZ25w==
X-Received: by 2002:a5d:5582:0:b0:371:8319:4dcc with SMTP id
 ffacd0b85a97d-378c2cd5da5mr3613413f8f.2.1726240103438; 
 Fri, 13 Sep 2024 08:08:23 -0700 (PDT)
Received: from lili (roam-nat-fw-prg-194-254-61-40.net.univ-paris-diderot.fr.
 [194.254.61.40]) by smtp.gmail.com with ESMTPSA id
 ffacd0b85a97d-378956de4b9sm17251174f8f.111.2024.09.13.08.08.22
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Fri, 13 Sep 2024 08:08:23 -0700 (PDT)
From: Simon Tournier <zimon.toutoune@HIDDEN>
In-Reply-To: <87bk5qcm1w.fsf@HIDDEN> (Richard Sent's message of
 "Tue, 30 Apr 2024 22:18:03 -0400")
References: <87bk5qcm1w.fsf@HIDDEN>
Date: Fri, 13 Sep 2024 17:08:19 +0200
Message-ID: <877cbfvbfg.fsf@HIDDEN>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: -0.0 (/)
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

Hi,

On Tue, 30 Apr 2024 at 22:18, Richard Sent <richard@HIDDEN> wr=
ote:

>> Inetutils is a collection of common network programs, such as an ftp
>> client and server, a telnet client and server, an rsh client and
>> server, and hostname.
>
> Most likely, this is what the user is interested in. However, inetutils
> does not show up until roughly the ~75th result with a relevance of 2
> (the lowest possible relevance).

Using Guix 056910e, I get:

    $ guix search rsh | recsel -CP name | grep -n inetutils
    76:inetutils

Then using the proposed v2 patch#73220 [1], I get:

    $ ./pre-inst-env guix search rsh | recsel -CP name | grep -n inetutils
    34:inetutils

Well, that=E2=80=99s not perfect but a bit better.


> Almost every search result beforehand contains the string "rsh" as a
> component of another word, such as "marshaling", "powershell", and
> "hershey". However, these match multiple times and are weighted
> significantly higher.

Well, if we consider the current implementation, the relevance scoring
reads for the highest:

          4 * 0       name
        + 2 * 0       upstream-name
        + 1 * 0       outputs
        + 3 * 2 * 1   synopsis
        + 2 * 4 * 1   description
        + 1 * 0       file-name
        =3D 14

where it means: field-weigh * match * weight-match

Compared to inetutils:

          4 * 0       name
        + 2 * 0       upstream-name
        + 1 * 0       outputs
        + 3 * 0       synopsis
        + 2 * 1 * 1   description
        + 1 * 0       file-name
        =3D 2

Well, this case cannot be improved much.  First, the field-weights are
almost optimal [2]. Second the number of occurrences depends on the
description; maybe it could be improved, I have not checked yet.

And v2 of #73220 replace the value of weight-match: the term =E2=80=99rsh=
=E2=80=99 in
=E2=80=9Can rsh client=E2=80=9D should have an higher score than in =E2=80=
=9Cuses `json.Marshal'
and `json.Unmarshal'=E2=80=9D.

In other words, it reads:

          4 * 0       name
        + 2 * 0       upstream-name
        + 1 * 0       outputs
        + 3 * 0       synopsis
        + 2 * 1 * 3   description
        + 1 * 0       file-name
        =3D 6

I think this address your suggestion, I guess.


> Ideally, guix search should rate inetutils higher because the string
> "rsh" occurs as its own word, not as a component of another, unrelated
> word. (Very, very people would search "rsh" looking for matches with
> "hershey", even if "hershey" occurs multiple times.)

Again, considering the case at hand: If instead of 3 randomly picked in
v2 of #73220, we would pick 7, then inetutils is ranked first.

Yeah, maybe 3 isn=E2=80=99t enough=E2=80=A6 And maybe 7 is a good choice.

Do you have other examples than =E2=80=99rsh=E2=80=99?


> Another example of where this can happen is with "dig", part of the bind
> package. Searching for "dig" returns garbage because "dig" is a common
> subword. Bind is scored with a relevance of 2, even though bind's
> description emphasises that dig is part of it.

Please note that using v2 of #73220 with the weight of 7, the package is
returned =E2=80=9Cthird=E2=80=9C: a relevance of 14 (behind 24 and 20).

However, it appears 8th in the list because the appearance for packages
having the same relevance scoring is arbitrary.  It just depends on how
the modules are walked.  Therefore, we cannot do much, IMHO.


Cheers,
simon


1: https://issues.guix.gnu.org/73220#1

2: Re: Search improvements (Was: Opposition to new single-letter package na=
me "t")
zimoun <zimon.toutoune@HIDDEN>
Tue, 09 Mar 2021 19:37:23 +0100
id:CAJ3okZ3+hn0nJP98OhnZYLWJvhLGpdTUK+jB0hoM5JArQxO=3Dzw@HIDDEN
https://lists.gnu.org/archive/html/guix-devel/2021-03
https://yhetil.org/guix/CAJ3okZ3+hn0nJP98OhnZYLWJvhLGpdTUK+jB0hoM5JArQxO=3D=
zw@HIDDEN





Last modified: Sun, 12 Jan 2025 05:45:02 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.