GNU bug report logs - #42162
gforge.inria.fr to be taken off-line in Dec. 2020

Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.

Package: guix; Severity: important; Reported by: Ludovic Courtès <ludovic.courtes@HIDDEN>; dated Thu, 2 Jul 2020 07:34:01 UTC; Maintainer for guix is bug-guix@HIDDEN.

Message received at 42162 <at> debbugs.gnu.org:


Received: (at 42162) by debbugs.gnu.org; 27 Aug 2020 18:07:19 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu Aug 27 14:07:19 2020
Received: from localhost ([127.0.0.1]:43899 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1kBMIx-0005mn-45
	for submit <at> debbugs.gnu.org; Thu, 27 Aug 2020 14:07:19 -0400
Received: from imta-35.everyone.net ([216.200.145.35]:46824
 helo=imta-38.everyone.net)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <bokr@HIDDEN>) id 1kBMIv-0005me-82
 for 42162 <at> debbugs.gnu.org; Thu, 27 Aug 2020 14:07:18 -0400
Received: from pps.filterd (m0004961.ppops.net [127.0.0.1])
 by imta-38.everyone.net (8.16.0.27/8.16.0.27) with SMTP id 07RI43qe009175;
 Thu, 27 Aug 2020 11:07:15 -0700
X-Eon-Originating-Account: wVvPZLly5FanX1K4Tx9U5p3ez1pLEyXOCiYwdT6e8PM
X-Eon-Dm: m0117124.ppops.net
Received: by m0117124.mta.everyone.net (EON-AUTHRELAY2 - 5a81d81c)
 id m0117124.5f332921.23eb48; Thu, 27 Aug 2020 11:07:07 -0700
X-Eon-Sig: AQMHrIJfR/ZL4Y047QIAAAAE,3d85287383ccb99dc52470193382448f
X-Eip: 2DBQm5kIfibIrvNK5ubFV2-YWoW86xZhtnL1tziijYw
Date: Thu, 27 Aug 2020 20:06:51 +0200
From: Bengt Richter <bokr@HIDDEN>
To: zimoun <zimon.toutoune@HIDDEN>
Subject: Re: bug#42162: Recovering source tarballs
Message-ID: <20200827180651.GA3255@LionPure>
References: <87mu4iv0gc.fsf@HIDDEN> <86h7uq8fmk.fsf@HIDDEN>
 <87d05etero.fsf@HIDDEN> <87r1tit5j6.fsf_-_@HIDDEN>
 <875za4ykej.fsf@HIDDEN> <86blixyb7c.fsf@HIDDEN>
 <87k0xlaz8p.fsf@HIDDEN> <86lfi0e88r.fsf@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <86lfi0e88r.fsf@HIDDEN>
User-Agent: Mutt/1.10.1 (2018-07-13)
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.235, 18.0.687
 definitions=2020-08-27_10:2020-08-27,
 2020-08-27 signatures=0
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0
 priorityscore=1501 malwarescore=0
 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1034
 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0
 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-2006250000
 definitions=main-2008270136
X-Spam-Score: -0.4 (/)
X-Debbugs-Envelope-To: 42162
Cc: 42162 <at> debbugs.gnu.org, Timothy Sample <samplet@HIDDEN>,
 Maurice =?utf-8?Q?Br=C3=A9mond?= <Maurice.Bremond@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Reply-To: Bengt Richter <bokr@HIDDEN>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -1.4 (-)

Hi,

On +2020-08-27 11:41:24 +0200, zimoun wrote:
> Hi,
> 
> On Wed, 26 Aug 2020 at 17:11, Timothy Sample <samplet@HIDDEN> wrote:
> > zimoun <zimon.toutoune@HIDDEN> writes:
> >
> >> One question is how this database scales?
> >>
> >> For example, a quick back-to-envelop estimation leads to ~1.2GB metadata
> >> for ~14k packages and then an increase of ~700MB per year, both with the
> >> Ludo’s code [1].
> >>
> >> [1] <http://issues.guix.gnu.org/issue/42162#11>
> >
> > It’s a good question.  A good part of the size comes from the
> > representation rather than the data.  Compression helps a lot here.  I
> > have a database of 3,912 packages.  It’s 295M uncompressed (which is a
> > little better than your estimation).  If I pass each file through Lzip,
> > it shrinks down to 60M.  That’s more like 15.5K per package, which is
> > almost an order of magnitude smaller than the estimation you used
> > (120K).  I think that makes the numbers rather pleasant, but it comes at
> > the expense of easy storing in Git.
> 
> Thank you for these numbers.  Really interesting!
> 
> First, I do not know if the database needs to be stored with Git.  What
> should be the advantage? (naive question :-))
> 
> 
> On SWH T2430 [1], you explain the “default-header” trick to cut down the
> size.  Nice!
> 
> Moreover, the format is a long list, e.g.,
> 
> --8<---------------cut here---------------start------------->8---
> (headers

How about
    (X-v1-headers
(borrowing from rfc2045 MIME usage indicating as-yet-not-a-formal-standard)
The idea is to make it easy to script the change to "(headers" once
there is consensus for declaring a new standard. The "v1-" part could allow
a simultaneous "(X-v2-headers" alternative for zimoun's concise suggestion,
or even a base64 of a compressed format. There's lots that could be borrowed from
the MIME rfc's :)

--8<---------------cut here---------------start------------->8---
6.3.  New Content-Transfer-Encodings

   Implementors may, if necessary, define private Content-Transfer-
   Encoding values, but must use an x-token, which is a name prefixed by
   "X-", to indicate its non-standard status, e.g., "Content-Transfer-
   Encoding: x-my-new-encoding".  Additional standardized Content-
   Transfer-Encoding values must be specified by a standards-track RFC.
   The requirements such specifications must meet are given in RFC 2048.
   As such, all content-transfer-encoding namespace except that
   beginning with "X-" is explicitly reserved to the IETF for future
   use.

   Unlike media types and subtypes, the creation of new Content-
   Transfer-Encoding values is STRONGLY discouraged, as it seems likely
   to hinder interoperability with little potential benefit
--8<---------------cut here---------------end--------------->8---

>     ((name "raptor2-2.0.15/")
>      (mode 493)
If you want to be more human-readable with mode, I would put
a chmod argument in place of 493 :)

--8<---------------cut here---------------start------------->8---
$ printf "%o\n" 493
755
$ 
--8<---------------cut here---------------end--------------->8---

Hm, could this be a security risk??
I mean, could a mode typo here inadvertently open a door for a nasty mod
by oportunistic code buried in a later-executed apparently unrelated app?

>      (mtime 1414909500)
One of these might be more human-recognizable :)
--8<---------------cut here---------------start------------->8---
$ date --date='@1414909497' -Is
2014-11-02T07:24:57+01:00
$ date --date='@1414909497' -uIs
2014-11-02T06:24:57+00:00
$ TZ=America/Buenos_Aires date --date='@1414909497' -Is
2014-11-02T03:24:57-03:00
$
$ date --date='@1414909497' -u '+%Y%m%d_%H%M%S'
20141102_062457
# vs 1414909497, which, yes, costs 5 chars less
$ 
--8<---------------cut here---------------end--------------->8---

>      (chksum 4225)
>      (typeflag 53))
>     ((name "raptor2-2.0.15/build/")
>      (mode 493)
>      (mtime 1414909497)
>      (chksum 4797)
>      (typeflag 53))
>     ((name "raptor2-2.0.15/build/ltversion.m4")
>      (size 690)
>      (mtime 1414908273)
>      (chksum 5958))
> 
>      […])
> --8<---------------cut here---------------end--------------->8---
> 
> which is human-readable.  Is it useful?
> 
> 
> Instead, one could imagine shorter keywords:
>
(X-v2-headers  ;; ;-)
>     ((na "raptor2-2.0.15/")
>      (mo 493)
>      (mt 1414909500)
>      (ch 4225)
>      (ty 53))
> 
> which using your database (commit fc50927) reduces from 295MB to 279MB.
> 
> Or even plain list:
>
(X-v3-headers
>    (\x00 "raptor2-2.0.15/" 493 1414909500 4225 53)
>    (\x01 "raptor2-2.0.15/build/ltversion.m4" 690 1414908273 5958)
> 
> where the first element provides the “type” of list to ease the reader.
> 
> 
> Well, the 2 naive questions are: does it make sense to
>  - have the database stored under Git?
>  - have an human-readable format?
> 
> 
> Thank you again for pushing forward this topic. :-)
> 
> All the best,
> simon
> 
> [1] https://forge.softwareheritage.org/T2430#47522
> 
> 
> 

Prefixing "X-" can obviously be used with any tentative name for anything.

I am suggesting it as a counter to premature (and likely clashing) bindings
of valuable names, which IMO is as bad as premature optimization :)

Naming is too important to be defined by first-user flag-planting, ISTM.
-- 
Regards,
Bengt Richter




Information forwarded to bug-guix@HIDDEN:
bug#42162; Package guix. Full text available.

Message received at 42162 <at> debbugs.gnu.org:


Received: (at 42162) by debbugs.gnu.org; 27 Aug 2020 12:50:05 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu Aug 27 08:50:05 2020
Received: from localhost ([127.0.0.1]:42064 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1kBHLw-0005Oe-Sb
	for submit <at> debbugs.gnu.org; Thu, 27 Aug 2020 08:50:05 -0400
Received: from eggs.gnu.org ([209.51.188.92]:52292)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <ludo@HIDDEN>) id 1kBHLt-0005Nk-0o
 for 42162 <at> debbugs.gnu.org; Thu, 27 Aug 2020 08:50:03 -0400
Received: from fencepost.gnu.org ([2001:470:142:3::e]:51916)
 by eggs.gnu.org with esmtp (Exim 4.90_1)
 (envelope-from <ludo@HIDDEN>)
 id 1kBHLm-0007j1-Fa; Thu, 27 Aug 2020 08:49:54 -0400
Received: from [2001:660:6102:320:e120:2c8f:8909:cdfe] (port=41772 helo=ribbon)
 by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256)
 (Exim 4.82) (envelope-from <ludo@HIDDEN>)
 id 1kBHLl-0001Kd-QN; Thu, 27 Aug 2020 08:49:54 -0400
From: =?utf-8?Q?Ludovic_Court=C3=A8s?= <ludo@HIDDEN>
To: zimoun <zimon.toutoune@HIDDEN>
Subject: Re: bug#42162: Recovering source tarballs
References: <87mu4iv0gc.fsf@HIDDEN> <86h7uq8fmk.fsf@HIDDEN>
 <87d05etero.fsf@HIDDEN> <87r1tit5j6.fsf_-_@HIDDEN>
 <875za4ykej.fsf@HIDDEN> <86blixyb7c.fsf@HIDDEN>
 <87k0xlaz8p.fsf@HIDDEN> <86lfi0e88r.fsf@HIDDEN>
X-URL: http://www.fdn.fr/~lcourtes/
X-Revolutionary-Date: 11 Fructidor an 228 de la =?utf-8?Q?R=C3=A9volution?=
X-PGP-Key-ID: 0x090B11993D9AEBB5
X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc
X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4  0CFB 090B 1199 3D9A EBB5
X-OS: x86_64-pc-linux-gnu
Date: Thu, 27 Aug 2020 14:49:51 +0200
In-Reply-To: <86lfi0e88r.fsf@HIDDEN> (zimoun's message of "Thu, 27 Aug 2020
 11:41:24 +0200")
Message-ID: <87lfi0tfrk.fsf@HIDDEN>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: -2.3 (--)
X-Debbugs-Envelope-To: 42162
Cc: 42162 <at> debbugs.gnu.org, Timothy Sample <samplet@HIDDEN>,
 Maurice =?utf-8?Q?Br=C3=A9mond?= <Maurice.Bremond@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -3.3 (---)

Hi!

zimoun <zimon.toutoune@HIDDEN> skribis:

> Moreover, the format is a long list, e.g.,
>
> (headers
>     ((name "raptor2-2.0.15/")
>      (mode 493)
>      (mtime 1414909500)
>      (chksum 4225)
>      (typeflag 53))
>     ((name "raptor2-2.0.15/build/")
>      (mode 493)
>      (mtime 1414909497)
>      (chksum 4797)
>      (typeflag 53))
>     ((name "raptor2-2.0.15/build/ltversion.m4")
>      (size 690)
>      (mtime 1414908273)
>      (chksum 5958))
>
>      [=E2=80=A6])
>
> which is human-readable.  Is it useful?
>
>
> Instead, one could imagine shorter keywords:
>
>     ((na "raptor2-2.0.15/")
>      (mo 493)
>      (mt 1414909500)
>      (ch 4225)
>      (ty 53))
>
> which using your database (commit fc50927) reduces from 295MB to 279MB.

I think it=E2=80=99s nice, at least at this stage, that it=E2=80=99s
human-readable=E2=80=94=E2=80=9Cpremature optimization is the root of all e=
vil=E2=80=9D.  :-)

I guess it won=E2=80=99t be difficult to make the format more dense eventua=
lly
if that is deemed necessary, using =E2=80=98write=E2=80=99 instead of =E2=
=80=98pretty-print=E2=80=99,
using tricks like you write, or even going binary as a last resort.

Ludo=E2=80=99.




Information forwarded to bug-guix@HIDDEN:
bug#42162; Package guix. Full text available.

Message received at 42162 <at> debbugs.gnu.org:


Received: (at 42162) by debbugs.gnu.org; 27 Aug 2020 09:41:36 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu Aug 27 05:41:36 2020
Received: from localhost ([127.0.0.1]:41639 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1kBEPY-0003wA-C3
	for submit <at> debbugs.gnu.org; Thu, 27 Aug 2020 05:41:36 -0400
Received: from mail-wm1-f49.google.com ([209.85.128.49]:51782)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <zimon.toutoune@HIDDEN>) id 1kBEPW-0003vw-3f
 for 42162 <at> debbugs.gnu.org; Thu, 27 Aug 2020 05:41:35 -0400
Received: by mail-wm1-f49.google.com with SMTP id w2so4318609wmi.1
 for <42162 <at> debbugs.gnu.org>; Thu, 27 Aug 2020 02:41:34 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=from:to:cc:subject:in-reply-to:references:date:message-id
 :mime-version:content-transfer-encoding;
 bh=bOkVksEsXMklsMzX7QPR8JTaILiDsXqtGVCryj/nq7Q=;
 b=DxZxtCBfpQxXKEyAMdzAtg2nTAHpNkPOFKFLOciSGy1bEc944zmiAvEU49ipzAf+Fu
 SE6+qLIWWPp+MA/U8XbRK8dViocvIfYDFZNs2p7yzAg0ZMQsW7B4aczuZv4fTL/N18Tp
 pzDXxvMZb1U3lIylKaz7aJCEOPFn+cfzqz/oTVrMsK3N7G7mrcrhFRicYMAuJUVOq1Qk
 NbHMMTRhwumI5L6Uj758MiQJe4UFc9SzNeUoBgcrBggNoe5fo0kkbpQyMQY1+NCHm19c
 +AevV6QGM0isI3Bw/MZnWSQZlkN4G5d69ly30Sgx3V8w087ktt5Nu/gJrADDs+yzNqDa
 +MhA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:in-reply-to:references:date
 :message-id:mime-version:content-transfer-encoding;
 bh=bOkVksEsXMklsMzX7QPR8JTaILiDsXqtGVCryj/nq7Q=;
 b=pjiTv0vpg61yNTrntSppqTde20ptLTk0ppWCIo7nUUHQrm+sjhJSkYIxadldgZC+JK
 37xjHfkno595JxVJHFAg1ekL3NclXfyggwMYhEiQ/K1flCqYZmTQYGZsSiLWk97awJYL
 X5Q4DS7pbj8S7oW3XPmxgtmGUo07zpnq21xEfJUMTmXR23xubAiRVM8QpTZCzPy6bBtN
 36dcnjpQuVaqZRKfsDIFOFa15kbsyeHK2e9bGJLFLjWT7/IUSvy3FCaE+B7LaoUXw43K
 aY4o/dX2vTMAhOkwAP0lKr+d2mX0Voh0uNq764zj8XuVL2nz+jFjvcEImFMK4xmnkKyT
 MLNw==
X-Gm-Message-State: AOAM5336qwCCxP4CWotUx3jtg55Y8GKlw+nituei95AVMZz8nvCEJxAR
 ZdaRaZSK+0DgN444kvbLDL8=
X-Google-Smtp-Source: ABdhPJxpRWDgPL+4SCtCt32ajrnd2KLv7Ff4tdph/2b7QErurIBKpHNA9kmo2xJk9m4wfK8JKViGyA==
X-Received: by 2002:a1c:43c3:: with SMTP id
 q186mr11685689wma.144.1598521288186; 
 Thu, 27 Aug 2020 02:41:28 -0700 (PDT)
Received: from lili (57.246.195.77.rev.sfr.net. [77.195.246.57])
 by smtp.gmail.com with ESMTPSA id v8sm4594222wrm.53.2020.08.27.02.41.26
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Thu, 27 Aug 2020 02:41:27 -0700 (PDT)
From: zimoun <zimon.toutoune@HIDDEN>
To: Timothy Sample <samplet@HIDDEN>
Subject: Re: bug#42162: Recovering source tarballs
In-Reply-To: <87k0xlaz8p.fsf@HIDDEN>
References: <87mu4iv0gc.fsf@HIDDEN> <86h7uq8fmk.fsf@HIDDEN>
 <87d05etero.fsf@HIDDEN> <87r1tit5j6.fsf_-_@HIDDEN>
 <875za4ykej.fsf@HIDDEN> <86blixyb7c.fsf@HIDDEN>
 <87k0xlaz8p.fsf@HIDDEN>
Date: Thu, 27 Aug 2020 11:41:24 +0200
Message-ID: <86lfi0e88r.fsf@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 42162
Cc: 42162 <at> debbugs.gnu.org,
 Maurice =?utf-8?Q?Br=C3=A9mond?= <Maurice.Bremond@HIDDEN>,
 Ludovic =?utf-8?Q?Court=C3=A8s?= <ludo@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

Hi,

On Wed, 26 Aug 2020 at 17:11, Timothy Sample <samplet@HIDDEN> wrote:
> zimoun <zimon.toutoune@HIDDEN> writes:
>
>> One question is how this database scales?
>>
>> For example, a quick back-to-envelop estimation leads to ~1.2GB metadata
>> for ~14k packages and then an increase of ~700MB per year, both with the
>> Ludo=E2=80=99s code [1].
>>
>> [1] <http://issues.guix.gnu.org/issue/42162#11>
>
> It=E2=80=99s a good question.  A good part of the size comes from the
> representation rather than the data.  Compression helps a lot here.  I
> have a database of 3,912 packages.  It=E2=80=99s 295M uncompressed (which=
 is a
> little better than your estimation).  If I pass each file through Lzip,
> it shrinks down to 60M.  That=E2=80=99s more like 15.5K per package, whic=
h is
> almost an order of magnitude smaller than the estimation you used
> (120K).  I think that makes the numbers rather pleasant, but it comes at
> the expense of easy storing in Git.

Thank you for these numbers.  Really interesting!

First, I do not know if the database needs to be stored with Git.  What
should be the advantage? (naive question :-))


On SWH T2430 [1], you explain the =E2=80=9Cdefault-header=E2=80=9D trick to=
 cut down the
size.  Nice!

Moreover, the format is a long list, e.g.,

--8<---------------cut here---------------start------------->8---
(headers
    ((name "raptor2-2.0.15/")
     (mode 493)
     (mtime 1414909500)
     (chksum 4225)
     (typeflag 53))
    ((name "raptor2-2.0.15/build/")
     (mode 493)
     (mtime 1414909497)
     (chksum 4797)
     (typeflag 53))
    ((name "raptor2-2.0.15/build/ltversion.m4")
     (size 690)
     (mtime 1414908273)
     (chksum 5958))

     [=E2=80=A6])
--8<---------------cut here---------------end--------------->8---

which is human-readable.  Is it useful?


Instead, one could imagine shorter keywords:

    ((na "raptor2-2.0.15/")
     (mo 493)
     (mt 1414909500)
     (ch 4225)
     (ty 53))

which using your database (commit fc50927) reduces from 295MB to 279MB.

Or even plain list:

   (\x00 "raptor2-2.0.15/" 493 1414909500 4225 53)
   (\x01 "raptor2-2.0.15/build/ltversion.m4" 690 1414908273 5958)

where the first element provides the =E2=80=9Ctype=E2=80=9D of list to ease=
 the reader.


Well, the 2 naive questions are: does it make sense to
 - have the database stored under Git?
 - have an human-readable format?


Thank you again for pushing forward this topic. :-)

All the best,
simon

[1] https://forge.softwareheritage.org/T2430#47522




Information forwarded to bug-guix@HIDDEN:
bug#42162; Package guix. Full text available.

Message received at 42162 <at> debbugs.gnu.org:


Received: (at 42162) by debbugs.gnu.org; 26 Aug 2020 21:12:00 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Wed Aug 26 17:12:00 2020
Received: from localhost ([127.0.0.1]:40981 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1kB2i8-0001u0-3m
	for submit <at> debbugs.gnu.org; Wed, 26 Aug 2020 17:12:00 -0400
Received: from wout1-smtp.messagingengine.com ([64.147.123.24]:57449)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <samplet@HIDDEN>) id 1kB2i6-0001to-If
 for 42162 <at> debbugs.gnu.org; Wed, 26 Aug 2020 17:11:59 -0400
Received: from compute3.internal (compute3.nyi.internal [10.202.2.43])
 by mailout.west.internal (Postfix) with ESMTP id 952B91653;
 Wed, 26 Aug 2020 17:11:52 -0400 (EDT)
Received: from mailfrontend2 ([10.202.2.163])
 by compute3.internal (MEProxy); Wed, 26 Aug 2020 17:11:52 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=
 messagingengine.com; h=cc:content-transfer-encoding:content-type
 :date:from:in-reply-to:message-id:mime-version:references
 :subject:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender
 :x-sasl-enc; s=fm3; bh=AxLkDrBOJnmHcfRv+GNc3Kv+MFRjakCXWZGBhdAJN
 dw=; b=gUVFRJr9Fp8QM8wQyf7nGI086/PzndWc0KJaGAfmzmO/GBTdrcbT74q9d
 2ovINT/9o0Yz9/GfSPO1FaK7ryK0L/RG9bxoLgpLCAnvhWhArAaCfkavbl4fUv22
 Bi/NClGE+n7xPjUP+lUYkSDtuPUNK2yLSBn8voLhSB19Mo2nR3jMFUekQvIQSktV
 YjCOe02NRT1seg6i8IO9reajNKM06hxZzmf6iHjvrumbcqgBaBfS0gYoF8DglXwp
 n0QhNcQD4zOEf6aHNDJxPSHcNirZrcm/lLFZYBRkeQ66bxSKgNvNCcwY/mjFWd48
 u2jJ7Zm/BnwWe3RLflBtiI1ystw2w==
X-ME-Sender: <xms:F9BGX_FrPvgqtgb__0PXhWOXPMzVY5wMCOVpOQwJiBrwrT7ugcU9MQ>
X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeduiedruddvvddgudehhecutefuodetggdotefrod
 ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh
 necuuegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmd
 enucfjughrpefhvffufhffjgfkfgggtgfgsehtqhertddtreejnecuhfhrohhmpefvihhm
 ohhthhihucfurghmphhlvgcuoehsrghmphhlvghtsehnghihrhhordgtohhmqeenucggtf
 frrghtthgvrhhnpefhtefhiedvtdeftdffvdehkeejhedvvdetuedtvdefgedtuedujeel
 ueetvdektdenucffohhmrghinhepghhnuhdrohhrghdpshhofhhtfigrrhgvhhgvrhhith
 grghgvrdhorhhgnecukfhppeejgedrudduiedrudekiedrgeegnecuvehluhhsthgvrhfu
 ihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomhepshgrmhhplhgvthesnhhghihroh
 drtghomh
X-ME-Proxy: <xmx:F9BGX8WXsI8-lA6bJGHlhkCbwvgjvTjiNcuCaT7DVvMYK4cdu23sTQ>
 <xmx:F9BGXxL1h40ZIs-DMWeeEAQkc48NlRBtvARvyQ2LKg42ApJbrMfCWA>
 <xmx:F9BGX9HwjDb4JWO9Ym_MCgUxOf9qPYRfeURdPnEmLX6lcK1LxOxhgA>
 <xmx:GNBGX2fio3FNr_uO8kxYIYais_iGmRjuv-eTM6WUMTWg3xO-Rwl9ng>
Received: from mrblack (74-116-186-44.qc.dsl.ebox.net [74.116.186.44])
 by mail.messagingengine.com (Postfix) with ESMTPA id 849C330600A3;
 Wed, 26 Aug 2020 17:11:51 -0400 (EDT)
From: Timothy Sample <samplet@HIDDEN>
To: zimoun <zimon.toutoune@HIDDEN>
Subject: Re: bug#42162: Recovering source tarballs
References: <87mu4iv0gc.fsf@HIDDEN> <86h7uq8fmk.fsf@HIDDEN>
 <87d05etero.fsf@HIDDEN> <87r1tit5j6.fsf_-_@HIDDEN>
 <875za4ykej.fsf@HIDDEN> <86blixyb7c.fsf@HIDDEN>
Date: Wed, 26 Aug 2020 17:11:50 -0400
In-Reply-To: <86blixyb7c.fsf@HIDDEN> (zimoun's message of "Wed, 26 Aug 2020
 12:04:55 +0200")
Message-ID: <87k0xlaz8p.fsf@HIDDEN>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: -0.7 (/)
X-Debbugs-Envelope-To: 42162
Cc: 42162 <at> debbugs.gnu.org,
 Maurice =?utf-8?Q?Br=C3=A9mond?= <Maurice.Bremond@HIDDEN>,
 Ludovic =?utf-8?Q?Court=C3=A8s?= <ludo@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -1.7 (-)

Hi zimoun,

zimoun <zimon.toutoune@HIDDEN> writes:

> One question is how this database scales?
>
> For example, a quick back-to-envelop estimation leads to ~1.2GB metadata
> for ~14k packages and then an increase of ~700MB per year, both with the
> Ludo=E2=80=99s code [1].
>
> [1] <http://issues.guix.gnu.org/issue/42162#11>

It=E2=80=99s a good question.  A good part of the size comes from the
representation rather than the data.  Compression helps a lot here.  I
have a database of 3,912 packages.  It=E2=80=99s 295M uncompressed (which i=
s a
little better than your estimation).  If I pass each file through Lzip,
it shrinks down to 60M.  That=E2=80=99s more like 15.5K per package, which =
is
almost an order of magnitude smaller than the estimation you used
(120K).  I think that makes the numbers rather pleasant, but it comes at
the expense of easy storing in Git.

> As mentioned [2], should this service be part of SWH (download cooking
> task)?  Or project side?
>
> [2] <https://forge.softwareheritage.org/T2430#47486>

It would be interesting to just have SWH absorb the project.  Since
other distros already know how to produce a =E2=80=9Csources.json=E2=80=9D =
and how to
query the SWH archive, it would mean that they benefit for free (and so
would Guix, for that matter).  I=E2=80=99m open to that, but right now havi=
ng
the freedom to experiment is important.


-- Tim




Information forwarded to bug-guix@HIDDEN:
bug#42162; Package guix. Full text available.

Message received at 42162 <at> debbugs.gnu.org:


Received: (at 42162) by debbugs.gnu.org; 26 Aug 2020 10:05:12 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Wed Aug 26 06:05:12 2020
Received: from localhost ([127.0.0.1]:37434 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1kAsIq-0005YX-1k
	for submit <at> debbugs.gnu.org; Wed, 26 Aug 2020 06:05:12 -0400
Received: from mail-wr1-f65.google.com ([209.85.221.65]:46966)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <zimon.toutoune@HIDDEN>) id 1kAsIi-0005XL-2s
 for 42162 <at> debbugs.gnu.org; Wed, 26 Aug 2020 06:05:10 -0400
Received: by mail-wr1-f65.google.com with SMTP id r15so1167666wrp.13
 for <42162 <at> debbugs.gnu.org>; Wed, 26 Aug 2020 03:05:04 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=from:to:cc:subject:in-reply-to:references:date:message-id
 :mime-version:content-transfer-encoding;
 bh=zhHFO+g8iEfdoCani6/L6PNlDjNXkm8YD7nCiPcLqRk=;
 b=ZmlV3hN4nNvtpy0cVRaFUPiWpGi4gZNUPYCpIO37C9foWlEvUgRocPYzROwYookDBR
 zD7tWh7i1NXZWFK200Q0q9pTuIw2hnI6vPf89PYZ1AtPUwr7b2K0FQUkhZ3qZiqlMn5m
 iYy3pEvex7CIvQtwrdaXAxZ7kHIkCwN7FIW/64ev39n7/cVX116SVfrJZ+G4dRoSlMIp
 AArViAb3jJOPRxiX4lkWb+z7CKAPMg1+CaiAVKt6v/CAuvoqBi/GWpjUxH2b8zNLtjk3
 6D0V5o5MxXGvjN0GjxbB9qIMND7RuDSC01etoExNfIzQ52tbXxioefVE1Er5z3+GumRF
 Yk9A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:in-reply-to:references:date
 :message-id:mime-version:content-transfer-encoding;
 bh=zhHFO+g8iEfdoCani6/L6PNlDjNXkm8YD7nCiPcLqRk=;
 b=YHaSrI1fBGMu1qzEaH3na8I6tv/B82ArvaIBiOfzqgllH2Dl/N7xhC9RVr0fondgmm
 RsIZB/DCCjagvjca7y21TRqIwJaYN14XqKYUj+QZC7ErxayAaUizxHbcl8fm9difVQf7
 NNohqU/AHGBhKIXcy+jLDSCuwpqsMc4iHBaFVpEG3P8nrprMseHrVzOq90NVsmxI0Ys7
 y5H2QY1iw3DivWBzrm+rmTmnFEJTTGF6gpkVjt+KLwo2btob3l3djWNwtw+97+9AQKtm
 HKgS2ffIt/BmtM6UXMPuR7pgBixDiKwpdsTiYhr/l6bgIkZT3MXO9NpUeTquPnXBCXUk
 wv3A==
X-Gm-Message-State: AOAM530PTLbBXbY8lJRz6r3X8dWZw11TcGUAclJPTBTJMJXF0AWk+7mH
 z6AMBYIk5vjE29eROSQwnuE=
X-Google-Smtp-Source: ABdhPJz9qYFy+AjB9uHysaAgfoQwHBE2Nj3D0vlg9XKXN4N48EpfVD7qwVBdWzzgUi1kRrb4/T0LsA==
X-Received: by 2002:a5d:4ccb:: with SMTP id c11mr14511391wrt.159.1598436298156; 
 Wed, 26 Aug 2020 03:04:58 -0700 (PDT)
Received: from lili (57.246.195.77.rev.sfr.net. [77.195.246.57])
 by smtp.gmail.com with ESMTPSA id a74sm4506921wme.11.2020.08.26.03.04.56
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Wed, 26 Aug 2020 03:04:57 -0700 (PDT)
From: zimoun <zimon.toutoune@HIDDEN>
To: Timothy Sample <samplet@HIDDEN>, Ludovic =?utf-8?Q?Court=C3=A8s?=
 <ludo@HIDDEN>
Subject: Re: bug#42162: Recovering source tarballs
In-Reply-To: <875za4ykej.fsf@HIDDEN>
References: <87mu4iv0gc.fsf@HIDDEN> <86h7uq8fmk.fsf@HIDDEN>
 <87d05etero.fsf@HIDDEN> <87r1tit5j6.fsf_-_@HIDDEN>
 <875za4ykej.fsf@HIDDEN>
Date: Wed, 26 Aug 2020 12:04:55 +0200
Message-ID: <86blixyb7c.fsf@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 42162
Cc: 42162 <at> debbugs.gnu.org,
 Maurice =?utf-8?Q?Br=C3=A9mond?= <Maurice.Bremond@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

Dear Timothy,

On Thu, 30 Jul 2020 at 13:36, Timothy Sample <samplet@HIDDEN> wrote:

> I call the thing =E2=80=9CDisarchive=E2=80=9D as in =E2=80=9Cdisassemble =
a source code archive=E2=80=9D.
> You can find it at <https://git.ngyro.com/disarchive/>.  It has a simple
> command-line interface so you can do
>
>     $ disarchive save software-1.0.tar.gz
>
> which serializes a disassembled version of =E2=80=9Csoftware-1.0.tar.gz=
=E2=80=9D to the
> database (which is just a directory) specified by the =E2=80=9CDISARCHIVE=
_DB=E2=80=9D
> environment variable.  Next, you can run
>
>     $ disarchive load hash-of-something-in-the-db
>
> which will recover an original file from its metadata (stored in the
> database) and data retrieved from the SWH archive or taken from a cache
> (again, just a directory) specified by =E2=80=9CDISARCHIVE_DIRCACHE=E2=80=
=9D.

Really nice!  Thank you!


>> I think we=E2=80=99d have to maintain a database that maps tarball hashe=
s to
>> metadata (!).  A simple version of it could be a Git repo where, say,
>> =E2=80=98sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk=E2=
=80=99 would
>> contain the metadata above.  The nice thing is that the Git repo itself
>> could be archived by SWH.  :-)
>
> You mean like <https://git.ngyro.com/disarchive-db/>?  :)

[...]

> This was generated by a little script built on top of =E2=80=9Cfold-packa=
ges=E2=80=9D.
> It downloads Gzip=E2=80=99d tarballs used by Guix packages and passes the=
m on to
> Disarchive for disassembly.  I limited the number to 100 because it=E2=80=
=99s
> slow and because I=E2=80=99m sure there is a long tail of weird software
> archives that are going to be hard to process.  The metadata directory
> ended up being 13M and the directory cache 2G.

One question is how this database scales?

For example, a quick back-to-envelop estimation leads to ~1.2GB metadata
for ~14k packages and then an increase of ~700MB per year, both with the
Ludo=E2=80=99s code [1].

[1] <http://issues.guix.gnu.org/issue/42162#11>



> I could remove most of the Guix stuff so that it would be easy to
> package in Guix, Nix, Debian, etc.  Then, someone=E2=84=A2 could write a =
service
> that consumes a =E2=80=9Csources.json=E2=80=9D file, adds the sources to =
a Disarchive
> database, and pushes everything to a Git repo.  I guess everyone who
> cares has to produce a =E2=80=9Csources.json=E2=80=9D file anyway, so it =
will be very
> little extra work.  Other stuff like changing the serialization format
> to JSON would be pretty easy, too.  I=E2=80=99m not well connected to the=
se
> other projects, mind you, so I=E2=80=99m not really sure how to reach out.

This service could be really useful.  Yes, it could be easy to update
the database each time Guix produces a new =E2=80=9Csources.json=E2=80=9D.

As mentioned [2], should this service be part of SWH (download cooking
task)?  Or project side?

[2] <https://forge.softwareheritage.org/T2430#47486>


Thank you again for this piece for work.

All the best,
simon




Information forwarded to bug-guix@HIDDEN:
bug#42162; Package guix. Full text available.

Message received at 42162 <at> debbugs.gnu.org:


Received: (at 42162) by debbugs.gnu.org; 23 Aug 2020 16:21:15 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sun Aug 23 12:21:15 2020
Received: from localhost ([127.0.0.1]:54977 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1k9sk6-0006kl-NI
	for submit <at> debbugs.gnu.org; Sun, 23 Aug 2020 12:21:14 -0400
Received: from eggs.gnu.org ([209.51.188.92]:54288)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <ludo@HIDDEN>) id 1k9sk5-0006kV-RF
 for 42162 <at> debbugs.gnu.org; Sun, 23 Aug 2020 12:21:14 -0400
Received: from fencepost.gnu.org ([2001:470:142:3::e]:55390)
 by eggs.gnu.org with esmtp (Exim 4.90_1)
 (envelope-from <ludo@HIDDEN>)
 id 1k9sjz-0001SL-Vt; Sun, 23 Aug 2020 12:21:07 -0400
Received: from [2a01:e0a:1d:7270:af76:b9b:ca24:c465] (port=38296 helo=ribbon)
 by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256)
 (Exim 4.82) (envelope-from <ludo@HIDDEN>)
 id 1k9sjz-000278-FU; Sun, 23 Aug 2020 12:21:07 -0400
From: =?utf-8?Q?Ludovic_Court=C3=A8s?= <ludo@HIDDEN>
To: Timothy Sample <samplet@HIDDEN>
Subject: Re: bug#42162: Recovering source tarballs
References: <87mu4iv0gc.fsf@HIDDEN> <86h7uq8fmk.fsf@HIDDEN>
 <87d05etero.fsf@HIDDEN> <87r1tit5j6.fsf_-_@HIDDEN>
 <875za4ykej.fsf@HIDDEN> <87bljvu4p4.fsf@HIDDEN>
 <87d047u0l3.fsf@HIDDEN> <87wo2dnhgb.fsf@HIDDEN>
 <874kpgudic.fsf@HIDDEN>
X-URL: http://www.fdn.fr/~lcourtes/
X-Revolutionary-Date: 7 Fructidor an 228 de la =?utf-8?Q?R=C3=A9volution?=
X-PGP-Key-ID: 0x090B11993D9AEBB5
X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc
X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4  0CFB 090B 1199 3D9A EBB5
X-OS: x86_64-pc-linux-gnu
Date: Sun, 23 Aug 2020 18:21:05 +0200
In-Reply-To: <874kpgudic.fsf@HIDDEN> (Timothy Sample's message of "Wed, 05
 Aug 2020 14:57:31 -0400")
Message-ID: <87r1rxbafi.fsf@HIDDEN>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: -2.3 (--)
X-Debbugs-Envelope-To: 42162
Cc: 42162 <at> debbugs.gnu.org,
 Maurice =?utf-8?Q?Br=C3=A9mond?= <Maurice.Bremond@HIDDEN>,
 zimoun <zimon.toutoune@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -3.3 (---)

Hello!

Timothy Sample <samplet@HIDDEN> skribis:

>> If we expose the database over HTTP (like over cgit), we can arrange so
>> that (guix download) simply GETs db.example.org/sha256/xyz.  No need to
>> fetch the whole database.
>>
>> It might be more reasonable to have a real database and a real service
>> around it, I=E2=80=99m sure Chris Baines would agree ;-), but we can cho=
ose URLs
>> that could easily be implemented by a =E2=80=9Creal=E2=80=9D service ins=
tead of cgit in
>> the future.
>
> I got it working over cgit shortly after sending my last message.  :)  So
> far, I am very much on team =E2=80=9Cgood enough for now=E2=80=9D.

Wonderful.  :-)

>> Timothy Sample <samplet@HIDDEN> skribis:
>>
>>> I was imagining an escape hatch beyond this, where one could look up a
>>> provenance record from when Disarchive ingested and verified a source
>>> code archive.  The provenance record would tell you which version of
>>> Guix was used when saving the archive, so you could try your luck with
>>> using =E2=80=9Cguix time-machine=E2=80=9D to reproduce Disarchive=E2=80=
=99s original
>>> computation.  If we perform database migrations, you would need to
>>> travel back in time in the database, too.  The idea is that you could
>>> work around breakages in Disarchive automatically using the Power of
>>> Guix=E2=84=A2.  Just a stray thought, really.
>>
>> Seems to me it Shouldn=E2=80=99t Be Necessary?  :-)
>>
>> I mean, as long as the format is extensible and =E2=80=9Cfuture-proof=E2=
=80=9D, we=E2=80=99ll
>> always be able to rebuild tarballs and then re-disassemble them if we
>> need to compute new hashes or whatever.
>
> If Disarchive relies on external compressors, there=E2=80=99s an outside =
chance
> that those compressors could change under our feet.  In that case, one
> would want to be able to track down exactly which version of XZ was used
> when Disarchive verified that it could reassemble a given source
> archive.

Oh, true.  Gzip and bzip2 are more-or-less =E2=80=9Cset in stone=E2=80=9D, =
but xz, lzip,
or zstd could change.  Recording the exact version of the implementation
would be a good stopgap.

> Maybe I=E2=80=99m being paranoid, but if the database entries are being
> computed by the CI infrastructure it would be pretty easy to note the
> Guix commit just in case.

Yeah, that makes sense.  At least we could have =E2=80=9Cnotes=E2=80=9D in =
the file
format to store that kind of info.  Using CI is also a good idea.

>> I was thinking that it might be best to not use Guix for computations.
>> For example, have =E2=80=9Cdisarchive save=E2=80=9D not build derivation=
s and instead do
>> everything =E2=80=9Chere and now=E2=80=9D.  That would make it easier fo=
r others to
>> adopt.  Wait, looking at the Git history, it looks like you already
>> addressed that point, neat.  :-)
>
> Since my last message I managed to remove Guix as dependency completely.
> Right now it loads =E2=80=98(guix swh)=E2=80=99 opportunistically, but I =
might just copy
> the code in.  Directory references now support multiple =E2=80=9Caddresse=
s=E2=80=9D so
> that you could have Nix-style, SWH-style, IPFS-style, etc.  Hopefully my
> next message will have a WIP patch enabling Guix to use Disarchive!

Neat, looking forward to it!

Thank you,
Ludo=E2=80=99.




Information forwarded to bug-guix@HIDDEN:
bug#42162; Package guix. Full text available.

Message received at 42162 <at> debbugs.gnu.org:


Received: (at 42162) by debbugs.gnu.org; 5 Aug 2020 18:57:42 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Wed Aug 05 14:57:42 2020
Received: from localhost ([127.0.0.1]:52256 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1k3Obd-0006ZC-Uw
	for submit <at> debbugs.gnu.org; Wed, 05 Aug 2020 14:57:42 -0400
Received: from wout4-smtp.messagingengine.com ([64.147.123.20]:37231)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <samplet@HIDDEN>) id 1k3Obb-0006Yy-EN
 for 42162 <at> debbugs.gnu.org; Wed, 05 Aug 2020 14:57:40 -0400
Received: from compute3.internal (compute3.nyi.internal [10.202.2.43])
 by mailout.west.internal (Postfix) with ESMTP id 8951CB14;
 Wed,  5 Aug 2020 14:57:33 -0400 (EDT)
Received: from mailfrontend2 ([10.202.2.163])
 by compute3.internal (MEProxy); Wed, 05 Aug 2020 14:57:33 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=
 messagingengine.com; h=cc:content-transfer-encoding:content-type
 :date:from:in-reply-to:message-id:mime-version:references
 :subject:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender
 :x-sasl-enc; s=fm3; bh=nd+7KPFLKTX6zY0yRN0xhTsAIdfOgFgJup8VS6osX
 Fs=; b=X425oK7UPAkzbxkpgDtVnkyCRVEfrJ3fDdzPERjLEapQ4ePSWoKXtpeyT
 ftSwlIaqOHn1UcuJHBebyN1Hkz2pb1idMJHFBonFcUos3kT/sq+poqtYLRoPhojt
 L9ujhzXc+TvfJrV3GOz5/bglgJdPN1zAsAiwUXadu9Iw6zORwRtatz82e2AQZk5v
 KigmjyhaY6c6i02TGh6osmn9ydeWjYhsUD8ijoFhyJEthelogIuwlHGzcOfkB6dK
 K9lCq3x8emAf8i9de6XwfKEoGZ1A415uD1fXUvUeciMvd/I19aVuKYYhZur16LGV
 on53/w5Z2aJ3uhfQiF7XCPZ1zNohA==
X-ME-Sender: <xms:HAErX89jQ5IIkH6gWVYgYt-q7MjN2cy_XgopIGlOXVg5MZWplNDoQg>
X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeduiedrjeekgddufeehucetufdoteggodetrfdotf
 fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen
 uceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmne
 cujfgurhephffvufhfffgjkfgfgggtgfesthhqredttderjeenucfhrhhomhepvfhimhho
 thhhhicuufgrmhhplhgvuceoshgrmhhplhgvthesnhhghihrohdrtghomheqnecuggftrf
 grthhtvghrnhepfefhgeegvefhteejheehffeutddthffhtefhgefhgfdvleekffekhfeu
 veevleffnecuffhomhgrihhnpehgnhhurdhorhhgpdgvgigrmhhplhgvrdhorhhgnecukf
 hppeejgedrudduiedrudekiedrgeegnecuvehluhhsthgvrhfuihiivgeptdenucfrrghr
 rghmpehmrghilhhfrhhomhepshgrmhhplhgvthesnhhghihrohdrtghomh
X-ME-Proxy: <xmx:HAErX0uVJ59DuuKAtR6ZpZJM9rgbMy11Sa1Ea3toCfZX0io6qGtwjw>
 <xmx:HAErXyBLgPa8vYhxGgMpna5m9xVCJkPyel5bt1H1m1bc_IRijdAUCw>
 <xmx:HAErX8cogn4Q8-Zn9NPYr-Vd65dGpegnDBoL9cvn6sRlZAQdn4UAjg>
 <xmx:HQErXzWsH1mOvf6Jlklz4-tlo4UWmzgyQ1sj0MUNK3RAbpwjYH9zZg>
Received: from mrblack (74-116-186-44.qc.dsl.ebox.net [74.116.186.44])
 by mail.messagingengine.com (Postfix) with ESMTPA id 27710306005F;
 Wed,  5 Aug 2020 14:57:32 -0400 (EDT)
From: Timothy Sample <samplet@HIDDEN>
To: Ludovic =?utf-8?Q?Court=C3=A8s?= <ludo@HIDDEN>
Subject: Re: bug#42162: Recovering source tarballs
References: <87mu4iv0gc.fsf@HIDDEN> <86h7uq8fmk.fsf@HIDDEN>
 <87d05etero.fsf@HIDDEN> <87r1tit5j6.fsf_-_@HIDDEN>
 <875za4ykej.fsf@HIDDEN> <87bljvu4p4.fsf@HIDDEN>
 <87d047u0l3.fsf@HIDDEN> <87wo2dnhgb.fsf@HIDDEN>
Date: Wed, 05 Aug 2020 14:57:31 -0400
In-Reply-To: <87wo2dnhgb.fsf@HIDDEN> ("Ludovic
 \=\?utf-8\?Q\?Court\=C3\=A8s\=22'\?\=
 \=\?utf-8\?Q\?s\?\= message of "Wed, 05 Aug 2020 19:14:12 +0200")
Message-ID: <874kpgudic.fsf@HIDDEN>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: -0.7 (/)
X-Debbugs-Envelope-To: 42162
Cc: 42162 <at> debbugs.gnu.org,
 Maurice =?utf-8?Q?Br=C3=A9mond?= <Maurice.Bremond@HIDDEN>,
 zimoun <zimon.toutoune@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -1.7 (-)

Hey,

Ludovic Court=C3=A8s <ludo@HIDDEN> writes:

> Note that we have <https://guix.gnu.org/sources.json>.  Last I checked,
> SWH was ingesting it in its =E2=80=9Cqualification=E2=80=9D instance, so =
it should be
> ingesting it for good real soon if it=E2=80=99s not doing it already.

Oh fantastic!  I was going to volunteer to do it, so that=E2=80=99s one thi=
ng
off my list.

> One can easily write a procedure that takes a tarball and returns a
> <computed-file> that builds its database entry.  So at each commit, we=E2=
=80=99d
> just rebuild things that have changed.

That makes more sense.  I will give this a shot soon.

> If we expose the database over HTTP (like over cgit), we can arrange so
> that (guix download) simply GETs db.example.org/sha256/xyz.  No need to
> fetch the whole database.
>
> It might be more reasonable to have a real database and a real service
> around it, I=E2=80=99m sure Chris Baines would agree ;-), but we can choo=
se URLs
> that could easily be implemented by a =E2=80=9Creal=E2=80=9D service inst=
ead of cgit in
> the future.

I got it working over cgit shortly after sending my last message.  :)  So
far, I am very much on team =E2=80=9Cgood enough for now=E2=80=9D.

> Timothy Sample <samplet@HIDDEN> skribis:
>
>> I was imagining an escape hatch beyond this, where one could look up a
>> provenance record from when Disarchive ingested and verified a source
>> code archive.  The provenance record would tell you which version of
>> Guix was used when saving the archive, so you could try your luck with
>> using =E2=80=9Cguix time-machine=E2=80=9D to reproduce Disarchive=E2=80=
=99s original
>> computation.  If we perform database migrations, you would need to
>> travel back in time in the database, too.  The idea is that you could
>> work around breakages in Disarchive automatically using the Power of
>> Guix=E2=84=A2.  Just a stray thought, really.
>
> Seems to me it Shouldn=E2=80=99t Be Necessary?  :-)
>
> I mean, as long as the format is extensible and =E2=80=9Cfuture-proof=E2=
=80=9D, we=E2=80=99ll
> always be able to rebuild tarballs and then re-disassemble them if we
> need to compute new hashes or whatever.

If Disarchive relies on external compressors, there=E2=80=99s an outside ch=
ance
that those compressors could change under our feet.  In that case, one
would want to be able to track down exactly which version of XZ was used
when Disarchive verified that it could reassemble a given source
archive.  Maybe I=E2=80=99m being paranoid, but if the database entries are
being computed by the CI infrastructure it would be pretty easy to note
the Guix commit just in case.

> I was thinking that it might be best to not use Guix for computations.
> For example, have =E2=80=9Cdisarchive save=E2=80=9D not build derivations=
 and instead do
> everything =E2=80=9Chere and now=E2=80=9D.  That would make it easier for=
 others to
> adopt.  Wait, looking at the Git history, it looks like you already
> addressed that point, neat.  :-)

Since my last message I managed to remove Guix as dependency completely.
Right now it loads =E2=80=98(guix swh)=E2=80=99 opportunistically, but I mi=
ght just copy
the code in.  Directory references now support multiple =E2=80=9Caddresses=
=E2=80=9D so
that you could have Nix-style, SWH-style, IPFS-style, etc.  Hopefully my
next message will have a WIP patch enabling Guix to use Disarchive!


-- Tim




Information forwarded to bug-guix@HIDDEN:
bug#42162; Package guix. Full text available.

Message received at 42162 <at> debbugs.gnu.org:


Received: (at 42162) by debbugs.gnu.org; 5 Aug 2020 17:14:36 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Wed Aug 05 13:14:36 2020
Received: from localhost ([127.0.0.1]:52144 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1k3Mzg-0001WW-3N
	for submit <at> debbugs.gnu.org; Wed, 05 Aug 2020 13:14:35 -0400
Received: from eggs.gnu.org ([209.51.188.92]:52982)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <ludo@HIDDEN>) id 1k3Mzd-0001WE-Ni
 for 42162 <at> debbugs.gnu.org; Wed, 05 Aug 2020 13:14:22 -0400
Received: from fencepost.gnu.org ([2001:470:142:3::e]:60296)
 by eggs.gnu.org with esmtp (Exim 4.90_1)
 (envelope-from <ludo@HIDDEN>)
 id 1k3MzX-0007Gl-4A; Wed, 05 Aug 2020 13:14:15 -0400
Received: from [2a01:e0a:1d:7270:af76:b9b:ca24:c465] (port=45254 helo=ribbon)
 by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256)
 (Exim 4.82) (envelope-from <ludo@HIDDEN>)
 id 1k3MzW-0001dk-9H; Wed, 05 Aug 2020 13:14:14 -0400
From: =?utf-8?Q?Ludovic_Court=C3=A8s?= <ludo@HIDDEN>
To: Timothy Sample <samplet@HIDDEN>
Subject: Re: bug#42162: Recovering source tarballs
References: <87mu4iv0gc.fsf@HIDDEN> <86h7uq8fmk.fsf@HIDDEN>
 <87d05etero.fsf@HIDDEN> <87r1tit5j6.fsf_-_@HIDDEN>
 <875za4ykej.fsf@HIDDEN> <87bljvu4p4.fsf@HIDDEN>
 <87d047u0l3.fsf@HIDDEN>
X-URL: http://www.fdn.fr/~lcourtes/
X-Revolutionary-Date: 19 Thermidor an 228 de la =?utf-8?Q?R=C3=A9volution?=
X-PGP-Key-ID: 0x090B11993D9AEBB5
X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc
X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4  0CFB 090B 1199 3D9A EBB5
X-OS: x86_64-pc-linux-gnu
Date: Wed, 05 Aug 2020 19:14:12 +0200
In-Reply-To: <87d047u0l3.fsf@HIDDEN> (Timothy Sample's message of "Mon, 03
 Aug 2020 12:59:52 -0400")
Message-ID: <87wo2dnhgb.fsf@HIDDEN>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: -2.3 (--)
X-Debbugs-Envelope-To: 42162
Cc: 42162 <at> debbugs.gnu.org,
 Maurice =?utf-8?Q?Br=C3=A9mond?= <Maurice.Bremond@HIDDEN>,
 zimoun <zimon.toutoune@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -3.3 (---)

Hello!

Timothy Sample <samplet@HIDDEN> skribis:

> Ludovic Court=C3=A8s <ludo@HIDDEN> writes:
>
>> Wooohoo!  Is it that time of the year when people give presents to one
>> another?  I can=E2=80=99t believe it.  :-)
>
> Not to be too cynical, but I think it=E2=80=99s just the time of year tha=
t I get
> frustrated with what I should be working on, and start fantasizing about
> green-field projects.  :p

:-)

>> Timothy Sample <samplet@HIDDEN> skribis:
>>
>>> The header and footer are read directly from the file.  Finding the
>>> compressor is harder.  I followed the approach taken by the pristine-tar
>>> project.  That is, try a bunch of compressors and hope for a match.
>>> Currently, I have:
>>>
>>>     =E2=80=A2 gnu-best
>>>     =E2=80=A2 gnu-best-rsync
>>>     =E2=80=A2 gnu
>>>     =E2=80=A2 gnu-rsync
>>>     =E2=80=A2 gnu-fast
>>>     =E2=80=A2 gnu-fast-rsync
>>>     =E2=80=A2 zlib-best
>>>     =E2=80=A2 zlib
>>>     =E2=80=A2 zlib-fast
>>>     =E2=80=A2 zlib-best-perl
>>>     =E2=80=A2 zlib-perl
>>>     =E2=80=A2 zlib-fast-perl
>>>     =E2=80=A2 gnu-best-rsync-1.4
>>>     =E2=80=A2 gnu-rsync-1.4
>>>     =E2=80=A2 gnu-fast-rsync-1.4
>>
>> I would have used the integers that zlib supports, but I guess that
>> doesn=E2=80=99t capture this whole gamut of compression setups.  And yea=
h, it=E2=80=99s
>> not great that we actually have to try and find the right compression
>> levels, but there=E2=80=99s no way around it it seems, and as you write,=
 we can
>> expect a couple of variants to be the most commonly used ones.
>
> My first instinct was =E2=80=9Cthis is impossible =E2=80=93 a DEFLATE com=
pressor can do
> just about whatever it wants!=E2=80=9D  Then I looked at pristine-tar and
> realized that their hack probably works pretty well.  If I had infinite
> time, I would think about some kind of fully general, parameterized LZ77
> algorithm that could describe any implementation.  If I had a lot of
> time I would peel back the curtain on Gzip and zlib and expose their
> tuning parameters.  That would be nicer, but keep in mind we will have
> to cover XZ, bzip2, and ZIP, too!  There=E2=80=99s a bit of balance betwe=
en
> quality and coverage.  Any improvement to the representation of the
> compression algorithm could be implemented easily: just replace the
> names with their improved representation.

Yup, it makes sense to not spend too much time on this bit.  I guess
we=E2=80=99d already have good coverage with gzip and xz.

>> (BTW the code I posted or the one in Disarchive could perhaps replace
>> the one in Gash-Utils.  I was frustrated to not see a =E2=80=98fold-arch=
ive=E2=80=99
>> procedure there, notably.)
>
> I really like =E2=80=9Cfold-archive=E2=80=9D.  One of the reasons I start=
ed doing this
> is to possibly share code with Gash-Utils.  It=E2=80=99s not as easy as I=
 was
> hoping, but I=E2=80=99m planning on improving things there based on my
> experience here.  I=E2=80=99ve now worked with four Scheme tar implementa=
tions,
> maybe if I write a really good one I could cap that number at five!

Heh.  :-)  The needs are different anyway.  In Gash-Utils the focus is
probably on simplicity/maintainability, whereas here you really want to
cover all the details of the wire representation.

>>> To avoid hitting the SWH archive at all, I introduced a directory cache
>>> so that I can store the directories locally.  If the directory cache is
>>> available, directories are stored and retrieved from it.
>>
>> I guess we can get back to them eventually to estimate our coverage rati=
o.
>
> It would be nice to know, but pretty hard to find out with the rate
> limit.  I guess it will improve immensely when we set up a
> =E2=80=9Csources.json=E2=80=9D file.

Note that we have <https://guix.gnu.org/sources.json>.  Last I checked,
SWH was ingesting it in its =E2=80=9Cqualification=E2=80=9D instance, so it=
 should be
ingesting it for good real soon if it=E2=80=99s not doing it already.

>>> You mean like <https://git.ngyro.com/disarchive-db/>?  :)
>>
>> Woow.  :-)
>>
>> We could actually have a CI job to create the database: it would
>> basically do =E2=80=98disarchive save=E2=80=99 for each tarball and stor=
e that using a
>> layout like the one you used.  Then we could have a job somewhere that
>> periodically fetches that and adds it to the database.  WDYT?
>
> Maybe....  I assume that Disarchive would fail for a few of them.  We
> would need a plan for monitoring those failures so that Disarchive can
> be improved.  Also, unless I=E2=80=99m misunderstanding something, this m=
eans
> building the whole database at every commit, no?  That would take a lot
> of time and space.  On the other hand, it would be easy enough to try.
> If it works, it=E2=80=99s a lot easier than setting up a whole other serv=
ice.

One can easily write a procedure that takes a tarball and returns a
<computed-file> that builds its database entry.  So at each commit, we=E2=
=80=99d
just rebuild things that have changed.

>> I think we should leave room for other hash algorithms (in the sexps
>> above too).
>
> It works for different hash algorithms, but not for different directory
> hashing methods (like you mention below).

OK.

[...]

>> So it does mean that we could pretty much right away add a fall-back in
>> (guix download) that looks up tarballs in your database and uses
>> Disarchive to recontruct it, right?  I love solved problems.  :-)
>>
>> Of course we could improve Disarchive and the database, but it seems to
>> me that we already have enough to improve the situation.  WDYT?
>
> I would say that we are darn close!  In theory it would work.  It would
> be much more practical if we had better coverage in the SWH archive
> (i.e., =E2=80=9Csources.json=E2=80=9D) and a way to get metadata for a so=
urce archive
> without downloading the entire Disarchive database.  It=E2=80=99s 13M now=
, but
> it will likely be 500M with all the Gzip=E2=80=99d tarballs from a recent=
 commit
> of Guix.  It will only grow after that, too.

If we expose the database over HTTP (like over cgit), we can arrange so
that (guix download) simply GETs db.example.org/sha256/xyz.  No need to
fetch the whole database.

It might be more reasonable to have a real database and a real service
around it, I=E2=80=99m sure Chris Baines would agree ;-), but we can choose=
 URLs
that could easily be implemented by a =E2=80=9Creal=E2=80=9D service instea=
d of cgit in
the future.

> Of course those are not hard blockers, so =E2=80=98(guix download)=E2=80=
=99 could start
> using Disarchive as soon as we package it.  I=E2=80=99ve starting looking=
 into
> it, but I=E2=80=99m confused about getting access to Disarchive from the
> =E2=80=9Cout-of-band=E2=80=9D download system.  Would it have to become a=
 dependency of
> Guix?

Yes.  It could be a behind-the-scenes dependency of =E2=80=9Cbuiltin:downlo=
ad=E2=80=9D;
it doesn=E2=80=99t have to be a dependency of each and every fixed-output
derivation.

> I was imagining an escape hatch beyond this, where one could look up a
> provenance record from when Disarchive ingested and verified a source
> code archive.  The provenance record would tell you which version of
> Guix was used when saving the archive, so you could try your luck with
> using =E2=80=9Cguix time-machine=E2=80=9D to reproduce Disarchive=E2=80=
=99s original
> computation.  If we perform database migrations, you would need to
> travel back in time in the database, too.  The idea is that you could
> work around breakages in Disarchive automatically using the Power of
> Guix=E2=84=A2.  Just a stray thought, really.

Seems to me it Shouldn=E2=80=99t Be Necessary?  :-)

I mean, as long as the format is extensible and =E2=80=9Cfuture-proof=E2=80=
=9D, we=E2=80=99ll
always be able to rebuild tarballs and then re-disassemble them if we
need to compute new hashes or whatever.

>> If you feel like it, you=E2=80=99re welcome to point them to your work i=
n the
>> discussion at <https://forge.softwareheritage.org/T2430>.  There=E2=80=
=99s one
>> person from NixOS (lewo) participating in the discussion and I=E2=80=99m=
 sure
>> they=E2=80=99d be interested.  Perhaps they=E2=80=99ll tell whether they=
 care about
>> having it available as JSON.
>
> Good idea.  I will work out a few more kinks and then bring it up there.
> I=E2=80=99ve already rewritten the parts that used the Guix daemon.  Disa=
rchive
> now only needs a handful Guix modules ('base32', 'serialization', and
> 'swh' are the ones that would be hard to remove).

An option would be to use (gcrypt base64); another one would be to
bundle (guix base32).

I was thinking that it might be best to not use Guix for computations.
For example, have =E2=80=9Cdisarchive save=E2=80=9D not build derivations a=
nd instead do
everything =E2=80=9Chere and now=E2=80=9D.  That would make it easier for o=
thers to
adopt.  Wait, looking at the Git history, it looks like you already
addressed that point, neat.  :-)

Thank you!

Ludo=E2=80=99.




Information forwarded to bug-guix@HIDDEN:
bug#42162; Package guix. Full text available.

Message received at 42162 <at> debbugs.gnu.org:


Received: (at 42162) by debbugs.gnu.org; 3 Aug 2020 21:10:43 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Mon Aug 03 17:10:43 2020
Received: from localhost ([127.0.0.1]:46173 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1k2hjG-0006iU-SZ
	for submit <at> debbugs.gnu.org; Mon, 03 Aug 2020 17:10:43 -0400
Received: from sender4-of-o51.zoho.com ([136.143.188.51]:21138)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <rekado@HIDDEN>) id 1k2hjC-0006iH-AX
 for 42162 <at> debbugs.gnu.org; Mon, 03 Aug 2020 17:10:42 -0400
ARC-Seal: i=1; a=rsa-sha256; t=1596489033; cv=none; 
 d=zohomail.com; s=zohoarc; 
 b=JWtOipem3mQQuHBKMdag3SSuWUsAZqdwsKoo8jSy2iP+SzWQUff4P3JyUNdAwTpC1XH42p5C1gK7CeklZvKOFJXNi7AmAYZPuHTS+lq3Akalg/XvXvPfeqNFfSSnHde8QKsZ6W8SIjGRLmfPrcVj8yt9QnL/2RRj4sToL84KcnE=
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com;
 s=zohoarc; t=1596489033;
 h=Content-Type:Cc:Date:From:In-Reply-To:MIME-Version:Message-ID:References:Subject:To;
 bh=6lOpOSTdMLt9X2IHtpYd/2V0VV7OM35a8ILIzhKMkUk=; 
 b=dMizfbFGiKurWyVSrQb6XSuZ30btPo+WHQWN1qVbesehnZzt4Wvugoa5M+LeEE1h0+DsFs6r9XJcHSYQcvG+6RGlZa9A+SJeJdoCVholdYEoPWT3Tvp/DsGf57f+YghKn1aB5WgJQMkcnyQ8OI8M8NrgozJwNILhDaDV6byXqfE=
ARC-Authentication-Results: i=1; mx.zohomail.com;
 dkim=pass  header.i=elephly.net;
 spf=pass  smtp.mailfrom=rekado@HIDDEN;
 dmarc=pass header.from=<rekado@HIDDEN> header.from=<rekado@HIDDEN>
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1596489033; 
 s=zoho; d=elephly.net; i=rekado@HIDDEN;
 h=References:From:To:Cc:Subject:In-reply-to:Date:Message-ID:MIME-Version:Content-Type;
 bh=6lOpOSTdMLt9X2IHtpYd/2V0VV7OM35a8ILIzhKMkUk=;
 b=dIjPyg0lQPgedsbwwulBK8hsdXC6G6nC4dhBLZiu9Cs4PLQlu7Jawvm24ibVjWDr
 3bvksdHd2hZrB7YC2nAeooAhyCqnbiDlzAtX0jTq8RzYN06jYTDcmVFIjFlN7ggEtnX
 Xo5Vsro772MNgzHeuVJmP6gHFy8JIGqU4G5z+tPs=
Received: from localhost (p54ad4b82.dip0.t-ipconnect.de [84.173.75.130]) by
 mx.zohomail.com with SMTPS id 1596489029919917.0914740202434;
 Mon, 3 Aug 2020 14:10:29 -0700 (PDT)
References: <87mu4iv0gc.fsf@HIDDEN> <86h7uq8fmk.fsf@HIDDEN>
 <87d05etero.fsf@HIDDEN> <87r1tit5j6.fsf_-_@HIDDEN>
 <CAJ3okZ2iesMKLLD29qrOJzNBjV=haoPFtpRg9T=0aTA8ZxLQMA@HIDDEN>
User-agent: mu4e 1.4.10; emacs 26.3
From: Ricardo Wurmus <rekado@HIDDEN>
To: zimoun <zimon.toutoune@HIDDEN>
Subject: Re: bug#42162: Recovering source tarballs
In-reply-to: <CAJ3okZ2iesMKLLD29qrOJzNBjV=haoPFtpRg9T=0aTA8ZxLQMA@HIDDEN>
X-URL: https://elephly.net
X-PGP-Key: https://elephly.net/rekado.pubkey
X-PGP-Fingerprint: BCA6 89B6 3655 3801 C3C6  2150 197A 5888 235F ACAC
Date: Mon, 03 Aug 2020 23:10:26 +0200
Message-ID: <87r1snfnb1.fsf@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain
X-ZohoMailClient: External
X-Spam-Score: -0.0 (/)
X-Debbugs-Envelope-To: 42162
Cc: 42162 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -1.0 (-)


zimoun <zimon.toutoune@HIDDEN> writes:

> Yes, but for example all the packages in gnu/packages/bioconductor.scm
> could be "git-fetch".  Today the source is over url-fetch but it could
> be over git-fetch with https://git.bioconductor.org/packages/flowCore or
> git@HIDDEN:packages/flowCore.

We should do that (and soon), especially because Bioconductor does not
keep an archive of old releases.  We can discuss this on a separate
issue lest we derail the discussion at hand.

-- 
Ricardo




Information forwarded to bug-guix@HIDDEN:
bug#42162; Package guix. Full text available.

Message received at 42162 <at> debbugs.gnu.org:


Received: (at 42162) by debbugs.gnu.org; 3 Aug 2020 17:00:05 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Mon Aug 03 13:00:05 2020
Received: from localhost ([127.0.0.1]:45943 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1k2doi-0006vw-EO
	for submit <at> debbugs.gnu.org; Mon, 03 Aug 2020 13:00:05 -0400
Received: from out4-smtp.messagingengine.com ([66.111.4.28]:47565)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <samplet@HIDDEN>) id 1k2dod-0006ut-Pa
 for 42162 <at> debbugs.gnu.org; Mon, 03 Aug 2020 13:00:03 -0400
Received: from compute3.internal (compute3.nyi.internal [10.202.2.43])
 by mailout.nyi.internal (Postfix) with ESMTP id 9A6485C0182;
 Mon,  3 Aug 2020 12:59:54 -0400 (EDT)
Received: from mailfrontend2 ([10.202.2.163])
 by compute3.internal (MEProxy); Mon, 03 Aug 2020 12:59:54 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=
 messagingengine.com; h=cc:content-transfer-encoding:content-type
 :date:from:message-id:mime-version:references:subject:to
 :x-me-proxy:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=
 fm3; bh=5hbXhmVVVrDHTCYfYDwKzOX63YmRuJHwjUiZbyDthNA=; b=j/Z3ktWR
 YoCJc6EXfvyz7eQffxfW7pq1rGx7HQtEUlBXzhdaMjeL/K3MEelOgBC9G2ciGXu1
 9McRnUPG94I2PD35TzEfaUVzuye++nb3HmyqLOz/4g/FODy9e2Hf/ubGVqIUvtm4
 kMk/utemdo3n8UjloXq8p+ihNJwz7pGBM6ea28j1GvfljV18cP5kqpY6sDOM6tBz
 s4KTjzATYk8A7UGVuJlPjnqb2Ed52Xx/+BhB9woNkVTmm4kB6fxID9iP3eGtczFQ
 4AGsSJJidQ15SbaRSn7kbd3zpZgbRV6sovaQ0PyRROHm2CSVGAmfPTfz/ymQV1Pb
 uDiVCriBdFMzWA==
X-ME-Sender: <xms:iUIoX2LHeyd00lUChSOuenWt-w3p5Dg_e4rb9doQsLzGGhXhl-fRsQ>
X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeduiedrjeeggddutdekucetufdoteggodetrfdotf
 fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen
 uceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmne
 cujfgurhephffvufhffffkfgggtgfgsehtqhertddtreejnecuhfhrohhmpefvihhmohht
 hhihucfurghmphhlvgcuoehsrghmphhlvghtsehnghihrhhordgtohhmqeenucggtffrrg
 htthgvrhhnpeevkeekhffftdefjeevgeevgfethfeuveevffdvkeffveeiudefgedvlefh
 jeetjeenucffohhmrghinhepnhhghihrohdrtghomhdpshhofhhtfigrrhgvhhgvrhhith
 grghgvrdhorhhgnecukfhppeejgedrudduiedrudekiedrgeegnecuvehluhhsthgvrhfu
 ihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomhepshgrmhhplhgvthesnhhghihroh
 drtghomh
X-ME-Proxy: <xmx:iUIoX-IMJBd6SOw1snbs3A0ViFOG7R07LWPy2QugXWehKMNSn0LK8Q>
 <xmx:iUIoX2vgpBF81_uMPYPc4VXYPi9IAfW-oXOFVxJKl_bpzkrrhtLmxA>
 <xmx:iUIoX7YbW5a1hjc4m62B-X-mjh1CazsrXJdX0TYuPV3U9oD1ShFx0g>
 <xmx:ikIoX-B_Z7pBhKp4Bs8vQEcohG-cO3pzlrU4gkiRF9cUJWeBdEpUaw>
Received: from mrblack (74-116-186-44.qc.dsl.ebox.net [74.116.186.44])
 by mail.messagingengine.com (Postfix) with ESMTPA id 90DB430600B7;
 Mon,  3 Aug 2020 12:59:53 -0400 (EDT)
From: Timothy Sample <samplet@HIDDEN>
To: Ludovic =?utf-8?Q?Court=C3=A8s?= <ludo@HIDDEN>
Subject: Re: bug#42162: Recovering source tarballs
References: <87mu4iv0gc.fsf@HIDDEN> <86h7uq8fmk.fsf@HIDDEN>
 <87d05etero.fsf@HIDDEN> <87r1tit5j6.fsf_-_@HIDDEN>
 <875za4ykej.fsf@HIDDEN> <87bljvu4p4.fsf@HIDDEN>
Date: Mon, 03 Aug 2020 12:59:52 -0400
Message-ID: <87d047u0l3.fsf@HIDDEN>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: -1.7 (-)
X-Debbugs-Envelope-To: 42162
Cc: 42162 <at> debbugs.gnu.org,
 Maurice =?utf-8?Q?Br=C3=A9mond?= <Maurice.Bremond@HIDDEN>,
 zimoun <zimon.toutoune@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -2.7 (--)

Hi Ludovic,

Ludovic Court=C3=A8s <ludo@HIDDEN> writes:

> Wooohoo!  Is it that time of the year when people give presents to one
> another?  I can=E2=80=99t believe it.  :-)

Not to be too cynical, but I think it=E2=80=99s just the time of year that =
I get
frustrated with what I should be working on, and start fantasizing about
green-field projects.  :p

> Timothy Sample <samplet@HIDDEN> skribis:
>
>> The header and footer are read directly from the file.  Finding the
>> compressor is harder.  I followed the approach taken by the pristine-tar
>> project.  That is, try a bunch of compressors and hope for a match.
>> Currently, I have:
>>
>>     =E2=80=A2 gnu-best
>>     =E2=80=A2 gnu-best-rsync
>>     =E2=80=A2 gnu
>>     =E2=80=A2 gnu-rsync
>>     =E2=80=A2 gnu-fast
>>     =E2=80=A2 gnu-fast-rsync
>>     =E2=80=A2 zlib-best
>>     =E2=80=A2 zlib
>>     =E2=80=A2 zlib-fast
>>     =E2=80=A2 zlib-best-perl
>>     =E2=80=A2 zlib-perl
>>     =E2=80=A2 zlib-fast-perl
>>     =E2=80=A2 gnu-best-rsync-1.4
>>     =E2=80=A2 gnu-rsync-1.4
>>     =E2=80=A2 gnu-fast-rsync-1.4
>
> I would have used the integers that zlib supports, but I guess that
> doesn=E2=80=99t capture this whole gamut of compression setups.  And yeah=
, it=E2=80=99s
> not great that we actually have to try and find the right compression
> levels, but there=E2=80=99s no way around it it seems, and as you write, =
we can
> expect a couple of variants to be the most commonly used ones.

My first instinct was =E2=80=9Cthis is impossible =E2=80=93 a DEFLATE compr=
essor can do
just about whatever it wants!=E2=80=9D  Then I looked at pristine-tar and
realized that their hack probably works pretty well.  If I had infinite
time, I would think about some kind of fully general, parameterized LZ77
algorithm that could describe any implementation.  If I had a lot of
time I would peel back the curtain on Gzip and zlib and expose their
tuning parameters.  That would be nicer, but keep in mind we will have
to cover XZ, bzip2, and ZIP, too!  There=E2=80=99s a bit of balance between
quality and coverage.  Any improvement to the representation of the
compression algorithm could be implemented easily: just replace the
names with their improved representation.

One thing pristine-tar does is reorder the compressor list based on the
input metadata.  A Gzip member usually stores its compression level, so
it makes sense to try everything at that level first before moving one.

>> Originally, I used your code, but I ran into some problems.  Namely,
>> real tarballs are not well-behaved.  I wrote new code to keep track of
>> subtle things like the formatting of the octal values.
>
> Yeah I guess I was too optimistic.  :-)  I wanted to have the
> serialization/deserialization code automatically generated by that
> macro, but yeah, it doesn=E2=80=99t capture enough details for real-world
> tarballs.

I enjoyed your implementation!  I might even bring back its style.  It
was a little stiff for trying to figure out exactly what I needed for
reproducing the tarballs.

> Do you know how frequently you get =E2=80=9Cweird=E2=80=9D tarballs?  I w=
as thinking
> about having something that works for plain GNU tar, but it=E2=80=99s even
> better to have something that works with =E2=80=9Cunusual=E2=80=9D tarbal=
ls!

I don=E2=80=99t have hard numbers, but I would say that a good handful (5=
=E2=80=9310%)
have =E2=80=9CX-format=E2=80=9D fields, meaning their octal formatting is u=
nusual.  (I=E2=80=99m
looking at =E2=80=9Cgrep -A 10 default-header=E2=80=9D over all the S-Exp f=
iles.)  The
most charming thing is the =E2=80=9Cuname=E2=80=9D and =E2=80=9Cgname=E2=80=
=9D fields.  For example,
=E2=80=9Crtmidi-4.0.0=E2=80=9D was made by =E2=80=9Cgary=E2=80=9D from =E2=
=80=9Cstaff=E2=80=9D.  :)

> (BTW the code I posted or the one in Disarchive could perhaps replace
> the one in Gash-Utils.  I was frustrated to not see a =E2=80=98fold-archi=
ve=E2=80=99
> procedure there, notably.)

I really like =E2=80=9Cfold-archive=E2=80=9D.  One of the reasons I started=
 doing this
is to possibly share code with Gash-Utils.  It=E2=80=99s not as easy as I w=
as
hoping, but I=E2=80=99m planning on improving things there based on my
experience here.  I=E2=80=99ve now worked with four Scheme tar implementati=
ons,
maybe if I write a really good one I could cap that number at five!

>> To avoid hitting the SWH archive at all, I introduced a directory cache
>> so that I can store the directories locally.  If the directory cache is
>> available, directories are stored and retrieved from it.
>
> I guess we can get back to them eventually to estimate our coverage ratio.

It would be nice to know, but pretty hard to find out with the rate
limit.  I guess it will improve immensely when we set up a
=E2=80=9Csources.json=E2=80=9D file.

>> You mean like <https://git.ngyro.com/disarchive-db/>?  :)
>
> Woow.  :-)
>
> We could actually have a CI job to create the database: it would
> basically do =E2=80=98disarchive save=E2=80=99 for each tarball and store=
 that using a
> layout like the one you used.  Then we could have a job somewhere that
> periodically fetches that and adds it to the database.  WDYT?

Maybe....  I assume that Disarchive would fail for a few of them.  We
would need a plan for monitoring those failures so that Disarchive can
be improved.  Also, unless I=E2=80=99m misunderstanding something, this mea=
ns
building the whole database at every commit, no?  That would take a lot
of time and space.  On the other hand, it would be easy enough to try.
If it works, it=E2=80=99s a lot easier than setting up a whole other servic=
e.

> I think we should leave room for other hash algorithms (in the sexps
> above too).

It works for different hash algorithms, but not for different directory
hashing methods (like you mention below).

>> This was generated by a little script built on top of =E2=80=9Cfold-pack=
ages=E2=80=9D.
>> It downloads Gzip=E2=80=99d tarballs used by Guix packages and passes th=
em on to
>> Disarchive for disassembly.  I limited the number to 100 because it=E2=
=80=99s
>> slow and because I=E2=80=99m sure there is a long tail of weird software
>> archives that are going to be hard to process.  The metadata directory
>> ended up being 13M and the directory cache 2G.
>
> Neat.
>
> So it does mean that we could pretty much right away add a fall-back in
> (guix download) that looks up tarballs in your database and uses
> Disarchive to recontruct it, right?  I love solved problems.  :-)
>
> Of course we could improve Disarchive and the database, but it seems to
> me that we already have enough to improve the situation.  WDYT?

I would say that we are darn close!  In theory it would work.  It would
be much more practical if we had better coverage in the SWH archive
(i.e., =E2=80=9Csources.json=E2=80=9D) and a way to get metadata for a sour=
ce archive
without downloading the entire Disarchive database.  It=E2=80=99s 13M now, =
but
it will likely be 500M with all the Gzip=E2=80=99d tarballs from a recent c=
ommit
of Guix.  It will only grow after that, too.

Of course those are not hard blockers, so =E2=80=98(guix download)=E2=80=99=
 could start
using Disarchive as soon as we package it.  I=E2=80=99ve starting looking i=
nto
it, but I=E2=80=99m confused about getting access to Disarchive from the
=E2=80=9Cout-of-band=E2=80=9D download system.  Would it have to become a d=
ependency of
Guix?

>> Even with the code I have so far, I have a lot of questions.  Mainly I=
=E2=80=99m
>> worried about keeping everything working into the future.  It would be
>> easy to make incompatible changes.  A lot of care would have to be
>> taken.  Of course, keeping a Guix commit and a Disarchive commit might
>> be enough to make any assembling reproducible, but there=E2=80=99s a
>> chicken-and-egg problem there.
>
> The way I see it, Guix would always look up tarballs in the HEAD of the
> database (no need to pick a specific commit).  Worst that could happen
> is we reconstruct a tarball that doesn=E2=80=99t match, and so the daemon=
 errors
> out.

I was imagining an escape hatch beyond this, where one could look up a
provenance record from when Disarchive ingested and verified a source
code archive.  The provenance record would tell you which version of
Guix was used when saving the archive, so you could try your luck with
using =E2=80=9Cguix time-machine=E2=80=9D to reproduce Disarchive=E2=80=99s=
 original
computation.  If we perform database migrations, you would need to
travel back in time in the database, too.  The idea is that you could
work around breakages in Disarchive automatically using the Power of
Guix=E2=84=A2.  Just a stray thought, really.

> Regarding future-proofness, I think we must be super careful about the
> file formats (the sexps).  You did pay attention to not having implicit
> defaults, which is perfect.  Perhaps one thing to change (or perhaps
> it=E2=80=99s already there) is support for other hashes in those sexps: b=
oth
> hash algorithms and directory hash methods (SWH dir/Git tree, nar, Git
> tree with different hash algorithm, IPFS CID, etc.).  Also the ability
> to specify several hashes.
>
> That way we could =E2=80=9Crefresh=E2=80=9D the database anytime by addin=
g the hash du
> jour for already-present tarballs.

The hash algorithm is already configurable, but the directory hash
method is not.  You=E2=80=99re right that it should be, and that there shou=
ld be
support for multiple digests.

>> What if a tarball from the closure of one the derivations is missing?
>> I guess you could work around it, but it would be tricky.
>
> Well, more generally, we=E2=80=99ll have to monitor archive coverage.  Bu=
t I
> don=E2=80=99t think the issue is specific to this method.

Again, I=E2=80=99m thinking about the case where I want to travel back in t=
ime
to reproduce a Disarchive computation.  It=E2=80=99s really an unlikely
scenario, I=E2=80=99m just trying to think of everything that could go wron=
g.

>>> Anyhow, we should team up with fellow NixOS and SWH hackers to address
>>> this, and with developers of other distros as well=E2=80=94this problem=
 is not
>>> just that of the functional deployment geeks, is it?
>>
>> I could remove most of the Guix stuff so that it would be easy to
>> package in Guix, Nix, Debian, etc.  Then, someone=E2=84=A2 could write a=
 service
>> that consumes a =E2=80=9Csources.json=E2=80=9D file, adds the sources to=
 a Disarchive
>> database, and pushes everything to a Git repo.  I guess everyone who
>> cares has to produce a =E2=80=9Csources.json=E2=80=9D file anyway, so it=
 will be very
>> little extra work.  Other stuff like changing the serialization format
>> to JSON would be pretty easy, too.  I=E2=80=99m not well connected to th=
ese
>> other projects, mind you, so I=E2=80=99m not really sure how to reach ou=
t.
>
> If you feel like it, you=E2=80=99re welcome to point them to your work in=
 the
> discussion at <https://forge.softwareheritage.org/T2430>.  There=E2=80=99=
s one
> person from NixOS (lewo) participating in the discussion and I=E2=80=99m =
sure
> they=E2=80=99d be interested.  Perhaps they=E2=80=99ll tell whether they =
care about
> having it available as JSON.

Good idea.  I will work out a few more kinks and then bring it up there.
I=E2=80=99ve already rewritten the parts that used the Guix daemon.  Disarc=
hive
now only needs a handful Guix modules ('base32', 'serialization', and
'swh' are the ones that would be hard to remove).

>> Sorry about the big mess of code and ideas =E2=80=93 I realize I may hav=
e taken
>> the =E2=80=9Cdo-ocracy=E2=80=9D approach a little far here.  :)  Even if=
 this is not
>> =E2=80=9Cthe=E2=80=9D solution, hopefully it=E2=80=99s useful for discus=
sion!
>
> You did great!  I had a very rough sketch and you did the real thing,
> that=E2=80=99s just awesome.  :-)
>
> Thanks a lot!

My pleasure!  Thanks for the feedback so far.


-- Tim




Information forwarded to bug-guix@HIDDEN:
bug#42162; Package guix. Full text available.

Message received at 42162 <at> debbugs.gnu.org:


Received: (at 42162) by debbugs.gnu.org; 31 Jul 2020 14:42:13 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Jul 31 10:42:12 2020
Received: from localhost ([127.0.0.1]:38392 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1k1WEe-00069A-9I
	for submit <at> debbugs.gnu.org; Fri, 31 Jul 2020 10:42:12 -0400
Received: from eggs.gnu.org ([209.51.188.92]:46052)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <ludo@HIDDEN>) id 1k1WEc-00068x-9M
 for 42162 <at> debbugs.gnu.org; Fri, 31 Jul 2020 10:42:11 -0400
Received: from fencepost.gnu.org ([2001:470:142:3::e]:42974)
 by eggs.gnu.org with esmtp (Exim 4.90_1)
 (envelope-from <ludo@HIDDEN>)
 id 1k1WEV-0000Fv-Qc; Fri, 31 Jul 2020 10:42:03 -0400
Received: from [2a01:e35:2ffd:930:68c2:32f7:f96f:b343] (port=48714 helo=ribbon)
 by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256)
 (Exim 4.82) (envelope-from <ludo@HIDDEN>)
 id 1k1WET-0003nA-To; Fri, 31 Jul 2020 10:42:03 -0400
From: =?utf-8?Q?Ludovic_Court=C3=A8s?= <ludo@HIDDEN>
To: Timothy Sample <samplet@HIDDEN>
Subject: Re: bug#42162: Recovering source tarballs
References: <87mu4iv0gc.fsf@HIDDEN> <86h7uq8fmk.fsf@HIDDEN>
 <87d05etero.fsf@HIDDEN> <87r1tit5j6.fsf_-_@HIDDEN>
 <875za4ykej.fsf@HIDDEN>
X-URL: http://www.fdn.fr/~lcourtes/
X-Revolutionary-Date: 14 Thermidor an 228 de la =?utf-8?Q?R=C3=A9volution?=
X-PGP-Key-ID: 0x090B11993D9AEBB5
X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc
X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4  0CFB 090B 1199 3D9A EBB5
X-OS: x86_64-pc-linux-gnu
Date: Fri, 31 Jul 2020 16:41:59 +0200
In-Reply-To: <875za4ykej.fsf@HIDDEN> (Timothy Sample's message of "Thu, 30
 Jul 2020 13:36:52 -0400")
Message-ID: <87bljvu4p4.fsf@HIDDEN>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: -2.3 (--)
X-Debbugs-Envelope-To: 42162
Cc: 42162 <at> debbugs.gnu.org,
 Maurice =?utf-8?Q?Br=C3=A9mond?= <Maurice.Bremond@HIDDEN>,
 zimoun <zimon.toutoune@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -3.3 (---)

Hi Timothy!

Timothy Sample <samplet@HIDDEN> skribis:

> This jumped out at me because I have been working with compression and
> tarballs for the bootstrapping effort.  I started pulling some threads
> and doing some research, and ended up prototyping an end-to-end solution
> for decomposing a Gzip=E2=80=99d tarball into Gzip metadata, tarball meta=
data,
> and an SWH directory ID.  It can even put them back together!  :)  There
> are a bunch of problems still, but I think this project is doable in the
> short-term.  I=E2=80=99ve tested 100 arbitrary Gzip=E2=80=99d tarballs fr=
om Guix, and
> found and fixed a bunch of little gaffes.  There=E2=80=99s a ton of work =
to do,
> of course, but here=E2=80=99s another small step.
>
> I call the thing =E2=80=9CDisarchive=E2=80=9D as in =E2=80=9Cdisassemble =
a source code archive=E2=80=9D.
> You can find it at <https://git.ngyro.com/disarchive/>.  It has a simple
> command-line interface so you can do
>
>     $ disarchive save software-1.0.tar.gz
>
> which serializes a disassembled version of =E2=80=9Csoftware-1.0.tar.gz=
=E2=80=9D to the
> database (which is just a directory) specified by the =E2=80=9CDISARCHIVE=
_DB=E2=80=9D
> environment variable.  Next, you can run
>
>     $ disarchive load hash-of-something-in-the-db
>
> which will recover an original file from its metadata (stored in the
> database) and data retrieved from the SWH archive or taken from a cache
> (again, just a directory) specified by =E2=80=9CDISARCHIVE_DIRCACHE=E2=80=
=9D.

Wooohoo!  Is it that time of the year when people give presents to one
another?  I can=E2=80=99t believe it.  :-)

> Now some implementation details.  The way I=E2=80=99ve set it up is that =
all of
> the assembly happens through Guix.  Each step in recreating a compressed
> tarball is a fixed-output derivation: the download from SWH, the
> creation of the tarball, and the compression.  I wanted an easy way to
> build and verify things according to a dependency graph without writing
> any code.  Hi Guix Daemon!  I=E2=80=99m not sure if this is a good long-t=
erm
> approach, though.  It could work well for reproducibility, but it might
> be easier to let some external service drive my code as a Guix package.
> Either way, it was an easy way to get started.
>
> For disassembly, it takes a Gzip file (containing a single member) and
> breaks it down like this:
>
>     (gzip-member
>       (version 0)
>       (name "hungrycat-0.4.1.tar.gz")
>       (input (sha256
>                "1ifzck1b97kjm567qb0prnqag2d01x0v8lghx98w1h2gzwsmxgi1"))
>       (header
>         (mtime 0)
>         (extra-flags 2)
>         (os 3))
>       (footer
>         (crc 3863610951)
>         (isize 194560))
>       (compressor gnu-best)
>       (digest
>         (sha256
>           "03fc1zsrf99lvxa7b4ps6pbi43304wbxh1f6ci4q0vkal370yfwh")))

Awesome.

> The header and footer are read directly from the file.  Finding the
> compressor is harder.  I followed the approach taken by the pristine-tar
> project.  That is, try a bunch of compressors and hope for a match.
> Currently, I have:
>
>     =E2=80=A2 gnu-best
>     =E2=80=A2 gnu-best-rsync
>     =E2=80=A2 gnu
>     =E2=80=A2 gnu-rsync
>     =E2=80=A2 gnu-fast
>     =E2=80=A2 gnu-fast-rsync
>     =E2=80=A2 zlib-best
>     =E2=80=A2 zlib
>     =E2=80=A2 zlib-fast
>     =E2=80=A2 zlib-best-perl
>     =E2=80=A2 zlib-perl
>     =E2=80=A2 zlib-fast-perl
>     =E2=80=A2 gnu-best-rsync-1.4
>     =E2=80=A2 gnu-rsync-1.4
>     =E2=80=A2 gnu-fast-rsync-1.4

I would have used the integers that zlib supports, but I guess that
doesn=E2=80=99t capture this whole gamut of compression setups.  And yeah, =
it=E2=80=99s
not great that we actually have to try and find the right compression
levels, but there=E2=80=99s no way around it it seems, and as you write, we=
 can
expect a couple of variants to be the most commonly used ones.

> The =E2=80=9Cinput=E2=80=9D field likely points to a tarball, which looks=
 like this:
>
>     (tarball
>       (version 0)
>       (name "hungrycat-0.4.1.tar")
>       (input (sha256
>                "02qg3z5cvq6dkdc0mxz4sami1ys668lddggf7bjhszk23xpfjm5r"))
>       (default-header)
>       (headers
>         ((name "hungrycat-0.4.1/")
>          (mode 493)
>          (mtime 1513360022)
>          (chksum 5058)
>          (typeflag 53))
>         ((name "hungrycat-0.4.1/configure")
>          (mode 493)
>          (size 130263)
>          (mtime 1513360022)
>          (chksum 6043))
>         ...)
>       (padding 3584)
>       (digest
>         (sha256
>           "1ifzck1b97kjm567qb0prnqag2d01x0v8lghx98w1h2gzwsmxgi1")))
>
> Originally, I used your code, but I ran into some problems.  Namely,
> real tarballs are not well-behaved.  I wrote new code to keep track of
> subtle things like the formatting of the octal values.

Yeah I guess I was too optimistic.  :-)  I wanted to have the
serialization/deserialization code automatically generated by that
macro, but yeah, it doesn=E2=80=99t capture enough details for real-world
tarballs.

Do you know how frequently you get =E2=80=9Cweird=E2=80=9D tarballs?  I was=
 thinking
about having something that works for plain GNU tar, but it=E2=80=99s even
better to have something that works with =E2=80=9Cunusual=E2=80=9D tarballs!

(BTW the code I posted or the one in Disarchive could perhaps replace
the one in Gash-Utils.  I was frustrated to not see a =E2=80=98fold-archive=
=E2=80=99
procedure there, notably.)

> Even though they are not well-behaved, they are usually
> self-consistent, so I introduced the =E2=80=9Cdefault-header=E2=80=9D fie=
ld to set
> default values for all headers.  Any omitted fields in the headers use
> the value from the default header, and the default header takes
> defaults from a =E2=80=9Cdefault default header=E2=80=9D defined in the c=
ode.  Here=E2=80=99s
> a default header from a different tarball:
>
>     (default-header
>       (uid 1199)
>       (gid 30)
>       (magic "ustar ")
>       (version " \x00")
>       (uname "cagordon")
>       (gname "lhea")
>       (devmajor-format (width 0))
>       (devminor-format (width 0)))

Very nice.

> Finally, the =E2=80=9Cinput=E2=80=9D field here points to an =E2=80=9Cswh=
-directory=E2=80=9D object.  It
> looks like this:
>
>     (swh-directory
>       (version 0)
>       (name "hungrycat-0.4.1")
>       (id "0496abd5a2e9e05c9fe20ae7684f48130ef6124a")
>       (digest
>         (sha256
>           "02qg3z5cvq6dkdc0mxz4sami1ys668lddggf7bjhszk23xpfjm5r")))

Yay!

> I have a little module for computing the directory hash like SWH does
> (which is in-turn like what Git does).  I did not verify that the 100
> packages where in the SWH archive.  I did verify a couple of packages,
> but I hit the rate limit and decided to avoid it for now.
>
> To avoid hitting the SWH archive at all, I introduced a directory cache
> so that I can store the directories locally.  If the directory cache is
> available, directories are stored and retrieved from it.

I guess we can get back to them eventually to estimate our coverage ratio.

>> I think we=E2=80=99d have to maintain a database that maps tarball hashe=
s to
>> metadata (!).  A simple version of it could be a Git repo where, say,
>> =E2=80=98sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk=E2=
=80=99 would
>> contain the metadata above.  The nice thing is that the Git repo itself
>> could be archived by SWH.  :-)
>
> You mean like <https://git.ngyro.com/disarchive-db/>?  :)

Woow.  :-)

We could actually have a CI job to create the database: it would
basically do =E2=80=98disarchive save=E2=80=99 for each tarball and store t=
hat using a
layout like the one you used.  Then we could have a job somewhere that
periodically fetches that and adds it to the database.  WDYT?

I think we should leave room for other hash algorithms (in the sexps
above too).

> This was generated by a little script built on top of =E2=80=9Cfold-packa=
ges=E2=80=9D.
> It downloads Gzip=E2=80=99d tarballs used by Guix packages and passes the=
m on to
> Disarchive for disassembly.  I limited the number to 100 because it=E2=80=
=99s
> slow and because I=E2=80=99m sure there is a long tail of weird software
> archives that are going to be hard to process.  The metadata directory
> ended up being 13M and the directory cache 2G.

Neat.

So it does mean that we could pretty much right away add a fall-back in
(guix download) that looks up tarballs in your database and uses
Disarchive to recontruct it, right?  I love solved problems.  :-)

Of course we could improve Disarchive and the database, but it seems to
me that we already have enough to improve the situation.  WDYT?

> Even with the code I have so far, I have a lot of questions.  Mainly I=E2=
=80=99m
> worried about keeping everything working into the future.  It would be
> easy to make incompatible changes.  A lot of care would have to be
> taken.  Of course, keeping a Guix commit and a Disarchive commit might
> be enough to make any assembling reproducible, but there=E2=80=99s a
> chicken-and-egg problem there.

The way I see it, Guix would always look up tarballs in the HEAD of the
database (no need to pick a specific commit).  Worst that could happen
is we reconstruct a tarball that doesn=E2=80=99t match, and so the daemon e=
rrors
out.

Regarding future-proofness, I think we must be super careful about the
file formats (the sexps).  You did pay attention to not having implicit
defaults, which is perfect.  Perhaps one thing to change (or perhaps
it=E2=80=99s already there) is support for other hashes in those sexps: both
hash algorithms and directory hash methods (SWH dir/Git tree, nar, Git
tree with different hash algorithm, IPFS CID, etc.).  Also the ability
to specify several hashes.

That way we could =E2=80=9Crefresh=E2=80=9D the database anytime by adding =
the hash du
jour for already-present tarballs.

> What if a tarball from the closure of one the derivations is missing?
> I guess you could work around it, but it would be tricky.

Well, more generally, we=E2=80=99ll have to monitor archive coverage.  But I
don=E2=80=99t think the issue is specific to this method.

>> Anyhow, we should team up with fellow NixOS and SWH hackers to address
>> this, and with developers of other distros as well=E2=80=94this problem =
is not
>> just that of the functional deployment geeks, is it?
>
> I could remove most of the Guix stuff so that it would be easy to
> package in Guix, Nix, Debian, etc.  Then, someone=E2=84=A2 could write a =
service
> that consumes a =E2=80=9Csources.json=E2=80=9D file, adds the sources to =
a Disarchive
> database, and pushes everything to a Git repo.  I guess everyone who
> cares has to produce a =E2=80=9Csources.json=E2=80=9D file anyway, so it =
will be very
> little extra work.  Other stuff like changing the serialization format
> to JSON would be pretty easy, too.  I=E2=80=99m not well connected to the=
se
> other projects, mind you, so I=E2=80=99m not really sure how to reach out.

If you feel like it, you=E2=80=99re welcome to point them to your work in t=
he
discussion at <https://forge.softwareheritage.org/T2430>.  There=E2=80=99s =
one
person from NixOS (lewo) participating in the discussion and I=E2=80=99m su=
re
they=E2=80=99d be interested.  Perhaps they=E2=80=99ll tell whether they ca=
re about
having it available as JSON.

> Sorry about the big mess of code and ideas =E2=80=93 I realize I may have=
 taken
> the =E2=80=9Cdo-ocracy=E2=80=9D approach a little far here.  :)  Even if =
this is not
> =E2=80=9Cthe=E2=80=9D solution, hopefully it=E2=80=99s useful for discuss=
ion!

You did great!  I had a very rough sketch and you did the real thing,
that=E2=80=99s just awesome.  :-)

Thanks a lot!

Ludo=E2=80=99.




Information forwarded to bug-guix@HIDDEN:
bug#42162; Package guix. Full text available.

Message received at 42162 <at> debbugs.gnu.org:


Received: (at 42162) by debbugs.gnu.org; 30 Jul 2020 17:37:02 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu Jul 30 13:37:01 2020
Received: from localhost ([127.0.0.1]:36664 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1k1CUH-00054E-9Z
	for submit <at> debbugs.gnu.org; Thu, 30 Jul 2020 13:37:01 -0400
Received: from out3-smtp.messagingengine.com ([66.111.4.27]:54387)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <samplet@HIDDEN>) id 1k1CUF-000542-12
 for 42162 <at> debbugs.gnu.org; Thu, 30 Jul 2020 13:37:00 -0400
Received: from compute3.internal (compute3.nyi.internal [10.202.2.43])
 by mailout.nyi.internal (Postfix) with ESMTP id E4A8B5C0180;
 Thu, 30 Jul 2020 13:36:53 -0400 (EDT)
Received: from mailfrontend1 ([10.202.2.162])
 by compute3.internal (MEProxy); Thu, 30 Jul 2020 13:36:53 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=
 messagingengine.com; h=cc:content-transfer-encoding:content-type
 :date:from:message-id:mime-version:references:subject:to
 :x-me-proxy:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=
 fm3; bh=87h8COJA8UsCPaH5R9nD3E/ixlDsWTtqCLc58qgCEV4=; b=Tj1rOiQK
 GKmg8Ly1XjpC333OPSKHPH8dbIpN7nG3eMBmpzoeNxK6nMJsesn6UWyKIVpLEAXO
 HAdxisu3kvfCeoRQHCQ1cFzD1hY+TjACc5us5j+Hu7wh8wS02/lNChs1HVE4/Pqe
 cH9y4PiY51clPYEeKt/F7/RshxjzV7l7hhPyfH3GY7iKtEvq6xopXPo3XQrnzqhl
 GSOab2uUQQYd+L6j/PLD6mC21LRqBihm4/PeRVxvQVPvxV5KtJSI1IT39XAOHp4R
 PdDjhKArr/V0zMfylJyQOi+4WF9UskFKbmc4vUEkqk63adFJW2ruxModDcT0zwHD
 QGUfoH22fJf9JQ==
X-ME-Sender: <xms:NQUjX2BQTg7gBgMER60nw0a1Go7MoG4tlqLblN3lKZZjw_-CX1qwwA>
X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeduiedrieeigdduudeiucetufdoteggodetrfdotf
 fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen
 uceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmne
 cujfgurhephffvufhffffkfgggtgfgsehtqhertddtreejnecuhfhrohhmpefvihhmohht
 hhihucfurghmphhlvgcuoehsrghmphhlvghtsehnghihrhhordgtohhmqeenucggtffrrg
 htthgvrhhnpeetvdeltdfgudehvdegtddutddugeeigeehvedvgfegffelhefgvdeghfeu
 ueejhfenucffohhmrghinhepshhofhhtfigrrhgvhhgvrhhithgrghgvrdhorhhgpdguvg
 gsihgrnhdrohhrghdpnhhghihrohdrtghomhenucfkphepjeegrdduudeirddukeeirdeg
 geenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpehsrg
 hmphhlvghtsehnghihrhhordgtohhm
X-ME-Proxy: <xmx:NQUjXwiVBBQHHg7PgFaLQn4ZUHBAjQNQ9haGazzJJakxKlB1RFn8lw>
 <xmx:NQUjX5n8mO67FH0LK-2k5tQxwFcISIB_ahZqoKPwkeyDLPzUV67yEQ>
 <xmx:NQUjX0xvBNptIQO7VGm4TRHUSjVmU4ah_wO4ry9ZMbB54OSyoVS0kg>
 <xmx:NQUjX86zBXjXyejL-wPZr-ir1aqkFi1m2GUIwueJ4pW6HAISOJ0EMA>
Received: from mrblack (74-116-186-44.qc.dsl.ebox.net [74.116.186.44])
 by mail.messagingengine.com (Postfix) with ESMTPA id 36DA0328005E;
 Thu, 30 Jul 2020 13:36:53 -0400 (EDT)
From: Timothy Sample <samplet@HIDDEN>
To: Ludovic =?utf-8?Q?Court=C3=A8s?= <ludo@HIDDEN>
Subject: Re: bug#42162: Recovering source tarballs
References: <87mu4iv0gc.fsf@HIDDEN> <86h7uq8fmk.fsf@HIDDEN>
 <87d05etero.fsf@HIDDEN> <87r1tit5j6.fsf_-_@HIDDEN>
Date: Thu, 30 Jul 2020 13:36:52 -0400
Message-ID: <875za4ykej.fsf@HIDDEN>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: -0.7 (/)
X-Debbugs-Envelope-To: 42162
Cc: 42162 <at> debbugs.gnu.org,
 Maurice =?utf-8?Q?Br=C3=A9mond?= <Maurice.Bremond@HIDDEN>,
 zimoun <zimon.toutoune@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -1.7 (-)

Hi Ludovic,

Ludovic Court=C3=A8s <ludo@HIDDEN> writes:

> Hi,
>
> Ludovic Court=C3=A8s <ludo@HIDDEN> skribis:
>
> [...]
>
> So for the medium term, and perhaps for the future, a possible option
> would be to preserve tarball metadata so we can reconstruct them:
>
>   tarball =3D metadata + tree
>
> After all, tarballs are byproducts and should be no exception: we should
> build them from source.  :-)
>
> In <https://forge.softwareheritage.org/T2430>, Stefano mentioned
> pristine-tar, which does almost that, but not quite: it stores a binary
> delta between a tarball and a tree:
>
>   https://manpages.debian.org/unstable/pristine-tar/pristine-tar.1.en.html
>
> I think we should have something more transparent than a binary delta.
>
> The code below can =E2=80=9Cdisassemble=E2=80=9D and =E2=80=9Cassemble=E2=
=80=9D a tar.  When it
> disassembles it, it generates metadata like this:
>
> (tar-source
>   (version 0)
>   (headers
>     (("guile-3.0.4/"
>       (mode 493)
>       (size 0)
>       (mtime 1593007723)
>       (chksum 3979)
>       (typeflag #\5))
>      ("guile-3.0.4/m4/"
>       (mode 493)
>       (size 0)
>       (mtime 1593007720)
>       (chksum 4184)
>       (typeflag #\5))
>      ("guile-3.0.4/m4/pipe2.m4"
>       (mode 420)
>       (size 531)
>       (mtime 1536050419)
>       (chksum 4812)
>       (hash (sha256
>               "arx6n2rmtf66yjlwkgwp743glcpdsfzgjiqrqhfegutmcwvwvsza")))
>      ("guile-3.0.4/m4/time_h.m4"
>       (mode 420)
>       (size 5471)
>       (mtime 1536050419)
>       (chksum 4974)
>       (hash (sha256
>               "z4py26rmvsk4st7db6vwziwwhkrjjrwj7nra4al6ipqh2ms45kka")))
> [=E2=80=A6]
>
> The =E2=80=99assemble-archive=E2=80=99 procedure consumes that, looks up =
file contents
> by hash on SWH, and reconstructs the original tarball=E2=80=A6
>
> =E2=80=A6 at least in theory, because in practice we hit the SWH rate lim=
it
> after looking up a few files:
>
>   https://archive.softwareheritage.org/api/#rate-limiting
>
> So it=E2=80=99s a bit ridiculous, but we may have to store a SWH =E2=80=
=9Cdir=E2=80=9D
> identifier for the whole extracted tree=E2=80=94a Git-tree hash=E2=80=94s=
ince that would
> allow us to retrieve the whole thing in a single HTTP request.
>
> Besides, we=E2=80=99ll also have to handle compression: storing gzip/xz h=
eaders
> and compression levels.

This jumped out at me because I have been working with compression and
tarballs for the bootstrapping effort.  I started pulling some threads
and doing some research, and ended up prototyping an end-to-end solution
for decomposing a Gzip=E2=80=99d tarball into Gzip metadata, tarball metada=
ta,
and an SWH directory ID.  It can even put them back together!  :)  There
are a bunch of problems still, but I think this project is doable in the
short-term.  I=E2=80=99ve tested 100 arbitrary Gzip=E2=80=99d tarballs from=
 Guix, and
found and fixed a bunch of little gaffes.  There=E2=80=99s a ton of work to=
 do,
of course, but here=E2=80=99s another small step.

I call the thing =E2=80=9CDisarchive=E2=80=9D as in =E2=80=9Cdisassemble a =
source code archive=E2=80=9D.
You can find it at <https://git.ngyro.com/disarchive/>.  It has a simple
command-line interface so you can do

    $ disarchive save software-1.0.tar.gz

which serializes a disassembled version of =E2=80=9Csoftware-1.0.tar.gz=E2=
=80=9D to the
database (which is just a directory) specified by the =E2=80=9CDISARCHIVE_D=
B=E2=80=9D
environment variable.  Next, you can run

    $ disarchive load hash-of-something-in-the-db

which will recover an original file from its metadata (stored in the
database) and data retrieved from the SWH archive or taken from a cache
(again, just a directory) specified by =E2=80=9CDISARCHIVE_DIRCACHE=E2=80=
=9D.

Now some implementation details.  The way I=E2=80=99ve set it up is that al=
l of
the assembly happens through Guix.  Each step in recreating a compressed
tarball is a fixed-output derivation: the download from SWH, the
creation of the tarball, and the compression.  I wanted an easy way to
build and verify things according to a dependency graph without writing
any code.  Hi Guix Daemon!  I=E2=80=99m not sure if this is a good long-term
approach, though.  It could work well for reproducibility, but it might
be easier to let some external service drive my code as a Guix package.
Either way, it was an easy way to get started.

For disassembly, it takes a Gzip file (containing a single member) and
breaks it down like this:

    (gzip-member
      (version 0)
      (name "hungrycat-0.4.1.tar.gz")
      (input (sha256
               "1ifzck1b97kjm567qb0prnqag2d01x0v8lghx98w1h2gzwsmxgi1"))
      (header
        (mtime 0)
        (extra-flags 2)
        (os 3))
      (footer
        (crc 3863610951)
        (isize 194560))
      (compressor gnu-best)
      (digest
        (sha256
          "03fc1zsrf99lvxa7b4ps6pbi43304wbxh1f6ci4q0vkal370yfwh")))

The header and footer are read directly from the file.  Finding the
compressor is harder.  I followed the approach taken by the pristine-tar
project.  That is, try a bunch of compressors and hope for a match.
Currently, I have:

    =E2=80=A2 gnu-best
    =E2=80=A2 gnu-best-rsync
    =E2=80=A2 gnu
    =E2=80=A2 gnu-rsync
    =E2=80=A2 gnu-fast
    =E2=80=A2 gnu-fast-rsync
    =E2=80=A2 zlib-best
    =E2=80=A2 zlib
    =E2=80=A2 zlib-fast
    =E2=80=A2 zlib-best-perl
    =E2=80=A2 zlib-perl
    =E2=80=A2 zlib-fast-perl
    =E2=80=A2 gnu-best-rsync-1.4
    =E2=80=A2 gnu-rsync-1.4
    =E2=80=A2 gnu-fast-rsync-1.4

This list is inspired by pristine-tar.  The first couple GNU compressors
use modern Gzip from Guix.  The zlib and rsync-1.4 ones use the Gzip and
zlib wrapper from pristine-tar called =E2=80=9Czgz=E2=80=9D.  The 100 Gzip =
files I
looked at use =E2=80=9Cgnu=E2=80=9D, =E2=80=9Cgnu-best=E2=80=9D, =E2=80=9Cg=
nu-best-rsync-1.4=E2=80=9D, =E2=80=9Czlib=E2=80=9D,
=E2=80=9Czlib-best=E2=80=9D, and =E2=80=9Czlib-fast-perl=E2=80=9D.

(As an aside, I had a way to decompose multi-member Gzip files, but it
was much, much slower.  Since I doubt they exist in the wild, I removed
that code.)

The =E2=80=9Cinput=E2=80=9D field likely points to a tarball, which looks l=
ike this:

    (tarball
      (version 0)
      (name "hungrycat-0.4.1.tar")
      (input (sha256
               "02qg3z5cvq6dkdc0mxz4sami1ys668lddggf7bjhszk23xpfjm5r"))
      (default-header)
      (headers
        ((name "hungrycat-0.4.1/")
         (mode 493)
         (mtime 1513360022)
         (chksum 5058)
         (typeflag 53))
        ((name "hungrycat-0.4.1/configure")
         (mode 493)
         (size 130263)
         (mtime 1513360022)
         (chksum 6043))
        ...)
      (padding 3584)
      (digest
        (sha256
          "1ifzck1b97kjm567qb0prnqag2d01x0v8lghx98w1h2gzwsmxgi1")))

Originally, I used your code, but I ran into some problems.  Namely,
real tarballs are not well-behaved.  I wrote new code to keep track of
subtle things like the formatting of the octal values.  Even though they
are not well-behaved, they are usually self-consistent, so I introduced
the =E2=80=9Cdefault-header=E2=80=9D field to set default values for all he=
aders.  Any
omitted fields in the headers use the value from the default header, and
the default header takes defaults from a =E2=80=9Cdefault default header=E2=
=80=9D
defined in the code.  Here=E2=80=99s a default header from a different tarb=
all:

    (default-header
      (uid 1199)
      (gid 30)
      (magic "ustar ")
      (version " \x00")
      (uname "cagordon")
      (gname "lhea")
      (devmajor-format (width 0))
      (devminor-format (width 0)))

These default values are computed to minimize the noise in the
serialized form.  Here we see for example that each header should have
UID 1199 unless otherwise specified.  We also see that the device fields
should be null strings instead of octal zeros.  Another good example
here is that the magic field has a space after =E2=80=9Custar=E2=80=9D, whi=
ch is not
what modern POSIX says to do.

My tarball reader has minimal support for extended headers, but they are
not serialized cleanly (they survive the round-trip, but they are not
human-readable).

Finally, the =E2=80=9Cinput=E2=80=9D field here points to an =E2=80=9Cswh-d=
irectory=E2=80=9D object.  It
looks like this:

    (swh-directory
      (version 0)
      (name "hungrycat-0.4.1")
      (id "0496abd5a2e9e05c9fe20ae7684f48130ef6124a")
      (digest
        (sha256
          "02qg3z5cvq6dkdc0mxz4sami1ys668lddggf7bjhszk23xpfjm5r")))

I have a little module for computing the directory hash like SWH does
(which is in-turn like what Git does).  I did not verify that the 100
packages where in the SWH archive.  I did verify a couple of packages,
but I hit the rate limit and decided to avoid it for now.

To avoid hitting the SWH archive at all, I introduced a directory cache
so that I can store the directories locally.  If the directory cache is
available, directories are stored and retrieved from it.

> How would we put that in practice?  Good question.  :-)
>
> I think we=E2=80=99d have to maintain a database that maps tarball hashes=
 to
> metadata (!).  A simple version of it could be a Git repo where, say,
> =E2=80=98sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk=E2=
=80=99 would
> contain the metadata above.  The nice thing is that the Git repo itself
> could be archived by SWH.  :-)

You mean like <https://git.ngyro.com/disarchive-db/>?  :)

This was generated by a little script built on top of =E2=80=9Cfold-package=
s=E2=80=9D.
It downloads Gzip=E2=80=99d tarballs used by Guix packages and passes them =
on to
Disarchive for disassembly.  I limited the number to 100 because it=E2=80=
=99s
slow and because I=E2=80=99m sure there is a long tail of weird software
archives that are going to be hard to process.  The metadata directory
ended up being 13M and the directory cache 2G.

> Thus, if a tarball vanishes, we=E2=80=99d look it up in the database and
> reconstruct it from its metadata plus content store in SWH.
>
> Thoughts?

Obviously I like the idea.  ;)

Even with the code I have so far, I have a lot of questions.  Mainly I=E2=
=80=99m
worried about keeping everything working into the future.  It would be
easy to make incompatible changes.  A lot of care would have to be
taken.  Of course, keeping a Guix commit and a Disarchive commit might
be enough to make any assembling reproducible, but there=E2=80=99s a
chicken-and-egg problem there.  What if a tarball from the closure of
one the derivations is missing?  I guess you could work around it, but
it would be tricky.

> Anyhow, we should team up with fellow NixOS and SWH hackers to address
> this, and with developers of other distros as well=E2=80=94this problem i=
s not
> just that of the functional deployment geeks, is it?

I could remove most of the Guix stuff so that it would be easy to
package in Guix, Nix, Debian, etc.  Then, someone=E2=84=A2 could write a se=
rvice
that consumes a =E2=80=9Csources.json=E2=80=9D file, adds the sources to a =
Disarchive
database, and pushes everything to a Git repo.  I guess everyone who
cares has to produce a =E2=80=9Csources.json=E2=80=9D file anyway, so it wi=
ll be very
little extra work.  Other stuff like changing the serialization format
to JSON would be pretty easy, too.  I=E2=80=99m not well connected to these
other projects, mind you, so I=E2=80=99m not really sure how to reach out.

Sorry about the big mess of code and ideas =E2=80=93 I realize I may have t=
aken
the =E2=80=9Cdo-ocracy=E2=80=9D approach a little far here.  :)  Even if th=
is is not
=E2=80=9Cthe=E2=80=9D solution, hopefully it=E2=80=99s useful for discussio=
n!


-- Tim




Information forwarded to bug-guix@HIDDEN:
bug#42162; Package guix. Full text available.

Message received at 42162 <at> debbugs.gnu.org:


Received: (at 42162) by debbugs.gnu.org; 22 Jul 2020 10:29:00 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Wed Jul 22 06:29:00 2020
Received: from localhost ([127.0.0.1]:41346 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1jyBzg-0002Ga-CJ
	for submit <at> debbugs.gnu.org; Wed, 22 Jul 2020 06:29:00 -0400
Received: from eggs.gnu.org ([209.51.188.92]:52454)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <ludo@HIDDEN>) id 1jyBze-0002GO-R2
 for 42162 <at> debbugs.gnu.org; Wed, 22 Jul 2020 06:28:59 -0400
Received: from fencepost.gnu.org ([2001:470:142:3::e]:46258)
 by eggs.gnu.org with esmtp (Exim 4.90_1)
 (envelope-from <ludo@HIDDEN>)
 id 1jyBzY-0008PU-Mh; Wed, 22 Jul 2020 06:28:52 -0400
Received: from [2a01:e0a:1d:7270:af76:b9b:ca24:c465] (port=59050 helo=ribbon)
 by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256)
 (Exim 4.82) (envelope-from <ludo@HIDDEN>)
 id 1jyBzX-0002rz-P1; Wed, 22 Jul 2020 06:28:52 -0400
From: =?utf-8?Q?Ludovic_Court=C3=A8s?= <ludo@HIDDEN>
To: zimoun <zimon.toutoune@HIDDEN>
Subject: Re: Recovering source tarballs
References: <87mu4iv0gc.fsf@HIDDEN> <86h7uq8fmk.fsf@HIDDEN>
 <87d05etero.fsf@HIDDEN> <87r1tit5j6.fsf_-_@HIDDEN>
 <CAJ3okZ2iesMKLLD29qrOJzNBjV=haoPFtpRg9T=0aTA8ZxLQMA@HIDDEN>
 <87365mzil1.fsf@HIDDEN>
 <CAJ3okZ0iMNjv93MM1FkEB3_zXA48Rq3rKXhwwug85fNRRc41Mg@HIDDEN>
 <87k0ywlg1z.fsf@HIDDEN> <86o8o81jic.fsf@HIDDEN>
X-URL: http://www.fdn.fr/~lcourtes/
X-Revolutionary-Date: 5 Thermidor an 228 de la =?utf-8?Q?R=C3=A9volution?=
X-PGP-Key-ID: 0x090B11993D9AEBB5
X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc
X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4  0CFB 090B 1199 3D9A EBB5
X-OS: x86_64-pc-linux-gnu
Date: Wed, 22 Jul 2020 12:28:50 +0200
In-Reply-To: <86o8o81jic.fsf@HIDDEN> (zimoun's message of "Wed, 22 Jul 2020
 02:27:39 +0200")
Message-ID: <875zafkfml.fsf@HIDDEN>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: -2.3 (--)
X-Debbugs-Envelope-To: 42162
Cc: 42162 <at> debbugs.gnu.org,
 Maurice =?utf-8?Q?Br=C3=A9mond?= <Maurice.Bremond@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -3.3 (---)

Hello!

zimoun <zimon.toutoune@HIDDEN> skribis:

> On Tue, 21 Jul 2020 at 23:22, Ludovic Court=C3=A8s <ludo@HIDDEN> wrote:
>
>>>> >>   =E2=80=A2 If we no longer deal with tarballs but upstreams keep s=
igning
>>>> >>     tarballs (not raw directory hashes), how can we authenticate our
>>>> >>     code after the fact?
>>>> >
>>>> > Does Guix automatically authenticate code using signed tarballs?
>>>>
>>>> Not automatically; packagers are supposed to authenticate code when th=
ey
>>>> add a package (=E2=80=98guix refresh -u=E2=80=99 does that automatical=
ly).
>>>
>>> So I miss the point of having this authentication information in the
>>> future where upstream has disappeared.
>>
>> What I meant above, is that often, what we have is things like detached
>> signatures of raw tarballs, or documents referring to a tarball hash:
>>
>>   https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/msg00009.html
>
> I still miss why it matters to store detached signature of raw tarballs.

I=E2=80=99m not saying we (Guix) should store signatures; I=E2=80=99m just =
saying that
developers typically sign raw tarballs.  It=E2=80=99s a general statement to
explain why storing or being able to reconstruct tarballs matters.

Thanks,
Ludo=E2=80=99.




Information forwarded to bug-guix@HIDDEN:
bug#42162; Package guix. Full text available.

Message received at 42162 <at> debbugs.gnu.org:


Received: (at 42162) by debbugs.gnu.org; 22 Jul 2020 00:27:56 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Tue Jul 21 20:27:56 2020
Received: from localhost ([127.0.0.1]:40750 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1jy2bt-00063L-BZ
	for submit <at> debbugs.gnu.org; Tue, 21 Jul 2020 20:27:56 -0400
Received: from mail-wm1-f65.google.com ([209.85.128.65]:54287)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <zimon.toutoune@HIDDEN>) id 1jy2br-000636-Gv
 for 42162 <at> debbugs.gnu.org; Tue, 21 Jul 2020 20:27:48 -0400
Received: by mail-wm1-f65.google.com with SMTP id o8so289597wmh.4
 for <42162 <at> debbugs.gnu.org>; Tue, 21 Jul 2020 17:27:47 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=from:to:cc:subject:in-reply-to:references:date:message-id
 :mime-version:content-transfer-encoding;
 bh=5QDd6uSCj/FVK2K/8I8yUUUC999qgy7poMJ7qfKgHLs=;
 b=sKSAyKWyF4pXMj+3o3AdG31lCcGxQeYGu6mJBum2KaZH+xvdUvdAxbNlwtyTv0qs6B
 Dr48xconj0PQWoKZYxzhc86q2idMtLeP0D0ZXziHpinjIIYbcuLm4ySYYz8egDsFcJ9/
 L41CfBjRy/jYddVCdvkZR8XPqYBeaJCObV15qXuY4JkVaBVgW36W1x9XWQwSxijQNdNw
 szWKG7i7BXP3m5vMTmh0Hzkgqy6nk5JBXhdt17ccnWJdLDknUM1bGKUQ9HARuHl9H/iS
 IyMqqJYMoCmh2HkovOnzAUZzGU64O3DPvwLqWdI+C1xT5BZ8LCZS852bZXuFYm1pDAwU
 D61w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:in-reply-to:references:date
 :message-id:mime-version:content-transfer-encoding;
 bh=5QDd6uSCj/FVK2K/8I8yUUUC999qgy7poMJ7qfKgHLs=;
 b=tl68hJgbGyWhYNZc8IDQ0Eh1HH5Xtf3Sy1GrBvufPObt2SrSirE+nqMTzOTHJMqJHl
 er12q1/gPTC71IuLqTsrK0ioeyfvXATSnCLsntqsHaDpIshhnka+9iBEM/niTexjUmH5
 77OUrkV6liSTWQW9t/rh+6gQWjui8SQD+mMsZpRgUEaHFtif7HScXp0qmoETtfe92+6v
 3Hj16qSnGMWnhpc7iEYLvavvxys3h1q2bfQnwvCdO+RuJAcCqWODe6d2kxy0gFB/1uei
 P0sMIaV888t++RWHf165zHIM8NMJw7wSqCC8aEDBCIkakgaE5P1EGYitS1Cf+V/JRKU7
 rGGg==
X-Gm-Message-State: AOAM5314fkf3ImDfl212ILTgxb/tUa3dWuPNdhLYLz0Ln6yz5iTLmRNx
 B2ow+8FxCW7Cz7TA0O1I/HM=
X-Google-Smtp-Source: ABdhPJwyUfs50SFuLVSeEItPz1RVrh0T40zvaVI2HlPmzwKYQx1DG1IlQ0itgbI7kMACEXz/xI0hrg==
X-Received: by 2002:a1c:9914:: with SMTP id b20mr6040325wme.15.1595377661599; 
 Tue, 21 Jul 2020 17:27:41 -0700 (PDT)
Received: from lili ([2a01:e0a:59b:9120:65d2:2476:f637:db1e])
 by smtp.gmail.com with ESMTPSA id n3sm29546791wre.29.2020.07.21.17.27.40
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 21 Jul 2020 17:27:40 -0700 (PDT)
From: zimoun <zimon.toutoune@HIDDEN>
To: Ludovic =?utf-8?Q?Court=C3=A8s?= <ludo@HIDDEN>
Subject: Re: Recovering source tarballs
In-Reply-To: <87k0ywlg1z.fsf@HIDDEN>
References: <87mu4iv0gc.fsf@HIDDEN> <86h7uq8fmk.fsf@HIDDEN>
 <87d05etero.fsf@HIDDEN> <87r1tit5j6.fsf_-_@HIDDEN>
 <CAJ3okZ2iesMKLLD29qrOJzNBjV=haoPFtpRg9T=0aTA8ZxLQMA@HIDDEN>
 <87365mzil1.fsf@HIDDEN>
 <CAJ3okZ0iMNjv93MM1FkEB3_zXA48Rq3rKXhwwug85fNRRc41Mg@HIDDEN>
 <87k0ywlg1z.fsf@HIDDEN>
Date: Wed, 22 Jul 2020 02:27:39 +0200
Message-ID: <86o8o81jic.fsf@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 42162
Cc: 42162 <at> debbugs.gnu.org,
 Maurice =?utf-8?Q?Br=C3=A9mond?= <Maurice.Bremond@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

Hi!

On Tue, 21 Jul 2020 at 23:22, Ludovic Court=C3=A8s <ludo@HIDDEN> wrote:

>>> >>   =E2=80=A2 If we no longer deal with tarballs but upstreams keep si=
gning
>>> >>     tarballs (not raw directory hashes), how can we authenticate our
>>> >>     code after the fact?
>>> >
>>> > Does Guix automatically authenticate code using signed tarballs?
>>>
>>> Not automatically; packagers are supposed to authenticate code when they
>>> add a package (=E2=80=98guix refresh -u=E2=80=99 does that automaticall=
y).
>>
>> So I miss the point of having this authentication information in the
>> future where upstream has disappeared.
>
> What I meant above, is that often, what we have is things like detached
> signatures of raw tarballs, or documents referring to a tarball hash:
>
>   https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/msg00009.html

I still miss why it matters to store detached signature of raw tarballs.

The authentication is done now (at package time and/or inclusion in the
lookup table proposal).  I miss why we would have to re-authenticate
again later.

IMHO, having a lookup table that returns the signatures from a tarball
hash or an archive of all the OpenGPG keys ever published is another
topic.


>>> But today, we store tarball hashes, not directory hashes.
>>
>> We store what "guix hash" returns. ;-)
>> So it is easy to migrate from tarball hashes to whatever else. :-)
>
> True, but that other thing, as it stands, would be a nar hash (like for
> =E2=80=98git-fetch=E2=80=99), not a Git-tree hash (what SWH uses).

Ok, now I am totally convinced that a lookup table is The Right Thing=E2=84=
=A2. :-)

>> I mean, it is "(sha256 (base32" and it is easy to have also
>> "(sha256-tree (base32" or something like that.
>
> Right, but that first and foremost requires daemon support.
>
> It=E2=80=99s doable, but migration would have to take a long time, since =
this is
> touching core parts of the =E2=80=9Cprotocol=E2=80=9D.

Doable but not necessary tractable. :-)


>> I have not done yet the clear back-to-envelop computations.  Roughly,
>> there are ~23 commits on average per day updating packages, so say 70%
>> of them are url-fetch, it is ~16 new tarballs per day, on average.
>> How the model using a Git-repo will scale?  Because, naively the
>> output of "disassemble-archive" in full text (pretty-print format) for
>> the hello-2.10.tar is 120KB and so 16*365*120K =3D ~700Mb per year
>> without considering all the Git internals.  Obviously, it depends on
>> the number of files and I do not know if hello is a representative
>> example.
>
> Interesting, thanks for making that calculation!  We could make the
> format more compact if needed.

Compressing should help.

Considering 14000 packages, based on this 120KB estimation, it leads to:
0.7*14k*120K=3D ~1.2GB for the Git-repo of the current Guix.

Cheers,
simon





Information forwarded to bug-guix@HIDDEN:
bug#42162; Package guix. Full text available.

Message received at 42162 <at> debbugs.gnu.org:


Received: (at 42162) by debbugs.gnu.org; 21 Jul 2020 21:22:12 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Tue Jul 21 17:22:12 2020
Received: from localhost ([127.0.0.1]:40539 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1jxziF-0001Zr-Oq
	for submit <at> debbugs.gnu.org; Tue, 21 Jul 2020 17:22:12 -0400
Received: from eggs.gnu.org ([209.51.188.92]:59896)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <ludo@HIDDEN>) id 1jxziC-0001Zd-UR
 for 42162 <at> debbugs.gnu.org; Tue, 21 Jul 2020 17:22:09 -0400
Received: from fencepost.gnu.org ([2001:470:142:3::e]:36586)
 by eggs.gnu.org with esmtp (Exim 4.90_1)
 (envelope-from <ludo@HIDDEN>)
 id 1jxzi6-0006kJ-Pe; Tue, 21 Jul 2020 17:22:02 -0400
Received: from [2a01:e0a:1d:7270:af76:b9b:ca24:c465] (port=56814 helo=ribbon)
 by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256)
 (Exim 4.82) (envelope-from <ludo@HIDDEN>)
 id 1jxzi6-00017G-99; Tue, 21 Jul 2020 17:22:02 -0400
From: =?utf-8?Q?Ludovic_Court=C3=A8s?= <ludo@HIDDEN>
To: zimoun <zimon.toutoune@HIDDEN>
Subject: Re: Recovering source tarballs
References: <87mu4iv0gc.fsf@HIDDEN> <86h7uq8fmk.fsf@HIDDEN>
 <87d05etero.fsf@HIDDEN> <87r1tit5j6.fsf_-_@HIDDEN>
 <CAJ3okZ2iesMKLLD29qrOJzNBjV=haoPFtpRg9T=0aTA8ZxLQMA@HIDDEN>
 <87365mzil1.fsf@HIDDEN>
 <CAJ3okZ0iMNjv93MM1FkEB3_zXA48Rq3rKXhwwug85fNRRc41Mg@HIDDEN>
X-URL: http://www.fdn.fr/~lcourtes/
X-Revolutionary-Date: 4 Thermidor an 228 de la =?utf-8?Q?R=C3=A9volution?=
X-PGP-Key-ID: 0x090B11993D9AEBB5
X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc
X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4  0CFB 090B 1199 3D9A EBB5
X-OS: x86_64-pc-linux-gnu
Date: Tue, 21 Jul 2020 23:22:00 +0200
In-Reply-To: <CAJ3okZ0iMNjv93MM1FkEB3_zXA48Rq3rKXhwwug85fNRRc41Mg@HIDDEN>
 (zimoun's message of "Mon, 20 Jul 2020 17:52:09 +0200")
Message-ID: <87k0ywlg1z.fsf@HIDDEN>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: -2.3 (--)
X-Debbugs-Envelope-To: 42162
Cc: 42162 <at> debbugs.gnu.org,
 Maurice =?utf-8?Q?Br=C3=A9mond?= <Maurice.Bremond@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -3.3 (---)

Hi!

zimoun <zimon.toutoune@HIDDEN> skribis:

> On Mon, 20 Jul 2020 at 10:39, Ludovic Court=C3=A8s <ludo@HIDDEN> wrote:
>> zimoun <zimon.toutoune@HIDDEN> skribis:
>> > On Sat, 11 Jul 2020 at 17:50, Ludovic Court=C3=A8s <ludo@HIDDEN> wrot=
e:
>
>> There are many many comments in your message, so I took the liberty to
>> reply only to the essence of it.  :-)
>
> Many comments because many open topics. ;-)

Understood, and they=E2=80=99re very valuable but (1) I choose not to just =
do
email :-), and (2) I like to separate issues in reasonable chunks rather
than long threads addressing all the problems we=E2=80=99ll have to deal wi=
th.

I think it really helps keep things tractable!

>> Lookup issue.  :-)  The hash in a CID is not just a raw blob hash.
>> Files are typically chunked beforehand, assembled as a Merkle tree, and
>> the CID is roughly the hash to the tree root.  So it would seem we can=
=E2=80=99t
>> use IPFS as-is for tarballs.
>
> Using the Git-repo map/table, then it becomes an option, right?
> Well, SWH would be a backend and IPFS could be another one.  Or any
> "cloudy" storage system that could appear in the future, right?

Sure, why not.

>> >>   =E2=80=A2 If we no longer deal with tarballs but upstreams keep sig=
ning
>> >>     tarballs (not raw directory hashes), how can we authenticate our
>> >>     code after the fact?
>> >
>> > Does Guix automatically authenticate code using signed tarballs?
>>
>> Not automatically; packagers are supposed to authenticate code when they
>> add a package (=E2=80=98guix refresh -u=E2=80=99 does that automatically=
).
>
> So I miss the point of having this authentication information in the
> future where upstream has disappeared.

What I meant above, is that often, what we have is things like detached
signatures of raw tarballs, or documents referring to a tarball hash:

  https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/msg00009.html

>> But today, we store tarball hashes, not directory hashes.
>
> We store what "guix hash" returns. ;-)
> So it is easy to migrate from tarball hashes to whatever else. :-)

True, but that other thing, as it stands, would be a nar hash (like for
=E2=80=98git-fetch=E2=80=99), not a Git-tree hash (what SWH uses).

> I mean, it is "(sha256 (base32" and it is easy to have also
> "(sha256-tree (base32" or something like that.

Right, but that first and foremost requires daemon support.

It=E2=80=99s doable, but migration would have to take a long time, since th=
is is
touching core parts of the =E2=80=9Cprotocol=E2=80=9D.

> I have not done yet the clear back-to-envelop computations.  Roughly,
> there are ~23 commits on average per day updating packages, so say 70%
> of them are url-fetch, it is ~16 new tarballs per day, on average.
> How the model using a Git-repo will scale?  Because, naively the
> output of "disassemble-archive" in full text (pretty-print format) for
> the hello-2.10.tar is 120KB and so 16*365*120K =3D ~700Mb per year
> without considering all the Git internals.  Obviously, it depends on
> the number of files and I do not know if hello is a representative
> example.

Interesting, thanks for making that calculation!  We could make the
format more compact if needed.

Thanks,
Ludo=E2=80=99.




Information forwarded to bug-guix@HIDDEN:
bug#42162; Package guix. Full text available.

Message received at 42162 <at> debbugs.gnu.org:


Received: (at 42162) by debbugs.gnu.org; 20 Jul 2020 21:27:42 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Mon Jul 20 17:27:42 2020
Received: from localhost ([127.0.0.1]:36539 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1jxdK1-0004tW-U5
	for submit <at> debbugs.gnu.org; Mon, 20 Jul 2020 17:27:42 -0400
Received: from mail-wr1-f50.google.com ([209.85.221.50]:38265)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <zimon.toutoune@HIDDEN>) id 1jxdJz-0004tI-IE
 for 42162 <at> debbugs.gnu.org; Mon, 20 Jul 2020 17:27:40 -0400
Received: by mail-wr1-f50.google.com with SMTP id a14so4352625wra.5
 for <42162 <at> debbugs.gnu.org>; Mon, 20 Jul 2020 14:27:39 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=from:to:cc:subject:in-reply-to:references:date:message-id
 :mime-version; bh=2+SnLNwqdv1rxWYvVaEUhogUq7I92bNwB0b3aUGnnZs=;
 b=HCQ1Z4wBrBN7aA1e9DM52gB+xgK+FfkWbvo7Z9Y4gVsmH3OormHdu+LgCzrpE3MSvz
 rEaYgij0anUzafOeq2cUP7DVBfimByVXXdeTYAvMi+pZHHuNj/imxD+tdzAw3QMnNs+K
 N5XNUU9srZrEjPM19GPUZrSXLtByshUeP7cm8nH6lAsdHJiFOno/ArCqpG6GoF1YKs4h
 XaBNfob3mZYZpGefPNKv0iDb61ejHoAlkO+0APyvpWAzwutAtgRlQULKNrBzSTyV97/v
 G0lOnDnw1PLeRGf8K8ES2Mh0Namo13fJ3wwXATyJ4xcgvAsGg1ln2iVwVEJT3M8ze0hm
 GXbw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:in-reply-to:references:date
 :message-id:mime-version;
 bh=2+SnLNwqdv1rxWYvVaEUhogUq7I92bNwB0b3aUGnnZs=;
 b=P7rcW1ghqLwTD0TVpwwXD0VbW9ZLZi/uTUYMh97IcV1Rprq71F1/hODqXCMmqC3pO+
 q+NffoAt60d+byr4L0bTyKuHx8KSgIo+MDSpUyILFpQquxoZlC9PKsks5F+Bh/Q0JoEP
 KzYLKqP1wL51+qKFn8gU1j2zKJw8QXu2RwvZuS/BI+LmaORWoBfsTYd4ukyCGSvGQGti
 SlvX31XFtnp0duGK0HO4rOJ6ZxRPU5k+EbT1xM030udwf10ZGSdk7JldFekhQ2enbt+i
 R+10HhAOJevTsQavFo8IY/CJzWFkHuk5ywO2SSEyQW11AYSnIfjBu5OrjRs+mhJ1pKlY
 0nFQ==
X-Gm-Message-State: AOAM531TgEa9nv1hgt94M3Q4tL4F0okEhfBgAmn9IqS13AMf+f555zo/
 QZlcBfuyED928J89PMOKNl8=
X-Google-Smtp-Source: ABdhPJyJK0i2JsiKiAXirf/MP1XnruW3q/ZCFjUMe16X4+96sZTwbsIyx/y//HOO/THdvnet9OhscQ==
X-Received: by 2002:adf:dfd1:: with SMTP id q17mr22565951wrn.94.1595280453723; 
 Mon, 20 Jul 2020 14:27:33 -0700 (PDT)
Received: from lili ([2a01:e0a:59b:9120:65d2:2476:f637:db1e])
 by smtp.gmail.com with ESMTPSA id p25sm489073wma.39.2020.07.20.14.27.32
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 20 Jul 2020 14:27:33 -0700 (PDT)
From: zimoun <zimon.toutoune@HIDDEN>
To: Christopher Baines <mail@HIDDEN>, Ludovic =?utf-8?Q?Court=C3=A8s?=
 <ludo@HIDDEN>
Subject: Re: bug#42162: Recovering source tarballs
In-Reply-To: <87a703jk78.fsf@HIDDEN>
References: <87mu4iv0gc.fsf@HIDDEN> <86h7uq8fmk.fsf@HIDDEN>
 <87d05etero.fsf@HIDDEN> <87r1tit5j6.fsf_-_@HIDDEN>
 <87a703jk78.fsf@HIDDEN>
Date: Mon, 20 Jul 2020 23:27:32 +0200
Message-ID: <865zahev23.fsf@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 42162
Cc: 42162 <at> debbugs.gnu.org,
 Maurice =?utf-8?Q?Br=C3=A9mond?= <Maurice.Bremond@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

Hi Chris,

On Mon, 13 Jul 2020 at 20:20, Christopher Baines <mail@HIDDEN> wrote:

> Going forward, being methodical as a project about storing the tarballs
> and source material for the packages is probalby the way to ensure it's
> available for the future. I'm not sure the data storage cost is
> significant, the cost of doing this is probably in working out what to
> store, doing so in a redundant manor, and making the data available.

A really rough estimate is 120KB on average* per raw tarball.  So if we
consider 14000 packages and 70% of them are url-fetch, then it leads to
14k*0.7*120K= 1.2GB; which is not significant.  Moreover, if we
extrapolate the numbers, between v1.0.0 and now it is 23 commits per day
modifying gnu/packages/ so 0.7*23*120K*365= 700MB per year.  However,
the 120KB of metadata to re-assemble the tarball have to be compared to
the 712KB of raw compressed tarball; both about the hello package.

*based on the hello package.  And it depends on the number of files in
 the tarball.  File stored not compressed: plain sexp.


Therefore, in addition to what to store, redundancy and availability,
one question is how to store?  Git-repo? SQL database? etc.



> The Guix Data Service knows about fixed output derivations, so it might
> be possible to backfill such a store by just attempting to build those
> derivations. It might also be possible to use the Guix Data Service to
> work out what's available, and what tarballs are missing.

Missing from where?  The substitutes farm or SWH?


Cheers,
simon




Information forwarded to bug-guix@HIDDEN:
bug#42162; Package guix. Full text available.

Message received at 42162 <at> debbugs.gnu.org:


Received: (at 42162) by debbugs.gnu.org; 20 Jul 2020 20:00:19 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Mon Jul 20 16:00:19 2020
Received: from localhost ([127.0.0.1]:36297 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1jxbxT-0002fN-4Q
	for submit <at> debbugs.gnu.org; Mon, 20 Jul 2020 16:00:19 -0400
Received: from mail-qk1-f179.google.com ([209.85.222.179]:36081)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <zimon.toutoune@HIDDEN>) id 1jxbxQ-0002f7-Le
 for 42162 <at> debbugs.gnu.org; Mon, 20 Jul 2020 16:00:17 -0400
Received: by mail-qk1-f179.google.com with SMTP id g26so6662200qka.3
 for <42162 <at> debbugs.gnu.org>; Mon, 20 Jul 2020 13:00:16 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=mime-version:references:in-reply-to:from:date:message-id:subject:to
 :cc; bh=HXYku00sUsiNQj6z6wEjbeqaMMSI1nifRy07iSTwf6g=;
 b=DcG2XbZKjOGP9sdGpkQ18ee3shGOwZpgkgfB5jIuCAVlYGTq54wW/wS0C7Gp1wa3am
 dWaCRxvupOEdq65jPRLG4qp+WvdGPcOHBLF5wi8SmaL8oN3O8drUl5T4IAndt+oFe3ch
 Les7d6yGrYh/7FdDYoFBd2n05eMd9AZLrMKEAD6ChlXqn7B+cDLqhUQVb5LTwpWu97YG
 RsCmGKxiK+eHDvdbw3KGfDUdKbg1vEjqHocwQj8VOQIi2415WJwC0N1ectUBR9JON5kt
 JMFSh0NjWm6bVKn/HEJl4EzrR83KV3ibgq0wbcCcspdNRDhL/fO8xammU4iD3HmklEsb
 G4Ww==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=HXYku00sUsiNQj6z6wEjbeqaMMSI1nifRy07iSTwf6g=;
 b=KaM8HvDjxM4EaDFr/iq39xEuBRM6ifLq4bNM4du126gPnA2E1i5lhW8Y6bwk07W8Hj
 nb+7yZttnO0HybXmg6FlQckFuCs3M/ImmbqpkKEQamHRcnK8aYQSBQa795JtsLgqyH9A
 iKjjytHXTia90uGf9hfRFwNNhUa4QnK4dBk3Jh1yBisnsJZxzXRsNwdxKGxjPbNyNdXk
 7gnjnKvUh/TqHU02wmE1LXZv0V9tC44INWgsqqi/WJa0okus7onv5GY4W7BX37LH9ptR
 kWuEjtD1cRwWEbXL0hRe6RrTScGj+qqa7o2m0vfuZbU9/SH5emxm5HxMlpgNWFcfygCt
 h4uA==
X-Gm-Message-State: AOAM533Sj5cBJXxk8RIsa0R58+vgmksP/+RUONc0hH6IMK5wjO6Jj/j/
 ZgRlo7lr0/Bq6LxRp9Dv17A5WEZLPE6zfPpn0hY=
X-Google-Smtp-Source: ABdhPJzuwP3jV1JnBTZDKeoXEE0Z7H9Tupxjk9w2EmIlqrDIIKxOcEHBFXH8oq009J7jeEtmb+vyfWJHxahtYQWmeis=
X-Received: by 2002:a05:620a:567:: with SMTP id
 p7mr23932094qkp.232.1595275210929; 
 Mon, 20 Jul 2020 13:00:10 -0700 (PDT)
MIME-Version: 1.0
References: <87mu4iv0gc.fsf@HIDDEN> <86h7uq8fmk.fsf@HIDDEN>
 <87d05etero.fsf@HIDDEN> <87r1tit5j6.fsf_-_@HIDDEN>
 <CAJ3okZ2iesMKLLD29qrOJzNBjV=haoPFtpRg9T=0aTA8ZxLQMA@HIDDEN>
 <87365mzil1.fsf@HIDDEN>
 <CAJ3okZ0iMNjv93MM1FkEB3_zXA48Rq3rKXhwwug85fNRRc41Mg@HIDDEN>
 <87wo2ynml7.fsf@HIDDEN>
In-Reply-To: <87wo2ynml7.fsf@HIDDEN>
From: zimoun <zimon.toutoune@HIDDEN>
Date: Mon, 20 Jul 2020 21:59:59 +0200
Message-ID: <CAJ3okZ2ndtsn5t38t+C_odoYDa-m8cdpFG9tnKC8FoKuoHXveA@HIDDEN>
Subject: Re: bug#42162: Recovering source tarballs
To: "Dr. Arne Babenhauserheide" <arne_bab@HIDDEN>
Content-Type: text/plain; charset="UTF-8"
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 42162
Cc: 42162 <at> debbugs.gnu.org,
 =?UTF-8?Q?Maurice_Br=C3=A9mond?= <Maurice.Bremond@HIDDEN>,
 =?UTF-8?Q?Ludovic_Court=C3=A8s?= <ludo@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

On Mon, 20 Jul 2020 at 19:05, Dr. Arne Babenhauserheide <arne_bab@HIDDEN> wrote:
> zimoun <zimon.toutoune@HIDDEN> writes:
> >> > The format of metadata (disassemble) that you propose is schemish
> >> > (obviously! :-)) but we could propose something more JSON-like.
> >>
> >> Sure, if that helps get other people on-board, why not (though sexps
> >> have lived much longer than JSON and XML together :-)).
> >
> > Lived much longer and still less less less used than JSON or XML alone. ;-)
>
> Though this is likely not a function of the format, but of the
> popularity of both Javascript and Java.

Well, the popularity matters to attract a broad audience and maybe get
other people on-board; if it is the aim.
It seems the de-facto format; even if JSON has flaws.  And zillions of
parsers for all the languages are floating around, which is not the
case for Sexp, even if it is easier to parse.

And JSON is already used in Guix, see [1] for an example.

1: https://guix.gnu.org/manual/devel/en/guix.html#Additional-Build-Options

However, I am not convinced that JSON or similarly Sexp will scale
well for a Tarball Heritage perspective.

All the best,
simon




Information forwarded to bug-guix@HIDDEN:
bug#42162; Package guix. Full text available.

Message received at 42162 <at> debbugs.gnu.org:


Received: (at 42162) by debbugs.gnu.org; 20 Jul 2020 17:05:54 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Mon Jul 20 13:05:54 2020
Received: from localhost ([127.0.0.1]:35962 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1jxZEg-0004UE-7q
	for submit <at> debbugs.gnu.org; Mon, 20 Jul 2020 13:05:54 -0400
Received: from mout.web.de ([212.227.17.12]:56115)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <arne_bab@HIDDEN>) id 1jxZEe-0004TX-IA
 for 42162 <at> debbugs.gnu.org; Mon, 20 Jul 2020 13:05:53 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=web.de;
 s=dbaedf251592; t=1595264745;
 bh=3SGT2d30eh7J/1ve3dcPh79Jb1jT2jrUdrH0YRCArgc=;
 h=X-UI-Sender-Class:References:From:To:Cc:Subject:In-reply-to:Date;
 b=Xw5K2Oda5KAMThkslE8f+daDDeKhnt++ZMy1+8VgJWO7tYX4/Us4j5yqCvKzUWfJc
 xipmnRoXwFAroPC1b3VIr/dIZE+7p8p4Vf/IYovIZhfvWuVRAw3I2ZnEO2b5ATjJLL
 ClX5xOS1HcqTl3zLwufiUaaXvCLzPxplsZiN8KQM=
X-UI-Sender-Class: c548c8c5-30a9-4db5-a2e7-cb6cb037b8f9
Received: from fluss ([80.136.20.161]) by smtp.web.de (mrweb101
 [213.165.67.124]) with ESMTPSA (Nemesis) id 0LetYx-1kcRHz3Z9T-00qlU1; Mon, 20
 Jul 2020 19:05:44 +0200
References: <87mu4iv0gc.fsf@HIDDEN> <86h7uq8fmk.fsf@HIDDEN>
 <87d05etero.fsf@HIDDEN> <87r1tit5j6.fsf_-_@HIDDEN>
 <CAJ3okZ2iesMKLLD29qrOJzNBjV=haoPFtpRg9T=0aTA8ZxLQMA@HIDDEN>
 <87365mzil1.fsf@HIDDEN>
 <CAJ3okZ0iMNjv93MM1FkEB3_zXA48Rq3rKXhwwug85fNRRc41Mg@HIDDEN>
User-agent: mu4e 1.4.10; emacs 26.3
From: "Dr. Arne Babenhauserheide" <arne_bab@HIDDEN>
To: zimoun <zimon.toutoune@HIDDEN>
Subject: Re: bug#42162: Recovering source tarballs
In-reply-to: <CAJ3okZ0iMNjv93MM1FkEB3_zXA48Rq3rKXhwwug85fNRRc41Mg@HIDDEN>
Date: Mon, 20 Jul 2020 19:05:40 +0200
Message-ID: <87wo2ynml7.fsf@HIDDEN>
MIME-Version: 1.0
Content-Type: multipart/signed; boundary="=-=-=";
 micalg=pgp-sha256; protocol="application/pgp-signature"
X-Provags-ID: V03:K1:Uh53OiXaAHEICcIq8MZ6ln2ZPpyr8JffOGvVOU34E3khA7Rl3Aw
 0KNxAcz3WuDYBreOVHuIN2BJzfn/HkPGfTWogvrJ2roB2RGQEhn1kt9gI0ZFrLJo40YQwUB
 Y2TdMeBuiNz9Mwrr6Glae8xhdUdkwcIawdxIsAzCuxEdM77I37rmcFKthf/KK7wRt1RIdc+
 Ai1htoTQSXKdniVwlHjlg==
X-Spam-Flag: NO
X-UI-Out-Filterresults: notjunk:1;V03:K0:W0SqF5sdtls=:f7SfusEtZ0Tk9cN/QRzrdK
 MFfQ0Od0yBUfmYBzgCWYK52K8V6AXlEzZEtmn027tNYECTUlRdeprDcdgeRgWk31DuDDqKJB2
 kw7Rd8Xm+zb8smqAZzUsRtrJ6gb2QT1FRD7QLuN2e65572xRz+7DuIryUZHt5FLYU4gYu8L8V
 EutFVSrlfZsYDi7PvXAfTzvxjGMTJazwydAC/MA8WfZsf/48BGbllRId3aHe/Ks9ecVc49TlZ
 VzBeKJPQP7TzBADOkOOUXmwkAIMeuLeodeUvRrer5/Ho+YQuvn4q9rnOhrs2q6vkymwXYF3rK
 ckHrpeGpQpc4kTFpI90Ggl+7fNpiHnkXvId/m7A9OIYXPrd4cTKOvRzkNRAH984nnuKiybnla
 OD6Abne1At6Be5qFQvnpi57VPdtRk1h3DXSCl5qkakRe2pvy9hIc66xj7FgeWVg/tquLUvYCE
 JHRMvyvdJ94SOr0gZfyEwRSIWqSgOEHQdD5id2pPIFpKn5dpXMK+eXT+/bUWnf3aClOnfC3bY
 Xdbiox5ibLlryM4yNlYnuviwTkFVpAQb2Oj5HkKkQldZBHMaQ1OJp3PGreT7p/2PPJMam9fa8
 FT12rltKJgSOHs/+9/u549aLjZD5aZkMroVZdaXT84pKnmELYaQUsLQ7/rZQcBR2S1GXuE20S
 86piXqNd53f7kS5u3FThP953yIwI83q2ioPmXpx7zyBMUTNmGIngg/7Bj+X/8/EXB3PkCz0Rg
 4H05N4YTzcaH/GVsYu0lTJ1XRiAnS+eqrswd1d7UlLg10+hVGGCfcoNHouBOCbMS9E2F5xJhM
 cjLewPZ18GwH6ZDciatIE26agZfYt0A93XNV6dAGb//eN8GLT3ObyDg6fiAQpEbWKEo0Ol1Et
 hOZPGu5xNQbuvT1vo9ks2WtSanYlgy5U34PhiMCE7rivNM1frgYpIgZnN+e1JHD6aVDXdVA4y
 wS45IG0j8uaVD02h5FPQTKne8TsOdOqqsl/PqI+O2uJQ3xMarCdkvrkVyoxFXOq6Bu9qbqylR
 4GMsM5jcLLKr18o3WgJjG32VTxhHBuXtICyt6ZiCTJVV/xaYaJ2sDyVq690oQeQdn+X8RI3ME
 81/sP0JeCiNYMlAQ6gpgFO+Y5t+OxRTQnfl4FEgTLF52GKQhkxDMJHVcE2jHtPO9ydhMW1Lha
 UImUUJei+zRm2lM5Z5+nRPjl7W1TO+kjc6uo9zZnLXL5iVcI7JOFW/BxQGMSGqRCD3CUwlz0Y
 CHBeAgeTFQJMBvxyG
X-Spam-Score: -0.7 (/)
X-Debbugs-Envelope-To: 42162
Cc: 42162 <at> debbugs.gnu.org,
 Maurice =?utf-8?Q?Br=C3=A9mond?= <Maurice.Bremond@HIDDEN>,
 Ludovic =?utf-8?Q?Court=C3=A8s?= <ludo@HIDDEN>, bug-guix@HIDDEN
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -1.7 (-)

--=-=-=
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


zimoun <zimon.toutoune@HIDDEN> writes:
>> > The format of metadata (disassemble) that you propose is schemish
>> > (obviously! :-)) but we could propose something more JSON-like.
>>
>> Sure, if that helps get other people on-board, why not (though sexps
>> have lived much longer than JSON and XML together :-)).
>
> Lived much longer and still less less less used than JSON or XML alone. ;=
-)

Though this is likely not a function of the format, but of the
popularity of both Javascript and Java.

JSON isn=E2=80=99t a well defined format for arbitrary data (try to store
numbers as keys and reason about what you get as return-values), and
XML is a monster of complexity.

Best wishes,
Arne
=2D-=20
Unpolitisch sein
hei=C3=9Ft politisch sein
ohne es zu merken

--=-=-=
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEE801qEjXQSQPNItXAE++NRSQDw+sFAl8VzucACgkQE++NRSQD
w+vH5A/+O4YSG9c/P8FD66fdhZ/tOcHBvSxfDu0GDfyB6O9gILuHtDM+OJFlcPvB
Nplo+FV/abU5mw7CeEgbVaK/Nv5MPTEHwbZBTZvlpwPPYtFpyLbxwbTqMa6Tgp9z
4Ml/L4FXlDBc1ohEZqJQqWouLOl0LjClMMPBv+rsThZZSBiRdYEUIXOQfrJv7tMi
WjosPJqtQ5Sp9QFxKTwbLayHVvbFyY095EyQVhy/7BY6+thaGGVYCjz0CcozJZhr
M/ebgF32Geu+IQtf1+hnXJCdQj4mEc5ALgz97qT7KXFwOpnZ0hT78dBooZY+5laD
0/pgrWwboNI3kRpDQw0PCeaq05Q3+ppLo++NZ1s+9vDUEW1uvzcjeSqv+Cm+wn++
3KmDDmVN2VuRTGQmBE9XdIqI0SYb65OXzzGaDoB8fvQzZgvMlhcCfrp+BtrjHWy0
UzNT4YScuZVTUgXJ49Hk+enihJAMGTyfOmwMo4eOaoQYxIuKayjtfIk8+CsCJWro
JlWMmPB0golG9EjO6cAK1zWN8gpzXuhATNsUvIqv2qHWzVSsh+rzDAf3xlsDO1rm
JIduKYyyuP9QaEtEjmLzCbER+4Qwzll3vsaelxfrQbs3xIbGKmm4vAzJPI/TVa5J
Y24v09FlMXw8bp+pw9XVC+V0fueRqMMD2+Log/ZTFnHbeX/uzguIswQBAQgAHRYh
BN0ovebZh1yrzkqLHdzPDbMLwQVIBQJfFc7nAAoJENzPDbMLwQVIehgD/3sqChU9
MHZfBv6LXzVixV8F68JW4UxKzEPOzYAr7MDmKgT1VN5gdltKCq+GYCgfD8CXepNw
qqL2K+DbapBBvTGpLXcJp36I0VbdOL04mshW6XMVJP33Cgyg9c5c569TiVV1R0Gk
GHr4eal/jedvaqlhit6qqmbWsI+ERHApnHQS
=6n8s
-----END PGP SIGNATURE-----
--=-=-=--




Information forwarded to bug-guix@HIDDEN:
bug#42162; Package guix. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 20 Jul 2020 17:05:59 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Mon Jul 20 13:05:59 2020
Received: from localhost ([127.0.0.1]:35965 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1jxZEl-0004UY-Ep
	for submit <at> debbugs.gnu.org; Mon, 20 Jul 2020 13:05:59 -0400
Received: from lists.gnu.org ([209.51.188.17]:59628)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <arne_bab@HIDDEN>) id 1jxZEk-0004UR-IJ
 for submit <at> debbugs.gnu.org; Mon, 20 Jul 2020 13:05:58 -0400
Received: from eggs.gnu.org ([2001:470:142:3::10]:44602)
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <arne_bab@HIDDEN>) id 1jxZEk-0008P3-DE
 for bug-guix@HIDDEN; Mon, 20 Jul 2020 13:05:58 -0400
Received: from mout.web.de ([212.227.17.12]:53649)
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <arne_bab@HIDDEN>)
 id 1jxZEi-0001ep-LZ; Mon, 20 Jul 2020 13:05:58 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=web.de;
 s=dbaedf251592; t=1595264745;
 bh=3SGT2d30eh7J/1ve3dcPh79Jb1jT2jrUdrH0YRCArgc=;
 h=X-UI-Sender-Class:References:From:To:Cc:Subject:In-reply-to:Date;
 b=Xw5K2Oda5KAMThkslE8f+daDDeKhnt++ZMy1+8VgJWO7tYX4/Us4j5yqCvKzUWfJc
 xipmnRoXwFAroPC1b3VIr/dIZE+7p8p4Vf/IYovIZhfvWuVRAw3I2ZnEO2b5ATjJLL
 ClX5xOS1HcqTl3zLwufiUaaXvCLzPxplsZiN8KQM=
X-UI-Sender-Class: c548c8c5-30a9-4db5-a2e7-cb6cb037b8f9
Received: from fluss ([80.136.20.161]) by smtp.web.de (mrweb101
 [213.165.67.124]) with ESMTPSA (Nemesis) id 0LetYx-1kcRHz3Z9T-00qlU1; Mon, 20
 Jul 2020 19:05:44 +0200
References: <87mu4iv0gc.fsf@HIDDEN> <86h7uq8fmk.fsf@HIDDEN>
 <87d05etero.fsf@HIDDEN> <87r1tit5j6.fsf_-_@HIDDEN>
 <CAJ3okZ2iesMKLLD29qrOJzNBjV=haoPFtpRg9T=0aTA8ZxLQMA@HIDDEN>
 <87365mzil1.fsf@HIDDEN>
 <CAJ3okZ0iMNjv93MM1FkEB3_zXA48Rq3rKXhwwug85fNRRc41Mg@HIDDEN>
User-agent: mu4e 1.4.10; emacs 26.3
From: "Dr. Arne Babenhauserheide" <arne_bab@HIDDEN>
To: zimoun <zimon.toutoune@HIDDEN>
Subject: Re: bug#42162: Recovering source tarballs
In-reply-to: <CAJ3okZ0iMNjv93MM1FkEB3_zXA48Rq3rKXhwwug85fNRRc41Mg@HIDDEN>
Date: Mon, 20 Jul 2020 19:05:40 +0200
Message-ID: <87wo2ynml7.fsf@HIDDEN>
MIME-Version: 1.0
Content-Type: multipart/signed; boundary="=-=-=";
 micalg=pgp-sha256; protocol="application/pgp-signature"
X-Provags-ID: V03:K1:Uh53OiXaAHEICcIq8MZ6ln2ZPpyr8JffOGvVOU34E3khA7Rl3Aw
 0KNxAcz3WuDYBreOVHuIN2BJzfn/HkPGfTWogvrJ2roB2RGQEhn1kt9gI0ZFrLJo40YQwUB
 Y2TdMeBuiNz9Mwrr6Glae8xhdUdkwcIawdxIsAzCuxEdM77I37rmcFKthf/KK7wRt1RIdc+
 Ai1htoTQSXKdniVwlHjlg==
X-Spam-Flag: NO
X-UI-Out-Filterresults: notjunk:1;V03:K0:W0SqF5sdtls=:f7SfusEtZ0Tk9cN/QRzrdK
 MFfQ0Od0yBUfmYBzgCWYK52K8V6AXlEzZEtmn027tNYECTUlRdeprDcdgeRgWk31DuDDqKJB2
 kw7Rd8Xm+zb8smqAZzUsRtrJ6gb2QT1FRD7QLuN2e65572xRz+7DuIryUZHt5FLYU4gYu8L8V
 EutFVSrlfZsYDi7PvXAfTzvxjGMTJazwydAC/MA8WfZsf/48BGbllRId3aHe/Ks9ecVc49TlZ
 VzBeKJPQP7TzBADOkOOUXmwkAIMeuLeodeUvRrer5/Ho+YQuvn4q9rnOhrs2q6vkymwXYF3rK
 ckHrpeGpQpc4kTFpI90Ggl+7fNpiHnkXvId/m7A9OIYXPrd4cTKOvRzkNRAH984nnuKiybnla
 OD6Abne1At6Be5qFQvnpi57VPdtRk1h3DXSCl5qkakRe2pvy9hIc66xj7FgeWVg/tquLUvYCE
 JHRMvyvdJ94SOr0gZfyEwRSIWqSgOEHQdD5id2pPIFpKn5dpXMK+eXT+/bUWnf3aClOnfC3bY
 Xdbiox5ibLlryM4yNlYnuviwTkFVpAQb2Oj5HkKkQldZBHMaQ1OJp3PGreT7p/2PPJMam9fa8
 FT12rltKJgSOHs/+9/u549aLjZD5aZkMroVZdaXT84pKnmELYaQUsLQ7/rZQcBR2S1GXuE20S
 86piXqNd53f7kS5u3FThP953yIwI83q2ioPmXpx7zyBMUTNmGIngg/7Bj+X/8/EXB3PkCz0Rg
 4H05N4YTzcaH/GVsYu0lTJ1XRiAnS+eqrswd1d7UlLg10+hVGGCfcoNHouBOCbMS9E2F5xJhM
 cjLewPZ18GwH6ZDciatIE26agZfYt0A93XNV6dAGb//eN8GLT3ObyDg6fiAQpEbWKEo0Ol1Et
 hOZPGu5xNQbuvT1vo9ks2WtSanYlgy5U34PhiMCE7rivNM1frgYpIgZnN+e1JHD6aVDXdVA4y
 wS45IG0j8uaVD02h5FPQTKne8TsOdOqqsl/PqI+O2uJQ3xMarCdkvrkVyoxFXOq6Bu9qbqylR
 4GMsM5jcLLKr18o3WgJjG32VTxhHBuXtICyt6ZiCTJVV/xaYaJ2sDyVq690oQeQdn+X8RI3ME
 81/sP0JeCiNYMlAQ6gpgFO+Y5t+OxRTQnfl4FEgTLF52GKQhkxDMJHVcE2jHtPO9ydhMW1Lha
 UImUUJei+zRm2lM5Z5+nRPjl7W1TO+kjc6uo9zZnLXL5iVcI7JOFW/BxQGMSGqRCD3CUwlz0Y
 CHBeAgeTFQJMBvxyG
Received-SPF: pass client-ip=212.227.17.12; envelope-from=arne_bab@HIDDEN;
 helo=mout.web.de
X-detected-operating-system: by eggs.gnu.org: First seen = 2020/07/20 13:05:53
X-ACL-Warn: Detected OS   = Linux 2.2.x-3.x [generic]
X-Spam_score_int: -27
X-Spam_score: -2.8
X-Spam_bar: --
X-Spam_report: (-2.8 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1,
 DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001,
 RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01,
 SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-Spam-Score: -1.4 (-)
X-Debbugs-Envelope-To: submit
Cc: 42162 <at> debbugs.gnu.org,
 Maurice =?utf-8?Q?Br=C3=A9mond?= <Maurice.Bremond@HIDDEN>,
 Ludovic =?utf-8?Q?Court=C3=A8s?= <ludo@HIDDEN>, bug-guix@HIDDEN
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -2.4 (--)

--=-=-=
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


zimoun <zimon.toutoune@HIDDEN> writes:
>> > The format of metadata (disassemble) that you propose is schemish
>> > (obviously! :-)) but we could propose something more JSON-like.
>>
>> Sure, if that helps get other people on-board, why not (though sexps
>> have lived much longer than JSON and XML together :-)).
>
> Lived much longer and still less less less used than JSON or XML alone. ;=
-)

Though this is likely not a function of the format, but of the
popularity of both Javascript and Java.

JSON isn=E2=80=99t a well defined format for arbitrary data (try to store
numbers as keys and reason about what you get as return-values), and
XML is a monster of complexity.

Best wishes,
Arne
=2D-=20
Unpolitisch sein
hei=C3=9Ft politisch sein
ohne es zu merken

--=-=-=
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEE801qEjXQSQPNItXAE++NRSQDw+sFAl8VzucACgkQE++NRSQD
w+vH5A/+O4YSG9c/P8FD66fdhZ/tOcHBvSxfDu0GDfyB6O9gILuHtDM+OJFlcPvB
Nplo+FV/abU5mw7CeEgbVaK/Nv5MPTEHwbZBTZvlpwPPYtFpyLbxwbTqMa6Tgp9z
4Ml/L4FXlDBc1ohEZqJQqWouLOl0LjClMMPBv+rsThZZSBiRdYEUIXOQfrJv7tMi
WjosPJqtQ5Sp9QFxKTwbLayHVvbFyY095EyQVhy/7BY6+thaGGVYCjz0CcozJZhr
M/ebgF32Geu+IQtf1+hnXJCdQj4mEc5ALgz97qT7KXFwOpnZ0hT78dBooZY+5laD
0/pgrWwboNI3kRpDQw0PCeaq05Q3+ppLo++NZ1s+9vDUEW1uvzcjeSqv+Cm+wn++
3KmDDmVN2VuRTGQmBE9XdIqI0SYb65OXzzGaDoB8fvQzZgvMlhcCfrp+BtrjHWy0
UzNT4YScuZVTUgXJ49Hk+enihJAMGTyfOmwMo4eOaoQYxIuKayjtfIk8+CsCJWro
JlWMmPB0golG9EjO6cAK1zWN8gpzXuhATNsUvIqv2qHWzVSsh+rzDAf3xlsDO1rm
JIduKYyyuP9QaEtEjmLzCbER+4Qwzll3vsaelxfrQbs3xIbGKmm4vAzJPI/TVa5J
Y24v09FlMXw8bp+pw9XVC+V0fueRqMMD2+Log/ZTFnHbeX/uzguIswQBAQgAHRYh
BN0ovebZh1yrzkqLHdzPDbMLwQVIBQJfFc7nAAoJENzPDbMLwQVIehgD/3sqChU9
MHZfBv6LXzVixV8F68JW4UxKzEPOzYAr7MDmKgT1VN5gdltKCq+GYCgfD8CXepNw
qqL2K+DbapBBvTGpLXcJp36I0VbdOL04mshW6XMVJP33Cgyg9c5c569TiVV1R0Gk
GHr4eal/jedvaqlhit6qqmbWsI+ERHApnHQS
=6n8s
-----END PGP SIGNATURE-----
--=-=-=--




Information forwarded to bug-guix@HIDDEN:
bug#42162; Package guix. Full text available.

Message received at 42162 <at> debbugs.gnu.org:


Received: (at 42162) by debbugs.gnu.org; 20 Jul 2020 15:52:29 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Mon Jul 20 11:52:28 2020
Received: from localhost ([127.0.0.1]:35887 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1jxY5c-0002a2-Gw
	for submit <at> debbugs.gnu.org; Mon, 20 Jul 2020 11:52:28 -0400
Received: from mail-qt1-f193.google.com ([209.85.160.193]:39308)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <zimon.toutoune@HIDDEN>) id 1jxY5a-0002Zn-RJ
 for 42162 <at> debbugs.gnu.org; Mon, 20 Jul 2020 11:52:27 -0400
Received: by mail-qt1-f193.google.com with SMTP id w9so823405qts.6
 for <42162 <at> debbugs.gnu.org>; Mon, 20 Jul 2020 08:52:26 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=mime-version:references:in-reply-to:from:date:message-id:subject:to
 :cc:content-transfer-encoding;
 bh=6yY2+ItFn799SwQP0b0qpZo6OSFqO3GENHFW05g8sSM=;
 b=ObApXvrb6JfdIOuO2Yd6kHhOnqFDxqFA5OR4KSPxQA2YxTglMnWr4XtShI+p5Fmr/F
 wteauilSVlS+BDM69DCA7q/yoz+/0VS1fECQDh3Yz1JN1WJ8GU/3B88ARsUfMZct2Zpv
 CvB4lAKmCgX9ZsN7JYmihj+yOX/UDQkkg06Aa3Ol68wR9AyZ7W9b8DClRIHw13Ci2r6n
 28RUWH8DBBcQBBj5l5yAD926dUHeGRtYqnZq87ySu3Rek/YVnXTrZSd3nWhgkmBVlL2f
 OA1uY2ZxZIL2TKwM0qYytu6fvUxhMo4dgCHLIQxB276Vph06WNdbCYTKcWpYSFklFMKw
 zXMg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc:content-transfer-encoding;
 bh=6yY2+ItFn799SwQP0b0qpZo6OSFqO3GENHFW05g8sSM=;
 b=FpQSHDzgRuLNYcrduo869TKFX5mD3YFVXXyNhwM/Tm3j0bXJb6Y+GnJdHKzivQRSYZ
 JL4kvI0K/rUtMiYtIqR6i56Yg6G4HHCiGs3DED+RFvXOBCsDjLycH8DGpbzaJphy8E3P
 U+59ih0waWDTs9o1RInzxbG7Od4vO0dR3DXzFcOtz29gv2QCwKpxeQ1T97ZHlAc/3jGf
 McrweSqWsUdRGNZnzZct0lOnlX6mseBA03qN0Fyh/FUFm4FsDFWZoToRu9kCjJHH8OH8
 W9diTtHM2/v/e5VRwD2b1CtssZfqG0YArLCyTHBLBNtmaY3+7cUv1Bvj2eFKcUI7Y+gI
 jgbg==
X-Gm-Message-State: AOAM531opwisS+U+XH7J5nrbKMzA/cdfFr18hx5gZG/7OWs0gEMsIYby
 6EbgW5QRJ1xUXO+6INfca/lN6XAAQCTOd5IXGMc=
X-Google-Smtp-Source: ABdhPJzXsR8UaDNJsfDMd3h34l/gx1x8kbGUM4K3u+a/Fp86UjmqpatewgKU5TS9xe3VH3XHvez6Q/xfINJx9ZL6Yx4=
X-Received: by 2002:aed:34e2:: with SMTP id x89mr37227qtd.313.1595260341047;
 Mon, 20 Jul 2020 08:52:21 -0700 (PDT)
MIME-Version: 1.0
References: <87mu4iv0gc.fsf@HIDDEN> <86h7uq8fmk.fsf@HIDDEN>
 <87d05etero.fsf@HIDDEN> <87r1tit5j6.fsf_-_@HIDDEN>
 <CAJ3okZ2iesMKLLD29qrOJzNBjV=haoPFtpRg9T=0aTA8ZxLQMA@HIDDEN>
 <87365mzil1.fsf@HIDDEN>
In-Reply-To: <87365mzil1.fsf@HIDDEN>
From: zimoun <zimon.toutoune@HIDDEN>
Date: Mon, 20 Jul 2020 17:52:09 +0200
Message-ID: <CAJ3okZ0iMNjv93MM1FkEB3_zXA48Rq3rKXhwwug85fNRRc41Mg@HIDDEN>
Subject: Re: Recovering source tarballs
To: =?UTF-8?Q?Ludovic_Court=C3=A8s?= <ludo@HIDDEN>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 42162
Cc: 42162 <at> debbugs.gnu.org,
 =?UTF-8?Q?Maurice_Br=C3=A9mond?= <Maurice.Bremond@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

Hi,

On Mon, 20 Jul 2020 at 10:39, Ludovic Court=C3=A8s <ludo@HIDDEN> wrote:
> zimoun <zimon.toutoune@HIDDEN> skribis:
> > On Sat, 11 Jul 2020 at 17:50, Ludovic Court=C3=A8s <ludo@HIDDEN> wrote=
:

> There are many many comments in your message, so I took the liberty to
> reply only to the essence of it.  :-)

Many comments because many open topics. ;-)


> However, the two examples above are good ideas as to the way forward: we
> could start a url-fetch-to-git-fetch migration in these two cases, and
> perhaps more.

Well, to be honest, I have tried to probe such migration when I opened
this thread:

https://lists.gnu.org/archive/html/guix-devel/2020-05/msg00224.html

and I have tried to summarized the pros/cons arguments here:

https://lists.gnu.org/archive/html/guix-devel/2020-05/msg00448.html


> > What about in addition push to IPFS?  Feasible?  Lookup issue?
>
> Lookup issue.  :-)  The hash in a CID is not just a raw blob hash.
> Files are typically chunked beforehand, assembled as a Merkle tree, and
> the CID is roughly the hash to the tree root.  So it would seem we can=E2=
=80=99t
> use IPFS as-is for tarballs.

Using the Git-repo map/table, then it becomes an option, right?
Well, SWH would be a backend and IPFS could be another one.  Or any
"cloudy" storage system that could appear in the future, right?


> >>   =E2=80=A2 If we no longer deal with tarballs but upstreams keep sign=
ing
> >>     tarballs (not raw directory hashes), how can we authenticate our
> >>     code after the fact?
> >
> > Does Guix automatically authenticate code using signed tarballs?
>
> Not automatically; packagers are supposed to authenticate code when they
> add a package (=E2=80=98guix refresh -u=E2=80=99 does that automatically)=
.

So I miss the point of having this authentication information in the
future where upstream has disappeared.
The authentication is done at packaging time.  So once it is done,
merged into master and then pushed to SWH, being able to authenticate
again does not really matter.

And if it matters, all should be updated each time vulnerabilities are
discovered and so I am not sure SWH makes sense for this use-case.


> But today, we store tarball hashes, not directory hashes.

We store what "guix hash" returns. ;-)
So it is easy to migrate from tarball hashes to whatever else. :-)
I mean, it is "(sha256 (base32" and it is easy to have also
"(sha256-tree (base32" or something like that.

In the case where the integrity is also used as lookup key.

> > The format of metadata (disassemble) that you propose is schemish
> > (obviously! :-)) but we could propose something more JSON-like.
>
> Sure, if that helps get other people on-board, why not (though sexps
> have lived much longer than JSON and XML together :-)).

Lived much longer and still less less less used than JSON or XML alone. ;-)


I have not done yet the clear back-to-envelop computations.  Roughly,
there are ~23 commits on average per day updating packages, so say 70%
of them are url-fetch, it is ~16 new tarballs per day, on average.
How the model using a Git-repo will scale?  Because, naively the
output of "disassemble-archive" in full text (pretty-print format) for
the hello-2.10.tar is 120KB and so 16*365*120K =3D ~700Mb per year
without considering all the Git internals.  Obviously, it depends on
the number of files and I do not know if hello is a representative
example.

And I do not know how Git operates on binary files if the disassembled
tarball is stored as .go file, or any other.


All the best,
simon

ps:
Just if someone wants to check from where I estimate the numbers.

--8<---------------cut here---------------start------------->8---
for ci in $(git log --after=3Dv1.0.0 --oneline \
                | grep "gnu:" | grep -E "(Add|Update)" \
                | cut -f1 -d' ')
do
    git --no-pager log -1 $ci --format=3D"%cs"
done | uniq -c > /tmp/commits

guix environment --ad-hoc r-minimal \
     -- R -e 'summary(read.table("/tmp/commits"))'

gzip -dc < $(guix build -S hello) > /tmp/hello.tar
guix repl -L /tmp/tar/

scheme@(guix-user)> (call-with-input-file "hello.tar"
          (lambda (port)
                 (disassemble-archive port)))
--8<---------------cut here---------------end--------------->8---




Information forwarded to bug-guix@HIDDEN:
bug#42162; Package guix. Full text available.

Message received at 42162 <at> debbugs.gnu.org:


Received: (at 42162) by debbugs.gnu.org; 20 Jul 2020 08:39:23 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Mon Jul 20 04:39:23 2020
Received: from localhost ([127.0.0.1]:33685 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1jxRKR-0001i9-JX
	for submit <at> debbugs.gnu.org; Mon, 20 Jul 2020 04:39:23 -0400
Received: from eggs.gnu.org ([209.51.188.92]:53034)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <ludo@HIDDEN>) id 1jxRKP-0001hs-Rs
 for 42162 <at> debbugs.gnu.org; Mon, 20 Jul 2020 04:39:18 -0400
Received: from fencepost.gnu.org ([2001:470:142:3::e]:58466)
 by eggs.gnu.org with esmtp (Exim 4.90_1)
 (envelope-from <ludo@HIDDEN>)
 id 1jxRKJ-0003ov-Mv; Mon, 20 Jul 2020 04:39:11 -0400
Received: from [2001:660:6102:320:e120:2c8f:8909:cdfe] (port=56700 helo=ribbon)
 by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256)
 (Exim 4.82) (envelope-from <ludo@HIDDEN>)
 id 1jxRKG-00010O-Ak; Mon, 20 Jul 2020 04:39:09 -0400
From: =?utf-8?Q?Ludovic_Court=C3=A8s?= <ludo@HIDDEN>
To: zimoun <zimon.toutoune@HIDDEN>
Subject: Re: Recovering source tarballs
References: <87mu4iv0gc.fsf@HIDDEN> <86h7uq8fmk.fsf@HIDDEN>
 <87d05etero.fsf@HIDDEN> <87r1tit5j6.fsf_-_@HIDDEN>
 <CAJ3okZ2iesMKLLD29qrOJzNBjV=haoPFtpRg9T=0aTA8ZxLQMA@HIDDEN>
X-URL: http://www.fdn.fr/~lcourtes/
X-Revolutionary-Date: 3 Thermidor an 228 de la =?utf-8?Q?R=C3=A9volution?=
X-PGP-Key-ID: 0x090B11993D9AEBB5
X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc
X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4  0CFB 090B 1199 3D9A EBB5
X-OS: x86_64-pc-linux-gnu
Date: Mon, 20 Jul 2020 10:39:06 +0200
In-Reply-To: <CAJ3okZ2iesMKLLD29qrOJzNBjV=haoPFtpRg9T=0aTA8ZxLQMA@HIDDEN>
 (zimoun's message of "Wed, 15 Jul 2020 18:55:21 +0200")
Message-ID: <87365mzil1.fsf@HIDDEN>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: -2.3 (--)
X-Debbugs-Envelope-To: 42162
Cc: 42162 <at> debbugs.gnu.org,
 Maurice =?utf-8?Q?Br=C3=A9mond?= <Maurice.Bremond@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

Hi!

There are many many comments in your message, so I took the liberty to
reply only to the essence of it.  :-)

zimoun <zimon.toutoune@HIDDEN> skribis:

> On Sat, 11 Jul 2020 at 17:50, Ludovic Court=C3=A8s <ludo@HIDDEN> wrote:
>
>> For the now, since 70% of our packages use =E2=80=98url-fetch=E2=80=99, =
we need to be
>> able to fetch or to reconstruct tarballs.  There=E2=80=99s no way around=
 it.
>
> Yes, but for example all the packages in gnu/packages/bioconductor.scm
> could be "git-fetch".  Today the source is over url-fetch but it could
> be over git-fetch with https://git.bioconductor.org/packages/flowCore or
> git@HIDDEN:packages/flowCore.
>
> Another example is the packages in gnu/packages/emacs-xyz.scm and the
> ones from elpa.gnu.org are "url-fetch" and could be "git-fetch", for
> example using
> http://git.savannah.gnu.org/gitweb/?p=3Demacs/elpa.git;a=3Dtree;f=3Dpacka=
ges/ace-window;h=3D71d3eb7bd2efceade91846a56b9937812f658bae;hb=3DHEAD
>
> So I would be more reserved about the "no way around it". :-)  I mean
> the 70% could be a bit mitigated.

The =E2=80=9Cno way around it=E2=80=9D was about the situation today: it=E2=
=80=99s a fact that
70% of packages are built from tarballs, so we need to be able to fetch
them or reconstruct them.

However, the two examples above are good ideas as to the way forward: we
could start a url-fetch-to-git-fetch migration in these two cases, and
perhaps more.

>> In the short term, we should arrange so that the build farm keeps GC
>> roots on source tarballs for an indefinite amount of time.  Cuirass
>> jobset?  Mcron job to preserve GC roots?  Ideas?
>
> Yes, preserving source tarballs for an indefinite amount of time will
> help.  At least all the packages where "lookup-content" returns #f,
> which means they are not in SWH or they are unreachable -- both is
> equivalent from Guix side.
>
> What about in addition push to IPFS?  Feasible?  Lookup issue?

Lookup issue.  :-)  The hash in a CID is not just a raw blob hash.
Files are typically chunked beforehand, assembled as a Merkle tree, and
the CID is roughly the hash to the tree root.  So it would seem we can=E2=
=80=99t
use IPFS as-is for tarballs.

>> For the future, we could store nar hashes of unpacked tarballs instead
>> of hashes over tarballs.  But that raises two questions:
>>
>>   =E2=80=A2 If we no longer deal with tarballs but upstreams keep signing
>>     tarballs (not raw directory hashes), how can we authenticate our
>>     code after the fact?
>
> Does Guix automatically authenticate code using signed tarballs?

Not automatically; packagers are supposed to authenticate code when they
add a package (=E2=80=98guix refresh -u=E2=80=99 does that automatically).

>>   =E2=80=A2 SWH internally store Git-tree hashes, not nar hashes, so we =
still
>>     wouldn=E2=80=99t be able to fetch our unpacked trees from SWH.
>>
>> (Both issues were previously discussed at
>> <https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/>.)
>>
>> So for the medium term, and perhaps for the future, a possible option
>> would be to preserve tarball metadata so we can reconstruct them:
>>
>>   tarball =3D metadata + tree
>
> There is different issues at different levels:
>
>  1. how to lookup? what information do we need to keep/store to be able
>     to query SWH?
>  2. how to check the integrity? what information do we need to
>     keep/store to be able to verify that SWH returns what Guix expects?
>  3. how to authenticate? where the tarball metadata has to be stored if
>     SWH removes it?
>
> Basically, the git-fetch source stores 3 identifiers:
>
>  - upstream url
>  - commit / tag
>  - integrity (sha256)
>
> Fetching from SWH requires the commit only (lookup-revision) or the
> tag+url (lookup-origin-revision) then from the returned revision, the
> integrity of the downloaded data is checked using the sha256, right?

Yes.

> Therefore, one way to fix lookup of the url-fetch source is to add an
> extra field mimicking the commit role.

But today, we store tarball hashes, not directory hashes.

> The easiest is to store a SWHID or an identifier allowing to deduce the
> SWHID.
>
> I have not checked the code, but something like this:
>
>   https://pypi.org/project/swh.model/
>   https://forge.softwareheritage.org/source/swh-model/
>
> and at package time, this identifier is added, similarly to integrity.

I=E2=80=99m skeptical about adding a field that is practically never used.

[...]

>> The code below can =E2=80=9Cdisassemble=E2=80=9D and =E2=80=9Cassemble=
=E2=80=9D a tar.  When it
>> disassembles it, it generates metadata like this:
>
> [...]
>
>> The =E2=80=99assemble-archive=E2=80=99 procedure consumes that, looks up=
 file contents
>> by hash on SWH, and reconstructs the original tarball=E2=80=A6
>
> Where do you plan to store the "disassembled" metadata?
> And where do you plan to "assemble-archive"?

We=E2=80=99d have a repo/database containing metadata indexed by tarball sh=
a256.

> How this database that maps tarball hashes to metadata should be
> maintained?  Git push hook?  Cron task?

Yes, something like that.  :-)

> What about foreign channels?  Should they maintain their own map?

Yes, presumably.

> To summary, it would work like this, right?
>
> at package time:
>  - store an integrity identiter (today sha256-nix-base32)
>  - disassemble the tarball
>  - commit to another repo the metadata using the path (address)
>    sha256/base32/<identitier>
>  - push to packages-repo *and* metadata-database-repo
>
> at future time: (upstream has disappeared, say!)
>  - use the integrity identifier to query the database repo
>  - lookup the SWHID from the database repo
>  - fetch the data from SWH
>  - or lookup the IPFS identifier from the database repo and fetch the
>    data from IPFS, for another example
>  - re-assemble the tarball using the metadata from the database repo
>  - check integrity, authentication, etc.

That=E2=80=99s the idea.

> The format of metadata (disassemble) that you propose is schemish
> (obviously! :-)) but we could propose something more JSON-like.

Sure, if that helps get other people on-board, why not (though sexps
have lived much longer than JSON and XML together :-)).

Thanks,
Ludo=E2=80=99.




Information forwarded to bug-guix@HIDDEN:
bug#42162; Package guix. Full text available.

Message received at 42162 <at> debbugs.gnu.org:


Received: (at 42162) by debbugs.gnu.org; 15 Jul 2020 16:55:46 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Wed Jul 15 12:55:46 2020
Received: from localhost ([127.0.0.1]:53454 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1jvkh4-0004KV-3B
	for submit <at> debbugs.gnu.org; Wed, 15 Jul 2020 12:55:46 -0400
Received: from mail-qt1-f180.google.com ([209.85.160.180]:34101)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <zimon.toutoune@HIDDEN>) id 1jvkh0-0004KG-5d
 for 42162 <at> debbugs.gnu.org; Wed, 15 Jul 2020 12:55:40 -0400
Received: by mail-qt1-f180.google.com with SMTP id w34so2267187qte.1
 for <42162 <at> debbugs.gnu.org>; Wed, 15 Jul 2020 09:55:38 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=mime-version:references:in-reply-to:from:date:message-id:subject:to
 :cc:content-transfer-encoding;
 bh=AKvtKOvImqiqraOsC7W4AmagSpzQR/Pc4DK8OvFqlP0=;
 b=Hy45TSTAwVfMIg8JVIUE1+UuitDTS4bY3j0jhK9SOYhxc7SxFrAcRIGaXCnxikUR8o
 rWcolatV+s/c9JQFDn+Prr35mNAvV0k+HXN9szSHqTiT1rQtnslm6KV9uM1D4Rj2Icin
 dN8pmuFcER65/PJubD0zr/hQfyQnWENS/ASHIxuGPzSWE4KrUo1oIkEIU+3HwZaF4n7Z
 iVqjuhpWz1pqnVyYRs+Ks8J21fFqoqLOeDgJAZXgc/rzjIipY6wVw2Jv+WC6QIly7i1G
 YesTsqmS+Cdi4x8d63O9UGVACcDxG4hcZg01/kPzVL9uhY7+l63aekQ9PDotHvEbD7Ue
 MKKA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc:content-transfer-encoding;
 bh=AKvtKOvImqiqraOsC7W4AmagSpzQR/Pc4DK8OvFqlP0=;
 b=f/p+JAeWkbenAmJvFdAn19ZtjEvE18caaOjbTXcus73UHDsn8n2wsH54bzwn5t3OAy
 M7Kbj55WP+ohAsSCHOz/VuMVFYbd/CbtTg6ayHxATzK1Ld+8ZmxubsY76z797jrjoaxp
 b9A6+e3DJXYq+QuPMEDs9+CaQKTmrB1DjrFlukIZzQvGOenYKsvouhNVgXNiCfcgB7c6
 uxrU2xnIcyFVoKjf+lbOpYFfyre+CxcJApqbTXYQbsKVaNKu+g3IgSZQ0AEb8Of5aAnu
 +aRgbGFXKWt1zYG6FhpyTYqOF6r461Lk/zzIHkffcpwqQI0UD2To8cde7G3pk1UO1OoY
 DOIg==
X-Gm-Message-State: AOAM5328IpM4BLND5lIYh3zBSyVZl8bdZFG3N34p5qaRbFNemNTPXnJr
 /F0DkN2UPYOaIVFRb6C05UO3uoTqElQ8urRgQ3A=
X-Google-Smtp-Source: ABdhPJziRxOVUynGzvTwaViQ1xR3c2sh8Vz6Zg9Z+vdat0tllH+W/qFDibId/nClbg0S+LHSZDyxV4AaWNdW5qllJcE=
X-Received: by 2002:ac8:4649:: with SMTP id f9mr676589qto.313.1594832132217;
 Wed, 15 Jul 2020 09:55:32 -0700 (PDT)
MIME-Version: 1.0
References: <87mu4iv0gc.fsf@HIDDEN> <86h7uq8fmk.fsf@HIDDEN>
 <87d05etero.fsf@HIDDEN> <87r1tit5j6.fsf_-_@HIDDEN>
In-Reply-To: <87r1tit5j6.fsf_-_@HIDDEN>
From: zimoun <zimon.toutoune@HIDDEN>
Date: Wed, 15 Jul 2020 18:55:21 +0200
Message-ID: <CAJ3okZ2iesMKLLD29qrOJzNBjV=haoPFtpRg9T=0aTA8ZxLQMA@HIDDEN>
Subject: Re: Recovering source tarballs
To: =?UTF-8?Q?Ludovic_Court=C3=A8s?= <ludo@HIDDEN>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 42162
Cc: 42162 <at> debbugs.gnu.org,
 =?UTF-8?Q?Maurice_Br=C3=A9mond?= <Maurice.Bremond@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

Hi Ludo,

Well, you enlarge the discussion to more than the issue of the 5
url-fetch packages on gforge.inria.fr :-)


First of all, you wrote [1] ``Migration away from tarballs is already
happening as more and more software is distributed straight from
content-addressed VCS repositories, though progress has been relatively
slow since we first discussed it in 2016.'' but on the other hand Guix
uses more than often [2] "url-fetch" even if "git-fetch" is available
upstream.  Other said, I am not convinced the migration is really
happening...

The issue would be mitigated if Guix transitions from "url-fetch" to
"git-fetch" when possible.

1: https://forge.softwareheritage.org/T2430#45800
2: https://lists.gnu.org/archive/html/guix-devel/2020-05/msg00224.html


Second, trying to do some stats about the SWH coverage, I note that
non-neglectible "url-fetch" are reachable by "lookup-content".  The
coverage is not straightforward because of the 120 request per hour rate
limit or unexpected server error.  Another story.

Well, I would like having numbers because I do not know what is
concretely the issue: how many "url-fetch" packages are reachable?  And
if they are unreachable, is it because they are not in yet? or is it
because Guix does not have enough info to lookup them?


On Sat, 11 Jul 2020 at 17:50, Ludovic Court=C3=A8s <ludo@HIDDEN> wrote:

> For the now, since 70% of our packages use =E2=80=98url-fetch=E2=80=99, w=
e need to be
> able to fetch or to reconstruct tarballs.  There=E2=80=99s no way around =
it.

Yes, but for example all the packages in gnu/packages/bioconductor.scm
could be "git-fetch".  Today the source is over url-fetch but it could
be over git-fetch with https://git.bioconductor.org/packages/flowCore or
git@HIDDEN:packages/flowCore.

Another example is the packages in gnu/packages/emacs-xyz.scm and the
ones from elpa.gnu.org are "url-fetch" and could be "git-fetch", for
example using
http://git.savannah.gnu.org/gitweb/?p=3Demacs/elpa.git;a=3Dtree;f=3Dpackage=
s/ace-window;h=3D71d3eb7bd2efceade91846a56b9937812f658bae;hb=3DHEAD

So I would be more reserved about the "no way around it". :-)  I mean
the 70% could be a bit mitigated.


> In the short term, we should arrange so that the build farm keeps GC
> roots on source tarballs for an indefinite amount of time.  Cuirass
> jobset?  Mcron job to preserve GC roots?  Ideas?

Yes, preserving source tarballs for an indefinite amount of time will
help.  At least all the packages where "lookup-content" returns #f,
which means they are not in SWH or they are unreachable -- both is
equivalent from Guix side.

What about in addition push to IPFS?  Feasible?  Lookup issue?

> For the future, we could store nar hashes of unpacked tarballs instead
> of hashes over tarballs.  But that raises two questions:
>
>   =E2=80=A2 If we no longer deal with tarballs but upstreams keep signing
>     tarballs (not raw directory hashes), how can we authenticate our
>     code after the fact?

Does Guix automatically authenticate code using signed tarballs?


>   =E2=80=A2 SWH internally store Git-tree hashes, not nar hashes, so we s=
till
>     wouldn=E2=80=99t be able to fetch our unpacked trees from SWH.
>
> (Both issues were previously discussed at
> <https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/>.)
>
> So for the medium term, and perhaps for the future, a possible option
> would be to preserve tarball metadata so we can reconstruct them:
>
>   tarball =3D metadata + tree

There is different issues at different levels:

 1. how to lookup? what information do we need to keep/store to be able
    to query SWH?
 2. how to check the integrity? what information do we need to
    keep/store to be able to verify that SWH returns what Guix expects?
 3. how to authenticate? where the tarball metadata has to be stored if
    SWH removes it?

Basically, the git-fetch source stores 3 identifiers:

 - upstream url
 - commit / tag
 - integrity (sha256)

Fetching from SWH requires the commit only (lookup-revision) or the
tag+url (lookup-origin-revision) then from the returned revision, the
integrity of the downloaded data is checked using the sha256, right?

Therefore, one way to fix lookup of the url-fetch source is to add an
extra field mimicking the commit role.

The easiest is to store a SWHID or an identifier allowing to deduce the
SWHID.

I have not checked the code, but something like this:

  https://pypi.org/project/swh.model/
  https://forge.softwareheritage.org/source/swh-model/

and at package time, this identifier is added, similarly to integrity.

Aside, does Guix use the authentication metadata that tarballs provide?


( BTW, I failed [3,4] to package swh.model so if someone wants to give a
try.
3: https://lists.gnu.org/archive/html/help-guix/2020-06/msg00158.html
4: https://lists.gnu.org/archive/html/help-guix/2020-06/msg00161.html )


> After all, tarballs are byproducts and should be no exception: we should
> build them from source.  :-)

[...]

> The code below can =E2=80=9Cdisassemble=E2=80=9D and =E2=80=9Cassemble=E2=
=80=9D a tar.  When it
> disassembles it, it generates metadata like this:

[...]

> The =E2=80=99assemble-archive=E2=80=99 procedure consumes that, looks up =
file contents
> by hash on SWH, and reconstructs the original tarball=E2=80=A6

Where do you plan to store the "disassembled" metadata?
And where do you plan to "assemble-archive"?

I mean,

 What is pushed to SWH? And how?
 What is fetched from SWH? And how?

(Well, answer below. :-))

> =E2=80=A6 at least in theory, because in practice we hit the SWH rate lim=
it
> after looking up a few files:

Yes, it is 120 request per hour and 10 save per hour.  Well, I do not
think they will increase much these numbers in general.  However,
they seem open for specific machines.  So, I do not want to speak for
them, but we could ask an higher rate limit for ci.guix.gnu.org for
example.  Then we need to distinguish between source substitutes and
binary substitutes.  And basically, when an user runs "guix build foo",
if the source is not available upstream nor already on ci.guix.gnu.org,
then ci.guix.gnu.org fetch the missing sources from SWH and delivers it
to the user.


>   https://archive.softwareheritage.org/api/#rate-limiting
>
> So it=E2=80=99s a bit ridiculous, but we may have to store a SWH =E2=80=
=9Cdir=E2=80=9D
> identifier for the whole extracted tree=E2=80=94a Git-tree hash=E2=80=94s=
ince that would
> allow us to retrieve the whole thing in a single HTTP request.

Well, the limited resources of SWH is an issue but SWH is not a mirror
but an archive. :-)

And as I wrote above, we could ask to SWH to increase the rate limit for
specific machine such as ci.guix.gnu.org


> I think we=E2=80=99d have to maintain a database that maps tarball hashes=
 to
> metadata (!).  A simple version of it could be a Git repo where, say,
> =E2=80=98sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk=E2=
=80=99 would
> contain the metadata above.  The nice thing is that the Git repo itself
> could be archived by SWH.  :-)

How this database that maps tarball hashes to metadata should be
maintained?  Git push hook?  Cron task?

What about foreign channels?  Should they maintain their own map?

To summary, it would work like this, right?

at package time:
 - store an integrity identiter (today sha256-nix-base32)
 - disassemble the tarball
 - commit to another repo the metadata using the path (address)
   sha256/base32/<identitier>
 - push to packages-repo *and* metadata-database-repo

at future time: (upstream has disappeared, say!)
 - use the integrity identifier to query the database repo
 - lookup the SWHID from the database repo
 - fetch the data from SWH
 - or lookup the IPFS identifier from the database repo and fetch the
   data from IPFS, for another example
 - re-assemble the tarball using the metadata from the database repo
 - check integrity, authentication, etc.

Well, right it is better than only adding an identifier for looking up
as I described above; because it is more general and flexible than only
SWH as fall-back.

The format of metadata (disassemble) that you propose is schemish
(obviously! :-)) but we could propose something more JSON-like.


All the best,
simon




Information forwarded to bug-guix@HIDDEN:
bug#42162; Package guix. Full text available.

Message received at 42162 <at> debbugs.gnu.org:


Received: (at 42162) by debbugs.gnu.org; 13 Jul 2020 19:20:35 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Mon Jul 13 15:20:35 2020
Received: from localhost ([127.0.0.1]:49324 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1jv40A-0001h1-PF
	for submit <at> debbugs.gnu.org; Mon, 13 Jul 2020 15:20:35 -0400
Received: from mira.cbaines.net ([212.71.252.8]:39750)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <mail@HIDDEN>) id 1jv408-0001gt-VD
 for 42162 <at> debbugs.gnu.org; Mon, 13 Jul 2020 15:20:33 -0400
Received: from localhost (unknown [46.237.175.173])
 by mira.cbaines.net (Postfix) with ESMTPSA id C870727BBE1;
 Mon, 13 Jul 2020 20:20:31 +0100 (BST)
Received: from localhost (localhost [local])
 by localhost (OpenSMTPD) with ESMTPA id b0041d41;
 Mon, 13 Jul 2020 19:20:29 +0000 (UTC)
References: <87mu4iv0gc.fsf@HIDDEN> <86h7uq8fmk.fsf@HIDDEN>
 <87d05etero.fsf@HIDDEN> <87r1tit5j6.fsf_-_@HIDDEN>
User-agent: mu4e 1.4.10; emacs 26.3
From: Christopher Baines <mail@HIDDEN>
To: Ludovic =?utf-8?Q?Court=C3=A8s?= <ludo@HIDDEN>
Subject: Re: bug#42162: Recovering source tarballs
In-reply-to: <87r1tit5j6.fsf_-_@HIDDEN>
Date: Mon, 13 Jul 2020 20:20:27 +0100
Message-ID: <87a703jk78.fsf@HIDDEN>
MIME-Version: 1.0
Content-Type: multipart/signed; boundary="=-=-=";
 micalg=pgp-sha512; protocol="application/pgp-signature"
X-Spam-Score: -0.0 (/)
X-Debbugs-Envelope-To: 42162
Cc: 42162 <at> debbugs.gnu.org,
 Maurice =?utf-8?Q?Br=C3=A9mond?= <Maurice.Bremond@HIDDEN>,
 zimoun <zimon.toutoune@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

--=-=-=
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


Ludovic Court=C3=A8s <ludo@HIDDEN> writes:

> Hi,
>
> Ludovic Court=C3=A8s <ludo@HIDDEN> skribis:
>
>> There=E2=80=99s this other discussion you mentioned, which I hope will h=
ave a
>> positive outcome:
>>
>>   https://forge.softwareheritage.org/T2430
>
> This discussion as well as discussions on #swh-devel have made it clear
> that SWH will not archive raw tarballs, at least not in the foreseeable
> future.  Instead, it will keep archiving the contents of tarballs, as it
> has always done=E2=80=94that=E2=80=99s already a huge service.
>
> Not storing raw tarballs makes sense from an engineering perspective,
> but it does mean that we cannot rely on SWH as a content-addressed
> mirror for tarballs.  (In fact, some raw tarballs are available on SWH,
> but that=E2=80=99s mostly =E2=80=9Cby chance=E2=80=9D, for instance becau=
se they appear as-is in
> a Git repo that was ingested.)  In fact this is one of the challenges
> mentioned in
> <https://guix.gnu.org/blog/2019/connecting-reproducible-deployment-to-a-l=
ong-term-source-code-archive/>.
>
> So we need a solution for now (and quite urgently), and a solution for
> the future.
>
> For the now, since 70% of our packages use =E2=80=98url-fetch=E2=80=99, w=
e need to be
> able to fetch or to reconstruct tarballs.  There=E2=80=99s no way around =
it.
>
> In the short term, we should arrange so that the build farm keeps GC
> roots on source tarballs for an indefinite amount of time.  Cuirass
> jobset?  Mcron job to preserve GC roots?  Ideas?

Going forward, being methodical as a project about storing the tarballs
and source material for the packages is probalby the way to ensure it's
available for the future. I'm not sure the data storage cost is
significant, the cost of doing this is probably in working out what to
store, doing so in a redundant manor, and making the data available.

The Guix Data Service knows about fixed output derivations, so it might
be possible to backfill such a store by just attempting to build those
derivations. It might also be possible to use the Guix Data Service to
work out what's available, and what tarballs are missing.

Chris

--=-=-=
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQKTBAEBCgB9FiEEPonu50WOcg2XVOCyXiijOwuE9XcFAl8Ms/tfFIAAAAAALgAo
aXNzdWVyLWZwckBub3RhdGlvbnMub3BlbnBncC5maWZ0aGhvcnNlbWFuLm5ldDNF
ODlFRUU3NDU4RTcyMEQ5NzU0RTBCMjVFMjhBMzNCMEI4NEY1NzcACgkQXiijOwuE
9XfAtw/6AtEyqRcimef5NTFchcAigC6fT6DJLcGnyJNUXlfZn6nHU9ao/ev33D5d
MFfKl1YljKf+fA848fZSIe0eBERbkZ+D1oed6SD6Xx8fG9ekCSgGtbmysNEcDDKK
qO5kg/QUbKYODpRW8iZIDMPUQZ0yNQu9KQdvVKIhHIZJnSGNt2XVjRdoCkW+H19m
QVPVdgqZIarkZctOzPegA8FFEi8O/GO7gK4gbizewecgsl1qL0yWBDyUJ9tsWeAH
+EsVykk91y9tHDPfQYfKqik7A0WrK75oeNOqs5QtEqRPjcMzwsDkIO13e5Y3Z5Yl
M7zTs7R/OLSyiSlT5z/1S5RrbMyMMryt0S4uvqjZfFDtgaOHxhVhBg/1kya/H5v1
cB3jq8WpvL6sDYFbSqI9vWPJnQDq5EpIvI16Ri0ygnMAffiz6hhtdn/pCGV7GG5U
7H6ED7gz5FB8YovGED1C9l8dh7h3Hi+1P+JL3KheJyF5bU829wqL9r2l5sOprad0
PEsq52RCwPBuNu8agTbobICimqFnp3B5wySDNEvkXZ4FFlMR6ZdW0BjBnLF0ZRU4
v8FCf+w81lAIksF9UWusZTzb++aMPXsdlHfelyWtOUi5mc1GMNRfCIW/VLIYyZIP
aqVPHoFkTWb6q6XK5tjC302Di/BD9qDEr5g9qFU16Yeq7ywcAjs=
=hgnz
-----END PGP SIGNATURE-----
--=-=-=--




Information forwarded to bug-guix@HIDDEN:
bug#42162; Package guix. Full text available.

Message received at 42162 <at> debbugs.gnu.org:


Received: (at 42162) by debbugs.gnu.org; 11 Jul 2020 15:50:35 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sat Jul 11 11:50:35 2020
Received: from localhost ([127.0.0.1]:44804 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1juHlq-0007f6-6K
	for submit <at> debbugs.gnu.org; Sat, 11 Jul 2020 11:50:34 -0400
Received: from eggs.gnu.org ([209.51.188.92]:35232)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <ludo@HIDDEN>) id 1juHln-0007en-3b
 for 42162 <at> debbugs.gnu.org; Sat, 11 Jul 2020 11:50:32 -0400
Received: from fencepost.gnu.org ([2001:470:142:3::e]:56562)
 by eggs.gnu.org with esmtp (Exim 4.90_1)
 (envelope-from <ludo@HIDDEN>)
 id 1juHlg-0006i8-GC; Sat, 11 Jul 2020 11:50:24 -0400
Received: from [2a01:e0a:1d:7270:af76:b9b:ca24:c465] (port=57928 helo=ribbon)
 by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256)
 (Exim 4.82) (envelope-from <ludo@HIDDEN>)
 id 1juHlf-0007F3-F5; Sat, 11 Jul 2020 11:50:24 -0400
From: =?utf-8?Q?Ludovic_Court=C3=A8s?= <ludo@HIDDEN>
To: zimoun <zimon.toutoune@HIDDEN>
Subject: Recovering source tarballs
References: <87mu4iv0gc.fsf@HIDDEN> <86h7uq8fmk.fsf@HIDDEN>
 <87d05etero.fsf@HIDDEN>
Date: Sat, 11 Jul 2020 17:50:21 +0200
In-Reply-To: <87d05etero.fsf@HIDDEN> ("Ludovic
 \=\?utf-8\?Q\?Court\=C3\=A8s\=22'\?\=
 \=\?utf-8\?Q\?s\?\= message of "Thu, 02 Jul 2020 12:03:39 +0200")
Message-ID: <87r1tit5j6.fsf_-_@HIDDEN>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="=-=-="
X-Spam-Score: -2.3 (--)
X-Debbugs-Envelope-To: 42162
Cc: 42162 <at> debbugs.gnu.org,
 Maurice =?utf-8?Q?Br=C3=A9mond?= <Maurice.Bremond@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -3.3 (---)

--=-=-=
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

Hi,

Ludovic Court=C3=A8s <ludo@HIDDEN> skribis:

> There=E2=80=99s this other discussion you mentioned, which I hope will ha=
ve a
> positive outcome:
>
>   https://forge.softwareheritage.org/T2430

This discussion as well as discussions on #swh-devel have made it clear
that SWH will not archive raw tarballs, at least not in the foreseeable
future.  Instead, it will keep archiving the contents of tarballs, as it
has always done=E2=80=94that=E2=80=99s already a huge service.

Not storing raw tarballs makes sense from an engineering perspective,
but it does mean that we cannot rely on SWH as a content-addressed
mirror for tarballs.  (In fact, some raw tarballs are available on SWH,
but that=E2=80=99s mostly =E2=80=9Cby chance=E2=80=9D, for instance because=
 they appear as-is in
a Git repo that was ingested.)  In fact this is one of the challenges
mentioned in
<https://guix.gnu.org/blog/2019/connecting-reproducible-deployment-to-a-lon=
g-term-source-code-archive/>.

So we need a solution for now (and quite urgently), and a solution for
the future.

For the now, since 70% of our packages use =E2=80=98url-fetch=E2=80=99, we =
need to be
able to fetch or to reconstruct tarballs.  There=E2=80=99s no way around it.

In the short term, we should arrange so that the build farm keeps GC
roots on source tarballs for an indefinite amount of time.  Cuirass
jobset?  Mcron job to preserve GC roots?  Ideas?

For the future, we could store nar hashes of unpacked tarballs instead
of hashes over tarballs.  But that raises two questions:

  =E2=80=A2 If we no longer deal with tarballs but upstreams keep signing
    tarballs (not raw directory hashes), how can we authenticate our
    code after the fact?

  =E2=80=A2 SWH internally store Git-tree hashes, not nar hashes, so we sti=
ll
    wouldn=E2=80=99t be able to fetch our unpacked trees from SWH.

(Both issues were previously discussed at
<https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/>.)

So for the medium term, and perhaps for the future, a possible option
would be to preserve tarball metadata so we can reconstruct them:

  tarball =3D metadata + tree

After all, tarballs are byproducts and should be no exception: we should
build them from source.  :-)

In <https://forge.softwareheritage.org/T2430>, Stefano mentioned
pristine-tar, which does almost that, but not quite: it stores a binary
delta between a tarball and a tree:

  https://manpages.debian.org/unstable/pristine-tar/pristine-tar.1.en.html

I think we should have something more transparent than a binary delta.

The code below can =E2=80=9Cdisassemble=E2=80=9D and =E2=80=9Cassemble=E2=
=80=9D a tar.  When it
disassembles it, it generates metadata like this:

--8<---------------cut here---------------start------------->8---
(tar-source
  (version 0)
  (headers
    (("guile-3.0.4/"
      (mode 493)
      (size 0)
      (mtime 1593007723)
      (chksum 3979)
      (typeflag #\5))
     ("guile-3.0.4/m4/"
      (mode 493)
      (size 0)
      (mtime 1593007720)
      (chksum 4184)
      (typeflag #\5))
     ("guile-3.0.4/m4/pipe2.m4"
      (mode 420)
      (size 531)
      (mtime 1536050419)
      (chksum 4812)
      (hash (sha256
              "arx6n2rmtf66yjlwkgwp743glcpdsfzgjiqrqhfegutmcwvwvsza")))
     ("guile-3.0.4/m4/time_h.m4"
      (mode 420)
      (size 5471)
      (mtime 1536050419)
      (chksum 4974)
      (hash (sha256
              "z4py26rmvsk4st7db6vwziwwhkrjjrwj7nra4al6ipqh2ms45kka")))
[=E2=80=A6]
--8<---------------cut here---------------end--------------->8---

The =E2=80=99assemble-archive=E2=80=99 procedure consumes that, looks up fi=
le contents
by hash on SWH, and reconstructs the original tarball=E2=80=A6

=E2=80=A6 at least in theory, because in practice we hit the SWH rate limit
after looking up a few files:

  https://archive.softwareheritage.org/api/#rate-limiting

So it=E2=80=99s a bit ridiculous, but we may have to store a SWH =E2=80=9Cd=
ir=E2=80=9D
identifier for the whole extracted tree=E2=80=94a Git-tree hash=E2=80=94sin=
ce that would
allow us to retrieve the whole thing in a single HTTP request.

Besides, we=E2=80=99ll also have to handle compression: storing gzip/xz hea=
ders
and compression levels.


How would we put that in practice?  Good question.  :-)

I think we=E2=80=99d have to maintain a database that maps tarball hashes to
metadata (!).  A simple version of it could be a Git repo where, say,
=E2=80=98sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk=E2=80=
=99 would
contain the metadata above.  The nice thing is that the Git repo itself
could be archived by SWH.  :-)

Thus, if a tarball vanishes, we=E2=80=99d look it up in the database and
reconstruct it from its metadata plus content store in SWH.

Thoughts?

Anyhow, we should team up with fellow NixOS and SWH hackers to address
this, and with developers of other distros as well=E2=80=94this problem is =
not
just that of the functional deployment geeks, is it?

Ludo=E2=80=99.


--=-=-=
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline; filename=tar.scm
Content-Transfer-Encoding: quoted-printable
Content-Description: the tar assembler/disassembler

;;; GNU Guix --- Functional package management for GNU
;;; Copyright =C2=A9 2020 Ludovic Court=C3=A8s <ludo@HIDDEN>
;;;
;;; This file is part of GNU Guix.
;;;
;;; GNU Guix is free software; you can redistribute it and/or modify it
;;; under the terms of the GNU General Public License as published by
;;; the Free Software Foundation; either version 3 of the License, or (at
;;; your option) any later version.
;;;
;;; GNU Guix is distributed in the hope that it will be useful, but
;;; WITHOUT ANY WARRANTY; without even the implied warranty of
;;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
;;; GNU General Public License for more details.
;;;
;;; You should have received a copy of the GNU General Public License
;;; along with GNU Guix.  If not, see <http://www.gnu.org/licenses/>.

(define-module (tar)
  #:use-module (ice-9 match)
  #:use-module (ice-9 binary-ports)
  #:use-module (rnrs bytevectors)
  #:use-module (srfi srfi-1)
  #:use-module (srfi srfi-9)
  #:use-module (srfi srfi-26)

  #:use-module (gcrypt hash)
  #:use-module (guix base16)
  #:use-module (guix base32)
  #:use-module ((ice-9 rdelim) #:select ((read-string . get-string-all)))
  #:use-module (web client)
  #:use-module (web response)
  #:export (disassemble-archive
            assemble-archive))


;;;
;;; Tar.
;;;

(define %TMAGIC "ustar\0")
(define %TVERSION "00")

(define-syntax-rule (define-field-type type type-size read-proc write-proc)
  "Define TYPE as a ustar header field type of TYPE-SIZE bytes.  READ-PROC =
is
the procedure to obtain the value of an object of this type froma bytevecto=
r,
and WRITE-PROC writes it to a bytevector."
  (define-syntax type
    (syntax-rules (read write size)
      ((_ size)  type-size)
      ((_ read)  read-proc)
      ((_ write) write-proc))))

(define (sub-bytevector bv offset size)
  (let ((sub (make-bytevector size)))
    (bytevector-copy! bv offset sub 0 size)
    sub))

(define (read-integer bv offset len)
  (string->number (read-string bv offset len) 8))
(define read-integer12 (cut read-integer <> <> 12))
(define read-integer8  (cut read-integer <> <> 8))

(define (read-string bv offset max-len)
  (define len
    (let loop ((len 0))
      (cond ((=3D len max-len)
             len)
            ((zero? (bytevector-u8-ref bv (+ offset len)))
             len)
            (else
             (loop (+ 1 len))))))

  (utf8->string (sub-bytevector bv offset len)))
(define read-string155 (cut read-string <> <> 155))
(define read-string100 (cut read-string <> <> 100))
(define read-string32 (cut read-string <> <> 32))
(define read-string6 (cut read-string <> <> 6))
(define read-string2 (cut read-string <> <> 2))

(define (read-character bv offset)
  (integer->char (bytevector-u8-ref bv offset)))

(define (read-padding12 bv offset)
  (bytevector-uint-ref bv offset (endianness big) 12))

(define (write-integer! bv offset value len)
  (let ((str (string-pad (number->string value 8) (- len 1) #\0)))
    (write-string! bv offset str len)))
(define write-integer12! (cut write-integer! <> <> <> 12))
(define write-integer8!  (cut write-integer! <> <> <> 8))

(define (write-string! bv offset str len)
  (let* ((str (string-pad-right str len #\nul))
         (buf (string->utf8 str)))
    (bytevector-copy! buf 0 bv offset (bytevector-length buf))))

(define write-string155! (cut write-string! <> <> <> 155))
(define write-string100! (cut write-string! <> <> <> 100))
(define write-string32! (cut write-string! <> <> <> 32))
(define write-string6! (cut write-string! <> <> <> 6))
(define write-string2! (cut write-string! <> <> <> 2))

(define (write-character! bv offset value)
  (bytevector-u8-set! bv offset (char->integer value)))

(define (write-padding12! bv offset value)
  (bytevector-uint-set! bv offset value (endianness big) 12))

(define-field-type integer12     12 read-integer12    write-integer12!)
(define-field-type integer8       8 read-integer8     write-integer8!)
(define-field-type character      1 read-character    write-character!)
(define-field-type string155    155 read-string155    write-string155!)
(define-field-type string100    100 read-string100    write-string100!)
(define-field-type string32      32 read-string32     write-string32!)
(define-field-type string6        6 read-string6      write-string6!)
(define-field-type string2        2 read-string2      write-string2!)
(define-field-type padding12     12 read-padding12    write-padding12!)

(define-syntax define-pack
  (syntax-rules ()
    ((_ type ctor pred
        write-header read-header
        (field-names field-types field-getters) ...)
     (begin
       (define-record-type type
         (ctor field-names ...)
         pred
         (field-names field-getters) ...)

       (define (read-header port)
         "Return the ustar header read from PORT."
         (set-port-encoding! port "ISO-8859-1")
         (let ((bv (get-bytevector-n port (+ (field-types size) ...))))
           (letrec-syntax ((build
                            (syntax-rules ()
                              ((_ bv () offset (fields (... ...)))
                               (ctor fields (... ...)))
                              ((_ bv (type0 types (... ...))
                                  offset (fields (... ...)))
                               (build bv
                                      (types (... ...))
                                      (+ offset (type0 size))
                                      (fields (... ...)
                                              ((type0 read) bv offset)))))))
             (build bv (field-types ...) 0 ()))))

       (define (write-header header port)
         "Serialize HEADER, a <ustar-header> record, to PORT."
         (let* ((len (+ (field-types size) ...))
                (bv  (make-bytevector len)))
           (match header
             (($ type field-names ...)
              (letrec-syntax ((write!
                               (syntax-rules ()
                                 ((_ () offset)
                                  #t)
                                 ((_ ((type value) rest (... ...)) offset)
                                  (begin
                                    ((type write) bv offset value)
                                    (write! (rest (... ...))
                                            (+ offset (type size))))))))
                (write! ((field-types field-names) ...) 0)
                (put-bytevector port bv))))))))))

;; The ustar header.  See <tar.h>.
(define-pack <ustar-header>
  %make-ustar-header ustar-header?
  write-ustar-header read-ustar-header
  (name         string100 ustar-header-name)      ;NUL-terminated if NUL fi=
ts
  (mode		 integer8 ustar-header-mode)
  (uid		 integer8 ustar-header-uid)
  (gid		 integer8 ustar-header-gid)
  (size		integer12 ustar-header-size)
  (mtime	integer12 ustar-header-mtime)
  (chksum	 integer8 ustar-header-checksum)
  (typeflag	character ustar-header-type-flag)
  (linkname	string100 ustar-header-link-name)
  (magic	  string6 ustar-header-magic)     ;must be TMAGIC
  (version	  string2 ustar-header-version)   ;must be TVERSION
  (uname	 string32 ustar-header-uname)     ;NUL-terminated
  (gname	 string32 ustar-header-gname)     ;NUL-terminated
  (devmajor	 integer8 ustar-header-device-major)
  (devminor	 integer8 ustar-header-device-minor)
  (prefix	string155 ustar-header-prefix)    ;NUL-terminated if NUL fits
  (padding      padding12 ustar-header-padding))

(define* (make-ustar-header name
                            #:key
                            (mode 0) (uid 0) (gid 0) (size 0)
                            (mtime 0) (checksum 0) (type-flag 0)
                            (link-name "")
                            (magic %TMAGIC) (version %TVERSION)
                            (uname "") (gname "")
                            (device-major 0) (device-minor 0)
                            (prefix "") (padding 0))
  (%make-ustar-header name mode uid gid size mtime checksum
                      type-flag link-name magic version uname gname
                      device-major device-minor prefix padding))

(define %zero-header
  ;; The all-zeros header, which marks the end of stream.
  (read-ustar-header (open-bytevector-input-port
                      (make-bytevector 512 0))))

(define (consumer port)
  "Return a procedure that consumes or skips the given number of bytes from
PORT."
  (if (false-if-exception (seek port 0 SEEK_CUR))
      (lambda (len)
        (seek port len SEEK_CUR))
      (lambda (len)
        (define bv (make-bytevector 8192))
        (let loop ((len len))
          (define block (min len (bytevector-length bv)))
          (unless (or (zero? block)
                      (eof-object? (get-bytevector-n! port bv 0 block)))
            (loop (- len block)))))))

(define (fold-archive proc seed port)
  "Read ustar headers from PORT; for each header, call PROC."
  (define skip
    (consumer port))

  (let loop ((result seed))
    (define header
      (read-ustar-header port))

    (if (equal? header %zero-header)
        result
        (let* ((result    (proc header port result))
               (size      (ustar-header-size header))
               (remainder (modulo size 512)))
          ;; It's up to PROC to consume the SIZE bytes of data corresponding
          ;; to HEADER.  Here we consume padding.
          (unless (zero? remainder)
            (skip (- 512 remainder)))
          (loop result)))))


;;;
;;; Disassembling/assembling an archive.
;;;

(define (dump in out size)
  "Copy SIZE bytes from IN to OUT."
  (define buf-size 65536)
  (define buf (make-bytevector buf-size))

  (let loop ((left size))
    (if (<=3D left 0)
        0
        (let ((read (get-bytevector-n! in buf 0 (min left buf-size))))
          (if (eof-object? read)
              left
              (begin
                (put-bytevector out buf 0 read)
                (loop (- left read))))))))

(define* (disassemble-archive port #:optional
                              (algorithm (hash-algorithm sha256)))
  "Read tar archive from PORT and return an sexp representing its metadata,
including individual file hashes with ALGORITHM."
  (define headers+hashes
    (fold-archive (lambda (header port result)
                    (if (zero? (ustar-header-size header))
                        (alist-cons header #f result)
                        (let ()
                          (define-values (hash-port get-hash)
                            (open-hash-port algorithm))

                          (dump port hash-port
                                (ustar-header-size header))
                          (close-port hash-port)
                          (alist-cons header (get-hash) result))))
                  '()
                  port))

  (define header+hash->sexp
    (match-lambda
      ((header . hash)
       (letrec-syntax ((serialize (syntax-rules ()
                                    ((_)
                                     '())
                                    ((_ (tag get default) rest ...)
                                     (let ((value (get header)))
                                       (append (if (equal? default value)
                                                   '()
                                                   `((tag ,value)))
                                               (serialize rest ...))))
                                    ((_ (tag get) rest ...)
                                     (append `((tag ,(get header)))
                                             (serialize rest ...))))))
         `(,(ustar-header-name header)
           ,@(serialize (mode ustar-header-mode)
                        (uid ustar-header-uid 0)
                        (gid ustar-header-gid 0)
                        (size ustar-header-size)
                        (mtime ustar-header-mtime)
                        (chksum ustar-header-checksum)
                        (typeflag ustar-header-type-flag #\nul)
                        (linkname ustar-header-link-name "")
                        (magic ustar-header-magic "")
                        (version ustar-header-version "")
                        (uname ustar-header-uname "")
                        (gname ustar-header-gname "")
                        (devmajor ustar-header-device-major 0)
                        (devminor ustar-header-device-minor 0)
                        (prefix ustar-header-prefix "")
                        (padding ustar-header-padding 0)

                        (hash (lambda (_)
                                (and
                                 hash
                                 `(,(hash-algorithm-name algorithm)
                                   ,(bytevector->base32-string hash))))
                              #f)))))))

  `(tar-source
    (version 0)
    (headers ,(map header+hash->sexp (reverse headers+hashes)))))

(define (fetch-from-swh algorithm hash)
  (define url
    (string-append "https://archive.softwareheritage.org/api/1/content/"
                   (symbol->string algorithm) ":"
                   (bytevector->base16-string hash) "/raw/"))

  (define-values (response port)
    (http-get url #:streaming? #t #:verify-certificate? #f))

  (if (=3D 200 (response-code response))
      port
      (throw 'swh-fetch-error url (get-string-all port))))

(define* (assemble-archive source port
                           #:optional (fetch-data fetch-from-swh))
  "Assemble archive from SOURCE, an sexp as returned by
'disassemble-archive'."
  (define sexp->header
    (match-lambda
      ((name . properties)
       (let ((ref (lambda (field)
                    (and=3D> (assq-ref properties field) car))))
         (make-ustar-header name
                            #:mode (ref 'mode)
                            #:uid (or (ref 'uid) 0)
                            #:gid (or (ref 'gid) 0)
                            #:size (ref 'size)
                            #:mtime (ref 'mtime)
                            #:checksum (ref 'chksum)
                            #:type-flag (or (ref 'typeflag) #\nul)
                            #:link-name (or (ref 'linkname) "")
                            #:magic (or (ref 'magic) "")
                            #:version (or (ref 'version) "")
                            #:uname (or (ref 'uname) "")
                            #:gname (or (ref 'gname) "")
                            #:device-major (or (ref 'devmajor) 0)
                            #:device-minor (or (ref 'devminor) 0)
                            #:prefix (or (ref 'prefix) "")
                            #:padding (or (ref 'padding) 0))))))

  (define sexp->data
    (match-lambda
      ((name . properties)
       (match (assq-ref properties 'hash)
         (((algorithm (=3D base32-string->bytevector hash)) _ ...)
          (fetch-data algorithm hash))
         (#f
          (open-input-string ""))))))

  (match source
    (('tar-source ('version 0) ('headers headers) _ ...)
     (for-each (lambda (sexp)
                 (let ((header (sexp->header sexp))
                       (data   (sexp->data sexp)))
                   (write-ustar-header header port)
                   (dump-port data port)
                   (close-port data)))
               headers))))

--=-=-=--




Information forwarded to bug-guix@HIDDEN:
bug#42162; Package guix. Full text available.
Severity set to 'important' from 'normal' Request was from Ludovic Courtès <ludo@HIDDEN> to control <at> debbugs.gnu.org. Full text available.

Message received at 42162 <at> debbugs.gnu.org:


Received: (at 42162) by debbugs.gnu.org; 2 Jul 2020 10:03:55 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu Jul 02 06:03:55 2020
Received: from localhost ([127.0.0.1]:54417 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1jqw4R-0008VR-Cd
	for submit <at> debbugs.gnu.org; Thu, 02 Jul 2020 06:03:55 -0400
Received: from eggs.gnu.org ([209.51.188.92]:47894)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <ludo@HIDDEN>) id 1jqw4M-0008VB-T5
 for 42162 <at> debbugs.gnu.org; Thu, 02 Jul 2020 06:03:54 -0400
Received: from fencepost.gnu.org ([2001:470:142:3::e]:58122)
 by eggs.gnu.org with esmtp (Exim 4.90_1)
 (envelope-from <ludo@HIDDEN>)
 id 1jqw4G-0001Jt-Tx; Thu, 02 Jul 2020 06:03:44 -0400
Received: from [2a01:e0a:1d:7270:af76:b9b:ca24:c465] (port=53918 helo=ribbon)
 by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256)
 (Exim 4.82) (envelope-from <ludo@HIDDEN>)
 id 1jqw4D-0008Ka-Jh; Thu, 02 Jul 2020 06:03:44 -0400
From: =?utf-8?Q?Ludovic_Court=C3=A8s?= <ludo@HIDDEN>
To: zimoun <zimon.toutoune@HIDDEN>
Subject: Re: bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020
References: <87mu4iv0gc.fsf@HIDDEN> <86h7uq8fmk.fsf@HIDDEN>
Date: Thu, 02 Jul 2020 12:03:39 +0200
In-Reply-To: <86h7uq8fmk.fsf@HIDDEN> (zimoun's message of "Thu, 02 Jul 2020
 10:50:43 +0200")
Message-ID: <87d05etero.fsf@HIDDEN>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: -2.3 (--)
X-Debbugs-Envelope-To: 42162
Cc: 42162 <at> debbugs.gnu.org,
 Maurice =?utf-8?Q?Br=C3=A9mond?= <Maurice.Bremond@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -3.3 (---)

zimoun <zimon.toutoune@HIDDEN> skribis:

> On Thu, 02 Jul 2020 at 09:29, Ludovic Court=C3=A8s <ludovic.courtes@inria=
.fr> wrote:
>
>> The hosting site gforge.inria.fr will be taken off-line in December
>> 2020.  This GForge instance hosts source code as tarballs, Subversion
>> repos, and Git repos.  Users have been invited to migrate to
>> gitlab.inria.fr, which is Git only.  It seems that Software Heritage
>> hasn=E2=80=99t archived (yet) all of gforge.inria.fr.  Let=E2=80=99s kee=
p track of the
>> situation in this issue.
>
> [...]
>
>> --8<---------------cut here---------------start------------->8---
>> scheme@(guile-user)> ,pp (lset-difference eq? $7 $8)
>> $11 =3D (#<package r-spams@HIDDEN gnu/packages/statistics.scm:39=
31 7f632401a640>
>>  #<package mpfi@HIDDEN gnu/packages/multiprecision.scm:158 7f632ee3adc0>
>>  #<package gf2x@HIDDEN gnu/packages/algebra.scm:103 7f6323ea1280>
>>  #<package gmp-ecm@HIDDEN gnu/packages/algebra.scm:658 7f6323eb4960>
>>  #<package cmh@HIDDEN gnu/packages/algebra.scm:322 7f6323eb4dc0>)
>> --8<---------------cut here---------------end--------------->8---
>
> All the 5 are 'url-fetch' so we can expect that sources.json will be up
> before the shutdown on December. :-)

Unfortunately, it won=E2=80=99t help for tarballs:

  https://sympa.inria.fr/sympa/arc/swh-devel/2020-07/msg00001.html

There=E2=80=99s this other discussion you mentioned, which I hope will have=
 a
positive outcome:

  https://forge.softwareheritage.org/T2430

>> (use-modules (guix) (gnu)
>>              (guix svn-download)
>>              (guix git-download)
>>              (guix swh)
>
> It does not work properly if I do not replace by
>
>                ((guix swh) #:hide (origin?))

Oh right, I had overlooked this as I played at the REPL.

Thanks,
Ludo=E2=80=99.




Information forwarded to bug-guix@HIDDEN:
bug#42162; Package guix. Full text available.

Message received at 42162 <at> debbugs.gnu.org:


Received: (at 42162) by debbugs.gnu.org; 2 Jul 2020 08:50:54 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu Jul 02 04:50:54 2020
Received: from localhost ([127.0.0.1]:54383 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1jquvl-0006ku-TR
	for submit <at> debbugs.gnu.org; Thu, 02 Jul 2020 04:50:54 -0400
Received: from mail-wm1-f66.google.com ([209.85.128.66]:53068)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <zimon.toutoune@HIDDEN>) id 1jquvj-0006kg-Ab
 for 42162 <at> debbugs.gnu.org; Thu, 02 Jul 2020 04:50:52 -0400
Received: by mail-wm1-f66.google.com with SMTP id q15so25837856wmj.2
 for <42162 <at> debbugs.gnu.org>; Thu, 02 Jul 2020 01:50:51 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=from:to:cc:subject:in-reply-to:references:date:message-id
 :mime-version:content-transfer-encoding;
 bh=onE8OM1lYrtY8lgjEDixeaGrwTtQ/o1v44Fg+6vlqC4=;
 b=O5D/Tae54Pzz34R31uW7DoHbllk27pH9CLVVikQ2bw2p1TwTa5XAdkIOmk5VHjeb+y
 jNBPLmKQNIARLovBXREo+BMK1ffegsIkQTReLsiGR30bgd9Zoa74ewtmgQ8s0yNCm7lq
 zwRpkFx/j4upC7O4nYcsNoaA6MGy/jpNrzoMdloKJJN/3ocJUquuSCQ0vEq9VlFCtVof
 lY5wGjfKkyDI+Ijf+mvVYLnlx8ShE68ZU65AvUVSFU/x+6qm9SGMn6y7sTJ1YezkkZfu
 DmSR4M+Fs4ym5scHMVk7cAtAr2TyupOewkyuydOLfLWMFcuarPMfhzXjH5L/1bbQUDxl
 3pSQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:in-reply-to:references:date
 :message-id:mime-version:content-transfer-encoding;
 bh=onE8OM1lYrtY8lgjEDixeaGrwTtQ/o1v44Fg+6vlqC4=;
 b=KMUU97ihEKBnZvo887TMPVVLihy3hkXc3B4XPIPCtN+A0j6e+HkmREjj+i4XV8ps4V
 8MFdR0/RtFAWYKvwu9BK2/XBDk6S2mNK5sVARUYyhtN3BT1yBrBUESO+MlBh36Tq8TYr
 jYosX3ZIfNluH03OxZjs69rxYxfCTyiJeb7cQK5b+Q3MSSxkkVeh9rIBidM1CoElYxBv
 EOSr4UskX+qhCzs/U0YTQPBh7+0uuVLov8Zq2wa4MxH4cFxHeZvZ15fqnZ95njDezTAT
 EXpYhdmvcOJ12OfNt9VTWojeDNz9lfzdMCD62kcpKk8IxWKqz66JN44Z4BPuo/QxQNvY
 dcgQ==
X-Gm-Message-State: AOAM531ISEZfSFuXCyA1NtpUeEZ2vnoyTAPiTXPeDmwWbTvwv3AYvzw1
 4u8La13+cXnZ5DyzUfM0znQ=
X-Google-Smtp-Source: ABdhPJyH+NDDuaexzdxzyGv/NabdzvDaCzyvr++afx8V/c3M7tThirpCJBYJKZnOkkZb8VGQ0yaWaQ==
X-Received: by 2002:a1c:1d46:: with SMTP id d67mr32700859wmd.152.1593679845356; 
 Thu, 02 Jul 2020 01:50:45 -0700 (PDT)
Received: from lili ([2a01:e0a:59b:9120:65d2:2476:f637:db1e])
 by smtp.gmail.com with ESMTPSA id q4sm9539987wmc.1.2020.07.02.01.50.44
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Thu, 02 Jul 2020 01:50:44 -0700 (PDT)
From: zimoun <zimon.toutoune@HIDDEN>
To: Ludovic =?utf-8?Q?Court=C3=A8s?= <ludovic.courtes@HIDDEN>,
 42162 <at> debbugs.gnu.org
Subject: Re: bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020
In-Reply-To: <87mu4iv0gc.fsf@HIDDEN>
References: <87mu4iv0gc.fsf@HIDDEN>
Date: Thu, 02 Jul 2020 10:50:43 +0200
Message-ID: <86h7uq8fmk.fsf@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 42162
Cc: Maurice =?utf-8?Q?Br=C3=A9mond?= <Maurice.Bremond@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

Hi Ludo,

On Thu, 02 Jul 2020 at 09:29, Ludovic Court=C3=A8s <ludovic.courtes@HIDDEN=
r> wrote:

> The hosting site gforge.inria.fr will be taken off-line in December
> 2020.  This GForge instance hosts source code as tarballs, Subversion
> repos, and Git repos.  Users have been invited to migrate to
> gitlab.inria.fr, which is Git only.  It seems that Software Heritage
> hasn=E2=80=99t archived (yet) all of gforge.inria.fr.  Let=E2=80=99s keep=
 track of the
> situation in this issue.

[...]

> --8<---------------cut here---------------start------------->8---
> scheme@(guile-user)> ,pp (lset-difference eq? $7 $8)
> $11 =3D (#<package r-spams@HIDDEN gnu/packages/statistics.scm:393=
1 7f632401a640>
>  #<package mpfi@HIDDEN gnu/packages/multiprecision.scm:158 7f632ee3adc0>
>  #<package gf2x@HIDDEN gnu/packages/algebra.scm:103 7f6323ea1280>
>  #<package gmp-ecm@HIDDEN gnu/packages/algebra.scm:658 7f6323eb4960>
>  #<package cmh@HIDDEN gnu/packages/algebra.scm:322 7f6323eb4dc0>)
> --8<---------------cut here---------------end--------------->8---

All the 5 are 'url-fetch' so we can expect that sources.json will be up
before the shutdown on December. :-)

Then, all the 14 packages we have from gforge.inria.fr will be
git-fetch, right?  So should we contact upstream to inform us when they
switch?  Then we can adapt the origin.

> (use-modules (guix) (gnu)
>              (guix svn-download)
>              (guix git-download)
>              (guix swh)

It does not work properly if I do not replace by

               ((guix swh) #:hide (origin?))

Well, I have no investigate further.

>              (ice-9 match)
>              (srfi srfi-1)
>              (srfi srfi-26))

[...]

> (define archived-source
>   (filter (lambda (package)
>             (let* ((origin (package-source package))
>                    (hash  (origin-hash origin)))
>               (lookup-content (content-hash-value hash)
>                               (symbol->string
>                                (content-hash-algorithm hash)))))
>           packages-on-gforge))

I am a bit lost about the other discussion on falling back for tarball.
But that's another story. :-)


Cheers,
simon




Information forwarded to bug-guix@HIDDEN:
bug#42162; Package guix. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 2 Jul 2020 07:33:20 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu Jul 02 03:33:20 2020
Received: from localhost ([127.0.0.1]:54308 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1jqtih-0004k2-Ko
	for submit <at> debbugs.gnu.org; Thu, 02 Jul 2020 03:33:19 -0400
Received: from lists.gnu.org ([209.51.188.17]:41136)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <ludovic.courtes@HIDDEN>) id 1jqtif-0004jt-20
 for submit <at> debbugs.gnu.org; Thu, 02 Jul 2020 03:33:17 -0400
Received: from eggs.gnu.org ([2001:470:142:3::10]:47102)
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <ludovic.courtes@HIDDEN>)
 id 1jqtie-000180-QI
 for bug-guix@HIDDEN; Thu, 02 Jul 2020 03:33:16 -0400
Received: from mail3-relais-sop.national.inria.fr ([192.134.164.104]:61628)
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <ludovic.courtes@HIDDEN>)
 id 1jqtic-0001rW-Mk
 for bug-guix@HIDDEN; Thu, 02 Jul 2020 03:33:16 -0400
X-IronPort-AV: E=Sophos;i="5.75,303,1589234400"; 
 d="scm'?scan'208";a="353342141"
Received: from 91-160-117-201.subs.proxad.net (HELO ribbon) ([91.160.117.201])
 by mail3-relais-sop.national.inria.fr with
 ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 02 Jul 2020 09:29:55 +0200
From: =?utf-8?Q?Ludovic_Court=C3=A8s?= <ludovic.courtes@HIDDEN>
To: <bug-guix@HIDDEN>
Subject: gforge.inria.fr to be taken off-line in Dec. 2020
X-Debbugs-Cc: "Maurice =?utf-8?Q?Br=C3=A9mond=22?= <Maurice.Bremond@HIDDEN>
X-URL: http://www.fdn.fr/~lcourtes/
X-Revolutionary-Date: 15 Messidor an 228 de la =?utf-8?Q?R=C3=A9volution?=
X-PGP-Key-ID: 0x090B11993D9AEBB5
X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc
X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4  0CFB 090B 1199 3D9A EBB5
X-OS: x86_64-pc-linux-gnu
Date: Thu, 02 Jul 2020 09:29:55 +0200
Message-ID: <87mu4iv0gc.fsf@HIDDEN>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="=-=-="
Received-SPF: pass client-ip=192.134.164.104;
 envelope-from=ludovic.courtes@HIDDEN;
 helo=mail3-relais-sop.national.inria.fr
X-detected-operating-system: by eggs.gnu.org: First seen = 2020/07/02 03:29:56
X-ACL-Warn: Detected OS   = ???
X-Spam_score_int: -68
X-Spam_score: -6.9
X-Spam_bar: ------
X-Spam_report: (-6.9 / 5.0 requ) BAYES_00=-1.9, RCVD_IN_DNSWL_HI=-5,
 RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_HELO_NONE=0.001,
 SPF_PASS=-0.001, URIBL_BLOCKED=0.001 autolearn=_AUTOLEARN
X-Spam_action: no action
X-Spam-Score: -1.3 (-)
X-Debbugs-Envelope-To: submit
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -2.3 (--)

--=-=-=
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

Hello!

The hosting site gforge.inria.fr will be taken off-line in December
2020.  This GForge instance hosts source code as tarballs, Subversion
repos, and Git repos.  Users have been invited to migrate to
gitlab.inria.fr, which is Git only.  It seems that Software Heritage
hasn=E2=80=99t archived (yet) all of gforge.inria.fr.  Let=E2=80=99s keep t=
rack of the
situation in this issue.

The following packages have their source on gforge.inria.fr:

--8<---------------cut here---------------start------------->8---
scheme@(guile-user)> ,pp packages-on-gforge
$7 =3D (#<package r-spams@HIDDEN gnu/packages/statistics.scm:3931 7=
f632401a640>
 #<package ocaml-cudf@HIDDEN gnu/packages/ocaml.scm:295 7f63235eb3c0>
 #<package ocaml-dose3@HIDDEN gnu/packages/ocaml.scm:357 7f63235eb280>
 #<package mpfi@HIDDEN gnu/packages/multiprecision.scm:158 7f632ee3adc0>
 #<package pt-scotch@HIDDEN gnu/packages/maths.scm:2920 7f632d832640>
 #<package scotch@HIDDEN gnu/packages/maths.scm:2774 7f632d832780>
 #<package pt-scotch32@HIDDEN gnu/packages/maths.scm:2944 7f632d8325a0>
 #<package scotch32@HIDDEN gnu/packages/maths.scm:2873 7f632d8326e0>
 #<package gf2x@HIDDEN gnu/packages/algebra.scm:103 7f6323ea1280>
 #<package gmp-ecm@HIDDEN gnu/packages/algebra.scm:658 7f6323eb4960>
 #<package cmh@HIDDEN gnu/packages/algebra.scm:322 7f6323eb4dc0>)
--8<---------------cut here---------------end--------------->8---

=E2=80=98isl=E2=80=99 (a dependency of GCC) has its source on gforge.inria.=
fr but it=E2=80=99s
also mirrored at gcc.gnu.org apparently.

Of these, the following are available on Software Heritage:

--8<---------------cut here---------------start------------->8---
scheme@(guile-user)> ,pp archived-source
$8 =3D (#<package ocaml-cudf@HIDDEN gnu/packages/ocaml.scm:295 7f63235eb3c0>
 #<package ocaml-dose3@HIDDEN gnu/packages/ocaml.scm:357 7f63235eb280>
 #<package pt-scotch@HIDDEN gnu/packages/maths.scm:2920 7f632d832640>
 #<package scotch@HIDDEN gnu/packages/maths.scm:2774 7f632d832780>
 #<package pt-scotch32@HIDDEN gnu/packages/maths.scm:2944 7f632d8325a0>
 #<package scotch32@HIDDEN gnu/packages/maths.scm:2873 7f632d8326e0>
 #<package isl@HIDDEN gnu/packages/gcc.scm:925 7f632dc82320>
 #<package isl@HIDDEN gnu/packages/gcc.scm:939 7f632dc82280>)
--8<---------------cut here---------------end--------------->8---

So we=E2=80=99ll be missing these:

--8<---------------cut here---------------start------------->8---
scheme@(guile-user)> ,pp (lset-difference eq? $7 $8)
$11 =3D (#<package r-spams@HIDDEN gnu/packages/statistics.scm:3931 =
7f632401a640>
 #<package mpfi@HIDDEN gnu/packages/multiprecision.scm:158 7f632ee3adc0>
 #<package gf2x@HIDDEN gnu/packages/algebra.scm:103 7f6323ea1280>
 #<package gmp-ecm@HIDDEN gnu/packages/algebra.scm:658 7f6323eb4960>
 #<package cmh@HIDDEN gnu/packages/algebra.scm:322 7f6323eb4dc0>)
--8<---------------cut here---------------end--------------->8---

Attached the code I used for this.

Thanks,
Ludo=E2=80=99.


--=-=-=
Content-Type: text/plain
Content-Disposition: inline; filename=gforge-inria.scm
Content-Description: the code

(use-modules (guix) (gnu)
             (guix svn-download)
             (guix git-download)
             (guix swh)
             (ice-9 match)
             (srfi srfi-1)
             (srfi srfi-26))

(define (gforge? package)
  (define (gforge-string? str)
    (string-contains str "gforge.inria.fr"))

  (match (package-source package)
    ((? origin? o)
     (match (origin-uri o)
       ((? string? url)
        (gforge-string? url))
       (((? string? urls) ...)
        (any gforge-string? urls))                ;or 'find'
       ((? git-reference? ref)
        (gforge-string? (git-reference-url ref)))
       ((? svn-reference? ref)
        (gforge-string? (svn-reference-url ref)))
       (_ #f)))
    (_ #f)))

(define packages-on-gforge
  (fold-packages (lambda (package result)
                   (if (gforge? package)
                       (cons package result)
                       result))
                 '()))

(define archived-source
  (filter (lambda (package)
            (let* ((origin (package-source package))
                   (hash  (origin-hash origin)))
              (lookup-content (content-hash-value hash)
                              (symbol->string
                               (content-hash-algorithm hash)))))
          packages-on-gforge))

--=-=-=--




Acknowledgement sent to Ludovic Courtès <ludovic.courtes@HIDDEN>:
New bug report received and forwarded. Copy sent to Maurice.Bremond@HIDDEN, bug-guix@HIDDEN. Full text available.
Report forwarded to Maurice.Bremond@HIDDEN, bug-guix@HIDDEN:
bug#42162; Package guix. Full text available.
Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.
Last modified: Thu, 27 Aug 2020 18:15:02 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.