GNU bug report logs - #70007
[PATCH] native JSON encoder

Package: emacs;

Reported by: Mattias Engdegård <mattias.engdegard <at> gmail.com>

Date: Tue, 26 Mar 2024 15:35:01 UTC

Severity: normal

Tags: patch

Done: Mattias Engdegård <mattias.engdegard <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 70007 in the body.
You can then email your comments to 70007 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Tue, 26 Mar 2024 15:35:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Mattias Engdegård <mattias.engdegard <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Tue, 26 Mar 2024 15:35:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Emacs Bug Report <bug-gnu-emacs <at> gnu.org>
Subject: [PATCH] native JSON encoder
Date: Tue, 26 Mar 2024 16:33:52 +0100

[Message part 1 (text/plain, inline)]

If we replace the lisp-to-JSON encoder with native code, we would not need the jansson library for it and it would be faster.

There is ongoing work on a JSON-to-lisp parser, but the author has made it clear that he does not have time to write an encoder, so I spent a morning mashing up the attached patch.

It generally produces the same result as the old code, except:

- The old code incorrectly accepted strings with non-Unicode characters (raw bytes). There is no reason to do this; JSON is UTF-8 only.
- The old code spent a lot of time ensuring that object keys were unique. The new code doesn't: it's a garbage-in garbage-out type of situation.

The new code could do with some optimisation but it's already about twice as fast as the old code, sometimes more.

I'd be very happy if someone could test it with packages that use this interface (json-serialise, json-insert).

[json-serialise.diff (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Tue, 26 Mar 2024 16:47:02 GMT) Full text and rfc822 format available.

Message #8 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: 70007 <at> debbugs.gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Tue, 26 Mar 2024 18:46:00 +0200

> From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
> Date: Tue, 26 Mar 2024 16:33:52 +0100
> 
> If we replace the lisp-to-JSON encoder with native code, we would not need the jansson library for it and it would be faster.
> 
> There is ongoing work on a JSON-to-lisp parser, but the author has made it clear that he does not have time to write an encoder, so I spent a morning mashing up the attached patch.

Thanks for working on this.

> It generally produces the same result as the old code, except:
> 
> - The old code incorrectly accepted strings with non-Unicode characters (raw bytes). There is no reason to do this; JSON is UTF-8 only.

Would it complicate the code not to reject raw bytes?  I'd like to
avoid incompatibilities if it's practical.  Also, Emacs traditionally
doesn't reject raw bytes, leaving that to the application or the user.

> I'd be very happy if someone could test it with packages that use this interface (json-serialise, json-insert).

Yes, please.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Wed, 27 Mar 2024 12:47:01 GMT) Full text and rfc822 format available.

Message #11 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 70007 <at> debbugs.gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Wed, 27 Mar 2024 13:46:17 +0100

26 mars 2024 kl. 17.46 skrev Eli Zaretskii <eliz <at> gnu.org>:

>> - The old code incorrectly accepted strings with non-Unicode characters (raw bytes). There is no reason to do this; JSON is UTF-8 only.
> 
> Would it complicate the code not to reject raw bytes?  I'd like to
> avoid incompatibilities if it's practical.  Also, Emacs traditionally
> doesn't reject raw bytes, leaving that to the application or the user.

Actually I may have misrepresented the behaviour of the old encoder. It doesn't accept any raw bytes but only sequences that happen to form valid UTF-8. It's quite strange, and I don't really think this was ever intended, just a consequence of the implementation.

This means that it accepts an already encoded unibyte UTF-8 string:

  (json-serialize "\303\251") -> "\"é\""

which is doubly odd since it's supposed to be encoding, but it ends up decoding the characters instead.
Even worse, it accepts mixtures of encoded and decoded chars:

  (json-serialize "é\303\251") -> "\"éé\""

which is just bonkers.
So while we could try to replicate this 'interesting' behaviour it would definitely complicate the code and be of questionable use.

The JSON spec is quite clear that it's UTF-8 only. The only useful deviation that I can think of would be to allow unpaired surrogates (WTF-8) to pass through for transmission of Windows file names, but that would be an extension -- the old encoder doesn't permit those.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Wed, 27 Mar 2024 15:51:02 GMT) Full text and rfc822 format available.

Message #14 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 70007 <at> debbugs.gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Wed, 27 Mar 2024 16:49:53 +0100

[Message part 1 (text/plain, inline)]

Here is an updated patch. It now ignores duplicated keys in objects represented by alists and plists, just like the old encoder. (I didn't include this in the first draft out of fear it would be slow and complicated, but it turned out just to be complicated.)

The performance is still acceptable, which means at least 2x the speed of the Jansson-based encoder.

[json-serialise.diff (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Wed, 27 Mar 2024 17:42:02 GMT) Full text and rfc822 format available.

Message #17 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>,
 Yuan Fu <casouri <at> gmail.com>
Cc: 70007 <at> debbugs.gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Wed, 27 Mar 2024 19:40:59 +0200

> From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
> Date: Wed, 27 Mar 2024 16:49:53 +0100
> Cc: 70007 <at> debbugs.gnu.org
> 
> Here is an updated patch. It now ignores duplicated keys in objects represented by alists and plists, just like the old encoder. (I didn't include this in the first draft out of fear it would be slow and complicated, but it turned out just to be complicated.)
> 
> The performance is still acceptable, which means at least 2x the speed of the Jansson-based encoder.

Thanks.  A few initial comments and questions, based on a very cursory
reading.

> +/* JSON encoding context */

This is not our comment style.

> +typedef struct {
> +  char *buf;
> +  ptrdiff_t size;		/* number of bytes in buf */
> +  ptrdiff_t capacity;		/* allocated size of buf */
> +  ptrdiff_t chars_delta;        /* size - {number of Unicode chars in buf} */

When you say "Unicode chars", what do you mean? characters or bytes?
If characters, then why do you need to qualify them with "Unicode"?

> +struct symset_tbl
> +{
> +  /* Table used by the containing object if any, so that we can easily
> +     all tables if an error occurs.  */
> +  struct symset_tbl *up;
> +  /* Table of symbols (2**bits entries), Qunbound where unused.  */
> +  Lisp_Object entries[];
                        ^^
Is this portable enough?

> +static struct symset_tbl *
> +alloc_symset_table (int bits)
> +{
> +  struct symset_tbl *st = xmalloc (sizeof *st + (sizeof *st->entries << bits));
> +  int size = 1 << bits;

I'd add an assertion here that BITS is not large enough to produce zero.

> +/* Enlarge the table used by a symset. */
                                        ^^
Two spaces there, please.

> +static NO_INLINE void
> +symset_expand (symset_t *ss)
> +{
> +  struct symset_tbl *old_table = ss->table;
> +  int oldbits = ss->bits;
> +  int oldsize = 1 << oldbits;

I'd add an assertion here about the magnitude of BITS.

> +  while (p < end)
> +    {
> +      unsigned char c = *p;
> +      if (json_plain_char[c])
> +	{
> +	  json_out_byte (jo, c);
> +	  p++;
> +	}
> +      else if (c > 0x7f)
> +	{
> +	  if (STRING_MULTIBYTE (str))
> +	    {
> +	      int n;
> +	      if (c <= 0xc1)
> +		string_not_unicode (str);
> +	      if (c <= 0xdf)
> +		n = 2;
> +	      else if (c <= 0xef)
> +		{
> +		  int v = (((c & 0x0f) << 12)
> +			   + ((p[1] & 0x3f) << 6) + (p[2] & 0x3f));
> +		  if (char_surrogate_p (v))
> +		    string_not_unicode (str);
> +		  n = 3;
> +		}
> +	      else if (c <= 0xf7)
> +		{
> +		  int v = (((c & 0x07) << 18)
> +			   + ((p[1] & 0x3f) << 12)
> +			   + ((p[2] & 0x3f) << 6)
> +			   + (p[3] & 0x3f));
> +		  if (v > MAX_UNICODE_CHAR)
> +		    string_not_unicode (str);
> +		  n = 4;
> +		}
> +	      else
> +		string_not_unicode (str);
> +	      json_out_str (jo, (const char *)p, n);
> +	      jo->chars_delta += n - 1;
> +	      p += n;
> +	    }
> +	  else
> +	    string_not_unicode (str);

This rejects unibyte non-ASCII strings, AFAU, in which case I suggest
to think whether we really want that.  E.g., why is it wrong to encode
a string to UTF-8, and then send it to JSON?

> +static void
> +json_out_float (json_out_t *jo, Lisp_Object f)
> +{
> +  double x = XFLOAT_DATA (f);
> +  if (isinf (x) || isnan (x))
> +    signal_error ("not a finite number", f);

Is JSON unable to handle Inf and NaN?

> +static Lisp_Object
> +json_out_string_result (json_out_t *jo)
> +{
> +  /* FIXME: should this be a unibyte or multibyte string?
> +     Right now we make a multibyte string for test compatibility,
> +     but we are really encoding so unibyte would make more sense.  */

I indeed think this should be a unibyte string, because otherwise
writing it to a file or a process will/might encode it, which would be
wrong.

> +  json_out_t jo = {
> +    .maxdepth = 25,

Is this arbitrary, or is it what JSON expects?  If arbitrary, should
it be customizable? should it be documented?

> +  /* FIXME: Do we really need to do all this work below to insert a string?
> +     Is there no function already written?  I must be missing something.  */

There is no function.  All the insert_from_* functions in insdel.c do
something similar.

Btw, shouldn't json-insert call treesit_record_change?  Yuan?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Wed, 27 Mar 2024 18:58:01 GMT) Full text and rfc822 format available.

Message #20 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: Yuan Fu <casouri <at> gmail.com>, 70007 <at> debbugs.gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Wed, 27 Mar 2024 19:57:24 +0100

Eli, thank you for your comments!

27 mars 2024 kl. 18.40 skrev Eli Zaretskii <eliz <at> gnu.org>:

>> +/* JSON encoding context */
> 
> This is not our comment style.

I'll go through the code and clean up all comments.

>> +typedef struct {
>> +  char *buf;
>> +  ptrdiff_t size;		/* number of bytes in buf */
>> +  ptrdiff_t capacity;		/* allocated size of buf */
>> +  ptrdiff_t chars_delta;        /* size - {number of Unicode chars in buf} */
> 
> When you say "Unicode chars", what do you mean? characters or bytes?
> If characters, then why do you need to qualify them with "Unicode"?

Characters. Will clarify.

>> +  Lisp_Object entries[];
>                        ^^
> Is this portable enough?

Something I'd like to know, too. We rely on C99 in many other aspects. Are there still compilers that are important to us but don't get this right?

10 years ago this was apparently an issue for IBM XL C 12.1, but modern versions are based on Clang. We could take our chances here; obviously we'll change it if someone complains but it seems unlikely. What do you think?

> I'd add an assertion here that BITS is not large enough to produce zero.

I'll deal with that in some way or another.

> This rejects unibyte non-ASCII strings, AFAU, in which case I suggest
> to think whether we really want that.  E.g., why is it wrong to encode
> a string to UTF-8, and then send it to JSON?

The way I see it, that would break the JSON abstraction: it transports strings of Unicode characters, not strings of bytes. A user who for some reason has a string of bytes that encode Unicode characters can just decode it in order to prove it to us. It's not the JSON encoder's job to decode the user's strings.

(It would also be a pain to deal with and risks slowing down the string serialiser even if it's a case that never happens.)

> Is JSON unable to handle Inf and NaN?

That's right.

>> +  /* FIXME: should this be a unibyte or multibyte string?
>> +     Right now we make a multibyte string for test compatibility,
>> +     but we are really encoding so unibyte would make more sense.  */
> 
> I indeed think this should be a unibyte string, because otherwise
> writing it to a file or a process will/might encode it, which would be
> wrong.

I would prefer that, too, but used multibyte for compatibility with the old code and so that its tests pass.
It should probably be a separate change if we decide that unibyte is better here.

>> +  json_out_t jo = {
>> +    .maxdepth = 25,
> 
> Is this arbitrary, or is it what JSON expects?  If arbitrary, should
> it be customizable? should it be documented?

It's semi-arbitrary but reasonable: the JSON_checker at json.org uses a maximum depth of 20 by default, and many implementations use its test suite. RFC-8259 states that the maximum depth is implementation-dependent.

It's hardly worth making this into a parameter for the user to adjust but I'll clarify the code.

>> +  /* FIXME: Do we really need to do all this work below to insert a string?
>> +     Is there no function already written?  I must be missing something.  */
> 
> There is no function.  All the insert_from_* functions in insdel.c do
> something similar.

Thank you for confirming that. Looks like we could use some abstraction then.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Wed, 27 Mar 2024 19:07:01 GMT) Full text and rfc822 format available.

Message #23 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: casouri <at> gmail.com, 70007 <at> debbugs.gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Wed, 27 Mar 2024 21:05:54 +0200

> From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
> Date: Wed, 27 Mar 2024 19:57:24 +0100
> Cc: Yuan Fu <casouri <at> gmail.com>,
>  70007 <at> debbugs.gnu.org
> 
> Eli, thank you for your comments!

Thanks for working on this in the first place.

> > This rejects unibyte non-ASCII strings, AFAU, in which case I suggest
> > to think whether we really want that.  E.g., why is it wrong to encode
> > a string to UTF-8, and then send it to JSON?
> 
> The way I see it, that would break the JSON abstraction: it transports strings of Unicode characters, not strings of bytes.

What's the difference?  AFAIU, JSON expects UTF-8 encoded strings, and
whether that is used as a sequence of bytes or a sequence of
characters is in the eyes of the beholder: the bytestream is the same,
only the interpretation changes.  So I'm not sure I understand how
this would break the assumption.

> A user who for some reason has a string of bytes that encode Unicode characters can just decode it in order to prove it to us. It's not the JSON encoder's job to decode the user's strings.

I didn't suggest to decode the input string, not at all.  I suggested
to allow unibyte strings, and process them just like you process
pure-ASCII strings, leaving it to the caller to make sure the string
has only valid UTF-8 sequences.  Forcing callers to decode such
strings is IMO too harsh and largely unjustified.

> (It would also be a pain to deal with and risks slowing down the string serialiser even if it's a case that never happens.)

I don't understand why.  Once again, I'm just talking about passing
the bytes through as you do with ASCII characters.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Thu, 28 Mar 2024 19:17:01 GMT) Full text and rfc822 format available.

Message #26 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Theodor Thornhill <theo <at> thornhill.no>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: Mattias Engdegård <mattias.engdegard <at> gmail.com>,
 70007 <at> debbugs.gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Thu, 28 Mar 2024 20:16:35 +0100

Eli Zaretskii <eliz <at> gnu.org> writes:

>> From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
>> Date: Tue, 26 Mar 2024 16:33:52 +0100
>> 
>> If we replace the lisp-to-JSON encoder with native code, we would not need the jansson library for it and it would be faster.
>> 
>> There is ongoing work on a JSON-to-lisp parser, but the author has made it clear that he does not have time to write an encoder, so I spent a morning mashing up the attached patch.
>
> Thanks for working on this.
>
>> It generally produces the same result as the old code, except:
>> 
>> - The old code incorrectly accepted strings with non-Unicode characters (raw bytes). There is no reason to do this; JSON is UTF-8 only.
>
> Would it complicate the code not to reject raw bytes?  I'd like to
> avoid incompatibilities if it's practical.  Also, Emacs traditionally
> doesn't reject raw bytes, leaving that to the application or the user.
>
>> I'd be very happy if someone could test it with packages that use this interface (json-serialise, json-insert).
>
> Yes, please.

I've been using this along with the json-to-lisp parser for some time
now, and I'm really happy to see these improvements. Thanks a lot!

I haven't seen any issues thus far, and emacs is much more responsive.

I hope both of these patches will soon arrive on emacs 30.

I'll continue using and testing both until then.

Thanks,
Theo

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Thu, 28 Mar 2024 21:00:05 GMT) Full text and rfc822 format available.

Message #29 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: casouri <at> gmail.com, 70007 <at> debbugs.gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Thu, 28 Mar 2024 21:59:38 +0100

27 mars 2024 kl. 20.05 skrev Eli Zaretskii <eliz <at> gnu.org>:

>>> This rejects unibyte non-ASCII strings, AFAU, in which case I suggest
>>> to think whether we really want that.  E.g., why is it wrong to encode
>>> a string to UTF-8, and then send it to JSON?
>> 
>> The way I see it, that would break the JSON abstraction: it transports strings of Unicode characters, not strings of bytes.
> 
> What's the difference?  AFAIU, JSON expects UTF-8 encoded strings, and
> whether that is used as a sequence of bytes or a sequence of
> characters is in the eyes of the beholder: the bytestream is the same,
> only the interpretation changes.

Well no -- JSON transports Unicode strings: the JSON serialiser takes a Unicode string as input and outputs a byte sequence; the JSON parser takes a byte sequence and returns a Unicode string (assuming we are just interested in strings).

That the transport format uses UTF-8 is unrelated; if the user hands an encoded byte sequence to us then it seems more likely that it's a mistake. After all, it cannot have come from a received JSON message.

I think it was just an another artefact of the old implementation. That code incorrectly used encode_string_utf_8 even on non-ASCII unibyte strings and trusted Jansson to validate the result. That resulted in a lot of wasted work and some strange strings getting accepted.

While it's theoretically possible that there are users with code relying on this behaviour, I can't find any evidence for it in the packages that I've looked at.

> I didn't suggest to decode the input string, not at all.  I suggested
> to allow unibyte strings, and process them just like you process
> pure-ASCII strings, leaving it to the caller to make sure the string
> has only valid UTF-8 sequences.

Users of this raw-bytes-input feature (if they exist at all) previously had their input validated by Jansson. While mistakes would probably be detected at the other end I'm not sure it's a good idea.

>  Forcing callers to decode such
> strings is IMO too harsh and largely unjustified.

We usually force them to do so in most other contexts. To take a random example, `princ` doesn't work with encoded strings. But it's rarely a problem.

Let's see how testing goes. We'll find a solution no matter what, pass-through or separate slow-path validation, if it turns out that we really need to after all.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Fri, 29 Mar 2024 06:05:02 GMT) Full text and rfc822 format available.

Message #32 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: casouri <at> gmail.com, 70007 <at> debbugs.gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Fri, 29 Mar 2024 09:04:21 +0300

> From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
> Date: Thu, 28 Mar 2024 21:59:38 +0100
> Cc: casouri <at> gmail.com,
>  70007 <at> debbugs.gnu.org
> 
> 27 mars 2024 kl. 20.05 skrev Eli Zaretskii <eliz <at> gnu.org>:
> 
> >>> This rejects unibyte non-ASCII strings, AFAU, in which case I suggest
> >>> to think whether we really want that.  E.g., why is it wrong to encode
> >>> a string to UTF-8, and then send it to JSON?
> >> 
> >> The way I see it, that would break the JSON abstraction: it transports strings of Unicode characters, not strings of bytes.
> > 
> > What's the difference?  AFAIU, JSON expects UTF-8 encoded strings, and
> > whether that is used as a sequence of bytes or a sequence of
> > characters is in the eyes of the beholder: the bytestream is the same,
> > only the interpretation changes.
> 
> Well no -- JSON transports Unicode strings: the JSON serialiser takes a Unicode string as input and outputs a byte sequence; the JSON parser takes a byte sequence and returns a Unicode string (assuming we are just interested in strings).
> 
> That the transport format uses UTF-8 is unrelated;

It is not unrelated.  A JSON stream is AFAIK supposed to have strings
represented in UTF-8 encoding.  When a Lisp program produces a JSON
stream, all that should matter to it is that any string there has a
valid UTF-8 sequence; where and how that sequence was obtained is of
secondary importance.

> if the user hands an encoded byte sequence to us then it seems more likely that it's a mistake.

We don't know that.  Since Emacs lets Lisp programs produce unibyte
UTF-8 encoded strings very easily, a program could do just that, for
whatever reasons.  Unless we have very serious reasons not to allow
UTF-8 sequences produced by something other than the JSON serializer
itself (and I think we don't), we should not prohibit it.  The Emacs
spirit is to let bad Lisp program enough rope to hang themselves if
that allows legitimate programs do their job more easily and flexibly.

> After all, it cannot have come from a received JSON message.

It could have, if it was encoded by the calling Lisp program.  It
could also have been received from another source, in unibyte form
that is nonetheless valid UTF-8.  If we force non-ASCII strings to be
multibyte, Lisp programs will be unable to take a unibyte UTF-8 string
received from an external source and plug it directly into an object
to be serialized into JSON; instead, they will have to decode the
string, then let the serializer encode it back -- a clear waste of CPU
cycles.

> I think it was just an another artefact of the old implementation. That code incorrectly used encode_string_utf_8 even on non-ASCII unibyte strings and trusted Jansson to validate the result. That resulted in a lot of wasted work and some strange strings getting accepted.

I'm not talking about the old implementation.  I was not completely
happy with it, either, and in particular with its insistence of
signaling errors due to encoding issues.  I think this is not our
business in this case: the responsibility for submitting a valid UTF-8
sequence, when we get a unibyte string, is on the caller.

> While it's theoretically possible that there are users with code relying on this behaviour, I can't find any evidence for it in the packages that I've looked at.

Once again, my bother is not about some code that expects us to encode
UTF-8 byte sequences -- doing that is definitely not TRT.  What I
would like to see is that unibyte strings are passed through
unchanged, so that valid UTF-8 strings will be okay, and invalid ones
will produce invalid JSON.  This is better than signaling errors,
IMNSHO, and in particular is more in-line with how Emacs handles
unibyte strings elsewhere.

> > I didn't suggest to decode the input string, not at all.  I suggested
> > to allow unibyte strings, and process them just like you process
> > pure-ASCII strings, leaving it to the caller to make sure the string
> > has only valid UTF-8 sequences.
> 
> Users of this raw-bytes-input feature (if they exist at all) previously had their input validated by Jansson. While mistakes would probably be detected at the other end I'm not sure it's a good idea.

Why not?  Once again, if we get a unibyte string, the onus is on the
caller to verify it's valid UTF-8, or suffer the consequences.

> >  Forcing callers to decode such
> > strings is IMO too harsh and largely unjustified.
> 
> We usually force them to do so in most other contexts. To take a random example, `princ` doesn't work with encoded strings. But it's rarely a problem.

There are many examples to the contrary.  For example, primitives that
deal with file names can accept both multibyte and unibyte encoded
strings.

> Let's see how testing goes. We'll find a solution no matter what, pass-through or separate slow-path validation, if it turns out that we really need to after all.

OK.  FTR, I'm not in favor of validation of unibyte strings, I just
suggest that we treat them as plain-ASCII: pass them through without
any validation, leaving the validation to the callers.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Sat, 30 Mar 2024 11:42:02 GMT) Full text and rfc822 format available.

Message #35 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: casouri <at> gmail.com, 70007 <at> debbugs.gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Sat, 30 Mar 2024 12:41:31 +0100

[Message part 1 (text/plain, inline)]

29 mars 2024 kl. 07.04 skrev Eli Zaretskii <eliz <at> gnu.org>:

> OK.  FTR, I'm not in favor of validation of unibyte strings, I just
> suggest that we treat them as plain-ASCII: pass them through without
> any validation, leaving the validation to the callers.

Actually we are more or less forced to validate unibyte strings as long as the serialiser returns multibyte. Which we agree that it probably shouldn't, but I'd first like to take some time to ensure that returning unibyte won't break anything.

Thank you for pushing the new JSON parser to master. I've rebased my patch and cleaned it up a bit, and it now removes all uses of Jansson from json.c. Since the change involves removing some Windows-specific code, perhaps you would like to check that it still compiles on that platform, if you have the time?

Otherwise I'll push it to master, and will remain ready to make any further adjustments necessary. We can then remove all remaining Jansson references (configuration, installation notes, etc), and make the required NEWS and manual changes.

[0001-New-JSON-encoder-bug-70007.patch (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Sat, 30 Mar 2024 13:24:02 GMT) Full text and rfc822 format available.

Message #38 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: casouri <at> gmail.com, 70007 <at> debbugs.gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Sat, 30 Mar 2024 16:22:57 +0300

> From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
> Date: Sat, 30 Mar 2024 12:41:31 +0100
> Cc: casouri <at> gmail.com,
>  70007 <at> debbugs.gnu.org
> 
> 29 mars 2024 kl. 07.04 skrev Eli Zaretskii <eliz <at> gnu.org>:
> 
> > OK.  FTR, I'm not in favor of validation of unibyte strings, I just
> > suggest that we treat them as plain-ASCII: pass them through without
> > any validation, leaving the validation to the callers.
> 
> Actually we are more or less forced to validate unibyte strings as long as the serialiser returns multibyte. Which we agree that it probably shouldn't, but I'd first like to take some time to ensure that returning unibyte won't break anything.

Yes, I was writing about the state of affairs when we change the
serializer to emit unibyte strings.

> Thank you for pushing the new JSON parser to master. I've rebased my patch and cleaned it up a bit, and it now removes all uses of Jansson from json.c. Since the change involves removing some Windows-specific code, perhaps you would like to check that it still compiles on that platform, if you have the time?

It builds and passes the tests, thanks.  But I have a couple of minor
nits below.

> +static struct symset_tbl *
> +make_symset_table (int bits, struct symset_tbl *up)
> +{
> +  int maxbits = min (SIZE_WIDTH - 2 - (word_size < 8 ? 2 : 3), 32);
> +  if (bits > maxbits)
> +    error ("out of memory");	/* Will never happen in practice.  */

This should be a call to memory_full, no?  Or if we must signal this
error here, at least make the error message more specific: mention
JSON or somesuch.

> +{
> +  double x = XFLOAT_DATA (f);
> +  if (isinf (x) || isnan (x))
> +    signal_error ("not a finite number", f);

I'd prefer a more specific error message here, like

  JSON does not allow Inf or NaN

Last, but not least: should we have a json-available-p function that
always returns non-nil, for better backward-compatibility?  Otherwise,
code out there might decide to use json.elm which is not what we want.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Sat, 30 Mar 2024 14:23:01 GMT) Full text and rfc822 format available.

Message #41 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: casouri <at> gmail.com, 70007 <at> debbugs.gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Sat, 30 Mar 2024 15:22:34 +0100

30 mars 2024 kl. 14.22 skrev Eli Zaretskii <eliz <at> gnu.org>:

> This should be a call to memory_full, no?  Or if we must signal this
> error here, at least make the error message more specific: mention
> JSON or somesuch.

It's academic, but memory_full is fine.

> I'd prefer a more specific error message here, like
> 
>  JSON does not allow Inf or NaN

Sure.

> Last, but not least: should we have a json-available-p function that
> always returns non-nil, for better backward-compatibility?  Otherwise,
> code out there might decide to use json.elm which is not what we want.

Yes, keep json-available-p (that always returns t) for compatibility. We could declare it obsolete but it's not very important.

I made the changes above and installed it on master, as well as a sweeping removal of all things Jansson.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Sat, 30 Mar 2024 16:15:01 GMT) Full text and rfc822 format available.

Message #44 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Richard Copley <rcopley <at> gmail.com>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: 70007 <at> debbugs.gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Sat, 30 Mar 2024 16:14:45 +0000

On 30/03/2024 14:22, Mattias Engdegård wrote:
> I made the changes above and installed it on master, as well as a sweeping removal of all things Jansson.

Hi Mattias,
A clean build fails with a linker error:

  CCLD     temacs
/usr/bin/ld: /tmp/cc4stUid.ltrans15.ltrans.o: in function `main':
<artificial>:(.text.startup+0x39a9): undefined reference to `syms_of_json'
collect2: error: ld returned 1 exit status
make[2]: *** [Makefile:739: temacs] Error 1

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Sat, 30 Mar 2024 16:38:01 GMT) Full text and rfc822 format available.

Message #47 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: casouri <at> gmail.com, 70007 <at> debbugs.gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Sat, 30 Mar 2024 19:37:08 +0300

> From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
> Date: Sat, 30 Mar 2024 15:22:34 +0100
> Cc: casouri <at> gmail.com,
>  70007 <at> debbugs.gnu.org
> 
> I made the changes above and installed it on master, as well as a sweeping removal of all things Jansson.

Thanks.  The new code failed to link due to an omission in
src/Makefile.in, and also the "clever" initialization in
json_serialize confused make-docfile (so globals.h were generated
incorrectly: they missed the DEFSYMs in syms_of_json).  I think I
fixed that, but please eyeball the changes to see if I missed
something.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Sat, 30 Mar 2024 16:42:02 GMT) Full text and rfc822 format available.

Message #50 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Richard Copley <rcopley <at> gmail.com>
Cc: mattias.engdegard <at> gmail.com, 70007 <at> debbugs.gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Sat, 30 Mar 2024 19:40:50 +0300

> Cc: 70007 <at> debbugs.gnu.org
> Date: Sat, 30 Mar 2024 16:14:45 +0000
> From: Richard Copley <rcopley <at> gmail.com>
> 
> On 30/03/2024 14:22, Mattias Engdegård wrote:
> > I made the changes above and installed it on master, as well as a sweeping removal of all things Jansson.
> 
> Hi Mattias,
> A clean build fails with a linker error:
> 
>    CCLD     temacs
> /usr/bin/ld: /tmp/cc4stUid.ltrans15.ltrans.o: in function `main':
> <artificial>:(.text.startup+0x39a9): undefined reference to `syms_of_json'
> collect2: error: ld returned 1 exit status
> make[2]: *** [Makefile:739: temacs] Error 1

Please try again, I hope I fixed this.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Sat, 30 Mar 2024 16:47:02 GMT) Full text and rfc822 format available.

Message #53 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Andy Moreton <andrewjmoreton <at> gmail.com>
To: bug-gnu-emacs <at> gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Sat, 30 Mar 2024 16:45:54 +0000

On Sat 30 Mar 2024, Richard Copley wrote:

> On 30/03/2024 14:22, Mattias Engdegård wrote:
>> I made the changes above and installed it on master, as well as a sweeping removal of all things Jansson.
>
> Hi Mattias,
> A clean build fails with a linker error:
>
>   CCLD     temacs
> /usr/bin/ld: /tmp/cc4stUid.ltrans15.ltrans.o: in function `main':
> <artificial>:(.text.startup+0x39a9): undefined reference to `syms_of_json'
> collect2: error: ld returned 1 exit status
> make[2]: *** [Makefile:739: temacs] Error 1

Commit 1135ce461d18 ("Always enable native JSON support and remove
Jansson references") removed json.o from configure.ac, but did not add it
in Makefile.in to ensure it is always built.

Adding json.o to base_obj there appears to fix the build.

    AndyM

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Sat, 30 Mar 2024 20:22:02 GMT) Full text and rfc822 format available.

Message #56 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: casouri <at> gmail.com, 70007 <at> debbugs.gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Sat, 30 Mar 2024 21:21:39 +0100

30 mars 2024 kl. 17.37 skrev Eli Zaretskii <eliz <at> gnu.org>:

> Thanks.  


> The new code failed to link due to an omission in
> src/Makefile.in, and also the "clever" initialization in
> json_serialize confused make-docfile (so globals.h were generated
> incorrectly: they missed the DEFSYMs in syms_of_json).  I think I
> fixed that, but please eyeball the changes to see if I missed
> something.

Blast, I thought I had bootstrapped and tested it but naturally used the wrong tree. Thanks for putting it right.

We should do something about make-docfile one of these days.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Sat, 30 Mar 2024 20:38:02 GMT) Full text and rfc822 format available.

Message #59 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Corwin Brust <corwin <at> bru.st>
To: Andy Moreton <andrewjmoreton <at> gmail.com>
Cc: 70007 <at> debbugs.gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Sat, 30 Mar 2024 15:36:59 -0500

> >> I made the changes above and installed it on master, as well as a sweeping removal of all things Jansson.

This builds okay for me with MSYS2/Mingw64; here are some Windows
binaries if anyone would like to test:

https://corwin.bru.st/emacs-30/emacs-30-000f91

Thanks!

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Sat, 30 Mar 2024 23:31:03 GMT) Full text and rfc822 format available.

Message #62 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Richard Copley <rcopley <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: mattias.engdegard <at> gmail.com, 70007 <at> debbugs.gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Sat, 30 Mar 2024 23:29:48 +0000

On Sat, 30 Mar 2024 at 16:40, Eli Zaretskii <eliz <at> gnu.org> wrote:
> Please try again, I hope I fixed this.
Thanks, looks good.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Tue, 02 Apr 2024 14:15:02 GMT) Full text and rfc822 format available.

Message #65 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: casouri <at> gmail.com, 70007 <at> debbugs.gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Tue, 2 Apr 2024 16:13:58 +0200

[Message part 1 (text/plain, inline)]

Looks like the new serialiser inherited a bug from the old code: `json-insert` in a unibyte buffer does not move point correctly. Example:

(with-temp-buffer
  (set-buffer-multibyte nil)
  (json-insert "é")
  (list (buffer-string) (point)))
=> ("\"\303\251\"" 4)

The string is correct but the position should be 5, not 4.

This made me look at the Fjson_insert logic a bit. I'm probably betraying my lack of knowledge about buffer subtleties here, but since the serialiser always produces (correct) UTF-8, shouldn't it be enough to copy the bytes, don't bother with any decoding, and perform the buffer insertion ceremonies?

Proposed patch attached. (There will also be a test, of course.)

[json-insert.diff (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Tue, 02 Apr 2024 16:15:02 GMT) Full text and rfc822 format available.

Message #68 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: casouri <at> gmail.com, 70007 <at> debbugs.gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Tue, 02 Apr 2024 19:13:47 +0300

> From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
> Date: Tue, 2 Apr 2024 16:13:58 +0200
> Cc: casouri <at> gmail.com,
>  70007 <at> debbugs.gnu.org
> 
> Looks like the new serialiser inherited a bug from the old code: `json-insert` in a unibyte buffer does not move point correctly. Example:
> 
> (with-temp-buffer
>   (set-buffer-multibyte nil)
>   (json-insert "é")
>   (list (buffer-string) (point)))
> => ("\"\303\251\"" 4)
> 
> The string is correct but the position should be 5, not 4.

In a build with --enable-checking, this hits an assertion.  So I think
we should add this to the test suite.

> This made me look at the Fjson_insert logic a bit. I'm probably betraying my lack of knowledge about buffer subtleties here, but since the serialiser always produces (correct) UTF-8, shouldn't it be enough to copy the bytes, don't bother with any decoding, and perform the buffer insertion ceremonies?

Yes, I think that's true.

One nit, though: if the result could be an empty string, then we
should not do anything at all, not even invalidate_buffer_caches.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Tue, 02 Apr 2024 17:20:03 GMT) Full text and rfc822 format available.

Message #71 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: casouri <at> gmail.com, 70007 <at> debbugs.gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Tue, 2 Apr 2024 19:19:03 +0200

2 apr. 2024 kl. 18.13 skrev Eli Zaretskii <eliz <at> gnu.org>:

>> This made me look at the Fjson_insert logic a bit. I'm probably betraying my lack of knowledge about buffer subtleties here, but since the serialiser always produces (correct) UTF-8, shouldn't it be enough to copy the bytes, don't bother with any decoding, and perform the buffer insertion ceremonies?
> 
> Yes, I think that's true.

Thank you, now pushed with added tests.

> One nit, though: if the result could be an empty string, then we
> should not do anything at all, not even invalidate_buffer_caches.

I don't think json-insert can ever end up inserting the empty string (unless it signals, and then we won't get to the insertion stage). But I added an assertion for this.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Sat, 24 Aug 2024 15:36:01 GMT) Full text and rfc822 format available.

Message #74 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: Yuan Fu <casouri <at> gmail.com>, 70007 <at> debbugs.gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Sat, 24 Aug 2024 17:33:33 +0200

(I'm going through some old unfinished work.)

The only major issue remaining here would be whether `json-serialize` should return a multibyte string, as it does now, or a unibyte string. I think we agree that unibyte is the correct choice in general, and since the code is new in Emacs 30 we should make a decision now because changing it in Emacs 31 will be riskier. On the other hand, the old libjansson implementation returned a multibyte string.

I've looked at packages using the API and they appear entirely agnostic -- the result is almost always a string that is fed into `process-send-string`.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Sat, 24 Aug 2024 16:16:01 GMT) Full text and rfc822 format available.

Message #77 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: casouri <at> gmail.com, 70007 <at> debbugs.gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Sat, 24 Aug 2024 19:14:22 +0300

> From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
> Date: Sat, 24 Aug 2024 17:33:33 +0200
> Cc: Yuan Fu <casouri <at> gmail.com>,
>  70007 <at> debbugs.gnu.org
> 
> (I'm going through some old unfinished work.)
> 
> The only major issue remaining here would be whether `json-serialize` should return a multibyte string, as it does now, or a unibyte string. I think we agree that unibyte is the correct choice in general, and since the code is new in Emacs 30 we should make a decision now because changing it in Emacs 31 will be riskier. On the other hand, the old libjansson implementation returned a multibyte string.
> 
> I've looked at packages using the API and they appear entirely agnostic -- the result is almost always a string that is fed into `process-send-string`.

We are very late in the release process to make such changes without
reservations.  We should have discussed this much earlier, but that's
water under the bridge.

Since this could potentially (albeit unlikely) break someone's code,
I'm okay with making this change, but with a twist: let's have a
variable which could be let-bound around the call to json-serialize,
to make the result a multibyte string instead of the default unibyte.
This is so if someone comes back crying that we broke his/her code,
they could have an easy fire escape.

OK?

P.S. Of course, the fact that the result is a unibyte string should be
mentioned in the doc string.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Sat, 24 Aug 2024 19:48:02 GMT) Full text and rfc822 format available.

Message #80 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: casouri <at> gmail.com, 70007 <at> debbugs.gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Sat, 24 Aug 2024 21:45:57 +0200

24 aug. 2024 kl. 18.14 skrev Eli Zaretskii <eliz <at> gnu.org>:

> We are very late in the release process to make such changes without
> reservations.  We should have discussed this much earlier, but that's
> water under the bridge.

Yes, sorry about not following this through in time. But here we are.

> Since this could potentially (albeit unlikely) break someone's code,
> I'm okay with making this change, but with a twist: let's have a
> variable which could be let-bound around the call to json-serialize,
> to make the result a multibyte string instead of the default unibyte.
> This is so if someone comes back crying that we broke his/her code,
> they could have an easy fire escape.

Sure, but it's a bit annoying to saddle us with some baggage that we never expect to be actually used.
And for that matter, it's no easier to bind a variable around the call than putting a call to `decode-coding-string` around it, is it? We could describe that in NEWS.

The null option is to do nothing and stick to what we have now. It's not a terrible choice either.

> P.S. Of course, the fact that the result is a unibyte string should be
> mentioned in the doc string.

Yes, definitely.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Sun, 25 Aug 2024 05:09:02 GMT) Full text and rfc822 format available.

Message #83 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>,
 Stefan Monnier <monnier <at> iro.umontreal.ca>,
 Stefan Kangas <stefankangas <at> gmail.com>, Andrea Corallo <acorallo <at> gnu.org>
Cc: casouri <at> gmail.com, 70007 <at> debbugs.gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Sun, 25 Aug 2024 08:07:48 +0300

> From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
> Date: Sat, 24 Aug 2024 21:45:57 +0200
> Cc: casouri <at> gmail.com,
>  70007 <at> debbugs.gnu.org
> 
> 24 aug. 2024 kl. 18.14 skrev Eli Zaretskii <eliz <at> gnu.org>:
> 
> > We are very late in the release process to make such changes without
> > reservations.  We should have discussed this much earlier, but that's
> > water under the bridge.
> 
> Yes, sorry about not following this through in time. But here we are.
> 
> > Since this could potentially (albeit unlikely) break someone's code,
> > I'm okay with making this change, but with a twist: let's have a
> > variable which could be let-bound around the call to json-serialize,
> > to make the result a multibyte string instead of the default unibyte.
> > This is so if someone comes back crying that we broke his/her code,
> > they could have an easy fire escape.
> 
> Sure, but it's a bit annoying to saddle us with some baggage that we never expect to be actually used.
> And for that matter, it's no easier to bind a variable around the call than putting a call to `decode-coding-string` around it, is it? We could describe that in NEWS.
> 
> The null option is to do nothing and stick to what we have now. It's not a terrible choice either.

Let's see what others think about this.  Stefan, Stefan, and Andrea:
any opinions wrt what, if anything, we should do about this on the
emacs-30 branch?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Sun, 25 Aug 2024 17:58:02 GMT) Full text and rfc822 format available.

Message #86 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: Mattias Engdegård <mattias.engdegard <at> gmail.com>,
 Andrea Corallo <acorallo <at> gnu.org>, casouri <at> gmail.com,
 Stefan Kangas <stefankangas <at> gmail.com>, 70007 <at> debbugs.gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Sun, 25 Aug 2024 13:55:53 -0400

>> > Since this could potentially (albeit unlikely) break someone's code,
>> > I'm okay with making this change, but with a twist: let's have a
>> > variable which could be let-bound around the call to json-serialize,
>> > to make the result a multibyte string instead of the default unibyte.
>> > This is so if someone comes back crying that we broke his/her code,
>> > they could have an easy fire escape.
>> Sure, but it's a bit annoying to saddle us with some baggage that we never
>> expect to be actually used.
>> And for that matter, it's no easier to bind a variable around the call
>> than putting a call to `decode-coding-string` around it, is it? We could
>> describe that in NEWS.

The variable solution has one advantage: users can (setq ... t) in their
`.emacs` to keep old code working, tho potentially at the cost of
breaking new code.

Another option is to introduce a new name (that outputs a unicode
string) and obsolete the old one.  🙂

But just making a breaking change is not that bad: ELisp has evolved to
be pretty good at making code magically work for both encoded and
decoded strings.

>> The null option is to do nothing and stick to what we have now.
>> It's not a terrible choice either.

Returning an encoded strings is something we do want to avoid in the
long run, so we might as well bite the bullet now.

> Let's see what others think about this.  Stefan, Stefan, and Andrea:
> any opinions wrt what, if anything, we should do about this on the
> emacs-30 branch?

I vote for breaking compatibility.


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Sun, 25 Aug 2024 18:30:02 GMT) Full text and rfc822 format available.

Message #89 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: mattias.engdegard <at> gmail.com, acorallo <at> gnu.org, casouri <at> gmail.com,
 stefankangas <at> gmail.com, 70007 <at> debbugs.gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Sun, 25 Aug 2024 21:26:11 +0300

> From: Stefan Monnier <monnier <at> iro.umontreal.ca>
> Cc: Mattias Engdegård <mattias.engdegard <at> gmail.com>,
>   Stefan Kangas
>  <stefankangas <at> gmail.com>,  Andrea Corallo <acorallo <at> gnu.org>,
>   casouri <at> gmail.com,  70007 <at> debbugs.gnu.org
> Date: Sun, 25 Aug 2024 13:55:53 -0400
> 
> > Let's see what others think about this.  Stefan, Stefan, and Andrea:
> > any opinions wrt what, if anything, we should do about this on the
> > emacs-30 branch?
> 
> I vote for breaking compatibility.

And against the additional variable to make this more
backward-compatible?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Sun, 25 Aug 2024 19:22:02 GMT) Full text and rfc822 format available.

Message #92 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: mattias.engdegard <at> gmail.com, acorallo <at> gnu.org, casouri <at> gmail.com,
 stefankangas <at> gmail.com, 70007 <at> debbugs.gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Sun, 25 Aug 2024 15:20:18 -0400

> And against the additional variable to make this more
> backward-compatible?

Yup.  The var would be my second-best choice (and I assume it's
immediately declared obsolete).


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Sun, 25 Aug 2024 20:11:01 GMT) Full text and rfc822 format available.

Message #95 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: Eli Zaretskii <eliz <at> gnu.org>, Andrea Corallo <acorallo <at> gnu.org>,
 casouri <at> gmail.com, Stefan Kangas <stefankangas <at> gmail.com>,
 70007 <at> debbugs.gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Sun, 25 Aug 2024 22:08:30 +0200

25 aug. 2024 kl. 19.55 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>:

> The variable solution has one advantage: users can (setq ... t) in their
> `.emacs` to keep old code working, tho potentially at the cost of
> breaking new code.

Actually that's a terrible solution because it affects all code, not just a single package. In addition, it will linger in the user's system indefinitely (.emacs being the prime bioaccumulation tissue) and potentially cause trouble when he or she installs a perfectly working package years later.

So my preferences are:

First place: always unibyte, with no controlling variable.
Very close second place: always multibyte, idem.
Distant third place: contrived new function name for unibyte, like json-serialize-encode (no, I don't like it either).
Very distant fourth place: variable controlling the string type returned.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Sat, 31 Aug 2024 09:47:02 GMT) Full text and rfc822 format available.

Message #98 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Stefan Kangas <stefankangas <at> gmail.com>, Andrea Corallo <acorallo <at> gnu.org>
Cc: Mattias Engdegård <mattias.engdegard <at> gmail.com>,
 casouri <at> gmail.com, monnier <at> iro.umontreal.ca, 70007 <at> debbugs.gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Sat, 31 Aug 2024 12:45:25 +0300

> From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
> Date: Sun, 25 Aug 2024 22:08:30 +0200
> Cc: Eli Zaretskii <eliz <at> gnu.org>,
>  Stefan Kangas <stefankangas <at> gmail.com>,
>  Andrea Corallo <acorallo <at> gnu.org>,
>  casouri <at> gmail.com,
>  70007 <at> debbugs.gnu.org
> 
> 25 aug. 2024 kl. 19.55 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>:
> 
> > The variable solution has one advantage: users can (setq ... t) in their
> > `.emacs` to keep old code working, tho potentially at the cost of
> > breaking new code.
> 
> Actually that's a terrible solution because it affects all code, not just a single package. In addition, it will linger in the user's system indefinitely (.emacs being the prime bioaccumulation tissue) and potentially cause trouble when he or she installs a perfectly working package years later.
> 
> So my preferences are:
> 
> First place: always unibyte, with no controlling variable.
> Very close second place: always multibyte, idem.
> Distant third place: contrived new function name for unibyte, like json-serialize-encode (no, I don't like it either).
> Very distant fourth place: variable controlling the string type returned.

Stefan Kangas and Andrea, can I have your opinions on this, please?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Sat, 31 Aug 2024 22:18:01 GMT) Full text and rfc822 format available.

Message #101 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Kangas <stefankangas <at> gmail.com>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>, Eli Zaretskii <eliz <at> gnu.org>
Cc: mattias.engdegard <at> gmail.com, acorallo <at> gnu.org, casouri <at> gmail.com,
 70007 <at> debbugs.gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Sat, 31 Aug 2024 15:15:25 -0700

Stefan Monnier <monnier <at> iro.umontreal.ca> writes:

>> And against the additional variable to make this more
>> backward-compatible?
>
> Yup.  The var would be my second-best choice (and I assume it's
> immediately declared obsolete).

I tend to agree with Stefan M here.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Sat, 07 Sep 2024 07:27:02 GMT) Full text and rfc822 format available.

Message #104 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: acorallo <at> gnu.org, Stefan Kangas <stefankangas <at> gmail.com>
Cc: mattias.engdegard <at> gmail.com, casouri <at> gmail.com, monnier <at> iro.umontreal.ca,
 70007 <at> debbugs.gnu.org
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Sat, 07 Sep 2024 10:26:39 +0300

> From: Stefan Kangas <stefankangas <at> gmail.com>
> Date: Sat, 31 Aug 2024 15:15:25 -0700
> Cc: mattias.engdegard <at> gmail.com, acorallo <at> gnu.org, casouri <at> gmail.com, 
> 	70007 <at> debbugs.gnu.org
> 
> Stefan Monnier <monnier <at> iro.umontreal.ca> writes:
> 
> >> And against the additional variable to make this more
> >> backward-compatible?
> >
> > Yup.  The var would be my second-best choice (and I assume it's
> > immediately declared obsolete).
> 
> I tend to agree with Stefan M here.

Thanks.

Andrea, would you please voice your opinion on this?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Sat, 07 Sep 2024 15:49:02 GMT) Full text and rfc822 format available.

Message #107 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Andrea Corallo <acorallo <at> gnu.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: mattias.engdegard <at> gmail.com, 70007 <at> debbugs.gnu.org, casouri <at> gmail.com,
 Stefan Kangas <stefankangas <at> gmail.com>, monnier <at> iro.umontreal.ca
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Sat, 07 Sep 2024 11:48:36 -0400

Eli Zaretskii <eliz <at> gnu.org> writes:

>> From: Stefan Kangas <stefankangas <at> gmail.com>
>> Date: Sat, 31 Aug 2024 15:15:25 -0700
>> Cc: mattias.engdegard <at> gmail.com, acorallo <at> gnu.org, casouri <at> gmail.com, 
>> 	70007 <at> debbugs.gnu.org
>> 
>> Stefan Monnier <monnier <at> iro.umontreal.ca> writes:
>> 
>> >> And against the additional variable to make this more
>> >> backward-compatible?
>> >
>> > Yup.  The var would be my second-best choice (and I assume it's
>> > immediately declared obsolete).
>> 
>> I tend to agree with Stefan M here.
>
> Thanks.
>
> Andrea, would you please voice your opinion on this?

I'm for returning unibyte indeed.  And for the variable or the second
function I'm kind of neutral, I'd do it only if it's not too much effort
so I'll trust Mattias preference here.

  Andrea

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Sat, 07 Sep 2024 15:54:02 GMT) Full text and rfc822 format available.

Message #110 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Andrea Corallo <acorallo <at> gnu.org>
Cc: mattias.engdegard <at> gmail.com, 70007 <at> debbugs.gnu.org, casouri <at> gmail.com,
 stefankangas <at> gmail.com, monnier <at> iro.umontreal.ca
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Sat, 07 Sep 2024 18:52:49 +0300

> From: Andrea Corallo <acorallo <at> gnu.org>
> Cc: Stefan Kangas <stefankangas <at> gmail.com>,  monnier <at> iro.umontreal.ca,
>   mattias.engdegard <at> gmail.com,  casouri <at> gmail.com,  70007 <at> debbugs.gnu.org
> Date: Sat, 07 Sep 2024 11:48:36 -0400
> 
> Eli Zaretskii <eliz <at> gnu.org> writes:
> 
> >> From: Stefan Kangas <stefankangas <at> gmail.com>
> >> Date: Sat, 31 Aug 2024 15:15:25 -0700
> >> Cc: mattias.engdegard <at> gmail.com, acorallo <at> gnu.org, casouri <at> gmail.com, 
> >> 	70007 <at> debbugs.gnu.org
> >> 
> >> Stefan Monnier <monnier <at> iro.umontreal.ca> writes:
> >> 
> >> >> And against the additional variable to make this more
> >> >> backward-compatible?
> >> >
> >> > Yup.  The var would be my second-best choice (and I assume it's
> >> > immediately declared obsolete).
> >> 
> >> I tend to agree with Stefan M here.
> >
> > Thanks.
> >
> > Andrea, would you please voice your opinion on this?
> 
> I'm for returning unibyte indeed.  And for the variable or the second
> function I'm kind of neutral, I'd do it only if it's not too much effort
> so I'll trust Mattias preference here.

OK, so let's go with unconditionally unibyte, as it seems to be the
consensus here.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70007; Package emacs. (Sun, 08 Sep 2024 18:35:02 GMT) Full text and rfc822 format available.

Message #113 received at 70007 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: casouri <at> gmail.com, Andrea Corallo <acorallo <at> gnu.org>, 70007 <at> debbugs.gnu.org,
 stefankangas <at> gmail.com, monnier <at> iro.umontreal.ca
Subject: Re: bug#70007: [PATCH] native JSON encoder
Date: Sun, 8 Sep 2024 20:33:33 +0200

7 sep. 2024 kl. 17.52 skrev Eli Zaretskii <eliz <at> gnu.org>:

> OK, so let's go with unconditionally unibyte, as it seems to be the
> consensus here.

That change has now been pushed to emacs-30. Feel free to adjust as required.

bug closed, send any further explanations to 70007 <at> debbugs.gnu.org and Mattias Engdegård <mattias.engdegard <at> gmail.com> Request was from Mattias Engdegård <mattias.engdegard <at> gmail.com> to control <at> debbugs.gnu.org. (Tue, 10 Sep 2024 13:27:01 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Wed, 09 Oct 2024 11:24:06 GMT) Full text and rfc822 format available.

This bug report was last modified 274 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #70007 [PATCH] native JSON encoder

GNU bug report logs - #70007
[PATCH] native JSON encoder