GNU bug report logs - #10627
char-ready? is broken for multibyte encodings

Previous Next

Package: guile;

Reported by: Mark H Weaver <mhw <at> netris.org>

Date: Sat, 28 Jan 2012 10:24:02 UTC

Severity: normal

Done: Andy Wingo <wingo <at> pobox.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 10627 in the body.
You can then email your comments to 10627 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-guile <at> gnu.org:
bug#10627; Package guile. (Sat, 28 Jan 2012 10:24:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Mark H Weaver <mhw <at> netris.org>:
New bug report received and forwarded. Copy sent to bug-guile <at> gnu.org. (Sat, 28 Jan 2012 10:24:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Mark H Weaver <mhw <at> netris.org>
To: bug-guile <at> gnu.org
Subject: char-ready? is broken for multibyte encodings
Date: Sat, 28 Jan 2012 05:21:24 -0500
The R5RS specifies that if 'char-ready?' returns #t, then the next
'read-char' operation is guaranteed not to hang.  This is not currently
the case for ports using a multibyte encoding.

'char-ready?' currently returns #t whenever at least one _byte_ is
available.  This is not correct in general.  It should return #t only if
there is a complete _character_ available.

     Mark




Information forwarded to bug-guile <at> gnu.org:
bug#10627; Package guile. (Sun, 24 Feb 2013 19:14:01 GMT) Full text and rfc822 format available.

Message #8 received at 10627 <at> debbugs.gnu.org (full text, mbox):

From: Andy Wingo <wingo <at> pobox.com>
To: Mark H Weaver <mhw <at> netris.org>
Cc: 10627 <at> debbugs.gnu.org
Subject: Re: bug#10627: char-ready? is broken for multibyte encodings
Date: Sun, 24 Feb 2013 20:11:50 +0100
On Sat 28 Jan 2012 11:21, Mark H Weaver <mhw <at> netris.org> writes:

> The R5RS specifies that if 'char-ready?' returns #t, then the next
> 'read-char' operation is guaranteed not to hang.  This is not currently
> the case for ports using a multibyte encoding.
>
> 'char-ready?' currently returns #t whenever at least one _byte_ is
> available.  This is not correct in general.  It should return #t only if
> there is a complete _character_ available.

This procedure is omitted in the R6RS because it is not a good
interface.  Besides its semantic difficulties, can you think of a sane
implementation for multibyte characters?

I suggest we document that this procedure only works correctly in
encodings with 1-byte characters and recommend that people use u8-ready?
instead.

Andy
-- 
http://wingolog.org/




Information forwarded to bug-guile <at> gnu.org:
bug#10627; Package guile. (Sun, 24 Feb 2013 20:17:02 GMT) Full text and rfc822 format available.

Message #11 received at 10627 <at> debbugs.gnu.org (full text, mbox):

From: Mark H Weaver <mhw <at> netris.org>
To: Andy Wingo <wingo <at> pobox.com>
Cc: 10627 <at> debbugs.gnu.org
Subject: Re: bug#10627: char-ready? is broken for multibyte encodings
Date: Sun, 24 Feb 2013 15:14:05 -0500
Hi Andy,

Andy Wingo <wingo <at> pobox.com> writes:

> On Sat 28 Jan 2012 11:21, Mark H Weaver <mhw <at> netris.org> writes:
>
>> The R5RS specifies that if 'char-ready?' returns #t, then the next
>> 'read-char' operation is guaranteed not to hang.  This is not currently
>> the case for ports using a multibyte encoding.
>>
>> 'char-ready?' currently returns #t whenever at least one _byte_ is
>> available.  This is not correct in general.  It should return #t only if
>> there is a complete _character_ available.
>
> This procedure is omitted in the R6RS because it is not a good
> interface.  Besides its semantic difficulties, can you think of a sane
> implementation for multibyte characters?

Maybe I'm missing something, but I don't see any semantic problem here,
and it seems straightforward to implement.  'char-ready?' should simply
read bytes until either a complete character is available, or no more
bytes are ready.  In either case, all the bytes should then be 'unget'
before returning.  What's the problem?

The only reason I haven't yet fixed this is because it will require some
refactoring in ports.c.  I guess the most straightforward approach is to
generalize 'get_codepoint', 'get_utf8_codepoint', and
'get_iconv_codepoint' to support a non-blocking mode of operation.

What do you think?

  Regards,
    Mark




Information forwarded to bug-guile <at> gnu.org:
bug#10627; Package guile. (Sun, 24 Feb 2013 22:18:01 GMT) Full text and rfc822 format available.

Message #14 received at 10627 <at> debbugs.gnu.org (full text, mbox):

From: Andy Wingo <wingo <at> pobox.com>
To: Mark H Weaver <mhw <at> netris.org>
Cc: 10627 <at> debbugs.gnu.org
Subject: Re: bug#10627: char-ready? is broken for multibyte encodings
Date: Sun, 24 Feb 2013 23:15:33 +0100
Hi :)

On Sun 24 Feb 2013 21:14, Mark H Weaver <mhw <at> netris.org> writes:

> Andy Wingo <wingo <at> pobox.com> writes:
>
>> On Sat 28 Jan 2012 11:21, Mark H Weaver <mhw <at> netris.org> writes:
>>
>>> The R5RS specifies that if 'char-ready?' returns #t, then the next
>>> 'read-char' operation is guaranteed not to hang.  This is not currently
>>> the case for ports using a multibyte encoding.
>>>
>>> 'char-ready?' currently returns #t whenever at least one _byte_ is
>>> available.  This is not correct in general.  It should return #t only if
>>> there is a complete _character_ available.
>>
>> This procedure is omitted in the R6RS because it is not a good
>> interface.  Besides its semantic difficulties, can you think of a sane
>> implementation for multibyte characters?
>
> Maybe I'm missing something, but I don't see any semantic problem here,
> and it seems straightforward to implement.  'char-ready?' should simply
> read bytes until either a complete character is available, or no more
> bytes are ready.  In either case, all the bytes should then be 'unget'
> before returning.  What's the problem?

The problem is that char-ready? should not read anything.  If you want
to peek, use peek-char.  Note that if the stream is at EOF, char-ready?
should return #t.

Andy
-- 
http://wingolog.org/




Information forwarded to bug-guile <at> gnu.org:
bug#10627; Package guile. (Mon, 25 Feb 2013 00:09:01 GMT) Full text and rfc822 format available.

Message #17 received at 10627 <at> debbugs.gnu.org (full text, mbox):

From: Mark H Weaver <mhw <at> netris.org>
To: Andy Wingo <wingo <at> pobox.com>
Cc: 10627 <at> debbugs.gnu.org
Subject: Re: bug#10627: char-ready? is broken for multibyte encodings
Date: Sun, 24 Feb 2013 19:06:30 -0500
Andy Wingo <wingo <at> pobox.com> writes:

> On Sun 24 Feb 2013 21:14, Mark H Weaver <mhw <at> netris.org> writes:
>
>> Maybe I'm missing something, but I don't see any semantic problem here,
>> and it seems straightforward to implement.  'char-ready?' should simply
>> read bytes until either a complete character is available, or no more
>> bytes are ready.  In either case, all the bytes should then be 'unget'
>> before returning.  What's the problem?
>
> The problem is that char-ready? should not read anything.

Okay, but if all bytes read are later *unread*, and the reads never
block, then why does it matter?  The reads in my proposed implementation
are just an internal implementation detail, and it seems to me that the
user cannot tell the difference, as long as he does not peek underneath
the Scheme port abstraction.

If you prefer, perhaps a nicer way to think about it is that
'char-ready?' looks ahead in the putback buffer and/or the read buffer
(refilling it in a non-blocking mode if needed), and returns #t iff a
complete character is present in the buffer(s), or EOF is reached.
However, is seems to me that implementing this in terms of read-byte and
unget-byte is simpler, because it avoids duplication of the logic
regarding putback buffers and refilling of buffers.  Maybe there's some
reason why this is a bad idea, but I haven't heard one.

I agree that 'char-ready?' is an antiquated interface, but it is
nonetheless part of the R5RS (and Guile since approximately forever),
and it is the only way to do a non-blocking read in portable R5RS.  It
seems to me that we ought to try to implement it as well as we can, no?

> If you want to peek, use peek-char.

Okay, but that's a totally different tool with a different use case.
It cannot be used to do non-blocking reads.

> Note that if the stream is at EOF, char-ready? should return #t.

Agreed.

More thoughts?

   Thanks,
     Mark




Information forwarded to bug-guile <at> gnu.org:
bug#10627; Package guile. (Mon, 25 Feb 2013 01:26:01 GMT) Full text and rfc822 format available.

Message #20 received at 10627 <at> debbugs.gnu.org (full text, mbox):

From: Daniel Hartwig <mandyke <at> gmail.com>
To: Mark H Weaver <mhw <at> netris.org>
Cc: Andy Wingo <wingo <at> pobox.com>, 10627 <at> debbugs.gnu.org
Subject: Re: bug#10627: char-ready? is broken for multibyte encodings
Date: Mon, 25 Feb 2013 09:23:45 +0800
On 25 February 2013 08:06, Mark H Weaver <mhw <at> netris.org> wrote:
> Andy Wingo <wingo <at> pobox.com> writes:
>
>> On Sun 24 Feb 2013 21:14, Mark H Weaver <mhw <at> netris.org> writes:
>>
>>> Maybe I'm missing something, but I don't see any semantic problem here,
>>> and it seems straightforward to implement.  'char-ready?' should simply
>>> read bytes until either a complete character is available, or no more
>>> bytes are ready.  In either case, all the bytes should then be 'unget'
>>> before returning.  What's the problem?
>>
>> The problem is that char-ready? should not read anything.
>
> Okay, but if all bytes read are later *unread*, and the reads never
> block, then why does it matter?

Taking care to still use sf_input_waiting for soft ports?  Reading
bytes from a soft port could have side effects (i.e. logging action or
similar).




Information forwarded to bug-guile <at> gnu.org:
bug#10627; Package guile. (Mon, 25 Feb 2013 08:58:01 GMT) Full text and rfc822 format available.

Message #23 received at 10627 <at> debbugs.gnu.org (full text, mbox):

From: Andy Wingo <wingo <at> pobox.com>
To: Mark H Weaver <mhw <at> netris.org>
Cc: 10627 <at> debbugs.gnu.org
Subject: Re: bug#10627: char-ready? is broken for multibyte encodings
Date: Mon, 25 Feb 2013 09:55:44 +0100
Hi Mark,

Are you proposing that `char-ready?' do a nonblocking read if
the buffer is empty?  That could work.

On Mon 25 Feb 2013 01:06, Mark H Weaver <mhw <at> netris.org> writes:

> However, is seems to me that implementing this in terms of read-byte and
> unget-byte is simpler, because it avoids duplication of the logic
> regarding putback buffers and refilling of buffers.

Could work, if the port is nonblocking to begin with.

> I agree that 'char-ready?' is an antiquated interface, but it is
> nonetheless part of the R5RS (and Guile since approximately forever),
> and it is the only way to do a non-blocking read in portable R5RS.  It
> seems to me that we ought to try to implement it as well as we can, no?

Do what you like to do :)  But if it were my time, I would simply
document that it checks for a byte and not a character and move on.

Andy
-- 
http://wingolog.org/




Information forwarded to bug-guile <at> gnu.org:
bug#10627; Package guile. (Tue, 26 Feb 2013 19:53:01 GMT) Full text and rfc822 format available.

Message #26 received at 10627 <at> debbugs.gnu.org (full text, mbox):

From: Mark H Weaver <mhw <at> netris.org>
To: Andy Wingo <wingo <at> pobox.com>
Cc: 10627 <at> debbugs.gnu.org
Subject: Re: bug#10627: char-ready? is broken for multibyte encodings
Date: Tue, 26 Feb 2013 14:50:43 -0500
Andy Wingo <wingo <at> pobox.com> writes:
> Are you proposing that `char-ready?' do a nonblocking read if
> the buffer is empty?  That could work.

Yes.  I suspect that something along these lines is already implemented,
because I don't see how 'u8-ready?' could work properly without it.

> Do what you like to do :)  But if it were my time, I would simply
> document that it checks for a byte and not a character and move on.

I'd like to fix it properly.  Let's keep this bug open until it's done.

     Thanks,
       Mark




Information forwarded to bug-guile <at> gnu.org:
bug#10627; Package guile. (Tue, 26 Feb 2013 20:02:01 GMT) Full text and rfc822 format available.

Message #29 received at 10627 <at> debbugs.gnu.org (full text, mbox):

From: Andy Wingo <wingo <at> pobox.com>
To: Mark H Weaver <mhw <at> netris.org>
Cc: 10627 <at> debbugs.gnu.org
Subject: Re: bug#10627: char-ready? is broken for multibyte encodings
Date: Tue, 26 Feb 2013 20:59:25 +0100
On Tue 26 Feb 2013 20:50, Mark H Weaver <mhw <at> netris.org> writes:

> Andy Wingo <wingo <at> pobox.com> writes:
>> Are you proposing that `char-ready?' do a nonblocking read if
>> the buffer is empty?  That could work.
>
> Yes.  I suspect that something along these lines is already implemented,
> because I don't see how 'u8-ready?' could work properly without it.

It does a poll with a timeout of 0.

Andy
-- 
http://wingolog.org/




Reply sent to Andy Wingo <wingo <at> pobox.com>:
You have taken responsibility. (Mon, 20 Jun 2016 19:24:01 GMT) Full text and rfc822 format available.

Notification sent to Mark H Weaver <mhw <at> netris.org>:
bug acknowledged by developer. (Mon, 20 Jun 2016 19:24:01 GMT) Full text and rfc822 format available.

Message #34 received at 10627-done <at> debbugs.gnu.org (full text, mbox):

From: Andy Wingo <wingo <at> pobox.com>
To: Mark H Weaver <mhw <at> netris.org>
Cc: 10627-done <at> debbugs.gnu.org
Subject: Re: bug#10627: char-ready? is broken for multibyte encodings
Date: Mon, 20 Jun 2016 21:23:35 +0200
On Tue 26 Feb 2013 20:59, Andy Wingo <wingo <at> pobox.com> writes:

> On Tue 26 Feb 2013 20:50, Mark H Weaver <mhw <at> netris.org> writes:
>
>> Andy Wingo <wingo <at> pobox.com> writes:
>>> Are you proposing that `char-ready?' do a nonblocking read if
>>> the buffer is empty?  That could work.
>>
>> Yes.  I suspect that something along these lines is already implemented,
>> because I don't see how 'u8-ready?' could work properly without it.
>
> It does a poll with a timeout of 0.

In the end I added this to the manual:

    Note that @code{char-ready?} only works reliably for terminals and
    sockets with one-byte encodings.  Under the hood it will return
    @code{#t} if the port has any input buffered, or if the file descriptor
    that backs the port polls as readable, indicating that Guile can fetch
    more bytes from the kernel.  However being able to fetch one byte
    doesn't mean that a full character is available; @xref{Encoding}.  Also,
    on many systems it's possible for a file descriptor to poll as readable,
    but then block when it comes time to read bytes.  Note also that on
    Linux kernels, all file ports backed by files always poll as readable.
    For non-file ports, this procedure always returns @code{#t}, except for
    soft ports, which have a @code{char-ready?} handler.  @xref{Soft Ports}.

    In short, this is a legacy procedure whose semantics are hard to
    provide.  However it is a useful check to see if any input is buffered.
    @xref{Non-Blocking I/O}.

We could try a non-blocking read but at that point we should just
provide a non-blocking read-char, and allow users to unread-char.  That
would be a different bug :)

Andy




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 19 Jul 2016 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 7 years and 284 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.