GNU bug report logs -
#10627
char-ready? is broken for multibyte encodings
Previous Next
Reported by: Mark H Weaver <mhw <at> netris.org>
Date: Sat, 28 Jan 2012 10:24:02 UTC
Severity: normal
Done: Andy Wingo <wingo <at> pobox.com>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 10627 in the body.
You can then email your comments to 10627 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-guile <at> gnu.org
:
bug#10627
; Package
guile
.
(Sat, 28 Jan 2012 10:24:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Mark H Weaver <mhw <at> netris.org>
:
New bug report received and forwarded. Copy sent to
bug-guile <at> gnu.org
.
(Sat, 28 Jan 2012 10:24:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
The R5RS specifies that if 'char-ready?' returns #t, then the next
'read-char' operation is guaranteed not to hang. This is not currently
the case for ports using a multibyte encoding.
'char-ready?' currently returns #t whenever at least one _byte_ is
available. This is not correct in general. It should return #t only if
there is a complete _character_ available.
Mark
Information forwarded
to
bug-guile <at> gnu.org
:
bug#10627
; Package
guile
.
(Sun, 24 Feb 2013 19:14:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 10627 <at> debbugs.gnu.org (full text, mbox):
On Sat 28 Jan 2012 11:21, Mark H Weaver <mhw <at> netris.org> writes:
> The R5RS specifies that if 'char-ready?' returns #t, then the next
> 'read-char' operation is guaranteed not to hang. This is not currently
> the case for ports using a multibyte encoding.
>
> 'char-ready?' currently returns #t whenever at least one _byte_ is
> available. This is not correct in general. It should return #t only if
> there is a complete _character_ available.
This procedure is omitted in the R6RS because it is not a good
interface. Besides its semantic difficulties, can you think of a sane
implementation for multibyte characters?
I suggest we document that this procedure only works correctly in
encodings with 1-byte characters and recommend that people use u8-ready?
instead.
Andy
--
http://wingolog.org/
Information forwarded
to
bug-guile <at> gnu.org
:
bug#10627
; Package
guile
.
(Sun, 24 Feb 2013 20:17:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 10627 <at> debbugs.gnu.org (full text, mbox):
Hi Andy,
Andy Wingo <wingo <at> pobox.com> writes:
> On Sat 28 Jan 2012 11:21, Mark H Weaver <mhw <at> netris.org> writes:
>
>> The R5RS specifies that if 'char-ready?' returns #t, then the next
>> 'read-char' operation is guaranteed not to hang. This is not currently
>> the case for ports using a multibyte encoding.
>>
>> 'char-ready?' currently returns #t whenever at least one _byte_ is
>> available. This is not correct in general. It should return #t only if
>> there is a complete _character_ available.
>
> This procedure is omitted in the R6RS because it is not a good
> interface. Besides its semantic difficulties, can you think of a sane
> implementation for multibyte characters?
Maybe I'm missing something, but I don't see any semantic problem here,
and it seems straightforward to implement. 'char-ready?' should simply
read bytes until either a complete character is available, or no more
bytes are ready. In either case, all the bytes should then be 'unget'
before returning. What's the problem?
The only reason I haven't yet fixed this is because it will require some
refactoring in ports.c. I guess the most straightforward approach is to
generalize 'get_codepoint', 'get_utf8_codepoint', and
'get_iconv_codepoint' to support a non-blocking mode of operation.
What do you think?
Regards,
Mark
Information forwarded
to
bug-guile <at> gnu.org
:
bug#10627
; Package
guile
.
(Sun, 24 Feb 2013 22:18:01 GMT)
Full text and
rfc822 format available.
Message #14 received at 10627 <at> debbugs.gnu.org (full text, mbox):
Hi :)
On Sun 24 Feb 2013 21:14, Mark H Weaver <mhw <at> netris.org> writes:
> Andy Wingo <wingo <at> pobox.com> writes:
>
>> On Sat 28 Jan 2012 11:21, Mark H Weaver <mhw <at> netris.org> writes:
>>
>>> The R5RS specifies that if 'char-ready?' returns #t, then the next
>>> 'read-char' operation is guaranteed not to hang. This is not currently
>>> the case for ports using a multibyte encoding.
>>>
>>> 'char-ready?' currently returns #t whenever at least one _byte_ is
>>> available. This is not correct in general. It should return #t only if
>>> there is a complete _character_ available.
>>
>> This procedure is omitted in the R6RS because it is not a good
>> interface. Besides its semantic difficulties, can you think of a sane
>> implementation for multibyte characters?
>
> Maybe I'm missing something, but I don't see any semantic problem here,
> and it seems straightforward to implement. 'char-ready?' should simply
> read bytes until either a complete character is available, or no more
> bytes are ready. In either case, all the bytes should then be 'unget'
> before returning. What's the problem?
The problem is that char-ready? should not read anything. If you want
to peek, use peek-char. Note that if the stream is at EOF, char-ready?
should return #t.
Andy
--
http://wingolog.org/
Information forwarded
to
bug-guile <at> gnu.org
:
bug#10627
; Package
guile
.
(Mon, 25 Feb 2013 00:09:01 GMT)
Full text and
rfc822 format available.
Message #17 received at 10627 <at> debbugs.gnu.org (full text, mbox):
Andy Wingo <wingo <at> pobox.com> writes:
> On Sun 24 Feb 2013 21:14, Mark H Weaver <mhw <at> netris.org> writes:
>
>> Maybe I'm missing something, but I don't see any semantic problem here,
>> and it seems straightforward to implement. 'char-ready?' should simply
>> read bytes until either a complete character is available, or no more
>> bytes are ready. In either case, all the bytes should then be 'unget'
>> before returning. What's the problem?
>
> The problem is that char-ready? should not read anything.
Okay, but if all bytes read are later *unread*, and the reads never
block, then why does it matter? The reads in my proposed implementation
are just an internal implementation detail, and it seems to me that the
user cannot tell the difference, as long as he does not peek underneath
the Scheme port abstraction.
If you prefer, perhaps a nicer way to think about it is that
'char-ready?' looks ahead in the putback buffer and/or the read buffer
(refilling it in a non-blocking mode if needed), and returns #t iff a
complete character is present in the buffer(s), or EOF is reached.
However, is seems to me that implementing this in terms of read-byte and
unget-byte is simpler, because it avoids duplication of the logic
regarding putback buffers and refilling of buffers. Maybe there's some
reason why this is a bad idea, but I haven't heard one.
I agree that 'char-ready?' is an antiquated interface, but it is
nonetheless part of the R5RS (and Guile since approximately forever),
and it is the only way to do a non-blocking read in portable R5RS. It
seems to me that we ought to try to implement it as well as we can, no?
> If you want to peek, use peek-char.
Okay, but that's a totally different tool with a different use case.
It cannot be used to do non-blocking reads.
> Note that if the stream is at EOF, char-ready? should return #t.
Agreed.
More thoughts?
Thanks,
Mark
Information forwarded
to
bug-guile <at> gnu.org
:
bug#10627
; Package
guile
.
(Mon, 25 Feb 2013 01:26:01 GMT)
Full text and
rfc822 format available.
Message #20 received at 10627 <at> debbugs.gnu.org (full text, mbox):
On 25 February 2013 08:06, Mark H Weaver <mhw <at> netris.org> wrote:
> Andy Wingo <wingo <at> pobox.com> writes:
>
>> On Sun 24 Feb 2013 21:14, Mark H Weaver <mhw <at> netris.org> writes:
>>
>>> Maybe I'm missing something, but I don't see any semantic problem here,
>>> and it seems straightforward to implement. 'char-ready?' should simply
>>> read bytes until either a complete character is available, or no more
>>> bytes are ready. In either case, all the bytes should then be 'unget'
>>> before returning. What's the problem?
>>
>> The problem is that char-ready? should not read anything.
>
> Okay, but if all bytes read are later *unread*, and the reads never
> block, then why does it matter?
Taking care to still use sf_input_waiting for soft ports? Reading
bytes from a soft port could have side effects (i.e. logging action or
similar).
Information forwarded
to
bug-guile <at> gnu.org
:
bug#10627
; Package
guile
.
(Mon, 25 Feb 2013 08:58:01 GMT)
Full text and
rfc822 format available.
Message #23 received at 10627 <at> debbugs.gnu.org (full text, mbox):
Hi Mark,
Are you proposing that `char-ready?' do a nonblocking read if
the buffer is empty? That could work.
On Mon 25 Feb 2013 01:06, Mark H Weaver <mhw <at> netris.org> writes:
> However, is seems to me that implementing this in terms of read-byte and
> unget-byte is simpler, because it avoids duplication of the logic
> regarding putback buffers and refilling of buffers.
Could work, if the port is nonblocking to begin with.
> I agree that 'char-ready?' is an antiquated interface, but it is
> nonetheless part of the R5RS (and Guile since approximately forever),
> and it is the only way to do a non-blocking read in portable R5RS. It
> seems to me that we ought to try to implement it as well as we can, no?
Do what you like to do :) But if it were my time, I would simply
document that it checks for a byte and not a character and move on.
Andy
--
http://wingolog.org/
Information forwarded
to
bug-guile <at> gnu.org
:
bug#10627
; Package
guile
.
(Tue, 26 Feb 2013 19:53:01 GMT)
Full text and
rfc822 format available.
Message #26 received at 10627 <at> debbugs.gnu.org (full text, mbox):
Andy Wingo <wingo <at> pobox.com> writes:
> Are you proposing that `char-ready?' do a nonblocking read if
> the buffer is empty? That could work.
Yes. I suspect that something along these lines is already implemented,
because I don't see how 'u8-ready?' could work properly without it.
> Do what you like to do :) But if it were my time, I would simply
> document that it checks for a byte and not a character and move on.
I'd like to fix it properly. Let's keep this bug open until it's done.
Thanks,
Mark
Information forwarded
to
bug-guile <at> gnu.org
:
bug#10627
; Package
guile
.
(Tue, 26 Feb 2013 20:02:01 GMT)
Full text and
rfc822 format available.
Message #29 received at 10627 <at> debbugs.gnu.org (full text, mbox):
On Tue 26 Feb 2013 20:50, Mark H Weaver <mhw <at> netris.org> writes:
> Andy Wingo <wingo <at> pobox.com> writes:
>> Are you proposing that `char-ready?' do a nonblocking read if
>> the buffer is empty? That could work.
>
> Yes. I suspect that something along these lines is already implemented,
> because I don't see how 'u8-ready?' could work properly without it.
It does a poll with a timeout of 0.
Andy
--
http://wingolog.org/
Reply sent
to
Andy Wingo <wingo <at> pobox.com>
:
You have taken responsibility.
(Mon, 20 Jun 2016 19:24:01 GMT)
Full text and
rfc822 format available.
Notification sent
to
Mark H Weaver <mhw <at> netris.org>
:
bug acknowledged by developer.
(Mon, 20 Jun 2016 19:24:01 GMT)
Full text and
rfc822 format available.
Message #34 received at 10627-done <at> debbugs.gnu.org (full text, mbox):
On Tue 26 Feb 2013 20:59, Andy Wingo <wingo <at> pobox.com> writes:
> On Tue 26 Feb 2013 20:50, Mark H Weaver <mhw <at> netris.org> writes:
>
>> Andy Wingo <wingo <at> pobox.com> writes:
>>> Are you proposing that `char-ready?' do a nonblocking read if
>>> the buffer is empty? That could work.
>>
>> Yes. I suspect that something along these lines is already implemented,
>> because I don't see how 'u8-ready?' could work properly without it.
>
> It does a poll with a timeout of 0.
In the end I added this to the manual:
Note that @code{char-ready?} only works reliably for terminals and
sockets with one-byte encodings. Under the hood it will return
@code{#t} if the port has any input buffered, or if the file descriptor
that backs the port polls as readable, indicating that Guile can fetch
more bytes from the kernel. However being able to fetch one byte
doesn't mean that a full character is available; @xref{Encoding}. Also,
on many systems it's possible for a file descriptor to poll as readable,
but then block when it comes time to read bytes. Note also that on
Linux kernels, all file ports backed by files always poll as readable.
For non-file ports, this procedure always returns @code{#t}, except for
soft ports, which have a @code{char-ready?} handler. @xref{Soft Ports}.
In short, this is a legacy procedure whose semantics are hard to
provide. However it is a useful check to see if any input is buffered.
@xref{Non-Blocking I/O}.
We could try a non-blocking read but at that point we should just
provide a non-blocking read-char, and allow users to unread-char. That
would be a different bug :)
Andy
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Tue, 19 Jul 2016 11:24:04 GMT)
Full text and
rfc822 format available.
This bug report was last modified 8 years and 286 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.