GNU bug report logs - #37633
Column part interpreted wrong in compilation mode

Package: emacs;

Reported by: Bernd Paysan <bernd <at> net2o.de>

Date: Sat, 5 Oct 2019 15:45:01 UTC

Severity: normal

Tags: wontfix

Done: Lars Ingebrigtsen <larsi <at> gnus.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 37633 in the body.
You can then email your comments to 37633 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#37633; Package emacs. (Sat, 05 Oct 2019 15:45:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Bernd Paysan <bernd <at> net2o.de>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Sat, 05 Oct 2019 15:45:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Bernd Paysan <bernd <at> net2o.de>
To: bug-gnu-emacs <at> gnu.org
Cc: anton <at> mips.complang.tuwien.ac.at
Subject: Column part interpreted wrong in compilation mode
Date: Sat, 05 Oct 2019 13:12:34 +0200

[Message part 1 (text/plain, inline)]

Compilers like gcc and others (e.g. gforth) output file:line:column on each 
error or warning.  However, “column” here is really the byte offset into the 
line (starting at 1).

Problems arise when tabs and UTF-8 glyphs are involved, e.g. compile

---------------test.c---------------
void foo() {
	printf("test %i", b);
	printf("test你好 %i", c);
}
---------------gcc test.c---------------
-*- mode: compilation; default-directory: "~/tmp/" -*-
Compilation started at Sat Oct  5 12:13:23

gcc test.c
test.c: In function ‘foo’:
test.c:2:2: warning: implicit declaration of function ‘printf’ [-Wimplicit-
function-declaration]
    2 |  printf("test %i", b);
      |  ^~~~~~
test.c:2:2: warning: incompatible implicit declaration of built-in function 
‘printf’
test.c:1:1: note: include ‘<stdio.h>’ or provide a declaration of ‘printf’
  +++ |+#include <stdio.h>
    1 | void foo() {
test.c:2:20: error: ‘b’ undeclared (first use in this function)
    2 |  printf("test %i", b);
      |                    ^
test.c:2:20: note: each undeclared identifier is reported only once for each 
function it appears in
test.c:3:26: error: ‘c’ undeclared (first use in this function)
    3 |  printf("test你好 %i", c);
      |                          ^

Compilation exited abnormally with code 1 at Sat Oct  5 12:13:23
---------------snip---------------

When you click on test.c:2:20, it gets you to the second t in 'test'; if you 
click on test.c:3:26, you end up on the '%'.  The expected result would be to 
have the cursor on 'b' and 'c'.

The problem has been discussed here two years ago:

https://www.reddit.com/r/emacs/comments/5m3i59/
ask_remacs_get_compile_mode_to_treat_column/

Suggested solution: Use byte-to-position to calculate the position in 
compilation-move-to-column.

Since debugging environments can also control Emacs e.g. through emacsclient 
+line:column file, I suggest adding a pattern that indicates that column here 
really means byte position, too, e.g. +line/byte or +line,byte or such. Or 
just interpret it as byte position, too.  gedit e.g. counts a tab as 1 if you 
open a file with +line:column options, but counts one UTF-8 glyph also as 1 
(which is not how compilers count).

Some programming languages convert unicode glyphs and other characters into 
internal character types (e.g. JavaScript), and then the gedit behavior or the 
behavior with compilation-error-screen-columns set to nil is probably ok.  
It's just that we need a byte mode here, too. True and false is not enough.

-- 
Bernd Paysan
"If you want it done right, you have to do it yourself"
net2o id: kQusJzA;7*?t=uy <at> X}1GWr!+0qqp_Cn176t4(dQ*
https://net2o.de/

[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37633; Package emacs. (Sat, 05 Oct 2019 16:09:01 GMT) Full text and rfc822 format available.

Message #8 received at 37633 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Bernd Paysan <bernd <at> net2o.de>
Cc: 37633 <at> debbugs.gnu.org, anton <at> mips.complang.tuwien.ac.at
Subject: Re: bug#37633: Column part interpreted wrong in compilation mode
Date: Sat, 05 Oct 2019 19:08:21 +0300

> Cc: anton <at> mips.complang.tuwien.ac.at
> Date: Sat, 05 Oct 2019 13:12:34 +0200
> From: Bernd Paysan via "Bug reports for GNU Emacs,
>  the Swiss army knife of text editors" <bug-gnu-emacs <at> gnu.org>
> 
> Suggested solution: Use byte-to-position to calculate the position in 
> compilation-move-to-column.

This only works in UTF-8 locales, and is not 100% even there, so it
isn't the right solution.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37633; Package emacs. (Sat, 05 Oct 2019 16:18:01 GMT) Full text and rfc822 format available.

Message #11 received at 37633 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: bernd <at> net2o.de
Cc: 37633 <at> debbugs.gnu.org, anton <at> mips.complang.tuwien.ac.at
Subject: Re: bug#37633: Column part interpreted wrong in compilation mode
Date: Sat, 05 Oct 2019 19:16:53 +0300

> Date: Sat, 05 Oct 2019 19:08:21 +0300
> From: Eli Zaretskii <eliz <at> gnu.org>
> Cc: 37633 <at> debbugs.gnu.org, anton <at> mips.complang.tuwien.ac.at
> 
> > Suggested solution: Use byte-to-position to calculate the position in 
> > compilation-move-to-column.
> 
> This only works in UTF-8 locales, and is not 100% even there, so it
> isn't the right solution.

In general, byte-to-position is meant to be used only for converting
between byte and character positions of text in Emacs buffers.

For byte offsets in external text we have bufferpos-to-filepos, but
that requires us to know the encoding of the external text.  We need
to find a reasonable way of getting that.  Suggestions and patches
welcome.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37633; Package emacs. (Sat, 05 Oct 2019 17:32:01 GMT) Full text and rfc822 format available.

Message #14 received at 37633 <at> debbugs.gnu.org (full text, mbox):

From: Bernd Paysan <bernd <at> net2o.de>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 37633 <at> debbugs.gnu.org, anton <at> mips.complang.tuwien.ac.at
Subject: Re: bug#37633: Column part interpreted wrong in compilation mode
Date: Sat, 05 Oct 2019 18:58:15 +0200

[Message part 1 (text/plain, inline)]

Am Samstag, 5. Oktober 2019, 18:08:21 CEST schrieb Eli Zaretskii:
> > Cc: anton <at> mips.complang.tuwien.ac.at
> > Date: Sat, 05 Oct 2019 13:12:34 +0200
> > From: Bernd Paysan via "Bug reports for GNU Emacs,
> > 
> >  the Swiss army knife of text editors" <bug-gnu-emacs <at> gnu.org>
> > 
> > Suggested solution: Use byte-to-position to calculate the position in
> > compilation-move-to-column.
> 
> This only works in UTF-8 locales, and is not 100% even there, so it
> isn't the right solution.

It's at least an improvement, though it's not perfect.

-- 
Bernd Paysan
"If you want it done right, you have to do it yourself"
net2o id: kQusJzA;7*?t=uy <at> X}1GWr!+0qqp_Cn176t4(dQ*
https://net2o.de/

[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37633; Package emacs. (Sat, 05 Oct 2019 17:32:02 GMT) Full text and rfc822 format available.

Message #17 received at 37633 <at> debbugs.gnu.org (full text, mbox):

From: Bernd Paysan <bernd <at> net2o.de>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 37633 <at> debbugs.gnu.org, anton <at> mips.complang.tuwien.ac.at
Subject: Re: bug#37633: Column part interpreted wrong in compilation mode
Date: Sat, 05 Oct 2019 19:05:26 +0200

[Message part 1 (text/plain, inline)]

Am Samstag, 5. Oktober 2019, 18:16:53 CEST schrieb Eli Zaretskii:
> > Date: Sat, 05 Oct 2019 19:08:21 +0300
> > From: Eli Zaretskii <eliz <at> gnu.org>
> > Cc: 37633 <at> debbugs.gnu.org, anton <at> mips.complang.tuwien.ac.at
> > 
> > > Suggested solution: Use byte-to-position to calculate the position in
> > > compilation-move-to-column.
> > 
> > This only works in UTF-8 locales, and is not 100% even there, so it
> > isn't the right solution.
> 
> In general, byte-to-position is meant to be used only for converting
> between byte and character positions of text in Emacs buffers.
> 
> For byte offsets in external text we have bufferpos-to-filepos, but
> that requires us to know the encoding of the external text.  We need
> to find a reasonable way of getting that.  Suggestions and patches
> welcome.

We can likely assume that the auto-detected encoding is the correct one, i.e. 
buffer-file-coding-system can be used (the default for the optional encoding 
system parameter for bufferpos-to-filepos and filepos-to-bufferpos).

I.e. go to the line selected, do a bufferpos-to-filepos on that position, add 
the column-1 to that, and do a filepos-to-bufferpos.  Jump there.

Problem with precision: "exact" requires encoding the entire file, so it's 
slow for large files.  Particularly with automatically generated files, this 
is likely not acceptable, so "approximate" could be good enough.

-- 
Bernd Paysan
"If you want it done right, you have to do it yourself"
net2o id: kQusJzA;7*?t=uy <at> X}1GWr!+0qqp_Cn176t4(dQ*
https://net2o.de/

[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37633; Package emacs. (Sat, 05 Oct 2019 17:36:01 GMT) Full text and rfc822 format available.

Message #20 received at 37633 <at> debbugs.gnu.org (full text, mbox):

From: Bernd Paysan <bernd <at> net2o.de>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 37633 <at> debbugs.gnu.org, anton <at> mips.complang.tuwien.ac.at
Subject: Re: bug#37633: Column part interpreted wrong in compilation mode
Date: Sat, 05 Oct 2019 19:34:59 +0200

[Message part 1 (text/plain, inline)]

Am Samstag, 5. Oktober 2019, 18:16:53 CEST schrieb Eli Zaretskii:
> > Date: Sat, 05 Oct 2019 19:08:21 +0300
> > From: Eli Zaretskii <eliz <at> gnu.org>
> > Cc: 37633 <at> debbugs.gnu.org, anton <at> mips.complang.tuwien.ac.at
> > 
> > > Suggested solution: Use byte-to-position to calculate the position in
> > > compilation-move-to-column.
> > 
> > This only works in UTF-8 locales, and is not 100% even there, so it
> > isn't the right solution.
> 
> In general, byte-to-position is meant to be used only for converting
> between byte and character positions of text in Emacs buffers.
> 
> For byte offsets in external text we have bufferpos-to-filepos, but
> that requires us to know the encoding of the external text.  We need
> to find a reasonable way of getting that.  Suggestions and patches
> welcome.

Ok, first I tried bufferpos-to-filepos.

(defun compilation-move-to-column (col screen)
  "Go to column COL on the current line.
If SCREEN is non-nil, columns are screen columns, otherwise, they are
just char-counts."
  (setq col (- col compilation-first-column))
  (let ((realpos (filepos-to-bufferpos (+ (bufferpos-to-filepos (line-
beginning-position) 'approximate) col) 'approximate)))
    (goto-char (min realpos (line-end-position)))))

I left out the (if ) with (screen), because I just wanted to test this case.  
For the examples I've used, it works with the 'approximate setting.

I leave out this screen part to the emacs maintainers, because you maybe want 
a three-case statement: nil for char-count, 't for screen columns, and 
'bytepos for byte-accurate position.  JavaScript (node) is ok with the char-
count mode.

Second test-case: iso8859-1 encoded file with

void foo() {
	printf("test %i", b);
	printf("testäöü %i", c);
}

...
test-iso.c:3:23: error: ‘c’ undeclared (first use in this function)
    3 |  printf("test��� %i", c);
      |                       ^
...

works when you click there, too.

-- 
Bernd Paysan
"If you want it done right, you have to do it yourself"
net2o id: kQusJzA;7*?t=uy <at> X}1GWr!+0qqp_Cn176t4(dQ*
https://net2o.de/

[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37633; Package emacs. (Sat, 05 Oct 2019 18:54:01 GMT) Full text and rfc822 format available.

Message #23 received at 37633 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Bernd Paysan <bernd <at> net2o.de>
Cc: 37633 <at> debbugs.gnu.org, anton <at> mips.complang.tuwien.ac.at
Subject: Re: bug#37633: Column part interpreted wrong in compilation mode
Date: Sat, 05 Oct 2019 21:53:02 +0300

> From: Bernd Paysan <bernd <at> net2o.de>
> Cc: 37633 <at> debbugs.gnu.org, anton <at> mips.complang.tuwien.ac.at
> Date: Sat, 05 Oct 2019 19:05:26 +0200
> 
> We can likely assume that the auto-detected encoding is the correct one, i.e. 
> buffer-file-coding-system can be used (the default for the optional encoding 
> system parameter for bufferpos-to-filepos and filepos-to-bufferpos).

Encoding of subprocess output is generally not auto-detected, it uses
the defaults derived from the locale.  I don't recommend
auto-detecting, because that's quite fragile (and is not needed here
anyway, IMO).

> Problem with precision: "exact" requires encoding the entire file, so it's 
> slow for large files.  Particularly with automatically generated files, this 
> is likely not acceptable, so "approximate" could be good enough.

We cannot use 'exact' here because there's no file per se: we only
have the compiler output.  We must use 'approximate'.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37633; Package emacs. (Sat, 05 Oct 2019 18:55:02 GMT) Full text and rfc822 format available.

Message #26 received at 37633 <at> debbugs.gnu.org (full text, mbox):

From: Bernd Paysan <bernd <at> net2o.de>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 37633 <at> debbugs.gnu.org, anton <at> mips.complang.tuwien.ac.at
Subject: Re: bug#37633: Column part interpreted wrong in compilation mode
Date: Sat, 05 Oct 2019 20:54:38 +0200

[Message part 1 (text/plain, inline)]

Am Samstag, 5. Oktober 2019, 20:53:02 CEST schrieb Eli Zaretskii:
> > Problem with precision: "exact" requires encoding the entire file, so it's
> > slow for large files.  Particularly with automatically generated files,
> > this is likely not acceptable, so "approximate" could be good enough.
> 
> We cannot use 'exact' here because there's no file per se: we only
> have the compiler output.  We must use 'approximate'.

The buffer that matters is not the compiler output, it's the buffer of the 
source code.

-- 
Bernd Paysan
"If you want it done right, you have to do it yourself"
net2o id: kQusJzA;7*?t=uy <at> X}1GWr!+0qqp_Cn176t4(dQ*
https://net2o.de/

[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37633; Package emacs. (Sat, 05 Oct 2019 19:16:02 GMT) Full text and rfc822 format available.

Message #29 received at 37633 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Bernd Paysan <bernd <at> net2o.de>
Cc: 37633 <at> debbugs.gnu.org, anton <at> mips.complang.tuwien.ac.at
Subject: Re: bug#37633: Column part interpreted wrong in compilation mode
Date: Sat, 05 Oct 2019 22:14:38 +0300

> From: Bernd Paysan <bernd <at> net2o.de>
> Cc: 37633 <at> debbugs.gnu.org, anton <at> mips.complang.tuwien.ac.at
> Date: Sat, 05 Oct 2019 20:54:38 +0200
> 
> > We cannot use 'exact' here because there's no file per se: we only
> > have the compiler output.  We must use 'approximate'.
> 
> The buffer that matters is not the compiler output, it's the buffer of the 
> source code.

But the column numbers are counted in the compiler output, and no one
said that the compiler output must be encoded the same as the source
file.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37633; Package emacs. (Sat, 05 Oct 2019 19:25:01 GMT) Full text and rfc822 format available.

Message #32 received at 37633 <at> debbugs.gnu.org (full text, mbox):

From: Bernd Paysan <bernd <at> net2o.de>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 37633 <at> debbugs.gnu.org, anton <at> mips.complang.tuwien.ac.at
Subject: Re: bug#37633: Column part interpreted wrong in compilation mode
Date: Sat, 05 Oct 2019 21:24:17 +0200

[Message part 1 (text/plain, inline)]

Am Samstag, 5. Oktober 2019, 21:14:38 CEST schrieb Eli Zaretskii:
> > From: Bernd Paysan <bernd <at> net2o.de>
> > Cc: 37633 <at> debbugs.gnu.org, anton <at> mips.complang.tuwien.ac.at
> > Date: Sat, 05 Oct 2019 20:54:38 +0200
> > 
> > > We cannot use 'exact' here because there's no file per se: we only
> > > have the compiler output.  We must use 'approximate'.
> > 
> > The buffer that matters is not the compiler output, it's the buffer of the
> > source code.
> 
> But the column numbers are counted in the compiler output, and no one
> said that the compiler output must be encoded the same as the source
> file.

The column numbers are written as decimal digits in the compiler output.  They 
are not even calculated, they are just extracted.

Indeed, the compiler output can be in a different encoding, but it doesn't 
matter.  The navigation that needs to change is in the source code file.  This 
is compiler output from compiling an iso-latin encoded file, the compiler 
output itself is utf-8:

test-iso.c:3:23: error: ‘c’ undeclared (first use in this function)
    3 |  printf("test��� %i", c);
      |                       ^

The 23(-1) are the numbers of bytes to get from the start of line to the 
missing variable 'c'.  The three � are there, because the compilation buffer 
contains invalid characters now.  They are iso-latin characters, invalid in 
utf-8.  But this is irrelevant.  All the compilation mode does is extract the 
test-iso.c (file name), 3 (line number) and 23 (byte index).  Navigation 
happens in test-iso.c, it's a file (the C compiler can't access emacs 
buffers), autodetection is pretty reliable.

There might be some corner cases, where the suggested solution is not perfect, 
but it's much better than what we have now.

-- 
Bernd Paysan
"If you want it done right, you have to do it yourself"
net2o id: kQusJzA;7*?t=uy <at> X}1GWr!+0qqp_Cn176t4(dQ*
https://net2o.de/

[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37633; Package emacs. (Sun, 06 Oct 2019 14:11:01 GMT) Full text and rfc822 format available.

Message #35 received at 37633 <at> debbugs.gnu.org (full text, mbox):

From: Anton Ertl <anton <at> mips.complang.tuwien.ac.at>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 37633 <at> debbugs.gnu.org, bernd <at> net2o.de, anton <at> mips.complang.tuwien.ac.at
Subject: Re: bug#37633: Column part interpreted wrong in compilation mode
Date: Sun, 6 Oct 2019 14:31:12 +0200

On Sat, Oct 05, 2019 at 07:16:53PM +0300, Eli Zaretskii wrote:
> For byte offsets in external text we have bufferpos-to-filepos, but
> that requires us to know the encoding of the external text.  We need
> to find a reasonable way of getting that.  Suggestions and patches
> welcome.

It's the encoding that you assumed for the text when you loaded the
file into the buffer.

The assumption may be wrong, which may cause problems elsewhere, but
should not cause problems for interpreting the byte position, because
the byte position does not depend on the encoding (unlike the
character position).

- anton

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37633; Package emacs. (Sun, 06 Oct 2019 17:17:02 GMT) Full text and rfc822 format available.

Message #38 received at 37633 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Bernd Paysan <bernd <at> net2o.de>
Cc: 37633 <at> debbugs.gnu.org, anton <at> mips.complang.tuwien.ac.at
Subject: Re: bug#37633: Column part interpreted wrong in compilation mode
Date: Sun, 06 Oct 2019 20:16:43 +0300

> From: Bernd Paysan <bernd <at> net2o.de>
> Cc: 37633 <at> debbugs.gnu.org, anton <at> mips.complang.tuwien.ac.at
> Date: Sat, 05 Oct 2019 21:24:17 +0200
> 
> > But the column numbers are counted in the compiler output, and no one
> > said that the compiler output must be encoded the same as the source
> > file.
> 
> The column numbers are written as decimal digits in the compiler output.  They 
> are not even calculated, they are just extracted.
> 
> Indeed, the compiler output can be in a different encoding, but it doesn't 
> matter.  The navigation that needs to change is in the source code file.  This 
> is compiler output from compiling an iso-latin encoded file, the compiler 
> output itself is utf-8:
> 
> test-iso.c:3:23: error: ‘c’ undeclared (first use in this function)
>     3 |  printf("test��� %i", c);
>       |                       ^
> 
> The 23(-1) are the numbers of bytes to get from the start of line to the 
> missing variable 'c'.  The three � are there, because the compilation buffer 
> contains invalid characters now.  They are iso-latin characters, invalid in 
> utf-8.  But this is irrelevant.  All the compilation mode does is extract the 
> test-iso.c (file name), 3 (line number) and 23 (byte index).  Navigation 
> happens in test-iso.c, it's a file (the C compiler can't access emacs 
> buffers), autodetection is pretty reliable.

Sorry, now I'm confused.  Does the compiler count bytes in its output
(where a Latin-1 line could be recoded in UTF-8, and thus have a
different number of bytes), or does it count bytes in the original
file (in this case encoded in Latin-1, i.e. 1 byte per character)?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37633; Package emacs. (Sun, 06 Oct 2019 17:36:02 GMT) Full text and rfc822 format available.

Message #41 received at 37633 <at> debbugs.gnu.org (full text, mbox):

From: Bernd Paysan <bernd <at> net2o.de>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 37633 <at> debbugs.gnu.org, anton <at> mips.complang.tuwien.ac.at
Subject: Re: bug#37633: Column part interpreted wrong in compilation mode
Date: Sun, 06 Oct 2019 19:35:33 +0200

[Message part 1 (text/plain, inline)]

Am Sonntag, 6. Oktober 2019, 19:16:43 CEST schrieb Eli Zaretskii:
> Sorry, now I'm confused.  Does the compiler count bytes in its output
> (where a Latin-1 line could be recoded in UTF-8, and thus have a
> different number of bytes), or does it count bytes in the original
> file (in this case encoded in Latin-1, i.e. 1 byte per character)?

It counts bytes in its input.  The output is just a copy of the input.  The 
compiler (GCC here) does not even care or know about what encoding the input 
actually is.  It's supposed to be ASCII compatible, the compiler does not try 
to be smart.  C symbols are supposed to be ASCII only, C strings are just byte 
arrays.  Don't try to overestimate the smartness here.

-- 
Bernd Paysan
"If you want it done right, you have to do it yourself"
net2o id: kQusJzA;7*?t=uy <at> X}1GWr!+0qqp_Cn176t4(dQ*
https://net2o.de/

[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37633; Package emacs. (Sun, 06 Oct 2019 17:55:02 GMT) Full text and rfc822 format available.

Message #44 received at 37633 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: anton <at> mips.complang.tuwien.ac.at
Cc: 37633 <at> debbugs.gnu.org, bernd <at> net2o.de
Subject: Re: bug#37633: Column part interpreted wrong in compilation mode
Date: Sun, 06 Oct 2019 20:53:49 +0300

> Date: Sun, 6 Oct 2019 14:31:12 +0200
> From: Anton Ertl <anton <at> mips.complang.tuwien.ac.at>
> Cc: bernd <at> net2o.de, 37633 <at> debbugs.gnu.org, anton <at> mips.complang.tuwien.ac.at
> 
> On Sat, Oct 05, 2019 at 07:16:53PM +0300, Eli Zaretskii wrote:
> > For byte offsets in external text we have bufferpos-to-filepos, but
> > that requires us to know the encoding of the external text.  We need
> > to find a reasonable way of getting that.  Suggestions and patches
> > welcome.
> 
> It's the encoding that you assumed for the text when you loaded the
> file into the buffer.

I'm not sure this is correct.  You are saying that the compiler counts
bytes in the original file, not in its output (which might be encoded
differently).  Do we have conclusive evidence that this is always
true?

> the byte position does not depend on the encoding (unlike the
> character position).

??? The same Latin-1 characters encoded in ISO-8859-1 and in UTF-8
will yield a different number of bytes.  So I don't think I understand
how can you say the above.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37633; Package emacs. (Sun, 06 Oct 2019 18:55:01 GMT) Full text and rfc822 format available.

Message #47 received at 37633 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Bernd Paysan <bernd <at> net2o.de>
Cc: 37633 <at> debbugs.gnu.org, anton <at> mips.complang.tuwien.ac.at
Subject: Re: bug#37633: Column part interpreted wrong in compilation mode
Date: Sun, 06 Oct 2019 21:54:28 +0300

> From: Bernd Paysan <bernd <at> net2o.de>
> Cc: 37633 <at> debbugs.gnu.org, anton <at> mips.complang.tuwien.ac.at
> Date: Sun, 06 Oct 2019 19:35:33 +0200
> 
> It counts bytes in its input.

In that case, using the encoding with which we visited the source is
TRT.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37633; Package emacs. (Sun, 06 Oct 2019 19:03:02 GMT) Full text and rfc822 format available.

Message #50 received at 37633 <at> debbugs.gnu.org (full text, mbox):

From: Bernd Paysan <bernd <at> net2o.de>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: anton <at> mips.complang.tuwien.ac.at, 37633 <at> debbugs.gnu.org
Subject: Re: bug#37633: Column part interpreted wrong in compilation mode
Date: Sun, 06 Oct 2019 21:02:14 +0200

[Message part 1 (text/plain, inline)]

Am Sonntag, 6. Oktober 2019, 19:53:49 CEST schrieb Eli Zaretskii:
> > Date: Sun, 6 Oct 2019 14:31:12 +0200
> > From: Anton Ertl <anton <at> mips.complang.tuwien.ac.at>
> > Cc: bernd <at> net2o.de, 37633 <at> debbugs.gnu.org,
> > anton <at> mips.complang.tuwien.ac.at
> > 
> > On Sat, Oct 05, 2019 at 07:16:53PM +0300, Eli Zaretskii wrote:
> > > For byte offsets in external text we have bufferpos-to-filepos, but
> > > that requires us to know the encoding of the external text.  We need
> > > to find a reasonable way of getting that.  Suggestions and patches
> > > welcome.
> > 
> > It's the encoding that you assumed for the text when you loaded the
> > file into the buffer.
> 
> I'm not sure this is correct.  You are saying that the compiler counts
> bytes in the original file, not in its output (which might be encoded
> differently).  Do we have conclusive evidence that this is always
> true?

Almost always.  gcc has a gazillion of options almost nobody uses.

E.g., you can use -finput-encoding=<endoding> to transcode input files on 
reading.  It's a not well tested option, as the output (still iso8859-1) 
shows:

% gcc -finput-charset=iso8859-1 test-iso.c
test-iso.c: In function ‘foo’:
test-iso.c:2:2: warning: implicit declaration of function ‘printf’ [-
Wimplicit-function-declaration]
    2 |  printf("test %i", b);
      |  ^~~~~~
test-iso.c:2:2: warning: incompatible implicit declaration of built-in 
function ‘printf’
test-iso.c:1:1: note: include ‘<stdio.h>’ or provide a declaration of ‘printf’
  +++ |+#include <stdio.h>
    1 | void foo() {
test-iso.c:2:20: error: ‘b’ undeclared (first use in this function)
    2 |  printf("test %i", b);
      |                    ^
test-iso.c:2:20: note: each undeclared identifier is reported only once for 
each function it appears in
test-iso.c:3:26: error: ‘c’ undeclared (first use in this function)
    3 |  printf("test��� %i", c);
      |                          ^

Here, due to the conversion on read in, the position reported is different (it 
was 3:23 before).

This transparent conversion on reading is used rarely.  Or rather: There is no 
search result in the entire github database.

> > the byte position does not depend on the encoding (unlike the
> > character position).
> 
> ??? The same Latin-1 characters encoded in ISO-8859-1 and in UTF-8
> will yield a different number of bytes.  So I don't think I understand
> how can you say the above.

What I'm trying to tell: The compiler (unless instructed to convert the file 
on reading) reports the byte position it found in the file.  That's the same 
byte position the editor calculates for that file — and that is regardless of 
what the editor assumed as encoding.  I.e. if the editor mistook a UTF-8 file 
for an iso8859-1, it will see an UTF-8 string "äöü" (6 bytes UTF-8) as 
"Ã¤Ã¶Ã¼" (6 bytes iso8859-1).  But it's still 6 bytes.

-- 
Bernd Paysan
"If you want it done right, you have to do it yourself"
net2o id: kQusJzA;7*?t=uy <at> X}1GWr!+0qqp_Cn176t4(dQ*
https://net2o.de/

[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37633; Package emacs. (Sun, 06 Oct 2019 19:17:01 GMT) Full text and rfc822 format available.

Message #53 received at 37633 <at> debbugs.gnu.org (full text, mbox):

From: Bernd Paysan <bernd <at> net2o.de>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 37633 <at> debbugs.gnu.org, anton <at> mips.complang.tuwien.ac.at
Subject: Re: bug#37633: Column part interpreted wrong in compilation mode
Date: Sun, 06 Oct 2019 21:16:42 +0200

[Message part 1 (text/plain, inline)]

Am Sonntag, 6. Oktober 2019, 20:54:28 CEST schrieb Eli Zaretskii:
> > From: Bernd Paysan <bernd <at> net2o.de>
> > Cc: 37633 <at> debbugs.gnu.org, anton <at> mips.complang.tuwien.ac.at
> > Date: Sun, 06 Oct 2019 19:35:33 +0200
> > 
> > It counts bytes in its input.
> 
> In that case, using the encoding with which we visited the source is
> TRT.

Yes.

-- 
Bernd Paysan
"If you want it done right, you have to do it yourself"
net2o id: kQusJzA;7*?t=uy <at> X}1GWr!+0qqp_Cn176t4(dQ*
https://net2o.de/

[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37633; Package emacs. (Sun, 06 Oct 2019 19:18:02 GMT) Full text and rfc822 format available.

Message #56 received at 37633 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Bernd Paysan <bernd <at> net2o.de>
Cc: anton <at> mips.complang.tuwien.ac.at, 37633 <at> debbugs.gnu.org
Subject: Re: bug#37633: Column part interpreted wrong in compilation mode
Date: Sun, 06 Oct 2019 22:16:47 +0300

> From: Bernd Paysan <bernd <at> net2o.de>
> Cc: anton <at> mips.complang.tuwien.ac.at, 37633 <at> debbugs.gnu.org
> Date: Sun, 06 Oct 2019 21:02:14 +0200
> 
> if the editor mistook a UTF-8 file for an iso8859-1, it will see an
> UTF-8 string "äöü" (6 bytes UTF-8) as "Ã¤Ã¶Ã¼" (6 bytes iso8859-1).
> But it's still 6 bytes.

Not inside the Emacs buffer, it isn't.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37633; Package emacs. (Sun, 06 Oct 2019 19:23:02 GMT) Full text and rfc822 format available.

Message #59 received at 37633 <at> debbugs.gnu.org (full text, mbox):

From: Bernd Paysan <bernd <at> net2o.de>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: anton <at> mips.complang.tuwien.ac.at, 37633 <at> debbugs.gnu.org
Subject: Re: bug#37633: Column part interpreted wrong in compilation mode
Date: Sun, 06 Oct 2019 21:22:20 +0200

[Message part 1 (text/plain, inline)]

Am Sonntag, 6. Oktober 2019, 21:16:47 CEST schrieb Eli Zaretskii:
> > From: Bernd Paysan <bernd <at> net2o.de>
> > Cc: anton <at> mips.complang.tuwien.ac.at, 37633 <at> debbugs.gnu.org
> > Date: Sun, 06 Oct 2019 21:02:14 +0200
> > 
> > if the editor mistook a UTF-8 file for an iso8859-1, it will see an
> > UTF-8 string "äöü" (6 bytes UTF-8) as "Ã¤Ã¶Ã¼" (6 bytes iso8859-1).
> > But it's still 6 bytes.
> 
> Not inside the Emacs buffer, it isn't.

I created a unicode file:

void main() {
        char *b="ha", *c="ho";
        printf("test %i", b);
        printf("testäöü %i", c);
}

I loaded this into emacs, and reverted the buffer using iso8859-1 coding 
(simulating a wrongly detected encoding).

It then looks like this:

void main() {
	char *b="ha", *c="ho";
	printf("test %i", b);
	printf("testÃ¤Ã¶Ã¼ %i", c);
}

I compiled it with gcc -Wall test-utf8.c into a compile-mode buffer.

-*- mode: compilation; default-directory: "~/tmp/" -*-
Compilation started at Sun Oct  6 21:18:24

gcc -Wall test-utf.c 
test-utf.c:1:6: warning: return type of ‘main’ is not ‘int’ [-Wmain]
    1 | void main() {
      |      ^~~~
test-utf.c: In function ‘main’:
test-utf.c:3:2: warning: implicit declaration of function ‘printf’ [-
Wimplicit-function-declaration]
    3 |  printf("test %i", b);
      |  ^~~~~~
test-utf.c:3:2: warning: incompatible implicit declaration of built-in 
function ‘printf’
test-utf.c:1:1: note: include ‘<stdio.h>’ or provide a declaration of ‘printf’
  +++ |+#include <stdio.h>
    1 | void main() {
test-utf.c:3:16: warning: format ‘%i’ expects argument of type ‘int’, but 
argument 2 has type ‘char *’ [-Wformat=]
    3 |  printf("test %i", b);
      |               ~^   ~
      |                |   |
      |                int char *
      |               %s
test-utf.c:4:22: warning: format ‘%i’ expects argument of type ‘int’, but 
argument 2 has type ‘char *’ [-Wformat=]
    4 |  printf("testäöü %i", c);
      |                     ~^   ~
      |                      |   |
      |                      int char *
      |                     %s

Compilation finished at Sun Oct  6 21:18:24

If I click on the test-utf.c:4:22 label, I get exactly where I want to: On the 
i of %i.

If I revert this buffer with the correct encoding utf-8-unix, then it still 
navigates to the i of %i, so it's all agnostic to whether the encoding 
detected was correct or wrong.

-- 
Bernd Paysan
"If you want it done right, you have to do it yourself"
net2o id: kQusJzA;7*?t=uy <at> X}1GWr!+0qqp_Cn176t4(dQ*
https://net2o.de/

[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37633; Package emacs. (Sun, 06 Oct 2019 19:35:01 GMT) Full text and rfc822 format available.

Message #62 received at 37633 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Bernd Paysan <bernd <at> net2o.de>
Cc: anton <at> mips.complang.tuwien.ac.at, 37633 <at> debbugs.gnu.org
Subject: Re: bug#37633: Column part interpreted wrong in compilation mode
Date: Sun, 06 Oct 2019 22:34:15 +0300

> From: Bernd Paysan <bernd <at> net2o.de>
> Cc: anton <at> mips.complang.tuwien.ac.at, 37633 <at> debbugs.gnu.org
> Date: Sun, 06 Oct 2019 21:22:20 +0200
> 
> > > if the editor mistook a UTF-8 file for an iso8859-1, it will see an
> > > UTF-8 string "äöü" (6 bytes UTF-8) as "Ã¤Ã¶Ã¼" (6 bytes iso8859-1).
> > > But it's still 6 bytes.
> > 
> > Not inside the Emacs buffer, it isn't.
> 
> I created a unicode file:
> [...]
> If I revert this buffer with the correct encoding utf-8-unix, then it still 
> navigates to the i of %i, so it's all agnostic to whether the encoding 
> detected was correct or wrong.

Not sure I understand: are you saying that your experiment proves that
my assertion about the number of bytes was incorrect?  Because it
doesn't.

And anyway, I see n o reason to argue about this side issue, since we
seem to be in agreement that using the file's encoding is TRT.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37633; Package emacs. (Sun, 06 Oct 2019 19:37:02 GMT) Full text and rfc822 format available.

Message #65 received at 37633 <at> debbugs.gnu.org (full text, mbox):

From: Bernd Paysan <bernd <at> net2o.de>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: anton <at> mips.complang.tuwien.ac.at, 37633 <at> debbugs.gnu.org
Subject: Re: bug#37633: Column part interpreted wrong in compilation mode
Date: Sun, 06 Oct 2019 21:35:57 +0200

[Message part 1 (text/plain, inline)]

Am Sonntag, 6. Oktober 2019, 21:34:15 CEST schrieb Eli Zaretskii:
> Not sure I understand: are you saying that your experiment proves that
> my assertion about the number of bytes was incorrect?  Because it
> doesn't.

No, the experiment supports your assertion.

> And anyway, I see n o reason to argue about this side issue, since we
> seem to be in agreement that using the file's encoding is TRT.

Indeed. Use the file's encoding is TRT.

-- 
Bernd Paysan
"If you want it done right, you have to do it yourself"
net2o id: kQusJzA;7*?t=uy <at> X}1GWr!+0qqp_Cn176t4(dQ*
https://net2o.de/

[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37633; Package emacs. (Mon, 07 Oct 2019 07:10:02 GMT) Full text and rfc822 format available.

Message #68 received at 37633 <at> debbugs.gnu.org (full text, mbox):

From: Anton Ertl <anton <at> mips.complang.tuwien.ac.at>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: anton <at> mips.complang.tuwien.ac.at, bernd <at> net2o.de, 37633 <at> debbugs.gnu.org
Subject: Re: bug#37633: Column part interpreted wrong in compilation mode
Date: Mon, 7 Oct 2019 09:09:08 +0200

On Sun, Oct 06, 2019 at 08:53:49PM +0300, Eli Zaretskii wrote:
> > the byte position does not depend on the encoding (unlike the
> > character position).
> 
> ??? The same Latin-1 characters encoded in ISO-8859-1 and in UTF-8
> will yield a different number of bytes.  So I don't think I understand
> how can you say the above.

The same bytes have the same number of bytes, whether you interpret
them as having one encoding or some other encoding.  How many
characters these bytes have depends on the encoding.

Of course, if you have transcoded the bytes into some other encoding,
you have to transcode them back for counting.  So for Emacs this means
converting back to the input encoding, and then counting (i.e., what
you describe as TRT (which I guess means The Right Thing)).

- anton

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37633; Package emacs. (Sat, 23 Apr 2022 13:37:02 GMT) Full text and rfc822 format available.

Message #71 received at 37633 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Bernd Paysan <bernd <at> net2o.de>
Cc: 37633 <at> debbugs.gnu.org, anton <at> mips.complang.tuwien.ac.at
Subject: Re: bug#37633: Column part interpreted wrong in compilation mode
Date: Sat, 23 Apr 2022 15:36:25 +0200

Bernd Paysan <bernd <at> net2o.de> writes:

> Problems arise when tabs and UTF-8 glyphs are involved, e.g. compile
>
> ---------------test.c---------------
> void foo() {
> 	printf("test %i", b);
> 	printf("test你好 %i", c);
> }
> ---------------gcc test.c---------------
> -*- mode: compilation; default-directory: "~/tmp/" -*-
> Compilation started at Sat Oct  5 12:13:23

[...]

> test.c:3:26: error: ‘c’ undeclared (first use in this function)
>     3 |  printf("test你好 %i", c);
>       |                          ^

(I'm going through old bug reports that unfortunately weren't resolved
at the time.)

Amusingly enough, gcc 11.2.0 said this to me

comp.c:4:31: error: 'c' undeclared (first use in this function)
    4 |         printf("test你好 %i", c);
      |                               ^

It's counting the leading TAB character as eight columns...  and then
counting the bytes of Chinese characters individually, ending up with a
column of 31.

So just using `filepos-to-bufferpos' wouldn't fix the current gcc.  We
could implement gcc's logic fully, but that's changing over time, and
other compilers surely have their own logic.  (I wouldn't be surprised
whether other compilers count characters instead of bytes in their
column outputs.)  And -finput-charset doesn't help with the column
calculation in gcc.

Since the issue is as messy as it is, I don't think there's anything
meaningful we can do here on the Emacs side, so I'm therefore closing
this bug report.  (If somebody has ideas that would work in general
here, please respond and we'll reopen.)

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Added tag(s) wontfix. Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Sat, 23 Apr 2022 13:37:02 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 37633 <at> debbugs.gnu.org and Bernd Paysan <bernd <at> net2o.de> Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Sat, 23 Apr 2022 13:37:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 22 May 2022 11:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 170 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #37633 Column part interpreted wrong in compilation mode

GNU bug report logs - #37633
Column part interpreted wrong in compilation mode