GNU bug report logs - #47885
[PATCH] org-table-import: Make it more smarter for interactive use

Previous Next

Package: org-mode;

Reported by: Utkarsh Singh <utkarsh190601 <at> gmail.com>

Date: Mon, 19 Apr 2021 04:44:02 UTC

Severity: normal

Tags: patch

To reply to this bug, email your comments to 47885 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to emacs-orgmode <at> gnu.org:
bug#47885; Package org-mode. (Mon, 19 Apr 2021 04:44:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Utkarsh Singh <utkarsh190601 <at> gmail.com>:
New bug report received and forwarded. Copy sent to emacs-orgmode <at> gnu.org. (Mon, 19 Apr 2021 04:44:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Utkarsh Singh <utkarsh190601 <at> gmail.com>
To: emacs-orgmode <at> gnu.org, bug-gnu-emacs <at> gnu.org
Subject: [PATCH] org-table-import: Make it more smarter for interactive use
Date: Mon, 19 Apr 2021 10:13:31 +0530
Hi,

My previous patch proposed to add support for importing file with
arbitrary name and building upon that this patch tries to make use of it
by making org-table-import smarter by simply adding more separators
(delimiters).

Currently org-table-import 'smartly' guesses only COMMA, TAB and SPACE
as separator whereas this patch tries to add support for ';'(SEMICOLON)
and ':' (COLON).

Here is an example org-table generated using =M-x org-table-import=
/etc/passwd (uses COLON as separator) with private information removed.

| bin                    | x |     1 |     1 |                                | /                     | /usr/bin/nologin   |
| daemon                 | x |     2 |     2 |                                | /                     | /usr/bin/nologin   |
| mail                   | x |     8 |    12 |                                | /var/spool/mail       | /usr/bin/nologin   |
| ftp                    | x |    14 |    11 |                                | /srv/ftp              | /usr/bin/nologin   |
| http                   | x |    33 |    33 |                                | /srv/http             | /usr/bin/nologin   |
| nobody                 | x | 65534 | 65534 | Nobody                         | /                     | /usr/bin/nologin   |
| dbus                   | x |    81 |    81 | System Message Bus             | /                     | /usr/bin/nologin   |
| systemd-journal-remote | x |   981 |   981 | systemd Journal Remote         | /                     | /usr/bin/nologin   |
| systemd-network        | x |   980 |   980 | systemd Network Management     | /                     | /usr/bin/nologin   |
| systemd-oom            | x |   979 |   979 | systemd Userspace OOM Killer   | /                     | /usr/bin/nologin   |
| systemd-resolve        | x |   978 |   978 | systemd Resolver               | /                     | /usr/bin/nologin   |
| systemd-timesync       | x |   977 |   977 | systemd Time Synchronization   | /                     | /usr/bin/nologin   |
| systemd-coredump       | x |   976 |   976 | systemd Core Dumper            | /                     | /usr/bin/nologin   |
| avahi                  | x |   974 |   974 | Avahi mDNS/DNS-SD daemon       | /                     | /usr/bin/nologin   |
| colord                 | x |   973 |   973 | Color management daemon        | /var/lib/colord       | /usr/bin/nologin   |
| rtkit                  | x |   133 |   133 | RealtimeKit                    | /proc                 | /usr/bin/nologin   |
| transmission           | x |   169 |   169 | Transmission BitTorrent Daemon | /var/lib/transmission | /usr/bin/nologin   |
| geoclue                | x |   972 |   972 | Geoinformation service         | /var/lib/geoclue      | /usr/bin/nologin   |
| usbmux                 | x |   140 |   140 | usbmux user                    | /                     | /usr/bin/nologin   |


diff --git a/lisp/org/org-table.el b/lisp/org/org-table.el
index ab66859d6a..5ee4af612b 100644
--- a/lisp/org/org-table.el
+++ b/lisp/org/org-table.el
@@ -846,6 +846,35 @@ org-table-create
       (goto-char pos))
     (org-table-align)))
 
+
+(defun org-table-guess-separator (beg0 end0)
+  "Guess separator for `org-table-convert-region' for region BEG0 to END0.
+
+List of preferred separator:
+comma, TAB, ';', ':' or SPACE
+
+If region contains a line which doesn't contain the required
+separator then discard the separator and search again using next
+separator."
+  (let ((beg (save-excursion
+	       (goto-char (min beg0 end0))
+	       (beginning-of-line 1)
+	       (point)))
+	(end (save-excursion
+	       (goto-char (max beg0 end0))
+	       (end-of-line 1)
+	       (if (bolp) (backward-char 1) (end-of-line 1))
+	       (point))))
+    (save-excursion
+      (goto-char beg)
+      (cond
+       ((not (re-search-forward "^[^\n,]+$" end t)) '(4))
+       ((not (re-search-forward "^[^\n\t]+$" end t)) '(16))
+       ((not (re-search-forward "^[^\n;]+$" end t)) ";")
+       ((not (re-search-forward "^[^\n:]+$" end t)) ":")
+       ((not (re-search-forward "^\\([^'\"][^\n\s][^'\"]\\)+$" end t)) " ")
+       (t nil)))))
+
 ;;;###autoload
 (defun org-table-convert-region (beg0 end0 &optional separator)
   "Convert region to a table.
@@ -862,10 +891,7 @@ org-table-convert-region
 integer  When a number, use that many spaces, or a TAB, as field separator
 regexp   When a regular expression, use it to match the separator
 nil      When nil, the command tries to be smart and figure out the
-         separator in the following way:
-         - when each line contains a TAB, assume TAB-separated material
-         - when each line contains a comma, assume CSV material
-         - else, assume one or more SPACE characters as separator."
+         separator using `org-table-guess-seperator'."
   (interactive "r\nP")
   (let* ((beg (min beg0 end0))
 	 (end (max beg0 end0))
@@ -881,14 +907,9 @@ org-table-convert-region
       (goto-char end)
       (if (bolp) (backward-char 1) (end-of-line 1))
       (setq end (point-marker))
-      ;; Get the right field separator
-      (unless separator
-	(goto-char beg)
-	(setq separator
-	      (cond
-	       ((not (re-search-forward "^[^\n\t]+$" end t)) '(16))
-	       ((not (re-search-forward "^[^\n,]+$" end t)) '(4))
-	       (t 1))))
+      (if (and (not separator)
+               (not (setq separator (org-table-guess-separator beg end))))
+          (error "Unable to guess suitable separator."))
       (goto-char beg)
       (if (equal separator '(4))
 	  (while (< (point) end)
@@ -921,12 +942,8 @@ org-table-convert-region
 (defun org-table-import (file separator)
   "Import FILE as a table.
 
-The command tries to be smart and figure out the separator in the
-following way:
-
-- when each line contains a TAB, assume TAB-separated material;
-- when each line contains a comma, assume CSV material;
-- else, assume one or more SPACE characters as separator.
+The command tries to be smart and figure out the separator using
+`org-table-guess-seperator'.
 
 When non-nil, SEPARATOR specifies the field separator in the
 lines.  It can have the following values:

-- 
Utkarsh Singh
http://utkarshsingh.xyz




Information forwarded to emacs-orgmode <at> gnu.org:
bug#47885; Package org-mode. (Mon, 19 Apr 2021 08:20:02 GMT) Full text and rfc822 format available.

Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Nicolas Goaziou <mail <at> nicolasgoaziou.fr>
To: Utkarsh Singh <utkarsh190601 <at> gmail.com>
Cc: bug-gnu-emacs <at> gnu.org, emacs-orgmode <at> gnu.org
Subject: Re: [PATCH] org-table-import: Make it more smarter for interactive use
Date: Mon, 19 Apr 2021 10:19:03 +0200
Hello,

Utkarsh Singh <utkarsh190601 <at> gmail.com> writes:

> My previous patch proposed to add support for importing file with
> arbitrary name and building upon that this patch tries to make use of it
> by making org-table-import smarter by simply adding more separators
> (delimiters).

Good idea, thank you. Some comments follow.

> +(defun org-table-guess-separator (beg0 end0)
> +  "Guess separator for `org-table-convert-region' for region BEG0 to END0.
> +
> +List of preferred separator:
> +comma, TAB, ';', ':' or SPACE

I suggest to use full names everywhere: comma, TAB, semicolon, colon, or
SPACE.

> +If region contains a line which doesn't contain the required
> +separator then discard the separator and search again using next
> +separator."
> +  (let ((beg (save-excursion
> +	       (goto-char (min beg0 end0))
> +	       (beginning-of-line 1)
> +	       (point)))

  (beginning-of-line 1) + (point) -> (line-beginning-position)

since you don't intent to move point.

> +	(end (save-excursion
> +	       (goto-char (max beg0 end0))
> +	       (end-of-line 1)
> +	       (if (bolp) (backward-char 1) (end-of-line 1))

I'm not sure about what you mean above. First, the second call to
end-of-line is useless, since you're already at the end of the line.
Second, what is wrong if point is at an empty line? Why do you want to
move it back?

> +	       (point))))

You may want to use `line-end-position'.

> +    (save-excursion
> +      (goto-char beg)
> +      (cond
> +       ((not (re-search-forward "^[^\n,]+$" end t)) '(4))
> +       ((not (re-search-forward "^[^\n\t]+$" end t)) '(16))
> +       ((not (re-search-forward "^[^\n;]+$" end t)) ";")
> +       ((not (re-search-forward "^[^\n:]+$" end t)) ":")
> +       ((not (re-search-forward "^\\([^'\"][^\n\s][^'\"]\\)+$" end t)) " ")
> +       (t nil)))))

I think you need to wrap `save-excursion' around each
`re-search-forward' call. Otherwise each test starts at the first line
containing the separator previously tested.
> +
>  ;;;###autoload
>  (defun org-table-convert-region (beg0 end0 &optional separator)
>    "Convert region to a table.
> @@ -862,10 +891,7 @@ org-table-convert-region
>  integer  When a number, use that many spaces, or a TAB, as field separator
>  regexp   When a regular expression, use it to match the separator
>  nil      When nil, the command tries to be smart and figure out the
> -         separator in the following way:
> -         - when each line contains a TAB, assume TAB-separated material
> -         - when each line contains a comma, assume CSV material
> -         - else, assume one or more SPACE characters as separator."
> +         separator using `org-table-guess-seperator'."

I wonder if creating a new function is warranted here. You could add the
news checks after those already present in the code, couldn't you?


Regards,
-- 
Nicolas Goaziou




Information forwarded to emacs-orgmode <at> gnu.org:
bug#47885; Package org-mode. (Mon, 19 Apr 2021 14:24:01 GMT) Full text and rfc822 format available.

Message #11 received at 47885 <at> debbugs.gnu.org (full text, mbox):

From: Utkarsh Singh <utkarsh190601 <at> gmail.com>
To: Nicolas Goaziou <mail <at> nicolasgoaziou.fr>
Cc: 47885 <at> debbugs.gnu.org, emacs-orgmode <at> gnu.org
Subject: Re: [PATCH] org-table-import: Make it more smarter for interactive use
Date: Mon, 19 Apr 2021 19:53:25 +0530
On 2021-04-19, 10:19 +0200, Nicolas Goaziou <mail <at> nicolasgoaziou.fr> wrote:

>> My previous patch proposed to add support for importing file with
>> arbitrary name and building upon that this patch tries to make use of it
>> by making org-table-import smarter by simply adding more separators
>> (delimiters).
>
> Good idea, thank you. Some comments follow.
>
>> +(defun org-table-guess-separator (beg0 end0)
>> +  "Guess separator for `org-table-convert-region' for region BEG0 to END0.
>> +
>> +List of preferred separator:
>> +comma, TAB, ';', ':' or SPACE
>
> I suggest to use full names everywhere: comma, TAB, semicolon, colon, or
> SPACE.
>
>> +If region contains a line which doesn't contain the required
>> +separator then discard the separator and search again using next
>> +separator."
>> +  (let ((beg (save-excursion
>> +	       (goto-char (min beg0 end0))
>> +	       (beginning-of-line 1)
>> +	       (point)))
>
>   (beginning-of-line 1) + (point) -> (line-beginning-position)
>
> since you don't intent to move point.
>
>> +	(end (save-excursion
>> +	       (goto-char (max beg0 end0))
>> +	       (end-of-line 1)
>> +	       (if (bolp) (backward-char 1) (end-of-line 1))
>
> I'm not sure about what you mean above. First, the second call to
> end-of-line is useless, since you're already at the end of the line.
> Second, what is wrong if point is at an empty line? Why do you want to
> move it back?
>
>> +	       (point))))
>
> You may want to use `line-end-position'.
>
>> +    (save-excursion
>> +      (goto-char beg)
>> +      (cond
>> +       ((not (re-search-forward "^[^\n,]+$" end t)) '(4))
>> +       ((not (re-search-forward "^[^\n\t]+$" end t)) '(16))
>> +       ((not (re-search-forward "^[^\n;]+$" end t)) ";")
>> +       ((not (re-search-forward "^[^\n:]+$" end t)) ":")
>> +       ((not (re-search-forward "^\\([^'\"][^\n\s][^'\"]\\)+$" end t)) " ")
>> +       (t nil)))))
>
> I think you need to wrap `save-excursion' around each
> `re-search-forward' call. Otherwise each test starts at the first line
> containing the separator previously tested.
>> +
>>  ;;;###autoload
>>  (defun org-table-convert-region (beg0 end0 &optional separator)
>>    "Convert region to a table.
>> @@ -862,10 +891,7 @@ org-table-convert-region
>>  integer  When a number, use that many spaces, or a TAB, as field separator
>>  regexp   When a regular expression, use it to match the separator
>>  nil      When nil, the command tries to be smart and figure out the
>> -         separator in the following way:
>> -         - when each line contains a TAB, assume TAB-separated material
>> -         - when each line contains a comma, assume CSV material
>> -         - else, assume one or more SPACE characters as separator."
>> +         separator using `org-table-guess-seperator'."

Thanks for reviewing the patch!

> I wonder if creating a new function is warranted here. You could add the
> news checks after those already present in the code, couldn't you?

At first I was also reluctant in creating a new function but decided to
do so because:

+ org-table-convert-region is currently doing two thing 'guessing the
separator' and 'converting the region'.  I thought it was a good idea to
separate out function into it's atomic operations.

+ Current guessing technique is quite basic as it assumes that data
(file that has to be imported) has no error/inconsistency in it.  I
would like to show you the doc string of Python's CSV library
implementation to guess separator (region inside """):

"""
Looks for text enclosed between two identical quotes
(the probable quotechar) which are preceded and followed
by the same character (the probable delimiter).
For example:
    ,'some text',
The quote with the most wins, same with the delimiter.
If there is no quotechar the delimiter can't be determined
this way.
"""

And if this functions fails then we have:

"""
The delimiter /should/ occur the same number of times on
each row. However, due to malformed data, it may not. We don't want
an all or nothing approach, so we allow for small variations in this
number.
1) build a table of the frequency of each character on every line.
2) build a table of frequencies of this frequency (meta-frequency?),
e.g.  'x occurred 5 times in 10 rows, 6 times in 1000 rows,
7 times in 2 rows'
3) use the mode of the meta-frequency to determine the /expected/
frequency for that character
4) find out how often the character actually meets that goal
5) the character that best meets its goal is the delimiter
For performance reasons, the data is evaluated in chunks, so it can
try and evaluate the smallest portion of the data possible, evaluating
additional chunks as necessary.
"""

I tried to do similar in Elisp but currently facing some issues due to
my inexperience in functional programming.  Also moving the 'guessing'
part out the function may lead to development of even better algorithm
than Python counterpart.

Modified version of concerned function:

(defun org-table-guess-separator (beg0 end0)
  "Guess separator for `org-table-convert-region' for region BEG0 to END0.

List of preferred separator:
comma, TAB, semicolon, colon or SPACE.

If region contains a line which doesn't contain the required
separator then discard the separator and search again using next
separator."
  (let* ((beg (save-excursion
		(goto-char (min beg0 end0))
		(line-beginning-position)))
	 (end (save-excursion
		(goto-char (max beg0 end0))
		(line-end-position)))
	 (sep-rexp '((","  "^[^\n,]+$")
		     ("\t" "^[^\n\t]+$")
		     (";"  "^[^\n;]+$")
		     (":"  "^[^\n:]+$")
		     (" "  "^\\([^'\"][^\n\s][^'\"]\\)+$")))
	 (tmp (car sep-rexp))
	 sep)
    (save-excursion
      (goto-char beg)
      (while (and (not sep)
		  (if (save-excursion
			(not (re-search-forward (nth 1 tmp) end t)))
		      (setq sep (nth 0 tmp))
		    (setq sep-rexp (cdr sep-rexp))
		    (setq tmp (car sep-rexp)))))
    sep)))

Version without using iteration:

(defun org-table-guess-separator (beg0 end0)
  "Guess separator for `org-table-convert-region' for region BEG0 to END0.

List of preferred separator:
COMMA, TAB, SEMICOLON, COLON or SPACE.

If region contains a line which doesn't contain the required
separator then discard the separator and search again using next
separator."
  (let ((beg (save-excursion
	       (goto-char (min beg0 end0))
	       (line-beginning-position)))
	(end (save-excursion
	       (goto-char (max beg0 end0))
	       (line-end-position))))
    (save-excursion
      (goto-char beg)
      (cond
       ((save-excursion (not (re-search-forward "^[^\n,]+$" end t))) ",")
       ((save-excursion (not (re-search-forward "^[^\n\t]+$" end t))) "\t")
       ((save-excursion (not (re-search-forward "^[^\n;]+$" end t))) ";")
       ((save-excursion (not (re-search-forward "^[^\n:]+$" end t))) ":")
       ((save-excursion (not (re-search-forward "^\\([^'\"][^\n\s][^'\"]\\)+$" end t))) " ")
       (t nil)))))

--
Utkarsh Singh
http://utkarshsingh.xyz




Information forwarded to emacs-orgmode <at> gnu.org:
bug#47885; Package org-mode. (Tue, 20 Apr 2021 13:41:02 GMT) Full text and rfc822 format available.

Message #14 received at 47885 <at> debbugs.gnu.org (full text, mbox):

From: Nicolas Goaziou <mail <at> nicolasgoaziou.fr>
To: Utkarsh Singh <utkarsh190601 <at> gmail.com>
Cc: 47885 <at> debbugs.gnu.org, emacs-orgmode <at> gnu.org
Subject: Re: [PATCH] org-table-import: Make it more smarter for interactive use
Date: Tue, 20 Apr 2021 15:40:12 +0200
Hello,

Utkarsh Singh <utkarsh190601 <at> gmail.com> writes:

> At first I was also reluctant in creating a new function but decided to
> do so because:
>
> + org-table-convert-region is currently doing two thing 'guessing the
> separator' and 'converting the region'.  I thought it was a good idea to
> separate out function into it's atomic operations.

I understand, but there is sometimes a (difficult) line to draw between
"separating concerns" and "function proliferation". Anyway, that's fine
here.

> + Current guessing technique is quite basic as it assumes that data
> (file that has to be imported) has no error/inconsistency in it.  I
> would like to show you the doc string of Python's CSV library
> implementation to guess separator (region inside """):
>
> """
> Looks for text enclosed between two identical quotes
> (the probable quotechar) which are preceded and followed
> by the same character (the probable delimiter).
> For example:
>     ,'some text',
> The quote with the most wins, same with the delimiter.
> If there is no quotechar the delimiter can't be determined
> this way.
> """
>
> And if this functions fails then we have:
>
> """
> The delimiter /should/ occur the same number of times on
> each row. However, due to malformed data, it may not. We don't want
> an all or nothing approach, so we allow for small variations in this
> number.
> 1) build a table of the frequency of each character on every line.
> 2) build a table of frequencies of this frequency (meta-frequency?),
> e.g.  'x occurred 5 times in 10 rows, 6 times in 1000 rows,
> 7 times in 2 rows'
> 3) use the mode of the meta-frequency to determine the /expected/
> frequency for that character
> 4) find out how often the character actually meets that goal
> 5) the character that best meets its goal is the delimiter
> For performance reasons, the data is evaluated in chunks, so it can
> try and evaluate the smallest portion of the data possible, evaluating
> additional chunks as necessary.
> """

For the problem we're trying to solve, this sounds like over-engineering
to me. Do we want so badly to guess a separator?

> I tried to do similar in Elisp but currently facing some issues due to
> my inexperience in functional programming.  Also moving the 'guessing'
> part out the function may lead to development of even better algorithm
> than Python counterpart.
>
> Modified version of concerned function:
>
> (defun org-table-guess-separator (beg0 end0)
>   "Guess separator for `org-table-convert-region' for region BEG0 to END0.
>
> List of preferred separator:
> comma, TAB, semicolon, colon or SPACE.
>
> If region contains a line which doesn't contain the required
> separator then discard the separator and search again using next
> separator."
>   (let* ((beg (save-excursion
> 		(goto-char (min beg0 end0))
> 		(line-beginning-position)))
> 	 (end (save-excursion
> 		(goto-char (max beg0 end0))
> 		(line-end-position)))

Thinking again about it, this needs extra care, as end0 might end up on
an empty line. You tried to avoid this in your first function, but
I think this was not sufficient either. Actually, beg0 could also start
on an empty line.

This needs to be tested extensively, but as a first approximation,
I think `beg' needs to be defined as:

  (save-excursion
    (goto-char (min beg0 end0))
    (skip-chars-forward " \t\n")
    (if (eobp) (point) (line-beginning-position)))

and `end' as

  (save-excursion
    (goto-char (max beg end0))
    (skip-chars-backward " \t\n" beg)
    (if (= beg (point)) (point) (line-end-position)))

Then you need to bail out if beg = end.

> 	 (sep-rexp '((","  "^[^\n,]+$")

sep-rexp -> sep-regexp

> 		     ("\t" "^[^\n\t]+$")
> 		     (";"  "^[^\n;]+$")
> 		     (":"  "^[^\n:]+$")
> 		     (" "  "^\\([^'\"][^\n\s][^'\"]\\)+$")))

At this point, I suggest to use `rx' macro instead.

> 	 (tmp (car sep-rexp))
> 	 sep)
>     (save-excursion
>       (goto-char beg)
>       (while (and (not sep)
> 		  (if (save-excursion
> 			(not (re-search-forward (nth 1 tmp) end t)))
> 		      (setq sep (nth 0 tmp))
> 		    (setq sep-rexp (cdr sep-rexp))
> 		    (setq tmp (car sep-rexp)))))

I suggest this (yes, I like pattern-matching, `car' and `cdr' are so
80's) instead:

  (save-excursion
    (goto-char beg)
    (catch :found
      (pcase-dolist (`(,sep ,regexp) sep-regexp)
        (save-excursion
          (unless (re-search-forward regexp end t)
          (throw :found sep))))
    nil))

Again all this needs to extensively tested, as there are a lot of
dangers lurking around.

Regards,
-- 
Nicolas Goaziou




Information forwarded to emacs-orgmode <at> gnu.org:
bug#47885; Package org-mode. (Tue, 20 Apr 2021 17:16:01 GMT) Full text and rfc822 format available.

Message #17 received at 47885 <at> debbugs.gnu.org (full text, mbox):

From: Utkarsh Singh <utkarsh190601 <at> gmail.com>
To: Nicolas Goaziou <mail <at> nicolasgoaziou.fr>
Cc: 47885 <at> debbugs.gnu.org, emacs-orgmode <at> gnu.org
Subject: Re: [PATCH] org-table-import: Make it more smarter for interactive use
Date: Tue, 20 Apr 2021 22:45:22 +0530
Hi,

On 2021-04-20, 15:40 +0200, Nicolas Goaziou <mail <at> nicolasgoaziou.fr> wrote:

> For the problem we're trying to solve, this sounds like over-engineering
> to me. Do we want so badly to guess a separator?

Earlier I took is as an assignment to learn Elisp but now I don't think
we should increase complexity this much.

> Thinking again about it, this needs extra care, as end0 might end up on
> an empty line. You tried to avoid this in your first function, but
> I think this was not sufficient either. Actually, beg0 could also start
> on an empty line.
>
> This needs to be tested extensively, but as a first approximation,
> I think `beg' needs to be defined as:
>
>   (save-excursion
>     (goto-char (min beg0 end0))
>     (skip-chars-forward " \t\n")
>     (if (eobp) (point) (line-beginning-position)))
>
> and `end' as
>
>   (save-excursion
>     (goto-char (max beg end0))
>     (skip-chars-backward " \t\n" beg)
>     (if (= beg (point)) (point) (line-end-position)))
>
> Then you need to bail out if beg = end.
>
>> 	 (sep-rexp '((","  "^[^\n,]+$")
>
> sep-rexp -> sep-regexp
>
>> 		     ("\t" "^[^\n\t]+$")
>> 		     (";"  "^[^\n;]+$")
>> 		     (":"  "^[^\n:]+$")
>> 		     (" "  "^\\([^'\"][^\n\s][^'\"]\\)+$")))
>
> At this point, I suggest to use `rx' macro instead.
>
> I suggest this (yes, I like pattern-matching, `car' and `cdr' are so
> 80's) instead:
>
>   (save-excursion
>     (goto-char beg)
>     (catch :found
>       (pcase-dolist (`(,sep ,regexp) sep-regexp)
>         (save-excursion
>           (unless (re-search-forward regexp end t)
>           (throw :found sep))))
>     nil))
>

Thanks! I was not aware of pcase-dolist function.

Function after doing the necessary changes:

(defun org-table-guess-separator (beg0 end0)
  "Guess separator for `org-table-convert-region' for region BEG0 to END0.

List of preferred separator:
comma, TAB, semicolon, colon or SPACE.

If region contains a line which doesn't contain the required
separator then discard the separator and search again using next
separator."
  (let* ((beg (save-excursion
                (goto-char (min beg0 end0))
                (skip-chars-forward " \t\n")
                (if (eobp) (point) (line-beginning-position))))
	 (end (save-excursion
                (goto-char (max beg end0))
                (skip-chars-backward " \t\n" beg)
                (if (= beg (point)) (point) (line-end-position))))
         (sep-regexp '((","  (rx bol (1+ (not (or ?\n ?,))) eol))
		       ("\t" (rx bol (1+ (not (or ?\n ?\t))) eol))
		       (";"  (rx bol (1+ (not (or ?\n ?\;))) eol))
		       (":"  (rx bol (1+ (not (or ?\n ?:))) eol))
		       (" "  (rx bol (1+ (not (or ?' ?\" ))
                                         (not (or ?\s ?\;))
                                         (not (or ?' ?\"))) eol))))
         sep)
    (unless (= beg end)
      (save-excursion
        (goto-char beg)
        (catch :found
          (pcase-dolist (`(,sep ,regexp) sep-regexp)
            (save-excursion
              (unless (re-search-forward (eval regexp) end t)
                (throw :found sep))))
          nil)))))

> Again all this needs to extensively tested, as there are a lot of
> dangers lurking around.

Summary of things that still requires a review:

+ Setting boundary right

+ When using SPACE as separator is it sufficient to check for all for
all non quoted SPACE's?

-- 
Utkarsh Singh
http://utkarshsingh.xyz




Information forwarded to emacs-orgmode <at> gnu.org:
bug#47885; Package org-mode. (Fri, 23 Apr 2021 04:59:01 GMT) Full text and rfc822 format available.

Message #20 received at 47885 <at> debbugs.gnu.org (full text, mbox):

From: Utkarsh Singh <utkarsh190601 <at> gmail.com>
To: Nicolas Goaziou <mail <at> nicolasgoaziou.fr>
Cc: 47885 <at> debbugs.gnu.org, emacs-orgmode <at> gnu.org
Subject: Re: [PATCH] org-table-import: Make it more smarter for interactive use
Date: Fri, 23 Apr 2021 10:28:24 +0530
[Message part 1 (text/plain, inline)]
Hi,

On 2021-04-20, 15:40 +0200, Nicolas Goaziou <mail <at> nicolasgoaziou.fr> wrote:

> Again all this needs to extensively tested, as there are a lot of
> dangers lurking around.

I am attaching my patch which also include my previous suggestion of
including yes-or-no prompt to org-table-import to allow file which don't
have csv, tsv or txt as extension.  Here are some concerns
with require your attention:

+ When using org-table-import interactively if we failed to guess
separator then we will be left with a user-error message and an
'unconverted table'.  We can make use of 'temp-buffer' to import our
file after successfully conversion.

+ Conversion part of org-table-convert-region make a distinction between
'(4) (comma separator) and rest of the separator we should either string
version of comma as AND condition or rewrite to simplify it.

I am willing to do these possible changes but currently waiting for your
review for org-table-guess-separator as there can be more serious bugs
lurking around on my code which I am considering base for these changes.

All the best,
Utkarsh

[org-table.patch (text/x-patch, inline)]
diff --git a/lisp/org/org-table.el b/lisp/org/org-table.el
index 0e93fb271f..84bc981fec 100644
--- a/lisp/org/org-table.el
+++ b/lisp/org/org-table.el
@@ -846,6 +846,42 @@ org-table-create
       (goto-char pos))
     (org-table-align)))
 
+
+(defun org-table-guess-separator (beg0 end0)
+  "Guess separator for `org-table-convert-region' for region BEG0 to END0.
+
+List of preferred separator:
+comma, TAB, semicolon, colon or SPACE.
+
+If region contains a line which doesn't contain the required
+separator then discard the separator and search again using next
+separator."
+  (let* ((beg (save-excursion
+                (goto-char (min beg0 end0))
+                (skip-chars-forward " \t\n")
+                (if (eobp) (point) (line-beginning-position))))
+	 (end (save-excursion
+                (goto-char (max beg end0))
+                (skip-chars-backward " \t\n" beg)
+                (if (= beg (point)) (point) (line-end-position))))
+         (sep-regexp '((","  (rx bol (1+ (not (or ?\n ?,))) eol))
+		       ("\t" (rx bol (1+ (not (or ?\n ?\t))) eol))
+		       (";"  (rx bol (1+ (not (or ?\n ?\;))) eol))
+		       (":"  (rx bol (1+ (not (or ?\n ?:))) eol))
+		       (" "  (rx bol (1+ (not (or ?' ?\" ))
+                                         (not (or ?\s ?\;))
+                                         (not (or ?' ?\"))) eol))))
+         sep)
+    (unless (= beg end)
+      (save-excursion
+        (goto-char beg)
+        (catch :found
+          (pcase-dolist (`(,sep ,regexp) sep-regexp)
+            (save-excursion
+              (unless (re-search-forward (eval regexp) end t)
+                (throw :found sep))))
+          nil)))))
+
 ;;;###autoload
 (defun org-table-convert-region (beg0 end0 &optional separator)
   "Convert region to a table.
@@ -859,20 +895,19 @@ org-table-convert-region
 (4)     Use the comma as a field separator
 (16)    Use a TAB as field separator
 (64)    Prompt for a regular expression as field separator
-integer  When a number, use that many spaces, or a TAB, as field separator
-regexp   When a regular expression, use it to match the separator
-nil      When nil, the command tries to be smart and figure out the
-         separator in the following way:
-         - when each line contains a TAB, assume TAB-separated material
-         - when each line contains a comma, assume CSV material
-         - else, assume one or more SPACE characters as separator."
+integer When a number, use that many spaces, or a TAB, as field separator
+regexp  When a regular expression, use it to match the separator
+nil     When nil, the command tries to be smart and figure out the
+        separator using `org-table-guess-seperator'."
   (interactive "r\nP")
   (let* ((beg (min beg0 end0))
 	 (end (max beg0 end0))
 	 re)
+
     (if (> (count-lines beg end) org-table-convert-region-max-lines)
 	(user-error "Region is longer than `org-table-convert-region-max-lines' (%s) lines; not converting"
 		    org-table-convert-region-max-lines)
+
       (when (equal separator '(64))
 	(setq separator (read-regexp "Regexp for field separator")))
       (goto-char beg)
@@ -881,17 +916,13 @@ org-table-convert-region
       (goto-char end)
       (if (bolp) (backward-char 1) (end-of-line 1))
       (setq end (point-marker))
-      ;; Get the right field separator
-      (unless separator
-	(goto-char beg)
-	(setq separator
-	      (cond
-	       ((not (re-search-forward "^[^\n\t]+$" end t)) '(16))
-	       ((not (re-search-forward "^[^\n,]+$" end t)) '(4))
-	       (t 1))))
+      (when (and (not separator)
+                 (not (setq separator
+                            (org-table-guess-separator (beg end)))))
+        (user-error "Failed to guess separator"))
       (goto-char beg)
       (if (equal separator '(4))
-	  (while (< (point) end)
+          (while (< (point) end)
 	    ;; parse the csv stuff
 	    (cond
 	     ((looking-at "^") (insert "| "))
@@ -905,7 +936,7 @@ org-table-convert-region
 	(setq re (cond
 		  ((equal separator '(4)) "^\\|\"?[ \t]*,[ \t]*\"?")
 		  ((equal separator '(16)) "^\\|\t")
-		  ((integerp separator)
+	          ((integerp separator)
 		   (if (< separator 1)
 		       (user-error "Number of spaces in separator must be >= 1")
 		     (format "^ *\\| *\t *\\| \\{%d,\\}" separator)))
@@ -921,12 +952,8 @@ org-table-convert-region
 (defun org-table-import (file separator)
   "Import FILE as a table.
 
-The command tries to be smart and figure out the separator in the
-following way:
-
-- when each line contains a TAB, assume TAB-separated material;
-- when each line contains a comma, assume CSV material;
-- else, assume one or more SPACE characters as separator.
+The command tries to be smart and figure out the separator using
+`org-table-guess-seperator'.
 
 When non-nil, SEPARATOR specifies the field separator in the
 lines.  It can have the following values:
@@ -938,7 +965,8 @@ org-table-import
 - regexp  When a regular expression, use it to match the separator."
   (interactive "f\nP")
   (when (and (called-interactively-p 'any)
-	     (not (string-match-p (rx "." (or "txt" "tsv" "csv") eos) file)))
+	     (not (string-match-p (rx "." (or "txt" "tsv" "csv") eos) file))
+             (not (yes-or-no-p "File does not havs .txt .txt .csv as extension.  Do you still want to continue? ")))
     (user-error "Cannot import such file"))
   (unless (bolp) (insert "\n"))
   (let ((beg (point))
[Message part 3 (text/plain, inline)]
-- 
Utkarsh Singh
http://utkarshsingh.xyz

Information forwarded to emacs-orgmode <at> gnu.org:
bug#47885; Package org-mode. (Tue, 27 Apr 2021 20:22:01 GMT) Full text and rfc822 format available.

Message #23 received at 47885 <at> debbugs.gnu.org (full text, mbox):

From: Nicolas Goaziou <mail <at> nicolasgoaziou.fr>
To: Utkarsh Singh <utkarsh190601 <at> gmail.com>
Cc: 47885 <at> debbugs.gnu.org
Subject: Re: bug#47885: [PATCH] org-table-import: Make it more smarter for
 interactive use
Date: Tue, 27 Apr 2021 22:21:20 +0200
Hello,

Utkarsh Singh <utkarsh190601 <at> gmail.com> writes:

> I am attaching my patch which also include my previous suggestion of
> including yes-or-no prompt to org-table-import to allow file which don't
> have csv, tsv or txt as extension.  

I suggest to make several patches. Do not try to stuff as many changes
as possible in a single one, please.

> + When using org-table-import interactively if we failed to guess
> separator then we will be left with a user-error message and an
> 'unconverted table'.  We can make use of 'temp-buffer' to import our
> file after successfully conversion.

I'm not sure to understand what you mean.

> + Conversion part of org-table-convert-region make a distinction between
> '(4) (comma separator) and rest of the separator we should either string
> version of comma as AND condition or rewrite to simplify it.

Ditto. But it can be the object of another patch. Let's concentrate on
`org-table-guess-separator' first.

> I am willing to do these possible changes but currently waiting for your
> review for org-table-guess-separator as there can be more serious bugs
> lurking around on my code which I am considering base for these
> changes.

You should definitely write tests for this function. Here's a start:

    (ert-deftest test-org-table/guess-separator ()
      "Test `test-org-table/guess-separator'."
      ;; Test space separator.
      (should
       (equal " "
              (org-test-with-temp-text "a b\nc d"
                (org-table-guess-separator (point-min) (point-max)))))
      (should
       (equal " "
              (org-test-with-temp-text "a b\nc d"
                (org-table-guess-separator (point-min) (point-max)))))
      ;; Test "inverted" region.
      (should
       (equal " "
              (org-test-with-temp-text "a b\nc d"
                (org-table-guess-separator (point-max) (point-min)))))
      ;; Do not error on empty region.
      (should-not
       (org-test-with-temp-text ""
         (org-table-guess-separator (point-max) (point-min))))
      (should-not
       (org-test-with-temp-text "   \n"
         (org-table-guess-separator (point-max) (point-min)))))

> +	 (end (save-excursion
> +                (goto-char (max beg end0))

This should be beg0 instead of beg above.

> +         (sep-regexp '((","  (rx bol (1+ (not (or ?\n ?,))) eol))
> +		       ("\t" (rx bol (1+ (not (or ?\n ?\t))) eol))
> +		       (";"  (rx bol (1+ (not (or ?\n ?\;))) eol))
> +		       (":"  (rx bol (1+ (not (or ?\n ?:))) eol))
> +		       (" "  (rx bol (1+ (not (or ?' ?\" ))
> +                                         (not (or ?\s ?\;))
> +                                         (not (or ?' ?\"))) eol))))

Use
         (sep-regexp
          (list (list ","  (rx bol (1+ (not (or ?\n ?,))) eol))
	        (list "\t" (rx bol (1+ (not (or ?\n ?\t))) eol))
	        (list ";"  (rx bol (1+ (not (or ?\n ?\;))) eol))
	        (list ":"  (rx bol (1+ (not (or ?\n ?:))) eol))
	        (list " "  (rx bol (1+ (not (or ?' ?\" ))
                                       (not (or ?\s ?\;))
                                       (not (or ?' ?\")))
                               eol))))

so you don't need eval below, and rx forms become constants when
compiled.

> +         sep)

This `sep' binding can be removed.

> +    (unless (= beg end)
> +      (save-excursion
> +        (goto-char beg)
> +        (catch :found
> +          (pcase-dolist (`(,sep ,regexp) sep-regexp)
> +            (save-excursion
> +              (unless (re-search-forward (eval regexp) end t)

You can drop the `eval'.

>    (when (and (called-interactively-p 'any)
> -	     (not (string-match-p (rx "." (or "txt" "tsv" "csv") eos) file)))
> +	     (not (string-match-p (rx "." (or "txt" "tsv" "csv") eos) file))
> +             (not (yes-or-no-p "File does not havs .txt .txt .csv as extension.  Do you still want to continue? ")))

"does not have" and ".txt" -> ".tsv" I guess.

Also please provide a patch with a commit message, possibly using `git
format-patch'.

Thanks!

Regards,
-- 
Nicolas Goaziou




Information forwarded to emacs-orgmode <at> gnu.org:
bug#47885; Package org-mode. (Wed, 28 Apr 2021 08:38:01 GMT) Full text and rfc822 format available.

Message #26 received at 47885 <at> debbugs.gnu.org (full text, mbox):

From: Utkarsh Singh <utkarsh190601 <at> gmail.com>
To: Nicolas Goaziou <mail <at> nicolasgoaziou.fr>
Cc: 47885 <at> debbugs.gnu.org
Subject: Re: bug#47885: [PATCH] org-table-import: Make it more smarter for
 interactive use
Date: Wed, 28 Apr 2021 14:07:37 +0530
[Message part 1 (text/plain, inline)]
Hi,

On 2021-04-27, 22:21 +0200, Nicolas Goaziou <mail <at> nicolasgoaziou.fr> wrote:

>> + When using org-table-import interactively if we failed to guess
>> separator then we will be left with a user-error message and an
>> 'unconverted table'.  We can make use of 'temp-buffer' to import our
>> file after successfully conversion.
>
> I'm not sure to understand what you mean.

Note: I will advice you to apply patch no. 2 before trying out the
following example.

1. Download the attached CSV file.  We can call this example.csv

2. Go to *scratch* buffer.

3. Use 'M-x org-table-import' to import example.csv as org-table.

You will see even thought org-table-guess-separator failed in guessing
separator we are still left with unconverted region added to our buffer.

>> + Conversion part of org-table-convert-region make a distinction between
>> '(4) (comma separator) and rest of the separator we should either string
>> version of comma as AND condition or rewrite to simplify it.
>
> Ditto. But it can be the object of another patch. Let's concentrate on
> `org-table-guess-separator' first.
>
>> I am willing to do these possible changes but currently waiting for your
>> review for org-table-guess-separator as there can be more serious bugs
>> lurking around on my code which I am considering base for these
>> changes.
>
> You should definitely write tests for this function. Here's a start:
>
>     (ert-deftest test-org-table/guess-separator ()
>       "Test `test-org-table/guess-separator'."
>       ;; Test space separator.
>       (should
>        (equal " "
>               (org-test-with-temp-text "a b\nc d"
>                 (org-table-guess-separator (point-min) (point-max)))))
>       (should
>        (equal " "
>               (org-test-with-temp-text "a b\nc d"
>                 (org-table-guess-separator (point-min) (point-max)))))
>       ;; Test "inverted" region.
>       (should
>        (equal " "
>               (org-test-with-temp-text "a b\nc d"
>                 (org-table-guess-separator (point-max) (point-min)))))
>       ;; Do not error on empty region.
>       (should-not
>        (org-test-with-temp-text ""
>          (org-table-guess-separator (point-max) (point-min))))
>       (should-not
>        (org-test-with-temp-text "   \n"
>          (org-table-guess-separator (point-max) (point-min)))))
>

I will surely do more testing.

I would also like to simplify the condition for guessing SPACE as
separator due to following cases:

+ field1 'this is field2' 'this is field3' :: In this case we still have
SPACE inside quote (' in this case).

+ Since SPACE is our last valid separator I think searching for a line
which doesn't contains space is more than enough.

Required patch:

[0001-org-table.el-org-table-import-add-yes-and-no-prompt.patch (text/x-patch, inline)]
From 6b112927de73c43edfd08254217808ebff42772a Mon Sep 17 00:00:00 2001
From: Utkarsh Singh <utkarsh190601 <at> gmail.com>
Date: Wed, 28 Apr 2021 10:26:46 +0530
Subject: [PATCH 1/3] org-table.el (org-table-import): add yes-and-no prompt

Add a yes and no prompt for files which don't have .txt, .tsv OR .csv
as file extensions.
---
 lisp/org/org-table.el | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/lisp/org/org-table.el b/lisp/org/org-table.el
index 0e93fb271f..e0b2be6892 100644
--- a/lisp/org/org-table.el
+++ b/lisp/org/org-table.el
@@ -938,7 +938,8 @@ org-table-import
 - regexp  When a regular expression, use it to match the separator."
   (interactive "f\nP")
   (when (and (called-interactively-p 'any)
-	     (not (string-match-p (rx "." (or "txt" "tsv" "csv") eos) file)))
+	     (not (string-match-p (rx "." (or "txt" "tsv" "csv") eos) file))
+             (not (yes-or-no-p "File does not have .txt, .tsv or .csv as extension.  Do you still want to continue? ")))
     (user-error "Cannot import such file"))
   (unless (bolp) (insert "\n"))
   (let ((beg (point))
-- 
2.31.1

[0002-org-table.el-org-table-convert-region-move-out-separ.patch (text/x-patch, inline)]
From 9bb017cfc8284075e04faf5496ed560ba48d5bbc Mon Sep 17 00:00:00 2001
From: Utkarsh Singh <utkarsh190601 <at> gmail.com>
Date: Wed, 28 Apr 2021 10:42:32 +0530
Subject: [PATCH 2/3] org-table.el (org-table-convert-region): move out
 separator-guessing

1. Move separator guessing code to org-table-guess-separator (new
function).
2. Add semicolon, colon and SPACE to the list of know separator
(separator which we can guess).
---
 lisp/org/org-table.el | 49 +++++++++++++++++++++++++++++++++----------
 1 file changed, 38 insertions(+), 11 deletions(-)

diff --git a/lisp/org/org-table.el b/lisp/org/org-table.el
index e0b2be6892..295f7a9b90 100644
--- a/lisp/org/org-table.el
+++ b/lisp/org/org-table.el
@@ -846,6 +846,39 @@ org-table-create
       (goto-char pos))
     (org-table-align)))
 
+(defun org-table-guess-separator (beg0 end0)
+  "Guess separator for region BEG0 to END0.
+
+List of preferred separator (in order of preference):
+comma, TAB, semicolon, colon or SPACE.
+
+Search for a line which doesn't contain a separator if found
+search again using next preferred separator or else return
+separator as string."
+  (let* ((beg (save-excursion
+                (goto-char (min beg0 end0))
+                (skip-chars-forward " \t\n")
+                (if (eobp) (point) (line-beginning-position))))
+	 (end (save-excursion
+                (goto-char (max beg0 end0))
+                (skip-chars-backward " \t\n" beg)
+                (if (= beg (point)) (point) (line-end-position))))
+         (sep-regexp
+          (list (list ","  (rx bol (1+ (not (or ?\n ?,))) eol))
+		(list "\t" (rx bol (1+ (not (or ?\n ?\t))) eol))
+		(list ";"  (rx bol (1+ (not (or ?\n ?\;))) eol))
+		(list ":"  (rx bol (1+ (not (or ?\n ?:))) eol))
+		(list " "  (rx bol (1+ (not (or ?\n ?\s))) eol)))))
+    (unless (= beg end)
+      (save-excursion
+        (goto-char beg)
+        (catch :found
+          (pcase-dolist (`(,sep ,regexp) sep-regexp)
+            (save-excursion
+              (unless (re-search-forward regexp end t)
+                (throw :found sep))))
+          nil)))))
+
 ;;;###autoload
 (defun org-table-convert-region (beg0 end0 &optional separator)
   "Convert region to a table.
@@ -862,10 +895,7 @@ org-table-convert-region
 integer  When a number, use that many spaces, or a TAB, as field separator
 regexp   When a regular expression, use it to match the separator
 nil      When nil, the command tries to be smart and figure out the
-         separator in the following way:
-         - when each line contains a TAB, assume TAB-separated material
-         - when each line contains a comma, assume CSV material
-         - else, assume one or more SPACE characters as separator."
+         separator using `org-table-guess-seperator'."
   (interactive "r\nP")
   (let* ((beg (min beg0 end0))
 	 (end (max beg0 end0))
@@ -882,13 +912,10 @@ org-table-convert-region
       (if (bolp) (backward-char 1) (end-of-line 1))
       (setq end (point-marker))
       ;; Get the right field separator
-      (unless separator
-	(goto-char beg)
-	(setq separator
-	      (cond
-	       ((not (re-search-forward "^[^\n\t]+$" end t)) '(16))
-	       ((not (re-search-forward "^[^\n,]+$" end t)) '(4))
-	       (t 1))))
+      (when (and (not separator)
+                 (not (setq separator
+                            (org-table-guess-separator beg end))))
+        (user-error "Failed to guess separator"))
       (goto-char beg)
       (if (equal separator '(4))
 	  (while (< (point) end)
-- 
2.31.1

[0003-org-table.el-org-table-import-add-file-prompt.patch (text/x-patch, inline)]
From fef97ffe27ff908647c45f1b066a845e71a0926f Mon Sep 17 00:00:00 2001
From: Utkarsh Singh <utkarsh190601 <at> gmail.com>
Date: Wed, 28 Apr 2021 14:01:31 +0530
Subject: [PATCH 3/3] org-table.el (org-table-import): add file prompt

---
 lisp/org/org-table.el | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/lisp/org/org-table.el b/lisp/org/org-table.el
index 295f7a9b90..e904903576 100644
--- a/lisp/org/org-table.el
+++ b/lisp/org/org-table.el
@@ -963,7 +963,8 @@ org-table-import
 - (64)    Prompt for a regular expression as field separator.
 - integer When a number, use that many spaces, or a TAB, as field separator.
 - regexp  When a regular expression, use it to match the separator."
-  (interactive "f\nP")
+  (interactive (list (read-file-name "Import file: ")
+                     (prefix-numeric-value current-prefix-arg)))
   (when (and (called-interactively-p 'any)
 	     (not (string-match-p (rx "." (or "txt" "tsv" "csv") eos) file))
              (not (yes-or-no-p "File does not have .txt, .tsv or .csv as extension.  Do you still want to continue? ")))
-- 
2.31.1

[example.csv (application/octet-stream, attachment)]
[Message part 6 (text/plain, inline)]
-- 
Utkarsh Singh
http://utkarshsingh.xyz

This bug report was last modified 2 years and 335 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.