ENH: ignoring comment lines and empty lines in CSV files#7470
ENH: ignoring comment lines and empty lines in CSV files#7470mdmueller wants to merge 2 commits intopandas-dev:masterfrom
Conversation
|
@AmrAS1 thanks for taking this on! lmk if you need any help we could put this in 0.14.1 if you can do this in next 2 weeks |
|
@jreback - No problem! I should be done with this pretty soon. Right now I'm just getting test failures with the Python parsing engine, so I'll add this functionality to |
thanks! |
|
will this close #4623 ? |
|
This will close #4623 according to current output: I'll check out the "\t\t \n" case -- I think this would have to be addressed after tokenization in the Cython code, unless there's a better way. I'm a little confused about what you mean by checking for >1 char of the comment char though, doesn't the C parser disallow regex comments anyway? |
|
what I mean is you should disallow for the |
|
Unless I'm missing something, it looks like that's the current functionality already: The tricky thing with the whitespace is for cases like '\t\t ,\t' or '\t a' where the delimiter or non-whitespace show up eventually. I guess I could backtrack in the tokenization algorithm like how \r line terminators are handled, would that be preferable to post-processing? |
|
@AmrAS1 if their is a test for multi-char comments then good I think handling the back-tracking similar to \r is good (don't want to slow up the general case). |
|
@jreback - I'm almost done, but I found a test failure involving a file which ends with a bunch of tab-filled lines using |
|
I think if u have the delimiter then the line counts |
|
@jreback - I think I'm done here, but let me know if there are any problems. |
There was a problem hiding this comment.
say it more like: read_csv/read_table file parsers now ignore comments provided with the comment=.... (say something about single character), and ignore starting at the comment
|
pls run a perf check ( pls squash this down to a small number of commits otherwise looks good! |
|
There were two with a ratio >1.2 but both are unrelated to parsing. Here is the bottom of the results: I assume the slower performance is due to previous commits. I'll squash the commits on this PR now. |
|
perf looks fine |
|
you have some failing tests |
|
Looks like it was the same error, I fixed it in the last commit so now I'm waiting on Travis to make sure. |
|
Hm, it's fixed now except for an error in |
|
@AmrAS1 hows' this coming along? |
|
@jreback - I tried running Travis on my repo with these changes and it worked except for a timeout. Not sure what's up, is there any way we can restart Travis on this PR? |
|
restarted do the doc-string / docs need updating |
|
Updated -- Travis seems to be stuck on a timeout. |
|
push this up again
it will force a new travis build |
|
those are seg faults |
|
It works locally for me |
|
so a way to debug is I have nosetests output each test (to see where it is failing) add the -v argument in ci/script.py (only temporary of course) |
65e7572 to
99f162f
Compare
|
Ah, must've forgotten to recompile extensions. The problem was that I forgot to change the parameter name in parser.pyx, I'll amend this. |
99f162f to
e2f02d7
Compare
There was a problem hiding this comment.
This is an API change (that blank lines will be skipped). Maybe make a warning here that this is new in 0.15?
|
Ok added API change notice |
There was a problem hiding this comment.
Maybe mention explicitely that it is read_csv (and read_table, I suppose)? That is more clear for users I think.
There was a problem hiding this comment.
Amended to add this in
1d95b5c to
cee026d
Compare
|
pls rebase and squash |
commit 0e9d792fc9d5159179efd810a1092671dbbef3b1
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date: Wed Sep 17 14:49:31 2014 -0400
Added warnings about API changes
commit 06472c21000b489841cc8e486ceddf05fd87a1c5
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date: Fri Sep 12 22:36:06 2014 -0400
Changed parameter name to skip_blank_lines
commit afd3be30b4afcae0d9bc6278237aab6a4c9e7eb8
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date: Fri Sep 12 21:50:08 2014 -0400
Minor doc changes
commit b47876e074f5f683a9a51768e480e24d9d3249ab
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date: Fri Sep 12 19:26:22 2014 -0400
Extended blank line skipping to custom line terminated/whitespace delimited reading
commit 3f4a20a831b1bc0ca29779b315dc72d78ad2301e
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date: Fri Sep 12 11:35:17 2014 -0400
Changed around io docs section
commit 223e17ecdcbe377cc69fd962221e03412f5e54d3
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date: Tue Sep 9 23:13:37 2014 -0400
Turned empty line skipping into a keyword parameter feature
commit dcd31ca6bd0849eab87ea1c3c5441c8630ca3a35
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date: Wed Sep 3 21:35:09 2014 -0400
Squashed commit of the following:
commit 9aea77954681c2f7d1336d94366221222d186c2b
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date: Tue Aug 26 22:43:21 2014 -0400
Fixed header/skiprows combination issue
commit 1975affea3bf0bd6f1769a79e4b0c7fde17962df
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date: Wed Jun 25 19:35:24 2014 -0400
Added warning/notes about functionality change in docs, removed HTML changes
commit 693c820092d9f17f9101074d29c2d7d53fa5a8ae
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date: Wed Jun 25 15:38:41 2014 -0400
Fixed problem with HTML reading and infinite loop in PythonParser __init__
commit 2a0a4babac7a5e53279eaa8281d0a51406caeb27
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date: Mon Jun 23 08:37:33 2014 -0400
Updated docs with new read_csv functionality, removed unreachable code
commit 19b5811e8d78c4e618e19ff5768aa2cfff041620
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date: Wed Jun 18 21:43:47 2014 -0400
Fixed error in empty/whitespace removal function
commit 3fd11a822cc0bee123d68240c62627da11ee88c2
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date: Wed Jun 18 18:48:08 2014 -0400
Squashed commit of the following:
commit 60a1cd1bc1042a9959ae75ff006052c433d98825
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date: Wed Jun 18 18:40:17 2014 -0400
Fixed error with string/numerical types
commit 7fe1bcf75466ea2b19d947aff0769c9f03bc23f5
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date: Wed Jun 18 17:47:56 2014 -0400
release notes
commit 835e490c8d3a3a96aeb6a6c3846217d36469656b
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date: Wed Jun 18 17:15:17 2014 -0400
Release note
commit 25cee3167b81b9c81e969629cd83968c6736a94f
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date: Wed Jun 18 16:56:44 2014 -0400
Fixed whitespace issue, made C parser check for delimiters in whitespace lines
commit 593495eb15162833de78d2da65f377fa977ad225
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date: Wed Jun 18 15:41:52 2014 -0400
Added new functionality to Python reader
commit 8a8325ed883034f176c929b41fe6fad16420e9b5
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date: Tue Jun 17 19:52:41 2014 -0400
Adjusted tokenizer to ignore whitespace-only lines, fixed tests
commit 3ea2eed22884a63a6e8dec1b795acdf29b030949
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date: Mon Jun 16 12:36:14 2014 -0400
Moved tests to C parsing suite, corrected multi-index test
commit d5540311ca44992148932ae27e16fc4d02a2a018
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date: Mon Jun 16 12:35:46 2014 -0400
Changed empty file handling so that a ValueError is raised as expected
commit 03a4c3d27c18052f04bd7cb862d289eabbc773ba
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date: Sun Jun 15 23:07:17 2014 -0400
Wrote tests for empty lines and comment lines
commit 01db817e97fc8ee0da85cc17603578b56d294b1b
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date: Sun Jun 15 23:02:04 2014 -0400
Modified C tokenizer so that comments and empty lines are ignored
cee026d to
e4bcb5c
Compare
|
@jreback - rebased and squashed |
There was a problem hiding this comment.
put the default here (True)
|
@AmrAS1 couple of minor documenting changes. ping when when green. |
|
@jreback - Green |
|
side issue I happen to have an old version of cython installed on 1 of my machines (0.17.1). didn't compile the parser.pyx correctly, but the newer ones do. odd - so going to change the min version (to 0.19.1) which is still pretty old |
|
Interesting, do you know what the compile error was? |
|
merged via 31c2558 |
|
I broke on the float_precision pointer-to-func (was a formatting error). 0.19.1 works fine though |
|
@amanshei Did you got it sorted out? |
|
I did, thank you! |
closes #4466