Skip to content

New CSV implementation #903

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Nov 25, 2024
Merged

New CSV implementation #903

merged 17 commits into from
Nov 25, 2024

Conversation

Jolanrensen
Copy link
Collaborator

@Jolanrensen Jolanrensen commented Oct 2, 2024

See the issue for the progress: #827

TODO:

  • Split double parser into separate PR when FastDoubleParser publishes ConfigurableDoubleParser: Fast double parser #935
  • Clean git history after merging fast double parser

Some high level overview of the changes:

  • introduces new :dataframe-csv module
  • new functions, requiring @OptIn(ExperimentalCsv::class):
    • read(Csv|Tsv|Delim)(Str) based on Deephaven
    • write(Csv|Tsv|Delim), and to(Csv|Tsv|Delim)Str based on Apache commons csv
  • Added ParserOptions.skipTypes
  • parse can now recognize Char
  • Extended supported ColTypes to all the types we can parse
  • Added OOM warning to old readCSV function informing the user about our new experimental functions :)
  • See all checked off boxes in ☂ CSV rework #827 for all the bugs this fixes or issues it takes in mind.

Fixes #787
Fixes #508
Fixes #589
Fixes #469
Fixes #921

Example of the new KDocs and all supported parameters:
image
image
image

@Jolanrensen Jolanrensen force-pushed the new-csv-implementation branch 4 times, most recently from 2f51ebf to ac8a5e5 Compare October 4, 2024 19:25
@Jolanrensen Jolanrensen force-pushed the new-csv-implementation branch 6 times, most recently from bdb5bc2 to cdf47a0 Compare October 16, 2024 16:57
@Jolanrensen Jolanrensen force-pushed the new-csv-implementation branch 10 times, most recently from 56934c2 to 86dc97f Compare October 22, 2024 16:29
@Jolanrensen Jolanrensen linked an issue Oct 23, 2024 that may be closed by this pull request
28 tasks
@Kotlin Kotlin deleted a comment from github-actions bot Oct 23, 2024
@Jolanrensen Jolanrensen mentioned this pull request Oct 28, 2024
@Jolanrensen
Copy link
Collaborator Author

Jolanrensen commented Oct 30, 2024

NOTE: maybe we can try {} catch(OutOfMemoryError) {} around the apache commons read csv and recommend the new experimental csv reader.

@Jolanrensen Jolanrensen force-pushed the new-csv-implementation branch from d837c58 to 258f5b7 Compare November 1, 2024 11:15
@@ -64,8 +67,17 @@ public data class ParserOptions(
val dateTimeFormatter: DateTimeFormatter? = null,
val dateTimePattern: String? = null,
val nullStrings: Set<String>? = null,
val skipTypes: Set<KType> = emptySet(),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe skipTypes be better be inverted to checkTypes. But that would mean exposing all Parsers so users/csv readers can filter which types to consider when parsing and which to skip. WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really weird feature, let's live a little bit with that solution to reflect later in code

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's mostly a performance thing, but the problem is that we don't have a user-facing way to refer to parsers, so we cannot say: "please parse this string column and try this type but not this type". I'm not sure how the API should look here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I managed to leverage convert to at least avoid skipTypes = allTypesExcept(...). Let's see if I can remove it altogether

@Jolanrensen Jolanrensen marked this pull request as ready for review November 1, 2024 11:42
@Jolanrensen
Copy link
Collaborator Author

Jolanrensen commented Nov 1, 2024

Released dev version to test: 0.15.0-dev-4759 0.15.0-dev-4803 0.15.0-dev-4939

@Jolanrensen Jolanrensen force-pushed the new-csv-implementation branch 2 times, most recently from 0b0776c to 92c0515 Compare November 1, 2024 12:43
@Jolanrensen Jolanrensen force-pushed the new-csv-implementation branch from 1114cb0 to 5bd7567 Compare November 15, 2024 14:39
@Jolanrensen Jolanrensen force-pushed the new-csv-implementation branch 2 times, most recently from 2f5c4ec to 703338a Compare November 19, 2024 11:13
@Jolanrensen Jolanrensen force-pushed the new-csv-implementation branch from 703338a to 6bb3502 Compare November 19, 2024 13:43
@zaleslaw zaleslaw self-requested a review November 19, 2024 15:43
…pted both csv implementations to use convertTo. Addition of DataColumn<String>.convertTo overloads to allow for ParserOptions (for nullStrings etc.) Moved DataColumn<String>.convertToDouble to impl. Fixed nullstrings support for it, cleaned the parsers. Added tests for Issue #921
# Conflicts:
#	core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/util/deprecationMessages.kt
@Jolanrensen
Copy link
Collaborator Author

@zaleslaw @koperagen My two latest commits (https://github.com/Kotlin/dataframe/compare/703338a7042c30f8629bfc9e5979c0a6560a067f..53e2f64daa1b8167c56d5c4a4fd33a66cf78bfe2) contain a lot of new changes because I figured out how ParserOptions actually work in the library:

  • There's a global setting DataFrame.parser which should be respected if either: ParserOptions == null or any of its parameters are null. I made sure the csv readers respect this too.
  • I added my new ParserOptions arguments to the global one as well (and I made it not a data class anymore for better future binary compatibility).
  • "skip types except" makes no sense, we have convertTo for that. But there was no overload for stringColumn.convertTo that took any ParserOptions, so I added that.
  • Moved convertToDouble to convertToDoubleImpl and added explanation in KDocs for why (I think) it exists (and it now accepts nullStrings as well).
  • Added more relevant tests

Could you rereview if you like?

@Jolanrensen Jolanrensen requested a review from zaleslaw November 22, 2024 15:13
@zaleslaw
Copy link
Collaborator

zaleslaw commented Nov 25, 2024

Is it a first global setting in KDF, or we had something earlier? Should user set up it manually and if it will be replaced is it global for the whole project, using our library? In that way it's closer to the CONFIG file which is visible or Spring Bean.

I see that it's inherited from external component, isn't it?

I'm ok, but looks like a temporary solution, could you create a ticket with research tag for that

@Jolanrensen
Copy link
Collaborator Author

Is it a first global setting in KDF, or we had something earlier? Should user set up it manually and if it will be replaced is it global for the whole project, using our library? In that way it's closer to the CONFIG file which is visible or Spring Bean.

I see that it's inherited from external component, isn't it?

I'm ok, but looks like a temporary solution, could you create a ticket with research tag for that

DataFrame.parser has been part of the library for years, I just didn't know about it yet (it's also part of the documentation https://kotlin.github.io/dataframe/parse.html, at the bottom). It's similar to Locale.setDefault() on the JVM, so yes, it's global, like a config file, but it can be changed programmatically because it's stored in a singleton object.

In all cases, the global default can also be overridden by providing a ParserOptions instance to the function, so we embrace both a functional and imperative style which I think is cool.

@Jolanrensen Jolanrensen self-assigned this Nov 25, 2024
@Jolanrensen Jolanrensen merged commit 1d2eb2e into master Nov 25, 2024
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants