-
Notifications
You must be signed in to change notification settings - Fork 4
Parsing occurrence text files in DwC archive #25
Comments
Indeed! Basically, can |
Yes, you can pass occurrence <- finch::dwca_read("http://ipt.vliz.be/eurobis/archive.do?r=manuela_uy&v=1.0", read = TRUE)$data$occurrence.txt
class(occurrence$footprintWKT)
occurrence <- finch::dwca_read("http://ipt.vliz.be/eurobis/archive.do?r=manuela_uy&v=1.0", read = TRUE, colClasses = c(footprintWKT = "character"))$data$occurrence.txt
class(occurrence$footprintWKT) |
Thanks for the nice example. My question is actually going a step further: what about saving the right col types in finch repo and using them as default in |
@peterdesmet : we currently don't manage types in python-dwca-reader, everything is just assumed to be a string and the conversions are left to the data user. That seemed the simplest sensible approach at the time. I do see the added value for users however, so I just opened a new issue (BelgianBiodiversityPlatform/python-dwca-reader#76) giving considerations about it, so I won't forgot to think about :) |
thanks for this @damianooldoni definitely seems reasonable to include the column types within this package and use them - do you want to make a PR? |
Having a default option would indeed be a good idea when using the package for data analysis. Still, as I mentioned as well in the python-dwca-reader repo, I would keep interpretation of the dtypes an option and not the default. Having all columns as strings/characters on input can be beneficial when for example you want to do validation of input and have full control about the dtype properties (e.g. our work with whip). |
i can see the advantage of that @stijnvanhoey to have all strings - we could have a parameter to toggle this, where you get all strings or apply the types above |
@damianooldoni curious about my above question ^^ |
yes, @sckott . I find it a good idea. I was still trying to find time within my free time to end up my other PR. But, yes, this should be done as well as I find it very important to avoid frustration and errors while importing occ files. I ping you very soon. |
okay, thanks |
This package is going to be archived. |
After a year working with GBIF data in R and getting always problems importing correctly occurrence text files in R, I ended up writing a gist where I collected most of the col types I got problems with: type_GBIF_occurrence_fields.R. I discussed with colleagues about the utility of putting it in our project package. But, as suggested here trias-project/trias#25 (comment) why not pitch the authors of finch about? 👍
The typical issue while opening such files is that some DwC fields (columns) are NAs for thousands of rows before getting a real value. This creates parsing failures as R assigned type logical to these fields (columns). My first solution was to increase the value of
guess_max
parameter but for big files is this unfeasible, plus this is just a work-around.The text was updated successfully, but these errors were encountered: