Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datasets Data URLs and API generally #6

Open
rufuspollock opened this issue Apr 29, 2016 · 7 comments
Open

Datasets Data URLs and API generally #6

rufuspollock opened this issue Apr 29, 2016 · 7 comments

Comments

@rufuspollock
Copy link
Collaborator

rufuspollock commented Apr 29, 2016

From @rgrp on February 24, 2013 18:26

This issue is about the URL / API structure for accessing data (and metadata) from the data packages.

Current Situation

  • For stuff under /data/: /data/{dataset}/datapackage.json and /data/{dataset}.csv
  • For other stuff either at /tools/view/ or /community/ via: http://data.okfn.org/tools/dataproxy/?url={path-to-csv} (though this is not much different from datapipes.okfnlabs.org/csv/raw/?url=.... and leaves much to be desired)

Proposal

/data/ + /community/ data packages

For /data/ and /community/ data packages:

/.../{dataset}/datapackage.json     # the datapackage.json file

## data urls
/.../{dataset}/r/{resource-name-or-order}.{format}  

so e.g.

/.../gdp/r/annual.csv   # resource name
/.../gdp/r/0.csv           # resource by index

Formats that we should support would be:

  • {format} = csv | json | html | raw (by default)
  • {resource-name} = name as in resources entry. (Also allow order e.g. 1 for first resource, 2 for second resource etc).

Addressing individual elements

Longer-term we could support addressing individual elements e.g. addressing into rows in a dataset or :

.../gdp/r/annual/5/        # row 5 of this dataset, rendered as HTML by default
.../gdp/r/annual/5.csv  # in CSV format
.../gdp/r/annual/5/year/  # cell in row 5, field year (in HTML form by default)

.../{dataset}/r/{resource-name-or-index}/{row-index-or-primary-key}[.html | .csv | .json]
.../{dataset}/r/{resource-name-or-index}/{row-index}/{field-name-or-index}[.html | .csv | .json]

Questions:

  • How do distinguish row index from primary key when both numerical (which takes precedence?) - i'd argue PK should take precedence and we have e.g. i:{number}
    • That said index is always possible whereas primary key may be absent ...
  • Support for ranges - see approach to this in datapipes

Data packages somewhere online

We follow something similar to the other case but instead of data package name in the url we move the data package url to the query string:

/api/datapackage.json?url={datapackage-url}
/api/data/{resource-name-or-index}.{format}?{datapackage-url}

# e.g. this returns first resource as CSV
/api/data/0.csv?url=https://raw.github.com/datasets/browser-stats/master/datapackage.json

Discussion

  • data.json is the serialization in the most obvious way - i.e. convert to a hash
    • alternative provide this in a results style format (and include the schema)
  • Should we use download attribute to set filename ...?
    • Not needed in above
  • (Now supported) How do we handle multiple data resources / files?
    • worry about that in the future - so only support first resource for the moment (this is good as it privileges single resource data packages ...)

Appendix

Alternatives

Alternatively could be:

{dataset}/{filename}.csv
{dataset}/{filename}.json (CORS enabled ...)

Or

{dataset}/data.csv

Think the former is better ...

Copied from original issue: frictionlessdata/frictionlessdata.io#19

@rufuspollock
Copy link
Collaborator Author

@mihi-tr wrote in #83:

I do think we'll need to think along the lines of having CORS enabled access for the datasets. Based on the dataset.json format (which allows relative urls) the api should look like

.../dataset/datapackage.json 

and then have

.../dataset/path-to-data/filename

for the data files - this way it doesn't matter which package url I got pointed at

Alternatively: modify datapackage.json - this is very ugly IMO

@rufuspollock
Copy link
Collaborator Author

@mihi-tr I don't know if you saw the extensive refactor of this proposal about a month ago. Please look at proposal above. As part of #73 I actually implemented most of the proposal at least for "core" datasets.

Please let me know if addresses your proxy need - and if there can be an even better API (I guess your biggest concern is which are not in core but note i propose a way to handle these - though not yet implemented).

@rufuspollock
Copy link
Collaborator Author

From @mihi-tr on November 30, 2013 19:22

It would (I think). Would need to test this in a practical environment.

@rufuspollock
Copy link
Collaborator Author

@mihi-tr here's an example of current api style: http://data.okfn.org/data/s-and-p-500-companies/r/constituents.csv

@rufuspollock
Copy link
Collaborator Author

Based on convo with @mihi-tr today downgrading priority to one star:

  • Unix philosophy - one tool for one job - if we need a data API tool for data packages why not make separate and small
    • only counter: convenience here of standard structure (but that's minor)
  • The "data package" data api tool coming soon

@mihi-tr still be nice to know what exactly should be in that tool ...

@rufuspollock
Copy link
Collaborator Author

Have updated proposal to flesh out the case for general online data packages - which are now think is the priority (given that we plan to not to much cataloging in this site and in this app).

@rufuspollock
Copy link
Collaborator Author

@mihi-tr could you look at the proposal in main part of issue about data packages online and let me know if this solves your requirements

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant