Skip to content

Commit c3dac4f

Browse files
authored
Merge pull request #674 from martindurant/docs
Revamp docs
2 parents aaded42 + b19e38e commit c3dac4f

File tree

6 files changed

+184
-140
lines changed

6 files changed

+184
-140
lines changed

docs/source/developer.rst

+24-3
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,28 @@ Implementing async
9494
~~~~~~~~~~~~~~~~~~
9595

9696
Starting in version 0.7.5, we provide async operations for some methods
97-
of some implementations.
97+
of some implementations. Async support in storage implementations is
98+
optional. Special considerations are required for async
99+
development, see :doc:`async`.
98100

99-
This section will contain details on how to implement backends offering
100-
async, once the details are ironed out on our end.
101+
Developing the library
102+
~~~~~~~~~~~~~~~~~~~~~~
103+
104+
The following can be used to install ``fsspec`` in development mode
105+
106+
.. code-block::
107+
108+
git clone https://github.com/intake/filesystem_spec
109+
cd filesystem_spec
110+
pip install -e .
111+
112+
A number of additional dependencies are required to run tests, see "ci/environment*.yml", as
113+
well as Docker. Most implementation-specific tests should skip if their requirements are
114+
not met.
115+
116+
Development happens by submitting pull requests (PRs) on github.
117+
This repo adheres for flake8 and black coding conventions. You may wish to install
118+
commit hooks if you intend to make PRs, as linting is done as part of the CI.
119+
120+
Docs use sphinx and the numpy docstring style. Please add an entry to the changelog
121+
along with any PR.

docs/source/features.rst

+32-52
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,6 @@
11
Features of fsspec
22
==================
33

4-
Consistent API to many different storage backends. The general API and functionality were
5-
proven with the projects `s3fs`_ and `gcsfs`_ (along with `hdfs3`_ and `adlfs`_), within the
6-
context of Dask and independently. These have been tried and tested by many users and shown their
7-
usefulness over some years. ``fsspec`` aims to build on these and unify their models, as well
8-
as extract out file-system handling code from Dask which does not so comfortably fit within a
9-
library designed for task-graph creation and their scheduling.
10-
11-
.. _s3fs: https://s3fs.readthedocs.io/en/latest/
12-
.. _gcsfs: https://gcsfs.readthedocs.io/en/latest/
13-
.. _hdfs3: https://hdfs3.readthedocs.io/en/latest/
14-
.. _adlfs: https://docs.microsoft.com/en-us/azure/data-lake-store/
15-
164
Here follows a brief description of some features of note of ``fsspec`` that provides to make
175
it an interesting project beyond some other file-system abstractions.
186

@@ -50,20 +38,31 @@ the initiation of the context which actually does the work of creating file-like
5038
# f is now a real file-like object holding resources
5139
f.read(...)
5240
53-
Random Access and Buffering
54-
---------------------------
55-
56-
The :func:`fsspec.spec.AbstractBufferedFile` class is provided as an easy way to build file-like
57-
interfaces to some service which is capable of providing blocks of bytes. This class is derived
58-
from in a number of the existing implementations. A subclass of ``AbstractBufferedFile`` provides
59-
random access for the underlying file-like data (without downloading the whole thing) and
60-
configurable read-ahead buffers to minimise the number of the read operations that need to be
61-
performed on the back-end storage.
41+
File Buffering and random access
42+
--------------------------------
6243

63-
This is also a critical feature in the big-data access model, where each sub-task of an operation
44+
Most implementations create file objects which derive from ``fsspec.spec.AbstractBufferedFile``, and
45+
have many behaviours in common. A subclass of ``AbstractBufferedFile`` provides
46+
random access for the underlying file-like data (without downloading the whole thing).
47+
This is a critical feature in the big-data access model, where each sub-task of an operation
6448
may need on a small part of a file, and does not, therefore want to be forced into downloading the
6549
whole thing.
6650

51+
These files offer buffering of both read and write operations, so that
52+
communication with the remote resource is limited. The size of the buffer is generally configured
53+
with the ``blocksize=`` kwarg at open time, although the implementation may have some minimum or
54+
maximum sizes that need to be respected.
55+
56+
For reading, a number of buffering schemes are available, listed in ``fsspec.caching.caches``
57+
(see :ref:`readbuffering`), or "none" for no buffering at all, e.g., for a simple read-ahead
58+
buffer, you can do
59+
60+
.. code-block:: python
61+
62+
fs = fsspec.filesystem(...)
63+
with fs.open(path, mode='rb', cache_type='readahead') as f:
64+
use_for_something(f)
65+
6766
Transparent text-mode and compression
6867
-------------------------------------
6968

@@ -195,25 +194,6 @@ is called, so that subsequent listing of the given paths will force a refresh. I
195194
addition, some methods like ``ls`` have a ``refresh`` parameter to force fetching
196195
the listing again.
197196

198-
File Buffering
199-
--------------
200-
201-
Most implementations create file objects which derive from ``fsspec.spec.AbstractBufferedFile``, and
202-
have many behaviours in common. These files offer buffering of both read and write operations, so that
203-
communication with the remote resource is limited. The size of the buffer is generally configured
204-
with the ``blocksize=`` kwargs at open time, although the implementation may have some minimum or
205-
maximum sizes that need to be respected.
206-
207-
For reading, a number of buffering schemes are available, listed in ``fsspec.caching.caches``
208-
(see :ref:`readbuffering`), or "none" for no buffering at all, e.g., for a simple read-ahead
209-
buffer, you can do
210-
211-
.. code-block:: python
212-
213-
fs = fsspec.filesystem(...)
214-
with fs.open(path, mode='rb', cache_type='readahead') as f:
215-
use_for_something(f)
216-
217197
URL chaining
218198
------------
219199

@@ -344,10 +324,10 @@ shown (or if none are selected, all files are shown).
344324

345325
The interface provides the following outputs:
346326

347-
- ``.urlpath``: the currently selected item (if any)
348-
- ``.storage_options``: the value of the kwargs box
349-
- ``.fs``: the current filesystem instance
350-
- ``.open_file()``: produces an ``OpenFile`` instance for the current selection
327+
#. ``.urlpath``: the currently selected item (if any)
328+
#. ``.storage_options``: the value of the kwargs box
329+
#. ``.fs``: the current filesystem instance
330+
#. ``.open_file()``: produces an ``OpenFile`` instance for the current selection
351331

352332
Configuration
353333
-------------
@@ -388,16 +368,16 @@ the style ``FSSPEC_{protocol}_{kwargname}=value``.
388368

389369
Configuration is determined in the following order, with later items winning:
390370

391-
- the contents of ini files, and json files in the config directory, sorted
392-
alphabetically
393-
- environment variables
394-
- the contents of ``fsspec.config.conf``, which can be edited at runtime
395-
- kwargs explicitly passed, whether with ``fsspec.open``, ``fsspec.filesystem``
396-
or directly instantiating the implementation class.
371+
#. the contents of ini files, and json files in the config directory, sorted
372+
alphabetically
373+
#. environment variables
374+
#. the contents of ``fsspec.config.conf``, which can be edited at runtime
375+
#. kwargs explicitly passed, whether with ``fsspec.open``, ``fsspec.filesystem``
376+
or directly instantiating the implementation class.
397377

398378

399379
Asynchronous
400-
============
380+
------------
401381

402382
Some implementations, those deriving from ``fsspec.asyn.AsyncFileSystem``, have
403383
async/coroutine implementations of some file operations. The async methods have

docs/source/index.rst

+68-45
Original file line numberDiff line numberDiff line change
@@ -1,58 +1,91 @@
1-
FSSPEC: Filesystem interfaces for Python
1+
``fsspec``: Filesystem interfaces for Python
22
======================================
33

4-
Filesystem Spec (FSSPEC) is a project to unify various projects and classes to work with remote filesystems and
5-
file-system-like abstractions using a standard pythonic interface.
4+
Filesystem Spec (``fsspec``) is a project to provide a unified pythonic interface to
5+
local, remote and embedded file systems and bytes storage.
66

7+
Brief Overview
8+
--------------
79

8-
.. _highlight:
10+
There are many places to store bytes, from in memory, to the local disk, cluster
11+
distributed storage, to the cloud. Many files also contain internal mappings of names to bytes,
12+
maybe in a hierarchical directory-oriented tree. Working with all these different
13+
storage media, and their associated libraries, is a pain. ``fsspec`` exists to
14+
provide a familiar API that will work the same whatever the storage backend.
15+
As much as possible, we iron out the quirks specific to each implementation,
16+
so you need do no more than provide credentials for each service you access
17+
(if needed) and thereafter not have to worry about the implementation again.
918

10-
Highlights
11-
----------
19+
Why
20+
---
1221

13-
- based on s3fs and gcsfs
14-
- ``fsspec`` instances are serializable and can be passed between processes/machines
15-
- the ``OpenFiles`` file-like instances are also serializable
16-
- implementations provide random access, to enable only the part of a file required to be read; plus a template
17-
to base other file-like classes on
18-
- file access can use transparent compression and text-mode
19-
- any file-system directory can be viewed as a key-value/mapping store
20-
- if installed, all file-system classes also subclass from ``pyarrow.filesystem.FileSystem``, so
21-
can work with any arrow function expecting such an instance
22-
- writes can be transactional: stored in a temporary location and only moved to the final
23-
destination when the transaction is committed
24-
- FUSE: mount any path from any backend to a point on your file-system
25-
- cached instances tokenised on the instance parameters
22+
``fsspec`` provides two main concepts: a set of filesystem classes with uniform APIs
23+
(i.e., functions such as ``cp``, ``rm``, ``cat``, ``mkdir``, ...) supplying operations on a range of
24+
storage systems; and top-level convenience functions like :func:`fsspec.open`, to allow
25+
you to quickly get from a URL to a file-like object that you can use with a third-party
26+
library or your own code.
2627

27-
These are described further in the :doc:`features` section.
28+
The section :doc:`background` gives motivation and history of this project, but
29+
most users will want to skip straight to :doc:`usage` to find out how to use
30+
the package and :doc:`features` to see the long list of added functionality
31+
included along with the basic file-system interface.
2832

29-
Installation
30-
------------
3133

32-
.. code-block:: sh
34+
Who uses ``fsspec``?
35+
--------------------
3336

34-
pip install fsspec
37+
You can use ``fsspec``'s file objects with any python function that accepts
38+
file objects, because of *duck typing*.
3539

36-
Not all included filesystems are usable by default without installing extra
37-
dependencies. For example to be able to access data in S3:
40+
You may well be using ``fsspec`` already without knowing it.
41+
The following libraries use ``fsspec`` internally for path and file handling:
3842

39-
.. code-block:: sh
43+
#. `Dask`_, the parallel, out-of-core and distributed
44+
programming platform
45+
#. `Intake`_, the data source cataloguing and loading
46+
library and its plugins
47+
#. `pandas`_, the tabular data analysis package
48+
#. `xarray`_ and `zarr`_, multidimensional array
49+
storage and labelled operations
50+
#. `DVC`_, version control system
51+
for machine learning projects
52+
53+
``fsspec`` filesystems are also supported by:
54+
55+
#. `pyarrow`_, the in-memory data layout engine
56+
57+
... plus many more that we don't know about.
58+
59+
.. _Dask: https://dask.org/
60+
.. _Intake: https://intake.readthedocs.io/
61+
.. _pandas: https://pandas.pydata.org/
62+
.. _xarray: http://xarray.pydata.org/
63+
.. _zarr: https://zarr.readthedocs.io/
64+
.. _DVC: https://dvc.org/
65+
.. _pyarrow: https://arrow.apache.org/docs/python/
4066

41-
pip install fsspec[s3]
4267

43-
or
68+
Installation
69+
------------
70+
71+
`fsspec` can be installed from PyPI or conda and has no dependencies of its own
4472

4573
.. code-block:: sh
4674
75+
pip install fsspec
4776
conda install -c conda-forge fsspec
4877
49-
Implementations
50-
---------------
78+
Not all filesystem implementations are available without installing extra
79+
dependencies. For example to be able to access data in S3, you can use the optional
80+
pip install syntax below, or install the specific package required
5181

52-
This repo contains several file-system implementations, see :ref:`implementations`. However,
53-
the external projects ``s3fs`` and ``gcsfs`` depend on ``fsspec`` and share the same behaviours.
54-
``Dask`` and ``Intake`` use ``fsspec`` internally for their IO needs.
82+
.. code-block:: sh
5583
84+
pip install fsspec[gcs]
85+
conda install -c conda-forge gcsfs
86+
87+
`fsspec` attempts to provide the right message when you attempt to use a filesystem
88+
for which you need additional dependencies.
5689
The current list of known implementations can be found as follows
5790

5891
.. code-block:: python
@@ -61,12 +94,10 @@ The current list of known implementations can be found as follows
6194
6295
known_implementations
6396
64-
These are only imported on request, which may fail if a required dependency is missing. The dictionary
65-
``fsspec.registry`` contains all imported implementations, and can be mutated by user code, if necessary.
6697
6798
6899
.. toctree::
69-
:maxdepth: 2
100+
:maxdepth: 1
70101
:caption: Contents:
71102

72103
intro.rst
@@ -76,11 +107,3 @@ These are only imported on request, which may fail if a required dependency is m
76107
async.rst
77108
api.rst
78109
changelog.rst
79-
80-
81-
Indices and tables
82-
==================
83-
84-
* :ref:`genindex`
85-
* :ref:`modindex`
86-
* :ref:`search`

docs/source/intro.rst

+17-18
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,5 @@
1-
Introduction
2-
============
3-
4-
To get stuck into using the package, rather than reading about its philosophy and history, you can
5-
skip to :doc:`usage`.
6-
71
Background
8-
----------
2+
==========
93

104
Python provides a standard interface for open files, so that alternate implementations of file-like object can
115
work seamlessly with many function which rely only on the methods of that standard interface. A number of libraries
@@ -21,7 +15,7 @@ other file-system implementations simpler.
2115
History
2216
-------
2317

24-
I (Martin Durant) have been involved in building a number of remote-data file-system implementations, principally
18+
We have been involved in building a number of remote-data file-system implementations, principally
2519
in the context of the `Dask`_ project. In particular, several are listed
2620
in `docs`_ with links to the specific repositories.
2721
With common authorship, there is much that is similar between the implementations, for example posix-like naming
@@ -57,21 +51,21 @@ Influences
5751
The following places to consider, when choosing the definitions of how we would like the file-system specification
5852
to look:
5953

60-
- python's `os`_ module and its `path` namespace; also other file-connected
61-
functionality in the standard library
62-
- posix/bash method naming conventions that linux/unix/osx users are familiar with; or perhaps their Windows variants
63-
- the existing implementations for the various backends (e.g.,
64-
`gcsfs`_ or Arrow's
65-
`hdfs`_)
66-
- `pyfilesystems`_, an attempt to do something similar, with a
67-
plugin architecture. This conception has several types of local file-system, and a lot of well-thought-out
68-
validation code.
54+
#. python's `os`_ module and its `path` namespace; also other file-connected
55+
functionality in the standard library
56+
#. posix/bash method naming conventions that linux/unix/osx users are familiar with; or perhaps their Windows variants
57+
#. the existing implementations for the various backends (e.g.,
58+
`gcsfs`_ or Arrow's
59+
`hdfs`_)
60+
#. `pyfilesystems`_, an attempt to do something similar, with a
61+
plugin architecture. This conception has several types of local file-system, and a lot of well-thought-out
62+
validation code.
6963

7064
.. _os: https://docs.python.org/3/library/os.html
7165
.. _gcsfs: http://gcsfs.readthedocs.io/en/latest/api.html#gcsfs.core.GCSFileSystem
7266
.. _pyfilesystems: https://docs.pyfilesystem.org/en/latest/index.html
7367

74-
Not pyfilesystems?
68+
Other similar work
7569
------------------
7670

7771
It might have been conceivable to reuse code in ``pyfilesystems``, which has an established interface and several
@@ -83,6 +77,11 @@ have an interface as close to those as possible. See a
8377

8478
.. _discussion: https://github.com/intake/filesystem_spec/issues/5
8579

80+
Other newer technologies such as `smart_open`_ and ``pyarrow``'s newer file-system rewrite also have some
81+
parts of the functionality presented here, that might suit some use cases better.
82+
83+
.. _smart_open: https://github.com/RaRe-Technologies/smart_open
84+
8685
Structure of the package
8786
------------------------
8887

0 commit comments

Comments
 (0)