-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SQLite to Parquet Conversion Functionality #213
Changes from all commits
ce9782d
855a026
8adec01
74583b0
5b1046f
b4b195c
747bae8
a54c04b
d8b656d
2df6f83
5e4f4e8
7627507
df40cf5
337dbb2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
@@ -0,0 +1,125 @@ | ||||
Data architecture | ||||
================= | ||||
|
||||
Pycytominer data architecture documentation. | ||||
|
||||
Distinct upstream data sources | ||||
------------------------------ | ||||
|
||||
Pycytominer has distinct data flow contingent on upstream data source. | ||||
Various projects are used to generate different kinds of data which are handled differently within Pycytominer. | ||||
|
||||
* `CellProfiler <https://github.com/CellProfiler/CellProfiler>`_ Generates `CSV <https://en.wikipedia.org/wiki/Comma-separated_values>`_ data used by Pycytominer. | ||||
* `Cytominer-database <https://github.com/cytomining/Cytominer-database>`_ Generates `SQLite <https://www.sqlite.org/>`_ databases (which includes table data based on CellProfiler CSV's mentioned above) used by Pycytominer. | ||||
* `DeepProfiler <https://github.com/cytomining/DeepProfiler>`_ Generates `NPZ <https://numpy.org/doc/stable/reference/routines.io.html?highlight=npz%20format#numpy-binary-files-npy-npz>`_ data used by Pycytominer. | ||||
|
||||
SQLite data | ||||
----------- | ||||
|
||||
Pycytominer in some areas consumes SQLite data sources. | ||||
This data source is currently considered somewhat deprecated for Pycytominer work. | ||||
|
||||
**SQLite data structure** | ||||
|
||||
.. mermaid:: | ||||
|
||||
erDiagram | ||||
Image ||--o{ Cytoplasm : contains | ||||
Image ||--o{ Cells : contains | ||||
Image ||--o{ Nuclei : contains | ||||
|
||||
Related SQLite databases have a structure loosely based around the above diagram. | ||||
There are generally four tables: Image, Cytoplasm, Cells, and Nuclei. | ||||
Each Image may contain zero to many Cells, Nuclei, or Cytoplasm data rows. | ||||
|
||||
**SQLite compartments** | ||||
|
||||
The tables Cytoplasm, Cells, and Nuclei are generally referenced as "compartments". | ||||
While these are often included within related SQLite datasets, other compartments may be involved as well. | ||||
|
||||
**SQLite common fields** | ||||
|
||||
Each of the above tables include ``TableNumber`` and ``ImageNumber`` fields which are cross-related to data in other tables. | ||||
``ObjectNumber`` is sometimes also but not guaranteed to be related to data across tables. | ||||
|
||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||
**SQLite data production** | ||||
|
||||
.. mermaid:: | ||||
|
||||
flowchart LR | ||||
subgraph Data | ||||
direction LR | ||||
cellprofiler_data[(CSV Files)] -.-> cytominerdatabase_data[(SQLite File)] | ||||
cytominerdatabase_data[(SQLite File)] | ||||
end | ||||
subgraph Projects | ||||
direction LR | ||||
CellProfiler | ||||
Cytominer-database | ||||
Pycytominer | ||||
end | ||||
CellProfiler --> cellprofiler_data | ||||
cellprofiler_data --> Cytominer-database | ||||
Cytominer-database --> cytominerdatabase_data | ||||
cytominerdatabase_data --> Pycytominer | ||||
|
||||
Related SQLite data is originally created from `CellProfiler <https://github.com/CellProfiler/CellProfiler>`_ CSV data exports. | ||||
This CSV data is then converted to SQLite by `Cytominer-database <https://github.com/cytomining/Cytominer-database>`_. | ||||
|
||||
**Cytominer-database data transformations** | ||||
|
||||
* Cytominer-database adds a field to all CSV tables from CellProfiler labeled ``TableNumber``. | ||||
This field is added to address dataset uniqueness as CellProfiler sometimes resets ``ImageNumber``. | ||||
|
||||
Parquet data | ||||
------------ | ||||
|
||||
Pycytominer currently provides capabilities to convert into `Apache Parquet <https://parquet.apache.org/>`_ data. | ||||
|
||||
**Parquet from Cytominer-database SQLite data sources** | ||||
|
||||
.. mermaid:: | ||||
|
||||
flowchart LR | ||||
subgraph Data | ||||
direction LR | ||||
cellprofiler_data[(CSV Files)] -.-> cytominerdatabase_data[(SQLite File)] | ||||
cytominerdatabase_data[(SQLite File)] -.-> pycytominer_data[(Parquet File)] | ||||
end | ||||
subgraph Projects | ||||
direction LR | ||||
CellProfiler | ||||
Cytominer-database | ||||
subgraph Pycytominer | ||||
direction LR | ||||
Pycytominer_conversion[Parquet Conversion] | ||||
Pycytominer_work[Parquet-based Work] | ||||
end | ||||
end | ||||
CellProfiler --> cellprofiler_data | ||||
cellprofiler_data --> Cytominer-database | ||||
Cytominer-database --> cytominerdatabase_data | ||||
cytominerdatabase_data --> Pycytominer_conversion | ||||
Pycytominer_conversion --> pycytominer_data | ||||
pycytominer_data --> Pycytominer_work | ||||
|
||||
Pycytominer includes the capability to convert related `Cytominer-database <https://github.com/cytomining/Cytominer-database>`_ SQLite-based data into parquet. | ||||
The resulting format includes SQLite table data in a single file, using joinable keys ``TableNumber`` and ``ImageNumber`` and none-type values to demonstrate data relationships (or lack thereof). | ||||
|
||||
Conversion work may be performed using the following module: :ref:`sqliteconvert` | ||||
|
||||
An Example of the resulting parquet data format for Pycytominer may be found below: | ||||
|
||||
|
||||
+--------------+--------------+-------------------------+---------------------+-----------------------+------------------------+----------------------------+------------------------+--------------------------+ | ||||
| TableNumber | ImageNumber | Cytoplasm_ObjectNumber | Cells_ObjectNumber | Nuclei_ObjectNumber | Image_Fields...(many) | Cytoplasm_Fields...(many) | Cells_Fields...(many) | Nuclei_Fields...(many) | | ||||
+--------------+--------------+-------------------------+---------------------+-----------------------+------------------------+----------------------------+------------------------+--------------------------+ | ||||
| 123abc | 1 | Null | Null | Null | Image Data... | Null | Null | Null | | ||||
+--------------+--------------+-------------------------+---------------------+-----------------------+------------------------+----------------------------+------------------------+--------------------------+ | ||||
| 123abc | 1 | 1 | Null | Null | Null | Cytoplasm Data... | Null | Null | | ||||
+--------------+--------------+-------------------------+---------------------+-----------------------+------------------------+----------------------------+------------------------+--------------------------+ | ||||
| 123abc | 1 | Null | 1 | Null | Null | Null | Cells Data... | Null | | ||||
+--------------+--------------+-------------------------+---------------------+-----------------------+------------------------+----------------------------+------------------------+--------------------------+ | ||||
| 123abc | 1 | Null | Null | 1 | Null | Null | Null | Nuclei Data... | | ||||
+--------------+--------------+-------------------------+---------------------+-----------------------+------------------------+----------------------------+------------------------+--------------------------+ | ||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
Architecture | ||
============ | ||
|
||
The following pages cover pycytominer architecture. | ||
|
||
.. toctree:: | ||
:maxdepth: 2 | ||
|
||
architecture.data |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,3 +5,4 @@ groundwork-sphinx-theme | |
|
||
mock | ||
autodoc | ||
sphinxcontrib-mermaid |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -34,7 +34,8 @@ | |
aggregate_fields_count, | ||
aggregate_image_features, | ||
) | ||
from .sqlite import ( | ||
from .sqlite.meta import engine_from_str, collect_columns, LIKE_NULLS, SQLITE_AFF_REF | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. just checking, did you run Come to think of it, we should have an auto-black feature 🤔 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I did run There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just a follow up on this: when running |
||
from .sqlite.clean import ( | ||
clean_like_nulls, | ||
collect_columns, | ||
contains_conflicting_aff_storage_class, | ||
|
@@ -43,3 +44,12 @@ | |
update_columns_to_nullable, | ||
update_values_like_null_to_null, | ||
) | ||
from .sqlite.convert import ( | ||
flow_convert_sqlite_to_parquet, | ||
multi_to_single_parquet, | ||
nan_data_fill, | ||
sql_select_distinct_join_chunks, | ||
sql_table_to_pd_dataframe, | ||
table_concat_to_parquet, | ||
to_unique_parquet, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yesss! love it