Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SQLite to Parquet Conversion Functionality #213

Closed
wants to merge 14 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
125 changes: 125 additions & 0 deletions docs/architecture.data.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
Data architecture
=================

Pycytominer data architecture documentation.

Distinct upstream data sources
------------------------------

Pycytominer has distinct data flow contingent on upstream data source.
Various projects are used to generate different kinds of data which are handled differently within Pycytominer.

* `CellProfiler <https://github.com/CellProfiler/CellProfiler>`_ Generates `CSV <https://en.wikipedia.org/wiki/Comma-separated_values>`_ data used by Pycytominer.
* `Cytominer-database <https://github.com/cytomining/Cytominer-database>`_ Generates `SQLite <https://www.sqlite.org/>`_ databases (which includes table data based on CellProfiler CSV's mentioned above) used by Pycytominer.
* `DeepProfiler <https://github.com/cytomining/DeepProfiler>`_ Generates `NPZ <https://numpy.org/doc/stable/reference/routines.io.html?highlight=npz%20format#numpy-binary-files-npy-npz>`_ data used by Pycytominer.

SQLite data
-----------

Pycytominer in some areas consumes SQLite data sources.
This data source is currently considered somewhat deprecated for Pycytominer work.

**SQLite data structure**

.. mermaid::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yesss! love it


erDiagram
Image ||--o{ Cytoplasm : contains
Image ||--o{ Cells : contains
Image ||--o{ Nuclei : contains

Related SQLite databases have a structure loosely based around the above diagram.
There are generally four tables: Image, Cytoplasm, Cells, and Nuclei.
Each Image may contain zero to many Cells, Nuclei, or Cytoplasm data rows.

**SQLite compartments**

The tables Cytoplasm, Cells, and Nuclei are generally referenced as "compartments".
While these are often included within related SQLite datasets, other compartments may be involved as well.

**SQLite common fields**

Each of the above tables include ``TableNumber`` and ``ImageNumber`` fields which are cross-related to data in other tables.
``ObjectNumber`` is sometimes also but not guaranteed to be related to data across tables.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

**SQLite data production**

.. mermaid::

flowchart LR
subgraph Data
direction LR
cellprofiler_data[(CSV Files)] -.-> cytominerdatabase_data[(SQLite File)]
cytominerdatabase_data[(SQLite File)]
end
subgraph Projects
direction LR
CellProfiler
Cytominer-database
Pycytominer
end
CellProfiler --> cellprofiler_data
cellprofiler_data --> Cytominer-database
Cytominer-database --> cytominerdatabase_data
cytominerdatabase_data --> Pycytominer

Related SQLite data is originally created from `CellProfiler <https://github.com/CellProfiler/CellProfiler>`_ CSV data exports.
This CSV data is then converted to SQLite by `Cytominer-database <https://github.com/cytomining/Cytominer-database>`_.

**Cytominer-database data transformations**

* Cytominer-database adds a field to all CSV tables from CellProfiler labeled ``TableNumber``.
This field is added to address dataset uniqueness as CellProfiler sometimes resets ``ImageNumber``.

Parquet data
------------

Pycytominer currently provides capabilities to convert into `Apache Parquet <https://parquet.apache.org/>`_ data.

**Parquet from Cytominer-database SQLite data sources**

.. mermaid::

flowchart LR
subgraph Data
direction LR
cellprofiler_data[(CSV Files)] -.-> cytominerdatabase_data[(SQLite File)]
cytominerdatabase_data[(SQLite File)] -.-> pycytominer_data[(Parquet File)]
end
subgraph Projects
direction LR
CellProfiler
Cytominer-database
subgraph Pycytominer
direction LR
Pycytominer_conversion[Parquet Conversion]
Pycytominer_work[Parquet-based Work]
end
end
CellProfiler --> cellprofiler_data
cellprofiler_data --> Cytominer-database
Cytominer-database --> cytominerdatabase_data
cytominerdatabase_data --> Pycytominer_conversion
Pycytominer_conversion --> pycytominer_data
pycytominer_data --> Pycytominer_work

Pycytominer includes the capability to convert related `Cytominer-database <https://github.com/cytomining/Cytominer-database>`_ SQLite-based data into parquet.
The resulting format includes SQLite table data in a single file, using joinable keys ``TableNumber`` and ``ImageNumber`` and none-type values to demonstrate data relationships (or lack thereof).

Conversion work may be performed using the following module: :ref:`sqliteconvert`

An Example of the resulting parquet data format for Pycytominer may be found below:


+--------------+--------------+-------------------------+---------------------+-----------------------+------------------------+----------------------------+------------------------+--------------------------+
| TableNumber | ImageNumber | Cytoplasm_ObjectNumber | Cells_ObjectNumber | Nuclei_ObjectNumber | Image_Fields...(many) | Cytoplasm_Fields...(many) | Cells_Fields...(many) | Nuclei_Fields...(many) |
+--------------+--------------+-------------------------+---------------------+-----------------------+------------------------+----------------------------+------------------------+--------------------------+
| 123abc | 1 | Null | Null | Null | Image Data... | Null | Null | Null |
+--------------+--------------+-------------------------+---------------------+-----------------------+------------------------+----------------------------+------------------------+--------------------------+
| 123abc | 1 | 1 | Null | Null | Null | Cytoplasm Data... | Null | Null |
+--------------+--------------+-------------------------+---------------------+-----------------------+------------------------+----------------------------+------------------------+--------------------------+
| 123abc | 1 | Null | 1 | Null | Null | Null | Cells Data... | Null |
+--------------+--------------+-------------------------+---------------------+-----------------------+------------------------+----------------------------+------------------------+--------------------------+
| 123abc | 1 | Null | Null | 1 | Null | Null | Null | Nuclei Data... |
+--------------+--------------+-------------------------+---------------------+-----------------------+------------------------+----------------------------+------------------------+--------------------------+

9 changes: 9 additions & 0 deletions docs/architecture.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
Architecture
============

The following pages cover pycytominer architecture.

.. toctree::
:maxdepth: 2

architecture.data
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = ["sphinx.ext.autodoc", "sphinx.ext.napoleon"]
extensions = ["sphinx.ext.autodoc", "sphinx.ext.napoleon", "sphinxcontrib.mermaid"]

# Add any paths that contain templates here, relative to this directory.
templates_path = ["_templates"]
Expand Down
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ Software for processing image-based profiling readouts.
install
tutorial
modules
architecture

Indices and tables
==================
Expand Down
26 changes: 26 additions & 0 deletions docs/pycytominer.cyto_utils.rst
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,32 @@ pycytominer.cyto\_utils.util module
:undoc-members:
:show-inheritance:

pycytominer.cyto\_utils.sqlite.clean module
---------------------------------------------

.. automodule:: pycytominer.cyto_utils.sqlite.clean
:members:
:undoc-members:
:show-inheritance:

.. _sqliteconvert:

pycytominer.cyto\_utils.sqlite.convert module
---------------------------------------------

.. automodule:: pycytominer.cyto_utils.sqlite.convert
:members:
:undoc-members:
:show-inheritance:

pycytominer.cyto\_utils.sqlite.meta module
---------------------------------------------

.. automodule:: pycytominer.cyto_utils.sqlite.meta
:members:
:undoc-members:
:show-inheritance:

pycytominer.cyto\_utils.write\_gct module
-----------------------------------------

Expand Down
1 change: 1 addition & 0 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@ groundwork-sphinx-theme

mock
autodoc
sphinxcontrib-mermaid
12 changes: 11 additions & 1 deletion pycytominer/cyto_utils/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,8 @@
aggregate_fields_count,
aggregate_image_features,
)
from .sqlite import (
from .sqlite.meta import engine_from_str, collect_columns, LIKE_NULLS, SQLITE_AFF_REF
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just checking, did you run black on all .py files? We've been using black for formatting.

Come to think of it, we should have an auto-black feature 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did run black on any .py files. I'll improve the look here to your comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a follow up on this: when running black against this file, it retains the existing formatting for that line.

from .sqlite.clean import (
clean_like_nulls,
collect_columns,
contains_conflicting_aff_storage_class,
Expand All @@ -43,3 +44,12 @@
update_columns_to_nullable,
update_values_like_null_to_null,
)
from .sqlite.convert import (
flow_convert_sqlite_to_parquet,
multi_to_single_parquet,
nan_data_fill,
sql_select_distinct_join_chunks,
sql_table_to_pd_dataframe,
table_concat_to_parquet,
to_unique_parquet,
)
Empty file.
Original file line number Diff line number Diff line change
@@ -1,155 +1,17 @@
"""
Pycytominer SQLite utilities
Pycytominer SQLite utilities - cleaning functions
"""

import logging
import os
import sqlite3
from typing import Optional, Tuple, Union

from sqlalchemy import create_engine
from sqlalchemy.engine.base import Engine

logger = logging.getLogger(__name__)

# A reference dictionary for SQLite affinity and storage class types
# See more here: https://www.sqlite.org/datatype3.html#affinity_name_examples
SQLITE_AFF_REF = {
"INTEGER": [
"INT",
"INTEGER",
"TINYINT",
"SMALLINT",
"MEDIUMINT",
"BIGINT",
"UNSIGNED BIG INT",
"INT2",
"INT8",
],
"TEXT": [
"CHARACTER",
"VARCHAR",
"VARYING CHARACTER",
"NCHAR",
"NATIVE CHARACTER",
"NVARCHAR",
"TEXT",
"CLOB",
],
"BLOB": ["BLOB"],
"REAL": [
"REAL",
"DOUBLE",
"DOUBLE PRECISION",
"FLOAT",
],
"NUMERIC": [
"NUMERIC",
"DECIMAL",
"BOOLEAN",
"DATE",
"DATETIME",
],
}

# strings which may represent null values
LIKE_NULLS = ("null", "none", "nan")


def engine_from_str(sql_engine: Union[str, Engine]) -> Engine:
"""
Helper function to create engine from a string or return the engine
if it's already been created.

Parameters
----------
sql_engine: str | sqlalchemy.engine.base.Engine
filename of the SQLite database or existing sqlalchemy engine

Returns
-------
sqlalchemy.engine.base.Engine
A SQLAlchemy engine
"""

# check the type of sql_engine passed and create engine if we have a str
if isinstance(sql_engine, str):
# if we don't already have the sqlite filestring, add it
if "sqlite:///" not in sql_engine:
sql_engine = f"sqlite:///{sql_engine}"
engine = create_engine(sql_engine)
else:
engine = sql_engine

return engine

from .meta import LIKE_NULLS, SQLITE_AFF_REF, collect_columns, engine_from_str

def collect_columns(
sql_engine: Union[str, Engine],
table_name: Optional[str] = None,
column_name: Optional[str] = None,
) -> list:
"""
Collect a list of columns from the given engine's
database using optional table or column level
specification.

Parameters
----------
sql_engine: str | sqlalchemy.engine.base.Engine
filename of the SQLite database or existing sqlalchemy engine
table_name: str
optional specific table name to check within database, by default None
column_name: str
optional specific column name to check within database, by default None

Returns
-------
list
Returns list, and if populated, contains tuples with values
similar to the following. These may also be accessed by name
similar to dictionaries, as they are SQLAlchemy Row objects.
[('table_name', 'column_name', 'column_type', 'notnull'),...]
"""

# create column list for return result
column_list = []

# create an engine
engine = engine_from_str(sql_engine)

with engine.connect() as connection:
if table_name is None:
# if no table name is provided, we assume all tables must be scanned
tables = connection.execute(
"SELECT name as table_name FROM sqlite_master WHERE type = 'table';"
).fetchall()
else:
# otherwise we will focus on just the table name provided
tables = [{"table_name": table_name}]

for table in tables:

# if no column name is specified we will focus on all columns within the table
sql_stmt = """
SELECT :table_name as table_name,
name as column_name,
type as column_type,
[notnull]
FROM pragma_table_info(:table_name)
"""

if column_name is not None:
# otherwise we will focus on only the column name provided
sql_stmt = f"{sql_stmt} WHERE name = :col_name;"

# append to column list the results
column_list += connection.execute(
sql_stmt,
{"table_name": str(table["table_name"]), "col_name": str(column_name)},
).fetchall()

return column_list
logger = logging.getLogger(__name__)


def contains_conflicting_aff_storage_class(
Expand Down
Loading