cytomining · d33bs · Jul 12, 2022 · Jul 14, 2022 · Jul 14, 2022 · Jul 14, 2022
diff --git a/docs/architecture.data.rst b/docs/architecture.data.rst
@@ -0,0 +1,125 @@
+Data architecture
+=================
+
+Pycytominer data architecture documentation.
+
+Distinct upstream data sources
+------------------------------
+
+Pycytominer has distinct data flow contingent on upstream data source.
+Various projects are used to generate different kinds of data which are handled differently within Pycytominer.
+
+* `CellProfiler <https://github.com/CellProfiler/CellProfiler>`_ Generates `CSV <https://en.wikipedia.org/wiki/Comma-separated_values>`_ data used by Pycytominer.
+* `Cytominer-database <https://github.com/cytomining/Cytominer-database>`_ Generates `SQLite <https://www.sqlite.org/>`_ databases (which includes table data based on CellProfiler CSV's mentioned above) used by Pycytominer.
+* `DeepProfiler <https://github.com/cytomining/DeepProfiler>`_ Generates `NPZ <https://numpy.org/doc/stable/reference/routines.io.html?highlight=npz%20format#numpy-binary-files-npy-npz>`_ data used by Pycytominer.
+
+SQLite data
+-----------
+
+Pycytominer in some areas consumes SQLite data sources.
+This data source is currently considered somewhat deprecated for Pycytominer work.
+
+**SQLite data structure**
+
+.. mermaid::
+
+   erDiagram
+      Image ||--o{ Cytoplasm : contains
+      Image ||--o{ Cells : contains
+      Image ||--o{ Nuclei : contains
+
+Related SQLite databases have a structure loosely based around the above diagram.
+There are generally four tables: Image, Cytoplasm, Cells, and Nuclei.
+Each Image may contain zero to many Cells, Nuclei, or Cytoplasm data rows.
+
+**SQLite compartments**
+
+The tables Cytoplasm, Cells, and Nuclei are generally referenced as "compartments".
+While these are often included within related SQLite datasets, other compartments may be involved as well.
+
+**SQLite common fields**
+
+Each of the above tables include ``TableNumber`` and ``ImageNumber`` fields which are cross-related to data in other tables.
+``ObjectNumber`` is sometimes also but not guaranteed to be related to data across tables.
+
-
-
+**SQLite data production**
+
+.. mermaid::
+
+   flowchart LR
+      subgraph Data
+         direction LR
+         cellprofiler_data[(CSV Files)] -.-> cytominerdatabase_data[(SQLite File)]
+         cytominerdatabase_data[(SQLite File)]
+      end
+      subgraph Projects
+         direction LR
+         CellProfiler
+         Cytominer-database 
+         Pycytominer
+      end
+      CellProfiler --> cellprofiler_data
+      cellprofiler_data --> Cytominer-database
+      Cytominer-database --> cytominerdatabase_data
+      cytominerdatabase_data --> Pycytominer
+
+Related SQLite data is originally created from `CellProfiler <https://github.com/CellProfiler/CellProfiler>`_ CSV data exports.
+This CSV data is then converted to SQLite by `Cytominer-database <https://github.com/cytomining/Cytominer-database>`_.
+
+**Cytominer-database data transformations**
+
+* Cytominer-database adds a field to all CSV tables from CellProfiler labeled ``TableNumber``. 
+  This field is added to address dataset uniqueness as CellProfiler sometimes resets ``ImageNumber``.
+
+Parquet data
+------------
+
+Pycytominer currently provides capabilities to convert into `Apache Parquet <https://parquet.apache.org/>`_ data.
+
+**Parquet from Cytominer-database SQLite data sources**
+
+.. mermaid::
+
+   flowchart LR
+      subgraph Data
+         direction LR
+         cellprofiler_data[(CSV Files)] -.-> cytominerdatabase_data[(SQLite File)]
+         cytominerdatabase_data[(SQLite File)] -.-> pycytominer_data[(Parquet File)]
+      end
+      subgraph Projects
+         direction LR
+         CellProfiler
+         Cytominer-database 
+         subgraph Pycytominer
+               direction LR
+               Pycytominer_conversion[Parquet Conversion]
+               Pycytominer_work[Parquet-based Work]
+         end
+      end
+      CellProfiler --> cellprofiler_data
+      cellprofiler_data --> Cytominer-database
+      Cytominer-database --> cytominerdatabase_data
+      cytominerdatabase_data --> Pycytominer_conversion
+      Pycytominer_conversion --> pycytominer_data
+      pycytominer_data --> Pycytominer_work
+
+Pycytominer includes the capability to convert related `Cytominer-database <https://github.com/cytomining/Cytominer-database>`_ SQLite-based data into parquet.
+The resulting format includes SQLite table data in a single file, using joinable keys ``TableNumber`` and ``ImageNumber`` and none-type values to demonstrate data relationships (or lack thereof). 
+
+Conversion work may be performed using the following module: :ref:`sqliteconvert`
+
+An Example of the resulting parquet data format for Pycytominer may be found below:
+
+
++--------------+--------------+-------------------------+---------------------+-----------------------+------------------------+----------------------------+------------------------+--------------------------+
+| TableNumber  | ImageNumber  | Cytoplasm_ObjectNumber  | Cells_ObjectNumber  | Nuclei_ObjectNumber  | Image_Fields...(many)  | Cytoplasm_Fields...(many)  | Cells_Fields...(many)  | Nuclei_Fields...(many)    |
++--------------+--------------+-------------------------+---------------------+-----------------------+------------------------+----------------------------+------------------------+--------------------------+
+| 123abc       | 1            | Null                    | Null                | Null                  | Image Data...          | Null                       | Null                   | Null                     |
++--------------+--------------+-------------------------+---------------------+-----------------------+------------------------+----------------------------+------------------------+--------------------------+
+| 123abc       | 1            | 1                       | Null                | Null                  | Null                   | Cytoplasm Data...          | Null                   | Null                     |
++--------------+--------------+-------------------------+---------------------+-----------------------+------------------------+----------------------------+------------------------+--------------------------+
+| 123abc       | 1            | Null                    | 1                   | Null                  | Null                   | Null                       | Cells Data...          | Null                     |
++--------------+--------------+-------------------------+---------------------+-----------------------+------------------------+----------------------------+------------------------+--------------------------+
+| 123abc       | 1            | Null                    | Null                | 1                     | Null                   | Null                       | Null                   | Nuclei Data...           |
++--------------+--------------+-------------------------+---------------------+-----------------------+------------------------+----------------------------+------------------------+--------------------------+
+
diff --git a/docs/architecture.rst b/docs/architecture.rst
@@ -0,0 +1,9 @@
+Architecture
+============
+
+The following pages cover pycytominer architecture.
+
+.. toctree::
+   :maxdepth: 2
+
+   architecture.data
diff --git a/docs/conf.py b/docs/conf.py
@@ -35,7 +35,7 @@
 # Add any Sphinx extension module names here, as strings. They can be
 # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 # ones.
-extensions = ["sphinx.ext.autodoc", "sphinx.ext.napoleon"]
+extensions = ["sphinx.ext.autodoc", "sphinx.ext.napoleon", "sphinxcontrib.mermaid"]
 
 # Add any paths that contain templates here, relative to this directory.
 templates_path = ["_templates"]

diff --git a/docs/index.rst b/docs/index.rst
@@ -15,6 +15,7 @@ Software for processing image-based profiling readouts.
    install
    tutorial
    modules
+   architecture
 
 Indices and tables
 ==================

diff --git a/docs/pycytominer.cyto_utils.rst b/docs/pycytominer.cyto_utils.rst
@@ -52,6 +52,32 @@ pycytominer.cyto\_utils.util module
    :undoc-members:
    :show-inheritance:
 
+pycytominer.cyto\_utils.sqlite.clean module
+---------------------------------------------
+
+.. automodule:: pycytominer.cyto_utils.sqlite.clean
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+.. _sqliteconvert:
+
+pycytominer.cyto\_utils.sqlite.convert module
+---------------------------------------------
+
+.. automodule:: pycytominer.cyto_utils.sqlite.convert
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+pycytominer.cyto\_utils.sqlite.meta module
+---------------------------------------------
+
+.. automodule:: pycytominer.cyto_utils.sqlite.meta
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
 pycytominer.cyto\_utils.write\_gct module
 -----------------------------------------
 

diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -5,3 +5,4 @@ groundwork-sphinx-theme
 
 mock
 autodoc
+sphinxcontrib-mermaid
diff --git a/pycytominer/cyto_utils/__init__.py b/pycytominer/cyto_utils/__init__.py
@@ -34,7 +34,8 @@
     aggregate_fields_count,
     aggregate_image_features,
 )
-from .sqlite import (
+from .sqlite.meta import engine_from_str, collect_columns, LIKE_NULLS, SQLITE_AFF_REF
+from .sqlite.clean import (
     clean_like_nulls,
     collect_columns,
     contains_conflicting_aff_storage_class,
@@ -43,3 +44,12 @@
     update_columns_to_nullable,
     update_values_like_null_to_null,
 )
+from .sqlite.convert import (
+    flow_convert_sqlite_to_parquet,
+    multi_to_single_parquet,
+    nan_data_fill,
+    sql_select_distinct_join_chunks,
+    sql_table_to_pd_dataframe,
+    table_concat_to_parquet,
+    to_unique_parquet,
+)
diff --git a/pycytominer/cyto_utils/sqlite/__init__.py b/pycytominer/cyto_utils/sqlite/__init__.py
diff --git a/pycytominer/cyto_utils/sqlite.py → pycytominer/cyto_utils/sqlite/clean.py b/pycytominer/cyto_utils/sqlite.py → pycytominer/cyto_utils/sqlite/clean.py
@@ -1,155 +1,17 @@
 """
-Pycytominer SQLite utilities
+Pycytominer SQLite utilities - cleaning functions
 """
 
 import logging
 import os
 import sqlite3
 from typing import Optional, Tuple, Union
 
-from sqlalchemy import create_engine
 from sqlalchemy.engine.base import Engine
 
-logger = logging.getLogger(__name__)
-
-# A reference dictionary for SQLite affinity and storage class types
-# See more here: https://www.sqlite.org/datatype3.html#affinity_name_examples
-SQLITE_AFF_REF = {
-    "INTEGER": [
-        "INT",
-        "INTEGER",
-        "TINYINT",
-        "SMALLINT",
-        "MEDIUMINT",
-        "BIGINT",
-        "UNSIGNED BIG INT",
-        "INT2",
-        "INT8",
-    ],
-    "TEXT": [
-        "CHARACTER",
-        "VARCHAR",
-        "VARYING CHARACTER",
-        "NCHAR",
-        "NATIVE CHARACTER",
-        "NVARCHAR",
-        "TEXT",
-        "CLOB",
-    ],
-    "BLOB": ["BLOB"],
-    "REAL": [
-        "REAL",
-        "DOUBLE",
-        "DOUBLE PRECISION",
-        "FLOAT",
-    ],
-    "NUMERIC": [
-        "NUMERIC",
-        "DECIMAL",
-        "BOOLEAN",
-        "DATE",
-        "DATETIME",
-    ],
-}
-
-# strings which may represent null values
-LIKE_NULLS = ("null", "none", "nan")
-
-
-def engine_from_str(sql_engine: Union[str, Engine]) -> Engine:
-    """
-    Helper function to create engine from a string or return the engine
-    if it's already been created.
-
-    Parameters
-    ----------
-    sql_engine: str | sqlalchemy.engine.base.Engine
-        filename of the SQLite database or existing sqlalchemy engine
-
-    Returns
-    -------
-    sqlalchemy.engine.base.Engine
-        A SQLAlchemy engine
-    """
-
-    # check the type of sql_engine passed and create engine if we have a str
-    if isinstance(sql_engine, str):
-        # if we don't already have the sqlite filestring, add it
-        if "sqlite:///" not in sql_engine:
-            sql_engine = f"sqlite:///{sql_engine}"
-        engine = create_engine(sql_engine)
-    else:
-        engine = sql_engine
-
-    return engine
-
+from .meta import LIKE_NULLS, SQLITE_AFF_REF, collect_columns, engine_from_str
 
-def collect_columns(
-    sql_engine: Union[str, Engine],
-    table_name: Optional[str] = None,
-    column_name: Optional[str] = None,
-) -> list:
-    """
-    Collect a list of columns from the given engine's
-    database using optional table or column level
-    specification.
-
-    Parameters
-    ----------
-    sql_engine: str | sqlalchemy.engine.base.Engine
-        filename of the SQLite database or existing sqlalchemy engine
-    table_name: str
-        optional specific table name to check within database, by default None
-    column_name: str
-        optional specific column name to check within database, by default None
-
-    Returns
-    -------
-    list
-        Returns list, and if populated, contains tuples with values
-        similar to the following. These may also be accessed by name
-        similar to dictionaries, as they are SQLAlchemy Row objects.
-        [('table_name', 'column_name', 'column_type', 'notnull'),...]
-    """
-
-    # create column list for return result
-    column_list = []
-
-    # create an engine
-    engine = engine_from_str(sql_engine)
-
-    with engine.connect() as connection:
-        if table_name is None:
-            # if no table name is provided, we assume all tables must be scanned
-            tables = connection.execute(
-                "SELECT name as table_name FROM sqlite_master WHERE type = 'table';"
-            ).fetchall()
-        else:
-            # otherwise we will focus on just the table name provided
-            tables = [{"table_name": table_name}]
-
-        for table in tables:
-
-            # if no column name is specified we will focus on all columns within the table
-            sql_stmt = """
-            SELECT :table_name as table_name,
-                    name as column_name,
-                    type as column_type,
-                    [notnull]
-            FROM pragma_table_info(:table_name)
-            """
-
-            if column_name is not None:
-                # otherwise we will focus on only the column name provided
-                sql_stmt = f"{sql_stmt} WHERE name = :col_name;"
-
-            # append to column list the results
-            column_list += connection.execute(
-                sql_stmt,
-                {"table_name": str(table["table_name"]), "col_name": str(column_name)},
-            ).fetchall()
-
-    return column_list
+logger = logging.getLogger(__name__)
 
 
 def contains_conflicting_aff_storage_class(
Original file line number	Diff line number	Diff line change
Expand Up		@@ -5,3 +5,4 @@ groundwork-sphinx-theme

		mock
		autodoc
		sphinxcontrib-mermaid