Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] IndexError: index out of bounds #3665

Open
nick-youngblut opened this issue Feb 4, 2025 · 0 comments
Open

[Bug] IndexError: index out of bounds #3665

nick-youngblut opened this issue Feb 4, 2025 · 0 comments
Assignees

Comments

@nick-youngblut
Copy link

Describe the bug

I'm loading in large h5ad files via scanpy.read_h5ad and then creating/appending via tiledbsoma.io.from_anndata. The relevant code:

def append_to_database(db_uri: str, adata: sc.AnnData) -> None:
    """
    Append an AnnData object to the TileDB database.
    Args:
        db_uri: URI of the TileDB database
        adata: AnnData object to append
    """
    logging.info("  Appending data to TileDB...")

    # Register AnnData objects
    rd = tiledbsoma.io.register_anndatas(
        db_uri,
        [adata],
        measurement_name="RNA",
        obs_field_name="obs_id",
        var_field_name="var_id",
    )

    # Apply resize
    with tiledbsoma.Experiment.open(db_uri) as exp:
        tiledbsoma.io.resize_experiment(
            exp.uri,
            nobs=rd.get_obs_shape(),
            nvars=rd.get_var_shapes()
        )

    # Ingest new data into the db
    tiledbsoma.io.from_anndata(
        db_uri,
        adata,
        measurement_name="RNA",
        registration_mapping=rd,
    )

def create_tiledb(db_uri: str, adata: sc.AnnData) -> None:
    """
    Create a new tiledb database.
    Args:
        db_uri: URI of the TileDB database
        adata: AnnData object to append
    """
    logging.info(f"  Creating new database...")
    tiledbsoma.io.from_anndata(
        db_uri,
        adata,
        measurement_name="RNA",
    )

def load_tiledb(h5ad_files: List[str], db_uri: str, batch_size: int=8) -> None:
    for infile in h5ad_files:
        logging.info(f"Processing {infile}...")
        # load anndata object
        adata = sc.read_h5ad(infile)
        # add to database
        if not os.path.exists(db_uri):
            create_tiledb(db_uri, adata)
        else:
            append_to_database(db_uri, adata)
        # clear memory
        del adata
        gc.collect()

The error that occurs on the append:

[2025-02-03 19:29:47.413] [tiledbsoma] [Process: 848304] [Thread: 848304] [warning] [TileDB-SOMA::ManagedQuery] [unnamed] Invalid column selected: obs_id
Traceback (most recent call last):
  File "/home/nickyoungblut/dev/nextflow/scRecounter/./scripts/tiledb-loader-tahoe.py", line 190, in <module>
    #load_tiledb(h5ad_files, args.db_uri, batch_size=args.threads)
    ^^^^^^
  File "/home/nickyoungblut/dev/nextflow/scRecounter/./scripts/tiledb-loader-tahoe.py", line 184, in main
    #print(h5ad_files); exit();
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nickyoungblut/dev/nextflow/scRecounter/./scripts/tiledb-loader-tahoe.py", line 160, in load_tiledb
    append_to_database(db_uri, adata)
  File "/home/nickyoungblut/dev/nextflow/scRecounter/./scripts/tiledb-loader-tahoe.py", line 113, in append_to_database
    rd = tiledbsoma.io.register_anndatas(
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nickyoungblut/miniforge3/envs/tiledb/lib/python3.12/site-packages/tiledbsoma/io/ingest.py", line 225, in register_anndatas
    return ExperimentAmbientLabelMapping.from_anndata_appends_on_experiment(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nickyoungblut/miniforge3/envs/tiledb/lib/python3.12/site-packages/tiledbsoma/io/_registration/ambient_label_mappings.py", line 419, in from_anndata_appends_on_experiment
    registration_data = cls._acquire_experiment_mappings(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nickyoungblut/miniforge3/envs/tiledb/lib/python3.12/site-packages/tiledbsoma/io/_registration/ambient_label_mappings.py", line 376, in _acquire_experiment_mappings
    registration_data = cls.from_isolated_soma_experiment(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nickyoungblut/miniforge3/envs/tiledb/lib/python3.12/site-packages/tiledbsoma/io/_registration/ambient_label_mappings.py", line 242, in from_isolated_soma_experiment
    obs_ids = [e.as_py() for e in batch[1]]
                                  ~~~~~^^^
  File "pyarrow/table.pxi", line 1693, in pyarrow.lib._Tabular.__getitem__
  File "pyarrow/table.pxi", line 1779, in pyarrow.lib._Tabular.column
  File "pyarrow/table.pxi", line 5175, in pyarrow.lib.Table._column
  File "pyarrow/array.pxi", line 598, in pyarrow.lib._normalize_index
IndexError: index out of bounds

The error appears to be due to a lack of obs_id and/or var_id columns do not exist in these h5ad files. However, when I added:

if not "obs_id" in adata.obs.columns: 
    adata.obs["obs_id"] = adata.obs.index
if not "var_id" in adata.var.columns:
    adata.var["var_id"] = adata.var.index

... I just get a seg-fault during the first append (2nd h5ad file) after the initial creation of the database from the first h5ad file. I'm using 512 GB of mem, so a lack of memory should not be the issue.

To Reproduce
Provide a code example and any sample input data (e.g. an H5AD) as an attachment to reproduce this behavior.

Versions (please complete the following information):

  • TileDB-SOMA version: 1.15.4
  • Language and language version (e.g. Python 3.9, R 4.3.2): Python 3.12.8
  • OS (e.g. MacOS, Ubuntu Linux): Linux
  • Note: you can use tiledbsoma.show_package_versions() (Python) or tiledbsoma::show_package_versions() (R)
@johnkerl johnkerl self-assigned this Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants