Data Management

Tutorial

Users cannot modify the data in an OmixAtlas directly. Instead, all data management operations (i.e. creation, updation or deletion of datasets) are done through a Catalog. Each OmixAtlas has an internal data Catalog that acts as the source of truth for the data, and any modifications done to it are eventually propagated to the OmixAtlas.

How do Catalogs work?

A Catalog is a collection of versioned datasets.

You can access it through the Catalog class.

from polly.auth import Polly
from polly.data_management import Catalog

Polly.auth("<access_key>")

catalog = Catalog(omixatlas_id="9")

Here's how you list the datasets inside the catalog.

catalog.list_datasets(limit=4)
# [
#   Dataset(dataset_id='PXD056', data_type='Proteomics', ...),
#   Dataset(dataset_id='SCP890', data_type='Single Cell RNASeq', ...),
#   Dataset(dataset_id='GSE583_GPL8887', data_type='Bulk RNAseq', ...),
#   Dataset(dataset_id='GSE8304', data_type='Bulk RNAseq', ...),
# ]

Each dataset is uniquely identified by a dataset_id and has some other attributes associated with it.

dataset = catalog.get_dataset(dataset_id='SCP890')

print(dataset)
# Dataset(dataset_id='SCP890', data_type='Single Cell RNASeq', ...),

print(dataset.data_type)
# 'Single Cell RNASeq'

Datasets consist of a metadata dictionary and a data file.

# load the metadata
dataset.metadata()
# {'description': 'Distinct transcriptional ...', 'abstract': 'PD-1 blockade unleashes CD8 T cells1, including...' }

# download the data into an h5ad file
dataset.download("./mydataset.h5ad")

# alternatively, load the data directly into memory
adata = dataset.load()

The data, metadata or data type of a dataset can be modified.

import scanpy

# load data
adata = catalog.get_dataset('SCP890').load()

# filter cells
scanpy.pp.filter_cells(adata)

# update data
catalog.update_dataset('SCP890', data=adata)

Creating a new dataset

You can create a new dataset in the catalog using the Catalog.create_dataset function.

new_dataset = catalog.create_dataset(
    dataset_id='GSE123',
    data_type='Bulk RNAseq',
    data='path/to/data/file.gct',
    metadata={'description': '...', 'tissue': ['lung']}
)

The metadata is passed as a dictionary. The data file (e.g. h5ad, gct) can be passed either by providing the path to the file or by passing the AnnData or GCToo object directly to the data parameter.

Passing ingestion parameters

The ingestion priority, ingestion_run_id and indexing flags are optional arguments that can be passed using the ingestion_params argument.

new_dataset = catalog.create_dataset(
    dataset_id='GSE123',
    data_type='Bulk RNAseq',
    data='path/to/data/file.gct',
    metadata=dict(),
    ingestion_params={'priority': 'high', 'file_metadata': True, 'ingestion_run_id': '<run_id>'}
)

Versioning

Every call to catalog.update_dataset(...) creates a new version of the dataset.

You can list all prior versions of a dataset and access their underlying data.

versions = catalog.list_versions(dataset_id='GSE101720_GPL18573_raw')

for version in versions:
    print(version.metadata())

Its also possible to roll back to an older version of a dataset.

catalog.rollback(dataset_id='GSE101720_GPL18573_raw', version_id='<version_id>')

Supplementary files

Supplementary files are arbitrary files that can be attached to a dataset. They can contain anything from the results of an analysis to raw sequence data.

dataset_id = 'GSE101720_GPL18573_raw'

catalog.attach_file(dataset_id, path='./analysis/report.html')
# SupplementaryFile(file_name='report.pdf', file_format='pdf', created_at='2024-01-17 14:39:48')

Files are uniquely identified by the file_name. Uploading multiple files with the same name will cause previous ones to be overwritten

These files can be also be tagged with a string value. These "tags" can be used as a means to organize files into categories.

catalog.attach_file(dataset_id, './de_results.csv', tag='Diff Exp Results')
catalog.attach_file(dataset_id, './counts/vst_normalized_counts.csv', tag='Diff Exp Results')

Here's how you would download all the differential expression results uploaded above.

sup_files = catalog.list_files(dataset_id)
sup_files
# [
#   SupplementaryFile(file_name='report.pdf', file_format='pdf', created_at='2024-01-17 14:39:48')
#   SupplementaryFile(file_name='de_results.csv', file_format='csv', created_at='2024-01-17 14:39:48', tag='Diff Exp Results')
#   SupplementaryFile(file_name='vst_normalized_counts.csv', file_format='csv', created_at='2024-01-17 14:39:48', tag='Diff Exp Results')
# ]

for file in sup_files:
    if file.tag == "Diff Exp Results":
        file.download('./')

Supplementary files are not versioned unlike datasets and cannot be retrieved once deleted.

API Reference

`Catalog`

Attributes:

Name	Type	Description
`omixatlas_id`	`str`	Omixatlas ID

`init`

Initializes the internal data Catalog for a given OmixAtlas

Parameters:

Name	Type	Description	Default
`omixatlas_id`	`str`	The identifier for the OmixAtlas	required

Examples:

>>> catalog = Catalog(omixatlas_id='9')

`create_dataset`

Creates a new dataset in the catalog.

Raises an error if a dataset with the same ID already exists

Parameters:

Name	Type	Description	Default
`dataset_id`	`str`	The unique identifier for the dataset.	required
`data_type`	`str`	The type of data being uploaded (e.g. `Single Cell RNASeq`).	required
`data`	`Union[str, GCToo, AnnData]`	The data, either as a file or as an AnnData or GCToo object	required
`metadata`	`dict`	The metadata dictionary	required

Returns:

Type	Description
`Dataset`	Newly created dataset.

Examples:

>>> new_dataset = catalog.create_dataset(
...     dataset_id='GSE123',
...     data_type='Bulk RNAseq',
...     data='path/to/data/file.gct',
...     metadata={'description': 'New dataset description'}
... )

`get_dataset`

Retrieves the dataset with the given dataset_id if it exists

This function doesn't download the underlying data, it only returns a reference to it

Parameters:

Name	Type	Description	Default
`dataset_id`	`str`	The unique identifier for the dataset to retrieve.	required

Returns:

Type	Description
`Dataset`	An object representing the retrieved dataset.

Examples:

>>> dataset = catalog.get_dataset(dataset_id='GSE123')
>>> dataset
Dataset(dataset_id='GSE123', data_type='Bulk RNAseq', last_modified_at='2023-11-07 15:42:40', data_format='gct')

`update_dataset`

Updates one or more attributes of a dataset.

Every update creates a new version of the dataset.

Parameters:

Name	Type	Description	Default
`dataset_id`	`str`	Identifier for the dataset.	required
`data_type`	`Optional[str]`	The type of data being uploaded (e.g. `Single Cell RNASeq`).	`None`
`data`	`Union[str, GCToo, AnnData, None]`	The data, either as a file or as an AnnData or GCToo object	`None`
`metadata`	`Optional[dict]`	The metadata dictionary	`None`

Returns:

Type	Description
`Dataset`	The updated dataset

Examples:

>>> updated_dataset = catalog.update_dataset(
...     dataset_id='GSE123',
...     metadata={'new': 'metadata'}
... )

`delete_dataset`

Performs a soft deletion of the dataset

Parameters:

Name	Type	Description	Default
`dataset_id`	`str`	Identifier for the dataset to be deleted.	required

Examples:

>>> catalog.delete_dataset(dataset_id='GSE123')

`list_datasets`

Retrieves the complete list of datasets in the catalog.

If prefetch_metadata is True the metadata json is downloaded in advance using multiple threads. This makes the dataset.metadata() function call return instantly. This is useful for bulk metadata download.

Parameters:

Name	Type	Description	Default
`limit`	`Optional[int]`	Limits the number of datasets that are returned	`None`
`prefetch_metadata`	`bool`	Prefetches metadata for each dataset	`False`

Returns:

Type	Description
`List[Dataset]`	A list of Dataset instances.

Examples:

>>> datasets = catalog.list_datasets()
>>> len(datasets)
148

`contains`

Check if a dataset with the given dataset_id exists in the catalog.

Parameters:

Name	Type	Description	Default
`dataset_id`	`str`	Identifier for the dataset.	required

Returns:

Type	Description
`bool`	True if the dataset exists, False otherwise.

`set_parent_study`

Adds a dataset to a study. The study_id can be any arbitrary string.

To remove the dataset from the study, set the study_id to None.

Parameters:

Name	Type	Description	Default
`dataset_id`	`str`	Identifier for the dataset	required
`study_id`	`Optional[str]`	User defined ID for the parent study	required

Examples:

Add dataset to a study

>>> catalog.set_parent_study(dataset_id='GSE123', study_id='PMID123')

Remove the dataset from the study

>>> catalog.set_parent_study(dataset_id='GSE123', study_id=None)

`list_studies`

Retrieves a list of all studies in the catalog.

Returns:

Type	Description
`List[str]`	A list of study IDs.

Examples:

>>> study_ids = catalog.list_studies()
>>> study_ids
['PMID123', 'PMID125', 'SCP300']

`list_datasets_in_study`

Retrieves a list of datasets associated with a specific study.

Parameters:

Name	Type	Description	Default
`study_id`	`str`	Identifier for the study.	required

Returns:

Type	Description
`List[str]`	A list of dataset IDs that are part of the study.

Examples:

>>> catalog.list_datasets_in_study(study_id='PMID123')
['GSE123', 'GSE123_raw']

`get_version`

Retrieves a specific version of a dataset.

You can also use this to retrieve a version that may have been deleted.

Parameters:

Name	Type	Description	Default
`dataset_id`	`str`	Identifier for the dataset.	required
`version_id`	`str`	Identifier for the dataset version.	required

Returns:

Type	Description
`DatasetVersion`	An object representing the dataset version.

Examples:

>>> catalog.get_version(dataset_id='GSE123', version_id='<uuid>')
DatasetVersion(version_id='<uuid>', created_at='2023-10-29 20:41:15')

`list_versions`

Lists dataset versions, optionally including deleted ones.

Parameters:

Name	Type	Description	Default
`dataset_id`	`str`	Identifier for the dataset.	required
`include_deleted`	`bool`	If set to True, includes previously deleted versions	`False`

Returns:

Type	Description
`List[DatasetVersion]`	List of dataset versions.

`rollback`

Reverts a dataset to a specified older version.

This function updates the dataset's data and metadata to match the specified version. The version ID itself is not restored.

Parameters:

Name	Type	Description	Default
`dataset_id`	`str`	Identifier for the dataset.	required
`version_id`	`str`	Identifier for the version to roll back to.	required

Returns:

Type	Description
`Dataset`	The dataset instance with the rolled back data and metadata.

`attach_file`

Attach a supplementary file to a dataset.

If a file with the same name already exists it is overwritten.

Optionally, you can link the supplementary file to a specific version of a dataset. This will auto-delete the file if the underlying data for the dataset changes. Note that changes to metadata or data_type will not auto-delete the file.

Added in polly-python v1.3

Parameters:

Name	Type	Description	Default
`dataset_id`	`str`	Identifier for the dataset.	required
`path`	`str`	Local path of the file.	required
`tag`	`Optional[str]`	An optional tag for the supplementary file.	`None`
`source_version_id`	`Optional[str]`	The version_id of the source dataset.	`None`
`file_name`	`Optional[str]`	The name of the file. If not provided the file name will be inferred from the path.	`None`

Returns:

Type	Description
`SupplementaryFile`	A SupplementaryFile instance representing the attached file.

`get_file`

Retrieves a supplementary file.

Added in polly-python v1.3

Parameters:

Name	Type	Description	Default
`dataset_id`	`str`	Identifier for the dataset.	required
`file_name`	`str`	Name of the file to retrieve.	required

Returns:

Type	Description
`SupplementaryFile`	A SupplementaryFile instance.

`list_files`

Retrieves a list of all supplementary files attached to a dataset.

Added in polly-python v1.3

Parameters:

Name	Type	Description	Default
`dataset_id`	`str`	Identifier for the dataset.	required

Returns:

Type	Description
`List[SupplementaryFile]`	A list of SupplementaryFile instances.

`delete_file`

Deletes the supplementary file.

Added in polly-python v1.3

Parameters:

Name	Type	Description	Default
`dataset_id`	`str`	Identifier for the dataset.	required
`file_name`	`str`	File name	required

`clear_token_cache`

Clears cached S3 tokens

`trigger_ingestion`

A helper function to manually trigger ingestion for a dataset

Parameters:

Name	Type	Description	Default
`dataset_id`	`str`	Identifier for the dataset.	required

`Dataset`

Bases: VersionBase

Class that represents a dataset

Attributes:

Name	Type	Description
`dataset_id`	`str`	Dataset Identifier
`omixatlas_id`	`str`	OmixAtlas ID (e.g. `'1673411159486'`)
`data_type`	`str`	Data type (e.g. `Single Cell RNASeq`)
`data_format`	`str`	Storage format for the data (e.g. `h5ad`, `gct`)
`version_id`	`str`	Unique version identifier
`study_id`	`Optional[str]`	Identifier for the parent study
`last_modified_at`	`int`	Unix timestamp (ms) for when the dataset was last modified

`download`

Parameters:

Name	Type	Description	Default
`path`	`str`	Can be a directory or a file path.	required

Returns:

Type	Description
`str`	The relative path where the data was downloaded

Examples:

>>> dataset.download('./mydataset.h5ad')

`load`

Loads the underlying data into memory, as a GCToo or AnnData object

Returns:

Type	Description
`Union[GCToo, AnnData]`	The loaded data object

Raises:

Type	Description
`ValueError`	If the data format is not supported or cannot be determined.

Examples:

>>> adata = dataset.load()

`metadata`

Retrieves the metadata for the dataset.

Returns:

Type	Description
`dict`	The metadata dictionary.

Examples:

>>> metadata = dataset.metadata()
>>> print(metadata)
{'description': '...'}

`DatasetVersion`

Bases: VersionBase

Class that represents an immutable version of a dataset

Attributes:

Name	Type	Description
`dataset_id`	`str`	Dataset Identifier
`omixatlas_id`	`str`	OmixAtlas ID (e.g. `'1673411159486'`)
`data_type`	`str`	Data type (e.g. `Single Cell RNASeq`)
`data_format`	`str`	Storage format for the data (e.g. `h5ad`, `gct`)
`version_id`	`str`	Unique version identifier
`created_at`	`int`	Unix timestamp (ms) for when this version was created

`download`

Parameters:

Name	Type	Description	Default
`path`	`str`	Can be a directory or a file path.	required

Returns:

Type	Description
`str`	The relative path where the data was downloaded

Examples:

>>> dataset.download('./mydataset.h5ad')

`load`

Loads the underlying data into memory, as a GCToo or AnnData object

Returns:

Type	Description
`Union[GCToo, AnnData]`	The loaded data object

Raises:

Type	Description
`ValueError`	If the data format is not supported or cannot be determined.

Examples:

>>> adata = dataset.load()

`metadata`

Retrieves the metadata for the dataset.

Returns:

Type	Description
`dict`	The metadata dictionary.

Examples:

>>> metadata = dataset.metadata()
>>> print(metadata)
{'description': '...'}

`SupplementaryFile`

Class that represents a supplementary file

Attributes:

Name	Type	Description
`dataset_id`	`str`	Dataset Identifier
`omixatlas_id`	`str`	OmixAtlas ID (e.g. `'1673411159486'`)
`file_name`	`str`	Unique identifier for the file
`file_format`	`str`	Format of the file (e.g. `vcf`, `csv`, `pdf`)
`tag`	`Optional[str]`	An optional tag for the file
`created_at`	`int`	Unix timestamp (ms) for when the file was added (or modified)

`download`

Parameters:

Name	Type	Description	Default
`path`	`str`	Can be a directory or a file path.	required

Returns:

Type	Description
`str`	The relative path where the data was downloaded

Examples:

>>> supplementary_file.download('./')

`copy_dataset`

Copies a dataset from one omixatlas catalog to another.

Any supplementary files attached to the dataset are also copied.

The transfer happens remotely, the data is not downloaded locally.

Parameters:

Name	Type	Description	Default
`src_omixatlas_id`	`str`	The source omixatlas_id	required
`src_dataset_id`	`str`	The source dataset_id	required
`dest_omixatlas_id`	`str`	The destination omixatlas_id	required
`dest_dataset_id`	`str`	The destination dataset_id	required
`overwrite`	`bool`	If True, overwrites the destination dataset if it already exists	`False`

Returns:

Type	Description
`Dataset`	The dataset in the destination catalog.

Examples:

>>> from polly.data_management import copy_dataset
>>> copy_dataset('9', 'GSE123', '17', 'GSE123_copy')

(Advanced) Bulk operations

The Catalog class currently doesn't provide any built-in methods to do operations on a large number of datasets. However, you could make use of the code snippet shown below. It shows how to add a new field to the metadata of all datasets in a catalog.

# Download all metadata
datasets = catalog.list_datasets(prefetch_metadata=True)

# Create a mapping from dataset_id to the new metadata dictionaries
dataset_id_to_metadata = {}

for dataset in datasets:
  new_metadata = {**dataset.metadata(), "new_field": "foobar"}
  dataset_id_to_metadata[dataset.dataset_id] = new_metadata

# Import helper function
from polly.threading_utils import for_each_threaded

# Run this function across multiple threads
def fn(dataset):
    dataset_id = dataset.dataset_id
    catalog.update_dataset(
        dataset_id, metadata=dataset_id_to_metadata[dataset_id]
    )

for_each_threaded(items=datasets, fn=fn, max_workers=10, verbose=True)

Data Management

Tutorial

How do Catalogs work?

Creating a new dataset

Passing ingestion parameters

Versioning

Supplementary files

API Reference

Catalog

__init__

create_dataset

get_dataset

update_dataset

delete_dataset

list_datasets

__contains__

set_parent_study

list_studies

list_datasets_in_study

get_version

list_versions

rollback

attach_file

get_file

list_files

delete_file

clear_token_cache

trigger_ingestion

Dataset

download

load

metadata

DatasetVersion

download

load

metadata

SupplementaryFile

download

copy_dataset

(Advanced) Bulk operations

`Catalog`

`init`

`create_dataset`

`get_dataset`

`update_dataset`

`delete_dataset`

`list_datasets`

`contains`

`set_parent_study`

`list_studies`

`list_datasets_in_study`

`get_version`

`list_versions`

`rollback`

`attach_file`

`get_file`

`list_files`

`delete_file`

`clear_token_cache`

`trigger_ingestion`

`Dataset`

`download`

`load`

`metadata`

`DatasetVersion`

`download`

`load`

`metadata`

`SupplementaryFile`

`download`

`copy_dataset`