Working with Cohorts

The Cohort class contains functions which can be used to create cohorts, add or remove samples, merge metadata and data-matrix of samples/datasets in a cohort and edit or delete a cohort.

Parameters:

token (str, default: None ) –

Authentication token from polly

Usage

from polly.cohort import Cohort

cohort = Cohort(token)

add_to_cohort

add_to_cohort(repo_key, dataset_id=None, sample_id=None)

This function is used to add datasets or samples to a cohort.

Parameters:

repo_key (str) –

repo_key(repo_name OR repo_id) for the omixatlas where datasets or samples belong.
dataset_id (list / str, default: None ) –

dataset_ids(list,in case of repositories where one dataset has 1 sample) or a dataset_id(str,in case of in case of repository where 1 dataset has many samples)
sample_id (list, default: None ) –

list of samples to be added in cohort, applicable only in case of in case of repository where 1 dataset has many samples.

Returns:

None –

A message will be displayed on the status of the operation.

Raises:

InvalidParameterException –

Empty or Invalid Parameters.
InvalidCohortOperationException –

This operation is not valid as no cohort has been instantiated.

create_cohort

create_cohort(local_path, cohort_name, description, repo_key=None, dataset_id=None, sample_id=None)

This function is used to create a cohort. After making Cohort Object you can create cohort.

Parameters:

local_path (str) –

local path to instantiate the cohort.
cohort_name (str) –

identifier name for the cohort.
description (str) –

description about the cohort.
repo_key (str, default: None ) –

repo_key(repo_name/repo_id) for the omixatlas from where datasets or samples is to be added.
dataset_id (list / str, default: None ) –

dataset_ids(list,in case of repositories where one dataset has 1 sample) or a dataset_id(str,in case of in case of repository where 1 dataset has many samples)
sample_id (list, default: None ) –

list of samples to be added in cohort, applicable only in case of in case of repository where 1 dataset has many samples.

Returns:

None –

A message will be displayed on the status of the operation.

Raises:

InvalidParameterException –

Empty or Invalid Parameters
InvalidCohortNameException –

The cohort_name does not represent a valid cohort name.
InvalidPathException –

Provided path does not represent a file or a directory.

create_merged_gct

create_merged_gct(file_path, file_name='')

This function is used to merge all the gct files in a cohort into a single gct file.

Parameters:

file_path (str) –

the system path where the gct file is to be written.
file_name (str, default: '' ) –

Identifier for the merged file name, cohort name will be used by default.

delete_cohort

delete_cohort()

This function is used to delete a cohort.

Returns:

None –

A confirmation message on deletion of cohort

edit_cohort

edit_cohort(new_cohort_name=None, new_description=None)

This function is used to edit the cohort level metadata such as cohort name and description. Atleast one of the argument should be present. Args: new_cohort_name (str): new identifier name for the cohort. new_description (str): new description about the cohort.

Returns:

message –

A confirmation message on updation of cohort.

Raises:

InvalidCohortOperationException –

This operation is not valid as no cohort has been instantiated.
CohortEditException –

No parameter specified for editing in cohort

is_valid

is_valid()

This function is used to check the validity of a cohort.

Returns:

bool –

A boolean result based on the validity of the cohort.

Raises:

InvalidPathException –

Cohort path does not represent a file or a directory.
InvalidCohortOperationException –

This operation is not valid as no cohort has been instantiated.

load_cohort

load_cohort(local_path)

Function to load an existing cohort into an object. Once loaded, the functions described in the documentation can be used for the object where the cohort is loaded.

Parameters:

local_path (str) –

local path of the cohort.

Returns:

None –

A confirmation message on instantiation of the cohort.

Raises:

InvalidPathException –

This path does not represent a file or a directory.
InvalidCohortPathException –

This path does not represent a Cohort.

merge_data

merge_data(data_level)

Function to merge metadata (dataset,sample and feature level) or data-matrix of all the samples/datasets in the cohort.

Parameters:

data_level (str) –

identifier to specify the data to be merged - "dataset", "sample", "feature" or "data_matrix"

Returns:

Dataframe –

A pandas dataframe containing the merged data which is ready for analysis

remove_from_cohort

remove_from_cohort(dataset_id=None, sample_id=[])

This function is used for removing datasets or samples from a cohort.

Parameters:

dataset_id (list / str, default: None ) –

dataset_ids(list,in case of repositories where one dataset has 1 sample) or a dataset_id(str,in case of in case of repository where 1 dataset has many samples)
sample_id (list, default: [] ) –

list of samples to be added in cohort, applicable only in case of in case of repository where 1 dataset has many samples.

Returns:

None –

A message will be displayed on the status of the operation.

Raises:

InvalidParameterException –

Empty or Invalid Parameters
InvalidCohortOperationException –

This operation is not valid as no cohort has been instantiated.

summarize_cohort

summarize_cohort()

Function to return cohort level metadata and dataframe with datasets or samples added in the cohort.

Returns:

Tuple –

A tuple with the first value as cohort metadata information (name, description and number of dataset(s) or sample(s) in the cohort) and the second value as dataframe containing the source, dataset_id/sample_id and data type available in the cohort.

Raises:

InvalidCohortOperationException –

This operation is not valid as no cohort has been instantiated.

Examples

In TCGA

query = <someSQLquery>
results=omixatlas.query_metadata(query)

Query execution succeeded (time taken: 2.13 seconds, data scanned: 0.244 MB)
Fetched 123 rows

dataset_ids = results['dataset_id'].tolist()
cohort1.create_cohort("/import","tcga_data","Proteomics datasets","tcga", dataset_ids)

INFO:root:Cohort Created !


Initializing process...


Verifying Data: 100%|██████████| 123/123 [00:11<00:00, 10.71it/s]
Adding data to cohort: 100%|██████████| 123/123 [00:14<00:00,  8.72it/s]
Adding metadata to cohort: 100%|██████████| 123/123 [00:11<00:00, 10.25it/s]
INFO:root:'123' dataset/s added to Cohort!

dataset_metadata = cohort1.merge_data("dataset")
display(dataset_metadata.head())

All_Metadata_col = cohort1.merge_data("sample")
print("\nColumns/Datasets information")
display(All_Metadata_col.head())

df_real = cohort1.merge_data("data_matrix")
print("\nData Matrix")
display(df_real.head())

In GEO

query = <someSQLquery>
results = omixatlas.query_metadata(query)
dataset_ids = results['dataset_id'].tolist()
cohort1.create_cohort("/import","geo_data","Transcriptomics datasets","geo", dataset_ids[0])

for i in dataset_ids[1:]:
    cohort1.add_to_cohort("geo", i)

INFO:root:Cohort Created !


Initializing process...
Adding data to cohort...
Adding metadata to cohort...


INFO:cmap_logger:Reading GCT: /import/geo_data.pco/geo_GSE120746_GPL18573.gct
INFO:root:'18' sample/s added to Cohort!


Initializing process...
Adding data to cohort...
Adding metadata to cohort...


INFO:cmap_logger:Reading GCT: /import/geo_data.pco/geo_GSE62642_GPL16791.gct
INFO:root:'14' sample/s added to Cohort!


Initializing process...
Adding data to cohort...
Adding metadata to cohort...


INFO:cmap_logger:Reading GCT: /import/geo_data.pco/geo_GSE68719_GPL11154.gct
INFO:root:'73' sample/s added to Cohort!

dataset_metadata = cohort1.merge_data("dataset")
display(dataset_metadata.head())

All_Metadata_col = cohort1.merge_data("sample")
print("\nColumns/Datasets information")
display(All_Metadata_col.head())

df_real = cohort1.merge_data("data_matrix")
print("\nData Matrix")
display(df_real.head())

Working with Cohorts

add_to_cohort

create_cohort

create_merged_gct

delete_cohort

edit_cohort

is_valid

load_cohort

merge_data

remove_from_cohort

summarize_cohort

Examples

In TCGA

In GEO

Tutorial Notebooks