Working with Cohorts
The Cohort class contains functions which can be used to create cohorts, add or remove samples, merge metadata and data-matrix of samples/datasets in a cohort and edit or delete a cohort.
Parameters:
-
token(str, default:None) –Authentication token from polly
Usage
from polly.cohort import Cohort
cohort = Cohort(token)
add_to_cohort
This function is used to add datasets or samples to a cohort.
Parameters:
-
repo_key(str) –repo_key(repo_name OR repo_id) for the omixatlas where datasets or samples belong.
-
dataset_id(list / str, default:None) –dataset_ids(list,in case of repositories where one dataset has 1 sample) or a dataset_id(str,in case of in case of repository where 1 dataset has many samples)
-
sample_id(list, default:None) –list of samples to be added in cohort, applicable only in case of in case of repository where 1 dataset has many samples.
Returns:
-
None–A message will be displayed on the status of the operation.
Raises:
-
InvalidParameterException–Empty or Invalid Parameters.
-
InvalidCohortOperationException–This operation is not valid as no cohort has been instantiated.
create_cohort
This function is used to create a cohort. After making Cohort Object you can create cohort.
Parameters:
-
local_path(str) –local path to instantiate the cohort.
-
cohort_name(str) –identifier name for the cohort.
-
description(str) –description about the cohort.
-
repo_key(str, default:None) –repo_key(repo_name/repo_id) for the omixatlas from where datasets or samples is to be added.
-
dataset_id(list / str, default:None) –dataset_ids(list,in case of repositories where one dataset has 1 sample) or a dataset_id(str,in case of in case of repository where 1 dataset has many samples)
-
sample_id(list, default:None) –list of samples to be added in cohort, applicable only in case of in case of repository where 1 dataset has many samples.
Returns:
-
None–A message will be displayed on the status of the operation.
Raises:
-
InvalidParameterException–Empty or Invalid Parameters
-
InvalidCohortNameException–The cohort_name does not represent a valid cohort name.
-
InvalidPathException–Provided path does not represent a file or a directory.
create_merged_gct
This function is used to merge all the gct files in a cohort into a single gct file.
Parameters:
-
file_path(str) –the system path where the gct file is to be written.
-
file_name(str, default:'') –Identifier for the merged file name, cohort name will be used by default.
delete_cohort
This function is used to delete a cohort.
Returns:
-
None–A confirmation message on deletion of cohort
edit_cohort
This function is used to edit the cohort level metadata such as cohort name and description. Atleast one of the argument should be present. Args: new_cohort_name (str): new identifier name for the cohort. new_description (str): new description about the cohort.
Returns:
-
message–A confirmation message on updation of cohort.
Raises:
-
InvalidCohortOperationException–This operation is not valid as no cohort has been instantiated.
-
CohortEditException–No parameter specified for editing in cohort
is_valid
This function is used to check the validity of a cohort.
Returns:
-
bool–A boolean result based on the validity of the cohort.
Raises:
-
InvalidPathException–Cohort path does not represent a file or a directory.
-
InvalidCohortOperationException–This operation is not valid as no cohort has been instantiated.
load_cohort
Function to load an existing cohort into an object. Once loaded, the functions described in the documentation can be used for the object where the cohort is loaded.
Parameters:
-
local_path(str) –local path of the cohort.
Returns:
-
None–A confirmation message on instantiation of the cohort.
Raises:
-
InvalidPathException–This path does not represent a file or a directory.
-
InvalidCohortPathException–This path does not represent a Cohort.
merge_data
Function to merge metadata (dataset,sample and feature level) or data-matrix of all the samples/datasets in the cohort.
Parameters:
-
data_level(str) –identifier to specify the data to be merged - "dataset", "sample", "feature" or "data_matrix"
Returns:
-
Dataframe–A pandas dataframe containing the merged data which is ready for analysis
remove_from_cohort
This function is used for removing datasets or samples from a cohort.
Parameters:
-
dataset_id(list / str, default:None) –dataset_ids(list,in case of repositories where one dataset has 1 sample) or a dataset_id(str,in case of in case of repository where 1 dataset has many samples)
-
sample_id(list, default:[]) –list of samples to be added in cohort, applicable only in case of in case of repository where 1 dataset has many samples.
Returns:
-
None–A message will be displayed on the status of the operation.
Raises:
-
InvalidParameterException–Empty or Invalid Parameters
-
InvalidCohortOperationException–This operation is not valid as no cohort has been instantiated.
summarize_cohort
Function to return cohort level metadata and dataframe with datasets or samples added in the cohort.
Returns:
-
Tuple–A tuple with the first value as cohort metadata information (name, description and number of dataset(s) or sample(s) in the cohort) and the second value as dataframe containing the source, dataset_id/sample_id and data type available in the cohort.
Raises:
-
InvalidCohortOperationException–This operation is not valid as no cohort has been instantiated.
Examples
In TCGA
Query execution succeeded (time taken: 2.13 seconds, data scanned: 0.244 MB)
Fetched 123 rows
dataset_ids = results['dataset_id'].tolist()
cohort1.create_cohort("/import","tcga_data","Proteomics datasets","tcga", dataset_ids)
INFO:root:Cohort Created !
Initializing process...
Verifying Data: 100%|██████████| 123/123 [00:11<00:00, 10.71it/s]
Adding data to cohort: 100%|██████████| 123/123 [00:14<00:00, 8.72it/s]
Adding metadata to cohort: 100%|██████████| 123/123 [00:11<00:00, 10.25it/s]
INFO:root:'123' dataset/s added to Cohort!
All_Metadata_col = cohort1.merge_data("sample")
print("\nColumns/Datasets information")
display(All_Metadata_col.head())
In GEO
query = <someSQLquery>
results = omixatlas.query_metadata(query)
dataset_ids = results['dataset_id'].tolist()
cohort1.create_cohort("/import","geo_data","Transcriptomics datasets","geo", dataset_ids[0])
for i in dataset_ids[1:]:
cohort1.add_to_cohort("geo", i)
INFO:root:Cohort Created !
Initializing process...
Adding data to cohort...
Adding metadata to cohort...
INFO:cmap_logger:Reading GCT: /import/geo_data.pco/geo_GSE120746_GPL18573.gct
INFO:root:'18' sample/s added to Cohort!
Initializing process...
Adding data to cohort...
Adding metadata to cohort...
INFO:cmap_logger:Reading GCT: /import/geo_data.pco/geo_GSE62642_GPL16791.gct
INFO:root:'14' sample/s added to Cohort!
Initializing process...
Adding data to cohort...
Adding metadata to cohort...
INFO:cmap_logger:Reading GCT: /import/geo_data.pco/geo_GSE68719_GPL11154.gct
INFO:root:'73' sample/s added to Cohort!
All_Metadata_col = cohort1.merge_data("sample")
print("\nColumns/Datasets information")
display(All_Metadata_col.head())