Data Management
OmixAtlas class enables users to interact with functional properties of the omixatlas such as create and update an Omixatlas, get summary of it's contents, add, insert, update the schema, add, update or delete datasets, query metadata, download data, save data to workspace etc.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
token |
str
|
token copy from polly. |
None
|
Usage
from polly.OmixAtlas import OmixAtlas
omixatlas = OmixAtlas(token)
add_datasets(repo_id, source_folder_path, destination_folder_path='', priority='low', validation=False)
This function is used to add a new data into an OmixAtlas. Once user runs this function successfully, it takes 30 seconds to log the ingestion request and within 2 mins, the ingestion log will be shown in the data ingestion monitoring dashboard. In order to add datasets into Omixatlas the user must be a Data Contributor at the resource level. Please contact polly@support.com if you get Access Denied error message.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
repo_id |
int/str
|
repo_name/repo_id for that Omixatlas |
required |
source_folder_path |
dict
|
source folder paths from data and metadata files are fetched.In this dictionary, there should be two keys called "data" and "metadata" with value consisting of folders where data and metadata files are stored respectively i.e. {"data":" |
required |
destination_folder_path |
str
|
Destination folder structure in s3. Users should use this only when they want to manage the folder structure in the backend. It is advised to not not give any value for this, by default the data goes in root folder. |
''
|
priority |
str
|
Optional parameter(low/medium/high). Priority at which this data has to be ingested into the OmixAtlas. The default value is "low". Acceptable values are "medium" and "high". |
'low'
|
validation |
bool
|
Optional parameter(True/False) Users was to activate validation. By Default False. Means validation not active by default. Validation needs to be activated only when validated files are being ingested. |
False
|
Raises:
Type | Description |
---|---|
paramError
|
If Params are not passed in the desired format or value not valid. |
RequestException
|
If there is issue in data ingestion. |
Returns:
Type | Description |
---|---|
pd.DataFrame
|
pd.DataFrame: DataFrame showing Upload Status of Files |
dataset_metadata_template(repo_key, source='all', data_type='all')
This function is used to fetch the template of dataset level metadata in a given OmixAtlas. In order to ingest the dataset level metadata appropriately in the OmixAtlas, the user needs to ensure the metadata json file contains the keys as per the dataset level schema.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
repo_id |
str/int
|
repo_name/repo_id for that Omixatlas |
required |
source |
all
|
Source/Sources present in the schema. Default value is "all" |
'all'
|
data_type |
all
|
Datatype/Datatypes present in the schema. Default value is "all" |
'all'
|
Returns:
Type | Description |
---|---|
dict
|
A dictionary with the dataset level metadata |
Raises:
Type | Description |
---|---|
invalidApiResponseException
|
attribute/key error |
Returns:
Type | Description |
---|---|
dict
|
dictionary with the dataset level metadata |
delete_datasets(repo_id, dataset_ids)
This function is used to delete datasets from an OmixAtlas. Once user runs this function successfully, they should be able to see the ingestion status on the data ingestion monitoring dashboard after ~15 mins.
Displays a dataframe with the status of the operation for each file(s).
In order to delete datasets into Omixatlas the user must be a Data Admin at the resource level. Please contact polly@support.com if you get Access Denied error message.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
repo_id |
int
|
repo_id for that Omixatlas |
required |
dataset_ids |
list
|
list of dataset_ids that users want to delete. |
required |
Raises:
Type | Description |
---|---|
paramError
|
If Params are not passed in the desired format or value not valid. |
RequestException
|
If there is issue in data ingestion. |
Returns:
Type | Description |
---|---|
None |
save_to_workspace(repo_id, dataset_id, workspace_id, workspace_path)
Function to download a dataset from OmixAtlas and save it to Workspaces.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
repo_id |
str
|
repo_id of the Omixatlas |
required |
dataset_id |
str
|
dataset id that needs to be saved |
required |
workspace_id |
int
|
workspace id in which the dataset needs to be saved |
required |
workspace_path |
str
|
path where the workspace resides |
required |
Returns:
Name | Type | Description |
---|---|---|
json |
json
|
Info about workspace where data is saved and of which Omixatlas |
update_datasets(repo_id, source_folder_path, destination_folder_path='', priority='low', validation=False)
This function is used to update a new data into an OmixAtlas. Once user runs this function successfully, it takes 30 seconds to log the ingestion request and within 2 mins, the ingestion log will be shown in the data ingestion monitoring dashboard. In order to update datasets the user must be a Data Contributor at the resource level. Please contact polly@support.com if you get Access Denied error message.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
repo_id |
int/str
|
repo_name/repo_id for that Omixatlas |
required |
source_folder_path |
dict
|
source folder paths from data and metadata files are fetched.In this dictionary, there should be two keys called "data" and "metadata" with value consisting of folders where data and metadata files are stored respectively i.e. {"data":" |
required |
destination_folder_path |
str
|
Destination folder structure in s3. Users should use this only when they want to manage the folder structure in the backend. It is advised to not not give any value for this, by default the data goes in root folder. |
''
|
priority |
str
|
Optional parameter(low/medium/high). Priority at which this data has to be ingested into the OmixAtlas. The default value is "low". Acceptable values are "medium" and "high". |
'low'
|
validation(bool, |
optional
|
Optional parameter(True/False) Users was to activate validation. By Default False. Means validation not active by default. Validation needs to be activated only when validated files are being ingested. |
required |
Raises:
Type | Description |
---|---|
paramError
|
If Params are not passed in the desired format or value not valid. |
RequestException
|
If there is issue in data ingestion. |
Returns:
Type | Description |
---|---|
pd.DataFrame
|
pd.DataFrame: DataFrame showing Upload Status of Files |
move_data(source_repo_key, destination_repo_key, dataset_ids, priority='medium')
This function is used to move datasets from source atlas to destination atlas. This function should only be used when schema of source and destination atlas are compatible with each other. Else, the behaviour of data in destination atlas may not be the same or the ingestion may fail. Please contact polly@support.com if you get Access Denied error message.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
source_repo_key |
str/int
|
src repo key of the dataset ids. Only repo_id supported now, |
required |
destination_repo_key |
str/int
|
destination repo key where the data needs to be transferred |
required |
dataset_ids |
list
|
list of dataset ids to transfer |
required |
priority |
str
|
Optional parameter(low/medium/high). Priority of ingestion. Defaults to "medium". |
'medium'
|
Returns:
Name | Type | Description |
---|---|---|
None |
str
|
None |
Examples
# Install polly python
pip install polly-python
# Import libraries
from polly.auth import Polly
from polly.omixatlas import OmixAtlas
# Create omixatlas object and authenticate
AUTH_TOKEN=(os.environ['POLLY_REFRESH_TOKEN'])
Polly.auth(AUTH_TOKEN)
omixatlas = OmixAtlas()
Add data to a new OmixAtlas
Addition of data into a newly created OA, also referred to as ingestion, can be done using polly py function add_datasets
.
The OmixAtlas to which the data is to be added should have the supported schema as per the metadata contained in the data.
Please see this FAQ to check if the metadata and the schema of the OA match. While adding a dataset both metdata file (json) and the data file (h5ad, gct, vcf) are required.
In order to use this function,
-
the metadata and data files should be present in separate folders and the path to these folders should be provided as a dictionary of keys
metadata
anddata
with values as the metadata folder path and data folder path respectively. -
Each metdata file should have the corresponding data file and visa-versa. The metdata and data file of a dataset is expected to have the same name of the dataset_id. For example, GSE100009_GPL11154.json and GSE100009_GPL11154.gct are the metadata file and data file for the dataset id GSE100009_GPL11154 respectively.
Once the files are uploaded for ingestion, the ingestion progress and logs can be monitored and fetched on the ingestion monitoring dashboard.
data_source_folder_path = "data_ingestion_demo/data/"
metadata_source_folder_path = "data_ingestion_demo/metadata/"
source_data_folder = {"data": data_source_folder_path, "metadata": metadata_source_folder_path}
repo_id = repo_id
priority= "medium"
omixatlas.add_datasets(repo_id, source_data_folder, priority= "medium")
File Name Message
0 combined_metadata.json File Uploaded
1 GSE100009_GPL11154_raw.gct File Uploaded
2 GSE100013_GPL16791_raw.gct File Uploaded
Please wait for 30 seconds while your ingestion request is getting logged.
Your ingestion request is successfully logged. You can go to ingestion monitoring dashboard for tracking it's status.
Usage of destination_folder/validation flags
destination_folder
: This is an optional parameter and this is where the files will be uploaded in s3. It is advised to not not give any value for this, as it's only used in special cases. By default the data is uploaded to the root folder.priority
: This is an optional parameter too. This states the priority at which this data has to be ingested into the OmixAtlas. The default value is "low". Acceptable values are "low", "medium" and "high".validation
option can be provided as True or False. By Default, validation in inactive. When validation needs to be activated, there's a certain way the dataset level metadata is generated. We'll add an example soon.
Template for dataset level metadata
To check if the keys used in the dataset level metadata is compatible with the schema of the OmixAtlas. The dataset level metadata template can be fetched using the function as shown below. In this example, followed by fetching the template, we are loading the keys in dataset level metadata and checking if the keys are matching schema requirements.
#example: get the metdata template of the destination atlas. here, we are looking at repo_id "9"
data_metadata_template_geo = omixatlas.dataset_metadata_template("9")
# getting the dataset level metadata from the dataset metadata json that is to be ingested
import json
keys_in_json = set()
f = open('/import/data_ingestion_demo/metadata/GSE95448_GPL19057.json')
data = json.load(f)
for key in data:
keys_in_json.add(key)
# comparing the keys in the destination atlas metadata vs the keys in the dataset metadata json that is to be ingested
intersect = keys_in_json.intersection(template_keys)
template_keys.difference(intersect)
Move data from source to destination
This function is used to move datasets from source atlas to destination atlas. This function should only be used when schema of source and destination atlas are compatible with each other. Else, the behaviour of data in destination atlas may not be the same or the ingestion may fail.
# example: moving 3 datasets from source "geo_transcriptomics_omixatlas" to destination "rankine_atlas"
omixatlas.move_data(source_repo_key = "geo_transcriptomics_omixatlas", destination_repo_key = "rankine_atlas",
dataset_ids = ["GSE12332_GPL123", "GSE43234_GPL143", "GSE89768_GPL967"])
Update the data or metadata in omixatlas
The already ingested data in the OA can be updated by re-ingesting either the metadata file of a dataset or the data file of a dataset or both of a dataset based on what needs to be updated. The update progress can also be seen on the ingestion monitoring dashboard. However, if there are no change in the files, the process will not be initiated and not seen in the ingestion monitoring dashboard.
In order to use this fuction,
-
The metadata and data files should be present in separate folder and the path to these folders should be provided as a dictionary of keys "metadata" and "data" with values as the metadata folder path and data folder path respectively and which ever applicable.
-
If the dataset was originally ingested in a specific destination folder, then that path should be provided.
- In case the destination folder path is not provided, the root directory would be taken as default.
- In case destination folder path doesn't match, the system will provide the folder path where the data or metdata is present in the OA in a warning message. Please see example 2 below for more details.
-
In case the data or metadata being updated has not been ingested before, then appropriate warning would be provided and it would suggested to use the
add_datasets
fucntion to add the data first.
Ex 1: Update both data and metadata file
metadata_folder_path = "repoid_1654268055800_files_test/metadata_2/"
data_folder_path = "repoid_1654268055800_files_test/data_2/"
repo_id= "1654268055800"
source_folder_path = {"metadata":metadata_folder_path, "data": data_folder_path}
omixatlas.update_datasets(repo_id, source_folder_path, destination_folder_path, priority)
/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
repoid_1654268055800_files_test/data_2/:
ACBC_MSKCC_2015_Copy_Number_AdCC10T.gct
repoid_1654268055800_files_test/metadata_2/:
ACBC_MSKCC_2015_Copy_Number_AdCC10T.json
Processing Metadata files: 100%|██████████| 1/1 [00:00<00:00, 357.57it/s]
Uploading data files: 100%|██████████| 1/1 [00:00<00:00, 6.36files/s]
Please wait for 30 seconds while your ingestion request is getting logged.
Your ingestion request is successfully logged. You can go to ingestion monitoring dashboard for tracking it's status.
File Name Message
0 combined_metadata.json File Uploaded
1 ACBC_MSKCC_2015_Copy_Number_AdCC10T.gct File Uploaded
Ex 2: updating only data files
data_folder_path = "repoid_1654268055800_files_test/data"
repo_id= "1654268055800"
destination_folder_path = "Mutation"
priority = "medium"
source_folder_path = {"data":data_folder_path}
omixatlas.update_datasets(repo_id, source_folder_path, destination_folder_path, priority)
Please wait for 30 seconds while your ingestion request is getting logged.
Your ingestion request is successfully logged. You can go to ingestion monitoring dashboard for tracking it's status.
File Name Message
0 CCLE_Mutation_C3A_LIVER.gct File Uploaded
Ex 3: Update data which doesn't exist OR if destination folder is incorrect
# update with invalid/absent DFP
metadata_folder_path = "repoid_1654268055800_files_test/metadata_3/"
data_folder_path = "repoid_1654268055800_files_test/data_3/"
repo_id= "1654268055800"
priority = "medium"
destination_folder_path = "transcriptomics_76" #invalid folder path
source_folder_path = {"metadata":metadata_folder_path, "data": data_folder_path}
omixatlas.update_datasets(repo_id, source_folder_path, destination_folder_path, priority)
Processing Metadata files: 0%| | 0/1 [00:00<?, ?it/s]WARNING: Unable to update metadata/data file BRCA_BCCRC_Mutation_SA018_2.json because corresponding data/metadata file not present in OmixAtlas. Please add the files using add_datasets function. For any questions, please reach out to polly.support@elucidata.io.
WARNING: Unable to update the data/metadata for BRCA_BCCRC_Mutation_SA018_2.json because original data file not present in the provided destination folder path: transcriptomics_76 in the omixatlas. Please choose the required destination folder path from the following:
["upload_folder_1/test_1"].
For any questions, please reach out to polly.support@elucidata.io.
Processing Metadata files: 100%|██████████| 1/1 [00:00<00:00, 351.96it/s]
WARNING: Unable to update metadata/data file BRCA_BCCRC_Mutation_SA018_2.gct because corresponding data/metadata file not present in OmixAtlas. Please add the files using add_datasets function. For any questions, please reach out to polly.support@elucidata.io.
WARNING: Unable to update the data/metadata for BRCA_BCCRC_Mutation_SA018_2.gct because original data file not present in the provided destination folder path: transcriptomics_76 in the omixatlas. Please choose the required destination folder path from the following:
[].
For any questions, please reach out to polly.support@elucidata.io.
Uploading data files: 0files [00:00, ?files/s]
Delete data from OmixAtlas
Datasets ingested can be deleted from the omixatlas using the delete_dataset
function. A list of dataset ids can be provided in order to delete multiple datasets. Status of delete operation can be seen on ingestion monitoring dashboard.
repo_id = "1643359804137"
dataset_ids = ["GSE100009_GPL11154", "GSE145009_GPL11124"]
omixatlas.delete_datasets(repo_id, dataset_ids)
Data ingestion monitoring dashboard
The Data ingestion monitoring dashboard on the GUI, allows users to monitor the progress of the ingestion runs (add_datasets, update_datasets, delete_datasets). For each dataset undergoing ingestion (addition/update) or deletion, the logs are available here to be viewed and downloaded.
To know more about ingestion monitoring dashboard, please refer to this section
How and why to save data in workspace?
Workspaces allow to download and save data from the analysis or an omixatlas to be reused again instead of downloading to the local system. Workspaces can act as storage spaces with additional capability of sharing or collaborating with other users. The workspace id can be fetched from the url that comes on opening a workspace on the GUI and needs to be passed as an integer.
repo_id = "9"
dataset_id = "GSE107280_GPL11154"
workspace_id = 12345
workspace_path= "geo_GSE107280_GPL11154"
omixatlas.save_to_workspace(repo_id, dataset_id, workspace_id, workspace_path)
INFO:root:Data Saved to workspace=8223
{'data': {'type': 'workspace-jobs',
'id': '9/12345',
'attributes': {'destination-gct': '12345/geo_GSE107280_GPL11154/GSE107280_GPL11154_curated.gct',
'destination-json': '12345/geo_GSE107280_GPL11154/GSE107280_GPL11154.json'}}}
Import a file from workspace to Polly Notebooks
The files present in the workspace can be viewed from notebook as well and also synced, so that the files that are present in the workspace are available for use in the current analysis notebook as well. Please note only those files that are present in the same workspace as the analysis/notebook can be synced.
# to list files in the folder "repoid_1654268055800_files_test/" in the current workspace
!polly files list --workspace-path "polly://repoid_1654268055800_files_test/" -y
# copy the files from the folder repoid_1654268055800_files_test/ in current workspace to the notebook under folder destination_repoid_1654268055800_files_test/"
!polly files sync -s "polly://repoid_1654268055800_files_test/" -d "destiantion_repoid_1654268055800_files_test/" -y