Skip to content

Schema Management

What is data schema?

There are over 2M datasets on Polly. Consumption of this data is determined by how users can view and access this data. The schema of the data plays an important role in determining how the users could consume this data in OmixAtlas and hence is key to a good user experience.

Key user facing properties controlled by the schema of an OmixAtlas?

  1. Only the fields included in the schema will be available for querying using polly-python.
  2. Fields available in the filter panels and size of filters displayed in front-end.
  3. Fields available in the table view of an OmixAtlas
  4. Fields which are of array type. Based on this property, the type of SQL query for that field will vary.
  5. Fields which is curated by Elucidata
  6. Display name and description for each field
  7. Enabling ontology recommendations in the filter panel

Name of schema types to be used which defining the schema

  1. Dataset-level metadata:

    • for all file types, files
  2. Sample-level metadata:

    • for gct files: gct_metadata
    • for h5ad files:h5ad_metadata
    • for biom files: biom_col_metadata
  3. Feature-level metadata:

    • for gct files,gct_row_metadata
    • for h5ad files,h5ad_data
    • for biom: biom_row_metadata
  4. Variant data:

    • for vcf files: vcf_variant_data
    • for vcf.bgz files: vcf_variant_data

Attributes of the schema

Field Attribute Description Input Conditions
field_name This is the field name that'll be used for querying on polly-python string
  • Lowercase only;
  • Start with alphabets;
  • Cannot include special characters except _;
  • Cannot be longer than 255 characters;
  • Cannot be reserved SQL keywords;
  • Reserved DDL keywords;
  • Reserved DML keywords;
  • Only Plain ASCII is recommended to avoid subtle irritations (accents vs unaccented characters, etc)
original_name This is the attribute that needs to be matched with the dataset metadata in JSON file or sample/feature metadata field name in gct/h5ad file string
  • Should match with data present in json, gct or h5ad files
  • min_length=1, max_length=50
type What kind of data do you want to store in this field, this will affect what kind of queries you’ll be able to perform in this field.
  • text
  • integer
  • object
  • boolean
  • float
is_keyword
  • Whether to store this field as a keyword in Elasticsearch.
  • It must be 1 if need to show it as a filter on GUI or use this field for any aggregations.
  • Keyword fields can be used for sorting, aggregations, and term-level (exact match) queries on the frontend.
  • If your field is not a keyword, it will not be possible to sort your data based on that value (if/when frontend supports it).
  • If your field is not a keyword, it will not be possible to get the number of unique values of that field, or get the most frequently occurring values of that field.
boolean
is_array Whether this field would be a list of values boolean
is_filter
  • Whether to show this field as filter on GUI.
  • Applicable only for dataset level metadata
boolean For these fields, is_keyword must be 1.
is_column Whether to show this field in GUI Table as a column. boolean
is_curated Whether this field has been curated by Elucidata. boolean
filter_size
  • This specifies maximum values for a filter you want to show on GUI. is_filter and is_keyword should be set as 1.
  • Applicable only for dataset level metadata
integer Lowe limit: 1 and Upper limit: 3000
display_name User-friendly name to be shown on front-end string min_length=1, max_length=50
description Any information about the field you want to show to the User. string min_length=1, max_length=300
is_ontology
  • If ontology recommendations to be added in filter panel for curated dataset level metadata such as disease, tissue, cell-lines, drug and cell-types
  • Applicable only for dataset level metadata
boolean If is_ontology is 1, then is_filter and is_keyword must be 1 too.
is_reserved If the field_name is reserved to be used for data management on the data infrastructure boolean There are certain fields which are managed by the data infrastructure and this attribute of schema controls that. While preparation of schema, users should make it False
is_searchable If the field_name is made searchable using the search bar on the GUI boolean For the fields, users wants to search upon, those should be marked True

OmixAtlas class enables users to interact with functional properties of the omixatlas such as create and update an Omixatlas, get summary of it's contents, add, insert, update the schema, add, update or delete datasets, query metadata, download data, save data to workspace etc.

Parameters:

  • token (str, default: None ) –

    token copy from polly.

Usage

from polly.OmixAtlas import OmixAtlas

omixatlas = OmixAtlas(token)

get_schema

get_schema(repo_key, schema_level=[], source='', data_type='', return_type='dataframe')

Function to get the Schema of all the tables in an OmixAtlas. User need to have Data Admin at the resource level to get the schema of an OmixAtlas. Please contact polly@support.com if you get Access Denied error message.

Parameters:

  • repo_key (str) –

    repo_id OR repo_name. This is a mandatory field.

  • schema_level (list, default: [] ) –

    Table name for which users want to get the schema. Users can get the table names by querying SHOW TABLES IN <repo_name> using query_metadata function. The default value is all the table names for the repo.

  • source (str, default: '' ) –

    Source for which user wants to fetch the schema. The default value is all the sources in the schema.

  • data_type (str, default: '' ) –

    Datatype for which user wants to fetch the schema. The default value is all the datatypes in the schema.

  • return_type (str, default: 'dataframe' ) –

    For users who intend to query should use "dataframe" output. For users, who want to perform schema management, they should get the output in "dict" format. Dataframe format doesn't give the complete schema, it only shows the information which aids users for writing queryies. Default value is "dataframe".

Raises:

  • paramException

    When Function Parameter passed are not in the right format.

  • RequestException

    There is some issue in fetching the Schema.

  • invalidApiResponseException

    The Data returned is not in appropriate format.

validate_schema

validate_schema(body)

Validate the payload of the schema. If there are errors schema in schema, then table of errors are printed If there are no errors in the schema, success message is printed Payload Format { "data":{ "type": , "id": , "attributes":{ "repo_id":, "schema_type":, "schema":{ } } } }

Parameters:

  • body (dict) –

    payload of the schema

Raises:

  • paramException

    if payload is not in correct format

Returns:

  • dict ( dict ) –

    Dataframe having all the errors in the schema

insert_schema

insert_schema(repo_key, body)

This function is used to insert the Schema in a newly created OmixAtlas. In order to insert schema the user must be a Data Admin at the resource level.
Please contact polly@support.com if you get Access Denied error message.

Parameters:

  • repo_key ((str / int, Optional)) –

    repo_id OR repo_name of the OmixAtlas. Users can get this by running get_all_omixatlas function. If not passed, taken from payload.

  • body (dict) –

    The payload should be a JSON file for a specific table as per the structure defined for schema.

Raises:

  • RequestException

    Some Issue in Inserting the Schema for the OmixAtlas.

replace_schema

replace_schema(repo_key, body)

The function will completely replace existing schema with the new schema passed in the body. A message will be displayed on the status of the operation. Completely REPLACE the existing schema with the one provided, so it can do all the ops (including deletion of fields if they are no longer present in the new incoming schema). In order to replace schema the user must be a Data Admin at the resource level.
Please contact polly@support.com if you get Access Denied error message. For more information, (see Examples)

Parameters:

  • repo_key (str / int) –

    repo_id OR repo_name of the OmixAtlas. Users can get this by running get_all_omixatlas function. If not passed, taken from payload.

  • body (dict) –

    The payload should be a JSON file for a specific table as per the structure defined for schema.

Raises:

  • RequestException

    Some Issue in Replacing the Schema for the OmixAtlas.

  • paramException

    Parameter Functions are not passed correctly.

update_schema

update_schema(repo_key, body)

This function is used to update the schema of an existing OmixAtlas. If the user wants to edit a field or its attribute in existing schema or if they want to add or delete new fields or if they want add a new source or datatype then they should use update schema function. Using update_schema, users can:
1. ADD new source or datatypes in the schema
2. ADD a new field to a source+data_type combination
3. UPDATE attributes of an existing field
However, using update_schema, users can't perform DELETE operations on any field, source or datatype.

A message will be displayed on the status of the operation. In order to update schema the user must be a Data Admin at the resource level.
Please contact polly@support.com if you get Access Denied error message. For more information, (see Examples)

Parameters:

  • repo_key ((str / int, Optional)) –

    repo_id OR repo_name of the OmixAtlas. Users can get this by running get_all_omixatlas function. If not passed, taken from payload.

  • body (dict) –

    The payload should be a JSON file for a specific table as per the structure defined for schema.

Raises:

  • RequestException

    Some Issue in Updating the Schema for the OmixAtlas.

  • paramException

    Parameter Functions are not passed correctly.

Examples

# Install polly python
pip install polly-python

# Import libraries
from polly.auth import Polly
from polly.omixatlas import OmixAtlas

# Create omixatlas object and authenticate
AUTH_TOKEN=(os.environ['POLLY_REFRESH_TOKEN'])
Polly.auth(AUTH_TOKEN)
omixatlas = OmixAtlas()

Before diving into the operations involved in schema management, it's important for users to be familiar with the structure of the schema of a table in OmixAtlas. The structure is shown below:-

# Schema template
{
    "data": {
        "type": "schemas",
        "id": "<repo_id>",
        "attributes": {
            "repo_id": "<repo_id>",
            "schema_type": "<index name>",
            "schema": {
                "<source>": {
                    "<datatype>": {
                        <Field Name>{
                            <attribute_name_1>: <attribute_name_1 datatype>
                            <attribute_name_2>: <attribute_name_2 datatype>
                        }
                        ..... Multiple Fields...
                    }.....Multiple datatypes....
                }....Multiple Source....
            }
        }
    }
}

How to validate a schema ?

Schema validation can be done using function named validate_schema.

This function takes the schema payload as input. Users can put the schema payload directly in python shell as dictionary or they can put in a json file and load the json file in a dictionary.

Users should prepare the schema payloas as per guidelines mention in this section. This function basically enforces a validation on those rules to minimise human error which can be made in the schema.

# In case schema validation fails, error table will be printed and dataframe having errors will be returned
repo_id = 1657777478388
payload = #example payload
{
  "data": {
    "type": "schemas",
    "id": "1657777478388",
    "attributes": {
      "repo_id": "1657777478388",
      "schema_type": "files",
      "schema": {
        "all": {
          "all": {
            "alter": {
              "type": "text",
              "is_keyword": True,
              "is_array": False,
              "is_filter": False,
              "is_column": False,
              "is_curated": False,
              "is_ontology": False,
              "filter_size": 1,
              "original_name": "dataset_source",
              "display_name": "Source",
              "description": "Source from where the data was fetched",
              "is_searchable": True,
              "is_reserved": False
            },
            "dataset_id": {
              "type": "text",
              "is_keyword": True,
              "is_array": False,
              "is_filter": False,
              "is_column": True,
              "is_curated": False,
              "is_ontology": False,
              "filter_size": 1,
              "original_name": "dataset_id",
              "display_name": "Dataset ID",
              "description": "Unique ID assocaited with every dataset",
              "is_searchable": True,
              "is_reserved": False
            },
            "curated_cell_line": {
              "type": "text",
              "is_keyword": True,
              "is_array": True,
              "is_filter": True,
              "is_column": True,
              "is_curated": False,
              "is_ontology": False,
              "filter_size": 1000,
              "original_name": "curated_cell_line",
              "display_name": "Cell line",
              "description": "Cell lines from which the samples were derived for this dataset",
              "is_searchable": True,
              "is_reserved": False
            }
          }
        }
      }
    }
  }
}

from polly.omixatlas import OmixAtlas
omixatlas = OmixAtlas()

omixatlas.validate_schema(payload)
╒════╤═════════════════╤══════════════════════════════════════════════════════╤═════════════╕
│    │ loc             │ msg                                                  │ type        │
╞════╪═════════════════╪══════════════════════════════════════════════════════╪═════════════╡
│  0 │ ('field_name',) │ The field name alter matches reserved DDL Constants. │ value_error │
╘════╧═════════════════╧══════════════════════════════════════════════════════╧═════════════╛

# In case schema validation is successful, a success message is printed
repo_id = 1657777478388
payload = #example payload
{
  "data": {
    "type": "schemas",
    "id": "1657777478388",
    "attributes": {
      "repo_id": "1657777478388",
      "schema_type": "files",
      "schema": {
        "all": {
          "all": {
            "data_matrix_available": {
              "type": "text",
              "is_keyword": True,
              "is_array": False,
              "is_filter": False,
              "is_column": False,
              "is_curated": False,
              "is_ontology": False,
              "filter_size": 1,
              "original_name": "dataset_source",
              "display_name": "Source",
              "description": "Source from where the data was fetched",
              "is_searchable": True,
              "is_reserved": False
            },
            "dataset_id": {
              "type": "text",
              "is_keyword": True,
              "is_array": False,
              "is_filter": False,
              "is_column": True,
              "is_curated": False,
              "is_ontology": False,
              "filter_size": 1,
              "original_name": "dataset_id",
              "display_name": "Dataset ID",
              "description": "Unique ID assocaited with every dataset",
              "is_searchable": True,
              "is_reserved": False
            },
            "curated_cell_line": {
              "type": "text",
              "is_keyword": True,
              "is_array": True,
              "is_filter": True,
              "is_column": True,
              "is_curated": False,
              "is_ontology": False,
              "filter_size": 1000,
              "original_name": "curated_cell_line",
              "display_name": "Cell line",
              "description": "Cell lines from which the samples were derived for this dataset",
              "is_searchable": True,
              "is_reserved": False
            }
          }
        }
      }
    }
  }
}

from polly.omixatlas import OmixAtlas
omixatlas = OmixAtlas()
omixatlas.validate_schema(payload)
Schema has no errors

How to insert schema in new atlas?

In a new atlas, users should use insert_schema function to add schema. An example of dataset level schema to be inserted in a new OmixAtlas with repo ID: 1657777478388 is shown below.

Note: User should ensure that the repo_id in schema payload is same as the repo_id they intend to insert schema into.

repo_id = 1657777478388
payload = #example payload
{
  "data": {
    "type": "schemas",
    "id": "1657777478388",
    "attributes": {
      "repo_id": "1657777478388",
      "schema_type": "files",
      "schema": {
        "all": {
          "all": {
            "dataset_source": {
              "type": "text",
              "is_keyword": True,
              "is_array": False,
              "is_filter": False,
              "is_column": False,
              "is_curated": False,
              "is_ontology": False,
              "filter_size": 1,
              "original_name": "dataset_source",
              "display_name": "Source",
              "description": "Source from where the data was fetched",
              "is_searchable": True,
              "is_reserved": False
            },
            "dataset_id": {
              "type": "text",
              "is_keyword": True,
              "is_array": False,
              "is_filter": False,
              "is_column": True,
              "is_curated": False,
              "is_ontology": False,
              "filter_size": 1,
              "original_name": "dataset_id",
              "display_name": "Dataset ID",
              "description": "Unique ID assocaited with every dataset",
              "is_searchable": True,
              "is_reserved": False
            },
            "curated_cell_line": {
              "type": "text",
              "is_keyword": True,
              "is_array": True,
              "is_filter": True,
              "is_column": True,
              "is_curated": False,
              "is_ontology": False,
              "filter_size": 1000,
              "original_name": "curated_cell_line",
              "display_name": "Cell line",
              "description": "Cell lines from which the samples were derived for this dataset",
              "is_searchable": True,
              "is_reserved": False
            }
          }
        }
      }
    }
  }
}
from polly.omixatlas import OmixAtlas
omixatlas = OmixAtlas()

omixatlas.insert_schema(repo_id, payload)

Change schema in an existing atlas

Schema of an OmixAtlas can be changed using update_schema or replace_schema function. These are described below.

update_schema is a PATCH operation, it can:

  • ADD new source or datatypes in the schema
  • ADD a new field to a source+data_type combination
  • UPDATE attributes of an existing field However, using update_schema, users can\'t perform DELETE operations on any field, source or datatype.

Note:

  1. If the schema change involves any form of deletion (source, datatype or field), users must use replace_schema function as shown below.
  2. While using update_schema, users may provide the fields in the payload which needs modification instead of entire payload.
  3. User should ensure that the repo_id in schema payload is same as the repo_id they intend to update schema for.

replace_schema is a PUT operation. It completely REPLACE the existing schema with the one provided, so it can do all the ops (including deletion of fields if they are no longer present in the new incoming schema).

Note:-

  1. While using replace_schema, users must provide all the fields in the payload.
  2. User should ensure that the repo_id in schema payload is same as the repo_id they intend to replace schema for.
#updating a schema
from polly.omixatlas import OmixAtlas
omixatlas = OmixAtlas()

omixatlas.update_schema(repo_id, payload)
#replacing a schema
from polly.omixatlas import OmixAtlas
omixatlas = OmixAtlas()

omixatlas.replace_schema(repo_id, payload)

What happens upon changing schema?

When schema is changed (either using update_schema or replace_schema), the system first calculates a difference in the incoming and existing schema. Based on the difference, the following may happen.

  1. Ultra fast update of schema, with no job created. This change happens instantaneously.
  2. Fast update of schema, with a job created that can be tracked on ingestion monitoring dashboard. This take a few minutes.
  3. Reindexing of the tables as per the incoming schema. If reindexing of tables happen, then the repository goes to a locked stage and no further operation is possible. This can be tracked on ingestion monitoring dashboard as well. This may take several minutes to a few hours depending on how many datasets in the atlas and which table is undergoing indexing.

Refer to this release note for more details. https://elucidatainc.atlassian.net/wiki/spaces/LIB/pages/3976562881/Release+notes+0.2.5+-+UX+for+schema+update

Use get_schema response

Background:-

From querying perspective, there are certain attributes of a field which users should be aware to write appropriate SQL queries. Those attributes are description, field type and whether the field is curated and is an array.

Query if a field is an array:-

"SELECT * FROM geo_transcriptomics_omixatlas.datasets WHERE CONTAINS(curated_disease,'Obesity')"

Query if a field is not an array:-

"SELECT * FROM geo_transcriptomics_omixatlas.datasets WHERE description LIKE '%BRCA%'"

To query the OmixAtlas

In order to both support users who want to query and manage data, we have enabled two different return_type for get_schema function. Those will be summarised below.

from polly.omixatlas import OmixAtlas
omixatlas = OmixAtlas()

schema = omixatlas.get_schema("geo_transcriptomics_omixatlas", return_type = "dataframe")

Note: The response of get_schema function is a named tuple. The index of this tuple is the name of the table (datasets, samples, features, samples_singlecell, features_singlecell etc).

schema.datasets.head() #for schema of dataset level metadata in gct files
schema.samples.head() #for schema of sample level metadata in gct files
schema.features.head() #for schema of feature level metadata in gct files
schema.samples_singlecell.head() #for schema of sample level metadata in h5ad files
schema.features_singlecell.head() #for schema of feature level metadata in h5ad files

To manage data on Polly

For users who intend to manage the data on Polly, they need to use the entire schema payload. It can be fetched as shown in the example below:-

from polly.omixatlas import OmixAtlas
omixatlas = OmixAtlas()

schema = omixatlas.get_schema("geo_transcriptomics_omixatlas", return_type = "dict")
schema.datasets
   {
     "data": {
       "type": "schemas",
       "id": "1668142600760",
       "attributes": {
         "repo_id": "1668142600760",
         "schema_type": "files",
         "schema": {
           "all": {
             "all": {
               "curated_organism": {
                 "type": "text",
                 "is_keyword": true,
                 "is_array": true,
                 "is_filter": true,
                 "is_column": true,
                 "is_curated": true,
                 "is_ontology": false,
                 "is_searchable": true,
                 "filter_size": 1000,
                 "original_name": "organism",
                 "display_name": "Organism",
                 "description": "Orgnism from which the samples were derived",
                 "is_reserved": false
               },
             <other fields> 
             },
             "created_time": 1668487618265,
             "last_updated_time": 1676267373231,
             "enabled": true,
             "schema_version": 9,
             "last_updated_by": "50"
           }
         }
       }
     }
   }

Similarly for other tables in an OmixAtlas, using the following the users can fetch the schema.

schema.datasets.head() #for schema of dataset level metadata in gct files
schema.samples.head() #for schema of sample level metadata in gct files
schema.features.head() #for schema of feature level metadata in gct files
schema.samples_singlecell.head() #for schema of sample level metadata in h5ad files
schema.features_singlecell.head() #for schema of feature level metadata in h5ad files

How to copy schema between two atlases?

For copying the schema from a source atlas to a destination atlas the users need to perform the following steps:-

  1. get the schema (return_type = "dict") for all the tables in the source atlas.

  2. update the repo_id and id keys in the payload fetched from the source as per the repo_id of the destination atlas.

  3. If the destination atlas is new, then use insert_schema function to insert schema for all the tables. If the destination atlas already has a schema then update_schema or replace_schema should be used.

Please note the user needs to have relevant permissions to do this operation (they should be data admin at resource level). Also, while the schema change is happening in the destination atlas, it gets locked for a short while. In case the atlas is locked, any further changes in schema is not allowed until unlocked.

In the example below, we are getting the schema from the geo_transcriptomics_omixatlas atlas into a new destination atlas with repo_id 1671618104796.

# create omixatlas object with authentication
from polly.omixatlas import OmixAtlas
omixatlas = OmixAtlas(AUTH_TOKEN)
Step1: get schema for different tables from source OA
source_schema = omixatlas.get_schema("geo_transcriptomics_omixatlas", return_type = "dict")
destination_atlas_repo_id = 1671618104796 #example

#dataset level geo_transcriptomics_omixatlas 
source_schema_dataset = source_schema.datasets

#sample level geo_transcriptomics_omixatlas 
source_schema_sample = source_schema.samples

#feature level geo_transcriptomics_omixatlas 
source_schema_feature = source_schema.features
Step2: update the fetched payload
#updating in dataset level schema
source_schema_dataset["data"]["id"] = destination_atlas_repo_id
source_schema_dataset["data"]["attributes"]["repo_id"] = destination_atlas_repo_id

#updating in sample level schema
source_schema_sample["data"]["id"] = destination_atlas_repo_id
source_schema_sample["data"]["attributes"]["repo_id"] = destination_atlas_repo_id

#updating in feature level schema
source_schema_feature["data"]["id"] = destination_atlas_repo_id
source_schema_feature["data"]["attributes"]["repo_id"] = destination_atlas_repo_id
Step3: inserting schema in the destination OA
#inserting the dataset level schema from source into new destination omixatlas
omixatlas.insert_schema(destination_atlas_repo_id, source_schema_dataset)

#inserting the sample level schema from source into new destination omixatlas
omixatlas.insert_schema(destination_atlas_repo_id, source_schema_sample)

#inserting the feature level schema from source into new destination omixatlas
omixatlas.insert_schema(destination_atlas_repo_id, source_schema_feature)

Response for each of them would look like this:-

    Schema has been Inserted. Please use get_schema functionality to get the inserted schema.