Curating OmixAtlas - Bulk RNAseq
List of Curated Fields on Polly
Bulk RNASeq Data
1. Dataset-level Metadata
Name | Description | Ontology | GUI Display Name | Polly-Python Display Name |
---|---|---|---|---|
Organism | This field represents the organism from which the samples originated. Organism labels already present in the source metadata are normalized using a normalization model. In case the organism labels are missing, related texts and abstracts are processed and further normalized to get the organism metadata label. | NCBI Taxonomy | Organism | curated_organism |
Tissue | This field represents the tissue(s) from which the samples in the dataset are derived. Tissue labels already present in the source metadata are normalized using a normalization model. In cases where tissue labels are missing, related texts and abstracts are processed and further normalized to get the tissue metadata label. Tissue labels for datasets consist of all tissue names from which the samples are derived. Tissue labels are annotated for samples extracted from a healthy tissue or a diseased tissue. Key specifications for tissue metadata annotations are as follows:
|
Brenda Tissue Ontology. | Tissue | curated_tissue |
Drug | This field represents the drug(s) that have been used in the treatment of the samples or relate to the experiment in some other way. Drug labels already present in the source metadata are normalized using a normalization model. In cases where drug labels are missing, related texts and abstracts are processed and further normalized to get the drug metadata label. Drug labels are annotated for the following types of sample treatments:
Drug labels are not annotated for the following types of sample treatments:
Note : Any mention of the drug in the text is included as a drug label irrespective of whether it is being used in the experiment or not. |
PubChem | Drug | curated_drug |
Disease | This field represents the disease(s) being studied in the experiment. Disease labels already present in the source metadata are normalized using a normalization model. In case the disease labels are missing, related texts and abstracts are processed and further normalized to get the disease metadata label. Disease labels are annotated when the samples have been collected from diseased tissue or organism. Examples of such cases are as follows:
Note : In studies, where the cell lines are extracted from a healthy tissue/organism and then conditioned to induce disease, the disease label for such a dataset will be "normal" since the sample is not extracted from any diseased tissue or organism. Key specifications for disease metadata annotations are:
|
MeSH | Disease | curated_disease |
Cell Type | This field represents the cell type of the samples within the study. Cell-type labels already present in the metadata at the source are normalized using a normalization model. In case the labels are missing, related texts and abstracts are processed and further normalized to get the cell type metadata label. Cell type labels are annotated in cases where the authors have cultured a particular cell type either extracted from tissues or developmental organs or generated in the lab and then used it in further experiment. Key specifications for cell type metadata annotations are as follows:
|
Cell Ontology | Cell Type | curated_cell_type |
Cell line | This field represents the cell line from which the samples were extracted. List of the population of modified cells used for the study. Cell line labels already present in the source metadata are normalized using synonyms present in the cell line ontology we use. Cell line labels are annotated in cases where the authors have cultured a particular cell line or bought it from organisations such as ATCC and then used it to further experiment. Eg. MDAMB-231, HEK-293 Key specifications for cell line metadata annotations are as follows:
|
The Cellosaurus | Cell Line | curated_cell_line |
Other metadata fields | ||||
Name | Description | GUI Display Name | Polly-Python Display Name | |
Alignment algorithm | This field represents the alignment method used for the alignment and quantification of RNASeq data | Alignment algorithm | alignment_method | |
Reference Gene Annotation | This field represents the gene annotation library used in the processing of data. Eg. Ensemble release V107 | NA | ref_gene_annotations | |
Reference genome | This field represents the reference genome used for alignment. Eg. GRCh38, GRCm38 | NA | reference_genome | |
Gene | This field provides the gene(s) studied in the dataset. | Gene | curated_gene | |
Strain | This field provides the names of the strain/genetic variants of the organism from which the samples are taken.The strain label is annotated for the strain of mice and rats used during the experimental process with reference to the strain attribute. Eg. for the strain attribute - wild-derived, curated label for strain name is CASA/RkJ. | NA | curated_strain | |
Chemical Treatment | This field indicates whether or not a sample in the dataset is exposed to chemical treatment. If a dataset has any sample with chemical treatment, then the value in this field 'curated_dataset_has_treatment' is 'True' otherwise 'False'. | Treatment Dataset | curated_dataset_has_treatment | |
Clinical Information | This field indicates whether a dataset is clinical or non-clinical. If a dataset has any clinical sample, then the value in this field 'curated_dataset_is_clinical' is 'True, otherwise 'False'. A dataset is labelled as clinical/non-clinical based on the definition given below: Clinical Data - Samples are collected from humans and are subject to direct measurements. Non-Clinical Data - Samples collected from in vitro laboratory studies and from in vivo studies in animals. Note: Clinical data does not include:
|
Clinical Dataset | curated_dataset_is_clinical | |
Abstract | This field provides the abstract of the publication associated with the dataset. | NA | abstract | |
Description | This field provides a brief description of the experiment or the study. | NA | description | |
Overall design | This field provides information on the overall design of the experiment as given by the author. | Overall Design | overall_design | |
Summary | This field provides a detailed summary of the publication (can be the abstract) or a summary of the experiment. | Summary | summary | |
Publication | This field provides the link to the data source providing more information on the dataset. | NA | publication | |
Year | This field provides the year in which the dataset or study is published. | NA | year | |
Data Type | This field provides the type of biomolecular data represented/studied in the dataset. | Data type | data_type | |
Dataset ID | This field provides the unique id for the dataset/study to represent a group of samples. | Dataset ID | dataset_id | |
Number of Samples | This field represents the total number of samples in a dataset. | Samples | total_num_samples | |
Source | This field represents the name of the source repository from where data has been obtained. | NA | dataset_source |
2. Sample Level Metadata
Name | Description | GUI Display Name | Polly-Python Display Name |
---|---|---|---|
Tissue | This field represents the tissue(s) from which the samples originated. Tissue labels already present in the source metadata are normalized using a normalization model. In cases where tissue labels are missing, related texts and abstracts are processed and further normalized to get the tissue metadata label. | ||
Tissue labels are annotated for samples extracted from healthy or diseased tissue. All labels are harmonized with Brenda Tissue Ontology. | Tissue | curated_tissue | |
Disease | At the sample level, this field represents the disease associated with a particular sample. Disease labels already present in the source metadata are normalized using a normalization model. In case the disease labels are missing, related texts and abstracts are processed and further normalized to get the disease metadata label. Disease labels are annotated for a sample when the samples have been collected from diseased tissue or organism. Examples of such cases are as follows:
At the sample level, disease labels are annotated for the following sample type: |
Disease | curated_disease |
Drug | This field represents the drugs that have been used in the treatment of a sample. Drug labels already present in the source metadata are normalized using a normalization model. In cases where drug labels are missing, related texts and abstracts are processed and further normalized to get the drug metadata label. Drug labels are annotated for the following types of sample treatments:
Drug labels are not annotated for the following types of sample treatments:
Note : Any mention of the drug in the text is included as a drug label irrespective of whether it is being used in the experiment or not. |
Drug | curated_drug |
Cell line | This field represents the cell line from which the sample was derived. Cell line labels already present in the source metadata are normalized using synonyms present in the cell line ontology we use. The cell line field is curated for a sample if the authors have cultured a particular cell line or bought it from organizations such as ATCC and then used it to further experiment. The names of the cell lines are harmonized by the cellosaurus ontology. | Cell line | curated_cell_line |
Cell Type | This field represents the cell type of the samples within the study. Cell type labels already present in the metadata at the source are normalized using a normalization model. In case the labels are missing, related texts and abstracts are processed and further normalized to get the cell type metadata label. Cell type labels are annotated in cases where the authors have cultured a particular cell type either extracted from tissues or developmental organs or generated in the lab and then used it in further experiment. Key specifications for cell type metadata annotations are as follows:
|
Cell type | curated_cell_type |
Cohort | This field represents the name of the cohort/group to which the sample belongs, providing information about the experimental condition/tissue/cell line/treatment. The cohort names are identified as automated values and might not always represent the actual cohorts studied in the paper. | Cohort name | curated_cohort_name |
Cohort ID | This field represents the numeric ID of the cohort to which the sample belongs. | Cohort ID | curated_cohort_id |
Control | This field indicates whether the sample is a control or a perturbation for a particular experiment. | NA | curated_control |
Drug smiles | This field represents the SMILES code of the drug used in the study. | Drug (SMILES code) | drug_smiles |
Cancer Stage and Grade (Available through metadata download or in the GCT) | This field represents the cancer stage and grade of the clinical samples which have disease labels as cancer. Cancer Stage System : The cancer stage is labelled in terms of the TNM staging system as well as the number staging system. 1. The TNM Staging System - The TNM system is the most widely used cancer staging system.
X after T, N or M (TX/NX/MX) means the main tumor, lymph nodes cancer or metastasis cannot be measured respectively. 2. Number staging system - For many cancers, the TNM combinations are grouped into five less-detailed stages. The higher the number, the larger the cancer tumor and the more it has spread into nearby tissues.
Therefore, for each sample, the fields for cancer and stage are as follows: T : Stores the T values from the TNM system.Examples:, T2, TX, T1a N : Stores the N values from the TNM system. Examples: N2, NX, N1b M : Stores the M values from the TNM system. Examples : M2, MX, M1c Stage : Stores the value for the number staging system. Examples: Stage 0, Stage 1 Grade : Stores the value for the cancer grading system. Examples: Grade 0, Grade 1 |
NA | NA |
Gene | This field represents the gene(s) under study in a sample. | Gene | curated_gene |
Modified genes | This field represents the gene(s) modified in the sample under study. | Gene modified | curated_gene_modified |
Genetic Modification | This field represents the type of genetic modification done on the sample. | Gene modification | curated_genetic_modification_type |
Characteristics [Note: This field provides variations of sample characteristics with respect to various parameters such as genotype, cell types, tissue etc. In the field name '_ch1' represents different variations] |
This field provides a summary of sample characteristics in terms of ID/genotype/origin/cell type/gene etc. | NA | characteristics_ch1; characteristics_ch2; |
Extraction protocol [Note: This field provides brief information on the sample extraction protocol. Details of the extraction protocol for samples with respect to various parameters are available as separate fields. In the field name '_ch1' , '_ch2' represent different details] |
This field provides a summary on the extraction protocol for the sample. | NA | extract_protocol_ch1;extract_protocol_ch2 |
Treatment Protocol [Note: This field provides brief information on the sample treatment protocol. Details of the treatment protocol for samples with respect to various parameters are available as separate fields. In the field name '_ch1' , '_ch2' represent different details] |
This field provides a summary on the treatment protocol for the sample. | NA | treatment_protocol_ch1; treatment_protocol_ch2 |
Description | This field provided a brief description of the sample. | NA | description |
Title | This field represents the type/genotype/origin/experimental condition of the sample. | Title | title |
Platform ID | This field represents the unique platform ID as per the GEO platform accession identifier. | NA | platform_id |
Sample ID | Unique ID for each sample as per the GEO sample accession number. | Sample ID | sample_id |
3. Feature Level Metadata
Field | Description | Polly-Python Display Name |
---|---|---|
Data ID | This field represents the unique ID for this data entity on Polly. | data_id |
Name | This field represents the ID of the feature - gene ID, protein ID etc. | name |