Curating Atlas - Bulk RNAseq

List of Curated Fields on Polly

Bulk RNASeq Data

1. Dataset-level Metadata

Name	Description	Ontology	GUI Display Name	Polly-Python Display Name
Organism	This field represents the organism from which the samples originated. Organism labels already present in the source metadata are normalized using a normalization model. In case the organism labels are missing, related texts and abstracts are processed and further normalized to get the organism metadata label.	NCBI Taxonomy	Organism	curated_organism
Tissue	This field represents the tissue(s) from which the samples in the dataset are derived. Tissue labels already present in the source metadata are normalized using a normalization model. In cases where tissue labels are missing, related texts and abstracts are processed and further normalized to get the tissue metadata label. Tissue labels for datasets consist of all tissue names from which the samples are derived. Tissue labels are annotated for samples extracted from a healthy tissue or a diseased tissue. Key specifications for tissue metadata annotations are as follows: All labels are harmonized with Brenda Tissue Ontology Dual Channel datasets where the author has studied two different cases in a single sample are not curated with tissue metadata Datasets with numerical metadata are not annotated	Brenda Tissue Ontology.	Tissue	curated_tissue
Drug	This field represents the drug(s) that have been used in the treatment of the samples or relate to the experiment in some other way. Drug labels already present in the source metadata are normalized using a normalization model. In cases where drug labels are missing, related texts and abstracts are processed and further normalized to get the drug metadata label. Drug labels are annotated for the following types of sample treatments: Chemical treatment - Treatment for activation/regulation/inhibition of a protein or nucleic acid. Stimulation - Treatment to elicit an immune response. Eg. Cytokine stimulation (Interferon alpha, TNF-a), LPS Drug treatment - Treatment with a substance used to treat an illness, relieve a symptom or modify a chemical process in the body for a specific purpose. Eg. 3,5-diethoxycarbonyl-1,4-dihydrocolidine (DDC) Drug labels are not annotated for the following types of sample treatments: Control/ Vehicle - The control group receives either no treatment, a standard treatment whose effect is already known, or a placebo. Mostly the control sample contains organic solvents Treatment for genetic perturbation - Treatments used for clonal selection, inducible genetic perturbation. Eg. Puromycin, Tamoxifen, Doxycycline Transfection - RNA molecules (miRNA, shRNA, siRNA etc) for genetic manipulation Other treatments - Treatments such as chemotherapy, radiation therapy or with antibodies Other - Culture media, supplements and detergents. Eg. DMEM, LB Media, Agar Media, Glucose, amino acids, fats, lipids, SDS Note : Any mention of the drug in the text is included as a drug label irrespective of whether it is being used in the experiment or not.	PubChem	Drug	curated_drug
Disease	This field represents the disease(s) being studied in the experiment. Disease labels already present in the source metadata are normalized using a normalization model. In case the disease labels are missing, related texts and abstracts are processed and further normalized to get the disease metadata label. Disease labels are annotated when the samples have been collected from diseased tissue or organism. Examples of such cases are as follows: If the cell line has been extracted from a diseased organism/tissue and been immortalised for further study then it will be annotated for the disease. For example, if the samples are extracted from an osteosarcoma tumor then disease annotation for this dataset will be "osteosarcoma" If a disease model has been created through genetic modification or has been artificially created in a lab for further studies, then it will be annotated for that disease. For example, a genetically engineered mouse model where an immunodeficient or humanized mouse is implanted with tissue from a patient's tumor. Note : In studies, where the cell lines are extracted from a healthy tissue/organism and then conditioned to induce disease, the disease label for such a dataset will be "normal" since the sample is not extracted from any diseased tissue or organism. Key specifications for disease metadata annotations are: All types of disease including viral, and bacterial infections, metabolic syndromes such as obesity, and diabetes, cancers such as lung neoplasm, breast neoplasm etc. are included as disease metadata labels. Labels for disease mentioned for both full form and short form are annotated. For example, Acute Myeloid Leukaemia will be labelled for both AML, Myeloid Leukaemia, Leukaemia and Acute Myeloid Leukaemia For a dataset, diseases for each sample in the dataset are annotated Disease mentioned in the metadata of a dataset, irrespective of the study is labelled Processes such as "carcinogenesis" or "tumorigenesis" are not annotated as diseases Development abnormalities are not included Dual Channel datasets where the author has studied two different cases in a single sample are not curated for disease metadata Tissue, Genes, Chemical Induction, Cell type and cell line will not be curated for disease.	MeSH	Disease	curated_disease
Cell Type	This field represents the cell type of the samples within the study. Cell-type labels already present in the metadata at the source are normalized using a normalization model. In case the labels are missing, related texts and abstracts are processed and further normalized to get the cell type metadata label. Cell type labels are annotated in cases where the authors have cultured a particular cell type either extracted from tissues or developmental organs or generated in the lab and then used it in further experiment. Key specifications for cell type metadata annotations are as follows: All labels are harmonized with Cell Ontology Cell lines are not curated for cell types. For cell types with functional terms such as circulating, associated, derived, etc., only the cell type is annotated For tissue-specific cell types, the tissue name along with the cell type is labelled. Eg - aortic endothelial cells, spinal motor neurons Organism terms are not included in the cell type labels. Eg. mouse "HSPCs" Abbreviated cell types with functional conditions are labelled as abbreviated terms. Eg. CTCs	Cell Ontology	Cell Type	curated_cell_type
Cell line	This field represents the cell line from which the samples were extracted. List of the population of modified cells used for the study. Cell line labels already present in the source metadata are normalized using synonyms present in the cell line ontology we use. Cell line labels are annotated in cases where the authors have cultured a particular cell line or bought it from organisations such as ATCC and then used it to further experiment. Eg. MDAMB-231, HEK-293 Key specifications for cell line metadata annotations are as follows: All labels are harmonized with The Cellosaurus Ontology Dual Channel datasets where the author has studied two different cases in a single sample are not curated for cell line metadata Datasets with numerical metadata are not annotated	The Cellosaurus	Cell Line	curated_cell_line
Strain	This field provides the names of the strain/genetic variants of the organism from which the samples are taken.The strain label is annotated for the strain of mice and rats used during the experimental process with reference to the strain attribute. Eg. for the strain attribute - wild-derived, curated label for strain name is CASA/RkJ.	MGI	Strain	curated_strain
Gene	This field provides the gene(s) studied in the dataset.	Genecards	Gene	curated_gene
Other metadata fields
Name	Description	GUI Display Name	Polly-Python Display Name
Alignment algorithm	This field represents the alignment method used for the alignment and quantification of RNASeq data	Alignment algorithm	alignment_method
Reference Gene Annotation	This field represents the gene annotation library used in the processing of data. Eg. Ensemble release V107	NA	ref_gene_annotations
Reference genome	This field represents the reference genome used for alignment. Eg. GRCh38, GRCm38	NA	reference_genome
Experimental_factors	The field provides the list of experimental factors indicating towards the design of the experiment. The names of the sample level metadata fields varying across the samples are given under this field as a list.	Experimental Factors	experimental_factors
Donor Information	This field indicates whether a dataset has a human donor or not. If a dataset has samples studied from a donor, then the value in this field 'curated_dataset_has_donor' is ‘True, otherwise 'False’.	Donor Dataset	curated_dataset_has_donor
Abstract	This field provides the abstract of the publication associated with the dataset.	Abstract	abstract
Authors	This field provides the names of the author(s) who published the dataset.	Authors	author
Pubmed ID	This field provides the pubmed IDs of the publication associated with the dataset.	Pubmed IDs	pubmed_ids
Description	This field provides a brief description of the experiment or the study.	Title	description
Overall design	This field provides information on the overall design of the experiment as given by the author.	Overall Design	overall_design
Summary	This field provides a detailed summary of the publication (can be the abstract) or a summary of the experiment.	Summary	summary
Publication Link	This field provides the link to the data source providing more information on the dataset.	Source Link	source_link
Year	This field provides the year in which the dataset or study is published.	Year	year
Data Type	This field provides the type of biomolecular data represented/studied in the dataset.	Data type	data_type
Dataset ID	This field provides the unique id for the dataset/study to represent a group of samples.	Dataset ID	dataset_id
Number of Samples	This field represents the total number of samples in a dataset.	Samples	total_num_samples
Source	This field represents the name of the source repository from where data has been obtained.	NA	dataset_source

2. Sample Level Metadata

Sample-level metadata for bulk RNAseq datasets consists of metadata fields curated by Polly’s curation models as well as the curated source metadata fields. All the sample-level metadata fields are visible on the ‘details’ page of a dataset ID on the Omixatlas interface. Following is a list of the fields available for querying at the sample level using Polly-Python

Name	Description	Polly-Python Display Name
Tissue	This field represents the tissue(s) from which the samples originated. Tissue labels already present in the source metadata are normalized using a normalization model. In cases where tissue labels are missing, related texts and abstracts are processed and further normalized to get the tissue metadata label. Tissue labels are annotated for samples extracted from healthy or diseased tissue. All labels are harmonized with Brenda Tissue Ontology.	curated_tissue
Disease	At the sample level, this field represents the disease associated with a particular sample. Disease labels already present in the source metadata are normalized using a normalization model. In case the disease labels are missing, related texts and abstracts are processed and further normalized to get the disease metadata label. Disease labels are annotated for a sample when the samples have been collected from diseased tissue or organism. Examples of such cases are as follows: If the cell line has been extracted from a diseased organism/tissue and been immortalized for further study then it will be annotated for the disease. For example, if the samples under study are extracted from osteosarcoma tumors then disease annotation for this sample will be "osteosarcoma" If a disease model has been created through genetic modification or has been artificially created in lab for further studies, then it will be annotated for that disease. For example, a genetically engineered mouse model where an immunodeficient or humanized mouse is implanted with tissue from a patient's tumor. At the sample level, disease labels are annotated for the following sample type: Clinical- Samples extracted from diseased patients, tissue, cell lines etc. GEM (Genetically Engineered Models) - Samples extracted from genetically engineered mouse models Diet Induced - Samples extracted from diet induced mouse models Xenograft - Samples extracted from patient-derived xenograft or cell line derived xenograft mouse models Infection - Samples extracted from infected organisms, tissues, cell lines or cultures. Example: Viral infection, Bacterial infection etc Other - Any other type of sample in which disease has not been mentioned in metadata, but is not normal	curated_disease
Drug	This field represents the drugs that have been used in the treatment of a sample. Drug labels already present in the source metadata are normalized using a normalization model. In cases where drug labels are missing, related texts and abstracts are processed and further normalized to get the drug metadata label. Drug labels are annotated for the following types of sample treatments: Chemical treatment - Treatment for activation/regulation/inhibition of a protein or nucleic acid. Stimulation - Treatment to elicit an immune response. Eg. Cytokine stimulation (Interferon alpha, TNF-a), LPS Drug treatment - Treatment with a substance used to treat an illness, relieve a symptom or modify a chemical process in the body for a specific purpose. Eg. 3,5-diethoxycarbonyl-1,4-dihydrocolidine (DDC) Drug labels are not annotated for the following types of sample treatments: Control/ Vehicle - The control group receives either no treatment, a standard treatment whose effect is already known, or a placebo. Mostly the control sample contains organic solvents Treatment for genetic perturbation - Treatments used for clonal selection, inducible genetic perturbation. Eg. Puromycin, Tamoxifen, Doxycycline Transfection - RNA molecules (miRNA, shRNA, siRNA etc) for genetic manipulation Other treatments - Treatments such as chemotherapy, radiation therapy or with antibodies Other - Culture media, supplements and detergents. Eg. DMEM, LB Media, Agar Media, Glucose, amino acids, fats, lipids, SDS Note : Any mention of the drug in the text is included as a drug label irrespective of whether it is being used in the experiment or not.	curated_drug
Cell line	This field represents the cell line from which the sample was derived. Cell line labels already present in the source metadata are normalized using synonyms present in the cell line ontology we use. The cell line field is curated for a sample if the authors have cultured a particular cell line or bought it from organizations such as ATCC and then used it to further experiment. The names of the cell lines are harmonized by the cellosaurus ontology.	curated_cell_line
Cell Type	This field represents the cell type of the samples within the study. Cell type labels already present in the metadata at the source are normalized using a normalization model. In case the labels are missing, related texts and abstracts are processed and further normalized to get the cell type metadata label. Cell type labels are annotated in cases where the authors have cultured a particular cell type either extracted from tissues or developmental organs or generated in the lab and then used it in further experiment. Key specifications for cell type metadata annotations are as follows: All labels are harmonized with Cell Ontology Cell lines are not curated for cell types For cell types with functional terms such as circulating, associated, derived, etc., only the cell type is annotated For tissue-specific cell types, the tissue name along with the cell type is labelled. Eg - aortic endothelial cells, spinal motor neurons Organism terms are not included in the cell type labels. Eg. mouse "HSPCs" Abbreviated cell types with functional conditions are labelled as abbreviated terms. Eg. CTCs	curated_cell_type
Cohort Name	This field represents the name of the cohort/group to which the sample belongs, providing information about the experimental condition/tissue/cell line/treatment. The cohort names are identified as automated values and might not always represent the actual cohorts studied in the paper.	curated_cohort_name
Cohort ID	This field represents the numeric ID of the cohort to which the sample belongs.	curated_cohort_id
Control	This field indicates whether the sample is a control or a perturbation for a particular experiment.	curated_control
Cancer Stage - TNM Stage	This field represents the cancer stage of the clinical samples which have disease labels as cancer. The cancer stage information is labelled in terms of the TNM staging system as well as the number staging system. The TNM Staging System - The TNM system is the most widely used cancer staging system. T (Primary Tumor T) - T refers to the size and extent of the main tumor. The higher the number after the T, the larger the tumor or the more it has grown into nearby tissues. T's may be further divided to provide more detail, such as T3a and T3b. The number value ranges from 1 to 4, with size 1 for the smallest and size 4 for the largest. N (Regional lymph nodes N) - N refers to the number of nearby lymph nodes that have cancer. The value ranges from 0-3 where 0 means no lymph nodes and N1, N2, and N3 refers to the number and location of lymph nodes that contain cancer. The higher the number after the N, the more lymph nodes that contain cancer. M (Distant metastasis M) - M refers to whether cancer has metastasized, where 0 means it has not metastasized and 1 means it has. X after T, N or M (TX/NX/MX) means the main tumor, lymph nodes cancer or metastasis cannot be measured respectively. T : Stores the T values from the TNM system.Examples:, T2, TX, T1a N : Stores the N values from the TNM system. Examples: N2, NX, N1b M : Stores the M values from the TNM system. Examples : M2, MX, M1c	curated_cancer_tnm_stage
Cancer Stage - Number Stage	Number staging system - For many cancers, the TNM combinations are grouped into five less-detailed stages. The higher the number, the larger the cancer tumor and the more it has spread into nearby tissues. Stage 0 - Cancer (abnormal cells) is present but has not spread. Stage 1 - Cancer is present. It is small in size but has not spread. Stage 2 - Cancer has grown but still has not spread. Stage 3 - Cancer is larger and may have spread to the surrounding tissues and/or lymph nodes. Stage 4 - Cancer has spread to distant parts of the body (at least to 1 other body organ). Such cancer is known as "secondary" or "metastatic" cancer.	curated_cancer_stage
Cancer Grade	This field represents the cancer grade of the clinical samples which have disease labels as cancer. Tumor grade is the description of a tumor based on how abnormal the tumor cells and the tumor tissue look under a microscope. The following values are given for the cancer grade label: Grade X (GX) - Cancer grade cannot be assessed. Grade 1 (G1) - Well-differentiated cells. Cancer cells resemble normal cells and aren't growing rapidly. G1 represents a low grade. Grade 2 (G2) - Moderately differentiated cells. Cancer cells don't look like normal cells and are growing faster than normal cells. G2 represents intermediate grade. Grade 3 (G3) - Poorly differentiated cells. Cancer cells look abnormal and may grow or spread more aggressively. G3 represents a high grade. Grade 4 (G4) - Undifferentiated cells. G4 represents a high grade.	curated_cancer_grade
Gene	This field represents the gene(s) under study in a sample.	curated_gene
Modified genes	This field represents the gene(s) modified in the sample under study. Ontology followed: genecards	curated_gene_modified
Genetic Modification	This field represents the type of genetic modification done on the sample.	curated_genetic_modification_type
Sample ID	Unique ID for each sample as per the GEO sample accession number.	sample_id
Title	This field represents the type/genotype/origin/experimental condition of the sample.	title
Description	This field provided a brief description of the sample.	description
Other Sample level metadata fields (Source)	This field provides all the sample level metadata from source in a single dictionary format	sample_characteristics

3. Feature Level Metadata

Field	Description	Polly-Python Display Name
Data ID	This field represents the unique ID for this data entity on Polly.	data_id
Name	This field represents the ID of the feature - gene ID, protein ID etc.	name