Technical FAQs
What is data audit?
The process of data audit begins with a discovery call to understand the data requirements of the prospective customer. The data requirements are captured via questionnaire. Using the points noted in the questionnaire. Accordingly, data audit is performed,
- Internal Data Audit - Finding data on Polly OmixAtlas That is relevant to the customer requirements
- External Data Audit - Find relevant data that is not yet incorporated on Polly OmixAtlas ( if number of datasets is less than 50 we will also do external audit using External Data audit )
The data is then used to create dashboards to review it with the prospect to know more about their data needs. These dashboards help us convey to the customers that,
- We have data they need and we are also able to talk about how this audit is possible because we have standardized schema and annotation
- We have standard process to communicate about data available on Polly
- Can work on their curation needs for custom curation
Initially, after the first call with the prospect, the data audit performed aims to give the customer a broad view of the data of their interest on Polly. In the follow-up calls the data audit process becomes more pin-pointed based on additional requests from the customer like request for gene knockout data, stage data, patient data and more.
Data Audit Workflow
The process of data audit can be divided into 3 main phases
- Discussion Phase - Gathering customer requirements, finding relevant terms and feedback
- Querying and Filtering Phase - Querying data on Polly (usually done using ElasticSearch API), cleaning and summarizing data.
- Dashboarding and Documenting - Connecting data to Google Data Studio page, updating the dashboard as required and documenting the process on Confluence
What are the data types?
Data types | Description | Repositories |
---|---|---|
Transcriptomics | Is the study of the transcriptome —the complete set of RNA transcripts that are produced by the genome, under specific circumstances. The main platforms available are RNA-Seq which is a high-throughput sequencing method or microarray. | HPA, GDC, GEO, CPTAC, Liver OA, cBioPortal, LINCS, TCGA, GTex |
GWAS | A genome-wide association study (abbreviated GWAS) is a research approach used to identify genomic variants that are statistically associated with a risk for a disease or a particular trait. | UKBioBank, gnomAD |
Mutation | A mutation is a change in the DNA sequence of an organism. Mutations can result from errors in DNA replication during cell division, exposure to mutagens or molecular instability. | TCGA, cBioPortal, CPTAC, Liver OA |
Copy Number Variation | Copy number variation is a type of structural change where a region of DNA ( >> 10bp ), which is duplicated , deleted, inverted or is any aberrant structure. | GDC, CPTAC, cBioPortal, TCGA |
Proteomics | Is the study of the proteome —the set of proteins that are produced by the cell,tissue under specific circumstances. The main platforms available are ELISA (enzyme linked immunosorbent assay) Elispot (enzyme linked immunosorbent spot) and RPPA (reverse phase protein array) | CPTAC, TCGA, ImmPort, Liver OA |
Gene Dependency | Algorithm generated score to define the gene required for the cancer to proliferate/survive | DepMap |
Lipidomics | Is the study of the lipidome —the complete set of lipids that are produced by the cell, tissue under specific circumstances. | Metabolomics, TEDDY, CPTAC. Liver OA |
Drug Response | The measurement ( IC50 ) of how much substance is required to inhibit a biological process like proliferation/survivability | PharmacoDB |
Metabolomics | Is the study of the metabolome —the complete set of metabolites that are produced by the cell,tissue under specific circumstances. | TEDDY, CPTAC, Liver OA, Metabolomics |
miRNA | Quantifies the micro RNA transcripts that are produced by the cell,tissue under specific circumstances. | TCGA, Liver OA, CPTAC, GDC |
Titer | A titer is a measurement of the amount or concentration of a substance in a solution. It usually refers to the amount of antibodies found in a person's blood. | ImmPort |
Methylation | Quantifies the epigenetic change that requires transfer of methyl group which regulates gene expression by recruiting proteins involved in gene repression or by inhibiting the binding of transcription factor(s) to DNA | CPTAC, cBioPortal, TCGA |
PCR | Polymerase chain reaction (PCR) is a method widely used to rapidly make millions to billions of copies (complete or partial) of a specific DNA sample | ImmPort |
Fusion | Quantifies the structural change that produces a aberrant RNA transcript which may cause undesired changes in gene function. | cBioPortal |
Single cell | Is the study of the transcriptome —the complete set of RNA transcripts that are produced by a cell under specific circumstances. | Liver OA, Single Cell |
Drug Screens | Is transcriptomics data generated after large amounts of drugs are screened in cell lines to study the effect of drug at various doses and/or timepoints | DepMap, Liver OA |
Cytometry | Cytometry is the measurement of number and characteristics of cells. Variables that can be measured by cytometric methods include cell size, cell count, cell morphology, cell cycle phase, DNA content, and the existence or absence of specific proteins on the cell surface or in the cytoplasm. | ImmPort |
Phosphoproteomics | Phosphoproteomics is a branch of proteomics that identifies, catalogs, and characterizes proteins containing a phosphate group as a posttranslational modification. | CPTAC |
RNAi | Is a transcriptomics data generated after a RNA sequence-specific suppression of gene expression by double-stranded RNA, through translational or transcriptional repression. | DepMap |
Acetylproteomics | Acetylation is a highly conserved and reversible post-translational modification. It mainly takes part in regulating gene expression through modifying nuclear histones, but can also regulate several metabolic enzymes and metabolism pathways. | CPTAC |
Gene Effect | The measurement of the effect size of knocking out a gene, normalized against the distributions of non-essential and pan-essential genes | DepMap, Liver OA |
Gene Expression Reliability | A reliability score is set for all genes and indicates the level of reliability of the analyzed protein expression pattern based on available protein/RNA/gene characterization data. | HPA |
Exome Sequencing | Sequencing exome which is the coding region of the genome | |
Structural Biology | Is the 3D structural coordinates of a protein which may also have ligand/small molecule bound | RCSB |
Lab measurement | Non-omics data like those from blood clinical chemistry results like blood reports that have Whole Blood Count, Red Blood Count, lipid profile | ImmPort |
SNP array | SNP array is a type of DNA microarray which is used to detect polymorphisms within a population. A single nucleotide polymorphism (SNP), a variation at a single site in DNA, is the most frequent type of variation in the genome. | GEO |
What usecases do we have for curated data (prioritise cancer/treatment data)?
We have the following use cases for curated data -
- Enabling target discovery using publicly available data
- Aggregate data from public and propreitary data
- Set up bioinformatics processing pipelines to convert raw data to usable formats
- Data infrastructure on Polly makes data findable for reuse and insight generation
- Enabling curation of high-throughput drug screen data
- Centralized data management, processed data and metadata are stored together
- Generates experiment level/project level reports
- Knowledge graph generation on Polly
- Aggregate data from public and proprietary data
- Create richer knowledge graphs across 35 million auto curated entities on Polly
- Use over 50 billion data points to form relationships over curated metadata
- ETL pipelines support from our Subject matter experts
- Access to large group of bioinformatics and software experts
- Utilize extensive experience in developing and deploying pipelines for wide variety of pharmaceutical and biotech startups