Technical FAQs

What is data audit?

The process of data audit begins with a discovery call to understand the data requirements of the prospective customer. The data requirements are captured via questionnaire. Using the points noted in the questionnaire. Accordingly, data audit is performed,

Internal Data Audit - Finding data on Polly OmixAtlas That is relevant to the customer requirements
External Data Audit - Find relevant data that is not yet incorporated on Polly OmixAtlas ( if number of datasets is less than 50 we will also do external audit using External Data audit )

The data is then used to create dashboards to review it with the prospect to know more about their data needs. These dashboards help us convey to the customers that,

We have data they need and we are also able to talk about how this audit is possible because we have standardized schema and annotation
We have standard process to communicate about data available on Polly
Can work on their curation needs for custom curation

Initially, after the first call with the prospect, the data audit performed aims to give the customer a broad view of the data of their interest on Polly. In the follow-up calls the data audit process becomes more pin-pointed based on additional requests from the customer like request for gene knockout data, stage data, patient data and more.

Data Audit Workflow

The process of data audit can be divided into 3 main phases

Discussion Phase - Gathering customer requirements, finding relevant terms and feedback
Querying and Filtering Phase - Querying data on Polly (usually done using ElasticSearch API), cleaning and summarizing data.
Dashboarding and Documenting - Connecting data to Google Data Studio page, updating the dashboard as required and documenting the process on Confluence

What are the data types?

Data types	Description	Repositories
Transcriptomics	Is the study of the transcriptome —the complete set of RNA transcripts that are produced by the genome, under specific circumstances. The main platforms available are RNA-Seq which is a high-throughput sequencing method or microarray.	HPA, GDC, GEO, CPTAC, Liver OA, cBioPortal, LINCS, TCGA, GTex
GWAS	A genome-wide association study (abbreviated GWAS) is a research approach used to identify genomic variants that are statistically associated with a risk for a disease or a particular trait.	UKBioBank, gnomAD
Mutation	A mutation is a change in the DNA sequence of an organism. Mutations can result from errors in DNA replication during cell division, exposure to mutagens or molecular instability.	TCGA, cBioPortal, CPTAC, Liver OA
Copy Number Variation	Copy number variation is a type of structural change where a region of DNA ( >> 10bp ), which is duplicated , deleted, inverted or is any aberrant structure.	GDC, CPTAC, cBioPortal, TCGA
Proteomics	Is the study of the proteome —the set of proteins that are produced by the cell,tissue under specific circumstances. The main platforms available are ELISA (enzyme linked immunosorbent assay) Elispot (enzyme linked immunosorbent spot) and RPPA (reverse phase protein array)	CPTAC, TCGA, ImmPort, Liver OA
Gene Dependency	Algorithm generated score to define the gene required for the cancer to proliferate/survive	DepMap
Lipidomics	Is the study of the lipidome —the complete set of lipids that are produced by the cell, tissue under specific circumstances.	Metabolomics, TEDDY, CPTAC. Liver OA
Drug Response	The measurement ( IC50 ) of how much substance is required to inhibit a biological process like proliferation/survivability	PharmacoDB
Metabolomics	Is the study of the metabolome —the complete set of metabolites that are produced by the cell,tissue under specific circumstances.	TEDDY, CPTAC, Liver OA, Metabolomics
miRNA	Quantifies the micro RNA transcripts that are produced by the cell,tissue under specific circumstances.	TCGA, Liver OA, CPTAC, GDC
Titer	A titer is a measurement of the amount or concentration of a substance in a solution. It usually refers to the amount of antibodies found in a person's blood.	ImmPort
Methylation	Quantifies the epigenetic change that requires transfer of methyl group which regulates gene expression by recruiting proteins involved in gene repression or by inhibiting the binding of transcription factor(s) to DNA	CPTAC, cBioPortal, TCGA
PCR	Polymerase chain reaction (PCR) is a method widely used to rapidly make millions to billions of copies (complete or partial) of a specific DNA sample	ImmPort
Fusion	Quantifies the structural change that produces a aberrant RNA transcript which may cause undesired changes in gene function.	cBioPortal
Single cell	Is the study of the transcriptome —the complete set of RNA transcripts that are produced by a cell under specific circumstances.	Liver OA, Single Cell
Drug Screens	Is transcriptomics data generated after large amounts of drugs are screened in cell lines to study the effect of drug at various doses and/or timepoints	DepMap, Liver OA
Cytometry	Cytometry is the measurement of number and characteristics of cells. Variables that can be measured by cytometric methods include cell size, cell count, cell morphology, cell cycle phase, DNA content, and the existence or absence of specific proteins on the cell surface or in the cytoplasm.	ImmPort
Phosphoproteomics	Phosphoproteomics is a branch of proteomics that identifies, catalogs, and characterizes proteins containing a phosphate group as a posttranslational modification.	CPTAC
RNAi	Is a transcriptomics data generated after a RNA sequence-specific suppression of gene expression by double-stranded RNA, through translational or transcriptional repression.	DepMap
Acetylproteomics	Acetylation is a highly conserved and reversible post-translational modification. It mainly takes part in regulating gene expression through modifying nuclear histones, but can also regulate several metabolic enzymes and metabolism pathways.	CPTAC
Gene Effect	The measurement of the effect size of knocking out a gene, normalized against the distributions of non-essential and pan-essential genes	DepMap, Liver OA
Gene Expression Reliability	A reliability score is set for all genes and indicates the level of reliability of the analyzed protein expression pattern based on available protein/RNA/gene characterization data.	HPA
Exome Sequencing	Sequencing exome which is the coding region of the genome
Structural Biology	Is the 3D structural coordinates of a protein which may also have ligand/small molecule bound	RCSB
Lab measurement	Non-omics data like those from blood clinical chemistry results like blood reports that have Whole Blood Count, Red Blood Count, lipid profile	ImmPort
SNP array	SNP array is a type of DNA microarray which is used to detect polymorphisms within a population. A single nucleotide polymorphism (SNP), a variation at a single site in DNA, is the most frequent type of variation in the genome.	GEO

What usecases do we have for curated data (prioritise cancer/treatment data)?

We have the following use cases for curated data -

Enabling target discovery using publicly available data
Aggregate data from public and propreitary data
Set up bioinformatics processing pipelines to convert raw data to usable formats
Data infrastructure on Polly makes data findable for reuse and insight generation

Enabling curation of high-throughput drug screen data
Centralized data management, processed data and metadata are stored together
Generates experiment level/project level reports

Knowledge graph generation on Polly
Aggregate data from public and proprietary data
Create richer knowledge graphs across 35 million auto curated entities on Polly
Use over 50 billion data points to form relationships over curated metadata

ETL pipelines support from our Subject matter experts
Access to large group of bioinformatics and software experts
Utilize extensive experience in developing and deploying pipelines for wide variety of pharmaceutical and biotech startups