Data Curation
The Curation class contains wrapper functions around the models used for semantic annotations of string/text.
Parameters:
-
token
(str
, default:None
) –token copy from polly.
Usage
from polly.curation import Curation
curationObj = Curation(token)
annotate_with_ontology
Tag a given piece of text. A "tag" is just an ontology term. Annotates with Polly supported ontologies. This function calls recognise_entity followed by normalize. Given a text, users can identify and tag entities in a text. Each entity/tag recognised in the text contains the name(word in the text identified), entity_type and the ontology_id.
Parameters:
-
text
(str
) –Input text
Returns:
-
List[Tag]
–set of unique tags
assign_control_pert_labels
Returns the sample metadata dataframe with 2 additional columns. is_control - whether the sample is a control sample control_prob - the probability that the sample is control
Parameters:
-
sample_metadata
(DataFrame
) –Metadata table
-
columns_to_exclude
(Set[str]
, default:None
) –Any columns which don't play any role in determining the label, e.g. sample id
Returns:
-
DataFrame
(DataFrame
) –Input data frame with 2 additional columns
Raises:
-
requestException
–Invalid Request
find_abbreviations
To run abbreviation detection separately. Internally calls a normaliser.
Parameters:
-
text
(str
) –The string to detect abbreviations in.
Returns:
-
Dict[str, str]
–Dictionary with abbreviation as key and full form as value
Raises:
-
requestException
–Invalid Request
recognise_entity
Run an NER model on the given text. The returned value is a list of entities along with span info. Users can simply recognise entities in a given text without any ontology standardisation (unlike the annotate_with_ontology function which normalises as well).
Parameters:
-
text
(str
) –input text
-
threshold
(float
, default:None
) –Optional Parameter. All entities with a score < threshold are filtered out from the output. Its best not to specify a threshold and just use the default value instead.
-
normalize_output
(bool
, default:False
) –whether to normalize the keywords
Returns:
-
entities
(List[dict]
) –List of spans containing the keyword, start/end index of the keyword and the entity type
Raises:
-
requestException
–Invalid Request
standardise_entity
cached
Map a given mention (keyword) to an ontology term. Given a text and entity type, users can get the Polly compatible ontology for the text such as the MESH ontology.
Parameters:
-
mention
(str
) –mention of an entity e.g. "Cadiac arrythmia"
-
entity_type
(str
) –Should be one of ['disease', 'drug', 'tissue', 'cell_type', 'cell_line', 'species', 'gene']
-
context
(str
, default:None
) –The text where the mention occurs. This is used to resolve abbreviations.
-
Threshold
–(float, optional) = Optional Parameter. All entities with a score < threshold are filtered out from the output. Its best not to specify a threshold and just use the default value instead.
Returns:
-
dict
(dict
) –Dictionary containing keys and values of the entity type, ontology (such as NCBI, MeSH), ontology ID (such as the MeSH ID), the score (confidence score), and synonyms if any
Raises:
-
requestException
–Invalid Request
Examples
# Install polly python
!sudo pip3 install polly-python --quiet
# Import libraries
from polly.auth import Polly
from polly.curation import Curation
import os
import pandas as pd
from json import dumps
import ipywidgets as widgets
# Create curation object and authenticate
AUTH_TOKEN=(os.environ['POLLY_REFRESH_TOKEN'])
curate = Curation(AUTH_TOKEN)
standardize_entity()
{'ontology': 'NCBI',
'ontology_id': 'txid10090',
'name': 'Mus musculus',
'entity_type': 'species',
'score': None,
'synonym': None}
{'ontology': 'MESH',
'ontology_id': 'C564330',
'name': 'Alzheimer Disease, Familial, 3, with Spastic Paraparesis and Apraxia',
'entity_type': 'disease',
'score': 202.1661376953125,
'synonym': 'ad'}
# With context, returns the desired keyword in case of abbreviation
curate.standardise_entity("AD", "disease",
context="Patients with atopic dermatitis (AD) where given drug A whereas non AD patients were given drug B")
{'ontology': 'MESH',
'ontology_id': 'D003876',
'name': 'Dermatitis, Atopic',
'entity_type': 'disease',
'score': 196.61105346679688,
'synonym': 'atopic dermatitis'}
# Usage of non-matching 'entity_type' returns none values
curate.standardise_entity("Mouse","disease")
{'ontology': 'CUI-less',
'ontology_id': None,
'name': None,
'entity_type': 'disease',
'score': None,
'synonym': None}
recognise_entity()
# Basic example with two entities
curate.recognise_entity("Gene expression profiling on mice lungs and reveals ACE2 upregulation")
[{'keyword': 'lungs',
'entity_type': 'tissue',
'span_begin': 34,
'span_end': 39,
'score': 0.9985597729682922},
{'keyword': 'ACE2',
'entity_type': 'gene',
'span_begin': 52,
'span_end': 55,
'score': 0.9900580048561096},
{'keyword': 'mice',
'entity_type': 'species',
'span_begin': 29,
'span_end': 32,
'score': 0.989605188369751}]
# No entity in the text
curate.recognise_entity("Significant upregulation was found in 100 samples")
[]
annotate_with_ontology()
[Tag(name='Mus musculus', ontology_id='NCBI:txid10090', entity_type='species'),
Tag(name='Adenocarcinoma', ontology_id='MESH:D000230', entity_type='disease')]
[Tag(name='Mus musculus', ontology_id='NCBI:txid10090', entity_type='species')]
find_abbreviations()
{}
# '-' on the text is not understood
curate.find_abbreviations("Patient is diagnosed with T1D- Type 1 Diabetes")
{}
# Abbreviation is recognized
curate.find_abbreviations("Patient is diagnosed with T1D (Type 1 Diabetes)")
{'T1D': 'Type 1 Diabetes'}
# Abbreviation does not match the full text
curate.find_abbreviations("Patient is diagnosed with T2D (Type 1 Diabetes)")
{}
assign_control_pert_labels()
sample_metadata = pd.DataFrame({"sample_id": [1, 2, 3, 4], "disease": ["control1", "ctrl2", "healthy", "HCC"],})
sample_metadata
sample_id | disease | |
---|---|---|
0 | 1 | control1 |
1 | 2 | ctrl2 |
2 | 3 | healthy |
3 | 4 | HCC |
sample_id | disease | is_control | control_prob | |
---|---|---|---|---|
0 | 1 | control1 | True | 1.00 |
1 | 2 | ctrl2 | True | 1.00 |
2 | 3 | healthy | True | 0.96 |
3 | 4 | HCC | False | 0.08 |