Curation App

Overview

Curation app is a data annotation tool for labeling multiple types of data for metadata fields. You can add dataset as well as sample level annotations to datasets. The app has an integrated active learning system that uses Machine Learning to provide annotation suggestions to increase the speed of curation and create ML models that can curate data by themselves.

The curation app is integrated with OmixAtlas and any dataset from OmixAtlas can be curated using it.

Key Features of Curation App

Curate dataset and sample level metadata by bringing in your own curators or use our expert curators
Configure your own curation UI with a few clicks based on your curation needs
Monitor curation tasks and their statuses on a dashboard
Curate datasets quickly using ML powered Active Learning integration and create your own ML models for curating data at scale

Getting Started

The user will have to use following steps to access the app :

Create Polly login ID
Login into the Polly
Get in touch with Customer Success Manager or Polly Support for access
Create Curation UI
Create Collection and add datasets
Assign Datasets to curators
Manual Curation
Expert Review
Approve/Reject complete batch
Export Datasets to your workspace for further use/Append Labels to respective OmixAtlas

Note - There is a toggle button to switch between curator and reviewer roles.

Select Curation Icon from Polly homepage as shown in Figure 1.

Figure 1 - Curation App Icon

Reviewer Window

Curation UI

In this window you will be able to define the curation field and curation columns. You will be able to define types, the field names and generate the dynamic interface for curation

Click on the Curation Configs icon. Create a new curation UI or edit an existing UI from the cards provided.

Figure 2 - Curation UI

Click on create Curation Config to create a new UI for custom curation fields.

Figure 3 - Create new curation UI

Add the required details, here you can add custom fields, the format of input and create a curation type on your own.

Curation Config : Name of the curation Type
Type : Tabular/Free Text
Level : Dataset/Sample
Description : Description for the Curation UI (Example - Name of the fields)
Header Name : Desired Column Name (Example - Time Point)
Key : Column name to be shown on Polly (Example - curated_time_point)
Type : Text/Number/Dropdown
Mandatory : Whether you want to keep the filling of field mandatory
Editable : Whether you want the column to be editable or not

Note : You can add more than one fields in one curation configuration

After filling up the required details, click on the submit button and the new UI will be created.

Figure 4 - Input Fields

Edit existing curation UI

Click on the existing UI

Figure 5 - Edit Curation UI

Click on the edit fields and edit the field names according to your requirements

Figure 6 - Edit Curation UI

Once the UI has been created you can go back to the homepage of the curation app and start creating collections and add datasets to the collection.

Create Collection

A collection is made up of a group of datasets(added directly from OmixAtlas) to be curated for a unique curation UI. In one collection, datasets from multiple OmixAtlas can be added. Each collection contains multiple iterations.

In a collection you can fetch thousands of datasets and get it curated in multiple batches by the curators. Each batch is called an iteration.

Note : Each iteration can be fully approved/rejected on the basis of overall quality of datasets curated in one particular iteration.

Create new collection

Click on atlas icon, you will reach following page

Figure 7 - Select Omixatlas

Select any OmixAtlas, for example - GEO

Figure 8 - GEO OmixAtlas

Select datasets of your interest using filter buttons
Example - You need neoplasm samples for liver in humans, click on all the filters and click on add to collection button.

Figure 9 - Select Desired Datasets

You can either select existing collection or create a new collection

Figure 10 - Create New Collection

While creating a new collection you need to add following details (If the required curation UI is not available you can create a new customized UI, the steps are explained above)

Collection : Name of the collection (Example - User_Document)

Curation UI : Select the desired UI from dropdown list (Example - CUS Master)

Description : Any description related to collection (Example - Standard Fields, Sample level)

Click on the submit button to create the collection

Figure 11 - Create New Collection

After the collection is created, go back to the curation app and click on the desired collection card for datasets assignment.

Figure 12 - Collection Card on Curation App

Add datasets to existing collection

You can also select the desired datasets and add to the existing collection either directly from the OmixAtlas or using Polly python.

Figure 13 - Add Datasets to Collection

Assign Datasets

Click on the collection created, you will reach the page shown in Figure 14.

Figure 14 - View Iterations

Click on View Iterations button, you will reach the page shown in Figure 15

Figure 15 - Assign Datasets

You can assign the datasets to a curator by filling out the details in prompted window as shown below

Iteration - Name of the iteration (Example - User_Document_10-10-2022)

Description - Description of the iteration (Example - Number of datasets to each curator)

Choose number of datasets to assign - Number of datasets to be assigned to each curator (Example - 30)

Curators - The email ids of the curators to whom the datasets will be assigned (Example - amritanjali.kiran@elucidata.io, ritu.tiwari@elucidata.io)

Note - The number of curators should be an even number, since the data curation methodology follows double blinded curation process (Each datasets are assigned to two different curators and are curated independently)

Figure 16 - Assign Datasets

After the datasets are assigned to the curators, one iteration is created.

Data Journey

After the assignment the journey of datasets is as followed

TO DO - Assigned datasets not curated
SUBMITTED - Curated Datasets
DISCUSSION - Datasets having less than 100% mutual agreement (consensus)
FINAL SUBMISSION - Datasets with 100% mutual agreement (consensus)
APPROVED - Correct Datasets
REJECTED - Incorrect Datasets

Note - The consensus and status of the datasets are shown in the review window as shown in Figure 19.

Figure 17 - Data Journey

Statistics Overview

This page comes under each collection where you can visualise status for curated datasets. Here you can select filters of your choice example filter one iteration or a curator and visualise the numbers of datasets sitting in discussion, final submission, approved etc. You can also use the updated date feature.

Figure 18 - Statistics Overview

Review Window

In this window, you can analyze the status of datasets according to consensus (mutual) score utilizing the multiple filter options available. You can filter datasets for a specific curator or filter datasets of a particular status example - final submission datasets for curator 1 and 2.

Figure 19 - Consensus Status

Click on the dataset id to check the annotated labels by both the curators

Click on Approve button if the labels are correct

Click on Reject button if the labels are wrong

Figure 20 - Reviewer Window

Bulk Approve/Reject

Bulk approve/Reject allows the user to approve/reject complete iteration in one click. After the expert review, if 30% or more datasets are approved, the complete iteration is considered as approved. Upon approving the iteration, datasets from final submission (with 100% consensus) move to approved which can be exported in a particular workspace for further use or can be appended back to the original OmixAtlas.

The rejected datasets are discarded and are sent for recuration.

Figure 21 - Bulk Approve/Reject

Export Datasets

After approving the iteration, click on the export data button in Figure 21. The export data window will pop up as shown in Figure 22. Fill in the required details.

Iteration - Name of the iteration from which the data will be exported

Select Workspace - The specific workspace in which exported data will be stored

Export File Format - JSON or CSV format of the exported data

Append Labels to OmixAtlases - Select Yes if the labels are to be updated in the original OmixAtlas and select No if the labels are not to be updated in OmixAtlas

Figure 22 - Export Data/ Append labels to OmixAtlases

Active Learning

Active learning is a method in supervised machine learning where a model is trained utilizing a small size of training data by prioritizing datasets of high quality. In the curation app we have integrated an active learning assisted machine learning method for free text curation.

While creating a new collection you can enable active learning as shown in Figure 23.

Note : The option to enable active learning will appear only when you select free text curation

Figure 23 - Enable Active Learning

If you have reviewer rights then you will be able to view consensus, model score and model consensus as shown in Figure 24.

Figure 24 - Model Consensus

You will also be able to view the performance of the model in the overall statistics page of the collection as shown in Figure 25.

Figure 25 - Model Performance

Curator Screen

Click on the curation icon in the Polly home page and you will be redirected to the collections page. Click on the specific collection to be curated as shown in Figure 26.

Figure 26 - Open Collection

Status Statistics

As a curator you can view the status of the curated data for each iteration in a particular collection

Figure 27 - Status Statistics

Iteration

Click on the view iterations as shown in Figure 27 and click on show all button you will reach the list of datasets (Figure 28)where you can view the status of each dataset and click on the dataset id to reach the curation window.

Figure 28 - Iteration

Curation Window

Upon clicking on the dataset id, the page will be redirected to the curation window as shown in Figure 29. After curation click on Save Progress (To edit the labels later) or Mark As Complete (If no changes in labels are required and the dataset is finally curated).

Figure 29 - Curation Window

Active Learning Curation Window

In the curator window, a curator can view the model scores. On the basis of model scores the curator will be able to prioritize datasets of low scores to curate on highest priority. The model score keeps changing with each curation.

Figure 30 - Model Score

Click on the dataset link, a curation window as shown in Figure 31 will open. Click on model predictions to apply active learning

Figure 31 - Active Learning Curation Window

A new window with the labels predicted by model will appear, check all the labels and click on apply predictions. After labeling the datasets with correct labels click on mark as complete or save progress.

Figure 32 - Model Predictions

Curator Performance Chart

As a reviewer you can also do the analysis of performance of each curator by comparing the number of attempted datasets, approved datasets and rejected datasets.

Figure 33 - Curator Performance Chart

Terminologies

Term	Description
Configuration	The table which contains columns/labels to be curated
Collection	Group of datasets for one particular curation type
Iterations	Version of assigned datasets undergoing curation or curated
Assigned Datasets	The datasets which has been sent for curation
Unassigned Datasets	The datasets added in collection to be assigned
Consensus	Mutual agreement between the curators, value ranges from0 -100
Tabular	Datasets to be curated in rows and columns format
Free Text	Datasets to be highlighted for labels in its textExample - Overall Design and Summary for GEO datasets
Model Predictions	Labels predicted by model
Model Consensus	Mutual agreement between model and manual curator, value ranges from0 -100
Model Score	Confidence score of the model, value ranges between 0-1
Attempted Datasets	Total number of datasets attended by a curator

VIDEO

https://www.youtube.com/watch?v=AwmDp6WM_RY&list=PLA_38j1m5-Y7PcqysIC3eehBjTHT-O1dT&index=3