Webinar summary – Semantic annotation of images in the FAIR data era

The Ontologies Community of Practice hosted a webinar series to debate, share, and advance our thinking on selected topics in the domain of ontologies. Here’s what we learned from the third webinar.

***

Digital agriculture increasingly relies on the generation of large quantity of images. These images are processed with machine learning techniques to speed up the identification of objects, their classification, visualization, and interpretation. However, images must comply with the FAIR principles to facilitate their access, reuse, and interoperability. As stated in recent paper authored by the Planteome team (Trigkakis et al, 2018), “Plant researchers could benefit greatly from a trained classification model that predicts image annotations with a high degree of accuracy.”

In this third Ontologies Community of Practice webinar, Justin Preece, Senior Faculty Research Assistant Oregon State University, presents the module developed by the Planteome project using the Bio-Image Semantic Query User Environment (BISQUE), an online image analysis and storage platform of Cyverse.

The Planteome module is a community platform for plant image segmentation and semantic classification using machine learning technics. The project focuses on leaf traits for which predicting detailed features with greater precision requires a large amount of data set training.

The segmentation of a photo is the identification of an area (e.g. a fruit) in the image that is then annotated using reference ontologies. Users can manually delineate an area of interest with markup lines, and then the module identifies it as a separated graphical object. Next, the trained model classifies the segment or the whole image to infer which object is represented. Ontology terms with their Universal Resource Identifier (URI) are then added into the metadata. The module currently uses a public image bank with 5,000 classified images.

By adding bootstrapping technics on a smaller set of training data, such a module could be used to identify visual disease symptoms. Around 300 images of leaves with fungal infection could be sufficient for tagging and classifying smaller sets of data. Semi-supervised technics can make multiplicative use of the data to improve the predictive power.

The next step of the Planteome project will be to improve knowledge representation with the integration of ontology graphs for better inferring image classification and improving interoperability. To this end, a metadata-querying interface is being developed. The current cutting-edge development is the real-time annotation of videos. Justin provides two examples of real-time annotation of maize kernel under fluorescent light and seed shape on the seed selector trail.

The Planteome module is on the Cyverse public repository, but it is not yet publically published. This is a collaborative project as the underlying system can take advantage of user-annotated images that contribute to the training set. Collaborators can propose their controlled vocabulary and contact the teams behind the reference ontologies for integrating their content.

Ontologies, which are powerful tools for knowledge representation and reasoning, can effectively support image tagging for indexing, retrieval, and content comparison. Pier Luigi Buttigieg, Data Scientist at the Alfred Wegener Institute, introduces the concept of “deep tagging” for a FAIRer future.

Knowledge representation is a branch of artificial intelligence that aims to model human knowledge into machine-understandable knowledge and support options for increased expressivity, advanced querying, and data mobilization. Knowledge representation for machine readability is key because enhancing tagging of information objects with a knowledge representation enables data discovery and, thus, new data analysis. Ontologies, when they are developed using the OBO Foundry principles, are the perfect tools for machine and human-readable logical representation of knowledge. Entities and relationships are defined on the web of knowledge along with their properties, processes, agents, etc. and possess a usable URI so as to be accessible to any human or machine. Ontology linked tags can add a new dimension of interoperability to metadata. High quality ontologies can also allow new kinds of analysis, driven by both data and machine-actionable knowledge. Community ontologies can be shaped by interacting with their developers.

The Planteome module follows as much as possible the FAIR principles; images are provided by public sources that are cited (e.g. Wikimedia) and described with quality metadata so they are findable, accessible, and reusable. Image annotation using community standard ontologies that are publicly available, going with the URI of terms, is a first step towards interoperability.

The future use of ontology graph will improve this criterion by offering the most expressive way of finding and reusing the images. One could then envisage a query system that can pull all relevant annotated images from various sources for answering a question such as, “Do we have images about what happens to the biodiversity surrounding a volcano after it has erupted?”

After a decade of ontology development, refinement, connections of concepts into semantic networks and data tagging, we are at a tipping point where the promise of knowledge inference and prediction is poised to become reality.

If you have a project for an image base and an analysis objective, please contact the Planteome team via their website.

Work cited: Trigkakis D. et al. (2018). Planteome & BisQue: Automating Image Annotation with Ontologies using Deep-Learning Networks. Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA.

***

The Ontologies Community of Practice needs your expert knowledge to improve its content, so please let the team know of any comments, feedback, and suggestion for improvement. Contact information is available on the Ontologies homepage.

***

We thank the panelists for their engaging and inspirational insights:

Justin Preece
Senior Faculty Research Assistant at Oregon State University
Justin Preece designs and develops software for genomics research, including web and desktop applications, data transformation software, and databases. He currently works on the BISQUE project and the Planteome project that aims to provide reference ontologies for describing the function, traits, and phenotypes of plant genes, varieties, and germplasms.

Pier Luigi Buttigieg
Data Scientist at the Alfred Wegener Institute
Pier Luigi Buttigieg’s work focuses on the application of bioinformatics and multivariate statistics to the diverse data sets derived from microbial ecology investigations. Concurrently, he develops and co-leads the Environment Ontology (ENVO) and the Sustainable Development Goals Interface Ontology (SDGIO) in support of semantically consistent data standardization across the sciences and global development agenda.

***

The next webinar in this year’s Ontologies CoP webinar series will take place in December (exact date TBD). Please join to the Ontologies LinkedIn group and subscribe to the mailing list if you wish to receive more information and reminders on this topic.

***

Archived webinars can be found here.

September 10, 2019

Elizabeth Arnaud

Ontologies Community of Practice Lead
Bioversity International

Latest news

Webinar – Next-gen crop production analytics

Webinar – An introductory journey to the Virtual Knowledge Graph approach ...

Webinar – Potential climate-related impacts on future maize and wheat yiel...

The Platform for Big Data and the digital future of CGIAR

Webinar – Digital innovations in agriculture

← Previous article Next article →

Search the website

Discover agricultural data and publications

Powered by GARDIAN

Become a youth in data partner

Submit an initiative!

AgroFIMS: Your new companion for easy standardization of data collection and description

The Agronomy Field Information Management System (AgroFIMS) allows users to create fieldbooks to collect agronomic data that is already tied to a metadata standard (the CG Core Metadata Schema, aligned with the standard Dublin Core), and semantic standards like the Agronomy Ontology (AgrO), generating data that is Findable, Accessible, Interoperable, and Reusable (FAIR) at collection. AgroFIMS therefore standardizes data collection and description for easy aggregation and inter-linking across disparate datasets. The fieldbooks you create can be exported to the Android-based KDSmart data collection application, and collected data imported back to AgroFIMS for statistical analysis and reports. In 2021 AgroFIMS will allow you to set up agronomic survey questionnaires, for data collection via ODK. It will also allow easy upload of your “born FAIR” data to Dataverse repository platforms with Dublin Core-compliant metadata schemas. Funding for AgroFIMS was provided by the Bill and Melinda Gates Foundation’s Open Access, Open Data Initiative, and the CGIAR Platform for Big Data in Agriculture. AgroFIMS is under GPL license. Go to AGROFIMS →

Responsible Data Management Guidelines to protect privacy

CGIAR Platform for Big Data in Agriculture advocates open data for agricultural research for development. It considers that opening up research data for scrutiny and reuse confers significant benefits to society.

However, the Platform appreciates that not all research data can be open and that a broad range of legitimate circumstances may require data to be restricted.

As an integral component of its advocacy for open data, the Platform promotes responsible data management through the entire research data lifecycle from planning, collecting, storing, disclosing or publishing, transferring, discovery and archiving.

These guidelines were created from information collected from: review on best and emerging practices across various sectors in the fast changing landscape of privacy and ethics (130 external resources); privacy and ethic materials sourced from seven CGIAR centers; first draft was circulated for input and feedback across CGIAR and incorporated into this edition. It’s important to note that this is an evolving document, the next stage is to consult externally for further input.

These Guidelines are intended to assist agricultural researchers handle privacy and personally identifiable information (PII) in the research project data lifecycle.

Check the guidelines →

REUSE / TRANSFER

Ensure consistency with the DMP-PII and the purpose for which prior informed consent has been obtained
Revaluate likelihood of (re-)identification and risk of harm, particularly if it involves a public data-set containing PII (as above)
Ensure PII is stored securely to protect privacy (as above)
Minimize use of PII and risk of disclosure through pro-privacy access controls and analytical tools (as above)

Don’t transfer data containing PII unless have explicit consent
Don’t transfer data containing PII in the absence of a data sharing agreement identifying aspects such as purpose and scope of use, privacy protections measures, confidentiality and any limitations)
Don’t reuse or transfer PII until any inconsistencies with the DMP-PII and/or purpose compatibility have been resolved (e.g. through updated ethics review or consent from participant)

ARCHIVING / DISCARDING

Plan for archiving or data destruction early in the process. Destroying data can be more secure, however, archiving can be beneficial if the data has ongoing evidentiary, scientific or cultural value. If archiving, identify where and how, the budget require
Ensure DMP-PII and purpose compatibility (as above)
Ensure adequate security measures to protect privacy (as above)

Don’t wait until the end of the project to assess archiving needs when time and resources may be limited
Don’t assume the longevity of a particular format, future-proof your archives data
Don’t forget to budget for archiving data, this should be done as part of your Data Management Plan

PUBLISHING AND DISCOVERY

Ensure DMP-PII and purpose compatibility (as above)
Revaluate likelihood of (re-)identification and risk of harm, particularly if it involves a public data-set containing PII
Indicate in metadata the availability of raw data or minimized data containing PII, if available bilaterally
Minimize use of PII and risk of disclosure through pro-privacy access controls and analytical tools

Don’t include PII in public datasets unless absolutely necessary to preserve the data’s analytic potential, scientific utility or benefit to the participant (and subject to participants informed consent and a rigorous risk assessment)

STORAGE AND ANALYSIS

Ensure compatibility with the DMP-PII (as above) and also the purpose for which prior informed consent has been obtained

Ensure PII is stored securely to protect privacy, through organizational or project specific safeguards to prevent unauthorized access, accidental disclosure or breach of data (physical & technical)

encryption for the storage and transmission of PII
access control measures to limited access to PII
two-factor or multifactor authentication
cloud services & back-end security

Don’t store data in unsecured locations or on unsecured devices or servers

Don’t store encrypted data and encryption keys in locations where they can be easily accessed simultaneously

Don’t underestimate the importance and value of administrative safeguards to standardize practices (i.e. organizational policies, procedures and maintenance of security measures that are designed to protect private information, data and access)

COLLECTION

Ensure compatibility with the DMP-PII
De-identify data to anonymize by default unless it will impair the data’s analytic potential, scientific utility or benefit to the participant,
If you cannot anonymize, minimize the PII and pseudonymize to reduce the disclosure risk
Provide research participants sufficient information to use reasoned judgment to decide whether or not they wish to participate in the project
Ensure informed consent is designed to address the following elements:
- competence, comprehension, full disclosure, voluntariness
- legitimate scientific purpose for which the PII is collected and scope of use (e.g. stored, transferred, published and whether as anonymized, minimized or raw data)
- foreseeable risk of privacy loss and consequences
- meaningful alternatives including opt-in protection/anonymization
- safeguards to protect privacy, conditions on which PII may be shared and any limitations on reuse or third- party access and use of PII
- permission to follow-up or contact the participant and for what purpose (including by third- parties)
- participant’s right to withdraw and rights regarding their data (e.g. to be informed; to access; to rectify; to object; to erase)
- inclusion of physical, phone and/or electronic contact (at least two forms of contact) that participant can reach to exert her/rights
- explicit consent and participant’s acknowledgement of understanding
- if written, provide the participant a copy of processed informed consent
Use plain language and adapt informed consent to meet the needs of vulnerable populations (e.g. obtain orally or in local language)

Don’t collect PII unless you have a Data Management Plan and any necessary approvals in place, including the recorded approval of the potential participant
Don’t collect PII unless you absolutely need it
Don’t assume that removal of direct identifiers is sufficient to anonymize data or that all de-identification techniques will result in anonymized data. Consider the risk of re-identification of a research participant, particularly if datasets are combined. If there is a reasonable risk of re-identification the information should be handled as PII (i.e. undertake risk analysis, evaluate stronger anonymization techniques, seek informed consent for the disclosure of data and explain its possible consequences)
Don’t include vulnerable participants or communities if their ability or capacity to provide voluntary informed consent is genuinely in question
Don’t underestimate the potential of quasi or indirect identifiers to identify an individual, particularly the inherent ability of location-based data to identify participants and their communities, and the increased risk of harm this may pose to potentially vulnerable individuals/communities
Avoid seeking overly broad consent that may call into question transparency or a research participant’s understanding regarding the use of their PII, be specific regarding the activities, purpose and limitations associated with PII so that the participant can make a genuinely informed decision and downstream users can evaluate purpose compatibility and seek fresh consent if needed

PLANNING AND APPROVAL

Develop a Data Management Plan which governs the handling of PII in the research project and beyond (DMP-PII). It should address:
- the type and nature of PII
- compliance requirements (including necessary forms for obtaining consent, and ethics clearance, if applicable)
- legitimate research objectives that will be advanced by the PII
- foreseeable risks and consequences if participants are identified from the data
- privacy protection measures (or lack thereof) for collection, storage, transfer and publishing
- process for obtaining informed consent
- timeframe or trigger for archiving or deletion of PII
Employ stricter standards for research involving vulnerable populations such as children or illiterate participants or sensitive data such as ethnicity or religious beliefs
Undertake due-diligence of datasets previously collected by you or third parties to ensure you are entitled/permitted to use for your research project
Consult the legal, IRB or ethics clearance committee or any other relevant institutional group for specific institutional, local, regional or national policies and regulatory frameworks that may apply to PII in the context of your work

Don’t leave the handling of PII and privacy protection as an after-thought, plan ahead!
Don’t forget to check local laws and donor or third-party requirements in addition to institutional policies governing research ethics and privacy protection (seek expert support if unsure!)
Don’t ignore ethical practices/standards, if your institution does not have an ethics framework or clearance process in place self-assess!
In assessing whether information is capable of identifying someone (i.e. PII) don’t limit your focus to direct identifiers, also consider indirect/quasi identifiers. Appreciate this will depend on the context of the research project, the data in question and external data which is or may become otherwise available (i.e. there is no exhaustive list).
In assessing risk of harm don’t forget to consider potential harm to the participant’s community or groups of individuals that can otherwise be identified or associated with the participant

Webinar summary – Semantic annotation of images in the FAIR data era

Webinar summary – Semantic annotation of images in the FAIR data era

Work cited: Trigkakis D. et al. (2018). Planteome & BisQue: Automating Image Annotation with Ontologies using Deep-Learning Networks. Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA.

We thank the panelists for their engaging and inspirational insights: