If Mary has a little lamb, who should know about it?

The GDPR and data science for agricultural development are at odds when it comes to delivering personalized products for the public good.

Photo by Noel Reynolds

The new European law on privacy and data confidentiality is causing ripples across the world, especially for organizations that rely on open data access to provide products for the public benefit.

The General Data Protection Regulation (GDPR) came into effect on May 25, enforcing new guidelines and restrictions for the collection and storage of personally identifiable information (PII). This puts many organizations, including the international agricultural research centers of CGIAR, which are dedicated to providing research data products as global public goods in a difficult position. Increased open access to such data products is fueling a “Big Data revolution” in agronomy and international development and the GDPR could significantly impact this progress.

The recent Facebook-Cambridge Analytica scandal and the presumable unmasking of the street artist Banksy, have raised awareness that private and public data can be used for unexpected and possibly undesirable purposes. While these examples are outside the scope of research for agricultural development, they contain elements of how personal data from farmers might be used for unintended purposes.

If the knowledge that Mary has a little lamb is useful, both for those providing quality needed services and wolves alike, who should know?

The scientific community is, therefore, asking what GDPR means for the provision of these global public goods.

The GDPR could jeopardize efforts to bring the big data revolution to farmers in low and middle-income countries. This is because many international organizations that work in these countries have ties to Europe and may be exposed to litigation under the regulation. Clear guidelines for implementation of the GDPR in research are still lacking, and in most research institutions the relevant policies have not been updated to address the new situation, there is a possibility that legal departments will argue for undue restrictions to be on the safe side.

Depersonalizing data and aggregating it to a level that is beyond all reproach is possible, but that may imply that the value of the data is lost, for researchers, product developers, and for the data subjects themselves. One of the big opportunities in digital agriculture is in developing personalized services for farmers – if you aggregate out all the specificity in the data, you can’t personalize the information and you are denying the end–users the opportunity to benefit from it.

Before we delve into GDPR and what it may mean for science, we need to be clear that the issues at hand pertain especially to raw human-subjects research (HSR) data. Much other agricultural research data has already been sufficiently de-identified when used for analysis and these data sets do not fall under the GDPR.
The main standards of GDPR that all organizations are expected to comply with, with respect to data that contains personally identifiable information are:

Transparency and lawfulness: A lawful, fair and transparent process of the personal data is required.
Purpose: There has to be a specific and legitimate reason behind the collection and the processing of the data. To put it simply, you have to be able to explain clearly and in great detail why you are gathering the data and what you are going to do with it.
Minimization: You should only collect the minimum possible amount of data that you need in regards to your purpose. On top of that, you should only keep the data that it is absolutely necessary to keep.
Accuracy: Under any circumstances, personal data should be precise and continuously updated. Personal data that is either unreliable or outdated should be reviewed or deleted.
Storage: As soon as the personal data is no longer necessary for your purpose, it has to be deleted. This does not mean all data has to be deleted, but only the personally identifiable information.
Confidentiality and integrity: Privacy sensitive data should be stored in a secured manner.

But that is not enough. At least one of the following criteria needs to be met to comply with the first standard mentioned above:

Consent: The data subject should provide unambiguous positive consent regarding the processing of his/her personal data for specific purpose(s).
Performance of contract: The data in question should be processed in order to allow/facilitate the performance of a contract that the data subject is part of. Data processing can also be a way for the data subject to initiate a contract agreement.
Legal requirement: The data manager is obliged by the national or European law to process the personal data. This is not relevant for the cases we are discussing here.
Vital interests: In cases of emergency (e.g., medical), the processing of personal data is allowed in order to protect interests of vital importance either for the data subject or another person in connection with the subject.

Guidelines for how to manage data collection, storage, analytics, and reporting will vary across scientific domains, but the standards outlined above provide a clear demarcation.

Let’s take some key examples of data that would fall under GDPR:

Household survey data on the adoption of agricultural technology and its effect on livelihoods that is made public.
The use of satellite remote sensing data to make inferences about cropping systems.
The development and provision of data-driven services to smallholders using multiple data sources including farmer data.

The first case is a classic human-subjects research (HSR) example. We can assume that the research went through an Internal Review Board or Ethics Committee for clearance. This means that the issue of consent should have been addressed in detail. The purpose of the data collection is clearly defined in a research protocol. Raw HSR data should always be stored in a secure location. Since, for research purposes, the individual is usually not the object of investigation, but depersonalized data are used to infer broader conclusions. The crucial role of ethics committees is to pass judgment on how the data is analyzed.

Most research is not problematic since it does not need personally identifiable data. The personal data is usually collected for follow-up research and/or auditing enumerators. Geospatial location data of fields and homesteads are often collected but do not need to be published. The data made public in open access is outside the sphere of influence of an ethics committee. Therefore, it should be stripped of all personally identifiable data and blurred up to a level that the geospatial coordinates point to a general area and not a specific household. How to best do this is a research questions as the amount of blurring required depends on location and data type. Data such as household rosters should be aggregated to prevent identification of households by combining data with other sources such as social media feeds. Finally, any sensitive questions and their answers should be masked.

Providing more detailed metadata on relevant data that has been masked or blurred can provide interested third parties the necessary information to request the original data for specific lawful purposes.

For maintaining high-quality research data, one generally does not want to update personally identifiable or other data; a dataset is a snapshot in time and needs to be curated as it was collected. In contrast, the GDPR prescribes either updating or deletion. The pragmatic thing to do would be to delete PII as data gets older, but we note that the risk of abuse of the data, in fact, diminishes as the data gets older.

The use of remotely sensed data from satellites is publicly available, whether free or at a cost. However, the data pertains to the assets of individuals and the decisions they have made on the management thereof. Ideally, such data are used together with field observations, e.g. in input use and crop response at specific locations. Given the large spatial and temporal variability in agriculture, compilations of data from many field research programs are needed in this type of research, and it thus depends on the availability of open data.

Two opposing viewpoints emerge. On the one hand, there is the strong conviction that the availability of accurately georeferenced field data provides a large benefit to society as it allows us to finally be able to properly take spatial variation into account in agricultural research in various disciplines; whole personal risks are likely minimal. On the other hand, making field data available through open access could be seen as a violation of a subject’s privacy. Moreover, there is no prior informed consent when using satellite data.
Because of the potential for misuse of granular geo-spatially referenced data, increasingly research organizations are requiring Ethics clearance of research using this data. An ethics clearance can address many of the issues related to the data use.

It is, however, very difficult to do an ethics review on open access data, as future data use is unknown.

Open access is fundamental to maintaining public research in agricultural development — otherwise, only private companies and perhaps a few very large research institutions would have relevant data.

The challenge thus is to come up with reasonable guidelines on how to blur PII to strike a balance between individual risk and potential societal benefit. Research is needed to set guidelines about, for example, how much noise to add to location data given the location, and the type of data provided.

In the third case, data is used in a contractual relationship between a farmer and a service provider, there is consent and therefore the use of the data is lawful. It then becomes important for the service provider to adhere to the standards. One of the big opportunities in digital agriculture is in developing personalized services for farmers – if you aggregate out all the specificity in the data, you can’t personalize the information and you are denying the end –users the opportunity to benefit from it. Fortunately, GDPR captures this case explicitly.

There is a clear role for scientific communities to develop guidelines for specific purposes, as the three examples demonstrate.

An overly cautious approach to data sharing would make it legal, but would also inhibit uses of the data in ways that can benefit the development of the agricultural sector in low and middle-income countries.

The ability to learn from the combined long-term data sets that include field trials, household surveys and remotely sensed data holds enormous promise. We should be careful to not squander the public benefit for only hypothetical risks to individuals. Through careful use of prior informed consent, we could exercise both the right to be forgotten (to not be identified) and the right to be remembered (to assure that research data can be used for maximal benefits).

Pragmatic approaches need to be developed to ensure that the protection of individuals is very high while maximizing the potential benefits to society.

Jun 8, 2018

Written by Gideon Kruseman (CIMMYT and CGIAR Platform for Big data in Agriculture) & & Robert Hijmans (UC Davis)

Mexico City, Mexico / California, United States

Latest

Harnessing telecommunications network data for rainfall monitoring in developing...

A scientific data engine for global food security: 3 challenges ahead

Decoding food security at the 2018 Big Data in Agriculture Convention

Data-driven innovations taking agriculture to the digital frontier of science

10 ways CGIAR is opening up data for agricultural innovation

Announcing our 2018 Inspire Challenge Winners

← Previous article Next article →

Search the website

Discover agricultural data and publications

Powered by GARDIAN

Become a youth in data partner

Submit an initiative!

AgroFIMS: Your new companion for easy standardization of data collection and description

The Agronomy Field Information Management System (AgroFIMS) allows users to create fieldbooks to collect agronomic data that is already tied to a metadata standard (the CG Core Metadata Schema, aligned with the standard Dublin Core), and semantic standards like the Agronomy Ontology (AgrO), generating data that is Findable, Accessible, Interoperable, and Reusable (FAIR) at collection. AgroFIMS therefore standardizes data collection and description for easy aggregation and inter-linking across disparate datasets. The fieldbooks you create can be exported to the Android-based KDSmart data collection application, and collected data imported back to AgroFIMS for statistical analysis and reports. In 2021 AgroFIMS will allow you to set up agronomic survey questionnaires, for data collection via ODK. It will also allow easy upload of your “born FAIR” data to Dataverse repository platforms with Dublin Core-compliant metadata schemas. Funding for AgroFIMS was provided by the Bill and Melinda Gates Foundation’s Open Access, Open Data Initiative, and the CGIAR Platform for Big Data in Agriculture. AgroFIMS is under GPL license. Go to AGROFIMS →

Responsible Data Management Guidelines to protect privacy

CGIAR Platform for Big Data in Agriculture advocates open data for agricultural research for development. It considers that opening up research data for scrutiny and reuse confers significant benefits to society.

However, the Platform appreciates that not all research data can be open and that a broad range of legitimate circumstances may require data to be restricted.

As an integral component of its advocacy for open data, the Platform promotes responsible data management through the entire research data lifecycle from planning, collecting, storing, disclosing or publishing, transferring, discovery and archiving.

These guidelines were created from information collected from: review on best and emerging practices across various sectors in the fast changing landscape of privacy and ethics (130 external resources); privacy and ethic materials sourced from seven CGIAR centers; first draft was circulated for input and feedback across CGIAR and incorporated into this edition. It’s important to note that this is an evolving document, the next stage is to consult externally for further input.

These Guidelines are intended to assist agricultural researchers handle privacy and personally identifiable information (PII) in the research project data lifecycle.

Check the guidelines →

REUSE / TRANSFER

Ensure consistency with the DMP-PII and the purpose for which prior informed consent has been obtained
Revaluate likelihood of (re-)identification and risk of harm, particularly if it involves a public data-set containing PII (as above)
Ensure PII is stored securely to protect privacy (as above)
Minimize use of PII and risk of disclosure through pro-privacy access controls and analytical tools (as above)

Don’t transfer data containing PII unless have explicit consent
Don’t transfer data containing PII in the absence of a data sharing agreement identifying aspects such as purpose and scope of use, privacy protections measures, confidentiality and any limitations)
Don’t reuse or transfer PII until any inconsistencies with the DMP-PII and/or purpose compatibility have been resolved (e.g. through updated ethics review or consent from participant)

ARCHIVING / DISCARDING

Plan for archiving or data destruction early in the process. Destroying data can be more secure, however, archiving can be beneficial if the data has ongoing evidentiary, scientific or cultural value. If archiving, identify where and how, the budget require
Ensure DMP-PII and purpose compatibility (as above)
Ensure adequate security measures to protect privacy (as above)

Don’t wait until the end of the project to assess archiving needs when time and resources may be limited
Don’t assume the longevity of a particular format, future-proof your archives data
Don’t forget to budget for archiving data, this should be done as part of your Data Management Plan

PUBLISHING AND DISCOVERY

Ensure DMP-PII and purpose compatibility (as above)
Revaluate likelihood of (re-)identification and risk of harm, particularly if it involves a public data-set containing PII
Indicate in metadata the availability of raw data or minimized data containing PII, if available bilaterally
Minimize use of PII and risk of disclosure through pro-privacy access controls and analytical tools

Don’t include PII in public datasets unless absolutely necessary to preserve the data’s analytic potential, scientific utility or benefit to the participant (and subject to participants informed consent and a rigorous risk assessment)

STORAGE AND ANALYSIS

Ensure compatibility with the DMP-PII (as above) and also the purpose for which prior informed consent has been obtained

Ensure PII is stored securely to protect privacy, through organizational or project specific safeguards to prevent unauthorized access, accidental disclosure or breach of data (physical & technical)

encryption for the storage and transmission of PII
access control measures to limited access to PII
two-factor or multifactor authentication
cloud services & back-end security

Don’t store data in unsecured locations or on unsecured devices or servers

Don’t store encrypted data and encryption keys in locations where they can be easily accessed simultaneously

Don’t underestimate the importance and value of administrative safeguards to standardize practices (i.e. organizational policies, procedures and maintenance of security measures that are designed to protect private information, data and access)

COLLECTION

Ensure compatibility with the DMP-PII
De-identify data to anonymize by default unless it will impair the data’s analytic potential, scientific utility or benefit to the participant,
If you cannot anonymize, minimize the PII and pseudonymize to reduce the disclosure risk
Provide research participants sufficient information to use reasoned judgment to decide whether or not they wish to participate in the project
Ensure informed consent is designed to address the following elements:
- competence, comprehension, full disclosure, voluntariness
- legitimate scientific purpose for which the PII is collected and scope of use (e.g. stored, transferred, published and whether as anonymized, minimized or raw data)
- foreseeable risk of privacy loss and consequences
- meaningful alternatives including opt-in protection/anonymization
- safeguards to protect privacy, conditions on which PII may be shared and any limitations on reuse or third- party access and use of PII
- permission to follow-up or contact the participant and for what purpose (including by third- parties)
- participant’s right to withdraw and rights regarding their data (e.g. to be informed; to access; to rectify; to object; to erase)
- inclusion of physical, phone and/or electronic contact (at least two forms of contact) that participant can reach to exert her/rights
- explicit consent and participant’s acknowledgement of understanding
- if written, provide the participant a copy of processed informed consent
Use plain language and adapt informed consent to meet the needs of vulnerable populations (e.g. obtain orally or in local language)

Don’t collect PII unless you have a Data Management Plan and any necessary approvals in place, including the recorded approval of the potential participant
Don’t collect PII unless you absolutely need it
Don’t assume that removal of direct identifiers is sufficient to anonymize data or that all de-identification techniques will result in anonymized data. Consider the risk of re-identification of a research participant, particularly if datasets are combined. If there is a reasonable risk of re-identification the information should be handled as PII (i.e. undertake risk analysis, evaluate stronger anonymization techniques, seek informed consent for the disclosure of data and explain its possible consequences)
Don’t include vulnerable participants or communities if their ability or capacity to provide voluntary informed consent is genuinely in question
Don’t underestimate the potential of quasi or indirect identifiers to identify an individual, particularly the inherent ability of location-based data to identify participants and their communities, and the increased risk of harm this may pose to potentially vulnerable individuals/communities
Avoid seeking overly broad consent that may call into question transparency or a research participant’s understanding regarding the use of their PII, be specific regarding the activities, purpose and limitations associated with PII so that the participant can make a genuinely informed decision and downstream users can evaluate purpose compatibility and seek fresh consent if needed

PLANNING AND APPROVAL

Develop a Data Management Plan which governs the handling of PII in the research project and beyond (DMP-PII). It should address:
- the type and nature of PII
- compliance requirements (including necessary forms for obtaining consent, and ethics clearance, if applicable)
- legitimate research objectives that will be advanced by the PII
- foreseeable risks and consequences if participants are identified from the data
- privacy protection measures (or lack thereof) for collection, storage, transfer and publishing
- process for obtaining informed consent
- timeframe or trigger for archiving or deletion of PII
Employ stricter standards for research involving vulnerable populations such as children or illiterate participants or sensitive data such as ethnicity or religious beliefs
Undertake due-diligence of datasets previously collected by you or third parties to ensure you are entitled/permitted to use for your research project
Consult the legal, IRB or ethics clearance committee or any other relevant institutional group for specific institutional, local, regional or national policies and regulatory frameworks that may apply to PII in the context of your work

Don’t leave the handling of PII and privacy protection as an after-thought, plan ahead!
Don’t forget to check local laws and donor or third-party requirements in addition to institutional policies governing research ethics and privacy protection (seek expert support if unsure!)
Don’t ignore ethical practices/standards, if your institution does not have an ethics framework or clearance process in place self-assess!
In assessing whether information is capable of identifying someone (i.e. PII) don’t limit your focus to direct identifiers, also consider indirect/quasi identifiers. Appreciate this will depend on the context of the research project, the data in question and external data which is or may become otherwise available (i.e. there is no exhaustive list).
In assessing risk of harm don’t forget to consider potential harm to the participant’s community or groups of individuals that can otherwise be identified or associated with the participant

If Mary has a little lamb, who should know about it?

Latest

Submit a Comment Cancel reply

Search the website

Discover agricultural data and publications

Powered by GARDIAN

Become a youth in data partner

Submit an initiative!

AgroFIMS: Your new companion for easy standardization of data collection and description

Responsible Data Management Guidelines to protect privacy

<img class="wp-image-93311 aligncenter" src="https://bigdata.cgiar.org/wp-content/uploads/2019/01/REUSE_arrow.png" alt="" width="100" height="100" />

REUSE / TRANSFER

<img class="alignnone size-full wp-image-92805 aligncenter" src="https://bigdata.cgiar.org/wp-content/uploads/2019/01/tips-icon-orange-100px.png" alt="" width="100" height="100" />

<img class=" wp-image-93476 aligncenter" src="https://bigdata.cgiar.org/wp-content/uploads/2019/01/DONT-DO-ICON.png" alt="" width="100" height="100" />

ARCHIVING / DISCARDING

<img class="alignnone size-full wp-image-92805 aligncenter" src="https://bigdata.cgiar.org/wp-content/uploads/2019/01/tips-icon-orange-100px.png" alt="" width="100" height="100" />

<img class=" wp-image-93476 aligncenter" src="https://bigdata.cgiar.org/wp-content/uploads/2019/01/DONT-DO-ICON.png" alt="" width="100" height="100" />

<img class="wp-image-93312 aligncenter" src="https://bigdata.cgiar.org/wp-content/uploads/2019/01/rss-transparent-300x300px.png" alt="" width="100" height="100" />

PUBLISHING AND DISCOVERY

<img class="alignnone size-full wp-image-92805 aligncenter" src="https://bigdata.cgiar.org/wp-content/uploads/2019/01/tips-icon-orange-100px.png" alt="" width="100" height="100" />

<img class=" wp-image-93476 aligncenter" src="https://bigdata.cgiar.org/wp-content/uploads/2019/01/DONT-DO-ICON.png" alt="" width="100" height="100" />

<img class="wp-image-93295 aligncenter" src="https://bigdata.cgiar.org/wp-content/uploads/2019/01/data-analysis-icon.png" alt="" width="100" height="100" />

STORAGE AND ANALYSIS

<img class="alignnone size-full wp-image-92805 aligncenter" src="https://bigdata.cgiar.org/wp-content/uploads/2019/01/tips-icon-orange-100px.png" alt="" width="100" height="100" />

<img class=" wp-image-93476 aligncenter" src="https://bigdata.cgiar.org/wp-content/uploads/2019/01/DONT-DO-ICON.png" alt="" width="100" height="100" />

<img class=" wp-image-93249 aligncenter" src="https://bigdata.cgiar.org/wp-content/uploads/2019/01/data-collection-icon.png" alt="" width="100" height="75" />

COLLECTION

<img class="alignnone size-full wp-image-92805 aligncenter" src="https://bigdata.cgiar.org/wp-content/uploads/2019/01/tips-icon-orange-100px.png" alt="" width="100" height="100" />

<img class=" wp-image-93476 aligncenter" src="https://bigdata.cgiar.org/wp-content/uploads/2019/01/DONT-DO-ICON.png" alt="" width="100" height="100" />

<img class=" wp-image-93217 aligncenter" src="https://bigdata.cgiar.org/wp-content/uploads/2019/01/planning-icon.png" alt="" width="100" height="114" />

PLANNING AND APPROVAL

<img class="alignnone size-full wp-image-92805 aligncenter" src="https://bigdata.cgiar.org/wp-content/uploads/2019/01/tips-icon-orange-100px.png" alt="" width="100" height="100" />

<img class=" wp-image-93476 aligncenter" src="https://bigdata.cgiar.org/wp-content/uploads/2019/01/DONT-DO-ICON.png" alt="" width="100" height="100" />