Data cleaning, the unsexy but essential aspect of data science
“We have decades of agricultural data that used to be locked up that are now made available. Most of them are unstructured, so it’s kind of a mess,” said Daniel Jimenez.
Data science involves creating models that can predict the future, such as what the yields will be for the next planting season. This work, arguably, is ‘sexy.’
There’s an aspect of data science, though, that is ‘unsexy’ but a requisite to actually developing those predictive models. It’s called data cleaning or data curation.
Depending on the quality of a data set, data cleaning takes up between 60 percent and 80 percent of a data scientist or a data analytics team’s time. Daniel Jimenez, who leads the CIAT data team, attributed the lengthy time to clean agriculture data to the lack of standardization of common terms used in the field.
Within the CGIAR Platform for Big Data in Agriculture, there’s a group of experts working to standardize agricultural terms. Both the Ontologies Data community of practice and the Organize module have developed tools for data harmonization.
Beyond the CGIAR system, there’s still much work to be done.
Maria Eliza Villarino
Cali, Colombia
This article’s original version is posted on the CIAT blog.