In this interview, Sebastian Neumaier, Research & Innovation co-chair of SEMANTiCS 2022, talks about the opportunities and challenges of open data in the context of AI. To use open data in a trustworthy way, providers need to play an active role and protect the data from both intentional manipulation and unintentional inaccuracies.
You have been working a lot on the decentralization of data exchange over the previous years, especially in the context of open data. How do you perceive the status of open data? Have the promises been fulfilled?
Open data initiatives have been successful in many aspects. There are, for instance, well-established open government data strategies; the FAIR principles became an integral part of the publication of research data. For many companies, however, open data initiatives lack accountability and traceability of their data. They want to determine and trace the use of their datasets, remain in control of their deployment, and decide under what policies their data is available. To address these needs, we have to work on new data sharing practices, standards, and interoperable infrastructures.
One of your special topics is policy-aware systems. What is this topic about?
New data sharing practices, as discussed above, bring with them a variety of requirements for access and usage policies. Manually ensuring policy compliance can become a time-consuming, costly, and error-prone task, especially when multiple parties are involved. A policy-aware system is a system that includes transparent and interoperable data-sharing policies so that such compliance checks can be modeled and automated.
What role will open data play in the advancement of AI? What opportunities and risks do you see?
Open data already plays an important role here. Large, public datasets help to train predictive models, for instance, in the health domain open data helps pharmaceutical companies to improve existing treatments and develop new cures. A machine learning model based on open data played a critical role in the clinical trial of a COVID-19 vaccine by providing recommendations for selecting trial participants based on where viral hotspots were likely to emerge.
Despite these opportunities, the use of open data in AI also carries risks - risks that should be recognized. For instance, the risk that personal information could be released by re-identifying information in (anonymized) datasets. Or the risk of encoded bias in the data, e.g. if groups are underrepresented or if historical data is stereotyped. To use open data in a trustworthy way, providers need to play an active role and protect the data from both intentional manipulation and unintentional inaccuracies. Important aspects here are transparent data governance and sovereign control over access to data.