A brief talk with Laura Hollink

May 11, 2017

Laura Hollink is data science chair at SEMANTiCS2017 and a researcher at CWI, the Netherlands' national research institute for mathematics and computer science. In the Information Access research group at CWI, they develop methods and techniques to support users in accessing information that is heterogeneous, subjective and potentially inconsistent. They collaborate with social scientists and humanities researchers, who typically work with this kind of complex data. She expects the conference will be more interdisciplinary than ever. More and more research in which Linked Open Data is not used in isolation but in combination with techniques from other fields.

Can you tell something about your work/research focus?

This work requires a combination of methods - not only semantic web technology or only statistical methods or machine learning, but a combination. It also demands a critical attitude when using technology to interpret data. We advocate transparency of the entire tool-chain to improve reproducibility and trust.

One of my current research interests is the analysis of how data is used: which queries or requests were done on the data, which pieces of data were clicked, viewed or downloaded, and which were never used. This adds to our understanding of the data by, for example, pointing to the most frequently requested or most popular pieces of data. It also increases our understanding of the technology used to provide access to the data and to what extent this technology meets user needs. We are presently collaborating with the National Library of the Netherlands to study usage of their archive of historic newspapers. We go beyond the standard query and click log analysis by including semantic annotations of the documents in the collection in our analysis. In this way, we've been able to discover distinct usage patterns for different parts of the collection, and provide concrete recommendations to the library about how to improve their collection metadata, search interface and retrieval mechanism.

 

Which trends and challenges you see for linked data/semantic web?

Now that Linked Open Data is used more and more in real applications, we can - and should - analyse usage of this type of data as well. I have been co-chair of the USEWOD workshop series about usage of Linked Open Data. The workshop was held from 2011 to 2016 at the ESWC, ISWC and WWW conferences. With USEWOD, we aimed to stimulate research into usage of Linked Open Data by providing a platform for novel work in this area. Each year, we published a dataset of log records of Linked Open Data servers. The dataset grew steadily, until it included SPARQL queries and requests to DBPedia, WikiData, LinkedGeoData, BioPortal, etc. Looking back, I think that the USEWOD dataset is really what pushed this research field forward, more even than the yearly workshop event itself. Usage data is notoriously hard to obtain unless you work at a large web company, and research into usage is hampered as a result. The USEWOD dataset has been (and still is) the basis for many papers published not only in the USEWOD workshop but also in more established venues.

Another recent line of work is the identification of concept drift, or the phenomenon where the characteristics of a concept change over time, signifying a shift in meaning. This phenomenon poses problems for applications that use Linked Open Data. To give some examples: semantic annotation systems (also called: entity detection and linking systems) might detect the wrong concept in a historic text if a concept has changed over time; existing semantic annotations become invalid when a concept changes; and correspondences between the concepts in two aligned ontologies become invalid too. We set out to tackle this problem by combining data from two sources: structured knowledge in the Linked Open Data Cloud to capture changes in the explicit relations between concepts, and distributional models of large corpora of natural language text to capture more subtle changes in the context in which concepts appear. Together, these two sources allow us to study drift of individual concepts and well as groups of related concepts.

What are your expectations about Semantics 2017 in Amsterdam, especially about the data science track?

I expect that this year the conference will be more interdisciplinary than ever. I see more and more research in which Linked Open Data is not used in isolation but in combination with techniques from other fields, such as computational linguistics, machine learning, and data mining. I find this combination of explicit and implicit knowledge extremely exciting. This is also the reason why we have a Data Science track this year, to accommodate this research. At the same time, Linked Open Data is now used more and more as a building block for research in other domains. From my own experience in collaborations with the Humanities I know that LOD has become almost mainstream in the world of museums, libraries and archives. In Social Sciences, the use of LOD is growing through studies about Wikipedia, Wikidata, and Open Government Data. As a result, I expect papers that combine traditional research methods in these domains (surveys, interviews, observations, etc.) with semantic technologies such as linking and reasoning.
Finally, I think there is a growing awareness of the need for transparency, both of research findings and in industry applications. The explicit knowledge in the Linked Open Data cloud provides opportunities. For example, it could contribute to more transparent output of recommender systems or retrieval systems. Still, the methodologies to reach a certain level of transparency are a work in progress. I hope that this year's conference will attract papers and stimulate discussion about how to reach transparency in applications based on Linked Open Data, and how transparency can lead to reproducibility of research results and trust of users.