RDF creation in privacy sensitive environments

Within a large program in the Dutch nursing homes, of which we reported last year during Semantics 2021, sensitive health-related and organizational data needs to be converted to RDF data. The program uses only open standards, for RDF creation we use RML. During this program extra tooling became relevant for the assistance of a semantic engineer supporting the health application manager in his/her endeavor to create RDF data using an ontology. The RDF data will be queried with a limited set of known queries. The assistance by the semantic engineer is constrained by the fact he/she should not have access to the production data.||||Creating and reporting quality indicators (KPI’s) within a health organization has many risks for breaching privacy regulation. During the first pilot-implementations of the program tooling was used and created for robust RML mapping. This RML mapping could only be made with some shared manual actions. During these actions it cannot be guaranteed that the semantic engineer gets insight in sensitive health data.||||For the process of creating the RML mapping comprising (i) data transformation, (ii) RML processing and (ii) SHACL validation an additional tool has been created. Crucial appeared to be proper test data based on the legacy data with which the RML mapping creates RDF data that result in the expected results of the known queries e.g., the number of employees in a nursing home. The test data is an abstraction of the legacy-data, therefore it may be called reference test data, aka RTD. Quality of RTD has great impact on the quality of the RML mapping and consequently the created RDF data. This RTD turned out to be the main challenge.||||This leads me to the conclusion that the tool supports the RDF creation control and RML and RDF design. The RML and RDF design turn out te be complex while RDF creation is rather linear. The syntax and column semantics are the result from application engineer interpreting the legacy exposure model. The next step is the semantic engineer creating the RTD resulting in the known query results, taking owl reasoning into account. This appears to be a creative process based on changes and feedback from SHACL validations and query results.||||This tool is implemented as an executable Python script that can be used for many nursing homes that are subject to the kik-v program. Other implementations can be made because many programming languages have sufficient libraries for similar functionalities too.||The solution has been tested in several nursing homes. This method will also be challenged by other IT vendors which is in fact the intention because nursing homes should not become dependent of an IT solution created by the Zorginstituut Nederland.||||The presentation will review the above and show the requirements, the process of co-creation of the reference test data and the RML mapping. Also feedback loops in terms of value from different type of SPARQL queries will be addressed. The talk will be closed with a demo of the tooling showing the described process.||||

Speakers: