Opportunity
The CAIRE4Aus project aims to establish a global-first clinical data resource repository of electronic health record (EHR) data linking primary, secondary, and tertiary care and including a diversity of data modalities. This will:
- Enable researchers to access diverse, high-quality clinical text data for advancing healthcare innovation.
- Provide a trusted foundation for large-scale data initiatives – supporting not only CAIRE4Aus but also future large scale-scale data initiatives.
- Create a scalable model for privacy-preserving clinical data sharing, strengthening public trust and accelerating research.
In this feasibility study, we address a critical enabling technology for building this repository – the need for robust de-identification and anonymisation strategies to ensure that clinical data, specifically clinical texts, can be safely stored and shared.
Project Objectives
The project team will:
- Develop reliable de-identification methods and robust protocols to prepare clinical text for safe inclusion in shareable datasets.
- Benchmark existing de-identification tools on local health service datasets and enhance methods for detecting personally identifiable information (PII).
- Annotate a representative sample of clinical texts to support model development and validation, leveraging an existing clinical text corpus created with Austin Health.
- Explore anonymisation techniques—such as the “hide in plain sight” approach – to ensure personally identifiable information (PII) is appropriately obfuscated.
- Establish foundational privacy protocols that will underpin the creation of a secure clinical data lake.


