The one who wanted to escape the GDPR thanks to anonymization

The GDPR regulates the processing of personal data, which relates to identified or identifiable persons. Therefore, anonymizing data may seem like a good way to get around this regulation. But is this really the case?


First, it is important to distinguish anonymization from another concept with which it is often confused: pseudonymization.

Pseudonymization consists of "replacing directly identifying data (name, surname, etc.) in a dataset with indirectly identifying data (alias, number in a classification, etc.). When data have been pseudonymized, it is still possible to find the identity of individuals through correlation and inference methods. Pseudonymized data is therefore considered personal data since it concerns potentially identifiable individuals.

Unlike pseudonymization, anonymization is irreversible. Anonymized data can no longer be associated with identified individuals.

To anonymize data and thus eliminate any possibility of re-identification, there are several techniques recalled by the G29 in its opinion published in 201 4, and we will focus on two of them:

  • Randomization: changing attributes while maintaining their overall distribution in the dataset (e.g., swapping the birth dates of individuals);

  • Generalization: change the scale or order of magnitude of attributes so that they are common to many people (e.g., replace date of birth with an age range).

We can then question the relevance of anonymization in the context of medical research. 

First, having accurate and precise information is essential to build reliable algorithms in order to establish so-called personalized medicine processes. Moreover, if the study is conclusive, it will be necessary to be able to find each patient in order to accompany him/her in the best possible way, the aim being to inform him/her of any genetic anomaly, for example, of which he/she would not be aware.

Also, it is impossible to anonymize data when working on a small sample: altering the attributes would not be sufficient to prevent re-identification. Anonymization is therefore not suitable for research on rare diseases, for example, which affect few people. In the end, anonymized data are only relevant for statistical analysis or possibly the generation of proofs of concept.

Attempting in vain to obtain complete anonymization of health data in research is a circuitous route with a colossal upfront investment for a minimal probability of success and, above all, a medical and ethical relevance close to nil.


We prefer to talk about de-identification with a more relevant risk management approach.

In short, we accept that there is no such thing as zero risk in our century of digitalization pushed to its limits. 

How do we do it? We start from the analysis of the processing and the data used to apply all the technical security measures (randomization, encryption, partitioning, etc.) and organizational measures to calculate the scientific probability of re-identification of patients according to the state of the art.

We establish a risk matrix that guides the project and allows it to unfold, and that puts the Processor in a real decision-making position thanks to a scientific method.


When anonymization is carried out, the starting point is identifiable data: this process is therefore considered as processing of personal data. Therefore, the GDPR and its obligations apply: theneed to inform the patient and possibly obtain his consent, to put in place security measures, etc.

To comply with the GDPR and to avoid any risk of sanctions, it is preferable to call upon a DPO (data protection officer). The latter will carry out a data protection impact analysis (PIA) as described above, which he will eventually submit to the CNIL for advice (French authority or any other national authority). This study, both regulatory and technical, will make it possible to assess the risks of the desired de-identification. 

If anonymization is not the most appropriate solution for your project, your DPO will be able to tell you this and suggest alternatives. For example, if you want to conduct a proof of concept (POC), he or she may advise you to use synthetic data. Synthetic data is created by creating "digital twins" of your patients and mixing their attributes with other data. This process is data processing, so you need to inform the patients involved and possibly obtain their consent.

Synthetic data are not anonymized since it is possible to re-identify individuals. Finally, they can only be used for textual data, so it does not apply to medical images for example.

Using data anonymization can be tempting, but it is essential to ensure that this process is scientifically sound and GDPR compliant!

Published on:
11 May 2022
Reading Time:
4 min
Synthetic data
Feature articles

The one who is a physician and want to appoint a DPO

clock 3 min

Patient experience & informed consent (the real opt-in!)

clock 3 min

Whoever wants "Isalid", a solution that can not lie

clock 3 min

The one who wants to re-use the data and inform its patients

clock 4 min

Consent by blockchain: how does it work?

clock 3 min