3. What are the challenges when searching for personal data?
The principal challenge in data detection lies in the completeness of data sources processed and the detection rules themselves.
The term "data source" refers to any location where data is stored, including:
Databases (SQL, NoSQL),
Storage of software packages outside the database (XML, File),
Emails (on the server and on users' machines),
Hidden data (Excel files, CSV export, documents stored on the network or user workstations).
Each of these potential data sources should be considered when performing the inventory work. The more detailed the list, the more exhaustive the detection will be.
It is important to note that the personal data protected by the GDPR does not only apply to sensitive data (e.g. political opinions, race, sexual orientation, religion, etc.) which are anyway (with some exceptions) prohibited from collection. Indeed any data is considered personal as soon as it can be "linked to" a person – either directly (because it contains a name, a photo, a fingerprint, a postal address, an e-mail address, a telephone number, a social security number, an internal number, an IP address, a computer connection identifier, a voice recording, etc.), or indirectly if the linkage can be made by cross-referencing with other information.
In this context, detection rules are able to locate data that could lead to a person being identified, either directly or indirectly (re-identification by cross-referencing).
Usually such data have a particular format that can be detected by more or less complex IT techniques, such as :
Date of birth,
Face in a picture,
This personal data should then be protected to ensure they are used for their primary purpose only, and for no other purpose, unless they are decoupled from the identifying data (that is, anonymized).
4. What are the main criteria to take into account when selecting a data detection solution?
When choosing a tool for detecting personal data, the first thing to take into account is the ultimate intention behind the detection itself.
For example, your goal may be to remove a person from your entire information system (the right to be forgotten), or simply to extract all or part of your databases for testing purposes. In these two cases you will not manage the discovery of personal data in the same way. In the first case, you will have to search (and above all find) ALL data, whereas in the second case you will only have to manage the subset of data extracted.
The cost of the data detection solutions is also a factor to take into account. There is no point in implementing complex processes of link search, anonymization of low quality data (e.g. typing error, data entry error, scan with character recognition...) if you are starting from a well-known data source, or if the regeneration of documents from anonymized data is sufficient for your needs (for the creation of test data sets for example). Limiting the complexity of the detection will result in faster processing time and be easier and less costly to manage.
It would be risky to embark on a complex process of modeling your complete information system if you simply want to generate test data sets for specific applications or extract anonymized statistical data from certain components of the system.
The scale of the task could quickly turn into a financial drain and demand skills or decision-making beyond the current capacities of the project manager himself - thus introducing a high risk of failure into the project. For this reason, it is vital to limit the scope of the project according to the actual need at hand.
Moreover, the GDPR does not require all the company's information to be anonymized, but only data being used outside of its original purpose. It is therefore very rare that an entire information system will need to be anonymized.