Currently I’m working on a big data / data ware house project. So basically what we want to achieve is collect data from different data sources (databases, tracking, …) at a central place to be able to extract certain data and do analysis on it.
Basically the architecture looks like the following:
The data is exported as a snapshot from the source system and saved to a snapshot file to have historical data available e.g. for data scientists.
From there it is but into a database that reflects the current state of the data.
Finally there are different applications accessing the current state of data, exporting it, doing aggregations, displaying charts, …
Of course we also talk in such cases about user specific data. So also data privacy matters a lot.
We were discussing a long time about the use case if user data has to be deleted from the system and how to do it.
Our finding was that the hardest part are the snap shots where for example you have a users data in different files that were collected of a year or longer because basically those snap shots are exports of the source database.
For the current view on the data and the analytics applications it is quite easy to delete and/or overwrite the data because normally you store e.g. a user’s email only once in such a system but for the snap shots you might have multiple instances of it in different snap shots. That would mean in a deletion scenario you had to go through a lot of files check for a specific user entry and delete / overwrite it.
So we came up with an idea of encrypting the snapshots: So the user’s data in each snapshot is encrypted for each user with its individual key. That key is stored in a central place. Each user will have only one key so there is e.g. a database table that contains a mapping of user id and the key.
During the creation of the snapshot the data is encrypted and saved and before importing into the current state database it is decrypted and the decrypted version is deleted right after the process.
In case user data has to be deleted deletion in current state and statistics applications will be performed. Additionally the encryption / decryption key of that specific user will be deleted so his data will still be in the snapshots but it is (not easily) possible to access the data.
At the moment this is more an idea and has not be implemented yet but we think it might be a good approach. So I will keep you updated about a proof of this concept.
Feedback and questions are appreciated as always! 😉