In the project, we virtually integrate heterogeneous personal data sources to facilitate cross-domain studies - e.g., allowing researchers to access the data and perform data analysis interactively. Since personal data is very sensitive, therefore, I research and develop a module called 'privacy preservation' where I automatically detect privacy concern level of users to provide sufficient privacy protection. The state-of-the-art approach is using machine learning model to detect privacy-concern level that exists in the data itself. Based on the detected level, we can inject a sufficient amount of noise into analytic results to protect privacy. Therefore, our project would require high-performance computation hardware to perform the task.
We previously got approved for small package on this project, however, due to the large amount of data we have to process, we have been out of the small package recently. This new requirements happen due to the recent state-of-the-art Deep Learning models are very big and require more computational resources to train (e.g., BERT , ELMO ). It is noted that these models are normally be reused, however, in our case, we have to train our own model to handle our own dataset. For the current small package, when we were out of CPU/GPU hours, we had to figure out how to perform the task (e.g., ask our collaborator, try to use free resources from Google). These solutions are temporary,
thus, we are applying for the medium package to fulfill our new requirements for the project.