Recent advances in Machine Learning (ML) have significantly increased the value that can be derived from data. However, the use of this data, especially when it includes private and sensitive information, is subject to stringent regulations such as the EU’s GDPR and the US’s HIPAA. These regulations impose significant constraints on data usage, particularly in the secure handling and processing of Personally Identifiable Information (PII). To ensure compliance, organisations must collect data lawfully, use it transparently, store it securely, and process it only for purposes explicitly consented to by its owner(s). Given that inadvertent data leakage can violate these criteria and incur penalties of up to 4% of gross annual revenue under the GDPR, it is crucial for organisations to implement robust security measures to safeguard all private data they wish to utilise. This need has led to the proliferation of a class of techniques known as Privacy Enhancing Technologies (PETs), which are designed to ensure the security, integrity, and confidentiality of data throughout its lifecycle.
This blog post provides an in-depth exploration of the various PETs applicable to ML, discussing their benefits, drawbacks, and available libraries for implementation. By examining a wide range of PETs, this blog post aims to encourage practitioners to reflect on the potential dangers of PII and explore strategies for effectively managing private data while ensuring its security and confidentiality.
PETs are tools and methodologies that ensure the security, integrity, and confidentiality of data in all its possible states throughout its lifecycle. In computing, the different states in which data can exist in one of three states, which are defined as follows:
To guarantee provide the aforementioned guarantees at each of these states, several distinct techniques have been developed. Some of the most commonly used techniques for each of these categories are presented below:
While techniques to protect data at rest and in transit are now widely adopted, techniques for safeguarding data in use have not yet matured.
In the context of ML, data is considered 'in use' because it is actively processed and analysed by algorithms to train models, make predictions, and derive insights. This active processing makes the data susceptible to various types of attacks, such as memory scraping attacks (e.g., CPU side-channel attacks) and malware injection attacks (e.g., Triton attacks). Moreover, data can also be leaked from the model it is used to train through model inversion attacks and statistical inference attacks. Since the focus of this blog post is on exploring PETs relevant to ML, the upcoming sections will examine each PET within the 'in use' category in greater detail to elucidate how they protect against such attacks.
Homomorphic Encryption (HE) is a cryptographic technique that allows computations to be performed directly on encrypted data without requiring access to its plaintext. There are several different types of HE, which differ in the extent and complexity of the operations they support. These include:
The main advantage of HE is that it provides a mathematically verifiable means of performing non-trivial computations on encrypted (sensitive) data without the need for decryption. This ensures the security and privacy of the data even while it is being processed. However, HE is currently limited by several drawbacks:
Despite these challenges, there has been considerable interest in using HE for Privacy-Preserving Machine Learning (PPML). For instance, several open-source libraries, such as TenSEAL, Concrete, and nGraph-HE, have been developed to facilitate the integration of HE into ML pipelines. Additionally, significant efforts have been made by organisations, including DARPA and Chain Reaction, to develop specialised hardware that enhances the efficiency of arithmetic operations on encrypted data. Nevertheless, its widespread implementation in industry ML pipelines is still limited.
Secure Multi-Party Computation (SMPC) is a cryptographic protocol that allows multiple parties to jointly compute a function over their inputs while ensuring that these inputs remain private. To achieve this, SMPC utilises a number of core cryptographic primitives, such as:
Since SMPC often utilises the HE cryptographic primitive, it offers the same advantages. Additionally, SMPC provides the following benefits:
However, SMPC also shares many of the same drawbacks as HE cryptographic primitives and faces additional, unique challenges:
Similar to HE, various open-source libraries have been developed to enable the deployment of SMPC in PPML. Notable examples include Dalskov, CrypTen, and TF-Encrypted, each offering support for widely used ML frameworks like PyTorch and TensorFlow. Despite these efforts, SMPC has not yet been adopted at scale for ML, mainly due to the performance constraints associated with it. However, as these libraries become more efficient and encryption accelerator hardware continues to advance, SMPC is expected to see a significant increase in adoption.
Trusted Execution Environments (TEEs) are secure areas within a processor dedicated to ensuring the integrity and confidentiality of data and code during execution. By isolating sensitive computations from the main operating system and other applications, they effectively protect sensitive data and operations from unauthorised access and tampering. TEEs can be either software-based (e.g., Microsoft VSM) or hardware-based (e.g., Intel SGX, ARM TrustZone), with the latter providing better performance and security. Overall, TEEs offer the following benefits:
They also present some downsides, including:
Given their significance to cloud computing, there has been significant interest in the use of TEEs for PPML. To facilitate the utilisation of TEEs, a number of libraries have been created, with the Gramine library standing out for its support of both the PyTorch and TensorFlow libraries. Moreover, hardware manufacturers are actively working on addressing the computational resource and scalability constraints that have limited the use of TEEs in PPML, as exemplified by NVIDIA’s push for confidential computing support in its latest Hopper architecture. As a result of these efforts, TEEs have seen considerable uptake in the industry, with their adoption expected to continue growing.
Federated Learning (FL) is a collaborative ML technique where multiple devices or servers train a model together while keeping the data decentralised and secure on their respective devices. This is done by having each device compute model updates locally on its data, then sending these updates (not the raw data) to be aggregated into a global model, which is then shared back with all devices in an iterative process until the model converges. While the core principle remains the same, the different variations of FL primarily differ in how they handle data distribution, communication patterns, privacy-preserving techniques, model aggregation methods, and the scale of deployment, each tailored to address specific challenges and use cases. This approach to ML presents a unique set of advantages, which are as follows:
Alongside its advantages, FL also comes with its share of challenges and disadvantages:
FL has garnered extensive study in the literature due to its broad applicability across various valuable use cases. This has led to the development of several open-source libraries designed to integrate FL into existing systems, such as TensorFlow Federated (TFF), PySyft, and Flower. Despite demonstrating promising results in controlled environments, FL's inefficiencies make it challenging to deploy effectively in real-world scenarios. As such, it has seen very limited deployment in industry.
Differential Privacy (DP) is a mathematical framework that provides a formal privacy guarantee for algorithms operating on datasets. It ensures that the inclusion or exclusion of a single data point does not significantly affect the output of the algorithm. There are several different ways in which this privacy guarantee can be established, including:
In the context of ML, DP is applied to protect individual data points during model training and inference, ensuring that the model's predictions or parameters do not reveal sensitive information about any single data point in the training set. Techniques such as differentially private stochastic gradient descent (DP-SGD) are commonly used to inject Laplacian or Gaussian noise into the training process, effectively preventing the model from overly memorising the data. Additionally, DP can be used to generate synthetic data that preserves the statistical properties of the original dataset without compromising individual privacy, enabling safe data sharing and analysis. The application of these techniques provide several unique advantages:
However, it also presents a few drawbacks:
The absence of any major drawbacks has led to the development of a wide array of libraries to facilitate the implementation of DP guarantees across various applications. Notable examples include PySyft, PyTorch Opacus, TensorFlow Privacy, and Diffprivlib. Among all the PETs discussed in this section, DP stands out as the most mature and widely utilised. This is particularly evident in its application to ensure privacy when training cutting-edge generative AI models.
PETs are rarely used in isolation and are instead seen as primitives that can be combined to create comprehensive privacy-preserving solutions. By integrating multiple PETs, organisations can leverage the strengths of each technique while mitigating their individual weaknesses. For instance, combining HE with FL can enhance data security during model training and update aggregation. Similarly, incorporating DP into models trained using TEEs can protect against privacy leaks from the trained models through model inversion and statistical inference attacks. By tailoring PET combinations to specific use cases, organisations can effectively safeguard sensitive information while still deriving valuable insights from their data through ML.
In conclusion, incorporating PETs into ML workflows is essential for enabling the use of sensitive data that would otherwise remain inaccessible due to stringent privacy regulations, such as the EU’s GDPR and the US’s HIPAA. A variety of PETs have been developed to address this need, each offering distinct advantages and limitations. Of these PETs, DP has emerged as the most viable and widely adopted solution in the industry, mainly due to its ease of implementation and versatility. Although TEEs have attracted comparable interest due to their low performance overhead, their limited computational resources and scalability make them impractical for large-scale ML systems. Similarly, other PETs like HE, SMPC, and FL have shown promise, but their adoption has been hindered by high performance overhead and difficulties in fully mitigating privacy leaks from trained models. While combining PETs can mitigate some of their individual weaknesses, they remain impractical for real-world implementation. Nonetheless, as data privacy regulations evolve, the development and adoption of these PETs will be crucial for enabling secure, PPML applications, paving the way for broader acceptance and implementation across various industries.