Jul 12, 2024

The Different Privacy Enhancing Technologies for Privacy-Preserving Machine Learning

privacy enhancing tech preserving machine learning valyu blog
privacy enhancing tech preserving machine learning valyu blog

Recent advances in Machine Learning (ML) have significantly increased the value that can be derived from data. However, the use of this data, especially when it includes private and sensitive information, is subject to stringent regulations such as the EU’s GDPR and the US’s HIPAA. These regulations impose significant constraints on data usage, particularly in the secure handling and processing of Personally Identifiable Information (PII). To ensure compliance, organisations must collect data lawfully, use it transparently, store it securely, and process it only for purposes explicitly consented to by its owner(s). Given that inadvertent data leakage can violate these criteria and incur penalties of up to 4% of gross annual revenue under the GDPR, it is crucial for organisations to implement robust security measures to safeguard all private data they wish to utilise. This need has led to the proliferation of a class of techniques known as Privacy Enhancing Technologies (PETs), which are designed to ensure the security, integrity, and confidentiality of data throughout its lifecycle.

This blog post provides an in-depth exploration of the various PETs applicable to ML, discussing their benefits, drawbacks, and available libraries for implementation. By examining a wide range of PETs, this blog post aims to encourage practitioners to reflect on the potential dangers of PII and explore strategies for effectively managing private data while ensuring its security and confidentiality.

What are PETs?

PETs are tools and methodologies that ensure the security, integrity, and confidentiality of data in all its possible states throughout its lifecycle. In computing, the different states in which data can exist in one of three states, which are defined as follows:

  • At rest: Data that is stored on a physical or digital medium and is not actively being used or transferred.

  • In transit: Data that is actively being transferred from one location to another over a network.

  • At use: Data that is actively being processed or utilised by a computing system.

To guarantee provide the aforementioned guarantees at each of these states, several distinct techniques have been developed. Some of the most commonly used techniques for each of these categories are presented below:

While techniques to protect data at rest and in transit are now widely adopted, techniques for safeguarding data in use have not yet matured.

Which PETs are Used in ML?

In the context of ML, data is considered 'in use' because it is actively processed and analysed by algorithms to train models, make predictions, and derive insights. This active processing makes the data susceptible to various types of attacks, such as memory scraping attacks (e.g., CPU side-channel attacks) and malware injection attacks (e.g., Triton attacks). Moreover, data can also be leaked from the model it is used to train through model inversion attacks and statistical inference attacks. Since the focus of this blog post is on exploring PETs relevant to ML, the upcoming sections will examine each PET within the 'in use' category in greater detail to elucidate how they protect against such attacks.

Homomorphic Encryption

Homomorphic Encryption (HE) is a cryptographic technique that allows computations to be performed directly on encrypted data without requiring access to its plaintext. There are several different types of HE, which differ in the extent and complexity of the operations they support. These include:

The main advantage of HE is that it provides a mathematically verifiable means of performing non-trivial computations on encrypted (sensitive) data without the need for decryption. This ensures the security and privacy of the data even while it is being processed. However, HE is currently limited by several drawbacks:

Computational Inefficiency. Operations on encrypted data are significantly more resource-intensive compared to plaintext data, resulting in considerable computational overhead.

  • Increased Memory Requirements. Encrypted data requires significantly more memory during computation than plaintext data, leading to higher memory usage.

  • Limited Functionality. HE can only compute linear functions, meaning that common non-linear functions used in ML have to be approximated, which can reduce the accuracy of these computations.

  • Privacy Leaks from Trained Models. Despite encryption, trained models can inadvertently reveal sensitive information through model inversion attacks or statistical inference from model parameters and outputs.

Despite these challenges, there has been considerable interest in using HE for Privacy-Preserving Machine Learning (PPML). For instance, several open-source libraries, such as TenSEAL, Concrete, and nGraph-HE, have been developed to facilitate the integration of HE into ML pipelines. Additionally, significant efforts have been made by organisations, including DARPA and Chain Reaction, to develop specialised hardware that enhances the efficiency of arithmetic operations on encrypted data. Nevertheless, its widespread implementation in industry ML pipelines is still limited.

Secure Multi-Party Computation

Secure Multi-Party Computation (SMPC) is a cryptographic protocol that allows multiple parties to jointly compute a function over their inputs while ensuring that these inputs remain private. To achieve this, SMPC utilises a number of core cryptographic primitives, such as:

  • Secret Sharing. Splits data into shares distributed among multiple parties, enabling secure aggregation of model updates without revealing individual contributions.

  • Homomorphic Encryption. As seen previously, HE allows computations on encrypted data, enabling ML models to perform inference and training on sensitive data without decryption.

  • Garbled Circuits. Provides a secure method to evaluate Boolean circuits representing ML models, ensuring that neither the inputs nor the function is exposed during computation.

  • Zero Knowledge Proofs. Enable one party to prove knowledge of a secret or correctness of a computation without revealing the secret itself, enhancing privacy and security in verifying computations and transactions.

Since SMPC often utilises the HE cryptographic primitive, it offers the same advantages. Additionally, SMPC provides the following benefits:

  • Trust Minimisation. It minimises the need for trust among participants, thereby facilitating collaboration among entities that might otherwise be constrained by privacy concerns.

  • Protection Against Various Threat Models. SMPC protocols can ensure data confidentiality against adversaries under the semi-honest, malicious, and covert threat model, if designed adequately.

However, SMPC also shares many of the same drawbacks as HE cryptographic primitives and faces additional, unique challenges:

  • Communication Overhead. SMPC protocols often necessitate extensive communication between parties, which can significantly slow down the computation process, especially when dealing with large datasets or complex models.

  • Data Poisoning. Detecting and mitigating data poisoning attacks is more challenging in SMPC because inspecting intermediate values is not possible without compromising privacy, making it harder to ensure the integrity of the computations.

Similar to HE, various open-source libraries have been developed to enable the deployment of SMPC in PPML. Notable examples include Dalskov, CrypTen, and TF-Encrypted, each offering support for widely used ML frameworks like PyTorch and TensorFlow. Despite these efforts, SMPC has not yet been adopted at scale for ML, mainly due to the performance constraints associated with it.  However, as these libraries become more efficient and encryption accelerator hardware continues to advance, SMPC is expected to see a significant increase in adoption.

Trusted Execution Environments

Trusted Execution Environments (TEEs) are secure areas within a processor dedicated to ensuring the integrity and confidentiality of data and code during execution. By isolating sensitive computations from the main operating system and other applications, they effectively protect sensitive data and operations from unauthorised access and tampering. TEEs can be either software-based (e.g., Microsoft VSM) or hardware-based (e.g., Intel SGX, ARM TrustZone), with the latter providing better performance and security. Overall, TEEs offer the following benefits:

  • Strong Security Guarantees. TEEs ensure that sensitive computations are protected from unauthorised access and tampering by using hardware isolation and secure cryptographic methods.

  • Minimal Performance Overhead. TEEs generally offer near-native execution speeds because they run directly on the hardware, resulting in minimal performance overhead compared to non-secure environments.

  • Remote Attestability. TEEs enable third parties to verify the integrity and authenticity of the executed code and data, which instils user trust in the security of the computing environment.

They also present some downsides, including:

  • Limited Scalability. TEEs are primarily designed for single-machine environments and do not scale well for distributed systems or large-scale cloud deployments. This makes them less suitable for applications requiring extensive parallel processing across multiple nodes.

  • Limited Computational Resources. TEEs typically have constrained memory and other computational resources compared to the main operating system. This limitation can restrict the complexity and size of applications that can be securely executed within the TEE.

  • Privacy Leaks from Trained Models. While TEEs protect data during training, models can leak sensitive information post-training through model inversion attacks or statistical inference from model parameters and outputs.

Given their significance to cloud computing, there has been significant interest in the use of TEEs for PPML. To facilitate the utilisation of TEEs, a number of libraries have been created, with the Gramine library standing out for its support of both the PyTorch and TensorFlow libraries. Moreover, hardware manufacturers are actively working on addressing the computational resource and scalability constraints that have limited the use of TEEs in PPML, as exemplified by NVIDIA’s push for confidential computing support in its latest Hopper architecture.  As a result of these efforts, TEEs have seen considerable uptake in the industry, with their adoption expected to continue growing.

Federated Learning

Federated Learning (FL) is a collaborative ML technique where multiple devices or servers train a model together while keeping the data decentralised and secure on their respective devices. This is done by having each device compute model updates locally on its data, then sending these updates (not the raw data) to be aggregated into a global model, which is then shared back with all devices in an iterative process until the model converges. While the core principle remains the same, the different variations of FL primarily differ in how they handle data distribution, communication patterns, privacy-preserving techniques, model aggregation methods, and the scale of deployment, each tailored to address specific challenges and use cases. This approach to ML presents a unique set of advantages, which are as follows:

  • Data Privacy and Security. Since raw data never leaves the local devices, FL significantly reduces the risk of data breaches and ensures compliance with privacy regulations such as GDPR.

  • Reduced Data Transfer Costs. Only model updates, which are typically much smaller than the actual datasets, are transmitted, leading to lower communication overhead and faster model updates.

  • Access to Diverse Data Sources. FL enables the utilisation of data from various sources that would otherwise be inaccessible due to privacy concerns, leading to more robust and generalised models.

Alongside its advantages, FL also comes with its share of challenges and disadvantages:

  • Complexity in Implementation. Implementing federated learning systems is technically challenging, requiring sophisticated algorithms for synchronisation, aggregation, and privacy preservation.

  • Non-IID Data Distribution. In practice, the data across different devices may not be independently and identically distributed (non-IID), which can lead to slower convergence rates, degraded model accuracy, biased predictions, reduced client participation, and higher communication and computational overheads in FL systems.

  • Heterogeneous Device Capabilities: Devices participating in FL may have different computational power, storage, and energy constraints, which can complicate the training process and affect overall performance.

  • Privacy Leaks from Trained Models. Despite protecting data during training, FL models can still leak sensitive information post-training through model inversion or statistical inference attacks.

  • Privacy Leaks from Local Updates. Local model updates sent to the global model can inadvertently reveal sensitive information about the local data if not adequately protected (e.g., through encryption).

FL has garnered extensive study in the literature due to its broad applicability across various valuable use cases. This has led to the development of several open-source libraries designed to integrate FL into existing systems, such as TensorFlow Federated (TFF), PySyft, and Flower. Despite demonstrating promising results in controlled environments, FL's inefficiencies make it challenging to deploy effectively in real-world scenarios. As such, it has seen very limited deployment in industry.

Differential Privacy

Differential Privacy (DP) is a mathematical framework that provides a formal privacy guarantee for algorithms operating on datasets. It ensures that the inclusion or exclusion of a single data point does not significantly affect the output of the algorithm. There are several different ways in which this privacy guarantee can be established, including:

  • Laplacian Mechanism. Adds noise drawn from the Laplace distribution to the output of a function, calibrated based on the function's sensitivity and the desired privacy level, ensuring pure $\epsilon$-DP.

  • Gaussian Mechanism. Introduces noise from the Gaussian distribution to the function's output, suitable for achieving ($\epsilon$, $\delta$)-DP, where a small probability of exceeding the privacy budget is acceptable.

  • Exponential Mechanism. Selects an output from a set of possible outcomes with a probability proportional to an exponential function of a utility score, balancing utility and privacy for non-numeric data.

In the context of ML, DP is applied to protect individual data points during model training and inference, ensuring that the model's predictions or parameters do not reveal sensitive information about any single data point in the training set. Techniques such as differentially private stochastic gradient descent (DP-SGD) are commonly used to inject Laplacian or Gaussian noise into the training process, effectively preventing the model from overly memorising the data. Additionally, DP can be used to generate synthetic data that preserves the statistical properties of the original dataset without compromising individual privacy, enabling safe data sharing and analysis. The application of these techniques provide several unique advantages:

  • Ease of Implementation: DP is significantly more straightforward to implement compared to the techniques discussed previously.

  • Provable Privacy Guarantees: DP provides mathematically rigorous privacy guarantees, making it easier to quantify the level of privacy protection.

  • Versatility: DP mechanisms can be tailored to a wide variety of data types and algorithms, making them broadly applicable across different fields and use cases. It also provides some level of protection against privacy leakage from Model Inversion Attacks.

However, it also presents a few drawbacks:

  • Reduced Data Utility: The noise added to ensure privacy can reduce the accuracy and utility of the data or the performance of the model.

  • Difficult Parameter Tuning: Selecting appropriate privacy parameters (like $\epsilon$ and $\delta$) can be complex and requires a deep understanding of the trade-offs between privacy and utility.

The absence of any major drawbacks has led to the development of a wide array of libraries to facilitate the implementation of DP guarantees across various applications. Notable examples include PySyft, PyTorch Opacus, TensorFlow Privacy, and Diffprivlib. Among all the PETs discussed in this section, DP stands out as the most mature and widely utilised. This is particularly evident in its application to ensure privacy when training cutting-edge generative AI models.

Can PETs be Combined?

PETs are rarely used in isolation and are instead seen as primitives that can be combined to create comprehensive privacy-preserving solutions. By integrating multiple PETs, organisations can leverage the strengths of each technique while mitigating their individual weaknesses. For instance, combining HE with FL can enhance data security during model training and update aggregation. Similarly, incorporating DP into models trained using TEEs can protect against privacy leaks from the trained models through model inversion and statistical inference attacks. By tailoring PET combinations to specific use cases, organisations can effectively safeguard sensitive information while still deriving valuable insights from their data through ML.

Conclusion

In conclusion, incorporating PETs into ML workflows is essential for enabling the use of sensitive data that would otherwise remain inaccessible due to stringent privacy regulations, such as the EU’s GDPR and the US’s HIPAA. A variety of PETs have been developed to address this need, each offering distinct advantages and limitations. Of these PETs, DP has emerged as the most viable and widely adopted solution in the industry, mainly due to its ease of implementation and versatility. Although TEEs have attracted comparable interest due to their low performance overhead, their limited computational resources and scalability make them impractical for large-scale ML systems. Similarly, other PETs like HE, SMPC, and FL have shown promise, but their adoption has been hindered by high performance overhead and difficulties in fully mitigating privacy leaks from trained models. While combining PETs can mitigate some of their individual weaknesses, they remain impractical for real-world implementation. Nonetheless, as data privacy regulations evolve, the development and adoption of these PETs will be crucial for enabling secure, PPML applications, paving the way for broader acceptance and implementation across various industries.

—-

Cover Image by Google DeepMind from Pexels.

More to read

Subscribe to our newsletter!

Valyu is a data provenance and licensing platform that connects data providers with ML engineers looking for diverse, high-quality datasets for training models.  

#WeBuild 🛠️

Subscribe to our newsletter!

Valyu is a data provenance and licensing platform that connects data providers with ML engineers looking for diverse, high-quality datasets for training models.  

#WeBuild 🛠️

Subscribe to our newsletter!

Valyu is a data provenance and licensing platform that connects data providers with ML engineers looking for diverse, high-quality datasets for training models.  

#WeBuild 🛠️